diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 8669c25c452b53da48239bc20c9a2d3528e75422..db4b1581ae671b1e676e215c9a80dfaab832fa21 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -90,7 +90,7 @@ Bazel BUILD files also need to include a license section, e.g.,
 Changes to TensorFlow C++ code should conform to
 [Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html).
 
-Use `clang-tidy` to check your C/C++ changes. To install clang-tidy on ubuntu:16.04, do:
+Use `clang-tidy` to check your C/C++ changes. To install `clang-tidy` on ubuntu:16.04, do:
 
 ```bash
 apt-get install -y clang-tidy
diff --git a/README.md b/README.md
index 6fb4486d0de9ff476b5cf1dbd63d66879637df84..63853137cfd30b396f8c7d204811f3e4a1794c07 100644
--- a/README.md
+++ b/README.md
@@ -56,6 +56,7 @@ $ python
 42
 >>> sess.close()
 ```
+Learn more examples about how to do specific tasks in TensorFlow at the [tutorials page of tensorflow.org](https://www.tensorflow.org/tutorials/).
 
 ## Contribution guidelines
 
diff --git a/RELEASE.md b/RELEASE.md
index 84d9d52868ecd55d38d6073315749d11c2340e8c..e09e9c6190f57adec67c2ae1d85848dabfd9c2a7 100644
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -1,3 +1,62 @@
+# Release 1.9.0
+
+## Major Features And Improvements
+* Update tf.keras to the Keras 2.1.6 API.
+* `tfe.Network` is deprecated. Please inherit from `tf.keras.Model`.
+* Adding support of core feature columns and losses to gradient boosted trees estimators.
+* The distributions.Bijector API supports broadcasting for Bijectors with new API changes. See [here](https://www.tensorflow.org/versions/r1.9/api_docs/python/tf/distributions/bijectors/Bijector) for more details.
+* Layered variable names have changed in the following conditions:
+  * Using `tf.keras.layers` with custom variable scopes.
+  * Using `tf.layers` in  a subclassed `tf.keras.Model` class. See [here](https://www.tensorflow.org/versions/r1.9/api_docs/python/tf/layers) for more details
+
+## Breaking Chances
+  * If you're opening empty variable scopes; replace `variable_scope`('', ...) by `variable_scope`(`tf.get_variable_scope()`, ...).
+
+## Bug Fixes and Other Changes
+* `tf.data`:
+  * The `DatasetBase::DebugString()` method is now `const`.
+  * Added the `tf.contrib.data.sample_from_datasets()` API for randomly sampling from multiple datasets.
+* Eager Execution:
+* `tf.keras`:
+  * Move Keras code out of _impl folder and remove API files.
+  * `tf.keras.Model.save_weights` now saves in TensorFlow format by default.
+  * Enable dataset iterators to be passed to `tf.keras.Model` training/eval methods.
+* Accelerated Linear Algebra (XLA):
+* TensorFlow Debugger (tfdbg): fix an issue in which the TensorBoard Debugger Plugin could not handle total source file size exceeding gRPC message size limit (4 MB).
+* `tf.contrib`:
+  * Add `tf.contrib.data.choose_from_datasets()`.
+  * `tf.contrib.data.make_csv_dataset()` now supports line breaks in quoted strings. Two arguments were removed from `make_csv_dataset`.
+  * `tf.contrib.framework.zero_initializer` supports ResourceVariable.
+  * Adding "constrained_optimization" to tensorflow/contrib.
+* Other:
+  * Add GCS Configuration Ops.
+  * Changing signature of `MakeIterator` to enable propagating error status.
+  * KL divergence for two Dirichlet distributions.
+  * More consistent GcsFileSystem behavior for certain reads past EOF.
+  * Update benchmark for tf.scan to match ranges across eager and graph modes.
+  * Fixed bug in `tf.reduce_prod gradient` for complex dtypes.
+  * Add optional `args` argument to `Dataset.from_generator()`.
+  * Allow the use of '.' in variables (e.g. "hparams.parse('a.b=1.0')"), which would previously raise an error. This will correspond to an attribute name with an embedded '.' symbol (e.g. 'a.b'), which can only be accessed indirectly (e.g. through getattr and setattr).  To set this up the user will first need to explicitly add the variable to the hparam object (e.g. "hparams.add_hparam(name='a.b', value=0.0)").
+  * Benchmark for tf.scan in graph and eager modes.
+  * Added complex128 support to FFT, FFT2D, FFT3D, IFFT, IFFT2D, and IFFT3D.
+  * Making ids unique in `nn.embedding_lookup_sparse`. This helps to reduce RPC calls for looking up the embeddings when there are repeated ids in the batch.
+  * Support indicator column in boosted trees.
+  * Prevent `tf.gradients()` from backpropagating through integer tensors.
+  * LinearOperator[1D,2D,3D]Circulant added to `tensorflow.linalg`.
+  * Conv3D, Conv3DBackpropInput, Conv3DBackpropFilter now supports arbitrary.
+  * Added `tf.train.Checkpoint` for reading/writing object-based checkpoints.
+  * `Dataset.list_files()` now produces determinstic results when `shuffle=False` or a `seed` is passed.
+  * Added LinearOperatorKronecker, a dense-free implementation of the Kronecker Product.
+  * Allow LinearOperator to broadcast.
+  * SavedModelBuilder will now deduplicate asset names that point to files with the same basename and the same contents. Note that this may result in new asset files included in SavedModels in cases where assets with the same name but different contents were previously overwriting each other.
+
+
+## Thanks to our Contributors
+
+This release contains contributions from many people at Google, as well as:
+
+Abdullah Alrasheed, Achal Shah, Ad-530, ADiegoCAlonso, Aditya Yogi, Ag Ramesh, akindyakov, Andy Kernahan, Anya Petrova, Aurelien Geron, Ben, Ben Barsdell, Bhavani-Subramanian, braincodercn, Brett Koonce, Brian Nemsick, Brian Zier, Bryan Heden, candy.dc, cclauss, Clayne Robison, ctiijima, Dalmo Cirne, David Norman, David T.H. Kao, DosLin, ekelsen, Elson Rodriguez, Erik Smistad, Felix Abecassis, Fergal Cotter, fo40225, foo0x29a, Freedom" Koan-Sin Tan, FréDéRic Branchaud-Charron, gdh1995, Geoffrey Irving, Giuseppe, gracehoney, Guido Zuidhof, Guillaume Klein, Guozhong Zhuang, Haggai, Harald Husum, imsheridan, Ivan Zhang, Jan Zikes, Jayaram Bobba, Jesse Benson, Jesse Gumz, Jiajia Li, Jie, jinghuangintel, Jingwen, jjsjann123, Joe Yearsley, Joel Hestness, Joel Shor, josephyearsley, Junpeng Lao, Karol M. Langner, Kb Sriram, krantideep95, Krish Ravindranath, Letian Feng, Loo Rong Jie, Lukas Geiger, Maciej, Mahmoud Abuzaina, ManHyuk, Mark Ryan, mbhuiyan, Michal Turek, Mostafa Alaa, Myungsung Kwak, Nand Dalal, Nehal J Wani, Neil Tenenholtz, ngc92, Nicholas Nadeau, P.Eng., Avs, Niranjan Hasabnis, P-Hidringer, Paul Van Eck, Peng Yu, Qing Zhao, Qingying Chen, Quanlong, Rajendra Arora, Rholais Lii, rmanyari, Robin Richtsfeld, Russell Klopfer, Sagi, Sam Sendelbach, Sandeep N Gupta, Sandip Giri, Sarah Edkins, Scott Tseng, Sdalbsoo, Sergii Khomenko, Seungwoo Choi (Biggie), Seyed Majid Azimi, Shaoning Zeng, shengfuintel, Siu Kei, Muk, Smit Shilu, soonson, Stefan Schweter, Sukhwan Kim, Sunitha Kambhampati, Taehoon Lee, tamimaddari82, Tang, Wenyi, Ted Chang, u2takey, Utkarsh Upadhyay, Vadim Markovtsev, voegtlel, Wai Hon Law, wangsiyu, Wenhao Hu, wenhao.hu, William D. Irons, Yan Facai (颜发才), Yanbo Liang, Yihong Wang, Yilei (Dolee) Yang, Yong Tang, Yuan (Terry) Tang
+
 # Release 1.8.0
 
 ## Major Features And Improvements
@@ -404,14 +463,6 @@ answered questions, and were part of inspiring discussions.
 
 # Release 1.4.0
 
-## Major Features And Improvements
-* `tf.keras` is now part of the core TensorFlow API.
-* [`tf.data`](http://tensorflow.org/programmers_guide/datasets) is now part of
-  the core TensorFlow API.
-  * The API is now subject to backwards compatibility guarantees.
-
-# Release 1.4.0
-
 ## Major Features And Improvements
 * `tf.keras` is now part of the core TensorFlow API.
 * [`tf.data`](http://tensorflow.org/programmers_guide/datasets) is now part of
diff --git a/SECURITY.md b/SECURITY.md
index 01886b613e5d93793953124331b57f075fe7a373..e2f6ff353a3c04a6ec6b8ccbaeb75db59fa22d54 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -168,7 +168,7 @@ below).
 
 Please use a descriptive subject line for your report email. After the initial
 reply to your report, the security team will endeavor to keep you informed of
-the progress being made towards a fix and announcement. 
+the progress being made towards a fix and announcement.
 
 In addition, please include the following information along with your report:
 
@@ -242,9 +242,7 @@ v//Fw6ZeY+HmRDFdirjD7wXtIuER4vqCryIqR6Xe9X8oJXz9L/Jhslc=
 -----END PGP PUBLIC KEY BLOCK-----
 ```
 
-### Known vulnerabilities
-
-| Type               | Versions affected | Reported by           | Additional Information      |
-|--------------------|:-----------------:|-----------------------|-----------------------------|
-| Out Of Bounds Read |             <=1.4 | Blade Team of Tencent | [issue report](https://github.com/tensorflow/tensorflow/issues/14959) |
+### Known Vulnerabilities
 
+For a list of known vulnerabilities and security advisories for TensorFlow,
+(https://github.com/tensorflow/tensorflow/blob/master/tensorflow/security/index.md)[click here].
diff --git a/WORKSPACE b/WORKSPACE
index 4ddfb9a3832ea1ea639ace887e1d601bdd857086..fd7570a80ae2ee0087f7d2fd771fcce5b9690028 100644
--- a/WORKSPACE
+++ b/WORKSPACE
@@ -22,26 +22,10 @@ check_bazel_version_at_least("0.10.0")
 
 load("//tensorflow:workspace.bzl", "tf_workspace")
 
-# Uncomment and update the paths in these entries to build the Android demo.
-#android_sdk_repository(
-#    name = "androidsdk",
-#    api_level = 23,
-#    # Ensure that you have the build_tools_version below installed in the
-#    # SDK manager as it updates periodically.
-#    build_tools_version = "26.0.1",
-#    # Replace with path to Android SDK on your system
-#    path = "<PATH_TO_SDK>",
-#)
-#
-#android_ndk_repository(
-#    name="androidndk",
-#    path="<PATH_TO_NDK>",
-#    # This needs to be 14 or higher to compile TensorFlow.
-#    # Please specify API level to >= 21 to build for 64-bit
-#    # archtectures or the Android NDK will automatically select biggest
-#    # API level that it supports without notice.
-#    # Note that the NDK version is not the API level.
-#    api_level=14)
+load("//third_party/android:android_configure.bzl", "android_configure")
+android_configure(name="local_config_android")
+load("@local_config_android//:android.bzl", "android_workspace")
+android_workspace()
 
 # Please add all new TensorFlow dependencies in workspace.bzl.
 tf_workspace()
diff --git a/configure.py b/configure.py
index b6c32543cf707983d48e390cc89abf13dafd55d3..ada342a50ab5104509156d3e44e6435a308255a3 100644
--- a/configure.py
+++ b/configure.py
@@ -670,8 +670,9 @@ def create_android_ndk_rule(environ_cp):
       error_msg=('The path %s or its child file "source.properties" '
                  'does not exist.')
   )
-
-  write_android_ndk_workspace_rule(android_ndk_home_path)
+  write_action_env_to_bazelrc('ANDROID_NDK_HOME', android_ndk_home_path)
+  write_action_env_to_bazelrc('ANDROID_NDK_API_LEVEL',
+                              check_ndk_level(android_ndk_home_path))
 
 
 def create_android_sdk_rule(environ_cp):
@@ -733,41 +734,12 @@ def create_android_sdk_rule(environ_cp):
       error_msg=('The selected SDK does not have build-tools version %s '
                  'available.'))
 
-  write_android_sdk_workspace_rule(android_sdk_home_path,
-                                   android_build_tools_version,
-                                   android_api_level)
-
-
-def write_android_sdk_workspace_rule(android_sdk_home_path,
-                                     android_build_tools_version,
-                                     android_api_level):
-  print('Writing android_sdk_workspace rule.\n')
-  with open(_TF_WORKSPACE, 'a') as f:
-    f.write("""
-android_sdk_repository(
-  name="androidsdk",
-  api_level=%s,
-  path="%s",
-  build_tools_version="%s")\n
-""" % (android_api_level, android_sdk_home_path, android_build_tools_version))
-
-
-def write_android_ndk_workspace_rule(android_ndk_home_path):
-  print('Writing android_ndk_workspace rule.')
-  ndk_api_level = check_ndk_level(android_ndk_home_path)
-  if int(ndk_api_level) not in _SUPPORTED_ANDROID_NDK_VERSIONS:
-    print('WARNING: The API level of the NDK in %s is %s, which is not '
-          'supported by Bazel (officially supported versions: %s). Please use '
-          'another version. Compiling Android targets may result in confusing '
-          'errors.\n' % (android_ndk_home_path, ndk_api_level,
-                         _SUPPORTED_ANDROID_NDK_VERSIONS))
-  with open(_TF_WORKSPACE, 'a') as f:
-    f.write("""
-android_ndk_repository(
-  name="androidndk",
-  path="%s",
-  api_level=%s)\n
-""" % (android_ndk_home_path, ndk_api_level))
+  write_action_env_to_bazelrc('ANDROID_BUILD_TOOLS_VERSION',
+                              android_build_tools_version)
+  write_action_env_to_bazelrc('ANDROID_SDK_API_LEVEL',
+                              android_api_level)
+  write_action_env_to_bazelrc('ANDROID_SDK_HOME',
+                              android_sdk_home_path)
 
 
 def check_ndk_level(android_ndk_home_path):
@@ -780,18 +752,16 @@ def check_ndk_level(android_ndk_home_path):
 
   revision = re.search(r'Pkg.Revision = (\d+)', filedata)
   if revision:
-    return revision.group(1)
-  return None
-
-
-def workspace_has_any_android_rule():
-  """Check the WORKSPACE for existing android_*_repository rules."""
-  with open(_TF_WORKSPACE, 'r') as f:
-    workspace = f.read()
-  has_any_rule = re.search(r'^android_[ns]dk_repository',
-                           workspace,
-                           re.MULTILINE)
-  return has_any_rule
+    ndk_api_level = revision.group(1)
+  else:
+    raise Exception('Unable to parse NDK revision.')
+  if int(ndk_api_level) not in _SUPPORTED_ANDROID_NDK_VERSIONS:
+    print('WARNING: The API level of the NDK in %s is %s, which is not '
+          'supported by Bazel (officially supported versions: %s). Please use '
+          'another version. Compiling Android targets may result in confusing '
+          'errors.\n' % (android_ndk_home_path, ndk_api_level,
+                         _SUPPORTED_ANDROID_NDK_VERSIONS))
+  return ndk_api_level
 
 
 def set_gcc_host_compiler_path(environ_cp):
@@ -1223,7 +1193,7 @@ def set_tf_cuda_compute_capabilities(environ_cp):
     # Check whether all capabilities from the input is valid
     all_valid = True
     # Remove all whitespace characters before splitting the string
-    # that users may insert by accident, as this will result in error 
+    # that users may insert by accident, as this will result in error
     tf_cuda_compute_capabilities = ''.join(tf_cuda_compute_capabilities.split())
     for compute_capability in tf_cuda_compute_capabilities.split(','):
       m = re.match('[0-9]+.[0-9]+', compute_capability)
@@ -1427,6 +1397,10 @@ def set_grpc_build_flags():
   write_to_bazelrc('build --define grpc_no_ares=true')
 
 
+def set_build_strip_flag():
+  write_to_bazelrc('build --strip=always')
+
+
 def set_windows_build_flags():
   if is_windows():
     # The non-monolithic build is not supported yet
@@ -1549,23 +1523,18 @@ def main():
 
   set_grpc_build_flags()
   set_cc_opt_flags(environ_cp)
+  set_build_strip_flag()
   set_windows_build_flags()
 
-  if workspace_has_any_android_rule():
-    print('The WORKSPACE file has at least one of ["android_sdk_repository", '
-          '"android_ndk_repository"] already set. Will not ask to help '
-          'configure the WORKSPACE. Please delete the existing rules to '
-          'activate the helper.\n')
-  else:
-    if get_var(
-        environ_cp, 'TF_SET_ANDROID_WORKSPACE', 'android workspace',
-        False,
-        ('Would you like to interactively configure ./WORKSPACE for '
-         'Android builds?'),
-        'Searching for NDK and SDK installations.',
-        'Not configuring the WORKSPACE for Android builds.'):
-      create_android_ndk_rule(environ_cp)
-      create_android_sdk_rule(environ_cp)
+  if get_var(
+      environ_cp, 'TF_SET_ANDROID_WORKSPACE', 'android workspace',
+      False,
+      ('Would you like to interactively configure ./WORKSPACE for '
+       'Android builds?'),
+      'Searching for NDK and SDK installations.',
+      'Not configuring the WORKSPACE for Android builds.'):
+    create_android_ndk_rule(environ_cp)
+    create_android_sdk_rule(environ_cp)
 
   print('Preconfigured Bazel build configs. You can use any of the below by '
         'adding "--config=<>" to your build command. See tools/bazel.rc for '
diff --git a/tensorflow/BUILD b/tensorflow/BUILD
index f2ad16fa04f5beb6616c58c28d0f0c460c3e3a17..6d134dbb80cb8c3dcf15b2ba20783870a67e9a62 100644
--- a/tensorflow/BUILD
+++ b/tensorflow/BUILD
@@ -19,6 +19,10 @@ load(
     "//tensorflow/core:platform/default/build_config.bzl",
     "tf_additional_binary_deps",
 )
+load(
+    "//tensorflow/tools/api/generator:api_gen.bzl",
+    "gen_api_init_files",  # @unused
+)
 
 # Config setting for determining if we are building for Android.
 config_setting(
@@ -471,7 +475,7 @@ tf_cc_shared_object(
 # excludes all but a subset of function names.
 # On MacOS, the linker does not support version_script, but has an
 # an "-exported_symbols_list" command.  -z defs disallows undefined
-# symbols in object files and -s strips the output.
+# symbols in object files.
 
 tf_cc_shared_object(
     name = "libtensorflow.so",
@@ -485,7 +489,6 @@ tf_cc_shared_object(
         "//tensorflow:windows_msvc": [],
         "//conditions:default": [
             "-z defs",
-            "-s",
             "-Wl,--version-script",  #  This line must be directly followed by the version_script.lds file
             "$(location //tensorflow/c:version_script.lds)",
         ],
@@ -511,7 +514,6 @@ tf_cc_shared_object(
         "//tensorflow:windows_msvc": [],
         "//conditions:default": [
             "-z defs",
-            "-s",
             "-Wl,--version-script",  #  This line must be directly followed by the version_script.lds file
             "$(location //tensorflow:tf_version_script.lds)",
         ],
@@ -536,13 +538,19 @@ exports_files(
     ],
 )
 
+gen_api_init_files(
+    name = "tensorflow_python_api_gen",
+    srcs = ["api_template.__init__.py"],
+    root_init_template = "api_template.__init__.py",
+)
+
 py_library(
     name = "tensorflow_py",
-    srcs = ["__init__.py"],
+    srcs = [
+        ":tensorflow_python_api_gen",
+        "//tensorflow/python/estimator/api:estimator_python_api_gen",
+    ],
     srcs_version = "PY2AND3",
     visibility = ["//visibility:public"],
-    deps = [
-        "//tensorflow/python",
-        "//tensorflow/tools/api/generator:python_api",
-    ],
+    deps = ["//tensorflow/python"],
 )
diff --git a/tensorflow/__init__.py b/tensorflow/__init__.py
index c8683e3976c90add3f1f54d8e575c798327e9273..440e9f8dbd2f4b2a2ab78eaaf26408584e7c1446 100644
--- a/tensorflow/__init__.py
+++ b/tensorflow/__init__.py
@@ -22,9 +22,6 @@ from __future__ import print_function
 
 # pylint: disable=g-bad-import-order
 from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
-# pylint: disable=wildcard-import
-from tensorflow.tools.api.generator.api import *  # pylint: disable=redefined-builtin
-# pylint: enable=wildcard-import
 
 from tensorflow.python.util.lazy_loader import LazyLoader
 contrib = LazyLoader('contrib', globals(), 'tensorflow.contrib')
diff --git a/tensorflow/api_template.__init__.py b/tensorflow/api_template.__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9662d7b478ba61c69edc20b0d47293f9939e7881
--- /dev/null
+++ b/tensorflow/api_template.__init__.py
@@ -0,0 +1,58 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Bring in all of the public TensorFlow interface into this module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+# pylint: disable=g-bad-import-order
+from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
+# API IMPORTS PLACEHOLDER
+
+try:
+  import os  # pylint: disable=g-import-not-at-top
+  # Add `estimator` attribute to allow access to estimator APIs via
+  # "tf.estimator..."
+  from tensorflow.python.estimator.api import estimator  # pylint: disable=g-import-not-at-top
+
+  # Add `estimator` to the __path__ to allow "from tensorflow.estimator..."
+  # style imports.
+  from tensorflow.python.estimator import api as estimator_api  # pylint: disable=g-import-not-at-top
+  __path__ += [os.path.dirname(estimator_api.__file__)]
+  del estimator_api
+  del os
+except (ImportError, AttributeError):
+  print('tf.estimator package not installed.')
+
+from tensorflow.python.util.lazy_loader import LazyLoader  # pylint: disable=g-import-not-at-top
+contrib = LazyLoader('contrib', globals(), 'tensorflow.contrib')
+del LazyLoader
+
+from tensorflow.python.platform import flags  # pylint: disable=g-import-not-at-top
+app.flags = flags  # pylint: disable=undefined-variable
+
+del absolute_import
+del division
+del print_function
+
+# These symbols appear because we import the python package which
+# in turn imports from tensorflow.core and tensorflow.python. They
+# must come from this module. So python adds these symbols for the
+# resolution to succeed.
+# pylint: disable=undefined-variable
+del python
+del core
+# pylint: enable=undefined-variable
diff --git a/tensorflow/c/c_api.cc b/tensorflow/c/c_api.cc
index b86b277ac3200b88ae03490a6c1b64d464e81950..cb0b093ad260e000dcef9d1123e967a77cf1a041 100644
--- a/tensorflow/c/c_api.cc
+++ b/tensorflow/c/c_api.cc
@@ -631,7 +631,22 @@ Status MessageToBuffer(const tensorflow::protobuf::Message& in,
         "Failed to allocate memory to serialize message of type '",
         in.GetTypeName(), "' and size ", proto_size);
   }
-  in.SerializeToArray(buf, proto_size);
+  // SerializeToArray takes size as an int.
+  // This next 'if' is a workaround till we update to depend on a version
+  // of protocol buffers that includes
+  // https://github.com/google/protobuf/pull/4739
+  if (proto_size > std::numeric_limits<int>::max()) {
+    return InvalidArgument("Cannot serialize protocol buffer of type ",
+                           in.GetTypeName(), " as the serialized size (",
+                           proto_size,
+                           "bytes) would be larger than the limit (",
+                           std::numeric_limits<int>::max(), " bytes)");
+  }
+  if (!in.SerializeToArray(buf, proto_size)) {
+    return InvalidArgument("Unable to serialize ", in.GetTypeName(),
+                           " protocol buffer, perhaps the serialized size (",
+                           proto_size, " bytes) is too large?");
+  }
   out->data = buf;
   out->length = proto_size;
   out->data_deallocator = [](void* data, size_t length) {
diff --git a/tensorflow/c/eager/BUILD b/tensorflow/c/eager/BUILD
index 9ce781fab0be709fb0f115a2206eea4c2826bf36..f265da2c2c89c0e9caf14f2213c606fcb69997e0 100644
--- a/tensorflow/c/eager/BUILD
+++ b/tensorflow/c/eager/BUILD
@@ -14,6 +14,7 @@ tf_cuda_library(
     name = "c_api",
     srcs = [
         "c_api.cc",
+        "c_api_debug.cc",
         "c_api_internal.h",
     ],
     hdrs = ["c_api.h"],
@@ -45,6 +46,7 @@ tf_cuda_library(
         "//tensorflow:with_xla_support": [
             "//tensorflow/compiler/tf2xla:xla_compiler",
             "//tensorflow/compiler/jit",
+            "//tensorflow/compiler/jit:xla_device",
         ],
         "//conditions:default": [],
     }) + [
@@ -99,9 +101,31 @@ tf_cuda_library(
     ],
 )
 
+tf_cuda_library(
+    name = "c_api_test_util",
+    testonly = 1,
+    srcs = ["c_api_test_util.cc"],
+    hdrs = ["c_api_test_util.h"],
+    visibility = [
+        "//learning/brain:__subpackages__",
+        "//tensorflow:__subpackages__",
+    ],
+    deps = [
+        ":c_api",
+        "//tensorflow/c:c_test_util",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:test",
+    ],
+)
+
 tf_cuda_cc_test(
     name = "c_api_test",
-    srcs = ["c_api_test.cc"],
+    srcs = [
+        "c_api_debug_test.cc",
+        "c_api_test.cc",
+    ],
     extra_copts = tfe_xla_copts(),
     tags = [
         "guitar",
@@ -109,6 +133,7 @@ tf_cuda_cc_test(
     ],
     deps = [
         ":c_api",
+        ":c_api_test_util",
         "//tensorflow/c:c_test_util",
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
diff --git a/tensorflow/c/eager/c_api.cc b/tensorflow/c/eager/c_api.cc
index 216210c88c1593ebc68f604547ab06b543a7b2af..81221c4078bec9820ee187efdf0314da378be62b 100644
--- a/tensorflow/c/eager/c_api.cc
+++ b/tensorflow/c/eager/c_api.cc
@@ -73,10 +73,6 @@ string DeviceName(const tensorflow::Device* d) {
   return (d == nullptr) ? "cpu:0" : d->name();
 }
 
-#ifdef TENSORFLOW_EAGER_USE_XLA
-std::atomic_int_fast64_t func_id_generator(0);
-#endif  // TENSORFLOW_EAGER_USE_XLA
-
 tensorflow::Status GetAllRemoteDevices(
     const std::vector<string>& remote_workers,
     tensorflow::WorkerCacheInterface* worker_cache,
diff --git a/tensorflow/c/eager/c_api.h b/tensorflow/c/eager/c_api.h
index 574a097e0d6f5d6e7acd77cae246678b6675129b..1862af3ce2f505a6e83b4805417eaf335ed07bc0 100644
--- a/tensorflow/c/eager/c_api.h
+++ b/tensorflow/c/eager/c_api.h
@@ -191,6 +191,45 @@ TF_CAPI_EXPORT extern TFE_TensorHandle* TFE_TensorHandleCopyToDevice(
     TFE_TensorHandle* h, TFE_Context* ctx, const char* device_name,
     TF_Status* status);
 
+// Debugging/Profiling information for TFE_TensorHandle
+//
+// TFE_TensorDebugInfo contains information useful for debugging and
+// profiling tensors.
+typedef struct TFE_TensorDebugInfo TFE_TensorDebugInfo;
+
+// Retrieves TFE_TensorDebugInfo for `handle`.
+// If TFE_TensorHandleTensorDebugInfo succeeds, `status` is set to OK and caller
+// is responsible for deleting returned TFE_TensorDebugInfo.
+// If TFE_TensorHandleTensorDebugInfo fails, `status` is set to appropriate
+// error and nullptr is returned. This function can block till the operation
+// that produces `handle` has completed.
+TF_CAPI_EXPORT extern TFE_TensorDebugInfo* TFE_TensorHandleTensorDebugInfo(
+    TFE_TensorHandle* handle, TF_Status* status);
+
+// Deletes `debug_info`.
+TF_CAPI_EXPORT extern void TFE_DeleteTensorDebugInfo(
+    TFE_TensorDebugInfo* debug_info);
+
+// Returns the number of dimensions used to represent the tensor on its device.
+// The number of dimensions used to reprensent the tensor on device can be
+// different from the number returned by TFE_TensorHandleNumDims.
+// The return value was current at the time of TFE_TensorDebugInfo creation.
+TF_CAPI_EXPORT extern int TFE_TensorDebugInfoOnDeviceNumDims(
+    TFE_TensorDebugInfo* debug_info);
+
+// Returns the number of elements in dimension `dim_index`.
+// Tensor representation on device can be transposed from its representation
+// on host. The data contained in dimension `dim_index` on device
+// can correspond to the data contained in another dimension in on-host
+// representation. The dimensions are indexed using the standard TensorFlow
+// major-to-minor order (slowest varying dimension first),
+// not the XLA's minor-to-major order.
+// On-device dimensions can be padded. TFE_TensorDebugInfoOnDeviceDim returns
+// the number of elements in a dimension after padding.
+// The return value was current at the time of TFE_TensorDebugInfo creation.
+TF_CAPI_EXPORT extern int64_t TFE_TensorDebugInfoOnDeviceDim(
+    TFE_TensorDebugInfo* debug_info, int dim_index);
+
 // Description of the TensorFlow op to execute.
 //
 // Assumes that the provided 'ctx' outlives the returned TFE_Op, i.e.,
diff --git a/tensorflow/c/eager/c_api_debug.cc b/tensorflow/c/eager/c_api_debug.cc
new file mode 100644
index 0000000000000000000000000000000000000000..5006b76f1981d068e99a2c081115ebb3a66d8c7f
--- /dev/null
+++ b/tensorflow/c/eager/c_api_debug.cc
@@ -0,0 +1,167 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/c/eager/c_api.h"
+
+#include <vector>
+
+#include "tensorflow/c/c_api.h"
+#include "tensorflow/c/eager/c_api_internal.h"
+#ifdef TENSORFLOW_EAGER_USE_XLA
+#include "tensorflow/compiler/jit/xla_device.h"
+#endif  // TENSORFLOW_EAGER_USE_XLA
+
+using tensorflow::int64;
+using tensorflow::string;
+
+namespace {
+
+std::vector<int64> TensorShapeAsVector(TFE_TensorHandle* handle,
+                                       TF_Status* status) {
+  std::vector<int64> shape;
+  int rank = TFE_TensorHandleNumDims(handle, status);
+  if (!status->status.ok()) {
+    return shape;
+  }
+  shape.reserve(rank);
+  for (int i = 0; i < rank; ++i) {
+    shape.push_back(TFE_TensorHandleDim(handle, i, status));
+    if (!status->status.ok()) {
+      return shape;
+    }
+  }
+  return shape;
+}
+
+}  // namespace
+
+extern "C" {
+
+TF_CAPI_EXPORT extern TFE_TensorDebugInfo* TFE_TensorHandleTensorDebugInfo(
+    TFE_TensorHandle* handle, TF_Status* status) {
+  const tensorflow::Tensor* tensor;
+  status->status = handle->handle->Tensor(&tensor);
+  if (!status->status.ok()) {
+    return nullptr;
+  }
+
+  tensorflow::Device* device;
+  status->status = handle->handle->Device(&device);
+  if (!status->status.ok()) {
+    return nullptr;
+  }
+
+#ifdef TENSORFLOW_EAGER_USE_XLA
+  // If tensor resides on an XLA device, use XLA device's PaddedShapeFn.
+  tensorflow::XlaDevice* xla_device =
+      dynamic_cast<tensorflow::XlaDevice*>(device);
+  if (xla_device != nullptr) {
+    tensorflow::XlaDevice::PaddedShapeFn shape_fn =
+        xla_device->metadata().padded_shape_fn();
+    xla::Shape padded_shape;
+    status->status = shape_fn(*tensor, &padded_shape);
+    if (!status->status.ok()) {
+      return nullptr;
+    }
+    if (VLOG_IS_ON(3)) {
+      std::vector<int64> shape_to_log = TensorShapeAsVector(handle, status);
+      if (!status->status.ok()) {
+        // Ignore the status here as we are simply logging.
+        status->status = tensorflow::Status::OK();
+      } else {
+        VLOG(3) << "Fully padded shape of ["
+                << tensorflow::str_util::Join(shape_to_log, ", ") << "] is "
+                << padded_shape.DebugString();
+      }
+    }
+
+    if (xla::ShapeUtil::IsTuple(padded_shape)) {
+      if (xla::ShapeUtil::TupleElementCount(padded_shape) != 2) {
+        // Currently, the only case of XlaTensor containing a tuple shape is to
+        // represent 64 bit ints, doubles, and complex numbers (we don't support
+        // 64bit complex numbers).
+        status->status = tensorflow::errors::InvalidArgument(
+            "XlaTensors should only contain tuples of size 2. Shape: ",
+            padded_shape.DebugString());
+        return nullptr;
+      }
+
+      // shape0 is not a const& because we will assign it to padded_shape below.
+      // It is illegal to assign a part of a message to itself.
+      xla::Shape shape0 = xla::ShapeUtil::GetTupleElementShape(padded_shape, 0);
+      const xla::Shape& shape1 =
+          xla::ShapeUtil::GetTupleElementShape(padded_shape, 1);
+      if (xla::ShapeUtil::IsTuple(shape0) || xla::ShapeUtil::IsTuple(shape1)) {
+        status->status = tensorflow::errors::InvalidArgument(
+            "XlaTensors should not contain nested tuples. Shape: ",
+            padded_shape.DebugString());
+        return nullptr;
+      }
+      if (!xla::ShapeUtil::Equal(shape0, shape1)) {
+        status->status = tensorflow::errors::InvalidArgument(
+            "Subshapes of XlaTensors should be the same. Shape: ",
+            padded_shape.DebugString());
+        return nullptr;
+      }
+
+      // Since the only case we handle here are two equal subshapes, we
+      // simply return one of them. The caller will interpret it as this
+      // shape directly storing the 64bit types. This approximation is good
+      // enough for this API's debugging use case.
+      padded_shape = shape0;
+    }
+
+    int rank = padded_shape.dimensions_size();
+    std::vector<int64> dev_dims;
+    dev_dims.reserve(rank);
+    if (rank == 1) {
+      // Rank 1 tensors might not have padded_shape.layout.minor_to_major set,
+      dev_dims.push_back(padded_shape.dimensions(0));
+    } else {
+      for (int i = rank - 1; i >= 0; --i) {
+        int64 dim_index = padded_shape.layout().minor_to_major(i);
+        dev_dims.push_back(padded_shape.dimensions(dim_index));
+      }
+    }
+    status->status = tensorflow::Status::OK();
+    return new TFE_TensorDebugInfo(dev_dims);
+  }
+#endif  // TENSORFLOW_EAGER_USE_XLA
+
+  // If the tensor is not an XLA tensor, the device shape is
+  // the same as regular tensor shape.
+  std::vector<int64> dev_dims = TensorShapeAsVector(handle, status);
+  if (!status->status.ok()) {
+    return nullptr;
+  }
+  return new TFE_TensorDebugInfo(dev_dims);
+}
+
+TF_CAPI_EXPORT extern void TFE_DeleteTensorDebugInfo(
+    TFE_TensorDebugInfo* debug_info) {
+  delete debug_info;
+}
+
+TF_CAPI_EXPORT extern int TFE_TensorDebugInfoOnDeviceNumDims(
+    TFE_TensorDebugInfo* debug_info) {
+  return debug_info->dev_dims.size();
+}
+
+TF_CAPI_EXPORT extern int64_t TFE_TensorDebugInfoOnDeviceDim(
+    TFE_TensorDebugInfo* debug_info, int dim_index) {
+  return debug_info->dev_dims[dim_index];
+}
+
+}  // extern "C"
diff --git a/tensorflow/c/eager/c_api_debug_test.cc b/tensorflow/c/eager/c_api_debug_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..cddb9f6e00e9d639026f4bbe061d58f76771c0a9
--- /dev/null
+++ b/tensorflow/c/eager/c_api_debug_test.cc
@@ -0,0 +1,50 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/c/eager/c_api.h"
+
+#include <string.h>
+#include "tensorflow/c/eager/c_api_test_util.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/test.h"
+
+TEST(CApiDebug, ScalarCPU) {
+  TFE_TensorHandle* h = TestScalarTensorHandle();
+  TF_Status* status = TF_NewStatus();
+  TFE_TensorDebugInfo* debug_info = TFE_TensorHandleTensorDebugInfo(h, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+
+  ASSERT_EQ(0, TFE_TensorDebugInfoOnDeviceNumDims(debug_info));
+
+  TFE_DeleteTensorDebugInfo(debug_info);
+  TFE_DeleteTensorHandle(h);
+  TF_DeleteStatus(status);
+}
+
+TEST(CApiDebug, 2DCPU) {
+  TFE_TensorHandle* h = TestMatrixTensorHandle3X2();
+  TF_Status* status = TF_NewStatus();
+  TFE_TensorDebugInfo* debug_info = TFE_TensorHandleTensorDebugInfo(h, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+
+  ASSERT_EQ(2, TFE_TensorDebugInfoOnDeviceNumDims(debug_info));
+  // Shape is the same for CPU tensors.
+  EXPECT_EQ(3, TFE_TensorDebugInfoOnDeviceDim(debug_info, 0));
+  EXPECT_EQ(2, TFE_TensorDebugInfoOnDeviceDim(debug_info, 1));
+
+  TFE_DeleteTensorDebugInfo(debug_info);
+  TFE_DeleteTensorHandle(h);
+  TF_DeleteStatus(status);
+}
diff --git a/tensorflow/c/eager/c_api_internal.h b/tensorflow/c/eager/c_api_internal.h
index 2b8384d72038c3b4a050c70b3e7c5e0ca0bd94f3..04a6efc47c5177c82b7e88168b67cc584587de7c 100644
--- a/tensorflow/c/eager/c_api_internal.h
+++ b/tensorflow/c/eager/c_api_internal.h
@@ -107,6 +107,14 @@ struct TFE_TensorHandle {
   tensorflow::TensorHandle* handle;
 };
 
+struct TFE_TensorDebugInfo {
+  TFE_TensorDebugInfo(const std::vector<tensorflow::int64>& dims)
+      : dev_dims(dims) {}
+
+  // Fully-padded, minor-to-major.
+  std::vector<tensorflow::int64> dev_dims;
+};
+
 struct TFE_Op {
   // t is NULL iff the TFE_Op corresponds to a TensorFlow function instead of a
   // primitive operation.
diff --git a/tensorflow/c/eager/c_api_test.cc b/tensorflow/c/eager/c_api_test.cc
index 49646bb73599d96fce2df90f918e692df7972aeb..992d1afd5fcb0641794bb2abbe5ab20a287d3b62 100644
--- a/tensorflow/c/eager/c_api_test.cc
+++ b/tensorflow/c/eager/c_api_test.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/c/eager/c_api.h"
 
 #include <string.h>
+#include "tensorflow/c/eager/c_api_test_util.h"
 #include "tensorflow/core/distributed_runtime/rpc/eager/eager_grpc_server_lib.h"
 #include "tensorflow/core/framework/function.pb.h"
 #include "tensorflow/core/lib/strings/strcat.h"
@@ -32,122 +33,6 @@ using tensorflow::string;
 
 namespace {
 
-TFE_TensorHandle* DoubleTestMatrixTensorHandle() {
-  int64_t dims[] = {2, 2};
-  double data[] = {1.0, 2.0, 3.0, 4.0};
-  TF_Tensor* t = TF_AllocateTensor(
-      TF_DOUBLE, &dims[0], sizeof(dims) / sizeof(int64_t), sizeof(data));
-  memcpy(TF_TensorData(t), &data[0], TF_TensorByteSize(t));
-  TF_Status* status = TF_NewStatus();
-  TFE_TensorHandle* th = TFE_NewTensorHandle(t, status);
-  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
-  TF_DeleteTensor(t);
-  TF_DeleteStatus(status);
-  return th;
-}
-
-TFE_TensorHandle* TestMatrixTensorHandle() {
-  int64_t dims[] = {2, 2};
-  float data[] = {1.0f, 2.0f, 3.0f, 4.0f};
-  TF_Tensor* t = TF_AllocateTensor(
-      TF_FLOAT, &dims[0], sizeof(dims) / sizeof(int64_t), sizeof(data));
-  memcpy(TF_TensorData(t), &data[0], TF_TensorByteSize(t));
-  TF_Status* status = TF_NewStatus();
-  TFE_TensorHandle* th = TFE_NewTensorHandle(t, status);
-  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
-  TF_DeleteTensor(t);
-  TF_DeleteStatus(status);
-  return th;
-}
-
-TFE_TensorHandle* TestMatrixTensorHandle3X2() {
-  int64_t dims[] = {3, 2};
-  double data[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0};
-  TF_Tensor* t = TF_AllocateTensor(
-      TF_FLOAT, &dims[0], sizeof(dims) / sizeof(int64_t), sizeof(data));
-  memcpy(TF_TensorData(t), &data[0], TF_TensorByteSize(t));
-  TF_Status* status = TF_NewStatus();
-  TFE_TensorHandle* th = TFE_NewTensorHandle(t, status);
-  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
-  TF_DeleteTensor(t);
-  TF_DeleteStatus(status);
-  return th;
-}
-
-TFE_Op* MatMulOp(TFE_Context* ctx, TFE_TensorHandle* a, TFE_TensorHandle* b) {
-  TF_Status* status = TF_NewStatus();
-
-  TFE_Op* op = TFE_NewOp(ctx, "MatMul", status);
-  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
-  TFE_OpAddInput(op, a, status);
-  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
-  TFE_OpAddInput(op, b, status);
-  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
-  TF_DeleteStatus(status);
-  TFE_OpSetAttrBool(op, "transpose_a", 0);
-  TFE_OpSetAttrBool(op, "transpose_b", 0);
-  TFE_OpSetAttrType(op, "T", TFE_TensorHandleDataType(a));
-
-  return op;
-}
-
-TFE_TensorHandle* TestAxisTensorHandle() {
-  int64_t dims[] = {1};
-  int data[] = {1};
-  TF_Tensor* t = TF_AllocateTensor(
-      TF_INT32, &dims[0], sizeof(dims) / sizeof(int64_t), sizeof(data));
-  memcpy(TF_TensorData(t), &data[0], TF_TensorByteSize(t));
-  TF_Status* status = TF_NewStatus();
-  TFE_TensorHandle* th = TFE_NewTensorHandle(t, status);
-  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
-  TF_DeleteTensor(t);
-  TF_DeleteStatus(status);
-  return th;
-}
-
-TFE_Op* MinOp(TFE_Context* ctx, TFE_TensorHandle* input,
-              TFE_TensorHandle* axis) {
-  TF_Status* status = TF_NewStatus();
-
-  TFE_Op* op = TFE_NewOp(ctx, "Min", status);
-  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
-  TFE_OpAddInput(op, input, status);
-  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
-  TFE_OpAddInput(op, axis, status);
-  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
-  TFE_OpSetAttrBool(op, "keep_dims", 1);
-  TFE_OpSetAttrType(op, "Tidx", TF_INT32);
-  TF_DeleteStatus(status);
-  TFE_OpSetAttrType(op, "T", TFE_TensorHandleDataType(input));
-
-  return op;
-}
-
-// If there is a GPU device, returns true and sets 'gpu_device_name'
-// accordingly.
-bool GetGPUDeviceName(TFE_Context* ctx, string* gpu_device_name) {
-  std::unique_ptr<TF_Status, decltype(&TF_DeleteStatus)> status(
-      TF_NewStatus(), TF_DeleteStatus);
-  TF_DeviceList* devices = TFE_ContextListDevices(ctx, status.get());
-  CHECK_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
-
-  const int num_devices = TF_DeviceListCount(devices);
-  for (int i = 0; i < num_devices; ++i) {
-    const string device_type(TF_DeviceListType(devices, i, status.get()));
-    CHECK_EQ(TF_GetCode(status.get()), TF_OK) << TF_Message(status.get());
-    const string device_name(TF_DeviceListName(devices, i, status.get()));
-    CHECK_EQ(TF_GetCode(status.get()), TF_OK) << TF_Message(status.get());
-    if (device_type == "GPU") {
-      *gpu_device_name = device_name;
-      LOG(INFO) << "Found GPU device " << device_name;
-      TF_DeleteDeviceList(devices);
-      return true;
-    }
-  }
-  TF_DeleteDeviceList(devices);
-  return false;
-}
-
 void BM_InitOp(int iters) {
   tensorflow::testing::StopTiming();
   TF_Status* status = TF_NewStatus();
@@ -257,8 +142,10 @@ void TestRemoteExecute(bool async) {
   TFE_ContextOptions* opts = TFE_NewContextOptions();
   TFE_ContextOptionsSetServerDef(opts, serialized.data(), serialized.size(),
                                  status);
-  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(1));
   EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(1));
+  TFE_ContextOptionsSetDevicePlacementPolicy(opts,
+                                             TFE_DEVICE_PLACEMENT_EXPLICIT);
   TFE_Context* ctx = TFE_NewContext(opts, status);
   EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TFE_DeleteContextOptions(opts);
@@ -320,6 +207,83 @@ void TestRemoteExecute(bool async) {
 TEST(CAPI, RemoteExecute) { TestRemoteExecute(false); }
 TEST(CAPI, RemoteExecuteAsync) { TestRemoteExecute(true); }
 
+void TestRemoteExecuteSilentCopies(bool async) {
+  tensorflow::ServerDef server_def = GetServerDef(2);
+
+  // This server def has the task index set to 0.
+  string serialized = server_def.SerializeAsString();
+
+  server_def.set_task_index(1);
+
+  std::unique_ptr<tensorflow::eager::EagerGrpcServer> worker_server;
+  ASSERT_TRUE(
+      tensorflow::eager::EagerGrpcServer::Create(server_def, &worker_server)
+          .ok());
+  ASSERT_TRUE(worker_server->Start().ok());
+
+  TF_Status* status = TF_NewStatus();
+  TFE_ContextOptions* opts = TFE_NewContextOptions();
+  TFE_ContextOptionsSetServerDef(opts, serialized.data(), serialized.size(),
+                                 status);
+  EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TFE_ContextOptionsSetAsync(opts, static_cast<unsigned char>(1));
+  TFE_ContextOptionsSetDevicePlacementPolicy(opts, TFE_DEVICE_PLACEMENT_SILENT);
+  TFE_Context* ctx = TFE_NewContext(opts, status);
+  EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TFE_DeleteContextOptions(opts);
+
+  TFE_TensorHandle* h0_task0 = TestMatrixTensorHandle();
+  TFE_TensorHandle* h1_task0 = TestMatrixTensorHandle();
+  const char remote_device_name[] =
+      "/job:localhost/replica:0/task:1/device:CPU:0";
+
+  // Handles are on task0, but op is on remote (task1).
+  TFE_Op* matmul = MatMulOp(ctx, h0_task0, h1_task0);
+  TFE_OpSetDevice(matmul, remote_device_name, status);
+  EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+
+  TFE_TensorHandle* retvals[1];
+  int num_retvals = 1;
+  TFE_Execute(matmul, &retvals[0], &num_retvals, status);
+  EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+
+  auto* retval_task0 = TFE_TensorHandleCopyToDevice(
+      retvals[0], ctx, "/job:localhost/replica:0/task:0/device:CPU:0", status);
+  ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+
+  TF_Tensor* t = TFE_TensorHandleResolve(retval_task0, status);
+  ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TFE_DeleteTensorHandle(retval_task0);
+  float product[4] = {0};
+  EXPECT_EQ(sizeof(product), TF_TensorByteSize(t));
+  memcpy(&product[0], TF_TensorData(t), TF_TensorByteSize(t));
+  TF_DeleteTensor(t);
+  EXPECT_EQ(7, product[0]);
+  EXPECT_EQ(10, product[1]);
+  EXPECT_EQ(15, product[2]);
+  EXPECT_EQ(22, product[3]);
+
+  TFE_DeleteTensorHandle(h0_task0);
+  TFE_DeleteTensorHandle(h1_task0);
+  TFE_DeleteTensorHandle(retvals[0]);
+
+  TFE_DeleteOp(matmul);
+
+  TFE_ContextAsyncWait(ctx, status);
+  TFE_DeleteContext(ctx, status);
+  EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+
+  TF_DeleteStatus(status);
+
+  // TODO(nareshmodi): Figure out how to correctly shut the server down.
+  worker_server.release();
+}
+
+TEST(CAPI, RemoteExecuteSilentCopies) { TestRemoteExecuteSilentCopies(false); }
+TEST(CAPI, RemoteExecuteSilentCopiesAsync) {
+  TestRemoteExecuteSilentCopies(true);
+}
+
 TEST(CAPI, TensorHandle) {
   TFE_TensorHandle* h = TestMatrixTensorHandle();
   EXPECT_EQ(TF_FLOAT, TFE_TensorHandleDataType(h));
@@ -536,7 +500,7 @@ void TensorHandleSilentCopy(bool async) {
 
   // Disable the test if no GPU is present.
   string gpu_device_name;
-  if (GetGPUDeviceName(ctx, &gpu_device_name)) {
+  if (GetDeviceName(ctx, &gpu_device_name, "GPU")) {
     TFE_TensorHandle* hgpu = TFE_TensorHandleCopyToDevice(
         hcpu, ctx, gpu_device_name.c_str(), status.get());
     ASSERT_TRUE(TF_GetCode(status.get()) == TF_OK) << TF_Message(status.get());
@@ -583,7 +547,7 @@ void TensorHandleSilentCopyLocal(bool async) {
 
   // Disable the test if no GPU is present.
   string gpu_device_name;
-  if (GetGPUDeviceName(ctx, &gpu_device_name)) {
+  if (GetDeviceName(ctx, &gpu_device_name, "GPU")) {
     TFE_TensorHandle* hgpu = TFE_TensorHandleCopyToDevice(
         hcpu, ctx, gpu_device_name.c_str(), status.get());
     ASSERT_TRUE(TF_GetCode(status.get()) == TF_OK) << TF_Message(status.get());
@@ -624,7 +588,7 @@ void SetAndGetOpDevices(bool async) {
 
   // Disable the test if no GPU is present.
   string gpu_device_name;
-  if (GetGPUDeviceName(ctx, &gpu_device_name)) {
+  if (GetDeviceName(ctx, &gpu_device_name, "GPU")) {
     TFE_OpSetDevice(matmul, "GPU:0", status);
     ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
     const char* device_name = TFE_OpGetDevice(matmul, status);
@@ -688,7 +652,7 @@ void Execute_MatMul_CPU_Runtime_Error(bool async) {
   TFE_DeleteContextOptions(opts);
 
   TFE_TensorHandle* m1 = TestMatrixTensorHandle();
-  TFE_TensorHandle* m2 = TestMatrixTensorHandle3X2();
+  TFE_TensorHandle* m2 = DoubleTestMatrixTensorHandle3X2();
   TFE_Op* matmul = MatMulOp(ctx, m1, m2);
   TFE_OpSetDevice(matmul, "/job:localhost/replica:0/task:0/device:CPU:0",
                   status);
diff --git a/tensorflow/c/eager/c_api_test_util.cc b/tensorflow/c/eager/c_api_test_util.cc
new file mode 100644
index 0000000000000000000000000000000000000000..5607c9dcb0bbec72b2f86def3dd4e6590d73197b
--- /dev/null
+++ b/tensorflow/c/eager/c_api_test_util.cc
@@ -0,0 +1,163 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/c/eager/c_api_test_util.h"
+
+#include "tensorflow/c/eager/c_api.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/test.h"
+
+using tensorflow::string;
+
+TFE_TensorHandle* TestScalarTensorHandle() {
+  float data[] = {1.0f};
+  TF_Tensor* t = TF_AllocateTensor(TF_FLOAT, nullptr, 0, sizeof(float));
+  memcpy(TF_TensorData(t), &data[0], TF_TensorByteSize(t));
+  TF_Status* status = TF_NewStatus();
+  TFE_TensorHandle* th = TFE_NewTensorHandle(t, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TF_DeleteTensor(t);
+  TF_DeleteStatus(status);
+  return th;
+}
+
+TFE_TensorHandle* DoubleTestMatrixTensorHandle() {
+  int64_t dims[] = {2, 2};
+  double data[] = {1.0, 2.0, 3.0, 4.0};
+  TF_Tensor* t = TF_AllocateTensor(
+      TF_DOUBLE, &dims[0], sizeof(dims) / sizeof(int64_t), sizeof(data));
+  memcpy(TF_TensorData(t), &data[0], TF_TensorByteSize(t));
+  TF_Status* status = TF_NewStatus();
+  TFE_TensorHandle* th = TFE_NewTensorHandle(t, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TF_DeleteTensor(t);
+  TF_DeleteStatus(status);
+  return th;
+}
+
+TFE_TensorHandle* TestMatrixTensorHandle() {
+  int64_t dims[] = {2, 2};
+  float data[] = {1.0f, 2.0f, 3.0f, 4.0f};
+  TF_Tensor* t = TF_AllocateTensor(
+      TF_FLOAT, &dims[0], sizeof(dims) / sizeof(int64_t), sizeof(data));
+  memcpy(TF_TensorData(t), &data[0], TF_TensorByteSize(t));
+  TF_Status* status = TF_NewStatus();
+  TFE_TensorHandle* th = TFE_NewTensorHandle(t, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TF_DeleteTensor(t);
+  TF_DeleteStatus(status);
+  return th;
+}
+
+TFE_TensorHandle* DoubleTestMatrixTensorHandle3X2() {
+  int64_t dims[] = {3, 2};
+  double data[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0};
+  TF_Tensor* t = TF_AllocateTensor(
+      TF_FLOAT, &dims[0], sizeof(dims) / sizeof(int64_t), sizeof(data));
+  memcpy(TF_TensorData(t), &data[0], TF_TensorByteSize(t));
+  TF_Status* status = TF_NewStatus();
+  TFE_TensorHandle* th = TFE_NewTensorHandle(t, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TF_DeleteTensor(t);
+  TF_DeleteStatus(status);
+  return th;
+}
+
+TFE_TensorHandle* TestMatrixTensorHandle3X2() {
+  int64_t dims[] = {3, 2};
+  float data[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
+  TF_Tensor* t = TF_AllocateTensor(
+      TF_FLOAT, &dims[0], sizeof(dims) / sizeof(int64_t), sizeof(data));
+  memcpy(TF_TensorData(t), &data[0], TF_TensorByteSize(t));
+  TF_Status* status = TF_NewStatus();
+  TFE_TensorHandle* th = TFE_NewTensorHandle(t, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TF_DeleteTensor(t);
+  TF_DeleteStatus(status);
+  return th;
+}
+
+TFE_Op* MatMulOp(TFE_Context* ctx, TFE_TensorHandle* a, TFE_TensorHandle* b) {
+  TF_Status* status = TF_NewStatus();
+
+  TFE_Op* op = TFE_NewOp(ctx, "MatMul", status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TFE_OpAddInput(op, a, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TFE_OpAddInput(op, b, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TF_DeleteStatus(status);
+  TFE_OpSetAttrBool(op, "transpose_a", 0);
+  TFE_OpSetAttrBool(op, "transpose_b", 0);
+  TFE_OpSetAttrType(op, "T", TFE_TensorHandleDataType(a));
+
+  return op;
+}
+
+TFE_TensorHandle* TestAxisTensorHandle() {
+  int64_t dims[] = {1};
+  int data[] = {1};
+  TF_Tensor* t = TF_AllocateTensor(
+      TF_INT32, &dims[0], sizeof(dims) / sizeof(int64_t), sizeof(data));
+  memcpy(TF_TensorData(t), &data[0], TF_TensorByteSize(t));
+  TF_Status* status = TF_NewStatus();
+  TFE_TensorHandle* th = TFE_NewTensorHandle(t, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TF_DeleteTensor(t);
+  TF_DeleteStatus(status);
+  return th;
+}
+
+TFE_Op* MinOp(TFE_Context* ctx, TFE_TensorHandle* input,
+              TFE_TensorHandle* axis) {
+  TF_Status* status = TF_NewStatus();
+
+  TFE_Op* op = TFE_NewOp(ctx, "Min", status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TFE_OpAddInput(op, input, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TFE_OpAddInput(op, axis, status);
+  CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TFE_OpSetAttrBool(op, "keep_dims", 1);
+  TFE_OpSetAttrType(op, "Tidx", TF_INT32);
+  TF_DeleteStatus(status);
+  TFE_OpSetAttrType(op, "T", TFE_TensorHandleDataType(input));
+
+  return op;
+}
+
+bool GetDeviceName(TFE_Context* ctx, string* device_name,
+                   const char* device_type) {
+  std::unique_ptr<TF_Status, decltype(&TF_DeleteStatus)> status(
+      TF_NewStatus(), TF_DeleteStatus);
+  TF_DeviceList* devices = TFE_ContextListDevices(ctx, status.get());
+  CHECK_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
+
+  const int num_devices = TF_DeviceListCount(devices);
+  for (int i = 0; i < num_devices; ++i) {
+    const string dev_type(TF_DeviceListType(devices, i, status.get()));
+    CHECK_EQ(TF_GetCode(status.get()), TF_OK) << TF_Message(status.get());
+    const string dev_name(TF_DeviceListName(devices, i, status.get()));
+    CHECK_EQ(TF_GetCode(status.get()), TF_OK) << TF_Message(status.get());
+    if (dev_type == device_type) {
+      *device_name = dev_name;
+      LOG(INFO) << "Found " << device_type << " device " << *device_name;
+      TF_DeleteDeviceList(devices);
+      return true;
+    }
+  }
+  TF_DeleteDeviceList(devices);
+  return false;
+}
diff --git a/tensorflow/c/eager/c_api_test_util.h b/tensorflow/c/eager/c_api_test_util.h
new file mode 100644
index 0000000000000000000000000000000000000000..474cae67c89249af3a62707f0db00ba458ca8f31
--- /dev/null
+++ b/tensorflow/c/eager/c_api_test_util.h
@@ -0,0 +1,53 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_C_EAGER_C_API_TEST_UTIL_H_
+#define TENSORFLOW_C_EAGER_C_API_TEST_UTIL_H_
+
+#include "tensorflow/c/eager/c_api.h"
+
+#include "tensorflow/core/platform/types.h"
+
+// Return a tensor handle containing a float scalar
+TFE_TensorHandle* TestScalarTensorHandle();
+
+// Return a tensor handle containing a 2x2 matrix of doubles
+TFE_TensorHandle* DoubleTestMatrixTensorHandle();
+
+// Return a tensor handle containing a 2x2 matrix of floats
+TFE_TensorHandle* TestMatrixTensorHandle();
+
+// Return a tensor handle containing a 3x2 matrix of doubles
+TFE_TensorHandle* DoubleTestMatrixTensorHandle3X2();
+
+// Return a tensor handle containing a 3x2 matrix of floats
+TFE_TensorHandle* TestMatrixTensorHandle3X2();
+
+// Return a matmul op multiplying `a` by `b`.
+TFE_Op* MatMulOp(TFE_Context* ctx, TFE_TensorHandle* a, TFE_TensorHandle* b);
+
+// Return an 1-D INT32 tensor containing a single value 1.
+TFE_TensorHandle* TestAxisTensorHandle();
+
+// Return an op taking minimum of `input` long `axis` dimension.
+TFE_Op* MinOp(TFE_Context* ctx, TFE_TensorHandle* input,
+              TFE_TensorHandle* axis);
+
+// If there is a device of type `device_type`, returns true
+// and sets 'device_name' accordingly.
+// `device_type` must be either "GPU" or "TPU".
+bool GetDeviceName(TFE_Context* ctx, tensorflow::string* device_name,
+                   const char* device_type);
+
+#endif  // TENSORFLOW_C_EAGER_C_API_TEST_UTIL_H_
diff --git a/tensorflow/c/eager/tape.h b/tensorflow/c/eager/tape.h
index 1833b25fea0047c9652318e49599ba623daaec26..734e712daa39c03f0177eb199b1acb1b19e5d845 100644
--- a/tensorflow/c/eager/tape.h
+++ b/tensorflow/c/eager/tape.h
@@ -48,7 +48,7 @@ struct OpTapeEntry {
 
   // Should be called before deleting the backward function. TODO(apassos) use
   // unique_ptrs to ensure this happens.
-  std::function<void()> backward_function_deleter;
+  std::function<void(BackwardFunction*)> backward_function_deleter;
 };
 
 // Map from tensor_id to internally-defined operation-id of the operation which
@@ -110,12 +110,6 @@ class VSpace {
 
   // Deletes the input tensor.
   virtual void DeleteGradient(Gradient* gradient) const = 0;
-
-  // Lets this VSpace know that it can release resources held by the
-  // `backward_function`, It will not be called again.
-  // `backward_function` must not be null.
-  virtual void ReleaseBackwardFunction(
-      BackwardFunction* backward_function) const = 0;
 };
 
 // Traces the execution of operations, doing eager garbage collection, and
@@ -130,7 +124,7 @@ class GradientTape {
   GradientTape(bool persistent) : persistent_(persistent) {}
   ~GradientTape() {
     for (const auto& pair : op_tape_) {
-      pair.second.backward_function_deleter();
+      pair.second.backward_function_deleter(pair.second.backward_function);
     }
   }
 
@@ -139,12 +133,12 @@ class GradientTape {
 
   void Watch(int64 tensor_id);
 
-  void RecordOperation(const string& op_type,
-                       gtl::ArraySlice<TapeTensor> output_tensors,
-                       gtl::ArraySlice<int64> input_tensor_id,
-                       gtl::ArraySlice<tensorflow::DataType> input_dtypes,
-                       BackwardFunction* backward_function,
-                       const std::function<void()>& backward_function_deleter);
+  void RecordOperation(
+      const string& op_type, gtl::ArraySlice<TapeTensor> output_tensors,
+      gtl::ArraySlice<int64> input_tensor_id,
+      gtl::ArraySlice<tensorflow::DataType> input_dtypes,
+      BackwardFunction* backward_function,
+      const std::function<void(BackwardFunction*)>& backward_function_deleter);
 
   void DeleteTrace(int64 tensor_id);
 
@@ -218,9 +212,9 @@ void GradientTape<Gradient, BackwardFunction>::RecordOperation(
     gtl::ArraySlice<int64> input_tensor_id,
     gtl::ArraySlice<tensorflow::DataType> input_dtypes,
     BackwardFunction* backward_function,
-    const std::function<void()>& backward_function_deleter) {
+    const std::function<void(BackwardFunction*)>& backward_function_deleter) {
   if (!ShouldRecord(input_tensor_id, input_dtypes)) {
-    backward_function_deleter();
+    backward_function_deleter(backward_function);
     return;
   }
   std::vector<int64> ids;
@@ -275,7 +269,7 @@ void GradientTape<Gradient, BackwardFunction>::DeleteTrace(int64 tensor_id) {
   for (int64 id : op_it->second.input_tensor_id) {
     DeleteTrace(id);
   }
-  op_it->second.backward_function_deleter();
+  op_it->second.backward_function_deleter(op_it->second.backward_function);
   op_tape_.erase(op_it);
 }
 
@@ -381,7 +375,8 @@ BackpropInitialState<BackwardFunction> PrepareBackprop(
     // backward functions that will be used for gradient computation
     // has been transferred to `result`.
     for (const auto& op_pair : *op_tape) {
-      op_pair.second.backward_function_deleter();
+      op_pair.second.backward_function_deleter(
+          op_pair.second.backward_function);
     }
     op_tape->clear();
   }
@@ -473,7 +468,7 @@ Status GradientTape<Gradient, BackwardFunction>::ComputeGradient(
     if (!persistent_) {
       // Release all backprop functions
       for (const auto& pair : state.op_tape) {
-        pair.second.backward_function_deleter();
+        pair.second.backward_function_deleter(pair.second.backward_function);
       }
     }
   };
@@ -541,7 +536,7 @@ Status GradientTape<Gradient, BackwardFunction>::ComputeGradient(
       Status s = vspace.CallBackwardFunction(trace.backward_function,
                                              out_gradients, &in_gradients);
       if (!persistent_) {
-        vspace.ReleaseBackwardFunction(trace.backward_function);
+        trace.backward_function_deleter(trace.backward_function);
       }
       if (!s.ok()) {
         cleanup();
@@ -550,7 +545,7 @@ Status GradientTape<Gradient, BackwardFunction>::ComputeGradient(
     } else {
       in_gradients.resize(trace.input_tensor_id.size());
       if (!persistent_) {
-        vspace.ReleaseBackwardFunction(trace.backward_function);
+        trace.backward_function_deleter(trace.backward_function);
       }
       for (Gradient* grad : out_gradients) {
         if (grad != nullptr) {
diff --git a/tensorflow/c/generate-pc.sh b/tensorflow/c/generate-pc.sh
index 02a6a58b6153bb78c684f9290ef95900f96e9357..7184ad68fb79f2598067d68d5ab5ba8f2c7a22c8 100755
--- a/tensorflow/c/generate-pc.sh
+++ b/tensorflow/c/generate-pc.sh
@@ -15,10 +15,12 @@
 # ==============================================================================
 
 TF_PREFIX='/usr/local'
+LIBDIR='lib'
 
 usage() {
     echo "Usage: $0 OPTIONS"
     echo -e "-p, --prefix\tset installation prefix (default: /usr/local)"
+    echo -e "-l, --libdir\tset lib directory (default: lib)"
     echo -e "-v, --version\tset TensorFlow version"
     echo -e "-h, --help\tdisplay this message"
 }
@@ -26,7 +28,7 @@ usage() {
 [ $# == 0 ] && usage && exit 0
 
 # read the options
-ARGS=$(getopt -o p:v:h --long prefix:,version:,help -n $0 -- "$@")
+ARGS=$(getopt -o p:l:v:h --long prefix:,libdir:,version:,help -n $0 -- "$@")
 eval set -- "$ARGS"
 
 # extract options and their arguments into variables.
@@ -38,6 +40,11 @@ while true ; do
                 "") shift 2 ;;
                 *) TF_PREFIX=$2 ; shift 2 ;;
             esac ;;
+        -l|--libdir)
+            case "$2" in
+                "") shift 2 ;;
+                *) LIBDIR=$2 ; shift 2 ;;
+            esac ;;
         -v|--version)
             case "$2" in
                 "") shift 2 ;;
@@ -55,7 +62,7 @@ echo "Generating pkgconfig file for TensorFlow $TF_VERSION in $TF_PREFIX"
 cat << EOF > tensorflow.pc
 prefix=${TF_PREFIX}
 exec_prefix=\${prefix}
-libdir=\${exec_prefix}/lib
+libdir=\${exec_prefix}/${LIBDIR}
 includedir=\${prefix}/include
 
 Name: TensorFlow
diff --git a/tensorflow/cc/framework/cc_op_gen.cc b/tensorflow/cc/framework/cc_op_gen.cc
index d6a4f141b6bb8ccadb77f1fa83b5fb742d78f70f..dfdef88945deca376368edd6f7aa322b1e1cbf94 100644
--- a/tensorflow/cc/framework/cc_op_gen.cc
+++ b/tensorflow/cc/framework/cc_op_gen.cc
@@ -273,6 +273,12 @@ string PrintAttrValue(const string& op, const AttrValue& attr_value) {
   return "<Unknown AttrValue type>";  // Prevent missing return warning
 }
 
+bool IsEmptyList(const AttrValue::ListValue& list) {
+  return list.s_size() == 0 && list.i_size() == 0 && list.f_size() == 0 &&
+         list.b_size() == 0 && list.type_size() == 0 &&
+         list.shape_size() == 0 && list.tensor_size() == 0;
+}
+
 string ToCamelCase(const string& str) {
   string result;
   const char joiner = '_';
@@ -297,9 +303,9 @@ string ToCamelCase(const string& str) {
 // indicate whether to treat the type as const when accepting the C++ type as an
 // argument to a function.
 std::pair<const char*, bool> AttrTypeName(StringPiece attr_type) {
-  static const std::unordered_map<StringPiece, std::pair<const char*, bool>,
-                                  StringPieceHasher>
-      attr_type_map{
+  static const auto* attr_type_map =
+      new std::unordered_map<StringPiece, std::pair<const char*, bool>,
+                             StringPieceHasher>{
           {"string", {"StringPiece", false}},
           {"list(string)", {"gtl::ArraySlice<string>", true}},
           {"int", {"int64", false}},
@@ -317,14 +323,34 @@ std::pair<const char*, bool> AttrTypeName(StringPiece attr_type) {
           {"func", {"NameAttrList", true}},
       };
 
-  auto entry = attr_type_map.find(attr_type);
-  if (entry == attr_type_map.end()) {
+  auto entry = attr_type_map->find(attr_type);
+  if (entry == attr_type_map->end()) {
     LOG(FATAL) << "Unsupported Attr type: " << attr_type;
     return {"", false};
   }
   return entry->second;
 }
 
+const char* ListElementTypeName(StringPiece attr_type) {
+  static const auto* attr_list_type_map =
+      new std::unordered_map<StringPiece, const char*, StringPieceHasher>{
+          {"list(string)", "string"},
+          {"list(int)", "int"},
+          {"list(float)", "float"},
+          {"list(bool)", "bool"},
+          {"list(type)", "DataType"},
+          {"list(shape)", "PartialTensorShape"},
+          {"list(tensor)", "TensorProto"},
+      };
+
+  auto entry = attr_list_type_map->find(attr_type);
+  if (entry == attr_list_type_map->end()) {
+    LOG(FATAL) << "Unsupported or non-list Attr type: " << attr_type;
+    return "";
+  }
+  return entry->second;
+}
+
 bool IsCPPKeyword(StringPiece name) {
   static const std::unordered_set<StringPiece, StringPieceHasher>
       // Keywords obtained from http://en.cppreference.com/w/cpp/keyword
@@ -668,6 +694,7 @@ OpInfo::OpInfo(const OpDef& graph_op_def, const ApiDef& api_def,
 string OpInfo::GetOpAttrStruct() const {
   string struct_fields;
   string setters;
+  string defaults_static_storage;
 
   for (int i = 0; i < graph_op_def.attr_size(); ++i) {
     const auto& attr(graph_op_def.attr(i));
@@ -705,11 +732,32 @@ string OpInfo::GetOpAttrStruct() const {
                        "_ = x;\n");
     strings::StrAppend(&setters, "      return ret;\n    }\n\n");
 
-    strings::StrAppend(
-        &struct_fields, "    ", attr_type_name, " ", api_def_attr.rename_to(),
-        "_ = ",
-        PrintAttrValue(graph_op_def.name(), api_def_attr.default_value()),
-        ";\n");
+    string field_initiliazer;
+    auto& default_value = api_def_attr.default_value();
+    if (default_value.value_case() == AttrValue::kList &&
+        !IsEmptyList(default_value.list())) {
+      // Non-empty lists need static storage for their defaults. Define a
+      // function with static local variable that stores the array.
+      strings::StrAppend(&defaults_static_storage, "    static ",
+                         attr_type_name, " Default_", api_def_attr.rename_to(),
+                         "() {\n");
+      strings::StrAppend(
+          &defaults_static_storage, "      static const ",
+          ListElementTypeName(attr.type()), " kStorage[] = ",
+          PrintAttrValue(graph_op_def.name(), api_def_attr.default_value()),
+          ";\n");
+      strings::StrAppend(&defaults_static_storage, "      return ",
+                         attr_type_name, "(kStorage);\n    }\n");
+      // Set the field_initializer to call the defined function.
+      strings::StrAppend(&field_initiliazer, "Default_",
+                         api_def_attr.rename_to(), "()");
+    } else {
+      field_initiliazer =
+          PrintAttrValue(graph_op_def.name(), api_def_attr.default_value());
+    }
+    strings::StrAppend(&struct_fields, "    ", attr_type_name, " ",
+                       api_def_attr.rename_to(), "_ = ", field_initiliazer,
+                       ";\n");
   }
 
   if (struct_fields.empty()) {
@@ -721,6 +769,9 @@ string OpInfo::GetOpAttrStruct() const {
   string struct_decl = MakeComment(attrs_comment, "  ");
   strings::StrAppend(&struct_decl, "  struct Attrs {\n");
   strings::StrAppend(&struct_decl, setters, struct_fields);
+  if (!defaults_static_storage.empty()) {
+    strings::StrAppend(&struct_decl, "  private:\n", defaults_static_storage);
+  }
   strings::StrAppend(&struct_decl, "  };\n");
 
   return struct_decl;
diff --git a/tensorflow/cc/gradients/math_grad.cc b/tensorflow/cc/gradients/math_grad.cc
index 52c177212a8c88f1857defcc38de4a01ac47dab0..35a01e0341cb08c9b314908b6dcd76fd99c1e68b 100644
--- a/tensorflow/cc/gradients/math_grad.cc
+++ b/tensorflow/cc/gradients/math_grad.cc
@@ -38,6 +38,7 @@ REGISTER_NO_GRADIENT_OP("NotEqual");
 REGISTER_NO_GRADIENT_OP("LogicalAnd");
 REGISTER_NO_GRADIENT_OP("LogicalOr");
 REGISTER_NO_GRADIENT_OP("LogicalNot");
+REGISTER_NO_GRADIENT_OP("Floor");
 
 // Conjugate helper function returns the conjugate of an Output if it
 // is complex valued.
diff --git a/tensorflow/compiler/aot/tests/BUILD b/tensorflow/compiler/aot/tests/BUILD
index fd2cf2b67d4618dd626b8eef78eed044d7fde0a4..0ecc3feeb6fef1dd691ab2785b3221075a79ba88 100644
--- a/tensorflow/compiler/aot/tests/BUILD
+++ b/tensorflow/compiler/aot/tests/BUILD
@@ -7,6 +7,10 @@ package(
 load("//tensorflow/compiler/aot:tfcompile.bzl", "tf_library")
 load("//tensorflow:tensorflow.bzl", "tf_cc_test")
 
+# We disable some tfcompile tests in the open source build with the
+# "manual" tag to avoid making our OSS users build LLVM twice
+# (once for host and once for target).
+
 test_suite(
     name = "all_tests",
     tags = ["manual"],
diff --git a/tensorflow/compiler/jit/BUILD b/tensorflow/compiler/jit/BUILD
index 980e0eec9e23b15a97b826067bac08053a437712..8c74014614789758192691ee065f92759a113a7a 100644
--- a/tensorflow/compiler/jit/BUILD
+++ b/tensorflow/compiler/jit/BUILD
@@ -25,6 +25,7 @@ load("//tensorflow:tensorflow.bzl", "tf_kernel_library")
 load("//tensorflow:tensorflow.bzl", "tf_cc_test")
 load("@local_config_cuda//cuda:build_defs.bzl", "if_cuda")
 load("@local_config_cuda//cuda:build_defs.bzl", "if_cuda_is_configured")
+load("//tensorflow:tensorflow.bzl", "tf_cuda_cc_test")
 
 # Target that bundles up the XLA CPU and GPU JIT devices.
 cc_library(
@@ -178,7 +179,9 @@ cc_library(
         "//tensorflow/core/kernels:identity_n_op",
         "//tensorflow/core/kernels:identity_op",
         "//tensorflow/core/kernels:no_op",
+        "//tensorflow/core/kernels:resource_variable_ops",
         "//tensorflow/core/kernels:sendrecv_ops",
+        "//tensorflow/core/kernels:shape_ops",
         "//tensorflow/core/kernels:variable_ops",
     ],
 )
@@ -311,9 +314,9 @@ cc_library(
         ":common",
         ":shape_inference_helpers",
         ":union_find",
+        ":xla_cluster_util",
         "//tensorflow/compiler/jit/graphcycles",
         "//tensorflow/compiler/jit/kernels:parallel_check_op",
-        "//tensorflow/compiler/jit/legacy_flags:encapsulate_subgraphs_pass_flags",
         "//tensorflow/compiler/jit/legacy_flags:mark_for_compilation_pass_flags",
         "//tensorflow/compiler/jit/ops:parallel_check_op",
         "//tensorflow/compiler/jit/ops:xla_ops",
@@ -331,6 +334,19 @@ cc_library(
     ],
 )
 
+cc_library(
+    name = "xla_cluster_util",
+    srcs = ["xla_cluster_util.cc"],
+    hdrs = ["xla_cluster_util.h"],
+    deps = [
+        "//tensorflow/compiler/jit/graphcycles",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:graph",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core/kernels:bounds_check",
+    ],
+)
+
 cc_library(
     name = "union_find",
     hdrs = ["union_find.h"],
@@ -407,6 +423,38 @@ tf_cc_test(
     ],
 )
 
+cc_library(
+    name = "xla_fusion_optimizer",
+    srcs = ["xla_fusion_optimizer.cc"],
+    hdrs = ["xla_fusion_optimizer.h"],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":common",
+        ":union_find",
+        ":xla_cluster_util",
+        "//tensorflow/compiler/jit/graphcycles",
+        "//tensorflow/core:core_cpu_base",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler/optimizers:custom_graph_optimizer",
+        "//tensorflow/core/grappler/optimizers:custom_graph_optimizer_registry",
+    ],
+)
+
+tf_cuda_cc_test(
+    name = "xla_fusion_optimizer_test",
+    srcs = ["xla_fusion_optimizer_test.cc"],
+    deps = [
+        ":common",
+        ":xla_cluster_util",
+        ":xla_fusion_optimizer",
+        "//tensorflow/core:graph",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core/grappler/utils:grappler_test",
+    ],
+)
+
 # This target can be used by XLA device plugins to prevent circular dependencies, and provides access to all of the required headers for building a device library.
 cc_header_only_library(
     name = "xla_jit_headers_lib",
diff --git a/tensorflow/compiler/jit/encapsulate_subgraphs_pass.cc b/tensorflow/compiler/jit/encapsulate_subgraphs_pass.cc
index 6d1e3325ebd35b9608ea273fb7de39bad381e60d..ea90d714c8fe2c2959a92a83ec105dbf2c098f7a 100644
--- a/tensorflow/compiler/jit/encapsulate_subgraphs_pass.cc
+++ b/tensorflow/compiler/jit/encapsulate_subgraphs_pass.cc
@@ -23,7 +23,6 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/jit/graphcycles/graphcycles.h"
-#include "tensorflow/compiler/jit/legacy_flags/encapsulate_subgraphs_pass_flags.h"
 #include "tensorflow/compiler/jit/mark_for_compilation_pass.h"
 #include "tensorflow/compiler/jit/shape_inference_helpers.h"
 #include "tensorflow/compiler/tf2xla/const_analysis.h"
@@ -182,8 +181,7 @@ class Encapsulator {
 
   // Write a copy of the input graph to 'graph_out', where the subgraphs are
   // replaced with calls to the new functions.
-  Status BuildOutputGraph(bool parallel_checking, Graph* graph_out,
-                          FunctionLibraryDefinition* library);
+  Status BuildOutputGraph(Graph* graph_out, FunctionLibraryDefinition* library);
 
  private:
   // A subgraph of the input, all marked with a common 'group_attribute'
@@ -271,7 +269,7 @@ class Encapsulator {
     // Adds the function call node to graph_out.
     Status AddFunctionCallNode(
         const std::unordered_map<const Node*, Node*>& node_images,
-        bool parallel_checking, Graph* graph_out);
+        Graph* graph_out);
 
     // Adds _RecvAtHost and _SendFromHost nodes, where needed, to graph_out.
     Status AddOutsideCompilationHostIONodes(
@@ -284,11 +282,9 @@ class Encapsulator {
     // Subgraph.
     void GetOutsideCompilationSubgraphNames(std::vector<string>* names) const;
 
-    // Returns the Node that inputs to the function should be wired up to.
-    Node* GetCallNodeForInputs() const;
-
-    // Returns the Node that outputs to the function should be wired up to.
-    Node* GetCallNodeForOutputs() const;
+    // Returns the Node that the inputs and outputs of the function should be
+    // wired up to.
+    Node* GetCallNode() const;
 
     // Returns the index of the arg that the dst of edge should connect to.
     int GetArgIndexForEdge(const Edge* edge) const;
@@ -425,12 +421,6 @@ class Encapsulator {
     OutsideCompilationSubgraph* LookupOrCreateOutsideCompilationSubgraph(
         const string& outside_compilation_id);
 
-    // Builds a ParallelCheck op that compares the output of the original
-    // subgraph with the encapsulated subgraph.
-    Status BuildParallelCheckOp(
-        const std::unordered_map<const Node*, Node*>& node_images,
-        Graph* graph_out);
-
     // Builds a placeholder node used to provide the key input to a RecvAtHost
     // or SendFromHost node. This placeholder node will be removed by a later
     // pass.
@@ -482,13 +472,8 @@ class Encapsulator {
     // Not owned.
     Node* host_compute_key_placeholder_ = nullptr;
 
-    // Function call node(s) in the output graph. Not owned.
-    // If parallel_checking is enabled, 'call_node_inputs' is the function call
-    // node to which inputs should be fed, and 'call_node_outputs' is the
-    // parallel check op from which outputs should be read. If parallel checking
-    // is disabled, both point to the function call node.
-    Node* call_node_inputs_;
-    Node* call_node_outputs_;
+    // Function call node in the output graph. Not owned.
+    Node* call_node_;
 
     // Maps from source (producer node/slot) and destination
     // (consumer node/slot) tensors in the input graph to _Arg numbers in
@@ -541,13 +526,12 @@ class Encapsulator {
 
   // Copies all nodes that aren't in a compiled subgraph to the output graph.
   Status CopyNodesToOutputGraph(
-      bool parallel_checking, Graph* graph_out,
-      std::unordered_map<const Node*, Node*>* node_images);
+      Graph* graph_out, std::unordered_map<const Node*, Node*>* node_images);
 
   // Adds function call nodes for each compiled subgraph.
   Status AddFunctionCallNodes(
       const std::unordered_map<const Node*, Node*>& node_images,
-      bool parallel_checking, Graph* graph_out);
+      Graph* graph_out);
 
   // Adds _RecvAtHost and _SendFromHost nodes, where needed, for all
   // outside_compilation subgraphs.
@@ -598,7 +582,7 @@ class Encapsulator {
       const string& src_outside_compilation_id, const string& dst_func_id,
       const string& dst_outside_compilation_id,
       const std::unordered_map<const Node*, Node*>& node_images,
-      bool parallel_checking, Graph* graph_out,
+      Graph* graph_out,
       std::unordered_set<std::pair<NodeSlot, NodeSlot>, NodeSlot::PairHasher>*
           edges_added);
 
@@ -609,7 +593,7 @@ class Encapsulator {
   // Adds all edges to the output graph.
   Status AddEdgesToOutputGraph(
       const std::unordered_map<const Node*, Node*>& node_images,
-      bool parallel_checking, Graph* graph_out);
+      Graph* graph_out);
 
   // Constructs a minimal shape inference graph that can be used to determine
   // the shape of send_node at the time that the subgraph is compiled.
@@ -729,13 +713,7 @@ void TopologicalClusterSort(
 
 }  // namespace
 
-Node* Encapsulator::Subgraph::GetCallNodeForInputs() const {
-  return call_node_inputs_;
-}
-
-Node* Encapsulator::Subgraph::GetCallNodeForOutputs() const {
-  return call_node_outputs_;
-}
+Node* Encapsulator::Subgraph::GetCallNode() const { return call_node_; }
 
 int Encapsulator::Subgraph::GetArgIndexForEdge(const Edge* edge) const {
   return args_by_dst_.at(NodeSlot(edge->dst(), edge->dst_input()));
@@ -1075,7 +1053,7 @@ Status Encapsulator::Subgraph::MakeSequencingNode(const string& subgraph_name,
 void Encapsulator::Subgraph::ConnectSequencerToCallNode(Graph* graph_out) {
   if (sequencer_ != nullptr) {
     VLOG(2) << "ConnectSequencerToCallNode";
-    graph_out->AddControlEdge(sequencer_, call_node_inputs_);
+    graph_out->AddControlEdge(sequencer_, call_node_);
   }
 }
 
@@ -1200,83 +1178,16 @@ Status Encapsulator::Subgraph::ReplaceFunctionDef(
   return Status::OK();
 }
 
-Status Encapsulator::Subgraph::BuildParallelCheckOp(
-    const std::unordered_map<const Node*, Node*>& node_images,
-    Graph* graph_out) {
-  // Build an index mapping output positions to node/slot pairs in the
-  // original graph.
-  std::vector<NodeSlot> results_by_num(results_.size());
-  for (const auto& entry : results_) {
-    results_by_num[entry.second] = entry.first;
-  }
-
-  // Build a parallel check NodeDef.
-  int num_results = results_by_num.size();
-  std::vector<DataType> result_dtypes(num_results);
-  std::vector<NodeDefBuilder::NodeOut> expected_outputs(num_results);
-  std::vector<NodeDefBuilder::NodeOut> actual_outputs(num_results);
-  for (int i = 0; i < num_results; ++i) {
-    const NodeSlot& node_slot = results_by_num[i];
-    result_dtypes[i] = node_slot.node->output_type(node_slot.slot);
-    expected_outputs[i] =
-        NodeDefBuilder::NodeOut(node_images.at(node_slot.node)->name(),
-                                node_slot.slot, result_dtypes[i]);
-    actual_outputs[i] =
-        NodeDefBuilder::NodeOut(call_node_def_.name(), i, result_dtypes[i]);
-  }
-  // Assign the parallel check op to a CPU on the same task as the cluster it is
-  // checking.
-  string device, dummy;
-  if (!DeviceNameUtils::SplitDeviceName(
-          call_node_inputs_->assigned_device_name(), &device, &dummy)) {
-    return errors::InvalidArgument("Could not parse device name");
-  }
-  strings::StrAppend(&device, "/cpu:0");
-
-  NodeDef check_def;
-  TF_RETURN_IF_ERROR(
-      NodeDefBuilder(graph_out->NewName(strings::StrCat(call_node_def_.name(),
-                                                        "_parallel_check")),
-                     "ParallelCheck")
-          .Device(device)
-          .Attr("T", result_dtypes)
-          .Input(expected_outputs)
-          .Input(actual_outputs)
-          .Finalize(&check_def));
-
-  Status s;
-  Node* check_op = graph_out->AddNode(check_def, &s);
-  if (!s.ok()) return s;
-  check_op->set_assigned_device_name(device);
-
-  // TODO(phawkins): it seems redundant to call AddEdge as well as
-  // pass Inputs to the NodeDefBuilder, but I have been unable to find a
-  // way to avoid it.
-  for (int i = 0; i < num_results; ++i) {
-    const NodeSlot& node_slot = results_by_num[i];
-    graph_out->AddEdge(node_images.at(node_slot.node), node_slot.slot, check_op,
-                       i);
-    graph_out->AddEdge(call_node_inputs_, i, check_op, num_results + i);
-  }
-
-  call_node_outputs_ = check_op;
-  return Status::OK();
-}
-
 Status Encapsulator::Subgraph::AddFunctionCallNode(
     const std::unordered_map<const Node*, Node*>& node_images,
-    bool parallel_checking, Graph* graph_out) {
+    Graph* graph_out) {
   Status s;
-  call_node_inputs_ = graph_out->AddNode(call_node_def_, &s);
+  call_node_ = graph_out->AddNode(call_node_def_, &s);
   if (!s.ok()) return s;
 
   // Copy the assigned device and the key_annotation over.
-  call_node_inputs_->set_assigned_device_name(device_);
-  call_node_outputs_ = call_node_inputs_;
+  call_node_->set_assigned_device_name(device_);
 
-  if (parallel_checking) {
-    TF_RETURN_IF_ERROR(BuildParallelCheckOp(node_images, graph_out));
-  }
   return Status::OK();
 }
 
@@ -1627,27 +1538,17 @@ Status Encapsulator::BuildFunctionDefs(
 }
 
 Status Encapsulator::CopyNodesToOutputGraph(
-    bool parallel_checking, Graph* graph_out,
-    std::unordered_map<const Node*, Node*>* node_images) {
+    Graph* graph_out, std::unordered_map<const Node*, Node*>* node_images) {
   for (Node* node : graph_in_->op_nodes()) {
     string func_id;
     string outside_compilation_id;
     TF_RETURN_IF_ERROR(
         GetFunctionNameAttr(node, &func_id, &outside_compilation_id));
 
-    // Don't copy nodes that going to be encapsulated, unless parallel checking
-    // is enabled.
-    if (IsInSubgraph(func_id, outside_compilation_id) && !parallel_checking)
-      continue;
+    // Don't copy nodes that are going to be encapsulated.
+    if (IsInSubgraph(func_id, outside_compilation_id)) continue;
 
     Node* image = graph_out->CopyNode(node);
-    if (!outside_compilation_id.empty()) {
-      if (parallel_checking) {
-        return errors::InvalidArgument(
-            "Parallel checking is not supported when outside_compilation "
-            "clusters are present.");
-      }
-    }
     (*node_images)[node] = image;
   }
   (*node_images)[graph_in_->source_node()] = graph_out->source_node();
@@ -1657,10 +1558,10 @@ Status Encapsulator::CopyNodesToOutputGraph(
 
 Status Encapsulator::AddFunctionCallNodes(
     const std::unordered_map<const Node*, Node*>& node_images,
-    bool parallel_checking, Graph* graph_out) {
+    Graph* graph_out) {
   for (auto& subgraph_entry : subgraphs_) {
-    TF_RETURN_IF_ERROR(subgraph_entry.second.AddFunctionCallNode(
-        node_images, parallel_checking, graph_out));
+    TF_RETURN_IF_ERROR(
+        subgraph_entry.second.AddFunctionCallNode(node_images, graph_out));
   }
   return Status::OK();
 }
@@ -1694,7 +1595,7 @@ Status Encapsulator::FindOutputImageOfEdgeSrc(
     } else {
       // The edge is from a subgraph to a regular node in the output graph so
       // use the subgraph's call node output.
-      *src_image = subgraphs_.at(src_func_id).GetCallNodeForOutputs();
+      *src_image = subgraphs_.at(src_func_id).GetCallNode();
     }
   } else {
     // The source of the edge is in the output graph so use the node image in
@@ -1742,7 +1643,7 @@ Status Encapsulator::FindOutputImageOfEdgeDst(
     } else {
       // The edge is to a subgraph from a regular node in the output graph so
       // use the subgraph's call node input.
-      *dst_image = subgraphs_.at(dst_func_id).GetCallNodeForInputs();
+      *dst_image = subgraphs_.at(dst_func_id).GetCallNode();
     }
   } else {
     // The destination of the edge is in the output graph so use the node image
@@ -1778,8 +1679,7 @@ Status Encapsulator::CopyEdgeToOutputGraph(
     const Edge* edge, const string& src_func_id,
     const string& src_outside_compilation_id, const string& dst_func_id,
     const string& dst_outside_compilation_id,
-    const std::unordered_map<const Node*, Node*>& node_images,
-    bool parallel_checking, Graph* graph_out,
+    const std::unordered_map<const Node*, Node*>& node_images, Graph* graph_out,
     std::unordered_set<std::pair<NodeSlot, NodeSlot>, NodeSlot::PairHasher>*
         edges_added) {
   Node* src_image;
@@ -1801,11 +1701,6 @@ Status Encapsulator::CopyEdgeToOutputGraph(
       graph_out->AddControlEdge(src_image, dst_image);
     }
 
-    // If parallel checking is enabled, also add a control edge to the
-    // corresponding parallel check op.
-    if (parallel_checking) {
-      graph_out->AddControlEdge(src_image, node_images.at(edge->dst()));
-    }
     return Status::OK();
   }
 
@@ -1817,14 +1712,6 @@ Status Encapsulator::CopyEdgeToOutputGraph(
       FindOutputSlotOfEdgeDst(src_func_id, src_outside_compilation_id,
                               dst_func_id, dst_outside_compilation_id, edge);
 
-  if (IsInSubgraph(dst_func_id, dst_outside_compilation_id) &&
-      parallel_checking) {
-    // If we are parallel checking, also feed the tensor as an input to the
-    // corresponding parallel check subgraph.
-    graph_out->AddEdge(src_image, src_output, node_images.at(edge->dst()),
-                       edge->dst_input());
-  }
-
   // Add the edge, if we have not already added it.
   if (edges_added
           ->emplace(NodeSlot(src_image, src_output),
@@ -1839,8 +1726,8 @@ Status Encapsulator::AddCallNodeDependencies(Graph* graph_out) {
   for (const auto& ancestors : subgraph_ancestors_) {
     const string& subgraph = ancestors.first;
     for (const string& ancestor : ancestors.second) {
-      graph_out->AddControlEdge(subgraphs_[ancestor].GetCallNodeForOutputs(),
-                                subgraphs_[subgraph].GetCallNodeForInputs());
+      graph_out->AddControlEdge(subgraphs_[ancestor].GetCallNode(),
+                                subgraphs_[subgraph].GetCallNode());
     }
   }
   return Status::OK();
@@ -1848,7 +1735,7 @@ Status Encapsulator::AddCallNodeDependencies(Graph* graph_out) {
 
 Status Encapsulator::AddEdgesToOutputGraph(
     const std::unordered_map<const Node*, Node*>& node_images,
-    bool parallel_checking, Graph* graph_out) {
+    Graph* graph_out) {
   // Set of edges already added to the output graph, represented as (src, dst)
   // pairs. We use the set to deduplicate edges; multiple edges in the input
   // graph may map to one edge in the output graph.
@@ -1870,16 +1757,6 @@ Status Encapsulator::AddEdgesToOutputGraph(
     if (IsInSubgraph(src_func_id, src_outside_compilation_id) &&
         IsInSubgraph(dst_func_id, dst_outside_compilation_id) &&
         src_func_id == dst_func_id) {
-      if (parallel_checking) {
-        Node* src_image = node_images.at(edge->src());
-        Node* dst_image = node_images.at(edge->dst());
-        if (edge->IsControlEdge()) {
-          graph_out->AddControlEdge(src_image, dst_image);
-        } else {
-          graph_out->AddEdge(src_image, edge->src_output(), dst_image,
-                             edge->dst_input());
-        }
-      }
       continue;
     }
 
@@ -1887,8 +1764,7 @@ Status Encapsulator::AddEdgesToOutputGraph(
     // unclustered graph.
     TF_RETURN_IF_ERROR(CopyEdgeToOutputGraph(
         edge, src_func_id, src_outside_compilation_id, dst_func_id,
-        dst_outside_compilation_id, node_images, parallel_checking, graph_out,
-        &edges_added));
+        dst_outside_compilation_id, node_images, graph_out, &edges_added));
   }
 
   for (auto& subgraph_entry : subgraphs_) {
@@ -2504,18 +2380,15 @@ Status Encapsulator::GetShapeInfoForOutsideCompilationSends(
   return Status::OK();
 }
 
-Status Encapsulator::BuildOutputGraph(bool parallel_checking, Graph* graph_out,
+Status Encapsulator::BuildOutputGraph(Graph* graph_out,
                                       FunctionLibraryDefinition* library) {
   // Map from nodes in the input graph to nodes in the output graph.
   std::unordered_map<const Node*, Node*> node_images;
 
-  TF_RETURN_IF_ERROR(
-      CopyNodesToOutputGraph(parallel_checking, graph_out, &node_images));
-  TF_RETURN_IF_ERROR(
-      AddFunctionCallNodes(node_images, parallel_checking, graph_out));
+  TF_RETURN_IF_ERROR(CopyNodesToOutputGraph(graph_out, &node_images));
+  TF_RETURN_IF_ERROR(AddFunctionCallNodes(node_images, graph_out));
   TF_RETURN_IF_ERROR(AddOutsideCompilationHostIONodes(node_images, graph_out));
-  TF_RETURN_IF_ERROR(
-      AddEdgesToOutputGraph(node_images, parallel_checking, graph_out));
+  TF_RETURN_IF_ERROR(AddEdgesToOutputGraph(node_images, graph_out));
 
   TF_RETURN_IF_ERROR(
       GetShapeInfoForOutsideCompilationSends(graph_out, library));
@@ -2528,8 +2401,8 @@ Status Encapsulator::BuildOutputGraph(bool parallel_checking, Graph* graph_out,
 Status EncapsulateSubgraphsInFunctions(
     string group_attribute, string outside_compilation_attribute,
     const Graph& graph_in, const RewriteSubgraphFn& rewrite_subgraph_fn,
-    bool parallel_checking, bool reuse_existing_functions,
-    std::unique_ptr<Graph>* graph_out, FunctionLibraryDefinition* library) {
+    bool reuse_existing_functions, std::unique_ptr<Graph>* graph_out,
+    FunctionLibraryDefinition* library) {
   Status s;
 
   Encapsulator encapsulator(std::move(group_attribute),
@@ -2543,8 +2416,7 @@ Status EncapsulateSubgraphsInFunctions(
 
   std::unique_ptr<Graph> out(new Graph(library));
   out->set_versions(graph_in.versions());
-  TF_RETURN_IF_ERROR(
-      encapsulator.BuildOutputGraph(parallel_checking, out.get(), library));
+  TF_RETURN_IF_ERROR(encapsulator.BuildOutputGraph(out.get(), library));
 
   *graph_out = std::move(out);
   return Status::OK();
@@ -2585,8 +2457,6 @@ static Status RenumberArguments(Graph* graph,
 Status EncapsulateSubgraphsPass::Run(
     const GraphOptimizationPassOptions& options) {
   VLOG(1) << "EncapsulateSubgraphsPass::Run";
-  legacy_flags::EncapsulateSubgraphsPassFlags* flags =
-      legacy_flags::GetEncapsulateSubgraphsPassFlags();
   if (VLOG_IS_ON(1)) {
     dump_graph::DumpGraphToFile("before_encapsulate_subgraphs", **options.graph,
                                 options.flib_def);
@@ -2663,7 +2533,7 @@ Status EncapsulateSubgraphsPass::Run(
 
   TF_RETURN_IF_ERROR(EncapsulateSubgraphsInFunctions(
       kXlaClusterAttr, kXlaOutsideCompilationAttr, **options.graph,
-      rewrite_subgraph, flags->tf_xla_parallel_checking,
+      rewrite_subgraph,
       /*reuse_existing_functions=*/false, &graph_out, library));
 
   if (VLOG_IS_ON(1)) {
diff --git a/tensorflow/compiler/jit/encapsulate_subgraphs_pass.h b/tensorflow/compiler/jit/encapsulate_subgraphs_pass.h
index 5fee36f022a7515504cb6faa5cca658481b784c5..e5dab7c657c79afa861b0443314d0c7801e4b66d 100644
--- a/tensorflow/compiler/jit/encapsulate_subgraphs_pass.h
+++ b/tensorflow/compiler/jit/encapsulate_subgraphs_pass.h
@@ -61,10 +61,6 @@ typedef std::function<Status(
 // If 'rewrite_subgraph_fn' is set, it is applied to each subgraph before
 // function conversion.
 //
-// If 'parallel_checking' is true, the unencapsulated operators are added to the
-// output graph, together with a "ParallelCheck" operator, that verifies that
-// the original and encapsulated subgraphs produce similar results.
-//
 // If 'reuse_existing_functions' is set, use an existing function with the
 // same name, if any.
 //
@@ -76,8 +72,8 @@ typedef std::function<Status(
 Status EncapsulateSubgraphsInFunctions(
     string group_attribute, string outside_compilation_attribute,
     const Graph& graph_in, const RewriteSubgraphFn& rewrite_subgraph_fn,
-    bool parallel_checking, bool reuse_existing_functions,
-    std::unique_ptr<Graph>* graph_out, FunctionLibraryDefinition* library);
+    bool reuse_existing_functions, std::unique_ptr<Graph>* graph_out,
+    FunctionLibraryDefinition* library);
 
 // The attribute that marks function calls produced by the encapsulate
 // subgraphs pass and that should in turn be compiled via XlaLaunch operators.
diff --git a/tensorflow/compiler/jit/encapsulate_subgraphs_pass_test.cc b/tensorflow/compiler/jit/encapsulate_subgraphs_pass_test.cc
index 5ec24d39a2c40a766dbb0ec51ebe798de620e24b..6a7cd932e53cc0428850ab048cc325e84ef1fce6 100644
--- a/tensorflow/compiler/jit/encapsulate_subgraphs_pass_test.cc
+++ b/tensorflow/compiler/jit/encapsulate_subgraphs_pass_test.cc
@@ -511,7 +511,6 @@ Status Encapsulate(GraphDef* graphdef, FunctionDefLibrary* library) {
   std::unique_ptr<Graph> graph_out;
   s = EncapsulateSubgraphsInFunctions("_encapsulate", "_outside", *graph,
                                       /*rewrite_subgraph_fn=*/{},
-                                      /*parallel_checking=*/false,
                                       /*reuse_existing_functions=*/false,
                                       &graph_out, lib_def.get());
   if (!s.ok()) return s;
@@ -560,8 +559,9 @@ TEST(EncapsulateSubgraphsTest, OneFunction) {
     Node* b = Input(b1.opts().WithName("B"));
     // Give nodes 'c' and 'd' names that collide after lowercasing.
     Node* c = Unary(a, b1.opts().WithName("C").WithAttr("_encapsulate", "F1"));
-    Node* d = Binary(b, c, b1.opts().WithName("c").WithControlInput(c).WithAttr(
-                               "_encapsulate", "F1"));
+    Node* d = Binary(b, c,
+                     b1.opts().WithName("c").WithControlInput(c).WithAttr(
+                         "_encapsulate", "F1"));
     Binary(a, d, b1.opts().WithName("E"));
     TF_EXPECT_OK(b1.ToGraphDef(&graphdef));
   }
@@ -614,8 +614,8 @@ TEST(EncapsulateSubgraphsTest, TwoFunctions) {
     Node* c =
         Unary(a, b1.opts().WithName("C").WithControlInput(control).WithAttr(
                      "_encapsulate", "F1"));
-    Node* d =
-        Binary(b, c, b1.opts().WithName("D").WithControlInput(control).WithAttr(
+    Node* d = Binary(b, c,
+                     b1.opts().WithName("D").WithControlInput(control).WithAttr(
                          "_encapsulate", "F2"));
     Binary(a, d, b1.opts().WithName("E"));
     TF_EXPECT_OK(b1.ToGraphDef(&graphdef));
@@ -707,7 +707,7 @@ TEST(EncapsulateSubgraphsTest, InputDeduplication) {
   std::unique_ptr<Graph> graph;
   TF_ASSERT_OK(EncapsulateSubgraphsInFunctions(
       "_cluster", "_outside", graph_before_encapsulation,
-      /*rewrite_subgraph_fn=*/{}, /*parallel_checking=*/false,
+      /*rewrite_subgraph_fn=*/{},
       /*reuse_existing_functions=*/false, &graph, &library));
 
   std::vector<string> expected_nodes = {"cluster1", "cluster2", "mul", "x"};
@@ -721,47 +721,6 @@ TEST(EncapsulateSubgraphsTest, InputDeduplication) {
   EXPECT_EQ(expected_edges, GraphEdges(*graph));
 }
 
-TEST(EncapsulateSubgraphsTest, ParallelChecking) {
-  Scope root = Scope::NewRootScope().ExitOnError().WithDevice(
-      "/job:localhost/replica:0/task:0/cpu:0");
-  auto x1 = ops::Placeholder(root.WithOpName("x1"), DT_FLOAT);
-  auto x2 = ops::Placeholder(root.WithOpName("x2"), DT_FLOAT);
-  auto add1 = ops::Add(root.WithOpName("add1"), x1, x2);
-  add1.node()->AddAttr("_cluster", "cluster1");
-  auto add2 = ops::Add(root.WithOpName("add2"), add1, x2);
-  add2.node()->AddAttr("_cluster", "cluster1");
-  auto out = ops::Mul(root.WithOpName("mul"), x1, add2);
-
-  Graph graph_before_encapsulation(OpRegistry::Global());
-  TF_ASSERT_OK(root.ToGraph(&graph_before_encapsulation));
-
-  FunctionLibraryDefinition library(OpRegistry::Global(), {});
-  std::unique_ptr<Graph> graph;
-  TF_ASSERT_OK(EncapsulateSubgraphsInFunctions(
-      "_cluster", "_outside", graph_before_encapsulation,
-      /*rewrite_subgraph_fn=*/{}, /*parallel_checking=*/true,
-      /*reuse_existing_functions=*/false, &graph, &library));
-
-  std::vector<string> expected_nodes = {
-      "add1", "add2", "cluster1", "cluster1_parallel_check/_0",
-      "mul",  "x1",   "x2"};
-  EXPECT_EQ(expected_nodes, GraphNodes(*graph));
-
-  std::vector<std::pair<string, string>> expected_edges = {
-      {"add1:0", "add2:0"},
-      {"add2:0", "cluster1_parallel_check/_0:0"},
-      {"cluster1:0", "cluster1_parallel_check/_0:1"},
-      {"cluster1_parallel_check/_0:0", "mul:1"},
-      {"x1:0", "add1:0"},
-      {"x1:0", "cluster1:0"},
-      {"x1:0", "mul:0"},
-      {"x2:0", "add1:1"},
-      {"x2:0", "add2:1"},
-      {"x2:0", "cluster1:1"},
-  };
-  EXPECT_EQ(expected_edges, GraphEdges(*graph));
-}
-
 const Node* FindNodeByName(const Graph& graph, const string& name) {
   for (const Node* node : graph.nodes()) {
     if (node->name() == name) return node;
@@ -814,7 +773,6 @@ TEST(EncapsulateSubgraphsWithGuaranteeConstOpTest, Simple) {
         }
         return Status::OK();
       },
-      /*parallel_checking=*/false,
       /*reuse_existing_functions=*/false, &graph_after, &library));
   EXPECT_EQ(2, guaranteed_consts);
 }
@@ -859,7 +817,6 @@ TEST(EncapsulateSubgraphsWithGuaranteeConstOpTest, Add) {
         }
         return Status::OK();
       },
-      /*parallel_checking=*/false,
       /*reuse_existing_functions=*/false, &graph_after, &library));
   // Only 1 runtime const, which is const_guarantee_add1. Add2 has one const
   // and another non-const, so overall non-const.
@@ -1050,7 +1007,7 @@ TEST(EncapsulateSubgraphsTest, OneFunctionTwoOutside) {
                          .WithAttr("_outside", "O1"));
     Node* recv2 = RecvAtHost(ops::NodeOut(key_constant, 0), "F1", "O2",
                              {DT_FLOAT, DT_FLOAT}, shape2.opts());
-    Node* h = Binary(ops::NodeOut(recv2, 0), e,
+    Node* h = Binary(ops::NodeOut(recv2, 1), e,
                      shape2.opts()
                          .WithName("H")
                          .WithAttr("_encapsulate", "F1")
@@ -1075,7 +1032,7 @@ TEST(EncapsulateSubgraphsTest, OneFunctionTwoOutside) {
            {"outside_compilation_O1_host_compute"}},
           {{"outside_compilation_O2_host_compute"},
            "XlaHostCompute",
-           {"D:o:0", "F:o:0"},
+           {"F:o:0", "D:o:0"},
            {{"Tinputs", gtl::ArraySlice<DataType>({DT_FLOAT, DT_FLOAT})},
             {"Toutputs", gtl::ArraySlice<DataType>({DT_FLOAT})},
             {"ancestors",
@@ -1123,13 +1080,13 @@ TEST(EncapsulateSubgraphsTest, OneFunctionTwoOutside) {
 
     Node* recv2 = RecvAtHost(ops::NodeOut(key_constant, 0), "F1", "O2",
                              {DT_FLOAT, DT_FLOAT}, b2.opts());
-    Node* g = Binary(e, ops::NodeOut(recv2, 1),
+    Node* g = Binary(e, ops::NodeOut(recv2, 0),
                      b2.opts()
                          .WithName("G")
                          .WithControlInputs({recv2, e})
                          .WithAttr("_encapsulate", "F1")
                          .WithAttr("_outside", "O2"));
-    Node* h = Binary(ops::NodeOut(recv2, 0), e,
+    Node* h = Binary(ops::NodeOut(recv2, 1), e,
                      b2.opts()
                          .WithName("H")
                          .WithAttr("_encapsulate", "F1")
diff --git a/tensorflow/compiler/jit/kernels/xla_launch_op.cc b/tensorflow/compiler/jit/kernels/xla_launch_op.cc
index 27287e0f9637929b2e04c6a76de19c2785ec357e..902fe27acdec1cb323217e6409fbd02f62177612 100644
--- a/tensorflow/compiler/jit/kernels/xla_launch_op.cc
+++ b/tensorflow/compiler/jit/kernels/xla_launch_op.cc
@@ -148,7 +148,7 @@ void XlaLocalLaunchBase::Compute(OpKernelContext* ctx) {
 
   XlaCompiler::Options options;
   options.client = client;
-  options.device_type = &cache->device_type();
+  options.device_type = cache->device_type();
   options.flib_def = ctx->function_library()->GetFunctionLibraryDefinition();
   options.graph_def_version = ctx->function_library()->graph_def_version();
   options.allow_cpu_custom_calls = (platform_id_ == se::host::kHostPlatformId);
diff --git a/tensorflow/compiler/jit/legacy_flags/BUILD b/tensorflow/compiler/jit/legacy_flags/BUILD
index 5d211f4d733d8d807426e62dd116092799184f35..5b6692f523658749f7ef48f9d7d89e97d4ce8b09 100644
--- a/tensorflow/compiler/jit/legacy_flags/BUILD
+++ b/tensorflow/compiler/jit/legacy_flags/BUILD
@@ -16,18 +16,6 @@ licenses(["notice"])  # Apache 2.0
 
 package(default_visibility = ["//tensorflow:internal"])
 
-cc_library(
-    name = "encapsulate_subgraphs_pass_flags",
-    srcs = ["encapsulate_subgraphs_pass_flags.cc"],
-    hdrs = ["encapsulate_subgraphs_pass_flags.h"],
-    deps =
-        [
-            "//tensorflow/compiler/xla/legacy_flags:parse_flags_from_env",
-            "//tensorflow/core:framework_internal",
-            "//tensorflow/core:lib",
-        ],
-)
-
 cc_library(
     name = "mark_for_compilation_pass_flags",
     srcs = ["mark_for_compilation_pass_flags.cc"],
diff --git a/tensorflow/compiler/jit/legacy_flags/encapsulate_subgraphs_pass_flags.cc b/tensorflow/compiler/jit/legacy_flags/encapsulate_subgraphs_pass_flags.cc
deleted file mode 100644
index 856475f12c8a411cd80c1c1859323304ca4029e0..0000000000000000000000000000000000000000
--- a/tensorflow/compiler/jit/legacy_flags/encapsulate_subgraphs_pass_flags.cc
+++ /dev/null
@@ -1,63 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-// Legacy flags for the XLA bridge's encapsulate_subgraphs_pass module.
-
-#include <mutex>
-#include <vector>
-
-#include "tensorflow/compiler/jit/legacy_flags/encapsulate_subgraphs_pass_flags.h"
-#include "tensorflow/compiler/xla/legacy_flags/parse_flags_from_env.h"
-#include "tensorflow/core/platform/types.h"
-#include "tensorflow/core/util/command_line_flags.h"
-
-namespace tensorflow {
-namespace legacy_flags {
-
-// Pointers to the parsed value of the flags and flag descriptors, initialized
-// via flags_init.
-static EncapsulateSubgraphsPassFlags* flags;
-static std::vector<Flag>* flag_list;
-static std::once_flag flags_init;
-
-// Allocate *flags.  Called via call_once(&flags_init,...).
-static void AllocateFlags() {
-  flags = new EncapsulateSubgraphsPassFlags;
-  flags->tf_xla_parallel_checking = false;
-  flag_list = new std::vector<Flag>({
-      Flag("tf_xla_parallel_checking", &flags->tf_xla_parallel_checking,
-           "Debug tool. Runs both JIT-compiled and interpreted graphs in "
-           "parallel and verifies they produce the same outputs."),
-  });
-  xla::legacy_flags::ParseFlagsFromEnv(*flag_list);
-}
-
-// Append to *append_to flag definitions associated with the XLA bridge's
-// encapsulate_subgraphs_pass module.
-void AppendEncapsulateSubgraphsPassFlags(std::vector<Flag>* append_to) {
-  std::call_once(flags_init, &AllocateFlags);
-  append_to->insert(append_to->end(), flag_list->begin(), flag_list->end());
-}
-
-// Return a pointer to the EncapsulateSubgraphsPassFlags struct;
-// repeated calls return the same pointer.
-// This should be called only after Flags::Parse() has returned.
-EncapsulateSubgraphsPassFlags* GetEncapsulateSubgraphsPassFlags() {
-  std::call_once(flags_init, &AllocateFlags);
-  return flags;
-}
-
-}  // namespace legacy_flags
-}  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/legacy_flags/encapsulate_subgraphs_pass_flags.h b/tensorflow/compiler/jit/legacy_flags/encapsulate_subgraphs_pass_flags.h
deleted file mode 100644
index d371bd269dbdfbf737d81490fb877fcf88661a8f..0000000000000000000000000000000000000000
--- a/tensorflow/compiler/jit/legacy_flags/encapsulate_subgraphs_pass_flags.h
+++ /dev/null
@@ -1,50 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#ifndef TENSORFLOW_COMPILER_JIT_LEGACY_FLAGS_ENCAPSULATE_SUBGRAPHS_PASS_FLAGS_H_
-#define TENSORFLOW_COMPILER_JIT_LEGACY_FLAGS_ENCAPSULATE_SUBGRAPHS_PASS_FLAGS_H_
-
-// Legacy flags for the XLA bridge's encapsulate_subgraphs_pass module.
-
-#include <vector>
-
-#include "tensorflow/core/platform/types.h"
-#include "tensorflow/core/util/command_line_flags.h"
-
-namespace tensorflow {
-namespace legacy_flags {
-
-// Append to *flag_list flag definitions associated with the XLA bridge's
-// encapsulate_subgraphs_pass module.
-void AppendEncapsulateSubgraphsPassFlags(
-    std::vector<tensorflow::Flag>* flag_list);
-
-// The values of flags associated with the XLA bridge's
-// encapsulate_subgraphs_pass module.
-typedef struct {
-  bool tf_xla_parallel_checking;  // Debug tool. Runs both JIT-compiled and
-                                  // interpreted graphs in parallel and verifies
-                                  // they produce the same outputs.
-} EncapsulateSubgraphsPassFlags;
-
-// Return a pointer to the EncapsulateSubgraphsPassFlags struct;
-// repeated calls return the same pointer.
-// This should be called only after Flags::Parse() has returned.
-EncapsulateSubgraphsPassFlags* GetEncapsulateSubgraphsPassFlags();
-
-}  // namespace legacy_flags
-}  // namespace tensorflow
-
-#endif  // TENSORFLOW_COMPILER_JIT_LEGACY_FLAGS_ENCAPSULATE_SUBGRAPHS_PASS_FLAGS_H_
diff --git a/tensorflow/compiler/jit/mark_for_compilation_pass.cc b/tensorflow/compiler/jit/mark_for_compilation_pass.cc
index 8e2ee0f1d71bc17b4c12c792c38002af4f9eb5eb..8c3882116dd4f048ea3e32c037bf4139c67a3eb9 100644
--- a/tensorflow/compiler/jit/mark_for_compilation_pass.cc
+++ b/tensorflow/compiler/jit/mark_for_compilation_pass.cc
@@ -25,6 +25,7 @@ limitations under the License.
 #include "tensorflow/compiler/jit/graphcycles/graphcycles.h"
 #include "tensorflow/compiler/jit/legacy_flags/mark_for_compilation_pass_flags.h"
 #include "tensorflow/compiler/jit/union_find.h"
+#include "tensorflow/compiler/jit/xla_cluster_util.h"
 #include "tensorflow/compiler/tf2xla/dump_graph.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/core/common_runtime/function.h"
@@ -41,9 +42,6 @@ limitations under the License.
 
 namespace tensorflow {
 
-const char* const kXlaClusterAttr = "_XlaCluster";
-const char* const kXlaOutsideCompilationAttr = "_XlaOutsideCompilation";
-
 namespace {
 
 bool HasXLAKernel(const Node& node, const DeviceType& jit_device_type) {
@@ -60,6 +58,14 @@ bool HasXLAKernel(const Node& node, const DeviceType& jit_device_type) {
       return false;
     }
   }
+
+  // XLA does not offer guaranteed aliasing between the input and output of the
+  // XLA cluster so it can't implement the forward-tensor-ref semantic.  Leave
+  // such nodes out of XLA clusters.
+  if (HasForwardedRefInput(node)) {
+    return false;
+  }
+
   return FindKernelDef(jit_device_type, node.def(), nullptr, nullptr).ok();
 }
 
@@ -165,16 +171,6 @@ bool IsCompilableCall(const NodeDef& call_def,
   return true;
 }
 
-// Returns the DeviceType corresponding to 'device'.
-Status DeviceTypeOfDevice(const string& device, DeviceType* device_type) {
-  DeviceNameUtils::ParsedName parsed;
-  if (!DeviceNameUtils::ParseFullName(device, &parsed)) {
-    return errors::Internal("Malformed assigned device '", device, "'");
-  }
-  *device_type = DeviceType(parsed.type);
-  return Status::OK();
-}
-
 // Tests whether `node` has a DT_RESOURCE typed input or output.
 bool HasResourceInputOrOutput(const Node& node) {
   return std::find(node.input_types().begin(), node.input_types().end(),
@@ -183,18 +179,11 @@ bool HasResourceInputOrOutput(const Node& node) {
                    DT_RESOURCE) != node.output_types().end();
 }
 
-struct NodeCompare {
-  bool operator()(const Node* a, const Node* b) const {
-    return a->id() < b->id();
-  }
-};
-using OrderedNodeSet = std::set<Node*, NodeCompare>;
-
 // Returns true if the op can be decomposed into XLA ops for which
 // there are fusable elemental implementations.
 //
-// TODO(hpucha): Consider a black list instead of a white list as
-// implemented below.
+// TODO(hpucha): Remove this code since this functionality is subsumed by
+// Grappler XlaFusionOptimizer.
 bool IsXlaFusable(const NodeDef& node) {
   static const std::unordered_set<std::string>* elementwise_ops =
       new std::unordered_set<std::string>(
@@ -364,7 +353,7 @@ Status FindCompilationCandidates(
   for (Node* node : graph.op_nodes()) {
     sorted_nodes.push_back(node);
   }
-  std::sort(sorted_nodes.begin(), sorted_nodes.end(), NodeCompare());
+  std::sort(sorted_nodes.begin(), sorted_nodes.end(), NodeComparatorID());
 
   for (Node* node : sorted_nodes) {
     VLOG(2) << "Fuel: " << fuel;
@@ -379,9 +368,13 @@ Status FindCompilationCandidates(
 
     DeviceType device_type("");
     TF_RETURN_IF_ERROR(
-        DeviceTypeOfDevice(node->assigned_device_name(), &device_type));
+        DeviceToDeviceType(node->assigned_device_name(), &device_type));
 
-    if (is_compilable_fn && !is_compilable_fn(node, device_type)) continue;
+    if (is_compilable_fn && !is_compilable_fn(node, device_type)) {
+      VLOG(2) << "Compilation rejected node: not compilable " << node->name()
+              << ": " << node->type_string();
+      continue;
+    }
 
     const XlaOpRegistry::DeviceRegistration* registration;
     CHECK(
@@ -430,46 +423,6 @@ struct Cluster {
   int representative = -1;
 };
 
-// Returns a string describing how an edge from src to dst would
-// create a cycle.
-string DescribeCycle(const GraphCycles& cycles, const Graph& graph, int src,
-                     int dst) {
-  int32 max_path_size = graph.num_node_ids() + 1;
-  std::vector<int32> path(max_path_size);
-  int32 path_size = cycles.FindPath(dst, src, max_path_size, path.data());
-  if (path_size == 0) {
-    return "";
-  }
-
-  auto node_name = [&cycles, &graph](int node_id) {
-    if (!FastBoundsCheck(node_id, graph.num_node_ids())) {
-      return string("(null)");
-    }
-    auto* node = graph.FindNodeId(node_id);
-    if (node == nullptr) {
-      return string("(null)");
-    }
-    return node->name();
-  };
-
-  string description;
-  strings::StrAppend(&description, "Edge from ", node_name(src), " to ",
-                     node_name(dst), " would create a cycle.\n");
-  path.resize(path_size);
-  for (int32 node_id : path) {
-    string ascii_art;
-    if (node_id == dst) {
-      ascii_art = "+-> ";
-    } else if (node_id != src) {
-      ascii_art = "|   ";
-    } else {
-      ascii_art = "+-- ";
-    }
-    strings::StrAppend(&description, ascii_art, node_name(node_id), "\n");
-  }
-  return description;
-}
-
 }  // anonymous namespace
 
 bool IsCompilable(FunctionLibraryRuntime* flr, const NodeDef& ndef) {
@@ -575,84 +528,13 @@ Status MarkForCompilationPass::RunImpl(
                                            : Env::Default(),
       is_compilable_fn, &compilation_candidates));
 
-  GraphCycles cycles;
-  for (int i = 0; i < graph->num_node_ids(); ++i) {
-    // We rely on the node IDs in the cycle detection graph being consecutive
-    // integers starting from 0.
-    CHECK_EQ(i, cycles.NewNode());
+  if (compilation_candidates.empty()) {
+    VLOG(2) << "No compilable candidates";
+    return Status::OK();
   }
 
-  // Compute the loop structure of the graph.
-  std::vector<ControlFlowInfo> control_flow_info;
-  TF_RETURN_IF_ERROR(BuildControlFlowInfo(graph, &control_flow_info));
-
-  // The clustering code must avoid adding cycles to the graph to prevent
-  // deadlock. However, the graph may contain loops, which would trigger the
-  // cycle detection code. To handle loops, we alter the structure of the cycle
-  // detection graph, disconnecting each loop from the enclosing graph.
-  // Specifically, we:
-  // * add a new "frame" node for each loop.
-  // * replace edges to "Enter" nodes, and edges from "Exit" nodes with edges
-  //   to/from the corresponding frame node. In essence, we collapse the loop
-  //   into a single node for the purpose of cycle detection in the enclosing
-  //   graph.
-  // * the body of the loop should now be disconnected from the rest of the
-  //   graph; we make it acyclic by breaking loop backedges (edges outgoing from
-  //   "NextIteration" nodes.
-
-  // Map from frame name strings to node IDs in the cycle detection graph.
-  std::unordered_map<string, int> frame_nodes;
-
-  // Get the cycle graph node ID for frame 'frame_name', or add one if none
-  // exists.
-  auto GetOrAddFrameNodeId = [&frame_nodes, &cycles](const string& frame_name) {
-    int& frame_id = frame_nodes.emplace(frame_name, -1).first->second;
-    if (frame_id < 0) {
-      // The emplace succeeded; we have not allocated a frame node yet.
-      frame_id = cycles.NewNode();
-    }
-    return frame_id;
-  };
-
-  for (Edge const* edge : graph->edges()) {
-    if (edge->dst()->IsEnter()) {
-      // Lift edges to an "Enter" node to the corresponding frame node.
-      const string& frame_name =
-          control_flow_info[edge->dst()->id()].frame_name;
-      int dst = GetOrAddFrameNodeId(frame_name);
-      if (!cycles.InsertEdge(edge->src()->id(), dst)) {
-        return errors::Internal(
-            "Cycle detected when adding enter->frame edge: ",
-            DescribeCycle(cycles, *graph, edge->src()->id(), dst));
-      }
-      continue;
-    }
-    if (edge->src()->IsExit()) {
-      // Lift edges from an "Exit" node to the corresponding frame node.
-      const string& frame_name =
-          control_flow_info[edge->src()->id()].frame_name;
-      int src = GetOrAddFrameNodeId(frame_name);
-      if (!cycles.InsertEdge(src, edge->dst()->id())) {
-        return errors::Internal(
-            "Cycle detected when adding frame->exit edge: ",
-            DescribeCycle(cycles, *graph, src, edge->dst()->id()));
-      }
-      // Drop the original edge.
-      continue;
-    }
-    if (edge->src()->IsNextIteration()) {
-      // Break loop back-edges.
-      continue;
-    }
-    if (!cycles.InsertEdge(edge->src()->id(), edge->dst()->id())) {
-      // This should never happen. All cycles in the graph should contain
-      // a control flow operator.
-      return errors::Internal(
-          "Found cycle in graph without control flow operator during XLA "
-          "compilation: ",
-          DescribeCycle(cycles, *graph, edge->src()->id(), edge->dst()->id()));
-    }
-  }
+  GraphCycles cycles;
+  TF_RETURN_IF_ERROR(CreateCycleDetectionGraph(graph, &cycles));
 
   // Each compilation candidate belongs to a cluster. The cluster's
   // representative
@@ -670,6 +552,9 @@ Status MarkForCompilationPass::RunImpl(
 
   // Repeatedly contract edges between clusters that are on the same device,
   // provided the contraction would not create a cycle.
+  //
+  // TODO(hpucha): Handle the case where kXlaClusterAttr is already set (for
+  // example, from the Grappler fusion pass).
   while (!worklist.empty()) {
     int from = worklist.front()->Get().representative;
     worklist.pop_front();
@@ -778,7 +663,7 @@ Status MarkForCompilationPass::RunImpl(
     // compilation.
     DeviceType device_type("");
     TF_RETURN_IF_ERROR(
-        DeviceTypeOfDevice(n->assigned_device_name(), &device_type));
+        DeviceToDeviceType(n->assigned_device_name(), &device_type));
     const XlaOpRegistry::DeviceRegistration* registration;
     XlaOpRegistry::GetCompilationDevice(device_type.type(), &registration);
 
diff --git a/tensorflow/compiler/jit/mark_for_compilation_pass_test.cc b/tensorflow/compiler/jit/mark_for_compilation_pass_test.cc
index 703d8825d74ced8d4d69c31ccd730adc89a8bffe..772c92d369e67f431b5d030d1d5cdc5ae2700d39 100644
--- a/tensorflow/compiler/jit/mark_for_compilation_pass_test.cc
+++ b/tensorflow/compiler/jit/mark_for_compilation_pass_test.cc
@@ -633,5 +633,52 @@ TEST(XlaCompilationTest, ConstOp) {
   }
 }
 
+TEST(XlaCompilationTest, DontClusterIdentityWithRefInput) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+  Output variable = ops::Variable(root.WithOpName("variable"),
+                                  PartialTensorShape{}, DT_FLOAT);
+  Output read = ops::Identity(root.WithOpName("read"), variable);
+  Output neg = ops::Negate(root.WithOpName("negate"), read);
+  Output add = ops::Add(root.WithOpName("add"), neg, neg);
+  std::unique_ptr<Graph> graph(new Graph(OpRegistry::Global()));
+
+  TF_ASSERT_OK(root.ToGraph(graph.get()));
+  TF_ASSERT_OK(MarkForCompilation(&graph));
+
+  std::unordered_map<string, string> clusters = GetClusters(*graph);
+
+  ASSERT_FALSE(clusters.empty());
+  string cluster_name = clusters.begin()->second;
+
+  std::unordered_map<string, string> expected_clusters(
+      {{"negate", cluster_name}, {"add", cluster_name}});
+  EXPECT_EQ(clusters, expected_clusters);
+}
+
+TEST(XlaCompilationTest, ClusterIdentityWithNonRefInput) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+  Output variable = ops::Variable(root.WithOpName("variable"),
+                                  PartialTensorShape{}, DT_FLOAT);
+  Output read = ops::Identity(root.WithOpName("read"), variable);
+  Output neg = ops::Negate(root.WithOpName("negate"), read);
+  Output identity = ops::Negate(root.WithOpName("identity"), neg);
+  Output add = ops::Add(root.WithOpName("add"), identity, neg);
+  std::unique_ptr<Graph> graph(new Graph(OpRegistry::Global()));
+
+  TF_ASSERT_OK(root.ToGraph(graph.get()));
+  TF_ASSERT_OK(MarkForCompilation(&graph));
+
+  std::unordered_map<string, string> clusters = GetClusters(*graph);
+
+  ASSERT_FALSE(clusters.empty());
+  string cluster_name = clusters.begin()->second;
+
+  std::unordered_map<string, string> expected_clusters(
+      {{"negate", cluster_name},
+       {"identity", cluster_name},
+       {"add", cluster_name}});
+  EXPECT_EQ(clusters, expected_clusters);
+}
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/xla_cluster_util.cc b/tensorflow/compiler/jit/xla_cluster_util.cc
new file mode 100644
index 0000000000000000000000000000000000000000..05b7821b8865d0f210ca9af92370e177d6043e80
--- /dev/null
+++ b/tensorflow/compiler/jit/xla_cluster_util.cc
@@ -0,0 +1,183 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/jit/xla_cluster_util.h"
+
+#include <unordered_map>
+
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/graph/control_flow.h"
+#include "tensorflow/core/kernels/bounds_check.h"
+#include "tensorflow/core/util/device_name_utils.h"
+
+namespace tensorflow {
+
+const char* const kXlaClusterAttr = "_XlaCluster";
+const char* const kXlaOutsideCompilationAttr = "_XlaOutsideCompilation";
+
+namespace {
+// Returns a string describing how an edge from src to dst would
+// create a cycle.
+string DescribeCycle(const GraphCycles* cycles, const Graph& graph, int src,
+                     int dst) {
+  int32 max_path_size = graph.num_node_ids() + 1;
+  std::vector<int32> path(max_path_size);
+  int32 path_size = cycles->FindPath(dst, src, max_path_size, path.data());
+  if (path_size == 0) {
+    return "";
+  }
+
+  auto node_name = [cycles, &graph](int node_id) {
+    if (!FastBoundsCheck(node_id, graph.num_node_ids())) {
+      return string("(null)");
+    }
+    auto* node = graph.FindNodeId(node_id);
+    if (node == nullptr) {
+      return string("(null)");
+    }
+    return node->name();
+  };
+
+  string description;
+  strings::StrAppend(&description, "Edge from ", node_name(src), " to ",
+                     node_name(dst), " would create a cycle.\n");
+  path.resize(path_size);
+  for (int32 node_id : path) {
+    string ascii_art;
+    if (node_id == dst) {
+      ascii_art = "+-> ";
+    } else if (node_id != src) {
+      ascii_art = "|   ";
+    } else {
+      ascii_art = "+-- ";
+    }
+    strings::StrAppend(&description, ascii_art, node_name(node_id), "\n");
+  }
+  return description;
+}
+
+bool AlwaysForwardsRefInput(const Node& node) { return node.IsIdentity(); }
+
+}  // namespace
+
+Status DeviceToDeviceType(const string& device, DeviceType* device_type) {
+  DeviceNameUtils::ParsedName parsed;
+  if (!DeviceNameUtils::ParseFullName(device, &parsed)) {
+    return errors::Internal("Malformed assigned device '", device, "'");
+  }
+  *device_type = DeviceType(parsed.type);
+  return Status::OK();
+}
+
+bool HasForwardedRefInput(const Node& node) {
+  if (AlwaysForwardsRefInput(node)) {
+    for (const Edge* incoming_edge : node.in_edges()) {
+      if (incoming_edge->IsControlEdge()) {
+        continue;
+      }
+
+      Node* incoming_node = incoming_edge->src();
+      if (IsRefType(incoming_node->output_type(incoming_edge->src_output()))) {
+        VLOG(2) << "Node " << node.def().ShortDebugString() << " has ref input "
+                << incoming_node->name() << " " << incoming_node->type_string();
+        return true;
+      }
+    }
+  }
+  return false;
+}
+
+Status CreateCycleDetectionGraph(const Graph* graph, GraphCycles* cycles) {
+  for (int i = 0; i < graph->num_node_ids(); ++i) {
+    // We rely on the node IDs in the cycle detection graph being consecutive
+    // integers starting from 0.
+    CHECK_EQ(i, cycles->NewNode());
+  }
+
+  // Compute the loop structure of the graph.
+  std::vector<ControlFlowInfo> control_flow_info;
+  TF_RETURN_IF_ERROR(BuildControlFlowInfo(graph, &control_flow_info));
+
+  // The clustering code must avoid adding cycles to the graph to prevent
+  // deadlock. However, the graph may contain loops, which would trigger the
+  // cycle detection code. To handle loops, we alter the structure of the cycle
+  // detection graph, disconnecting each loop from the enclosing graph.
+  // Specifically, we:
+  // * add a new "frame" node for each loop.
+  // * replace edges to "Enter" nodes, and edges from "Exit" nodes with edges
+  //   to/from the corresponding frame node. In essence, we collapse the loop
+  //   into a single node for the purpose of cycle detection in the enclosing
+  //   graph.
+  // * the body of the loop should now be disconnected from the rest of the
+  //   graph; we make it acyclic by breaking loop backedges (edges outgoing from
+  //   "NextIteration" nodes.
+
+  // Map from frame name strings to node IDs in the cycle detection graph.
+  std::unordered_map<string, int> frame_nodes;
+
+  // Get the cycle graph node ID for frame 'frame_name', or add one if none
+  // exists.
+  auto GetOrAddFrameNodeId = [&frame_nodes, cycles](const string& frame_name) {
+    int& frame_id = frame_nodes.emplace(frame_name, -1).first->second;
+    if (frame_id < 0) {
+      // The emplace succeeded; we have not allocated a frame node yet.
+      frame_id = cycles->NewNode();
+    }
+    return frame_id;
+  };
+
+  for (Edge const* edge : graph->edges()) {
+    if (edge->dst()->IsEnter()) {
+      // Lift edges to an "Enter" node to the corresponding frame node.
+      const string& frame_name =
+          control_flow_info[edge->dst()->id()].frame_name;
+      int dst = GetOrAddFrameNodeId(frame_name);
+      if (!cycles->InsertEdge(edge->src()->id(), dst)) {
+        return errors::Internal(
+            "Cycle detected when adding enter->frame edge: ",
+            DescribeCycle(cycles, *graph, edge->src()->id(), dst));
+      }
+      continue;
+    }
+    if (edge->src()->IsExit()) {
+      // Lift edges from an "Exit" node to the corresponding frame node.
+      const string& frame_name =
+          control_flow_info[edge->src()->id()].frame_name;
+      int src = GetOrAddFrameNodeId(frame_name);
+      if (!cycles->InsertEdge(src, edge->dst()->id())) {
+        return errors::Internal(
+            "Cycle detected when adding frame->exit edge: ",
+            DescribeCycle(cycles, *graph, src, edge->dst()->id()));
+      }
+      // Drop the original edge.
+      continue;
+    }
+    if (edge->src()->IsNextIteration()) {
+      // Break loop back-edges.
+      continue;
+    }
+    if (!cycles->InsertEdge(edge->src()->id(), edge->dst()->id())) {
+      // This should never happen. All cycles in the graph should contain
+      // a control flow operator.
+      return errors::Internal(
+          "Found cycle in graph without control flow operator during XLA "
+          "compilation: ",
+          DescribeCycle(cycles, *graph, edge->src()->id(), edge->dst()->id()));
+    }
+  }
+  return Status::OK();
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/xla_cluster_util.h b/tensorflow/compiler/jit/xla_cluster_util.h
new file mode 100644
index 0000000000000000000000000000000000000000..bcce082aaf6044ff0654efa4d78c0f493a350d00
--- /dev/null
+++ b/tensorflow/compiler/jit/xla_cluster_util.h
@@ -0,0 +1,49 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Contains utilities for clustering compilable graph nodes via XLA.
+
+#ifndef TENSORFLOW_COMPILER_JIT_XLA_CLUSTER_UTIL_H_
+#define TENSORFLOW_COMPILER_JIT_XLA_CLUSTER_UTIL_H_
+
+#include "tensorflow/compiler/jit/graphcycles/graphcycles.h"
+#include "tensorflow/core/graph/algorithm.h"
+
+namespace tensorflow {
+
+// The attribute that marks nodes to be grouped into functions by the
+// encapsulate subgraphs pass.
+extern const char* const kXlaClusterAttr;
+
+// The attribute that marks nodes in a cluster to be placed outside the xla
+// compilation by the encapsulate subgraphs pass.
+extern const char* const kXlaOutsideCompilationAttr;
+
+using OrderedNodeSet = std::set<Node*, NodeComparatorID>;
+
+// Returns the DeviceType corresponding to 'device'.
+Status DeviceToDeviceType(const string& device, DeviceType* device_type);
+
+// Returns true if `node` has a ref tensor input that it forwards to its output.
+bool HasForwardedRefInput(const Node& node);
+
+// Creates a graph representation to enable cycle detection when clustering.
+// This representation handles loops in graph by disconnecting each loop from
+// the enclosing graph.
+Status CreateCycleDetectionGraph(const Graph* graph, GraphCycles* cycles);
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMPILER_JIT_XLA_CLUSTER_UTIL_H_
diff --git a/tensorflow/compiler/jit/xla_compile_on_demand_op.cc b/tensorflow/compiler/jit/xla_compile_on_demand_op.cc
index ab644ff5a61c407b246b97af5328bf5cd8c1893b..b1943d3e1a7e321b5a3796a0c6e4f2b5d9ac7018 100644
--- a/tensorflow/compiler/jit/xla_compile_on_demand_op.cc
+++ b/tensorflow/compiler/jit/xla_compile_on_demand_op.cc
@@ -151,8 +151,7 @@ Status XlaCompileOnDemandOp::Compile(
   core::ScopedUnref cache_ref(cache);
 
   XlaCompiler::Options options;
-  DeviceType device_type = metadata.jit_device_type();
-  options.device_type = &device_type;
+  options.device_type = metadata.jit_device_type();
   options.client = metadata.client();
   options.flib_def =
       new FunctionLibraryDefinition(OpRegistry::Global(), FunctionDefLibrary{});
diff --git a/tensorflow/compiler/jit/xla_cpu_device.cc b/tensorflow/compiler/jit/xla_cpu_device.cc
index ea9e0366043a4a64bfe43703c55d4470693bbac8..43648402f65c656b6b4eb2e83e61ce45f1c73669 100644
--- a/tensorflow/compiler/jit/xla_cpu_device.cc
+++ b/tensorflow/compiler/jit/xla_cpu_device.cc
@@ -50,11 +50,12 @@ Status XlaCpuDeviceFactory::CreateDevices(const SessionOptions& options,
   (void)registrations;
 
   std::unique_ptr<XlaDevice> device;
-  TF_RETURN_IF_ERROR(
-      XlaDevice::Create("Host", DEVICE_XLA_CPU, 0, DEVICE_CPU_XLA_JIT, options,
-                        name_prefix, registration,
-                        /*transfer_as_literal=*/false,
-                        /*shape_representation_fn=*/{}, &device));
+  TF_RETURN_IF_ERROR(XlaDevice::Create("Host", DEVICE_XLA_CPU, 0,
+                                       DEVICE_CPU_XLA_JIT, options, name_prefix,
+                                       registration,
+                                       /*transfer_as_literal=*/false,
+                                       /*shape_representation_fn=*/{},
+                                       /*padded_shape_fn=*/{}, &device));
   devices->push_back(device.release());
   return Status::OK();
 }
diff --git a/tensorflow/compiler/jit/xla_device.cc b/tensorflow/compiler/jit/xla_device.cc
index f13b46c532e6008477849f2e06887901c90038ab..ed007d603ea1b3d27dd25f00726261cdd029c20c 100644
--- a/tensorflow/compiler/jit/xla_device.cc
+++ b/tensorflow/compiler/jit/xla_device.cc
@@ -23,6 +23,7 @@ limitations under the License.
 #include "tensorflow/compiler/jit/xla_device_context.h"
 #include "tensorflow/compiler/jit/xla_device_ops.h"
 #include "tensorflow/compiler/tf2xla/dump_graph.h"
+#include "tensorflow/compiler/tf2xla/shape_util.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/core/common_runtime/device.h"
@@ -105,6 +106,25 @@ XlaDeviceAllocator* XlaDeviceAllocatorState::GetOrCreateXlaDeviceAllocator(
   return alloc_ptr;
 }
 
+namespace {
+
+// Default PaddedShapeFn implementation that simply returns the unpadded
+// on-device shape. This is accurate for CPU and GPU devices that neither
+// transpose nor pad tensors.
+Status DefaultPaddedShapeFn(const Tensor& tensor, xla::Shape* shape) {
+  const tensorflow::XlaTensor* xla_tensor =
+      tensorflow::XlaTensor::FromTensor(&tensor);
+  if (xla_tensor == nullptr) {
+    return TensorShapeToXLAShape(tensor.dtype(), tensor.shape(), shape);
+  }
+
+  const xla::ShapedBuffer& shaped_buffer = xla_tensor->shaped_buffer();
+  *shape = shaped_buffer.on_device_shape();
+  return Status::OK();
+}
+
+}  // namespace
+
 /* static */ Status XlaDevice::Create(
     const string& platform_name, const string& device_name, int device_ordinal,
     const string& jit_device_name, const SessionOptions& options,
@@ -112,7 +132,7 @@ XlaDeviceAllocator* XlaDeviceAllocatorState::GetOrCreateXlaDeviceAllocator(
     const XlaOpRegistry::DeviceRegistration& registration,
     bool transfer_as_literal,
     const XlaCompiler::ShapeRepresentationFn& shape_representation_fn,
-    std::unique_ptr<XlaDevice>* device) {
+    const PaddedShapeFn& padded_shape_fn, std::unique_ptr<XlaDevice>* device) {
   VLOG(1) << "XlaDevice::Create " << platform_name << " " << device_name << ":"
           << device_ordinal;
 
@@ -133,17 +153,20 @@ XlaDeviceAllocator* XlaDeviceAllocatorState::GetOrCreateXlaDeviceAllocator(
 
   device->reset(new XlaDevice(
       options, attrs, device_ordinal, DeviceType(jit_device_name),
-      platform.ValueOrDie(), transfer_as_literal, shape_representation_fn));
+      platform.ValueOrDie(), transfer_as_literal, shape_representation_fn,
+      padded_shape_fn ? padded_shape_fn : DefaultPaddedShapeFn));
   return Status::OK();
 }
 
 XlaDevice::Metadata::Metadata(
     int device_ordinal, se::Platform* platform, const DeviceType& device_type,
-    XlaCompiler::ShapeRepresentationFn shape_representation_fn)
+    XlaCompiler::ShapeRepresentationFn shape_representation_fn,
+    PaddedShapeFn padded_shape_fn)
     : device_ordinal_(device_ordinal),
       device_type_(device_type),
       platform_(platform),
-      shape_representation_fn_(std::move(shape_representation_fn)) {}
+      shape_representation_fn_(std::move(shape_representation_fn)),
+      padded_shape_fn_(std::move(padded_shape_fn)) {}
 
 int XlaDevice::Metadata::device_ordinal() const { return device_ordinal_; }
 
@@ -178,10 +201,11 @@ XlaDevice::XlaDevice(
     const SessionOptions& options, const DeviceAttributes& attrs,
     int device_ordinal, const DeviceType& jit_device_name,
     se::Platform* platform, bool transfer_as_literal,
-    const XlaCompiler::ShapeRepresentationFn& shape_representation_fn)
+    const XlaCompiler::ShapeRepresentationFn& shape_representation_fn,
+    const PaddedShapeFn& padded_shape_fn)
     : LocalDevice(options, attrs),
       xla_metadata_(device_ordinal, platform, jit_device_name,
-                    shape_representation_fn),
+                    shape_representation_fn, padded_shape_fn),
       device_ordinal_(device_ordinal),
       jit_device_name_(jit_device_name),
       xla_allocator_(nullptr),
diff --git a/tensorflow/compiler/jit/xla_device.h b/tensorflow/compiler/jit/xla_device.h
index d5d345d43b16c43c7a202791b2604b39d29e8cdb..02e88ee6793e984a7b782790f8011cbcbc5a5026 100644
--- a/tensorflow/compiler/jit/xla_device.h
+++ b/tensorflow/compiler/jit/xla_device.h
@@ -45,13 +45,19 @@ namespace tensorflow {
 
 class XlaDevice : public LocalDevice {
  public:
+  // Given a tensor, sets `xla::Shape*` the shape of tensor's representation
+  // on device, fully padded. On error, the contents of `xla::Shape*`
+  // are undefined.
+  typedef std::function<Status(const Tensor&, xla::Shape*)> PaddedShapeFn;
+
   // Wrapper class to store metadata about the XlaDevice, where it can be
   // retrieved e.g., when lazily creating the XlaCompilationCache device.
   class Metadata {
    public:
     Metadata(int device_ordinal, se::Platform* platform,
              const DeviceType& device_type,
-             XlaCompiler::ShapeRepresentationFn shape_representation_fn);
+             XlaCompiler::ShapeRepresentationFn shape_representation_fn,
+             PaddedShapeFn padded_shape_fn);
 
     // The index of the device on this host.
     int device_ordinal() const;
@@ -62,12 +68,14 @@ class XlaDevice : public LocalDevice {
     const XlaCompiler::ShapeRepresentationFn& shape_representation_fn() const {
       return shape_representation_fn_;
     }
+    const PaddedShapeFn& padded_shape_fn() const { return padded_shape_fn_; }
 
    private:
     const int device_ordinal_;
     const DeviceType device_type_;
     se::Platform* platform_;  // Not owned.
     XlaCompiler::ShapeRepresentationFn shape_representation_fn_;
+    PaddedShapeFn padded_shape_fn_;
 
     TF_DISALLOW_COPY_AND_ASSIGN(Metadata);
   };
@@ -81,6 +89,8 @@ class XlaDevice : public LocalDevice {
   // 'transfer_as_literal' is true if device<->host transfers must be done using
   // XLA's TransferLiteral{To,From}Device interface. If false, we can use
   // ThenMemcpy instead.
+  // If padded_shape_fn is empty, a default implementation that returns
+  // the on-host shape is used.
   static Status Create(
       const string& platform_name, const string& device_name,
       int device_ordinal, const string& jit_device_name,
@@ -88,12 +98,16 @@ class XlaDevice : public LocalDevice {
       const XlaOpRegistry::DeviceRegistration& registration,
       bool transfer_as_literal,
       const XlaCompiler::ShapeRepresentationFn& shape_representation_fn,
-      std::unique_ptr<XlaDevice>* device);
+      const PaddedShapeFn& padded_shape_fn, std::unique_ptr<XlaDevice>* device);
 
+  // Creates a new XLA Device.
+  // If padded_shape_fn is empty, a default implementation that returns
+  // the logical on-device shape without padding is used.
   XlaDevice(const SessionOptions& options, const DeviceAttributes& attrs,
             int device_ordinal, const DeviceType& jit_device_name,
             se::Platform* platform, bool transfer_as_literal,
-            const XlaCompiler::ShapeRepresentationFn& shape_representation_fn);
+            const XlaCompiler::ShapeRepresentationFn& shape_representation_fn,
+            const PaddedShapeFn& padded_shape_fn);
   ~XlaDevice() override;
 
   Allocator* GetAllocator(AllocatorAttributes attr) override;
@@ -110,6 +124,7 @@ class XlaDevice : public LocalDevice {
                              Tensor* tensor) override;
 
   xla::LocalClient* client() const;
+  const Metadata& metadata() { return xla_metadata_; }
   xla::StatusOr<se::Stream*> GetStream();
 
   // If not already set, create and set GpuDeviceInfo.
diff --git a/tensorflow/compiler/jit/xla_device_context.cc b/tensorflow/compiler/jit/xla_device_context.cc
index ff30b62bad782f281bcd25275521ed8b0c4c0bfd..71e63b110b3b132a57fc291e53a165954c72a03c 100644
--- a/tensorflow/compiler/jit/xla_device_context.cc
+++ b/tensorflow/compiler/jit/xla_device_context.cc
@@ -54,16 +54,26 @@ XlaTransferManager::XlaTransferManager(
       client_(client),
       transfer_manager_(client->backend().transfer_manager()),
       transfer_as_literal_(transfer_as_literal),
-      shape_representation_fn_(std::move(shape_representation_fn)) {}
+      shape_representation_fn_(std::move(shape_representation_fn)) {
+  if (!shape_representation_fn_) {
+    shape_representation_fn_ = [](const TensorShape& shape, DataType dtype) {
+      return shape;
+    };
+  }
+}
 
 Status XlaTransferManager::TransferLiteralToDevice(
     const Tensor& host_tensor, Tensor* device_tensor) const {
-  xla::Literal literal;
-  TF_RETURN_IF_ERROR(HostTensorToLiteral(host_tensor, &literal));
-  VLOG(1) << "Transfer to device as literal: " << literal.ToString();
+  xla::Shape xla_shape;
+  TF_RETURN_IF_ERROR(TensorShapeToXLAShape(host_tensor.dtype(),
+                                           host_tensor.shape(), &xla_shape));
+  xla::BorrowingLiteral literal(
+      static_cast<const char*>(DMAHelper::base(&host_tensor)), xla_shape);
 
   const xla::ShapedBuffer& shaped_buffer =
       XlaTensor::FromTensor(device_tensor)->shaped_buffer();
+  VLOG(1) << "Transfer to device as literal: " << literal.ToString() << " "
+          << shaped_buffer.ToString();
   return transfer_manager_->TransferLiteralToDevice(stream_->parent(), literal,
                                                     shaped_buffer);
 }
@@ -76,7 +86,8 @@ Status XlaTransferManager::TransferLiteralFromDevice(
   TF_ASSIGN_OR_RETURN(std::unique_ptr<xla::Literal> literal,
                       transfer_manager_->TransferLiteralFromDevice(
                           stream_->parent(), shaped_buffer));
-  VLOG(1) << "Transfer from device as literal: " << literal->ToString();
+  VLOG(1) << "Transfer from device as literal: " << literal->ToString() << " "
+          << shaped_buffer.ToString();
   Tensor tensor;
   TF_RETURN_IF_ERROR(
       LiteralToHostTensor(*literal, host_tensor->dtype(), &tensor));
@@ -98,7 +109,9 @@ void XlaTransferManager::CopyCPUTensorToDevice(const Tensor* cpu_tensor,
             << " "
             << reinterpret_cast<const void*>(
                    device_tensor->tensor_data().data())
-            << " " << cpu_tensor->NumElements();
+            << " " << cpu_tensor->NumElements() << " "
+            << cpu_tensor->shape().DebugString() << " "
+            << device_tensor->shape().DebugString();
 
     void* src_ptr = const_cast<void*>(DMAHelper::base(cpu_tensor));
     const int64 total_bytes = cpu_tensor->TotalBytes();
@@ -106,13 +119,8 @@ void XlaTransferManager::CopyCPUTensorToDevice(const Tensor* cpu_tensor,
     XlaTensor* xla_tensor = XlaTensor::FromTensor(device_tensor);
     CHECK(xla_tensor);
 
-    TensorShape shape;
-    if (shape_representation_fn_) {
-      shape = shape_representation_fn_(device_tensor->shape(),
-                                       device_tensor->dtype());
-    } else {
-      shape = device_tensor->shape();
-    }
+    TensorShape shape = shape_representation_fn_(device_tensor->shape(),
+                                                 device_tensor->dtype());
     if (!xla_tensor->has_shaped_buffer()) {
       Status s = xla_tensor->AllocateShapedBuffer(
           device_tensor->dtype(), shape, client_,
@@ -165,7 +173,9 @@ void XlaTransferManager::CopyDeviceTensorToCPU(const Tensor* device_tensor,
                    device_tensor->tensor_data().data())
             << " "
             << reinterpret_cast<const void*>(cpu_tensor->tensor_data().data())
-            << device_tensor->NumElements();
+            << " " << device_tensor->NumElements() << " "
+            << cpu_tensor->shape().DebugString() << " "
+            << device_tensor->shape().DebugString();
 
     const int64 total_bytes = cpu_tensor->TotalBytes();
     se::DeviceMemoryBase dev_src_ptr =
@@ -194,6 +204,42 @@ void XlaTransferManager::CopyDeviceTensorToCPU(const Tensor* device_tensor,
   done(Status::OK());
 }
 
+void XlaTransferManager::CopyDeviceTensorToDevice(const Tensor& src_tensor,
+                                                  Tensor* dst_tensor,
+                                                  const StatusCallback& done) {
+  // TODO(phawkins): replace this code with an asynchronous implementation.
+  auto body = [&]() {
+    if (src_tensor.NumElements() == 0) {
+      return Status::OK();
+    }
+    XlaTensor* xla_src = XlaTensor::FromTensor(&src_tensor);
+    XlaTensor* xla_dst = XlaTensor::FromTensor(dst_tensor);
+    CHECK(xla_src && xla_dst)
+        << "Missing destination tensor for device-to-device copy";
+    if (!xla_dst->has_shaped_buffer()) {
+      TensorShape shape =
+          shape_representation_fn_(src_tensor.shape(), src_tensor.dtype());
+      TF_RETURN_IF_ERROR(
+          xla_dst->AllocateShapedBuffer(src_tensor.dtype(), shape, client_,
+                                        stream_->parent()->device_ordinal()));
+    }
+    TF_RETURN_IF_ERROR(
+        xla_dst->shaped_buffer().buffers().ForEachMutableElementWithStatus(
+            [&](const xla::ShapeIndex& index, se::DeviceMemoryBase* buffer) {
+              const se::DeviceMemoryBase& from_buffer =
+                  xla_src->shaped_buffer().buffers().element(index);
+              CHECK_EQ(buffer->size(), from_buffer.size());
+              if (!stream_->parent()->SynchronousMemcpy(buffer, from_buffer,
+                                                        buffer->size())) {
+                return errors::Internal("Device to device memcpy failed");
+              }
+              return Status::OK();
+            }));
+    return Status::OK();
+  };
+  done(body());
+}
+
 XlaDeviceContext::XlaDeviceContext(
     se::Stream* stream, xla::LocalClient* client, bool transfer_as_literal,
     XlaCompiler::ShapeRepresentationFn shape_representation_fn)
@@ -215,4 +261,10 @@ void XlaDeviceContext::CopyDeviceTensorToCPU(const Tensor* device_tensor,
                                  done);
 }
 
+void XlaDeviceContext::CopyDeviceTensorToDevice(const Tensor& src_tensor,
+                                                Tensor* dst_tensor,
+                                                const StatusCallback& done) {
+  manager_.CopyDeviceTensorToDevice(src_tensor, dst_tensor, done);
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/xla_device_context.h b/tensorflow/compiler/jit/xla_device_context.h
index 9af9655868448ce5116db3611c5f88339135947e..ee346e5653bbf9f393df202572c2150b4989506f 100644
--- a/tensorflow/compiler/jit/xla_device_context.h
+++ b/tensorflow/compiler/jit/xla_device_context.h
@@ -55,6 +55,10 @@ class XlaTransferManager {
   void CopyDeviceTensorToCPU(const Tensor* device_tensor,
                              StringPiece tensor_name, Device* device,
                              Tensor* cpu_tensor, StatusCallback done);
+
+  void CopyDeviceTensorToDevice(const Tensor& src_tensor, Tensor* dst_tensor,
+                                const StatusCallback& done);
+
   se::Stream* stream() const { return stream_; }
 
  private:
@@ -72,7 +76,7 @@ class XlaTransferManager {
   xla::TransferManager* transfer_manager_;
   // True if we must use XLA's TransferManager for correct device transfers.
   const bool transfer_as_literal_;
-  const XlaCompiler::ShapeRepresentationFn shape_representation_fn_;
+  XlaCompiler::ShapeRepresentationFn shape_representation_fn_;
 };
 
 // DeviceContext for operators assigned to XlaDevice devices. The
@@ -90,6 +94,9 @@ class XlaDeviceContext : public DeviceContext {
   void CopyDeviceTensorToCPU(const Tensor* device_tensor,
                              StringPiece tensor_name, Device* device,
                              Tensor* cpu_tensor, StatusCallback done) override;
+  void CopyDeviceTensorToDevice(const Tensor& src_tensor, Tensor* dst_tensor,
+                                const StatusCallback& done);
+
   se::Stream* stream() const override { return manager_.stream(); }
 
  private:
diff --git a/tensorflow/compiler/jit/xla_device_ops.cc b/tensorflow/compiler/jit/xla_device_ops.cc
index f68dba6b6a26c0c289fd8457ad143d62e5fb9a69..5ecb1afa7bcec910ca843ccd3a782745f2bb6ca8 100644
--- a/tensorflow/compiler/jit/xla_device_ops.cc
+++ b/tensorflow/compiler/jit/xla_device_ops.cc
@@ -15,7 +15,10 @@ limitations under the License.
 
 #include "tensorflow/compiler/jit/xla_device_ops.h"
 
+#include <memory>
+
 #include "tensorflow/compiler/jit/xla_device_context.h"
+#include "tensorflow/compiler/jit/xla_tensor.h"
 
 namespace tensorflow {
 
@@ -26,4 +29,82 @@ void XlaDeviceDummyOp::Compute(OpKernelContext* ctx) {
              << type_string() << " on an XLA device. This should never happen.";
 }
 
+XlaAssignVariableOp::XlaAssignVariableOp(OpKernelConstruction* c)
+    : AsyncOpKernel(c) {
+  OP_REQUIRES_OK(c, c->GetAttr("dtype", &dtype_));
+}
+
+void XlaAssignVariableOp::ComputeAsync(OpKernelContext* context,
+                                       DoneCallback done) {
+  OP_REQUIRES_ASYNC(context, dtype_ == context->input(1).dtype(),
+                    errors::InvalidArgument(
+                        "Variable and value dtypes don't match; respectively, ",
+                        dtype_, " and ", context->input(1).dtype()),
+                    done);
+  Var* variable = nullptr;
+  OP_REQUIRES_OK_ASYNC(
+      context,
+      LookupOrCreateResource<Var>(
+          context, HandleFromInput(context, 0), &variable,
+          [this, context](Var** ptr) {
+            *ptr = new Var(dtype_);
+            PersistentTensor unused;
+            Tensor* tmp;
+            AllocatorAttributes attr;
+            TF_RETURN_IF_ERROR(context->allocate_persistent(
+                dtype_, context->input(1).shape(), &unused, &tmp, attr));
+            *(*ptr)->tensor() = *tmp;
+            return Status::OK();
+          }),
+      done);
+  core::ScopedUnref s(variable);
+
+  OP_REQUIRES_ASYNC(context, variable->tensor()->dtype() == dtype_,
+                    errors::InvalidArgument(
+                        "Trying to assign variable with wrong dtype. Expected ",
+                        DataTypeString(variable->tensor()->dtype()), " got ",
+                        DataTypeString(dtype_)),
+                    done);
+
+  const Tensor& value = context->input(1);
+  AllocatorAttributes attr;
+
+  // Copying is unnecessary if we are the last user of the value tensor, we can
+  // just adopt the input tensor's buffer instead.
+  std::unique_ptr<Tensor> input_alias = context->forward_input(
+      1, /*output_index=*/OpKernelContext::Params::kNoReservation, dtype_,
+      value.shape(), DEVICE_MEMORY, attr);
+  mutex_lock ml(*variable->mu());
+  variable->is_initialized = true;
+  if (input_alias) {
+    *variable->tensor() = *input_alias;
+    done();
+    return;
+  }
+
+  // Need to copy, but maybe we can re-use variable's buffer?
+  if (!XlaTensor::RefCountIsOne(*variable->tensor()) ||
+      !variable->tensor()->shape().IsSameSize(value.shape())) {
+    // Copy to new buffer
+    PersistentTensor unused;
+    Tensor* tmp;
+    OP_REQUIRES_OK_ASYNC(context,
+                         context->allocate_persistent(dtype_, value.shape(),
+                                                      &unused, &tmp, attr),
+                         done);
+    *variable->tensor() = *tmp;
+  }
+
+  XlaDeviceContext* device_context =
+      static_cast<XlaDeviceContext*>(context->op_device_context());
+
+  variable->Ref();
+  device_context->CopyDeviceTensorToDevice(
+      value, variable->tensor(), [context, variable, done](Status status) {
+        variable->Unref();
+        context->SetStatus(status);
+        done();
+      });
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/xla_device_ops.h b/tensorflow/compiler/jit/xla_device_ops.h
index 9c00a0682ccdc08e7bb09e32d32f01e87e7aaf8d..11e45d2823da2b623bd3cd45f7147686b05fdb2f 100644
--- a/tensorflow/compiler/jit/xla_device_ops.h
+++ b/tensorflow/compiler/jit/xla_device_ops.h
@@ -26,7 +26,9 @@ limitations under the License.
 #include "tensorflow/core/kernels/identity_n_op.h"
 #include "tensorflow/core/kernels/identity_op.h"
 #include "tensorflow/core/kernels/no_op.h"
+#include "tensorflow/core/kernels/resource_variable_ops.h"
 #include "tensorflow/core/kernels/sendrecv_ops.h"
+#include "tensorflow/core/kernels/shape_ops.h"
 #include "tensorflow/core/kernels/variable_ops.h"
 
 namespace tensorflow {
@@ -41,6 +43,15 @@ class XlaDeviceDummyOp : public OpKernel {
   void Compute(OpKernelContext* ctx) override;
 };
 
+class XlaAssignVariableOp : public AsyncOpKernel {
+ public:
+  explicit XlaAssignVariableOp(OpKernelConstruction* c);
+  void ComputeAsync(OpKernelContext* context, DoneCallback done) override;
+
+ private:
+  DataType dtype_;
+};
+
 #define REGISTER_XLA_LAUNCH_KERNEL(DEVICE, KERNEL, TYPES) \
   REGISTER_KERNEL_BUILDER(Name("XlaLaunch")               \
                               .Device(DEVICE)             \
@@ -73,7 +84,68 @@ class XlaDeviceDummyOp : public OpKernel {
                                                                                \
   REGISTER_KERNEL_BUILDER(                                                     \
       Name("VarHandleOp").Device(DEVICE).HostMemory("resource"),               \
-      ResourceHandleOp<Var>);
+      ResourceHandleOp<Var>);                                                  \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("ReadVariableOp").Device(DEVICE).HostMemory("resource"),            \
+      ReadVariableOp);                                                         \
+  REGISTER_KERNEL_BUILDER(Name("Shape")                                        \
+                              .Device(DEVICE)                                  \
+                              .HostMemory("output")                            \
+                              .TypeConstraint<int32>("out_type")               \
+                              .TypeConstraint("T", TYPES),                     \
+                          ShapeOp<int32>);                                     \
+  REGISTER_KERNEL_BUILDER(Name("Shape")                                        \
+                              .Device(DEVICE)                                  \
+                              .HostMemory("output")                            \
+                              .TypeConstraint<int64>("out_type")               \
+                              .TypeConstraint("T", TYPES),                     \
+                          ShapeOp<int64>);                                     \
+  REGISTER_KERNEL_BUILDER(Name("ShapeN")                                       \
+                              .Device(DEVICE)                                  \
+                              .HostMemory("output")                            \
+                              .TypeConstraint<int32>("out_type")               \
+                              .TypeConstraint("T", TYPES),                     \
+                          ShapeNOp<int32>);                                    \
+  REGISTER_KERNEL_BUILDER(Name("ShapeN")                                       \
+                              .Device(DEVICE)                                  \
+                              .HostMemory("output")                            \
+                              .TypeConstraint<int64>("out_type")               \
+                              .TypeConstraint("T", TYPES),                     \
+                          ShapeNOp<int64>);                                    \
+  REGISTER_KERNEL_BUILDER(Name("Size")                                         \
+                              .Device(DEVICE)                                  \
+                              .HostMemory("output")                            \
+                              .TypeConstraint<int32>("out_type")               \
+                              .TypeConstraint("T", TYPES),                     \
+                          SizeOp<int32>);                                      \
+  REGISTER_KERNEL_BUILDER(Name("Size")                                         \
+                              .Device(DEVICE)                                  \
+                              .HostMemory("output")                            \
+                              .TypeConstraint<int64>("out_type")               \
+                              .TypeConstraint("T", TYPES),                     \
+                          SizeOp<int64>);                                      \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("Rank").Device(DEVICE).HostMemory("output").TypeConstraint("T",     \
+                                                                      TYPES),  \
+      RankOp);                                                                 \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("AssignVariableOp").Device(DEVICE).HostMemory("resource"),          \
+      XlaAssignVariableOp);                                                    \
+  REGISTER_KERNEL_BUILDER(Name("ControlTrigger").Device(DEVICE),               \
+                          ControlTriggerOp);                                   \
+  REGISTER_KERNEL_BUILDER(Name("Switch").Device(DEVICE).HostMemory("pred"),    \
+                          SwitchOp);                                           \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("Merge").Device(DEVICE).HostMemory("value_index"), MergeOp);        \
+  REGISTER_KERNEL_BUILDER(Name("Enter").Device(DEVICE), EnterOp);              \
+  REGISTER_KERNEL_BUILDER(Name("Exit").Device(DEVICE), ExitOp);                \
+  REGISTER_KERNEL_BUILDER(Name("NextIteration").Device(DEVICE),                \
+                          NextIterationOp);                                    \
+  REGISTER_KERNEL_BUILDER(Name("LoopCond")                                     \
+                              .Device(DEVICE)                                  \
+                              .HostMemory("input")                             \
+                              .HostMemory("output"),                           \
+                          LoopCondOp);
 
 }  // namespace tensorflow
 
diff --git a/tensorflow/compiler/jit/xla_fusion_optimizer.cc b/tensorflow/compiler/jit/xla_fusion_optimizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..74257b09a808a39454eace3b1a9bf57a2e071360
--- /dev/null
+++ b/tensorflow/compiler/jit/xla_fusion_optimizer.cc
@@ -0,0 +1,328 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/jit/xla_fusion_optimizer.h"
+
+#include <atomic>
+#include <deque>
+#include <unordered_map>
+#include <unordered_set>
+
+#include "tensorflow/compiler/jit/defs.h"
+#include "tensorflow/compiler/jit/graphcycles/graphcycles.h"
+#include "tensorflow/compiler/jit/union_find.h"
+#include "tensorflow/compiler/jit/xla_cluster_util.h"
+#include "tensorflow/core/common_runtime/shape_refiner.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/graph/graph_constructor.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
+
+namespace tensorflow {
+
+// Is 'node' an operator that consumes only the shape of its input, not the
+// data itself?
+static bool IsShapeConsumerOp(const Node& node) {
+  return node.type_string() == "Shape" || node.type_string() == "ShapeN" ||
+         node.type_string() == "Rank" || node.type_string() == "Size";
+}
+
+// Returns true if the op can be decomposed into XLA ops for which
+// there are fusable elemental implementations.
+bool IsXlaFusable(const NodeDef& node) {
+  static const std::unordered_set<std::string>* elementwise_ops =
+      new std::unordered_set<std::string>(
+          {// tf2xla/kernels/aggregate_ops.cc
+           "AddN",
+           // tf2xla/kernels/binary_ops.cc
+           "Add", "Sub", "Mul", "Div", "Atan2", "Complex", "FloorDiv",
+           "FloorMod", "BitwiseAnd", "BitwiseOr", "LeftShift", "RightShift",
+           "LogicalAnd", "LogicalOr", "Mod", "Maximum", "Minimum", "RealDiv",
+           "ReciprocalGrad", "RsqrtGrad", "SqrtGrad", "SquaredDifference",
+           "TruncateDiv", "TruncateMod", "Equal", "NotEqual", "Greater",
+           "GreaterEqual", "Less", "LessEqual", "SigmoidGrad", "SoftplusGrad",
+           "SoftsignGrad", "TanhGrad", "Pow", "ApproximateEqual",
+           // tf2xla/kernels/unary_ops.cc
+           "ComplexAbs", "Angle", "Conj", "Abs", "Acos", "Acosh", "Asin",
+           "Asinh", "Atan", "Atanh", "Ceil", "Cos", "Cosh", "Sin", "Exp",
+           "Expm1", "Floor", "IsFinite", "IsInf", "IsNan", "Inv", "Reciprocal",
+           "Log", "Log1p", "Invert", "LogicalNot", "Neg", "Rint", "Round",
+           "Rsqrt", "Sigmoid", "Sign", "Sinh", "Softplus", "Softsign", "Sqrt",
+           "Square", "Tan", "Tanh", "Real", "Imag",
+           // tf2xla/kernels/bcast_ops.cc
+           "BroadcastArgs", "BroadcastGradientArgs",
+           // tf2xla/kernels/bias_ops.cc
+           "BiasAdd", "BiasAddV1", "BiasAddGrad" /*(Reduce)*/,
+           // tf2xla/kernels/cast_op.cc
+           "Cast",
+           // tf2xla/kernels/concat_op.cc
+           "Concat", "ConcatV2", "ConcatOffset",
+           // tf2xla/kernels/const_op.cc
+           "Const",
+           // tf2xla/kernels/elu_op.cc
+           "Elu", "EluGrad", "Selu", "SeluGrad",
+           // tf2xla/kernels/fill_op.cc
+           "Fill",
+           // tf2xla/kernels/identity_op.cc
+           "Identity", "IdentityN", "PreventGradient",
+           "StopGradient", /*"Snapshot",*/
+           // tf2xla/kernels/index_ops.cc
+           "ArgMax", "ArgMin",
+           // tf2xla/kernels/mirror_pad_op.cc
+           "MirrorPad",
+           // tf2xla/kernels/one_hot_op.cc
+           "OneHot",
+           // tf2xla/kernels/pack_op.cc
+           "Pack",
+           // tf2xla/kernels/pad_op.cc
+           "Pad", "PadV2",
+           // tf2xla/kernels/relu_op.cc
+           "Relu", "Relu6", "ReluGrad", "Relu6Grad",
+           // tf2xla/kernels/reshape_op.cc
+           "Reshape",
+           // tf2xla/kernels/reverse_op.cc
+           "Reverse", "ReverseV2",
+           // tf2xla/kernels/reverse_sequence_op.cc
+           "ReverseSequence",
+           // tf2xla/kernels/shape_op.cc
+           "Shape", "ShapeN", "Rank", "Size", "ExpandDims", "Squeeze",
+           "ZerosLike", "OnesLike",
+           // tf2xla/kernels/slice_op.cc
+           "Slice",
+           // tf2xla/kernels/split_op.cc
+           "Split", "SplitV",
+           // tf2xla/kernels/strided_slice_op.cc
+           "StridedSlice", "StridedSliceGrad", "ResourceStridedSliceAssign",
+           // tf2xla/kernels/tile_ops.cc
+           "Tile",
+           // tf2xla/kernels/transpose_op.cc
+           "Transpose", "InvertPermutation",
+           // tf2xla/kernels/unpack_op.cc
+           "Unpack"});
+
+  return elementwise_ops->count(node.op()) > 0;
+}
+
+Status XlaFusionOptimizer::Optimize(grappler::Cluster* cluster,
+                                    const grappler::GrapplerItem& item,
+                                    GraphDef* output) {
+  VLOG(2) << "Here at fusion optimizer";
+
+  // TODO(hpucha): Implement encapsulation and replacing with XlaLaunch op.
+  // Once that happens, the expected interaction between this optimizer and when
+  // the global_jit_level is set is as follows: Fusion optimizer will replace
+  // appropriate fusion clusters with XlaLaunch nodes. The remaining graph can
+  // be further compiled where possible via mark_for_compilation_pass. Note that
+  // this might lead to inefficient clustering, and it is best to use either the
+  // fusion optimizer or the global_jit flag, and not combine the two.
+
+  // Create a Graph out of GraphDef. This is required currently because the
+  // helpers around clustering, encapsulation etc work on graphs.
+  FunctionLibraryDefinition function_library(OpRegistry::Global(),
+                                             item.graph.library());
+  Graph graph(function_library);
+  ShapeRefiner shape_refiner(graph.versions(), graph.op_registry());
+  shape_refiner.set_require_shape_inference_fns(false);
+  shape_refiner.set_disable_constant_propagation(true);
+  ImportGraphDefOptions options;
+  // Graph optimization happens at the late stage of graph execution, when
+  // colocation constraints are already validated previously and the device
+  // placement of nodes has also completed, so there is no need to validate
+  // colocation constraints again.
+  options.validate_colocation_constraints = false;
+  options.validate_shape = false;
+  TF_RETURN_IF_ERROR(
+      ImportGraphDef(options, item.graph, &graph, &shape_refiner));
+
+  // Collect nodes that can be fused via XLA, while ignoring those that
+  // explicitly ask for XLA: (*) nodes that are marked to be compiled
+  // explicitly. (*) nodes assigned to XLA device.
+  OrderedNodeSet compilation_candidates;
+  for (Node* node : graph.op_nodes()) {
+    // If there is a _XlaCompile annotation, ignore the node if it is
+    // true. Nodes are marked with this attr via experimental_jit_scope, and
+    // will be handled by the mark_for_compilation pass.
+    bool compile = false;
+    Status status = GetNodeAttr(node->attrs(), kXlaCompileAttr, &compile);
+    if (status.ok() && compile) {
+      continue;
+    }
+    // If there is already a _XlaCluster annotation, ignore the node. Nodes are
+    // marked with this attr to indicate they are already part of a cluster and
+    // hence ignored.
+    status = GetNodeAttr(node->attrs(), kXlaClusterAttr, &compile);
+    if (status.ok()) {
+      continue;
+    }
+
+    // If there is an explicit XLA device placement, ignore the node.
+    DeviceType device_type("");
+    TF_RETURN_IF_ERROR(DeviceToDeviceType(node->def().device(), &device_type));
+    if (device_type.type_string().find("XLA") != string::npos) continue;
+
+    // Assume all fusable ops are registered.
+    // TODO(hpucha): Check for registration if possible.
+    if (!IsXlaFusable(node->def())) {
+      continue;
+    }
+
+    // XLA does not offer guaranteed aliasing between the input and output of
+    // the XLA cluster so it can't implement the forward-tensor-ref semantic.
+    // Leave such nodes out of XLA clusters.
+    if (HasForwardedRefInput(*node)) {
+      continue;
+    }
+
+    compilation_candidates.insert(node);
+  }
+
+  if (compilation_candidates.empty()) {
+    VLOG(2) << "No compilable candidates";
+    *output = item.graph;
+    return Status::OK();
+  }
+
+  GraphCycles cycles;
+  TF_RETURN_IF_ERROR(CreateCycleDetectionGraph(&graph, &cycles));
+
+  // TODO(hpucha): Make clustering more robust. There are two known issues that
+  // we need to mitigate: (a) Non-resource variables can cause deadlocks
+  // when clustering changes order of execution. See b/77263461 for a specific
+  // example. (b) Queue operations can also cause deadlocks. See b/77261498 for
+  // example.
+
+  struct Cluster {
+    // Identifies the node that represents this cluster in the cycle detection
+    // graph.
+    int representative = -1;
+  };
+
+  // Each compilation candidate belongs to a cluster. The cluster's
+  // representative names the node in the 'cycles' graph that represents the
+  // cluster.
+  std::vector<UnionFind<Cluster>> clusters(graph.num_node_ids());
+  std::deque<UnionFind<Cluster>*> worklist;
+  for (Node* node : compilation_candidates) {
+    Cluster& cluster = clusters[node->id()].Get();
+    cluster.representative = node->id();
+    worklist.push_back(&clusters[node->id()]);
+  }
+
+  // Repeatedly contract edges between clusters that are on the same device,
+  // provided the contraction would not create a cycle. This is a simplified
+  // version of the clustering in mark_for_compilation_pass that also deals with
+  // nodes that are explicitly tagged to be compiled/clustered.
+  while (!worklist.empty()) {
+    int from = worklist.front()->Get().representative;
+    worklist.pop_front();
+
+    Node* node_from = graph.FindNodeId(from);
+    if (node_from->IsControlFlow()) {
+      // Control flow nodes aren't compilation candidates and should never
+      // appear.
+      return errors::Internal(
+          "Found control flow node in clustering worklist: ",
+          node_from->type_string());
+    }
+    for (int to : cycles.Successors(from)) {
+      if (to >= graph.num_node_ids()) {
+        // Node is a "frame" node that is present only in the cycle detection
+        // graph. No clustering is possible.
+        continue;
+      }
+      Node* node_to = graph.FindNodeId(to);
+      if (compilation_candidates.find(node_to) ==
+          compilation_candidates.cend()) {
+        continue;
+      }
+
+      // Do not cluster across devices.
+      if (node_from->def().device() != node_to->def().device()) {
+        VLOG(2) << "Devices " << node_from->def().device() << " "
+                << node_to->def().device();
+        VLOG(2) << "Device names " << node_from->assigned_device_name() << " "
+                << node_to->assigned_device_name();
+        continue;
+      }
+
+      // Ops that consume shapes cannot be the root of a cluster. This is an
+      // optimization.
+      if (clusters[from].Size() == 1 && IsShapeConsumerOp(*node_from)) {
+        continue;
+      }
+
+      // If contracting the edge would create a cycle, bail out.
+      // However, just because we can't merge the clusters now does not mean
+      // we won't be able to merge them in the future.
+      // e.g., if we have edges 1->2, 2->3 and 1->3, we cannot contract edge
+      // 1->3. But if we first contract 1->2 then we can later contract 1->3.
+      if (!cycles.ContractEdge(from, to)) continue;
+
+      // Merge the clusters. ContractEdge uses 'from' as the number of the
+      // merged node, so make sure 'from' is the chosen representative.
+      clusters[from].Merge(&clusters[to]);
+
+      worklist.push_back(&clusters[from]);
+      break;
+    }
+  }
+
+  // Count the number of non-trivial elements in each cluster.
+  std::vector<int> effective_cluster_sizes(graph.num_node_ids());
+  for (const Node* n : compilation_candidates) {
+    int cluster = clusters[n->id()].Get().representative;
+    // Identity nodes will be removed if the node gets marked for compilation.
+    // Therefore we don't want to count them towards the effective cluster size.
+    if (n->def().op() != "Identity") {
+      effective_cluster_sizes[cluster]++;
+    }
+  }
+
+  const int min_cluster_size = 2;
+  int num_clusters = 0;
+  for (auto size : effective_cluster_sizes) {
+    if (size >= min_cluster_size) {
+      VLOG(3) << "Cluster " << num_clusters << " " << size;
+      num_clusters++;
+    }
+  }
+
+  // Names for each cluster.
+  std::unordered_map<int, string> cluster_names;
+  // Sequence number generator to ensure clusters have unique names.
+  static std::atomic<int64> cluster_sequence_num;
+
+  for (Node* n : compilation_candidates) {
+    int cluster = clusters[n->id()].Get().representative;
+
+    // Compile if this is a cluster of >= min_cluster_size compilable operators.
+    if (effective_cluster_sizes[cluster] >= min_cluster_size) {
+      string& name = cluster_names[cluster];
+
+      if (name.empty()) {
+        name = strings::StrCat("cluster_", cluster_sequence_num++);
+      }
+      n->AddAttr(kXlaClusterAttr, name);
+      VLOG(3) << "Assigning node " << n->name() << " to cluster " << name;
+    }
+  }
+
+  graph.ToGraphDef(output);
+  return Status::OK();
+}
+
+REGISTER_GRAPH_OPTIMIZER_AS(XlaFusionOptimizer, "xla-fusion");
+
+}  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/xla_fusion_optimizer.h b/tensorflow/compiler/jit/xla_fusion_optimizer.h
new file mode 100644
index 0000000000000000000000000000000000000000..3d2309e782d38725f8db025fbfda0bf0f63d18be
--- /dev/null
+++ b/tensorflow/compiler/jit/xla_fusion_optimizer.h
@@ -0,0 +1,49 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_JIT_XLA_FUSION_OPTIMIZER_H_
+#define TENSORFLOW_COMPILER_JIT_XLA_FUSION_OPTIMIZER_H_
+
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer.h"
+
+namespace tensorflow {
+
+// Optimizes graphs by fusing ops where possible, resulting in more efficient
+// execution.
+class XlaFusionOptimizer : public grappler::CustomGraphOptimizer {
+ public:
+  XlaFusionOptimizer() {}
+  ~XlaFusionOptimizer() override {}
+
+  Status Init(
+      const RewriterConfig_CustomGraphOptimizer* config = nullptr) override {
+    return Status::OK();
+  }
+
+  string name() const override { return "xla-fusion"; };
+
+  Status Optimize(grappler::Cluster* cluster,
+                  const grappler::GrapplerItem& item,
+                  GraphDef* output) override;
+
+  void Feedback(grappler::Cluster* cluster, const grappler::GrapplerItem& item,
+                const GraphDef& optimize_output, double result) override {
+    // Nothing to do for XlaFusionOptimizer.
+  }
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMPILER_JIT_XLA_FUSION_OPTIMIZER_H_
diff --git a/tensorflow/compiler/jit/xla_fusion_optimizer_test.cc b/tensorflow/compiler/jit/xla_fusion_optimizer_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..5736760a878dc857a8558093054d0adc0f727398
--- /dev/null
+++ b/tensorflow/compiler/jit/xla_fusion_optimizer_test.cc
@@ -0,0 +1,183 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/jit/xla_fusion_optimizer.h"
+#include "tensorflow/compiler/jit/defs.h"
+#include "tensorflow/compiler/jit/xla_cluster_util.h"
+#include "tensorflow/core/graph/graph_def_builder.h"
+#include "tensorflow/core/graph/graph_def_builder_util.h"
+#include "tensorflow/core/grappler/utils/grappler_test.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+
+namespace tensorflow {
+namespace {
+
+REGISTER_OP("UncompilableNullary").Output("o: float");
+REGISTER_OP("UncompilableUnary").Input("a: float").Output("o: float");
+
+class XlaFusionOptimizerTest : public grappler::GrapplerTest {
+ protected:
+  std::unordered_map<string, string> GetClusters(const GraphDef& graph) {
+    std::unordered_map<string, string> ids;
+    for (const NodeDef& node : graph.node()) {
+      string cluster;
+      if (GetNodeAttr(AttrSlice(node), kXlaClusterAttr, &cluster).ok()) {
+        CHECK(!cluster.empty());
+        ids[node.name()] = cluster;
+      }
+    }
+    return ids;
+  }
+};
+
+TEST_F(XlaFusionOptimizerTest, Chains) {
+  GraphDef graph;
+  {
+    GraphDefBuilder builder(GraphDefBuilder::kFailImmediately);
+    Node* a =
+        ops::SourceOp("UncompilableNullary", builder.opts().WithName("A"));
+    Node* b = ops::UnaryOp("Relu", a, builder.opts().WithName("B"));
+    Node* c = ops::UnaryOp("Relu", b, builder.opts().WithName("C"));
+    Node* d =
+        ops::UnaryOp("UncompilableUnary", c, builder.opts().WithName("D"));
+    Node* e = ops::UnaryOp("Relu", d, builder.opts().WithName("E"));
+    ops::UnaryOp("Relu", e, builder.opts().WithName("F"));
+    TF_ASSERT_OK(builder.ToGraphDef(&graph));
+  }
+  grappler::GrapplerItem item;
+  item.graph = graph;
+
+  XlaFusionOptimizer optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+  auto clusters = GetClusters(output);
+  EXPECT_EQ(4, clusters.size());
+  EXPECT_EQ(clusters["B"], clusters["C"]);
+  EXPECT_EQ(clusters["E"], clusters["F"]);
+  EXPECT_NE(clusters["B"], clusters["E"]);
+  EXPECT_TRUE(clusters.find("A") == clusters.cend());
+  EXPECT_TRUE(clusters.find("D") == clusters.cend());
+}
+
+TEST_F(XlaFusionOptimizerTest, FusableOps) {
+  GraphDef graph;
+  {
+    GraphDefBuilder builder(GraphDefBuilder::kFailImmediately);
+    Node* a = ops::SourceOp(
+        "Placeholder",
+        builder.opts().WithName("A").WithAttr("dtype", tensorflow::DT_FLOAT));
+    Node* b = ops::SourceOp(
+        "Placeholder",
+        builder.opts().WithName("B").WithAttr("dtype", tensorflow::DT_FLOAT));
+
+    Node* c = ops::BinaryOp("Add", a, b, builder.opts().WithName("C"));
+    ops::BinaryOp("MatMul", a, c, builder.opts().WithName("D"));
+    ops::UnaryOp("Abs", c, builder.opts().WithName("E"));
+
+    TF_ASSERT_OK(builder.ToGraphDef(&graph));
+  }
+  grappler::GrapplerItem item;
+  item.graph = graph;
+
+  XlaFusionOptimizer optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+  auto clusters = GetClusters(output);
+  EXPECT_EQ(2, clusters.size());
+  EXPECT_EQ(clusters["C"], clusters["E"]);
+  EXPECT_TRUE(clusters.find("D") == clusters.cend());
+}
+
+TEST_F(XlaFusionOptimizerTest, IgnoreExplicitXLAAttrs) {
+  GraphDef graph;
+  {
+    GraphDefBuilder builder(GraphDefBuilder::kFailImmediately);
+    Node* a = ops::SourceOp(
+        "Placeholder",
+        builder.opts().WithName("A").WithAttr("dtype", tensorflow::DT_FLOAT));
+    Node* b = ops::SourceOp(
+        "Placeholder",
+        builder.opts().WithName("B").WithAttr("dtype", tensorflow::DT_FLOAT));
+
+    Node* c = ops::BinaryOp(
+        "Add", a, b,
+        builder.opts().WithName("C").WithDevice("/device:XLA_CPU"));
+    ops::BinaryOp("MatMul", a, c, builder.opts().WithName("D"));
+    Node* e = ops::UnaryOp("Abs", c, builder.opts().WithName("E"));
+    ops::UnaryOp("Cos", e,
+                 builder.opts().WithName("F").WithAttr(kXlaCompileAttr, true));
+
+    TF_ASSERT_OK(builder.ToGraphDef(&graph));
+  }
+  grappler::GrapplerItem item;
+  item.graph = graph;
+
+  XlaFusionOptimizer optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+  auto clusters = GetClusters(output);
+  EXPECT_TRUE(clusters.empty());
+}
+
+TEST_F(XlaFusionOptimizerTest, UncompilableCycles) {
+  GraphDef graph;
+  {
+    GraphDefBuilder builder(GraphDefBuilder::kFailImmediately);
+    Node* a = ops::SourceOp("Const", builder.opts()
+                                         .WithName("A")
+                                         .WithAttr("dtype", DT_FLOAT)
+                                         .WithAttr("value", Tensor()));
+    Node* b =
+        ops::UnaryOp("UncompilableUnary", a, builder.opts().WithName("B"));
+    ops::BinaryOp("Mul", a, b, builder.opts().WithName("C"));
+
+    TF_ASSERT_OK(builder.ToGraphDef(&graph));
+  }
+  grappler::GrapplerItem item;
+  item.graph = graph;
+
+  XlaFusionOptimizer optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+  auto clusters = GetClusters(output);
+  EXPECT_TRUE(clusters.empty());
+}
+
+TEST_F(XlaFusionOptimizerTest, CompilableCycles) {
+  GraphDef graph;
+  {
+    GraphDefBuilder builder(GraphDefBuilder::kFailImmediately);
+    Node* a = ops::SourceOp("Const", builder.opts()
+                                         .WithName("A")
+                                         .WithAttr("dtype", DT_FLOAT)
+                                         .WithAttr("value", Tensor()));
+    Node* b = ops::UnaryOp("Relu", a, builder.opts().WithName("B"));
+    ops::BinaryOp("Mul", a, b, builder.opts().WithName("C"));
+    TF_ASSERT_OK(builder.ToGraphDef(&graph));
+  }
+  grappler::GrapplerItem item;
+  item.graph = graph;
+
+  XlaFusionOptimizer optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+  auto clusters = GetClusters(output);
+  EXPECT_EQ(3, clusters.size());
+  EXPECT_EQ(clusters["A"], clusters["B"]);
+  EXPECT_EQ(clusters["A"], clusters["C"]);
+}
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/xla_gpu_device.cc b/tensorflow/compiler/jit/xla_gpu_device.cc
index 26842fbe5cc110fa9ce7a2767d245484fd67556d..c0d86a28c7698c302e28bab972bb2f847cc00ca4 100644
--- a/tensorflow/compiler/jit/xla_gpu_device.cc
+++ b/tensorflow/compiler/jit/xla_gpu_device.cc
@@ -49,7 +49,8 @@ Status XlaGpuDeviceFactory::CreateDevices(const SessionOptions& options,
       XlaDevice::Create("CUDA", DEVICE_XLA_GPU, 0, DEVICE_GPU_XLA_JIT, options,
                         name_prefix, registration,
                         /*transfer_as_literal=*/false,
-                        /*shape_representation_fn=*/{}, &device);
+                        /*shape_representation_fn=*/{},
+                        /*padded_shape_fn=*/{}, &device);
   if (!status.ok()) {
     // Treat failures as non-fatal; there might not be a GPU in the machine.
     VLOG(1) << "Failed to create XLA_GPU device: " << status;
diff --git a/tensorflow/compiler/jit/xla_interpreter_device.cc b/tensorflow/compiler/jit/xla_interpreter_device.cc
index 4146996f6346446e715ffb225882cfb20359dae1..661187f4a873b03b8d013aa74cb6b6315bb4e2eb 100644
--- a/tensorflow/compiler/jit/xla_interpreter_device.cc
+++ b/tensorflow/compiler/jit/xla_interpreter_device.cc
@@ -48,11 +48,12 @@ Status XlaInterpreterDeviceFactory::CreateDevices(
   registration.compile_resource_ops = true;
 
   std::unique_ptr<XlaDevice> device;
-  TF_RETURN_IF_ERROR(XlaDevice::Create(
-      "Interpreter", DEVICE_XLA_INTERPRETER, 0, DEVICE_INTERPRETER_XLA_JIT,
-      options, name_prefix, registration,
-      /*transfer_as_literal=*/false,
-      /*shape_representation_fn=*/{}, &device));
+  TF_RETURN_IF_ERROR(XlaDevice::Create("Interpreter", DEVICE_XLA_INTERPRETER, 0,
+                                       DEVICE_INTERPRETER_XLA_JIT, options,
+                                       name_prefix, registration,
+                                       /*transfer_as_literal=*/false,
+                                       /*shape_representation_fn=*/{},
+                                       /*padded_shape_fn=*/{}, &device));
   devices->push_back(device.release());
   return Status::OK();
 }
diff --git a/tensorflow/compiler/jit/xla_tensor.cc b/tensorflow/compiler/jit/xla_tensor.cc
index a7211c9c7e281a8141d5671b345c628441b2359d..3c44c4ae6df7f3e2d60d8933561c0c71888e8c3f 100644
--- a/tensorflow/compiler/jit/xla_tensor.cc
+++ b/tensorflow/compiler/jit/xla_tensor.cc
@@ -18,7 +18,7 @@ limitations under the License.
 
 namespace tensorflow {
 
-/*static*/ XlaTensor* XlaTensor::FromTensor(Tensor* tensor) {
+/*static*/ XlaTensor* XlaTensor::FromTensor(const Tensor* tensor) {
   if (tensor->NumElements() == 0) {
     return nullptr;
   }
@@ -27,8 +27,8 @@ namespace tensorflow {
   return xla_tensor;
 }
 
-/*static*/ const XlaTensor* XlaTensor::FromTensor(const Tensor* tensor) {
-  return FromTensor(const_cast<Tensor*>(tensor));
+/*static*/ bool XlaTensor::RefCountIsOne(const Tensor& tensor) {
+  return tensor.RefCountIsOne();
 }
 
 /*static*/ se::DeviceMemoryBase XlaTensor::DeviceMemoryFromTensor(
@@ -67,6 +67,8 @@ Status XlaTensor::AllocateShapedBuffer(DataType dtype, const TensorShape& shape,
     index_to_buffer.second = buffer.Forget();
   }
 
+  VLOG(4) << shaped_buffer.ToString();
+
   set_shaped_buffer(std::move(shaped_buffer));
   return Status::OK();
 }
diff --git a/tensorflow/compiler/jit/xla_tensor.h b/tensorflow/compiler/jit/xla_tensor.h
index 6b29c82ec11e39ad525663991e179443c2b6dca7..c54001a999998f45c0cdacd752ca4036f0792857 100644
--- a/tensorflow/compiler/jit/xla_tensor.h
+++ b/tensorflow/compiler/jit/xla_tensor.h
@@ -34,10 +34,9 @@ class XlaTensor {
  public:
   // Downcast from a Tensor to an XlaTensor. Return nullptr if the downcast
   // fails.
-  static XlaTensor* FromTensor(Tensor* tensor);
-  // Downcast from a Tensor to an XlaTensor. Return nullptr if the downcast
-  // fails.
-  static const XlaTensor* FromTensor(const Tensor* tensor);
+  static XlaTensor* FromTensor(const Tensor* tensor);
+
+  static bool RefCountIsOne(const Tensor& tensor);
 
   // Create a DeviceMemoryBase from a Tensor. The Tensor can be an XlaTensor, in
   // which case the returned value is shaped_buffer()->root_buffer(), or a
@@ -62,6 +61,10 @@ class XlaTensor {
     CHECK(has_shaped_buffer());
     return *shaped_buffer_;
   }
+  xla::ShapedBuffer& shaped_buffer() {
+    CHECK(has_shaped_buffer());
+    return *shaped_buffer_;
+  }
   // Mutates the XlaTensor to set the ShapedBuffer.
   void set_shaped_buffer(xla::ScopedShapedBuffer shaped_buffer) {
     shaped_buffer_ =
diff --git a/tensorflow/compiler/tests/BUILD b/tensorflow/compiler/tests/BUILD
index 4c291d2383163e5def54657186c2190c023832fc..e6c92f9720e1285617280f60d1c5fea443c5ebef 100644
--- a/tensorflow/compiler/tests/BUILD
+++ b/tensorflow/compiler/tests/BUILD
@@ -120,6 +120,19 @@ tf_xla_py_test(
     ],
 )
 
+tf_xla_py_test(
+    name = "bucketize_op_test",
+    size = "small",
+    srcs = ["bucketize_op_test.py"],
+    deps = [
+        ":xla_test",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:platform_test",
+    ],
+)
+
 tf_xla_py_test(
     name = "categorical_op_test",
     size = "small",
@@ -532,7 +545,9 @@ tf_xla_py_test(
     ],
     deps = [
         ":xla_test",
+        "//tensorflow/python:array_ops",
         "//tensorflow/python:framework",
+        "//tensorflow/python:math_ops",
         "//tensorflow/python:platform_test",
         "//tensorflow/python:random_ops",
     ],
diff --git a/tensorflow/compiler/tests/bucketize_op_test.py b/tensorflow/compiler/tests/bucketize_op_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..fde9759a1c209844caac99d5f303cd3e406e5370
--- /dev/null
+++ b/tensorflow/compiler/tests/bucketize_op_test.py
@@ -0,0 +1,78 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for bucketize_op."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.compiler.tests.xla_test import XLATestCase
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors_impl
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.platform import test
+
+
+class BucketizationOpTest(XLATestCase):
+
+  def testInt(self):
+    with self.test_session() as sess:
+      p = array_ops.placeholder(dtypes.int32)
+      with self.test_scope():
+        op = math_ops._bucketize(p, boundaries=[0, 3, 8, 11])
+      expected_out = [0, 1, 1, 2, 2, 3, 3, 4, 4]
+      self.assertAllEqual(expected_out,
+                          sess.run(op, {p: [-5, 0, 2, 3, 5, 8, 10, 11, 12]}))
+
+  def testFloat(self):
+    with self.test_session() as sess:
+      p = array_ops.placeholder(dtypes.float32)
+      with self.test_scope():
+        op = math_ops._bucketize(p, boundaries=[0., 3., 8., 11.])
+      expected_out = [0, 1, 1, 2, 2, 3, 3, 4, 4]
+      self.assertAllEqual(
+          expected_out,
+          sess.run(op, {p: [-5., 0., 2., 3., 5., 8., 10., 11., 12.]}))
+
+  def test2DInput(self):
+    with self.test_session() as sess:
+      p = array_ops.placeholder(dtypes.float32)
+      with self.test_scope():
+        op = math_ops._bucketize(p, boundaries=[0, 3, 8, 11])
+      expected_out = [[0, 1, 1, 2, 2], [3, 3, 4, 4, 1]]
+      self.assertAllEqual(
+          expected_out, sess.run(op,
+                                 {p: [[-5, 0, 2, 3, 5], [8, 10, 11, 12, 0]]}))
+
+  def testInvalidBoundariesOrder(self):
+    with self.test_session() as sess:
+      p = array_ops.placeholder(dtypes.int32)
+      with self.test_scope():
+        op = math_ops._bucketize(p, boundaries=[0, 8, 3, 11])
+      with self.assertRaisesRegexp(errors_impl.InvalidArgumentError,
+                                   "Expected sorted boundaries"):
+        sess.run(op, {p: [-5, 0]})
+
+  def testBoundariesNotList(self):
+    with self.test_session():
+      with self.assertRaisesRegexp(TypeError, "Expected list.*"):
+        p = array_ops.placeholder(dtypes.int32)
+        with self.test_scope():
+          math_ops._bucketize(p, boundaries=0)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/compiler/tests/depthwise_conv_op_test.py b/tensorflow/compiler/tests/depthwise_conv_op_test.py
index 0a0d335ca76dd7ec7ca3b12f9e8a83b596daa07e..03d96a2cd8ab22a472a67f092e36224820405fa8 100644
--- a/tensorflow/compiler/tests/depthwise_conv_op_test.py
+++ b/tensorflow/compiler/tests/depthwise_conv_op_test.py
@@ -153,7 +153,7 @@ class DepthwiseConv2DTest(XLATestCase):
                   dtype=data_type).reshape(filter_in_sizes)
     with self.test_session() as sess:
       if data_type == np.float32:
-        tolerance = 1e-5
+        tolerance = 1e-4
       else:
         self.assertEqual(data_type, np.float64)
         tolerance = 1e-8
@@ -339,7 +339,7 @@ class DepthwiseConv2DTest(XLATestCase):
 
     gpu_value = _GetVal(use_xla=True)
     cpu_value = _GetVal(use_xla=False)
-    self.assertAllClose(cpu_value, gpu_value, rtol=1e-4, atol=1e-4)
+    self.assertAllClose(cpu_value, gpu_value, rtol=1e-3, atol=1e-3)
 
   def testDepthwiseConv2DInputGradCompare(self):
     for index, (input_size, filter_size, output_size, stride,
diff --git a/tensorflow/compiler/tests/eager_test.py b/tensorflow/compiler/tests/eager_test.py
index 52d8d6d295c428f2c3466ef2963223cc978b4277..a4154ad1e846f8241a2ab6598da36ccb6b3b653e 100644
--- a/tensorflow/compiler/tests/eager_test.py
+++ b/tensorflow/compiler/tests/eager_test.py
@@ -31,11 +31,13 @@ from tensorflow.python.framework import ops
 from tensorflow.python.layers import convolutional
 from tensorflow.python.layers import pooling
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import embedding_ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.platform import googletest
+from tensorflow.python.training import adam
 
 
 class EagerTest(XLATestCase):
@@ -117,6 +119,15 @@ class EagerTest(XLATestCase):
       v.assign_add(2.0)
     self.assertEqual(3.0, v.numpy())
 
+  def testReadAssignRead(self):
+    with self.test_scope():
+      v = resource_variable_ops.ResourceVariable(1.0)
+      val1 = v.read_value()
+      v.assign_add(2.0)
+      val2 = v.read_value()
+    self.assertEqual(1.0, val1.numpy())
+    self.assertEqual(3.0, val2.numpy())
+
   def testGradient(self):
     def f(x):
       return x
@@ -136,6 +147,129 @@ class EagerTest(XLATestCase):
       grads = backprop.implicit_grad(f)()
     self.assertEqual(2., grads[0][0].numpy())
 
+  def testMultipleVariableReads(self):
+    # This test makes sure consecutive variable reads don't copy
+    # the underlying memory.
+    with self.test_scope():
+      # Create 128MiB variables
+      var = resource_variable_ops.ResourceVariable(
+          array_ops.ones([32, 1024, 1024]))
+
+      # Read the same variable 100 times. If the underlying tensor
+      # is not copied, this is a trivial operation. If it is copied,
+      # this will eat over 13GB and OOM.
+      values = []
+      for _ in range(100):
+        values.append(var.value())
+
+  # The shape, shape_n, size, and rank are tested here because their
+  # execution kernels (as opposed to compilation only tf2xla kernels)
+  # are distincts from tf2xla kernels.
+
+  def testShape(self):
+    def const(value):
+      return array_ops.shape(
+          constant_op.constant(value)).numpy()
+
+    def ones(value):
+      return array_ops.shape(
+          array_ops.ones(value)).numpy()
+
+    with self.test_scope():
+      # Shapes of directly constructed tensors
+      self.assertAllEqual([], const(3))
+      self.assertAllEqual([3], const([1.0, 2.0, 3.0]))
+      self.assertAllEqual([2, 2], const([[1.0, 2.0], [3.0, 4.0]]))
+      self.assertAllEqual([2, 1, 2], const([[[1.0, 2.0]], [[3.0, 4.0]]]))
+
+      # Shapes of tensors created by op running on device
+      # We make this distinction because directly constructed tensors
+      # are treated differently in a few places that can influence shape:
+      #  - they always have on_host_tensor
+      #  - they and their shapes can be cached
+      #  - they end up on device via a copy, instead of as program output
+      self.assertAllEqual([], ones([]))
+      self.assertAllEqual([3], ones([3]))
+      self.assertAllEqual([2, 2], ones([2, 2]))
+      self.assertAllEqual([2, 1, 2], ones([2, 1, 2]))
+
+  def testShapeN(self):
+    with self.test_scope():
+      # Shapes of directly constructed tensors
+      shapes = array_ops.shape_n([
+          constant_op.constant(1.0),
+          constant_op.constant([1.0, 2.0, 3.0]),
+          constant_op.constant([[1.0, 2.0], [3.0, 4.0]])])
+      self.assertAllEqual(
+          [[], [3], [2, 2]],
+          [x.numpy().tolist() for x in shapes])
+
+      # Shapes of tensors created by op running on device
+      shapes = array_ops.shape_n([
+          array_ops.ones([]),
+          array_ops.ones([3]),
+          array_ops.ones([2, 2])])
+      self.assertAllEqual(
+          [[], [3], [2, 2]],
+          [x.numpy().tolist() for x in shapes])
+
+  def testSize(self):
+    with self.test_scope():
+      self.assertEqual(
+          1, array_ops.size(constant_op.constant(1.0)).numpy())
+      self.assertEqual(
+          3, array_ops.size(constant_op.constant([1.0, 2.0, 3.0])).numpy())
+      self.assertEqual(
+          4, array_ops.size(
+              constant_op.constant([[1.0, 2.0], [3.0, 4.0]])).numpy())
+
+  def testRank(self):
+    with self.test_scope():
+      self.assertEqual(
+          0, array_ops.rank(constant_op.constant(1.0)).numpy())
+      self.assertEqual(
+          1, array_ops.rank(constant_op.constant([1.0, 2.0, 3.0])).numpy())
+      self.assertEqual(
+          2, array_ops.rank(
+              constant_op.constant([[1.0, 2.0], [3.0, 4.0]])).numpy())
+
+  def testAdam(self):
+    with self.test_scope():
+      optimizer = adam.AdamOptimizer(0.1)
+      x = resource_variable_ops.ResourceVariable(10.0)
+      with backprop.GradientTape() as tape:
+        y = x * x
+      dy_dx = tape.gradient(y, x)
+      optimizer.apply_gradients([(dy_dx, x)])
+      self.assertAlmostEqual(9.9, x.numpy(), places=3)
+
+  def testAdamSparse(self):
+    with ops.device('/cpu:0'):
+      # Create 2-D embedding for 3 objects on CPU because sparse/sliced updates
+      # are not implemented on TPU.
+      embedding_matrix = resource_variable_ops.ResourceVariable(
+          array_ops.ones([3, 2]))
+
+    with self.test_scope():
+      with backprop.GradientTape() as tape:
+        embedding = embedding_ops.embedding_lookup(embedding_matrix, [1])
+        y = math_ops.reduce_sum(embedding)
+      dy_dx = tape.gradient(y, embedding_matrix)
+      self.assertIsInstance(dy_dx, ops.IndexedSlices)
+      optimizer = adam.AdamOptimizer(0.1)
+      # The gradient application operations will run on CPU because optimizer
+      # updates are always collocated with the variable.
+      optimizer.apply_gradients([(dy_dx, embedding_matrix)])
+
+      # This assign_add will run on CPU because when an input to an
+      # operation is a resource, this operation is placed on the resource's
+      # device by the eager runtime.
+      embedding_matrix.assign_add(array_ops.ones([3, 2]))
+
+    self.assertAllClose([[2.0, 2.0],
+                         [1.9, 1.9],
+                         [2.0, 2.0]], embedding_matrix.numpy())
+
 
 class EagerFunctionTest(XLATestCase):
 
diff --git a/tensorflow/compiler/tests/image_ops_test.py b/tensorflow/compiler/tests/image_ops_test.py
index 42e637734c578fcc70473060cb156e172a0a1995..7cf953ef25ef5daf8a6d4fc9985ed8dbfb2081e5 100644
--- a/tensorflow/compiler/tests/image_ops_test.py
+++ b/tensorflow/compiler/tests/image_ops_test.py
@@ -65,9 +65,7 @@ class RGBToHSVTest(XLATestCase):
         join1 = array_ops.stack(split1)
         join2 = array_ops.stack(split2)
         batch1, batch2, join1, join2 = sess.run([batch1, batch2, join1, join2],
-                                                {
-                                                    batch0: inp
-                                                })
+                                                {batch0: inp})
 
       # Verify that processing batch elements together is the same as separate
       self.assertAllClose(batch1, join1)
@@ -401,9 +399,7 @@ class AdjustSaturationTest(XLATestCase):
           x = array_ops.placeholder(dtypes.float32, shape=x_shape)
           with self.test_scope():
             y_fused = self._adjust_saturation(x,
-                                              scale).eval(feed_dict={
-                                                  x: x_np
-                                              })
+                                              scale).eval(feed_dict={x: x_np})
           self.assertAllClose(y_fused, y_baseline, rtol=2e-5, atol=1e-5)
 
 
@@ -412,7 +408,8 @@ class ResizeBilinearTest(XLATestCase):
   def _assertForwardOpMatchesExpected(self,
                                       image_np,
                                       target_shape,
-                                      expected=None):
+                                      expected=None,
+                                      large_tolerance=False):
     if expected is None:
       self.fail("expected must be specified")
     with self.test_session() as sess, self.test_scope():
@@ -420,7 +417,11 @@ class ResizeBilinearTest(XLATestCase):
       resized = gen_image_ops.resize_bilinear(
           image, target_shape, align_corners=True)
       out = sess.run(resized, {image: image_np[np.newaxis, :, :, np.newaxis]})
-      self.assertAllClose(expected[np.newaxis, :, :, np.newaxis], out)
+      if large_tolerance:
+        self.assertAllClose(
+            expected[np.newaxis, :, :, np.newaxis], out, rtol=0.03, atol=0.1)
+      else:
+        self.assertAllClose(expected[np.newaxis, :, :, np.newaxis], out)
 
   def _assertBackwardOpMatchesExpected(self,
                                        grads_np,
@@ -555,6 +556,28 @@ class ResizeBilinearTest(XLATestCase):
               [[12.5, 27.5, 21.875], [42.5, 80.0, 57.5], [40.625, 72.5, 50]],
               dtype=np.float32))
 
+  def testAlignCorners4x4To8x8(self):
+    self._assertForwardOpMatchesExpected(
+        (np.array([[0, 1, 2, 3]], dtype=np.float32) + np.array(
+            [[0], [1], [2], [3]], dtype=np.float32)) * 7.0, [8, 8],
+        expected=3 *
+        (np.array([[0, 1, 2, 3, 4, 5, 6, 7]], dtype=np.float32) + np.array(
+            [[0], [1], [2], [3], [4], [5], [6], [7]], dtype=np.float32)),
+        large_tolerance=True)
+
+  def testAlignCorners8x8To16x16(self):
+    self._assertForwardOpMatchesExpected(
+        (np.array([[0, 1, 2, 3, 4, 5, 6, 7]], dtype=np.float32) + np.array(
+            [[0], [1], [2], [3], [4], [5], [6], [7]], dtype=np.float32)) * 15.0,
+        [16, 16],
+        expected=7 * (np.array(
+            [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]],
+            dtype=np.float32) + np.array(
+                [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11],
+                 [12], [13], [14], [15]],
+                dtype=np.float32)),
+        large_tolerance=True)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/compiler/tests/jit_test.py b/tensorflow/compiler/tests/jit_test.py
index 4b0043b6b4c7fbf57ec1507b84adf18daaea9363..6e0db54b7a74b284dc7d18bcbb07c178c664c1e5 100644
--- a/tensorflow/compiler/tests/jit_test.py
+++ b/tensorflow/compiler/tests/jit_test.py
@@ -125,7 +125,7 @@ class JitLaunchTest(test.TestCase):
           for (x, y) in zip(compiled, direct):
             self.assertAllClose(x, y, rtol=1e-1)
         else:
-          self.assertAllClose(compiled, direct)
+          self.assertAllClose(compiled, direct, rtol=1e-2)
 
   def testNoOutputs(self):
     with session_lib.Session() as sess:
diff --git a/tensorflow/compiler/tests/random_ops_test.py b/tensorflow/compiler/tests/random_ops_test.py
index d6c93088d4efff7d8306e262a79ae49d3d8ac722..f13dff96203b5480480c2a2fc9ac38ca78b7f78a 100644
--- a/tensorflow/compiler/tests/random_ops_test.py
+++ b/tensorflow/compiler/tests/random_ops_test.py
@@ -22,6 +22,8 @@ import numpy as np
 
 from tensorflow.compiler.tests.xla_test import XLATestCase
 from tensorflow.python.framework import dtypes
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.platform import googletest
 
@@ -47,18 +49,18 @@ class RandomOpsTest(XLATestCase):
       # We use exact equality here. If the random-number generator is producing
       # deterministic output, all three outputs will be bitwise identical.
       self.assertTrue((not np.array_equal(y, z)) or
-                      (not np.array_equal(z, w)) or
-                      (not np.array_equal(y, w)))
+                      (not np.array_equal(z, w)) or (not np.array_equal(y, w)))
 
   def testRandomUniformIsNotConstant(self):
+
     def rng(dtype):
-      return random_ops.random_uniform(shape=[2], dtype=dtype,
-                                       maxval=1000000)
+      return random_ops.random_uniform(shape=[2], dtype=dtype, maxval=1000000)
 
     for dtype in self._random_types():
       self._testRngIsNotConstant(rng, dtype)
 
   def testRandomNormalIsNotConstant(self):
+
     def rng(dtype):
       return random_ops.random_normal(shape=[2], dtype=dtype)
 
@@ -70,12 +72,20 @@ class RandomOpsTest(XLATestCase):
     for dtype in self._random_types():
       with self.test_session() as sess:
         with self.test_scope():
-          x = random_ops.random_uniform(shape=[1000], dtype=dtype, minval=-2,
-                                        maxval=33)
+          x = random_ops.random_uniform(
+              shape=[1000], dtype=dtype, minval=-2, maxval=33)
         y = sess.run(x)
         self.assertTrue((y >= -2).sum() == 1000)
         self.assertTrue((y < 33).sum() == 1000)
 
+  def testTruncatedNormalIsNotConstant(self):
+
+    def rng(dtype):
+      return random_ops.truncated_normal(shape=[2], dtype=dtype)
+
+    # TODO(b/34339814): implement inverse erf support for non-F32 types.
+    self._testRngIsNotConstant(rng, dtypes.float32)
+
   def testTruncatedNormalIsInRange(self):
     count = 10000
     # TODO(b/34339814): implement inverse erf support for non-F32 types.
@@ -87,6 +97,29 @@ class RandomOpsTest(XLATestCase):
         self.assertTrue((y >= -2).sum() == count)
         self.assertTrue((y <= 2).sum() == count)
 
+  def testShuffle1d(self):
+    with self.test_session() as sess:
+      with self.test_scope():
+        x = math_ops.range(20)
+        shuffle = random_ops.random_shuffle(x)
+      result = sess.run(shuffle)
+      expected = range(20)
+      # Compare sets to avoid randomness behavior changes but make sure still
+      # have all the values.
+      self.assertAllEqual(set(result), set(expected))
+
+  def testShuffle2d(self):
+    with self.test_session() as sess:
+      with self.test_scope():
+        x = array_ops.diag(math_ops.range(20))
+        shuffle = random_ops.random_shuffle(x)
+      result = sess.run(shuffle)
+      expected = np.diag(range(20)).flatten()
+      # Compare sets to avoid randomness behavior changes but make sure still
+      # have all the values.
+      self.assertAllEqual(len(result.flatten()), len(expected))
+      self.assertAllEqual(set(result.flatten()), set(expected))
+
 
 if __name__ == '__main__':
   googletest.main()
diff --git a/tensorflow/compiler/tests/variable_ops_test.py b/tensorflow/compiler/tests/variable_ops_test.py
index 8ecad00f6e23b3a7746bbb473102ac847bf4cbfd..2c09b03d5a35cde2c42d8a145781270c0c908587 100644
--- a/tensorflow/compiler/tests/variable_ops_test.py
+++ b/tensorflow/compiler/tests/variable_ops_test.py
@@ -187,6 +187,25 @@ class VariableOpsTest(XLATestCase):
           rtol=1e-4)
       self.assertAllClose(np.array([1.9, 2.9], dtype=np.float32), vb, rtol=1e-4)
 
+  def testWriteOfAliasedTensor(self):
+    for dtype in self.numeric_types:
+      init = np.array([[1, 2j], [3, 4]]).astype(dtype)
+      update = np.array([[7, 1j], [2, 11]]).astype(dtype)
+      with self.test_session() as sess, self.test_scope():
+        v = resource_variable_ops.ResourceVariable(init)
+        sess.run(variables.variables_initializer([v]))
+        p = array_ops.placeholder(dtype)
+        q = array_ops.identity(p)
+        x = v.read_value()
+        # Writes the value of 'p' to 'v', but keeps a reference to the original
+        # value of 'v' so the variable update cannot reuse its buffer.
+        with ops.control_dependencies([x]):
+          y = v.assign(q)
+        result = sess.run([x, y, q], {p: update})
+        self.assertAllClose(init, result[0])
+        self.assertAllClose(update, result[1])
+        self.assertAllClose(update, result[2])
+
 
 class StridedSliceAssignChecker(object):
   """Compares the results of a slice assignment using Tensorflow and numpy."""
diff --git a/tensorflow/compiler/tests/xla_device_test.py b/tensorflow/compiler/tests/xla_device_test.py
index b707bd0963d71d7c4b43b8d42752b4c50e9bbf7c..f0b010fa67f2ffb3f81fd14d4d89585f716b4890 100644
--- a/tensorflow/compiler/tests/xla_device_test.py
+++ b/tensorflow/compiler/tests/xla_device_test.py
@@ -23,6 +23,7 @@ import numpy as np
 from tensorflow.compiler.tests.xla_test import XLATestCase
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_control_flow_ops
 from tensorflow.python.platform import test
 
 
@@ -46,6 +47,12 @@ class XlaDeviceTest(XLATestCase):
           result = sess.run(z, {x: inputs})
         self.assertAllCloseAccordingToType(result, inputs + inputs)
 
+  def testControlTrigger(self):
+    with self.test_session() as sess:
+      with self.test_scope():
+        x = gen_control_flow_ops.control_trigger()
+      sess.run(x)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/compiler/tf2xla/functionalize_control_flow.cc b/tensorflow/compiler/tf2xla/functionalize_control_flow.cc
index 42585ad4d8a17d71146e48b69f9fa56f9ff24c3e..1438f6b48c4913e60b0c0a9f5c3d67fe595cbfe8 100644
--- a/tensorflow/compiler/tf2xla/functionalize_control_flow.cc
+++ b/tensorflow/compiler/tf2xla/functionalize_control_flow.cc
@@ -1438,7 +1438,13 @@ Status FunctionalizeControlFlow(const FunctionLibraryDefinition* lookup_library,
   // connected to all source nodes in the graph. Many graphs violate this
   // invariant.
   std::vector<ControlFlowInfo> cf_info;
-  TF_RETURN_IF_ERROR(BuildControlFlowInfo(graph, &cf_info));
+  std::vector<string> unreachable_nodes;
+  TF_RETURN_IF_ERROR(BuildControlFlowInfo(graph, &cf_info, &unreachable_nodes));
+  if (!unreachable_nodes.empty()) {
+    return errors::InvalidArgument(
+        "The following nodes are unreachable from the source in the graph: ",
+        tensorflow::str_util::Join(unreachable_nodes, ", "));
+  }
 
   // Builds Frames, indexed by name.
   std::unordered_map<string, Frame> frames;
diff --git a/tensorflow/compiler/tf2xla/kernels/BUILD b/tensorflow/compiler/tf2xla/kernels/BUILD
index e6da157c111ad9167bf7b1e743d9afbb8fb2ad03..edd2ab6301ee891c433639ce300cde0c72929cea 100644
--- a/tensorflow/compiler/tf2xla/kernels/BUILD
+++ b/tensorflow/compiler/tf2xla/kernels/BUILD
@@ -18,6 +18,7 @@ tf_kernel_library(
         "bcast_ops.cc",
         "bias_ops.cc",
         "binary_ops.cc",
+        "bucketize_op.cc",
         "cast_op.cc",
         "categorical_op.cc",
         "cholesky_op.cc",
diff --git a/tensorflow/compiler/tf2xla/kernels/bucketize_op.cc b/tensorflow/compiler/tf2xla/kernels/bucketize_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..ca9a6b40688d1e8496d1b823e20d273d519f65e8
--- /dev/null
+++ b/tensorflow/compiler/tf2xla/kernels/bucketize_op.cc
@@ -0,0 +1,67 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <algorithm>
+
+#include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
+#include "tensorflow/compiler/tf2xla/xla_op_registry.h"
+#include "tensorflow/compiler/xla/client/lib/arithmetic.h"
+#include "tensorflow/core/framework/op_kernel.h"
+
+namespace tensorflow {
+namespace {
+
+class BucketizeOp : public XlaOpKernel {
+ public:
+  explicit BucketizeOp(OpKernelConstruction* context) : XlaOpKernel(context) {
+    OP_REQUIRES_OK(context, context->GetAttr("boundaries", &boundaries_));
+    OP_REQUIRES(context, std::is_sorted(boundaries_.begin(), boundaries_.end()),
+                errors::InvalidArgument("Expected sorted boundaries"));
+  }
+
+  void Compile(XlaOpKernelContext* context) override {
+    xla::XlaBuilder* builder = context->builder();
+    const DataType dtype = context->input_type(0);
+    xla::XlaOp input = context->Input(0);
+
+    xla::XlaOp boundaries = builder->ConstantR1<float>(boundaries_);
+    // TODO(phawkins): the following behavior matches the behavior of the core
+    // Bucketize kernel. However, comparing an int32 or int64 against float may
+    // lead to inaccurate bucketing due to rounding.
+    if (dtype == DT_DOUBLE) {
+      input = builder->ConvertElementType(input, xla::F64);
+      boundaries = builder->ConvertElementType(boundaries, xla::F64);
+    } else {
+      input = builder->ConvertElementType(input, xla::F32);
+    }
+    xla::XlaOp comparison = builder->ConvertElementType(
+        builder->Ge(builder->Broadcast(input, {1}), boundaries,
+                    /*broadcast_dimensions=*/{0}),
+        xla::S32);
+    xla::XlaOp buckets = builder->Reduce(
+        comparison, /*init_value=*/builder->ConstantR0<int32>(0),
+        /*computation=*/xla::CreateScalarAddComputation(xla::S32, builder),
+        /*dimensions_to_reduce=*/{0});
+    context->SetOutput(0, buckets);
+  }
+
+ private:
+  std::vector<float> boundaries_;
+};
+
+REGISTER_XLA_OP(Name("Bucketize"), BucketizeOp);
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/kernels/if_op.cc b/tensorflow/compiler/tf2xla/kernels/if_op.cc
index 8b9b026643cf35216a2082dfcce9270c017bd14f..d48c6eea754f75a8879d3938f233a6a591d26d0d 100644
--- a/tensorflow/compiler/tf2xla/kernels/if_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/if_op.cc
@@ -48,11 +48,11 @@ void XlaIfOp::Compile(XlaOpKernelContext* ctx) {
 
   VLOG(1) << "Building If: " << input_types_.size() << " inputs";
 
-  std::vector<xla::XlaOp> inputs(input_types_.size());
   std::vector<XlaCompiler::Argument> arguments(input_types_.size());
   for (int i = 0; i < input_types_.size(); ++i) {
     XlaCompiler::Argument& arg = arguments[i];
     DataType type = ctx->input_type(i + 1);
+
     if (type == DT_RESOURCE) {
       XlaResource* resource;
       OP_REQUIRES_OK(ctx, ctx->GetResourceInput(i + 1, &resource));
@@ -60,7 +60,6 @@ void XlaIfOp::Compile(XlaOpKernelContext* ctx) {
       arg.initialized = resource->initialized();
       arg.kind = XlaCompiler::Argument::kResource;
       arg.resource_kind = resource->kind();
-      OP_REQUIRES_OK(ctx, resource->Pack(&inputs[i], b));
 
       arg.type = resource->type();
       arg.shape = resource->shape();
@@ -79,7 +78,6 @@ void XlaIfOp::Compile(XlaOpKernelContext* ctx) {
       arg.kind = XlaCompiler::Argument::kParameter;
       arg.type = input_types_[i];
       arg.shape = ctx->InputShape(i + 1);
-      inputs[i] = ctx->Input(i + 1);
       VLOG(2) << "Arg type: " << DataTypeString(arg.type)
               << " shape: " << arg.shape.DebugString();
     }
@@ -100,6 +98,7 @@ void XlaIfOp::Compile(XlaOpKernelContext* ctx) {
   OP_REQUIRES_OK(ctx, compiler->CompileFunction(options, else_branch_,
                                                 arguments, &else_result));
 
+  bool has_tensor_array_gradients = false;
   for (XlaCompiler::CompilationResult* result : {&then_result, &else_result}) {
     for (const XlaCompiler::ResourceUpdate& update : result->resource_updates) {
       XlaResource* resource;
@@ -121,9 +120,21 @@ void XlaIfOp::Compile(XlaOpKernelContext* ctx) {
       for (const auto& gradient : resource->tensor_array_gradients()) {
         arg.tensor_array_gradients.insert(gradient.first);
       }
+      if (!resource->tensor_array_gradients().empty())
+        has_tensor_array_gradients = true;
     }
   }
 
+  // Recompile the functions to update the argument shapes for tensor arrays.
+  if (has_tensor_array_gradients) {
+    then_result = {};
+    OP_REQUIRES_OK(ctx, compiler->CompileFunction(options, then_branch_,
+                                                  arguments, &then_result));
+    else_result = {};
+    OP_REQUIRES_OK(ctx, compiler->CompileFunction(options, else_branch_,
+                                                  arguments, &else_result));
+  }
+
   // Check that both branches have identical input shapes.
   OP_REQUIRES(ctx, then_result.xla_input_shapes.size() == 1,
               errors::FailedPrecondition("Expected one input shape"));
@@ -175,6 +186,19 @@ void XlaIfOp::Compile(XlaOpKernelContext* ctx) {
             "Mismatch in resource of then and else branch for resource ", i));
   }
 
+  int num_inputs = then_result.input_mapping.size();
+  std::vector<xla::XlaOp> inputs(num_inputs);
+  for (int i = 0; i < num_inputs; ++i) {
+    int input_num = then_result.input_mapping[i] + 1;
+    if (ctx->input_type(input_num) == DT_RESOURCE) {
+      XlaResource* resource;
+      OP_REQUIRES_OK(ctx, ctx->GetResourceInput(input_num, &resource));
+      OP_REQUIRES_OK(ctx, resource->Pack(&inputs[i], b));
+    } else {
+      inputs[i] = ctx->Input(i + 1);
+    }
+  }
+
   xla::XlaOp outputs =
       b->Conditional(ctx->Input(0), b->Tuple(inputs), *then_result.computation,
                      b->Tuple(inputs), *else_result.computation);
diff --git a/tensorflow/compiler/tf2xla/kernels/image_resize_ops.cc b/tensorflow/compiler/tf2xla/kernels/image_resize_ops.cc
index 9058cbc74762576c7e6f8ec1b2b0f6b247ac0502..79d3a6979cec4c6bda92a71dcff4ddd2151367d5 100644
--- a/tensorflow/compiler/tf2xla/kernels/image_resize_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/image_resize_ops.cc
@@ -99,27 +99,34 @@ ResizeConvolutionDims ComputeResizeConvolutionParameters(
   return dims;
 }
 
+// Form a 2D convolution kernel like:
+//       1 2 3 2 1
+//       2 4 6 4 2
+// 1/9 * 3 6 9 6 3
+//       2 4 6 4 2
+//       1 2 3 2 1
+// by multiplying two 1D kernels of the form:
+// 1/3 * [1 2 3 2 1]
+// If the 2D kernel would be very large, the 1D kernel can be applied once in
+// each dimension due to the symmetry of the kernel along all axis to reduce the
+// computational intensity.
+std::vector<float> Make1DKernel(int64 n) {
+  std::vector<float> kernel(n * 2 - 1);
+  for (int64 i = 0; i < n; ++i) {
+    float v = (i + 1.0f) / n;
+    kernel[i] = v;
+    kernel[n * 2 - 2 - i] = v;
+  }
+  return kernel;
+}
+
+// Kernels with more than 16 spatial elements are considered intense and the
+// kernel should applied to each dimension independently.
+const int64 kMax2DKernelSize = 16;
+
 xla::XlaOp MakeBilinearResizeKernel(xla::XlaBuilder* builder,
                                     gtl::ArraySlice<int64> kernel_size,
                                     int64 channels) {
-  // Form a 2D convolution kernel like:
-  //       1 2 3 2 1
-  //       2 4 6 4 2
-  // 1/9 * 3 6 9 6 3
-  //       2 4 6 4 2
-  //       1 2 3 2 1
-  // by multiplying two 1D kernels of the form:
-  // 1/3 * [1 2 3 2 1]
-  auto make_1d_kernel = [](int64 n) {
-    std::vector<float> kernel(n * 2 - 1);
-    for (int64 i = 0; i < n; ++i) {
-      float v = (i + 1.0f) / n;
-      kernel[i] = v;
-      kernel[n * 2 - 2 - i] = v;
-    }
-    return kernel;
-  };
-
   xla::XlaOp channels_iota;
   // DT_INT32 Iota will always return status::OK().
   TF_CHECK_OK(
@@ -133,12 +140,37 @@ xla::XlaOp MakeBilinearResizeKernel(xla::XlaBuilder* builder,
       xla::PrimitiveType::F32);
   return builder->Mul(
       builder->Mul(diag,
-                   builder->ConstantR1<float>(make_1d_kernel(kernel_size[1])),
+                   builder->ConstantR1<float>(Make1DKernel(kernel_size[1])),
                    /*broadcast_dimensions=*/{1}),
-      builder->ConstantR1<float>(make_1d_kernel(kernel_size[0])),
+      builder->ConstantR1<float>(Make1DKernel(kernel_size[0])),
       /*broadcast_dimensions=*/{0});
 }
 
+xla::XlaOp MakeBilinearResizeKernelInDim(xla::XlaBuilder* builder,
+                                         gtl::ArraySlice<int64> kernel_size,
+                                         int64 channels, int64 dim) {
+  xla::XlaOp channels_iota;
+  // DT_INT32 Iota will always return status::OK().
+  TF_CHECK_OK(
+      XlaHelpers::Iota(builder, DataType::DT_INT32, channels, &channels_iota));
+
+  auto diag = builder->ConvertElementType(
+      builder->Eq(builder->Broadcast(
+                      channels_iota,
+                      {dim == 0 ? (2 * kernel_size[0] - 1) : 1,
+                       dim == 1 ? (2 * kernel_size[1] - 1) : 1, channels}),
+                  channels_iota, /*broadcast_dimensions=*/{2}),
+      xla::PrimitiveType::F32);
+  if (dim == 1) {
+    return builder->Mul(
+        diag, builder->ConstantR1<float>(Make1DKernel(kernel_size[1])),
+        /*broadcast_dimensions=*/{1});
+  }
+  return builder->Mul(diag,
+                      builder->ConstantR1<float>(Make1DKernel(kernel_size[0])),
+                      /*broadcast_dimensions=*/{0});
+}
+
 xla::XlaOp ResizeUsingDilationAndConvolution(xla::XlaBuilder* builder,
                                              const xla::XlaOp& input,
                                              const int num_spatial_dims,
@@ -165,20 +197,42 @@ xla::XlaOp ResizeUsingDilationAndConvolution(xla::XlaBuilder* builder,
     dimension_numbers.add_output_spatial_dimensions(1 + i);
     dimension_numbers.add_kernel_spatial_dimensions(i);
   }
-  dimension_numbers.set_kernel_input_feature_dimension(num_spatial_dims);
-  dimension_numbers.set_kernel_output_feature_dimension(num_spatial_dims + 1);
+  dimension_numbers.set_kernel_input_feature_dimension(num_spatial_dims + 1);
+  dimension_numbers.set_kernel_output_feature_dimension(num_spatial_dims);
 
   ResizeConvolutionDims dims =
       ComputeResizeConvolutionParameters(in_size, out_size);
-  xla::XlaOp kernel =
-      MakeBilinearResizeKernel(builder, dims.kernel_size, channels);
-  xla::XlaOp output = builder->ConvGeneralDilated(
-      input, kernel, dims.stride,
-      /*padding=*/
-      {{dims.kernel_size[0] - 1, dims.kernel_size[0] - 1},
-       {dims.kernel_size[1] - 1, dims.kernel_size[1] - 1}},
-      /*lhs_dilation=*/dims.kernel_size,
-      /*rhs_dilation=*/{1, 1}, dimension_numbers);
+  xla::XlaOp output;
+  // Split convolutions into independent dimensions if they wmuld be a very
+  // large kernel.
+  if (dims.kernel_size[0] * dims.kernel_size[1] < kMax2DKernelSize) {
+    xla::XlaOp kernel =
+        MakeBilinearResizeKernel(builder, dims.kernel_size, channels);
+    output = builder->ConvGeneralDilated(
+        input, kernel, dims.stride,
+        /*padding=*/
+        {{dims.kernel_size[0] - 1, dims.kernel_size[0] - 1},
+         {dims.kernel_size[1] - 1, dims.kernel_size[1] - 1}},
+        /*lhs_dilation=*/dims.kernel_size,
+        /*rhs_dilation=*/{1, 1}, dimension_numbers);
+  } else {
+    xla::XlaOp kernel0 =
+        MakeBilinearResizeKernelInDim(builder, dims.kernel_size, channels, 0);
+    output = builder->ConvGeneralDilated(
+        input, kernel0, {dims.stride[0], 1},
+        /*padding=*/
+        {{dims.kernel_size[0] - 1, dims.kernel_size[0] - 1}, {0, 0}},
+        /*lhs_dilation=*/{dims.kernel_size[0], 1},
+        /*rhs_dilation=*/{1, 1}, dimension_numbers);
+    xla::XlaOp kernel1 =
+        MakeBilinearResizeKernelInDim(builder, dims.kernel_size, channels, 1);
+    output = builder->ConvGeneralDilated(
+        output, kernel1, {1, dims.stride[1]},
+        /*padding=*/
+        {{0, 0}, {dims.kernel_size[1] - 1, dims.kernel_size[1] - 1}},
+        /*lhs_dilation=*/{1, dims.kernel_size[1]},
+        /*rhs_dilation=*/{1, 1}, dimension_numbers);
+  }
 
   // Add broadcasts to handle expanding from a size == 1 dimension to a
   // size > 1 dimension.
@@ -214,26 +268,63 @@ xla::XlaOp ResizeUsingDilationAndConvolutionGradOp(xla::XlaBuilder* builder,
   }
   dimension_numbers.set_kernel_input_feature_dimension(num_spatial_dims);
   dimension_numbers.set_kernel_output_feature_dimension(num_spatial_dims + 1);
-  xla::XlaOp kernel =
-      MakeBilinearResizeKernel(builder, dims.kernel_size, channels);
+  xla::XlaOp output;
+  if (dims.kernel_size[0] * dims.kernel_size[1] < kMax2DKernelSize) {
+    xla::XlaOp kernel =
+        MakeBilinearResizeKernel(builder, dims.kernel_size, channels);
+
+    // Broadcast the input kernel where the forward op expanded from a size == 1
+    // dimension to a size > 1 dimension. This has the effect of summing the
+    // gradient contributions in that dimension.
+    for (int i = 0; i < num_spatial_dims; ++i) {
+      if (in_size[i] == 1 && grad_size[i] > 1) {
+        kernel =
+            builder->Add(kernel, builder->ConstantR1<float>(grad_size[i], 0),
+                         /*broadcast_dimensions=*/{i});
+      }
+    }
 
-  // Broadcast the input kernel where the forward op expanded from a size == 1
-  // dimension to a size > 1 dimension. This has the effect of summing the
-  // gradient contributions in that dimension.
-  for (int i = 0; i < num_spatial_dims; ++i) {
-    if (in_size[i] == 1 && grad_size[i] > 1) {
-      kernel = builder->Add(kernel, builder->ConstantR1<float>(grad_size[i], 0),
-                            /*broadcast_dimensions=*/{i});
+    output = builder->ConvGeneralDilated(
+        grad, kernel, /*window_strides=*/dims.kernel_size,
+        /*padding=*/
+        {{dims.kernel_size[0] - 1, dims.kernel_size[0] - 1},
+         {dims.kernel_size[1] - 1, dims.kernel_size[1] - 1}},
+        /*lhs_dilation=*/dims.stride,
+        /*rhs_dilation=*/{1, 1}, dimension_numbers);
+  } else {
+    xla::XlaOp kernel0 =
+        MakeBilinearResizeKernelInDim(builder, dims.kernel_size, channels, 0);
+    xla::XlaOp kernel1 =
+        MakeBilinearResizeKernelInDim(builder, dims.kernel_size, channels, 1);
+
+    // Broadcast the input kernel where the forward op expanded from a size == 1
+    // dimension to a size > 1 dimension. This has the effect of summing the
+    // gradient contributions in that dimension.
+    if (in_size[0] == 1 && grad_size[0] > 1) {
+      kernel0 =
+          builder->Add(kernel0, builder->ConstantR1<float>(grad_size[0], 0),
+                       /*broadcast_dimensions=*/{0});
+    }
+    if (in_size[1] == 1 && grad_size[1] > 1) {
+      kernel1 =
+          builder->Add(kernel0, builder->ConstantR1<float>(grad_size[1], 0),
+                       /*broadcast_dimensions=*/{1});
     }
-  }
 
-  xla::XlaOp output = builder->ConvGeneralDilated(
-      grad, kernel, /*window_strides=*/dims.kernel_size,
-      /*padding=*/
-      {{dims.kernel_size[0] - 1, dims.kernel_size[0] - 1},
-       {dims.kernel_size[1] - 1, dims.kernel_size[1] - 1}},
-      /*lhs_dilation=*/dims.stride,
-      /*rhs_dilation=*/{1, 1}, dimension_numbers);
+    output = builder->ConvGeneralDilated(
+        grad, kernel0, /*window_strides=*/{dims.kernel_size[0], 1},
+        /*padding=*/
+        {{dims.kernel_size[0] - 1, dims.kernel_size[0] - 1}, {0, 0}},
+        /*lhs_dilation=*/{dims.stride[0], 1},
+        /*rhs_dilation=*/{1, 1}, dimension_numbers);
+
+    output = builder->ConvGeneralDilated(
+        output, kernel1, /*window_strides=*/{1, dims.kernel_size[1]},
+        /*padding=*/
+        {{0, 0}, {dims.kernel_size[1] - 1, dims.kernel_size[1] - 1}},
+        /*lhs_dilation=*/{1, dims.stride[1]},
+        /*rhs_dilation=*/{1, 1}, dimension_numbers);
+  }
 
   // If in_size[i] > 1 and grad_size[i] == 1, pad the output in dimension i.
   // Opposite of the slice performed by the forward op.
diff --git a/tensorflow/compiler/tf2xla/kernels/no_op.cc b/tensorflow/compiler/tf2xla/kernels/no_op.cc
index 8c8a9bbe787f3224e7444b62dcf8ad99130cf37f..65ab9da8d7ca0509a4a69c43727a0e6c0435908a 100644
--- a/tensorflow/compiler/tf2xla/kernels/no_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/no_op.cc
@@ -24,8 +24,7 @@ namespace tensorflow {
 REGISTER_XLA_OP(Name("NoOp").CompilationOnly(), NoOp);
 
 // We register ControlTrigger as a no-op. This is correct since nodes seen
-// by the XLA compiler are never dead. This may need rethinking when we add
-// support for conditionals to XLA.
-REGISTER_XLA_OP(Name("ControlTrigger"), NoOp);
+// by the XLA compiler are never dead.
+REGISTER_XLA_OP(Name("ControlTrigger").CompilationOnly(), NoOp);
 
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/kernels/random_ops.cc b/tensorflow/compiler/tf2xla/kernels/random_ops.cc
index 5f5bd586376ab368e443671ac8a5de23a5fd604b..105be38fe26b6667e8b4ce6da92a3969cdc0c187 100644
--- a/tensorflow/compiler/tf2xla/kernels/random_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/random_ops.cc
@@ -17,6 +17,9 @@ limitations under the License.
 // TODO(misard,phawkins): handle random number generator seeds/states correctly.
 // TODO(misard,phawkins): add tests.
 
+#include "tensorflow/compiler/tf2xla/kernels/gather_op_helpers.h"
+#include "tensorflow/compiler/tf2xla/lib/util.h"
+#include "tensorflow/compiler/tf2xla/lib/while_loop.h"
 #include "tensorflow/compiler/tf2xla/shape_util.h"
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
@@ -55,6 +58,78 @@ class RandomUniformOp : public XlaOpKernel {
 REGISTER_XLA_OP(Name("RandomUniform").CompileTimeConstInput("shape"),
                 RandomUniformOp);
 
+class RandomShuffleOp : public XlaOpKernel {
+ public:
+  explicit RandomShuffleOp(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {}
+
+  void Compile(XlaOpKernelContext* ctx) override {
+    auto builder = ctx->builder();
+    xla::XlaOp input = ctx->Input(0);
+    TensorShape input_shape = ctx->InputShape(0);
+    const int64 n = input_shape.dim_size(0);
+    int64 num_elements = 1;
+    for (tensorflow::TensorShapeDim dimension : input_shape) {
+      num_elements *= dimension.size;
+    }
+    if (num_elements <= 1 || n <= 1) {
+      // No shuffling is required, so copy input directly to output
+      ctx->SetOutput(0, input);
+    } else {
+      // Generate the random swaps for the indices.
+      auto swaps_shape = xla::ShapeUtil::MakeShape(xla::S32, {n});
+      auto swaps =
+          builder->RngUniform(builder->ConstantR0<int32>(0),
+                              builder->ConstantR0<int32>(n), swaps_shape);
+
+      // Generate range(n) as the initial value for the indices to be swapped.
+      xla::XlaOp indices;
+      TF_CHECK_OK(XlaHelpers::Iota(builder, DataType::DT_INT32, n, &indices));
+
+      // Swap the indices at i and swaps[i].
+      auto swap_body_fn = [&](xla::XlaOp i,
+                              gtl::ArraySlice<xla::XlaOp> loop_vars,
+                              xla::XlaBuilder* builder)
+          -> xla::StatusOr<std::vector<xla::XlaOp>> {
+        auto swaps = loop_vars[0];
+        auto indices = loop_vars[1];
+        i = builder->Reshape(i, {1});
+        // temp = indices[i]
+        auto temp = builder->DynamicSlice(indices, i, {1});
+        // swap_index = swaps[i]
+        auto swap_index = builder->DynamicSlice(swaps, i, {1});
+        // swap_value = indices[swaps[i]]
+        auto swap_value = builder->DynamicSlice(indices, swap_index, {1});
+        // indices[i] = indices[swaps[i]]
+        indices = builder->DynamicUpdateSlice(indices, swap_value, i);
+        // indices[swaps[i]] = temp
+        indices = builder->DynamicUpdateSlice(indices, temp, swap_index);
+        return std::vector<xla::XlaOp>{swaps, indices};
+      };
+      // for i in range(n):
+      auto swap_loop_result =
+          XlaForEachIndex(n, xla::S32, swap_body_fn, {swaps, indices},
+                          "indices_swap_loop", builder)
+              .ValueOrDie();
+      auto swapped_indices = swap_loop_result[1];
+
+      // Gather the data using the swapped indices as the shuffled order.
+      auto indices_tensor_shape = TensorShape({n});
+      DataType type = ctx->expected_output_dtype(0);
+      xla::XlaOp gather;
+      OP_REQUIRES_OK(ctx, XlaGather(input, input_shape, swapped_indices,
+                                    indices_tensor_shape,
+                                    /*axis=*/0, /*indices_are_nd=*/false, type,
+                                    DT_INT32, builder, &gather));
+      ctx->SetOutput(0, gather);
+    }
+  }
+
+ private:
+  TF_DISALLOW_COPY_AND_ASSIGN(RandomShuffleOp);
+};
+
+REGISTER_XLA_OP(Name("RandomShuffle"), RandomShuffleOp);
+
 class RandomUniformIntOp : public XlaOpKernel {
  public:
   explicit RandomUniformIntOp(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {}
@@ -127,13 +202,8 @@ class TruncatedNormalOp : public XlaOpKernel {
     OP_REQUIRES_OK(ctx, ctx->ConstantInputAsShape(0, &shape));
     xla::Shape xla_shape;
     OP_REQUIRES_OK(ctx, TensorShapeToXLAShape(dtype, shape, &xla_shape));
-    xla::Shape xla_element_shape =
-        xla::ShapeUtil::MakeShape(xla_shape.element_type(), {});
 
     xla::XlaBuilder* b = ctx->builder();
-    xla::XlaOp mean = XlaHelpers::Zero(b, dtype);
-    xla::XlaOp stddev = XlaHelpers::One(b, dtype);
-    xla::XlaOp candidate = b->RngNormal(mean, stddev, xla_shape);
 
     auto two_sd = [dtype](bool negate, xla::XlaBuilder* b) {
       return XlaHelpers::FloatLiteral(b, dtype, negate ? -2.0 : 2.0);
@@ -151,34 +221,38 @@ class TruncatedNormalOp : public XlaOpKernel {
     //   out_of_range_mask := candidate < mean-2*sd || candidate > mean+2*sd
     //   candidate = select(out_of_range_mask, rng_normal(), candidate)
     // }
-    std::unique_ptr<xla::XlaBuilder> test_builder =
-        b->CreateSubBuilder("truncated_normal_test");
-    {
-      auto* b = test_builder.get();
-      xla::XlaOp candidate = b->Parameter(0, xla_shape, "candidate");
-      out_of_range_mask(candidate, b);
-      OP_REQUIRES_OK(ctx, Any(out_of_range_mask(candidate, b), b).status());
-    }
-
-    std::unique_ptr<xla::XlaBuilder> body_builder =
-        b->CreateSubBuilder("truncated_normal_body");
-    {
-      auto* b = body_builder.get();
-      xla::XlaOp candidate = b->Parameter(0, xla_shape, "candidate");
-      xla::XlaOp to_resample = out_of_range_mask(candidate, b);
+    std::vector<xla::XlaOp> initial_values = {
+        // The current candidate.
+        b->Broadcast(XlaHelpers::Zero(b, dtype), shape.dim_sizes()),
+        // The to_resample mask, where 'true' identifies a location in the
+        // current candidate that is out of range and must be regenerated.
+        b->Broadcast(b->ConstantR0<bool>(true), shape.dim_sizes()),
+        // Is any element in the mask true?
+        b->ConstantR0<bool>(true)};
+    auto condition = [&](gtl::ArraySlice<xla::XlaOp> values,
+                         xla::XlaBuilder* b) -> xla::StatusOr<xla::XlaOp> {
+      // Continue while any element in the mask is true.
+      return values[2];
+    };
+    auto body =
+        [&](gtl::ArraySlice<xla::XlaOp> values,
+            xla::XlaBuilder* b) -> xla::StatusOr<std::vector<xla::XlaOp>> {
+      xla::XlaOp candidate = values[0];
+      xla::XlaOp to_resample = values[1];
       xla::XlaOp mean = XlaHelpers::Zero(b, dtype);
       xla::XlaOp stddev = XlaHelpers::One(b, dtype);
-      b->Select(to_resample, b->RngNormal(mean, stddev, xla_shape), candidate);
-    }
-
-    xla::StatusOr<xla::XlaComputation> test_computation = test_builder->Build();
-    OP_REQUIRES_OK(ctx, test_computation.status());
-    xla::StatusOr<xla::XlaComputation> body_computation = body_builder->Build();
-    OP_REQUIRES_OK(ctx, body_computation.status());
-    xla::XlaOp result = b->While(test_computation.ValueOrDie(),
-                                 body_computation.ValueOrDie(), candidate);
-
-    ctx->SetOutput(0, result);
+      candidate = b->Select(to_resample, b->RngNormal(mean, stddev, xla_shape),
+                            candidate);
+      // Compute a new to_resample mask, and determine whether any value is
+      // still out of range.
+      to_resample = out_of_range_mask(candidate, b);
+      TF_ASSIGN_OR_RETURN(xla::XlaOp done, Any(to_resample, b));
+      return std::vector<xla::XlaOp>{candidate, to_resample, done};
+    };
+    auto result =
+        XlaWhileLoop(condition, body, initial_values, "truncated_normal", b);
+    OP_REQUIRES_OK(ctx, result.status());
+    ctx->SetOutput(0, result.ValueOrDie()[0]);
   }
 };
 
diff --git a/tensorflow/compiler/tf2xla/kernels/shape_op.cc b/tensorflow/compiler/tf2xla/kernels/shape_op.cc
index 05354bca5bb089703fdcceb6f44648bbb98d004b..d59720bef742c7441ee01a954247013559bb909c 100644
--- a/tensorflow/compiler/tf2xla/kernels/shape_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/shape_op.cc
@@ -43,7 +43,7 @@ class ShapeOp : public XlaOpKernel {
   DataType out_dtype_;
 };
 
-REGISTER_XLA_OP(Name("Shape"), ShapeOp);
+REGISTER_XLA_OP(Name("Shape").CompilationOnly(), ShapeOp);
 
 class ShapeNOp : public XlaOpKernel {
  public:
@@ -65,7 +65,7 @@ class ShapeNOp : public XlaOpKernel {
  private:
   DataType out_dtype_;
 };
-REGISTER_XLA_OP(Name("ShapeN"), ShapeNOp);
+REGISTER_XLA_OP(Name("ShapeN").CompilationOnly(), ShapeNOp);
 
 class RankOp : public XlaOpKernel {
  public:
@@ -81,7 +81,7 @@ class RankOp : public XlaOpKernel {
   }
 };
 
-REGISTER_XLA_OP(Name("Rank"), RankOp);
+REGISTER_XLA_OP(Name("Rank").CompilationOnly(), RankOp);
 
 class SizeOp : public XlaOpKernel {
  public:
@@ -100,7 +100,7 @@ class SizeOp : public XlaOpKernel {
   }
 };
 
-REGISTER_XLA_OP(Name("Size"), SizeOp);
+REGISTER_XLA_OP(Name("Size").CompilationOnly(), SizeOp);
 
 class ExpandDimsOp : public XlaOpKernel {
  public:
@@ -189,10 +189,9 @@ class SqueezeOp : public XlaOpKernel {
       if (!wrapped_squeeze_dims.empty()) {
         if (wrapped_squeeze_dims.count(i) > 0) {
           OP_REQUIRES(ctx, existing_dim == 1,
-                      errors::InvalidArgument("Tried to explicitly squeeze "
-                                              "dimension ",
-                                              i, " but dimension was not 1: ",
-                                              existing_dim));
+                      errors::InvalidArgument(
+                          "Tried to explicitly squeeze dimension ", i,
+                          " but dimension was not 1: ", existing_dim));
         } else {
           // This dimension is not being squeezed.
           new_shape.push_back(existing_dim);
diff --git a/tensorflow/compiler/tf2xla/kernels/variable_ops.cc b/tensorflow/compiler/tf2xla/kernels/variable_ops.cc
index 6109db8e89e5ee67e0635d26e258bfe7cb70a15d..a163fa0a5b34675e46d0d7c5f4e0ccb1e3fb18eb 100644
--- a/tensorflow/compiler/tf2xla/kernels/variable_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/variable_ops.cc
@@ -57,7 +57,7 @@ class ReadVariableOp : public XlaOpKernel {
  private:
   DataType dtype_;
 };
-REGISTER_XLA_OP(Name("ReadVariableOp"), ReadVariableOp);
+REGISTER_XLA_OP(Name("ReadVariableOp").CompilationOnly(), ReadVariableOp);
 
 class AssignVariableOp : public XlaOpKernel {
  public:
@@ -67,7 +67,7 @@ class AssignVariableOp : public XlaOpKernel {
                    ctx->AssignVariable(0, ctx->input_type(1), ctx->Input(1)));
   }
 };
-REGISTER_XLA_OP(Name("AssignVariableOp"), AssignVariableOp);
+REGISTER_XLA_OP(Name("AssignVariableOp").CompilationOnly(), AssignVariableOp);
 
 class AssignAddVariableOp : public XlaOpKernel {
  public:
diff --git a/tensorflow/compiler/tf2xla/lib/cholesky.cc b/tensorflow/compiler/tf2xla/lib/cholesky.cc
index 3f1384bc864abd882ebba2b90acbe0b1e664687a..20925118bf598a6436c43bd727ce40e3abafc46c 100644
--- a/tensorflow/compiler/tf2xla/lib/cholesky.cc
+++ b/tensorflow/compiler/tf2xla/lib/cholesky.cc
@@ -110,7 +110,6 @@ xla::StatusOr<xla::XlaOp> CholeskyUnblocked(xla::XlaBuilder* builder,
         FloatLiteral(body_builder, a_shape.element_type(), 0.5));
 
     // a[..., i+1:, i]
-    auto ip1 = body_builder->Add(i, body_builder->ConstantR0<int32>(1));
     // select the whole i-th column, then mask out all rows above i+1
     TF_ASSIGN_OR_RETURN(
         auto a_0i, DynamicSliceInMinorDims(body_builder, body_a, {i}, {1}));
diff --git a/tensorflow/compiler/tf2xla/literal_util.cc b/tensorflow/compiler/tf2xla/literal_util.cc
index 43e1c1e9fecec1c71db1509757251cb5d903ca49..db56b128375ce8ff2faf12c5d7ea256bdfab0f63 100644
--- a/tensorflow/compiler/tf2xla/literal_util.cc
+++ b/tensorflow/compiler/tf2xla/literal_util.cc
@@ -40,6 +40,37 @@ Status HostTensorToLiteral(const Tensor& host_tensor, xla::Literal* literal) {
   return Status::OK();
 }
 
+Status HostTensorToBorrowingLiteral(const Tensor& host_tensor,
+                                    xla::BorrowingLiteral* literal) {
+  xla::Shape xla_shape;
+  TF_RETURN_IF_ERROR(TensorShapeToXLAShape(host_tensor.dtype(),
+                                           host_tensor.shape(), &xla_shape));
+  *literal = xla::BorrowingLiteral(
+      static_cast<const char*>(DMAHelper::base(&host_tensor)), xla_shape);
+  return Status::OK();
+}
+
+Status HostTensorsToBorrowingLiteralTuple(
+    tensorflow::gtl::ArraySlice<Tensor> host_tensors,
+    xla::BorrowingLiteral* literal) {
+  std::vector<const char*> buf_ptrs;
+  buf_ptrs.reserve(host_tensors.size());
+  std::vector<xla::Shape> tensor_shapes(host_tensors.size());
+
+  for (int i = 0; i < host_tensors.size(); i++) {
+    // Validate runtime shapes and fail if it doesn't match the contract.
+    const Tensor* tensor = &host_tensors[i];
+    buf_ptrs.emplace_back(static_cast<const char*>(DMAHelper::base(tensor)));
+    TF_RETURN_IF_ERROR(TensorShapeToXLAShape(tensor->dtype(), tensor->shape(),
+                                             &tensor_shapes[i]));
+  }
+
+  *literal = xla::BorrowingLiteral(
+      buf_ptrs, xla::ShapeUtil::MakeTupleShape(tensor_shapes));
+
+  return Status::OK();
+}
+
 Status CopyLiteralToHostTensor(const xla::LiteralSlice& literal,
                                Tensor* host_tensor) {
   TF_RET_CHECK(xla::ShapeUtil::IsArray(literal.shape()) &&
diff --git a/tensorflow/compiler/tf2xla/literal_util.h b/tensorflow/compiler/tf2xla/literal_util.h
index 220bec15538c36fa30abef9e729b64dbbb9f72b3..74685025c1780c5c0ba56205a98786582e9191e9 100644
--- a/tensorflow/compiler/tf2xla/literal_util.h
+++ b/tensorflow/compiler/tf2xla/literal_util.h
@@ -22,6 +22,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
 
 namespace tensorflow {
 
@@ -29,6 +30,17 @@ namespace tensorflow {
 // unsupported type.
 Status HostTensorToLiteral(const Tensor& host_tensor, xla::Literal* literal);
 
+// Returns a BorrowingLiteral that utilizes the same underlying buffer owned by
+// 'host_tensor'.
+Status HostTensorToBorrowingLiteral(const Tensor& host_tensor,
+                                    xla::BorrowingLiteral* literal);
+
+// Returns a BorrowingLiteral tuple that utilizes the same underlying buffers
+// owned by 'host_tensors'.
+Status HostTensorsToBorrowingLiteralTuple(
+    tensorflow::gtl::ArraySlice<Tensor> host_tensors,
+    xla::BorrowingLiteral* literal);
+
 // Copies 'literal' to freshly allocated 'host_tensor', which is allocated of
 // type <target_type>.
 // Fails if the literal's primitive type !=
diff --git a/tensorflow/compiler/tf2xla/tf2xla.cc b/tensorflow/compiler/tf2xla/tf2xla.cc
index 3a08aa8cf4f5cea6210cc9470d57c3387445ea6e..ac768b206e2a8d163a4253432a1911152f89ce86 100644
--- a/tensorflow/compiler/tf2xla/tf2xla.cc
+++ b/tensorflow/compiler/tf2xla/tf2xla.cc
@@ -263,8 +263,7 @@ Status ConvertGraphToXla(std::unique_ptr<Graph> graph, xla::Client* client,
   // Compile the graph into an XLA computation.
   XlaCompiler::Options compiler_options;
   compiler_options.client = client;
-  DeviceType device_type(DEVICE_CPU_XLA_JIT);
-  compiler_options.device_type = &device_type;
+  compiler_options.device_type = DeviceType(DEVICE_CPU_XLA_JIT);
   compiler_options.flib_def = &graph->flib_def();
   compiler_options.graph_def_version = graph->versions().producer();
   compiler_options.allow_cpu_custom_calls = true;
diff --git a/tensorflow/compiler/tf2xla/xla_compiler.cc b/tensorflow/compiler/tf2xla/xla_compiler.cc
index f7098917b191058c53a1d6a5923e80e5e8319d72..9c8e56a17e07348d3cfaaca0b5eb335295af05c3 100644
--- a/tensorflow/compiler/tf2xla/xla_compiler.cc
+++ b/tensorflow/compiler/tf2xla/xla_compiler.cc
@@ -83,12 +83,9 @@ XlaCompiler::XlaCompiler(XlaCompiler::Options options)
     : options_(options),
       initialization_status_(Status::OK()),
       next_step_id_(1),
-      device_(
-          new XlaCompilationDevice(SessionOptions(), *options_.device_type)),
+      device_(new XlaCompilationDevice(SessionOptions(), options_.device_type)),
       device_mgr_({device_}) {
-  // We no longer need the device_type.
-  options_.device_type = nullptr;
-
+  CHECK(!options_.device_type.type_string().empty());
   if (options_.populate_resource_manager) {
     initialization_status_ =
         (*options_.populate_resource_manager)(device_->resource_manager());
@@ -228,7 +225,7 @@ Status XlaCompiler::CompileFunction(const XlaCompiler::CompileOptions& options,
 // Computes the XLA shape for argument 'arg'.
 Status XlaCompiler::XLAShapeForArgument(const XlaCompiler::Argument& arg,
                                         bool is_entry_computation,
-                                        xla::Shape* xla_shape) {
+                                        xla::Shape* xla_shape) const {
   switch (arg.kind) {
     case XlaCompiler::Argument::kConstant:
       LOG(FATAL) << "Unreachable case";
@@ -655,10 +652,70 @@ Status XlaCompiler::CompileSingleOp(
                         .Finalize(graph.get(), &node);
     TF_RETURN_IF_ERROR(status);
   }
+  FixupSourceAndSinkEdges(graph.get());
 
   return CompileGraph(options, name, std::move(graph), args, result);
 }
 
+namespace {
+
+// Check that the ops of all non-functional nodes have been registered.
+string ValidateFunctionDef(const FunctionDef* fdef,
+                           const FunctionLibraryDefinition& flib_def) {
+  std::vector<string> invalid_ops;
+  for (const NodeDef& node : fdef->node_def()) {
+    const string& op = node.op();
+    if (op == FunctionLibraryDefinition::kGradientOp || flib_def.Find(op)) {
+      continue;
+    }
+    const OpDef* op_def;
+    if (!OpRegistry::Global()->LookUpOpDef(op, &op_def).ok()) {
+      invalid_ops.push_back(op);
+    }
+  }
+  return tensorflow::str_util::Join(invalid_ops, ", ");
+}
+
+// Check that the graph doesn't have any invalid nodes (e.g. incompatible with
+// given device_type, invalid data type, missing attributes...)
+Status ValidateGraph(const Graph* graph,
+                     const FunctionLibraryDefinition& flib_def,
+                     const DeviceType& device_type, const string& name) {
+  std::vector<string> invalid_ops;
+  for (const Node* node : graph->nodes()) {
+    if (node->type_string() == FunctionLibraryDefinition::kGradientOp) {
+      continue;
+    }
+    const FunctionDef* fdef = flib_def.Find(node->def().op());
+    if (fdef) {
+      string error_msg = ValidateFunctionDef(fdef, flib_def);
+      if (!error_msg.empty()) {
+        invalid_ops.push_back(
+            strings::StrCat(node->def().op(), ":{", error_msg, "}"));
+      }
+      continue;
+    }
+    const OpDef* op_def;
+    if (!OpRegistry::Global()->LookUpOpDef(node->def().op(), &op_def).ok()) {
+      invalid_ops.push_back(node->def().op());
+      continue;
+    }
+    TF_RETURN_IF_ERROR(ValidateNodeDef(node->def(), *op_def));
+    if (!FindKernelDef(device_type, node->def(), nullptr, nullptr).ok()) {
+      invalid_ops.push_back(node->def().op());
+    }
+  }
+  if (!invalid_ops.empty()) {
+    return errors::InvalidArgument(strings::StrCat(
+        "Detected unsupported operations when trying to compile graph ", name,
+        " on ", device_type.type_string(), ":",
+        tensorflow::str_util::Join(invalid_ops, ", ")));
+  }
+  return Status::OK();
+}
+
+}  // namespace
+
 Status XlaCompiler::CompileGraph(const XlaCompiler::CompileOptions& options,
                                  string const& name,
                                  std::unique_ptr<Graph> graph,
@@ -681,6 +738,11 @@ Status XlaCompiler::CompileGraph(const XlaCompiler::CompileOptions& options,
       FunctionalizeControlFlow(flib_runtime_->GetFunctionLibraryDefinition(),
                                graph.get(), local_flib_def_.get()));
 
+  // Detect invalid nodes.
+  // FunctionalizeControlFlow may remove some nodes from the graph.
+  TF_RETURN_IF_ERROR(ValidateGraph(graph.get(), *options_.flib_def,
+                                   options_.device_type, name));
+
   xla::XlaBuilder builder(name);
   XlaContext* context = new XlaContext(
       this, &builder, options_.allow_cpu_custom_calls,
diff --git a/tensorflow/compiler/tf2xla/xla_compiler.h b/tensorflow/compiler/tf2xla/xla_compiler.h
index bf496bd8bc81e67056eba380288bca88737cc00d..c93850ce270502ea1df1f6469963e96e86994fa2 100644
--- a/tensorflow/compiler/tf2xla/xla_compiler.h
+++ b/tensorflow/compiler/tf2xla/xla_compiler.h
@@ -18,6 +18,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/tf2xla/host_compute_metadata.pb.h"
 #include "tensorflow/compiler/tf2xla/xla_compilation_device.h"
+#include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
 #include "tensorflow/core/common_runtime/device.h"
 #include "tensorflow/core/common_runtime/device_mgr.h"
@@ -244,9 +245,9 @@ class XlaCompiler {
   typedef std::function<TensorShape(const TensorShape&, DataType)>
       ShapeRepresentationFn;
   struct Options {
-    // Name of the compilation device to use. Needs to be live only during
-    // XlaCompiler's constructor.
-    const DeviceType* device_type = nullptr;
+    // Name of the compilation device to use. It must be set by the caller.
+    // The default empty value is invalid.
+    DeviceType device_type = DeviceType("");
 
     xla::Client* client = nullptr;
 
@@ -313,7 +314,7 @@ class XlaCompiler {
   // See the class comment for more details about the argument passing
   // convention.
   Status XLAShapeForArgument(const Argument& arg, bool is_entry_computation,
-                             xla::Shape* xla_shape);
+                             xla::Shape* xla_shape) const;
 
   // Retrieves the channel handle associated with `key`. Allocates
   // a new channel handle if none exists.
diff --git a/tensorflow/compiler/tf2xla/xla_compiler_test.cc b/tensorflow/compiler/tf2xla/xla_compiler_test.cc
index 55772ca324872f6d5fac008de7819b7fae64966a..613230452b74755ce7543ec2ab82861aa0dfeb7a 100644
--- a/tensorflow/compiler/tf2xla/xla_compiler_test.cc
+++ b/tensorflow/compiler/tf2xla/xla_compiler_test.cc
@@ -34,6 +34,7 @@ limitations under the License.
 #include "tensorflow/core/framework/resource_mgr.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_testutil.h"
+#include "tensorflow/core/graph/algorithm.h"
 #include "tensorflow/core/graph/graph.h"
 #include "tensorflow/core/graph/graph_constructor.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
@@ -45,8 +46,6 @@ namespace tensorflow {
 
 class XlaCompilerTest : public ::testing::Test {
  protected:
-  XlaCompilerTest() : cpu_device_type_(DEVICE_CPU_XLA_JIT) {}
-
   void SetUp() override {
     client_ = xla::ClientLibrary::LocalClientOrDie();
 
@@ -58,7 +57,7 @@ class XlaCompilerTest : public ::testing::Test {
 
   XlaCompiler::Options DefaultOptions() {
     XlaCompiler::Options options;
-    options.device_type = &cpu_device_type_;
+    options.device_type = DeviceType(DEVICE_CPU_XLA_JIT);
     options.client = client_;
     options.flib_def = flib_def_.get();
     return options;
@@ -68,7 +67,6 @@ class XlaCompilerTest : public ::testing::Test {
     return compiler->local_flib_def_.get();
   }
 
-  DeviceType cpu_device_type_;
   xla::Client* client_;
   std::unique_ptr<FunctionLibraryDefinition> flib_def_;
 };
@@ -979,5 +977,115 @@ TEST_F(XlaCompilerTest, ArgRetvalShapeRepresentationFunction) {
   EXPECT_TRUE(xla::LiteralTestUtil::Equal(*expected_literal, *actual_literal));
 }
 
+// Tests a graph which has a function with an invalid op.
+TEST_F(XlaCompilerTest, FunctionWithInvalidOp) {
+  XlaCompiler compiler(DefaultOptions());
+
+  FunctionDefLibrary flib;
+  FunctionDef fn = FillFn();
+  NodeDef* node = fn.add_node_def();
+  node->set_name("Invalid");
+  node->set_op("InvalidOp"); /* unsupported op */
+  node = fn.add_node_def();
+  node->set_name("Switch");
+  node->set_op("Switch"); /* control flow node */
+  *flib.add_function() = fn;
+
+  TF_ASSERT_OK(flib_def_->AddFunctionDef(fn));
+
+  std::unique_ptr<Graph> graph(new Graph(OpRegistry::Global()));
+
+  Scope scope = Scope::NewRootScope().ExitOnError();
+  auto value = ops::Const<int32>(scope.WithOpName("value"), 1, {});
+  auto shape = ops::Const<int32>(scope.WithOpName("shape"), {5}, {1});
+  TF_ASSERT_OK(scope.graph()->AddFunctionLibrary(flib));
+
+  NodeDef def;
+  TF_ASSERT_OK(NodeDefBuilder("fill_fn", "FillFn", flib_def_.get())
+                   .Input(value.name(), 0, DT_INT32)
+                   .Input(shape.name(), 1, DT_INT32)
+                   .Finalize(&def));
+  Status status;
+  Node* fill = scope.graph()->AddNode(def, &status);
+  TF_ASSERT_OK(status);
+  TF_ASSERT_OK(scope.DoShapeInference(fill));
+  scope.graph()->AddEdge(value.node(), 0, fill, 0);
+  scope.graph()->AddEdge(shape.node(), 0, fill, 1);
+
+  auto retval = ops::_Retval(scope.WithOpName("retval"), Output(fill), 0);
+
+  TF_ASSERT_OK(scope.ToGraph(graph.get()));
+
+  std::vector<XlaCompiler::Argument> args;
+  XlaCompiler::CompilationResult result;
+  status = compiler.CompileGraph(XlaCompiler::CompileOptions(), "fill",
+                                 std::move(graph), args, &result);
+  ASSERT_FALSE(status.ok());
+  EXPECT_TRUE(
+      str_util::StrContains(status.error_message(), "FillFn:{InvalidOp}"))
+      << status.error_message();
+}
+
+// Tests a graph which has a node with invalid data type.
+TEST_F(XlaCompilerTest, NodeWithInvalidDataType) {
+  std::unique_ptr<Graph> graph(new Graph(OpRegistry::Global()));
+  NodeDef shape;
+  shape.set_name("Shape");
+  shape.set_op("Shape");
+  (*shape.mutable_attr())["T"].set_type(DT_INT32);
+  (*shape.mutable_attr())["out_type"].set_type(DT_BOOL); /* invalid type */
+  Status status;
+  Node* shape_node = graph->AddNode(shape, &status);
+  TF_ASSERT_OK(status);
+  graph->AddControlEdge(graph->source_node(), shape_node);
+
+  std::vector<XlaCompiler::Argument> args;
+  XlaCompiler::CompilationResult result;
+  XlaCompiler compiler(DefaultOptions());
+  status = compiler.CompileGraph(XlaCompiler::CompileOptions(), "invalid_type",
+                                 std::move(graph), args, &result);
+  ASSERT_FALSE(status.ok());
+  EXPECT_TRUE(str_util::StrContains(status.error_message(),
+                                    "is not in the list of allowed values"))
+      << status.error_message();
+}
+
+TEST_F(XlaCompilerTest, SingleOpWithoutInputs) {
+  std::unique_ptr<Graph> graph(new Graph(OpRegistry::Global()));
+  NodeDef no_op;
+  no_op.set_name("NoOp");
+  no_op.set_op("NoOp");
+  Status status;
+  graph->AddNode(no_op, &status);
+  TF_ASSERT_OK(status);
+
+  std::vector<XlaCompiler::Argument> args;
+  XlaCompiler compiler(DefaultOptions());
+  // No control edge linking NoOp with source/sink.
+  {
+    std::unique_ptr<Graph> graph_copy(new Graph(OpRegistry::Global()));
+    CopyGraph(*graph, graph_copy.get());
+    XlaCompiler::CompilationResult result;
+    status = compiler.CompileGraph(XlaCompiler::CompileOptions(), "NoOp",
+                                   std::move(graph_copy), args, &result);
+    ASSERT_FALSE(status.ok());
+    EXPECT_TRUE(str_util::StrContains(status.error_message(),
+                                      "The following nodes are unreachable "
+                                      "from the source in the graph: NoOp"))
+        << status.error_message();
+  }
+
+  // Fix control edges for NoOp.
+  {
+    std::unique_ptr<Graph> graph_copy(new Graph(OpRegistry::Global()));
+    CopyGraph(*graph, graph_copy.get());
+    EXPECT_TRUE(FixupSourceAndSinkEdges(graph_copy.get()));
+    XlaCompiler::CompilationResult result;
+    TF_ASSERT_OK(compiler.CompileGraph(XlaCompiler::CompileOptions(), "NoOp",
+                                       std::move(graph_copy), args, &result));
+    EXPECT_EQ(0, result.resource_updates.size());
+  }
+}
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/xla_helpers.cc b/tensorflow/compiler/tf2xla/xla_helpers.cc
index f1594193af09c7193f03b4685d3a7d4510d654dd..a1da176fe30ddd0d4460a51b60b2568ecc1af6aa 100644
--- a/tensorflow/compiler/tf2xla/xla_helpers.cc
+++ b/tensorflow/compiler/tf2xla/xla_helpers.cc
@@ -19,11 +19,13 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/lib/util.h"
 
 #include "tensorflow/compiler/tf2xla/literal_util.h"
+#include "tensorflow/compiler/tf2xla/shape_util.h"
 #include "tensorflow/compiler/tf2xla/type_util.h"
 #include "tensorflow/compiler/tf2xla/xla_context.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
 #include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/core/common_runtime/dma_helper.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
@@ -210,8 +212,9 @@ Status XlaHelpers::Iota(xla::XlaBuilder* builder, DataType dtype, int64 size,
       return errors::InvalidArgument("Invalid argument type ",
                                      DataTypeString(dtype));
   }
-  xla::Literal linspace_literal;
-  TF_RETURN_IF_ERROR(HostTensorToLiteral(linspace, &linspace_literal));
+  xla::BorrowingLiteral linspace_literal;
+  TF_RETURN_IF_ERROR(HostTensorToBorrowingLiteral(linspace, &linspace_literal));
+
   *iota = builder->ConstantLiteral(linspace_literal);
   return Status::OK();
 }
@@ -245,8 +248,8 @@ Status XlaHelpers::OneHot(xla::XlaBuilder* builder, int64 depth, int axis,
       return errors::InvalidArgument("Invalid argument type ",
                                      DataTypeString(index_type));
   }
-  xla::Literal linspace_literal;
-  TF_RETURN_IF_ERROR(HostTensorToLiteral(linspace, &linspace_literal));
+  xla::BorrowingLiteral linspace_literal;
+  TF_RETURN_IF_ERROR(HostTensorToBorrowingLiteral(linspace, &linspace_literal));
 
   // Broadcast the linspace constant across the indices along the new axis,
   // and test equality at each position.
diff --git a/tensorflow/compiler/xla/BUILD b/tensorflow/compiler/xla/BUILD
index c6deb959a59f7b79500a0948b4035ea56cd9b4a1..1b8e516770c3e217dd7c2f26ce426895b478c2e4 100644
--- a/tensorflow/compiler/xla/BUILD
+++ b/tensorflow/compiler/xla/BUILD
@@ -53,7 +53,6 @@ xla_proto_library(
     deps = [
         ":xla_data_proto",
         "//tensorflow/compiler/xla/service:hlo_proto",
-        "//tensorflow/compiler/xla/service:session_proto",
     ],
 )
 
diff --git a/tensorflow/compiler/xla/client/BUILD b/tensorflow/compiler/xla/client/BUILD
index aacb394ae5f92aa0d87ee3a23bcc3d4ec5cd99a3..8f08d3b2e04670ad6590aca1db0fd9d25faed83f 100644
--- a/tensorflow/compiler/xla/client/BUILD
+++ b/tensorflow/compiler/xla/client/BUILD
@@ -86,6 +86,7 @@ cc_library(
     hdrs = ["executable_build_options.h"],
     deps = [
         "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/service:device_memory_allocator",
         "//tensorflow/core:lib",
@@ -109,6 +110,7 @@ cc_library(
         "//tensorflow/compiler/xla/service:compiler",
         "//tensorflow/compiler/xla/service:device_memory_allocator",
         "//tensorflow/compiler/xla/service:executable",
+        "//tensorflow/compiler/xla/service:hlo_proto",
         "//tensorflow/compiler/xla/service:local_service",
         "//tensorflow/compiler/xla/service:shaped_buffer",
         "//tensorflow/compiler/xla/service:source_map_util",
diff --git a/tensorflow/compiler/xla/client/client.cc b/tensorflow/compiler/xla/client/client.cc
index c9d275a77b5cd40225f4b5c45e02c242d27d9aa1..3d596a6e65430b6e9692aabd65fc8aa84b7b873d 100644
--- a/tensorflow/compiler/xla/client/client.cc
+++ b/tensorflow/compiler/xla/client/client.cc
@@ -64,7 +64,7 @@ StatusOr<std::unique_ptr<Literal>> Client::Transfer(
 }
 
 StatusOr<std::unique_ptr<GlobalData>> Client::TransferToServer(
-    const Literal& literal, const DeviceHandle* device_handle) {
+    const LiteralSlice& literal, const DeviceHandle* device_handle) {
   TransferToServerRequest request;
   *request.mutable_literal() = literal.ToProto();
   if (device_handle) {
@@ -91,7 +91,7 @@ StatusOr<std::unique_ptr<GlobalData>> Client::TransferToServer(
   return MakeUnique<GlobalData>(stub_, response.data());
 }
 
-Status Client::TransferToInfeed(const Literal& literal, int64 replica_id,
+Status Client::TransferToInfeed(const LiteralSlice& literal, int64 replica_id,
                                 const DeviceHandle* device_handle) {
   TransferToInfeedRequest request;
   *request.mutable_literal() = literal.ToProto();
diff --git a/tensorflow/compiler/xla/client/client.h b/tensorflow/compiler/xla/client/client.h
index d57e2536d0b44cda46d7c1c2513b82c9f8a31c1b..68f0d0ac78c859fde7a6a007cd250b047a7bfcda 100644
--- a/tensorflow/compiler/xla/client/client.h
+++ b/tensorflow/compiler/xla/client/client.h
@@ -107,14 +107,14 @@ class Client {
   // device (and its replicas if replication is enabled). Otherwise, data is
   // transferred to the default device (and its replicas).
   StatusOr<std::unique_ptr<GlobalData>> TransferToServer(
-      const Literal& literal, const DeviceHandle* device_handle = nullptr);
+      const LiteralSlice& literal, const DeviceHandle* device_handle = nullptr);
 
   // Transfer the given literal to the Infeed interface of the device.
   //
   // device_handle and replica_id together specify a particular device; a device
   // assigned for the given replica_id among the replicas that the given device
   // handle belongs to.
-  Status TransferToInfeed(const Literal& literal, int64 replica_id = 0,
+  Status TransferToInfeed(const LiteralSlice& literal, int64 replica_id = 0,
                           const DeviceHandle* device_handle = nullptr);
 
   // Transfers from the Outfeed of the device.
@@ -153,8 +153,6 @@ class Client {
   //
   // If output_layout is non-null, then the output of the computation will be
   // stored using that layout.
-  //
-  // TODO(b/74197823): This is a part of a NOT YET ready refactor.
   StatusOr<std::unique_ptr<Literal>> ComputeConstant(
       const XlaComputation& computation,
       const Layout* output_layout = nullptr) const;
diff --git a/tensorflow/compiler/xla/client/compile_only_client.cc b/tensorflow/compiler/xla/client/compile_only_client.cc
index dc69d2097ebe14ca0e14a39849d4fcae99024fdc..5c9abad4c3126be5e45e96c770c0679fe8606788 100644
--- a/tensorflow/compiler/xla/client/compile_only_client.cc
+++ b/tensorflow/compiler/xla/client/compile_only_client.cc
@@ -24,7 +24,8 @@ namespace xla {
 StatusOr<std::vector<std::unique_ptr<AotCompilationResult>>>
 CompileOnlyClient::CompileAheadOfTime(
     const tensorflow::gtl::ArraySlice<AotXlaComputationInstance> computations,
-    const AotCompilationOptions& options) {
+    const AotCompilationOptions& options,
+    std::unique_ptr<AotCompilationMetadata>* metadata) {
   std::vector<CompileOnlyService::AotXlaComputationInstance> service_instances;
   service_instances.reserve(computations.size());
   for (const AotXlaComputationInstance& instance : computations) {
@@ -36,7 +37,8 @@ CompileOnlyClient::CompileAheadOfTime(
     service_instance.argument_layouts = instance.argument_layouts;
     service_instance.result_layout = instance.result_layout;
   }
-  return compiler_service_->CompileAheadOfTime(service_instances, options);
+  return compiler_service_->CompileAheadOfTime(service_instances, options,
+                                               metadata);
 }
 
 int64 CompileOnlyClient::PointerSizeForTriple(tensorflow::StringPiece triple) {
diff --git a/tensorflow/compiler/xla/client/compile_only_client.h b/tensorflow/compiler/xla/client/compile_only_client.h
index f9a7c31270c7a11175f47a537639a97d0c9211af..332c96503637344d56e363e19db4880c37ca9684 100644
--- a/tensorflow/compiler/xla/client/compile_only_client.h
+++ b/tensorflow/compiler/xla/client/compile_only_client.h
@@ -46,13 +46,15 @@ class CompileOnlyClient : public Client {
     const Shape* result_layout;
   };
 
-  // Compiles a list of xla computations for ahead-of-time execution.  This is
-  // intended for use in static compilation. The |options| parameter describes
-  // the target for which the compiler should emit code.
+  // Compiles a list of xla computations for ahead-of-time execution.
+  // This is intended for use in static compilation. The |options|
+  // parameter describes the target for which the compiler should emit
+  // code. |metadata|, if provided, is populated during compilation.
   StatusOr<std::vector<std::unique_ptr<AotCompilationResult>>>
   CompileAheadOfTime(
       const tensorflow::gtl::ArraySlice<AotXlaComputationInstance> computations,
-      const AotCompilationOptions& options);
+      const AotCompilationOptions& options,
+      std::unique_ptr<AotCompilationMetadata>* metadata = nullptr);
 
   // Returns the size of a pointer in bytes for a given triple.
   static int64 PointerSizeForTriple(tensorflow::StringPiece triple);
diff --git a/tensorflow/compiler/xla/client/executable_build_options.cc b/tensorflow/compiler/xla/client/executable_build_options.cc
index 6e3c5cb484b8f1ef053fa287a4d462aeb886e530..7dee41f6a05025ec196b78e54015e8e71777031f 100644
--- a/tensorflow/compiler/xla/client/executable_build_options.cc
+++ b/tensorflow/compiler/xla/client/executable_build_options.cc
@@ -87,6 +87,18 @@ ExecutableBuildOptions::dump_optimized_hlo_proto_to() const {
   return dump_optimized_hlo_proto_to_;
 }
 
+ExecutableBuildOptions&
+ExecutableBuildOptions::set_dump_unoptimized_hlo_proto_to(
+    tensorflow::StringPiece dirpath) {
+  dump_unoptimized_hlo_proto_to_ = dirpath.ToString();
+  return *this;
+}
+
+const tensorflow::gtl::optional<string>&
+ExecutableBuildOptions::dump_unoptimized_hlo_proto_to() const {
+  return dump_unoptimized_hlo_proto_to_;
+}
+
 ExecutableBuildOptions& ExecutableBuildOptions::set_dump_per_pass_hlo_proto_to(
     tensorflow::StringPiece dirpath) {
   dump_per_pass_hlo_proto_to_ = dirpath.ToString();
diff --git a/tensorflow/compiler/xla/client/executable_build_options.h b/tensorflow/compiler/xla/client/executable_build_options.h
index 11f10983606fe02b1edb11a260edde8e5f9a726f..9dc9be4423564fb967b247c2d1df31099cb80237 100644
--- a/tensorflow/compiler/xla/client/executable_build_options.h
+++ b/tensorflow/compiler/xla/client/executable_build_options.h
@@ -17,6 +17,7 @@ limitations under the License.
 #define TENSORFLOW_COMPILER_XLA_CLIENT_EXECUTABLE_BUILD_OPTIONS_H_
 
 #include "tensorflow/compiler/xla/service/device_memory_allocator.h"
+#include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/gtl/optional.h"
@@ -64,6 +65,13 @@ class ExecutableBuildOptions {
       tensorflow::StringPiece dirpath);
   const tensorflow::gtl::optional<string>& dump_optimized_hlo_proto_to() const;
 
+  // If set, specifies a dirpath to dump the start-of-optimization-pipeline HLO
+  // protobuf to (as in DebugOptions).
+  ExecutableBuildOptions& set_dump_unoptimized_hlo_proto_to(
+      tensorflow::StringPiece dirpath);
+  const tensorflow::gtl::optional<string>& dump_unoptimized_hlo_proto_to()
+      const;
+
   // If set, specifies a dirpath to dump the per-pass-in-pipeline HLO protobufs
   // to (as in DebugOptions).
   ExecutableBuildOptions& set_dump_per_pass_hlo_proto_to(
@@ -76,6 +84,13 @@ class ExecutableBuildOptions {
   ExecutableBuildOptions& set_hlo_profile(bool enabled);
   tensorflow::gtl::optional<bool> hlo_profile() const;
 
+  void add_disabled_hlo_pass(tensorflow::StringPiece pass_name) {
+    disabled_hlo_passes_.push_back(std::string(pass_name));
+  }
+  const tensorflow::gtl::ArraySlice<std::string> disabled_hlo_passes() const {
+    return disabled_hlo_passes_;
+  }
+
   // Returns a string representation of the build options, suitable for
   // debugging.
   string ToString() const;
@@ -87,8 +102,10 @@ class ExecutableBuildOptions {
   bool result_layout_set_ = false;
   tensorflow::gtl::optional<string> generate_hlo_graph_;
   tensorflow::gtl::optional<string> dump_optimized_hlo_proto_to_;
+  tensorflow::gtl::optional<string> dump_unoptimized_hlo_proto_to_;
   tensorflow::gtl::optional<string> dump_per_pass_hlo_proto_to_;
   DeviceMemoryAllocator* device_allocator_ = nullptr;
+  std::vector<std::string> disabled_hlo_passes_;
 };
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/client/local_client.cc b/tensorflow/compiler/xla/client/local_client.cc
index a7c55c6b2b7fe2b5541ce71bf3eaa24114522fc5..ae0308020d014e038d2f0fd7de6c5f372d6cbed1 100644
--- a/tensorflow/compiler/xla/client/local_client.cc
+++ b/tensorflow/compiler/xla/client/local_client.cc
@@ -185,7 +185,7 @@ StatusOr<ScopedShapedBuffer> LocalExecutable::Run(
       run_options, backend_->StreamBorrower(),
       backend_->eigen_intra_op_thread_pool());
 
-  if (executable_->dumping()) {
+  if (executable_->dumping_snapshot()) {
     return ExecuteAndDump(&service_options, arguments);
   }
   return executable_->ExecuteOnStreamWrapper(
@@ -195,36 +195,36 @@ StatusOr<ScopedShapedBuffer> LocalExecutable::Run(
 StatusOr<ScopedShapedBuffer> LocalExecutable::ExecuteAndDump(
     const ServiceExecutableRunOptions* run_options,
     const tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments) {
-  executable_->session_module()->set_execution_platform(
+  executable_->hlo_snapshot()->set_execution_platform(
       backend_->platform()->Name());
-  TF_RETURN_IF_ERROR(RecordArguments(arguments, executable_->session_module()));
+  TF_RETURN_IF_ERROR(RecordArguments(arguments, executable_->hlo_snapshot()));
   TF_ASSIGN_OR_RETURN(
       ScopedShapedBuffer result,
       executable_->ExecuteOnStream(run_options, arguments,
                                    /*hlo_execution_profile=*/nullptr));
-  TF_RETURN_IF_ERROR(RecordResult(&result, executable_->session_module()));
-  TF_RETURN_IF_ERROR(executable_->DumpSessionModule());
+  TF_RETURN_IF_ERROR(RecordResult(&result, executable_->hlo_snapshot()));
+  TF_RETURN_IF_ERROR(executable_->DumpHloSnapshot());
   return std::move(result);
 }
 
 Status LocalExecutable::RecordArguments(
     const tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
-    SessionModule* session_module) {
-  session_module->clear_arguments();
+    HloSnapshot* hlo_snapshot) {
+  hlo_snapshot->clear_arguments();
   for (const ShapedBuffer* argument : arguments) {
     TF_ASSIGN_OR_RETURN(std::unique_ptr<Literal> literal,
                         LiteralFromShapedBuffer(*argument));
-    *session_module->add_arguments() = literal->ToProto();
+    *hlo_snapshot->add_arguments() = literal->ToProto();
   }
   return Status::OK();
 }
 
 Status LocalExecutable::RecordResult(const ShapedBuffer* result,
-                                     SessionModule* session_module) {
-  session_module->clear_result();
+                                     HloSnapshot* hlo_snapshot) {
+  hlo_snapshot->clear_result();
   TF_ASSIGN_OR_RETURN(std::unique_ptr<Literal> literal,
                       LiteralFromShapedBuffer(*result));
-  *session_module->mutable_result() = literal->ToProto();
+  *hlo_snapshot->mutable_result() = literal->ToProto();
   return Status::OK();
 }
 
@@ -304,6 +304,11 @@ StatusOr<std::unique_ptr<Literal>> LocalClient::ShapedBufferToLiteral(
                                                                  shaped_buffer);
 }
 
+StatusOr<const ShapedBuffer*> LocalClient::GlobalDataToShapedBuffer(
+    const GlobalDataHandle& data, int replica_number) {
+  return local_service_->GlobalDataToShapedBuffer(data, replica_number);
+}
+
 Status LocalClient::TransferToInfeedLocal(const Literal& literal,
                                           int device_ordinal) {
   TF_ASSIGN_OR_RETURN(se::StreamExecutor * executor,
diff --git a/tensorflow/compiler/xla/client/local_client.h b/tensorflow/compiler/xla/client/local_client.h
index d63d4ec7f3744d507cc854213e430e25e861e559..4d9e0d7cd9d6ddebead1e12b23e94b529038039b 100644
--- a/tensorflow/compiler/xla/client/local_client.h
+++ b/tensorflow/compiler/xla/client/local_client.h
@@ -25,6 +25,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/compiler.h"
 #include "tensorflow/compiler/xla/service/device_memory_allocator.h"
 #include "tensorflow/compiler/xla/service/executable.h"
+#include "tensorflow/compiler/xla/service/hlo.pb.h"
 #include "tensorflow/compiler/xla/service/local_service.h"
 #include "tensorflow/compiler/xla/service/shaped_buffer.h"
 #include "tensorflow/compiler/xla/statusor.h"
@@ -58,12 +59,18 @@ class LocalExecutable {
 
   // Validates that the given arguments and options satisfy various constraints
   // of the computation.
+  //
+  // The given ExecutableRunOptions override any values from legacy_flags
+  // (TF_XLA_FLAGS environment variable).
   Status ValidateExecutionOptions(
       const tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
       const ExecutableRunOptions& run_options, const Backend& backend);
 
   // Records the computation in a SessionModule proto with the arguments used to
   // invoke it, and the result. Enabled by flag: --tla_dump_executions_to.
+  //
+  // The given ServiceExecutableRunOptions override any values from legacy_flags
+  // (TF_XLA_FLAGS environment variable).
   StatusOr<ScopedShapedBuffer> ExecuteAndDump(
       const ServiceExecutableRunOptions* run_options,
       const tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments);
@@ -72,11 +79,10 @@ class LocalExecutable {
   // proto.
   Status RecordArguments(
       const tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
-      SessionModule* session_module);
+      HloSnapshot* hlo_snapshot);
 
   // Records the result of the computation in a SessionModule proto.
-  Status RecordResult(const ShapedBuffer* result,
-                      SessionModule* session_module);
+  Status RecordResult(const ShapedBuffer* result, HloSnapshot* hlo_snapshot);
 
   // Returns a literal containing the contents of the given ShapedBuffer.
   StatusOr<std::unique_ptr<Literal>> LiteralFromShapedBuffer(
@@ -109,6 +115,9 @@ class LocalClient : public Client {
 
   // Build and return a LocalExecutable object. The executable is compiled using
   // the given XlaComputation, argument layouts and options.
+  //
+  // The given ExecutableBuildOptions override any values from legacy_flags
+  // (TF_XLA_FLAGS environment variable).
   StatusOr<std::unique_ptr<LocalExecutable>> Compile(
       const XlaComputation& computation,
       const tensorflow::gtl::ArraySlice<const Shape*> argument_layouts,
@@ -127,6 +136,11 @@ class LocalClient : public Client {
   StatusOr<std::unique_ptr<Literal>> ShapedBufferToLiteral(
       const ShapedBuffer& shaped_buffer);
 
+  // Converts a GlobalDataHandle into a pointer to a ShapedBuffer that's valid
+  // as long as the handle is valid.
+  StatusOr<const ShapedBuffer*> GlobalDataToShapedBuffer(
+      const GlobalDataHandle& data, int replica_number);
+
   // Transfer the given literal to the infeed queue of the given device.
   // TODO(b/69670845): Remove the 'Local' from the name when LocalClient does
   // not inherit from Client and there is no possibility of confusion with
diff --git a/tensorflow/compiler/xla/client/xla_client/BUILD b/tensorflow/compiler/xla/client/xla_client/BUILD
index 0d6e207971ec64515ec5e6da292910920edd101a..507a2dc5f088e159156f0ef3d663ba2819f6a2d4 100644
--- a/tensorflow/compiler/xla/client/xla_client/BUILD
+++ b/tensorflow/compiler/xla/client/xla_client/BUILD
@@ -37,7 +37,6 @@ cc_library(
     ],
 )
 
-# TODO(b/74197823): Replace computation_builder with xla_builder.
 cc_library(
     name = "xla_builder",
     srcs = ["xla_builder.cc"],
diff --git a/tensorflow/compiler/xla/client/xla_client/xla_builder.cc b/tensorflow/compiler/xla/client/xla_client/xla_builder.cc
index ae506317c2e4862d77cb4f0628e919871ad1aeb2..ae8fbdb2dc8b3f4fc21bcfef9692645f7e1d48b5 100644
--- a/tensorflow/compiler/xla/client/xla_client/xla_builder.cc
+++ b/tensorflow/compiler/xla/client/xla_client/xla_builder.cc
@@ -1611,14 +1611,41 @@ XlaOp XlaBuilder::BatchNormGrad(const XlaOp& operand, const XlaOp& scale,
   });
 }
 
-XlaOp XlaBuilder::CrossReplicaSum(const XlaOp& operand) {
+XlaOp XlaBuilder::CrossReplicaSum(
+    const XlaOp& operand,
+    tensorflow::gtl::ArraySlice<int64> replica_group_ids) {
   return NoteErrorOrReturn([&]() -> StatusOr<XlaOp> {
-    HloInstructionProto instr;
+    TF_ASSIGN_OR_RETURN(const Shape& shape, GetShape(operand));
+    const Shape& scalar_shape = ShapeUtil::MakeShape(shape.element_type(), {});
+    auto b = CreateSubBuilder("sum");
+    b->Add(b->Parameter(/*parameter_number=*/0, scalar_shape, "x"),
+           b->Parameter(/*parameter_number=*/1, scalar_shape, "y"));
+    TF_ASSIGN_OR_RETURN(auto computation, b->Build());
+    return CrossReplicaSum(operand, computation, replica_group_ids,
+                           /*channel_id=*/tensorflow::gtl::nullopt);
+  });
+}
 
+XlaOp XlaBuilder::CrossReplicaSum(
+    const XlaOp& operand, const XlaComputation& computation,
+    tensorflow::gtl::ArraySlice<int64> replica_group_ids,
+    const tensorflow::gtl::optional<ChannelHandle>& channel_id) {
+  return NoteErrorOrReturn([&]() -> StatusOr<XlaOp> {
+    if (channel_id.has_value()) {
+      return Unimplemented(
+          "replica_group_ids and channel_id and is not supported in AllReduce");
+    }
+
+    HloInstructionProto instr;
     TF_ASSIGN_OR_RETURN(const Shape& operand_shape, GetShape(operand));
     TF_ASSIGN_OR_RETURN(
         *instr.mutable_shape(),
         ShapeInference::InferCrossReplicaSumShape({&operand_shape}));
+    for (int64 replica_group_id : replica_group_ids) {
+      instr.add_replica_group_ids(replica_group_id);
+    }
+
+    AddCalledComputation(computation, &instr);
 
     return AddInstruction(std::move(instr), HloOpcode::kCrossReplicaSum,
                           {operand});
diff --git a/tensorflow/compiler/xla/client/xla_client/xla_builder.h b/tensorflow/compiler/xla/client/xla_client/xla_builder.h
index 2b3013a91c488782098bd81994e899eae5a1f506..0329e42ed1aef8edd1537e888ddcd78f08584407 100644
--- a/tensorflow/compiler/xla/client/xla_client/xla_builder.h
+++ b/tensorflow/compiler/xla/client/xla_client/xla_builder.h
@@ -528,9 +528,35 @@ class XlaBuilder {
       tensorflow::gtl::ArraySlice<int64> window_strides,
       tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding);
 
-  // Returns the sum of the operand value across all replicas. All replicas
-  // supply one input to the sum and all replicas receive the resulting sum.
-  XlaOp CrossReplicaSum(const XlaOp& operand);
+  // Returns the sum of the operand value within each subgroup of replicas. All
+  // replicas supply one input to the sum and all replicas receive the resulting
+  // sum for each subgroup.
+  XlaOp CrossReplicaSum(
+      const XlaOp& operand,
+      tensorflow::gtl::ArraySlice<int64> replica_group_ids = {});
+
+  // Enqueues an operation that do an AllReduce of the operand cross cores. Here
+  // AllReduce means doing a reduction on the input operand cross cores and then
+  // broadcasting the reduction result to those cores. The reduction function is
+  // defined by `computation`, which should be a commutative computation on
+  // scalars, e.g., add, min, or max. The way that AllReduce is applied is
+  // configured by:
+  //
+  // - `replica_group_ids`: maps replica ids to subgroup ids. If empty, all
+  // replicas belong to one group. Allreduce will be applied within subgroups.
+  // For example, we have 4 replicas, then replica_group_ids={0,1,0,1} means,
+  // replica 0 and 2 are in subgroup 0, replica 1 and 3 are in subgroup 1.
+  //
+  // - `channel_id`: for Allreduce nodes from different models, if they have the
+  // same channel_id, they will be 'Allreduce'd. If empty, Allreduce will not be
+  // applied cross models.
+  //
+  // TODO(b/79737069): Rename this to AllReduce when it's ready to use.
+  XlaOp CrossReplicaSum(
+      const XlaOp& operand, const XlaComputation& computation,
+      tensorflow::gtl::ArraySlice<int64> replica_group_ids = {},
+      const tensorflow::gtl::optional<ChannelHandle>& channel_id =
+          tensorflow::gtl::nullopt);
 
   // Enqueues an operation that scatters the `source` array to the selected
   // indices of each window.
diff --git a/tensorflow/compiler/xla/layout_util.cc b/tensorflow/compiler/xla/layout_util.cc
index a76fdcda250168cbed2acd01bdd9ddc3b4c93b92..e8f29b83291a7cb238dc25b9f4bb743fe426a162 100644
--- a/tensorflow/compiler/xla/layout_util.cc
+++ b/tensorflow/compiler/xla/layout_util.cc
@@ -65,6 +65,16 @@ void SetDefaultLayoutToContainer(
   return layout;
 }
 
+/* static */ Layout LayoutUtil::MakeLayoutFromMajorToMinor(
+    tensorflow::gtl::ArraySlice<int64> major_to_minor) {
+  Layout layout;
+  layout.set_format(DENSE);
+  for (int i = major_to_minor.size() - 1; i >= 0; i--) {
+    layout.add_minor_to_major(major_to_minor[i]);
+  }
+  return layout;
+}
+
 /* static */ Layout LayoutUtil::MakeSparseLayout(int64 max_sparse_elements) {
   Layout layout;
   layout.set_format(SPARSE);
@@ -88,8 +98,13 @@ Layout CreateDefaultLayoutForRank(int64 rank) {
 }  // namespace
 
 /* static */ Layout LayoutUtil::GetDefaultLayoutForShape(const Shape& shape) {
+  if (ShapeUtil::IsOpaque(shape) || ShapeUtil::IsToken(shape)) {
+    // Opaque and token types have empty layouts.
+    return Layout();
+  }
+
   // A Layout proto corresponds to a single array, not a tuple.
-  DCHECK(!ShapeUtil::IsTuple(shape));
+  CHECK(ShapeUtil::IsArray(shape));
   return CreateDefaultLayoutForRank(shape.dimensions_size());
 }
 
@@ -116,14 +131,15 @@ Layout CreateDefaultLayoutForRank(int64 rank) {
       SetToDefaultLayout(&element_shape);
     }
     shape->clear_layout();
-  } else if (ShapeUtil::IsOpaque(*shape)) {
-    shape->clear_layout();
-  } else {
+  } else if (ShapeUtil::IsArray(*shape)) {
     shape->mutable_layout()->set_format(DENSE);
     tensorflow::protobuf::RepeatedField<tensorflow::protobuf_int64>*
         minor_to_major = shape->mutable_layout()->mutable_minor_to_major();
     minor_to_major->Resize(shape->dimensions_size(), 0);
     SetDefaultLayoutToContainer(minor_to_major);
+  } else {
+    // Opaque, token types etc. have no layout.
+    shape->clear_layout();
   }
 }
 
@@ -150,18 +166,20 @@ Layout CreateDefaultLayoutForRank(int64 rank) {
       TF_RETURN_IF_ERROR(ValidateLayoutInShape(element_shape));
     }
     return Status::OK();
-  } else if (ShapeUtil::IsOpaque(shape)) {
-    if (shape.has_layout()) {
-      return InvalidArgument("opaque should not have a layout field");
-    }
-    return Status::OK();
-  } else {
-    // Array shape.
+  } else if (ShapeUtil::IsArray(shape)) {
     if (!shape.has_layout()) {
       return InvalidArgument("shape %s does not have a layout",
                              ShapeUtil::HumanString(shape).c_str());
     }
     return ValidateLayoutForShape(shape.layout(), shape);
+  } else {
+    // Token, opaque, etc. shape.
+    if (shape.has_layout()) {
+      return InvalidArgument(
+          "shape of primitive type %s should not have a layout",
+          PrimitiveType_Name(shape.element_type()).c_str());
+    }
+    return Status::OK();
   }
 }
 
@@ -171,8 +189,10 @@ Layout CreateDefaultLayoutForRank(int64 rank) {
     return InvalidArgument("a single Layout is not valid for tuple shapes");
   }
 
-  if (ShapeUtil::IsOpaque(shape)) {
-    return Status::OK();
+  if (!ShapeUtil::IsArray(shape)) {
+    return InvalidArgument(
+        "shape of primitive type %s should not have a layout",
+        PrimitiveType_Name(shape.element_type()).c_str());
   }
 
   if (layout.format() == INVALID_FORMAT) {
@@ -263,7 +283,7 @@ Layout CreateDefaultLayoutForRank(int64 rank) {
 }
 
 /* static */ bool LayoutUtil::IsPadded(const Shape& shape) {
-  if (ShapeUtil::IsTuple(shape) || !HasLayout(shape) ||
+  if (!ShapeUtil::IsArray(shape) || !HasLayout(shape) ||
       shape.layout().padded_dimensions_size() == 0) {
     return false;
   }
@@ -313,7 +333,8 @@ Layout CreateDefaultLayoutForRank(int64 rank) {
     // Tuple shape: all subshapes must have a layout.
     return std::all_of(shape.tuple_shapes().begin(), shape.tuple_shapes().end(),
                        [](const Shape& s) { return HasLayout(s); });
-  } else if (ShapeUtil::IsOpaque(shape)) {
+  } else if (!ShapeUtil::IsArray(shape)) {
+    // Opaque, token types etc. ignore layout.
     return true;
   }
   return shape.has_layout() && shape.layout().format() != INVALID_FORMAT;
@@ -422,12 +443,9 @@ Status LayoutUtil::CopyLayoutBetweenShapes(const Shape& src, Shape* dst) {
 
 /* static */ bool LayoutUtil::LayoutsInShapesEqual(const Shape& lhs,
                                                    const Shape& rhs) {
-  if (ShapeUtil::IsTuple(lhs) != ShapeUtil::IsTuple(rhs)) {
-    return false;
-  }
   if (ShapeUtil::IsTuple(lhs)) {
-    if (ShapeUtil::TupleElementCount(lhs) !=
-        ShapeUtil::TupleElementCount(rhs)) {
+    if (!ShapeUtil::IsTuple(rhs) || ShapeUtil::TupleElementCount(lhs) !=
+                                        ShapeUtil::TupleElementCount(rhs)) {
       return false;
     }
     for (int i = 0; i < ShapeUtil::TupleElementCount(lhs); ++i) {
@@ -436,9 +454,12 @@ Status LayoutUtil::CopyLayoutBetweenShapes(const Shape& src, Shape* dst) {
       }
     }
     return true;
-  } else {
+  } else if (ShapeUtil::IsArray(lhs)) {
     return ShapeUtil::Rank(lhs) == ShapeUtil::Rank(rhs) &&
            LayoutUtil::Equal(lhs.layout(), rhs.layout());
+  } else {
+    // Layouts of non-array and non-tuple shapes is ignored.
+    return true;
   }
 }
 
diff --git a/tensorflow/compiler/xla/layout_util.h b/tensorflow/compiler/xla/layout_util.h
index d3d6a2cc94012f7113fd1cb1b17e9c9d5323d9bf..739bbe73675c7fb855627006028eafdf703d6540 100644
--- a/tensorflow/compiler/xla/layout_util.h
+++ b/tensorflow/compiler/xla/layout_util.h
@@ -36,6 +36,10 @@ class LayoutUtil {
   // convenience function for protobuf construction.)
   static Layout MakeLayout(tensorflow::gtl::ArraySlice<int64> minor_to_major);
 
+  // Similar to MakeLayout, but take indices in reverse order.
+  static Layout MakeLayoutFromMajorToMinor(
+      tensorflow::gtl::ArraySlice<int64> major_to_minor);
+
   // Creates a sparse layout with the given maximum number of elements. (This is
   // a convenience function for protobuf construction.)
   static Layout MakeSparseLayout(int64 max_sparse_elements);
diff --git a/tensorflow/compiler/xla/layout_util_test.cc b/tensorflow/compiler/xla/layout_util_test.cc
index 4fd1d818e3e3b417eee9f6b14bb598bfb9480c6e..e4c825450dcd45a8fbeaacbb2ad145f94307176f 100644
--- a/tensorflow/compiler/xla/layout_util_test.cc
+++ b/tensorflow/compiler/xla/layout_util_test.cc
@@ -218,6 +218,47 @@ TEST_F(LayoutUtilTest, CopyLayoutBogusLayout) {
                                "elements, but shape is rank"));
 }
 
+TEST_F(LayoutUtilTest, CopyTokenLayout) {
+  Shape src = ShapeUtil::MakeTokenShape();
+  Shape dst = ShapeUtil::MakeTokenShape();
+
+  // Layouts are trivially the same for token types and copying layouts should
+  // be a nop.
+  EXPECT_TRUE(LayoutUtil::LayoutsInShapesEqual(src, dst));
+  EXPECT_IS_OK(LayoutUtil::CopyLayoutBetweenShapes(src, &dst));
+  EXPECT_TRUE(LayoutUtil::LayoutsInShapesEqual(src, dst));
+}
+
+TEST_F(LayoutUtilTest, CopyOpaqueLayout) {
+  Shape src = ShapeUtil::MakeOpaqueShape();
+  Shape dst = ShapeUtil::MakeOpaqueShape();
+
+  // Layouts are trivially the same for opaque types and copying layouts should
+  // be a nop.
+  EXPECT_TRUE(LayoutUtil::LayoutsInShapesEqual(src, dst));
+  EXPECT_IS_OK(LayoutUtil::CopyLayoutBetweenShapes(src, &dst));
+  EXPECT_TRUE(LayoutUtil::LayoutsInShapesEqual(src, dst));
+}
+
+TEST_F(LayoutUtilTest, CopyTupleLayoutWithTokenAndOpaque) {
+  Shape src = ShapeUtil::MakeTupleShape(
+      {MakeShapeWithLayout(F32, {2, 3}, {0, 1}),
+       MakeShapeWithLayout(F32, {42, 123}, {1, 0}), ShapeUtil::MakeTokenShape(),
+       ShapeUtil::MakeTupleShape(
+           {ShapeUtil::MakeOpaqueShape(), MakeShapeWithLayout(F32, {}, {}),
+            MakeShapeWithLayout(F32, {1, 2, 3}, {0, 2, 1})})});
+  Shape dst = ShapeUtil::MakeTupleShape(
+      {MakeShapeWithLayout(F32, {2, 3}, {1, 0}),
+       MakeShapeWithLayout(F32, {42, 123}, {1, 0}), ShapeUtil::MakeTokenShape(),
+       ShapeUtil::MakeTupleShape(
+           {ShapeUtil::MakeOpaqueShape(), MakeShapeWithLayout(F32, {}, {}),
+            MakeShapeWithLayout(F32, {1, 2, 3}, {1, 2, 0})})});
+
+  EXPECT_FALSE(LayoutUtil::LayoutsInShapesEqual(src, dst));
+  EXPECT_IS_OK(LayoutUtil::CopyLayoutBetweenShapes(src, &dst));
+  EXPECT_TRUE(LayoutUtil::LayoutsInShapesEqual(src, dst));
+}
+
 TEST_F(LayoutUtilTest, ClearLayoutTuple) {
   Shape shape = ShapeUtil::MakeTupleShape(
       {MakeShapeWithLayout(F32, {2, 3}, {1, 0}),
@@ -236,6 +277,16 @@ TEST_F(LayoutUtilTest, ClearLayoutTuple) {
   EXPECT_FALSE(shape.tuple_shapes(2).tuple_shapes(1).has_layout());
 }
 
+TEST_F(LayoutUtilTest, ClearLayoutOpaqueAndToken) {
+  // Opaque and token types trivially have layouts.
+  for (Shape shape :
+       {ShapeUtil::MakeOpaqueShape(), ShapeUtil::MakeTokenShape()}) {
+    EXPECT_TRUE(LayoutUtil::HasLayout(shape));
+    LayoutUtil::ClearLayout(&shape);
+    EXPECT_TRUE(LayoutUtil::HasLayout(shape));
+  }
+}
+
 TEST_F(LayoutUtilTest, SetToDefaultLayoutTuple) {
   Shape shape = ShapeUtil::MakeTupleShape(
       {MakeShapeWithLayout(F32, {2, 3, 4}, {1, 0, 2}),
diff --git a/tensorflow/compiler/xla/literal_comparison.cc b/tensorflow/compiler/xla/literal_comparison.cc
index 3696fdbe12e311af3b286ef0dfe91377983b72dd..bf9679cafec72c2e9dc5796e9058c6703239c508 100644
--- a/tensorflow/compiler/xla/literal_comparison.cc
+++ b/tensorflow/compiler/xla/literal_comparison.cc
@@ -716,9 +716,11 @@ Status Equal(const LiteralSlice& expected, const LiteralSlice& actual) {
   }
 
   return AppendStatus(result,
-                      tensorflow::strings::Printf("expected: %s\nactual:   %s",
-                                                  expected.ToString().c_str(),
-                                                  actual.ToString().c_str()));
+                      tensorflow::strings::Printf(
+                          "\nat index: %s\nexpected: %s\nactual:   %s",
+                          Literal::MultiIndexAsString(multi_index).c_str(),
+                          ToStringTruncated(expected).c_str(),
+                          ToStringTruncated(actual).c_str()));
 }
 
 Status Near(const LiteralSlice& expected, const LiteralSlice& actual,
diff --git a/tensorflow/compiler/xla/literal_util.cc b/tensorflow/compiler/xla/literal_util.cc
index 4c560767dc603bf805f365d594810f4df7e90ed3..6b295897004cebce003ddd3999aacf63915ffe5f 100644
--- a/tensorflow/compiler/xla/literal_util.cc
+++ b/tensorflow/compiler/xla/literal_util.cc
@@ -807,6 +807,47 @@ std::unique_ptr<Literal> LiteralBase::Relayout(
   return result;
 }
 
+StatusOr<std::unique_ptr<Literal>> LiteralBase::Broadcast(
+    const Shape& result_shape,
+    tensorflow::gtl::ArraySlice<int64> dimensions) const {
+  if (!ShapeUtil::IsArray(shape())) {
+    return InvalidArgument("Broadcast only supports arrays.");
+  }
+
+  for (int64 i = 0; i < dimensions.size(); i++) {
+    TF_RET_CHECK(shape().dimensions(i) ==
+                 result_shape.dimensions(dimensions[i]));
+  }
+
+  std::unique_ptr<Literal> result = MakeUnique<Literal>(result_shape);
+
+  // scratch_source_index is temporary storage space for the computed index into
+  // the input literal.  We put it here to avoid allocating an std::vector in
+  // every iteration of ShapeUtil::ForEachIndex.
+  std::vector<int64> scratch_source_index(shape().dimensions_size());
+
+  char* dest_data = static_cast<char*>(result->untyped_data());
+  const char* source_data = static_cast<const char*>(untyped_data());
+  const int64 primitive_size =
+      ShapeUtil::ByteSizeOfPrimitiveType(shape().element_type());
+
+  ShapeUtil::ForEachIndex(
+      result_shape, [&](tensorflow::gtl::ArraySlice<int64> output_index) {
+        for (int64 i = 0; i < dimensions.size(); ++i) {
+          scratch_source_index[i] = output_index[dimensions[i]];
+        }
+        int64 dest_index = IndexUtil::MultidimensionalIndexToLinearIndex(
+            result_shape, output_index);
+        int64 source_index = IndexUtil::MultidimensionalIndexToLinearIndex(
+            shape(), scratch_source_index);
+        memcpy(dest_data + primitive_size * dest_index,
+               source_data + primitive_size * source_index, primitive_size);
+        return true;
+      });
+
+  return std::move(result);
+}
+
 StatusOr<std::unique_ptr<Literal>> LiteralBase::Reshape(
     tensorflow::gtl::ArraySlice<int64> dimensions) const {
   if (!ShapeUtil::IsArray(shape())) {
@@ -946,6 +987,23 @@ std::unique_ptr<Literal> LiteralBase::Transpose(
   return new_literal;
 }
 
+template <typename NativeT>
+std::unique_ptr<Literal> LiteralBase::SliceInternal(
+    const Shape& result_shape,
+    tensorflow::gtl::ArraySlice<int64> start_indices) const {
+  auto result_literal = MakeUnique<Literal>(result_shape);
+  DimensionVector new_indices(ShapeUtil::Rank(result_shape));
+  result_literal->EachCell<NativeT>(
+      [&](tensorflow::gtl::ArraySlice<int64> indices, NativeT /*value*/) {
+        for (int64 i = 0; i < ShapeUtil::Rank(result_shape); ++i) {
+          new_indices[i] = indices[i] + start_indices[i];
+        }
+        NativeT value = Get<NativeT>(new_indices);
+        result_literal->Set<NativeT>(indices, value);
+      });
+  return result_literal;
+}
+
 std::unique_ptr<Literal> LiteralBase::Slice(
     tensorflow::gtl::ArraySlice<int64> start_indices,
     tensorflow::gtl::ArraySlice<int64> limit_indices) const {
@@ -963,51 +1021,17 @@ std::unique_ptr<Literal> LiteralBase::Slice(
   const auto result_shape =
       ShapeUtil::MakeShapeWithLayout(shape().element_type(), result_dimensions,
                                      LayoutUtil::MinorToMajor(shape()));
-
-  auto result_literal = MakeUnique<Literal>(result_shape);
-
-  DimensionVector new_indices(ShapeUtil::Rank(result_shape));
   switch (result_shape.element_type()) {
     case F32:
-      result_literal->EachCell<float>(
-          [&](tensorflow::gtl::ArraySlice<int64> indices, float /*value*/) {
-            for (int64 i = 0; i < ShapeUtil::Rank(result_shape); ++i) {
-              new_indices[i] = indices[i] + start_indices[i];
-            }
-            float value = Get<float>(new_indices);
-            result_literal->Set<float>(indices, value);
-          });
-      return result_literal;
+      return SliceInternal<float>(result_shape, start_indices);
+    case BF16:
+      return SliceInternal<bfloat16>(result_shape, start_indices);
     case C64:
-      result_literal->EachCell<complex64>(
-          [&](tensorflow::gtl::ArraySlice<int64> indices, complex64 /*value*/) {
-            for (int64 i = 0; i < ShapeUtil::Rank(result_shape); ++i) {
-              new_indices[i] = indices[i] + start_indices[i];
-            }
-            complex64 value = Get<complex64>(new_indices);
-            result_literal->Set<complex64>(indices, value);
-          });
-      return result_literal;
+      return SliceInternal<complex64>(result_shape, start_indices);
     case S32:
-      result_literal->EachCell<int32>(
-          [&](tensorflow::gtl::ArraySlice<int64> indices, int32 /*value*/) {
-            for (int64 i = 0; i < ShapeUtil::Rank(result_shape); ++i) {
-              new_indices[i] = indices[i] + start_indices[i];
-            }
-            int32 value = Get<int32>(new_indices);
-            result_literal->Set<int32>(indices, value);
-          });
-      return result_literal;
+      return SliceInternal<int32>(result_shape, start_indices);
     case U32:
-      result_literal->EachCell<uint32>(
-          [&](tensorflow::gtl::ArraySlice<int64> indices, uint32 /*value*/) {
-            for (int64 i = 0; i < ShapeUtil::Rank(result_shape); ++i) {
-              new_indices[i] = indices[i] + start_indices[i];
-            }
-            uint32 value = Get<uint32>(new_indices);
-            result_literal->Set<uint32>(indices, value);
-          });
-      return result_literal;
+      return SliceInternal<uint32>(result_shape, start_indices);
     default:
       LOG(FATAL) << "not yet implemented: "
                  << PrimitiveType_Name(result_shape.element_type());
@@ -2317,28 +2341,28 @@ LiteralSlice::LiteralSlice(const LiteralBase& literal,
     : LiteralBase(), root_piece_(&literal.piece(view_root)) {}
 
 BorrowingLiteral::BorrowingLiteral(const char* src_buf_ptr, const Shape& shape)
-    : LiteralBase(), shape_(shape) {
-  CHECK(ShapeUtil::IsArray(shape_));
+    : LiteralBase(), shape_(MakeUnique<Shape>(shape)) {
+  CHECK(ShapeUtil::IsArray(*shape_));
   CHECK_NE(src_buf_ptr, nullptr);
-  CHECK(LayoutUtil::HasLayout(shape_));
+  CHECK(LayoutUtil::HasLayout(*shape_));
 
   root_piece_ = Piece();
   root_piece_.set_buffer(const_cast<char*>(src_buf_ptr));
-  root_piece_.set_subshape(&shape_);
+  root_piece_.set_subshape(shape_.get());
 }
 
 BorrowingLiteral::BorrowingLiteral(
     tensorflow::gtl::ArraySlice<const char*> src_buf_ptrs, const Shape& shape)
-    : LiteralBase(), shape_(shape) {
-  CHECK(ShapeUtil::IsTuple(shape_));
-  CHECK(!ShapeUtil::IsNestedTuple(shape_));
-  CHECK_EQ(src_buf_ptrs.size(), ShapeUtil::TupleElementCount(shape_));
+    : LiteralBase(), shape_(MakeUnique<Shape>(shape)) {
+  CHECK(ShapeUtil::IsTuple(*shape_));
+  CHECK(!ShapeUtil::IsNestedTuple(*shape_));
+  CHECK_EQ(src_buf_ptrs.size(), ShapeUtil::TupleElementCount(*shape_));
   root_piece_ = Piece();
-  root_piece_.set_subshape(&shape_);
-  BuildPieceSubtree(shape_, &root_piece_);
+  root_piece_.set_subshape(shape_.get());
+  BuildPieceSubtree(*shape_, &root_piece_);
 
   for (int i = 0; i < src_buf_ptrs.size(); ++i) {
-    const auto& src_shape = shape_.tuple_shapes(i);
+    const auto& src_shape = shape_->tuple_shapes(i);
     CHECK(ShapeUtil::IsArray(src_shape));
     root_piece_.child(i).set_buffer(const_cast<char*>(src_buf_ptrs[i]));
   }
diff --git a/tensorflow/compiler/xla/literal_util.h b/tensorflow/compiler/xla/literal_util.h
index 609dc7a3aca646a5bb787487de101ac115df8ea5..8e4159e360e042beb31a75c432a3c7dfa7356007 100644
--- a/tensorflow/compiler/xla/literal_util.h
+++ b/tensorflow/compiler/xla/literal_util.h
@@ -277,6 +277,12 @@ class LiteralBase {
   StatusOr<std::unique_ptr<Literal>> Reshape(
       tensorflow::gtl::ArraySlice<int64> dimensions) const;
 
+  // Creates a new literal by broadcasting this literal with `dimensions` to
+  // yield a literal of shape `result_shape`.
+  StatusOr<std::unique_ptr<Literal>> Broadcast(
+      const Shape& result_shape,
+      tensorflow::gtl::ArraySlice<int64> dimensions) const;
+
   // Creates a new literal by reordering the dimensions of this literal.
   // The given `permutation` must be a permutation of the dimension numbers
   // in the original literal, and it specifies the order of the new dimensions
@@ -536,6 +542,12 @@ class LiteralBase {
   friend class Literal;
   friend class LiteralSlice;
   friend class BorrowingLiteral;
+
+ private:
+  template <typename NativeT>
+  std::unique_ptr<Literal> SliceInternal(
+      const Shape& result_shape,
+      tensorflow::gtl::ArraySlice<int64> start_indices) const;
 };
 
 // Class representing literal values in XLA.
@@ -1087,8 +1099,10 @@ class BorrowingLiteral : public LiteralBase {
   const Piece& root_piece() const override { return root_piece_; };
   Piece root_piece_;
 
-  // Shape of this literal.
-  const Shape shape_;
+  // Shape of this literal. Stored as unique_ptr so such that the (default)
+  // move construction of this class would be trivially correct: the pointer to
+  // Shape root_piece_ stores will still point to the correct address.
+  std::unique_ptr<Shape> shape_;
 };
 
 template <typename NativeT>
diff --git a/tensorflow/compiler/xla/literal_util_test.cc b/tensorflow/compiler/xla/literal_util_test.cc
index 77f979a0d701f09162e112b69f6128008872aa18..53b926163c472c3ed7b72bf8b035d13996d59e34 100644
--- a/tensorflow/compiler/xla/literal_util_test.cc
+++ b/tensorflow/compiler/xla/literal_util_test.cc
@@ -1431,7 +1431,7 @@ TEST_F(LiteralUtilTest, LiteralSliceOfALiteralSlice) {
   EXPECT_EQ(matrix_view, *Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}));
 }
 
-TEST_F(LiteralUtilTest, BorrowingLiteralFromOneBufferPtrTest) {
+TEST_F(LiteralUtilTest, BorrowingLiteralFromOneBufferPtr) {
   std::vector<int64> int64_values = {1, 2, 3};
   const Shape literal_shape = ShapeUtil::MakeShape(S64, {3});
 
@@ -1443,7 +1443,7 @@ TEST_F(LiteralUtilTest, BorrowingLiteralFromOneBufferPtrTest) {
   EXPECT_EQ(literal.Get<int64>({2}), 3);
 }
 
-TEST_F(LiteralUtilTest, BorrowingLiteralFromMultipleBufferPtrsTest) {
+TEST_F(LiteralUtilTest, BorrowingLiteralFromMultipleBufferPtrs) {
   std::vector<int64> one_two_three = {1, 2, 3};
   const Shape one_two_three_shape = ShapeUtil::MakeShape(S64, {3});
 
@@ -1810,5 +1810,35 @@ TEST_F(LiteralUtilTest, GetSparseElementAsString) {
       tensorflow::strings::StrCat("(", float{3.0}, ", ", float{4.0}, ")"));
 }
 
+TEST_F(LiteralUtilTest, BroadcastVectorToMatrix0) {
+  std::unique_ptr<Literal> literal = Literal::CreateR1<int64>({1, 2});
+  TF_ASSERT_OK_AND_ASSIGN(
+      std::unique_ptr<Literal> broadcasted_literal,
+      literal->Broadcast(
+          /*result_shape=*/ShapeUtil::MakeShape(S64, {2, 2}),
+          /*dimensions=*/{0}));
+  EXPECT_EQ(*broadcasted_literal, *Literal::CreateR2<int64>({{1, 1}, {2, 2}}));
+}
+
+TEST_F(LiteralUtilTest, BroadcastVectorToMatrix1) {
+  std::unique_ptr<Literal> literal = Literal::CreateR1<int64>({1, 2});
+  TF_ASSERT_OK_AND_ASSIGN(
+      std::unique_ptr<Literal> broadcasted_literal,
+      literal->Broadcast(
+          /*result_shape=*/ShapeUtil::MakeShape(S64, {2, 2}),
+          /*dimensions=*/{1}));
+  EXPECT_EQ(*broadcasted_literal, *Literal::CreateR2<int64>({{1, 2}, {1, 2}}));
+}
+
+TEST_F(LiteralUtilTest, BroadcastScalarToMatrix) {
+  std::unique_ptr<Literal> literal = Literal::CreateR0<int32>(9);
+  TF_ASSERT_OK_AND_ASSIGN(
+      std::unique_ptr<Literal> broadcasted_literal,
+      literal->Broadcast(
+          /*result_shape=*/ShapeUtil::MakeShape(S32, {2, 2}),
+          /*dimensions=*/{}));
+  EXPECT_EQ(*broadcasted_literal, *Literal::CreateR2<int32>({{9, 9}, {9, 9}}));
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/python/BUILD b/tensorflow/compiler/xla/python/BUILD
index 932cce943f7c046a85984e6e5ed6b59dae371473..83834c1ff65ea2f9989fe08279c29056d9070adb 100644
--- a/tensorflow/compiler/xla/python/BUILD
+++ b/tensorflow/compiler/xla/python/BUILD
@@ -12,6 +12,7 @@ py_library(
     deps = [
         ":pywrap_xla",
         "//tensorflow/compiler/xla:xla_data_proto_py",
+        "//tensorflow/compiler/xla/service:hlo_proto_py",
     ],
 )
 
@@ -53,6 +54,7 @@ cc_library(
         "//tensorflow/compiler/xla/client:local_client",
         "//tensorflow/compiler/xla/client/xla_client:xla_builder",
         "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/service:hlo_proto",
         "//tensorflow/compiler/xla/service:shaped_buffer",
         "//tensorflow/core:framework_lite",
         "//tensorflow/core:lib",
diff --git a/tensorflow/compiler/xla/python/local_computation_builder.cc b/tensorflow/compiler/xla/python/local_computation_builder.cc
index cb4dc1782b680fca1485e883343fbb262b86b1d1..445cee1aa7b462f7ae2b6b0771ff57f0c8f3db99 100644
--- a/tensorflow/compiler/xla/python/local_computation_builder.cc
+++ b/tensorflow/compiler/xla/python/local_computation_builder.cc
@@ -20,7 +20,6 @@ limitations under the License.
 #include "tensorflow/core/platform/thread_annotations.h"
 
 namespace xla {
-
 namespace swig {
 
 // TODO(b/34473877) Ideally XLA would support AllReduce among arbitrary sets of
@@ -97,6 +96,36 @@ const ScopedShapedBuffer* LocalShapedBuffer::shaped_buffer() const {
   return &shaped_buffer_;
 }
 
+ShapedBuffer LocalShapedBuffer::Release() { return shaped_buffer_.release(); }
+
+LocalShapedBufferTuple::LocalShapedBufferTuple(
+    std::vector<LocalShapedBuffer*> elements)
+    : elements_(std::move(elements)) {
+  for (auto* element : elements_) {
+    DCHECK(element != nullptr);
+  }
+}
+
+LocalShapedBufferTuple::~LocalShapedBufferTuple() {
+  for (LocalShapedBuffer* element : elements_) {
+    if (element != nullptr) {
+      delete element;
+    }
+  }
+}
+
+StatusOr<LocalShapedBuffer*> LocalShapedBufferTuple::Release(int i) {
+  LocalShapedBuffer* element = elements_[i];
+  if (element == nullptr) {
+    return InvalidArgument("Attempted to release already-released element %d.",
+                           i);
+  }
+  elements_[i] = nullptr;
+  return element;
+}
+
+int LocalShapedBufferTuple::size() const { return elements_.size(); }
+
 static StatusOr<ScopedShapedBuffer> ToBuffer(LocalClient* client,
                                              int device_ordinal,
                                              const Literal& arg) {
@@ -276,6 +305,15 @@ const XlaComputation& LocalComputation::computation() const {
   return computation_;
 }
 
+string LocalComputation::GetSerializedProto() const {
+  string result;
+  if (!computation_.proto().SerializeToString(&result)) {
+    LOG(ERROR) << "Failed to serialize the HloModuleProto.";
+    return "";
+  }
+  return result;
+}
+
 StatusOr<Shape> LocalComputation::GetReturnValueShape() const {
   TF_ASSIGN_OR_RETURN(ProgramShape program_shape,
                       computation_.GetProgramShape());
@@ -589,10 +627,12 @@ _FORWARD_BINOP(Or)
 _FORWARD_UNOP(Not)
 _FORWARD_UNOP(Abs)
 _FORWARD_UNOP(Exp)
+_FORWARD_UNOP(Expm1)
 _FORWARD_UNOP(Floor)
 _FORWARD_UNOP(Ceil)
 _FORWARD_UNOP(Round)
 _FORWARD_UNOP(Log)
+_FORWARD_UNOP(Log1p)
 _FORWARD_UNOP(Sign)
 _FORWARD_UNOP(Cos)
 _FORWARD_UNOP(Sin)
@@ -622,6 +662,54 @@ void DeleteLocalComputation(LocalComputation* computation) {
   delete computation;
 }
 
-}  // namespace swig
+StatusOr<LocalShapedBufferTuple*> DestructureLocalShapedBufferTuple(
+    LocalShapedBuffer* local_shaped_buffer) {
+  if (!ShapeUtil::IsTuple(
+          local_shaped_buffer->shaped_buffer()->on_device_shape())) {
+    return InvalidArgument(
+        "Attemped to destructure a LocalShapedBuffer that did not have a tuple "
+        "shape; shape: %s",
+        ShapeUtil::HumanString(
+            local_shaped_buffer->shaped_buffer()->on_device_shape())
+            .c_str());
+  }
 
+  DeviceMemoryAllocator* allocator =
+      local_shaped_buffer->shaped_buffer()->memory_allocator();
+  ShapedBuffer tuple_buffer = local_shaped_buffer->Release();
+
+  // Extract some metadata we use to construct scoped buffers.
+  const se::Platform* platform = tuple_buffer.platform();
+  int device_ordinal = tuple_buffer.device_ordinal();
+
+  ShapeTree<se::DeviceMemoryBase>& shape_tree = tuple_buffer.buffers();
+  const Shape& tuple_shape = tuple_buffer.on_device_shape();
+  std::vector<LocalShapedBuffer*> results;
+  for (int64 i = 0; i < ShapeUtil::TupleElementCount(tuple_shape); ++i) {
+    // Create a shaped buffer for this destructured tuple element.
+    const Shape& subshape = ShapeUtil::GetSubshape(tuple_shape, {i});
+    VLOG(3) << "Starting tuple element " << i << " subshape: " << subshape;
+    ShapedBuffer shaped_buffer(subshape, subshape, platform, device_ordinal);
+
+    ShapeUtil::ForEachSubshape(
+        subshape, [&](const Shape& s, const ShapeIndex& index) {
+          ShapeIndex original(index);
+          original.push_front(i);
+          se::DeviceMemoryBase* device_memory =
+              shape_tree.mutable_element(original);
+          shaped_buffer.set_buffer(*device_memory, index);
+          *device_memory = se::DeviceMemoryBase();
+        });
+
+    VLOG(3) << "Completed tuple element: " << i;
+    results.push_back(new LocalShapedBuffer(
+        ScopedShapedBuffer(std::move(shaped_buffer), allocator)));
+  }
+  // Deallocate the root buffer.
+  se::DeviceMemoryBase root_buffer = tuple_buffer.root_buffer();
+  TF_RETURN_IF_ERROR(allocator->Deallocate(device_ordinal, root_buffer));
+  return new LocalShapedBufferTuple(std::move(results));
+}
+
+}  // namespace swig
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/python/local_computation_builder.h b/tensorflow/compiler/xla/python/local_computation_builder.h
index a06b85b4ea28c4f386598901138930eaaed12079..0da3964676e9c6729229686f38bb05c8b2427bff 100644
--- a/tensorflow/compiler/xla/python/local_computation_builder.h
+++ b/tensorflow/compiler/xla/python/local_computation_builder.h
@@ -26,7 +26,6 @@ limitations under the License.
 #include "tensorflow/core/lib/gtl/array_slice.h"
 
 namespace xla {
-
 namespace swig {
 
 // Initializes the number of replicas that XLA will be initialized with (when
@@ -69,10 +68,42 @@ class LocalShapedBuffer {
 
   StatusOr<std::unique_ptr<Literal> > ToLiteral() const;
 
+  // Transfers ownership of the encapsulated ShapedBuffer to the caller,
+  // analogous to std::unique_ptr::release().
+  ShapedBuffer Release();
+
  private:
   ScopedShapedBuffer shaped_buffer_;
 };
 
+// Result of a tuple destructuring operation on a LocalShapedBuffer -- this
+// appears to be a simpler mechanism for the time being than an alternative like
+// using SWIG to transform std::vectors into Python lists of SWIG objects
+// directly.
+class LocalShapedBufferTuple {
+ public:
+  // Note: any LocalShapedBuffer elements that are not Release()'d will be
+  // deallocated in the destructor.
+  explicit LocalShapedBufferTuple(std::vector<LocalShapedBuffer*> elements);
+
+  ~LocalShapedBufferTuple();
+
+  // Releases the ith element to the caller. Further attempts to release the ith
+  // element will return an invalid argument error.
+  StatusOr<LocalShapedBuffer*> Release(int i);
+
+  // Returns the number of elements in the destructured tuple.
+  int size() const;
+
+ private:
+  std::vector<LocalShapedBuffer*> elements_;
+};
+
+// Destructures a tuple-valued LocalShapedBuffer into its constitutent elements
+// in LocalShapedBufferTuple form.
+StatusOr<LocalShapedBufferTuple*> DestructureLocalShapedBufferTuple(
+    LocalShapedBuffer* local_shaped_buffer);
+
 // Wraps a LocalExecutable produced by compiling a
 // LocalComputation. The Execute method forwards to that of the
 // underlying LocalExecutable, and additionally handles tranferring
@@ -112,6 +143,11 @@ class LocalComputation {
 
   const XlaComputation& computation() const;
 
+  // Returns the HloModuleProto contained in the XlaComputation in the
+  // serialized binary format. Logs an internal error and returns an empty
+  // string on failure.
+  string GetSerializedProto() const;
+
   // Returns the return-value shape for this computation.
   StatusOr<Shape> GetReturnValueShape() const;
 
@@ -300,10 +336,12 @@ class LocalComputationBuilder {
   _FORWARD_UNOP(Not)
   _FORWARD_UNOP(Abs)
   _FORWARD_UNOP(Exp)
+  _FORWARD_UNOP(Expm1)
   _FORWARD_UNOP(Floor)
   _FORWARD_UNOP(Ceil)
   _FORWARD_UNOP(Round)
   _FORWARD_UNOP(Log)
+  _FORWARD_UNOP(Log1p)
   _FORWARD_UNOP(Sign)
   _FORWARD_UNOP(Cos)
   _FORWARD_UNOP(Sin)
@@ -331,7 +369,6 @@ void DeleteCompiledLocalComputation(CompiledLocalComputation* computation);
 void DeleteLocalComputation(LocalComputation* computation);
 
 }  // namespace swig
-
 }  // namespace xla
 
 #endif  // TENSORFLOW_COMPILER_XLA_PYTHON_LOCAL_COMPUTATION_BUILDER_H_
diff --git a/tensorflow/compiler/xla/python/local_computation_builder.i b/tensorflow/compiler/xla/python/local_computation_builder.i
index 04c56bbba95fbf3248df6c49700ff563c8b253c0..477df6fde25d0db760e08df9d335bd12e31ccb55 100644
--- a/tensorflow/compiler/xla/python/local_computation_builder.i
+++ b/tensorflow/compiler/xla/python/local_computation_builder.i
@@ -200,6 +200,20 @@ tensorflow::ImportNumpy();
   }
 }
 
+%typemap(out) StatusOr<xla::swig::LocalShapedBufferTuple*> {
+  if ($1.ok()) {
+    auto* value = $1.ValueOrDie();
+    {
+      auto* $1 = value;
+      $typemap(out, xla::swig::LocalShapedBufferTuple*)
+    }
+  } else {
+    PyErr_SetString(PyExc_RuntimeError, $1.status().ToString().c_str());
+    SWIG_fail;
+  }
+}
+
+
 %typemap(out) StatusOr< std::unique_ptr<Literal> > {
   if ($1.ok()) {
     std::unique_ptr<Literal> value = $1.ConsumeValueOrDie();
@@ -851,6 +865,11 @@ tensorflow::ImportNumpy();
     })) {
       return nullptr;
     }
+    if (!HandleStringAttribute($input, "dump_unoptimized_hlo_proto_to", [&](string s) {
+      build_options.set_dump_unoptimized_hlo_proto_to(std::move(s));
+    })) {
+      return nullptr;
+    }
     if (!HandleStringAttribute($input, "dump_per_pass_hlo_proto_to", [&](string s) {
       build_options.set_dump_per_pass_hlo_proto_to(std::move(s));
     })) {
@@ -900,12 +919,16 @@ tensorflow::ImportNumpy();
 %unignore xla::swig::LocalShapedBuffer;
 %unignore xla::swig::LocalShapedBuffer::FromLiteral;
 %unignore xla::swig::LocalShapedBuffer::ToLiteral;
+%unignore xla::swig::LocalShapedBufferTuple;
+%unignore xla::swig::LocalShapedBufferTuple::Release;
+%unignore xla::swig::LocalShapedBufferTuple::size;
 %unignore xla::swig::CompiledLocalComputation;
 %unignore xla::swig::CompiledLocalComputation::Execute;
 %unignore xla::swig::CompiledLocalComputation::ExecuteWithShapedBuffers;
 %unignore xla::swig::LocalComputation;
 %unignore xla::swig::LocalComputation::Compile;
 %unignore xla::swig::LocalComputation::GetReturnValueShape;
+%unignore xla::swig::LocalComputation::GetSerializedProto;
 %unignore xla::swig::LocalOp;
 %unignore xla::swig::LocalComputationBuilder;
 %unignore xla::swig::LocalComputationBuilder::LocalComputationBuilder;
@@ -968,10 +991,12 @@ tensorflow::ImportNumpy();
 %unignore xla::swig::LocalComputationBuilder::Not;
 %unignore xla::swig::LocalComputationBuilder::Abs;
 %unignore xla::swig::LocalComputationBuilder::Exp;
+%unignore xla::swig::LocalComputationBuilder::Expm1;
 %unignore xla::swig::LocalComputationBuilder::Floor;
 %unignore xla::swig::LocalComputationBuilder::Ceil;
 %unignore xla::swig::LocalComputationBuilder::Round;
 %unignore xla::swig::LocalComputationBuilder::Log;
+%unignore xla::swig::LocalComputationBuilder::Log1p;
 %unignore xla::swig::LocalComputationBuilder::Sign;
 %unignore xla::swig::LocalComputationBuilder::Cos;
 %unignore xla::swig::LocalComputationBuilder::Sin;
@@ -983,6 +1008,7 @@ tensorflow::ImportNumpy();
 %unignore xla::swig::LocalComputationBuilder::ReciprocalF32;
 %unignore xla::swig::LocalComputationBuilder::Neg;
 %unignore xla::swig::LocalComputationBuilder::Sort;
+%unignore xla::swig::DestructureLocalShapedBufferTuple;
 %unignore xla::swig::DeleteLocalShapedBuffer;
 %unignore xla::swig::DeleteLocalComputation;
 %unignore xla::swig::DeleteCompiledLocalComputation;
diff --git a/tensorflow/compiler/xla/python/xla_client.py b/tensorflow/compiler/xla/python/xla_client.py
index 1d5b75d1bee2dcee3e448d0bcb72103b539efac6..c025127c3cf1871d4def1297ed36c046cae61d4b 100644
--- a/tensorflow/compiler/xla/python/xla_client.py
+++ b/tensorflow/compiler/xla/python/xla_client.py
@@ -28,6 +28,7 @@ import numpy as np
 
 from tensorflow.compiler.xla import xla_data_pb2
 from tensorflow.compiler.xla.python import pywrap_xla as c_api
+from tensorflow.compiler.xla.service import hlo_pb2
 
 
 # Most functions are snake_case for consistency with other modules, whereas
@@ -88,10 +89,12 @@ _UNARY_OPS = [
     'Not',
     'Abs',
     'Exp',
+    'Expm1',
     'Floor',
     'Round',
     'Ceil',
     'Log',
+    'Log1p',
     'Sign',
     'Cos',
     'Sin',
@@ -183,6 +186,14 @@ class LocalBuffer(object):
       self._delete(self.c_local_shaped_buffer)
       self.c_local_shaped_buffer = None
 
+  def destructure(self):
+    assert self.c_local_shaped_buffer is not None
+    result = c_api.DestructureLocalShapedBufferTuple(self.c_local_shaped_buffer)
+    self.c_local_shaped_buffer = None
+    size = result.size()
+    destructured = tuple(LocalBuffer(result.Release(i)) for i in xrange(size))
+    return destructured
+
   def is_deleted(self):
     return self.c_local_shaped_buffer is None
 
@@ -352,6 +363,7 @@ class CompileOptions(object):
   def __init__(self):
     self.generate_hlo_graph = None
     self.dump_optimized_hlo_proto_to = None
+    self.dump_unoptimized_hlo_proto_to = None
     self.dump_per_pass_hlo_proto_to = None
     self.hlo_profile = False
 
@@ -410,6 +422,17 @@ class LocalComputation(object):
       assert isinstance(c_local_computation, c_api.LocalComputation)
       self._delete = c_api.DeleteLocalComputation
 
+  def GetProto(self):
+    """Get the HloModuleProto proto object in this local computation.
+
+    Returns:
+       An HloModuleProto proto object that has the whole-graph information.
+    """
+
+    serialized = self.c_local_computation.GetSerializedProto()
+    proto = hlo_pb2.HloModuleProto.FromString(serialized)
+    return proto
+
   def Compile(self, argument_shapes=(), compile_options=None, layout_fn=None):
     """Compiles an un-compiled local computation.
 
@@ -1100,6 +1123,61 @@ class ComputationBuilder(object):
     dimension_numbers.output_spatial_dimensions.extend(range(2, 2 + nd))
     return dimension_numbers
 
+  def ConvGeneralDilated(self, lhs, rhs, window_strides, padding, lhs_dilation,
+                         rhs_dilation, dimension_numbers):
+    """Enqueues a ConvGeneralDilated operation onto the computation.
+
+    Args:
+      lhs: LocalOp for the rank N+2 array of inputs.
+      rhs: LocalOp for the rank N+2 array of kernel weights.
+      window_strides: length-N array-like of integer kernel strides.
+      padding: length-N array-like of pairs of integers of (low, high) padding.
+      lhs_dilation: length-N array-like of integer dilation factors.
+      rhs_dilation: length-N array-like of integer dilation factors.
+      dimension_numbers: either an xla_data_pb2.ConvolutionDimensionNumbers or a
+        triple (lhs_spec, rhs_spec, out_spec) where each element is a string of
+        length N+2 identifying by position (1) batch dimensions in lhs, rhs, and
+        the output with the character 'N', (2) feature dimensions in lhs and the
+        output with the character 'C', (3) input and output feature dimensions
+        in rhs with the characters 'I' and 'O' respectively, and (4) spatial
+        dimension correspondences between lhs, rhs, and the output using any
+        distinct characters. For example, to indicate dimension numbers
+        consistent with the Conv operation with two spatial dimensions, one
+        could use ('NCHW', 'OIHW', 'NCHW'). As another example, to indicate
+        dimension numbers consistent with the TensorFlow Conv2D operation, one
+        could use ('NHWC', 'HWIO', 'NHWC'). When using the latter form of
+        convolution dimension specification, window strides are associated with
+        spatial dimension character labels according to the order in which the
+        labels appear in the rhs_spec string, so that window_strides[0] is
+        matched with the dimension corresponding to the first character
+        appearing in rhs_spec that is not 'I' or 'O'.
+
+    Returns: a LocalOp representing the ConvGenralDilated operation.
+    """
+    if not isinstance(dimension_numbers,
+                      xla_data_pb2.ConvolutionDimensionNumbers):
+      lhs_spec, rhs_spec, out_spec = dimension_numbers
+      dimension_numbers = xla_data_pb2.ConvolutionDimensionNumbers()
+
+      dimension_numbers.input_batch_dimension = lhs_spec.index('N')
+      dimension_numbers.input_feature_dimension = lhs_spec.index('C')
+      dimension_numbers.output_batch_dimension = out_spec.index('N')
+      dimension_numbers.output_feature_dimension = out_spec.index('C')
+      dimension_numbers.kernel_output_feature_dimension = rhs_spec.index('O')
+      dimension_numbers.kernel_input_feature_dimension = rhs_spec.index('I')
+
+      dimension_numbers.kernel_spatial_dimensions.extend(
+          i for i, c in enumerate(rhs_spec) if c not in {'I', 'O'})
+      dimension_numbers.input_spatial_dimensions.extend(
+          sorted((i for i, c in enumerate(lhs_spec) if c not in {'N', 'C'}),
+                 key=lambda i: rhs_spec.index(lhs_spec[i])))
+      dimension_numbers.output_spatial_dimensions.extend(
+          sorted((i for i, c in enumerate(out_spec) if c not in {'N', 'C'}),
+                 key=lambda i: rhs_spec.index(out_spec[i])))
+    return self._client.ConvGeneralDilated(lhs, rhs, window_strides, padding,
+                                           lhs_dilation, rhs_dilation,
+                                           dimension_numbers)
+
 
 def _forward_methods_to_local_builder():
   """Forward remaining ComputationBuilder methods to the C API.
diff --git a/tensorflow/compiler/xla/python/xla_client_test.py b/tensorflow/compiler/xla/python/xla_client_test.py
index c073c02040e4d260cf760ea2b25f70d60ddd41a1..71e1d60a4e23dbfef333223c396e109533da9365 100644
--- a/tensorflow/compiler/xla/python/xla_client_test.py
+++ b/tensorflow/compiler/xla/python/xla_client_test.py
@@ -164,6 +164,16 @@ class ComputationsWithConstantsTest(LocalComputationTest):
         c.Constant(NumpyArrayF32([[1, -1, 1], [-1, 1, -1]])))
     self._ExecuteAndCompareClose(c, expected=[[2, 1, 4], [3, 6, 5]])
 
+  def testGetProto(self):
+    c = self._NewComputation()
+    c.Add(
+        c.Constant(NumpyArrayF32([[1, 2, 3], [4, 5, 6]])),
+        c.Constant(NumpyArrayF32([[1, -1, 1], [-1, 1, -1]])))
+    built = c.Build()
+    proto = built.GetProto()  # HloModuleProto
+    self.assertTrue(len(proto.computations) == 1)
+    self.assertTrue(len(proto.computations[0].instructions) == 3)
+
   def testSum2DF64(self):
     c = self._NewComputation()
     c.Add(
@@ -355,6 +365,55 @@ class LocalBufferTest(LocalComputationTest):
     with self.assertRaises(ValueError):
       compiled_c.ExecuteWithLocalBuffers([arg_buffer])
 
+  def testDestructureTupleEmpty(self):
+    t = ()
+    local_buffer = xla_client.LocalBuffer.from_pyval(t)
+    pieces = local_buffer.destructure()
+    self.assertTrue(local_buffer.is_deleted())
+    self.assertEqual(len(pieces), 0)
+
+  def testDestructureTupleOneArrayElement(self):
+    t = (np.array([1, 2, 3, 4], dtype=np.int32),)
+    local_buffer = xla_client.LocalBuffer.from_pyval(t)
+    pieces = local_buffer.destructure()
+    self.assertTrue(local_buffer.is_deleted())
+    self.assertEqual(len(pieces), 1)
+    array = pieces[0]
+    got = array.to_py()
+    want = NumpyArrayS32([1, 2, 3, 4])
+    np.testing.assert_equal(want, got)
+
+  def testDestructureTupleTwoArrayElementDifferentType(self):
+    t = (np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32),
+         np.array([2, 3, 4, 5], dtype=np.int32))
+    local_buffer = xla_client.LocalBuffer.from_pyval(t)
+    pieces = local_buffer.destructure()
+    self.assertTrue(local_buffer.is_deleted())
+    self.assertEqual(len(pieces), 2)
+    array0, array1 = pieces
+    got = array0.to_py()
+    want = NumpyArrayF32([1.0, 2.0, 3.0, 4.0])
+    np.testing.assert_equal(want, got)
+    got = array1.to_py()
+    want = NumpyArrayS32([2, 3, 4, 5])
+    np.testing.assert_equal(want, got)
+
+  def testDestructureTupleNested(self):
+    t = ((NumpyArrayF32([1.0, 2.0]), NumpyArrayS32([3, 4])), NumpyArrayS32([5]))
+    local_buffer = xla_client.LocalBuffer.from_pyval(t)
+    pieces = local_buffer.destructure()
+    self.assertTrue(local_buffer.is_deleted())
+    self.assertEqual(len(pieces), 2)
+    tuple0, array1 = pieces
+    got = array1.to_py()
+    want = NumpyArrayS32([5])
+    np.testing.assert_equal(want, got)
+    got = tuple0.to_py()
+    self.assertEqual(type(got), tuple)
+    self.assertEqual(len(got), 2)
+    np.testing.assert_equal(NumpyArrayF32([1.0, 2.0]), got[0])
+    np.testing.assert_equal(NumpyArrayS32([3, 4]), got[1])
+
 
 class SingleOpTest(LocalComputationTest):
   """Tests for single ops.
@@ -509,6 +568,46 @@ class SingleOpTest(LocalComputationTest):
                          [40., 50., 0.]]]])
     self._ExecuteAndCompareClose(c, expected=result)
 
+  def testConvGeneralDilatedF32(self):
+    c = self._NewComputation()
+    a = lambda *dims: np.arange(np.prod(dims)).reshape(dims).astype("float32")
+    lhs = a(1, 1, 2, 3)
+    rhs = a(1, 1, 1, 2) * 10
+    strides = [1, 1]
+    pads = [(1, 0), (0, 1)]
+    lhs_dilation = (2, 1)
+    rhs_dilation = (1, 1)
+    dimension_numbers = ("NCHW", "OIHW", "NCHW")
+    c.ConvGeneralDilated(c.Constant(lhs), c.Constant(rhs),
+                         strides, pads, lhs_dilation, rhs_dilation,
+                         dimension_numbers)
+    result = np.array([[[[0., 0., 0.],
+                         [10., 20., 0.],
+                         [0., 0., 0.],
+                         [40., 50., 0.]]]])
+    self._ExecuteAndCompareClose(c, expected=result)
+
+  def testConvGeneralDilatedPermutedF32(self):
+    c = self._NewComputation()
+    a = lambda *dims: np.arange(np.prod(dims)).reshape(dims).astype("float32")
+    lhs = a(1, 1, 2, 3)
+    rhs = a(1, 1, 1, 2) * 10
+    strides = [1, 1]
+    pads = [(1, 0), (0, 1)]
+    lhs_dilation = (2, 1)
+    rhs_dilation = (1, 1)
+
+    dimension_numbers = ("NHWC", "OIHW", "CWNH")
+    c.ConvGeneralDilated(c.Constant(np.transpose(lhs, (0, 2, 3, 1))),
+                         c.Constant(rhs),
+                         strides, pads, lhs_dilation, rhs_dilation,
+                         dimension_numbers)
+    result = np.array([[[[0., 0., 0.],
+                         [10., 20., 0.],
+                         [0., 0., 0.],
+                         [40., 50., 0.]]]])
+    self._ExecuteAndCompareClose(c, expected=np.transpose(result, (1, 3, 0, 2)))
+
   def testBooleanNot(self):
     c = self._NewComputation()
     arr = NumpyArrayBool([True, False, True])
@@ -521,6 +620,12 @@ class SingleOpTest(LocalComputationTest):
     c.Exp(c.Constant(arr))
     self._ExecuteAndCompareClose(c, expected=np.exp(arr))
 
+  def testExpm1(self):
+    c = self._NewComputation()
+    arr = NumpyArrayF32([3.3, 12.1])
+    c.Expm1(c.Constant(arr))
+    self._ExecuteAndCompareClose(c, expected=np.expm1(arr))
+
   def testRound(self):
     c = self._NewComputation()
     arr = NumpyArrayF32([3.3, 12.1])
@@ -533,6 +638,12 @@ class SingleOpTest(LocalComputationTest):
     c.Log(c.Constant(arr))
     self._ExecuteAndCompareClose(c, expected=np.log(arr))
 
+  def testLog1p(self):
+    c = self._NewComputation()
+    arr = NumpyArrayF32([3.3, 12.1])
+    c.Log1p(c.Constant(arr))
+    self._ExecuteAndCompareClose(c, expected=np.log1p(arr))
+
   def testNeg(self):
     c = self._NewComputation()
     arr = NumpyArrayF32([3.3, 12.1])
diff --git a/tensorflow/compiler/xla/reference_util.h b/tensorflow/compiler/xla/reference_util.h
index 2698ba7d79e246530b6b486d3e3bc8bf101c891e..8fa6961d197dce519cf151283b8bc0836a4615c0 100644
--- a/tensorflow/compiler/xla/reference_util.h
+++ b/tensorflow/compiler/xla/reference_util.h
@@ -265,9 +265,9 @@ class ReferenceUtil {
                                               const Array3D<T>& rhs,
                                               int concatenate_dimension) {
     CHECK(0 <= concatenate_dimension && concatenate_dimension < 3);
-    std::vector<int64> lhs_dims = {lhs.n1(), lhs.n2(), lhs.n3()};
-    std::vector<int64> rhs_dims = {rhs.n1(), rhs.n2(), rhs.n3()};
-    std::vector<int64> out_dims = {rhs.n1(), rhs.n2(), rhs.n3()};
+    const int64 lhs_dims[] = {lhs.n1(), lhs.n2(), lhs.n3()};
+    const int64 rhs_dims[] = {rhs.n1(), rhs.n2(), rhs.n3()};
+    int64 out_dims[] = {rhs.n1(), rhs.n2(), rhs.n3()};
     for (int i = 0; i < 3; ++i) {
       if (i != concatenate_dimension) {
         out_dims[i] = lhs_dims[i];
@@ -299,9 +299,9 @@ class ReferenceUtil {
                                               const Array4D<T>& rhs,
                                               int concatenate_dimension) {
     CHECK(0 <= concatenate_dimension && concatenate_dimension < 4);
-    std::vector<int64> lhs_dims = {lhs.n1(), lhs.n2(), lhs.n3(), lhs.n4()};
-    std::vector<int64> rhs_dims = {rhs.n1(), rhs.n2(), rhs.n3(), rhs.n4()};
-    std::vector<int64> out_dims = {rhs.n1(), rhs.n2(), rhs.n3(), rhs.n4()};
+    const int64 lhs_dims[] = {lhs.n1(), lhs.n2(), lhs.n3(), lhs.n4()};
+    const int64 rhs_dims[] = {rhs.n1(), rhs.n2(), rhs.n3(), rhs.n4()};
+    int64 out_dims[] = {rhs.n1(), rhs.n2(), rhs.n3(), rhs.n4()};
     for (int i = 0; i < 4; ++i) {
       if (i != concatenate_dimension) {
         out_dims[i] = lhs_dims[i];
@@ -553,12 +553,11 @@ class ReferenceUtil {
                                      const NativeT pad) {
     CHECK_EQ(padding.dimensions_size(), 3);
 
-    const std::vector<int64> input_bounds = {operand.n1(), operand.n2(),
-                                             operand.n3()};
-    std::vector<int64> pad_low(3);
-    std::vector<int64> pad_high(3);
-    std::vector<int64> pad_interior(3);
-    std::vector<int64> output_bounds(3);
+    const int64 input_bounds[] = {operand.n1(), operand.n2(), operand.n3()};
+    int64 pad_low[3];
+    int64 pad_high[3];
+    int64 pad_interior[3];
+    int64 output_bounds[3];
     for (int64 i = 0; i < 3; ++i) {
       pad_low[i] = padding.dimensions(i).edge_padding_low();
       pad_high[i] = padding.dimensions(i).edge_padding_high();
@@ -574,7 +573,7 @@ class ReferenceUtil {
 
     Array3D<NativeT> result(output_bounds[0], output_bounds[1],
                             output_bounds[2]);
-    std::vector<int> indices = {0, 0, 0};
+    int indices[] = {0, 0, 0};
     for (indices[0] = 0; indices[0] < output_bounds[0]; ++indices[0]) {
       for (indices[1] = 0; indices[1] < output_bounds[1]; ++indices[1]) {
         for (indices[2] = 0; indices[2] < output_bounds[2]; ++indices[2]) {
@@ -612,12 +611,12 @@ class ReferenceUtil {
                                      const NativeT pad) {
     CHECK_EQ(padding.dimensions_size(), 4);
 
-    const std::vector<int64> input_bounds = {operand.n1(), operand.n2(),
-                                             operand.n3(), operand.n4()};
-    std::vector<int64> pad_low(4);
-    std::vector<int64> pad_high(4);
-    std::vector<int64> pad_interior(4);
-    std::vector<int64> output_bounds(4);
+    const int64 input_bounds[] = {operand.n1(), operand.n2(), operand.n3(),
+                                  operand.n4()};
+    int64 pad_low[4];
+    int64 pad_high[4];
+    int64 pad_interior[4];
+    int64 output_bounds[4];
     for (int64 i = 0; i < 4; ++i) {
       pad_low[i] = padding.dimensions(i).edge_padding_low();
       pad_high[i] = padding.dimensions(i).edge_padding_high();
diff --git a/tensorflow/compiler/xla/rpc/BUILD b/tensorflow/compiler/xla/rpc/BUILD
index 0d56a9a477b15964ad45e798865aa8d2c7385073..1775666652f303ee095a11405537106a3eb9b056 100644
--- a/tensorflow/compiler/xla/rpc/BUILD
+++ b/tensorflow/compiler/xla/rpc/BUILD
@@ -42,7 +42,7 @@ tf_cc_binary(
         "//tensorflow/compiler/xla/service:cpu_plugin",
         "//tensorflow/core:framework_internal",
         "//tensorflow/core:lib",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -61,7 +61,7 @@ tf_cc_test(
         "//tensorflow/core:lib",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -74,6 +74,6 @@ cc_library(
         "//tensorflow/compiler/xla/service",
         "//tensorflow/compiler/xla/service:platform_util",
         "//tensorflow/core/distributed_runtime/rpc:grpc_util",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
diff --git a/tensorflow/compiler/xla/rpc/grpc_client_test.cc b/tensorflow/compiler/xla/rpc/grpc_client_test.cc
index 313f11a9a957155eb277dc02ba5d2565c87e0235..d7dd9786a2bbde2d18ae81a9a9d4cc4b2cc38411 100644
--- a/tensorflow/compiler/xla/rpc/grpc_client_test.cc
+++ b/tensorflow/compiler/xla/rpc/grpc_client_test.cc
@@ -20,8 +20,8 @@ limitations under the License.
 #include <memory>
 #include <vector>
 
-#include "grpc++/create_channel.h"
-#include "grpc++/security/credentials.h"
+#include "grpcpp/create_channel.h"
+#include "grpcpp/security/credentials.h"
 
 #include "tensorflow/compiler/xla/client/client.h"
 #include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
diff --git a/tensorflow/compiler/xla/rpc/grpc_service.cc b/tensorflow/compiler/xla/rpc/grpc_service.cc
index 5f4dc6bd08f18b50e60b173432d3d305759bccea..4e1435fa30a24c320ddbedb84d37b369a3158a54 100644
--- a/tensorflow/compiler/xla/rpc/grpc_service.cc
+++ b/tensorflow/compiler/xla/rpc/grpc_service.cc
@@ -32,19 +32,6 @@ namespace xla {
   return tensorflow::ToGrpcStatus(s);
 }
 
-::grpc::Status GRPCService::Computation(::grpc::ServerContext* context,
-                                        const ComputationRequest* arg,
-                                        ComputationResponse* result) {
-  return DelegateRPC(
-      [this, arg, result]() { return service_->Computation(arg, result); });
-}
-
-::grpc::Status GRPCService::CreateOp(::grpc::ServerContext* context,
-                                     const OpRequest* arg, OpResponse* result) {
-  return DelegateRPC(
-      [this, arg, result]() { return service_->Op(arg, result); });
-}
-
 ::grpc::Status GRPCService::Unregister(::grpc::ServerContext* context,
                                        const UnregisterRequest* arg,
                                        UnregisterResponse* result) {
@@ -60,21 +47,6 @@ namespace xla {
   });
 }
 
-::grpc::Status GRPCService::SetReturnValue(::grpc::ServerContext* context,
-                                           const SetReturnValueRequest* arg,
-                                           SetReturnValueResponse* results) {
-  return DelegateRPC([this, arg, results]() {
-    return service_->SetReturnValue(arg, results);
-  });
-}
-
-::grpc::Status GRPCService::Execute(::grpc::ServerContext* context,
-                                    const ExecuteRequest* arg,
-                                    ExecuteResponse* result) {
-  return DelegateRPC(
-      [this, arg, result]() { return service_->Execute(arg, result); });
-}
-
 ::grpc::Status GRPCService::ExecuteGraph(::grpc::ServerContext* /*context*/,
                                          const ExecuteGraphRequest* arg,
                                          ExecuteResponse* result) {
@@ -82,13 +54,6 @@ namespace xla {
       [this, arg, result]() { return service_->ExecuteGraph(arg, result); });
 }
 
-::grpc::Status GRPCService::ExecuteAsync(::grpc::ServerContext* context,
-                                         const ExecuteAsyncRequest* arg,
-                                         ExecuteAsyncResponse* result) {
-  return DelegateRPC(
-      [this, arg, result]() { return service_->ExecuteAsync(arg, result); });
-}
-
 ::grpc::Status GRPCService::WaitForExecution(::grpc::ServerContext* context,
                                              const WaitForExecutionRequest* arg,
                                              WaitForExecutionResponse* result) {
@@ -136,20 +101,6 @@ namespace xla {
       [this, arg, result]() { return service_->ResetDevice(arg, result); });
 }
 
-::grpc::Status GRPCService::IsConstant(::grpc::ServerContext* context,
-                                       const IsConstantRequest* arg,
-                                       IsConstantResponse* result) {
-  return DelegateRPC(
-      [this, arg, result]() { return service_->IsConstant(arg, result); });
-}
-
-::grpc::Status GRPCService::ComputeConstant(::grpc::ServerContext* context,
-                                            const ComputeConstantRequest* arg,
-                                            ComputeConstantResponse* result) {
-  return DelegateRPC(
-      [this, arg, result]() { return service_->ComputeConstant(arg, result); });
-}
-
 ::grpc::Status GRPCService::GetShape(::grpc::ServerContext* context,
                                      const GetShapeRequest* arg,
                                      GetShapeResponse* result) {
@@ -157,43 +108,4 @@ namespace xla {
       [this, arg, result]() { return service_->GetShape(arg, result); });
 }
 
-::grpc::Status GRPCService::GetComputationShape(
-    ::grpc::ServerContext* context, const GetComputationShapeRequest* arg,
-    GetComputationShapeResponse* result) {
-  return DelegateRPC([this, arg, result]() {
-    return service_->GetComputationShape(arg, result);
-  });
-}
-
-::grpc::Status GRPCService::GetLocalShape(::grpc::ServerContext* context,
-                                          const GetLocalShapeRequest* arg,
-                                          GetLocalShapeResponse* result) {
-  return DelegateRPC(
-      [this, arg, result]() { return service_->GetLocalShape(arg, result); });
-}
-
-::grpc::Status GRPCService::GetComputationStats(
-    ::grpc::ServerContext* context, const ComputationStatsRequest* arg,
-    ComputationStatsResponse* result) {
-  return DelegateRPC([this, arg, result]() {
-    return service_->GetComputationStats(arg, result);
-  });
-}
-
-::grpc::Status GRPCService::SnapshotComputation(
-    ::grpc::ServerContext* context, const SnapshotComputationRequest* arg,
-    SnapshotComputationResponse* result) {
-  return DelegateRPC([this, arg, result]() {
-    return service_->SnapshotComputation(arg, result);
-  });
-}
-
-::grpc::Status GRPCService::LoadComputationSnapshot(
-    ::grpc::ServerContext* context, const LoadComputationSnapshotRequest* arg,
-    LoadComputationSnapshotResponse* result) {
-  return DelegateRPC([this, arg, result]() {
-    return service_->LoadComputationSnapshot(arg, result);
-  });
-}
-
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/rpc/grpc_service.h b/tensorflow/compiler/xla/rpc/grpc_service.h
index 50f02796f2d45baf894841782cd96d8d51a5ba00..ca1b09b648013ad45d806040c5ddcf11d9e5604e 100644
--- a/tensorflow/compiler/xla/rpc/grpc_service.h
+++ b/tensorflow/compiler/xla/rpc/grpc_service.h
@@ -16,7 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_RPC_GRPC_SERVICE_H_
 #define TENSORFLOW_COMPILER_XLA_RPC_GRPC_SERVICE_H_
 
-#include "grpc++/server_context.h"
+#include "grpcpp/server_context.h"
 #include "tensorflow/compiler/xla/rpc/xla_service.grpc.pb.h"
 #include "tensorflow/compiler/xla/service/service.h"
 
@@ -31,13 +31,6 @@ class GRPCService : public grpc::XlaService::Service {
   static StatusOr<std::unique_ptr<GRPCService>> NewService(
       se::Platform* platform = nullptr);
 
-  ::grpc::Status Computation(::grpc::ServerContext* context,
-                             const ComputationRequest* arg,
-                             ComputationResponse* result) override;
-
-  ::grpc::Status CreateOp(::grpc::ServerContext* context, const OpRequest* arg,
-                          OpResponse* result) override;
-
   ::grpc::Status Unregister(::grpc::ServerContext* context,
                             const UnregisterRequest* arg,
                             UnregisterResponse* result) override;
@@ -46,22 +39,10 @@ class GRPCService : public grpc::XlaService::Service {
                                   const DeconstructTupleRequest* arg,
                                   DeconstructTupleResponse* result) override;
 
-  ::grpc::Status SetReturnValue(::grpc::ServerContext* context,
-                                const SetReturnValueRequest* arg,
-                                SetReturnValueResponse* results) override;
-
-  ::grpc::Status Execute(::grpc::ServerContext* context,
-                         const ExecuteRequest* arg,
-                         ExecuteResponse* result) override;
-
   ::grpc::Status ExecuteGraph(::grpc::ServerContext* context,
                               const ExecuteGraphRequest* arg,
                               ExecuteResponse* result) override;
 
-  ::grpc::Status ExecuteAsync(::grpc::ServerContext* context,
-                              const ExecuteAsyncRequest* arg,
-                              ExecuteAsyncResponse* result) override;
-
   ::grpc::Status WaitForExecution(::grpc::ServerContext* context,
                                   const WaitForExecutionRequest* arg,
                                   WaitForExecutionResponse* result) override;
@@ -86,38 +67,10 @@ class GRPCService : public grpc::XlaService::Service {
                              const ResetDeviceRequest* arg,
                              ResetDeviceResponse* result) override;
 
-  ::grpc::Status IsConstant(::grpc::ServerContext* context,
-                            const IsConstantRequest* arg,
-                            IsConstantResponse* result) override;
-
-  ::grpc::Status ComputeConstant(::grpc::ServerContext* context,
-                                 const ComputeConstantRequest* arg,
-                                 ComputeConstantResponse* result) override;
-
   ::grpc::Status GetShape(::grpc::ServerContext* context,
                           const GetShapeRequest* arg,
                           GetShapeResponse* result) override;
 
-  ::grpc::Status GetComputationShape(
-      ::grpc::ServerContext* context, const GetComputationShapeRequest* arg,
-      GetComputationShapeResponse* result) override;
-
-  ::grpc::Status GetLocalShape(::grpc::ServerContext* context,
-                               const GetLocalShapeRequest* arg,
-                               GetLocalShapeResponse* result) override;
-
-  ::grpc::Status GetComputationStats(::grpc::ServerContext* context,
-                                     const ComputationStatsRequest* arg,
-                                     ComputationStatsResponse* result) override;
-
-  ::grpc::Status SnapshotComputation(
-      ::grpc::ServerContext* context, const SnapshotComputationRequest* arg,
-      SnapshotComputationResponse* result) override;
-
-  ::grpc::Status LoadComputationSnapshot(
-      ::grpc::ServerContext* context, const LoadComputationSnapshotRequest* arg,
-      LoadComputationSnapshotResponse* result) override;
-
  private:
   std::unique_ptr<::xla::Service> service_;
 
diff --git a/tensorflow/compiler/xla/rpc/grpc_service_main.cc b/tensorflow/compiler/xla/rpc/grpc_service_main.cc
index e29908ccec80db76e3b5b856e57382c56430c379..c68c857c304138ff4318e243f66547c6acce1005 100644
--- a/tensorflow/compiler/xla/rpc/grpc_service_main.cc
+++ b/tensorflow/compiler/xla/rpc/grpc_service_main.cc
@@ -15,9 +15,9 @@ limitations under the License.
 
 // Basic server binary that exposes a xla::Service through a GRPC interface
 // on a configurable port.
-#include "grpc++/security/server_credentials.h"
-#include "grpc++/server.h"
-#include "grpc++/server_builder.h"
+#include "grpcpp/security/server_credentials.h"
+#include "grpcpp/server.h"
+#include "grpcpp/server_builder.h"
 #include "tensorflow/compiler/xla/rpc/grpc_service.h"
 #include "tensorflow/core/lib/strings/stringprintf.h"
 #include "tensorflow/core/platform/init_main.h"
diff --git a/tensorflow/compiler/xla/rpc/grpc_stub.cc b/tensorflow/compiler/xla/rpc/grpc_stub.cc
index 620ac6cec4f76d938e57e87849066df59514938a..7b8ab158e1396d7087a407be180ab44d2e16e121 100644
--- a/tensorflow/compiler/xla/rpc/grpc_stub.cc
+++ b/tensorflow/compiler/xla/rpc/grpc_stub.cc
@@ -62,21 +62,6 @@ Status GRPCStub::ResetDevice(const ResetDeviceRequest* request,
   });
 }
 
-Status GRPCStub::LoadComputationSnapshot(
-    const LoadComputationSnapshotRequest* request,
-    LoadComputationSnapshotResponse* response) {
-  return MakeRPC([this, request, response](::grpc::ClientContext* context) {
-    return grpc_stub_->LoadComputationSnapshot(context, *request, response);
-  });
-}
-
-Status GRPCStub::Execute(const ExecuteRequest* request,
-                         ExecuteResponse* response) {
-  return MakeRPC([this, request, response](::grpc::ClientContext* context) {
-    return grpc_stub_->Execute(context, *request, response);
-  });
-}
-
 Status GRPCStub::ExecuteGraph(const ExecuteGraphRequest* request,
                               ExecuteResponse* response) {
   return MakeRPC([this, request, response](::grpc::ClientContext* context) {
@@ -84,13 +69,6 @@ Status GRPCStub::ExecuteGraph(const ExecuteGraphRequest* request,
   });
 }
 
-Status GRPCStub::ExecuteParallel(const ExecuteParallelRequest* request,
-                                 ExecuteParallelResponse* response) {
-  return MakeRPC([this, request, response](::grpc::ClientContext* context) {
-    return grpc_stub_->ExecuteParallel(context, *request, response);
-  });
-}
-
 Status GRPCStub::ExecuteGraphParallel(
     const ExecuteGraphParallelRequest* request,
     ExecuteParallelResponse* response) {
@@ -99,13 +77,6 @@ Status GRPCStub::ExecuteGraphParallel(
   });
 }
 
-Status GRPCStub::ExecuteAsync(const ExecuteAsyncRequest* request,
-                              ExecuteAsyncResponse* response) {
-  return MakeRPC([this, request, response](::grpc::ClientContext* context) {
-    return grpc_stub_->ExecuteAsync(context, *request, response);
-  });
-}
-
 Status GRPCStub::WaitForExecution(const WaitForExecutionRequest* request,
                                   WaitForExecutionResponse* response) {
   return MakeRPC([this, request, response](::grpc::ClientContext* context) {
@@ -120,13 +91,6 @@ Status GRPCStub::DeconstructTuple(const DeconstructTupleRequest* request,
   });
 }
 
-Status GRPCStub::GetComputationStats(const ComputationStatsRequest* request,
-                                     ComputationStatsResponse* response) {
-  return MakeRPC([this, request, response](::grpc::ClientContext* context) {
-    return grpc_stub_->GetComputationStats(context, *request, response);
-  });
-}
-
 Status GRPCStub::GetComputationGraphStats(
     const ComputationGraphStatsRequest* request,
     ComputationStatsResponse* response) {
@@ -135,13 +99,6 @@ Status GRPCStub::GetComputationGraphStats(
   });
 }
 
-Status GRPCStub::GetComputationShape(const GetComputationShapeRequest* request,
-                                     GetComputationShapeResponse* response) {
-  return MakeRPC([this, request, response](::grpc::ClientContext* context) {
-    return grpc_stub_->GetComputationShape(context, *request, response);
-  });
-}
-
 Status GRPCStub::GetShape(const GetShapeRequest* request,
                           GetShapeResponse* response) {
   return MakeRPC([this, request, response](::grpc::ClientContext* context) {
@@ -163,48 +120,6 @@ Status GRPCStub::CreateChannelHandle(const CreateChannelHandleRequest* request,
   });
 }
 
-// Methods used by ComputationBuilder.
-Status GRPCStub::Computation(const ComputationRequest* request,
-                             ComputationResponse* response) {
-  return MakeRPC([this, request, response](::grpc::ClientContext* context) {
-    return grpc_stub_->Computation(context, *request, response);
-  });
-}
-
-Status GRPCStub::Op(const OpRequest* request, OpResponse* response) {
-  return MakeRPC([this, request, response](::grpc::ClientContext* context) {
-    return grpc_stub_->CreateOp(context, *request, response);
-  });
-}
-
-Status GRPCStub::GetLocalShape(const GetLocalShapeRequest* request,
-                               GetLocalShapeResponse* response) {
-  return MakeRPC([this, request, response](::grpc::ClientContext* context) {
-    return grpc_stub_->GetLocalShape(context, *request, response);
-  });
-}
-
-Status GRPCStub::SetReturnValue(const SetReturnValueRequest* request,
-                                SetReturnValueResponse* responses) {
-  return MakeRPC([this, request, responses](::grpc::ClientContext* context) {
-    return grpc_stub_->SetReturnValue(context, *request, responses);
-  });
-}
-
-Status GRPCStub::IsConstant(const IsConstantRequest* request,
-                            IsConstantResponse* response) {
-  return MakeRPC([this, request, response](::grpc::ClientContext* context) {
-    return grpc_stub_->IsConstant(context, *request, response);
-  });
-}
-
-Status GRPCStub::ComputeConstant(const ComputeConstantRequest* request,
-                                 ComputeConstantResponse* response) {
-  return MakeRPC([this, request, response](::grpc::ClientContext* context) {
-    return grpc_stub_->ComputeConstant(context, *request, response);
-  });
-}
-
 Status GRPCStub::ComputeConstantGraph(
     const ComputeConstantGraphRequest* request,
     ComputeConstantResponse* response) {
@@ -213,14 +128,6 @@ Status GRPCStub::ComputeConstantGraph(
   });
 }
 
-// Methods used by Computation.
-Status GRPCStub::SnapshotComputation(const SnapshotComputationRequest* request,
-                                     SnapshotComputationResponse* response) {
-  return MakeRPC([this, request, response](::grpc::ClientContext* context) {
-    return grpc_stub_->SnapshotComputation(context, *request, response);
-  });
-}
-
 // Methods used by GlobalData.
 Status GRPCStub::Unregister(const UnregisterRequest* request,
                             UnregisterResponse* response) {
diff --git a/tensorflow/compiler/xla/rpc/grpc_stub.h b/tensorflow/compiler/xla/rpc/grpc_stub.h
index 5906d45769b5749b0c590dbc0e1972077dc3e7ba..8dfcb761387d608abbb1f62974f49b976a7ff7ff 100644
--- a/tensorflow/compiler/xla/rpc/grpc_stub.h
+++ b/tensorflow/compiler/xla/rpc/grpc_stub.h
@@ -43,39 +43,21 @@ class GRPCStub : public ServiceInterface {
   Status ResetDevice(const ResetDeviceRequest* arg,
                      ResetDeviceResponse* result) override;
 
-  Status LoadComputationSnapshot(
-      const LoadComputationSnapshotRequest* request,
-      LoadComputationSnapshotResponse* result) override;
-
-  Status Execute(const ExecuteRequest* arg, ExecuteResponse* result) override;
-
   Status ExecuteGraph(const ExecuteGraphRequest* request,
                       ExecuteResponse* response) override;
 
-  Status ExecuteParallel(const ExecuteParallelRequest* arg,
-                         ExecuteParallelResponse* result) override;
-
   Status ExecuteGraphParallel(const ExecuteGraphParallelRequest* request,
                               ExecuteParallelResponse* response) override;
 
-  Status ExecuteAsync(const ExecuteAsyncRequest* arg,
-                      ExecuteAsyncResponse* result) override;
-
   Status WaitForExecution(const WaitForExecutionRequest* arg,
                           WaitForExecutionResponse* result) override;
 
   Status DeconstructTuple(const DeconstructTupleRequest* arg,
                           DeconstructTupleResponse* result) override;
 
-  Status GetComputationStats(const ComputationStatsRequest* arg,
-                             ComputationStatsResponse* result) override;
-
   Status GetComputationGraphStats(const ComputationGraphStatsRequest* request,
                                   ComputationStatsResponse* response) override;
 
-  Status GetComputationShape(const GetComputationShapeRequest* arg,
-                             GetComputationShapeResponse* result) override;
-
   Status GetShape(const GetShapeRequest* arg,
                   GetShapeResponse* result) override;
 
@@ -85,30 +67,9 @@ class GRPCStub : public ServiceInterface {
   Status CreateChannelHandle(const CreateChannelHandleRequest* arg,
                              CreateChannelHandleResponse* result) override;
 
-  // Methods used by ComputationBuilder.
-  Status Computation(const ComputationRequest* arg,
-                     ComputationResponse* result) override;
-
-  Status Op(const OpRequest* arg, OpResponse* result) override;
-  Status GetLocalShape(const GetLocalShapeRequest* arg,
-                       GetLocalShapeResponse* result) override;
-
-  Status SetReturnValue(const SetReturnValueRequest* arg,
-                        SetReturnValueResponse* results) override;
-
-  Status IsConstant(const IsConstantRequest* arg,
-                    IsConstantResponse* result) override;
-
-  Status ComputeConstant(const ComputeConstantRequest* arg,
-                         ComputeConstantResponse* result) override;
-
   Status ComputeConstantGraph(const ComputeConstantGraphRequest* arg,
                               ComputeConstantResponse* result) override;
 
-  // Methods used by Computation.
-  Status SnapshotComputation(const SnapshotComputationRequest* ag,
-                             SnapshotComputationResponse* result) override;
-
   // Methods used by GlobalData.
   Status Unregister(const UnregisterRequest* arg,
                     UnregisterResponse* result) override;
diff --git a/tensorflow/compiler/xla/rpc/xla_service.proto b/tensorflow/compiler/xla/rpc/xla_service.proto
index c47164ee1b7657ae378a053f553442bee751753e..551ae895e05586daec0ffcd425f4950f76bdd50d 100644
--- a/tensorflow/compiler/xla/rpc/xla_service.proto
+++ b/tensorflow/compiler/xla/rpc/xla_service.proto
@@ -75,19 +75,7 @@ service XlaService {
   rpc GetShape(GetShapeRequest) returns (GetShapeResponse) {
   }
 
-  // Requests the program shape of the referenced computation.
-  rpc GetComputationShape(GetComputationShapeRequest)
-      returns (GetComputationShapeResponse) {
-  }
-
-  // Requests the statistics of the given computation.
-  rpc GetComputationStats(ComputationStatsRequest)
-      returns (ComputationStatsResponse) {
-  }
-
   // Requests the statistics of the given computation.
-  //
-  // TODO(b/74197823): This is a part of a NOT YET ready refactor.
   rpc GetComputationGraphStats(ComputationGraphStatsRequest)
       returns (ComputationStatsResponse) {
   }
@@ -121,25 +109,12 @@ service XlaService {
   rpc ResetDevice(ResetDeviceRequest) returns (ResetDeviceResponse) {
   }
 
-  // Tests if an expression is a compile-time constant.
-  rpc IsConstant(IsConstantRequest) returns (IsConstantResponse) {
-  }
-
-  // Computes the value of a constant expression.
-  rpc ComputeConstant(ComputeConstantRequest)
-      returns (ComputeConstantResponse) {
-  }
-
   // Computes the value of a constant expression. The request contains the
   // computation graph for the constant expression.
   rpc ComputeConstantGraph(ComputeConstantGraphRequest)
       returns (ComputeConstantResponse) {
   }
 
-  // Retrieves the inferred shape for a value within a computation.
-  rpc GetLocalShape(GetLocalShapeRequest) returns (GetLocalShapeResponse) {
-  }
-
   // Requests one or more device handles from the target. The returned device
   // handles can be used to specify the device on which to execute computations
   // or transfer data.
@@ -153,32 +128,6 @@ service XlaService {
       returns (CreateChannelHandleResponse) {
   }
 
-  // Requests that the referenced computation be specialized for the provided
-  // arguments for subsequent execution. This permits things such as value
-  // specialization.
-  rpc Specialize(SpecializeRequest) returns (SpecializeResponse) {
-  }
-
-  // Modifies the provided computation so that subsequent executions
-  // will compute the provided ComputationDataHandle, rather than the
-  // last expression enqueued on that Computation.
-  rpc SetReturnValue(SetReturnValueRequest) returns (SetReturnValueResponse) {
-  }
-
-  // Computation creates a new computation with the given name.
-  // A unique ComputationHandle is returned.
-  rpc Computation(ComputationRequest) returns (ComputationResponse) {
-  }
-
-  // Adds a new op to a computation.
-  rpc CreateOp(OpRequest) returns (OpResponse) {
-  }
-
-  // Invokes the provided computation with the provided global data passed as
-  // immutable arguments. Returns global data output and execution timing.
-  rpc Execute(ExecuteRequest) returns (ExecuteResponse) {
-  }
-
   // Invokes the provided computation with the provided global data passed as
   // immutable arguments. The request contains the whole computation graph.
   // Returns global data output and execution timing.
@@ -188,38 +137,13 @@ service XlaService {
   // Invokes the provided list of computations in parallel with the provided
   // global data for each computation. Returns a list of global data output and
   // execution timing.
-  rpc ExecuteParallel(ExecuteParallelRequest)
-      returns (ExecuteParallelResponse) {
-  }
-
-  // Invokes the provided list of computations in parallel with the provided
-  // global data for each computation. Returns a list of global data output and
-  // execution timing.
-  //
-  // TODO(b/74197823): This is a part of a NOT YET ready refactor.
   rpc ExecuteGraphParallel(ExecuteGraphParallelRequest)
       returns (ExecuteParallelResponse) {
   }
 
-  // Invokes the provided computation with the provided global data passed as
-  // immutable arguments. Returns a handle to the execution.
-  rpc ExecuteAsync(ExecuteAsyncRequest) returns (ExecuteAsyncResponse) {
-  }
-
   // Waits until the given execution (aysnchronously launched) is complete, and
   // returns the global data output.
   rpc WaitForExecution(WaitForExecutionRequest)
       returns (WaitForExecutionResponse) {
   }
-
-  // Serializes a computation to proto form, so it can be loaded via
-  // LoadComputationSnapshot.
-  rpc SnapshotComputation(SnapshotComputationRequest)
-      returns (SnapshotComputationResponse) {
-  }
-
-  // Loads a computation from a captured snapshot.
-  rpc LoadComputationSnapshot(LoadComputationSnapshotRequest)
-      returns (LoadComputationSnapshotResponse) {
-  }
 }
diff --git a/tensorflow/compiler/xla/service/BUILD b/tensorflow/compiler/xla/service/BUILD
index d1722644c72646538dab77899b79d25056f2f2bf..2942edbf71f29304ebb240f0547808ae0af1ac87 100644
--- a/tensorflow/compiler/xla/service/BUILD
+++ b/tensorflow/compiler/xla/service/BUILD
@@ -16,19 +16,22 @@ load("//tensorflow/compiler/xla/tests:build_defs.bzl", "xla_test")
 load("//tensorflow/compiler/xla:xla.bzl", "xla_proto_library")
 load("//tensorflow:tensorflow.bzl", "tf_cc_test")
 load("//tensorflow:tensorflow.bzl", "tf_cc_binary")
+load(
+    "//tensorflow/core:platform/default/build_config.bzl",
+    "tf_proto_library_py",
+)
 
 xla_proto_library(
-    name = "session_proto",
-    srcs = ["session.proto"],
+    name = "hlo_proto",
+    srcs = ["hlo.proto"],
     visibility = ["//visibility:public"],
     deps = ["//tensorflow/compiler/xla:xla_data_proto"],
 )
 
-xla_proto_library(
-    name = "hlo_proto",
+tf_proto_library_py(
+    name = "hlo_proto",  # bzl adds a _py suffix only to the OSS target.
     srcs = ["hlo.proto"],
     visibility = ["//visibility:public"],
-    deps = ["//tensorflow/compiler/xla:xla_data_proto"],
 )
 
 xla_proto_library(
@@ -266,6 +269,7 @@ cc_library(
         "dfs_hlo_visitor.cc",
         "hlo_computation.cc",
         "hlo_instruction.cc",
+        "hlo_instructions.cc",
         "hlo_module.cc",
         "hlo_opcode.cc",
         "hlo_sharding.cc",
@@ -273,18 +277,21 @@ cc_library(
     hdrs = [
         "dfs_hlo_visitor.h",
         "dfs_hlo_visitor_with_default.h",
+        "hlo_clone_context.h",
         "hlo_computation.h",
+        "hlo_domain_metadata.h",
         "hlo_instruction.h",
+        "hlo_instructions.h",
         "hlo_module.h",
         "hlo_opcode.h",
         "hlo_sharding.h",
     ],
     deps = [
+        ":hlo_casting_utils",
         ":hlo_module_config",
         ":hlo_proto",
         ":hlo_reachability",
         ":name_uniquer",
-        ":versioned_computation_handle",
         "//tensorflow/compiler/xla:array",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:protobuf_util",
@@ -297,6 +304,7 @@ cc_library(
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:window_util",
         "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/core:human_readable_json",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
     ],
@@ -336,8 +344,8 @@ tf_cc_test(
         ":hlo",
         ":pattern_matcher",
         "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:test",
     ],
 )
@@ -375,6 +383,7 @@ cc_library(
     deps = [
         ":hlo",
         "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:lib",
     ],
@@ -385,20 +394,9 @@ tf_cc_test(
     srcs = ["hlo_matchers_test.cc"],
     deps = [
         ":hlo_matchers",
+        ":hlo_parser",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
-    ],
-)
-
-cc_library(
-    name = "versioned_computation_handle",
-    srcs = ["versioned_computation_handle.cc"],
-    hdrs = ["versioned_computation_handle.h"],
-    deps = [
-        "//tensorflow/compiler/xla:types",
-        "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/core:lib",
     ],
 )
 
@@ -407,12 +405,14 @@ tf_cc_test(
     srcs = ["hlo_instruction_test.cc"],
     deps = [
         ":hlo",
+        ":hlo_parser",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:protobuf_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla:window_util",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
     ],
@@ -429,6 +429,7 @@ tf_cc_test(
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
     ],
@@ -533,45 +534,6 @@ tf_cc_test(
     ],
 )
 
-cc_library(
-    name = "user_computation",
-    srcs = ["user_computation.cc"],
-    hdrs = ["user_computation.h"],
-    deps = [
-        ":hlo",
-        ":session_proto",
-        ":shape_inference",
-        ":versioned_computation_handle",
-        "//tensorflow/compiler/xla:literal_util",
-        "//tensorflow/compiler/xla:shape_util",
-        "//tensorflow/compiler/xla:status_macros",
-        "//tensorflow/compiler/xla:statusor",
-        "//tensorflow/compiler/xla:types",
-        "//tensorflow/compiler/xla:util",
-        "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla:xla_proto",
-        "//tensorflow/core:lib",
-    ],
-)
-
-tf_cc_test(
-    name = "user_computation_test",
-    srcs = ["user_computation_test.cc"],
-    deps = [
-        ":hlo_matchers",
-        ":user_computation",
-        "//tensorflow/compiler/xla:literal_util",
-        "//tensorflow/compiler/xla:shape_util",
-        "//tensorflow/compiler/xla:status_macros",
-        "//tensorflow/compiler/xla:test",
-        "//tensorflow/compiler/xla:test_helpers",
-        "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/service:hlo",
-        "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/core:test",
-    ],
-)
-
 cc_library(
     name = "platform_util",
     srcs = ["platform_util.cc"],
@@ -617,10 +579,8 @@ cc_library(
         ":allocation_tracker",
         ":backend",
         ":channel_tracker",
-        ":compilation_cache",
         ":compiler",
         ":computation_layout",
-        ":computation_tracker",
         ":device_memory_allocator",
         ":executable",
         ":execution_tracker",
@@ -631,11 +591,8 @@ cc_library(
         ":hlo_module_config",
         ":hlo_proto_util",
         ":platform_util",
-        ":session_proto",
         ":source_map_util",
         ":transfer_manager",
-        ":user_computation",
-        ":versioned_computation_handle",
         "//tensorflow/compiler/xla:executable_run_options",
         "//tensorflow/compiler/xla:execution_options_util",
         "//tensorflow/compiler/xla:service_interface",
@@ -662,7 +619,6 @@ cc_library(
         ":backend",
         ":compiler",
         ":computation_layout",
-        ":computation_tracker",
         ":device_memory_allocator",
         ":executable",
         ":hlo",
@@ -671,8 +627,6 @@ cc_library(
         ":platform_util",
         ":service",
         ":shaped_buffer",
-        ":user_computation",
-        ":versioned_computation_handle",
         "//tensorflow/compiler/xla:execution_options_util",
         "//tensorflow/compiler/xla:shape_layout",
         "//tensorflow/compiler/xla:shape_util",
@@ -696,7 +650,6 @@ cc_library(
         ":backend",
         ":compiler",
         ":computation_layout",
-        ":computation_tracker",
         ":platform_util",
         ":service",
         "//tensorflow/compiler/xla:status_macros",
@@ -793,9 +746,7 @@ cc_library(
         ":hlo_graph_dumper",
         ":hlo_proto",
         ":pool",
-        ":session_proto",
         ":shaped_buffer",
-        ":versioned_computation_handle",
         "//tensorflow/compiler/xla:executable_run_options",
         "//tensorflow/compiler/xla:status",
         "//tensorflow/compiler/xla:status_macros",
@@ -891,34 +842,12 @@ cc_library(
     ],
 )
 
-cc_library(
-    name = "computation_tracker",
-    srcs = ["computation_tracker.cc"],
-    hdrs = ["computation_tracker.h"],
-    deps = [
-        ":hlo",
-        ":hlo_module_config",
-        ":session_proto",
-        ":user_computation",
-        ":versioned_computation_handle",
-        "//tensorflow/compiler/xla:status_macros",
-        "//tensorflow/compiler/xla:statusor",
-        "//tensorflow/compiler/xla:types",
-        "//tensorflow/compiler/xla:util",
-        "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/core:lib",
-    ],
-)
-
 cc_library(
     name = "channel_tracker",
     srcs = ["channel_tracker.cc"],
     hdrs = ["channel_tracker.h"],
     deps = [
         ":hlo",
-        ":session_proto",
-        ":user_computation",
-        ":versioned_computation_handle",
         "//tensorflow/compiler/xla:status",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -1024,7 +953,6 @@ tf_cc_test(
         ":buffer_assignment",
         ":buffer_value",
         ":call_graph",
-        ":computation_tracker",
         ":copy_insertion",
         ":cpu_plugin",
         ":flatten_call_graph",
@@ -1038,9 +966,9 @@ tf_cc_test(
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
     ],
 )
@@ -1076,9 +1004,9 @@ tf_cc_test(
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
     ],
 )
 
@@ -1179,9 +1107,9 @@ tf_cc_test(
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
     ],
 )
 
@@ -1214,9 +1142,22 @@ tf_cc_test(
     deps = [
         ":hlo_matchers",
         ":instruction_fusion",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
+    ],
+)
+
+cc_library(
+    name = "multi_output_fusion",
+    srcs = ["multi_output_fusion.cc"],
+    hdrs = ["multi_output_fusion.h"],
+    deps = [
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_pass",
+        "//tensorflow/core:lib",
     ],
 )
 
@@ -1388,9 +1329,9 @@ tf_cc_test(
     deps = [
         ":gather_expander",
         "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:test_macros_header",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",  # fixdeps: keep
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
     ],
 )
 
@@ -1696,14 +1637,11 @@ tf_cc_test(
     name = "hlo_cost_analysis_test",
     srcs = ["hlo_cost_analysis_test.cc"],
     deps = [
-        ":computation_tracker",
         ":cpu_plugin",
         ":hlo",
         ":hlo_cost_analysis",
         ":local_service",
         ":service",
-        ":user_computation",
-        ":versioned_computation_handle",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:test_helpers",
@@ -1742,9 +1680,9 @@ tf_cc_test(
         ":cpu_plugin",
         ":hlo_cost_analysis",
         ":hlo_execution_profile",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
     ],
 )
@@ -1925,9 +1863,9 @@ tf_cc_test(
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
         "//tensorflow/core:test",
     ],
@@ -2044,20 +1982,6 @@ tf_cc_test(
     ],
 )
 
-cc_library(
-    name = "compilation_cache",
-    srcs = ["compilation_cache.cc"],
-    hdrs = ["compilation_cache.h"],
-    deps = [
-        ":executable",
-        ":hlo_module_config",
-        ":versioned_computation_handle",
-        "//tensorflow/compiler/xla:types",
-        "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/core:lib",
-    ],
-)
-
 cc_library(
     name = "layout_assignment",
     srcs = [
@@ -2228,6 +2152,7 @@ tf_cc_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+        "//tensorflow/core:test",
     ],
 )
 
@@ -2262,11 +2187,11 @@ tf_cc_test(
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:test_utils",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
         "//tensorflow/core:test",
     ],
@@ -2288,9 +2213,9 @@ tf_cc_test(
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:test_utils",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
         "//tensorflow/core:test",
     ],
@@ -2338,6 +2263,7 @@ cc_library(
     hdrs = ["hlo_cse.h"],
     deps = [
         ":hlo",
+        ":hlo_domain_map",
         ":hlo_pass",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
@@ -2360,10 +2286,10 @@ tf_cc_test(
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:test_utils",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
     ],
 )
@@ -2402,6 +2328,78 @@ tf_cc_test(
     ],
 )
 
+cc_library(
+    name = "hlo_domain_map",
+    srcs = ["hlo_domain_map.cc"],
+    hdrs = ["hlo_domain_map.h"],
+    deps = [
+        ":hlo",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/core:lib",
+    ],
+)
+
+cc_library(
+    name = "hlo_sharding_metadata",
+    srcs = ["hlo_sharding_metadata.cc"],
+    hdrs = [
+        "hlo_sharding_metadata.h",
+    ],
+    deps = [
+        ":hlo",
+        "//tensorflow/compiler/xla:shape_tree",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/core:lib",
+    ],
+)
+
+cc_library(
+    name = "hlo_domain_isolator",
+    srcs = ["hlo_domain_isolator.cc"],
+    hdrs = ["hlo_domain_isolator.h"],
+    deps = [
+        ":hlo",
+        ":hlo_graph_dumper",
+        ":hlo_pass",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/compiler/xla:util",
+    ],
+)
+
+cc_library(
+    name = "hlo_domain_remover",
+    srcs = ["hlo_domain_remover.cc"],
+    hdrs = ["hlo_domain_remover.h"],
+    deps = [
+        ":hlo",
+        ":hlo_domain_isolator",
+        ":hlo_domain_map",
+        ":hlo_graph_dumper",
+        ":hlo_pass",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/core:lib",
+    ],
+)
+
+tf_cc_test(
+    name = "hlo_domain_test",
+    srcs = ["hlo_domain_test.cc"],
+    deps = [
+        ":hlo",
+        ":hlo_domain_isolator",
+        ":hlo_domain_remover",
+        ":hlo_parser",
+        ":hlo_sharding_metadata",
+        "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla/legacy_flags:debug_options_flags",
+        "//tensorflow/compiler/xla/tests:hlo_test_base",
+        "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+        "//tensorflow/core:test",
+    ],
+)
+
 cc_library(
     name = "hlo_element_type_converter",
     srcs = ["hlo_element_type_converter.cc"],
@@ -2483,10 +2481,10 @@ xla_test(
         "//tensorflow/compiler/xla:execution_options_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
     ],
 )
 
@@ -2544,7 +2542,6 @@ cc_library(
     name = "hlo_tfgraph_builder",
     srcs = ["hlo_tfgraph_builder.cc"],
     hdrs = ["hlo_tfgraph_builder.h"],
-    visibility = ["//tensorflow/compiler/xla/tools:__pkg__"],
     deps = [
         ":hlo",
         "//tensorflow/compiler/xla:literal_util",
@@ -2575,6 +2572,7 @@ cc_library(
     hdrs = ["hlo_graph_dumper.h"],
     deps = [
         ":hlo",
+        ":hlo_casting_utils",
         ":hlo_execution_profile",
         ":hlo_tfgraph_builder",
         "//tensorflow/compiler/xla:literal_util",
@@ -2632,10 +2630,10 @@ tf_cc_test(
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/service/gpu:ir_emission_utils",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
     ],
 )
@@ -2772,7 +2770,7 @@ cc_library(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/service:backend",
         "//tensorflow/compiler/xla/service:compiler",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/core:core_cpu_internal",
         "//tensorflow/core:lib",
         "//tensorflow/core:stream_executor_no_cuda",
@@ -2808,8 +2806,8 @@ tf_cc_test(
         ":tuple_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla/service:hlo_matchers",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
     ],
 )
 
@@ -2832,9 +2830,10 @@ tf_cc_test(
     deps = [
         ":while_util",
         "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla/service:hlo_matchers",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
     ],
 )
 
@@ -2860,6 +2859,7 @@ tf_cc_test(
         ":hlo_matchers",
         ":while_loop_invariant_code_motion",
         "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_verified_test_base",
         "//tensorflow/core:test",
     ],
@@ -2886,8 +2886,8 @@ tf_cc_test(
         ":hlo_matchers",
         ":while_loop_constant_sinking",
         "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_verified_test_base",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:test",
     ],
 )
@@ -2925,6 +2925,7 @@ cc_library(
     hdrs = ["indexed_array_analysis.h"],
     deps = [
         ":hlo",
+        ":hlo_evaluator",
         ":hlo_pass",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/core:lib",
@@ -2939,9 +2940,75 @@ tf_cc_test(
         ":hlo_matchers",
         ":indexed_array_analysis",
         "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_verified_test_base",
         "//tensorflow/compiler/xla/tests:test_utils",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
+        "//tensorflow/core:test",
+    ],
+)
+
+cc_library(
+    name = "hlo_parser",
+    srcs = ["hlo_parser.cc"],
+    hdrs = ["hlo_parser.h"],
+    deps = [
+        ":hlo",
+        ":hlo_lexer",
+        ":hlo_sharding_metadata",
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+    ],
+)
+
+tf_cc_test(
+    name = "hlo_parser_test",
+    size = "small",
+    srcs = ["hlo_parser_test.cc"],
+    deps = [
+        ":hlo_parser",
+        "//tensorflow/compiler/xla:window_util",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",  # fixdeps: keep
+    ],
+)
+
+cc_library(
+    name = "hlo_lexer",
+    srcs = ["hlo_lexer.cc"],
+    hdrs = [
+        "hlo_lexer.h",
+        "hlo_token.h",
+    ],
+    deps = [
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:regexp_internal",
+    ],
+)
+
+cc_library(
+    name = "hlo_casting_utils",
+    hdrs = ["hlo_casting_utils.h"],
+    deps = ["//tensorflow/core:lib"],
+)
+
+tf_cc_test(
+    name = "hlo_casting_utils_test",
+    srcs = ["hlo_casting_utils_test.cc"],
+    deps = [
+        ":hlo",
+        ":hlo_casting_utils",
+        "//tensorflow/compiler/xla/tests:xla_internal_test_main",  # fixdeps: keep
         "//tensorflow/core:test",
     ],
 )
diff --git a/tensorflow/compiler/xla/service/algebraic_simplifier.cc b/tensorflow/compiler/xla/service/algebraic_simplifier.cc
index f732ed8f398c4699bd5247dc7fa1e9677340dcae..3b36939b8a6900f047bbec225aa232e0e805b5d1 100644
--- a/tensorflow/compiler/xla/service/algebraic_simplifier.cc
+++ b/tensorflow/compiler/xla/service/algebraic_simplifier.cc
@@ -157,6 +157,8 @@ class AlgebraicSimplifierVisitor : public DfsHloVisitorWithDefault {
 
   Status HandleSubtract(HloInstruction* sub) override;
 
+  Status HandleMap(HloInstruction* map) override;
+
   Status HandleMaximum(HloInstruction* maximum) override;
   Status HandleMinimum(HloInstruction* minimum) override;
 
@@ -231,10 +233,10 @@ class AlgebraicSimplifierVisitor : public DfsHloVisitorWithDefault {
                                    HloInstruction* operand, HloInstruction* max,
                                    HloInstruction* max_operand);
 
-  // A Reshape or Broadcast that feeds an element-wise operation with a unique
-  // non-scalar operand can sink to after the operation.
-  StatusOr<bool> TryToSinkReshapeOrBroadcastAfterOpWithUniqueNonScalarOperand(
-      HloInstruction* reshape_or_broadcast);
+  // A Broadcast that feeds an element-wise operation with a unique non-scalar
+  // operand can sink to after the operation.
+  StatusOr<bool> TryToSinkBroadcastAfterOpWithUniqueNonScalarOperand(
+      HloInstruction* broadcast);
 
   // Replaces the existing HLO instruction old_instruction, with
   // new_instruction, and marks the optimizer status as changed.
@@ -1303,7 +1305,7 @@ Status AlgebraicSimplifierVisitor::HandleBroadcast(HloInstruction* broadcast) {
   // broadcast after the unary element-wise operation.
   TF_ASSIGN_OR_RETURN(
       bool sink_succeeded,
-      TryToSinkReshapeOrBroadcastAfterOpWithUniqueNonScalarOperand(broadcast));
+      TryToSinkBroadcastAfterOpWithUniqueNonScalarOperand(broadcast));
   changed_ |= sink_succeeded;
   if (sink_succeeded) {
     return Status::OK();
@@ -1555,15 +1557,16 @@ Status AlgebraicSimplifierVisitor::HandlePower(HloInstruction* power) {
   return Status::OK();
 }
 
-StatusOr<bool> AlgebraicSimplifierVisitor::
-    TryToSinkReshapeOrBroadcastAfterOpWithUniqueNonScalarOperand(
-        HloInstruction* reshape_or_broadcast) {
+StatusOr<bool>
+AlgebraicSimplifierVisitor::TryToSinkBroadcastAfterOpWithUniqueNonScalarOperand(
+    HloInstruction* broadcast) {
+  TF_RET_CHECK(broadcast->opcode() == HloOpcode::kBroadcast);
   bool changed = false;
-  if (ShapeUtil::IsScalar(reshape_or_broadcast->shape())) {
+  if (ShapeUtil::IsScalar(broadcast->shape())) {
     return false;
   }
-  HloInstruction* operand = reshape_or_broadcast->mutable_operand(0);
-  for (HloInstruction* user : reshape_or_broadcast->users()) {
+  HloInstruction* operand = broadcast->mutable_operand(0);
+  for (HloInstruction* user : broadcast->users()) {
     if (user->user_count() == 0 && user != computation_->root_instruction()) {
       continue;
     }
@@ -1581,55 +1584,50 @@ StatusOr<bool> AlgebraicSimplifierVisitor::
       continue;
     }
 
-    int64 reshape_or_broadcast_operand_index = -1;
     // Find the unique non-scalar operand or continue if there isn't one.
-    int64 scalar_count = 0;
-    for (int64 i = 0; i < user->operand_count(); ++i) {
-      if (ShapeUtil::IsScalar(user->operand(i)->shape())) {
-        ++scalar_count;
-      } else {
-        reshape_or_broadcast_operand_index = i;
+    int64 scalar_broadcast_count = 0;
+    int64 broadcast_use_count = 0;
+    for (HloInstruction* user_operand : user->operands()) {
+      if (user_operand->opcode() == HloOpcode::kBroadcast &&
+          ShapeUtil::IsScalar(user_operand->operand(0)->shape())) {
+        ++scalar_broadcast_count;
+      } else if (broadcast == user_operand) {
+        ++broadcast_use_count;
       }
     }
-    if (scalar_count != user->operand_count() - 1) {
+    if (scalar_broadcast_count + broadcast_use_count != user->operand_count()) {
       continue;
     }
-    VLOG(4) << "Sinking reshape or broadcast after user:";
-    VLOG(4) << "  old reshape/broadcast: " << reshape_or_broadcast->ToString();
+    std::vector<HloInstruction*> new_operands;
+    new_operands.reserve(user->operand_count());
+
+    for (HloInstruction* user_operand : user->operands()) {
+      if (user_operand->opcode() == HloOpcode::kBroadcast &&
+          ShapeUtil::IsScalar(user_operand->operand(0)->shape())) {
+        new_operands.push_back(
+            computation_->AddInstruction(HloInstruction::CreateBroadcast(
+                ShapeUtil::ChangeElementType(
+                    operand->shape(), user_operand->shape().element_type()),
+                user_operand->mutable_operand(0), {})));
+      } else {
+        CHECK_EQ(broadcast, user_operand);
+        new_operands.push_back(operand);
+      }
+    }
+    VLOG(4) << "Sinking broadcast after user:";
+    VLOG(4) << "  old broadcast: " << broadcast->ToString();
     VLOG(4) << "  old user: " << user->ToString();
-    CHECK_EQ(user->operand(reshape_or_broadcast_operand_index),
-             reshape_or_broadcast);
-    auto new_user_operands = user->operands();
-    new_user_operands[reshape_or_broadcast_operand_index] = operand;
-    auto new_user = computation_->AddInstruction(user->CloneWithNewOperands(
-        ShapeUtil::MakeShapeWithLayout(
-            user->shape().element_type(),
-            AsInt64Slice(operand->shape().dimensions()),
-            LayoutUtil::MinorToMajor(operand->shape())),
-        new_user_operands));
+    HloInstruction* new_user =
+        computation_->AddInstruction(user->CloneWithNewOperands(
+            ShapeUtil::ChangeElementType(operand->shape(),
+                                         user->shape().element_type()),
+            new_operands));
     VLOG(4) << "  new user: " << new_user->ToString();
-    HloInstruction* new_reshape_or_broadcast = nullptr;
-    if (reshape_or_broadcast->opcode() == HloOpcode::kReshape) {
-      new_reshape_or_broadcast =
-          computation_->AddInstruction(HloInstruction::CreateReshape(
-              ShapeUtil::MakeShapeWithLayout(
-                  user->shape().element_type(),
-                  AsInt64Slice(reshape_or_broadcast->shape().dimensions()),
-                  LayoutUtil::MinorToMajor(reshape_or_broadcast->shape())),
-              new_user));
-    } else {
-      TF_RET_CHECK(reshape_or_broadcast->opcode() == HloOpcode::kBroadcast);
-      new_reshape_or_broadcast =
-          computation_->AddInstruction(HloInstruction::CreateBroadcast(
-              ShapeUtil::MakeShapeWithLayout(
-                  user->shape().element_type(),
-                  AsInt64Slice(reshape_or_broadcast->shape().dimensions()),
-                  LayoutUtil::MinorToMajor(reshape_or_broadcast->shape())),
-              new_user, reshape_or_broadcast->dimensions()));
-    }
-    VLOG(4) << "  new reshape/broadcast: "
-            << new_reshape_or_broadcast->ToString();
-    TF_RETURN_IF_ERROR(user->ReplaceAllUsesWith(new_reshape_or_broadcast));
+    HloInstruction* new_broadcast =
+        computation_->AddInstruction(HloInstruction::CreateBroadcast(
+            user->shape(), new_user, broadcast->dimensions()));
+    VLOG(4) << "  new broadcast: " << new_broadcast->ToString();
+    TF_RETURN_IF_ERROR(user->ReplaceAllUsesWith(new_broadcast));
     changed = true;
   }
   return changed;
@@ -1672,16 +1670,6 @@ Status AlgebraicSimplifierVisitor::HandleReshape(HloInstruction* reshape) {
     }
   }
 
-  // A Reshape that feeds a unary element-wise operation can sink the
-  // reshape after the unary element-wise operation.
-  TF_ASSIGN_OR_RETURN(
-      bool sink_succeeded,
-      TryToSinkReshapeOrBroadcastAfterOpWithUniqueNonScalarOperand(reshape));
-  changed_ |= sink_succeeded;
-  if (sink_succeeded) {
-    return Status::OK();
-  }
-
   // Make this a bitcast if possible.
   if (is_layout_sensitive_ &&
       ReshapeIsBitcast(reshape, valid_bitcast_callback_)) {
@@ -1786,6 +1774,46 @@ Status AlgebraicSimplifierVisitor::HandleReduce(HloInstruction* reduce) {
                     new_reduce_dimensions, function));
   }
 
+  // If the reduction results in the same number of elements, then the only
+  // possible side effect would be a reshape. Since the init_value is an
+  // identity of the reduction function, we can therefore replace the reduce
+  // with a simple reshape, ignoring the reduction function completely.
+  if (ShapeUtil::ElementsIn(reduce->shape()) ==
+      ShapeUtil::ElementsIn(arg->shape())) {
+    return ReplaceWithNewInstruction(
+        reduce, HloInstruction::CreateReshape(reduce->shape(), arg));
+  }
+
+  // If a reduce feeds a reduce with the same computation and initial value,
+  // they can be combined into a single reduce.
+  if (arg->opcode() == HloOpcode::kReduce &&
+      init_value->Identical(*arg->operand(1)) &&
+      *function == *arg->to_apply()) {
+    // Create a new reduce with the combined reduction dimensions of both
+    // reduces.
+    std::vector<int64> arg_dims = arg->dimensions();
+    std::sort(arg_dims.begin(), arg_dims.end());
+    std::vector<int64> reduce_dims = reduce->dimensions();
+    std::sort(reduce_dims.begin(), reduce_dims.end());
+    // Transform reduce_dims to the same rank as the operand of the operand.
+    for (int64 arg_dim : arg_dims) {
+      for (int64& dim : reduce_dims) {
+        if (dim >= arg_dim) {
+          ++dim;
+        }
+      }
+    }
+    std::vector<int64> new_dimensions;
+    new_dimensions.reserve(arg->dimensions().size() +
+                           reduce->dimensions().size());
+    std::merge(arg_dims.begin(), arg_dims.end(), reduce_dims.begin(),
+               reduce_dims.end(), std::back_inserter(new_dimensions));
+    return ReplaceWithNewInstruction(
+        reduce,
+        HloInstruction::CreateReduce(reduce->shape(), arg->mutable_operand(0),
+                                     init_value, new_dimensions, function));
+  }
+
   // A reshape that collapses multiple dimensions into a dimension being
   // reduced can just reduce all of those dimensions instead of doing a
   // collapsing reshape before a reduction.
@@ -1830,15 +1858,6 @@ Status AlgebraicSimplifierVisitor::HandleReduce(HloInstruction* reduce) {
                       new_reduce_dimensions, function));
     }
   }
-  if (ShapeUtil::ElementsIn(reduce->shape()) ==
-          ShapeUtil::ElementsIn(arg->shape()) ||
-      ShapeUtil::HasZeroElements(arg->shape())) {
-    auto reshape = computation_->AddInstruction(
-        HloInstruction::CreateReshape(reduce->shape(), arg));
-    return ReplaceWithNewInstruction(
-        reduce, HloInstruction::CreateMap(reduce->shape(),
-                                          {init_value, reshape}, function));
-  }
   return Status::OK();
 }
 
@@ -1858,7 +1877,7 @@ Status AlgebraicSimplifierVisitor::HandleReduceWindow(
     return ReplaceWithNewInstruction(
         reduce_window,
         HloInstruction::CreateMap(reduce_window->shape(),
-                                  {operand, reduce_window->mutable_operand(1)},
+                                  {reduce_window->mutable_operand(1), operand},
                                   function));
   }
 
@@ -2188,6 +2207,39 @@ bool AlgebraicSimplifierVisitor::TransformToClampIfSameShape(
   return true;
 }
 
+Status AlgebraicSimplifierVisitor::HandleMap(HloInstruction* map) {
+  auto* map_computation = map->to_apply();
+  auto* map_root = map_computation->root_instruction();
+  if (map_root->opcode() == HloOpcode::kParameter) {
+    ReplaceInstructionIfSameShape(
+        map, map->mutable_operand(map_root->parameter_number()));
+    return Status::OK();
+  }
+  if (map_root->opcode() == HloOpcode::kConstant) {
+    if (!ShapeUtil::IsScalar(map_root->shape())) {
+      return Status::OK();
+    }
+    auto clone = map_root->CloneWithNewOperands(map_root->shape(), {});
+    if (ShapeUtil::IsScalar(map->shape())) {
+      return ReplaceWithNewInstruction(map, std::move(clone));
+    }
+    return ReplaceWithNewInstruction(
+        map,
+        HloInstruction::CreateBroadcast(
+            map->shape(), computation_->AddInstruction(std::move(clone)), {}));
+  }
+  std::vector<HloInstruction*> new_operands;
+  for (auto* root_operand : map_root->operands()) {
+    if (root_operand->opcode() != HloOpcode::kParameter) {
+      return Status::OK();
+    }
+    new_operands.push_back(
+        map->mutable_operand(root_operand->parameter_number()));
+  }
+  auto clone = map_root->CloneWithNewOperands(map->shape(), new_operands);
+  return ReplaceWithNewInstruction(map, std::move(clone));
+}
+
 Status AlgebraicSimplifierVisitor::HandleMaximum(HloInstruction* maximum) {
   // Match the following tree:
   //          min_operand     operand
diff --git a/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc b/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc
index 4e082877c776c35bab499c805fef7632765a3ee1..2605b0488cb7c6850746df94c4ab05d6b5d35de5 100644
--- a/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc
+++ b/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc
@@ -74,6 +74,44 @@ TEST_F(AlgebraicSimplifierTest, AddZero) {
   EXPECT_EQ(root, param0);
 }
 
+// Test that Reduce(Reduce(A)) -> Reduce(A)
+TEST_F(AlgebraicSimplifierTest, TwoReducesToOne) {
+  HloComputation::Builder builder(TestName());
+  // Create add computation.
+  HloInstruction* zero = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+  HloComputation* add_computation = nullptr;
+  {
+    HloComputation::Builder builder(TestName() + ".add");
+    const Shape scalar_shape = ShapeUtil::MakeShape(F32, {});
+    HloInstruction* p0 = builder.AddInstruction(
+        HloInstruction::CreateParameter(0, scalar_shape, "p0"));
+    HloInstruction* p1 = builder.AddInstruction(
+        HloInstruction::CreateParameter(1, scalar_shape, "p1"));
+    builder.AddInstruction(
+        HloInstruction::CreateBinary(scalar_shape, HloOpcode::kAdd, p0, p1));
+    add_computation = module().AddEmbeddedComputation(builder.Build());
+  }
+  Shape r4f32 = ShapeUtil::MakeShape(F32, {4, 5, 6, 7});
+  HloInstruction* param = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, r4f32, "param"));
+  std::vector<int64> dims0({0});
+  Shape r3f32 = ShapeUtil::MakeShape(F32, {5, 6, 7});
+  HloInstruction* reduce0 = builder.AddInstruction(
+      HloInstruction::CreateReduce(r3f32, param, zero, dims0, add_computation));
+  std::vector<int64> dims1({1, 2});
+  Shape r1f32 = ShapeUtil::MakeShape(F32, {5});
+  builder.AddInstruction(HloInstruction::CreateReduce(r1f32, reduce0, zero,
+                                                      dims1, add_computation));
+  module().AddEntryComputation(builder.Build());
+  AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
+                                 non_bitcasting_callback());
+  ASSERT_TRUE(simplifier.Run(&module()).ValueOrDie());
+  HloInstruction* root = module().entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::Reduce(param, zero));
+  EXPECT_EQ(root->dimensions(), std::vector<int64>({0, 2, 3}));
+}
+
 // Test that Const + A is canonicalized to A + Const.
 TEST_F(AlgebraicSimplifierTest, AddConstOnLHS) {
   Shape r0f32 = ShapeUtil::MakeShape(F32, {});
@@ -143,6 +181,39 @@ TEST_F(AlgebraicSimplifierTest, AddBroadcastZeroR0Operand) {
   EXPECT_EQ(root, param0);
 }
 
+TEST_F(AlgebraicSimplifierTest, InlineTrivialMap) {
+  HloComputation::Builder builder(TestName());
+  // Create add computation.
+  HloComputation* add_computation = nullptr;
+  {
+    HloComputation::Builder builder(TestName() + ".add");
+    const Shape scalar_shape = ShapeUtil::MakeShape(F32, {});
+    HloInstruction* p0 = builder.AddInstruction(
+        HloInstruction::CreateParameter(0, scalar_shape, "p0"));
+    HloInstruction* p1 = builder.AddInstruction(
+        HloInstruction::CreateParameter(1, scalar_shape, "p1"));
+    builder.AddInstruction(
+        HloInstruction::CreateBinary(scalar_shape, HloOpcode::kAdd, p0, p1));
+    add_computation = module().AddEmbeddedComputation(builder.Build());
+  }
+  Shape r2f32 = ShapeUtil::MakeShape(F32, {32, 1});
+  HloInstruction* param0 = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, r2f32, "param0"));
+  HloInstruction* zero = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+  builder.AddInstruction(
+      HloInstruction::CreateMap(r2f32, {param0, zero}, add_computation));
+
+  auto computation = module().AddEntryComputation(builder.Build());
+  HloInstruction* root = computation->root_instruction();
+  EXPECT_EQ(root->opcode(), HloOpcode::kMap);
+  AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
+                                 non_bitcasting_callback());
+  ASSERT_TRUE(simplifier.Run(&module()).ValueOrDie());
+  root = computation->root_instruction();
+  EXPECT_THAT(root, op::Add(param0, zero));
+}
+
 TEST_F(AlgebraicSimplifierTest, AddBroadcastZeroR1Operand) {
   Shape r2f32 = ShapeUtil::MakeShape(F32, {3, 2});
   HloComputation::Builder builder(TestName());
@@ -1318,32 +1389,6 @@ TEST_F(AlgebraicSimplifierTest, ReshapeReplacedWithBitcast) {
       op::Tuple(op::Bitcast(), dimensions_wrong_reshape, layout_wrong_reshape));
 }
 
-TEST_F(AlgebraicSimplifierTest, ReshapeAfterEffectiveUnary) {
-  HloComputation::Builder builder(TestName());
-  HloInstruction* param =
-      builder.AddInstruction(HloInstruction::CreateParameter(
-          0, ShapeUtil::MakeShape(F32, {2, 3, 4, 5}), "param"));
-  HloInstruction* movable_reshape =
-      builder.AddInstruction(HloInstruction::CreateReshape(
-          ShapeUtil::MakeShape(F32, {1, 2, 3, 4, 5}), param));
-  HloInstruction* zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
-  builder.AddInstruction(
-      HloInstruction::CreateBinary(ShapeUtil::MakeShape(F32, {1, 2, 3, 4, 5}),
-                                   HloOpcode::kMaximum, movable_reshape, zero));
-  auto computation = module().AddEntryComputation(builder.Build());
-
-  EXPECT_THAT(computation->root_instruction(),
-              op::Maximum(op::Reshape(param), zero));
-
-  AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
-                                 bitcasting_callback());
-
-  simplifier.Run(&module()).ValueOrDie();
-  EXPECT_THAT(computation->root_instruction(),
-              op::Reshape(op::Maximum(param, zero)));
-}
-
 // Regression test for a bug in the reshape sinking transformation, where
 // moving a reshape to a scalar led to a crash.
 TEST_F(AlgebraicSimplifierTest, ReshapeToScalarNotHoistedAfterEffectiveUnary) {
@@ -1707,7 +1752,7 @@ TEST_F(AlgebraicSimplifierTest, RemoveNoopPad) {
 
   AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
                                  non_bitcasting_callback());
-  ASSERT_TRUE(simplifier.Run(module.get()).ValueOrDie());
+  ASSERT_TRUE(simplifier.Run(module).ValueOrDie());
 
   EXPECT_THAT(computation->root_instruction(), param);
 }
@@ -1752,7 +1797,7 @@ TEST_F(AlgebraicSimplifierTest, NegativePadding) {
   EXPECT_THAT(computation->root_instruction(), op::Pad(param, zero));
   EXPECT_TRUE(has_negative_padding(pad));
 
-  ASSERT_TRUE(simplifier.Run(module.get()).ValueOrDie());
+  ASSERT_TRUE(simplifier.Run(module).ValueOrDie());
 
   EXPECT_THAT(computation->root_instruction(), op::Slice(op::Pad(param, zero)));
   EXPECT_FALSE(
@@ -1774,7 +1819,7 @@ TEST_F(AlgebraicSimplifierTest, RemoveNoopReshape) {
 
   AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
                                  non_bitcasting_callback());
-  ASSERT_TRUE(simplifier.Run(module.get()).ValueOrDie());
+  ASSERT_TRUE(simplifier.Run(module).ValueOrDie());
 
   EXPECT_THAT(computation->root_instruction(), param);
 }
@@ -1797,7 +1842,7 @@ TEST_F(AlgebraicSimplifierTest, RemoveNoopSlice) {
 
   AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
                                  non_bitcasting_callback());
-  ASSERT_TRUE(simplifier.Run(module.get()).ValueOrDie());
+  ASSERT_TRUE(simplifier.Run(module).ValueOrDie());
 
   EXPECT_THAT(computation->root_instruction(), param);
 }
@@ -1925,7 +1970,8 @@ TEST_F(AlgebraicSimplifierTest, ConvertConvToMatmul) {
     b.AddInstruction(HloInstruction::CreateConvolve(out_shape, input, filter,
                                                     window, dnums));
 
-    auto module = CreateNewModule();
+    // TODO(b/80488902): verify this module.
+    auto module = HloTestBase::CreateNewModule();
     auto* computation = module->AddEntryComputation(b.Build());
 
     AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/true,
@@ -2053,7 +2099,7 @@ TEST_F(AlgebraicSimplifierTest, MaxMinToClamp) {
 
   AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
                                  non_bitcasting_callback());
-  ASSERT_TRUE(simplifier.Run(module.get()).ValueOrDie());
+  ASSERT_TRUE(simplifier.Run(module).ValueOrDie());
 
   EXPECT_THAT(computation->root_instruction(),
               op::Clamp(max_value, param0, min_value));
@@ -2083,7 +2129,7 @@ TEST_F(AlgebraicSimplifierTest, MinMaxToClamp) {
 
   AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
                                  non_bitcasting_callback());
-  ASSERT_TRUE(simplifier.Run(module.get()).ValueOrDie());
+  ASSERT_TRUE(simplifier.Run(module).ValueOrDie());
 
   EXPECT_THAT(computation->root_instruction(),
               op::Clamp(max_value, param0, min_value));
@@ -2114,7 +2160,7 @@ TEST_F(AlgebraicSimplifierTest, MinMaxWithBroadcastToClamp) {
 
   AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
                                  non_bitcasting_callback());
-  ASSERT_TRUE(simplifier.Run(module.get()).ValueOrDie());
+  ASSERT_TRUE(simplifier.Run(module).ValueOrDie());
 
   EXPECT_THAT(computation->root_instruction(),
               op::Clamp(max_value, param0, min_value));
@@ -2144,7 +2190,7 @@ TEST_F(AlgebraicSimplifierTest, MinMaxNotToClamp) {
 
   AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
                                  non_bitcasting_callback());
-  EXPECT_FALSE(simplifier.Run(module.get()).ValueOrDie());
+  EXPECT_FALSE(simplifier.Run(module).ValueOrDie());
 
   EXPECT_THAT(computation->root_instruction(),
               op::Minimum(op::Maximum(param0, max_value), min_value));
@@ -2177,7 +2223,7 @@ TEST_F(AlgebraicSimplifierTest, MinEquationWithMaxNotToClamp) {
 
   AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
                                  non_bitcasting_callback());
-  EXPECT_FALSE(simplifier.Run(module.get()).ValueOrDie());
+  EXPECT_FALSE(simplifier.Run(module).ValueOrDie());
 
   EXPECT_THAT(computation->root_instruction(),
               op::Minimum(op::Add(op::Maximum(param0, max_value), max_value),
@@ -2193,10 +2239,8 @@ TEST_F(AlgebraicSimplifierTest, ScalarBroadcastToSlice) {
       HloInstruction::CreateParameter(0, r0f32, "scalar_param"));
 
   Shape broadcast_shape = ShapeUtil::MakeShape(F32, {4, 5, 6, 7});
-  HloInstruction* broadcast =
-      builder.AddInstruction(HloInstruction::CreateBroadcast(
-          broadcast_shape, scalar_param,
-          AsInt64Slice(broadcast_shape.dimensions())));
+  HloInstruction* broadcast = builder.AddInstruction(
+      HloInstruction::CreateBroadcast(broadcast_shape, scalar_param, {}));
 
   Shape slice_shape = ShapeUtil::MakeShape(F32, {2, 2, 3, 3});
   HloInstruction* slice = builder.AddInstruction(HloInstruction::CreateSlice(
@@ -2212,10 +2256,10 @@ TEST_F(AlgebraicSimplifierTest, ScalarBroadcastToSlice) {
   AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
                                  non_bitcasting_callback());
 
-  ASSERT_TRUE(simplifier.Run(module.get()).ValueOrDie());
+  ASSERT_TRUE(simplifier.Run(module).ValueOrDie());
 
   // Running simplification again should not result in any further changes.
-  ASSERT_FALSE(simplifier.Run(module.get()).ValueOrDie());
+  ASSERT_FALSE(simplifier.Run(module).ValueOrDie());
 
   root = computation->root_instruction();
   EXPECT_THAT(root, op::Broadcast(scalar_param));
@@ -2230,10 +2274,8 @@ TEST_F(AlgebraicSimplifierTest, ScalarBroadcastToTransposeReshape) {
       HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
 
   Shape broadcast_shape = ShapeUtil::MakeShape(F32, {4, 5, 6});
-  HloInstruction* broadcast =
-      builder.AddInstruction(HloInstruction::CreateBroadcast(
-          broadcast_shape, forty_two,
-          AsInt64Slice(broadcast_shape.dimensions())));
+  HloInstruction* broadcast = builder.AddInstruction(
+      HloInstruction::CreateBroadcast(broadcast_shape, forty_two, {}));
 
   HloInstruction* transpose =
       builder.AddInstruction(HloInstruction::CreateTranspose(
@@ -2252,7 +2294,7 @@ TEST_F(AlgebraicSimplifierTest, ScalarBroadcastToTransposeReshape) {
 
   AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
                                  non_bitcasting_callback());
-  ASSERT_TRUE(simplifier.Run(module.get()).ValueOrDie());
+  ASSERT_TRUE(simplifier.Run(module).ValueOrDie());
 
   root = computation->root_instruction();
   EXPECT_THAT(root, op::Broadcast(forty_two));
@@ -2261,7 +2303,8 @@ TEST_F(AlgebraicSimplifierTest, ScalarBroadcastToTransposeReshape) {
 
 // Test that ReduceWindow(Pad(op, x), y) can simplify to ReduceWindow(op, x).
 TEST_F(AlgebraicSimplifierTest, FoldPadIntoReduceWindow) {
-  auto module = CreateNewModule();
+  // TODO(b/80488902): verify this module.
+  auto module = HloTestBase::CreateNewModule();
   HloComputation::Builder builder(TestName());
 
   // Create operand to the pad.
@@ -2342,7 +2385,8 @@ TEST_F(AlgebraicSimplifierTest, FoldPadIntoReduceWindow) {
 // Test that ReduceWindow(Convert(Pad(op, x)), y) can simplify to
 // ReduceWindow(Convert(op), x).
 TEST_F(AlgebraicSimplifierTest, FoldConvertedPadIntoReduceWindow) {
-  auto module = CreateNewModule();
+  // TODO(b/80488902): verify this module.
+  auto module = HloTestBase::CreateNewModule();
   HloComputation::Builder builder(TestName());
 
   // Create operand to the pad.
@@ -2437,7 +2481,7 @@ TEST_F(AlgebraicSimplifierTest, ReversalOfTrivialDimensionsToBitcast) {
 
   AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
                                  non_bitcasting_callback());
-  ASSERT_TRUE(simplifier.Run(module.get()).ValueOrDie());
+  ASSERT_TRUE(simplifier.Run(module).ValueOrDie());
 
   HloInstruction* root = computation->root_instruction();
   EXPECT_EQ(a, root);
diff --git a/tensorflow/compiler/xla/service/batchnorm_expander.cc b/tensorflow/compiler/xla/service/batchnorm_expander.cc
index 96e02b82b97ff2fd682638f4c6297cbc2019c481..ec13fadbc75e2315d1d6ef72e24a0faca0c7de40 100644
--- a/tensorflow/compiler/xla/service/batchnorm_expander.cc
+++ b/tensorflow/compiler/xla/service/batchnorm_expander.cc
@@ -58,8 +58,7 @@ class BatchNormExpanderVisitor : public DfsHloVisitorWithDefault {
 
   // Runs the visitor on a computation.
   static bool Run(HloComputation* computation, bool rewrite_training_op,
-                  bool rewrite_inference_op, bool rewrite_grad_op,
-                  bool use_fusion);
+                  bool rewrite_inference_op, bool rewrite_grad_op);
 
   // Returns whether any batch norm ops were rewritten.
   const bool changed() const { return changed_; }
@@ -70,21 +69,14 @@ class BatchNormExpanderVisitor : public DfsHloVisitorWithDefault {
   explicit BatchNormExpanderVisitor(HloComputation* computation,
                                     bool rewrite_training_op,
                                     bool rewrite_inference_op,
-                                    bool rewrite_grad_op, bool use_fusion)
+                                    bool rewrite_grad_op)
       : computation_(computation),
         rewrite_training_op_(rewrite_training_op),
         rewrite_inference_op_(rewrite_inference_op),
-        rewrite_grad_op_(rewrite_grad_op),
-        use_fusion_(use_fusion) {}
+        rewrite_grad_op_(rewrite_grad_op) {}
 
   HloComputation* GetOrCreateScalarAddComputation(
       PrimitiveType primitive_type) {
-    HloComputation** scalar_add_computation =
-        &scalar_add_computations_[primitive_type];
-    if (*scalar_add_computation) {
-      return *scalar_add_computation;
-    }
-
     HloComputation::Builder b("scalar_add_computation");
     Shape shape = ShapeUtil::MakeShape(primitive_type, {});
     auto scalar_lhs = b.AddInstruction(
@@ -93,26 +85,39 @@ class BatchNormExpanderVisitor : public DfsHloVisitorWithDefault {
         HloInstruction::CreateParameter(1, shape, "scalar_rhs"));
     auto scalar_op = b.AddInstruction(HloInstruction::CreateBinary(
         shape, HloOpcode::kAdd, scalar_lhs, scalar_rhs));
-    *scalar_add_computation =
-        computation_->parent()->AddEmbeddedComputation(b.Build(scalar_op));
-    return *scalar_add_computation;
+    return computation_->parent()->AddEmbeddedComputation(b.Build(scalar_op));
   }
 
-  // Current HloComputation instance the BatchNormExpander is
-  // traversing.
-  HloComputation* computation_;
-
-  bool rewrite_training_op_;
-  bool rewrite_inference_op_;
-  bool rewrite_grad_op_;
-  bool use_fusion_;
-
-  // Whether rewrite has occurred.
-  bool changed_ = false;
+  std::unique_ptr<HloInstruction> Rsqrt(
+      HloInstruction* operand,
+      const std::function<HloInstruction*(std::unique_ptr<HloInstruction>)>&
+          add_instruction) {
+    HloInstruction* exponent = add_instruction(HloInstruction::CreateBroadcast(
+        operand->shape(),
+        add_instruction(HloInstruction::CreateConvert(
+            ShapeUtil::MakeShape(operand->shape().element_type(), {}),
+            add_instruction(HloInstruction::CreateConstant(
+                Literal::CreateR0<float>(-0.5f))))),
+        {}));
+    return HloInstruction::CreateBinary(operand->shape(), HloOpcode::kPower,
+                                        operand, exponent);
+  }
 
-  // Cached computations for adding two scalars.
-  tensorflow::gtl::FlatMap<PrimitiveType, HloComputation*>
-      scalar_add_computations_;
+  std::unique_ptr<HloInstruction> Mean(
+      int64 element_count, HloInstruction* operand,
+      const std::function<HloInstruction*(std::unique_ptr<HloInstruction>)>&
+          add_instruction) {
+    HloInstruction* elem_count_recip =
+        add_instruction(HloInstruction::CreateBroadcast(
+            operand->shape(),
+            add_instruction(HloInstruction::CreateConvert(
+                ShapeUtil::MakeShape(operand->shape().element_type(), {}),
+                add_instruction(HloInstruction::CreateConstant(
+                    Literal::CreateR0<float>(1.0 / element_count))))),
+            {}));
+    return HloInstruction::CreateBinary(operand->shape(), HloOpcode::kMultiply,
+                                        operand, elem_count_recip);
+  }
 
   // Replaces the existing HLO instruction old_instruction, with
   // new_instruction, and marks the optimizer status as changed.
@@ -136,6 +141,16 @@ class BatchNormExpanderVisitor : public DfsHloVisitorWithDefault {
     changed_ = true;
     return Status::OK();
   }
+  // Current HloComputation instance the BatchNormExpander is
+  // traversing.
+  HloComputation* computation_;
+
+  bool rewrite_training_op_;
+  bool rewrite_inference_op_;
+  bool rewrite_grad_op_;
+
+  // Whether rewrite has occurred.
+  bool changed_ = false;
 };
 
 }  // namespace
@@ -143,13 +158,12 @@ class BatchNormExpanderVisitor : public DfsHloVisitorWithDefault {
 bool BatchNormExpanderVisitor::Run(HloComputation* computation,
                                    bool rewrite_training_op,
                                    bool rewrite_inference_op,
-                                   bool rewrite_grad_op, bool use_fusion) {
+                                   bool rewrite_grad_op) {
   BatchNormExpanderVisitor visitor(
       computation,
       /*rewrite_training_op=*/rewrite_training_op,
       /*rewrite_inference_op=*/rewrite_inference_op,
-      /*rewrite_grad_op=*/rewrite_grad_op,
-      /*use_fusion=*/use_fusion);
+      /*rewrite_grad_op=*/rewrite_grad_op);
   TF_CHECK_OK(computation->Accept(&visitor));
   return visitor.changed_;
 }
@@ -167,6 +181,10 @@ Status BatchNormExpanderVisitor::HandleBatchNormTraining(
     added_instructions.push_back(added_inst);
     return added_inst;
   };
+  auto add_binary = [&](const Shape& shape, const HloOpcode opcode,
+                        HloInstruction* a, HloInstruction* b) {
+    return add(HloInstruction::CreateBinary(shape, opcode, a, b));
+  };
   int64 instruction_count_before = computation_->instruction_count();
 
   // Expand batch norm training into smaller HLO ops.
@@ -176,12 +194,7 @@ Status BatchNormExpanderVisitor::HandleBatchNormTraining(
   int64 feature_index = batch_norm->feature_index();
   const int64 feature_count = operand_shape.dimensions(feature_index);
   const int64 size_in_elements = ShapeUtil::ElementsIn(operand_shape);
-  auto elements_per_feature_literal =
-      Literal::CreateR0<float>(size_in_elements / feature_count);
-  TF_ASSIGN_OR_RETURN(elements_per_feature_literal,
-                      elements_per_feature_literal->Convert(ptype));
-  auto elements_per_feature = add(
-      HloInstruction::CreateConstant(std::move(elements_per_feature_literal)));
+  int64 elements_per_feature_int64 = size_in_elements / feature_count;
 
   HloInstruction* scale = batch_norm->mutable_operand(1);
   HloInstruction* offset = batch_norm->mutable_operand(2);
@@ -193,8 +206,9 @@ Status BatchNormExpanderVisitor::HandleBatchNormTraining(
 
   auto epsilon_literal = Literal::CreateR0(batch_norm->epsilon());
   TF_ASSIGN_OR_RETURN(epsilon_literal, epsilon_literal->Convert(ptype));
-  auto epsilon =
-      add(HloInstruction::CreateConstant(std::move(epsilon_literal)));
+  auto epsilon = add(HloInstruction::CreateBroadcast(
+      operand_shape,
+      add(HloInstruction::CreateConstant(std::move(epsilon_literal))), {}));
   std::vector<int64> dimensions_without_feature;
 
   for (int64 i = 0; i < ShapeUtil::Rank(operand_shape); ++i) {
@@ -213,8 +227,8 @@ Status BatchNormExpanderVisitor::HandleBatchNormTraining(
       GetOrCreateScalarAddComputation(ptype);
 
   // X^2.
-  auto operand_squared = add(HloInstruction::CreateBinary(
-      operand_shape, HloOpcode::kMultiply, operand, operand));
+  auto operand_squared =
+      add_binary(operand_shape, HloOpcode::kMultiply, operand, operand);
   // Sum[X].
   auto sum = add(HloInstruction::CreateReduce(feature_shape, operand, zero,
                                               dimensions_without_feature,
@@ -225,71 +239,48 @@ Status BatchNormExpanderVisitor::HandleBatchNormTraining(
       feature_shape, operand_squared, zero, dimensions_without_feature,
       add_reduce_computation));
 
-  // Fuse two parallel reduces together to improve performance.
-  if (use_fusion_ && !batch_norm->has_sharding()) {
-    auto tuple = add(HloInstruction::CreateTuple({sum, squared_sum}));
-
-    auto fused = computation_->CreateFusionInstruction(
-        {tuple, sum, squared_sum, operand_squared},
-        HloInstruction::FusionKind::kInput);
-
-    sum = add(HloInstruction::CreateGetTupleElement(feature_shape, fused, 0));
-
-    squared_sum =
-        add(HloInstruction::CreateGetTupleElement(feature_shape, fused, 1));
-  }
-
   // E[X].
-  auto mean = add(HloInstruction::CreateBinary(
-      feature_shape, HloOpcode::kDivide, sum, elements_per_feature));
+  auto mean = add(Mean(elements_per_feature_int64, sum, add));
 
   auto mean_broadcasted = add(
       HloInstruction::CreateBroadcast(operand_shape, mean, {feature_index}));
 
   // E[X^2].
-  auto square_mean = add(HloInstruction::CreateBinary(
-      feature_shape, HloOpcode::kDivide, squared_sum, elements_per_feature));
+  auto square_mean = add(Mean(elements_per_feature_int64, squared_sum, add));
 
   // E^2[X].
-  auto mean_square = add(HloInstruction::CreateBinary(
-      feature_shape, HloOpcode::kMultiply, mean, mean));
+  auto mean_square =
+      add_binary(feature_shape, HloOpcode::kMultiply, mean, mean);
 
   // Var[X].
-  auto var = add(HloInstruction::CreateBinary(
-      feature_shape, HloOpcode::kSubtract, square_mean, mean_square));
+  auto var =
+      add_binary(feature_shape, HloOpcode::kSubtract, square_mean, mean_square);
 
   auto var_broadcasted =
       add(HloInstruction::CreateBroadcast(operand_shape, var, {feature_index}));
 
   // Var[X] + epsilon.
-  auto var_add_epsilon = add(HloInstruction::CreateBinary(
-      operand_shape, HloOpcode::kAdd, var_broadcasted, epsilon));
-
-  auto neg_half_literal = Literal::CreateR0(-0.5f);
-  TF_ASSIGN_OR_RETURN(neg_half_literal, neg_half_literal->Convert(ptype));
-  auto neg_half =
-      add(HloInstruction::CreateConstant(std::move(neg_half_literal)));
+  auto var_add_epsilon =
+      add_binary(operand_shape, HloOpcode::kAdd, var_broadcasted, epsilon);
 
   // 1 / Sqrt[Var[X] + epsilon].
-  auto rsqrt_var_add_epsilon = add(HloInstruction::CreateBinary(
-      operand_shape, HloOpcode::kPower, var_add_epsilon, neg_half));
+  auto rsqrt_var_add_epsilon = add(Rsqrt(var_add_epsilon, add));
 
   // X - E[X].
-  auto operand_minus_mean = add(HloInstruction::CreateBinary(
-      operand_shape, HloOpcode::kSubtract, operand, mean_broadcasted));
+  auto operand_minus_mean = add_binary(operand_shape, HloOpcode::kSubtract,
+                                       operand, mean_broadcasted);
 
   // (X - E[X]) / Sqrt[Var[X] + epsilon].
-  auto normalized = add(
-      HloInstruction::CreateBinary(operand_shape, HloOpcode::kMultiply,
-                                   operand_minus_mean, rsqrt_var_add_epsilon));
+  auto normalized = add_binary(operand_shape, HloOpcode::kMultiply,
+                               operand_minus_mean, rsqrt_var_add_epsilon);
 
   // (X - E[X]) / Sqrt[Var[X] + epsilon] * scale.
-  auto scaled_normalized = add(HloInstruction::CreateBinary(
-      operand_shape, HloOpcode::kMultiply, normalized, scale_broadcasted));
+  auto scaled_normalized = add_binary(operand_shape, HloOpcode::kMultiply,
+                                      normalized, scale_broadcasted);
 
   // (X - E[X]) / Sqrt[Var[X] + epsilon] * scale + offset.
-  auto shifted_normalized = add(HloInstruction::CreateBinary(
-      operand_shape, HloOpcode::kAdd, scaled_normalized, offset_broadcasted));
+  auto shifted_normalized = add_binary(operand_shape, HloOpcode::kAdd,
+                                       scaled_normalized, offset_broadcasted);
 
   auto tuple = HloInstruction::CreateTuple({shifted_normalized, mean, var});
 
@@ -331,8 +322,11 @@ Status BatchNormExpanderVisitor::HandleBatchNormInference(
 
   auto epsilon_literal = Literal::CreateR0(batch_norm->epsilon());
   TF_ASSIGN_OR_RETURN(epsilon_literal, epsilon_literal->Convert(ptype));
-  auto epsilon = computation_->AddInstruction(
-      HloInstruction::CreateConstant(std::move(epsilon_literal)));
+  auto epsilon = computation_->AddInstruction(HloInstruction::CreateBroadcast(
+      operand_shape,
+      computation_->AddInstruction(
+          HloInstruction::CreateConstant(std::move(epsilon_literal))),
+      {}));
 
   std::vector<int64> dimensions_without_feature;
 
@@ -349,6 +343,10 @@ Status BatchNormExpanderVisitor::HandleBatchNormInference(
     added_instructions.push_back(added_inst);
     return added_inst;
   };
+  auto add_binary = [&](const Shape& shape, const HloOpcode opcode,
+                        HloInstruction* a, HloInstruction* b) {
+    return add(HloInstruction::CreateBinary(shape, opcode, a, b));
+  };
   int64 instruction_count_before = computation_->instruction_count();
 
   auto scale_broadcasted = add(
@@ -364,30 +362,23 @@ Status BatchNormExpanderVisitor::HandleBatchNormInference(
       add(HloInstruction::CreateBroadcast(operand_shape, var, {feature_index}));
 
   // Var[X] + epsilon.
-  auto var_add_epsilon = add(HloInstruction::CreateBinary(
-      operand_shape, HloOpcode::kAdd, var_broadcasted, epsilon));
-
-  auto neg_half_literal = Literal::CreateR0(-0.5f);
-  TF_ASSIGN_OR_RETURN(neg_half_literal, neg_half_literal->Convert(ptype));
-  auto neg_half =
-      add(HloInstruction::CreateConstant(std::move(neg_half_literal)));
+  auto var_add_epsilon =
+      add_binary(operand_shape, HloOpcode::kAdd, var_broadcasted, epsilon);
 
   // 1 / Sqrt[Var[X] + epsilon].
-  auto rsqrt_var_add_epsilon = add(HloInstruction::CreateBinary(
-      operand_shape, HloOpcode::kPower, var_add_epsilon, neg_half));
+  auto rsqrt_var_add_epsilon = add(Rsqrt(var_add_epsilon, add));
 
   // X - E[X].
-  auto operand_minus_mean = add(HloInstruction::CreateBinary(
-      operand_shape, HloOpcode::kSubtract, operand, mean_broadcasted));
+  auto operand_minus_mean = add_binary(operand_shape, HloOpcode::kSubtract,
+                                       operand, mean_broadcasted);
 
   // (X - E[X]) / Sqrt[Var[X] + epsilon].
-  auto normalized = add(
-      HloInstruction::CreateBinary(operand_shape, HloOpcode::kMultiply,
-                                   operand_minus_mean, rsqrt_var_add_epsilon));
+  auto normalized = add_binary(operand_shape, HloOpcode::kMultiply,
+                               operand_minus_mean, rsqrt_var_add_epsilon);
 
   // (X - E[X]) / Sqrt[Var[X] + epsilon] * scale.
-  auto scaled_normalized = add(HloInstruction::CreateBinary(
-      operand_shape, HloOpcode::kMultiply, normalized, scale_broadcasted));
+  auto scaled_normalized = add_binary(operand_shape, HloOpcode::kMultiply,
+                                      normalized, scale_broadcasted);
 
   // (X - E[X]) / Sqrt[Var[X] + epsilon] * scale + offset.
   auto shifted_normalized = HloInstruction::CreateBinary(
@@ -435,6 +426,10 @@ Status BatchNormExpanderVisitor::HandleBatchNormGrad(
     added_instructions.push_back(added_inst);
     return added_inst;
   };
+  auto add_binary = [&](const Shape& shape, const HloOpcode opcode,
+                        HloInstruction* a, HloInstruction* b) {
+    return add(HloInstruction::CreateBinary(shape, opcode, a, b));
+  };
   int64 instruction_count_before = computation_->instruction_count();
 
   HloInstruction* activation = batch_norm->mutable_operand(0);
@@ -450,26 +445,20 @@ Status BatchNormExpanderVisitor::HandleBatchNormGrad(
 
   const int64 size_in_elements = ShapeUtil::ElementsIn(activation_shape);
   const int64 feature_count = activation_shape.dimensions(feature_index);
-  auto elements_per_feature_literal =
-      Literal::CreateR0<float>(size_in_elements / feature_count);
-  TF_ASSIGN_OR_RETURN(elements_per_feature_literal,
-                      elements_per_feature_literal->Convert(ptype));
-  auto elements_per_feature = add(
-      HloInstruction::CreateConstant(std::move(elements_per_feature_literal)));
+  const int64 elements_per_feature_int64 = size_in_elements / feature_count;
 
   auto zero_literal = Literal::CreateR0(0.0f);
   TF_ASSIGN_OR_RETURN(zero_literal, zero_literal->Convert(ptype));
   auto zero = add(HloInstruction::CreateConstant(std::move(zero_literal)));
 
-  auto neg_half_literal = Literal::CreateR0(-0.5f);
-  TF_ASSIGN_OR_RETURN(neg_half_literal, neg_half_literal->Convert(ptype));
-  auto neg_half =
-      add(HloInstruction::CreateConstant(std::move(neg_half_literal)));
-
   auto epsilon_literal = Literal::CreateR0(batch_norm->epsilon());
   TF_ASSIGN_OR_RETURN(epsilon_literal, epsilon_literal->Convert(ptype));
-  auto epsilon =
+  auto epsilon_scalar =
       add(HloInstruction::CreateConstant(std::move(epsilon_literal)));
+  auto epsilon_activation = add(
+      HloInstruction::CreateBroadcast(activation_shape, epsilon_scalar, {}));
+  auto epsilon_feature =
+      add(HloInstruction::CreateBroadcast(feature_shape, epsilon_scalar, {}));
 
   std::vector<int64> dimensions_without_feature;
 
@@ -489,26 +478,23 @@ Status BatchNormExpanderVisitor::HandleBatchNormGrad(
       HloInstruction::CreateBroadcast(activation_shape, mean, {feature_index}));
 
   // rsqrt[Var[X] + epsilon].
-  auto rsqrt_var_add_epsilon_broadcasted = add(HloInstruction::CreateBinary(
-      activation_shape, HloOpcode::kPower,
-      add(HloInstruction::CreateBinary(activation_shape, HloOpcode::kAdd,
-                                       variance_broadcasted, epsilon)),
-      neg_half));
-
-  auto rsqrt_var_add_epsilon = add(HloInstruction::CreateBinary(
-      feature_shape, HloOpcode::kPower,
-      add(HloInstruction::CreateBinary(feature_shape, HloOpcode::kAdd, variance,
-                                       epsilon)),
-      neg_half));
+  auto rsqrt_var_add_epsilon_broadcasted =
+      add(Rsqrt(add_binary(activation_shape, HloOpcode::kAdd,
+                           variance_broadcasted, epsilon_activation),
+                add));
+
+  auto rsqrt_var_add_epsilon = add(Rsqrt(
+      add_binary(feature_shape, HloOpcode::kAdd, variance, epsilon_feature),
+      add));
 
   // X - E[X].
-  auto activation_minus_mean = add(HloInstruction::CreateBinary(
-      activation_shape, HloOpcode::kSubtract, activation, mean_broadcasted));
+  auto activation_minus_mean = add_binary(
+      activation_shape, HloOpcode::kSubtract, activation, mean_broadcasted);
 
   // Grad[Y] * (X - E[X]).
   auto grad_output_times_activiation_minus_mean =
-      add(HloInstruction::CreateBinary(activation_shape, HloOpcode::kMultiply,
-                                       grad_output, activation_minus_mean));
+      add_binary(activation_shape, HloOpcode::kMultiply, grad_output,
+                 activation_minus_mean);
 
   HloComputation* add_reduce_computation =
       GetOrCreateScalarAddComputation(ptype);
@@ -524,25 +510,10 @@ Status BatchNormExpanderVisitor::HandleBatchNormGrad(
       feature_shape, grad_output, zero, dimensions_without_feature,
       add_reduce_computation));
 
-  if (use_fusion_ && !batch_norm->has_sharding()) {
-    auto tuple = add(HloInstruction::CreateTuple(
-        {sum_grad_output_times_activiation_minus_mean, grad_beta}));
-
-    auto fused = computation_->CreateFusionInstruction(
-        {tuple, sum_grad_output_times_activiation_minus_mean, grad_beta},
-        HloInstruction::FusionKind::kInput);
-
-    sum_grad_output_times_activiation_minus_mean =
-        add(HloInstruction::CreateGetTupleElement(feature_shape, fused, 0));
-
-    grad_beta =
-        add(HloInstruction::CreateGetTupleElement(feature_shape, fused, 1));
-  }
-
   // Grad[scale] = Sum(Grad[Y] * (X - E[X]) * rsqrt[Var[X] + epsilon]).
-  auto grad_scale = add(HloInstruction::CreateBinary(
-      feature_shape, HloOpcode::kMultiply,
-      sum_grad_output_times_activiation_minus_mean, rsqrt_var_add_epsilon));
+  auto grad_scale = add_binary(feature_shape, HloOpcode::kMultiply,
+                               sum_grad_output_times_activiation_minus_mean,
+                               rsqrt_var_add_epsilon);
 
   // I2 = Sum(Grad[Y])
   auto i2 = add(HloInstruction::CreateBroadcast(activation_shape, grad_beta,
@@ -554,39 +525,40 @@ Status BatchNormExpanderVisitor::HandleBatchNormGrad(
       {feature_index}));
 
   // I4 = (X - E[X]) * I3
-  auto i4 = add(HloInstruction::CreateBinary(
-      activation_shape, HloOpcode::kMultiply, i3, activation_minus_mean));
+  auto i4 = add_binary(activation_shape, HloOpcode::kMultiply, i3,
+                       activation_minus_mean);
 
   // I5 = I4 / (Var[X] + epsilon)
-  auto i5 = add(HloInstruction::CreateBinary(
-      activation_shape, HloOpcode::kDivide, i4,
-      add(HloInstruction::CreateBinary(activation_shape, HloOpcode::kAdd,
-                                       variance_broadcasted, epsilon))));
+  auto i5 = add_binary(activation_shape, HloOpcode::kDivide, i4,
+                       add_binary(activation_shape, HloOpcode::kAdd,
+                                  variance_broadcasted, epsilon_activation));
 
   // scale * rsqrt[Var[X] + epsilon] * 1/N
-  auto scale_times_rsqrt_var_add_epsilon = add(HloInstruction::CreateBinary(
-      activation_shape, HloOpcode::kMultiply, scale_broadcasted,
-      rsqrt_var_add_epsilon_broadcasted));
+  auto scale_times_rsqrt_var_add_epsilon =
+      add_binary(activation_shape, HloOpcode::kMultiply, scale_broadcasted,
+                 rsqrt_var_add_epsilon_broadcasted);
 
-  scale_times_rsqrt_var_add_epsilon = add(HloInstruction::CreateBinary(
-      activation_shape, HloOpcode::kDivide, scale_times_rsqrt_var_add_epsilon,
-      elements_per_feature));
+  scale_times_rsqrt_var_add_epsilon = add(
+      Mean(elements_per_feature_int64, scale_times_rsqrt_var_add_epsilon, add));
 
-  auto i1 =
-      add(HloInstruction::CreateBinary(activation_shape, HloOpcode::kMultiply,
-                                       grad_output, elements_per_feature));
+  auto elements_per_feature_literal =
+      Literal::CreateR0<float>(elements_per_feature_int64);
+  TF_ASSIGN_OR_RETURN(elements_per_feature_literal,
+                      elements_per_feature_literal->Convert(ptype));
+  auto elements_per_feature = add(
+      HloInstruction::CreateConstant(std::move(elements_per_feature_literal)));
+  auto i1 = add_binary(activation_shape, HloOpcode::kMultiply, grad_output,
+                       add(HloInstruction::CreateBroadcast(
+                           activation_shape, elements_per_feature, {})));
 
   // I6 = I1 - I2 - I5
-  auto i6 = add(HloInstruction::CreateBinary(
+  auto i6 = add_binary(
       activation_shape, HloOpcode::kSubtract,
-      add(HloInstruction::CreateBinary(activation_shape, HloOpcode::kSubtract,
-                                       i1, i2)),
-      i5));
+      add_binary(activation_shape, HloOpcode::kSubtract, i1, i2), i5);
 
   // Grad[X] = scale * rsqrt[Var[X] + epsilon] * 1/N * I6.
-  auto grad_activation =
-      add(HloInstruction::CreateBinary(activation_shape, HloOpcode::kMultiply,
-                                       scale_times_rsqrt_var_add_epsilon, i6));
+  auto grad_activation = add_binary(activation_shape, HloOpcode::kMultiply,
+                                    scale_times_rsqrt_var_add_epsilon, i6);
   auto tuple =
       HloInstruction::CreateTuple({grad_activation, grad_scale, grad_beta});
   if (batch_norm->has_sharding()) {
@@ -615,8 +587,8 @@ StatusOr<bool> BatchNormExpander::Run(HloModule* module) {
   bool changed = false;
   for (auto* comp : module->MakeNonfusionComputations()) {
     if (BatchNormExpanderVisitor::Run(comp, rewrite_training_op_,
-                                      rewrite_inference_op_, rewrite_grad_op_,
-                                      use_fusion_)) {
+                                      rewrite_inference_op_,
+                                      rewrite_grad_op_)) {
       changed = true;
     }
   }
diff --git a/tensorflow/compiler/xla/service/batchnorm_expander.h b/tensorflow/compiler/xla/service/batchnorm_expander.h
index 4ad987085da91684bb7891070afeefd19be4138f..7ae202c583516443a6263403fb5460d1adbabd97 100644
--- a/tensorflow/compiler/xla/service/batchnorm_expander.h
+++ b/tensorflow/compiler/xla/service/batchnorm_expander.h
@@ -31,11 +31,10 @@ class BatchNormExpander : public HloPassInterface {
   // When use_fusion is set, a multi-output fusion node is created.
   BatchNormExpander(bool rewrite_training_op = false,
                     bool rewrite_inference_op = false,
-                    bool rewrite_grad_op = false, bool use_fusion = true)
+                    bool rewrite_grad_op = false)
       : rewrite_training_op_(rewrite_training_op),
         rewrite_inference_op_(rewrite_inference_op),
-        rewrite_grad_op_(rewrite_grad_op),
-        use_fusion_(use_fusion) {}
+        rewrite_grad_op_(rewrite_grad_op) {}
   ~BatchNormExpander() = default;
   tensorflow::StringPiece name() const override { return "batchnorm_expander"; }
 
@@ -47,7 +46,6 @@ class BatchNormExpander : public HloPassInterface {
   bool rewrite_training_op_;
   bool rewrite_inference_op_;
   bool rewrite_grad_op_;
-  bool use_fusion_;
 };
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/bfloat16_conversion_folding.cc b/tensorflow/compiler/xla/service/bfloat16_conversion_folding.cc
index 08d0152e3cfcfcb7ae1e85f72c2f7dc856f5e8b3..1b8b2d204503576c3fcb02f6d5b37f2db45e1768 100644
--- a/tensorflow/compiler/xla/service/bfloat16_conversion_folding.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_conversion_folding.cc
@@ -182,15 +182,26 @@ Status BFloat16ConversionFoldingVisitor::DefaultAction(HloInstruction* hlo) {
 
 Status BFloat16ConversionFoldingVisitor::HandleCrossReplicaSum(
     HloInstruction* crs) {
-  if (!ShapeUtil::IsTuple(crs->shape()) ||
-      !bfloat16_support_->SupportsMixedPrecisions(*crs)) {
-    return DefaultAction(crs);
-  }
-
   // First use DefaultAction() to handle the operands. It can't handle
   // tuple-shaped output.
   TF_RETURN_IF_ERROR(DefaultAction(crs));
 
+  if (!bfloat16_support_->SupportsMixedPrecisions(*crs)) {
+    return Status::OK();
+  }
+
+  // If the output is not a tuple, we don't need special handling.
+  if (!ShapeUtil::IsTuple(crs->shape())) {
+    return Status::OK();
+  }
+
+  // If crs is the root instruction, we should keep its original output type.
+  // The root instruction implicitly has a use from being the result of the
+  // computation, and the code below does not take this use into account.
+  if (crs == computation_->root_instruction()) {
+    return Status::OK();
+  }
+
   // Then do per-tuple-element handling on the output.
   std::vector<std::vector<HloInstruction*>> per_tuple_element_gtes(
       crs->operand_count());
diff --git a/tensorflow/compiler/xla/service/bfloat16_conversion_folding_test.cc b/tensorflow/compiler/xla/service/bfloat16_conversion_folding_test.cc
index 28e71c2054f59ba4d5d096bf7d898161877bb42f..f7b4c1405dbc8719d8fba5476e6e41d2921ea877 100644
--- a/tensorflow/compiler/xla/service/bfloat16_conversion_folding_test.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_conversion_folding_test.cc
@@ -211,6 +211,17 @@ TEST_F(BFloat16ConversionFoldingTest, DoNotFoldTuple) {
 
 TEST_F(BFloat16ConversionFoldingTest, FoldCrossReplicaSumTupleOutput) {
   auto builder = HloComputation::Builder(TestName());
+
+  auto module = CreateNewModule();
+  HloComputation::Builder sum_builder("add");
+  auto x = sum_builder.AddInstruction(HloInstruction::CreateParameter(
+      /*parameter_number=*/0, ShapeUtil::MakeShape(F32, {}), "x"));
+  auto y = sum_builder.AddInstruction(HloInstruction::CreateParameter(
+      /*parameter_number=*/1, ShapeUtil::MakeShape(F32, {}), "y"));
+  sum_builder.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(F32, {}), HloOpcode::kAdd, x, y));
+  HloComputation* sum = module->AddEmbeddedComputation(sum_builder.Build());
+
   Shape f32_shape = ShapeUtil::MakeShape(F32, {2, 4});
   Shape bf16_shape = ShapeUtil::MakeShape(BF16, {2, 4});
 
@@ -223,7 +234,8 @@ TEST_F(BFloat16ConversionFoldingTest, FoldCrossReplicaSumTupleOutput) {
 
   HloInstruction* crs =
       builder.AddInstruction(HloInstruction::CreateCrossReplicaSum(
-          ShapeUtil::MakeTupleShape({f32_shape, f32_shape}), {convert_a, b}));
+          ShapeUtil::MakeTupleShape({f32_shape, f32_shape}), {convert_a, b},
+          sum, /*replica_group_ids=*/{}, /*barrier=*/""));
   HloInstruction* gte_a = builder.AddInstruction(
       HloInstruction::CreateGetTupleElement(f32_shape, crs, 0));
   HloInstruction* gte_b = builder.AddInstruction(
@@ -233,7 +245,6 @@ TEST_F(BFloat16ConversionFoldingTest, FoldCrossReplicaSumTupleOutput) {
   HloInstruction* tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({gte_a, convert_gte_b}));
 
-  auto module = CreateNewModule();
   auto computation = module->AddEntryComputation(builder.Build());
 
   EXPECT_TRUE(FoldConversions(module.get()));
diff --git a/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc b/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc
index 1afaefd9df9c5771fb9e134ae9050f3abb00ea4a..830f26422bdc2b3bd789e7d5926bcebac815d34a 100644
--- a/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc
@@ -228,6 +228,17 @@ TEST_F(BFloat16NormalizationTest, ResolveUnsupportedMixedPrecisionReduce) {
 }
 
 TEST_F(BFloat16NormalizationTest, ResolveMixedPrecisionTupleCrossReplicaSum) {
+  auto module = CreateNewModule();
+  HloComputation::Builder sum_builder("sum");
+  auto x = sum_builder.AddInstruction(HloInstruction::CreateParameter(
+      /*parameter_number=*/0, ShapeUtil::MakeShape(F32, {}), "x"));
+  auto y = sum_builder.AddInstruction(HloInstruction::CreateParameter(
+      /*parameter_number=*/1, ShapeUtil::MakeShape(F32, {}), "y"));
+  sum_builder.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(F32, {}), HloOpcode::kAdd, x, y));
+  HloComputation* reduction =
+      module->AddEmbeddedComputation(sum_builder.Build());
+
   auto builder = HloComputation::Builder(TestName());
   Shape f32_shape = ShapeUtil::MakeShape(F32, {2, 4});
   Shape bf16_shape = ShapeUtil::MakeShape(BF16, {2, 4});
@@ -239,11 +250,11 @@ TEST_F(BFloat16NormalizationTest, ResolveMixedPrecisionTupleCrossReplicaSum) {
 
   HloInstruction* crs =
       builder.AddInstruction(HloInstruction::CreateCrossReplicaSum(
-          ShapeUtil::MakeTupleShape({f32_shape, bf16_shape}), {a, b}));
+          ShapeUtil::MakeTupleShape({f32_shape, bf16_shape}), {a, b}, reduction,
+          /*replica_group_ids=*/{}, /*barrier=*/""));
   HloInstruction* gte = builder.AddInstruction(
       HloInstruction::CreateGetTupleElement(bf16_shape, crs, 1));
 
-  auto module = CreateNewModule();
   auto computation = module->AddEntryComputation(builder.Build());
 
   EXPECT_TRUE(Normalize(module.get()));
diff --git a/tensorflow/compiler/xla/service/buffer_assignment.cc b/tensorflow/compiler/xla/service/buffer_assignment.cc
index c0b8bf903923a327fb1378eafb51a7d493d5e62d..5d3b0cb333928d3b7b042ef0a6f4969f87655d7f 100644
--- a/tensorflow/compiler/xla/service/buffer_assignment.cc
+++ b/tensorflow/compiler/xla/service/buffer_assignment.cc
@@ -135,6 +135,7 @@ Status GatherComputationsByAllocationType(
             worklist.push_back(std::make_pair(subcomputation,
                                               false));  // Not thread local.
             break;
+          case HloOpcode::kCrossReplicaSum:
           case HloOpcode::kMap:
           case HloOpcode::kReduce:
           case HloOpcode::kReduceWindow:
@@ -630,9 +631,8 @@ Status BufferAssignment::ComputeSummaryStats() {
     }
   }
   if (module_sequence.size() == module_->computation_count()) {
-    TF_ASSIGN_OR_RETURN(
-        const int64 min_size,
-        MinimumMemoryForSequence(module_sequence, buffer_size_));
+    TF_ASSIGN_OR_RETURN(const int64 min_size,
+                        MinimumMemoryForModule(module_sequence, buffer_size_));
     stats_.total_fragmentation_bytes = stats_.total_allocation_bytes - min_size;
   }
 
diff --git a/tensorflow/compiler/xla/service/buffer_assignment_test.cc b/tensorflow/compiler/xla/service/buffer_assignment_test.cc
index a4fb0eefaca094898ed9acad8062484d1a36afe7..efa4696130ffeff669b0d674438a45c5a9d48ef2 100644
--- a/tensorflow/compiler/xla/service/buffer_assignment_test.cc
+++ b/tensorflow/compiler/xla/service/buffer_assignment_test.cc
@@ -25,7 +25,6 @@ limitations under the License.
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/buffer_value.h"
 #include "tensorflow/compiler/xla/service/call_graph.h"
-#include "tensorflow/compiler/xla/service/computation_tracker.h"
 #include "tensorflow/compiler/xla/service/copy_insertion.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/flatten_call_graph.h"
@@ -33,12 +32,12 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/service/hlo_ordering.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/service/hlo_scheduling.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/platform/macros.h"
@@ -82,7 +81,7 @@ const std::vector<const HloInstruction*> GetInstructions(HloInstruction* root) {
 
 class BufferAssignmentTest : public HloTestBase {
  protected:
-  BufferAssignmentTest() : computation_tracker_() {}
+  BufferAssignmentTest() {}
   ~BufferAssignmentTest() override {}
 
   std::unique_ptr<BufferAssignment> RunBufferAssignment(HloModule* module,
@@ -252,9 +251,6 @@ class BufferAssignmentTest : public HloTestBase {
     return total_size;
   }
 
-  // Computation tracker for nested computations.
-  ComputationTracker computation_tracker_;
-
   // Shapes for use in the examples.
   Shape s32_ = ShapeUtil::MakeShape(xla::S32, {});
   Shape r0f32_ = ShapeUtil::MakeShape(xla::F32, {});
@@ -375,11 +371,11 @@ TEST_F(BufferAssignmentTest, Basic) {
   // param1[100] --------------/--------/
   auto builder = HloComputation::Builder(TestName());
   auto paramscalar =
-      builder.AddInstruction(HloInstruction::CreateParameter(0, r0f32_, ""));
+      builder.AddInstruction(HloInstruction::CreateParameter(0, r0f32_, "p"));
   auto param0 = builder.AddInstruction(
-      HloInstruction::CreateParameter(1, f32vec100_, ""));
+      HloInstruction::CreateParameter(1, f32vec100_, "p1"));
   auto param1 = builder.AddInstruction(
-      HloInstruction::CreateParameter(2, f32vec100_, ""));
+      HloInstruction::CreateParameter(2, f32vec100_, "p2"));
   auto mul = builder.AddInstruction(HloInstruction::CreateBinary(
       f32vec100_, HloOpcode::kMultiply, paramscalar, param0));
   auto add = builder.AddInstruction(
@@ -422,11 +418,11 @@ TEST_F(BufferAssignmentTest, BasicUniquelyColored) {
   // share anything.
   auto builder = HloComputation::Builder(TestName());
   auto paramscalar =
-      builder.AddInstruction(HloInstruction::CreateParameter(0, r0f32_, ""));
+      builder.AddInstruction(HloInstruction::CreateParameter(0, r0f32_, "p"));
   auto param0 = builder.AddInstruction(
-      HloInstruction::CreateParameter(1, f32vec100_, ""));
+      HloInstruction::CreateParameter(1, f32vec100_, "p1"));
   auto param1 = builder.AddInstruction(
-      HloInstruction::CreateParameter(2, f32vec100_, ""));
+      HloInstruction::CreateParameter(2, f32vec100_, "p2"));
   auto mul = builder.AddInstruction(HloInstruction::CreateBinary(
       f32vec100_, HloOpcode::kMultiply, paramscalar, param0));
   auto add = builder.AddInstruction(
@@ -481,11 +477,11 @@ TEST_F(BufferAssignmentTest, BasicPartiallyColored) {
   // have the color 0, which allows the mul and add to share buffers.
   auto builder = HloComputation::Builder(TestName());
   auto paramscalar =
-      builder.AddInstruction(HloInstruction::CreateParameter(0, r0f32_, ""));
+      builder.AddInstruction(HloInstruction::CreateParameter(0, r0f32_, "p"));
   auto param0 = builder.AddInstruction(
-      HloInstruction::CreateParameter(1, f32vec100_, ""));
+      HloInstruction::CreateParameter(1, f32vec100_, "p1"));
   auto param1 = builder.AddInstruction(
-      HloInstruction::CreateParameter(2, f32vec100_, ""));
+      HloInstruction::CreateParameter(2, f32vec100_, "p2"));
   auto mul = builder.AddInstruction(HloInstruction::CreateBinary(
       f32vec100_, HloOpcode::kMultiply, paramscalar, param0));
   auto add = builder.AddInstruction(
@@ -551,11 +547,11 @@ TEST_F(BufferAssignmentTest, MultipleUsersForNode) {
   //
   auto builder = HloComputation::Builder(TestName());
   auto paramscalar =
-      builder.AddInstruction(HloInstruction::CreateParameter(0, r0f32_, ""));
+      builder.AddInstruction(HloInstruction::CreateParameter(0, r0f32_, "p"));
   auto param0 = builder.AddInstruction(
-      HloInstruction::CreateParameter(1, f32vec100_, ""));
+      HloInstruction::CreateParameter(1, f32vec100_, "p1"));
   auto param1 = builder.AddInstruction(
-      HloInstruction::CreateParameter(2, f32vec100_, ""));
+      HloInstruction::CreateParameter(2, f32vec100_, "p2"));
   auto mul = builder.AddInstruction(HloInstruction::CreateBinary(
       f32vec100_, HloOpcode::kMultiply, paramscalar, param0));
   auto add = builder.AddInstruction(
@@ -605,7 +601,7 @@ TEST_F(BufferAssignmentTest, TrivialMap) {
   // Creates the main kernel and verifies instruction counts.
   auto builder = HloComputation::Builder(TestName());
   auto param0 = builder.AddInstruction(
-      HloInstruction::CreateParameter(0, f32a100x10_, ""));
+      HloInstruction::CreateParameter(0, f32a100x10_, "p"));
   auto map = builder.AddInstruction(
       HloInstruction::CreateMap(f32a100x10_, {param0}, map_computation));
   module->AddEntryComputation(builder.Build());
@@ -658,7 +654,7 @@ TEST_F(BufferAssignmentTest, CannotReuseInputBufferOfReduce) {
 
   auto builder = HloComputation::Builder(TestName());
   auto param0 = builder.AddInstruction(
-      HloInstruction::CreateParameter(0, f32a100x10_, ""));
+      HloInstruction::CreateParameter(0, f32a100x10_, "p"));
   auto exp1 = builder.AddInstruction(
       HloInstruction::CreateUnary(f32a100x10_, HloOpcode::kExp, param0));
   auto exp2 = builder.AddInstruction(
@@ -822,7 +818,7 @@ TEST_F(BufferAssignmentTest, UnaryOpReuseChain) {
   // param0[100] ---> (exp) ---> (tanh) ---> (exp) ---> (neg)
   auto builder = HloComputation::Builder(TestName());
   auto param0 = builder.AddInstruction(
-      HloInstruction::CreateParameter(0, f32vec100_, ""));
+      HloInstruction::CreateParameter(0, f32vec100_, "p"));
   auto exp1 = builder.AddInstruction(
       HloInstruction::CreateUnary(f32vec100_, HloOpcode::kExp, param0));
   auto tanh = builder.AddInstruction(
@@ -1500,11 +1496,11 @@ TEST_F(BufferAssignmentTest, TrivialPeakBuffers) {
   // param1[100] --------------/--------/
   auto builder = HloComputation::Builder(TestName());
   auto paramscalar =
-      builder.AddInstruction(HloInstruction::CreateParameter(0, r0f32_, ""));
+      builder.AddInstruction(HloInstruction::CreateParameter(0, r0f32_, "p"));
   auto param0 = builder.AddInstruction(
-      HloInstruction::CreateParameter(1, f32vec100_, ""));
+      HloInstruction::CreateParameter(1, f32vec100_, "p1"));
   auto param1 = builder.AddInstruction(
-      HloInstruction::CreateParameter(2, f32vec100_, ""));
+      HloInstruction::CreateParameter(2, f32vec100_, "p2"));
   auto mul = builder.AddInstruction(HloInstruction::CreateBinary(
       f32vec100_, HloOpcode::kMultiply, paramscalar, param0));
   auto add = builder.AddInstruction(
@@ -1540,7 +1536,7 @@ TEST_F(BufferAssignmentTest, PeakBuffers) {
   // be {%rev, %neg, %concat}. This occurs right at the concat itself.
   auto builder = HloComputation::Builder(TestName());
   auto param = builder.AddInstruction(
-      HloInstruction::CreateParameter(0, f32vec100_, ""));
+      HloInstruction::CreateParameter(0, f32vec100_, "p"));
   auto log = builder.AddInstruction(
       HloInstruction::CreateUnary(f32vec100_, HloOpcode::kLog, param));
   auto rev = builder.AddInstruction(
@@ -1677,7 +1673,7 @@ class WhileBufferAssignmentTest : public HloTestBase {
   std::unique_ptr<BufferAssignment> RunBufferAssignment(HloModule* module,
                                                         int64 alignment = 1) {
     auto sequence =
-        CreateMemoryMinimizingSequence(*module, ByteSizeOf).ConsumeValueOrDie();
+        ScheduleComputationsInModule(*module, ByteSizeOf).ConsumeValueOrDie();
     return BufferAssigner::Run(
                module, xla::MakeUnique<SequentialHloOrdering>(module, sequence),
                ByteSizeOf,
@@ -1797,7 +1793,7 @@ ENTRY %test_module {
 })";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(module_str));
+                          ParseHloString(module_str));
 
   // Run CopyInsertion and check if the graph constructed above doesn't need
   // any copies inserted for BufferAssignment to run.
@@ -2107,7 +2103,7 @@ TEST_F(WhileBufferAssignmentTest, WhileLoopsInterferingResultRange) {
   RunCopyInsertion(module.get());
 
   auto sequence =
-      CreateMemoryMinimizingSequence(*module, ByteSizeOf).ConsumeValueOrDie();
+      ScheduleComputationsInModule(*module, ByteSizeOf).ConsumeValueOrDie();
 
   // To trigger b/38494731, we want a specific Hlo sequence for the
   // root computation, so we overwrite that entry with a manually
diff --git a/tensorflow/compiler/xla/service/call_graph.cc b/tensorflow/compiler/xla/service/call_graph.cc
index a8053d15e124319c5c898f0034b9aaa95a007a89..a23427f00ccd88bb0fe1d973a667f80ca54b14cd 100644
--- a/tensorflow/compiler/xla/service/call_graph.cc
+++ b/tensorflow/compiler/xla/service/call_graph.cc
@@ -57,6 +57,7 @@ CallContext GetInstructionCallContext(HloOpcode opcode) {
     case HloOpcode::kConditional:
     case HloOpcode::kWhile:
       return CallContext::kSequential;
+    case HloOpcode::kCrossReplicaSum:
     case HloOpcode::kMap:
     case HloOpcode::kReduce:
     case HloOpcode::kReduceWindow:
diff --git a/tensorflow/compiler/xla/service/channel_tracker.h b/tensorflow/compiler/xla/service/channel_tracker.h
index c7763f2ca3e68490cd0cd9b4ba4d7bd180134080..fac0afd672ff3ed083aacf778dd9c4f90a2ee870 100644
--- a/tensorflow/compiler/xla/service/channel_tracker.h
+++ b/tensorflow/compiler/xla/service/channel_tracker.h
@@ -19,9 +19,6 @@ limitations under the License.
 #include <map>
 
 #include "tensorflow/compiler/xla/service/hlo_module.h"
-#include "tensorflow/compiler/xla/service/session.pb.h"
-#include "tensorflow/compiler/xla/service/user_computation.h"
-#include "tensorflow/compiler/xla/service/versioned_computation_handle.h"
 #include "tensorflow/compiler/xla/status.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/types.h"
diff --git a/tensorflow/compiler/xla/service/compilation_cache.cc b/tensorflow/compiler/xla/service/compilation_cache.cc
deleted file mode 100644
index b16907da9e9c909d2639f83895db27d724a84a7b..0000000000000000000000000000000000000000
--- a/tensorflow/compiler/xla/service/compilation_cache.cc
+++ /dev/null
@@ -1,78 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include "tensorflow/compiler/xla/service/compilation_cache.h"
-
-#include <utility>
-
-#include "tensorflow/compiler/xla/types.h"
-#include "tensorflow/compiler/xla/xla_data.pb.h"
-#include "tensorflow/core/lib/strings/strcat.h"
-#include "tensorflow/core/platform/logging.h"
-
-namespace xla {
-
-std::shared_ptr<Executable> CompilationCache::Insert(
-    std::unique_ptr<Executable> executable,
-    const HloModuleConfig& module_config) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  CacheKey key =
-      BuildKey(executable->entry_computation_handle(), module_config);
-  VLOG(2) << "inserting cache key: " << key;
-  if (cache_.count(key) == 0) {
-    cache_.emplace(key, std::move(executable));
-  } else {
-    // Executable already exists in the cache. This can happen if two Execute
-    // calls for a new computation are received simultaneously by the
-    // service. In this case, we discard the Executable given as a parameter and
-    // return what is in the cache. This is necessary because the service relies
-    // on the cache to keep ownership of the Executable. We only want to store
-    // one Executable for a given computation version and we can't discard the
-    // executable which is in the cache because it may be in use.
-    executable.reset();
-  }
-  return cache_.at(key);
-}
-
-std::shared_ptr<Executable> CompilationCache::LookUp(
-    const VersionedComputationHandle& versioned_handle,
-    const HloModuleConfig& module_config) const {
-  tensorflow::mutex_lock lock(mutex_);
-
-  CacheKey key = BuildKey(versioned_handle, module_config);
-  VLOG(2) << "looking up cache key: " << key;
-  if (cache_.count(key) == 0) {
-    VLOG(2) << "cache key not found: " << key;
-    return nullptr;
-  } else {
-    std::shared_ptr<Executable> result = cache_.at(key);
-    VLOG(2) << "hit executable with module config: "
-            << result->module_config().compilation_cache_key();
-    return result;
-  }
-}
-
-CompilationCache::CacheKey CompilationCache::BuildKey(
-    const VersionedComputationHandle& versioned_handle,
-    const HloModuleConfig& module_config) const {
-  // The computation shape is represented entirely by its ProgramShape member,
-  // so just serialize the proto as part of the key.
-  return tensorflow::strings::StrCat(versioned_handle.handle.handle(), "::",
-                                     versioned_handle.version, "::",
-                                     module_config.compilation_cache_key());
-}
-
-}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/compilation_cache.h b/tensorflow/compiler/xla/service/compilation_cache.h
deleted file mode 100644
index 09989726ae6629aa65cb1dd84c16408a75019fa5..0000000000000000000000000000000000000000
--- a/tensorflow/compiler/xla/service/compilation_cache.h
+++ /dev/null
@@ -1,78 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_COMPILATION_CACHE_H_
-#define TENSORFLOW_COMPILER_XLA_SERVICE_COMPILATION_CACHE_H_
-
-#include <map>
-#include <memory>
-#include <string>
-
-#include "tensorflow/compiler/xla/service/executable.h"
-#include "tensorflow/compiler/xla/service/hlo_module_config.h"
-#include "tensorflow/compiler/xla/service/versioned_computation_handle.h"
-#include "tensorflow/compiler/xla/types.h"
-#include "tensorflow/core/platform/macros.h"
-#include "tensorflow/core/platform/mutex.h"
-#include "tensorflow/core/platform/thread_annotations.h"
-
-namespace xla {
-
-// A cache which stores Executables indexed by computation handle and version.
-class CompilationCache {
- public:
-  CompilationCache() {}
-
-  // Insert the given Executable into the cache. Return a bare Executable
-  // pointer for the caller to use. Note: the returned pointer will *not* be the
-  // same as the given unique pointer if the computation already exists in the
-  // cache. See comments in the .cc implementation for details of this case.
-  //
-  // module_config is provided by the caller, instead of being taken from the
-  // executable, so that we can insert keys into the compilation cache that are
-  // devoid of layout (where XLA gets to choose what layout to compile).
-  //
-  // A shared_ptr is returned so the caller can keep the Executable from being
-  // destructed in the event that the Executable is evicted from the
-  // computation cache (and the cache's shared_ptr to the Executable is
-  // destructed).
-  std::shared_ptr<Executable> Insert(std::unique_ptr<Executable> executable,
-                                     const HloModuleConfig& module_config);
-
-  // Lookup the Executable for the specified versioned computation in the cache.
-  // Return a shared_ptr to the Executable if it exists in the cache. Return
-  // nullptr otherwise.
-  std::shared_ptr<Executable> LookUp(
-      const VersionedComputationHandle& versioned_handle,
-      const HloModuleConfig& module_config) const;
-
- protected:
-  mutable tensorflow::mutex mutex_;
-
-  // Map from versioned handle with program layout to Executable built
-  // for that computation version and program layout.
-  using CacheKey = string;
-
-  CacheKey BuildKey(const VersionedComputationHandle& versioned_handle,
-                    const HloModuleConfig& module_config) const;
-  std::map<CacheKey, std::shared_ptr<Executable>> cache_ GUARDED_BY(mutex_);
-
- private:
-  TF_DISALLOW_COPY_AND_ASSIGN(CompilationCache);
-};
-
-}  // namespace xla
-
-#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_COMPILATION_CACHE_H_
diff --git a/tensorflow/compiler/xla/service/compile_only_service.cc b/tensorflow/compiler/xla/service/compile_only_service.cc
index d39fd7307ae1b5bd0c431f98c413011ca081050b..7426672a7a2a9102bd5ea98bd51092982e1e09b4 100644
--- a/tensorflow/compiler/xla/service/compile_only_service.cc
+++ b/tensorflow/compiler/xla/service/compile_only_service.cc
@@ -22,7 +22,6 @@ limitations under the License.
 #include "tensorflow/compiler/xla/legacy_flags/debug_options_flags.h"
 #include "tensorflow/compiler/xla/service/backend.h"
 #include "tensorflow/compiler/xla/service/computation_layout.h"
-#include "tensorflow/compiler/xla/service/computation_tracker.h"
 #include "tensorflow/compiler/xla/service/platform_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/types.h"
@@ -64,7 +63,8 @@ CompileOnlyService::CompileOnlyService(const ServiceOptions& options,
 StatusOr<std::vector<std::unique_ptr<AotCompilationResult>>>
 CompileOnlyService::CompileAheadOfTime(
     const tensorflow::gtl::ArraySlice<AotXlaComputationInstance> computations,
-    const AotCompilationOptions& options) {
+    const AotCompilationOptions& options,
+    std::unique_ptr<AotCompilationMetadata>* metadata) {
   std::vector<std::unique_ptr<HloModule>> hlo_modules;
   for (const AotXlaComputationInstance& instance : computations) {
     TF_RET_CHECK(instance.computation.has_program_shape());
@@ -101,59 +101,8 @@ CompileOnlyService::CompileAheadOfTime(
     hlo_modules.push_back(std::move(hlo_module));
   }
 
-  return compiler_->CompileAheadOfTime(std::move(hlo_modules), options);
-}
-
-StatusOr<std::vector<std::unique_ptr<AotCompilationResult>>>
-CompileOnlyService::CompileAheadOfTime(
-    const tensorflow::gtl::ArraySlice<AotComputationInstance> computations,
-    const AotCompilationOptions& options) {
-  std::vector<std::unique_ptr<HloModule>> hlo_modules;
-  for (const AotComputationInstance& instance : computations) {
-    TF_ASSIGN_OR_RETURN(UserComputation * user_computation,
-                        computation_tracker_.Resolve(instance.computation));
-    VersionedComputationHandle versioned_handle =
-        user_computation->GetVersionedHandle();
-
-    const DebugOptions& debug_options = options.debug_options();
-
-    // Dump computation proto state if flag is set.
-    const string& directory_path = debug_options.xla_dump_computations_to();
-    if (!directory_path.empty()) {
-      TF_ASSIGN_OR_RETURN(
-          std::unique_ptr<SessionModule> session_module,
-          computation_tracker_.SnapshotComputation(versioned_handle.handle));
-      string filename = tensorflow::strings::StrCat(
-          "computation_", versioned_handle.handle.handle(), "__",
-          session_module->entry().name(), "__version_",
-          versioned_handle.version);
-      const string& per_host_path = tensorflow::io::JoinPath(
-          directory_path, tensorflow::port::Hostname());
-
-      TF_RETURN_IF_ERROR(Executable::DumpToDirectory(per_host_path, filename,
-                                                     *session_module));
-    }
-
-    TF_ASSIGN_OR_RETURN(
-        std::shared_ptr<const ProgramShape> program_shape,
-        user_computation->ComputeProgramShape(versioned_handle.version));
-
-    ExecutionOptions execution_options;
-    *execution_options.mutable_debug_options() = debug_options;
-    TF_ASSIGN_OR_RETURN(
-        std::unique_ptr<HloModuleConfig> module_config,
-        CreateModuleConfig(*program_shape, instance.argument_layouts,
-                           &execution_options, user_computation));
-
-    TF_ASSIGN_OR_RETURN(std::unique_ptr<HloModule> hlo_module,
-                        computation_tracker_.BuildHloModule(
-                            versioned_handle, *module_config,
-                            /*include_unreachable_instructions=*/true));
-    TF_RETURN_IF_ERROR(MaybeDumpHloModule(*hlo_module));
-    hlo_modules.push_back(std::move(hlo_module));
-  }
-
-  return compiler_->CompileAheadOfTime(std::move(hlo_modules), options);
+  return compiler_->CompileAheadOfTime(std::move(hlo_modules), options,
+                                       metadata);
 }
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/compile_only_service.h b/tensorflow/compiler/xla/service/compile_only_service.h
index 7f2ce0e8974c01b09664235d7b9d19555b2705a3..1ac950bdd66bd034dfdafa8598ec506221e99c2f 100644
--- a/tensorflow/compiler/xla/service/compile_only_service.h
+++ b/tensorflow/compiler/xla/service/compile_only_service.h
@@ -38,24 +38,7 @@ class CompileOnlyService : public Service {
   static StatusOr<std::unique_ptr<CompileOnlyService>> NewService(
       const ServiceOptions& options);
 
-  // A description of a computation to compile using CompileAheadOfTime.
-  struct AotComputationInstance {
-    ComputationHandle computation;
-    std::vector<const Shape*> argument_layouts;
-    const Shape* result_layout = nullptr;
-  };
-
-  // Compiles a list of computations for ahead-of-time execution.  This is
-  // intended for use in static compilation.  See
-  // |CompileOnlyClient::CompileAheadOfTime| for additional details.
-  StatusOr<std::vector<std::unique_ptr<AotCompilationResult>>>
-  CompileAheadOfTime(
-      const tensorflow::gtl::ArraySlice<AotComputationInstance> computations,
-      const AotCompilationOptions& Options);
-
   // A description of a xla computation to compile using CompileAheadOfTime.
-  //
-  // TODO(b/74197823): This is a part of a NOT YET ready refactor.
   struct AotXlaComputationInstance {
     HloModuleProto computation;
     std::vector<const Shape*> argument_layouts;
@@ -65,31 +48,21 @@ class CompileOnlyService : public Service {
   // Compiles a list of xla computations for ahead-of-time execution.  This is
   // intended for use in static compilation.  See
   // |CompileOnlyClient::CompileAheadOfTime| for additional details.
-  //
-  // TODO(b/74197823): This is a part of a NOT YET ready refactor.
   StatusOr<std::vector<std::unique_ptr<AotCompilationResult>>>
   CompileAheadOfTime(
       const tensorflow::gtl::ArraySlice<AotXlaComputationInstance> computations,
       const AotCompilationOptions& options);
 
-  // Override Service methods that require or imply the existence of an
-  // execute backend.  Note that this does not include TransferToClient, as
-  // computing constants produces global data that we may wish to transfer.
-  Status Execute(const ExecuteRequest* arg, ExecuteResponse* result) override {
-    return Unimplemented("CompileOnlyService does not support execution.");
-  }
-  Status ExecuteParallel(const ExecuteParallelRequest* arg,
-                         ExecuteParallelResponse* result) override {
-    return Unimplemented("CompileOnlyService does not support execution.");
-  }
+  StatusOr<std::vector<std::unique_ptr<AotCompilationResult>>>
+  CompileAheadOfTime(
+      const tensorflow::gtl::ArraySlice<AotXlaComputationInstance> computations,
+      const AotCompilationOptions& options,
+      std::unique_ptr<AotCompilationMetadata>* metadata);
+
   Status GetDeviceHandles(const GetDeviceHandlesRequest* arg,
                           GetDeviceHandlesResponse* result) override {
     return Unimplemented("CompileOnlyService does not support devices.");
   }
-  Status ExecuteAsync(const ExecuteAsyncRequest* arg,
-                      ExecuteAsyncResponse* result) override {
-    return Unimplemented("CompileOnlyService does not support execution.");
-  }
   Status WaitForExecution(const WaitForExecutionRequest* arg,
                           WaitForExecutionResponse* result) override {
     return Unimplemented("CompileOnlyService does not support execution.");
diff --git a/tensorflow/compiler/xla/service/compiler.cc b/tensorflow/compiler/xla/service/compiler.cc
index 8b01a6c4b5004d03e6e7d23b99b923fdcdeaff99..0dceed853dcbae211657f00433866cfe10c51fc7 100644
--- a/tensorflow/compiler/xla/service/compiler.cc
+++ b/tensorflow/compiler/xla/service/compiler.cc
@@ -28,6 +28,27 @@ namespace xla {
 /* static */ tensorflow::mutex Compiler::platform_compiler_mutex_(
     tensorflow::LINKER_INITIALIZED);
 
+std::vector<std::unique_ptr<tensorflow::protobuf::Message>>
+Compiler::ComputeBackendConfigs(const HloInstruction& hlo,
+                                se::StreamExecutor* executor) const {
+  CHECK(executor != nullptr);
+  return {};
+}
+
+// Define a default version where metadata is not used.
+StatusOr<std::vector<std::unique_ptr<AotCompilationResult>>>
+Compiler::CompileAheadOfTime(
+    std::vector<std::unique_ptr<HloModule>> modules,
+    const AotCompilationOptions& options,
+    std::unique_ptr<AotCompilationMetadata>* metadata) {
+  if (metadata != nullptr) {
+    return Unimplemented(
+        "Populating AotCompilationMetadata is not implemented on this "
+        "compiler.");
+  }
+  return CompileAheadOfTime(std::move(modules), options);
+}
+
 /* static */ std::map<se::Platform::Id, Compiler::CompilerFactory>*
 Compiler::GetPlatformCompilerFactories() {
   static auto* r = new std::map<se::Platform::Id, CompilerFactory>;
diff --git a/tensorflow/compiler/xla/service/compiler.h b/tensorflow/compiler/xla/service/compiler.h
index a4b59d1ba9b24e3f886a7feb51181ae8f990951f..d1144f97bb2ab29d3d18f3b3f65a38af46e68dd1 100644
--- a/tensorflow/compiler/xla/service/compiler.h
+++ b/tensorflow/compiler/xla/service/compiler.h
@@ -24,9 +24,11 @@ limitations under the License.
 #include <map>
 #include <memory>
 #include <string>
+#include <vector>
 
 #include "tensorflow/compiler/xla/service/buffer_value.h"
 #include "tensorflow/compiler/xla/service/executable.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/service/hlo_module_config.h"
 #include "tensorflow/compiler/xla/service/logical_buffer.h"
@@ -34,6 +36,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
 #include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/platform/protobuf.h"
 #include "tensorflow/core/platform/stream_executor_no_cuda.h"
 #include "tensorflow/core/platform/thread_annotations.h"
 
@@ -91,6 +94,19 @@ class AotCompilationOptions {
   DebugOptions debug_options_;
 };
 
+// Abstract superclass describing metadata produced during ahead-of-time
+// compilation.
+class AotCompilationMetadata {
+ public:
+  AotCompilationMetadata(const AotCompilationMetadata&) = delete;
+  AotCompilationMetadata& operator=(AotCompilationMetadata const&) = delete;
+
+  virtual ~AotCompilationMetadata() = default;
+
+ protected:
+  AotCompilationMetadata() = default;
+};
+
 // Abstract compiler interface that is subclassed for compilation on a
 // particular platform.
 //
@@ -153,12 +169,29 @@ class Compiler {
       std::vector<std::vector<se::StreamExecutor*>> stream_exec,
       DeviceMemoryAllocator* device_allocator) = 0;
 
+  // Returns the backend configurations that the backend will consider for the
+  // given HLO. Returns no configurations if the backend does not support
+  // configurations for the given HLO.
+  //
+  // The stream executor is passed in to provide information about the hardware
+  // that the backend configurations would be targeting.
+  virtual std::vector<std::unique_ptr<tensorflow::protobuf::Message>>
+  ComputeBackendConfigs(const HloInstruction& hlo,
+                        se::StreamExecutor* executor) const;
+
   // Compiles the HLO module for ahead-of-time execution.  This is intended for
   // use in static compilation.
   virtual StatusOr<std::vector<std::unique_ptr<AotCompilationResult>>>
   CompileAheadOfTime(std::vector<std::unique_ptr<HloModule>> modules,
                      const AotCompilationOptions& options) = 0;
 
+  // Similar to CompileAheadOfTime above but AotCompilationMetadata
+  // has an argument that can be populated during compilation.
+  virtual StatusOr<std::vector<std::unique_ptr<AotCompilationResult>>>
+  CompileAheadOfTime(std::vector<std::unique_ptr<HloModule>> modules,
+                     const AotCompilationOptions& options,
+                     std::unique_ptr<AotCompilationMetadata>* metadata);
+
   /////
   // The Compiler class also serves as a point to register compiler objects
   // for the various platforms.
diff --git a/tensorflow/compiler/xla/service/computation_layout.h b/tensorflow/compiler/xla/service/computation_layout.h
index 53c3a3f7b738687db3098acfaef1ae87860d0440..6975f387b4864bf28ea0ad23d7d4602b5b346e08 100644
--- a/tensorflow/compiler/xla/service/computation_layout.h
+++ b/tensorflow/compiler/xla/service/computation_layout.h
@@ -32,12 +32,21 @@ namespace xla {
 // mutable layouts.
 class ComputationLayout {
  public:
+  // Creates a new ComputationLayout with the given result layout.
+  explicit ComputationLayout(ShapeLayout result_layout)
+      : result_layout_(std::move(result_layout)) {}
+
   // Constructs a ComputationLayout from a ProgramShape. The layouts of the
   // parameters and results are set to the default layout. Layouts in the
   // ProgramShape are ignored if ignore_layouts is true.
   explicit ComputationLayout(const ProgramShape& program_shape,
                              bool ignore_layouts = true);
 
+  // Adds a new parameter layout to the computation layout.
+  void add_parameter_layout(ShapeLayout shape_layout) {
+    parameter_layouts_.push_back(std::move(shape_layout));
+  }
+
   // Returns the layout of a particular parameter.
   const ShapeLayout& parameter_layout(int64 param_no) const {
     return parameter_layouts_[param_no];
diff --git a/tensorflow/compiler/xla/service/computation_tracker.cc b/tensorflow/compiler/xla/service/computation_tracker.cc
deleted file mode 100644
index 70e25eebdb068db893e24aec0f72d09090ac7027..0000000000000000000000000000000000000000
--- a/tensorflow/compiler/xla/service/computation_tracker.cc
+++ /dev/null
@@ -1,256 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include "tensorflow/compiler/xla/service/computation_tracker.h"
-
-#include <list>
-#include <string>
-#include <utility>
-#include <vector>
-
-#include "tensorflow/compiler/xla/ptr_util.h"
-#include "tensorflow/compiler/xla/service/hlo_computation.h"
-#include "tensorflow/compiler/xla/status_macros.h"
-#include "tensorflow/compiler/xla/types.h"
-#include "tensorflow/compiler/xla/util.h"
-#include "tensorflow/core/lib/strings/strcat.h"
-#include "tensorflow/core/lib/strings/stringprintf.h"
-#include "tensorflow/core/platform/logging.h"
-
-using ::tensorflow::strings::Appendf;
-
-namespace xla {
-
-ComputationTracker::ComputationTracker() : next_computation_(1) {}
-
-ComputationHandle ComputationTracker::NewComputation(
-    const string& computation_name) {
-  tensorflow::mutex_lock lock(computation_mutex_);
-  ComputationHandle computation_handle;
-  int64 handle_value = next_computation_++;
-  computation_handle.set_handle(handle_value);
-  opaque_to_computation_[handle_value] =
-      MakeUnique<UserComputation>(computation_name, computation_handle);
-  return computation_handle;
-}
-
-StatusOr<ComputationHandle> ComputationTracker::LoadSessionModule(
-    const SessionModule& session_module) {
-  tensorflow::mutex_lock lock(computation_mutex_);
-
-  // For each embedded computation, create a new computation based on its
-  // serialized data, and place the mapping from the old computation handle to
-  // the new computation handle.
-
-  // Build a mapping from old embedded computation handles to new computation
-  // handles. We build the ID mapping first since the embedded computations are
-  // in no particular order and may refer to each other.
-  std::map<int64, ComputationHandle> old_to_new;
-  for (const SessionComputation& computation :
-       session_module.embedded_computations()) {
-    const int64 old_handle = computation.computation_handle().handle();
-    if (!old_to_new.emplace(old_handle, AllocateHandle()).second) {
-      return InvalidArgument("Duplicate embedded computation handle %lld",
-                             old_handle);
-    }
-  }
-
-  // Create a new computation from each serialized embedded computation.
-  for (const SessionComputation& computation :
-       session_module.embedded_computations()) {
-    const int64 old_handle = computation.computation_handle().handle();
-    const ComputationHandle& new_handle = old_to_new[old_handle];
-    TF_ASSIGN_OR_RETURN(opaque_to_computation_[new_handle.handle()],
-                        UserComputation::MakeWithRemapping(
-                            computation, new_handle, old_to_new));
-  }
-
-  // Finally, place the entry computation in the tracker with all of the
-  // remappings populated from the above.
-  const int64 old_handle = session_module.entry().computation_handle().handle();
-  TF_ASSIGN_OR_RETURN(
-      old_to_new[old_handle],
-      LoadSessionComputation(session_module.entry(), &old_to_new));
-  return old_to_new[old_handle];
-}
-
-StatusOr<std::unique_ptr<SessionModule>>
-ComputationTracker::SnapshotComputation(const ComputationHandle& computation) {
-  TF_ASSIGN_OR_RETURN(UserComputation * user_computation, Resolve(computation));
-  const VersionedComputationHandle entry_versioned_handle =
-      user_computation->GetVersionedHandle();
-  std::set<VersionedComputationHandle> visited;
-  std::list<VersionedComputationHandle> post_order;
-  {
-    tensorflow::mutex_lock lock(computation_mutex_);
-    ComputeComputationPostOrder(entry_versioned_handle, &visited, &post_order);
-  }
-  auto session_module = MakeUnique<SessionModule>();
-  *session_module->mutable_entry() =
-      Resolve(entry_versioned_handle.handle)
-          .ValueOrDie()
-          ->CloneSessionComputation(entry_versioned_handle.version);
-  for (auto it = ++post_order.rbegin(); it != post_order.rend(); ++it) {
-    *session_module->add_embedded_computations() =
-        Resolve(it->handle).ValueOrDie()->CloneSessionComputation(it->version);
-  }
-  return std::move(session_module);
-}
-
-StatusOr<UserComputation*> ComputationTracker::Resolve(
-    const ComputationHandle& computation) const {
-  tensorflow::mutex_lock lock(computation_mutex_);
-  return ResolveInternal(computation);
-}
-
-ComputationHandle ComputationTracker::AllocateHandle() {
-  int64 handle_value = next_computation_++;
-  ComputationHandle result;
-  result.set_handle(handle_value);
-  return result;
-}
-
-StatusOr<ComputationHandle> ComputationTracker::LoadSessionComputation(
-    const SessionComputation& session_computation,
-    std::map<int64, ComputationHandle>* old_to_new) {
-  TF_RET_CHECK(old_to_new != nullptr);
-  const ComputationHandle new_handle = AllocateHandle();
-  (*old_to_new)[session_computation.computation_handle().handle()] = new_handle;
-  TF_ASSIGN_OR_RETURN(opaque_to_computation_[new_handle.handle()],
-                      UserComputation::MakeWithRemapping(
-                          session_computation, new_handle, *old_to_new));
-  return new_handle;
-}
-
-StatusOr<UserComputation*> ComputationTracker::ResolveInternal(
-    const ComputationHandle& computation) const {
-  auto it = opaque_to_computation_.find(computation.handle());
-  if (it == opaque_to_computation_.end()) {
-    return NotFound("computation handle not found: %lld", computation.handle());
-  }
-  UserComputation* user_computation = it->second.get();
-  return user_computation;
-}
-
-void ComputationTracker::ComputeComputationPostOrder(
-    const VersionedComputationHandle& versioned_handle,
-    std::set<VersionedComputationHandle>* visited,
-    std::list<VersionedComputationHandle>* post_order) const {
-  if (visited->count(versioned_handle) > 0) {
-    CHECK_EQ(1, visited->count(versioned_handle));
-    return;
-  }
-
-  UserComputation* computation =
-      ResolveInternal(versioned_handle.handle).ValueOrDie();
-  std::vector<VersionedComputationHandle> embedded_handles =
-      computation->GetEmbeddedComputations(versioned_handle.version);
-
-  for (const auto& embedded_handle : embedded_handles) {
-    ComputeComputationPostOrder(embedded_handle, visited, post_order);
-  }
-
-  visited->insert(versioned_handle);
-  post_order->push_back(versioned_handle);
-}
-
-StatusOr<std::unique_ptr<HloModule>> ComputationTracker::BuildHloModule(
-    const VersionedComputationHandle& entry_handle,
-    const HloModuleConfig& config,
-    bool include_unreachable_instructions) const {
-  tensorflow::mutex_lock lock(computation_mutex_);
-
-  VLOG(1) << "BuildHloModule(" << entry_handle
-          << ", include_unreachable_instructions="
-          << include_unreachable_instructions << ")";
-  XLA_VLOG_LINES(1, ToStringInternal());
-
-  TF_ASSIGN_OR_RETURN(UserComputation * entry_computation,
-                      ResolveInternal(entry_handle.handle));
-
-  // Build a topological sort of the entry and any embedded computations as a
-  // list. The root of the computation will be the last element in the list.
-  std::set<VersionedComputationHandle> visited;
-  std::list<VersionedComputationHandle> post_order;
-  ComputeComputationPostOrder(entry_handle, &visited, &post_order);
-
-  // Map from ComputationHandle value and computation version to HloComputation.
-  std::map<VersionedComputationHandle, HloComputation*> hlo_computations;
-
-  // The resolver lambda resolves VersionedHandles to embedded
-  // HloComputation*. This is required by UserComputation::BuildHloComputation
-  // when lowering calling operations (map, reduce etc).
-  auto resolver = [&hlo_computations](
-      const VersionedComputationHandle& versioned_handle) -> HloComputation* {
-    CHECK_GT(hlo_computations.count(versioned_handle), 0);
-    return hlo_computations.at(versioned_handle);
-  };
-
-  // Print the post-order list for this entry computation.
-  if (VLOG_IS_ON(2)) {
-    VLOG(2) << "Visiting UserComputations in post order:";
-    for (const VersionedComputationHandle& versioned_handle : post_order) {
-      VLOG(2) << "  " << versioned_handle;
-    }
-  }
-
-  string module_name =
-      tensorflow::strings::StrCat(entry_computation->name(), "_module");
-  auto module = MakeUnique<HloModule>(module_name, entry_handle, config);
-  for (auto versioned_handle : post_order) {
-    UserComputation* computation =
-        ResolveInternal(versioned_handle.handle).ValueOrDie();
-
-    TF_ASSIGN_OR_RETURN(
-        std::unique_ptr<HloComputation> hlo_computation,
-        computation->BuildHloComputation(versioned_handle.version, resolver,
-                                         config.debug_options(),
-                                         include_unreachable_instructions));
-
-    // Add the newly created computation to VersionedHandle-to-HloComputation
-    // map.
-    DCHECK_EQ(0, hlo_computations.count(versioned_handle));
-    hlo_computations[versioned_handle] = hlo_computation.get();
-
-    if (computation == entry_computation) {
-      module->AddEntryComputation(std::move(hlo_computation));
-    } else {
-      module->AddEmbeddedComputation(std::move(hlo_computation));
-    }
-  }
-
-  return std::move(module);
-}
-
-string ComputationTracker::ToString() const {
-  tensorflow::mutex_lock lock(computation_mutex_);
-  return ToStringInternal();
-}
-
-string ComputationTracker::ToStringInternal() const {
-  string out;
-  Appendf(&out, "ComputationTracker(%p):\n", this);
-  for (const auto& handle_computation : opaque_to_computation_) {
-    int64 handle = handle_computation.first;
-    const std::unique_ptr<UserComputation>& computation =
-        handle_computation.second;
-    Appendf(&out, "  %4lld : %s \"%s\"\n", handle,
-            computation->GetVersionedHandle().ToString().c_str(),
-            computation->name().c_str());
-  }
-  return out;
-}
-
-}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/computation_tracker.h b/tensorflow/compiler/xla/service/computation_tracker.h
deleted file mode 100644
index d42d66adefe7faa2751da4cd80b392a38917ce70..0000000000000000000000000000000000000000
--- a/tensorflow/compiler/xla/service/computation_tracker.h
+++ /dev/null
@@ -1,147 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_COMPUTATION_TRACKER_H_
-#define TENSORFLOW_COMPILER_XLA_SERVICE_COMPUTATION_TRACKER_H_
-
-#include <list>
-#include <map>
-#include <memory>
-#include <set>
-#include <string>
-
-#include "tensorflow/compiler/xla/service/hlo_module.h"
-#include "tensorflow/compiler/xla/service/hlo_module_config.h"
-#include "tensorflow/compiler/xla/service/session.pb.h"
-#include "tensorflow/compiler/xla/service/user_computation.h"
-#include "tensorflow/compiler/xla/service/versioned_computation_handle.h"
-#include "tensorflow/compiler/xla/statusor.h"
-#include "tensorflow/compiler/xla/types.h"
-#include "tensorflow/compiler/xla/xla_data.pb.h"
-#include "tensorflow/core/platform/macros.h"
-#include "tensorflow/core/platform/mutex.h"
-#include "tensorflow/core/platform/thread_annotations.h"
-#include "tensorflow/core/platform/types.h"
-
-namespace xla {
-
-// Tracks computations for the XLA service; computations can be registered
-// with a UserComputation instance and can be resolved from a handle for later
-// use.
-//
-// This class is also capable of serializing/deserializing computations that it
-// tracks (and to serialize properly you need to serialize all referred-to
-// computations as well).
-class ComputationTracker {
- public:
-  ComputationTracker();
-
-  // Creates a new UserComputation object and returns the corresponding
-  // ComputationHandle for it.
-  //
-  // Precondition: user_computation is not already present in the map.
-  ComputationHandle NewComputation(const string& computation_name);
-
-  // Restores session data for a computation that has been serialized, and
-  // allocates a new computation handle for it.
-  StatusOr<ComputationHandle> LoadSessionModule(
-      const SessionModule& session_module);
-
-  // Snapshots a computation (referenced by the provided handle) at its latest
-  // version, returning a module where it is the entry, and any referred-to
-  // computations are entrained as "embedded" (non-entry) computations.
-  StatusOr<std::unique_ptr<SessionModule>> SnapshotComputation(
-      const ComputationHandle& computation);
-
-  // Resolves a ComputationHandle to a UserComputation that is present in the
-  // map.
-  StatusOr<UserComputation*> Resolve(
-      const ComputationHandle& computation) const;
-
-  // Builds an HLO module using the specified computation as the entry. The
-  // module will include the entry computation as well as all computations which
-  // are called directly or indirectly from the entry computation via operations
-  // like "map". config is the HLO module configuration to use for the
-  // constructed module.
-  // If include_unreachable_instructions is true, then instructions
-  // which are not reachable from the root are lowered into HloInstructions
-  // including unreachable parameters. This ensures the entry HloComputation has
-  // the same program shape (ProgramShape) as the entry UserComputation.
-  StatusOr<std::unique_ptr<HloModule>> BuildHloModule(
-      const VersionedComputationHandle& entry_handle,
-      const HloModuleConfig& config,
-      bool include_unreachable_instructions = true) const;
-
-  string ToString() const;
-
- private:
-  // Bumps the next_computation_ number and returns the allocated number wrapped
-  // in a ComputationHandle.
-  ComputationHandle AllocateHandle()
-      EXCLUSIVE_LOCKS_REQUIRED(computation_mutex_);
-
-  // Loads a session computation into a UserComputation, registers it, and
-  // returns the computation handle of the registered computation. If old_to_new
-  // is provided, it is used for remapping references to computations present in
-  // session_computation.
-  //
-  // old_to_new will be updated with the mapping from session_computation's old
-  // handle to the returned handle value, and may not be null.
-  StatusOr<ComputationHandle> LoadSessionComputation(
-      const SessionComputation& session_computation,
-      std::map<int64, ComputationHandle>* old_to_new)
-      EXCLUSIVE_LOCKS_REQUIRED(computation_mutex_);
-
-  // Internal implementation of Resolve method which requires, but does not
-  // acquire the mutex.
-  StatusOr<UserComputation*> ResolveInternal(
-      const ComputationHandle& computation) const
-      EXCLUSIVE_LOCKS_REQUIRED(computation_mutex_);
-
-  // Builds a post order sort of a computation ("entry") and all of its embedded
-  // computations including all transitively embedded computations. An embedded
-  // computation (the callee) will always appear in the sort before the
-  // computation which calls the embedded computation (the caller). Necessarily,
-  // the entry computation is the last element in the sort. visited and
-  // post_order should be empty when calling. post_order contains the post order
-  // sort when the function return.
-  void ComputeComputationPostOrder(
-      const VersionedComputationHandle& versioned_handle,
-      std::set<VersionedComputationHandle>* visited,
-      std::list<VersionedComputationHandle>* post_order) const
-      EXCLUSIVE_LOCKS_REQUIRED(computation_mutex_);
-
-  string ToStringInternal() const EXCLUSIVE_LOCKS_REQUIRED(computation_mutex_);
-
-  // Guards the computation mapping. Marked mutable so that the Resolve method
-  // can remain const; Resolve does't really modify the tracker in any way, but
-  // it has to lock the mutex for safety.
-  mutable tensorflow::mutex computation_mutex_;
-
-  // The next sequence number to assign to a computation, guarded by the same
-  // mutex as the mapping as they'll be mutated at the same time.
-  int64 next_computation_ GUARDED_BY(computation_mutex_);
-
-  // Mapping from ComputationHandle value to the corresponding registered
-  // UserComputation object.
-  std::map<int64, std::unique_ptr<UserComputation>> opaque_to_computation_
-      GUARDED_BY(computation_mutex_);
-
-  TF_DISALLOW_COPY_AND_ASSIGN(ComputationTracker);
-};
-
-}  // namespace xla
-
-#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_COMPUTATION_TRACKER_H_
diff --git a/tensorflow/compiler/xla/service/copy_insertion_test.cc b/tensorflow/compiler/xla/service/copy_insertion_test.cc
index 153f062d015e49db11c4c9ae0a2a61e76c020f02..684fff8a6fd4fd045a0de9df9660b887f32bdf40 100644
--- a/tensorflow/compiler/xla/service/copy_insertion_test.cc
+++ b/tensorflow/compiler/xla/service/copy_insertion_test.cc
@@ -1636,8 +1636,7 @@ void BM_SequentialWhiles(int num_iters, int num_whiles) {
   for (int i = 0; i < num_iters; ++i) {
     HloModuleConfig config;
     config.set_debug_options(legacy_flags::GetDebugOptionsFromFlags());
-    HloModule module("BM_SequentialWhiles", VersionedComputationHandle(),
-                     config);
+    HloModule module("BM_SequentialWhiles", config);
 
     auto builder = HloComputation::Builder("BM_SequentialWhiles");
     HloInstruction* x = builder.AddInstruction(HloInstruction::CreateParameter(
@@ -1677,8 +1676,7 @@ void BM_ParallelWhiles(int num_iters, int num_whiles) {
   for (int i = 0; i < num_iters; ++i) {
     HloModuleConfig config;
     config.set_debug_options(legacy_flags::GetDebugOptionsFromFlags());
-    HloModule module("BM_SequentialWhiles", VersionedComputationHandle(),
-                     config);
+    HloModule module("BM_SequentialWhiles", config);
 
     auto builder = HloComputation::Builder("BM_ParallelWhiles");
     HloInstruction* x = builder.AddInstruction(HloInstruction::CreateParameter(
@@ -1750,8 +1748,7 @@ void BM_ManyElementTuple(int num_iters, const int num_tuple_inputs) {
   std::vector<HloInstruction*> tuple_params(num_tuple_inputs);
   for (int i = 0; i < num_iters; ++i) {
     auto builder = HloComputation::Builder("BM_ParallelWhiles");
-    HloModule module("BM_ManyElementTuple", VersionedComputationHandle(),
-                     config);
+    HloModule module("BM_ManyElementTuple", config);
     for (int j = 0; j < num_tuple_inputs; ++j) {
       tuple_params[j] = builder.AddInstruction(
           HloInstruction::CreateParameter(j, element_shape, ""));
diff --git a/tensorflow/compiler/xla/service/cpu/BUILD b/tensorflow/compiler/xla/service/cpu/BUILD
index bfd85f257fb9550a6babb2459a7227ca9003a14f..b703be0f39e2032bc58479f0b957f9d8b01a77c3 100644
--- a/tensorflow/compiler/xla/service/cpu/BUILD
+++ b/tensorflow/compiler/xla/service/cpu/BUILD
@@ -151,7 +151,14 @@ cc_library(
         "@llvm//:target",  # fixdeps: keep
         "@llvm//:x86_code_gen",  # fixdeps: keep
         "@llvm//:x86_disassembler",  # fixdeps: keep
-    ],
+    ] + select({
+        "@org_tensorflow//tensorflow:linux_ppc64le": [
+            "@llvm//:powerpc_disassembler",
+            "@llvm//:powerpc_code_gen",
+        ],
+        "//conditions:default": [
+        ],
+    }),
     alwayslink = True,  # Contains compiler registration
 )
 
@@ -649,10 +656,10 @@ tf_cc_test(
     deps = [
         ":cpu_instruction_fusion",
         "//tensorflow/compiler/xla/service:hlo_matchers",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/service:transpose_folding",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
     ],
 )
@@ -706,9 +713,9 @@ tf_cc_test(
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service:hlo_matchers",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
     ],
 )
 
@@ -898,6 +905,7 @@ cc_library(
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/service/llvm_ir:llvm_util",
+        "//tensorflow/core:lib",
         "@llvm//:core",
         "@llvm//:support",
     ],
@@ -958,7 +966,7 @@ tf_cc_test(
         ":ir_emission_utils",
         ":target_machine_features_fake",
         "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
     ],
 )
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc b/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc
index 25b18eff20f901fc34343a12bfbd353ecec49cfb..d039132535071661d047579587385210719fede3 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc
@@ -264,8 +264,7 @@ Status CpuCompiler::RunHloPasses(HloModule* module, bool is_aot_compile,
     pass.AddPass<BatchNormExpander>(
         /*rewrite_training_op=*/true,
         /*rewrite_inference_op=*/true,
-        /*rewrite_grad_op=*/true,
-        /*use_fusion=*/false);
+        /*rewrite_grad_op=*/true);
     pass.AddPass<AlgebraicSimplifier>(
         /*is_layout_sensitive=*/false,
         [](const Shape&, const Shape&) { return false; },
@@ -550,8 +549,8 @@ StatusOr<std::unique_ptr<Executable>> CpuCompiler::RunBackend(
   // and reduced memory usage (as compared to using DependencyHloOrdering).
   TF_ASSIGN_OR_RETURN(
       SequentialHloOrdering::HloModuleSequence module_sequence,
-      CreateMemoryMinimizingSequence(*module, BufferSizeBytesFunction(),
-                                     DFSMemoryScheduler));
+      ScheduleComputationsInModule(*module, BufferSizeBytesFunction(),
+                                   DFSMemoryScheduler));
 
   // Run buffer analysis on the HLO graph. This analysis figures out which
   // temporary buffers are required to run the computation.
@@ -730,7 +729,7 @@ CpuCompiler::CompileAheadOfTime(std::vector<std::unique_ptr<HloModule>> modules,
 
     TF_ASSIGN_OR_RETURN(
         SequentialHloOrdering::HloModuleSequence module_sequence,
-        CreateMemoryMinimizingSequence(*module, BufferSizeBytesFunction()));
+        ScheduleComputationsInModule(*module, BufferSizeBytesFunction()));
 
     // Run buffer analysis on the HLO graph. This analysis figures out which
     // temporary buffers are required to run the computation.
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_eigen_tensor_alignment_test.cc b/tensorflow/compiler/xla/service/cpu/cpu_eigen_tensor_alignment_test.cc
index d12fa6bb9ad2054bdc052c9d7b3729cc28e11f6d..8727c72b6e42517b1859e98ecadb41bbceed761c 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_eigen_tensor_alignment_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_eigen_tensor_alignment_test.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/cpu/dot_op_emitter.h"
 #include "tensorflow/compiler/xla/service/cpu/ir_emission_utils.h"
 #include "tensorflow/compiler/xla/service/cpu/target_machine_features_fake.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/test.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 
 namespace xla {
 namespace cpu {
@@ -40,7 +40,7 @@ ENTRY DotOperation {
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
 
   HloInstruction* dot = module->entry_computation()->root_instruction();
 
@@ -71,7 +71,7 @@ ENTRY ConvOperation {
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
 
   HloInstruction* conv = module->entry_computation()->root_instruction();
 
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc b/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc
index 46fe060817b0264d90574b45a94cf1f6e5964593..97e10a89a209c057685709e7a5034052ff4376ed 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc
@@ -19,9 +19,9 @@ limitations under the License.
 #include <set>
 
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/service/transpose_folding.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
 
 namespace op = xla::testing::opcode_matchers;
@@ -172,7 +172,7 @@ ENTRY DotOperationFusion_TransposeFusion {
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
   HloComputation* computation = module->entry_computation();
 
   TransposeFolding transpose_folding(
@@ -202,7 +202,7 @@ ENTRY DotOperationFusion_TransposeFusion {
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
   HloComputation* computation = module->entry_computation();
 
   TransposeFolding transpose_folding(
@@ -233,7 +233,7 @@ ENTRY DotOperationFusion_TransposeFusion {
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
   HloComputation* computation = module->entry_computation();
 
   TransposeFolding transpose_folding(
@@ -775,7 +775,7 @@ TEST_P(GatherLoopFusionTest, GatherLoopFusion) {
   string hlo_string = tensorflow::strings::StrCat(
       "HloModule ", spec.test_name, "\n\n", spec.hlo_computation_text);
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
 
   RunFusionAndCheckOpcodesWereFused(
       module.get(),
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_options.cc b/tensorflow/compiler/xla/service/cpu/cpu_options.cc
index e75fcb6bc9719f7453d5f0cb52d1673cef1fd3df..3ed7876715f64191f6e652d2b5cb1673df9a1b94 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_options.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_options.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/cpu/cpu_options.h"
 
 #include "tensorflow/core/lib/strings/numbers.h"
+#include "tensorflow/core/lib/strings/str_util.h"
 
 namespace {
 
@@ -24,6 +25,7 @@ const char* const kXlaDisableVectorizedReduce = "xla_disable_vectorized_reduce";
 const char* const kLlvmIrDotTilingFactor = "xla_llvm_dot_tiling_factor";
 const char* const kXlaEnableExperimentalLlvmIrGemm =
     "xla_enable_experimental_llvm_ir_gemm";
+const char* const kLlvmIrGemmTileSize = "xla_llvm_ir_gemm_tile_size";
 
 }  // namespace
 
@@ -62,6 +64,43 @@ bool EnableExperimentalLlvmIrGemm(const HloModuleConfig& config) {
   return extra_options_map.count(kXlaEnableExperimentalLlvmIrGemm) > 0;
 }
 
+static tensorflow::StringPiece RemoveSuffix(tensorflow::StringPiece str,
+                                            tensorflow::StringPiece suffix) {
+  CHECK_GE(str.size(), suffix.size());
+  CHECK_EQ(str.substr(str.size() - suffix.size()), suffix);
+  return str.substr(0, str.size() - suffix.size());
+}
+
+tensorflow::gtl::optional<std::tuple<int64, int64, int64>> LlvmIrGemmTileSize(
+    const HloModuleConfig& config) {
+  const auto& extra_options_map =
+      config.debug_options().xla_backend_extra_options();
+  auto it = extra_options_map.find(kLlvmIrGemmTileSize);
+  if (it == extra_options_map.end()) {
+    return tensorflow::gtl::nullopt;
+  }
+
+  std::vector<string> tile_components =
+      tensorflow::str_util::Split(it->second, ':');
+  CHECK_EQ(tile_components.size(), 3);
+
+  int64 tile_size_m;
+  int64 tile_size_k;
+  int64 tile_size_n_in_vector_width;
+
+  CHECK(tensorflow::strings::safe_strto64(tile_components[0], &tile_size_m));
+  CHECK(tensorflow::strings::safe_strto64(tile_components[1], &tile_size_k));
+
+  tensorflow::StringPiece tile_size_n_in_vector_width_str =
+      RemoveSuffix(tile_components[2], "*vectwidth");
+
+  CHECK(tensorflow::strings::safe_strto64(tile_size_n_in_vector_width_str,
+                                          &tile_size_n_in_vector_width));
+
+  return std::tuple<int64, int64, int64>(tile_size_m, tile_size_k,
+                                         tile_size_n_in_vector_width);
+}
+
 }  // namespace options
 }  // namespace cpu
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_options.h b/tensorflow/compiler/xla/service/cpu/cpu_options.h
index 106dfbbc62dfba8d3de74e0a2ae3bb247bd91d67..429b9e16cbdd6f623919533582481f1640118081 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_options.h
+++ b/tensorflow/compiler/xla/service/cpu/cpu_options.h
@@ -29,6 +29,8 @@ bool VectorizedReduceDisabled(const HloModuleConfig& config);
 bool EnableExperimentalLlvmIrGemm(const HloModuleConfig& config);
 tensorflow::gtl::optional<int64> LlvmIrGemvTilingFactor(
     const HloModuleConfig& config);
+tensorflow::gtl::optional<std::tuple<int64, int64, int64>> LlvmIrGemmTileSize(
+    const HloModuleConfig& config);
 
 }  // namespace options
 }  // namespace cpu
diff --git a/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc b/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc
index af69fc3da9869aa2df958ecc5c064ee37dd9ea21..8eb39d615fd482cdcea716ba7b105c643a2d8b87 100644
--- a/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc
+++ b/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc
@@ -42,17 +42,17 @@ using llvm_ir::SetToFirstInsertPoint;
 namespace cpu {
 
 namespace {
-// Loads a tile of values from a 2D tensor.
-class TileLoader {
+// Provides tiled access to an in-memory rank 2 array.
+class MemoryTile {
  public:
-  // Constructs a TileLoader that will load a tile consisting of
+  // Constructs a MemoryTile that can operate on tiles consisting of
   // `tile_size_along_major_dim` vectors from the matrix `matrix`, starting at
   // `major_dim_offset` in the major dimension.  The tile size along the minor
   // dimension is the vector size, and that is implicitly determined by `vsl`.
-  TileLoader(VectorSupportLibrary* vsl, llvm::IRBuilder<>* ir_builder,
+  MemoryTile(VectorSupportLibrary* vsl, llvm::IRBuilder<>* ir_builder,
              llvm::Value* matrix, int64 matrix_size_along_minor_dim,
              llvm::Value* major_dim_offset, int64 tile_size_along_major_dim)
-      : vsl_(vsl) {
+      : vsl_(vsl), ir_builder_(ir_builder) {
     pointers_.reserve(tile_size_along_major_dim);
     for (int64 i = 0; i < tile_size_along_major_dim; i++) {
       llvm::Value* total_offset = ir_builder->CreateMul(
@@ -62,9 +62,10 @@ class TileLoader {
     }
   }
 
-  // Load a tile consisting of `tile_size_along_major_dim_` vectors starting at
-  // `major_dim_offset_` in the major dimension and `minor_dim_offset` in the
-  // minor dimension.
+  // Load a tile consisting of `tile_size_along_major_dim` vectors from position
+  // {major: `major_dim_offset`, minor: `minor_dim_offset`}.
+  //
+  // Note: `major_dim_offset` is a parameter to the constructor.
   std::vector<llvm::Value*> LoadTile(llvm::Value* minor_dim_offset) const {
     std::vector<llvm::Value*> result;
     result.reserve(pointers_.size());
@@ -74,11 +75,104 @@ class TileLoader {
     return result;
   }
 
+  // Stores `tile` to position {major: `major_dim_offset`, minor:
+  // `minor_dim_offset`}.
+  //
+  // Note: `major_dim_offset` is a parameter to the constructor.
+  void StoreTile(tensorflow::gtl::ArraySlice<llvm::Value*> tile,
+                 llvm::Value* minor_dim_offset) const {
+    CHECK_EQ(tile.size(), pointers_.size());
+    for (int64 i = 0; i < pointers_.size(); i++) {
+      vsl_->StoreVector(tile[i], pointers_[i], minor_dim_offset);
+    }
+  }
+
+  // Loads a tile of size [`tile_size_along_major_dim`,
+  // `tile_size_along_middle_dim`] from position {major: `major_dim_offset`,
+  // minor: `minor_dim_offset`} and then broadcasts each element into a vector
+  // of size vsl_.vector_size().  The (i,j)'th element of the return value is
+  // the (i,j)'th element in the tile broadcasted into an LLVM vector.
+  //
+  // Note: `major_dim_offset` is a parameter to the constructor.
+  std::vector<std::vector<llvm::Value*>> LoadBroadcastTile(
+      llvm::Value* minor_dim_offset, int64 tile_size_along_middle_dim) const {
+    std::vector<std::vector<llvm::Value*>> result;
+    result.resize(pointers_.size());
+    for (int64 i = 0; i < pointers_.size(); i++) {
+      for (int64 j = 0; j < tile_size_along_middle_dim; j++) {
+        result[i].push_back(vsl_->LoadBroadcast(
+            pointers_[i], ir_builder_->CreateAdd(minor_dim_offset,
+                                                 ir_builder_->getInt64(j))));
+      }
+    }
+    return result;
+  }
+
  private:
   VectorSupportLibrary* vsl_;
+  llvm::IRBuilder<>* ir_builder_;
   std::vector<llvm::Value*> pointers_;
 };
 
+// The base class for the classes representing the GEMV emitter configurations.
+//
+// The IR emitted (modulo the LLVM values representing the input and output
+// buffers) by the row major and column major GEMV emitters should be a function
+// of their configuration.  This is important because their configuration is
+// used as a key to cache the generated IR.
+class GemvConfig {
+ public:
+  // Mixin for convenience.
+  template <typename T>
+  struct User {
+   public:
+    PrimitiveType scalar_type() const {
+      return derived().config().scalar_type();
+    }
+    int64 tile_rows() const { return derived().config().tile_rows(); }
+    int64 tile_cols() const { return derived().config().tile_cols(); }
+    int64 m() const { return derived().config().m(); }
+    int64 k() const { return derived().config().k(); }
+    int64 has_addend() const { return derived().config().has_addend(); }
+
+   private:
+    const T& derived() const { return *static_cast<const T*>(this); }
+  };
+
+  PrimitiveType scalar_type() const { return scalar_type_; }
+  int64 tile_rows() const { return tile_rows_; }
+  int64 tile_cols() const { return tile_cols_; }
+  int64 m() const { return m_; }
+  int64 k() const { return k_; }
+  bool has_addend() const { return has_addend_; }
+
+  string GetCacheKey() const {
+    return tensorflow::strings::StrCat(
+        name_, "_", PrimitiveType_Name(scalar_type()), "_", tile_rows(), "_",
+        tile_cols(), "_", m(), "_", k(), has_addend() ? "_with_addend" : "");
+  }
+
+ protected:
+  explicit GemvConfig(string name, PrimitiveType scalar_type, int64 tile_rows,
+                      int64 tile_cols, int64 m, int64 k, bool has_addend)
+      : name_(std::move(name)),
+        scalar_type_(scalar_type),
+        tile_rows_(tile_rows),
+        tile_cols_(tile_cols),
+        m_(m),
+        k_(k),
+        has_addend_(has_addend) {}
+
+ private:
+  string name_;
+  PrimitiveType scalar_type_;
+  int64 tile_rows_;
+  int64 tile_cols_;
+  int64 m_;
+  int64 k_;
+  bool has_addend_;
+};
+
 // Computes a dot product between "[M,K]{0,1} lhs" with a [K,1] vector (the
 // layout of the vector does not matter).  This implementation uses a tiling
 // scheme to improve performance.
@@ -140,38 +234,46 @@ class TileLoader {
 // TODO(sanjoy): We should investigate if using gather loads and scatter stores
 // can be used here have the same inner loop for both column-major and row-major
 // matrix-vector products.
-class ColumnMajorMatrixVectorProductEmitter {
+class ColumnMajorMatrixVectorProductEmitter
+    : public GemvConfig::User<ColumnMajorMatrixVectorProductEmitter> {
  public:
-  ColumnMajorMatrixVectorProductEmitter(PrimitiveType scalar_type,
-                                        int64 tile_rows, int64 tile_cols,
-                                        int64 m, int64 k, llvm::Value* lhs,
+  class Config : public GemvConfig {
+   public:
+    explicit Config(PrimitiveType scalar_type, int64 tile_rows, int64 tile_cols,
+                    int64 m, int64 k, bool has_addend)
+        : GemvConfig(/*name=*/"col_major_gemv", scalar_type,
+                     /*tile_rows=*/tile_rows, /*tile_cols=*/tile_cols, /*m=*/m,
+                     /*k=*/k, /*has_addend=*/has_addend) {}
+  };
+
+  ColumnMajorMatrixVectorProductEmitter(const Config& config, llvm::Value* lhs,
                                         llvm::Value* rhs, llvm::Value* addend,
                                         llvm::Value* result,
                                         llvm::IRBuilder<>* ir_builder)
-      : scalar_type_(scalar_type),
-        tile_rows_(tile_rows),
-        tile_cols_(tile_cols),
-        m_(m),
-        k_(k),
+      : config_(config),
         lhs_(lhs),
         rhs_(rhs),
         addend_(addend),
         result_(result),
         ir_builder_(ir_builder),
         ksl_(ir_builder_),
-        vsl_(scalar_type_, /*vector_size=*/tile_rows_, ir_builder_, "") {
-    CHECK(tile_rows_ > 0 && IsPowerOfTwo(static_cast<uint64>(tile_rows_)));
+        vsl_(config.scalar_type(), /*vector_size=*/config.tile_rows(),
+             ir_builder_, "") {
+    CHECK(tile_rows() > 0 && IsPowerOfTwo(static_cast<uint64>(tile_rows())));
+    CHECK(!has_addend() || addend != nullptr);
   }
 
   void Emit();
 
+  const Config& config() const { return config_; }
+
  private:
   void EmitOuterLoopBody(llvm::Value* column, int64 column_count,
                          bool is_first_column);
 
-  TileLoader GetLhsTileLoader(llvm::Value* column_start, int64 column_count) {
-    return TileLoader(&vsl_, ir_builder_, /*matrix=*/lhs_,
-                      /*matrix_size_along_minor_dim=*/m_,
+  MemoryTile GetLhsMemoryTile(llvm::Value* column_start, int64 column_count) {
+    return MemoryTile(&vsl_, ir_builder_, /*matrix=*/lhs_,
+                      /*matrix_size_along_minor_dim=*/m(),
                       /*major_dim_offset=*/column_start,
                       /*tile_size_along_major_dim=*/column_count);
   }
@@ -188,18 +290,14 @@ class ColumnMajorMatrixVectorProductEmitter {
     return result;
   }
 
-  void EmitInnerLoopTiled(TileLoader* lhs_tile_loader,
+  void EmitInnerLoopTiled(MemoryTile* lhs_memory_tile,
                           const std::vector<llvm::Value*>& rhs_tile,
                           int64 columns, bool is_first_column);
 
   void EmitInnerLoopEpilogue(llvm::Value* current_tile_col, int64 columns,
                              bool is_first_tiled_column);
 
-  PrimitiveType scalar_type_;
-  int64 tile_rows_;
-  int64 tile_cols_;
-  int64 m_;
-  int64 k_;
+  Config config_;
   llvm::Value* lhs_;
   llvm::Value* rhs_;
   llvm::Value* addend_;
@@ -211,26 +309,26 @@ class ColumnMajorMatrixVectorProductEmitter {
 
 void ColumnMajorMatrixVectorProductEmitter::EmitOuterLoopBody(
     llvm::Value* column, int64 column_count, bool is_first_column) {
-  TileLoader lhs_tile_loader = GetLhsTileLoader(/*column_start=*/column,
+  MemoryTile lhs_memory_tile = GetLhsMemoryTile(/*column_start=*/column,
                                                 /*column_count=*/column_count);
 
   std::vector<llvm::Value*> rhs_tile =
       LoadRhsTile(column, /*count=*/column_count);
-  EmitInnerLoopTiled(&lhs_tile_loader, rhs_tile,
+  EmitInnerLoopTiled(&lhs_memory_tile, rhs_tile,
                      /*columns=*/column_count, is_first_column);
   EmitInnerLoopEpilogue(column, /*columns=*/column_count, is_first_column);
 }
 
 void ColumnMajorMatrixVectorProductEmitter::Emit() {
   // See the comment on the class declaration for the algorithm used here.
-  int64 column_remainder = k_ % tile_cols_;
-  int64 column_limit = k_ - column_remainder;
+  int64 column_remainder = k() % tile_cols();
+  int64 column_limit = k() - column_remainder;
 
-  ksl_.For("dot.outer.tiled",
-           /*start=*/0, /*end=*/column_limit, /*step=*/tile_cols_,
-           [&](llvm::Value* column, bool is_first_column) {
-             EmitOuterLoopBody(column, tile_cols_, is_first_column);
-           });
+  ksl_.ForReturnVoid("dot.outer.tiled",
+                     /*start=*/0, /*end=*/column_limit, /*step=*/tile_cols(),
+                     [&](llvm::Value* column, bool is_first_column) {
+                       EmitOuterLoopBody(column, tile_cols(), is_first_column);
+                     });
 
   if (column_remainder != 0) {
     EmitOuterLoopBody(ir_builder_->getInt64(column_limit), column_remainder,
@@ -239,29 +337,30 @@ void ColumnMajorMatrixVectorProductEmitter::Emit() {
 }
 
 void ColumnMajorMatrixVectorProductEmitter::EmitInnerLoopTiled(
-    TileLoader* lhs_tile_loader, const std::vector<llvm::Value*>& rhs_tile,
+    MemoryTile* lhs_memory_tile, const std::vector<llvm::Value*>& rhs_tile,
     int64 columns, bool is_first_column) {
-  int64 row_limit = m_ - (m_ % tile_rows_);
-
-  ksl_.For("dot.inner.tiled", /*start=*/0, /*end=*/row_limit,
-           /*step=*/tile_rows_, [&](llvm::Value* row) {
-             std::vector<llvm::Value*> lhs_tile =
-                 lhs_tile_loader->LoadTile(/*minor_dim_offset=*/row);
-             llvm::Value* accumulator =
-                 is_first_column ? (addend_ ? vsl_.LoadVector(addend_, row)
-                                            : vsl_.GetZeroVector())
-                                 : vsl_.LoadVector(result_, row);
-             for (int i = 0; i < columns; i++) {
-               accumulator = vsl_.MulAdd(lhs_tile[i], rhs_tile[i], accumulator);
-             }
-             vsl_.StoreVector(accumulator, result_, row);
-           });
+  int64 row_limit = m() - (m() % tile_rows());
+
+  ksl_.ForReturnVoid(
+      "dot.inner.tiled", /*start=*/0, /*end=*/row_limit,
+      /*step=*/tile_rows(), [&](llvm::Value* row) {
+        std::vector<llvm::Value*> lhs_tile =
+            lhs_memory_tile->LoadTile(/*minor_dim_offset=*/row);
+        llvm::Value* accumulator =
+            is_first_column ? (addend_ ? vsl_.LoadVector(addend_, row)
+                                       : vsl_.GetZeroVector())
+                            : vsl_.LoadVector(result_, row);
+        for (int i = 0; i < columns; i++) {
+          accumulator = vsl_.MulAdd(lhs_tile[i], rhs_tile[i], accumulator);
+        }
+        vsl_.StoreVector(accumulator, result_, row);
+      });
 }
 
 void ColumnMajorMatrixVectorProductEmitter::EmitInnerLoopEpilogue(
     llvm::Value* current_tile_col, int64 columns, bool is_first_tiled_column) {
-  int64 row_start = m_ - (m_ % tile_rows_);
-  if (row_start == m_) {
+  int64 row_start = m() - (m() % tile_rows());
+  if (row_start == m()) {
     return;
   }
 
@@ -274,25 +373,25 @@ void ColumnMajorMatrixVectorProductEmitter::EmitInnerLoopEpilogue(
   //     // initialized.
   //   }
 
-  ksl_.For(
+  ksl_.ForReturnVoid(
       "dot.inner.epilg.outer", /*start=*/current_tile_col,
       /*end=*/ir_builder_->CreateAdd(columns_llvm, current_tile_col),
       /*step=*/1, /*peel_first_iteration=*/false,
       [&](llvm::Value* col, llvm::Value* is_first_scalar_col) {
         llvm::Value* rhs_element = vsl_.LoadScalar(rhs_, col);
         llvm::Value* total_offset =
-            ir_builder_->CreateMul(col, ir_builder_->getInt64(m_));
+            ir_builder_->CreateMul(col, ir_builder_->getInt64(m()));
         llvm::Value* lhs_base_pointer =
             vsl_.ComputeOffsetPointer(lhs_, total_offset);
-        ksl_.For(
-            "dot.inner.epilg.inner", /*start=*/row_start, /*end=*/m_,
+        ksl_.ForReturnVoid(
+            "dot.inner.epilg.inner", /*start=*/row_start, /*end=*/m(),
             /*step=*/1, [&](llvm::Value* scalar_row) {
               llvm::Value* product = vsl_.Mul(
                   vsl_.LoadScalar(lhs_base_pointer, scalar_row), rhs_element);
               llvm::Value* setting_result_first_time = ir_builder_->CreateAnd(
                   is_first_scalar_col,
                   ir_builder_->getInt1(is_first_tiled_column));
-              ksl_.If(
+              ksl_.IfReturnVoid(
                   setting_result_first_time,
                   /*true_block_generator=*/
                   [&]() {
@@ -365,51 +464,55 @@ void ColumnMajorMatrixVectorProductEmitter::EmitInnerLoopEpilogue(
 //
 // We have an inner epilogue loop to deal with the "B" sub-matrix and an outer
 // epilogue loop to deal with the C,D submatrix.
-class RowMajorMatrixVectorProductEmitter {
+class RowMajorMatrixVectorProductEmitter
+    : public GemvConfig::User<RowMajorMatrixVectorProductEmitter> {
  public:
-  RowMajorMatrixVectorProductEmitter(PrimitiveType scalar_type, int64 tile_rows,
-                                     int64 tile_cols, int64 m, int64 k,
-                                     llvm::Value* lhs, llvm::Value* rhs,
-                                     llvm::Value* addend, llvm::Value* result,
+  class Config : public GemvConfig {
+   public:
+    explicit Config(PrimitiveType scalar_type, int64 tile_rows, int64 tile_cols,
+                    int64 m, int64 k, bool has_addend)
+        : GemvConfig(/*name=*/"row_major_gemv", scalar_type,
+                     /*tile_rows=*/tile_rows, /*tile_cols=*/tile_cols, /*m=*/m,
+                     /*k=*/k, /*has_addend=*/has_addend) {}
+  };
+
+  RowMajorMatrixVectorProductEmitter(const Config& config, llvm::Value* lhs,
+                                     llvm::Value* rhs, llvm::Value* addend,
+                                     llvm::Value* result,
                                      llvm::IRBuilder<>* ir_builder)
-      : scalar_type_(scalar_type),
-        tile_rows_(tile_rows),
-        tile_cols_(tile_cols),
-        m_(m),
-        k_(k),
+      : config_(config),
         lhs_(lhs),
         rhs_(rhs),
         addend_(addend),
         result_(result),
         ir_builder_(ir_builder),
         ksl_(ir_builder_),
-        vsl_(scalar_type_, /*vector_size=*/tile_cols_, ir_builder_, "") {
-    CHECK(tile_cols_ > 0 && IsPowerOfTwo(static_cast<uint64>(tile_cols_)));
+        vsl_(scalar_type(), /*vector_size=*/tile_cols(), ir_builder_, "") {
+    CHECK(tile_cols() > 0 && IsPowerOfTwo(static_cast<uint64>(tile_cols())));
+    CHECK(!has_addend() || addend != nullptr);
   }
 
   void Emit();
 
+  const Config& config() const { return config_; }
+
  private:
-  TileLoader GetLhsTileLoader(llvm::Value* row_start, int64 row_count) {
-    return TileLoader(&vsl_, ir_builder_, /*matrix=*/lhs_,
-                      /*matrix_size_along_minor_dim=*/k_,
+  MemoryTile GetLhsMemoryTile(llvm::Value* row_start, int64 row_count) {
+    return MemoryTile(&vsl_, ir_builder_, /*matrix=*/lhs_,
+                      /*matrix_size_along_minor_dim=*/k(),
                       /*major_dim_offset=*/row_start,
                       /*tile_size_along_major_dim=*/row_count);
   }
 
   void EmitOuterLoopBody(llvm::Value* row, int64 row_count);
 
-  void EmitInnerLoopTiled(TileLoader* lhs_tile_loader, int64 rows,
+  void EmitInnerLoopTiled(MemoryTile* lhs_memory_tile, int64 rows,
                           std::vector<VectorVariable>* vector_accumulators);
 
   void EmitInnerLoopEpilogue(llvm::Value* current_tile_row, int64 rows,
                              std::vector<ScalarVariable>* scalar_accumulators);
 
-  PrimitiveType scalar_type_;
-  int64 tile_rows_;
-  int64 tile_cols_;
-  int64 m_;
-  int64 k_;
+  Config config_;
   llvm::Value* lhs_;
   llvm::Value* rhs_;
   llvm::Value* addend_;
@@ -421,7 +524,7 @@ class RowMajorMatrixVectorProductEmitter {
 
 void RowMajorMatrixVectorProductEmitter::EmitOuterLoopBody(llvm::Value* row,
                                                            int64 row_count) {
-  TileLoader lhs_tile_loader = GetLhsTileLoader(/*row_start=*/row,
+  MemoryTile lhs_memory_tile = GetLhsMemoryTile(/*row_start=*/row,
                                                 /*row_count=*/row_count);
   std::vector<VectorVariable> vector_accumulators;
   std::vector<ScalarVariable> scalar_accumulators;
@@ -429,7 +532,7 @@ void RowMajorMatrixVectorProductEmitter::EmitOuterLoopBody(llvm::Value* row,
     vector_accumulators.emplace_back(&vsl_, vsl_.GetZeroVector());
     scalar_accumulators.emplace_back(&vsl_, vsl_.GetZeroScalar());
   }
-  EmitInnerLoopTiled(&lhs_tile_loader, /*rows=*/row_count,
+  EmitInnerLoopTiled(&lhs_memory_tile, /*rows=*/row_count,
                      &vector_accumulators);
   EmitInnerLoopEpilogue(/*current_tile_row=*/row, /*rows=*/row_count,
                         &scalar_accumulators);
@@ -466,12 +569,13 @@ void RowMajorMatrixVectorProductEmitter::EmitOuterLoopBody(llvm::Value* row,
 
 void RowMajorMatrixVectorProductEmitter::Emit() {
   // See the comment on the class declaration for the algorithm used here.
-  int64 row_remainder = m_ % tile_rows_;
-  int64 row_limit = m_ - row_remainder;
+  int64 row_remainder = m() % tile_rows();
+  int64 row_limit = m() - row_remainder;
 
-  ksl_.For("dot.outer.tiled",
-           /*start=*/0, /*end=*/row_limit, /*step=*/tile_rows_,
-           [&](llvm::Value* row) { EmitOuterLoopBody(row, tile_rows_); });
+  ksl_.ForReturnVoid(
+      "dot.outer.tiled",
+      /*start=*/0, /*end=*/row_limit, /*step=*/tile_rows(),
+      [&](llvm::Value* row) { EmitOuterLoopBody(row, tile_rows()); });
 
   if (row_remainder != 0) {
     EmitOuterLoopBody(ir_builder_->getInt64(row_limit), row_remainder);
@@ -479,45 +583,46 @@ void RowMajorMatrixVectorProductEmitter::Emit() {
 }
 
 void RowMajorMatrixVectorProductEmitter::EmitInnerLoopTiled(
-    TileLoader* lhs_tile_loader, int64 rows,
+    MemoryTile* lhs_memory_tile, int64 rows,
     std::vector<VectorVariable>* vector_accumulators) {
-  int64 column_limit = k_ - (k_ % tile_cols_);
-
-  ksl_.For("dot.inner.tiled", /*start=*/0, /*end=*/column_limit,
-           /*step=*/tile_cols_, [&](llvm::Value* col) {
-             std::vector<llvm::Value*> lhs_tile =
-                 lhs_tile_loader->LoadTile(/*minor_dim_offset=*/col);
-             llvm::Value* rhs_value = vsl_.LoadVector(rhs_, col);
-             for (int i = 0; i < rows; i++) {
-               llvm::Value* old_sum = (*vector_accumulators)[i].Get();
-               (*vector_accumulators)[i].Set(
-                   vsl_.Add(old_sum, vsl_.Mul(rhs_value, lhs_tile[i])));
-             }
-           });
+  int64 column_limit = k() - (k() % tile_cols());
+
+  ksl_.ForReturnVoid("dot.inner.tiled", /*start=*/0, /*end=*/column_limit,
+                     /*step=*/tile_cols(), [&](llvm::Value* col) {
+                       std::vector<llvm::Value*> lhs_tile =
+                           lhs_memory_tile->LoadTile(/*minor_dim_offset=*/col);
+                       llvm::Value* rhs_value = vsl_.LoadVector(rhs_, col);
+                       for (int i = 0; i < rows; i++) {
+                         llvm::Value* old_sum = (*vector_accumulators)[i].Get();
+                         (*vector_accumulators)[i].Set(vsl_.Add(
+                             old_sum, vsl_.Mul(rhs_value, lhs_tile[i])));
+                       }
+                     });
 }
 
 void RowMajorMatrixVectorProductEmitter::EmitInnerLoopEpilogue(
     llvm::Value* current_tile_row, int64 rows,
     std::vector<ScalarVariable>* scalar_accumulators) {
-  int64 column_start = k_ - (k_ % tile_cols_);
-  if (column_start == k_) {
+  int64 column_start = k() - (k() % tile_cols());
+  if (column_start == k()) {
     return;
   }
 
   for (int r = 0; r < rows; r++) {
     llvm::Value* total_offset = ir_builder_->CreateMul(
         ir_builder_->CreateAdd(ir_builder_->getInt64(r), current_tile_row),
-        ir_builder_->getInt64(k_));
+        ir_builder_->getInt64(k()));
     llvm::Value* lhs_base_pointer =
         vsl_.ComputeOffsetPointer(lhs_, total_offset);
-    ksl_.For("dot.inner.epilg.inner", /*start=*/column_start, /*end=*/k_,
-             /*step=*/1, [&](llvm::Value* scalar_col) {
-               llvm::Value* product =
-                   vsl_.Mul(vsl_.LoadScalar(lhs_base_pointer, scalar_col),
-                            vsl_.LoadScalar(rhs_, scalar_col));
-               llvm::Value* old_value = (*scalar_accumulators)[r].Get();
-               (*scalar_accumulators)[r].Set(vsl_.Add(old_value, product));
-             });
+    ksl_.ForReturnVoid(
+        "dot.inner.epilg.inner", /*start=*/column_start, /*end=*/k(),
+        /*step=*/1, [&](llvm::Value* scalar_col) {
+          llvm::Value* product =
+              vsl_.Mul(vsl_.LoadScalar(lhs_base_pointer, scalar_col),
+                       vsl_.LoadScalar(rhs_, scalar_col));
+          llvm::Value* old_value = (*scalar_accumulators)[r].Get();
+          (*scalar_accumulators)[r].Set(vsl_.Add(old_value, product));
+        });
   }
 }
 
@@ -543,141 +648,234 @@ class MatrixMatrixBlockPanelEmitter {
     int64 k() const { return k_; }
     int64 n() const { return n_; }
 
+    string ToString() const {
+      return tensorflow::strings::StrCat(m(), "x", k(), "x", n());
+    }
+
    private:
     const int64 m_;
     const int64 k_;
     const int64 n_;
   };
 
-  // Creates an instance of MatrixMatrixBlockPanelEmitter that matrix-multiplies
-  // `lhs` with `rhs` and stores the result in `result`.
+  // Represents the configuration of the GEBP emitter.  The LLVM IR emitted by
+  // the emitter, modulo the LLVM values holding the input and output buffers,
+  // must be a function of the instance of `Config` passed to it.
   //
-  // `m`, `k` and `n` are the matrix multiplication dimensions.
+  // `dims` holds the matrix multiplication dimensions.
   //
   // `max_vectorization_width` is the maximum vector width (i.e. the width of
   // the largest vector register we will use).  This can be larger than the
   // largest vector register supported by the machine -- LLVM will legalize
   // these large vector widths into legally sized vectors.
+  //
+  // `max_vector_count` is the maximum number of vectors of size
+  // `max_vectorization_width` that we will attempt to process at once.
+  //
   // `min_vectorization_width` is the smallest vector width the emitter will use
   // -- below that it will devolve to using a scalar loop.
   //
-  // `k_tiling_factor` is the number of elements along the reduction dimensions
-  // that we will attempt to process at once.
-  explicit MatrixMatrixBlockPanelEmitter(
-      llvm::Value* lhs, llvm::Value* rhs, llvm::Value* result, Dimensions dims,
-      int max_vectorization_width, int min_vectorization_width,
-      int k_tiling_factor, const TargetMachineFeatures& target_machine_features,
-      llvm::IRBuilder<>* ir_builder, PrimitiveType primitive_type)
+  // The innermost reduction loop executes the matrix multiply in tiles of size
+  // [`tile_size_m`, `tile_size_k`] from the LHS and [`tile_size_k`,
+  // <vectorization width>] in the RHS.
+  class Config {
+   public:
+    explicit Config(PrimitiveType scalar_type, Dimensions dims,
+                    int64 max_vectorization_width, int64 max_vector_count,
+                    int64 min_vectorization_width, int64 tile_size_m,
+                    int64 tile_size_k)
+        : scalar_type_(scalar_type),
+          dims_(dims),
+          max_vectorization_width_(max_vectorization_width),
+          max_vector_count_(max_vector_count),
+          min_vectorization_width_(min_vectorization_width),
+          tile_size_m_(tile_size_m),
+          tile_size_k_(tile_size_k) {}
+
+    string GetCacheKey() const {
+      return tensorflow::strings::StrCat(
+          "gebp_", PrimitiveType_Name(scalar_type()), "_", dims().ToString(),
+          "_", max_vectorization_width(), "_", min_vectorization_width(), "_",
+          tile_size_m(), "_", tile_size_k());
+    }
+
+    PrimitiveType scalar_type() const { return scalar_type_; }
+    Dimensions dims() const { return dims_; }
+    int64 max_vectorization_width() const { return max_vectorization_width_; }
+    int64 max_vector_count() const { return max_vector_count_; }
+    int64 min_vectorization_width() const { return min_vectorization_width_; }
+
+    int64 tile_size_m() const { return tile_size_m_; }
+    int64 tile_size_k() const { return tile_size_k_; }
+
+   private:
+    PrimitiveType scalar_type_;
+    Dimensions dims_;
+    int64 max_vectorization_width_;
+    int64 max_vector_count_;
+    int64 min_vectorization_width_;
+    int64 tile_size_m_;
+    int64 tile_size_k_;
+  };
+
+  // Creates an instance of MatrixMatrixBlockPanelEmitter that matrix-multiplies
+  // `lhs` with `rhs` and stores the result in `result`.
+  explicit MatrixMatrixBlockPanelEmitter(Config config, llvm::Value* lhs,
+                                         llvm::Value* rhs, llvm::Value* result,
+                                         llvm::IRBuilder<>* ir_builder)
       : lhs_(lhs),
         rhs_(rhs),
         result_(result),
-        dims_(dims),
-        max_vectorization_width_(max_vectorization_width),
-        min_vectorization_width_(min_vectorization_width),
-        k_tiling_factor_(k_tiling_factor),
-        target_machine_features_(target_machine_features),
+        config_(config),
         ir_builder_(ir_builder),
-        primitive_type_(primitive_type),
         ksl_(ir_builder_) {
-    CHECK(max_vectorization_width > 0 &&
-          IsPowerOfTwo(static_cast<uint64>(max_vectorization_width)));
-    CHECK(min_vectorization_width > 0 &&
-          IsPowerOfTwo(static_cast<uint64>(min_vectorization_width)));
-    CHECK_GT(k_tiling_factor, 0);
+    CHECK(max_vectorization_width() > 0 &&
+          IsPowerOfTwo(static_cast<uint64>(max_vectorization_width())));
+    CHECK_GT(max_vector_count(), 0);
+    CHECK(min_vectorization_width() > 0 &&
+          IsPowerOfTwo(static_cast<uint64>(min_vectorization_width())));
+    CHECK_GE(max_vectorization_width(), min_vectorization_width());
+    CHECK_GT(tile_size_k(), 0);
   }
 
   void Emit();
 
  private:
-  // We can only iterate the `n` dimension for an extent that is divisible by
-  // the vectorization width.  So we emit an outer loop that first processes the
-  // largest extent in `n` that is divisible by max_vectorization_width, then
-  // the largest remaining extent that is divisible by max_vectorization_width /
-  // 2 etc.  This function emits that outermost loop.
-  void EmitChunkedLoopOverN();
-
-  // This emits a loop that loops over the `k` dimension in multiples of
-  // `k_tiling_factor` as much as possible and then emits a remainder epilogue.
-  void EmitLoopOverK(VectorSupportLibrary* vsl, llvm::Value* n_start,
-                     llvm::Value* n_end);
-
-  // This emits the inner reduction loop.  This inner reduction loop processes
-  // all indices in the `m` dimension, [`k_start`, `k_end`) in the k dimension
-  // and [`n_start`, `n_end`) in the `n` dimension.
-  void EmitInnerLoop(int64 k_tiling_factor, llvm::Value* k_start,
-                     llvm::Value* k_end, llvm::Value* n_start,
-                     llvm::Value* n_end, VectorSupportLibrary* vsl);
-
-  llvm::Value* getInt64(int64 value) { return ir_builder_->getInt64(value); }
+  // The HandleResiduesOnX helpers split the iteration space for dimension X
+  // into a multiple of the tile size on dimension X and an epilogue.  These
+  // helpers ultimately call into `EmitTiledGemm` for emitting the
+  // tiled GEMM kernel.
+
+  void HandleResiduesOnN();
+  void HandleResiduesOnK(VectorSupportLibrary* vsl, llvm::Value* n_start,
+                         llvm::Value* n_end);
+  void HandleResiduesOnM(VectorSupportLibrary* vsl, int64 tile_size_k,
+                         llvm::Value* k_start, llvm::Value* k_end,
+                         llvm::Value* n_start, llvm::Value* n_end);
+
+  // This emits a tiled GEMM kernel.  For a detailed description see the comment
+  // on the implementation.
+  void EmitTiledGemm(VectorSupportLibrary* vsl, int64 tile_size_k,
+                     llvm::Value* k_start, llvm::Value* k_end,
+                     llvm::Value* n_start, llvm::Value* n_end,
+                     int64 tile_size_m, llvm::Value* m_start,
+                     llvm::Value* m_end);
+
+  llvm::Value* GetInt64(int64 value) { return ir_builder_->getInt64(value); }
+
+  Config config() const { return config_; }
+  Dimensions dims() const { return config().dims(); }
+
+  int64 max_vectorization_width() const {
+    return config().max_vectorization_width();
+  }
+  int64 max_vector_count() const { return config().max_vector_count(); }
+  int64 min_vectorization_width() const {
+    return config().min_vectorization_width();
+  }
+  int64 tile_size_m() const { return config().tile_size_m(); }
+  int64 tile_size_k() const { return config().tile_size_k(); }
+  PrimitiveType scalar_type() const { return config().scalar_type(); }
 
   llvm::Value* lhs_;
   llvm::Value* rhs_;
   llvm::Value* result_;
-  Dimensions dims_;
-
-  int64 max_vectorization_width_;
-  int64 min_vectorization_width_;
-  int64 k_tiling_factor_;
+  Config config_;
 
-  const TargetMachineFeatures& target_machine_features_;
   llvm::IRBuilder<>* ir_builder_;
-  PrimitiveType primitive_type_;
   KernelSupportLibrary ksl_;
 };
 
-void MatrixMatrixBlockPanelEmitter::Emit() { EmitChunkedLoopOverN(); }
+void MatrixMatrixBlockPanelEmitter::Emit() { HandleResiduesOnN(); }
+
+void MatrixMatrixBlockPanelEmitter::HandleResiduesOnN() {
+  // We can only iterate the `n` dimension for an extent that is divisible by
+  // the vectorization width.  So we emit an outer loop that first processes the
+  // largest extent in `n` that is divisible by max_vectorization_width, then
+  // the largest remaining extent that is divisible by max_vectorization_width /
+  // 2 etc.
+
+  int64 current_vectorization_width =
+      max_vector_count() * max_vectorization_width();
+  int64 current_vector_count = max_vector_count();
 
-void MatrixMatrixBlockPanelEmitter::EmitChunkedLoopOverN() {
-  int64 current_vectorization_width = max_vectorization_width_;
   int64 n_start = 0;
-  while (n_start != dims_.n() &&
-         current_vectorization_width >= min_vectorization_width_) {
-    int64 n_end = dims_.n() - (dims_.n() % current_vectorization_width);
+  while (n_start != dims().n() &&
+         current_vectorization_width >= min_vectorization_width()) {
+    int64 n_end = dims().n() - (dims().n() % current_vectorization_width);
     if (n_start != n_end) {
-      VectorSupportLibrary vsl(primitive_type_, current_vectorization_width,
+      VectorSupportLibrary vsl(scalar_type(), current_vectorization_width,
                                ir_builder_, "gebp");
-      EmitLoopOverK(&vsl, getInt64(n_start), getInt64(n_end));
+      HandleResiduesOnK(&vsl, GetInt64(n_start), GetInt64(n_end));
       n_start = n_end;
     }
-    current_vectorization_width /= 2;
+    if (current_vector_count == 1) {
+      current_vectorization_width /= 2;
+    } else {
+      current_vector_count--;
+      current_vectorization_width =
+          current_vector_count * max_vectorization_width();
+    }
   }
 
-  if (n_start != dims_.n()) {
-    VectorSupportLibrary vsl(primitive_type_, 1, ir_builder_, "gebp");
-    ksl_.For("epi.n", n_start, dims_.n(), 1, [&](llvm::Value* n_i) {
+  if (n_start != dims().n()) {
+    VectorSupportLibrary vsl(scalar_type(), 1, ir_builder_, "gebp");
+    ksl_.ForReturnVoid("epi.n", n_start, dims().n(), 1, [&](llvm::Value* n_i) {
       llvm::Value* n_i_next =
           ir_builder_->CreateAdd(n_i, ir_builder_->getInt64(1));
-      EmitLoopOverK(&vsl, n_i, n_i_next);
+      HandleResiduesOnK(&vsl, n_i, n_i_next);
     });
   }
 }
 
-void MatrixMatrixBlockPanelEmitter::EmitLoopOverK(VectorSupportLibrary* vsl,
-                                                  llvm::Value* n_start,
-                                                  llvm::Value* n_end) {
+void MatrixMatrixBlockPanelEmitter::HandleResiduesOnK(VectorSupportLibrary* vsl,
+                                                      llvm::Value* n_start,
+                                                      llvm::Value* n_end) {
   int64 k_start = 0;
-  int64 k_end = dims_.k() - (dims_.k() % k_tiling_factor_);
+  int64 k_end = dims().k() - (dims().k() % tile_size_k());
   if (k_end != k_start) {
-    EmitInnerLoop(k_tiling_factor_, getInt64(k_start), getInt64(k_end), n_start,
-                  n_end, vsl);
+    HandleResiduesOnM(vsl, tile_size_k(), GetInt64(k_start), GetInt64(k_end),
+                      n_start, n_end);
     k_start = k_end;
   }
 
-  if (k_start != dims_.k()) {
-    EmitInnerLoop(dims_.k() - k_start, getInt64(k_start), getInt64(dims_.k()),
-                  n_start, n_end, vsl);
+  if (k_start != dims().k()) {
+    HandleResiduesOnM(vsl, dims().k() - k_start, GetInt64(k_start),
+                      GetInt64(dims().k()), n_start, n_end);
   }
 }
 
+void MatrixMatrixBlockPanelEmitter::HandleResiduesOnM(
+    VectorSupportLibrary* vsl, int64 tile_size_k, llvm::Value* k_start,
+    llvm::Value* k_end, llvm::Value* n_start, llvm::Value* n_end) {
+  const int64 m_end = dims().m() - dims().m() % tile_size_m();
+  EmitTiledGemm(vsl, tile_size_k, k_start, k_end, n_start, n_end, tile_size_m(),
+                GetInt64(0), GetInt64(m_end));
+
+  if (m_end != dims().m()) {
+    EmitTiledGemm(vsl, tile_size_k, k_start, k_end, n_start, n_end,
+                  dims().m() - m_end, GetInt64(m_end), GetInt64(dims().m()));
+  }
+}
+
+// The loop structure is:
+//
+// Iterate over dimension M as m:
+//   Iterate over dimension N as n:
+//     Iterate over dimension K as k:
+//       OutputTile[m,n] += Dot(LhsTile[m,k], RhsTile[k,n])
+//
+// I.e. a just a tiled version of a "naive" GEMM.
+//
 // The tiling scheme is as follows:
 //
 // Let the LHS be:
 //
-//   +---+---+---+
-//   | a | b | c | .
-//   +---+---+---+ .
-//   |   |   |   | .
-//   +---+---+---+
+//   +----+----+----+
+//   | a0 | b0 | c0 | .
+//   +----+----+----+ .
+//   | a1 | b1 | c1 | .
+//   +----+----+----+
 //     ..     ..
 //
 // and the RHS be:
@@ -691,75 +889,87 @@ void MatrixMatrixBlockPanelEmitter::EmitLoopOverK(VectorSupportLibrary* vsl,
 //   +----+----+----+----+ .
 //     ......    ......
 //
-// and let k_tiling_factor be 3 and the vector width (implicitly denoted by
-// `vsl`) be 4.
+// and let tile_size_m=2, tile_size_k=3 and the vector width (implicitly denoted
+// by `vsl`) be 4.  Then we want to matrix multiply this tile to get a [2,4]
+// matrix that we can increment the result matrix by.
 //
-// Then we
+// First broadcast the rows row in LHS to 3 vectors of width 4, giving us a rank
+// 3 array, L, of dimension [2,3,4]:
 //
-//  1. broadcast the first row in LHS to 3 vectors of width 4
-//  2. elementwise multiply the RHS rows with these broadcasted vectors
-//  3. elementwise add them:
+//       L[0,_,_]           *      L[1,_,_]
+//                          *
+//   +----+----+----+----+  *  +----+----+----+----+
+//   | a0 | a0 | a0 | a0 |  *  | a1 | a1 | a1 | a1 |
+//   +----+----+----+----+  *  +----+----+----+----+
+//   | b0 | b0 | b0 | b0 |  *  | b1 | b1 | b1 | b1 |
+//   +----+----+----+----+  *  +----+----+----+----+
+//   | c0 | c0 | c0 | c0 |  *  | c1 | c1 | c1 | c1 |
+//   +----+----+----+----+  *  +----+----+----+----+
 //
-//   +---+---+---+---+   +----+----+----+----+
-//   | a | a | a | a | * | p0 | p1 | p2 | p3 |   +
-//   +---+---+---+---+   +----+----+----+----+
 //
-//   +---+---+---+---+   +----+----+----+----+
-//   | b | b | b | b | * | q0 | q1 | q2 | q3 |   +
-//   +---+---+---+---+   +----+----+----+----+
+// Then we FMA L[0,_,_] with the RHS to get the first row of the result and
+// L[1,_,_] with the RHS to get the second row of the result.  For example,
+// L[0,_,_] is computed as:
 //
-//   +---+---+---+---+   +----+----+----+----+
-//   | c | c | c | c | * | r0 | r1 | r2 | r3 |
-//   +---+---+---+---+   +----+----+----+----+
+//   +----+----+----+----+   +----+----+----+----+
+//   | a0 | a0 | a0 | a0 | * | p0 | p1 | p2 | p3 |   +
+//   +----+----+----+----+   +----+----+----+----+
 //
-// to get:
+//   +----+----+----+----+   +----+----+----+----+
+//   | b0 | b0 | b0 | b0 | * | q0 | q1 | q2 | q3 |   +
+//   +----+----+----+----+   +----+----+----+----+
 //
-//   +----------------+----------------+----------------+----------------+
-//   | a*p0+b*q0+c*r0 | a*p1+b*q1+c*r1 | a*p2+b*q2+c*r2 | a*p3+b*q3+c*r3 |
-//   +----------------+----------------+----------------+----------------+
+//   +----+----+----+----+   +----+----+----+----+
+//   | c0 | c0 | c0 | c0 | * | r0 | r1 | r2 | r3 |
+//   +----+----+----+----+   +----+----+----+----+
 //
-// which we increment into the appropriate region in the result.
-void MatrixMatrixBlockPanelEmitter::EmitInnerLoop(
-    int64 k_tiling_factor, llvm::Value* k_start, llvm::Value* k_end,
-    llvm::Value* n_start, llvm::Value* n_end, VectorSupportLibrary* vsl) {
-  ksl_.For("dot.m", 0, dims_.m(), 1, [&](llvm::Value* m_i) {
-    // This outer loop iterates over all of the M dimension
-    llvm::Value* result_row_begin = vsl->ComputeOffsetPointer(
-        result_, /*offset_elements=*/m_i, /*scale=*/dims_.n());
-    llvm::Value* lhs_row_begin = vsl->ComputeOffsetPointer(
-        lhs_, /*offset_elements=*/m_i, /*scale=*/dims_.k());
-
-    ksl_.For("dot.k", k_start, k_end, k_tiling_factor, [&](llvm::Value* k_i) {
-      // broadcasted_a is the broadcasted set of vectors denoted as <a,a,a,a>,
-      // <b,b,b,b> etc. in the diagram.
-      std::vector<llvm::Value*> broadcasted_a;
-      broadcasted_a.reserve(k_tiling_factor);
-      for (int i = 0; i < k_tiling_factor; i++) {
-        broadcasted_a.push_back(vsl->LoadBroadcast(
-            lhs_row_begin, ir_builder_->CreateAdd(getInt64(i), k_i)));
-      }
+// to get:
+//
+//   +-------------------+-------------------+-------------------+---------
+//   | a0*p0+b0*q0+c0*r0 | a0*p1+b0*q1+c0*r1 | a0*p2+b0*q2+c0*r2 |  ...
+//   +-------------------+-------------------+-------------------+---------
+void MatrixMatrixBlockPanelEmitter::EmitTiledGemm(
+    VectorSupportLibrary* vsl, int64 tile_size_k, llvm::Value* k_start,
+    llvm::Value* k_end, llvm::Value* n_start, llvm::Value* n_end,
+    int64 tile_size_m, llvm::Value* m_start, llvm::Value* m_end) {
+  ksl_.ForReturnVoid(
+      "dot.m", m_start, m_end, tile_size_m, [&](llvm::Value* m_i) {
+        MemoryTile result_memory_tile(
+            vsl, ir_builder_, /*matrix=*/result_,
+            /*matrix_size_along_minor_dim=*/dims().n(),
+            /*major_dim_offset=*/m_i,
+            /*tile_size_along_major_dim=*/tile_size_m);
+        MemoryTile lhs_memory_tile(vsl, ir_builder_, /*matrix=*/lhs_,
+                                   /*matrix_size_along_minor_dim=*/dims().k(),
+                                   /*major_dim_offset=*/m_i,
+                                   /*tile_size_along_major_dim=*/tile_size_m);
+        ksl_.ForReturnVoid(
+            "dot.n", n_start, n_end, vsl->vector_size(), [&](llvm::Value* n_i) {
+              TileVariable result_tile_var(vsl,
+                                           result_memory_tile.LoadTile(n_i));
+              ksl_.ForReturnVoid(
+                  "dot.k", k_start, k_end, tile_size_k, [&](llvm::Value* k_i) {
+                    MemoryTile rhs_memory_tile(vsl, ir_builder_, rhs_,
+                                               dims().n(), k_i, tile_size_k);
+                    std::vector<std::vector<llvm::Value*>> lhs_tile =
+                        lhs_memory_tile.LoadBroadcastTile(k_i, tile_size_k);
+                    std::vector<llvm::Value*> rhs_tile =
+                        rhs_memory_tile.LoadTile(n_i);
+                    std::vector<llvm::Value*> result_tile =
+                        result_tile_var.Get();
+                    for (int64 r_m_i = 0; r_m_i < tile_size_m; r_m_i++) {
+                      for (int64 r_k_i = 0; r_k_i < tile_size_k; r_k_i++) {
+                        result_tile[r_m_i] =
+                            vsl->MulAdd(lhs_tile[r_m_i][r_k_i], rhs_tile[r_k_i],
+                                        result_tile[r_m_i]);
+                      }
+                    }
+                    result_tile_var.Set(result_tile);
+                  });
 
-      // rhs_loader will be used to load the tile off of the RHS, denoted as
-      // <<p0,p1,p2,p3>,<q0,q1,q2,q3> ...> in the diagram.
-      TileLoader rhs_loader(vsl, ir_builder_, rhs_, dims_.n(), k_i,
-                            k_tiling_factor);
-      ksl_.For(
-          "dot.n", n_start, n_end, vsl->vector_size(), [&](llvm::Value* n_i) {
-            // This loop iterates over the N dimension.  It loads the tile from
-            // RHS, does the FMA resulting in the
-            // <a*p0+b*q0+c*r0,a*p1+b*q1+c*r1,...> in the diagram and increments
-            // the result.
-            std::vector<llvm::Value*> tile = rhs_loader.LoadTile(n_i);
-            llvm::Value* result_accumulator =
-                vsl->LoadVector(result_row_begin, n_i);
-            for (int i = 0; i < tile.size(); i++) {
-              result_accumulator =
-                  vsl->MulAdd(tile[i], broadcasted_a[i], result_accumulator);
-            }
-            vsl->StoreVector(result_accumulator, result_row_begin, n_i);
-          });
-    });
-  });
+              result_memory_tile.StoreTile(result_tile_var.Get(), n_i);
+            });
+      });
 }
 
 }  // namespace
@@ -827,8 +1037,6 @@ bool DotOpEmitter::EmitExperimentalGebpDotIfEnabled(
     return false;
   }
 
-  VLOG(2) << "Emitting GEBP kernel in LLVM IR";
-
   llvm::Value* lhs = lhs_array_.GetBasePointer();
   llvm::Value* rhs = rhs_array_.GetBasePointer();
   llvm::Value* target = target_array_.GetBasePointer();
@@ -846,14 +1054,41 @@ bool DotOpEmitter::EmitExperimentalGebpDotIfEnabled(
       target, ir_builder_->getInt8(0), size_bytes,
       target_machine_features_.minimum_alignment_for_allocation(size_bytes));
 
-  MatrixMatrixBlockPanelEmitter::Dimensions gebp_dims(/*m=*/m, /*k=*/k,
-                                                      /*n=*/n);
-  MatrixMatrixBlockPanelEmitter gebp_emitter(
-      /*lhs=*/lhs, /*rhs=*/rhs, /*result=*/target, gebp_dims,
-      /*max_vectorization_width=*/8, /*min_vectorization_width=*/4,
-      /*k_tiling_factor=*/8, target_machine_features_, ir_builder_,
-      primitive_type);
-  gebp_emitter.Emit();
+  int64 max_target_vector_width =
+      target_machine_features_.vector_register_num_elements(
+          *ir_builder_->GetInsertBlock()->getParent(), primitive_type);
+
+  int64 tile_size_m, tile_size_k, tile_size_n_in_vector_width;
+  std::tie(tile_size_m, tile_size_k, tile_size_n_in_vector_width) =
+      GetGemmTileSize();
+
+  MatrixMatrixBlockPanelEmitter::Config config(
+      /*scalar_type=*/primitive_type,
+      MatrixMatrixBlockPanelEmitter::Dimensions{/*m=*/m, /*k=*/k, /*n=*/n},
+      /*max_vectorization_width=*/max_target_vector_width,
+      /*max_vector_count=*/tile_size_n_in_vector_width,
+      /*min_vectorization_width=*/std::min<int64>(4, max_target_vector_width),
+      /*tile_size_m=*/tile_size_m, /*tile_size_k=*/tile_size_k);
+
+  VLOG(2) << "Emitting GEBP kernel in LLVM IR with config "
+          << config.GetCacheKey();
+
+  const bool enable_fast_math =
+      hlo_module_config_.debug_options().xla_enable_fast_math();
+  const bool optimize_for_size =
+      options::OptimizeForSizeRequested(hlo_module_config_);
+
+  KernelSupportLibrary::EmitAndCallOutlinedKernel(
+      /*enable_fast_math=*/enable_fast_math,
+      /*optimize_for_size=*/optimize_for_size, ir_builder_,
+      config.GetCacheKey(), lhs, rhs, target,
+      [this, config](llvm::Value* lhs, llvm::Value* rhs, llvm::Value* target) {
+        MatrixMatrixBlockPanelEmitter gebp_emitter(
+            config, /*lhs=*/lhs, /*rhs=*/rhs,
+            /*result=*/target, ir_builder_);
+        gebp_emitter.Emit();
+      });
+
   return true;
 }
 
@@ -942,47 +1177,39 @@ bool DotOpEmitter::EmitLlvmIrDotIfProfitable() {
   if (is_column_major_matrix_vector) {
     VLOG(2) << "Emitting column major matrix-vector multiply with m = " << m
             << " and k = " << k;
-    int64 tile_rows = vector_register_element_size;
-    int64 tile_cols = tiling_factor;
-
-    string kernel_name = tensorflow::strings::StrCat(
-        "col_major_gemv_", PrimitiveType_Name(primitive_type), "_", tile_rows,
-        "_", tile_cols, "_", m, "_", k, addend_array_ ? "_with_addend" : "");
+    ColumnMajorMatrixVectorProductEmitter::Config config(
+        /*scalar_type=*/primitive_type,
+        /*tile_rows=*/vector_register_element_size, /*tile_cols=*/tiling_factor,
+        /*m=*/m, /*k=*/k, /*has_addend=*/addend_array_ != nullptr);
 
     KernelSupportLibrary::EmitAndCallOutlinedKernel(
         /*enable_fast_math=*/enable_fast_math,
-        /*optimize_for_size=*/optimize_for_size, ir_builder_, kernel_name,
-        lhs_op, rhs_op,
+        /*optimize_for_size=*/optimize_for_size, ir_builder_,
+        config.GetCacheKey(), lhs_op, rhs_op,
         addend_array_ ? addend_array_->GetBasePointer() : nullptr, result_op,
-        [this, tile_rows, tile_cols, m, k, primitive_type](
-            llvm::Value* lhs_op, llvm::Value* rhs_op, llvm::Value* addend_op,
-            llvm::Value* result_op) {
+        [this, config](llvm::Value* lhs_op, llvm::Value* rhs_op,
+                       llvm::Value* addend_op, llvm::Value* result_op) {
           ColumnMajorMatrixVectorProductEmitter emitter(
-              primitive_type, tile_rows, tile_cols, m, k, lhs_op, rhs_op,
-              addend_op, result_op, ir_builder_);
+              config, lhs_op, rhs_op, addend_op, result_op, ir_builder_);
           emitter.Emit();
         });
   } else {
     VLOG(2) << "Emitting row major matrix-vector multiply with m = " << m
             << " and k = " << k;
-    int64 tile_rows = tiling_factor;
-    int64 tile_cols = vector_register_element_size;
-
-    string kernel_name = tensorflow::strings::StrCat(
-        "row_major_gemv_", PrimitiveType_Name(primitive_type), "_", tile_rows,
-        "_", tile_cols, "_", m, "_", k, addend_array_ ? "_with_addend" : "");
+    RowMajorMatrixVectorProductEmitter::Config config(
+        /*scalar_type=*/primitive_type,
+        /*tile_rows=*/tiling_factor, /*tile_cols=*/vector_register_element_size,
+        /*m=*/m, /*k=*/k, /*has_addend=*/addend_array_ != nullptr);
 
     KernelSupportLibrary::EmitAndCallOutlinedKernel(
         /*enable_fast_math=*/enable_fast_math,
-        /*optimize_for_size=*/optimize_for_size, ir_builder_, kernel_name,
-        lhs_op, rhs_op,
+        /*optimize_for_size=*/optimize_for_size, ir_builder_,
+        config.GetCacheKey(), lhs_op, rhs_op,
         addend_array_ ? addend_array_->GetBasePointer() : nullptr, result_op,
-        [this, tile_rows, tile_cols, m, k, primitive_type](
-            llvm::Value* lhs_op, llvm::Value* rhs_op, llvm::Value* addend_op,
-            llvm::Value* result_op) {
+        [this, config](llvm::Value* lhs_op, llvm::Value* rhs_op,
+                       llvm::Value* addend_op, llvm::Value* result_op) {
           RowMajorMatrixVectorProductEmitter emitter(
-              primitive_type, tile_rows, tile_cols, m, k, lhs_op, rhs_op,
-              addend_op, result_op, ir_builder_);
+              config, lhs_op, rhs_op, addend_op, result_op, ir_builder_);
           emitter.Emit();
         });
   }
@@ -1074,8 +1301,11 @@ Status DotOpEmitter::Emit() {
   // from messing up the vectorization.
   std::unique_ptr<llvm_ir::ForLoop> reduction_loop = loop_nest.AddLoop(
       0, lhs_shape.dimensions(lhs_reduction_dimension), "reduction",
-      /*prevent_unrolling=*/lhs_reduction_along_minor_dimension &&
-          rhs_reduction_along_minor_dimension);
+      /*unroll_mode=*/
+      (lhs_reduction_along_minor_dimension &&
+       rhs_reduction_along_minor_dimension)
+          ? xla::llvm_ir::UnrollMode::kNoUnroll
+          : xla::llvm_ir::UnrollMode::kDefaultUnroll);
 
   // The final entry in the rhs and lhs indexes is the indvar of the
   // reduction loop.
diff --git a/tensorflow/compiler/xla/service/cpu/dot_op_emitter.h b/tensorflow/compiler/xla/service/cpu/dot_op_emitter.h
index d88ccea0dbc845c0d9a580a5b118c57c888fb557..ed2a18976a0f1a88e7bb4632d3a63167d5c146ad 100644
--- a/tensorflow/compiler/xla/service/cpu/dot_op_emitter.h
+++ b/tensorflow/compiler/xla/service/cpu/dot_op_emitter.h
@@ -143,6 +143,17 @@ class DotOpEmitter {
         .value_or(kDefaultTilingFactor);
   }
 
+  std::tuple<int64, int64, int64> GetGemmTileSize() const {
+    // Tuned for broadwell - Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
+    //
+    // TODO(b/80093688): Tune for other architectures and centralize this
+    // information in one place.
+    const std::tuple<int64, int64, int64> kDefaultTileSize =
+        std::tuple<int64, int64, int64>(11, 9, 1);
+    return options::LlvmIrGemmTileSize(hlo_module_config_)
+        .value_or(kDefaultTileSize);
+  }
+
   // Returns true if we should use an experimental implementation of GEMM
   // (general matrix matrix multiplication) if possible.
   bool EnableExperimentalLlvmIrGemm() const {
diff --git a/tensorflow/compiler/xla/service/cpu/ir_emission_utils_test.cc b/tensorflow/compiler/xla/service/cpu/ir_emission_utils_test.cc
index abb2471e6ae6b2f2949ab2e91235e5047ae404f8..530ebce854fedf4e4db12139d5b56087b1176a6c 100644
--- a/tensorflow/compiler/xla/service/cpu/ir_emission_utils_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/ir_emission_utils_test.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/cpu/ir_emission_utils.h"
 
 #include "tensorflow/compiler/xla/service/cpu/target_machine_features_fake.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/test.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 
 namespace xla {
 namespace {
@@ -35,7 +35,7 @@ ENTRY Conv {
 }
 )";
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
 
   HloComputation* entry_computation = module->entry_computation();
 
diff --git a/tensorflow/compiler/xla/service/cpu/ir_emitter.cc b/tensorflow/compiler/xla/service/cpu/ir_emitter.cc
index 13bd5e73db500e20b0e8c33bf921ee2457e126e5..59223fddac2f5f7e2e85de4d37e4b6c5760ae697 100644
--- a/tensorflow/compiler/xla/service/cpu/ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/cpu/ir_emitter.cc
@@ -160,39 +160,44 @@ Status IrEmitter::HandleBitcast(HloInstruction* bitcast) {
   return Status::OK();
 }
 
-llvm::GlobalVariable* IrEmitter::EmitGlobalForLiteral(const Literal& literal) {
-  llvm::GlobalVariable* result;
+llvm::Constant* IrEmitter::EmitGlobalForLiteral(const Literal& literal) {
+  llvm::Constant* result;
 
   // We avoid creating large constants in the LLVM IR since LLVM is not
   // efficient for large constant arrays.  We still emit "small enough" constant
   // arrays into the Ir, in the off chance the LLVM optimizer can do something
   // interesting with it.
+  //
+  // TODO(b/29904935): Remove the large constant pool.
   const int kMaxInternalConstantSizeInBytes = 128;
   if (external_constant_pool_ &&
       ByteSizeOf(literal.shape()) >= kMaxInternalConstantSizeInBytes) {
     string global_name = tensorflow::strings::StrCat(
         "constant_global_", external_global_constant_counter_++);
-    result = new llvm::GlobalVariable(
+    llvm::GlobalVariable* result_global = new llvm::GlobalVariable(
         /*Module=*/*module_,
         /*Type=*/IrShapeType(literal.shape()),
         /*isConstant=*/true,
         /*Linkage=*/llvm::GlobalValue::ExternalLinkage,
         /*Initializer=*/nullptr,
         /*Name=*/AsStringRef(global_name));
-    result->setAlignment(MinimumAlignmentForShape(literal.shape()));
+    result_global->setAlignment(MinimumAlignmentForShape(literal.shape()));
     external_constant_pool_->Insert(global_name, literal,
                                     MinimumAlignmentForShape(literal.shape()));
+    result = result_global;
   } else {
     llvm::Constant* initializer =
         llvm_ir::ConvertLiteralToIrConstant(literal, module_);
-    result = new llvm::GlobalVariable(
+    llvm::GlobalVariable* result_global = new llvm::GlobalVariable(
         /*Module=*/*module_,
         /*Type=*/initializer->getType(),
         /*isConstant=*/true,
         /*Linkage=*/llvm::GlobalValue::PrivateLinkage,
         /*Initializer=*/initializer,
         /*Name=*/"");
-    result->setAlignment(MinimumAlignmentForShape(literal.shape()));
+    result_global->setAlignment(MinimumAlignmentForShape(literal.shape()));
+    result = llvm::ConstantExpr::getBitCast(
+        result_global, IrShapeType(literal.shape())->getPointerTo());
   }
   return result;
 }
@@ -200,7 +205,7 @@ llvm::GlobalVariable* IrEmitter::EmitGlobalForLiteral(const Literal& literal) {
 Status IrEmitter::HandleConstant(HloInstruction* constant) {
   VLOG(2) << "HandleConstant: " << constant->ToString();
   const Literal& literal = constant->literal();
-  llvm::GlobalVariable* global_for_const;
+  llvm::Constant* global_for_const;
 
   auto it = emitted_literals_.find(&literal);
   if (it != emitted_literals_.end()) {
diff --git a/tensorflow/compiler/xla/service/cpu/ir_emitter.h b/tensorflow/compiler/xla/service/cpu/ir_emitter.h
index f49cfc1dc378bb80da3ddf995363acfa2081067b..32c536e18fee86cc60067ba3b25ab1eb0e4233df 100644
--- a/tensorflow/compiler/xla/service/cpu/ir_emitter.h
+++ b/tensorflow/compiler/xla/service/cpu/ir_emitter.h
@@ -527,7 +527,8 @@ class IrEmitter : public DfsHloVisitorWithDefault {
   Status EmitXfeedTransfer(XfeedKind kind, const Shape& shape,
                            llvm::Value* program_buffer_address);
 
-  llvm::GlobalVariable* EmitGlobalForLiteral(const Literal& literal);
+  // Returns a ConstExpr bitcast.
+  llvm::Constant* EmitGlobalForLiteral(const Literal& literal);
 
   const HloModuleConfig& hlo_module_config_;
 
@@ -548,7 +549,7 @@ class IrEmitter : public DfsHloVisitorWithDefault {
     }
   };
 
-  tensorflow::gtl::FlatMap<const Literal*, llvm::GlobalVariable*,
+  tensorflow::gtl::FlatMap<const Literal*, llvm::Constant*,
                            LiteralPtrHashFunctor, LiteralPtrEqualityFunctor>
       emitted_literals_;
 
diff --git a/tensorflow/compiler/xla/service/cpu/parallel_task_assignment.cc b/tensorflow/compiler/xla/service/cpu/parallel_task_assignment.cc
index 63d0f7b95c7e45913c707471dbe2dc62e05251d6..4fa5984b0466b178a587e97cbced97deac749f74 100644
--- a/tensorflow/compiler/xla/service/cpu/parallel_task_assignment.cc
+++ b/tensorflow/compiler/xla/service/cpu/parallel_task_assignment.cc
@@ -38,7 +38,7 @@ class SimpleCostModel : public ParallelCostModel {
     const int64 min_cost_per_thread = 256LL << 10;  // 256KB L2 Cache size.
     // Return target parallel task count in [1, max_parallelism_].
     return std::min(max_parallelism_,
-                    std::max(1LL, instruction_cost / min_cost_per_thread));
+                    std::max(int64{1}, instruction_cost / min_cost_per_thread));
   }
 
  private:
@@ -63,7 +63,7 @@ class DefaultCostModel : public ParallelCostModel {
     int64 max_parallelism;
     // Calculate flops-to-bytes-ratio for 'instruction'.
     const int64 bytes_accessed =
-        std::max(1LL, cost_analysis_->bytes_accessed(*instruction));
+        std::max(int64{1}, cost_analysis_->bytes_accessed(*instruction));
     const float flops_to_bytes_ratio =
         cost_analysis_->flop_count(*instruction) /
         static_cast<float>(bytes_accessed);
@@ -93,7 +93,7 @@ class DefaultCostModel : public ParallelCostModel {
     }
     // Return target parallel task count in [1, max_parallelism_].
     return std::min(max_parallelism,
-                    std::max(1LL, instruction_cost / min_cost_per_thread));
+                    std::max(int64{1}, instruction_cost / min_cost_per_thread));
   }
 
  private:
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_matmul_mkl.cc b/tensorflow/compiler/xla/service/cpu/runtime_matmul_mkl.cc
index 92da5f71c23d5e1450b39ea8b7bb8345f6fabb3b..f8c8dd5e93d53db8d87be0208b5cf4daac3464f1 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_matmul_mkl.cc
+++ b/tensorflow/compiler/xla/service/cpu/runtime_matmul_mkl.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifdef INTEL_MKL
+#if defined(INTEL_MKL) && !defined(DO_NOT_USE_ML)
 #include "tensorflow/compiler/xla/service/cpu/runtime_matmul_mkl.h"
 #include "third_party/intel_mkl_ml/include/mkl_cblas.h"
 #include "third_party/intel_mkl_ml/include/mkl_service.h"
diff --git a/tensorflow/compiler/xla/service/cpu/shape_partition.cc b/tensorflow/compiler/xla/service/cpu/shape_partition.cc
index 42fe955f1917e0268dc739e44fbd0a7afb39185c..d12c5396148d32adb178b955a34e050cc56784da 100644
--- a/tensorflow/compiler/xla/service/cpu/shape_partition.cc
+++ b/tensorflow/compiler/xla/service/cpu/shape_partition.cc
@@ -115,7 +115,7 @@ ShapePartitionIterator::ShapePartitionIterator(
   for (int i = 0; i < dimension_partition_sizes_.size(); ++i) {
     const int64 dim_size = shape_.dimensions(dimensions_[i]);
     dimension_partition_sizes_[i] =
-        std::max(1LL, dim_size / dimension_partition_counts_[i]);
+        std::max(int64{1}, dim_size / dimension_partition_counts_[i]);
   }
 
   // Calculate the partition strides for each dimension.
diff --git a/tensorflow/compiler/xla/service/cpu/tests/BUILD b/tensorflow/compiler/xla/service/cpu/tests/BUILD
index 67f776e7b5883f425b41c05342b74bebe223e17f..66ae5ef0f66e90982102d73e474f5d0582f5415c 100644
--- a/tensorflow/compiler/xla/service/cpu/tests/BUILD
+++ b/tensorflow/compiler/xla/service/cpu/tests/BUILD
@@ -152,9 +152,9 @@ tf_cc_test(
     srcs = ["cpu_literal_caching_test.cc"],
     deps = [
         "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/service/cpu:cpu_compiler",
         "//tensorflow/compiler/xla/service/cpu/tests:cpu_codegen_test",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
@@ -166,9 +166,9 @@ tf_cc_test(
     srcs = ["cpu_outfeed_test.cc"],
     deps = [
         "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/service/cpu:cpu_compiler",
         "//tensorflow/compiler/xla/service/cpu/tests:cpu_codegen_test",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
diff --git a/tensorflow/compiler/xla/service/cpu/tests/cpu_external_constants_test.cc b/tensorflow/compiler/xla/service/cpu/tests/cpu_external_constants_test.cc
index ed8f375bd6186e4805fe9ded5be9ae7c9f4d5c84..faac927027c48e44eb8ff1fcc4109fbc177fc579 100644
--- a/tensorflow/compiler/xla/service/cpu/tests/cpu_external_constants_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/tests/cpu_external_constants_test.cc
@@ -64,8 +64,8 @@ TEST_F(CpuExternalConstantsTest, BasicNegative) {
   // The constant array in this test case is small enough that there is no need
   // to externalize it.
   TestWithArray(/*rows=*/4, /*cols=*/4, R"(
-CHECK-NOT: @constant_global_0 = external constant [4 x [4 x float]], align 8
-CHECK: @0 = private constant [4 x [4 x float]] {{.*}}, align 8
+CHECK-NOT: @constant_global_0 = external constant [16 x float], align 8
+CHECK: @0 = private constant [16 x float] {{.*}}, align 8
 )");
 }
 }  // namespace
diff --git a/tensorflow/compiler/xla/service/cpu/tests/cpu_literal_caching_test.cc b/tensorflow/compiler/xla/service/cpu/tests/cpu_literal_caching_test.cc
index d6e0425c5542be89835571f0103b1829f63cc2c2..27044b1d62027e3b83744c486cb790269e505aff 100644
--- a/tensorflow/compiler/xla/service/cpu/tests/cpu_literal_caching_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/tests/cpu_literal_caching_test.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/cpu/cpu_compiler.h"
 #include "tensorflow/compiler/xla/service/cpu/tests/cpu_codegen_test.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 
 namespace xla {
 namespace cpu {
@@ -55,12 +55,12 @@ ENTRY main {
 )";
 
   string filecheck_pattern = R"(
-CHECK: private constant [2 x [3 x [2 x float]]]
-CHECK-NOT: private constant [2 x [3 x [2 x float]]]
+CHECK: private constant [12 x float]
+CHECK-NOT: private constant [12 x float]
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_text));
+                          ParseHloString(hlo_text));
 
   CpuAotCompilationOptions options{
       /*triple=*/"x86_64-pc-linux", /*cpu_name=*/"", /*features=*/"",
@@ -78,34 +78,34 @@ TEST_F(CpuDuplicateConstantsTest, RepeatedTupleConstants) {
 HloModule RepeatedConstants
 
 while_body {
-  arg_body = (f32[2,1]{1,0}, f32[2]{0}) parameter(0)
-  ROOT const = (f32[2,1]{1,0}, f32[2]{0}) constant((f32[2,1], f32[2]) ( f32[2,1] { { 1 }, { 2 } }, {2, 42} ))
+  arg_body = (f32[2,1]{1,0}, f32[1]{0}) parameter(0)
+  ROOT const = (f32[2,1]{1,0}, f32[1]{0}) constant((f32[2,1], f32[1]) ( f32[2,1] { { 1 }, { 2 } }, {2} ))
 }
 
 while_cond {
-  arg_cond = (f32[2,1]{1,0}, f32[2]{0}) parameter(0)
+  arg_cond = (f32[2,1]{1,0}, f32[1]{0}) parameter(0)
   ROOT unknown = pred[] infeed()
 }
 
 ENTRY main {
   param = f32[2,3,2] parameter(0)
-  const_a = (f32[2,1]{1,0}, f32[2]{0}) constant((f32[2,1], f32[2]) ( f32[2,1] { { 1 }, { 2 } }, {2, 42} ))
-  const_b = (f32[2,1]{1,0}, f32[2]{0}) while((f32[2,1]{1,0}, f32[2]{0}) const_a), condition=while_cond, body=while_body
+  const_a = (f32[2,1]{1,0}, f32[1]{0}) constant((f32[2,1], f32[1]) ( f32[2,1] { { 1 }, { 2 } }, {2} ))
+  const_b = (f32[2,1]{1,0}, f32[1]{0}) while((f32[2,1]{1,0}, f32[1]{0}) const_a), condition=while_cond, body=while_body
 
-  out0 = () outfeed((f32[2,1]{1,0}, f32[2]{0}) const_a)
-  ROOT out1 = () outfeed((f32[2,1]{1,0}, f32[2]{0}) const_b)
+  out0 = () outfeed((f32[2,1]{1,0}, f32[1]{0}) const_a)
+  ROOT out1 = () outfeed((f32[2,1]{1,0}, f32[1]{0}) const_b)
 }
 )";
 
   string filecheck_pattern = R"(
+CHECK: private constant [1 x float]
 CHECK: private constant [2 x float]
-CHECK: private constant [2 x [1 x float]]
+CHECK-NOT: private constant [1 x float]
 CHECK-NOT: private constant [2 x float]
-CHECK-NOT: private constant [2 x [1 x float]]
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_text));
+                          ParseHloString(hlo_text));
 
   CpuAotCompilationOptions options{
       /*triple=*/"x86_64-pc-linux", /*cpu_name=*/"", /*features=*/"",
diff --git a/tensorflow/compiler/xla/service/cpu/tests/cpu_outfeed_test.cc b/tensorflow/compiler/xla/service/cpu/tests/cpu_outfeed_test.cc
index 879372eb13884cdb7edd8cfb3e8b4bac4e314951..1ee279290b6fcfe775ce9867d424b1c031f5d2bd 100644
--- a/tensorflow/compiler/xla/service/cpu/tests/cpu_outfeed_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/tests/cpu_outfeed_test.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/cpu/cpu_compiler.h"
 #include "tensorflow/compiler/xla/service/cpu/tests/cpu_codegen_test.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 
 namespace xla {
 namespace cpu {
@@ -37,11 +37,11 @@ ENTRY main {
 )";
 
   string filecheck_pattern = R"(
-CHECK: private constant [2 x [3 x [2 x float]]]
+CHECK: private constant [12 x float]
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_text));
+                          ParseHloString(hlo_text));
 
   CpuAotCompilationOptions options{
       /*triple=*/"x86_64-pc-linux", /*cpu_name=*/"", /*features=*/"",
diff --git a/tensorflow/compiler/xla/service/cpu/vector_support_library.cc b/tensorflow/compiler/xla/service/cpu/vector_support_library.cc
index cd1165e23812861ba9951546b7dd744529232196..c444d151858d3a152a01b99657ffae89ebc6b487 100644
--- a/tensorflow/compiler/xla/service/cpu/vector_support_library.cc
+++ b/tensorflow/compiler/xla/service/cpu/vector_support_library.cc
@@ -427,5 +427,27 @@ llvm::Value* LlvmVariable::Get() const {
 void LlvmVariable::Set(llvm::Value* new_value) {
   ir_builder_->CreateStore(new_value, alloca_);
 }
+
+TileVariable::TileVariable(VectorSupportLibrary* vector_support,
+                           std::vector<llvm::Value*> initial_value) {
+  for (llvm::Value* initial_vector_value : initial_value) {
+    storage_.emplace_back(vector_support, initial_vector_value);
+  }
+}
+
+std::vector<llvm::Value*> TileVariable::Get() const {
+  std::vector<llvm::Value*> result;
+  c_transform(storage_, std::back_inserter(result),
+              [&](VectorVariable vect_var) { return vect_var.Get(); });
+  return result;
+}
+
+void TileVariable::Set(tensorflow::gtl::ArraySlice<llvm::Value*> value) {
+  CHECK_EQ(value.size(), storage_.size());
+  for (int64 i = 0, e = value.size(); i < e; i++) {
+    storage_[i].Set(value[i]);
+  }
+}
+
 }  // namespace cpu
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/cpu/vector_support_library.h b/tensorflow/compiler/xla/service/cpu/vector_support_library.h
index edcaec584997b17dce30b8c46fda4abc78441064..49c2a4e2f4bae9e1672b7d2fe891301bce08bd4b 100644
--- a/tensorflow/compiler/xla/service/cpu/vector_support_library.h
+++ b/tensorflow/compiler/xla/service/cpu/vector_support_library.h
@@ -23,6 +23,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/primitive_util.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
 
 namespace xla {
 namespace cpu {
@@ -317,6 +318,21 @@ class ScalarVariable : public LlvmVariable {
     Set(initial_value);
   }
 };
+
+// This wraps a set of alloca-backed stack variables that can, as a whole, store
+// a tile.  A "tile" is a sequence of vectors that is typically used as a 2D
+// grid of scalar values (e.g. for tiled GEMMs).
+class TileVariable {
+ public:
+  TileVariable(VectorSupportLibrary* vector_support,
+               std::vector<llvm::Value*> initial_value);
+
+  std::vector<llvm::Value*> Get() const;
+  void Set(tensorflow::gtl::ArraySlice<llvm::Value*> value);
+
+ private:
+  std::vector<VectorVariable> storage_;
+};
 }  // namespace cpu
 }  // namespace xla
 
diff --git a/tensorflow/compiler/xla/service/dfs_hlo_visitor.h b/tensorflow/compiler/xla/service/dfs_hlo_visitor.h
index b9d7ec9c2e17e560580fcea060bf552c42fe3b3c..ee2b455730f8f520db6652f0352f8a96291cac73 100644
--- a/tensorflow/compiler/xla/service/dfs_hlo_visitor.h
+++ b/tensorflow/compiler/xla/service/dfs_hlo_visitor.h
@@ -197,6 +197,10 @@ class DfsHloVisitorBase {
     return HandleElementwiseUnary(hlo);
   }
 
+  virtual Status HandleDomain(HloInstructionPtr hlo) {
+    return HandleElementwiseUnary(hlo);
+  }
+
   virtual Status HandleInfeed(HloInstructionPtr hlo) = 0;
   virtual Status HandleOutfeed(HloInstructionPtr hlo) = 0;
   virtual Status HandleHostCompute(HloInstructionPtr hlo) = 0;
@@ -239,6 +243,8 @@ class DfsHloVisitorBase {
 
   virtual Status HandleBatchNormGrad(HloInstructionPtr hlo) = 0;
 
+  virtual Status HandleGenerateToken(HloInstructionPtr token) = 0;
+
   // Invoked to inform the visitor that the traversal has completed, and that
   // the root was "root".
   virtual Status FinishVisit(HloInstructionPtr root) = 0;
diff --git a/tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h b/tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h
index 240faebe62f5cee4f61b3c36b5e8f653cfd6db8e..6934e00a4b665e9e6a4302e0c0a8ce1d5bb94373 100644
--- a/tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h
+++ b/tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h
@@ -188,6 +188,9 @@ class DfsHloVisitorWithDefaultBase
   Status HandleGather(HloInstructionPtr gather) override {
     return DefaultAction(gather);
   }
+  Status HandleGenerateToken(HloInstructionPtr token) override {
+    return DefaultAction(token);
+  }
 
   // Invoked to inform the visitor that the traversal has completed, and that
   // the root was "root".
diff --git a/tensorflow/compiler/xla/service/elemental_ir_emitter.cc b/tensorflow/compiler/xla/service/elemental_ir_emitter.cc
index 9a8bab353ef6b1e0b05b250d35296bc3cef8bc37..93fea7ead7a86bb34c449668fd88a58145681eb1 100644
--- a/tensorflow/compiler/xla/service/elemental_ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/elemental_ir_emitter.cc
@@ -456,17 +456,15 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitFloatUnaryOp(
                                     llvm::ConstantFP::get(type, 1.0)));
     }
     case HloOpcode::kIsFinite: {
-      // (x == x) && abs(x) != inf
+      // abs(x) o!= inf, this works because the comparison returns false if
+      // either operand is NaN.
       auto type = operand_value->getType();
-      auto equal_self =
-          ir_builder_->CreateFCmpOEQ(operand_value, operand_value);
       auto abs_value = llvm_ir::EmitCallToIntrinsic(
           llvm::Intrinsic::fabs, {operand_value}, {type}, ir_builder_);
       auto infinity = llvm::ConstantFP::getInfinity(type);
       auto not_infinite = ir_builder_->CreateFCmpONE(abs_value, infinity);
-      auto result_i1 = ir_builder_->CreateAnd(equal_self, not_infinite);
       return ir_builder_->CreateZExt(
-          result_i1, llvm_ir::PrimitiveTypeToIrType(PRED, module_));
+          not_infinite, llvm_ir::PrimitiveTypeToIrType(PRED, module_));
     }
     case HloOpcode::kNegate:
       return ir_builder_->CreateFNeg(operand_value);
diff --git a/tensorflow/compiler/xla/service/elemental_ir_emitter_test.cc b/tensorflow/compiler/xla/service/elemental_ir_emitter_test.cc
index b43dc0c65d9b6e7c05e06010ba2ff2eb27392295..8980d4303353a132ada2b3c685b4f2856c33c6a1 100644
--- a/tensorflow/compiler/xla/service/elemental_ir_emitter_test.cc
+++ b/tensorflow/compiler/xla/service/elemental_ir_emitter_test.cc
@@ -14,12 +14,12 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/execution_options_util.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 
 namespace xla {
 namespace {
@@ -33,7 +33,7 @@ class ElementalIrEmitterExecutionTest : public HloTestBase {
     HloModuleConfig config;
     config.set_debug_options(GetDebugOptionsForTest());
     TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                            tools::Parse(hlo_text, config));
+                            ParseHloString(hlo_text, config));
     EXPECT_TRUE(RunAndCompareNoHloPasses(std::move(module), args, nullopt));
   }
 };
diff --git a/tensorflow/compiler/xla/service/executable.cc b/tensorflow/compiler/xla/service/executable.cc
index 8119478ce934da06969024905e5e054e0b509b03..6df172db8e541c5cef7aab04f3d8611fc735e8b0 100644
--- a/tensorflow/compiler/xla/service/executable.cc
+++ b/tensorflow/compiler/xla/service/executable.cc
@@ -129,20 +129,6 @@ StatusOr<ScopedShapedBuffer> Executable::ExecuteOnStreamWrapper(
   return return_value;
 }
 
-Status Executable::DumpSessionModule() {
-  TF_RET_CHECK(dumping());
-  const string& directory_path =
-      module_config().debug_options().xla_dump_executions_to();
-  VersionedComputationHandle versioned_handle = entry_computation_handle();
-  // This filename does not include the version number because the computation
-  // is only ever executed at one version.
-  string filename = tensorflow::strings::Printf(
-      "computation_%lld__%s__execution_%lld", versioned_handle.handle.handle(),
-      session_module_->entry().name().c_str(), ++execution_count_);
-  return Executable::DumpToDirectory(directory_path, filename,
-                                     *session_module_);
-}
-
 Status Executable::DumpHloSnapshot() {
   TF_RET_CHECK(dumping_snapshot());
   TF_RET_CHECK(hlo_snapshot_->has_hlo() &&
@@ -156,26 +142,6 @@ Status Executable::DumpHloSnapshot() {
   return Executable::DumpToDirectory(directory_path, filename, *hlo_snapshot_);
 }
 
-/* static */ Status Executable::DumpToDirectory(
-    const string& directory_path, string filename,
-    const SessionModule& session_module) {
-  tensorflow::Env* env = tensorflow::Env::Default();
-  if (!env->IsDirectory(directory_path).ok()) {
-    // NB! CreateDir does not work reliably with multiple XLA threads -- two
-    // threads can race to observe the absence of the dump directory and
-    // simultaneously try to create it, causing the "losing" thread to get a
-    // "directory already exists" error.
-    TF_RETURN_IF_ERROR(env->RecursivelyCreateDir(directory_path));
-  }
-  filename = SanitizeFileName(std::move(filename));
-  string file_path = tensorflow::io::JoinPath(directory_path, filename);
-  string result;
-  TF_RET_CHECK(
-      tensorflow::SerializeToStringDeterministic(session_module, &result));
-  return tensorflow::WriteStringToFile(tensorflow::Env::Default(), file_path,
-                                       result);
-}
-
 /* static */ Status Executable::DumpToDirectory(
     const string& directory_path, string filename,
     const HloSnapshot& hlo_session) {
diff --git a/tensorflow/compiler/xla/service/executable.h b/tensorflow/compiler/xla/service/executable.h
index 4f0466c544738fa1ec4602ee5104daee8d969c83..dc1f26ea65cc707d4f0522af2aa3ec40621632f1 100644
--- a/tensorflow/compiler/xla/service/executable.h
+++ b/tensorflow/compiler/xla/service/executable.h
@@ -27,9 +27,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_graph_dumper.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/service/service_executable_run_options.h"
-#include "tensorflow/compiler/xla/service/session.pb.h"
 #include "tensorflow/compiler/xla/service/shaped_buffer.h"
-#include "tensorflow/compiler/xla/service/versioned_computation_handle.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
@@ -132,26 +130,12 @@ class Executable {
 
   const HloModuleConfig& module_config() const { return hlo_module_->config(); }
 
-  // Returns the versioned computation handle of the computation computed by
-  // this executable.
-  const VersionedComputationHandle& entry_computation_handle() const {
-    return hlo_module_->entry_computation_handle();
-  }
-
   // The shape (including layout) that results from this execution. This is the
   // shape of the DeviceMemoryBase result value in ExecuteOnStream above.
   const Shape& host_result_shape() const {
     return hlo_module_->config().host_entry_computation_layout().result_shape();
   }
 
-  // TODO(b/74197823): Delete the session module dumping helpers.
-  void set_session_module(std::unique_ptr<xla::SessionModule> session_module) {
-    session_module_ = std::move(session_module);
-  }
-  bool dumping() const { return session_module_ != nullptr; }
-  SessionModule* session_module() const { return session_module_.get(); }
-  Status DumpSessionModule();
-
   // Dumping helpers.
   void set_hlo_snapshot(std::unique_ptr<xla::HloSnapshot> hlo_snapshot) {
     hlo_snapshot_ = std::move(hlo_snapshot);
@@ -160,10 +144,6 @@ class Executable {
   HloSnapshot* hlo_snapshot() const { return hlo_snapshot_.get(); }
   Status DumpHloSnapshot();
 
-  // Dump session_module to directory_path/filename.
-  static Status DumpToDirectory(const string& directory_path, string filename,
-                                const SessionModule& session_module);
-
   // Dump hlo snapshot to directory_path/filename.
   static Status DumpToDirectory(const string& directory_path, string filename,
                                 const HloSnapshot& hlo_session);
@@ -179,9 +159,6 @@ class Executable {
   // around.
   const std::unique_ptr<const HloModule> hlo_module_;
 
-  // SessionModule this was compiled from. Null if not dumping executions.
-  std::unique_ptr<SessionModule> session_module_;
-
   // HloSnapshot this was compiled from. Null if not dumping executions.
   std::unique_ptr<HloSnapshot> hlo_snapshot_;
 
diff --git a/tensorflow/compiler/xla/tools/parser/README.md b/tensorflow/compiler/xla/service/g3doc/hlo_parser.md
similarity index 100%
rename from tensorflow/compiler/xla/tools/parser/README.md
rename to tensorflow/compiler/xla/service/g3doc/hlo_parser.md
diff --git a/tensorflow/compiler/xla/service/gather_expander_test.cc b/tensorflow/compiler/xla/service/gather_expander_test.cc
index 1c72ca066502eb549bf8638cdf0b7827b06f92d7..020ffcd106862cb2641a9f3bceb70acdd969a458 100644
--- a/tensorflow/compiler/xla/service/gather_expander_test.cc
+++ b/tensorflow/compiler/xla/service/gather_expander_test.cc
@@ -14,9 +14,9 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/service/gather_expander.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 
 namespace xla {
 namespace {
@@ -36,7 +36,7 @@ ENTRY main {
 }
 )";
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_text));
+                          ParseHloString(hlo_text));
 
   Status status = GatherExpander{}.Run(module.get()).status();
   EXPECT_EQ(status.code(), tensorflow::error::UNIMPLEMENTED);
@@ -63,7 +63,7 @@ ENTRY main {
 }
 )";
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_text));
+                          ParseHloString(hlo_text));
   TF_ASSERT_OK_AND_ASSIGN(bool changed, GatherExpander{}.Run(module.get()));
   ASSERT_TRUE(changed);
 
diff --git a/tensorflow/compiler/xla/service/gpu/BUILD b/tensorflow/compiler/xla/service/gpu/BUILD
index 4012f87f2bf69d1ab056da5d6c750441c7404980..5e02631a5856a45762027d246144d6f6dd913c26 100644
--- a/tensorflow/compiler/xla/service/gpu/BUILD
+++ b/tensorflow/compiler/xla/service/gpu/BUILD
@@ -1,6 +1,8 @@
 # Description:
 #   GPU-specific components in XLA service implementation.
 
+load("//tensorflow/compiler/xla:xla.bzl", "xla_proto_library")
+
 licenses(["notice"])  # Apache 2.0
 
 package(default_visibility = [":friends"])
@@ -23,6 +25,11 @@ filegroup(
 
 load("//tensorflow:tensorflow.bzl", "tf_cc_test")
 
+xla_proto_library(
+    name = "backend_configs",
+    srcs = ["backend_configs.proto"],
+)
+
 cc_library(
     name = "gpu_constants",
     srcs = ["gpu_constants.cc"],
@@ -133,6 +140,7 @@ cc_library(
         "ir_emitter_unnested.h",
     ],
     deps = [
+        ":backend_configs",
         ":cudnn_convolution_runner",
         ":elemental_ir_emitter",
         ":gpu_constants",
@@ -156,6 +164,7 @@ cc_library(
         "//tensorflow/compiler/xla/service:name_uniquer",
         "//tensorflow/compiler/xla/service/llvm_ir:fused_ir_emitter",
         "//tensorflow/compiler/xla/service/llvm_ir:ir_array",
+        "//tensorflow/compiler/xla/service/llvm_ir:kernel_support_library",
         "//tensorflow/compiler/xla/service/llvm_ir:llvm_loop",
         "//tensorflow/compiler/xla/service/llvm_ir:llvm_util",
         "//tensorflow/compiler/xla/service/llvm_ir:loop_emitter",
@@ -266,6 +275,7 @@ cc_library(
         "while_thunk.h",
     ],
     deps = [
+        ":backend_configs",
         ":buffer_allocations",
         ":cudnn_convolution_runner",
         ":infeed_manager",
@@ -322,6 +332,7 @@ cc_library(
     srcs = ["cudnn_convolution_algorithm_picker.cc"],
     hdrs = ["cudnn_convolution_algorithm_picker.h"],
     deps = [
+        ":backend_configs",
         ":cudnn_convolution_runner",
         ":gpu_executable",
         ":ir_emission_utils",
@@ -338,6 +349,7 @@ cc_library(
     srcs = ["cudnn_convolution_runner.cc"],
     hdrs = ["cudnn_convolution_runner.h"],
     deps = [
+        ":stream_executor_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status",
         "//tensorflow/compiler/xla:status_macros",
@@ -401,10 +413,41 @@ tf_cc_test(
     srcs = ["instruction_fusion_test.cc"],
     deps = [
         ":instruction_fusion",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service:hlo_matchers",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
+    ],
+)
+
+cc_library(
+    name = "multi_output_fusion",
+    srcs = ["multi_output_fusion.cc"],
+    hdrs = ["multi_output_fusion.h"],
+    deps = [
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:multi_output_fusion",
+        "//tensorflow/core:lib",
+    ],
+)
+
+tf_cc_test(
+    name = "multi_output_fusion_test",
+    srcs = ["multi_output_fusion_test.cc"],
+    deps = [
+        ":multi_output_fusion",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_matchers",
+        "//tensorflow/compiler/xla/service:hlo_parser",
+        "//tensorflow/compiler/xla/tests:hlo_test_base",
+        "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+        "//tensorflow/core:lib",
     ],
 )
 
@@ -446,9 +489,9 @@ tf_cc_test(
         ":instruction_fusion",
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla/service:hlo_matchers",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
     ],
 )
 
@@ -508,6 +551,7 @@ cc_library(
         ":instruction_fusion",
         ":ir_emission_utils",
         ":ir_emitter",
+        ":multi_output_fusion",
         ":pad_insertion",
         ":partition_assignment",
         ":stream_assignment",
@@ -542,6 +586,8 @@ cc_library(
         "//tensorflow/compiler/xla/service:reshape_mover",
         "//tensorflow/compiler/xla/service:transpose_folding",
         "//tensorflow/compiler/xla/service:tuple_simplifier",
+        "//tensorflow/compiler/xla/service:while_loop_constant_sinking",
+        "//tensorflow/compiler/xla/service:while_loop_invariant_code_motion",
         "//tensorflow/compiler/xla/service:while_loop_simplifier",
         "//tensorflow/compiler/xla/service:zero_sized_hlo_elimination",
         "//tensorflow/compiler/xla/service/gpu:cudnn_batchnorm_rewriter",
@@ -587,14 +633,18 @@ cc_library(
     srcs = ["gpu_layout_assignment.cc"],
     hdrs = ["gpu_layout_assignment.h"],
     deps = [
+        ":gpu_options",
         ":ir_emission_utils",
+        ":stream_executor_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:window_util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/service:computation_layout",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service:layout_assignment",
         "//tensorflow/core:lib",
+        "//tensorflow/core:stream_executor_no_cuda",
     ],
 )
 
@@ -691,6 +741,27 @@ cc_library(
     ],
 )
 
+cc_library(
+    name = "gpu_options",
+    srcs = ["gpu_options.cc"],
+    hdrs = ["gpu_options.h"],
+    deps = [
+        "//tensorflow/compiler/xla/service:hlo_module_config",
+        "//tensorflow/core:lib_internal",
+    ],
+)
+
+cc_library(
+    name = "stream_executor_util",
+    srcs = ["stream_executor_util.cc"],
+    hdrs = ["stream_executor_util.h"],
+    deps = [
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/core:stream_executor_no_cuda",
+    ],
+)
+
 tf_cc_test(
     name = "gpu_hlo_support_checker_test",
     srcs = ["gpu_hlo_support_checker_test.cc"],
diff --git a/tensorflow/compiler/xla/service/gpu/backend_configs.proto b/tensorflow/compiler/xla/service/gpu/backend_configs.proto
new file mode 100644
index 0000000000000000000000000000000000000000..640c6392b8b820c708b853c2a3cea4d4116e85a8
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/backend_configs.proto
@@ -0,0 +1,27 @@
+syntax = "proto3";
+
+package xla.gpu;
+
+// Backend configs for XLA:GPU.
+//
+// These are metadata that the GPU backend attaches to HloInstrucitons and later
+// uses during e.g. codegen.
+//
+// Remember that proto3 doesn't give clients a way to tell the difference
+// between a field not being present and a field having the default value.
+// Choose your defaults carefully.
+//
+// No guarantee is made about the stability of these protos.
+//
+// See HloInstruction::backend_config() for more info.
+
+// Backend config for a convolution that runs through cudnn.
+message CudnnConvBackendConfig {
+  // Opaque algorithm number of cudnn algorithm chosen for this conv.
+  int64 algorithm = 1;
+
+  // Whether we may use tensor cores when running this conv.  Even if this is
+  // true, cudnn may choose not to use tensor cores, e.g. because the GPU or
+  // selected algorithm doesn't support it.
+  bool tensor_ops_enabled = 2;
+}
diff --git a/tensorflow/compiler/xla/service/gpu/cudnn_convolution_algorithm_picker.cc b/tensorflow/compiler/xla/service/gpu/cudnn_convolution_algorithm_picker.cc
index 6a46bdb9b438f81dc564b9033f5d302f90b6a997..3dc98c4c93ea2b9b68dd3ee27794a39847f8756c 100644
--- a/tensorflow/compiler/xla/service/gpu/cudnn_convolution_algorithm_picker.cc
+++ b/tensorflow/compiler/xla/service/gpu/cudnn_convolution_algorithm_picker.cc
@@ -14,6 +14,7 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/service/gpu/cudnn_convolution_algorithm_picker.h"
+#include "tensorflow/compiler/xla/service/gpu/backend_configs.pb.h"
 #include "tensorflow/compiler/xla/service/gpu/convolution_thunk.h"
 #include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
 #include "tensorflow/core/lib/gtl/optional.h"
@@ -316,21 +317,20 @@ StatusOr<bool> CudnnConvolutionAlgorithmPicker::RunOnInstruction(
   Shape new_call_shape =
       ShapeUtil::MakeTupleShape({instr->shape().tuple_shapes(0),
                                  ShapeUtil::MakeShape(U8, {scratch_bytes})});
-  HloInstruction* algorithm_hlo = computation->AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int64>(algorithm)));
-  HloInstruction* tensor_ops_enabled_hlo =
-      computation->AddInstruction(HloInstruction::CreateConstant(
-          Literal::CreateR0<bool>(tensor_ops_enabled)));
+
+  CudnnConvBackendConfig backend_config;
+  backend_config.set_algorithm(algorithm);
+  backend_config.set_tensor_ops_enabled(tensor_ops_enabled);
 
   HloInstruction* new_call =
       computation->AddInstruction(HloInstruction::CreateCustomCall(
           new_call_shape,
-          {instr->mutable_operand(0), instr->mutable_operand(1), algorithm_hlo,
-           tensor_ops_enabled_hlo},
+          {instr->mutable_operand(0), instr->mutable_operand(1)},
           instr->custom_call_target()));
   new_call->set_window(instr->window());
   new_call->set_convolution_dimension_numbers(
       instr->convolution_dimension_numbers());
+  TF_RETURN_IF_ERROR(new_call->set_backend_config(backend_config));
 
   // Repackage new_call so it has the same shape as the original call, namely
   // (conv_result, u8[0]).
diff --git a/tensorflow/compiler/xla/service/gpu/cudnn_convolution_runner.cc b/tensorflow/compiler/xla/service/gpu/cudnn_convolution_runner.cc
index 10b4c3de89989c52cfea5273c3d5b0beef76abd2..0645fbb3ad39f1f1649caf45a6068b5a196c30b9 100644
--- a/tensorflow/compiler/xla/service/gpu/cudnn_convolution_runner.cc
+++ b/tensorflow/compiler/xla/service/gpu/cudnn_convolution_runner.cc
@@ -14,6 +14,8 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/service/gpu/cudnn_convolution_runner.h"
+#include "tensorflow/compiler/xla/layout_util.h"
+#include "tensorflow/compiler/xla/service/gpu/stream_executor_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/util.h"
@@ -113,8 +115,17 @@ Status RunCudnnConvolution(
 
   // cuDNN's convolution APIs support the BDYX layout for activations/output and
   // the OIYX layout for weights.
+  DataLayout input_dl;
+  FilterLayout filter_dl;
+  DataLayout output_dl;
+
+  TF_ASSIGN_OR_RETURN(std::tie(input_dl, filter_dl, output_dl),
+                      XlaConvLayoutsToStreamExecutorLayouts(
+                          dnums, input_shape.layout(), filter_shape.layout(),
+                          output_shape.layout()));
+
   BatchDescriptor input_descriptor(effective_num_dimensions);
-  input_descriptor.set_layout(DataLayout::kBatchDepthYX)
+  input_descriptor.set_layout(input_dl)
       .set_feature_map_count(
           input_shape.dimensions(dnums.input_feature_dimension()))
       .set_count(input_shape.dimensions(dnums.input_batch_dimension()));
@@ -126,7 +137,7 @@ Status RunCudnnConvolution(
   }
 
   FilterDescriptor filter_descriptor(effective_num_dimensions);
-  filter_descriptor.set_layout(FilterLayout::kOutputInputYX)
+  filter_descriptor.set_layout(filter_dl)
       .set_input_feature_map_count(
           filter_shape.dimensions(dnums.kernel_input_feature_dimension()))
       .set_output_feature_map_count(
@@ -149,7 +160,7 @@ Status RunCudnnConvolution(
   }
 
   BatchDescriptor output_descriptor(effective_num_dimensions);
-  output_descriptor.set_layout(DataLayout::kBatchDepthYX)
+  output_descriptor.set_layout(output_dl)
       .set_feature_map_count(
           output_shape.dimensions(dnums.output_feature_dimension()))
       .set_count(output_shape.dimensions(dnums.output_batch_dimension()));
diff --git a/tensorflow/compiler/xla/service/gpu/elemental_ir_emitter.cc b/tensorflow/compiler/xla/service/gpu/elemental_ir_emitter.cc
index e5e2a0478a0659986ddec8d6785827b14b9efb56..b812dd7d3fbb25f279e87f79b647e299f29073ea 100644
--- a/tensorflow/compiler/xla/service/gpu/elemental_ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/gpu/elemental_ir_emitter.cc
@@ -53,11 +53,17 @@ using llvm_ir::IrName;
 using llvm_ir::SetToFirstInsertPoint;
 using tensorflow::strings::StrAppend;
 
+namespace {
 // Returns whether operand is a floating-point literal with the given value.
 bool IsFPLiteralWithValue(const HloInstruction* operand, float value) {
-  return operand->opcode() == HloOpcode::kConstant &&
-         operand->literal().IsAllFloat(value);
+  if (operand->opcode() == HloOpcode::kConstant &&
+      operand->literal().IsAllFloat(value)) {
+    return true;
+  }
+  return operand->opcode() == HloOpcode::kBroadcast &&
+         IsFPLiteralWithValue(operand->operand(0), value);
 }
+}  // namespace
 
 GpuElementalIrEmitter::GpuElementalIrEmitter(
     const HloModuleConfig& hlo_module_config, llvm::Module* module,
diff --git a/tensorflow/compiler/xla/service/gpu/fusion_merger_test.cc b/tensorflow/compiler/xla/service/gpu/fusion_merger_test.cc
index 2217776c7d5a5f92c520d56222988f80401be9e4..b22bb1d39ba177ef42673c7a3755694b43c15d14 100644
--- a/tensorflow/compiler/xla/service/gpu/fusion_merger_test.cc
+++ b/tensorflow/compiler/xla/service/gpu/fusion_merger_test.cc
@@ -17,9 +17,9 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/gpu/instruction_fusion.h"
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 
 namespace xla {
 namespace gpu {
@@ -40,7 +40,7 @@ class FusionMergerTest : public HloTestBase {};
 //                   Tuple
 //
 TEST_F(FusionMergerTest, MergeSharedFusionInstruction) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
 HloModule MergeSharedFusionInstruction
 
 comp.3 {
@@ -104,7 +104,7 @@ ENTRY MergeSharedFusionInstruction.Computation0 {
 //
 // Fusion2 is not merged because it exceeds the threshold flops-to-bytes ratio.
 TEST_F(FusionMergerTest, FlopsToBytesRatioThresholdExceeded) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
 HloModule FlopsToBytesRatioThresholdExceeded
 
 comp.2 {
@@ -162,7 +162,7 @@ ENTRY FlopsToBytesRatioThresholdExceeded.Computation1 {
 // is merged into Fusion0 and Fusion1) would exceed the bytes transferred
 // threshold.
 TEST_F(FusionMergerTest, BytesTransferredThresholdExeceeded) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
 HloModule BytesTransferredThresholdExeceeded
 
 comp.2 {
@@ -210,7 +210,7 @@ ENTRY BytesTransferredThresholdExeceeded.Computation2 {
 // Fusion2 is reduced for this test which makes the merge operation into its
 // operand below the bytes transferred threshold.
 TEST_F(FusionMergerTest, BytesTransferredThresholdNotExeceeded) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
 HloModule BytesTransferredThresholdNotExeceeded
 
 comp.2 {
@@ -253,7 +253,7 @@ ENTRY BytesTransferredThresholdNotExeceeded.Computation2 {
 // Check that we're willing to merge f1_computation into f2_computation, even
 // though f2 is an input fusion node.
 TEST_F(FusionMergerTest, WillMergeIntoInputFusion) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
     HloModule m
 
     f1_computation {
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc b/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc
index d50153d8a31077e759bd6104d5bca8868a554fde..afefc740d707cd6fd01420e950eafd37abe80119 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc
@@ -52,6 +52,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/gpu/ir_emitter_context.h"
 #include "tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h"
 #include "tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.h"
+#include "tensorflow/compiler/xla/service/gpu/multi_output_fusion.h"
 #include "tensorflow/compiler/xla/service/gpu/pad_insertion.h"
 #include "tensorflow/compiler/xla/service/gpu/partition_assignment.h"
 #include "tensorflow/compiler/xla/service/gpu/stream_assignment.h"
@@ -73,6 +74,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/reshape_mover.h"
 #include "tensorflow/compiler/xla/service/transpose_folding.h"
 #include "tensorflow/compiler/xla/service/tuple_simplifier.h"
+#include "tensorflow/compiler/xla/service/while_loop_constant_sinking.h"
+#include "tensorflow/compiler/xla/service/while_loop_invariant_code_motion.h"
 #include "tensorflow/compiler/xla/service/while_loop_simplifier.h"
 #include "tensorflow/compiler/xla/service/zero_sized_hlo_elimination.h"
 #include "tensorflow/compiler/xla/status_macros.h"
@@ -160,8 +163,7 @@ Status OptimizeHloModule(HloModule* hlo_module, se::StreamExecutor* stream_exec,
       pass.AddPass<BatchNormExpander>(
           /*rewrite_training_op=*/true,
           /*rewrite_inference_op=*/true,
-          /*rewrite_grad_op=*/true,
-          /*use_fusion=*/false);
+          /*rewrite_grad_op=*/true);
 
       // Rewrite gather ops into smaller ones.
       pass.AddPass<GatherExpander>();
@@ -174,6 +176,7 @@ Status OptimizeHloModule(HloModule* hlo_module, se::StreamExecutor* stream_exec,
           /*is_layout_sensitive=*/false,
           [](const Shape&, const Shape&) { return false; });
       pass.AddPass<TupleSimplifier>();
+      pass.AddPass<WhileLoopConstantSinking>();
       pass.AddPass<WhileLoopSimplifier>();
       pass.AddPass<HloDCE>();
       pass.AddPass<ReshapeMover>();
@@ -200,18 +203,28 @@ Status OptimizeHloModule(HloModule* hlo_module, se::StreamExecutor* stream_exec,
     pipeline.AddInvariantChecker<HloVerifier>();
     pipeline.AddPass<CudnnConvolutionRewriter>();
     pipeline.AddPass<PadInsertion>();
+    TF_RETURN_IF_ERROR(pipeline.Run(hlo_module).status());
+  }
+
+  {
+    HloPassPipeline pipeline("layout_assignment");
+    pipeline.AddPass<GpuLayoutAssignment>(
+        hlo_module->mutable_device_entry_computation_layout(), stream_exec);
+
+    // The LayoutAssignment pass may leave behind kCopy instructions which are
+    // duplicate or NOPs, so remove them with algebraic simplification and CSE.
+    pipeline.AddPass<HloPassFix<AlgebraicSimplifier>>(
+        /*is_layout_sensitive=*/true,
+        /*valid_bitcast_callback=*/[](const Shape&, const Shape&) {
+          return true;
+        });
 
     // Choose the fastest algorithm for each conv.
     //
-    // In theory doing this here is way too early: It needs to happen after
-    // layout assignment, because the layout of the inputs/outputs affects the
-    // speed of the conv.  But currently we only allow only one input/output
-    // layout when calling cudnn, so there's no ambiguity.
-    //
-    // We pick the algorithm at this early stage so we can generate better HLO.
-    // After CudnnConvolutionRewriter, our convolutions are CustomCalls which
-    // return a tuple (conv_result, scratch_memory), and the each conv uses 0
-    // bytes of scratch:
+    // We pick the algorithm before fusion so we can generate better HLO. After
+    // CudnnConvolutionRewriter, our convolutions are CustomCalls which return a
+    // tuple (conv_result, scratch_memory), and the each conv uses 0 bytes of
+    // scratch:
     //
     //   customcall = (f32[...], f32[0])
     //   return gte(customcall, 0)
@@ -227,35 +240,15 @@ Status OptimizeHloModule(HloModule* hlo_module, se::StreamExecutor* stream_exec,
     // The new tuple and gte instructions then be simplified away, because
     // nobody is expected to use the scratch value.
     //
-    // However, if we were to run CudnnConvolutionAlgorithmPicker after layout
-    // assignment, fusion would already have run, and the gte(customcall, 0)
-    // would probably already be into a fusion node.  We can't simplify across
-    // HloComputation boundaries, so in this case we wouldn't be able to
-    // simplify away the new_tuple bits.
-    //
-    // We'll need to revisit this if we ever allow multiple layouts for the
-    // inputs/outputs of a cudnn convolution.
+    // However, if we were to run CudnnConvolutionAlgorithmPicker after fusion
+    // the gte(customcall, 0) would probably already be into a fusion node.  We
+    // can't simplify across HloComputation boundaries, so in this case we
+    // wouldn't be able to simplify away the new_tuple bits.
     pipeline.AddPass<CudnnConvolutionAlgorithmPicker>(stream_exec,
                                                       device_allocator);
     // Clean up new_tuple described above.
     pipeline.AddPass<TupleSimplifier>();
-    pipeline.AddPass<HloDCE>();
 
-    TF_RETURN_IF_ERROR(pipeline.Run(hlo_module).status());
-  }
-
-  {
-    HloPassPipeline pipeline("layout_assignment");
-    pipeline.AddPass<GpuLayoutAssignment>(
-        hlo_module->mutable_device_entry_computation_layout());
-
-    // The LayoutAssignment pass may leave behind kCopy instructions which are
-    // duplicate or NOPs, so remove them with algebraic simplification and CSE.
-    pipeline.AddPass<HloPassFix<AlgebraicSimplifier>>(
-        /*is_layout_sensitive=*/true,
-        /*valid_bitcast_callback=*/[](const Shape&, const Shape&) {
-          return true;
-        });
     pipeline.AddPass<HloCSE>(/*is_layout_sensitive=*/true);
     TF_RETURN_IF_ERROR(pipeline.Run(hlo_module).status());
   }
@@ -266,6 +259,7 @@ Status OptimizeHloModule(HloModule* hlo_module, se::StreamExecutor* stream_exec,
     fusion.AddPass<GpuInstructionFusion>(/*may_duplicate=*/false);
     fusion.AddPass<GpuInstructionFusion>(/*may_duplicate=*/true);
     fusion.AddPass<FusionMerger>();
+    fusion.AddPass<GpuMultiOutputFusion>();
     TF_RETURN_IF_ERROR(fusion.Run(hlo_module).status());
 
     HloPassPipeline reduce_pipeline("reduce-precision");
@@ -282,6 +276,15 @@ Status OptimizeHloModule(HloModule* hlo_module, se::StreamExecutor* stream_exec,
       TF_RETURN_IF_ERROR(fusion.Run(hlo_module).status());
     }
   }
+
+  {
+    // Do an aggressive LICM pass over while loops.  In particular, this hoists
+    // constants that were sunk by WhileLoopConstantSinking.  Leaving them in
+    // the while loop may result in unnecessary copies.
+    HloPassPipeline pipeline("while-loop-licm");
+    pipeline.AddPass<WhileLoopInvariantCodeMotion>(true);
+    TF_RETURN_IF_ERROR(pipeline.Run(hlo_module).status());
+  }
   return Status::OK();
 }
 
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_copy_insertion.cc b/tensorflow/compiler/xla/service/gpu/gpu_copy_insertion.cc
index d9560779f313d5a559c3eb0f5b28ff5dd210d9d5..c5ccdd4a7dcec02ddab8a1f748659de41f6202d2 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_copy_insertion.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_copy_insertion.cc
@@ -78,12 +78,6 @@ StatusOr<bool> GpuCopyInsertion::Run(HloModule* module) {
       for (int64 i = 0; i < hlo->operand_count() - 2; ++i) {
         TF_RETURN_IF_ERROR(copy_operand_if_constant(i));
       }
-    } else if (IsCustomCallToDnnConvolution(*hlo)) {
-      // The last two arguments to a CUDNN convolution are two HLO constants for
-      // cudnn algorithm and tensor_ops_enabled flag, which shouldn't be copied.
-      for (int64 i = 0; i < hlo->operand_count() - 2; ++i) {
-        TF_RETURN_IF_ERROR(copy_operand_if_constant(i));
-      }
     } else if (ImplementedAsLibraryCall(*hlo) ||
                hlo->opcode() == HloOpcode::kCrossReplicaSum) {
       // For all other library calls and cross-replica-sum, materialize all the
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment.cc b/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment.cc
index 89f1e625884568bf7370b3801d851ef4846c2a98..8bf62dde8b9948375fc493fd1a524cfa7b062502 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment.cc
@@ -18,31 +18,72 @@ limitations under the License.
 #include <memory>
 
 #include "tensorflow/compiler/xla/layout_util.h"
+#include "tensorflow/compiler/xla/service/gpu/gpu_options.h"
 #include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
+#include "tensorflow/compiler/xla/service/gpu/stream_executor_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/window_util.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/core/errors.h"
 
 namespace xla {
 namespace gpu {
 
-// cuDNN convolutions are called with specific layouts on the input, output,
-// and filter:
-//
-//   input: DataLayout::kBatchDepthYX
-//   output: DataLayout::kBatchDepthYX
-//   filter: FilterLayout::kOutputInputYX
-//
-// The order dimensions in the constant name is major-to-minor (eg, the
-// most-major dimension of the input is batch, most-minor is X). The
-// specific dimension numbers these named dimensions correspond to is
-// determined by the ConvolutionDimensionNumbers argument. Y is spatial
-// dimension 0, and X is spatial dimension 1.
-//
-// TODO(b/29399649): Be more flexible about handling layouts of cuDNN calls.
-static Status AddBackendConstraintsToDnnConvCustomCall(
+using stream_executor::dnn::DataLayout;
+using stream_executor::dnn::FilterLayout;
+
+static bool IsVoltaOrLater(const se::StreamExecutor& stream_executor) {
+  int major, minor;
+  CHECK(stream_executor.GetDeviceDescription().cuda_compute_capability(&major,
+                                                                       &minor));
+  return major >= 7;
+}
+
+// Returns (input, filter, output) layouts.
+static std::tuple<DataLayout, FilterLayout, DataLayout>
+HeuristicLayoutAssignment(const HloInstruction* instr,
+                          stream_executor::StreamExecutor* stream_executor) {
+  // DataLayout and FilterLayout uses weird enum names. Translations:
+  //   N <=> Batch or Output
+  //   C <=> Depth or Input
+  //   H <=> Y
+  //   W <=> X
+  //
+  // Therefore kOutputInputYX means NHWC; kBatchDepthYX means NCHW.
+
+  // As of today, our empirical evidence is that cudnn 7.0 is faster on V100 x
+  // fp16 with the mostly-NHWC layout. The heuristic may change as cudnn version
+  // changes, as well as the hardware updates.
+  if (!(instr->operand(0)->shape().element_type() == xla::PrimitiveType::F16 &&
+        IsVoltaOrLater(*stream_executor))) {
+    return std::make_tuple(DataLayout::kBatchDepthYX,
+                           FilterLayout::kOutputInputYX,
+                           DataLayout::kBatchDepthYX);
+  }
+  VLOG(2) << "Using heuristic to figure out layouts for " << instr->ToString();
+  // For BackwardInput that has stride, full NHWC layouts run significantly
+  // slower than (NHWC, NCHW, NCHW) or (NHWC, NCHW, NHWC).
+  //
+  // TODO(timshen): more closely compare (NHWC, NCHW, NCHW) and (NHWC, NCHW,
+  // NHWC).
+  if (instr->custom_call_target() == kCudnnConvBackwardInputCallTarget &&
+      window_util::HasStride(instr->window())) {
+    return std::make_tuple(DataLayout::kBatchYXDepth,
+                           FilterLayout::kOutputInputYX,
+                           DataLayout::kBatchDepthYX);
+  }
+  return std::make_tuple(DataLayout::kBatchYXDepth,
+                         FilterLayout::kOutputYXInput,
+                         DataLayout::kBatchYXDepth);
+}
+
+// Adds layout constraints on the cudnn custom-call instruction. The layout
+// constraints are represented in terms of minor_to_major fields of both
+// operands and the output shape. Depending on the underlying algorithm, one of
+// { NCHW, NHWC } ^ 3 = 8 different layout combinations may be chosen.
+Status GpuLayoutAssignment::AddBackendConstraintsToDnnConvCustomCall(
     HloInstruction* instr, LayoutConstraints* constraints) {
   CHECK(IsCustomCallToDnnConvolution(*instr)) << instr->ToString();
   Shape input_shape;
@@ -66,39 +107,25 @@ static Status AddBackendConstraintsToDnnConvCustomCall(
                << instr->custom_call_target();
   }
 
-  // Construct minor-to-major dimension orders for operands and result.
-  // cuDNN's convolution APIs support the BDYX layout for activations/output
-  // and the OIYX layout for weights.
-  // TODO(b/29399649): Be more flexible about handling layouts of cuDNN
-  // calls after we switch to cuDNN v5.
-  const ConvolutionDimensionNumbers& dimension_numbers =
-      instr->convolution_dimension_numbers();
-  std::vector<int64> input_layout;
-  for (int i = dimension_numbers.input_spatial_dimensions_size() - 1; i >= 0;
-       --i) {
-    input_layout.push_back(dimension_numbers.input_spatial_dimensions(i));
-  }
-  input_layout.push_back(dimension_numbers.input_feature_dimension());
-  input_layout.push_back(dimension_numbers.input_batch_dimension());
-  *input_shape.mutable_layout() = LayoutUtil::MakeLayout(input_layout);
-
-  std::vector<int64> filter_layout;
-  for (int i = dimension_numbers.kernel_spatial_dimensions_size() - 1; i >= 0;
-       --i) {
-    filter_layout.push_back(dimension_numbers.kernel_spatial_dimensions(i));
-  }
-  filter_layout.push_back(dimension_numbers.kernel_input_feature_dimension());
-  filter_layout.push_back(dimension_numbers.kernel_output_feature_dimension());
-  *filter_shape.mutable_layout() = LayoutUtil::MakeLayout(filter_layout);
-
-  std::vector<int64> output_layout;
-  for (int i = dimension_numbers.output_spatial_dimensions_size() - 1; i >= 0;
-       --i) {
-    output_layout.push_back(dimension_numbers.output_spatial_dimensions(i));
+  {
+    DataLayout input;
+    FilterLayout filter;
+    DataLayout output;
+    if (ConvUseLayoutHeuristic(instr->GetModule()->config())) {
+      std::tie(input, filter, output) =
+          HeuristicLayoutAssignment(instr, stream_executor_);
+    } else {
+      input = DataLayout::kBatchDepthYX;
+      filter = FilterLayout::kOutputInputYX;
+      output = DataLayout::kBatchDepthYX;
+    }
+
+    TF_ASSIGN_OR_RETURN(
+        std::tie(*input_shape.mutable_layout(), *filter_shape.mutable_layout(),
+                 *output_shape.mutable_layout()),
+        StreamExecutorConvLayoutsToXlaLayouts(
+            instr->convolution_dimension_numbers(), input, filter, output));
   }
-  output_layout.push_back(dimension_numbers.output_feature_dimension());
-  output_layout.push_back(dimension_numbers.output_batch_dimension());
-  *output_shape.mutable_layout() = LayoutUtil::MakeLayout(output_layout);
 
   // The custom call returns a tuple of (actual_result, scratch_buffer);
   // call_result_buf is the logical buffer for actual_result, the thing that
@@ -132,7 +159,13 @@ static Status AddBackendConstraintsToDnnConvCustomCall(
 
 Status GpuLayoutAssignment::AddBackendConstraints(
     LayoutConstraints* constraints) {
-  for (auto* instruction : constraints->computation()->instructions()) {
+  // Add convolution constraints in reverse postorder that the earliest
+  // convolution layout propagates first. This reduces the likelihood of fusion
+  // nodes with copies.
+  auto post_order = constraints->computation()->MakeInstructionPostOrder();
+  for (auto iterator = post_order.rbegin(); iterator != post_order.rend();
+       ++iterator) {
+    HloInstruction* instruction = *iterator;
     if (IsCustomCallToDnnConvolution(*instruction)) {
       TF_RETURN_IF_ERROR(
           AddBackendConstraintsToDnnConvCustomCall(instruction, constraints));
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment.h b/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment.h
index 86a3a7111fd79494e469beecf3234f6cec9adb9c..ce24af1cf8856920ccf438b5bbd2ef28cfa8ba6f 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment.h
+++ b/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment.h
@@ -19,6 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/computation_layout.h"
 #include "tensorflow/compiler/xla/service/layout_assignment.h"
 #include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/platform/stream_executor_no_cuda.h"
 
 namespace xla {
 namespace gpu {
@@ -27,8 +28,10 @@ namespace gpu {
 // layout constraints for operands and results of library calls.
 class GpuLayoutAssignment : public LayoutAssignment {
  public:
-  explicit GpuLayoutAssignment(ComputationLayout* entry_computation_layout)
-      : LayoutAssignment(entry_computation_layout) {}
+  explicit GpuLayoutAssignment(ComputationLayout* entry_computation_layout,
+                               se::StreamExecutor* stream_executor)
+      : LayoutAssignment(entry_computation_layout),
+        stream_executor_(stream_executor) {}
   ~GpuLayoutAssignment() override {}
 
  protected:
@@ -41,6 +44,12 @@ class GpuLayoutAssignment : public LayoutAssignment {
       LayoutConstraints* constraints) override;
   bool CustomCallRequiresMajorFirstLayout(
       const HloInstruction* instruction) override;
+
+ private:
+  Status AddBackendConstraintsToDnnConvCustomCall(
+      HloInstruction* instr, LayoutConstraints* constraints);
+
+  se::StreamExecutor* stream_executor_;
 };
 
 }  // namespace gpu
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment_test.cc b/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment_test.cc
index 4c45d2e94aebce5496da94841f6a1ae9015615c1..e48165c1426ea04839c245bc20b851a0f1710246 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment_test.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment_test.cc
@@ -69,7 +69,8 @@ TEST_F(LayoutAssignmentTest, Elementwise) {
         *computation_layout.mutable_result_layout() =
             ShapeLayout(result_shape_with_layout);
 
-        GpuLayoutAssignment layout_assignment(&computation_layout);
+        GpuLayoutAssignment layout_assignment(
+            &computation_layout, backend().default_stream_executor());
         EXPECT_TRUE(layout_assignment.Run(module.get()).ValueOrDie());
 
         for (const HloInstruction* operand : add->operands()) {
@@ -156,7 +157,8 @@ TEST_F(LayoutAssignmentTest, BatchNormInference) {
         *computation_layout.mutable_result_layout() = ShapeLayout(result_shape);
       }
 
-      GpuLayoutAssignment layout_assignment(&computation_layout);
+      GpuLayoutAssignment layout_assignment(
+          &computation_layout, backend().default_stream_executor());
       EXPECT_TRUE(layout_assignment.Run(module.get()).ValueOrDie());
 
       // The first operand to batchnorm should have the same layout as the
@@ -225,7 +227,8 @@ TEST_F(LayoutAssignmentTest, BatchNormTraining) {
                 {result_shape, offset_scale_shape, offset_scale_shape}));
       }
 
-      GpuLayoutAssignment layout_assignment(&computation_layout);
+      GpuLayoutAssignment layout_assignment(
+          &computation_layout, backend().default_stream_executor());
       EXPECT_TRUE(layout_assignment.Run(module.get()).ValueOrDie());
 
       // The first operand to batchnorm should have the same layout as the
@@ -305,7 +308,8 @@ TEST_F(LayoutAssignmentTest, BatchNormGrad) {
                   {result_shape, scale_shape, scale_shape}));
         }
 
-        GpuLayoutAssignment layout_assignment(&computation_layout);
+        GpuLayoutAssignment layout_assignment(
+            &computation_layout, backend().default_stream_executor());
         EXPECT_TRUE(layout_assignment.Run(module.get()).ValueOrDie());
 
         // The first and fourth operands to the batchnorm call should have the
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_options.cc b/tensorflow/compiler/xla/service/gpu/gpu_options.cc
new file mode 100644
index 0000000000000000000000000000000000000000..35b4b4e20b633792de4251a4b0e89f4b579053ce
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/gpu_options.cc
@@ -0,0 +1,28 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/gpu/gpu_options.h"
+#include "tensorflow/core/lib/gtl/map_util.h"
+
+namespace xla {
+namespace gpu {
+
+bool ConvUseLayoutHeuristic(const HloModuleConfig& config) {
+  return !config.debug_options().xla_backend_extra_options().count(
+      "xla_gpu_experimental_conv_disable_layout_heuristic");
+}
+
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_options.h b/tensorflow/compiler/xla/service/gpu/gpu_options.h
new file mode 100644
index 0000000000000000000000000000000000000000..498d4a94955cb2c50e0b165f28ded44ac1c0bfff
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/gpu_options.h
@@ -0,0 +1,33 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GPU_GPU_OPTIONS_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_GPU_OPTIONS_H_
+
+#include "tensorflow/compiler/xla/service/hlo_module_config.h"
+
+// Helper functions for querying options that are specific to the GPU backend.
+
+namespace xla {
+namespace gpu {
+
+// Returns true if we should use heuristics to assign convolution layouts, as
+// opposed to always assigning NCHW.
+bool ConvUseLayoutHeuristic(const HloModuleConfig& config);
+
+}  // namespace gpu
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GPU_GPU_OPTIONS_H_
diff --git a/tensorflow/compiler/xla/service/gpu/hlo_schedule.cc b/tensorflow/compiler/xla/service/gpu/hlo_schedule.cc
index f766f968826d960a8e86308f2395301aaa09f1ae..375709150e08996ea6a40f5e9e66a8f8d9287008 100644
--- a/tensorflow/compiler/xla/service/gpu/hlo_schedule.cc
+++ b/tensorflow/compiler/xla/service/gpu/hlo_schedule.cc
@@ -199,7 +199,7 @@ StatusOr<std::unique_ptr<HloSchedule>> HloSchedule::Build(
     // concurrency by optimizing for minimal memory usage.
     TF_ASSIGN_OR_RETURN(
         schedule->thunk_launch_order_,
-        CreateMemoryMinimizingSequence(
+        ScheduleOneComputation(
             *entry_computation, [pointer_size](const BufferValue& buffer) {
               return ShapeUtil::ByteSizeOf(buffer.shape(), pointer_size);
             }));
diff --git a/tensorflow/compiler/xla/service/gpu/hlo_schedule_test.cc b/tensorflow/compiler/xla/service/gpu/hlo_schedule_test.cc
index e230d538cc2df826778e8d13eaaaf31ec81c57f0..45f0a1c645b2875cf90d2c11cfb66c3dd855d097 100644
--- a/tensorflow/compiler/xla/service/gpu/hlo_schedule_test.cc
+++ b/tensorflow/compiler/xla/service/gpu/hlo_schedule_test.cc
@@ -47,8 +47,7 @@ class HloScheduleTest : public HloTestBase {
     auto debug_options = GetDebugOptionsForTest();
     debug_options.set_xla_gpu_disable_multi_streaming(false);
     config.set_debug_options(debug_options);
-    return MakeUnique<HloModule>("test_module", VersionedComputationHandle(),
-                                 config);
+    return MakeUnique<HloModule>("test_module", config);
   }
 
   HloVec RemoveHlo(const HloVec& input,
diff --git a/tensorflow/compiler/xla/service/gpu/infeed_manager.cc b/tensorflow/compiler/xla/service/gpu/infeed_manager.cc
index 3ddc1c0789d746bf021256638342364aac63e0e3..ae310beefad0c81c17fd4140b441b3a19a002e2c 100644
--- a/tensorflow/compiler/xla/service/gpu/infeed_manager.cc
+++ b/tensorflow/compiler/xla/service/gpu/infeed_manager.cc
@@ -49,13 +49,25 @@ void InfeedManager::EnqueueBuffers(const std::vector<InfeedBuffer*>& buffers) {
 }
 
 InfeedBuffer* InfeedManager::BlockingDequeueBuffer() {
-  tensorflow::mutex_lock l(mu_);
-  while (enqueued_buffer_.empty()) {
-    cv_.wait(l);
+  bool became_empty = false;
+  InfeedBuffer* current_buffer;
+  {
+    tensorflow::mutex_lock l(mu_);
+    while (enqueued_buffer_.empty()) {
+      cv_.wait(l);
+    }
+    current_buffer = enqueued_buffer_.front();
+    enqueued_buffer_.pop_front();
+    dequeued_buffer_.insert(current_buffer);
+    if (enqueued_buffer_.empty()) {
+      became_empty = true;
+    }
+  }
+  if (became_empty) {
+    for (const auto& callback : on_empty_callbacks_) {
+      callback();
+    }
   }
-  InfeedBuffer* current_buffer = enqueued_buffer_.front();
-  enqueued_buffer_.pop_front();
-  dequeued_buffer_.insert(current_buffer);
   return current_buffer;
 }
 
@@ -88,6 +100,10 @@ se::Stream* InfeedManager::GetStream(se::StreamExecutor* executor) {
   return host_to_device_stream_.get();
 }
 
+void InfeedManager::RegisterOnEmptyCallback(std::function<void()> callback) {
+  on_empty_callbacks_.push_back(std::move(callback));
+}
+
 InfeedManager* GetOrCreateInfeedManager() {
   static InfeedManager* manager = new InfeedManager;
   return manager;
diff --git a/tensorflow/compiler/xla/service/gpu/infeed_manager.h b/tensorflow/compiler/xla/service/gpu/infeed_manager.h
index d5f2216d460a45085536b15f9bf6e3bd3579f9c8..a3fc15cfe36a490f38daabca9ff36fbb1012aead 100644
--- a/tensorflow/compiler/xla/service/gpu/infeed_manager.h
+++ b/tensorflow/compiler/xla/service/gpu/infeed_manager.h
@@ -21,6 +21,7 @@ limitations under the License.
 #define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_INFEED_MANAGER_H_
 
 #include <deque>
+#include <vector>
 
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/core/lib/gtl/flatset.h"
@@ -100,6 +101,10 @@ class InfeedManager {
   // returns null.
   se::Stream* GetStream(se::StreamExecutor* executor);
 
+  // Registers a callback that will be called when 'enqueued_buffer_' becomes
+  // empty.
+  void RegisterOnEmptyCallback(std::function<void()> callback);
+
  private:
   // TODO(b/30467474): Revisit if this mutex becomes a point of
   // contention.
@@ -122,6 +127,10 @@ class InfeedManager {
 
   // Executor that the host_to_device_stream belongs to. Not owned.
   se::StreamExecutor* host_to_device_executor_;
+
+  // List of callbacks which will be called when 'enqueued_buffer_' becomes
+  // empty.
+  std::vector<std::function<void()>> on_empty_callbacks_;
 };
 
 // Singleton creator-or-accessor: Returns the GPU infeed manager.
diff --git a/tensorflow/compiler/xla/service/gpu/instruction_fusion.cc b/tensorflow/compiler/xla/service/gpu/instruction_fusion.cc
index 5d5bef6b57b57fce4255a145634745b38dccacc7..6c4519185b34989eb53c884ba214d69b824b113c 100644
--- a/tensorflow/compiler/xla/service/gpu/instruction_fusion.cc
+++ b/tensorflow/compiler/xla/service/gpu/instruction_fusion.cc
@@ -77,15 +77,14 @@ bool GpuInstructionFusion::ShouldFuse(HloInstruction* consumer,
   HloInstruction* producer = consumer->mutable_operand(operand_index);
 
   // Check if we can use output fusion for (A @ B) * alpha
-  if (consumer->operand_count() == 2 &&
-      (producer->opcode() == HloOpcode::kDot ||
-       (producer->opcode() == HloOpcode::kFusion &&
-        producer->fused_expression_root()->opcode() == HloOpcode::kDot))) {
+  if (producer->opcode() == HloOpcode::kDot ||
+      (producer->opcode() == HloOpcode::kFusion &&
+       producer->fused_expression_root()->opcode() == HloOpcode::kDot)) {
     int64 other_operand_index = 1 - operand_index;
-    const HloInstruction* alpha = consumer->operand(other_operand_index);
     HloInstruction* op1 = nullptr;
     HloInstruction* op2 = nullptr;
-    if (consumer->opcode() == HloOpcode::kFusion &&
+    if (consumer->operand_count() == 1 &&
+        consumer->opcode() == HloOpcode::kFusion &&
         consumer->fusion_kind() == HloInstruction::FusionKind::kLoop &&
         Match(consumer->fused_expression_root(),
               match::Op()
@@ -103,10 +102,12 @@ bool GpuInstructionFusion::ShouldFuse(HloInstruction* consumer,
           op2->opcode() != HloOpcode::kBroadcast) {
         return false;
       }
-      if (IsIEEEFloatingPointScalarConstant(alpha)) {
+      if (IsIEEEFloatingPointScalarConstant(op2->operand(0))) {
         return true;
       }
-    } else if (consumer->opcode() == HloOpcode::kMultiply) {
+    } else if (consumer->operand_count() == 2 &&
+               consumer->opcode() == HloOpcode::kMultiply) {
+      const HloInstruction* alpha = consumer->operand(other_operand_index);
       // Fuse if 'alpha' is a broadcast of a scalar constant.
       if (alpha->opcode() == HloOpcode::kBroadcast &&
           alpha->dimensions().empty() &&
@@ -173,10 +174,38 @@ bool GpuInstructionFusion::ShouldFuse(HloInstruction* consumer,
     return false;
   }
 
+  // Fuse scalar constants into loop fusion nodes, this reduces the number of
+  // parameters and makes matching scalar broadcasts easier.
+  if (ShapeUtil::IsEffectiveScalar(producer->shape()) &&
+      consumer->opcode() == HloOpcode::kFusion &&
+      producer->opcode() == HloOpcode::kConstant) {
+    return true;
+  }
+
   return IsFusile(*producer) && IsFusile(*consumer) &&
          InstructionFusion::ShouldFuse(consumer, operand_index);
 }
 
+bool GpuInstructionFusion::ShouldFuseIntoMultiOutput(HloInstruction* consumer,
+                                                     int64 operand_index) {
+  const HloInstruction* producer = consumer->operand(operand_index);
+  // The IR emitter has limited support for non-loop fusions with multi output
+  // at present.
+  // TODO(tjoerg): Relax this constraint to allow for arbitraty kinds of fusion.
+  if (consumer->opcode() == HloOpcode::kFusion &&
+      consumer->fusion_kind() != HloInstruction::FusionKind::kLoop) {
+    return false;
+  }
+  // Multi-output fusion requires instructions with compatible shapes.
+  if (!ShapeUtil::Compatible(producer->shape(), consumer->shape())) {
+    return false;
+  }
+  // TODO(tjoerg): Stop calling `ShouldFuse` to relax the criteria for
+  // multi-output fusion. In particular, do not check whether an instruction is
+  // expensive to duplicate, since this doesn't matter here.
+  return GpuInstructionFusion::ShouldFuse(consumer, operand_index);
+}
+
 HloInstruction::FusionKind GpuInstructionFusion::ChooseKind(
     const HloInstruction* producer, const HloInstruction* consumer) {
   if (IsReductionToVector(*consumer)) {
diff --git a/tensorflow/compiler/xla/service/gpu/instruction_fusion.h b/tensorflow/compiler/xla/service/gpu/instruction_fusion.h
index 9fb06b0a244186484b1c17edf13bd28a4305a1a6..f629d9ff2c7165b652369612c30979150f93bd24 100644
--- a/tensorflow/compiler/xla/service/gpu/instruction_fusion.h
+++ b/tensorflow/compiler/xla/service/gpu/instruction_fusion.h
@@ -31,6 +31,9 @@ class GpuInstructionFusion : public InstructionFusion {
 
   bool ShouldFuse(HloInstruction* consumer, int64 operand_index) override;
 
+  bool ShouldFuseIntoMultiOutput(HloInstruction* consumer,
+                                 int64 operand_index) override;
+
   HloInstruction::FusionKind ChooseKind(
       const HloInstruction* producer, const HloInstruction* consumer) override;
 };
diff --git a/tensorflow/compiler/xla/service/gpu/instruction_fusion_test.cc b/tensorflow/compiler/xla/service/gpu/instruction_fusion_test.cc
index 760e0e90f583d0e43975e23b731a40af75c7dc17..1963d9eef72d41fa0a275bea98f959671fa7e737 100644
--- a/tensorflow/compiler/xla/service/gpu/instruction_fusion_test.cc
+++ b/tensorflow/compiler/xla/service/gpu/instruction_fusion_test.cc
@@ -15,9 +15,12 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/gpu/instruction_fusion.h"
 
+#include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
+#include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
+#include "tensorflow/compiler/xla/util.h"
 
 namespace op = xla::testing::opcode_matchers;
 
@@ -140,7 +143,7 @@ TEST_F(InstructionFusionTest, PotentialBitcastTransposeOfDotUnfused) {
 
 // Tests that broadcasts fused into a fusion with a reduce root.
 TEST_F(InstructionFusionTest, BroadcastIntoReduce) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
     HloModule test_module
 
     add {
@@ -165,11 +168,11 @@ TEST_F(InstructionFusionTest, BroadcastIntoReduce) {
   HloInstruction* root = module->entry_computation()->root_instruction();
   EXPECT_THAT(root, op::Fusion());
   EXPECT_THAT(root->fused_expression_root(),
-              op::Reduce(op::Broadcast(op::Parameter()), op::Parameter()));
+              op::Reduce(op::Broadcast(op::Constant()), op::Constant()));
 }
 
 TEST_F(InstructionFusionTest, BitcastIntoAdd) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
     HloModule test_module
 
     ENTRY BroadcastIntoAdd {
@@ -191,7 +194,7 @@ TEST_F(InstructionFusionTest, BitcastIntoAdd) {
 }
 
 TEST_F(InstructionFusionTest, AddIntoBitcast) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
     HloModule test_module
 
     ENTRY BroadcastIntoAdd {
@@ -213,7 +216,7 @@ TEST_F(InstructionFusionTest, AddIntoBitcast) {
 }
 
 TEST_F(InstructionFusionTest, DontFuseGTE) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule test_module
   ENTRY DontFuseGTE {
     p0 = (f32[10], f32[10]) parameter(0)
@@ -229,7 +232,7 @@ TEST_F(InstructionFusionTest, DontFuseGTE) {
 }
 
 TEST_F(InstructionFusionTest, DotOutputFusion) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule test_module
   ENTRY OutputFusion {
     alpha = f32[] constant(3)
@@ -252,13 +255,13 @@ TEST_F(InstructionFusionTest, DotOutputFusion) {
   EXPECT_THAT(
       root->fused_expression_root(),
       op::Multiply(op::Dot(op::Parameter(), op::Transpose(op::Parameter())),
-                   op::Broadcast(op::Parameter())));
+                   op::Broadcast(op::Constant())));
 }
 
 // Compute sum(1/p0), where p0 has type f32, twice.  Check that the division is
 // duplicated and fused into both reduces.
 TEST_F(InstructionFusionTest, FloatingPointDivIsCheap) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule test_module
   Add {
     lhs = f32[] parameter(0)
@@ -281,14 +284,15 @@ TEST_F(InstructionFusionTest, FloatingPointDivIsCheap) {
                   .ValueOrDie());
 
   HloInstruction* root = module->entry_computation()->root_instruction();
-  EXPECT_THAT(root, op::Tuple(op::Fusion(), op::Fusion()));
+  EXPECT_THAT(root, op::Tuple(op::Fusion(), op::Fusion()))
+      << module->ToString();
 }
 
 // Compute sum(100/p0), where p0 has type s32, twice.  Check that the division
 // is *not* duplicated and fused into both reduces, because we say that integer
 // division is not cheap.
 TEST_F(InstructionFusionTest, IntegerDivIsNotCheap) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule test_module
   Add {
     lhs = s32[] parameter(0)
@@ -308,11 +312,12 @@ TEST_F(InstructionFusionTest, IntegerDivIsNotCheap) {
 
   EXPECT_FALSE(GpuInstructionFusion(/*may_duplicate=*/true)
                    .Run(module.get())
-                   .ValueOrDie());
+                   .ValueOrDie())
+      << module->ToString();
 }
 
 TEST_F(InstructionFusionTest, DotOutputFusionImpossible) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule test_module
   ENTRY NoOutputFusion {
     alpha = f32[] constant(3)
@@ -334,7 +339,271 @@ TEST_F(InstructionFusionTest, DotOutputFusionImpossible) {
   EXPECT_EQ(root->fusion_kind(), HloInstruction::FusionKind::kLoop);
   EXPECT_THAT(root->fused_expression_root(),
               op::Multiply(op::Multiply(op::Parameter(), op::Parameter()),
-                           op::Broadcast(op::Parameter())));
+                           op::Broadcast(op::Constant())));
+}
+
+// Counts the HLO ops with a given op code in the specified module.
+static int Count(const HloModule& module, HloOpcode op) {
+  int count = 0;
+  for (const auto* computation : module.computations()) {
+    for (const auto* instruction : computation->instructions()) {
+      if (instruction->opcode() == op) {
+        ++count;
+      }
+    }
+  }
+  return count;
+}
+
+// Returns an HLO instruction from the given computation with the op code.
+static StatusOr<const HloInstruction*> FindHloInstruction(
+    const HloComputation& computation, HloOpcode op) {
+  for (const auto* instruction : computation.instructions()) {
+    if (instruction->opcode() == op) {
+      return instruction;
+    }
+  }
+  return NotFound(
+      "Computation '%s' does not contain an instruction with op code '%s'.",
+      computation.name().c_str(), HloOpcodeString(op).c_str());
+}
+
+TEST_F(InstructionFusionTest, MultiOutputFusion) {
+  // sub --> add --> tuple
+  //  \---------------/
+  auto module = ParseHloString(R"(
+    HloModule test_module
+    ENTRY OutputFusion {
+     p0 = f32[4,3]{1,0} parameter(0)
+     p1 = f32[4,3]{1,0} parameter(1)
+     p2 = f32[4,3]{1,0} parameter(2)
+     sub = f32[4,3]{1,0} subtract(p0, p2)
+     add = f32[4,3]{1,0} add(sub, p1)
+     ROOT tuple = (f32[4,3]{1,0}, f32[4,3]{1,0}) tuple(sub, add)
+    })")
+                    .ValueOrDie();
+
+  ASSERT_TRUE(GpuInstructionFusion(/*may_duplicate=*/true)
+                  .Run(module.get())
+                  .ValueOrDie());
+  SCOPED_TRACE(module->ToString());
+
+  // Expect that there is one multi-output fusion and subtract has not been
+  // duplicated.
+  EXPECT_EQ(Count(*module, HloOpcode::kFusion), 1);
+  EXPECT_EQ(Count(*module, HloOpcode::kSubtract), 1);
+  TF_ASSERT_OK_AND_ASSIGN(
+      const HloInstruction* fusion,
+      FindHloInstruction(*module->entry_computation(), HloOpcode::kFusion));
+  EXPECT_THAT(
+      fusion->fused_expression_root(),
+      op::Tuple(op::Add(op::Subtract(), op::Parameter()), op::Subtract()));
+}
+
+TEST_F(InstructionFusionTest, MultiOutputFusionExpensiveOp) {
+  // tanh --> add --> tuple
+  //  \---------------/
+  auto module = ParseHloString(R"(
+    HloModule test_module
+    ENTRY OutputFusion {
+     p0 = f32[4,3]{1,0} parameter(0)
+     p1 = f32[4,3]{1,0} parameter(1)
+     tanh = f32[4,3]{1,0} tanh(p0)
+     add = f32[4,3]{1,0} add(tanh, p1)
+     ROOT tuple = (f32[4,3]{1,0}, f32[4,3]{1,0}) tuple(tanh, add)
+    })")
+                    .ValueOrDie();
+
+  // TODO(tjoerg): Allow multi-output fusion for expensive operations like tanh.
+  ASSERT_FALSE(GpuInstructionFusion(/*may_duplicate=*/true)
+                   .Run(module.get())
+                   .ValueOrDie())
+      << module->ToString();
+}
+
+TEST_F(InstructionFusionTest, MultiOutputFusion2) {
+  // sub --> add1 --\--------\
+  //  \----------> add2 --> tuple
+  auto module = ParseHloString(R"(
+    HloModule test_module
+    ENTRY OutputFusion {
+     p0 = f32[4,3]{1,0} parameter(0)
+     p1 = f32[4,3]{1,0} parameter(1)
+     p2 = f32[4,3]{1,0} parameter(2)
+     sub = f32[4,3]{1,0} subtract(p0, p2)
+     add1 = f32[4,3]{1,0} add(sub, p1)
+     add2 = f32[4,3]{1,0} add(sub, add1)
+     ROOT tuple = (f32[4,3]{1,0}) tuple(add1, add2)
+    })")
+                    .ValueOrDie();
+
+  ASSERT_TRUE(GpuInstructionFusion(/*may_duplicate=*/true)
+                  .Run(module.get())
+                  .ValueOrDie());
+  SCOPED_TRACE(module->ToString());
+
+  // Expect that there is one multi-output fusion and subtract has not been
+  // duplicated.
+  EXPECT_EQ(Count(*module, HloOpcode::kFusion), 1);
+  EXPECT_EQ(Count(*module, HloOpcode::kSubtract), 1);
+  TF_ASSERT_OK_AND_ASSIGN(
+      const HloInstruction* fusion,
+      FindHloInstruction(*module->entry_computation(), HloOpcode::kFusion));
+  EXPECT_THAT(fusion->fused_expression_root(),
+              op::Tuple(op::Add(op::Subtract(), op::Add()),
+                        op::Add(op::Subtract(), op::Parameter())));
+}
+
+TEST_F(InstructionFusionTest, MultiOutputFusion3) {
+  // sub --> add1 ----\--------\
+  //  \ --> add2 --> add3 --> tuple
+  auto module = ParseHloString(R"(
+    HloModule test_module
+    ENTRY OutputFusion {
+     p0 = f32[4,3]{1,0} parameter(0)
+     p1 = f32[4,3]{1,0} parameter(1)
+     p2 = f32[4,3]{1,0} parameter(2)
+     p3 = f32[4,3]{1,0} parameter(3)
+     sub = f32[4,3]{1,0} subtract(p0, p2)
+     add1 = f32[4,3]{1,0} add(sub, p1)
+     add2 = f32[4,3]{1,0} add(p2, sub)
+     add3 = f32[4,3]{1,0} add(add1, add2)
+     ROOT tuple = (f32[4,3]{1,0}) tuple(add3, add2)
+    })")
+                    .ValueOrDie();
+
+  ASSERT_TRUE(GpuInstructionFusion(/*may_duplicate=*/true)
+                  .Run(module.get())
+                  .ValueOrDie());
+  SCOPED_TRACE(module->ToString());
+
+  // Expect that there is one multi-output fusion and subtract has not been
+  // duplicated.
+  EXPECT_EQ(Count(*module, HloOpcode::kFusion), 1);
+  EXPECT_EQ(Count(*module, HloOpcode::kSubtract), 1);
+  TF_ASSERT_OK_AND_ASSIGN(
+      const HloInstruction* fusion,
+      FindHloInstruction(*module->entry_computation(), HloOpcode::kFusion));
+  EXPECT_THAT(fusion->fused_expression_root(),
+              op::Tuple(op::Add(op::Add(), op::Add()),
+                        op::Add(op::Parameter(), op::Subtract())));
+}
+
+TEST_F(InstructionFusionTest, NoCyclesDueToMultiOutputFusion) {
+  // sub --> mul ---\
+  //  \--> call --> add --> tuple
+  auto module = ParseHloString(R"(
+  HloModule test_module
+  ENTRY OutputFusion {
+    c = f32[] constant(42)
+    p0 = f32[4,3]{1,0} parameter(0)
+    p1 = f32[4,3]{1,0} parameter(1)
+    sub = f32[4,3]{1,0} subtract(p0, p1)
+    mul = f32[4,3]{1,0} multiply(sub, c)
+    call = f32[4,3]{1,0} custom-call(sub), custom_call_target="foo"
+    add = f32[4,3]{1,0} add(mul, call)
+    ROOT tuple = (f32[4,3]{1,0}) tuple(add)
+  })")
+                    .ValueOrDie();
+
+  ASSERT_TRUE(GpuInstructionFusion(/*may_duplicate=*/true)
+                  .Run(module.get())
+                  .ValueOrDie());
+  // Visit instructions in post order to detect cycles.
+  // TODO(tjoerg): Add cycle detection to the HloVerifier.
+  class DummyVisitor : public DfsHloVisitorWithDefault {
+   public:
+    DummyVisitor() {}
+    Status DefaultAction(HloInstruction* /*hlo_instruction*/) override {
+      return Status::OK();
+    }
+  } visitor;
+  for (const HloComputation* computation : module->MakeComputationPostOrder()) {
+    // Accept will return a FailedPrecondition when a cycle is detected.
+    EXPECT_TRUE(computation->root_instruction()->Accept(&visitor).ok());
+  }
+}
+
+TEST_F(InstructionFusionTest, NoMultiOutputFusionWithIncompatibleShapes) {
+  // sub[2,3] --> add[4,3] --> tuple([2,3], [4,3])
+  //  \-------------------------/
+  auto module = ParseHloString(R"(
+    HloModule test_module
+    ENTRY OutputFusion {
+     p0 = f32[2,3]{1,0} parameter(0)
+     p1 = f32[4,3]{1,0} parameter(1)
+     p2 = f32[2,3]{1,0} parameter(2)
+     sub = f32[2,3]{1,0} subtract(p0, p2)
+     add = f32[4,3]{1,0} add(sub, p1)
+     ROOT tuple = (f32[2,3]{1,0}, f32[4,3]{1,0}) tuple(sub, add)
+    })")
+                    .ValueOrDie();
+
+  // Multi-output fusion requires shapes to be compatible. Since `sub` and `add`
+  // have incompatible shapes, expect that no multi-output fusion happens.
+  ASSERT_FALSE(GpuInstructionFusion(/*may_duplicate=*/true)
+                   .Run(module.get())
+                   .ValueOrDie())
+      << module->ToString();
+}
+
+TEST_F(InstructionFusionTest, FuseIntoInputFusionInstruction) {
+  auto module = ParseHloString(R"(
+  HloModule test_module
+
+  add_computation {
+    add_lhs = f32[] parameter(0)
+    add_rhs = f32[] parameter(1)
+    ROOT add_root = f32[] add(add_lhs, add_rhs)
+  }
+
+  fused_computation {
+    p1 = f32[10] parameter(0)
+    zero = f32[] constant(0)
+    ROOT f2_root = f32[] reduce(p1, zero), dimensions={0},
+           to_apply=add_computation
+  }
+
+  ENTRY entry {
+    p0 = f32[10] parameter(0)
+    mul = f32[10] multiply(p0, p0)
+    fusion = f32[] fusion(mul), kind=kInput, calls=fused_computation
+    ROOT tuple = (f32[10], f32[]) tuple(fusion, mul)
+  })")
+                    .ValueOrDie();
+
+  // Multi-output fusion is not supported for non-loop fusions at present. Since
+  // `fused_computation` is a input fusion, expect no multi-output fusion to
+  // happen.
+  ASSERT_FALSE(GpuInstructionFusion(/*may_duplicate=*/true)
+                   .Run(module.get())
+                   .ValueOrDie())
+      << module->ToString();
+}
+
+TEST_F(InstructionFusionTest, FuseScalarConstant) {
+  auto module = ParseHloString(R"(
+  HloModule test_module
+
+  ENTRY FuseScalarConstant {
+    p0 = f32[] parameter(0)
+    c0 = f32[] constant(1)
+    add1 = f32[] add(p0, c0)
+    b0 = f32[2]{0} broadcast(add1), dimensions={}
+    c1 = f32[2]{0} constant({1, 2})
+    ROOT add2 = f32[2]{0} add(b0, c1)
+  })")
+                    .ValueOrDie();
+
+  EXPECT_TRUE(GpuInstructionFusion(/*may_duplicate=*/true)
+                  .Run(module.get())
+                  .ValueOrDie());
+
+  HloInstruction* root = module->entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::Fusion());
+  EXPECT_THAT(root->fused_expression_root(),
+              op::Add(op::Broadcast(op::Add(op::Parameter(), op::Constant())),
+                      op::Parameter()));
 }
 
 }  // namespace gpu
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc b/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc
index 22e715099526c20532bb298e84e50457d89f615e..67890bfed1136796c83c7ef6912ffc1ab1b7e332 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc
+++ b/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc
@@ -162,19 +162,8 @@ static HloInstruction* CreateCudnnConv(
   Shape call_shape =
       ShapeUtil::MakeTupleShape({shape, ShapeUtil::MakeShape(U8, {0})});
 
-  // Our CustomCall takes four arguments: The conv lhs and rhs, the cudnn
-  // algorithm to use, and a boolean indicating whether to use tensor cores.
-  //
-  // It's up to a later pass to choose the algorithm and decide whether to use
-  // tensor cores, so to indicate that we haven't yet made a choice, we speicfy
-  // -1 and false for those args.
-  HloInstruction* negative_one = computation->AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int64>(-1)));
-  HloInstruction* false_constant = computation->AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
-  HloInstruction* custom_call =
-      computation->AddInstruction(HloInstruction::CreateCustomCall(
-          call_shape, {lhs, rhs, negative_one, false_constant}, call_target));
+  HloInstruction* custom_call = computation->AddInstruction(
+      HloInstruction::CreateCustomCall(call_shape, {lhs, rhs}, call_target));
   custom_call->set_window(window);
   custom_call->set_convolution_dimension_numbers(dnums);
   return custom_call;
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter.cc b/tensorflow/compiler/xla/service/gpu/ir_emitter.cc
index 1e0db2821a2c212d0f212ae94ab69231bc6053ea..547af33e9a98c03e1429366172f9a401e385a9d1 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter.cc
@@ -94,7 +94,10 @@ Status IrEmitter::HandleConstant(HloInstruction* constant) {
           << std::endl
           << "  its type: "
           << llvm_ir::DumpToString(*global_for_const->getType());
-  bindings_.BindHloToIrValue(*constant, global_for_const);
+  llvm::Constant* shape_constant = llvm::ConstantExpr::getBitCast(
+      global_for_const,
+      llvm_ir::ShapeToIrType(literal.shape(), module_)->getPointerTo());
+  bindings_.BindHloToIrValue(*constant, shape_constant);
   return Status::OK();
 }
 
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter.h b/tensorflow/compiler/xla/service/gpu/ir_emitter.h
index b0accc08d479258d65a18202122e4c9e90ff78d0..e55dfc6dae844ceb1d28ad389d133c80823bad9a 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter.h
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter.h
@@ -120,10 +120,11 @@ class IrEmitter : public DfsHloVisitorWithDefault {
   llvm::Value* GetBasePointer(const HloInstruction& inst) const {
     return bindings_.GetBasePointer(inst);
   }
-  // A convenient helper for calling BufferAssignment::GetUniqueTopLevelSlice.
-  BufferAllocation::Slice GetAllocationSlice(const HloInstruction& hlo) const {
+  // A convenient helper for calling BufferAssignment::GetUniqueSlice.
+  BufferAllocation::Slice GetAllocationSlice(
+      const HloInstruction& hlo, const ShapeIndex& index = {}) const {
     return ir_emitter_context_->buffer_assignment()
-        .GetUniqueTopLevelSlice(&hlo)
+        .GetUniqueSlice(&hlo, index)
         .ConsumeValueOrDie();
   }
 
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
index 55d4c1d13d3ad41e09d48db70478cf5e6af59808..726434c3dfd4f1ef866d2eb9b6d7eb8b659e0984 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
@@ -32,6 +32,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/buffer_assignment.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor.h"
+#include "tensorflow/compiler/xla/service/gpu/backend_configs.pb.h"
 #include "tensorflow/compiler/xla/service/gpu/conditional_thunk.h"
 #include "tensorflow/compiler/xla/service/gpu/convolution_thunk.h"
 #include "tensorflow/compiler/xla/service/gpu/copy_thunk.h"
@@ -58,6 +59,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/ops.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/tuple_ops.h"
@@ -79,6 +81,7 @@ namespace {
 
 using llvm_ir::IrName;
 using tensorflow::gtl::ArraySlice;
+using tensorflow::gtl::InlinedVector;
 using tensorflow::gtl::nullopt;
 using tensorflow::gtl::optional;
 using tensorflow::strings::StrCat;
@@ -422,15 +425,8 @@ Status IrEmitterUnnested::HandleCustomCall(HloInstruction* custom_call) {
     auto conv_result_slice = assn.GetUniqueSlice(custom_call, {0}).ValueOrDie();
     auto scratch_slice = assn.GetUniqueSlice(custom_call, {1}).ValueOrDie();
 
-    const HloInstruction* algorithm_inst = custom_call->operand(2);
-    CHECK(algorithm_inst->IsConstant()) << algorithm_inst->ToString();
-    int64 algorithm = algorithm_inst->literal().Get<int64>({});
-
-    const HloInstruction* tensor_ops_enabled_inst = custom_call->operand(3);
-    CHECK(tensor_ops_enabled_inst->IsConstant())
-        << tensor_ops_enabled_inst->ToString();
-    bool tensor_ops_enabled = tensor_ops_enabled_inst->literal().Get<bool>({});
-
+    TF_ASSIGN_OR_RETURN(CudnnConvBackendConfig backend_config,
+                        custom_call->backend_config<CudnnConvBackendConfig>());
     const auto& target = custom_call->custom_call_target();
     std::unique_ptr<ConvolutionThunk> thunk;
     if (target == kCudnnConvForwardCallTarget) {
@@ -445,7 +441,8 @@ Status IrEmitterUnnested::HandleCustomCall(HloInstruction* custom_call) {
           /*filter_shape=*/rhs_shape,
           /*output_shape=*/conv_result_shape,  //
           custom_call->window(), custom_call->convolution_dimension_numbers(),
-          algorithm, tensor_ops_enabled, custom_call);
+          backend_config.algorithm(), backend_config.tensor_ops_enabled(),
+          custom_call);
     } else if (target == kCudnnConvBackwardInputCallTarget) {
       thunk = MakeUnique<ConvolutionThunk>(
           CudnnConvKind::kBackwardInput,
@@ -458,7 +455,8 @@ Status IrEmitterUnnested::HandleCustomCall(HloInstruction* custom_call) {
           /*filter_shape=*/rhs_shape,
           /*output_shape=*/lhs_shape,  //
           custom_call->window(), custom_call->convolution_dimension_numbers(),
-          algorithm, tensor_ops_enabled, custom_call);
+          backend_config.algorithm(), backend_config.tensor_ops_enabled(),
+          custom_call);
     } else if (target == kCudnnConvBackwardFilterCallTarget) {
       thunk = MakeUnique<ConvolutionThunk>(
           CudnnConvKind::kBackwardFilter,
@@ -471,7 +469,8 @@ Status IrEmitterUnnested::HandleCustomCall(HloInstruction* custom_call) {
           /*filter_shape=*/conv_result_shape,
           /*output_shape=*/rhs_shape,  //
           custom_call->window(), custom_call->convolution_dimension_numbers(),
-          algorithm, tensor_ops_enabled, custom_call);
+          backend_config.algorithm(), backend_config.tensor_ops_enabled(),
+          custom_call);
     } else {
       LOG(FATAL) << "Unexpected custom call target: "
                  << custom_call->custom_call_target();
@@ -499,12 +498,31 @@ Status IrEmitterUnnested::HandleFusion(HloInstruction* fusion) {
   // initializes the output array to the initial value of the reduce.
   if (HloInstruction::FusionKind::kInput == fusion->fusion_kind()) {
     switch (root->opcode()) {
+      case HloOpcode::kTuple:
       case HloOpcode::kReduce: {
         VLOG(3) << "Emitting fused reduction to vector: " << fusion->ToString();
-        TF_ASSIGN_OR_RETURN(std::unique_ptr<Thunk> initializer_thunk,
-                            BuildInitializerThunk(fusion));
         std::vector<std::unique_ptr<Thunk>> thunks;
-        thunks.push_back(std::move(initializer_thunk));
+        ArraySlice<HloInstruction*> output_instructions =
+            root->opcode() == HloOpcode::kTuple
+                ? root->operands()
+                : ArraySlice<HloInstruction*>(&root, 1);
+
+        // For multi-output fusion emit an initializer for each tuple element.
+        // Otherwise it's sufficient to just initialize the single output.
+        HloInstruction* first_reduce = nullptr;
+        for (int i = 0, e = output_instructions.size(); i != e; ++i) {
+          if (output_instructions[i]->opcode() == HloOpcode::kReduce) {
+            TF_ASSIGN_OR_RETURN(
+                std::unique_ptr<Thunk> initializer_thunk,
+                BuildInitializerThunk(fusion, output_instructions[i] == root
+                                                  ? ShapeIndex()
+                                                  : ShapeIndex({i})));
+            thunks.push_back(std::move(initializer_thunk));
+            first_reduce =
+                first_reduce == nullptr ? output_instructions[i] : first_reduce;
+          }
+        }
+        CHECK(first_reduce != nullptr);
         thunks.push_back(BuildKernelThunk(fusion));
         thunk_sequence_->emplace_back(
             MakeUnique<SequentialThunk>(std::move(thunks), fusion));
@@ -518,11 +536,50 @@ Status IrEmitterUnnested::HandleFusion(HloInstruction* fusion) {
         FusedIrEmitter fused_emitter(parameter_arrays, &elemental_emitter);
         TF_RETURN_IF_ERROR(root->Accept(&fused_emitter));
 
-        Shape input_shape = root->operand(0)->shape();
-        return EmitReductionToVector(
-            root, input_shape, fused_emitter.GetGenerator(root->operand(0)),
-            fused_emitter.GetGenerator(root->operand(1)), root->dimensions(),
-            root->to_apply());
+        // For multi-output fusion CHECK the constraints and feed all the
+        // reduces into a single loop code generator. Single-output reduce
+        // fusion is a special case of that.
+        InlinedVector<llvm_ir::ElementGenerator, 1> input_gens;
+        InlinedVector<llvm_ir::ElementGenerator, 1> init_value_gens;
+        std::vector<std::pair<llvm_ir::ElementGenerator, ShapeIndex>>
+            extra_output_gens;
+        InlinedVector<HloComputation*, 1> reducers;
+        InlinedVector<ShapeIndex, 1> reduce_output_shapes;
+        for (int i = 0, e = output_instructions.size(); i != e; ++i) {
+          const HloInstruction* inst = output_instructions[i];
+          ShapeIndex output_shape_index;
+          if (root->opcode() == HloOpcode::kTuple) {
+            output_shape_index = {i};
+          }
+          // TODO(kramerb): CHECK that layouts are equal. Currently this
+          // breaks multioutputfusion_test. The test has pre-fused
+          // instructions, but layout_assignment will not assign any layouts
+          // for instructions inside of a fused computation. It just removes
+          // the layouts instead.
+          if (inst->opcode() == HloOpcode::kReduce) {
+            CHECK(ShapeUtil::Compatible(first_reduce->shape(), inst->shape()));
+            CHECK(ShapeUtil::Compatible(first_reduce->operand(0)->shape(),
+                                        inst->operand(0)->shape()));
+            CHECK(ShapeUtil::Compatible(first_reduce->operand(1)->shape(),
+                                        inst->operand(1)->shape()));
+            CHECK(first_reduce->dimensions() == inst->dimensions());
+            input_gens.push_back(fused_emitter.GetGenerator(inst->operand(0)));
+            init_value_gens.push_back(
+                fused_emitter.GetGenerator(inst->operand(1)));
+            reducers.push_back(inst->to_apply());
+            reduce_output_shapes.push_back(std::move(output_shape_index));
+          } else {
+            CHECK(ShapeUtil::Compatible(first_reduce->operand(0)->shape(),
+                                        inst->shape()));
+            extra_output_gens.emplace_back(fused_emitter.GetGenerator(inst),
+                                           std::move(output_shape_index));
+          }
+        }
+        const Shape& input_shape = first_reduce->operand(0)->shape();
+        return EmitReductionToVector(first_reduce, input_shape, input_gens,
+                                     init_value_gens,
+                                     first_reduce->dimensions(), reducers,
+                                     reduce_output_shapes, extra_output_gens);
       }
       default:
         LOG(FATAL) << "Bad opcode for input fusion: "
@@ -907,10 +964,33 @@ Status IrEmitterUnnested::HandleCopy(HloInstruction* copy) {
   return IrEmitter::HandleCopy(copy);
 }
 
+Status IrEmitterUnnested::EmitExtraOutputsForReduce(
+    const HloInstruction* reduce, const llvm_ir::IrArray::Index& index,
+    tensorflow::gtl::ArraySlice<
+        std::pair<llvm_ir::ElementGenerator, ShapeIndex>>
+        extra_output_gens) {
+  for (int i = 0; i != extra_output_gens.size(); ++i) {
+    const HloInstruction* output = reduce->parent()->FusionInstruction();
+    llvm::Value* extra_output_address =
+        GetIrArray(*output, *output, extra_output_gens[i].second)
+            .EmitArrayElementAddress(index, &ir_builder_,
+                                     "extra_output_element_address");
+    TF_ASSIGN_OR_RETURN(llvm::Value* const extra_output_ir_value,
+                        extra_output_gens[i].first(index));
+    ir_builder_.CreateStore(extra_output_ir_value, extra_output_address);
+  }
+  return Status::OK();
+}
+
 Status IrEmitterUnnested::EmitReductionToScalar(
     HloInstruction* reduce, const Shape& input_shape,
-    const llvm_ir::ElementGenerator& input_gen,
-    const llvm_ir::ElementGenerator& init_value_gen, HloComputation* reducer) {
+    tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> input_gens,
+    tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> init_value_gens,
+    tensorflow::gtl::ArraySlice<HloComputation*> reducers,
+    tensorflow::gtl::ArraySlice<ShapeIndex> reduce_output_shapes,
+    tensorflow::gtl::ArraySlice<
+        std::pair<llvm_ir::ElementGenerator, ShapeIndex>>
+        extra_output_gens) {
   // Number of elements processed by a single thread.
   constexpr int64 kTileSize = 16;
   int64 num_elems = ShapeUtil::ElementsIn(input_shape);
@@ -962,14 +1042,19 @@ Status IrEmitterUnnested::EmitReductionToScalar(
   //
   auto loop_body_emitter =
       [=](const llvm_ir::IrArray::Index& tile_index) -> Status {
+    const int num_reduces = reducers.size();
     llvm::Type* element_ir_type =
         llvm_ir::PrimitiveTypeToIrType(input_shape.element_type(), module_);
-    llvm::Value* partial_reduction_result_address = ir_builder_.CreateAlloca(
-        element_ir_type, /*ArraySize=*/nullptr, "partial_reduction_result");
-    {
-      TF_ASSIGN_OR_RETURN(llvm::Value * init_ir_value,
-                          init_value_gen(llvm_ir::IrArray::Index({})));
+    std::vector<llvm::Value*> partial_reduction_result_addresses;
+    for (int i = 0; i != num_reduces; ++i) {
+      llvm::Value* partial_reduction_result_address = ir_builder_.CreateAlloca(
+          element_ir_type, /*ArraySize=*/nullptr,
+          "partial_reduction_result." + llvm::Twine(i));
+      TF_ASSIGN_OR_RETURN(llvm::Value* const init_ir_value,
+                          init_value_gens[i](llvm_ir::IrArray::Index({})));
       ir_builder_.CreateStore(init_ir_value, partial_reduction_result_address);
+      partial_reduction_result_addresses.push_back(
+          partial_reduction_result_address);
     }
 
     llvm::Value* x_in_tiles = tile_index[0];
@@ -1002,11 +1087,16 @@ Status IrEmitterUnnested::EmitReductionToScalar(
       llvm_ir::IrArray::Index input_index(
           /*linear=*/x, input_shape, &ir_builder_);
       llvm::Value* input_address = ir_builder_.CreateAlloca(element_ir_type);
-      TF_ASSIGN_OR_RETURN(llvm::Value * input_ir_value, input_gen(input_index));
-      ir_builder_.CreateStore(input_ir_value, input_address);
-      return (EmitCallToNestedComputation(
-          *reducer, {partial_reduction_result_address, input_address},
-          partial_reduction_result_address));
+      for (int i = 0; i != num_reduces; ++i) {
+        TF_ASSIGN_OR_RETURN(llvm::Value* const input_ir_value,
+                            input_gens[i](input_index));
+        ir_builder_.CreateStore(input_ir_value, input_address);
+        TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
+            *reducers[i],
+            {partial_reduction_result_addresses[i], input_address},
+            partial_reduction_result_addresses[i]));
+      }
+      return EmitExtraOutputsForReduce(reduce, input_index, extra_output_gens);
     };
 
     // x_end = kTileSize + x_in_tiles * kTileSize, i.e., the location that's
@@ -1041,20 +1131,24 @@ Status IrEmitterUnnested::EmitReductionToScalar(
                                       : element_ir_type;
     for (int shuffle_distance = kWarpSize / 2; shuffle_distance >= 1;
          shuffle_distance /= 2) {
-      llvm::Value* partial_reduction_result = ir_builder_.CreateLoad(
-          ir_builder_.CreateBitCast(partial_reduction_result_address,
-                                    shuffle_ir_type->getPointerTo()),
-          "partial_reduction_result");
       llvm::Value* result_from_other_lane = ir_builder_.CreateAlloca(
           element_ir_type, nullptr, "result_from_other_lane");
-      ir_builder_.CreateStore(
-          EmitShuffleDown(partial_reduction_result,
-                          ir_builder_.getInt32(shuffle_distance), &ir_builder_),
-          ir_builder_.CreateBitCast(result_from_other_lane,
-                                    shuffle_ir_type->getPointerTo()));
-      TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
-          *reducer, {partial_reduction_result_address, result_from_other_lane},
-          partial_reduction_result_address));
+      for (int i = 0; i != num_reduces; ++i) {
+        llvm::Value* partial_reduction_result = ir_builder_.CreateLoad(
+            ir_builder_.CreateBitCast(partial_reduction_result_addresses[i],
+                                      shuffle_ir_type->getPointerTo()),
+            "partial_reduction_result");
+        ir_builder_.CreateStore(
+            EmitShuffleDown(partial_reduction_result,
+                            ir_builder_.getInt32(shuffle_distance),
+                            &ir_builder_),
+            ir_builder_.CreateBitCast(result_from_other_lane,
+                                      shuffle_ir_type->getPointerTo()));
+        TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
+            *reducers[i],
+            {partial_reduction_result_addresses[i], result_from_other_lane},
+            partial_reduction_result_addresses[i]));
+      }
     }
 
     const HloInstruction* output =
@@ -1070,14 +1164,21 @@ Status IrEmitterUnnested::EmitReductionToScalar(
         "lane_id_is_zero", &ir_builder_);
     llvm_ir::SetToFirstInsertPoint(if_lane_id_is_zero_data.true_block,
                                    &ir_builder_);
-    llvm::Value* output_address =
-        GetIrArray(*output, *output)
-            .EmitArrayElementAddress(
-                llvm_ir::IrArray::Index(/*linear=*/ir_builder_.getInt64(0),
-                                        output->shape(), &ir_builder_),
-                &ir_builder_, "output_element_address");
-    return EmitAtomicOperationForNestedComputation(
-        *reducer, output_address, partial_reduction_result_address);
+
+    for (int i = 0; i != num_reduces; ++i) {
+      llvm::Value* output_address =
+          GetIrArray(*output, *output, reduce_output_shapes[i])
+              .EmitArrayElementAddress(
+                  llvm_ir::IrArray::Index(
+                      /*linear=*/ir_builder_.getInt64(0),
+                      ShapeUtil::GetSubshape(output->shape(),
+                                             reduce_output_shapes[i]),
+                      &ir_builder_),
+                  &ir_builder_, "output_element_address");
+      TF_RETURN_IF_ERROR(EmitAtomicOperationForNestedComputation(
+          *reducers[i], output_address, partial_reduction_result_addresses[i]));
+    }
+    return Status::OK();
   };
 
   // Emit a parallel loop that iterates through all input tiles, one per thread.
@@ -1097,8 +1198,13 @@ Status IrEmitterUnnested::EmitReductionToScalar(
 
 Status IrEmitterUnnested::EmitColumnReduction(
     int64 height, int64 width, HloInstruction* reduce, const Shape& input_shape,
-    const llvm_ir::ElementGenerator& input_gen,
-    const llvm_ir::ElementGenerator& init_value_gen, HloComputation* reducer) {
+    tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> input_gens,
+    tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> init_value_gens,
+    tensorflow::gtl::ArraySlice<HloComputation*> reducers,
+    tensorflow::gtl::ArraySlice<ShapeIndex> reduce_output_shapes,
+    tensorflow::gtl::ArraySlice<
+        std::pair<llvm_ir::ElementGenerator, ShapeIndex>>
+        extra_output_gens) {
   // Divide the input matrix into tiles of size Kx1. For example, when the
   // input matrix is 4x4 and K=2, the tiled matrix looks like
   //
@@ -1108,9 +1214,13 @@ Status IrEmitterUnnested::EmitColumnReduction(
   //   4567  // Numbers indicate tile IDs.
   //
   // Each tile is first partially reduced to a scalar by a thread, and then the
-  // scalar is accumulated to the output vector using atomic operations. We
-  // choose 16 as the tile size, which matches Eigen's ColumnReduceKernel.
-  constexpr int64 kTileSize = 16;
+  // scalar is accumulated to the output vector using atomic operations.
+  //
+  // We choose 128 as the tile size based on empirical evidence. It's big enough
+  // to reduce the amount of atomic adds in the end, maximizing the memory
+  // bandwidth.
+  constexpr int64 kTileSize = 128;
+
   // If the height is not a multiple of the tile size, we pad the bottom of the
   // input matrix.
   const int64 height_in_tiles = CeilOfRatio(height, kTileSize);
@@ -1140,15 +1250,20 @@ Status IrEmitterUnnested::EmitColumnReduction(
   // }
   auto loop_body_emitter =
       [=](const llvm_ir::IrArray::Index& tile_index) -> Status {
+    const int num_reduces = reducers.size();
     // Emit the loop body that reduces one tile.
     llvm::Type* element_ir_type =
         llvm_ir::PrimitiveTypeToIrType(input_shape.element_type(), module_);
-    llvm::Value* partial_reduction_result_address = ir_builder_.CreateAlloca(
-        element_ir_type, /*ArraySize=*/nullptr, "partial_reduction_result");
-    {
-      TF_ASSIGN_OR_RETURN(llvm::Value * init_ir_value,
-                          init_value_gen(llvm_ir::IrArray::Index({})));
+    std::vector<llvm::Value*> partial_reduction_result_addresses;
+    for (int i = 0; i != num_reduces; ++i) {
+      llvm::Value* partial_reduction_result_address = ir_builder_.CreateAlloca(
+          element_ir_type, /*ArraySize=*/nullptr,
+          "partial_reduction_result." + llvm::Twine(i));
+      TF_ASSIGN_OR_RETURN(llvm::Value* const init_ir_value,
+                          init_value_gens[i](llvm_ir::IrArray::Index({})));
       ir_builder_.CreateStore(init_ir_value, partial_reduction_result_address);
+      partial_reduction_result_addresses.push_back(
+          partial_reduction_result_address);
     }
 
     // Emit an inner for-loop that partially reduces the elements in the given
@@ -1206,13 +1321,18 @@ Status IrEmitterUnnested::EmitColumnReduction(
                 .SourceIndexOfTranspose(normalized_input_shape, input_shape,
                                         transpose_dimension_mapping,
                                         &ir_builder_);
-        TF_ASSIGN_OR_RETURN(llvm::Value * input_ir_value,
-                            input_gen(input_index));
-        ir_builder_.CreateStore(input_ir_value, input_address);
+        for (int i = 0; i != num_reduces; ++i) {
+          TF_ASSIGN_OR_RETURN(llvm::Value* const input_ir_value,
+                              input_gens[i](input_index));
+          ir_builder_.CreateStore(input_ir_value, input_address);
+          TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
+              *reducers[i],
+              {partial_reduction_result_addresses[i], input_address},
+              partial_reduction_result_addresses[i]));
+        }
+        return EmitExtraOutputsForReduce(reduce, input_index,
+                                         extra_output_gens);
       }
-      return (EmitCallToNestedComputation(
-          *reducer, {partial_reduction_result_address, input_address},
-          partial_reduction_result_address));
     };
 
     // y_end = kTileSize + y_in_tiles * kTileSize, i.e., the y location that's
@@ -1241,13 +1361,20 @@ Status IrEmitterUnnested::EmitColumnReduction(
                                    &ir_builder_);
     const HloInstruction* output =
         reduce->IsFused() ? reduce->parent()->FusionInstruction() : reduce;
-    llvm::Value* output_address =
-        GetIrArray(*output, *output)
-            .EmitArrayElementAddress(
-                llvm_ir::IrArray::Index(x, output->shape(), &ir_builder_),
-                &ir_builder_, "output_element_address");
-    return EmitAtomicOperationForNestedComputation(
-        *reducer, output_address, partial_reduction_result_address);
+    for (int i = 0; i != num_reduces; ++i) {
+      llvm::Value* output_address =
+          GetIrArray(*output, *output, reduce_output_shapes[i])
+              .EmitArrayElementAddress(
+                  llvm_ir::IrArray::Index(
+                      x,
+                      ShapeUtil::GetSubshape(output->shape(),
+                                             reduce_output_shapes[i]),
+                      &ir_builder_),
+                  &ir_builder_, "output_element_address");
+      TF_RETURN_IF_ERROR(EmitAtomicOperationForNestedComputation(
+          *reducers[i], output_address, partial_reduction_result_addresses[i]));
+    }
+    return Status::OK();
   };
 
   // Emit a parallel loop that iterate through all input tiles.
@@ -1265,12 +1392,42 @@ Status IrEmitterUnnested::EmitColumnReduction(
       .EmitLoop(IrName(reduce));
 }
 
+static std::pair<int64, int64> ComputeTilingSchemeForReduction(
+    int64 depth, int64 width, int64 kWarpSize) {
+  constexpr int64 kTargetNumElementsPerThread = 64;
+  int64 x_tile_size = kTargetNumElementsPerThread;
+  int64 z_tile_size = 1;
+
+  // Only tile along the x dimension with tile size kTargetNumElementsPerThread
+  // if doing so doesn't require a slow version of loop with bound check on each
+  // dimension. A more sophisticated heuristics is to enable tile along the
+  // x dimension with tile size kTargetNumElementsPerThread when either width is
+  // a factor of (kWarpSize * kTargetNumElementsPerThread) or width is big
+  // enough so that only a small fraction of the threads execute the slow
+  // version of loop with bound check.
+  if (width % (kWarpSize * kTargetNumElementsPerThread) != 0) {
+    x_tile_size = 8;
+    z_tile_size = 8;
+    while (depth % z_tile_size != 0) {
+      z_tile_size -= 1;
+    }
+  }
+
+  return std::pair<int64, int64>(x_tile_size, z_tile_size);
+}
+
 Status IrEmitterUnnested::EmitRowReduction(
     int64 depth, int64 height, int64 width, HloInstruction* reduce,
-    const Shape& input_shape, const llvm_ir::ElementGenerator& input_gen,
-    const llvm_ir::ElementGenerator& init_value_gen, HloComputation* reducer) {
+    const Shape& input_shape,
+    tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> input_gens,
+    tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> init_value_gens,
+    tensorflow::gtl::ArraySlice<HloComputation*> reducers,
+    tensorflow::gtl::ArraySlice<ShapeIndex> reduce_output_shapes,
+    tensorflow::gtl::ArraySlice<
+        std::pair<llvm_ir::ElementGenerator, ShapeIndex>>
+        extra_output_gens) {
   // A naive algorithm is:
-  // 1. Divide the input tensor into tiles of size 1x1xK.
+  // 1. Divide the x dimension of the input tensor into tiles of size 1x1xX.
   // 2. Partially reduces each tile to a scalar using one thread.
   // 3. Accumulates that scalar to the output vector using atomic operations.
   //
@@ -1281,15 +1438,15 @@ Status IrEmitterUnnested::EmitRowReduction(
   //   int y = linear_index / width_in_tiles % height;
   //   int z = linear_index / (height * width_in_tiles);
   //   float partial_result = 0;
-  //   for (element_id_in_tile : range(kTileSize)) {
-  //     int x = x_in_tiles * kTileSize + element_id_in_tile;
+  //   for (element_id_in_tile : range(x_tile_size)) {
+  //     int x = x_in_tiles * x_tile_size + element_id_in_tile;
   //     if (x < width)
   //       partial_result = reducer(partial_result, input[z][y][z]);
   //   }
   //   AtomicReducer(&output[y], partial_result);
   // }
   //
-  // Three optimizations are performed.
+  // Four optimizations are performed.
   //
   // 1. To coalesce global memory accesses, dilate the tile with a factor of 32
   // (i.e. the warp size). For example, suppose the width is 8x32=256. Instead
@@ -1316,29 +1473,44 @@ Status IrEmitterUnnested::EmitRowReduction(
   // element_id_in_tile, which makes the code more friendly to optimizations
   // such as LICM.
   //
+  // 4. When the width is too small and x_tile_size is less than the target
+  //    number of elements per thread and use a small factor of depth as
+  //    z_tile_size to increase the number of elements calculated by each
+  //    partial sum. This can reduce the needed number of dynamic shfl_down and
+  //    atomic operations.
+  //
   // for (linear_index = threadIdx.x + blockIdx.x * blockDim.x;
   //      linear_index < depth * height * width_in_tiles;
   //      linear_index += blockDim.x * gridDim.x) {
   //   int x_in_tiles = linear_index % width_in_tiles;
   //   int y = linear_index / width_in_tiles % height;
-  //   int z = linear_index / (height * width_in_tiles);
+  //   int z_in_tiles = linear_index / (height * width_in_tiles);
   //   int warp_id = x_in_tiles / warpSize;
   //   int lane_id = x_in_tiles % warpSize;
   //   float partial_result = 0;
   //   int x = warp_id * kTileSize * warpSize + lane_id;
-  //   if (width % (kTileSize * warpSize) == 0 ||
-  //       x + (kTileSize - 1) * warpSize < width) {
-  //     // The entire tile is in bounds.
-  //     for (int element_id_in_tile = 0; element_id_in_tile < kTileSize;
-  //        ++element_id_in_tile, x += warpSize) {
-  //       partial_result = Reducer(partial_result, input[z][y][x]);
+  //   if (width % (x_tile_size * warpSize) == 0 ||
+  //       x + (x_tile_size - 1) * warpSize < width) {
+  //     // The entire x_tile is in bounds.
+  //     for (int element_id_in_z_tile = 0; element_id_in_z_tile < z_tile_size;
+  //        ++element_id_in_z_tile) {
+  //       z = z_in_tiles * z_tile_size + element_id_in_z_tile;
+  //       for (int element_id_in_x_tile = 0;element_id_in_x_tile < x_tile_size;
+  //        ++element_id_in_x_tile, x += warpSize) {
+  //         partial_result = Reducer(partial_result, input[z][y][x]);
+  //       }
   //     }
   //   } else {
   //     // The tile is partially in bounds.
-  //     for (int element_id_in_tile = 0; element_id_in_tile < kTileSize;
+  //     for (int element_id_in_z_tile = 0; element_id_in_z_tile < z_tile_size;
+  //        ++element_id_in_z_tile) {
+  //       z = z_in_tiles * z_tile_size + element_id_in_z_tile;
+  //       for (int element_id_in_x_tile = 0; element_id_in_x_tile <
+  //       x_tile_size;
   //          ++element_id_in_tile, x += warpSize) {
-  //       if (x < width)
-  //         partial_result = Reducer(partial_result, input[z][y][x]);
+  //         if (x < width)
+  //           partial_result = Reducer(partial_result, input[z][y][x]);
+  //       }
   //     }
   //   }
   //   for (shuffle_distance = 16; shuffle_distance > 0; shuffle_distance /= 2)
@@ -1349,29 +1521,35 @@ Status IrEmitterUnnested::EmitRowReduction(
   //     AtomicReducer(&output[y], partial_result);
   // }
   //
-  // Choose 8 as the tile size, which matches Eigen's RowReduceKernel.
-  constexpr int64 kTileSize = 8;
+
+  int64 x_tile_size;
+  int64 z_tile_size;
+  std::tie(x_tile_size, z_tile_size) =
+      ComputeTilingSchemeForReduction(depth, width, kWarpSize);
+
   // Round the width in tiles up to the nearest multiple of kWarpSize, so that
   // the use of shfl_down is valid.
   const int64 width_in_tiles =
-      RoundUpToNearest(CeilOfRatio(width, kTileSize), kWarpSize);
+      RoundUpToNearest(CeilOfRatio(width, x_tile_size), kWarpSize);
 
-  auto loop_body_emitter =
-      [=](const llvm_ir::IrArray::Index& tile_index) -> Status {
-    // Emit the loop body that reduces one tile.
+  auto loop_body_emitter = [=](const llvm_ir::IrArray::Index& tile_index) {
+    // Emit the loop body that reduces one z-x-tile.
+    const int num_reduces = reducers.size();
     llvm::Type* element_ir_type = llvm_ir::PrimitiveTypeToIrType(
         input_shape.element_type(), ir_emitter_context_->llvm_module());
-    llvm::Value* partial_reduction_result_address = ir_builder_.CreateAlloca(
-        element_ir_type, /*ArraySize=*/nullptr, "partial_reduction_result");
-    {
-      TF_ASSIGN_OR_RETURN(llvm::Value * init_ir_value,
-                          init_value_gen(llvm_ir::IrArray::Index({})));
+    std::vector<llvm::Value*> partial_reduction_result_addresses;
+    for (int i = 0; i != num_reduces; ++i) {
+      llvm::Value* partial_reduction_result_address = ir_builder_.CreateAlloca(
+          element_ir_type, /*ArraySize=*/nullptr,
+          "partial_reduction_result." + llvm::Twine(i));
+      TF_ASSIGN_OR_RETURN(llvm::Value* const init_ir_value,
+                          init_value_gens[i](llvm_ir::IrArray::Index({})));
       ir_builder_.CreateStore(init_ir_value, partial_reduction_result_address);
+      partial_reduction_result_addresses.push_back(
+          partial_reduction_result_address);
     }
 
-    // Emit an inner for-loop that partially reduces the elements in the given
-    // tile.
-    llvm::Value* z = tile_index[0];
+    llvm::Value* z_tile = tile_index[0];
     llvm::Value* y = tile_index[1];
     llvm::Value* x_tile = tile_index[2];
     llvm::Value* warp_id = ir_builder_.CreateUDiv(
@@ -1379,102 +1557,132 @@ Status IrEmitterUnnested::EmitRowReduction(
     llvm::Value* lane_id = ir_builder_.CreateURem(
         x_tile, ir_builder_.getInt64(kWarpSize), "lane_id");
 
-    // The x-location of the last element in this tile.
-    //   last_x = lane_id + warpSize * (kTileSize - 1 + warp_id * kTileSize);
+    // The x-location of the last element in this z-x-tile.
+    //   last_x = lane_id + warpSize * (x_tile_size - 1 + warp_id *
+    //   x_tile_size);
     llvm::Value* last_x = ir_builder_.CreateNSWAdd(
-        lane_id,
-        ir_builder_.CreateNSWMul(
-            ir_builder_.getInt64(kWarpSize),
-            ir_builder_.CreateNSWAdd(
-                ir_builder_.getInt64(kTileSize - 1),
-                ir_builder_.CreateNSWMul(warp_id,
-                                         ir_builder_.getInt64(kTileSize)))));
-
-    auto emit_tile_element_loop = [=](bool tile_in_bounds) -> Status {
-      std::unique_ptr<llvm_ir::ForLoop> tile_element_loop =
-          llvm_ir::ForLoop::EmitForLoop("element_id_in_tile",
-                                        ir_builder_.getInt64(0),
-                                        ir_builder_.getInt64(kTileSize),
-                                        ir_builder_.getInt64(1), &ir_builder_);
-
-      // Emit the body of the partial reduction loop.
-      llvm_ir::SetToFirstInsertPoint(tile_element_loop->GetBodyBasicBlock(),
-                                     &ir_builder_);
-      // x = lane_id + warpSize * (element_id_in_tile + warp_id * kTileSize);
-      llvm::Value* x = ir_builder_.CreateNSWAdd(
-          lane_id,
-          ir_builder_.CreateNSWMul(
-              ir_builder_.getInt64(kWarpSize),
-              ir_builder_.CreateNSWAdd(
-                  tile_element_loop->GetIndVarValue(),
-                  ir_builder_.CreateNSWMul(warp_id,
-                                           ir_builder_.getInt64(kTileSize)))));
-
-      // Unless we know the tile is entirely in bounds, we have to emit a
-      // x-in-bounds check before reading from the input.
-      if (!tile_in_bounds) {
-        llvm_ir::LlvmIfData if_x_in_bounds_data = llvm_ir::EmitIfThenElse(
-            ir_builder_.CreateICmpULT(x, ir_builder_.getInt64(width)),
-            "x_in_bounds", &ir_builder_);
-
-        // Points ir_builder_ to the then-block.
-        llvm_ir::SetToFirstInsertPoint(if_x_in_bounds_data.true_block,
-                                       &ir_builder_);
-      }
+        lane_id, ir_builder_.CreateNSWMul(
+                     ir_builder_.getInt64(kWarpSize),
+                     ir_builder_.CreateNSWAdd(
+                         ir_builder_.getInt64(x_tile_size - 1),
+                         ir_builder_.CreateNSWMul(
+                             warp_id, ir_builder_.getInt64(x_tile_size)))));
+
+    KernelSupportLibrary ksl(
+        &ir_builder_,
+        /*unroll_mode=*/xla::llvm_ir::UnrollMode::kFullyUnroll,
+        /*prevent_vectorization=*/false);
+
+    // Emit a for-loop that partially reduces the elements in the given
+    // z-x-tile.
+    auto emit_z_x_tile_element_loop = [&](bool x_tile_in_bounds,
+                                          int64 x_tile_loop_bound) -> Status {
+      auto emit_z_tile_element_loop = [&](llvm::Value* z_indvar) -> Status {
+        llvm::Value* z = ir_builder_.CreateNSWAdd(
+            z_indvar, ir_builder_.CreateNSWMul(
+                          ir_builder_.getInt64(z_tile_size), z_tile));
+
+        TF_RETURN_IF_ERROR(ksl.For(
+            "x_tile",
+            /*start=*/0, /*end=*/x_tile_loop_bound, /*step=*/1,
+            [&](llvm::Value* x_indvar) -> Status {
+              // x = lane_id + warpSize * (element_id_in_x_tile + warp_id *
+              // x_tile_size);
+              llvm::Value* x = ir_builder_.CreateNSWAdd(
+                  lane_id,
+                  ir_builder_.CreateNSWMul(
+                      ir_builder_.getInt64(kWarpSize),
+                      ir_builder_.CreateNSWAdd(
+                          x_indvar,
+                          ir_builder_.CreateNSWMul(
+                              warp_id, ir_builder_.getInt64(x_tile_size)))));
+
+              // Unless we know the x-tile is entirely in bounds, we have to
+              // emit a x-in-bounds check before reading from the input.
+              if (!x_tile_in_bounds) {
+                llvm_ir::LlvmIfData if_x_in_bounds_data =
+                    llvm_ir::EmitIfThenElse(ir_builder_.CreateICmpULT(
+                                                x, ir_builder_.getInt64(width)),
+                                            "x_in_bounds", &ir_builder_);
+                // Points ir_builder_ to the then-block.
+                llvm_ir::SetToFirstInsertPoint(if_x_in_bounds_data.true_block,
+                                               &ir_builder_);
+              }
+
+              // Emit code that reads the input element and accumulates it
+              // to the partial reduction result.
+              llvm::Value* input_address =
+                  ir_builder_.CreateAlloca(element_ir_type);
+              {
+                // {z,y,x} is an index to input_3d_tensor_shape
+                // [depth,height,width]. We need to convert that to an index
+                // to input_shape (the shape of the operand of "reduce").
+                // This conversion is composed of a transposition from
+                // input_shape to normalized_input_shape and a reshape from
+                // normalized_input_shape to input_3d_tensor_shape.
+                const Shape normalized_input_shape = ShapeUtil::
+                    MakeShapeWithDescendingLayoutAndSamePhysicalLayout(
+                        input_shape);
+                auto input_shape_min2maj =
+                    LayoutUtil::MinorToMajor(input_shape);
+                const std::vector<int64> transpose_dimension_mapping(
+                    input_shape_min2maj.rbegin(), input_shape_min2maj.rend());
+                const Shape input_3d_tensor_shape =
+                    ShapeUtil::MakeShapeWithDescendingLayout(
+                        input_shape.element_type(), {depth, height, width});
+                const llvm_ir::IrArray::Index input_3d_tensor_index(
+                    {z, y, x}, input_3d_tensor_shape, &ir_builder_);
+                const llvm_ir::IrArray::Index input_index =
+                    input_3d_tensor_index
+                        .SourceIndexOfReshape(input_3d_tensor_shape,
+                                              normalized_input_shape,
+                                              &ir_builder_)
+                        .SourceIndexOfTranspose(
+                            normalized_input_shape, input_shape,
+                            transpose_dimension_mapping, &ir_builder_);
+
+                for (int i = 0; i != num_reduces; ++i) {
+                  TF_ASSIGN_OR_RETURN(llvm::Value* const input_ir_value,
+                                      input_gens[i](input_index));
+                  ir_builder_.CreateStore(input_ir_value, input_address);
+                  TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
+                      *reducers[i],
+                      {partial_reduction_result_addresses[i], input_address},
+                      partial_reduction_result_addresses[i]));
+                }
+                return EmitExtraOutputsForReduce(reduce, input_index,
+                                                 extra_output_gens);
+              }
+            }));
+        return Status::OK();
+      };
 
-      // Emit code that reads the input element and accumulates it to the
-      // partial reduction result.
-      llvm::Value* input_address = ir_builder_.CreateAlloca(element_ir_type);
-      {
-        // {z,y,x} is an index to input_3d_tensor_shape [depth,height,width]. We
-        // need to convert that to an index to input_shape (the shape of the
-        // operand of "reduce"). This conversion is composed of a transposition
-        // from input_shape to normalized_input_shape and a reshape from
-        // normalized_input_shape to input_3d_tensor_shape.
-        const Shape normalized_input_shape =
-            ShapeUtil::MakeShapeWithDescendingLayoutAndSamePhysicalLayout(
-                input_shape);
-        auto input_shape_min2maj = LayoutUtil::MinorToMajor(input_shape);
-        const std::vector<int64> transpose_dimension_mapping(
-            input_shape_min2maj.rbegin(), input_shape_min2maj.rend());
-        const Shape input_3d_tensor_shape =
-            ShapeUtil::MakeShapeWithDescendingLayout(input_shape.element_type(),
-                                                     {depth, height, width});
-        const llvm_ir::IrArray::Index input_3d_tensor_index(
-            {z, y, x}, input_3d_tensor_shape, &ir_builder_);
-        const llvm_ir::IrArray::Index input_index =
-            input_3d_tensor_index
-                .SourceIndexOfReshape(input_3d_tensor_shape,
-                                      normalized_input_shape, &ir_builder_)
-                .SourceIndexOfTranspose(normalized_input_shape, input_shape,
-                                        transpose_dimension_mapping,
-                                        &ir_builder_);
-        TF_ASSIGN_OR_RETURN(llvm::Value * input_ir_value,
-                            input_gen(input_index));
-        ir_builder_.CreateStore(input_ir_value, input_address);
-      }
-      return EmitCallToNestedComputation(
-          *reducer, {partial_reduction_result_address, input_address},
-          partial_reduction_result_address);
+      return ksl.For("z_tile",
+                     /*start=*/0, /*end=*/z_tile_size, /*step=*/1,
+                     emit_z_tile_element_loop);
     };
 
     llvm::Value* tile_in_bounds = ir_builder_.CreateOr(
-        ir_builder_.getInt1(width % (kTileSize * kWarpSize) == 0),
+        ir_builder_.getInt1(width % (x_tile_size * kWarpSize) == 0),
         ir_builder_.CreateICmpULT(last_x, ir_builder_.getInt64(width)));
-    llvm_ir::LlvmIfData if_tile_in_bounds_data =
-        llvm_ir::EmitIfThenElse(tile_in_bounds, "tile_in_bounds", &ir_builder_);
-    llvm_ir::SetToFirstInsertPoint(if_tile_in_bounds_data.true_block,
-                                   &ir_builder_);
-    TF_RETURN_IF_ERROR(emit_tile_element_loop(/*tile_in_bounds=*/true));
-    llvm_ir::SetToFirstInsertPoint(if_tile_in_bounds_data.false_block,
-                                   &ir_builder_);
-    TF_RETURN_IF_ERROR(emit_tile_element_loop(/*tile_in_bounds=*/false));
 
-    // After the if-then-else statement on tile_in_bounds, emit calls to
-    // shfl_down that accumulate the partial reduction results of all threads
-    // from the warp.
-    llvm_ir::SetToFirstInsertPoint(if_tile_in_bounds_data.after_block,
-                                   &ir_builder_);
+    TF_RETURN_IF_ERROR(
+        ksl.If(tile_in_bounds,
+               /*true_block_generator=*/
+               [&]() -> Status {
+                 return emit_z_x_tile_element_loop(/*x_tile_in_bounds=*/true,
+                                                   x_tile_size);
+               },
+               /*false_block_generator=*/
+               [&]() -> Status {
+                 return emit_z_x_tile_element_loop(
+                     /*x_tile_in_bounds=*/false,
+                     CeilOfRatio(width % (x_tile_size * kWarpSize), kWarpSize));
+               }));
+
+    // After accumulating the elements of the z_x_tile, emit calls to
+    // shfl_down that accumulate the partial reduction results of all
+    // threads in a warp.
     int bit_width = llvm_ir::GetSizeInBits(element_ir_type);
     // bitcast cannot be applied to aggregate types (even packed ones), so we
     // instead bitcast addresses of load/store to intN* of the same bit-width.
@@ -1483,20 +1691,24 @@ Status IrEmitterUnnested::EmitRowReduction(
                                       : element_ir_type;
     for (int shuffle_distance = 16; shuffle_distance >= 1;
          shuffle_distance /= 2) {
-      llvm::Value* partial_reduction_result = ir_builder_.CreateLoad(
-          ir_builder_.CreateBitCast(partial_reduction_result_address,
-                                    shuffle_ir_type->getPointerTo()),
-          "partial_reduction_result");
       llvm::Value* result_from_other_lane = ir_builder_.CreateAlloca(
           element_ir_type, nullptr, "result_from_other_lane");
-      ir_builder_.CreateStore(
-          EmitShuffleDown(partial_reduction_result,
-                          ir_builder_.getInt32(shuffle_distance), &ir_builder_),
-          ir_builder_.CreateBitCast(result_from_other_lane,
-                                    shuffle_ir_type->getPointerTo()));
-      TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
-          *reducer, {partial_reduction_result_address, result_from_other_lane},
-          partial_reduction_result_address));
+      for (int i = 0; i != num_reduces; ++i) {
+        llvm::Value* partial_reduction_result = ir_builder_.CreateLoad(
+            ir_builder_.CreateBitCast(partial_reduction_result_addresses[i],
+                                      shuffle_ir_type->getPointerTo()),
+            "partial_reduction_result");
+        ir_builder_.CreateStore(
+            EmitShuffleDown(partial_reduction_result,
+                            ir_builder_.getInt32(shuffle_distance),
+                            &ir_builder_),
+            ir_builder_.CreateBitCast(result_from_other_lane,
+                                      shuffle_ir_type->getPointerTo()));
+        TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
+            *reducers[i],
+            {partial_reduction_result_addresses[i], result_from_other_lane},
+            partial_reduction_result_addresses[i]));
+      }
     }
 
     const HloInstruction* output =
@@ -1510,19 +1722,34 @@ Status IrEmitterUnnested::EmitRowReduction(
         "lane_id_is_zero", &ir_builder_);
     llvm_ir::SetToFirstInsertPoint(if_lane_id_is_zero_data.true_block,
                                    &ir_builder_);
-    llvm::Value* output_address =
-        GetIrArray(*output, *output)
-            .EmitArrayElementAddress(
-                llvm_ir::IrArray::Index(y, output->shape(), &ir_builder_),
-                &ir_builder_, "output_element_address");
-    return EmitAtomicOperationForNestedComputation(
-        *reducer, output_address, partial_reduction_result_address);
+    for (int i = 0; i != num_reduces; ++i) {
+      llvm::Value* output_address =
+          GetIrArray(*output, *output, reduce_output_shapes[i])
+              .EmitArrayElementAddress(
+                  llvm_ir::IrArray::Index(
+                      y,
+                      ShapeUtil::GetSubshape(output->shape(),
+                                             reduce_output_shapes[i]),
+                      &ir_builder_),
+                  &ir_builder_, "output_element_address");
+      if (x_tile_size * z_tile_size < depth * width) {
+        TF_RETURN_IF_ERROR(EmitAtomicOperationForNestedComputation(
+            *reducers[i], output_address,
+            partial_reduction_result_addresses[i]));
+      } else {
+        TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
+            *reducers[i],
+            {output_address, partial_reduction_result_addresses[i]},
+            output_address));
+      }
+    }
+    return Status::OK();
   };
 
   // Emit a parallel loop that iterates through every input tiles.
   Shape tiled_input_shape = ShapeUtil::MakeShapeWithLayout(
-      reduce->shape().element_type(), {depth, height, width_in_tiles},
-      {2, 1, 0});
+      reduce->shape().element_type(),
+      {depth / z_tile_size, height, width_in_tiles}, {2, 1, 0});
   LaunchDimensions launch_dimensions = CalculateLaunchDimensions(
       tiled_input_shape, ir_emitter_context_->device_description());
   CHECK(LastThunk()->kind() == Thunk::Kind::kSequential);
@@ -1543,10 +1770,14 @@ Status IrEmitterUnnested::EmitRowReduction(
 //               elementwise.
 Status IrEmitterUnnested::EmitReductionToVector(
     HloInstruction* reduce, const Shape& input_shape,
-    const llvm_ir::ElementGenerator& input_gen,
-    const llvm_ir::ElementGenerator& init_value_gen,
+    tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> input_gens,
+    tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> init_value_gens,
     tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce,
-    HloComputation* reducer) {
+    tensorflow::gtl::ArraySlice<HloComputation*> reducers,
+    tensorflow::gtl::ArraySlice<ShapeIndex> reduce_output_shapes,
+    tensorflow::gtl::ArraySlice<
+        std::pair<llvm_ir::ElementGenerator, ShapeIndex>>
+        extra_output_gens) {
   // This emission requires "reduce" to have an input layout. It is either set
   // by LayoutAssignment (for a top-level kReduce) or by InstructionFusion (for
   // a fused kReduce).
@@ -1581,8 +1812,9 @@ Status IrEmitterUnnested::EmitReductionToVector(
   // `EmitReductionToVector`, we only need to check whether the minormost
   // dimension of the input is to keep.
   if (input_dims_to_keep.empty()) {
-    return EmitReductionToScalar(reduce, input_shape, input_gen, init_value_gen,
-                                 reducer);
+    return EmitReductionToScalar(reduce, input_shape, input_gens,
+                                 init_value_gens, reducers,
+                                 reduce_output_shapes, extra_output_gens);
   } else if (input_dims_to_keep.front() ==
              LayoutUtil::Minor(input_shape.layout(), 0)) {
     // Column reduction. Treat the result of "input" as a matrix whose width
@@ -1599,8 +1831,9 @@ Status IrEmitterUnnested::EmitReductionToVector(
         height *= input_shape.dimensions(input_dim);
       }
     }
-    return EmitColumnReduction(height, width, reduce, input_shape, input_gen,
-                               init_value_gen, reducer);
+    return EmitColumnReduction(height, width, reduce, input_shape, input_gens,
+                               init_value_gens, reducers, reduce_output_shapes,
+                               extra_output_gens);
   } else {
     // Reduce the row dimension of a matrix or reduce dimension 0 and 2 in a
     // 3D tensor. The size of dimension 1 (the height) is the size of the
@@ -1626,7 +1859,8 @@ Status IrEmitterUnnested::EmitReductionToVector(
     }
     const int64 height = ShapeUtil::ElementsIn(reduce->shape());
     return EmitRowReduction(depth, height, width, reduce, input_shape,
-                            input_gen, init_value_gen, reducer);
+                            input_gens, init_value_gens, reducers,
+                            reduce_output_shapes, extra_output_gens);
   }
 }
 
@@ -1650,16 +1884,15 @@ Status IrEmitterUnnested::HandleReduce(HloInstruction* reduce) {
         MakeUnique<SequentialThunk>(std::move(thunks), reduce));
 
     return EmitReductionToVector(
-        reduce, input->shape(),
-        [&](const llvm_ir::IrArray::Index& index) {
+        reduce, input->shape(), {[&](const llvm_ir::IrArray::Index& index) {
           return GetIrArray(*input, *reduce)
               .EmitReadArrayElement(index, &ir_builder_);
-        },
-        [&](const llvm_ir::IrArray::Index& index) {
+        }},
+        {[&](const llvm_ir::IrArray::Index& index) {
           return GetIrArray(*init_value, *reduce)
               .EmitReadArrayElement(index, &ir_builder_);
-        },
-        dimensions_to_reduce, reducer);
+        }},
+        dimensions_to_reduce, {reducer}, {{}}, {});
   }
 
   thunk_sequence_->emplace_back(BuildKernelThunk(reduce));
@@ -2281,7 +2514,9 @@ std::unique_ptr<Thunk> IrEmitterUnnested::BuildGemmThunk(
     if (alpha->opcode() == HloOpcode::kBroadcast) {
       alpha = alpha->operand(0);
     }
-    alpha = inst->operand(alpha->parameter_number());
+    if (alpha->opcode() == HloOpcode::kParameter) {
+      alpha = inst->operand(alpha->parameter_number());
+    }
     // TODO(b/74185543): Remove the following if block once we support fusion
     // with a non-constant as well. Then we will just always use the constant
     // on the device.
@@ -2324,21 +2559,30 @@ std::unique_ptr<Thunk> IrEmitterUnnested::BuildFftThunk(
 }
 
 StatusOr<std::unique_ptr<Thunk>> IrEmitterUnnested::BuildInitializerThunk(
-    const HloInstruction* hlo) {
+    const HloInstruction* hlo, const ShapeIndex& index) {
   bool fused = HloOpcode::kFusion == hlo->opcode();
   const HloInstruction* inst = fused ? hlo->fused_expression_root() : hlo;
-  const HloInstruction* init_value = [&] {
+  const HloInstruction* init_value_operand = [&] {
     switch (inst->opcode()) {
       case HloOpcode::kSelectAndScatter:
         return inst->operand(2);
       case HloOpcode::kReduce:
         return inst->operand(1);
+      case HloOpcode::kTuple:
+        CHECK(hlo->IsMultiOutputFusion())
+            << ": " << hlo->ToString() << " is not a multi-output fusion.";
+        CHECK(inst->operand(index.back())->opcode() == HloOpcode::kReduce)
+            << ": Found '" << inst->operand(index.back())->opcode() << "' in "
+            << inst->ToString() << " but expected 'reduce'.";
+        // For multi-output fusion look through the tuple.
+        return inst->operand(index.back())->operand(1);
       default:
         LOG(FATAL) << "Opcode " << inst->opcode()
                    << " should not need an initializer.";
     }
   }();
 
+  const HloInstruction* init_value = init_value_operand;
   if (fused && init_value->opcode() == HloOpcode::kParameter) {
     init_value = hlo->operand(init_value->parameter_number());
   }
@@ -2356,7 +2600,7 @@ StatusOr<std::unique_ptr<Thunk>> IrEmitterUnnested::BuildInitializerThunk(
     ArraySlice<uint8> literal_bytes(
         reinterpret_cast<const uint8*>(literal.untyped_data()), num_bytes);
     if (c_all_of(literal_bytes, [](uint8 byte) { return byte == 0; })) {
-      return {MakeUnique<MemzeroThunk>(GetAllocationSlice(*hlo), hlo)};
+      return {MakeUnique<MemzeroThunk>(GetAllocationSlice(*hlo, index), hlo)};
     }
 
     // If the literal is 8 or 16 bits wide, we can emit a 32-bit memset by
@@ -2372,8 +2616,8 @@ StatusOr<std::unique_ptr<Thunk>> IrEmitterUnnested::BuildInitializerThunk(
         pattern16 = literal_bytes.front();
       }
       uint32 pattern32 = uint32{pattern16} | (uint32{pattern16} << 16);
-      return {MakeUnique<Memset32BitValueThunk>(pattern32,
-                                                GetAllocationSlice(*hlo), hlo)};
+      return {MakeUnique<Memset32BitValueThunk>(
+          pattern32, GetAllocationSlice(*hlo, index), hlo)};
     }
 
     // If the literal is an even multiple of 32 bits wide, we can emit a 32-bit
@@ -2383,20 +2627,31 @@ StatusOr<std::unique_ptr<Thunk>> IrEmitterUnnested::BuildInitializerThunk(
                literal_bytes.size() - 4) == 0) {
       uint32 word;
       memcpy(&word, literal_bytes.data(), sizeof(word));
-      return {MakeUnique<Memset32BitValueThunk>(word, GetAllocationSlice(*hlo),
-                                                hlo)};
+      return {MakeUnique<Memset32BitValueThunk>(
+          word, GetAllocationSlice(*hlo, index), hlo)};
     }
   }
 
   // Otherwise fall back to our slow initializer code.
   std::unique_ptr<KernelThunk> kernel_thunk = BuildKernelThunk(hlo);
-  TF_RETURN_IF_ERROR(EmitTargetElementLoopInThunk(
-      *hlo,
-      [=](const llvm_ir::IrArray::Index& index) {
-        return GetIrArray(*init_value, *hlo)
-            .EmitReadArrayElement(index, &ir_builder_);
-      },
-      kernel_thunk.get()));
+  LaunchDimensions launch_dimensions =
+      CalculateLaunchDimensions(ShapeUtil::GetSubshape(hlo->shape(), index),
+                                ir_emitter_context_->device_description());
+  UpdateLaunchDimensions(launch_dimensions, kernel_thunk.get(),
+                         ir_emitter_context_->llvm_module());
+  // If the init_value was fused into this reduce we have to generate it first.
+  if (fused && init_value_operand->opcode() != HloOpcode::kParameter) {
+    CHECK_EQ(HloOpcode::kConstant, init_value_operand->opcode());
+    TF_RETURN_IF_ERROR(HandleConstant(const_cast<HloInstruction*>(init_value)));
+  }
+  TF_RETURN_IF_ERROR(ParallelLoopEmitter(
+                         [=](const llvm_ir::IrArray::Index& index) {
+                           return GetIrArray(*init_value, *hlo)
+                               .EmitReadArrayElement(index, &ir_builder_);
+                         },
+                         GetIrArray(*hlo, *hlo, index), launch_dimensions,
+                         &ir_builder_)
+                         .EmitLoop(IrName(hlo)));
 
   // Clean up state left behind by emitting the loop above.  (This is normally
   // done in IrEmitterUnnested::Postprocess().)
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h
index e42c5e86862576bad1c8610652d1c50d2364cd83..202231b82f3877c11cf932bd00a8aac350fd0afa 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h
@@ -100,6 +100,13 @@ class IrEmitterUnnested : public IrEmitter {
       const HloInstruction& inst,
       tensorflow::gtl::ArraySlice<const BufferAllocation*> args);
 
+  // Helper for writing extra outputs from inside a reduce kernel.
+  Status EmitExtraOutputsForReduce(
+      const HloInstruction* reduce, const llvm_ir::IrArray::Index& index,
+      tensorflow::gtl::ArraySlice<
+          std::pair<llvm_ir::ElementGenerator, ShapeIndex>>
+          extra_output_gens);
+
   // EmitColumnReduction and EmitRowReduction emit code for column and row
   // reduction of a matrix and/or 3D tensor. Row and column reduction have
   // different memory access pattern, so for performance their implementations
@@ -110,28 +117,43 @@ class IrEmitterUnnested : public IrEmitter {
   // `EmitReductionToVector`. Note that input shape might not be
   // [height x width], but can be bitcast to [height x weight] with "height"
   // being the major dimension.
-  Status EmitColumnReduction(int64 height, int64 width, HloInstruction* reduce,
-                             const Shape& input_shape,
-                             const llvm_ir::ElementGenerator& input_gen,
-                             const llvm_ir::ElementGenerator& init_value_gen,
-                             HloComputation* reducer);
+  Status EmitColumnReduction(
+      int64 height, int64 width, HloInstruction* reduce,
+      const Shape& input_shape,
+      tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> input_gens,
+      tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> init_value_gens,
+      tensorflow::gtl::ArraySlice<HloComputation*> reducers,
+      tensorflow::gtl::ArraySlice<ShapeIndex> reduce_output_shapes,
+      tensorflow::gtl::ArraySlice<
+          std::pair<llvm_ir::ElementGenerator, ShapeIndex>>
+          extra_output_gens);
 
   // Emits code that reduces a 3D tensor of shape [depth x height x width] to a
   // vector of shape [height]. Other parameters have the same meaning as those
   // of `EmitReductionToVector`. Note that input shape might not be
   // [depth x height x width], but can be bitcast to [depth x height x weight]
   // with "depth" being the most major dimension.
-  Status EmitRowReduction(int64 depth, int64 height, int64 width,
-                          HloInstruction* reduce, const Shape& input_shape,
-                          const llvm_ir::ElementGenerator& input_gen,
-                          const llvm_ir::ElementGenerator& init_value_gen,
-                          HloComputation* reducer);
+  Status EmitRowReduction(
+      int64 depth, int64 height, int64 width, HloInstruction* reduce,
+      const Shape& input_shape,
+      tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> input_gens,
+      tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> init_value_gens,
+      tensorflow::gtl::ArraySlice<HloComputation*> reducers,
+      tensorflow::gtl::ArraySlice<ShapeIndex> reduce_output_shapes,
+      tensorflow::gtl::ArraySlice<
+          std::pair<llvm_ir::ElementGenerator, ShapeIndex>>
+          extra_output_gens);
 
   // Emits code that reduces a tensor of arbitrary rank to a scalar.
-  Status EmitReductionToScalar(HloInstruction* reduce, const Shape& input_shape,
-                               const llvm_ir::ElementGenerator& input_gen,
-                               const llvm_ir::ElementGenerator& init_value_gen,
-                               HloComputation* reducer);
+  Status EmitReductionToScalar(
+      HloInstruction* reduce, const Shape& input_shape,
+      tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> input_gens,
+      tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> init_value_gens,
+      tensorflow::gtl::ArraySlice<HloComputation*> reducers,
+      tensorflow::gtl::ArraySlice<ShapeIndex> reduce_output_shapes,
+      tensorflow::gtl::ArraySlice<
+          std::pair<llvm_ir::ElementGenerator, ShapeIndex>>
+          extra_output_gens);
 
   // Figures out whether `reduce` is a row or column reduction, and which
   // dimensions to reduce, and calls either `EmitRowReduction` or
@@ -141,13 +163,24 @@ class IrEmitterUnnested : public IrEmitter {
   // generate elements of the input and the initial value. Other parameters mean
   // the same as for `HandleReduce`.
   //
+  // Multiple reduces can be emitted in the same loop, assuming they have the
+  // same input and output shapes, and the same reduce dimensions.
+  //
+  // extra_output_gens can contain extra generators for intermediate outputs.
+  // These must have the same shape as the reduce input as they are computed
+  // when the reduce inputs are being read.
+  //
   // Prerequisite: `IsReductionToVector(*reduce)`
   Status EmitReductionToVector(
       HloInstruction* reduce, const Shape& input_shape,
-      const llvm_ir::ElementGenerator& input_gen,
-      const llvm_ir::ElementGenerator& init_value_gen,
+      tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> input_gens,
+      tensorflow::gtl::ArraySlice<llvm_ir::ElementGenerator> init_value_gens,
       tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce,
-      HloComputation* reducer);
+      tensorflow::gtl::ArraySlice<HloComputation*> reducers,
+      tensorflow::gtl::ArraySlice<ShapeIndex> reduce_output_shapes,
+      tensorflow::gtl::ArraySlice<
+          std::pair<llvm_ir::ElementGenerator, ShapeIndex>>
+          extra_output_gens);
 
   // Returns a KernelThunk that invokes the kernel emitted for `inst`. The
   // caller needs to make sure `inst` outlives the lifetime of the returned
@@ -166,7 +199,7 @@ class IrEmitterUnnested : public IrEmitter {
   // Returns a thunk that, given a reduce or select-and-scatter op, initializes
   // its memory to the appropriate initial value.
   StatusOr<std::unique_ptr<Thunk>> BuildInitializerThunk(
-      const HloInstruction* hlo);
+      const HloInstruction* hlo, const ShapeIndex& index = {});
 
   // Returns a thunk that calls host-to-device cuMemcpy to implement `inst`.
   std::unique_ptr<Thunk> BuildHostToDeviceCopyThunk(const HloInstruction* inst);
diff --git a/tensorflow/compiler/xla/service/gpu/multi_output_fusion.cc b/tensorflow/compiler/xla/service/gpu/multi_output_fusion.cc
new file mode 100644
index 0000000000000000000000000000000000000000..e3f444a1268cf8e3e551a3c5b986fee0339fa327
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/multi_output_fusion.cc
@@ -0,0 +1,144 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/gpu/multi_output_fusion.h"
+
+#include <stdint.h>
+#include <algorithm>
+#include <iterator>
+#include <list>
+#include <memory>
+#include <string>
+#include <utility>
+
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace xla {
+namespace gpu {
+
+GpuMultiOutputFusion::GpuMultiOutputFusion() : MultiOutputFusion(INT64_MAX) {}
+
+bool GpuMultiOutputFusion::ShapesCompatibleForFusion(HloInstruction* instr1,
+                                                     HloInstruction* instr2) {
+  auto get_element_instr =
+      [&](const HloInstruction* instr) -> const HloInstruction* {
+    const HloInstruction* element_instr = instr;
+    if (instr->opcode() == HloOpcode::kFusion) {
+      auto fused_expression_root = instr->fused_expression_root();
+      if (instr->IsMultiOutputFusion()) {
+        // If possible, we want to pick a reduce operand of the fusion root,
+        // because it has the most constraints.
+        for (const auto* inst : fused_expression_root->operands()) {
+          if (inst->opcode() == HloOpcode::kReduce) {
+            return inst;
+          }
+        }
+        return fused_expression_root->operands()[0];
+      } else {
+        element_instr = fused_expression_root;
+      }
+    }
+    return element_instr;
+  };
+
+  auto get_element_shape = [&](const HloInstruction* element_instr) {
+    // Special handling of kReduce instructions -- the fusion
+    // applies to the first operand.
+    if (element_instr->opcode() == HloOpcode::kReduce) {
+      return element_instr->operand(0)->shape();
+    }
+    return element_instr->shape();
+  };
+
+  // The shapes in all tuple operands should agree, unless it is a reduce.
+  // In that case, the operand of the reduce needs to have the same shape
+  // as the other tuple operands, but also we need to compare the output
+  // shapes of the reduces.
+  auto* element_instr_1 = get_element_instr(instr1);
+  auto* element_instr_2 = get_element_instr(instr2);
+  if (element_instr_1->opcode() == HloOpcode::kReduce &&
+      element_instr_2->opcode() == HloOpcode::kReduce &&
+      !ShapeUtil::Equal(element_instr_1->shape(), element_instr_2->shape())) {
+    return false;
+  }
+  // The elementwise output shapes must be the same (including layout).
+  return ShapeUtil::Equal(get_element_shape(element_instr_1),
+                          get_element_shape(element_instr_2));
+}
+
+bool GpuMultiOutputFusion::IsProfitableOperand(HloInstruction* instr) {
+  // kConstant instruction will not have memory reads, so it won't be a profit
+  // source. Skip them.
+  if (instr->opcode() == HloOpcode::kConstant &&
+      ShapeUtil::IsEffectiveScalar(instr->shape())) {
+    return false;
+  }
+  // We don't target to fuse producer/consumer instructions -- this should
+  // be taken care of by the instruction_fusion pass. If instr has only
+  // one user, it will not have sibling instructions. We won't consider it.
+  if (instr->user_count() < 2) {
+    return false;
+  }
+  return true;
+}
+
+namespace {
+bool IsReduction(HloInstruction* instr) {
+  if (instr->IsMultiOutputFusion()) {
+    for (const HloInstruction* operand :
+         instr->fused_expression_root()->operands()) {
+      if (operand->opcode() == HloOpcode::kReduce) {
+        return true;
+      }
+    }
+    return false;
+  } else if (instr->opcode() == HloOpcode::kFusion) {
+    return instr->fused_expression_root()->opcode() == HloOpcode::kReduce;
+  } else {
+    return instr->opcode() == HloOpcode::kReduce;
+  }
+}
+}  // namespace
+
+bool GpuMultiOutputFusion::IsFusible(HloInstruction* instr) {
+  return IsReduction(instr);
+}
+
+int64 GpuMultiOutputFusion::GetProfit(HloInstruction* instr1,
+                                      HloInstruction* instr2) {
+  tensorflow::gtl::FlatSet<HloInstruction*> in_list;
+  for (auto instr : instr1->operands()) {
+    if (!IsProfitableOperand(instr)) {
+      continue;
+    }
+    in_list.insert(instr);
+  }
+  int64 profit = 0;
+  for (auto instr : instr2->operands()) {
+    if (!IsProfitableOperand(instr) || in_list.count(instr) == 0) {
+      continue;
+    }
+    profit += ShapeUtil::ByteSizeOf(instr->shape());
+  }
+  VLOG(2) << "Fusing instr1=" << instr1->name() << " instr2=" << instr2->name()
+          << ", the profit is =" << profit;
+  return profit;
+}
+
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/multi_output_fusion.h b/tensorflow/compiler/xla/service/gpu/multi_output_fusion.h
new file mode 100644
index 0000000000000000000000000000000000000000..5451a93cec5e4aeca05717c181edb9dad0305c83
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/multi_output_fusion.h
@@ -0,0 +1,55 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GPU_MULTI_OUTPUT_FUSION_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_MULTI_OUTPUT_FUSION_H_
+
+#include "tensorflow/compiler/xla/service/multi_output_fusion.h"
+
+namespace xla {
+namespace gpu {
+
+// Multi-output fusion of sibling and producer-consumer instructions for the
+// Jellyfish backend.
+class GpuMultiOutputFusion : public MultiOutputFusion {
+ public:
+  GpuMultiOutputFusion();
+
+ protected:
+  // Test if instr1 and instr2 have the compatible shapes that can be legally
+  // fused.
+  bool ShapesCompatibleForFusion(HloInstruction* instr1,
+                                 HloInstruction* instr2) override;
+
+  // We currently only consider reduce and reduce fusion nodes as candidates.
+  bool IsFusible(HloInstruction* instr) override;
+
+  // This function estimates the amount of memory reads saved by merging
+  // instr1 and instr2 into one multi-output fusion instruction. For a fusion
+  // instruction, all the operands need to be loaded from memory. If we merge
+  // instr1 and instr2, common operands will not be loaded twice. The profit is
+  // estimated as the size of the common operands b/w instr1 and instr2.
+  int64 GetProfit(HloInstruction* instr1, HloInstruction* instr2) override;
+
+  // Whether fusing the instruction can reduce memory reads.
+  //
+  // TODO(tjoerg): Move this method up into the MultiOutputFusion base class.
+  bool IsProfitableOperand(HloInstruction* instr) override;
+};
+
+}  // namespace gpu
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GPU_MULTI_OUTPUT_FUSION_H_
diff --git a/tensorflow/compiler/xla/service/gpu/multi_output_fusion_test.cc b/tensorflow/compiler/xla/service/gpu/multi_output_fusion_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..924cfb11f3da76c475458ea14f201e809be61be8
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/multi_output_fusion_test.cc
@@ -0,0 +1,230 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/gpu/multi_output_fusion.h"
+
+#include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
+#include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+
+namespace op = xla::testing::opcode_matchers;
+
+namespace xla {
+namespace gpu {
+
+using InstructionFusionTest = HloTestBase;
+
+const char kModulePrefix[] = R"(
+    HloModule test_module
+
+    scalar_add_computation {
+      scalar_lhs.0 = f32[] parameter(0)
+      scalar_rhs.0 = f32[] parameter(1)
+      ROOT add.0 = f32[] add(scalar_lhs.0, scalar_rhs.0)
+    }
+    scalar_mul_computation {
+      scalar_lhs.1 = f32[] parameter(0)
+      scalar_rhs.1 = f32[] parameter(1)
+      ROOT mul.1 = f32[] add(scalar_lhs.1, scalar_rhs.1)
+    })";
+
+TEST_F(InstructionFusionTest, MultiOutputFusionSiblingReduceAndReduceFusion) {
+  // Fusion with reduce instruction root and a sibling reduce instruction
+  // sharing the same input param.
+  auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
+    fused_computation {
+      p1.1 = f32[128,512,28,28]{3,2,1,0} parameter(1)
+      mul = f32[128,512,28,28]{3,2,1,0} multiply(p1.1, p1.1)
+      const.1 = f32[] parameter(0)
+      ROOT reduce.1 = f32[512]{0} reduce(mul, const.1), dimensions={0,2,3}, to_apply=scalar_add_computation
+    }
+
+    ENTRY entry {
+      p0 = f32[] parameter(0)
+      p1 = f32[128,512,28,28]{3,2,1,0} parameter(1)
+      const.2 = f32[] constant(1)
+      fusion = f32[512] fusion(p0, p1), kind=kInput, calls=fused_computation
+      reduce.2 = f32[512]{0} reduce(p1, const.2), dimensions={0,2,3}, to_apply=scalar_add_computation
+      ROOT root = (f32[512]{0}, f32[512]{0}) tuple(fusion, reduce.2)
+    })"))
+                    .ValueOrDie();
+  ASSERT_TRUE(GpuMultiOutputFusion().Run(module.get()).ValueOrDie());
+  SCOPED_TRACE(module->ToString());
+  const HloInstruction* fusion =
+      module->entry_computation()->root_instruction()->operand(0)->operand(0);
+  ASSERT_TRUE(fusion->IsMultiOutputFusion());
+  EXPECT_THAT(fusion->fused_expression_root(),
+              op::Tuple(op::Reduce(), op::Reduce()));
+}
+
+TEST_F(InstructionFusionTest, MultiOutputFusionDifferentReduceInputShapes) {
+  auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
+    fused_computation_1 {
+      p1.1 = f32[6400]{0} parameter(1)
+      mul = f32[6400]{0} multiply(p1.1, p1.1)
+      const.1 = f32[] parameter(0)
+      ROOT reduce.1 = f32[] reduce(mul, const.1), dimensions={0}, to_apply=scalar_add_computation
+    }
+
+    fused_computation_2 {
+      p1.2 = f32[6400]{0} parameter(1)
+      r1 = f32[64,100]{0,1} reshape(p1.2)
+      const.2 = f32[] parameter(0)
+      ROOT reduce.2 = f32[] reduce(r1, const.2), dimensions={1,0}, to_apply=scalar_mul_computation
+    }
+
+    ENTRY entry {
+      p0 = f32[] parameter(0)
+      p1 = f32[6400]{0} parameter(1)
+      fusion.1 = f32[] fusion(p0, p1), kind=kInput, calls=fused_computation_1
+      fusion.2 = f32[] fusion(p0, p1), kind=kInput, calls=fused_computation_2
+      ROOT root = (f32[], f32[]) tuple(fusion.1, fusion.2)
+    })"))
+                    .ValueOrDie();
+  ASSERT_FALSE(GpuMultiOutputFusion().Run(module.get()).ValueOrDie());
+}
+
+TEST_F(InstructionFusionTest, MultiOutputFusionDifferentReduceOutputShapes) {
+  auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
+    fused_computation_1 {
+      p1.1 = f32[10,10]{1,0} parameter(1)
+      mul = f32[10,10]{1,0} multiply(p1.1, p1.1)
+      const.1 = f32[] parameter(0)
+      ROOT reduce.1 = f32[] reduce(mul, const.1), dimensions={0,1}, to_apply=scalar_add_computation
+    }
+
+    fused_computation_2 {
+      p1.2 = f32[10,10]{1,0} parameter(1)
+      const.2 = f32[10]{0} parameter(0)
+      ROOT reduce.2 = f32[10]{0} reduce(p1.2, const.2), dimensions={0}, to_apply=scalar_mul_computation
+    }
+
+    ENTRY entry {
+      p0 = f32[] parameter(0)
+      p1.3 = f32[10,10]{1,0} parameter(1)
+      fusion.1 = f32[] fusion(p0, p1.3), kind=kInput, calls=fused_computation_1
+      p2 = f32[] parameter(2)
+      fusion.2 = f32[10]{0} fusion(p2, p1.3), kind=kInput, calls=fused_computation_2
+      ROOT root = (f32[], f32[10]{0}) tuple(fusion.1, fusion.2)
+    })"))
+                    .ValueOrDie();
+  ASSERT_FALSE(GpuMultiOutputFusion().Run(module.get()).ValueOrDie());
+}
+
+TEST_F(InstructionFusionTest, MultiOutputFusionSiblingReduceFusions) {
+  // Two sibling fusions with reduce instruction roots sharing the same input
+  // param.
+  auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
+    fused_computation_1 {
+      p1.1 = f32[128,512,28,28]{3,2,1,0} parameter(1)
+      mul = f32[128,512,28,28]{3,2,1,0} multiply(p1.1, p1.1)
+      const.1 = f32[] parameter(0)
+      ROOT reduce.1 = f32[512]{0} reduce(mul, const.1), dimensions={0,2,3}, to_apply=scalar_add_computation
+    }
+
+    fused_computation_2 {
+      p1.2 = f32[128,512,28,28]{3,2,1,0} parameter(1)
+      const.2 = f32[] parameter(0)
+      ROOT reduce.2 = f32[512]{0} reduce(p1.2, const.2), dimensions={0,2,3}, to_apply=scalar_add_computation
+    }
+
+    ENTRY entry {
+      p0 = f32[] parameter(0)
+      p1 = f32[128,512,28,28]{3,2,1,0} parameter(1)
+      fusion.1 = f32[512] fusion(p0, p1), kind=kInput, calls=fused_computation_1
+      fusion.2 = f32[512] fusion(p0, p1), kind=kInput, calls=fused_computation_2
+      ROOT root = (f32[512]{0}, f32[512]{0}) tuple(fusion.1, fusion.2)
+    })"))
+                    .ValueOrDie();
+  ASSERT_TRUE(GpuMultiOutputFusion().Run(module.get()).ValueOrDie());
+  SCOPED_TRACE(module->ToString());
+  const HloInstruction* fusion =
+      module->entry_computation()->root_instruction()->operand(0)->operand(0);
+  ASSERT_TRUE(fusion->IsMultiOutputFusion());
+  EXPECT_THAT(fusion->fused_expression_root(),
+              op::Tuple(op::Reduce(), op::Reduce()));
+}
+
+TEST_F(InstructionFusionTest,
+       MultiOutputFusionSiblingReduceAndReduceMultiOutputFusion) {
+  // Multi-output fusion with two reduce instructions root and a sibling reduce
+  // instruction sharing the same input param.
+  auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
+    fused_computation (p0: f32[128,512,28,28]) -> (f32[512], f32[512]) {
+      const.1 = f32[] constant(1)
+      p0.1 = f32[128,512,28,28]{3,2,1,0} parameter(0)
+      mul = f32[128,512,28,28]{3,2,1,0} multiply(f32[128,512,28,28]{3,2,1,0} p0.1, f32[128,512,28,28]{3,2,1,0} p0.1)
+      reduce.1 = f32[512]{0} reduce(f32[128,512,28,28]{3,2,1,0} mul, f32[] const.1), dimensions={0,2,3}, to_apply=scalar_add_computation
+      reduce.2 = f32[512]{0} reduce(f32[128,512,28,28]{3,2,1,0} p0.1, f32[] const.1), dimensions={0,2,3}, to_apply=scalar_add_computation
+      ROOT tuple = (f32[512]{0}, f32[512]{0}) tuple(f32[512]{0} reduce.1, f32[512]{0} reduce.2)
+    }
+
+    ENTRY entry (p0: f32[128,512,28,28]) -> (f32[512], f32[512], f32[512]) {
+      p0 = f32[128,512,28,28]{3,2,1,0} parameter(0)
+      const = f32[] constant(1)
+      fusion = (f32[512]{0}, f32[512]{0}) fusion(f32[128,512,28,28]{3,2,1,0} p0), kind=kInput, calls=fused_computation
+      get-tuple-element = f32[512]{0} get-tuple-element((f32[512]{0}, f32[512]{0}) fusion), index=0
+      get-tuple-element.1 = f32[512]{0} get-tuple-element((f32[512]{0}, f32[512]{0}) fusion), index=1
+      reduce.3 = f32[512]{0} reduce(p0, const), dimensions={0,2,3}, to_apply=scalar_add_computation
+      ROOT root = (f32[512]{0}, f32[512]{0}, f32[512]{0}) tuple(f32[512]{0} get-tuple-element, f32[512]{0} get-tuple-element.1, f32[512]{0} reduce.3)
+    })"))
+                    .ValueOrDie();
+  ASSERT_TRUE(GpuMultiOutputFusion().Run(module.get()).ValueOrDie());
+  SCOPED_TRACE(module->ToString());
+  const HloInstruction* fusion =
+      module->entry_computation()->root_instruction()->operand(0)->operand(0);
+  ASSERT_TRUE(fusion->IsMultiOutputFusion());
+  EXPECT_THAT(fusion->fused_expression_root(),
+              op::Tuple(op::Reduce(), op::Reduce(), op::Reduce()));
+}
+
+TEST_F(InstructionFusionTest,
+       MultiOutputFusionSiblingFusionCheckAgainstReduceOperand) {
+  // Verify that if we already have a multi-output fusion that we prefer to pick
+  // a reduce op from its operands for checking shape compatibility.
+  auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
+    fused_computation_1 {
+      p1.1 = f32[10,10]{1,0} parameter(1)
+      mul = f32[10,10]{1,0} multiply(p1.1, p1.1)
+      const.1 = f32[] parameter(0)
+      reduce.1 = f32[] reduce(p1.1, const.1), dimensions={0,1}, to_apply=scalar_add_computation
+      ROOT tuple = (f32[10,10], f32[]) tuple(mul, reduce.1)
+    }
+
+    fused_computation_2 {
+      p1.2 = f32[10,10]{1,0} parameter(1)
+      const.2 = f32[10] parameter(0)
+      ROOT reduce.2 = f32[10] reduce(p1.2, const.2), dimensions={0}, to_apply=scalar_mul_computation
+    }
+
+    ENTRY entry {
+      p0 = f32[] parameter(0)
+      p1 = f32[10,10]{1,0} parameter(1)
+      p2 = f32[10]{0} parameter(2)
+      fusion.1 = (f32[10,10], f32[10]) fusion(p0, p1), kind=kInput, calls=fused_computation_1
+      get-tuple-element.1 = f32[10,10] get-tuple-element((f32[10,10], f32[10]) fusion.1), index=0
+      get-tuple-element.2 = f32[] get-tuple-element((f32[10,10], f32[10]) fusion.1), index=1
+      fusion.2 = f32[10] fusion(p2, p1), kind=kInput, calls=fused_computation_2
+      ROOT root = (f32[10,10], f32[], f32[10]) tuple(get-tuple-element.1, get-tuple-element.2, fusion.2)
+    })"))
+                    .ValueOrDie();
+  ASSERT_FALSE(GpuMultiOutputFusion().Run(module.get()).ValueOrDie());
+}
+
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/stream_assignment_test.cc b/tensorflow/compiler/xla/service/gpu/stream_assignment_test.cc
index 696fa7e0194032b5c78bf11383c3280a62de07fa..6f4bb0580e8dfc1dce1cca0a60cc3dd9ea600fb3 100644
--- a/tensorflow/compiler/xla/service/gpu/stream_assignment_test.cc
+++ b/tensorflow/compiler/xla/service/gpu/stream_assignment_test.cc
@@ -33,8 +33,7 @@ class StreamAssignmentTest : public HloTestBase {
     auto debug_options = GetDebugOptionsForTest();
     debug_options.set_xla_gpu_disable_multi_streaming(false);
     config.set_debug_options(debug_options);
-    return MakeUnique<HloModule>("test_module", VersionedComputationHandle(),
-                                 config);
+    return MakeUnique<HloModule>("test_module", config);
   }
 
   // Pre-canned shapes.
diff --git a/tensorflow/compiler/xla/service/gpu/stream_executor_util.cc b/tensorflow/compiler/xla/service/gpu/stream_executor_util.cc
new file mode 100644
index 0000000000000000000000000000000000000000..a50ddf6ac63c7fa7ccace94bc7f40f438aedccf8
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/stream_executor_util.cc
@@ -0,0 +1,151 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/gpu/stream_executor_util.h"
+
+#include "tensorflow/compiler/xla/layout_util.h"
+
+namespace xla {
+namespace gpu {
+
+using stream_executor::dnn::DataLayout;
+using stream_executor::dnn::DataLayoutString;
+using stream_executor::dnn::FilterLayout;
+using stream_executor::dnn::FilterLayoutString;
+
+StatusOr<std::tuple<Layout, Layout, Layout>>
+StreamExecutorConvLayoutsToXlaLayouts(const ConvolutionDimensionNumbers& dnums,
+                                      DataLayout input, FilterLayout filter,
+                                      DataLayout output) {
+  std::vector<int64> input_layout;
+  switch (input) {
+    case DataLayout::kBatchDepthYX:
+      input_layout.push_back(dnums.input_batch_dimension());
+      input_layout.push_back(dnums.input_feature_dimension());
+      input_layout.insert(input_layout.end(),
+                          dnums.input_spatial_dimensions().begin(),
+                          dnums.input_spatial_dimensions().end());
+      break;
+    case DataLayout::kBatchYXDepth:
+      input_layout.push_back(dnums.input_batch_dimension());
+      input_layout.insert(input_layout.end(),
+                          dnums.input_spatial_dimensions().begin(),
+                          dnums.input_spatial_dimensions().end());
+      input_layout.push_back(dnums.input_feature_dimension());
+      break;
+    default:
+      return tensorflow::errors::Internal("Invalid input layout: ",
+                                          DataLayoutString(input));
+  }
+
+  std::vector<int64> filter_layout;
+  switch (filter) {
+    case FilterLayout::kOutputInputYX:
+      filter_layout.push_back(dnums.kernel_output_feature_dimension());
+      filter_layout.push_back(dnums.kernel_input_feature_dimension());
+      filter_layout.insert(filter_layout.end(),
+                           dnums.kernel_spatial_dimensions().begin(),
+                           dnums.kernel_spatial_dimensions().end());
+      break;
+    case FilterLayout::kOutputYXInput:
+      filter_layout.push_back(dnums.kernel_output_feature_dimension());
+      filter_layout.insert(filter_layout.end(),
+                           dnums.kernel_spatial_dimensions().begin(),
+                           dnums.kernel_spatial_dimensions().end());
+      filter_layout.push_back(dnums.kernel_input_feature_dimension());
+      break;
+    default:
+      return tensorflow::errors::Internal("Invalid filter layout: ",
+                                          FilterLayoutString(filter));
+  }
+
+  std::vector<int64> output_layout;
+  switch (output) {
+    case DataLayout::kBatchDepthYX:
+      output_layout.push_back(dnums.output_batch_dimension());
+      output_layout.push_back(dnums.output_feature_dimension());
+      output_layout.insert(output_layout.end(),
+                           dnums.output_spatial_dimensions().begin(),
+                           dnums.output_spatial_dimensions().end());
+      break;
+    case DataLayout::kBatchYXDepth:
+      output_layout.push_back(dnums.output_batch_dimension());
+      output_layout.insert(output_layout.end(),
+                           dnums.output_spatial_dimensions().begin(),
+                           dnums.output_spatial_dimensions().end());
+      output_layout.push_back(dnums.output_feature_dimension());
+      break;
+    default:
+      return tensorflow::errors::Internal("Invalid output layout: ",
+                                          DataLayoutString(output));
+  }
+
+  return std::make_tuple(LayoutUtil::MakeLayoutFromMajorToMinor(input_layout),
+                         LayoutUtil::MakeLayoutFromMajorToMinor(filter_layout),
+                         LayoutUtil::MakeLayoutFromMajorToMinor(output_layout));
+}
+
+StatusOr<std::tuple<DataLayout, FilterLayout, DataLayout>>
+XlaConvLayoutsToStreamExecutorLayouts(const ConvolutionDimensionNumbers& dnums,
+                                      const Layout& input, const Layout& filter,
+                                      const Layout& output) {
+  Layout nchw_input, nchw_filter, nchw_output;
+  std::tie(nchw_input, nchw_filter, nchw_output) =
+      StreamExecutorConvLayoutsToXlaLayouts(dnums, DataLayout::kBatchDepthYX,
+                                            FilterLayout::kOutputInputYX,
+                                            DataLayout::kBatchDepthYX)
+          .ConsumeValueOrDie();
+
+  Layout nhwc_input, nhwc_filter, nhwc_output;
+  std::tie(nhwc_input, nhwc_filter, nhwc_output) =
+      StreamExecutorConvLayoutsToXlaLayouts(dnums, DataLayout::kBatchYXDepth,
+                                            FilterLayout::kOutputYXInput,
+                                            DataLayout::kBatchYXDepth)
+          .ConsumeValueOrDie();
+
+  DataLayout input_layout;
+  if (LayoutUtil::Equal(input, nchw_input)) {
+    input_layout = DataLayout::kBatchDepthYX;
+  } else if (LayoutUtil::Equal(input, nhwc_input)) {
+    input_layout = DataLayout::kBatchYXDepth;
+  } else {
+    return tensorflow::errors::Internal("Invalid input layout: ",
+                                        input.ShortDebugString());
+  }
+
+  FilterLayout filter_layout;
+  if (LayoutUtil::Equal(filter, nchw_filter)) {
+    filter_layout = FilterLayout::kOutputInputYX;
+  } else if (LayoutUtil::Equal(filter, nhwc_filter)) {
+    filter_layout = FilterLayout::kOutputYXInput;
+  } else {
+    return tensorflow::errors::Internal("Invalid filter layout: ",
+                                        filter.ShortDebugString());
+  }
+
+  DataLayout output_layout;
+  if (LayoutUtil::Equal(output, nchw_output)) {
+    output_layout = DataLayout::kBatchDepthYX;
+  } else if (LayoutUtil::Equal(output, nhwc_output)) {
+    output_layout = DataLayout::kBatchYXDepth;
+  } else {
+    return tensorflow::errors::Internal("Invalid output layout: ",
+                                        output.ShortDebugString());
+  }
+
+  return std::make_tuple(input_layout, filter_layout, output_layout);
+}
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/stream_executor_util.h b/tensorflow/compiler/xla/service/gpu/stream_executor_util.h
new file mode 100644
index 0000000000000000000000000000000000000000..8218f4fd11d3978d0ecc53fc15e287aea4b69ec3
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/stream_executor_util.h
@@ -0,0 +1,46 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GPU_STREAM_EXECUTOR_UTIL_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_STREAM_EXECUTOR_UTIL_H_
+
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/platform/stream_executor_no_cuda.h"
+
+// Helper functions for interacting with StreamExecutor.
+
+namespace xla {
+namespace gpu {
+
+// Returns (input, filter, output) XLA Layout protos given the StreamExecutor
+// layouts.
+StatusOr<std::tuple<Layout, Layout, Layout>>
+StreamExecutorConvLayoutsToXlaLayouts(const ConvolutionDimensionNumbers& dnums,
+                                      stream_executor::dnn::DataLayout input,
+                                      stream_executor::dnn::FilterLayout filter,
+                                      stream_executor::dnn::DataLayout output);
+
+// Returns (input, filter, output) StreamExecutor layouts given the XLA layouts.
+StatusOr<std::tuple<stream_executor::dnn::DataLayout,
+                    stream_executor::dnn::FilterLayout,
+                    stream_executor::dnn::DataLayout>>
+XlaConvLayoutsToStreamExecutorLayouts(const ConvolutionDimensionNumbers& dnums,
+                                      const Layout& input, const Layout& filter,
+                                      const Layout& output);
+
+}  // namespace gpu
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GPU_STREAM_EXECUTOR_UTIL_H_
diff --git a/tensorflow/compiler/xla/service/gpu/while_transformer.cc b/tensorflow/compiler/xla/service/gpu/while_transformer.cc
index ad55728c45599c801aad7e12fac95ae9f0c4fc3b..7749201cbceece216a2db2569936949eb7de5125 100644
--- a/tensorflow/compiler/xla/service/gpu/while_transformer.cc
+++ b/tensorflow/compiler/xla/service/gpu/while_transformer.cc
@@ -457,8 +457,8 @@ class WhileBodyComputationMatcher : public MatcherBase {
         return InvalidArgument("Unexpected tuple index instruction : %s",
                                inst->name().c_str());
       } else if (tag == "loop_increment") {
-        // Parse the constant which represents the loop induction variable
-        // increment value.
+        // ParseHloString the constant which represents the loop induction
+        // variable increment value.
         TF_RETURN_IF_ERROR(ParseConstInteger(inst, &loop_increment_));
       } else if (tag == "param0" &&
                  inst != computation_->parameter_instruction(0)) {
diff --git a/tensorflow/compiler/xla/service/heap_simulator.cc b/tensorflow/compiler/xla/service/heap_simulator.cc
index 06a5e0351b63270b61b998ca2211f480f256f759..5dba50a63b1d77bda0835e0333cc7dd9ddbe2dcf 100644
--- a/tensorflow/compiler/xla/service/heap_simulator.cc
+++ b/tensorflow/compiler/xla/service/heap_simulator.cc
@@ -26,6 +26,41 @@ namespace xla {
 using tensorflow::gtl::FlatMap;
 using tensorflow::gtl::FlatSet;
 
+StatusOr<int64> MinimumMemoryForModule(
+    const SequentialHloOrdering::HloModuleSequence& module_sequence,
+    const LogicalBuffer::SizeFunction& size_function) {
+  if (module_sequence.empty()) {
+    return 0;
+  }
+
+  const HloModule* module = module_sequence.begin()->first->parent();
+  TF_ASSIGN_OR_RETURN(std::unique_ptr<TuplePointsToAnalysis> points_to_analysis,
+                      TuplePointsToAnalysis::Run(module));
+
+  // The absolute minimum memory required for a given sequence of instructions
+  // is determined by the sequence of Alloc and Free calls on a simulated heap,
+  // ignoring fragmentation. We run the heap simulation on the whole module,
+  // rather than summing each computation, since it gives us a better lower
+  // bound, by minimizing the liveness of sub-computations.
+  TF_ASSIGN_OR_RETURN(
+      HeapSimulator::Result result,
+      HeapSimulator::Run(MakeUnique<NoFragmentationStatsHeap>(), *module,
+                         module_sequence, *points_to_analysis, size_function));
+  return result.heap_size;
+}
+
+StatusOr<int64> MinimumMemoryForComputation(
+    const HloComputation& computation,
+    const std::vector<const HloInstruction*>& sequence,
+    const TuplePointsToAnalysis& points_to_analysis,
+    const LogicalBuffer::SizeFunction& size_function) {
+  TF_ASSIGN_OR_RETURN(
+      HeapSimulator::Result result,
+      HeapSimulator::Run(MakeUnique<NoFragmentationStatsHeap>(), computation,
+                         sequence, points_to_analysis, size_function));
+  return result.heap_size;
+}
+
 /*static*/
 StatusOr<HeapSimulator::Result> HeapSimulator::Run(
     std::unique_ptr<HeapAlgorithm> algorithm, const HloModule& module,
diff --git a/tensorflow/compiler/xla/service/heap_simulator.h b/tensorflow/compiler/xla/service/heap_simulator.h
index 8b2b43a37a5c41d334e5338c6a6fad160f03a51e..3be3bb8e7fae9d22c9be00f3f81d2b93638c22ef 100644
--- a/tensorflow/compiler/xla/service/heap_simulator.h
+++ b/tensorflow/compiler/xla/service/heap_simulator.h
@@ -34,6 +34,21 @@ limitations under the License.
 
 namespace xla {
 
+// Returns the minimum memory required to compute an HLO module where all
+// computations have been scheduled (represented by the given module_sequence),
+// assuming no fragmentation.
+StatusOr<int64> MinimumMemoryForModule(
+    const SequentialHloOrdering::HloModuleSequence& module_sequence,
+    const LogicalBuffer::SizeFunction& size_function);
+
+// Returns the minimum memory required to compute the given computation,
+// assuming no fragmentation.
+StatusOr<int64> MinimumMemoryForComputation(
+    const HloComputation& computation,
+    const std::vector<const HloInstruction*>& sequence,
+    const TuplePointsToAnalysis& points_to_analysis,
+    const LogicalBuffer::SizeFunction& size_function);
+
 // Forward declare classes defined below.
 class HeapAlgorithm;
 
diff --git a/tensorflow/compiler/xla/service/heap_simulator_test.cc b/tensorflow/compiler/xla/service/heap_simulator_test.cc
index 6271652412c2979ff926702f12722102344b0dfb..309ab85f784274835904015472f3f0d601885763 100644
--- a/tensorflow/compiler/xla/service/heap_simulator_test.cc
+++ b/tensorflow/compiler/xla/service/heap_simulator_test.cc
@@ -34,6 +34,64 @@ limitations under the License.
 namespace xla {
 namespace {
 
+class MinimumMemoryForSequenceTest : public HloTestBase {};
+
+TEST_F(MinimumMemoryForSequenceTest, MultiComputation) {
+  auto module = CreateNewModule();
+  const Shape scalar_shape = ShapeUtil::MakeShape(xla::F32, {});
+  const Shape tuple_shape =
+      ShapeUtil::MakeTupleShape({scalar_shape, scalar_shape});
+
+  auto cond_builder = HloComputation::Builder("WhileCond");
+  // Tuple param: 24 bytes (each elem has 8 byte pointer, 4 byte element)
+  HloInstruction* cond_param = cond_builder.AddInstruction(
+      HloInstruction::CreateParameter(0, tuple_shape, "cond_param"));
+  HloInstruction* cond_iter = cond_builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(scalar_shape, cond_param, 0));
+  HloInstruction* cond_data = cond_builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(scalar_shape, cond_param, 1));
+  // Free cond_param[] (16 bytes), Alloc PRED[] (1 byte)
+  HloInstruction* cond_lt = cond_builder.AddInstruction(
+      HloInstruction::CreateBinary(ShapeUtil::MakeShape(PRED, {}),
+                                   HloOpcode::kLt, cond_iter, cond_data));
+  HloComputation* cond_computation =
+      module->AddEmbeddedComputation(cond_builder.Build());
+
+  auto body_builder = HloComputation::Builder("WhileBody");
+  // Tuple param: 24 bytes (each elem has 8 byte pointer, 4 byte element)
+  HloInstruction* body_param = body_builder.AddInstruction(
+      HloInstruction::CreateParameter(0, tuple_shape, "body_param"));
+  HloComputation* body_computation =
+      module->AddEmbeddedComputation(body_builder.Build());
+
+  auto builder = HloComputation::Builder(TestName());
+  // Entry params: 8 bytes (4 bytes per param), TOTAL=8
+  HloInstruction* iter = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, scalar_shape, "param_iter"));
+  HloInstruction* data = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, scalar_shape, "param_data"));
+  // Tuple: 16 bytes (8 bytes per pointer), TOTAL=24
+  HloInstruction* tuple =
+      builder.AddInstruction(HloInstruction::CreateTuple({iter, data}));
+  // While: 8 bytes (4 bytes per element), TOTAL=32
+  // Both cond and body use a max of 24 bytes, TOTAL=56
+  HloInstruction* while_op = builder.AddInstruction(HloInstruction::CreateWhile(
+      tuple_shape, cond_computation, body_computation, tuple));
+  HloComputation* entry_computation =
+      module->AddEntryComputation(builder.Build());
+
+  auto size_fn = [](const BufferValue& buffer) {
+    return ShapeUtil::ByteSizeOf(buffer.shape(), /*pointer_size=*/8);
+  };
+
+  SequentialHloOrdering::HloModuleSequence module_sequence;
+  module_sequence[cond_computation] = {cond_param, cond_iter, cond_data,
+                                       cond_lt};
+  module_sequence[body_computation] = {body_param};
+  module_sequence[entry_computation] = {iter, data, tuple, while_op};
+  EXPECT_EQ(56, MinimumMemoryForModule(module_sequence, size_fn).ValueOrDie());
+}
+
 const char kAlloc[] = "Alloc";
 const char kFree[] = "Free";
 const char kFinish[] = "Finish";
diff --git a/tensorflow/compiler/xla/service/hlo.proto b/tensorflow/compiler/xla/service/hlo.proto
index 1f7c1cffd324ad2f4e4cdf11046c8459b8ceb6d5..e201359d3d25b7d2dda852762c6de1fcb75685d7 100644
--- a/tensorflow/compiler/xla/service/hlo.proto
+++ b/tensorflow/compiler/xla/service/hlo.proto
@@ -145,6 +145,7 @@ message HloInstructionProto {
   repeated int64 operand_ids = 36;
   repeated int64 control_predecessor_ids = 37;
   repeated int64 called_computation_ids = 38;
+  repeated int64 replica_group_ids = 44;
 
   xla.OpSharding sharding = 40;
 
diff --git a/tensorflow/compiler/xla/service/hlo_casting_utils.h b/tensorflow/compiler/xla/service/hlo_casting_utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..7f73bba036534a62a70a80431236cffa766c9b38
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_casting_utils.h
@@ -0,0 +1,104 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Casting utilitiy functions for HLO instructions.
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_CASTING_UTILS_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_CASTING_UTILS_H_
+
+#include <type_traits>
+#include "tensorflow/core/platform/logging.h"
+
+namespace xla {
+
+class HloInstruction;
+
+template <class T>
+using EnableIfDerivedFromHlo =
+    typename std::enable_if<std::is_base_of<HloInstruction, T>::value>::type;
+
+// TODO(b/93238915): Switch implementation from C++'s dynamic_cast to LLVM-like
+// RTTI if it turns out to be a performance issue.
+// Casts an HloInstruction pointer to one of its subclasses, dies if argument is
+// nullptr or runtime information does not match.
+//
+// Similar to LLVM's cast.
+template <class T, EnableIfDerivedFromHlo<T>* = nullptr>
+const T* Cast(const HloInstruction* instruction) {
+  CHECK(instruction != nullptr);
+  const T* casted = dynamic_cast<const T*>(instruction);
+  CHECK(casted != nullptr);
+  return casted;
+}
+
+// Non-const overload of Cast.
+template <class T, EnableIfDerivedFromHlo<T>* = nullptr>
+T* Cast(HloInstruction* instruction) {
+  return const_cast<T*>(
+      Cast<T>(const_cast<const HloInstruction*>(instruction)));
+}
+
+// Works just like the Cast, except that it allows for a null pointer as an
+// argument which it then propagates.
+//
+// Similar to LLVM's cast_or_null.
+template <class T, EnableIfDerivedFromHlo<T>* = nullptr>
+const T* CastOrNull(const HloInstruction* instruction) {
+  return instruction != nullptr ? Cast<T>(instruction) : nullptr;
+}
+
+// Non-const overload of CastOrNull.
+template <class T, EnableIfDerivedFromHlo<T>* = nullptr>
+T* CastOrNull(HloInstruction* instruction) {
+  return const_cast<T*>(
+      CastOrNull<T>(const_cast<const HloInstruction*>(instruction)));
+}
+
+// Casts an HloInstruction pointer to one of its subclasses, dies if argument is
+// nullptr, returns nullptr if runtime information does not match.
+//
+// Similar to LLVM's dyn_cast.
+template <class T, EnableIfDerivedFromHlo<T>* = nullptr>
+const T* DynCast(const HloInstruction* instruction) {
+  CHECK(instruction != nullptr);
+  return dynamic_cast<const T*>(instruction);
+}
+
+// Non-const overload of DynCast.
+template <class T, EnableIfDerivedFromHlo<T>* = nullptr>
+T* DynCast(HloInstruction* instruction) {
+  return const_cast<T*>(
+      DynCast<T>(const_cast<const HloInstruction*>(instruction)));
+}
+
+// Works just like the DynCast, except that it allows for a null pointer as an
+// argument which it then propagates.
+//
+// Similar to LLVM's dyn_cast_or_null.
+template <class T, EnableIfDerivedFromHlo<T>* = nullptr>
+const T* DynCastOrNull(const HloInstruction* instruction) {
+  return instruction != nullptr ? DynCast<T>(instruction) : nullptr;
+}
+
+// Non-const overload of DynCastOrNull.
+template <class T, EnableIfDerivedFromHlo<T>* = nullptr>
+T* DynCastOrNull(HloInstruction* instruction) {
+  return const_cast<T*>(
+      DynCastOrNull<T>(const_cast<const HloInstruction*>(instruction)));
+}
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_CASTING_UTILS_H_
diff --git a/tensorflow/compiler/xla/service/hlo_casting_utils_test.cc b/tensorflow/compiler/xla/service/hlo_casting_utils_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..a3364275409122254bf99b40a7d2fcbb2d7564cc
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_casting_utils_test.cc
@@ -0,0 +1,113 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/hlo_casting_utils.h"
+
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace xla {
+namespace {
+
+class DummyInstruction : public HloInstruction {
+ public:
+  DummyInstruction()
+      : HloInstruction(HloOpcode::kConstant, ShapeUtil::MakeShape(F32, {})) {}
+};
+
+class AnotherDummyInstruction : public HloInstruction {
+ public:
+  AnotherDummyInstruction()
+      : HloInstruction(HloOpcode::kParameter, ShapeUtil::MakeShape(F32, {})) {}
+};
+
+TEST(HloCastingUtilsTest, CastSucceeds) {
+  DummyInstruction instruction;
+  DummyInstruction* casted =
+      Cast<DummyInstruction>(static_cast<HloInstruction*>(&instruction));
+  ASSERT_EQ(casted, &instruction);
+}
+
+TEST(HloCastingUtilsTest, CastDiesForWrongType) {
+  AnotherDummyInstruction instruction;
+  ASSERT_DEATH(
+      Cast<DummyInstruction>(static_cast<HloInstruction*>(&instruction)), "");
+}
+
+TEST(HloCastingUtilsTest, CastDiesForNullptr) {
+  HloInstruction* null = nullptr;
+  ASSERT_DEATH(Cast<DummyInstruction>(null), "");
+}
+
+TEST(HloCastingUtilsTest, CastOrNullSucceeds) {
+  DummyInstruction instruction;
+  DummyInstruction* casted =
+      Cast<DummyInstruction>(static_cast<HloInstruction*>(&instruction));
+  ASSERT_EQ(casted, &instruction);
+}
+
+TEST(HloCastingUtilsTest, CastOrNullDiesForWrongType) {
+  AnotherDummyInstruction instruction;
+  ASSERT_DEATH(
+      Cast<DummyInstruction>(static_cast<HloInstruction*>(&instruction)), "");
+}
+
+TEST(HloCastingUtilsTest, CastOrNullReturnsNullptrForNullptr) {
+  HloInstruction* null = nullptr;
+  DummyInstruction* casted = CastOrNull<DummyInstruction>(null);
+  ASSERT_EQ(casted, nullptr);
+}
+
+TEST(HloCastingUtilsTest, DynCastSucceeds) {
+  DummyInstruction instruction;
+  DummyInstruction* casted =
+      DynCast<DummyInstruction>(static_cast<HloInstruction*>(&instruction));
+  ASSERT_EQ(casted, &instruction);
+}
+
+TEST(HloCastingUtilsTest, DynCastReturnsNullptrForWrongType) {
+  AnotherDummyInstruction instruction;
+  DummyInstruction* casted =
+      DynCast<DummyInstruction>(static_cast<HloInstruction*>(&instruction));
+  ASSERT_EQ(casted, nullptr);
+}
+
+TEST(HloCastingUtilsTest, DynCastDiesForNullptr) {
+  HloInstruction* null = nullptr;
+  ASSERT_DEATH(DynCast<DummyInstruction>(null), "");
+}
+
+TEST(HloCastingUtilsTest, DynCastOrNullSucceeds) {
+  DummyInstruction instruction;
+  DummyInstruction* casted = DynCastOrNull<DummyInstruction>(
+      static_cast<HloInstruction*>(&instruction));
+  ASSERT_EQ(casted, &instruction);
+}
+
+TEST(HloCastingUtilsTest, DynCastOrNullReturnsNullptrForWrongType) {
+  AnotherDummyInstruction instruction;
+  DummyInstruction* casted = DynCastOrNull<DummyInstruction>(
+      static_cast<HloInstruction*>(&instruction));
+  ASSERT_EQ(casted, nullptr);
+}
+
+TEST(HloCastingUtilsTest, DynCastOrNullReturnsNullptrForNullptr) {
+  HloInstruction* null = nullptr;
+  DummyInstruction* casted = DynCastOrNull<DummyInstruction>(null);
+  ASSERT_EQ(casted, nullptr);
+}
+
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_clone_context.h b/tensorflow/compiler/xla/service/hlo_clone_context.h
new file mode 100644
index 0000000000000000000000000000000000000000..658643b427a9625fac1166151a89cbd669f817d5
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_clone_context.h
@@ -0,0 +1,97 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_CLONE_CONTEXT_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_CLONE_CONTEXT_H_
+
+#include <string>
+
+#include "tensorflow/compiler/xla/map_util.h"
+#include "tensorflow/core/lib/gtl/flatmap.h"
+
+namespace xla {
+
+class HloInstruction;
+class HloComputation;
+class HloModule;
+
+// Data structure used to track the cloning of HloInstruction and HloComputation
+// objects.
+class HloCloneContext {
+ public:
+  // Creates a new HloCloneContext object to clone HloInstruction and
+  // HloComputation objects to be added to the module specified as argument.
+  // The suffix string will be appended to computation names.
+  explicit HloCloneContext(HloModule* module, const string& suffix = "")
+      : module_(module), suffix_(suffix) {}
+
+  HloModule* module() const { return module_; }
+
+  const string& suffix() const { return suffix_; }
+
+  void MapInstruction(const HloInstruction* old_instruction,
+                      HloInstruction* new_instruction) {
+    instructions_[old_instruction] = new_instruction;
+  }
+
+  void MapComputation(const HloComputation* old_computation,
+                      HloComputation* new_computation) {
+    computations_[old_computation] = new_computation;
+  }
+
+  // Finds the new instruction mapped to its old copy, or return nullptr in case
+  // it is not found.
+  HloInstruction* FindInstruction(const HloInstruction* old_instruction) const {
+    return FindOrDefault(instructions_, old_instruction, nullptr);
+  }
+
+  // Finds the new computation mapped to its old copy, or return nullptr in case
+  // it is not found.
+  HloComputation* FindComputation(const HloComputation* old_computation) const {
+    return FindOrDefault(computations_, old_computation, nullptr);
+  }
+
+  // Retrieves the new instruction mapped to its old copy, or fail if not found.
+  HloInstruction* GetInstruction(const HloInstruction* old_instruction) const {
+    return FindOrDie(instructions_, old_instruction);
+  }
+
+  // Retrieves the new computation mapped to its old copy, or fail if not found.
+  HloComputation* GetComputation(const HloComputation* old_computation) const {
+    return FindOrDie(computations_, old_computation);
+  }
+
+  const tensorflow::gtl::FlatMap<const HloInstruction*, HloInstruction*>&
+  cloned_instructions() const {
+    return instructions_;
+  }
+
+  const tensorflow::gtl::FlatMap<const HloComputation*, HloComputation*>&
+  cloned_computations() const {
+    return computations_;
+  }
+
+ private:
+  HloModule* module_;
+  string suffix_;
+  tensorflow::gtl::FlatMap<const HloInstruction*, HloInstruction*>
+      instructions_;
+  tensorflow::gtl::FlatMap<const HloComputation*, HloComputation*>
+      computations_;
+};
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_CLONE_CONTEXT_H_
diff --git a/tensorflow/compiler/xla/service/hlo_computation.cc b/tensorflow/compiler/xla/service/hlo_computation.cc
index 63c3dc4a5932f754a9ccdd70d03c999fe528a448..b158f44923982642615543cebbd54f00596e2641 100644
--- a/tensorflow/compiler/xla/service/hlo_computation.cc
+++ b/tensorflow/compiler/xla/service/hlo_computation.cc
@@ -64,7 +64,7 @@ HloComputation::HloComputation(
     const string& name, int parameter_count,
     std::vector<std::unique_ptr<HloInstruction>>* instructions,
     HloInstruction* root_instruction, HloInstruction* fusion_instruction)
-    : name_(name),
+    : name_(NameUniquer::GetSanitizedName(name)),
       unique_id_(-1),
       root_instruction_(root_instruction),
       fusion_instruction_(fusion_instruction) {
@@ -315,6 +315,42 @@ void ComputeComputationPostOrder(
   }
 }
 
+std::list<HloInstruction*> ComputeInstructionPostOrder(
+    HloInstruction* root, tensorflow::gtl::FlatSet<HloInstruction*>* visited) {
+  std::list<HloInstruction*> post_order;
+  std::vector<std::pair<HloInstruction*, bool>> dfs_stack;
+  dfs_stack.emplace_back(root, false);
+  while (!dfs_stack.empty()) {
+    const auto current = dfs_stack.back();
+    if (current.second) {
+      dfs_stack.pop_back();
+      if (!visited->insert(current.first).second) {
+        continue;
+      }
+      post_order.push_back(current.first);
+    } else {
+      if (visited->count(current.first)) {
+        dfs_stack.pop_back();
+        continue;
+      }
+      dfs_stack.back().second = true;
+
+      // Add the operands to the stack in reverse order so the first operand is
+      // processed first. This will produce a more natural ordering and a nicer
+      // result for thigns like HLO stringification.
+      const auto& operands = current.first->operands();
+      for (int64 i = operands.size() - 1; i >= 0; --i) {
+        dfs_stack.emplace_back(operands[i], false);
+      }
+
+      for (HloInstruction* op : current.first->control_predecessors()) {
+        dfs_stack.emplace_back(op, false);
+      }
+    }
+  }
+  return post_order;
+}
+
 }  // namespace
 
 std::list<HloInstruction*> HloComputation::MakeInstructionPostOrder() const {
@@ -328,9 +364,9 @@ std::list<HloInstruction*> HloComputation::MakeInstructionPostOrder() const {
       // users).
       trace_instructions.push_back(instruction.get());
     } else if (instruction->users().empty()) {
-      post_order.splice(post_order.end(),
-                        InstructionPostOrderer::GetOrder(instruction.get(),
-                                                         &added_instructions));
+      post_order.splice(
+          post_order.end(),
+          ComputeInstructionPostOrder(instruction.get(), &added_instructions));
     }
   }
   post_order.splice(post_order.end(), trace_instructions);
@@ -752,22 +788,21 @@ Status HloComputation::Accept(
 }
 
 std::unique_ptr<HloComputation> HloComputation::Clone(
-    const string& suffix, HloModule* module,
-    HloInstruction::CloneMap* clone_map) {
+    const string& suffix, HloCloneContext* context) {
   return CloneWithReplacements(
       /*replacements=*/std::unordered_map<const HloInstruction*,
                                           std::unique_ptr<HloInstruction>>(),
-      module, clone_map, suffix);
+      context, suffix);
 }
 
 std::unique_ptr<HloComputation> HloComputation::CloneWithReplacements(
     std::unordered_map<const HloInstruction*, std::unique_ptr<HloInstruction>>
         replacements,
-    HloModule* module, HloInstruction::CloneMap* clone_map,
-    const string& suffix) {
-  HloInstruction::CloneMap local_clone_map;
-  if (clone_map == nullptr) {
-    clone_map = &local_clone_map;
+    HloCloneContext* context, const string& suffix) {
+  std::unique_ptr<HloCloneContext> context_ptr;
+  if (context == nullptr) {
+    context_ptr = MakeUnique<HloCloneContext>(parent(), suffix);
+    context = context_ptr.get();
   }
 
   // Look up instr in the replacements map, and return either the replacement,
@@ -792,18 +827,18 @@ std::unique_ptr<HloComputation> HloComputation::CloneWithReplacements(
   }
 
   std::vector<std::unique_ptr<HloInstruction>> instructions;
-  std::unique_ptr<HloInstruction> new_instr = nullptr;
+  std::unique_ptr<HloInstruction> new_instr;
   for (auto instr : postorder) {
     std::vector<HloInstruction*> new_operands;
     for (auto operand : instr->operands()) {
       auto replaced_operand = replace(operand);
       CHECK_NE(replaced_operand, nullptr)
-          << "Replacements map specifies to leave out " << operand->ToString()
-          << ", but it is used by " << instr->ToString() << ".";
-      new_operands.push_back(FindOrDie(*clone_map, replaced_operand));
+          << "replacements map tried to eliminate a used instruction "
+          << operand->ToString() << ", used by " << instr->ToString();
+      new_operands.push_back(context->GetInstruction(replaced_operand));
     }
-    new_instr = instr->CloneWithNewOperands(instr->shape(), new_operands,
-                                            module, clone_map);
+    new_instr =
+        instr->CloneWithNewOperands(instr->shape(), new_operands, context);
     instructions.push_back(std::move(new_instr));
   }
   Builder builder(name() + "." + suffix);
@@ -811,22 +846,23 @@ std::unique_ptr<HloComputation> HloComputation::CloneWithReplacements(
     builder.AddInstruction(std::move(instr));
   }
   auto result = builder.Build(
-      /*root_instruction=*/FindOrDie(*clone_map, replace(root_instruction())));
+      /*root_instruction=*/context->GetInstruction(
+          replace(root_instruction())));
 
   // Clone control dependencies.
   for (auto instr : postorder) {
-    HloInstruction* new_instr = FindOrDie(*clone_map, instr);
+    HloInstruction* new_instr = context->GetInstruction(instr);
     for (auto successor : instr->control_successors()) {
       auto replaced_successor = replace(successor);
-      CHECK_NE(replaced_successor, nullptr)
-          << "Replacements map specifies to leave out " << successor->ToString()
-          << ", but it is control-depended-on by " << instr->ToString() << ".";
-
-      TF_CHECK_OK(new_instr->AddControlDependencyTo(
-          FindOrDie(*clone_map, replaced_successor)));
+      // successor may not have been remapped, because it might have been
+      // removed by the replacements map.
+      if (replaced_successor != nullptr) {
+        TF_CHECK_OK(new_instr->AddControlDependencyTo(
+            context->GetInstruction(replaced_successor)));
+      }
     }
   }
-
+  context->MapComputation(this, result.get());
   // We cloned the elements of 'replacements', so they're all going to be
   // destroyed. HloInstructions need to be detached from their operands before
   // they're destroyed, otherwise they stick around in the operands' users lists
@@ -836,7 +872,6 @@ std::unique_ptr<HloComputation> HloComputation::CloneWithReplacements(
       new_instr->DetachFromOperands();
     }
   }
-
   return result;
 }
 
diff --git a/tensorflow/compiler/xla/service/hlo_computation.h b/tensorflow/compiler/xla/service/hlo_computation.h
index 8bc97df0365a32bdc89d4636ad4c7076ffb08296..0da4a305f3d5d694a1918fed294337100b0a27fd 100644
--- a/tensorflow/compiler/xla/service/hlo_computation.h
+++ b/tensorflow/compiler/xla/service/hlo_computation.h
@@ -29,6 +29,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
+#include "tensorflow/compiler/xla/service/hlo_clone_context.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_reachability.h"
 #include "tensorflow/compiler/xla/service/name_uniquer.h"
@@ -300,17 +301,11 @@ class HloComputation {
       const std::function<Status(const HloInstruction*)>& visitor_func) const;
 
   // Returns a deep copy of this computation including all instructions.
-  //
-  // If the module pointer is not nullptr, then the cloned computations will be
-  // added to this module in order to support deep cloning. Otherwise the module
-  // of the computation is used.
-  //
-  // If clone_map is not nullptr, then each original instruction that is cloned
-  // will be inserted and map to its clone. clone_map should not already contain
-  // any of the instructions to clone.
-  std::unique_ptr<HloComputation> Clone(
-      const string& suffix = "clone", HloModule* module = nullptr,
-      HloInstruction::CloneMap* clone_map = nullptr);
+  // If the clone context is specified, it will be populated with the cloned
+  // object mappings, and its module() will be used to add new computations
+  // into.
+  std::unique_ptr<HloComputation> Clone(const string& suffix = "clone",
+                                        HloCloneContext* context = nullptr);
 
   // Like Clone(), but if an instruction is present in replacement_map, we use
   // the map's value to replace that instruction in the cloned computation.
@@ -320,9 +315,7 @@ class HloComputation {
   std::unique_ptr<HloComputation> CloneWithReplacements(
       std::unordered_map<const HloInstruction*, std::unique_ptr<HloInstruction>>
           replacements,
-      HloModule* module = nullptr,
-      HloInstruction::CloneMap* clone_map = nullptr,
-      const string& suffix = "clone");
+      HloCloneContext* context = nullptr, const string& suffix = "clone");
 
   // Returns true if the given instruction can be removed from the computation.
   // Parameter instructions cannot be removed without violating invariants of
diff --git a/tensorflow/compiler/xla/service/hlo_cost_analysis.cc b/tensorflow/compiler/xla/service/hlo_cost_analysis.cc
index 94c9c7eabcc99d4cf61f535925c068a9b55ed136..762e1afc71b108b2e32b5a7f7f1bbeb783fc6fbd 100644
--- a/tensorflow/compiler/xla/service/hlo_cost_analysis.cc
+++ b/tensorflow/compiler/xla/service/hlo_cost_analysis.cc
@@ -172,15 +172,22 @@ Status HloCostAnalysis::HandleReverse(const HloInstruction*) {
   return Status::OK();
 }
 
-Status HloCostAnalysis::HandleSlice(const HloInstruction*) {
+Status HloCostAnalysis::HandleSlice(const HloInstruction* slice) {
+  current_properties_[kBytesAccessedKey] = shape_size_(slice->shape()) * 2;
   return Status::OK();
 }
 
-Status HloCostAnalysis::HandleDynamicSlice(const HloInstruction*) {
+Status HloCostAnalysis::HandleDynamicSlice(
+    const HloInstruction* dynamic_slice) {
+  current_properties_[kBytesAccessedKey] =
+      shape_size_(dynamic_slice->shape()) * 2;
   return Status::OK();
 }
 
-Status HloCostAnalysis::HandleDynamicUpdateSlice(const HloInstruction*) {
+Status HloCostAnalysis::HandleDynamicUpdateSlice(
+    const HloInstruction* dynamic_update_slice) {
+  current_properties_[kBytesAccessedKey] =
+      shape_size_(dynamic_update_slice->operand(1)->shape()) * 2;
   return Status::OK();
 }
 
@@ -386,6 +393,10 @@ Status HloCostAnalysis::HandleTranspose(const HloInstruction*) {
   return Status::OK();
 }
 
+Status HloCostAnalysis::HandleGenerateToken(const HloInstruction*) {
+  return Status::OK();
+}
+
 Status HloCostAnalysis::HandleConvolution(const HloInstruction* convolution) {
   auto lhs = convolution->operand(0);
   auto rhs = convolution->operand(1);
diff --git a/tensorflow/compiler/xla/service/hlo_cost_analysis.h b/tensorflow/compiler/xla/service/hlo_cost_analysis.h
index d17678d20f2a23fd98d18b77d5fb25853901a789..0d66736fe1d0677d13a63ede7a203d6ac20c76f5 100644
--- a/tensorflow/compiler/xla/service/hlo_cost_analysis.h
+++ b/tensorflow/compiler/xla/service/hlo_cost_analysis.h
@@ -97,6 +97,7 @@ class HloCostAnalysis : public ConstDfsHloVisitor {
   Status HandleBroadcast(const HloInstruction* broadcast) override;
   Status HandlePad(const HloInstruction* pad) override;
   Status HandleReshape(const HloInstruction* reshape) override;
+  Status HandleGenerateToken(const HloInstruction* token) override;
   Status HandleTranspose(const HloInstruction* transpose) override;
   Status HandleWhile(const HloInstruction* xla_while) override;
   Status HandleConditional(const HloInstruction* conditional) override;
diff --git a/tensorflow/compiler/xla/service/hlo_cost_analysis_test.cc b/tensorflow/compiler/xla/service/hlo_cost_analysis_test.cc
index 16fdda8a8b9ade09ea31cda1f4cf5e8ff2c0a081..d22bef56730da194816b4ee89dc3196439b350f9 100644
--- a/tensorflow/compiler/xla/service/hlo_cost_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_cost_analysis_test.cc
@@ -460,5 +460,51 @@ TEST_F(HloCostAnalysisTest, BaseDilatedConvolution) {
   EXPECT_EQ(analysis.flop_count(), 1472);
 }
 
+TEST_F(HloCostAnalysisTest, Slice) {
+  // Test the analysis on a slice.
+  XlaBuilder builder("slice");
+  auto x = builder.Parameter(0, ShapeUtil::MakeShape(F32, {2}), "x");
+  auto slice = builder.Slice(x, {0}, {1}, {1});
+  auto hlo_module = BuildHloGraph(&builder);
+
+  // Run HLO cost analysis.
+  HloCostAnalysis analysis(ShapeSize);
+  ASSERT_IS_OK(
+      hlo_module->entry_computation()->root_instruction()->Accept(&analysis));
+
+  EXPECT_EQ(analysis.bytes_accessed(), 8);
+}
+
+TEST_F(HloCostAnalysisTest, DynamicSlice) {
+  // Test the analysis on a slice.
+  XlaBuilder builder("dynamic-slice");
+  auto x = builder.Parameter(0, ShapeUtil::MakeShape(F32, {2}), "x");
+  auto slice = builder.DynamicSlice(x, builder.ConstantR1<int32>({1}), {1});
+  auto hlo_module = BuildHloGraph(&builder);
+
+  // Run HLO cost analysis.
+  HloCostAnalysis analysis(ShapeSize);
+  ASSERT_IS_OK(
+      hlo_module->entry_computation()->root_instruction()->Accept(&analysis));
+
+  EXPECT_EQ(analysis.bytes_accessed(), 8);
+}
+
+TEST_F(HloCostAnalysisTest, DynamicUpdateSlice) {
+  // Test the analysis on a slice.
+  XlaBuilder builder("dynamic-update-slice");
+  auto x = builder.Parameter(0, ShapeUtil::MakeShape(F32, {2}), "x");
+  auto slice = builder.DynamicUpdateSlice(x, builder.ConstantR1<float>({1.0}),
+                                          builder.ConstantR1<int32>({1}));
+  auto hlo_module = BuildHloGraph(&builder);
+
+  // Run HLO cost analysis.
+  HloCostAnalysis analysis(ShapeSize);
+  ASSERT_IS_OK(
+      hlo_module->entry_computation()->root_instruction()->Accept(&analysis));
+
+  EXPECT_EQ(analysis.bytes_accessed(), 8);
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_cse.cc b/tensorflow/compiler/xla/service/hlo_cse.cc
index c17c26c5a435fe34dd1024d596004cf6b5fdce8c..a0ee8896230d6dcacb5a8eb607fc00ae5226cfa5 100644
--- a/tensorflow/compiler/xla/service/hlo_cse.cc
+++ b/tensorflow/compiler/xla/service/hlo_cse.cc
@@ -26,6 +26,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/layout_util.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_domain_map.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -41,16 +42,16 @@ namespace {
 
 // Find and combine identical constants. Constants are identical if they have
 // the same type and value.
-bool CombineConstants(HloComputation* computation, bool is_layout_sensitive) {
-  bool changed = false;
-
+StatusOr<bool> CombineConstants(HloComputation* computation,
+                                bool is_layout_sensitive) {
+  TF_ASSIGN_OR_RETURN(auto domain_map, HloDomainMap::Create(computation, ""));
   // Map from ShortDebugString of the layoutless shape of the constant to the
   // set of constant instructions with that shape. Layoutless shape is used to
   // bin possible common constants together to reduce number of constant
   // comparisons. If we end up having too many constant comparisons, a more
   // precise binning might have to be used.
   std::multimap<string, HloInstruction*> constants;
-
+  int64 combined = 0;
   auto inst_it = computation->instructions().begin();
   while (inst_it != computation->instructions().end()) {
     HloInstruction* instruction = *inst_it;
@@ -70,7 +71,8 @@ bool CombineConstants(HloComputation* computation, bool is_layout_sensitive) {
       auto range = constants.equal_range(shape_string);
       HloInstruction* match = nullptr;
       for (auto it = range.first; it != range.second; ++it) {
-        if (instruction->literal() == it->second->literal()) {
+        if (instruction->literal() == it->second->literal() &&
+            domain_map->InSameDomain(it->second, instruction)) {
           match = it->second;
           break;
         }
@@ -81,12 +83,13 @@ bool CombineConstants(HloComputation* computation, bool is_layout_sensitive) {
         // Match found, replace this instruction with the one in the multimap.
         TF_CHECK_OK(instruction->ReplaceAllUsesWith(match));
         TF_CHECK_OK(computation->RemoveInstruction(instruction));
-        changed = true;
+        ++combined;
       }
     }
   }
-
-  return changed;
+  VLOG(4) << "Combined " << combined << " constants in " << computation->name()
+          << " computation";
+  return combined > 0;
 }
 
 // An instruction is considered to be equivalent to another only if they
@@ -123,24 +126,27 @@ StatusOr<bool> HloCSE::Run(HloModule* module) {
       continue;
     }
 
-    changed |= CombineConstants(computation, is_layout_sensitive_);
+    TF_ASSIGN_OR_RETURN(bool combined,
+                        CombineConstants(computation, is_layout_sensitive_));
+    changed |= combined;
 
     // HLO instructions are grouped into equivalency classes by using the
     // cse_equal predicate defined above. This set holds a representative
     // instruction for each class.
     tensorflow::gtl::FlatSet<HloInstruction*, decltype(&CseHash),
                              decltype(cse_equal)>
-        representatives(/*N=*/1024, &CseHash, cse_equal);
-
+        representatives(/*N=*/computation->instruction_count() + 1, &CseHash,
+                        cse_equal);
     for (auto instruction : computation->MakeInstructionPostOrder()) {
       // If the instruction has zero operands (constants, parameters, etc.) skip
       // over it.
       if (instruction->operand_count() == 0) {
         continue;
       }
-
-      // Skip instructions which have side effects.
-      if (instruction->HasSideEffect()) {
+      // Skip instructions which have side effects or are a domain (which must
+      // not be CSE-ed).
+      if (instruction->HasSideEffect() ||
+          instruction->opcode() == HloOpcode::kDomain) {
         continue;
       }
 
diff --git a/tensorflow/compiler/xla/service/hlo_cse_test.cc b/tensorflow/compiler/xla/service/hlo_cse_test.cc
index 9735764b692238d6a320bcff51e43b98dcadabda..16db374566c727f1f3efe2a6d419f1f3caf0aaf1 100644
--- a/tensorflow/compiler/xla/service/hlo_cse_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_cse_test.cc
@@ -32,10 +32,10 @@ limitations under the License.
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_utils.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/core/platform/types.h"
 
@@ -142,31 +142,46 @@ TEST_F(HloCseTest, ConstantsSameValueDifferentType) {
   // Test that constants with the same value but different type are *not*
   // commoned.
   auto builder = HloComputation::Builder(TestName());
-  builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<uint32>(42)));
-  builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(42)));
-  builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<uint64>(42.0)));
-  builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int64>(42.0)));
-  builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<double>(42.0)));
-  builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+  std::vector<HloInstruction*> constants;
+  constants.push_back(builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<uint32>(42))));
+  constants.push_back(builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<int32>(42))));
+  constants.push_back(builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<uint64>(42.0))));
+  constants.push_back(builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<int64>(42.0))));
+  constants.push_back(builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<double>(42.0))));
+  constants.push_back(builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f))));
   // Duplicate the float constant to verify something happens.
-  builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+  constants.push_back(builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f))));
+
+  const Shape shape_r0 = ShapeUtil::MakeShape(F32, {});
+  for (int64 i = 0; i < constants.size(); ++i) {
+    constants[i] = builder.AddInstruction(
+        HloInstruction::CreateConvert(shape_r0, constants[i]));
+  }
+  HloInstruction* root = builder.AddInstruction(HloInstruction::CreateBinary(
+      shape_r0, HloOpcode::kAdd, constants[0], constants[1]));
+  for (int64 i = 2; i < constants.size(); ++i) {
+    root = builder.AddInstruction(HloInstruction::CreateBinary(
+        shape_r0, HloOpcode::kAdd, root, constants[i]));
+  }
 
   auto module = CreateNewModule();
   auto computation = module->AddEntryComputation(builder.Build());
 
-  EXPECT_EQ(7, computation->instruction_count());
+  EXPECT_EQ(20, computation->instruction_count());
 
   HloCSE cse(/*is_layout_sensitive=*/false);
   EXPECT_TRUE(cse.Run(module.get()).ValueOrDie());
 
-  EXPECT_EQ(6, computation->instruction_count());
+  // CSE will remove both the second float(42.0f) and the corresponding
+  // convert/cast.
+  EXPECT_EQ(18, computation->instruction_count());
 }
 
 TEST_F(HloCseTest, NonscalarConstants) {
@@ -471,7 +486,7 @@ TEST_F(HloCseTest, DoNotCombineCallsToImpureFunctions) {
 }
 
 TEST_F(HloCseTest, CompareComputations) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
     HloModule m
 
     add_computation {
@@ -501,5 +516,25 @@ TEST_F(HloCseTest, CompareComputations) {
   EXPECT_EQ(root->operand(0), root->operand(1));
 }
 
+TEST_F(HloCseTest, ConstantsSameValueInDifferentDomains) {
+  // Test that constants with the same value but in different domains (disjoint
+  // in this case) are not collapsed.
+  auto builder = HloComputation::Builder(TestName());
+  builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<uint32>(42)));
+  builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<uint32>(42)));
+
+  auto module = CreateNewModule();
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_EQ(2, computation->instruction_count());
+
+  HloCSE cse(/*is_layout_sensitive=*/false);
+  EXPECT_FALSE(cse.Run(module.get()).ValueOrDie());
+
+  EXPECT_EQ(2, computation->instruction_count());
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc b/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc
index b06e6c9f3e62f375a9e48f8ef81efe7121bbef94..d0200058683b2db8f5f0469d6c643014881f741e 100644
--- a/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc
+++ b/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc
@@ -363,7 +363,7 @@ bool HloDataflowAnalysis::UpdateCallValueSet(HloInstruction* call) {
 bool HloDataflowAnalysis::UpdateConditionalValueSet(
     HloInstruction* conditional) {
   CHECK_EQ(conditional->opcode(), HloOpcode::kConditional);
-  std::vector<const InstructionValueSet*> inputs = {
+  const InstructionValueSet* const inputs[] = {
       &GetInstructionValueSet(
           conditional->true_computation()->root_instruction()),
       &GetInstructionValueSet(
@@ -538,7 +538,7 @@ bool HloDataflowAnalysis::UpdateTupleValueSet(HloInstruction* tuple) {
 
 bool HloDataflowAnalysis::UpdateWhileValueSet(HloInstruction* xla_while) {
   CHECK_EQ(xla_while->opcode(), HloOpcode::kWhile);
-  std::vector<const InstructionValueSet*> inputs = {
+  const InstructionValueSet* const inputs[] = {
       &GetInstructionValueSet(xla_while->while_body()->root_instruction()),
       &GetInstructionValueSet(xla_while->operand(0))};
   if (ssa_form_) {
@@ -931,16 +931,17 @@ bool HloDataflowAnalysis::CanShareOperandBufferWithUser(
     }
     const HloUse& use = value.uses()[0];
 
-    if (user->fusion_kind() == HloInstruction::FusionKind::kLoop &&
-        user->fused_expression_root()->opcode() ==
-            HloOpcode::kDynamicUpdateSlice) {
-      // Loop fusion with kDynamicUpdateSlice fused root.
-      //
-      // Returns true iff there is exactly one use of 'operand' at shape index
-      // 'operand_index', and this singleton use is the fused root at operand
-      // index 0.
-      return use.instruction == user->fused_expression_root() &&
-             use.operand_number == 0;
+    if (user->fusion_kind() == HloInstruction::FusionKind::kLoop) {
+      if (user->fused_expression_root()->opcode() ==
+          HloOpcode::kDynamicUpdateSlice) {
+        // Loop fusion with kDynamicUpdateSlice fused root.
+        //
+        // Returns true iff there is exactly one use of 'operand' at shape index
+        // 'operand_index', and this singleton use is the fused root at operand
+        // index 0.
+        return use.instruction == user->fused_expression_root() &&
+               use.operand_number == 0;
+      }
     } else if (user->fusion_kind() == HloInstruction::FusionKind::kOutput &&
                user->fused_expression_root()->opcode() == HloOpcode::kAdd) {
       // Output fusion with kAdd fused root.
@@ -967,6 +968,7 @@ bool HloDataflowAnalysis::CanShareOperandBufferWithUser(
              use.operand_number == other_add_operand_index;
     }
   }
+
   if (user->opcode() == HloOpcode::kDynamicUpdateSlice ||
       user->opcode() == HloOpcode::kWhile) {
     // We eliminated other users in BufferLiveness::live_range_strictly_before,
@@ -998,8 +1000,13 @@ bool HloDataflowAnalysis::CanShareOperandBufferWithUser(
             }) != uses.end();
     return uses.size() == 2 && found_caller_use && found_elementwise_callee_use;
   }
-  // Check if 'user' is element-wise.
-  return user->IsElementwise();
+
+  // Loop fusions that contain transposing copies won't reach here as they have
+  // different layouts, which fails the check in the beginning of this function.
+  //
+  // Multi-output fusion will fail the check here as tuples are not considered
+  // an elementwise operation.
+  return user->IsElementwiseOnOperand(user->operand_index(operand));
 }
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc b/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc
index 5798326dcbf65c3c34748afb02afab1dc7af9147..db1822ec47a7f52e2c3ef8dcbf433cd787ef75ab 100644
--- a/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc
@@ -1974,6 +1974,89 @@ TEST_F(CanShareOperandBufferWithUserTest, ElementWiseSameShape) {
       dataflow_analysis_->CanShareOperandBufferWithUser(exp, {}, log, {}));
 }
 
+TEST_F(CanShareOperandBufferWithUserTest,
+       NonElementwiseLoopFusionCantAliasOperandBuffer) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape data_shape = ShapeUtil::MakeShape(F32, {2, 2});
+
+  auto param0 = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, data_shape, "param0"));
+
+  auto neg = builder.AddInstruction(
+      HloInstruction::CreateUnary(data_shape, HloOpcode::kNegate, param0));
+
+  auto reverse = builder.AddInstruction(
+      HloInstruction::CreateReverse(data_shape, neg, {0, 1}));
+
+  BuildModule(builder.Build());
+  auto fusion = computation_->CreateFusionInstruction(
+      {reverse, neg}, HloInstruction::FusionKind::kLoop);
+  RunAnalysis();
+
+  EXPECT_FALSE(dataflow_analysis_->CanShareOperandBufferWithUser(param0, {},
+                                                                 fusion, {}));
+}
+
+TEST_F(CanShareOperandBufferWithUserTest,
+       MultiOutputFusionCantAliasOperandBuffer) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape data_shape = ShapeUtil::MakeShape(F32, {2, 2});
+
+  Shape in_shape = ShapeUtil::MakeShape(F32, {8});
+  Shape out_shape = ShapeUtil::MakeShape(PRED, {8});
+  auto param0 = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, in_shape, "param0"));
+  auto param1 = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, in_shape, "param1"));
+
+  auto copy0 = builder.AddInstruction(
+      HloInstruction::CreateUnary(in_shape, HloOpcode::kCopy, param0));
+  auto copy1 = builder.AddInstruction(
+      HloInstruction::CreateUnary(in_shape, HloOpcode::kCopy, param1));
+
+  auto tuple =
+      builder.AddInstruction(HloInstruction::CreateTuple({copy1, copy0}));
+
+  BuildModule(builder.Build());
+  auto fusion = computation_->CreateFusionInstruction(
+      {tuple, copy1, copy0}, HloInstruction::FusionKind::kLoop);
+  RunAnalysis();
+
+  EXPECT_FALSE(dataflow_analysis_->CanShareOperandBufferWithUser(param0, {},
+                                                                 fusion, {0}));
+  EXPECT_FALSE(dataflow_analysis_->CanShareOperandBufferWithUser(param0, {},
+                                                                 fusion, {1}));
+  EXPECT_FALSE(dataflow_analysis_->CanShareOperandBufferWithUser(param1, {},
+                                                                 fusion, {0}));
+  EXPECT_FALSE(dataflow_analysis_->CanShareOperandBufferWithUser(param1, {},
+                                                                 fusion, {1}));
+}
+
+TEST_F(CanShareOperandBufferWithUserTest,
+       ElementwiseLoopFusionCantAliasOperandBuffer) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape data_shape = ShapeUtil::MakeShape(F32, {2, 2});
+
+  auto one = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+  auto operand = builder.AddInstruction(
+      HloInstruction::CreateBroadcast(data_shape, one, {1}));
+
+  auto neg = builder.AddInstruction(
+      HloInstruction::CreateUnary(data_shape, HloOpcode::kNegate, operand));
+
+  auto exp = builder.AddInstruction(
+      HloInstruction::CreateUnary(data_shape, HloOpcode::kExp, neg));
+
+  BuildModule(builder.Build());
+  auto fusion = computation_->CreateFusionInstruction(
+      {exp, neg}, HloInstruction::FusionKind::kLoop);
+  RunAnalysis();
+
+  EXPECT_TRUE(dataflow_analysis_->CanShareOperandBufferWithUser(operand, {},
+                                                                fusion, {}));
+}
+
 TEST_F(CanShareOperandBufferWithUserTest, ElementWiseDifferentShape) {
   auto builder = HloComputation::Builder(TestName());
 
@@ -2048,6 +2131,46 @@ TEST_F(CanShareOperandBufferWithUserTest, FusedDynamicUpdateSlice) {
                                                                 fusion, {}));
 }
 
+TEST_F(CanShareOperandBufferWithUserTest,
+       FusedDynamicUpdateSliceWithConvertCantShare) {
+  auto builder = HloComputation::Builder(TestName());
+
+  Shape data_shape = ShapeUtil::MakeShape(F32, {8});
+  Shape data_shape_bf16 = ShapeUtil::MakeShape(BF16, {8});
+  auto tuple = builder.AddInstruction(HloInstruction::CreateParameter(
+      0, ShapeUtil::MakeTupleShape({data_shape, data_shape}), "tuple"));
+  auto gte0 = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(data_shape, tuple, 0));
+  auto gte1 = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(data_shape, tuple, 1));
+
+  auto convert1 = builder.AddInstruction(
+      HloInstruction::CreateConvert(data_shape_bf16, gte1));
+
+  // Create a DynamicUpdateSlice instruction of tuple element 1.
+  auto starts = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR1<int32>({2})));
+  auto update = builder.AddInstruction(HloInstruction::CreateConstant(
+      Literal::CreateR1<float>({2.f, 2.f, 2.f})));
+  auto dynamic_update_slice =
+      builder.AddInstruction(HloInstruction::CreateDynamicUpdateSlice(
+          data_shape_bf16, convert1, update, starts));
+
+  auto convert2 = builder.AddInstruction(
+      HloInstruction::CreateConvert(data_shape, dynamic_update_slice));
+  builder.AddInstruction(HloInstruction::CreateTuple({gte0, convert2}));
+
+  BuildModule(builder.Build());
+  auto fusion = computation_->CreateFusionInstruction(
+      {convert2, dynamic_update_slice, starts, update, convert1},
+      HloInstruction::FusionKind::kLoop);
+  RunAnalysis();
+
+  // The fusion instruction can't share with tuple element 1.
+  EXPECT_FALSE(
+      dataflow_analysis_->CanShareOperandBufferWithUser(gte1, {}, fusion, {}));
+}
+
 TEST_F(CanShareOperandBufferWithUserTest, DynamicUpdateSliceCanShare) {
   auto builder = HloComputation::Builder(TestName());
 
diff --git a/tensorflow/compiler/xla/service/hlo_domain_isolator.cc b/tensorflow/compiler/xla/service/hlo_domain_isolator.cc
new file mode 100644
index 0000000000000000000000000000000000000000..78955db0da02f16eb93689db947dc1190ab7049a
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_domain_isolator.cc
@@ -0,0 +1,104 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/hlo_domain_isolator.h"
+
+#include "tensorflow/compiler/xla/map_util.h"
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_graph_dumper.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/types.h"
+
+namespace xla {
+
+class HloDomainIsolator::RunContext {
+ public:
+  RunContext(HloModule* module, HloDomainIsolator* isolator)
+      : module_(module), isolator_(isolator) {}
+
+  StatusOr<bool> Run();
+
+ private:
+  // Inserts a kDomain instruction between parent and operand, in case
+  // the attribute (ie, sharding) values change between instruction and operand.
+  // Returns the newly inserted kDomain instruction, or nullptr if no kDomain
+  // instruction was necessary.
+  StatusOr<HloInstruction*> CreateDomain(HloInstruction* instruction,
+                                         HloInstruction* parent,
+                                         HloInstruction* operand);
+
+  HloModule* module_;
+  HloDomainIsolator* isolator_;
+};
+
+StatusOr<HloInstruction*> HloDomainIsolator::RunContext::CreateDomain(
+    HloInstruction* instruction, HloInstruction* parent,
+    HloInstruction* operand) {
+  HloInstruction* domain = nullptr;
+  std::unique_ptr<HloInstruction> domain_instruction =
+      isolator_->creator_(instruction, operand);
+  if (domain_instruction != nullptr) {
+    domain = operand->parent()->AddInstruction(std::move(domain_instruction));
+    TF_RETURN_IF_ERROR(operand->ReplaceUseWith(parent, domain));
+  }
+  return domain;
+}
+
+StatusOr<bool> HloDomainIsolator::RunContext::Run() {
+  hlo_graph_dumper::MaybeDumpHloModule(*module_, "Before Domain Isolator");
+
+  int64 added_domains = 0;
+  for (HloComputation* computation : module_->computations()) {
+    // Walk in post order and place all the required kDomain instructions.
+    for (HloInstruction* instruction :
+         computation->MakeInstructionPostOrder()) {
+      if (instruction->opcode() == HloOpcode::kDomain) {
+        continue;
+      }
+      for (HloInstruction* operand : instruction->unique_operands()) {
+        // When applying multiple domains, we could end up stacking more than
+        // one in one edge, so here we want to build the effective
+        // (kDomain-less) instruction->operand edge.
+        HloInstruction* parent = instruction;
+        while (operand->opcode() == HloOpcode::kDomain) {
+          parent = operand;
+          operand = operand->mutable_operand(0);
+        }
+        // Check whether a kDomain is necessary between instruction and operand.
+        TF_ASSIGN_OR_RETURN(HloInstruction * domain,
+                            CreateDomain(instruction, parent, operand));
+        if (domain != nullptr) {
+          VLOG(4) << "New domain: " << domain->ToString();
+          ++added_domains;
+        }
+      }
+    }
+  }
+  VLOG(3) << "Added " << added_domains << " kDomain instructions";
+  if (added_domains > 0) {
+    hlo_graph_dumper::MaybeDumpHloModule(*module_, "After Domain Isolator");
+  }
+  return added_domains > 0;
+}
+
+HloDomainIsolator::HloDomainIsolator(DomainCreator creator)
+    : creator_(std::move(creator)) {}
+
+StatusOr<bool> HloDomainIsolator::Run(HloModule* module) {
+  RunContext run_context(module, this);
+  return run_context.Run();
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_domain_isolator.h b/tensorflow/compiler/xla/service/hlo_domain_isolator.h
new file mode 100644
index 0000000000000000000000000000000000000000..e0c5718509dabebb7b9307bf764b0ea1ce7369a0
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_domain_isolator.h
@@ -0,0 +1,56 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_ISOLATOR_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_ISOLATOR_H_
+
+#include <memory>
+#include <vector>
+
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/service/hlo_pass_interface.h"
+
+namespace xla {
+
+// Domain isolation is the task of placing kDomain instructions between HLO
+// instructions having different shrading. A kDomain instruction is essentially
+// used to break an HLO graph edge connecting two instructions with different
+// sharding. If a set of connected instructions have all the same sharding, no
+// kDomain instruciton will be placed.
+class HloDomainIsolator : public HloPassInterface {
+ public:
+  // Creates a new kDomain instruction for the edge between the use instruction
+  // (the first HloInstruction argument), and the operand instruction (the
+  // second HloInstruction argument).
+  // Returns nullptr in case no domain separation is necessary.
+  using DomainCreator = std::function<std::unique_ptr<HloInstruction>(
+      HloInstruction*, HloInstruction*)>;
+
+  explicit HloDomainIsolator(DomainCreator creator);
+
+  tensorflow::StringPiece name() const override { return "domain_isolator"; }
+
+  StatusOr<bool> Run(HloModule* module) override;
+
+ private:
+  class RunContext;
+
+  DomainCreator creator_;
+};
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_ISOLATOR_H_
diff --git a/tensorflow/compiler/xla/service/hlo_domain_map.cc b/tensorflow/compiler/xla/service/hlo_domain_map.cc
new file mode 100644
index 0000000000000000000000000000000000000000..ebd5adb5d573ce4b556046f85eb26a6ad59efcb9
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_domain_map.cc
@@ -0,0 +1,176 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/hlo_domain_map.h"
+
+#include <algorithm>
+
+#include "tensorflow/compiler/xla/map_util.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/types.h"
+
+namespace xla {
+
+/* static */ StatusOr<std::unique_ptr<HloDomainMap>> HloDomainMap::Create(
+    HloComputation* computation, string domain_kind) {
+  auto domain_map = WrapUnique(new HloDomainMap(std::move(domain_kind)));
+  TF_RETURN_IF_ERROR(domain_map->Populate(computation));
+  return std::move(domain_map);
+}
+
+/* static */ StatusOr<std::unique_ptr<HloDomainMap>> HloDomainMap::Create(
+    HloModule* module, string domain_kind) {
+  auto domain_map = WrapUnique(new HloDomainMap(std::move(domain_kind)));
+  for (HloComputation* computation : module->computations()) {
+    TF_RETURN_IF_ERROR(domain_map->Populate(computation));
+  }
+  return std::move(domain_map);
+}
+
+bool HloDomainMap::InSameDomain(HloInstruction* instruction1,
+                                HloInstruction* instruction2) const {
+  int64 domain_id1 = FindOrDefault(instruction_to_domain_, instruction1, -1);
+  int64 domain_id2 = FindOrDefault(instruction_to_domain_, instruction2, -1);
+  return domain_id1 >= 0 && domain_id1 == domain_id2;
+}
+
+Status HloDomainMap::TryProcessEmptyDomain(HloInstruction* instruction) {
+  TF_RET_CHECK(instruction->opcode() == HloOpcode::kDomain);
+  // We only check operands, so we are sure to not process the empty domain from
+  // both sides.
+  for (HloInstruction* operand : instruction->unique_operands()) {
+    if (IsDomainInstruction(operand)) {
+      auto domain = MakeUnique<DomainMetadata::Domain>();
+      domain->enter_domains.insert(operand);
+      domain->exit_domains.insert(instruction);
+      TF_RETURN_IF_ERROR(InsertDomain(std::move(domain)));
+    }
+  }
+  return Status::OK();
+}
+
+Status HloDomainMap::Populate(HloComputation* computation) {
+  for (HloInstruction* instruction : computation->instructions()) {
+    if (IsDomainInstruction(instruction)) {
+      // If this is a kDomain of the kind we are currently processing, check
+      // whether this is an "empty domain".
+      TF_RETURN_IF_ERROR(TryProcessEmptyDomain(instruction));
+      continue;
+    }
+    int64 domain_id = FindOrDefault(instruction_to_domain_, instruction, -1);
+    if (domain_id >= 0) {
+      // We have already processed this instruction.
+      continue;
+    }
+    TF_ASSIGN_OR_RETURN(std::unique_ptr<DomainMetadata::Domain> domain,
+                        CreateDomain(instruction));
+    TF_RETURN_IF_ERROR(InsertDomain(std::move(domain)));
+  }
+  return Status::OK();
+}
+
+Status HloDomainMap::InsertDomain(
+    std::unique_ptr<DomainMetadata::Domain> domain) {
+  int64 domain_id = instruction_domains_.size();
+  instruction_domains_.push_back(std::move(domain));
+  for (HloInstruction* instruction : instruction_domains_.back()->reach_set) {
+    instruction_to_domain_[instruction] = domain_id;
+  }
+  return Status::OK();
+}
+
+Status HloDomainMap::ExpandDomain(HloInstruction* instruction,
+                                  DomainMetadata::Domain* domain) const {
+  std::vector<HloInstruction*> in_queue;
+  in_queue.push_back(instruction);
+  while (!in_queue.empty()) {
+    HloInstruction* current_instruction = in_queue.back();
+    in_queue.pop_back();
+    if (domain->reach_set.insert(current_instruction).second) {
+      // We should not be finding instructions with assigned domain here.
+      // If we assigned a domain to the instruction, it means that all the
+      // instructions reached by it, should have a domain as well.
+      int64 domain_id =
+          FindOrDefault(instruction_to_domain_, current_instruction, -1);
+      TF_RET_CHECK(domain_id < 0)
+          << "Instruction " << current_instruction->ToString()
+          << " already has domain " << domain_id;
+      for (HloInstruction* operand : current_instruction->operands()) {
+        if (IsDomainInstruction(operand)) {
+          // The reach set instruction is a user of the domain instruction
+          // (the instruction sees the kDomain as operand).
+          // IOW the dataflow enters the domain through the kDomain instruction.
+          domain->enter_domains.insert(operand);
+        } else {
+          in_queue.push_back(operand);
+        }
+      }
+      for (HloInstruction* user : current_instruction->users()) {
+        if (IsDomainInstruction(user)) {
+          // The reach set instruction is an operand of the domain instruction
+          // (the instruction sees the kDomain as user).
+          // IOW the dataflow exits the domain through the kDomain instruction.
+          domain->exit_domains.insert(user);
+        } else {
+          in_queue.push_back(user);
+        }
+      }
+    }
+  }
+  return Status::OK();
+}
+
+StatusOr<std::unique_ptr<DomainMetadata::Domain>> HloDomainMap::CreateDomain(
+    HloInstruction* instruction) const {
+  auto domain = MakeUnique<DomainMetadata::Domain>();
+  TF_RETURN_IF_ERROR(ExpandDomain(instruction, domain.get()));
+  domain->instructions = MakeNonDomainInstructions(domain->reach_set);
+  return std::move(domain);
+}
+
+bool HloDomainMap::IsDomainInstruction(HloInstruction* instruction) const {
+  if (instruction->opcode() != HloOpcode::kDomain) {
+    return false;
+  }
+  if (!domain_kind_.empty()) {
+    if (instruction->user_side_metadata().Kind() != domain_kind_) {
+      return false;
+    }
+    // Both user and operand side of the metadata must be of the same kind.
+    CHECK(instruction->operand_side_metadata().Kind() == domain_kind_)
+        << "Instruction " << instruction->ToString()
+        << " has mismatching metadata kinds";
+  }
+  return true;
+}
+
+/* static */ std::vector<HloInstruction*>
+HloDomainMap::MakeNonDomainInstructions(
+    const tensorflow::gtl::FlatSet<HloInstruction*>& instruction_set) {
+  std::vector<HloInstruction*> instructions;
+  instructions.reserve(instruction_set.size());
+  for (HloInstruction* instruction : instruction_set) {
+    if (instruction->opcode() != HloOpcode::kDomain) {
+      instructions.push_back(instruction);
+    }
+  }
+  std::sort(instructions.begin(), instructions.end(),
+            [](HloInstruction* a, HloInstruction* b) {
+              return a->unique_id() < b->unique_id();
+            });
+  return instructions;
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_domain_map.h b/tensorflow/compiler/xla/service/hlo_domain_map.h
new file mode 100644
index 0000000000000000000000000000000000000000..e62ef763fb3881ab6030b1f6a66266ac80a3d84d
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_domain_map.h
@@ -0,0 +1,108 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_MAP_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_MAP_H_
+
+#include <memory>
+#include <vector>
+
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_domain_metadata.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/statusor.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/gtl/flatmap.h"
+#include "tensorflow/core/lib/gtl/flatset.h"
+
+namespace xla {
+
+// The HloDomainMap splits a set of instructions within a module or computation,
+// into different domains, separated by kDomain instructions.
+// A domain is composed by a set of instructions which can reach each other via
+// operand/user edges, without crossing a kDomain insutrction of a given kind.
+// A domain never crosses computation boundaries.
+class HloDomainMap {
+ public:
+  // Creates a new HloDomainMap, creating all the domains within the input
+  // computation, of the given kind. If domain_kind is not empty, only the
+  // kDomain instructions of domain_kind will be considered as separators.
+  // Otherwise every kDomain instruction will be splitting domains.
+  static StatusOr<std::unique_ptr<HloDomainMap>> Create(
+      HloComputation* computation, string domain_kind);
+
+  // Creates a new HloDomainMap, creating all the domains within the input
+  // module, of the given kind. If domain_kind is not empty, only the
+  // kDomain instructions of domain_kind will be considered as separators.
+  // Otherwise every kDomain instruction will be splitting domains.
+  static StatusOr<std::unique_ptr<HloDomainMap>> Create(HloModule* module,
+                                                        string domain_kind);
+
+  // Retrieves all the domains the input module or computation are composed by.
+  const std::vector<std::unique_ptr<DomainMetadata::Domain>>& GetDomains()
+      const {
+    return instruction_domains_;
+  }
+
+  // Checks whether two instructions are within the same domain.
+  bool InSameDomain(HloInstruction* instruction1,
+                    HloInstruction* instruction2) const;
+
+  // Checks whether instruction is a kDomain instruction of the kind we are
+  // currently processing.
+  bool IsDomainInstruction(HloInstruction* instruction) const;
+
+ private:
+  HloDomainMap(string domain_kind) : domain_kind_(std::move(domain_kind)) {}
+
+  // Check if the kDomain instruction is facing (via its operand link) another
+  // kDomain instruction of the same kind, hence defining an empty domain.
+  // If that is the case, create the empty domain and call the proper
+  // normalizer.
+  Status TryProcessEmptyDomain(HloInstruction* instruction);
+
+  Status Populate(HloComputation* computation);
+
+  // Inserts the provided domain into the ones tracked by this object,
+  // creating a new domain ID.
+  Status InsertDomain(std::unique_ptr<DomainMetadata::Domain> domain);
+
+  // From the given instruction, epxands operand and user wise, the set of
+  // instructions which can be reached without crossing a kDomain instruction
+  // of the kind specified by domain_kind_.
+  // The domain data structure will be populated with all the reached
+  // instructions, and the boundaries of the domain, with the kDomain
+  // instructions encountered while expanding the reach.
+  Status ExpandDomain(HloInstruction* instruction,
+                      DomainMetadata::Domain* domain) const;
+
+  // Creates a domain data structure using the ExpandDomain() API.
+  StatusOr<std::unique_ptr<DomainMetadata::Domain>> CreateDomain(
+      HloInstruction* instruction) const;
+
+  // Out of an instruction set, returns a vector of all the ones which are not
+  // a kDomain kind.
+  static std::vector<HloInstruction*> MakeNonDomainInstructions(
+      const tensorflow::gtl::FlatSet<HloInstruction*>& instruction_set);
+
+  string domain_kind_;
+  std::vector<std::unique_ptr<DomainMetadata::Domain>> instruction_domains_;
+  tensorflow::gtl::FlatMap<HloInstruction*, int64> instruction_to_domain_;
+};
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_MAP_H_
diff --git a/tensorflow/compiler/xla/service/hlo_domain_metadata.h b/tensorflow/compiler/xla/service/hlo_domain_metadata.h
new file mode 100644
index 0000000000000000000000000000000000000000..aa0308100a21f109579de75788fce7d242d6a6b0
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_domain_metadata.h
@@ -0,0 +1,84 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_METADATA_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_METADATA_H_
+
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "tensorflow/compiler/xla/statusor.h"
+#include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/lib/gtl/flatset.h"
+
+namespace xla {
+
+// Cannot include hlo_instruction.h as this file is included from there.
+class HloInstruction;
+
+// The DomainMetadata represents the base class for metadata which can be
+// attached to kDomain HLO instructions.
+class DomainMetadata {
+ public:
+  // A Domain data structure captures all the information about a kDomain
+  // bounded instruction set.
+  struct Domain {
+    // The set of instructions which are reachable from each other via
+    // operand/user pathways, without crossing a kDomain instruction of a given
+    // kind. The reach_set can contain kDomain instructions of other kinds, if
+    // two domains of different kind intersect each other.
+    tensorflow::gtl::FlatSet<HloInstruction*> reach_set;
+
+    // The same instructions in reach_set, but purged from kDomain instructions.
+    std::vector<HloInstruction*> instructions;
+
+    // If we consider a graph edge as an arrow oriented from the operand to the
+    // user, the enter_domains will contain the set of kDomain instructions
+    // whose dataflow enters the reach set (domain), while the exit_domains
+    // contains the set of kDomain instructions whose dataflow exit the reach
+    // set.
+    tensorflow::gtl::FlatSet<HloInstruction*> enter_domains;
+    tensorflow::gtl::FlatSet<HloInstruction*> exit_domains;
+  };
+
+  virtual ~DomainMetadata() = default;
+
+  // Clones the metadata object.
+  virtual std::unique_ptr<DomainMetadata> Clone() const = 0;
+
+  // Returns the metadata type. A unique identifier which describes the real
+  // metadata type.
+  virtual tensorflow::StringPiece Kind() const = 0;
+
+  // Compares the metadata object with another one and returns true if the
+  // two matches.
+  virtual bool Matches(const DomainMetadata& other) const = 0;
+
+  // Returns a string representation of the metadata.
+  virtual string ToString() const = 0;
+
+  // Given a reachable set (the set of instructions which are reachable from
+  // each other via user/operand pathways, without crossing a kDomain
+  // instruciton), makes sure that all of them have metadata attributes which
+  // are coherent with this metadata object.
+  virtual Status NormalizeInstructions(const Domain& domain) const = 0;
+};
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_METADATA_H_
diff --git a/tensorflow/compiler/xla/service/hlo_domain_remover.cc b/tensorflow/compiler/xla/service/hlo_domain_remover.cc
new file mode 100644
index 0000000000000000000000000000000000000000..1d06040b0e7c92b03f4cb5481bdee73a0f74f939
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_domain_remover.cc
@@ -0,0 +1,149 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/hlo_domain_remover.h"
+
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_domain_isolator.h"
+#include "tensorflow/compiler/xla/service/hlo_domain_map.h"
+#include "tensorflow/compiler/xla/service/hlo_graph_dumper.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/types.h"
+
+namespace xla {
+
+class HloDomainRemover::RunContext {
+ public:
+  RunContext(HloModule* module, HloDomainRemover* remover)
+      : module_(module), remover_(remover) {}
+
+  StatusOr<bool> Run();
+
+ private:
+  // Verifies the consistency of the domain, and normalizes the instructions
+  // within it.
+  Status VerifyAndNormalizeDomain(const DomainMetadata::Domain& domain);
+
+  HloModule* module_;
+  HloDomainRemover* remover_;
+};
+
+Status HloDomainRemover::RunContext::VerifyAndNormalizeDomain(
+    const DomainMetadata::Domain& domain) {
+  // Verify that the whole kDomain frontier bounding the instruction reach set,
+  // has matching metadata.
+  // A kDomain instruction has two sides of metadata, a user facing and an
+  // operand facing.
+  // A reachable instruction set can make contact with a kDomain instruction on
+  // a user facing side (the kDomain is operand of the instruction), or on a
+  // operand facing side (the kDomain is user of the instruction).
+  // And depending on the contact side, the proper metadata object
+  // (user_side_metadata() vs. operand_side_metadata()) needs to be used for
+  // consistency checks.
+  const DomainMetadata* ref_metadata = nullptr;
+  VLOG(4) << "Reach set:";
+  for (HloInstruction* instruction : domain.instructions) {
+    VLOG(4) << "  " << instruction->name();
+  }
+  VLOG(4) << "  Domains:";
+  for (HloInstruction* instruction : domain.enter_domains) {
+    const DomainMetadata& meta = instruction->user_side_metadata();
+    VLOG(4) << "    User side: " << instruction->name();
+    VLOG(4) << "      " << meta.ToString();
+    if (ref_metadata == nullptr) {
+      ref_metadata = &meta;
+    } else {
+      TF_RET_CHECK(meta.Matches(*ref_metadata))
+          << "Metadata mismatch at instruction " << instruction->name() << " : "
+          << meta.ToString() << " vs " << ref_metadata->ToString();
+    }
+  }
+  for (HloInstruction* instruction : domain.exit_domains) {
+    const DomainMetadata& meta = instruction->operand_side_metadata();
+    VLOG(4) << "    Operand side: " << instruction->name();
+    VLOG(4) << "      " << meta.ToString();
+    if (ref_metadata == nullptr) {
+      ref_metadata = &meta;
+    } else {
+      TF_RET_CHECK(meta.Matches(*ref_metadata))
+          << "Metadata mismatch at instruction " << instruction->name() << " : "
+          << meta.ToString() << " vs " << ref_metadata->ToString();
+    }
+  }
+  if (ref_metadata != nullptr) {
+    VLOG(4) << "Applying domain normalization: " << ref_metadata->ToString();
+    TF_RETURN_IF_ERROR(ref_metadata->NormalizeInstructions(domain));
+  } else {
+    // No kDomain instruction was present within this domain, so call the
+    // generic normalization functions and have them apply their heuristic.
+    VLOG(2) << "Applying domain-less normalization";
+    TF_RETURN_IF_ERROR(remover_->normalizer_(domain));
+  }
+  return Status::OK();
+}
+
+StatusOr<bool> HloDomainRemover::RunContext::Run() {
+  VLOG(4) << "Processing metadata domain: '" << remover_->kind_ << "'";
+  hlo_graph_dumper::MaybeDumpHloModule(*module_, "Before Domain Remover");
+
+  int64 removed_domains = 0;
+  for (HloComputation* computation : module_->computations()) {
+    // First create the domain instruciton sets. A domain instruction set is
+    // the set of instructions whose edges never cross a kDomain instruction.
+    TF_ASSIGN_OR_RETURN(std::unique_ptr<HloDomainMap> domain_map,
+                        HloDomainMap::Create(computation, remover_->kind_));
+    // Verify and normalize every domain populated within the map.
+    for (auto& domain : domain_map->GetDomains()) {
+      TF_RETURN_IF_ERROR(VerifyAndNormalizeDomain(*domain));
+    }
+
+    // Now remove all the kDomain instructions of the kind specified by the
+    // remover, that are within the currently processed computation from the
+    // graph.
+    for (HloInstruction* instruction :
+         computation->MakeInstructionPostOrder()) {
+      for (HloInstruction* operand : instruction->unique_operands()) {
+        if (domain_map->IsDomainInstruction(operand)) {
+          VLOG(5) << "Removing " << operand->name();
+          TF_RETURN_IF_ERROR(
+              operand->ReplaceAllUsesWith(operand->mutable_operand(0)));
+          TF_RETURN_IF_ERROR(computation->RemoveInstruction(operand));
+          ++removed_domains;
+        }
+      }
+    }
+    HloInstruction* root = computation->root_instruction();
+    if (root != nullptr && domain_map->IsDomainInstruction(root)) {
+      VLOG(5) << "Removing " << root->name();
+      computation->set_root_instruction(root->mutable_operand(0));
+      TF_RETURN_IF_ERROR(computation->RemoveInstruction(root));
+      ++removed_domains;
+    }
+  }
+  VLOG(3) << "Removed " << removed_domains << " kDomain instructions of '"
+          << remover_->kind_ << "' kind";
+  if (removed_domains > 0) {
+    hlo_graph_dumper::MaybeDumpHloModule(*module_, "After Domain Remover");
+  }
+  return removed_domains > 0;
+}
+
+StatusOr<bool> HloDomainRemover::Run(HloModule* module) {
+  RunContext run_context(module, this);
+  return run_context.Run();
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_domain_remover.h b/tensorflow/compiler/xla/service/hlo_domain_remover.h
new file mode 100644
index 0000000000000000000000000000000000000000..0c71dd34fd4d2944037dc965a2c9ad2c592d6e3e
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_domain_remover.h
@@ -0,0 +1,56 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_REMOVER_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_REMOVER_H_
+
+#include "tensorflow/compiler/xla/service/hlo_domain_metadata.h"
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/service/hlo_pass_interface.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace xla {
+
+// Removes all the kDomain instructions of a given kind from the input module,
+// and calls the normalizer to propagate the properties on the possibly new born
+// instructions.
+class HloDomainRemover : public HloPassInterface {
+ public:
+  // Creates a new HloDomainRemover object tasked at removing all the kDomain
+  // instructions of a given kind.
+  // In case a reachable set (the set of instructions within a computation,
+  // which are mutually reachable via operand/user pathways) has all the
+  // instructions in it with the same attributes (ie, sharding), a normalizer
+  // function is tasked at applying attribute normalization on the instructions
+  // within such domain.
+  HloDomainRemover(
+      tensorflow::StringPiece kind,
+      std::function<Status(const DomainMetadata::Domain&)> normalizer)
+      : kind_(kind.ToString()), normalizer_(std::move(normalizer)) {}
+
+  tensorflow::StringPiece name() const override { return "domain_remover"; }
+
+  StatusOr<bool> Run(HloModule* module) override;
+
+ private:
+  class RunContext;
+
+  string kind_;
+  std::function<Status(const DomainMetadata::Domain&)> normalizer_;
+};
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_REMOVER_H_
diff --git a/tensorflow/compiler/xla/service/hlo_domain_test.cc b/tensorflow/compiler/xla/service/hlo_domain_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..5553ddb153f7f1f2e6a790890c11f35e192488c4
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_domain_test.cc
@@ -0,0 +1,432 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/legacy_flags/debug_options_flags.h"
+#include "tensorflow/compiler/xla/service/hlo_domain_isolator.h"
+#include "tensorflow/compiler/xla/service/hlo_domain_metadata.h"
+#include "tensorflow/compiler/xla/service/hlo_domain_remover.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
+#include "tensorflow/compiler/xla/service/hlo_sharding_metadata.h"
+#include "tensorflow/compiler/xla/test.h"
+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+
+namespace xla {
+namespace {
+
+class HloDomainTest : public HloTestBase {
+ protected:
+  bool FindUserViaDomainPath(HloInstruction* instruction,
+                             HloInstruction* operand) const {
+    for (HloInstruction* user : operand->users()) {
+      if (user == instruction) {
+        return true;
+      }
+      if (user->opcode() == HloOpcode::kDomain &&
+          FindUserViaDomainPath(instruction, user)) {
+        return true;
+      }
+    }
+    return false;
+  }
+
+  // Checks whether there is a kDomain instruction in the edge between the
+  // instruction and the operand.
+  bool HasDomainEdge(HloModule* module,
+                     tensorflow::StringPiece instruction_name,
+                     tensorflow::StringPiece operand_name) {
+    HloInstruction* instruction = FindInstruction(module, instruction_name);
+    HloInstruction* operand = FindInstruction(module, operand_name);
+    CHECK_NE(instruction, nullptr);
+    CHECK_NE(operand, nullptr);
+    if (!instruction->IsUserOf(operand)) {
+      // If instruction is not an immediate user, we must find a path from
+      // operand to instruction anyway, otherwise there is a corruption.
+      if (FindUserViaDomainPath(instruction, operand)) {
+        return true;
+      }
+      LOG(FATAL) << "Bad HLO module generated across the '" << instruction_name
+                 << "' and '" << operand_name << "' instructions:\n"
+                 << module->ToString();
+    }
+    return false;
+  }
+
+  StatusOr<std::unique_ptr<HloModule>> ParseModule(
+      tensorflow::StringPiece hlo_string) {
+    HloModuleConfig config;
+    config.set_debug_options(legacy_flags::GetDebugOptionsFromFlags());
+    return ParseHloString(hlo_string, config);
+  }
+};
+
+// Dummy DomainMetadata implementation which create kDomain boundaries around
+// HLO instructions with the same metadata().op_name() values.
+class OpNameMetadata : public DomainMetadata {
+ public:
+  explicit OpNameMetadata(string opname) : opname_(std::move(opname)) {}
+
+  std::unique_ptr<DomainMetadata> Clone() const override {
+    return MakeUnique<OpNameMetadata>(opname_);
+  }
+
+  tensorflow::StringPiece Kind() const override { return KindName(); }
+
+  bool Matches(const DomainMetadata& other) const override {
+    const OpNameMetadata* other_ptr =
+        dynamic_cast<const OpNameMetadata*>(&other);
+    if (other_ptr == nullptr) {
+      // If other is not a OpNameMetadata, then it is clearly a no match.
+      return false;
+    }
+    return opname_ == other_ptr->opname_;
+  }
+
+  string ToString() const override { return opname_; }
+
+  Status NormalizeInstructions(
+      const DomainMetadata::Domain& domain) const override {
+    // For the purposes of this test, nothing to do.
+    return Status::OK();
+  }
+
+  static tensorflow::StringPiece KindName() { return "opname"; }
+
+ private:
+  string opname_;
+};
+
+// Creator function for OpNameMetadata domains.
+std::unique_ptr<HloInstruction> OpNameDomainCreator(HloInstruction* instruction,
+                                                    HloInstruction* operand) {
+  if (instruction->metadata().op_name() == operand->metadata().op_name()) {
+    return nullptr;
+  }
+  std::unique_ptr<DomainMetadata> operand_side_metadata =
+      MakeUnique<OpNameMetadata>(operand->metadata().op_name());
+  std::unique_ptr<DomainMetadata> user_side_metadata =
+      MakeUnique<OpNameMetadata>(instruction->metadata().op_name());
+  return HloInstruction::CreateDomain(operand->shape(), operand,
+                                      std::move(operand_side_metadata),
+                                      std::move(user_side_metadata));
+}
+
+Status OpNameDomainNormalizer(const DomainMetadata::Domain& domain) {
+  // Nothing to do for the particular use this test make of the OpName domains.
+  return Status::OK();
+}
+
+TEST_F(HloDomainTest, CheckDomainLinks) {
+  const char* const hlo_string = R"(
+HloModule Module
+
+ENTRY entry {
+  p0 = (f32[4], f32[4]) parameter(0)
+  a = f32[4] get-tuple-element(p0), index=0
+  b = f32[4] get-tuple-element(p0), index=1
+  c = f32[4] add(f32[4] a, f32[4] b), sharding={maximal device=1}
+  d = f32[4] subtract(a, b), sharding={maximal device=1}
+  e = f32[4] multiply(c, d), sharding={maximal device=1}
+  ROOT f = (f32[4], f32[4], f32[4]) tuple(c, d, e)
+}
+)";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseModule(hlo_string));
+  LOG(INFO) << "Original module:\n" << module->ToString();
+
+  HloDomainIsolator isolator(CreateShardingDomain);
+  TF_ASSERT_OK_AND_ASSIGN(bool isolator_changed, isolator.Run(module.get()));
+  EXPECT_TRUE(isolator_changed);
+
+  EXPECT_TRUE(HasDomainEdge(module.get(), "c", "a"));
+  EXPECT_TRUE(HasDomainEdge(module.get(), "c", "b"));
+  EXPECT_TRUE(HasDomainEdge(module.get(), "d", "a"));
+  EXPECT_TRUE(HasDomainEdge(module.get(), "d", "b"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "e", "c"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "e", "d"));
+
+  HloDomainRemover remover(ShardingMetadata::KindName(),
+                           NormalizeShardingDomain);
+  TF_ASSERT_OK_AND_ASSIGN(bool remover_changed, remover.Run(module.get()));
+  EXPECT_TRUE(remover_changed);
+
+  EXPECT_FALSE(HasDomainEdge(module.get(), "c", "a"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "c", "b"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "d", "a"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "d", "b"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "e", "c"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "e", "d"));
+}
+
+TEST_F(HloDomainTest, CheckNoDomainAddedIfNoSharding) {
+  const char* const hlo_string = R"(
+HloModule Module
+
+ENTRY entry {
+  p0 = (f32[4], f32[4]) parameter(0)
+  a = f32[4] get-tuple-element(p0), index=0
+  b = f32[4] get-tuple-element(p0), index=1
+  c = f32[4] add(f32[4] a, f32[4] b)
+  d = f32[4] subtract(a, b)
+  e = f32[4] multiply(c, d)
+  ROOT f = (f32[4], f32[4], f32[4]) tuple(c, d, e)
+}
+)";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseModule(hlo_string));
+  LOG(INFO) << "Original module:\n" << module->ToString();
+
+  HloDomainIsolator isolator(CreateShardingDomain);
+  TF_ASSERT_OK_AND_ASSIGN(bool isolator_changed, isolator.Run(module.get()));
+  EXPECT_TRUE(!isolator_changed);
+}
+
+TEST_F(HloDomainTest, CheckDomainAroundIO) {
+  const char* const hlo_string = R"(
+HloModule Module
+
+ENTRY entry {
+  p0 = (f32[4]) parameter(0)
+  a = f32[4] get-tuple-element(p0), index=0
+  b = (f32[4], u32[]) send(a), channel_id=1, sharding={maximal device=0}
+  c = () send-done(b), channel_id=1, sharding={maximal device=0}
+  d = (f32[4], u32[]) recv(), channel_id=2, sharding={maximal device=0}
+  e = f32[4] recv-done(d), channel_id=2, sharding={maximal device=0}
+  f = f32[4] add(a, e)
+  g = f32[4] subtract(a, e)
+  ROOT h = (f32[4], f32[4]) tuple(f, g)
+}
+)";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseModule(hlo_string));
+  LOG(INFO) << "Original module:\n" << module->ToString();
+
+  HloDomainIsolator isolator(CreateShardingDomain);
+  TF_ASSERT_OK_AND_ASSIGN(bool isolator_changed, isolator.Run(module.get()));
+  EXPECT_TRUE(isolator_changed);
+
+  EXPECT_TRUE(HasDomainEdge(module.get(), "b", "a"));
+  EXPECT_TRUE(HasDomainEdge(module.get(), "f", "e"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "a", "p0"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "c", "b"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "e", "d"));
+
+  HloDomainRemover remover(ShardingMetadata::KindName(),
+                           NormalizeShardingDomain);
+  TF_ASSERT_OK_AND_ASSIGN(bool remover_changed, remover.Run(module.get()));
+  EXPECT_TRUE(remover_changed);
+
+  EXPECT_FALSE(HasDomainEdge(module.get(), "b", "a"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "f", "e"));
+}
+
+TEST_F(HloDomainTest, CheckNoDomainAddedOnPureIOComputation) {
+  const char* const hlo_string = R"(
+HloModule Module
+
+ENTRY entry {
+  a = (f32[4], u32[]) recv(), channel_id=1, sharding={maximal device=-1}
+  b = f32[4] recv-done(a), channel_id=1, sharding={maximal device=-1}
+  c = f32[4] add(b, b), sharding={maximal device=-1}
+  d = (f32[4], u32[]) send(c), channel_id=2, sharding={maximal device=-1}
+  ROOT e = () send-done(d), channel_id=2, sharding={maximal device=-1}
+}
+)";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseModule(hlo_string));
+  LOG(INFO) << "Original module:\n" << module->ToString();
+
+  HloDomainIsolator isolator(CreateShardingDomain);
+  TF_ASSERT_OK_AND_ASSIGN(bool isolator_changed, isolator.Run(module.get()));
+  EXPECT_FALSE(isolator_changed);
+}
+
+TEST_F(HloDomainTest, CheckNormalizationOnPureIOComputation) {
+  const char* const hlo_string = R"(
+HloModule Module
+
+ENTRY entry {
+  a = (f32[4], u32[]) recv(), channel_id=1, sharding={maximal device=0}
+  b = f32[4] recv-done(a), channel_id=1, sharding={maximal device=0}
+  c = f32[4] add(b, b)
+  d = (f32[4], u32[]) send(c), channel_id=2, sharding={maximal device=0}
+  ROOT e = () send-done(d), channel_id=2, sharding={maximal device=0}
+}
+)";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseModule(hlo_string));
+  LOG(INFO) << "Original module:\n" << module->ToString();
+
+  HloDomainRemover remover(ShardingMetadata::KindName(),
+                           NormalizeShardingDomain);
+  TF_ASSERT_OK_AND_ASSIGN(bool remover_changed, remover.Run(module.get()));
+  EXPECT_FALSE(remover_changed);
+
+  HloInstruction* add = FindInstruction(module.get(), "c");
+  ASSERT_NE(add, nullptr);
+  auto device = add->sharding_unique_device();
+  EXPECT_TRUE(device.has_value());
+  EXPECT_EQ(*device, 0);
+}
+
+TEST_F(HloDomainTest, CheckMultiDomainLinks) {
+  const char* const hlo_string = R"(
+HloModule Module
+
+ENTRY entry {
+  p0 = (f32[4], f32[4]) parameter(0)
+  a = f32[4] get-tuple-element(p0), index=0
+  b = f32[4] get-tuple-element(p0), index=1
+  c = f32[4] add(a, b), sharding={maximal device=1}
+  d = f32[4] subtract(a, c), sharding={maximal device=1}, metadata={op_name="D"}
+  e = f32[4] multiply(c, d), sharding={maximal device=1}, metadata={op_name="D"}
+  f = f32[4] add(e, c), sharding={maximal device=1}
+  ROOT g = (f32[4], f32[4], f32[4]) tuple(c, d, f)
+}
+)";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseModule(hlo_string));
+  LOG(INFO) << "Original module:\n" << module->ToString();
+
+  HloDomainIsolator sharding_isolator(CreateShardingDomain);
+  TF_ASSERT_OK_AND_ASSIGN(bool sharding_isolator_changed,
+                          sharding_isolator.Run(module.get()));
+  EXPECT_TRUE(sharding_isolator_changed);
+
+  HloDomainIsolator opname_isolator(OpNameDomainCreator);
+  TF_ASSERT_OK_AND_ASSIGN(bool opname_isolator_changed,
+                          opname_isolator.Run(module.get()));
+  EXPECT_TRUE(opname_isolator_changed);
+
+  EXPECT_TRUE(HasDomainEdge(module.get(), "c", "a"));
+  EXPECT_TRUE(HasDomainEdge(module.get(), "c", "b"));
+  EXPECT_TRUE(HasDomainEdge(module.get(), "d", "a"));
+  EXPECT_TRUE(HasDomainEdge(module.get(), "d", "c"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "e", "d"));
+
+  HloDomainRemover sharding_remover(ShardingMetadata::KindName(),
+                                    NormalizeShardingDomain);
+  TF_ASSERT_OK_AND_ASSIGN(bool sharding_remover_changed,
+                          sharding_remover.Run(module.get()));
+  EXPECT_TRUE(sharding_remover_changed);
+
+  HloDomainRemover opname_remover(OpNameMetadata::KindName(),
+                                  OpNameDomainNormalizer);
+  TF_ASSERT_OK_AND_ASSIGN(bool opname_remover_changed,
+                          opname_remover.Run(module.get()));
+  EXPECT_TRUE(opname_remover_changed);
+
+  EXPECT_FALSE(HasDomainEdge(module.get(), "c", "a"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "c", "b"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "d", "a"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "d", "c"));
+}
+
+TEST_F(HloDomainTest, CheckNormalizationOnInfeedTuple) {
+  const char* const hlo_string = R"(
+HloModule Module
+
+ENTRY entry {
+  infeed = (f32[4], f32[4]) infeed(),
+    sharding={{maximal device=1}, {maximal device=0}}
+  gte0 = f32[4] get-tuple-element(infeed), index=0
+  gte1 = f32[4] get-tuple-element(infeed), index=1
+  copy0 = f32[4] copy(gte0)
+  copy1 = f32[4] copy(gte1)
+  ROOT add = f32[4] add(copy0, copy1)
+}
+)";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseModule(hlo_string));
+  LOG(INFO) << "Original module:\n" << module->ToString();
+
+  HloDomainIsolator isolator(CreateShardingDomain);
+  TF_ASSERT_OK_AND_ASSIGN(bool isolator_changed, isolator.Run(module.get()));
+  EXPECT_TRUE(isolator_changed);
+
+  EXPECT_TRUE(HasDomainEdge(module.get(), "gte0", "infeed"));
+  EXPECT_TRUE(HasDomainEdge(module.get(), "gte1", "infeed"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "copy0", "gte0"));
+  EXPECT_FALSE(HasDomainEdge(module.get(), "copy1", "gte1"));
+
+  // Inject unassigned tuple/gte within the infeed domain, to simulate the
+  // HLO passes adding unexpected instructions.
+  //
+  //            infeed
+  //           /      \
+  //         GTE0    GTE1
+  //         /          \
+  //       COPY0       COPY1
+  //          \         /
+  //           \       /
+  //             TUPLE
+  //               |
+  //             DOMAIN
+  HloInstruction* infeed = FindInstruction(module.get(), "infeed");
+  ASSERT_NE(infeed, nullptr);
+  auto infeed_users = infeed->users();
+  HloInstruction* new_gte0 =
+      infeed->parent()->AddInstruction(HloInstruction::CreateGetTupleElement(
+          ShapeUtil::GetTupleElementShape(infeed->shape(), 0), infeed, 0));
+  HloInstruction* new_copy0 =
+      infeed->parent()->AddInstruction(HloInstruction::CreateUnary(
+          new_gte0->shape(), HloOpcode::kCopy, new_gte0));
+  HloInstruction* new_gte1 =
+      infeed->parent()->AddInstruction(HloInstruction::CreateGetTupleElement(
+          ShapeUtil::GetTupleElementShape(infeed->shape(), 1), infeed, 1));
+  HloInstruction* new_copy1 =
+      infeed->parent()->AddInstruction(HloInstruction::CreateUnary(
+          new_gte1->shape(), HloOpcode::kCopy, new_gte1));
+  HloInstruction* new_tuple = infeed->parent()->AddInstruction(
+      HloInstruction::CreateTuple({new_copy0, new_copy1}));
+  for (HloInstruction* user : infeed_users) {
+    TF_EXPECT_OK(infeed->ReplaceUseWith(user, new_tuple));
+  }
+
+  HloDomainRemover remover(ShardingMetadata::KindName(),
+                           NormalizeShardingDomain);
+  TF_ASSERT_OK_AND_ASSIGN(bool remover_changed, remover.Run(module.get()));
+  EXPECT_TRUE(remover_changed);
+
+  struct Assignment {
+    HloInstruction* instruction;
+    int64 device;
+  } assignments[] = {
+      {new_gte0, 1},
+      {new_copy0, 1},
+      {new_gte1, 0},
+      {new_copy1, 0},
+  };
+  for (auto& assignment : assignments) {
+    auto device = assignment.instruction->sharding_unique_device();
+    EXPECT_TRUE(device.has_value());
+    EXPECT_EQ(*device, assignment.device);
+  }
+  EXPECT_TRUE(new_tuple->has_sharding());
+  EXPECT_EQ(
+      new_tuple->sharding(),
+      HloSharding::Tuple(new_tuple->shape(), {HloSharding::AssignDevice(1),
+                                              HloSharding::AssignDevice(0)}));
+}
+
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_element_type_converter.cc b/tensorflow/compiler/xla/service/hlo_element_type_converter.cc
index d236f83aeb9254b9c6e6d04629758ac2c8fd0da3..4ed1508d7067684a15d0fb7d86e69b055bc1333b 100644
--- a/tensorflow/compiler/xla/service/hlo_element_type_converter.cc
+++ b/tensorflow/compiler/xla/service/hlo_element_type_converter.cc
@@ -119,6 +119,7 @@ StatusOr<bool> HloElementTypeConverter::Run(HloModule* module) {
     return false;
   }
 
+  HloCloneContext context(module);
   bool changed = false;
   for (auto* computation : module->computations()) {
     for (auto* hlo : computation->MakeInstructionPostOrder()) {
@@ -140,6 +141,7 @@ StatusOr<bool> HloElementTypeConverter::Run(HloModule* module) {
       // These are ops with embedded computations where it suffices to convert
       // the embedded computations instead of converting the ops themselves.
       if (opcode == HloOpcode::kWhile || opcode == HloOpcode::kCall ||
+          opcode == HloOpcode::kCrossReplicaSum ||
           opcode == HloOpcode::kFusion || opcode == HloOpcode::kMap ||
           opcode == HloOpcode::kReduce || opcode == HloOpcode::kReduceWindow ||
           opcode == HloOpcode::kSelectAndScatter ||
@@ -180,7 +182,7 @@ StatusOr<bool> HloElementTypeConverter::Run(HloModule* module) {
             ShapeUtil::ChangeElementType(hlo->shape(), replace_with_type_);
 
         new_hlo = computation->AddInstruction(
-            hlo->CloneWithNewOperands(shape, new_operands, hlo->GetModule()));
+            hlo->CloneWithNewOperands(shape, new_operands, &context));
         TF_RETURN_IF_ERROR(new_hlo->CopyAllControlDepsFrom(hlo));
 
         new_hlo = ToElementType(new_hlo, eliminate_type_);
@@ -189,16 +191,16 @@ StatusOr<bool> HloElementTypeConverter::Run(HloModule* module) {
         Shape new_shape = GetConvertedTupleShape(hlo->shape(), eliminate_type_,
                                                  replace_with_type_);
 
-        new_hlo = computation->AddInstruction(hlo->CloneWithNewOperands(
-            new_shape, new_operands, hlo->GetModule()));
+        new_hlo = computation->AddInstruction(
+            hlo->CloneWithNewOperands(new_shape, new_operands, &context));
         TF_RETURN_IF_ERROR(new_hlo->CopyAllControlDepsFrom(hlo));
 
         // Convert the elements of the result of `new_hlo` to produce a new
         // tuple with shape `old_shape`.
         new_hlo = ConvertTupleElements(new_hlo, old_shape);
       } else {
-        new_hlo = computation->AddInstruction(hlo->CloneWithNewOperands(
-            hlo->shape(), new_operands, hlo->GetModule()));
+        new_hlo = computation->AddInstruction(
+            hlo->CloneWithNewOperands(hlo->shape(), new_operands, &context));
         TF_RETURN_IF_ERROR(new_hlo->CopyAllControlDepsFrom(hlo));
       }
 
diff --git a/tensorflow/compiler/xla/service/hlo_evaluator.cc b/tensorflow/compiler/xla/service/hlo_evaluator.cc
index fa59a5fb2030b22aa9e6a59abbfba521d19adb51..e0648e14672c45e9a691fd6a674c9a2cd7605a12 100644
--- a/tensorflow/compiler/xla/service/hlo_evaluator.cc
+++ b/tensorflow/compiler/xla/service/hlo_evaluator.cc
@@ -309,6 +309,35 @@ StatusOr<std::unique_ptr<Literal>> HloEvaluator::EvaluateWithSubstitutions(
   return result;
 }
 
+StatusOr<std::unique_ptr<Literal>> HloEvaluator::EvaluateElementwiseBinaryOp(
+    HloOpcode opcode, const Literal& lhs, const Literal& rhs) {
+  std::unique_ptr<HloInstruction> lhs_instr =
+      HloInstruction::CreateConstant(lhs.CloneToUnique());
+  std::unique_ptr<HloInstruction> rhs_instr =
+      HloInstruction::CreateConstant(rhs.CloneToUnique());
+
+  std::unique_ptr<HloInstruction> cloned_instruction =
+      HloInstruction::CreateBinary(lhs.shape(), opcode, lhs_instr.get(),
+                                   rhs_instr.get());
+  auto result = Evaluate(cloned_instruction.get());
+
+  cloned_instruction->DetachFromOperands();
+  return result;
+}
+
+StatusOr<std::unique_ptr<Literal>> HloEvaluator::EvaluateElementwiseUnaryOp(
+    HloOpcode opcode, const Literal& operand) {
+  std::unique_ptr<HloInstruction> operand_instr =
+      HloInstruction::CreateConstant(operand.CloneToUnique());
+
+  std::unique_ptr<HloInstruction> cloned_instruction =
+      HloInstruction::CreateUnary(operand.shape(), opcode, operand_instr.get());
+  auto result = Evaluate(cloned_instruction.get());
+
+  cloned_instruction->DetachFromOperands();
+  return result;
+}
+
 Status HloEvaluator::HandleParameter(HloInstruction* parameter) {
   CHECK_LT(parameter->parameter_number(), arg_literals_.size());
   const Literal* input_literal = arg_literals_[parameter->parameter_number()];
@@ -859,6 +888,36 @@ Status HloEvaluator::HandleGather(HloInstruction* gather) {
   return Status::OK();
 }
 
+Status HloEvaluator::HandleBroadcast(HloInstruction* broadcast) {
+  const Literal& operand = GetEvaluatedLiteralFor(broadcast->operand(0));
+
+  TF_RET_CHECK(broadcast->dimensions().size() ==
+               ShapeUtil::Rank(operand.shape()))
+      << "broadcast dimensions is of size: " << broadcast->dimensions().size()
+      << " and rank of operand_to_broadcast is: "
+      << ShapeUtil::Rank(operand.shape());
+  // Checks that operand's dimensions are the same as the broadcast's
+  // dimensions along the dimensions to be broadcasted.
+  for (int64 i = 0; i < broadcast->dimensions().size(); ++i) {
+    TF_RET_CHECK(broadcast->shape().dimensions(broadcast->dimensions(i)) ==
+                 operand.shape().dimensions(i));
+  }
+
+  TF_ASSIGN_OR_RETURN(
+      evaluated_[broadcast],
+      operand.Broadcast(broadcast->shape(), broadcast->dimensions()));
+
+  return Status::OK();
+}
+
+Status HloEvaluator::HandleGenerateToken(HloInstruction* token) {
+  // Literals cannot represent a TOKEN shape so just create an empty tuple as
+  // the "result" of the kGenerateToken operation.
+  // TODO(b/109929053): Add support for TOKENs in Literals.
+  evaluated_[token] = Literal::MakeTuple({});
+  return Status::OK();
+}
+
 Status HloEvaluator::HandleGetTupleElement(HloInstruction* get_tuple_element) {
   const auto result_shape = get_tuple_element->shape();
   const int64 index = get_tuple_element->tuple_index();
@@ -914,9 +973,10 @@ Status HloEvaluator::HandleFusion(HloInstruction* fusion) {
   // Attach cloned computation to an empty HLO module so the existing ones are
   // not modified.
   HloModule empty_hlo_module("EmptyModuleForFusion", config);
+  HloCloneContext context(&empty_hlo_module);
   auto cloned_fused_computation =
       fusion->fused_instructions_computation()->Clone(
-          /*suffix=*/"clone_with_layout", &empty_hlo_module);
+          /*suffix=*/"clone_with_layout", &context);
   for (auto* instruction : cloned_fused_computation->instructions()) {
     LayoutUtil::SetToDefaultLayout(instruction->mutable_shape());
   }
diff --git a/tensorflow/compiler/xla/service/hlo_evaluator.h b/tensorflow/compiler/xla/service/hlo_evaluator.h
index 566d53a41427119ea3d429a60a4430068bc953b1..fc2fc9437b238a2e519401b2b121dfbef070e2dc 100644
--- a/tensorflow/compiler/xla/service/hlo_evaluator.h
+++ b/tensorflow/compiler/xla/service/hlo_evaluator.h
@@ -109,6 +109,12 @@ class HloEvaluator : public DfsHloVisitorWithDefault {
       const std::unordered_map<const HloInstruction*, const Literal*>&
           substitutions);
 
+  StatusOr<std::unique_ptr<Literal>> EvaluateElementwiseBinaryOp(
+      HloOpcode opcode, const Literal& lhs, const Literal& rhs);
+
+  StatusOr<std::unique_ptr<Literal>> EvaluateElementwiseUnaryOp(
+      HloOpcode opcode, const Literal& operand);
+
  protected:
   // Make HloEvaluatorTypedVisitor a friend because it is logically part of this
   // class.
@@ -166,6 +172,10 @@ class HloEvaluator : public DfsHloVisitorWithDefault {
 
   Status HandleSelect(HloInstruction* select) override;
 
+  Status HandleBroadcast(HloInstruction* broadcast) override;
+
+  Status HandleGenerateToken(HloInstruction* token) override;
+
   // Returns the already-evaluated literal result for the instruction.
   // A Constant instruction is considered evaluated and its literal will be
   // returned directly without looking up the cache.
diff --git a/tensorflow/compiler/xla/service/hlo_evaluator_test.cc b/tensorflow/compiler/xla/service/hlo_evaluator_test.cc
index ae5b5e0412ef99db9b72d645a954759ca0b9eb8b..72eb9930e92c340ab9f42cd563c27507623b2ba7 100644
--- a/tensorflow/compiler/xla/service/hlo_evaluator_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_evaluator_test.cc
@@ -262,13 +262,13 @@ TEST_P(HloEvaluatorTest, DoesCosR2) {
   auto operand = Literal::CreateR2<float>({{0, M_PI}, {-M_PI, 2 * M_PI}});
   auto expected = Literal::CreateR2<float>({{1, -1}, {-1, 1}});
   TestUnaryOp(HloOpcode::kCos, std::move(expected), std::move(operand),
-              use_bfloat16_ ? 0x1.0P-5 : 0x1.0P-20);
+              use_bfloat16_ ? 0.031250 : 9.5367431640625E-7);
 }
 TEST_P(HloEvaluatorTest, DoesSinR2) {
   auto operand = Literal::CreateR2<float>({{0, M_PI}, {-M_PI, 2 * M_PI}});
   auto expected = Literal::CreateR2<float>({{0, 0}, {0, 0}});
   TestUnaryOp(HloOpcode::kSin, std::move(expected), std::move(operand),
-              use_bfloat16_ ? 0x1.0P-5 : 0x1.0P-20);
+              use_bfloat16_ ? 0.031250 : 9.5367431640625E-7);
 }
 TEST_P(HloEvaluatorTest, DoesNotR2) {
   auto operand =
@@ -333,7 +333,7 @@ TEST_P(HloEvaluatorTest, DoesReshape) {
   result->EachCell<NativeT>(
       [&](tensorflow::gtl::ArraySlice<int64> indices, NativeT value) {
         std::vector<int64> rindexes = Permute(permutation, indices);
-        EXPECT_NEAR(value, literal_clone->Get<NativeT>(rindexes), 0x1.0P-5);
+        EXPECT_NEAR(value, literal_clone->Get<NativeT>(rindexes), 0.031250);
       });
 }
 
@@ -567,7 +567,7 @@ TEST_P(HloEvaluatorTest, NegativePadding2D) {
   (*expected_array)(0, 4) = 2.718f;
   auto expected = Literal::CreateR2FromArray2D<float>(*expected_array);
 
-  EXPECT_TRUE(LiteralTestUtil::Near(*expected, *result, ErrorSpec(0x1.0P-5)));
+  EXPECT_TRUE(LiteralTestUtil::Near(*expected, *result, ErrorSpec(0.031250)));
 }
 
 TEST_P(HloEvaluatorTest, NegativeAndInteriorPadding2D) {
@@ -1248,7 +1248,7 @@ void BM_ReducePrecisely(int num_iters) {
   HloComputation::Builder b("BM_ReducePrecisely");
   HloModuleConfig config;
   config.set_debug_options(legacy_flags::GetDebugOptionsFromFlags());
-  HloModule module("BM_ReducePrecisely", VersionedComputationHandle(), config);
+  HloModule module("BM_ReducePrecisely", config);
 
   constexpr int kNumElements = 1 << 25;  // float += 1 saturates at 1<<24
   std::vector<float> v(kNumElements, 1.0f);
diff --git a/tensorflow/compiler/xla/service/hlo_evaluator_typed_visitor.h b/tensorflow/compiler/xla/service/hlo_evaluator_typed_visitor.h
index 024e8751f79b8b73cb868f6cbd4603f3e94ca7ea..13f46407e33e36bdbef4c9032630101d6c18268f 100644
--- a/tensorflow/compiler/xla/service/hlo_evaluator_typed_visitor.h
+++ b/tensorflow/compiler/xla/service/hlo_evaluator_typed_visitor.h
@@ -161,36 +161,6 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
     return HandleRound<ReturnT>(round);
   }
 
-  Status HandleBroadcast(HloInstruction* broadcast) override {
-    const Literal& operand_to_broadcast =
-        parent_->GetEvaluatedLiteralFor(broadcast->operand(0));
-    std::vector<int64> broadcast_indices(
-        ShapeUtil::Rank(broadcast->operand(0)->shape()), 0);
-
-    TF_RET_CHECK(broadcast->dimensions().size() ==
-                 ShapeUtil::Rank(operand_to_broadcast.shape()))
-        << "broadcast dimensions is of size: " << broadcast->dimensions().size()
-        << " and rank of operand_to_broadcast is: "
-        << ShapeUtil::Rank(operand_to_broadcast.shape());
-    // Checks that operand's dimensions are the same as the broadcast's
-    // dimensions along the dimensions to be broadcasted.
-    for (int64 i = 0; i < broadcast->dimensions().size(); ++i) {
-      TF_RET_CHECK(broadcast->shape().dimensions(broadcast->dimensions(i)) ==
-                   operand_to_broadcast.shape().dimensions(i));
-    }
-
-    auto output = MakeUnique<Literal>(broadcast->shape());
-    TF_RETURN_IF_ERROR(output->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
-          for (int64 i = 0; i < broadcast->dimensions().size(); ++i) {
-            broadcast_indices[i] = multi_index[broadcast->dimensions(i)];
-          }
-          return operand_to_broadcast.Get<ReturnT>(broadcast_indices);
-        }));
-    parent_->evaluated_[broadcast] = std::move(output);
-    return Status::OK();
-  }
-
   template <
       typename NativeT,
       typename std::enable_if<!is_complex_t<NativeT>::value>::type* = nullptr>
@@ -1482,11 +1452,12 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
             // Evaluate computation with specified literal operands.
             auto curr_val_literal = Literal::CreateR0<ReturnT>(curr_val);
             auto result_val_literal = Literal::CreateR0<ReturnT>(result_val);
-            std::vector<const Literal*> args = {result_val_literal.get(),
-                                                curr_val_literal.get()};
 
             std::unique_ptr<Literal> computed_result =
-                embedded_evaluator.Evaluate<const Literal*>(*function, args)
+                embedded_evaluator
+                    .Evaluate<const Literal*>(
+                        *function,
+                        {result_val_literal.get(), curr_val_literal.get()})
                     .ConsumeValueOrDie();
             // Clear visit states so that we can use the evaluator again on
             // the same computation.
@@ -1685,10 +1656,11 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
                     Literal::CreateR0<ReturnT>(curr_val);
                 const auto result_val_literal =
                     Literal::CreateR0<ReturnT>(result_val);
-                const std::vector<const Literal*> args = {
-                    result_val_literal.get(), curr_val_literal.get()};
                 std::unique_ptr<Literal> computed_result =
-                    embedded_evaluator.Evaluate<const Literal*>(*function, args)
+                    embedded_evaluator
+                        .Evaluate<const Literal*>(
+                            *function,
+                            {result_val_literal.get(), curr_val_literal.get()})
                         .ConsumeValueOrDie();
 
                 // Clear visit states so that the we can use the evaluate again
@@ -1990,10 +1962,10 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
 
     // TODO(b/74360564): This is implementation defined behavior, but is
     // currently respected by all implementations. Change this if we ever decide
-    // to oficially document different behavior.
+    // to officially document different behavior.
     for (int64 i = 0; i < start.size(); ++i) {
       start[i] = std::min<int64>(
-          std::max(0LL, start[i]),
+          std::max(int64{0}, start[i]),
           operand_literal.shape().dimensions(i) - result_shape.dimensions(i));
     }
 
diff --git a/tensorflow/compiler/xla/service/hlo_execution_profile_test.cc b/tensorflow/compiler/xla/service/hlo_execution_profile_test.cc
index 4900c813fdf037e65c6b42d027f1cbefb6ee9830..eba80c0f199f6224f4b46ac19af482c713585154 100644
--- a/tensorflow/compiler/xla/service/hlo_execution_profile_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_execution_profile_test.cc
@@ -15,8 +15,8 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/hlo_execution_profile.h"
 #include "tensorflow/compiler/xla/service/hlo_cost_analysis.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 
 namespace xla {
@@ -29,7 +29,7 @@ using ::testing::ContainsRegex;
 class HloExecutionProfileTest : public HloTestBase {};
 
 TEST_F(HloExecutionProfileTest, Basic) {
-  auto hlo_module = tools::Parse(R"(
+  auto hlo_module = ParseHloString(R"(
   HloModule test_module
   ENTRY entry_computation {
     lhs = f32[30,30]{1,0} parameter(0)
diff --git a/tensorflow/compiler/xla/service/hlo_graph_dumper.cc b/tensorflow/compiler/xla/service/hlo_graph_dumper.cc
index 17e3c405f1e5269ddf2f03c031a1137f9bb14fcc..28fc6c4209bcc14d890f28d6a9935c55b76586c0 100644
--- a/tensorflow/compiler/xla/service/hlo_graph_dumper.cc
+++ b/tensorflow/compiler/xla/service/hlo_graph_dumper.cc
@@ -28,6 +28,8 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/layout_util.h"
 #include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/service/hlo_casting_utils.h"
+#include "tensorflow/compiler/xla/service/hlo_instructions.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/service/hlo_tfgraph_builder.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -321,13 +323,11 @@ optional<string> MatchTrivialComputation(const HloComputation* computation) {
 class HloDotDumper {
  public:
   HloDotDumper(const HloComputation* computation, tensorflow::StringPiece label,
-               const DebugOptions& debug_options, bool show_metadata,
-               bool show_backend_config, const HloExecutionProfile* profile,
-               NodeFilter filter)
+               const DebugOptions& debug_options, bool show_backend_config,
+               const HloExecutionProfile* profile, NodeFilter filter)
       : computation_(computation),
         label_(std::string(label)),
         debug_options_(debug_options),
-        show_metadata_(show_metadata),
         show_backend_config_(show_backend_config),
         profile_(profile),
         filter_(std::move(filter)) {}
@@ -395,7 +395,6 @@ class HloDotDumper {
   const HloComputation* computation_;  // never null
   const string label_;                 // overall name for the graph
   const DebugOptions& debug_options_;
-  const bool show_metadata_;
   const bool show_backend_config_;
   const HloExecutionProfile* profile_;  // may be null
   const NodeFilter filter_;
@@ -430,7 +429,8 @@ class HloDotDumper {
 
   // When coloring by sharding information, we track the sharding string
   // representation to color association, by round-robin the color schemes.
-  std::unordered_map<string, ColorScheme> sharding_colors_;
+  std::unordered_map<HloSharding, ColorScheme, HloSharding::Hasher>
+      sharding_colors_;
   int64 next_shard_color_ = 0;
 };
 
@@ -592,15 +592,26 @@ bool HloDotDumper::ShouldShowSubcomputation(const HloComputation* subcomp) {
 string HloDotDumper::DumpSubcomputation(const HloComputation* subcomp,
                                         const HloInstruction* parent_instr) {
   VLOG(2) << "Dumping subcomputation " << subcomp->name();
-  const char* computation_fmt = R"(subgraph %s {
-%s
-label = <%s>;
-labelloc = t;
-tooltip = " ";
-%s
-}  // %s
+  // Add an edge from the subcomputation to its parent node.  If subcomp
+  // belongs to a fusion node, it's drawn in place of the fusion instruction,
+  // so there's no need to link those.
+  if (parent_instr->opcode() != HloOpcode::kFusion) {
+    const HloInstruction* from = GetNodeForEdge(subcomp->root_instruction());
+    VLOG(2) << "Edge: from " << from->name() << " to " << parent_instr->name()
+            << " as " << next_edge_id_;
+    edge_ids_.insert({{from, parent_instr}, next_edge_id_++});
+    const char* edge_fmt =
+        R"(%s -> %s [ltail="%s", style="dashed" tooltip="%s -> %s"];)";
+    edges_.push_back(Printf(
+        edge_fmt, InstructionId(from), InstructionId(parent_instr),
+        SubcomputationId(subcomp), subcomp->name(), parent_instr->name()));
+  }
 
-)";
+  // Have we already dumped this subcomputation?  If so, generating the edge
+  // linking it and parent_instr is all we want to do in this function.
+  if (cluster_ids_.find(subcomp) != cluster_ids_.end()) {
+    return "";
+  }
 
   cluster_ids_[subcomp] = next_cluster_id_++;
 
@@ -647,25 +658,16 @@ tooltip = " ";
 
   string comp_body = DumpComputation(subcomp);
 
-  // Add an edge from the subcomputation to its parent node.  If subcomp
-  // belongs to a fusion node, it's drawn in place of the fusion instruction,
-  // so there's no need to link those.
-  if (parent_instr->opcode() != HloOpcode::kFusion) {
-    const HloInstruction* from = GetNodeForEdge(subcomp->root_instruction());
-    VLOG(2) << "Edge: from " << from->name() << " to " << parent_instr->name()
-            << " as " << next_edge_id_;
-    edge_ids_.insert({{from, parent_instr}, next_edge_id_++});
-    const char* edge_fmt =
-        R"(%s -> %s [ltail="%s", style="dashed" tooltip="%s -> %s"];)";
-    edges_.push_back(Printf(
-        edge_fmt, InstructionId(from), InstructionId(parent_instr),
-        SubcomputationId(subcomp), subcomp->name(), parent_instr->name()));
-  }
-
-  string computation =
-      Printf(computation_fmt, id, style, subcomp_label, comp_body, id);
+  const char* computation_fmt = R"(subgraph %s {
+%s
+label = <%s>;
+labelloc = t;
+tooltip = " ";
+%s
+}  // %s
 
-  return computation;
+)";
+  return Printf(computation_fmt, id, style, subcomp_label, comp_body, id);
 }
 
 string HloDotDumper::DumpComputation(const HloComputation* comp) {
@@ -723,11 +725,25 @@ string HloDotDumper::DumpRootTag() {
                 to_id, node_body, node_shape, NodeColorAttributes(color));
 }
 
+static const HloConstantInstruction* TryGetFusionParameterConstant(
+    const HloInstruction* instr) {
+  if (instr->opcode() != HloOpcode::kParameter || !instr->IsFused()) {
+    return nullptr;
+  }
+  const HloInstruction* fusion = instr->parent()->FusionInstruction();
+  const HloInstruction* operand = fusion->operand(instr->parameter_number());
+  return DynCast<HloConstantInstruction>(operand);
+}
+
 bool HloDotDumper::ShouldMergeIntoUsers(const HloInstruction* instr) const {
   // If a node:
   //
-  //  - is a tuple-shaped parameter,
-  //  - is not a parameter to a fusion node,
+  //  - is a parameter of a fusion node which is bound to a constant,
+  //
+  // or
+  //
+  //  - is a tuple-shaped parameter, and
+  //  - is not a parameter to a fusion node, and
   //  - has at least kMinUsersToOmit users shown, and
   //  - all of the shown users are get-tuple-elements,
   //
@@ -735,6 +751,9 @@ bool HloDotDumper::ShouldMergeIntoUsers(const HloInstruction* instr) const {
   //
   // This helps us handle the common case where a while loop body has one big
   // tuple-shaped parameter.
+  if (TryGetFusionParameterConstant(instr) != nullptr) {
+    return true;
+  }
   const int kMinUsersToOmit = 3;
   return instr->opcode() == HloOpcode::kParameter &&
          ShapeUtil::IsTuple(instr->shape()) && !instr->IsFused() &&
@@ -791,22 +810,22 @@ string HloDotDumper::DumpInstruction(const HloInstruction* instr) {
   }
   // Build the text that will be displayed inside the node.
   string node_body = node_label;
-  for (const string& s : {trivial_subcomputation, node_metadata,
-                          node_backend_config, extra_info, inlined_constants}) {
+  for (const string& s : {trivial_subcomputation, node_backend_config,
+                          extra_info, inlined_constants}) {
     if (!s.empty()) {
       StrAppend(&node_body, "<br/>", s);
     }
   }
 
-  return Printf(R"(%s [label=<%s>, shape=%s, tooltip=" ", %s];)"
+  return Printf(R"(%s [label=<%s>, shape=%s, tooltip="%s", %s];)"
                 "\n",
-                InstructionId(instr), node_body, node_shape,
+                InstructionId(instr), node_body, node_shape, node_metadata,
                 NodeColorAttributes(color));
 }
 
 string HloDotDumper::GetInstructionNodeInlinedOperands(
     const HloInstruction* instr) {
-  auto stringify_constant = [](const HloInstruction* constant) {
+  auto stringify_constant = [](const HloConstantInstruction* constant) {
     const auto& shape = constant->shape();
 
     // If the shape has a dimension of size zero, print it as e.g.
@@ -825,7 +844,7 @@ string HloDotDumper::GetInstructionNodeInlinedOperands(
         *elem_count *= dim;
       }
     }
-    if (elem_count.has_value() && *elem_count <= 8 && constant->HasLiteral()) {
+    if (elem_count.has_value() && *elem_count <= 8) {
       return Printf("%s (%s)", constant->literal().ToString(),
                     ShapeUtil::HumanString(constant->shape()));
     }
@@ -841,29 +860,26 @@ string HloDotDumper::GetInstructionNodeInlinedOperands(
                   ShapeUtil::HumanString(constant->shape()));
   };
 
-  // Special case: If instr is a parameter to a fusion node, check whether the
-  // corresponding operand to the fusion node is a constant.
-  if (instr->opcode() == HloOpcode::kParameter && instr->IsFused()) {
-    const HloInstruction* fusion = instr->parent()->FusionInstruction();
-    const HloInstruction* operand = fusion->operand(instr->parameter_number());
-    if (operand->opcode() != HloOpcode::kConstant) {
-      return "";
-    }
-    return StrCat("<b>constant</b> ", stringify_constant(operand));
-  }
-
   std::vector<string> lines;
   for (int64 i = 0; i < instr->operand_count(); ++i) {
     const HloInstruction* operand = instr->operand(i);
+    const auto* constant_operand = DynCast<HloConstantInstruction>(operand);
     optional<string> operand_str;
-    if (operand->opcode() == HloOpcode::kConstant) {
-      operand_str = stringify_constant(operand);
+    if (constant_operand != nullptr) {
+      operand_str = stringify_constant(constant_operand);
     } else if (ShouldMergeIntoUsers(operand)) {
-      // Special case: If the operand is a parameter, use its parameter number
-      // rather than its name, because that's generally how people think of the
-      // node.
+      // Special case: If the operand is a parameter to a fusion node and it
+      // always has a constant value, display it like a regular constant.
+      //
+      // For other parameters, use the parameter number rather than the proper
+      // name, because that's generally how people think of the node.
       if (operand->opcode() == HloOpcode::kParameter) {
-        operand_str = Printf("Parameter %lld", operand->parameter_number());
+        if (const HloConstantInstruction* constant =
+                TryGetFusionParameterConstant(operand)) {
+          operand_str = stringify_constant(constant);
+        } else {
+          operand_str = Printf("Parameter %lld", operand->parameter_number());
+        }
       } else {
         operand_str = operand->name();
       }
@@ -885,24 +901,26 @@ ColorScheme HloDotDumper::GetInstructionColor(const HloInstruction* instr) {
     if (!instr->has_sharding()) {
       return kDashedBorder;
     }
-    string shard_str = instr->sharding().ToString();
-    auto it = sharding_colors_.find(shard_str);
+    auto it = sharding_colors_.find(instr->sharding());
     if (it != sharding_colors_.end()) {
       return it->second;
     }
     ColorScheme color = static_cast<ColorScheme>(
         kBlue + (next_shard_color_++ % (kDashedBorder - kBlue)));
-    sharding_colors_.emplace(shard_str, color);
+    sharding_colors_.emplace(instr->sharding(), color);
     return color;
   }
   const auto kParameterColor = kOrange;
 
   // Special case: If this instruction has a parameter merged into it, paint it
-  // the same color as a parameter.
+  // the same color as a parameter.  Unless the merged-in parameter is a
+  // parameter to a fusion node that is bound to a constant -- these aren't
+  // "real" parameters from the user's perspective.
   if (std::any_of(instr->operands().begin(), instr->operands().end(),
                   [&](const HloInstruction* operand) {
                     return operand->opcode() == HloOpcode::kParameter &&
-                           ShouldMergeIntoUsers(operand);
+                           ShouldMergeIntoUsers(operand) &&
+                           TryGetFusionParameterConstant(operand) == nullptr;
                   })) {
     return kParameterColor;
   }
@@ -965,6 +983,7 @@ ColorScheme HloDotDumper::GetInstructionColor(const HloInstruction* instr) {
     case HloOpcode::kBitcast:
     case HloOpcode::kGetTupleElement:
     case HloOpcode::kTrace:
+    case HloOpcode::kGenerateToken:
     case HloOpcode::kTuple:
       return kWhite;
     case HloOpcode::kBroadcast:
@@ -976,7 +995,6 @@ ColorScheme HloDotDumper::GetInstructionColor(const HloInstruction* instr) {
       }
       return kGreen;
     case HloOpcode::kConcatenate:
-    case HloOpcode::kCopy:
     case HloOpcode::kDynamicSlice:
     case HloOpcode::kGather:
     case HloOpcode::kPad:
@@ -998,6 +1016,10 @@ ColorScheme HloDotDumper::GetInstructionColor(const HloInstruction* instr) {
         return kWhite;
       }
       return kGreen;
+    case HloOpcode::kCopy:
+      // Emphasize copy nodes, which are either physical transposes (and thus
+      // significant), or copies of read-only buffers (and thus dead weight).
+      return kGreen;
     case HloOpcode::kConvolution:
     case HloOpcode::kDot:
     case HloOpcode::kFft:
@@ -1013,6 +1035,7 @@ ColorScheme HloDotDumper::GetInstructionColor(const HloInstruction* instr) {
     case HloOpcode::kReduceWindow:
     case HloOpcode::kSelectAndScatter:
       return kPurple;
+    case HloOpcode::kDomain:
     case HloOpcode::kFusion:
     case HloOpcode::kMap:
       return kGray;
@@ -1068,10 +1091,6 @@ string HloDotDumper::GetInstructionNodeLabel(const HloInstruction* instr) {
 }
 
 string HloDotDumper::GetInstructionNodeMetadata(const HloInstruction* instr) {
-  if (!show_metadata_) {
-    return "";
-  }
-
   std::vector<string> lines;
   if (!instr->metadata().op_name().empty()) {
     lines.push_back(HtmlLikeStringSanitize(instr->metadata().op_name()));
@@ -1091,11 +1110,11 @@ string HloDotDumper::GetInstructionNodeMetadata(const HloInstruction* instr) {
 
 string HloDotDumper::GetInstructionNodeBackendConfig(
     const HloInstruction* instr) {
-  if (!show_backend_config_ || instr->backend_config().empty()) {
+  if (!show_backend_config_ || instr->raw_backend_config_string().empty()) {
     return "";
   }
 
-  return StrCat("backend_config=\"", instr->backend_config(), "\"");
+  return StrCat("backend_config=\"", instr->raw_backend_config_string(), "\"");
 }
 
 string HloDotDumper::GetInstructionNodeExtraInfo(const HloInstruction* instr) {
@@ -1154,6 +1173,20 @@ string HloDotDumper::GetInstructionNodeExtraInfo(const HloInstruction* instr) {
   return Join(lines, "<br/>");
 }
 
+// Gets the total number of array elements in the given shape.  For tuples, this
+// is the sum of all the sizes of all of the array elements recursively in the
+// tuple.
+static int64 TotalElementsInShape(const Shape& shape) {
+  int64 elems = 0;
+  ShapeUtil::ForEachSubshape(
+      shape, [&](const Shape& subshape, const ShapeIndex& /*index*/) {
+        if (ShapeUtil::IsArray(subshape)) {
+          elems += ShapeUtil::ElementsIn(subshape);
+        }
+      });
+  return elems;
+}
+
 void HloDotDumper::AddInstructionIncomingEdges(const HloInstruction* instr) {
   auto add_edge = [&](const HloInstruction* from, const HloInstruction* to,
                       int64 operand_num, bool control_edge = false) {
@@ -1173,9 +1206,16 @@ void HloDotDumper::AddInstructionIncomingEdges(const HloInstruction* instr) {
     } else if (control_edge) {
       edge_label = "style=\"dotted\" color=\"gray\" label=\"ctrl\"";
     }
-    const char* kEdgeFmt = R"(%s -> %s [tooltip="%s -> %s" %s];)";
+
+    // We print "small" arrays using a hollow arrowhead and "large" arrays using
+    // a filled arrowhead.  For now, we use an arbitrary cutoff for what "big"
+    // means.
+    bool is_big_array = TotalElementsInShape(from->shape()) >= 4096;
+
+    const char* kEdgeFmt = R"(%s -> %s [arrowhead=%s tooltip="%s -> %s" %s];)";
     edges_.push_back(Printf(kEdgeFmt, InstructionId(from), InstructionId(to),
-                            from->name(), to->name(), edge_label));
+                            (is_big_array ? "normal" : "empty"), from->name(),
+                            to->name(), edge_label));
   };
 
   // Add edges from instr's operands to instr.  Parameters within fusion
@@ -1425,7 +1465,7 @@ string ExportGraph(const string& graph,
 string DumpGraph(const HloComputation& computation, const string& label,
                  const DebugOptions& debug_options,
                  const HloExecutionProfile* hlo_execution_profile,
-                 bool show_metadata, bool show_backend_config) {
+                 bool show_backend_config) {
   GraphRendererInterface::GraphKind graph_kind;
   string graph;
   if (debug_options.xla_hlo_dump_as_graphdef()) {
@@ -1436,8 +1476,8 @@ string DumpGraph(const HloComputation& computation, const string& label,
     graph_kind = GraphRendererInterface::TF_GRAPHDEF;
   } else {
     graph =
-        HloDotDumper(&computation, label, debug_options, show_metadata,
-                     show_backend_config, hlo_execution_profile, NodeFilter())
+        HloDotDumper(&computation, label, debug_options, show_backend_config,
+                     hlo_execution_profile, NodeFilter())
             .Dump();
     graph_kind = GraphRendererInterface::DOT_GRAPH;
   }
@@ -1449,15 +1489,15 @@ string DumpGraph(const HloComputation& computation, const string& label,
 }
 
 string DumpNeighborhoodAround(const HloInstruction& node, int radius,
-                              bool show_metadata, bool show_backend_config) {
+                              bool show_backend_config) {
   auto debug_options = node.GetModule()->config().debug_options();
   string label =
       StrCat("Neighborhood of ", radius, " nodes around ", node.name());
   NodeFilter filter = MakeNodeFilter(&node, radius);
-  string graph = HloDotDumper(node.parent(), label, debug_options,
-                              show_metadata, show_backend_config,
-                              /*profile=*/nullptr, filter)
-                     .Dump();
+  string graph =
+      HloDotDumper(node.parent(), label, debug_options, show_backend_config,
+                   /*profile=*/nullptr, filter)
+          .Dump();
   return ExportGraph(graph, GraphRendererInterface::DOT_GRAPH, debug_options);
 }
 
diff --git a/tensorflow/compiler/xla/service/hlo_graph_dumper.h b/tensorflow/compiler/xla/service/hlo_graph_dumper.h
index fc8e1468aca9c2edbc22c30a41a1be8b32a1feca..0b11f34abb7f0d937a24d11f4dc5d2d6a0aae6e7 100644
--- a/tensorflow/compiler/xla/service/hlo_graph_dumper.h
+++ b/tensorflow/compiler/xla/service/hlo_graph_dumper.h
@@ -56,7 +56,7 @@ string MaybeDumpHloModule(const HloModule& module, const string& label,
 string DumpGraph(const HloComputation& computation, const string& label,
                  const DebugOptions& debug_options,
                  const HloExecutionProfile* hlo_execution_profile = nullptr,
-                 bool show_metadata = false, bool show_backend_config = false);
+                 bool show_backend_config = false);
 
 // Like DumpGraph, but renders only nodes "near" the given node in the graph.
 //
@@ -64,7 +64,6 @@ string DumpGraph(const HloComputation& computation, const string& label,
 // (roughly) corresponds to the max distance a node may be from the primary node
 // before it's omitted from the graph.
 string DumpNeighborhoodAround(const HloInstruction& node, int radius,
-                              bool show_metadata = false,
                               bool show_backend_config = false);
 
 // Dumps the HloModule::ToString() as a file into the provided directory path
diff --git a/tensorflow/compiler/xla/service/hlo_graph_dumper_test.cc b/tensorflow/compiler/xla/service/hlo_graph_dumper_test.cc
index 8e52d926d85f1ce6fabeb2dedd2f8e0fe0c2051d..68f41a1cbb4db228f5dcf8b4a6130f05e81262a8 100644
--- a/tensorflow/compiler/xla/service/hlo_graph_dumper_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_graph_dumper_test.cc
@@ -121,7 +121,7 @@ TEST(HloGraphDumperTest, Constant) {
   HloComputation::Builder b("b");
   auto instruction = b.AddInstruction(
       HloInstruction::CreateConstant(Literal::CreateR0<float>(-42)));
-  instruction->set_name("i_am_a_constant_root_instruction");
+  instruction->SetAndSanitizeName("i_am_a_constant_root_instruction");
   HloModuleConfig config;
   HloModule m(TestName(), config);
   HloComputation* root_computation = m.AddEntryComputation(b.Build());
diff --git a/tensorflow/compiler/xla/service/hlo_instruction.cc b/tensorflow/compiler/xla/service/hlo_instruction.cc
index db1c33e2f0dfa0599810ab2e8d32209e64c5c865..39662d1735e4b411ef36cbc8421eabe52be6165a 100644
--- a/tensorflow/compiler/xla/service/hlo_instruction.cc
+++ b/tensorflow/compiler/xla/service/hlo_instruction.cc
@@ -16,7 +16,6 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 
 #include <algorithm>
-#include <deque>
 #include <ostream>
 #include <set>
 #include <unordered_set>
@@ -27,7 +26,9 @@ limitations under the License.
 #include "tensorflow/compiler/xla/protobuf_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor.h"
+#include "tensorflow/compiler/xla/service/hlo_casting_utils.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_instructions.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/service/name_uniquer.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -37,9 +38,11 @@ limitations under the License.
 #include "tensorflow/compiler/xla/window_util.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/gtl/flatmap.h"
+#include "tensorflow/core/lib/gtl/flatset.h"
 #include "tensorflow/core/lib/gtl/map_util.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/human_readable_json.h"
 #include "tensorflow/core/platform/logging.h"
 
 namespace xla {
@@ -58,68 +61,218 @@ StatusOr<std::unique_ptr<HloInstruction>> HloInstruction::CreateFromProto(
   TF_ASSIGN_OR_RETURN(HloOpcode opcode, StringToHloOpcode(proto.opcode()));
   TF_RET_CHECK(proto.has_shape());
 
-  auto instruction = WrapUnique(new HloInstruction(opcode, proto.shape()));
-  for (const int64 operand_id : proto.operand_ids()) {
-    TF_RET_CHECK(ContainsKey(instruction_map, operand_id))
-        << "No instruction with id " << operand_id;
-    instruction->AppendOperand(instruction_map.at(operand_id));
-  }
-  for (const int64 predecessor_id : proto.control_predecessor_ids()) {
-    TF_RET_CHECK(ContainsKey(instruction_map, predecessor_id))
-        << "No instruction with id " << predecessor_id;
-    TF_RETURN_IF_ERROR(instruction_map.at(predecessor_id)
-                           ->AddControlDependencyTo(instruction.get()));
-  }
-
-  // In the proto, fused computations are held exclusively within the
-  // HloInstructionProto and do not appear as an HloComputationProto within the
-  // HloModuleProto.
-  if (instruction->opcode() == HloOpcode::kFusion) {
-    TF_RET_CHECK(!proto.fusion_kind().empty());
-    TF_ASSIGN_OR_RETURN(instruction->fusion_kind_,
-                        StringToFusionKind(proto.fusion_kind()));
-
-    // Find the fused computation and set its fusion instruction.
-    TF_RET_CHECK(proto.called_computation_ids_size() == 1)
-        << "Expect 1 called computation for fusion instruction, but sees "
-        << proto.called_computation_ids_size();
-    const int64 fusion_id = proto.called_computation_ids(0);
-    auto* fused_computation = FindPtrOrNull(computation_map, fusion_id);
-    TF_RET_CHECK(fused_computation != nullptr)
-        << "No fusion computation with id " << fusion_id;
-    fused_computation->SetFusionInstruction(instruction.get());
-    instruction->called_computations_.push_back(fused_computation);
-  } else {
-    for (const int64 computation_id : proto.called_computation_ids()) {
-      TF_RET_CHECK(ContainsKey(computation_map, computation_id))
-          << "No computation with id " << computation_id;
-      instruction->called_computations_.push_back(
-          computation_map.at(computation_id));
+  std::unique_ptr<HloInstruction> instruction;
+  const auto operands = [&instruction_map, &proto](int index) {
+    return instruction_map.at(proto.operand_ids(index));
+  };
+  const auto computations = [&computation_map, &proto](int index) {
+    return computation_map.at(proto.called_computation_ids(index));
+  };
+  switch (opcode) {
+    // Ops migrated to subclasses.
+    case HloOpcode::kBatchNormTraining:
+      CHECK_EQ(proto.operand_ids_size(), 3);
+      instruction = CreateBatchNormTraining(
+          proto.shape(), operands(0), operands(1), operands(2), proto.epsilon(),
+          proto.feature_index());
+      break;
+    case HloOpcode::kBatchNormInference:
+      CHECK_EQ(proto.operand_ids_size(), 5);
+      instruction = CreateBatchNormInference(
+          proto.shape(), operands(0), operands(1), operands(2), operands(3),
+          operands(4), proto.epsilon(), proto.feature_index());
+      break;
+    case HloOpcode::kBatchNormGrad:
+      CHECK_EQ(proto.operand_ids_size(), 5);
+      instruction = CreateBatchNormGrad(proto.shape(), operands(0), operands(1),
+                                        operands(2), operands(3), operands(4),
+                                        proto.epsilon(), proto.feature_index());
+      break;
+    case HloOpcode::kFft: {
+      CHECK_EQ(proto.operand_ids_size(), 1);
+      std::vector<int64> fft_length(proto.fft_length().begin(),
+                                    proto.fft_length().end());
+      instruction = CreateFft(proto.shape(), operands(0), proto.fft_type(),
+                              tensorflow::gtl::ArraySlice<int64>(fft_length));
+      break;
+    }
+    case HloOpcode::kSend:
+      CHECK_EQ(proto.operand_ids_size(), 1);
+      instruction = CreateSend(operands(0), proto.channel_id());
+      break;
+    case HloOpcode::kSendDone:
+      CHECK_EQ(proto.operand_ids_size(), 1);
+      instruction = CreateSendDone(operands(0));
+      break;
+    case HloOpcode::kRecv:
+      CHECK_EQ(proto.operand_ids_size(), 0);
+      instruction =
+          CreateRecv(proto.shape().tuple_shapes(0), proto.channel_id());
+      break;
+    case HloOpcode::kRecvDone:
+      CHECK_EQ(proto.operand_ids_size(), 1);
+      instruction = CreateRecvDone(operands(0));
+      break;
+    case HloOpcode::kReverse:
+      CHECK_EQ(proto.operand_ids_size(), 1);
+      instruction = CreateReverse(proto.shape(), operands(0),
+                                  std::vector<int64>(proto.dimensions().begin(),
+                                                     proto.dimensions().end()));
+      break;
+    case HloOpcode::kConcatenate: {
+      CHECK_EQ(proto.dimensions_size(), 1);
+      std::vector<HloInstruction*> concat_operands(proto.operand_ids_size());
+      std::transform(proto.operand_ids().begin(), proto.operand_ids().end(),
+                     concat_operands.begin(),
+                     [&instruction_map](int64 operand_id) {
+                       return instruction_map.at(operand_id);
+                     });
+      instruction = CreateConcatenate(proto.shape(), concat_operands,
+                                      proto.dimensions(0));
+      break;
+    }
+    case HloOpcode::kReduce:
+      CHECK_EQ(proto.operand_ids_size(), 2);
+      CHECK_EQ(proto.called_computation_ids_size(), 1);
+      instruction = CreateReduce(proto.shape(), operands(0), operands(1),
+                                 std::vector<int64>(proto.dimensions().begin(),
+                                                    proto.dimensions().end()),
+                                 computations(0));
+      break;
+    case HloOpcode::kTranspose:
+      CHECK_EQ(proto.operand_ids_size(), 1);
+      instruction =
+          CreateTranspose(proto.shape(), operands(0),
+                          std::vector<int64>(proto.dimensions().begin(),
+                                             proto.dimensions().end()));
+      break;
+    case HloOpcode::kBroadcast:
+      CHECK_EQ(proto.operand_ids_size(), 1);
+      instruction =
+          CreateBroadcast(proto.shape(), operands(0),
+                          std::vector<int64>(proto.dimensions().begin(),
+                                             proto.dimensions().end()));
+      break;
+    case HloOpcode::kMap: {
+      CHECK_EQ(proto.called_computation_ids_size(), 1);
+      std::vector<HloInstruction*> map_operands(proto.operand_ids_size());
+      std::transform(proto.operand_ids().begin(), proto.operand_ids().end(),
+                     map_operands.begin(),
+                     [&instruction_map](int64 operand_id) {
+                       return instruction_map.at(operand_id);
+                     });
+      instruction = CreateMap(proto.shape(), map_operands, computations(0));
+      break;
+    }
+    case HloOpcode::kSlice: {
+      CHECK_EQ(proto.operand_ids_size(), 1);
+      std::vector<int64> slice_starts, slice_limits, slice_strides;
+      for (const HloInstructionProto::SliceDimensions& slice_dimensions :
+           proto.slice_dimensions()) {
+        slice_starts.push_back(slice_dimensions.start());
+        slice_limits.push_back(slice_dimensions.limit());
+        slice_strides.push_back(slice_dimensions.stride());
+      }
+      instruction = CreateSlice(proto.shape(), operands(0), slice_starts,
+                                slice_limits, slice_strides);
+      break;
+    }
+    case HloOpcode::kConstant: {
+      CHECK(proto.has_literal());
+      TF_ASSIGN_OR_RETURN(auto literal,
+                          Literal::CreateFromProto(proto.literal()));
+      instruction = CreateConstant(std::move(literal));
+      break;
+    }
+    case HloOpcode::kTrace: {
+      TF_RET_CHECK(proto.operand_ids_size() == 1)
+          << "Trace instruction should have 1 operand but sees "
+          << proto.operand_ids_size();
+      CHECK(proto.has_literal());
+      TF_ASSIGN_OR_RETURN(auto literal,
+                          Literal::CreateFromProto(proto.literal()));
+      instruction = CreateTrace(literal->GetR1U8AsString(), operands(0));
+      break;
+    }
+    case HloOpcode::kFusion: {
+      // In the proto, fused computations are held exclusively within the
+      // HloInstructionProto and do not appear as an HloComputationProto within
+      // the HloModuleProto.
+      TF_RET_CHECK(!proto.fusion_kind().empty());
+      TF_ASSIGN_OR_RETURN(FusionKind fusion_kind,
+                          StringToFusionKind(proto.fusion_kind()));
+
+      // Find the fused computation and set its fusion instruction.
+      TF_RET_CHECK(proto.called_computation_ids_size() == 1)
+          << "Expect 1 called computation for fusion instruction, but sees "
+          << proto.called_computation_ids_size();
+      const int64 fusion_id = proto.called_computation_ids(0);
+      auto* fused_computation = FindPtrOrNull(computation_map, fusion_id);
+      TF_RET_CHECK(fused_computation != nullptr)
+          << "No fusion computation with id " << fusion_id;
+      std::vector<HloInstruction*> fusion_operands(proto.operand_ids_size());
+      std::transform(proto.operand_ids().begin(), proto.operand_ids().end(),
+                     fusion_operands.begin(),
+                     [&instruction_map](int64 operand_id) {
+                       return instruction_map.at(operand_id);
+                     });
+      instruction = CreateFusion(proto.shape(), fusion_kind, fusion_operands,
+                                 fused_computation);
+      break;
+    }
+    case HloOpcode::kRng: {
+      std::vector<HloInstruction*> rng_parms(proto.operand_ids_size());
+      std::transform(proto.operand_ids().begin(), proto.operand_ids().end(),
+                     rng_parms.begin(), [&instruction_map](int64 operand_id) {
+                       return instruction_map.at(operand_id);
+                     });
+      instruction = CreateRng(proto.shape(), proto.distribution(), rng_parms);
+      break;
+    }
+    case HloOpcode::kParameter:
+      instruction = CreateParameter(proto.parameter_number(), proto.shape(),
+                                    proto.name());
+      break;
+    case HloOpcode::kGetTupleElement:
+      CHECK_EQ(proto.operand_ids_size(), 1);
+      instruction = CreateGetTupleElement(proto.shape(), operands(0),
+                                          proto.tuple_index());
+      break;
+    case HloOpcode::kReducePrecision:
+      instruction =
+          CreateReducePrecision(proto.shape(), operands(0),
+                                proto.exponent_bits(), proto.mantissa_bits());
+      break;
+    default: {
+      instruction = WrapUnique(new HloInstruction(opcode, proto.shape()));
+      for (const int64 operand_id : proto.operand_ids()) {
+        TF_RET_CHECK(ContainsKey(instruction_map, operand_id))
+            << "No instruction with id " << operand_id;
+        instruction->AppendOperand(instruction_map.at(operand_id));
+      }
+      for (const int64 predecessor_id : proto.control_predecessor_ids()) {
+        TF_RET_CHECK(ContainsKey(instruction_map, predecessor_id))
+            << "No instruction with id " << predecessor_id;
+        TF_RETURN_IF_ERROR(instruction_map.at(predecessor_id)
+                               ->AddControlDependencyTo(instruction.get()));
+      }
+      if (instruction->opcode() != HloOpcode::kFusion) {
+        for (const int64 computation_id : proto.called_computation_ids()) {
+          TF_RET_CHECK(ContainsKey(computation_map, computation_id))
+              << "No computation with id " << computation_id;
+          instruction->called_computations_.push_back(
+              computation_map.at(computation_id));
+        }
+      }
+      break;
     }
-  }
-
-  if (instruction->opcode() == HloOpcode::kTrace) {
-    TF_RET_CHECK(instruction->operands().size() == 1)
-        << "Trace instruction should have 1 operand but sees "
-        << instruction->operands().size();
-    instruction->mutable_operand(0)->set_tracing(instruction.get());
   }
 
   TF_RET_CHECK(!proto.name().empty());
-  instruction->name_ = proto.name();
-
+  instruction->SetAndSanitizeName(proto.name());
   instruction->metadata_ = proto.metadata();
-  instruction->set_backend_config(proto.backend_config());
-  if (proto.has_literal()) {
-    TF_ASSIGN_OR_RETURN(instruction->literal_,
-                        Literal::CreateFromProto(proto.literal()));
-  }
-  instruction->parameter_number_ = proto.parameter_number();
+  instruction->backend_config_ = proto.backend_config();
 
-  instruction->tuple_index_ = proto.tuple_index();
-  for (int64 dimension : proto.dimensions()) {
-    instruction->dimensions_.push_back(dimension);
-  }
   if (proto.has_window()) {
     instruction->window_ = MakeUnique<Window>(proto.window());
   }
@@ -132,14 +285,7 @@ StatusOr<std::unique_ptr<HloInstruction>> HloInstruction::CreateFromProto(
     instruction->dot_dimension_numbers_ =
         MakeUnique<DotDimensionNumbers>(proto.dot_dimension_numbers());
   }
-  for (const HloInstructionProto::SliceDimensions& slice_dimensions :
-       proto.slice_dimensions()) {
-    instruction->slice_starts_.push_back(slice_dimensions.start());
-    instruction->slice_limits_.push_back(slice_dimensions.limit());
-    instruction->slice_strides_.push_back(slice_dimensions.stride());
-  }
-  instruction->exponent_bits_ = proto.exponent_bits();
-  instruction->mantissa_bits_ = proto.mantissa_bits();
+
   for (int64 dynamic_slice_size : proto.dynamic_slice_sizes()) {
     instruction->dynamic_slice_sizes_.push_back(dynamic_slice_size);
   }
@@ -148,17 +294,9 @@ StatusOr<std::unique_ptr<HloInstruction>> HloInstruction::CreateFromProto(
         MakeUnique<PaddingConfig>(proto.padding_config());
   }
   instruction->outfeed_config_ = proto.outfeed_config();
-  instruction->distribution_ = proto.distribution();
-  instruction->epsilon_ = proto.epsilon();
-  instruction->feature_index_ = proto.feature_index();
-  instruction->channel_id_ = proto.channel_id();
   instruction->infeed_config_ = proto.infeed_config();
   instruction->custom_call_target_ = proto.custom_call_target();
   instruction->outfeed_shape_ = proto.outfeed_shape();
-  instruction->fft_type_ = proto.fft_type();
-  for (int64 fft_len : proto.fft_length()) {
-    instruction->fft_length_.push_back(fft_len);
-  }
 
   if (proto.has_sharding()) {
     TF_ASSIGN_OR_RETURN(const auto& sharding,
@@ -177,57 +315,38 @@ StatusOr<std::unique_ptr<HloInstruction>> HloInstruction::CreateFromProto(
   instruction->channel_name_ = proto.channel_name();
   instruction->cost_estimate_ns_ = proto.cost_estimate_ns();
 
+  for (int64 replica_group_id : proto.replica_group_ids()) {
+    instruction->replica_group_ids_.push_back(replica_group_id);
+  }
+
   return std::move(instruction);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateParameter(
     int64 parameter_number, const Shape& shape, const string& name) {
-  auto instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kParameter, shape));
-  instruction->parameter_number_ = parameter_number;
-  instruction->name_ = name;
-  return instruction;
+  return MakeUnique<HloParameterInstruction>(parameter_number, shape, name);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateTrace(
     const string& tag, HloInstruction* operand) {
-  auto instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kTrace, ShapeUtil::MakeNil()));
-  instruction->operands_.push_back(operand);
-  instruction->literal_ = Literal::CreateR1U8(tag);
-  operand->set_tracing(instruction.get());
-  return instruction;
+  return MakeUnique<HloTraceInstruction>(tag, operand);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateConstant(
     std::unique_ptr<Literal> literal) {
-  auto instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kConstant, literal->shape()));
-  instruction->literal_ = std::move(literal);
-  return instruction;
+  return MakeUnique<HloConstantInstruction>(std::move(literal));
 }
 
 /* static */ std::unique_ptr<HloInstruction>
 HloInstruction::CreateGetTupleElement(const Shape& shape,
                                       HloInstruction* operand, int64 index) {
-  CHECK(ShapeUtil::IsTuple(operand->shape()));
-  auto instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kGetTupleElement, shape));
-  instruction->tuple_index_ = index;
-  instruction->AppendOperand(operand);
-  return instruction;
+  return MakeUnique<HloGetTupleElementInstruction>(shape, operand, index);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateRng(
     const Shape& shape, RandomDistribution distribution,
     tensorflow::gtl::ArraySlice<HloInstruction*> parameters) {
-  auto instruction = WrapUnique(new HloInstruction(HloOpcode::kRng, shape));
-  instruction->distribution_ = distribution;
-  instruction->shape_ = shape;
-  for (HloInstruction* param : parameters) {
-    instruction->AppendOperand(param);
-  }
-  return instruction;
+  return MakeUnique<HloRngInstruction>(shape, distribution, parameters);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateNary(
@@ -256,6 +375,7 @@ HloInstruction::CreateGetTupleElement(const Shape& shape,
     case HloOpcode::kCopy:
     case HloOpcode::kCos:
     case HloOpcode::kClz:
+    case HloOpcode::kDomain:
     case HloOpcode::kExp:
     case HloOpcode::kExpm1:
     case HloOpcode::kFloor:
@@ -341,13 +461,8 @@ HloInstruction::CreateGetTupleElement(const Shape& shape,
     const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
     HloComputation* map_computation,
     tensorflow::gtl::ArraySlice<HloInstruction*> static_operands) {
-  CHECK(static_operands.empty()) << "static_operands not yet supported";
-  auto instruction = WrapUnique(new HloInstruction(HloOpcode::kMap, shape));
-  for (auto operand : operands) {
-    instruction->AppendOperand(operand);
-  }
-  instruction->called_computations_.push_back(map_computation);
-  return instruction;
+  return MakeUnique<HloMapInstruction>(shape, operands, map_computation,
+                                       static_operands);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateConvolve(
@@ -373,11 +488,7 @@ HloInstruction::CreateGetTupleElement(const Shape& shape,
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateFft(
     const Shape& shape, HloInstruction* operand, FftType fft_type,
     tensorflow::gtl::ArraySlice<int64> fft_length) {
-  auto instruction = WrapUnique(new HloInstruction(HloOpcode::kFft, shape));
-  instruction->AppendOperand(operand);
-  instruction->fft_type_ = fft_type;
-  instruction->fft_length_.assign(fft_length.begin(), fft_length.end());
-  return instruction;
+  return MakeUnique<HloFftInstruction>(shape, operand, fft_type, fft_length);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateDot(
@@ -410,18 +521,29 @@ HloInstruction::CreateReducePrecision(const Shape& shape,
                                       HloInstruction* operand,
                                       const int exponent_bits,
                                       const int mantissa_bits) {
-  auto instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kReducePrecision, shape));
-  instruction->AppendOperand(operand);
-  instruction->exponent_bits_ = exponent_bits;
-  instruction->mantissa_bits_ = mantissa_bits;
-  return instruction;
+  return MakeUnique<HloReducePrecisionInstruction>(
+      shape, operand, exponent_bits, mantissa_bits);
 }
 
 /* static */ std::unique_ptr<HloInstruction>
 HloInstruction::CreateCrossReplicaSum(
-    const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands) {
-  return CreateNary(shape, HloOpcode::kCrossReplicaSum, operands);
+    const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
+    HloComputation* reduce_computation,
+    tensorflow::gtl::ArraySlice<int64> replica_group_ids,
+    tensorflow::StringPiece barrier,
+    const tensorflow::gtl::optional<int64>& channel_id) {
+  // TODO(b/79737069): Remove the CHECK when supported.
+  CHECK(!channel_id.has_value());
+  auto instruction =
+      WrapUnique(new HloInstruction(HloOpcode::kCrossReplicaSum, shape));
+  for (auto operand : operands) {
+    instruction->AppendOperand(operand);
+  }
+  instruction->called_computations_.push_back(reduce_computation);
+  instruction->replica_group_ids_.assign(replica_group_ids.begin(),
+                                         replica_group_ids.end());
+  instruction->cross_replica_sum_barrier_ = std::string(barrier);
+  return instruction;
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateInfeed(
@@ -447,56 +569,44 @@ HloInstruction::CreateCrossReplicaSum(
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateSend(
     HloInstruction* operand, int64 channel_id) {
-  // Send instruction produces a tuple of {aliased operand, U32 context}.
-  Shape output_shape = ShapeUtil::MakeTupleShape(
-      {operand->shape(), ShapeUtil::MakeShape(U32, {})});
-  auto instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kSend, output_shape));
-  instruction->AppendOperand(operand);
-  instruction->channel_id_ = channel_id;
-  return instruction;
+  return MakeUnique<HloSendInstruction>(operand, channel_id);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateSendDone(
     HloInstruction* operand) {
-  CHECK(operand->opcode() == HloOpcode::kSend)
+  auto send_operand = DynCast<HloSendInstruction>(operand);
+  CHECK(send_operand != nullptr)
       << "SendDone must take the context operand from Send";
-  auto instruction = WrapUnique(
-      new HloInstruction(HloOpcode::kSendDone, ShapeUtil::MakeNil()));
-  instruction->AppendOperand(operand);
-  instruction->channel_id_ = operand->channel_id();
-  return instruction;
+  return MakeUnique<HloSendDoneInstruction>(send_operand);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateRecv(
     const Shape& shape, int64 channel_id) {
-  // Recv instruction produces a tuple of {receive buffer, U32 context}.
-  Shape output_shape =
-      ShapeUtil::MakeTupleShape({shape, ShapeUtil::MakeShape(U32, {})});
-  auto instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kRecv, output_shape));
-  instruction->channel_id_ = channel_id;
-  return instruction;
+  return MakeUnique<HloRecvInstruction>(shape, channel_id);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateRecvDone(
     HloInstruction* operand) {
-  CHECK(operand->opcode() == HloOpcode::kRecv)
+  auto recv_operand = DynCast<HloRecvInstruction>(operand);
+  CHECK(recv_operand != nullptr)
       << "RecvDone must take the context operand from Recv";
-  Shape output_shape = ShapeUtil::GetTupleElementShape(operand->shape(), 0);
-  auto instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kRecvDone, output_shape));
-  instruction->AppendOperand(operand);
-  instruction->channel_id_ = operand->channel_id();
-  return instruction;
+  return MakeUnique<HloRecvDoneInstruction>(recv_operand);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateReverse(
     const Shape& shape, HloInstruction* operand,
     tensorflow::gtl::ArraySlice<int64> dimensions) {
-  auto instruction = WrapUnique(new HloInstruction(HloOpcode::kReverse, shape));
-  instruction->AppendOperand(operand);
-  instruction->dimensions_.assign(dimensions.begin(), dimensions.end());
+  return MakeUnique<HloReverseInstruction>(shape, operand, dimensions);
+}
+
+/* static */ std::unique_ptr<HloInstruction>
+HloInstruction::CreateGenerateToken(
+    tensorflow::gtl::ArraySlice<HloInstruction*> operands) {
+  auto instruction = WrapUnique(new HloInstruction(
+      HloOpcode::kGenerateToken, ShapeUtil::MakeTokenShape()));
+  for (auto operand : operands) {
+    instruction->AppendOperand(operand);
+  }
   return instruction;
 }
 
@@ -533,18 +643,8 @@ HloInstruction::CreateCrossReplicaSum(
     tensorflow::gtl::ArraySlice<int64> start_indices,
     tensorflow::gtl::ArraySlice<int64> limit_indices,
     tensorflow::gtl::ArraySlice<int64> strides) {
-  auto instruction = WrapUnique(new HloInstruction(HloOpcode::kSlice, shape));
-  instruction->AppendOperand(operand);
-  instruction->slice_starts_.assign(start_indices.begin(), start_indices.end());
-  instruction->slice_limits_.assign(limit_indices.begin(), limit_indices.end());
-  instruction->slice_strides_.assign(strides.begin(), strides.end());
-  // For backward compatibility with old serialized computations: if there are
-  // no strides, assume all strides are 1.
-  // TODO(b/63317920): remove this code.
-  if (instruction->slice_strides_.empty()) {
-    instruction->slice_strides_ = std::vector<int64>(start_indices.size(), 1LL);
-  }
-  return instruction;
+  return MakeUnique<HloSliceInstruction>(shape, operand, start_indices,
+                                         limit_indices, strides);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateDynamicSlice(
@@ -575,13 +675,7 @@ HloInstruction::CreateDynamicUpdateSlice(const Shape& shape,
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateConcatenate(
     const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
     int64 dimension) {
-  auto instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kConcatenate, shape));
-  for (auto operand : operands) {
-    instruction->AppendOperand(operand);
-  }
-  instruction->dimensions_.push_back(dimension);
-  return instruction;
+  return MakeUnique<HloConcatenateInstruction>(shape, operands, dimension);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateConvert(
@@ -604,13 +698,8 @@ HloInstruction::CreateBitcastConvert(const Shape& shape,
     const Shape& shape, HloInstruction* arg, HloInstruction* init_value,
     tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce,
     HloComputation* reduce_computation) {
-  auto instruction = WrapUnique(new HloInstruction(HloOpcode::kReduce, shape));
-  instruction->AppendOperand(arg);
-  instruction->AppendOperand(init_value);
-  instruction->dimensions_.assign(dimensions_to_reduce.begin(),
-                                  dimensions_to_reduce.end());
-  instruction->called_computations_.push_back(reduce_computation);
-  return instruction;
+  return MakeUnique<HloReduceInstruction>(
+      shape, arg, init_value, dimensions_to_reduce, reduce_computation);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateReduceWindow(
@@ -631,14 +720,8 @@ HloInstruction::CreateBatchNormTraining(const Shape& shape,
                                         HloInstruction* scale,
                                         HloInstruction* offset, float epsilon,
                                         int64 feature_index) {
-  auto instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kBatchNormTraining, shape));
-  instruction->AppendOperand(operand);
-  instruction->AppendOperand(scale);
-  instruction->AppendOperand(offset);
-  instruction->epsilon_ = epsilon;
-  instruction->feature_index_ = feature_index;
-  return instruction;
+  return MakeUnique<HloBatchNormTrainingInstruction>(
+      shape, operand, scale, offset, epsilon, feature_index);
 }
 
 /* static */ std::unique_ptr<HloInstruction>
@@ -646,16 +729,8 @@ HloInstruction::CreateBatchNormInference(
     const Shape& shape, HloInstruction* operand, HloInstruction* scale,
     HloInstruction* offset, HloInstruction* mean, HloInstruction* variance,
     float epsilon, int64 feature_index) {
-  auto instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kBatchNormInference, shape));
-  instruction->AppendOperand(operand);
-  instruction->AppendOperand(scale);
-  instruction->AppendOperand(offset);
-  instruction->AppendOperand(mean);
-  instruction->AppendOperand(variance);
-  instruction->epsilon_ = epsilon;
-  instruction->feature_index_ = feature_index;
-  return instruction;
+  return MakeUnique<HloBatchNormInferenceInstruction>(
+      shape, operand, scale, offset, mean, variance, epsilon, feature_index);
 }
 
 /* static */ std::unique_ptr<HloInstruction>
@@ -664,16 +739,9 @@ HloInstruction::CreateBatchNormGrad(const Shape& shape, HloInstruction* operand,
                                     HloInstruction* variance,
                                     HloInstruction* grad_output, float epsilon,
                                     int64 feature_index) {
-  auto instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kBatchNormGrad, shape));
-  instruction->AppendOperand(operand);
-  instruction->AppendOperand(scale);
-  instruction->AppendOperand(mean);
-  instruction->AppendOperand(variance);
-  instruction->AppendOperand(grad_output);
-  instruction->epsilon_ = epsilon;
-  instruction->feature_index_ = feature_index;
-  return instruction;
+  return MakeUnique<HloBatchNormGradInstruction>(shape, operand, scale, mean,
+                                                 variance, grad_output, epsilon,
+                                                 feature_index);
 }
 
 /* static */ std::unique_ptr<HloInstruction>
@@ -696,12 +764,8 @@ HloInstruction::CreateSelectAndScatter(
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateBroadcast(
     const Shape& shape, HloInstruction* operand,
     tensorflow::gtl::ArraySlice<int64> broadcast_dimensions) {
-  auto instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kBroadcast, shape));
-  instruction->AppendOperand(operand);
-  instruction->dimensions_.assign(broadcast_dimensions.begin(),
-                                  broadcast_dimensions.end());
-  return instruction;
+  return MakeUnique<HloBroadcastInstruction>(shape, operand,
+                                             broadcast_dimensions);
 }
 
 /* static */ std::unique_ptr<HloInstruction>
@@ -780,45 +844,29 @@ HloInstruction::CreateBroadcastSequence(
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateTranspose(
     const Shape& shape, HloInstruction* operand,
     tensorflow::gtl::ArraySlice<int64> dimensions) {
-  CHECK_EQ(shape.dimensions().size(), dimensions.size());
-  CHECK_EQ(shape.dimensions().size(), operand->shape().dimensions().size());
-  CHECK(std::equal(operand->shape().dimensions().begin(),
-                   operand->shape().dimensions().end(),
-                   Permute(dimensions, shape.dimensions()).begin()))
-      << "shape: " << ShapeUtil::HumanString(shape)
-      << ", operand->shape(): " << ShapeUtil::HumanString(shape)
-      << ", dimensions: {" << Join(dimensions, ", ") << "}";
-  auto instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kTranspose, shape));
-  instruction->AppendOperand(operand);
-  instruction->dimensions_.assign(dimensions.begin(), dimensions.end());
-  return instruction;
+  return MakeUnique<HloTransposeInstruction>(shape, operand, dimensions);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateFusion(
     const Shape& shape, FusionKind fusion_kind, HloInstruction* fused_root) {
-  auto instruction = WrapUnique(new HloInstruction(HloOpcode::kFusion, shape));
-  instruction->fusion_kind_ = fusion_kind;
-  instruction->name_ = "fusion";
-  instruction->set_parent(fused_root->parent());
-  instruction->set_metadata(fused_root->metadata());
-  instruction->CloneAndFuseInternal(fused_root);
-  return instruction;
+  return MakeUnique<HloFusionInstruction>(shape, fusion_kind, fused_root);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateFusion(
     const Shape& shape, FusionKind fusion_kind,
     tensorflow::gtl::ArraySlice<HloInstruction*> operands,
     HloComputation* fusion_computation) {
-  auto instruction = WrapUnique(new HloInstruction(HloOpcode::kFusion, shape));
-  for (auto operand : operands) {
-    instruction->AppendOperand(operand);
+  return MakeUnique<HloFusionInstruction>(shape, fusion_kind, operands,
+                                          fusion_computation);
+}
+
+void HloInstruction::set_single_sharding(const HloSharding& sharding) {
+  CHECK(!sharding.IsTuple()) << sharding;
+  if (ShapeUtil::IsTuple(shape())) {
+    set_sharding(HloSharding::Tuple(sharding.GetAsShapeTree(shape())));
+  } else {
+    set_sharding(sharding);
   }
-  instruction->fusion_kind_ = fusion_kind;
-  instruction->name_ = "fusion";
-  instruction->called_computations_.push_back(fusion_computation);
-  fusion_computation->SetFusionInstruction(instruction.get());
-  return instruction;
 }
 
 void HloInstruction::SetupDerivedInstruction(
@@ -831,289 +879,6 @@ void HloInstruction::SetupDerivedInstruction(
   derived_instruction->set_metadata(metadata_);
 }
 
-HloInstruction* HloInstruction::AddFusionOperand(HloInstruction* new_operand) {
-  CHECK_EQ(opcode(), HloOpcode::kFusion);
-  CHECK_EQ(operand_count(),
-           fused_instructions_computation()->parameter_instructions().size());
-  const int64 param_no = operand_count();
-  // Name the parameter after the instruction it represents in the outer
-  // (non-fusion) computation.
-  string param_name = StrCat(new_operand->name(), ".param_", param_no);
-  HloInstruction* fused_parameter =
-      fused_instructions_computation()->AddParameter(
-          HloInstruction::CreateParameter(param_no, new_operand->shape(),
-                                          param_name));
-  AppendOperand(new_operand);
-  return fused_parameter;
-}
-
-void HloInstruction::MergeFusionInstruction(
-    HloInstruction* instruction_to_merge) {
-  CHECK_EQ(opcode_, HloOpcode::kFusion);
-  CHECK_EQ(instruction_to_merge->opcode(), HloOpcode::kFusion);
-  CHECK(std::find(operands().begin(), operands().end(), instruction_to_merge) !=
-        operands().end());
-  // Clone the instruction from which to merge fused instructions.
-  std::unique_ptr<HloInstruction> clone = instruction_to_merge->Clone();
-  // Replace uses of fused parameters with the corresponding operand of the
-  // fusion.  Add all non-parameter fused instructions to 'unfused_instructions'
-  // to be merged into 'this'.  This is done in reverse post order.
-  std::vector<HloInstruction*> unfused_instructions;
-  auto fused_instructions =
-      clone->fused_instructions_computation()->MakeInstructionPostOrder();
-  for (auto fused_it = fused_instructions.rbegin();
-       fused_it != fused_instructions.rend(); ++fused_it) {
-    auto fused_instruction = *fused_it;
-    if (fused_instruction->opcode() == HloOpcode::kParameter) {
-      TF_CHECK_OK(fused_instruction->ReplaceAllUsesWith(
-          clone->mutable_operand(fused_instruction->parameter_number())));
-    } else {
-      unfused_instructions.push_back(fused_instruction);
-    }
-  }
-  CHECK(unfused_instructions.front() == clone->fused_expression_root());
-  // Replace instruction_to_merge use of 'this' with unfused_root.
-  TF_CHECK_OK(
-      instruction_to_merge->ReplaceUseWith(this, unfused_instructions.front()));
-  // Fuse 'unfused_instructions' into 'this'.
-  for (auto& instruction : unfused_instructions) {
-    FuseInstruction(instruction);
-    instruction->DetachFromOperands();
-  }
-  CHECK_EQ(0, clone->user_count());
-  clone->DetachFromOperands();
-  TF_CHECK_OK(parent()->parent()->RemoveEmbeddedComputation(
-      clone->fused_instructions_computation()));
-}
-
-void HloInstruction::MergeFusionInstructionIntoMultiOutput(
-    HloInstruction* instruction_to_merge) {
-  CHECK_EQ(opcode_, HloOpcode::kFusion);
-  CHECK_EQ(instruction_to_merge->opcode(), HloOpcode::kFusion);
-  // Add all non-parameter fused instructions to 'unfused_instructions' to be
-  // merged into 'this'. `old_to_new' maps the instructions in the fused node
-  // to the disaseembled fusion instructions.
-  // Note that we add the unfused instructions to this->parent_ computation.
-  // This is necessary because the unique_id needs for an instruction and
-  // it's only added when inserting to the computation.
-  tensorflow::gtl::FlatMap<HloInstruction*, HloInstruction*> old_to_new;
-  std::vector<HloInstruction*> unfused_instructions;
-  auto computation_to_merge =
-      instruction_to_merge->fused_instructions_computation();
-  auto post_order = computation_to_merge->MakeInstructionPostOrder();
-  for (auto rit = post_order.rbegin(); rit != post_order.rend(); ++rit) {
-    auto fused_instruction = *rit;
-    if (fused_instruction->opcode() == HloOpcode::kParameter) {
-      InsertOrDie(&old_to_new, fused_instruction,
-                  instruction_to_merge->mutable_operand(
-                      fused_instruction->parameter_number()));
-      continue;
-    }
-
-    // Here we clone the insertion and call FuseInstructionIntoMultiOutput()
-    // which clones again. This can be improved.
-    auto cloned_instruction =
-        parent_->AddInstruction(fused_instruction->Clone());
-    unfused_instructions.push_back(cloned_instruction);
-    InsertOrDie(&old_to_new, fused_instruction, cloned_instruction);
-  }
-  for (auto unfused_instruction : unfused_instructions) {
-    for (int64 index = 0; index < unfused_instruction->operand_count();
-         index++) {
-      auto new_operand =
-          FindOrDie(old_to_new, unfused_instruction->mutable_operand(index));
-      TF_CHECK_OK(unfused_instruction->ReplaceOperandWith(index, new_operand));
-    }
-  }
-
-  HloInstruction* unfused_root = unfused_instructions.front();
-  TF_CHECK_OK(instruction_to_merge->ReplaceAllUsesWith(unfused_root));
-
-  TF_CHECK_OK(
-      instruction_to_merge->parent()->RemoveInstruction(instruction_to_merge));
-  if (GetModule()) {
-    TF_CHECK_OK(GetModule()->RemoveEmbeddedComputation(computation_to_merge));
-  }
-
-  // Fuse the root instruction and generate multiple outputs.
-  FuseInstructionIntoMultiOutput(unfused_root);
-  TF_CHECK_OK(unfused_root->parent()->RemoveInstruction(unfused_root));
-  // The rest instructions are of normal fusing.
-  for (int64 i = 1; i < unfused_instructions.size(); i++) {
-    auto instruction = unfused_instructions[i];
-    FuseInstruction(instruction);
-    TF_CHECK_OK(instruction->parent()->RemoveInstruction(instruction));
-  }
-}
-
-HloInstruction* HloInstruction::FuseInstructionInternal(
-    HloInstruction* instruction_to_fuse, bool add_output) {
-  CHECK_EQ(opcode_, HloOpcode::kFusion);
-
-  // When add_output is false, this fusion instruction must be a user of
-  // instruction_to_fuse.
-  if (!add_output) {
-    CHECK(IsUserOf(instruction_to_fuse));
-  }
-  HloInstruction* fused_instruction =
-      CloneAndFuseInternal(instruction_to_fuse, add_output);
-  return fused_instruction;
-}
-
-HloInstruction* HloInstruction::CloneAndFuseInternal(
-    HloInstruction* instruction_to_fuse, bool add_output) {
-  CHECK_EQ(opcode_, HloOpcode::kFusion);
-  CHECK(instruction_to_fuse->IsFusable()) << instruction_to_fuse->ToString();
-  VLOG(3) << "CloneAndFuseInternal:\n" << instruction_to_fuse->ToString();
-  HloInstruction* clone = nullptr;
-  if (called_computations_.empty()) {
-    // New fusion instruction. It should not be a multioutput instruction.
-    CHECK(!add_output);
-    auto builder = HloComputation::Builder("fused_computation", this);
-    builder.AddInstruction(instruction_to_fuse->Clone(/*suffix=*/""));
-    called_computations_.push_back(
-        CHECK_NOTNULL(GetModule())->AddEmbeddedComputation(builder.Build()));
-    clone = fused_expression_root();
-  } else {
-    clone = fused_instructions_computation()->AddInstruction(
-        instruction_to_fuse->Clone(/*suffix=*/""));
-    // When add_output is false, instruction_to_fuse is necessarily an operand
-    // of the fusion instruction. After fusion this will no longer be the case.
-    // Remove the operand from the operand list and remove its corresponding
-    // fused parameter instruction. Renumber parameters as necessary to make
-    // parameter numbers consistent with their index in the
-    // fused_parameter_ vector.
-    bool in_operand_list = std::find(operands_.begin(), operands_.end(),
-                                     instruction_to_fuse) != operands_.end();
-    CHECK(add_output || in_operand_list);
-    const std::vector<HloInstruction*>& fused_parameters =
-        fused_instructions_computation()->parameter_instructions();
-    for (int64 operand_num = 0; operand_num < operand_count(); ++operand_num) {
-      if (instruction_to_fuse == operands_[operand_num]) {
-        // replace the fused parameter instruction's uses with the clone.
-        HloInstruction* fused_parameter = fused_parameters[operand_num];
-        TF_CHECK_OK(fused_parameter->ReplaceAllUsesWith(clone));
-
-        // Remove the corresponding fused parameter and operand from their
-        // respective vectors.
-        TF_CHECK_OK(
-            fused_instructions_computation()->RemoveParameter(operand_num));
-        operands_.erase(operands_.begin() + operand_num);
-        break;
-      }
-    }
-    // We've cloned instruction_to_fuse into this fusion instruction, so this
-    // fusion instruction is no longer a use of instruction_to_fuse.
-    if (in_operand_list) {
-      instruction_to_fuse->RemoveUser(this);
-      // When the instruction_to_fuse does not have other users, we don't need
-      // to generate a multioutput fusion instruction.
-      if (instruction_to_fuse->user_count() == 0) {
-        add_output = false;
-      }
-    }
-  }
-
-  // Reread the parameters in the computation.
-  const std::vector<HloInstruction*>& fused_parameters =
-      fused_instructions_computation()->parameter_instructions();
-
-  // Add each operand of the clone as an operand of the fusion instruction. A
-  // complication is that some clone operands may already be operands of the
-  // fusion instruction.
-  for (int64 operand_num = 0; operand_num < clone->operand_count();
-       ++operand_num) {
-    HloInstruction* operand = clone->mutable_operand(operand_num);
-
-    // See if this operand is already an operand of the fusion node.
-    CHECK_EQ(operands_.size(), fused_parameters.size());
-    HloInstruction* fused_param = nullptr;
-    for (int64 i = 0; i < operands_.size(); ++i) {
-      if (operands_[i] == operand) {
-        fused_param = fused_parameters[i];
-        break;
-      }
-    }
-
-    if (fused_param == nullptr) {
-      // Clone's operand was not already an operand of the fusion
-      // instruction. Add it as an operand and add a corresponding fused
-      // parameter instruction.
-      fused_param = AddFusionOperand(operand);
-    }
-    TF_CHECK_OK(clone->ReplaceOperandWith(operand_num, fused_param));
-  }
-
-  if (add_output) {
-    CHECK_GT(instruction_to_fuse->user_count(), 0);
-    // If this is already a multioutput fusion instruction, expand the root
-    // tuple by 1.
-    HloInstruction* fused_root = fused_expression_root();
-    HloInstruction::InstructionVector tuple_elements;
-    bool newly_created_tuple_instr = false;
-    if (fused_root->opcode() == HloOpcode::kTuple) {
-      tuple_elements = fused_root->operands();
-    } else {
-      tuple_elements.push_back(fused_root);
-      newly_created_tuple_instr = true;
-    }
-    if (clone->opcode() == HloOpcode::kTuple) {
-      for (auto inst : clone->operands()) {
-        tuple_elements.push_back(inst);
-      }
-    } else {
-      tuple_elements.push_back(clone);
-    }
-    HloInstruction* new_root = fused_instructions_computation()->AddInstruction(
-        HloInstruction::CreateTuple(tuple_elements));
-    fused_instructions_computation()->set_root_instruction(new_root);
-    shape_ = new_root->shape();
-    if (fused_root->opcode() == HloOpcode::kTuple) {
-      TF_CHECK_OK(
-          fused_instructions_computation()->RemoveInstruction(fused_root));
-    }
-
-    // If this is a newly created multioutput instruction, we need to update
-    // the use of the original fusion instruction.
-    if (newly_created_tuple_instr) {
-      HloInstruction* new_instr = parent_->AddInstruction(
-          HloInstruction::CreateGetTupleElement(fused_root->shape(), this, 0));
-      TF_CHECK_OK(ReplaceAllUsesWith(new_instr));
-    }
-    int64 index = tuple_elements.size();
-    if (instruction_to_fuse->opcode() == HloOpcode::kTuple) {
-      index -= instruction_to_fuse->operand_count();
-      std::vector<HloInstruction*> to_be_removed;
-      for (auto old_gte : instruction_to_fuse->users()) {
-        CHECK_EQ(old_gte->opcode(), HloOpcode::kGetTupleElement);
-        int64 old_tuple_index = old_gte->tuple_index();
-        HloInstruction* new_gte =
-            parent_->AddInstruction(HloInstruction::CreateGetTupleElement(
-                old_gte->shape(), this, index + old_tuple_index));
-        TF_CHECK_OK(old_gte->ReplaceAllUsesWith(new_gte));
-        to_be_removed.push_back(old_gte);
-      }
-      for (auto old_gte : to_be_removed) {
-        TF_CHECK_OK(parent_->RemoveInstruction(old_gte));
-      }
-      TF_CHECK_OK(fused_instructions_computation()->RemoveInstruction(clone));
-    } else {
-      HloInstruction* new_gte =
-          parent_->AddInstruction(HloInstruction::CreateGetTupleElement(
-              clone->shape(), this, index - 1));
-      TF_CHECK_OK(instruction_to_fuse->ReplaceAllUsesWith(new_gte));
-    }
-  }
-
-  VLOG(2) << "New clone:\n" << clone->ToString();
-  return clone;
-}
-
-RandomDistribution HloInstruction::random_distribution() const {
-  CHECK_EQ(opcode_, HloOpcode::kRng);
-  return distribution_;
-}
-
 bool HloInstruction::HasSideEffectNoRecurse() const {
   switch (opcode_) {
     case HloOpcode::kSend:
@@ -1225,25 +990,58 @@ bool HloInstruction::HasSideEffect() const {
   return gather_dim_numbers;
 }
 
+/* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateDomain(
+    const Shape& shape, HloInstruction* operand,
+    std::unique_ptr<DomainMetadata> operand_side_metadata,
+    std::unique_ptr<DomainMetadata> user_side_metadata) {
+  auto instruction = WrapUnique(new HloInstruction(HloOpcode::kDomain, shape));
+  instruction->operand_side_metadata_ = std::move(operand_side_metadata);
+  instruction->user_side_metadata_ = std::move(user_side_metadata);
+  instruction->AppendOperand(operand);
+  return instruction;
+}
+
 std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
     const Shape& shape,
     tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
-    HloModule* module, CloneMap* clone_map) const {
+    HloCloneContext* context) const {
   VLOG(3) << "CloneWithNewOperands:\n  " << ToString();
   VLOG(3) << "  new operands:";
   for (const HloInstruction* new_operand : new_operands) {
     VLOG(3) << "    %" << new_operand->name();
   }
-  if (module == nullptr) {
-    module = GetModule();
-  }
 
   std::unique_ptr<HloInstruction> clone;
-
   // Explicitly call the factory for the instruction type. This is more robust
   // in the face of code changes than copying fields explicitly. This also
   // properly sets the user fields of the operands.
   switch (opcode_) {
+    // Ops migrated to subclasses.
+    // TODO(b/80131774): Remove this switch when migration is complete.
+    case HloOpcode::kBatchNormTraining:
+    case HloOpcode::kBatchNormInference:
+    case HloOpcode::kBatchNormGrad:
+    case HloOpcode::kFft:
+    case HloOpcode::kSend:
+    case HloOpcode::kSendDone:
+    case HloOpcode::kRecv:
+    case HloOpcode::kRecvDone:
+    case HloOpcode::kReverse:
+    case HloOpcode::kConcatenate:
+    case HloOpcode::kReduce:
+    case HloOpcode::kTranspose:
+    case HloOpcode::kBroadcast:
+    case HloOpcode::kMap:
+    case HloOpcode::kSlice:
+    case HloOpcode::kConstant:
+    case HloOpcode::kTrace:
+    case HloOpcode::kFusion:
+    case HloOpcode::kRng:
+    case HloOpcode::kParameter:
+    case HloOpcode::kGetTupleElement:
+    case HloOpcode::kReducePrecision:
+      clone = CloneWithNewOperandsImpl(shape, new_operands, context);
+      break;
     // Unary ops.
     case HloOpcode::kAbs:
     case HloOpcode::kRoundNearestAfz:
@@ -1302,23 +1100,24 @@ std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
                             new_operands[2]);
       break;
     // Other supported ops.
-    case HloOpcode::kBroadcast:
-      CHECK_EQ(new_operands.size(), 1);
-      clone = CreateBroadcast(shape, new_operands[0], dimensions_);
-      break;
     case HloOpcode::kCall:
       clone = CreateCall(shape, new_operands, to_apply());
       break;
     case HloOpcode::kCustomCall:
       clone = CreateCustomCall(shape, new_operands, custom_call_target_);
+      if (window_ != nullptr) {
+        clone->window_ = MakeUnique<Window>(*window_);
+      }
+      if (convolution_dimension_numbers_ != nullptr) {
+        clone->convolution_dimension_numbers_ =
+            MakeUnique<ConvolutionDimensionNumbers>(
+                *convolution_dimension_numbers_);
+      }
       break;
     case HloOpcode::kHostCompute:
       clone = CreateHostCompute(shape, new_operands, channel_name_,
                                 cost_estimate_ns_);
       break;
-    case HloOpcode::kConcatenate:
-      clone = CreateConcatenate(shape, new_operands, dimensions(0));
-      break;
     case HloOpcode::kConvert:
       CHECK_EQ(new_operands.size(), 1);
       clone = CreateConvert(shape, new_operands[0]);
@@ -1327,11 +1126,6 @@ std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
       CHECK_EQ(new_operands.size(), 1);
       clone = CreateBitcastConvert(shape, new_operands[0]);
       break;
-    case HloOpcode::kReducePrecision:
-      CHECK_EQ(new_operands.size(), 1);
-      clone = CreateReducePrecision(shape, new_operands[0], exponent_bits_,
-                                    mantissa_bits_);
-      break;
     case HloOpcode::kConvolution:
       CHECK_EQ(new_operands.size(), 2);
       clone = CreateConvolve(shape, new_operands[0], new_operands[1], *window_,
@@ -1342,30 +1136,16 @@ std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
       clone = CreateDot(shape, new_operands[0], new_operands[1],
                         *dot_dimension_numbers_);
       break;
-    case HloOpcode::kFft:
-      CHECK_EQ(new_operands.size(), 1);
-      clone = CreateFft(shape, new_operands[0], fft_type_, fft_length_);
-      break;
     case HloOpcode::kCrossReplicaSum:
-      clone = CreateCrossReplicaSum(shape, new_operands);
-      break;
-    case HloOpcode::kGetTupleElement:
-      CHECK_EQ(new_operands.size(), 1);
-      clone = CreateGetTupleElement(shape, new_operands[0], tuple_index());
-      break;
-    case HloOpcode::kMap:
-      clone = CreateMap(shape, new_operands, to_apply());
+      clone =
+          CreateCrossReplicaSum(shape, new_operands, to_apply(),
+                                replica_group_ids_, cross_replica_sum_barrier_);
       break;
     case HloOpcode::kPad:
       CHECK_EQ(new_operands.size(), 2);
       clone =
           CreatePad(shape, new_operands[0], new_operands[1], *padding_config_);
       break;
-    case HloOpcode::kReduce:
-      CHECK_EQ(new_operands.size(), 2);
-      clone = CreateReduce(shape, new_operands[0], new_operands[1], dimensions_,
-                           to_apply());
-      break;
     case HloOpcode::kReduceWindow:
       CHECK_EQ(new_operands.size(), 2);
       clone = CreateReduceWindow(shape, new_operands[0], new_operands[1],
@@ -1377,22 +1157,10 @@ std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
           CreateSelectAndScatter(shape, new_operands[0], select(), *window_,
                                  new_operands[1], new_operands[2], scatter());
       break;
-    case HloOpcode::kReverse:
-      CHECK_EQ(new_operands.size(), 1);
-      clone = CreateReverse(shape, new_operands[0], dimensions_);
-      break;
-    case HloOpcode::kRng:
-      clone = CreateRng(shape, distribution_, new_operands);
-      break;
     case HloOpcode::kReshape:
       CHECK_EQ(new_operands.size(), 1);
       clone = CreateReshape(shape, new_operands[0]);
       break;
-    case HloOpcode::kSlice:
-      CHECK_EQ(new_operands.size(), 1);
-      clone = CreateSlice(shape, new_operands[0], slice_starts_, slice_limits_,
-                          slice_strides_);
-      break;
     case HloOpcode::kDynamicSlice:
       clone = CreateDynamicSlice(shape, new_operands[0], new_operands[1],
                                  dynamic_slice_sizes_);
@@ -1402,10 +1170,6 @@ std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
       clone = CreateDynamicUpdateSlice(shape, new_operands[0], new_operands[1],
                                        new_operands[2]);
       break;
-    case HloOpcode::kTranspose:
-      CHECK_EQ(new_operands.size(), 1);
-      clone = CreateTranspose(shape, new_operands[0], dimensions_);
-      break;
     case HloOpcode::kTuple:
       clone = CreateTuple(new_operands);
       *clone->mutable_shape() = shape;
@@ -1415,33 +1179,6 @@ std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
       clone =
           CreateWhile(shape, while_condition(), while_body(), new_operands[0]);
       break;
-    case HloOpcode::kConstant:
-      clone = CreateConstant(literal_->CloneToUnique());
-      break;
-    case HloOpcode::kFusion: {
-      CHECK_NE(module, nullptr);
-      auto new_fused_computation = module->AddEmbeddedComputation(
-          fused_instructions_computation()->Clone("clone", module, clone_map));
-      clone = CreateFusion(/*shape=*/shape, /*fusion_kind=*/fusion_kind(),
-                           /*operands=*/new_operands,
-                           /*fusion_computation=*/new_fused_computation);
-      break;
-    }
-    case HloOpcode::kParameter:
-      clone = CreateParameter(parameter_number_, shape, name_);
-      break;
-    case HloOpcode::kBatchNormTraining:
-      CHECK_EQ(new_operands.size(), 3);
-      clone =
-          CreateBatchNormTraining(shape, new_operands[0], new_operands[1],
-                                  new_operands[2], epsilon(), feature_index());
-      break;
-    case HloOpcode::kBatchNormInference:
-      CHECK_EQ(new_operands.size(), 5);
-      clone = CreateBatchNormInference(
-          shape, new_operands[0], new_operands[1], new_operands[2],
-          new_operands[3], new_operands[4], epsilon(), feature_index());
-      break;
     case HloOpcode::kInfeed:
       CHECK_EQ(new_operands.size(), 0);
       clone = CreateInfeed(shape, infeed_config());
@@ -1450,49 +1187,37 @@ std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
       CHECK_EQ(new_operands.size(), 1);
       clone = CreateOutfeed(outfeed_shape_, new_operands[0], outfeed_config());
       break;
-    case HloOpcode::kBatchNormGrad:
-      CHECK_EQ(new_operands.size(), 5);
-      clone = CreateBatchNormGrad(shape, new_operands[0], new_operands[1],
-                                  new_operands[2], new_operands[3],
-                                  new_operands[4], epsilon(), feature_index());
-      break;
     case HloOpcode::kConditional:
       CHECK_EQ(new_operands.size(), 3);
       clone = CreateConditional(shape, new_operands[0], new_operands[1],
                                 true_computation(), new_operands[2],
                                 false_computation());
       break;
-    case HloOpcode::kSend:
-      CHECK_EQ(new_operands.size(), 1);
-      clone = CreateSend(new_operands[0], channel_id());
-      break;
-    case HloOpcode::kSendDone:
-      CHECK_EQ(new_operands.size(), 1);
-      clone = CreateSendDone(new_operands[0]);
-      break;
-    case HloOpcode::kRecv:
-      CHECK_EQ(new_operands.size(), 0);
-      // The shape is a tuple, but CreateRecv() wants the raw data shape.
-      clone =
-          CreateRecv(ShapeUtil::GetTupleElementShape(shape, 0), channel_id());
-      break;
-    case HloOpcode::kRecvDone:
-      CHECK_EQ(new_operands.size(), 1);
-      clone = CreateRecvDone(new_operands[0]);
-      break;
     case HloOpcode::kGather:
       CHECK_EQ(new_operands.size(), 2);
       clone = CreateGather(shape, new_operands[0], new_operands[1],
                            *gather_dimension_numbers_, gather_window_bounds_);
       break;
-    case HloOpcode::kTrace:
-      LOG(FATAL) << "Not yet implemented, clone: " << HloOpcodeString(opcode_);
+    case HloOpcode::kDomain:
+      CHECK_EQ(new_operands.size(), 1);
+      clone =
+          CreateDomain(shape, new_operands[0], operand_side_metadata_->Clone(),
+                       user_side_metadata_->Clone());
+      break;
+    case HloOpcode::kGenerateToken:
+      clone = CreateGenerateToken(new_operands);
+      break;
   }
   SetupDerivedInstruction(clone.get());
   clone->set_parent(parent_);
-  clone->set_backend_config(backend_config());
-  if (clone_map != nullptr) {
-    InsertOrDie(clone_map, this, clone.get());
+  clone->set_raw_backend_config_string(backend_config_);
+  if (context != nullptr) {
+    context->MapInstruction(this, clone.get());
+    clone->ReplaceCalledComputations([&](HloComputation* callee) {
+      return callee->parent() != context->module()
+                 ? context->module()->DeepCloneComputation(callee, context)
+                 : callee;
+    });
   }
   return clone;
 }
@@ -1500,9 +1225,9 @@ std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
 HloInstruction::~HloInstruction() {}
 
 std::unique_ptr<HloInstruction> HloInstruction::Clone(
-    const string& suffix, HloModule* module, CloneMap* clone_map) const {
+    const string& suffix, HloCloneContext* context) const {
   std::unique_ptr<HloInstruction> clone =
-      CloneWithNewOperands(shape_, operands_, module, clone_map);
+      CloneWithNewOperands(shape_, operands_, context);
   if (suffix.empty()) {
     clone->name_ = name();
   } else {
@@ -1562,40 +1287,6 @@ const HloInstruction* HloInstruction::LatestNonGteAncestor() const {
   return hlo;
 }
 
-const Literal& HloInstruction::literal() const {
-  CHECK_EQ(HloOpcode::kConstant, opcode_);
-  return *literal_;
-}
-
-bool HloInstruction::HasLiteral() const { return literal_ != nullptr; }
-
-bool HloInstruction::CanHaveDimensionsField() const {
-  return (opcode() == HloOpcode::kReverse ||
-          opcode() == HloOpcode::kConcatenate ||
-          opcode() == HloOpcode::kReduce || opcode() == HloOpcode::kBroadcast ||
-          opcode() == HloOpcode::kTranspose);
-}
-
-const std::vector<int64>& HloInstruction::dimensions() const {
-  CHECK(CanHaveDimensionsField());
-  return dimensions_;
-}
-
-int64 HloInstruction::dimensions(int64 index) const {
-  return dimensions()[index];
-}
-
-int64 HloInstruction::concatenate_dimension() const {
-  CHECK(opcode() == HloOpcode::kConcatenate);
-  CHECK_EQ(1, dimensions_.size());
-  return dimensions(0);
-}
-
-int64 HloInstruction::tuple_index() const {
-  CHECK_EQ(HloOpcode::kGetTupleElement, opcode_);
-  return tuple_index_;
-}
-
 const HloInstruction* HloInstruction::operand(int64 i) const {
   return operands_[i];
 }
@@ -1614,6 +1305,17 @@ int64 HloInstruction::operand_index(const HloInstruction* target) const {
   LOG(FATAL) << "target was not an operand: " << target->ToString();
 }
 
+HloInstruction::InstructionVector HloInstruction::unique_operands() const {
+  InstructionVector unique;
+  tensorflow::gtl::FlatSet<const HloInstruction*> seen;
+  for (HloInstruction* operand : operands()) {
+    if (seen.insert(operand).second) {
+      unique.push_back(operand);
+    }
+  }
+  return unique;
+}
+
 Status HloInstruction::AddControlDependencyTo(HloInstruction* instruction) {
   TF_RET_CHECK(instruction->parent() == parent());
   if (std::find(control_successors_.begin(), control_successors_.end(),
@@ -1673,10 +1375,6 @@ void HloInstruction::AddUser(HloInstruction* user) {
   }
 }
 
-bool HloInstruction::IsConstant() const {
-  return opcode_ == HloOpcode::kConstant;
-}
-
 bool HloInstruction::HasConstantOperand() const {
   for (const HloInstruction* operand : operands_) {
     if (operand->IsConstant()) {
@@ -1706,7 +1404,6 @@ bool HloInstruction::IdenticalSlowPath(
     case HloOpcode::kConvert:
     case HloOpcode::kCopy:
     case HloOpcode::kCos:
-    case HloOpcode::kCrossReplicaSum:
     case HloOpcode::kDivide:
     case HloOpcode::kDynamicSlice:
     case HloOpcode::kDynamicUpdateSlice:
@@ -1746,41 +1443,12 @@ bool HloInstruction::IdenticalSlowPath(
     case HloOpcode::kTuple:
       return true;
 
-    // Broadcast, Concatenate, and Transpose need the same dimensions field.
-    case HloOpcode::kBroadcast:
-    case HloOpcode::kConcatenate:
-    case HloOpcode::kTranspose:
-      return dimensions() == other.dimensions();
-
-    case HloOpcode::kFusion:
-      return fusion_kind() == other.fusion_kind() &&
-             eq_computations(fused_instructions_computation(),
-                             other.fused_instructions_computation());
-
     // These opcodes have complex or special behavior so just return false.
-    case HloOpcode::kRng:
-    case HloOpcode::kTrace:
+    case HloOpcode::kDomain:
     case HloOpcode::kWhile:
+    case HloOpcode::kGenerateToken:
       return false;
 
-    case HloOpcode::kParameter:
-      return parameter_number() == other.parameter_number();
-
-    case HloOpcode::kBatchNormTraining:
-    case HloOpcode::kBatchNormInference:
-    case HloOpcode::kBatchNormGrad:
-      return feature_index() == other.feature_index() &&
-             epsilon() == other.epsilon();
-
-    // A constant is defined by the value in the literal.
-    case HloOpcode::kConstant:
-      return literal() == other.literal();
-
-    // A reduce-precision operation is determined by the bit sizes.
-    case HloOpcode::kReducePrecision:
-      return exponent_bits() == other.exponent_bits() &&
-             mantissa_bits() == other.mantissa_bits();
-
     // Convolution has a window and dimensions.
     case HloOpcode::kConvolution:
       return protobuf_util::ProtobufEquals(window(), other.window()) &&
@@ -1797,16 +1465,6 @@ bool HloInstruction::IdenticalSlowPath(
                                            other.gather_dimension_numbers()) &&
              gather_window_bounds() == other.gather_window_bounds();
 
-    // FFT has various types & lengths.
-    case HloOpcode::kFft:
-      return fft_type() == other.fft_type() &&
-             fft_length() == other.fft_length();
-
-    // Reduction results are determined by the reduction dimension and the
-    // reduction computation.
-    case HloOpcode::kReduce:
-      return dimensions() == other.dimensions() &&
-             eq_computations(to_apply(), other.to_apply());
     case HloOpcode::kReduceWindow:
       return eq_computations(to_apply(), other.to_apply()) &&
              protobuf_util::ProtobufEquals(window(), other.window());
@@ -1818,24 +1476,30 @@ bool HloInstruction::IdenticalSlowPath(
              eq_computations(scatter(), other.scatter()) &&
              protobuf_util::ProtobufEquals(window(), other.window());
 
-
     // Remaining instructions with special values.
-    case HloOpcode::kGetTupleElement:
-      return tuple_index() == other.tuple_index();
     case HloOpcode::kPad:
       return protobuf_util::ProtobufEquals(padding_config(),
                                            other.padding_config());
-    case HloOpcode::kSlice:
-      return slice_starts_ == other.slice_starts_ &&
-             slice_limits_ == other.slice_limits_ &&
-             slice_strides_ == other.slice_strides_;
     case HloOpcode::kCall:
-    case HloOpcode::kMap:
-      return eq_computations(to_apply(), other.to_apply());
+    case HloOpcode::kCrossReplicaSum:
+      return replica_group_ids() == other.replica_group_ids() &&
+             cross_replica_sum_barrier() == other.cross_replica_sum_barrier() &&
+             eq_computations(to_apply(), other.to_apply());
     case HloOpcode::kCustomCall:
+      if ((window_ == nullptr) != (other.window_ == nullptr) ||
+          (window_ != nullptr &&
+           !protobuf_util::ProtobufEquals(window(), other.window()))) {
+        return false;
+      }
+      if ((convolution_dimension_numbers_ == nullptr) !=
+              (other.convolution_dimension_numbers_ == nullptr) ||
+          (convolution_dimension_numbers_ != nullptr &&
+           !protobuf_util::ProtobufEquals(
+               convolution_dimension_numbers(),
+               other.convolution_dimension_numbers()))) {
+        return false;
+      }
       return custom_call_target_ == other.custom_call_target_;
-    case HloOpcode::kReverse:
-      return dimensions() == other.dimensions();
     case HloOpcode::kConditional:
       return eq_computations(true_computation(), other.true_computation()) &&
              eq_computations(false_computation(), other.false_computation());
@@ -1844,21 +1508,36 @@ bool HloInstruction::IdenticalSlowPath(
     case HloOpcode::kInfeed:
     case HloOpcode::kOutfeed:
     case HloOpcode::kSort:
-    case HloOpcode::kRecv:
-    case HloOpcode::kRecvDone:
-    case HloOpcode::kSend:
-    case HloOpcode::kSendDone:
     case HloOpcode::kHostCompute:
       return false;
-  }
-}
 
-bool HloInstruction::IsRank2Transpose() const {
-  return (opcode_ == HloOpcode::kTranspose) &&
-         dimensions_ == std::vector<int64>({1, 0}) &&
-         shape_.dimensions_size() == 2 &&
-         std::equal(shape_.dimensions().begin(), shape_.dimensions().end(),
-                    operands_[0]->shape_.dimensions().rbegin());
+    // Ops migrated to subclasses should never come to this line.
+    // TODO(b/80131774): Remove this switch when migration is complete.
+    case HloOpcode::kBatchNormTraining:
+    case HloOpcode::kBatchNormInference:
+    case HloOpcode::kBatchNormGrad:
+    case HloOpcode::kFft:
+    case HloOpcode::kSend:
+    case HloOpcode::kSendDone:
+    case HloOpcode::kRecv:
+    case HloOpcode::kRecvDone:
+    case HloOpcode::kReverse:
+    case HloOpcode::kConcatenate:
+    case HloOpcode::kReduce:
+    case HloOpcode::kTranspose:
+    case HloOpcode::kBroadcast:
+    case HloOpcode::kMap:
+    case HloOpcode::kSlice:
+    case HloOpcode::kConstant:
+    case HloOpcode::kTrace:
+    case HloOpcode::kFusion:
+    case HloOpcode::kRng:
+    case HloOpcode::kParameter:
+    case HloOpcode::kGetTupleElement:
+    case HloOpcode::kReducePrecision:
+      LOG(FATAL) << "Base class impl called for opcode with subclass: "
+                 << opcode();
+  }
 }
 
 void HloInstruction::RemoveUser(HloInstruction* user) {
@@ -1964,6 +1643,7 @@ HloComputation* HloInstruction::to_apply() const {
     case HloOpcode::kMap:
     case HloOpcode::kReduceWindow:
     case HloOpcode::kReduce:
+    case HloOpcode::kCrossReplicaSum:
       CHECK_EQ(called_computations_.size(), 1);
       return called_computations_[0];
     default:
@@ -2099,6 +1779,71 @@ string HloInstruction::ToString(const HloPrintOptions& options) const {
   return ToStringWithCanonicalNameMap(options, &new_map);
 }
 
+bool HloInstruction::IsElementwiseImpl(
+    const tensorflow::gtl::optional<int64>& operand_idx) const {
+  switch (opcode_) {
+    // Unary elementwise operations.
+    case HloOpcode::kAbs:
+    case HloOpcode::kRoundNearestAfz:
+    case HloOpcode::kCeil:
+    case HloOpcode::kClz:
+    case HloOpcode::kConvert:
+    case HloOpcode::kBitcastConvert:
+    case HloOpcode::kCopy:
+    case HloOpcode::kCos:
+    case HloOpcode::kExp:
+    case HloOpcode::kExpm1:
+    case HloOpcode::kFloor:
+    case HloOpcode::kImag:
+    case HloOpcode::kIsFinite:
+    case HloOpcode::kLog:
+    case HloOpcode::kLog1p:
+    case HloOpcode::kNot:
+    case HloOpcode::kNegate:
+    case HloOpcode::kReal:
+    case HloOpcode::kReducePrecision:
+    case HloOpcode::kSign:
+    case HloOpcode::kSin:
+    case HloOpcode::kTanh:
+      CHECK_EQ(1, operand_count());
+      return true;
+
+    // Binary elementwise operations, the same as in IsElementwiseBinary().
+    case HloOpcode::kAdd:
+    case HloOpcode::kAtan2:
+    case HloOpcode::kComplex:
+    case HloOpcode::kDivide:
+    case HloOpcode::kEq:
+    case HloOpcode::kGe:
+    case HloOpcode::kGt:
+    case HloOpcode::kLe:
+    case HloOpcode::kLt:
+    case HloOpcode::kMaximum:
+    case HloOpcode::kMinimum:
+    case HloOpcode::kMultiply:
+    case HloOpcode::kNe:
+    case HloOpcode::kPower:
+    case HloOpcode::kRemainder:
+    case HloOpcode::kSubtract:
+    case HloOpcode::kAnd:
+    case HloOpcode::kOr:
+    case HloOpcode::kShiftLeft:
+    case HloOpcode::kShiftRightArithmetic:
+    case HloOpcode::kShiftRightLogical:
+      CHECK_EQ(2, operand_count());
+      return true;
+
+    // Ternary elementwise operations.
+    case HloOpcode::kSelect:
+      return !ShapeUtil::IsTuple(shape_);
+    case HloOpcode::kClamp:
+      return true;
+
+    default:
+      return false;
+  }
+}
+
 string HloInstruction::ToStringWithCanonicalNameMap(
     const HloPrintOptions& options,
     CanonicalNameMap* canonical_name_map) const {
@@ -2134,8 +1879,8 @@ string HloInstruction::ToStringWithCanonicalNameMap(
        !metadata_.source_file().empty())) {
     StrAppend(&result, ", metadata={", xla::OpMetadataToString(metadata_), "}");
   }
-  if (options.print_backend_config() && !backend_config().empty()) {
-    StrAppend(&result, ", backend_config=\"", CEscape(backend_config()), "\"");
+  if (options.print_backend_config() && !backend_config_.empty()) {
+    StrAppend(&result, ", backend_config=\"", CEscape(backend_config_), "\"");
   }
   return result;
 }
@@ -2149,76 +1894,39 @@ string HloInstruction::OperandsToStringWithCanonicalNameMap(
     const HloPrintOptions& options,
     CanonicalNameMap* canonical_name_map) const {
   string operands;
-  if (opcode() == HloOpcode::kConstant) {
-    // For constants, show the actual value in place of an empty operand list.
-    //
-    // In HloInstruction, sometimes a constant literal is not constructed due
-    // to its size. Skip the printing in this case.
-    if (HasLiteral() && ((!ShapeUtil::IsTuple(shape()) &&
-                          ShapeUtil::ElementsIn(shape()) <= 10) ||
-                         options.print_large_constants())) {
-      // Literal::ToString emits multidimensional arrays over multiple
-      // lines. Compact this into one line by stripping out white space.
-      string tmp = literal().ToString();
-      std::replace(tmp.begin(), tmp.end(), '\n', ' ');
-      std::vector<string> v = tensorflow::str_util::Split(tmp, ' ');
-      bool first = true;
-      // Concatenate elements in "v" with spaces separating them, but ignoring
-      // empty entries.
-      for (const auto& s : v) {
-        if (s.empty()) {
-          continue;
-        }
-        StrAppend(&operands, (first ? "" : " "), s);
-        first = false;
-      }
-    } else {
-      // Do not show large constants or tuples.
-      operands = "{...}";
-    }
-  } else if (opcode() == HloOpcode::kParameter) {
-    StrAppend(&operands, parameter_number_);
-  } else {
-    tensorflow::gtl::ArraySlice<HloInstruction*> slice(operands_);
-    const int64 kMaxOperandsToShowIfCompact = 4;
-    if (options.compact_operands() &&
-        slice.size() > kMaxOperandsToShowIfCompact) {
-      slice.remove_suffix(slice.size() - kMaxOperandsToShowIfCompact);
+  tensorflow::gtl::ArraySlice<HloInstruction*> slice(operands_);
+  const int64 kMaxOperandsToShowIfCompact = 4;
+  if (options.compact_operands() &&
+      slice.size() > kMaxOperandsToShowIfCompact) {
+    slice.remove_suffix(slice.size() - kMaxOperandsToShowIfCompact);
+  }
+  operands = Join(slice, ", ", [&](string* out, HloInstruction* operand) {
+    std::vector<string> str;
+    if (options.print_operand_shape()) {
+      str.push_back(ShapeUtil::HumanStringWithLayout(operand->shape()));
     }
-    operands = Join(slice, ", ", [&](string* out, HloInstruction* operand) {
-      std::vector<string> str;
-      if (options.print_operand_shape()) {
-        str.push_back(ShapeUtil::HumanStringWithLayout(operand->shape()));
-      }
 
-      // In a top-level HloInstruction::ToString() call, the operand name is not
-      // part of the canonical string.
-      if (options.canonicalize_instruction_names() &&
-          options.is_in_nested_computation()) {
-        str.push_back(PrintName(
-            canonical_name_map->LookupOrInsert(operand->name()), options));
-      } else if (!options.compact_operands()) {
-        str.push_back(PrintName(operand->name(), options));
-      }
-      StrAppend(out, Join(str, " "));
-    });
-    const int64 remaining = operands_.size() - slice.size();
-    if (slice.size() != operands_.size()) {
-      StrAppend(&operands, ", ...(+", remaining, ")");
+    // In a top-level HloInstruction::ToString() call, the operand name is not
+    // part of the canonical string.
+    if (options.canonicalize_instruction_names() &&
+        options.is_in_nested_computation()) {
+      str.push_back(PrintName(
+          canonical_name_map->LookupOrInsert(operand->name()), options));
+    } else if (!options.compact_operands()) {
+      str.push_back(PrintName(operand->name(), options));
     }
+    StrAppend(out, Join(str, " "));
+  });
+  const int64 remaining = operands_.size() - slice.size();
+  if (slice.size() != operands_.size()) {
+    StrAppend(&operands, ", ...(+", remaining, ")");
   }
   return operands;
 }
 
 std::vector<string> HloInstruction::ExtraAttributesToString(
     const HloPrintOptions& options) const {
-  std::vector<string> extra;
-  if (opcode() == HloOpcode::kFusion) {
-    extra.push_back(StrCat("kind=", xla::ToString(fusion_kind())));
-  }
-  if (CanHaveDimensionsField()) {
-    extra.push_back(StrCat("dimensions={", Join(dimensions(), ","), "}"));
-  }
+  std::vector<string> extra = ExtraAttributesToStringImpl(options);
   if (window_ != nullptr && window_->dimensions_size() != 0) {
     extra.push_back(StrCat("window={", window_util::ToString(*window_), "}"));
   }
@@ -2226,32 +1934,16 @@ std::vector<string> HloInstruction::ExtraAttributesToString(
     extra.push_back(
         StrCat("padding=", xla::PaddingConfigToString(*padding_config_)));
   }
-  if (opcode() == HloOpcode::kSlice) {
-    std::vector<string> bounds;
-    bounds.reserve(slice_starts_.size());
-    const bool omit_stride =
-        std::all_of(slice_strides_.begin(), slice_strides_.end(),
-                    [](int64 stride) { return stride == 1; });
-    for (int i = 0; i < slice_starts_.size(); ++i) {
-      string stride_str = omit_stride ? "" : StrCat(":", slice_strides_[i]);
-      bounds.push_back(StrCat("[", slice_starts_[i], ":", slice_limits_[i],
-                              stride_str, "]"));
-    }
-    extra.push_back(StrCat("slice={", Join(bounds, ", "), "}"));
-  }
+
   if (opcode() == HloOpcode::kDynamicSlice) {
     extra.push_back(
         StrCat("dynamic_slice_sizes={", Join(dynamic_slice_sizes(), ","), "}"));
   }
-  if (opcode() == HloOpcode::kBatchNormTraining ||
-      opcode() == HloOpcode::kBatchNormInference ||
-      opcode() == HloOpcode::kBatchNormGrad) {
-    extra.push_back(StrCat("epsilon=", epsilon()));
-    extra.push_back(StrCat("feature_index=", feature_index()));
-  }
 
   if (convolution_dimension_numbers_ != nullptr) {
-    extra.push_back(ConvolutionDimensionNumbersToString());
+    extra.push_back(StrCat(
+        "dim_labels=",
+        ConvolutionDimensionNumbersToString(*convolution_dimension_numbers_)));
   }
   if (dot_dimension_numbers_ != nullptr) {
     extra.push_back(DotDimensionNumbersToString());
@@ -2261,10 +1953,6 @@ std::vector<string> HloInstruction::ExtraAttributesToString(
     extra.push_back(
         StrCat("window_bounds={", Join(gather_window_bounds(), ","), "}"));
   }
-  if (opcode() == HloOpcode::kFft) {
-    extra.push_back(StrCat("fft_type=", FftType_Name(fft_type())));
-    extra.push_back(StrCat("fft_length={", Join(fft_length(), ","), "}"));
-  }
 
   if (options.print_subcomputation_mode() ==
       HloPrintOptions::PrintSubcomputationMode::kNameOnly) {
@@ -2284,7 +1972,8 @@ std::vector<string> HloInstruction::ExtraAttributesToString(
                              PrintName(false_computation()->name(), options)));
     } else if (opcode() == HloOpcode::kCall || opcode() == HloOpcode::kMap ||
                opcode() == HloOpcode::kReduceWindow ||
-               opcode() == HloOpcode::kReduce) {
+               opcode() == HloOpcode::kReduce ||
+               opcode() == HloOpcode::kCrossReplicaSum) {
       extra.push_back(
           StrCat("to_apply=", PrintName(to_apply()->name(), options)));
     } else if (!called_computations().empty()) {
@@ -2334,14 +2023,7 @@ std::vector<string> HloInstruction::ExtraAttributesToString(
         break;
     }
   }
-  if (opcode() == HloOpcode::kSend || opcode() == HloOpcode::kRecv ||
-      opcode() == HloOpcode::kSendDone || opcode() == HloOpcode::kRecvDone) {
-    extra.push_back(StrCat("channel_id=", channel_id_));
-  }
 
-  if (opcode() == HloOpcode::kGetTupleElement) {
-    extra.push_back(StrCat("index=", tuple_index()));
-  }
   if (has_sharding()) {
     extra.push_back(StrCat("sharding=", sharding().ToString()));
   }
@@ -2361,13 +2043,17 @@ std::vector<string> HloInstruction::ExtraAttributesToString(
     extra.push_back(
         StrCat("outfeed_config=\"", CEscape(outfeed_config_), "\""));
   }
-  if (opcode() == HloOpcode::kRng) {
+  if (operand_side_metadata_ != nullptr && user_side_metadata_ != nullptr) {
+    extra.push_back(StrCat("domain={kind=\"", operand_side_metadata_->Kind(),
+                           "\", entry=", operand_side_metadata_->ToString(),
+                           ", exit=", user_side_metadata_->ToString(), "}"));
+  }
+  if (!replica_group_ids().empty()) {
     extra.push_back(
-        StrCat("distribution=", RandomDistributionToString(distribution_)));
+        StrCat("replica_group_ids={", Join(replica_group_ids(), ","), "}"));
   }
-  if (opcode() == HloOpcode::kReducePrecision) {
-    extra.push_back(StrCat("exponent_bits=", exponent_bits_));
-    extra.push_back(StrCat("mantissa_bits=", mantissa_bits_));
+  if (!cross_replica_sum_barrier().empty()) {
+    extra.push_back(StrCat("barrier=\"", cross_replica_sum_barrier(), "\""));
   }
 
   // By contract, we print the custom call target even if
@@ -2407,25 +2093,13 @@ HloInstructionProto HloInstruction::ToProto() const {
   }
 
   *proto.mutable_metadata() = metadata_;
-  proto.set_backend_config(backend_config());
-  if (literal_ != nullptr) {
-    *proto.mutable_literal() = literal_->ToProto();
-  }
-  proto.set_parameter_number(parameter_number_);
-  if (opcode() == HloOpcode::kFusion) {
-    proto.set_fusion_kind(xla::ToString(fusion_kind()));
-    proto.add_called_computation_ids(
-        fused_instructions_computation()->unique_id());
-  } else {
+  proto.set_backend_config(backend_config_);
+  if (opcode() != HloOpcode::kFusion) {
     for (const HloComputation* computation : called_computations_) {
       proto.add_called_computation_ids(computation->unique_id());
     }
   }
 
-  proto.set_tuple_index(tuple_index_);
-  for (int64 dimension : dimensions_) {
-    proto.add_dimensions(dimension);
-  }
   if (window_ != nullptr) {
     *proto.mutable_window() = *window_;
   }
@@ -2444,14 +2118,7 @@ HloInstructionProto HloInstruction::ToProto() const {
       proto.add_gather_window_bounds(bound);
     }
   }
-  for (int i = 0; i < slice_starts_.size(); ++i) {
-    auto* slice_dimension = proto.add_slice_dimensions();
-    slice_dimension->set_start(slice_starts_[i]);
-    slice_dimension->set_limit(slice_limits_[i]);
-    slice_dimension->set_stride(slice_strides_[i]);
-  }
-  proto.set_exponent_bits(exponent_bits_);
-  proto.set_mantissa_bits(mantissa_bits_);
+
   for (int64 slice_size : dynamic_slice_sizes_) {
     proto.add_dynamic_slice_sizes(slice_size);
   }
@@ -2459,19 +2126,9 @@ HloInstructionProto HloInstruction::ToProto() const {
     *proto.mutable_padding_config() = *padding_config_;
   }
   proto.set_outfeed_config(outfeed_config_);
-  if (opcode() == HloOpcode::kRng) {
-    proto.set_distribution(distribution_);
-  }
-  proto.set_epsilon(epsilon_);
-  proto.set_feature_index(feature_index_);
-  proto.set_channel_id(channel_id_);
   proto.set_infeed_config(infeed_config_);
   proto.set_custom_call_target(custom_call_target_);
   *proto.mutable_outfeed_shape() = outfeed_shape_;
-  proto.set_fft_type(fft_type_);
-  for (int64 fft_len : fft_length_) {
-    proto.add_fft_length(fft_len);
-  }
 
   if (has_sharding()) {
     *proto.mutable_sharding() = sharding().ToProto();
@@ -2479,6 +2136,9 @@ HloInstructionProto HloInstruction::ToProto() const {
 
   proto.set_channel_name(channel_name_);
   proto.set_cost_estimate_ns(cost_estimate_ns_);
+  for (int64 replica_group_id : replica_group_ids_) {
+    proto.add_replica_group_ids(replica_group_id);
+  }
 
   return proto;
 }
@@ -2531,12 +2191,6 @@ void HloInstruction::set_tracing(HloInstruction* trace_instruction) {
   trace_instruction_ = trace_instruction;
 }
 
-string HloInstruction::TracingTag() const {
-  CHECK_EQ(HloOpcode::kTrace, opcode());
-  CHECK(literal_ != nullptr);
-  return literal_->GetR1U8AsString();
-}
-
 bool HloInstruction::IsFused() const { return parent_->IsFusionComputation(); }
 
 bool HloInstruction::IsFusable() const {
@@ -2546,6 +2200,7 @@ bool HloInstruction::IsFusable() const {
   }
   // Some kinds of instructions don't make sense to fuse.
   switch (opcode_) {
+    case HloOpcode::kDomain:
     case HloOpcode::kParameter:
       return false;
     // Side effecting instrutions cannot be fused.
@@ -2554,49 +2209,6 @@ bool HloInstruction::IsFusable() const {
   }
 }
 
-HloComputation* HloInstruction::fused_instructions_computation() const {
-  CHECK_EQ(opcode_, HloOpcode::kFusion);
-  CHECK(!called_computations_.empty());
-  auto* fused_instructions_computation = called_computations_.front();
-  CHECK(fused_instructions_computation->IsFusionComputation());
-  return fused_instructions_computation;
-}
-
-HloInstruction* HloInstruction::fused_expression_root() const {
-  CHECK_EQ(opcode_, HloOpcode::kFusion);
-  return fused_instructions_computation()->root_instruction();
-}
-
-HloInstruction* HloInstruction::fused_parameter(int64 parameter_number) const {
-  CHECK_EQ(opcode_, HloOpcode::kFusion);
-  return fused_instructions_computation()->parameter_instruction(
-      parameter_number);
-}
-
-const std::vector<HloInstruction*>& HloInstruction::fused_parameters() const {
-  CHECK_EQ(opcode_, HloOpcode::kFusion);
-  return fused_instructions_computation()->parameter_instructions();
-}
-
-const tensorflow::gtl::iterator_range<UnwrappingIterator<
-    std::list<std::unique_ptr<HloInstruction>>::const_iterator>>
-HloInstruction::fused_instructions() const {
-  CHECK_EQ(opcode_, HloOpcode::kFusion);
-  const HloComputation* subcomp = fused_instructions_computation();
-  return subcomp->instructions();
-}
-
-const tensorflow::gtl::iterator_range<
-    UnwrappingIterator<std::list<std::unique_ptr<HloInstruction>>::iterator>>
-HloInstruction::fused_instructions() {
-  CHECK_EQ(opcode_, HloOpcode::kFusion);
-  return fused_instructions_computation()->instructions();
-}
-
-int64 HloInstruction::fused_instruction_count() const {
-  return fused_instructions_computation()->instruction_count();
-}
-
 HloInstruction::HloInstruction(HloOpcode opcode, const Shape& shape)
     : unique_id_(-1),
       opcode_(opcode),
@@ -2773,6 +2385,10 @@ Status HloInstruction::Visit(DfsHloVisitorBase<HloInstructionPtr>* visitor) {
       return visitor->HandleSendDone(this);
     case HloOpcode::kGather:
       return visitor->HandleGather(this);
+    case HloOpcode::kDomain:
+      return visitor->HandleDomain(this);
+    case HloOpcode::kGenerateToken:
+      return visitor->HandleGenerateToken(this);
 
     // These opcodes are not handled here.
     case HloOpcode::kTrace:
@@ -3040,87 +2656,7 @@ bool HloInstruction::IsElementwiseBinary() const {
 }
 
 bool HloInstruction::IsElementwise() const {
-  switch (opcode_) {
-    // Nullary elementwise operations.
-    case HloOpcode::kConstant:
-      return true;
-
-    // Unary elementwise operations.
-    case HloOpcode::kAbs:
-    case HloOpcode::kRoundNearestAfz:
-    case HloOpcode::kCeil:
-    case HloOpcode::kClz:
-    case HloOpcode::kConvert:
-    case HloOpcode::kBitcastConvert:
-    case HloOpcode::kCopy:
-    case HloOpcode::kCos:
-    case HloOpcode::kExp:
-    case HloOpcode::kExpm1:
-    case HloOpcode::kFloor:
-    case HloOpcode::kImag:
-    case HloOpcode::kIsFinite:
-    case HloOpcode::kLog:
-    case HloOpcode::kLog1p:
-    case HloOpcode::kNot:
-    case HloOpcode::kNegate:
-    case HloOpcode::kReal:
-    case HloOpcode::kReducePrecision:
-    case HloOpcode::kSign:
-    case HloOpcode::kSin:
-    case HloOpcode::kTanh:
-      CHECK_EQ(1, operand_count());
-      return true;
-
-    // Binary elementwise operations, the same as in IsElementwiseBinary().
-    case HloOpcode::kAdd:
-    case HloOpcode::kAtan2:
-    case HloOpcode::kComplex:
-    case HloOpcode::kDivide:
-    case HloOpcode::kEq:
-    case HloOpcode::kGe:
-    case HloOpcode::kGt:
-    case HloOpcode::kLe:
-    case HloOpcode::kLt:
-    case HloOpcode::kMaximum:
-    case HloOpcode::kMinimum:
-    case HloOpcode::kMultiply:
-    case HloOpcode::kNe:
-    case HloOpcode::kPower:
-    case HloOpcode::kRemainder:
-    case HloOpcode::kSubtract:
-    case HloOpcode::kAnd:
-    case HloOpcode::kOr:
-    case HloOpcode::kShiftLeft:
-    case HloOpcode::kShiftRightArithmetic:
-    case HloOpcode::kShiftRightLogical:
-      CHECK_EQ(2, operand_count());
-      return true;
-
-    // Ternary elementwise operations.
-    case HloOpcode::kSelect:
-      return !ShapeUtil::IsTuple(shape_);
-    case HloOpcode::kClamp:
-      return true;
-
-    // Other operations.
-    case HloOpcode::kRng:
-    case HloOpcode::kMap:
-      return true;
-    case HloOpcode::kFusion:
-      if (fusion_kind() != FusionKind::kLoop) {
-        return false;
-      }
-      for (auto* fused : fused_instructions()) {
-        if (fused->opcode() != HloOpcode::kParameter &&
-            !fused->IsElementwise()) {
-          return false;
-        }
-      }
-      return true;
-
-    default:
-      return false;
-  }
+  return IsElementwiseImpl(tensorflow::gtl::nullopt);
 }
 
 bool HloInstruction::ImplicitlyBroadcastsOperand(int64 operand_idx) const {
@@ -3128,54 +2664,8 @@ bool HloInstruction::ImplicitlyBroadcastsOperand(int64 operand_idx) const {
   return !ShapeUtil::SameDimensions(shape(), operand(operand_idx)->shape());
 }
 
-namespace {
-bool IsInstructionElementwiseOnOperand(const HloInstruction* instruction,
-                                       const HloInstruction* operand) {
-  std::vector<int64> operand_indices = instruction->OperandIndices(operand);
-  return std::all_of(
-      operand_indices.begin(), operand_indices.end(),
-      [instruction](int64 operand_index) {
-        return instruction->IsElementwiseOnOperand(operand_index);
-      });
-}
-}  // namespace
-
 bool HloInstruction::IsElementwiseOnOperand(int64 operand_idx) const {
-  // For all instructions other than kFusion, being elementwise on one of the
-  // operands is equivalent to being elementwise on all the operands.
-  if (opcode() != HloOpcode::kFusion) {
-    return IsElementwise();
-  }
-
-  CHECK_EQ(HloOpcode::kFusion, opcode());
-  if (fusion_kind() != FusionKind::kLoop) {
-    return false;
-  }
-
-  // A loop-fusion is elementwise on an operand if all operations (computed
-  // using BFS) between the operand and the fused root are elementwise.
-  std::deque<HloInstruction*> worklist;
-  std::unordered_set<const HloInstruction*> visited;
-  worklist.push_back(fused_parameter(operand_idx));
-  visited.insert(fused_parameter(operand_idx));
-  while (!worklist.empty()) {
-    HloInstruction* operand = worklist.front();
-    worklist.pop_front();
-    for (HloInstruction* user : operand->users()) {
-      CHECK_GE(user->unique_id(), 0);
-      if (ContainsKey(visited, user)) {
-        continue;
-      }
-      if (user->IsElementwise() ||
-          IsInstructionElementwiseOnOperand(user, operand)) {
-        worklist.push_back(user);
-        visited.insert(user);
-      } else {
-        return false;
-      }
-    }
-  }
-  return true;
+  return IsElementwiseImpl(operand_idx);
 }
 
 // A helper class for memoized, recursive computation of HloOpcode::kFusion
@@ -3197,8 +2687,10 @@ class HloInstruction::FusionReusesParamElements {
   static UseKind ComputeInternal(
       int64 i, const HloInstruction& hlo,
       tensorflow::gtl::FlatMap<const HloInstruction*, UseKind>* cache) {
-    if (hlo.opcode_ == HloOpcode::kParameter && hlo.parameter_number_ == i) {
-      return UseKind::kUse;
+    if (auto hlo_param = DynCast<HloParameterInstruction>(&hlo)) {
+      if (hlo_param->parameter_number() == i) {
+        return UseKind::kUse;
+      }
     }
 
     auto p = cache->emplace(&hlo, UseKind{});
@@ -3360,42 +2852,8 @@ string RandomDistributionToString(const RandomDistribution& distribution) {
   return tensorflow::str_util::Lowercase(RandomDistribution_Name(distribution));
 }
 
-StatusOr<RandomDistribution> StringToRandomDistribution(const string& name) {
-  static std::unordered_map<string, RandomDistribution>* map = [] {
-    static auto* map = new std::unordered_map<string, RandomDistribution>;
-    for (int i = 0; i < RandomDistribution_ARRAYSIZE; i++) {
-      if (RandomDistribution_IsValid(i)) {
-        auto value = static_cast<RandomDistribution>(i);
-        (*map)[RandomDistributionToString(value)] = value;
-      }
-    }
-    return map;
-  }();
-  auto found = map->find(tensorflow::str_util::Lowercase(name));
-  if (found == map->end()) {
-    return InvalidArgument("Unknown distribution");
-  }
-  return found->second;
-}
-
-std::ostream& operator<<(std::ostream& os, HloInstruction::FusionKind kind) {
-  return os << ToString(kind);
-}
-
-string HloInstruction::ConvolutionDimensionNumbersToString() const {
-  string result;
-  if (convolution_dimension_numbers_ == nullptr) {
-    return result;
-  }
-  const ConvolutionDimensionNumbers& dnums = *convolution_dimension_numbers_;
-  // Show the given dimension labels in order of major to minor based on the
-  // shape's layout.
-  const auto append_dims = [&](const std::vector<string>& dims,
-                               const Shape& shape) {
-    CHECK_EQ(dims.size(), ShapeUtil::Rank(shape));
-    StrAppend(&result, Join(dims, ""));
-  };
-
+string ConvolutionDimensionNumbersToString(
+    const ConvolutionDimensionNumbers& dnums) {
   // lhs_dims[i] is the symbol of the logical dimension i for the lhs
   // operand. E.g. if batch has dimension number 2, then lhs_dims[2] == "b".
   std::vector<string> lhs_dims(2 + dnums.input_spatial_dimensions().size());
@@ -3419,19 +2877,8 @@ string HloInstruction::ConvolutionDimensionNumbersToString() const {
     output_dims[dnums.output_spatial_dimensions(i)] = StrCat(i);
   }
 
-  result += "dim_labels=";
-  append_dims(lhs_dims, operand(0)->shape());
-  result += "_";
-  append_dims(rhs_dims, operand(1)->shape());
-  result += "->";
-
-  // A convolution can be represented as a kConvolution HLO or as a CustomCall
-  // that returns a tuple, the first element of which is the result of the
-  // convolution.
-  Shape this_shape =
-      ShapeUtil::IsTuple(shape()) ? shape().tuple_shapes(0) : shape();
-  append_dims(output_dims, this_shape);
-  return result;
+  return StrCat(Join(lhs_dims, ""), "_", Join(rhs_dims, ""), "->",
+                Join(output_dims, ""));
 }
 
 string HloInstruction::DotDimensionNumbersToString() const {
@@ -3457,6 +2904,28 @@ string HloInstruction::DotDimensionNumbersToString() const {
   return Join(result, ", ");
 }
 
+StatusOr<RandomDistribution> StringToRandomDistribution(const string& name) {
+  static std::unordered_map<string, RandomDistribution>* map = [] {
+    static auto* map = new std::unordered_map<string, RandomDistribution>;
+    for (int i = 0; i < RandomDistribution_ARRAYSIZE; i++) {
+      if (RandomDistribution_IsValid(i)) {
+        auto value = static_cast<RandomDistribution>(i);
+        (*map)[RandomDistributionToString(value)] = value;
+      }
+    }
+    return map;
+  }();
+  auto found = map->find(tensorflow::str_util::Lowercase(name));
+  if (found == map->end()) {
+    return InvalidArgument("Unknown distribution");
+  }
+  return found->second;
+}
+
+std::ostream& operator<<(std::ostream& os, HloInstruction::FusionKind kind) {
+  return os << ToString(kind);
+}
+
 string HloInstruction::GatherDimensionNumbersToString() const {
   CHECK_NE(gather_dimension_numbers_.get(), nullptr);
   string output_window_dims =
@@ -3488,6 +2957,31 @@ bool HloInstruction::CouldBeBitcast() const {
   }
 }
 
+Status HloInstruction::GetBackendConfigInternal(
+    tensorflow::protobuf::Message* proto) const {
+  proto->Clear();
+
+  // Empty string does not parse as valid JSON, but it's a valid backend config,
+  // corresponding to the empty proto.
+  if (backend_config_.empty()) {
+    return Status::OK();
+  }
+  return tensorflow::HumanReadableJsonToProto(backend_config_, proto);
+}
+
+Status HloInstruction::set_backend_config(
+    const tensorflow::protobuf::Message& proto) {
+  TF_ASSIGN_OR_RETURN(backend_config_, BackendConfigToRawString(proto));
+  return Status::OK();
+}
+
+/* static */ StatusOr<string> HloInstruction::BackendConfigToRawString(
+    const tensorflow::protobuf::Message& proto) {
+  string ret;
+  TF_RETURN_IF_ERROR(tensorflow::ProtoToHumanReadableJson(proto, &ret));
+  return ret;
+}
+
 HloModule* HloInstruction::GetModule() const {
   if (parent_) {
     return parent_->parent();
@@ -3505,21 +2999,173 @@ void HloInstruction::set_outer_dimension_partitions(
   outer_dimension_partitions_ = outer_dimension_partitions;
 }
 
+// TODO(b/80131774): Remove these temporary methods after transition.
+int64 HloInstruction::feature_index() const {
+  return Cast<HloBatchNormInstruction>(this)->feature_index();
+}
+
+float HloInstruction::epsilon() const {
+  return Cast<HloBatchNormInstruction>(this)->epsilon();
+}
+
+FftType HloInstruction::fft_type() const {
+  return Cast<HloFftInstruction>(this)->fft_type();
+}
+
+const std::vector<int64>& HloInstruction::fft_length() const {
+  return Cast<HloFftInstruction>(this)->fft_length();
+}
+
+int64 HloInstruction::channel_id() const {
+  return Cast<HloSendRecvInstruction>(this)->channel_id();
+}
+
+int64 HloInstruction::concatenate_dimension() const {
+  return Cast<HloConcatenateInstruction>(this)->concatenate_dimension();
+}
+
+bool HloInstruction::IsRank2Transpose() const {
+  auto transpose = DynCast<HloTransposeInstruction>(this);
+  return transpose != nullptr && transpose->IsRank2Transpose();
+}
+
+int64 HloInstruction::slice_starts(int64 dimension) const {
+  return Cast<HloSliceInstruction>(this)->slice_starts(dimension);
+}
+
+const std::vector<int64>& HloInstruction::slice_starts() const {
+  return Cast<HloSliceInstruction>(this)->slice_starts();
+}
+
+int64 HloInstruction::slice_limits(int64 dimension) const {
+  return Cast<HloSliceInstruction>(this)->slice_limits(dimension);
+}
+
+const std::vector<int64>& HloInstruction::slice_limits() const {
+  return Cast<HloSliceInstruction>(this)->slice_limits();
+}
+
+int64 HloInstruction::slice_strides(int64 dimension) const {
+  return Cast<HloSliceInstruction>(this)->slice_strides(dimension);
+}
+
+const std::vector<int64>& HloInstruction::slice_strides() const {
+  return Cast<HloSliceInstruction>(this)->slice_strides();
+}
+
+bool HloInstruction::IsInPlaceSlice() const {
+  return Cast<HloSliceInstruction>(this)->IsInPlaceSlice();
+}
+
+const Literal& HloInstruction::literal() const {
+  return Cast<HloConstantInstruction>(this)->literal();
+}
+
+bool HloInstruction::IsConstant() const {
+  return DynCast<HloConstantInstruction>(this) != nullptr;
+}
+
 void HloInstruction::RelayoutConstant(const Layout& new_layout,
                                       const ShapeIndex& shape_index) {
-  CHECK_EQ(opcode(), HloOpcode::kConstant);
-  Shape* mutable_array_subshape =
-      ShapeUtil::GetMutableSubshape(mutable_shape(), shape_index);
-  CHECK(ShapeUtil::IsArray(*mutable_array_subshape));
+  Cast<HloConstantInstruction>(this)->RelayoutConstant(new_layout, shape_index);
+}
 
-  // Normally array_subshape will always have a layout, but this invariant is
-  // temporarily broken in LayoutAssignment::AssignLayouts.
+string HloInstruction::TracingTag() const {
+  return Cast<HloTraceInstruction>(this)->TracingTag();
+}
 
-  if (!mutable_array_subshape->has_layout() ||
-      !LayoutUtil::Equal(mutable_array_subshape->layout(), new_layout)) {
-    literal_ = literal_->Relayout(new_layout, shape_index);
-    *mutable_array_subshape->mutable_layout() = new_layout;
-  }
+HloInstruction* HloInstruction::AddFusionOperand(HloInstruction* new_operand) {
+  return Cast<HloFusionInstruction>(this)->AddFusionOperand(new_operand);
+}
+
+// Delegates to HloFusionInstruction::MergeFusionInstruction.
+void HloInstruction::MergeFusionInstruction(
+    HloInstruction* instruction_to_merge) {
+  return Cast<HloFusionInstruction>(this)->MergeFusionInstruction(
+      Cast<HloFusionInstruction>(instruction_to_merge));
+}
+
+// Delegates to HloFusionInstruction::MergeFusionInstructionIntoMultiOutput.
+void HloInstruction::MergeFusionInstructionIntoMultiOutput(
+    HloInstruction* instruction_to_merge) {
+  return Cast<HloFusionInstruction>(this)
+      ->MergeFusionInstructionIntoMultiOutput(
+          Cast<HloFusionInstruction>(instruction_to_merge));
+}
+
+HloInstruction* HloInstruction::FuseInstruction(
+    HloInstruction* instruction_to_fuse) {
+  return Cast<HloFusionInstruction>(this)->FuseInstruction(instruction_to_fuse);
+}
+
+HloInstruction* HloInstruction::FuseInstructionIntoMultiOutput(
+    HloInstruction* instruction_to_fuse) {
+  return Cast<HloFusionInstruction>(this)->FuseInstructionIntoMultiOutput(
+      instruction_to_fuse);
+}
+
+HloComputation* HloInstruction::fused_instructions_computation() const {
+  return Cast<HloFusionInstruction>(this)->fused_instructions_computation();
+}
+
+HloInstruction* HloInstruction::fused_expression_root() const {
+  return Cast<HloFusionInstruction>(this)->fused_expression_root();
+}
+
+const tensorflow::gtl::iterator_range<UnwrappingIterator<
+    std::list<std::unique_ptr<HloInstruction>>::const_iterator>>
+HloInstruction::fused_instructions() const {
+  return Cast<HloFusionInstruction>(this)->fused_instructions();
 }
 
+const tensorflow::gtl::iterator_range<
+    UnwrappingIterator<std::list<std::unique_ptr<HloInstruction>>::iterator>>
+HloInstruction::fused_instructions() {
+  return Cast<HloFusionInstruction>(this)->fused_instructions();
+}
+
+int64 HloInstruction::fused_instruction_count() const {
+  return Cast<HloFusionInstruction>(this)->fused_instruction_count();
+}
+
+HloInstruction* HloInstruction::fused_parameter(int64 parameter_number) const {
+  return Cast<HloFusionInstruction>(this)->fused_parameter(parameter_number);
+}
+
+const std::vector<HloInstruction*>& HloInstruction::fused_parameters() const {
+  return Cast<HloFusionInstruction>(this)->fused_parameters();
+}
+
+const bool HloInstruction::IsMultiOutputFusion() const {
+  const HloFusionInstruction* fusion = DynCast<HloFusionInstruction>(this);
+  return fusion != nullptr && fusion->IsMultiOutputFusion();
+}
+
+HloInstruction::FusionKind HloInstruction::fusion_kind() const {
+  return Cast<HloFusionInstruction>(this)->fusion_kind();
+}
+
+void HloInstruction::set_fusion_kind(FusionKind kind) {
+  return Cast<HloFusionInstruction>(this)->set_fusion_kind(kind);
+}
+
+RandomDistribution HloInstruction::random_distribution() const {
+  return Cast<HloRngInstruction>(this)->random_distribution();
+}
+
+int64 HloInstruction::parameter_number() const {
+  return Cast<HloParameterInstruction>(this)->parameter_number();
+}
+
+int64 HloInstruction::tuple_index() const {
+  return Cast<HloGetTupleElementInstruction>(this)->tuple_index();
+}
+
+int32 HloInstruction::exponent_bits() const {
+  return Cast<HloReducePrecisionInstruction>(this)->exponent_bits();
+}
+
+int32 HloInstruction::mantissa_bits() const {
+  return Cast<HloReducePrecisionInstruction>(this)->mantissa_bits();
+}
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_instruction.h b/tensorflow/compiler/xla/service/hlo_instruction.h
index 234dbc8399de2d88209dd8dd2be58dd152ddbe76..a206cdab2739cd5046295179ead5d8bf19d521c4 100644
--- a/tensorflow/compiler/xla/service/hlo_instruction.h
+++ b/tensorflow/compiler/xla/service/hlo_instruction.h
@@ -37,6 +37,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/map_util.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
+#include "tensorflow/compiler/xla/service/hlo_clone_context.h"
+#include "tensorflow/compiler/xla/service/hlo_domain_metadata.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/service/hlo_sharding.h"
 #include "tensorflow/compiler/xla/service/name_uniquer.h"
@@ -50,6 +52,7 @@ limitations under the License.
 #include "tensorflow/core/lib/gtl/iterator_range.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/macros.h"
+#include "tensorflow/core/platform/protobuf.h"
 #include "tensorflow/core/platform/types.h"
 
 namespace xla {
@@ -319,7 +322,7 @@ class HloInstruction {
     kCustom,
   };
 
-  ~HloInstruction();
+  virtual ~HloInstruction();
 
   // Creates an instruction from the given proto. Arguments:
   //
@@ -423,10 +426,27 @@ class HloInstruction {
       const Shape& shape, HloInstruction* operand, const int exponent_bits,
       const int mantissa_bits);
 
-  // Creates a cross replica sum op.
+  // Creates a cross replica reduction op.
+  //
+  // `reduction_computation`: the reduction function.
+  //
+  // `replica_group_ids`: maps replica ids to subgroup ids. If empty, all
+  // replicas belong to one group. Allreduce will be applied within subgroups.
+  // For example, we have 4 replicas, then replica_group_ids={0,1,0,1} means,
+  // replica 0 and 2 are in subgroup 0, replica 1 and 3 are in subgroup 1.
+  //
+  // `channel_id`: for Allreduce nodes from different models, if they have the
+  // same channel_id, they will be 'Allreduce'd. If empty, Allreduce will not be
+  // applied cross models.
+  //
+  // TODO(b/79737069): Rename this to AllReduce.
   static std::unique_ptr<HloInstruction> CreateCrossReplicaSum(
-      const Shape& shape,
-      tensorflow::gtl::ArraySlice<HloInstruction*> operands);
+      const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
+      HloComputation* reduce_computation,
+      tensorflow::gtl::ArraySlice<int64> replica_group_ids,
+      tensorflow::StringPiece barrier,
+      const tensorflow::gtl::optional<int64>& channel_id =
+          tensorflow::gtl::nullopt);
 
   // Creates a conversion instruction, where operand is the data to convert and
   // shape is the target shape for the conversion.
@@ -597,6 +617,13 @@ class HloInstruction {
       const GatherDimensionNumbers& gather_dim_numbers,
       tensorflow::gtl::ArraySlice<int64> window_bounds);
 
+  // Creates a kDomain instruction which delimits an HLO domain which have
+  // the provided user and operand side metadata.
+  static std::unique_ptr<HloInstruction> CreateDomain(
+      const Shape& shape, HloInstruction* operand,
+      std::unique_ptr<DomainMetadata> operand_side_metadata,
+      std::unique_ptr<DomainMetadata> user_side_metadata);
+
   // Creates a fusion instruction. A fusion instruction contains one or more
   // fused instructions forming an expression with a single root
   // "fused_root". Additional instructions can be added to the fusion
@@ -638,6 +665,11 @@ class HloInstruction {
       const Shape& shape, HloInstruction* operand,
       tensorflow::gtl::ArraySlice<int64> dimensions);
 
+  // Creates a token instruction used for joining or creating token types which
+  // thread through side-effecting operations.
+  static std::unique_ptr<HloInstruction> CreateGenerateToken(
+      tensorflow::gtl::ArraySlice<HloInstruction*> operands);
+
   // Creates an instance of GatherDimensionNumbers.
   static GatherDimensionNumbers MakeGatherDimNumbers(
       tensorflow::gtl::ArraySlice<int64> output_window_dims,
@@ -676,6 +708,10 @@ class HloInstruction {
   using InstructionVector = tensorflow::gtl::InlinedVector<HloInstruction*, 2>;
   const InstructionVector& operands() const { return operands_; }
 
+  // Returns the vector of unique operands, in the same order they are found
+  // within the operand vector.
+  InstructionVector unique_operands() const;
+
   // Returns the index of 'target' in the operands sequence.
   // Precondition: target must be an operand (or a fatal error will occur).
   int64 operand_index(const HloInstruction* target) const;
@@ -762,15 +798,16 @@ class HloInstruction {
       }
     }
 
+    if (backend_config_ != other.backend_config_) {
+      return false;
+    }
+
     return IdenticalSlowPath(other, eq_computations);
   }
 
   // Returns whether the instruction has a constant operand.
   bool HasConstantOperand() const;
 
-  // Returns whether this instruction does a rank-2 transposition.
-  bool IsRank2Transpose() const;
-
   // Replaces the use of this instruction in "user" with "new_producer". Note
   // that there might be multiple uses of this instruction in "user"; all will
   // be replaced.
@@ -839,38 +876,6 @@ class HloInstruction {
   template <typename HloInstructionPtr>
   Status Visit(DfsHloVisitorBase<HloInstructionPtr>* visitor);
 
-  // Returns the literal associated with this instruction.
-  //
-  // Note: only constant and parameter opcodes have an associated literal.
-  const Literal& literal() const;
-
-  // Returns whether there is literal associated with this instruction.
-  bool HasLiteral() const;
-
-  // Returns the parameter number associated with this instruction.
-  //
-  // Note: only parameter opcodes have an associated parameter number.
-  int64 parameter_number() const {
-    CHECK_EQ(HloOpcode::kParameter, opcode_);
-    return parameter_number_;
-  }
-
-  // Returns the dimension sizes or numbers associated with this instruction.
-  //
-  // Precondition: opcode() is one of: concatenate, reduce, broadcast, reshape,
-  // and reverse.
-  const std::vector<int64>& dimensions() const;
-  int64 dimensions(int64 index) const;
-
-  // Accessor for the dimension in which a concatenate HLO should occur.
-  // Precondition: opcode() == HloOpcode::kConcatenate
-  int64 concatenate_dimension() const;
-
-  // Returns the tuple index associated with this instruction.
-  //
-  // Precondition: opcode() == HloOpcode::kGetTupleElement
-  int64 tuple_index() const;
-
   // Returns the first non-GetTupleElement ancestor instruction of 'hlo'.
   // If the first non-GTE ancestor is tuple-shaped, populates 'index' with the
   // (possibly nested) tuple indices used on the path from ancestor to 'hlo'.
@@ -965,7 +970,7 @@ class HloInstruction {
   string ToShortString() const;
 
   // Returns a serialized representation of this instruction.
-  HloInstructionProto ToProto() const;
+  virtual HloInstructionProto ToProto() const;
 
   // Returns a category for the HLO. This could be something like "convolution"
   // or "elementwise".
@@ -977,111 +982,26 @@ class HloInstruction {
   HloInstruction* tracing() const;
   void set_tracing(HloInstruction* trace_instruction);
 
-  // Returns the channel id associated with the instruction. The id is
-  // shared between each Send/Recv pair and is globally unique to identify each
-  // channel.
-  //
-  // Precondition: opcode() == HloOpcode::kSend or HloOpcode::kRecv
-  int64 channel_id() const { return channel_id_; }
-
   // Returns the channel name associated with the instruction. The name is
   // used to identify host Send/Recv operations.
   //
   // Precondition: opcode() == HloOpcode::kHostCompute
   string channel_name() const { return channel_name_; }
 
-  // Returns feature_index field associated with the instruction. The index
-  // represents the index of the feature dimension.
-  //
-  // Precondition: opcode() is one of kBatchNormTraining, kBatchNormInference,
-  // or kBatchNormGrad.
-  int64 feature_index() const { return feature_index_; }
-
-  // Returns a epsilon value associated with the instruction. The is a small
-  // number added to the variance to avoid divide-by-zero error.
-  //
-  // Precondition: opcode() is one of kBatchNormTraining, kBatchNormInference,
-  // or kBatchNormGrad.
-  float epsilon() const { return epsilon_; }
-
   // Returns the infeed configuration string. The infeed configuration includes
   // any metadata needed for the backend compiler (e.g., infeed buffer address)
   // and is target-dependent.
   string infeed_config() const { return infeed_config_; }
   void set_infeed_config(const string& config) { infeed_config_ = config; }
 
-  // Returns a tag to be used in tracing.
-  //
-  // Precondition: opcode() == HloOpcode::kTrace
-  string TracingTag() const;
-
-  // Returns whether the instruction is a constant.
-  bool IsConstant() const;
-
   // Returns true if this instruction is fused, ie contained within a fusion
   // instruction.
   bool IsFused() const;
 
-  // Returns the computation for this fused instruction.
-  //
-  // Precondition: opcode() == HloOpcode::kFusion
-  HloComputation* fused_instructions_computation() const;
-
   // Returns true if this instruction can be legally fused into a fusion
   // instruction.
   bool IsFusable() const;
 
-  // Returns the root instruction of the fused expression contained within this
-  // fusion instruction.
-  //
-  // Precondition: opcode() == HloOpcode::kFusion
-  HloInstruction* fused_expression_root() const;
-
-  // Returns the list of fused instructions inside this fusion instruction.  The
-  // returned type is a range of HloInstruction*s.
-  //
-  // Precondition: opcode() == HloOpcode::kFusion
-  const tensorflow::gtl::iterator_range<UnwrappingIterator<
-      std::list<std::unique_ptr<HloInstruction>>::const_iterator>>
-  fused_instructions() const;
-
-  const tensorflow::gtl::iterator_range<
-      UnwrappingIterator<std::list<std::unique_ptr<HloInstruction>>::iterator>>
-  fused_instructions();
-
-  // Gets the number of instructions inside this fusion instruction.
-  //
-  // Precondition: opcode() == HloOpcode::kFusion
-  int64 fused_instruction_count() const;
-
-  // Returns the fused parameter instruction in this fusion instruction
-  // corresponding to the given parameter number.
-  //
-  // Precondition: opcode() == HloOpcode::kFusion
-  HloInstruction* fused_parameter(int64 parameter_number) const;
-
-  // Returns the vector of fused parameters inside this fusion instruction.
-  //
-  // Precondition: opcode() == HloOpcode::kFusion
-  const std::vector<HloInstruction*>& fused_parameters() const;
-
-  // Returns true if this instruction is a fusion instruction that generates
-  // multiple outputs.
-  const bool IsMultiOutputFusion() const {
-    return opcode() == HloOpcode::kFusion &&
-           fused_expression_root()->opcode() == HloOpcode::kTuple;
-  }
-
-  FusionKind fusion_kind() const {
-    CHECK_EQ(HloOpcode::kFusion, opcode_);
-    return fusion_kind_;
-  }
-
-  void set_fusion_kind(FusionKind kind) {
-    CHECK_EQ(HloOpcode::kFusion, opcode_);
-    fusion_kind_ = kind;
-  }
-
   // Returns the sharding applied to this operator.
   // REQUIRES: has_sharding() is true.
   const HloSharding& sharding() const {
@@ -1094,16 +1014,23 @@ class HloInstruction {
   }
   // Returns the sharding unique device, if any.
   tensorflow::gtl::optional<int64> sharding_unique_device() const {
-    if (sharding_ == nullptr || !sharding_->HasUniqueDevice()) {
+    if (sharding_ == nullptr) {
       return tensorflow::gtl::optional<int64>();
     }
-    return sharding_->UniqueDevice().ValueOrDie();
+    auto device = sharding_->UniqueDevice();
+    return device.ok() ? device.ValueOrDie()
+                       : tensorflow::gtl::optional<int64>();
   }
   // Sets the sharding of this operator. Should only be called by HloModule or
   // HloComputation methods.
   void set_sharding(const HloSharding& sharding) {
     sharding_ = MakeUnique<HloSharding>(sharding);
   }
+  void set_single_sharding(const HloSharding& sharding);
+  // Sets a sharding that assigns the current instruction to device.
+  void set_device_sharding(int64 device) {
+    set_single_sharding(HloSharding::AssignDevice(device));
+  }
   // Remove any sharding from this operator.
   void clear_sharding() { sharding_ = nullptr; }
   // Return true if this operator has a sharding assigned.
@@ -1117,6 +1044,15 @@ class HloInstruction {
     return other->has_sharding() ? sharding() == other->sharding() : false;
   }
 
+  // Retrieves the operand side metadata of a kDomain instruction.
+  const DomainMetadata& operand_side_metadata() const {
+    return *operand_side_metadata_;
+  }
+  // Retrieves the user side metadata of a kDomain instruction.
+  const DomainMetadata& user_side_metadata() const {
+    return *user_side_metadata_;
+  }
+
   // When creating a new instruction which either replaces, or shifts up (kCopy
   // insertion case), another instruction, we need to make sure the certain
   // properties of the new instruction are copied into the derived one. As of
@@ -1124,93 +1060,6 @@ class HloInstruction {
   // instruction.
   void SetupDerivedInstruction(HloInstruction* derived_instruction) const;
 
-  // Adds a new operand the fusion instruction.
-  HloInstruction* AddFusionOperand(HloInstruction* new_operand);
-
-  // Merges the fused instructions from 'instruction_to_merge' into the
-  // fused instruction set of 'this', updating operands as necessary.
-  //
-  // Precondition: opcode() == HloOpcode::kFusion
-  // Predondition: 'instruction_to_merge' must be an operand of 'this'.
-  void MergeFusionInstruction(HloInstruction* instruction_to_merge);
-
-  // Merges the fused instructions from instruction_to_merge into the fused
-  // instruction set of 'this' and generates multioutput fusion instructions.
-  // All the users of instruction_to_merge will be redirected to 'this'
-  // instruction. instruction_to_merge will be removed from its parent
-  // computation.
-  //
-  // Precondition: opcode() == HloOpcode::kFusion
-  void MergeFusionInstructionIntoMultiOutput(
-      HloInstruction* instruction_to_merge);
-
-  // Fuses the given instruction in this fusion instruction. instruction_to_fuse
-  // is cloned and the clone is placed in the fusion
-  // instruction. instruction_to_fuse is unchanged. Instruction is cloned rather
-  // than moved to cleanly handle the case where the instruction has a use
-  // outside the fusion instruction. Moving such an instruction into a fusion
-  // instruction would violate the single-result invariant of HLO instructions
-  // and significantly complicate code generation.
-  //
-  // Precondition: this->opcode() == HloOpcode::kFusion
-  HloInstruction* FuseInstruction(HloInstruction* instruction_to_fuse) {
-    return FuseInstructionInternal(instruction_to_fuse);
-  }
-
-  // Fuses the given instruction in this fusion instruction and generate
-  // multioutput fusion instruction. A clone of the instruction_to_fuse will
-  // be part of the output of fusion instructions. The users of
-  // instruction_to_fuse will be redirected to this fusion instructions.
-  // instruction_to_fuse will be removed from its parent computation.
-  //
-  // Precondition: this->opcode() == HloOpcode::kFusion
-  HloInstruction* FuseInstructionIntoMultiOutput(
-      HloInstruction* instruction_to_fuse) {
-    return FuseInstructionInternal(instruction_to_fuse, /* add_output */ true);
-  }
-
-  // Returns the start index in the given dimension for a slice node.
-  //
-  // Precondition: opcode() == HloOpcode::kSlice
-  int64 slice_starts(int64 dimension) const {
-    CHECK_EQ(HloOpcode::kSlice, opcode_);
-    return slice_starts_[dimension];
-  }
-  const std::vector<int64>& slice_starts() const { return slice_starts_; }
-
-  // Returns the (exclusive) limit index in the given dimension for a slice
-  // node.
-  //
-  // Precondition: opcode() == HloOpcode::kSlice
-  int64 slice_limits(int64 dimension) const {
-    CHECK_EQ(HloOpcode::kSlice, opcode_);
-    return slice_limits_[dimension];
-  }
-  const std::vector<int64>& slice_limits() const {
-    CHECK_EQ(HloOpcode::kSlice, opcode_);
-    return slice_limits_;
-  }
-
-  // Returns the stride in the given dimension for a slice node.
-  //
-  // Precondition: opcode() == HloOpcode::kSlice
-  int64 slice_strides(int64 dimension) const {
-    CHECK_EQ(HloOpcode::kSlice, opcode_);
-    return slice_strides_[dimension];
-  }
-  const std::vector<int64>& slice_strides() const { return slice_strides_; }
-
-  // Returns the flag that describes whether a slice must be lowered into an
-  // offset into the original operand.
-  bool IsInPlaceSlice() const { return is_in_place_slice_; }
-
-  // Sets and returns the flag that describes whether a slice must be lowered
-  // into an offset into the original operand.
-  bool SetIsInPlaceSlice(bool value) {
-    is_in_place_slice_ = value;
-    return value;
-  }
-
   // Returns the size of the slice in the given dimension for a dynamic
   // slice node.
   //
@@ -1224,22 +1073,6 @@ class HloInstruction {
     return dynamic_slice_sizes_;
   }
 
-  // Returns the number of exponent bits for a reduce-precision node.
-  //
-  // Precondition: opcode() == HloOpcode::kReducePrecision
-  int32 exponent_bits() const {
-    CHECK_EQ(HloOpcode::kReducePrecision, opcode_);
-    return exponent_bits_;
-  }
-
-  // Returns the number of mantissa bits for a reduce-precision node.
-  //
-  // Precondition: opcode() == HloOpcode::kReducePrecision
-  int32 mantissa_bits() const {
-    CHECK_EQ(HloOpcode::kReducePrecision, opcode_);
-    return mantissa_bits_;
-  }
-
   // Returns data on the window in a windowed operation such as
   // convolution.
   const Window& window() const {
@@ -1277,19 +1110,6 @@ class HloInstruction {
         MakeUnique<ConvolutionDimensionNumbers>(dnums);
   }
 
-  FftType fft_type() const {
-    CHECK_EQ(HloOpcode::kFft, opcode_);
-    return fft_type_;
-  }
-
-  const std::vector<int64>& fft_length() const {
-    CHECK_EQ(HloOpcode::kFft, opcode_);
-    return fft_length_;
-  }
-
-  // Returns the dump string of the convolution dimension numbers.
-  string ConvolutionDimensionNumbersToString() const;
-
   // Returns data on the dimension numbers used for a dot operation.
   const DotDimensionNumbers& dot_dimension_numbers() const {
     CHECK(dot_dimension_numbers_ != nullptr);
@@ -1312,35 +1132,19 @@ class HloInstruction {
   // Returns the dump string of the gather dimension numbers.
   string GatherDimensionNumbersToString() const;
 
-  // Returns the random distribution for this rng node.
-  //
-  // Precondition: opcode() == HloOpcode::kRng
-  RandomDistribution random_distribution() const;
-
-  // See documentation for Clone().
-  using CloneMap = std::unordered_map<const HloInstruction*, HloInstruction*>;
-
   // Clones the HLO instruction. The clone will have the same opcode, shape, and
   // operands. After creation the clone has no uses. "this" (the instruction
   // cloned from) is not changed. Suffix is the string to append to the name of
-  // the instruction to form the name of the cloned instruction. Ignores the
-  // control predecessors and successors of this HLO instruction.
-  //
-  // If the module pointer is not nullptr, then any cloned computations will be
-  // added to this module in order to support deep cloning. Otherwise the module
-  // of the instruction is used.
-  //
-  // If clone_map is not nullptr, then each original instruction that is cloned
-  // will be inserted and map to its clone. clone_map should not already contain
-  // any of the instructions to clone.
-  std::unique_ptr<HloInstruction> Clone(const string& suffix = "clone",
-                                        HloModule* module = nullptr,
-                                        CloneMap* clone_map = nullptr) const;
+  // the instruction to form the name of the cloned instruction.
+  // Ignores the control predecessors and successors of this HLO instruction.
+  std::unique_ptr<HloInstruction> Clone(
+      const string& suffix = "clone", HloCloneContext* context = nullptr) const;
 
   // Clones the HLO instruction as above but with new shape and operands.
   std::unique_ptr<HloInstruction> CloneWithNewOperands(
-      const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
-      HloModule* module = nullptr, CloneMap* clone_map = nullptr) const;
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context = nullptr) const;
 
   // Returns the computations this instruction directly calls (if any).
   const std::vector<HloComputation*>& called_computations() const {
@@ -1410,9 +1214,14 @@ class HloInstruction {
   std::tuple<bool, std::vector<int64>, std::vector<int64>>
   ReshapeMerelyInsertsOrDeletes1SizedDimensions() const;
 
-  // Gets/sets the string identifier for this instruction.
+  // Gets the string identifier for this instruction.
   const string& name() const { return name_; }
-  void set_name(tensorflow::StringPiece name) { name_ = std::string(name); }
+
+  // Sets the string identifier for this instruction. Name will be sanitized to
+  // match the regexp "[a-zA-Z_][a-zA-Z0-9_.-]*".
+  void SetAndSanitizeName(const string& name) {
+    name_ = NameUniquer::GetSanitizedName(name);
+  }
 
   // Use the given NameUniquer to select a unique name for the instruction based
   // on the instruction's existing name.
@@ -1435,13 +1244,34 @@ class HloInstruction {
   // this field and they cannot interpret it due to its meaning being backend
   // specific.
   //
-  // TODO(b/78194644): Introduce structured configuration format as per
-  // go/xla-heuristics.
-  const string& backend_config() const { return backend_config_; }
-  void set_backend_config(string backend_config) {
-    backend_config_ = std::move(backend_config);
+  // ConfigProto should be a protobuf Message type.
+  template <typename ConfigProto>
+  StatusOr<ConfigProto> backend_config() const {
+    ConfigProto proto;
+    TF_RETURN_IF_ERROR(GetBackendConfigInternal(&proto));
+    return std::move(proto);
+  }
+  Status set_backend_config(const tensorflow::protobuf::Message& proto);
+
+  // Getter/setter for raw JSON-encoded backend config.  Prefer the
+  // functions above that deal in proto Messages where possible.
+  const string& raw_backend_config_string() const { return backend_config_; }
+  void set_raw_backend_config_string(string config_str) {
+    backend_config_ = std::move(config_str);
   }
 
+  // Returns a string representation of a proto in the format used by
+  // raw_backend_config_string.
+  //
+  // This is morally equivalent to:
+  //
+  //   HloInstruction instr;
+  //   TF_RETURN_IF_ERROR(instr.set_backend_config(proto));
+  //   return instr.raw_backend_config_string();
+  //
+  static StatusOr<string> BackendConfigToRawString(
+      const tensorflow::protobuf::Message& proto);
+
   // Sets the debug metadata for this instruction.
   void set_metadata(const OpMetadata& metadata) { metadata_ = metadata; }
   const OpMetadata& metadata() const { return metadata_; }
@@ -1472,13 +1302,193 @@ class HloInstruction {
   void set_outer_dimension_partitions(
       const std::vector<int64>& outer_dimension_partitions);
 
-  // Change the layout for an Constant Hlo instruction to match new_layout.  For
-  // tuple shaped constants shape_index is the path to the internal array
-  // subshape whose layout needs to be changed.
+  // Old methods kept for smooth subclassing transition BEGIN.
+  // TODO(b/80131774): Remove this code.
+
+  // Delegates to HloBatchNormInstruction::feature_index.
+  int64 feature_index() const;
+
+  // Delegates to HloBatchNormInstruction::epsilon.
+  float epsilon() const;
+
+  // Delegates to HloFftInstruction::fft_type.
+  FftType fft_type() const;
+
+  // Delegates to HloFftInstruction::fft_length.
+  const std::vector<int64>& fft_length() const;
+
+  // Delegates to HloSendRecvInstruction::channel_id.
+  int64 channel_id() const;
+
+  // Returns the dimension sizes or numbers associated with this instruction.
+  virtual const std::vector<int64>& dimensions() const {
+    LOG(FATAL) << "Unimplemented method.";
+  }
+  virtual int64 dimensions(int64 index) const {
+    LOG(FATAL) << "Unimplemented method.";
+  }
+
+  // Delegates to HloConcatenateInstruction::concatenate_dimension.
+  int64 concatenate_dimension() const;
+
+  // Returns whether this instruction does a rank-2 transposition.
+  bool IsRank2Transpose() const;
+
+  // Delegates to HloSliceInstruction::slice_start.
+  int64 slice_starts(int64 dimension) const;
+  const std::vector<int64>& slice_starts() const;
+
+  // Delegates to HloSliceInstruction::slice_limits.
+  int64 slice_limits(int64 dimension) const;
+  const std::vector<int64>& slice_limits() const;
+
+  // Delegates to HloSliceInstruction::slice_strides.
+  int64 slice_strides(int64 dimension) const;
+  const std::vector<int64>& slice_strides() const;
+
+  // Delegates to HloSliceInstruction::IsInPlaceSlice.
+  bool IsInPlaceSlice() const;
+
+  // Returns the literal associated with this instruction.
+  const Literal& literal() const;
+
+  // Returns whether the instruction is a constant.
+  bool IsConstant() const;
+
+  // Delegate to HloConstantInstruction::RelayoutConstant.
   void RelayoutConstant(const Layout& new_layout,
                         const ShapeIndex& shape_index = {});
 
+  // Delegates to HloTraceInstruction::TracingTag.
+  string TracingTag() const;
+
+  // Delegates to HloFusionInstruction::AddFusionOperand.
+  HloInstruction* AddFusionOperand(HloInstruction* new_operand);
+
+  // Delegates to HloFusionInstruction::MergeFusionInstruction.
+  void MergeFusionInstruction(HloInstruction* instruction_to_merge);
+
+  // Delegates to HloFusionInstruction::MergeFusionInstructionIntoMultiOutput.
+  void MergeFusionInstructionIntoMultiOutput(
+      HloInstruction* instruction_to_merge);
+
+  // Delegates to HloFusionInstruction::FuseInstruction.
+  HloInstruction* FuseInstruction(HloInstruction* instruction_to_fuse);
+
+  // Delegates to HloFusionInstruction::FuseInstructionIntoMultiOutput.
+  HloInstruction* FuseInstructionIntoMultiOutput(
+      HloInstruction* instruction_to_fuse);
+
+  // Delegates to HloFusionInstruction::fused_instruction.
+  HloComputation* fused_instructions_computation() const;
+
+  // Delegates to HloFusionInstruction::fused_expression_root.
+  HloInstruction* fused_expression_root() const;
+
+  // Delegates to HloFusionInstruction::fused_instructions.
+  const tensorflow::gtl::iterator_range<UnwrappingIterator<
+      std::list<std::unique_ptr<HloInstruction>>::const_iterator>>
+  fused_instructions() const;
+
+  const tensorflow::gtl::iterator_range<
+      UnwrappingIterator<std::list<std::unique_ptr<HloInstruction>>::iterator>>
+  fused_instructions();
+
+  // Delegates to HloFusionInstruction::fused_instruction_count.
+  int64 fused_instruction_count() const;
+
+  // Delegates to HloFusionInstruction::fused_parameter.
+  HloInstruction* fused_parameter(int64 parameter_number) const;
+
+  // Delegates to HloFusionInstruction::fused_parameters.
+  const std::vector<HloInstruction*>& fused_parameters() const;
+
+  // Returns true if this instruction is a fusion instruction that generates
+  // multiple outputs.
+  const bool IsMultiOutputFusion() const;
+
+  // Delegates to HloFusionInstruction::fusion_kind.
+  FusionKind fusion_kind() const;
+
+  // Delegates to HloFusionInstruction::set_fusion_kind.
+  void set_fusion_kind(FusionKind kind);
+
+  // Delegates to HloRngInstruction::random_distribution.
+  RandomDistribution random_distribution() const;
+
+  // Delegates to HloParameterInstruction::parameter_number.
+  int64 parameter_number() const;
+
+  // Delegates to HloGetTupleElementInstruction::tuple_index.
+  int64 tuple_index() const;
+
+  // Returns the number of exponent bits for a reduce-precision node.
+  int32 exponent_bits() const;
+
+  // Returns the number of mantissa bits for a reduce-precision node.
+  int32 mantissa_bits() const;
+  // Old methods kept for smooth subclassing transition END.
+
+  // Returns the group ids of each replica for CrossReplicaSum op.
+  const std::vector<int64>& replica_group_ids() const {
+    return replica_group_ids_;
+  }
+
+  // Returns the barrier config used for the CrossReplicaSum implementation of
+  // each backend.
+  string cross_replica_sum_barrier() const {
+    return cross_replica_sum_barrier_;
+  }
+  void set_cross_replica_sum_barrier(string barrier) {
+    cross_replica_sum_barrier_ = barrier;
+  }
+
+ protected:
+  enum class UseKind { kNoUse, kReuse, kUsePermutingElements, kUse };
+  // Helper class for computing OperandElementUse for kFusion.
+  class FusionReusesParamElements;
+
+  // Internal constructor for a given opcode/shape, other fields must be filled
+  // by factory methods.
+  HloInstruction(HloOpcode opcode, const Shape& shape);
+
+  // Appends operand to the list of operands and adds this instruction as a user
+  // of the operand.
+  void AppendOperand(HloInstruction* operand);
+
+  void RemoveOperandAt(int index) {
+    operands_.erase(operands_.begin() + index);
+  }
+
+  void AppendComputation(HloComputation* computation) {
+    called_computations_.push_back(computation);
+  }
+
+  void DetachFrom(HloInstruction* usee) { usee->RemoveUser(this); }
+
  private:
+  // Implementation for non-common logic of CloneWithNewOperands.
+  virtual std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const {
+    // TODO(b/80131774): This should be pure virtual.
+    LOG(FATAL) << "Unimplemented method.";
+  }
+
+  // Implementation for non-common logic of ExtraAttributesToString.
+  virtual std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const {
+    return {};
+  }
+
+  // Implementation for IsElementwise if operand_idx is nullopt and for
+  // IsElementwiseOnOperand if otherwise.
+  //
+  // NOTE: For all instructions other than kFusion, being elementwise on one of
+  // the operands is equivalent to being elementwise on all the operands.
+  virtual bool IsElementwiseImpl(
+      const tensorflow::gtl::optional<int64>& operand_idx) const;
   // Prints an instruction to a string.
   //
   // The canonical string representation needs to name operands and instruction
@@ -1489,7 +1499,7 @@ class HloInstruction {
       CanonicalNameMap* canonical_name_map) const;
 
   // Prints an operand to a string.
-  string OperandsToStringWithCanonicalNameMap(
+  virtual string OperandsToStringWithCanonicalNameMap(
       const HloPrintOptions& options,
       CanonicalNameMap* canonical_name_map) const;
 
@@ -1497,13 +1507,8 @@ class HloInstruction {
   // OperandsToStringWithCanonicalNameMap() functions.
   friend class HloComputation;
 
-  enum class UseKind { kNoUse, kReuse, kUsePermutingElements, kUse };
-
-  // Helper class for computing OperandElementUse for kFusion.
-  class FusionReusesParamElements;
-
   // See comments on Identical().
-  bool IdenticalSlowPath(
+  virtual bool IdenticalSlowPath(
       const HloInstruction& other,
       const std::function<bool(const HloComputation*, const HloComputation*)>&
           eq_computations) const;
@@ -1513,55 +1518,19 @@ class HloInstruction {
       const Shape& shape, HloOpcode opcode,
       tensorflow::gtl::ArraySlice<HloInstruction*> operands);
 
-  // Appends operand to the list of operands and adds this instruction as a user
-  // of the operand.
-  void AppendOperand(HloInstruction* operand);
-
   // Adds a user for this instruction.
   void AddUser(HloInstruction* user);
 
   // Removes a user for this instruction.
   void RemoveUser(HloInstruction* user);
 
-  // Internal constructor for a given opcode/shape, other fields must be filled
-  // by factory methods.
-  HloInstruction(HloOpcode opcode, const Shape& shape);
-
-  // Fuses the given instruction into this fusion instruction. When add_output
-  // is false (which is the default), instruction_to_fuse is cloned and the
-  // clone is placed in the fusion instruction. instruction_to_fuse is
-  // unchanged.
-  //
-  // When add_output is true, a clone of the instruction_to_fuse will be part
-  // of the output of fusion instructions. The users of instruction_to_fuse
-  // will be redirected to this fusion instructions. instruction_to_fuse will
-  // be removed from its parent computation.
-  //
-  // Precondition: this->opcode() == HloOpcode::kFusion
-  HloInstruction* FuseInstructionInternal(HloInstruction* instruction_to_fuse,
-                                          bool add_output = false);
-
-  // Clones the given instruction_to_fuse and insert the clone into this fusion
-  // instruction. If add_output is true, a clone of instruction_to_fuse will
-  // be in the output of the this fusion instruction (part of the tuple of the
-  // fusion root).
-  //
-  // Precondition: opcode() == HloOpcode::kFusion
-  HloInstruction* CloneAndFuseInternal(HloInstruction* instruction_to_fuse,
-                                       bool add_output = false);
-
-  // Clones a fusion instruction with a new shape and operands.
-  std::unique_ptr<HloInstruction> CloneFusionWithNewOperands(
-      const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
-      HloModule* module = nullptr) const;
-
-  // Returns true if this instruction can legally have the dimensions field
-  // set. Used for checking precondition of dimensions field accessors.
-  bool CanHaveDimensionsField() const;
-
   // Returns how this instruction uses elements of its `i`th operand.
   UseKind OperandElementUse(int64 i) const;
 
+  // Helper for implementing backend_config().  Parses backend_config_ into the
+  // given proto.
+  Status GetBackendConfigInternal(tensorflow::protobuf::Message* proto) const;
+
   int unique_id_;  // Unique to this HloInstruction within a HloModule
 
   // Opcode for this instruction.
@@ -1592,16 +1561,6 @@ class HloInstruction {
   // Result shape of this instruction.
   Shape shape_;
 
-  // Literal, only present for kConstant.
-  std::unique_ptr<Literal> literal_;
-
-  // Constant index, only present for kGetTupleElement.
-  int64 tuple_index_ = -1;
-
-  // Dimensions present for some operations that require reshaping or
-  // broadcasting, including Reshape, Reduce, ReduceWindow, and Reverse.
-  std::vector<int64> dimensions_;
-
   // Describes the window in a windowed operation such as convolution.
   std::unique_ptr<Window> window_;
 
@@ -1614,24 +1573,6 @@ class HloInstruction {
   std::unique_ptr<GatherDimensionNumbers> gather_dimension_numbers_;
   std::vector<int64> gather_window_bounds_;
 
-  // Describes FFT type for an FFT instruction.
-  FftType fft_type_ = FftType::FFT;
-
-  // Indicates the FFT length for an FFT instruction.
-  std::vector<int64> fft_length_;
-
-  // Describes the [begin, end) index range for a slice.
-  std::vector<int64> slice_starts_;
-  std::vector<int64> slice_limits_;
-  std::vector<int64> slice_strides_;
-
-  // Describes whether the slice can be lowered to an offset into the operand.
-  bool is_in_place_slice_ = false;
-
-  // The bit sizes for a reduce-precision operation.
-  int32 exponent_bits_ = 0;
-  int32 mantissa_bits_ = 0;
-
   // Describes the [start, start + size) range size for a dynamic slice
   // ('start' is specified dynamically in the second operand of the operation).
   std::vector<int64> dynamic_slice_sizes_;
@@ -1640,14 +1581,12 @@ class HloInstruction {
   // padding of this pad instruction. Only set for pad instructions.
   std::unique_ptr<PaddingConfig> padding_config_;
 
-  // The type of the fusion. Used by kFusion only.
-  FusionKind fusion_kind_;
-
   // The sharding, if one exists.
   std::unique_ptr<HloSharding> sharding_;
 
-  // For parameter instructions this field holds the parameter number.
-  int64 parameter_number_ = 0;
+  // Fields used by the kDomain instruction.
+  std::unique_ptr<DomainMetadata> operand_side_metadata_;
+  std::unique_ptr<DomainMetadata> user_side_metadata_;
 
   // Name of a global symbol to call, only present for kCustomCall.
   string custom_call_target_;
@@ -1686,22 +1625,6 @@ class HloInstruction {
   // an operand.
   HloInstruction* trace_instruction_ = nullptr;
 
-  // The distribution requested for random number generation.
-  // Only present for kRng.
-  RandomDistribution distribution_;
-
-  // A small float number added to the variance to avoid divide-by-zero error.
-  // Only present for kBatchNormTraining.
-  float epsilon_ = 0.0f;
-
-  // An integer value representing the index of the feature dimension.
-  // Only present for kBatchNormTraining.
-  int64 feature_index_ = -1;
-
-  // Represents a unique identifier for each Send/Recv instruction pair.
-  // Only present for kSend or kRecv.
-  int64 channel_id_ = -1;
-
   // The string representation of the infeed configuration.
   string infeed_config_;
 
@@ -1709,6 +1632,12 @@ class HloInstruction {
   // HLO. See the documentation on backend_config().
   string backend_config_;
 
+  // The group id of each replica for CrossReplicaSum.
+  std::vector<int64> replica_group_ids_;
+
+  // The string representation of the barrier config used for CrossReplicaSum.
+  string cross_replica_sum_barrier_;
+
   // String identifier for instruction.
   string name_;
 
@@ -1731,6 +1660,9 @@ StatusOr<HloInstruction::FusionKind> StringToFusionKind(
 string PaddingConfigToString(const PaddingConfig& padding);
 string OpMetadataToString(const OpMetadata& metadata);
 string RandomDistributionToString(const RandomDistribution& distribution);
+string ConvolutionDimensionNumbersToString(
+    const ConvolutionDimensionNumbers& dnums);
+
 StatusOr<RandomDistribution> StringToRandomDistribution(const string& name);
 
 std::ostream& operator<<(std::ostream& os, HloInstruction::FusionKind kind);
diff --git a/tensorflow/compiler/xla/service/hlo_instruction_test.cc b/tensorflow/compiler/xla/service/hlo_instruction_test.cc
index a61c472c72804b077d21274d2e866a69c5e73157..5d6f8b931f0c665fba03e1c845214fa83aabf12e 100644
--- a/tensorflow/compiler/xla/service/hlo_instruction_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_instruction_test.cc
@@ -24,11 +24,13 @@ limitations under the License.
 #include "tensorflow/compiler/xla/protobuf_util.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
 #include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/compiler/xla/window_util.h"
 
 namespace xla {
 namespace {
@@ -340,7 +342,7 @@ TEST_F(HloInstructionTest, TrivialMap) {
   // Builds a parameter and feeds it to the map.
   HloComputation::Builder builder(TestName());
   auto param0 = builder.AddInstruction(
-      HloInstruction::CreateParameter(0, f32a100x10, ""));
+      HloInstruction::CreateParameter(0, f32a100x10, "p"));
   auto map = builder.AddInstruction(
       HloInstruction::CreateMap(f32a100x10, {param0}, add_f32));
   module->AddEntryComputation(builder.Build());
@@ -379,7 +381,7 @@ TEST_F(HloInstructionTest, TrivialReduce) {
   // Builds a parameter and an initial value and feeds them to the reduce.
   HloComputation::Builder builder(TestName());
   auto param0 = builder.AddInstruction(
-      HloInstruction::CreateParameter(0, f32a100x10, ""));
+      HloInstruction::CreateParameter(0, f32a100x10, "p"));
   auto const0 = builder.AddInstruction(
       HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
   builder.AddInstruction(
@@ -978,6 +980,23 @@ TEST_F(HloInstructionTest, FullyElementwise) {
   }
 }
 
+TEST_F(HloInstructionTest, MapIsElementwise) {
+  auto module = CreateNewModule();
+  const Shape r2f32 = ShapeUtil::MakeShapeWithLayout(F32, {10, 10}, {1, 0});
+  HloComputation::Builder builder(TestName());
+  HloComputation::Builder map_builder("id");
+  map_builder.AddInstruction(
+      HloInstruction::CreateParameter(0, ShapeUtil::MakeShape(F32, {}), "p0"));
+  auto map_computation = module->AddEmbeddedComputation(map_builder.Build());
+  auto x =
+      builder.AddInstruction(HloInstruction::CreateParameter(0, r2f32, "x"));
+  auto map = builder.AddInstruction(
+      HloInstruction::CreateMap(r2f32, {x}, map_computation));
+  module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(map->IsElementwise());
+}
+
 TEST_F(HloInstructionTest, PartiallyElementwise) {
   const Shape r1f32 = ShapeUtil::MakeShape(F32, {5});
   const Shape r2f32 = ShapeUtil::MakeShape(F32, {3, 5});
@@ -1494,5 +1513,117 @@ TEST_F(HloInstructionTest, CanonnicalStringificationConditional) {
 })");
 }
 
+TEST_F(HloInstructionTest, CheckDeepClone) {
+  const char* const hlo_string = R"(
+HloModule Module
+
+addy (lhs: s32[], rhs: s32[]) -> s32[] {
+  lhs = s32[] parameter(0)
+  rhs = s32[] parameter(1)
+  ROOT zadd = s32[] add(lhs, rhs)
+}
+
+calla (x: s32[]) -> s32[] {
+  x = s32[] parameter(0)
+  reduce = s32[] reduce-window(x, x), to_apply=addy
+  ROOT xadd = s32[] add(x, reduce)
+}
+
+body (bparam: s32[]) -> s32[] {
+  constant = s32[] constant(1)
+  bparam = s32[] parameter(0)
+  v = s32[] call(bparam), to_apply=calla
+  ROOT add = s32[] add(constant, bparam)
+}
+
+condition (cparam: s32[]) -> pred[] {
+  xconstant = s32[] constant(5)
+  cparam = s32[] parameter(0)
+  ROOT greater-than = pred[] greater-than(xconstant, cparam)
+}
+
+ENTRY entry (param: s32[]) -> s32[] {
+  eparam = s32[] parameter(0)
+  ROOT while = s32[] while(eparam), condition=condition, body=body
+ }
+)";
+  // Check that deep clones really deep clones every instruction and
+  // computations, without leaving dangling pointers to the old module.
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseHloString(hlo_string));
+  std::unique_ptr<HloModule> clone = module->Clone();
+  for (HloComputation* computation : clone->computations()) {
+    EXPECT_EQ(computation->parent(), clone.get());
+    for (HloInstruction* instruction : computation->instructions()) {
+      EXPECT_EQ(instruction->parent()->parent(), clone.get());
+    }
+  }
+}
+
+TEST_F(HloInstructionTest, IdenticalAccountsForBackendConfig) {
+  const Shape shape = ShapeUtil::MakeShape(F32, {42});
+  HloComputation::Builder builder("test");
+  HloInstruction* p =
+      builder.AddInstruction(HloInstruction::CreateParameter(0, shape, "p"));
+
+  HloInstruction* add1 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, p, p));
+  HloInstruction* add2 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, p, p));
+
+  EXPECT_TRUE(add1->Identical(*add2));
+  add1->set_raw_backend_config_string("abc");
+  EXPECT_FALSE(add1->Identical(*add2));
+}
+
+TEST_F(HloInstructionTest, IdenticalAccountsForCustomCallWindow) {
+  auto instr1 = HloInstruction::CreateCustomCall(ShapeUtil::MakeShape(F32, {}),
+                                                 /*operands=*/{},
+                                                 /*custom_call_target=*/"foo");
+  auto instr2 = instr1->Clone();
+  EXPECT_TRUE(instr1->Identical(*instr2));
+
+  Window w = window_util::MakeWindow({1, 2, 3});
+  instr1->set_window(w);
+  EXPECT_FALSE(instr1->Identical(*instr2));
+}
+
+TEST_F(HloInstructionTest, IdenticalAccountsForCustomCallDnums) {
+  auto instr1 = HloInstruction::CreateCustomCall(ShapeUtil::MakeShape(F32, {}),
+                                                 /*operands=*/{},
+                                                 /*custom_call_target=*/"foo");
+  auto instr2 = instr1->Clone();
+  EXPECT_TRUE(instr1->Identical(*instr2));
+
+  ConvolutionDimensionNumbers dnums;
+  dnums.set_output_batch_dimension(42);
+  instr1->set_convolution_dimension_numbers(dnums);
+  EXPECT_FALSE(instr1->Identical(*instr2));
+}
+
+TEST_F(HloInstructionTest, CloneWindowOnCustomCall) {
+  auto instr = HloInstruction::CreateCustomCall(ShapeUtil::MakeShape(F32, {}),
+                                                /*operands=*/{},
+                                                /*custom_call_target=*/"foo");
+  Window w = window_util::MakeWindow({1, 2, 3});
+  instr->set_window(w);
+  auto clone = instr->Clone();
+  EXPECT_TRUE(protobuf_util::ProtobufEquals(clone->window(), w))
+      << clone->window().DebugString();
+}
+
+TEST_F(HloInstructionTest, CloneDnumsOnCustomCall) {
+  auto instr = HloInstruction::CreateCustomCall(ShapeUtil::MakeShape(F32, {}),
+                                                /*operands=*/{},
+                                                /*custom_call_target=*/"foo");
+  ConvolutionDimensionNumbers dnums;
+  dnums.set_output_batch_dimension(42);
+  instr->set_convolution_dimension_numbers(dnums);
+  auto clone = instr->Clone();
+  EXPECT_TRUE(protobuf_util::ProtobufEquals(
+      clone->convolution_dimension_numbers(), dnums))
+      << clone->convolution_dimension_numbers().DebugString();
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_instructions.cc b/tensorflow/compiler/xla/service/hlo_instructions.cc
new file mode 100644
index 0000000000000000000000000000000000000000..d326d5d009fa3a378cfe92e1b081445f09d982e1
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_instructions.cc
@@ -0,0 +1,1287 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/hlo_instructions.h"
+
+#include <deque>
+
+#include "tensorflow/compiler/xla/service/hlo_casting_utils.h"
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+
+namespace xla {
+namespace {
+
+using ::tensorflow::str_util::Join;
+using ::tensorflow::strings::StrAppend;
+using ::tensorflow::strings::StrCat;
+
+bool IsInstructionElementwiseOnOperand(const HloInstruction* instruction,
+                                       const HloInstruction* operand) {
+  std::vector<int64> operand_indices = instruction->OperandIndices(operand);
+  return std::all_of(
+      operand_indices.begin(), operand_indices.end(),
+      [instruction](int64 operand_index) {
+        return instruction->IsElementwiseOnOperand(operand_index);
+      });
+}
+}  // namespace
+
+HloBatchNormInstruction::HloBatchNormInstruction(
+    HloOpcode opcode, const Shape& shape, HloInstruction* operand,
+    HloInstruction* scale, float epsilon, int64 feature_index)
+    : HloInstruction(opcode, shape),
+      epsilon_(epsilon),
+      feature_index_(feature_index) {
+  AppendOperand(operand);
+  AppendOperand(scale);
+}
+
+bool HloBatchNormInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& casted_other = static_cast<const HloBatchNormInstruction&>(other);
+  return feature_index() == casted_other.feature_index() &&
+         epsilon() == casted_other.epsilon();
+}
+
+HloInstructionProto HloBatchNormInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  proto.set_epsilon(epsilon_);
+  proto.set_feature_index(feature_index_);
+  return proto;
+}
+
+std::vector<string> HloBatchNormInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {StrCat("epsilon=", epsilon()),
+          StrCat("feature_index=", feature_index())};
+}
+
+HloBatchNormTrainingInstruction::HloBatchNormTrainingInstruction(
+    const Shape& shape, HloInstruction* operand, HloInstruction* scale,
+    HloInstruction* offset, float epsilon, int64 feature_index)
+    : HloBatchNormInstruction(HloOpcode::kBatchNormTraining, shape, operand,
+                              scale, epsilon, feature_index) {
+  AppendOperand(offset);
+}
+
+std::unique_ptr<HloInstruction>
+HloBatchNormTrainingInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 3);
+  return MakeUnique<HloBatchNormTrainingInstruction>(
+      shape, new_operands[0], new_operands[1], new_operands[2], epsilon(),
+      feature_index());
+}
+
+HloBatchNormInferenceInstruction::HloBatchNormInferenceInstruction(
+    const Shape& shape, HloInstruction* operand, HloInstruction* scale,
+    HloInstruction* offset, HloInstruction* mean, HloInstruction* variance,
+    float epsilon, int64 feature_index)
+    : HloBatchNormInstruction(HloOpcode::kBatchNormInference, shape, operand,
+                              scale, epsilon, feature_index) {
+  AppendOperand(offset);
+  AppendOperand(mean);
+  AppendOperand(variance);
+}
+
+std::unique_ptr<HloInstruction>
+HloBatchNormInferenceInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 5);
+  return MakeUnique<HloBatchNormInferenceInstruction>(
+      shape, new_operands[0], new_operands[1], new_operands[2], new_operands[3],
+      new_operands[4], epsilon(), feature_index());
+}
+
+HloBatchNormGradInstruction::HloBatchNormGradInstruction(
+    const Shape& shape, HloInstruction* operand, HloInstruction* scale,
+    HloInstruction* mean, HloInstruction* variance, HloInstruction* grad_output,
+    float epsilon, int64 feature_index)
+    : HloBatchNormInstruction(HloOpcode::kBatchNormGrad, shape, operand, scale,
+                              epsilon, feature_index) {
+  AppendOperand(mean);
+  AppendOperand(variance);
+  AppendOperand(grad_output);
+}
+
+std::unique_ptr<HloInstruction>
+HloBatchNormGradInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 5);
+  return MakeUnique<HloBatchNormGradInstruction>(
+      shape, new_operands[0], new_operands[1], new_operands[2], new_operands[3],
+      new_operands[4], epsilon(), feature_index());
+}
+
+HloFftInstruction::HloFftInstruction(
+    const Shape& shape, HloInstruction* operand, FftType fft_type,
+    tensorflow::gtl::ArraySlice<int64> fft_length)
+    : HloInstruction(HloOpcode::kFft, shape), fft_type_(fft_type) {
+  fft_length_.assign(fft_length.begin(), fft_length.end());
+  AppendOperand(operand);
+}
+
+HloInstructionProto HloFftInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  proto.set_fft_type(fft_type_);
+  for (int64 fft_len : fft_length_) {
+    proto.add_fft_length(fft_len);
+  }
+  return proto;
+}
+
+std::vector<string> HloFftInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {StrCat("fft_type=", FftType_Name(fft_type())),
+          StrCat("fft_length={", Join(fft_length(), ","), "}")};
+}
+
+bool HloFftInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& casted_other = static_cast<const HloFftInstruction&>(other);
+  return fft_type() == casted_other.fft_type() &&
+         fft_length() == casted_other.fft_length();
+}
+
+std::unique_ptr<HloInstruction> HloFftInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 1);
+  return MakeUnique<HloFftInstruction>(shape, new_operands[0], fft_type_,
+                                       fft_length_);
+}
+
+HloSendRecvInstruction::HloSendRecvInstruction(HloOpcode opcode,
+                                               const Shape& shape,
+                                               int64 channel_id)
+    : HloInstruction(opcode, shape), channel_id_(channel_id) {}
+
+HloInstructionProto HloSendRecvInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  proto.set_channel_id(channel_id_);
+  return proto;
+}
+
+std::vector<string> HloSendRecvInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {StrCat("channel_id=", channel_id_)};
+}
+
+bool HloSendRecvInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  // Not yet supported.
+  return false;
+}
+
+// Send instruction produces a tuple of {aliased operand, U32 context}.
+HloSendInstruction::HloSendInstruction(HloInstruction* operand,
+                                       int64 channel_id)
+    : HloSendRecvInstruction(
+          HloOpcode::kSend,
+          ShapeUtil::MakeTupleShape(
+              {CHECK_NOTNULL(operand)->shape(), ShapeUtil::MakeShape(U32, {})}),
+          channel_id) {
+  AppendOperand(operand);
+}
+
+std::unique_ptr<HloInstruction> HloSendInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 1);
+  return MakeUnique<HloSendInstruction>(new_operands[0], channel_id());
+}
+
+HloSendDoneInstruction::HloSendDoneInstruction(HloSendInstruction* operand)
+    : HloSendRecvInstruction(HloOpcode::kSendDone, ShapeUtil::MakeNil(),
+                             CHECK_NOTNULL(operand)->channel_id()) {
+  AppendOperand(operand);
+}
+
+std::unique_ptr<HloInstruction>
+HloSendDoneInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 1);
+  return MakeUnique<HloSendDoneInstruction>(
+      Cast<HloSendInstruction>(new_operands[0]));
+}
+
+// Recv instruction produces a tuple of {receive buffer, U32 context}.
+HloRecvInstruction::HloRecvInstruction(const Shape& shape, int64 channel_id)
+    : HloSendRecvInstruction(
+          HloOpcode::kRecv,
+          ShapeUtil::MakeTupleShape({shape, ShapeUtil::MakeShape(U32, {})}),
+          channel_id) {}
+
+std::unique_ptr<HloInstruction> HloRecvInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 0);
+  return MakeUnique<HloRecvInstruction>(
+      ShapeUtil::GetTupleElementShape(shape, 0), channel_id());
+}
+
+HloRecvDoneInstruction::HloRecvDoneInstruction(HloRecvInstruction* operand)
+    : HloSendRecvInstruction(
+          HloOpcode::kRecvDone,
+          ShapeUtil::GetTupleElementShape(operand->shape(), 0),
+          CHECK_NOTNULL(operand)->channel_id()) {
+  AppendOperand(operand);
+}
+
+std::unique_ptr<HloInstruction>
+HloRecvDoneInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 1);
+  return MakeUnique<HloRecvDoneInstruction>(
+      Cast<HloRecvInstruction>(new_operands[0]));
+}
+
+HloReverseInstruction::HloReverseInstruction(
+    const Shape& shape, HloInstruction* operand,
+    tensorflow::gtl::ArraySlice<int64> dimensions)
+    : HloInstruction(HloOpcode::kReverse, shape),
+      dimensions_(dimensions.begin(), dimensions.end()) {
+  AppendOperand(operand);
+}
+
+HloInstructionProto HloReverseInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  for (int64 dimension : dimensions_) {
+    proto.add_dimensions(dimension);
+  }
+  return proto;
+}
+
+std::vector<string> HloReverseInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {StrCat("dimensions={", Join(dimensions(), ","), "}")};
+}
+
+bool HloReverseInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& casted_other = static_cast<const HloReverseInstruction&>(other);
+  return dimensions() == casted_other.dimensions();
+}
+
+std::unique_ptr<HloInstruction> HloReverseInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 1);
+  return MakeUnique<HloReverseInstruction>(shape, new_operands[0],
+                                           dimensions());
+}
+
+HloConcatenateInstruction::HloConcatenateInstruction(
+    const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
+    int64 dimension)
+    : HloInstruction(HloOpcode::kConcatenate, shape), dimensions_({dimension}) {
+  for (auto operand : operands) {
+    AppendOperand(operand);
+  }
+}
+
+HloInstructionProto HloConcatenateInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  for (int64 dimension : dimensions_) {
+    proto.add_dimensions(dimension);
+  }
+  return proto;
+}
+
+std::vector<string> HloConcatenateInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {StrCat("dimensions={", Join(dimensions(), ","), "}")};
+}
+
+bool HloConcatenateInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& casted_other =
+      static_cast<const HloConcatenateInstruction&>(other);
+  return dimensions() == casted_other.dimensions();
+}
+
+std::unique_ptr<HloInstruction>
+HloConcatenateInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  return MakeUnique<HloConcatenateInstruction>(shape, new_operands,
+                                               dimensions(0));
+}
+
+HloReduceInstruction::HloReduceInstruction(
+    const Shape& shape, HloInstruction* arg, HloInstruction* init_value,
+    tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce,
+    HloComputation* reduce_computation)
+    : HloInstruction(HloOpcode::kReduce, shape),
+      dimensions_(dimensions_to_reduce.begin(), dimensions_to_reduce.end()) {
+  AppendOperand(arg);
+  AppendOperand(init_value);
+  AppendComputation(reduce_computation);
+}
+
+HloInstructionProto HloReduceInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  for (int64 dimension : dimensions_) {
+    proto.add_dimensions(dimension);
+  }
+  return proto;
+}
+
+std::vector<string> HloReduceInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {StrCat("dimensions={", Join(dimensions(), ","), "}")};
+}
+
+bool HloReduceInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& casted_other = static_cast<const HloReduceInstruction&>(other);
+  // Reduction results are determined by the reduction dimension and the
+  // reduction computation.
+  return dimensions() == casted_other.dimensions() &&
+         eq_computations(to_apply(), casted_other.to_apply());
+}
+
+std::unique_ptr<HloInstruction> HloReduceInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 2);
+  return MakeUnique<HloReduceInstruction>(
+      shape, new_operands[0], new_operands[1], dimensions(), to_apply());
+}
+
+HloTransposeInstruction::HloTransposeInstruction(
+    const Shape& shape, HloInstruction* operand,
+    tensorflow::gtl::ArraySlice<int64> dimensions)
+    : HloInstruction(HloOpcode::kTranspose, shape),
+      dimensions_(dimensions.begin(), dimensions.end()) {
+  CHECK_EQ(shape.dimensions().size(), dimensions.size());
+  CHECK_EQ(shape.dimensions().size(), operand->shape().dimensions().size());
+  CHECK(std::equal(operand->shape().dimensions().begin(),
+                   operand->shape().dimensions().end(),
+                   Permute(dimensions, shape.dimensions()).begin()))
+      << "shape: " << ShapeUtil::HumanString(shape)
+      << ", operand->shape(): " << ShapeUtil::HumanString(shape)
+      << ", dimensions: {" << Join(dimensions, ", ") << "}";
+  AppendOperand(operand);
+}
+
+bool HloTransposeInstruction::IsRank2Transpose() const {
+  return dimensions() == std::vector<int64>({1, 0}) &&
+         shape().dimensions_size() == 2 &&
+         std::equal(shape().dimensions().begin(), shape().dimensions().end(),
+                    operand(0)->shape().dimensions().rbegin());
+}
+
+HloInstructionProto HloTransposeInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  for (int64 dimension : dimensions_) {
+    proto.add_dimensions(dimension);
+  }
+  return proto;
+}
+
+std::vector<string> HloTransposeInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {StrCat("dimensions={", Join(dimensions(), ","), "}")};
+}
+
+bool HloTransposeInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& casted_other = static_cast<const HloTransposeInstruction&>(other);
+  return dimensions() == casted_other.dimensions();
+}
+
+std::unique_ptr<HloInstruction>
+HloTransposeInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 1);
+  return MakeUnique<HloTransposeInstruction>(shape, new_operands[0],
+                                             dimensions());
+}
+
+HloBroadcastInstruction::HloBroadcastInstruction(
+    const Shape& shape, HloInstruction* operand,
+    tensorflow::gtl::ArraySlice<int64> broadcast_dimension)
+    : HloInstruction(HloOpcode::kBroadcast, shape),
+      dimensions_(broadcast_dimension.begin(), broadcast_dimension.end()) {
+  AppendOperand(operand);
+}
+
+HloInstructionProto HloBroadcastInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  for (int64 dimension : dimensions_) {
+    proto.add_dimensions(dimension);
+  }
+  return proto;
+}
+
+std::vector<string> HloBroadcastInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {StrCat("dimensions={", Join(dimensions(), ","), "}")};
+}
+
+bool HloBroadcastInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& casted_other = static_cast<const HloBroadcastInstruction&>(other);
+  return dimensions() == casted_other.dimensions();
+}
+
+std::unique_ptr<HloInstruction>
+HloBroadcastInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 1);
+  return MakeUnique<HloBroadcastInstruction>(shape, new_operands[0],
+                                             dimensions());
+}
+
+HloMapInstruction::HloMapInstruction(
+    const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
+    HloComputation* map_computation,
+    tensorflow::gtl::ArraySlice<HloInstruction*> static_operands)
+    : HloInstruction(HloOpcode::kMap, shape) {
+  CHECK(static_operands.empty()) << "static_operands not yet supported";
+  for (auto operand : operands) {
+    AppendOperand(operand);
+  }
+  AppendComputation(map_computation);
+  // TODO(b/65689298) Remove code below once Map is generalized to accept
+  // arbitrary map dimensions.
+  dimensions_.resize(ShapeUtil::Rank(shape));
+  std::iota(dimensions_.begin(), dimensions_.end(), 0);
+}
+
+HloInstructionProto HloMapInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  for (int64 dimension : dimensions_) {
+    proto.add_dimensions(dimension);
+  }
+  return proto;
+}
+
+bool HloMapInstruction::IsElementwiseImpl(
+    const tensorflow::gtl::optional<int64>& operand_idx) const {
+  if (!dimensions().empty()) {
+    // Check that the map is executed in elementwise compatible dimensions.
+    if (dimensions().size() != shape().dimensions_size()) {
+      return false;
+    }
+    for (int i = 0; i < dimensions().size(); ++i) {
+      if (dimensions()[i] != i) {
+        return false;
+      }
+    }
+  }
+  return true;
+}
+
+std::vector<string> HloMapInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {StrCat("dimensions={", Join(dimensions(), ","), "}")};
+}
+
+bool HloMapInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  return eq_computations(to_apply(), other.to_apply());
+}
+
+std::unique_ptr<HloInstruction> HloMapInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  return MakeUnique<HloMapInstruction>(shape, new_operands, to_apply());
+}
+
+HloSliceInstruction::HloSliceInstruction(
+    const Shape& shape, HloInstruction* operand,
+    tensorflow::gtl::ArraySlice<int64> start_indices,
+    tensorflow::gtl::ArraySlice<int64> limit_indices,
+    tensorflow::gtl::ArraySlice<int64> strides)
+    : HloInstruction(HloOpcode::kSlice, shape),
+      slice_starts_(start_indices.begin(), start_indices.end()),
+      slice_limits_(limit_indices.begin(), limit_indices.end()),
+      slice_strides_(strides.begin(), strides.end()) {
+  AppendOperand(operand);
+  // For backward compatibility with old serialized computations: if there are
+  // no strides, assume all strides are 1.
+  // TODO(b/63317920): remove this code.
+  if (slice_strides_.empty()) {
+    slice_strides_ = std::vector<int64>(start_indices.size(), 1LL);
+  }
+}
+
+HloInstructionProto HloSliceInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  for (int i = 0; i < slice_starts_.size(); ++i) {
+    auto* slice_dimension = proto.add_slice_dimensions();
+    slice_dimension->set_start(slice_starts_[i]);
+    slice_dimension->set_limit(slice_limits_[i]);
+    slice_dimension->set_stride(slice_strides_[i]);
+  }
+  return proto;
+}
+
+std::vector<string> HloSliceInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  std::vector<string> bounds;
+  bounds.reserve(slice_starts_.size());
+  const bool omit_stride =
+      std::all_of(slice_strides_.begin(), slice_strides_.end(),
+                  [](int64 stride) { return stride == 1; });
+  for (int i = 0; i < slice_starts_.size(); ++i) {
+    string stride_str = omit_stride ? "" : StrCat(":", slice_strides_[i]);
+    bounds.push_back(
+        StrCat("[", slice_starts_[i], ":", slice_limits_[i], stride_str, "]"));
+  }
+  return {StrCat("slice={", Join(bounds, ", "), "}")};
+}
+
+bool HloSliceInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& other_slice = static_cast<const HloSliceInstruction&>(other);
+  return slice_starts_ == other_slice.slice_starts_ &&
+         slice_limits_ == other_slice.slice_limits_ &&
+         slice_strides_ == other_slice.slice_strides_;
+}
+
+std::unique_ptr<HloInstruction> HloSliceInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 1);
+  return MakeUnique<HloSliceInstruction>(shape, new_operands[0], slice_starts_,
+                                         slice_limits_, slice_strides_);
+}
+
+HloConstantInstruction::HloConstantInstruction(std::unique_ptr<Literal> literal)
+    : HloInstruction(HloOpcode::kConstant, CHECK_NOTNULL(literal)->shape()),
+      literal_(std::move(literal)) {}
+
+HloInstructionProto HloConstantInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  *proto.mutable_literal() = literal_->ToProto();
+  return proto;
+}
+
+bool HloConstantInstruction::IsElementwiseImpl(
+    const tensorflow::gtl::optional<int64>& operand_idx) const {
+  return true;
+}
+
+void HloConstantInstruction::RelayoutConstant(const Layout& new_layout,
+                                              const ShapeIndex& shape_index) {
+  Shape* mutable_array_subshape =
+      ShapeUtil::GetMutableSubshape(mutable_shape(), shape_index);
+  CHECK(ShapeUtil::IsArray(*mutable_array_subshape));
+
+  // Normally array_subshape will always have a layout, but this invariant is
+  // temporarily broken in LayoutAssignment::AssignLayouts.
+
+  if (!mutable_array_subshape->has_layout() ||
+      !LayoutUtil::Equal(mutable_array_subshape->layout(), new_layout)) {
+    literal_ = literal_->Relayout(new_layout, shape_index);
+    *mutable_array_subshape->mutable_layout() = new_layout;
+  }
+}
+
+bool HloConstantInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& other_slice = static_cast<const HloSliceInstruction&>(other);
+  return literal() == other_slice.literal();
+}
+
+std::unique_ptr<HloInstruction>
+HloConstantInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  return MakeUnique<HloConstantInstruction>(literal_->CloneToUnique());
+}
+
+string HloConstantInstruction::OperandsToStringWithCanonicalNameMap(
+    const HloPrintOptions& options,
+    CanonicalNameMap* canonical_name_map) const {
+  string operands;
+  // For constants, show the actual value in place of an empty operand list.
+  if ((!ShapeUtil::IsTuple(shape()) && ShapeUtil::ElementsIn(shape()) <= 10) ||
+      options.print_large_constants()) {
+    // Literal::ToString emits multidimensional arrays over multiple
+    // lines. Compact this into one line by stripping out white space.
+    string tmp = literal().ToString();
+    std::replace(tmp.begin(), tmp.end(), '\n', ' ');
+    std::vector<string> v = tensorflow::str_util::Split(tmp, ' ');
+    bool first = true;
+    // Concatenate elements in "v" with spaces separating them, but ignoring
+    // empty entries.
+    for (const auto& s : v) {
+      if (s.empty()) {
+        continue;
+      }
+      StrAppend(&operands, (first ? "" : " "), s);
+      first = false;
+    }
+  } else {
+    // Do not show large constants or tuples.
+    operands = "{...}";
+  }
+  return operands;
+}
+
+HloTraceInstruction::HloTraceInstruction(const string& tag,
+                                         HloInstruction* operand)
+    : HloInstruction(HloOpcode::kTrace, ShapeUtil::MakeNil()),
+      literal_(Literal::CreateR1U8(tag)) {
+  AppendOperand(operand);
+  operand->set_tracing(this);
+}
+
+HloInstructionProto HloTraceInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  *proto.mutable_literal() = literal_->ToProto();
+  return proto;
+}
+
+bool HloTraceInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  return false;
+}
+
+std::unique_ptr<HloInstruction> HloTraceInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  LOG(FATAL) << "Not yet implemented, clone: " << HloOpcodeString(opcode());
+}
+
+HloFusionInstruction::HloFusionInstruction(const Shape& shape,
+                                           FusionKind fusion_kind,
+                                           HloInstruction* fused_root)
+    : HloInstruction(HloOpcode::kFusion, shape), fusion_kind_(fusion_kind) {
+  CHECK(fused_root != nullptr);
+  SetAndSanitizeName("fusion");
+  set_parent(fused_root->parent());
+  set_metadata(fused_root->metadata());
+  CloneAndFuseInternal(fused_root);
+}
+
+HloFusionInstruction::HloFusionInstruction(
+    const Shape& shape, FusionKind fusion_kind,
+    tensorflow::gtl::ArraySlice<HloInstruction*> operands,
+    HloComputation* fusion_computation)
+    : HloInstruction(HloOpcode::kFusion, shape), fusion_kind_(fusion_kind) {
+  for (auto operand : operands) {
+    AppendOperand(operand);
+  }
+  SetAndSanitizeName("fusion");
+  AppendComputation(fusion_computation);
+  fusion_computation->SetFusionInstruction(this);
+}
+
+HloInstructionProto HloFusionInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  proto.set_fusion_kind(xla::ToString(fusion_kind()));
+  proto.add_called_computation_ids(
+      fused_instructions_computation()->unique_id());
+  return proto;
+}
+
+bool HloFusionInstruction::IsElementwiseImpl(
+    const tensorflow::gtl::optional<int64>& operand_idx) const {
+  if (fusion_kind() != FusionKind::kLoop) {
+    return false;
+  }
+
+  if (!operand_idx.has_value()) {
+    for (auto* fused : fused_instructions()) {
+      if (fused->opcode() != HloOpcode::kParameter && !fused->IsElementwise()) {
+        return false;
+      }
+    }
+    return true;
+  }
+  // A loop-fusion is elementwise on an operand if all operations (computed
+  // using BFS) between the operand and the fused root are elementwise.
+  std::deque<HloInstruction*> worklist;
+  std::unordered_set<const HloInstruction*> visited;
+  worklist.push_back(fused_parameter(operand_idx.value()));
+  visited.insert(fused_parameter(operand_idx.value()));
+  while (!worklist.empty()) {
+    HloInstruction* operand = worklist.front();
+    worklist.pop_front();
+    for (HloInstruction* user : operand->users()) {
+      CHECK_GE(user->unique_id(), 0);
+      if (ContainsKey(visited, user)) {
+        continue;
+      }
+      if (user->IsElementwise() ||
+          IsInstructionElementwiseOnOperand(user, operand)) {
+        worklist.push_back(user);
+        visited.insert(user);
+      } else {
+        return false;
+      }
+    }
+  }
+  return true;
+}
+
+HloInstruction* HloFusionInstruction::AddFusionOperand(
+    HloInstruction* new_operand) {
+  CHECK_EQ(operand_count(),
+           fused_instructions_computation()->parameter_instructions().size());
+  const int64 param_no = operand_count();
+  // Name the parameter after the instruction it represents in the outer
+  // (non-fusion) computation.
+  string param_name = StrCat(new_operand->name(), ".param_", param_no);
+  HloInstruction* fused_parameter =
+      fused_instructions_computation()->AddParameter(
+          HloInstruction::CreateParameter(param_no, new_operand->shape(),
+                                          param_name));
+  AppendOperand(new_operand);
+  return fused_parameter;
+}
+
+void HloFusionInstruction::MergeFusionInstruction(
+    HloFusionInstruction* instruction_to_merge) {
+  CHECK(std::find(operands().begin(), operands().end(), instruction_to_merge) !=
+        operands().end());
+  // Clone the instruction from which to merge fused instructions.
+  std::unique_ptr<HloInstruction> cloned = instruction_to_merge->Clone();
+  HloFusionInstruction* cloned_fusion =
+      static_cast<HloFusionInstruction*>(cloned.get());
+  // Replace uses of fused parameters with the corresponding operand of the
+  // fusion.  Add all non-parameter fused instructions to
+  // 'unfused_instructions' to be merged into 'this'.  This is done in reverse
+  // post order.
+  std::vector<HloInstruction*> unfused_instructions;
+  auto fused_instructions = cloned_fusion->fused_instructions_computation()
+                                ->MakeInstructionPostOrder();
+  for (auto fused_it = fused_instructions.rbegin();
+       fused_it != fused_instructions.rend(); ++fused_it) {
+    auto fused_instruction = *fused_it;
+    if (fused_instruction->opcode() == HloOpcode::kParameter) {
+      TF_CHECK_OK(
+          fused_instruction->ReplaceAllUsesWith(cloned_fusion->mutable_operand(
+              fused_instruction->parameter_number())));
+    } else {
+      unfused_instructions.push_back(fused_instruction);
+    }
+  }
+  CHECK(unfused_instructions.front() == cloned_fusion->fused_expression_root());
+  // Replace instruction_to_merge use of 'this' with unfused_root.
+  TF_CHECK_OK(
+      instruction_to_merge->ReplaceUseWith(this, unfused_instructions.front()));
+  // Fuse 'unfused_instructions' into 'this'.
+  for (auto& instruction : unfused_instructions) {
+    FuseInstruction(instruction);
+    instruction->DetachFromOperands();
+  }
+  CHECK_EQ(0, cloned_fusion->user_count());
+  cloned_fusion->DetachFromOperands();
+  TF_CHECK_OK(parent()->parent()->RemoveEmbeddedComputation(
+      cloned_fusion->fused_instructions_computation()));
+}
+
+void HloFusionInstruction::MergeFusionInstructionIntoMultiOutput(
+    HloFusionInstruction* instruction_to_merge) {
+  // Add all non-parameter fused instructions to 'unfused_instructions' to be
+  // merged into 'this'. `old_to_new' maps the instructions in the fused node
+  // to the disaseembled fusion instructions.
+  // Note that we add the unfused instructions to this->parent_ computation.
+  // This is necessary because the unique_id needs for an instruction and
+  // it's only added when inserting to the computation.
+  tensorflow::gtl::FlatMap<HloInstruction*, HloInstruction*> old_to_new;
+  std::vector<HloInstruction*> unfused_instructions;
+  auto computation_to_merge =
+      instruction_to_merge->fused_instructions_computation();
+  auto post_order = computation_to_merge->MakeInstructionPostOrder();
+  for (auto rit = post_order.rbegin(); rit != post_order.rend(); ++rit) {
+    auto fused_instruction = *rit;
+    if (fused_instruction->opcode() == HloOpcode::kParameter) {
+      InsertOrDie(&old_to_new, fused_instruction,
+                  instruction_to_merge->mutable_operand(
+                      fused_instruction->parameter_number()));
+      continue;
+    }
+
+    // Here we clone the insertion and call FuseInstructionIntoMultiOutput()
+    // which clones again. This can be improved.
+    auto cloned_instruction =
+        parent()->AddInstruction(fused_instruction->Clone());
+    unfused_instructions.push_back(cloned_instruction);
+    InsertOrDie(&old_to_new, fused_instruction, cloned_instruction);
+  }
+  for (auto unfused_instruction : unfused_instructions) {
+    for (int64 index = 0; index < unfused_instruction->operand_count();
+         index++) {
+      auto new_operand =
+          FindOrDie(old_to_new, unfused_instruction->mutable_operand(index));
+      TF_CHECK_OK(unfused_instruction->ReplaceOperandWith(index, new_operand));
+    }
+  }
+
+  HloInstruction* unfused_root = unfused_instructions.front();
+  TF_CHECK_OK(instruction_to_merge->ReplaceAllUsesWith(unfused_root));
+
+  TF_CHECK_OK(
+      instruction_to_merge->parent()->RemoveInstruction(instruction_to_merge));
+  if (GetModule()) {
+    TF_CHECK_OK(GetModule()->RemoveEmbeddedComputation(computation_to_merge));
+  }
+
+  // Fuse the root instruction and generate multiple outputs.
+  FuseInstructionIntoMultiOutput(unfused_root);
+  TF_CHECK_OK(unfused_root->parent()->RemoveInstruction(unfused_root));
+  // The rest instructions are of normal fusing.
+  for (int64 i = 1; i < unfused_instructions.size(); i++) {
+    auto instruction = unfused_instructions[i];
+    FuseInstruction(instruction);
+    TF_CHECK_OK(instruction->parent()->RemoveInstruction(instruction));
+  }
+}
+
+HloComputation* HloFusionInstruction::fused_instructions_computation() const {
+  CHECK(!called_computations().empty());
+  auto* fused_instructions_computation = called_computations().front();
+  CHECK(fused_instructions_computation->IsFusionComputation())
+      << "Computation " << fused_instructions_computation->name()
+      << " is not a fusion kind";
+  return fused_instructions_computation;
+}
+
+HloInstruction* HloFusionInstruction::fused_expression_root() const {
+  return fused_instructions_computation()->root_instruction();
+}
+
+HloInstruction* HloFusionInstruction::fused_parameter(
+    int64 parameter_number) const {
+  return fused_instructions_computation()->parameter_instruction(
+      parameter_number);
+}
+
+const std::vector<HloInstruction*>& HloFusionInstruction::fused_parameters()
+    const {
+  return fused_instructions_computation()->parameter_instructions();
+}
+
+const tensorflow::gtl::iterator_range<UnwrappingIterator<
+    std::list<std::unique_ptr<HloInstruction>>::const_iterator>>
+HloFusionInstruction::fused_instructions() const {
+  const HloComputation* subcomp = fused_instructions_computation();
+  return subcomp->instructions();
+}
+
+const tensorflow::gtl::iterator_range<
+    UnwrappingIterator<std::list<std::unique_ptr<HloInstruction>>::iterator>>
+HloFusionInstruction::fused_instructions() {
+  return fused_instructions_computation()->instructions();
+}
+
+int64 HloFusionInstruction::fused_instruction_count() const {
+  return fused_instructions_computation()->instruction_count();
+}
+
+HloInstruction* HloFusionInstruction::FuseInstructionInternal(
+    HloInstruction* instruction_to_fuse, bool add_output) {
+  // When add_output is false, this fusion instruction must be a user of
+  // instruction_to_fuse.
+  if (!add_output) {
+    CHECK(IsUserOf(instruction_to_fuse));
+  }
+  HloInstruction* fused_instruction =
+      CloneAndFuseInternal(instruction_to_fuse, add_output);
+  return fused_instruction;
+}
+
+HloInstruction* HloFusionInstruction::CloneAndFuseInternal(
+    HloInstruction* instruction_to_fuse, bool add_output) {
+  CHECK(instruction_to_fuse->IsFusable()) << instruction_to_fuse->ToString();
+  VLOG(3) << "CloneAndFuseInternal:\n" << instruction_to_fuse->ToString();
+  HloInstruction* clone = nullptr;
+  if (called_computations().empty()) {
+    // New fusion instruction. It should not be a multioutput instruction.
+    CHECK(!add_output);
+    auto builder = HloComputation::Builder("fused_computation", this);
+    builder.AddInstruction(instruction_to_fuse->Clone(/*suffix=*/""));
+    AppendComputation(
+        CHECK_NOTNULL(GetModule())->AddEmbeddedComputation(builder.Build()));
+    clone = fused_expression_root();
+  } else {
+    clone = fused_instructions_computation()->AddInstruction(
+        instruction_to_fuse->Clone(/*suffix=*/""));
+    // When add_output is false, instruction_to_fuse is necessarily an operand
+    // of the fusion instruction. After fusion this will no longer be the
+    // case. Remove the operand from the operand list and remove its
+    // corresponding fused parameter instruction. Renumber parameters as
+    // necessary to make parameter numbers consistent with their index in the
+    // fused_parameter_ vector.
+    bool in_operand_list = std::find(operands().begin(), operands().end(),
+                                     instruction_to_fuse) != operands().end();
+    CHECK(add_output || in_operand_list);
+    const std::vector<HloInstruction*>& fused_parameters =
+        fused_instructions_computation()->parameter_instructions();
+    for (int64 operand_num = 0; operand_num < operand_count(); ++operand_num) {
+      if (instruction_to_fuse == operand(operand_num)) {
+        // replace the fused parameter instruction's uses with the clone.
+        HloInstruction* fused_parameter = fused_parameters[operand_num];
+        TF_CHECK_OK(fused_parameter->ReplaceAllUsesWith(clone));
+
+        // Remove the corresponding fused parameter and operand from their
+        // respective vectors.
+        TF_CHECK_OK(
+            fused_instructions_computation()->RemoveParameter(operand_num));
+        RemoveOperandAt(operand_num);
+        break;
+      }
+    }
+    // We've cloned instruction_to_fuse into this fusion instruction, so this
+    // fusion instruction is no longer a use of instruction_to_fuse.
+    if (in_operand_list) {
+      DetachFrom(instruction_to_fuse);
+      // When the instruction_to_fuse does not have other users, we don't need
+      // to generate a multioutput fusion instruction.
+      if (instruction_to_fuse->user_count() == 0) {
+        add_output = false;
+      }
+    }
+  }
+
+  // Reread the parameters in the computation.
+  const std::vector<HloInstruction*>& fused_parameters =
+      fused_instructions_computation()->parameter_instructions();
+
+  // Add each operand of the clone as an operand of the fusion instruction. A
+  // complication is that some clone operands may already be operands of the
+  // fusion instruction.
+  for (int64 operand_num = 0; operand_num < clone->operand_count();
+       ++operand_num) {
+    HloInstruction* operand = clone->mutable_operand(operand_num);
+
+    // See if this operand is already an operand of the fusion node.
+    CHECK_EQ(operands().size(), fused_parameters.size());
+    HloInstruction* fused_param = nullptr;
+    for (int64 i = 0; i < operands().size(); ++i) {
+      if (this->operand(i) == operand) {
+        fused_param = fused_parameters[i];
+        break;
+      }
+    }
+
+    if (fused_param == nullptr) {
+      // Clone's operand was not already an operand of the fusion
+      // instruction. Add it as an operand and add a corresponding fused
+      // parameter instruction.
+      fused_param = AddFusionOperand(operand);
+    }
+    TF_CHECK_OK(clone->ReplaceOperandWith(operand_num, fused_param));
+  }
+
+  if (add_output) {
+    CHECK_GT(instruction_to_fuse->user_count(), 0);
+    // If this is already a multioutput fusion instruction, expand the root
+    // tuple by 1.
+    HloInstruction* fused_root = fused_expression_root();
+    HloInstruction::InstructionVector tuple_elements;
+    bool newly_created_tuple_instr = false;
+    if (fused_root->opcode() == HloOpcode::kTuple) {
+      tuple_elements = fused_root->operands();
+    } else {
+      tuple_elements.push_back(fused_root);
+      newly_created_tuple_instr = true;
+    }
+    if (clone->opcode() == HloOpcode::kTuple) {
+      for (auto inst : clone->operands()) {
+        tuple_elements.push_back(inst);
+      }
+    } else {
+      tuple_elements.push_back(clone);
+    }
+    HloInstruction* new_root = fused_instructions_computation()->AddInstruction(
+        HloInstruction::CreateTuple(tuple_elements));
+    fused_instructions_computation()->set_root_instruction(new_root);
+    *mutable_shape() = new_root->shape();
+    if (fused_root->opcode() == HloOpcode::kTuple) {
+      TF_CHECK_OK(
+          fused_instructions_computation()->RemoveInstruction(fused_root));
+    }
+
+    // If this is a newly created multioutput instruction, we need to update
+    // the use of the original fusion instruction.
+    if (newly_created_tuple_instr) {
+      HloInstruction* new_instr = parent()->AddInstruction(
+          HloInstruction::CreateGetTupleElement(fused_root->shape(), this, 0));
+      TF_CHECK_OK(ReplaceAllUsesWith(new_instr));
+    }
+    int64 index = tuple_elements.size();
+    if (instruction_to_fuse->opcode() == HloOpcode::kTuple) {
+      index -= instruction_to_fuse->operand_count();
+      std::vector<HloInstruction*> to_be_removed;
+      for (auto old_gte : instruction_to_fuse->users()) {
+        CHECK_EQ(old_gte->opcode(), HloOpcode::kGetTupleElement);
+        int64 old_tuple_index = old_gte->tuple_index();
+        HloInstruction* new_gte =
+            parent()->AddInstruction(HloInstruction::CreateGetTupleElement(
+                old_gte->shape(), this, index + old_tuple_index));
+        TF_CHECK_OK(old_gte->ReplaceAllUsesWith(new_gte));
+        to_be_removed.push_back(old_gte);
+      }
+      for (auto old_gte : to_be_removed) {
+        TF_CHECK_OK(parent()->RemoveInstruction(old_gte));
+      }
+      TF_CHECK_OK(fused_instructions_computation()->RemoveInstruction(clone));
+    } else {
+      HloInstruction* new_gte =
+          parent()->AddInstruction(HloInstruction::CreateGetTupleElement(
+              clone->shape(), this, index - 1));
+      TF_CHECK_OK(instruction_to_fuse->ReplaceAllUsesWith(new_gte));
+    }
+  }
+
+  VLOG(2) << "New clone:\n" << clone->ToString();
+  return clone;
+}
+
+std::vector<string> HloFusionInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {StrCat("kind=", xla::ToString(fusion_kind()))};
+}
+
+bool HloFusionInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  return fusion_kind() == other.fusion_kind() &&
+         eq_computations(fused_instructions_computation(),
+                         other.fused_instructions_computation());
+}
+
+std::unique_ptr<HloInstruction> HloFusionInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  HloModule* module = context != nullptr ? context->module() : GetModule();
+  HloComputation* new_fused_computation = nullptr;
+  if (context != nullptr) {
+    new_fused_computation =
+        context->FindComputation(fused_instructions_computation());
+  }
+  if (new_fused_computation == nullptr) {
+    new_fused_computation = module->AddEmbeddedComputation(
+        fused_instructions_computation()->Clone("clone", context));
+  }
+  return MakeUnique<HloFusionInstruction>(shape, fusion_kind(), new_operands,
+                                          new_fused_computation);
+}
+
+HloRngInstruction::HloRngInstruction(
+    const Shape& shape, RandomDistribution distribution,
+    tensorflow::gtl::ArraySlice<HloInstruction*> parameters)
+    : HloInstruction(HloOpcode::kRng, shape), distribution_(distribution) {
+  for (HloInstruction* param : parameters) {
+    AppendOperand(param);
+  }
+}
+
+HloInstructionProto HloRngInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  proto.set_distribution(distribution_);
+  return proto;
+}
+
+std::vector<string> HloRngInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {StrCat("distribution=", RandomDistributionToString(distribution_))};
+}
+
+bool HloRngInstruction::IsElementwiseImpl(
+    const tensorflow::gtl::optional<int64>& operand_idx) const {
+  return true;
+}
+
+bool HloRngInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  return false;
+}
+
+std::unique_ptr<HloInstruction> HloRngInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  return MakeUnique<HloRngInstruction>(shape, distribution_, new_operands);
+}
+
+HloParameterInstruction::HloParameterInstruction(int64 parameter_number,
+                                                 const Shape& shape,
+                                                 const string& name)
+    : HloInstruction(HloOpcode::kParameter, shape),
+      parameter_number_(parameter_number) {
+  SetAndSanitizeName(name);
+}
+
+HloInstructionProto HloParameterInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  proto.set_parameter_number(parameter_number_);
+  return proto;
+}
+
+string HloParameterInstruction::OperandsToStringWithCanonicalNameMap(
+    const HloPrintOptions& options,
+    CanonicalNameMap* canonical_name_map) const {
+  return StrCat(parameter_number_);
+}
+
+bool HloParameterInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& casted_other = static_cast<const HloParameterInstruction&>(other);
+  return parameter_number() == casted_other.parameter_number();
+}
+
+std::unique_ptr<HloInstruction>
+HloParameterInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  return MakeUnique<HloParameterInstruction>(parameter_number_, shape, name());
+}
+
+HloGetTupleElementInstruction::HloGetTupleElementInstruction(
+    const Shape& shape, HloInstruction* operand, int64 index)
+    : HloInstruction(HloOpcode::kGetTupleElement, shape), tuple_index_(index) {
+  CHECK(ShapeUtil::IsTuple(operand->shape()));
+  AppendOperand(operand);
+}
+
+HloInstructionProto HloGetTupleElementInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  proto.set_tuple_index(tuple_index_);
+  return proto;
+}
+
+std::vector<string> HloGetTupleElementInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {StrCat("index=", tuple_index())};
+}
+
+bool HloGetTupleElementInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& casted_other =
+      static_cast<const HloGetTupleElementInstruction&>(other);
+  return tuple_index() == casted_other.tuple_index();
+}
+
+std::unique_ptr<HloInstruction>
+HloGetTupleElementInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 1);
+  return MakeUnique<HloGetTupleElementInstruction>(shape, new_operands[0],
+                                                   tuple_index());
+}
+
+HloReducePrecisionInstruction::HloReducePrecisionInstruction(
+    const Shape& shape, HloInstruction* operand, const int exponent_bits,
+    const int mantissa_bits)
+    : HloInstruction(HloOpcode::kReducePrecision, shape),
+      exponent_bits_(exponent_bits),
+      mantissa_bits_(mantissa_bits) {
+  AppendOperand(operand);
+}
+
+HloInstructionProto HloReducePrecisionInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  proto.set_exponent_bits(exponent_bits_);
+  proto.set_mantissa_bits(mantissa_bits_);
+  return proto;
+}
+
+std::vector<string> HloReducePrecisionInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {StrCat("exponent_bits=", exponent_bits_),
+          StrCat("mantissa_bits=", mantissa_bits_)};
+}
+
+bool HloReducePrecisionInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& casted_other =
+      static_cast<const HloReducePrecisionInstruction&>(other);
+  // A reduce-precision operation is determined by the bit sizes.
+  return exponent_bits() == casted_other.exponent_bits() &&
+         mantissa_bits() == casted_other.mantissa_bits();
+}
+
+std::unique_ptr<HloInstruction>
+HloReducePrecisionInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 1);
+  return MakeUnique<HloReducePrecisionInstruction>(
+      shape, new_operands[0], exponent_bits(), mantissa_bits());
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_instructions.h b/tensorflow/compiler/xla/service/hlo_instructions.h
new file mode 100644
index 0000000000000000000000000000000000000000..6749d875559008b3a2bd479ff075c83d85d87509
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_instructions.h
@@ -0,0 +1,727 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// All HloInstruction subclasses are put in this file.
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_INSTRUCTIONS_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_INSTRUCTIONS_H_
+
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+
+namespace xla {
+
+class HloBatchNormInstruction : public HloInstruction {
+ public:
+  // Returns feature_index field associated with the instruction. The index
+  // represents the index of the feature dimension.
+  int64 feature_index() const { return feature_index_; }
+
+  // Returns a epsilon value associated with the instruction. The is a small
+  // number added to the variance to avoid divide-by-zero error.
+  float epsilon() const { return epsilon_; }
+
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ protected:
+  explicit HloBatchNormInstruction(HloOpcode opcode, const Shape& shape,
+                                   HloInstruction* operand,
+                                   HloInstruction* scale, float epsilon,
+                                   int64 feature_index);
+
+ private:
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // A small float number added to the variance to avoid divide-by-zero error.
+  float epsilon_ = 0.0f;
+
+  // An integer value representing the index of the feature dimension.
+  int64 feature_index_ = -1;
+};
+
+class HloBatchNormTrainingInstruction : public HloBatchNormInstruction {
+ public:
+  explicit HloBatchNormTrainingInstruction(const Shape& shape,
+                                           HloInstruction* operand,
+                                           HloInstruction* scale,
+                                           HloInstruction* offset,
+                                           float epsilon, int64 feature_index);
+
+ private:
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+};
+
+class HloBatchNormInferenceInstruction : public HloBatchNormInstruction {
+ public:
+  explicit HloBatchNormInferenceInstruction(
+      const Shape& shape, HloInstruction* operand, HloInstruction* scale,
+      HloInstruction* offset, HloInstruction* mean, HloInstruction* variance,
+      float epsilon, int64 feature_index);
+
+ private:
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+};
+
+class HloBatchNormGradInstruction : public HloBatchNormInstruction {
+ public:
+  explicit HloBatchNormGradInstruction(
+      const Shape& shape, HloInstruction* operand, HloInstruction* scale,
+      HloInstruction* mean, HloInstruction* variance,
+      HloInstruction* grad_output, float epsilon, int64 feature_index);
+
+ private:
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+};
+
+class HloFftInstruction : public HloInstruction {
+ public:
+  explicit HloFftInstruction(const Shape& shape, HloInstruction* operand,
+                             FftType fft_type,
+                             tensorflow::gtl::ArraySlice<int64> fft_length);
+  FftType fft_type() const { return fft_type_; }
+
+  const std::vector<int64>& fft_length() const { return fft_length_; }
+
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ private:
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  // Describes FFT type for an FFT instruction.
+  FftType fft_type_ = FftType::FFT;
+
+  // Indicates the FFT length for an FFT instruction.
+  std::vector<int64> fft_length_;
+};
+
+class HloSendRecvInstruction : public HloInstruction {
+ public:
+  // Returns the channel id associated with the instruction. The id is
+  // shared between each Send/Recv pair and is globally unique to identify each
+  // channel.
+  int64 channel_id() const { return channel_id_; }
+
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ protected:
+  explicit HloSendRecvInstruction(HloOpcode opcode, const Shape& shape,
+                                  int64 channel_id);
+
+ private:
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Represents a unique identifier for each Send/Recv instruction pair.
+  int64 channel_id_;
+};
+
+class HloSendInstruction : public HloSendRecvInstruction {
+ public:
+  explicit HloSendInstruction(HloInstruction* operand, int64 channel_id);
+
+ private:
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+};
+
+class HloSendDoneInstruction : public HloSendRecvInstruction {
+ public:
+  explicit HloSendDoneInstruction(HloSendInstruction* operand);
+
+ private:
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+};
+
+class HloRecvInstruction : public HloSendRecvInstruction {
+ public:
+  explicit HloRecvInstruction(const Shape& shape, int64 channel_id);
+
+ private:
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+};
+
+class HloRecvDoneInstruction : public HloSendRecvInstruction {
+ public:
+  explicit HloRecvDoneInstruction(HloRecvInstruction* operand);
+
+ private:
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+};
+
+class HloReverseInstruction : public HloInstruction {
+ public:
+  explicit HloReverseInstruction(const Shape& shape, HloInstruction* operand,
+                                 tensorflow::gtl::ArraySlice<int64> dimensions);
+  // Returns the dimension sizes or numbers associated with this instruction.
+  const std::vector<int64>& dimensions() const override { return dimensions_; }
+  int64 dimensions(int64 index) const override { return dimensions()[index]; }
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ private:
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  std::vector<int64> dimensions_;
+};
+
+class HloConcatenateInstruction : public HloInstruction {
+ public:
+  explicit HloConcatenateInstruction(
+      const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
+      int64 dimension);
+  // Returns the dimension sizes or numbers associated with this instruction.
+  const std::vector<int64>& dimensions() const override { return dimensions_; }
+  int64 dimensions(int64 index) const override { return dimensions()[index]; }
+  // Accessor for the dimension in which a concatenate HLO should occur.
+  int64 concatenate_dimension() const { return dimensions(0); }
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ private:
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  std::vector<int64> dimensions_;
+};
+
+class HloReduceInstruction : public HloInstruction {
+ public:
+  explicit HloReduceInstruction(
+      const Shape& shape, HloInstruction* arg, HloInstruction* init_value,
+      tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce,
+      HloComputation* reduce_computation);
+  // Returns the dimension sizes or numbers associated with this instruction.
+  const std::vector<int64>& dimensions() const override { return dimensions_; }
+  int64 dimensions(int64 index) const override { return dimensions()[index]; }
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ private:
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  std::vector<int64> dimensions_;
+};
+
+class HloTransposeInstruction : public HloInstruction {
+ public:
+  explicit HloTransposeInstruction(
+      const Shape& shape, HloInstruction* operand,
+      tensorflow::gtl::ArraySlice<int64> dimensions);
+  // Returns the dimension sizes or numbers associated with this instruction.
+  const std::vector<int64>& dimensions() const override { return dimensions_; }
+  int64 dimensions(int64 index) const override { return dimensions()[index]; }
+  // Returns whether this instruction does a rank-2 transposition.
+  bool IsRank2Transpose() const;
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ private:
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  std::vector<int64> dimensions_;
+};
+
+class HloBroadcastInstruction : public HloInstruction {
+ public:
+  explicit HloBroadcastInstruction(
+      const Shape& shape, HloInstruction* operand,
+      tensorflow::gtl::ArraySlice<int64> broadcast_dimension);
+  // Returns the dimension sizes or numbers associated with this instruction.
+  const std::vector<int64>& dimensions() const override { return dimensions_; }
+  int64 dimensions(int64 index) const override { return dimensions()[index]; }
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ private:
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  std::vector<int64> dimensions_;
+};
+
+class HloMapInstruction : public HloInstruction {
+ public:
+  explicit HloMapInstruction(
+      const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
+      HloComputation* map_computation,
+      tensorflow::gtl::ArraySlice<HloInstruction*> static_operands = {});
+  // Returns the dimension sizes or numbers associated with this instruction.
+  const std::vector<int64>& dimensions() const override { return dimensions_; }
+  int64 dimensions(int64 index) const override { return dimensions()[index]; }
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ private:
+  bool IsElementwiseImpl(
+      const tensorflow::gtl::optional<int64>& operand_idx) const override;
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  std::vector<int64> dimensions_;
+};
+
+class HloSliceInstruction : public HloInstruction {
+ public:
+  explicit HloSliceInstruction(const Shape& shape, HloInstruction* operand,
+                               tensorflow::gtl::ArraySlice<int64> start_indices,
+                               tensorflow::gtl::ArraySlice<int64> limit_indices,
+                               tensorflow::gtl::ArraySlice<int64> strides);
+
+  HloInstructionProto ToProto() const override;
+
+  // Returns the start index in the given dimension for a slice node.
+  int64 slice_starts(int64 dimension) const { return slice_starts_[dimension]; }
+  const std::vector<int64>& slice_starts() const { return slice_starts_; }
+
+  // Returns the (exclusive) limit index in the given dimension for a slice
+  // node.
+  int64 slice_limits(int64 dimension) const { return slice_limits_[dimension]; }
+  const std::vector<int64>& slice_limits() const { return slice_limits_; }
+
+  // Returns the stride in the given dimension for a slice node.
+  int64 slice_strides(int64 dimension) const {
+    return slice_strides_[dimension];
+  }
+  const std::vector<int64>& slice_strides() const { return slice_strides_; }
+
+  // Returns the flag that describes whether a slice must be lowered into an
+  // offset into the original operand.
+  bool IsInPlaceSlice() const { return is_in_place_slice_; }
+
+  // Sets and returns the flag that describes whether a slice must be lowered
+  // into an offset into the original operand.
+  bool SetIsInPlaceSlice(bool value) {
+    is_in_place_slice_ = value;
+    return value;
+  }
+
+ private:
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  // Describes the [begin, end) index range for a slice.
+  std::vector<int64> slice_starts_;
+  std::vector<int64> slice_limits_;
+  std::vector<int64> slice_strides_;
+
+  // Describes whether the slice can be lowered to an offset into the operand.
+  bool is_in_place_slice_ = false;
+};
+
+class HloConstantInstruction : public HloInstruction {
+ public:
+  explicit HloConstantInstruction(std::unique_ptr<Literal> literal);
+  // Returns the literal associated with this instruction.
+  const Literal& literal() const { return *literal_; }
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+  // Change the layout for an Constant Hlo instruction to match new_layout.  For
+  // tuple shaped constants shape_index is the path to the internal array
+  // subshape whose layout needs to be changed.
+  void RelayoutConstant(const Layout& new_layout,
+                        const ShapeIndex& shape_index = {});
+
+ private:
+  bool IsElementwiseImpl(
+      const tensorflow::gtl::optional<int64>& operand_idx) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  string OperandsToStringWithCanonicalNameMap(
+      const HloPrintOptions& options,
+      CanonicalNameMap* canonical_name_map) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+  // TODO(b/36360764): Remove unique_ptr wrapping.
+  std::unique_ptr<Literal> literal_;
+};
+
+class HloTraceInstruction : public HloInstruction {
+ public:
+  explicit HloTraceInstruction(const string& tag, HloInstruction* operand);
+  // Returns a tag to be used in tracing.
+  string TracingTag() const { return literal_->GetR1U8AsString(); }
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ private:
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+  // TODO(b/36360764): Remove unique_ptr wrapping.
+  std::unique_ptr<Literal> literal_;
+};
+
+class HloFusionInstruction : public HloInstruction {
+ public:
+  explicit HloFusionInstruction(const Shape& shape, FusionKind fusion_kind,
+                                HloInstruction* fused_root);
+
+  explicit HloFusionInstruction(
+      const Shape& shape, FusionKind fusion_kind,
+      tensorflow::gtl::ArraySlice<HloInstruction*> operands,
+      HloComputation* fusion_computation);
+
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+  // Adds a new operand the fusion instruction.
+  HloInstruction* AddFusionOperand(HloInstruction* new_operand);
+
+  // Merges the fused instructions from 'instruction_to_merge' into the
+  // fused instruction set of 'this', updating operands as necessary.
+  //
+  // Predondition: 'instruction_to_merge' must be an operand of 'this'.
+  void MergeFusionInstruction(HloFusionInstruction* instruction_to_merge);
+
+  // Merges the fused instructions from instruction_to_merge into the fused
+  // instruction set of 'this' and generates multioutput fusion instructions.
+  // All the users of instruction_to_merge will be redirected to 'this'
+  // instruction. instruction_to_merge will be removed from its parent
+  // computation.
+  void MergeFusionInstructionIntoMultiOutput(
+      HloFusionInstruction* instruction_to_merge);
+
+  // Fuses the given instruction in this fusion instruction. instruction_to_fuse
+  // is cloned and the clone is placed in the fusion
+  // instruction. instruction_to_fuse is unchanged. Instruction is cloned rather
+  // than moved to cleanly handle the case where the instruction has a use
+  // outside the fusion instruction. Moving such an instruction into a fusion
+  // instruction would violate the single-result invariant of HLO instructions
+  // and significantly complicate code generation.
+  HloInstruction* FuseInstruction(HloInstruction* instruction_to_fuse) {
+    return FuseInstructionInternal(instruction_to_fuse);
+  }
+
+  // Fuses the given instruction in this fusion instruction and generate
+  // multioutput fusion instruction. A clone of the instruction_to_fuse will
+  // be part of the output of fusion instructions. The users of
+  // instruction_to_fuse will be redirected to this fusion instructions.
+  // instruction_to_fuse will be removed from its parent computation.
+  HloInstruction* FuseInstructionIntoMultiOutput(
+      HloInstruction* instruction_to_fuse) {
+    return FuseInstructionInternal(instruction_to_fuse, /* add_output */ true);
+  }
+
+  // Returns the computation for this fused instruction.
+  HloComputation* fused_instructions_computation() const;
+
+  // Returns the root instruction of the fused expression contained within this
+  // fusion instruction.
+  HloInstruction* fused_expression_root() const;
+
+  // Returns the list of fused instructions inside this fusion instruction.  The
+  // returned type is a range of HloInstruction*s.
+  const tensorflow::gtl::iterator_range<UnwrappingIterator<
+      std::list<std::unique_ptr<HloInstruction>>::const_iterator>>
+  fused_instructions() const;
+
+  const tensorflow::gtl::iterator_range<
+      UnwrappingIterator<std::list<std::unique_ptr<HloInstruction>>::iterator>>
+  fused_instructions();
+
+  // Gets the number of instructions inside this fusion instruction.
+  int64 fused_instruction_count() const;
+
+  // Returns the fused parameter instruction in this fusion instruction
+  // corresponding to the given parameter number.
+  HloInstruction* fused_parameter(int64 parameter_number) const;
+
+  // Returns the vector of fused parameters inside this fusion instruction.
+  const std::vector<HloInstruction*>& fused_parameters() const;
+
+  // Returns true if this instruction is a fusion instruction that generates
+  // multiple outputs.
+  const bool IsMultiOutputFusion() const {
+    return fused_expression_root()->opcode() == HloOpcode::kTuple;
+  }
+
+  FusionKind fusion_kind() const { return fusion_kind_; }
+
+  void set_fusion_kind(FusionKind kind) { fusion_kind_ = kind; }
+
+ private:
+  // Fuses the given instruction into this fusion instruction. When add_output
+  // is false (which is the default), instruction_to_fuse is cloned and the
+  // clone is placed in the fusion instruction. instruction_to_fuse is
+  // unchanged.
+  //
+  // When add_output is true, a clone of the instruction_to_fuse will be part
+  // of the output of fusion instructions. The users of instruction_to_fuse
+  // will be redirected to this fusion instructions. instruction_to_fuse will
+  // be removed from its parent computation.
+  HloInstruction* FuseInstructionInternal(HloInstruction* instruction_to_fuse,
+                                          bool add_output = false);
+  // Clones the given instruction_to_fuse and insert the clone into this fusion
+  // instruction. If add_output is true, a clone of instruction_to_fuse will
+  // be in the output of the this fusion instruction (part of the tuple of the
+  // fusion root).
+  HloInstruction* CloneAndFuseInternal(HloInstruction* instruction_to_fuse,
+                                       bool add_output = false);
+
+  bool IsElementwiseImpl(
+      const tensorflow::gtl::optional<int64>& operand_idx) const override;
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  // The type of the fusion. Used by kFusion only.
+  FusionKind fusion_kind_;
+};
+
+class HloRngInstruction : public HloInstruction {
+ public:
+  explicit HloRngInstruction(
+      const Shape& shape, RandomDistribution distribution,
+      tensorflow::gtl::ArraySlice<HloInstruction*> parameters);
+  // Returns the random distribution for this rng node.
+  RandomDistribution random_distribution() const { return distribution_; }
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ private:
+  bool IsElementwiseImpl(
+      const tensorflow::gtl::optional<int64>& operand_idx) const override;
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  // The distribution requested for random number generation.
+  RandomDistribution distribution_;
+};
+
+class HloParameterInstruction : public HloInstruction {
+ public:
+  explicit HloParameterInstruction(int64 parameter_number, const Shape& shape,
+                                   const string& name);
+  int64 parameter_number() const { return parameter_number_; }
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ private:
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  string OperandsToStringWithCanonicalNameMap(
+      const HloPrintOptions& options,
+      CanonicalNameMap* canonical_name_map) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  int64 parameter_number_ = 0;
+};
+
+class HloGetTupleElementInstruction : public HloInstruction {
+ public:
+  explicit HloGetTupleElementInstruction(const Shape& shape,
+                                         HloInstruction* operand, int64 index);
+  // Returns the tuple index associated with this instruction.
+  int64 tuple_index() const { return tuple_index_; }
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ private:
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  int64 tuple_index_ = -1;
+};
+
+class HloReducePrecisionInstruction : public HloInstruction {
+ public:
+  explicit HloReducePrecisionInstruction(const Shape& shape,
+                                         HloInstruction* operand,
+                                         const int exponent_bits,
+                                         const int mantissa_bits);
+  // Returns the number of exponent bits for a reduce-precision node.
+  int32 exponent_bits() const { return exponent_bits_; }
+  // Returns the number of mantissa bits for a reduce-precision node.
+  int32 mantissa_bits() const { return mantissa_bits_; }
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ private:
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  // The bit sizes for a reduce-precision operation.
+  int32 exponent_bits_ = 0;
+  int32 mantissa_bits_ = 0;
+};
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_INSTRUCTIONS_H_
diff --git a/tensorflow/compiler/xla/tools/parser/hlo_lexer.cc b/tensorflow/compiler/xla/service/hlo_lexer.cc
similarity index 95%
rename from tensorflow/compiler/xla/tools/parser/hlo_lexer.cc
rename to tensorflow/compiler/xla/service/hlo_lexer.cc
index 350db126535e418cbfa914edd958f47ba90a3ee5..f0d9fdbc8f86da0bb9d7f9235239df677c9506bc 100644
--- a/tensorflow/compiler/xla/tools/parser/hlo_lexer.cc
+++ b/tensorflow/compiler/xla/service/hlo_lexer.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/tools/parser/hlo_lexer.h"
+#include "tensorflow/compiler/xla/service/hlo_lexer.h"
 
 #include <unordered_map>
 
@@ -26,9 +26,8 @@ limitations under the License.
 #include "tensorflow/core/platform/regexp.h"
 
 namespace xla {
-namespace tools {
 
-using tensorflow::StringPiece;
+using ::tensorflow::StringPiece;
 
 namespace {
 
@@ -67,12 +66,12 @@ bool HloLexer::CanDereference(const char* ptr) const {
   return ptr < buf_.end() && ptr >= buf_.begin();
 }
 
-StringPiece HloLexer::StringPieceFromPointers(const char* begin,
-                                              const char* end) const {
+tensorflow::StringPiece HloLexer::StringPieceFromPointers(
+    const char* begin, const char* end) const {
   CHECK(begin <= end);
   CHECK(begin == buf_.end() || CanDereference(begin));
   CHECK(end == buf_.end() || CanDereference(end));
-  return StringPiece(begin, end - begin);
+  return tensorflow::StringPiece(begin, end - begin);
 }
 
 tensorflow::RegexpStringPiece HloLexer::RegexpStringPieceFromPointers(
@@ -197,7 +196,8 @@ TokKind HloLexer::LexIdentifier() {
     return TokKind::kAttributeName;
   }
 
-  StringPiece identifier = StringPieceFromPointers(token_start_, current_ptr_);
+  tensorflow::StringPiece identifier =
+      StringPieceFromPointers(token_start_, current_ptr_);
 
   // See if this is a keyword.
 #define KEYWORD(STR)            \
@@ -332,23 +332,24 @@ std::pair<unsigned, unsigned> HloLexer::GetLineAndColumn(LocTy location) const {
   line_no_cache_.last_query = ptr;
   line_no_cache_.line_no_of_query = line_no;
   size_t line_offset = StringPieceFromPointers(start, ptr).rfind('\n');
-  if (line_offset == StringPiece::npos) {
+  if (line_offset == tensorflow::StringPiece::npos) {
     line_offset = 0;
   }
   return {line_no, ptr - start - line_offset};
 }
 
-StringPiece HloLexer::GetLine(LocTy loc) const {
+tensorflow::StringPiece HloLexer::GetLine(LocTy loc) const {
   if (!CanDereference(loc)) {
     return "LINE OUT OF RANGE";
   }
   size_t line_start =
       StringPieceFromPointers(buf_.begin(), loc + 1).rfind('\n');
-  const char* start = line_start == StringPiece::npos
+  const char* start = line_start == tensorflow::StringPiece::npos
                           ? buf_.begin()
                           : buf_.begin() + line_start + 1;
   size_t line_end = StringPieceFromPointers(loc, buf_.end()).find('\n');
-  const char* end = line_end == StringPiece::npos ? buf_.end() : loc + line_end;
+  const char* end =
+      line_end == tensorflow::StringPiece::npos ? buf_.end() : loc + line_end;
 
   return StringPieceFromPointers(start, end);
 }
@@ -370,7 +371,7 @@ TokKind HloLexer::LexString() {
   static LazyRE2 escaping_pattern = {R"("([^"\\]|\\.)*")"};
   if (RE2::Consume(&consumable, *escaping_pattern)) {
     current_ptr_ = consumable.begin();
-    StringPiece raw =
+    tensorflow::StringPiece raw =
         StringPieceFromPointers(token_start_ + 1, current_ptr_ - 1);
     string error;
     if (!tensorflow::str_util::CUnescape(raw, &str_val_, &error)) {
@@ -453,5 +454,4 @@ string TokKindToString(TokKind kind) {
   }
 }
 
-}  // namespace tools
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tools/parser/hlo_lexer.h b/tensorflow/compiler/xla/service/hlo_lexer.h
similarity index 90%
rename from tensorflow/compiler/xla/tools/parser/hlo_lexer.h
rename to tensorflow/compiler/xla/service/hlo_lexer.h
index 27880b9b8afbfa58abfedc3b2cecd5236b78a6d6..ceb674f25e94ac3ac2e6a4a0687a93ffdcd065e0 100644
--- a/tensorflow/compiler/xla/tools/parser/hlo_lexer.h
+++ b/tensorflow/compiler/xla/service/hlo_lexer.h
@@ -13,12 +13,12 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_COMPILER_XLA_TOOLS_PARSER_HLO_LEXER_H_
-#define TENSORFLOW_COMPILER_XLA_TOOLS_PARSER_HLO_LEXER_H_
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_LEXER_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_LEXER_H_
 
 #include <string>
 
-#include "tensorflow/compiler/xla/tools/parser/hlo_token.h"
+#include "tensorflow/compiler/xla/service/hlo_token.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
@@ -27,9 +27,11 @@ limitations under the License.
 #include "tensorflow/core/platform/types.h"
 
 namespace xla {
-namespace tools {
 
 // Lexer for the HloModule::ToString() format text.
+//
+// This class is meant to be used by hlo_parser.cc.  You shouldn't need to use
+// it directly.
 class HloLexer {
  public:
   explicit HloLexer(tensorflow::StringPiece buf) : buf_(buf) {
@@ -57,7 +59,7 @@ class HloLexer {
     CHECK(GetKind() == TokKind::kShape);
     return shape_val_;
   }
-  int64 GetInt64Val() const {
+  tensorflow::int64 GetInt64Val() const {
     CHECK(GetKind() == TokKind::kInt);
     return int64_val_;
   }
@@ -114,7 +116,7 @@ class HloLexer {
   TokKind current_kind_;
   string str_val_;
   Shape shape_val_;
-  int64 int64_val_;
+  tensorflow::int64 int64_val_;
   double decimal_val_;
 
   struct LineNoCacheTy {
@@ -125,7 +127,6 @@ class HloLexer {
   mutable LineNoCacheTy line_no_cache_{nullptr, 0};
 };
 
-}  // namespace tools
 }  // namespace xla
 
-#endif  // TENSORFLOW_COMPILER_XLA_TOOLS_PARSER_HLO_LEXER_H_
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_LEXER_H_
diff --git a/tensorflow/compiler/xla/service/hlo_liveness_analysis_test.cc b/tensorflow/compiler/xla/service/hlo_liveness_analysis_test.cc
index 8e2e2c7627ba6ac9e5078446056917a07436cbd7..0275294a1a86cef13e5b267ad578f30cc18858dc 100644
--- a/tensorflow/compiler/xla/service/hlo_liveness_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_liveness_analysis_test.cc
@@ -18,12 +18,12 @@ limitations under the License.
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/test.h"
 
@@ -59,7 +59,7 @@ class HloLivenessAnalysisTest : public HloTestBase {
 
 // Test that add instruction at entry root is live at all output shape indices.
 TEST_F(HloLivenessAnalysisTest, AddAtEntryRoot) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleModule
   ENTRY SimpleComputation {
     constant.1 = s32[] constant(0)
@@ -75,7 +75,7 @@ TEST_F(HloLivenessAnalysisTest, AddAtEntryRoot) {
 
 // Test that a dead add instruction is marked as dead by analysis.
 TEST_F(HloLivenessAnalysisTest, DeadAdd) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleModule
   ENTRY SimpleComputation {
     constant.1 = s32[] constant(0)
@@ -94,7 +94,7 @@ TEST_F(HloLivenessAnalysisTest, DeadAdd) {
 // Test that all output shape indices of entry root tuple (and defining
 // instruction in its output) are marked live.
 TEST_F(HloLivenessAnalysisTest, TupleAtEntryRoot) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleModule
   ENTRY SimpleComputation {
     constant.1 = s32[] constant(0)
@@ -113,7 +113,7 @@ TEST_F(HloLivenessAnalysisTest, TupleAtEntryRoot) {
 // Tests that all outputs of nested tuple and entry root (and defining
 // instruction values appearing in its output) are marked live.
 TEST_F(HloLivenessAnalysisTest, NestedTupleAtEntryRoot) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleModule
   ENTRY SimpleComputation {
     constant.1 = s32[] constant(1)
@@ -140,7 +140,7 @@ TEST_F(HloLivenessAnalysisTest, NestedTupleAtEntryRoot) {
 // Tests that GTE at entry root of Tuple instruction only propgates liveness
 // to the live elements in tuple.
 TEST_F(HloLivenessAnalysisTest, GteOfTuple) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleModule
   ENTRY SimpleComputation {
     constant.1 = s32[] constant(0)
@@ -162,7 +162,7 @@ TEST_F(HloLivenessAnalysisTest, GteOfTuple) {
 // Tests that GTE at entry root of nested Tuple instruction only propgates
 // liveness to the live elements in tuple.
 TEST_F(HloLivenessAnalysisTest, GteOfNestedTuple) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleModule
   ENTRY SimpleComputation {
     constant.1 = s32[] constant(0)
@@ -199,7 +199,7 @@ TEST_F(HloLivenessAnalysisTest, GteOfNestedTuple) {
 // Tests that GTE of GTE (at entry root) of nested Tuple instruction only
 // propgates liveness to the live elements in tuple.
 TEST_F(HloLivenessAnalysisTest, GteOfGteOfNestedTuple) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleModule
   ENTRY SimpleComputation {
     constant.1 = s32[] constant(0)
@@ -240,7 +240,7 @@ TEST_F(HloLivenessAnalysisTest, GteOfGteOfNestedTuple) {
 
 // Test that live/dead while tuple elements are marked live/dead correctly.
 TEST_F(HloLivenessAnalysisTest, WhileWithDeadTupleElement) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleLoop
   SimpleLoop.body {
     loop_var.1 = (s32[], s32[3]{0}) parameter(0)
@@ -291,7 +291,7 @@ TEST_F(HloLivenessAnalysisTest, WhileWithDeadTupleElement) {
 // Tests that a tuple element live in while.cond computation, propagates
 // liveness to while.body.root/while.result/while.operand (where it is unused).
 TEST_F(HloLivenessAnalysisTest, WhileCondPropagatesLiveness) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleLoop
   SimpleLoop.body {
     loop_var.1 = (s32[], s32[3]{0}) parameter(0)
@@ -345,7 +345,7 @@ TEST_F(HloLivenessAnalysisTest, WhileCondPropagatesLiveness) {
 // Tests that a use of while.result{0} propagates liveness to
 // while.body.param{1} to while.body.root{1}, and then to while.body.param{2}.
 TEST_F(HloLivenessAnalysisTest, WhileWithLiveTupleElements) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleLoop
   SimpleLoop.body {
     loop_var.1 = (s32[], s32[], s32[]) parameter(0)
diff --git a/tensorflow/compiler/xla/service/hlo_matchers.h b/tensorflow/compiler/xla/service/hlo_matchers.h
index c33bdadf1c7145bf2aff09b01423c6c21382da0c..c570b420c21fed4d7828feb24ee5c7859db94a79 100644
--- a/tensorflow/compiler/xla/service/hlo_matchers.h
+++ b/tensorflow/compiler/xla/service/hlo_matchers.h
@@ -17,6 +17,7 @@ limitations under the License.
 #define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_MATCHERS_H_
 
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/core/lib/gtl/optional.h"
 
@@ -324,6 +325,12 @@ inline ::testing::Matcher<const ::xla::HloInstruction*> Sharding(
   return ::testing::MakeMatcher(
       new ::xla::testing::HloShardingMatcher(sharding));
 }
+// Matcher for Sharding from sharding string
+inline ::testing::Matcher<const ::xla::HloInstruction*> Sharding(
+    tensorflow::StringPiece sharding) {
+  return ::testing::MakeMatcher(new ::xla::testing::HloShardingMatcher(
+      ParseSharding(sharding).ValueOrDie()));
+}
 // Verifies that no HloSharding is set for an HLO instruction.
 inline ::testing::Matcher<const ::xla::HloInstruction*> NoSharding() {
   return ::testing::MakeMatcher(
diff --git a/tensorflow/compiler/xla/service/hlo_matchers_test.cc b/tensorflow/compiler/xla/service/hlo_matchers_test.cc
index 016cc01e33840aa195dfc0a21e8ac8f3d24a3e06..9a3010cf1ff75e840130d8442bbe26d6041cef25 100644
--- a/tensorflow/compiler/xla/service/hlo_matchers_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_matchers_test.cc
@@ -14,8 +14,8 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_util.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 
 namespace op = xla::testing::opcode_matchers;
 using ::testing::_;
@@ -147,6 +147,18 @@ TEST(HloMatchersTest, ShardingMatcher) {
                                             "param.1");
   p1->set_sharding(HloSharding::AssignDevice(1));
 
+  auto tuple_shape = ShapeUtil::MakeTupleShape(
+      {ShapeUtil::MakeShape(F32, {7}), ShapeUtil::MakeShape(S32, {9}),
+       ShapeUtil::MakeShape(F32, {11})});
+  auto p2 = HloInstruction::CreateParameter(1, tuple_shape, "param.2");
+  Array<int64> assignment({2});
+  assignment.SetValues({0, 1});
+  auto sharding = HloSharding::Tuple(
+      tuple_shape,
+      {HloSharding::Tile(ShapeUtil::MakeShape(F32, {5}), assignment),
+       HloSharding::AssignDevice(1), HloSharding::Replicate()});
+  p2->set_sharding(sharding);
+
   EXPECT_THAT(p0.get(), op::NoSharding());
   EXPECT_THAT(p0.get(),
               ::testing::Not(op::Sharding(HloSharding::AssignDevice(1))));
@@ -155,6 +167,11 @@ TEST(HloMatchersTest, ShardingMatcher) {
               ::testing::Not(op::Sharding(HloSharding::AssignDevice(0))));
   EXPECT_THAT(p1.get(), op::Sharding(HloSharding::AssignDevice(1)));
 
+  EXPECT_THAT(
+      p2.get(),
+      op::Sharding(
+          "{{f32[5] devices=[2]0,1}, {maximal device=1}, {replicated}}"));
+
   EXPECT_THAT(Explain(p0.get(), op::Sharding(HloSharding::AssignDevice(1))),
               "%param.0 = f32[5]{0} parameter(0) has no sharding (expected: "
               "{maximal device=1})");
@@ -178,7 +195,7 @@ ENTRY DotOperationFusion_TransposeFusion {
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
   HloInstruction* root = module->entry_computation()->root_instruction();
 
   EXPECT_THAT(root, op::Dot(op::Parameter(0), op::Parameter(1),
diff --git a/tensorflow/compiler/xla/service/hlo_module.cc b/tensorflow/compiler/xla/service/hlo_module.cc
index fbf1d58007e318a8a08aa9e11d9d54811533703e..9c59374b4a9d7e3dbfb99d8a6b30d4230e553658 100644
--- a/tensorflow/compiler/xla/service/hlo_module.cc
+++ b/tensorflow/compiler/xla/service/hlo_module.cc
@@ -32,15 +32,6 @@ limitations under the License.
 
 namespace xla {
 
-HloModule::HloModule(const string& name,
-                     const VersionedComputationHandle& entry_computation_handle,
-                     const HloModuleConfig& config)
-    : name_(NameUniquer::GetSanitizedName(name)),
-      config_(config),
-      has_entry_computation_handle_(true),
-      entry_computation_handle_(entry_computation_handle),
-      unique_id_(next_unique_module_id_++) {}
-
 HloModule::HloModule(const string& name, const HloModuleConfig& config)
     : name_(NameUniquer::GetSanitizedName(name)),
       config_(config),
@@ -234,8 +225,7 @@ HloModuleProto HloModule::ToProto() const {
 
 /* static */
 StatusOr<std::unique_ptr<HloModule>> HloModule::CreateFromProto(
-    const HloModuleProto& proto, const HloModuleConfig& module_config,
-    const VersionedComputationHandle& entry_computation_handle) {
+    const HloModuleProto& proto, const HloModuleConfig& module_config) {
   // The ProgramShape in the passed in module config must match the shapes of
   // the entry parameters and root.
   TF_RET_CHECK(proto.has_program_shape())
@@ -287,8 +277,7 @@ StatusOr<std::unique_ptr<HloModule>> HloModule::CreateFromProto(
   }
   TF_RET_CHECK(entry != nullptr);
 
-  auto module = MakeUnique<HloModule>(proto.name(), entry_computation_handle,
-                                      module_config);
+  auto module = MakeUnique<HloModule>(proto.name(), module_config);
 
   // Sort the computations in the proto id's order.
   std::sort(computations.begin(), computations.end(),
@@ -401,7 +390,7 @@ HloInstruction* HloModule::OutlineExpressionFromComputation(
         // as a parameter in the new function.
         arguments.push_back(old_operand);
         *operand_slot = builder.AddInstruction(HloInstruction::CreateParameter(
-            parameter_count, old_operand->shape(), ""));
+            parameter_count, old_operand->shape(), "p"));
         ++parameter_count;
       }
       TF_CHECK_OK(
@@ -496,7 +485,18 @@ std::list<HloComputation*> HloModule::MakeComputationPostOrder() const {
       added_computations.insert(computation.get());
     }
   }
-  CHECK_EQ(post_order.size(), computations_.size());
+  if (post_order.size() != computations_.size()) {
+    for (HloComputation* computation : post_order) {
+      LOG(ERROR) << "Post Order: " << computation->name() << " ("
+                 << computation->parent()->name() << ")";
+    }
+    for (auto& computation : computations_) {
+      LOG(ERROR) << "Computations: " << computation->name() << " ("
+                 << computation->parent()->name() << ")";
+    }
+    LOG(FATAL) << "Mismatch computation count: post_order=" << post_order.size()
+               << " computation_count=" << computations_.size();
+  }
   return post_order;
 }
 
@@ -514,57 +514,26 @@ std::vector<HloComputation*> HloModule::MakeNonfusionComputations() const {
 std::unique_ptr<HloModule> HloModule::Clone(const string& suffix) const {
   VLOG(1) << "Cloning module :" << name_ << " --> " << suffix << "\n";
   auto module = MakeUnique<HloModule>(name_ + "-" + suffix, config_);
-  module->entry_computation_handle_ = entry_computation_handle_;
-  module->has_entry_computation_handle_ = has_entry_computation_handle_;
-
-  std::unordered_map<HloComputation*, HloComputation*> clone_map;
-  for (auto& computation : computations_) {
-    if (computation->IsFusionComputation()) {
-      // Cloning of a fused computation is handled by its fusion instruction.
-      continue;
-    }
 
-    // When cloning a computation, pass in the new module, so that for any
-    // fusion instruction in this computation, the fused computation will be
-    // deep cloned to the new module.
-    auto cloned_computation = computation->Clone(suffix, module.get());
-    InsertOrDie(&clone_map, computation.get(), cloned_computation.get());
-
-    if (entry_computation_ == computation.get()) {
-      module->AddEntryComputation(std::move(cloned_computation));
-    } else {
-      module->AddEmbeddedComputation(std::move(cloned_computation));
-    }
-  }
-
-  for (auto& cloned_computation : module->computations_) {
-    for (auto* instruction : cloned_computation->instructions()) {
-      // Rewrite instruction's called_computation to point to the cloned
-      // computations.
-      instruction->ReplaceCalledComputations([&](HloComputation* hlo) {
-        if (hlo->IsFusionComputation()) {
-          // Cloning of a fused computation has already been handled when its
-          // fusion instruction is cloned. So this hlo computation is already
-          // the cloned one.
-          return hlo;
-        }
-        return FindOrDie(clone_map, hlo);
-      });
-    }
-  }
+  HloCloneContext context(module.get(), suffix);
+  auto cloned_computation = entry_computation_->Clone(suffix, &context);
+  module->AddEntryComputation(std::move(cloned_computation));
   return module;
 }
 
-HloComputation* HloModule::DeepCloneComputation(HloComputation* computation) {
-  HloComputation* clone = AddEmbeddedComputation(computation->Clone("", this));
-  TF_CHECK_OK(
-      clone->root_instruction()->Accept([this](HloInstruction* instruction) {
-        instruction->ReplaceCalledComputations([this](HloComputation* callee) {
-          return DeepCloneComputation(callee);
-        });
-        return Status::OK();
-      }));
-  return clone;
+HloComputation* HloModule::DeepCloneComputation(HloComputation* computation,
+                                                HloCloneContext* context) {
+  HloComputation* new_computation;
+  if (context != nullptr) {
+    if ((new_computation = context->FindComputation(computation)) != nullptr) {
+      return new_computation;
+    }
+    new_computation =
+        AddEmbeddedComputation(computation->Clone(context->suffix(), context));
+  } else {
+    new_computation = AddEmbeddedComputation(computation->Clone(""));
+  }
+  return new_computation;
 }
 
 uint64 HloModule::RandomNew64() const {
diff --git a/tensorflow/compiler/xla/service/hlo_module.h b/tensorflow/compiler/xla/service/hlo_module.h
index 02918c377776b73f2086fe41afc406567a12af4c..757e65bda286d983d05e5a791aa7dffe97bac945 100644
--- a/tensorflow/compiler/xla/service/hlo_module.h
+++ b/tensorflow/compiler/xla/service/hlo_module.h
@@ -26,11 +26,11 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/iterator_util.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
+#include "tensorflow/compiler/xla/service/hlo_clone_context.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_module_config.h"
 #include "tensorflow/compiler/xla/service/name_uniquer.h"
-#include "tensorflow/compiler/xla/service/versioned_computation_handle.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
@@ -56,10 +56,6 @@ namespace xla {
 // attached to.
 class HloModule {
  public:
-  HloModule(const string& name,
-            const VersionedComputationHandle& entry_computation_handle,
-            const HloModuleConfig& config);
-
   // Constructor without a versioned computation handle. This constructor should
   // only be used for HloModules used outside of the XLA service (eg
   // tests). The versioned handle is used by the service in the compilation
@@ -94,8 +90,10 @@ class HloModule {
   std::unique_ptr<HloModule> Clone(const string& suffix = "clone") const;
 
   // Performs a deep clone of the computation, by recursively cloning all
-  // the called computations as well.
-  HloComputation* DeepCloneComputation(HloComputation* computation);
+  // the called computations as well. If the clone context is specified, it
+  // will be populated with the cloned object mappings.
+  HloComputation* DeepCloneComputation(HloComputation* computation,
+                                       HloCloneContext* context = nullptr);
 
   // Return a pointer to the entry computation of the module..
   const HloComputation* entry_computation() const {
@@ -123,10 +121,6 @@ class HloModule {
     return config_.device_entry_computation_layout();
   }
 
-  const VersionedComputationHandle& entry_computation_handle() const {
-    return entry_computation_handle_;
-  }
-
   // Gets the computations in this module.
   //
   // Returns a view of HloComputation*s, so you can iterate over this in the
@@ -185,9 +179,7 @@ class HloModule {
   // Convert an HloModule to or from a proto.
   HloModuleProto ToProto() const;
   static StatusOr<std::unique_ptr<HloModule>> CreateFromProto(
-      const HloModuleProto& proto, const HloModuleConfig& module_config,
-      const VersionedComputationHandle& entry_computation_handle =
-          VersionedComputationHandle());
+      const HloModuleProto& proto, const HloModuleConfig& module_config);
 
   // Creates and returns an HloModuleConfig with an appropriate program shape
   // for the HLO module in the given proto.
@@ -261,10 +253,6 @@ class HloModule {
   mutable std::mt19937_64 rng_{42};
   mutable tensorflow::mutex rng_mutex_;
 
-  // Versioned handle of the entry computation of the module.
-  bool has_entry_computation_handle_ = false;
-  VersionedComputationHandle entry_computation_handle_;
-
   // Unique name generator for computation and instruction names, which are
   // unique per module.
   NameUniquer computation_name_uniquer_{/*separator=*/"."};
diff --git a/tensorflow/compiler/xla/service/hlo_module_dce_test.cc b/tensorflow/compiler/xla/service/hlo_module_dce_test.cc
index 53b7d0ed3964ca8a2c3bb73c62015a1c7dbfe487..363862e4905fc13a4ef07aeaac255259fc6b86ba 100644
--- a/tensorflow/compiler/xla/service/hlo_module_dce_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_module_dce_test.cc
@@ -19,11 +19,11 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_utils.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/platform/types.h"
@@ -73,7 +73,7 @@ class HloModuleDceTest : public HloTestBase {
 
 // Tests that a while with all outputs live is unmodified.
 TEST_F(HloModuleDceTest, WhileWithLiveOutputs) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleLoop
   SimpleLoop.body {
     loop_var.1 = (s32[], s32[3]{0}) parameter(0)
@@ -110,7 +110,7 @@ TEST_F(HloModuleDceTest, WhileWithLiveOutputs) {
 // Tests a while loop with one unused output (which is used in the while loop
 // body by an instruction with side-effects: rng) is unmodified.
 TEST_F(HloModuleDceTest, WhileWithUnusedSideEffectingTupleElement) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleLoop
   SimpleLoop.body {
     loop_var.1 = (s32[], f32[]) parameter(0)
@@ -150,7 +150,7 @@ TEST_F(HloModuleDceTest, WhileWithUnusedSideEffectingTupleElement) {
 // Tests that a while loop with one dead tuple element at {1} has its while
 // loop body modified to make that tuple element pass-through the while body.
 TEST_F(HloModuleDceTest, OneWhileWithDeadTupleElement) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleLoop
   SimpleLoop.body {
     loop_var.1 = (s32[], s32[3]{0}) parameter(0)
@@ -193,7 +193,7 @@ TEST_F(HloModuleDceTest, OneWhileWithDeadTupleElement) {
 // dead in while.body{1} and at while.result{1}) propgates liveness of this
 // tuple element to while.body{1} and at while.result{1}.
 TEST_F(HloModuleDceTest, OneWhileWithTupleElementUsedByCond) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleLoop
   SimpleLoop.body {
     loop_var.1 = (s32[], s32[]) parameter(0)
@@ -235,7 +235,7 @@ TEST_F(HloModuleDceTest, OneWhileWithTupleElementUsedByCond) {
 // Tests that HloModuleDCE can remove a dead tuple element at index {1} between
 // two dependent while loops.
 TEST_F(HloModuleDceTest, TwoWhilesWithDeadTupleElement) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleLoop
   SimpleLoop.body0 {
     loop_var.1 = (s32[], s32[3]{0}) parameter(0)
@@ -303,7 +303,7 @@ TEST_F(HloModuleDceTest, TwoWhilesWithDeadTupleElement) {
 // Tests that HloModuleDCE can remove a dead tuple element at while.1{0} and
 // while.2{1}, between two dependent while loops.
 TEST_F(HloModuleDceTest, TwoWhilesWithDeadTupleElementSwizzled) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule SimpleLoop
   SimpleLoop.body0 {
     loop_var.1 = (s32[3]{0}, s32[]) parameter(0)
diff --git a/tensorflow/compiler/xla/service/hlo_module_group_metadata.cc b/tensorflow/compiler/xla/service/hlo_module_group_metadata.cc
index b4cd3c730e323b8459312edbebc564e08f9d6840..bf33640db16638803f4f8e6c66f35d6bb6e2c9fe 100644
--- a/tensorflow/compiler/xla/service/hlo_module_group_metadata.cc
+++ b/tensorflow/compiler/xla/service/hlo_module_group_metadata.cc
@@ -87,6 +87,7 @@ Status HloModuleGroupMetadata::Build() {
           << "Peer instruction does not match the computation kind";
       TF_RETURN_IF_ERROR(
           AddCompanion(tracked->instruction(), peer_tracked->instruction()));
+      tracked_instructions_comms_[tracked->instruction()].push_back(hlo);
     }
 
     // Add the parents of companion instructions (they must be all of the same
@@ -112,27 +113,43 @@ Status HloModuleGroupMetadata::Build() {
     }
   }
   TF_RETURN_IF_ERROR(VerifyCompanionSets());
+  if (VLOG_IS_ON(4)) {
+    DumpCollectedStats();
+  }
   return Status::OK();
 }
 
 Status HloModuleGroupMetadata::VerifyCompanionSets() const {
-  // TODO(dlibenzi): Migrate this to use the device instead of module ID, once
-  // the kDomain CL goes in.
   for (const auto& companions : companion_sets_) {
     // A companion set must be composed at most of an instruction per
     // device/module.
     std::unordered_set<int64> devices;
     for (HloInstruction* instruction : *companions) {
-      int64 device = GetModuleId(instruction->parent()->parent());
-      if (!devices.insert(device).second) {
-        std::stringstream ss;
-        ss << "Companion set:" << std::endl;
-        for (HloInstruction* hlo : *companions) {
-          ss << "  " << hlo->name() << " ("
-             << GetModuleId(hlo->parent()->parent()) << ")" << std::endl;
+      // Go through all the communicating instructions (send, recv) of the given
+      // companion, and record their device.
+      auto it = tracked_instructions_comms_.find(instruction);
+      if (it == tracked_instructions_comms_.end()) {
+        // Companions can be added even if they have no communicating
+        // instructions, if they are parent of companions.
+        continue;
+      }
+      std::unordered_set<int64> comm_devices;
+      for (HloInstruction* comm_instruction : it->second) {
+        auto device = GetInstructionDevice(*comm_instruction);
+        TF_RET_CHECK(device) << "Instruction " << comm_instruction->ToString()
+                             << " does not have a device";
+        comm_devices.insert(*device);
+      }
+      for (int64 device : comm_devices) {
+        if (!devices.insert(device).second) {
+          std::stringstream ss;
+          ss << "Companion set:" << std::endl;
+          for (HloInstruction* hlo : *companions) {
+            ss << "  " << hlo->name() << std::endl;
+          }
+          ss << "has multiple instructions on the same device";
+          return FailedPrecondition("%s", ss.str().c_str());
         }
-        ss << "has multiple instructions on the same device";
-        return FailedPrecondition("%s", ss.str().c_str());
       }
     }
   }
@@ -223,6 +240,28 @@ int64 HloModuleGroupMetadata::GetModuleId(const HloModule* module) const {
   LOG(FATAL) << "unknown module";
 }
 
+tensorflow::gtl::optional<int64> HloModuleGroupMetadata::GetInstructionDevice(
+    const HloInstruction& instruction) const {
+  // The module group metadata can be created in both "single module, multiple
+  // devices" and "multiple modules, no explicit devices" fashions.
+  // The API returns an optional even though the current implementation always
+  // returns a device, to account for cases where we cannot guess a device.
+  // In such cases the VerifyChannelInstructions() will return proper errors.
+  tensorflow::gtl::optional<int64> device =
+      instruction.sharding_unique_device();
+  if (!device) {
+    device = GetModuleId(instruction.parent()->parent());
+  }
+  return device;
+}
+
+int64 HloModuleGroupMetadata::GetDeviceModulesCount() const {
+  return std::count_if(modules_.begin(), modules_.end(),
+                       [](const HloModule* module) {
+                         return !module->config().is_host_module();
+                       });
+}
+
 Status HloModuleGroupMetadata::RecordInstructions() {
   const auto visitor = [this](HloInstruction* hlo) -> Status {
     if (hlo->opcode() == HloOpcode::kWhile) {
@@ -284,6 +323,7 @@ Status HloModuleGroupMetadata::RecordInstructions() {
       TF_RETURN_IF_ERROR(computation->Accept(visitor));
     }
   }
+  VLOG(2) << "Created " << channels_.size() << " channels";
   return Status::OK();
 }
 
@@ -346,26 +386,38 @@ Status HloModuleGroupMetadata::VerifyChannelInstructions() {
     if (!ShapeUtil::Compatible(send_shape, recv_shape)) {
       return FailedPrecondition("send/recv shapes do not match");
     }
-    const HloModule* send_module = channel.send->parent()->parent();
-    const HloModule* send_done_module = channel.send_done->parent()->parent();
-    if (send_module != send_done_module) {
+    auto send_device = GetInstructionDevice(*channel.send);
+    auto send_done_device = GetInstructionDevice(*channel.send_done);
+    if (!send_device) {
+      return FailedPrecondition("send instruction must have a device: %s",
+                                channel.send->ToString().c_str());
+    }
+    if (!send_done_device) {
+      return FailedPrecondition("send_done instruction must have a device: %s",
+                                channel.send_done->ToString().c_str());
+    }
+    if (*send_device != *send_done_device) {
       return FailedPrecondition(
           "send and send-done (channel=%lld) must be on the same device: %lld "
           "vs. %lld",
-          channel.id, GetModuleId(send_module), GetModuleId(send_done_module));
+          channel.id, *send_device, *send_done_device);
     }
-    const HloModule* recv_module = channel.recv->parent()->parent();
-    const HloModule* recv_done_module = channel.recv_done->parent()->parent();
-    if (recv_module != recv_done_module) {
+    auto recv_device = GetInstructionDevice(*channel.recv);
+    auto recv_done_device = GetInstructionDevice(*channel.recv_done);
+    if (!recv_done_device) {
+      return FailedPrecondition("recv_done instruction must have a device: %s",
+                                channel.recv_done->ToString().c_str());
+    }
+    if (*recv_device != *recv_done_device) {
       return FailedPrecondition(
           "recv and recv-done (channel=%lld) must be on the same device: %lld "
           "vs. %lld",
-          channel.id, GetModuleId(recv_module), GetModuleId(recv_done_module));
+          channel.id, *recv_device, *recv_done_device);
     }
-    if (send_module == recv_module) {
+    if (*send_device == *recv_device) {
       return FailedPrecondition(
           "send and recv (channel=%lld) must be on different devices: %lld",
-          channel.id, GetModuleId(send_module));
+          channel.id, *send_device);
     }
   }
 
@@ -402,4 +454,36 @@ Status HloModuleGroupMetadata::CheckCommunicatingInstruction(
   return FailedPrecondition("channel is used in disallowed computation");
 }
 
+void HloModuleGroupMetadata::DumpCollectedStats() const {
+  std::map<std::pair<int64, int64>, int64> communication_histogram;
+  for (auto& channel : channels_) {
+    auto from_device = GetInstructionDevice(*channel.send);
+    auto to_device = GetInstructionDevice(*channel.recv);
+    LOG(INFO) << "Channel " << channel.id << ": from_device=" << *from_device
+              << " to_device=" << *to_device << " send=" << channel.send->name()
+              << " send_done=" << channel.send_done->name()
+              << " recv=" << channel.recv->name()
+              << " recv_done=" << channel.recv_done->name();
+    communication_histogram[std::pair<int64, int64>(*from_device,
+                                                    *to_device)] += 1;
+  }
+  for (auto& fromto_count : communication_histogram) {
+    LOG(INFO) << "From " << fromto_count.first.first << " to "
+              << fromto_count.first.second << ": " << fromto_count.second;
+  }
+  for (auto& companion_set : companion_sets_) {
+    LOG(INFO) << "Companion set:";
+    for (HloInstruction* instruction : *companion_set) {
+      LOG(INFO) << "  " << instruction->name();
+    }
+  }
+  for (auto& instruction_comm : tracked_instructions_comms_) {
+    LOG(INFO) << "Communicating instruction " << instruction_comm.first->name();
+    for (HloInstruction* instruction : instruction_comm.second) {
+      auto device = GetInstructionDevice(*instruction);
+      LOG(INFO) << "  " << instruction->name() << " on device " << *device;
+    }
+  }
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_module_group_metadata.h b/tensorflow/compiler/xla/service/hlo_module_group_metadata.h
index 3ef4542f9129632de4975688ae7e9e2c5f43a7ee..ffde3a332dfc141ca928a44cfdf4686900e9f47b 100644
--- a/tensorflow/compiler/xla/service/hlo_module_group_metadata.h
+++ b/tensorflow/compiler/xla/service/hlo_module_group_metadata.h
@@ -29,6 +29,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/gtl/flatmap.h"
+#include "tensorflow/core/lib/gtl/optional.h"
 #include "tensorflow/core/platform/types.h"
 
 namespace xla {
@@ -148,6 +149,15 @@ class HloModuleGroupMetadata {
   // the module in the module vector.
   int64 GetModuleId(const HloModule* module) const;
 
+  // Retrieves the device an instruction is assigned to. Either from the
+  // sharding information, or from the ordinal of the module the instruction
+  // is in.
+  tensorflow::gtl::optional<int64> GetInstructionDevice(
+      const HloInstruction& instruction) const;
+
+  // Returns the number of modules for devices (excluding the host module).
+  int64 GetDeviceModulesCount() const;
+
   // Returns the companion instructions for the given instruction.
   //
   // Precondition: IsCompanionWhile(instruction) is true.
@@ -220,6 +230,9 @@ class HloModuleGroupMetadata {
     return it != tracked_instructions_.end() ? &it->second : nullptr;
   }
 
+  // Dump all the collected module group statistics to the logs.
+  void DumpCollectedStats() const;
+
   // List of all companion instructions sets in the module.
   std::vector<std::unique_ptr<std::unordered_set<HloInstruction*>>>
       companion_sets_;
@@ -231,6 +244,11 @@ class HloModuleGroupMetadata {
   tensorflow::gtl::FlatMap<const HloComputation*, TrackedInstruction>
       tracked_instructions_;
 
+  // Maps tracked instructions (kWhile, kConditional, kCall, ...) to the set of
+  // communicating instructions within the proper called computation(s).
+  tensorflow::gtl::FlatMap<HloInstruction*, std::vector<HloInstruction*>>
+      tracked_instructions_comms_;
+
   // All channels in the module.
   std::vector<Channel> channels_;
 
diff --git a/tensorflow/compiler/xla/service/hlo_opcode.h b/tensorflow/compiler/xla/service/hlo_opcode.h
index ac7cd2f2f517cf8831416d9265fc48bbf9fce340..a35546f5f41b149d119ee141fd734da8bfd055b2 100644
--- a/tensorflow/compiler/xla/service/hlo_opcode.h
+++ b/tensorflow/compiler/xla/service/hlo_opcode.h
@@ -69,6 +69,7 @@ namespace xla {
   V(kCrossReplicaSum, "cross-replica-sum")                   \
   V(kCustomCall, "custom-call")                              \
   V(kDivide, "divide")                                       \
+  V(kDomain, "domain")                                       \
   V(kDot, "dot")                                             \
   V(kDynamicSlice, "dynamic-slice")                          \
   V(kDynamicUpdateSlice, "dynamic-update-slice")             \
@@ -80,6 +81,7 @@ namespace xla {
   V(kFusion, "fusion", kHloOpcodeIsVariadic)                 \
   V(kGather, "gather")                                       \
   V(kGe, "greater-than-or-equal-to", kHloOpcodeIsComparison) \
+  V(kGenerateToken, "generate-token", kHloOpcodeIsVariadic)  \
   V(kGetTupleElement, "get-tuple-element")                   \
   V(kGt, "greater-than", kHloOpcodeIsComparison)             \
   V(kHostCompute, "host-compute")                            \
diff --git a/tensorflow/compiler/xla/service/hlo_opcode_test.cc b/tensorflow/compiler/xla/service/hlo_opcode_test.cc
index cd2ce5c69f030c65b889d67e082a3677b8739ddb..774345124b4ad62e35d9423a23f1dbaa28e44d80 100644
--- a/tensorflow/compiler/xla/service/hlo_opcode_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_opcode_test.cc
@@ -58,6 +58,7 @@ TEST(HloOpcodeTest, OpcodeProperties) {
       case HloOpcode::kConcatenate:
       case HloOpcode::kFusion:
       case HloOpcode::kMap:
+      case HloOpcode::kGenerateToken:
       case HloOpcode::kTuple:
         EXPECT_TRUE(HloOpcodeIsVariadic(opcode));
         break;
diff --git a/tensorflow/compiler/xla/service/hlo_ordering.h b/tensorflow/compiler/xla/service/hlo_ordering.h
index ee526d8dd7f7e81b3a846741d3e452935f486bd2..985f3fa64d8767b0c0063ee900f7d11c3b7f6d4a 100644
--- a/tensorflow/compiler/xla/service/hlo_ordering.h
+++ b/tensorflow/compiler/xla/service/hlo_ordering.h
@@ -183,6 +183,10 @@ class DependencyHloOrdering : public PredecessorHloOrdering {
 // interference is reduced relative to DependencyHloOrdering.
 class SequentialHloOrdering : public HloOrdering {
  public:
+  // TODO(dimvar): HloModuleSequence is not a good name because it sounds like
+  // a sequence of modules, instead of a map of schedules for all computations
+  // in a module. We should change it at some point.
+  //
   // A sequence of instructions for each computation in the module.
   using HloModuleSequence =
       tensorflow::gtl::FlatMap<const HloComputation*,
diff --git a/tensorflow/compiler/xla/service/hlo_ordering_test.cc b/tensorflow/compiler/xla/service/hlo_ordering_test.cc
index 37a7fbad97cea2f34798efecc2489e57d1374f35..cfe5dace05ac03f1573f90b2ce664c94837837b4 100644
--- a/tensorflow/compiler/xla/service/hlo_ordering_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_ordering_test.cc
@@ -22,10 +22,10 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_dataflow_analysis.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/service/hlo_scheduling.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 
@@ -310,7 +310,7 @@ ENTRY while.v11 {
 })";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(module_str));
+                          ParseHloString(module_str));
   DependencyHloOrdering ordering(module.get());
   ordering.ToString();  // Shouldn't crash.
 }
@@ -347,7 +347,7 @@ ENTRY root {
 })";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(module_str));
+                          ParseHloString(module_str));
   TF_ASSERT_OK_AND_ASSIGN(auto dataflow,
                           HloDataflowAnalysis::Run(*module, /*ssa_form=*/true));
   DependencyHloOrdering ordering(module.get());
diff --git a/tensorflow/compiler/xla/tools/parser/hlo_parser.cc b/tensorflow/compiler/xla/service/hlo_parser.cc
similarity index 86%
rename from tensorflow/compiler/xla/tools/parser/hlo_parser.cc
rename to tensorflow/compiler/xla/service/hlo_parser.cc
index d0e7af8844203da93dac5b45cb7e13916448dd47..fef475380c5c810e1c4712406dde6b1135be3d97 100644
--- a/tensorflow/compiler/xla/tools/parser/hlo_parser.cc
+++ b/tensorflow/compiler/xla/service/hlo_parser.cc
@@ -13,10 +13,12 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 
 #include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/service/hlo_domain_metadata.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/hlo_sharding_metadata.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/core/lib/gtl/map_util.h"
@@ -24,18 +26,17 @@ limitations under the License.
 #include "tensorflow/core/lib/strings/stringprintf.h"
 
 namespace xla {
-namespace tools {
 
 namespace {
 
-using tensorflow::StringPiece;
-using tensorflow::gtl::optional;
-using tensorflow::str_util::Join;
-using tensorflow::str_util::Split;
-using tensorflow::str_util::SplitAndParseAsInts;
-using tensorflow::strings::Printf;
-using tensorflow::strings::StrAppend;
-using tensorflow::strings::StrCat;
+using ::tensorflow::StringPiece;
+using ::tensorflow::gtl::optional;
+using ::tensorflow::str_util::Join;
+using ::tensorflow::str_util::Split;
+using ::tensorflow::str_util::SplitAndParseAsInts;
+using ::tensorflow::strings::Printf;
+using ::tensorflow::strings::StrAppend;
+using ::tensorflow::strings::StrCat;
 
 const double kF16max = 65504;
 
@@ -56,6 +57,11 @@ class HloParser {
   // Returns the error information.
   string GetError() const { return Join(error_, "\n"); }
 
+  // Stand alone parsing utils for various aggregate data types.
+  StatusOr<HloSharding> ParseShardingOnly();
+  StatusOr<Window> ParseWindowOnly();
+  StatusOr<ConvolutionDimensionNumbers> ParseConvolutionDimensionNumbersOnly();
+
  private:
   // ParseXXX returns false if an error occurred.
   bool ParseHloModule();
@@ -78,11 +84,15 @@ class HloParser {
 
   // Sets the sub-value of literal at the given index to the given value. The
   // literal's shape must have the default layout.
-  bool SetValueInLiteral(int64 value, int64 linear_index, Literal* literal);
-  bool SetValueInLiteral(double value, int64 linear_index, Literal* literal);
-  bool SetValueInLiteral(bool value, int64 linear_index, Literal* literal);
+  bool SetValueInLiteral(tensorflow::int64 value,
+                         tensorflow::int64 linear_index, Literal* literal);
+  bool SetValueInLiteral(double value, tensorflow::int64 linear_index,
+                         Literal* literal);
+  bool SetValueInLiteral(bool value, tensorflow::int64 linear_index,
+                         Literal* literal);
   template <typename LiteralNativeT, typename ParsedElemT>
-  bool SetValueInLiteralHelper(ParsedElemT value, int64 linear_index,
+  bool SetValueInLiteralHelper(ParsedElemT value,
+                               tensorflow::int64 linear_index,
                                Literal* literal);
 
   bool ParseOperands(std::vector<HloInstruction*>* operands);
@@ -94,9 +104,15 @@ class HloParser {
   // Describes the start, limit, and stride on every dimension of the operand
   // being sliced.
   struct SliceRanges {
-    std::vector<int64> starts;
-    std::vector<int64> limits;
-    std::vector<int64> strides;
+    std::vector<tensorflow::int64> starts;
+    std::vector<tensorflow::int64> limits;
+    std::vector<tensorflow::int64> strides;
+  };
+
+  // The data parsed for the kDomain instruction.
+  struct DomainData {
+    std::unique_ptr<DomainMetadata> entry_metadata;
+    std::unique_ptr<DomainMetadata> exit_metadata;
   };
 
   // Types of attributes.
@@ -117,6 +133,7 @@ class HloParser {
     kMetadata,
     kFusionKind,
     kDistribution,
+    kDomain,
   };
 
   struct AttrConfig {
@@ -164,21 +181,27 @@ class HloParser {
   bool ParseComputationName(HloComputation** value);
   // Parses a list of names and finds the corresponding hlo instructions.
   bool ParseInstructionNames(std::vector<HloInstruction*>* instructions);
-  bool ParseWindow(Window* window);
+  // Pass expect_outer_curlies == true when parsing a Window in the context of a
+  // larger computation.  Pass false when parsing a stand-alone Window string.
+  bool ParseWindow(Window* window, bool expect_outer_curlies);
   bool ParseConvolutionDimensionNumbers(ConvolutionDimensionNumbers* dnums);
   bool ParsePaddingConfig(PaddingConfig* padding);
   bool ParseMetadata(OpMetadata* metadata);
   bool ParseSharding(OpSharding* sharding);
   bool ParseSingleSharding(OpSharding* sharding, bool lbrace_pre_lexed);
 
+  // Parses the metadata behind a kDOmain instruction.
+  bool ParseDomain(DomainData* domain);
+
   // Parses a sub-attribute of the window attribute, e.g.,size=1x2x3.
-  bool ParseDxD(const string& name, std::vector<int64>* result);
+  bool ParseDxD(const string& name, std::vector<tensorflow::int64>* result);
   // Parses window's pad sub-attriute, e.g., pad=0_0x3x3.
-  bool ParseWindowPad(std::vector<std::vector<int64>>* pad);
+  bool ParseWindowPad(std::vector<std::vector<tensorflow::int64>>* pad);
 
   bool ParseSliceRanges(SliceRanges* result);
   bool ParseInt64List(const TokKind start, const TokKind end,
-                      const TokKind delim, std::vector<int64>* result);
+                      const TokKind delim,
+                      std::vector<tensorflow::int64>* result);
 
   bool ParseParamListToShape(Shape* shape, LocTy* shape_loc);
   bool ParseParamList();
@@ -190,7 +213,7 @@ class HloParser {
   bool ParseFftType(FftType* result);
   bool ParseFusionKind(HloInstruction::FusionKind* result);
   bool ParseRandomDistribution(RandomDistribution* result);
-  bool ParseInt64(int64* result);
+  bool ParseInt64(tensorflow::int64* result);
   bool ParseDouble(double* result);
   bool ParseBool(bool* result);
   bool ParseToken(TokKind kind, const string& msg);
@@ -384,6 +407,7 @@ bool HloParser::ParseComputation(HloComputation** entry_computation) {
     }
     *entry_computation = computation;
   }
+  instruction_pool_.clear();
 
   return AddComputation(name, computation, name_loc);
 }
@@ -447,7 +471,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
   HloInstruction* instruction;
   switch (opcode) {
     case HloOpcode::kParameter: {
-      int64 parameter_number;
+      tensorflow::int64 parameter_number;
       if (!ParseToken(TokKind::kLparen,
                       "expects '(' before parameter number") ||
           !ParseInt64(&parameter_number) ||
@@ -563,11 +587,28 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kCrossReplicaSum: {
+      optional<HloComputation*> to_apply;
+      optional<std::vector<int64>> replica_group_ids;
+      optional<string> barrier;
+      attrs["to_apply"] = {/*required=*/true, AttrTy::kHloComputation,
+                           &to_apply};
+      attrs["replica_group_ids"] = {
+          /*required=*/false, AttrTy::kBracedInt64List, &replica_group_ids};
+      attrs["barrier"] = {/*required=*/false, AttrTy::kString, &barrier};
       if (!ParseOperands(&operands) || !ParseAttributes(attrs)) {
         return false;
       }
-      instruction = builder->AddInstruction(
-          HloInstruction::CreateCrossReplicaSum(shape, operands));
+
+      if (replica_group_ids) {
+        instruction =
+            builder->AddInstruction(HloInstruction::CreateCrossReplicaSum(
+                shape, operands, *to_apply, *replica_group_ids,
+                barrier ? *barrier : ""));
+      } else {
+        instruction =
+            builder->AddInstruction(HloInstruction::CreateCrossReplicaSum(
+                shape, operands, *to_apply, {}, barrier ? *barrier : ""));
+      }
       break;
     }
     case HloOpcode::kReshape: {
@@ -579,6 +620,14 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
           HloInstruction::CreateReshape(shape, operands[0]));
       break;
     }
+    case HloOpcode::kGenerateToken: {
+      if (!ParseOperands(&operands) || !ParseAttributes(attrs)) {
+        return false;
+      }
+      instruction = builder->AddInstruction(
+          HloInstruction::CreateGenerateToken(operands));
+      break;
+    }
     case HloOpcode::kTuple: {
       if (!ParseOperands(&operands) || !ParseAttributes(attrs)) {
         return false;
@@ -602,7 +651,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kRecv: {
-      optional<int64> channel_id;
+      optional<tensorflow::int64> channel_id;
       attrs["channel_id"] = {/*required=*/true, AttrTy::kInt64, &channel_id};
       if (!ParseOperands(&operands, /*expected_size=*/0) ||
           !ParseAttributes(attrs)) {
@@ -613,7 +662,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kRecvDone: {
-      optional<int64> channel_id;
+      optional<tensorflow::int64> channel_id;
       attrs["channel_id"] = {/*required=*/true, AttrTy::kInt64, &channel_id};
       if (!ParseOperands(&operands, /*expected_size=*/1) ||
           !ParseAttributes(attrs)) {
@@ -627,7 +676,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kSend: {
-      optional<int64> channel_id;
+      optional<tensorflow::int64> channel_id;
       attrs["channel_id"] = {/*required=*/true, AttrTy::kInt64, &channel_id};
       if (!ParseOperands(&operands, /*expected_size=*/1) ||
           !ParseAttributes(attrs)) {
@@ -638,7 +687,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kSendDone: {
-      optional<int64> channel_id;
+      optional<tensorflow::int64> channel_id;
       attrs["channel_id"] = {/*required=*/true, AttrTy::kInt64, &channel_id};
       if (!ParseOperands(&operands, /*expected_size=*/1) ||
           !ParseAttributes(attrs)) {
@@ -652,7 +701,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kGetTupleElement: {
-      optional<int64> index;
+      optional<tensorflow::int64> index;
       attrs["index"] = {/*required=*/true, AttrTy::kInt64, &index};
       if (!ParseOperands(&operands, /*expected_size=*/1) ||
           !ParseAttributes(attrs)) {
@@ -710,7 +759,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
     }
     case HloOpcode::kFft: {
       optional<FftType> fft_type;
-      optional<std::vector<int64>> fft_length;
+      optional<std::vector<tensorflow::int64>> fft_length;
       attrs["fft_type"] = {/*required=*/true, AttrTy::kFftType, &fft_type};
       attrs["fft_length"] = {/*required=*/true, AttrTy::kBracedInt64List,
                              &fft_length};
@@ -723,7 +772,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kBroadcast: {
-      optional<std::vector<int64>> broadcast_dimensions;
+      optional<std::vector<tensorflow::int64>> broadcast_dimensions;
       attrs["dimensions"] = {/*required=*/true, AttrTy::kBracedInt64List,
                              &broadcast_dimensions};
       if (!ParseOperands(&operands, /*expected_size=*/1) ||
@@ -735,7 +784,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kConcatenate: {
-      optional<std::vector<int64>> dimensions;
+      optional<std::vector<tensorflow::int64>> dimensions;
       attrs["dimensions"] = {/*required=*/true, AttrTy::kBracedInt64List,
                              &dimensions};
       if (!ParseOperands(&operands) || !ParseAttributes(attrs) ||
@@ -750,6 +799,9 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       optional<HloComputation*> to_apply;
       attrs["to_apply"] = {/*required=*/true, AttrTy::kHloComputation,
                            &to_apply};
+      optional<std::vector<tensorflow::int64>> dimensions;
+      attrs["dimensions"] = {/*required=*/false, AttrTy::kBracedInt64List,
+                             &dimensions};
       if (!ParseOperands(&operands) || !ParseAttributes(attrs)) {
         return false;
       }
@@ -761,7 +813,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       optional<HloComputation*> reduce_computation;
       attrs["to_apply"] = {/*required=*/true, AttrTy::kHloComputation,
                            &reduce_computation};
-      optional<std::vector<int64>> dimensions_to_reduce;
+      optional<std::vector<tensorflow::int64>> dimensions_to_reduce;
       attrs["dimensions"] = {/*required=*/true, AttrTy::kBracedInt64List,
                              &dimensions_to_reduce};
       if (!ParseOperands(&operands, /*expected_size=*/2) ||
@@ -774,7 +826,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kReverse: {
-      optional<std::vector<int64>> dimensions;
+      optional<std::vector<tensorflow::int64>> dimensions;
       attrs["dimensions"] = {/*required=*/true, AttrTy::kBracedInt64List,
                              &dimensions};
       if (!ParseOperands(&operands, /*expected_size=*/1) ||
@@ -818,7 +870,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kDynamicSlice: {
-      optional<std::vector<int64>> dynamic_slice_sizes;
+      optional<std::vector<tensorflow::int64>> dynamic_slice_sizes;
       attrs["dynamic_slice_sizes"] = {
           /*required=*/true, AttrTy::kBracedInt64List, &dynamic_slice_sizes};
       if (!ParseOperands(&operands, /*expected_size=*/2) ||
@@ -842,7 +894,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kTranspose: {
-      optional<std::vector<int64>> dimensions;
+      optional<std::vector<tensorflow::int64>> dimensions;
       attrs["dimensions"] = {/*required=*/true, AttrTy::kBracedInt64List,
                              &dimensions};
       if (!ParseOperands(&operands, /*expected_size=*/1) ||
@@ -856,7 +908,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
     case HloOpcode::kBatchNormTraining: {
       optional<float> epsilon;
       attrs["epsilon"] = {/*required=*/true, AttrTy::kFloat, &epsilon};
-      optional<int64> feature_index;
+      optional<tensorflow::int64> feature_index;
       attrs["feature_index"] = {/*required=*/true, AttrTy::kInt64,
                                 &feature_index};
       if (!ParseOperands(&operands, /*expected_size=*/3) ||
@@ -872,7 +924,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
     case HloOpcode::kBatchNormInference: {
       optional<float> epsilon;
       attrs["epsilon"] = {/*required=*/true, AttrTy::kFloat, &epsilon};
-      optional<int64> feature_index;
+      optional<tensorflow::int64> feature_index;
       attrs["feature_index"] = {/*required=*/true, AttrTy::kInt64,
                                 &feature_index};
       if (!ParseOperands(&operands, /*expected_size=*/5) ||
@@ -889,7 +941,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
     case HloOpcode::kBatchNormGrad: {
       optional<float> epsilon;
       attrs["epsilon"] = {/*required=*/true, AttrTy::kFloat, &epsilon};
-      optional<int64> feature_index;
+      optional<tensorflow::int64> feature_index;
       attrs["feature_index"] = {/*required=*/true, AttrTy::kInt64,
                                 &feature_index};
       if (!ParseOperands(&operands, /*expected_size=*/5) ||
@@ -960,8 +1012,8 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kReducePrecision: {
-      optional<int64> exponent_bits;
-      optional<int64> mantissa_bits;
+      optional<tensorflow::int64> exponent_bits;
+      optional<tensorflow::int64> mantissa_bits;
       attrs["exponent_bits"] = {/*required=*/true, AttrTy::kInt64,
                                 &exponent_bits};
       attrs["mantissa_bits"] = {/*required=*/true, AttrTy::kInt64,
@@ -1006,7 +1058,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
     }
     case HloOpcode::kHostCompute: {
       optional<string> channel_name;
-      optional<int64> cost_estimate_ns;
+      optional<tensorflow::int64> cost_estimate_ns;
       attrs["channel_name"] = {/*required=*/true, AttrTy::kString,
                                &channel_name};
       attrs["cost_estimate_ns"] = {/*required=*/true, AttrTy::kInt64,
@@ -1019,16 +1071,16 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kDot: {
-      optional<std::vector<int64>> lhs_contracting_dims;
+      optional<std::vector<tensorflow::int64>> lhs_contracting_dims;
       attrs["lhs_contracting_dims"] = {
           /*required=*/false, AttrTy::kBracedInt64List, &lhs_contracting_dims};
-      optional<std::vector<int64>> rhs_contracting_dims;
+      optional<std::vector<tensorflow::int64>> rhs_contracting_dims;
       attrs["rhs_contracting_dims"] = {
           /*required=*/false, AttrTy::kBracedInt64List, &rhs_contracting_dims};
-      optional<std::vector<int64>> lhs_batch_dims;
+      optional<std::vector<tensorflow::int64>> lhs_batch_dims;
       attrs["lhs_batch_dims"] = {/*required=*/false, AttrTy::kBracedInt64List,
                                  &lhs_batch_dims};
-      optional<std::vector<int64>> rhs_batch_dims;
+      optional<std::vector<tensorflow::int64>> rhs_batch_dims;
       attrs["rhs_batch_dims"] = {/*required=*/false, AttrTy::kBracedInt64List,
                                  &rhs_batch_dims};
 
@@ -1060,20 +1112,20 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kGather: {
-      optional<std::vector<int64>> output_window_dims;
+      optional<std::vector<tensorflow::int64>> output_window_dims;
       attrs["output_window_dims"] = {
           /*required=*/true, AttrTy::kBracedInt64List, &output_window_dims};
-      optional<std::vector<int64>> elided_window_dims;
+      optional<std::vector<tensorflow::int64>> elided_window_dims;
       attrs["elided_window_dims"] = {
           /*required=*/true, AttrTy::kBracedInt64List, &elided_window_dims};
-      optional<std::vector<int64>> gather_dims_to_operand_dims;
+      optional<std::vector<tensorflow::int64>> gather_dims_to_operand_dims;
       attrs["gather_dims_to_operand_dims"] = {/*required=*/true,
                                               AttrTy::kBracedInt64List,
                                               &gather_dims_to_operand_dims};
-      optional<int64> index_vector_dim;
+      optional<tensorflow::int64> index_vector_dim;
       attrs["index_vector_dim"] = {/*required=*/true, AttrTy::kInt64,
                                    &index_vector_dim};
-      optional<std::vector<int64>> window_bounds;
+      optional<std::vector<tensorflow::int64>> window_bounds;
       attrs["window_bounds"] = {/*required=*/true, AttrTy::kBracedInt64List,
                                 &window_bounds};
 
@@ -1093,12 +1145,29 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
           dim_numbers, *window_bounds));
       break;
     }
+    case HloOpcode::kDomain: {
+      DomainData domain;
+      attrs["domain"] = {/*required=*/true, AttrTy::kDomain, &domain};
+      if (!ParseOperands(&operands, /*expected_size=*/1) ||
+          !ParseAttributes(attrs)) {
+        return false;
+      }
+      instruction = builder->AddInstruction(HloInstruction::CreateDomain(
+          shape, operands[0], std::move(domain.entry_metadata),
+          std::move(domain.exit_metadata)));
+      break;
+    }
     case HloOpcode::kTrace:
       return TokenError(StrCat("parsing not yet implemented for op: ",
                                HloOpcodeString(opcode)));
   }
 
-  instruction->set_name(name);
+  instruction->SetAndSanitizeName(name);
+  if (instruction->name() != name) {
+    return Error(name_loc,
+                 StrCat("illegal instruction name: ", name,
+                        "; suggest renaming to: ", instruction->name()));
+  }
 
   // Add shared attributes like metadata to the instruction, if they were seen.
   if (sharding) {
@@ -1118,7 +1187,7 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
     instruction->set_metadata(*metadata);
   }
   if (backend_config) {
-    instruction->set_backend_config(std::move(*backend_config));
+    instruction->set_raw_backend_config_string(std::move(*backend_config));
   }
   return AddInstruction(name, instruction, name_loc);
 }  // NOLINT(readability/fn_size)
@@ -1169,8 +1238,8 @@ bool HloParser::ParseSingleSharding(OpSharding* sharding,
   LocTy loc = lexer_.GetLoc();
   bool maximal = false;
   bool replicated = false;
-  std::vector<int64> devices;
-  std::vector<int64> tile_assignment_dimensions;
+  std::vector<tensorflow::int64> devices;
+  std::vector<tensorflow::int64> tile_assignment_dimensions;
   Shape tile_shape;
   while (lexer_.GetKind() != TokKind::kRbrace) {
     switch (lexer_.GetKind()) {
@@ -1197,7 +1266,7 @@ bool HloParser::ParseSingleSharding(OpSharding* sharding,
           }
 
           do {
-            int64 dim;
+            tensorflow::int64 dim;
             if (!ParseInt64(&dim)) {
               return false;
             }
@@ -1209,7 +1278,7 @@ bool HloParser::ParseSingleSharding(OpSharding* sharding,
             return false;
           }
           do {
-            int64 device;
+            tensorflow::int64 device;
             if (!ParseInt64(&device)) {
               return false;
             }
@@ -1268,10 +1337,10 @@ bool HloParser::ParseSingleSharding(OpSharding* sharding,
     }
     sharding->set_type(OpSharding::Type::OpSharding_Type_OTHER);
     *sharding->mutable_tile_shape() = tile_shape;
-    for (int64 dim : tile_assignment_dimensions) {
+    for (tensorflow::int64 dim : tile_assignment_dimensions) {
       sharding->add_tile_assignment_dimensions(dim);
     }
-    for (int64 device : devices) {
+    for (tensorflow::int64 device : devices) {
       sharding->add_tile_assignment_devices(device);
     }
   }
@@ -1280,6 +1349,34 @@ bool HloParser::ParseSingleSharding(OpSharding* sharding,
   return true;
 }
 
+// domain ::= '{' 'kind=' domain_kind ',' 'entry=' entry_sharding ','
+//            'exit=' exit_sharding '}'
+bool HloParser::ParseDomain(DomainData* domain) {
+  std::unordered_map<string, AttrConfig> attrs;
+  optional<string> kind;
+  optional<OpSharding> entry_sharding;
+  optional<OpSharding> exit_sharding;
+  attrs["kind"] = {/*required=*/true, AttrTy::kString, &kind};
+  attrs["entry"] = {/*required=*/true, AttrTy::kSharding, &entry_sharding};
+  attrs["exit"] = {/*required=*/true, AttrTy::kSharding, &exit_sharding};
+  if (!ParseSubAttributes(attrs)) {
+    return false;
+  }
+  if (*kind == ShardingMetadata::KindName()) {
+    auto entry_sharding_ptr = MakeUnique<HloSharding>(
+        HloSharding::FromProto(*entry_sharding).ValueOrDie());
+    auto exit_sharding_ptr = MakeUnique<HloSharding>(
+        HloSharding::FromProto(*exit_sharding).ValueOrDie());
+    domain->entry_metadata =
+        MakeUnique<ShardingMetadata>(std::move(entry_sharding_ptr));
+    domain->exit_metadata =
+        MakeUnique<ShardingMetadata>(std::move(exit_sharding_ptr));
+  } else {
+    return TokenError(StrCat("unsupported domain kind: ", *kind));
+  }
+  return true;
+}
+
 // '{' name+ '}'
 bool HloParser::ParseInstructionNames(
     std::vector<HloInstruction*>* instructions) {
@@ -1306,40 +1403,50 @@ bool HloParser::ParseInstructionNames(
                     "expects '}' at the end of instruction name list");
 }
 
-bool HloParser::SetValueInLiteral(int64 value, int64 linear_index,
+bool HloParser::SetValueInLiteral(tensorflow::int64 value,
+                                  tensorflow::int64 linear_index,
                                   Literal* literal) {
   const Shape& shape = literal->shape();
   switch (shape.element_type()) {
     case S8:
-      return SetValueInLiteralHelper<int8>(value, linear_index, literal);
+      return SetValueInLiteralHelper<tensorflow::int8>(value, linear_index,
+                                                       literal);
     case S16:
-      return SetValueInLiteralHelper<int16>(value, linear_index, literal);
+      return SetValueInLiteralHelper<tensorflow::int16>(value, linear_index,
+                                                        literal);
     case S32:
-      return SetValueInLiteralHelper<int32>(value, linear_index, literal);
+      return SetValueInLiteralHelper<tensorflow::int32>(value, linear_index,
+                                                        literal);
     case S64:
-      return SetValueInLiteralHelper<int64>(value, linear_index, literal);
+      return SetValueInLiteralHelper<tensorflow::int64>(value, linear_index,
+                                                        literal);
     case U8:
-      return SetValueInLiteralHelper<uint8>(value, linear_index, literal);
+      return SetValueInLiteralHelper<tensorflow::uint8>(value, linear_index,
+                                                        literal);
     case U16:
-      return SetValueInLiteralHelper<uint8>(value, linear_index, literal);
+      return SetValueInLiteralHelper<tensorflow::uint16>(value, linear_index,
+                                                         literal);
     case U32:
-      return SetValueInLiteralHelper<uint32>(value, linear_index, literal);
+      return SetValueInLiteralHelper<tensorflow::uint32>(value, linear_index,
+                                                         literal);
     case U64:
-      return SetValueInLiteralHelper<uint64>(value, linear_index, literal);
+      return SetValueInLiteralHelper<tensorflow::uint64>(value, linear_index,
+                                                         literal);
     default:
       LOG(FATAL) << "unknown integral primitive type "
                  << PrimitiveType_Name(shape.element_type());
   }
 }
 
-bool HloParser::SetValueInLiteral(double value, int64 linear_index,
+bool HloParser::SetValueInLiteral(double value, tensorflow::int64 linear_index,
                                   Literal* literal) {
   const Shape& shape = literal->shape();
   switch (shape.element_type()) {
     case F16:
-      return SetValueInLiteralHelper<half>(value, linear_index, literal);
+      return SetValueInLiteralHelper<Eigen::half>(value, linear_index, literal);
     case BF16:
-      return SetValueInLiteralHelper<bfloat16>(value, linear_index, literal);
+      return SetValueInLiteralHelper<tensorflow::bfloat16>(value, linear_index,
+                                                           literal);
     case F32:
       return SetValueInLiteralHelper<float>(value, linear_index, literal);
     case F64:
@@ -1350,7 +1457,7 @@ bool HloParser::SetValueInLiteral(double value, int64 linear_index,
   }
 }
 
-bool HloParser::SetValueInLiteral(bool value, int64 linear_index,
+bool HloParser::SetValueInLiteral(bool value, tensorflow::int64 linear_index,
                                   Literal* literal) {
   const Shape& shape = literal->shape();
   switch (shape.element_type()) {
@@ -1363,7 +1470,8 @@ bool HloParser::SetValueInLiteral(bool value, int64 linear_index,
 }
 
 template <typename LiteralNativeT, typename ParsedElemT>
-bool HloParser::SetValueInLiteralHelper(ParsedElemT value, int64 linear_index,
+bool HloParser::SetValueInLiteralHelper(ParsedElemT value,
+                                        tensorflow::int64 linear_index,
                                         Literal* literal) {
   // Check that linear_index is in range.
   if (linear_index >= ShapeUtil::ElementsIn(literal->shape())) {
@@ -1475,7 +1583,7 @@ bool HloParser::ParseNonTupleLiteral(std::unique_ptr<Literal>* literal,
 
 bool HloParser::ParseDenseLiteral(std::unique_ptr<Literal>* literal,
                                   const Shape& shape) {
-  const int64 rank = ShapeUtil::Rank(shape);
+  const tensorflow::int64 rank = ShapeUtil::Rank(shape);
   if (rank > 1 && !EatShapeAndCheckCompatible(shape)) {
     return false;
   }
@@ -1483,8 +1591,8 @@ bool HloParser::ParseDenseLiteral(std::unique_ptr<Literal>* literal,
   // Create a literal with the given shape in default layout.
   *literal = Literal::CreateFromDimensions(shape.element_type(),
                                            AsInt64Slice(shape.dimensions()));
-  int64 nest_level = 0;
-  int64 linear_index = 0;
+  tensorflow::int64 nest_level = 0;
+  tensorflow::int64 linear_index = 0;
   // elems_seen_per_dim[i] is how many elements or sub-arrays we have seen for
   // the dimension i. For example, to parse f32[2,3] {{1, 2, 3}, {4, 5, 6}},
   // when we are parsing the 2nd '{' (right before '1'), we are seeing a
@@ -1492,14 +1600,14 @@ bool HloParser::ParseDenseLiteral(std::unique_ptr<Literal>* literal,
   // the first '}' (right after '3'), it means the sub-array ends, and the
   // sub-array is supposed to contain exactly 3 elements, so check if
   // elems_seen_per_dim[1] is 3.
-  std::vector<int64> elems_seen_per_dim(rank);
+  std::vector<tensorflow::int64> elems_seen_per_dim(rank);
   auto get_index_str = [&elems_seen_per_dim](int dim) -> string {
-    std::vector<int64> elems_seen_until_dim(elems_seen_per_dim.begin(),
-                                            elems_seen_per_dim.begin() + dim);
+    std::vector<tensorflow::int64> elems_seen_until_dim(
+        elems_seen_per_dim.begin(), elems_seen_per_dim.begin() + dim);
     return StrCat("[",
                   Join(elems_seen_until_dim, ",",
-                       [](string* out, const int64& num_elems) {
-                         tensorflow::strings::StrAppend(out, num_elems - 1);
+                       [](string* out, const tensorflow::int64& num_elems) {
+                         StrAppend(out, num_elems - 1);
                        }),
                   "]");
   };
@@ -1575,7 +1683,7 @@ bool HloParser::ParseDenseLiteral(std::unique_ptr<Literal>* literal,
           lexer_.Lex();
         } else if (primitive_util::IsIntegralType(shape.element_type())) {
           LocTy loc = lexer_.GetLoc();
-          int64 value;
+          tensorflow::int64 value;
           if (!ParseInt64(&value)) {
             return Error(loc, StrCat("expects integer for primitive type: ",
                                      PrimitiveType_Name(shape.element_type())));
@@ -1615,29 +1723,29 @@ bool HloParser::ParseSparseLiteral(std::unique_ptr<Literal>* literal,
 
   switch (shape.element_type()) {
     case PRED:
-      return ParseSparseLiteralHelper<uint8>(literal, shape);
+      return ParseSparseLiteralHelper<tensorflow::uint8>(literal, shape);
     case S8:
-      return ParseSparseLiteralHelper<int8>(literal, shape);
+      return ParseSparseLiteralHelper<tensorflow::int8>(literal, shape);
     case S16:
-      return ParseSparseLiteralHelper<int16>(literal, shape);
+      return ParseSparseLiteralHelper<tensorflow::int16>(literal, shape);
     case S32:
-      return ParseSparseLiteralHelper<int32>(literal, shape);
+      return ParseSparseLiteralHelper<tensorflow::int32>(literal, shape);
     case S64:
-      return ParseSparseLiteralHelper<int64>(literal, shape);
+      return ParseSparseLiteralHelper<tensorflow::int64>(literal, shape);
     case U8:
-      return ParseSparseLiteralHelper<uint8>(literal, shape);
+      return ParseSparseLiteralHelper<tensorflow::uint8>(literal, shape);
     case U16:
-      return ParseSparseLiteralHelper<uint16>(literal, shape);
+      return ParseSparseLiteralHelper<tensorflow::uint16>(literal, shape);
     case U32:
-      return ParseSparseLiteralHelper<uint32>(literal, shape);
+      return ParseSparseLiteralHelper<tensorflow::uint32>(literal, shape);
     case U64:
-      return ParseSparseLiteralHelper<uint64>(literal, shape);
+      return ParseSparseLiteralHelper<tensorflow::uint64>(literal, shape);
     case F16:
-      return ParseSparseLiteralHelper<half>(literal, shape);
+      return ParseSparseLiteralHelper<Eigen::half>(literal, shape);
     case F32:
       return ParseSparseLiteralHelper<float>(literal, shape);
     case BF16:
-      return ParseSparseLiteralHelper<bfloat16>(literal, shape);
+      return ParseSparseLiteralHelper<tensorflow::bfloat16>(literal, shape);
     case F64:
       return ParseSparseLiteralHelper<double>(literal, shape);
     default:
@@ -1650,9 +1758,9 @@ bool HloParser::ParseSparseLiteral(std::unique_ptr<Literal>* literal,
 template <typename LiteralNativeT>
 bool HloParser::ParseSparseLiteralHelper(std::unique_ptr<Literal>* literal,
                                          const Shape& shape) {
-  std::vector<int64> index;
+  std::vector<tensorflow::int64> index;
 
-  int64 rank = ShapeUtil::Rank(shape);
+  tensorflow::int64 rank = ShapeUtil::Rank(shape);
 
   *literal = MakeUnique<Literal>(shape);
 
@@ -1670,7 +1778,7 @@ bool HloParser::ParseSparseLiteralHelper(std::unique_ptr<Literal>* literal,
     LocTy index_loc = lexer_.GetLoc();
     index.clear();
     if (lexer_.GetKind() == TokKind::kInt) {
-      int64 single_index = lexer_.GetInt64Val();
+      tensorflow::int64 single_index = lexer_.GetInt64Val();
       lexer_.Lex();
       if (rank != 1) {
         return Error(
@@ -1703,7 +1811,7 @@ bool HloParser::ParseSparseLiteralHelper(std::unique_ptr<Literal>* literal,
       value = static_cast<LiteralNativeT>(lexer_.GetKind() == TokKind::kw_true);
       lexer_.Lex();
     } else if (primitive_util::IsIntegralType(shape.element_type())) {
-      int64 value_s64;
+      tensorflow::int64 value_s64;
       if (!ParseInt64(&value_s64)) {
         return Error(value_loc,
                      StrCat("expects integer for primitive type: ",
@@ -1876,23 +1984,24 @@ bool HloParser::ParseAttributeHelper(
     LocTy attr_loc = lexer_.GetLoc();
     switch (attr_type) {
       case AttrTy::kInt64: {
-        int64 result;
+        tensorflow::int64 result;
         if (!ParseInt64(&result)) {
           return false;
         }
-        static_cast<optional<int64>*>(attr_out_ptr)->emplace(result);
+        static_cast<optional<tensorflow::int64>*>(attr_out_ptr)
+            ->emplace(result);
         return true;
       }
       case AttrTy::kInt32: {
-        int64 result;
+        tensorflow::int64 result;
         if (!ParseInt64(&result)) {
           return false;
         }
-        if (result != static_cast<int32>(result)) {
+        if (result != static_cast<tensorflow::int32>(result)) {
           return Error(attr_loc, "value out of range for int32");
         }
-        static_cast<optional<int32>*>(attr_out_ptr)
-            ->emplace(static_cast<int32>(result));
+        static_cast<optional<tensorflow::int32>*>(attr_out_ptr)
+            ->emplace(static_cast<tensorflow::int32>(result));
         return true;
       }
       case AttrTy::kFloat: {
@@ -1926,7 +2035,7 @@ bool HloParser::ParseAttributeHelper(
       }
       case AttrTy::kWindow: {
         Window result;
-        if (!ParseWindow(&result)) {
+        if (!ParseWindow(&result, /*expect_outer_curlies=*/true)) {
           return false;
         }
         static_cast<optional<Window>*>(attr_out_ptr)->emplace(result);
@@ -1968,12 +2077,12 @@ bool HloParser::ParseAttributeHelper(
         return true;
       }
       case AttrTy::kBracedInt64List: {
-        std::vector<int64> result;
+        std::vector<tensorflow::int64> result;
         if (!ParseInt64List(TokKind::kLbrace, TokKind::kRbrace, TokKind::kComma,
                             &result)) {
           return false;
         }
-        static_cast<optional<std::vector<int64>>*>(attr_out_ptr)
+        static_cast<optional<std::vector<tensorflow::int64>>*>(attr_out_ptr)
             ->emplace(result);
         return true;
       }
@@ -2018,6 +2127,9 @@ bool HloParser::ParseAttributeHelper(
             ->emplace(result);
         return true;
       }
+      case AttrTy::kDomain: {
+        return ParseDomain(static_cast<DomainData*>(attr_out_ptr));
+      }
     }
   }();
   if (!success) {
@@ -2044,9 +2156,10 @@ bool HloParser::ParseComputationName(HloComputation** value) {
 // ::= '{' size stride? pad? lhs_dilate? rhs_dilate? '}'
 // The subattributes can appear in any order. 'size=' is required, others are
 // optional.
-bool HloParser::ParseWindow(Window* window) {
+bool HloParser::ParseWindow(Window* window, bool expect_outer_curlies) {
   LocTy loc = lexer_.GetLoc();
-  if (!ParseToken(TokKind::kLbrace, "expected '{' to start window attribute")) {
+  if (expect_outer_curlies &&
+      !ParseToken(TokKind::kLbrace, "expected '{' to start window attribute")) {
     return false;
   }
 
@@ -2056,7 +2169,9 @@ bool HloParser::ParseWindow(Window* window) {
   std::vector<int64> lhs_dilate;
   std::vector<int64> rhs_dilate;
   std::vector<int64> rhs_reversal;
-  while (lexer_.GetKind() != TokKind::kRbrace) {
+  const auto end_token =
+      expect_outer_curlies ? TokKind::kRbrace : TokKind::kEof;
+  while (lexer_.GetKind() != end_token) {
     LocTy attr_loc = lexer_.GetLoc();
     string field_name;
     if (!ParseAttributeName(&field_name)) {
@@ -2120,7 +2235,8 @@ bool HloParser::ParseWindow(Window* window) {
     window->mutable_dimensions(i)->set_window_reversal(
         rhs_reversal.empty() ? false : (rhs_reversal[i] == 1));
   }
-  return ParseToken(TokKind::kRbrace, "expected '}' to end window attribute");
+  return !expect_outer_curlies ||
+         ParseToken(TokKind::kRbrace, "expected '}' to end window attribute");
 }
 
 // This is the inverse of HloInstruction::ConvolutionDimensionNumbersToString.
@@ -2144,7 +2260,7 @@ bool HloParser::ParseConvolutionDimensionNumbers(
                << str;
   }
 
-  const int64 rank = lhs_rhs_out[0].length();
+  const tensorflow::int64 rank = lhs_rhs_out[0].length();
   if (rank != lhs_rhs_out[1].length() || rank != lhs_rhs_out[2].length()) {
     return TokenError(
         "convolution lhs, rhs, and output must have the same rank");
@@ -2258,7 +2374,7 @@ bool HloParser::ParseSliceRanges(SliceRanges* result) {
   if (!ParseToken(TokKind::kLbrace, "expects '{' to start ranges")) {
     return false;
   }
-  std::vector<std::vector<int64>> ranges;
+  std::vector<std::vector<tensorflow::int64>> ranges;
   if (lexer_.GetKind() == TokKind::kRbrace) {
     // empty
     return ParseToken(TokKind::kRbrace, "expects '}' to end ranges");
@@ -2292,7 +2408,7 @@ bool HloParser::ParseSliceRanges(SliceRanges* result) {
 //   ::= int64_val (delim int64_val)*
 bool HloParser::ParseInt64List(const TokKind start, const TokKind end,
                                const TokKind delim,
-                               std::vector<int64>* result) {
+                               std::vector<tensorflow::int64>* result) {
   if (!ParseToken(start, StrCat("expects an int64 list starting with ",
                                 TokKindToString(start)))) {
     return false;
@@ -2301,7 +2417,7 @@ bool HloParser::ParseInt64List(const TokKind start, const TokKind end,
     // empty
   } else {
     do {
-      int64 i;
+      tensorflow::int64 i;
       if (!ParseInt64(&i)) {
         return false;
       }
@@ -2418,7 +2534,8 @@ bool HloParser::ParseString(string* result) {
   return true;
 }
 
-bool HloParser::ParseDxD(const string& name, std::vector<int64>* result) {
+bool HloParser::ParseDxD(const string& name,
+                         std::vector<tensorflow::int64>* result) {
   LocTy loc = lexer_.GetLoc();
   if (!result->empty()) {
     return Error(loc,
@@ -2426,7 +2543,7 @@ bool HloParser::ParseDxD(const string& name, std::vector<int64>* result) {
   }
   // 1D
   if (lexer_.GetKind() == TokKind::kInt) {
-    int64 number;
+    tensorflow::int64 number;
     if (!ParseInt64(&number)) {
       return Error(loc, Printf("expects sub-attribute '%s=i'", name.c_str()));
     }
@@ -2446,7 +2563,8 @@ bool HloParser::ParseDxD(const string& name, std::vector<int64>* result) {
   return TokenError("expects token type kInt or kDxD");
 }
 
-bool HloParser::ParseWindowPad(std::vector<std::vector<int64>>* pad) {
+bool HloParser::ParseWindowPad(
+    std::vector<std::vector<tensorflow::int64>>* pad) {
   LocTy loc = lexer_.GetLoc();
   if (!pad->empty()) {
     return Error(loc, "sub-attribute 'pad=' already exists");
@@ -2457,7 +2575,7 @@ bool HloParser::ParseWindowPad(std::vector<std::vector<int64>>* pad) {
   string str = lexer_.GetStrVal();
   std::vector<string> padding_str = Split(str, 'x');
   for (int i = 0; i < padding_str.size(); i++) {
-    std::vector<int64> low_high;
+    std::vector<tensorflow::int64> low_high;
     if (!SplitAndParseAsInts(padding_str[i], '_', &low_high) ||
         low_high.size() != 2) {
       return Error(loc,
@@ -2481,7 +2599,7 @@ bool HloParser::ParsePaddingConfig(PaddingConfig* padding) {
   string str = lexer_.GetStrVal();
   std::vector<string> padding_str = Split(str, 'x');
   for (const auto& padding_dim_str : padding_str) {
-    std::vector<int64> padding_dim;
+    std::vector<tensorflow::int64> padding_dim;
     if (!SplitAndParseAsInts(padding_dim_str, '_', &padding_dim) ||
         (padding_dim.size() != 2 && padding_dim.size() != 3)) {
       return Error(loc,
@@ -2503,7 +2621,7 @@ bool HloParser::ParseMetadata(OpMetadata* metadata) {
   optional<string> op_type;
   optional<string> op_name;
   optional<string> source_file;
-  optional<int32> source_line;
+  optional<tensorflow::int32> source_line;
   attrs["op_type"] = {/*required=*/false, AttrTy::kString, &op_type};
   attrs["op_name"] = {/*required=*/false, AttrTy::kString, &op_name};
   attrs["source_file"] = {/*required=*/false, AttrTy::kString, &source_file};
@@ -2590,7 +2708,7 @@ bool HloParser::ParseRandomDistribution(RandomDistribution* result) {
   return true;
 }
 
-bool HloParser::ParseInt64(int64* result) {
+bool HloParser::ParseInt64(tensorflow::int64* result) {
   VLOG(1) << "ParseInt64";
   if (lexer_.GetKind() != TokKind::kInt) {
     return TokenError("expects integer");
@@ -2673,10 +2791,48 @@ bool HloParser::AddComputation(const string& name, HloComputation* computation,
   return true;
 }
 
+StatusOr<HloSharding> HloParser::ParseShardingOnly() {
+  lexer_.Lex();
+  OpSharding op_sharding;
+  if (!ParseSharding(&op_sharding)) {
+    return InvalidArgument("Syntax error:\n%s", GetError().c_str());
+  }
+  if (lexer_.GetKind() != TokKind::kEof) {
+    return InvalidArgument("Syntax error:\nExtra content after sharding");
+  }
+  return HloSharding::FromProto(op_sharding);
+}
+
+StatusOr<Window> HloParser::ParseWindowOnly() {
+  lexer_.Lex();
+  Window window;
+  if (!ParseWindow(&window, /*expect_outer_curlies=*/false)) {
+    return InvalidArgument("Syntax error:\n%s", GetError().c_str());
+  }
+  if (lexer_.GetKind() != TokKind::kEof) {
+    return InvalidArgument("Syntax error:\nExtra content after window");
+  }
+  return window;
+}
+
+StatusOr<ConvolutionDimensionNumbers>
+HloParser::ParseConvolutionDimensionNumbersOnly() {
+  lexer_.Lex();
+  ConvolutionDimensionNumbers dnums;
+  if (!ParseConvolutionDimensionNumbers(&dnums)) {
+    return InvalidArgument("Syntax error:\n%s", GetError().c_str());
+  }
+  if (lexer_.GetKind() != TokKind::kEof) {
+    return InvalidArgument(
+        "Syntax error:\nExtra content after convolution dnums");
+  }
+  return dnums;
+}
+
 }  // namespace
 
-StatusOr<std::unique_ptr<HloModule>> Parse(StringPiece str,
-                                           const HloModuleConfig& config) {
+StatusOr<std::unique_ptr<HloModule>> ParseHloString(
+    tensorflow::StringPiece str, const HloModuleConfig& config) {
   HloParser parser(str, config);
   if (!parser.Run()) {
     return InvalidArgument("Syntax error:\n%s", parser.GetError().c_str());
@@ -2684,10 +2840,29 @@ StatusOr<std::unique_ptr<HloModule>> Parse(StringPiece str,
   return parser.ConsumeHloModule();
 }
 
-StatusOr<std::unique_ptr<HloModule>> Parse(StringPiece str) {
+StatusOr<std::unique_ptr<HloModule>> ParseHloString(
+    tensorflow::StringPiece str) {
+  HloModuleConfig config;
+  return ParseHloString(str, config);
+}
+
+StatusOr<HloSharding> ParseSharding(tensorflow::StringPiece str) {
+  HloModuleConfig config;
+  HloParser parser(str, config);
+  return parser.ParseShardingOnly();
+}
+
+StatusOr<Window> ParseWindow(tensorflow::StringPiece str) {
+  HloModuleConfig config;
+  HloParser parser(str, config);
+  return parser.ParseWindowOnly();
+}
+
+StatusOr<ConvolutionDimensionNumbers> ParseConvolutionDimensionNumbers(
+    tensorflow::StringPiece str) {
   HloModuleConfig config;
-  return Parse(str, config);
+  HloParser parser(str, config);
+  return parser.ParseConvolutionDimensionNumbersOnly();
 }
 
-}  // namespace tools
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tools/parser/hlo_parser.h b/tensorflow/compiler/xla/service/hlo_parser.h
similarity index 52%
rename from tensorflow/compiler/xla/tools/parser/hlo_parser.h
rename to tensorflow/compiler/xla/service/hlo_parser.h
index 2f97a2b9b19d0cdb64a2869913da62c55e14c1d5..3f3a51215e34bbdd667f1cb20d0ae968e0ce5efd 100644
--- a/tensorflow/compiler/xla/tools/parser/hlo_parser.h
+++ b/tensorflow/compiler/xla/service/hlo_parser.h
@@ -13,30 +13,47 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_COMPILER_XLA_TOOLS_PARSER_HLO_PARSER_H_
-#define TENSORFLOW_COMPILER_XLA_TOOLS_PARSER_HLO_PARSER_H_
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_PARSER_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_PARSER_H_
 
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_lexer.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/statusor.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_lexer.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 
 namespace xla {
-namespace tools {
+
+// For details about the syntax accepted by this parser, see
+// g3doc/hlo_parser.md.
 
 // The api of the hlo parser. Given a string in the HloModule::ToString()
 // format, parses the string and creates a HloModule with the given config.
-StatusOr<std::unique_ptr<HloModule>> Parse(tensorflow::StringPiece str,
-                                           const HloModuleConfig& config);
+StatusOr<std::unique_ptr<HloModule>> ParseHloString(
+    tensorflow::StringPiece str, const HloModuleConfig& config);
 
 // The api of the hlo parser. Given a string in the HloModule::ToString()
 // format, parses the string and creates a HloModule with default config.
-StatusOr<std::unique_ptr<HloModule>> Parse(tensorflow::StringPiece str);
+StatusOr<std::unique_ptr<HloModule>> ParseHloString(
+    tensorflow::StringPiece str);
+
+// Parses the result of HloSharding::ToString(), e.g. "{replicated}".
+StatusOr<HloSharding> ParseSharding(tensorflow::StringPiece str);
+
+// Parses the result of window_util::ToString(const Window&).
+StatusOr<Window> ParseWindow(tensorflow::StringPiece str);
+
+// Parses the result of ConvolutionDimensionNumbersToString(), e.g.
+// "b0f_0io->b0f".
+StatusOr<ConvolutionDimensionNumbers> ParseConvolutionDimensionNumbers(
+    tensorflow::StringPiece str);
+
+// ParseHloString sharding from str. str is supposed to contain the body of the
+// sharding, i.e. just the rhs of the "sharding={...}" attribute string.
+StatusOr<HloSharding> ParseSharding(tensorflow::StringPiece str);
 
-}  // namespace tools
 }  // namespace xla
 
-#endif  // TENSORFLOW_COMPILER_XLA_TOOLS_PARSER_HLO_PARSER_H_
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_PARSER_H_
diff --git a/tensorflow/compiler/xla/tools/parser/hlo_parser_test.cc b/tensorflow/compiler/xla/service/hlo_parser_test.cc
similarity index 88%
rename from tensorflow/compiler/xla/tools/parser/hlo_parser_test.cc
rename to tensorflow/compiler/xla/service/hlo_parser_test.cc
index 131aded95ab04c4327c275ed8cd18b8fc7ac1bd6..f834d34d57106b11cf398f966d8c0224f00d1b8d 100644
--- a/tensorflow/compiler/xla/tools/parser/hlo_parser_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_parser_test.cc
@@ -13,19 +13,20 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 
 #include <string>
+#include "tensorflow/compiler/xla/window_util.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/platform/test.h"
 
 namespace xla {
-namespace tools {
+
 namespace {
 
-using tensorflow::StringPiece;
+using ::tensorflow::StringPiece;
 
 struct TestData {
   string test_name;
@@ -233,6 +234,17 @@ ENTRY %ShardedTupleCreate.v4 (v1: f32[], v2: f32[3], v3: f32[2,3]) -> (f32[], f3
   ROOT %tuple = (f32[], f32[3]{0}, f32[2,3]{1,0}) tuple(f32[] %v1, f32[3]{0} %v2, f32[2,3]{1,0} %v3), sharding={{replicated}, {maximal device=0}, {replicated}}
 }
 
+)"
+},
+{
+"DomainParsing",
+R"(HloModule DomainParsing_module
+
+ENTRY %DomainParsing (v1: f32[]) -> f32[] {
+  %v1 = f32[] parameter(0)
+  ROOT %dom = f32[] domain(f32[] %v1), domain={kind="sharding", entry={maximal device=0}, exit={maximal device=1}}
+}
+
 )"
 },
 // int32 result = 0;
@@ -753,7 +765,7 @@ add_F32.v3 {
 ENTRY MapBinaryAdder.v3 {
   param0 = f32[4]{0} parameter(0)
   param1 = f32[4]{0} parameter(1)
-  ROOT map = f32[4]{0} map(param0, param1), to_apply=add_F32.v3
+  ROOT map = f32[4]{0} map(param0, param1), dimensions={0}, to_apply=add_F32.v3
 }
 
 )"
@@ -888,6 +900,42 @@ ENTRY Gather {
 
 )"
 },
+// cross-replica-sum
+{
+"CrossReplicaSum",
+R"(HloModule CRS
+
+add {
+  lhs = f32[] parameter(0)
+  rhs = f32[] parameter(1)
+  ROOT add = f32[] add(lhs, rhs)
+}
+
+ENTRY CRS {
+  input = f32[8]{0} parameter(0)
+  ROOT crs = f32[8]{0} cross-replica-sum(input), to_apply=add
+}
+
+)"
+},
+// cross-replica-sum with subgroups
+{
+"CrossReplicaSumWithSubgroups",
+R"(HloModule CRS_Subgroups
+
+add {
+  lhs = f32[] parameter(0)
+  rhs = f32[] parameter(1)
+  ROOT add = f32[] add(lhs, rhs)
+}
+
+ENTRY CrossReplicaSumWithSubgroups {
+  input = f32[128,32]{0,1} parameter(0)
+  ROOT cross-replica-sum = f32[128,32]{0,1} cross-replica-sum(input), to_apply=add, replica_group_ids={0,0,1,1}, barrier="abc"
+}
+
+)"
+}
   });
   // clang-format on
 }
@@ -900,12 +948,12 @@ class HloParserTest : public ::testing::Test,
         << "'" << s << "' does not contain '" << expected << "'";
   }
 
-  // Expects "ToString(Parse(string)) == string", that is, parses the string,
-  // asserts that it succeeded, stringifies the parsed module, and checks that
-  // the it equals the original string.
+  // Expects "ToString(ParseHloString(string)) == string", that is, parses the
+  // string, asserts that it succeeded, stringifies the parsed module, and
+  // checks that the it equals the original string.
   void ExpectEqual() {
     const string& original = GetParam().module_string;
-    auto result = Parse(original);
+    auto result = ParseHloString(original);
     TF_ASSERT_OK(result.status());
     EXPECT_EQ(original, result.ValueOrDie()->ToString(
                             HloPrintOptions().set_print_large_constants(true)));
@@ -916,7 +964,7 @@ class HloParserShortTest : public HloParserTest {
  protected:
   void ExpectEqualShort() {
     const string& original = GetParam().module_string;
-    auto result = Parse(original);
+    auto result = ParseHloString(original);
     TF_ASSERT_OK(result.status());
     EXPECT_EQ(original,
               result.ValueOrDie()->ToString(HloPrintOptions::ShortParsable()));
@@ -937,13 +985,13 @@ INSTANTIATE_TEST_CASE_P(HloParserTestSuccessInstantiation, HloParserShortTest,
 
 TEST_F(HloParserTest, Empty) {
   const string original = "";
-  auto result = Parse(original);
+  auto result = ParseHloString(original);
   EXPECT_NE(Status::OK(), result.status());
 }
 
 TEST_F(HloParserTest, Garbage) {
   const string original = "HloModule thi$ str1ng makes# N0 sen$e @all!*&^%$";
-  auto result = Parse(original);
+  auto result = ParseHloString(original);
   EXPECT_NE(Status::OK(), result.status());
 }
 
@@ -957,7 +1005,7 @@ ENTRY %blabla (x: f32[], y: f32[]) -> f32[] {
 }
 
 )";
-  auto result = Parse(original);
+  auto result = ParseHloString(original);
   EXPECT_NE(Status::OK(), result.status());
 }
 
@@ -969,7 +1017,7 @@ ENTRY %blabla (x: g32[]) -> g32[] {
 }
 
 )";
-  auto result = Parse(original);
+  auto result = ParseHloString(original);
   EXPECT_NE(Status::OK(), result.status());
 }
 
@@ -982,7 +1030,7 @@ ENTRY %blabla (x: f32[]) -> pred[] {
 }
 
 )";
-  auto result = Parse(original);
+  auto result = ParseHloString(original);
   EXPECT_NE(Status::OK(), result.status());
 }
 
@@ -993,7 +1041,7 @@ ENTRY %blabla (x: f32[]) -> pred[] {
   %eq = pred[]{} equal-to(f32[]{} %x, f32[]{} %y)
 }
 )";
-  auto result = Parse(original);
+  auto result = ParseHloString(original);
   EXPECT_NE(Status::OK(), result.status());
 }
 
@@ -1008,7 +1056,7 @@ ENTRY %SelectScalarS32True.v4 () -> s32[] {
 }
 
 )";
-  auto result = Parse(original);
+  auto result = ParseHloString(original);
   TF_EXPECT_OK(result.status());
   // Constant instructions have no name. The string will be parsed successfully
   // but the constant names will not be exactly the same.
@@ -1019,12 +1067,12 @@ TEST_F(HloParserTest, ConfigurationField) {
 ENTRY %configuration_test() -> s32[] {
   %constant = s32[] constant(42), backend_config="foo bar"
 })";
-  auto result = Parse(original);
+  auto result = ParseHloString(original);
   TF_ASSERT_OK(result.status());
   EXPECT_EQ("foo bar", result.ValueOrDie()
                            ->entry_computation()
                            ->root_instruction()
-                           ->backend_config());
+                           ->raw_backend_config_string());
 }
 
 TEST_F(HloParserTest, LiteralDimensionsMismatch_1) {
@@ -1035,7 +1083,7 @@ ENTRY %some_2 () -> f32[2] {
 }
 
 )";
-  auto result = Parse(original);
+  auto result = ParseHloString(original);
   EXPECT_NE(Status::OK(), result.status());
   ExpectHasSubstr(result.status().error_message(),
                   "expects nested array in rank 1, but sees larger");
@@ -1049,7 +1097,7 @@ ENTRY %some_2x3 () -> f32[2,3] {
 }
 
 )";
-  auto result = Parse(original);
+  auto result = ParseHloString(original);
   EXPECT_NE(Status::OK(), result.status());
   ExpectHasSubstr(result.status().error_message(),
                   "expects nested array in rank 2, but sees 1");
@@ -1063,7 +1111,7 @@ ENTRY %some_2x3x2 () -> f32[2,3,2] {
 }
 
 )";
-  auto result = Parse(original);
+  auto result = ParseHloString(original);
   EXPECT_NE(Status::OK(), result.status());
   ExpectHasSubstr(result.status().error_message(),
                   "expects 3 elements in the [0]th element");
@@ -1078,7 +1126,7 @@ ENTRY %ConstantF16Overflow.v4 () -> f16[] {
 }
 
 )";
-  auto result = Parse(original);
+  auto result = ParseHloString(original);
   EXPECT_NE(Status::OK(), result.status());
   ExpectHasSubstr(result.status().error_message(),
                   "is out of range for literal's primitive type F16");
@@ -1092,7 +1140,7 @@ ENTRY %ConstantWithExp.v4 () -> f32[] {
 }
 
 )";
-  auto result = Parse(original);
+  auto result = ParseHloString(original);
   TF_EXPECT_OK(result.status());
   // The string will be parsed successfully but the output strings are not
   // exactly the same, because "3e2" is parsed into value 300 and will be
@@ -1110,7 +1158,7 @@ ENTRY %Convolve1D1Window_0.v3 (input: f32[1,2,1], filter: f32[1,1,1]) -> f32[1,2
 }
 
 )";
-  TF_EXPECT_OK(Parse(original).status());
+  TF_EXPECT_OK(ParseHloString(original).status());
 }
 
 TEST_F(HloParserTest, InvalidDimLabels) {
@@ -1126,17 +1174,18 @@ ENTRY %Convolve1D1Window_0.v3 (input: f32[1,2,1], filter: f32[1,1,1]) -> f32[1,2
 
 )";
 
+  ExpectHasSubstr(ParseHloString(tensorflow::strings::StrCat(
+                                     prefix, ",dim_labels=00_01_10", suffix))
+                      .status()
+                      .error_message(),
+                  "expects dim labels pattern");
+
   ExpectHasSubstr(
-      Parse(tensorflow::strings::StrCat(prefix, ",dim_labels=00_01_10", suffix))
+      ParseHloString(tensorflow::strings::StrCat(
+                         prefix, ",dim_labels=010_1100->010", suffix))
           .status()
           .error_message(),
-      "expects dim labels pattern");
-
-  ExpectHasSubstr(Parse(tensorflow::strings::StrCat(
-                            prefix, ",dim_labels=010_1100->010", suffix))
-                      .status()
-                      .error_message(),
-                  "must have the same rank");
+      "must have the same rank");
 }
 
 TEST_F(HloParserTest, UnexpectedAttribute) {
@@ -1151,7 +1200,7 @@ ENTRY %TwoSendRecvBothWayRecvFist.v3 () -> f32[] {
 }
 
 )";
-  ExpectHasSubstr(Parse(original).status().error_message(),
+  ExpectHasSubstr(ParseHloString(original).status().error_message(),
                   "unexpected attribute \"calls\"");
 }
 
@@ -1167,7 +1216,7 @@ ENTRY %TwoSendRecvBothWayRecvFist.v3 () -> f32[] {
 }
 
 )";
-  ExpectHasSubstr(Parse(original).status().error_message(),
+  ExpectHasSubstr(ParseHloString(original).status().error_message(),
                   "attribute channel_id is expected but not seen");
 }
 
@@ -1183,7 +1232,7 @@ ENTRY %TwoSendRecvBothWayRecvFist.v3 () -> f32[] {
 }
 
 )";
-  ExpectHasSubstr(Parse(original).status().error_message(),
+  ExpectHasSubstr(ParseHloString(original).status().error_message(),
                   "'done' is not defined");
 }
 
@@ -1196,7 +1245,7 @@ ENTRY %slice.v2 (p0: f32[3,3,4,4]) -> f32[3,3,2,4] {
 }
 
 )";
-  TF_EXPECT_OK(Parse(original).status());
+  TF_EXPECT_OK(ParseHloString(original).status());
 }
 
 TEST_F(HloParserTest, PaddingConfigIsNotWindowPad) {
@@ -1210,7 +1259,7 @@ ENTRY %Convolve1D1Window_0.v3 (input: f32[1,2,1], filter: f32[1,1,1]) -> f32[1,2
 }
 
 )";
-  ExpectHasSubstr(Parse(original).status().error_message(),
+  ExpectHasSubstr(ParseHloString(original).status().error_message(),
                   "expects padding_low and padding_high separated by '_'");
 }
 
@@ -1222,7 +1271,7 @@ ENTRY %test_comma.v4 () -> f32[] {
 }
 
 )";
-  TF_EXPECT_OK(Parse(original).status());
+  TF_EXPECT_OK(ParseHloString(original).status());
 }
 
 TEST_F(HloParserTest, ComputationShapeDoesNotMatchRootShape) {
@@ -1232,7 +1281,7 @@ ENTRY %CustomCall () -> f32[1] {
   %constant = f32[1]{0} constant({12345})
   ROOT %foo = f32[1,2,3]{0,2,1} custom-call(f32[1]{0} %constant), custom_call_target="foo\"bar"
 })";
-  ExpectHasSubstr(Parse(original).status().error_message(),
+  ExpectHasSubstr(ParseHloString(original).status().error_message(),
                   "Shape of computation CustomCall, f32[1], is not compatible "
                   "with that of its root instruction foo, f32[1,2,3]");
 }
@@ -1251,7 +1300,7 @@ ENTRY %Reduce (input: f32[8,16,256]) -> f32[8,16] {
   ROOT reduce = f32[8,16]{0,1} reduce(input, constant), dimensions={2}, to_apply=add_F32.v3
 })";
 
-  auto module = Parse(original);
+  auto module = ParseHloString(original);
   TF_ASSERT_OK(module.status());
   auto program_layout = module.ValueOrDie()->host_entry_computation_layout();
   ASSERT_EQ(program_layout.parameter_count(), 1);
@@ -1274,7 +1323,7 @@ c1 {
 c2 {
   const2 = f32[1]{0} constant({67890})
 })";
-  auto module = Parse(original);
+  auto module = ParseHloString(original);
   TF_ASSERT_OK(module.status());
   EXPECT_EQ(module.ValueOrDie()->entry_computation()->name(), "c2");
 }
@@ -1285,7 +1334,7 @@ ENTRY consts {
   first = f32[1]{0} constant({12345})
   last = f32[1]{0} constant({67890})
 })";
-  auto module = Parse(original);
+  auto module = ParseHloString(original);
   TF_ASSERT_OK(module.status());
   EXPECT_EQ(
       module.ValueOrDie()->entry_computation()->root_instruction()->name(),
@@ -1300,7 +1349,7 @@ ENTRY c1 {
 ENTRY c2 {
   const2 = f32[1]{0} constant({67890})
 })";
-  ExpectHasSubstr(Parse(original).status().error_message(),
+  ExpectHasSubstr(ParseHloString(original).status().error_message(),
                   "expects only one ENTRY");
 }
 
@@ -1310,25 +1359,10 @@ ENTRY consts {
   ROOT const1 = f32[1]{0} constant({12345})
   ROOT const2 = f32[1]{0} constant({12345})
 })";
-  ExpectHasSubstr(Parse(original).status().error_message(),
+  ExpectHasSubstr(ParseHloString(original).status().error_message(),
                   "one computation should have only one ROOT");
 }
 
-TEST_F(HloParserTest, InstructionExists) {
-  const string original = R"(HloModule comp_exists
-c1 {
-  instr = f32[1]{0} constant({12345})
-}
-c2 {
-  instr = f32[1]{0} constant({67890})
-})";
-
-  ExpectHasSubstr(Parse(original).status().error_message(),
-                  R"(was parsing 3:3: error: instruction previously defined here
-  instr = f32[1]{0} constant({12345})
-  ^)");
-}
-
 TEST_F(HloParserTest, ComputationExists) {
   const string original = R"(HloModule comp_exists
 comp {
@@ -1337,12 +1371,52 @@ comp {
 comp {
   const2 = f32[1]{0} constant({67890})
 })";
-  ExpectHasSubstr(Parse(original).status().error_message(),
+  ExpectHasSubstr(ParseHloString(original).status().error_message(),
                   R"(was parsing 2:1: error: computation previously defined here
 comp {
 ^)");
 }
 
+TEST_F(HloParserTest, CrossComputationLookup) {
+  const string original = R"(HloModule cross_computation_lookup:
+tcalla (a: (s32[], s32[])) -> (s32[], s32[]) {
+  ROOT aparam = (s32[], s32[]) parameter(0)
+}
+
+tcallb (b: (s32[], s32[])) -> s32[] {
+  rparam = (s32[], s32[]) parameter(0)
+  ROOT gte0 = s32[] get-tuple-element(aparam), index=0
+}
+
+ENTRY entry {
+  param = (s32[], s32[]) parameter(0)
+  call0 = (s32[], s32[]) call(param), to_apply=tcalla
+  ROOT call1 = s32[] call(param), to_apply=tcallb
+})";
+  ExpectHasSubstr(
+      ParseHloString(original).status().error_message(),
+      "was parsing 8:39: error: instruction does not exist: aparam");
+}
+
+TEST_F(HloParserTest, ParseSharding) {
+  const string original = "{maximal device=42}";
+  TF_ASSERT_OK_AND_ASSIGN(HloSharding sharding, ParseSharding(original));
+  EXPECT_EQ(sharding.ToString(), original);
+}
+
+TEST_F(HloParserTest, ParseWindow) {
+  Window original = window_util::MakeWindow({1, 2, 3});
+  TF_ASSERT_OK_AND_ASSIGN(Window parsed,
+                          ParseWindow(window_util::ToString(original)))
+  EXPECT_EQ(window_util::ToString(original), window_util::ToString(parsed));
+}
+
+TEST_F(HloParserTest, ParseConvolutionDimensionNumbers) {
+  const string original = "b0f_0io->b0f";
+  TF_ASSERT_OK_AND_ASSIGN(ConvolutionDimensionNumbers dnums,
+                          ParseConvolutionDimensionNumbers(original));
+  EXPECT_EQ(original, ConvolutionDimensionNumbersToString(dnums));
+}
+
 }  // namespace
-}  // namespace tools
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_reachability.cc b/tensorflow/compiler/xla/service/hlo_reachability.cc
index 8e167633bb13476301fa0c4afa0b123c9b47e40d..4738e46f8aeb96a4c25d04b3246bd21f644fe3ea 100644
--- a/tensorflow/compiler/xla/service/hlo_reachability.cc
+++ b/tensorflow/compiler/xla/service/hlo_reachability.cc
@@ -33,17 +33,27 @@ bool HloReachabilityMap::SetReachabilityToUnion(
     const HloInstruction* instruction) {
   BitVector& bit_vector = GetBitVector(instruction);
   tmp_bit_vector_ = bit_vector;
+  SetReachabilityToUnionHelper(inputs, instruction, &bit_vector);
+  return bit_vector != tmp_bit_vector_;
+}
 
+void HloReachabilityMap::FastSetReachabilityToUnion(
+    tensorflow::gtl::ArraySlice<const HloInstruction*> inputs,
+    const HloInstruction* instruction) {
+  SetReachabilityToUnionHelper(inputs, instruction, &GetBitVector(instruction));
+}
+
+void HloReachabilityMap::SetReachabilityToUnionHelper(
+    tensorflow::gtl::ArraySlice<const HloInstruction*> inputs,
+    const HloInstruction* instruction, BitVector* bit_vector) {
   // If instruction is part of inputs, don't reset the bit_vector.
   if (std::find(inputs.begin(), inputs.end(), instruction) == inputs.end()) {
-    bit_vector.SetToZero();
+    bit_vector->SetToZero();
   }
-  bit_vector.Set(GetIndex(instruction));
+  bit_vector->Set(GetIndex(instruction));
   for (const HloInstruction* input : inputs) {
-    bit_vector.OrWith(GetBitVector(input));
+    bit_vector->OrWith(GetBitVector(input));
   }
-
-  return bit_vector != tmp_bit_vector_;
 }
 
 void HloReachabilityMap::SetReachable(const HloInstruction* a,
diff --git a/tensorflow/compiler/xla/service/hlo_reachability.h b/tensorflow/compiler/xla/service/hlo_reachability.h
index 553ec11f6f9a2997ab7113f9b8241e04c7fe20d5..69bb2b3cee6dafe058c45b4e74e93401bea2cfc9 100644
--- a/tensorflow/compiler/xla/service/hlo_reachability.h
+++ b/tensorflow/compiler/xla/service/hlo_reachability.h
@@ -57,6 +57,11 @@ class HloReachabilityMap {
       tensorflow::gtl::ArraySlice<const HloInstruction*> inputs,
       const HloInstruction* instruction);
 
+  // As above, but faster because it does not check if the reachability changed.
+  void FastSetReachabilityToUnion(
+      tensorflow::gtl::ArraySlice<const HloInstruction*> inputs,
+      const HloInstruction* instruction);
+
   // Sets entry so that IsReachable(a, b) will return true
   //
   // !!! THIS FUNCTION DOES NOT COMPUTE REACHABILITY !!! It sets the adjacency
@@ -133,6 +138,11 @@ class HloReachabilityMap {
     return bit_vectors_[GetIndex(instruction)];
   }
 
+  // Helper for SetReachabilityToUnion/FastSetReachabilityToUnion.
+  void SetReachabilityToUnionHelper(
+      tensorflow::gtl::ArraySlice<const HloInstruction*> inputs,
+      const HloInstruction* instruction, BitVector* bit_vector);
+
   // Return the index of the given instruction. The value is used to index into
   // the vector of BitVectors and the BitVectors themselves.
   int GetIndex(const HloInstruction* instruction) const {
diff --git a/tensorflow/compiler/xla/service/hlo_rematerialization.cc b/tensorflow/compiler/xla/service/hlo_rematerialization.cc
index 39b85de0f12024f5e20ddd37618987c6d06bc307..9c7bc7a5ea7c77dadb8772f08b823c3579cf2154 100644
--- a/tensorflow/compiler/xla/service/hlo_rematerialization.cc
+++ b/tensorflow/compiler/xla/service/hlo_rematerialization.cc
@@ -71,6 +71,20 @@ bool IsRematerializable(const HloInstruction* instruction) {
   }
 }
 
+// Checks whether an instruction can be rematerialized, by looking up the
+// cache before, and eventually calling the IsRematerializable() API.
+bool CanBeRematerialized(
+    const HloInstruction* instruction,
+    tensorflow::gtl::FlatMap<const HloInstruction*, bool>* remat_able) {
+  auto it = remat_able->find(instruction);
+  if (it != remat_able->end()) {
+    return it->second;
+  }
+  bool rematerializable = IsRematerializable(instruction);
+  (*remat_able)[instruction] = rematerializable;
+  return rematerializable;
+}
+
 // Type holding a unique identifier for each Buffer object.
 using BufferId = int64;
 using BufferIdList = tensorflow::gtl::InlinedVector<BufferId, 3>;
@@ -843,9 +857,10 @@ int64 RematerializationCost(const HloInstruction* instruction,
 // candidate which reduce memory use at the program point of the current
 // instruction as indicated by memory_tracker. nullptr is returned if no
 // candidate can be found.
-Item* PickRematerializationCandidate(const MemoryUsageTracker& memory_tracker,
-                                     const InstructionList& instruction_list,
-                                     int64 memory_limit_bytes) {
+Item* PickRematerializationCandidate(
+    const MemoryUsageTracker& memory_tracker,
+    const InstructionList& instruction_list, int64 memory_limit_bytes,
+    tensorflow::gtl::FlatMap<const HloInstruction*, bool>* remat_able) {
   Item* best_item = nullptr;
   int64 best_cost = 0;
 
@@ -869,8 +884,7 @@ Item* PickRematerializationCandidate(const MemoryUsageTracker& memory_tracker,
               << " is excluded from rematerialization";
       continue;
     }
-
-    if (!IsRematerializable(candidate)) {
+    if (!CanBeRematerialized(candidate, remat_able)) {
       VLOG(5) << "candidate " << candidate->name()
               << " not viable: is not rematerializable";
       continue;
@@ -974,6 +988,9 @@ StatusOr<bool> HloRematerialization::RematerializeComputation(
   // blacklist.
   tensorflow::gtl::FlatSet<const HloInstruction*> remat_move_instructions;
 
+  // The map from instructions to their rematerializable status.
+  tensorflow::gtl::FlatMap<const HloInstruction*, bool> remat_able;
+
   // The peak memory of the computation at any point in the instruction
   // sequence.
   int64 peak_memory = memory_tracker.memory_usage();
@@ -1011,7 +1028,7 @@ StatusOr<bool> HloRematerialization::RematerializeComputation(
               << ", limit is " << HumanReadableNumBytes(memory_limit_bytes);
 
       Item* best_item = PickRematerializationCandidate(
-          memory_tracker, instruction_list, memory_limit_bytes);
+          memory_tracker, instruction_list, memory_limit_bytes, &remat_able);
 
       if (best_item == nullptr) {
         VLOG(3) << "Unable to find rematerialization candidate at program "
@@ -1213,7 +1230,7 @@ StatusOr<bool> HloRematerialization::Run(
 
   XLA_VLOG_LINES(3, "Before HloRematerialization:\n" + module->ToString());
   // Create initial sequence of HLO instructions.
-  TF_ASSIGN_OR_RETURN(*sequence, CreateMemoryMinimizingSequence(
+  TF_ASSIGN_OR_RETURN(*sequence, ScheduleComputationsInModule(
                                      *module,
                                      [this](const BufferValue& buffer) {
                                        return size_function_(buffer.shape());
diff --git a/tensorflow/compiler/xla/service/hlo_rematerialization_test.cc b/tensorflow/compiler/xla/service/hlo_rematerialization_test.cc
index 83de54f3fa56ee660b79d8c366dbc0b52f9fde87..e81334d5a84268a129cd4e90091e97dc23243226 100644
--- a/tensorflow/compiler/xla/service/hlo_rematerialization_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_rematerialization_test.cc
@@ -27,6 +27,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
 
 namespace xla {
 namespace {
@@ -40,7 +41,8 @@ class HloRematerializationTest : public HloTestBase {
   // Creates and returns a computation which can benefit from
   // rematerialization. The computation looks like:
   //
-  //   F32[] %param = {...}
+  //   F32[1] %param = {...}
+  //   F32[] %reshape = reshape(F32[], param)
   //   F32[1024] %bcast = broadcast(%param)
   //   F32[1024] %negate = negate(%bcast)
   //   F32[2048] %concat_1 = concat({%negate, %negate})
@@ -57,9 +59,11 @@ class HloRematerializationTest : public HloTestBase {
       const string& suffix = "") {
     auto builder = HloComputation::Builder(TestName() + suffix);
     auto param = builder.AddInstruction(
-        HloInstruction::CreateParameter(0, scalar_shape_, "param"));
+        HloInstruction::CreateParameter(0, vec1_shape_, "param"));
+    auto reshape = builder.AddInstruction(
+        HloInstruction::CreateReshape(scalar_shape_, param));
     auto bcast = builder.AddInstruction(
-        HloInstruction::CreateBroadcast(vec1024_shape_, param, {}));
+        HloInstruction::CreateBroadcast(vec1024_shape_, reshape, {}));
     auto negate = builder.AddInstruction(
         HloInstruction::CreateUnary(vec1024_shape_, HloOpcode::kNegate, bcast));
     auto concat_1 = builder.AddInstruction(HloInstruction::CreateConcatenate(
@@ -100,9 +104,11 @@ class HloRematerializationTest : public HloTestBase {
       const string& suffix = "") {
     auto builder = HloComputation::Builder(TestName() + suffix);
     auto param = builder.AddInstruction(
-        HloInstruction::CreateParameter(0, scalar_shape_, "param"));
+        HloInstruction::CreateParameter(0, vec1_shape_, "param"));
+    auto reshape = builder.AddInstruction(
+        HloInstruction::CreateReshape(scalar_shape_, param));
     auto bcast = builder.AddInstruction(
-        HloInstruction::CreateBroadcast(vec1024_shape_, param, {}));
+        HloInstruction::CreateBroadcast(vec1024_shape_, reshape, {}));
     auto slice_1 = builder.AddInstruction(
         HloInstruction::CreateSlice(vec1_shape_, bcast, /*start_indices=*/{0},
                                     /*limit_indices=*/{1},
@@ -135,6 +141,15 @@ class HloRematerializationTest : public HloTestBase {
     return ShapeUtil::ByteSizeOf(shape, sizeof(void*));
   }
 
+  StatusOr<bool> RunHloRematerialization(
+      int64 memory_limit_bytes, HloModule* module,
+      SequentialHloOrdering::HloModuleSequence* sequence) {
+    TF_EXPECT_OK(verifier().Run(module).status());
+    return HloRematerialization::RematerializeAndSchedule(
+        ByteSizeOf, memory_limit_bytes, module, DefaultMemoryScheduler,
+        sequence);
+  }
+
   // Various shapes used in the canned computations.
   const Shape scalar_shape_ = ShapeUtil::MakeShape(xla::F32, {});
   const Shape vec1_shape_ = ShapeUtil::MakeShape(xla::F32, {1});
@@ -158,11 +173,9 @@ TEST_F(HloRematerializationTest, SingleComputation) {
   SequentialHloOrdering::HloModuleSequence sequence;
   // Computation requires 16KB without rematerialization, but uses only 12KB
   // with rematerialization so pick a memory limit between these values (14KB).
-  TF_ASSERT_OK_AND_ASSIGN(bool changed,
-                          HloRematerialization::RematerializeAndSchedule(
-                              ByteSizeOf,
-                              /*memory_limit_bytes=*/14 * 1024, module.get(),
-                              DefaultMemoryScheduler, &sequence));
+  TF_ASSERT_OK_AND_ASSIGN(bool changed, RunHloRematerialization(
+                                            /*memory_limit_bytes=*/14 * 1024,
+                                            module.get(), &sequence));
   EXPECT_TRUE(changed);
 
   // Root should not have changed.
@@ -188,18 +201,16 @@ TEST_F(HloRematerializationTest, SingleComputationNoRematerialization) {
   HloComputation* computation =
       module->AddEntryComputation(MakeRematerializableComputation());
 
-  EXPECT_EQ(computation->instruction_count(), 7);
+  EXPECT_EQ(computation->instruction_count(), 8);
 
   SequentialHloOrdering::HloModuleSequence sequence;
-  TF_ASSERT_OK_AND_ASSIGN(bool changed,
-                          HloRematerialization::RematerializeAndSchedule(
-                              ByteSizeOf,
-                              /*memory_limit_bytes=*/20 * 1024, module.get(),
-                              DefaultMemoryScheduler, &sequence));
+  TF_ASSERT_OK_AND_ASSIGN(bool changed, RunHloRematerialization(
+                                            /*memory_limit_bytes=*/20 * 1024,
+                                            module.get(), &sequence));
 
   // No instructions should have been materialized.
   EXPECT_FALSE(changed);
-  EXPECT_EQ(computation->instruction_count(), 7);
+  EXPECT_EQ(computation->instruction_count(), 8);
 }
 
 // Test rematerialization of a computation which calls another computation via a
@@ -225,23 +236,21 @@ TEST_F(HloRematerializationTest, RematerializeAroundWhile) {
       module->AddEntryComputation(MakeRematerializableWhileComputation(
           while_cond, /*while_body=*/body_computation));
 
-  EXPECT_EQ(entry_computation->instruction_count(), 6);
-  EXPECT_EQ(body_computation->instruction_count(), 7);
+  EXPECT_EQ(entry_computation->instruction_count(), 7);
+  EXPECT_EQ(body_computation->instruction_count(), 8);
 
   // The body computation uses 16KB and the entry computation uses 2KB at the
   // while so the peak memory use of the module is 18KB. Set the memory limit a
   // bit lower (17KB) to force rematerialization of the entry computation.
   SequentialHloOrdering::HloModuleSequence sequence;
-  TF_ASSERT_OK_AND_ASSIGN(bool changed,
-                          HloRematerialization::RematerializeAndSchedule(
-                              ByteSizeOf,
-                              /*memory_limit_bytes=*/17 * 1024, module.get(),
-                              DefaultMemoryScheduler, &sequence));
+  TF_ASSERT_OK_AND_ASSIGN(bool changed, RunHloRematerialization(
+                                            /*memory_limit_bytes=*/17 * 1024,
+                                            module.get(), &sequence));
   EXPECT_TRUE(changed);
 
   // Only the entry computation should have a rematerialized instruction added.
-  EXPECT_EQ(entry_computation->instruction_count(), 7);
-  EXPECT_EQ(body_computation->instruction_count(), 7);
+  EXPECT_EQ(entry_computation->instruction_count(), 8);
+  EXPECT_EQ(body_computation->instruction_count(), 8);
 }
 
 // Test rematerialization of a computation which calls another computation via a
@@ -264,20 +273,18 @@ TEST_F(HloRematerializationTest, RematerializeEntryAndWhileBody) {
       module->AddEntryComputation(MakeRematerializableWhileComputation(
           while_cond, /*while_body=*/body_computation));
 
-  EXPECT_EQ(entry_computation->instruction_count(), 6);
-  EXPECT_EQ(body_computation->instruction_count(), 7);
+  EXPECT_EQ(entry_computation->instruction_count(), 7);
+  EXPECT_EQ(body_computation->instruction_count(), 8);
 
   SequentialHloOrdering::HloModuleSequence sequence;
-  TF_ASSERT_OK_AND_ASSIGN(bool changed,
-                          HloRematerialization::RematerializeAndSchedule(
-                              ByteSizeOf,
-                              /*memory_limit_bytes=*/15 * 1024, module.get(),
-                              DefaultMemoryScheduler, &sequence));
+  TF_ASSERT_OK_AND_ASSIGN(bool changed, RunHloRematerialization(
+                                            /*memory_limit_bytes=*/15 * 1024,
+                                            module.get(), &sequence));
   EXPECT_TRUE(changed);
 
-  // Both computations should have a rematerialized instruction added.
-  EXPECT_EQ(entry_computation->instruction_count(), 7);
-  EXPECT_EQ(body_computation->instruction_count(), 8);
+  // Both computations should have rematerialized instructions added.
+  EXPECT_EQ(entry_computation->instruction_count(), 9);
+  EXPECT_EQ(body_computation->instruction_count(), 9);
 }
 
 // Test rematerialization of a doubly nested computation. All computations
@@ -303,24 +310,22 @@ TEST_F(HloRematerializationTest, RematerializeNestedComputations) {
       module->AddEntryComputation(MakeRematerializableWhileComputation(
           while_cond, /*while_body=*/middle_computation));
 
-  EXPECT_EQ(entry_computation->instruction_count(), 6);
-  EXPECT_EQ(middle_computation->instruction_count(), 6);
-  EXPECT_EQ(inner_computation->instruction_count(), 7);
+  EXPECT_EQ(entry_computation->instruction_count(), 7);
+  EXPECT_EQ(middle_computation->instruction_count(), 7);
+  EXPECT_EQ(inner_computation->instruction_count(), 8);
 
   // If all computations are maximally rematerialized then peak memory usage is
   // ~12K so pick something slightly larger.
   SequentialHloOrdering::HloModuleSequence sequence;
-  TF_ASSERT_OK_AND_ASSIGN(bool changed,
-                          HloRematerialization::RematerializeAndSchedule(
-                              ByteSizeOf,
-                              /*memory_limit_bytes=*/13 * 1024, module.get(),
-                              DefaultMemoryScheduler, &sequence));
+  TF_ASSERT_OK_AND_ASSIGN(bool changed, RunHloRematerialization(
+                                            /*memory_limit_bytes=*/13 * 1024,
+                                            module.get(), &sequence));
   EXPECT_TRUE(changed);
 
-  // All computations should have a rematerialized instruction added.
-  EXPECT_EQ(entry_computation->instruction_count(), 7);
-  EXPECT_EQ(middle_computation->instruction_count(), 7);
-  EXPECT_EQ(inner_computation->instruction_count(), 8);
+  // All computations should have rematerialized instructions added.
+  EXPECT_EQ(entry_computation->instruction_count(), 9);
+  EXPECT_EQ(middle_computation->instruction_count(), 9);
+  EXPECT_EQ(inner_computation->instruction_count(), 9);
 }
 
 TEST_F(HloRematerializationTest, RngNotRematerialized) {
@@ -382,10 +387,9 @@ TEST_F(HloRematerializationTest, RngNotRematerialized) {
   // parameter and output) and 20KB (peak memory possible with
   // rematerialization).
   TF_ASSERT_OK_AND_ASSIGN(
-      bool changed, HloRematerialization::RematerializeAndSchedule(
-                        ByteSizeOf,
+      bool changed, RunHloRematerialization(
                         /*memory_limit_bytes=*/4 * ByteSizeOf(vec1024_shape_),
-                        module.get(), DefaultMemoryScheduler, &sequence));
+                        module.get(), &sequence));
   EXPECT_TRUE(changed);
   // The rng should not have been rematerialized.
   EXPECT_EQ(count_rngs(entry_computation), 1);
@@ -476,11 +480,9 @@ TEST_F(HloRematerializationTest, InstructionRematerializedMultipleTimes) {
   // Pick a memory limit some where between 24KB (initial peak memory including
   // parameter and output) and 20KB (peak memory possible with
   // rematerialization).
-  TF_ASSERT_OK_AND_ASSIGN(bool changed,
-                          HloRematerialization::RematerializeAndSchedule(
-                              ByteSizeOf,
-                              /*memory_limit_bytes=*/22 * 1024, module.get(),
-                              DefaultMemoryScheduler, &sequence));
+  TF_ASSERT_OK_AND_ASSIGN(bool changed, RunHloRematerialization(
+                                            /*memory_limit_bytes=*/22 * 1024,
+                                            module.get(), &sequence));
   EXPECT_TRUE(changed);
 
   // The broadcast should have been rematerialized 3 times.
@@ -573,11 +575,9 @@ TEST_P(IndirectUseTest, IndirectUseNotRematerialized) {
   // Pick a memory limit some where between 24KB (initial peak memory including
   // parameter and output) and 20KB (peak memory possible with
   // rematerialization).
-  TF_ASSERT_OK_AND_ASSIGN(bool changed,
-                          HloRematerialization::RematerializeAndSchedule(
-                              ByteSizeOf,
-                              /*memory_limit_bytes=*/22 * 1024, module.get(),
-                              DefaultMemoryScheduler, &sequence));
+  TF_ASSERT_OK_AND_ASSIGN(bool changed, RunHloRematerialization(
+                                            /*memory_limit_bytes=*/22 * 1024,
+                                            module.get(), &sequence));
   // Rematerialization should only occur if the rematerializable instruction has
   // no indirect uses.
   if (indirectly_used) {
diff --git a/tensorflow/compiler/xla/service/hlo_runner.cc b/tensorflow/compiler/xla/service/hlo_runner.cc
index 2a601ec3d183023954b6f1b6bca7594384378169..e1f9d8efd4974055947438c8a2e15cb77d1b5c75 100644
--- a/tensorflow/compiler/xla/service/hlo_runner.cc
+++ b/tensorflow/compiler/xla/service/hlo_runner.cc
@@ -22,9 +22,9 @@ limitations under the License.
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/compiler/xla/layout_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/service/transfer_manager.h"
 #include "tensorflow/compiler/xla/shape_util.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/core/common_runtime/eigen_thread_pool.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/types.h"
@@ -36,7 +36,7 @@ HloRunner::CreateModuleFromString(const tensorflow::StringPiece hlo_string,
                                   const DebugOptions& debug_options) {
   HloModuleConfig config;
   config.set_debug_options(debug_options);
-  return tools::Parse(hlo_string, config);
+  return ParseHloString(hlo_string, config);
 }
 
 namespace {
@@ -80,7 +80,7 @@ HloRunner::ReadModuleFromHloTextFile(const std::string& filename,
                                                   filename, &hlo_string));
   HloModuleConfig config;
   config.set_debug_options(debug_options);
-  return tools::Parse(hlo_string, config);
+  return ParseHloString(hlo_string, config);
 }
 
 HloRunner::HloRunner(se::Platform* platform) {
@@ -92,53 +92,108 @@ HloRunner::HloRunner(se::Platform* platform) {
 
 HloRunner::~HloRunner() {}
 
-StatusOr<std::unique_ptr<Literal>> HloRunner::Execute(
-    std::unique_ptr<HloModule> module,
-    const tensorflow::gtl::ArraySlice<Literal*> arguments,
-    bool run_hlo_passes) {
-  TF_ASSIGN_OR_RETURN(std::unique_ptr<Executable> executable,
-                      CreateExecutable(std::move(module), run_hlo_passes));
-  se::Stream stream(backend().default_stream_executor());
-  stream.Init();
+StatusOr<ScopedShapedBuffer> HloRunner::TransferLiteralToDevice(
+    const Literal& literal) {
+  TF_ASSIGN_OR_RETURN(ScopedShapedBuffer buffer,
+                      backend().transfer_manager()->AllocateScopedShapedBuffer(
+                          literal.shape(), backend().memory_allocator(),
+                          backend().default_device_ordinal()));
+  TF_RETURN_IF_ERROR(backend().transfer_manager()->TransferLiteralToDevice(
+      backend().default_stream_executor(), literal, buffer));
+  return std::move(buffer);
+}
 
-  ServiceExecutableRunOptions service_run_options(GetServiceRunOptionsForDevice(
-      backend().default_device_ordinal(), &stream, nullptr));
-  const ExecutableRunOptions& run_options = service_run_options.run_options();
+StatusOr<std::vector<ScopedShapedBuffer>> HloRunner::TransferLiteralsToDevice(
+    const tensorflow::gtl::ArraySlice<const Literal*> literals) {
+  std::vector<ScopedShapedBuffer> buffers;
+  for (const Literal* literal : literals) {
+    CHECK(literal != nullptr);
+    TF_ASSIGN_OR_RETURN(ScopedShapedBuffer buffer,
+                        TransferLiteralToDevice(*literal));
+    buffers.push_back(std::move(buffer));
+  }
+  return std::move(buffers);
+}
 
-  // Copy arguments to device.
-  std::vector<ScopedShapedBuffer> argument_buffers;
-  for (Literal* argument : arguments) {
-    TF_ASSIGN_OR_RETURN(
-        ScopedShapedBuffer argument_buffer,
-        backend().transfer_manager()->AllocateScopedShapedBuffer(
-            argument->shape(), run_options.allocator(),
-            run_options.device_ordinal()));
-    TF_RETURN_IF_ERROR(backend().transfer_manager()->TransferLiteralToDevice(
-        stream.parent(), *argument, argument_buffer));
-    argument_buffers.push_back(std::move(argument_buffer));
+StatusOr<std::vector<ScopedShapedBuffer>> HloRunner::TransferLiteralsToDevice(
+    const tensorflow::gtl::ArraySlice<std::unique_ptr<Literal>> literals) {
+  std::vector<const Literal*> literal_pointers;
+  literal_pointers.reserve(literals.size());
+  for (const auto& literal : literals) {
+    literal_pointers.push_back(literal.get());
   }
+  return TransferLiteralsToDevice(literal_pointers);
+}
+
+StatusOr<std::unique_ptr<Literal>> HloRunner::TransferLiteralFromDevice(
+    const ShapedBuffer& buffer) {
+  return backend().transfer_manager()->TransferLiteralFromDevice(
+      backend().default_stream_executor(), buffer);
+}
 
-  std::vector<const ShapedBuffer*> argument_buffer_ptrs;
-  argument_buffer_ptrs.reserve(argument_buffers.size());
-  for (const auto& buf : argument_buffers) {
-    argument_buffer_ptrs.push_back(&buf);
+StatusOr<std::unique_ptr<Literal>> HloRunner::Execute(
+    std::unique_ptr<HloModule> module,
+    const tensorflow::gtl::ArraySlice<const Literal*> arguments,
+    bool run_hlo_passes, ExecutionProfile* profile) {
+  TF_ASSIGN_OR_RETURN(std::vector<ScopedShapedBuffer> argument_buffers,
+                      TransferLiteralsToDevice(arguments));
+  TF_ASSIGN_OR_RETURN(ScopedShapedBuffer result,
+                      ExecuteWithDeviceBuffers(
+                          /*module=*/std::move(module),
+                          /*arguments=*/argument_buffers,
+                          /*run_hlo_passes=*/run_hlo_passes,
+                          /*profile=*/profile));
+  return TransferLiteralFromDevice(result);
+}
+
+StatusOr<std::unique_ptr<Literal>> HloRunner::Execute(
+    std::unique_ptr<HloModule> module,
+    const tensorflow::gtl::ArraySlice<std::unique_ptr<Literal>> arguments,
+    bool run_hlo_passes, ExecutionProfile* profile) {
+  // Construct a vector of plain pointers for the arguments.
+  std::vector<const Literal*> argument_pointers;
+  argument_pointers.reserve(arguments.size());
+  for (const auto& argument : arguments) {
+    argument_pointers.push_back(argument.get());
   }
+  return Execute(
+      /*module=*/std::move(module),
+      /*arguments=*/argument_pointers,
+      /*run_hlo_passes=*/run_hlo_passes,
+      /*profile=*/profile);
+}
 
-  TF_ASSIGN_OR_RETURN(
-      ScopedShapedBuffer result,
-      executable->ExecuteOnStreamWrapper(
-          &service_run_options, /*profile=*/nullptr, argument_buffer_ptrs));
-
-  auto result_literal = backend().transfer_manager()->TransferLiteralFromDevice(
-      stream.parent(), result);
-  if (result_literal.ok()) {
-    VLOG(4) << "Executed binary and got result: "
-            << result_literal.ValueOrDie()->ToString();
-  } else {
-    VLOG(4) << "Executed binary and got status: "
-            << result_literal.status().ToString();
+StatusOr<ScopedShapedBuffer> HloRunner::ExecuteWithDeviceBuffers(
+    std::unique_ptr<HloModule> module,
+    const tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
+    bool run_hlo_passes, ExecutionProfile* profile) {
+  // Get service run options.
+  se::Stream stream(backend().default_stream_executor());
+  stream.Init();
+  ServiceExecutableRunOptions service_run_options =
+      GetServiceRunOptionsForDevice(backend().default_device_ordinal(), &stream,
+                                    nullptr);
+
+  TF_ASSIGN_OR_RETURN(std::unique_ptr<Executable> executable,
+                      CreateExecutable(std::move(module), run_hlo_passes));
+  return executable->ExecuteOnStreamWrapper(&service_run_options,
+                                            /*profile=*/profile, arguments);
+}
+
+StatusOr<ScopedShapedBuffer> HloRunner::ExecuteWithDeviceBuffers(
+    std::unique_ptr<HloModule> module,
+    const tensorflow::gtl::ArraySlice<ScopedShapedBuffer> arguments,
+    bool run_hlo_passes, ExecutionProfile* profile) {
+  std::vector<const ShapedBuffer*> argument_pointers;
+  argument_pointers.reserve(arguments.size());
+  for (const auto& argument : arguments) {
+    argument_pointers.push_back(&argument);
   }
-  return result_literal;
+  return ExecuteWithDeviceBuffers(
+      /*module=*/std::move(module),
+      /*arguments=*/argument_pointers,
+      /*run_hlo_passes=*/run_hlo_passes,
+      /*profile=*/profile);
 }
 
 StatusOr<std::vector<std::unique_ptr<Literal>>> HloRunner::ExecuteReplicated(
@@ -295,4 +350,8 @@ Backend& HloRunner::backend() {
   return *backend_;
 }
 
+const Backend& HloRunner::backend() const {
+  return const_cast<HloRunner*>(this)->backend();
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_runner.h b/tensorflow/compiler/xla/service/hlo_runner.h
index 53f7c6fe4a09111c5ee24f2290f0f4aeed0a4401..65537f07f56e74b7fe2c2f9792af21efc7229573 100644
--- a/tensorflow/compiler/xla/service/hlo_runner.h
+++ b/tensorflow/compiler/xla/service/hlo_runner.h
@@ -102,6 +102,15 @@ class HloRunner {
   static StatusOr<std::unique_ptr<HloModule>> ReadModuleFromHloTextFile(
       const std::string& filename, const DebugOptions& debug_options);
 
+  // Transfers data between the host and device.
+  StatusOr<ScopedShapedBuffer> TransferLiteralToDevice(const Literal& literal);
+  StatusOr<std::vector<ScopedShapedBuffer>> TransferLiteralsToDevice(
+      const tensorflow::gtl::ArraySlice<const Literal*> literals);
+  StatusOr<std::vector<ScopedShapedBuffer>> TransferLiteralsToDevice(
+      const tensorflow::gtl::ArraySlice<std::unique_ptr<Literal>> literals);
+  StatusOr<std::unique_ptr<Literal>> TransferLiteralFromDevice(
+      const ShapedBuffer& buffer);
+
   // Executes the given module with given literals as input and returns the
   // result as a Literal.
   //
@@ -109,20 +118,25 @@ class HloRunner {
   // optimization.
   StatusOr<std::unique_ptr<Literal>> Execute(
       std::unique_ptr<HloModule> module,
-      const tensorflow::gtl::ArraySlice<Literal*> arguments,
-      bool run_hlo_passes = true);
+      const tensorflow::gtl::ArraySlice<const Literal*> arguments,
+      bool run_hlo_passes = true, ExecutionProfile* profile = nullptr);
 
   StatusOr<std::unique_ptr<Literal>> Execute(
       std::unique_ptr<HloModule> module,
       const tensorflow::gtl::ArraySlice<std::unique_ptr<Literal>> arguments,
-      bool run_hlo_passes = true) {
-    // Construct a vector of plain pointers for the arguments.
-    std::vector<Literal*> argument_pointers;
-    c_transform(
-        arguments, std::back_inserter(argument_pointers),
-        [](const std::unique_ptr<Literal>& literal) { return literal.get(); });
-    return Execute(std::move(module), argument_pointers, run_hlo_passes);
-  }
+      bool run_hlo_passes = true, ExecutionProfile* profile = nullptr);
+
+  // As Execute(), but accepts and returns device buffers instead of host
+  // buffers.
+  StatusOr<ScopedShapedBuffer> ExecuteWithDeviceBuffers(
+      std::unique_ptr<HloModule> module,
+      const tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
+      bool run_hlo_passes = true, ExecutionProfile* profile = nullptr);
+
+  StatusOr<ScopedShapedBuffer> ExecuteWithDeviceBuffers(
+      std::unique_ptr<HloModule> module,
+      const tensorflow::gtl::ArraySlice<ScopedShapedBuffer> arguments,
+      bool run_hlo_passes = true, ExecutionProfile* profile = nullptr);
 
   // Executes a given HLO module into a set of replicas, and returns a map
   // with the replica number as key, and the corresponding returned literal as
@@ -137,6 +151,7 @@ class HloRunner {
   // This creates the backend lazily so it's possible to instantiate an
   // HloRunner in a program without any backends linked in.
   Backend& backend();
+  const Backend& backend() const;
 
  private:
   // Creates an executable object given an HLO module. If run_hlo_passes is
diff --git a/tensorflow/compiler/xla/service/hlo_scheduling.cc b/tensorflow/compiler/xla/service/hlo_scheduling.cc
index 854aa943199397c0e3f84d48a74ef41ae0d3db56..b14ade3549d093acdb5cdc7ae99dd025a42d5621 100644
--- a/tensorflow/compiler/xla/service/hlo_scheduling.cc
+++ b/tensorflow/compiler/xla/service/hlo_scheduling.cc
@@ -36,29 +36,6 @@ using ::tensorflow::strings::HumanReadableNumBytes;
 
 namespace xla {
 
-StatusOr<int64> MinimumMemoryForSequence(
-    const SequentialHloOrdering::HloModuleSequence& module_sequence,
-    const LogicalBuffer::SizeFunction& size_function) {
-  if (module_sequence.empty()) {
-    return 0;
-  }
-
-  const HloModule* module = module_sequence.begin()->first->parent();
-  TF_ASSIGN_OR_RETURN(std::unique_ptr<TuplePointsToAnalysis> points_to_analysis,
-                      TuplePointsToAnalysis::Run(module));
-
-  // The absolute minimum memory required for a given sequence of instructions
-  // is determined by the sequence of Alloc and Free calls on a simulated heap,
-  // ignoring fragmentation. We run the heap simulation on the whole module,
-  // rather than summing each computation, since it gives us a better lower
-  // bound, by minimizing the liveness of sub-computations.
-  TF_ASSIGN_OR_RETURN(
-      HeapSimulator::Result result,
-      HeapSimulator::Run(MakeUnique<NoFragmentationStatsHeap>(), *module,
-                         module_sequence, *points_to_analysis, size_function));
-  return result.heap_size;
-}
-
 namespace {
 
 // Class implementing a list scheduler of HLO instructions which produces a
@@ -299,6 +276,8 @@ class ListScheduler {
       auto best_it = ready_queue.end();
       --best_it;
       const HloInstruction* best = best_it->second.instruction;
+      VLOG(2) << "Schedule instruction: " << best->ToShortString()
+              << " Bytes freed: " << best_it->first.first;
       ready_queue.erase(best_it);
       ready_instructions.erase(best);
       schedule.push_back(best);
@@ -396,7 +375,7 @@ int64 SumLogicalBufferSizes(
   return size;
 }
 
-StatusOr<std::vector<const HloInstruction*>> CreateMemoryMinimizingSequence(
+StatusOr<std::vector<const HloInstruction*>> ScheduleComputationsInModule(
     const HloComputation& computation,
     const TuplePointsToAnalysis& points_to_analysis,
     const LogicalBuffer::SizeFunction& size_function,
@@ -414,18 +393,6 @@ StatusOr<std::vector<const HloInstruction*>> CreateMemoryMinimizingSequence(
 
 }  // namespace
 
-StatusOr<int64> MinimumMemoryForComputation(
-    const HloComputation& computation,
-    const std::vector<const HloInstruction*>& sequence,
-    const TuplePointsToAnalysis& points_to_analysis,
-    const LogicalBuffer::SizeFunction& size_function) {
-  TF_ASSIGN_OR_RETURN(
-      HeapSimulator::Result result,
-      HeapSimulator::Run(MakeUnique<NoFragmentationStatsHeap>(), computation,
-                         sequence, points_to_analysis, size_function));
-  return result.heap_size;
-}
-
 StatusOr<std::vector<const HloInstruction*>> DFSMemoryScheduler(
     const HloComputation& computation,
     const TuplePointsToAnalysis& points_to_analysis,
@@ -437,6 +404,7 @@ StatusOr<std::vector<const HloInstruction*>> DFSMemoryScheduler(
   // simply users-1 for each instruction.  By subtracting 1, we're saying that
   // instructions with no users or a single user don't count; instructions with
   // lots of fan-out will be visited earlier.
+  int64 cumulative_total_size = 0;
   tensorflow::gtl::FlatMap<const HloInstruction*, int64> extra_users;
   tensorflow::gtl::FlatMap<const HloInstruction*, int64> total_sizes;
   for (const HloInstruction* hlo : computation.MakeInstructionPostOrder()) {
@@ -449,12 +417,21 @@ StatusOr<std::vector<const HloInstruction*>> DFSMemoryScheduler(
     int64 logical_buffer_size = SumLogicalBufferSizes(
         points_to_analysis.GetBuffersDefinedByInstruction(hlo), size_function);
     total_sizes[hlo] = logical_buffer_size;
+    cumulative_total_size += logical_buffer_size;
     tensorflow::gtl::FlatSet<const HloInstruction*> unique_operands(
         hlo->operands().begin(), hlo->operands().end());
     for (const HloInstruction* operand : unique_operands) {
       extra_users[hlo] += extra_users[operand];
       total_sizes[hlo] += total_sizes[operand];
     }
+    // total_sizes[hlo] transitively includes the sizes of all nodes that
+    // lead to it. But computation is a DAG, so we are double-counting nodes,
+    // which can lead to overflows for large programs.
+    // cumulative_total_size caps the size to prevent overflows.
+    // NOTE(dimvar): this is quite ugly and should be changed. It's unclear
+    // why we care about transitive sizes; when scheduling a node, its input
+    // and output buffers should be all that matters, not its "history".
+    total_sizes[hlo] = std::min(total_sizes[hlo], cumulative_total_size);
   }
   CHECK_EQ(extra_users.size(), computation.instruction_count());
   CHECK_EQ(total_sizes.size(), computation.instruction_count());
@@ -564,10 +541,9 @@ StatusOr<std::vector<const HloInstruction*>> DefaultMemoryScheduler(
   }
 }
 
-StatusOr<SequentialHloOrdering::HloModuleSequence>
-CreateMemoryMinimizingSequence(const HloModule& module,
-                               const LogicalBuffer::SizeFunction& size_function,
-                               const MemorySchedulerAlgorithm& algorithm) {
+StatusOr<SequentialHloOrdering::HloModuleSequence> ScheduleComputationsInModule(
+    const HloModule& module, const LogicalBuffer::SizeFunction& size_function,
+    const MemorySchedulerAlgorithm& algorithm) {
   SequentialHloOrdering::HloModuleSequence sequence;
   TF_ASSIGN_OR_RETURN(std::unique_ptr<TuplePointsToAnalysis> points_to_analysis,
                       TuplePointsToAnalysis::Run(&module));
@@ -575,7 +551,7 @@ CreateMemoryMinimizingSequence(const HloModule& module,
   for (const auto* computation : module.MakeComputationPostOrder()) {
     if (!computation->IsFusionComputation()) {
       TF_ASSIGN_OR_RETURN(auto one_computation_sequence,
-                          CreateMemoryMinimizingSequence(
+                          ScheduleComputationsInModule(
                               *computation, *points_to_analysis, size_function,
                               algorithm, memory_by_computation));
       memory_by_computation[computation] =
@@ -588,15 +564,15 @@ CreateMemoryMinimizingSequence(const HloModule& module,
   return sequence;
 }
 
-StatusOr<std::vector<const HloInstruction*>> CreateMemoryMinimizingSequence(
+StatusOr<std::vector<const HloInstruction*>> ScheduleOneComputation(
     const HloComputation& computation,
     const LogicalBuffer::SizeFunction& size_function) {
   CHECK(!computation.IsFusionComputation());
   TF_ASSIGN_OR_RETURN(std::unique_ptr<TuplePointsToAnalysis> points_to_analysis,
                       TuplePointsToAnalysis::Run(computation.parent()));
   tensorflow::gtl::FlatMap<const HloComputation*, int64> empty_map;
-  return CreateMemoryMinimizingSequence(computation, *points_to_analysis,
-                                        size_function, nullptr, empty_map);
+  return ScheduleComputationsInModule(computation, *points_to_analysis,
+                                      size_function, nullptr, empty_map);
 }
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_scheduling.h b/tensorflow/compiler/xla/service/hlo_scheduling.h
index 49b927eefd24f4e26df781dd8d2b977bedba2b80..2b33ccc8bfb895286bb3747aab0a16cf25e2cfae 100644
--- a/tensorflow/compiler/xla/service/hlo_scheduling.h
+++ b/tensorflow/compiler/xla/service/hlo_scheduling.h
@@ -28,20 +28,6 @@ limitations under the License.
 
 namespace xla {
 
-// Returns the minimum memory required to compute the given module sequence,
-// assuming no fragmentation.
-StatusOr<int64> MinimumMemoryForSequence(
-    const SequentialHloOrdering::HloModuleSequence& module_sequence,
-    const LogicalBuffer::SizeFunction& size_function);
-
-// Returns the minimum memory required to compute the given computation,
-// assuming no fragmentation.
-StatusOr<int64> MinimumMemoryForComputation(
-    const HloComputation& computation,
-    const std::vector<const HloInstruction*>& sequence,
-    const TuplePointsToAnalysis& points_to_analysis,
-    const LogicalBuffer::SizeFunction& size_function);
-
 // A memory scheduler computes an execution sequence for the HLO instructions in
 // 'computation' that minimizes peak memory, given a points-to analysis result
 // that describes buffer aliasing, together with a target-specific size function
@@ -89,14 +75,13 @@ StatusOr<std::vector<const HloInstruction*>> DefaultMemoryScheduler(
 // Returns an HloModuleSequence which seeks to minimize the memory required for
 // the computation. size_function is the function returning the number of bytes
 // required for a LogicalBuffer.
-StatusOr<SequentialHloOrdering::HloModuleSequence>
-CreateMemoryMinimizingSequence(const HloModule& module,
-                               const LogicalBuffer::SizeFunction& size_function,
-                               const MemorySchedulerAlgorithm& algorithm = {});
+StatusOr<SequentialHloOrdering::HloModuleSequence> ScheduleComputationsInModule(
+    const HloModule& module, const LogicalBuffer::SizeFunction& size_function,
+    const MemorySchedulerAlgorithm& algorithm = {});
 
-// Overload of above that computes the sequence for a single computation.
+// Computes the schedule for a single computation.
 // Currently only used by the GPU backend.
-StatusOr<std::vector<const HloInstruction*>> CreateMemoryMinimizingSequence(
+StatusOr<std::vector<const HloInstruction*>> ScheduleOneComputation(
     const HloComputation& computation,
     const LogicalBuffer::SizeFunction& size_function);
 
diff --git a/tensorflow/compiler/xla/service/hlo_scheduling_test.cc b/tensorflow/compiler/xla/service/hlo_scheduling_test.cc
index c018ba2ffc404d0c6a0d08b8f5c63a9f90888b70..6f1b1215d39dfbaeff768de70fa0a0859cd97381 100644
--- a/tensorflow/compiler/xla/service/hlo_scheduling_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_scheduling_test.cc
@@ -22,74 +22,15 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/service/hlo_ordering.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 
 namespace xla {
 namespace {
 
-class MinimumMemoryForSequenceTest : public HloTestBase {};
-
-TEST_F(MinimumMemoryForSequenceTest, MultiComputation) {
-  auto module = CreateNewModule();
-  const Shape scalar_shape = ShapeUtil::MakeShape(xla::F32, {});
-  const Shape tuple_shape =
-      ShapeUtil::MakeTupleShape({scalar_shape, scalar_shape});
-
-  auto cond_builder = HloComputation::Builder("WhileCond");
-  // Tuple param: 24 bytes (each elem has 8 byte pointer, 4 byte element)
-  HloInstruction* cond_param = cond_builder.AddInstruction(
-      HloInstruction::CreateParameter(0, tuple_shape, "cond_param"));
-  HloInstruction* cond_iter = cond_builder.AddInstruction(
-      HloInstruction::CreateGetTupleElement(scalar_shape, cond_param, 0));
-  HloInstruction* cond_data = cond_builder.AddInstruction(
-      HloInstruction::CreateGetTupleElement(scalar_shape, cond_param, 1));
-  // Free cond_param[] (16 bytes), Alloc PRED[] (1 byte)
-  HloInstruction* cond_lt = cond_builder.AddInstruction(
-      HloInstruction::CreateBinary(ShapeUtil::MakeShape(PRED, {}),
-                                   HloOpcode::kLt, cond_iter, cond_data));
-  HloComputation* cond_computation =
-      module->AddEmbeddedComputation(cond_builder.Build());
-
-  auto body_builder = HloComputation::Builder("WhileBody");
-  // Tuple param: 24 bytes (each elem has 8 byte pointer, 4 byte element)
-  HloInstruction* body_param = body_builder.AddInstruction(
-      HloInstruction::CreateParameter(0, tuple_shape, "body_param"));
-  HloComputation* body_computation =
-      module->AddEmbeddedComputation(body_builder.Build());
-
-  auto builder = HloComputation::Builder(TestName());
-  // Entry params: 8 bytes (4 bytes per param), TOTAL=8
-  HloInstruction* iter = builder.AddInstruction(
-      HloInstruction::CreateParameter(0, scalar_shape, "param_iter"));
-  HloInstruction* data = builder.AddInstruction(
-      HloInstruction::CreateParameter(1, scalar_shape, "param_data"));
-  // Tuple: 16 bytes (8 bytes per pointer), TOTAL=24
-  HloInstruction* tuple =
-      builder.AddInstruction(HloInstruction::CreateTuple({iter, data}));
-  // While: 8 bytes (4 bytes per element), TOTAL=32
-  // Both cond and body use a max of 24 bytes, TOTAL=56
-  HloInstruction* while_op = builder.AddInstruction(HloInstruction::CreateWhile(
-      tuple_shape, cond_computation, body_computation, tuple));
-  HloComputation* entry_computation =
-      module->AddEntryComputation(builder.Build());
-
-  auto size_fn = [](const BufferValue& buffer) {
-    return ShapeUtil::ByteSizeOf(buffer.shape(), /*pointer_size=*/8);
-  };
-
-  SequentialHloOrdering::HloModuleSequence module_sequence;
-  module_sequence[cond_computation] = {cond_param, cond_iter, cond_data,
-                                       cond_lt};
-  module_sequence[body_computation] = {body_param};
-  module_sequence[entry_computation] = {iter, data, tuple, while_op};
-  EXPECT_EQ(56,
-            MinimumMemoryForSequence(module_sequence, size_fn).ValueOrDie());
-}
-
 class HloSchedulingTest : public HloTestBase {};
 
 TEST_F(HloSchedulingTest, LastUseScheduledFirst) {
@@ -124,7 +65,7 @@ TEST_F(HloSchedulingTest, LastUseScheduledFirst) {
 
   TF_ASSERT_OK_AND_ASSIGN(
       SequentialHloOrdering::HloModuleSequence sequence,
-      CreateMemoryMinimizingSequence(*module, [](const BufferValue& buffer) {
+      ScheduleComputationsInModule(*module, [](const BufferValue& buffer) {
         return ShapeUtil::ByteSizeOf(buffer.shape());
       }));
   // Verify that all instructions are in the sequence.
@@ -158,14 +99,14 @@ ENTRY root {
 })";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(module_str));
+                          ParseHloString(module_str));
 
   auto size_fn = [](const BufferValue& buffer) {
     return ShapeUtil::ByteSizeOf(buffer.shape(), /*pointer_size=*/8);
   };
   TF_ASSERT_OK_AND_ASSIGN(
       SequentialHloOrdering::HloModuleSequence sequence,
-      CreateMemoryMinimizingSequence(*module, size_fn, ListMemoryScheduler));
+      ScheduleComputationsInModule(*module, size_fn, ListMemoryScheduler));
   // Verify that all instructions are in the sequence.
   EXPECT_EQ(module->entry_computation()->instruction_count(),
             sequence.at(module->entry_computation()).size());
@@ -270,7 +211,7 @@ TEST_F(HloSchedulingTest, ListAccountsForSubcomputations) {
   module->AddEntryComputation(builder.Build());
 
   TF_ASSERT_OK_AND_ASSIGN(SequentialHloOrdering::HloModuleSequence sequence,
-                          CreateMemoryMinimizingSequence(
+                          ScheduleComputationsInModule(
                               *module,
                               [](const BufferValue& buffer) {
                                 return ShapeUtil::ByteSizeOf(buffer.shape());
@@ -289,5 +230,100 @@ TEST_F(HloSchedulingTest, ListAccountsForSubcomputations) {
   EXPECT_TRUE(ordering.ExecutesBefore(transpose, add));
 }
 
+TEST_F(HloSchedulingTest, TuplesAreAccountedCorrectly) {
+  auto builder = HloComputation::Builder(TestName());
+  const auto TUPLE_SIZE = 1;
+  const Shape r1f32 = ShapeUtil::MakeShape(xla::F32, {6});
+
+  // Wrap lit in abs because constants are considered free by
+  // IgnoreInstruction, and it skews the accounting.
+  auto lit = builder.AddInstruction(HloInstruction::CreateConstant(
+      Literal::CreateR1<float>({1, 1, 1, 1, 1, 1})));
+  auto abs_const = builder.AddInstruction(
+      HloInstruction::CreateUnary(r1f32, HloOpcode::kAbs, lit));
+
+  auto abs_abs1 = builder.AddInstruction(
+      HloInstruction::CreateUnary(r1f32, HloOpcode::kAbs, abs_const));
+  auto tuple = builder.AddInstruction(HloInstruction::CreateTuple(
+      tensorflow::gtl::ArraySlice<HloInstruction*>({abs_abs1})));
+  auto tuple_elm = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(r1f32, tuple, 0));
+
+  auto abs_abs2 = builder.AddInstruction(
+      HloInstruction::CreateUnary(r1f32, HloOpcode::kAbs, abs_const));
+
+  builder.AddInstruction(HloInstruction::CreateBinary(r1f32, HloOpcode::kAdd,
+                                                      tuple_elm, abs_abs2));
+
+  auto module = CreateNewModule();
+  module->AddEntryComputation(builder.Build());
+  TF_ASSERT_OK_AND_ASSIGN(
+      SequentialHloOrdering::HloModuleSequence sequence,
+      ScheduleComputationsInModule(*module,
+                                   [&TUPLE_SIZE](const BufferValue& buffer) {
+                                     return ShapeUtil::ByteSizeOf(
+                                         buffer.shape(), TUPLE_SIZE);
+                                   },
+                                   ListMemoryScheduler));
+
+  // Verify that all instructions are in the sequence.
+  EXPECT_EQ(module->entry_computation()->instruction_count(),
+            sequence.at(module->entry_computation()).size());
+  SequentialHloOrdering ordering(module.get(), sequence);
+  // tuple allocates the tuple buffer and doesn't free anything.
+  // abs_abs2 uses the same buffer for input/output, so its bytes-freed is 0.
+  // abs_abs2 should be scheduled before tuple by List.
+  EXPECT_TRUE(ordering.ExecutesBefore(abs_abs2, tuple));
+}
+
+TEST_F(HloSchedulingTest, MultiOutputFusionAccountedCorrectly) {
+  const Shape r1f32 = ShapeUtil::MakeShape(xla::F32, {5});
+  HloComputation::Builder builder(TestName());
+
+  auto c1 = builder.AddInstruction(HloInstruction::CreateConstant(
+      Literal::CreateR1<float>({1, 1, 1, 1, 1})));
+  auto c2 = builder.AddInstruction(HloInstruction::CreateConstant(
+      Literal::CreateR1<float>({1, 2, 3, 4, 5})));
+  auto c3 = builder.AddInstruction(HloInstruction::CreateConstant(
+      Literal::CreateR1<float>({0, 2, 4, 6, 8})));
+
+  auto add = builder.AddInstruction(
+      HloInstruction::CreateBinary(r1f32, HloOpcode::kAdd, c1, c2));
+  auto mul = builder.AddInstruction(
+      HloInstruction::CreateBinary(r1f32, HloOpcode::kMultiply, add, c3));
+  auto tuple = builder.AddInstruction(HloInstruction::CreateTuple({add, mul}));
+
+  auto tuple_elm = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(r1f32, tuple, 0));
+
+  auto exp = builder.AddInstruction(
+      HloInstruction::CreateUnary(r1f32, HloOpcode::kExp, c3));
+
+  builder.AddInstruction(
+      HloInstruction::CreateBinary(r1f32, HloOpcode::kAdd, tuple_elm, exp));
+
+  auto module = CreateNewModule();
+  auto* computation = module->AddEntryComputation(builder.Build());
+
+  auto fusion = computation->CreateFusionInstruction(
+      {tuple, mul, add}, HloInstruction::FusionKind::kLoop);
+
+  TF_ASSERT_OK_AND_ASSIGN(SequentialHloOrdering::HloModuleSequence sequence,
+                          ScheduleComputationsInModule(
+                              *module,
+                              [](const BufferValue& buffer) {
+                                return ShapeUtil::ByteSizeOf(buffer.shape(), 2);
+                              },
+                              ListMemoryScheduler));
+
+  // Verify that all instructions are in the sequence.
+  EXPECT_EQ(module->entry_computation()->instruction_count(),
+            sequence.at(module->entry_computation()).size());
+  SequentialHloOrdering ordering(module.get(), sequence);
+  // fusion allocates memory for the tuple elements and doesn't free anything,
+  // so it's more expensive than exp.
+  EXPECT_TRUE(ordering.ExecutesBefore(exp, fusion));
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_sharding.cc b/tensorflow/compiler/xla/service/hlo_sharding.cc
index 7f7e3f7dab03ce0ad64bd0fcfe4ddd020d31bf56..9fb15df7c26951fb7f0d62b0d6533d6312e7a4d5 100644
--- a/tensorflow/compiler/xla/service/hlo_sharding.cc
+++ b/tensorflow/compiler/xla/service/hlo_sharding.cc
@@ -39,6 +39,34 @@ HloSharding HloSharding::Tile1D(const Shape& input_shape, int64 num_tiles) {
   return HloSharding(tile_shape, assignment);
 }
 
+HloSharding HloSharding::Tuple(const ShapeTree<HloSharding>& sub_shardings) {
+  std::vector<HloSharding> flattened_list;
+  flattened_list.reserve(sub_shardings.leaf_count());
+  for (const auto& index_to_sharding : sub_shardings.leaves()) {
+    flattened_list.push_back(index_to_sharding.second);
+  }
+  if (flattened_list.empty()) {
+    // Empty tuple sharding ends up having no leaves, but we want to allow
+    // empty tuple HLO instruction results to have sharding, so we fetch the
+    // root ({}) sharding value from the ShapeTree.
+    // A ShapeTree created with ShapeTree<HloSharding>(shape, init) will have
+    // init as value at its root.
+    flattened_list.push_back(sub_shardings.element(ShapeIndex({})));
+  }
+  return HloSharding(flattened_list);
+}
+
+HloSharding HloSharding::Tuple(
+    const Shape& tuple_shape,
+    tensorflow::gtl::ArraySlice<HloSharding> shardings) {
+  CHECK(ShapeUtil::IsTuple(tuple_shape)) << ShapeUtil::HumanString(tuple_shape);
+  std::vector<HloSharding> flattened_list(shardings.begin(), shardings.end());
+  CHECK_EQ(flattened_list.size(), RequiredLeaves(tuple_shape))
+      << "Flat list has " << flattened_list.size() << ", required "
+      << RequiredLeaves(tuple_shape);
+  return HloSharding(flattened_list);
+}
+
 string HloSharding::ToString() const {
   if (IsTuple()) {
     std::vector<string> parts;
@@ -49,9 +77,6 @@ string HloSharding::ToString() const {
     return StrCat("{", tensorflow::str_util::Join(parts, ", "), "}");
   }
 
-  string result = StrCat("{", (replicated_ ? " replicated" : ""),
-                         (maximal_ ? " maximal" : ""));
-
   if (replicated_) {
     return "{replicated}";
   } else if (maximal_) {
@@ -126,6 +151,49 @@ std::vector<int64> HloSharding::TileLimitForDevice(int64 device) const {
   return index;
 }
 
+int64 HloSharding::RequiredLeaves(const Shape& shape) {
+  // Empty tuples have no leaf nodes as far as ShapeUtil and ShapeTree are
+  // concerned, but they do have a single tuple_elements_ entry since we want
+  // to allow empty tuple results to have sharding.
+  return ShapeUtil::IsEmptyTuple(shape) ? 1 : ShapeUtil::GetLeafCount(shape);
+}
+
+Status HloSharding::CheckLeafCount(const Shape& shape) const {
+  int64 shape_leaves = RequiredLeaves(shape);
+  TF_RET_CHECK(shape_leaves == tuple_elements_.size())
+      << "Shape " << ShapeUtil::HumanString(shape) << " has " << shape_leaves
+      << " leaf nodes while this sharding has " << tuple_elements_.size();
+  return Status::OK();
+}
+
+StatusOr<ShapeTree<HloSharding>> HloSharding::AsShapeTree(
+    const Shape& shape) const {
+  if (IsTuple()) {
+    ShapeTree<HloSharding> result(shape, HloSharding::Replicate());
+    TF_RETURN_IF_ERROR(CheckLeafCount(shape));
+    auto it = tuple_elements_.begin();
+    for (auto& index_to_sharding : result.leaves()) {
+      index_to_sharding.second = *it++;
+    }
+    if (ShapeUtil::IsEmptyTuple(shape)) {
+      // Empty tuples have no leaves, but we want to assign them a sharding
+      // anyway, so we use the root element sharding.
+      *result.mutable_element(ShapeIndex({})) = *it;
+    }
+    return std::move(result);
+  } else {
+    return ShapeTree<HloSharding>(shape, *this);
+  }
+}
+
+StatusOr<HloSharding> HloSharding::GetTupleSharding(const Shape& shape) const {
+  if (IsTuple()) {
+    TF_RETURN_IF_ERROR(CheckLeafCount(shape));
+    return *this;
+  }
+  return Tuple(ShapeTree<HloSharding>(shape, *this));
+}
+
 StatusOr<int64> HloSharding::UniqueDevice() const {
   if (IsTuple()) {
     if (tuple_elements_.empty()) {
@@ -167,28 +235,12 @@ Status HloSharding::ValidateTuple(const Shape& shape, int64 num_devices) const {
     return tensorflow::errors::InvalidArgument(
         StrCat("Sharding is tuple-shaped but validation shape is not."));
   }
-  // The easiest way to get the number of elements in a nested tuple is just to
-  // create a shape tree. We could call GetAsShapeTree, but that will try and
-  // apply our tuple_shardings_ to the shape tree, and that might cause a crash
-  // at this point as we haven't validated them.
-  ShapeTree<bool> bool_shape_tree(shape, false);
-  int64 num_leaves =
-      std::distance(bool_shape_tree.leaf_begin(), bool_shape_tree.leaf_end());
-  if (num_leaves != tuple_elements_.size()) {
-    return tensorflow::errors::InvalidArgument(
-        StrCat("Validation tuple shape has ", num_leaves,
-               " leaf elements, but this sharding contains ",
-               tuple_elements_.size(), " elements."));
-  }
+  TF_RETURN_IF_ERROR(CheckLeafCount(shape));
 
   // Now we've validated the number of tuple elements, it's safe to request a
   // shape tree.
   ShapeTree<HloSharding> shape_tree = GetAsShapeTree(shape);
   for (const auto& index_to_sharding : shape_tree.leaves()) {
-    if (index_to_sharding.first.empty()) {
-      // An empty tuple has a ShapeTree with a single leaf at the empty index.
-      continue;
-    }
     Status status = index_to_sharding.second.ValidateNonTuple(
         ShapeUtil::GetSubshape(shape, index_to_sharding.first), num_devices);
     if (!status.ok()) {
@@ -370,11 +422,21 @@ HloSharding HloSharding::GetSubSharding(const Shape& shape,
   Shape sub_shape = ShapeUtil::GetSubshape(shape, index);
   ShapeTree<HloSharding> sub_shape_tree(sub_shape, Replicate());
   sub_shape_tree.CopySubtreeFrom(GetAsShapeTree(shape), index, {});
-  if (ShapeUtil::IsTuple(sub_shape)) {
-    return Tuple(sub_shape_tree);
-  } else {
-    return sub_shape_tree.element({});
+  return ShapeUtil::IsTuple(sub_shape) ? Tuple(sub_shape_tree)
+                                       : sub_shape_tree.element(ShapeIndex({}));
+}
+
+tensorflow::gtl::optional<HloSharding> HloSharding::ExtractSingleSharding()
+    const {
+  if (!IsTuple()) {
+    return *this;
+  }
+  for (int64 i = 1; i < tuple_elements_.size(); ++i) {
+    if (tuple_elements_[0] != tuple_elements_[i]) {
+      return tensorflow::gtl::optional<HloSharding>();
+    }
   }
+  return tuple_elements_.front();
 }
 
 std::ostream& operator<<(std::ostream& out, const HloSharding& sharding) {
diff --git a/tensorflow/compiler/xla/service/hlo_sharding.h b/tensorflow/compiler/xla/service/hlo_sharding.h
index 2b8e757f42991f697df37d3d34bfdff6a36bc509..6a744e0247273e25c5de3143b7bbba2b79ee816a 100644
--- a/tensorflow/compiler/xla/service/hlo_sharding.h
+++ b/tensorflow/compiler/xla/service/hlo_sharding.h
@@ -70,26 +70,13 @@ class HloSharding {
 
   // Creates a new sharding for a tuple type. The given ShapeTree must have
   // elements for every leaf shape contained in the tuple.
-  static HloSharding Tuple(const ShapeTree<HloSharding>& sub_shardings) {
-    std::vector<HloSharding> flattened_list;
-    flattened_list.reserve(
-        std::distance(sub_shardings.leaf_begin(), sub_shardings.leaf_end()));
-    for (const auto& index_to_sharding : sub_shardings.leaves()) {
-      flattened_list.push_back(index_to_sharding.second);
-    }
-    return HloSharding(flattened_list);
-  }
+  static HloSharding Tuple(const ShapeTree<HloSharding>& sub_shardings);
 
-  // Creates a new sharding for a tuple type. The requested tuple shape must not
-  // be nested. For nested tuples, use the ShapeTree overload.
+  // Creates a new sharding for a tuple type. The number of elements in
+  // shardings must match the number of leaf nodes in tuple_shape. For
+  // empty tuples, the shardings array must have one element.
   static HloSharding Tuple(const Shape& tuple_shape,
-                           tensorflow::gtl::ArraySlice<HloSharding> shardings) {
-    CHECK(ShapeUtil::IsTuple(tuple_shape));
-    CHECK(!ShapeUtil::IsNestedTuple(tuple_shape));
-    std::vector<HloSharding> flattened_list(shardings.begin(), shardings.end());
-    CHECK_EQ(flattened_list.size(), ShapeUtil::TupleElementCount(tuple_shape));
-    return HloSharding(flattened_list);
-  }
+                           tensorflow::gtl::ArraySlice<HloSharding> shardings);
 
   // Create a new sharding from a protobuf OpSharding.
   static StatusOr<HloSharding> FromProto(const OpSharding& proto);
@@ -99,6 +86,9 @@ class HloSharding {
   static bool IsReservedDevice(int64 device) { return device < 0; }
 
   OpSharding ToProto() const;
+
+  // Note that this string canonically has outer curly braces, e.g.
+  // "{replicated}".
   string ToString() const;
 
   // Validate that this sharding can be applied to a tensor with shape `shape`.
@@ -160,25 +150,27 @@ class HloSharding {
   // tuple, if IsTuple, or a ShapeTree with a single element containing this
   // sharding. Only the leaf elements are populated. This creates a new
   // ShapeTree object so is not cheap.
+  StatusOr<ShapeTree<HloSharding>> AsShapeTree(const Shape& shape) const;
   ShapeTree<HloSharding> GetAsShapeTree(const Shape& shape) const {
-    if (IsTuple()) {
-      ShapeTree<HloSharding> result(shape, HloSharding::Replicate());
-      CHECK_EQ(std::distance(result.leaf_begin(), result.leaf_end()),
-               tuple_elements_.size());
-      auto it = tuple_elements_.begin();
-      for (auto& index_to_sharding : result.leaves()) {
-        index_to_sharding.second = *it++;
-      }
-      return result;
-    } else {
-      return ShapeTree<HloSharding>(shape, *this);
-    }
+    return AsShapeTree(shape).ValueOrDie();
   }
 
   // Retrieves the sub sharding at a given index, out of a tuple sharding.
   // REQUIRES: IsTuple()
   HloSharding GetSubSharding(const Shape& shape, const ShapeIndex& index) const;
 
+  // If the current sharding is a tuple sharding, return itself as result.
+  // Otherwise returns a tuple sharding for the input shape, with all the leaves
+  // having this object sharding.
+  StatusOr<HloSharding> GetTupleSharding(const Shape& shape) const;
+
+  // Extracts the sharding that is common within the current sharding.
+  // If the current sharding is not a tuple sharding, the current sharding will
+  // be returned. If it is a tuple, and all the tuple elements are common, the
+  // common element will be returned. Otherwise the optional will contain no
+  // value.
+  tensorflow::gtl::optional<HloSharding> ExtractSingleSharding() const;
+
   bool operator==(const HloSharding& other) const {
     return replicated_ == other.replicated_ && maximal_ == other.maximal_ &&
            ShapeUtil::Compatible(tile_shape_, other.tile_shape_) &&
@@ -208,6 +200,12 @@ class HloSharding {
     return h;
   }
 
+  struct Hasher {
+    size_t operator()(const HloSharding& sharding) const {
+      return sharding.Hash();
+    }
+  };
+
   // Gets the tile shape.
   // REQUIRES: !IsTileMaximal() && !IsTuple()
   const Shape& tile_shape() const { return tile_shape_; }
@@ -261,11 +259,19 @@ class HloSharding {
         tile_assignment_({0}),
         tuple_elements_(tuple_shardings) {}
 
+  // Checks that the number of elements in tuple_elements_ is consistent with
+  // the tuple shape passes as argument.
+  Status CheckLeafCount(const Shape& shape) const;
+
   // Internal helper to validate a tuple sharding.
   Status ValidateTuple(const Shape& shape, int64 num_devices) const;
+
   // Internal helper to validate a non-tuple (leaf) sharding.
   Status ValidateNonTuple(const Shape& shape, int64 num_devices) const;
 
+  // Returns the number of tuple_elements_ entries to fit the shape.
+  static int64 RequiredLeaves(const Shape& shape);
+
   bool replicated_;
   bool maximal_;
   bool tuple_;
diff --git a/tensorflow/compiler/xla/service/hlo_sharding_metadata.cc b/tensorflow/compiler/xla/service/hlo_sharding_metadata.cc
new file mode 100644
index 0000000000000000000000000000000000000000..7b4b071af46df19520f9ba1f1f632692d489de59
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_sharding_metadata.cc
@@ -0,0 +1,394 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/hlo_sharding_metadata.h"
+
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/shape_tree.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+
+namespace xla {
+
+namespace {
+
+struct PassThrough {
+  PassThrough(HloInstruction* user, HloInstruction* operand)
+      : user(user), operand(operand) {}
+
+  HloInstruction* user = nullptr;
+  HloInstruction* operand = nullptr;
+};
+
+void SetSingleSharding(HloInstruction* instruction,
+                       const HloSharding& sharding) {
+  VLOG(4) << "  " << instruction->name() << " to " << sharding;
+  instruction->set_single_sharding(sharding);
+}
+
+bool ShardingMatches(const HloSharding& sharding1,
+                     const HloSharding& sharding2) {
+  auto single_sharding1 = sharding1.ExtractSingleSharding();
+  if (single_sharding1) {
+    auto single_sharding2 = sharding2.ExtractSingleSharding();
+    if (single_sharding2) {
+      return *single_sharding1 == single_sharding2;
+    }
+  }
+  // Anything which is not unique across all elements, gets a full sharding
+  // compare.
+  return sharding1 == sharding2;
+}
+
+// When we create domains, they are never "empty", where with empty we mean
+// that a kDomain instruction has as operand another kDomain instruction of the
+// same kind.
+// But when the HLO optimizations are run, empty domains can be created.
+// For example:
+//
+//  Domain(device=None, device=0) ->
+//    Tuple(device=0) ->
+//      GTE(device=0) ->
+//        Domain(device=0, device=None)
+//
+// In that case the tuple simplifier could create something like:
+//
+//  Domain(device=None, device=0) -> Domain(device=0, device=None)
+//
+// Which is a so called empty domain.
+// In the case above, crossing an empty domain which was transiting through
+// device 0, requires the normalization phase to fixup the empty domain by
+// adding back a Tuple+GTE pair with the proper device.
+// One particular case where this can create problems is the result of the
+// entry computation, where the GTE assignments are used by TF to tell the
+// XLA where the results should be sent.
+std::vector<PassThrough> LocatePassThroughDomainLinks(
+    const DomainMetadata::Domain& domain) {
+  std::vector<PassThrough> pass_through;
+  for (HloInstruction* instruction : domain.enter_domains) {
+    CHECK(instruction->opcode() == HloOpcode::kDomain)
+        << "Instruction is not a kDomain: " << instruction->ToString();
+    for (HloInstruction* user : instruction->users()) {
+      if (user->opcode() == HloOpcode::kDomain &&
+          domain.exit_domains.count(user) != 0) {
+        pass_through.emplace_back(user, instruction);
+        VLOG(2) << "Found passthrough domain link:";
+        VLOG(2) << "  " << user->ToString();
+        VLOG(2) << "  " << instruction->ToString();
+      }
+    }
+  }
+  return pass_through;
+}
+
+Status FixupPassThroughDomainLinks(const DomainMetadata::Domain& domain,
+                                   const HloSharding& sharding) {
+  for (auto& pass_through : LocatePassThroughDomainLinks(domain)) {
+    HloInstruction* tuple = pass_through.operand->parent()->AddInstruction(
+        HloInstruction::CreateTuple({pass_through.operand}));
+    HloInstruction* gte = pass_through.operand->parent()->AddInstruction(
+        HloInstruction::CreateGetTupleElement(pass_through.operand->shape(),
+                                              tuple, 0));
+    gte->set_sharding(sharding);
+    TF_RETURN_IF_ERROR(
+        pass_through.operand->ReplaceUseWith(pass_through.user, gte));
+  }
+  return Status::OK();
+}
+
+std::unique_ptr<HloSharding> CloneShardingForDomain(
+    const HloSharding& sharding) {
+  auto single_sharding = sharding.ExtractSingleSharding();
+  if (!single_sharding) {
+    return MakeUnique<HloSharding>(sharding);
+  }
+  return MakeUnique<HloSharding>(*single_sharding);
+}
+
+Status ApplyDomainSingleSharding(const DomainMetadata::Domain& domain,
+                                 const HloSharding& sharding) {
+  VLOG(4) << "Applying " << sharding << " sharding";
+  for (HloInstruction* instruction : domain.instructions) {
+    // We only change instructions without sharding, since otherwise we might
+    // mess up with eventual HLO passes which has knowledge of it.
+    if (!instruction->has_sharding()) {
+      SetSingleSharding(instruction, sharding);
+    } else {
+      VLOG(4) << "  " << instruction->name() << " already has sharding "
+              << instruction->sharding();
+    }
+  }
+  return Status::OK();
+}
+
+// Retrieves the sharding of a tuple shaped instruction in form of a ShapeTree.
+// If the instruction has no sharding, a ShapeTree with HloSharding::Replicate()
+// sharding will be returned.
+ShapeTree<HloSharding> GetTupleSharding(HloInstruction* tuple) {
+  if (tuple->has_sharding()) {
+    return tuple->sharding().GetAsShapeTree(tuple->shape());
+  }
+  return ShapeTree<HloSharding>(tuple->shape(), HloSharding::Replicate());
+}
+
+// Retrieves the sharding of operand, asked from a user instruction which is
+// within domain. If operand is a kDomain, it means that sharding argument is
+// the operand sharding, otherwise the operand's own sharding will be returned.
+const HloSharding* GetOperandSharding(const HloInstruction* operand,
+                                      const DomainMetadata::Domain& domain,
+                                      const HloSharding& sharding) {
+  DCHECK_EQ(domain.reach_set.count(const_cast<HloInstruction*>(operand)), 1);
+  // Here the user of operand is within the domain instruction set, and since it
+  // is user of operand, we need to look into the enter_domains set. If this is
+  // not a kDomain within the user domains set, then return the operand
+  // sharding, if any.
+  if (operand->opcode() != HloOpcode::kDomain ||
+      domain.enter_domains.count(const_cast<HloInstruction*>(operand)) == 0) {
+    return operand->has_sharding() ? &operand->sharding() : nullptr;
+  }
+  // At this point operand is a kDomain of the currently processed domain, so we
+  // can refer to sharding as the domain sharding.
+  return &sharding;
+}
+
+// Tries to propagate the sharding information into the instructions that are
+// part of the domain, in a post order manner (operand propagate to user).
+StatusOr<int64> ApplyDomainShardingPass(const DomainMetadata::Domain& domain,
+                                        const HloSharding& sharding) {
+  int64 assigned = 0;
+  for (HloInstruction* instruction : domain.instructions) {
+    if (instruction->has_sharding()) {
+      continue;
+    }
+    if (instruction->opcode() == HloOpcode::kGetTupleElement) {
+      HloInstruction* tuple = instruction->mutable_operand(0);
+      const HloSharding* tuple_sharding =
+          GetOperandSharding(tuple, domain, sharding);
+      if (tuple_sharding != nullptr) {
+        if (tuple_sharding->IsTuple()) {
+          HloSharding sub_sharding = tuple_sharding->GetSubSharding(
+              tuple->shape(), {instruction->tuple_index()});
+          VLOG(4) << "  " << instruction->name() << " to sharding "
+                  << sub_sharding;
+          instruction->set_sharding(sub_sharding);
+        } else {
+          SetSingleSharding(instruction, *tuple_sharding);
+        }
+        ++assigned;
+      }
+    } else if (instruction->opcode() == HloOpcode::kTuple) {
+      int64 tuple_assigned = 0;
+      ShapeTree<HloSharding> shape_tree = GetTupleSharding(instruction);
+      for (int64 i = 0; i < instruction->operand_count(); ++i) {
+        const HloSharding* operand_sharding =
+            GetOperandSharding(instruction->operand(i), domain, sharding);
+        if (operand_sharding != nullptr &&
+            shape_tree.element({i}) != *operand_sharding) {
+          *shape_tree.mutable_element({i}) = *operand_sharding;
+          ++tuple_assigned;
+        }
+      }
+      if (tuple_assigned > 0) {
+        HloSharding tuple_sharding = HloSharding::Tuple(shape_tree);
+        VLOG(4) << "  " << instruction->name() << " to sharding "
+                << tuple_sharding;
+        instruction->set_sharding(tuple_sharding);
+        ++assigned;
+      }
+    } else {
+      // If all the operand of the given instruction has the same single device
+      // assignment, assign that device to this instruction as well.
+      const HloSharding* common_sharding = nullptr;
+      for (const HloInstruction* operand : instruction->operands()) {
+        const HloSharding* operand_sharding =
+            GetOperandSharding(operand, domain, sharding);
+        if (operand_sharding != nullptr) {
+          if (common_sharding != nullptr &&
+              *common_sharding != *operand_sharding) {
+            common_sharding = nullptr;
+            break;
+          }
+          common_sharding = operand_sharding;
+        }
+      }
+      if (common_sharding != nullptr) {
+        VLOG(4) << "  " << instruction->name() << " to sharding "
+                << *common_sharding;
+        instruction->set_sharding(*common_sharding);
+        ++assigned;
+      }
+    }
+  }
+  return assigned;
+}
+
+Status ApplyDomainSharding(const DomainMetadata::Domain& domain,
+                           const HloSharding& sharding) {
+  auto single_sharding = sharding.ExtractSingleSharding();
+  if (single_sharding) {
+    // Shortcut the simple case. We have a unique sharding, so we call
+    // the ApplyDomainSingleSharding() API which will apply array or tuple
+    // shaped sharding to the domain instructions.
+    return ApplyDomainSingleSharding(domain, *single_sharding);
+  }
+  VLOG(1) << "Assigning non-trivial sharding " << sharding;
+  for (;;) {
+    TF_ASSIGN_OR_RETURN(int64 assigned,
+                        ApplyDomainShardingPass(domain, sharding));
+    if (assigned == 0) {
+      break;
+    }
+  }
+  int64 unassigned = 0;
+  for (HloInstruction* instruction : domain.instructions) {
+    if (!instruction->has_sharding()) {
+      LOG(WARNING) << "Unassigned instruction: " << instruction->ToString();
+      ++unassigned;
+    }
+  }
+  // Should we error out if unassigned > 0?
+  return Status::OK();
+}
+
+// Creates a kDomain instruction to be placed between instruction and operand.
+// The kDomain instruction will be created only if the sharding differ between
+// the instruction and the operand.
+std::unique_ptr<HloInstruction> CreateDomain(HloInstruction* instruction,
+                                             HloInstruction* operand) {
+  const HloSharding* instruction_sharding =
+      instruction->has_sharding() ? &instruction->sharding() : nullptr;
+  const HloSharding* operand_sharding =
+      operand->has_sharding() ? &operand->sharding() : nullptr;
+  // No need for domain if they both have no sharding.
+  if (instruction_sharding == nullptr && operand_sharding == nullptr) {
+    return nullptr;
+  }
+  // No need for domain if they match.
+  if (instruction_sharding != nullptr && operand_sharding != nullptr &&
+      ShardingMatches(*instruction_sharding, *operand_sharding)) {
+    return nullptr;
+  }
+  std::unique_ptr<HloSharding> real_instruction_sharding;
+  std::unique_ptr<HloSharding> real_operand_sharding;
+  if (instruction_sharding != nullptr) {
+    real_instruction_sharding = CloneShardingForDomain(*instruction_sharding);
+  }
+  if (operand_sharding != nullptr) {
+    real_operand_sharding = CloneShardingForDomain(*operand_sharding);
+  }
+  VLOG(3) << "Creating domain:";
+  VLOG(3) << "  Instruction: " << instruction->name();
+  VLOG(3) << "  Operand: " << operand->name();
+  VLOG(3) << "    User side sharding: "
+          << (real_instruction_sharding != nullptr
+                  ? real_instruction_sharding->ToString()
+                  : "None");
+  VLOG(3) << "    Operand side sharding: "
+          << (real_operand_sharding != nullptr
+                  ? real_operand_sharding->ToString()
+                  : "None");
+
+  std::unique_ptr<DomainMetadata> operand_side_metadata =
+      MakeUnique<ShardingMetadata>(std::move(real_operand_sharding));
+  std::unique_ptr<DomainMetadata> user_side_metadata =
+      MakeUnique<ShardingMetadata>(std::move(real_instruction_sharding));
+  return HloInstruction::CreateDomain(operand->shape(), operand,
+                                      std::move(operand_side_metadata),
+                                      std::move(user_side_metadata));
+}
+
+StatusOr<std::unique_ptr<HloSharding>> ExtractOriginalCommonSharding(
+    tensorflow::gtl::ArraySlice<HloInstruction*> instructions) {
+  // If we are here, all the instructions being passed had the same sharding
+  // (or no sharding), by the means of the ShardingMatches() API.
+  // As such, no kDomain was inserted, and here we are asked to extract the
+  // original common sharding.
+  // All the instructions passed to this API are part of the same computation.
+  const HloSharding* sharding = nullptr;
+  for (HloInstruction* instruction : instructions) {
+    if (instruction->has_sharding()) {
+      if (sharding == nullptr) {
+        sharding = &instruction->sharding();
+      } else {
+        TF_RET_CHECK(ShardingMatches(*sharding, instruction->sharding()))
+            << "Sharding " << *sharding << " does not match the one in "
+            << instruction->ToString();
+      }
+    }
+  }
+  if (sharding == nullptr) {
+    return std::unique_ptr<HloSharding>();
+  }
+  VLOG(4) << "Extracted sharding is " << *sharding;
+  return CloneShardingForDomain(*sharding);
+}
+
+}  // namespace
+
+std::unique_ptr<DomainMetadata> ShardingMetadata::Clone() const {
+  std::unique_ptr<HloSharding> sharding;
+  if (sharding_ != nullptr) {
+    sharding = MakeUnique<HloSharding>(*sharding_);
+  }
+  return MakeUnique<ShardingMetadata>(std::move(sharding));
+}
+
+bool ShardingMetadata::Matches(const DomainMetadata& other) const {
+  const ShardingMetadata* other_ptr =
+      dynamic_cast<const ShardingMetadata*>(&other);
+  if (other_ptr == nullptr) {
+    // If other is not a ShardingMetadata, then it is clearly a no match.
+    return false;
+  }
+  if (sharding_ == nullptr) {
+    return other_ptr->sharding_ == nullptr;
+  }
+  return other_ptr->sharding_ != nullptr
+             ? ShardingMatches(*sharding_, *other_ptr->sharding_)
+             : false;
+}
+
+string ShardingMetadata::ToString() const {
+  return sharding_ != nullptr ? sharding_->ToString() : "None";
+}
+
+Status ShardingMetadata::NormalizeInstructions(
+    const DomainMetadata::Domain& domain) const {
+  if (sharding_ != nullptr) {
+    VLOG(4) << "Normalizing sharding to " << sharding_->ToString() << ":";
+    TF_RETURN_IF_ERROR(ApplyDomainSharding(domain, *sharding_));
+    TF_RETURN_IF_ERROR(FixupPassThroughDomainLinks(domain, *sharding_));
+  }
+  return Status::OK();
+}
+
+Status NormalizeShardingDomain(const DomainMetadata::Domain& domain) {
+  TF_ASSIGN_OR_RETURN(std::unique_ptr<HloSharding> sharding,
+                      ExtractOriginalCommonSharding(domain.instructions));
+  if (sharding != nullptr) {
+    VLOG(4) << "Normalizing sharding-less domain to " << sharding->ToString()
+            << ":";
+    TF_RETURN_IF_ERROR(ApplyDomainSharding(domain, *sharding));
+  } else {
+    VLOG(1) << "Unable to find common sharding";
+  }
+  return Status::OK();
+}
+
+std::unique_ptr<HloInstruction> CreateShardingDomain(
+    HloInstruction* instruction, HloInstruction* operand) {
+  return CreateDomain(instruction, operand);
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_sharding_metadata.h b/tensorflow/compiler/xla/service/hlo_sharding_metadata.h
new file mode 100644
index 0000000000000000000000000000000000000000..ec162c34904ee2dfac3daeeee37133282a9c9698
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_sharding_metadata.h
@@ -0,0 +1,67 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_SHARDING_METADATA_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_SHARDING_METADATA_H_
+
+#include "tensorflow/compiler/xla/service/hlo_domain_metadata.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_sharding.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
+
+namespace xla {
+
+// A DomainMetadata implementation that internally wraps a sharding attribute.
+class ShardingMetadata : public DomainMetadata {
+ public:
+  explicit ShardingMetadata(std::unique_ptr<HloSharding> sharding)
+      : sharding_(std::move(sharding)) {}
+
+  std::unique_ptr<DomainMetadata> Clone() const override;
+
+  tensorflow::StringPiece Kind() const override { return KindName(); }
+
+  bool Matches(const DomainMetadata& other) const override;
+
+  string ToString() const override;
+
+  Status NormalizeInstructions(
+      const DomainMetadata::Domain& domain) const override;
+
+  static tensorflow::StringPiece KindName() { return "sharding"; }
+
+ private:
+  std::unique_ptr<HloSharding> sharding_;
+};
+
+// Within a set of instructions which had common sharding attributes before
+// entring the HLO passes pipeline, apply sharding heuristics and normalize the
+// instructions whose sharding deviates from the one which is inferred as to be
+// the original one.
+// Policy wise, HLO passes are allowed to create new unassigned instructions,
+// but if they do create assigned ones, they have to conform to the ones around.
+Status NormalizeShardingDomain(const DomainMetadata::Domain& domain);
+
+// Given an HLO graph edge between instruction and one of its operands, creates
+// a ShardingMetadata based kDomain instruction if the sharding between
+// instruction and operand changes. Returns nullptr if there is no need for a
+// domain separation.
+std::unique_ptr<HloInstruction> CreateShardingDomain(
+    HloInstruction* instruction, HloInstruction* operand);
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_SHARDING_METADATA_H_
diff --git a/tensorflow/compiler/xla/service/hlo_sharding_test.cc b/tensorflow/compiler/xla/service/hlo_sharding_test.cc
index 3bf0d25efb7fad78aeccdd9269c289950b2171ab..54b7402b866361748d9eb35182b0bf486c4c9bdc 100644
--- a/tensorflow/compiler/xla/service/hlo_sharding_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_sharding_test.cc
@@ -13,14 +13,13 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/service/hlo_sharding.h"
-
 #include <set>
 #include <unordered_map>
 #include <utility>
 #include <vector>
 
 #include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
@@ -312,5 +311,50 @@ TEST_F(HloShardingTest, OstreamTest) {
   EXPECT_EQ(oss.str(), "{f32[3,5,7,11] devices=[1,1,2,2]0,1,2,3}");
 }
 
+TEST_F(HloShardingTest, ParseHloString) {
+  auto check = [](const HloSharding& sharding) {
+    TF_ASSERT_OK_AND_ASSIGN(auto parsed_sharding,
+                            ParseSharding(sharding.ToString()));
+    EXPECT_EQ(sharding, parsed_sharding);
+  };
+  check(HloSharding::Replicate());
+  check(HloSharding::AssignDevice(2));
+  check(HloSharding::Tile(ShapeUtil::MakeShape(F32, {3, 1, 3, 7}),
+                          Array4D<int64>({{{{0}, {1}}}})));
+  // Empty tuple. One sharding is required for empty tuples, as we need to be
+  // able to assign sharding to them, even though they have no leaves.
+  check(HloSharding::Tuple(ShapeUtil::MakeTupleShape({}),
+                           {HloSharding::Replicate()}));
+  {
+    // Non-nested tuple.
+    auto tuple_shape =
+        ShapeUtil::MakeTupleShape({ShapeUtil::MakeShape(F32, {3, 1, 5, 7}),
+                                   ShapeUtil::MakeShape(F32, {3, 5, 7}),
+                                   ShapeUtil::MakeShape(F32, {3, 7})});
+    check(HloSharding::Tuple(
+        tuple_shape, {HloSharding::Tile(ShapeUtil::MakeShape(F32, {3, 1, 3, 7}),
+                                        Array4D<int64>({{{{0}, {1}}}})),
+                      HloSharding::Replicate(), HloSharding::AssignDevice(1)}));
+  }
+  {
+    // Nested tuple.
+    auto tuple_shape = ShapeUtil::MakeTupleShape(
+        {ShapeUtil::MakeShape(F32, {3, 1, 5, 7}),
+         ShapeUtil::MakeTupleShape({ShapeUtil::MakeShape(F32, {3, 5, 7}),
+                                    ShapeUtil::MakeShape(F32, {3, 7})})});
+    std::vector<HloSharding> leaf_shardings = {
+        HloSharding::Tile(ShapeUtil::MakeShape(F32, {3, 1, 3, 7}),
+                          Array4D<int64>({{{{0}, {1}}}})),
+        HloSharding::Replicate(), HloSharding::AssignDevice(1)};
+    ShapeTree<HloSharding> sharding_tree(tuple_shape, HloSharding::Replicate());
+    // Assign leaf_shardings to sharding_tree leaves.
+    auto it = leaf_shardings.begin();
+    for (auto& index_to_sharding : sharding_tree.leaves()) {
+      index_to_sharding.second = *it++;
+    }
+    check(HloSharding::Tuple(sharding_tree));
+  }
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tools/parser/hlo_token.h b/tensorflow/compiler/xla/service/hlo_token.h
similarity index 84%
rename from tensorflow/compiler/xla/tools/parser/hlo_token.h
rename to tensorflow/compiler/xla/service/hlo_token.h
index 7928bee5c2097f353b182095a555c334d7b69c95..533429608bc2e13626a3e746fbe465398e1f4bb4 100644
--- a/tensorflow/compiler/xla/tools/parser/hlo_token.h
+++ b/tensorflow/compiler/xla/service/hlo_token.h
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_COMPILER_XLA_TOOLS_PARSER_HLO_TOKEN_H_
-#define TENSORFLOW_COMPILER_XLA_TOOLS_PARSER_HLO_TOKEN_H_
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_TOKEN_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_TOKEN_H_
 
 #include <string>
 
@@ -22,9 +22,11 @@ limitations under the License.
 #include "tensorflow/core/platform/types.h"
 
 namespace xla {
-namespace tools {
 
 // Defines different kinds of tokens in a hlo module string.
+//
+// You shouldn't need to use this directly unless you're using HloLexer
+// directly, and you probably don't need to do that.  Use hlo_parser instead.
 enum class TokKind {
   // Markers
   kEof,
@@ -72,7 +74,6 @@ enum class TokKind {
 
 string TokKindToString(TokKind kind);
 
-}  // namespace tools
 }  // namespace xla
 
-#endif  // TENSORFLOW_COMPILER_XLA_TOOLS_PARSER_HLO_TOKEN_H_
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_TOKEN_H_
diff --git a/tensorflow/compiler/xla/service/hlo_verifier.cc b/tensorflow/compiler/xla/service/hlo_verifier.cc
index 7d6d0d9eaf70969c1a3762959233b561706398c2..9034073cc8a82311297ccd087741e6713110a5a7 100644
--- a/tensorflow/compiler/xla/service/hlo_verifier.cc
+++ b/tensorflow/compiler/xla/service/hlo_verifier.cc
@@ -376,6 +376,7 @@ Status CheckMixedPrecisionOperands(const HloInstruction* instruction) {
     case HloOpcode::kConstant:
     case HloOpcode::kCrossReplicaSum:
     case HloOpcode::kCustomCall:
+    case HloOpcode::kDomain:
     case HloOpcode::kFusion:
     case HloOpcode::kGetTupleElement:
     case HloOpcode::kInfeed:
@@ -425,6 +426,14 @@ Status ShapeVerifier::HandleGather(HloInstruction* gather) {
           gather->gather_dimension_numbers(), gather->gather_window_bounds()));
 }
 
+Status ShapeVerifier::HandleGenerateToken(HloInstruction* token) {
+  std::vector<const Shape*> operand_shapes;
+  for (const HloInstruction* operand : token->operands()) {
+    operand_shapes.push_back(&operand->shape());
+  }
+  return CheckShape(token, ShapeInference::InferTokenShape(operand_shapes));
+}
+
 Status ShapeVerifier::CheckShape(const HloInstruction* instruction,
                                  const Shape& inferred_shape) {
   // If allow_mixed_precision_ is false, check if there are operands with
@@ -790,6 +799,46 @@ Status HloVerifier::CheckElementwiseInstruction(HloInstruction* instruction) {
   return Status::OK();
 }
 
+namespace {
+
+// Returns true if the given Shape has a TOKEN shape as any subshape.
+bool ShapeContainsToken(const Shape& shape) {
+  bool contains_token = false;
+  ShapeUtil::ForEachSubshape(
+      shape, [&contains_token](const Shape& subshape, const ShapeIndex&) {
+        if (ShapeUtil::IsToken(subshape)) {
+          contains_token = true;
+        }
+      });
+  return contains_token;
+}
+
+// Verifies that all types entering and exiting the entry computation are
+// legal. For example, TOKEN types have no Literal representation and cannot be
+// on the interface of the entry computation (parameters and root instruction).
+Status VerifyEntryAndExitShapes(const HloModule& module) {
+  for (int i = 0; i < module.entry_computation()->num_parameters(); ++i) {
+    HloInstruction* param =
+        module.entry_computation()->parameter_instruction(i);
+    if (ShapeContainsToken(param->shape())) {
+      return InternalError(
+          "Entry parameter %d is or contains a token shape: %s", i,
+          ShapeUtil::HumanString(param->shape()).c_str());
+    }
+  }
+  if (ShapeContainsToken(
+          module.entry_computation()->root_instruction()->shape())) {
+    return InternalError(
+        "Entry root is or contains a token shape: %s",
+        ShapeUtil::HumanString(
+            module.entry_computation()->root_instruction()->shape())
+            .c_str());
+  }
+  return Status::OK();
+}
+
+}  // namespace
+
 StatusOr<bool> HloVerifier::Run(HloModule* module) {
   TF_RETURN_IF_ERROR(VerifyHloStructure(module));
 
@@ -850,6 +899,8 @@ StatusOr<bool> HloVerifier::Run(HloModule* module) {
     TF_RETURN_IF_ERROR(computation->Accept(shape_verifier.get()));
   }
 
+  TF_RETURN_IF_ERROR(VerifyEntryAndExitShapes(*module));
+
   return false;
 }
 
diff --git a/tensorflow/compiler/xla/service/hlo_verifier.h b/tensorflow/compiler/xla/service/hlo_verifier.h
index 1392a78097aa026b2f7cffa2b0135402d3ca7ae5..7283b3e7dcdbed5be18a1da1571287cf0c089288 100644
--- a/tensorflow/compiler/xla/service/hlo_verifier.h
+++ b/tensorflow/compiler/xla/service/hlo_verifier.h
@@ -81,6 +81,7 @@ class ShapeVerifier : public DfsHloVisitor {
       HloInstruction* batch_norm_inference) override;
   Status HandleBatchNormGrad(HloInstruction* batch_norm_grad) override;
   Status HandleGather(HloInstruction* gather) override;
+  Status HandleGenerateToken(HloInstruction* token) override;
 
   Status FinishVisit(HloInstruction*) override { return Status::OK(); }
 
diff --git a/tensorflow/compiler/xla/service/human_readable_profile_builder.cc b/tensorflow/compiler/xla/service/human_readable_profile_builder.cc
index dc3bfce0c495bc40a2df7b985cab67e02a3e15ce..d7458c338e9f1df9fac90270845aae0b8f779ee2 100644
--- a/tensorflow/compiler/xla/service/human_readable_profile_builder.cc
+++ b/tensorflow/compiler/xla/service/human_readable_profile_builder.cc
@@ -169,6 +169,23 @@ string HumanReadableProfileBuilder::ToString() const {
       StrAppend(&s, table.MakeReport(CyclesToMicroseconds(total_cycles_)));
     }
   }
+
+  if (total_bytes > 0) {
+    MetricTableReport table;
+    table.SetMetricName("MiB read+written");
+    table.SetEntryName("ops");
+    table.SetShowCategoryTable();
+    for (const auto& op : op_infos_) {
+      MetricTableReport::Entry entry;
+      entry.text = op.name;
+      entry.short_text = op.short_name;
+      entry.category_text = op.category;
+      entry.metric = static_cast<double>(op.bytes_accessed) / (1 << 20);
+      table.AddEntry(std::move(entry));
+    }
+    StrAppend(&s,
+              table.MakeReport(static_cast<double>(total_bytes) / (1 << 20)));
+  }
   return s;
 }
 
diff --git a/tensorflow/compiler/xla/service/indexed_array_analysis.cc b/tensorflow/compiler/xla/service/indexed_array_analysis.cc
index 15b2d8f4990735c56f105e7c1b9b7dc70609d898..8b3fa6c1572cf0ed91fc427722edcb23d8b8529d 100644
--- a/tensorflow/compiler/xla/service/indexed_array_analysis.cc
+++ b/tensorflow/compiler/xla/service/indexed_array_analysis.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/indexed_array_analysis.h"
 #include "tensorflow/compiler/xla/map_util.h"
+#include "tensorflow/compiler/xla/service/hlo_evaluator.h"
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/core/lib/gtl/flatset.h"
 #include "tensorflow/core/lib/gtl/inlined_vector.h"
@@ -28,9 +29,11 @@ using Analysis = IndexedArrayAnalysis;
 using UnknownArray = Analysis::UnknownArray;
 using ConstantArray = Analysis::ConstantArray;
 using ScalarIndexedArray = Analysis::ScalarIndexedArray;
+using tensorflow::gtl::ArraySlice;
+using tensorflow::str_util::Join;
 }  // namespace
 
-string IndexedArrayAnalysis::ToString(Array* root) {
+string IndexedArrayAnalysis::ToString(Array* root, bool print_constants) {
   switch (root->kind()) {
     case Array::kUnknown: {
       auto* unknown_tensor = root->as<UnknownArray>();
@@ -39,6 +42,12 @@ string IndexedArrayAnalysis::ToString(Array* root) {
     }
 
     case Array::kConstant: {
+      if (print_constants) {
+        string contents = root->as<ConstantArray>()->literal()->ToString();
+        return tensorflow::strings::StrCat(
+            "(constant ", ShapeUtil::HumanString(root->shape()), " ", contents,
+            ")");
+      }
       return tensorflow::strings::StrCat(
           "(constant ", ShapeUtil::HumanString(root->shape()), ")");
     }
@@ -50,26 +59,26 @@ string IndexedArrayAnalysis::ToString(Array* root) {
                         ? "scalar-indexed-const"
                         : "scalar-indexed";
       return tensorflow::strings::StrCat(
-          "(", name, " ", ToString(indexed_array->source()), " ",
-          ToString(indexed_array->indices()), " ", indexed_array->source_dim(),
-          "->[", tensorflow::str_util::Join(indexed_array->output_dims(), ","),
-          "])");
+          "(", name, " ", ToString(indexed_array->source(), print_constants),
+          " ", ToString(indexed_array->indices(), print_constants), " ",
+          indexed_array->source_dim(), "->[",
+          Join(indexed_array->output_dims(), ","), "])");
     }
   }
 }
 
-Analysis::Array* IndexedArrayAnalysis::GetArrayFor(
+StatusOr<Analysis::Array*> IndexedArrayAnalysis::GetArrayFor(
     const HloInstruction* instr) {
   auto it = cache_.find(instr);
   if (it != cache_.end()) {
     return it->second;
   }
 
-  TraverseAndPopulateCache(instr);
+  TF_RETURN_IF_ERROR(TraverseAndPopulateCache(instr));
   return FindOrDie(cache_, instr);
 }
 
-void IndexedArrayAnalysis::TraverseAndPopulateCache(
+Status IndexedArrayAnalysis::TraverseAndPopulateCache(
     const HloInstruction* root) {
   // Depth first search over the DAG, invoking ComputeArrayFor in post order.
   // The HLO instructions already in the cache are considered leaves.
@@ -105,28 +114,46 @@ void IndexedArrayAnalysis::TraverseAndPopulateCache(
 
       case kVisited:
         stack.pop_back();
-        InsertOrDie(&cache_, instr, ComputeArrayFor(instr));
+        TF_ASSIGN_OR_RETURN(Array * array, ComputeArrayFor(instr));
+        InsertOrDie(&cache_, instr, array);
         break;
     }
   } while (!stack.empty());
+
+  return Status::OK();
 }
 
-Analysis::Array* IndexedArrayAnalysis::ComputeArrayFor(
+StatusOr<Analysis::Array*> IndexedArrayAnalysis::ComputeArrayFor(
     const HloInstruction* instr) {
   Array* computed_array;
-  switch (instr->opcode()) {
-    default:
-      computed_array = nullptr;
-      break;
-    case HloOpcode::kConstant:
-      computed_array = ComputeArrayForConstant(instr->literal());
-      break;
-    case HloOpcode::kGather:
-      computed_array = ComputeArrayForGather(
-          instr->shape(), instr->gather_dimension_numbers(),
-          instr->gather_window_bounds(), FindOrDie(cache_, instr->operand(0)),
-          FindOrDie(cache_, instr->operand(1)));
-      break;
+  if (instr->IsElementwise() && instr->operand_count() == 1) {
+    TF_ASSIGN_OR_RETURN(
+        computed_array,
+        ComputeArrayForElementwiseUnaryOp(
+            instr->opcode(), FindOrDie(cache_, instr->operand(0))));
+  } else if (instr->IsElementwise() && instr->operand_count() == 2) {
+    TF_ASSIGN_OR_RETURN(
+        computed_array,
+        ComputeArrayForElementwiseBinaryOp(
+            instr->opcode(), FindOrDie(cache_, instr->operand(0)),
+            FindOrDie(cache_, instr->operand(1))));
+  } else if (instr->opcode() == HloOpcode::kConstant) {
+    TF_ASSIGN_OR_RETURN(computed_array,
+                        ComputeArrayForConstant(instr->literal()));
+  } else if (instr->opcode() == HloOpcode::kGather) {
+    TF_ASSIGN_OR_RETURN(
+        computed_array,
+        ComputeArrayForGather(instr->shape(), instr->gather_dimension_numbers(),
+                              instr->gather_window_bounds(),
+                              FindOrDie(cache_, instr->operand(0)),
+                              FindOrDie(cache_, instr->operand(1))));
+  } else if (instr->opcode() == HloOpcode::kReshape) {
+    TF_ASSIGN_OR_RETURN(
+        computed_array,
+        ComputeArrayForReshape(instr->shape(),
+                               FindOrDie(cache_, instr->operand(0))));
+  } else {
+    computed_array = nullptr;
   }
 
   if (!computed_array) {
@@ -136,12 +163,12 @@ Analysis::Array* IndexedArrayAnalysis::ComputeArrayFor(
   return computed_array;
 }
 
-Analysis::Array* IndexedArrayAnalysis::ComputeArrayForConstant(
+StatusOr<Analysis::Array*> IndexedArrayAnalysis::ComputeArrayForConstant(
     const Literal& literal) {
   return Construct<ConstantArray>(&literal);
 }
 
-ScalarIndexedArray* IndexedArrayAnalysis::FoldGatherOfGather(
+StatusOr<ScalarIndexedArray*> IndexedArrayAnalysis::FoldGatherOfGather(
     ScalarIndexedArray* source, Array* indices, int64 source_dim,
     tensorflow::gtl::ArraySlice<int64> output_dims, Shape shape) {
   // We want to transform Gather(Gather(A, X), Y) => Gather(A, Gather(X, Y)).
@@ -161,14 +188,14 @@ ScalarIndexedArray* IndexedArrayAnalysis::FoldGatherOfGather(
                                               IndexComponent::Ungathered);
 
   // Simulate the first gather.
-  simulated_index.erase(simulated_index.begin() + source->source_dim());
+  EraseAt(&simulated_index, source->source_dim());
   for (int64 gather_dim : source->output_dims()) {
     simulated_index.insert(simulated_index.begin() + gather_dim,
                            IndexComponent::GatheredFirst);
   }
 
   // Simulate the second gather.
-  simulated_index.erase(simulated_index.begin() + source_dim);
+  EraseAt(&simulated_index, source_dim);
   for (int64 output_dim : output_dims) {
     simulated_index.insert(simulated_index.begin() + output_dim,
                            IndexComponent::GatheredSecond);
@@ -207,7 +234,7 @@ ScalarIndexedArray* IndexedArrayAnalysis::FoldGatherOfGather(
                                      std::move(shape));
 }
 
-Analysis::Array* IndexedArrayAnalysis::ComputeArrayForGather(
+StatusOr<Analysis::Array*> IndexedArrayAnalysis::ComputeArrayForGather(
     const Shape& shape, const GatherDimensionNumbers& dim_numbers,
     tensorflow::gtl::ArraySlice<int64> window_bounds, Array* source,
     Array* indices) {
@@ -244,6 +271,443 @@ Analysis::Array* IndexedArrayAnalysis::ComputeArrayForGather(
                                        shape);
 }
 
+namespace {
+// Returns an index into `values` such that the product of the range
+// [values.begin()+index, values.end()) is equal to `product`.  If there is no
+// such index, return -1.  All integers in `values` must be positive.
+int64 FindSuffixWithProduct(ArraySlice<int64> values, int64 product) {
+  DCHECK(c_all_of(values, [](int64 value) { return value > 0; }));
+
+  int64 current_product = 1;
+  int64 i;
+  for (i = values.size() - 1; i >= 0 && product > current_product; --i) {
+    current_product *= values[i];
+  }
+
+  if (product == current_product) {
+    return i + 1;
+  }
+
+  return -1;
+}
+
+struct ReshapePassthroughDimPair {
+  int64 result_dim;
+  int64 operand_dim;
+};
+
+// Returns a set of dimension pairs such for all (result_dim, operand_dim) in
+// the set:
+//
+// output_index[result_dim] = SourceIndexOfReshape(output_index)[operand_dim]
+//
+// The returned vector of pairs is sorted in both the result_dim and the
+// operand_dim components.
+std::vector<ReshapePassthroughDimPair> ComputeReshapePassthroughDimPairs(
+    ArraySlice<int64> operand_shape, ArraySlice<int64> result_shape) {
+  // A reshape can be seen as an index mapping from output index to input index:
+  //
+  // (i_0, ..., i_n) = f(o_0, ..., o_m)
+  //
+  // This function returns the pairs (j, k) for which the following invariant
+  // holds for all indices in the shape:
+  //
+  //   o_j == i_k
+  //
+  // And this occurs when:
+  //
+  //    O_{j+1} * ... * O_n == I_{k+1} * ...  * I_m
+  //
+  // (where O_x are the sizes of the output shape and I_x are the sizes of the
+  // input shape) and the size of the dimension j of the result is the same as
+  // the size of dimension k in the operand.
+  //
+  // These conditions are sufficient because the Reshape HLO is spec'ed such
+  // that the rightmost dimensions are always minor in the flattening and refine
+  // operation.
+
+  std::vector<ReshapePassthroughDimPair> result;
+  int64 result_subarray_size = 1;
+  for (int64 result_dim = result_shape.size() - 1; result_dim >= 0;
+       --result_dim) {
+    int64 candidate_operand_dim =
+        FindSuffixWithProduct(operand_shape, result_subarray_size);
+
+    // result_subarray_size does not include the elements in the current
+    // `result_dim` dimension (we multiply in result_shape[result_dim] at the
+    // end of loop body) so candidate_operand_dim can never be zero.
+    CHECK_NE(candidate_operand_dim, 0);
+
+    if (candidate_operand_dim != -1 &&
+        result_shape[result_dim] == operand_shape[candidate_operand_dim - 1]) {
+      result.push_back({/*result_dim=*/result_dim,
+                        /*operand_dim=*/candidate_operand_dim - 1});
+    }
+    result_subarray_size *= result_shape[result_dim];
+  }
+
+  c_reverse(result);
+
+  if (VLOG_IS_ON(3)) {
+    std::vector<string> result_strings;
+    c_transform(result, std::back_inserter(result_strings),
+                [](ReshapePassthroughDimPair value) {
+                  return tensorflow::strings::StrCat(value.result_dim, "->",
+                                                     value.operand_dim);
+                });
+    VLOG(3) << "For a reshape from [" << Join(operand_shape, ",") << "] to ["
+            << Join(result_shape, ",") << "] passthrough indices are ["
+            << Join(result_strings, ",") << "]";
+  }
+
+  DCHECK(c_is_sorted(
+      result, [](ReshapePassthroughDimPair lhs, ReshapePassthroughDimPair rhs) {
+        return lhs.result_dim < rhs.result_dim;
+      }));
+
+  DCHECK(c_is_sorted(
+      result, [](ReshapePassthroughDimPair lhs, ReshapePassthroughDimPair rhs) {
+        return lhs.operand_dim < rhs.operand_dim;
+      }));
+
+  return result;
+}
+
+// Return true if `dim` is stated as an passthrough operand dim in
+// `passthrough_dims`.
+bool IsReshapePassthroughOperandDim(
+    ArraySlice<ReshapePassthroughDimPair> passthrough_dims, int64 dim) {
+  return c_any_of(passthrough_dims,
+                  [&](ReshapePassthroughDimPair passthrough_dim_pair) {
+                    return passthrough_dim_pair.operand_dim == dim;
+                  });
+}
+
+// Maps `operand_dim` which must be an passthrough operand dimension to its
+// corresponding passthrough result dimension based on `passthrough_dims`.
+int64 MapPassthroughOperandDimToResultDim(
+    ArraySlice<ReshapePassthroughDimPair> passthrough_dims, int64 operand_dim) {
+  auto it = c_find_if(passthrough_dims,
+                      [&](ReshapePassthroughDimPair passthrough_dim_pair) {
+                        return passthrough_dim_pair.operand_dim == operand_dim;
+                      });
+  CHECK(it != passthrough_dims.end());
+  return it->result_dim;
+}
+
+int64 FindSourcePositionForPassthroughResultDim(ArraySlice<int64> operand_shape,
+                                                ArraySlice<int64> result_shape,
+                                                int64 source_passthrough_dim) {
+  int64 indexed_source_subarray_size =
+      std::accumulate(operand_shape.begin() + source_passthrough_dim + 1,
+                      operand_shape.end(), 1, std::multiplies<int64>());
+
+  return FindSuffixWithProduct(result_shape, indexed_source_subarray_size);
+}
+
+};  // namespace
+
+StatusOr<Analysis::Array*> IndexedArrayAnalysis::ComputeArrayForReshape(
+    const Shape& shape, Array* operand) {
+  auto* scalar_indexed = dynamic_cast<ScalarIndexedConstantArray*>(operand);
+  if (!scalar_indexed) {
+    return nullptr;
+  }
+
+  // Try to fold Reshape(ScalarIndexed(Const, Indices))
+  //          => ScalarIndexed(Const', Indices)
+  //
+  // We can view the reshape and the scalar-indexed operations as functions that
+  // map an output index (i.e. an index into the result) to an input index
+  // (i.e. an index into the operand).  The key idea used here is that the
+  // output-to-input mapping for some reshape operations may "pass through" some
+  // output dimensions into the input space unchanged -- i.e. there may exist
+  // output dimension "O" and input dimension "I" such that OutputIndex[O] is
+  // always == InputIndexForReshape(OutputIndex)[I].  If these pass-through
+  // dimensions in the input space of the reshape happen to be include all the
+  // output dimensions for the scalar-indexed node then, roughly, the following
+  // holds:
+  //
+  //    SourceIndexOfScalarIndexed(SourceIndexOfReshape(Idx))
+  // == SourceIndexOfScalarIndexed(SourceIndexOfReshape(Ps ++ Qs))
+  //
+  //      Where Ps are the set of the pass-through components of Idx that are
+  //      also the output dims of the scalar-indexed node, and Qs are the rest.
+  //      For brevity, we're playing fast and loose with the notation here -- we
+  //      don't literally require Idx to be a concatenation of Ps and Qs, as
+  //      suggested by the "++".
+  //
+  // == SourceIndexOfScalarIndexed(Ps ++ SourceIndexOfReshape(Qs))
+  //
+  //      Again, we're playing fast and loose with the notation around "++".
+  //      Generally this ++ will be a different function that the ++ in the
+  //      previous step.
+  //
+  // If the scalar-indexed node has a constant as the source then the
+  // SourceIndexOfReshape function can be "folded into" the constant itself by
+  // reshaping it, leaving us with:
+  //
+  // == SourceIndexOfScalarIndexed(Ps ++ Qs)
+  // == SourceIndexOfScalarIndexed(Idx)
+  //
+  // which is just a scalar-indexed node (with parameters different from the
+  // scalar-indexed node we started with) with a reshaped constant as the
+  // source.
+  //
+  // We can't fold SourceIndexOfReshape into the constant without introducing
+  // another precondition: since the new scalar-indexed node will have a
+  // reshaped (constant) array as its source it will, in general, have a
+  // different source dimension than the original scalar-indexed node.  This
+  // source dimension will have to be a passthrough dimension of the
+  // SourceIndexOfReshape indexing function that is folded into the source. And
+  // such a dimension need not exist so this is a non-trivial precondition.
+
+  std::vector<ReshapePassthroughDimPair> reshape_passthrough_dims =
+      ComputeReshapePassthroughDimPairs(
+          /*operand_shape=*/AsInt64Slice(operand->shape().dimensions()),
+          /*result_shape=*/AsInt64Slice(shape.dimensions()));
+
+  auto is_reshape_passthrough_operand_dim = [&](int64 operand_dim) {
+    return IsReshapePassthroughOperandDim(reshape_passthrough_dims,
+                                          operand_dim);
+  };
+
+  if (!c_all_of(scalar_indexed->output_dims(),
+                is_reshape_passthrough_operand_dim)) {
+    return nullptr;
+  }
+
+  // To compute the shape of the source for the new scalar-indexed node we're
+  // going to create, we first "undo" the scalar-indexed operation.
+  std::vector<int64> new_scalar_indexed_source_shape(shape.dimensions().begin(),
+                                                     shape.dimensions().end());
+  for (int64 i = scalar_indexed->output_dims().size() - 1; i >= 0; i--) {
+    int64 output_dim = scalar_indexed->output_dims()[i];
+    int64 output_dim_after_reshape = MapPassthroughOperandDimToResultDim(
+        reshape_passthrough_dims, output_dim);
+    EraseAt(&new_scalar_indexed_source_shape, output_dim_after_reshape);
+  }
+
+  // After this, we need to add in the dimension that will be the source
+  // dimension for the new scalar-indexed node.  A scalar-indexed node "removes"
+  // the source dimensions and "adds" the output dimensions, so to get back to
+  // the shape for the *source* of the scalar-indexed node we need to remove the
+  // output dims (which we did above) and then add back the source dim (which we
+  // are about to do below):
+
+  const Shape& scalar_indexed_source_shape = scalar_indexed->source()->shape();
+
+  int64 source_dim_for_new_scalar_indexed_node =
+      FindSourcePositionForPassthroughResultDim(
+          /*operand_shape=*/AsInt64Slice(
+              scalar_indexed_source_shape.dimensions()),
+          /*result_shape=*/new_scalar_indexed_source_shape,
+          scalar_indexed->source_dim());
+
+  // We may not be able to find a source dim for the new scalar-indexed node.
+  // For instance consider:
+  //
+  //   operand = s32[3,5,2] constant({...})
+  //   indices = s32[7] parameter(0)
+  //   gather = s32[3,2,7] gather(operand, indices),
+  //       output_window_dims={0,1},
+  //       elided_window_dims={1},
+  //       gather_dims_to_operand_dims={1},
+  //       index_vector_dim=1,
+  //       window_bounds={3,1,2}
+  //   reshape = s32[6,7] reshape(gather)
+  //
+  // In this case the gather maps to:
+  //    (scalar-indexed-const (constant s32[3,5,2]) %indices 1->[2])
+  //
+  // and the reshape passes through dimension 2 from its input into dimension 1
+  // in its output.  However, we can't rewrite the reshape as a scalar-indexed
+  // node because then we'd have to reshape the [3,5,2] `operand` array to
+  // [6,5], but then dimension 1 of the reshaped [6,5] array indexes differently
+  // (a.k.a. isn't pass-through) than the [3,5,2] array.
+
+  if (source_dim_for_new_scalar_indexed_node == -1) {
+    return nullptr;
+  }
+
+  InsertAt(
+      &new_scalar_indexed_source_shape, source_dim_for_new_scalar_indexed_node,
+      scalar_indexed_source_shape.dimensions(scalar_indexed->source_dim()));
+
+  CHECK(IsReshapePassthroughOperandDim(
+      ComputeReshapePassthroughDimPairs(
+          /*operand_shape=*/AsInt64Slice(
+              scalar_indexed_source_shape.dimensions()),
+          /*result_shape=*/new_scalar_indexed_source_shape),
+      scalar_indexed->source_dim()));
+
+  auto map_passthrough_operand_dim_to_result_dim = [&](int64 result_dim) {
+    return MapPassthroughOperandDimToResultDim(reshape_passthrough_dims,
+                                               result_dim);
+  };
+
+  std::vector<int64> output_dims_for_new_scalar_indexed_node;
+  c_transform(scalar_indexed->output_dims(),
+              std::back_inserter(output_dims_for_new_scalar_indexed_node),
+              map_passthrough_operand_dim_to_result_dim);
+
+  TF_ASSIGN_OR_RETURN(const Literal* new_scalar_indexed_source_literal,
+                      TakeOwnership(scalar_indexed->literal().Reshape(
+                          new_scalar_indexed_source_shape)));
+  TF_ASSIGN_OR_RETURN(
+      Array * new_scalar_indexed_source,
+      ComputeArrayForConstant(*new_scalar_indexed_source_literal));
+
+  return ConstructScalarIndexedArray(
+      new_scalar_indexed_source, scalar_indexed->indices(),
+      source_dim_for_new_scalar_indexed_node,
+      output_dims_for_new_scalar_indexed_node, shape);
+}
+
+StatusOr<Analysis::Array*>
+IndexedArrayAnalysis::ComputeArrayForElementwiseBinaryOp(HloOpcode opcode,
+                                                         Array* lhs,
+                                                         Array* rhs) {
+  // Try to fold BinaryOp(Broadcast(Const0), ScalarIndexed(Const1, Indices))
+  //          => ScalarIndexed(BinaryOp(Broadcast'(Const0), Const1), Indices)
+  //
+  // We can do this if every output dimension from the scalar-indexed node is a
+  // broadcasted dimension for the broadcast node.  Informally, the precondition
+  // means Broadcast(Const0)[IDX] is solely a function of the components of IDX
+  // that are not output-dims for the scalar-indexed node. In other words, for
+  // every assignment to the non-output dims in IDX we have a "constant" LHS to
+  // the BinaryOp.  This transform propagates this "constant" to the source for
+  // the scalar-indexed node.
+
+  ScalarIndexedConstantArray* lhs_scalar_indexed_const =
+      dynamic_cast<ScalarIndexedConstantArray*>(lhs);
+  ScalarIndexedConstantArray* rhs_scalar_indexed_const =
+      dynamic_cast<ScalarIndexedConstantArray*>(rhs);
+
+  bool lhs_is_indexed;
+
+  // One of the operands must be scalar-indexed and the other must be a
+  // broadcast of a constant.
+  if (lhs_scalar_indexed_const && !rhs_scalar_indexed_const) {
+    lhs_is_indexed = true;
+  } else if (rhs_scalar_indexed_const && !lhs_scalar_indexed_const) {
+    lhs_is_indexed = false;
+  } else {
+    return nullptr;
+  }
+
+  ScalarIndexedConstantArray* scalar_indexed_const =
+      lhs_is_indexed ? lhs_scalar_indexed_const : rhs_scalar_indexed_const;
+  UnknownArray* candidate_broadcast_array =
+      dynamic_cast<UnknownArray*>(lhs_is_indexed ? rhs : lhs);
+  if (!candidate_broadcast_array ||
+      candidate_broadcast_array->instruction().opcode() !=
+          HloOpcode::kBroadcast) {
+    return nullptr;
+  }
+
+  const HloInstruction* broadcast_instr =
+      &candidate_broadcast_array->instruction();
+  const HloInstruction* broadcast_const_operand = broadcast_instr->operand(0);
+  if (broadcast_const_operand->opcode() != HloOpcode::kConstant) {
+    return nullptr;
+  }
+
+  ArraySlice<int64> broadcast_dims = broadcast_instr->dimensions();
+  auto is_broadcasted_dim = [&](int64 output_dim) {
+    return c_find(broadcast_dims, output_dim) == broadcast_dims.end();
+  };
+
+  // All of the output dims must be "broadcasted" dims for the other operand.
+  if (!c_all_of(scalar_indexed_const->output_dims(), is_broadcasted_dim)) {
+    return nullptr;
+  }
+
+  // To figure out the broadcast dimensions for the (constant) source for the
+  // scalar-indexed node, we "simulate" the index transformation done by the
+  // existing broadcsat:
+  enum class IndexComponent { Broadcasted, NotBroadcasted };
+  std::vector<IndexComponent> simulated_index(
+      broadcast_instr->shape().dimensions_size(), IndexComponent::Broadcasted);
+  for (int64 broadcast_dim : broadcast_dims) {
+    simulated_index[broadcast_dim] = IndexComponent::NotBroadcasted;
+  }
+
+  // The scalar-indexed node "removes" the source dim and "inserts" the output
+  // dims.  We do the opposite here to undo the scalar-indexed operation.
+  ArraySlice<int64> output_dims = scalar_indexed_const->output_dims();
+  for (int64 i = output_dims.size() - 1; i >= 0; --i) {
+    CHECK(simulated_index[output_dims[i]] == IndexComponent::Broadcasted);
+    EraseAt(&simulated_index, output_dims[i]);
+  }
+
+  InsertAt(&simulated_index, scalar_indexed_const->source_dim(),
+           IndexComponent::Broadcasted);
+
+  // new_inner_broadcast_dims holds the broadcast dimensions for the inner
+  // BinaryOp(Broadcast'(Const0), Const1).  We now translate simulated_index to
+  // new_inner_broadcast_dims.
+  std::vector<int64> new_inner_broadcast_dims;
+  for (int64 i = 0; i < simulated_index.size(); i++) {
+    if (simulated_index[i] == IndexComponent::NotBroadcasted) {
+      new_inner_broadcast_dims.push_back(i);
+    }
+  }
+
+  // inner_broadcast_result is the Broadcast'(Const0) bit in
+  // BinaryOp(Broadcast'(Const0), Const1)
+  TF_ASSIGN_OR_RETURN(
+      std::unique_ptr<Literal> inner_broadcast_result,
+      broadcast_const_operand->literal().Broadcast(
+          scalar_indexed_const->source()->shape(), new_inner_broadcast_dims));
+
+  // literal_for_new_source is BinaryOp(Broadcast'(Const0), Const1)
+  const Literal* literal_for_new_source;
+  if (lhs_is_indexed) {
+    TF_ASSIGN_OR_RETURN(
+        literal_for_new_source,
+        TakeOwnership(HloEvaluator{}.EvaluateElementwiseBinaryOp(
+            opcode, scalar_indexed_const->literal(), *inner_broadcast_result)));
+  } else {
+    TF_ASSIGN_OR_RETURN(
+        literal_for_new_source,
+        TakeOwnership(HloEvaluator{}.EvaluateElementwiseBinaryOp(
+            opcode, *inner_broadcast_result, scalar_indexed_const->literal())));
+  }
+
+  ConstantArray* new_source = Construct<ConstantArray>(literal_for_new_source);
+  return Construct<ScalarIndexedConstantArray>(
+      new_source, scalar_indexed_const->indices(),
+      scalar_indexed_const->source_dim(),
+      std::vector<int64>(scalar_indexed_const->output_dims().begin(),
+                         scalar_indexed_const->output_dims().end()),
+      scalar_indexed_const->shape());
+}
+
+StatusOr<Analysis::Array*>
+IndexedArrayAnalysis::ComputeArrayForElementwiseUnaryOp(HloOpcode opcode,
+                                                        Array* operand) {
+  auto* scalar_indexed_const =
+      dynamic_cast<ScalarIndexedConstantArray*>(operand);
+  if (scalar_indexed_const == nullptr) {
+    return nullptr;
+  }
+
+  // Fold UnaryOp(ScalarIndexed(Const, Indices))
+  //   => ScalarIndexed(UnaryOp(Const), Indices)
+
+  TF_ASSIGN_OR_RETURN(Literal * literal_for_new_source,
+                      TakeOwnership(HloEvaluator{}.EvaluateElementwiseUnaryOp(
+                          opcode, scalar_indexed_const->literal())));
+  ConstantArray* new_source = Construct<ConstantArray>(literal_for_new_source);
+  return Construct<ScalarIndexedConstantArray>(
+      new_source, scalar_indexed_const->indices(),
+      scalar_indexed_const->source_dim(),
+      std::vector<int64>(scalar_indexed_const->output_dims().begin(),
+                         scalar_indexed_const->output_dims().end()),
+      scalar_indexed_const->shape());
+}
+
 tensorflow::StringPiece IndexedArrayAnalysisPrinterPass::name() const {
   return "indexed-array-analysis-printer-pass";
 }
@@ -256,7 +720,7 @@ StatusOr<bool> IndexedArrayAnalysisPrinterPass::Run(HloModule* module) {
   IndexedArrayAnalysis analysis;
   for (auto* computation : module->MakeNonfusionComputations()) {
     for (auto* instr : computation->instructions()) {
-      auto* t = analysis.GetArrayFor(instr);
+      TF_ASSIGN_OR_RETURN(Analysis::Array * t, analysis.GetArrayFor(instr));
       if (!dynamic_cast<UnknownArray*>(t) && !dynamic_cast<ConstantArray*>(t)) {
         VLOG(2) << instr->ToString() << "   ->   " << analysis.ToString(t);
       }
diff --git a/tensorflow/compiler/xla/service/indexed_array_analysis.h b/tensorflow/compiler/xla/service/indexed_array_analysis.h
index b132a8f25153d2e86e8aa477fdb851f1c9c8e719..ce92fd2919c90fa8a2fb7b796ed6f0fdaf48fe62 100644
--- a/tensorflow/compiler/xla/service/indexed_array_analysis.h
+++ b/tensorflow/compiler/xla/service/indexed_array_analysis.h
@@ -143,8 +143,8 @@ class IndexedArrayAnalysis {
   //
   // For example, if source is of shape [11,13,17,19], indices is of shape
   // [23,29], output_dims is [0,2] and source_dim is 2 then the output is of
-  // shape [23,11,29,19] and the output index [A,B,C,D,E] is mapped to the input
-  // index [B,D,indices[A,C],E].
+  // shape [23,11,29,13,19] and the output index [A,B,C,D,E] is mapped to the
+  // input index [B,D,indices[A,C],E].
   class ScalarIndexedArray : public Array {
    public:
     Kind kind() const override { return kScalarIndexed; }
@@ -152,7 +152,15 @@ class IndexedArrayAnalysis {
 
     Array* source() const { return source_; }
     Array* indices() const { return indices_; }
+
+    // `source_dim` is the dimension in the source array that is being indexed
+    // over using indices from the `indices` array.  See the class documentation
+    // and the overview for more details.
     int64 source_dim() const { return source_dim_; }
+
+    // `output_dims` are the dimensions in the output array that are being used
+    // to compute an index into the `indices` array.  See the class
+    // documentation and the overview for more details.
     tensorflow::gtl::ArraySlice<int64> output_dims() const {
       return output_dims_;
     }
@@ -212,26 +220,26 @@ class IndexedArrayAnalysis {
   // NB!  By inspecting the implementation, you may be able to infer a stronger
   // caching guarantee than what is mentioned above.  Nevertheless, what is
   // stated above is the contract.
-  Array* GetArrayFor(const HloInstruction* instr);
+  StatusOr<Array*> GetArrayFor(const HloInstruction* instr);
 
   // Pretty-prints the expression rooted at `root`.
-  string ToString(Array* root);
+  string ToString(Array* root, bool print_constants = false);
 
  private:
   // Helper function that ensures that every HLO instruction that is
   // transitively used by `root` has an entry in `cache_`.
-  void TraverseAndPopulateCache(const HloInstruction* root);
+  Status TraverseAndPopulateCache(const HloInstruction* root);
 
   // Creates an Array instance for `instr` under the assumption that all
   // operations of `instr` are present in `cache_`.
-  Array* ComputeArrayFor(const HloInstruction* instr);
+  StatusOr<Array*> ComputeArrayFor(const HloInstruction* instr);
 
-  Array* ComputeArrayForConstant(const Literal& literal);
+  StatusOr<Array*> ComputeArrayForConstant(const Literal& literal);
 
-  Array* ComputeArrayForGather(const Shape& shape,
-                               const GatherDimensionNumbers& dim_numbers,
-                               tensorflow::gtl::ArraySlice<int64> window_bounds,
-                               Array* source, Array* indices);
+  StatusOr<Array*> ComputeArrayForGather(
+      const Shape& shape, const GatherDimensionNumbers& dim_numbers,
+      tensorflow::gtl::ArraySlice<int64> window_bounds, Array* source,
+      Array* indices);
 
   // This tries to fold a ScalarIndexedArray which has another
   // ScalarIndexedArray as a source into a ScalarIndexedArray that instead has a
@@ -254,10 +262,17 @@ class IndexedArrayAnalysis {
   //
   //    I2 = [I0[i]  for i in I1]
   //    G1 = [Arr[i] for i in I2]
-  ScalarIndexedArray* FoldGatherOfGather(
+  StatusOr<ScalarIndexedArray*> FoldGatherOfGather(
       ScalarIndexedArray* source, Array* indices, int64 source_dim,
       tensorflow::gtl::ArraySlice<int64> output_dims, Shape shape);
 
+  StatusOr<Array*> ComputeArrayForReshape(const Shape& shape, Array* operand);
+
+  StatusOr<Array*> ComputeArrayForElementwiseBinaryOp(HloOpcode opcode,
+                                                      Array* lhs, Array* rhs);
+  StatusOr<Array*> ComputeArrayForElementwiseUnaryOp(HloOpcode opcode,
+                                                     Array* operand);
+
   template <typename T, typename... Args>
   T* Construct(Args&&... args) {
     T* new_tensor = new T(std::forward<Args>(args)...);
@@ -279,6 +294,19 @@ class IndexedArrayAnalysis {
     }
   }
 
+  Literal* TakeOwnership(std::unique_ptr<Literal> literal) {
+    owned_literals_.push_back(std::move(literal));
+    return owned_literals_.back().get();
+  }
+
+  StatusOr<Literal*> TakeOwnership(
+      StatusOr<std::unique_ptr<Literal>> literal_or_error) {
+    TF_ASSIGN_OR_RETURN(std::unique_ptr<Literal> literal,
+                        std::move(literal_or_error));
+    owned_literals_.push_back(std::move(literal));
+    return owned_literals_.back().get();
+  }
+
   std::vector<std::unique_ptr<Array>> owned_tensors_;
   std::vector<std::unique_ptr<Literal>> owned_literals_;
   tensorflow::gtl::FlatMap<const HloInstruction*, Array*> cache_;
diff --git a/tensorflow/compiler/xla/service/indexed_array_analysis_test.cc b/tensorflow/compiler/xla/service/indexed_array_analysis_test.cc
index b2731b7c51a45c4f9b713d99ef3e4623ad2c9c83..373556ebeba883f7dc2116bdf0ffc3274182f775 100644
--- a/tensorflow/compiler/xla/service/indexed_array_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/indexed_array_analysis_test.cc
@@ -23,14 +23,31 @@ class IndexedArrayAnalysisTest : public HloVerifiedTestBase {
  protected:
   void AssertArrayForRootExpressionIs(const string& hlo_text,
                                       const string& root_expression) {
+    AssertArrayForRootExpressionIsImpl(hlo_text, root_expression,
+                                       /*print_constants=*/false);
+  }
+
+  void AssertArrayWithConstantsForRootExpressionIs(
+      const string& hlo_text, const string& root_expression) {
+    AssertArrayForRootExpressionIsImpl(hlo_text, root_expression,
+                                       /*print_constants=*/true);
+  }
+
+ private:
+  void AssertArrayForRootExpressionIsImpl(const string& hlo_text,
+                                          const string& root_expression,
+                                          bool print_constants) {
     IndexedArrayAnalysis indexed_tensor_analysis;
     ParseAndVerifyModule(hlo_text);
 
-    string result =
-        indexed_tensor_analysis.ToString(indexed_tensor_analysis.GetArrayFor(
+    TF_ASSERT_OK_AND_ASSIGN(
+        IndexedArrayAnalysis::Array* const array_result,
+        indexed_tensor_analysis.GetArrayFor(
             module().entry_computation()->root_instruction()));
-    LOG(INFO) << result;
-    ASSERT_EQ(result, root_expression);
+    string string_result =
+        indexed_tensor_analysis.ToString(array_result, print_constants);
+    LOG(INFO) << string_result;
+    ASSERT_EQ(string_result, root_expression);
   }
 };
 
@@ -187,5 +204,301 @@ ENTRY main {
       "(scalar-indexed %operand (scalar-indexed %indices_a %indices_b "
       "1->[0,2]) 1->[0,1,3])");
 }
+
+TEST_F(IndexedArrayAnalysisTest, ReshapeOfGather0) {
+  string hlo_text = R"(
+HloModule ReshapeOfGather
+
+ENTRY main {
+  operand = s32[3,4] constant(s32[3,4]{{1,2,3,4},{1,2,3,4},{1,2,3,4}})
+  indices = s32[5] parameter(0)
+  gather = s32[5,4] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=1,
+      window_bounds={1,4}
+  ROOT reshape = s32[5,2,2] reshape(gather)
+}
+)";
+
+  AssertArrayForRootExpressionIs(
+      hlo_text, "(scalar-indexed-const (constant s32[3,2,2]) %indices 0->[0])");
+}
+
+TEST_F(IndexedArrayAnalysisTest, ReshapeOfGather1) {
+  string hlo_text = R"(
+HloModule ReshapeOfGather
+
+ENTRY main {
+  operand = s32[3,4] constant(s32[3,4]{{1,2,3,4},{1,2,3,4},{1,2,3,4}})
+  indices = s32[5,7] parameter(0)
+  gather = s32[5,4,7] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=2,
+      window_bounds={1,4}
+  ROOT reshape = s32[5,2,2,7] reshape(gather)
+}
+)";
+
+  AssertArrayForRootExpressionIs(
+      hlo_text,
+      "(scalar-indexed-const (constant s32[3,2,2]) %indices 0->[0,3])");
+}
+
+TEST_F(IndexedArrayAnalysisTest, ReshapeOfGather2) {
+  string hlo_text = R"(
+HloModule ReshapeOfGather
+
+ENTRY main {
+  operand = s32[3,2,6] constant(s32[3,2,6]{
+      {{1,2,3,4,5,6},{1,2,3,4,5,6}},
+      {{1,2,3,4,5,6},{1,2,3,4,5,6}},
+      {{1,2,3,4,5,6},{1,2,3,4,5,6}}})
+  indices = s32[5,7] parameter(0)
+  gather = s32[5,2,6,7] gather(operand, indices),
+      output_window_dims={1,2},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=2,
+      window_bounds={1,2,6}
+  ROOT reshape = s32[5,3,4,7] reshape(gather)
+}
+)";
+
+  AssertArrayForRootExpressionIs(
+      hlo_text,
+      "(scalar-indexed-const (constant s32[3,3,4]) %indices 0->[0,3])");
+}
+
+TEST_F(IndexedArrayAnalysisTest, ReshapeOfGatherNegative0) {
+  string hlo_text = R"(
+HloModule ReshapeOfGather
+
+ENTRY main {
+  operand = s32[3,4] constant(s32[3,4]{{1,2,3,4},{1,2,3,4},{1,2,3,4}})
+  indices = s32[5,6] parameter(0)
+  gather = s32[5,4,6] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=2,
+      window_bounds={1,4}
+  ROOT reshape = s32[5,2,2,2,3] reshape(gather)
+}
+)";
+
+  AssertArrayForRootExpressionIs(hlo_text, "%reshape");
+}
+
+TEST_F(IndexedArrayAnalysisTest, ReshapeOfGatherNegative1) {
+  string hlo_text = R"(
+HloModule ReshapeOfGather
+
+ENTRY main {
+  operand = s32[3,5,2] constant(s32[3,5,2]{
+      {{1,2},{3,4},{5,6},{7,8},{9,10}},
+      {{1,2},{3,4},{5,6},{7,8},{9,10}},
+      {{1,2},{3,4},{5,6},{7,8},{9,10}}})
+  indices = s32[7] parameter(0)
+  gather = s32[3,2,7] gather(operand, indices),
+      output_window_dims={0,1},
+      elided_window_dims={1},
+      gather_dims_to_operand_dims={1},
+      index_vector_dim=1,
+      window_bounds={3,1,2}
+  ROOT reshape = s32[6,7] reshape(gather)
+}
+)";
+
+  AssertArrayForRootExpressionIs(hlo_text, "%reshape");
+}
+
+TEST_F(IndexedArrayAnalysisTest, UnaryOpOfGather) {
+  string hlo_text = R"(
+HloModule UnaryOpOfGather
+
+ENTRY main {
+  operand = f32[3,4] constant(f32[3,4]{{1,2,3,4},{1,3,2,4},{4,3,2,1}})
+  indices = s32[5] parameter(0)
+  gather = f32[5,4] gather(operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=1,
+      window_bounds={1,4}
+  ROOT tanh = f32[5,4] tanh(gather)
+}
+)";
+
+  AssertArrayWithConstantsForRootExpressionIs(hlo_text, 1 + R"(
+(scalar-indexed-const (constant f32[3,4] f32[3,4] {
+  { 0.761594176, 0.964027584, 0.995054781, 0.999329329 },
+  { 0.761594176, 0.995054781, 0.964027584, 0.999329329 },
+  { 0.999329329, 0.995054781, 0.964027584, 0.761594176 }
+}) %indices 0->[0]))");
+}
+
+TEST_F(IndexedArrayAnalysisTest, AddBroadcastedScalarWithGather) {
+  string hlo_text = R"(
+HloModule AddBroadcastedScalarWithGather
+
+ENTRY main {
+  gather_operand = s32[3,4] constant(s32[3,4]{{1,2,3,4},{1,3,2,4},{4,3,2,1}})
+  constant = s32[] constant(5)
+  constant_broadcasted = s32[5,4] broadcast(constant), dimensions={}
+  indices = s32[5] parameter(0)
+  gather = s32[5,4] gather(gather_operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=1,
+      window_bounds={1,4}
+  ROOT add = s32[5,4] add(gather, constant_broadcasted)
+}
+)";
+
+  AssertArrayWithConstantsForRootExpressionIs(hlo_text, 1 + R"(
+(scalar-indexed-const (constant s32[3,4] s32[3,4] {
+  { 6, 7, 8, 9 },
+  { 6, 8, 7, 9 },
+  { 9, 8, 7, 6 }
+}) %indices 0->[0]))");
+}
+
+TEST_F(IndexedArrayAnalysisTest,
+       SubtractBroadcastedScalarWithGather_GatherIsLhs) {
+  string hlo_text = R"(
+HloModule SubtractBroadcastedScalarWithGather
+
+ENTRY main {
+  gather_operand = s32[3,4] constant(s32[3,4]{{1,2,3,4},{1,3,2,4},{4,3,2,1}})
+  constant = s32[] constant(5)
+  constant_broadcasted = s32[5,4] broadcast(constant), dimensions={}
+  indices = s32[5] parameter(0)
+  gather = s32[5,4] gather(gather_operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=1,
+      window_bounds={1,4}
+  ROOT sub = s32[5,4] subtract(gather, constant_broadcasted)
+}
+)";
+
+  AssertArrayWithConstantsForRootExpressionIs(hlo_text, 1 + R"(
+(scalar-indexed-const (constant s32[3,4] s32[3,4] {
+  { -4, -3, -2, -1 },
+  { -4, -2, -3, -1 },
+  { -1, -2, -3, -4 }
+}) %indices 0->[0]))");
+}
+
+TEST_F(IndexedArrayAnalysisTest,
+       SubtractBroadcastedScalarWithGather_GatherIsRhs) {
+  string hlo_text = R"(
+HloModule SubtractBroadcastedScalarWithGather
+
+ENTRY main {
+  gather_operand = s32[3,4] constant(s32[3,4]{{1,2,3,4},{1,3,2,4},{4,3,2,1}})
+  constant = s32[] constant(5)
+  constant_broadcasted = s32[5,4] broadcast(constant), dimensions={}
+  indices = s32[5] parameter(0)
+  gather = s32[5,4] gather(gather_operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=1,
+      window_bounds={1,4}
+  ROOT sub = s32[5,4] subtract(constant_broadcasted, gather)
+}
+)";
+
+  AssertArrayWithConstantsForRootExpressionIs(hlo_text, 1 + R"(
+(scalar-indexed-const (constant s32[3,4] s32[3,4] {
+  { 4, 3, 2, 1 },
+  { 4, 2, 3, 1 },
+  { 1, 2, 3, 4 }
+}) %indices 0->[0]))");
+}
+
+TEST_F(IndexedArrayAnalysisTest, AddBroadcastedVectorWithGather) {
+  string hlo_text = R"(
+HloModule AddBroadcastedVectorWithGather
+
+ENTRY main {
+  gather_operand = s32[3,4] constant(s32[3,4]{{1,2,3,4},{1,3,2,4},{4,3,2,1}})
+  constant_vect = s32[4] constant({10,11,12,13})
+  constant_broadcasted = s32[5,4] broadcast(constant_vect), dimensions={1}
+  indices = s32[5] parameter(0)
+  gather = s32[5,4] gather(gather_operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=1,
+      window_bounds={1,4}
+  ROOT add = s32[5,4] add(gather, constant_broadcasted)
+}
+)";
+
+  AssertArrayWithConstantsForRootExpressionIs(hlo_text, 1 + R"(
+(scalar-indexed-const (constant s32[3,4] s32[3,4] {
+  { 11, 13, 15, 17 },
+  { 11, 14, 14, 17 },
+  { 14, 14, 14, 14 }
+}) %indices 0->[0]))");
+}
+
+TEST_F(IndexedArrayAnalysisTest, AddBroadcastedVectorWithGather_Negative) {
+  string hlo_text = R"(
+HloModule AddBroadcastedVectorWithGather
+
+ENTRY main {
+  gather_operand = s32[3,4] constant(s32[3,4]{{1,2,3,4},{1,3,2,4},{4,3,2,1}})
+  constant_vect = s32[5] constant({10,11,12,13,14})
+  constant_broadcasted = s32[5,4] broadcast(constant_vect), dimensions={0}
+  indices = s32[5] parameter(0)
+  gather = s32[5,4] gather(gather_operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=1,
+      window_bounds={1,4}
+  ROOT add = s32[5,4] add(gather, constant_broadcasted)
+}
+)";
+
+  AssertArrayForRootExpressionIs(hlo_text, "%add");
+}
+
+TEST_F(IndexedArrayAnalysisTest, RegularUnaryOp) {
+  string hlo_text = R"(
+HloModule RegularUnaryOp
+
+ENTRY main {
+  input = f32[100] parameter(0)
+  ROOT tanh = f32[100] tanh(input)
+}
+)";
+
+  AssertArrayForRootExpressionIs(hlo_text, "%tanh");
+}
+
+TEST_F(IndexedArrayAnalysisTest, RegularBinaryOp) {
+  string hlo_text = R"(
+HloModule RegularUnaryOp
+
+ENTRY main {
+  input0 = f32[100] parameter(0)
+  input1 = f32[100] parameter(1)
+  ROOT add = f32[100] add(input0, input1)
+}
+)";
+
+  AssertArrayForRootExpressionIs(hlo_text, "%add");
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/instruction_fusion.cc b/tensorflow/compiler/xla/service/instruction_fusion.cc
index cb6c98c48171a06539499b723a8d8b7aa0ccc96a..abedb4063d3763516e66cff36633dbd90c8cafde 100644
--- a/tensorflow/compiler/xla/service/instruction_fusion.cc
+++ b/tensorflow/compiler/xla/service/instruction_fusion.cc
@@ -96,6 +96,7 @@ bool IsAlwaysDuplicable(const HloInstruction& instruction) {
     case HloOpcode::kShiftRightLogical:
     case HloOpcode::kSlice:
     case HloOpcode::kSubtract:
+    case HloOpcode::kGenerateToken:
     case HloOpcode::kTranspose:
     case HloOpcode::kTuple:
       return false;
@@ -118,6 +119,7 @@ bool IsAlwaysDuplicable(const HloInstruction& instruction) {
     case HloOpcode::kCrossReplicaSum:
     case HloOpcode::kCustomCall:
     case HloOpcode::kDivide:
+    case HloOpcode::kDomain:
     case HloOpcode::kDot:
     case HloOpcode::kExp:
     case HloOpcode::kExpm1:
@@ -178,8 +180,7 @@ bool InstructionFusion::EffectivelyAtMostUnary(HloInstruction* hlo) {
 
 bool InstructionFusion::CanFuseOnAllPaths(
     HloInstruction* producer, HloInstruction* consumer,
-    const HloReachabilityMap& reachability_map,
-    const DoNotFuseSet& do_not_fuse) {
+    const HloInstructionSet& do_not_duplicate) {
   if (consumer == producer) {
     return true;
   }
@@ -190,10 +191,11 @@ bool InstructionFusion::CanFuseOnAllPaths(
     auto* consumer_operand = consumer->mutable_operand(i);
     // If the operand is not on a path to the producer, it doesn't matter
     // whether it's fusable.
-    if (!reachability_map.IsReachable(producer, consumer_operand)) {
+    if (!reachability_->IsReachable(producer, consumer_operand)) {
       continue;
     }
-    if (do_not_fuse.count(consumer_operand) > 0 || !ShouldFuse(consumer, i)) {
+    if (do_not_duplicate.count(consumer_operand) > 0 ||
+        !ShouldFuse(consumer, i)) {
       return false;
     }
     // The producer is reachable from consumer_operand which means we need
@@ -201,18 +203,16 @@ bool InstructionFusion::CanFuseOnAllPaths(
     // producer to be fusable into consumer on all paths.
     // Perform the recursive step: make sure producer can be fused into
     // consumer_operand on all paths.
-    if (!CanFuseOnAllPaths(producer, consumer_operand, reachability_map,
-                           do_not_fuse)) {
+    if (!CanFuseOnAllPaths(producer, consumer_operand, do_not_duplicate)) {
       return false;
     }
   }
   return true;
 }
 
-InstructionFusion::DoNotFuseSet InstructionFusion::ComputeGloballyUnfusable(
+InstructionFusion::HloInstructionSet
+InstructionFusion::ComputeGloballyUnfusable(
     tensorflow::gtl::ArraySlice<HloInstruction*> post_order) {
-  auto reachability = computation_->ComputeReachability();
-
   // Forbid fusion of producers that:
   // a) Need to be duplicated, unless they can be fused into all consumers
   //    via all paths.
@@ -222,10 +222,10 @@ InstructionFusion::DoNotFuseSet InstructionFusion::ComputeGloballyUnfusable(
   // Note that if we allow fusion by these global rules, we may still forbid
   // fusing operations that require duplication later depending on
   // is_expensive_().
-  DoNotFuseSet do_not_fuse;
+  HloInstructionSet do_not_duplicate;
   for (HloInstruction* consumer : post_order) {
     for (HloInstruction* producer : consumer->operands()) {
-      if (do_not_fuse.count(producer) > 0) {
+      if (do_not_duplicate.count(producer) > 0) {
         continue;
       }
 
@@ -254,14 +254,14 @@ InstructionFusion::DoNotFuseSet InstructionFusion::ComputeGloballyUnfusable(
       // A will be not allowed to be fused into B, as it cannot be fused via
       // all paths.
       if (producer->IsFusable() &&
-          CanFuseOnAllPaths(producer, consumer, *reachability, do_not_fuse)) {
+          CanFuseOnAllPaths(producer, consumer, do_not_duplicate)) {
         continue;
       }
-      do_not_fuse.insert(producer);
+      do_not_duplicate.insert(producer);
     }
   }
 
-  return do_not_fuse;
+  return do_not_duplicate;
 }
 
 StatusOr<bool> InstructionFusion::Run(HloModule* module) {
@@ -273,6 +273,7 @@ StatusOr<bool> InstructionFusion::Run(HloModule* module) {
   for (auto* computation : module->MakeNonfusionComputations()) {
     CHECK(!computation->IsFusionComputation());
     computation_ = computation;
+    reachability_ = computation_->ComputeReachability();
 
     // We want to be able to remove arbitrary instructions from the post order
     // and also compare positions of instructions in the post order. To make
@@ -290,7 +291,7 @@ StatusOr<bool> InstructionFusion::Run(HloModule* module) {
       InsertOrDie(&post_order_index, post_order[i], i);
     }
 
-    DoNotFuseSet do_not_fuse = ComputeGloballyUnfusable(post_order);
+    HloInstructionSet do_not_duplicate = ComputeGloballyUnfusable(post_order);
 
     // Instruction fusion effectively fuses edges in the computation graph
     // (producer instruction -> consumer instruction) so we iterate over all
@@ -358,9 +359,20 @@ StatusOr<bool> InstructionFusion::Run(HloModule* module) {
       // ensures that B will be considered before A.
       //
       // We store the original indices of the operands to pass to ShouldFuse.
-      std::vector<int64> sorted_operand_numbers(instruction->operands().size());
-      std::iota(std::begin(sorted_operand_numbers),
-                std::end(sorted_operand_numbers), 0);
+      std::vector<int64> sorted_operand_numbers;
+      sorted_operand_numbers.reserve(instruction->operands().size());
+      for (int i = 0; i < instruction->operands().size(); ++i) {
+        // This will happen if we have two possible instructions to fuse the
+        // same operand into; once the operand is fused into one instruction,
+        // the other instruction will get a new get-tuple-element as its
+        // operand, which is not in the post-order index.
+        // TODO(tjoerg): Look into fusing past these multi-output fuse points.
+        if (post_order_index.find(instruction->mutable_operand(i)) ==
+            post_order_index.end()) {
+          continue;
+        }
+        sorted_operand_numbers.push_back(i);
+      }
       std::sort(
           sorted_operand_numbers.begin(), sorted_operand_numbers.end(),
           [&](int64 i, int64 j) {
@@ -377,13 +389,20 @@ StatusOr<bool> InstructionFusion::Run(HloModule* module) {
         if (!operand->IsFusable()) {
           continue;
         }
-        if (!ShouldFuse(instruction, i)) {
-          continue;
-        }
-        if (do_not_fuse.count(operand) > 0) {
+
+        HloInstruction* fusion_instruction;
+        // Try "regular" fusion if the operand may be duplicated. Otherwise,
+        // perform multi-output fusion, unless this creates a cycle.
+        // TODO(tjoerg): Consider making multi-output fusion the default.
+        if (ShouldFuse(instruction, i) &&
+            do_not_duplicate.count(operand) == 0) {
+          fusion_instruction = Fuse(operand, instruction);
+        } else if (ShouldFuseIntoMultiOutput(instruction, i) &&
+                   !MultiOutputFusionCreatesCycle(operand, instruction)) {
+          fusion_instruction = FuseIntoMultiOutput(operand, instruction);
+        } else {
           continue;
         }
-        HloInstruction* fusion_instruction = Fuse(operand, instruction);
 
         // Fusing an instruction into a fusion instruction can change the
         // operand set of the fusion instruction. For simplicity just push the
@@ -449,6 +468,19 @@ HloInstruction* InstructionFusion::FuseIntoMultiOutput(
   return fusion_instruction;
 }
 
+bool InstructionFusion::MultiOutputFusionCreatesCycle(
+    HloInstruction* producer, HloInstruction* consumer) {
+  return c_any_of(
+      consumer->operands(), [&](const HloInstruction* consumer_operand) {
+        // The fusion algorithm traverses the HLO graph in reverse post order.
+        // Thus `cosumers` is visited before its operands (including
+        // `producer`). Therefore, consumer operands cannot have been fused yet.
+        // It is thus safe to use the pre-computed reachability map.
+        return consumer_operand != producer &&
+               reachability_->IsReachable(producer, consumer_operand);
+      });
+}
+
 bool InstructionFusion::ShouldFuse(HloInstruction* consumer,
                                    int64 operand_index) {
   HloInstruction* producer = consumer->mutable_operand(operand_index);
diff --git a/tensorflow/compiler/xla/service/instruction_fusion.h b/tensorflow/compiler/xla/service/instruction_fusion.h
index c3c2ed0aaa81d6f346ec6e70d9c8b3b923e0a3d2..f73ca9adf768ed26f9ec9f162e01b7b160f50daf 100644
--- a/tensorflow/compiler/xla/service/instruction_fusion.h
+++ b/tensorflow/compiler/xla/service/instruction_fusion.h
@@ -61,6 +61,14 @@ class InstructionFusion : public HloPassInterface {
   // Subtypes can override this with target-specific heuristics.
   virtual bool ShouldFuse(HloInstruction* consumer, int64 operand_index);
 
+  // Returns whether multi-output fusion can be applied to fuse `producer` into
+  // `consumer`. In contrast to "regular" fusion, the `producer` is not
+  // duplicated by multi-output fusion.
+  virtual bool ShouldFuseIntoMultiOutput(HloInstruction* consumer,
+                                         int64 operand_index) {
+    return false;
+  }
+
   // Chooses a fusion kind for `producer` and `consumer`.
   // Default method chooses `kLoop`.
   virtual HloInstruction::FusionKind ChooseKind(const HloInstruction* producer,
@@ -97,10 +105,12 @@ class InstructionFusion : public HloPassInterface {
   // Current HloComputation instance the loop fuser is traversing.
   HloComputation* computation_;
   HloModule* module_;
+  // Reachability information for the current computation.
+  std::unique_ptr<HloReachabilityMap> reachability_;
 
  private:
   // The set of producers whose consumers we cannot fuse into.
-  using DoNotFuseSet = std::unordered_set<HloInstruction*>;
+  using HloInstructionSet = std::unordered_set<HloInstruction*>;
 
   HloInstruction* AddFusionInstruction(HloInstruction* producer,
                                        HloInstruction* consumer);
@@ -108,18 +118,21 @@ class InstructionFusion : public HloPassInterface {
   // Whether or not we can fuse producer into consumer on all paths
   // from the producer to the consumer where nodes are HLOs and edges are uses.
   bool CanFuseOnAllPaths(HloInstruction* producer, HloInstruction* consumer,
-                         const HloReachabilityMap& reachability_map,
-                         const DoNotFuseSet& do_not_fuse);
+                         const HloInstructionSet& do_not_fuse);
 
   // Computes the set of nodes that we do not want to fuse into any of their
   // consumers based on a global analysis of the HLO graph.
-  DoNotFuseSet ComputeGloballyUnfusable(
+  HloInstructionSet ComputeGloballyUnfusable(
       tensorflow::gtl::ArraySlice<HloInstruction*> post_order);
 
   // Used to determine if an HLO is expensive. Expensive operations will not be
   // duplicated.
   std::function<bool(const HloInstruction& instruction)> is_expensive_;
 
+  // Whether multi-output fusion would introduce a cycle into the HLO graph.
+  bool MultiOutputFusionCreatesCycle(HloInstruction* producer,
+                                     HloInstruction* consumer);
+
   // Returns whether we may duplicate an instruction if we want to fuse it.
   bool may_duplicate_;
 
diff --git a/tensorflow/compiler/xla/service/instruction_fusion_test.cc b/tensorflow/compiler/xla/service/instruction_fusion_test.cc
index df109df7877eefe4c337f93cc5a3a7a48e2e76c7..21db2338995960bde00ec9c4b325e5562fc3a592 100644
--- a/tensorflow/compiler/xla/service/instruction_fusion_test.cc
+++ b/tensorflow/compiler/xla/service/instruction_fusion_test.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/instruction_fusion.h"
 
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 
 namespace xla {
 
@@ -47,7 +47,7 @@ class InstructionFusionForTesting : public InstructionFusion {
 };
 
 TEST_F(InstructionFusionTest, FuseInstructions) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule test_module
   ENTRY entry_computation {
     p0 = f32[4,3]{1,0} parameter(0)
@@ -67,7 +67,7 @@ TEST_F(InstructionFusionTest, FuseInstructions) {
 }
 
 TEST_F(InstructionFusionTest, FuseIntoFusionInstruction) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule test_module
   fused_computation {
     p1 = f32[4,3] parameter(0)
@@ -90,7 +90,7 @@ TEST_F(InstructionFusionTest, FuseIntoFusionInstruction) {
 }
 
 TEST_F(InstructionFusionTest, FuseInstructionsIntoMultiOutput) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule test_module
   ENTRY entry_computation {
     p0 = f32[4,3]{1,0} parameter(0)
@@ -195,7 +195,7 @@ static int Count(const HloModule& module, HloOpcode op) {
 }
 
 TEST_F(InstructionFusionTest, FuseCheapNonDuplicatableOps) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule test_module
   ENTRY OutputFusion {
     p0 = f32[4,3]{1,0} parameter(0)
@@ -220,7 +220,7 @@ TEST_F(InstructionFusionTest, AvoidDuplicationIfNotAllFusableRecursively) {
   //
   // p0 -> add -------------------------> sub
   //           \-> abs1 -> rng -> abs2 -/
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule test_module
   ENTRY OutputFusion {
     p0 = f32[4,3]{1,0} parameter(0)
@@ -251,7 +251,7 @@ TEST_F(InstructionFusionTest, AvoidDuplicationIfNotAllFusableRecursively) {
   // p0 -> add -------------------------> sub
   //           \-> abs1 -> log -> abs2 -/
   //                           \-> send
-  module = tools::Parse(R"(
+  module = ParseHloString(R"(
   HloModule test_module
   ENTRY OutputFusion {
     p0 = f32[4,3]{1,0} parameter(0)
@@ -282,7 +282,7 @@ TEST_F(InstructionFusionTest, AvoidDuplicationIfNotAllFusableRecursively) {
   //    \         \-> add2 -/
   //     \-> log -/
   //             \-> send
-  module = tools::Parse(R"(
+  module = ParseHloString(R"(
   HloModule test_module
   ENTRY OutputFusion {
     p0 = f32[4,3]{1,0} parameter(0)
@@ -314,7 +314,7 @@ TEST_F(InstructionFusionTest, AvoidDuplicationIfNotAllFusableRecursively) {
   //                       \------> sub1
   //                        log -/
   //                            \-> send
-  module = tools::Parse(R"(
+  module = ParseHloString(R"(
   HloModule test_module
   ENTRY OutputFusion {
     p0 = f32[4,3]{1,0} parameter(0)
@@ -390,7 +390,7 @@ TEST_F(InstructionFusionTest, AllowEffectiveUnaryDuplication) {
 
 TEST_F(InstructionFusionTest,
        WideningConvertsAreAlwaysDuplicableIntoConsumers) {
-  auto module = tools::Parse(R"(
+  auto module = ParseHloString(R"(
   HloModule test_module
   ENTRY Test {
     p0 = f16[100] parameter(0)
diff --git a/tensorflow/compiler/xla/service/layout_assignment_test.cc b/tensorflow/compiler/xla/service/layout_assignment_test.cc
index 7508013199a82267efc0e1426cb5989d5fe844a0..bf0448a67674f24591d866b646b98aea09ebb12c 100644
--- a/tensorflow/compiler/xla/service/layout_assignment_test.cc
+++ b/tensorflow/compiler/xla/service/layout_assignment_test.cc
@@ -29,13 +29,13 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_layout.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
 #include "tensorflow/compiler/xla/tests/test_utils.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/core/status.h"
@@ -651,7 +651,7 @@ TEST_F(LayoutAssignmentTest, TransposeWithinFusionDoesNotCrash) {
     }
   )";
 
-  auto module = tools::Parse(module_str).ValueOrDie();
+  auto module = ParseHloString(module_str).ValueOrDie();
 
   module =
       backend()
@@ -691,7 +691,7 @@ TEST_F(LayoutAssignmentTest, GTEInheritsLayoutFromOperand) {
     }
   )";
 
-  auto module = tools::Parse(module_str).ValueOrDie();
+  auto module = ParseHloString(module_str).ValueOrDie();
   ComputationLayout computation_layout(
       module->entry_computation()->ComputeProgramShape());
   Shape param_shape = ShapeUtil::MakeTupleShape(
diff --git a/tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.cc b/tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.cc
index f172b1d87c870270436f7301ed200b47d08431a7..d909845a3a21fc55e44b0037371fca30e577980f 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.cc
@@ -80,8 +80,10 @@ Status FusedIrEmitter::HandleConstant(HloInstruction* constant) {
       *ir_builder_->GetInsertBlock()->getModule(), initializer->getType(),
       /*isConstant=*/true, llvm::GlobalValue::ExternalLinkage, initializer,
       /*Name=*/"");
+  llvm::Constant* shape_constant = llvm::ConstantExpr::getBitCast(
+      global, llvm_ir::ShapeToIrType(literal.shape(), module_)->getPointerTo());
   generators_[constant] = [=](const IrArray::Index& index) {
-    return IrArray(global, constant->shape())
+    return IrArray(shape_constant, constant->shape())
         .EmitReadArrayElement(index, ir_builder_);
   };
 
diff --git a/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.cc b/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.cc
index 23d2d4e87d26f4988ebddcf20f5a27af6a7fe0d6..1f6e3c829f890d68aa251b101f0402c120a19d61 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.cc
@@ -15,53 +15,57 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.h"
 
-#include "tensorflow/compiler/xla/service/llvm_ir/llvm_loop.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
 
 namespace xla {
-void KernelSupportLibrary::For(
+Status KernelSupportLibrary::For(
     tensorflow::StringPiece name, llvm::Value* start, llvm::Value* end,
     llvm::Value* step,
-    const std::function<void(llvm::Value*, bool)>& for_body_generator) {
-  If(ir_builder_->CreateICmpSLT(start, end), [&]() {
-    for_body_generator(start, /*is_first_iteration=*/true);
-    For(name, ir_builder_->CreateAdd(start, step), end, step,
-        [&](llvm::Value* iv) { for_body_generator(iv, false); });
+    const std::function<Status(llvm::Value*, bool)>& for_body_generator) {
+  return If(ir_builder_->CreateICmpSLT(start, end), [&]() -> Status {
+    TF_RETURN_IF_ERROR(for_body_generator(start, /*is_first_iteration=*/true));
+    return For(name, ir_builder_->CreateAdd(start, step), end, step,
+               [&](llvm::Value* iv) { return for_body_generator(iv, false); });
   });
 }
 
-void KernelSupportLibrary::For(
+Status KernelSupportLibrary::For(
     tensorflow::StringPiece name, llvm::Value* start, llvm::Value* end,
     llvm::Value* step, bool peel_first_iteration,
-    const std::function<void(llvm::Value*, llvm::Value*)>& for_body_generator) {
+    const std::function<Status(llvm::Value*, llvm::Value*)>&
+        for_body_generator) {
   if (peel_first_iteration) {
-    For(name, start, end, step, true,
-        [&](llvm::Value* indvar, bool is_first_iteration) {
-          for_body_generator(indvar, ir_builder_->getInt1(is_first_iteration));
-        });
+    return For(name, start, end, step, true,
+               [&](llvm::Value* indvar, bool is_first_iteration) -> Status {
+                 return for_body_generator(
+                     indvar, ir_builder_->getInt1(is_first_iteration));
+               });
   } else {
     std::unique_ptr<llvm_ir::ForLoop> loop = llvm_ir::ForLoop::EmitForLoop(
         name, start, end, step, ir_builder_,
-        /*prevent_unrolling=*/prevent_unrolling_,
+        /*unroll_mode=*/unroll_mode_,
         /*prevent_vectorization=*/prevent_vectorization_);
     ir_builder_->SetInsertPoint(&loop->GetBodyBasicBlock()->back());
-    for_body_generator(loop->GetIndVarValue(),
-                       /*is_first_iteration=*/ir_builder_->CreateICmpEQ(
-                           loop->GetIndVarValue(), start));
+    TF_RETURN_IF_ERROR(
+        for_body_generator(loop->GetIndVarValue(),
+                           /*is_first_iteration=*/ir_builder_->CreateICmpEQ(
+                               loop->GetIndVarValue(), start)));
     llvm_ir::SetToLastInsertPoint(loop->GetExitBasicBlock(), ir_builder_);
+    return Status::OK();
   }
 }
 
-void KernelSupportLibrary::If(
-    llvm::Value* condition, const std::function<void()>& true_block_generator,
-    const std::function<void()>& false_block_generator) {
+Status KernelSupportLibrary::If(
+    llvm::Value* condition, const std::function<Status()>& true_block_generator,
+    const std::function<Status()>& false_block_generator) {
   llvm_ir::LlvmIfData if_data =
       llvm_ir::EmitIfThenElse(condition, "", ir_builder_);
   ir_builder_->SetInsertPoint(&if_data.true_block->back());
-  true_block_generator();
+  TF_RETURN_IF_ERROR(true_block_generator());
   ir_builder_->SetInsertPoint(&if_data.false_block->back());
-  false_block_generator();
+  TF_RETURN_IF_ERROR(false_block_generator());
   llvm_ir::SetToLastInsertPoint(if_data.after_block, ir_builder_);
+  return Status::OK();
 }
 
 void KernelSupportLibrary::EmitAndCallOutlinedKernel(
diff --git a/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.h b/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.h
index 64b935bbf1fb9033cd2e1259b4639cd3780be711..e17c649e5272a9e7c0d5126083ab76542abfdf48 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.h
+++ b/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.h
@@ -21,6 +21,7 @@ limitations under the License.
 #include "llvm/IR/BasicBlock.h"
 #include "llvm/IR/IRBuilder.h"
 #include "llvm/IR/Value.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/llvm_loop.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 
@@ -30,13 +31,14 @@ namespace xla {
 class KernelSupportLibrary {
  public:
   // `ir_builder` is the llvm::IRBuilder instance used to generate LLVM IR.
-  // If `prevent_unrolling` is true then unrolling is explicitly disabled on
-  // every loop generated by this instance of KernelSupportLibrary.
-  explicit KernelSupportLibrary(llvm::IRBuilder<>* ir_builder,
-                                bool prevent_unrolling = true,
-                                bool prevent_vectorization = true)
+  // `unroll_mode` specifies the desired LLVM unrolling behavior for every loop
+  // generated by this instance of KernelSupportLibrary.
+  explicit KernelSupportLibrary(
+      llvm::IRBuilder<>* ir_builder,
+      llvm_ir::UnrollMode unroll_mode = llvm_ir::UnrollMode::kNoUnroll,
+      bool prevent_vectorization = true)
       : ir_builder_(ir_builder),
-        prevent_unrolling_(prevent_unrolling),
+        unroll_mode_(unroll_mode),
         prevent_vectorization_(prevent_vectorization) {}
 
   // Generates the following control flow structure:
@@ -46,19 +48,41 @@ class KernelSupportLibrary {
   //     for (i64 i = `start` + `step`; i s< `end`; i += `step`)
   //       `for_body_generator(/*ind_var=*/,i, /*is_first_iteration=*/false)`;
   //   }
-  void For(
+  Status For(
+      tensorflow::StringPiece name, llvm::Value* start, llvm::Value* end,
+      llvm::Value* step,
+      const std::function<Status(llvm::Value* ind_var,
+                                 bool is_first_iteration)>& for_body_generator);
+
+  void ForReturnVoid(
       tensorflow::StringPiece name, llvm::Value* start, llvm::Value* end,
       llvm::Value* step,
       const std::function<void(llvm::Value* ind_var, bool is_first_iteration)>&
-          for_body_generator);
+          for_body_generator) {
+    CHECK_EQ(Status::OK(),
+             For(name, start, end, step,
+                 [&](llvm::Value* ind_var, bool is_first_iteration) -> Status {
+                   for_body_generator(ind_var, is_first_iteration);
+                   return Status::OK();
+                 }));
+  }
+
+  Status For(tensorflow::StringPiece name, int64 start, int64 end, int64 step,
+             const std::function<Status(llvm::Value* ind_var,
+                                        bool is_first_iteration)>&
+                 for_body_generator) {
+    return For(name, /*start=*/ir_builder_->getInt64(start),
+               /*end=*/ir_builder_->getInt64(end),
+               /*step=*/ir_builder_->getInt64(step), for_body_generator);
+  }
 
-  void For(
+  void ForReturnVoid(
       tensorflow::StringPiece name, int64 start, int64 end, int64 step,
       const std::function<void(llvm::Value* ind_var, bool is_first_iteration)>&
           for_body_generator) {
-    For(name, /*start=*/ir_builder_->getInt64(start),
-        /*end=*/ir_builder_->getInt64(end),
-        /*step=*/ir_builder_->getInt64(step), for_body_generator);
+    ForReturnVoid(name, /*start=*/ir_builder_->getInt64(start),
+                  /*end=*/ir_builder_->getInt64(end),
+                  /*step=*/ir_builder_->getInt64(step), for_body_generator);
   }
 
   // Generates the following control flow structure if `peel_first_iteration` is
@@ -75,46 +99,101 @@ class KernelSupportLibrary {
   //   for (i64 i = `start`; i s< `end`; i += `step`)
   //     `for_body_generator(/*ind_var=*/,i,
   //                         /*is_first_iteration=*/,(i != `start`))`;
-  void For(tensorflow::StringPiece name, llvm::Value* start, llvm::Value* end,
-           llvm::Value* step, bool peel_first_iteration,
-           const std::function<void(llvm::Value* ind_var,
-                                    llvm::Value* is_first_iteration)>&
+  Status For(tensorflow::StringPiece name, llvm::Value* start, llvm::Value* end,
+             llvm::Value* step, bool peel_first_iteration,
+             const std::function<Status(llvm::Value* ind_var,
+                                        llvm::Value* is_first_iteration)>&
+                 for_body_generator);
+
+  void ForReturnVoid(tensorflow::StringPiece name, llvm::Value* start,
+                     llvm::Value* end, llvm::Value* step,
+                     bool peel_first_iteration,
+                     const std::function<void(llvm::Value* ind_var,
+                                              llvm::Value* is_first_iteration)>&
+                         for_body_generator) {
+    TF_CHECK_OK(For(
+        name, start, end, step, peel_first_iteration,
+        [&](llvm::Value* ind_var, llvm::Value* is_first_iteration) -> Status {
+          for_body_generator(ind_var, is_first_iteration);
+          return Status::OK();
+        }));
+  }
+
+  Status For(tensorflow::StringPiece name, llvm::Value* start, llvm::Value* end,
+             int64 step, bool peel_first_iteration,
+             const std::function<Status(llvm::Value* ind_var,
+                                        llvm::Value* is_first_iteration)>&
+                 for_body_generator) {
+    return For(name, /*start=*/start, /*end=*/end,
+               /*step=*/ir_builder_->getInt64(step), peel_first_iteration,
                for_body_generator);
+  }
+
+  void ForReturnVoid(tensorflow::StringPiece name, llvm::Value* start,
+                     llvm::Value* end, int64 step, bool peel_first_iteration,
+                     const std::function<void(llvm::Value* ind_var,
+                                              llvm::Value* is_first_iteration)>&
+                         for_body_generator) {
+    ForReturnVoid(name, /*start=*/start, /*end=*/end,
+                  /*step=*/ir_builder_->getInt64(step), peel_first_iteration,
+                  for_body_generator);
+  }
 
-  void For(tensorflow::StringPiece name, llvm::Value* start, llvm::Value* end,
-           int64 step, bool peel_first_iteration,
-           const std::function<void(llvm::Value* ind_var,
-                                    llvm::Value* is_first_iteration)>&
-               for_body_generator) {
-    For(name, /*start=*/start, /*end=*/end,
-        /*step=*/ir_builder_->getInt64(step), peel_first_iteration,
-        for_body_generator);
+  Status For(
+      tensorflow::StringPiece name, llvm::Value* start, llvm::Value* end,
+      llvm::Value* step,
+      const std::function<Status(llvm::Value* ind_var)>& for_body_generator) {
+    return For(name, start, end, step,
+               /*peel_first_iteration=*/false,
+               [&](llvm::Value* indvar, llvm::Value*) -> Status {
+                 return for_body_generator(indvar);
+               });
   }
 
-  void For(
+  void ForReturnVoid(
       tensorflow::StringPiece name, llvm::Value* start, llvm::Value* end,
       llvm::Value* step,
       const std::function<void(llvm::Value* ind_var)>& for_body_generator) {
-    For(name, start, end, step,
-        /*peel_first_iteration=*/false,
-        [&](llvm::Value* indvar, llvm::Value*) { for_body_generator(indvar); });
+    ForReturnVoid(name, start, end, step,
+                  /*peel_first_iteration=*/false,
+                  [&](llvm::Value* indvar, llvm::Value*) {
+                    return for_body_generator(indvar);
+                  });
+  }
+
+  Status For(
+      tensorflow::StringPiece name, llvm::Value* start, llvm::Value* end,
+      int64 step,
+      const std::function<Status(llvm::Value* ind_var)>& for_body_generator) {
+    return For(name, start, end, ir_builder_->getInt64(step),
+               /*peel_first_iteration=*/false,
+               [&](llvm::Value* indvar, llvm::Value*) -> Status {
+                 return for_body_generator(indvar);
+               });
   }
 
-  void For(
+  void ForReturnVoid(
       tensorflow::StringPiece name, llvm::Value* start, llvm::Value* end,
       int64 step,
       const std::function<void(llvm::Value* ind_var)>& for_body_generator) {
-    For(name, start, end, ir_builder_->getInt64(step),
-        /*peel_first_iteration=*/false,
-        [&](llvm::Value* indvar, llvm::Value*) { for_body_generator(indvar); });
+    ForReturnVoid(name, start, end, ir_builder_->getInt64(step),
+                  for_body_generator);
+  }
+
+  Status For(
+      tensorflow::StringPiece name, int64 start, int64 end, int64 step,
+      const std::function<Status(llvm::Value* ind_var)>& for_body_generator) {
+    return For(name, /*start=*/ir_builder_->getInt64(start),
+               /*end=*/ir_builder_->getInt64(end),
+               /*step=*/ir_builder_->getInt64(step), for_body_generator);
   }
 
-  void For(
+  void ForReturnVoid(
       tensorflow::StringPiece name, int64 start, int64 end, int64 step,
       const std::function<void(llvm::Value* ind_var)>& for_body_generator) {
-    For(name, /*start=*/ir_builder_->getInt64(start),
-        /*end=*/ir_builder_->getInt64(end),
-        /*step=*/ir_builder_->getInt64(step), for_body_generator);
+    ForReturnVoid(name, /*start=*/ir_builder_->getInt64(start),
+                  /*end=*/ir_builder_->getInt64(end),
+                  /*step=*/ir_builder_->getInt64(step), for_body_generator);
   }
 
   // Generates the following control flow structure:
@@ -123,9 +202,25 @@ class KernelSupportLibrary {
   //     `true_block_generator()`;
   //   else
   //      `false_block_generator()`;
-  void If(llvm::Value* condition,
-          const std::function<void()>& true_block_generator,
-          const std::function<void()>& false_block_generator = []() {});
+  Status If(llvm::Value* condition,
+            const std::function<Status()>& true_block_generator,
+            const std::function<Status()>& false_block_generator =
+                []() -> Status { return Status::OK(); });
+
+  void IfReturnVoid(llvm::Value* condition,
+                    const std::function<void()>& true_block_generator,
+                    const std::function<void()>& false_block_generator = []() {
+                    }) {
+    TF_CHECK_OK(If(condition,
+                   [&]() {
+                     true_block_generator();
+                     return Status::OK();
+                   },
+                   [&]() {
+                     false_block_generator();
+                     return Status::OK();
+                   }));
+  }
 
   using ArgumentVector = tensorflow::gtl::ArraySlice<llvm::Value*>;
 
@@ -183,7 +278,7 @@ class KernelSupportLibrary {
 
  private:
   llvm::IRBuilder<>* ir_builder_;
-  bool prevent_unrolling_;
+  llvm_ir::UnrollMode unroll_mode_;
   bool prevent_vectorization_;
 };
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.cc b/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.cc
index 497b48ff227d7d1f158080529372df44b6932b24..9f867014fb015845448c4fcf9c165750f8a61935 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.cc
@@ -34,7 +34,7 @@ namespace llvm_ir {
 
 ForLoop::ForLoop(tensorflow::StringPiece prefix, tensorflow::StringPiece suffix,
                  llvm::Value* start_index, llvm::Value* end_index,
-                 llvm::Value* step, bool prevent_unrolling,
+                 llvm::Value* step, UnrollMode unroll_mode,
                  bool prevent_vectorization)
     : prefix_(std::string(prefix)),
       suffix_(std::string(suffix)),
@@ -42,15 +42,15 @@ ForLoop::ForLoop(tensorflow::StringPiece prefix, tensorflow::StringPiece suffix,
       end_index_(end_index),
       step_(step),
       insert_before_bb_(nullptr),
-      prevent_unrolling_(prevent_unrolling),
+      unroll_mode_(unroll_mode),
       prevent_vectorization_(prevent_vectorization) {}
 
 /* static */ std::unique_ptr<ForLoop> ForLoop::EmitForLoop(
     tensorflow::StringPiece prefix, llvm::Value* start_index,
     llvm::Value* end_index, llvm::Value* step, llvm::IRBuilder<>* ir_builder,
-    bool prevent_unrolling, bool prevent_vectorization) {
+    UnrollMode unroll_mode, bool prevent_vectorization) {
   std::unique_ptr<ForLoop> loop(new ForLoop(prefix, /*suffix=*/"", start_index,
-                                            end_index, step, prevent_unrolling,
+                                            end_index, step, unroll_mode,
                                             prevent_vectorization));
   loop->Emit(ir_builder);
   return loop;
@@ -147,11 +147,12 @@ void ForLoop::Emit(llvm::IRBuilder<>* ir_builder) {
 std::vector<llvm::Metadata*> ForLoop::GetLoopMetadata(
     llvm::IRBuilder<>* ir_builder) {
   const char* const kLlvmLoopUnrollDisableMDName = "llvm.loop.unroll.disable";
+  const char* const kLlvmLoopUnrollFullMDName = "llvm.loop.unroll.full";
   const char* const kLlvmLoopVectorizeMDName = "llvm.loop.vectorize.enable";
   llvm::LLVMContext* ctx = &start_index_->getContext();
 
   std::vector<llvm::Metadata*> result;
-  if (prevent_unrolling_) {
+  if (unroll_mode_ == xla::llvm_ir::UnrollMode::kNoUnroll) {
     result.push_back(llvm::MDNode::get(
         *ctx, {llvm::MDString::get(*ctx, kLlvmLoopUnrollDisableMDName)}));
   }
@@ -162,6 +163,10 @@ std::vector<llvm::Metadata*> ForLoop::GetLoopMetadata(
                llvm::ConstantAsMetadata::get(ir_builder->getFalse())}));
   }
 
+  if (unroll_mode_ == xla::llvm_ir::UnrollMode::kFullyUnroll) {
+    result.push_back(llvm::MDNode::get(
+        *ctx, {llvm::MDString::get(*ctx, kLlvmLoopUnrollFullMDName)}));
+  }
   return result;
 }
 
@@ -178,25 +183,25 @@ llvm::BasicBlock* ForLoop::CreateLoopBB(tensorflow::StringPiece name,
 std::unique_ptr<ForLoop> ForLoopNest::AddLoop(tensorflow::StringPiece suffix,
                                               llvm::Value* start_index,
                                               llvm::Value* end_index,
-                                              bool prevent_unrolling,
+                                              UnrollMode unroll_mode,
                                               bool prevent_vectorization) {
   return AddLoop(suffix, start_index, end_index, ir_builder_->getInt64(1),
-                 prevent_unrolling, prevent_vectorization);
+                 unroll_mode, prevent_vectorization);
 }
 
 std::unique_ptr<ForLoop> ForLoopNest::AddLoop(tensorflow::StringPiece suffix,
                                               llvm::Value* start_index,
                                               llvm::Value* end_index,
                                               llvm::Value* stride,
-                                              bool prevent_unrolling,
+                                              UnrollMode unroll_mode,
                                               bool prevent_vectorization) {
   if (inner_loop_body_bb_ != nullptr) {
     // Create this loop inside the previous one.
     ir_builder_->SetInsertPoint(&*inner_loop_body_bb_->getFirstInsertionPt());
   }
   std::unique_ptr<ForLoop> loop(new ForLoop(
-      /*prefix=*/name_, suffix, start_index, end_index, stride,
-      prevent_unrolling, prevent_vectorization));
+      /*prefix=*/name_, suffix, start_index, end_index, stride, unroll_mode,
+      prevent_vectorization));
   loop->Emit(ir_builder_);
 
   if (outer_loop_preheader_bb_ == nullptr) {
@@ -215,23 +220,23 @@ std::unique_ptr<ForLoop> ForLoopNest::AddLoop(tensorflow::StringPiece suffix,
 std::unique_ptr<ForLoop> ForLoopNest::AddLoop(int64 start_index,
                                               int64 end_index,
                                               tensorflow::StringPiece suffix,
-                                              bool prevent_unrolling,
+                                              UnrollMode unroll_mode,
                                               bool prevent_vectorization) {
   CHECK_LE(start_index, end_index);
   return AddLoop(suffix, ir_builder_->getInt64(start_index),
-                 ir_builder_->getInt64(end_index), prevent_unrolling,
+                 ir_builder_->getInt64(end_index), unroll_mode,
                  prevent_vectorization);
 }
 
 std::unique_ptr<ForLoop> ForLoopNest::AddLoop(int64 start_index,
                                               int64 end_index, int64 stride,
                                               tensorflow::StringPiece suffix,
-                                              bool prevent_unrolling,
+                                              UnrollMode unroll_mode,
                                               bool prevent_vectorization) {
   CHECK_LE(start_index, end_index);
   return AddLoop(suffix, ir_builder_->getInt64(start_index),
                  ir_builder_->getInt64(end_index),
-                 ir_builder_->getInt64(stride), prevent_unrolling,
+                 ir_builder_->getInt64(stride), unroll_mode,
                  prevent_vectorization);
 }
 
diff --git a/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.h b/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.h
index d915f95db134918a173a9711936bb1e2f1ea0d95..4e403cd994874c27453574283c5c573c876628db 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.h
+++ b/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.h
@@ -34,6 +34,12 @@ limitations under the License.
 namespace xla {
 namespace llvm_ir {
 
+enum class UnrollMode {
+  kDefaultUnroll,
+  kFullyUnroll,
+  kNoUnroll,
+};
+
 // A class for constructing a for-loop in LLVM IR.
 class ForLoop {
  public:
@@ -69,12 +75,13 @@ class ForLoop {
   // LLVM IR. If non-empty, it is prepended to the name of the induction
   // variable value and each basic block created for the loop.
   //
-  // If `prevent_unrolling` is true then emit metadata that directs LLVM to not
-  // unroll the generated loop.
+  // `unroll_mode` specifies the desired LLVM unrolling behavior for generated
+  //  loop.
   static std::unique_ptr<ForLoop> EmitForLoop(
       tensorflow::StringPiece prefix, llvm::Value* start_index,
       llvm::Value* end_index, llvm::Value* step, llvm::IRBuilder<>* ir_builder,
-      bool prevent_unrolling = false, bool prevent_vectorization = false);
+      UnrollMode unroll_mode = llvm_ir::UnrollMode::kDefaultUnroll,
+      bool prevent_vectorization = false);
 
   // The names of the blocks follow LLVM's conventions. Control flow amongst the
   // blocks for the example C code looks like:
@@ -128,7 +135,7 @@ class ForLoop {
 
   ForLoop(tensorflow::StringPiece prefix, tensorflow::StringPiece suffix,
           llvm::Value* start_index, llvm::Value* end_index, llvm::Value* step,
-          bool prevent_unrolling, bool prevent_vectorization);
+          UnrollMode unroll_mode, bool prevent_vectorization);
 
   // Emit the loop at the insert point of the builder.
   void Emit(llvm::IRBuilder<>* ir_builder);
@@ -161,7 +168,7 @@ class ForLoop {
   llvm::BasicBlock* body_bb_;
   llvm::BasicBlock* exit_bb_;
   llvm::Value* indvar_;
-  bool prevent_unrolling_;
+  UnrollMode unroll_mode_;
   bool prevent_vectorization_;
 
   TF_DISALLOW_COPY_AND_ASSIGN(ForLoop);
@@ -182,34 +189,34 @@ class ForLoopNest {
 
   // Adds a loop to the nest. If no loop has been added yet then emit a loop at
   // the current insert point of the given builder. If one or more loops have
-  // been added then emit loop inside the body of the last added loop.  If
-  // prevent_unrolling is true, then metadata is emitting directing LLVM to not
-  // unroll this loop.
-  std::unique_ptr<ForLoop> AddLoop(tensorflow::StringPiece suffix,
-                                   llvm::Value* start_index,
-                                   llvm::Value* end_index, llvm::Value* stride,
-                                   bool prevent_unrolling = false,
-                                   bool prevent_vectorization = false);
+  // been added then emit loop inside the body of the last added loop.
+  // unroll_mode is used to emit metadata that controls LLVM unrolling.
+  std::unique_ptr<ForLoop> AddLoop(
+      tensorflow::StringPiece suffix, llvm::Value* start_index,
+      llvm::Value* end_index, llvm::Value* stride,
+      UnrollMode unroll_mode = xla::llvm_ir::UnrollMode::kDefaultUnroll,
+      bool prevent_vectorization = false);
 
   // Like the above, except that it defaults to a stride of one.
-  std::unique_ptr<ForLoop> AddLoop(tensorflow::StringPiece suffix,
-                                   llvm::Value* start_index,
-                                   llvm::Value* end_index,
-                                   bool prevent_unrolling = false,
-                                   bool prevent_vectorization = false);
+  std::unique_ptr<ForLoop> AddLoop(
+      tensorflow::StringPiece suffix, llvm::Value* start_index,
+      llvm::Value* end_index,
+      UnrollMode unroll_mode = xla::llvm_ir::UnrollMode::kDefaultUnroll,
+      bool prevent_vectorization = false);
 
   // A convenient wrapper of the other flavor of AddLoop. The given start and
   // end index are constant.
-  std::unique_ptr<ForLoop> AddLoop(int64 start_index, int64 end_index,
-                                   int64 stride, tensorflow::StringPiece suffix,
-                                   bool prevent_unrolling = false,
-                                   bool prevent_vectorization = false);
+  std::unique_ptr<ForLoop> AddLoop(
+      int64 start_index, int64 end_index, int64 stride,
+      tensorflow::StringPiece suffix,
+      UnrollMode unroll_mode = xla::llvm_ir::UnrollMode::kDefaultUnroll,
+      bool prevent_vectorization = false);
 
   // Like the above, except that it defaults to a stride of one.
-  std::unique_ptr<ForLoop> AddLoop(int64 start_index, int64 end_index,
-                                   tensorflow::StringPiece suffix,
-                                   bool prevent_unrolling = false,
-                                   bool prevent_vectorization = false);
+  std::unique_ptr<ForLoop> AddLoop(
+      int64 start_index, int64 end_index, tensorflow::StringPiece suffix,
+      UnrollMode unroll_mode = xla::llvm_ir::UnrollMode::kDefaultUnroll,
+      bool prevent_vectorization = false);
 
   // Add loops to iterate through the indices within the specified
   // shape. The returned index collects the induction variables of the
diff --git a/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc b/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc
index ec04239b4f9112134ba876fdfbb3905a3baf1f72..ff64da87e9c9acf8a9d7ff87d3b1be7a9e9106bb 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc
@@ -87,18 +87,10 @@ llvm::Value* EmitCallToIntrinsic(
     tensorflow::gtl::ArraySlice<llvm::Value*> operands,
     tensorflow::gtl::ArraySlice<llvm::Type*> overloaded_types,
     llvm::IRBuilder<>* ir_builder) {
-  std::vector<llvm::Type*> types;
-  for (auto type : overloaded_types) {
-    types.push_back(type);
-  }
   llvm::Module* module = ModuleFromIRBuilder(ir_builder);
-  llvm::Function* intrinsic =
-      llvm::Intrinsic::getDeclaration(module, intrinsic_id, types);
-  std::vector<llvm::Value*> operands_vec;
-  for (auto operand : operands) {
-    operands_vec.push_back(operand);
-  }
-  return ir_builder->CreateCall(intrinsic, operands_vec);
+  llvm::Function* intrinsic = llvm::Intrinsic::getDeclaration(
+      module, intrinsic_id, AsArrayRef(overloaded_types));
+  return ir_builder->CreateCall(intrinsic, AsArrayRef(operands));
 }
 
 llvm::Value* EmitFloatMax(llvm::Value* lhs_value, llvm::Value* rhs_value,
@@ -368,15 +360,52 @@ llvm::Constant* LiteralToConstant(const Literal& literal, int64 dimension_index,
   return llvm::ConstantArray::get(aggregate_type, elements);
 }
 
+template <typename T>
+llvm::Constant* GetConstantDataArray(const Literal& literal,
+                                     llvm::Module* module) {
+  const T* data = static_cast<const T*>(literal.untyped_data());
+  int64 num_elements = literal.size_bytes() / sizeof(T);
+  return llvm::ConstantDataArray::get(module->getContext(),
+                                      llvm::makeArrayRef(data, num_elements));
+}
+
 }  // namespace
 
 llvm::Constant* ConvertLiteralToIrConstant(const Literal& literal,
                                            llvm::Module* module) {
-  std::vector<int64> multi_index(ShapeUtil::Rank(literal.shape()), 0);
-  llvm::Constant* value = LiteralToConstant(
-      literal, /*dimension_index=*/ShapeUtil::Rank(literal.shape()) - 1,
-      &multi_index, module);
-  return value;
+  const Shape& shape = literal.shape();
+  // TODO(b/29904935): We can get rid of this switch by exposing a
+  // ConstantDataArray factory method that takes a llvm::Type and a StringRef.
+  switch (shape.element_type()) {
+    case U64:
+      return GetConstantDataArray<uint64>(literal, module);
+    case U32:
+      return GetConstantDataArray<uint32>(literal, module);
+    case U8:
+      return GetConstantDataArray<uint8>(literal, module);
+    case S64:
+      return GetConstantDataArray<int64>(literal, module);
+    case S32:
+      return GetConstantDataArray<int32>(literal, module);
+    case F64:
+      return GetConstantDataArray<double>(literal, module);
+    case F32:
+      return GetConstantDataArray<float>(literal, module);
+    case BF16:
+    case F16:
+      return GetConstantDataArray<uint16>(literal, module);
+    case PRED:
+      return GetConstantDataArray<bool>(literal, module);
+    // TODO(b/29904935): Also use ConstantDataArray for complex numbers.
+    case C64: {
+      int64 dimensions = ShapeUtil::Rank(shape);
+      std::vector<int64> multi_index(dimensions, 0);
+      return LiteralToConstant(literal, /*dimension_index=*/dimensions - 1,
+                               &multi_index, module);
+    }
+    default:
+      LOG(FATAL) << "unsupported type " << shape.element_type();
+  }
 }
 
 llvm::AllocaInst* EmitAllocaAtFunctionEntry(llvm::Type* type,
diff --git a/tensorflow/compiler/xla/service/llvm_ir/loop_emitter.cc b/tensorflow/compiler/xla/service/llvm_ir/loop_emitter.cc
index 0728ccfff7b85e3751f33bc5272a5f22d4e5411a..dc2934a34c23f8229947210cacc9863d47c2ea55 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/loop_emitter.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/loop_emitter.cc
@@ -83,7 +83,9 @@ LoopEmitter::LoopEmitter(const ElementGenerator& target_element_generator,
   // Sanity check: In multi-output fusion, all shapes produced must have the
   // same dimensions.
   for (const IrArray& array : target_arrays) {
-    CHECK(ShapeUtil::SameDimensions(shape_, array.GetShape()));
+    CHECK(ShapeUtil::SameDimensions(shape_, array.GetShape()))
+        << ": '" << shape_.ShortDebugString() << "' does not match '"
+        << array.GetShape().ShortDebugString() << "'";
   }
 }
 
diff --git a/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.cc b/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.cc
index 3a21eda35757aa706565ee4a5286eee1acea117b..5fc08aab916e377b245b6221108956c06da70767 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.cc
@@ -24,15 +24,14 @@ limitations under the License.
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
-#include "tensorflow/core/lib/strings/stringprintf.h"
 #include "tensorflow/core/platform/logging.h"
 
 namespace xla {
 namespace llvm_ir {
 
-void EmitTupleSelect(IrArray select, IrArray pred, llvm::Value* on_true,
-                     llvm::Value* on_false, llvm::IRBuilder<>* ir_builder,
-                     llvm::Module* module) {
+void EmitTupleSelect(const IrArray& select, const IrArray& pred,
+                     llvm::Value* on_true, llvm::Value* on_false,
+                     llvm::IRBuilder<>* ir_builder, llvm::Module* module) {
   CHECK(ShapeUtil::IsScalar(pred.GetShape()));
 
   llvm::LoadInst* pred_value =
@@ -47,30 +46,27 @@ void EmitTupleSelect(IrArray select, IrArray pred, llvm::Value* on_true,
   VLOG(2) << "  pred_cond: " << DumpToString(*pred_cond);
 
   for (int i = 0; i < ShapeUtil::TupleElementCount(select.GetShape()); ++i) {
-    std::vector<llvm::Value*> element_index = {ir_builder->getInt64(0),
-                                               ir_builder->getInt64(i)};
+    llvm::Value* const element_index[] = {ir_builder->getInt64(0),
+                                          ir_builder->getInt64(i)};
     llvm::Value* on_true_element_address =
         ir_builder->CreateInBoundsGEP(on_true, element_index);
     llvm::Value* on_true_element = ir_builder->CreateLoad(
-        on_true_element_address,
-        tensorflow::strings::Printf("on_true_element_%d", i).c_str());
+        on_true_element_address, "on_true_element_" + llvm::Twine(i));
     llvm::Value* on_false_element_address =
         ir_builder->CreateInBoundsGEP(on_false, element_index);
     llvm::Value* on_false_element = ir_builder->CreateLoad(
-        on_false_element_address,
-        tensorflow::strings::Printf("on_false_element_%d", i).c_str());
+        on_false_element_address, "on_false_element_" + llvm::Twine(i));
 
     llvm::Value* output_element_address =
         ir_builder->CreateInBoundsGEP(select.GetBasePointer(), element_index);
     ir_builder->CreateStore(
-        ir_builder->CreateSelect(
-            pred_cond, on_true_element, on_false_element,
-            tensorflow::strings::Printf("select_output_element_%d", i).c_str()),
+        ir_builder->CreateSelect(pred_cond, on_true_element, on_false_element,
+                                 "select_output_element_" + llvm::Twine(i)),
         output_element_address);
   }
 }
 
-void EmitTuple(IrArray tuple,
+void EmitTuple(const IrArray& tuple,
                tensorflow::gtl::ArraySlice<llvm::Value*> operands,
                llvm::IRBuilder<>* ir_builder, llvm::Module* module) {
   for (size_t i = 0; i < operands.size(); ++i) {
diff --git a/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.h b/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.h
index dbf9a140068b60505f6798360438f709bfd3feba..352d34ebf839c6c2465abade7c3d3eb3b7a34506 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.h
+++ b/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.h
@@ -59,13 +59,13 @@ namespace llvm_ir {
 // of the address from the corresponding element in either
 // tuple_on_true or tuple_on_false:
 //   output[i] = pred ? tuple_on_true[i] : tuple_on_false[i]
-void EmitTupleSelect(IrArray select, IrArray pred, llvm::Value* on_true,
-                     llvm::Value* on_false, llvm::IRBuilder<>* ir_builder,
-                     llvm::Module* module);
+void EmitTupleSelect(const IrArray& select, const IrArray& pred,
+                     llvm::Value* on_true, llvm::Value* on_false,
+                     llvm::IRBuilder<>* ir_builder, llvm::Module* module);
 
 // A tuple is an array of pointers, one for each operand. Each pointer points to
 // the output buffer of its corresponding operand.
-void EmitTuple(IrArray tuple,
+void EmitTuple(const IrArray& tuple,
                tensorflow::gtl::ArraySlice<llvm::Value*> operands,
                llvm::IRBuilder<>* ir_builder, llvm::Module* module);
 
diff --git a/tensorflow/compiler/xla/service/local_service.cc b/tensorflow/compiler/xla/service/local_service.cc
index 0fa4061738612df76c72a18a9353f16bf6a42677..296d04d4362b12fdc39798a016ca9e8795e02586 100644
--- a/tensorflow/compiler/xla/service/local_service.cc
+++ b/tensorflow/compiler/xla/service/local_service.cc
@@ -24,15 +24,12 @@ limitations under the License.
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/backend.h"
 #include "tensorflow/compiler/xla/service/computation_layout.h"
-#include "tensorflow/compiler/xla/service/computation_tracker.h"
 #include "tensorflow/compiler/xla/service/executable.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_execution_profile.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/service/hlo_module_config.h"
 #include "tensorflow/compiler/xla/service/platform_util.h"
-#include "tensorflow/compiler/xla/service/user_computation.h"
-#include "tensorflow/compiler/xla/service/versioned_computation_handle.h"
 #include "tensorflow/compiler/xla/shape_layout.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
@@ -110,6 +107,11 @@ ExecutionOptions CreateExecutionOptions(
         ->set_xla_dump_optimized_hlo_proto_to(
             build_options.dump_optimized_hlo_proto_to().value());
   }
+  if (build_options.dump_unoptimized_hlo_proto_to().has_value()) {
+    execution_options.mutable_debug_options()
+        ->set_xla_dump_unoptimized_hlo_proto_to(
+            build_options.dump_unoptimized_hlo_proto_to().value());
+  }
   if (build_options.dump_per_pass_hlo_proto_to().has_value()) {
     execution_options.mutable_debug_options()
         ->set_xla_dump_per_pass_hlo_proto_to(
@@ -124,75 +126,17 @@ ExecutionOptions CreateExecutionOptions(
     LayoutUtil::SetToDefaultLayout(
         execution_options.mutable_shape_with_output_layout());
   }
-  return execution_options;
-}
-
-}  // namespace
-
-StatusOr<std::unique_ptr<Executable>> LocalService::CompileExecutable(
-    const ComputationHandle& computation,
-    const tensorflow::gtl::ArraySlice<const Shape*> argument_layouts,
-    const ExecutableBuildOptions& build_options) {
-  TF_ASSIGN_OR_RETURN(UserComputation * user_computation,
-                      computation_tracker_.Resolve(computation));
-  VersionedComputationHandle versioned_handle =
-      user_computation->GetVersionedHandle();
 
-  TF_ASSIGN_OR_RETURN(
-      std::shared_ptr<const ProgramShape> program_shape,
-      user_computation->ComputeProgramShape(versioned_handle.version));
-
-  // Validate incoming layouts.
-  if (argument_layouts.size() != program_shape->parameters_size()) {
-    return InvalidArgument(
-        "Invalid number of arguments for computation: expected %d, got %zu.",
-        program_shape->parameters_size(), argument_layouts.size());
+  for (const std::string& disabled_pass : build_options.disabled_hlo_passes()) {
+    execution_options.mutable_debug_options()->add_xla_disable_hlo_passes(
+        disabled_pass);
   }
-  for (int i = 0; i < argument_layouts.size(); ++i) {
-    const Shape& argument_shape = *argument_layouts[i];
-    TF_RETURN_IF_ERROR(ShapeUtil::ValidateShape(argument_shape));
-    if (!ShapeUtil::Compatible(argument_shape, program_shape->parameters(i))) {
-      tensorflow::gtl::optional<const OpMetadata*> metadata =
-          user_computation->ParameterMetadata(i);
-      auto metadata_string = [&metadata]() -> string {
-        if (!metadata.has_value()) {
-          return "";
-        }
-        CHECK(metadata.value() != nullptr);
-        const OpMetadata& m = *metadata.value();
-        if (!m.source_file().empty()) {
-          return tensorflow::strings::Printf(
-              " (%s:%d)", m.source_file().c_str(), m.source_line());
-        }
-        return "";
-      };
-      return InvalidArgument(
-          "Invalid argument shape for argument %d%s, expected %s, got %s.", i,
-          metadata_string().c_str(),
-          ShapeUtil::HumanString(program_shape->parameters(i)).c_str(),
-          ShapeUtil::HumanString(argument_shape).c_str());
-    }
-  }
-  if (build_options.result_layout() != nullptr) {
-    TF_RETURN_IF_ERROR(ValidateResultShapeWithLayout(
-        *build_options.result_layout(), program_shape->result()));
-  }
-
-  ExecutionOptions execution_options =
-      CreateExecutionOptions(build_options, program_shape.get());
-  TF_ASSIGN_OR_RETURN(std::unique_ptr<HloModuleConfig> module_config,
-                      CreateModuleConfig(*program_shape, argument_layouts,
-                                         &execution_options, user_computation));
 
-  TF_ASSIGN_OR_RETURN(
-      se::StreamExecutor * executor,
-      execute_backend_->stream_executor(build_options.device_ordinal()));
-
-  return BuildExecutable(versioned_handle, std::move(module_config),
-                         execute_backend_.get(), executor,
-                         build_options.device_allocator());
+  return execution_options;
 }
 
+}  // namespace
+
 StatusOr<std::unique_ptr<Executable>> LocalService::CompileExecutable(
     const XlaComputation& computation,
     const tensorflow::gtl::ArraySlice<const Shape*> argument_layouts,
@@ -260,4 +204,15 @@ StatusOr<int> LocalService::ReplicaNumberToDeviceOrdinal(int replica_number) {
       /*computation_count=*/1);
 }
 
+StatusOr<const ShapedBuffer*> LocalService::GlobalDataToShapedBuffer(
+    const GlobalDataHandle& data, int replica_number) {
+  TF_ASSIGN_OR_RETURN(auto buffers, allocation_tracker_.Resolve(data));
+  if (replica_number >= buffers.size()) {
+    return InvalidArgument(
+        "replica_number %d out of range; must be less than num_replicas = %zu.",
+        replica_number, buffers.size());
+  }
+  return buffers[replica_number];
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/local_service.h b/tensorflow/compiler/xla/service/local_service.h
index 06567cabd6eb28aae53881613cd6beb78e25e222..39d6734c3fc06df6832cf67edddbc7c14c815cd1 100644
--- a/tensorflow/compiler/xla/service/local_service.h
+++ b/tensorflow/compiler/xla/service/local_service.h
@@ -41,23 +41,11 @@ class LocalService : public Service {
   static StatusOr<std::unique_ptr<LocalService>> NewService(
       const ServiceOptions& options);
 
-  // Builds an Executable with the given argument layouts and options. If
-  // result_layout is non-null, then the executable is compiled to produce a
-  // result of the given layout.  If device_allocator is non-null, then the
-  // compiler may use it to allocate temp space on the device.  The compiler is
-  // responsible for freeing any memory it allocates this way.
-  StatusOr<std::unique_ptr<Executable>> CompileExecutable(
-      const ComputationHandle& computation,
-      const tensorflow::gtl::ArraySlice<const Shape*> argument_layouts,
-      const ExecutableBuildOptions& options);
-
   // Builds an Executable with the given XlaComputation, argument layouts and
   // options. If result_layout is non-null, then the executable is compiled to
   // produce a result of the given layout.  If device_allocator is non-null,
   // then the compiler may use it to allocate temp space on the device.  The
   // compiler is responsible for freeing any memory it allocates this way.
-  //
-  // TODO(b/74197823): This is a part of a NOT YET ready refactor.
   StatusOr<std::unique_ptr<Executable>> CompileExecutable(
       const XlaComputation& computation,
       const tensorflow::gtl::ArraySlice<const Shape*> argument_layouts,
@@ -70,6 +58,11 @@ class LocalService : public Service {
   // the "easy" case where a single replica is a single device.
   StatusOr<int> ReplicaNumberToDeviceOrdinal(int replica_number);
 
+  // Converts a GlobalDataHandle into a pointer to a ShapedBuffer that's valid
+  // as long as the handle is valid.
+  StatusOr<const ShapedBuffer*> GlobalDataToShapedBuffer(
+      const GlobalDataHandle& data, int replica_number);
+
  private:
   explicit LocalService(const ServiceOptions& options,
                         std::unique_ptr<Backend> backend);
diff --git a/tensorflow/compiler/xla/service/logical_buffer_analysis.cc b/tensorflow/compiler/xla/service/logical_buffer_analysis.cc
index 6aca6ba38572c5311797fbb91acbbcd6610a3410..f410921b4b5337192bdeae5924631d9c06b7d5a5 100644
--- a/tensorflow/compiler/xla/service/logical_buffer_analysis.cc
+++ b/tensorflow/compiler/xla/service/logical_buffer_analysis.cc
@@ -125,6 +125,12 @@ Status LogicalBufferAnalysis::HandleBitcast(HloInstruction*) {
   return Status::OK();
 }
 
+Status LogicalBufferAnalysis::HandleDomain(HloInstruction*) {
+  // A kDomain instruction aliases its operand. That is, the buffer of its
+  // result *is* the buffer of its operand.
+  return Status::OK();
+}
+
 Status LogicalBufferAnalysis::HandleRecvDone(HloInstruction*) {
   // RecvDone doesn't create a new buffer but rather aliases its input (Recv)
   // tuple element at {0} to its output.
diff --git a/tensorflow/compiler/xla/service/logical_buffer_analysis.h b/tensorflow/compiler/xla/service/logical_buffer_analysis.h
index f4c63dd86b4d8a6f598d46047012e4e5bc7b3d7e..b5ef3967875a58b35631d5f69c210f5cbcd91250 100644
--- a/tensorflow/compiler/xla/service/logical_buffer_analysis.h
+++ b/tensorflow/compiler/xla/service/logical_buffer_analysis.h
@@ -59,6 +59,7 @@ class LogicalBufferAnalysis : public DfsHloVisitorWithDefault {
   Status HandleTuple(HloInstruction* tuple) override;
   Status HandleGetTupleElement(HloInstruction* get_tuple_element) override;
   Status HandleBitcast(HloInstruction* bitcast) override;
+  Status HandleDomain(HloInstruction* domain) override;
   Status HandleCopy(HloInstruction* copy) override;
   Status HandleRecvDone(HloInstruction* recv_done) override;
   Status HandleSend(HloInstruction* send) override;
diff --git a/tensorflow/compiler/xla/service/multi_output_fusion.cc b/tensorflow/compiler/xla/service/multi_output_fusion.cc
new file mode 100644
index 0000000000000000000000000000000000000000..29f787b86b9cbb6f80d048b46b78bdad8074f488
--- /dev/null
+++ b/tensorflow/compiler/xla/service/multi_output_fusion.cc
@@ -0,0 +1,342 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/multi_output_fusion.h"
+
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/core/lib/gtl/flatmap.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace xla {
+
+StatusOr<bool> MultiOutputFusion::Run(HloModule* module) {
+  bool changed = false;
+
+  for (auto* computation : module->MakeNonfusionComputations()) {
+    computation_ = computation;
+    reachability_ = computation_->ComputeReachability();
+    candidates_.clear();
+    candidates_index_.clear();
+    all_fusion_candidates_.clear();
+
+    int64 index = 0;
+    for (auto it : computation_->MakeInstructionPostOrder()) {
+      candidates_.emplace_back(it);
+      InsertOrDie(&candidates_index_, it, index++);
+    }
+
+    // Create the initial candidate list for each Node.
+    for (auto& node : candidates_) {
+      HloInstruction* instruction = node.hlo;
+      int64 instruction_id = get_candidate_id(instruction);
+      FusionCandidate& instr_node = candidates_[instruction_id];
+      if (!IsFusible(instruction)) {
+        continue;
+      }
+      all_fusion_candidates_.push_back(instruction);
+
+      std::vector<HloInstruction*> candidates;
+      tensorflow::gtl::FlatSet<HloInstruction*> candidates_set;
+      VLOG(10) << "Looking at instruction: " << instruction->name();
+      for (auto operand : instruction->operands()) {
+        // Filter out the non-interesting instructions -- they
+        // will not generate the savings.
+        if (!IsProfitableOperand(operand)) {
+          VLOG(10) << "Operand not profitable: " << operand->name();
+          continue;
+        }
+        VLOG(10) << "Operand profitable: " << operand->name();
+        for (auto user : operand->users()) {
+          VLOG(10) << "User: " << user->name();
+          if (user == instruction || !IsFusible(user)) {
+            VLOG(10) << "User is not fusible, or is the instruction itself: "
+                     << user->name();
+            continue;
+          }
+          int64 user_id = get_candidate_id(user);
+          if (is_connected(instruction, user)) {
+            VLOG(10) << "User is connected: " << user->name();
+            continue;
+          }
+          if (instruction_id < user_id &&
+              user->opcode() == HloOpcode::kFusion) {
+            VLOG(10) << "User ID for user: " << user->name() << " is "
+                     << user_id << " which is higher than " << instruction_id;
+            continue;
+          }
+          if (!LegalToFuse(instruction, user)) {
+            VLOG(10) << "User not legal to fuse: " << user->name();
+            continue;
+          }
+          if (candidates_set.insert(user).second) {
+            VLOG(10) << "User added to candidate list: " << user->name();
+            candidates.push_back(user);
+          }
+        }
+      }
+
+      // Iterate over candidates rather than candidates_set to avoid
+      // nondeterminism.
+      for (auto candidate : candidates) {
+        int64 profit = GetProfit(instruction, candidate);
+        if (profit > 0) {
+          FusionCandidate& candidate_node =
+              candidates_[get_candidate_id(candidate)];
+          instr_node.fusibles.emplace_back(candidate, profit);
+          candidate_node.fusibles.emplace_back(instruction, profit);
+          worklist_.emplace(instruction, candidate, profit);
+        }
+      }
+    }
+    if (Perform()) {
+      changed = true;
+    }
+  }
+  return changed;
+}
+
+HloInstruction* MultiOutputFusion::Fuse(HloInstruction* instr1,
+                                        HloInstruction* instr2) {
+  HloInstruction* remaining = instr1;
+  HloInstruction* fused = instr2;
+  // Make sure that if only one of the instructions is a fusion, or if only one
+  // of the instructions is a multi-output fusion, it's what will be fused into.
+  //
+  // An invariant is that no bitcast nodes will show up in the middle of a
+  // fusion node. This invariant must hold in order for us to lower it. Given
+  // that, we require that during multi-output fusion, a fusion node ending with
+  // bitcast to preserve its structure as a nested fusion instead being
+  // merged and flattened.
+  if (fused->opcode() == HloOpcode::kFusion &&
+      fused->fused_expression_root()->opcode() != HloOpcode::kBitcast) {
+    std::swap(remaining, fused);
+  }
+  if (fused->IsMultiOutputFusion()) {
+    std::swap(remaining, fused);
+  }
+
+  if (fused->opcode() == HloOpcode::kFusion &&
+      fused->fused_expression_root()->opcode() != HloOpcode::kBitcast) {
+    remaining->MergeFusionInstructionIntoMultiOutput(fused);
+  } else {
+    if (remaining->opcode() == HloOpcode::kFusion &&
+        remaining->fused_expression_root()->opcode() == HloOpcode::kBitcast) {
+      auto parent_computation = remaining->parent();
+      // Create a nested fusion node.
+      auto remaining_nested_fused =
+          parent_computation->AddInstruction(HloInstruction::CreateFusion(
+              remaining->shape(), HloInstruction::FusionKind::kLoop,
+              remaining));
+      TF_CHECK_OK(parent_computation->ReplaceInstruction(
+          remaining, remaining_nested_fused));
+      remaining = remaining_nested_fused;
+    }
+    remaining->FuseInstructionIntoMultiOutput(fused);
+  }
+
+  return remaining;
+}
+
+void MultiOutputFusion::Update(HloInstruction* instr1, HloInstruction* instr2) {
+  HloInstruction* fusion = instr1;
+  HloInstruction* fused = instr2;
+  if (is_fused(instr1)) {
+    fusion = instr2;
+    fused = instr1;
+  }
+
+  // Insert the newly created instruction (if any), to candidates_.
+  for (auto use : fusion->users()) {
+    if (candidates_index_.find(use) == candidates_index_.end()) {
+      int64 index = candidates_.size();
+      candidates_.emplace_back(use);
+      InsertOrDie(&candidates_index_, use, index++);
+    }
+  }
+  FusionCandidate& fusion_node = candidates_[get_candidate_id(fusion)];
+  FusionCandidate& fused_node = candidates_[get_candidate_id(fused)];
+
+  // Update the reachability graph.
+  UpdateReachability(fusion, fused, all_fusion_candidates_,
+                     [this](HloInstruction* instr) { return is_fused(instr); });
+
+  // Update the fusible list for fusion. Variable new_fusibles keeps
+  // track of the new or changed entries.
+  std::vector<std::pair<HloInstruction*, int64>> new_fusibles;
+  tensorflow::gtl::FlatSet<HloInstruction*> in_list;
+  auto it = fusion_node.fusibles.begin();
+  while (it != fusion_node.fusibles.end()) {
+    HloInstruction* instr = it->first;
+    if (is_fused(instr) || is_connected(fusion, instr)) {
+      it = fusion_node.fusibles.erase(it);
+      continue;
+    }
+    in_list.insert(instr);
+    int64 profit = GetProfit(instr, fusion);
+    if (profit > it->second) {
+      it->second = profit;
+      new_fusibles.emplace_back(instr, profit);
+    }
+    ++it;
+  }
+
+  // Fused_node has been fused into fusion_node. Take the fusion candidates
+  // (fusibles) from fused_nodes and add them to the fusion_node's. Filter
+  // out those fusibles that no longer valid (or already in the list).
+  for (const auto& it : fused_node.fusibles) {
+    HloInstruction* instr = it.first;
+    if (instr == fusion || is_fused(instr) || is_connected(fusion, instr)) {
+      continue;
+    }
+    if (in_list.count(instr) > 0) {
+      continue;
+    }
+    int64 profit = GetProfit(instr, fusion);
+    fusion_node.fusibles.emplace_back(instr, profit);
+    new_fusibles.emplace_back(instr, profit);
+  }
+  fused_node.fusibles.clear();
+
+  // Update the worklist_.
+  for (auto it : new_fusibles) {
+    worklist_.emplace(fusion, it.first, it.second);
+  }
+}
+
+bool MultiOutputFusion::LegalToFuse(HloInstruction* instr1,
+                                    HloInstruction* instr2) {
+  if (instr1 == instr2) {
+    return false;
+  }
+  if (instr1->opcode() != HloOpcode::kFusion) {
+    return false;
+  }
+
+  // Fusing nodes with 0 user makes no sense and the rest of the implementation
+  // doesn't support it either.
+  if (instr1->user_count() == 0 || instr2->user_count() == 0) {
+    return false;
+  }
+
+  // Check if the users of multioutput fusion is not a get-tuple-element.
+  // If this is the case, we bail out because the transformation assumes
+  // the users are get-tuple-element.
+  auto multioutput_user_is_not_gte = [](HloInstruction* instr) {
+    if (!instr->IsMultiOutputFusion()) {
+      return false;
+    }
+    for (auto user : instr->users()) {
+      if (user->opcode() != HloOpcode::kGetTupleElement) {
+        return true;
+      }
+    }
+    return false;
+  };
+  if (multioutput_user_is_not_gte(instr1) ||
+      multioutput_user_is_not_gte(instr2)) {
+    return false;
+  }
+
+  if (is_connected(instr1, instr2)) {
+    return false;
+  }
+  if (!ShapesCompatibleForFusion(instr1, instr2)) {
+    return false;
+  }
+
+  return true;
+}
+
+void MultiOutputFusion::UpdateReachability(
+    HloInstruction* instr1, HloInstruction* instr2,
+    tensorflow::gtl::ArraySlice<HloInstruction*> instrs_to_update,
+    const std::function<bool(HloInstruction*)>& skip) {
+  for (auto instr : instrs_to_update) {
+    if (skip != nullptr && skip(instr)) {
+      continue;
+    }
+    if (reachability_->IsReachable(instr2, instr) &&
+        reachability_->IsReachable(instr1, instr)) {
+      // If a candidate was already reachable by both, no update needed.
+      continue;
+    }
+    if (reachability_->IsReachable(instr2, instr)) {
+      reachability_->FastSetReachabilityToUnion({instr, instr1}, instr);
+    }
+    if (reachability_->IsReachable(instr1, instr)) {
+      reachability_->FastSetReachabilityToUnion({instr, instr2}, instr);
+    }
+  }
+}
+
+bool MultiOutputFusion::Perform() {
+  int changed = false;
+  // Pick the top candidate from queue and try to merge.
+  while (!worklist_.empty()) {
+    if (fuel_ <= 0) {
+      VLOG(2) << "No fusing: run out of fuel.";
+      break;
+    }
+    ToBeFused candidate = worklist_.top();
+    worklist_.pop();
+
+    HloInstruction* instr1 = candidate.instr1;
+    HloInstruction* instr2 = candidate.instr2;
+
+    if (is_fused(instr1) || is_fused(instr2)) {
+      continue;
+    }
+
+    VLOG(1) << "Considering candidate profit_score=" << candidate.score
+            << "\n\t\tinstr1 = " << instr1->ToString()
+            << "\n\t\tinstr2 = " << instr2->ToString();
+
+    if (LegalToFuse(instr1, instr2)) {
+      VLOG(1) << "Fuse!";
+      VLOG(2) << "Before multi_output_fusion:";
+      VLOG(2) << "instr1: " << instr1->ToString();
+      VLOG(2) << "\n"
+              << instr1->fused_instructions_computation()->ToString(
+                     HloPrintOptions().set_indent_amount(1));
+      VLOG(2) << "instr2: " << instr2->ToString();
+      if (instr2->opcode() == HloOpcode::kFusion) {
+        VLOG(2) << "\n"
+                << instr2->fused_instructions_computation()->ToString(
+                       HloPrintOptions().set_indent_amount(1));
+      }
+      HloInstruction* ret = Fuse(instr1, instr2);
+      set_is_fused(ret == instr1 ? instr2 : instr1);
+      Update(instr1, instr2);
+      changed = true;
+      VLOG(2) << "After fusion, \t this: " << ret->name() << "\n"
+              << ret->fused_instructions_computation()->ToString(
+                     HloPrintOptions().set_indent_amount(1));
+      auto users = ret->users();
+      --fuel_;
+    }
+  }
+  if (DoProducerConsumerMultiOutputFusion(computation_)) {
+    changed = true;
+  }
+  return changed;
+}
+
+bool MultiOutputFusion::DoProducerConsumerMultiOutputFusion(
+    HloComputation* /*computation*/) {
+  return false;
+}
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/multi_output_fusion.h b/tensorflow/compiler/xla/service/multi_output_fusion.h
new file mode 100644
index 0000000000000000000000000000000000000000..cfdf83cfe856a7c3b05f51129446cd4e1055a8d6
--- /dev/null
+++ b/tensorflow/compiler/xla/service/multi_output_fusion.h
@@ -0,0 +1,160 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_MULTI_OUTPUT_FUSION_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_MULTI_OUTPUT_FUSION_H_
+
+#include <queue>
+#include <vector>
+
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/service/hlo_pass_interface.h"
+#include "tensorflow/compiler/xla/statusor.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+
+namespace xla {
+
+// This class implements the fusing of sibling fusion instructions that sharing
+// common operands.
+// It constructs the following associated data structures.
+//  (1) candidates_: stores the instruction and the set of instructions it can
+//      fuse to.
+//  (2) candidates_index_: maps instruction to id.
+//  (3) reachability_: reachability map in this computation.
+//  (4) all_fusion_candidates_: the vector of candidate instructions.
+//  (5) worklist_: a priority queue that contains pairs of instructions to be
+//      fused and their fusion profit scores.
+//
+//  Function Perform() applies the optimization. It picks up the most profitable
+//  pair in the worklist_, check if it's legal to fuse and fuse the pair.
+//  After fusion, it updates the associated structure such as reachability_,
+//  candidates_ and worklist_.
+//  Note that the reachability map is updated based on the original computation.
+//  This works because the reachability is monotonically increasing with
+//  instruction fusion.
+class MultiOutputFusion : public HloPassInterface {
+ public:
+  MultiOutputFusion(int64 fuel) : fuel_(fuel) {}
+
+  tensorflow::StringPiece name() const override {
+    return "multi_output_fusion";
+  }
+
+  // Run multi-output fusion on the given module. Returns whether the module
+  // was changed.
+  StatusOr<bool> Run(HloModule* module) override;
+
+ protected:
+  // Main entry for the optimization. Returns true if the optimization happens.
+  bool Perform();
+
+  // Test if instr1 and instr2 have the compatible shapes that can be legally
+  // fused.
+  virtual bool ShapesCompatibleForFusion(HloInstruction* instr1,
+                                         HloInstruction* instr2) = 0;
+
+  // Whether the instruction is a candidate for fusion.
+  virtual bool IsFusible(HloInstruction* instr) = 0;
+
+  // This function estimates the savings by merging instr1 and instr2 into one
+  // multi-output fusion instruction.
+  virtual int64 GetProfit(HloInstruction* instr1, HloInstruction* instr2) = 0;
+
+  // Whether fusing the instruction can reduce cost.
+  virtual bool IsProfitableOperand(HloInstruction* instr) = 0;
+
+  // Test if it's legal to fuse instr1 and instr2 into one fusion instruction.
+  virtual bool LegalToFuse(HloInstruction* instr1, HloInstruction* instr2);
+
+  // Update the reachability map after fusing instr1 and instr2.
+  void UpdateReachability(
+      HloInstruction* instr1, HloInstruction* instr2,
+      tensorflow::gtl::ArraySlice<HloInstruction*> instrs_to_update,
+      const std::function<bool(HloInstruction*)>& skip = nullptr);
+
+  // Hook for multi-output fusion along producer-consumer edges.
+  // Returns whether any instructions were fused.
+  //
+  // TODO(b/80420762): Perform producer-consumer multi-output fusion in
+  // InstructionFusion instead.
+  virtual bool DoProducerConsumerMultiOutputFusion(HloComputation* computation);
+
+ private:
+  // Fuse HloInstrctuion instr1 and instr2 and return the fused instruction.
+  // The other instruction is removed from its parent computation.
+  HloInstruction* Fuse(HloInstruction* instr1, HloInstruction* instr2);
+
+  // Update the internal data structures after instr1 and instr2 are fused into
+  // one fusion instruction.
+  void Update(HloInstruction* instr1, HloInstruction* instr2);
+
+  // Optimization fuel is a compiler debugging technique that makes an
+  // optimization pass stop what it is doing after having made N changes to the
+  // program, where N is the fuel. By varying N, this can be used to find the
+  // first single change that makes a test fail.
+  int64 fuel_;
+
+  // Computation for the pass.
+  HloComputation* computation_;
+
+  // An internal data structure for each instruction in current computation.
+  // When an instruction is removed, member 'hlo' is set to nullptr.
+  struct FusionCandidate {
+    HloInstruction* hlo;
+    std::list<std::pair<HloInstruction*, int64>> fusibles;
+    explicit FusionCandidate(HloInstruction* hlo) : hlo(hlo) {}
+  };
+  std::vector<FusionCandidate> candidates_;
+
+  // A map that maps an instruction to the index_.
+  tensorflow::gtl::FlatMap<HloInstruction*, int> candidates_index_;
+
+  // The reachability map of current computation.
+  std::unique_ptr<HloReachabilityMap> reachability_;
+
+  // This stores all the candidate instructions in current computation.
+  std::vector<HloInstruction*> all_fusion_candidates_;
+
+  // The pair of candidates to be fused and the profit score.
+  struct ToBeFused {
+    HloInstruction* instr1;
+    HloInstruction* instr2;
+    int64 score;
+    ToBeFused(HloInstruction* instr1, HloInstruction* instr2, int64 score)
+        : instr1(instr1), instr2(instr2), score(score) {}
+    bool operator<(const ToBeFused& rhs) const { return score < rhs.score; }
+  };
+  std::priority_queue<ToBeFused> worklist_;
+
+  int64 get_candidate_id(HloInstruction* instr) {
+    return FindOrDie(candidates_index_, instr);
+  }
+
+  bool is_fused(HloInstruction* instr) {
+    return candidates_[get_candidate_id(instr)].hlo == nullptr;
+  }
+
+  void set_is_fused(HloInstruction* instr) {
+    candidates_[get_candidate_id(instr)].hlo = nullptr;
+  }
+
+  bool is_connected(HloInstruction* instr1, HloInstruction* instr2) {
+    return reachability_->IsConnected(instr1, instr2);
+  }
+};
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_MULTI_OUTPUT_FUSION_H_
diff --git a/tensorflow/compiler/xla/service/pattern_matcher.h b/tensorflow/compiler/xla/service/pattern_matcher.h
index d3bc47e61e0e75fa2ef181988700f88cec9c1d76..2515222cf2db3d9699c85c13f4fe72b3488fa217 100644
--- a/tensorflow/compiler/xla/service/pattern_matcher.h
+++ b/tensorflow/compiler/xla/service/pattern_matcher.h
@@ -204,7 +204,7 @@ class LayoutPattern {
   // Modifies the pattern to match only if the layout equals the given proto.
   // The layout must outlive the returned pattern.
   constexpr LayoutPattern<LayoutType, LayoutPatternEqualImpl<Impl>> EqualTo(
-      const Layout* layout) const {
+      const ::xla::Layout* layout) const {
     return LayoutPattern<LayoutType, LayoutPatternEqualImpl<Impl>>(
         LayoutPatternEqualImpl<Impl>(impl_, layout), matched_layout_);
   }
diff --git a/tensorflow/compiler/xla/service/pattern_matcher_test.cc b/tensorflow/compiler/xla/service/pattern_matcher_test.cc
index 204e8c99209fa95adb868a676bb9e5144fed432c..fef3c132b0f3467a01b02f2be88b419459179277 100644
--- a/tensorflow/compiler/xla/service/pattern_matcher_test.cc
+++ b/tensorflow/compiler/xla/service/pattern_matcher_test.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/pattern_matcher.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/core/platform/test.h"
 
 namespace xla {
@@ -29,7 +29,7 @@ TEST(PatternMatcherTest, AddOp) {
       ROOT %two_plus_two = f32[] add(f32[] %two, f32[] %two)
     }
   )";
-  TF_ASSERT_OK_AND_ASSIGN(auto hlo_module, tools::Parse(kModuleStr));
+  TF_ASSERT_OK_AND_ASSIGN(auto hlo_module, ParseHloString(kModuleStr));
 
   const HloInstruction* matched_inst;
   HloInstruction* matched_operand;
@@ -182,7 +182,7 @@ TEST(PatternMatcherTest, FusionKind) {
       p0 = f32[] parameter(0)
       ROOT fusion = f32[] fusion(p0), kind=kLoop, calls=fused_computation
     })";
-  TF_ASSERT_OK_AND_ASSIGN(auto hlo_module, tools::Parse(kModuleStr));
+  TF_ASSERT_OK_AND_ASSIGN(auto hlo_module, ParseHloString(kModuleStr));
 
   auto* root = hlo_module->entry_computation()->root_instruction();
   EXPECT_TRUE(Match(
diff --git a/tensorflow/compiler/xla/service/reshape_mover.cc b/tensorflow/compiler/xla/service/reshape_mover.cc
index 0f26a025bf125f70199637894741540f89eae7e5..49ec38eb62c7b51c7a2d301d882cef032b288036 100644
--- a/tensorflow/compiler/xla/service/reshape_mover.cc
+++ b/tensorflow/compiler/xla/service/reshape_mover.cc
@@ -155,20 +155,15 @@ HloInstruction* UpdateOperand(const HloInstruction* first_reshape_operand,
     case HloOpcode::kConstant: {
       if (first_reshape_operand->opcode() == HloOpcode::kReshape) {
         VLOG(5) << "Adding reshape to kConstant operand";
-        HloInstruction* reshape = computation->AddInstruction(
+        return computation->AddInstruction(
             HloInstruction::CreateReshape(new_shape, operand));
-        operand->SetupDerivedInstruction(reshape);
-        return reshape;
       } else {
         CHECK(first_reshape_operand->opcode() == HloOpcode::kTranspose);
         VLOG(5) << "Adding transpose to kConstant operand";
         std::vector<int64> inverse_permutation =
             InversePermutation(first_reshape_operand->dimensions());
-        HloInstruction* transpose =
-            computation->AddInstruction(HloInstruction::CreateTranspose(
-                new_shape, operand, inverse_permutation));
-        operand->SetupDerivedInstruction(transpose);
-        return transpose;
+        return computation->AddInstruction(HloInstruction::CreateTranspose(
+            new_shape, operand, inverse_permutation));
       }
     }
     case HloOpcode::kRng: {
diff --git a/tensorflow/compiler/xla/service/service.cc b/tensorflow/compiler/xla/service/service.cc
index cb0f76ebe4d445059fdf37ebf559bef851a57104..d01c35b99231310692f85d0f9fbf4f2c3709d44c 100644
--- a/tensorflow/compiler/xla/service/service.cc
+++ b/tensorflow/compiler/xla/service/service.cc
@@ -36,7 +36,6 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_module_config.h"
 #include "tensorflow/compiler/xla/service/hlo_proto_util.h"
 #include "tensorflow/compiler/xla/service/platform_util.h"
-#include "tensorflow/compiler/xla/service/session.pb.h"
 #include "tensorflow/compiler/xla/service/source_map_util.h"
 #include "tensorflow/compiler/xla/service/transfer_manager.h"
 #include "tensorflow/compiler/xla/shape_layout.h"
@@ -62,33 +61,6 @@ namespace xla {
 
 namespace {
 
-// Records the arguments used to invoke a computation in a SessionModule
-// proto.
-Status RecordArguments(
-    const tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
-    se::StreamExecutor* executor, TransferManager* transfer_manager,
-    SessionModule* module) {
-  module->clear_arguments();
-  for (const ShapedBuffer* argument : arguments) {
-    TF_ASSIGN_OR_RETURN(
-        std::unique_ptr<Literal> literal,
-        transfer_manager->TransferLiteralFromDevice(executor, *argument));
-    *module->add_arguments() = literal->ToProto();
-  }
-  return Status::OK();
-}
-
-// Records the result of a computation in a SessionModule proto.
-Status RecordResult(const ShapedBuffer& result, se::StreamExecutor* executor,
-                    TransferManager* transfer_manager, SessionModule* module) {
-  module->clear_result();
-  TF_ASSIGN_OR_RETURN(
-      std::unique_ptr<Literal> literal,
-      transfer_manager->TransferLiteralFromDevice(executor, result));
-  *module->mutable_result() = literal->ToProto();
-  return Status::OK();
-}
-
 // Records the arguments used to invoke a computation in an HloSnapshot proto.
 Status RecordArguments(
     const tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
@@ -195,20 +167,6 @@ Service::Service(const ServiceOptions& options,
   }
 }
 
-Status Service::Computation(const ComputationRequest* arg,
-                            ComputationResponse* result) {
-  if (arg->name().empty()) {
-    return InvalidArgument("computation request needs a name");
-  }
-
-  *result->mutable_computation() =
-      computation_tracker_.NewComputation(arg->name());
-  VLOG(1) << Printf("Created new computation %s on service %p, name %s",
-                    result->computation().ShortDebugString().c_str(), this,
-                    arg->name().c_str());
-  return Status::OK();
-}
-
 Status Service::CreateChannelHandle(const CreateChannelHandleRequest* arg,
                                     CreateChannelHandleResponse* result) {
   *result->mutable_channel() = channel_tracker_.NewChannel();
@@ -288,8 +246,7 @@ Service::ResolveAndValidateArguments(
 StatusOr<std::unique_ptr<HloModuleConfig>> Service::CreateModuleConfig(
     const ProgramShape& program_shape,
     tensorflow::gtl::ArraySlice<const Shape*> argument_shapes,
-    const ExecutionOptions* execution_options,
-    const UserComputation* user_computation) {
+    const ExecutionOptions* execution_options) {
   auto config = MakeUnique<HloModuleConfig>(program_shape);
   ComputationLayout* host_computation_layout =
       config->mutable_host_entry_computation_layout();
@@ -305,17 +262,9 @@ StatusOr<std::unique_ptr<HloModuleConfig>> Service::CreateModuleConfig(
     // ProgramShape.
     if (!ShapeUtil::Compatible(*argument_shapes[i],
                                program_shape.parameters(i))) {
-      if (user_computation == nullptr) {
-        return InvalidArgument(
-            "Argument does not match shape of computation parameter %d: want "
-            "%s, got %s",
-            i, ShapeUtil::HumanString(program_shape.parameters(i)).c_str(),
-            ShapeUtil::HumanString(*argument_shapes[i]).c_str());
-      }
-      return InvalidParameterArgument(
-          *user_computation->ParameterMetadata(i).value(),
-          "Argument does not match shape of computation parameter %d: want %s, "
-          "got %s",
+      return InvalidArgument(
+          "Argument does not match shape of computation parameter %d: want "
+          "%s, got %s",
           i, ShapeUtil::HumanString(program_shape.parameters(i)).c_str(),
           ShapeUtil::HumanString(*argument_shapes[i]).c_str());
     }
@@ -366,76 +315,12 @@ StatusOr<std::unique_ptr<HloModuleConfig>> Service::CreateModuleConfig(
 StatusOr<std::unique_ptr<HloModuleConfig>> Service::CreateModuleConfig(
     const ProgramShape& program_shape,
     tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
-    const ExecutionOptions& execution_options,
-    const UserComputation* user_computation) {
+    const ExecutionOptions& execution_options) {
   std::vector<const Shape*> argument_shapes;
   for (const auto* arg : arguments) {
     argument_shapes.push_back(&arg->on_host_shape());
   }
-  return CreateModuleConfig(program_shape, argument_shapes, &execution_options,
-                            user_computation);
-}
-
-StatusOr<std::vector<std::unique_ptr<Executable>>> Service::BuildExecutables(
-    std::vector<VersionedComputationHandle> versioned_handles,
-    std::vector<std::unique_ptr<HloModuleConfig>> module_configs,
-    Backend* backend, std::vector<std::vector<se::StreamExecutor*>> executors,
-    DeviceMemoryAllocator* device_allocator) {
-  VLOG(1) << Printf("BuildExecutable on service %p", this);
-
-  // Dump computation proto state if flag is set.
-  std::vector<std::unique_ptr<SessionModule>> session_modules;
-  for (int64 i = 0; i < versioned_handles.size(); ++i) {
-    const string& directory_path =
-        module_configs[i]->debug_options().xla_dump_computations_to();
-    const string& other_directory_path =
-        module_configs[i]->debug_options().xla_dump_executions_to();
-    if (directory_path.empty() && other_directory_path.empty()) {
-      continue;
-    }
-    TF_ASSIGN_OR_RETURN(
-        std::unique_ptr<SessionModule> session_module,
-        computation_tracker_.SnapshotComputation(versioned_handles[i].handle));
-    if (!directory_path.empty()) {
-      string filename = Printf("computation_%lld__%s__version_%lld",
-                               versioned_handles[i].handle.handle(),
-                               session_module->entry().name().c_str(),
-                               versioned_handles[i].version);
-      TF_RETURN_IF_ERROR(Executable::DumpToDirectory(directory_path, filename,
-                                                     *session_module));
-      session_modules.push_back(std::move(session_module));
-    }
-  }
-
-  VLOG(1) << "Computation handles:";
-  for (const VersionedComputationHandle& versioned_handle : versioned_handles) {
-    VLOG(1) << versioned_handle;
-  }
-
-  CHECK_EQ(versioned_handles.size(), module_configs.size());
-  std::vector<std::unique_ptr<HloModule>> modules;
-  for (int64 i = 0; i < versioned_handles.size(); ++i) {
-    const VersionedComputationHandle& versioned_handle = versioned_handles[i];
-    const HloModuleConfig& config = *module_configs[i];
-    TF_ASSIGN_OR_RETURN(auto module,
-                        computation_tracker_.BuildHloModule(
-                            versioned_handle, config,
-                            /*include_unreachable_instructions=*/true));
-    modules.push_back(std::move(module));
-  }
-
-  TF_ASSIGN_OR_RETURN(
-      std::vector<std::unique_ptr<Executable>> executables,
-      backend->compiler()->Compile(std::move(modules), std::move(executors),
-                                   device_allocator));
-
-  for (size_t i = 0; i < versioned_handles.size(); ++i) {
-    if (!module_configs[i]->debug_options().xla_dump_executions_to().empty()) {
-      executables[i]->set_session_module(std::move(session_modules[i]));
-    }
-  }
-
-  return std::move(executables);
+  return CreateModuleConfig(program_shape, argument_shapes, &execution_options);
 }
 
 StatusOr<std::vector<std::unique_ptr<Executable>>> Service::BuildExecutables(
@@ -512,98 +397,6 @@ Status Service::ValidateEntryComputationLayout(HloModule* module) {
   return Status::OK();
 }
 
-StatusOr<std::unique_ptr<Executable>> Service::BuildExecutable(
-    const VersionedComputationHandle& versioned_handle,
-    std::unique_ptr<HloModuleConfig> module_config, Backend* backend,
-    se::StreamExecutor* executor, DeviceMemoryAllocator* device_allocator) {
-  VLOG(1) << Printf("BuildExecutable on service %p with handle %s", this,
-                    versioned_handle.ToString().c_str());
-
-  // Dump computation proto state if flag is set.
-  std::unique_ptr<SessionModule> session_module;
-  const string& directory_path =
-      module_config->debug_options().xla_dump_computations_to();
-  const string& other_directory_path =
-      module_config->debug_options().xla_dump_executions_to();
-  if (!directory_path.empty() || !other_directory_path.empty()) {
-    TF_ASSIGN_OR_RETURN(
-        session_module,
-        computation_tracker_.SnapshotComputation(versioned_handle.handle));
-    if (!directory_path.empty()) {
-      string filename = Printf("computation_%lld__%s__version_%lld",
-                               versioned_handle.handle.handle(),
-                               session_module->entry().name().c_str(),
-                               versioned_handle.version);
-      TF_RETURN_IF_ERROR(Executable::DumpToDirectory(directory_path, filename,
-                                                     *session_module));
-    }
-  }
-
-  TF_ASSIGN_OR_RETURN(
-      std::unique_ptr<HloModule> module,
-      computation_tracker_.BuildHloModule(versioned_handle, *module_config,
-                                          /*include_unreachable_instructions=*/
-                                          true));
-
-  TF_RETURN_IF_ERROR(MaybeDumpHloModule(*module));
-
-  TF_ASSIGN_OR_RETURN(
-      module, backend->compiler()->RunHloPasses(std::move(module), executor,
-                                                device_allocator));
-  // Check that on-host and on-device shapes are consistent.
-  TF_RETURN_IF_ERROR(ValidateEntryComputationLayout(module.get()));
-
-  TF_ASSIGN_OR_RETURN(std::unique_ptr<Executable> executable,
-                      backend->compiler()->RunBackend(
-                          std::move(module), executor, device_allocator));
-
-  if (!other_directory_path.empty()) {
-    executable->set_session_module(std::move(session_module));
-  }
-
-  return std::move(executable);
-}
-
-StatusOr<std::shared_ptr<Executable>> Service::BuildAndCacheExecutable(
-    const VersionedComputationHandle& versioned_handle,
-    std::unique_ptr<HloModuleConfig> module_config, Backend* backend,
-    se::StreamExecutor* executor, ExecutionProfile* profile,
-    DeviceMemoryAllocator* device_allocator) {
-  std::shared_ptr<Executable> executable =
-      compilation_cache_.LookUp(versioned_handle, *module_config);
-
-  if (executable != nullptr) {
-    // Executable found in the computation cache.
-    if (profile != nullptr) {
-      profile->set_compilation_cache_hit(true);
-    }
-    return executable;
-  }
-
-  uint64 start_micros =
-      // Avoid reading the clock if we don't want timing info
-      (profile != nullptr) ? tensorflow::Env::Default()->NowMicros() : 0;
-
-  // Take a copy of the module config, as compilation introduces layouts where
-  // layouts were optional before.
-  HloModuleConfig original_module_config = *module_config;
-  TF_ASSIGN_OR_RETURN(
-      std::unique_ptr<Executable> executable_unique_ptr,
-      BuildExecutable(versioned_handle, std::move(module_config), backend,
-                      executor, device_allocator));
-
-  if (profile != nullptr) {
-    uint64 end_micros = tensorflow::Env::Default()->NowMicros();
-    uint64 milliseconds = (end_micros - start_micros) / 1000;
-    profile->set_compilation_cache_hit(false);
-    profile->set_compile_time_ms(milliseconds);
-  }
-
-  // Insert executable into the cache.
-  return compilation_cache_.Insert(std::move(executable_unique_ptr),
-                                   original_module_config);
-}
-
 StatusOr<std::vector<GlobalDataHandle>>
 Service::ExecuteParallelAndRegisterResult(
     tensorflow::gtl::ArraySlice<Executable*> executables,
@@ -624,9 +417,16 @@ Service::ExecuteParallelAndRegisterResult(
   // profiled.
   std::map<int64, se::Stream*> index_to_profiled_streams;
 
-  TF_ASSIGN_OR_RETURN(DeviceAssignment device_assignment,
-                      backend->computation_placer()->AssignDevices(
-                          options_.number_of_replicas(), executables.size()));
+  // Build DeviceAssignment for all cores based on the provided device handles.
+  DeviceAssignment device_assignment(options_.number_of_replicas(),
+                                     executables.size());
+  for (int64 i = 0; i < executables.size(); i++) {
+    TF_ASSIGN_OR_RETURN(auto replicas, Replicas(*backend, device_handles[i]));
+    CHECK_EQ(replicas.size(), arguments[i].size());
+    for (int64 replica = 0; replica < replicas.size(); ++replica) {
+      device_assignment(replica, i) = replicas[replica]->device_ordinal();
+    }
+  }
 
   for (int64 i = 0; i < executables.size(); i++) {
     // Stream executors for the replicas of the current computation.
@@ -799,13 +599,6 @@ StatusOr<GlobalDataHandle> Service::ExecuteAndRegisterResult(
                                                        result_tag);
 }
 
-Status Service::SetReturnValue(const SetReturnValueRequest* arg,
-                               SetReturnValueResponse* results) {
-  TF_ASSIGN_OR_RETURN(UserComputation * computation,
-                      computation_tracker_.Resolve(arg->computation()));
-  return computation->SetReturnValue(arg->operand());
-}
-
 StatusOr<std::vector<se::StreamExecutor*>> Service::GetExecutors(
     const ExecutionOptions& execution_options, int64 requests_size,
     int64 request_index) const {
@@ -847,117 +640,6 @@ StatusOr<std::vector<std::vector<const ShapedBuffer*>>> Service::GetArguments(
   return replicated_arguments;
 }
 
-Status Service::ExecuteParallel(const ExecuteParallelRequest* arg,
-                                ExecuteParallelResponse* result) {
-  VLOG(1) << "running execute-parallel request: " << arg->ShortDebugString();
-
-  std::vector<std::vector<std::vector<const ShapedBuffer*>>> all_arguments;
-  std::vector<std::vector<se::StreamExecutor*>> all_executors;
-  std::vector<VersionedComputationHandle> versioned_handles;
-  std::vector<std::unique_ptr<HloModuleConfig>> module_configs;
-  std::vector<string> computation_names;
-  std::vector<DeviceHandle> device_handles;
-
-  int num_requested_devices =
-      std::accumulate(arg->requests().begin(), arg->requests().end(), 0,
-                      [](int a, const ExecuteRequest& r) -> int {
-                        return a + r.execution_options().device_handles_size();
-                      });
-  if (num_requested_devices * options_.number_of_replicas() >
-      execute_backend_->device_count()) {
-    return FailedPrecondition(
-        "there are not enough stream executors to execute %d computations",
-        num_requested_devices);
-  }
-
-  for (int64 i = 0; i < arg->requests_size(); ++i) {
-    // Get the stream executor for the i'th computation. This stream executor
-    // is one of the executors to run the replicated computation.
-    const ExecutionOptions& execution_options =
-        arg->requests(i).execution_options();
-
-    // Get the executors.
-    TF_ASSIGN_OR_RETURN(auto executors, GetExecutors(execution_options,
-                                                     arg->requests_size(), i));
-
-    // Resolve the UserComputation object associated with the requested
-    // computation and compute the program shape.
-    const ExecuteRequest& request = arg->requests(i);
-    TF_ASSIGN_OR_RETURN(UserComputation * user_computation,
-                        computation_tracker_.Resolve(request.computation()));
-    VersionedComputationHandle versioned_handle =
-        user_computation->GetVersionedHandle();
-    if (user_computation->request_count(versioned_handle.version) == 0) {
-      return InvalidArgument("computations may not be empty");
-    }
-
-    TF_ASSIGN_OR_RETURN(
-        std::shared_ptr<const ProgramShape> program_shape,
-        user_computation->ComputeProgramShape(versioned_handle.version));
-
-    // Get the replicated arguments.
-    TF_ASSIGN_OR_RETURN(auto replicated_arguments,
-                        GetArguments(execution_options, request.arguments()));
-
-    // Create an HloModuleConfig object for the computation, given the shape of
-    // the program and the argument allocations. Here, we care only about the
-    // shapes of the arguments, so, it is sufficient to use the arguments of
-    // replica 0.
-    TF_ASSIGN_OR_RETURN(
-        std::unique_ptr<HloModuleConfig> module_config,
-        CreateModuleConfig(*program_shape, replicated_arguments.front(),
-                           request.execution_options(), user_computation));
-    VLOG(3) << "ExecuteParallel created HloModuleConfig computation layout: "
-            << module_config->host_entry_computation_layout().ToString();
-
-    // Adds to the vectors to build and execute the computations after the loop.
-    all_arguments.push_back(replicated_arguments);
-    all_arguments.insert(all_arguments.end(), executors.size() - 1, {{}});
-    versioned_handles.push_back(versioned_handle);
-    module_configs.push_back(std::move(module_config));
-    computation_names.insert(computation_names.end(), executors.size(),
-                             user_computation->name());
-    all_executors.push_back(executors);
-    device_handles.insert(device_handles.end(),
-                          execution_options.device_handles().begin(),
-                          execution_options.device_handles().end());
-  }
-
-  // Build the user computations into HloModules and compile to generate the
-  // executables.
-  //
-  // TODO(jlebar): There's currently no way to pass a device allocator to
-  // ExecuteParallel, so we have to pass a null device_allocator below.
-  TF_ASSIGN_OR_RETURN(
-      std::vector<std::unique_ptr<Executable>> executables,
-      BuildExecutables(versioned_handles, std::move(module_configs),
-                       execute_backend_.get(), all_executors,
-                       /*device_allocator=*/nullptr));
-  std::vector<Executable*> executable_ptrs;
-  executable_ptrs.reserve(executables.size());
-  for (const auto& executable : executables) {
-    executable_ptrs.push_back(executable.get());
-  }
-
-  // Execute the generated executables in parallel and return the device
-  // handles for each computation's output.
-  ExecutionProfile profile;
-  TF_ASSIGN_OR_RETURN(
-      std::vector<GlobalDataHandle> outputs,
-      ExecuteParallelAndRegisterResult(executable_ptrs, all_arguments,
-                                       execute_backend_.get(), device_handles,
-                                       computation_names, &profile));
-  for (const GlobalDataHandle& output : outputs) {
-    ExecuteResponse response;
-    *response.mutable_output() = output;
-    *response.mutable_profile() = profile;
-    *result->add_responses() = response;
-  }
-
-  VLOG(1) << "successfully completed 'execute-parallel' request";
-  return Status::OK();
-}
-
 Status Service::ExecuteGraphParallel(const ExecuteGraphParallelRequest* arg,
                                      ExecuteParallelResponse* result) {
   VLOG(1) << "running execute-graph-parallel request";
@@ -1007,8 +689,7 @@ Status Service::ExecuteGraphParallel(const ExecuteGraphParallelRequest* arg,
         std::unique_ptr<HloModuleConfig> module_config,
         CreateModuleConfig(request.computation().program_shape(),
                            replicated_arguments.front(),
-                           request.execution_options(),
-                           /*user_computation=*/nullptr));
+                           request.execution_options()));
     VLOG(3)
         << "ExecuteGraphParallel created HloModuleConfig computation layout: "
         << module_config->host_entry_computation_layout().ToString();
@@ -1083,15 +764,6 @@ Status Service::GetDeviceHandles(const GetDeviceHandlesRequest* arg,
   return Status::OK();
 }
 
-Status Service::ExecuteOneToN(const ExecuteRequest* arg,
-                              ExecuteResponse* result) {
-  ExecuteParallelRequest parallel_arg;
-  *parallel_arg.add_requests() = *arg;
-  ExecuteParallelResponse parallel_result;
-  TF_RETURN_IF_ERROR(ExecuteParallel(&parallel_arg, &parallel_result));
-  return PickParallelResponse(parallel_result, result);
-}
-
 Status Service::ExecuteOneToN(const ExecuteGraphRequest* arg,
                               ExecuteResponse* result) {
   ExecuteGraphParallelRequest parallel_arg;
@@ -1124,80 +796,6 @@ Status Service::PickParallelResponse(
   return Status::OK();
 }
 
-Status Service::Execute(const ExecuteRequest* arg, ExecuteResponse* result) {
-  VLOG(1) << "running execute request: " << arg->ShortDebugString();
-
-  TF_ASSIGN_OR_RETURN(UserComputation * user_computation,
-                      computation_tracker_.Resolve(arg->computation()));
-
-  VersionedComputationHandle versioned_handle =
-      user_computation->GetVersionedHandle();
-
-  if (user_computation->request_count(versioned_handle.version) == 0) {
-    return InvalidArgument("computations may not be empty");
-  }
-
-  // If we received multiple device handles, we must partition the module.
-  if (arg->execution_options().device_handles_size() > 1) {
-    return ExecuteOneToN(arg, result);
-  }
-
-  TF_ASSIGN_OR_RETURN(
-      std::shared_ptr<const ProgramShape> program_shape,
-      user_computation->ComputeProgramShape(versioned_handle.version));
-
-  TF_ASSIGN_OR_RETURN(auto replicas, Replicas(*execute_backend_,
-                                              SingleComputationDeviceHandle()));
-  TF_ASSIGN_OR_RETURN(
-      std::vector<std::vector<const ShapedBuffer*>> replicated_arguments,
-      ResolveAndValidateArguments(arg->arguments(), replicas));
-
-  // Since we care only about the shapes of the arguments, it is sufficient to
-  // use the arguments of replica 0.
-  TF_ASSIGN_OR_RETURN(
-      std::unique_ptr<HloModuleConfig> module_config,
-      CreateModuleConfig(*program_shape, replicated_arguments.front(),
-                         arg->execution_options(), user_computation));
-
-  VLOG(3) << "Execute created HloModuleConfig computation layout: "
-          << module_config->host_entry_computation_layout().ToString();
-
-  TF_ASSIGN_OR_RETURN(
-      std::shared_ptr<Executable> executable,
-      BuildAndCacheExecutable(versioned_handle, std::move(module_config),
-                              execute_backend_.get(),
-                              execute_backend_->default_stream_executor(),
-                              result->mutable_profile()));
-
-  if (executable->dumping()) {
-    executable->session_module()->set_execution_platform(
-        execute_backend_->platform()->Name());
-    TF_RETURN_IF_ERROR(RecordArguments(
-        replicated_arguments.front(),
-        execute_backend_->default_stream_executor(),
-        execute_backend_->transfer_manager(), executable->session_module()));
-  }
-
-  TF_ASSIGN_OR_RETURN(
-      *result->mutable_output(),
-      ExecuteAndRegisterResult(
-          executable.get(), replicated_arguments, execute_backend_.get(),
-          "result of " + user_computation->name(), result->mutable_profile()));
-
-  if (executable->dumping()) {
-    TF_ASSIGN_OR_RETURN(
-        const ShapedBuffer* result_buffer,
-        allocation_tracker_.ResolveForReplica(result->output(), 0));
-    TF_RETURN_IF_ERROR(RecordResult(
-        *result_buffer, execute_backend_->default_stream_executor(),
-        execute_backend_->transfer_manager(), executable->session_module()));
-    TF_RETURN_IF_ERROR(executable->DumpSessionModule());
-  }
-
-  VLOG(1) << "successfully completed 'execute' request";
-  return Status::OK();
-}
-
 StatusOr<std::unique_ptr<Executable>> Service::BuildExecutable(
     const HloModuleProto& module_proto,
     std::unique_ptr<HloModuleConfig> module_config, Backend* backend,
@@ -1303,86 +901,6 @@ Status Service::ExecuteGraph(const ExecuteGraphRequest* arg,
   return Status::OK();
 }
 
-Status Service::ExecuteAsync(const ExecuteAsyncRequest* arg,
-                             ExecuteAsyncResponse* result) {
-  VLOG(1) << "running execute-async request: " << arg->ShortDebugString();
-
-  TF_ASSIGN_OR_RETURN(UserComputation * user_computation,
-                      computation_tracker_.Resolve(arg->computation()));
-
-  VersionedComputationHandle versioned_handle =
-      user_computation->GetVersionedHandle();
-  if (user_computation->request_count(versioned_handle.version) == 0) {
-    return InvalidArgument("computations may not be empty");
-  }
-
-  TF_ASSIGN_OR_RETURN(
-      std::shared_ptr<const ProgramShape> program_shape,
-      user_computation->ComputeProgramShape(versioned_handle.version));
-
-  TF_ASSIGN_OR_RETURN(auto replicas, Replicas(*execute_backend_,
-                                              SingleComputationDeviceHandle()));
-  TF_RET_CHECK(!replicas.empty());
-  TF_ASSIGN_OR_RETURN(
-      std::vector<std::vector<const ShapedBuffer*>> replicated_arguments,
-      ResolveAndValidateArguments(arg->arguments(), replicas));
-
-  TF_ASSIGN_OR_RETURN(
-      std::unique_ptr<HloModuleConfig> module_config,
-      CreateModuleConfig(*program_shape, replicated_arguments.front(),
-                         arg->execution_options(), user_computation));
-
-  VLOG(3) << "ExecuteAsync created HloModuleConfig computation layout: "
-          << module_config->host_entry_computation_layout().ToString();
-
-  ExecutionProfile profile;
-
-  TF_ASSIGN_OR_RETURN(
-      std::shared_ptr<Executable> executable,
-      BuildAndCacheExecutable(
-          versioned_handle, std::move(module_config), execute_backend_.get(),
-          execute_backend_->default_stream_executor(), &profile));
-
-  // Set up streams.
-  std::vector<Pool<se::Stream>::SmartPtr> streams;
-  for (se::StreamExecutor* executor : replicas) {
-    TF_ASSIGN_OR_RETURN(Pool<se::Stream>::SmartPtr stream,
-                        execute_backend_->BorrowStream(executor));
-    streams.push_back(std::move(stream));
-  }
-
-  std::vector<ScopedShapedBuffer> result_buffers;
-  for (size_t i = 0; i < streams.size(); ++i) {
-    const auto& stream = streams[i];
-    ExecutableRunOptions options;
-    options.set_stream(stream.get());
-    options.set_allocator(execute_backend_->memory_allocator());
-    options.set_intra_op_thread_pool(
-        execute_backend_->eigen_intra_op_thread_pool_device());
-
-    ServiceExecutableRunOptions service_options(
-        options, execute_backend_->StreamBorrower());
-
-    TF_ASSIGN_OR_RETURN(ScopedShapedBuffer this_result_buffer,
-                        executable->ExecuteAsyncOnStream(
-                            &service_options, replicated_arguments[i]));
-
-    result_buffers.emplace_back(std::move(this_result_buffer));
-  }
-
-  TF_ASSIGN_OR_RETURN(
-      GlobalDataHandle output,
-      allocation_tracker_.RegisterReplicatedBuffers(
-          std::move(result_buffers), "result of " + user_computation->name()));
-
-  *result->mutable_execution() = execution_tracker_.Register(
-      execute_backend_.get(), std::move(streams), profile, output);
-  streams.clear();
-
-  VLOG(1) << "successfully completed 'execute-async' request";
-  return Status::OK();
-}
-
 Status Service::WaitForExecution(const WaitForExecutionRequest* arg,
                                  WaitForExecutionResponse* result) {
   TF_ASSIGN_OR_RETURN(const auto execution,
@@ -1549,117 +1067,6 @@ Status Service::ResetDevice(const ResetDeviceRequest* arg,
   return execute_backend_->ResetDevices();
 }
 
-Status Service::IsConstant(const IsConstantRequest* arg,
-                           IsConstantResponse* result) {
-  TF_ASSIGN_OR_RETURN(UserComputation * user_computation,
-                      computation_tracker_.Resolve(arg->computation()));
-
-  VersionedComputationHandle versioned_handle =
-      user_computation->GetVersionedHandleAtOperation(arg->operand());
-
-  if (user_computation->request_count(versioned_handle.version) == 0) {
-    return InvalidArgument("computations may not be empty");
-  }
-
-  TF_ASSIGN_OR_RETURN(
-      bool is_constant,
-      user_computation->IsConstant(arg->operand(), arg->num_parameters()));
-
-  result->set_is_constant(is_constant);
-  return Status::OK();
-}
-
-Status Service::ComputeConstant(const ComputeConstantRequest* arg,
-                                ComputeConstantResponse* result) {
-  TF_ASSIGN_OR_RETURN(UserComputation * user_computation,
-                      computation_tracker_.Resolve(arg->computation()));
-
-  VersionedComputationHandle versioned_handle =
-      user_computation->GetVersionedHandleAtOperation(arg->operand());
-
-  if (user_computation->request_count(versioned_handle.version) == 0) {
-    return InvalidArgument("computations may not be empty");
-  }
-
-  TF_ASSIGN_OR_RETURN(
-      bool is_constant,
-      user_computation->IsConstant(arg->operand(), arg->parameters_size()));
-  if (!is_constant) {
-    StatusOr<const OperationRequest*> op_request_status =
-        user_computation->LookUpRequestForErrorReporting(arg->operand());
-    string op_request_string = "<unknown operation>";
-    if (op_request_status.ok()) {
-      op_request_string = op_request_status.ValueOrDie()->ShortDebugString();
-    }
-    return InvalidArgument(
-        "Operand to ComputeConstant depends on a parameter.\n\n"
-        "  op requested for constant evaluation: %s\n\n"
-        "This is an internal error that typically happens when the XLA user "
-        "(e.g. TensorFlow) is attempting to determine a value that must be a "
-        "compile-time constant (e.g. an array dimension) but it is not capable "
-        "of being evaluated at XLA compile time.\n\n"
-        "Please file a usability bug with the framework being used (e.g. "
-        "TensorFlow).",
-        op_request_string.c_str());
-  }
-
-  // We can't use ComputeProgramShape because it checks that all parameter
-  // instructions are present and contiguous. Instead construct ProgramShape
-  // directly.
-  ProgramShape program_shape;
-  TF_ASSIGN_OR_RETURN(*program_shape.mutable_result(),
-                      user_computation->GetShape(arg->operand()));
-
-  TF_DCHECK_OK(ShapeUtil::ValidateShape(program_shape.result()));
-
-  ExecutionOptions execution_options = xla::CreateDefaultExecutionOptions();
-  execution_options.mutable_debug_options()->set_xla_enable_fast_math(false);
-  execution_options.mutable_debug_options()
-      ->set_xla_eliminate_hlo_implicit_broadcast(true);
-  *execution_options.mutable_shape_with_output_layout() =
-      program_shape.result();
-
-  Shape shape_with_output_layout(program_shape.result());
-  if (arg->has_output_layout()) {
-    TF_RETURN_IF_ERROR(LayoutUtil::ValidateLayoutForShape(
-        arg->output_layout(), execution_options.shape_with_output_layout()));
-    *execution_options.mutable_shape_with_output_layout()->mutable_layout() =
-        arg->output_layout();
-  }
-
-  TF_ASSIGN_OR_RETURN(std::unique_ptr<HloModuleConfig> module_config,
-                      CreateModuleConfig(program_shape, {}, execution_options,
-                                         user_computation));
-
-  // Exclude dead parameter instructions for the purpose of computing constants.
-  TF_ASSIGN_OR_RETURN(
-      std::unique_ptr<HloModule> module,
-      computation_tracker_.BuildHloModule(versioned_handle, *module_config,
-                                          /*include_unreachable_instructions=*/
-                                          false));
-
-  std::vector<std::unique_ptr<Literal>> parameters(arg->parameters_size());
-  for (int64 i = 0; i < arg->parameters_size(); ++i) {
-    TF_ASSIGN_OR_RETURN(parameters[i],
-                        Literal::CreateFromProto(arg->parameters(i)));
-  }
-  HloEvaluator evaluator;
-  TF_ASSIGN_OR_RETURN(
-      auto result_literal,
-      evaluator.Evaluate<std::unique_ptr<Literal>>(*module, parameters));
-
-  // Since the shape_with_output_layout option in ExecutionOption is
-  // non-effective to the Evaluator results, explicit relayout here.
-  //
-  // TODO(b/77824332): Make HloEvaluator take care of the re-layout.
-  if (arg->has_output_layout()) {
-    result_literal = result_literal->Relayout(arg->output_layout());
-  }
-  *result->mutable_literal() = result_literal->ToProto();
-
-  return Status::OK();
-}
-
 Status Service::ComputeConstantGraph(const ComputeConstantGraphRequest* arg,
                                      ComputeConstantResponse* result) {
   if (!arg->has_computation()) {
@@ -1709,60 +1116,6 @@ Status Service::GetShape(const GetShapeRequest* arg, GetShapeResponse* result) {
   return Status::OK();
 }
 
-Status Service::GetComputationShape(const GetComputationShapeRequest* arg,
-                                    GetComputationShapeResponse* result) {
-  TF_ASSIGN_OR_RETURN(UserComputation * computation,
-                      computation_tracker_.Resolve(arg->computation()));
-
-  VersionedComputationHandle versioned_handle =
-      computation->GetVersionedHandle();
-
-  TF_ASSIGN_OR_RETURN(auto program_shape, computation->ComputeProgramShape(
-                                              versioned_handle.version));
-  *result->mutable_program_shape() = *program_shape;
-  return Status::OK();
-}
-
-Status Service::GetLocalShape(const GetLocalShapeRequest* arg,
-                              GetLocalShapeResponse* result) {
-  TF_ASSIGN_OR_RETURN(UserComputation * computation,
-                      computation_tracker_.Resolve(arg->computation()));
-
-  TF_ASSIGN_OR_RETURN(*result->mutable_shape(),
-                      computation->GetShape(arg->operand()));
-  return Status::OK();
-}
-
-Status Service::GetComputationStats(const ComputationStatsRequest* arg,
-                                    ComputationStatsResponse* result) {
-  TF_ASSIGN_OR_RETURN(UserComputation * user_computation,
-                      computation_tracker_.Resolve(arg->computation()));
-
-  VersionedComputationHandle versioned_handle =
-      user_computation->GetVersionedHandle();
-
-  HloModuleConfig config;
-  config.set_debug_options(arg->debug_options());
-  TF_ASSIGN_OR_RETURN(
-      std::unique_ptr<HloModule> module,
-      computation_tracker_.BuildHloModule(versioned_handle, config));
-
-  hlo_graph_dumper::MaybeDumpHloModule(*module,
-                                       "computation statistics subject");
-
-  // Run HLO analysis to get the computation statistics.
-  HloCostAnalysis analysis(
-      execute_backend_->compiler()->ShapeSizeBytesFunction());
-
-  TF_RETURN_IF_ERROR(module->entry_computation()->Accept(&analysis));
-
-  ComputationStats stats;
-  stats.set_flop_count(analysis.flop_count());
-  stats.set_transcendental_count(analysis.transcendental_count());
-  *result->mutable_stats() = stats;
-  return Status::OK();
-}
-
 Status Service::GetComputationGraphStats(
     const ComputationGraphStatsRequest* arg, ComputationStatsResponse* result) {
   if (!arg->has_computation()) {
@@ -1793,262 +1146,6 @@ Status Service::GetComputationGraphStats(
   return Status::OK();
 }
 
-template <typename RequestT, typename ResponseT>
-Status Service::AddInstruction(
-    const RequestT* arg, ResponseT* result,
-    const std::function<StatusOr<ComputationDataHandle>(UserComputation*)>&
-        adder) {
-  TF_ASSIGN_OR_RETURN(UserComputation * computation,
-                      computation_tracker_.Resolve(arg->computation()));
-
-  TF_ASSIGN_OR_RETURN(*result->mutable_output(), adder(computation));
-  return Status::OK();
-}
-
-Status Service::Op(const OpRequest* arg, OpResponse* result) {
-  TF_ASSIGN_OR_RETURN(UserComputation * computation,
-                      computation_tracker_.Resolve(arg->computation()));
-  StatusOr<ComputationDataHandle> handle_status;
-
-  switch (arg->op_case()) {
-    case OpRequest::kBatchNormTrainingRequest:
-      handle_status = computation->AddBatchNormTrainingInstruction(
-          arg->batch_norm_training_request());
-      break;
-    case OpRequest::kBatchNormInferenceRequest:
-      handle_status = computation->AddBatchNormInferenceInstruction(
-          arg->batch_norm_inference_request());
-      break;
-    case OpRequest::kBatchNormGradRequest:
-      handle_status = computation->AddBatchNormGradInstruction(
-          arg->batch_norm_grad_request());
-      break;
-    case OpRequest::kBinaryOpRequest:
-      handle_status =
-          computation->AddBinaryInstruction(arg->binary_op_request());
-      break;
-    case OpRequest::kBroadcastRequest:
-      handle_status =
-          computation->AddBroadcastInstruction(arg->broadcast_request());
-      break;
-    case OpRequest::kCallRequest: {
-      TF_ASSIGN_OR_RETURN(
-          UserComputation * to_apply,
-          computation_tracker_.Resolve(arg->call_request().to_apply()));
-      handle_status =
-          computation->AddCallInstruction(arg->call_request(), *to_apply);
-      break;
-    }
-    case OpRequest::kConcatenateRequest:
-      handle_status =
-          computation->AddConcatenateInstruction(arg->concatenate_request());
-      break;
-    case OpRequest::kConditionalRequest: {
-      TF_ASSIGN_OR_RETURN(UserComputation * true_computation,
-                          computation_tracker_.Resolve(
-                              arg->conditional_request().true_computation()));
-      TF_ASSIGN_OR_RETURN(UserComputation * false_computation,
-                          computation_tracker_.Resolve(
-                              arg->conditional_request().false_computation()));
-      handle_status = computation->AddConditionalInstruction(
-          arg->conditional_request(), *true_computation, *false_computation);
-      break;
-    }
-    case OpRequest::kConstantRequest:
-      handle_status =
-          computation->AddConstantInstruction(arg->constant_request());
-      break;
-    case OpRequest::kConvertRequest:
-      handle_status =
-          computation->AddConvertInstruction(arg->convert_request());
-      break;
-    case OpRequest::kBitcastConvertRequest:
-      handle_status = computation->AddBitcastConvertInstruction(
-          arg->bitcast_convert_request());
-      break;
-    case OpRequest::kConvolveRequest:
-      handle_status =
-          computation->AddConvolveInstruction(arg->convolve_request());
-      break;
-    case OpRequest::kCrossReplicaSumRequest:
-      handle_status = computation->AddCrossReplicaSumInstruction(
-          arg->cross_replica_sum_request());
-      break;
-    case OpRequest::kCustomCallRequest:
-      handle_status =
-          computation->AddCustomCallInstruction(arg->custom_call_request());
-      break;
-    case OpRequest::kDotRequest:
-      handle_status = computation->AddDotInstruction(arg->dot_request());
-      break;
-    case OpRequest::kDynamicSliceRequest:
-      handle_status =
-          computation->AddDynamicSliceInstruction(arg->dynamic_slice_request());
-      break;
-    case OpRequest::kDynamicUpdateSliceRequest:
-      handle_status = computation->AddDynamicUpdateSliceInstruction(
-          arg->dynamic_update_slice_request());
-      break;
-    case OpRequest::kFftRequest:
-      handle_status = computation->AddFftInstruction(arg->fft_request());
-      break;
-    case OpRequest::kGatherRequest:
-      handle_status = computation->AddGatherInstruction(arg->gather_request());
-      break;
-    case OpRequest::kGetTupleElementRequest:
-      handle_status = computation->AddGetTupleElementInstruction(
-          arg->get_tuple_element_request());
-      break;
-    case OpRequest::kInfeedRequest:
-      handle_status = computation->AddInfeedInstruction(arg->infeed_request());
-      break;
-    case OpRequest::kOutfeedRequest:
-      handle_status =
-          computation->AddOutfeedInstruction(arg->outfeed_request());
-      break;
-    case OpRequest::kHostComputeRequest:
-      handle_status =
-          computation->AddHostComputeInstruction(arg->host_compute_request());
-      break;
-    case OpRequest::kMapRequest: {
-      TF_ASSIGN_OR_RETURN(
-          UserComputation * to_apply,
-          computation_tracker_.Resolve(arg->map_request().to_apply()));
-      handle_status =
-          computation->AddMapInstruction(arg->map_request(), *to_apply);
-      break;
-    }
-    case OpRequest::kPadRequest:
-      handle_status = computation->AddPadInstruction(arg->pad_request());
-      break;
-    case OpRequest::kParameterRequest:
-      handle_status =
-          computation->AddParameterInstruction(arg->parameter_request());
-      break;
-    case OpRequest::kReduceRequest: {
-      TF_ASSIGN_OR_RETURN(
-          UserComputation * to_apply,
-          computation_tracker_.Resolve(arg->reduce_request().to_apply()));
-      handle_status =
-          computation->AddReduceInstruction(arg->reduce_request(), *to_apply);
-      break;
-    }
-    case OpRequest::kReducePrecisionRequest: {
-      handle_status = computation->AddReducePrecisionInstruction(
-          arg->reduce_precision_request());
-      break;
-    }
-    case OpRequest::kReduceWindowRequest: {
-      TF_ASSIGN_OR_RETURN(UserComputation * to_apply,
-                          computation_tracker_.Resolve(
-                              arg->reduce_window_request().to_apply()));
-      handle_status = computation->AddReduceWindowInstruction(
-          arg->reduce_window_request(), *to_apply);
-      break;
-    }
-    case OpRequest::kReshapeRequest:
-      handle_status =
-          computation->AddReshapeInstruction(arg->reshape_request());
-      break;
-    case OpRequest::kReverseRequest:
-      handle_status =
-          computation->AddReverseInstruction(arg->reverse_request());
-      break;
-    case OpRequest::kRngRequest:
-      handle_status = computation->AddRngInstruction(arg->rng_request());
-      break;
-    case OpRequest::kSelectAndScatterRequest: {
-      TF_ASSIGN_OR_RETURN(UserComputation * select,
-                          computation_tracker_.Resolve(
-                              arg->select_and_scatter_request().select()));
-      TF_ASSIGN_OR_RETURN(UserComputation * scatter,
-                          computation_tracker_.Resolve(
-                              arg->select_and_scatter_request().scatter()));
-      handle_status = computation->AddSelectAndScatterInstruction(
-          arg->select_and_scatter_request(), *select, *scatter);
-      break;
-    }
-    case OpRequest::kSliceRequest:
-      handle_status = computation->AddSliceInstruction(arg->slice_request());
-      break;
-    case OpRequest::kTernaryOpRequest:
-      handle_status =
-          computation->AddTernaryInstruction(arg->ternary_op_request());
-      break;
-    case OpRequest::kTraceRequest:
-      return computation->AddTraceInstruction(arg->trace_request());
-    case OpRequest::kTransposeRequest:
-      handle_status =
-          computation->AddTransposeInstruction(arg->transpose_request());
-      break;
-    case OpRequest::kUnaryOpRequest:
-      handle_status = computation->AddUnaryInstruction(arg->unary_op_request());
-      break;
-    case OpRequest::kVariadicOpRequest:
-      handle_status =
-          computation->AddVariadicInstruction(arg->variadic_op_request());
-      break;
-    case OpRequest::kWhileRequest: {
-      TF_ASSIGN_OR_RETURN(
-          UserComputation * condition,
-          computation_tracker_.Resolve(arg->while_request().condition()));
-      TF_ASSIGN_OR_RETURN(
-          UserComputation * body,
-          computation_tracker_.Resolve(arg->while_request().body()));
-      handle_status = computation->AddWhileInstruction(arg->while_request(),
-                                                       *condition, *body);
-      break;
-    }
-    case OpRequest::kSendRequest: {
-      TF_RETURN_IF_ERROR(
-          channel_tracker_.RegisterSend(arg->send_request().channel_handle()));
-      // Send does not return a value, but we need a handle to be able to
-      // set OpMetadata and OpSharding (device assignment).
-      handle_status = computation->AddSendInstruction(arg->send_request());
-      break;
-    }
-    case OpRequest::kRecvRequest: {
-      TF_RETURN_IF_ERROR(
-          channel_tracker_.RegisterRecv(arg->recv_request().channel_handle()));
-      handle_status = computation->AddRecvInstruction(arg->recv_request());
-      break;
-    }
-    case OpRequest::OP_NOT_SET:
-      return InvalidArgument("XLA service received OpRequest with OP_NOT_SET");
-    default:
-      return InvalidArgument("Unsupported operation in XLA service");
-  }
-  TF_ASSIGN_OR_RETURN(*result->mutable_output(), handle_status);
-
-  // We set the debug metadata here, because we slice off part of the OpRequest
-  // proto in the above switch statement.
-  TF_ASSIGN_OR_RETURN(ComputationDataHandle handle, handle_status);
-  TF_RETURN_IF_ERROR(computation->SetOpMetadata(handle, arg->metadata()));
-  if (arg->has_sharding()) {
-    TF_RETURN_IF_ERROR(computation->SetOpSharding(handle, arg->sharding()));
-  }
-  return Status::OK();
-}
-
-Status Service::SnapshotComputation(const SnapshotComputationRequest* arg,
-                                    SnapshotComputationResponse* result) {
-  TF_ASSIGN_OR_RETURN(
-      std::unique_ptr<SessionModule> module,
-      computation_tracker_.SnapshotComputation(arg->computation()));
-
-  result->set_allocated_module(module.release());
-
-  return Status::OK();
-}
-
-Status Service::LoadComputationSnapshot(
-    const LoadComputationSnapshotRequest* arg,
-    LoadComputationSnapshotResponse* result) {
-  TF_ASSIGN_OR_RETURN(*result->mutable_computation(),
-                      computation_tracker_.LoadSessionModule(arg->module()));
-  return Status::OK();
-}
-
 DeviceHandle Service::SingleComputationDeviceHandle() const {
   DeviceHandle device_handle;
   device_handle.set_handle(0);
diff --git a/tensorflow/compiler/xla/service/service.h b/tensorflow/compiler/xla/service/service.h
index 81fbd41957887aec763e1cfe165ad0d1d2ac2269..8748a4c1447eca691abc0f7ca48feda48ceb86e1 100644
--- a/tensorflow/compiler/xla/service/service.h
+++ b/tensorflow/compiler/xla/service/service.h
@@ -26,17 +26,12 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/allocation_tracker.h"
 #include "tensorflow/compiler/xla/service/backend.h"
 #include "tensorflow/compiler/xla/service/channel_tracker.h"
-#include "tensorflow/compiler/xla/service/compilation_cache.h"
-#include "tensorflow/compiler/xla/service/computation_tracker.h"
 #include "tensorflow/compiler/xla/service/device_memory_allocator.h"
 #include "tensorflow/compiler/xla/service/executable.h"
 #include "tensorflow/compiler/xla/service/execution_tracker.h"
 #include "tensorflow/compiler/xla/service/hlo_execution_profile.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/service/hlo_module_config.h"
-#include "tensorflow/compiler/xla/service/session.pb.h"
-#include "tensorflow/compiler/xla/service/user_computation.h"
-#include "tensorflow/compiler/xla/service/versioned_computation_handle.h"
 #include "tensorflow/compiler/xla/service_interface.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/types.h"
@@ -83,11 +78,6 @@ class Service : public ServiceInterface {
   static StatusOr<std::unique_ptr<Service>> NewService(
       const ServiceOptions& options);
 
-  // Creates a new computation with the given name.
-  // A unique ComputationHandle is returned.
-  Status Computation(const ComputationRequest* arg,
-                     ComputationResponse* result) override;
-
   // Unregisters a previously-allocated global handle.
   //
   // If the handle given is not currently allocated, a NOT_FOUND status is
@@ -100,35 +90,15 @@ class Service : public ServiceInterface {
   Status DeconstructTuple(const DeconstructTupleRequest* arg,
                           DeconstructTupleResponse* result) override;
 
-  // Modifies the provided computation so that subsequent executions
-  // will compute the provided ComputationDataHandle, rather than the
-  // last expression enqueued on that Computation.
-  Status SetReturnValue(const SetReturnValueRequest* arg,
-                        SetReturnValueResponse* results) override;
-
-  // Executes a computation with the provided global data passed as
-  // immutable arguments. Returns global data output and execution timing.
-  Status Execute(const ExecuteRequest* arg, ExecuteResponse* result) override;
-
   // Executes a computation with the provided global data passed as
   // immutable arguments. The request contains the whole computation graph.
   // Returns global data output and execution timing.
-  //
-  // TODO(b/74197823): This is a part of a NOT YET ready refactor.
   Status ExecuteGraph(const ExecuteGraphRequest* arg,
                       ExecuteResponse* result) override;
 
   // Executes one or more computations in parallel with the provided global data
   // passed as immutable arguments. Returns global data output for each
   // computation.
-  Status ExecuteParallel(const ExecuteParallelRequest* arg,
-                         ExecuteParallelResponse* result) override;
-
-  // Executes one or more computations in parallel with the provided global data
-  // passed as immutable arguments. Returns global data output for each
-  // computation.
-  //
-  // TODO(b/74197823): This is a part of a NOT YET ready refactor.
   Status ExecuteGraphParallel(const ExecuteGraphParallelRequest* arg,
                               ExecuteParallelResponse* result) override;
 
@@ -143,16 +113,6 @@ class Service : public ServiceInterface {
   Status GetDeviceHandles(const GetDeviceHandlesRequest* arg,
                           GetDeviceHandlesResponse* result) override;
 
-  // Asynchronously executes a computation with provided arguments. Invokes
-  // the provided computation with the provided global data passed as
-  // immutable arguments. Returns a handle to the execution.
-  //
-  // (Note: The corresponding function in xla::Client was removed as part of
-  // b/64116060, in an attempt to simplify our API.  We're keeping this around
-  // for now in case we want to expose this to clients in a different way.)
-  Status ExecuteAsync(const ExecuteAsyncRequest* arg,
-                      ExecuteAsyncResponse* result) override;
-
   // Waits until the specified execution is complete and returns the result.
   // Calling this API multiple times with the same execution handle returns the
   // method with an error since the execution handle is destroyed after the
@@ -190,13 +150,6 @@ class Service : public ServiceInterface {
   Status ResetDevice(const ResetDeviceRequest* arg,
                      ResetDeviceResponse* result) override;
 
-  // Tests if an expression is a compile-time constant.
-  Status IsConstant(const IsConstantRequest* arg,
-                    IsConstantResponse* result) override;
-
-  // Computes the value of a constant expression.
-  Status ComputeConstant(const ComputeConstantRequest* arg,
-                         ComputeConstantResponse* result) override;
   Status ComputeConstantGraph(const ComputeConstantGraphRequest* arg,
                               ComputeConstantResponse* result) override;
 
@@ -205,54 +158,15 @@ class Service : public ServiceInterface {
   Status GetShape(const GetShapeRequest* arg,
                   GetShapeResponse* result) override;
 
-  // Returns the program shape of the computation associated with the given
-  // handle.
-  Status GetComputationShape(const GetComputationShapeRequest* arg,
-                             GetComputationShapeResponse* result) override;
-
-  /////
-  // Computation-oriented methods.
-
-  // Enqueues an Op on the computation.
-  Status Op(const OpRequest* arg, OpResponse* result) override;
-
-  // Retrieves the inferred shape for a value within a computation.
-  Status GetLocalShape(const GetLocalShapeRequest* arg,
-                       GetLocalShapeResponse* result) override;
-
   // Retrieves the statistics of a computation.
-  Status GetComputationStats(const ComputationStatsRequest* arg,
-                             ComputationStatsResponse* result) override;
-
-  // Retrieves the statistics of a computation.
-  //
-  // TODO(b/74197823): This is a part of a NOT YET ready refactor.
   Status GetComputationGraphStats(const ComputationGraphStatsRequest* arg,
                                   ComputationStatsResponse* result) override;
 
-  // Snapshots the current state of a computation handle into a serializable
-  // protocol buffer form, so it can be loaded via
-  // LoadComputationSnapshot.
-  Status SnapshotComputation(const SnapshotComputationRequest* arg,
-                             SnapshotComputationResponse* result) override;
-
-  // Loads a computation from a serialized protocol buffer created via
-  // SnapshotComputation.
-  Status LoadComputationSnapshot(
-      const LoadComputationSnapshotRequest* arg,
-      LoadComputationSnapshotResponse* result) override;
-
   // Creates a unique channel handle that can be used for Send/Recv
   // instructions.
   Status CreateChannelHandle(const CreateChannelHandleRequest* arg,
                              CreateChannelHandleResponse* result) override;
 
-  // Returns the ComputationTracker of the current service instance.
-  // Only used in unit tests to access user computations from client.
-  const ComputationTracker& computation_tracker() {
-    return computation_tracker_;
-  }
-
   // Returns the backend used to execute computations.
   const Backend& backend() const { return *execute_backend_; }
   Backend* mutable_backend() { return execute_backend_.get(); }
@@ -263,8 +177,7 @@ class Service : public ServiceInterface {
   StatusOr<std::unique_ptr<HloModuleConfig>> CreateModuleConfig(
       const ProgramShape& program_shape,
       tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
-      const ExecutionOptions& execution_options,
-      const UserComputation* user_computation = nullptr);
+      const ExecutionOptions& execution_options);
 
   // Picks a parallel response and fills the result.
   Status PickParallelResponse(const ExecuteParallelResponse& parallel_result,
@@ -305,23 +218,13 @@ class Service : public ServiceInterface {
   StatusOr<std::unique_ptr<HloModuleConfig>> CreateModuleConfig(
       const ProgramShape& program_shape,
       tensorflow::gtl::ArraySlice<const Shape*> argument_shapes,
-      const ExecutionOptions* execution_options,
-      const UserComputation* user_computation = nullptr);
+      const ExecutionOptions* execution_options);
 
   // Builds an Executable for the given parameters.
   //
   // If device_allocator is not null, the compiler may use it to allocate temp
   // buffers, which the compiler is responsible for freeing.  The allocator
   // given here need not match the allocator used when running the executable.
-  StatusOr<std::unique_ptr<Executable>> BuildExecutable(
-      const VersionedComputationHandle& versioned_handle,
-      std::unique_ptr<HloModuleConfig> module_config, Backend* backend,
-      se::StreamExecutor* executor,
-      DeviceMemoryAllocator* device_allocator = nullptr);
-
-  // Builds an Executable for the given HLO module proto.
-  //
-  // TODO(b/74197823): This is a part of a NOT YET ready refactor.
   StatusOr<std::unique_ptr<Executable>> BuildExecutable(
       const HloModuleProto& module_proto,
       std::unique_ptr<HloModuleConfig> module_config, Backend* backend,
@@ -330,26 +233,12 @@ class Service : public ServiceInterface {
 
   // Same as BuildExecutable() above, but builds a list of Executables for the
   // given computations that may interact with each other.
-  StatusOr<std::vector<std::unique_ptr<Executable>>> BuildExecutables(
-      std::vector<VersionedComputationHandle> versioned_handles,
-      std::vector<std::unique_ptr<HloModuleConfig>> module_configs,
-      Backend* backend, std::vector<std::vector<se::StreamExecutor*>> executors,
-      DeviceMemoryAllocator* device_allocator);
   StatusOr<std::vector<std::unique_ptr<Executable>>> BuildExecutables(
       const std::vector<const HloModuleProto*>& module_protos,
       std::vector<std::unique_ptr<HloModuleConfig>> module_configs,
       Backend* backend, std::vector<std::vector<se::StreamExecutor*>> executors,
       DeviceMemoryAllocator* device_allocator);
 
-  // Similar to BuildExecutable, but look in the compilation cache for the
-  // executable first. If the executable is not in the cache, it is built and
-  // inserted into the cache.
-  StatusOr<std::shared_ptr<Executable>> BuildAndCacheExecutable(
-      const VersionedComputationHandle& versioned_handle,
-      std::unique_ptr<HloModuleConfig> module_config, Backend* backend,
-      se::StreamExecutor* executor, ExecutionProfile* profile,
-      DeviceMemoryAllocator* device_allocator = nullptr);
-
   // Runs the given executable with the given arguments and register the result
   // in the allocation tracker. The handle of the result from the tracker is
   // returned. If the parameter "profile" is not null, it points to an
@@ -372,17 +261,9 @@ class Service : public ServiceInterface {
       tensorflow::gtl::ArraySlice<string> result_tags,
       ExecutionProfile* profile);
 
-  // Convenience function for adding a function to a user computation.
-  template <typename RequestT, typename ResponseT>
-  Status AddInstruction(
-      const RequestT* arg, ResponseT* result,
-      const std::function<StatusOr<ComputationDataHandle>(UserComputation*)>&
-          adder);
-
   // Executes a single computation which has more than one target device.
   // The N devices are expected to all return an empty tuple, but one, which
   // will be the result of this computation.
-  Status ExecuteOneToN(const ExecuteRequest* arg, ExecuteResponse* result);
   Status ExecuteOneToN(const ExecuteGraphRequest* arg, ExecuteResponse* result);
 
   // Convenience function which checks whether the given shape_with_layout
@@ -405,9 +286,6 @@ class Service : public ServiceInterface {
 
   ServiceOptions options_;
 
-  // Tracks computations built via the API.
-  ComputationTracker computation_tracker_;
-
   // Tracks channels created via the API.
   ChannelTracker channel_tracker_;
 
@@ -417,9 +295,6 @@ class Service : public ServiceInterface {
   // Tracks asynchronously launched executions via the API.
   ExecutionTracker execution_tracker_;
 
-  // Cache containing previously built Executables.
-  CompilationCache compilation_cache_;
-
   // Backend to compile and execute computations on.
   std::unique_ptr<Backend> execute_backend_;
 
diff --git a/tensorflow/compiler/xla/service/session.proto b/tensorflow/compiler/xla/service/session.proto
deleted file mode 100644
index bb8d1cd2a106ea3e5bb61eee5052bd60c38cd0e2..0000000000000000000000000000000000000000
--- a/tensorflow/compiler/xla/service/session.proto
+++ /dev/null
@@ -1,85 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-// This proto file defines messages which store the state of XLA
-// computations within the XLA service. A computation is stored as a record
-// of the operation requests used to build it.
-syntax = "proto3";
-
-import "tensorflow/compiler/xla/xla_data.proto";
-
-package xla;
-
-// Describes a single operation request.
-message OperationRequest {
-  ComputationDataHandle output_handle = 1;
-  Shape output_shape = 2;
-
-  // For operations which call embedded computations such as "Map", these are
-  // the version(s) that the embedded computation should be called at. A version
-  // value of a computation is the ComputationDataHandle of the root of the
-  // computation at the point in time.
-  //
-  // "Call", "Map", "Reduce", and "ReduceWindow" operations take a single
-  // embedded computation so this field will have a single value for those
-  // operations.
-  //
-  // "While" operation takes two; index 0 is the "condition" version and index 1
-  // is the "body" version.
-  repeated int64 embedded_computation_versions = 3;
-
-  // The actual request, which in itself is a tagged union of all possible
-  // operation request types.
-  OpRequest request = 4;
-}
-
-// Describes a sequence of operation requests which define an XLA
-// computation.
-message SessionComputation {
-  string name = 1;
-
-  // The ComputationHandle used to refer to this computation in the XLA
-  // service.
-  ComputationHandle computation_handle = 2;
-
-  // Map from ComputationDataHandle value to operation request. The highest
-  // ComputationDataHandle value corresponds to the root of the computation.
-  map<int64, OperationRequest> requests = 3;
-}
-
-// Describes a group of SessionComputations with an "entry point" computation
-// that may refer to the other non-entry (AKA embedded) computations.
-//
-// This message is used to serialize a computation that has been built via the
-// XLA service API, along with its dependencies, for purposes such as
-// analysis/replay/file-storage.
-message SessionModule {
-  // The entry computation, which was requested for serialization. This may have
-  // referred to embedded computations, which are reflected below.
-  SessionComputation entry = 1;
-
-  // Embedded computations that are transitively referred to by the entry
-  // computation.
-  repeated SessionComputation embedded_computations = 2;
-
-  // The arguments passed to the computation.
-  repeated LiteralProto arguments = 3;
-
-  // The result of the computation.
-  LiteralProto result = 4;
-
-  // The name of the platform used to run the computation.
-  string execution_platform = 5;
-}
diff --git a/tensorflow/compiler/xla/service/shape_inference.cc b/tensorflow/compiler/xla/service/shape_inference.cc
index 3500978bdd808f0c7684d14a05636d90105aa594..bd98e86b08b7507b4a7a0d1a7ebac4b654ff2171 100644
--- a/tensorflow/compiler/xla/service/shape_inference.cc
+++ b/tensorflow/compiler/xla/service/shape_inference.cc
@@ -44,129 +44,6 @@ namespace xla {
 
 namespace {
 
-// Return the UnaryOperation proto enum value associated with the given HLO
-// opcode.
-UnaryOperation OpcodeToUnaryOperation(HloOpcode opcode) {
-  switch (opcode) {
-    case HloOpcode::kAbs:
-      return UNOP_ABS;
-    case HloOpcode::kCeil:
-      return UNOP_CEIL;
-    case HloOpcode::kClz:
-      return UNOP_CLZ;
-    case HloOpcode::kCos:
-      return UNOP_COS;
-    case HloOpcode::kExp:
-      return UNOP_EXP;
-    case HloOpcode::kExpm1:
-      return UNOP_EXPM1;
-    case HloOpcode::kFloor:
-      return UNOP_FLOOR;
-    case HloOpcode::kImag:
-      return UNOP_IMAG;
-    case HloOpcode::kIsFinite:
-      return UNOP_IS_FINITE;
-    case HloOpcode::kLog:
-      return UNOP_LOG;
-    case HloOpcode::kLog1p:
-      return UNOP_LOG1P;
-    case HloOpcode::kNot:
-      return UNOP_NOT;
-    case HloOpcode::kNegate:
-      return UNOP_NEGATE;
-    case HloOpcode::kReal:
-      return UNOP_REAL;
-    case HloOpcode::kRoundNearestAfz:
-      return UNOP_ROUND_NEAREST_AFZ;
-    case HloOpcode::kSign:
-      return UNOP_SIGN;
-    case HloOpcode::kSin:
-      return UNOP_SIN;
-    case HloOpcode::kSort:
-      return UNOP_SORT;
-    case HloOpcode::kTanh:
-      return UNOP_TANH;
-    default:
-      LOG(FATAL) << "Unhandled opcode for conversion to unary operation: "
-                 << opcode;
-  }
-}
-
-// Return the BinaryOperation proto enum value associated with the given HLO
-// opcode.
-BinaryOperation OpcodeToBinaryOperation(HloOpcode opcode) {
-  switch (opcode) {
-    case HloOpcode::kAtan2:
-      return BINOP_ATAN2;
-    case HloOpcode::kComplex:
-      return BINOP_COMPLEX;
-    case HloOpcode::kMultiply:
-      return BINOP_MUL;
-    case HloOpcode::kAdd:
-      return BINOP_ADD;
-    case HloOpcode::kSubtract:
-      return BINOP_SUB;
-    case HloOpcode::kDivide:
-      return BINOP_DIV;
-    case HloOpcode::kEq:
-      return BINOP_EQ;
-    case HloOpcode::kGe:
-      return BINOP_GE;
-    case HloOpcode::kGt:
-      return BINOP_GT;
-    case HloOpcode::kLe:
-      return BINOP_LE;
-    case HloOpcode::kLt:
-      return BINOP_LT;
-    case HloOpcode::kNe:
-      return BINOP_NE;
-    case HloOpcode::kMaximum:
-      return BINOP_MAX;
-    case HloOpcode::kMinimum:
-      return BINOP_MIN;
-    case HloOpcode::kPower:
-      return BINOP_POW;
-    case HloOpcode::kRemainder:
-      return BINOP_REM;
-    case HloOpcode::kOr:
-      return BINOP_OR;
-    case HloOpcode::kAnd:
-      return BINOP_AND;
-    case HloOpcode::kShiftLeft:
-      return BINOP_SHIFT_LEFT;
-    case HloOpcode::kShiftRightArithmetic:
-      return BINOP_SHIFT_RIGHT_ARITHMETIC;
-    case HloOpcode::kShiftRightLogical:
-      return BINOP_SHIFT_RIGHT_LOGICAL;
-    default:
-      LOG(FATAL) << "unhandled opcode " << opcode;
-  }
-}
-
-// Return the TernaryOperation proto enum value associated with the given HLO
-// opcode.
-TernaryOperation OpcodeToTernaryOperation(HloOpcode opcode) {
-  switch (opcode) {
-    case HloOpcode::kClamp:
-      return TRIOP_CLAMP;
-    case HloOpcode::kSelect:
-      return TRIOP_SELECT;
-    default:
-      LOG(FATAL) << "unhandled opcode " << opcode;
-  }
-}
-
-// Return the VariadicOperation proto enum value associated with the given HLO
-// opcode.
-VariadicOperation OpcodeToVariadicOperation(HloOpcode opcode) {
-  switch (opcode) {
-    case HloOpcode::kTuple:
-      return VAROP_TUPLE;
-    default:
-      LOG(FATAL) << "unhandled opcode " << opcode;
-  }
-}
-
 // Returns true if no element is present in slice more than once.
 bool AllUnique(tensorflow::gtl::ArraySlice<int64> slice) {
   return std::set<int64>(slice.begin(), slice.end()).size() == slice.size();
@@ -316,88 +193,86 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
 /* static */ StatusOr<Shape> ShapeInference::InferUnaryOpShape(
     HloOpcode opcode, const Shape& shape) {
   // There is no copy operation at the proto level, so handle copy explicitly.
-  if (opcode == HloOpcode::kCopy) {
+  // A domain shape is the same as the input one.
+  if (opcode == HloOpcode::kCopy || opcode == HloOpcode::kDomain) {
     return shape;
   }
 
-  return InferUnaryOpShape(OpcodeToUnaryOperation(opcode), shape);
-}
+  TF_RETURN_IF_ERROR(
+      ExpectNotTupleOrOpaque(shape, "operand of unary operation"));
 
-/* static */ StatusOr<Shape> ShapeInference::InferUnaryOpShape(
-    UnaryOperation operation, const Shape& arg) {
-  TF_RETURN_IF_ERROR(ExpectNotTupleOrOpaque(arg, "operand of unary operation"));
-
-  TF_DCHECK_OK(ShapeUtil::ValidateShapeWithOptionalLayout(arg));
-  switch (operation) {
-    case UNOP_FLOOR:
-    case UNOP_CEIL:
-      if (!ShapeUtil::ElementIsFloating(arg)) {
+  TF_DCHECK_OK(ShapeUtil::ValidateShapeWithOptionalLayout(shape));
+  switch (opcode) {
+    case HloOpcode::kFloor:
+    case HloOpcode::kCeil:
+      if (!ShapeUtil::ElementIsFloating(shape)) {
         return InvalidArgument(
             "Expected element type in shape to be floating for floor/ceil "
             "operation; got %s.",
-            PrimitiveType_Name(arg.element_type()).c_str());
+            PrimitiveType_Name(shape.element_type()).c_str());
       }
-      return arg;
-    case UNOP_COS:
-    case UNOP_SIN:
-    case UNOP_EXP:
-    case UNOP_EXPM1:
-    case UNOP_LOG:
-    case UNOP_LOG1P:
-    case UNOP_TANH:
-      if (!ShapeUtil::ElementIsFloating(arg) &&
-          !ShapeUtil::ElementIsComplex(arg)) {
+      return shape;
+    case HloOpcode::kCos:
+    case HloOpcode::kSin:
+    case HloOpcode::kExp:
+    case HloOpcode::kExpm1:
+    case HloOpcode::kLog:
+    case HloOpcode::kLog1p:
+    case HloOpcode::kTanh:
+      if (!ShapeUtil::ElementIsFloating(shape) &&
+          !ShapeUtil::ElementIsComplex(shape)) {
         return InvalidArgument(
             "Expected element type in shape to be floating or complex for "
             "sin/cos/exp/log/tanh operation; got %s.",
-            PrimitiveType_Name(arg.element_type()).c_str());
+            PrimitiveType_Name(shape.element_type()).c_str());
       }
-      return arg;
-    case UNOP_REAL:
-    case UNOP_IMAG:
-      if (!ShapeUtil::ElementIsComplex(arg)) {
+      return shape;
+    case HloOpcode::kReal:
+    case HloOpcode::kImag:
+      if (!ShapeUtil::ElementIsComplex(shape)) {
         return InvalidArgument(
             "Expected element type in shape to be complex for real/imag "
             "operation; got %s.",
-            PrimitiveType_Name(arg.element_type()).c_str());
+            PrimitiveType_Name(shape.element_type()).c_str());
       }
-      return ShapeUtil::ChangeElementType(arg, F32);
-    case UNOP_ABS:
-      if (ShapeUtil::ElementIsComplex(arg)) {
+      return ShapeUtil::ChangeElementType(shape, F32);
+    case HloOpcode::kAbs:
+      if (ShapeUtil::ElementIsComplex(shape)) {
         return ShapeUtil::ChangeElementType(
-            arg, primitive_util::ComplexComponentType(arg.element_type()));
+            shape, primitive_util::ComplexComponentType(shape.element_type()));
       }
-      return arg;
-    case UNOP_CLZ:
-    case UNOP_NEGATE:
-    case UNOP_ROUND_NEAREST_AFZ:
-    case UNOP_SIGN:
-    case UNOP_SORT:
-      return arg;
-
-    case UNOP_NOT:
-      if (arg.element_type() != PRED &&
-          !primitive_util::IsIntegralType(arg.element_type())) {
+      return shape;
+    case HloOpcode::kClz:
+    case HloOpcode::kNegate:
+    case HloOpcode::kRoundNearestAfz:
+    case HloOpcode::kSign:
+    case HloOpcode::kSort:
+      return shape;
+
+    case HloOpcode::kNot:
+      if (shape.element_type() != PRED &&
+          !primitive_util::IsIntegralType(shape.element_type())) {
         return InvalidArgument(
             "Expected pred or an integral element type in argument to Not "
             "operation; got %s.",
-            PrimitiveType_Name(arg.element_type()).c_str());
+            PrimitiveType_Name(shape.element_type()).c_str());
       }
-      return arg;
+      return shape;
 
-    case UNOP_IS_FINITE:
-      if (!ShapeUtil::ElementIsFloating(arg)) {
+    case HloOpcode::kIsFinite:
+      if (!ShapeUtil::ElementIsFloating(shape)) {
         return InvalidArgument(
-            "Expected element type in shape to be floating point for IsFinite "
+            "Expected element type in shape to be floating "
+            "point for IsFinite "
             "operation; got %s.",
-            PrimitiveType_Name(arg.element_type()).c_str());
+            PrimitiveType_Name(shape.element_type()).c_str());
       }
-      return ShapeUtil::ChangeElementType(arg, PRED);
+      return ShapeUtil::ChangeElementType(shape, PRED);
 
     default:
       return InvalidArgument(
           "Unknown operation for unary shape inference: \"%s\".",
-          UnaryOperation_Name(operation).c_str());
+          HloOpcodeString(opcode).c_str());
   }
 }
 
@@ -462,6 +337,17 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
   return ShapeUtil::MakeShape(element_type, new_dimensions);
 }
 
+/* static */ StatusOr<Shape> ShapeInference::InferTokenShape(
+    tensorflow::gtl::ArraySlice<const Shape*> arg_shapes) {
+  for (const Shape* arg_shape : arg_shapes) {
+    if (arg_shape->element_type() != TOKEN) {
+      return InvalidArgument(
+          "Operands of token instructions must be TOKEN types.");
+    }
+  }
+  return ShapeUtil::MakeTokenShape();
+}
+
 /* static */ StatusOr<Shape> ShapeInference::InferConvertShape(
     const Shape& operand_shape, PrimitiveType new_element_type) {
   auto old_element_type = operand_shape.element_type();
@@ -767,8 +653,9 @@ Status ValidateDotDimensionNumbers(
 }
 
 /* static */ StatusOr<Shape>
-ShapeInference::InferDegenerateDimensionBroadcastShape(
-    BinaryOperation operation, const Shape& lhs, const Shape& rhs) {
+ShapeInference::InferDegenerateDimensionBroadcastShape(HloOpcode operation,
+                                                       const Shape& lhs,
+                                                       const Shape& rhs) {
   TF_RET_CHECK(ShapeUtil::Rank(lhs) == ShapeUtil::Rank(rhs));
 
   // The shapes have to be compatible. That is, if some dimension d has a
@@ -786,7 +673,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     } else {
       return InvalidArgument(
           "Binary op %s with incompatible shapes: %s and %s.",
-          BinaryOperation_Name(operation).c_str(),
+          HloOpcodeString(operation).c_str(),
           ShapeUtil::HumanString(lhs).c_str(),
           ShapeUtil::HumanString(rhs).c_str());
     }
@@ -796,8 +683,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 }
 
 /* static */ StatusOr<Shape> ShapeInference::InferInDimBroadcastShape(
-    BinaryOperation operation, const Shape& smaller_shape,
-    const Shape& larger_shape,
+    const Shape& smaller_shape, const Shape& larger_shape,
     tensorflow::gtl::ArraySlice<int64> broadcast_dimensions) {
   if (broadcast_dimensions.empty() && !ShapeUtil::IsScalar(smaller_shape)) {
     // Reject "magic" inference for binops on different shapes, requiring
@@ -898,7 +784,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 }
 
 /* static */ StatusOr<Shape> ShapeInference::InferElementwiseBinaryOpShape(
-    BinaryOperation operation, const Shape& lhs, const Shape& rhs,
+    HloOpcode operation, const Shape& lhs, const Shape& rhs,
     tensorflow::gtl::ArraySlice<int64> broadcast_dimensions) {
   TF_RETURN_IF_ERROR(
       ExpectNotTupleOrOpaque(lhs, "lhs of elementwise binary operation"));
@@ -908,8 +794,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(lhs, rhs)) {
     return InvalidArgument(
         "Binary op %s with different element types: %s and %s.",
-        BinaryOperation_Name(operation).c_str(),
-        ShapeUtil::HumanString(lhs).c_str(),
+        HloOpcodeString(operation).c_str(), ShapeUtil::HumanString(lhs).c_str(),
         ShapeUtil::HumanString(rhs).c_str());
   }
 
@@ -942,10 +827,9 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
         ShapeUtil::Rank(lhs) > ShapeUtil::Rank(rhs) ? rhs : lhs;
 
     // After InDim broadcasting, perform degenerate dimensions broadcasting.
-    TF_ASSIGN_OR_RETURN(
-        Shape indim_broadcast_shape,
-        InferInDimBroadcastShape(operation, smaller_shape, larger_shape,
-                                 broadcast_dimensions));
+    TF_ASSIGN_OR_RETURN(Shape indim_broadcast_shape,
+                        InferInDimBroadcastShape(smaller_shape, larger_shape,
+                                                 broadcast_dimensions));
 
     return InferDegenerateDimensionBroadcastShape(
         operation, indim_broadcast_shape, larger_shape);
@@ -954,51 +838,44 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
 /* static */ StatusOr<Shape> ShapeInference::InferBinaryOpShape(
     HloOpcode opcode, const HloInstruction* lhs, const HloInstruction* rhs) {
-  return InferBinaryOpShape(OpcodeToBinaryOperation(opcode), lhs->shape(),
-                            rhs->shape(), /*broadcast_dimensions=*/{});
+  return InferBinaryOpShape(opcode, lhs->shape(), rhs->shape(),
+                            /*broadcast_dimensions=*/{});
 }
 
 /* static */ StatusOr<Shape> ShapeInference::InferBinaryOpShape(
     HloOpcode opcode, const Shape& lhs, const Shape& rhs,
     tensorflow::gtl::ArraySlice<int64> broadcast_dimensions) {
-  return InferBinaryOpShape(OpcodeToBinaryOperation(opcode), lhs, rhs,
-                            broadcast_dimensions);
-}
-
-/* static */ StatusOr<Shape> ShapeInference::InferBinaryOpShape(
-    BinaryOperation operation, const Shape& lhs, const Shape& rhs,
-    tensorflow::gtl::ArraySlice<int64> broadcast_dimensions) {
   VLOG(2) << tensorflow::strings::Printf(
       "inferring shape for <%s>(%s, %s) with broadcast_dimensions={%s}",
-      BinaryOperation_Name(operation).c_str(),
-      ShapeUtil::HumanString(lhs).c_str(), ShapeUtil::HumanString(rhs).c_str(),
+      HloOpcodeString(opcode).c_str(), ShapeUtil::HumanString(lhs).c_str(),
+      ShapeUtil::HumanString(rhs).c_str(),
       Join(broadcast_dimensions, ", ").c_str());
   TF_DCHECK_OK(ShapeUtil::ValidateShapeWithOptionalLayout(lhs));
   TF_DCHECK_OK(ShapeUtil::ValidateShapeWithOptionalLayout(rhs));
 
   TF_RETURN_IF_ERROR(ExpectNotTupleOrOpaque(
       lhs, tensorflow::strings::StrCat("lhs of binary operation ",
-                                       BinaryOperation_Name(operation))));
+                                       HloOpcodeString(opcode))));
   TF_RETURN_IF_ERROR(ExpectNotTupleOrOpaque(
       rhs, tensorflow::strings::StrCat("rhs of binary operation ",
-                                       BinaryOperation_Name(operation))));
-  switch (operation) {
-    case BINOP_MAX:
-    case BINOP_MIN:
-    case BINOP_SUB:
-    case BINOP_ADD:
-    case BINOP_ATAN2:
-    case BINOP_POW:
-    case BINOP_DIV:
-    case BINOP_REM:
-    case BINOP_MUL:
-    case BINOP_SHIFT_LEFT:
-    case BINOP_SHIFT_RIGHT_ARITHMETIC:
-    case BINOP_SHIFT_RIGHT_LOGICAL:
-      return InferElementwiseBinaryOpShape(operation, lhs, rhs,
+                                       HloOpcodeString(opcode))));
+  switch (opcode) {
+    case HloOpcode::kMaximum:
+    case HloOpcode::kMinimum:
+    case HloOpcode::kSubtract:
+    case HloOpcode::kAdd:
+    case HloOpcode::kAtan2:
+    case HloOpcode::kPower:
+    case HloOpcode::kDivide:
+    case HloOpcode::kRemainder:
+    case HloOpcode::kMultiply:
+    case HloOpcode::kShiftLeft:
+    case HloOpcode::kShiftRightArithmetic:
+    case HloOpcode::kShiftRightLogical:
+      return InferElementwiseBinaryOpShape(opcode, lhs, rhs,
                                            broadcast_dimensions);
 
-    case BINOP_COMPLEX: {
+    case HloOpcode::kComplex: {
       if (!ShapeUtil::ElementIsFloating(lhs)) {
         return InvalidArgument(
             "Expected element type in shape to be floating for complex compose "
@@ -1006,7 +883,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
             PrimitiveType_Name(lhs.element_type()).c_str());
       }
       TF_ASSIGN_OR_RETURN(const Shape& shape,
-                          InferElementwiseBinaryOpShape(operation, lhs, rhs,
+                          InferElementwiseBinaryOpShape(opcode, lhs, rhs,
                                                         broadcast_dimensions));
       if (lhs.element_type() == F32 && rhs.element_type() == F32) {
         return ShapeUtil::ChangeElementType(shape, C64);
@@ -1014,8 +891,8 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
         return Unimplemented("Complex component type is not implemented.");
       }
     }
-    case BINOP_AND:
-    case BINOP_OR:
+    case HloOpcode::kAnd:
+    case HloOpcode::kOr:
       if (lhs.element_type() != PRED &&
           !primitive_util::IsIntegralType(lhs.element_type())) {
         return InvalidArgument(
@@ -1023,24 +900,24 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
             "got %s.",
             PrimitiveType_Name(lhs.element_type()).c_str());
       }
-      return InferElementwiseBinaryOpShape(operation, lhs, rhs,
+      return InferElementwiseBinaryOpShape(opcode, lhs, rhs,
                                            broadcast_dimensions);
-    case BINOP_EQ:
-    case BINOP_GE:
-    case BINOP_GT:
-    case BINOP_LE:
-    case BINOP_LT:
-    case BINOP_NE: {
+    case HloOpcode::kEq:
+    case HloOpcode::kGe:
+    case HloOpcode::kGt:
+    case HloOpcode::kLe:
+    case HloOpcode::kLt:
+    case HloOpcode::kNe: {
       TF_ASSIGN_OR_RETURN(const Shape& shape,
-                          InferElementwiseBinaryOpShape(operation, lhs, rhs,
+                          InferElementwiseBinaryOpShape(opcode, lhs, rhs,
                                                         broadcast_dimensions));
       return ShapeUtil::ChangeElementType(shape, PRED);
     }
     default:
       return Unimplemented(
           "Binary op shape inference: %s; lhs: %s; rhs: %s is not implemented.",
-          BinaryOperation_Name(operation).c_str(),
-          lhs.ShortDebugString().c_str(), rhs.ShortDebugString().c_str());
+          HloOpcodeString(opcode).c_str(), lhs.ShortDebugString().c_str(),
+          rhs.ShortDebugString().c_str());
   }
 }
 
@@ -1052,23 +929,17 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
 /* static */ StatusOr<Shape> ShapeInference::InferTernaryOpShape(
     HloOpcode opcode, const Shape& lhs, const Shape& rhs, const Shape& ehs) {
-  return InferTernaryOpShape(OpcodeToTernaryOperation(opcode), lhs, rhs, ehs);
-}
-
-/* static */ StatusOr<Shape> ShapeInference::InferTernaryOpShape(
-    TernaryOperation operation, const Shape& lhs, const Shape& rhs,
-    const Shape& ehs) {
   TF_DCHECK_OK(ShapeUtil::ValidateShapeWithOptionalLayout(lhs));
   TF_DCHECK_OK(ShapeUtil::ValidateShapeWithOptionalLayout(rhs));
   TF_DCHECK_OK(ShapeUtil::ValidateShapeWithOptionalLayout(ehs));
-  switch (operation) {
-    case TRIOP_CLAMP:
+  switch (opcode) {
+    case HloOpcode::kClamp:
       return InferClampShape(lhs, rhs, ehs);
-    case TRIOP_SELECT:
+    case HloOpcode::kSelect:
       return InferSelectShape(lhs, rhs, ehs);
     default:
       return InvalidArgument("Unknown operation %s.",
-                             TernaryOperation_Name(operation).c_str());
+                             HloOpcodeString(opcode).c_str());
   }
 }
 
@@ -1085,18 +956,11 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 /* static */ StatusOr<Shape> ShapeInference::InferVariadicOpShape(
     HloOpcode opcode,
     tensorflow::gtl::ArraySlice<const Shape*> operand_shapes) {
-  return InferVariadicOpShape(OpcodeToVariadicOperation(opcode),
-                              operand_shapes);
-}
-
-/* static */ StatusOr<Shape> ShapeInference::InferVariadicOpShape(
-    VariadicOperation operation,
-    tensorflow::gtl::ArraySlice<const Shape*> operand_shapes) {
   for (const Shape* shape : operand_shapes) {
     TF_DCHECK_OK(ShapeUtil::ValidateShapeWithOptionalLayout(*shape));
   }
-  switch (operation) {
-    case VAROP_TUPLE: {
+  switch (opcode) {
+    case HloOpcode::kTuple: {
       Shape result = ShapeUtil::MakeTupleShape({});
       for (const Shape* shape : operand_shapes) {
         ShapeUtil::AppendShapeToTuple(*shape, &result);
@@ -1105,7 +969,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     }
     default:
       return InvalidArgument("Unknown operation %s.",
-                             VariadicOperation_Name(operation).c_str());
+                             HloOpcodeString(opcode).c_str());
   }
 }
 
diff --git a/tensorflow/compiler/xla/service/shape_inference.h b/tensorflow/compiler/xla/service/shape_inference.h
index 9da2c99b4177f08ece8daabaf2922ddd7e947a1b..f1f7b50902d899c0c629c3098d80fc400fb1388d 100644
--- a/tensorflow/compiler/xla/service/shape_inference.h
+++ b/tensorflow/compiler/xla/service/shape_inference.h
@@ -46,8 +46,6 @@ class ShapeInference {
  public:
   // Infers the shape produced by applying the given unary operation to the
   // given input shape.
-  static StatusOr<Shape> InferUnaryOpShape(UnaryOperation operation,
-                                           const Shape& arg);
   static StatusOr<Shape> InferUnaryOpShape(HloOpcode opcode,
                                            const Shape& shape);
   static StatusOr<Shape> InferUnaryOpShape(HloOpcode opcode,
@@ -55,9 +53,6 @@ class ShapeInference {
 
   // Infers the shape produced by applying the given binary operation to the
   // given input shapes.
-  static StatusOr<Shape> InferBinaryOpShape(
-      BinaryOperation operation, const Shape& lhs, const Shape& rhs,
-      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
   static StatusOr<Shape> InferBinaryOpShape(
       HloOpcode opcode, const Shape& lhs, const Shape& rhs,
       tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
@@ -67,9 +62,6 @@ class ShapeInference {
 
   // Infers the shape produced by applying the given ternary operation to the
   // given input shapes.
-  static StatusOr<Shape> InferTernaryOpShape(TernaryOperation operation,
-                                             const Shape& lhs, const Shape& rhs,
-                                             const Shape& ehs);
   static StatusOr<Shape> InferTernaryOpShape(HloOpcode opcode, const Shape& lhs,
                                              const Shape& rhs,
                                              const Shape& ehs);
@@ -80,9 +72,6 @@ class ShapeInference {
 
   // Infers the shape produced by applying the given variadic operation to the
   // given input operand shapes.
-  static StatusOr<Shape> InferVariadicOpShape(
-      VariadicOperation operation,
-      tensorflow::gtl::ArraySlice<const Shape*> operand_shapes);
   static StatusOr<Shape> InferVariadicOpShape(
       HloOpcode opcode,
       tensorflow::gtl::ArraySlice<const Shape*> operand_shapes);
@@ -227,6 +216,13 @@ class ShapeInference {
   static StatusOr<Shape> InferConcatOpShape(
       tensorflow::gtl::ArraySlice<const Shape*> arg_shapes, int64 dimension);
 
+  // Infers the shape produced by a kGenerateToken operation. Trivially this
+  // shape is always a TOKEN shape. However, ShapeInference serves two purposes:
+  // inferring shapes and checking operand shapes. This method verifies that the
+  // operand shapes are all TOKENs.
+  static StatusOr<Shape> InferTokenShape(
+      tensorflow::gtl::ArraySlice<const Shape*> arg_shapes);
+
   // Helper that validates the given operand shape can be converted to the
   // target output_shape via a convert instruction -- the requirement is that
   // the shape is identical except for the element type.
@@ -279,7 +275,7 @@ class ShapeInference {
   // the LHS and a single element in the RHS to produce a single output element,
   // even in the presence of broadcasting of one of the operands over the other.
   static StatusOr<Shape> InferElementwiseBinaryOpShape(
-      BinaryOperation operation, const Shape& lhs, const Shape& rhs,
+      HloOpcode operation, const Shape& lhs, const Shape& rhs,
       tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
 
   // Helper for inferring the shape of Clamp ops.
@@ -295,7 +291,7 @@ class ShapeInference {
   // dimension broadcasting (a dimension of size 1 in one operand is broadcast
   // up to match the size of the dimension in the other operand).
   static StatusOr<Shape> InferDegenerateDimensionBroadcastShape(
-      BinaryOperation operation, const Shape& lhs, const Shape& rhs);
+      HloOpcode operation, const Shape& lhs, const Shape& rhs);
 
   // Helper for inferring shapes of binary operations using "InDim"
   // broadcasting. This is the broadcasting used in the *InDim binary operations
@@ -303,8 +299,7 @@ class ShapeInference {
   // lower-rank shape than larger_shape. Returns the shape that the
   // smaller_shape is broadcast to.
   static StatusOr<Shape> InferInDimBroadcastShape(
-      BinaryOperation operation, const Shape& smaller_shape,
-      const Shape& larger_shape,
+      const Shape& smaller_shape, const Shape& larger_shape,
       tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
 
   TF_DISALLOW_COPY_AND_ASSIGN(ShapeInference);
diff --git a/tensorflow/compiler/xla/service/shape_inference_test.cc b/tensorflow/compiler/xla/service/shape_inference_test.cc
index 0e61994a786b53a295ef9c9c2287b28fbf754d9b..6d017dffe2d8f927abad4a62bff7fe41bc871975 100644
--- a/tensorflow/compiler/xla/service/shape_inference_test.cc
+++ b/tensorflow/compiler/xla/service/shape_inference_test.cc
@@ -101,8 +101,8 @@ class SelectAndScatterShapeInferenceTest : public ShapeInferenceTest {
 
 TEST_F(ShapeInferenceTest, UnaryNegateMatrix) {
   Shape matrix_shape = ShapeUtil::MakeShape(F32, {128, 64});
-  auto inferred_status = ShapeInference::InferUnaryOpShape(
-      UnaryOperation::UNOP_NEGATE, matrix_shape);
+  auto inferred_status =
+      ShapeInference::InferUnaryOpShape(HloOpcode::kNegate, matrix_shape);
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(matrix_shape, inferred_status.ValueOrDie()));
 }
@@ -110,14 +110,14 @@ TEST_F(ShapeInferenceTest, UnaryNegateMatrix) {
 TEST_F(ShapeInferenceTest, SelectScalarPredBetweenTuples) {
   Shape tuple = ShapeUtil::MakeTupleShape({s32_, f32_});
   auto inferred_status = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_SELECT, pred_, tuple, tuple);
+      HloOpcode::kSelect, pred_, tuple, tuple);
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(tuple, inferred_status.ValueOrDie()));
 }
 
 TEST_F(ShapeInferenceTest, SelectScalarPredBetweenArrays) {
   auto inferred_status = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_SELECT, pred_, matrix_64_48_, matrix_64_48_);
+      HloOpcode::kSelect, pred_, matrix_64_48_, matrix_64_48_);
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(matrix_64_48_, inferred_status.ValueOrDie()));
 }
@@ -125,34 +125,34 @@ TEST_F(ShapeInferenceTest, SelectScalarPredBetweenArrays) {
 TEST_F(ShapeInferenceTest, SelectArrayPredBetweenArrays) {
   auto predarray = ShapeUtil::MakeShape(PRED, {64, 48});
   auto inferred_status = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_SELECT, predarray, matrix_64_48_, matrix_64_48_);
+      HloOpcode::kSelect, predarray, matrix_64_48_, matrix_64_48_);
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(matrix_64_48_, inferred_status.ValueOrDie()));
 }
 
 TEST_F(ShapeInferenceTest, SelectBadShapes) {
   auto inferred_status_error1 = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_SELECT, pred_, matrix_64_48_, matrix_32_64_);
+      HloOpcode::kSelect, pred_, matrix_64_48_, matrix_32_64_);
   ASSERT_FALSE(inferred_status_error1.ok());
   ASSERT_THAT(inferred_status_error1.status().error_message(),
               HasSubstr("Operands to select must be the same shape"));
 
   auto inferred_status_error2 = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_SELECT, s32_, matrix_64_48_, matrix_64_48_);
+      HloOpcode::kSelect, s32_, matrix_64_48_, matrix_64_48_);
   ASSERT_FALSE(inferred_status_error2.ok());
   ASSERT_THAT(inferred_status_error2.status().error_message(),
               HasSubstr("pred operand must have PRED"));
 
   auto inferred_status_error3 = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_SELECT, ShapeUtil::MakeShape(PRED, {64}),
-      matrix_64_48_, matrix_64_48_);
+      HloOpcode::kSelect, ShapeUtil::MakeShape(PRED, {64}), matrix_64_48_,
+      matrix_64_48_);
   ASSERT_FALSE(inferred_status_error3.ok());
   ASSERT_THAT(inferred_status_error3.status().error_message(),
               HasSubstr("with non-scalar predicate with dimensionality"));
 
   // Tuples have a TUPLE element type and cannot be the pred of a select.
   auto inferred_status_error4 = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_SELECT, ShapeUtil::MakeTupleShape({pred_, pred_}),
+      HloOpcode::kSelect, ShapeUtil::MakeTupleShape({pred_, pred_}),
       ShapeUtil::MakeTupleShape({f32_, f32_}),
       ShapeUtil::MakeTupleShape({f32_, f32_}));
   ASSERT_FALSE(inferred_status_error4.ok());
@@ -162,102 +162,98 @@ TEST_F(ShapeInferenceTest, SelectBadShapes) {
 
 TEST_F(ShapeInferenceTest, ClampAllMatrix) {
   auto inferred_status = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_CLAMP, matrix_64_48_, matrix_64_48_,
-      matrix_64_48_);
+      HloOpcode::kClamp, matrix_64_48_, matrix_64_48_, matrix_64_48_);
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(matrix_64_48_, inferred_status.ValueOrDie()));
 }
 
 TEST_F(ShapeInferenceTest, ClampAllScalar) {
-  auto inferred_status = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_CLAMP, f32_, f32_, f32_);
+  auto inferred_status =
+      ShapeInference::InferTernaryOpShape(HloOpcode::kClamp, f32_, f32_, f32_);
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(f32_, inferred_status.ValueOrDie()));
 }
 
 TEST_F(ShapeInferenceTest, ClampMinScalar) {
   auto inferred_status = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_CLAMP, f32_, matrix_64_48_, matrix_64_48_);
+      HloOpcode::kClamp, f32_, matrix_64_48_, matrix_64_48_);
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(matrix_64_48_, inferred_status.ValueOrDie()));
 }
 
 TEST_F(ShapeInferenceTest, ClampMaxScalar) {
   auto inferred_status = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_CLAMP, matrix_64_48_, matrix_64_48_, f32_);
+      HloOpcode::kClamp, matrix_64_48_, matrix_64_48_, f32_);
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(matrix_64_48_, inferred_status.ValueOrDie()));
 }
 
 TEST_F(ShapeInferenceTest, ClampOperandScalar) {
   auto inferred_status = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_CLAMP, matrix_64_48_, f32_, matrix_64_48_);
+      HloOpcode::kClamp, matrix_64_48_, f32_, matrix_64_48_);
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(matrix_64_48_, inferred_status.ValueOrDie()));
 }
 
 TEST_F(ShapeInferenceTest, ClampMinMatrix) {
   auto inferred_status = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_CLAMP, matrix_64_48_, f32_, f32_);
+      HloOpcode::kClamp, matrix_64_48_, f32_, f32_);
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(matrix_64_48_, inferred_status.ValueOrDie()));
 }
 
 TEST_F(ShapeInferenceTest, ClampMaxMatrix) {
   auto inferred_status = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_CLAMP, f32_, f32_, matrix_64_48_);
+      HloOpcode::kClamp, f32_, f32_, matrix_64_48_);
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(matrix_64_48_, inferred_status.ValueOrDie()));
 }
 
 TEST_F(ShapeInferenceTest, ClampOperandMatrix) {
   auto inferred_status = ShapeInference::InferTernaryOpShape(
-      TernaryOperation::TRIOP_CLAMP, f32_, matrix_64_48_, f32_);
+      HloOpcode::kClamp, f32_, matrix_64_48_, f32_);
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(matrix_64_48_, inferred_status.ValueOrDie()));
 }
 
 TEST_F(ShapeInferenceTest, ClampBadShapes) {
   // Type mismatch
-  ASSERT_FALSE(ShapeInference::InferTernaryOpShape(
-                   TernaryOperation::TRIOP_CLAMP, s32_, f32_, f32_)
-                   .ok());
-  ASSERT_FALSE(ShapeInference::InferTernaryOpShape(
-                   TernaryOperation::TRIOP_CLAMP, f32_, s32_, f32_)
-                   .ok());
-  ASSERT_FALSE(ShapeInference::InferTernaryOpShape(
-                   TernaryOperation::TRIOP_CLAMP, f32_, f32_, s32_)
-                   .ok());
-  // Dimension mismatch
   ASSERT_FALSE(
-      ShapeInference::InferTernaryOpShape(TernaryOperation::TRIOP_CLAMP,
-                                          vector_64_, vector_32_, vector_32_)
+      ShapeInference::InferTernaryOpShape(HloOpcode::kClamp, s32_, f32_, f32_)
           .ok());
   ASSERT_FALSE(
-      ShapeInference::InferTernaryOpShape(TernaryOperation::TRIOP_CLAMP,
-                                          vector_32_, vector_64_, vector_32_)
+      ShapeInference::InferTernaryOpShape(HloOpcode::kClamp, f32_, s32_, f32_)
           .ok());
   ASSERT_FALSE(
-      ShapeInference::InferTernaryOpShape(TernaryOperation::TRIOP_CLAMP,
-                                          vector_32_, vector_32_, vector_64_)
+      ShapeInference::InferTernaryOpShape(HloOpcode::kClamp, f32_, f32_, s32_)
           .ok());
-  // Dimension mismatch, where one operand is a scalar
+  // Dimension mismatch
   ASSERT_FALSE(ShapeInference::InferTernaryOpShape(
-                   TernaryOperation::TRIOP_CLAMP, vector_64_, vector_32_, f32_)
+                   HloOpcode::kClamp, vector_64_, vector_32_, vector_32_)
                    .ok());
   ASSERT_FALSE(ShapeInference::InferTernaryOpShape(
-                   TernaryOperation::TRIOP_CLAMP, vector_64_, f32_, vector_32_)
+                   HloOpcode::kClamp, vector_32_, vector_64_, vector_32_)
                    .ok());
   ASSERT_FALSE(ShapeInference::InferTernaryOpShape(
-                   TernaryOperation::TRIOP_CLAMP, f32_, vector_64_, vector_32_)
+                   HloOpcode::kClamp, vector_32_, vector_32_, vector_64_)
+                   .ok());
+  // Dimension mismatch, where one operand is a scalar
+  ASSERT_FALSE(ShapeInference::InferTernaryOpShape(HloOpcode::kClamp,
+                                                   vector_64_, vector_32_, f32_)
+                   .ok());
+  ASSERT_FALSE(ShapeInference::InferTernaryOpShape(HloOpcode::kClamp,
+                                                   vector_64_, f32_, vector_32_)
+                   .ok());
+  ASSERT_FALSE(ShapeInference::InferTernaryOpShape(HloOpcode::kClamp, f32_,
+                                                   vector_64_, vector_32_)
                    .ok());
 }
 
 TEST_F(ShapeInferenceTest, Complex) {
   auto complex_shape = [&](const Shape& lhs, const Shape& rhs,
                            const tensorflow::gtl::ArraySlice<int64>& bcast) {
-    return ShapeInference::InferBinaryOpShape(BinaryOperation::BINOP_COMPLEX,
-                                              lhs, rhs, bcast);
+    return ShapeInference::InferBinaryOpShape(HloOpcode::kComplex, lhs, rhs,
+                                              bcast);
   };
   // Inputs must be FP.
   ASSERT_FALSE(complex_shape(s32_, s32_, {}).ok());
@@ -292,8 +288,8 @@ TEST_F(ShapeInferenceTest, Complex) {
 }
 
 TEST_F(ShapeInferenceTest, VariadicOpTuplify) {
-  StatusOr<Shape> result = ShapeInference::InferVariadicOpShape(
-      VariadicOperation::VAROP_TUPLE, {&s32_, &f32_});
+  StatusOr<Shape> result =
+      ShapeInference::InferVariadicOpShape(HloOpcode::kTuple, {&s32_, &f32_});
   ASSERT_IS_OK(result.status());
   ASSERT_TRUE(ShapeUtil::Equal(result.ValueOrDie(),
                                ShapeUtil::MakeTupleShape({s32_, f32_})));
@@ -804,8 +800,8 @@ TEST_F(ShapeInferenceTest, InferConstIndexShape) {
 
 TEST_F(ShapeInferenceTest, InferPowShape) {
   auto ten_floats = ShapeUtil::MakeShape(F32, {10});
-  auto inferred_status =
-      ShapeInference::InferBinaryOpShape(BINOP_POW, ten_floats, f32_, {});
+  auto inferred_status = ShapeInference::InferBinaryOpShape(
+      HloOpcode::kPower, ten_floats, f32_, {});
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(ten_floats, inferred_status.ValueOrDie()));
 }
@@ -813,7 +809,7 @@ TEST_F(ShapeInferenceTest, InferPowShape) {
 TEST_F(ShapeInferenceTest, InferCompareShapeEq) {
   auto ten_floats = ShapeUtil::MakeShape(F32, {10});
   auto inferred_status =
-      ShapeInference::InferBinaryOpShape(BINOP_EQ, ten_floats, f32_, {});
+      ShapeInference::InferBinaryOpShape(HloOpcode::kEq, ten_floats, f32_, {});
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(ShapeUtil::MakeShape(PRED, {10}),
                                inferred_status.ValueOrDie()));
@@ -822,7 +818,7 @@ TEST_F(ShapeInferenceTest, InferCompareShapeEq) {
 TEST_F(ShapeInferenceTest, InferCompareShapeGe) {
   auto ten_floats = ShapeUtil::MakeShape(F32, {10});
   auto inferred_status =
-      ShapeInference::InferBinaryOpShape(BINOP_GE, ten_floats, f32_, {});
+      ShapeInference::InferBinaryOpShape(HloOpcode::kGe, ten_floats, f32_, {});
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(ShapeUtil::MakeShape(PRED, {10}),
                                inferred_status.ValueOrDie()));
@@ -831,7 +827,7 @@ TEST_F(ShapeInferenceTest, InferCompareShapeGe) {
 TEST_F(ShapeInferenceTest, InferCompareShapeGt) {
   auto ten_floats = ShapeUtil::MakeShape(F32, {10});
   auto inferred_status =
-      ShapeInference::InferBinaryOpShape(BINOP_GT, ten_floats, f32_, {});
+      ShapeInference::InferBinaryOpShape(HloOpcode::kGt, ten_floats, f32_, {});
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(ShapeUtil::MakeShape(PRED, {10}),
                                inferred_status.ValueOrDie()));
@@ -840,7 +836,7 @@ TEST_F(ShapeInferenceTest, InferCompareShapeGt) {
 TEST_F(ShapeInferenceTest, InferCompareShapeLe) {
   auto ten_floats = ShapeUtil::MakeShape(F32, {10});
   auto inferred_status =
-      ShapeInference::InferBinaryOpShape(BINOP_LE, ten_floats, f32_, {});
+      ShapeInference::InferBinaryOpShape(HloOpcode::kLe, ten_floats, f32_, {});
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(ShapeUtil::MakeShape(PRED, {10}),
                                inferred_status.ValueOrDie()));
@@ -849,7 +845,7 @@ TEST_F(ShapeInferenceTest, InferCompareShapeLe) {
 TEST_F(ShapeInferenceTest, InferCompareShapeLt) {
   auto ten_floats = ShapeUtil::MakeShape(F32, {10});
   auto inferred_status =
-      ShapeInference::InferBinaryOpShape(BINOP_LT, ten_floats, f32_, {});
+      ShapeInference::InferBinaryOpShape(HloOpcode::kLt, ten_floats, f32_, {});
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(ShapeUtil::MakeShape(PRED, {10}),
                                inferred_status.ValueOrDie()));
@@ -858,7 +854,7 @@ TEST_F(ShapeInferenceTest, InferCompareShapeLt) {
 TEST_F(ShapeInferenceTest, InferCompareShapeNe) {
   auto ten_floats = ShapeUtil::MakeShape(F32, {10});
   auto inferred_status =
-      ShapeInference::InferBinaryOpShape(BINOP_NE, ten_floats, f32_, {});
+      ShapeInference::InferBinaryOpShape(HloOpcode::kNe, ten_floats, f32_, {});
   ASSERT_IS_OK(inferred_status.status());
   ASSERT_TRUE(ShapeUtil::Equal(ShapeUtil::MakeShape(PRED, {10}),
                                inferred_status.ValueOrDie()));
@@ -1111,22 +1107,22 @@ TEST_F(ShapeInferenceTest, BinOpBroadcastMatrixVector) {
   const Shape vec8 = ShapeUtil::MakeShape(F32, {8});
   const Shape vec16 = ShapeUtil::MakeShape(F32, {16});
 
-  auto inferred_status_match = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, mat, vec8, {1});
+  auto inferred_status_match =
+      ShapeInference::InferBinaryOpShape(HloOpcode::kAdd, mat, vec8, {1});
   ASSERT_IS_OK(inferred_status_match.status());
   ASSERT_TRUE(ShapeUtil::Equal(inferred_status_match.ValueOrDie(), mat));
 
-  auto inferred_status_mismatch = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, mat, vec8, {0});
+  auto inferred_status_mismatch =
+      ShapeInference::InferBinaryOpShape(HloOpcode::kAdd, mat, vec8, {0});
   ASSERT_FALSE(inferred_status_mismatch.ok());
 
-  inferred_status_match = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, mat, vec16, {0});
+  inferred_status_match =
+      ShapeInference::InferBinaryOpShape(HloOpcode::kAdd, mat, vec16, {0});
   ASSERT_IS_OK(inferred_status_match.status());
   ASSERT_TRUE(ShapeUtil::Equal(inferred_status_match.ValueOrDie(), mat));
 
-  inferred_status_mismatch = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, mat, vec16, {1});
+  inferred_status_mismatch =
+      ShapeInference::InferBinaryOpShape(HloOpcode::kAdd, mat, vec16, {1});
   ASSERT_FALSE(inferred_status_mismatch.ok());
 }
 
@@ -1138,17 +1134,17 @@ TEST_F(ShapeInferenceTest, BinOpBroadcastCubeMatrix) {
   const Shape matrix16_8 = ShapeUtil::MakeShape(F32, {16, 8});
 
   auto inferred_status_match = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, cube, matrix8_4, {1, 2});
+      HloOpcode::kAdd, cube, matrix8_4, {1, 2});
   ASSERT_IS_OK(inferred_status_match.status());
   ASSERT_TRUE(ShapeUtil::Equal(inferred_status_match.ValueOrDie(), cube));
 
   inferred_status_match = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, cube, matrix16_4, {0, 2});
+      HloOpcode::kAdd, cube, matrix16_4, {0, 2});
   ASSERT_IS_OK(inferred_status_match.status());
   ASSERT_TRUE(ShapeUtil::Equal(inferred_status_match.ValueOrDie(), cube));
 
   inferred_status_match = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, cube, matrix16_8, {0, 1});
+      HloOpcode::kAdd, cube, matrix16_8, {0, 1});
   ASSERT_IS_OK(inferred_status_match.status());
   ASSERT_TRUE(ShapeUtil::Equal(inferred_status_match.ValueOrDie(), cube));
 }
@@ -1162,43 +1158,43 @@ TEST_F(ShapeInferenceTest, BinOpBroadcastBadDimension) {
   const Shape matrix8_8 = ShapeUtil::MakeShape(F32, {8, 8});
 
   // "magical" broadcast rejected
-  auto inferred_status_error1 = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, tensor, vec8, {});
+  auto inferred_status_error1 =
+      ShapeInference::InferBinaryOpShape(HloOpcode::kAdd, tensor, vec8, {});
   ASSERT_FALSE(inferred_status_error1.ok());
   ASSERT_THAT(inferred_status_error1.status().error_message(),
               HasSubstr("Automatic"));
 
   // broadcast_dimension out of bounds for tensor's rank
-  auto inferred_status_error2 = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, tensor, vec8, {3});
+  auto inferred_status_error2 =
+      ShapeInference::InferBinaryOpShape(HloOpcode::kAdd, tensor, vec8, {3});
   ASSERT_FALSE(inferred_status_error2.ok());
   ASSERT_THAT(inferred_status_error2.status().error_message(),
               ContainsRegex("Broadcast dimension number .* too large"));
 
   // broadcast_dimension doesn't match corresponding dimension
-  auto inferred_status_error3 = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, tensor, vec8, {0});
+  auto inferred_status_error3 =
+      ShapeInference::InferBinaryOpShape(HloOpcode::kAdd, tensor, vec8, {0});
   ASSERT_FALSE(inferred_status_error3.ok());
   ASSERT_THAT(inferred_status_error3.status().error_message(),
               HasSubstr("Broadcast dimension 0 mismatch"));
 
   // broadcast_dimensions list too long
   auto inferred_status_error4 = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, tensor, matrix8_4, {0, 1, 2});
+      HloOpcode::kAdd, tensor, matrix8_4, {0, 1, 2});
   ASSERT_FALSE(inferred_status_error4.ok());
   ASSERT_THAT(inferred_status_error4.status().error_message(),
               HasSubstr("broadcast_dimensions has to match"));
 
   // there's a dimension above the rank of the tensor
   auto inferred_status_error5 = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, tensor, matrix8_4, {3, 0});
+      HloOpcode::kAdd, tensor, matrix8_4, {3, 0});
   ASSERT_FALSE(inferred_status_error5.ok());
   ASSERT_THAT(inferred_status_error5.status().error_message(),
               ContainsRegex("dimension number .* too large"));
 
   // broadcasting dimensions don't match in this order
   auto inferred_status_error6 = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, tensor, matrix8_4, {2, 1});
+      HloOpcode::kAdd, tensor, matrix8_4, {2, 1});
   ASSERT_FALSE(inferred_status_error6.ok());
   ASSERT_THAT(inferred_status_error6.status().error_message(),
               HasSubstr("dimension 0 mismatch"));
@@ -1207,13 +1203,13 @@ TEST_F(ShapeInferenceTest, BinOpBroadcastBadDimension) {
   // in a proper (strictly increasing) order, even if the lower-rank array
   // matches the higher-rank array in many different ways.
   auto inferred_status_error7 = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, tensor8_8_8, matrix8_8, {0, 0});
+      HloOpcode::kAdd, tensor8_8_8, matrix8_8, {0, 0});
   ASSERT_FALSE(inferred_status_error7.ok());
   ASSERT_THAT(inferred_status_error7.status().error_message(),
               HasSubstr("dimensions order is wrong"));
 
   auto inferred_status_error8 = ShapeInference::InferBinaryOpShape(
-      BinaryOperation::BINOP_ADD, tensor8_8_8, matrix8_8, {1, 0});
+      HloOpcode::kAdd, tensor8_8_8, matrix8_8, {1, 0});
   ASSERT_FALSE(inferred_status_error8.ok());
   ASSERT_THAT(inferred_status_error8.status().error_message(),
               HasSubstr("dimensions order is wrong"));
diff --git a/tensorflow/compiler/xla/service/transpose_folding.cc b/tensorflow/compiler/xla/service/transpose_folding.cc
index ba16dc640e2d2974eab4fc8b134a6e33c03e3b85..49e1f873192f800056a2272f7d4f698898b0f8a1 100644
--- a/tensorflow/compiler/xla/service/transpose_folding.cc
+++ b/tensorflow/compiler/xla/service/transpose_folding.cc
@@ -178,7 +178,6 @@ bool FoldTransposeIntoConvolution(InstructionOperandsPair pair) {
 
   auto new_conv = HloInstruction::CreateConvolve(
       convolution.shape(), new_lhs, new_rhs, convolution.window(), new_dnums);
-  convolution.SetupDerivedInstruction(new_conv.get());
   TF_CHECK_OK(convolution.parent()->ReplaceWithNewInstruction(
       &convolution, std::move(new_conv)));
 
diff --git a/tensorflow/compiler/xla/service/transpose_folding_test.cc b/tensorflow/compiler/xla/service/transpose_folding_test.cc
index f73f1227aaf1630a9e7c43bb508732c5518ef929..cccb8f2fbb0266bbf1f40b09170938a1e5d3e78d 100644
--- a/tensorflow/compiler/xla/service/transpose_folding_test.cc
+++ b/tensorflow/compiler/xla/service/transpose_folding_test.cc
@@ -27,12 +27,12 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/service/shape_inference.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/platform/logging.h"
 
@@ -69,7 +69,7 @@ ENTRY entry_computation {
 }
 )";
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
 
   FoldTranspose(module.get());
 
@@ -91,7 +91,7 @@ ENTRY entry_computation {
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
 
   TransposeFolding transpose_folding(
       [](const HloInstruction& dot,
@@ -119,7 +119,7 @@ ENTRY entry_computation {
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
 
   TransposeFolding transpose_folding(
       [](const HloInstruction& dot,
@@ -147,7 +147,7 @@ ENTRY entry_computation {
 }
 )";
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
 
   FoldTranspose(module.get());
 
@@ -176,7 +176,7 @@ TEST_F(TransposeFoldingTest, FuseDotWithConstantOperands) {
   HloComputation* entry_computation =
       module->AddEntryComputation(builder.Build(mul));
   HloInstruction* call = module->OutlineExpressionFromComputation(
-      {add, sub, mul}, "", entry_computation);
+      {add, sub, mul}, "entry", entry_computation);
   EXPECT_EQ(call, entry_computation->root_instruction());
   HloComputation* callee_computation = call->to_apply();
   // The arguments to the call should be const1, const2, and const3.
@@ -205,7 +205,7 @@ ENTRY entry_computation {
 }
 )";
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
   FoldTranspose(module.get());
 
   const HloComputation* callee = module->GetComputationWithName("callee");
diff --git a/tensorflow/compiler/xla/service/tuple_points_to_analysis.cc b/tensorflow/compiler/xla/service/tuple_points_to_analysis.cc
index 8cb654493ca82dc702b2c1e7a4284f4f31d1e5f9..eb6d1ada6b553f998fe06917dfdf0b5092cd79cd 100644
--- a/tensorflow/compiler/xla/service/tuple_points_to_analysis.cc
+++ b/tensorflow/compiler/xla/service/tuple_points_to_analysis.cc
@@ -273,6 +273,14 @@ Status TuplePointsToAnalysis::HandleBitcast(HloInstruction* bitcast) {
   return Status::OK();
 }
 
+Status TuplePointsToAnalysis::HandleDomain(HloInstruction* domain) {
+  // A kDomain instruction aliases its operand. That is, the buffer of its
+  // result *is* the buffer of its operand, so just copy the operands points-to
+  // set.
+  CreateCopiedPointsToSet(domain, domain->operand(0));
+  return Status::OK();
+}
+
 Status TuplePointsToAnalysis::HandleSlice(HloInstruction* slice) {
   // A kSlice instruction aliases its operand if the backend lowers it to an
   // in-place implementation.
@@ -715,15 +723,16 @@ bool TuplePointsToAnalysis::CanShareOperandBufferWithUser(
     return false;
   }
   if (user->opcode() == HloOpcode::kFusion) {
-    if (user->fusion_kind() == HloInstruction::FusionKind::kLoop &&
-        user->fused_expression_root()->opcode() ==
-            HloOpcode::kDynamicUpdateSlice) {
-      // Loop fusion with kDynamicUpdateSlice fused root.
-      //
-      // Returns true iff there is exactly one use of 'operand' at shape index
-      // 'operand_index', and this singleton use is the fused root at operand
-      // index 0.
-      return HasUniqueFusedUseOfOperandAt(operand, operand_index, user, 0);
+    if (user->fusion_kind() == HloInstruction::FusionKind::kLoop) {
+      if (user->fused_expression_root()->opcode() ==
+          HloOpcode::kDynamicUpdateSlice) {
+        // Loop fusion with kDynamicUpdateSlice fused root.
+        //
+        // Returns true iff there is exactly one use of 'operand' at shape index
+        // 'operand_index', and this singleton use is the fused root at operand
+        // index 0.
+        return HasUniqueFusedUseOfOperandAt(operand, operand_index, user, 0);
+      }
     } else if (user->fusion_kind() == HloInstruction::FusionKind::kOutput &&
                user->fused_expression_root()->opcode() == HloOpcode::kAdd) {
       // Output fusion with kAdd fused root.
@@ -781,8 +790,12 @@ bool TuplePointsToAnalysis::CanShareOperandBufferWithUser(
     return param_uses.size() == 1 && param_uses[0].first == callee_root &&
            callee_root->IsElementwiseOnOperand(param_uses[0].second);
   }
-  // Check if 'user' is element-wise.
-  return user->IsElementwise();
+  // Loop fusions that contain transposing copies won't reach here as they have
+  // different layouts, which fails the check in the beginning of this function.
+  //
+  // Multi-output fusion will fail the check here as tuples are not considered
+  // an elementwise operation.
+  return user->IsElementwiseOnOperand(user->operand_index(operand));
 }
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/tuple_points_to_analysis.h b/tensorflow/compiler/xla/service/tuple_points_to_analysis.h
index 1ac713013650d807b15e33565e6d2dec406a5d13..c0d82414806d9a6ff57aec59d077f444137fec9a 100644
--- a/tensorflow/compiler/xla/service/tuple_points_to_analysis.h
+++ b/tensorflow/compiler/xla/service/tuple_points_to_analysis.h
@@ -248,6 +248,7 @@ class TuplePointsToAnalysis : public DfsHloVisitorWithDefault {
   Status HandleTuple(HloInstruction* tuple) override;
   Status HandleGetTupleElement(HloInstruction* get_tuple_element) override;
   Status HandleBitcast(HloInstruction* bitcast) override;
+  Status HandleDomain(HloInstruction* domain) override;
   Status HandleSlice(HloInstruction* slice) override;
   Status HandleCopy(HloInstruction* copy) override;
   Status HandleRecvDone(HloInstruction* recv_done) override;
diff --git a/tensorflow/compiler/xla/service/tuple_points_to_analysis_test.cc b/tensorflow/compiler/xla/service/tuple_points_to_analysis_test.cc
index f558316b05b168a6f100e8ef69adfd9dbc023102..5734f284071944bc22011405898cf86f33dc48d7 100644
--- a/tensorflow/compiler/xla/service/tuple_points_to_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/tuple_points_to_analysis_test.cc
@@ -1148,5 +1148,30 @@ TEST_F(CanShareOperandBufferWithUserTest, CallToComputationWithFusionRoot) {
                                                                  call, {}));
 }
 
+TEST_F(CanShareOperandBufferWithUserTest, LoopFusionWithElementwiseOperand) {
+  Shape full_shape = ShapeUtil::MakeShape(F32, {16, 32});
+  Shape broadcast_shape = ShapeUtil::MakeShape(F32, {16});
+
+  auto builder = HloComputation::Builder(TestName() + "_fusion");
+  auto param0 = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, full_shape, "full"));
+  auto param1 = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, broadcast_shape, "small"));
+  auto broadcast = builder.AddInstruction(
+      HloInstruction::CreateBroadcast(full_shape, param1, {0}));
+  auto add = builder.AddInstruction(HloInstruction::CreateBinary(
+      full_shape, HloOpcode::kAdd, param0, broadcast));
+
+  BuildModule(builder.Build());
+  auto fusion = computation_->CreateFusionInstruction(
+      {add, broadcast}, HloInstruction::FusionKind::kLoop);
+  RunAnalysis();
+
+  EXPECT_TRUE(points_to_analysis_->CanShareOperandBufferWithUser(param0, {},
+                                                                 fusion, {}));
+  EXPECT_FALSE(points_to_analysis_->CanShareOperandBufferWithUser(param1, {},
+                                                                  fusion, {}));
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/tuple_simplifier.cc b/tensorflow/compiler/xla/service/tuple_simplifier.cc
index d668855084a884518b338cdf396a9330b9f43a2b..77bdcc9de0d830991208a1db271d009bccaf550e 100644
--- a/tensorflow/compiler/xla/service/tuple_simplifier.cc
+++ b/tensorflow/compiler/xla/service/tuple_simplifier.cc
@@ -30,10 +30,17 @@ limitations under the License.
 
 namespace xla {
 
+TupleSimplifier::TupleSimplifier(bool exclude_entry_computation) :
+    exclude_entry_computation_(exclude_entry_computation) {}
+
 StatusOr<bool> TupleSimplifier::Run(HloModule* module) {
   // Initially add all GTE and Tuple instructions to the worklist.
   std::queue<HloInstruction*> worklist;
   for (auto* computation : module->computations()) {
+    if (exclude_entry_computation_ &&
+        computation == module->entry_computation()) {
+      continue;
+    }
     for (auto* instruction : computation->instructions()) {
       if (instruction->opcode() == HloOpcode::kTuple ||
           instruction->opcode() == HloOpcode::kGetTupleElement) {
@@ -69,7 +76,6 @@ StatusOr<bool> TupleSimplifier::Run(HloModule* module) {
       //       Tuple
       //
       HloInstruction* top_tuple = nullptr;
-      HloInstruction* first_gte = nullptr;
       bool can_simplify = true;
       for (int64 operand_number = 0;
            operand_number < instruction->operand_count(); ++operand_number) {
@@ -79,17 +85,10 @@ StatusOr<bool> TupleSimplifier::Run(HloModule* module) {
           can_simplify = false;
           break;
         }
-        if (first_gte == nullptr) {
-          first_gte = operand;
-        } else if (!first_gte->has_compatible_sharding(operand)) {
-          can_simplify = false;
-          break;
-        }
         if (top_tuple == nullptr) {
           top_tuple = operand->mutable_operand(0);
           if (!ShapeUtil::Compatible(top_tuple->shape(),
-                                     instruction->shape()) ||
-              !instruction->has_compatible_sharding(top_tuple)) {
+                                     instruction->shape())) {
             can_simplify = false;
             break;
           }
@@ -118,14 +117,12 @@ StatusOr<bool> TupleSimplifier::Run(HloModule* module) {
         HloInstruction* element_source =
             instruction->mutable_operand(0)->mutable_operand(
                 instruction->tuple_index());
-        if (instruction->has_compatible_sharding(element_source)) {
-          changed = true;
-          TF_RETURN_IF_ERROR(instruction->ReplaceAllUsesWith(element_source));
-          for (HloInstruction* user : element_source->users()) {
-            if (user->opcode() == HloOpcode::kTuple ||
-                user->opcode() == HloOpcode::kGetTupleElement) {
-              worklist.push(user);
-            }
+        changed = true;
+        TF_RETURN_IF_ERROR(instruction->ReplaceAllUsesWith(element_source));
+        for (HloInstruction* user : element_source->users()) {
+          if (user->opcode() == HloOpcode::kTuple ||
+              user->opcode() == HloOpcode::kGetTupleElement) {
+            worklist.push(user);
           }
         }
       }
diff --git a/tensorflow/compiler/xla/service/tuple_simplifier.h b/tensorflow/compiler/xla/service/tuple_simplifier.h
index e5e9b10b5bf3f452d1bfec476b8d5c7d74c4f4e8..750950188312c5077d487f2feef0606f07839432 100644
--- a/tensorflow/compiler/xla/service/tuple_simplifier.h
+++ b/tensorflow/compiler/xla/service/tuple_simplifier.h
@@ -27,13 +27,20 @@ namespace xla {
 // the module.
 class TupleSimplifier : public HloPassInterface {
  public:
-  TupleSimplifier() {}
+  TupleSimplifier() : TupleSimplifier(/*exclude_entry_computation=*/false) {}
+  explicit TupleSimplifier(bool exclude_entry_computation);
   ~TupleSimplifier() override {}
   tensorflow::StringPiece name() const override { return "tuple-simplifier"; }
 
   // Run tuple simplification on the given computation. Returns whether the
   // computation was changed.
   StatusOr<bool> Run(HloModule* module) override;
+
+ private:
+  // When set, this pipeline stage will perform optimization of all computations
+  // apart from the module's entry computation. This is used by Graphcore's
+  // backend.
+  bool exclude_entry_computation_;
 };
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/tuple_simplifier_test.cc b/tensorflow/compiler/xla/service/tuple_simplifier_test.cc
index ca9ae91281fce5ee061d066fc3e538dbbc09f6b3..d3635eae81ec7017f9bf6a69250d10716309c9ec 100644
--- a/tensorflow/compiler/xla/service/tuple_simplifier_test.cc
+++ b/tensorflow/compiler/xla/service/tuple_simplifier_test.cc
@@ -42,6 +42,12 @@ class TupleSimplifierTest : public HloTestBase {
     TF_ASSERT_OK(changed_status.status());
     EXPECT_EQ(change_expected, changed_status.ValueOrDie());
   }
+  void Run(HloModule* module, bool change_expected, bool exclude_entry) {
+    TupleSimplifier simplifier(exclude_entry);
+    auto changed_status = simplifier.Run(module);
+    TF_ASSERT_OK(changed_status.status());
+    EXPECT_EQ(change_expected, changed_status.ValueOrDie());
+  }
 
   const Shape scalar_shape_ = ShapeUtil::MakeShape(F32, {});
   const Shape tuple_shape_ = ShapeUtil::MakeTupleShape(
@@ -211,5 +217,76 @@ TEST_F(TupleSimplifierTest, IncompatibleTuples) {
   EXPECT_THAT(computation->root_instruction(), tuple);
 }
 
+TEST_F(TupleSimplifierTest, CanExcludeEntryComputation) {
+  //  Verify that the root computation can be excluded
+  auto module = CreateNewModule();
+
+  HloInstruction* p0;
+  HloInstruction* p1;
+  HloComputation* c0;
+  HloComputation* c1;
+  HloComputation* entry;
+
+  {
+    HloComputation::Builder builder(TestName() + "_1");
+    p0 = builder.AddInstruction(
+        HloInstruction::CreateParameter(0, tuple_shape_, "param"));
+    HloInstruction* gte0 = builder.AddInstruction(
+        HloInstruction::CreateGetTupleElement(scalar_shape_, p0, 0));
+    HloInstruction* gte1 = builder.AddInstruction(
+        HloInstruction::CreateGetTupleElement(scalar_shape_, p0, 1));
+    HloInstruction* gte2 = builder.AddInstruction(
+        HloInstruction::CreateGetTupleElement(scalar_shape_, p0, 2));
+
+    builder.AddInstruction(HloInstruction::CreateTuple({gte0, gte1, gte2}));
+
+    c0 = module->AddEmbeddedComputation(builder.Build());
+  }
+  {
+    HloComputation::Builder builder(TestName() + "_2");
+    p1 = builder.AddInstruction(
+        HloInstruction::CreateParameter(0, tuple_shape_, "param"));
+    HloInstruction* gte0 = builder.AddInstruction(
+        HloInstruction::CreateGetTupleElement(scalar_shape_, p1, 0));
+    HloInstruction* gte1 = builder.AddInstruction(
+        HloInstruction::CreateGetTupleElement(scalar_shape_, p1, 1));
+    HloInstruction* gte2 = builder.AddInstruction(
+        HloInstruction::CreateGetTupleElement(scalar_shape_, p1, 2));
+
+    builder.AddInstruction(HloInstruction::CreateTuple({gte0, gte1, gte2}));
+
+    c1 = module->AddEmbeddedComputation(builder.Build());
+  }
+  {
+    HloComputation::Builder builder(TestName() + "_Entry");
+    HloInstruction* tuple_param = builder.AddInstruction(
+        HloInstruction::CreateParameter(0, tuple_shape_, "param"));
+    HloInstruction* call0 = builder.AddInstruction(
+        HloInstruction::CreateCall(tuple_shape_, {tuple_param}, c0));
+    HloInstruction* call1 = builder.AddInstruction(
+        HloInstruction::CreateCall(tuple_shape_, {tuple_param}, c1));
+    HloInstruction* gte0 = builder.AddInstruction(
+        HloInstruction::CreateGetTupleElement(scalar_shape_, call0, 0));
+    HloInstruction* gte1 = builder.AddInstruction(
+        HloInstruction::CreateGetTupleElement(scalar_shape_, call1, 1));
+    HloInstruction* tuple0 =
+        builder.AddInstruction(HloInstruction::CreateTuple({gte0, gte1}));
+    HloInstruction* gte2 = builder.AddInstruction(
+        HloInstruction::CreateGetTupleElement(scalar_shape_, tuple0, 0));
+    HloInstruction* gte3 = builder.AddInstruction(
+        HloInstruction::CreateGetTupleElement(scalar_shape_, tuple0, 1));
+
+    builder.AddInstruction(HloInstruction::CreateTuple({gte2, gte3}));
+
+    entry = module->AddEntryComputation(builder.Build());
+  }
+
+  Run(module.get(), /*change_expected=*/true, /*exclude_entry=*/ true);
+
+  EXPECT_THAT(c0->root_instruction(), p0);
+  EXPECT_THAT(c1->root_instruction(), p1);
+  EXPECT_THAT(entry->instruction_count(), 9);
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/tuple_util_test.cc b/tensorflow/compiler/xla/service/tuple_util_test.cc
index 754fd8ef169231827eeb5bfd72aeb596644ca767..d33d5bb8f30c8504aa323d461e5f59709b48e1fc 100644
--- a/tensorflow/compiler/xla/service/tuple_util_test.cc
+++ b/tensorflow/compiler/xla/service/tuple_util_test.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/tuple_util.h"
 
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/test.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 
 namespace xla {
 namespace {
@@ -37,7 +37,7 @@ ENTRY entry {
 )";
 
   TF_ASSIGN_OR_RETURN(std::unique_ptr<HloModule> module,
-                      tools::Parse(hlo_string));
+                      ParseHloString(hlo_string));
 
   *entry_computation = module->entry_computation();
   *param0 = (*entry_computation)->parameter_instruction(0);
diff --git a/tensorflow/compiler/xla/service/user_computation.cc b/tensorflow/compiler/xla/service/user_computation.cc
deleted file mode 100644
index 9e62d0acfb98946f1e693fc0310098b4ec99750b..0000000000000000000000000000000000000000
--- a/tensorflow/compiler/xla/service/user_computation.cc
+++ /dev/null
@@ -1,3557 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include "tensorflow/compiler/xla/service/user_computation.h"
-
-#include <algorithm>
-#include <set>
-#include <stack>
-#include <unordered_map>
-#include <utility>
-#include <vector>
-
-#include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
-#include "tensorflow/compiler/xla/ptr_util.h"
-#include "tensorflow/compiler/xla/service/hlo_computation.h"
-#include "tensorflow/compiler/xla/service/hlo_instruction.h"
-#include "tensorflow/compiler/xla/service/hlo_opcode.h"
-#include "tensorflow/compiler/xla/service/shape_inference.h"
-#include "tensorflow/compiler/xla/shape_util.h"
-#include "tensorflow/compiler/xla/status_macros.h"
-#include "tensorflow/compiler/xla/types.h"
-#include "tensorflow/compiler/xla/util.h"
-#include "tensorflow/core/lib/core/errors.h"
-#include "tensorflow/core/lib/strings/str_util.h"
-#include "tensorflow/core/lib/strings/strcat.h"
-#include "tensorflow/core/lib/strings/stringprintf.h"
-#include "tensorflow/core/platform/logging.h"
-#include "tensorflow/core/platform/protobuf.h"
-
-namespace xla {
-namespace {
-
-HloOpcode UnaryOperationToHloOpcode(UnaryOperation unop) {
-  switch (unop) {
-    case UNOP_ABS:
-      return HloOpcode::kAbs;
-    case UNOP_CEIL:
-      return HloOpcode::kCeil;
-    case UNOP_CLZ:
-      return HloOpcode::kClz;
-    case UNOP_COS:
-      return HloOpcode::kCos;
-    case UNOP_EXP:
-      return HloOpcode::kExp;
-    case UNOP_EXPM1:
-      return HloOpcode::kExpm1;
-    case UNOP_FLOOR:
-      return HloOpcode::kFloor;
-    case UNOP_IMAG:
-      return HloOpcode::kImag;
-    case UNOP_IS_FINITE:
-      return HloOpcode::kIsFinite;
-    case UNOP_LOG:
-      return HloOpcode::kLog;
-    case UNOP_LOG1P:
-      return HloOpcode::kLog1p;
-    case UNOP_NOT:
-      return HloOpcode::kNot;
-    case UNOP_NEGATE:
-      return HloOpcode::kNegate;
-    case UNOP_REAL:
-      return HloOpcode::kReal;
-    case UNOP_ROUND_NEAREST_AFZ:
-      return HloOpcode::kRoundNearestAfz;
-    case UNOP_SIGN:
-      return HloOpcode::kSign;
-    case UNOP_SIN:
-      return HloOpcode::kSin;
-    case UNOP_SORT:
-      return HloOpcode::kSort;
-    case UNOP_TANH:
-      return HloOpcode::kTanh;
-    default:
-      LOG(FATAL) << "unhandled operation " << unop;
-  }
-}
-
-HloOpcode BinaryOperationToHloOpcode(BinaryOperation binop) {
-  switch (binop) {
-    case BINOP_ATAN2:
-      return HloOpcode::kAtan2;
-    case BINOP_COMPLEX:
-      return HloOpcode::kComplex;
-    case BINOP_MUL:
-      return HloOpcode::kMultiply;
-    case BINOP_ADD:
-      return HloOpcode::kAdd;
-    case BINOP_SUB:
-      return HloOpcode::kSubtract;
-    case BINOP_DIV:
-      return HloOpcode::kDivide;
-    case BINOP_EQ:
-      return HloOpcode::kEq;
-    case BINOP_GE:
-      return HloOpcode::kGe;
-    case BINOP_GT:
-      return HloOpcode::kGt;
-    case BINOP_LE:
-      return HloOpcode::kLe;
-    case BINOP_LT:
-      return HloOpcode::kLt;
-    case BINOP_NE:
-      return HloOpcode::kNe;
-    case BINOP_MAX:
-      return HloOpcode::kMaximum;
-    case BINOP_MIN:
-      return HloOpcode::kMinimum;
-    case BINOP_POW:
-      return HloOpcode::kPower;
-    case BINOP_REM:
-      return HloOpcode::kRemainder;
-    case BINOP_OR:
-      return HloOpcode::kOr;
-    case BINOP_AND:
-      return HloOpcode::kAnd;
-    case BINOP_SHIFT_LEFT:
-      return HloOpcode::kShiftLeft;
-    case BINOP_SHIFT_RIGHT_ARITHMETIC:
-      return HloOpcode::kShiftRightArithmetic;
-    case BINOP_SHIFT_RIGHT_LOGICAL:
-      return HloOpcode::kShiftRightLogical;
-    default:
-      LOG(FATAL) << "unhandled operation " << binop;
-  }
-}
-
-HloOpcode TernaryOperationToHloOpcode(TernaryOperation triop) {
-  switch (triop) {
-    case TRIOP_CLAMP:
-      return HloOpcode::kClamp;
-    case TRIOP_SELECT:
-      return HloOpcode::kSelect;
-    default:
-      LOG(FATAL) << "unhandled operation " << triop;
-  }
-}
-
-HloOpcode VariadicOperationToHloOpcode(VariadicOperation varop) {
-  switch (varop) {
-    case VAROP_TUPLE:
-      return HloOpcode::kTuple;
-    default:
-      LOG(FATAL) << "unhandled operation " << varop;
-  }
-}
-
-}  // namespace
-
-/* static */ StatusOr<std::unique_ptr<UserComputation>>
-UserComputation::MakeWithRemapping(
-    const SessionComputation& session_computation,
-    const ComputationHandle& handle,
-    const std::map<int64, ComputationHandle>& old_to_new) {
-  auto user_computation =
-      MakeUnique<UserComputation>(session_computation.name(), handle);
-  {
-    tensorflow::mutex_lock lock(user_computation->mutex_);
-    user_computation->session_computation_ = session_computation;
-    user_computation->next_handle_value_ =
-        std::max_element(session_computation.requests().begin(),
-                         session_computation.requests().end(),
-                         [](const std::pair<int64, OperationRequest>& lhs,
-                            const std::pair<int64, OperationRequest>& rhs) {
-                           return lhs.first < rhs.first;
-                         })
-            ->first +
-        1;
-    TF_RETURN_IF_ERROR(user_computation->RemapEmbeddedComputations(old_to_new));
-  }
-
-  return std::move(user_computation);
-}
-
-UserComputation::UserComputation(const string& name,
-                                 const ComputationHandle& handle)
-    : name_(name), next_handle_value_(1) {
-  *session_computation_.mutable_computation_handle() = handle;
-  session_computation_.set_name(name);
-
-  VLOG(1) << "New UserComputation \"" << name
-          << "\", handle: " << handle.handle();
-}
-
-ComputationDataHandle UserComputation::CreateComputationDataHandle() {
-  ComputationDataHandle handle;
-  handle.set_handle(next_handle_value_);
-  // Handles are used as Version values and *must* be assigned consecutively for
-  // computation versioning to work.
-  next_handle_value_++;
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddParameterInstruction(
-    const ParameterRequest& parameter_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  int64 parameter_number = parameter_request.parameter();
-  if (parameters_.count(parameter_number) != 0) {
-    return InvalidArgument("parameter %lld already registered",
-                           parameter_number);
-  }
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  const Shape& validated_shape = parameter_request.shape();
-  TF_RETURN_IF_ERROR(
-      ShapeUtil::ValidateShapeWithOptionalLayout(validated_shape));
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = validated_shape;
-  *request.mutable_request()->mutable_parameter_request() = parameter_request;
-
-  parameters_[parameter_number] = &request;
-
-  VLOG(1) << "AddParameterInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << parameter_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddSendInstruction(
-    const SendRequest& send_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  // Check if the operand of the instruction is valid.
-  TF_RETURN_IF_ERROR(LookUpRequest(send_request.operand()).status());
-
-  // No handle is returned, but a handle must be assigned to this instruction
-  // for computation versioning.
-  ComputationDataHandle handle = CreateComputationDataHandle();
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = ShapeUtil::MakeNil();
-  *request.mutable_request()->mutable_send_request() = send_request;
-
-  VLOG(1) << "AddSendInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << send_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddRecvInstruction(
-    const RecvRequest& recv_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  const Shape& shape = recv_request.shape();
-  TF_RETURN_IF_ERROR(ShapeUtil::ValidateShapeWithOptionalLayout(shape));
-  ComputationDataHandle handle = CreateComputationDataHandle();
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = shape;
-  *request.mutable_request()->mutable_recv_request() = recv_request;
-
-  VLOG(1) << "AddRecvInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << recv_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddPadInstruction(
-    const PadRequest& pad_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(pad_request.operand()));
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* padding_value,
-                      LookUpRequest(pad_request.padding_value()));
-
-  TF_ASSIGN_OR_RETURN(Shape inferred_shape, ShapeInference::InferPadShape(
-                                                operand->output_shape(),
-                                                padding_value->output_shape(),
-                                                pad_request.padding_config()));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = inferred_shape;
-  *request.mutable_request()->mutable_pad_request() = pad_request;
-
-  VLOG(1) << "AddPadInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << pad_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddConstantInstruction(
-    const ConstantRequest& constant_request) {
-  const Shape& validated_shape = constant_request.literal().shape();
-  TF_RETURN_IF_ERROR(
-      ShapeUtil::ValidateShapeWithOptionalLayout(validated_shape));
-
-  tensorflow::mutex_lock lock(mutex_);
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = validated_shape;
-  *request.mutable_request()->mutable_constant_request() = constant_request;
-
-  VLOG(1) << "AddConstantInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddGatherInstruction(
-    const GatherRequest& gather_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* input_request,
-                      LookUpRequest(gather_request.input()));
-  TF_ASSIGN_OR_RETURN(const OperationRequest* gather_indices_request,
-                      LookUpRequest(gather_request.gather_indices()));
-
-  TF_ASSIGN_OR_RETURN(
-      Shape shape,
-      ShapeInference::InferGatherShape(
-          input_request->output_shape(), gather_indices_request->output_shape(),
-          gather_request.dimension_numbers(),
-          AsInt64Slice(gather_request.window_bounds())));
-
-  const ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = shape;
-  *request.mutable_request()->mutable_gather_request() = gather_request;
-
-  VLOG(1) << "AddGatherInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << gather_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddGetTupleElementInstruction(
-    const GetTupleElementRequest& get_tuple_element_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(get_tuple_element_request.operand()));
-  if (!ShapeUtil::IsTuple(operand->output_shape())) {
-    return InvalidArgument(
-        "Operand to GetTupleElement() is not a tuple; got %s",
-        ShapeUtil::HumanString(operand->output_shape()).c_str());
-  }
-  Shape element_shape = ShapeUtil::GetTupleElementShape(
-      operand->output_shape(), get_tuple_element_request.index());
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = element_shape;
-  *request.mutable_request()->mutable_get_tuple_element_request() =
-      get_tuple_element_request;
-
-  VLOG(1) << "AddGetTupleElementInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << get_tuple_element_request.ShortDebugString();
-  return handle;
-}
-
-Status UserComputation::AddTraceInstruction(const TraceRequest& trace_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  // Verify that the operand index is valid.
-  TF_RETURN_IF_ERROR(LookUpRequest(trace_request.operand()).status());
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = ShapeUtil::MakeNil();
-  *request.mutable_request()->mutable_trace_request() = trace_request;
-
-  VLOG(1) << "AddTraceInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << trace_request.ShortDebugString();
-  return Status::OK();
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddRngInstruction(
-    const RngRequest& rng_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  // Check the number of parameters per RNG distribution.
-  switch (rng_request.distribution()) {
-    case RandomDistribution::RNG_NORMAL:
-    case RandomDistribution::RNG_UNIFORM:
-      if (rng_request.parameter_size() != 2) {
-        return InvalidArgument(
-            "RNG distribution (%s) expects 2 parameters, but got %d",
-            RandomDistribution_Name(rng_request.distribution()).c_str(),
-            rng_request.parameter_size());
-      }
-      break;
-    default:
-      LOG(FATAL) << "unhandled distribution " << rng_request.distribution();
-  }
-
-  // Verify that the parameter indices are valid;
-  for (const ComputationDataHandle& param : rng_request.parameter()) {
-    TF_RETURN_IF_ERROR(LookUpRequest(param).status());
-  }
-  const Shape& validated_shape = rng_request.shape();
-  TF_RETURN_IF_ERROR(
-      ShapeUtil::ValidateShapeWithOptionalLayout(validated_shape));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = validated_shape;
-  *request.mutable_request()->mutable_rng_request() = rng_request;
-
-  VLOG(1) << "AddRngInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << rng_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddMapInstruction(
-    const MapRequest& map_request,
-    const UserComputation& to_apply_computation) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  std::vector<const Shape*> operand_shapes;
-  for (const ComputationDataHandle& handle : map_request.operands()) {
-    TF_ASSIGN_OR_RETURN(const OperationRequest* operand, LookUpRequest(handle));
-    operand_shapes.push_back(&operand->output_shape());
-  }
-
-  VersionedComputationHandle::Version to_apply_version =
-      to_apply_computation.version();
-  TF_ASSIGN_OR_RETURN(
-      std::shared_ptr<const ProgramShape> to_apply_program_shape,
-      to_apply_computation.ComputeProgramShape(to_apply_version));
-  TF_ASSIGN_OR_RETURN(
-      Shape inferred_shape,
-      ShapeInference::InferMapShape(operand_shapes, *to_apply_program_shape,
-                                    AsInt64Slice(map_request.dimensions())));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = inferred_shape;
-  request.add_embedded_computation_versions(to_apply_version);
-  *request.mutable_request()->mutable_map_request() = map_request;
-
-  VLOG(1) << "AddMapInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << map_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddReduceInstruction(
-    const ReduceRequest& reduce_request,
-    const UserComputation& to_apply_computation) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(reduce_request.operand()));
-  TF_ASSIGN_OR_RETURN(const OperationRequest* init_value,
-                      LookUpRequest(reduce_request.init_value()));
-
-  VersionedComputationHandle::Version to_apply_version =
-      to_apply_computation.version();
-  TF_ASSIGN_OR_RETURN(
-      std::shared_ptr<const ProgramShape> to_apply_program_shape,
-      to_apply_computation.ComputeProgramShape(to_apply_version));
-
-  TF_ASSIGN_OR_RETURN(
-      Shape inferred_shape,
-      ShapeInference::InferReduceShape(
-          operand->output_shape(), init_value->output_shape(),
-          AsInt64Slice(reduce_request.dimensions()), *to_apply_program_shape));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = inferred_shape;
-  request.add_embedded_computation_versions(to_apply_version);
-  *request.mutable_request()->mutable_reduce_request() = reduce_request;
-
-  VLOG(1) << "AddReduceInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << reduce_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle>
-UserComputation::AddBatchNormTrainingInstruction(
-    const BatchNormTrainingRequest& batch_norm_training_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(batch_norm_training_request.operand()));
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* scale,
-                      LookUpRequest(batch_norm_training_request.scale()));
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* offset,
-                      LookUpRequest(batch_norm_training_request.offset()));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-
-  TF_ASSIGN_OR_RETURN(
-      Shape inferred_shape,
-      ShapeInference::InferBatchNormTrainingShape(
-          operand->output_shape(), scale->output_shape(),
-          offset->output_shape(), batch_norm_training_request.feature_index()));
-
-  *request.mutable_output_shape() = inferred_shape;
-
-  *request.mutable_output_handle() = handle;
-
-  *request.mutable_request()->mutable_batch_norm_training_request() =
-      batch_norm_training_request;
-
-  VLOG(1) << "AddBatchNormTrainingInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << batch_norm_training_request.ShortDebugString();
-
-  return handle;
-}
-
-StatusOr<ComputationDataHandle>
-UserComputation::AddBatchNormInferenceInstruction(
-    const BatchNormInferenceRequest& batch_norm_inference_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(batch_norm_inference_request.operand()));
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* scale,
-                      LookUpRequest(batch_norm_inference_request.scale()));
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* offset,
-                      LookUpRequest(batch_norm_inference_request.offset()));
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* mean,
-                      LookUpRequest(batch_norm_inference_request.mean()));
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* variance,
-                      LookUpRequest(batch_norm_inference_request.variance()));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-
-  TF_ASSIGN_OR_RETURN(Shape inferred_shape,
-                      ShapeInference::InferBatchNormInferenceShape(
-                          operand->output_shape(), scale->output_shape(),
-                          offset->output_shape(), mean->output_shape(),
-                          variance->output_shape(),
-                          batch_norm_inference_request.feature_index()));
-
-  *request.mutable_output_shape() = inferred_shape;
-
-  *request.mutable_output_handle() = handle;
-
-  *request.mutable_request()->mutable_batch_norm_inference_request() =
-      batch_norm_inference_request;
-
-  VLOG(1) << "AddBatchNormInferenceInstruction ("
-          << GetVersionedHandleInternal() << "), data handle "
-          << handle.handle() << ": "
-          << batch_norm_inference_request.ShortDebugString();
-
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddBatchNormGradInstruction(
-    const BatchNormGradRequest& batch_norm_grad_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(batch_norm_grad_request.operand()));
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* scale,
-                      LookUpRequest(batch_norm_grad_request.scale()));
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* mean,
-                      LookUpRequest(batch_norm_grad_request.mean()));
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* variance,
-                      LookUpRequest(batch_norm_grad_request.variance()));
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* grad_output,
-                      LookUpRequest(batch_norm_grad_request.grad_output()));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-
-  TF_ASSIGN_OR_RETURN(
-      Shape inferred_shape,
-      ShapeInference::InferBatchNormGradShape(
-          operand->output_shape(), scale->output_shape(), mean->output_shape(),
-          variance->output_shape(), grad_output->output_shape(),
-          batch_norm_grad_request.feature_index()));
-
-  *request.mutable_output_shape() = inferred_shape;
-
-  *request.mutable_output_handle() = handle;
-
-  *request.mutable_request()->mutable_batch_norm_grad_request() =
-      batch_norm_grad_request;
-
-  VLOG(1) << "AddBatchNormGradInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << batch_norm_grad_request.ShortDebugString();
-
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddReduceWindowInstruction(
-    const ReduceWindowRequest& reduce_window_request,
-    const UserComputation& to_apply_computation) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(reduce_window_request.operand()));
-  TF_ASSIGN_OR_RETURN(const OperationRequest* init_value,
-                      LookUpRequest(reduce_window_request.init_value()));
-
-  VersionedComputationHandle::Version to_apply_version =
-      to_apply_computation.version();
-  TF_ASSIGN_OR_RETURN(
-      std::shared_ptr<const ProgramShape> to_apply_program_shape,
-      to_apply_computation.ComputeProgramShape(to_apply_version));
-
-  TF_ASSIGN_OR_RETURN(
-      Shape inferred_shape,
-      ShapeInference::InferReduceWindowShape(
-          operand->output_shape(), init_value->output_shape(),
-          reduce_window_request.window(), *to_apply_program_shape));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = inferred_shape;
-  request.add_embedded_computation_versions(to_apply_version);
-  *request.mutable_request()->mutable_reduce_window_request() =
-      reduce_window_request;
-
-  VLOG(1) << "AddReduceWindowInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << reduce_window_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddSelectAndScatterInstruction(
-    const SelectAndScatterRequest& select_and_scatter_request,
-    const UserComputation& select_computation,
-    const UserComputation& scatter_computation) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(select_and_scatter_request.operand()));
-  TF_ASSIGN_OR_RETURN(const OperationRequest* source,
-                      LookUpRequest(select_and_scatter_request.source()));
-  TF_ASSIGN_OR_RETURN(const OperationRequest* init_value,
-                      LookUpRequest(select_and_scatter_request.init_value()));
-
-  VersionedComputationHandle::Version select_version =
-      select_computation.version();
-  TF_ASSIGN_OR_RETURN(std::shared_ptr<const ProgramShape> select_program_shape,
-                      select_computation.ComputeProgramShape(select_version));
-  VersionedComputationHandle::Version scatter_version =
-      scatter_computation.version();
-  TF_ASSIGN_OR_RETURN(std::shared_ptr<const ProgramShape> scatter_program_shape,
-                      scatter_computation.ComputeProgramShape(scatter_version));
-
-  TF_ASSIGN_OR_RETURN(
-      Shape inferred_shape,
-      ShapeInference::InferSelectAndScatterShape(
-          operand->output_shape(), *select_program_shape,
-          select_and_scatter_request.window(), source->output_shape(),
-          init_value->output_shape(), *scatter_program_shape));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = inferred_shape;
-  request.add_embedded_computation_versions(select_version);
-  request.add_embedded_computation_versions(scatter_version);
-  *request.mutable_request()->mutable_select_and_scatter_request() =
-      select_and_scatter_request;
-
-  VLOG(1) << "AddSelectAndScatterInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << select_and_scatter_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddReverseInstruction(
-    const ReverseRequest& reverse_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(reverse_request.operand()));
-  TF_ASSIGN_OR_RETURN(
-      Shape inferred_shape,
-      ShapeInference::InferReverseShape(
-          operand->output_shape(), AsInt64Slice(reverse_request.dimensions())));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = inferred_shape;
-  *request.mutable_request()->mutable_reverse_request() = reverse_request;
-  VLOG(1) << "AddReverseInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << reverse_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddWhileInstruction(
-    const WhileRequest& while_request,
-    const UserComputation& condition_computation,
-    const UserComputation& body_computation) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* init,
-                      LookUpRequest(while_request.init()));
-
-  VersionedComputationHandle::Version condition_version =
-      condition_computation.version();
-  TF_ASSIGN_OR_RETURN(
-      std::shared_ptr<const ProgramShape> condition_program_shape,
-      condition_computation.ComputeProgramShape(condition_version));
-
-  VersionedComputationHandle::Version body_version = body_computation.version();
-  TF_ASSIGN_OR_RETURN(std::shared_ptr<const ProgramShape> body_program_shape,
-                      body_computation.ComputeProgramShape(body_version));
-
-  TF_ASSIGN_OR_RETURN(
-      Shape inferred_shape,
-      ShapeInference::InferWhileShape(
-          *condition_program_shape, *body_program_shape, init->output_shape()));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = inferred_shape;
-  request.add_embedded_computation_versions(condition_version);
-  request.add_embedded_computation_versions(body_version);
-  *request.mutable_request()->mutable_while_request() = while_request;
-
-  VLOG(1) << "AddWhileInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << while_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddConditionalInstruction(
-    const ConditionalRequest& conditional_request,
-    const UserComputation& true_computation,
-    const UserComputation& false_computation) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* pred,
-                      LookUpRequest(conditional_request.predicate()));
-  TF_ASSIGN_OR_RETURN(const OperationRequest* true_operand,
-                      LookUpRequest(conditional_request.true_operand()));
-  TF_ASSIGN_OR_RETURN(const OperationRequest* false_operand,
-                      LookUpRequest(conditional_request.false_operand()));
-
-  VersionedComputationHandle::Version true_computation_version =
-      true_computation.version();
-  TF_ASSIGN_OR_RETURN(
-      std::shared_ptr<const ProgramShape> true_computation_shape,
-      true_computation.ComputeProgramShape(true_computation_version));
-
-  VersionedComputationHandle::Version false_computation_version =
-      false_computation.version();
-  TF_ASSIGN_OR_RETURN(
-      std::shared_ptr<const ProgramShape> false_computation_shape,
-      false_computation.ComputeProgramShape(false_computation_version));
-
-  TF_ASSIGN_OR_RETURN(Shape inferred_shape,
-                      ShapeInference::InferConditionalShape(
-                          pred->output_shape(), true_operand->output_shape(),
-                          false_operand->output_shape(),
-                          *true_computation_shape, *false_computation_shape));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = inferred_shape;
-  request.add_embedded_computation_versions(true_computation_version);
-  request.add_embedded_computation_versions(false_computation_version);
-  *request.mutable_request()->mutable_conditional_request() =
-      conditional_request;
-
-  VLOG(1) << "AddConditionalInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << conditional_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddBroadcastInstruction(
-    const BroadcastRequest& broadcast_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  // Fetches and validates the operand.
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(broadcast_request.operand()));
-  TF_ASSIGN_OR_RETURN(Shape inferred_shape,
-                      ShapeInference::InferBroadcastShape(
-                          operand->output_shape(),
-                          AsInt64Slice(broadcast_request.broadcast_sizes())));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = inferred_shape;
-  *request.mutable_request()->mutable_broadcast_request() = broadcast_request;
-
-  VLOG(1) << "AddBroadcastInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << broadcast_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddReshapeInstruction(
-    const ReshapeRequest& reshape_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  // Fetches and validates the operand.
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(reshape_request.operand()));
-
-  TF_ASSIGN_OR_RETURN(
-      Shape inferred_shape,
-      ShapeInference::InferReshapeShape(
-          operand->output_shape(), AsInt64Slice(reshape_request.dimensions()),
-          AsInt64Slice(reshape_request.new_sizes())));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = inferred_shape;
-  *request.mutable_request()->mutable_reshape_request() = reshape_request;
-
-  VLOG(1) << "AddReshapeInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << reshape_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddTransposeInstruction(
-    const TransposeRequest& transpose_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  // Fetches and validates the operand.
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(transpose_request.operand()));
-
-  TF_ASSIGN_OR_RETURN(Shape inferred_shape,
-                      ShapeInference::InferTransposeShape(
-                          operand->output_shape(),
-                          AsInt64Slice(transpose_request.dimensions())));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = inferred_shape;
-  *request.mutable_request()->mutable_transpose_request() = transpose_request;
-
-  VLOG(1) << "AddTransposeInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << transpose_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddSliceInstruction(
-    const SliceRequest& slice_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(slice_request.operand()));
-
-  TF_ASSIGN_OR_RETURN(
-      Shape new_shape,
-      ShapeInference::InferSliceShape(
-          operand->output_shape(), AsInt64Slice(slice_request.start_indices()),
-          AsInt64Slice(slice_request.limit_indices()),
-          AsInt64Slice(slice_request.strides())));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = new_shape;
-  *request.mutable_request()->mutable_slice_request() = slice_request;
-
-  VLOG(1) << "AddSliceInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << slice_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddDynamicSliceInstruction(
-    const DynamicSliceRequest& dynamic_slice_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(dynamic_slice_request.operand()));
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* start_indices,
-                      LookUpRequest(dynamic_slice_request.start_indices()));
-
-  TF_ASSIGN_OR_RETURN(
-      Shape new_shape,
-      ShapeInference::InferDynamicSliceShape(
-          operand->output_shape(), start_indices->output_shape(),
-          AsInt64Slice(dynamic_slice_request.slice_sizes())));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = new_shape;
-  *request.mutable_request()->mutable_dynamic_slice_request() =
-      dynamic_slice_request;
-
-  VLOG(1) << "AddDynamicSliceInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << dynamic_slice_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle>
-UserComputation::AddDynamicUpdateSliceInstruction(
-    const DynamicUpdateSliceRequest& dynamic_update_slice_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(dynamic_update_slice_request.operand()));
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* update,
-                      LookUpRequest(dynamic_update_slice_request.update()));
-
-  TF_ASSIGN_OR_RETURN(
-      const OperationRequest* start_indices,
-      LookUpRequest(dynamic_update_slice_request.start_indices()));
-
-  TF_ASSIGN_OR_RETURN(Shape new_shape,
-                      ShapeInference::InferDynamicUpdateSliceShape(
-                          operand->output_shape(), update->output_shape(),
-                          start_indices->output_shape()));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = new_shape;
-  *request.mutable_request()->mutable_dynamic_update_slice_request() =
-      dynamic_update_slice_request;
-
-  VLOG(1) << "AddDynamicUpdateSliceInstruction ("
-          << GetVersionedHandleInternal() << "), data handle "
-          << handle.handle() << ": "
-          << dynamic_update_slice_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddConcatenateInstruction(
-    const ConcatenateRequest& concatenate_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  std::vector<const Shape*> operand_shapes;
-  for (const ComputationDataHandle& handle : concatenate_request.operands()) {
-    TF_ASSIGN_OR_RETURN(const OperationRequest* operand, LookUpRequest(handle));
-    operand_shapes.push_back(&operand->output_shape());
-  }
-
-  TF_ASSIGN_OR_RETURN(Shape new_shape,
-                      ShapeInference::InferConcatOpShape(
-                          operand_shapes, concatenate_request.dimension()));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = new_shape;
-  *request.mutable_request()->mutable_concatenate_request() =
-      concatenate_request;
-
-  VLOG(1) << "AddConcatenateInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << concatenate_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddConvertInstruction(
-    const ConvertRequest& convert_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(convert_request.operand()));
-
-  TF_ASSIGN_OR_RETURN(Shape new_shape, ShapeInference::InferConvertShape(
-                                           operand->output_shape(),
-                                           convert_request.new_element_type()));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = new_shape;
-  *request.mutable_request()->mutable_convert_request() = convert_request;
-
-  VLOG(1) << "AddConvertInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << convert_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddBitcastConvertInstruction(
-    const ConvertRequest& convert_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(convert_request.operand()));
-
-  TF_ASSIGN_OR_RETURN(Shape new_shape, ShapeInference::InferConvertShape(
-                                           operand->output_shape(),
-                                           convert_request.new_element_type()));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = new_shape;
-  *request.mutable_request()->mutable_bitcast_convert_request() =
-      convert_request;
-
-  VLOG(1) << "AddBitcastConvertInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << convert_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddReducePrecisionInstruction(
-    const ReducePrecisionRequest& reduce_precision_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(reduce_precision_request.operand()));
-
-  TF_ASSIGN_OR_RETURN(
-      Shape new_shape,
-      ShapeInference::InferReducePrecisionShape(
-          operand->output_shape(), reduce_precision_request.exponent_bits(),
-          reduce_precision_request.mantissa_bits()));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = new_shape;
-  *request.mutable_request()->mutable_reduce_precision_request() =
-      reduce_precision_request;
-
-  VLOG(1) << "AddReducePrecisionInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << reduce_precision_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddConvolveInstruction(
-    const ConvolveRequest& convolve_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* lhs,
-                      LookUpRequest(convolve_request.lhs()));
-  TF_ASSIGN_OR_RETURN(const OperationRequest* rhs,
-                      LookUpRequest(convolve_request.rhs()));
-  TF_ASSIGN_OR_RETURN(Shape shape, ShapeInference::InferConvolveShape(
-                                       lhs->output_shape(), rhs->output_shape(),
-                                       convolve_request.window(),
-                                       convolve_request.dimension_numbers()));
-
-  const ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = shape;
-  *request.mutable_request()->mutable_convolve_request() = convolve_request;
-
-  VLOG(1) << "AddConvolveInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << convolve_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddFftInstruction(
-    const FftRequest& fft_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(fft_request.operand()));
-  TF_ASSIGN_OR_RETURN(Shape shape,
-                      ShapeInference::InferFftShape(
-                          operand->output_shape(), fft_request.fft_type(),
-                          AsInt64Slice(fft_request.fft_length())));
-
-  const ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = shape;
-  *request.mutable_request()->mutable_fft_request() = fft_request;
-
-  VLOG(1) << "AddFftInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << fft_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddCrossReplicaSumInstruction(
-    const CrossReplicaSumRequest& cross_replica_sum_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(cross_replica_sum_request.operand()));
-  TF_ASSIGN_OR_RETURN(Shape shape, ShapeInference::InferCrossReplicaSumShape(
-                                       {&operand->output_shape()}));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = shape;
-  *request.mutable_request()->mutable_cross_replica_sum_request() =
-      cross_replica_sum_request;
-
-  VLOG(1) << "AddCrossreplicaSumInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << cross_replica_sum_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddInfeedInstruction(
-    const InfeedRequest& infeed_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  const Shape& shape = infeed_request.shape();
-  if (!LayoutUtil::HasLayout(shape)) {
-    return InvalidArgument("Given shape to Infeed must have a layout");
-  }
-
-  const ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = shape;
-  *request.mutable_request()->mutable_infeed_request() = infeed_request;
-
-  VLOG(1) << "AddInfeedInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << infeed_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddOutfeedInstruction(
-    const OutfeedRequest& outfeed_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  const Shape& shape = outfeed_request.shape();
-  if (!LayoutUtil::HasLayout(shape)) {
-    return InvalidArgument("Given shape to Outfeed must have a layout");
-  }
-
-  // Verify that operand is valid.
-  TF_RETURN_IF_ERROR(LookUpRequest(outfeed_request.operand()).status());
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = shape;
-  *request.mutable_request()->mutable_outfeed_request() = outfeed_request;
-
-  VLOG(1) << "AddOutfeedInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << outfeed_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddCallInstruction(
-    const CallRequest& call_request,
-    const UserComputation& to_apply_computation) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  std::vector<const Shape*> operand_shapes;
-  for (const ComputationDataHandle& handle : call_request.operands()) {
-    TF_ASSIGN_OR_RETURN(const OperationRequest* operand, LookUpRequest(handle));
-    operand_shapes.push_back(&operand->output_shape());
-  }
-
-  VersionedComputationHandle::Version to_apply_version =
-      to_apply_computation.version();
-  TF_ASSIGN_OR_RETURN(
-      std::shared_ptr<const ProgramShape> to_apply_program_shape,
-      to_apply_computation.ComputeProgramShape(to_apply_version));
-  TF_ASSIGN_OR_RETURN(
-      Shape inferred_shape,
-      ShapeInference::InferCallShape(operand_shapes, *to_apply_program_shape));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = inferred_shape;
-  request.add_embedded_computation_versions(to_apply_version);
-  *request.mutable_request()->mutable_call_request() = call_request;
-
-  VLOG(1) << "AddCallInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << call_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddCustomCallInstruction(
-    const CustomCallRequest& custom_call_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  for (const ComputationDataHandle& handle : custom_call_request.operands()) {
-    TF_RETURN_IF_ERROR(LookUpRequest(handle).status());
-  }
-
-  if (tensorflow::str_util::StartsWith(custom_call_request.call_target_name(),
-                                       "$")) {
-    return InvalidArgument(
-        "Invalid custom_call_target \"%s\": Call targets that start with '$' "
-        "are reserved for internal use.",
-        custom_call_request.call_target_name().c_str());
-  }
-
-  const ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = custom_call_request.shape();
-  *request.mutable_request()->mutable_custom_call_request() =
-      custom_call_request;
-
-  VLOG(1) << "AddCustomCallInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << custom_call_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddHostComputeInstruction(
-    const HostComputeRequest& host_compute_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  for (const ComputationDataHandle& handle : host_compute_request.operands()) {
-    TF_RETURN_IF_ERROR(LookUpRequest(handle).status());
-  }
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = host_compute_request.shape();
-  *request.mutable_request()->mutable_host_compute_request() =
-      host_compute_request;
-
-  VLOG(1) << "AddHostComputeInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << host_compute_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddDotInstruction(
-    const DotRequest& dot_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* lhs,
-                      LookUpRequest(dot_request.lhs()));
-  TF_ASSIGN_OR_RETURN(const OperationRequest* rhs,
-                      LookUpRequest(dot_request.rhs()));
-
-  TF_ASSIGN_OR_RETURN(Shape shape, ShapeInference::InferDotOpShape(
-                                       lhs->output_shape(), rhs->output_shape(),
-                                       dot_request.dimension_numbers()));
-
-  const ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = shape;
-  *request.mutable_request()->mutable_dot_request() = dot_request;
-
-  VLOG(1) << "AddDotInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << dot_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddUnaryInstruction(
-    const UnaryOpRequest& unary_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand,
-                      LookUpRequest(unary_request.operand()));
-  TF_ASSIGN_OR_RETURN(
-      Shape shape, ShapeInference::InferUnaryOpShape(unary_request.unop(),
-                                                     operand->output_shape()));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = shape;
-  *request.mutable_request()->mutable_unary_op_request() = unary_request;
-
-  VLOG(1) << "AddUnaryInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << unary_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddBinaryInstruction(
-    const BinaryOpRequest& binary_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* lhs,
-                      LookUpRequest(binary_request.lhs()));
-  TF_ASSIGN_OR_RETURN(const OperationRequest* rhs,
-                      LookUpRequest(binary_request.rhs()));
-  TF_ASSIGN_OR_RETURN(
-      Shape shape,
-      ShapeInference::InferBinaryOpShape(
-          binary_request.binop(), lhs->output_shape(), rhs->output_shape(),
-          AsInt64Slice(binary_request.broadcast_dimensions())));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = shape;
-  *request.mutable_request()->mutable_binary_op_request() = binary_request;
-
-  VLOG(1) << "AddBinaryInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << binary_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddTernaryInstruction(
-    const TernaryOpRequest& ternary_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* lhs,
-                      LookUpRequest(ternary_request.lhs()));
-  TF_ASSIGN_OR_RETURN(const OperationRequest* rhs,
-                      LookUpRequest(ternary_request.rhs()));
-  TF_ASSIGN_OR_RETURN(const OperationRequest* ehs,
-                      LookUpRequest(ternary_request.ehs()));
-  TF_ASSIGN_OR_RETURN(Shape shape,
-                      ShapeInference::InferTernaryOpShape(
-                          ternary_request.triop(), lhs->output_shape(),
-                          rhs->output_shape(), ehs->output_shape()));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = shape;
-  *request.mutable_request()->mutable_ternary_op_request() = ternary_request;
-
-  VLOG(1) << "AddTernaryInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << ternary_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<ComputationDataHandle> UserComputation::AddVariadicInstruction(
-    const VariadicOpRequest& variadic_request) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  std::vector<const Shape*> operand_shapes;
-  for (const ComputationDataHandle& handle : variadic_request.operands()) {
-    TF_ASSIGN_OR_RETURN(const OperationRequest* operand, LookUpRequest(handle));
-    operand_shapes.push_back(&operand->output_shape());
-  }
-
-  TF_ASSIGN_OR_RETURN(Shape shape,
-                      ShapeInference::InferVariadicOpShape(
-                          variadic_request.varop(), operand_shapes));
-
-  ComputationDataHandle handle = CreateComputationDataHandle();
-
-  OperationRequest& request =
-      (*session_computation_.mutable_requests())[handle.handle()];
-  *request.mutable_output_handle() = handle;
-  *request.mutable_output_shape() = shape;
-  *request.mutable_request()->mutable_variadic_op_request() = variadic_request;
-
-  VLOG(1) << "AddVariadicInstruction (" << GetVersionedHandleInternal()
-          << "), data handle " << handle.handle() << ": "
-          << variadic_request.ShortDebugString();
-  return handle;
-}
-
-StatusOr<Shape> UserComputation::GetShape(const ComputationDataHandle& handle) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* operand, LookUpRequest(handle));
-  return operand->output_shape();
-}
-
-Status UserComputation::SetOpMetadata(const ComputationDataHandle& handle,
-                                      const OpMetadata& metadata) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  int64 handle_value = handle.handle();
-  if (session_computation_.requests().count(handle_value) == 0) {
-    return InvalidArgument("Invalid handle in SetOpMetadata (%lld)",
-                           handle_value);
-  }
-  *session_computation_.mutable_requests()
-       ->at(handle_value)
-       .mutable_request()
-       ->mutable_metadata() = metadata;
-  return Status::OK();
-}
-
-Status UserComputation::SetOpSharding(const ComputationDataHandle& handle,
-                                      const OpSharding& sharding) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  int64 handle_value = handle.handle();
-  if (session_computation_.requests().count(handle_value) == 0) {
-    return InvalidArgument("Invalid handle in SetOpSharding (%lld)",
-                           handle_value);
-  }
-  *session_computation_.mutable_requests()
-       ->at(handle_value)
-       .mutable_request()
-       ->mutable_sharding() = sharding;
-  return Status::OK();
-}
-
-Status UserComputation::SetReturnValue(const ComputationDataHandle& handle) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  if (!(handle.handle() > 0 && handle.handle() < next_handle_value_)) {
-    return InvalidArgument("Invalid handle in SetReturnValue");
-  }
-
-  handle_to_return_ = handle;
-
-  VLOG(1) << "SetReturnValue of computation \"" << name() << "\" fixed to "
-          << GetVersionedHandleInternal();
-
-  return Status::OK();
-}
-
-VersionedComputationHandle UserComputation::GetVersionedHandle() const {
-  tensorflow::mutex_lock lock(mutex_);
-  return GetVersionedHandleInternal();
-}
-
-VersionedComputationHandle UserComputation::GetVersionedHandleInternal() const {
-  VersionedComputationHandle versioned_handle;
-  versioned_handle.handle = session_computation_.computation_handle();
-
-  if (handle_to_return_.handle() > 0) {
-    // A specific handle has been requested for the result of the computation.
-    versioned_handle.version = handle_to_return_.handle();
-  } else {
-    // A version value is simply the most recently assigned
-    // ComputationDataHandle value, ie the handle value of the root of the
-    // computation.
-    versioned_handle.version = next_handle_value_ - 1;
-  }
-
-  return versioned_handle;
-}
-
-VersionedComputationHandle UserComputation::GetVersionedHandleAtOperation(
-    const ComputationDataHandle& operation) const {
-  tensorflow::mutex_lock lock(mutex_);
-
-  // The version at which an operation was added is simply the handle value of
-  // the ComputationDataHandle.
-  VersionedComputationHandle versioned_handle;
-  versioned_handle.handle = session_computation_.computation_handle();
-  versioned_handle.version = operation.handle();
-  return versioned_handle;
-}
-
-VersionedComputationHandle::Version UserComputation::version() const {
-  return GetVersionedHandle().version;
-}
-
-namespace {
-
-// Returns true if the operation type corresponding to the given opcase can be
-// the root of the computation.
-bool CanBeRoot(const OpRequest::OpCase& op_case) {
-  switch (op_case) {
-    case OpRequest::kTraceRequest:
-    case OpRequest::kSendRequest:
-    case OpRequest::kOutfeedRequest:
-      return false;
-    default:
-      return true;
-  }
-}
-
-// Returns a pointer to the operation with the given data handle value in the
-// given SessionComputation.
-StatusOr<const OperationRequest*> LookUpRequest(
-    int64 handle_value, const SessionComputation& session_computation) {
-  if (session_computation.requests().count(handle_value) == 0) {
-    return InvalidArgument("no ComputationDataHandle value %lld", handle_value);
-  }
-  return &session_computation.requests().at(handle_value);
-}
-
-// Returns the OperationRequest corresponding to the root (result) of the
-// session computation.
-StatusOr<const OperationRequest*> GetRoot(
-    VersionedComputationHandle::Version version,
-    const SessionComputation& session_computation) {
-  TF_RET_CHECK(version > 0);
-  // Not all instructions can be roots. Walk backwards from the operation
-  // indicated by this version until a valid root is found.
-  const OperationRequest* root_request = nullptr;
-  while (version > 0) {
-    TF_ASSIGN_OR_RETURN(root_request,
-                        LookUpRequest(version, session_computation));
-    if (CanBeRoot(root_request->request().op_case())) {
-      break;
-    }
-    version--;
-  }
-  if (version == 0) {
-    return InternalError("Computation contains no root operation");
-  }
-  return root_request;
-}
-
-}  // namespace
-
-StatusOr<std::shared_ptr<const ProgramShape>>
-UserComputation::ComputeProgramShape(
-    VersionedComputationHandle::Version version) const {
-  tensorflow::mutex_lock lock(mutex_);
-
-  TF_RET_CHECK(version > 0 && version < next_handle_value_);
-
-  if (program_shape_ == nullptr || program_shape_version_ != version) {
-    // ProgramShape has not been computed yet, or is for different
-    // version. Compute it now.
-    TF_RETURN_IF_ERROR(CheckParametersAreContiguous(version));
-
-    auto program_shape = MakeUnique<ProgramShape>();
-    for (int64 request_num = 1; request_num <= version; ++request_num) {
-      const OperationRequest& request =
-          session_computation_.requests().at(request_num);
-      if (request.request().op_case() == OpRequest::kParameterRequest) {
-        const ParameterRequest& parameter_request =
-            request.request().parameter_request();
-        int64 param_no = parameter_request.parameter();
-        // Parameters may be out of order so expand ProgramShape parameters
-        // until it is at least large enough to hold the current parameter
-        // number.
-        while (program_shape->parameters_size() <= param_no) {
-          program_shape->add_parameters();
-          program_shape->add_parameter_names();
-        }
-        *program_shape->mutable_parameters(param_no) = request.output_shape();
-        *program_shape->mutable_parameter_names(param_no) =
-            parameter_request.name();
-      }
-    }
-
-    // The root determines the output shape.
-    TF_ASSIGN_OR_RETURN(const OperationRequest* root_request,
-                        GetRoot(version, session_computation_));
-    *program_shape->mutable_result() = root_request->output_shape();
-    if (ShapeUtil::IsOpaque(program_shape->result())) {
-      return Unimplemented("Computation results cannot be opaque");
-    }
-
-    program_shape_ = std::move(program_shape);
-    program_shape_version_ = version;
-  }
-
-  return program_shape_;
-}
-
-namespace {
-
-// A visitor which checks whether an operation is pure functional meaning that
-// it doesn't depend on any parameter with an index higher then num_parameters.
-// The visitor walks the computation starting at a given operation and sets
-// is_functional to false iff a parameter or RNG operation is encountered.
-void PureFunctionalVisitor(const SessionComputation& session_computation,
-                           const ComputationDataHandle& handle,
-                           int64 num_parameters, std::set<int64>* visited,
-                           bool* is_functional) {
-  if (visited->count(handle.handle()) != 0 || !*is_functional) {
-    return;
-  }
-
-  const OperationRequest& request =
-      session_computation.requests().at(handle.handle());
-  switch (request.request().op_case()) {
-    case OpRequest::kRngRequest:
-      *is_functional = false;
-      break;
-
-    case OpRequest::kConstantRequest:
-      break;
-
-    case OpRequest::kGetTupleElementRequest: {
-      const GetTupleElementRequest& get_tuple_element_request =
-          request.request().get_tuple_element_request();
-      PureFunctionalVisitor(session_computation,
-                            get_tuple_element_request.operand(), num_parameters,
-                            visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kSliceRequest: {
-      const SliceRequest& slice_request = request.request().slice_request();
-      PureFunctionalVisitor(session_computation, slice_request.operand(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kDynamicSliceRequest: {
-      const DynamicSliceRequest& dynamic_slice_request =
-          request.request().dynamic_slice_request();
-      PureFunctionalVisitor(session_computation,
-                            dynamic_slice_request.operand(), num_parameters,
-                            visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            dynamic_slice_request.start_indices(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kDynamicUpdateSliceRequest: {
-      const DynamicUpdateSliceRequest& dynamic_update_slice_request =
-          request.request().dynamic_update_slice_request();
-      PureFunctionalVisitor(session_computation,
-                            dynamic_update_slice_request.operand(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            dynamic_update_slice_request.update(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            dynamic_update_slice_request.start_indices(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kConcatenateRequest: {
-      const ConcatenateRequest& concatenate_request =
-          request.request().concatenate_request();
-      for (const ComputationDataHandle& handle :
-           concatenate_request.operands()) {
-        PureFunctionalVisitor(session_computation, handle, num_parameters,
-                              visited, is_functional);
-      }
-      break;
-    }
-
-    case OpRequest::kConvolveRequest: {
-      const ConvolveRequest& convolve_request =
-          request.request().convolve_request();
-      PureFunctionalVisitor(session_computation, convolve_request.lhs(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation, convolve_request.rhs(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kFftRequest: {
-      const FftRequest& fft_request = request.request().fft_request();
-      PureFunctionalVisitor(session_computation, fft_request.operand(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kCrossReplicaSumRequest: {
-      // TODO(b/33009255): Implmement constant folding for cross replica sum.
-      *is_functional = false;
-      break;
-    }
-
-    case OpRequest::kInfeedRequest: {
-      *is_functional = false;
-      break;
-    }
-
-    case OpRequest::kOutfeedRequest: {
-      *is_functional = false;
-      break;
-    }
-
-    case OpRequest::kHostComputeRequest: {
-      *is_functional = false;
-      break;
-    }
-
-    case OpRequest::kCallRequest: {
-      const CallRequest& call_request = request.request().call_request();
-      for (const ComputationDataHandle& handle : call_request.operands()) {
-        PureFunctionalVisitor(session_computation, handle, num_parameters,
-                              visited, is_functional);
-      }
-      // TODO(b/32495713): We aren't checking the to_apply computation itself,
-      // so we conservatively say that computations containing the Call op
-      // cannot be constant.  We cannot set is_functional=false in other similar
-      // cases since we're already relying on IsConstant to return true.
-      *is_functional = false;
-      break;
-    }
-
-    case OpRequest::kCustomCallRequest: {
-      *is_functional = false;
-      break;
-    }
-
-    case OpRequest::kDotRequest: {
-      const DotRequest& dot_request = request.request().dot_request();
-      PureFunctionalVisitor(session_computation, dot_request.lhs(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation, dot_request.rhs(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kSendRequest: {
-      *is_functional = false;
-      break;
-    }
-
-    case OpRequest::kRecvRequest: {
-      *is_functional = false;
-      break;
-    }
-
-    case OpRequest::kMapRequest: {
-      const MapRequest& map_request = request.request().map_request();
-      for (const ComputationDataHandle& handle : map_request.operands()) {
-        PureFunctionalVisitor(session_computation, handle, num_parameters,
-                              visited, is_functional);
-      }
-      // TODO(b/32495713): We aren't checking the to_apply computation itself.
-      break;
-    }
-
-    case OpRequest::kReduceRequest: {
-      const ReduceRequest& reduce_request = request.request().reduce_request();
-      PureFunctionalVisitor(session_computation, reduce_request.operand(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation, reduce_request.init_value(),
-                            num_parameters, visited, is_functional);
-      // TODO(b/32495713): We aren't checking the to_apply computation itself.
-      break;
-    }
-
-    case OpRequest::kReduceWindowRequest: {
-      const ReduceWindowRequest& reduce_window_request =
-          request.request().reduce_window_request();
-      PureFunctionalVisitor(session_computation,
-                            reduce_window_request.operand(), num_parameters,
-                            visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            reduce_window_request.init_value(), num_parameters,
-                            visited, is_functional);
-      // TODO(b/32495713): We aren't checking the to_apply computation itself.
-      break;
-    }
-
-    case OpRequest::kSelectAndScatterRequest: {
-      const SelectAndScatterRequest& select_and_scatter_request =
-          request.request().select_and_scatter_request();
-      PureFunctionalVisitor(session_computation,
-                            select_and_scatter_request.operand(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            select_and_scatter_request.source(), num_parameters,
-                            visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            select_and_scatter_request.init_value(),
-                            num_parameters, visited, is_functional);
-      // TODO(b/32495713): We aren't checking the select and scatter
-      // computations themselves.
-      break;
-    }
-
-    case OpRequest::kBroadcastRequest: {
-      const BroadcastRequest& broadcast_request =
-          request.request().broadcast_request();
-      PureFunctionalVisitor(session_computation, broadcast_request.operand(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kReshapeRequest: {
-      const ReshapeRequest& reshape_request =
-          request.request().reshape_request();
-      PureFunctionalVisitor(session_computation, reshape_request.operand(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kReverseRequest: {
-      const ReverseRequest& reverse_request =
-          request.request().reverse_request();
-      PureFunctionalVisitor(session_computation, reverse_request.operand(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kPadRequest: {
-      const PadRequest& pad_request = request.request().pad_request();
-      PureFunctionalVisitor(session_computation, pad_request.operand(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation, pad_request.padding_value(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kParameterRequest: {
-      const ParameterRequest& parameter_request =
-          request.request().parameter_request();
-      if (parameter_request.parameter() >= num_parameters) {
-        *is_functional = false;
-      }
-      break;
-    }
-
-    case OpRequest::kConvertRequest: {
-      const ConvertRequest& convert_request =
-          request.request().convert_request();
-      PureFunctionalVisitor(session_computation, convert_request.operand(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kBitcastConvertRequest: {
-      const ConvertRequest& convert_request =
-          request.request().bitcast_convert_request();
-      PureFunctionalVisitor(session_computation, convert_request.operand(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kWhileRequest: {
-      const WhileRequest& while_request = request.request().while_request();
-      PureFunctionalVisitor(session_computation, while_request.init(),
-                            num_parameters, visited, is_functional);
-      // TODO(b/32495713): We aren't checking the condition and body
-      // computations themselves.
-      *is_functional = false;
-      break;
-    }
-
-    case OpRequest::kConditionalRequest: {
-      const ConditionalRequest& conditional_request =
-          request.request().conditional_request();
-      PureFunctionalVisitor(session_computation,
-                            conditional_request.predicate(), num_parameters,
-                            visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            conditional_request.true_operand(), num_parameters,
-                            visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            conditional_request.false_operand(), num_parameters,
-                            visited, is_functional);
-      // TODO(b/32495713): We aren't checking the true and false computations
-      // themselves.
-      break;
-    }
-
-    case OpRequest::kTernaryOpRequest: {
-      const TernaryOpRequest& ternary_op_request =
-          request.request().ternary_op_request();
-      PureFunctionalVisitor(session_computation, ternary_op_request.lhs(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation, ternary_op_request.rhs(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation, ternary_op_request.ehs(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kTransposeRequest: {
-      const TransposeRequest& transpose_request =
-          request.request().transpose_request();
-      PureFunctionalVisitor(session_computation, transpose_request.operand(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kVariadicOpRequest: {
-      const VariadicOpRequest& variadic_op_request =
-          request.request().variadic_op_request();
-      for (const ComputationDataHandle& handle :
-           variadic_op_request.operands()) {
-        PureFunctionalVisitor(session_computation, handle, num_parameters,
-                              visited, is_functional);
-      }
-      break;
-    }
-
-    case OpRequest::kUnaryOpRequest: {
-      const UnaryOpRequest& unary_op_request =
-          request.request().unary_op_request();
-      PureFunctionalVisitor(session_computation, unary_op_request.operand(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kBatchNormTrainingRequest: {
-      const BatchNormTrainingRequest& batch_norm_training_request =
-          request.request().batch_norm_training_request();
-      PureFunctionalVisitor(session_computation,
-                            batch_norm_training_request.operand(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            batch_norm_training_request.scale(), num_parameters,
-                            visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            batch_norm_training_request.offset(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kBatchNormInferenceRequest: {
-      const BatchNormInferenceRequest& batch_norm_inference_request =
-          request.request().batch_norm_inference_request();
-      PureFunctionalVisitor(session_computation,
-                            batch_norm_inference_request.operand(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            batch_norm_inference_request.scale(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            batch_norm_inference_request.offset(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            batch_norm_inference_request.mean(), num_parameters,
-                            visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            batch_norm_inference_request.variance(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kBatchNormGradRequest: {
-      const BatchNormGradRequest& batch_norm_grad_request =
-          request.request().batch_norm_grad_request();
-      PureFunctionalVisitor(session_computation,
-                            batch_norm_grad_request.operand(), num_parameters,
-                            visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            batch_norm_grad_request.scale(), num_parameters,
-                            visited, is_functional);
-      PureFunctionalVisitor(session_computation, batch_norm_grad_request.mean(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            batch_norm_grad_request.variance(), num_parameters,
-                            visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            batch_norm_grad_request.grad_output(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kBinaryOpRequest: {
-      const BinaryOpRequest& binary_op_request =
-          request.request().binary_op_request();
-      PureFunctionalVisitor(session_computation, binary_op_request.lhs(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation, binary_op_request.rhs(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::kGatherRequest: {
-      PureFunctionalVisitor(session_computation,
-                            request.request().gather_request().input(),
-                            num_parameters, visited, is_functional);
-      PureFunctionalVisitor(session_computation,
-                            request.request().gather_request().gather_indices(),
-                            num_parameters, visited, is_functional);
-      break;
-    }
-
-    case OpRequest::OP_NOT_SET:
-      LOG(FATAL) << "OperationRequest doesn't contain a request";
-
-    default:
-      LOG(FATAL) << "Unexpected request type: " << request.request().op_case();
-  }
-  if (!*is_functional) {
-    VLOG(1) << "Non-functional: " << request.request().DebugString();
-  }
-  visited->insert(handle.handle());
-}
-
-}  // namespace
-
-StatusOr<bool> UserComputation::IsConstant(const ComputationDataHandle& handle,
-                                           int64 num_parameters) {
-  tensorflow::mutex_lock lock(mutex_);
-
-  // Verify that the handle is valid.
-  auto operation_status = LookUpRequest(handle);
-  if (!operation_status.ok()) {
-    return operation_status.status();
-  }
-
-  bool is_constant = true;
-  std::set<int64> visited;
-  PureFunctionalVisitor(session_computation_, handle, num_parameters, &visited,
-                        &is_constant);
-
-  return is_constant;
-}
-
-std::vector<VersionedComputationHandle>
-UserComputation::GetEmbeddedComputations(
-    VersionedComputationHandle::Version version) const {
-  tensorflow::mutex_lock lock(mutex_);
-
-  VLOG(1)
-      << "GetEmbeddedComputations(" << name() << " "
-      << VersionedComputationHandle{session_computation_.computation_handle(),
-                                    version}
-      << ")";
-  XLA_VLOG_LINES(3, session_computation_.DebugString());
-
-  std::vector<VersionedComputationHandle> computations;
-  std::vector<int64> sorted_handles;
-  for (const auto& handle_request : session_computation_.requests()) {
-    sorted_handles.push_back(handle_request.first);
-  }
-  std::sort(sorted_handles.begin(), sorted_handles.end());
-  for (int64 handle : sorted_handles) {
-    const auto& handle_request = session_computation_.requests().find(handle);
-    CHECK(handle_request != session_computation_.requests().end());
-    int64 handle_value = handle_request->first;
-    if (handle_value <= version) {
-      const OperationRequest& request = handle_request->second;
-      switch (request.request().op_case()) {
-        case OpRequest::kCallRequest: {
-          CHECK_EQ(1, request.embedded_computation_versions_size());
-          const CallRequest& call_request = request.request().call_request();
-          const VersionedComputationHandle versioned_handle = {
-              call_request.to_apply(),
-              request.embedded_computation_versions(0)};
-          computations.push_back(versioned_handle);
-          break;
-        }
-
-        case OpRequest::kMapRequest: {
-          CHECK_EQ(1, request.embedded_computation_versions_size());
-          const MapRequest& map_request = request.request().map_request();
-          const VersionedComputationHandle versioned_handle = {
-              map_request.to_apply(), request.embedded_computation_versions(0)};
-          computations.push_back(versioned_handle);
-          break;
-        }
-
-        case OpRequest::kReduceRequest: {
-          CHECK_EQ(1, request.embedded_computation_versions_size());
-          const ReduceRequest& reduce_request =
-              request.request().reduce_request();
-          const VersionedComputationHandle versioned_handle = {
-              reduce_request.to_apply(),
-              request.embedded_computation_versions(0)};
-          computations.push_back(versioned_handle);
-          break;
-        }
-
-        case OpRequest::kReduceWindowRequest: {
-          CHECK_EQ(1, request.embedded_computation_versions_size());
-          const ReduceWindowRequest& reduce_window_request =
-              request.request().reduce_window_request();
-          const VersionedComputationHandle versioned_handle = {
-              reduce_window_request.to_apply(),
-              request.embedded_computation_versions(0)};
-          computations.push_back(versioned_handle);
-          break;
-        }
-
-        case OpRequest::kSelectAndScatterRequest: {
-          CHECK_EQ(2, request.embedded_computation_versions_size());
-          const SelectAndScatterRequest& select_and_scatter_request =
-              request.request().select_and_scatter_request();
-          const VersionedComputationHandle select_versioned_handle = {
-              select_and_scatter_request.select(),
-              request.embedded_computation_versions(0)};
-          computations.push_back(select_versioned_handle);
-          const VersionedComputationHandle scatter_versioned_handle = {
-              select_and_scatter_request.scatter(),
-              request.embedded_computation_versions(1)};
-          computations.push_back(scatter_versioned_handle);
-          break;
-        }
-
-        case OpRequest::kWhileRequest: {
-          CHECK_EQ(2, request.embedded_computation_versions_size());
-          const WhileRequest& while_request = request.request().while_request();
-          const VersionedComputationHandle condition_versioned_handle = {
-              while_request.condition(),
-              request.embedded_computation_versions(0)};
-          computations.push_back(condition_versioned_handle);
-          const VersionedComputationHandle body_versioned_handle = {
-              while_request.body(), request.embedded_computation_versions(1)};
-          computations.push_back(body_versioned_handle);
-          break;
-        }
-
-        case OpRequest::kConditionalRequest: {
-          CHECK_EQ(2, request.embedded_computation_versions_size());
-          const ConditionalRequest& conditional_request =
-              request.request().conditional_request();
-          const VersionedComputationHandle true_computation_versioned_handle = {
-              conditional_request.true_computation(),
-              request.embedded_computation_versions(0)};
-          computations.push_back(true_computation_versioned_handle);
-          const VersionedComputationHandle false_computation_versioned_handle =
-              {conditional_request.false_computation(),
-               request.embedded_computation_versions(1)};
-          computations.push_back(false_computation_versioned_handle);
-          break;
-        }
-
-        default:
-          // No embedded computation.
-          break;
-      }
-    }
-  }
-  VLOG(2) << "Embedded computations: "
-          << tensorflow::str_util::Join(
-                 computations, ", ",
-                 [](string* out, const VersionedComputationHandle& h) {
-                   out->append(h.ToString());
-                 });
-  return computations;
-}
-
-StatusOr<const OperationRequest*>
-UserComputation::LookUpRequestForErrorReporting(
-    const ComputationDataHandle& handle) const {
-  tensorflow::mutex_lock lock(mutex_);
-  return LookUpRequest(handle);
-}
-
-tensorflow::gtl::optional<const OpMetadata*> UserComputation::ParameterMetadata(
-    int parameter_number) const {
-  tensorflow::mutex_lock lock(mutex_);
-  auto it = parameters_.find(parameter_number);
-  if (it == parameters_.end()) {
-    return tensorflow::gtl::nullopt;
-  }
-  OperationRequest* op = it->second;
-  return &op->request().metadata();
-}
-
-Status UserComputation::RemapEmbeddedComputations(
-    const std::map<int64, ComputationHandle>& old_to_new) {
-  auto update = [&old_to_new](ComputationHandle* to_update) -> Status {
-    int64 old = to_update->handle();
-    auto it = old_to_new.find(old);
-    if (it == old_to_new.end()) {
-      string mapping = tensorflow::str_util::Join(
-          old_to_new, ", ",
-          [](string* out, std::pair<int64, ComputationHandle> element) {
-            tensorflow::strings::Appendf(out, "%lld:%lld", element.first,
-                                         element.second.handle());
-          });
-      return NotFound(
-          "could not find referenced (old) computation handle in mapping: "
-          "%lld; mapping: {%s}",
-          old, mapping.c_str());
-    }
-    VLOG(2) << "remapping " << old << " to " << it->second.handle();
-    *to_update = it->second;
-    return Status::OK();
-  };
-  TF_RETURN_IF_ERROR(update(session_computation_.mutable_computation_handle()));
-  for (auto& handle_request : *session_computation_.mutable_requests()) {
-    OperationRequest& request = handle_request.second;
-    switch (request.request().op_case()) {
-      case OpRequest::kCallRequest: {
-        TF_RET_CHECK(1 == request.embedded_computation_versions_size());
-        CallRequest* call_request =
-            request.mutable_request()->mutable_call_request();
-        TF_RETURN_IF_ERROR(update(call_request->mutable_to_apply()));
-        break;
-      }
-      case OpRequest::kMapRequest: {
-        TF_RET_CHECK(1 == request.embedded_computation_versions_size());
-        MapRequest* map_request =
-            request.mutable_request()->mutable_map_request();
-        TF_RETURN_IF_ERROR(update(map_request->mutable_to_apply()));
-        break;
-      }
-      case OpRequest::kReduceRequest: {
-        TF_RET_CHECK(1 == request.embedded_computation_versions_size());
-        ReduceRequest* reduce_request =
-            request.mutable_request()->mutable_reduce_request();
-        TF_RETURN_IF_ERROR(update(reduce_request->mutable_to_apply()));
-        break;
-      }
-      case OpRequest::kReduceWindowRequest: {
-        TF_RET_CHECK(1 == request.embedded_computation_versions_size());
-        ReduceWindowRequest* reduce_window_request =
-            request.mutable_request()->mutable_reduce_window_request();
-        TF_RETURN_IF_ERROR(update(reduce_window_request->mutable_to_apply()));
-        break;
-      }
-      case OpRequest::kSelectAndScatterRequest: {
-        TF_RET_CHECK(2 == request.embedded_computation_versions_size());
-        SelectAndScatterRequest* select_and_scatter_request =
-            request.mutable_request()->mutable_select_and_scatter_request();
-        TF_RETURN_IF_ERROR(
-            update(select_and_scatter_request->mutable_select()));
-        TF_RETURN_IF_ERROR(
-            update(select_and_scatter_request->mutable_scatter()));
-        break;
-      }
-      case OpRequest::kWhileRequest: {
-        TF_RET_CHECK(2 == request.embedded_computation_versions_size());
-        WhileRequest* while_request =
-            request.mutable_request()->mutable_while_request();
-        TF_RETURN_IF_ERROR(update(while_request->mutable_condition()));
-        TF_RETURN_IF_ERROR(update(while_request->mutable_body()));
-        break;
-      }
-      case OpRequest::kConditionalRequest: {
-        TF_RET_CHECK(2 == request.embedded_computation_versions_size());
-        ConditionalRequest* conditional_request =
-            request.mutable_request()->mutable_conditional_request();
-        TF_RETURN_IF_ERROR(
-            update(conditional_request->mutable_true_computation()));
-        TF_RETURN_IF_ERROR(
-            update(conditional_request->mutable_false_computation()));
-        break;
-      }
-      default:
-        // No embedded computation.
-        TF_RET_CHECK(0 == request.embedded_computation_versions_size());
-        break;
-    }
-  }
-  return Status::OK();
-}
-
-SessionComputation UserComputation::CloneSessionComputation(
-    VersionedComputationHandle::Version version) const {
-  tensorflow::mutex_lock lock(mutex_);
-  SessionComputation result = session_computation_;
-  // Erase all the requests that exceed the version specified.
-  // There's no lower_bound method on tensorflow::protobuf::Map so we iterate
-  // all the elements.
-  auto it = result.mutable_requests()->begin();
-  while (it != result.mutable_requests()->end()) {
-    if (it->first > version) {
-      it = result.mutable_requests()->erase(it);
-    } else {
-      ++it;
-    }
-  }
-  return result;
-}
-
-StatusOr<const OperationRequest*> UserComputation::LookUpRequest(
-    const ComputationDataHandle& handle) const {
-  int64 handle_value = handle.handle();
-  if (session_computation_.requests().count(handle_value) == 0) {
-    return InvalidArgument("no ComputationDataHandle value %lld", handle_value);
-  }
-  return &session_computation_.requests().at(handle_value);
-}
-
-Status UserComputation::CheckParametersAreContiguous(
-    VersionedComputationHandle::Version version) const {
-  TF_RET_CHECK(version > 0 && version < next_handle_value_);
-
-  // Determine number of parameter inputs at the given version.
-  std::map<int64, const ParameterRequest*> parameter_requests;
-  for (int64 request_num = 1; request_num <= version; ++request_num) {
-    const OperationRequest& request =
-        session_computation_.requests().at(request_num);
-
-    if (request.request().op_case() == OpRequest::kParameterRequest) {
-      const ParameterRequest& parameter_request =
-          request.request().parameter_request();
-      // Duplicate parameters should be checked when parameter requests are
-      // added.
-      TF_RET_CHECK(0 ==
-                   parameter_requests.count(parameter_request.parameter()));
-      parameter_requests[parameter_request.parameter()] = &parameter_request;
-    }
-  }
-
-  for (int64 i = 0; i < parameter_requests.size(); ++i) {
-    auto it = parameter_requests.find(i);
-    if (it == parameter_requests.end()) {
-      return FailedPrecondition(
-          "computation %s does not have all its parameters populated "
-          "sequentially, missing parameter %lld",
-          name_.c_str(), i);
-    }
-  }
-
-  return Status::OK();
-}
-
-namespace {
-
-// Helper class which builds an HLO computation from a SessionComputation. To
-// construct the HLO computation, the SessionComputation graph is walked in
-// DFS order lowering each OperationRequest to an HLO instruction.
-class ComputationLowerer {
- public:
-  static StatusOr<std::unique_ptr<HloComputation>> Lower(
-      const string& computation_name,
-      const SessionComputation& session_computation,
-      VersionedComputationHandle::Version version,
-      UserComputation::HloComputationResolver hlo_resolver,
-      const DebugOptions& debug_options,
-      bool include_unreachable_instructions) {
-    ComputationLowerer lowerer(computation_name, session_computation, version,
-                               std::move(hlo_resolver), debug_options,
-                               include_unreachable_instructions);
-    return lowerer.Lower();
-  }
-
- private:
-  ComputationLowerer(const string& computation_name,
-                     const SessionComputation& session_computation,
-                     VersionedComputationHandle::Version version,
-                     UserComputation::HloComputationResolver hlo_resolver,
-                     const DebugOptions& debug_options,
-                     bool include_unreachable_instructions)
-      : hlo_builder_(computation_name),
-        session_computation_(session_computation),
-        version_(version),
-        hlo_resolver_(std::move(hlo_resolver)),
-        debug_options_(debug_options),
-        include_unreachable_instructions_(include_unreachable_instructions) {}
-
-  // Build an HLO computation from the SessionComputation at the given
-  // version.
-  StatusOr<std::unique_ptr<HloComputation>> Lower();
-
- private:
-  // Traverses the computation 'root' using a DFS, calling 'visit' in postorder.
-  void TraversePostorder(
-      const ComputationDataHandle& root,
-      std::unordered_map<int64, HloInstruction*>* visited,
-      const std::function<void(const ComputationDataHandle&)>& visit);
-
-  // DFS visitor of the UserComputation operations which lowers the operations
-  // to HLO instructions.
-  void Visit(const ComputationDataHandle& handle,
-             std::unordered_map<int64, HloInstruction*>* instructions);
-
-  // Resolves a ComputationHandle and Version to a previously lowered
-  // HloComputation using the hlo_resolver_ function.
-  HloComputation* ResolveComputation(
-      const ComputationHandle& handle,
-      VersionedComputationHandle::Version version);
-
-  // This function takes an input value which is being implicitly broadcast into
-  // an output shape and figures out the right kBroadcast instruction(s)
-  // necessary to replicate the implicit broadcast semantics explicitly.
-  HloInstruction* ImplicitBroadcastToExplicitBroadcast(
-      HloInstruction* operand, const Shape& output_shape);
-
-  HloComputation::Builder hlo_builder_;
-  const SessionComputation& session_computation_;
-  const VersionedComputationHandle::Version version_;
-  const UserComputation::HloComputationResolver hlo_resolver_;
-  const DebugOptions& debug_options_;
-  const bool include_unreachable_instructions_;
-};
-
-// Calls 'apply' on each operand of 'request'.
-static void ForEachOperand(
-    const OperationRequest& request,
-    const std::function<void(const ComputationDataHandle& param)>& apply) {
-  switch (request.request().op_case()) {
-    case OpRequest::kRngRequest: {
-      const RngRequest& rng_request = request.request().rng_request();
-      for (const ComputationDataHandle& param : rng_request.parameter()) {
-        apply(param);
-      }
-      break;
-    }
-
-    case OpRequest::kConstantRequest:
-      break;
-    case OpRequest::kGetTupleElementRequest: {
-      const GetTupleElementRequest& get_tuple_element_request =
-          request.request().get_tuple_element_request();
-      apply(get_tuple_element_request.operand());
-      break;
-    }
-
-    case OpRequest::kSliceRequest: {
-      const SliceRequest& slice_request = request.request().slice_request();
-      apply(slice_request.operand());
-      break;
-    }
-
-    case OpRequest::kDynamicSliceRequest: {
-      const DynamicSliceRequest& dynamic_slice_request =
-          request.request().dynamic_slice_request();
-      apply(dynamic_slice_request.operand());
-      apply(dynamic_slice_request.start_indices());
-      break;
-    }
-
-    case OpRequest::kDynamicUpdateSliceRequest: {
-      const DynamicUpdateSliceRequest& dynamic_update_slice_request =
-          request.request().dynamic_update_slice_request();
-      apply(dynamic_update_slice_request.operand());
-      apply(dynamic_update_slice_request.update());
-      apply(dynamic_update_slice_request.start_indices());
-      break;
-    }
-
-    case OpRequest::kConcatenateRequest: {
-      const ConcatenateRequest& concatenate_request =
-          request.request().concatenate_request();
-      for (const ComputationDataHandle& handle :
-           concatenate_request.operands()) {
-        apply(handle);
-      }
-      break;
-    }
-
-    case OpRequest::kConvolveRequest: {
-      const ConvolveRequest& convolve_request =
-          request.request().convolve_request();
-      apply(convolve_request.lhs());
-      apply(convolve_request.rhs());
-      break;
-    }
-
-    case OpRequest::kFftRequest: {
-      const FftRequest& fft_request = request.request().fft_request();
-      apply(fft_request.operand());
-      break;
-    }
-
-    case OpRequest::kBatchNormTrainingRequest: {
-      const BatchNormTrainingRequest& batch_norm_training_request =
-          request.request().batch_norm_training_request();
-
-      apply(batch_norm_training_request.operand());
-      apply(batch_norm_training_request.scale());
-      apply(batch_norm_training_request.offset());
-      break;
-    }
-
-    case OpRequest::kBatchNormInferenceRequest: {
-      const BatchNormInferenceRequest& batch_norm_inference_request =
-          request.request().batch_norm_inference_request();
-
-      apply(batch_norm_inference_request.operand());
-      apply(batch_norm_inference_request.scale());
-      apply(batch_norm_inference_request.offset());
-      apply(batch_norm_inference_request.mean());
-      apply(batch_norm_inference_request.variance());
-      break;
-    }
-
-    case OpRequest::kBatchNormGradRequest: {
-      const BatchNormGradRequest& batch_norm_grad_request =
-          request.request().batch_norm_grad_request();
-
-      apply(batch_norm_grad_request.operand());
-      apply(batch_norm_grad_request.scale());
-      apply(batch_norm_grad_request.mean());
-      apply(batch_norm_grad_request.variance());
-      apply(batch_norm_grad_request.grad_output());
-      break;
-    }
-
-    case OpRequest::kCrossReplicaSumRequest: {
-      const CrossReplicaSumRequest& cross_replica_sum_request =
-          request.request().cross_replica_sum_request();
-      apply(cross_replica_sum_request.operand());
-      break;
-    }
-
-    case OpRequest::kInfeedRequest:
-      break;
-
-    case OpRequest::kOutfeedRequest: {
-      const OutfeedRequest& outfeed_request =
-          request.request().outfeed_request();
-      apply(outfeed_request.operand());
-      break;
-    }
-
-    case OpRequest::kMapRequest: {
-      const MapRequest& map_request = request.request().map_request();
-      for (const ComputationDataHandle& handle : map_request.operands()) {
-        apply(handle);
-      }
-      break;
-    }
-
-    case OpRequest::kReduceRequest: {
-      const ReduceRequest& reduce_request = request.request().reduce_request();
-      apply(reduce_request.operand());
-      apply(reduce_request.init_value());
-      break;
-    }
-
-    case OpRequest::kReduceWindowRequest: {
-      const ReduceWindowRequest& reduce_window_request =
-          request.request().reduce_window_request();
-      apply(reduce_window_request.operand());
-      apply(reduce_window_request.init_value());
-      break;
-    }
-
-    case OpRequest::kSelectAndScatterRequest: {
-      const SelectAndScatterRequest& select_and_scatter_request =
-          request.request().select_and_scatter_request();
-      apply(select_and_scatter_request.operand());
-      apply(select_and_scatter_request.source());
-      apply(select_and_scatter_request.init_value());
-
-      break;
-    }
-
-    case OpRequest::kBroadcastRequest: {
-      const BroadcastRequest& broadcast_request =
-          request.request().broadcast_request();
-      apply(broadcast_request.operand());
-      break;
-    }
-
-    case OpRequest::kReshapeRequest: {
-      const ReshapeRequest& reshape_request =
-          request.request().reshape_request();
-      apply(reshape_request.operand());
-      break;
-    }
-
-    case OpRequest::kTransposeRequest: {
-      const TransposeRequest& transpose_request =
-          request.request().transpose_request();
-      apply(transpose_request.operand());
-      break;
-    }
-
-    case OpRequest::kReverseRequest: {
-      const ReverseRequest& reverse_request =
-          request.request().reverse_request();
-      apply(reverse_request.operand());
-      break;
-    }
-
-    case OpRequest::kPadRequest: {
-      const PadRequest& pad_request = request.request().pad_request();
-      apply(pad_request.operand());
-      apply(pad_request.padding_value());
-      break;
-    }
-
-    case OpRequest::kRecvRequest:
-    case OpRequest::kParameterRequest:
-      break;
-
-    case OpRequest::kConvertRequest: {
-      const ConvertRequest& convert_request =
-          request.request().convert_request();
-      apply(convert_request.operand());
-      break;
-    }
-
-    case OpRequest::kBitcastConvertRequest: {
-      const ConvertRequest& convert_request =
-          request.request().bitcast_convert_request();
-      apply(convert_request.operand());
-      break;
-    }
-
-    case OpRequest::kWhileRequest: {
-      const WhileRequest& while_request = request.request().while_request();
-      apply(while_request.init());
-      break;
-    }
-
-    case OpRequest::kConditionalRequest: {
-      const ConditionalRequest& conditional_request =
-          request.request().conditional_request();
-      apply(conditional_request.predicate());
-      apply(conditional_request.true_operand());
-      apply(conditional_request.false_operand());
-      break;
-    }
-
-    case OpRequest::kTernaryOpRequest: {
-      const TernaryOpRequest& ternary_op_request =
-          request.request().ternary_op_request();
-      apply(ternary_op_request.lhs());
-      apply(ternary_op_request.rhs());
-      apply(ternary_op_request.ehs());
-      break;
-    }
-
-    case OpRequest::kVariadicOpRequest: {
-      const VariadicOpRequest& variadic_op_request =
-          request.request().variadic_op_request();
-      for (const ComputationDataHandle& handle :
-           variadic_op_request.operands()) {
-        apply(handle);
-      }
-      break;
-    }
-
-    case OpRequest::kCallRequest: {
-      const CallRequest& call_request = request.request().call_request();
-      for (const ComputationDataHandle& handle : call_request.operands()) {
-        apply(handle);
-      }
-      break;
-    }
-
-    case OpRequest::kCustomCallRequest: {
-      const CustomCallRequest& cc_request =
-          request.request().custom_call_request();
-      for (const ComputationDataHandle& operand : cc_request.operands()) {
-        apply(operand);
-      }
-      break;
-    }
-
-    case OpRequest::kHostComputeRequest: {
-      const HostComputeRequest& hc_request =
-          request.request().host_compute_request();
-      for (const ComputationDataHandle& operand : hc_request.operands()) {
-        apply(operand);
-      }
-      break;
-    }
-
-    case OpRequest::kDotRequest: {
-      const DotRequest& dot_request = request.request().dot_request();
-      apply(dot_request.rhs());
-      apply(dot_request.lhs());
-      break;
-    }
-
-    case OpRequest::kUnaryOpRequest: {
-      const UnaryOpRequest& unary_op_request =
-          request.request().unary_op_request();
-      apply(unary_op_request.operand());
-      break;
-    }
-
-    case OpRequest::kBinaryOpRequest: {
-      const BinaryOpRequest& binary_op_request =
-          request.request().binary_op_request();
-      apply(binary_op_request.rhs());
-      apply(binary_op_request.lhs());
-      break;
-    }
-
-    case OpRequest::kReducePrecisionRequest: {
-      const ReducePrecisionRequest& reduce_precision_request =
-          request.request().reduce_precision_request();
-      apply(reduce_precision_request.operand());
-      break;
-    }
-
-    case OpRequest::kTraceRequest: {
-      const TraceRequest& trace_request = request.request().trace_request();
-      apply(trace_request.operand());
-      break;
-    }
-
-    case OpRequest::kSendRequest: {
-      const SendRequest& send_request = request.request().send_request();
-      apply(send_request.operand());
-      break;
-    }
-
-    case OpRequest::kGatherRequest: {
-      const GatherRequest& gather_request = request.request().gather_request();
-      apply(gather_request.input());
-      apply(gather_request.gather_indices());
-      break;
-    }
-
-    case OpRequest::OP_NOT_SET:
-      LOG(FATAL) << "OperationRequest doesn't contain a request";
-
-    default:
-      LOG(FATAL) << "Unexpected request type: " << request.request().op_case();
-  }
-}
-
-void ComputationLowerer::TraversePostorder(
-    const ComputationDataHandle& root,
-    std::unordered_map<int64, HloInstruction*>* visited,
-    const std::function<void(const ComputationDataHandle&)>& visit) {
-  // Stack containing {handle, enter} pairs. The 'enter' value describes whether
-  // we are entering or leaving 'handle'.
-  std::stack<std::pair<ComputationDataHandle, bool>> work;
-  work.push({root, true});
-  while (!work.empty()) {
-    ComputationDataHandle handle;
-    bool enter;
-    std::tie(handle, enter) = work.top();
-    work.pop();
-
-    if (enter) {
-      // We are entering 'handle'. The first time we enter 'handle', we add it
-      // to 'visited' with a nullptr value. If 'handle' is already in 'visited',
-      // we do not visit it again. This algorithm only uses the presence of
-      // a handle in 'visited', but we use a map so we can use the same data
-      // structure to store the HloInstruction outputs.
-      if (visited->emplace(handle.handle(), nullptr).second) {
-        const OperationRequest& request =
-            session_computation_.requests().at(handle.handle());
-        // Push the corresponding 'leave' action onto the stack, followed by
-        // the operands.
-        work.push({handle, false});
-        ForEachOperand(request, [&work](const ComputationDataHandle& child) {
-          work.push({child, true});
-        });
-      }
-    } else {
-      // We are leaving 'handle'. We have visited the operands of 'handle', and
-      // now can visit the 'handle' itself.
-      visit(handle);
-    }
-  }
-}
-
-StatusOr<std::unique_ptr<HloComputation>> ComputationLowerer::Lower() {
-  // Map from ComputationDataHandle to HLO instruction. Serves as a record of
-  // which operations have been visited as well as a cache for looking up
-  // ComputationDataHandles as HloInstructions.
-  std::unordered_map<int64, HloInstruction*> instructions;
-
-  TF_ASSIGN_OR_RETURN(const OperationRequest* root_request,
-                      GetRoot(version_, session_computation_));
-
-  auto visit = [&](const ComputationDataHandle& handle) {
-    Visit(handle, &instructions);
-  };
-  TraversePostorder(root_request->output_handle(), &instructions, visit);
-  HloInstruction* hlo_root =
-      instructions.at(root_request->output_handle().handle());
-
-  if (include_unreachable_instructions_) {
-    // Iterate through all computation data handles, and visit any unvisited
-    // operations.
-    for (int64 request_num = 1; request_num <= version_; ++request_num) {
-      TF_ASSIGN_OR_RETURN(const OperationRequest* request,
-                          LookUpRequest(request_num, session_computation_));
-      TraversePostorder(request->output_handle(), &instructions, visit);
-    }
-  }
-
-  return hlo_builder_.Build(hlo_root);
-}
-
-HloComputation* ComputationLowerer::ResolveComputation(
-    const ComputationHandle& handle,
-    VersionedComputationHandle::Version version) {
-  const VersionedComputationHandle checked_handle = {handle, version};
-  return hlo_resolver_(checked_handle);
-}
-
-HloInstruction* ComputationLowerer::ImplicitBroadcastToExplicitBroadcast(
-    HloInstruction* operand, const Shape& output_shape) {
-  auto fadd = [this](std::unique_ptr<HloInstruction> x) {
-    return hlo_builder_.AddInstruction(std::move(x));
-  };
-  return fadd(
-      HloInstruction::CreateBroadcastSequence(output_shape, operand, fadd));
-}
-
-void ComputationLowerer::Visit(
-    const ComputationDataHandle& handle,
-    std::unordered_map<int64, HloInstruction*>* instructions) {
-  CHECK_LE(handle.handle(), version_);
-  CHECK(instructions->at(handle.handle()) == nullptr);
-  const OperationRequest& request =
-      session_computation_.requests().at(handle.handle());
-  auto add_instruction = [&](std::unique_ptr<HloInstruction> instruction) {
-    HloInstruction* hlo_instruction =
-        hlo_builder_.AddInstruction(std::move(instruction));
-    hlo_instruction->set_metadata(request.request().metadata());
-    if (request.request().has_sharding()) {
-      OpSharding op_sharding = request.request().sharding();
-      hlo_instruction->set_sharding(
-          HloSharding::FromProto(op_sharding).ValueOrDie());
-    }
-    return hlo_instruction;
-  };
-  auto lookup_instruction = [&](const ComputationDataHandle& handle) {
-    return instructions->at(handle.handle());
-  };
-  HloInstruction* hlo_instruction;
-  switch (request.request().op_case()) {
-    case OpRequest::kRngRequest: {
-      const RngRequest& rng_request = request.request().rng_request();
-      std::vector<HloInstruction*> parameters;
-      for (const ComputationDataHandle& param : rng_request.parameter()) {
-        parameters.push_back(lookup_instruction(param));
-      }
-      hlo_instruction = add_instruction(HloInstruction::CreateRng(
-          request.output_shape(), rng_request.distribution(), parameters));
-      break;
-    }
-
-    case OpRequest::kConstantRequest: {
-      const ConstantRequest& constant_request =
-          request.request().constant_request();
-      hlo_instruction = add_instruction(HloInstruction::CreateConstant(
-          Literal::CreateFromProto(constant_request.literal())
-              .ConsumeValueOrDie()));
-      break;
-    }
-
-    case OpRequest::kGetTupleElementRequest: {
-      const GetTupleElementRequest& get_tuple_element_request =
-          request.request().get_tuple_element_request();
-      HloInstruction* operand =
-          lookup_instruction(get_tuple_element_request.operand());
-      hlo_instruction = add_instruction(HloInstruction::CreateGetTupleElement(
-          request.output_shape(), operand, get_tuple_element_request.index()));
-      break;
-    }
-
-    case OpRequest::kSliceRequest: {
-      const SliceRequest& slice_request = request.request().slice_request();
-      HloInstruction* operand = lookup_instruction(slice_request.operand());
-      hlo_instruction = add_instruction(HloInstruction::CreateSlice(
-          request.output_shape(), operand,
-          AsInt64Slice(slice_request.start_indices()),
-          AsInt64Slice(slice_request.limit_indices()),
-          AsInt64Slice(slice_request.strides())));
-      break;
-    }
-
-    case OpRequest::kDynamicSliceRequest: {
-      const DynamicSliceRequest& dynamic_slice_request =
-          request.request().dynamic_slice_request();
-      HloInstruction* operand =
-          lookup_instruction(dynamic_slice_request.operand());
-      HloInstruction* start_indices =
-          lookup_instruction(dynamic_slice_request.start_indices());
-
-      hlo_instruction = add_instruction(HloInstruction::CreateDynamicSlice(
-          request.output_shape(), operand, start_indices,
-          AsInt64Slice(dynamic_slice_request.slice_sizes())));
-      break;
-    }
-
-    case OpRequest::kDynamicUpdateSliceRequest: {
-      const DynamicUpdateSliceRequest& dynamic_update_slice_request =
-          request.request().dynamic_update_slice_request();
-      HloInstruction* operand =
-          lookup_instruction(dynamic_update_slice_request.operand());
-      HloInstruction* update =
-          lookup_instruction(dynamic_update_slice_request.update());
-      HloInstruction* start_indices =
-          lookup_instruction(dynamic_update_slice_request.start_indices());
-      hlo_instruction =
-          add_instruction(HloInstruction::CreateDynamicUpdateSlice(
-              request.output_shape(), operand, update, start_indices));
-      break;
-    }
-
-    case OpRequest::kConcatenateRequest: {
-      const ConcatenateRequest& concatenate_request =
-          request.request().concatenate_request();
-      std::vector<HloInstruction*> operands;
-      for (const ComputationDataHandle& handle :
-           concatenate_request.operands()) {
-        HloInstruction* operand = lookup_instruction(handle);
-        operands.push_back(operand);
-      }
-      hlo_instruction = add_instruction(HloInstruction::CreateConcatenate(
-          request.output_shape(), operands, concatenate_request.dimension()));
-      break;
-    }
-
-    case OpRequest::kConvolveRequest: {
-      const ConvolveRequest& convolve_request =
-          request.request().convolve_request();
-      HloInstruction* lhs = lookup_instruction(convolve_request.lhs());
-      HloInstruction* rhs = lookup_instruction(convolve_request.rhs());
-      hlo_instruction = add_instruction(HloInstruction::CreateConvolve(
-          request.output_shape(), lhs, rhs, convolve_request.window(),
-          convolve_request.dimension_numbers()));
-      break;
-    }
-
-    case OpRequest::kFftRequest: {
-      const FftRequest& fft_request = request.request().fft_request();
-      HloInstruction* operand = lookup_instruction(fft_request.operand());
-      hlo_instruction = add_instruction(HloInstruction::CreateFft(
-          request.output_shape(), operand, fft_request.fft_type(),
-          AsInt64Slice(fft_request.fft_length())));
-      break;
-    }
-
-    case OpRequest::kDotRequest: {
-      const DotRequest& dot_request = request.request().dot_request();
-      HloInstruction* lhs = lookup_instruction(dot_request.lhs());
-      HloInstruction* rhs = lookup_instruction(dot_request.rhs());
-      hlo_instruction = add_instruction(HloInstruction::CreateDot(
-          request.output_shape(), lhs, rhs, dot_request.dimension_numbers()));
-      break;
-    }
-
-    case OpRequest::kCrossReplicaSumRequest: {
-      const CrossReplicaSumRequest& cross_replica_sum_request =
-          request.request().cross_replica_sum_request();
-      HloInstruction* operand =
-          lookup_instruction(cross_replica_sum_request.operand());
-      hlo_instruction = add_instruction(HloInstruction::CreateCrossReplicaSum(
-          request.output_shape(), {operand}));
-      break;
-    }
-
-    case OpRequest::kInfeedRequest: {
-      const InfeedRequest& infeed_request = request.request().infeed_request();
-      hlo_instruction = add_instruction(HloInstruction::CreateInfeed(
-          request.output_shape(), infeed_request.config()));
-      break;
-    }
-
-    case OpRequest::kOutfeedRequest: {
-      const OutfeedRequest& outfeed_request =
-          request.request().outfeed_request();
-      HloInstruction* operand = lookup_instruction(outfeed_request.operand());
-      hlo_instruction = add_instruction(HloInstruction::CreateOutfeed(
-          outfeed_request.shape(), operand, outfeed_request.outfeed_config()));
-      break;
-    }
-
-    case OpRequest::kMapRequest: {
-      const MapRequest& map_request = request.request().map_request();
-      std::vector<HloInstruction*> operands;
-      for (const ComputationDataHandle& handle : map_request.operands()) {
-        HloInstruction* operand = lookup_instruction(handle);
-        operands.push_back(operand);
-      }
-      CHECK_EQ(1, request.embedded_computation_versions_size());
-      VersionedComputationHandle::Version map_version =
-          request.embedded_computation_versions(0);
-      HloComputation* map_computation =
-          ResolveComputation(map_request.to_apply(), map_version);
-      hlo_instruction = add_instruction(HloInstruction::CreateMap(
-          request.output_shape(), operands, map_computation));
-      break;
-    }
-
-    case OpRequest::kReduceRequest: {
-      const ReduceRequest& reduce_request = request.request().reduce_request();
-      HloInstruction* operand = lookup_instruction(reduce_request.operand());
-      HloInstruction* init_value =
-          lookup_instruction(reduce_request.init_value());
-      CHECK_EQ(1, request.embedded_computation_versions_size());
-      VersionedComputationHandle::Version reduce_version =
-          request.embedded_computation_versions(0);
-      HloComputation* reduce_computation =
-          ResolveComputation(reduce_request.to_apply(), reduce_version);
-      hlo_instruction = add_instruction(HloInstruction::CreateReduce(
-          request.output_shape(), operand, init_value,
-          AsInt64Slice(reduce_request.dimensions()), reduce_computation));
-      break;
-    }
-
-    case OpRequest::kReduceWindowRequest: {
-      const ReduceWindowRequest& reduce_window_request =
-          request.request().reduce_window_request();
-      HloInstruction* operand =
-          lookup_instruction(reduce_window_request.operand());
-      HloInstruction* init_value =
-          lookup_instruction(reduce_window_request.init_value());
-      CHECK_EQ(1, request.embedded_computation_versions_size());
-      VersionedComputationHandle::Version reduce_window_version =
-          request.embedded_computation_versions(0);
-      HloComputation* reduce_window_computation = ResolveComputation(
-          reduce_window_request.to_apply(), reduce_window_version);
-      hlo_instruction = add_instruction(HloInstruction::CreateReduceWindow(
-          request.output_shape(), operand, init_value,
-          reduce_window_request.window(), reduce_window_computation));
-      break;
-    }
-
-    case OpRequest::kSelectAndScatterRequest: {
-      const SelectAndScatterRequest& select_and_scatter_request =
-          request.request().select_and_scatter_request();
-      HloInstruction* operand =
-          lookup_instruction(select_and_scatter_request.operand());
-      HloInstruction* source =
-          lookup_instruction(select_and_scatter_request.source());
-      HloInstruction* init_value =
-          lookup_instruction(select_and_scatter_request.init_value());
-      CHECK_EQ(2, request.embedded_computation_versions_size());
-      VersionedComputationHandle::Version select_version =
-          request.embedded_computation_versions(0);
-      VersionedComputationHandle::Version scatter_version =
-          request.embedded_computation_versions(1);
-      HloComputation* select_computation = ResolveComputation(
-          select_and_scatter_request.select(), select_version);
-      HloComputation* scatter_computation = ResolveComputation(
-          select_and_scatter_request.scatter(), scatter_version);
-      hlo_instruction = add_instruction(HloInstruction::CreateSelectAndScatter(
-          request.output_shape(), operand, select_computation,
-          select_and_scatter_request.window(), source, init_value,
-          scatter_computation));
-      break;
-    }
-
-    case OpRequest::kBatchNormTrainingRequest: {
-      const BatchNormTrainingRequest& batch_norm_training_request =
-          request.request().batch_norm_training_request();
-      HloInstruction* operand =
-          lookup_instruction(batch_norm_training_request.operand());
-      HloInstruction* scale =
-          lookup_instruction(batch_norm_training_request.scale());
-      HloInstruction* offset =
-          lookup_instruction(batch_norm_training_request.offset());
-
-      hlo_instruction = add_instruction(HloInstruction::CreateBatchNormTraining(
-          request.output_shape(), operand, scale, offset,
-          batch_norm_training_request.epsilon(),
-          batch_norm_training_request.feature_index()));
-      break;
-    }
-
-    case OpRequest::kBatchNormInferenceRequest: {
-      const BatchNormInferenceRequest& batch_norm_inference_request =
-          request.request().batch_norm_inference_request();
-      HloInstruction* operand =
-          lookup_instruction(batch_norm_inference_request.operand());
-      HloInstruction* scale =
-          lookup_instruction(batch_norm_inference_request.scale());
-      HloInstruction* offset =
-          lookup_instruction(batch_norm_inference_request.offset());
-      HloInstruction* mean =
-          lookup_instruction(batch_norm_inference_request.mean());
-      HloInstruction* variance =
-          lookup_instruction(batch_norm_inference_request.variance());
-
-      hlo_instruction =
-          add_instruction(HloInstruction::CreateBatchNormInference(
-              request.output_shape(), operand, scale, offset, mean, variance,
-              batch_norm_inference_request.epsilon(),
-              batch_norm_inference_request.feature_index()));
-      break;
-    }
-
-    case OpRequest::kBatchNormGradRequest: {
-      const BatchNormGradRequest& batch_norm_grad_request =
-          request.request().batch_norm_grad_request();
-
-      HloInstruction* operand =
-          lookup_instruction(batch_norm_grad_request.operand());
-      HloInstruction* scale =
-          lookup_instruction(batch_norm_grad_request.scale());
-      HloInstruction* mean = lookup_instruction(batch_norm_grad_request.mean());
-      HloInstruction* variance =
-          lookup_instruction(batch_norm_grad_request.variance());
-      HloInstruction* grad_output =
-          lookup_instruction(batch_norm_grad_request.grad_output());
-
-      hlo_instruction = add_instruction(HloInstruction::CreateBatchNormGrad(
-          request.output_shape(), operand, scale, mean, variance, grad_output,
-          batch_norm_grad_request.epsilon(),
-          batch_norm_grad_request.feature_index()));
-      break;
-    }
-
-    case OpRequest::kBroadcastRequest: {
-      const BroadcastRequest& broadcast_request =
-          request.request().broadcast_request();
-      HloInstruction* operand = lookup_instruction(broadcast_request.operand());
-      std::vector<int64> broadcast_dimensions;
-      // The client-level broadcast instruction just appends dimensions on the
-      // left (adds lowest numbered dimensions). The HLO broadcast op is more
-      // flexible and can add new dimensions anywhere. The broadcast_dimensions
-      // maps operand dimensions to dimensions in the broadcast output, so
-      // to append dimensions on the left the broadcast_dimensions should just
-      // be the n highest dimension numbers of the output shape where n is
-      // the number of input dimensions.
-      broadcast_dimensions.reserve(ShapeUtil::Rank(operand->shape()));
-      for (int i = 0; i < ShapeUtil::Rank(operand->shape()); ++i) {
-        broadcast_dimensions.push_back(i +
-                                       ShapeUtil::Rank(request.output_shape()) -
-                                       ShapeUtil::Rank(operand->shape()));
-      }
-      hlo_instruction = add_instruction(HloInstruction::CreateBroadcast(
-          request.output_shape(), operand, broadcast_dimensions));
-      break;
-    }
-
-    case OpRequest::kReshapeRequest: {
-      const ReshapeRequest& reshape_request =
-          request.request().reshape_request();
-      HloInstruction* operand = lookup_instruction(reshape_request.operand());
-      HloInstruction* transposed;
-      if (IsIdentityPermutation(AsInt64Slice(reshape_request.dimensions()))) {
-        transposed = operand;
-      } else {
-        transposed = add_instruction(HloInstruction::CreateTranspose(
-            ShapeUtil::PermuteDimensions(
-                InversePermutation(AsInt64Slice(reshape_request.dimensions())),
-                operand->shape()),
-            operand, AsInt64Slice(reshape_request.dimensions())));
-      }
-      hlo_instruction = add_instruction(
-          HloInstruction::CreateReshape(request.output_shape(), transposed));
-      break;
-    }
-
-    case OpRequest::kTransposeRequest: {
-      const TransposeRequest& transpose_request =
-          request.request().transpose_request();
-      HloInstruction* operand = lookup_instruction(transpose_request.operand());
-      hlo_instruction = add_instruction(HloInstruction::CreateTranspose(
-          ShapeUtil::PermuteDimensions(
-              InversePermutation(AsInt64Slice(transpose_request.dimensions())),
-              operand->shape()),
-          operand, AsInt64Slice(transpose_request.dimensions())));
-      break;
-    }
-
-    case OpRequest::kReverseRequest: {
-      const ReverseRequest& reverse_request =
-          request.request().reverse_request();
-      HloInstruction* operand = lookup_instruction(reverse_request.operand());
-      hlo_instruction = add_instruction(HloInstruction::CreateReverse(
-          request.output_shape(), operand,
-          AsInt64Slice(reverse_request.dimensions())));
-      break;
-    }
-
-    case OpRequest::kPadRequest: {
-      const PadRequest& pad_request = request.request().pad_request();
-      HloInstruction* operand = lookup_instruction(pad_request.operand());
-      HloInstruction* padding_value =
-          lookup_instruction(pad_request.padding_value());
-      hlo_instruction = add_instruction(HloInstruction::CreatePad(
-          request.output_shape(), operand, padding_value,
-          pad_request.padding_config()));
-      break;
-    }
-
-    case OpRequest::kRecvRequest: {
-      const RecvRequest& recv_request = request.request().recv_request();
-      HloInstruction* recv = add_instruction(HloInstruction::CreateRecv(
-          request.output_shape(), recv_request.channel_handle().handle()));
-      hlo_instruction = add_instruction(HloInstruction::CreateRecvDone(recv));
-      break;
-    }
-
-    case OpRequest::kParameterRequest: {
-      const ParameterRequest& parameter_request =
-          request.request().parameter_request();
-      hlo_instruction = add_instruction(HloInstruction::CreateParameter(
-          parameter_request.parameter(), request.output_shape(),
-          parameter_request.name()));
-      break;
-    }
-
-    case OpRequest::kConvertRequest: {
-      const ConvertRequest& convert_request =
-          request.request().convert_request();
-      HloInstruction* operand = lookup_instruction(convert_request.operand());
-      hlo_instruction = add_instruction(
-          HloInstruction::CreateConvert(request.output_shape(), operand));
-      break;
-    }
-
-    case OpRequest::kBitcastConvertRequest: {
-      const ConvertRequest& convert_request =
-          request.request().bitcast_convert_request();
-      HloInstruction* operand = lookup_instruction(convert_request.operand());
-      hlo_instruction = add_instruction(HloInstruction::CreateBitcastConvert(
-          request.output_shape(), operand));
-      break;
-    }
-
-    case OpRequest::kWhileRequest: {
-      const WhileRequest& while_request = request.request().while_request();
-      CHECK_EQ(2, request.embedded_computation_versions_size());
-      VersionedComputationHandle::Version condition_version =
-          request.embedded_computation_versions(0);
-      HloComputation* condition =
-          ResolveComputation(while_request.condition(), condition_version);
-      VersionedComputationHandle::Version body_version =
-          request.embedded_computation_versions(1);
-      HloComputation* body =
-          ResolveComputation(while_request.body(), body_version);
-      HloInstruction* init = lookup_instruction(while_request.init());
-      hlo_instruction = add_instruction(HloInstruction::CreateWhile(
-          request.output_shape(), condition, body, init));
-      break;
-    }
-
-    case OpRequest::kConditionalRequest: {
-      const ConditionalRequest& conditional_request =
-          request.request().conditional_request();
-      CHECK_EQ(2, request.embedded_computation_versions_size());
-      VersionedComputationHandle::Version true_computation_version =
-          request.embedded_computation_versions(0);
-      HloComputation* true_computation = ResolveComputation(
-          conditional_request.true_computation(), true_computation_version);
-      VersionedComputationHandle::Version false_computation_version =
-          request.embedded_computation_versions(1);
-      HloComputation* false_computation = ResolveComputation(
-          conditional_request.false_computation(), false_computation_version);
-      HloInstruction* predicate =
-          lookup_instruction(conditional_request.predicate());
-      HloInstruction* true_operand =
-          lookup_instruction(conditional_request.true_operand());
-      HloInstruction* false_operand =
-          lookup_instruction(conditional_request.false_operand());
-      hlo_instruction = add_instruction(HloInstruction::CreateConditional(
-          request.output_shape(), predicate, true_operand, true_computation,
-          false_operand, false_computation));
-      break;
-    }
-
-    case OpRequest::kTernaryOpRequest: {
-      const TernaryOpRequest& ternary_op_request =
-          request.request().ternary_op_request();
-      HloInstruction* lhs = lookup_instruction(ternary_op_request.lhs());
-      HloInstruction* rhs = lookup_instruction(ternary_op_request.rhs());
-      HloInstruction* ehs = lookup_instruction(ternary_op_request.ehs());
-      auto hlo_opcode = TernaryOperationToHloOpcode(ternary_op_request.triop());
-      if (debug_options_.xla_eliminate_hlo_implicit_broadcast() &&
-          !ShapeUtil::IsTuple(request.output_shape())) {
-        if (!ShapeUtil::IsTuple(lhs->shape()) &&
-            !ShapeUtil::SameDimensions(request.output_shape(), lhs->shape())) {
-          // lhs side is being implicitly broadcast. Change to explicit.
-          lhs =
-              ImplicitBroadcastToExplicitBroadcast(lhs, request.output_shape());
-        }
-
-        if (!ShapeUtil::IsTuple(rhs->shape()) &&
-            !ShapeUtil::SameDimensions(request.output_shape(), rhs->shape())) {
-          rhs =
-              ImplicitBroadcastToExplicitBroadcast(rhs, request.output_shape());
-        }
-
-        if (!ShapeUtil::IsTuple(ehs->shape()) &&
-            !ShapeUtil::SameDimensions(request.output_shape(), ehs->shape())) {
-          ehs =
-              ImplicitBroadcastToExplicitBroadcast(ehs, request.output_shape());
-        }
-      }
-
-      hlo_instruction = add_instruction(HloInstruction::CreateTernary(
-          request.output_shape(), hlo_opcode, lhs, rhs, ehs));
-      break;
-    }
-
-    case OpRequest::kVariadicOpRequest: {
-      const VariadicOpRequest& variadic_op_request =
-          request.request().variadic_op_request();
-      std::vector<HloInstruction*> operands;
-      for (const ComputationDataHandle& handle :
-           variadic_op_request.operands()) {
-        HloInstruction* operand = lookup_instruction(handle);
-        operands.push_back(operand);
-      }
-      auto hlo_opcode =
-          VariadicOperationToHloOpcode(variadic_op_request.varop());
-      hlo_instruction = add_instruction(HloInstruction::CreateVariadic(
-          request.output_shape(), hlo_opcode, operands));
-      break;
-    }
-
-    case OpRequest::kCallRequest: {
-      const CallRequest& call_request = request.request().call_request();
-      std::vector<HloInstruction*> operands;
-      for (const ComputationDataHandle& handle : call_request.operands()) {
-        operands.push_back(lookup_instruction(handle));
-      }
-      CHECK_EQ(1, request.embedded_computation_versions_size());
-      VersionedComputationHandle::Version call_version =
-          request.embedded_computation_versions(0);
-      HloComputation* call_computation =
-          ResolveComputation(call_request.to_apply(), call_version);
-      hlo_instruction = add_instruction(HloInstruction::CreateCall(
-          request.output_shape(), operands, call_computation));
-      break;
-    }
-
-    case OpRequest::kCustomCallRequest: {
-      const CustomCallRequest& cc_request =
-          request.request().custom_call_request();
-      std::vector<HloInstruction*> operands;
-      for (const ComputationDataHandle& operand : cc_request.operands()) {
-        operands.push_back(lookup_instruction(operand));
-      }
-      hlo_instruction = add_instruction(HloInstruction::CreateCustomCall(
-          cc_request.shape(), operands, cc_request.call_target_name()));
-      break;
-    }
-
-    case OpRequest::kHostComputeRequest: {
-      const HostComputeRequest& host_compute_request =
-          request.request().host_compute_request();
-      std::vector<HloInstruction*> operands;
-      for (const ComputationDataHandle& operand :
-           host_compute_request.operands()) {
-        operands.push_back(lookup_instruction(operand));
-      }
-      auto output_shape = host_compute_request.shape();
-      auto channel_name = host_compute_request.channel_name();
-      auto cost_estimate_ns = host_compute_request.cost_estimate_ns();
-      hlo_instruction = add_instruction(HloInstruction::CreateHostCompute(
-          output_shape, operands, channel_name, cost_estimate_ns));
-      break;
-    }
-
-    case OpRequest::kUnaryOpRequest: {
-      const UnaryOpRequest& unary_op_request =
-          request.request().unary_op_request();
-      HloInstruction* operand = lookup_instruction(unary_op_request.operand());
-      auto hlo_opcode = UnaryOperationToHloOpcode(unary_op_request.unop());
-      hlo_instruction = add_instruction(HloInstruction::CreateUnary(
-          request.output_shape(), hlo_opcode, operand));
-      break;
-    }
-
-    case OpRequest::kBinaryOpRequest: {
-      const BinaryOpRequest& binary_op_request =
-          request.request().binary_op_request();
-      HloInstruction* lhs = lookup_instruction(binary_op_request.lhs());
-      HloInstruction* rhs = lookup_instruction(binary_op_request.rhs());
-      auto hlo_opcode = BinaryOperationToHloOpcode(binary_op_request.binop());
-      if (binary_op_request.broadcast_dimensions_size() > 0 &&
-          ShapeUtil::Rank(lhs->shape()) != ShapeUtil::Rank(rhs->shape())) {
-        // Emit a broadcast instruction to perform the "broadcast in dimension"
-        // operation.
-        HloInstruction* operand_to_broadcast =
-            ShapeUtil::Rank(lhs->shape()) < ShapeUtil::Rank(rhs->shape()) ? lhs
-                                                                          : rhs;
-        CHECK_EQ(ShapeUtil::Rank(operand_to_broadcast->shape()),
-                 binary_op_request.broadcast_dimensions().size());
-
-        // Construct the bounds of the shape of the kBroadcast instruction
-        // responsible for the in-dimension broadcast.
-        std::vector<int64> output_dimensions;
-        for (int64 size : request.output_shape().dimensions()) {
-          output_dimensions.push_back(size);
-        }
-        for (int64 operand_dim = 0;
-             operand_dim < ShapeUtil::Rank(operand_to_broadcast->shape());
-             ++operand_dim) {
-          int64 output_dim =
-              binary_op_request.broadcast_dimensions()[operand_dim];
-          output_dimensions[output_dim] =
-              operand_to_broadcast->shape().dimensions(operand_dim);
-        }
-
-        Shape broadcast_shape = ShapeUtil::MakeShape(
-            operand_to_broadcast->shape().element_type(), output_dimensions);
-
-        // The broadcast semantics of a client-level binary op broadcast is
-        // identical to the HLO broadcast semantics so the broadcast_dimensions
-        // field can just be passed to the instruction builder.
-        HloInstruction* broadcasted_operand =
-            add_instruction(HloInstruction::CreateBroadcast(
-                broadcast_shape, operand_to_broadcast,
-                AsInt64Slice(binary_op_request.broadcast_dimensions())));
-
-        lhs = (lhs == operand_to_broadcast) ? broadcasted_operand : lhs;
-        rhs = (rhs == operand_to_broadcast) ? broadcasted_operand : rhs;
-      }
-      if (debug_options_.xla_eliminate_hlo_implicit_broadcast()) {
-        if (!ShapeUtil::SameDimensions(request.output_shape(), lhs->shape())) {
-          // lhs side is being implicitly broadcast. Change to explicit.
-          lhs =
-              ImplicitBroadcastToExplicitBroadcast(lhs, request.output_shape());
-        }
-
-        if (!ShapeUtil::SameDimensions(request.output_shape(), rhs->shape())) {
-          rhs =
-              ImplicitBroadcastToExplicitBroadcast(rhs, request.output_shape());
-        }
-      }
-      hlo_instruction = add_instruction(HloInstruction::CreateBinary(
-          request.output_shape(), hlo_opcode, lhs, rhs));
-      break;
-    }
-
-    case OpRequest::kReducePrecisionRequest: {
-      const ReducePrecisionRequest& reduce_precision_request =
-          request.request().reduce_precision_request();
-      HloInstruction* operand =
-          lookup_instruction(reduce_precision_request.operand());
-      auto exponent_bits = reduce_precision_request.exponent_bits();
-      auto mantissa_bits = reduce_precision_request.mantissa_bits();
-      hlo_instruction = add_instruction(HloInstruction::CreateReducePrecision(
-          request.output_shape(), operand, exponent_bits, mantissa_bits));
-      break;
-    }
-
-    case OpRequest::kTraceRequest: {
-      const TraceRequest& trace_request = request.request().trace_request();
-      HloInstruction* operand = lookup_instruction(trace_request.operand());
-      hlo_instruction = add_instruction(
-          HloInstruction::CreateTrace(trace_request.tag(), operand));
-      break;
-    }
-
-    case OpRequest::kSendRequest: {
-      const SendRequest& send_request = request.request().send_request();
-      HloInstruction* operand = lookup_instruction(send_request.operand());
-      HloInstruction* send = add_instruction(HloInstruction::CreateSend(
-          operand, send_request.channel_handle().handle()));
-      hlo_instruction = add_instruction(HloInstruction::CreateSendDone(send));
-      break;
-    }
-
-    case OpRequest::kGatherRequest: {
-      const GatherRequest& gather_request = request.request().gather_request();
-      HloInstruction* input_operand =
-          lookup_instruction(gather_request.input());
-      HloInstruction* gather_indices_operand =
-          lookup_instruction(gather_request.gather_indices());
-      std::vector<int64> window_bounds;
-      c_copy(gather_request.window_bounds(), std::back_inserter(window_bounds));
-      hlo_instruction = add_instruction(HloInstruction::CreateGather(
-          request.output_shape(), input_operand, gather_indices_operand,
-          gather_request.dimension_numbers(), window_bounds));
-      break;
-    }
-
-    case OpRequest::OP_NOT_SET:
-      LOG(FATAL) << "OperationRequest doesn't contain a request";
-
-    default:
-      LOG(FATAL) << "Unexpected request type: " << request.request().op_case();
-  }
-  (*instructions)[handle.handle()] = hlo_instruction;
-}  // NOLINT(readability/fn_size)
-
-}  // namespace
-
-StatusOr<std::unique_ptr<HloComputation>> UserComputation::BuildHloComputation(
-    VersionedComputationHandle::Version version,
-    HloComputationResolver hlo_resolver, const DebugOptions& debug_options,
-    bool include_unreachable_instructions) const {
-  tensorflow::mutex_lock lock(mutex_);
-
-  VLOG(2) << "Building HloComputation from UserComputation " << name_
-          << " at version " << version;
-  XLA_VLOG_LINES(3, session_computation_.DebugString());
-
-  TF_ASSIGN_OR_RETURN(
-      std::unique_ptr<HloComputation> hlo_computation,
-      ComputationLowerer::Lower(
-          tensorflow::strings::StrCat(name(), ".v", version),
-          session_computation_, version, std::move(hlo_resolver), debug_options,
-          include_unreachable_instructions));
-
-  return std::move(hlo_computation);
-}
-
-}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/user_computation.h b/tensorflow/compiler/xla/service/user_computation.h
deleted file mode 100644
index 5544c868fe905c1ca7e6cab32738440add2e3b4f..0000000000000000000000000000000000000000
--- a/tensorflow/compiler/xla/service/user_computation.h
+++ /dev/null
@@ -1,413 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_USER_COMPUTATION_H_
-#define TENSORFLOW_COMPILER_XLA_SERVICE_USER_COMPUTATION_H_
-
-#include <functional>
-#include <map>
-#include <memory>
-#include <string>
-#include <vector>
-
-#include "tensorflow/compiler/xla/service/hlo_computation.h"
-#include "tensorflow/compiler/xla/service/session.pb.h"
-#include "tensorflow/compiler/xla/service/versioned_computation_handle.h"
-#include "tensorflow/compiler/xla/statusor.h"
-#include "tensorflow/compiler/xla/types.h"
-#include "tensorflow/compiler/xla/xla.pb.h"
-#include "tensorflow/compiler/xla/xla_data.pb.h"
-#include "tensorflow/core/platform/macros.h"
-#include "tensorflow/core/platform/mutex.h"
-#include "tensorflow/core/platform/thread_annotations.h"
-#include "tensorflow/core/platform/types.h"
-
-namespace xla {
-
-// A UserComputation is the built-up computation that users create via the
-// XLA Service interface.
-//
-// The XLA service adds instructions to a user computation via this
-// interface. The state of the computation is stored as a SessionComputation
-// proto which holds a record of all operation-building requests received by the
-// XLA service.
-//
-// UserComputations are lowered to HloComputations which are passed to the high
-// level compiler interface.
-class UserComputation {
- public:
-  // Factory used when restoring a computation from serialized session
-  // computation (computation snapshot) data. Remaps any references to
-  // computation handle via the old_to_new mapping.
-  //
-  // An error will occur if the old_to_new mapping cannot resolve a reference to
-  // a computation that is present in session_computation.
-  static StatusOr<std::unique_ptr<UserComputation>> MakeWithRemapping(
-      const SessionComputation& session_computation,
-      const ComputationHandle& handle,
-      const std::map<int64, ComputationHandle>& old_to_new);
-
-  // Creates an empty computation with the given name and computation handle.
-  explicit UserComputation(const string& name, const ComputationHandle& handle);
-
-  // Enqueues a parameter-retrieving instruction onto this user computation.
-  // Returns an error status if the parameter number is already registered with
-  // different values.
-  StatusOr<ComputationDataHandle> AddParameterInstruction(
-      const ParameterRequest& parameter_request);
-
-  // Enqueues a pad instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddPadInstruction(
-      const PadRequest& pad_request);
-
-  // Enqueues a tracing instruction onto this user computation.
-  // Returns an error status if the operand cannot be resolved.
-  Status AddTraceInstruction(const TraceRequest& trace_request);
-
-  // Enqueues a random number generation instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddRngInstruction(
-      const RngRequest& rng_request);
-
-  // Enqueues a unary instruction onto this user computation.
-  // Returns an error status if the operand index is out of bounds.
-  StatusOr<ComputationDataHandle> AddUnaryInstruction(
-      const UnaryOpRequest& unary_request);
-
-  // Enqueues a batch norm training instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddBatchNormTrainingInstruction(
-      const BatchNormTrainingRequest& batch_norm_training_request);
-
-  // Enqueues a batch norm inference instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddBatchNormInferenceInstruction(
-      const BatchNormInferenceRequest& batch_norm_inference_request);
-
-  // Enqueues a batch norm grad instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddBatchNormGradInstruction(
-      const BatchNormGradRequest& batch_norm_grad_request);
-
-  // Enqueues a binary instruction onto this user computation.
-  // Returns an error status if the operand indices are out of bounds.
-  StatusOr<ComputationDataHandle> AddBinaryInstruction(
-      const BinaryOpRequest& binary_request);
-
-  // Enqueues a ternary instruction onto this user computation.
-  // Returns an error status if the operand indices are out of bounds.
-  StatusOr<ComputationDataHandle> AddTernaryInstruction(
-      const TernaryOpRequest& ternary_request);
-
-  // Enqueues a variadic instruction onto this user computation.
-  // Returns an error status if the operand indices are out of bounds.
-  StatusOr<ComputationDataHandle> AddVariadicInstruction(
-      const VariadicOpRequest& variadic_request);
-
-  // Enqueues a constant instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddConstantInstruction(
-      const ConstantRequest& constant_request);
-
-  // Enqueues a get tuple element instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddGetTupleElementInstruction(
-      const GetTupleElementRequest& get_tuple_element_request);
-
-  // Enqueues a map instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddMapInstruction(
-      const MapRequest& map_request,
-      const UserComputation& to_apply_computation);
-
-  // Enqueues a reduce-precision instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddReducePrecisionInstruction(
-      const ReducePrecisionRequest& reduce_precision_request);
-
-  // Enqueues a convolution instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddConvolveInstruction(
-      const ConvolveRequest& convolve_request);
-
-  // Enqueues an FFT instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddFftInstruction(
-      const FftRequest& fft_request);
-
-  // Enqueues a cross replica sum instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddCrossReplicaSumInstruction(
-      const CrossReplicaSumRequest& cross_replica_sum_request);
-
-  // Enqueues an infeed instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddInfeedInstruction(
-      const InfeedRequest& infeed_request);
-
-  // Enqueues an outfeed instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddOutfeedInstruction(
-      const OutfeedRequest& outfeed_request);
-
-  // Enqueues a host compute instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddHostComputeInstruction(
-      const HostComputeRequest& host_compute_request);
-
-  // Enqueues a call instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddCallInstruction(
-      const CallRequest& call_request,
-      const UserComputation& to_apply_computation);
-
-  // Enqueues a custom call instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddCustomCallInstruction(
-      const CustomCallRequest& custom_call_request);
-
-  // Enqueues a dot instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddDotInstruction(
-      const DotRequest& dot_request);
-
-  // Enqueues a broadcast instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddBroadcastInstruction(
-      const BroadcastRequest& broadcast_request);
-
-  // Enqueues a reshape instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddReshapeInstruction(
-      const ReshapeRequest& reshape_request);
-
-  // Enqueues a transpose instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddTransposeInstruction(
-      const TransposeRequest& transpose_request);
-
-  // Enqueues a slice instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddSliceInstruction(
-      const SliceRequest& slice_request);
-
-  // Enqueues a dynamic slice instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddDynamicSliceInstruction(
-      const DynamicSliceRequest& dynamic_slice_request);
-
-  // Enqueues a dynamic update slice instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddDynamicUpdateSliceInstruction(
-      const DynamicUpdateSliceRequest& dynamic_update_slice_request);
-
-  // Enqueues a concatenate instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddConcatenateInstruction(
-      const ConcatenateRequest& concatenate_request);
-
-  // Enqueues a convert instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddConvertInstruction(
-      const ConvertRequest& convert_request);
-
-  // Enqueues a bitcast element instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddBitcastConvertInstruction(
-      const ConvertRequest& convert_request);
-
-  // Enqueues a reduce instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddReduceInstruction(
-      const ReduceRequest& reduce_request,
-      const UserComputation& to_apply_computation);
-
-  // Enqueues a windowed reduce instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddReduceWindowInstruction(
-      const ReduceWindowRequest& reduce_window_request,
-      const UserComputation& to_apply_computation);
-
-  // Enqueues a select-and-scatter instruction onto this user
-  // computation.
-  StatusOr<ComputationDataHandle> AddSelectAndScatterInstruction(
-      const SelectAndScatterRequest& select_and_scatter_request,
-      const UserComputation& select_computation,
-      const UserComputation& scatter_computation);
-
-  // Enqueues a reverse instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddReverseInstruction(
-      const ReverseRequest& reverse_request);
-
-  // Enqueues a while instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddWhileInstruction(
-      const WhileRequest& while_request,
-      const UserComputation& condition_computation,
-      const UserComputation& body_computation);
-
-  // Enqueues a conditional instruction on this user computation.
-  StatusOr<ComputationDataHandle> AddConditionalInstruction(
-      const ConditionalRequest& conditional_request,
-      const UserComputation& true_computation,
-      const UserComputation& false_computation);
-
-  // Enqueues a Send instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddSendInstruction(
-      const SendRequest& send_request);
-
-  // Enqueues a Recv instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddRecvInstruction(
-      const RecvRequest& recv_request);
-
-  // Enqueues a Gather instruction onto this user computation.
-  StatusOr<ComputationDataHandle> AddGatherInstruction(
-      const GatherRequest& gather_request);
-
-  // Returns the user-provided name of this user computation, which is provided
-  // via the XLA computation-building API.
-  const string& name() const { return name_; }
-
-  // Subsequent executions of this computation will compute the value
-  // represented by handle, rather than the last expression enqueued
-  // on the computation.
-  Status SetReturnValue(const ComputationDataHandle& handle);
-
-  // Return a versioned handle for this computation.
-  VersionedComputationHandle GetVersionedHandle() const;
-
-  // Return a versioned handle for this computation with a version equal to the
-  // point at which given operation was added to the computation.
-  VersionedComputationHandle GetVersionedHandleAtOperation(
-      const ComputationDataHandle& operation) const;
-
-  // Return a version value representing the current state of the
-  // computation.
-  VersionedComputationHandle::Version version() const;
-
-  // Computes and returns the program shape for the user computation -- gathers
-  // parameters and result type into a single proto. A shared_ptr is used
-  // because the returned pointer refers to an internally cached value which may
-  // be discarded by the UserComputation object. This avoid unnecessary copies.
-  //
-  // If the parameter space is not dense (i.e. there are holes in the parameter
-  // numbers provided) then an error status is returned.
-  StatusOr<std::shared_ptr<const ProgramShape>> ComputeProgramShape(
-      VersionedComputationHandle::Version version) const;
-
-  // Returns true if the given data handle does not depend on any parameter with
-  // index higher then num_parameters. That is, the value can be computed at
-  // compile time if we know the first num_parameters arguments.
-  StatusOr<bool> IsConstant(const ComputationDataHandle& handle,
-                            int64 num_parameters);
-
-  // Returns the output shape of the operation indicated by the given handle.
-  StatusOr<Shape> GetShape(const ComputationDataHandle& handle);
-
-  // Sets metadata on the Hlo instruction referenced by the given handle.
-  Status SetOpMetadata(const ComputationDataHandle& handle,
-                       const OpMetadata& metadata);
-
-  // Sets the device assignment on the Hlo instruction referenced by 'handle'.
-  Status SetOpSharding(const ComputationDataHandle& handle,
-                       const OpSharding& sharding);
-
-  // Builds a HLO computation from the UserComputation. The parameter "resolver"
-  // is a function which returns a pointer to the HloComputation corresponding
-  // to the given ComputationHandle at the given version. The resolver is used
-  // for operations, such as map, which call other computations and need a
-  // pointer to the called HloComputation to construct the respective HLO
-  // instructions. If include_unreachable_instructions is true, then
-  // instructions which are not reachable from the root are lowered into
-  // HloInstructions.
-  using HloComputationResolver =
-      std::function<HloComputation*(const VersionedComputationHandle& handle)>;
-  StatusOr<std::unique_ptr<HloComputation>> BuildHloComputation(
-      VersionedComputationHandle::Version version,
-      HloComputationResolver hlo_resolver, const DebugOptions& debug_options,
-      bool include_unreachable_instructions = true) const;
-
-  // Return a vector containing the embedded computations used by this
-  // UserComputation. Only embedded computations which are called directly by
-  // this UserComputation are included. That is, the transitive closure of
-  // embedded computations is not included.
-  std::vector<VersionedComputationHandle> GetEmbeddedComputations(
-      VersionedComputationHandle::Version version) const;
-
-  // Returns the number of OperationRequest objects in this UserComputation.
-  // The 'version' of a computation is identical to the number of
-  // OperationRequests in the UserComputation.
-  int64 request_count(VersionedComputationHandle::Version version) const {
-    return version;
-  }
-
-  // Returns a copy of the internal session state for this computation -- this
-  // is useful for serializing the guts of a user computation, though references
-  // to other handles (e.g. referred-to computations) must be handled with care
-  // in the serialization / de-serialization process.
-  SessionComputation CloneSessionComputation(
-      VersionedComputationHandle::Version version) const;
-
-  // Warning: typically we don't want to look up computation data handles until
-  // the computation is finished being built, for consistency purposes. We
-  // expose this routine for error reporting purposes so that we can provide
-  // more meaningful error messages from the XLA service layer.
-  //
-  // Returns the operation request that the handle comes from.
-  StatusOr<const OperationRequest*> LookUpRequestForErrorReporting(
-      const ComputationDataHandle& handle) const;
-
-  // Retrieves the parameter metadata for the given parameter number.
-  //
-  // If the parameter number is invalid for this computation, nullopt is
-  // returned. When the return value has_value(), nullptr will never be
-  // the held value.
-  tensorflow::gtl::optional<const OpMetadata*> ParameterMetadata(
-      int parameter_number) const;
-
- private:
-  // Warning: dangerous mutating operation that doesn't respect versioning.
-  // This is only used at initialization time when constructing from a
-  // SessionComputation a la MakeWithRemapping.
-  //
-  // Remaps references to old computations (with handle values in the keys of
-  // old_to_new) to the computation handle given in the values. This is useful
-  // when loading computations from snapshots, to finish initialization, before
-  // the user computation is released into the wild.
-  Status RemapEmbeddedComputations(
-      const std::map<int64, ComputationHandle>& old_to_new)
-      EXCLUSIVE_LOCKS_REQUIRED(mutex_);
-
-  // Returns the OperationRequest corresponding to the given handle.
-  StatusOr<const OperationRequest*> LookUpRequest(
-      const ComputationDataHandle& handle) const
-      EXCLUSIVE_LOCKS_REQUIRED(mutex_);
-
-  // Creates a new ComputationDataHandle with the next available handle value.
-  ComputationDataHandle CreateComputationDataHandle()
-      EXCLUSIVE_LOCKS_REQUIRED(mutex_);
-
-  // Checks whether the parameter numbers of the parameter operations are
-  // contiguous starting from zero. Returns appropriate error status if not.
-  Status CheckParametersAreContiguous(
-      VersionedComputationHandle::Version version) const
-      EXCLUSIVE_LOCKS_REQUIRED(mutex_);
-
-  VersionedComputationHandle GetVersionedHandleInternal() const
-      EXCLUSIVE_LOCKS_REQUIRED(mutex_);
-
-  // Name of the computation.
-  string name_;
-
-  mutable tensorflow::mutex mutex_;
-
-  // State of the computation as a record of all operation-building requests.
-  SessionComputation session_computation_ GUARDED_BY(mutex_);
-
-  // Mapping from parameter number to operation request containing the
-  // respective ParameterRequest.
-  std::map<int64, OperationRequest*> parameters_ GUARDED_BY(mutex_);
-
-  // The next ComputationDataHandle value to assign. Handle values are assigned
-  // sequentially.
-  int64 next_handle_value_ GUARDED_BY(mutex_);
-
-  // If handle_to_return_.has_handle() then an Execution of this Computation
-  // will compute the value represented by handle_to_return_, otherwise it will
-  // compute the value of (next_handle_value_ - 1).
-  ComputationDataHandle handle_to_return_ GUARDED_BY(mutex_);
-
-  // Memoized ProgramShape and its version. A shared_ptr is used because
-  // references to this object are returned by ComputeProgramShape.
-  mutable int64 program_shape_version_ GUARDED_BY(mutex_) = 0;
-  mutable std::shared_ptr<const ProgramShape> program_shape_ GUARDED_BY(mutex_);
-
-  TF_DISALLOW_COPY_AND_ASSIGN(UserComputation);
-};
-
-}  // namespace xla
-
-#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_USER_COMPUTATION_H_
diff --git a/tensorflow/compiler/xla/service/user_computation_test.cc b/tensorflow/compiler/xla/service/user_computation_test.cc
deleted file mode 100644
index 2fa163953f638c0038e9f6bb11ce2a3742e0558c..0000000000000000000000000000000000000000
--- a/tensorflow/compiler/xla/service/user_computation_test.cc
+++ /dev/null
@@ -1,340 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include "tensorflow/compiler/xla/service/user_computation.h"
-
-#include "tensorflow/compiler/xla/literal_util.h"
-#include "tensorflow/compiler/xla/service/hlo_computation.h"
-#include "tensorflow/compiler/xla/service/hlo_matchers.h"
-#include "tensorflow/compiler/xla/shape_util.h"
-#include "tensorflow/compiler/xla/status_macros.h"
-#include "tensorflow/compiler/xla/test.h"
-#include "tensorflow/compiler/xla/test_helpers.h"
-#include "tensorflow/compiler/xla/xla_data.pb.h"
-#include "tensorflow/core/lib/core/status_test_util.h"
-
-namespace op = xla::testing::opcode_matchers;
-
-namespace xla {
-namespace {
-
-using UserComputationTest = ::testing::Test;
-
-TEST_F(UserComputationTest, SimpleComputation) {
-  const Shape kScalarShape = ShapeUtil::MakeShape(F32, {});
-  const Shape kVectorShape = ShapeUtil::MakeShape(F32, {2});
-
-  // Build a simple three operation computatation:
-  //
-  //   %constant = Constant({123, 42})
-  //   %param = Param(0)
-  //   %outfeed = Outfeed(%constant)
-  //
-  // Build the computation at two different versions and check invariants.
-  ComputationHandle handle;
-  handle.set_handle(123);
-  UserComputation computation("TheComputation", handle);
-
-  ConstantRequest constant_request;
-  *constant_request.mutable_literal() =
-      Literal::CreateR1<float>({123.0f, 42.0f})->ToProto();
-  TF_ASSERT_OK_AND_ASSIGN(ComputationDataHandle constant_handle,
-                          computation.AddConstantInstruction(constant_request));
-
-  ParameterRequest param_request;
-  *param_request.mutable_shape() = kScalarShape;
-  param_request.set_parameter(0);
-  param_request.set_name("param0");
-  TF_ASSERT_OK_AND_ASSIGN(ComputationDataHandle param_handle,
-                          computation.AddParameterInstruction(param_request));
-  OpMetadata metadata;
-  metadata.set_op_name("meta");
-  TF_ASSERT_OK(computation.SetOpMetadata(param_handle, metadata));
-
-  OutfeedRequest outfeed_request;
-  *outfeed_request.mutable_operand() = constant_handle;
-  *outfeed_request.mutable_shape() = kVectorShape;
-  outfeed_request.set_outfeed_config("abc");
-  TF_ASSERT_OK_AND_ASSIGN(ComputationDataHandle outfeed_handle,
-                          computation.AddOutfeedInstruction(outfeed_request));
-
-  auto hlo_resolver = [](const VersionedComputationHandle& handle) {
-    return nullptr;
-  };
-  {
-    // Test the computation at the latest version. In this case, the most
-    // recently added operation is an outfeed. However, the outfeed is not the
-    // root because outfeeds cannot be the root of a computation.
-    VersionedComputationHandle latest_version =
-        computation.GetVersionedHandle();
-
-    // Program shape should have a single scalar parameter and scalar
-    // result. The outfeed instruction should not affect the program shape.
-    TF_ASSERT_OK_AND_ASSIGN(
-        std::shared_ptr<const ProgramShape> program_shape,
-        computation.ComputeProgramShape(latest_version.version));
-    ASSERT_EQ(1, program_shape->parameters_size());
-    EXPECT_TRUE(
-        ShapeUtil::Compatible(kScalarShape, program_shape->parameters(0)));
-    EXPECT_TRUE(ShapeUtil::Compatible(kScalarShape, program_shape->result()));
-
-    // Build the HLO computation.
-    TF_ASSERT_OK_AND_ASSIGN(
-        std::unique_ptr<HloComputation> hlo_computation,
-        computation.BuildHloComputation(latest_version.version, hlo_resolver,
-                                        DebugOptions()));
-    // There should be one HloInstruction per UserComputation operation.
-    EXPECT_EQ(3, hlo_computation->instruction_count());
-    // The root of the instruction should be the parameter instruction (not the
-    // outfeed).
-    EXPECT_THAT(hlo_computation->root_instruction(), op::Parameter());
-  }
-
-  {
-    // Test the computation at the version right after the parameter instruction
-    // is added.
-    VersionedComputationHandle version_at_param =
-        computation.GetVersionedHandleAtOperation(param_handle);
-
-    // Program shape should have a single scalar parameter, and scalar result.
-    TF_ASSERT_OK_AND_ASSIGN(
-        std::shared_ptr<const ProgramShape> program_shape,
-        computation.ComputeProgramShape(version_at_param.version));
-    ASSERT_EQ(1, program_shape->parameters_size());
-    EXPECT_TRUE(
-        ShapeUtil::Compatible(kScalarShape, program_shape->parameters(0)));
-    EXPECT_TRUE(ShapeUtil::Compatible(kScalarShape, program_shape->result()));
-
-    // There should be two instructions, one for the constant and one for the
-    // parameter. The outfeed instruction should not be included.
-    TF_ASSERT_OK_AND_ASSIGN(
-        std::unique_ptr<HloComputation> hlo_computation,
-        computation.BuildHloComputation(version_at_param.version, hlo_resolver,
-                                        DebugOptions()));
-    EXPECT_EQ(2, hlo_computation->instruction_count());
-    EXPECT_THAT(hlo_computation->root_instruction(), op::Parameter());
-  }
-  {
-    // Test the computation at the latest version, but lowered with
-    // include_unreachable_instructions set to false.
-    VersionedComputationHandle latest_version =
-        computation.GetVersionedHandle();
-
-    // Build the HLO computation.
-    TF_ASSERT_OK_AND_ASSIGN(
-        std::unique_ptr<HloComputation> hlo_computation,
-        computation.BuildHloComputation(
-            latest_version.version, hlo_resolver, DebugOptions(),
-            /*include_unreachable_instructions=*/false));
-    // There is only one reachable instruction, the parameter.
-    EXPECT_EQ(1, hlo_computation->instruction_count());
-    // The root of the instruction should be the parameter instruction (not the
-    // outfeed).
-    EXPECT_THAT(hlo_computation->root_instruction(), op::Parameter());
-    EXPECT_EQ(hlo_computation->root_instruction()->metadata().op_name(),
-              "meta");
-  }
-}
-
-TEST_F(UserComputationTest, EliminateScalarBroadcast) {
-  auto debug_options = DebugOptions();
-  debug_options.set_xla_eliminate_hlo_implicit_broadcast(true);
-
-  // Build a binary computation with scalar broadcast.
-  //
-  //  %a = Constant({123, 42})
-  //  %b = Constant(1)
-  //  %add = Add(%a, %b)
-  ComputationHandle handle;
-  handle.set_handle(123);
-  UserComputation computation("TheComputation", handle);
-
-  ConstantRequest a_request;
-  *a_request.mutable_literal() =
-      Literal::CreateR1<float>({123.0f, 42.0f})->ToProto();
-  TF_ASSERT_OK_AND_ASSIGN(ComputationDataHandle a_handle,
-                          computation.AddConstantInstruction(a_request));
-
-  ConstantRequest b_request;
-  *b_request.mutable_literal() = Literal::CreateR0<float>(1.0f)->ToProto();
-  TF_ASSERT_OK_AND_ASSIGN(ComputationDataHandle b_handle,
-                          computation.AddConstantInstruction(b_request));
-
-  BinaryOpRequest add;
-  add.set_binop(BINOP_ADD);
-  *add.mutable_lhs() = a_handle;
-  *add.mutable_rhs() = b_handle;
-  TF_ASSERT_OK(computation.AddBinaryInstruction(add).status());
-
-  auto hlo_resolver = [](const VersionedComputationHandle& handle) {
-    return nullptr;
-  };
-  VersionedComputationHandle latest_version = computation.GetVersionedHandle();
-
-  // Build the HLO computation.
-  TF_ASSERT_OK_AND_ASSIGN(
-      std::unique_ptr<HloComputation> hlo_computation,
-      computation.BuildHloComputation(latest_version.version, hlo_resolver,
-                                      debug_options));
-  // The binary operation has implicit scalar broadcast, should be converted
-  // to an explicit broadcast intruction and a binary instruction.
-  EXPECT_EQ(4, hlo_computation->instruction_count());
-  EXPECT_THAT(hlo_computation->root_instruction(), op::Add());
-  LOG(INFO) << hlo_computation->root_instruction()->ToString();
-  const auto& operands = hlo_computation->root_instruction()->operands();
-  ASSERT_EQ(2, operands.size());
-  EXPECT_TRUE(operands[0]->opcode() == HloOpcode::kBroadcast ||
-              operands[1]->opcode() == HloOpcode::kBroadcast);
-}
-
-TEST_F(UserComputationTest, CheckImplicitBroadcastToExplicitBroadcast) {
-  auto debug_options = DebugOptions();
-  debug_options.set_xla_eliminate_hlo_implicit_broadcast(true);
-
-  // Build a binary computation with degenerate broadcast.
-  //
-  //  %a = Param({1, 2, 3});
-  //  %b = Param({1, 2, 1});
-  //  %add = Add(%a, %b, {});
-  ComputationHandle handle;
-  handle.set_handle(123);
-  UserComputation computation("TheComputation", handle);
-
-  ParameterRequest a_request;
-  *a_request.mutable_shape() = ShapeUtil::MakeShape(F32, {1, 2, 3});
-  a_request.set_name("a");
-  a_request.set_parameter(0);
-  TF_ASSERT_OK_AND_ASSIGN(ComputationDataHandle a_handle,
-                          computation.AddParameterInstruction(a_request));
-
-  ParameterRequest b_request;
-  *b_request.mutable_shape() = ShapeUtil::MakeShape(F32, {1, 2, 1});
-  b_request.set_name("b");
-  b_request.set_parameter(1);
-  TF_ASSERT_OK_AND_ASSIGN(ComputationDataHandle b_handle,
-                          computation.AddParameterInstruction(b_request));
-
-  const int64 kDevice = 7;
-  OpSharding sharding;
-  sharding.set_type(OpSharding::Type::OpSharding_Type_MAXIMAL);
-  sharding.add_tile_assignment_dimensions(1);
-  sharding.add_tile_assignment_devices(kDevice);
-
-  TF_EXPECT_OK(computation.SetOpSharding(b_handle, sharding));
-
-  BinaryOpRequest add;
-  add.set_binop(BINOP_ADD);
-  *add.mutable_lhs() = a_handle;
-  *add.mutable_rhs() = b_handle;
-  TF_ASSERT_OK(computation.AddBinaryInstruction(add).status());
-
-  auto hlo_resolver = [](const VersionedComputationHandle& handle) {
-    return nullptr;
-  };
-  VersionedComputationHandle latest_version = computation.GetVersionedHandle();
-
-  // Build the HLO computation.
-  TF_ASSERT_OK_AND_ASSIGN(
-      std::unique_ptr<HloComputation> hlo_computation,
-      computation.BuildHloComputation(latest_version.version, hlo_resolver,
-                                      debug_options));
-
-  //    b         a
-  //    |         |
-  // reshape      |
-  //    |         |
-  // broadcast    |
-  //     \       /
-  //        add
-  EXPECT_EQ(5, hlo_computation->instruction_count());
-  ASSERT_THAT(
-      hlo_computation->root_instruction(),
-      op::Add(op::Parameter(), op::Broadcast(op::Reshape(op::Parameter()))));
-
-  const HloInstruction* broadcast =
-      hlo_computation->root_instruction()->operand(1);
-  EXPECT_TRUE(broadcast->has_sharding());
-
-  const HloInstruction* reshape = broadcast->operand(0);
-  EXPECT_TRUE(reshape->has_sharding());
-}
-
-TEST_F(UserComputationTest, EliminateDegenerateBroadcastAfterIndimBroadcast) {
-  auto debug_options = DebugOptions();
-  debug_options.set_xla_eliminate_hlo_implicit_broadcast(true);
-
-  // Build a binary computation with in-dim broadcast and degenerate broadcast.
-  //
-  //  %a = Param({2, 3});
-  //  %b = Param({2, 1, 4});
-  //  %add = Add(%a, %b, {0, 1});
-  ComputationHandle handle;
-  handle.set_handle(123);
-  UserComputation computation("TheComputation", handle);
-
-  ParameterRequest a_request;
-  *a_request.mutable_shape() = ShapeUtil::MakeShape(F32, {2, 3});
-  a_request.set_name("a");
-  a_request.set_parameter(0);
-  TF_ASSERT_OK_AND_ASSIGN(ComputationDataHandle a_handle,
-                          computation.AddParameterInstruction(a_request));
-
-  ParameterRequest b_request;
-  *b_request.mutable_shape() = ShapeUtil::MakeShape(F32, {2, 1, 4});
-  b_request.set_name("b");
-  b_request.set_parameter(1);
-  TF_ASSERT_OK_AND_ASSIGN(ComputationDataHandle b_handle,
-                          computation.AddParameterInstruction(b_request));
-
-  BinaryOpRequest add;
-  add.set_binop(BINOP_ADD);
-  *add.mutable_lhs() = a_handle;
-  *add.mutable_rhs() = b_handle;
-  add.add_broadcast_dimensions(0);
-  add.add_broadcast_dimensions(1);
-  TF_ASSERT_OK(computation.AddBinaryInstruction(add).status());
-
-  auto hlo_resolver = [](const VersionedComputationHandle& handle) {
-    return nullptr;
-  };
-  VersionedComputationHandle latest_version = computation.GetVersionedHandle();
-
-  // Build the HLO computation.
-  TF_ASSERT_OK_AND_ASSIGN(
-      std::unique_ptr<HloComputation> hlo_computation,
-      computation.BuildHloComputation(latest_version.version, hlo_resolver,
-                                      debug_options));
-
-  // The binary operation has in-dim broadcast and degenerate broadcast, should
-  // first do the in-dim broadcast then convert the degnerate broadcast into a
-  // reshape and a broadcast.
-  //
-  //    b         a
-  //    |         |
-  // broadcast reshape
-  //    |         |
-  //    |     broadcast
-  //     \        /
-  //        add
-  EXPECT_EQ(6, hlo_computation->instruction_count());
-  EXPECT_THAT(hlo_computation->root_instruction(), op::Add());
-  const auto& operands = hlo_computation->root_instruction()->operands();
-  ASSERT_EQ(2, operands.size());
-  EXPECT_TRUE(operands[0]->opcode() == HloOpcode::kBroadcast &&
-              operands[1]->opcode() == HloOpcode::kBroadcast);
-}
-
-}  // namespace
-}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/versioned_computation_handle.h b/tensorflow/compiler/xla/service/versioned_computation_handle.h
deleted file mode 100644
index 5732a56caffa31dde52dff5c2775f9fde0cacfbd..0000000000000000000000000000000000000000
--- a/tensorflow/compiler/xla/service/versioned_computation_handle.h
+++ /dev/null
@@ -1,55 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_VERSIONED_COMPUTATION_HANDLE_H_
-#define TENSORFLOW_COMPILER_XLA_SERVICE_VERSIONED_COMPUTATION_HANDLE_H_
-
-#include <ostream>
-
-#include "tensorflow/compiler/xla/types.h"
-#include "tensorflow/compiler/xla/xla_data.pb.h"
-
-namespace xla {
-
-// A data structure encapsulating a ComputationHandle and version value of that
-// computation. This object is used to unambiguously refer to a particular
-// computation in the service.
-struct VersionedComputationHandle {
-  // A version value unambiguously specifying the state of the computation at a
-  // particular point in time as it is being built. This value is the
-  // ComputationDataHandle of the current root instruction.
-  using Version = int64;
-
-  ComputationHandle handle;
-  Version version;
-
-  string ToString() const;
-  bool operator==(const VersionedComputationHandle& other) const {
-    return (handle.handle() == other.handle.handle()) &&
-           (version == other.version);
-  }
-  bool operator<(const VersionedComputationHandle& other) const {
-    return ((handle.handle() < other.handle.handle()) ||
-            ((handle.handle() == other.handle.handle()) &&
-             (version < other.version)));
-  }
-};
-
-std::ostream& operator<<(std::ostream& out,
-                         const VersionedComputationHandle& versioned_handle);
-
-}  // namespace xla
-
-#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_VERSIONED_COMPUTATION_HANDLE_H_
diff --git a/tensorflow/compiler/xla/service/while_loop_constant_sinking_test.cc b/tensorflow/compiler/xla/service/while_loop_constant_sinking_test.cc
index 0d2288d8ea6ebb0ac4ac9468a211b161438fc5f1..393e75803888d8a642881c4d525b170d1e1180ba 100644
--- a/tensorflow/compiler/xla/service/while_loop_constant_sinking_test.cc
+++ b/tensorflow/compiler/xla/service/while_loop_constant_sinking_test.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/while_loop_constant_sinking.h"
 
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/test.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
 
 namespace xla {
@@ -55,7 +55,7 @@ ENTRY entry {
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
 
   TF_ASSERT_OK_AND_ASSIGN(bool changed,
                           WhileLoopConstantSinking{}.Run(module.get()));
@@ -95,7 +95,7 @@ ENTRY entry {
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
 
   TF_ASSERT_OK_AND_ASSIGN(bool changed,
                           WhileLoopConstantSinking{}.Run(module.get()));
@@ -136,7 +136,7 @@ ENTRY entry {
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
 
   TF_ASSERT_OK_AND_ASSIGN(bool changed,
                           WhileLoopConstantSinking{}.Run(module.get()));
@@ -184,7 +184,7 @@ ENTRY entry {
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
 
   TF_ASSERT_OK_AND_ASSIGN(bool changed,
                           WhileLoopConstantSinking{}.Run(module.get()));
diff --git a/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.cc b/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.cc
index 321fdeb1ea313d2bc00b0210b422f36915f41453..09ddcffb22c2184262adf87d570870ec000c0e6f 100644
--- a/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.cc
+++ b/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.cc
@@ -98,14 +98,17 @@ static void CreateLoopInvariantCopy(
 // Returns true if `instruction` is worth hoisting only if it lets us hoist some
 // instruction using it.  The rationale is that hoisting these instructions will
 // prevent simplification and fusion in the while body.
-static bool NotWorthHoistingIndividually(const HloInstruction& instruction) {
+bool WhileLoopInvariantCodeMotion::NotWorthHoistingIndividually(
+    const HloInstruction& instruction) {
   switch (instruction.opcode()) {
     default:
       return false;
 
+    case HloOpcode::kConstant:
+      return !hoist_constants_;
+
     case HloOpcode::kBitcast:
     case HloOpcode::kBroadcast:
-    case HloOpcode::kConstant:
     case HloOpcode::kReshape:
     case HloOpcode::kReverse:
     case HloOpcode::kSlice:
@@ -115,7 +118,8 @@ static bool NotWorthHoistingIndividually(const HloInstruction& instruction) {
   }
 }
 
-static StatusOr<bool> TryHoistingInvariantInstructionsFromWhileBody(
+StatusOr<bool>
+WhileLoopInvariantCodeMotion::TryHoistingInvariantInstructionsFromWhileBody(
     HloInstruction* while_instr) {
   auto print_no_metadata = HloPrintOptions{}.set_print_metadata(false);
 
@@ -161,12 +165,16 @@ static StatusOr<bool> TryHoistingInvariantInstructionsFromWhileBody(
     }
   }
 
-  if (unhoisted_invariant_instructions.empty()) {
+  if (unhoisted_invariant_instructions.empty() && !hoist_constants_) {
     // There are no obviously loop invariant elements in the state being
     // threaded through the while loop so give up.  In theory this precondition
     // is too strong -- we could have code that e.g. permutes the elements in
     // the while state but uses a select to pick the same value on every
     // iteration.
+    //
+    // If we were asked to hoist constants, we need to scan the while body for
+    // constants even if we didn't find any loop invariant values in the while
+    // state tuple.
     return false;
   }
 
@@ -243,6 +251,9 @@ static StatusOr<bool> TryHoistingInvariantInstructionsFromWhileBody(
 }
 
 StatusOr<bool> WhileLoopInvariantCodeMotion::Run(HloModule* module) {
+  VLOG(2) << "HLO module before WhileLoopConstantSinking:";
+  XLA_VLOG_LINES(2, module->ToString());
+
   bool changed = false;
   std::vector<HloInstruction*> while_instrs;
   for (auto* comp : module->computations()) {
@@ -270,6 +281,14 @@ StatusOr<bool> WhileLoopInvariantCodeMotion::Run(HloModule* module) {
         TryHoistingInvariantInstructionsFromWhileBody(while_instr));
     changed |= result;
   }
+
+  if (changed) {
+    VLOG(2) << "HLO module after WhileLoopConstantSinking:";
+    XLA_VLOG_LINES(2, module->ToString());
+  } else {
+    VLOG(2) << "HLO module unchanged after WhileLoopConstantSinking";
+  }
+
   return changed;
 }
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.h b/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.h
index 8c4b765b0003c48cfacb9d28e7c8259ac0927d66..8e6cc8787576e4f041229da5cf8dd2b09194eb2a 100644
--- a/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.h
+++ b/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.h
@@ -27,12 +27,28 @@ namespace xla {
 
 class WhileLoopInvariantCodeMotion : public HloPassInterface {
  public:
+  // If `hoist_constants` is true then constants are always hoisted out of while
+  // loop bodies.  Otherwise they are only hoisted out if they enable other
+  // non-trivial computations to be hoisted out.
+  //
+  // Setting `hoist_constants` to false can be help if LICM is run in the mid
+  // level HLO pipeline because hoisting constants out of while loop bodies can
+  // break optimizations like constant folding.
+  explicit WhileLoopInvariantCodeMotion(bool hoist_constants = false)
+      : hoist_constants_(hoist_constants) {}
   ~WhileLoopInvariantCodeMotion() override = default;
 
   tensorflow::StringPiece name() const override {
     return "while-loop-invariant-code-motion";
   }
   StatusOr<bool> Run(HloModule* module) override;
+
+ private:
+  bool NotWorthHoistingIndividually(const HloInstruction& instruction);
+  StatusOr<bool> TryHoistingInvariantInstructionsFromWhileBody(
+      HloInstruction* while_instr);
+
+  bool hoist_constants_;
 };
 }  // namespace xla
 
diff --git a/tensorflow/compiler/xla/service/while_loop_invariant_code_motion_test.cc b/tensorflow/compiler/xla/service/while_loop_invariant_code_motion_test.cc
index 799340fda905fb7d40b19b4cb79bb0fcb5629fd3..8831c513eee66e36163135b732f833d46cb7eb03 100644
--- a/tensorflow/compiler/xla/service/while_loop_invariant_code_motion_test.cc
+++ b/tensorflow/compiler/xla/service/while_loop_invariant_code_motion_test.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/while_loop_invariant_code_motion.h"
 
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/hlo_verified_test_base.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
@@ -438,5 +439,77 @@ TEST_F(WhileLoopInvariantCodeMotionTest, BodyHasNonTupleRoot) {
   EXPECT_FALSE(simplified_loop);
 }
 
+const char* const kConstantHoistingTestCase = R"(
+HloModule ModuleWithWhile
+
+body {
+  p_body = (f32[2]{0}) parameter(0)
+  p_body.1 = f32[2]{0} get-tuple-element(p_body), index=0
+  const = f32[2]{0} constant({3, 4})
+  add.0 = f32[2]{0} add(p_body.1, const)
+  ROOT root = (f32[2]{0}) tuple(add.0)
+}
+
+condition {
+  p_cond = (f32[2]{0}) parameter(0)
+  ROOT result = pred[] constant(true)
+}
+
+ENTRY entry {
+  const_0 = f32[2]{0} constant({1, 2})
+  while_init = (f32[2]{0}) tuple(const_0)
+  ROOT while = (f32[2]{0}) while(while_init), condition=condition, body=body
+}
+)";
+
+TEST_F(WhileLoopInvariantCodeMotionTest, HoistsConstantWhenAsked) {
+  ParseAndVerifyModule(kConstantHoistingTestCase);
+
+  TF_ASSERT_OK_AND_ASSIGN(
+      bool simplified_loop,
+      WhileLoopInvariantCodeMotion{/*hoist_constants=*/true}.Run(&module()));
+  EXPECT_TRUE(simplified_loop);
+
+  HloComputation* while_body = module().GetComputationWithName("wide.body");
+  ASSERT_NE(while_body, nullptr);
+
+  // We expect the while body to be the equivalent of:
+  //
+  //  wide.body {
+  //    wide_param.1 = (f32[2]{0}, f32[2]{0}) parameter(0)
+  //    get-tuple-element.1 = f32[2]{0} get-tuple-element(wide_param.1), index=0
+  //    tuple.1 = (f32[2]{0}) tuple(get-tuple-element.1)
+  //    get-tuple-element.4 = f32[2]{0} get-tuple-element(tuple.1), index=0
+  //    get-tuple-element.7 = f32[2]{0} get-tuple-element(wide_param.1), index=1
+  //    add.1 = f32[2]{0} add(get-tuple-element.4, get-tuple-element.7)
+  //    tuple.3 = (f32[2]{0}) tuple(add.1)
+  //    get-tuple-element.8 = f32[2]{0} get-tuple-element(tuple.3), index=0
+  //    get-tuple-element.9 = f32[2]{0} get-tuple-element(wide_param.1), index=1
+  //    ROOT tuple.4 = (f32[2]{0}, f32[2]{0}) tuple(get-tuple-element.8,
+  //                                                get-tuple-element.9)
+  //  }
+
+  auto wide_param_1 = op::Parameter(0);
+  auto get_tuple_element_1 = op::GetTupleElement(wide_param_1, 0);
+  auto tuple_1 = op::Tuple(get_tuple_element_1);
+  auto get_tuple_element_4 = op::GetTupleElement(tuple_1, 0);
+  auto get_tuple_element_7 = op::GetTupleElement(wide_param_1, 1);
+  auto add_1 = op::Add(get_tuple_element_4, get_tuple_element_7);
+  auto tuple_3 = op::Tuple(add_1);
+  auto get_tuple_element_8 = op::GetTupleElement(tuple_3, 0);
+  auto get_tuple_element_9 = op::GetTupleElement(wide_param_1, 1);
+  auto tuple_4 = op::Tuple(get_tuple_element_8, get_tuple_element_9);
+
+  EXPECT_THAT(while_body->root_instruction(), tuple_4);
+}
+
+TEST_F(WhileLoopInvariantCodeMotionTest, DoesNotHoistConstantByDefault) {
+  ParseAndVerifyModule(kConstantHoistingTestCase);
+
+  TF_ASSERT_OK_AND_ASSIGN(bool simplified_loop,
+                          WhileLoopInvariantCodeMotion{}.Run(&module()));
+  EXPECT_FALSE(simplified_loop);
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/while_util.cc b/tensorflow/compiler/xla/service/while_util.cc
index ed20b36292a7f24385603627d74fc72ba6b3b724..473eab2ea84eb8faf745cbe299bc80bcc1b62a35 100644
--- a/tensorflow/compiler/xla/service/while_util.cc
+++ b/tensorflow/compiler/xla/service/while_util.cc
@@ -117,9 +117,13 @@ WhileUtil::MakeInstructionsLiveIn(
   HloInstruction* new_while = containing_computation->AddInstruction(
       HloInstruction::CreateWhile(new_while_shape, new_while_condition,
                                   new_while_body, new_while_init));
-  TF_RETURN_IF_ERROR(containing_computation->ReplaceInstruction(
-      while_instr, TupleUtil::ExtractPrefix(
-                       new_while, while_instr->shape().tuple_shapes_size())));
+
+  // We want to get rid of the old while instruction even if it has side
+  // effecting operations so we do a manual HloComputation::RemoveInstruction
+  // instead of relying on HloComputation::ReplaceInstruction.
+  TF_RETURN_IF_ERROR(while_instr->ReplaceAllUsesWith(TupleUtil::ExtractPrefix(
+      new_while, while_instr->shape().tuple_shapes_size())));
+  TF_RETURN_IF_ERROR(containing_computation->RemoveInstruction(while_instr));
 
   HloInstruction* while_body_param = new_while_body->parameter_instruction(0);
   std::vector<HloInstruction*> live_in_instructions;
diff --git a/tensorflow/compiler/xla/service/while_util.h b/tensorflow/compiler/xla/service/while_util.h
index 322d27b88cae60cb051f5fafdde70e2aafedbc1e..e67636d80f4b682fe1335eae535fb86105ac082b 100644
--- a/tensorflow/compiler/xla/service/while_util.h
+++ b/tensorflow/compiler/xla/service/while_util.h
@@ -38,17 +38,21 @@ class WhileUtil {
   };
 
   // Replaces `while_instr` with a new while instruction that is equivalent to
-  // `while_instr`, except that it has all of the HLO instructions in
+  // `while_instr` except that it has all of the HLO instructions in
   // `instructions` as live-in, loop invariant values.  These new live in values
   // are represented as new elements appended to the parameter of the while
   // loop, which must be of tuple shape.  GetTupleElement instructions computing
   // each new live in value is returned in the `while_body_live_in_values`
   // vector.
   //
-  // Precondition: `while_instr` must have a tuple shaped state.
+  // Deletes `while_instr` after replacing it.
   //
-  // Every instruction in `instructions` must be contained in the computation
-  // that contains `while_instr`.
+  // Preconditions:
+  //
+  //  `while_instr` must have a tuple shaped state.
+  //
+  //   Every instruction in `instructions` must be contained in the computation
+  //   that contains `while_instr`.
   static StatusOr<MakeInstructionsLiveInResult> MakeInstructionsLiveIn(
       HloInstruction* while_instr,
       tensorflow::gtl::ArraySlice<HloInstruction*> instructions);
diff --git a/tensorflow/compiler/xla/service/while_util_test.cc b/tensorflow/compiler/xla/service/while_util_test.cc
index 974bc542a34d0af6d41ed29f36df87f4c164a360..d79d3297213e832306ea4726483b0f215df0f5d3 100644
--- a/tensorflow/compiler/xla/service/while_util_test.cc
+++ b/tensorflow/compiler/xla/service/while_util_test.cc
@@ -16,8 +16,9 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/while_util.h"
 
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/test.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
+#include "tensorflow/compiler/xla/util.h"
 
 namespace xla {
 namespace {
@@ -49,7 +50,7 @@ ENTRY entry {
 )";
 
   TF_ASSIGN_OR_RETURN(std::unique_ptr<HloModule> module,
-                      tools::Parse(hlo_string));
+                      ParseHloString(hlo_string));
 
   *entry_computation = module->entry_computation();
   *param0 = (*entry_computation)->parameter_instruction(0);
@@ -150,7 +151,7 @@ ENTRY main {
 )";
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                          tools::Parse(hlo_string));
+                          ParseHloString(hlo_string));
 
   HloComputation* while_body = module->GetComputationWithName("body");
 
@@ -163,5 +164,47 @@ ENTRY main {
   ASSERT_EQ(gte_list.size(), 1);
   EXPECT_EQ((*gte_list.begin())->name(), "gte.0");
 }
+
+TEST(WhileUtilTest, AlwaysRemovePreviousWhileBody) {
+  const char* const hlo_string = R"(
+HloModule WhileWithSideEffects
+
+body {
+  param.b = (s32[], s32[]) parameter(0)
+  gte.0 = s32[] get-tuple-element(param.b), index=0
+  gte.1 = s32[] get-tuple-element(param.b), index=1
+  add = s32[] add(gte.0, gte.1)
+  ROOT tuple = (s32[], s32[]) tuple(gte.0, add)
+}
+
+cond {
+  param.c = (s32[], s32[]) parameter(0)
+  ROOT condition = pred[] infeed()
+}
+
+ENTRY main {
+  init = (s32[], s32[]) parameter(0)
+  to_make_live_in = f32[100] parameter(1)
+  ROOT while = (s32[], s32[]) while(init), condition=cond, body=body
+}
+)";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseHloString(hlo_string));
+
+  HloComputation* main = module->GetComputationWithName("main");
+  HloInstruction* while_instr = main->root_instruction();
+  HloInstruction* to_make_live_in = main->parameter_instruction(1);
+
+  TF_ASSERT_OK_AND_ASSIGN(
+      WhileUtil::MakeInstructionsLiveInResult make_live_in_result,
+      WhileUtil::MakeInstructionsLiveIn(while_instr,
+                                        /*instructions=*/{to_make_live_in}));
+
+  auto is_while = [](const HloInstruction* instr) {
+    return instr->opcode() == HloOpcode::kWhile;
+  };
+  EXPECT_EQ(c_count_if(main->instructions(), is_while), 1);
+}
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service_interface.h b/tensorflow/compiler/xla/service_interface.h
index 141347a792c23a2c542d7b564ab76c118409865d..14c35e7b84f07bebac33a9753ac26a8ee1418f1e 100644
--- a/tensorflow/compiler/xla/service_interface.h
+++ b/tensorflow/compiler/xla/service_interface.h
@@ -47,41 +47,22 @@ class ServiceInterface {
   virtual Status ResetDevice(const ResetDeviceRequest* arg,
                              ResetDeviceResponse* result) = 0;
 
-  virtual Status LoadComputationSnapshot(
-      const LoadComputationSnapshotRequest* request,
-      LoadComputationSnapshotResponse* result) = 0;
-
-  virtual Status Execute(const ExecuteRequest* arg,
-                         ExecuteResponse* result) = 0;
-
   virtual Status ExecuteGraph(const ExecuteGraphRequest* arg,
                               ExecuteResponse* result) = 0;
 
-  virtual Status ExecuteParallel(const ExecuteParallelRequest* arg,
-                                 ExecuteParallelResponse* result) = 0;
-
   virtual Status ExecuteGraphParallel(const ExecuteGraphParallelRequest* arg,
                                       ExecuteParallelResponse* result) = 0;
 
-  virtual Status ExecuteAsync(const ExecuteAsyncRequest* arg,
-                              ExecuteAsyncResponse* result) = 0;
-
   virtual Status WaitForExecution(const WaitForExecutionRequest* arg,
                                   WaitForExecutionResponse* result) = 0;
 
   virtual Status DeconstructTuple(const DeconstructTupleRequest* arg,
                                   DeconstructTupleResponse* result) = 0;
 
-  virtual Status GetComputationStats(const ComputationStatsRequest* arg,
-                                     ComputationStatsResponse* result) = 0;
-
   virtual Status GetComputationGraphStats(
       const ComputationGraphStatsRequest* arg,
       ComputationStatsResponse* result) = 0;
 
-  virtual Status GetComputationShape(const GetComputationShapeRequest* arg,
-                                     GetComputationShapeResponse* result) = 0;
-
   virtual Status GetShape(const GetShapeRequest* arg,
                           GetShapeResponse* result) = 0;
 
@@ -91,31 +72,9 @@ class ServiceInterface {
   virtual Status GetDeviceHandles(const GetDeviceHandlesRequest* arg,
                                   GetDeviceHandlesResponse* result) = 0;
 
-  // Methods used by ComputationBuilder.
-  virtual Status Computation(const ComputationRequest* arg,
-                             ComputationResponse* result) = 0;
-
-  virtual Status Op(const OpRequest* arg, OpResponse* result) = 0;
-
-  virtual Status GetLocalShape(const GetLocalShapeRequest* arg,
-                               GetLocalShapeResponse* result) = 0;
-
-  virtual Status SetReturnValue(const SetReturnValueRequest* arg,
-                                SetReturnValueResponse* results) = 0;
-
-  virtual Status IsConstant(const IsConstantRequest* arg,
-                            IsConstantResponse* result) = 0;
-
-  virtual Status ComputeConstant(const ComputeConstantRequest* arg,
-                                 ComputeConstantResponse* result) = 0;
-
   virtual Status ComputeConstantGraph(const ComputeConstantGraphRequest* arg,
                                       ComputeConstantResponse* result) = 0;
 
-  // Methods used by Computation.
-  virtual Status SnapshotComputation(const SnapshotComputationRequest* ag,
-                                     SnapshotComputationResponse* result) = 0;
-
   // Methods used by GlobalData.
   virtual Status Unregister(const UnregisterRequest* arg,
                             UnregisterResponse* result) = 0;
diff --git a/tensorflow/compiler/xla/shape_tree.h b/tensorflow/compiler/xla/shape_tree.h
index 37c94ac543b166c14affd8165d244440ae6b67d6..18e54d23c241ae0d4c61d8be79ff021dfb02a3e6 100644
--- a/tensorflow/compiler/xla/shape_tree.h
+++ b/tensorflow/compiler/xla/shape_tree.h
@@ -47,6 +47,9 @@ struct ShapeTreeNode {
   // Children of this node, as indices into the container's nodes_ array.
   std::vector<size_t> children;
 
+  // Tells whether this is a leaf node.
+  bool is_leaf = true;
+
   explicit ShapeTreeNode(ShapeIndex index)
       : ShapeTreeNode(std::move(index), T()) {}
   ShapeTreeNode(ShapeIndex index, T data)
@@ -122,9 +125,7 @@ class ShapeTree {
 
   // Returns true if the node at the given index is a leaf node (an array
   // shape).
-  bool IsLeaf(const ShapeIndex& index) const {
-    return Lookup(index)->children.empty();
-  }
+  bool IsLeaf(const ShapeIndex& index) const { return Lookup(index)->is_leaf; }
 
   ShapeTree(const ShapeTree&) = default;
   ShapeTree& operator=(const ShapeTree&) = default;
@@ -222,6 +223,9 @@ class ShapeTree {
                     /*iterate_leaves_only=*/false);
   }
 
+  // Returns the number of leaf nodes in the tree.
+  int64 leaf_count() const { return std::distance(leaf_begin(), leaf_end()); }
+
   // Recursively traverses the shape and calls the given function at each
   // element. The function has the following arguments:
   //
@@ -308,16 +312,14 @@ class ShapeTreeIterator
       : nodes_(nodes),
         node_(std::move(node)),
         iterate_leaves_only_(iterate_leaves_only) {
-    while (iterate_leaves_only && node_ != nodes_->end() &&
-           !node_->children.empty()) {
+    while (iterate_leaves_only && node_ != nodes_->end() && !node_->is_leaf) {
       ++node_;
     }
   }
 
   ShapeTreeIterator& operator++() {
     ++node_;
-    while (iterate_leaves_only_ && node_ != nodes_->end() &&
-           !node_->children.empty()) {
+    while (iterate_leaves_only_ && node_ != nodes_->end() && !node_->is_leaf) {
       ++node_;
     }
     return *this;
@@ -330,8 +332,7 @@ class ShapeTreeIterator
 
   ShapeTreeIterator& operator--() {
     --node_;
-    while (iterate_leaves_only_ && node_ > nodes_->begin() &&
-           !node_->children.empty()) {
+    while (iterate_leaves_only_ && node_ > nodes_->begin() && !node_->is_leaf) {
       --node_;
     }
     return *this;
@@ -355,7 +356,7 @@ class ShapeTreeIterator
   ContainerType* nodes_;
   IteratorType node_;
   // True if we should not include interior nodes in our walk.
-  bool iterate_leaves_only_;
+  const bool iterate_leaves_only_;
 };
 
 template <typename T>
@@ -376,6 +377,7 @@ void ShapeTree<T>::InitChildren(const Shape& shape, const T& init_value,
   if (ShapeUtil::IsTuple(shape)) {
     const int64 size = ShapeUtil::TupleElementCount(shape);
     node->children.reserve(size);
+    node->is_leaf = false;
     ShapeIndex shape_index = node->data.first;
     shape_index.push_back(0);
     for (int i = 0; i < size; ++i) {
@@ -392,6 +394,7 @@ void ShapeTree<T>::InitChildren(const Shape& shape, Node* node) {
   if (ShapeUtil::IsTuple(shape)) {
     const int64 size = ShapeUtil::TupleElementCount(shape);
     node->children.reserve(size);
+    node->is_leaf = false;
     ShapeIndex shape_index = node->data.first;
     shape_index.push_back(0);
     for (int i = 0; i < size; ++i) {
diff --git a/tensorflow/compiler/xla/shape_tree_test.cc b/tensorflow/compiler/xla/shape_tree_test.cc
index dc5facf1581c07fbb74dfcee95025692938632bd..51de82e95746281ed6e587b545dc933b48ce1ad4 100644
--- a/tensorflow/compiler/xla/shape_tree_test.cc
+++ b/tensorflow/compiler/xla/shape_tree_test.cc
@@ -116,6 +116,11 @@ TEST_F(ShapeTreeTest, InitValueConstructor) {
   TestInitValueConstructor(nested_tuple_shape_, 10);
 }
 
+TEST_F(ShapeTreeTest, EmptyTupleMustHaveNoLeaves) {
+  ShapeTree<int> shape_tree{ShapeUtil::MakeTupleShape({})};
+  EXPECT_EQ(0, shape_tree.leaf_count());
+}
+
 TEST_F(ShapeTreeTest, ArrayShape) {
   ShapeTree<int> shape_tree{array_shape_};
   *shape_tree.mutable_element({}) = 42;
diff --git a/tensorflow/compiler/xla/shape_util.cc b/tensorflow/compiler/xla/shape_util.cc
index 7a897f6f8f99e65285e1be0757a55f703fc81c72..5db66599324913b9214d7623597060950246fb03 100644
--- a/tensorflow/compiler/xla/shape_util.cc
+++ b/tensorflow/compiler/xla/shape_util.cc
@@ -27,7 +27,6 @@ limitations under the License.
 #include "tensorflow/compiler/xla/primitive_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/types.h"
-#include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/gtl/iterator_range.h"
@@ -42,17 +41,35 @@ limitations under the License.
 
 namespace xla {
 
+using ::tensorflow::strings::StrAppend;
+using ::tensorflow::strings::StrCat;
+
 string ShapeIndex::ToString() const {
-  return tensorflow::strings::StrCat(
-      "{", tensorflow::str_util::Join(indices_, ","), "}");
+  return StrCat("{", tensorflow::str_util::Join(indices_, ","), "}");
 }
 
 string ShapeIndexView::ToString() const {
-  return tensorflow::strings::StrCat(
-      "{",
-      tensorflow::str_util::Join(tensorflow::gtl::make_range(begin_, end_),
-                                 ","),
-      "}");
+  return StrCat("{",
+                tensorflow::str_util::Join(
+                    tensorflow::gtl::make_range(begin_, end_), ","),
+                "}");
+}
+
+bool ShapeIndexView::operator==(const ShapeIndexView& other) const {
+  if (size() != other.size()) {
+    return false;
+  }
+  for (auto it = begin(), other_it = other.begin(); it != end();
+       ++it, ++other_it) {
+    if (*it != *other_it) {
+      return false;
+    }
+  }
+  return true;
+}
+
+bool ShapeIndexView::operator!=(const ShapeIndexView& other) const {
+  return !(*this == other);
 }
 
 std::ostream& operator<<(std::ostream& out, const ShapeIndex& shape_index) {
@@ -67,18 +84,30 @@ std::ostream& operator<<(std::ostream& out, const ShapeIndexView& shape_index) {
 
 namespace {
 
+// Returns whether the given primitive type corresponds to an array shape.
+bool IsArrayPrimitiveType(PrimitiveType primitive_type) {
+  return primitive_type != PRIMITIVE_TYPE_INVALID && primitive_type != TUPLE &&
+         primitive_type != OPAQUE && primitive_type != TOKEN;
+}
+
 // Recursive helper for comparing the equality of two shapes. Returns true if
 // the shapes are the same. If compare_layouts is true, then layouts must also
 // match.
 bool CompareShapes(const Shape& lhs, const Shape& rhs, bool compare_layouts) {
-  if (ShapeUtil::IsTuple(lhs) || ShapeUtil::IsTuple(rhs)) {
-    return ShapeUtil::IsTuple(lhs) && ShapeUtil::IsTuple(rhs) &&
-           ContainersEqual(lhs.tuple_shapes(), rhs.tuple_shapes(),
+  if (!ShapeUtil::SameElementType(lhs, rhs)) {
+    VLOG(3) << "CompareShapes: lhs element type != rhs element type";
+    return false;
+  }
+
+  if (ShapeUtil::IsTuple(lhs)) {
+    return ContainersEqual(lhs.tuple_shapes(), rhs.tuple_shapes(),
                            [=](const Shape& l, const Shape& r) {
                              return CompareShapes(l, r, compare_layouts);
                            });
-  } else if (ShapeUtil::IsOpaque(lhs) || ShapeUtil::IsOpaque(rhs)) {
-    return ShapeUtil::IsOpaque(lhs) && ShapeUtil::IsOpaque(rhs);
+  } else if (!ShapeUtil::IsArray(lhs)) {
+    // Non-tuple, non-array tupes such as opaque and token types are trivially
+    // the same.
+    return true;
   }
 
   if (compare_layouts) {
@@ -108,10 +137,6 @@ bool CompareShapes(const Shape& lhs, const Shape& rhs, bool compare_layouts) {
     VLOG(3) << "CompareShapes: lhs dimensions != rhs dimensions";
     return false;
   }
-  if (!ShapeUtil::SameElementType(lhs, rhs)) {
-    VLOG(3) << "CompareShapes: lhs element type != rhs element type";
-    return false;
-  }
   return true;
 }
 
@@ -154,8 +179,8 @@ StatusOr<Shape> MakeShapeWithLayoutInternal(
 }
 
 /* static */ int64 ShapeUtil::Rank(const Shape& shape) {
-  CHECK(!ShapeUtil::IsTuple(shape))
-      << "Tuples do not have a rank, shape: " << shape;
+  CHECK(ShapeUtil::IsArray(shape))
+      << "Non-arrays do not have a rank, shape: " << shape;
   return shape.dimensions_size();
 }
 
@@ -182,8 +207,7 @@ StatusOr<Shape> MakeShapeWithLayoutInternal(
 
 /* static */ Shape ShapeUtil::MakeShape(
     PrimitiveType element_type, tensorflow::gtl::ArraySlice<int64> dimensions) {
-  DCHECK_NE(TUPLE, element_type);
-  DCHECK_NE(OPAQUE, element_type);
+  CHECK(IsArrayPrimitiveType(element_type));
   Shape result;
   PopulateShape(element_type, dimensions, &result);
   return result;
@@ -206,8 +230,7 @@ StatusOr<Shape> MakeShapeWithLayoutInternal(
 /* static */ Shape ShapeUtil::MakeShapeWithSparseLayout(
     PrimitiveType element_type, tensorflow::gtl::ArraySlice<int64> dimensions,
     int64 max_sparse_elements) {
-  DCHECK_NE(TUPLE, element_type);
-  DCHECK_NE(OPAQUE, element_type);
+  CHECK(IsArrayPrimitiveType(element_type));
   Shape shape = ShapeUtil::MakeShape(element_type, dimensions);
   *shape.mutable_layout() = LayoutUtil::MakeSparseLayout(max_sparse_elements);
   TF_DCHECK_OK(ShapeUtil::ValidateShape(shape));
@@ -254,6 +277,13 @@ ShapeUtil::MakeShapeWithDescendingLayoutAndSamePhysicalLayout(
   return result;
 }
 
+/* static */ Shape ShapeUtil::MakeTokenShape() {
+  Shape result;
+  result.set_element_type(TOKEN);
+  TF_DCHECK_OK(ValidateShapeWithOptionalLayout(result));
+  return result;
+}
+
 /* static */ void ShapeUtil::AppendShapeToTuple(const Shape& shape,
                                                 Shape* tuple_shape) {
   TF_DCHECK_OK(ValidateShapeWithOptionalLayout(shape));
@@ -277,7 +307,7 @@ ShapeUtil::MakeShapeWithDescendingLayoutAndSamePhysicalLayout(
 }
 
 /* static */ bool ShapeUtil::ElementHasBitWidth(const Shape& shape, int bits) {
-  if (shape.element_type() == TUPLE || shape.element_type() == OPAQUE) {
+  if (!IsArray(shape)) {
     return false;
   }
   return primitive_util::BitWidth(shape.element_type()) == bits;
@@ -303,6 +333,7 @@ ShapeUtil::MakeShapeWithDescendingLayoutAndSamePhysicalLayout(
     case C64:
     case TUPLE:
     case OPAQUE:
+    case TOKEN:
       return false;
 
     default:
@@ -318,6 +349,10 @@ ShapeUtil::MakeShapeWithDescendingLayoutAndSamePhysicalLayout(
   return primitive_util::IsFloatingPointType(shape.element_type());
 }
 
+/* static */ bool ShapeUtil::IsArray(const Shape& shape) {
+  return IsArrayPrimitiveType(shape.element_type());
+}
+
 /* static */ bool ShapeUtil::IsNestedTuple(const Shape& shape) {
   return IsTuple(shape) && std::any_of(shape.tuple_shapes().begin(),
                                        shape.tuple_shapes().end(), IsTuple);
@@ -371,7 +406,7 @@ ShapeUtil::MakeShapeWithDescendingLayoutAndSamePhysicalLayout(
 }
 
 /* static */ int64 ShapeUtil::ElementsIn(const Shape& shape) {
-  CHECK(!IsTuple(shape)) << ShapeUtil::HumanString(shape);
+  CHECK(IsArray(shape)) << ShapeUtil::HumanString(shape);
   CHECK_EQ(shape.dimensions_size(), Rank(shape));
   return std::accumulate<decltype(shape.dimensions().begin()), int64>(
       shape.dimensions().begin(), shape.dimensions().end(), 1LL,
@@ -386,23 +421,6 @@ ShapeUtil::MakeShapeWithDescendingLayoutAndSamePhysicalLayout(
   return shape.element_type() == F32 && Rank(shape) == 0;
 }
 
-/* static */ string ShapeUtil::HumanString(const Shape& shape) {
-  if (IsTuple(shape)) {
-    string text = "(";
-    const char* prefix = "";
-    for (const Shape& elem_shape : shape.tuple_shapes()) {
-      tensorflow::strings::StrAppend(&text, prefix, HumanString(elem_shape));
-      prefix = ", ";
-    }
-    text += ")";
-    return text;
-  } else {
-    return tensorflow::strings::StrCat(
-        tensorflow::str_util::Lowercase(
-            PrimitiveType_Name(shape.element_type())),
-        "[", tensorflow::str_util::Join(shape.dimensions(), ","), "]");
-  }
-}
 
 namespace {
 
@@ -453,48 +471,56 @@ StatusOr<PrimitiveType> StringToPrimitiveType(const string& name) {
 
 }  // namespace
 
-/* static */ string ShapeUtil::HumanStringWithLayout(const Shape& shape) {
+/* static */ string ShapeUtil::HumanString(const Shape& shape) {
   if (IsTuple(shape)) {
     string text = "(";
     const char* prefix = "";
     for (const Shape& elem_shape : shape.tuple_shapes()) {
-      tensorflow::strings::StrAppend(&text, prefix,
-                                     HumanStringWithLayout(elem_shape));
+      StrAppend(&text, prefix, HumanString(elem_shape));
       prefix = ", ";
     }
     text += ")";
     return text;
-  } else {
-    string result = tensorflow::strings::StrCat(
-        LowercasePrimitiveTypeName(shape.element_type()), "[");
-    for (int i = 0; i < shape.dimensions().size(); i++) {
-      tensorflow::strings::StrAppend(&result, (i > 0) ? "," : "",
-                                     shape.dimensions(i));
+  }
+  return StrCat(LowercasePrimitiveTypeName(shape.element_type()), "[",
+                tensorflow::str_util::Join(shape.dimensions(), ","), "]");
+}
+
+/* static */ string ShapeUtil::HumanStringWithLayout(const Shape& shape) {
+  if (IsTuple(shape)) {
+    string text = "(";
+    const char* prefix = "";
+    for (const Shape& elem_shape : shape.tuple_shapes()) {
+      StrAppend(&text, prefix, HumanStringWithLayout(elem_shape));
+      prefix = ", ";
     }
-    result += "]";
-    if (!IsScalar(shape) && !IsOpaque(shape)) {
-      if (LayoutUtil::HasLayout(shape)) {
-        tensorflow::strings::StrAppend(&result,
-                                       LayoutUtil::HumanString(shape.layout()));
-      }
+    text += ")";
+    return text;
+  }
+  string result = StrCat(LowercasePrimitiveTypeName(shape.element_type()), "[");
+  for (int i = 0; i < shape.dimensions().size(); i++) {
+    StrAppend(&result, (i > 0) ? "," : "", shape.dimensions(i));
+  }
+  result += "]";
+  if (!IsScalar(shape) && IsArray(shape)) {
+    if (LayoutUtil::HasLayout(shape)) {
+      StrAppend(&result, LayoutUtil::HumanString(shape.layout()));
     }
-    return result;
   }
+  return result;
 }
 
 /* static */ string ShapeUtil::HumanString(const ProgramShape& program_shape) {
   std::vector<string> parameters;
   for (auto& shape : program_shape.parameters()) {
     const int i = parameters.size();
-    parameters.push_back(
-        tensorflow::strings::StrCat(i < program_shape.parameter_names_size()
-                                        ? program_shape.parameter_names(i)
-                                        : "(unknown)",
-                                    ": ", HumanString(shape)));
+    parameters.push_back(StrCat(i < program_shape.parameter_names_size()
+                                    ? program_shape.parameter_names(i)
+                                    : "(unknown)",
+                                ": ", HumanString(shape)));
   }
-  return tensorflow::strings::StrCat(
-      "(", tensorflow::str_util::Join(parameters, ", "), ") -> ",
-      HumanString(program_shape.result()));
+  return StrCat("(", tensorflow::str_util::Join(parameters, ", "), ") -> ",
+                HumanString(program_shape.result()));
 }
 
 namespace {
@@ -564,14 +590,17 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
     // Extract the primitive element type.
     TF_ASSIGN_OR_RETURN(const PrimitiveType primitive_type,
                         StringToPrimitiveType(element_type_string));
-    if (primitive_type == PRIMITIVE_TYPE_INVALID || primitive_type == TUPLE ||
-        primitive_type == OPAQUE) {
+    if (primitive_type == PRIMITIVE_TYPE_INVALID || primitive_type == TUPLE) {
       return InvalidArgument("Invalid element type string: \"%s\".",
                              element_type_string.c_str());
     }
 
     Shape result;
-    if (format_string.empty() && layout_string.empty()) {
+    if (primitive_type == OPAQUE) {
+      result = ShapeUtil::MakeOpaqueShape();
+    } else if (primitive_type == TOKEN) {
+      result = ShapeUtil::MakeTokenShape();
+    } else if (format_string.empty() && layout_string.empty()) {
       // Create a shape without a layout set.
       result = ShapeUtil::MakeShape(primitive_type, dimensions);
     } else if (format_string == "sparse") {
@@ -616,43 +645,44 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
 }
 
 /* static */ bool ShapeUtil::Compatible(const Shape& lhs, const Shape& rhs) {
-  if (lhs.element_type() == TUPLE) {
+  if (IsArray(lhs)) {
+    return SameElementType(lhs, rhs) && SameDimensions(lhs, rhs);
+  } else if (lhs.element_type() == TUPLE) {
     return rhs.element_type() == TUPLE &&
            ContainersEqual(lhs.tuple_shapes(), rhs.tuple_shapes(), Compatible);
+  } else {
+    // Opaque, token, etc types are vacuously compatible.
+    return true;
   }
-  if (lhs.element_type() == OPAQUE) {
-    return rhs.element_type() == OPAQUE;
-  }
-  return SameElementType(lhs, rhs) && SameDimensions(lhs, rhs);
 }
 
 /* static */ bool ShapeUtil::CompatibleIgnoringElementType(const Shape& lhs,
                                                            const Shape& rhs) {
-  if (lhs.element_type() == TUPLE) {
+  if (IsArray(lhs)) {
+    return IsArray(rhs) && SameDimensions(lhs, rhs);
+  } else if (lhs.element_type() == TUPLE) {
     return rhs.element_type() == TUPLE &&
            ContainersEqual(lhs.tuple_shapes(), rhs.tuple_shapes(),
                            CompatibleIgnoringElementType);
+  } else {
+    // Opaque, token, etc types are vacuously compatible.
+    return true;
   }
-  if (lhs.element_type() == OPAQUE) {
-    return rhs.element_type() == OPAQUE;
-  }
-  return ShapeUtil::IsArray(rhs) && SameDimensions(lhs, rhs);
 }
 
 /* static */ bool ShapeUtil::CompatibleIgnoringFpPrecision(const Shape& lhs,
                                                            const Shape& rhs) {
-  if (lhs.element_type() == TUPLE) {
+  if (IsArray(lhs)) {
+    return IsArray(rhs) && SameElementTypeIgnoringFpPrecision(lhs, rhs) &&
+           CompatibleIgnoringElementType(lhs, rhs);
+  } else if (lhs.element_type() == TUPLE) {
     return rhs.element_type() == TUPLE &&
            ContainersEqual(lhs.tuple_shapes(), rhs.tuple_shapes(),
                            CompatibleIgnoringFpPrecision);
+  } else {
+    // Opaque, token, etc types are vacuously compatible.
+    return true;
   }
-  if (lhs.element_type() == OPAQUE) {
-    return rhs.element_type() == OPAQUE;
-  }
-  if (SameElementTypeIgnoringFpPrecision(lhs, rhs)) {
-    return CompatibleIgnoringElementType(lhs, rhs);
-  }
-  return false;
 }
 
 /* static */ int64 ShapeUtil::GetDimension(const Shape& shape,
@@ -674,10 +704,6 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
   switch (primitive_type) {
     case PRED:
       return sizeof(int8);
-    case TUPLE:
-      LOG(FATAL) << "tuples have no definitive size";
-    case OPAQUE:
-      LOG(FATAL) << "opaque have no definitive size";
     case S8:
       return sizeof(int8);
     case S16:
@@ -704,6 +730,13 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
       return sizeof(double);
     case C64:
       return sizeof(complex64);
+    case TOKEN:
+      // Tokens require no space.
+      return 0;
+    case TUPLE:
+    case OPAQUE:
+      LOG(FATAL) << PrimitiveType_Name(primitive_type)
+                 << " primitive type has no definitive size";
     default:
       LOG(FATAL) << "Unhandled primitive type " << primitive_type;
   }
@@ -712,28 +745,32 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
 /* static */ int64 ShapeUtil::ByteSizeOf(const Shape& shape,
                                          int64 pointer_size) {
   TF_DCHECK_OK(ValidateShape(shape));
-  DCHECK_NE(OPAQUE, shape.element_type());
   if (shape.element_type() == TUPLE) {
     return ByteSizeOfTupleIndexTable(shape, pointer_size);
+  } else if (IsArray(shape)) {
+    int64 byte_size = ByteSizeOfElements(shape);
+    if (LayoutUtil::IsSparseArray(shape)) {
+      byte_size += ByteSizeOfSparseIndices(shape);
+    }
+    return byte_size;
+  } else if (shape.element_type() == TOKEN) {
+    return 0;
   }
-  int64 byte_size = ByteSizeOfElements(shape);
-  if (LayoutUtil::IsSparseArray(shape)) {
-    byte_size += ByteSizeOfSparseIndices(shape);
-  }
-  return byte_size;
+  LOG(FATAL) << PrimitiveType_Name(shape.element_type())
+             << " primitive type has no definitive size";
 }
 
 /* static */ int64 ShapeUtil::ByteSizeOfTupleIndexTable(const Shape& shape,
                                                         int64 pointer_size) {
   TF_DCHECK_OK(ValidateShape(shape));
-  DCHECK_EQ(TUPLE, shape.element_type());
+  CHECK_EQ(TUPLE, shape.element_type());
   CHECK_GT(pointer_size, 0);
   return pointer_size * shape.tuple_shapes_size();
 }
 
 /* static */ int64 ShapeUtil::ByteSizeOfElements(const Shape& shape) {
   TF_DCHECK_OK(ValidateShape(shape));
-  DCHECK(ShapeUtil::IsArray(shape));
+  CHECK(ShapeUtil::IsArray(shape));
   int64 allocated_element_count;
 
   if (LayoutUtil::IsSparseArray(shape)) {
@@ -758,13 +795,17 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
 
 /* static */ int64 ShapeUtil::ByteSizeOfSparseIndices(const Shape& shape) {
   TF_DCHECK_OK(ValidateShape(shape));
-  DCHECK(LayoutUtil::IsSparseArray(shape));
+  CHECK(LayoutUtil::IsSparseArray(shape));
   return LayoutUtil::MaxSparseElements(shape.layout()) *
          ShapeUtil::Rank(shape) * sizeof(int64);
 }
 
 /* static */ Status ShapeUtil::ValidateShapeWithOptionalLayoutInternal(
     const Shape& shape) {
+  if (shape.element_type() == PRIMITIVE_TYPE_INVALID) {
+    return InvalidArgument("shape has invalid element type: %s",
+                           shape.ShortDebugString().c_str());
+  }
   if (shape.element_type() == TUPLE) {
     if (shape.dimensions_size() != 0) {
       return InvalidArgument("tuples must not have dimensions specified");
@@ -780,10 +821,24 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
   if (shape.tuple_shapes_size() > 0) {
     return InvalidArgument("non-tuple shape has tuple_shapes field");
   }
-  if (shape.element_type() == PRIMITIVE_TYPE_INVALID) {
-    return InvalidArgument("shape has invalid element type: %s",
-                           shape.ShortDebugString().c_str());
+
+  // Tokens and opaques can should not have layout or dimensions.
+  if (shape.element_type() == TOKEN || shape.element_type() == OPAQUE) {
+    if (shape.dimensions_size() != 0) {
+      return InvalidArgument(
+          "shape has %s element type, but has dimensions field: %s",
+          LowercasePrimitiveTypeName(shape.element_type()).c_str(),
+          shape.ShortDebugString().c_str());
+    }
+    if (shape.has_layout()) {
+      return InvalidArgument(
+          "shape has %s element type, but has layout field: %s",
+          LowercasePrimitiveTypeName(shape.element_type()).c_str(),
+          shape.ShortDebugString().c_str());
+    }
+    return Status::OK();
   }
+
   if (Rank(shape) != shape.dimensions_size()) {
     return InvalidArgument(
         "shape's rank is mismatched with dimension count; rank=%lld "
@@ -863,64 +918,25 @@ bool ShapeUtil::IsLeafIndex(const Shape& shape, const ShapeIndex& index) {
   return !IsTuple(GetSubshape(shape, index));
 }
 
-/* static */ Shape ShapeUtil::StripDegenerateDimensions(const Shape& shape) {
-  std::vector<int64> dimension_sizes;
-  std::vector<int64> degenerate_dimensions;
-  for (int64 i = 0; i < shape.dimensions_size(); ++i) {
-    if (shape.dimensions(i) == 1) {
-      degenerate_dimensions.push_back(i);
-    } else {
-      dimension_sizes.push_back(shape.dimensions(i));
+/* static */ int64 ShapeUtil::GetLeafCount(const Shape& shape) {
+  int64 count = 0;
+  ForEachSubshape(shape, [&](const Shape&, const ShapeIndex& index) {
+    if (IsLeafIndex(shape, index)) {
+      ++count;
     }
-  }
-
-  // Construct minor_to_major of stripped shape. The order of the non-degenerate
-  // dimensions should be preserved from the original shape. First, create
-  // vector of the non-degenerate dimensions from the original minor_to_major
-  // array.
-  std::vector<int64> minor_to_major;
-  for (int64 i : shape.layout().minor_to_major()) {
-    if (std::find(degenerate_dimensions.begin(), degenerate_dimensions.end(),
-                  i) == degenerate_dimensions.end()) {
-      minor_to_major.push_back(i);
-    }
-  }
+  });
+  return count;
+}
 
-  // The dimensions in minor_to_major need to be renumbered to account for the
-  // degenerate dimensions which have removed. Decrement each dimension number
-  // once for each degenerate dimension which has a smaller number.
-  for (int i = 0; i < minor_to_major.size(); ++i) {
-    int adjustment = 0;
-    for (int64 dim : degenerate_dimensions) {
-      if (minor_to_major[i] > dim) {
-        adjustment++;
-      }
+/* static */ std::vector<ShapeUtil::IndexedShape> ShapeUtil::GetLeafShapes(
+    const Shape& shape) {
+  std::vector<IndexedShape> leaves;
+  ForEachSubshape(shape, [&](const Shape& sub_shape, const ShapeIndex& index) {
+    if (IsLeafIndex(shape, index)) {
+      leaves.emplace_back(index, sub_shape);
     }
-    minor_to_major[i] -= adjustment;
-  }
-
-  {
-    std::vector<int64> dims(minor_to_major.size());
-    std::iota(dims.begin(), dims.end(), 0);
-    DCHECK(minor_to_major.size() == dims.size() &&
-           std::is_permutation(minor_to_major.begin(), minor_to_major.end(),
-                               dims.begin()));
-  }
-  Shape stripped_shape;
-  if (LayoutUtil::IsDenseArray(shape)) {
-    stripped_shape = MakeShapeWithLayout(shape.element_type(), dimension_sizes,
-                                         minor_to_major);
-  } else if (LayoutUtil::IsSparseArray(shape)) {
-    stripped_shape =
-        MakeShapeWithSparseLayout(shape.element_type(), dimension_sizes,
-                                  shape.layout().max_sparse_elements());
-  } else {
-    stripped_shape = MakeShape(shape.element_type(), dimension_sizes);
-  }
-
-  VLOG(10) << "Original_shape: " << HumanStringWithLayout(shape);
-  VLOG(10) << "Stripped_shape: " << HumanStringWithLayout(stripped_shape);
-  return stripped_shape;
+  });
+  return leaves;
 }
 
 namespace {
@@ -1028,6 +1044,9 @@ Status ForEachMutableSubshapeHelper(
 /* static */ std::tuple<bool, std::vector<int64>, std::vector<int64>>
 ShapeUtil::InsertedOrDeleted1SizedDimensions(const Shape& shape_pre,
                                              const Shape& shape_post) {
+  CHECK(IsArray(shape_pre));
+  CHECK(IsArray(shape_post));
+
   auto nil = std::make_tuple(false, std::vector<int64>(), std::vector<int64>());
 
   std::vector<int64> deleted_indices;
@@ -1085,6 +1104,9 @@ ShapeUtil::InsertedOrDeleted1SizedDimensions(const Shape& shape_pre,
 /* static */ std::vector<std::pair<int64, int64>>
 ShapeUtil::DimensionsUnmodifiedByReshape(const Shape& input_shape,
                                          const Shape& output_shape) {
+  CHECK(IsArray(input_shape));
+  CHECK(IsArray(output_shape));
+
   // Unmodified dimensions are merely common factors of rank 1.
   auto common_factors = CommonFactors(AsInt64Slice(input_shape.dimensions()),
                                       AsInt64Slice(output_shape.dimensions()));
@@ -1138,8 +1160,10 @@ ShapeUtil::DimensionsUnmodifiedByReshape(const Shape& input_shape,
 
 /* static */ bool ShapeUtil::ReshapeIsBitcast(const Shape& input_shape,
                                               const Shape& output_shape) {
-  CHECK(LayoutUtil::HasLayout(input_shape) &&
-        LayoutUtil::HasLayout(output_shape));
+  CHECK(IsArray(input_shape));
+  CHECK(IsArray(output_shape));
+  CHECK(LayoutUtil::HasLayout(input_shape));
+  CHECK(LayoutUtil::HasLayout(output_shape));
 
   if (!SameElementType(input_shape, output_shape)) {
     return false;
@@ -1301,6 +1325,9 @@ ShapeUtil::DimensionsUnmodifiedByReshape(const Shape& input_shape,
 
 /* static */ tensorflow::gtl::optional<Shape> ShapeUtil::AlignLayouts(
     const Shape& input_shape, const Shape& output_shape) {
+  CHECK(IsArray(input_shape));
+  CHECK(IsArray(output_shape));
+
   int64 input_rank = Rank(input_shape);
   int64 output_rank = Rank(output_shape);
 
@@ -1435,6 +1462,7 @@ ShapeUtil::DimensionsUnmodifiedByReshape(const Shape& input_shape,
 
 /* static */ Shape ShapeUtil::DeleteDimension(int64 dim_to_delete,
                                               Shape shape) {
+  CHECK(IsArray(shape));
   shape.mutable_dimensions()->erase(shape.dimensions().begin() + dim_to_delete);
   if (LayoutUtil::HasLayout(shape)) {
     Layout* layout = shape.mutable_layout();
@@ -1456,6 +1484,7 @@ ShapeUtil::DimensionsUnmodifiedByReshape(const Shape& input_shape,
 
 /* static */ Shape ShapeUtil::FilterDimensions(
     const std::function<bool(int64)>& p, Shape shape) {
+  CHECK(IsArray(shape));
   std::vector<int64> dims_to_delete;
   for (int64 i = shape.dimensions().size() - 1; i >= 0; --i) {
     if (!p(i)) {
diff --git a/tensorflow/compiler/xla/shape_util.h b/tensorflow/compiler/xla/shape_util.h
index 82c75f85d838f94cb040e56d59d0e012af5e0db0..ae2d17d6bbbfed96e1da192253838ae5e9a67e17 100644
--- a/tensorflow/compiler/xla/shape_util.h
+++ b/tensorflow/compiler/xla/shape_util.h
@@ -27,6 +27,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/core/threadpool.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
@@ -61,6 +62,8 @@ class ShapeIndex {
  public:
   ShapeIndex() = default;
   ShapeIndex(std::initializer_list<int64> init) : indices_(init) {}
+  template <typename InputIt>
+  ShapeIndex(InputIt start, InputIt end) : indices_(start, end) {}
 
   bool empty() const { return indices_.empty(); }
   size_t size() const { return indices_.size(); }
@@ -131,6 +134,10 @@ class ShapeIndexView {
     ++new_begin;
     return ShapeIndexView(new_begin, end_);
   }
+  ShapeIndex ToShapeIndex() const { return ShapeIndex(begin_, end_); }
+
+  bool operator==(const ShapeIndexView& other) const;
+  bool operator!=(const ShapeIndexView& other) const;
 
   string ToString() const;
 
@@ -150,12 +157,22 @@ std::ostream& operator<<(std::ostream& out, const ShapeIndexView& shape_index);
 // properties, which do invariant checks before / after the operation.
 class ShapeUtil {
  public:
+  // Data structure which describes the coordinates and the shape, of a tuple
+  // shaped sub-shape.
+  struct IndexedShape {
+    IndexedShape() = default;
+    IndexedShape(ShapeIndex index, Shape shape)
+        : index(std::move(index)), shape(std::move(shape)) {}
+    ShapeIndex index;
+    Shape shape;
+  };
+
   // Returns the number of elements are contained within the provided shape;
   // e.g. for rank 0 (scalars) the result is always 1. Note that sparse shapes
   // may not actually be able to store this number of elements. See
   // LayoutUtil::MaxSparseElements(shape) to obtain the maximum number of
   // elements that can be stored in a sparse shape.
-  // Precondition: !IsTuple(shape)
+  // Precondition: IsArray(shape)
   static int64 ElementsIn(const Shape& shape);
 
   // Returns true if 'shape' has zero elements.
@@ -166,13 +183,11 @@ class ShapeUtil {
   // shapes. This includes only the size of the top-level buffer. For example, a
   // tuple is stored as an array of pointers to other buffers. In this case,
   // this method only returns the size of the pointer array.
-  // Precondition: (!ShapeUtil::IsTuple(shape) || pointer_size > 0) &&
-  //               !ShapeUtil::IsOpaque(shape)
   static int64 ByteSizeOf(const Shape& shape, int64 pointer_size = -1);
 
   // Returns the number of bytes used to store the primitive_type.
   //
-  // Precondition: !ShapeUtil::IsOpaque(shape) && !ShapeUtil::IsTuple(shape)
+  // Precondition: ShapeUtil::IsArray(shape)
   static int64 ByteSizeOfPrimitiveType(PrimitiveType primitive_type);
 
   // Returns the number of bytes required to store the tuple member pointers for
@@ -231,7 +246,7 @@ class ShapeUtil {
   }
 
   // Returns the higher-precision element type if a and b are both floating
-  // point types; otherwise, checks that they have the same element type
+  // point types; otherwise, checks that that they have the same element type
   // and returns it.
   static PrimitiveType HigherPrecisionElementType(const Shape& a,
                                                   const Shape& b) {
@@ -279,10 +294,10 @@ class ShapeUtil {
   // Scalar-specific
 
   static bool IsScalar(const Shape& shape) {
-    return !IsTuple(shape) && !IsOpaque(shape) && Rank(shape) == 0;
+    return IsArray(shape) && Rank(shape) == 0;
   }
   static bool IsEffectiveScalar(const Shape& shape) {
-    return !IsTuple(shape) && !IsOpaque(shape) && TrueRank(shape) == 0;
+    return IsArray(shape) && TrueRank(shape) == 0;
   }
   static bool IsScalarF32(const Shape& shape);
 
@@ -311,6 +326,10 @@ class ShapeUtil {
   // into a custom operation.
   static Shape MakeOpaqueShape();
 
+  // Creates a token shape. Values of this shape are used for ordering
+  // side-effecting operations.
+  static Shape MakeTokenShape();
+
   // Appends a shape to the given tuple.
   static void AppendShapeToTuple(const Shape& shape, Shape* tuple_shape);
 
@@ -410,11 +429,15 @@ class ShapeUtil {
     return shape.element_type() == OPAQUE;
   }
 
+  // Returns whether the shape is an token value used for ordering
+  // side-effecting operations.
+  static bool IsToken(const Shape& shape) {
+    return shape.element_type() == TOKEN;
+  }
+
   // Returns whether the shape is an array.  Note that scalars are considered
   // arrays.
-  static bool IsArray(const Shape& shape) {
-    return !IsTuple(shape) && !IsOpaque(shape);
-  }
+  static bool IsArray(const Shape& shape);
 
   // Returns whether the shape is a tuple with at least one element which is
   // also a tuple.
@@ -461,6 +484,13 @@ class ShapeUtil {
   // shape.
   static bool IsLeafIndex(const Shape& shape, const ShapeIndex& index);
 
+  // Returns the number of leaves in the shape.
+  static int64 GetLeafCount(const Shape& shape);
+
+  // Retrieves all the leaf shapes and their indexes, in the order walked by
+  // the ForEachSubshape() API.
+  static std::vector<IndexedShape> GetLeafShapes(const Shape& shape);
+
   // Calls the given visitor function for each subshape of the given shape.
   // Subshapes are visited in DFS pre-order starting with the entire shape
   // (index {}).
@@ -483,26 +513,6 @@ class ShapeUtil {
   static Status ForEachMutableSubshapeWithStatus(
       Shape* shape, const MutatingStatusVisitorFunction& func);
 
-  // Removes all degenerate dimensions (size one) from the given shape. The
-  // stripped minor_to_major preserves the relative ordering of non-degenerate
-  // dimensions. The stripped shape has the property that the underlying
-  // representation (bits in memory) for the stripped shape is the same as the
-  // original shape modulo padding. Examples:
-  //
-  // input shape:    F32 [1, 2, 1], minor_to_major = {0, 1, 2}
-  // stripped shape: F32 [2], minor_to_major = {0}
-  //
-  // input shape:    F32 [6, 1, 5], minor_to_major = {2, 0, 1}
-  // stripped shape: F32 [6, 5], minor_to_major = {1, 0}
-  //
-  // input shape:    F32 [1, 7, 1, 6, 5, 1], minor_to_major = {0, 2, 5, 4, 3, 1}
-  // stripped shape: F32 [7, 6, 5], minor_to_major = {0, 2, 1}
-  //
-  // input shape:    F32 [1, 1], minor_to_major = {0, 1}
-  // stripped shape: F32 [], minor_to_major = {}
-  // Precondition: !ShapeUtil::IsOpaque(shape) && !ShapeUtil::IsTuple(shape)
-  static Shape StripDegenerateDimensions(const Shape& shape);
-
   // Permutes the dimensions by the given permutation, so
   // return_value.dimensions[permutation[i]] = argument.dimensions[i]
   static Shape PermuteDimensions(tensorflow::gtl::ArraySlice<int64> permutation,
@@ -626,6 +636,28 @@ class ShapeUtil {
         .IgnoreError();
   }
 
+  // These convenience wrappers don't take `base`, `count` and `incr`
+  // explicitly, but iterate over every element in `shape` instead.
+
+  template <typename FnType>
+  static Status ForEachIndexWithStatus(const Shape& shape,
+                                       const FnType& visitor_function) {
+    std::vector<int64> base(shape.dimensions_size());
+    std::vector<int64> incr(shape.dimensions_size(), 1);
+    return ForEachIndexWithStatus(shape, base,
+                                  /*count=*/AsInt64Slice(shape.dimensions()),
+                                  incr, visitor_function);
+  }
+
+  template <typename FnType>
+  static void ForEachIndex(const Shape& shape, const FnType& visitor_function) {
+    ForEachIndexWithStatus(shape,
+                           [&](tensorflow::gtl::ArraySlice<int64> indices) {
+                             return StatusOr<bool>(visitor_function(indices));
+                           })
+        .IgnoreError();
+  }
+
   // A parallel version of ForEachIndex(WithStatus). This can only be used if
   // the visitor_function is thread-safe and the order of iteration does not
   // matter.
diff --git a/tensorflow/compiler/xla/shape_util_test.cc b/tensorflow/compiler/xla/shape_util_test.cc
index f7675e97da7b061bde063e5093256c2288f99c98..0ff514564bdb27b7afa4cf99b0d727f2c029a5ae 100644
--- a/tensorflow/compiler/xla/shape_util_test.cc
+++ b/tensorflow/compiler/xla/shape_util_test.cc
@@ -93,12 +93,14 @@ TEST(ShapeUtilTest, ParseShapeStringTupleOfArrays) {
 }
 
 TEST(ShapeUtilTest, ParseShapeStringNestedTuple) {
-  string shape_string = "(f32[1],(f32[2]), f32[3])";
+  string shape_string = "(f32[1],(f32[2], token[]), opaque[], f32[3])";
   TF_ASSERT_OK_AND_ASSIGN(Shape actual,
                           ShapeUtil::ParseShapeString(shape_string));
   Shape expected = ShapeUtil::MakeTupleShape({
       ShapeUtil::MakeShape(F32, {1}),
-      ShapeUtil::MakeTupleShape({ShapeUtil::MakeShape(F32, {2})}),
+      ShapeUtil::MakeTupleShape(
+          {ShapeUtil::MakeShape(F32, {2}), ShapeUtil::MakeTokenShape()}),
+      ShapeUtil::MakeOpaqueShape(),
       ShapeUtil::MakeShape(F32, {3}),
   });
   ASSERT_TRUE(ShapeUtil::Equal(expected, actual))
@@ -136,6 +138,23 @@ TEST(ShapeUtilTest, ParseShapeStringWithSparseLayout) {
       << "actual: " << ShapeUtil::HumanString(actual);
 }
 
+TEST(ShapeUtilTest, ParseOpaqueType) {
+  TF_ASSERT_OK_AND_ASSIGN(Shape actual,
+                          ShapeUtil::ParseShapeString("opaque[]"));
+  Shape expected = ShapeUtil::MakeOpaqueShape();
+  ASSERT_TRUE(ShapeUtil::Equal(expected, actual))
+      << "expected: " << ShapeUtil::HumanString(expected)
+      << "actual:   " << ShapeUtil::HumanString(actual);
+}
+
+TEST(ShapeUtilTest, ParseTokenType) {
+  TF_ASSERT_OK_AND_ASSIGN(Shape actual, ShapeUtil::ParseShapeString("token[]"));
+  Shape expected = ShapeUtil::MakeTokenShape();
+  ASSERT_TRUE(ShapeUtil::Equal(expected, actual))
+      << "expected: " << ShapeUtil::HumanString(expected)
+      << "actual:   " << ShapeUtil::HumanString(actual);
+}
+
 TEST(ShapeUtilTest, ParseInvalidShapeString) {
   string shape_strings[] = {
       "f32[123,456]foobar{0,1}", "f32[123,456]sparse{0,1}", "f32[123,456]{foo}",
@@ -295,6 +314,9 @@ TEST(ShapeUtilTest, ByteSizeOfWithoutPadding) {
   EXPECT_EQ(8, ShapeUtil::ByteSizeOfPrimitiveType(C64));
   EXPECT_EQ(8, ShapeUtil::ByteSizeOf(ShapeUtil::MakeShape(C64, {})));
   EXPECT_EQ(1600, ShapeUtil::ByteSizeOf(ShapeUtil::MakeShape(C64, {10, 20})));
+
+  EXPECT_EQ(0, ShapeUtil::ByteSizeOfPrimitiveType(TOKEN));
+  EXPECT_EQ(0, ShapeUtil::ByteSizeOf(ShapeUtil::MakeTokenShape()));
 }
 
 TEST(ShapeUtilTest, ByteSizeOfWithPadding) {
@@ -449,19 +471,21 @@ TEST(ShapeUtilTest, IsLeafIndex) {
 
 TEST(ShapeUtilTest, HumanString) {
   Shape opaque = ShapeUtil::MakeOpaqueShape();
+  Shape token = ShapeUtil::MakeTokenShape();
   Shape scalar = ShapeUtil::MakeShape(F32, {});
   Shape matrix = ShapeUtil::MakeShape(U32, {1, 2});
   Shape matrix2 = ShapeUtil::MakeShapeWithLayout(S32, {3, 4}, {0, 1});
   Shape tuple = ShapeUtil::MakeTupleShape({opaque, scalar, matrix, matrix2});
-  Shape nested_tuple = ShapeUtil::MakeTupleShape({tuple, matrix});
+  Shape nested_tuple = ShapeUtil::MakeTupleShape({tuple, matrix, token});
 
   EXPECT_EQ("opaque[]", ShapeUtil::HumanString(opaque));
+  EXPECT_EQ("token[]", ShapeUtil::HumanString(token));
   EXPECT_EQ("f32[]", ShapeUtil::HumanString(scalar));
   EXPECT_EQ("u32[1,2]", ShapeUtil::HumanString(matrix));
   EXPECT_EQ("s32[3,4]", ShapeUtil::HumanString(matrix2));
   EXPECT_EQ("(opaque[], f32[], u32[1,2], s32[3,4])",
             ShapeUtil::HumanString(tuple));
-  EXPECT_EQ("((opaque[], f32[], u32[1,2], s32[3,4]), u32[1,2])",
+  EXPECT_EQ("((opaque[], f32[], u32[1,2], s32[3,4]), u32[1,2], token[])",
             ShapeUtil::HumanString(nested_tuple));
 
   EXPECT_EQ("opaque[]", ShapeUtil::HumanStringWithLayout(opaque));
@@ -470,8 +494,10 @@ TEST(ShapeUtilTest, HumanString) {
   EXPECT_EQ("s32[3,4]{0,1}", ShapeUtil::HumanStringWithLayout(matrix2));
   EXPECT_EQ("(opaque[], f32[], u32[1,2]{1,0}, s32[3,4]{0,1})",
             ShapeUtil::HumanStringWithLayout(tuple));
-  EXPECT_EQ("((opaque[], f32[], u32[1,2]{1,0}, s32[3,4]{0,1}), u32[1,2]{1,0})",
-            ShapeUtil::HumanStringWithLayout(nested_tuple));
+  EXPECT_EQ(
+      "((opaque[], f32[], u32[1,2]{1,0}, s32[3,4]{0,1}), u32[1,2]{1,0}, "
+      "token[])",
+      ShapeUtil::HumanStringWithLayout(nested_tuple));
 
   ProgramShape prog = ShapeUtil::MakeProgramShape(
       {opaque, scalar, matrix, matrix2, tuple, nested_tuple}, nested_tuple);
@@ -481,8 +507,9 @@ TEST(ShapeUtilTest, HumanString) {
       "(unknown): u32[1,2], "
       "(unknown): s32[3,4], "
       "(unknown): (opaque[], f32[], u32[1,2], s32[3,4]), "
-      "(unknown): ((opaque[], f32[], u32[1,2], s32[3,4]), u32[1,2])) -> "
-      "((opaque[], f32[], u32[1,2], s32[3,4]), u32[1,2])",
+      "(unknown): ((opaque[], f32[], u32[1,2], s32[3,4]), u32[1,2], token[])) "
+      "-> "
+      "((opaque[], f32[], u32[1,2], s32[3,4]), u32[1,2], token[])",
       ShapeUtil::HumanString(prog));
 
   prog.add_parameter_names("arg0");
@@ -497,8 +524,10 @@ TEST(ShapeUtilTest, HumanString) {
       "matrix: u32[1,2], "
       "matrix2: s32[3,4], "
       "tuple: (opaque[], f32[], u32[1,2], s32[3,4]), "
-      "nested_tuple: ((opaque[], f32[], u32[1,2], s32[3,4]), u32[1,2])) -> "
-      "((opaque[], f32[], u32[1,2], s32[3,4]), u32[1,2])",
+      "nested_tuple: ((opaque[], f32[], u32[1,2], s32[3,4]), u32[1,2], "
+      "token[])) "
+      "-> "
+      "((opaque[], f32[], u32[1,2], s32[3,4]), u32[1,2], token[])",
       ShapeUtil::HumanString(prog));
 }
 
@@ -713,16 +742,6 @@ TEST(ShapeUtilTest, ReshapeIsBitcast_3x2x2_6x2_Dim1IsMostMinor) {
       ShapeUtil::MakeShapeWithLayout(F32, {6, 2}, {0, 1})));
 }
 
-TEST(ShapeUtilTest, StripDegenerateDimensions) {
-  EXPECT_TRUE(ShapeUtil::Equal(ShapeUtil::StripDegenerateDimensions(
-                                   ShapeUtil::MakeShape(F32, {3, 1, 2})),
-                               ShapeUtil::MakeShape(F32, {3, 2})));
-  EXPECT_TRUE(ShapeUtil::Equal(
-      ShapeUtil::StripDegenerateDimensions(
-          ShapeUtil::MakeShapeWithSparseLayout(F32, {3, 1, 2}, 10)),
-      ShapeUtil::MakeShapeWithSparseLayout(F32, {3, 2}, 10)));
-}
-
 TEST(AlgebraicSimplifierTest, ReshapeIsBitcast_3x2x2_6x2_Dim0IsMostMinor) {
   EXPECT_FALSE(ShapeUtil::ReshapeIsBitcast(
       ShapeUtil::MakeShapeWithLayout(F32, {3, 2, 2}, {0, 1, 2}),
diff --git a/tensorflow/compiler/xla/tests/BUILD b/tensorflow/compiler/xla/tests/BUILD
index 4883380be1f8a291bda829dff713de549ba58c65..e7e0a19db0516e4210f6bb78d6b5e6968bf78b2a 100644
--- a/tensorflow/compiler/xla/tests/BUILD
+++ b/tensorflow/compiler/xla/tests/BUILD
@@ -117,11 +117,11 @@ cc_library(
         "//tensorflow/compiler/xla/service:backend",
         "//tensorflow/compiler/xla/service:computation_layout",
         "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/service:hlo_runner",
         "//tensorflow/compiler/xla/service:hlo_verifier",
         "//tensorflow/compiler/xla/service:interpreter_plugin",  # reference backend
         "//tensorflow/compiler/xla/service:platform_util",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
         "//tensorflow/core:stream_executor_no_cuda",
         "//tensorflow/core:test",
@@ -138,8 +138,8 @@ cc_library(
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/service:hlo_verifier",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
         "//tensorflow/core:test",
     ],
@@ -619,6 +619,7 @@ xla_test(
 
 xla_test(
     name = "exhaustive_f32_elementwise_op_test",
+    size = "enormous",
     srcs = ["exhaustive_f32_elementwise_op_test.cc"],
     backends = [
         "cpu",
@@ -626,7 +627,6 @@ xla_test(
     ],
     shard_count = 48,
     tags = [
-        "enormous",
         "manual",
         "notap",
     ],
@@ -697,8 +697,8 @@ xla_test(
         "//tensorflow/compiler/xla:execution_options_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
     ],
 )
 
@@ -776,30 +776,42 @@ xla_test(
     ],
 )
 
+CONVOLUTION_TEST_DEPS = [
+    "//tensorflow/compiler/xla:array2d",
+    "//tensorflow/compiler/xla:array4d",
+    "//tensorflow/compiler/xla:literal_util",
+    "//tensorflow/compiler/xla:reference_util",
+    "//tensorflow/compiler/xla:shape_util",
+    "//tensorflow/compiler/xla:statusor",
+    "//tensorflow/compiler/xla:util",
+    "//tensorflow/compiler/xla:xla_data_proto",
+    "//tensorflow/compiler/xla/client:global_data",
+    "//tensorflow/compiler/xla/client:local_client",
+    "//tensorflow/compiler/xla/client:padding",
+    "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+    "//tensorflow/compiler/xla/tests:client_library_test_base",
+    "//tensorflow/compiler/xla/tests:literal_test_util",
+    "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+    "//tensorflow/core:lib",
+    "//tensorflow/core:test",
+]
+
 xla_test(
     name = "convolution_test",
     timeout = "long",
     srcs = ["convolution_test.cc"],
     shard_count = 25,
-    deps = [
-        "//tensorflow/compiler/xla:array2d",
-        "//tensorflow/compiler/xla:array4d",
-        "//tensorflow/compiler/xla:literal_util",
-        "//tensorflow/compiler/xla:reference_util",
-        "//tensorflow/compiler/xla:shape_util",
-        "//tensorflow/compiler/xla:statusor",
-        "//tensorflow/compiler/xla:util",
-        "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client:global_data",
-        "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client:padding",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/tests:client_library_test_base",
-        "//tensorflow/compiler/xla/tests:literal_test_util",
-        "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/core:lib",
-        "//tensorflow/core:test",
-    ],
+    deps = CONVOLUTION_TEST_DEPS,
+)
+
+xla_test(
+    name = "convolution_test_gpu_alternative_layout",
+    timeout = "long",
+    srcs = ["convolution_test.cc"],
+    backend_args = {"gpu": ["--xla_backend_extra_options=xla_gpu_experimental_conv_disable_layout_heuristic"]},
+    backends = ["gpu"],
+    shard_count = 25,
+    deps = CONVOLUTION_TEST_DEPS,
 )
 
 xla_test(
@@ -1183,9 +1195,25 @@ xla_test(
     ],
     deps = [
         ":client_library_test_base",
+        "//tensorflow/compiler/xla/service:hlo_parser",
+        "//tensorflow/compiler/xla/tests:hlo_test_base",
+        "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:test",
+    ],
+)
+
+xla_test(
+    name = "token_hlo_test",
+    srcs = ["token_hlo_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
+    deps = [
+        ":client_library_test_base",
+        "//tensorflow/compiler/xla/service:hlo_verifier",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
         "//tensorflow/core:test",
     ],
@@ -1508,11 +1536,11 @@ xla_test(
         "//tensorflow/compiler/xla/client:local_client",
         "//tensorflow/compiler/xla/client/xla_client:xla_builder",
         "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/compiler/xla/tools/parser:hlo_parser",
         "//tensorflow/core:lib",
         "//tensorflow/core:stream_executor_no_cuda",
         "//tensorflow/core:test",
diff --git a/tensorflow/compiler/xla/tests/broadcast_simple_test.cc b/tensorflow/compiler/xla/tests/broadcast_simple_test.cc
index 34c86e007beea1cbac04641bdbdab62dc567f13e..3a0f51fc66d65c8684bd607b9e8103559cd4d8d4 100644
--- a/tensorflow/compiler/xla/tests/broadcast_simple_test.cc
+++ b/tensorflow/compiler/xla/tests/broadcast_simple_test.cc
@@ -671,7 +671,7 @@ XLA_TEST_F(BroadcastSimpleTest, InvalidInDimensionBroadcasting) {
   auto result_status = Execute(&b, {});
   EXPECT_FALSE(result_status.ok());
   EXPECT_THAT(result_status.status().error_message(),
-              HasSubstr("op BINOP_ADD with incompatible shapes"));
+              HasSubstr("op add with incompatible shapes"));
 }
 
 XLA_TEST_F(BroadcastSimpleTest, InvalidDegenerateBroadcasting) {
@@ -684,7 +684,7 @@ XLA_TEST_F(BroadcastSimpleTest, InvalidDegenerateBroadcasting) {
   auto result_status = Execute(&b, {});
   EXPECT_FALSE(result_status.ok());
   EXPECT_THAT(result_status.status().error_message(),
-              HasSubstr("op BINOP_ADD with incompatible shapes"));
+              HasSubstr("op add with incompatible shapes"));
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/tests/convert_test.cc b/tensorflow/compiler/xla/tests/convert_test.cc
index 4ef0a77884c90b9fe32f96d3361fa3d80bde623b..722d882471a41a75c1e5e60f8c1a151b76c7e004 100644
--- a/tensorflow/compiler/xla/tests/convert_test.cc
+++ b/tensorflow/compiler/xla/tests/convert_test.cc
@@ -249,10 +249,10 @@ XLA_TEST_F(ConvertTest, ConvertR1F32ToR1S64) {
                          -1.99f,
                          -2.0f,
                          -2.01f,
-                         0x1.FFFFFEp+62F,
-                         0x1.FFFFFCp+62F,
-                         -0x1.FFFFFEp+62F,
-                         -0x1.FFFFFCp+62F};
+                         9223371487098961920.f,
+                         9223370937343148032.f,
+                         -9223371487098961920.f,
+                         -9223370937343148032.f};
   std::unique_ptr<Literal> arg_literal = Literal::CreateR1<float>({arg});
   auto arg_param = builder.Parameter(0, arg_literal->shape(), "arg_param");
   std::unique_ptr<GlobalData> arg_data =
diff --git a/tensorflow/compiler/xla/tests/convolution_test.cc b/tensorflow/compiler/xla/tests/convolution_test.cc
index 947959beb144e1509a77ad2f94b8493de46ba6f2..346bb3a3996ee5bf662b0f74dd0c2096efbf5295 100644
--- a/tensorflow/compiler/xla/tests/convolution_test.cc
+++ b/tensorflow/compiler/xla/tests/convolution_test.cc
@@ -47,9 +47,9 @@ class ConvolutionTest : public ClientLibraryTestBase {
 #if XLA_TEST_BACKEND_GPU
   // XLA:GPU sometimes uses FFT convolution which isn't as precise as spatial
   // convolution. So relax the absolute error threshold.
-  ErrorSpec error_spec_ = ErrorSpec(1e-2);
+  ErrorSpec error_spec_ = ErrorSpec(1e-2, 1e-4);
 #else
-  ErrorSpec error_spec_ = ErrorSpec(1e-4);
+  ErrorSpec error_spec_ = ErrorSpec(1e-4, 1e-4);
 #endif
 };
 
diff --git a/tensorflow/compiler/xla/tests/cross_replica_sum_test.cc b/tensorflow/compiler/xla/tests/cross_replica_sum_test.cc
index b15988776513a60c9e5c85d4780912106db98e75..b151187c4b8f01c5b46ccadf27d2e22a7c902e98 100644
--- a/tensorflow/compiler/xla/tests/cross_replica_sum_test.cc
+++ b/tensorflow/compiler/xla/tests/cross_replica_sum_test.cc
@@ -14,12 +14,12 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 
 namespace xla {
 namespace {
@@ -32,11 +32,19 @@ class TrivialCrossReplicaSumTest : public HloTestBase {};
 XLA_TEST_F(TrivialCrossReplicaSumTest, OneOperand) {
   const char* module_str = R"(
   HloModule test
+
+  add {
+    x = f32[] parameter(0)
+    y = f32[] parameter(1)
+    add = f32[] add(x, y)
+  }
+
   ENTRY test_computation {
     p = f32[3] parameter(0)
-    ROOT crs = f32[3] cross-replica-sum(p)
+    ROOT crs = f32[3] cross-replica-sum(p), to_apply=add
   })";
-  auto module = tools::Parse(module_str, GetModuleConfigForTest()).ValueOrDie();
+  auto module =
+      ParseHloString(module_str, GetModuleConfigForTest()).ValueOrDie();
   auto literal = Literal::CreateR1<float>({1, 2, 3});
   EXPECT_EQ(*literal, *ExecuteAndTransfer(std::move(module), {literal.get()}));
 }
@@ -44,12 +52,20 @@ XLA_TEST_F(TrivialCrossReplicaSumTest, OneOperand) {
 XLA_TEST_F(TrivialCrossReplicaSumTest, MultipleOperands) {
   const char* module_str = R"(
   HloModule test
+
+  add {
+    x = f32[] parameter(0)
+    y = f32[] parameter(1)
+    add = f32[] add(x, y)
+  }
+
   ENTRY test_computation {
     p0 = f32[3] parameter(0)
     p1 = f32[2] parameter(1)
-    ROOT crs = (f32[3], f32[2]) cross-replica-sum(p0, p1)
+    ROOT crs = (f32[3], f32[2]) cross-replica-sum(p0, p1), to_apply=add
   })";
-  auto module = tools::Parse(module_str, GetModuleConfigForTest()).ValueOrDie();
+  auto module =
+      ParseHloString(module_str, GetModuleConfigForTest()).ValueOrDie();
   auto literal0 = Literal::CreateR1<float>({1, 2, 3});
   auto literal1 = Literal::CreateR1<float>({10, 20});
   EXPECT_EQ(
@@ -63,12 +79,20 @@ XLA_TEST_F(TrivialCrossReplicaSumTest, MultipleOperands) {
 XLA_TEST_F(TrivialCrossReplicaSumTest, ConstantOperand) {
   const char* module_str = R"(
   HloModule test
+
+  add {
+    x = f32[] parameter(0)
+    y = f32[] parameter(1)
+    add = f32[] add(x, y)
+  }
+
   ENTRY test_computation {
     p0 = f32[3] parameter(0)
     p1 = f32[2] constant({10, 20})
-    ROOT crs = (f32[3], f32[2]) cross-replica-sum(p0, p1)
+    ROOT crs = (f32[3], f32[2]) cross-replica-sum(p0, p1), to_apply=add
   })";
-  auto module = tools::Parse(module_str, GetModuleConfigForTest()).ValueOrDie();
+  auto module =
+      ParseHloString(module_str, GetModuleConfigForTest()).ValueOrDie();
   auto literal0 = Literal::CreateR1<float>({1, 2, 3});
   auto literal1 = Literal::CreateR1<float>({10, 20});
   EXPECT_EQ(*Literal::MakeTuple({literal0.get(), literal1.get()}),
diff --git a/tensorflow/compiler/xla/tests/gather_operation_test.cc b/tensorflow/compiler/xla/tests/gather_operation_test.cc
index 4854c649c15f2ab89bd3b343abd248be6e227c60..143ffbdeb409d91ab6d46d386aa5ff98ebc4ae10 100644
--- a/tensorflow/compiler/xla/tests/gather_operation_test.cc
+++ b/tensorflow/compiler/xla/tests/gather_operation_test.cc
@@ -14,12 +14,12 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/execution_options_util.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 
 // NB!  TODO(b/74360564): These tests do not test out of bounds behavior since
 // that hasn't been specced yet.
@@ -41,7 +41,7 @@ class GatherOperationTest : public HloTestBase {
     HloModuleConfig config;
     config.set_debug_options(GetDebugOptionsForTest());
     TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
-                            tools::Parse(hlo_text, config));
+                            ParseHloString(hlo_text, config));
     EXPECT_TRUE(RunAndCompare(std::move(module), args, nullopt));
   }
 };
diff --git a/tensorflow/compiler/xla/tests/hlo_test_base.cc b/tensorflow/compiler/xla/tests/hlo_test_base.cc
index 36e19e6507fa3b6f4a21949583f92716d2f44333..242cc5db11ff2bdf69209df7537216573d8afbf3 100644
--- a/tensorflow/compiler/xla/tests/hlo_test_base.cc
+++ b/tensorflow/compiler/xla/tests/hlo_test_base.cc
@@ -23,11 +23,11 @@ limitations under the License.
 #include "tensorflow/compiler/xla/layout_util.h"
 #include "tensorflow/compiler/xla/legacy_flags/debug_options_flags.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/service/platform_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_utils.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
@@ -94,8 +94,7 @@ HloTestBase::HloTestBase(se::Platform* test_platform,
 
 /* static */
 std::unique_ptr<HloModule> HloTestBase::CreateNewModule(const string& name) {
-  return MakeUnique<HloModule>(name, VersionedComputationHandle(),
-                               GetModuleConfigForTest());
+  return MakeUnique<HloModule>(name, GetModuleConfigForTest());
 }
 
 /*static*/ DebugOptions HloTestBase::GetDebugOptionsForTest() {
diff --git a/tensorflow/compiler/xla/tests/hlo_test_base.h b/tensorflow/compiler/xla/tests/hlo_test_base.h
index eb3a2ea76a667a2afa2562f01d28f34384b84a21..249da87f489324ed9d377cc46a15cef5a9e74192 100644
--- a/tensorflow/compiler/xla/tests/hlo_test_base.h
+++ b/tensorflow/compiler/xla/tests/hlo_test_base.h
@@ -66,6 +66,15 @@ namespace xla {
 //
 // For a more detailed example, see "../tests/sample_text_test.cc".
 class HloTestBase : public ::testing::Test {
+ public:
+  // Creates a new HLO module for a test. The module created will have
+  // TestName() for its name; it will also automatically populate its debug
+  // options from command-line flags. If you want a fresh HloModule object and
+  // then add HloComputations to it, it's recommended to use this method in your
+  // tests.
+  static std::unique_ptr<HloModule> CreateNewModule(
+      const string& name = TestName());
+
  protected:
   // This uses the interpreter backend as the reference backend and
   // automatically finds another supported backend as the test backend. If the
@@ -80,14 +89,6 @@ class HloTestBase : public ::testing::Test {
 
   ~HloTestBase() override {}
 
-  // Creates a new HLO module for a test. The module created will have
-  // TestName() for its name; it will also automatically populate its debug
-  // options from command-line flags. If you want a fresh HloModule object and
-  // then add HloComputations to it, it's recommended to use this method in your
-  // tests.
-  static std::unique_ptr<HloModule> CreateNewModule(
-      const string& name = TestName());
-
   // Populates debug options from command-line flags and adjusts the options for
   // testing. It is recommended to use this when you need to pass in
   // DebugOptions, e.g. when creating a module from a string or a file.
diff --git a/tensorflow/compiler/xla/tests/hlo_verified_test_base.cc b/tensorflow/compiler/xla/tests/hlo_verified_test_base.cc
index da4cf4ae0c31bc194cd2ec9b845df36afbde69b0..22c664d1426c598dbb695ff1b66ce009b0a19c00 100644
--- a/tensorflow/compiler/xla/tests/hlo_verified_test_base.cc
+++ b/tensorflow/compiler/xla/tests/hlo_verified_test_base.cc
@@ -15,10 +15,10 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/tests/hlo_verified_test_base.h"
 
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/service/hlo_verifier.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/test.h"
 
@@ -41,14 +41,17 @@ void HloVerifiedTestBase::TearDown() {
       << "TearDown called more than once; it should be called exactly once.";
   tear_down_called_ = true;
   if (module_) {
-    VerifyModule();
+    VerifyModule(module_.get());
+  }
+  for (int i = 0; i < modules_.size(); ++i) {
+    VerifyModule(modules_.at(i).get());
   }
   HloTestBase::TearDown();
 }
 
-void HloVerifiedTestBase::VerifyModule() {
-  HloVerifier verifier;
-  xla::StatusOr<bool> mutated = verifier.Run(module_.get());
+void HloVerifiedTestBase::VerifyModule(HloModule* module) {
+  HloVerifier verifier(/*allow_mixed_precision=*/true);
+  xla::StatusOr<bool> mutated = verifier.Run(module);
   if (!mutated.ok()) {
     ADD_FAILURE() << "HloVerifier failed: " << mutated.status();
   } else {
@@ -59,15 +62,20 @@ void HloVerifiedTestBase::VerifyModule() {
 
 HloModule& HloVerifiedTestBase::module() {
   if (!module_) {
-    module_ = CreateNewModule();
+    module_ = HloTestBase::CreateNewModule();
   }
   return *module_;
 }
 
+HloModule* HloVerifiedTestBase::CreateNewModule(const string& name) {
+  modules_.emplace_back(HloTestBase::CreateNewModule());
+  return modules_.back().get();
+}
+
 void HloVerifiedTestBase::ParseAndVerifyModule(
     tensorflow::StringPiece hlo_text) {
   CHECK(!module_) << "Called ParseModule when test already has a module.";
-  TF_ASSERT_OK_AND_ASSIGN(module_, tools::Parse(hlo_text));
-  VerifyModule();
+  TF_ASSERT_OK_AND_ASSIGN(module_, ParseHloString(hlo_text));
+  VerifyModule(module_.get());
 }
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/hlo_verified_test_base.h b/tensorflow/compiler/xla/tests/hlo_verified_test_base.h
index e5bb14a8839acbdef8fd2b79bb0f574c46ea3d40..5b59cc77f61b05092d3afb331e73932c9edc5840 100644
--- a/tensorflow/compiler/xla/tests/hlo_verified_test_base.h
+++ b/tensorflow/compiler/xla/tests/hlo_verified_test_base.h
@@ -52,11 +52,23 @@ class HloVerifiedTestBase : public HloTestBase {
     shape_verifier_ = std::move(shape_verifier);
   }
 
+  // Creates a new module for a test, and stores it in modules_ so it can be
+  // verified. Intentionally hides HloTestBase::CreateNewModule, to prevent
+  // creation of unverified modules.
+  HloModule* CreateNewModule(const string& name = TestName());
+
+  // It is confusing to store modules created by module() and CreateNewModule()
+  // in different fields, but it allows us to migrate tests to
+  // HloVerifiedTestBase more easily, so it's a win because we can verify more
+  // modules. See b/80488902.
  private:
-  std::unique_ptr<HloModule> module_;  // Lazily populated. Access via module().
+  // Lazily populated. Access via module().
+  std::unique_ptr<HloModule> module_;
+  // Populated by calls to CreateNewModule.
+  std::vector<std::unique_ptr<HloModule>> modules_;
   std::unique_ptr<ShapeVerifier> shape_verifier_;
   bool tear_down_called_ = false;
-  void VerifyModule();
+  static void VerifyModule(HloModule* module);
 };
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/llvm_compiler_test.cc b/tensorflow/compiler/xla/tests/llvm_compiler_test.cc
index 2f46ee0be216d7dabf1c476d3cfb7d528f8ab6a4..082bc34136e004795ce300c66591758f47c665fe 100644
--- a/tensorflow/compiler/xla/tests/llvm_compiler_test.cc
+++ b/tensorflow/compiler/xla/tests/llvm_compiler_test.cc
@@ -124,8 +124,7 @@ class LLVMCompilerTest : public ::testing::Test {
   static std::unique_ptr<HloModule> CreateNewModule() {
     HloModuleConfig config;
     config.set_debug_options(legacy_flags::GetDebugOptionsFromFlags());
-    return MakeUnique<HloModule>(TestName(), VersionedComputationHandle(),
-                                 config);
+    return MakeUnique<HloModule>(TestName(), config);
   }
 };
 
diff --git a/tensorflow/compiler/xla/tests/map_test.cc b/tensorflow/compiler/xla/tests/map_test.cc
index 7df45bebebdd3eb2e71f27d831a8e2ac9e3b5f7c..3975e9125703ee081d4e84fa8bd27fcbe483ac34 100644
--- a/tensorflow/compiler/xla/tests/map_test.cc
+++ b/tensorflow/compiler/xla/tests/map_test.cc
@@ -488,10 +488,9 @@ TEST_F(MapTest, MapOperantionWithBuildError) {
 
   StatusOr<XlaComputation> computation_status = builder.Build();
   ASSERT_TRUE(!computation_status.ok());
-  EXPECT_THAT(
-      computation_status.status().ToString(),
-      ::testing::HasSubstr("error from: ErrorAdd: Binary op BINOP_ADD with "
-                           "different element types: f32[] and u16[]"));
+  EXPECT_THAT(computation_status.status().ToString(),
+              ::testing::HasSubstr("error from: ErrorAdd: Binary op add with "
+                                   "different element types: f32[] and u16[]"));
 }
 
 // MapTest disables inline and algsimp. MapTestWithFullOpt runs all
diff --git a/tensorflow/compiler/xla/tests/multioutput_fusion_test.cc b/tensorflow/compiler/xla/tests/multioutput_fusion_test.cc
index ec7ca20bdf266cf8ed220809c0c24bee473359be..41f723edf1ff3518686231f31b61b64291b1f6bf 100644
--- a/tensorflow/compiler/xla/tests/multioutput_fusion_test.cc
+++ b/tensorflow/compiler/xla/tests/multioutput_fusion_test.cc
@@ -273,5 +273,246 @@ XLA_TEST_F(MultiOutputFusionTest, MultiOutputLoopFeedingMap) {
       *result, *Literal::CreateR1<float>({0.0, 4.0, 9.0})));
 }
 
+const char* const kScalarOps = R"(
+    HloModule m
+
+    Add {
+      lhsadd = f32[] parameter(0)
+      rhsadd = f32[] parameter(1)
+      ROOT add = f32[] add(lhsadd, rhsadd)
+    }
+
+    Max {
+      lhsmax = f32[] parameter(0)
+      rhsmax = f32[] parameter(1)
+      ROOT max = f32[] maximum(lhsmax, rhsmax)
+    }
+)";
+
+XLA_TEST_F(MultiOutputFusionTest,
+           DISABLED_ON_CPU(MultiOutputReduceFusionMinor)) {
+  const string testcase = tensorflow::strings::StrCat(kScalarOps, R"(
+    fused_reduce {
+      p0 = f32[2,2,2]{2,1,0} parameter(0)
+      c0 = f32[] constant(0)
+      r1 = f32[2,2]{1,0} reduce(p0, c0), dimensions={2}, to_apply=Add
+      mul = f32[2,2,2]{2,1,0} multiply(p0, p0)
+      c1 = f32[] constant(5)
+      r2 = f32[2,2]{1,0} reduce(mul, c1), dimensions={2}, to_apply=Max
+      ROOT tuple = (f32[2,2]{1,0}, f32[2,2]{1,0}) tuple(r1, r2)
+    }
+
+    ENTRY reduce {
+      p = f32[2,2,2]{2,1,0} parameter(0)
+      ROOT fusion = (f32[2,2]{1,0}, f32[2,2]{1,0}) fusion(p), kind=kInput,
+                                                        calls=fused_reduce
+    })");
+  auto module =
+      HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
+          .ValueOrDie();
+  auto param = Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
+  TF_ASSERT_OK_AND_ASSIGN(auto result,
+                          Execute(std::move(module), {param.get()}));
+  EXPECT_TRUE(LiteralTestUtil::Equal(
+      *result,
+      *Literal::MakeTupleOwned(Literal::CreateR2<float>({{3, 7}, {11, 15}}),
+                               Literal::CreateR2<float>({{5, 16}, {36, 64}}))));
+}
+
+XLA_TEST_F(MultiOutputFusionTest,
+           DISABLED_ON_CPU(MultiOutputReduceFusionMajor)) {
+  const string testcase = tensorflow::strings::StrCat(kScalarOps, R"(
+    fused_reduce {
+      p0 = f32[2,2,2]{2,1,0} parameter(0)
+      c0 = f32[] constant(0)
+      r1 = f32[2,2]{1,0} reduce(p0, c0), dimensions={0}, to_apply=Add
+      mul = f32[2,2,2]{2,1,0} multiply(p0, p0)
+      c1 = f32[] constant(5)
+      r2 = f32[2,2]{1,0} reduce(mul, c1), dimensions={0}, to_apply=Max
+      ROOT tuple = (f32[2,2]{1,0}, f32[2,2]{1,0}) tuple(r1, r2)
+    }
+
+    ENTRY reduce {
+      p = f32[2,2,2]{2,1,0} parameter(0)
+      ROOT fusion = (f32[2,2]{1,0}, f32[2,2]{1,0}) fusion(p), kind=kInput,
+                                                        calls=fused_reduce
+    })");
+  auto module =
+      HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
+          .ValueOrDie();
+  auto param = Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
+  TF_ASSERT_OK_AND_ASSIGN(auto result,
+                          Execute(std::move(module), {param.get()}));
+  EXPECT_TRUE(LiteralTestUtil::Equal(
+      *result, *Literal::MakeTupleOwned(
+                   Literal::CreateR2<float>({{6, 8}, {10, 12}}),
+                   Literal::CreateR2<float>({{25, 36}, {49, 64}}))));
+}
+
+XLA_TEST_F(MultiOutputFusionTest,
+           DISABLED_ON_CPU(MultiOutputReduceFusionScalar)) {
+  const string testcase = tensorflow::strings::StrCat(kScalarOps, R"(
+    fused_reduce {
+      p0 = f32[2,2,2]{2,1,0} parameter(0)
+      c0 = f32[] constant(0)
+      r1 = f32[2]{0} reduce(p0, c0), dimensions={0,2}, to_apply=Add
+      mul = f32[2,2,2]{2,1,0} multiply(p0, p0)
+      c1 = f32[] constant(1.17549e-38)
+      r2 = f32[2]{0} reduce(mul, c1), dimensions={0,2}, to_apply=Max
+      r3 = f32[2]{0} reduce(mul, c0), dimensions={0,2}, to_apply=Add
+      ROOT tuple = (f32[2]{0}, f32[2]{0}, f32[2]{0}) tuple(r1, r2, r3)
+    }
+
+    ENTRY reduce {
+      p = f32[2,2,2]{2,1,0} parameter(0)
+      ROOT fusion = (f32[2]{0}, f32[2]{0}, f32[2]{0}) fusion(p), kind=kInput,
+                                                        calls=fused_reduce
+    })");
+  auto module =
+      HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
+          .ValueOrDie();
+  auto param = Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
+  TF_ASSERT_OK_AND_ASSIGN(auto result,
+                          Execute(std::move(module), {param.get()}));
+  EXPECT_TRUE(LiteralTestUtil::Equal(
+      *result, *Literal::MakeTupleOwned(Literal::CreateR1<float>({14, 22}),
+                                        Literal::CreateR1<float>({36, 64}),
+                                        Literal::CreateR1<float>({66, 138}))));
+}
+
+XLA_TEST_F(MultiOutputFusionTest,
+           DISABLED_ON_CPU(MultiOutputReduceFusionMinorWithExtraOutput)) {
+  const string testcase = tensorflow::strings::StrCat(kScalarOps, R"(
+    fused_reduce {
+      p0 = f32[2,2,2]{2,1,0} parameter(0)
+      c0 = f32[] constant(0)
+      r1 = f32[2,2]{1,0} reduce(p0, c0), dimensions={2}, to_apply=Add
+      mul = f32[2,2,2]{2,1,0} multiply(p0, p0)
+      c1 = f32[] constant(5)
+      r2 = f32[2,2]{1,0} reduce(mul, c1), dimensions={2}, to_apply=Max
+      ROOT tuple = (f32[2,2,2]{2,1,0}, f32[2,2]{1,0}, f32[2,2]{1,0})
+                     tuple(p0, r1, r2)
+    }
+
+    ENTRY reduce {
+      p = f32[2,2,2]{2,1,0} parameter(0)
+      ROOT fusion = (f32[2,2,2]{2,1,0}, f32[2,2]{1,0}, f32[2,2]{1,0}) fusion(p),
+                                                 kind=kInput, calls=fused_reduce
+    })");
+  auto module =
+      HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
+          .ValueOrDie();
+  auto param = Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
+  TF_ASSERT_OK_AND_ASSIGN(auto result,
+                          Execute(std::move(module), {param.get()}));
+  EXPECT_TRUE(LiteralTestUtil::Equal(
+      *result,
+      *Literal::MakeTupleOwned(
+          Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}),
+          Literal::CreateR2<float>({{3, 7}, {11, 15}}),
+          Literal::CreateR2<float>({{5, 16}, {36, 64}}))));
+}
+
+XLA_TEST_F(MultiOutputFusionTest,
+           DISABLED_ON_CPU(MultiOutputReduceFusionMajorWithExtraOutput)) {
+  const string testcase = tensorflow::strings::StrCat(kScalarOps, R"(
+    fused_reduce {
+      p0 = f32[2,2,2]{2,1,0} parameter(0)
+      c0 = f32[] constant(0)
+      r1 = f32[2,2]{1,0} reduce(p0, c0), dimensions={0}, to_apply=Add
+      mul = f32[2,2,2]{2,1,0} multiply(p0, p0)
+      c1 = f32[] constant(5)
+      r2 = f32[2,2]{1,0} reduce(mul, c1), dimensions={0}, to_apply=Max
+      ROOT tuple = (f32[2,2]{1,0}, f32[2,2,2]{2,1,0}, f32[2,2]{1,0})
+                     tuple(r1, mul, r2)
+    }
+
+    ENTRY reduce {
+      p = f32[2,2,2]{2,1,0} parameter(0)
+      ROOT fusion = (f32[2,2]{1,0}, f32[2,2,2]{2,1,0}, f32[2,2]{1,0}) fusion(p),
+                                                 kind=kInput, calls=fused_reduce
+    })");
+  auto module =
+      HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
+          .ValueOrDie();
+  auto param = Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
+  TF_ASSERT_OK_AND_ASSIGN(auto result,
+                          Execute(std::move(module), {param.get()}));
+  EXPECT_TRUE(LiteralTestUtil::Equal(
+      *result,
+      *Literal::MakeTupleOwned(
+          Literal::CreateR2<float>({{6, 8}, {10, 12}}),
+          Literal::CreateR3<float>({{{1, 4}, {9, 16}}, {{25, 36}, {49, 64}}}),
+          Literal::CreateR2<float>({{25, 36}, {49, 64}}))));
+}
+
+XLA_TEST_F(MultiOutputFusionTest,
+           DISABLED_ON_CPU(MultiOutputReduceFusionScalarWithExtraOutput)) {
+  const string testcase = tensorflow::strings::StrCat(kScalarOps, R"(
+    fused_reduce {
+      p0 = f32[2,2,2]{2,1,0} parameter(0)
+      c0 = f32[] constant(0)
+      r1 = f32[2]{0} reduce(p0, c0), dimensions={0,2}, to_apply=Add
+      mul = f32[2,2,2]{2,1,0} multiply(p0, p0)
+      c1 = f32[] constant(5)
+      mul2 = f32[2,2,2]{2,1,0} multiply(p0, c1)
+      ROOT tuple = (f32[2]{0}, f32[2,2,2]{2,1,0}, f32[2,2,2]{2,1,0})
+                                                           tuple(r1, mul, mul2)
+    }
+
+    ENTRY reduce {
+      p = f32[2,2,2]{2,1,0} parameter(0)
+      ROOT fusion = (f32[2]{0}, f32[2,2,2]{2,1,0}, f32[2,2,2]{2,1,0}) fusion(p),
+                                                 kind=kInput, calls=fused_reduce
+    })");
+  auto module =
+      HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
+          .ValueOrDie();
+  auto param = Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
+  TF_ASSERT_OK_AND_ASSIGN(auto result,
+                          Execute(std::move(module), {param.get()}));
+  EXPECT_TRUE(LiteralTestUtil::Equal(
+      *result,
+      *Literal::MakeTupleOwned(
+          Literal::CreateR1<float>({14, 22}),
+          Literal::CreateR3<float>({{{1, 4}, {9, 16}}, {{25, 36}, {49, 64}}}),
+          Literal::CreateR3<float>(
+              {{{5, 10}, {15, 20}}, {{25, 30}, {35, 40}}}))));
+}
+
+XLA_TEST_F(MultiOutputFusionTest,
+           DISABLED_ON_CPU(MultiOutputReduceFusionNonConstInit)) {
+  const string testcase = tensorflow::strings::StrCat(kScalarOps, R"(
+    fused_reduce {
+      p0 = f32[2,2,2]{2,1,0} parameter(0)
+      init1 = f32[] parameter(1)
+      init2 = f32[] parameter(2)
+      r1 = f32[2,2]{1,0} reduce(p0, init1), dimensions={2}, to_apply=Add
+      r2 = f32[2,2]{1,0} reduce(p0, init2), dimensions={2}, to_apply=Max
+      ROOT tuple = (f32[2,2]{1,0}, f32[2,2]{1,0}) tuple(r1, r2)
+    }
+
+    ENTRY reduce {
+      p = f32[2,2,2]{2,1,0} parameter(0)
+      i = f32[] parameter(1)
+      j = f32[] parameter(2)
+      ROOT fusion = (f32[2,2]{1,0}, f32[2,2]{1,0}) fusion(p, i, j), kind=kInput,
+                                                              calls=fused_reduce
+    })");
+  auto module =
+      HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
+          .ValueOrDie();
+  auto param = Literal::CreateR3<float>({{{0, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
+  auto init1 = Literal::CreateR0<float>(5);
+  auto init2 = Literal::CreateR0<float>(6);
+  TF_ASSERT_OK_AND_ASSIGN(
+      auto result,
+      Execute(std::move(module), {param.get(), init1.get(), init2.get()}));
+  EXPECT_TRUE(LiteralTestUtil::Equal(
+      *result, *Literal::MakeTupleOwned(
+                   Literal::CreateR2<float>({{167, 172}, {176, 180}}),
+                   Literal::CreateR2<float>({{6, 6}, {6, 8}}))));
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/reduce_hlo_test.cc b/tensorflow/compiler/xla/tests/reduce_hlo_test.cc
index c0a2c0ca4cb8414e0771a541b9f963f9aedc8376..9052b188ed09a715b6ad7c3a40dc853d02cdd70c 100644
--- a/tensorflow/compiler/xla/tests/reduce_hlo_test.cc
+++ b/tensorflow/compiler/xla/tests/reduce_hlo_test.cc
@@ -15,9 +15,9 @@ limitations under the License.
 
 #include <array>
 
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
-#include "tensorflow/compiler/xla/tools/parser/hlo_parser.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/test.h"
@@ -73,7 +73,7 @@ ENTRY reduce.1 {
 }
 )";
 
-  return tools::Parse(hlo_string);
+  return ParseHloString(hlo_string);
 }
 
 // TODO(b/72454718): XLA:GPU does not support executing code compiled without
diff --git a/tensorflow/compiler/xla/tests/slice_test.cc b/tensorflow/compiler/xla/tests/slice_test.cc
index 52195db2aa74710b901dd7744a670764a034e96b..5653bf11a7364bf9ed79bcb6b53f7db31f454803 100644
--- a/tensorflow/compiler/xla/tests/slice_test.cc
+++ b/tensorflow/compiler/xla/tests/slice_test.cc
@@ -197,9 +197,10 @@ class SliceR1Test : public ClientLibraryTestBase,
     // vector<bool>.
     tensorflow::gtl::InlinedVector<NativeT, 1> input(spec.input_dim0);
     std::iota(input.begin(), input.end(), NativeT());
+    auto literal = Literal::CreateR1<NativeT>(input);
 
     XlaBuilder builder(TestName());
-    auto original = builder.ConstantR1<NativeT>(input);
+    auto original = builder.Parameter(0, literal->shape(), "p0");
     builder.Slice(original, {spec.slice_start}, {spec.slice_limit},
                   {spec.slice_stride});
 
@@ -210,7 +211,9 @@ class SliceR1Test : public ClientLibraryTestBase,
       expected.push_back(i);
     }
 
-    ComputeAndCompareR1<NativeT>(&builder, expected, {});
+    TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<GlobalData> arg,
+                            client_->TransferToServer(*literal));
+    ComputeAndCompareR1<NativeT>(&builder, expected, {arg.get()});
   }
 };
 
@@ -365,15 +368,18 @@ XLA_TEST_P(SliceR2Test, DoIt) {
   const R2Spec& spec = GetParam();
   Array2D<int32> input(spec.input_dim0, spec.input_dim1);
   input.FillUnique();
+  auto literal = Literal::CreateR2FromArray2DWithLayout(
+      input, LayoutUtil::MakeLayout(spec.layout));
 
   XlaBuilder builder(TestName());
-  auto a = builder.ConstantR2FromArray2DWithLayout<int32>(
-      input, LayoutUtil::MakeLayout(spec.layout));
+  auto a = builder.Parameter(0, literal->shape(), "p0");
   builder.Slice(a, spec.slice_starts, spec.slice_limits, spec.slice_strides);
 
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<GlobalData> arg,
+                          client_->TransferToServer(*literal));
   std::unique_ptr<Array2D<int32>> expected = ReferenceUtil::Slice2D(
       input, spec.slice_starts, spec.slice_limits, spec.slice_strides);
-  ComputeAndCompareR2<int32>(&builder, *expected, {});
+  ComputeAndCompareR2<int32>(&builder, *expected, {arg.get()});
 }
 
 INSTANTIATE_TEST_CASE_P(
@@ -453,7 +459,7 @@ class SliceR4Test : public ClientLibraryTestBase,
   void Run(const R4Spec& spec) {
     Array4D<float> values(spec.input_dims[0], spec.input_dims[1],
                           spec.input_dims[2], spec.input_dims[3]);
-    values.FillRandom(3.14f);
+    values.FillIota(3.14159);
     auto expected = ReferenceUtil::Slice4D(
         values, spec.slice_starts, spec.slice_limits, spec.slice_strides);
     XlaBuilder builder(TestName());
diff --git a/tensorflow/compiler/xla/tests/test_utils.cc b/tensorflow/compiler/xla/tests/test_utils.cc
index de1865138802bc72e9a4b2db7a21343b0d327108..dd7c541733634213606b5a7983b59bb1f14bf75c 100644
--- a/tensorflow/compiler/xla/tests/test_utils.cc
+++ b/tensorflow/compiler/xla/tests/test_utils.cc
@@ -26,6 +26,7 @@ namespace {
 template <typename FloatT, typename GeneratorT>
 void PopulateWithRandomFloatingPointDataImpl(Literal* literal,
                                              std::minstd_rand0* engine) {
+  CHECK(engine != nullptr);
   CHECK_EQ(literal->shape().element_type(),
            primitive_util::NativeToPrimitiveType<FloatT>());
   // Create uniform numbers between 1 and 1.125 to avoid creating denormal
@@ -59,12 +60,14 @@ void PopulateWithRandomFloatingPointDataImpl(Literal* literal,
 template <typename FloatT>
 void PopulateWithRandomFloatingPointData(Literal* literal,
                                          std::minstd_rand0* engine) {
+  CHECK(engine != nullptr);
   PopulateWithRandomFloatingPointDataImpl<FloatT, FloatT>(literal, engine);
 }
 
 template <>
 void PopulateWithRandomFloatingPointData<half>(Literal* literal,
                                                std::minstd_rand0* engine) {
+  CHECK(engine != nullptr);
   PopulateWithRandomFloatingPointDataImpl<half, float>(literal, engine);
 }
 
@@ -73,6 +76,7 @@ void PopulateWithRandomFloatingPointData<half>(Literal* literal,
 template <>
 void PopulateWithRandomFloatingPointData<bfloat16>(Literal* literal,
                                                    std::minstd_rand0* engine) {
+  CHECK(engine != nullptr);
   CHECK_EQ(literal->shape().element_type(), BF16);
   std::uniform_real_distribution<float> generator(-0.9f, 1.0f);
   TF_CHECK_OK(literal->Populate<bfloat16>(
@@ -84,6 +88,7 @@ void PopulateWithRandomFloatingPointData<bfloat16>(Literal* literal,
 template <typename IntT>
 void PopulateWithRandomIntegralData(Literal* literal,
                                     std::minstd_rand0* engine) {
+  CHECK(engine != nullptr);
   CHECK_EQ(literal->shape().element_type(),
            primitive_util::NativeToPrimitiveType<IntT>());
   std::uniform_int_distribution<IntT> generator(
@@ -107,6 +112,9 @@ StatusOr<std::unique_ptr<Literal>> MakeFakeLiteralInternal(
     }
     return Literal::MakeTupleOwned(std::move(elements));
   }
+  if (engine == nullptr) {
+    return Literal::CreateFromShape(shape);
+  }
   auto literal = MakeUnique<Literal>(shape);
   switch (shape.element_type()) {
     case BF16:
@@ -201,11 +209,13 @@ std::unique_ptr<Literal> MakeRandomNonwrappingSliceIndex(
     std::minstd_rand0* engine) {
   const int64 rank = ShapeUtil::Rank(input_shape);
   std::vector<int32> start_indices(rank);
-  for (int i = 0; i < rank; ++i) {
-    const int32 upper_bound = ShapeUtil::GetDimension(input_shape, i) -
-                              ShapeUtil::GetDimension(slice_shape, i);
-    std::uniform_int_distribution<int32> generator(0, upper_bound);
-    start_indices[i] = generator(*engine);
+  if (engine != nullptr) {
+    for (int i = 0; i < rank; ++i) {
+      const int32 upper_bound = ShapeUtil::GetDimension(input_shape, i) -
+                                ShapeUtil::GetDimension(slice_shape, i);
+      std::uniform_int_distribution<int32> generator(0, upper_bound);
+      start_indices[i] = generator(*engine);
+    }
   }
   return Literal::CreateR1<int32>(start_indices);
 }
@@ -321,20 +331,21 @@ StatusOr<std::unique_ptr<Literal>> MakeConstrainedArgument(
 
 }  // namespace
 
-StatusOr<std::unique_ptr<Literal>> MakeFakeLiteral(const Shape& shape) {
-  std::minstd_rand0 engine;
-  return MakeFakeLiteralInternal(shape, &engine);
+StatusOr<std::unique_ptr<Literal>> MakeFakeLiteral(const Shape& shape,
+                                                   bool pseudo_random) {
+  auto engine = pseudo_random ? MakeUnique<std::minstd_rand0>() : nullptr;
+  return MakeFakeLiteralInternal(shape, engine.get());
 }
 
 StatusOr<std::vector<std::unique_ptr<Literal>>> MakeFakeArguments(
-    HloModule* const module) {
+    HloModule* const module, bool pseudo_random) {
   TF_ASSIGN_OR_RETURN(auto dataflow, HloDataflowAnalysis::Run(*module));
   const auto params = module->entry_computation()->parameter_instructions();
-  std::minstd_rand0 engine;
+  auto engine = pseudo_random ? MakeUnique<std::minstd_rand0>() : nullptr;
   std::vector<std::unique_ptr<Literal>> arguments(params.size());
   for (int i = 0; i < params.size(); ++i) {
-    TF_ASSIGN_OR_RETURN(
-        arguments[i], MakeConstrainedArgument(*dataflow, *params[i], &engine));
+    TF_ASSIGN_OR_RETURN(arguments[i], MakeConstrainedArgument(
+                                          *dataflow, *params[i], engine.get()));
   }
   return std::move(arguments);
 }
diff --git a/tensorflow/compiler/xla/tests/test_utils.h b/tensorflow/compiler/xla/tests/test_utils.h
index f483cdebea5c7c8a43e73ab57748a93c97bb78d7..a8689f64981569ceb7c8a712f8ece00c99e8cf2d 100644
--- a/tensorflow/compiler/xla/tests/test_utils.h
+++ b/tensorflow/compiler/xla/tests/test_utils.h
@@ -55,16 +55,28 @@ class PseudorandomGenerator {
 };
 
 // Generates fake data in a literal of the given shape, or returns an error
-// status if the element type is currently unhandled for fake data generation.
-StatusOr<std::unique_ptr<Literal>> MakeFakeLiteral(const Shape& shape);
+// status if the element type is currently unhandled for fake data
+// generation. See below for documentation of pseudo_random.
+StatusOr<std::unique_ptr<Literal>> MakeFakeLiteral(const Shape& shape,
+                                                   bool pseudo_random = true);
 
 // Generates a vector of arguments containing fake data. The number, shape and
 // layout of the arguments is appropriate for given HLO module.
 //
 // Will handle special cases such as making sure that indices used for dynamic
 // slices are bounded, reduces that call adds use 0 as an init value, etc.
+//
+// If pseudo_random is true, the generated numbers will be generated
+// deterministically in a pseudo random way unless the values are constrated to
+// be e.g. init values as above. If pseudo_random is false, the returned values
+// will be generated in a faster way that yields less interesting data, e.g. the
+// values may all be just the same value.
+//
+// TODO(b/79942829): Make interesting argument generation fast enough that using
+// pseudo_random does not save any noticeable amount of time so that the
+// parameter can be removed.
 StatusOr<std::vector<std::unique_ptr<Literal>>> MakeFakeArguments(
-    HloModule* const module);
+    HloModule* const module, bool pseudo_random = true);
 
 // Check that a given module satisfies various constraints before trying to
 // execute it.
diff --git a/tensorflow/compiler/xla/tests/token_hlo_test.cc b/tensorflow/compiler/xla/tests/token_hlo_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4585244ce81c14ab6d4d629bb7d208d73c82248d
--- /dev/null
+++ b/tensorflow/compiler/xla/tests/token_hlo_test.cc
@@ -0,0 +1,124 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <array>
+
+#include "tensorflow/compiler/xla/service/hlo_verifier.h"
+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/compiler/xla/tests/test_macros.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace xla {
+namespace {
+
+class TokenHloTest : public HloTestBase {};
+
+// TODO(b/79770375): Compile, not just verify the HLO module when the backends
+// support kGenerateToken.
+XLA_TEST_F(TokenHloTest, SingleTokenInstruction) {
+  std::unique_ptr<HloModule> module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  builder.AddInstruction(HloInstruction::CreateGenerateToken({}));
+  builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<int32>(42)));
+
+  module->AddEntryComputation(builder.Build());
+  EXPECT_IS_OK(HloVerifier().Run(module.get()).status());
+}
+
+XLA_TEST_F(TokenHloTest, TokenTree) {
+  std::unique_ptr<HloModule> module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  auto token0 = builder.AddInstruction(HloInstruction::CreateGenerateToken({}));
+  auto token1 = builder.AddInstruction(HloInstruction::CreateGenerateToken({}));
+  auto token2 = builder.AddInstruction(HloInstruction::CreateGenerateToken({}));
+  builder.AddInstruction(
+      HloInstruction::CreateGenerateToken({token0, token0, token1, token2}));
+  builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<int32>(42)));
+
+  module->AddEntryComputation(builder.Build());
+  EXPECT_IS_OK(HloVerifier().Run(module.get()).status());
+}
+
+XLA_TEST_F(TokenHloTest, InvalidTokenShapedEntryParameter) {
+  std::unique_ptr<HloModule> module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  builder.AddInstruction(
+      HloInstruction::CreateParameter(0, ShapeUtil::MakeShape(F32, {}), "p0"));
+  builder.AddInstruction(
+      HloInstruction::CreateParameter(1, ShapeUtil::MakeTokenShape(), "p1"));
+  builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<int32>(42)));
+  module->AddEntryComputation(builder.Build());
+
+  Status status = HloVerifier().Run(module.get()).status();
+  ASSERT_IS_NOT_OK(status);
+  EXPECT_THAT(
+      status.error_message(),
+      ::testing::HasSubstr("Entry parameter 1 is or contains a token shape"));
+}
+
+XLA_TEST_F(TokenHloTest, InvalidTupleTokenShapedEntryParameter) {
+  std::unique_ptr<HloModule> module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  builder.AddInstruction(HloInstruction::CreateParameter(
+      0,
+      ShapeUtil::MakeTupleShape(
+          {ShapeUtil::MakeShape(F32, {1, 2, 3}), ShapeUtil::MakeTokenShape()}),
+      "param"));
+  module->AddEntryComputation(builder.Build());
+
+  Status status = HloVerifier().Run(module.get()).status();
+  ASSERT_IS_NOT_OK(status);
+  EXPECT_THAT(
+      status.error_message(),
+      ::testing::HasSubstr("Entry parameter 0 is or contains a token shape"));
+}
+
+XLA_TEST_F(TokenHloTest, InvalidTokenRoot) {
+  std::unique_ptr<HloModule> module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  builder.AddInstruction(HloInstruction::CreateGenerateToken({}));
+  module->AddEntryComputation(builder.Build());
+
+  Status status = HloVerifier().Run(module.get()).status();
+  ASSERT_IS_NOT_OK(status);
+  EXPECT_THAT(status.error_message(),
+              ::testing::HasSubstr("Entry root is or contains a token shape"));
+}
+
+XLA_TEST_F(TokenHloTest, InvalidOperandToTokenInstruction) {
+  std::unique_ptr<HloModule> module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  auto param = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, ShapeUtil::MakeShape(F32, {}), "p0"));
+  builder.AddInstruction(HloInstruction::CreateGenerateToken({param}));
+  builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<int32>(123)));
+  module->AddEntryComputation(builder.Build());
+
+  Status status = HloVerifier().Run(module.get()).status();
+  ASSERT_IS_NOT_OK(status);
+  EXPECT_THAT(status.error_message(),
+              ::testing::HasSubstr(
+                  "Operands of token instructions must be TOKEN types"));
+}
+
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/tools/BUILD b/tensorflow/compiler/xla/tools/BUILD
index 415cf9c16a2613913265d0342e5ab9932de5eb19..e4a052c8f1c0009619c3a94606f6384d04006e4e 100644
--- a/tensorflow/compiler/xla/tools/BUILD
+++ b/tensorflow/compiler/xla/tools/BUILD
@@ -85,7 +85,9 @@ cc_library(
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
         "//tensorflow/compiler/xla/client/lib:testing",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/service:hlo_proto",
+        "//tensorflow/compiler/xla/service/gpu:infeed_manager",
         "//tensorflow/compiler/xla/tests:test_utils",
         "//tensorflow/core:framework_internal",
         "//tensorflow/core:lib",
@@ -134,7 +136,7 @@ tf_cc_binary(
     deps = [
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:types",
-        "//tensorflow/compiler/xla/service:session_proto",
+        "//tensorflow/compiler/xla/service:hlo_proto",
         "//tensorflow/core:lib",
     ],
 )
@@ -163,7 +165,6 @@ tf_cc_binary(
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
         "//tensorflow/compiler/xla/service",
-        "//tensorflow/compiler/xla/service:computation_tracker",
         "//tensorflow/compiler/xla/service:hlo_proto",
         "//tensorflow/compiler/xla/service:interpreter_plugin",
         "//tensorflow/core:lib",
diff --git a/tensorflow/compiler/xla/tools/convert_computation.cc b/tensorflow/compiler/xla/tools/convert_computation.cc
index fe03a6e7bdfe99877c250fe1ae22beee4c8018a2..14d01b5bfb067cc39abc4d6e0605007624b6e0ae 100644
--- a/tensorflow/compiler/xla/tools/convert_computation.cc
+++ b/tensorflow/compiler/xla/tools/convert_computation.cc
@@ -21,7 +21,7 @@ limitations under the License.
 #include <unistd.h>
 #include <string>
 
-#include "tensorflow/compiler/xla/service/session.pb.h"
+#include "tensorflow/compiler/xla/service/hlo.pb.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/core/platform/env.h"
@@ -33,7 +33,7 @@ namespace xla {
 namespace tools {
 
 void RealMain(const string& mode, const string& path) {
-  SessionModule module;
+  HloSnapshot module;
   tensorflow::Env* env = tensorflow::Env::Default();
   if (mode == "txt2bin") {
     TF_CHECK_OK(tensorflow::ReadTextProto(env, path, &module));
diff --git a/tensorflow/compiler/xla/tools/dumped_computation_to_text.cc b/tensorflow/compiler/xla/tools/dumped_computation_to_text.cc
index b815bbf854b82b323da7879c230a1026cae96625..5dd5150be339846d0775880931f615b92c5b08d8 100644
--- a/tensorflow/compiler/xla/tools/dumped_computation_to_text.cc
+++ b/tensorflow/compiler/xla/tools/dumped_computation_to_text.cc
@@ -20,7 +20,6 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/client.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/service/computation_tracker.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
 #include "tensorflow/compiler/xla/service/service.h"
 #include "tensorflow/compiler/xla/statusor.h"
diff --git a/tensorflow/compiler/xla/tools/parser/BUILD b/tensorflow/compiler/xla/tools/parser/BUILD
deleted file mode 100644
index 0fa4b98d0a41a1e7c681bb2302da3b752315867b..0000000000000000000000000000000000000000
--- a/tensorflow/compiler/xla/tools/parser/BUILD
+++ /dev/null
@@ -1,72 +0,0 @@
-# Build file for the Hlo parser.
-
-licenses(["notice"])  # Apache 2.0
-
-package(
-    default_visibility = [":friends"],
-)
-
-package_group(
-    name = "friends",
-    includes = [
-        "//tensorflow/compiler/xla:friends",
-    ],
-)
-
-# Filegroup used to collect source files for dependency checking.
-filegroup(
-    name = "c_srcs",
-    data = glob([
-        "**/*.cc",
-        "**/*.h",
-    ]),
-)
-
-load("//tensorflow:tensorflow.bzl", "tf_cc_test")
-
-cc_library(
-    name = "hlo_lexer",
-    srcs = ["hlo_lexer.cc"],
-    hdrs = [
-        "hlo_lexer.h",
-        "hlo_token.h",
-    ],
-    deps = [
-        "//tensorflow/compiler/xla:shape_util",
-        "//tensorflow/compiler/xla:statusor",
-        "//tensorflow/compiler/xla:types",
-        "//tensorflow/compiler/xla:util",
-        "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/core:lib",
-        "//tensorflow/core:regexp_internal",
-    ],
-)
-
-cc_library(
-    name = "hlo_parser",
-    srcs = ["hlo_parser.cc"],
-    hdrs = ["hlo_parser.h"],
-    deps = [
-        ":hlo_lexer",
-        "//tensorflow/compiler/xla:literal_util",
-        "//tensorflow/compiler/xla:shape_util",
-        "//tensorflow/compiler/xla:statusor",
-        "//tensorflow/compiler/xla:util",
-        "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/service:hlo",
-        "//tensorflow/core:lib",
-        "//tensorflow/core:lib_internal",
-    ],
-)
-
-tf_cc_test(
-    name = "hlo_parser_test",
-    size = "small",
-    srcs = ["hlo_parser_test.cc"],
-    deps = [
-        ":hlo_parser",
-        "//tensorflow/core:lib",
-        "//tensorflow/core:test",
-        "//tensorflow/core:test_main",
-    ],
-)
diff --git a/tensorflow/compiler/xla/tools/replay_computation.cc b/tensorflow/compiler/xla/tools/replay_computation.cc
index df0501386c1e4de9111fbb6b2d9e8ec372dbf41e..f7574e0b1cc95daee6d6743ba4e2e490ee87e7c6 100644
--- a/tensorflow/compiler/xla/tools/replay_computation.cc
+++ b/tensorflow/compiler/xla/tools/replay_computation.cc
@@ -24,6 +24,9 @@ limitations under the License.
 // passing --use_fake_data on the command line.  If the real data is available
 // in the proto and --use_fake_data is false, the real data is used.
 //
+// Input can be a binary HloSnapshot proto, a binary HloProto proto, or a
+// textual HLO string.
+//
 // The output format is:
 //
 // file_path: computation_name :: type:literal_str
@@ -41,7 +44,9 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/local_client.h"
 #include "tensorflow/compiler/xla/execution_options_util.h"
 #include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/service/gpu/infeed_manager.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/statusor.h"
@@ -63,105 +68,177 @@ namespace {
 // fields.
 struct Options {
   string fake_infeed_shape;
+  bool generate_fake_infeed = false;
   bool use_fake_data = false;
   bool print_result = true;
   int num_runs = 1;
-  bool xla_hlo_profile_last_run = false;
 };
 
 // Invokes the given computation passing arbitrary data for every (unbound)
 // parameter if use_fake_data, Otherwise use recorded data if available.
 //
-// Similarly, infeeds fake data of shape fake_infeed_shape if it is provided;
-// otherwise, no infeed is performed.
-StatusOr<std::unique_ptr<Literal>> ReplayComputation(const HloSnapshot& module,
-                                                     Client* client,
-                                                     const Options& opts) {
-  TF_ASSIGN_OR_RETURN(auto computation, client->LoadSnapshot(module));
-
-  std::vector<std::unique_ptr<GlobalData>> arguments;
+// Similarly, infeeds fake data of shape fake_infeed_shape if it is provided.
+// If generate_fake_infeed is true, the required infeed shape is derived from
+// the computation and then used to provide a fake infeed shape.
+//
+// If neither generate_fake_infeed is true nor a fake_infeed_shape is provided,
+// no infeed is performed.
+StatusOr<Literal> ReplayComputation(const HloSnapshot& module,
+                                    LocalClient* client, const Options& opts) {
+  XlaComputation computation(module.hlo().hlo_module());
+
+  // Build the `argument_ptrs` vector, which contains ShapedBuffer*s to our
+  // arguments.  This is a bit involved, because we may have to convert from
+  // GlobalData to ShapedBuffer*, and we have to manage the lifetime of all our
+  // objects.
+  std::vector<ScopedShapedBuffer> scoped_shaped_buffer_arguments;
+  std::vector<std::unique_ptr<GlobalData>> global_data_arguments;
+  std::vector<const ShapedBuffer*> argument_ptrs;
   if (opts.use_fake_data) {
-    arguments = MakeFakeArgumentsOrDie(computation, client);
+    global_data_arguments = MakeFakeArgumentsOrDie(computation, client);
+    for (const auto& data : global_data_arguments) {
+      argument_ptrs.push_back(
+          client->GlobalDataToShapedBuffer(data->handle(), /*device_ordinal=*/0)
+              .ValueOrDie());
+    }
   } else {  // use recorded data if available
     for (const auto& proto : module.arguments()) {
       TF_ASSIGN_OR_RETURN(std::unique_ptr<xla::Literal> literal,
                           Literal::CreateFromProto(proto));
-      TF_ASSIGN_OR_RETURN(std::unique_ptr<GlobalData> data,
-                          client->TransferToServer(*literal));
-      arguments.push_back(std::move(data));
+      TF_ASSIGN_OR_RETURN(
+          ScopedShapedBuffer data,
+          client->LiteralToShapedBuffer(*literal, /*device_ordinal=*/0));
+      scoped_shaped_buffer_arguments.push_back(std::move(data));
+    }
+    for (const auto& argument : scoped_shaped_buffer_arguments) {
+      argument_ptrs.push_back(&argument);
     }
   }
 
+  bool provide_infeed = false;
+  Shape infeed_shape;
+  if (!opts.fake_infeed_shape.empty()) {
+    StatusOr<Shape> shape_status =
+        ShapeUtil::ParseShapeString(opts.fake_infeed_shape);
+    TF_CHECK_OK(shape_status.status());
+    infeed_shape = std::move(shape_status).ValueOrDie();
+    provide_infeed = true;
+  } else if (opts.generate_fake_infeed) {
+    for (const auto& comp : computation.proto().computations()) {
+      for (const auto& instruction : comp.instructions()) {
+        if (instruction.opcode() == HloOpcodeString(HloOpcode::kInfeed)) {
+          CHECK(!provide_infeed)
+              << "--generate_fake_infeed only works if the model has 0 or 1 "
+                 "infeed ops, but this one has >= 2.";
+          provide_infeed = true;
+          infeed_shape = instruction.shape();
+          LOG(INFO) << "Generating fake infeed shape for inferred shape: "
+                    << ShapeUtil::HumanString(infeed_shape);
+        }
+      }
+    }
+  }
   // We only instantiate the thread pool if the user has requested that a
-  // concurrent infeed occur via the fake_infeed_shape.
+  // concurrent infeed occur via the fake_infeed_shape, or when
+  // --generate_fake_infeed is passed and there exists an infeed operation in
+  // the HloSnapshot.
   tensorflow::gtl::optional<tensorflow::thread::ThreadPool> pool;
-
-  if (!opts.fake_infeed_shape.empty()) {
+  std::unique_ptr<Literal> data;
+  if (provide_infeed) {
+    data = std::move(MakeFakeLiteral(infeed_shape)).ValueOrDie();
+  }
+  auto transfer_infeed = [&data, client]() {
+    TF_CHECK_OK(client->TransferToInfeed(*data));
+  };
+  if (provide_infeed) {
     pool.emplace(tensorflow::Env::Default(), "infeed",
                  /*num_threads=*/1);
-    pool->Schedule([opts, client]() {
-      StatusOr<Shape> shape_status =
-          ShapeUtil::ParseShapeString(opts.fake_infeed_shape);
-      TF_CHECK_OK(shape_status.status());
-      Shape shape = std::move(shape_status).ValueOrDie();
-      StatusOr<std::unique_ptr<Literal>> data_status = MakeFakeLiteral(shape);
-      TF_CHECK_OK(data_status.status());
-      std::unique_ptr<Literal> data = std::move(data_status).ValueOrDie();
-      while (true) {
-        TF_CHECK_OK(client->TransferToInfeed(*data));
-      }
+    pool->Schedule([transfer_infeed]() {
+      // There may be several infeed buffers needed, however we don't know how
+      // many. If we proactively transfer too many infeed buffers, we may run
+      // out of memory. If we transfer too few infeed buffers, the program will
+      // hang. Therefore, we register a callback that is called when the infeed
+      // becomes empty, and in this callback we will transfer another fake
+      // infeed.
+      auto infeed_manager = xla::gpu::GetOrCreateInfeedManager();
+      infeed_manager->RegisterOnEmptyCallback(transfer_infeed);
+      transfer_infeed();
     });
   }
 
-  std::vector<GlobalData*> execute_arguments;
-  execute_arguments.reserve(arguments.size());
-  for (auto& argument : arguments) {
-    execute_arguments.push_back(argument.get());
+  std::vector<const Shape*> argument_layouts;
+  for (const auto& param : computation.proto().program_shape().parameters()) {
+    argument_layouts.push_back(&param);
   }
+  std::unique_ptr<LocalExecutable> executable =
+      client->Compile(computation, argument_layouts, ExecutableBuildOptions())
+          .ValueOrDie();
 
   // Run the computation num_runs times, and return the result from the last
   // execution.
-  std::unique_ptr<Literal> result;
+  StreamExecutorMemoryAllocator allocator(
+      client->platform(),
+      {client->platform()->ExecutorForDevice(0).ValueOrDie()});
+  tensorflow::gtl::optional<ScopedShapedBuffer> result;
   for (int i = 0; i < opts.num_runs; ++i) {
     ExecutionProfile profile;
-    ExecutionOptions execution_options = CreateDefaultExecutionOptions();
-    if (opts.xla_hlo_profile_last_run && i == opts.num_runs - 1) {
-      execution_options.mutable_debug_options()->set_xla_hlo_profile(true);
-    }
+    ExecutableRunOptions run_options;
+    run_options.set_execution_profile(&profile);
+    run_options.set_allocator(&allocator);
 
-    if (opts.print_result) {
-      TF_ASSIGN_OR_RETURN(
-          result, client->ExecuteAndTransfer(computation, execute_arguments,
-                                             &execution_options, &profile));
-    } else {
-      // If we're not printing the result, execute the computation but don't
-      // bother retrieving the result.  This can be a significant speedup.
-      TF_RETURN_IF_ERROR(client
-                             ->Execute(computation, execute_arguments,
-                                       &execution_options, &profile)
-                             .status());
-    }
+    TF_ASSIGN_OR_RETURN(result, executable->Run(argument_ptrs, run_options));
     LOG(INFO) << "Execution took "
               << static_cast<double>(profile.compute_time_ns()) / 1e9 << "s";
   }
 
-  return std::move(result);
+  // Check that --num_runs > 0, otherwise *result below will fail with an
+  // unhelpful error (because the loop didn't run any iterations).
+  CHECK_GT(opts.num_runs, 0) << "--num_runs must be > 0";
+  TF_ASSIGN_OR_RETURN(std::unique_ptr<Literal> result_literal,
+                      client->ShapedBufferToLiteral(*result));
+  return std::move(*result_literal);
 }
 
-int RealMain(tensorflow::gtl::ArraySlice<char*> args, const Options& opts) {
-  Client* client = ClientLibrary::LocalClientOrDie();
+StatusOr<HloSnapshot> ParseInputFile(const string& filename,
+                                     const Options& opts) {
   tensorflow::Env* env = tensorflow::Env::Default();
+  HloSnapshot snapshot;
+  if (tensorflow::ReadBinaryProto(env, filename, &snapshot).ok()) {
+    return snapshot;
+  }
+  CHECK(opts.use_fake_data)
+      << "Without --use_fake_data, you must pass an HloSnapshot -- HloProto "
+         "and textual HLO don't carry real data.";
+  fprintf(stderr, "%s: is not HloSnapshot. Trying HloProto.\n",
+          filename.c_str());
+
+  if (tensorflow::ReadBinaryProto(env, filename, snapshot.mutable_hlo()).ok()) {
+    return snapshot;
+  }
+  fprintf(stderr, "%s: is not HloProto. Trying HLO text.\n", filename.c_str());
+  string contents;
+  TF_RETURN_IF_ERROR(tensorflow::ReadFileToString(env, filename, &contents));
+  StatusOr<std::unique_ptr<HloModule>> module = ParseHloString(contents);
+  if (module.ok()) {
+    *snapshot.mutable_hlo()->mutable_hlo_module() =
+        module.ValueOrDie()->ToProto();
+    return snapshot;
+  }
+  fprintf(stderr, "%s: is not HLO text.  Nothing left to try.\n",
+          filename.c_str());
+  return InvalidArgument("Could not parse %s.", filename.c_str());
+}
+
+int RealMain(tensorflow::gtl::ArraySlice<char*> args, const Options& opts) {
+  LocalClient* client = ClientLibrary::LocalClientOrDie();
   int exit_status = EXIT_SUCCESS;
   for (char* arg : args) {
-    HloSnapshot snapshot;
-    auto status = tensorflow::ReadBinaryProto(env, arg, &snapshot);
-    if (!status.ok()) {
-      fprintf(stderr, "%s: is not HloSnapshot: %s.\n", arg,
-              status.ToString().c_str());
+    StatusOr<HloSnapshot> maybe_snapshot = ParseInputFile(arg, opts);
+    if (!maybe_snapshot.ok()) {
       continue;
     }
-    StatusOr<std::unique_ptr<Literal>> result_status =
-        ReplayComputation(snapshot, client, opts);
+    HloSnapshot snapshot = std::move(maybe_snapshot).ValueOrDie();
+    StatusOr<Literal> result_status = ReplayComputation(snapshot, client, opts);
     if (!result_status.ok()) {
       fprintf(stderr, "%s: error: %s\n", arg,
               result_status.status().ToString().c_str());
@@ -169,12 +246,12 @@ int RealMain(tensorflow::gtl::ArraySlice<char*> args, const Options& opts) {
       continue;
     }
 
-    std::unique_ptr<Literal> result = result_status.ConsumeValueOrDie();
-    if (result != nullptr) {
+    if (opts.print_result) {
+      Literal result = std::move(result_status).ValueOrDie();
       fprintf(stdout, "%s: %s :: %s:%s\n", arg,
               snapshot.hlo().hlo_module().name().c_str(),
-              ShapeUtil::HumanString(result->shape()).c_str(),
-              result->ToString().c_str());
+              ShapeUtil::HumanString(result.shape()).c_str(),
+              result.ToString().c_str());
       if (snapshot.has_result()) {
         std::unique_ptr<Literal> literal =
             Literal::CreateFromProto(snapshot.result()).ConsumeValueOrDie();
@@ -204,9 +281,9 @@ int main(int argc, char** argv) {
                        "Number of times to run each computation"),
       tensorflow::Flag("fake_infeed_shape", &opts.fake_infeed_shape,
                        "Shape of fake data to construct for (infinite) infeed"),
-      tensorflow::Flag(
-          "xla_hlo_profile_last_run", &opts.xla_hlo_profile_last_run,
-          "Pass --xla_hlo_profile the last time we run the computation."),
+      tensorflow::Flag("generate_fake_infeed", &opts.generate_fake_infeed,
+                       "Whether a fake infeed shape should be generated "
+                       "derived from the computation"),
   };
   xla::string usage = tensorflow::Flags::Usage(argv[0], flag_list);
   bool parse_ok = tensorflow::Flags::Parse(&argc, argv, flag_list);
diff --git a/tensorflow/compiler/xla/util.h b/tensorflow/compiler/xla/util.h
index be33bd6dd1304fa8fc6e5aed1d4c4d65bf97e692..b4f45cc972d3d397ddff8e8d9163d1fef387392f 100644
--- a/tensorflow/compiler/xla/util.h
+++ b/tensorflow/compiler/xla/util.h
@@ -218,6 +218,12 @@ Status Unavailable(const char* format, ...) TF_PRINTF_ATTRIBUTE(1, 2);
 // Passed-varargs variant of the InvalidArgument factory above.
 Status InvalidArgumentV(const char* format, va_list args);
 
+template <typename... Args>
+Status InvalidArgumentStrCat(Args&&... concat) {
+  return InvalidArgument(
+      "%s", tensorflow::strings::StrCat(std::forward<Args>(concat)...).c_str());
+}
+
 template <typename... Args>
 Status UnimplementedStrCat(Args&&... concat) {
   return Unimplemented(
@@ -486,6 +492,12 @@ bool c_is_sorted(const C& c) {
   return std::is_sorted(std::begin(c), std::end(c));
 }
 
+template <typename C, typename Compare>
+bool c_is_sorted(const C& c, Compare&& comp) {
+  return std::is_sorted(std::begin(c), std::end(c),
+                        std::forward<Compare>(comp));
+}
+
 template <typename C>
 auto c_adjacent_find(const C& c) -> decltype(std::begin(c)) {
   return std::adjacent_find(std::begin(c), std::end(c));
@@ -514,12 +526,29 @@ typename std::decay<T>::type c_accumulate(const Sequence& sequence, T&& init,
                          std::forward<BinaryOp>(binary_op));
 }
 
+template <typename C, typename Pred>
+typename std::iterator_traits<
+    decltype(std::begin(std::declval<C>()))>::difference_type
+c_count_if(const C& c, Pred&& pred) {
+  return std::count_if(std::begin(c), std::end(c), std::forward<Pred>(pred));
+}
+
 template <typename C, typename Value>
 int64 FindIndex(const C& c, Value&& value) {
   auto it = c_find(c, std::forward<Value>(value));
   return std::distance(c.begin(), it);
 }
 
+template <typename C, typename Value>
+void InsertAt(C* c, int64 index, Value&& value) {
+  c->insert(c->begin() + index, std::forward<Value>(value));
+}
+
+template <typename C>
+void EraseAt(C* c, int64 index) {
+  c->erase(c->begin() + index);
+}
+
 // Returns true if `x` fits in 32-bits.
 template <typename T>
 bool IsInt32(T x) {
diff --git a/tensorflow/compiler/xla/xla.proto b/tensorflow/compiler/xla/xla.proto
index f619b8dc24038af64a27fc0565c74447ca9d09cf..6f07e4606bef015214f2c564515c8258a906205b 100644
--- a/tensorflow/compiler/xla/xla.proto
+++ b/tensorflow/compiler/xla/xla.proto
@@ -17,7 +17,6 @@ syntax = "proto3";
 
 import "tensorflow/compiler/xla/xla_data.proto";
 import "tensorflow/compiler/xla/service/hlo.proto";
-import "tensorflow/compiler/xla/service/session.proto";
 
 package xla;
 
@@ -226,22 +225,6 @@ message ExecutionOptions {
   repeated DeviceHandle device_handles = 5;
 }
 
-message SnapshotComputationRequest {
-  ComputationHandle computation = 1;
-}
-
-message SnapshotComputationResponse {
-  SessionModule module = 1;
-}
-
-message LoadComputationSnapshotRequest {
-  SessionModule module = 1;
-}
-
-message LoadComputationSnapshotResponse {
-  ComputationHandle computation = 1;
-}
-
 message GetDeviceHandlesRequest {
   int64 device_count = 1;
 }
@@ -300,11 +283,6 @@ message ResetDeviceRequest {
 message ResetDeviceResponse {
 }
 
-message ComputationStatsRequest {
-  ComputationHandle computation = 1;
-  DebugOptions debug_options = 2;
-}
-
 message ComputationGraphStatsRequest {
   HloModuleProto computation = 1;
   DebugOptions debug_options = 2;
@@ -314,14 +292,6 @@ message ComputationStatsResponse {
   ComputationStats stats = 1;
 }
 
-message ComputationRequest {
-  string name = 1;
-}
-
-message ComputationResponse {
-  ComputationHandle computation = 1;
-}
-
 message CreateChannelHandleRequest {
 }
 
@@ -336,24 +306,6 @@ message UnregisterRequest {
 message UnregisterResponse {
 }
 
-message SetReturnValueRequest {
-  ComputationHandle computation = 1;
-  ComputationDataHandle operand = 2;
-}
-
-message SetReturnValueResponse {
-}
-
-message ExecuteRequest {
-  reserved 3, 4;
-
-  ComputationHandle computation = 1;
-  repeated GlobalDataHandle arguments = 2;
-
-  // Options that affect how XLA compiles and runs code to service this request.
-  ExecutionOptions execution_options = 5;
-}
-
 message ExecuteGraphRequest {
   HloModuleProto computation = 1;
   repeated GlobalDataHandle arguments = 2;
@@ -362,10 +314,6 @@ message ExecuteGraphRequest {
   ExecutionOptions execution_options = 3;
 }
 
-message ExecuteParallelRequest {
-  repeated ExecuteRequest requests = 1;
-}
-
 message ExecuteGraphParallelRequest {
   repeated ExecuteGraphRequest requests = 1;
 }
@@ -379,21 +327,6 @@ message ExecuteParallelResponse {
   repeated ExecuteResponse responses = 1;
 }
 
-message ExecuteAsyncRequest {
-  reserved 3, 4;
-
-  ComputationHandle computation = 1;
-  repeated GlobalDataHandle arguments = 2;
-
-  // Options that affect how XLA compiles and runs code to service this request.
-  ExecutionOptions execution_options = 6;
-}
-
-message ExecuteAsyncResponse {
-  // A handle to the execution launched asynchronously.
-  ExecutionHandle execution = 1;
-}
-
 message WaitForExecutionRequest {
   ExecutionHandle execution = 1;
 }
@@ -403,31 +336,13 @@ message WaitForExecutionResponse {
   ExecutionProfile profile = 2;
 }
 
-message IsConstantRequest {
-  ComputationHandle computation = 1;
-  ComputationDataHandle operand = 2;
-  int64 num_parameters = 3;
-}
-
-message IsConstantResponse {
-  bool is_constant = 1;
-}
-
-message ComputeConstantRequest {
-  ComputationHandle computation = 1;
-  ComputationDataHandle operand = 2;
-  Layout output_layout = 3;
-  repeated LiteralProto parameters = 4;
-}
-
 message ComputeConstantGraphRequest {
   HloModuleProto computation = 1;
   Layout output_layout = 2;
 }
 
 message ComputeConstantResponse {
-  // A LiteralProto is returned directly for this request, instead of a
-  // ComputationDataHandle.
+  // A LiteralProto is returned directly for this request.
   LiteralProto literal = 1;
 }
 
@@ -469,14 +384,6 @@ message LoadDataResponse {
   int64 nanoseconds = 5;
 }
 
-message SpecializeRequest {
-  ComputationHandle computation = 1;
-  repeated GlobalDataHandle arguments = 2;
-}
-
-message SpecializeResponse {
-}
-
 message GetShapeRequest {
   GlobalDataHandle data = 1;
 }
@@ -485,14 +392,6 @@ message GetShapeResponse {
   Shape shape = 1;
 }
 
-message GetComputationShapeRequest {
-  ComputationHandle computation = 1;
-}
-
-message GetComputationShapeResponse {
-  ProgramShape program_shape = 1;
-}
-
 message UnpackRequest {
   GlobalDataHandle data = 1;
 }
diff --git a/tensorflow/compiler/xla/xla_data.proto b/tensorflow/compiler/xla/xla_data.proto
index b895ac045c361b2336e0081eadf16334d49d3bee..0af73e8a93060f4569ddef9697b89a6fa2b8674b 100644
--- a/tensorflow/compiler/xla/xla_data.proto
+++ b/tensorflow/compiler/xla/xla_data.proto
@@ -66,11 +66,16 @@ enum PrimitiveType {
   // in the dimensions field.
   TUPLE = 13;
 
-  // An opaque type used for passing context specific data to a custom
-  // operation.
+  // An opaque type used for passing context-specific data to a custom
+  // operation. Shapes of this primitive type will have empty dimensions and
+  // tuple_shapes fields.
   OPAQUE = 14;
 
-  // Next = 17
+  // A token type threaded between side-effecting operations. Shapes of this
+  // primitive type will have empty dimensions and tuple_shapes fields.
+  TOKEN = 17;
+
+  // Next = 18
 }
 
 // Describes the value held inside padding elements.
@@ -271,12 +276,6 @@ message ExecutionProfile {
   int64 compute_and_transfer_time_ns = 5;
 }
 
-// Handle given to a user that represents a computation that the user builds up
-// before execution.
-message ComputationHandle {
-  int64 handle = 1;
-}
-
 // Handle given to a user that represents an execution that the user launched
 // asynchronously on the device.
 message ExecutionHandle {
@@ -290,13 +289,6 @@ message GlobalDataHandle {
   int64 handle = 1;
 }
 
-// Handle given to a user that represents a data result in a computation.
-// This is used to pass to subsequent computations that depends upon the data as
-// an operand.
-message ComputationDataHandle {
-  int64 handle = 1;
-}
-
 // Handle given to a user that represents a replicated virtual device. Each
 // replicated device represents N physical devices for execution where N is the
 // number of replicas.
@@ -436,44 +428,6 @@ message GatherDimensionNumbers {
   int64 index_vector_dim = 4;
 }
 
-// Operation requests that are all collected as a tagged union with a oneof
-// field in OpRequest.
-
-message ConstantRequest {
-  LiteralProto literal = 2;
-}
-
-message GetTupleElementRequest {
-  ComputationDataHandle operand = 2;
-  int64 index = 3;
-}
-
-message SliceRequest {
-  ComputationDataHandle operand = 2;
-  repeated int64 start_indices = 3;
-  repeated int64 limit_indices = 4;
-  repeated int64 strides = 5;
-}
-
-message DynamicSliceRequest {
-  // Operand from which to slice at dynamic 'start_indices'.
-  ComputationDataHandle operand = 2;
-  // Dynamically computed 'start_indices' for slice operation.
-  ComputationDataHandle start_indices = 3;
-  // Slice sizes for each dimension (note that indices calculations are computed
-  // modulo dimension sizes to avoid out-of-bound array accesses).
-  repeated int64 slice_sizes = 4;
-}
-
-message DynamicUpdateSliceRequest {
-  // Operand on which slice 'update' is to be applied.
-  ComputationDataHandle operand = 2;
-  // The slice update to apply to 'operand'.
-  ComputationDataHandle update = 3;
-  // Dynamically computed start indices for the update slice operation.
-  ComputationDataHandle start_indices = 4;
-}
-
 message ConvolutionDimensionNumbers {
   // The number of the dimension that represents batch in the input.
   int64 input_batch_dimension = 7;
@@ -511,13 +465,6 @@ message ConvolutionDimensionNumbers {
   // Next = 13
 };
 
-message ConvolveRequest {
-  ComputationDataHandle lhs = 2;
-  ComputationDataHandle rhs = 3;  // This is the filter/kernel.
-  Window window = 4;              // Describes the filter/kernel.
-  ConvolutionDimensionNumbers dimension_numbers = 5;
-}
-
 enum FftType {
   FFT = 0;    // Forward FFT; complex in, complex out.
   IFFT = 1;   // Inverse FFT; complex in, complex out.
@@ -526,56 +473,6 @@ enum FftType {
               //                   fft_length real out
 }
 
-message FftRequest {
-  FftType fft_type = 1;
-  repeated int64 fft_length = 2;  // Multivalent for higher-order FFT.
-  ComputationDataHandle operand = 3;
-}
-
-message InfeedRequest {
-  // The shape of the data returned by reading the device's infeed buffer.
-  Shape shape = 2;
-
-  // Additional infeed configuration for the backend.
-  bytes config = 3;
-}
-
-message OutfeedRequest {
-  // The shape of the data returned by reading the device's outfeed buffer.
-  Shape shape = 1;
-
-  // Operand to the Outfeed. Supports tuple.
-  ComputationDataHandle operand = 2;
-
-  // Backend-specific information for how to perform the outfeed.
-  bytes outfeed_config = 3;
-}
-
-message CallRequest {
-  ComputationHandle to_apply = 2;
-  repeated ComputationDataHandle operands = 3;
-}
-
-message CustomCallRequest {
-  string call_target_name = 2;
-  repeated ComputationDataHandle operands = 3;
-  Shape shape = 4;
-}
-
-message HostComputeRequest {
-  // Operand to the HostCompute. Supports tuple.
-  repeated ComputationDataHandle operands = 1;
-
-  // Name used to identify HostSend/Recv channels.
-  string channel_name = 2;
-
-  // Cost estimate in nanoseconds.
-  int64 cost_estimate_ns = 3;
-
-  // The shape of any data returned by host.
-  Shape shape = 4;
-}
-
 message DotDimensionNumbers {
   // The dimension numbers that represent the 'lhs' contracting dimensions.
   repeated int64 lhs_contracting_dimensions = 1;
@@ -587,297 +484,6 @@ message DotDimensionNumbers {
   repeated int64 rhs_batch_dimensions = 4;
 };
 
-message DotRequest {
-  ComputationDataHandle lhs = 2;
-  ComputationDataHandle rhs = 3;
-  DotDimensionNumbers dimension_numbers = 4;
-}
-
-message MapRequest {
-  repeated ComputationDataHandle operands = 2;
-  ComputationHandle to_apply = 3;
-  repeated ComputationDataHandle static_operands = 4;
-  // The dimensions over which to map.
-  // Example mapping a Dot operation along the batch dimension 0:
-  //   operand0.shape = [2, 2, 2], operand1.shape = [2,2,3]
-  //   Map({operand0, operand1}, Dot, {0})
-  repeated int64 dimensions = 5;
-}
-
-message ReduceRequest {
-  // Operand to the reduction.
-  ComputationDataHandle operand = 2;
-
-  // Initial value for the reduction. This must be consistent with the result
-  // shape of to_apply.
-  ComputationDataHandle init_value = 3;
-
-  // The dimensions to reduce over.
-  repeated int64 dimensions = 4;
-
-  // The computation to apply in the reduction.
-  ComputationHandle to_apply = 5;
-}
-
-message ReduceWindowRequest {
-  ComputationDataHandle operand = 2;
-  ComputationDataHandle init_value = 3;
-  Window window = 4;
-  ComputationHandle to_apply = 5;
-}
-
-message BatchNormTrainingRequest {
-  ComputationDataHandle operand = 1;
-  ComputationDataHandle scale = 2;
-  ComputationDataHandle offset = 3;
-  float epsilon = 4;
-  int64 feature_index = 5;
-}
-
-message BatchNormInferenceRequest {
-  ComputationDataHandle operand = 1;
-  ComputationDataHandle scale = 2;
-  ComputationDataHandle offset = 3;
-  ComputationDataHandle mean = 4;
-  ComputationDataHandle variance = 5;
-  float epsilon = 6;
-  int64 feature_index = 7;
-}
-
-message BatchNormGradRequest {
-  ComputationDataHandle operand = 1;
-  ComputationDataHandle scale = 2;
-  ComputationDataHandle mean = 3;
-  ComputationDataHandle variance = 4;
-  ComputationDataHandle grad_output = 5;
-  float epsilon = 6;
-  int64 feature_index = 7;
-}
-
-message CrossReplicaSumRequest {
-  ComputationDataHandle operand = 2;
-}
-
-message SelectAndScatterRequest {
-  // Operand array on which the windows slide.
-  ComputationDataHandle operand = 2;
-
-  // Source array for the data to scatter.
-  ComputationDataHandle source = 3;
-
-  // Initial scalar value for each element in the output.
-  ComputationDataHandle init_value = 4;
-
-  // Window configuration.
-  Window window = 5;
-
-  // Binary function used to select an element from each window.
-  ComputationHandle select = 6;
-
-  // Binary function used to combine each scattered value from source with the
-  // current output value at the selected location.
-  ComputationHandle scatter = 7;
-}
-
-message ReverseRequest {
-  ComputationDataHandle operand = 2;
-  repeated int64 dimensions = 3;
-}
-
-message BroadcastRequest {
-  ComputationDataHandle operand = 2;
-  repeated int64 broadcast_sizes = 3;
-}
-
-message PadRequest {
-  ComputationDataHandle operand = 2;
-  ComputationDataHandle padding_value = 3;
-  PaddingConfig padding_config = 4;
-}
-
-message ReshapeRequest {
-  ComputationDataHandle operand = 2;
-
-  // The dimension order for collapse (from fastest-changing to slowest).
-  repeated int64 dimensions = 3;
-
-  // The new dimension sizes (from dimension 0 to n-1).
-  repeated int64 new_sizes = 4;
-}
-
-message TransposeRequest {
-  ComputationDataHandle operand = 2;
-
-  // The permutation of the operand's dimensions (in the range 0 to n-1).
-  repeated int64 dimensions = 3;
-}
-
-message ParameterRequest {
-  Shape shape = 2;
-  int64 parameter = 3;
-  string name = 4;
-}
-
-message GetLocalShapeRequest {
-  ComputationHandle computation = 1;
-  ComputationDataHandle operand = 2;
-}
-
-message GetLocalShapeResponse {
-  Shape shape = 1;
-}
-
-message TraceRequest {
-  string tag = 2;
-  ComputationDataHandle operand = 3;
-}
-
-message ConvertRequest {
-  ComputationDataHandle operand = 2;
-  PrimitiveType new_element_type = 3;
-}
-
-message ConcatenateRequest {
-  repeated ComputationDataHandle operands = 2;
-  // The dimension in which we concatenate; e.g. if you had dimension arrays of
-  // [4, 1] and [5, 1], you'd concatenate in dimension 0 to produce a [9, 1].
-  // Attempting to concatenate those in dimension 1 would produce an error, as
-  // 4 != 5 (and there is no ragged array support).
-  int64 dimension = 3;
-}
-
-message ConditionalRequest {
-  ComputationDataHandle predicate = 2;
-  ComputationDataHandle true_operand = 3;
-  ComputationHandle true_computation = 4;
-  ComputationDataHandle false_operand = 5;
-  ComputationHandle false_computation = 6;
-}
-
-message WhileRequest {
-  ComputationHandle condition = 2;
-  ComputationHandle body = 3;
-  ComputationDataHandle init = 4;
-}
-
-enum UnaryOperation {
-  UNOP_INVALID = 0;
-
-  // Elementwise, logical negation on booleans and bitwise negation on ints.
-  UNOP_NOT = 1;
-
-  // Elementwise, computes e^x.
-  UNOP_EXP = 2;
-
-  // Elementwise, computes -x.
-  UNOP_NEGATE = 3;
-
-  // Puts the elements in the operand into sorted order.
-  UNOP_SORT = 4;
-
-  // Elementwise, computes tanh(x).
-  UNOP_TANH = 5;
-
-  // Elementwise, computes the natural logarithm of x.
-  UNOP_LOG = 6;
-
-  // Elementwise, computes the floor of x.
-  UNOP_FLOOR = 7;
-
-  // Elementwise, computes the ceil of x.
-  UNOP_CEIL = 8;
-
-  // Elementwise, computes the abs of x.
-  UNOP_ABS = 9;
-
-  // Elementwise, computes the sign of x.
-  UNOP_SIGN = 10;
-
-  // Elementwise, tests if values are finite (not NaN or inf)
-  UNOP_IS_FINITE = 11;
-
-  // Elementwise, computes the cosine of x.
-  UNOP_COS = 12;
-
-  // Elementwise, computes the sine of x.
-  UNOP_SIN = 13;
-
-  // Elementwise, rounds x to nearest integral value, rounding half-way cases
-  // away from zero.
-  UNOP_ROUND_NEAREST_AFZ = 14;
-
-  // Elementwise, extract real component of complex x.
-  UNOP_REAL = 15;
-
-  // Elementwise, extract real component of complex x.
-  UNOP_IMAG = 16;
-
-  // Elementwise, computes clz(x).
-  UNOP_CLZ = 17;
-
-  // Elementwise, computes exp(x)-1.
-  UNOP_EXPM1 = 18;
-
-  // Elementwise, computes log(x+1).
-  UNOP_LOG1P = 19;
-}
-
-message UnaryOpRequest {
-  UnaryOperation unop = 2;
-  ComputationDataHandle operand = 3;
-}
-
-enum BinaryOperation {
-  BINOP_INVALID = 0;
-
-  // Arithmetic operations.
-  BINOP_ADD = 1;
-  BINOP_DIV = 2;
-  BINOP_MUL = 3;
-  BINOP_SUB = 4;
-
-  // Comparison operators.
-  BINOP_EQ = 5;
-  BINOP_GE = 6;
-  BINOP_GT = 7;
-  BINOP_LE = 8;
-  BINOP_LT = 9;
-  BINOP_NE = 10;
-
-  // Element-wise maximum.
-  BINOP_MAX = 14;
-
-  // Element-wise minimum.
-  BINOP_MIN = 15;
-
-  // Raises the left-hand-side to the right-hand-side power.
-  BINOP_POW = 16;
-
-  // Remainder operation.
-  BINOP_REM = 17;
-
-  // Element-wise, logical operators on booleans and bitwise operators on ints.
-  BINOP_AND = 18;
-  BINOP_OR = 19;
-
-  BINOP_SHIFT_LEFT = 20;
-  BINOP_SHIFT_RIGHT_ARITHMETIC = 21;
-  BINOP_SHIFT_RIGHT_LOGICAL = 22;
-
-  // Complex from real, imag.
-  BINOP_COMPLEX = 23;
-
-  // Computes the 4-quadrant arctangent of the y, x input arguments.
-  BINOP_ATAN2 = 24;
-}
-
-message BinaryOpRequest {
-  BinaryOperation binop = 2;
-  ComputationDataHandle lhs = 3;
-  ComputationDataHandle rhs = 4;
-  repeated int64 broadcast_dimensions = 5;
-}
-
 enum RandomDistribution {
   RNG_INVALID = 0;
 
@@ -892,67 +498,6 @@ enum RandomDistribution {
   // Next: 4
 }
 
-message RngRequest {
-  RandomDistribution distribution = 2;
-  repeated ComputationDataHandle parameter = 3;
-  Shape shape = 4;
-}
-
-enum TernaryOperation {
-  TRIOP_INVALID = 0;
-
-  // Given a predicate and two operands, selects operand0 if the predicate is
-  // true and operand1 if the predicate is false.
-  TRIOP_SELECT = 1;
-
-  // Given a min, max and an operand returns the operand if between min and max,
-  // else returns min if operand is less than min or max if operand is greater
-  // than max.
-  TRIOP_CLAMP = 3;
-}
-
-message TernaryOpRequest {
-  TernaryOperation triop = 2;
-  ComputationDataHandle lhs = 3;
-  ComputationDataHandle rhs = 4;
-  ComputationDataHandle ehs = 5;
-}
-
-enum VariadicOperation {
-  VAROP_INVALID = 0;
-
-  // Creates a tuple from its operands.
-  VAROP_TUPLE = 1;
-}
-
-message VariadicOpRequest {
-  VariadicOperation varop = 2;
-  repeated ComputationDataHandle operands = 3;
-}
-
-message ReducePrecisionRequest {
-  ComputationDataHandle operand = 1;
-  int32 exponent_bits = 2;
-  int32 mantissa_bits = 3;
-}
-
-message SendRequest {
-  ComputationDataHandle operand = 1;
-  ChannelHandle channel_handle = 2;
-}
-
-message RecvRequest {
-  Shape shape = 1;
-  ChannelHandle channel_handle = 2;
-}
-
-message GatherRequest {
-  ComputationDataHandle input = 1;
-  ComputationDataHandle gather_indices = 2;
-  GatherDimensionNumbers dimension_numbers = 3;
-  repeated int64 window_bounds = 4;
-}
-
 message OpSharding {
   enum Type {
     // This sharding is replicated across all devices (implies maximal,
@@ -983,59 +528,3 @@ message OpSharding {
   // to.
   repeated OpSharding tuple_shardings = 5;
 }
-
-message OpRequest {
-  ComputationHandle computation = 1;
-  OpMetadata metadata = 33;
-  OpSharding sharding = 40;
-
-  oneof op {
-    BinaryOpRequest binary_op_request = 2;
-    BroadcastRequest broadcast_request = 3;
-    CallRequest call_request = 4;
-    ConcatenateRequest concatenate_request = 5;
-    ConstantRequest constant_request = 6;
-    ConvertRequest convert_request = 7;
-    ConvolveRequest convolve_request = 8;
-    CrossReplicaSumRequest cross_replica_sum_request = 9;
-    CustomCallRequest custom_call_request = 10;
-    DotRequest dot_request = 43;
-    DynamicSliceRequest dynamic_slice_request = 11;
-    DynamicUpdateSliceRequest dynamic_update_slice_request = 12;
-    GetTupleElementRequest get_tuple_element_request = 13;
-    InfeedRequest infeed_request = 14;
-    MapRequest map_request = 15;
-    PadRequest pad_request = 16;
-    ParameterRequest parameter_request = 17;
-    ReducePrecisionRequest reduce_precision_request = 36;
-    ReduceRequest reduce_request = 18;
-    ReduceWindowRequest reduce_window_request = 19;
-    ReshapeRequest reshape_request = 20;
-    ReverseRequest reverse_request = 21;
-    RngRequest rng_request = 22;
-    SelectAndScatterRequest select_and_scatter_request = 23;
-    SliceRequest slice_request = 24;
-    TernaryOpRequest ternary_op_request = 25;
-    TraceRequest trace_request = 26;
-    TransposeRequest transpose_request = 34;
-    UnaryOpRequest unary_op_request = 27;
-    VariadicOpRequest variadic_op_request = 28;
-    WhileRequest while_request = 29;
-    SendRequest send_request = 30;
-    RecvRequest recv_request = 31;
-    OutfeedRequest outfeed_request = 32;
-    BatchNormTrainingRequest batch_norm_training_request = 35;
-    BatchNormGradRequest batch_norm_grad_request = 37;
-    BatchNormInferenceRequest batch_norm_inference_request = 38;
-    FftRequest fft_request = 41;
-    ConvertRequest bitcast_convert_request = 42;
-    ConditionalRequest conditional_request = 44;
-    HostComputeRequest host_compute_request = 45;
-    GatherRequest gather_request = 46;
-    // Next: 47
-  }
-}
-
-message OpResponse {
-  ComputationDataHandle output = 1;
-}
diff --git a/tensorflow/contrib/BUILD b/tensorflow/contrib/BUILD
index 0f9c80404ad33c39ae783e0bfa3cfb26e342fe3d..50b1ae5cc3cba2d6ac89c4415a3419ffdf7aec93 100644
--- a/tensorflow/contrib/BUILD
+++ b/tensorflow/contrib/BUILD
@@ -31,13 +31,15 @@ py_library(
         "//tensorflow/contrib/cluster_resolver:cluster_resolver_py",
         "//tensorflow/contrib/coder:coder_py",
         "//tensorflow/contrib/compiler:compiler_py",
+        "//tensorflow/contrib/autograph",
         "//tensorflow/contrib/constrained_optimization",
+        "//tensorflow/contrib/control_flow",
         "//tensorflow/contrib/copy_graph:copy_graph_py",
         "//tensorflow/contrib/crf:crf_py",
         "//tensorflow/contrib/cudnn_rnn:cudnn_rnn_py",
         "//tensorflow/contrib/data",
-        "//tensorflow/contrib/distribute:distribute",
         "//tensorflow/contrib/deprecated:deprecated_py",
+        "//tensorflow/contrib/distribute:distribute",
         "//tensorflow/contrib/distributions:distributions_py",
         "//tensorflow/contrib/eager/python:tfe",
         "//tensorflow/contrib/estimator:estimator_py",
@@ -83,7 +85,6 @@ py_library(
         "//tensorflow/contrib/proto",
         "//tensorflow/contrib/quantization:quantization_py",
         "//tensorflow/contrib/quantize:quantize_graph",
-        "//tensorflow/contrib/autograph",
         "//tensorflow/contrib/receptive_field:receptive_field_py",
         "//tensorflow/contrib/recurrent:recurrent_py",
         "//tensorflow/contrib/reduce_slice_ops:reduce_slice_ops_py",
diff --git a/tensorflow/contrib/__init__.py b/tensorflow/contrib/__init__.py
index 9aad772f0acd941d50d6ba238d345616195a6939..ad8c40395c2cdcc5e4288e04bb2115bd3627cdc9 100644
--- a/tensorflow/contrib/__init__.py
+++ b/tensorflow/contrib/__init__.py
@@ -30,6 +30,7 @@ from tensorflow.contrib import cluster_resolver
 from tensorflow.contrib import coder
 from tensorflow.contrib import compiler
 from tensorflow.contrib import constrained_optimization
+from tensorflow.contrib import control_flow
 from tensorflow.contrib import copy_graph
 from tensorflow.contrib import crf
 from tensorflow.contrib import cudnn_rnn
diff --git a/tensorflow/contrib/all_reduce/BUILD b/tensorflow/contrib/all_reduce/BUILD
index 62d1b1cf079d04d50e4899cfd9ba1d405ee1efb9..881808a98bfd688c2efaa8beb5b8f11a2527fee8 100644
--- a/tensorflow/contrib/all_reduce/BUILD
+++ b/tensorflow/contrib/all_reduce/BUILD
@@ -11,6 +11,16 @@ exports_files(["LICENSE"])
 
 load("//tensorflow:tensorflow.bzl", "tf_py_test")
 
+py_library(
+    name = "all_reduce_py",
+    srcs = ["__init__.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":all_reduce",
+        "//tensorflow/python:util",
+    ],
+)
+
 py_library(
     name = "all_reduce",
     srcs = [
diff --git a/tensorflow/contrib/all_reduce/__init__.py b/tensorflow/contrib/all_reduce/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9824f4cfbf83d9b001a58cafe582226e96c076f
--- /dev/null
+++ b/tensorflow/contrib/all_reduce/__init__.py
@@ -0,0 +1,39 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""All-reduce implementations."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+# pylint: disable=unused-import,line-too-long,wildcard-import
+from tensorflow.contrib.all_reduce.python.all_reduce import *
+
+from tensorflow.python.util.all_util import remove_undocumented
+# pylint: enable=unused-import,line-too-long,wildcard-import
+
+_allowed_symbols = [
+    'build_ring_all_reduce',
+    'build_recursive_hd_all_reduce',
+    'build_shuffle_all_reduce',
+    'build_nccl_all_reduce',
+    'build_nccl_then_ring',
+    'build_nccl_then_recursive_hd',
+    'build_nccl_then_shuffle',
+    'build_shuffle_then_ring',
+    'build_shuffle_then_shuffle'
+]
+
+remove_undocumented(__name__, allowed_exception_list=_allowed_symbols)
diff --git a/tensorflow/contrib/android/jni/run_stats_jni.cc b/tensorflow/contrib/android/jni/run_stats_jni.cc
index 707853b59befc2625145ad96952fbf9f66d62b43..30de7b59af79cb36ee266a15bb6e668c2e3f628a 100644
--- a/tensorflow/contrib/android/jni/run_stats_jni.cc
+++ b/tensorflow/contrib/android/jni/run_stats_jni.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/contrib/android/jni/run_stats_jni.h"
 
 #include <jni.h>
+
 #include <sstream>
 
 #include "tensorflow/core/protobuf/config.pb.h"
@@ -73,7 +74,8 @@ JNIEXPORT jstring RUN_STATS_METHOD(summary)(JNIEnv* env, jclass clazz,
   StatSummarizer* s = requireHandle(env, handle);
   if (s == nullptr) return nullptr;
   std::stringstream ret;
-  ret << s->GetStatsByMetric("Top 10 CPU", StatSummarizer::BY_TIME, 10)
+  ret << s->GetStatsByMetric("Top 10 CPU", tensorflow::StatsCalculator::BY_TIME,
+                             10)
       << s->GetStatsByNodeType() << s->ShortSummary();
   return env->NewStringUTF(ret.str().c_str());
 }
diff --git a/tensorflow/contrib/autograph/CONTRIBUTING.md b/tensorflow/contrib/autograph/CONTRIBUTING.md
index a7a3fe1452d2a3e9c2a37a25ae96f541f8f939e0..a4aec8c74a9ad1418072471a5d3cde8c3b968a38 100644
--- a/tensorflow/contrib/autograph/CONTRIBUTING.md
+++ b/tensorflow/contrib/autograph/CONTRIBUTING.md
@@ -2,6 +2,9 @@
 
 We'd love to have your patches and contributions! Here are some guidelines. In general, we follow the [TensorFlow contributing guidelines](../../CONTRIBUTING.md), but have some [AutoGraph-specific style guidelines](STYLE_GUIDE.md). More details below.
 
+## TensorFlow Code of Conduct
+Please review and follow the [TensorFlow Code of Conduct](../../CODE_OF_CONDUCT.md).
+
 ## Contributor License Agreement
 
 Contributions to this project must be accompanied by a Contributor License
@@ -28,7 +31,7 @@ repository (with credit to the original author) and closes the pull request.
 
 ## Style
 
-See the [TensorFlow AutoGraph style guide](STYLE_GUIDE.md).
+See the [AutoGraph style guide](STYLE_GUIDE.md).
 
 ## Unit tests
 
diff --git a/tensorflow/contrib/autograph/STYLE_GUIDE.md b/tensorflow/contrib/autograph/STYLE_GUIDE.md
index 5618ec3e34499ad0f0b2a0d8b0ad04c11ee9bf9c..866e5f583a34570dfddc733f57561ed1d2b7c5bf 100644
--- a/tensorflow/contrib/autograph/STYLE_GUIDE.md
+++ b/tensorflow/contrib/autograph/STYLE_GUIDE.md
@@ -1,43 +1,26 @@
-# TensorFlow AutoGraph Style Guide
+# AutoGraph Style Guide
 
-This page contains style decisions that both developers and users of TensorFlow
-AutoGraph should follow to increase the readability of their code, reduce the
-number of errors, and promote consistency. We borrow many style principles from the TensorFlow Probability style guide.
+This page contains style decisions that developers should follow when
+contributing code to AutoGraph.
 
 ## TensorFlow Style
 
 Follow the [TensorFlow style
-guide](https://www.tensorflow.org/community/style_guide) and [documentation
-guide](https://www.tensorflow.org/community/documentation). Below are additional
-TensorFlow conventions not noted in those guides. In the future, these noted
-conventions may be moved upstream.
+guide](https://www.tensorflow.org/community/style_guide), the [documentation
+guide](https://www.tensorflow.org/community/documentation) and the
+[Google Python style guide](https://google.github.io/styleguide/pyguide.html).
+
+Naming conventions:
 
 1.  The name is TensorFlow, not Tensorflow.
 2.  The name is AutoGraph, not Autograph.
 
-## TensorFlow Code of Conduct
-Please review and follow the [TensorFlow Code of Conduct](../../CODE_OF_CONDUCT.md).
-
-## TensorFlow AutoGraph Style
+## AutoGraph Style
 
-Below are TensorFlow AutoGraph-specific conventions. In the event of conflict,
+Below are AutoGraph-specific conventions. In the event of conflict,
 it supercedes all previous conventions.
 
-1.  __Importing submodule aliases.__ Use the Pythonic style 
-`from tensorflow.contrib.autograph.converters import ifexp` and `from tensorflow.contrib import autograph as ag`.
-
-2.  __Examples in Docstrings.__ Write a `#### Examples` subsection below `Args`,
-    `Returns`, `Raises`, etc. to illustrate examples. If the docstring's last
-    line is a fence bracket (\`\`\`) closing a code snippet, add an empty line
-    before closing the docstring with \"\"\". This properly displays the code
-    snippet.
-
-    Justification: Users regularly need to remind themselves of args and
-    semantics. But rarely look at examples more than the first time. But since
-    examples are usually long (which is great!) it means they have to do a lot
-    of annoying scrolling ...unless Examples follow Args/Returns/Raises.
-
-3.  __Citations in Docstrings.__ Write a `#### References` subsection at the
+1.  __Citations in Docstrings.__ Write a `#### References` subsection at the
     bottom of any docstring with citations. Use ICLR’s bibliography style to
     write references; for example, order entries by the first author's last
     name. Add a link to the paper if the publication is open source (ideally,
@@ -77,21 +60,12 @@ it supercedes all previous conventions.
          https://arxiv.org/abs/1803.04386
     ```
 
-4.  When doing float math over literals eg use `1.` instead of `1` or `1.0`.
-
-    *   Using `1.` is another line of defense against an automatic casting
-        mistake. (Using `1.0` is also such a defense but is not minimal.)
-
-5.  Prefer using named args for functions' 2nd args onward.
-
-    *   Definitely use named args for 2nd args onward in docstrings.
-
-9.  Avoid LaTeX in docstrings.
+2.  Avoid LaTeX in docstrings.
 
     *   It is not rendered in many (if not most) editors and can be hard to read
         for both LaTeX experts and non-experts.
 
-10. Write docstring and comment math using ASCII friendly notation; python using
+3. Write docstring and comment math using ASCII friendly notation; python using
     operators. E.g., `x**2` better than `x^2`, `x[i, j]` better than `x_{i,j}`,
     `sum{ f(x[i]) : i=1...n }` better than `\sum_{i=1}^n f(x_i)` `int{sin(x) dx:
     x in [0, 2 pi]}` better than `\int_0^{2\pi} sin(x) dx`.
@@ -99,27 +73,3 @@ it supercedes all previous conventions.
     *   The more we stick to python style, the more someone can
         copy/paste/execute.
     *   Python style is usually easier to read as ASCII.
-
-11. All public functions require docstrings with: one line description, Args,
-    Returns, Raises (if raises exceptions).
-
-    *   Returns docstrings should be in the same format as Args, eg, of the form
-        "name: Description." Part of the rationale is that we are suggesting a
-        reasonable variable name for the returned object(s).
-
-12. Regard `*args` and/or `**kwargs` as features of last resort.
-
-    *   Keyword arguments make the intention of a function call more clear.
-    *   [Possible exceptions for
-        `kwargs`](https://stackoverflow.com/questions/1415812/why-use-kwargs-in-python-what-are-some-real-world-advantages-over-using-named).
-
-18. The `__init__.py` file for modules should use TensorFlow's
-    `remove_undocumented` feature, which seals the module's methods.
-
-21. Use `"{}".format()` rather than `"" %` for string formatting.
-
-    Justification: [PEP 3101](https://www.python.org/dev/peps/pep-3101/) and
-    [Python official
-    tutorials](https://docs.python.org/3.2/tutorial/inputoutput.html#old-string-formatting):
-    "...this old style of formatting will eventually be removed from the
-    language, str.format() should generally be used."
diff --git a/tensorflow/contrib/autograph/__init__.py b/tensorflow/contrib/autograph/__init__.py
index 3386c4eca4b93e850f6fe3c6239d29c61d787ece..dbdbad8f4c91c725294baa36acebbaf5b5e8cf5c 100644
--- a/tensorflow/contrib/autograph/__init__.py
+++ b/tensorflow/contrib/autograph/__init__.py
@@ -23,18 +23,37 @@ from __future__ import print_function
 
 # TODO(mdan): Bring only the relevant symbols to the top level.
 from tensorflow.contrib.autograph import utils
+from tensorflow.contrib.autograph import operators
 from tensorflow.contrib.autograph.impl.api import convert
 from tensorflow.contrib.autograph.impl.api import converted_call
 from tensorflow.contrib.autograph.impl.api import do_not_convert
 from tensorflow.contrib.autograph.impl.api import RunMode
 from tensorflow.contrib.autograph.impl.api import to_code
 from tensorflow.contrib.autograph.impl.api import to_graph
+from tensorflow.contrib.autograph.impl.directives import set_element_type
+from tensorflow.contrib.autograph.impl.directives import set_loop_options
+from tensorflow.contrib.autograph.impl.special_functions import stack
 from tensorflow.contrib.autograph.pyct.transformer import AutographParseError
 from tensorflow.python.util.all_util import remove_undocumented
 
 _allowed_symbols = [
-    'utils', 'convert', 'converted_call', 'do_not_convert', 'RunMode',
-    'to_code', 'to_graph', 'AutographParseError'
+    # Main API
+    'RunMode',
+    'convert',
+    'converted_call',
+    'do_not_convert',
+    'to_code',
+    'to_graph',
+    # Overloaded operators
+    'operators',
+    # Special functions and directives
+    'set_element_type',
+    'set_loop_options',
+    'stack',
+    # Exceptions
+    'AutographParseError',
+    # Utilities: to be removed
+    'utils',
 ]
 
 remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/autograph/converters/BUILD b/tensorflow/contrib/autograph/converters/BUILD
index 8f9bffa55e44e4942bb3845945b3d440c7957cc9..284ad84be566199adaaa1ab641d37528ae4dfd2d 100644
--- a/tensorflow/contrib/autograph/converters/BUILD
+++ b/tensorflow/contrib/autograph/converters/BUILD
@@ -31,6 +31,7 @@ py_library(
         "name_scopes.py",
         "side_effect_guards.py",
         "single_return.py",
+        "slices.py",
     ],
     srcs_version = "PY2AND3",
     visibility = ["//tensorflow:__subpackages__"],
@@ -208,3 +209,14 @@ py_test(
         "//tensorflow/python:client_testlib",
     ],
 )
+
+py_test(
+    name = "slices_test",
+    srcs = ["slices_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":test_lib",
+        "//tensorflow/contrib/autograph/pyct",
+        "//tensorflow/python:client_testlib",
+    ],
+)
diff --git a/tensorflow/contrib/autograph/converters/break_statements.py b/tensorflow/contrib/autograph/converters/break_statements.py
index 35877224b87c1abda1a270be4869e9dcfd0cf97c..775d92c1d9f8bc35d1eda62f3f3ef7ee43414779 100644
--- a/tensorflow/contrib/autograph/converters/break_statements.py
+++ b/tensorflow/contrib/autograph/converters/break_statements.py
@@ -18,8 +18,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import gast
-
 from tensorflow.contrib.autograph.pyct import anno
 from tensorflow.contrib.autograph.pyct import templates
 from tensorflow.contrib.autograph.pyct import transformer
@@ -34,14 +32,6 @@ CONTROL_VAR_NAME = 'control_var_name'
 class BreakStatementTransformer(transformer.Base):
   """Canonicalizes break statements into additional conditionals."""
 
-  def _track_body(self, nodes, break_var):
-    self.enter_local_scope()
-    self.set_local(CONTROL_VAR_NAME, break_var)
-    nodes = self.visit_block(nodes)
-    break_used = self.get_local(BREAK_USED, False)
-    self.exit_local_scope()
-    return nodes, break_used
-
   def visit_Break(self, node):
     self.set_local(BREAK_USED, True)
     var_name = self.get_local(CONTROL_VAR_NAME)
@@ -54,13 +44,9 @@ class BreakStatementTransformer(transformer.Base):
 
   def _guard_if_present(self, block, var_name):
     """Prevents the block from executing if var_name is set."""
-
-    # If we don't have statements that immediately depend on the break
-    # we still need to make sure that the break variable remains
-    # used, in case the break becomes useful in later stages of transformation.
-    # Not having this broke the break_in_inner_loop test.
     if not block:
-      block = [gast.Pass()]
+      return block
+
     template = """
         if not var_name:
           block
@@ -71,9 +57,17 @@ class BreakStatementTransformer(transformer.Base):
         block=block)
     return node
 
+  def _track_body(self, nodes, break_var):
+    self.enter_local_scope()
+    self.set_local(CONTROL_VAR_NAME, break_var)
+    nodes = self.visit_block(nodes)
+    break_used = self.get_local(BREAK_USED, False)
+    self.exit_local_scope()
+    return nodes, break_used
+
   def visit_While(self, node):
     scope = anno.getanno(node, NodeAnno.BODY_SCOPE)
-    break_var = self.context.namer.new_symbol('break__', scope.referenced)
+    break_var = self.context.namer.new_symbol('break_', scope.referenced)
 
     node.test = self.visit(node.test)
     node.body, break_used = self._track_body(node.body, break_var)
@@ -81,6 +75,10 @@ class BreakStatementTransformer(transformer.Base):
     node.orelse = self.visit_block(node.orelse)
 
     if break_used:
+      # Python's else clause only triggers if the loop exited cleanly (e.g.
+      # break did not trigger).
+      guarded_orelse = self._guard_if_present(node.orelse, break_var)
+
       template = """
         var_name = False
         while test and not var_name:
@@ -88,20 +86,18 @@ class BreakStatementTransformer(transformer.Base):
         else:
           orelse
       """
-      # Python's else clause only triggers if the loop exited cleanly (e.g.
-      # break did not trigger).
       node = templates.replace(
           template,
           var_name=break_var,
           test=node.test,
           body=node.body,
-          orelse=self._guard_if_present(node.orelse, break_var))
+          orelse=guarded_orelse)
 
     return node
 
   def visit_For(self, node):
     scope = anno.getanno(node, NodeAnno.BODY_SCOPE)
-    break_var = self.context.namer.new_symbol('break__', scope.referenced)
+    break_var = self.context.namer.new_symbol('break_', scope.referenced)
 
     node.target = self.visit(node.target)
     node.iter = self.visit(node.iter)
@@ -110,19 +106,32 @@ class BreakStatementTransformer(transformer.Base):
     node.orelse = self.visit_block(node.orelse)
 
     if break_used:
-      node.orelse = self._guard_if_present(node.orelse, break_var)
+      # Python's else clause only triggers if the loop exited cleanly (e.g.
+      # break did not trigger).
+      guarded_orelse = self._guard_if_present(node.orelse, break_var)
+      extra_test = templates.replace_as_expression(
+          'not var_name', var_name=break_var)
+
+      # The extra test is hidden in the AST, which will confuse the static
+      # analysis. To mitigate that, we insert a no-op statement that ensures
+      # the control variable is marked as used.
+      # TODO(mdan): Use a marker instead, e.g. ag__.condition_loop_on(var_name)
       template = """
         var_name = False
-        for_stmt
+        for target in iter_:
+          (var_name,)
+          body
+        else:
+          orelse
       """
-      # Python's else clause only triggers if the loop exited cleanly (e.g.
-      # break did not trigger).
       node = templates.replace(
           template,
           var_name=break_var,
-          for_stmt=node)
-      extra_test = templates.replace_as_expression(
-          'not var_name', var_name=break_var)
+          iter_=node.iter,
+          target=node.target,
+          body=node.body,
+          orelse=guarded_orelse)
+
       anno.setanno(node[1], 'extra_test', extra_test)
 
     return node
diff --git a/tensorflow/contrib/autograph/converters/builtin_functions.py b/tensorflow/contrib/autograph/converters/builtin_functions.py
index 317711a866f731de1b497295a2752dee0eb544f5..231e4ee35a72f51845a476d9f605986ac73b4676 100644
--- a/tensorflow/contrib/autograph/converters/builtin_functions.py
+++ b/tensorflow/contrib/autograph/converters/builtin_functions.py
@@ -31,9 +31,6 @@ class BuiltinFunctionTransformer(transformer.Base):
   TF equivalent, like `len`.
   """
 
-  def __init__(self, context):
-    super(BuiltinFunctionTransformer, self).__init__(context)
-
   def _convert_builtin(self, node):
     template = """
       ag__.utils.dynamic_builtin(func, args)
@@ -51,7 +48,7 @@ class BuiltinFunctionTransformer(transformer.Base):
     # TODO(mdan): This won't work if the function was hidden.
     # TODO(mdan): Rely on the live_val and use inspect_utils.is_builtin instead.
     if (isinstance(node.func, gast.Name) and
-        node.func.id in ('len', 'range', 'xrange')):
+        node.func.id in ('len', 'range', 'xrange', 'float', 'int')):
       return self._convert_builtin(node)
     # Print needs to be handled separately because it can be read as statement.
     if isinstance(node.func, gast.Name) and node.func.id == 'print':
diff --git a/tensorflow/contrib/autograph/converters/continue_statements.py b/tensorflow/contrib/autograph/converters/continue_statements.py
index 4299a8a9d59715d032222c47794bbb4393f34ce6..0417817a77e706fc0ce805f7391bea600f5fbb2d 100644
--- a/tensorflow/contrib/autograph/converters/continue_statements.py
+++ b/tensorflow/contrib/autograph/converters/continue_statements.py
@@ -24,103 +24,115 @@ from tensorflow.contrib.autograph.pyct import transformer
 from tensorflow.contrib.autograph.pyct.static_analysis.annos import NodeAnno
 
 
-class ContinueCanonicalizationTransformer(transformer.Base):
-  """Canonicalizes continue statements into additional conditionals."""
+# Tags for local state.
+CONTROL_VAR_NAME = 'control_var_name'
+CONTINUE_USED = 'continue_used'
+GUARD_CREATED = 'guard_created'
+CREATE_GUARD_NEXT = 'create_guard_next'
 
-  def __init__(self, context):
-    super(ContinueCanonicalizationTransformer, self).__init__(context)
-    # This is a stack structure, to correctly process nested loops.
-    self.continuation_uses = []
 
-  def _create_continuation_check(self):
-    template = """
-      if not var_name:
-        pass
-    """
-    cond, = templates.replace(template, var_name=self.continuation_uses[-1][1])
-    cond.body = []
-    return cond
+class ContinueCanonicalizationTransformer(transformer.Base):
+  """Canonicalizes continue statements into additional conditionals."""
 
-  def _create_continuation_trigger(self):
+  def visit_Continue(self, node):
+    self.set_local(CONTINUE_USED, True)
     template = """
       var_name = True
     """
-    assign, = templates.replace(
-        template, var_name=self.continuation_uses[-1][1])
-    return assign
-
-  def _create_continuation_init(self):
-    template = """
-      var_name = False
-    """
-    assign, = templates.replace(
-        template, var_name=self.continuation_uses[-1][1])
-    return assign
-
-  def _visit_and_reindent_if_necessary(self, nodes):
-    reorganized_nodes = []
-    current_dest = reorganized_nodes
-    continue_used_in_block = False
-    for i, n in enumerate(nodes):
-      # TODO(mdan): This could be optimized if control structures are simple.
-      self.continuation_uses[-1][0] = False
-      n = self.visit(n)
-      current_dest.append(n)
-      if self.continuation_uses[-1][0]:
-        continue_used_in_block = True
-        if i < len(nodes) - 1:  # Last statement in block needs no protection.
-          cond = self._create_continuation_check()
-          current_dest.append(cond)
-          current_dest = cond.body
-    self.continuation_uses[-1][0] = continue_used_in_block
-    return reorganized_nodes
-
-  def _process_loop_block(self, block, scope):
-    cont_var = self.context.namer.new_symbol('cont_requested', scope.referenced)
-    self.continuation_uses.append([False, cont_var])
-    block = self._visit_and_reindent_if_necessary(block)
-    if self.continuation_uses[-1][0]:
-      block.insert(0, self._create_continuation_init())
-    self.continuation_uses.pop()
-    return block
+    return templates.replace(
+        template, var_name=self.get_local(CONTROL_VAR_NAME))
+
+  def _postprocess_statement(self, node):
+    # Example of how the state machine below works:
+    #
+    #   1| stmt           # State: CONTINUE_USED = False
+    #    |                # Action: none
+    #   2| if cond:
+    #   3|   continue     # State: CONTINUE_USED = True,
+    #    |                #        GUARD_CREATED = False,
+    #    |                #        CREATE_GUARD_NEXT = False
+    #    |                # Action: set CREATE_GUARD_NEXT = True
+    #   4| stmt           # State: CONTINUE_USED = True,
+    #    |                #        GUARD_CREATED = False,
+    #    |                #        CREATE_GUARD_NEXT = True
+    #    |                # Action: create `if not continue_used`,
+    #    |                #         set GUARD_CREATED = True
+    #   5| stmt           # State: CONTINUE_USED = True, GUARD_CREATED = True
+    #    |                # Action: none (will be wrapped under previously
+    #    |                #         created if node)
+
+    if self.get_local(CONTINUE_USED, False):
+      if self.get_local(GUARD_CREATED, False):
+        return node, None
+
+      elif not self.get_local(CREATE_GUARD_NEXT, False):
+        self.set_local(CREATE_GUARD_NEXT, True)
+        return node, None
+
+      else:
+        self.set_local(GUARD_CREATED, True)
+        template = """
+          if not var_name:
+            original_node
+        """
+        cond, = templates.replace(
+            template,
+            var_name=self.get_local(CONTROL_VAR_NAME),
+            original_node=node)
+        return cond, cond.body
+    return node, None
+
+  def _visit_loop_body(self, node, nodes):
+    self.enter_local_scope()
+    scope = anno.getanno(node, NodeAnno.BODY_SCOPE)
+    continue_var = self.context.namer.new_symbol('continue_', scope.referenced)
+    self.set_local(CONTROL_VAR_NAME, continue_var)
+
+    nodes = self.visit_block(nodes, after_visit=self._postprocess_statement)
+
+    if self.get_local(CONTINUE_USED, False):
+      template = """
+        var_name = False
+      """
+      control_var_init = templates.replace(template, var_name=continue_var)
+      nodes = control_var_init + nodes
+
+    self.exit_local_scope()
+    return nodes
+
+  def _visit_non_loop_body(self, nodes):
+    self.enter_local_scope(inherit=(CONTROL_VAR_NAME,))
+    nodes = self.visit_block(nodes, after_visit=self._postprocess_statement)
+    continue_used = self.get_local(CONTINUE_USED, False)
+    self.exit_local_scope(keep=(CONTINUE_USED,))
+    return nodes, continue_used
 
   def visit_While(self, node):
-    self.generic_visit(node.test)
-    node.body = self._process_loop_block(node.body,
-                                         anno.getanno(node,
-                                                      NodeAnno.BODY_SCOPE))
-    for n in node.orelse:
-      self.generic_visit(n)
+    node.test = self.visit(node.test)
+    node.body = self._visit_loop_body(node, node.body)
+    # A continue in the else clause applies to the containing scope.
+    node.orelse, _ = self._visit_non_loop_body(node.orelse)
     return node
 
   def visit_For(self, node):
-    self.generic_visit(node.target)
-    self.generic_visit(node.iter)
-    node.body = self._process_loop_block(node.body,
-                                         anno.getanno(node,
-                                                      NodeAnno.BODY_SCOPE))
-    for n in node.orelse:
-      self.generic_visit(n)
+    node.target = self.generic_visit(node.target)
+    node.iter = self.generic_visit(node.iter)
+    node.body = self._visit_loop_body(node, node.body)
+    # A continue in the else clause applies to the containing scope.
+    node.orelse, _ = self._visit_non_loop_body(node.orelse)
     return node
 
   def visit_If(self, node):
-    if self.continuation_uses:
-      self.generic_visit(node.test)
-      node.body = self._visit_and_reindent_if_necessary(node.body)
-      continue_used_in_body = self.continuation_uses[-1][0]
-      node.orelse = self._visit_and_reindent_if_necessary(node.orelse)
-      self.continuation_uses[-1][0] = (
-          continue_used_in_body or self.continuation_uses[-1][0])
-    else:
-      node = self.generic_visit(node)
+    node.test = self.generic_visit(node.test)
+    node.body, continue_used_body = self._visit_non_loop_body(node.body)
+    node.orelse, continue_used_orelse = self._visit_non_loop_body(node.orelse)
+    self.set_local(CONTINUE_USED, continue_used_body or continue_used_orelse)
     return node
 
-  def visit_Continue(self, node):
-    self.continuation_uses[-1][0] = True
-    return self._create_continuation_trigger()
-
-  def visit_Break(self, node):
-    assert False, 'break statement should be desugared at this point'
+  def visit_With(self, node):
+    node.items = self.visit_block(node.items)
+    node.body, _ = self._visit_non_loop_body(node.body)
+    return node
 
 
 def transform(node, namer):
diff --git a/tensorflow/contrib/autograph/converters/control_flow_test.py b/tensorflow/contrib/autograph/converters/control_flow_test.py
index 1a863590f97add9bfa587d1142a09ae26a9fdb44..9d23d9b5b7e8e8480e04fccc1c8c81799abf382b 100644
--- a/tensorflow/contrib/autograph/converters/control_flow_test.py
+++ b/tensorflow/contrib/autograph/converters/control_flow_test.py
@@ -42,7 +42,7 @@ class ControlFlowTest(converter_test_base.TestCase):
     node = self.parse_and_analyze(test_fn, {})
     node = control_flow.transform(node, self.ctx)
 
-    with self.compiled(node, control_flow_ops.while_loop) as result:
+    with self.compiled(node) as result:
       with self.test_session() as sess:
         self.assertEqual((10, 5, 5),
                          sess.run(result.test_fn(constant_op.constant(5))))
@@ -57,7 +57,7 @@ class ControlFlowTest(converter_test_base.TestCase):
     node = self.parse_and_analyze(test_fn, {})
     node = control_flow.transform(node, self.ctx)
 
-    with self.compiled(node, control_flow_ops.while_loop) as result:
+    with self.compiled(node) as result:
       with self.test_session() as sess:
         self.assertEqual(0, sess.run(result.test_fn(constant_op.constant(5))))
 
@@ -75,7 +75,7 @@ class ControlFlowTest(converter_test_base.TestCase):
     node = self.parse_and_analyze(test_fn, {})
     node = control_flow.transform(node, self.ctx)
 
-    with self.compiled(node, control_flow_ops.cond) as result:
+    with self.compiled(node) as result:
       with self.test_session() as sess:
         self.assertEqual((-1, 0),
                          sess.run(result.test_fn(constant_op.constant(1))))
@@ -92,7 +92,7 @@ class ControlFlowTest(converter_test_base.TestCase):
     node = self.parse_and_analyze(test_fn, {})
     node = control_flow.transform(node, self.ctx)
 
-    with self.compiled(node, control_flow_ops.cond) as result:
+    with self.compiled(node) as result:
       with self.test_session() as sess:
         self.assertEqual(-1, sess.run(result.test_fn(constant_op.constant(1))))
 
diff --git a/tensorflow/contrib/autograph/converters/lists.py b/tensorflow/contrib/autograph/converters/lists.py
index b49521b2c328f418828a5e92890aa1b169384b70..c15dfff9e8ebd8b96fd4aff82459a6fd7d0ac8ab 100644
--- a/tensorflow/contrib/autograph/converters/lists.py
+++ b/tensorflow/contrib/autograph/converters/lists.py
@@ -33,82 +33,193 @@ from __future__ import print_function
 import gast
 
 from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import parser
 from tensorflow.contrib.autograph.pyct import templates
 from tensorflow.contrib.autograph.pyct import transformer
-from tensorflow.python.framework import dtypes
+from tensorflow.contrib.autograph.pyct.static_analysis.annos import NodeAnno
+
+
+# Tags for local state.
+POP_USES = 'pop_uses'
 
 
 class ListTransformer(transformer.Base):
   """Converts lists and related operations to their TF counterpart."""
 
-  def _empty_list(self, node):
-    if not anno.hasanno(node, 'element_type'):
-      raise NotImplementedError(
-          'type inference for empty lists is not yet supported; '
-          'use set_element_type(<list>, <dtype>) to continue')
-    dtype = anno.getanno(node, 'element_type')
-    if not isinstance(dtype, dtypes.DType):
-      # TODO(mdan): Allow non-TF dtypes?
-      # That would be consistent with the dynamic dispatch pattern, but
-      # we must make sure that doesn't become confusing.
-      raise NotImplementedError('element type "%s" not yet supported' % dtype)
-
-    dtype_name = dtype.name
-    # TODO(mdan): Does it ever make sense not to use tensor lists?
+  def visit_List(self, node):
+    node = self.generic_visit(node)
     template = """
-      tf.TensorArray(tf.dtype_name, size=0, dynamic_size=True)
+      ag__.new_list(elements)
     """
-    return templates.replace_as_expression(template, dtype_name=dtype_name)
+    return templates.replace_as_expression(template, elements=node)
 
-  def _pre_populated_list(self, node):
-    raise NotImplementedError('pre-populated lists')
+  def _replace_append_call(self, node):
+    assert len(node.args) == 1
+    assert isinstance(node.func, gast.Attribute)
+    template = """
+      target = ag__.list_append(target, element)
+    """
+    return templates.replace(
+        template,
+        target=node.func.value,
+        element=node.args[0])
+
+  def _replace_pop_call(self, node):
+    # Expressions that use pop() are converted to a statement + expression.
+    #
+    # For example:
+    #
+    #   print(target.pop())
+    #
+    # ... is converted to:
+    #
+    #   target, target_pop = ag__.list_pop(target)
+    #   print(target_pop)
+    #
+    # Here, we just generate the variable name and swap it in,
+    # and _generate_pop_operation will handle the rest.
+    #
+    # Multiple uses of pop() are allowed:
+    #
+    #   print(tartget.pop(), target.pop())
+    #   print(tartget.pop().pop())
+    #
+    assert isinstance(node.func, gast.Attribute)
+    scope = anno.getanno(node, NodeAnno.ARGS_SCOPE)
+    target_node = node.func.value
+
+    # Attempt to use a related name if can get one. Otherwise use something
+    # generic.
+    if anno.hasanno(target_node, anno.Basic.QN):
+      target_name = anno.getanno(target_node, anno.Basic.QN).ssf()
+    else:
+      target_name = 'list'
+    pop_var_name = self.context.namer.new_symbol(target_name, scope.referenced)
+
+    pop_uses = self.get_local(POP_USES, [])
+    pop_uses.append((node, pop_var_name))
+    self.set_local(POP_USES, pop_uses)
+
+    return templates.replace_as_expression('var_name', var_name=pop_var_name)
+
+  def _replace_stack_call(self, node):
+    assert len(node.args) == 1
+    dtype = anno.getanno(
+        node.args[0],
+        'element_type',
+        default=templates.replace_as_expression('None'))
+    template = """
+      ag__.list_stack(
+          target,
+          opts=ag__.ListStackOpts(
+              element_dtype=dtype,
+              original_call=orig_call))
+    """
+    return templates.replace_as_expression(
+        template,
+        dtype=dtype,
+        target=node.args[0],
+        orig_call=node.func)
 
-  def visit_Expr(self, node):
+  def visit_Call(self, node):
     node = self.generic_visit(node)
-    if isinstance(node.value, gast.Call):
-      call_node = node.value
-
-      if not anno.hasanno(call_node.func, anno.Basic.QN):
-        return node
-      qn = anno.getanno(call_node.func, anno.Basic.QN)
-
-      if qn.qn[-1] == 'append' and (len(call_node.args) == 1):
-        template = """
-          target = ag__.utils.dynamic_list_append(target, element)
-        """
-        node = templates.replace(
-            template,
-            target=qn.parent.ast(),
-            element=call_node.args[0])
+
+    # TODO(mdan): This is insufficient if target is a function argument.
+    # In the case of function arguments, we need to add the list to the
+    # function's return value, because it is being modified.
+    # TODO(mdan): Checking just the name is brittle, can it be improved?
+    if isinstance(node.func, gast.Attribute):
+      func_name = node.func.attr
+      if func_name == 'append' and (len(node.args) == 1):
+        node = self._replace_append_call(node)
+      elif func_name == 'pop' and (len(node.args) <= 1):
+        node = self._replace_pop_call(node)
+      elif func_name == 'stack' and (len(node.args) == 1):
+        node = self._replace_stack_call(node)
+
     return node
 
-  def _replace_list_constructors(self, targets, values):
-    for target in targets:
-      if (isinstance(target, (gast.Tuple, gast.List)) and
-          isinstance(values, (gast.Tuple, gast.List))):
-        n_targets = len(target.elts)
-        for i in range(n_targets):
-          target_el, value_el = target.elts[i], values.elts[i]
-          values.elts[i] = self._replace_list_constructors(
-              (target_el,), value_el)
-        return values
-      if isinstance(values, gast.List):
-        if values.elts:
-          return self._pre_populated_list(values)
-        else:
-          return self._empty_list(values)
-    return values
-
-  def visit_Assign(self, node):
-    node = self.generic_visit(node)
+  def _generate_pop_operation(self, original_call_node, pop_var_name):
+    assert isinstance(original_call_node.func, gast.Attribute)
+
+    if original_call_node.args:
+      pop_element = original_call_node.args[0]
+    else:
+      pop_element = parser.parse_expression('None')
+    # The call will be something like "target.pop()", and the dtype is hooked to
+    # target, hence the func.value.
+    dtype = anno.getanno(
+        original_call_node.func.value,
+        'element_type',
+        default=templates.replace_as_expression('None'))
+    shape = anno.getanno(
+        original_call_node.func.value,
+        'element_shape',
+        default=templates.replace_as_expression('None'))
+
+    template = """
+      target, pop_var_name = ag__.list_pop(
+          target, element,
+          opts=ag__.ListPopOpts(element_dtype=dtype, element_shape=shape))
+    """
+    return templates.replace(
+        template,
+        target=original_call_node.func.value,
+        pop_var_name=pop_var_name,
+        element=pop_element,
+        dtype=dtype,
+        shape=shape)
+
+  def _postprocess_statement(self, node):
+    """Inserts any separate pop() calls that node may use."""
+    pop_uses = self.get_local(POP_USES, None)
+    if pop_uses:
+      replacements = []
+      for original_call_node, pop_var_name in pop_uses:
+        replacements.extend(
+            self._generate_pop_operation(original_call_node, pop_var_name))
+      replacements.append(node)
+      node = replacements
+    self.exit_local_scope()
+    return node, None
+
+  # TODO(mdan): Should we have a generic visit_block instead?
+  # Right now it feels that a visit_block would add too much magic that's
+  # hard to follow.
+
+  def _visit_and_process_block(self, block):
+    return self.visit_block(
+        block,
+        before_visit=self.enter_local_scope,
+        after_visit=self._postprocess_statement)
+
+  def visit_FunctionDef(self, node):
+    node.args = self.generic_visit(node.args)
+    node.decorator_list = self.visit_block(node.decorator_list)
+    node.body = self._visit_and_process_block(node.body)
+    return node
+
+  def visit_For(self, node):
+    node.target = self.visit(node.target)
+    node.body = self._visit_and_process_block(node.body)
+    node.orelse = self._visit_and_process_block(node.orelse)
+    return node
+
+  def visit_While(self, node):
+    node.test = self.visit(node.test)
+    node.body = self._visit_and_process_block(node.body)
+    node.orelse = self._visit_and_process_block(node.orelse)
+    return node
+
+  def visit_If(self, node):
+    node.test = self.visit(node.test)
+    node.body = self._visit_and_process_block(node.body)
+    node.orelse = self._visit_and_process_block(node.orelse)
+    return node
 
-    # Only convert lists when they are assigned to a variable, e.g.:
-    #   l = []
-    # TODO(mdan): A similar pattern exists in type_info.py
-    # We should add a generic "unpack_assignment" function to the base
-    # transformer, that has the same effect as applying some logic to the SSA
-    # form.
-    node.value = self._replace_list_constructors(node.targets, node.value)
+  def visit_With(self, node):
+    node.items = self.visit_block(node.items)
+    node.body = self._visit_and_process_block(node.body)
     return node
 
 
diff --git a/tensorflow/contrib/autograph/converters/lists_test.py b/tensorflow/contrib/autograph/converters/lists_test.py
index 74c6dc64f197f75eb3e66c01fb078467e8e8ea89..9f18ab9f44dd8c3f341a02b950f75317c676eff8 100644
--- a/tensorflow/contrib/autograph/converters/lists_test.py
+++ b/tensorflow/contrib/autograph/converters/lists_test.py
@@ -22,74 +22,126 @@ from tensorflow.contrib.autograph import utils
 from tensorflow.contrib.autograph.converters import converter_test_base
 from tensorflow.contrib.autograph.converters import lists
 from tensorflow.python.framework import dtypes
-from tensorflow.python.ops import tensor_array_ops
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import list_ops
 from tensorflow.python.platform import test
 
 
 class ListTest(converter_test_base.TestCase):
 
-  def test_empty_annotated_list(self):
+  def test_empty_list(self):
 
     def test_fn():
-      l = []
-      utils.set_element_type(l, dtypes.int32)
-      l.append(1)
-      return l
+      return []
 
-    node = self.parse_and_analyze(test_fn, {'dtypes': dtypes, 'utils': utils})
+    node = self.parse_and_analyze(test_fn, {})
     node = lists.transform(node, self.ctx)
 
-    with self.compiled(node, tensor_array_ops.TensorArray,
-                       dtypes.int32) as result:
-      # TODO(mdan): Attach these additional modules automatically.
-      result.utils = utils
-      result.dtypes = dtypes
+    with self.compiled(node) as result:
+      tl = result.test_fn()
+      # Empty tensor lists cannot be evaluated or stacked.
+      self.assertTrue(isinstance(tl, ops.Tensor))
+      self.assertEqual(tl.dtype, dtypes.variant)
+
+  def test_initialized_list(self):
+
+    def test_fn():
+      return [1, 2, 3]
+
+    node = self.parse_and_analyze(test_fn, {})
+    node = lists.transform(node, self.ctx)
+
+    with self.compiled(node) as result:
       with self.test_session() as sess:
-        self.assertAllEqual([1], sess.run(result.test_fn().stack()))
+        tl = result.test_fn()
+        r = list_ops.tensor_list_stack(tl, dtypes.int32)
+        self.assertAllEqual(sess.run(r), [1, 2, 3])
 
-  def test_empty_annotated_lists_unpacked(self):
+  def test_list_append(self):
 
     def test_fn():
-      l, m = [], []
-      utils.set_element_type(l, dtypes.int32)
-      utils.set_element_type(m, dtypes.int32)
-      l.append(1)
-      m.append(2)
-      return l, m
+      l = [1]
+      l.append(2)
+      l.append(3)
+      return l
 
-    node = self.parse_and_analyze(test_fn, {'dtypes': dtypes, 'utils': utils})
+    node = self.parse_and_analyze(test_fn, {})
     node = lists.transform(node, self.ctx)
 
-    with self.compiled(node, tensor_array_ops.TensorArray,
-                       dtypes.int32) as result:
+    with self.compiled(node) as result:
+      with self.test_session() as sess:
+        tl = result.test_fn()
+        r = list_ops.tensor_list_stack(tl, dtypes.int32)
+        self.assertAllEqual(sess.run(r), [1, 2, 3])
+
+  def test_list_pop(self):
+
+    def test_fn():
+      l = [1, 2, 3]
+      utils.set_element_type(l, dtypes.int32, ())
+      s = l.pop()
+      return s, l
+
+    node = self.parse_and_analyze(
+        test_fn,
+        {
+            'utils': utils,
+            'dtypes': dtypes
+        },
+        include_type_analysis=True,
+    )
+    node = lists.transform(node, self.ctx)
+
+    with self.compiled(node) as result:
       result.utils = utils
       result.dtypes = dtypes
       with self.test_session() as sess:
-        res_l, res_m = result.test_fn()
-        self.assertEqual([1], sess.run(res_l.stack()))
-        self.assertEqual([2], sess.run(res_m.stack()))
+        ts, tl = result.test_fn()
+        r = list_ops.tensor_list_stack(tl, dtypes.int32)
+        self.assertAllEqual(sess.run(r), [1, 2])
+        self.assertAllEqual(sess.run(ts), 3)
+
+  def test_double_list_pop(self):
 
-  def test_empty_annotated_lists_list_unpacked(self):
+    def test_fn(l):
+      s = l.pop().pop()
+      return s
+
+    node = self.parse_and_analyze(test_fn, {})
+    node = lists.transform(node, self.ctx)
+
+    with self.compiled(node) as result:
+      test_input = [1, 2, [1, 2, 3]]
+      # TODO(mdan): Pass a list of lists of tensor when we fully support that.
+      # For now, we just pass a regular Python list of lists just to verify that
+      # the two pop calls are sequenced properly.
+      self.assertAllEqual(result.test_fn(test_input), 3)
+
+  def test_list_stack(self):
+
+    tf = None  # Will be replaced with a mock.
 
     def test_fn():
-      [l, m] = [], []
+      l = [1, 2, 3]
       utils.set_element_type(l, dtypes.int32)
-      utils.set_element_type(m, dtypes.int32)
-      l.append(1)
-      m.append(2)
-      return l, m
-
-    node = self.parse_and_analyze(test_fn, {'dtypes': dtypes, 'utils': utils})
+      return tf.stack(l)
+
+    node = self.parse_and_analyze(
+        test_fn,
+        {
+            'utils': utils,
+            'dtypes': dtypes
+        },
+        include_type_analysis=True,
+    )
     node = lists.transform(node, self.ctx)
 
-    with self.compiled(node, tensor_array_ops.TensorArray,
-                       dtypes.int32) as result:
+    with self.compiled(node, array_ops.stack, dtypes.int32) as result:
       result.utils = utils
       result.dtypes = dtypes
       with self.test_session() as sess:
-        res_l, res_m = result.test_fn()
-        self.assertEqual([1], sess.run(res_l.stack()))
-        self.assertEqual([2], sess.run(res_m.stack()))
+        self.assertAllEqual(sess.run(result.test_fn()), [1, 2, 3])
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/autograph/converters/slices.py b/tensorflow/contrib/autograph/converters/slices.py
new file mode 100644
index 0000000000000000000000000000000000000000..85aeda9c4164eb70329bd50f789eea5441c8fc87
--- /dev/null
+++ b/tensorflow/contrib/autograph/converters/slices.py
@@ -0,0 +1,83 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Converter for slice operations."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import gast
+
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import templates
+from tensorflow.contrib.autograph.pyct import transformer
+
+
+class SliceTransformer(transformer.Base):
+  """Converts slicing operations to their TF counterpart.
+
+  Currently, relying on the default slice operator that Tensor uses is
+  insufficient, because TensorArray and tensor lists use dedicated index read
+  and write functions.
+  """
+
+  def _process_single_assignment(self, target, value):
+    if not isinstance(target, gast.Subscript):
+      return None
+
+    template = """
+      target = ag__.set_item(target, key, item)
+    """
+    return templates.replace(
+        template, target=target.value, key=target.slice, item=value)
+
+  def visit_Assign(self, node):
+    node = self.generic_visit(node)
+    # TODO(mdan): Support unpackings and multiple assignments.
+    if len(node.targets) != 1:
+      raise NotImplementedError('multiple assignment')
+    replacement = self._process_single_assignment(node.targets[0], node.value)
+    if replacement is not None:
+      return replacement
+    return node
+
+  def visit_Subscript(self, node):
+    node = self.generic_visit(node)
+    if not isinstance(node.slice, gast.Index):
+      # TODO(mdan): It might make more sense to wave them through.
+      raise NotImplementedError('non-index slice')
+
+    if not isinstance(node.ctx, gast.Load):
+      # Index writes are handled at a higher level, one at which the rvalue is
+      # also available.
+      return node
+
+    dtype = anno.getanno(
+        node.value,
+        'element_type',
+        default=templates.replace_as_expression('None'))
+
+    template = """
+      ag__.get_item(
+          target,
+          key,
+          opts=ag__.GetItemOpts(element_dtype=dtype))
+    """
+    return templates.replace_as_expression(
+        template, target=node.value, key=node.slice, dtype=dtype)
+
+
+def transform(node, context):
+  return SliceTransformer(context).visit(node)
diff --git a/tensorflow/contrib/autograph/converters/slices_test.py b/tensorflow/contrib/autograph/converters/slices_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c2d7e1ea1a6c46fcc3a2c6972a24507646ef858
--- /dev/null
+++ b/tensorflow/contrib/autograph/converters/slices_test.py
@@ -0,0 +1,59 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for slices module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.autograph import utils
+from tensorflow.contrib.autograph.converters import converter_test_base
+from tensorflow.contrib.autograph.converters import slices
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.ops import list_ops
+from tensorflow.python.platform import test
+
+
+class SliceTest(converter_test_base.TestCase):
+
+  def test_index_access(self):
+
+    def test_fn(l):
+      utils.set_element_type(l, dtypes.int32)
+      return l[1]
+
+    node = self.parse_and_analyze(
+        test_fn,
+        {
+            'utils': utils,
+            'dtypes': dtypes
+        },
+        include_type_analysis=True,
+    )
+    node = slices.transform(node, self.ctx)
+
+    with self.compiled(node, dtypes.int32) as result:
+      result.utils = utils
+      result.dtypes = dtypes
+      with self.test_session() as sess:
+        tl = list_ops.tensor_list_from_tensor(
+            [1, 2], element_shape=constant_op.constant([], dtype=dtypes.int32))
+        y = result.test_fn(tl)
+        self.assertEqual(2, sess.run(y))
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/autograph/impl/BUILD b/tensorflow/contrib/autograph/impl/BUILD
index 54424e26472b8466b8fe68ea848b5463c10224c9..02f16ae1875d6bd1fb87d19f8bfc5cae900391dd 100644
--- a/tensorflow/contrib/autograph/impl/BUILD
+++ b/tensorflow/contrib/autograph/impl/BUILD
@@ -20,7 +20,9 @@ py_library(
         "api.py",
         "config.py",
         "conversion.py",
+        "directives.py",
         "naming.py",
+        "special_functions.py",
     ],
     srcs_version = "PY2AND3",
     visibility = ["//tensorflow:__subpackages__"],
@@ -69,3 +71,13 @@ py_test(
         "//tensorflow/python:client_testlib",
     ],
 )
+
+py_test(
+    name = "special_functions_test",
+    srcs = ["special_functions_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":impl",
+        "//tensorflow/python:client_testlib",
+    ],
+)
diff --git a/tensorflow/contrib/autograph/impl/conversion.py b/tensorflow/contrib/autograph/impl/conversion.py
index 55a30dc127957b2a9caa053db843380c94bacfbf..7802bbbe27ec5fed891440af2f589801918b3bdd 100644
--- a/tensorflow/contrib/autograph/impl/conversion.py
+++ b/tensorflow/contrib/autograph/impl/conversion.py
@@ -38,6 +38,7 @@ from tensorflow.contrib.autograph.converters import logical_expressions
 from tensorflow.contrib.autograph.converters import name_scopes
 from tensorflow.contrib.autograph.converters import side_effect_guards
 from tensorflow.contrib.autograph.converters import single_return
+from tensorflow.contrib.autograph.converters import slices
 from tensorflow.contrib.autograph.impl import config
 from tensorflow.contrib.autograph.impl import naming
 from tensorflow.contrib.autograph.pyct import ast_util
@@ -371,6 +372,8 @@ def node_to_graph(node, ctx, nocompile_decorators):
   # TODO(mdan): Clean this up.
   # Some intermediate analyses are not required, and some comments got orphaned.
 
+  # TODO(mdan): We may assume all converters require analysis to be re-done.
+
   # Past this point, line numbers are no longer accurate so we ignore the
   # source.
   # TODO(mdan): Is it feasible to reconstruct intermediate source code?
@@ -393,6 +396,8 @@ def node_to_graph(node, ctx, nocompile_decorators):
 
   node = _static_analysis_pass(node, ctx)
   node = lists.transform(node, ctx)
+  node = _static_analysis_pass(node, ctx)
+  node = slices.transform(node, ctx)
   node = builtin_functions.transform(node, ctx)
 
   node = _static_analysis_pass(node, ctx)
diff --git a/tensorflow/contrib/autograph/impl/directives.py b/tensorflow/contrib/autograph/impl/directives.py
new file mode 100644
index 0000000000000000000000000000000000000000..aabe5d99394a0cb921196d1c6a6b2a9496ea7545
--- /dev/null
+++ b/tensorflow/contrib/autograph/impl/directives.py
@@ -0,0 +1,68 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Directives are special no-op functions that serve as compilation markers.
+
+They provide static information like type hints, compilation and TensorFlow
+overrides.
+
+These serve as annotations in the compiled code, allowing the user some control
+over the compilation process. They have no functional role at runtime.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+
+UNSPECIFIED = object()
+
+
+def set_element_type(entity, dtype, shape=UNSPECIFIED):
+  """Indicates that the entity is expected hold items of specified type/shape.
+
+  The staged TensorFlow ops will reflect and assert this data type. Ignored
+  otherwise.
+
+  Args:
+    entity: The entity to annotate.
+    dtype: TensorFlow dtype value to assert for entity.
+    shape: Optional shape to assert for entity.
+  """
+  del entity
+  del dtype
+  del shape
+
+
+def set_loop_options(
+    parallel_iterations=UNSPECIFIED,
+    back_prop=UNSPECIFIED,
+    swap_memory=UNSPECIFIED,
+    maximum_iterations=UNSPECIFIED):
+  """Specifies additional arguments to be passed to the enclosing while_loop.
+
+  The parameters apply to and only to the immediately enclosing loop. It only
+  has effect if the loop is staged as a TF while_loop; otherwise the parameters
+  have no effect.
+
+  Args:
+    parallel_iterations: See tf.while_loop.
+    back_prop: See tf.while_loop.
+    swap_memory: See tf.while_loop.
+    maximum_iterations: See tf.while_loop.
+  """
+  del parallel_iterations
+  del back_prop
+  del swap_memory
+  del maximum_iterations
diff --git a/tensorflow/contrib/autograph/impl/special_functions.py b/tensorflow/contrib/autograph/impl/special_functions.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7a8177c44c88217560fb7f72c77d3ac1aa0c9ec
--- /dev/null
+++ b/tensorflow/contrib/autograph/impl/special_functions.py
@@ -0,0 +1,48 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Special functions that only make sense for AutoGraph.
+
+These functions are meant to ensure feature parity between Python and AutoGraph,
+so that the exact same code works in both modes. In general, AutoGraph will
+replace these calls.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.autograph.operators import data_structures
+
+
+def stack(list_or_tensor, element_dtype=None):
+  """Stacks the input, if it admits the notion of stacking. No-op otherwise.
+
+  For example, a list of tensors can be stacked into a larger tensor. This
+  function is similar to tf.stack, but it accepts non-lists and lists of
+  non-tensors as arguments. In the latter case, the function does nothing.
+
+  Args:
+    list_or_tensor: Any entity.
+    element_dtype: Optional dtype for the elements in the list. Required if the
+        input is stackable, and the list is untyped.
+
+  Returns:
+    If the input is stackable, a new object representing the stacked inputs.
+  Otherwise it returns list_or_tensor unchanged.
+  """
+  return data_structures.list_stack(
+      list_or_tensor,
+      data_structures.ListStackOpts(
+          element_dtype=element_dtype, original_call=lambda x: x))
diff --git a/tensorflow/contrib/autograph/impl/special_functions_test.py b/tensorflow/contrib/autograph/impl/special_functions_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b52d2a59b5a3e3c92f11343197379c773ecc828
--- /dev/null
+++ b/tensorflow/contrib/autograph/impl/special_functions_test.py
@@ -0,0 +1,50 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for special_functions module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.autograph.impl import special_functions
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import tensor_util
+from tensorflow.python.ops import list_ops
+from tensorflow.python.platform import test
+
+
+class SpecialFunctionsTest(test.TestCase):
+
+  def test_basic(self):
+    self.assertEqual(special_functions.stack(1), 1)
+    self.assertListEqual(special_functions.stack([1, 2, 3]), [1, 2, 3])
+    # TODO(mdan): This should probably forward to tf.stack.
+    self.assertTrue(
+        isinstance(
+            special_functions.stack(
+                [constant_op.constant(1),
+                 constant_op.constant(2)]), list))
+
+    t = constant_op.constant([1.0, 2.0])
+    l = list_ops.tensor_list_from_tensor(
+        t, element_shape=constant_op.constant([], dtype=dtypes.int32))
+    self.assertTrue(
+        tensor_util.is_tensor(
+            special_functions.stack(l, element_dtype=dtypes.float32)))
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/autograph/operators/BUILD b/tensorflow/contrib/autograph/operators/BUILD
index 18bfec5d9c69912f90414c51ac63ba540cf4d5fc..0c6ab65505ee03e19588adae73d3134399a34b65 100644
--- a/tensorflow/contrib/autograph/operators/BUILD
+++ b/tensorflow/contrib/autograph/operators/BUILD
@@ -22,7 +22,7 @@ py_library(
         "__init__.py",
         "control_flow.py",
         "data_structures.py",
-        "dispatch_context.py",
+        "slices.py",
     ],
     srcs_version = "PY2AND3",
     visibility = ["//tensorflow:__subpackages__"],
@@ -52,3 +52,13 @@ py_test(
         "//tensorflow/python:client_testlib",
     ],
 )
+
+py_test(
+    name = "slices_test",
+    srcs = ["slices_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":operators",
+        "//tensorflow/python:client_testlib",
+    ],
+)
diff --git a/tensorflow/contrib/autograph/operators/__init__.py b/tensorflow/contrib/autograph/operators/__init__.py
index 38b761d97d54bdaee4da91269964469b482895ae..c900fd6af2ea5dfb419f731ee8d8822d68424b27 100644
--- a/tensorflow/contrib/autograph/operators/__init__.py
+++ b/tensorflow/contrib/autograph/operators/__init__.py
@@ -28,6 +28,10 @@ closures for the body.
 #    - the names used in the Python docs, if the operator is a function (e.g.
 #      list_ and x for append, see
 #      https://docs.python.org/3.7/tutorial/datastructures.html)
+#
+# All operators may accept a final argument named "opts", of a type that
+# subclasses namedtuple and contains any arguments that are only required
+# for some specializations of the operator.
 
 from __future__ import absolute_import
 from __future__ import division
@@ -35,3 +39,12 @@ from __future__ import print_function
 
 from tensorflow.contrib.autograph.operators.control_flow import for_stmt
 from tensorflow.contrib.autograph.operators.control_flow import while_stmt
+from tensorflow.contrib.autograph.operators.data_structures import list_append
+from tensorflow.contrib.autograph.operators.data_structures import list_pop
+from tensorflow.contrib.autograph.operators.data_structures import list_stack
+from tensorflow.contrib.autograph.operators.data_structures import ListPopOpts
+from tensorflow.contrib.autograph.operators.data_structures import ListStackOpts
+from tensorflow.contrib.autograph.operators.data_structures import new_list
+from tensorflow.contrib.autograph.operators.slices import get_item
+from tensorflow.contrib.autograph.operators.slices import GetItemOpts
+from tensorflow.contrib.autograph.operators.slices import set_item
diff --git a/tensorflow/contrib/autograph/operators/data_structures.py b/tensorflow/contrib/autograph/operators/data_structures.py
index c862306baa9e8114a71a26323ddcbd35c8592c55..06d8727b0fcc30b532b3f11281cd1a83c51ac8bc 100644
--- a/tensorflow/contrib/autograph/operators/data_structures.py
+++ b/tensorflow/contrib/autograph/operators/data_structures.py
@@ -18,39 +18,250 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import collections
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_util
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import list_ops
 from tensorflow.python.ops import tensor_array_ops
+from tensorflow.python.ops import variables
+
+
+# TODO(mdan): Once control flow supports objects, repackage as a class.
+
+
+def new_list(iterable=None):
+  """The list constructor.
+
+  Args:
+    iterable: Optional elements to fill the list with.
+
+  Returns:
+    A list-like object. The exact return value depends on the initial elements.
+  """
+  if iterable:
+    elements = tuple(iterable)
+  else:
+    elements = ()
+
+  # TODO(mdan): Extend these criteria.
+  if any(isinstance(el, variables.Variable) for el in elements):
+    return _py_list_new(elements)
+  return _tf_tensor_list_new(elements)
 
-# TODO(mdan): Add support for TensorList once functional.
-# TODO(mdan): Add primitives for empty list, list with elements.
 
+def _tf_tensor_list_new(elements):
+  """Overload of new_list that stages a Tensor list creation."""
+  elements = tuple(ops.convert_to_tensor(el) for el in elements)
+  all_dtypes = set(el.dtype for el in elements)
+  if len(all_dtypes) == 1:
+    element_dtype = tuple(all_dtypes)[0]
+  else:
+    # Heterogeneous lists are ok.
+    element_dtype = dtypes.variant
+
+  # TODO(mdan): This may fail for elements of variable shapes.
+  all_shapes = set(tuple(el.shape.as_list()) for el in elements)
+  if len(all_shapes) == 1:
+    element_shape = array_ops.shape(elements[0])
+  else:
+    # Heterogeneous lists are ok.
+    element_shape = constant_op.constant(-1)  # unknown shape, by convention
+
+  l = list_ops.empty_tensor_list(
+      element_shape=element_shape, element_dtype=element_dtype)
+  for el in elements:
+    l = list_ops.tensor_list_push_back(l, el)
+  return l
 
-def append(target, element):
+
+def _py_list_new(elements):
+  """Overload of new_list that creates a Python list."""
+  return list(elements)
+
+
+def list_append(list_, x):
   """The list append function.
 
-  Note: it is unspecified where target will be mutated or not. If target is
-  a TensorFlow entity, it will not be typically mutated. If target is a plain
-  list, it will be. In general, if the target is mutated then the return value
+  Note: it is unspecified where list_ will be mutated or not. If list_ is
+  a TensorFlow entity, it will not be typically mutated. If list_ is a plain
+  list, it will be. In general, if the list is mutated then the return value
   should point to the original entity.
 
   Args:
-    target: An entity that supports append semantics.
-    element: The element to append.
+    list_: An entity that supports append semantics.
+    x: The element to append.
 
   Returns:
-    Same as target, after the append was performed.
+    Same as list_, after the append was performed.
+
+  Raises:
+    ValueError: if list_ is not of a known list-like type.
   """
-  if isinstance(target, tensor_array_ops.TensorArray):
-    return _tf_tensorarray_append(target, element)
+  if isinstance(list_, tensor_array_ops.TensorArray):
+    return _tf_tensorarray_append(list_, x)
+  elif tensor_util.is_tensor(list_):
+    if list_.dtype == dtypes.variant:
+      return _tf_tensor_list_append(list_, x)
+    else:
+      raise ValueError(
+          'tensor lists are expected to be Tensors with dtype=tf.variant,'
+          ' instead found %s' % list_)
   else:
-    return _py_append(target, element)
+    return _py_list_append(list_, x)
+
+
+def _tf_tensor_list_append(list_, x):
+  """Overload of list_append that stages a Tensor list write."""
+  def empty_list_of_elements_like_x():
+    tensor_x = ops.convert_to_tensor(x)
+    return list_ops.empty_tensor_list(
+        element_shape=array_ops.shape(tensor_x),
+        element_dtype=tensor_x.dtype)
+
+  list_ = control_flow_ops.cond(
+      list_ops.tensor_list_length(list_) > 0,
+      lambda: list_,
+      empty_list_of_elements_like_x,
+  )
+  return list_ops.tensor_list_push_back(list_, x)
+
+
+def _tf_tensorarray_append(list_, x):
+  """Overload of list_append that stages a TensorArray write."""
+  return list_.write(list_.size(), x)
+
+
+def _py_list_append(list_, x):
+  """Overload of list_append that executes a Python list append."""
+  # Revert to the original call.
+  list_.append(x)
+  return list_
+
+
+class ListPopOpts(
+    collections.namedtuple('ListPopOpts', ('element_dtype', 'element_shape'))):
+  pass
+
+
+def list_pop(list_, i, opts):
+  """The list pop function.
+
+  Note: it is unspecified where list_ will be mutated or not. If list_ is
+  a TensorFlow entity, it will not be typically mutated. If list_ is a plain
+  list, it will be. In general, if the list is mutated then the return value
+  should point to the original entity.
+
+  Args:
+    list_: An entity that supports pop semantics.
+    i: Optional index to pop from. May be None.
+    opts: A ListPopOpts.
+
+  Returns:
+    Tuple (x, out_list_):
+      out_list_: same as list_, after the removal was performed.
+      x: the removed element value.
+
+  Raises:
+    ValueError: if list_ is not of a known list-like type or the operation is
+    not supported for that type.
+  """
+  assert isinstance(opts, ListPopOpts)
+
+  if isinstance(list_, tensor_array_ops.TensorArray):
+    raise ValueError('TensorArray does not support item removal')
+  elif tensor_util.is_tensor(list_):
+    if list_.dtype == dtypes.variant:
+      return _tf_tensor_list_pop(list_, i, opts)
+    else:
+      raise ValueError(
+          'tensor lists are expected to be Tensors with dtype=tf.variant,'
+          ' instead found %s' % list_)
+  else:
+    return _py_list_pop(list_, i)
+
+
+def _tf_tensor_list_pop(list_, i, opts):
+  """Overload of list_pop that stages a Tensor list pop."""
+  if i is not None:
+    raise NotImplementedError('tensor lists only support removing from the end')
+
+  if opts.element_dtype is None:
+    raise ValueError('cannot pop from a list without knowing its element '
+                     'type; use set_element_type to annotate it')
+  if opts.element_shape is None:
+    raise ValueError('cannot pop from a list without knowing its element '
+                     'shape; use set_element_type to annotate it')
+  list_out, x = list_ops.tensor_list_pop_back(
+      list_, element_dtype=opts.element_dtype)
+  x.set_shape(opts.element_shape)
+  return list_out, x
+
+
+def _py_list_pop(list_, i):
+  """Overload of list_pop that executes a Python list append."""
+  if i is None:
+    x = list_.pop()
+  else:
+    x = list_.pop(i)
+  return list_, x
+
+
+# TODO(mdan): Look into reducing duplication between all these containers.
+class ListStackOpts(
+    collections.namedtuple('ListStackOpts',
+                           ('element_dtype', 'original_call'))):
+  pass
+
+
+def list_stack(list_, opts):
+  """The list stack function.
+
+  This does not have a direct correspondent in Python. The closest idiom to
+  this is tf.append or np.stack. It's different from those in the sense that it
+  accepts a Tensor list, rather than a list of tensors. It can also accept
+  TensorArray. When the target is anything else, the dispatcher will rely on
+  ctx.original_call for fallback.
+
+  Args:
+    list_: An entity that supports append semantics.
+    opts: A ListStackOpts object.
+
+  Returns:
+    The output of the stack operation, typically a Tensor.
+  """
+  assert isinstance(opts, ListStackOpts)
+
+  if isinstance(list_, tensor_array_ops.TensorArray):
+    return _tf_tensorarray_stack(list_)
+  elif tensor_util.is_tensor(list_):
+    if list_.dtype == dtypes.variant:
+      return _tf_tensor_list_stack(list_, opts)
+    else:
+      # No-op for primitive Tensor arguments.
+      return list_
+  else:
+    return _py_list_stack(list_, opts)
+
+
+def _tf_tensorarray_stack(list_):
+  """Overload of list_stack that stages a TensorArray stack."""
+  return list_.stack()
 
 
-def _tf_tensorarray_append(target, element):
-  """Overload of append that stages a TensorArray write at the last position."""
-  return target.write(target.size(), element)
+def _tf_tensor_list_stack(list_, opts):
+  """Overload of list_stack that stages a Tensor list write."""
+  if opts.element_dtype is None:
+    raise ValueError('cannot stack a list without knowing its element type;'
+                     ' use set_element_type to annotate it')
+  return list_ops.tensor_list_stack(list_, element_dtype=opts.element_dtype)
 
 
-def _py_append(target, element):
-  """Overload of append that executes a Python list append."""
-  target.append(element)
-  return target
+def _py_list_stack(list_, opts):
+  """Overload of list_stack that executes a Python list append."""
+  # Revert to the original call.
+  return opts.original_call(list_)
diff --git a/tensorflow/contrib/autograph/operators/data_structures_test.py b/tensorflow/contrib/autograph/operators/data_structures_test.py
index 577d28c34da39f1216669513c29a00ac07bec126..8bbb52d6c10b241ec754c7dea599fa15a869595f 100644
--- a/tensorflow/contrib/autograph/operators/data_structures_test.py
+++ b/tensorflow/contrib/autograph/operators/data_structures_test.py
@@ -19,25 +19,98 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.contrib.autograph.operators import data_structures
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import list_ops
 from tensorflow.python.ops import tensor_array_ops
 from tensorflow.python.platform import test
 
 
-class AppendTest(test.TestCase):
+class ListTest(test.TestCase):
 
-  def test_tf_tensorarray(self):
+  def test_new_list_empty(self):
+    l = data_structures.new_list()
+    # Can't evaluate an empty list.
+    # TODO(mdan): sess.run should allow tf.variant maybe?
+    self.assertTrue(isinstance(l, ops.Tensor))
+
+  def test_new_list_tensor(self):
+    l = data_structures.new_list([3, 4, 5])
+    t = list_ops.tensor_list_stack(l, element_dtype=dtypes.int32)
+    with self.test_session() as sess:
+      self.assertAllEqual(sess.run(t), [3, 4, 5])
+
+  def test_append_tensor_list(self):
+    l = data_structures.new_list()
+    x = constant_op.constant([1, 2, 3])
+    l = data_structures.list_append(l, x)
+
+    t = list_ops.tensor_list_stack(l, element_dtype=x.dtype)
+    with self.test_session() as sess:
+      self.assertAllEqual(sess.run(t), [[1, 2, 3]])
+
+  def test_append_tensorarray(self):
     l = tensor_array_ops.TensorArray(dtypes.int32, size=0, dynamic_size=True)
-    l1 = data_structures.append(l, 1)
-    l2 = data_structures.append(l1, 2)
+    l1 = data_structures.list_append(l, 1)
+    l2 = data_structures.list_append(l1, 2)
     with self.test_session() as sess:
       self.assertAllEqual(sess.run(l1.stack()), [1])
       self.assertAllEqual(sess.run(l2.stack()), [1, 2])
 
-  def test_python(self):
+  def test_append_python(self):
     l = []
-    self.assertAllEqual(data_structures.append(l, 1), [1])
-    self.assertAllEqual(data_structures.append(l, 2), [1, 2])
+    self.assertAllEqual(data_structures.list_append(l, 1), [1])
+    self.assertAllEqual(data_structures.list_append(l, 2), [1, 2])
+
+  def test_pop_tensor_list(self):
+    initial_list = constant_op.constant([[1, 2], [3, 4]])
+    elem_shape = constant_op.constant([2])
+    l = list_ops.tensor_list_from_tensor(initial_list, element_shape=elem_shape)
+
+    opts = data_structures.ListPopOpts(
+        element_dtype=initial_list.dtype,
+        element_shape=(2,))
+
+    with self.assertRaises(NotImplementedError):
+      data_structures.list_pop(l, 0, opts)
+
+    with self.test_session() as sess:
+      l, x = data_structures.list_pop(l, None, opts)
+      self.assertAllEqual(sess.run(x), [3, 4])
+
+      t = list_ops.tensor_list_stack(l, element_dtype=initial_list.dtype)
+      self.assertAllEqual(sess.run(t), [[1, 2]])
+
+  def test_pop_python(self):
+    l = [1, 2, 3]
+    opts = data_structures.ListPopOpts(element_dtype=None, element_shape=())
+    self.assertAllEqual(data_structures.list_pop(l, None, opts), ([1, 2], 3))
+    self.assertAllEqual(data_structures.list_pop(l, None, opts), ([1], 2))
+
+  def test_stack_tensor_list(self):
+    initial_list = constant_op.constant([[1, 2], [3, 4]])
+    elem_shape = constant_op.constant([2])
+    l = list_ops.tensor_list_from_tensor(initial_list, element_shape=elem_shape)
+
+    opts = data_structures.ListStackOpts(
+        element_dtype=initial_list.dtype, original_call=None)
+
+    with self.test_session() as sess:
+      t = data_structures.list_stack(l, opts)
+      self.assertAllEqual(sess.run(t), sess.run(initial_list))
+
+  def test_stack_fallback(self):
+
+    def dummy_function(l):
+      # Lazy person's mock: just transform the argument in a way in which we
+      # can check that this function was indeed called.
+      return [x * 2 for x in l]
+
+    opts = data_structures.ListStackOpts(
+        element_dtype=None, original_call=dummy_function)
+
+    self.assertAllEqual(data_structures.list_stack([1, 2], opts), [2, 4])
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/autograph/operators/slices.py b/tensorflow/contrib/autograph/operators/slices.py
new file mode 100644
index 0000000000000000000000000000000000000000..04fbeb2f6e39234cad139442704fd7a8d0f56172
--- /dev/null
+++ b/tensorflow/contrib/autograph/operators/slices.py
@@ -0,0 +1,133 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Operators specific to slicing operations."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import tensor_util
+from tensorflow.python.ops import list_ops
+from tensorflow.python.ops import tensor_array_ops
+
+
+# TODO(mdan): Support extended slices.
+
+
+class GetItemOpts(collections.namedtuple('GetItemOpts', ('element_dtype',))):
+  pass
+
+
+def get_item(target, i, opts):
+  """The slice read operator (i.e. __getitem__).
+
+  Note: it is unspecified whether target will be mutated or not. In general,
+  if target is mutable (like Python lists), it will be mutated.
+
+  Args:
+    target: An entity that supports getitem semantics.
+    i: Index to read from.
+    opts: A GetItemOpts object.
+
+  Returns:
+    The read element.
+
+  Raises:
+    ValueError: if target is not of a supported type.
+  """
+  assert isinstance(opts, GetItemOpts)
+
+  if isinstance(target, tensor_array_ops.TensorArray):
+    return _tf_tensorarray_get_item(target, i)
+  elif tensor_util.is_tensor(target):
+    if target.dtype == dtypes.variant:
+      return _tf_tensor_list_get_item(target, i, opts)
+    else:
+      return _tf_tensor_get_item(target, i)
+  else:
+    return _py_get_item(target, i)
+
+
+def _tf_tensorarray_get_item(target, i):
+  """Overload of get_item that stages a TensorArray read."""
+  return target.read(i)
+
+
+def _tf_tensor_list_get_item(target, i, opts):
+  """Overload of get_item that stages a Tensor list read."""
+  if opts.element_dtype is None:
+    raise ValueError('cannot retrieve from a list without knowing its '
+                     'element type; use set_element_type to annotate it')
+  x = list_ops.tensor_list_get_item(target, i, element_dtype=opts.element_dtype)
+  return x
+
+
+def _tf_tensor_get_item(target, i):
+  """Overload of get_item that stages a Tensor (not Tensor list) read."""
+  return target[i]
+
+
+def _py_get_item(target, i):
+  """Overload of get_item that executes a Python list modification."""
+  return target[i]
+
+
+def set_item(target, i, x):
+  """The slice write operator (i.e. __setitem__).
+
+  Note: it is unspecified whether target will be mutated or not. In general,
+  if target is mutable (like Python lists), it will be mutated.
+
+  Args:
+    target: An entity that supports setitem semantics.
+    i: Index to modify.
+    x: The new element value.
+
+  Returns:
+    Same as target, after the update was performed.
+
+  Raises:
+    ValueError: if target is not of a supported type.
+  """
+  if isinstance(target, tensor_array_ops.TensorArray):
+    return _tf_tensorarray_set_item(target, i, x)
+  elif tensor_util.is_tensor(target):
+    if target.dtype == dtypes.variant:
+      return _tf_tensor_list_set_item(target, i, x)
+    else:
+      raise ValueError(
+          'tensor lists are expected to be Tensors with dtype=tf.variant,'
+          ' instead found %s' % target)
+  else:
+    return _py_set_item(target, i, x)
+
+
+def _tf_tensorarray_set_item(target, i, x):
+  """Overload of set_item that stages a TensorArray write."""
+  return target.write(i, x)
+
+
+def _tf_tensor_list_set_item(target, i, x):
+  """Overload of set_item that stages a Tensor list update."""
+  return list_ops.tensor_list_set_item(target, i, x)
+
+
+def _py_set_item(target, i, x):
+  """Overload of set_item that executes a Python list modification."""
+  target[i] = x
+  return target
diff --git a/tensorflow/contrib/autograph/operators/slices_test.py b/tensorflow/contrib/autograph/operators/slices_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..d4aacb9d2015fec56a8df5ad85a20b733765ba26
--- /dev/null
+++ b/tensorflow/contrib/autograph/operators/slices_test.py
@@ -0,0 +1,51 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for slices module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.autograph.operators import slices
+from tensorflow.python.framework import constant_op
+from tensorflow.python.ops import list_ops
+from tensorflow.python.platform import test
+
+
+class SlicesTest(test.TestCase):
+
+  def test_set_item_tensor_list(self):
+    initial_list = constant_op.constant([[1, 2], [3, 4]])
+    elem_shape = constant_op.constant([2])
+    l = list_ops.tensor_list_from_tensor(initial_list, element_shape=elem_shape)
+    l = slices.set_item(l, 0, [5, 6])
+
+    with self.test_session() as sess:
+      t = list_ops.tensor_list_stack(l, element_dtype=initial_list.dtype)
+      self.assertAllEqual(sess.run(t), [[5, 6], [3, 4]])
+
+  def test_get_item_tensor_list(self):
+    initial_list = constant_op.constant([[1, 2], [3, 4]])
+    elem_shape = constant_op.constant([2])
+    l = list_ops.tensor_list_from_tensor(initial_list, element_shape=elem_shape)
+    t = slices.get_item(
+        l, 1, slices.GetItemOpts(element_dtype=initial_list.dtype))
+
+    with self.test_session() as sess:
+      self.assertAllEqual(sess.run(t), [3, 4])
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/autograph/pyct/BUILD b/tensorflow/contrib/autograph/pyct/BUILD
index 796ab445c74128e1123e24b67c288e0e3c5ca24c..989b821e53a5cefbe39095e669f9a9e0bec65b8a 100644
--- a/tensorflow/contrib/autograph/pyct/BUILD
+++ b/tensorflow/contrib/autograph/pyct/BUILD
@@ -130,6 +130,7 @@ py_test(
     name = "transformer_test",
     srcs = ["transformer_test.py"],
     srcs_version = "PY2AND3",
+    tags = ["no_windows"],
     deps = [
         ":pyct",
         "//tensorflow/python:client_testlib",
diff --git a/tensorflow/contrib/autograph/pyct/anno.py b/tensorflow/contrib/autograph/pyct/anno.py
index cc4a7edf02ed7556c9a552d8730e4c7875038c83..ae861627fd65cca057e7bf1af41424e605d4b7a1 100644
--- a/tensorflow/contrib/autograph/pyct/anno.py
+++ b/tensorflow/contrib/autograph/pyct/anno.py
@@ -46,8 +46,15 @@ class Basic(NoValue):
       '`name_map` allows renaming symbols.')
 
 
-def getanno(node, key, field_name='___pyct_anno'):
-  return getattr(node, field_name)[key]
+FAIL = object()
+
+
+def getanno(node, key, default=FAIL, field_name='___pyct_anno'):
+  if (default is FAIL or
+      (hasattr(node, field_name) and (key in getattr(node, field_name)))):
+    return getattr(node, field_name)[key]
+  else:
+    return default
 
 
 def hasanno(node, key, field_name='___pyct_anno'):
@@ -73,5 +80,9 @@ def delanno(node, key, field_name='___pyct_anno'):
 
 
 def copyanno(from_node, to_node, key, field_name='___pyct_anno'):
-  if hasanno(from_node, key, field_name):
-    setanno(to_node, key, getanno(from_node, key, field_name), field_name)
+  if hasanno(from_node, key, field_name=field_name):
+    setanno(
+        to_node,
+        key,
+        getanno(from_node, key, field_name=field_name),
+        field_name=field_name)
diff --git a/tensorflow/contrib/autograph/pyct/anno_test.py b/tensorflow/contrib/autograph/pyct/anno_test.py
index 1d4d9d119e0c45c4bf9dd4e5b8156766489a2e4d..f2c0c8cf05ca4b3671eb653ce56f6da61de54aee 100644
--- a/tensorflow/contrib/autograph/pyct/anno_test.py
+++ b/tensorflow/contrib/autograph/pyct/anno_test.py
@@ -38,12 +38,14 @@ class AnnoTest(test.TestCase):
 
     anno.setanno(node, 'foo', 3)
     self.assertTrue(anno.hasanno(node, 'foo'))
-    self.assertEqual(3, anno.getanno(node, 'foo'))
+    self.assertEqual(anno.getanno(node, 'foo'), 3)
+    self.assertEqual(anno.getanno(node, 'bar', default=7), 7)
 
     anno.delanno(node, 'foo')
     self.assertFalse(anno.hasanno(node, 'foo'))
     with self.assertRaises(AttributeError):
       anno.getanno(node, 'foo')
+    self.assertIsNone(anno.getanno(node, 'foo', default=None))
 
   def test_copyanno(self):
     node_1 = ast.Name()
diff --git a/tensorflow/contrib/autograph/pyct/qual_names.py b/tensorflow/contrib/autograph/pyct/qual_names.py
index 583cf7ecd7bce31c55de58361ab5295abb5d6707..da07013cf4f4309c0e24adda3017575d942861b7 100644
--- a/tensorflow/contrib/autograph/pyct/qual_names.py
+++ b/tensorflow/contrib/autograph/pyct/qual_names.py
@@ -205,6 +205,7 @@ class QnResolver(gast.NodeTransformer):
     return node
 
   def visit_Subscript(self, node):
+    # TODO(mdan): This may no longer apply if we overload getitem.
     node = self.generic_visit(node)
     s = node.slice
     if not isinstance(s, gast.Index):
@@ -216,7 +217,11 @@ class QnResolver(gast.NodeTransformer):
     elif isinstance(s.value, gast.Str):
       subscript = QN(StringLiteral(s.value.s))
     else:
-      subscript = anno.getanno(node.slice.value, anno.Basic.QN)
+      # The index may be an expression, case in which a name doesn't make sense.
+      if anno.hasanno(node.slice.value, anno.Basic.QN):
+        subscript = anno.getanno(node.slice.value, anno.Basic.QN)
+      else:
+        return node
     if anno.hasanno(node.value, anno.Basic.QN):
       anno.setanno(node, anno.Basic.QN,
                    QN(anno.getanno(node.value, anno.Basic.QN),
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/type_info.py b/tensorflow/contrib/autograph/pyct/static_analysis/type_info.py
index c00946f9c41bc68d5c638d71f356b484db1286d1..7d1e65c958d7787ef5ed707d4822d14a83092975 100644
--- a/tensorflow/contrib/autograph/pyct/static_analysis/type_info.py
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/type_info.py
@@ -17,8 +17,8 @@
 This analyzer uses known live values to further infer object types. This
 may include for instance constructed objects and object member functions.
 
-In addition, the analyzer will also process annotations for TF (staged) type
-annotations.
+In addition, the analyzer also handles user annotations made in the code (for
+example, the autograph.set_element_type function).
 
 Requires annotations generated by LiveValuesResolver.
 """
@@ -44,6 +44,7 @@ from __future__ import print_function
 import gast
 
 from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import parser
 from tensorflow.contrib.autograph.pyct import transformer
 from tensorflow.python.util import tf_inspect
 
@@ -136,14 +137,14 @@ class TypeInfoResolver(transformer.Base):
 
   def _process_function_arg(self, arg_name):
     str_name = str(arg_name)
+    type_holder = arg_name.ast()
+    self.scope.setval(arg_name, type_holder)
     if len(self.enclosing_entities) == 1 and str_name in self.context.arg_types:
       # Forge a node to hold the type information, so that method calls on
       # it can resolve the type.
-      type_holder = arg_name.ast()
       type_string, type_obj = self.context.arg_types[str_name]
       anno.setanno(type_holder, 'type', type_obj)
       anno.setanno(type_holder, 'type_fqn', tuple(type_string.split('.')))
-      self.scope.setval(arg_name, type_holder)
 
   def visit_arg(self, node):
     self._process_function_arg(anno.getanno(node.arg, anno.Basic.QN))
@@ -159,58 +160,47 @@ class TypeInfoResolver(transformer.Base):
       # a = b
       # then for future references to `a` we should have definition = `b`
       definition = self.scope.getval(qn)
-      if anno.hasanno(definition, 'type'):
-        anno.setanno(node, 'type', anno.getanno(definition, 'type'))
-        anno.setanno(node, 'type_fqn', anno.getanno(definition, 'type_fqn'))
-      if anno.hasanno(definition, 'element_type'):
-        anno.setanno(node, 'element_type',
-                     anno.getanno(definition, 'element_type'))
+      anno.copyanno(definition, node, 'type')
+      anno.copyanno(definition, node, 'type_fqn')
+      anno.copyanno(definition, node, 'element_type')
+      anno.copyanno(definition, node, 'element_shape')
     return node
 
-  def _process_variable_assignment(self, source, targets):
-    # Special case: constructors.
-    if isinstance(source, gast.Call):
-      func = source.func
+  def _process_variable_assignment(self, target, value):
+    # Constructors
+    if isinstance(value, gast.Call):
+      func = value.func
       if anno.hasanno(func, 'live_val'):
         func_obj = anno.getanno(func, 'live_val')
         if tf_inspect.isclass(func_obj):
-          anno.setanno(source, 'is_constructor', True)
-          anno.setanno(source, 'type', func_obj)
-          anno.setanno(source, 'type_fqn', anno.getanno(func, 'fqn'))
+          anno.setanno(value, 'is_constructor', True)
+          anno.setanno(value, 'type', func_obj)
+          anno.setanno(value, 'type_fqn', anno.getanno(func, 'fqn'))
           # TODO(mdan): Raise an error if constructor has side effects.
           # We can have a whitelist of no-side-effects constructors.
           # We can also step inside the constructor and further analyze.
 
-    # Multiple targets mean multiple assignment.
-    for target in targets:
-      # Tuple target means unpacking.
-      if isinstance(target, (gast.Tuple, gast.List)):
-        for i, target_item in enumerate(target.elts):
-          # Two cases here:
-          #   1. Static unpacking, e.g. a, b = c, d
-          #   2. Dynamic unpacking, e.g. a, b = c
-          # The former case is optimized away.
-          if isinstance(source, (gast.Tuple, gast.List)):
-            source_item = source.elts[i]
-          else:
-            source_item = gast.Subscript(source, gast.Index(i), ctx=None)
-          self._process_variable_assignment(source_item, (target_item,))
-      elif isinstance(target, (gast.Name, gast.Attribute)):
-        target_symbol = anno.getanno(target, anno.Basic.QN)
-        self.scope.setval(target_symbol, source)
-      else:
-        raise ValueError('assignment target has unknown type: %s' % target)
+    if isinstance(target, (gast.Name, gast.Attribute)):
+      target_symbol = anno.getanno(target, anno.Basic.QN)
+      self.scope.setval(target_symbol, value)
+    elif isinstance(target, gast.Subscript):
+      pass
+    else:
+      raise ValueError('assignment target has unknown type: %s' % target)
 
   def visit_With(self, node):
-    for wi in node.items:
-      if wi.optional_vars is not None:
-        self._process_variable_assignment(wi.context_expr, (wi.optional_vars,))
+    for item in node.items:
+      if item.optional_vars is not None:
+        self.apply_to_single_assignments((item.optional_vars,),
+                                         item.context_expr,
+                                         self._process_variable_assignment)
     self.generic_visit(node)
     return node
 
   def visit_Assign(self, node):
     self.generic_visit(node)
-    self._process_variable_assignment(node.value, node.targets)
+    self.apply_to_single_assignments(
+        node.targets, node.value, self._process_variable_assignment)
     return node
 
   def visit_Call(self, node):
@@ -220,23 +210,20 @@ class TypeInfoResolver(transformer.Base):
       if (anno.getanno(node.func, 'live_val') is
           self.context.type_annotation_func):
 
-        if len(node.args) != 2:
-          raise ValueError('"%s" must have exactly two parameters'
+        if len(node.args) < 2 or len(node.args) > 3:
+          raise ValueError('"%s" must have either two or three parameters'
                            % self.context.type_annotation_func)
-        target_arg, type_arg = node.args
+        if len(node.args) == 2:
+          target_arg, type_arg = node.args
+          shape_arg = parser.parse_expression('None')
+        else:
+          target_arg, type_arg, shape_arg = node.args
         if not anno.hasanno(target_arg, anno.Basic.QN):
           raise ValueError('the first argument of "%s" must by a symbol'
                            % self.context.type_annotation_func)
-        if isinstance(type_arg, gast.Str):
-          element_type = type_arg.s
-        elif isinstance(type_arg, gast.Num):
-          element_type = type_arg.n
-        else:
-          if not anno.hasanno(type_arg, 'live_val'):
-            raise ValueError(
-                'the second argument of "%s" must be statically resolvable' %
-                self.context.type_annotation_func)
-          element_type = anno.getanno(type_arg, 'live_val')
+        # TODO(mdan): This is vulnerable to symbol renaming.
+        element_type = type_arg
+        element_shape = shape_arg
 
         target_symbol = anno.getanno(target_arg, anno.Basic.QN)
         # Find the definition of this symbol and annotate it with the given
@@ -244,7 +231,9 @@ class TypeInfoResolver(transformer.Base):
         # to receive the same type annotation.
         definition = self.scope.getval(target_symbol)
         anno.setanno(node, 'element_type', element_type)
+        anno.setanno(node, 'element_shape', element_shape)
         anno.setanno(definition, 'element_type', element_type)
+        anno.setanno(definition, 'element_shape', element_shape)
         # TODO(mdan): Should we update references between definition and here?
     return self.generic_visit(node)
 
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/type_info_test.py b/tensorflow/contrib/autograph/pyct/static_analysis/type_info_test.py
index 46b7701624a43073fb7cc612d2678ab851513d91..484562f294bb53a63feeca965b8f94c58aa2a685 100644
--- a/tensorflow/contrib/autograph/pyct/static_analysis/type_info_test.py
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/type_info_test.py
@@ -187,14 +187,27 @@ class TypeInfoResolverTest(test.TestCase):
 
     def test_fn():
       f = []
-      f = utils.set_element_type(f, Foo)
+      f = utils.set_element_type(f, Foo, (1, 2, 3))
       return f
 
     node = self._parse_and_analyze(test_fn, {'Foo': Foo, 'utils': utils})
     f_def = node.body[0].body[0].value
-    self.assertEqual(anno.getanno(f_def, 'element_type'), Foo)
+    self.assertEqual(anno.getanno(f_def, 'element_type').id, 'Foo')
     f_ref = node.body[0].body[1].value
-    self.assertEqual(anno.getanno(f_ref, 'element_type'), Foo)
+    self.assertEqual(anno.getanno(f_ref, 'element_type').id, 'Foo')
+
+  def test_type_annotation_args(self):
+
+    class Foo(object):
+      pass
+
+    def test_fn(f):
+      utils.set_element_type(f, Foo)
+      return f
+
+    node = self._parse_and_analyze(test_fn, {'Foo': Foo, 'utils': utils})
+    f_ref = node.body[0].body[1].value
+    self.assertEqual(anno.getanno(f_ref, 'element_type').id, 'Foo')
 
   def test_nested_unpacking(self):
 
@@ -210,9 +223,9 @@ class TypeInfoResolverTest(test.TestCase):
 
     node = self._parse_and_analyze(test_fn, {'Foo': Foo, 'Bar': Bar})
     a, b, c = node.body[0].body[1].value.elts
-    self.assertEquals(Foo, anno.getanno(a, 'type'))
-    self.assertEquals(Bar, anno.getanno(b, 'type'))
-    self.assertEquals(Foo, anno.getanno(c, 'type'))
+    self.assertEquals(anno.getanno(a, 'type'), Foo)
+    self.assertEquals(anno.getanno(b, 'type'), Bar)
+    self.assertEquals(anno.getanno(c, 'type'), Foo)
     self.assertFalse(anno.hasanno(a, 'live_val'))
     self.assertFalse(anno.hasanno(b, 'live_val'))
     self.assertFalse(anno.hasanno(c, 'live_val'))
@@ -229,8 +242,8 @@ class TypeInfoResolverTest(test.TestCase):
 
     node = self._parse_and_analyze(test_fn, {'utils': utils})
     a, b = node.body[0].body[2].body[2].value.elts
-    self.assertEquals(1, anno.getanno(a, 'element_type'))
-    self.assertEquals(2, anno.getanno(b, 'element_type'))
+    self.assertEquals(anno.getanno(a, 'element_type').n, 1)
+    self.assertEquals(anno.getanno(b, 'element_type').n, 2)
     self.assertFalse(anno.hasanno(a, 'type'))
     self.assertFalse(anno.hasanno(b, 'type'))
     self.assertFalse(anno.hasanno(a, 'live_val'))
diff --git a/tensorflow/contrib/autograph/pyct/templates.py b/tensorflow/contrib/autograph/pyct/templates.py
index baf7923fff7c786c1abd05e11fa6ffdb8c8f0912..9c479ebc2fa83d27dc363ae306daedb556734a1f 100644
--- a/tensorflow/contrib/autograph/pyct/templates.py
+++ b/tensorflow/contrib/autograph/pyct/templates.py
@@ -239,8 +239,13 @@ def replace_as_expression(template, **replacements):
     raise ValueError(
         'single expression expected; for more general templates use replace')
   node = replacement[0]
-  if not isinstance(node, gast.Expr):
-    raise ValueError(
-        'the template is expected to generate an expression node; instead '
-        'found %s' % node)
-  return node.value
+  node = qual_names.resolve(node)
+
+  if isinstance(node, gast.Expr):
+    return node.value
+  elif isinstance(node, gast.Name):
+    return node
+
+  raise ValueError(
+      'the template is expected to generate an expression or a name node;'
+      ' instead found %s' % node)
diff --git a/tensorflow/contrib/autograph/pyct/transformer.py b/tensorflow/contrib/autograph/pyct/transformer.py
index 4db6cc0adfad90ffc1a6bbcadfc80215688d271e..60bca8b38dcf62b4e997379d075cfc45511a894f 100644
--- a/tensorflow/contrib/autograph/pyct/transformer.py
+++ b/tensorflow/contrib/autograph/pyct/transformer.py
@@ -70,14 +70,40 @@ class Base(gast.NodeTransformer):
     return tuple(self._enclosing_entities)
 
   @property
-  def locel_scope_level(self):
+  def local_scope_level(self):
     return len(self._local_scope_state)
 
-  def enter_local_scope(self):
-    self._local_scope_state.append({})
+  def enter_local_scope(self, inherit=None):
+    """Marks entry into a new local scope.
 
-  def exit_local_scope(self):
-    return self._local_scope_state.pop()
+    Args:
+      inherit: Optional enumerable of variable names to copy from the
+          parent scope.
+    """
+    scope_entered = {}
+    if inherit:
+      this_scope = self._local_scope_state[-1]
+      for name in inherit:
+        if name in this_scope:
+          scope_entered[name] = this_scope[name]
+    self._local_scope_state.append(scope_entered)
+
+  def exit_local_scope(self, keep=None):
+    """Marks exit from the current local scope.
+
+    Args:
+      keep: Optional enumerable of variable names to copy into the
+          parent scope.
+    Returns:
+      A dict containing the scope that has just been exited.
+    """
+    scope_left = self._local_scope_state.pop()
+    if keep:
+      this_scope = self._local_scope_state[-1]
+      for name in keep:
+        if name in scope_left:
+          this_scope[name] = scope_left[name]
+    return scope_left
 
   def set_local(self, name, value):
     self._local_scope_state[-1][name] = value
@@ -91,38 +117,163 @@ class Base(gast.NodeTransformer):
       print(pretty_printer.fmt(node))
     return node
 
-  def visit_block(self, nodes):
-    """Helper equivalent to generic_visit, but for node lists."""
+  def visit_block(self, nodes, before_visit=None, after_visit=None):
+    """A more powerful version of generic_visit for statement blocks.
+
+    An example of a block is the body of an if statement.
+
+    This function allows specifying a postprocessing callback (the
+    after_visit argument) argument which can be used to move nodes to a new
+    destination. This is done by after_visit by returning a non-null
+    second return value, e.g. return new_node, new_destination.
+
+    For example, a transformer could perform the following move:
+
+        foo()
+        bar()
+        baz()
+
+        foo()
+        if cond:
+          bar()
+          baz()
+
+    The above could be done with a postprocessor of this kind:
+
+        def after_visit(node):
+          if node_is_function_call(bar):
+            new_container_node = build_cond()
+            new_container_node.body.append(node)
+            return new_container_node, new_container_node.body
+          else:
+            # Once we set a new destination, all subsequent items will be
+            # moved to it, so we don't need to explicitly handle baz.
+            return node, None
+
+    Args:
+      nodes: enumerable of AST node objects
+      before_visit: optional callable that is called before visiting each item
+          in nodes
+      after_visit: optional callable that takes in an AST node and
+          returns a tuple (new_node, new_destination). It is called after
+          visiting each item in nodes. Is used in the same was as the
+          visit_* methods: new_node will replace the node; if not None,
+          new_destination must be a list, and subsequent nodes will be placed
+          in this list instead of the list returned by visit_block.
+    Returns:
+      A list of AST node objects containing the transformed items fron nodes,
+      except those nodes that have been relocated using after_visit.
+    """
     results = []
+    node_destination = results
     for node in nodes:
+      if before_visit:
+        # TODO(mdan): We can modify node here too, if ever needed.
+        before_visit()
+
       replacement = self.visit(node)
+
+      if after_visit and replacement:
+        replacement, new_destination = after_visit(replacement)
+      else:
+        new_destination = None
+
       if replacement:
         if isinstance(replacement, (list, tuple)):
-          results.extend(replacement)
+          node_destination.extend(replacement)
         else:
-          results.append(replacement)
+          node_destination.append(replacement)
+
+      # Allow the postprocessor to reroute the remaining nodes to a new list.
+      if new_destination is not None:
+        node_destination = new_destination
     return results
 
+  # TODO(mdan): Once we have error tracing, we may be able to just go to SSA.
+  def apply_to_single_assignments(self, targets, values, apply_fn):
+    """Applies a fuction to each individual assignment.
+
+    This function can process a possibly-unpacked (e.g. a, b = c, d) assignment.
+    It tries to break down the unpacking if possible. In effect, it has the same
+    effect as passing the assigned values in SSA form to apply_fn.
+
+    Examples:
+
+    The following will result in apply_fn(a, c), apply_fn(b, d):
+
+        a, b = c, d
+
+    The following will result in apply_fn(a, c[0]), apply_fn(b, c[1]):
+
+        a, b = c
+
+    The following will result in apply_fn(a, (b, c)):
+
+        a = b, c
+
+    It uses the visitor pattern to allow subclasses to process single
+    assignments individually.
+
+    Args:
+      targets: list, tuple of or individual AST node. Should be used with the
+          targets field of an ast.Assign node.
+      values: an AST node.
+      apply_fn: a function of a single argument, which will be called with the
+          respective nodes of each single assignment. The signaure is
+          apply_fn(target, value), no return value.
+    """
+    if not isinstance(targets, (list, tuple)):
+      targets = (targets,)
+    for target in targets:
+      if isinstance(target, (gast.Tuple, gast.List)):
+        for i in range(len(target.elts)):
+          target_el = target.elts[i]
+          if isinstance(values, (gast.Tuple, gast.List)):
+            value_el = values.elts[i]
+          else:
+            value_el = gast.Subscript(values, gast.Index(i), ctx=gast.Store())
+          self.apply_to_single_assignments(target_el, value_el, apply_fn)
+      else:
+        # TODO(mdan): Look into allowing to rewrite the AST here.
+        apply_fn(target, values)
+
   def visit(self, node):
     source_code = self.context.source_code
     source_file = self.context.source_file
     did_enter_function = False
-    local_scope_state_size = len(self._local_scope_state)
+    local_scope_size_at_entry = len(self._local_scope_state)
 
     try:
       if isinstance(node, (gast.FunctionDef, gast.ClassDef, gast.Lambda)):
-        self._enclosing_entities.append(node)
         did_enter_function = True
 
+      if did_enter_function:
+        self._enclosing_entities.append(node)
+
       if source_code and hasattr(node, 'lineno'):
         self._lineno = node.lineno
         self._col_offset = node.col_offset
-      if anno.hasanno(node, anno.Basic.SKIP_PROCESSING):
-        return node
-      return super(Base, self).visit(node)
 
-    except (ValueError, AttributeError, KeyError, NotImplementedError,
-            AssertionError) as e:
+      if not anno.hasanno(node, anno.Basic.SKIP_PROCESSING):
+        result = super(Base, self).visit(node)
+
+      # On exception, the local scope integrity is not guaranteed.
+      if did_enter_function:
+        self._enclosing_entities.pop()
+
+      if local_scope_size_at_entry != len(self._local_scope_state):
+        raise AssertionError(
+            'Inconsistent local scope stack. Before entering node %s, the'
+            ' stack had length %d, after exit it has length %d. This'
+            ' indicates enter_local_scope and exit_local_scope are not'
+            ' well paired.' % (
+                node,
+                local_scope_size_at_entry,
+                len(self._local_scope_state)
+            ))
+      return result
+
+    except (ValueError, AttributeError, KeyError, NotImplementedError) as e:
       msg = '%s: %s\nOffending source:\n%s\n\nOccurred at node:\n%s' % (
           e.__class__.__name__, str(e), try_ast_to_source(node),
           pretty_printer.fmt(node, color=False))
@@ -130,18 +281,11 @@ class Base(gast.NodeTransformer):
         line = source_code.splitlines()[self._lineno - 1]
       else:
         line = '<no source available>'
+      # TODO(mdan): Avoid the printing of the original exception.
+      # In other words, we need to find how to suppress the "During handling
+      # of the above exception, another exception occurred" message.
       six.reraise(AutographParseError,
                   AutographParseError(
                       msg,
                       (source_file, self._lineno, self._col_offset + 1, line)),
                   sys.exc_info()[2])
-    finally:
-      if did_enter_function:
-        self._enclosing_entities.pop()
-
-      if local_scope_state_size != len(self._local_scope_state):
-        raise AssertionError(
-            'Inconsistent local scope stack. Before entering node %s, the'
-            ' stack had length %d, after exit it has length %d. This'
-            ' indicates enter_local_scope and exit_local_scope are not'
-            ' well paired.')
diff --git a/tensorflow/contrib/autograph/pyct/transformer_test.py b/tensorflow/contrib/autograph/pyct/transformer_test.py
index f96b0dc377521a482d347436caa98633a0a32c8a..f110e79605945e908e8a49112cf758ec29fa1b11 100644
--- a/tensorflow/contrib/autograph/pyct/transformer_test.py
+++ b/tensorflow/contrib/autograph/pyct/transformer_test.py
@@ -18,6 +18,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import gast
+
 from tensorflow.contrib.autograph.pyct import anno
 from tensorflow.contrib.autograph.pyct import context
 from tensorflow.contrib.autograph.pyct import parser
@@ -27,7 +29,7 @@ from tensorflow.python.platform import test
 
 class TransformerTest(test.TestCase):
 
-  def _context_for_nodetesting(self):
+  def _context_for_testing(self):
     return context.EntityContext(
         namer=None,
         source_code=None,
@@ -53,7 +55,7 @@ class TransformerTest(test.TestCase):
         anno.setanno(node, 'enclosing_entities', self.enclosing_entities)
         return self.generic_visit(node)
 
-    tr = TestTransformer(self._context_for_nodetesting())
+    tr = TestTransformer(self._context_for_testing())
 
     def test_function():
       a = 0
@@ -94,7 +96,7 @@ class TransformerTest(test.TestCase):
                       inner_function, lambda_node),
                      anno.getanno(lambda_expr, 'enclosing_entities'))
 
-  def test_statement_info_stack(self):
+  def test_local_scope_info_stack(self):
 
     class TestTransformer(transformer.Base):
 
@@ -116,7 +118,7 @@ class TransformerTest(test.TestCase):
       def visit_For(self, node):
         return self._annotate_result(node)
 
-    tr = TestTransformer(self._context_for_nodetesting())
+    tr = TestTransformer(self._context_for_testing())
 
     def test_function(a):
       """Docstring."""
@@ -142,7 +144,7 @@ class TransformerTest(test.TestCase):
     self.assertFalse(anno.hasanno(while_node, 'string'))
     self.assertEqual('1', anno.getanno(while_node, 'test'))
 
-  def test_statement_info_stack_checks_integrity(self):
+  def test_local_scope_info_stack_checks_integrity(self):
 
     class TestTransformer(transformer.Base):
 
@@ -155,7 +157,7 @@ class TransformerTest(test.TestCase):
         self.exit_local_scope()
         return node
 
-    tr = TestTransformer(self._context_for_nodetesting())
+    tr = TestTransformer(self._context_for_testing())
 
     def no_exit(a):
       if a > 0:
@@ -174,6 +176,38 @@ class TransformerTest(test.TestCase):
     with self.assertRaises(AssertionError):
       tr.visit(node)
 
+  def test_visit_block_postprocessing(self):
+
+    class TestTransformer(transformer.Base):
+
+      def _process_body_item(self, node):
+        if isinstance(node, gast.Assign) and (node.value.id == 'y'):
+          if_node = gast.If(gast.Name('x', gast.Load(), None), [node], [])
+          return if_node, if_node.body
+        return node, None
+
+      def visit_FunctionDef(self, node):
+        node.body = self.visit_block(
+            node.body, after_visit=self._process_body_item)
+        return node
+
+    def test_function(x, y):
+      z = x
+      z = y
+      return z
+
+    tr = TestTransformer(self._context_for_testing())
+
+    node, _ = parser.parse_entity(test_function)
+    node = tr.visit(node)
+    node = node.body[0]
+
+    self.assertEqual(len(node.body), 2)
+    self.assertTrue(isinstance(node.body[0], gast.Assign))
+    self.assertTrue(isinstance(node.body[1], gast.If))
+    self.assertTrue(isinstance(node.body[1].body[0], gast.Assign))
+    self.assertTrue(isinstance(node.body[1].body[1], gast.Return))
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/autograph/utils/BUILD b/tensorflow/contrib/autograph/utils/BUILD
index d3a1b9468892531cbc51bc13de66ef595f1a95f8..d82c17bf2afd01aedf4344f983b02c09abcb9bad 100644
--- a/tensorflow/contrib/autograph/utils/BUILD
+++ b/tensorflow/contrib/autograph/utils/BUILD
@@ -33,6 +33,8 @@ py_library(
     srcs_version = "PY2AND3",
     visibility = ["//tensorflow:__subpackages__"],
     deps = [
+        "//tensorflow/contrib/autograph/pyct",
+        "//tensorflow/python:dtypes",
         "//tensorflow/python:list_ops",
         "//tensorflow/python:script_ops",
         "//tensorflow/python/data/ops:dataset_ops",
diff --git a/tensorflow/contrib/autograph/utils/builtins.py b/tensorflow/contrib/autograph/utils/builtins.py
index 211e8eaee9082dd3e4f035e4379871cd2e154a39..998087e056c2cd264399982220d6e0528aab9edb 100644
--- a/tensorflow/contrib/autograph/utils/builtins.py
+++ b/tensorflow/contrib/autograph/utils/builtins.py
@@ -24,6 +24,7 @@ import six
 
 from tensorflow.contrib.autograph.utils import py_func
 from tensorflow.contrib.autograph.utils import type_check
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import logging_ops
@@ -38,7 +39,13 @@ def dynamic_builtin(f, *args, **kwargs):
     return dynamic_range(*args, **kwargs)
   if f is range:
     return dynamic_range(*args, **kwargs)
-  raise ValueError('%s is not supported' % f)
+  if f is int:
+    return dynamic_int(*args, **kwargs)
+  if f is float:
+    return dynamic_float(*args, **kwargs)
+
+  raise NotImplementedError(
+      'The "%s" builtin is not yet supported.' % f.__name__)
 
 
 def dynamic_len(list_or_tensor):
@@ -52,6 +59,20 @@ def dynamic_len(list_or_tensor):
   return len(list_or_tensor)
 
 
+def dynamic_int(num_or_tensor, **kwargs):
+  """Implementation of int() using dynamic dispatch."""
+  if tensor_util.is_tensor(num_or_tensor):
+    return math_ops.cast(num_or_tensor, dtype=dtypes.int32, **kwargs)
+  return int(num_or_tensor)
+
+
+def dynamic_float(num_or_tensor, **kwargs):
+  """Implementation of float() using dynamic dispatch."""
+  if tensor_util.is_tensor(num_or_tensor):
+    return math_ops.cast(num_or_tensor, dtype=dtypes.float32, **kwargs)
+  return float(num_or_tensor)
+
+
 def dynamic_range(start_or_stop, stop=None, step=None):
   """Implementation of range using dynamic dispatch."""
   if type_check.is_tensor(start_or_stop, stop, step):
diff --git a/tensorflow/contrib/autograph/utils/builtins_test.py b/tensorflow/contrib/autograph/utils/builtins_test.py
index 163e6984079fea5c3b3d9aeda0ec8048d651686f..0c2312178a921037fa419818bf309d671c33914d 100644
--- a/tensorflow/contrib/autograph/utils/builtins_test.py
+++ b/tensorflow/contrib/autograph/utils/builtins_test.py
@@ -24,6 +24,7 @@ import six
 
 from tensorflow.contrib.autograph.utils import builtins
 from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
 from tensorflow.python.platform import test
 
 
@@ -77,7 +78,7 @@ class BuiltinsTest(test.TestCase):
       return x
 
     # Functions that just have the names of builtins are rejected.
-    with self.assertRaises(ValueError):
+    with self.assertRaises(NotImplementedError):
       self.assertEqual(builtins.dynamic_builtin(range, 1), 1)
     if six.PY2:
       self.assertListEqual(
@@ -87,6 +88,20 @@ class BuiltinsTest(test.TestCase):
     self.assertListEqual(
         list(builtins.dynamic_builtin(six.moves.xrange, 3)), [0, 1, 2])
 
+  def test_casts(self):
+    i = constant_op.constant(2, dtype=dtypes.int32)
+    f = constant_op.constant(1.0, dtype=dtypes.float32)
+
+    self.assertEqual(builtins.dynamic_builtin(int, i).dtype, dtypes.int32)
+    self.assertEqual(builtins.dynamic_builtin(int, f).dtype, dtypes.int32)
+    self.assertEqual(builtins.dynamic_builtin(float, i).dtype, dtypes.float32)
+    self.assertEqual(builtins.dynamic_builtin(float, f).dtype, dtypes.float32)
+
+    self.assertEqual(builtins.dynamic_builtin(int, True), 1)
+    self.assertEqual(builtins.dynamic_builtin(int, False), 0)
+    self.assertEqual(builtins.dynamic_builtin(float, True), 1.0)
+    self.assertEqual(builtins.dynamic_builtin(float, False), 0.0)
+
   def test_dynamic_print_tf(self):
     try:
       out_capturer = six.StringIO()
diff --git a/tensorflow/contrib/batching/BUILD b/tensorflow/contrib/batching/BUILD
index d65c990c87cbc316472237d183c03765416501e7..b27a19b16c08cb588b45949105a6399623e766e1 100644
--- a/tensorflow/contrib/batching/BUILD
+++ b/tensorflow/contrib/batching/BUILD
@@ -49,6 +49,14 @@ cc_library(
     ],
 )
 
+cc_library(
+    name = "serial_device_batch_scheduler",
+    hdrs = ["serial_device_batch_scheduler.h"],
+    deps = [
+        "//tensorflow/core/kernels/batching_util:serial_device_batch_scheduler",
+    ],
+)
+
 cc_library(
     name = "basic_batch_scheduler",
     hdrs = ["basic_batch_scheduler.h"],
@@ -96,6 +104,7 @@ py_test(
     name = "batch_ops_test",
     size = "small",
     srcs = ["python/ops/batch_ops_test.py"],
+    shard_count = 5,
     srcs_version = "PY2AND3",
     tags = [
         "manual",
diff --git a/tensorflow/contrib/batching/__init__.py b/tensorflow/contrib/batching/__init__.py
index 44fa5f42a73bfb1bf008f6f4eafd14913c88dcfa..1e503a097a7b72d9244b0a1cf57747c4b4122c81 100644
--- a/tensorflow/contrib/batching/__init__.py
+++ b/tensorflow/contrib/batching/__init__.py
@@ -14,6 +14,7 @@
 # ==============================================================================
 """Ops and modules related to batch.
 
+@@batch_function_v1
 @@batch_function
 """
 from __future__ import absolute_import
diff --git a/tensorflow/contrib/batching/python/ops/batch_ops.py b/tensorflow/contrib/batching/python/ops/batch_ops.py
index 921d6917a4e478c3e60771fdc3ae99febc33d2e3..012a51f71101471850d312033c41dcbc4805d44c 100644
--- a/tensorflow/contrib/batching/python/ops/batch_ops.py
+++ b/tensorflow/contrib/batching/python/ops/batch_ops.py
@@ -18,6 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import gen_batch_ops
 # go/tf-wildcard-import
@@ -83,6 +84,74 @@ def batch_function(num_batch_threads,
   SparseTensor is not supported. The return value of the decorated function
   must be a Tensor or a list/tuple of Tensors.
 
+  Args:
+    num_batch_threads: Number of scheduling threads for processing batches
+     of work. Determines the number of batches processed in parallel.
+    max_batch_size: Batch sizes will never be bigger than this.
+    batch_timeout_micros: Maximum number of microseconds to wait before
+     outputting an incomplete batch.
+    allowed_batch_sizes: Optional list of allowed batch sizes. If left empty,
+     does nothing. Otherwise, supplies a list of batch sizes, causing the op
+     to pad batches up to one of those sizes. The entries must increase
+     monotonically, and the final entry must equal max_batch_size.
+    grad_timeout_micros: The timeout to use for the gradient. See the
+     documentation of the unbatch op for more details. Defaults to 60s.
+    unbatch_timeout_micros: The timeout to use for unbatching. See the
+     documentation of the unbatch op for more details. Defaults to 60s.
+    max_enqueued_batches: The maximum depth of the batch queue. Defaults to 10.
+
+  Returns:
+    The decorated function will return the unbatched computation output Tensors.
+  """
+
+  def decorator(fn):  # pylint: disable=missing-docstring
+
+    def decorated(*args):  # pylint: disable=missing-docstring
+      types = [arg.dtype for arg in args]
+
+      @function.Defun(*types)
+      def computation(*computation_args):
+        return fn(*computation_args)
+
+      with ops.name_scope("batch") as name:
+        for a in args:
+          if not isinstance(a, ops.Tensor):
+            raise ValueError("All arguments to functions decorated with "
+                             "`batch_function`  are supposed to be Tensors; "
+                             "found %s" % repr(a))
+        for inp in computation.captured_inputs:
+          print("inp: %s" % inp)
+          for op in inp.consumers():
+            print("op: %s" % op)
+        return gen_batch_ops.batch_function(
+            num_batch_threads=num_batch_threads,
+            max_batch_size=max_batch_size,
+            batch_timeout_micros=batch_timeout_micros,
+            allowed_batch_sizes=allowed_batch_sizes,
+            max_enqueued_batches=max_enqueued_batches,
+            shared_name=name,
+            f=computation,
+            in_tensors=list(args),
+            captured_tensors=computation.captured_inputs,
+            Tout=[o.type for o in computation.definition.signature.output_arg])
+
+    return decorated
+
+  return decorator
+
+
+def batch_function_v1(num_batch_threads,
+                      max_batch_size,
+                      batch_timeout_micros,
+                      allowed_batch_sizes=None,
+                      grad_timeout_micros=60 * 1000 * 1000,
+                      unbatch_timeout_micros=60 * 1000 * 1000,
+                      max_enqueued_batches=10):
+  """Batches the computation done by the decorated function.
+
+  This is the older version of batch_function(). Please use the former instead
+  of this.
+
   Args:
     num_batch_threads: Number of scheduling threads for processing batches
      of work. Determines the number of batches processed in parallel.
diff --git a/tensorflow/contrib/batching/python/ops/batch_ops_test.py b/tensorflow/contrib/batching/python/ops/batch_ops_test.py
index e22f978dde6f1b7febc771d526201579c20292c7..78468145469df216344bc00f116add250dc51dd3 100644
--- a/tensorflow/contrib/batching/python/ops/batch_ops_test.py
+++ b/tensorflow/contrib/batching/python/ops/batch_ops_test.py
@@ -23,7 +23,10 @@ import time
 
 from tensorflow.contrib.batching.python.ops import batch_ops
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import function
+from tensorflow.python.framework.errors import InvalidArgumentError
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_batch_ops
 from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import script_ops
 from tensorflow.python.platform import test
@@ -185,12 +188,62 @@ class BatchOpsTest(test.TestCase):
       self.assertEqual(thread_results[0], [2])
       self.assertEqual(main_results[0], [3])
 
+  def testBasicUnbatchV1Decorated(self):
+    """Tests that the batch_function_v1 decorator works."""
+    with self.test_session() as sess:
+      @batch_ops.batch_function_v1(1, 10, 100000)
+      def computation(in_t):
+        return in_t + 1
+
+      inp = array_ops.placeholder(dtype=dtypes.int32, shape=[1])
+      result = computation(inp)
+      thread_results = []
+
+      def worker():
+        thread_results.extend(sess.run([result], feed_dict={inp: [1]}))
+
+      worker_thread = threading.Thread(target=worker)
+      worker_thread.start()
+      main_results = sess.run([result], feed_dict={inp: [2]})
+      worker_thread.join()
+      self.assertEqual(thread_results[0], [2])
+      self.assertEqual(main_results[0], [3])
+
   def testBasicUnbatchDecorated(self):
     """Tests that the batch_function decorator works."""
     with self.test_session() as sess:
+      # TODO(apassos): Removing this line causes test flakiness! Ideally should
+      # be investigated.
+      default_inp = array_ops.placeholder_with_default(2, shape=[])  # pylint: disable=unused-variable
+
       @batch_ops.batch_function(1, 10, 100000)
       def computation(in_t):
         return in_t + 1
+
+      inp = array_ops.placeholder(dtype=dtypes.int32, shape=[1])
+      result = computation(inp)
+      thread_results = []
+
+      def worker():
+        thread_results.extend(sess.run([result], feed_dict={inp: [1]}))
+
+      worker_thread = threading.Thread(target=worker)
+      worker_thread.start()
+      main_results = sess.run([result], feed_dict={inp: [2]})
+      worker_thread.join()
+      self.assertEqual(thread_results[0], [2])
+      self.assertEqual(main_results[0], [3])
+
+  def testBatchDecoratedWithCapturedInput(self):
+    """Tests that the batch_function decorator works."""
+    with self.test_session() as sess:
+      captured_inp0 = array_ops.placeholder_with_default(2, shape=[])
+      captured_inp1 = array_ops.placeholder_with_default(1, shape=[])
+
+      @batch_ops.batch_function(1, 10, 100000)
+      def computation(in_t):
+        return in_t + captured_inp0 - captured_inp1
+
       inp = array_ops.placeholder(dtype=dtypes.int32, shape=[1])
       result = computation(inp)
       thread_results = []
@@ -205,6 +258,114 @@ class BatchOpsTest(test.TestCase):
       self.assertEqual(thread_results[0], [2])
       self.assertEqual(main_results[0], [3])
 
+  def testBatchFunctionOp(self):
+    """Tests that the batch_function op works."""
+    with self.test_session() as sess:
+
+      @function.Defun(dtypes.int32)
+      def computation(in_t):
+        return in_t + 1
+
+      inp = array_ops.placeholder(dtype=dtypes.int32, shape=[1])
+      result = gen_batch_ops.batch_function(
+          [inp],
+          num_batch_threads=1,
+          max_batch_size=10,
+          batch_timeout_micros=100000,
+          Tout=[dtypes.int32],
+          f=computation,
+          captured_tensors=computation.captured_inputs)
+      thread_results = []
+
+      def worker():
+        thread_results.extend(sess.run([result], feed_dict={inp: [1]}))
+
+      worker_thread = threading.Thread(target=worker)
+      worker_thread.start()
+      main_results = sess.run([result], feed_dict={inp: [2]})
+      worker_thread.join()
+      self.assertEqual(thread_results[0], [2])
+      self.assertEqual(main_results[0], [3])
+
+  def testBatchFunctionOpWithCapturedInput(self):
+    """Tests that batch_function op works with captured input."""
+    with self.test_session() as sess:
+      captured_inp0 = array_ops.placeholder_with_default(2, shape=[])
+      captured_inp1 = array_ops.placeholder_with_default(1, shape=[])
+      inp = array_ops.placeholder(dtype=dtypes.int32, shape=[1])
+
+      @function.Defun(dtypes.int32)
+      def computation(inp):
+        return inp + captured_inp0 - captured_inp1
+
+      result = gen_batch_ops.batch_function(
+          num_batch_threads=1,
+          max_batch_size=10,
+          batch_timeout_micros=100000,  # 100ms
+          allowed_batch_sizes=[3, 10],
+          batching_queue="",
+          f=computation,
+          in_tensors=[inp],
+          captured_tensors=computation.captured_inputs,
+          Tout=[o.type for o in computation.definition.signature.output_arg])
+
+      thread_results = []
+
+      def worker():
+        thread_results.extend(sess.run([result], feed_dict={inp: [1]}))
+
+      worker_thread = threading.Thread(target=worker)
+      worker_thread.start()
+      main_results = sess.run([result], feed_dict={inp: [2]})
+      worker_thread.join()
+      self.assertEqual(thread_results[0], [2])
+      self.assertEqual(main_results[0], [3])
+
+  def testBatchFunctionOpWithInputError(self):
+    """Tests that batch_function op works with error in the inputs."""
+    with self.test_session() as sess:
+      inp = array_ops.placeholder(dtype=dtypes.int32, shape=[1])
+
+      @function.Defun(dtypes.int32, dtypes.int32)
+      def computation(in0, in1):
+        return in0 + in1
+
+      result = gen_batch_ops.batch_function(
+          [inp],  # computation actually expects 2 inputs.
+          num_batch_threads=1,
+          max_batch_size=10,
+          batch_timeout_micros=100000,  # 100ms
+          batching_queue="",
+          f=computation,
+          captured_tensors=computation.captured_inputs,
+          Tout=[o.type for o in computation.definition.signature.output_arg])
+
+      with self.assertRaisesRegexp(InvalidArgumentError,
+                                   ".*2 arguments.*but 1.*"):
+        sess.run([result], feed_dict={inp: [2]})
+
+  def testBasicUnbatchDecoratedWithReshape(self):
+    """Tests that the batch_function decorator works."""
+    with self.test_session() as sess:
+
+      @batch_ops.batch_function(1, 10, 100000)
+      def computation(in_t):
+        return array_ops.reshape(in_t, [-1]) + 1
+
+      inp = array_ops.placeholder(dtype=dtypes.int32, shape=[1, 1])
+      result = computation(inp)
+      thread_results = []
+
+      def worker():
+        thread_results.extend(sess.run([result], feed_dict={inp: [[1]]}))
+
+      worker_thread = threading.Thread(target=worker)
+      worker_thread.start()
+      main_results = sess.run([result], feed_dict={inp: [[2]]})
+      worker_thread.join()
+      self.assertEqual(thread_results[0], [2])
+      self.assertEqual(main_results[0], [3])
+
   def testUnbatchTimeout(self):
     """Tests that the unbatch timeout works."""
     with self.test_session() as sess:
diff --git a/tensorflow/compiler/xla/service/versioned_computation_handle.cc b/tensorflow/contrib/batching/serial_device_batch_scheduler.h
similarity index 59%
rename from tensorflow/compiler/xla/service/versioned_computation_handle.cc
rename to tensorflow/contrib/batching/serial_device_batch_scheduler.h
index a693c4695f0e776cf297d0ecd28d6de53bd5c0c6..bf6b7083612018eecf0d1784e60cbbf0c5796fef 100644
--- a/tensorflow/compiler/xla/service/versioned_computation_handle.cc
+++ b/tensorflow/contrib/batching/serial_device_batch_scheduler.h
@@ -13,20 +13,9 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/service/versioned_computation_handle.h"
+#ifndef TENSORFLOW_CONTRIB_BATCHING_SERIAL_DEVICE_BATCH_SCHEDULER_H_
+#define TENSORFLOW_CONTRIB_BATCHING_SERIAL_DEVICE_BATCH_SCHEDULER_H_
 
-#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/kernels/batching_util/serial_device_batch_scheduler.h"
 
-namespace xla {
-
-string VersionedComputationHandle::ToString() const {
-  return tensorflow::strings::StrCat(handle.handle(), ":v", version);
-}
-
-std::ostream& operator<<(std::ostream& out,
-                         const VersionedComputationHandle& versioned_handle) {
-  out << versioned_handle.ToString();
-  return out;
-}
-
-}  // namespace xla
+#endif  // TENSORFLOW_CONTRIB_BATCHING_SERIAL_DEVICE_BATCH_SCHEDULER_H_
diff --git a/tensorflow/contrib/bayesflow/python/ops/monte_carlo.py b/tensorflow/contrib/bayesflow/python/ops/monte_carlo.py
index 5770bcdd706723394bb06196d24aeb32b8b8491a..68fa415eeaf1d1ae7c2ecf1be1c300eddbfa4e69 100644
--- a/tensorflow/contrib/bayesflow/python/ops/monte_carlo.py
+++ b/tensorflow/contrib/bayesflow/python/ops/monte_carlo.py
@@ -12,10 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Monte Carlo integration and helpers.
-
-See the @{$python/contrib.bayesflow.monte_carlo} guide.
-"""
+"""Monte Carlo integration and helpers."""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator.py b/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator.py
index 758754feac31f1d2cf10e69d7a9a6d288931c900..911d87fa10570382ee5f03edfc1bfd1d116c8360 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator.py
@@ -232,7 +232,13 @@ def _dnn_tree_combined_model_fn(features,
         return update_op
 
   if predict_with_tree_only:
-    tree_train_logits = tree_logits
+    if mode == model_fn.ModeKeys.TRAIN or mode == model_fn.ModeKeys.PREDICT:
+      tree_train_logits = tree_logits
+    else:
+      tree_train_logits = control_flow_ops.cond(
+          global_step > dnn_steps_to_train,
+          lambda: tree_logits,
+          lambda: dnn_logits)
   else:
     tree_train_logits = dnn_logits + tree_logits
 
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/estimator.py b/tensorflow/contrib/boosted_trees/estimator_batch/estimator.py
index 89d0d611d2905492cec09e033b8cbc238ec7fac6..9c36c302210185bc390751a0229a61f2f8cd91b8 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/estimator.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/estimator.py
@@ -41,7 +41,8 @@ class GradientBoostedDecisionTreeClassifier(estimator.Estimator):
                feature_engineering_fn=None,
                logits_modifier_function=None,
                center_bias=True,
-               use_core_libs=False):
+               use_core_libs=False,
+               output_leaf_index=False):
     """Initializes a GradientBoostedDecisionTreeClassifier estimator instance.
 
     Args:
@@ -66,6 +67,16 @@ class GradientBoostedDecisionTreeClassifier(estimator.Estimator):
         the bias.
       use_core_libs: Whether feature columns and loss are from the core (as
         opposed to contrib) version of tensorflow.
+      output_leaf_index: whether to output leaf indices along with predictions
+        during inference. The leaf node indexes are available in predictions
+        dict by the key 'leaf_index'. It is a Tensor of rank 2 and its shape is
+        [batch_size, num_trees].
+        For example,
+        result_iter = classifier.predict(...)
+        for result_dict in result_iter:
+          # access leaf index list by result_dict["leaf_index"]
+          # which contains one leaf index per tree
+
     Raises:
       ValueError: If learner_config is not valid.
     """
@@ -74,7 +85,9 @@ class GradientBoostedDecisionTreeClassifier(estimator.Estimator):
       # supports second order derivative.
       def loss_fn(labels, logits, weights=None):
         result = losses.per_example_maxent_loss(
-            labels=labels, logits=logits, weights=weights,
+            labels=labels,
+            logits=logits,
+            weights=weights,
             num_classes=n_classes)
         return math_ops.reduce_mean(result[0])
     else:
@@ -102,6 +115,7 @@ class GradientBoostedDecisionTreeClassifier(estimator.Estimator):
             'center_bias': center_bias,
             'logits_modifier_function': logits_modifier_function,
             'use_core_libs': use_core_libs,
+            'output_leaf_index': output_leaf_index,
         },
         model_dir=model_dir,
         config=config,
@@ -124,7 +138,8 @@ class GradientBoostedDecisionTreeRegressor(estimator.Estimator):
                feature_engineering_fn=None,
                logits_modifier_function=None,
                center_bias=True,
-               use_core_libs=False):
+               use_core_libs=False,
+               output_leaf_index=False):
     """Initializes a GradientBoostedDecisionTreeRegressor estimator instance.
 
     Args:
@@ -151,6 +166,13 @@ class GradientBoostedDecisionTreeRegressor(estimator.Estimator):
         the bias.
       use_core_libs: Whether feature columns and loss are from the core (as
         opposed to contrib) version of tensorflow.
+      output_leaf_index: whether to output leaf indices along with predictions
+        during inference. The leaf node indexes are available in predictions
+        dict by the key 'leaf_index'. For example,
+        result_dict = classifier.predict(...)
+        for example_prediction_result in result_dict:
+          # access leaf index list by example_prediction_result["leaf_index"]
+          # which contains one leaf index per tree
     """
     head = head_lib.regression_head(
         label_name=label_name,
@@ -173,6 +195,7 @@ class GradientBoostedDecisionTreeRegressor(estimator.Estimator):
             'logits_modifier_function': logits_modifier_function,
             'center_bias': center_bias,
             'use_core_libs': use_core_libs,
+            'output_leaf_index': False,
         },
         model_dir=model_dir,
         config=config,
@@ -197,7 +220,8 @@ class GradientBoostedDecisionTreeEstimator(estimator.Estimator):
                feature_engineering_fn=None,
                logits_modifier_function=None,
                center_bias=True,
-               use_core_libs=False):
+               use_core_libs=False,
+               output_leaf_index=False):
     """Initializes a GradientBoostedDecisionTreeEstimator estimator instance.
 
     Args:
@@ -220,6 +244,13 @@ class GradientBoostedDecisionTreeEstimator(estimator.Estimator):
         the bias.
       use_core_libs: Whether feature columns and loss are from the core (as
         opposed to contrib) version of tensorflow.
+      output_leaf_index: whether to output leaf indices along with predictions
+        during inference. The leaf node indexes are available in predictions
+        dict by the key 'leaf_index'. For example,
+        result_dict = classifier.predict(...)
+        for example_prediction_result in result_dict:
+          # access leaf index list by example_prediction_result["leaf_index"]
+          # which contains one leaf index per tree
     """
     super(GradientBoostedDecisionTreeEstimator, self).__init__(
         model_fn=model.model_builder,
@@ -233,6 +264,7 @@ class GradientBoostedDecisionTreeEstimator(estimator.Estimator):
             'logits_modifier_function': logits_modifier_function,
             'center_bias': center_bias,
             'use_core_libs': use_core_libs,
+            'output_leaf_index': False,
         },
         model_dir=model_dir,
         config=config,
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/estimator_test.py b/tensorflow/contrib/boosted_trees/estimator_batch/estimator_test.py
index 0d58317bd59331cfcde0e12aeb3a3a03fc45d89b..75ef1b050028b6462b255827c06e836e5c481844 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/estimator_test.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/estimator_test.py
@@ -68,6 +68,28 @@ class BoostedTreeEstimatorTest(test_util.TensorFlowTestCase):
     classifier.evaluate(input_fn=_eval_input_fn, steps=1)
     classifier.export(self._export_dir_base)
 
+  def testThatLeafIndexIsInPredictions(self):
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = 2
+    learner_config.constraints.max_tree_depth = 1
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    classifier = estimator.GradientBoostedDecisionTreeClassifier(
+        learner_config=learner_config,
+        num_trees=1,
+        examples_per_layer=3,
+        model_dir=model_dir,
+        config=config,
+        feature_columns=[contrib_feature_column.real_valued_column("x")],
+        output_leaf_index=True)
+
+    classifier.fit(input_fn=_train_input_fn, steps=15)
+    result_iter = classifier.predict(input_fn=_eval_input_fn)
+    for prediction_dict in result_iter:
+      self.assertTrue("leaf_index" in prediction_dict)
+      self.assertTrue("logits" in prediction_dict)
+
   def testFitAndEvaluateDontThrowExceptionWithCoreForEstimator(self):
     learner_config = learner_pb2.LearnerConfig()
     learner_config.num_classes = 2
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/model.py b/tensorflow/contrib/boosted_trees/estimator_batch/model.py
index 15ab6d814522ab1dee58dcd71246354fc4d8a483..1ee891198939e53fc5913104b2c2e65dc977823f 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/model.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/model.py
@@ -63,6 +63,8 @@ def model_builder(features, labels, mode, params, config):
   num_trees = params["num_trees"]
   use_core_libs = params["use_core_libs"]
   logits_modifier_function = params["logits_modifier_function"]
+  output_leaf_index = params["output_leaf_index"]
+
   if features is None:
     raise ValueError("At least one feature must be specified.")
 
@@ -96,7 +98,8 @@ def model_builder(features, labels, mode, params, config):
       feature_columns=feature_columns,
       logits_dimension=head.logits_dimension,
       features=training_features,
-      use_core_columns=use_core_libs)
+      use_core_columns=use_core_libs,
+      output_leaf_index=output_leaf_index)
   with ops.name_scope("gbdt", "gbdt_optimizer"):
     predictions_dict = gbdt_model.predict(mode)
     logits = predictions_dict["predictions"]
@@ -127,6 +130,9 @@ def model_builder(features, labels, mode, params, config):
         labels=labels,
         train_op_fn=_train_op_fn,
         logits=logits)
+  if output_leaf_index and gbdt_batch.LEAF_INDEX in predictions_dict:
+    model_fn_ops.predictions[gbdt_batch.LEAF_INDEX] = predictions_dict[
+        gbdt_batch.LEAF_INDEX]
   if num_trees:
     if center_bias:
       num_trees += 1
diff --git a/tensorflow/contrib/boosted_trees/kernels/prediction_ops.cc b/tensorflow/contrib/boosted_trees/kernels/prediction_ops.cc
index b3fe38614e05801b223f0c96f7a70ce7e432a70b..9493c1a1394040db3b744f1b382b20bd5bd1988d 100644
--- a/tensorflow/contrib/boosted_trees/kernels/prediction_ops.cc
+++ b/tensorflow/contrib/boosted_trees/kernels/prediction_ops.cc
@@ -59,6 +59,7 @@ const char* kApplyDropoutAttributeName = "apply_dropout";
 const char* kApplyAveragingAttributeName = "apply_averaging";
 const char* kDropoutInfoOutputTensorName = "drop_out_tree_indices_weights";
 const char* kPredictionsTensorName = "predictions";
+const char* kLeafIndexTensorName = "leaf_index";
 
 void CalculateTreesToInclude(
     const boosted_trees::trees::DecisionTreeEnsembleConfig& config,
@@ -170,15 +171,22 @@ class GradientTreesPredictionOp : public OpKernel {
     core::ScopedUnref unref_me(ensemble_resource);
     if (use_locking_) {
       tf_shared_lock l(*ensemble_resource->get_mutex());
-      DoCompute(context, ensemble_resource);
+      DoCompute(context, ensemble_resource,
+                /*return_output_leaf_index=*/false);
     } else {
-      DoCompute(context, ensemble_resource);
+      DoCompute(context, ensemble_resource,
+                /*return_output_leaf_index=*/false);
     }
   }
 
- private:
-  void DoCompute(OpKernelContext* context,
-                 DecisionTreeEnsembleResource* ensemble_resource) {
+ protected:
+  // return_output_leaf_index is a boolean variable indicating whether to output
+  // leaf index in prediction. Though this class invokes only with this param
+  // value as false, the subclass GradientTreesPredictionVerboseOp will invoke
+  // with the true value.
+  virtual void DoCompute(OpKernelContext* context,
+                         DecisionTreeEnsembleResource* ensemble_resource,
+                         const bool return_output_leaf_index) {
     // Read dense float features list;
     OpInputList dense_float_features_list;
     OP_REQUIRES_OK(context, TensorUtils::ReadDenseFloatFeatures(
@@ -267,6 +275,14 @@ class GradientTreesPredictionOp : public OpKernel {
                                           &output_predictions_t));
     auto output_predictions = output_predictions_t->matrix<float>();
 
+    // Allocate output leaf index matrix.
+    Tensor* output_leaf_index_t = nullptr;
+    if (return_output_leaf_index) {
+      OP_REQUIRES_OK(context, context->allocate_output(
+                                  kLeafIndexTensorName,
+                                  {batch_size, ensemble_resource->num_trees()},
+                                  &output_leaf_index_t));
+    }
     // Run predictor.
     thread::ThreadPool* const worker_threads =
         context->device()->tensorflow_cpu_worker_threads()->workers;
@@ -288,11 +304,13 @@ class GradientTreesPredictionOp : public OpKernel {
             i, weight * (num_ensembles - i + start_averaging) / num_ensembles);
       }
       MultipleAdditiveTrees::Predict(adjusted, trees_to_include, batch_features,
-                                     worker_threads, output_predictions);
+                                     worker_threads, output_predictions,
+                                     output_leaf_index_t);
     } else {
       MultipleAdditiveTrees::Predict(
           ensemble_resource->decision_tree_ensemble(), trees_to_include,
-          batch_features, worker_threads, output_predictions);
+          batch_features, worker_threads, output_predictions,
+          output_leaf_index_t);
     }
 
     // Output dropped trees and original weights.
@@ -302,7 +320,6 @@ class GradientTreesPredictionOp : public OpKernel {
                                 {2, static_cast<int64>(dropped_trees.size())},
                                 &output_dropout_info_t));
     auto output_dropout_info = output_dropout_info_t->matrix<float>();
-
     for (int32 i = 0; i < dropped_trees.size(); ++i) {
       output_dropout_info(0, i) = dropped_trees[i];
       output_dropout_info(1, i) = original_weights[i];
@@ -326,6 +343,27 @@ class GradientTreesPredictionOp : public OpKernel {
 REGISTER_KERNEL_BUILDER(Name("GradientTreesPrediction").Device(DEVICE_CPU),
                         GradientTreesPredictionOp);
 
+// GradientTreesPredictionVerboseOp is derived from GradientTreesPredictionOp
+// and have an additional output of tensor of rank 2 containing leaf ids for
+// each tree where an instance ended up with.
+class GradientTreesPredictionVerboseOp : public GradientTreesPredictionOp {
+ public:
+  explicit GradientTreesPredictionVerboseOp(OpKernelConstruction* const context)
+      : GradientTreesPredictionOp(context) {}
+
+ protected:
+  void DoCompute(OpKernelContext* context,
+                 DecisionTreeEnsembleResource* ensemble_resource,
+                 bool return_output_leaf_index) override {
+    GradientTreesPredictionOp::DoCompute(context, ensemble_resource,
+                                         /*return_output_leaf_index=*/true);
+  }
+};
+
+REGISTER_KERNEL_BUILDER(
+    Name("GradientTreesPredictionVerbose").Device(DEVICE_CPU),
+    GradientTreesPredictionVerboseOp);
+
 class GradientTreesPartitionExamplesOp : public OpKernel {
  public:
   explicit GradientTreesPartitionExamplesOp(OpKernelConstruction* const context)
diff --git a/tensorflow/contrib/boosted_trees/kernels/split_handler_ops.cc b/tensorflow/contrib/boosted_trees/kernels/split_handler_ops.cc
index 04e32267cc4a00b3169c3abbcbf549805a0fb462..401bec84a20a0fefcddbfa1039a117e65f853633 100644
--- a/tensorflow/contrib/boosted_trees/kernels/split_handler_ops.cc
+++ b/tensorflow/contrib/boosted_trees/kernels/split_handler_ops.cc
@@ -43,47 +43,60 @@ namespace {
 const int32 DUMMY_FEATURE_DIMENSION = -1;
 }  // namespace
 
-class BaseBuildSplitOp : public OpKernel {
+class SplitBuilderState {
  public:
-  explicit BaseBuildSplitOp(OpKernelConstruction* const context)
-      : OpKernel(context) {
-    OP_REQUIRES_OK(context, context->GetAttr("feature_column_group_id",
-                                             &feature_column_group_id_));
+  explicit SplitBuilderState(OpKernelContext* const context) {
+    const Tensor* l1_regularization_t;
     OP_REQUIRES_OK(context,
-                   context->GetAttr("l1_regularization", &l1_regularization_));
+                   context->input("l1_regularization", &l1_regularization_t));
+    const Tensor* l2_regularization_t;
     OP_REQUIRES_OK(context,
-                   context->GetAttr("l2_regularization", &l2_regularization_));
-    OP_REQUIRES_OK(context, context->GetAttr("tree_complexity_regularization",
-                                             &tree_complexity_regularization_));
+                   context->input("l2_regularization", &l2_regularization_t));
+    const Tensor* tree_complexity_regularization_t;
+    OP_REQUIRES_OK(context, context->input("tree_complexity_regularization",
+                                           &tree_complexity_regularization_t));
+    const Tensor* min_node_weight_t;
     OP_REQUIRES_OK(context,
-                   context->GetAttr("min_node_weight", &min_node_weight_));
+                   context->input("min_node_weight", &min_node_weight_t));
 
-    int strategy;
-    OP_REQUIRES_OK(context, context->GetAttr("multiclass_strategy", &strategy));
+    const Tensor* feature_column_group_id_t;
+    OP_REQUIRES_OK(context, context->input("feature_column_group_id",
+                                           &feature_column_group_id_t));
+
+    const Tensor* multiclass_strategy_t;
+    OP_REQUIRES_OK(
+        context, context->input("multiclass_strategy", &multiclass_strategy_t));
+    int strategy = multiclass_strategy_t->scalar<int32>()();
     OP_REQUIRES(
         context,
         boosted_trees::learner::LearnerConfig_MultiClassStrategy_IsValid(
             strategy),
         errors::InvalidArgument("Wrong multiclass strategy passed."));
-    multiclass_strategy_ = LearnerConfig_MultiClassStrategy(strategy);
-  }
 
-  NodeStats ComputeNodeStats(const GradientStats& grad_stats) {
-    return NodeStats(l1_regularization_, l2_regularization_, min_node_weight_,
-                     multiclass_strategy_, grad_stats);
-  }
+    multiclass_strategy_ = LearnerConfig_MultiClassStrategy(strategy);
 
-  void ReadClassId(OpKernelContext* const context, int32* class_id) {
     const Tensor* class_id_t;
     OP_REQUIRES_OK(context, context->input("class_id", &class_id_t));
     OP_REQUIRES(context, TensorShapeUtils::IsScalar(class_id_t->shape()),
                 errors::InvalidArgument("class_id must be a scalar."));
-    *class_id = class_id_t->scalar<int32>()();
+    class_id_ = class_id_t->scalar<int32>()();
+
+    l1_regularization_ = l1_regularization_t->scalar<float>()();
+    l2_regularization_ = l2_regularization_t->scalar<float>()();
+    tree_complexity_regularization_ =
+        tree_complexity_regularization_t->scalar<float>()();
+    min_node_weight_ = min_node_weight_t->scalar<float>()();
+    feature_column_group_id_ = feature_column_group_id_t->scalar<int32>()();
+  }
+
+  NodeStats ComputeNodeStats(const GradientStats& grad_stats) {
+    return NodeStats(l1_regularization_, l2_regularization_, min_node_weight_,
+                     multiclass_strategy_, grad_stats);
   }
 
-  void FillLeaf(const int class_id, const NodeStats& best_node_stats,
+  void FillLeaf(const NodeStats& best_node_stats,
                 boosted_trees::trees::Leaf* leaf) const {
-    if (class_id == -1) {
+    if (class_id_ == -1) {
       // This would be the case either for TREE_PER_CLASS with only 2 classes,
       // or for other multiclass strategies.
       for (float f : best_node_stats.weight_contribution) {
@@ -93,25 +106,31 @@ class BaseBuildSplitOp : public OpKernel {
       CHECK(best_node_stats.weight_contribution.size() == 1)
           << "Weight contribution size = "
           << best_node_stats.weight_contribution.size();
-      leaf->mutable_sparse_vector()->add_index(class_id);
+      leaf->mutable_sparse_vector()->add_index(class_id_);
       leaf->mutable_sparse_vector()->add_value(
           best_node_stats.weight_contribution[0]);
     }
   }
 
- protected:
+  int32 feature_column_group_id() { return feature_column_group_id_; }
+  float tree_complexity_regularization() {
+    return tree_complexity_regularization_;
+  }
+
+ private:
   LearnerConfig_MultiClassStrategy multiclass_strategy_;
-  int32 feature_column_group_id_;
   float l1_regularization_;
   float l2_regularization_;
-  float min_node_weight_;
   float tree_complexity_regularization_;
+  float min_node_weight_;
+  int32 class_id_;
+  int32 feature_column_group_id_;
 };
 
-class BuildDenseInequalitySplitsOp : public BaseBuildSplitOp {
+class BuildDenseInequalitySplitsOp : public OpKernel {
  public:
   explicit BuildDenseInequalitySplitsOp(OpKernelConstruction* const context)
-      : BaseBuildSplitOp(context) {}
+      : OpKernel(context) {}
 
   void Compute(OpKernelContext* const context) override {
     const Tensor* num_minibatches_t;
@@ -139,9 +158,6 @@ class BuildDenseInequalitySplitsOp : public BaseBuildSplitOp {
     const Tensor* hessians_t;
     OP_REQUIRES_OK(context, context->input("hessians", &hessians_t));
 
-    int class_id;
-    ReadClassId(context, &class_id);
-
     // Find the number of unique partitions before we allocate the output.
     std::vector<int32> partition_boundaries;
     partition_boundaries.push_back(0);
@@ -185,6 +201,7 @@ class BuildDenseInequalitySplitsOp : public BaseBuildSplitOp {
                                 &output_splits_t));
     tensorflow::TTypes<string>::Vec output_splits =
         output_splits_t->vec<string>();
+    SplitBuilderState state(context);
     for (int root_idx = 0; root_idx < num_elements; ++root_idx) {
       float best_gain = std::numeric_limits<float>::lowest();
       int start_index = partition_boundaries[root_idx];
@@ -196,7 +213,7 @@ class BuildDenseInequalitySplitsOp : public BaseBuildSplitOp {
             GradientStats(*gradients_t, *hessians_t, bucket_idx);
       }
       root_gradient_stats *= normalizer_ratio;
-      NodeStats root_stats = ComputeNodeStats(root_gradient_stats);
+      NodeStats root_stats = state.ComputeNodeStats(root_gradient_stats);
       int32 best_bucket_idx = 0;
       NodeStats best_right_node_stats(0);
       NodeStats best_left_node_stats(0);
@@ -206,10 +223,10 @@ class BuildDenseInequalitySplitsOp : public BaseBuildSplitOp {
         GradientStats g(*gradients_t, *hessians_t, bucket_idx);
         g *= normalizer_ratio;
         left_gradient_stats += g;
-        NodeStats left_stats = ComputeNodeStats(left_gradient_stats);
+        NodeStats left_stats = state.ComputeNodeStats(left_gradient_stats);
         GradientStats right_gradient_stats =
             root_gradient_stats - left_gradient_stats;
-        NodeStats right_stats = ComputeNodeStats(right_gradient_stats);
+        NodeStats right_stats = state.ComputeNodeStats(right_gradient_stats);
         if (left_stats.gain + right_stats.gain > best_gain) {
           best_gain = left_stats.gain + right_stats.gain;
           best_left_node_stats = left_stats;
@@ -220,18 +237,18 @@ class BuildDenseInequalitySplitsOp : public BaseBuildSplitOp {
       SplitInfo split_info;
       auto* dense_split =
           split_info.mutable_split_node()->mutable_dense_float_binary_split();
-      dense_split->set_feature_column(feature_column_group_id_);
+      dense_split->set_feature_column(state.feature_column_group_id());
       dense_split->set_threshold(
           bucket_boundaries(bucket_ids(best_bucket_idx, 0)));
 
       auto* left_child = split_info.mutable_left_child();
       auto* right_child = split_info.mutable_right_child();
 
-      FillLeaf(class_id, best_left_node_stats, left_child);
-      FillLeaf(class_id, best_right_node_stats, right_child);
+      state.FillLeaf(best_left_node_stats, left_child);
+      state.FillLeaf(best_right_node_stats, right_child);
       split_info.SerializeToString(&output_splits(root_idx));
       gains(root_idx) =
-          best_gain - root_stats.gain - tree_complexity_regularization_;
+          best_gain - root_stats.gain - state.tree_complexity_regularization();
       output_partition_ids(root_idx) = partition_ids(start_index);
     }
   }
@@ -239,13 +256,10 @@ class BuildDenseInequalitySplitsOp : public BaseBuildSplitOp {
 REGISTER_KERNEL_BUILDER(Name("BuildDenseInequalitySplits").Device(DEVICE_CPU),
                         BuildDenseInequalitySplitsOp);
 
-class BuildSparseInequalitySplitsOp : public BaseBuildSplitOp {
+class BuildSparseInequalitySplitsOp : public OpKernel {
  public:
   explicit BuildSparseInequalitySplitsOp(OpKernelConstruction* const context)
-      : BaseBuildSplitOp(context) {
-    OP_REQUIRES_OK(context,
-                   context->GetAttr("bias_feature_id", &bias_feature_id_));
-  }
+      : OpKernel(context) {}
 
   void Compute(OpKernelContext* const context) override {
     const Tensor* num_minibatches_t;
@@ -275,8 +289,10 @@ class BuildSparseInequalitySplitsOp : public BaseBuildSplitOp {
     const Tensor* hessians_t;
     OP_REQUIRES_OK(context, context->input("hessians", &hessians_t));
 
-    int class_id;
-    ReadClassId(context, &class_id);
+    const Tensor* bias_feature_id_t;
+    OP_REQUIRES_OK(context,
+                   context->input("bias_feature_id", &bias_feature_id_t));
+    int64 bias_feature_id = bias_feature_id_t->scalar<int64>()();
 
     // For each partition (tree node), store starting index for each dimension.
     PartitionAndDimensionBoundaries partition_boundaries;
@@ -354,6 +370,7 @@ class BuildSparseInequalitySplitsOp : public BaseBuildSplitOp {
                                 &output_splits_t));
     tensorflow::TTypes<string>::Vec output_splits =
         output_splits_t->vec<string>();
+    SplitBuilderState state(context);
     // For each tree node that needs to be split.
     for (int root_idx = 0; root_idx < num_elements; ++root_idx) {
       const auto& dimension_boundaries =
@@ -372,7 +389,7 @@ class BuildSparseInequalitySplitsOp : public BaseBuildSplitOp {
 
       OP_REQUIRES(
           context,
-          bucket_ids_and_dimensions(bias_start_index, 0) == bias_feature_id_,
+          bucket_ids_and_dimensions(bias_start_index, 0) == bias_feature_id,
           errors::InvalidArgument("Bias feature ID missing."));
 
       // Dimension for bias feature is always 0
@@ -388,7 +405,7 @@ class BuildSparseInequalitySplitsOp : public BaseBuildSplitOp {
       GradientStats root_gradient_stats(*gradients_t, *hessians_t,
                                         bias_start_index);
       root_gradient_stats *= normalizer_ratio;
-      NodeStats root_stats = ComputeNodeStats(root_gradient_stats);
+      NodeStats root_stats = state.ComputeNodeStats(root_gradient_stats);
 
       // Iterate through dimensions.
       for (int j = 0; j < dimension_boundaries.size() - 1; ++j) {
@@ -408,7 +425,7 @@ class BuildSparseInequalitySplitsOp : public BaseBuildSplitOp {
             << bucket_ids_and_dimensions(start_index, 1) << " and for "
             << bucket_ids_and_dimensions(end_index - 1, 0) << " "
             << bucket_ids_and_dimensions(end_index - 1, 1);
-        if (bucket_ids_and_dimensions(start_index, 0) == bias_feature_id_) {
+        if (bucket_ids_and_dimensions(start_index, 0) == bias_feature_id) {
           // 0-dimension case which has a first bucket for catch all feature.
           CHECK(bucket_ids_and_dimensions(start_index, 1) == 0)
               << "Dimension of bias feature should be 0";
@@ -447,10 +464,10 @@ class BuildSparseInequalitySplitsOp : public BaseBuildSplitOp {
               present_gradient_stats - left_gradient_stats;
 
           {
-            NodeStats left_stats_default_left =
-                ComputeNodeStats(root_gradient_stats - right_gradient_stats);
+            NodeStats left_stats_default_left = state.ComputeNodeStats(
+                root_gradient_stats - right_gradient_stats);
             NodeStats right_stats_default_left =
-                ComputeNodeStats(right_gradient_stats);
+                state.ComputeNodeStats(right_gradient_stats);
             if (left_stats_default_left.gain + right_stats_default_left.gain >
                 best_gain) {
               best_gain =
@@ -466,9 +483,9 @@ class BuildSparseInequalitySplitsOp : public BaseBuildSplitOp {
           // enough missing examples.
           if (!fixed_default_direction) {
             NodeStats left_stats_default_right =
-                ComputeNodeStats(left_gradient_stats);
-            NodeStats right_stats_default_right =
-                ComputeNodeStats(root_gradient_stats - left_gradient_stats);
+                state.ComputeNodeStats(left_gradient_stats);
+            NodeStats right_stats_default_right = state.ComputeNodeStats(
+                root_gradient_stats - left_gradient_stats);
             if (left_stats_default_right.gain + right_stats_default_right.gain >
                 best_gain) {
               best_gain = left_stats_default_right.gain +
@@ -494,7 +511,7 @@ class BuildSparseInequalitySplitsOp : public BaseBuildSplitOp {
                           ->mutable_sparse_float_binary_split_default_left()
                           ->mutable_split();
       }
-      dense_split->set_feature_column(feature_column_group_id_);
+      dense_split->set_feature_column(state.feature_column_group_id());
       // Set the feature index for the best feature column.
       const int64 best_dimension_id =
           bucket_ids_and_dimensions(best_element_idx, 1);
@@ -505,11 +522,11 @@ class BuildSparseInequalitySplitsOp : public BaseBuildSplitOp {
 
       auto* left_child = split_info.mutable_left_child();
       auto* right_child = split_info.mutable_right_child();
-      FillLeaf(class_id, best_left_node_stats, left_child);
-      FillLeaf(class_id, best_right_node_stats, right_child);
+      state.FillLeaf(best_left_node_stats, left_child);
+      state.FillLeaf(best_right_node_stats, right_child);
       split_info.SerializeToString(&output_splits(root_idx));
       gains(root_idx) =
-          best_gain - root_stats.gain - tree_complexity_regularization_;
+          best_gain - root_stats.gain - state.tree_complexity_regularization();
       output_partition_ids(root_idx) = partition_ids(bias_start_index);
     }
   }
@@ -526,19 +543,14 @@ class BuildSparseInequalitySplitsOp : public BaseBuildSplitOp {
   // For each partition, store start indices of feature column dimensions.
   typedef std::vector<std::vector<DimensionBoundary>>
       PartitionAndDimensionBoundaries;
-
-  int64 bias_feature_id_;
 };
 REGISTER_KERNEL_BUILDER(Name("BuildSparseInequalitySplits").Device(DEVICE_CPU),
                         BuildSparseInequalitySplitsOp);
 
-class BuildCategoricalEqualitySplitsOp : public BaseBuildSplitOp {
+class BuildCategoricalEqualitySplitsOp : public OpKernel {
  public:
   explicit BuildCategoricalEqualitySplitsOp(OpKernelConstruction* const context)
-      : BaseBuildSplitOp(context) {
-    OP_REQUIRES_OK(context,
-                   context->GetAttr("bias_feature_id", &bias_feature_id_));
-  }
+      : OpKernel(context) {}
 
   void Compute(OpKernelContext* const context) override {
     const Tensor* num_minibatches_t;
@@ -561,8 +573,10 @@ class BuildCategoricalEqualitySplitsOp : public BaseBuildSplitOp {
     const Tensor* hessians_t;
     OP_REQUIRES_OK(context, context->input("hessians", &hessians_t));
 
-    int class_id;
-    ReadClassId(context, &class_id);
+    const Tensor* bias_feature_id_t;
+    OP_REQUIRES_OK(context,
+                   context->input("bias_feature_id", &bias_feature_id_t));
+    int64 bias_feature_id = bias_feature_id_t->scalar<int64>()();
 
     // Find the number of unique partitions before we allocate the output.
     std::vector<int32> partition_boundaries;
@@ -605,16 +619,17 @@ class BuildCategoricalEqualitySplitsOp : public BaseBuildSplitOp {
                                 &output_splits_t));
     tensorflow::TTypes<string>::Vec output_splits =
         output_splits_t->vec<string>();
+    SplitBuilderState state(context);
     for (int root_idx = 0; root_idx < num_elements; ++root_idx) {
       float best_gain = std::numeric_limits<float>::lowest();
       int start_index = partition_boundaries[non_empty_partitions[root_idx]];
       int end_index = partition_boundaries[non_empty_partitions[root_idx] + 1];
       // First feature ID in each partition should be the bias feature.
-      OP_REQUIRES(context, feature_ids(start_index, 0) == bias_feature_id_,
+      OP_REQUIRES(context, feature_ids(start_index, 0) == bias_feature_id,
                   errors::InvalidArgument("Bias feature ID missing."));
       GradientStats root_gradient_stats(*gradients_t, *hessians_t, start_index);
       root_gradient_stats *= normalizer_ratio;
-      NodeStats root_stats = ComputeNodeStats(root_gradient_stats);
+      NodeStats root_stats = state.ComputeNodeStats(root_gradient_stats);
       int32 best_feature_idx = 0;
       NodeStats best_right_node_stats(0);
       NodeStats best_left_node_stats(0);
@@ -625,8 +640,8 @@ class BuildCategoricalEqualitySplitsOp : public BaseBuildSplitOp {
         left_gradient_stats *= normalizer_ratio;
         GradientStats right_gradient_stats =
             root_gradient_stats - left_gradient_stats;
-        NodeStats left_stats = ComputeNodeStats(left_gradient_stats);
-        NodeStats right_stats = ComputeNodeStats(right_gradient_stats);
+        NodeStats left_stats = state.ComputeNodeStats(left_gradient_stats);
+        NodeStats right_stats = state.ComputeNodeStats(right_gradient_stats);
         if (left_stats.gain + right_stats.gain > best_gain) {
           best_gain = left_stats.gain + right_stats.gain;
           best_left_node_stats = left_stats;
@@ -637,21 +652,18 @@ class BuildCategoricalEqualitySplitsOp : public BaseBuildSplitOp {
       SplitInfo split_info;
       auto* equality_split = split_info.mutable_split_node()
                                  ->mutable_categorical_id_binary_split();
-      equality_split->set_feature_column(feature_column_group_id_);
+      equality_split->set_feature_column(state.feature_column_group_id());
       equality_split->set_feature_id(feature_ids(best_feature_idx, 0));
       auto* left_child = split_info.mutable_left_child();
       auto* right_child = split_info.mutable_right_child();
-      FillLeaf(class_id, best_left_node_stats, left_child);
-      FillLeaf(class_id, best_right_node_stats, right_child);
+      state.FillLeaf(best_left_node_stats, left_child);
+      state.FillLeaf(best_right_node_stats, right_child);
       split_info.SerializeToString(&output_splits(root_idx));
       gains(root_idx) =
-          best_gain - root_stats.gain - tree_complexity_regularization_;
+          best_gain - root_stats.gain - state.tree_complexity_regularization();
       output_partition_ids(root_idx) = partition_ids(start_index);
     }
   }
-
- private:
-  int64 bias_feature_id_;
 };
 
 REGISTER_KERNEL_BUILDER(
diff --git a/tensorflow/contrib/boosted_trees/lib/learner/batch/ordinal_split_handler.py b/tensorflow/contrib/boosted_trees/lib/learner/batch/ordinal_split_handler.py
index f06b73c00d0bebb2717a79b7894e2addf914daba..409a2d8f46c331c13aec10542c4967d50575e94a 100644
--- a/tensorflow/contrib/boosted_trees/lib/learner/batch/ordinal_split_handler.py
+++ b/tensorflow/contrib/boosted_trees/lib/learner/batch/ordinal_split_handler.py
@@ -64,6 +64,8 @@ from __future__ import print_function
 import re
 
 from tensorflow.contrib.boosted_trees.lib.learner.batch import base_split_handler
+from tensorflow.contrib.boosted_trees.python.ops import gen_quantile_ops
+from tensorflow.contrib.boosted_trees.python.ops import gen_stats_accumulator_ops
 from tensorflow.contrib.boosted_trees.python.ops import quantile_ops
 from tensorflow.contrib.boosted_trees.python.ops import split_handler_ops
 from tensorflow.contrib.boosted_trees.python.ops import stats_accumulator_ops
@@ -72,9 +74,11 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
+
 _BIAS_FEATURE_ID = -1
 # Pattern to remove all non alpha numeric from a string.
 _PATTERN = re.compile(r"[\W_]+")
@@ -130,11 +134,14 @@ class InequalitySplitHandler(base_split_handler.BaseSplitHandler):
         gradient_shape,
         hessian_shape,
         name="StatsAccumulator/{}".format(self._name))
-    self._quantile_accumulator = quantile_ops.QuantileAccumulator(
-        init_stamp_token,
-        epsilon=epsilon,
-        num_quantiles=num_quantiles,
-        name="QuantileAccumulator/{}".format(self._name))
+    # Allocate both stats accumulator and quantile accumulator on the same
+    # device so that we can build splits with fewer RPCs.
+    with ops.colocate_with(self._stats_accumulator.resource()):
+      self._quantile_accumulator = quantile_ops.QuantileAccumulator(
+          init_stamp_token,
+          epsilon=epsilon,
+          num_quantiles=num_quantiles,
+          name="QuantileAccumulator/{}".format(self._name))
 
 
 class DenseSplitHandler(InequalitySplitHandler):
@@ -236,45 +243,74 @@ class DenseSplitHandler(InequalitySplitHandler):
 
   def make_splits(self, stamp_token, next_stamp_token, class_id):
     """Create the best split using the accumulated stats and flush the state."""
-    # Get the bucket boundaries
-    are_splits_ready, buckets = (
-        self._quantile_accumulator.get_buckets(stamp_token))
-    # After we receive the boundaries from previous iteration we can flush
-    # the quantile accumulator.
-    with ops.control_dependencies([buckets]):
-      flush_quantiles = self._quantile_accumulator.flush(
-          stamp_token=stamp_token, next_stamp_token=next_stamp_token)
-
-    # Get the aggregated gradients and hessians per <partition_id, feature_id>
-    # pair.
-    # In order to distribute the computation on all the PSs we use the PS that
-    # had the stats accumulator on.
-    with ops.device(None):
-      with ops.device(self._stats_accumulator.resource().device):
-        num_minibatches, partition_ids, bucket_ids, gradients, hessians = (
-            self._stats_accumulator.flush(stamp_token, next_stamp_token))
-
-        # Put quantile and stats accumulator flushing in the dependency path.
-        are_splits_ready = control_flow_ops.with_dependencies(
-            [flush_quantiles, partition_ids], are_splits_ready)
-
-        partition_ids, gains, split_infos = (
-            split_handler_ops.build_dense_inequality_splits(
-                num_minibatches=num_minibatches,
-                bucket_boundaries=buckets,
-                partition_ids=partition_ids,
-                bucket_ids=bucket_ids,
-                gradients=gradients,
-                hessians=hessians,
-                class_id=class_id,
-                feature_column_group_id=self._feature_column_group_id,
-                l1_regularization=self._l1_regularization,
-                l2_regularization=self._l2_regularization,
-                tree_complexity_regularization=self.
-                _tree_complexity_regularization,
-                min_node_weight=self._min_node_weight,
-                multiclass_strategy=self._multiclass_strategy))
-    return (are_splits_ready, partition_ids, gains, split_infos)
+    if (self._gradient_shape == tensor_shape.scalar() and
+        self._hessian_shape == tensor_shape.scalar()):
+      handler = make_dense_split_scalar
+    else:
+      handler = make_dense_split_tensor
+
+    are_splits_ready, partition_ids, gains, split_infos = (
+        handler(self._quantile_accumulator.resource(),
+                self._stats_accumulator.resource(), stamp_token,
+                next_stamp_token, self._multiclass_strategy, class_id,
+                self._feature_column_group_id, self._l1_regularization,
+                self._l2_regularization, self._tree_complexity_regularization,
+                self._min_node_weight))
+    return are_splits_ready, partition_ids, gains, split_infos
+
+
+def _make_dense_split(quantile_accumulator_handle, stats_accumulator_handle,
+                      stamp_token, next_stamp_token, multiclass_strategy,
+                      class_id, feature_column_id, l1_regularization,
+                      l2_regularization, tree_complexity_regularization,
+                      min_node_weight, is_multi_dimentional):
+  """Function that builds splits for a dense feature column."""
+  # Get the bucket boundaries
+  are_splits_ready, buckets = (
+      gen_quantile_ops.quantile_accumulator_get_buckets(
+          quantile_accumulator_handles=[quantile_accumulator_handle],
+          stamp_token=stamp_token))
+  # quantile_accumulator_get_buckets returns a list of results per handle that
+  # we pass to it. In this case we're getting results just for one resource.
+  are_splits_ready = are_splits_ready[0]
+  buckets = buckets[0]
+
+  # After we receive the boundaries from previous iteration we can flush
+  # the quantile accumulator.
+  with ops.control_dependencies([buckets]):
+    flush_quantiles = gen_quantile_ops.quantile_accumulator_flush(
+        quantile_accumulator_handle=quantile_accumulator_handle,
+        stamp_token=stamp_token,
+        next_stamp_token=next_stamp_token)
+
+  if is_multi_dimentional:
+    num_minibatches, partition_ids, bucket_ids, gradients, hessians = (
+        gen_stats_accumulator_ops.stats_accumulator_tensor_flush(
+            stats_accumulator_handle, stamp_token, next_stamp_token))
+  else:
+    num_minibatches, partition_ids, bucket_ids, gradients, hessians = (
+        gen_stats_accumulator_ops.stats_accumulator_scalar_flush(
+            stats_accumulator_handle, stamp_token, next_stamp_token))
+
+  # Put quantile and stats accumulator flushing in the dependency path.
+  with ops.control_dependencies([flush_quantiles, partition_ids]):
+    are_splits_ready = array_ops.identity(are_splits_ready)
+  partition_ids, gains, split_infos = (
+      split_handler_ops.build_dense_inequality_splits(
+          num_minibatches=num_minibatches,
+          bucket_boundaries=buckets,
+          partition_ids=partition_ids,
+          bucket_ids=bucket_ids,
+          gradients=gradients,
+          hessians=hessians,
+          class_id=class_id,
+          feature_column_group_id=feature_column_id,
+          l1_regularization=l1_regularization,
+          l2_regularization=l2_regularization,
+          tree_complexity_regularization=tree_complexity_regularization,
+          min_node_weight=min_node_weight,
+          multiclass_strategy=multiclass_strategy))
+  return are_splits_ready, partition_ids, gains, split_infos
 
 
 class SparseSplitHandler(InequalitySplitHandler):
@@ -327,9 +363,6 @@ class SparseSplitHandler(InequalitySplitHandler):
         multiclass_strategy=multiclass_strategy,
         init_stamp_token=init_stamp_token,
         name=name)
-    # Register sparse_make_stats_update function as an Op to the graph.
-    g = ops.get_default_graph()
-    sparse_make_stats_update.add_to_graph(g)
     self._sparse_float_column = sparse_float_column
 
   def scheduled_reads(self):
@@ -361,8 +394,8 @@ class SparseSplitHandler(InequalitySplitHandler):
     are_buckets_ready, buckets = scheduled_reads[0]
     with ops.name_scope(self._name, "SparseSplitHandler"):
       (quantile_indices, quantile_values, quantile_shapes, quantile_weights,
-       example_partition_ids,
-       feature_ids, gradients, hessians) = sparse_make_stats_update(
+       example_partition_ids, feature_ids, gradients,
+       hessians) = sparse_make_stats_update(
            is_active, are_buckets_ready, self._sparse_float_column.indices,
            self._sparse_float_column.values,
            self._sparse_float_column.dense_shape, buckets,
@@ -379,42 +412,115 @@ class SparseSplitHandler(InequalitySplitHandler):
 
   def make_splits(self, stamp_token, next_stamp_token, class_id):
     """Create the best split using the accumulated stats and flush the state."""
-    # Get the bucket boundaries
-    are_splits_ready, buckets = (
-        self._quantile_accumulator.get_buckets(stamp_token))
-
-    # After we receive the boundaries from previous iteration we can flush
-    # the quantile accumulator.
-    with ops.control_dependencies([buckets]):
-      flush_quantiles = self._quantile_accumulator.flush(
-          stamp_token=stamp_token, next_stamp_token=next_stamp_token)
-
-    with ops.device(None):
-      with ops.device(self._stats_accumulator.resource().device):
-        num_minibatches, partition_ids, bucket_ids, gradients, hessians = (
-            self._stats_accumulator.flush(stamp_token, next_stamp_token))
-
-        # Put quantile and stats accumulator flushing in the dependency path.
-        are_splits_ready = control_flow_ops.with_dependencies(
-            [flush_quantiles, partition_ids], are_splits_ready)
-        partition_ids, gains, split_infos = (
-            split_handler_ops.build_sparse_inequality_splits(
-                num_minibatches=num_minibatches,
-                bucket_boundaries=buckets,
-                partition_ids=partition_ids,
-                bucket_ids=bucket_ids,
-                gradients=gradients,
-                hessians=hessians,
-                class_id=class_id,
-                feature_column_group_id=self._feature_column_group_id,
-                l1_regularization=self._l1_regularization,
-                l2_regularization=self._l2_regularization,
-                tree_complexity_regularization=self.
-                _tree_complexity_regularization,
-                min_node_weight=self._min_node_weight,
-                bias_feature_id=_BIAS_FEATURE_ID,
-                multiclass_strategy=self._multiclass_strategy))
-    return (are_splits_ready, partition_ids, gains, split_infos)
+    if (self._gradient_shape == tensor_shape.scalar() and
+        self._hessian_shape == tensor_shape.scalar()):
+      handler = make_sparse_split_scalar
+    else:
+      handler = make_sparse_split_tensor
+
+    are_splits_ready, partition_ids, gains, split_infos = (
+        handler(self._quantile_accumulator.resource(),
+                self._stats_accumulator.resource(), stamp_token,
+                next_stamp_token, self._multiclass_strategy, class_id,
+                self._feature_column_group_id, self._l1_regularization,
+                self._l2_regularization, self._tree_complexity_regularization,
+                self._min_node_weight))
+    return are_splits_ready, partition_ids, gains, split_infos
+
+
+def _make_sparse_split(quantile_accumulator_handle, stats_accumulator_handle,
+                       stamp_token, next_stamp_token, multiclass_strategy,
+                       class_id, feature_column_id, l1_regularization,
+                       l2_regularization, tree_complexity_regularization,
+                       min_node_weight, is_multi_dimentional):
+  """Function that builds splits for a sparse feature column."""
+  # Get the bucket boundaries
+  are_splits_ready, buckets = (
+      gen_quantile_ops.quantile_accumulator_get_buckets(
+          quantile_accumulator_handles=[quantile_accumulator_handle],
+          stamp_token=stamp_token))
+  # quantile_accumulator_get_buckets returns a list of results per handle that
+  # we pass to it. In this case we're getting results just for one resource.
+  are_splits_ready = are_splits_ready[0]
+  buckets = buckets[0]
+
+  # After we receive the boundaries from previous iteration we can flush
+  # the quantile accumulator.
+  with ops.control_dependencies([buckets]):
+    flush_quantiles = gen_quantile_ops.quantile_accumulator_flush(
+        quantile_accumulator_handle=quantile_accumulator_handle,
+        stamp_token=stamp_token,
+        next_stamp_token=next_stamp_token)
+
+  if is_multi_dimentional:
+    num_minibatches, partition_ids, bucket_ids, gradients, hessians = (
+        gen_stats_accumulator_ops.stats_accumulator_tensor_flush(
+            stats_accumulator_handle, stamp_token, next_stamp_token))
+  else:
+    num_minibatches, partition_ids, bucket_ids, gradients, hessians = (
+        gen_stats_accumulator_ops.stats_accumulator_scalar_flush(
+            stats_accumulator_handle, stamp_token, next_stamp_token))
+
+  # Put quantile and stats accumulator flushing in the dependency path.
+  with ops.control_dependencies([flush_quantiles, partition_ids]):
+    are_splits_ready = array_ops.identity(are_splits_ready)
+  partition_ids, gains, split_infos = (
+      split_handler_ops.build_sparse_inequality_splits(
+          num_minibatches=num_minibatches,
+          bucket_boundaries=buckets,
+          partition_ids=partition_ids,
+          bucket_ids=bucket_ids,
+          gradients=gradients,
+          hessians=hessians,
+          class_id=class_id,
+          feature_column_group_id=feature_column_id,
+          l1_regularization=l1_regularization,
+          l2_regularization=l2_regularization,
+          tree_complexity_regularization=tree_complexity_regularization,
+          min_node_weight=min_node_weight,
+          bias_feature_id=_BIAS_FEATURE_ID,
+          multiclass_strategy=multiclass_strategy))
+  return are_splits_ready, partition_ids, gains, split_infos
+
+
+def _specialize_make_split(func, is_multi_dimentional):
+  """Builds a specialized version of the function."""
+
+  @function.Defun(
+      dtypes.resource,
+      dtypes.resource,
+      dtypes.int64,
+      dtypes.int64,
+      dtypes.int32,
+      dtypes.int32,
+      dtypes.int32,
+      dtypes.float32,
+      dtypes.float32,
+      dtypes.float32,
+      dtypes.float32,
+      noinline=True)
+  def f(quantile_accumulator_handle, stats_accumulator_handle, stamp_token,
+        next_stamp_token, multiclass_strategy, class_id, feature_column_id,
+        l1_regularization, l2_regularization, tree_complexity_regularization,
+        min_node_weight):
+    """Function that builds splits for a sparse feature column."""
+    return func(
+        quantile_accumulator_handle, stats_accumulator_handle, stamp_token,
+        next_stamp_token, multiclass_strategy, class_id, feature_column_id,
+        l1_regularization, l2_regularization, tree_complexity_regularization,
+        min_node_weight, is_multi_dimentional)
+
+  return f
+
+make_dense_split_scalar = _specialize_make_split(_make_dense_split,
+                                                 is_multi_dimentional=False)
+make_dense_split_tensor = _specialize_make_split(_make_dense_split,
+                                                 is_multi_dimentional=True)
+
+make_sparse_split_scalar = _specialize_make_split(_make_sparse_split,
+                                                  is_multi_dimentional=False)
+make_sparse_split_tensor = _specialize_make_split(_make_sparse_split,
+                                                  is_multi_dimentional=True)
 
 
 @function.Defun(
@@ -540,8 +646,9 @@ def sparse_make_stats_update(
 
   empty_float = constant_op.constant([], dtype=dtypes.float32)
   handler_not_active = (constant_op.constant(
-      [], dtype=dtypes.int64, shape=[0, 2]), empty_float, constant_op.constant(
-          [0, 1], dtype=dtypes.int64), empty_float)
+      [], dtype=dtypes.int64, shape=[0, 2]), empty_float,
+                        constant_op.constant([0, 1], dtype=dtypes.int64),
+                        empty_float)
   handler_active = (sparse_column_indices, sparse_column_values,
                     sparse_column_shape, weights)
   quantile_indices, quantile_values, quantile_shape, quantile_weights = (
diff --git a/tensorflow/contrib/boosted_trees/lib/learner/batch/ordinal_split_handler_test.py b/tensorflow/contrib/boosted_trees/lib/learner/batch/ordinal_split_handler_test.py
index 54d03018d9e266beabbbabd78ebbb80cfe689c04..2f2c2302113bf59d6a065d5005c934dc76c2148d 100644
--- a/tensorflow/contrib/boosted_trees/lib/learner/batch/ordinal_split_handler_test.py
+++ b/tensorflow/contrib/boosted_trees/lib/learner/batch/ordinal_split_handler_test.py
@@ -18,6 +18,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import numpy as np
+
 from tensorflow.contrib.boosted_trees.lib.learner.batch import ordinal_split_handler
 from tensorflow.contrib.boosted_trees.proto import learner_pb2
 from tensorflow.contrib.boosted_trees.proto import split_info_pb2
@@ -65,9 +67,9 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
       hessian_shape = tensor_shape.scalar()
       split_handler = ordinal_split_handler.DenseSplitHandler(
           l1_regularization=0.1,
-          l2_regularization=1,
-          tree_complexity_regularization=0,
-          min_node_weight=0,
+          l2_regularization=1.,
+          tree_complexity_regularization=0.,
+          min_node_weight=0.,
           epsilon=0.001,
           num_quantiles=10,
           feature_column_group_id=0,
@@ -92,7 +94,9 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
           example_weights,
           is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_1]):
-        are_splits_ready = split_handler.make_splits(0, 1, class_id)[0]
+        are_splits_ready = split_handler.make_splits(
+            np.int64(0), np.int64(1), class_id)[0]
+
       with ops.control_dependencies([are_splits_ready]):
         update_2 = split_handler.update_stats_sync(
             1,
@@ -105,7 +109,7 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
             is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_2]):
         are_splits_ready2, partitions, gains, splits = (
-            split_handler.make_splits(1, 2, class_id))
+            split_handler.make_splits(np.int64(1), np.int64(2), class_id))
         are_splits_ready, are_splits_ready2, partitions, gains, splits = (
             sess.run([
                 are_splits_ready, are_splits_ready2, partitions, gains, splits
@@ -199,10 +203,10 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
       hessian_shape = tensor_shape.TensorShape([2, 2])
 
       split_handler = ordinal_split_handler.DenseSplitHandler(
-          l1_regularization=0,
-          l2_regularization=1,
-          tree_complexity_regularization=0,
-          min_node_weight=0,
+          l1_regularization=0.,
+          l2_regularization=1.,
+          tree_complexity_regularization=0.,
+          min_node_weight=0.,
           epsilon=0.001,
           num_quantiles=3,
           feature_column_group_id=0,
@@ -227,7 +231,9 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
           example_weights,
           is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_1]):
-        are_splits_ready = split_handler.make_splits(0, 1, class_id)[0]
+        are_splits_ready = split_handler.make_splits(
+            np.int64(0), np.int64(1), class_id)[0]
+
       with ops.control_dependencies([are_splits_ready]):
         update_2 = split_handler.update_stats_sync(
             1,
@@ -240,7 +246,7 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
             is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_2]):
         are_splits_ready2, partitions, gains, splits = (
-            split_handler.make_splits(1, 2, class_id))
+            split_handler.make_splits(np.int64(1), np.int64(2), class_id))
         are_splits_ready, are_splits_ready2, partitions, gains, splits = (
             sess.run([
                 are_splits_ready, are_splits_ready2, partitions, gains, splits
@@ -285,10 +291,10 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
       hessian_shape = tensor_shape.TensorShape([2])
 
       split_handler = ordinal_split_handler.DenseSplitHandler(
-          l1_regularization=0,
-          l2_regularization=1,
-          tree_complexity_regularization=0,
-          min_node_weight=0,
+          l1_regularization=0.,
+          l2_regularization=1.,
+          tree_complexity_regularization=0.,
+          min_node_weight=0.,
           epsilon=0.001,
           num_quantiles=3,
           feature_column_group_id=0,
@@ -313,7 +319,8 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
           example_weights,
           is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_1]):
-        are_splits_ready = split_handler.make_splits(0, 1, class_id)[0]
+        are_splits_ready = split_handler.make_splits(
+            np.int64(0), np.int64(1), class_id)[0]
       with ops.control_dependencies([are_splits_ready]):
         update_2 = split_handler.update_stats_sync(
             1,
@@ -326,7 +333,7 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
             is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_2]):
         are_splits_ready2, partitions, gains, splits = (
-            split_handler.make_splits(1, 2, class_id))
+            split_handler.make_splits(np.int64(1), np.int64(2), class_id))
         are_splits_ready, are_splits_ready2, partitions, gains, splits = (
             sess.run([
                 are_splits_ready, are_splits_ready2, partitions, gains, splits
@@ -369,9 +376,9 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
 
       split_handler = ordinal_split_handler.DenseSplitHandler(
           l1_regularization=0.1,
-          l2_regularization=1,
-          tree_complexity_regularization=0,
-          min_node_weight=0,
+          l2_regularization=1.,
+          tree_complexity_regularization=0.,
+          min_node_weight=0.,
           epsilon=0.001,
           num_quantiles=10,
           feature_column_group_id=0,
@@ -396,7 +403,8 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
           example_weights,
           is_active=array_ops.constant([True, False]))
       with ops.control_dependencies([update_1]):
-        are_splits_ready = split_handler.make_splits(0, 1, class_id)[0]
+        are_splits_ready = split_handler.make_splits(
+            np.int64(0), np.int64(1), class_id)[0]
       with ops.control_dependencies([are_splits_ready]):
         update_2 = split_handler.update_stats_sync(
             1,
@@ -409,7 +417,7 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
             is_active=array_ops.constant([False, True]))
       with ops.control_dependencies([update_2]):
         are_splits_ready2, partitions, gains, splits = (
-            split_handler.make_splits(1, 2, class_id))
+            split_handler.make_splits(np.int64(1), np.int64(2), class_id))
         are_splits_ready, are_splits_ready2, partitions, gains, splits = (
             sess.run([
                 are_splits_ready, are_splits_ready2, partitions, gains, splits
@@ -443,9 +451,9 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
 
       split_handler = ordinal_split_handler.DenseSplitHandler(
           l1_regularization=0.1,
-          l2_regularization=1,
+          l2_regularization=1.,
           tree_complexity_regularization=0.5,
-          min_node_weight=0,
+          min_node_weight=0.,
           epsilon=0.001,
           num_quantiles=10,
           feature_column_group_id=0,
@@ -470,7 +478,8 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
           example_weights,
           is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_1]):
-        are_splits_ready = split_handler.make_splits(0, 1, class_id)[0]
+        are_splits_ready = split_handler.make_splits(
+            np.int64(0), np.int64(1), class_id)[0]
       with ops.control_dependencies([are_splits_ready]):
         update_2 = split_handler.update_stats_sync(
             1,
@@ -483,7 +492,7 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
             is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_2]):
         are_splits_ready2, partitions, gains, splits = (
-            split_handler.make_splits(1, 2, class_id))
+            split_handler.make_splits(np.int64(1), np.int64(2), class_id))
         are_splits_ready, are_splits_ready2, partitions, gains, splits = (
             sess.run([
                 are_splits_ready, are_splits_ready2, partitions, gains, splits
@@ -576,7 +585,7 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
 
       split_handler = ordinal_split_handler.DenseSplitHandler(
           l1_regularization=0.1,
-          l2_regularization=1,
+          l2_regularization=1.,
           tree_complexity_regularization=0.5,
           min_node_weight=1.5,
           epsilon=0.001,
@@ -603,7 +612,8 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
           example_weights,
           is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_1]):
-        are_splits_ready = split_handler.make_splits(0, 1, class_id)[0]
+        are_splits_ready = split_handler.make_splits(
+            np.int64(0), np.int64(1), class_id)[0]
       with ops.control_dependencies([are_splits_ready]):
         update_2 = split_handler.update_stats_sync(
             1,
@@ -616,7 +626,7 @@ class DenseSplitHandlerTest(test_util.TensorFlowTestCase):
             is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_2]):
         are_splits_ready2, partitions, gains, splits = (
-            split_handler.make_splits(1, 2, class_id))
+            split_handler.make_splits(np.int64(1), np.int64(2), class_id))
         are_splits_ready, are_splits_ready2, partitions, gains, splits = (
             sess.run([
                 are_splits_ready, are_splits_ready2, partitions, gains, splits
@@ -685,10 +695,10 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
       class_id = -1
 
       split_handler = ordinal_split_handler.SparseSplitHandler(
-          l1_regularization=0,
-          l2_regularization=2,
-          tree_complexity_regularization=0,
-          min_node_weight=0,
+          l1_regularization=0.0,
+          l2_regularization=2.0,
+          tree_complexity_regularization=0.0,
+          min_node_weight=0.0,
           epsilon=0.01,
           num_quantiles=2,
           feature_column_group_id=0,
@@ -713,8 +723,8 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
           example_weights,
           is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_1]):
-        are_splits_ready = split_handler.make_splits(0, 1, class_id)[0]
-
+        are_splits_ready = split_handler.make_splits(
+            np.int64(0), np.int64(1), class_id)[0]
       with ops.control_dependencies([are_splits_ready]):
         update_2 = split_handler.update_stats_sync(
             1,
@@ -727,7 +737,7 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
             is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_2]):
         are_splits_ready2, partitions, gains, splits = (
-            split_handler.make_splits(1, 2, class_id))
+            split_handler.make_splits(np.int64(1), np.int64(2), class_id))
         are_splits_ready, are_splits_ready2, partitions, gains, splits = (
             sess.run([
                 are_splits_ready, are_splits_ready2, partitions, gains, splits
@@ -811,10 +821,10 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
       class_id = -1
 
       split_handler = ordinal_split_handler.SparseSplitHandler(
-          l1_regularization=0,
-          l2_regularization=2,
-          tree_complexity_regularization=0,
-          min_node_weight=0,
+          l1_regularization=0.0,
+          l2_regularization=2.0,
+          tree_complexity_regularization=0.0,
+          min_node_weight=0.0,
           epsilon=0.01,
           num_quantiles=2,
           feature_column_group_id=0,
@@ -839,7 +849,8 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
           example_weights,
           is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_1]):
-        are_splits_ready = split_handler.make_splits(0, 1, class_id)[0]
+        are_splits_ready = split_handler.make_splits(
+            np.int64(0), np.int64(1), class_id)[0]
 
       with ops.control_dependencies([are_splits_ready]):
         update_2 = split_handler.update_stats_sync(
@@ -853,7 +864,7 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
             is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_2]):
         are_splits_ready2, partitions, gains, splits = (
-            split_handler.make_splits(1, 2, class_id))
+            split_handler.make_splits(np.int64(1), np.int64(2), class_id))
         are_splits_ready, are_splits_ready2, partitions, gains, splits = (
             sess.run([
                 are_splits_ready, are_splits_ready2, partitions, gains, splits
@@ -905,10 +916,10 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
       class_id = -1
 
       split_handler = ordinal_split_handler.SparseSplitHandler(
-          l1_regularization=0,
-          l2_regularization=2,
-          tree_complexity_regularization=0,
-          min_node_weight=0,
+          l1_regularization=0.0,
+          l2_regularization=2.0,
+          tree_complexity_regularization=0.0,
+          min_node_weight=0.0,
           epsilon=0.01,
           num_quantiles=2,
           feature_column_group_id=0,
@@ -933,7 +944,8 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
           example_weights,
           is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_1]):
-        are_splits_ready = split_handler.make_splits(0, 1, class_id)[0]
+        are_splits_ready = split_handler.make_splits(
+            np.int64(0), np.int64(1), class_id)[0]
 
       with ops.control_dependencies([are_splits_ready]):
         update_2 = split_handler.update_stats_sync(
@@ -947,7 +959,7 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
             is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_2]):
         are_splits_ready2, partitions, gains, splits = (
-            split_handler.make_splits(1, 2, class_id))
+            split_handler.make_splits(np.int64(1), np.int64(2), class_id))
         are_splits_ready, are_splits_ready2, partitions, gains, splits = (
             sess.run([
                 are_splits_ready, are_splits_ready2, partitions, gains, splits
@@ -996,10 +1008,10 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
       class_id = -1
 
       split_handler = ordinal_split_handler.SparseSplitHandler(
-          l1_regularization=0,
-          l2_regularization=2,
-          tree_complexity_regularization=0,
-          min_node_weight=0,
+          l1_regularization=0.0,
+          l2_regularization=2.0,
+          tree_complexity_regularization=0.0,
+          min_node_weight=0.0,
           epsilon=0.01,
           num_quantiles=2,
           feature_column_group_id=0,
@@ -1024,7 +1036,8 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
           example_weights,
           is_active=array_ops.constant([True, False]))
       with ops.control_dependencies([update_1]):
-        are_splits_ready = split_handler.make_splits(0, 1, class_id)[0]
+        are_splits_ready = split_handler.make_splits(
+            np.int64(0), np.int64(1), class_id)[0]
 
       with ops.control_dependencies([are_splits_ready]):
         update_2 = split_handler.update_stats_sync(
@@ -1038,7 +1051,7 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
             is_active=array_ops.constant([False, True]))
       with ops.control_dependencies([update_2]):
         are_splits_ready2, partitions, gains, splits = (
-            split_handler.make_splits(1, 2, class_id))
+            split_handler.make_splits(np.int64(1), np.int64(2), class_id))
         are_splits_ready, are_splits_ready2, partitions, gains, splits = (
             sess.run([
                 are_splits_ready, are_splits_ready2, partitions, gains, splits
@@ -1065,10 +1078,10 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
       class_id = -1
 
       split_handler = ordinal_split_handler.SparseSplitHandler(
-          l1_regularization=0,
-          l2_regularization=2,
-          tree_complexity_regularization=0,
-          min_node_weight=0,
+          l1_regularization=0.0,
+          l2_regularization=2.0,
+          tree_complexity_regularization=0.0,
+          min_node_weight=0.0,
           epsilon=0.01,
           num_quantiles=2,
           feature_column_group_id=0,
@@ -1096,7 +1109,8 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
           example_weights,
           is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_1]):
-        are_splits_ready = split_handler.make_splits(0, 1, class_id)[0]
+        are_splits_ready = split_handler.make_splits(
+            np.int64(0), np.int64(1), class_id)[0]
 
       with ops.control_dependencies([are_splits_ready]):
         update_2 = split_handler.update_stats_sync(
@@ -1110,7 +1124,7 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
             is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_2]):
         are_splits_ready2, partitions, gains, splits = (
-            split_handler.make_splits(1, 2, class_id))
+            split_handler.make_splits(np.int64(1), np.int64(2), class_id))
         are_splits_ready, are_splits_ready2, partitions, gains, splits = (
             sess.run([
                 are_splits_ready, are_splits_ready2, partitions, gains, splits
@@ -1138,10 +1152,10 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
       class_id = -1
 
       split_handler = ordinal_split_handler.SparseSplitHandler(
-          l1_regularization=0,
-          l2_regularization=2,
-          tree_complexity_regularization=0,
-          min_node_weight=0,
+          l1_regularization=0.0,
+          l2_regularization=2.0,
+          tree_complexity_regularization=0.0,
+          min_node_weight=0.0,
           epsilon=0.01,
           num_quantiles=2,
           feature_column_group_id=0,
@@ -1166,7 +1180,8 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
           example_weights,
           is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_1]):
-        are_splits_ready = split_handler.make_splits(0, 1, class_id)[0]
+        are_splits_ready = split_handler.make_splits(
+            np.int64(0), np.int64(1), class_id)[0]
 
       with ops.control_dependencies([are_splits_ready]):
         update_2 = split_handler.update_stats_sync(
@@ -1180,7 +1195,7 @@ class SparseSplitHandlerTest(test_util.TensorFlowTestCase):
             is_active=array_ops.constant([True, True]))
       with ops.control_dependencies([update_2]):
         are_splits_ready2, partitions, gains, splits = (
-            split_handler.make_splits(1, 2, class_id))
+            split_handler.make_splits(np.int64(1), np.int64(2), class_id))
         are_splits_ready, are_splits_ready2, partitions, gains, splits = (
             sess.run([
                 are_splits_ready, are_splits_ready2, partitions, gains, splits
diff --git a/tensorflow/contrib/boosted_trees/lib/models/multiple_additive_trees.cc b/tensorflow/contrib/boosted_trees/lib/models/multiple_additive_trees.cc
index 43b00d4c6dc2e0066810012292874314215c41be..c9223afeab233497bce9f680bd44bd10ccfc6491 100644
--- a/tensorflow/contrib/boosted_trees/lib/models/multiple_additive_trees.cc
+++ b/tensorflow/contrib/boosted_trees/lib/models/multiple_additive_trees.cc
@@ -26,7 +26,8 @@ void MultipleAdditiveTrees::Predict(
     const std::vector<int32>& trees_to_include,
     const boosted_trees::utils::BatchFeatures& features,
     tensorflow::thread::ThreadPool* const worker_threads,
-    tensorflow::TTypes<float>::Matrix output_predictions) {
+    tensorflow::TTypes<float>::Matrix output_predictions,
+    Tensor* const output_leaf_index) {
   // Zero out predictions as the model is additive.
   output_predictions.setZero();
 
@@ -38,8 +39,13 @@ void MultipleAdditiveTrees::Predict(
 
   // Lambda for doing a block of work.
   auto update_predictions = [&config, &features, &trees_to_include,
-                             &output_predictions](int64 start, int64 end) {
+                             &output_predictions,
+                             &output_leaf_index](int64 start, int64 end) {
     auto examples_iterable = features.examples_iterable(start, end);
+    Tensor dummy_tensor(DT_INT32, TensorShape({1, 1}));
+    tensorflow::TTypes<int>::Matrix output_leaf_index_mat =
+        output_leaf_index != nullptr ? output_leaf_index->matrix<int>()
+                                     : dummy_tensor.matrix<int>();
     for (const auto& example : examples_iterable) {
       for (const int32 tree_idx : trees_to_include) {
         const boosted_trees::trees::DecisionTreeConfig& tree =
@@ -47,6 +53,10 @@ void MultipleAdditiveTrees::Predict(
         const float tree_weight = config.tree_weights(tree_idx);
         const int leaf_idx = trees::DecisionTree::Traverse(tree, 0, example);
         QCHECK(leaf_idx >= 0) << "Invalid tree: " << tree.DebugString();
+        // Checks if output leaf tree index is required.
+        if (output_leaf_index != nullptr) {
+          output_leaf_index_mat(example.example_idx, tree_idx) = leaf_idx;
+        }
         const auto& leaf_node = tree.nodes(leaf_idx);
         QCHECK(leaf_node.has_leaf())
             << "Invalid leaf node: " << leaf_node.DebugString();
diff --git a/tensorflow/contrib/boosted_trees/lib/models/multiple_additive_trees.h b/tensorflow/contrib/boosted_trees/lib/models/multiple_additive_trees.h
index cc3dc226cdbc88fc7010ada1e7f0e6c0a3913c5f..940531c4ba4bcac19fa980deb091e55b48e0693b 100644
--- a/tensorflow/contrib/boosted_trees/lib/models/multiple_additive_trees.h
+++ b/tensorflow/contrib/boosted_trees/lib/models/multiple_additive_trees.h
@@ -33,12 +33,17 @@ class MultipleAdditiveTrees {
  public:
   // Predict runs tree ensemble on the given batch and updates
   // output predictions accordingly, for the given list of trees.
+  // output_leaf_indices is a pointer to a 2 dimensional tensor. If it is not
+  // nullptr, this method fills output_leaf_indices with a per-tree leaf id
+  // where each of the instances from 'features' ended up in. Its shape is num
+  // examples X num of trees.
   static void Predict(
       const boosted_trees::trees::DecisionTreeEnsembleConfig& config,
       const std::vector<int32>& trees_to_include,
       const boosted_trees::utils::BatchFeatures& features,
       tensorflow::thread::ThreadPool* const worker_threads,
-      tensorflow::TTypes<float>::Matrix output_predictions);
+      tensorflow::TTypes<float>::Matrix output_predictions,
+      Tensor* const output_leaf_index);
 };
 
 }  // namespace models
diff --git a/tensorflow/contrib/boosted_trees/lib/models/multiple_additive_trees_test.cc b/tensorflow/contrib/boosted_trees/lib/models/multiple_additive_trees_test.cc
index 4ca18bedb1054ef64c6d4b25bbad04842bab1a6a..462a9ac86fe51d07cfb958d9be49bef84811a52e 100644
--- a/tensorflow/contrib/boosted_trees/lib/models/multiple_additive_trees_test.cc
+++ b/tensorflow/contrib/boosted_trees/lib/models/multiple_additive_trees_test.cc
@@ -62,7 +62,8 @@ TEST_F(MultipleAdditiveTreesTest, Empty) {
   tensorflow::thread::ThreadPool threads(tensorflow::Env::Default(), "test",
                                          kNumThreadsSingleThreaded);
   MultipleAdditiveTrees::Predict(tree_ensemble_config, {}, batch_features_,
-                                 &threads, output_matrix);
+                                 &threads, output_matrix,
+                                 /*output_leaf_index=*/nullptr);
   EXPECT_EQ(0, output_matrix(0, 0));
   EXPECT_EQ(0, output_matrix(1, 0));
 }
@@ -99,17 +100,38 @@ TEST_F(MultipleAdditiveTreesTest, SingleClass) {
   // Normal case.
   {
     MultipleAdditiveTrees::Predict(tree_ensemble_config, {0, 1},
-                                   batch_features_, &threads, output_matrix);
+                                   batch_features_, &threads, output_matrix,
+                                   /*output_leaf_index=*/nullptr);
     EXPECT_FLOAT_EQ(-0.2f, output_matrix(0, 0));  // -0.4 (bias) + 0.2 (leaf 2).
     EXPECT_FLOAT_EQ(0.5f, output_matrix(1, 0));   // -0.4 (bias) + 0.9 (leaf 1).
   }
+  // Normal case with leaf node.
+  {
+    // Initialize output leaf index tensor, since leaf index is positive in this
+    // case, initialize with the value of -1. Since there are 2 examples and
+    // there are 2 trees, initialize leaf output index by 2 * 2.
+    Tensor output_leaf_index_tensor(DT_INT32, TensorShape({2, 2}));
+    MultipleAdditiveTrees::Predict(tree_ensemble_config, {0, 1},
+                                   batch_features_, &threads, output_matrix,
+                                   &output_leaf_index_tensor);
+    EXPECT_FLOAT_EQ(-0.2f, output_matrix(0, 0));  // -0.4 (bias) + 0.2 (leaf 2).
+    EXPECT_FLOAT_EQ(0.5f, output_matrix(1, 0));   // -0.4 (bias) + 0.9 (leaf 1).
+    EXPECT_FLOAT_EQ(0, output_leaf_index_tensor.matrix<int>()(
+                           0, 0));  // 1st leaf for the first example
+    EXPECT_FLOAT_EQ(0, output_leaf_index_tensor.matrix<int>()(
+                           1, 0));  // 1st leaf for the second example
+    EXPECT_FLOAT_EQ(2, output_leaf_index_tensor.matrix<int>()(
+                           0, 1));  // 2nd leaf for the first example
+    EXPECT_FLOAT_EQ(1, output_leaf_index_tensor.matrix<int>()(
+                           1, 1));  // 2nd leaf for the second example
+  }
   // Weighted case
   {
     DecisionTreeEnsembleConfig weighted = tree_ensemble_config;
     weighted.set_tree_weights(0, 6.0);
     weighted.set_tree_weights(1, 3.2);
     MultipleAdditiveTrees::Predict(weighted, {0, 1}, batch_features_, &threads,
-                                   output_matrix);
+                                   output_matrix, nullptr);
     // -0.4 (bias) + 0.2 (leaf 2).
     EXPECT_FLOAT_EQ(-0.4f * 6 + 0.2 * 3.2, output_matrix(0, 0));
     // -0.4 (bias) + 0.9 (leaf 1).
@@ -118,21 +140,21 @@ TEST_F(MultipleAdditiveTreesTest, SingleClass) {
   // Drop first tree.
   {
     MultipleAdditiveTrees::Predict(tree_ensemble_config, {1}, batch_features_,
-                                   &threads, output_matrix);
+                                   &threads, output_matrix, nullptr);
     EXPECT_FLOAT_EQ(0.2f, output_matrix(0, 0));  // 0.2 (leaf 2).
     EXPECT_FLOAT_EQ(0.9f, output_matrix(1, 0));  // 0.9 (leaf 1).
   }
   // Drop second tree.
   {
     MultipleAdditiveTrees::Predict(tree_ensemble_config, {0}, batch_features_,
-                                   &threads, output_matrix);
+                                   &threads, output_matrix, nullptr);
     EXPECT_FLOAT_EQ(-0.4f, output_matrix(0, 0));  // -0.4 (bias).
     EXPECT_FLOAT_EQ(-0.4f, output_matrix(1, 0));  // -0.4 (bias).
   }
   // Drop all trees.
   {
     MultipleAdditiveTrees::Predict(tree_ensemble_config, {}, batch_features_,
-                                   &threads, output_matrix);
+                                   &threads, output_matrix, nullptr);
     EXPECT_FLOAT_EQ(0.0, output_matrix(0, 0));
     EXPECT_FLOAT_EQ(0.0, output_matrix(1, 0));
   }
@@ -172,7 +194,8 @@ TEST_F(MultipleAdditiveTreesTest, MultiClass) {
   // Normal case.
   {
     MultipleAdditiveTrees::Predict(tree_ensemble_config, {0, 1},
-                                   batch_features_, &threads, output_matrix);
+                                   batch_features_, &threads, output_matrix,
+                                   nullptr);
     EXPECT_FLOAT_EQ(-0.4f, output_matrix(0, 0));  // -0.4 (bias)
     EXPECT_FLOAT_EQ(-0.5f, output_matrix(0, 1));  // -0.7 (bias) + 0.2 (leaf 2)
     EXPECT_FLOAT_EQ(0.5f, output_matrix(1, 0));   // -0.4 (bias) + 0.9 (leaf 1)
@@ -184,7 +207,7 @@ TEST_F(MultipleAdditiveTreesTest, MultiClass) {
     weighted.set_tree_weights(0, 6.0);
     weighted.set_tree_weights(1, 3.2);
     MultipleAdditiveTrees::Predict(weighted, {0, 1}, batch_features_, &threads,
-                                   output_matrix);
+                                   output_matrix, nullptr);
     // bias
     EXPECT_FLOAT_EQ(-0.4f * 6, output_matrix(0, 0));
     // bias + leaf 2
@@ -197,7 +220,7 @@ TEST_F(MultipleAdditiveTreesTest, MultiClass) {
   // Dropout first tree.
   {
     MultipleAdditiveTrees::Predict(tree_ensemble_config, {1}, batch_features_,
-                                   &threads, output_matrix);
+                                   &threads, output_matrix, nullptr);
     EXPECT_FLOAT_EQ(0.0, output_matrix(0, 0));
     EXPECT_FLOAT_EQ(0.2f, output_matrix(0, 1));  // 0.2 (leaf 2)
     EXPECT_FLOAT_EQ(0.9f, output_matrix(1, 0));  // 0.9 (leaf 2)
@@ -206,7 +229,7 @@ TEST_F(MultipleAdditiveTreesTest, MultiClass) {
   // Dropout second tree.
   {
     MultipleAdditiveTrees::Predict(tree_ensemble_config, {0}, batch_features_,
-                                   &threads, output_matrix);
+                                   &threads, output_matrix, nullptr);
     EXPECT_FLOAT_EQ(-0.4f, output_matrix(0, 0));  // -0.4 (bias)
     EXPECT_FLOAT_EQ(-0.7f, output_matrix(0, 1));  // -0.7 (bias)
     EXPECT_FLOAT_EQ(-0.4f, output_matrix(1, 0));  // -0.4 (bias)
@@ -215,7 +238,7 @@ TEST_F(MultipleAdditiveTreesTest, MultiClass) {
   // Drop both trees.
   {
     MultipleAdditiveTrees::Predict(tree_ensemble_config, {}, batch_features_,
-                                   &threads, output_matrix);
+                                   &threads, output_matrix, nullptr);
     EXPECT_FLOAT_EQ(0.0f, output_matrix(0, 0));
     EXPECT_FLOAT_EQ(0.0f, output_matrix(0, 1));
     EXPECT_FLOAT_EQ(0.0f, output_matrix(1, 0));
@@ -258,7 +281,8 @@ TEST_F(MultipleAdditiveTreesTest, DenseLeaves) {
   // Normal case.
   {
     MultipleAdditiveTrees::Predict(tree_ensemble_config, {0, 1},
-                                   batch_features_, &threads, output_matrix);
+                                   batch_features_, &threads, output_matrix,
+                                   nullptr);
     EXPECT_FLOAT_EQ(-0.2f, output_matrix(0, 0));  // -0.4 (tree1) + 0.2 (leaf 2)
     EXPECT_FLOAT_EQ(-0.4f, output_matrix(0, 1));  // -0.7 (tree1) + 0.3 (leaf 2)
     EXPECT_FLOAT_EQ(3.4f, output_matrix(0, 2));   // 3.0 -(tree1) + 0.4 (leaf 2)
diff --git a/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_stream.h b/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_stream.h
index 8ad97fedc923ac50bcaad86e0ba2c2e46df6821b..c120dd8a6c156ec9eb7ba0b6c552f5138bd21a16 100644
--- a/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_stream.h
+++ b/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_stream.h
@@ -295,7 +295,7 @@ WeightedQuantilesStream<ValueType, WeightType, CompareFn>::GetQuantileSpecs(
   if (eps <= std::numeric_limits<double>::epsilon()) {
     // Exact quantile computation at the expense of RAM.
     max_level = 1;
-    block_size = std::max(max_elements, 2LL);
+    block_size = std::max(max_elements, int64{2});
   } else {
     // The bottom-most level will become full at most
     // (max_elements / block_size) times, the level above will become full
@@ -315,7 +315,7 @@ WeightedQuantilesStream<ValueType, WeightType, CompareFn>::GetQuantileSpecs(
       block_size = static_cast<size_t>(ceil(max_level / eps)) + 1;
     }
   }
-  return std::make_tuple(max_level, std::max(block_size, 2LL));
+  return std::make_tuple(max_level, std::max(block_size, int64{2}));
 }
 
 }  // namespace quantiles
diff --git a/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_summary.h b/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_summary.h
index 7576856dc3a6d0b6681ee9745c875cf46d1e2960..a7e7bfc13cadcea4d29d33e0dbd955bdad6ffcb9 100644
--- a/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_summary.h
+++ b/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_summary.h
@@ -195,7 +195,7 @@ class WeightedQuantilesSummary {
   // designed to be cache-friendly.
   void Compress(int64 size_hint, double min_eps = 0) {
     // No-op if we're already within the size requirement.
-    size_hint = std::max(size_hint, 2LL);
+    size_hint = std::max(size_hint, int64{2});
     if (entries_.size() <= size_hint) {
       return;
     }
@@ -267,7 +267,7 @@ class WeightedQuantilesSummary {
     if (entries_.empty()) {
       return output;
     }
-    num_quantiles = std::max(num_quantiles, 2LL);
+    num_quantiles = std::max(num_quantiles, int64{2});
     output.reserve(num_quantiles + 1);
 
     // Make successive rank queries to get boundaries.
diff --git a/tensorflow/contrib/boosted_trees/ops/prediction_ops.cc b/tensorflow/contrib/boosted_trees/ops/prediction_ops.cc
index d66f645f62aba84261337eb37d6e3204930f8f15..6491d58794332e9417951753532e018aafb652b1 100644
--- a/tensorflow/contrib/boosted_trees/ops/prediction_ops.cc
+++ b/tensorflow/contrib/boosted_trees/ops/prediction_ops.cc
@@ -40,6 +40,24 @@ static Status ApplyGradientTreesPredictionShapeFn(InferenceContext* c) {
   return Status::OK();
 }
 
+static Status ApplyGradientTreesPredictionVerboseShapeFn(InferenceContext* c) {
+  string learner_config_str;
+  c->GetAttr("learner_config", &learner_config_str).IgnoreError();
+  LearnerConfig learner_config;
+  ParseProtoUnlimited(&learner_config, learner_config_str);
+
+  bool reduce_dim;
+  c->GetAttr("reduce_dim", &reduce_dim).IgnoreError();
+  // Sets the shape of the output as a matrix.
+  c->set_output(0, {c->Matrix(InferenceContext::kUnknownDim,
+                              reduce_dim ? learner_config.num_classes() - 1
+                                         : learner_config.num_classes())});
+  c->set_output(1, {c->UnknownShape()});
+  c->set_output(2, {c->Matrix(InferenceContext::kUnknownDim,
+                              InferenceContext::kUnknownDim)});
+  return Status::OK();
+}
+
 REGISTER_OP("GradientTreesPrediction")
     .Attr("learner_config: string")
     .Attr("num_dense_float_features: int >= 0")
@@ -90,6 +108,58 @@ drop_out_tree_indices_weights: Tensor of Rank 2 containing dropped trees indices
 and original weights of those trees during prediction.
 )doc");
 
+REGISTER_OP("GradientTreesPredictionVerbose")
+    .Attr("learner_config: string")
+    .Attr("num_dense_float_features: int >= 0")
+    .Attr("num_sparse_float_features: int >= 0")
+    .Attr("num_sparse_int_features: int >= 0")
+    .Attr("use_locking: bool = false")
+    .Attr("apply_dropout: bool")
+    .Attr("apply_averaging: bool")
+    .Attr("center_bias: bool")
+    .Attr("reduce_dim: bool")
+    .Input("tree_ensemble_handle: resource")
+    .Input("seed: int64")
+    .Input("dense_float_features: num_dense_float_features * float")
+    .Input("sparse_float_feature_indices: num_sparse_float_features * int64")
+    .Input("sparse_float_feature_values: num_sparse_float_features * float")
+    .Input("sparse_float_feature_shapes: num_sparse_float_features * int64")
+    .Input("sparse_int_feature_indices: num_sparse_int_features * int64")
+    .Input("sparse_int_feature_values: num_sparse_int_features * int64")
+    .Input("sparse_int_feature_shapes: num_sparse_int_features * int64")
+    .Output("predictions: float")
+    .Output("drop_out_tree_indices_weights: float")
+    .Output("leaf_index: int32")
+    .SetShapeFn(ApplyGradientTreesPredictionVerboseShapeFn)
+    .Doc(R"doc(
+Runs multiple additive regression forests predictors on input instances
+and computes the final prediction for each class, and outputs a matrix of
+leaf ids per each tree in an ensemble.
+
+learner_config: Config for the learner of type LearnerConfig proto. Prediction
+ops for now uses only LearningRateDropoutDrivenConfig config from the learner.
+num_dense_float_features: Number of dense float features.
+num_sparse_float_features: Number of sparse float features.
+num_sparse_int_features: Number of sparse int features.
+use_locking: Whether to use locking.
+seed: random seed to be used for dropout.
+reduce_dim: whether to reduce the dimension (legacy impl) or not.
+apply_dropout: whether to apply dropout during prediction.
+apply_averaging: whether averaging of tree ensembles should take place. If set
+to true, will be based on AveragingConfig from learner_config.
+tree_ensemble_handle: The handle to the tree ensemble.
+dense_float_features: Rank 2 Tensors containing dense float feature values.
+sparse_float_feature_indices: Rank 2 Tensors containing sparse float indices.
+sparse_float_feature_values: Rank 1 Tensors containing sparse float values.
+sparse_float_feature_shapes: Rank 1 Tensors containing sparse float shapes.
+sparse_int_feature_indices: Rank 2 Tensors containing sparse int indices.
+sparse_int_feature_values: Rank 1 Tensors containing sparse int values.
+sparse_int_feature_shapes: Rank 1 Tensors containing sparse int shapes.
+predictions: Rank 2 Tensor containing predictions per example per class.
+drop_out_tree_indices_weights: Tensor of Rank 2 containing dropped trees indices
+leaf_index: tensor of rank 2 containing leaf ids for each tree where an instance ended up.
+)doc");
+
 REGISTER_OP("GradientTreesPartitionExamples")
     .Attr("num_dense_float_features: int >= 0")
     .Attr("num_sparse_float_features: int >= 0")
diff --git a/tensorflow/contrib/boosted_trees/ops/split_handler_ops.cc b/tensorflow/contrib/boosted_trees/ops/split_handler_ops.cc
index 5d0ebbf73ce1272b51a475f67984db3a181b7130..ca5c7f3d8c78a543c63fbfa9f7eb7c3d348f11b8 100644
--- a/tensorflow/contrib/boosted_trees/ops/split_handler_ops.cc
+++ b/tensorflow/contrib/boosted_trees/ops/split_handler_ops.cc
@@ -23,12 +23,6 @@ using shape_inference::InferenceContext;
 using shape_inference::ShapeHandle;
 
 REGISTER_OP("BuildDenseInequalitySplits")
-    .Attr("feature_column_group_id: int")
-    .Attr("l1_regularization: float")
-    .Attr("l2_regularization: float")
-    .Attr("tree_complexity_regularization: float")
-    .Attr("min_node_weight: float")
-    .Attr("multiclass_strategy: int")
     .Input("num_minibatches: int64")
     .Input("partition_ids: int32")
     .Input("bucket_ids: int64")
@@ -36,6 +30,12 @@ REGISTER_OP("BuildDenseInequalitySplits")
     .Input("hessians: float32")
     .Input("bucket_boundaries: float32")
     .Input("class_id: int32")
+    .Input("feature_column_group_id: int32")
+    .Input("l1_regularization: float")
+    .Input("l2_regularization: float")
+    .Input("tree_complexity_regularization: float")
+    .Input("min_node_weight: float")
+    .Input("multiclass_strategy: int32")
     .Output("output_partition_ids: int32")
     .Output("gains: float32")
     .Output("split_infos: string")
@@ -73,6 +73,17 @@ bucket_ids: A rank 2 tensor of buckets IDs and dimensions.
 gradients: A rank 1 tensor of gradients.
 hessians: A rank 1 tensor of hessians.
 bucket_boundaries: A rank 1 tensor, thresholds that were used for bucketization.
+class_id: A scalar, the class id for which we're building the splits.
+feature_column_group_id: A scalar, the index of the feature we are spiltting on.
+l1_regularization: A scalar, which specifies the l1 regularization term.
+l2_regularization: A scalar, which specifies the l2 regularization term.
+tree_complexity_regularization: A scalar, which specifies the tree complexity
+    regularization term.
+min_node_weight: A scalar, minimum sum of example hessian needed in a child.
+    If a split results in a leaf node with a smaller value, the split will not
+    be considered.
+multiclass_strategy: A scalar, specifying the multiclass handling strategy.
+    See LearnerConfig.MultiClassStrategy for valid values.
 output_partition_ids: A rank 1 tensor, the partition IDs that we created splits
     for.
 gains: A rank 1 tensor, for the computed gain for the created splits.
@@ -81,13 +92,6 @@ split_infos: A rank 1 tensor of serialized protos which contains the
 )doc");
 
 REGISTER_OP("BuildSparseInequalitySplits")
-    .Attr("feature_column_group_id: int")
-    .Attr("bias_feature_id: int")
-    .Attr("l1_regularization: float")
-    .Attr("l2_regularization: float")
-    .Attr("tree_complexity_regularization: float")
-    .Attr("min_node_weight: float")
-    .Attr("multiclass_strategy: int")
     .Input("num_minibatches: int64")
     .Input("partition_ids: int32")
     .Input("bucket_ids: int64")
@@ -95,6 +99,13 @@ REGISTER_OP("BuildSparseInequalitySplits")
     .Input("hessians: float32")
     .Input("bucket_boundaries: float32")
     .Input("class_id: int32")
+    .Input("feature_column_group_id: int32")
+    .Input("bias_feature_id: int64")
+    .Input("l1_regularization: float")
+    .Input("l2_regularization: float")
+    .Input("tree_complexity_regularization: float")
+    .Input("min_node_weight: float")
+    .Input("multiclass_strategy: int32")
     .Output("output_partition_ids: int32")
     .Output("gains: float32")
     .Output("split_infos: string")
@@ -133,6 +144,17 @@ bucket_ids: A rank 2 tensor of buckets IDs and dimensions.
 gradients: A rank 1 tensor of gradients.
 hessians: A rank 1 tensor of hessians.
 bucket_boundaries: A rank 1 tensor, thresholds that were used for bucketization.
+class_id: A scalar, the class id for which we're building the splits.
+feature_column_group_id: A scalar, the index of the feature we are spiltting on.
+l1_regularization: A scalar, which specifies the l1 regularization term.
+l2_regularization: A scalar, which specifies the l2 regularization term.
+tree_complexity_regularization: A scalar, which specifies the tree complexity
+    regularization term.
+min_node_weight: A scalar, minimum sum of example hessian needed in a child.
+    If a split results in a leaf node with a smaller value, the split will not
+    be considered.
+multiclass_strategy: A scalar, specifying the multiclass handling strategy.
+    See LearnerConfig.MultiClassStrategy for valid values.
 output_partition_ids: A rank 1 tensor, the partition IDs that we created splits
     for.
 gains: A rank 1 tensor, for the computed gain for the created splits.
@@ -141,19 +163,19 @@ split_infos: A rank 1 tensor of serialized protos which contains the
 )doc");
 
 REGISTER_OP("BuildCategoricalEqualitySplits")
-    .Attr("feature_column_group_id: int")
-    .Attr("bias_feature_id: int")
-    .Attr("l1_regularization: float")
-    .Attr("l2_regularization: float")
-    .Attr("tree_complexity_regularization: float")
-    .Attr("min_node_weight: float")
-    .Attr("multiclass_strategy: int")
     .Input("num_minibatches: int64")
     .Input("partition_ids: int32")
     .Input("feature_ids: int64")
     .Input("gradients: float32")
     .Input("hessians: float32")
     .Input("class_id: int32")
+    .Input("feature_column_group_id: int32")
+    .Input("bias_feature_id: int64")
+    .Input("l1_regularization: float")
+    .Input("l2_regularization: float")
+    .Input("tree_complexity_regularization: float")
+    .Input("min_node_weight: float")
+    .Input("multiclass_strategy: int32")
     .Output("output_partition_ids: int32")
     .Output("gains: float32")
     .Output("split_infos: string")
@@ -188,6 +210,17 @@ partition_ids: A rank 1 tensor of partition IDs.
 feature_ids: A rank 2 tensor of feature IDs and dimensions.
 gradients: A rank 1 tensor of gradients.
 hessians: A rank 1 tensor of hessians.
+class_id: A scalar, the class id for which we're building the splits.
+feature_column_group_id: A scalar, the index of the feature we are spiltting on.
+l1_regularization: A scalar, which specifies the l1 regularization term.
+l2_regularization: A scalar, which specifies the l2 regularization term.
+tree_complexity_regularization: A scalar, which specifies the tree complexity
+    regularization term.
+min_node_weight: A scalar, minimum sum of example hessian needed in a child.
+    If a split results in a leaf node with a smaller value, the split will not
+    be considered.
+multiclass_strategy: A scalar, specifying the multiclass handling strategy.
+    See LearnerConfig.MultiClassStrategy for valid values.
 output_partition_ids: A rank 1 tensor, the partition IDs that we created splits
     for.
 gains: A rank 1 tensor, for the computed gain for the created splits.
@@ -196,4 +229,3 @@ split_infos: A rank 1 tensor of serialized protos which contains the
 )doc");
 
 }  // namespace tensorflow
-   // namespace tensorflow
diff --git a/tensorflow/contrib/boosted_trees/python/ops/batch_ops_utils.py b/tensorflow/contrib/boosted_trees/python/ops/batch_ops_utils.py
index 7a5f329b7ab3216972180ccbb4c85f2537175422..843420968ac6a6716fdf6b4967146e131139f67c 100644
--- a/tensorflow/contrib/boosted_trees/python/ops/batch_ops_utils.py
+++ b/tensorflow/contrib/boosted_trees/python/ops/batch_ops_utils.py
@@ -20,6 +20,8 @@ from __future__ import print_function
 import abc
 import collections
 
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
@@ -60,6 +62,7 @@ def _move_tensors(tensors, device):
   """Moves a list of tensors to a device by concatenating/splitting them."""
   # Reset the device setting to avoid weird interactions with device merging
   # logic.
+  zero = constant_op.constant(0, dtype=dtypes.int32)
   with ops.device(None):
     if all(tensor.shape == tensor_shape.scalar() for tensor in tensors):
       with ops.device(tensors[0].device):
@@ -68,12 +71,11 @@ def _move_tensors(tensors, device):
         return array_ops.unstack(values)
     else:
       with ops.device(tensors[0].device):
-        sizes = array_ops.stack(
-            [array_ops.shape(tensor)[0] for tensor in tensors])
-        values = array_ops.concat(tensors, axis=0)
+        sizes = array_ops.stack(array_ops.shape_n(tensors))[:, 0]
+        values = array_ops.concat(tensors, axis=zero)
       with ops.device(device):
         sizes = array_ops.unstack(sizes)
-        return list(array_ops.split(values, sizes, axis=0))
+        return list(array_ops.split(values, sizes, axis=zero))
 
 
 def _scheduled_stamp_resource_op_runner(batch, stamp):
diff --git a/tensorflow/contrib/boosted_trees/python/ops/prediction_ops.py b/tensorflow/contrib/boosted_trees/python/ops/prediction_ops.py
index 58f0d36b0f78eeed6abcec1c4fa696f4ccffa615..7f6e55ae5888fc4ef50e34690d61c3ed303e971a 100644
--- a/tensorflow/contrib/boosted_trees/python/ops/prediction_ops.py
+++ b/tensorflow/contrib/boosted_trees/python/ops/prediction_ops.py
@@ -21,4 +21,5 @@ from __future__ import print_function
 from tensorflow.contrib.boosted_trees.python.ops import boosted_trees_ops_loader
 from tensorflow.contrib.boosted_trees.python.ops.gen_prediction_ops import gradient_trees_partition_examples
 from tensorflow.contrib.boosted_trees.python.ops.gen_prediction_ops import gradient_trees_prediction
+from tensorflow.contrib.boosted_trees.python.ops.gen_prediction_ops import gradient_trees_prediction_verbose
 # pylint: enable=unused-import
diff --git a/tensorflow/contrib/boosted_trees/python/ops/quantile_ops.py b/tensorflow/contrib/boosted_trees/python/ops/quantile_ops.py
index 50cc00afdcc77fedc9bf8c94a9a6fcf2a28ebde9..19b6b3296db394b07f57a25dbde187eb9195af38 100644
--- a/tensorflow/contrib/boosted_trees/python/ops/quantile_ops.py
+++ b/tensorflow/contrib/boosted_trees/python/ops/quantile_ops.py
@@ -201,3 +201,6 @@ class QuantileAccumulator(saver.BaseSaverBuilder.SaveableObject):
         stamp_token=stamp_token,
         next_stamp_token=next_stamp_token)
     return result
+
+  def resource(self):
+    return self._quantile_accumulator_handle
diff --git a/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py b/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py
index e53d86ec612f299c800753d67ceee79acb5db497..47698d45c81478f2b694aaadc603f742c44d5351 100644
--- a/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py
+++ b/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py
@@ -58,6 +58,7 @@ NUM_LAYERS_ATTEMPTED = "num_layers"
 NUM_TREES_ATTEMPTED = "num_trees"
 NUM_USED_HANDLERS = "num_used_handlers"
 USED_HANDLERS_MASK = "used_handlers_mask"
+LEAF_INDEX = "leaf_index"
 _FEATURE_NAME_TEMPLATE = "%s_%d"
 
 
@@ -71,18 +72,24 @@ def _get_column_by_index(tensor, indices):
   return array_ops.reshape(array_ops.gather(p_flat, i_flat), [shape[0], -1])
 
 
-def _make_predictions_dict(stamp, logits, partition_ids, ensemble_stats,
-                           used_handlers):
+def _make_predictions_dict(stamp,
+                           logits,
+                           partition_ids,
+                           ensemble_stats,
+                           used_handlers,
+                           leaf_index=None):
   """Returns predictions for the given logits and n_classes.
 
   Args:
     stamp: The ensemble stamp.
-    logits: A rank 2 `Tensor` with shape [batch_size, n_classes - 1].
-        that contains predictions when no dropout was applied.
+    logits: A rank 2 `Tensor` with shape [batch_size, n_classes - 1]. that
+      contains predictions when no dropout was applied.
     partition_ids: A rank 1 `Tensor` with shape [batch_size].
     ensemble_stats: A TreeEnsembleStatsOp result tuple.
     used_handlers: A TreeEnsembleUsedHandlerOp result tuple of an int and a
-        boolean mask..
+      boolean mask.
+    leaf_index: A rank 2 `Tensor` with shape [batch_size, number of trees]. that
+      contains leaf id for each example prediction.
 
   Returns:
     A dict of predictions.
@@ -95,6 +102,8 @@ def _make_predictions_dict(stamp, logits, partition_ids, ensemble_stats,
   result[NUM_TREES_ATTEMPTED] = ensemble_stats.attempted_trees
   result[NUM_USED_HANDLERS] = used_handlers.num_used_handlers
   result[USED_HANDLERS_MASK] = used_handlers.used_handlers_mask
+  if leaf_index is not None:
+    result[LEAF_INDEX] = leaf_index
   return result
 
 
@@ -180,8 +189,7 @@ def extract_features(features, feature_columns, use_core_columns):
         elif isinstance(fc, feature_column_lib._EmbeddingColumn):
           # pylint: enable=protected-access
           transformed_features[fc.name] = fc_core.input_layer(
-              features, [fc],
-              weight_collections=[scope])
+              features, [fc], weight_collections=[scope])
         else:
           result = feature_column_ops.transform_features(features, [fc])
           if len(result) > 1:
@@ -269,7 +277,8 @@ class GradientBoostedDecisionTreeModel(object):
                features,
                logits_dimension,
                feature_columns=None,
-               use_core_columns=False):
+               use_core_columns=False,
+               output_leaf_index=False):
     """Construct a new GradientBoostedDecisionTreeModel function.
 
     Args:
@@ -277,13 +286,15 @@ class GradientBoostedDecisionTreeModel(object):
       num_ps_replicas: Number of parameter server replicas, can be 0.
       ensemble_handle: A handle to the ensemble variable.
       center_bias: Whether to center the bias before growing trees.
-      examples_per_layer: Number of examples to accumulate before growing
-        a tree layer. It can also be a function that computes the number of
-        examples based on the depth of the layer that's being built.
+      examples_per_layer: Number of examples to accumulate before growing a tree
+        layer. It can also be a function that computes the number of examples
+        based on the depth of the layer that's being built.
       learner_config: A learner config.
       features: `dict` of `Tensor` objects.
       logits_dimension: An int, the dimension of logits.
       feature_columns: A list of feature columns.
+      output_leaf_index: A boolean variable indicating whether to output leaf
+        index into predictions dictionary.
 
     Raises:
       ValueError: if inputs are not valid.
@@ -334,10 +345,12 @@ class GradientBoostedDecisionTreeModel(object):
     self._feature_columns = feature_columns
     self._learner_config_serialized = learner_config.SerializeToString()
     self._attempted_trees = variables.Variable(
-        initial_value=array_ops.zeros([], dtypes.int64), trainable=False,
+        initial_value=array_ops.zeros([], dtypes.int64),
+        trainable=False,
         name="attempted_trees")
     self._finalized_trees = variables.Variable(
-        initial_value=array_ops.zeros([], dtypes.int64), trainable=False,
+        initial_value=array_ops.zeros([], dtypes.int64),
+        trainable=False,
         name="finalized_trees")
     if not features:
       raise ValueError("Features dictionary must be specified.")
@@ -354,9 +367,11 @@ class GradientBoostedDecisionTreeModel(object):
     self._sparse_int_indices = sparse_int_indices
     self._sparse_int_values = sparse_int_values
     self._sparse_int_shapes = sparse_int_shapes
-    self._reduce_dim = (self._learner_config.multi_class_strategy ==
-                        learner_pb2.LearnerConfig.TREE_PER_CLASS and
-                        learner_config.num_classes == 2)
+    self._reduce_dim = (
+        self._learner_config.multi_class_strategy ==
+        learner_pb2.LearnerConfig.TREE_PER_CLASS and
+        learner_config.num_classes == 2)
+    self._output_leaf_index = output_leaf_index
 
   def _predict_and_return_dict(self, ensemble_handle, ensemble_stamp, mode):
     """Runs prediction and returns a dictionary of the prediction results.
@@ -374,8 +389,8 @@ class GradientBoostedDecisionTreeModel(object):
     ensemble_stats = training_ops.tree_ensemble_stats(ensemble_handle,
                                                       ensemble_stamp)
     num_handlers = (
-        len(self._dense_floats) + len(self._sparse_float_shapes) +
-        len(self._sparse_int_shapes))
+        len(self._dense_floats) + len(self._sparse_float_shapes) + len(
+            self._sparse_int_shapes))
     # Used during feature selection.
     used_handlers = model_ops.tree_ensemble_used_handlers(
         ensemble_handle, ensemble_stamp, num_all_handlers=num_handlers)
@@ -386,22 +401,44 @@ class GradientBoostedDecisionTreeModel(object):
     # Make sure ensemble stats run. This will check that the ensemble has
     # the right stamp.
     with ops.control_dependencies(ensemble_stats):
-      predictions, _ = prediction_ops.gradient_trees_prediction(
-          ensemble_handle,
-          seed,
-          self._dense_floats,
-          self._sparse_float_indices,
-          self._sparse_float_values,
-          self._sparse_float_shapes,
-          self._sparse_int_indices,
-          self._sparse_int_values,
-          self._sparse_int_shapes,
-          learner_config=self._learner_config_serialized,
-          apply_dropout=apply_dropout,
-          apply_averaging=mode != learn.ModeKeys.TRAIN,
-          use_locking=True,
-          center_bias=self._center_bias,
-          reduce_dim=self._reduce_dim)
+      leaf_index = None
+      # Only used in infer (predict), not used in train and eval.
+      if self._output_leaf_index and mode == learn.ModeKeys.INFER:
+        predictions, _, leaf_index = (
+            prediction_ops).gradient_trees_prediction_verbose(
+                ensemble_handle,
+                seed,
+                self._dense_floats,
+                self._sparse_float_indices,
+                self._sparse_float_values,
+                self._sparse_float_shapes,
+                self._sparse_int_indices,
+                self._sparse_int_values,
+                self._sparse_int_shapes,
+                learner_config=self._learner_config_serialized,
+                apply_dropout=apply_dropout,
+                apply_averaging=mode != learn.ModeKeys.TRAIN,
+                use_locking=True,
+                center_bias=self._center_bias,
+                reduce_dim=self._reduce_dim)
+      else:
+        leaf_index = None
+        predictions, _ = prediction_ops.gradient_trees_prediction(
+            ensemble_handle,
+            seed,
+            self._dense_floats,
+            self._sparse_float_indices,
+            self._sparse_float_values,
+            self._sparse_float_shapes,
+            self._sparse_int_indices,
+            self._sparse_int_values,
+            self._sparse_int_shapes,
+            learner_config=self._learner_config_serialized,
+            apply_dropout=apply_dropout,
+            apply_averaging=mode != learn.ModeKeys.TRAIN,
+            use_locking=True,
+            center_bias=self._center_bias,
+            reduce_dim=self._reduce_dim)
       partition_ids = prediction_ops.gradient_trees_partition_examples(
           ensemble_handle,
           self._dense_floats,
@@ -414,7 +451,7 @@ class GradientBoostedDecisionTreeModel(object):
           use_locking=True)
 
     return _make_predictions_dict(ensemble_stamp, predictions, partition_ids,
-                                  ensemble_stats, used_handlers)
+                                  ensemble_stats, used_handlers, leaf_index)
 
   def predict(self, mode):
     """Returns predictions given the features and mode.
@@ -432,8 +469,9 @@ class GradientBoostedDecisionTreeModel(object):
     # Use the current ensemble to predict on the current batch of input.
     # For faster prediction we check if the inputs are on the same device
     # as the model. If not, we create a copy of the model on the worker.
-    input_deps = (self._dense_floats + self._sparse_float_indices +
-                  self._sparse_int_indices)
+    input_deps = (
+        self._dense_floats + self._sparse_float_indices +
+        self._sparse_int_indices)
     if not input_deps:
       raise ValueError("No input tensors for prediction.")
 
@@ -457,8 +495,8 @@ class GradientBoostedDecisionTreeModel(object):
 
       # Determine whether the local ensemble is stale and update it if needed.
       def _refresh_local_ensemble_fn():
-        # Serialize the model from parameter server after reading all inputs.
-        with ops.control_dependencies(input_deps):
+        # Serialize the model from parameter server after reading the inputs.
+        with ops.control_dependencies([input_deps[0]]):
           (ensemble_stamp, serialized_model) = (
               model_ops.tree_ensemble_serialize(self._ensemble_handle))
 
@@ -500,8 +538,9 @@ class GradientBoostedDecisionTreeModel(object):
       ValueError: if inputs are not valid.
     """
     # Get the worker device from input dependencies.
-    input_deps = (self._dense_floats + self._sparse_float_indices +
-                  self._sparse_int_indices)
+    input_deps = (
+        self._dense_floats + self._sparse_float_indices +
+        self._sparse_int_indices)
     worker_device = input_deps[0].device
 
     # Get tensors relevant for training and form the loss.
@@ -517,7 +556,7 @@ class GradientBoostedDecisionTreeModel(object):
         aggregation_method=None)[0]
     strategy = self._learner_config.multi_class_strategy
 
-    class_id = -1
+    class_id = constant_op.constant(-1, dtype=dtypes.int32)
     # Handle different multiclass strategies.
     if strategy == learner_pb2.LearnerConfig.TREE_PER_CLASS:
       # We build one vs rest trees.
@@ -571,31 +610,39 @@ class GradientBoostedDecisionTreeModel(object):
     # Get the weights for each example for quantiles calculation,
     weights = self._get_weights(hessian_shape, squeezed_hessians)
 
-    regularization_config = self._learner_config.regularization
-    min_node_weight = self._learner_config.constraints.min_node_weight
     # Create all handlers ensuring resources are evenly allocated across PS.
     fc_name_idx = 0
     handlers = []
     init_stamp_token = constant_op.constant(0, dtype=dtypes.int64)
+    l1_regularization = constant_op.constant(
+        self._learner_config.regularization.l1, dtypes.float32)
+    l2_regularization = constant_op.constant(
+        self._learner_config.regularization.l2, dtypes.float32)
+    tree_complexity_regularization = constant_op.constant(
+        self._learner_config.regularization.tree_complexity, dtypes.float32)
+    min_node_weight = constant_op.constant(
+        self._learner_config.constraints.min_node_weight, dtypes.float32)
+    epsilon = 0.01
+    num_quantiles = 100
+    strategy_tensor = constant_op.constant(strategy)
     with ops.device(self._get_replica_device_setter(worker_device)):
       # Create handlers for dense float columns
       for dense_float_column_idx in range(len(self._dense_floats)):
         fc_name = self._fc_names[fc_name_idx]
         handlers.append(
             ordinal_split_handler.DenseSplitHandler(
-                l1_regularization=regularization_config.l1,
-                l2_regularization=regularization_config.l2,
-                tree_complexity_regularization=(
-                    regularization_config.tree_complexity),
+                l1_regularization=l1_regularization,
+                l2_regularization=l2_regularization,
+                tree_complexity_regularization=tree_complexity_regularization,
                 min_node_weight=min_node_weight,
                 feature_column_group_id=dense_float_column_idx,
-                epsilon=0.01,
-                num_quantiles=100,
+                epsilon=epsilon,
+                num_quantiles=num_quantiles,
                 dense_float_column=self._dense_floats[dense_float_column_idx],
                 name=fc_name,
                 gradient_shape=gradient_shape,
                 hessian_shape=hessian_shape,
-                multiclass_strategy=strategy,
+                multiclass_strategy=strategy_tensor,
                 init_stamp_token=init_stamp_token))
         fc_name_idx += 1
 
@@ -604,14 +651,13 @@ class GradientBoostedDecisionTreeModel(object):
         fc_name = self._fc_names[fc_name_idx]
         handlers.append(
             ordinal_split_handler.SparseSplitHandler(
-                l1_regularization=regularization_config.l1,
-                l2_regularization=regularization_config.l2,
-                tree_complexity_regularization=(
-                    regularization_config.tree_complexity),
+                l1_regularization=l1_regularization,
+                l2_regularization=l2_regularization,
+                tree_complexity_regularization=tree_complexity_regularization,
                 min_node_weight=min_node_weight,
                 feature_column_group_id=sparse_float_column_idx,
-                epsilon=0.01,
-                num_quantiles=100,
+                epsilon=epsilon,
+                num_quantiles=num_quantiles,
                 sparse_float_column=sparse_tensor.SparseTensor(
                     self._sparse_float_indices[sparse_float_column_idx],
                     self._sparse_float_values[sparse_float_column_idx],
@@ -619,7 +665,7 @@ class GradientBoostedDecisionTreeModel(object):
                 name=fc_name,
                 gradient_shape=gradient_shape,
                 hessian_shape=hessian_shape,
-                multiclass_strategy=strategy,
+                multiclass_strategy=strategy_tensor,
                 init_stamp_token=init_stamp_token))
         fc_name_idx += 1
 
@@ -628,10 +674,9 @@ class GradientBoostedDecisionTreeModel(object):
         fc_name = self._fc_names[fc_name_idx]
         handlers.append(
             categorical_split_handler.EqualitySplitHandler(
-                l1_regularization=regularization_config.l1,
-                l2_regularization=regularization_config.l2,
-                tree_complexity_regularization=(
-                    regularization_config.tree_complexity),
+                l1_regularization=l1_regularization,
+                l2_regularization=l2_regularization,
+                tree_complexity_regularization=tree_complexity_regularization,
                 min_node_weight=min_node_weight,
                 feature_column_group_id=sparse_int_column_idx,
                 sparse_int_column=sparse_tensor.SparseTensor(
@@ -641,7 +686,7 @@ class GradientBoostedDecisionTreeModel(object):
                 name=fc_name,
                 gradient_shape=gradient_shape,
                 hessian_shape=hessian_shape,
-                multiclass_strategy=strategy,
+                multiclass_strategy=strategy_tensor,
                 init_stamp_token=init_stamp_token))
         fc_name_idx += 1
 
@@ -694,11 +739,11 @@ class GradientBoostedDecisionTreeModel(object):
         name="continue_centering",
         trainable=False)
     stats_update_ops.append(
-        control_flow_ops.cond(continue_centering,
-                              self._make_update_bias_stats_fn(
-                                  ensemble_stamp, predictions, gradients,
-                                  bias_stats_accumulator),
-                              control_flow_ops.no_op))
+        control_flow_ops.cond(
+            continue_centering,
+            self._make_update_bias_stats_fn(ensemble_stamp, predictions,
+                                            gradients, bias_stats_accumulator),
+            control_flow_ops.no_op))
 
     # Update handler stats.
     handler_reads = collections.OrderedDict()
@@ -720,8 +765,8 @@ class GradientBoostedDecisionTreeModel(object):
           shape=[len(handlers)], seed=[seed + 1, 1])
       active_handlers = array_ops.stack(
           [active_handlers_current_layer, active_handlers_next_layer], axis=1)
-      active_handlers = (active_handlers <
-                         self._learner_config.feature_fraction_per_level)
+      active_handlers = (
+          active_handlers < self._learner_config.feature_fraction_per_level)
     elif subsampling_type == "feature_fraction_per_tree":
       seed = predictions_dict[NUM_TREES_ATTEMPTED]
       active_handlers_current_layer = stateless.stateless_random_uniform(
@@ -729,9 +774,12 @@ class GradientBoostedDecisionTreeModel(object):
       active_handlers_current_layer = (
           active_handlers_current_layer <
           self._learner_config.feature_fraction_per_tree)
-      active_handlers = array_ops.stack([
-          active_handlers_current_layer,
-          array_ops.ones([len(handlers)], dtype=dtypes.bool)], axis=1)
+      active_handlers = array_ops.stack(
+          [
+              active_handlers_current_layer,
+              array_ops.ones([len(handlers)], dtype=dtypes.bool)
+          ],
+          axis=1)
     else:
       active_handlers = array_ops.ones([len(handlers), 2], dtype=dtypes.bool)
 
@@ -760,6 +808,7 @@ class GradientBoostedDecisionTreeModel(object):
     empty_hessians = constant_op.constant(
         [], dtype=dtypes.float32, shape=empty_hess_shape)
 
+    active_handlers = array_ops.unstack(active_handlers, axis=0)
     for handler_idx in range(len(handlers)):
       handler = handlers[handler_idx]
       is_active = active_handlers[handler_idx]
@@ -901,7 +950,6 @@ class GradientBoostedDecisionTreeModel(object):
         "DecisionTreeEnsembleResourceHandleOp",
         "StatsAccumulatorScalarResourceHandleOp",
         "StatsAccumulatorTensorResourceHandleOp",
-        "QuantileStreamResourceHandleOp",
     ]
     ps_strategy = _OpRoundRobinStrategy(ps_ops, ps_tasks)
     return device_setter.replica_device_setter(
@@ -971,7 +1019,7 @@ class GradientBoostedDecisionTreeModel(object):
       # This is a workaround for the slowness of graph building in tf.cond.
       # See (b/36554864).
       split_sizes = array_ops.reshape(
-          array_ops.shape_n(partition_ids_list), [-1])
+          array_ops.shape_n(partition_ids_list), [len(partition_ids_list)])
       partition_ids = array_ops.concat(partition_ids_list, axis=0)
       gains = array_ops.concat(gains_list, axis=0)
       split_infos = array_ops.concat(split_info_list, axis=0)
@@ -1036,8 +1084,11 @@ class GradientBoostedDecisionTreeModel(object):
 
       # Update ensemble.
       update_ops = [are_all_splits_ready]
-      update_model = control_flow_ops.cond(continue_centering, _center_bias_fn,
-                                           _grow_ensemble_fn)
+      if self._center_bias:
+        update_model = control_flow_ops.cond(continue_centering,
+                                             _center_bias_fn, _grow_ensemble_fn)
+      else:
+        update_model = _grow_ensemble_fn()
       update_ops.append(update_model)
 
       # Update ensemble stats.
diff --git a/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch_test.py b/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch_test.py
index f9c22283b7f5136777bfa60a12c94974adfbd245..e3d4397fadcbaf148f7f6cfaca13e850639786cf 100644
--- a/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch_test.py
+++ b/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch_test.py
@@ -19,19 +19,15 @@ from __future__ import division
 from __future__ import print_function
 
 from google.protobuf import text_format
-
 from tensorflow.contrib import layers
 from tensorflow.contrib.boosted_trees.proto import learner_pb2
 from tensorflow.contrib.boosted_trees.proto import tree_config_pb2
 from tensorflow.contrib.boosted_trees.python.ops import model_ops
 from tensorflow.contrib.boosted_trees.python.training.functions import gbdt_batch
 from tensorflow.contrib.boosted_trees.python.utils import losses
-
-from tensorflow.python.feature_column import feature_column_lib as core_feature_column
 from tensorflow.contrib.layers.python.layers import feature_column as feature_column_lib
 from tensorflow.contrib.learn.python.learn.estimators import model_fn
-
-
+from tensorflow.python.feature_column import feature_column_lib as core_feature_column
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.framework import test_util
@@ -97,8 +93,8 @@ class GbdtTest(test_util.TensorFlowTestCase):
           array_ops.zeros([2], dtypes.int64))
       features["sparse_int"] = sparse_tensor.SparseTensor(
           array_ops.zeros([2, 2], dtypes.int64),
-          array_ops.zeros([2], dtypes.int64),
-          array_ops.zeros([2], dtypes.int64))
+          array_ops.zeros([2], dtypes.int64), array_ops.zeros([2],
+                                                              dtypes.int64))
       (fc_names, dense_floats, sparse_float_indices, sparse_float_values,
        sparse_float_shapes, sparse_int_indices, sparse_int_values,
        sparse_int_shapes) = (
@@ -139,8 +135,8 @@ class GbdtTest(test_util.TensorFlowTestCase):
           array_ops.zeros([2], dtypes.int64))
       features["sparse_categorical"] = sparse_tensor.SparseTensor(
           array_ops.zeros([2, 2], dtypes.int64),
-          array_ops.zeros(
-              [2], dtypes.string), array_ops.zeros([2], dtypes.int64))
+          array_ops.zeros([2], dtypes.string), array_ops.zeros([2],
+                                                               dtypes.int64))
       feature_columns = set()
       feature_columns.add(layers.real_valued_column("dense_float"))
       feature_columns.add(
@@ -235,7 +231,8 @@ class GbdtTest(test_util.TensorFlowTestCase):
           ensemble_handle=ensemble_handle,
           examples_per_layer=1,
           learner_config=learner_config,
-          logits_dimension=1, features=features)
+          logits_dimension=1,
+          features=features)
 
       predictions = array_ops.constant(
           [[0.0], [1.0], [0.0], [2.0]], dtype=dtypes.float32)
@@ -316,6 +313,113 @@ class GbdtTest(test_util.TensorFlowTestCase):
           }"""
       self.assertProtoEquals(expected_tree, output.trees[0])
 
+  def testTrainFnChiefSparseAndDense(self):
+    """Tests the train function with sparse and dense features."""
+    with self.test_session() as sess:
+      ensemble_handle = model_ops.tree_ensemble_variable(
+          stamp_token=0, tree_ensemble_config="", name="tree_ensemble")
+      learner_config = learner_pb2.LearnerConfig()
+      learner_config.learning_rate_tuner.fixed.learning_rate = 0.1
+      learner_config.num_classes = 2
+      learner_config.regularization.l1 = 0
+      learner_config.regularization.l2 = 0
+      learner_config.constraints.max_tree_depth = 1
+      learner_config.constraints.min_node_weight = 0
+      features = {}
+      features["dense_float"] = array_ops.ones([4, 1], dtypes.float32)
+      features["sparse_float"] = sparse_tensor.SparseTensor(
+          array_ops.zeros([2, 2], dtypes.int64),
+          array_ops.zeros([2], dtypes.float32),
+          array_ops.constant([4, 1], dtypes.int64))
+
+      gbdt_model = gbdt_batch.GradientBoostedDecisionTreeModel(
+          is_chief=True,
+          num_ps_replicas=0,
+          center_bias=False,
+          ensemble_handle=ensemble_handle,
+          examples_per_layer=1,
+          learner_config=learner_config,
+          logits_dimension=1,
+          features=features)
+
+      predictions = array_ops.constant(
+          [[0.0], [1.0], [0.0], [2.0]], dtype=dtypes.float32)
+      partition_ids = array_ops.zeros([4], dtypes.int32)
+      ensemble_stamp = variables.Variable(
+          initial_value=0,
+          name="ensemble_stamp",
+          trainable=False,
+          dtype=dtypes.int64)
+
+      predictions_dict = {
+          "predictions": predictions,
+          "predictions_no_dropout": predictions,
+          "partition_ids": partition_ids,
+          "ensemble_stamp": ensemble_stamp,
+          "num_trees": 12,
+      }
+
+      labels = array_ops.ones([4, 1], dtypes.float32)
+      weights = array_ops.ones([4, 1], dtypes.float32)
+      # Create train op.
+      train_op = gbdt_model.train(
+          loss=math_ops.reduce_mean(
+              _squared_loss(labels, weights, predictions)),
+          predictions_dict=predictions_dict,
+          labels=labels)
+      variables.global_variables_initializer().run()
+      resources.initialize_resources(resources.shared_resources()).run()
+
+      # On first run, expect no splits to be chosen because the quantile
+      # buckets will not be ready.
+      train_op.run()
+      stamp_token, serialized = model_ops.tree_ensemble_serialize(
+          ensemble_handle)
+      output = tree_config_pb2.DecisionTreeEnsembleConfig()
+      output.ParseFromString(serialized.eval())
+      self.assertEquals(len(output.trees), 0)
+      self.assertEquals(len(output.tree_weights), 0)
+      self.assertEquals(stamp_token.eval(), 1)
+
+      # Update the stamp to be able to run a second time.
+      sess.run([ensemble_stamp.assign_add(1)])
+
+      train_op.run()
+      stamp_token, serialized = model_ops.tree_ensemble_serialize(
+          ensemble_handle)
+      output = tree_config_pb2.DecisionTreeEnsembleConfig()
+      output.ParseFromString(serialized.eval())
+      self.assertEquals(len(output.trees), 1)
+      self.assertAllClose(output.tree_weights, [0.1])
+      self.assertEquals(stamp_token.eval(), 2)
+      expected_tree = """
+          nodes {
+            sparse_float_binary_split_default_right {
+              split{
+                left_id: 1
+                right_id: 2
+              }
+            }
+            node_metadata {
+              gain: 1.125
+            }
+          }
+          nodes {
+            leaf {
+              vector {
+                value: 1.0
+              }
+            }
+          }
+          nodes {
+            leaf {
+              vector {
+                value: -0.5
+              }
+            }
+          }"""
+      self.assertProtoEquals(expected_tree, output.trees[0])
+
   def testTrainFnChiefScalingNumberOfExamples(self):
     """Tests the train function running on chief without bias centering."""
     with self.test_session() as sess:
@@ -339,7 +443,8 @@ class GbdtTest(test_util.TensorFlowTestCase):
           ensemble_handle=ensemble_handle,
           examples_per_layer=num_examples_fn,
           learner_config=learner_config,
-          logits_dimension=1, features=features)
+          logits_dimension=1,
+          features=features)
 
       predictions = array_ops.constant(
           [[0.0], [1.0], [0.0], [2.0]], dtype=dtypes.float32)
@@ -442,7 +547,8 @@ class GbdtTest(test_util.TensorFlowTestCase):
           ensemble_handle=ensemble_handle,
           examples_per_layer=1,
           learner_config=learner_config,
-          logits_dimension=1, features=features)
+          logits_dimension=1,
+          features=features)
 
       predictions = array_ops.constant(
           [[0.0], [1.0], [0.0], [2.0]], dtype=dtypes.float32)
@@ -513,7 +619,8 @@ class GbdtTest(test_util.TensorFlowTestCase):
           ensemble_handle=ensemble_handle,
           examples_per_layer=1,
           learner_config=learner_config,
-          logits_dimension=1, features=features)
+          logits_dimension=1,
+          features=features)
 
       predictions = array_ops.constant(
           [[0.0], [1.0], [0.0], [2.0]], dtype=dtypes.float32)
@@ -576,7 +683,8 @@ class GbdtTest(test_util.TensorFlowTestCase):
           ensemble_handle=ensemble_handle,
           examples_per_layer=1,
           learner_config=learner_config,
-          logits_dimension=1, features=features)
+          logits_dimension=1,
+          features=features)
 
       predictions = array_ops.constant(
           [[0.0], [1.0], [0.0], [2.0]], dtype=dtypes.float32)
@@ -622,7 +730,8 @@ class GbdtTest(test_util.TensorFlowTestCase):
     with self.test_session() as sess:
       # Create ensemble with one bias node.
       ensemble_config = tree_config_pb2.DecisionTreeEnsembleConfig()
-      text_format.Merge("""
+      text_format.Merge(
+          """
           trees {
             nodes {
               leaf {
@@ -659,16 +768,129 @@ class GbdtTest(test_util.TensorFlowTestCase):
           ensemble_handle=ensemble_handle,
           examples_per_layer=1,
           learner_config=learner_config,
-          logits_dimension=1, features=features)
+          logits_dimension=1,
+          features=features)
 
       # Create predict op.
       mode = model_fn.ModeKeys.EVAL
       predictions_dict = sess.run(gbdt_model.predict(mode))
       self.assertEquals(predictions_dict["ensemble_stamp"], 3)
-      self.assertAllClose(predictions_dict["predictions"], [[0.25], [0.25],
-                                                            [0.25], [0.25]])
+      self.assertAllClose(predictions_dict["predictions"],
+                          [[0.25], [0.25], [0.25], [0.25]])
       self.assertAllClose(predictions_dict["partition_ids"], [0, 0, 0, 0])
 
+  def testPredictFnWithLeafIndexAdvancedLeft(self):
+    """Tests the predict function with output leaf ids."""
+    with self.test_session() as sess:
+      # Create ensemble with one bias node.
+      ensemble_config = tree_config_pb2.DecisionTreeEnsembleConfig()
+      text_format.Merge(
+          """
+          trees {
+            nodes {
+                dense_float_binary_split {
+                  threshold: 1.0
+                  left_id: 1
+                  right_id: 2
+                }
+                node_metadata {
+                  gain: 0
+                }
+              }
+              nodes {
+                leaf {
+                  vector {
+                    value: 0.25
+                  }
+                }
+              }
+              nodes {
+                leaf {
+                  vector {
+                    value: 0.15
+                  }
+                }
+              }
+          }
+          trees {
+            nodes {
+                dense_float_binary_split {
+                  threshold: 0.99
+                  left_id: 1
+                  right_id: 2
+                }
+                node_metadata {
+                  gain: 00
+                }
+              }
+              nodes {
+                leaf {
+                  vector {
+                    value: 0.25
+                  }
+                }
+              }
+              nodes {
+                leaf {
+                  vector {
+                    value: 0.23
+                  }
+                }
+              }
+          }
+          tree_weights: 1.0
+          tree_weights: 1.0
+          tree_metadata {
+            num_tree_weight_updates: 1
+            num_layers_grown: 1
+            is_finalized: true
+          }
+          tree_metadata {
+            num_tree_weight_updates: 1
+            num_layers_grown: 1
+            is_finalized: true
+          }""", ensemble_config)
+      ensemble_handle = model_ops.tree_ensemble_variable(
+          stamp_token=3,
+          tree_ensemble_config=ensemble_config.SerializeToString(),
+          name="tree_ensemble")
+      resources.initialize_resources(resources.shared_resources()).run()
+      learner_config = learner_pb2.LearnerConfig()
+      learner_config.learning_rate_tuner.fixed.learning_rate = 0.1
+      learner_config.num_classes = 2
+      learner_config.regularization.l1 = 0
+      learner_config.regularization.l2 = 0
+      learner_config.constraints.max_tree_depth = 1
+      learner_config.constraints.min_node_weight = 0
+      features = {}
+      features["dense_float"] = array_ops.constant(
+          [[0.0], [1.0], [1.1], [2.0]], dtype=dtypes.float32)
+      gbdt_model = gbdt_batch.GradientBoostedDecisionTreeModel(
+          is_chief=False,
+          num_ps_replicas=0,
+          center_bias=True,
+          ensemble_handle=ensemble_handle,
+          examples_per_layer=1,
+          learner_config=learner_config,
+          logits_dimension=1,
+          features=features,
+          output_leaf_index=True)
+
+      # Create predict op.
+      mode = model_fn.ModeKeys.INFER
+      predictions_dict = sess.run(gbdt_model.predict(mode))
+      self.assertEquals(predictions_dict["ensemble_stamp"], 3)
+      # here are how the numbers in expected results are calculated,
+      # 0.5 = 0.25 + 0.25
+      # 0.48 = 0.25 + 0.23
+      # 0.38 = 0.15 + 0.23
+      # 0.38 = 0.15 + 0.23
+      self.assertAllClose(predictions_dict["predictions"],
+                          [[0.5], [0.48], [0.38], [0.38]])
+      self.assertAllClose(predictions_dict["partition_ids"], [0, 0, 0, 0])
+      self.assertAllClose(predictions_dict["leaf_index"],
+                          [[1, 1], [1, 2], [2, 2], [2, 2]])
+
   def testTrainFnMulticlassFullHessian(self):
     """Tests the GBDT train for multiclass full hessian."""
     with self.test_session() as sess:
@@ -698,7 +920,8 @@ class GbdtTest(test_util.TensorFlowTestCase):
           ensemble_handle=ensemble_handle,
           examples_per_layer=1,
           learner_config=learner_config,
-          logits_dimension=5, features=features)
+          logits_dimension=5,
+          features=features)
 
       predictions = array_ops.constant(
           [[0.0, -1.0, 0.5, 1.2, 3.1], [1.0, 0.0, 0.8, 0.3, 1.0],
@@ -801,7 +1024,8 @@ class GbdtTest(test_util.TensorFlowTestCase):
           ensemble_handle=ensemble_handle,
           examples_per_layer=1,
           learner_config=learner_config,
-          logits_dimension=5, features=features)
+          logits_dimension=5,
+          features=features)
 
       predictions = array_ops.constant(
           [[0.0, -1.0, 0.5, 1.2, 3.1], [1.0, 0.0, 0.8, 0.3, 1.0],
@@ -893,8 +1117,8 @@ class GbdtTest(test_util.TensorFlowTestCase):
       learner_config.constraints.max_tree_depth = 1
       learner_config.constraints.min_node_weight = 0
       features = {
-          "dense_float": array_ops.constant(
-              [[1.0], [1.5], [2.0]], dtypes.float32),
+          "dense_float":
+              array_ops.constant([[1.0], [1.5], [2.0]], dtypes.float32),
       }
 
       gbdt_model = gbdt_batch.GradientBoostedDecisionTreeModel(
@@ -904,7 +1128,8 @@ class GbdtTest(test_util.TensorFlowTestCase):
           ensemble_handle=ensemble_handle,
           examples_per_layer=1,
           learner_config=learner_config,
-          logits_dimension=5, features=features)
+          logits_dimension=5,
+          features=features)
 
       batch_size = 3
       predictions = array_ops.constant(
@@ -986,7 +1211,8 @@ class GbdtTest(test_util.TensorFlowTestCase):
       self.assertAllClose(
           0.893284678459,
           output.trees[0].nodes[2].leaf.sparse_vector.value[0],
-          atol=1e-4, rtol=1e-4)
+          atol=1e-4,
+          rtol=1e-4)
 
   def testTrainFnChiefFeatureSelectionReachedLimitNoGoodSplit(self):
     """Tests the train function running on chief with feature selection."""
@@ -1230,9 +1456,9 @@ class GbdtTest(test_util.TensorFlowTestCase):
       tree_ensemble_config = tree_config_pb2.DecisionTreeEnsembleConfig()
       tree = tree_ensemble_config.trees.add()
 
-      _set_float_split(tree.nodes.add()
-                       .sparse_float_binary_split_default_right.split, 2, 4.0,
-                       1, 2)
+      _set_float_split(
+          tree.nodes.add().sparse_float_binary_split_default_right.split, 2,
+          4.0, 1, 2)
       _append_to_leaf(tree.nodes.add().leaf, 0, 0.5)
       _append_to_leaf(tree.nodes.add().leaf, 1, 1.2)
       tree_ensemble_config.tree_weights.append(1.0)
@@ -1241,7 +1467,8 @@ class GbdtTest(test_util.TensorFlowTestCase):
       metadata.num_layers_grown = 1
       tree_ensemble_config = tree_ensemble_config.SerializeToString()
       ensemble_handle = model_ops.tree_ensemble_variable(
-          stamp_token=0, tree_ensemble_config=tree_ensemble_config,
+          stamp_token=0,
+          tree_ensemble_config=tree_ensemble_config,
           name="tree_ensemble")
       learner_config = learner_pb2.LearnerConfig()
       learner_config.learning_rate_tuner.fixed.learning_rate = 0.1
@@ -1333,5 +1560,6 @@ class GbdtTest(test_util.TensorFlowTestCase):
 
       self.assertEquals(output.growing_metadata.num_layers_attempted, 2)
 
+
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/contrib/checkpoint/__init__.py b/tensorflow/contrib/checkpoint/__init__.py
index af8df72618b7255e182e98e6e4b96a0333b3dce6..8ae493ba998bd882b5ef946f927ec1882d91f61d 100644
--- a/tensorflow/contrib/checkpoint/__init__.py
+++ b/tensorflow/contrib/checkpoint/__init__.py
@@ -18,11 +18,15 @@ Visualization and inspection:
 @@dot_graph_from_checkpoint
 @@object_metadata
 
-Creating and managing dependencies:
+Managing dependencies:
 @@Checkpointable
 @@CheckpointableObjectGraph
 @@NoDependency
 @@split_dependency
+
+Checkpointable data structures:
+@@List
+@@Mapping
 @@UniqueNameTracker
 """
 
@@ -36,8 +40,11 @@ from tensorflow.contrib.checkpoint.python.visualize import dot_graph_from_checkp
 from tensorflow.core.protobuf.checkpointable_object_graph_pb2 import CheckpointableObjectGraph
 from tensorflow.python.training.checkpointable.base import Checkpointable
 from tensorflow.python.training.checkpointable.base import NoDependency
+from tensorflow.python.training.checkpointable.data_structures import List
+from tensorflow.python.training.checkpointable.data_structures import Mapping
 from tensorflow.python.training.checkpointable.util import object_metadata
 
 from tensorflow.python.util.all_util import remove_undocumented
 
 remove_undocumented(module_name=__name__)
+
diff --git a/tensorflow/contrib/checkpoint/python/BUILD b/tensorflow/contrib/checkpoint/python/BUILD
index 53f4e97f9932104933b3ecf80142e5af82cd487a..7b200a29bf60087d6da1010b0be05c04faec80cd 100644
--- a/tensorflow/contrib/checkpoint/python/BUILD
+++ b/tensorflow/contrib/checkpoint/python/BUILD
@@ -11,6 +11,7 @@ py_library(
         ":containers",
         ":split_dependency",
         ":visualize",
+        "//tensorflow/python/training/checkpointable:data_structures",
     ],
 )
 
@@ -19,7 +20,10 @@ py_library(
     srcs = ["containers.py"],
     srcs_version = "PY2AND3",
     visibility = ["//tensorflow:internal"],
-    deps = ["//tensorflow/python/training/checkpointable:base"],
+    deps = [
+        "//tensorflow/python/training/checkpointable:base",
+        "//tensorflow/python/training/checkpointable:data_structures",
+    ],
 )
 
 py_test(
@@ -30,8 +34,8 @@ py_test(
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:framework_test_lib",
         "//tensorflow/python:resource_variable_ops",
-        "//tensorflow/python:training",
         "//tensorflow/python/training/checkpointable:base",
+        "//tensorflow/python/training/checkpointable:util",
         "@six_archive//:six",
     ],
 )
@@ -44,6 +48,7 @@ py_library(
     deps = [
         "//tensorflow/python:control_flow_ops",
         "//tensorflow/python:training",
+        "//tensorflow/python/training/checkpointable:base",
     ],
 )
 
@@ -55,8 +60,9 @@ py_test(
         "//tensorflow/python:array_ops",
         "//tensorflow/python:framework_test_lib",
         "//tensorflow/python:resource_variable_ops",
-        "//tensorflow/python:training",
         "//tensorflow/python/eager:test",
+        "//tensorflow/python/training/checkpointable:base",
+        "//tensorflow/python/training/checkpointable:util",
     ],
 )
 
@@ -67,6 +73,8 @@ py_library(
     visibility = ["//tensorflow:internal"],
     deps = [
         "//tensorflow/python:pywrap_tensorflow",
+        "//tensorflow/python/training/checkpointable:base",
+        "//tensorflow/python/training/checkpointable:util",
     ],
 )
 
@@ -75,10 +83,13 @@ py_test(
     srcs = ["visualize_test.py"],
     deps = [
         ":visualize",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:constant_op",
         "//tensorflow/python:resource_variable_ops",
         "//tensorflow/python:training",
+        "//tensorflow/python/eager:context",
         "//tensorflow/python/eager:test",
+        "//tensorflow/python/keras:engine",
+        "//tensorflow/python/keras:layers",
+        "//tensorflow/python/training/checkpointable:util",
     ],
 )
diff --git a/tensorflow/contrib/checkpoint/python/containers.py b/tensorflow/contrib/checkpoint/python/containers.py
index 9807abae1f5106bb84f858c3725f096aaa4eaca9..4d3d5312993740636709cb732c0b8e3e2626262d 100644
--- a/tensorflow/contrib/checkpoint/python/containers.py
+++ b/tensorflow/contrib/checkpoint/python/containers.py
@@ -18,9 +18,10 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.training.checkpointable import base as checkpointable_lib
+from tensorflow.python.training.checkpointable import data_structures
 
 
-class UniqueNameTracker(checkpointable_lib.CheckpointableBase):
+class UniqueNameTracker(data_structures.CheckpointableDataStructure):
   """Adds dependencies on checkpointable objects with name hints.
 
   Useful for creating dependencies with locally unique names.
@@ -41,6 +42,7 @@ class UniqueNameTracker(checkpointable_lib.CheckpointableBase):
   """
 
   def __init__(self):
+    super(UniqueNameTracker, self).__init__()
     self._maybe_initialize_checkpointable()
     self._name_counts = {}
 
@@ -74,4 +76,5 @@ class UniqueNameTracker(checkpointable_lib.CheckpointableBase):
       count += 1
       candidate = _format_name(base_name, count)
     self._name_counts[base_name] = count + 1
-    return self._track_checkpointable(checkpointable, name=candidate)
+    self._track_value(checkpointable, name=candidate)
+    return checkpointable
diff --git a/tensorflow/contrib/checkpoint/python/containers_test.py b/tensorflow/contrib/checkpoint/python/containers_test.py
index 851a80058852bd917aec075b4bf63264318603a7..3717d7f583ffdc205a279d45df60cddbc5cbf08e 100644
--- a/tensorflow/contrib/checkpoint/python/containers_test.py
+++ b/tensorflow/contrib/checkpoint/python/containers_test.py
@@ -22,6 +22,8 @@ import six
 
 from tensorflow.contrib.checkpoint.python import containers
 from tensorflow.python.framework import test_util
+from tensorflow.python.keras import layers
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.platform import test
 from tensorflow.python.training.checkpointable import base as checkpointable
@@ -95,5 +97,12 @@ class UniqueNameTrackerTests(test.TestCase):
         dependency_names,
         ["x", "x_1", "y", "slot_manager", "slotdeps", "save_counter"])
 
+  @test_util.run_in_graph_and_eager_modes()
+  def testLayers(self):
+    tracker = containers.UniqueNameTracker()
+    tracker.track(layers.Dense(3), "dense")
+    tracker.layers[0](array_ops.zeros([1, 1]))
+    self.assertEqual(2, len(tracker.trainable_weights))
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/cloud/BUILD b/tensorflow/contrib/cloud/BUILD
index f3a75e8688ece19a6e6fd53ee9faf7f4144d76cf..42ba368531468b789a87429f88ca84937f9b909d 100644
--- a/tensorflow/contrib/cloud/BUILD
+++ b/tensorflow/contrib/cloud/BUILD
@@ -15,7 +15,10 @@ load(
 )
 
 tf_gen_op_libs(
-    op_lib_names = ["bigquery_reader_ops"],
+    op_lib_names = [
+        "bigquery_reader_ops",
+        "gcs_config_ops",
+    ],
     deps = [
         "//tensorflow/core:lib",
     ],
@@ -28,15 +31,25 @@ tf_gen_op_wrapper_py(
     deps = [":bigquery_reader_ops_op_lib"],
 )
 
+tf_gen_op_wrapper_py(
+    name = "gen_gcs_config_ops",
+    out = "python/ops/gen_gcs_config_ops.py",
+    require_shape_functions = True,
+    visibility = ["//tensorflow:internal"],
+    deps = [":gcs_config_ops_op_lib"],
+)
+
 py_library(
     name = "cloud_py",
     srcs = [
         "__init__.py",
         "python/ops/bigquery_reader_ops.py",
+        "python/ops/gcs_config_ops.py",
     ],
     srcs_version = "PY2AND3",
     deps = [
         ":gen_bigquery_reader_ops",
+        ":gen_gcs_config_ops",
         "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:io_ops",
         "//tensorflow/python:util",
diff --git a/tensorflow/contrib/cloud/__init__.py b/tensorflow/contrib/cloud/__init__.py
index 8870264b95dfd9f8c4b1655c475fe23e0639924f..a6e13ea3ae938444b9ead0772e52fb8797a847da 100644
--- a/tensorflow/contrib/cloud/__init__.py
+++ b/tensorflow/contrib/cloud/__init__.py
@@ -20,9 +20,15 @@ from __future__ import print_function
 
 # pylint: disable=line-too-long,wildcard-import
 from tensorflow.contrib.cloud.python.ops.bigquery_reader_ops import *
+from tensorflow.contrib.cloud.python.ops.gcs_config_ops import *
 # pylint: enable=line-too-long,wildcard-import
 
 from tensorflow.python.util.all_util import remove_undocumented
 
-_allowed_symbols = ['BigQueryReader']
+_allowed_symbols = [
+    'BigQueryReader',
+    'ConfigureColabSession',
+    'ConfigureGcs',
+    'ConfigureGcsHook',
+]
 remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/cloud/kernels/BUILD b/tensorflow/contrib/cloud/kernels/BUILD
index ff46f0daa80a70badedf73e15bfaf4dca85fdd89..40160706f70e8fa8323005dd183770ed51c8c415 100644
--- a/tensorflow/contrib/cloud/kernels/BUILD
+++ b/tensorflow/contrib/cloud/kernels/BUILD
@@ -73,3 +73,17 @@ tf_proto_library(
     srcs = ["bigquery_table_partition.proto"],
     cc_api_version = 2,
 )
+
+tf_kernel_library(
+    name = "gcs_config_ops",
+    srcs = ["gcs_config_ops.cc"],
+    visibility = ["//tensorflow:internal"],
+    deps = [
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core/platform/cloud:curl_http_request",
+        "//tensorflow/core/platform/cloud:gcs_file_system",
+        "//tensorflow/core/platform/cloud:oauth_client",
+        "@jsoncpp_git//:jsoncpp",
+    ],
+)
diff --git a/tensorflow/contrib/cloud/kernels/gcs_config_ops.cc b/tensorflow/contrib/cloud/kernels/gcs_config_ops.cc
new file mode 100644
index 0000000000000000000000000000000000000000..648a219fb87a6ebc64767a7da780013ef6b95443
--- /dev/null
+++ b/tensorflow/contrib/cloud/kernels/gcs_config_ops.cc
@@ -0,0 +1,205 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <sstream>
+
+#include "include/json/json.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/tensor_shape.h"
+#include "tensorflow/core/platform/cloud/curl_http_request.h"
+#include "tensorflow/core/platform/cloud/gcs_file_system.h"
+#include "tensorflow/core/platform/cloud/oauth_client.h"
+#include "tensorflow/core/util/ptr_util.h"
+
+namespace tensorflow {
+namespace {
+
+// The default initial delay between retries with exponential backoff.
+constexpr int kInitialRetryDelayUsec = 500000;  // 0.5 sec
+
+// The minimum time delta between now and the token expiration time
+// for the token to be re-used.
+constexpr int kExpirationTimeMarginSec = 60;
+
+// The URL to retrieve the auth bearer token via OAuth with a refresh token.
+constexpr char kOAuthV3Url[] = "https://www.googleapis.com/oauth2/v3/token";
+
+// The URL to retrieve the auth bearer token via OAuth with a private key.
+constexpr char kOAuthV4Url[] = "https://www.googleapis.com/oauth2/v4/token";
+
+// The authentication token scope to request.
+constexpr char kOAuthScope[] = "https://www.googleapis.com/auth/cloud-platform";
+
+Status RetrieveGcsFs(OpKernelContext* ctx, RetryingGcsFileSystem** fs) {
+  DCHECK(fs != nullptr);
+  *fs = nullptr;
+
+  FileSystem* filesystem = nullptr;
+  TF_RETURN_IF_ERROR(
+      ctx->env()->GetFileSystemForFile("gs://fake/file.text", &filesystem));
+  if (filesystem == nullptr) {
+    return errors::FailedPrecondition("The GCS file system is not registered.");
+  }
+
+  *fs = dynamic_cast<RetryingGcsFileSystem*>(filesystem);
+  if (*fs == nullptr) {
+    return errors::Internal(
+        "The filesystem registered under the 'gs://' scheme was not a "
+        "tensorflow::RetryingGcsFileSystem*.");
+  }
+  return Status::OK();
+}
+
+template <typename T>
+Status ParseScalarArgument(OpKernelContext* ctx, StringPiece argument_name,
+                           T* output) {
+  const Tensor* argument_t;
+  TF_RETURN_IF_ERROR(ctx->input(argument_name, &argument_t));
+  if (!TensorShapeUtils::IsScalar(argument_t->shape())) {
+    return errors::InvalidArgument(argument_name, " must be a scalar");
+  }
+  *output = argument_t->scalar<T>()();
+  return Status::OK();
+}
+
+// GcsCredentialsOpKernel overrides the credentials used by the gcs_filesystem.
+class GcsCredentialsOpKernel : public OpKernel {
+ public:
+  explicit GcsCredentialsOpKernel(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+  void Compute(OpKernelContext* ctx) override {
+    // Get a handle to the GCS file system.
+    RetryingGcsFileSystem* gcs = nullptr;
+    OP_REQUIRES_OK(ctx, RetrieveGcsFs(ctx, &gcs));
+
+    string json_string;
+    OP_REQUIRES_OK(ctx, ParseScalarArgument<string>(ctx, "json", &json_string));
+
+    Json::Value json;
+    Json::Reader reader;
+    std::stringstream json_stream(json_string);
+    OP_REQUIRES(ctx, reader.parse(json_stream, json),
+                errors::InvalidArgument("Could not parse json: ", json_string));
+
+    OP_REQUIRES(
+        ctx, json.isMember("refresh_token") || json.isMember("private_key"),
+        errors::InvalidArgument("JSON format incompatible; did not find fields "
+                                "`refresh_token` or `private_key`."));
+
+    auto provider =
+        tensorflow::MakeUnique<ConstantAuthProvider>(json, ctx->env());
+
+    // Test getting a token
+    string dummy_token;
+    OP_REQUIRES_OK(ctx, provider->GetToken(&dummy_token));
+    OP_REQUIRES(ctx, !dummy_token.empty(),
+                errors::InvalidArgument(
+                    "Could not retrieve a token with the given credentials."));
+
+    // Set the provider.
+    gcs->underlying()->SetAuthProvider(std::move(provider));
+  }
+
+ private:
+  class ConstantAuthProvider : public AuthProvider {
+   public:
+    ConstantAuthProvider(const Json::Value& json,
+                         std::unique_ptr<OAuthClient> oauth_client, Env* env,
+                         int64 initial_retry_delay_usec)
+        : json_(json),
+          oauth_client_(std::move(oauth_client)),
+          env_(env),
+          initial_retry_delay_usec_(initial_retry_delay_usec) {}
+
+    ConstantAuthProvider(const Json::Value& json, Env* env)
+        : ConstantAuthProvider(json, tensorflow::MakeUnique<OAuthClient>(), env,
+                               kInitialRetryDelayUsec) {}
+
+    ~ConstantAuthProvider() override {}
+
+    Status GetToken(string* token) override {
+      mutex_lock l(mu_);
+      const uint64 now_sec = env_->NowSeconds();
+
+      if (!current_token_.empty() &&
+          now_sec + kExpirationTimeMarginSec < expiration_timestamp_sec_) {
+        *token = current_token_;
+        return Status::OK();
+      }
+      if (json_.isMember("refresh_token")) {
+        TF_RETURN_IF_ERROR(oauth_client_->GetTokenFromRefreshTokenJson(
+            json_, kOAuthV3Url, &current_token_, &expiration_timestamp_sec_));
+      } else if (json_.isMember("private_key")) {
+        TF_RETURN_IF_ERROR(oauth_client_->GetTokenFromServiceAccountJson(
+            json_, kOAuthV4Url, kOAuthScope, &current_token_,
+            &expiration_timestamp_sec_));
+      } else {
+        return errors::FailedPrecondition(
+            "Unexpected content of the JSON credentials file.");
+      }
+
+      *token = current_token_;
+      return Status::OK();
+    }
+
+   private:
+    Json::Value json_;
+    std::unique_ptr<OAuthClient> oauth_client_;
+    Env* env_;
+
+    mutex mu_;
+    string current_token_ GUARDED_BY(mu_);
+    uint64 expiration_timestamp_sec_ GUARDED_BY(mu_) = 0;
+
+    // The initial delay for exponential backoffs when retrying failed calls.
+    const int64 initial_retry_delay_usec_;
+    TF_DISALLOW_COPY_AND_ASSIGN(ConstantAuthProvider);
+  };
+};
+
+REGISTER_KERNEL_BUILDER(Name("GcsConfigureCredentials").Device(DEVICE_CPU),
+                        GcsCredentialsOpKernel);
+
+class GcsBlockCacheOpKernel : public OpKernel {
+ public:
+  explicit GcsBlockCacheOpKernel(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+  void Compute(OpKernelContext* ctx) override {
+    // Get a handle to the GCS file system.
+    RetryingGcsFileSystem* gcs = nullptr;
+    OP_REQUIRES_OK(ctx, RetrieveGcsFs(ctx, &gcs));
+
+    size_t max_cache_size, block_size, max_staleness;
+    OP_REQUIRES_OK(ctx, ParseScalarArgument<size_t>(ctx, "max_cache_size",
+                                                    &max_cache_size));
+    OP_REQUIRES_OK(ctx,
+                   ParseScalarArgument<size_t>(ctx, "block_size", &block_size));
+    OP_REQUIRES_OK(
+        ctx, ParseScalarArgument<size_t>(ctx, "max_staleness", &max_staleness));
+
+    if (gcs->underlying()->block_size() == block_size &&
+        gcs->underlying()->max_bytes() == max_cache_size &&
+        gcs->underlying()->max_staleness() == max_staleness) {
+      LOG(INFO) << "Skipping resetting the GCS block cache.";
+      return;
+    }
+    gcs->underlying()->ResetFileBlockCache(block_size, max_cache_size,
+                                           max_staleness);
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("GcsConfigureBlockCache").Device(DEVICE_CPU),
+                        GcsBlockCacheOpKernel);
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/cloud/ops/gcs_config_ops.cc b/tensorflow/contrib/cloud/ops/gcs_config_ops.cc
new file mode 100644
index 0000000000000000000000000000000000000000..9cf85f5f1811d873075b6d2e1931d8badfd6e32c
--- /dev/null
+++ b/tensorflow/contrib/cloud/ops/gcs_config_ops.cc
@@ -0,0 +1,70 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/common_shape_fns.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/shape_inference.h"
+
+namespace tensorflow {
+
+REGISTER_OP("GcsConfigureCredentials")
+    .Input("json: string")
+    .SetShapeFn(shape_inference::NoOutputs)
+    .Doc(R"doc(
+Configures the credentials used by the GCS client of the local TF runtime.
+
+The json input can be of the format:
+
+1. Refresh Token:
+{
+  "client_id": "<redacted>",
+  "client_secret": "<redacted>",
+  "refresh_token: "<redacted>",
+  "type": "authorized_user",
+}
+
+2. Service Account:
+{
+  "type": "service_account",
+  "project_id": "<redacted>",
+  "private_key_id": "<redacted>",
+  "private_key": "------BEGIN PRIVATE KEY-----\n<REDACTED>\n-----END PRIVATE KEY------\n",
+  "client_email": "<REDACTED>@<REDACTED>.iam.gserviceaccount.com",
+  "client_id": "<REDACTED>",
+  # Some additional fields elided
+}
+
+Note the credentials established through this method are shared across all
+sessions run on this runtime.
+
+Note be sure to feed the inputs to this op to ensure the credentials are not
+stored in a constant op within the graph that might accidentally be checkpointed
+or in other ways be persisted or exfiltrated.
+)doc");
+
+REGISTER_OP("GcsConfigureBlockCache")
+    .Input("max_cache_size: uint64")
+    .Input("block_size: uint64")
+    .Input("max_staleness: uint64")
+    .SetShapeFn(shape_inference::NoOutputs)
+    .Doc(R"doc(
+Re-configures the GCS block cache with the new configuration values.
+
+If the values are the same as already configured values, this op is a no-op. If
+they are different, the current contents of the block cache is dropped, and a
+new block cache is created fresh.
+)doc");
+
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/cloud/python/ops/gcs_config_ops.py b/tensorflow/contrib/cloud/python/ops/gcs_config_ops.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c8c5acb31af69b4f738a13c6548cdd31947d71a
--- /dev/null
+++ b/tensorflow/contrib/cloud/python/ops/gcs_config_ops.py
@@ -0,0 +1,188 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""GCS file system configuration for TensorFlow."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import json
+
+from tensorflow.contrib.cloud.python.ops import gen_gcs_config_ops
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.training import training
+
+
+# @tf_export('contrib.cloud.BlockCacheParams')
+class BlockCacheParams(object):
+  """BlockCacheParams is a struct used for configuring the GCS Block Cache."""
+
+  def __init__(self, block_size=None, max_bytes=None, max_staleness=None):
+    self._block_size = block_size or 128 * 1024 * 1024
+    self._max_bytes = max_bytes or 2 * self._block_size
+    self._max_staleness = max_staleness or 0
+
+  @property
+  def block_size(self):
+    return self._block_size
+
+  @property
+  def max_bytes(self):
+    return self._max_bytes
+
+  @property
+  def max_staleness(self):
+    return self._max_staleness
+
+
+# @tf_export('contrib.cloud.ConfigureGcsHook')
+class ConfigureGcsHook(training.SessionRunHook):
+  """ConfigureGcsHook configures GCS when used with Estimator/TPUEstimator.
+
+  Warning: GCS `credentials` may be transmitted over the network unencrypted.
+  Please ensure that the network is trusted before using this function. For
+  users running code entirely within Google Cloud, your data is protected by
+  encryption in between data centers. For more information, please take a look
+  at https://cloud.google.com/security/encryption-in-transit/.
+
+  Example:
+
+  ```
+  sess = tf.Session()
+  refresh_token = raw_input("Refresh token: ")
+  client_secret = raw_input("Client secret: ")
+  client_id = "<REDACTED>"
+  creds = {
+      "client_id": client_id,
+      "refresh_token": refresh_token,
+      "client_secret": client_secret,
+      "type": "authorized_user",
+  }
+  tf.contrib.cloud.configure_gcs(sess, credentials=creds)
+  ```
+
+  """
+
+  def _verify_dictionary(self, creds_dict):
+    if 'refresh_token' in creds_dict or 'private_key' in creds_dict:
+      return True
+    return False
+
+  def __init__(self, credentials=None, block_cache=None):
+    """Constructs a ConfigureGcsHook.
+
+    Args:
+      credentials: A json-formatted string.
+      block_cache: A `BlockCacheParams`
+
+    Raises:
+      ValueError: If credentials is improperly formatted or block_cache is not a
+        BlockCacheParams.
+    """
+    if credentials is not None:
+      if isinstance(credentials, str):
+        try:
+          data = json.loads(credentials)
+        except ValueError as e:
+          raise ValueError('credentials was not a well formed JSON string.', e)
+        if not self._verify_dictionary(data):
+          raise ValueError(
+              'credentials has neither a "refresh_token" nor a "private_key" '
+              'field.')
+      elif isinstance(credentials, dict):
+        if not self._verify_dictionary(credentials):
+          raise ValueError('credentials has neither a "refresh_token" nor a '
+                           '"private_key" field.')
+        credentials = json.dumps(credentials)
+      else:
+        raise ValueError('credentials is of an unknown type')
+
+    self._credentials = credentials
+
+    if block_cache and not isinstance(block_cache, BlockCacheParams):
+      raise ValueError('block_cache must be an instance of BlockCacheParams.')
+    self._block_cache = block_cache
+
+  def begin(self):
+    if self._credentials:
+      self._credentials_placeholder = array_ops.placeholder(dtypes.string)
+      self._credentials_ops = gen_gcs_config_ops.gcs_configure_credentials(
+          self._credentials_placeholder)
+    if self._block_cache:
+      self._block_cache_op = gen_gcs_config_ops.gcs_configure_block_cache(
+          max_cache_size=self._block_cache.max_bytes,
+          block_size=self._block_cache.block_size,
+          max_staleness=self._block_cache.max_staleness)
+
+  def after_create_session(self, session, coord):
+    del coord
+    if self._credentials_op:
+      session.run(
+          self._credentials_op,
+          feed_dict={self._credentials_placeholder: self._credentials})
+    if self._block_cache_op:
+      session.run(self._block_cache_op)
+
+
+def configure_gcs(session, credentials=None, block_cache=None, device=None):
+  """Configures the GCS file system for a given a session.
+
+  Warning: GCS `credentials` may be transmitted over the network unencrypted.
+  Please ensure that the network is trusted before using this function. For
+  users running code entirely within Google Cloud, your data is protected by
+  encryption in between data centers. For more information, please take a look
+  at https://cloud.google.com/security/encryption-in-transit/.
+
+  Args:
+    session: A `tf.Session` session that should be used to configure the GCS
+      file system.
+    credentials: [Optional.] A JSON string
+    block_cache: [Optional.] A BlockCacheParams to configure the block cache .
+    device: [Optional.] The device to place the configure ops.
+  """
+
+  def configure(credentials, block_cache):
+    """Helper function to actually configure GCS."""
+    if credentials:
+      if isinstance(credentials, dict):
+        credentials = json.dumps(credentials)
+      placeholder = array_ops.placeholder(dtypes.string)
+      op = gen_gcs_config_ops.gcs_configure_credentials(placeholder)
+      session.run(op, feed_dict={placeholder: credentials})
+    if block_cache:
+      op = gen_gcs_config_ops.gcs_configure_block_cache(
+          max_cache_size=block_cache.max_bytes,
+          block_size=block_cache.block_size,
+          max_staleness=block_cache.max_staleness)
+      session.run(op)
+
+  if device:
+    with ops.device(device):
+      return configure(credentials, block_cache)
+  return configure(credentials, block_cache)
+
+
+def configure_colab_session(session):
+  """ConfigureColabSession configures the GCS file system in Colab.
+
+  Args:
+    session: A `tf.Session` session.
+  """
+  # Read from the application default credentials (adc).
+  with open('/content/datalab/adc.json') as f:
+    data = json.load(f)
+  configure_gcs(session, credentials=data)
diff --git a/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py b/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py
index 880fca4ea65608472838baee234e468bef37afb3..8f521ffee4d31e090c13bac98290656d6e1d330e 100644
--- a/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py
+++ b/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py
@@ -36,6 +36,7 @@ except ImportError:
 
 
 _GKE_ENV_VARIABLE = 'KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS'
+_ENDPOINTS_SEPARATOR = ','
 _DEFAULT_ENV_VARIABLE = 'TPU_NAME'
 _DISCOVERY_SERVICE_URL_ENV_VARIABLE = 'TPU_API_DISCOVERY_URL'
 
@@ -69,8 +70,8 @@ class TPUClusterResolver(ClusterResolver):
     return _GKE_ENV_VARIABLE in os.environ
 
   @staticmethod
-  def _gkeMaster():
-    return os.environ[_GKE_ENV_VARIABLE].split(',')[0]
+  def _gkeEndpoints():
+    return os.environ[_GKE_ENV_VARIABLE]
 
   @staticmethod
   def _envVarFallback():
@@ -143,7 +144,7 @@ class TPUClusterResolver(ClusterResolver):
     # When using GKE with Cloud TPUs, the env variable will be set.
     if tpu is None:
       if in_gke:
-        tpu = self._gkeMaster()
+        tpu = self._gkeEndpoints()
       else:
         tpu = self._envVarFallback()
 
@@ -170,10 +171,11 @@ class TPUClusterResolver(ClusterResolver):
 
     if service is None and should_resolve:
       if not _GOOGLE_API_CLIENT_INSTALLED:
-        raise ImportError('googleapiclient must be installed before using the '
-                          'TPU cluster resolver. Execute: `pip install '
-                          '--upgrade google-api-python-client` to install with '
-                          'pip.')
+        raise ImportError('googleapiclient and oauth2client must be installed '
+                          'before using the TPU cluster resolver. Execute: '
+                          '`pip install --upgrade google-api-python-client` '
+                          'and `pip install --upgrade oauth2client` to '
+                          'install with pip.')
 
       final_discovery_url = self._discoveryUrl() or discovery_url
       if final_discovery_url:
@@ -213,7 +215,7 @@ class TPUClusterResolver(ClusterResolver):
       ValueError: If none of the TPUs specified exists.
     """
     if not self._shouldResolve():
-      return self._tpu
+      return self._tpu.split(compat.as_bytes(_ENDPOINTS_SEPARATOR))[0]
 
     job_tasks = self.cluster_spec().job_tasks(self._job_name)
     if not job_tasks:
@@ -255,6 +257,10 @@ class TPUClusterResolver(ClusterResolver):
       request = self._service.projects().locations().nodes().get(name=full_name)
       response = request.execute()
 
+      if 'state' in response and response['state'] != 'READY':
+        raise RuntimeError('TPU "%s" is not yet ready; state: "%s"' %
+                           (self._tpu, response['state']))
+
       if 'health' in response and response['health'] != 'HEALTHY':
         raise RuntimeError('TPU "%s" is unhealthy: "%s"' % (self._tpu,
                                                             response['health']))
@@ -275,8 +281,12 @@ class TPUClusterResolver(ClusterResolver):
         # Case 3.
         return None
       # Case 2.
-      cluster_spec = {self._job_name: [self._tpu[len(
-          compat.as_bytes('grpc://')):]]}
+      cluster_spec = {
+          self._job_name: [
+              x[len(compat.as_bytes('grpc://')):]
+              for x in self._tpu.split(compat.as_bytes(_ENDPOINTS_SEPARATOR))
+          ]
+      }
 
     if self._coordinator_address:
       # {1, 2}.a
diff --git a/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver_test.py b/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver_test.py
index 5fac55fd027fa2d100621e08a09e05cdb3a1b941..ad4f6432630be44a7de6e778f55f1fb7fd66f307 100644
--- a/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver_test.py
+++ b/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver_test.py
@@ -158,6 +158,50 @@ class TPUClusterResolverTest(test.TestCase):
     """
     self._verifyClusterSpecEquality(actual_cluster_spec, expected_proto)
 
+  @mock.patch.object(TPUClusterResolver, '_requestComputeMetadata',
+                     mock_request_compute_metadata)
+  def testUnhealthyCloudTpu(self):
+    tpu_map = {
+        'projects/test-project/locations/us-central1-c/nodes/test-tpu-1': {
+            'ipAddress': '10.1.2.3',
+            'port': '8470',
+            'health': 'UNHEALTHY'
+        }
+    }
+
+    tpu_cluster_resolver = TPUClusterResolver(
+        project=None,
+        zone=None,
+        tpu='test-tpu-1',
+        coordinator_name=None,
+        credentials=None,
+        service=self.mock_service_client(tpu_map=tpu_map))
+
+    with self.assertRaises(RuntimeError):
+      tpu_cluster_resolver.cluster_spec()
+
+  @mock.patch.object(TPUClusterResolver, '_requestComputeMetadata',
+                     mock_request_compute_metadata)
+  def testNotReadyCloudTpu(self):
+    tpu_map = {
+        'projects/test-project/locations/us-central1-c/nodes/test-tpu-1': {
+            'ipAddress': '10.1.2.3',
+            'port': '8470',
+            'state': 'CREATING'
+        }
+    }
+
+    tpu_cluster_resolver = TPUClusterResolver(
+        project=None,
+        zone=None,
+        tpu='test-tpu-1',
+        coordinator_name=None,
+        credentials=None,
+        service=self.mock_service_client(tpu_map=tpu_map))
+
+    with self.assertRaises(RuntimeError):
+      tpu_cluster_resolver.cluster_spec()
+
   def testSimpleSuccessfulRetrieval(self):
     tpu_map = {
         'projects/test-project/locations/us-central1-c/nodes/test-tpu-1': {
@@ -358,13 +402,61 @@ class TPUClusterResolverTest(test.TestCase):
         compat.as_bytes('/bns/foo/bar'), tpu_cluster_resolver.master())
     self.assertEqual(None, tpu_cluster_resolver.cluster_spec())
 
-  def testGkeEnvironment(self):
+  def testGkeEnvironmentForDonut(self):
     os.environ['KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS'] = 'grpc://10.120.27.5:8470'
-    self.assertTrue('KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS' in os.environ)
+
+    self.assertIn('KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS', os.environ)
+    self.assertTrue(TPUClusterResolver._inGke())
+    self.assertEqual(
+        compat.as_bytes('grpc://10.120.27.5:8470'),
+        compat.as_bytes(TPUClusterResolver._gkeEndpoints()))
+
+    tpu_cluster_resolver = TPUClusterResolver()
+    self.assertEqual(
+        compat.as_bytes('grpc://10.120.27.5:8470'),
+        compat.as_bytes(tpu_cluster_resolver.master()))
+    actual_cluster_spec = tpu_cluster_resolver.cluster_spec()
+    expected_proto = """
+    job {
+      name: 'worker'
+      tasks { key: 0 value: '10.120.27.5:8470' }
+    }
+    """
+    self._verifyClusterSpecEquality(actual_cluster_spec, expected_proto)
+
+    del os.environ['KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS']
+
+  def testGkeEnvironmentForPod(self):
+    os.environ['KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS'] = ('grpc://10.120.27.5:8470,'
+                                                     'grpc://10.120.27.6:8470,'
+                                                     'grpc://10.120.27.7:8470,'
+                                                     'grpc://10.120.27.8:8470')
+
+    self.assertIn('KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS', os.environ)
     self.assertTrue(TPUClusterResolver._inGke())
+    self.assertEqual(
+        compat.as_bytes('grpc://10.120.27.5:8470,'
+                        'grpc://10.120.27.6:8470,'
+                        'grpc://10.120.27.7:8470,'
+                        'grpc://10.120.27.8:8470'),
+        compat.as_bytes(TPUClusterResolver._gkeEndpoints()))
+
+    tpu_cluster_resolver = TPUClusterResolver()
     self.assertEqual(
         compat.as_bytes('grpc://10.120.27.5:8470'),
-        compat.as_bytes(TPUClusterResolver._gkeMaster()))
+        compat.as_bytes(tpu_cluster_resolver.master()))
+    actual_cluster_spec = tpu_cluster_resolver.cluster_spec()
+    expected_proto = """
+    job {
+      name: 'worker'
+      tasks { key: 0 value: '10.120.27.5:8470' }
+      tasks { key: 1 value: '10.120.27.6:8470' }
+      tasks { key: 2 value: '10.120.27.7:8470' }
+      tasks { key: 3 value: '10.120.27.8:8470' }
+    }
+    """
+    self._verifyClusterSpecEquality(actual_cluster_spec, expected_proto)
+
     del os.environ['KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS']
 
   def testDiscoveryUrl(self):
diff --git a/tensorflow/contrib/cmake/CMakeLists.txt b/tensorflow/contrib/cmake/CMakeLists.txt
index 0708d6b7b9f0ba549aea091a265f42890e50d223..e524e9e7437b19e0d117fe7b85042e8154773a02 100644
--- a/tensorflow/contrib/cmake/CMakeLists.txt
+++ b/tensorflow/contrib/cmake/CMakeLists.txt
@@ -18,7 +18,16 @@ cmake_policy(SET CMP0022 NEW)
 
 # Options
 option(tensorflow_VERBOSE "Enable for verbose output" OFF)
+
+if(WIN32)
+# BoringSSL is disabled for windows as it currently doesn't build with
+# MSBuild. (Ninja is required.)
 option(tensorflow_ENABLE_SSL_SUPPORT "Enable boringssl support" OFF)
+else()
+# BoringSSL is enabled for gRPC.
+option(tensorflow_ENABLE_SSL_SUPPORT "Enable boringssl support" ON)
+endif()
+
 option(tensorflow_ENABLE_GRPC_SUPPORT "Enable gRPC support" ON)
 option(tensorflow_ENABLE_HDFS_SUPPORT "Enable HDFS support" OFF)
 option(tensorflow_ENABLE_JEMALLOC_SUPPORT "Enable jemalloc support" OFF)
diff --git a/tensorflow/contrib/cmake/external/double_conversion.cmake b/tensorflow/contrib/cmake/external/double_conversion.cmake
index 527ccdc8d887cb4c2e7d2412c99a8bc682568472..5c5adaf5798289fba1c5d0b3f9e0489dc242043e 100644
--- a/tensorflow/contrib/cmake/external/double_conversion.cmake
+++ b/tensorflow/contrib/cmake/external/double_conversion.cmake
@@ -16,15 +16,15 @@ include (ExternalProject)
 
 set(double_conversion_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/double_conversion/src/double_conversion)
 set(double_conversion_URL https://github.com/google/double-conversion.git)
-set(double_conversion_TAG 5664746)
+set(double_conversion_TAG 3992066a95b823efc8ccc1baf82a1cfc73f6e9b8)
 set(double_conversion_BUILD ${double_conversion_INCLUDE_DIR})
 set(double_conversion_LIBRARIES ${double_conversion_BUILD}/double-conversion/libdouble-conversion.so)
 set(double_conversion_INCLUDES ${double_conversion_BUILD})
 
 if(WIN32)
-  set(double_conversion_STATIC_LIBRARIES ${double_conversion_BUILD}/double-conversion/$(Configuration)/double-conversion.lib)
+  set(double_conversion_STATIC_LIBRARIES ${double_conversion_BUILD}/$(Configuration)/double-conversion.lib)
 else()
-  set(double_conversion_STATIC_LIBRARIES ${double_conversion_BUILD}/double-conversion/libdouble-conversion.a)
+  set(double_conversion_STATIC_LIBRARIES ${double_conversion_BUILD}/libdouble-conversion.a)
 endif()
 
 set(double_conversion_HEADERS
diff --git a/tensorflow/contrib/cmake/external/grpc.cmake b/tensorflow/contrib/cmake/external/grpc.cmake
index 693dc7cd673233b889b35a3f3170b57581da9a9f..b1e64aa55c80ad59cfdc0f4767c0282b4f73367f 100644
--- a/tensorflow/contrib/cmake/external/grpc.cmake
+++ b/tensorflow/contrib/cmake/external/grpc.cmake
@@ -20,6 +20,10 @@ set(GRPC_BUILD ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc)
 set(GRPC_TAG d184fa229d75d336aedea0041bd59cb93e7e267f)
 
 if(WIN32)
+  # We use unsecure gRPC because boringssl does not build on windows
+  set(grpc_TARGET grpc++_unsecure)
+  set(grpc_DEPENDS protobuf zlib)
+  set(grpc_SSL_PROVIDER NONE)
   if(${CMAKE_GENERATOR} MATCHES "Visual Studio.*")
     set(grpc_STATIC_LIBRARIES
         ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/Release/grpc++_unsecure.lib
@@ -32,9 +36,12 @@ if(WIN32)
         ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/gpr.lib)
   endif()
 else()
+  set(grpc_TARGET grpc++)
+  set(grpc_DEPENDS boringssl protobuf zlib)
+  set(grpc_SSL_PROVIDER module)
   set(grpc_STATIC_LIBRARIES
-      ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/libgrpc++_unsecure.a
-      ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/libgrpc_unsecure.a
+      ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/libgrpc++.a
+      ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/libgrpc.a
       ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/libaddress_sorting.a
       ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/third_party/cares/cares/lib/libcares.a
       ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/libgpr.a)
@@ -44,13 +51,13 @@ add_definitions(-DGRPC_ARES=0)
 
 ExternalProject_Add(grpc
     PREFIX grpc
-    DEPENDS protobuf zlib
+    DEPENDS ${grpc_DEPENDS}
     GIT_REPOSITORY ${GRPC_URL}
     GIT_TAG ${GRPC_TAG}
     DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
     BUILD_IN_SOURCE 1
     BUILD_BYPRODUCTS ${grpc_STATIC_LIBRARIES}
-    BUILD_COMMAND ${CMAKE_COMMAND} --build . --config Release --target grpc++_unsecure
+    BUILD_COMMAND ${CMAKE_COMMAND} --build . --config Release --target ${grpc_TARGET}
     COMMAND ${CMAKE_COMMAND} --build . --config Release --target grpc_cpp_plugin
     INSTALL_COMMAND ""
     CMAKE_CACHE_ARGS
@@ -59,7 +66,7 @@ ExternalProject_Add(grpc
         -DPROTOBUF_INCLUDE_DIRS:STRING=${PROTOBUF_INCLUDE_DIRS}
         -DPROTOBUF_LIBRARIES:STRING=${protobuf_STATIC_LIBRARIES}
         -DZLIB_ROOT:STRING=${ZLIB_INSTALL}
-	-DgRPC_SSL_PROVIDER:STRING=NONE
+	-DgRPC_SSL_PROVIDER:STRING=${grpc_SSL_PROVIDER}
 )
 
 # grpc/src/core/ext/census/tracing.c depends on the existence of openssl/rand.h.
diff --git a/tensorflow/contrib/cmake/python_modules.txt b/tensorflow/contrib/cmake/python_modules.txt
index fece56c4127de4deebc1404f0eff9747f99ba89f..015cb73bbd93bb77f6748a364b263d99eb305c27 100644
--- a/tensorflow/contrib/cmake/python_modules.txt
+++ b/tensorflow/contrib/cmake/python_modules.txt
@@ -115,6 +115,8 @@ tensorflow/contrib/coder/python/ops
 tensorflow/contrib/compiler
 tensorflow/contrib/constrained_optimization
 tensorflow/contrib/constrained_optimization/python
+tensorflow/contrib/control_flow
+tensorflow/contrib/control_flow/python
 tensorflow/contrib/copy_graph
 tensorflow/contrib/copy_graph/python
 tensorflow/contrib/copy_graph/python/util
diff --git a/tensorflow/contrib/cmake/tf_c.cmake b/tensorflow/contrib/cmake/tf_c.cmake
index a06bdf78fb011b288d5d7af6488ec6802ff34c35..2e0a2fcef4cbdc50f0521296c4a25a864dbd8b77 100644
--- a/tensorflow/contrib/cmake/tf_c.cmake
+++ b/tensorflow/contrib/cmake/tf_c.cmake
@@ -21,6 +21,7 @@ set(tf_c_srcs
     "${tensorflow_source_dir}/tensorflow/c/c_api_function.cc"
     "${tensorflow_source_dir}/tensorflow/c/eager/c_api.cc"
     "${tensorflow_source_dir}/tensorflow/c/eager/c_api.h"
+    "${tensorflow_source_dir}/tensorflow/c/eager/c_api_debug.cc"
     "${tensorflow_source_dir}/tensorflow/c/eager/tape.h"
     "${tensorflow_source_dir}/tensorflow/c/checkpoint_reader.cc"
     "${tensorflow_source_dir}/tensorflow/c/checkpoint_reader.h"
diff --git a/tensorflow/contrib/cmake/tf_core_framework.cmake b/tensorflow/contrib/cmake/tf_core_framework.cmake
index b47c32f1c48b3d42fe5b4ba115cc2a511b7ee5f4..dac84ccb0dbf4848329e35a6e9bcf6213d8c0e55 100644
--- a/tensorflow/contrib/cmake/tf_core_framework.cmake
+++ b/tensorflow/contrib/cmake/tf_core_framework.cmake
@@ -213,10 +213,6 @@ else()
   list(REMOVE_ITEM tf_core_platform_srcs ${tf_core_platform_srcs_exclude})
 endif()
 
-file(GLOB tf_core_platform_exclude_srcs
-  "${tensorflow_source_dir}/tensorflow/core/platform/variant_coding.cc")
-list(REMOVE_ITEM tf_core_platform_srcs ${tf_core_platform_exclude_srcs})
-
 list(APPEND tf_core_lib_srcs ${tf_core_platform_srcs})
 
 if(UNIX)
@@ -286,8 +282,6 @@ set(tf_version_srcs ${tensorflow_source_dir}/tensorflow/core/util/version_info.c
 file(GLOB_RECURSE tf_core_framework_srcs
     "${tensorflow_source_dir}/tensorflow/core/framework/*.h"
     "${tensorflow_source_dir}/tensorflow/core/framework/*.cc"
-    "${tensorflow_source_dir}/tensorflow/core/platform/variant_coding.h"
-    "${tensorflow_source_dir}/tensorflow/core/platform/variant_coding.cc"
     "${tensorflow_source_dir}/tensorflow/core/graph/edgeset.h"
     "${tensorflow_source_dir}/tensorflow/core/graph/edgeset.cc"
     "${tensorflow_source_dir}/tensorflow/core/graph/graph.h"
diff --git a/tensorflow/contrib/cmake/tf_core_ops.cmake b/tensorflow/contrib/cmake/tf_core_ops.cmake
index e558691de4b74988031f7b2204aad92e8c7af68b..bc753333dba4f67eee0114c4022743dd59a05982 100644
--- a/tensorflow/contrib/cmake/tf_core_ops.cmake
+++ b/tensorflow/contrib/cmake/tf_core_ops.cmake
@@ -113,6 +113,7 @@ GENERATE_CONTRIB_OP_LIBRARY(tensor_forest_stats "${tensorflow_source_dir}/tensor
 GENERATE_CONTRIB_OP_LIBRARY(text_skip_gram "${tensorflow_source_dir}/tensorflow/contrib/text/ops/skip_gram_ops.cc")
 GENERATE_CONTRIB_OP_LIBRARY(tpu "${tpu_ops_srcs}")
 GENERATE_CONTRIB_OP_LIBRARY(bigquery_reader "${tensorflow_source_dir}/tensorflow/contrib/cloud/ops/bigquery_reader_ops.cc")
+GENERATE_CONTRIB_OP_LIBRARY(gcs_config "${tensorflow_source_dir}/tensorflow/contrib/cloud/ops/gcs_config_ops.cc")
 GENERATE_CONTRIB_OP_LIBRARY(reduce_slice_ops "${tensorflow_source_dir}/tensorflow/contrib/reduce_slice_ops/ops/reduce_slice_ops.cc")
 
 ########################################################
diff --git a/tensorflow/contrib/cmake/tf_python.cmake b/tensorflow/contrib/cmake/tf_python.cmake
index 894b1ead7688bd951c5c78f613fdb7aae226fe65..92446044892127284ecb8753a250b77cb2a5743a 100755
--- a/tensorflow/contrib/cmake/tf_python.cmake
+++ b/tensorflow/contrib/cmake/tf_python.cmake
@@ -420,6 +420,8 @@ GENERATE_PYTHON_OP_LIB("contrib_text_skip_gram_ops"
   DESTINATION ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/contrib/text/python/ops/gen_skip_gram_ops.py)
 GENERATE_PYTHON_OP_LIB("contrib_bigquery_reader_ops"
   DESTINATION ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/contrib/cloud/python/ops/gen_bigquery_reader_ops.py)
+GENERATE_PYTHON_OP_LIB("contrib_gcs_config_ops"
+  DESTINATION ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/contrib/cloud/python/ops/gen_gcs_config_ops.py)
 GENERATE_PYTHON_OP_LIB("stateless_random_ops"
   DESTINATION ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/contrib/stateless/gen_stateless_random_ops.py)
 GENERATE_PYTHON_OP_LIB("debug_ops"
@@ -723,7 +725,7 @@ endif()
 ########################################################
 
 # Parse tensorflow/tools/api/generator/BUILD to get list of generated files.
-FILE(READ ${tensorflow_source_dir}/tensorflow/tools/api/generator/BUILD api_generator_BUILD_text)
+FILE(READ ${tensorflow_source_dir}/tensorflow/tools/api/generator/api_gen.bzl api_generator_BUILD_text)
 STRING(REGEX MATCH "# BEGIN GENERATED FILES.*# END GENERATED FILES" api_init_files_text ${api_generator_BUILD_text})
 string(REPLACE "# BEGIN GENERATED FILES" "" api_init_files_text ${api_init_files_text})
 string(REPLACE "# END GENERATED FILES" "" api_init_files_text ${api_init_files_text})
@@ -734,7 +736,7 @@ foreach(api_init_file ${api_init_files_list})
     string(STRIP "${api_init_file}" api_init_file)
     if(api_init_file)
         string(REPLACE "\"" "" api_init_file "${api_init_file}")  # Remove quotes
-        list(APPEND api_init_files "${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/tools/api/generator/${api_init_file}")
+        list(APPEND api_init_files "${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/${api_init_file}")
     endif()
 endforeach(api_init_file)
 set(api_init_list_file "${tensorflow_source_dir}/api_init_files_list.txt")
@@ -747,18 +749,16 @@ add_custom_command(
 
       # tensorflow/__init__.py depends on files generated in this step. So, remove it while
       # this step is running since the files aren't there yet.
-      COMMAND ${CMAKE_COMMAND} -E rename ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/__init__.py
-                                         ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/final.__init__.py
-      COMMAND ${CMAKE_COMMAND} -E touch ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/__init__.py
+      COMMAND ${CMAKE_COMMAND} -E remove -f ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/__init__.py
 
       # Run create_python_api.py to generate API init files.
       COMMAND ${CMAKE_COMMAND} -E env PYTHONPATH=${CMAKE_CURRENT_BINARY_DIR}/tf_python ${PYTHON_EXECUTABLE}
-              "${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/tools/api/generator/create_python_api.py" "${api_init_list_file}"
-
-      # Re-add tensorflow/__init__.py back.
-      COMMAND ${CMAKE_COMMAND} -E remove -f ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/__init__.py
-      COMMAND ${CMAKE_COMMAND} -E rename ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/final.__init__.py
-                                         ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/__init__.py
+              "${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/tools/api/generator/create_python_api.py"
+              "--root_init_template=${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/api_template.__init__.py"
+              "--apidir=${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow"
+              "--package=tensorflow.python"
+              "--apiname=tensorflow"
+              "${api_init_list_file}"
 
       COMMENT "Generating __init__.py files for Python API."
       WORKING_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/tf_python"
@@ -767,7 +767,49 @@ add_custom_command(
 add_custom_target(tf_python_api SOURCES ${api_init_files})
 add_dependencies(tf_python_api tf_python_ops)
 
+# TODO(mikecase): This can be removed once tf.estimator is moved
+# out of TensorFlow.
+########################################################
+# Generate API __init__.py files for tf.estimator.
+########################################################
+
+# Parse tensorflow/tools/api/generator/BUILD to get list of generated files.
+FILE(READ ${tensorflow_source_dir}/tensorflow/tools/api/generator/api_gen.bzl api_generator_BUILD_text)
+STRING(REGEX MATCH "# BEGIN GENERATED ESTIMATOR FILES.*# END GENERATED ESTIMATOR FILES" api_init_files_text ${api_generator_BUILD_text})
+string(REPLACE "# BEGIN GENERATED ESTIMATOR FILES" "" api_init_files_text ${api_init_files_text})
+string(REPLACE "# END GENERATED ESTIMATOR FILES" "" api_init_files_text ${api_init_files_text})
+string(REPLACE "," ";" api_init_files_list ${api_init_files_text})
+
+set(api_init_files "")
+foreach(api_init_file ${api_init_files_list})
+    string(STRIP "${api_init_file}" api_init_file)
+    if(api_init_file)
+        string(REPLACE "\"" "" api_init_file "${api_init_file}")  # Remove quotes
+        list(APPEND api_init_files "${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/python/estimator/api/${api_init_file}")
+    endif()
+endforeach(api_init_file)
+set(estimator_api_init_list_file "${tensorflow_source_dir}/estimator_api_init_files_list.txt")
+file(WRITE "${estimator_api_init_list_file}" "${api_init_files}")
+
+# Run create_python_api.py to generate __init__.py files.
+add_custom_command(
+      OUTPUT ${api_init_files}
+      DEPENDS tf_python_ops tf_python_copy_scripts_to_destination pywrap_tensorflow_internal tf_python_touchup_modules tf_extension_ops
+
+      # Run create_python_api.py to generate API init files.
+      COMMAND ${CMAKE_COMMAND} -E env PYTHONPATH=${CMAKE_CURRENT_BINARY_DIR}/tf_python ${PYTHON_EXECUTABLE}
+              "${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/tools/api/generator/create_python_api.py"
+              "--apidir=${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/python/estimator/api"
+              "--package=tensorflow.python.estimator"
+              "--apiname=estimator"
+              "${estimator_api_init_list_file}"
+
+      COMMENT "Generating __init__.py files for Python API."
+      WORKING_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/tf_python"
+)
 
+add_custom_target(estimator_python_api SOURCES ${api_init_files})
+add_dependencies(estimator_python_api tf_python_ops)
 ############################################################
 # Build a PIP package containing the TensorFlow runtime.
 ############################################################
@@ -778,6 +820,7 @@ add_dependencies(tf_python_build_pip_package
     tf_python_touchup_modules
     tf_python_ops
     tf_python_api
+    estimator_python_api
     tf_extension_ops)
 
 # Fix-up Python files that were not included by the add_python_module() macros.
diff --git a/tensorflow/contrib/cmake/tf_tests.cmake b/tensorflow/contrib/cmake/tf_tests.cmake
index 5942ff3363a96de70df7e13d0857e4ad82e35fee..eb9482dc25f2be8ce46cc38bf3dd28889b09a9d4 100644
--- a/tensorflow/contrib/cmake/tf_tests.cmake
+++ b/tensorflow/contrib/cmake/tf_tests.cmake
@@ -212,6 +212,10 @@ if (tensorflow_BUILD_PYTHON_TESTS)
     "${tensorflow_source_dir}/tensorflow/contrib/factorization/python/ops/gmm_test.py"
     # Disable following manual tag in BUILD.
     "${tensorflow_source_dir}/tensorflow/python/keras/_impl/keras/layers/convolutional_test.py"
+    # These tests depend on a .so file
+    ${tensorflow_source_dir}/tensorflow/python/kernel_tests/duplicate_op_test.py
+    ${tensorflow_source_dir}/tensorflow/python/kernel_tests/invalid_op_test.py
+    ${tensorflow_source_dir}/tensorflow/python/kernel_tests/ackermann_test.py
 
   )
   if (WIN32)
diff --git a/tensorflow/contrib/control_flow/BUILD b/tensorflow/contrib/control_flow/BUILD
new file mode 100644
index 0000000000000000000000000000000000000000..e8036d63aeeac224b226899c036416a06b4ffe65
--- /dev/null
+++ b/tensorflow/contrib/control_flow/BUILD
@@ -0,0 +1,53 @@
+# New implementations of control flow ops
+
+licenses(["notice"])  # Apache 2.0
+
+package(default_visibility = ["//visibility:public"])
+
+load("//tensorflow:tensorflow.bzl", "tf_py_test")
+
+py_library(
+    name = "control_flow",
+    srcs = ["__init__.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":cond_v2",
+    ],
+)
+
+py_library(
+    name = "cond_v2",
+    srcs = ["python/cond_v2.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow/core:protos_all_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:c_api_util",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:function",
+        "//tensorflow/python:function_def_to_graph",
+        "//tensorflow/python:functional_ops_gen",
+        "//tensorflow/python:gradients",
+        "//tensorflow/python:pywrap_tensorflow",
+        "//tensorflow/python:util",
+    ],
+)
+
+tf_py_test(
+    name = "cond_v2_test",
+    size = "small",
+    srcs = ["python/cond_v2_test.py"],
+    additional_deps = [
+        ":cond_v2",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:constant_op",
+        "//tensorflow/python:control_flow_ops",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:gradients",
+        "//tensorflow/python:training",
+    ],
+    grpc_enabled = True,
+)
diff --git a/tensorflow/contrib/control_flow/__init__.py b/tensorflow/contrib/control_flow/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..582af2cf10a3d92dd8611b0f2826625e3acfb099
--- /dev/null
+++ b/tensorflow/contrib/control_flow/__init__.py
@@ -0,0 +1,31 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""New implementations of TF control flow ops.
+
+@@cond_v2
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+# pylint: disable=unused-import
+from tensorflow.contrib.control_flow.python.cond_v2 import cond_v2
+# pylint: enable=unused-import
+
+from tensorflow.python.util.all_util import remove_undocumented
+
+remove_undocumented(__name__)
diff --git a/tensorflow/contrib/control_flow/python/cond_v2.py b/tensorflow/contrib/control_flow/python/cond_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ffad9caa92d2d3be8f598758a443b0eceb8d4d8
--- /dev/null
+++ b/tensorflow/contrib/control_flow/python/cond_v2.py
@@ -0,0 +1,429 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+"""cond_v2 and gradient.
+
+This is a version of cond that emits a single If op, as well as the gradient
+function for If ops produced by cond_v2. This will eventually replace the
+current tf.cond implementation once it reaches feature and performance parity.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python import pywrap_tensorflow as c_api
+from tensorflow.python.framework import c_api_util
+from tensorflow.python.framework import function
+from tensorflow.python.framework import function_def_to_graph
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_functional_ops
+from tensorflow.python.ops import gradients_impl
+from tensorflow.python.util import compat
+
+
+# NOTE(skyewm): TensorFlow uses protected class methods and fields to signify
+# that they aren't part of the official public API. These protected members
+# often need to be used by implementation code however. Rather than litter the
+# code with pylint comments, we ignore protected access violations for
+# readability.
+# pylint: disable=protected-access
+
+
+def cond_v2(pred, true_fn, false_fn, name="cond"):
+  """Like tf.cond, except emits a single If op."""
+  with ops.name_scope(name) as scope:
+    true_graph = function.func_graph_from_py_func(true_fn, [], [],
+                                                  name="%s_true" % scope)
+    false_graph = function.func_graph_from_py_func(false_fn, [], [],
+                                                   name="%s_false" % scope)
+    _check_same_outputs(true_graph, false_graph)
+
+    # Add inputs to true_graph and false_graph to make them match. Note that
+    # this modifies true_graph and false_graph.
+    cond_inputs = _make_inputs_match(true_graph, false_graph,
+                                     true_graph.extra_inputs,
+                                     false_graph.extra_inputs)
+
+    # Add all intermediate tensors as function outputs so they're available for
+    # the gradient computation.
+
+    true_intermediates = _get_intermediates(true_graph)
+    false_intermediates = _get_intermediates(false_graph)
+
+    # Save the original number of outputs to return to the caller.
+    num_cond_outputs = len(true_graph.outputs)
+
+    # Make the number/type of new intermediate outputs match.
+    extra_true_outputs, extra_false_outputs = _pad_params(
+        true_graph, false_graph, true_intermediates, false_intermediates)
+
+    true_graph.outputs.extend(extra_true_outputs)
+    false_graph.outputs.extend(extra_false_outputs)
+
+    # Create the If op.
+    tensors = gen_functional_ops._if(
+        pred, cond_inputs, [t.dtype for t in true_graph.outputs],
+        _create_new_tf_function(true_graph),
+        _create_new_tf_function(false_graph),
+        name=scope)
+
+    return tensors[:num_cond_outputs]
+
+
+@ops.RegisterGradient("If")
+def _IfGrad(op, *grads):  # pylint: disable=invalid-name
+  """The gradient of an If op produced by cond_v2."""
+  true_graph, false_graph = _get_func_graphs(op)
+
+  # Create grad functions that compute the gradient of the true/false forward
+  # graphs. These functions will capture tensors from the forward pass
+  # functions.
+  true_grad_graph = _create_grad_func(
+      true_graph, grads, _get_grad_fn_name(true_graph))
+  false_grad_graph = _create_grad_func(
+      false_graph, grads, _get_grad_fn_name(false_graph))
+
+  assert ([t.dtype for t in true_grad_graph.outputs] ==
+          [t.dtype for t in false_grad_graph.outputs])
+
+  # Match up the captured grad function inputs with outputs of 'op' and other
+  # external tensors.
+  true_grad_inputs = _get_grad_inputs(op, true_graph, true_grad_graph)
+  false_grad_inputs = _get_grad_inputs(op, false_graph, false_grad_graph)
+
+  # Make the inputs to true_grad_graph and false_grad_graph match. Note that
+  # this modifies true_grad_graph and false_grad_graph.
+  grad_inputs = _make_inputs_match(true_grad_graph, false_grad_graph,
+                                   true_grad_inputs, false_grad_inputs)
+
+  # Add all intermediate tensors as function outputs so they're available for
+  # higher-order gradient computations.
+
+  true_grad_intermediates = _get_intermediates(true_grad_graph)
+  false_grad_intermediates = _get_intermediates(false_grad_graph)
+
+  # Save the original number of gradient outputs to return.
+  num_grad_outputs = len(true_grad_graph.outputs)
+
+  # Make the number/type of new intermediate outputs match.
+  extra_true_grad_outputs, extra_false_grad_outputs = _pad_params(
+      true_grad_graph, false_grad_graph,
+      true_grad_intermediates, false_grad_intermediates)
+
+  true_grad_graph.outputs.extend(extra_true_grad_outputs)
+  false_grad_graph.outputs.extend(extra_false_grad_outputs)
+
+  # Create the gradient If op.
+  tensors = gen_functional_ops._if(
+      op.inputs[0], grad_inputs, [t.dtype for t in true_grad_graph.outputs],
+      _create_new_tf_function(true_grad_graph),
+      _create_new_tf_function(false_grad_graph))
+
+  # The predicate has no gradient.
+  return [None] + tensors[:num_grad_outputs]
+
+
+def _get_func_graphs(if_op):
+  """Returns `_FuncGraph`s for the input op branches.
+
+  Args:
+    if_op: The _If Operation.
+
+  Returns:
+    A 2-tuple of the `_FuncGraph`s of the then_branch and else_branch.
+  """
+  def _get_func_graph_for_branch(branch_name):
+    extra_inputs = if_op.inputs[1:]  # First input is pred.
+    input_shapes = [t.shape for t in extra_inputs]
+    func_name = if_op.get_attr(branch_name).name
+    fdef = if_op.graph._get_function(func_name).definition
+    func_graph = function_def_to_graph.function_def_to_graph(fdef, input_shapes)
+    func_graph.extra_inputs = extra_inputs
+    func_graph.extra_args = func_graph.inputs
+    func_graph._captured = dict(zip(extra_inputs, func_graph.inputs))
+    return func_graph
+
+  return (_get_func_graph_for_branch("then_branch"),
+          _get_func_graph_for_branch("else_branch"))
+
+
+def _grad_fn(func_graph, grads):
+  """The gradient function for each conditional branch.
+
+  This function builds the gradient graph of the corresponding forward-pass
+  conditional branch in `func_graph`. This is done by differentiating
+  func_graph's outputs w.r.t. its inputs.
+
+  Args:
+    func_graph: function._FuncGraph. The corresponding forward-pass function.
+    grads: The list of input gradient Tensors.
+
+  Returns:
+    The output gradient Tensors.
+  """
+  # Filter out untrainable function outputs.
+  # NOTE(skyewm): If we don't do this, the untrainable tensors can sometimes
+  # cause _GradientsHelper to raise an exception (e.g. the implementation
+  # doesn't expect 'ys' to contain boolean tensors).
+  assert len(func_graph.outputs) == len(grads)
+  ys = []
+  grad_ys = []
+  for y, grad_y in zip(func_graph.outputs, grads):
+    if not gradients_impl._IsTrainable(y):
+      continue
+    ys.append(y)
+    grad_ys.append(grad_y)
+
+  # Build the gradient graph. Note that this builds the gradient computation of
+  # func_graph in the current graph, which requires capturing tensors from
+  # func_graph. The captured func_graph tensors are resolved to external tensors
+  # in _get_grad_inputs.
+  result = gradients_impl._GradientsHelper(
+      ys, func_graph.inputs, grad_ys=grad_ys,
+      src_graph=func_graph)
+
+  # Functions can't return None; replace Nones with zero tensors.
+  # TODO(b/80444525): don't return anything here and make _IfGrad return None if
+  # both branches have zero gradient.
+  for i in range(len(result)):
+    if result[i] is None:
+      result[i] = array_ops.zeros_like(func_graph.inputs[i])
+
+  return result
+
+
+def _create_grad_func(func_graph, grads, name):
+  """Returns the _FuncGraph representation of _grad_fn."""
+  return function.func_graph_from_py_func(lambda: _grad_fn(func_graph, grads),
+                                          [], [], name)
+
+
+def _get_grad_inputs(if_op, cond_graph, grad_graph):
+  """Returns the tensors we should pass to grad_graph.
+
+  This method handles tensors captured from cond_graph in grad_graph. It
+  converts these to suitable input tensors from the outer graph.
+
+  Args:
+    if_op: Operation. The forward-pass If op that uses cond_graph.
+    cond_graph: function._FuncGraph. The forward-pass function.
+    grad_graph: function._FuncGraph. The gradients function.
+
+  Returns:
+    A list of inputs tensors to be passed to grad_graph.
+  """
+  inputs = []
+
+  # Maps placeholders in cond_graph -> input tensor in outer graph.
+  forward_input_map = {v: k for k, v in cond_graph._captured.items()}
+
+  for t in grad_graph.extra_inputs:
+    if t.graph == ops.get_default_graph():
+      # t is in the outer graph (e.g. one of the input gradients).
+      inputs.append(t)
+    elif t in forward_input_map:
+      # t is an input placeholder in cond_graph. Get the corresponding input
+      # tensor in the outer graph.
+      assert t.graph == cond_graph
+      assert forward_input_map[t].graph == ops.get_default_graph()
+      inputs.append(forward_input_map[t])
+    else:
+      # t is an intermediate value in cond_graph. Get the corresponding output
+      # of 'if_op' (note that all intermediate values are outputs).
+      assert t.graph == cond_graph
+      output_idx = cond_graph.outputs.index(t)
+      inputs.append(if_op.outputs[output_idx])
+
+  return inputs
+
+
+def _create_new_tf_function(func_graph):
+  """Converts func_graph to a TF_Function and adds it to the current graph.
+
+  Args:
+    func_graph: function._FuncGraph
+
+  Returns:
+    The name of the new TF_Function.
+  """
+  c_func = c_api.TF_GraphToFunction_wrapper(
+      func_graph._c_graph,
+      compat.as_str(func_graph.name),
+      False,  # append_hash_to_fn_name
+      None,  # opers
+      [t._as_tf_output() for t in func_graph.inputs],
+      [t._as_tf_output() for t in func_graph.outputs],
+      [],
+      None,  # opts
+      None)  # description
+  _ = c_api_util.ScopedTFFunction(c_func)
+
+  # TODO(b/109833212): this sucks, we're serializing the TF_Function*,
+  # deserializing it into a Python FunctionDef, then reserializing it to create
+  # a new TF_Function that we add to the graph.
+  fdef = function.function_def_from_tf_function(c_func)
+  defined_func = function._from_definition(fdef)
+  defined_func.add_to_graph(ops.get_default_graph())
+
+  return func_graph.name
+
+
+def _get_intermediates(func_graph):
+  """Returns all tensors in `func_graph` that aren't inputs or outputs."""
+  intermediates = []
+  for op in func_graph.get_operations():
+    for t in op.outputs:
+      if t in func_graph.inputs: continue
+      if t in func_graph.outputs: continue
+      intermediates.append(t)
+  return intermediates
+
+
+def _separate_unique_inputs(true_inputs, false_inputs):
+  """Separates tensors appearing only in true_inputs or false_inputs, or both.
+
+  Args:
+    true_inputs: list of Tensors
+    false_inputs: list of Tensors
+
+  Returns:
+    Three lists of Tensors:
+      1. The tensors that appear in both true_inputs and false_inputs
+      2. The tensors that only appear in true_inputs
+      3. The tensors that only appear in false_inputs
+  """
+  true_inputs = set(true_inputs)
+  false_inputs = set(false_inputs)
+
+  shared_inputs = true_inputs.intersection(false_inputs)
+  true_only_inputs = true_inputs - false_inputs
+  false_only_inputs = false_inputs - true_inputs
+
+  return list(shared_inputs), list(true_only_inputs), list(false_only_inputs)
+
+
+def _pad_params(true_graph, false_graph, true_params, false_params):
+  """Returns new param lists that have matching signatures.
+
+  This is done by mirroring each param list in the other using dummy params.
+  There is no merging of params.
+
+  Args:
+    true_graph: function._FuncGraph
+    false_graph: function._FuncGraph
+    true_params: a list of Tensors from true_graph
+    false_params: a list of Tensors from false_graph
+
+  Returns:
+    A new list of Tensors in true_graph and a new list of Tensors in
+    false_graph. The two lists have the same number of Tensors, with matching
+    types and shapes across the lists.
+  """
+  new_true_params = (true_params +
+                     _create_dummy_params(true_graph, false_params))
+  new_false_inputs = (_create_dummy_params(false_graph, true_params)
+                      + false_params)
+  return new_true_params, new_false_inputs
+
+
+def _make_inputs_match(true_graph, false_graph, true_inputs, false_inputs):
+  """Modifies true_graph and false_graph so they have the same input signature.
+
+  This method reorders and/or adds parameters to true_graph and false_graph so
+  they have the same input signature, and updates the 'inputs', 'extra_inputs',
+  and '_captured' fields of both graphs accordingly. It uses the input tensors
+  from the outer graph to avoid duplicating shared arguments.
+
+  Args:
+    true_graph: function._FuncGraph
+    false_graph: function._FuncGraph
+    true_inputs: a list of Tensors in the outer graph. The inputs for
+      true_graph.
+    false_inputs: a list of Tensors in the outer graph. The inputs for
+      false_graph.
+
+  Returns:
+    A new list of Tensors from the outer graph that are the new inputs for both
+    true_graph and false_graph. This is a deduped version of true_inputs +
+    false_inputs.
+  """
+  shared_inputs, true_only_inputs, false_only_inputs = _separate_unique_inputs(
+      true_inputs, false_inputs)
+
+  new_inputs = shared_inputs + true_only_inputs + false_only_inputs
+
+  true_input_to_param = dict(zip(true_inputs, true_graph.inputs))
+  false_input_to_param = dict(zip(false_inputs, false_graph.inputs))
+
+  true_graph.inputs = (
+      [true_input_to_param[t] for t in shared_inputs] +
+      [true_input_to_param[t] for t in true_only_inputs] +
+      _create_dummy_params(true_graph, false_only_inputs))
+
+  false_graph.inputs = (
+      [false_input_to_param[t] for t in shared_inputs] +
+      _create_dummy_params(false_graph, true_only_inputs) +
+      [false_input_to_param[t] for t in false_only_inputs])
+
+  # Rewrite the _FuncGraphs' state to reflect the new inputs.
+  true_graph.extra_inputs = new_inputs
+  false_graph.extra_inputs = new_inputs
+
+  true_graph._captured = dict(zip(new_inputs, true_graph.inputs))
+  false_graph._captured = dict(zip(new_inputs, false_graph.inputs))
+
+  return new_inputs
+
+
+def _create_dummy_params(func_graph, template_tensors):
+  """Creates tensors in func_graph to represent template_tensors.
+
+  Args:
+    func_graph: function._FuncGraph.
+    template_tensors: a list of tensors in the outer graph.
+
+  Returns:
+    A list of tensors in func_graph.
+  """
+  with func_graph.as_default():
+    return [gen_functional_ops.fake_param(dtype=t.dtype, shape=t.shape)
+            for t in template_tensors]
+
+
+def _get_grad_fn_name(func_graph):
+  """Returns a unique name to use for the grad function of `func_graph`."""
+  name = "%s_grad" % func_graph.name
+
+  base_name = name
+  counter = 1
+  if ops.get_default_graph()._is_function(name):
+    name = "%s_%s" % (base_name, counter)
+    counter += 1
+
+  return name
+
+
+def _check_same_outputs(true_graph, false_graph):
+  """Raises an error if true_graph and false_graph have different outputs."""
+  true_output_types = [t.dtype for t in true_graph.outputs]
+  false_output_types = [t.dtype for t in false_graph.outputs]
+  if (len(true_graph.outputs) != len(false_graph.outputs) or
+      true_output_types != false_output_types):
+    raise ValueError(
+        "true_fn() and false_fn() must return the same number and type of "
+        "arguments, got:\n"
+        "  true_fn: %s\n"
+        "  false_fn: %s" % (true_output_types, false_output_types))
diff --git a/tensorflow/contrib/control_flow/python/cond_v2_test.py b/tensorflow/contrib/control_flow/python/cond_v2_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..338601aa2c5ee8ffc97aa2e07ff05a2d17531936
--- /dev/null
+++ b/tensorflow/contrib/control_flow/python/cond_v2_test.py
@@ -0,0 +1,171 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Tests for cond_v2."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.control_flow.python import cond_v2
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import gradients_impl
+from tensorflow.python.ops import math_ops
+from tensorflow.python.platform import test
+from tensorflow.python.training import saver
+
+
+class NewCondTest(test.TestCase):
+
+  def _testCond(self, true_fn, false_fn, train_vals):
+    pred = array_ops.placeholder(dtypes.bool, name="pred")
+
+    expected = control_flow_ops.cond(pred, true_fn, false_fn, name="expected")
+    actual = cond_v2.cond_v2(pred, true_fn, false_fn, name="actual")
+
+    expected_grad = gradients_impl.gradients(expected, train_vals)
+    actual_grad = gradients_impl.gradients(actual, train_vals)
+
+    with self.test_session() as sess:
+      expected_val, actual_val, expected_grad_val, actual_grad_val = sess.run(
+          (expected, actual, expected_grad, actual_grad), {pred: True})
+      self.assertEqual(expected_val, actual_val)
+      self.assertEqual(expected_grad_val, actual_grad_val)
+
+      expected_val, actual_val, expected_grad_val, actual_grad_val = sess.run(
+          (expected, actual, expected_grad, actual_grad), {pred: False})
+      self.assertEqual(expected_val, actual_val)
+      self.assertEqual(expected_grad_val, actual_grad_val)
+
+  def testBasic(self):
+    x = constant_op.constant(1.0, name="x")
+    y = constant_op.constant(2.0, name="y")
+
+    def true_fn():
+      return x * 2.0
+
+    def false_fn():
+      return y * 3.0
+
+    self._testCond(true_fn, false_fn, [x])
+    self._testCond(true_fn, false_fn, [x, y])
+    self._testCond(true_fn, false_fn, [y])
+
+  def testBasic2(self):
+    x = constant_op.constant(1.0, name="x")
+    y = constant_op.constant(2.0, name="y")
+
+    def true_fn():
+      return x * y * 2.0
+
+    def false_fn():
+      return 2.0
+
+    self._testCond(true_fn, false_fn, [x])
+    self._testCond(true_fn, false_fn, [x, y])
+    self._testCond(true_fn, false_fn, [y])
+
+  def testNoInputs(self):
+    pred = array_ops.placeholder(dtypes.bool, name="pred")
+
+    def true_fn():
+      return constant_op.constant(1.0)
+
+    def false_fn():
+      return constant_op.constant(2.0)
+
+    out = cond_v2.cond_v2(pred, true_fn, false_fn)
+
+    with self.test_session() as sess:
+      self.assertEqual(sess.run(out, {pred: True}), [1.0])
+      self.assertEqual(sess.run(out, {pred: False}), [2.0])
+
+  def testSecondDerivative(self):
+    pred = array_ops.placeholder(dtypes.bool, name="pred")
+    x = constant_op.constant(3.0, name="x")
+
+    def true_fn():
+      return math_ops.pow(x, 3)
+
+    def false_fn():
+      return x
+
+    cond = cond_v2.cond_v2(pred, true_fn, false_fn, name="cond")
+    cond_grad = gradients_impl.gradients(cond, [x])
+    cond_grad_grad = gradients_impl.gradients(cond_grad, [x])
+
+    with self.test_session() as sess:
+      # d[x^3]/dx = 3x^2
+      true_val = sess.run(cond_grad, {pred: True})
+      self.assertEqual(true_val, [27.0])
+      # d[x]/dx = 1
+      false_val = sess.run(cond_grad, {pred: False})
+      self.assertEqual(false_val, [1.0])
+
+      true_val = sess.run(cond_grad_grad, {pred: True})
+      # d2[x^3]/dx2 = 6x
+      self.assertEqual(true_val, [18.0])
+      false_val = sess.run(cond_grad_grad, {pred: False})
+      # d2[x]/dx2 = 0
+      self.assertEqual(false_val, [0.0])
+
+  def testGradientOfDeserializedCond(self):
+    with ops.Graph().as_default():
+      pred = array_ops.placeholder(dtypes.bool, name="pred")
+      x = constant_op.constant(3.0, name="x")
+      ops.add_to_collection("x", x)
+
+      def true_fn():
+        return math_ops.pow(x, 3)
+
+      def false_fn():
+        return x
+
+      ops.add_to_collection("pred", pred)
+      cond = cond_v2.cond_v2(pred, true_fn, false_fn, name="cond")
+      for c in cond:
+        ops.add_to_collection("cond", c)
+      meta_graph = saver.export_meta_graph()
+
+    with ops.Graph().as_default() as g:
+      saver.import_meta_graph(meta_graph)
+      x = ops.get_collection("x")[0]
+      pred = ops.get_collection("pred")[0]
+      cond = ops.get_collection("cond")
+      cond_grad = gradients_impl.gradients(cond, [x], name="cond_grad")
+      cond_grad_grad = gradients_impl.gradients(
+          cond_grad, [x], name="cond_grad_grad")
+      with self.test_session(graph=g) as sess:
+        # d[x^3]/dx = 3x^2
+        true_val = sess.run(cond_grad, {pred: True})
+        self.assertEqual(true_val, [27.0])
+        # d[x]/dx = 1
+        false_val = sess.run(cond_grad, {pred: False})
+        self.assertEqual(false_val, [1.0])
+
+        true_val = sess.run(cond_grad_grad, {pred: True})
+        # d2[x^3]/dx2 = 6x
+        self.assertEqual(true_val, [18.0])
+        false_val = sess.run(cond_grad_grad, {pred: False})
+        # d2[x]/dx2 = 0
+        self.assertEqual(false_val, [0.0])
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/copy_graph/python/util/copy_elements.py b/tensorflow/contrib/copy_graph/python/util/copy_elements.py
index 102bc460fdadb0ad5dc9a2960b8655c55357108e..a0dd3881a86c19e47ccb65f84a2477a55626b81c 100644
--- a/tensorflow/contrib/copy_graph/python/util/copy_elements.py
+++ b/tensorflow/contrib/copy_graph/python/util/copy_elements.py
@@ -218,7 +218,6 @@ def copy_op_to_graph(org_instance, to_graph, variables, scope=''):
                            new_control_inputs, input_types, new_original_op,
                            op_def)
     #Use Graph's hidden methods to add the op
-    to_graph._add_op(new_op)  # pylint: disable=protected-access
     to_graph._record_op_seen_by_control_dependencies(new_op)
     for device_function in reversed(to_graph._device_function_stack):
       new_op._set_device(device_function(new_op))
diff --git a/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py b/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py
index ed0a26bbd87eeb5bd005de8f9d054d315e378529..8822a7523f6b168f569e29970c9c29f2eb3614fc 100644
--- a/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py
+++ b/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py
@@ -20,7 +20,6 @@ from __future__ import print_function
 import os
 from tensorflow.contrib.checkpoint.python import split_dependency
 from tensorflow.contrib.rnn.python.ops import lstm_ops
-from tensorflow.python.framework import common_shapes
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import random_seed
@@ -1647,10 +1646,3 @@ class CudnnRNNRelu(_CudnnRNNNoInputC):
   # 1 set of weight and bias parameters for the recurrent input, and 1 for the
   # previous layer input.
   _NUM_PARAMS_PER_LAYER = CUDNN_RNN_RELU_PARAMS_PER_LAYER
-
-
-ops.RegisterShape("CudnnRNNParamsSize")(common_shapes.call_cpp_shape_fn)
-ops.RegisterShape("CudnnRNNParamsToCanonical")(common_shapes.call_cpp_shape_fn)
-ops.RegisterShape("CudnnRNNCanonicalToParams")(common_shapes.call_cpp_shape_fn)
-ops.RegisterShape("CudnnRNN")(common_shapes.call_cpp_shape_fn)
-ops.RegisterShape("CudnnRNNBackprop")(common_shapes.call_cpp_shape_fn)
diff --git a/tensorflow/contrib/data/__init__.py b/tensorflow/contrib/data/__init__.py
index a25aa85251083c24ca6685c4ffef267955f66f63..1af1ed08b53ee04367eb316d5c9caa0216f2e88d 100644
--- a/tensorflow/contrib/data/__init__.py
+++ b/tensorflow/contrib/data/__init__.py
@@ -30,6 +30,7 @@ See the @{$datasets$Importing Data} Programmer's Guide for an overview.
 @@assert_element_shape
 @@batch_and_drop_remainder
 @@bucket_by_sequence_length
+@@choose_from_datasets
 @@dense_to_sparse_batch
 @@enumerate_dataset
 @@group_by_window
diff --git a/tensorflow/contrib/data/kernels/csv_dataset_op.cc b/tensorflow/contrib/data/kernels/csv_dataset_op.cc
index 76e54a284e07ec1bab9b0f364a44997a39bce78a..4657807785d58727d34f37172bd30c56a5b7cde6 100644
--- a/tensorflow/contrib/data/kernels/csv_dataset_op.cc
+++ b/tensorflow/contrib/data/kernels/csv_dataset_op.cc
@@ -18,7 +18,6 @@ limitations under the License.
 #include "tensorflow/core/framework/dataset.h"
 #include "tensorflow/core/framework/op.h"
 #include "tensorflow/core/framework/shape_inference.h"
-#include "tensorflow/core/lib/io/buffered_inputstream.h"
 #include "tensorflow/core/lib/io/random_inputstream.h"
 
 namespace tensorflow {
@@ -103,12 +102,11 @@ class CSVDatasetOp : public DatasetOpKernel {
     OP_REQUIRES(
         ctx, select_cols.empty() || select_cols.front() >= 0,
         errors::InvalidArgument("select_cols should be non-negative indices"));
-    bool select_all_cols = select_cols.empty();
 
-    *output = new Dataset(
-        ctx, std::move(filenames), header, buffer_size, output_types_,
-        output_shapes_, std::move(record_defaults), std::move(select_cols),
-        select_all_cols, use_quote_delim, delim[0], std::move(na_value));
+    *output = new Dataset(ctx, std::move(filenames), header, buffer_size,
+                          output_types_, output_shapes_,
+                          std::move(record_defaults), std::move(select_cols),
+                          use_quote_delim, delim[0], std::move(na_value));
   }
 
  private:
@@ -118,8 +116,7 @@ class CSVDatasetOp : public DatasetOpKernel {
             int64 buffer_size, const DataTypeVector& output_types,
             const std::vector<PartialTensorShape>& output_shapes,
             std::vector<Tensor> record_defaults, std::vector<int64> select_cols,
-            bool select_all_cols, bool use_quote_delim, char delim,
-            string na_value)
+            bool use_quote_delim, char delim, string na_value)
         : GraphDatasetBase(ctx),
           filenames_(std::move(filenames)),
           header_(header),
@@ -128,12 +125,11 @@ class CSVDatasetOp : public DatasetOpKernel {
           output_shapes_(output_shapes),
           record_defaults_(std::move(record_defaults)),
           select_cols_(std::move(select_cols)),
-          select_all_cols_(select_all_cols),
           use_quote_delim_(use_quote_delim),
           delim_(delim),
           na_value_(std::move(na_value)) {}
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::CSV")}));
@@ -145,7 +141,7 @@ class CSVDatasetOp : public DatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override { return "CSVDatasetOp::Dataset"; }
+    string DebugString() const override { return "CSVDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(DatasetGraphDefBuilder* b,
@@ -166,11 +162,24 @@ class CSVDatasetOp : public DatasetOpKernel {
                              std::vector<Tensor>* out_tensors,
                              bool* end_of_sequence) override {
         mutex_lock l(mu_);
+        bool select_all = dataset()->select_cols_.empty();
         do {
           // We are currently processing a file, so try to read the next record
-          if (buffered_input_stream_) {
-            Status s = ReadRecord(ctx, out_tensors);
-            if (s.ok() || !errors::IsOutOfRange(s)) {
+          if (input_stream_) {
+            Status s = ReadRecord(ctx, out_tensors, select_all,
+                                  dataset()->select_cols_);
+            if (s.ok()) {
+              // Validate output
+              if (out_tensors->size() != dataset()->out_type_.size()) {
+                return errors::InvalidArgument(
+                    "Expect ", dataset()->out_type_.size(), " fields but have ",
+                    out_tensors->size(), " in record");
+              }
+
+              *end_of_sequence = false;
+              return s;
+            }
+            if (!errors::IsOutOfRange(s)) {
               // Not at the end of file, return OK or non-EOF errors to caller.
               *end_of_sequence = false;
               return s;
@@ -203,145 +212,317 @@ class CSVDatasetOp : public DatasetOpKernel {
       }
 
      private:
-      // Reads a record by parsing the input buffer, and converting extracted
+      // Reads an entire CSV row from the input stream, either from the
+      // existing buffer or by filling the buffer as needed. Converts extracted
       // fields to output tensors as we go.
-      Status ReadRecord(IteratorContext* ctx, std::vector<Tensor>* out_tensors)
+      //
+      // When this function is called, pos_ should be the index of the first
+      // character of the record in buffer_, or past the end of the buffer.
+      // Note: ctx and out_tensors are only used in this function
+      // when fields are included in the record.
+      Status ReadRecord(IteratorContext* ctx, std::vector<Tensor>* out_tensors,
+                        bool select_all, const std::vector<int64>& selected)
           EXCLUSIVE_LOCKS_REQUIRED(mu_) {
-        // Extracts fields from line(s) from the buffered input stream.
-        out_tensors->reserve(dataset()->record_defaults_.size());
-
-        string input;
-        TF_RETURN_IF_ERROR(buffered_input_stream_->ReadLine(&input));
-
-        size_t current_idx = 0;
-        size_t num_fields_parsed = 0;
-        size_t selector_idx = 0;  // Keep track of index into select_cols
-
-        while (current_idx < input.size()) {
-          // In each iteration, parse one field
-          if (input[current_idx] == '\n' || input[current_idx] == '\r') {
-            // This should never happen, because buffered input reader splits
-            // input on newlines.
-            return errors::InvalidArgument("Parsing error.");
-          }
+        if (pos_ >= buffer_.size()) {
+          // At the end of the file, this will return errors::OutOfRange
+          TF_RETURN_IF_ERROR(FillBuffer(&buffer_));
+          pos_ = 0;
+        }
+
+        // The first character may be \n if this is the continuation of a
+        // \r\n linebreak between this and the previous record. If so, skip it.
+
+        bool end_of_record = false;  // Keep track of when we find \n, \r or EOF
+        size_t num_parsed = 0;
+        size_t num_selected_parsed = 0;
+
+        Status result;
 
-          bool quoted = false;
+        while (!end_of_record) {  // Read till we reach \n, \r or EOF
           bool include =
-              (dataset()->select_all_cols_ ||
-               dataset()->select_cols_[selector_idx] == num_fields_parsed);
+              select_all || (num_selected_parsed < selected.size() &&
+                             selected[num_selected_parsed] == num_parsed);
 
-          if (dataset()->use_quote_delim_ && input[current_idx] == '"') {
-            quoted = true;
-            current_idx++;
+          // Don't fail fast, so that the next call to GetNext may still return
+          // a valid record
+          result.Update(
+              ParseOneField(ctx, out_tensors, &end_of_record, include));
+
+          num_parsed++;
+          if (include) num_selected_parsed++;
+        }
+
+        return result;
+      }
+
+      // Parses one field from position pos_ in the buffer. Fields are
+      // delimited by delim, CRLF, or EOF. Advances pos_ to the first char of
+      // the next field.
+      Status ParseOneField(IteratorContext* ctx,
+                           std::vector<Tensor>* out_tensors,
+                           bool* end_of_record, bool include)
+          EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+        if (pos_ >= buffer_.size()) {
+          // If we get here, this means the previous field's end coincided
+          // with the end of the buffer. We can fill the buffer without abandon.
+          Status s = FillBuffer(&buffer_);
+
+          if (errors::IsOutOfRange(s)) {
+            // Reached EOF, and last field is empty
+            *end_of_record = true;
+            if (include) {
+              return FieldToOutput(ctx, StringPiece(), out_tensors);
+            } else {
+              return Status::OK();
+            }
+          } else if (!s.ok()) {
+            return s;  // Surface other errors back to caller
           }
 
-          // Parse the body of the field
-          string field;
-          if (!quoted) {
-            while (current_idx < input.size() &&
-                   input[current_idx] != dataset()->delim_) {
-              if ((dataset()->use_quote_delim_ && input[current_idx] == '"') ||
-                  input[current_idx] == '\n' || input[current_idx] == '\r') {
-                return errors::InvalidArgument(
-                    "Unquoted fields cannot have quotes/CRLFs inside");
+          pos_ = 0;
+        }
+
+        if (dataset()->use_quote_delim_ && buffer_[pos_] == '"') {
+          return ParseQuotedField(ctx, out_tensors, end_of_record, include);
+        }
+
+        return ParseUnquotedField(ctx, out_tensors, end_of_record, include);
+      }
+
+      // For keeping track of relevant parts of a field from a previous buffer
+      struct Piece {
+        size_t start;
+        size_t len;
+        string buffer;
+
+        Piece(string buffer, size_t start, size_t len)
+            : start(start), len(len), buffer(std::move(buffer)) {}
+      };
+
+      // Given that pos_ exceeds the buffer, saves the relevant part of the
+      // current buffer (if necessary), fills the buffer, and resets indices to
+      // 0.
+      Status SaveAndFillBuffer(std::vector<Piece>* earlier_pieces,
+                               size_t* start, bool include)
+          EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+        string temp_buffer;
+
+        buffer_.swap(temp_buffer);
+        if (include && pos_ > *start) {
+          earlier_pieces->push_back(
+              Piece(std::move(temp_buffer), *start, pos_ - *start));
+        }
+        pos_ = 0;
+        *start = 0;
+        return FillBuffer(&buffer_);
+      }
+
+      // Parses unquoted field from position pos_ in the buffer. Continually
+      // reads from buffer until end of field is reached (delim, CRLF, or EOF).
+      // Advances pos_ to keep track of our position in the buffer as we go,
+      // stopping at the first character of the next field.
+      Status ParseQuotedField(IteratorContext* ctx,
+                              std::vector<Tensor>* out_tensors,
+                              bool* end_of_record, bool include)
+          EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+        std::vector<Piece> earlier_pieces;
+        size_t start = pos_;
+        pos_++;  // Starting quotation mark
+
+        Status parse_result;
+        while (true) {  // Each iter reads 1 char, filling buffer if necessary
+          if (pos_ >= buffer_.size()) {
+            Status s = SaveAndFillBuffer(&earlier_pieces, &start, include);
+            if (errors::IsOutOfRange(s)) {
+              return errors::InvalidArgument(
+                  "Reached end of file without closing quoted field in "
+                  "record");
+            } else if (!s.ok()) {
+              return s;  // Surface all other errors to caller
+            }
+          }
+
+          char ch = buffer_[pos_];
+          if (ch == '"') {
+            // When we encounter a quote, we look ahead to the next character to
+            // decide what to do
+            pos_++;
+            if (pos_ >= buffer_.size()) {
+              Status s = SaveAndFillBuffer(&earlier_pieces, &start, include);
+              if (errors::IsOutOfRange(s)) {
+                // This was the last field. We are done
+                *end_of_record = true;
+                parse_result.Update(QuotedFieldToOutput(
+                    ctx, StringPiece(), out_tensors, earlier_pieces, include));
+                return parse_result;
+              } else if (!s.ok()) {
+                return s;
               }
-              if (include) field += input[current_idx];
-              current_idx++;
-            }  // Exit condition: end of input, or current index at delim
+            }
+
+            char next = buffer_[pos_];
+            pos_++;
+            if (next == dataset()->delim_) {
+              parse_result.Update(QuotedFieldToOutput(
+                  ctx, StringPiece(&buffer_[start], pos_ - 1 - start),
+                  out_tensors, earlier_pieces, include));
+              return parse_result;
+
+            } else if (next == '\n' || next == '\r') {
+              *end_of_record = true;
+              parse_result.Update(QuotedFieldToOutput(
+                  ctx, StringPiece(&buffer_[start], pos_ - 1 - start),
+                  out_tensors, earlier_pieces, include));
+              if (next == '\r') SkipNewLineIfNecessary();
+              return parse_result;
+            } else if (next != '"') {
+              // Take note of the error, but keep going to end of field.
+              include = false;  // So we don't get funky errors when trying to
+                                // unescape the quotes.
+              parse_result.Update(errors::InvalidArgument(
+                  "Quote inside a string has to be escaped by another quote"));
+            }
 
-            // Go to next field or the end
-            current_idx++;
           } else {
-            // Quoted field needs to be ended with '"' and delim or end
-            while (true) {
-              if (current_idx >= input.size() - 1 || input.empty()) {
-                if (current_idx == input.size() - 1 &&
-                    input[current_idx] == '"') {
-                  // We're at the end of the input, and the quote terminates the
-                  // record. Go to end.
-                  current_idx++;
-                  break;
-                }
-                // If there's no terminating quote, it means our buffered record
-                // line reader split a record up. This can happen if there is a
-                // newline encased in quotes. The next line is also part of the
-                // record, so we read it and reset the index.
-                if (include && current_idx == input.size() - 1) {
-                  // TODO(rachelim): Instead of building up a string, keep track
-                  //  of terminal indices (or starting char* and length)
-                  // Also look into using /lib/strings/Scanner
-                  field += input[current_idx];
-                }
-                if (include) {
-                  field += '\n';
-                }
-                current_idx = 0;
-                Status s = buffered_input_stream_->ReadLine(&input);
-                if (!s.ok()) {
-                  return errors::InvalidArgument(
-                      "Quoted field has to end with quote followed by delim, "
-                      "CRLF, or EOF");
-                }
-              } else if (input[current_idx] == '"' &&
-                         input[current_idx + 1] == dataset()->delim_) {
-                // End of field, go to next field or end
-                current_idx += 2;
-                break;
-              } else if (input[current_idx] == '"') {
-                // Current char is a quote. Since we're not at end of field,
-                // the next character must also be a quote.
-                if (input[current_idx + 1] != '"') {
-                  return errors::InvalidArgument(
-                      "Quote inside a string has to be escaped by another "
-                      "quote");
-                }
-                if (include) field += '"';
-                current_idx += 2;
-              } else {
-                if (include) field += input[current_idx];
-                current_idx++;
-              }
+            pos_++;
+          }
+        }
+      }
+
+      // Converts quoted field to an output tensor, removing the starting
+      // and ending quotes from it and unescaping double quotations if
+      // necessary.
+      Status QuotedFieldToOutput(IteratorContext* ctx, StringPiece field,
+                                 std::vector<Tensor>* out_tensors,
+                                 const std::vector<Piece>& earlier_pieces,
+                                 bool include) EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+        if (!include) return Status::OK();
+
+        if (earlier_pieces.empty()) {
+          if (field.find('\"', 1) == field.size() - 1) {
+            // `field` contains no escaped quotation marks.
+            // Exclude framing quotation marks
+            field.remove_prefix(1);
+            field.remove_suffix(1);
+            return FieldToOutput(ctx, field, out_tensors);
+          }
+        }
+        string field_complete;
+        size_t str_len = field.size();
+        for (const Piece& p : earlier_pieces) {
+          str_len += p.len;
+        }
+        field_complete.reserve(str_len);
+
+        // This bool flips every time we see a quote, so that we skip the second
+        // quote of every pair of adjacent quotes in the field. We need to track
+        // this across iterations of the for loop because adjacent double quotes
+        // may be in different buffers. Initialize to true because we also skip
+        // the opening quotation mark of the quoted field.
+        bool skip_next_quote = true;
+        for (const Piece& p : earlier_pieces) {
+          AppendUnescapedPiece(StringPiece(&p.buffer[p.start], p.len),
+                               &field_complete, &skip_next_quote);
+        }
+        AppendUnescapedPiece(field, &field_complete, &skip_next_quote);
+        StringPiece result = StringPiece(field_complete);
+        result.remove_suffix(1);  // Skip final quote
+
+        return FieldToOutput(ctx, result, out_tensors);
+      }
+
+      void AppendUnescapedPiece(StringPiece piece, string* field_complete,
+                                bool* skip_next_quote) {
+        size_t from = 0;
+        size_t found = piece.find('\"', from);
+        while (found != string::npos) {
+          if (!*skip_next_quote) {
+            // This is the first quote in a pair of adjacent double quotes
+            field_complete->append(piece.data() + from, found + 1 - from);
+          }
+          *skip_next_quote = !*skip_next_quote;
+          from = found + 1;
+          found = piece.find('\"', from);
+        }
+        // Include the chunk after the last quotation mark in the string
+        if (from < piece.size()) {
+          field_complete->append(piece.data() + from, piece.size() - from);
+        }
+      }
+
+      // Parses unquoted field from position pos_ in the buffer. Continually
+      // reads from buffer until end of field is reached (delim, CRLF, or EOF).
+      // Advances pos_ to keep track of our position in the buffer as we go,
+      // stopping at the first character of the next field.
+      Status ParseUnquotedField(IteratorContext* ctx,
+                                std::vector<Tensor>* out_tensors,
+                                bool* end_of_record, bool include)
+          EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+        std::vector<Piece> earlier_pieces;
+        size_t start = pos_;
+        Status parse_result;
+
+        while (true) {  // Each iter reads 1 char, filling buffer if necessary
+          if (pos_ >= buffer_.size()) {
+            Status s = SaveAndFillBuffer(&earlier_pieces, &start, include);
+            // Handle errors
+            if (errors::IsOutOfRange(s)) {
+              // Whatever we have is the last field of the last record
+              *end_of_record = true;
+              parse_result.Update(UnquotedFieldToOutput(
+                  ctx, StringPiece(&buffer_[start], pos_ - start), out_tensors,
+                  earlier_pieces, include));
+              return parse_result;
+            } else if (!s.ok()) {
+              return s;  // Surface all other errors to caller
             }
           }
 
-          num_fields_parsed++;
+          char ch = buffer_[pos_];
 
-          if (include) {
-            // Add the tensor to the result
-            TF_RETURN_IF_ERROR(FieldToOutput(ctx, std::move(field),
-                                             selector_idx, out_tensors));
-            selector_idx++;
-            // Terminate early if we have all the fields we want
-            if (selector_idx == dataset()->select_cols_.size())
-              return Status::OK();
+          if (ch == dataset()->delim_) {
+            parse_result.Update(UnquotedFieldToOutput(
+                ctx, StringPiece(&buffer_[start], pos_ - start), out_tensors,
+                earlier_pieces, include));
+            pos_++;
+            return parse_result;
           }
-        }  // Exit condition: current_idx has reached the end of record
-
-        // Check if the last field is empty, and include it if necessary
-        bool include =
-            (dataset()->select_all_cols_ ||
-             dataset()->select_cols_[selector_idx] == num_fields_parsed);
-        if (include && !input.empty() &&
-            input[input.size() - 1] == dataset()->delim_) {
-          TF_RETURN_IF_ERROR(
-              FieldToOutput(ctx, string(), selector_idx, out_tensors));
+          if (ch == '\n' || ch == '\r') {
+            // need special case to skip over first \n of record if the line
+            // breaks are \r\n
+            parse_result.Update(UnquotedFieldToOutput(
+                ctx, StringPiece(&buffer_[start], pos_ - start), out_tensors,
+                earlier_pieces, include));
+            *end_of_record = true;
+            pos_++;
+            if (ch == '\r') SkipNewLineIfNecessary();
+            return parse_result;
+          }
+          if (dataset()->use_quote_delim_ && ch == '"') {
+            // Take note of the error, but keep going to end of field.
+            parse_result.Update(errors::InvalidArgument(
+                "Unquoted fields cannot have quotes inside"));
+          }
+          // Otherwise, go to next character
+          pos_++;
         }
+      }
 
-        // Check that number of fields matches
-        if (out_tensors->size() != dataset()->out_type_.size()) {
-          return errors::InvalidArgument("Expect ", dataset()->out_type_.size(),
-                                         " fields but have ",
-                                         out_tensors->size(), " in record");
+      Status FillBuffer(string* result) EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+        result->clear();
+        Status s = input_stream_->ReadNBytes(dataset()->buffer_size_, result);
+
+        if (errors::IsOutOfRange(s) && !result->empty()) {
+          // Ignore OutOfRange error when ReadNBytes read < N bytes.
+          return Status::OK();
         }
-        return Status::OK();
+        return s;
       }
 
-      // Given a string field, and its index in the output,
-      // converts it to a Tensor of the right type and adds it to the
-      // out_tensors vector.
-      Status FieldToOutput(IteratorContext* ctx, string field,
-                           size_t output_idx,
+      // Given a field, converts it to the right output tensor type
+      Status FieldToOutput(IteratorContext* ctx, StringPiece field,
                            std::vector<Tensor>* out_tensors) {
+        size_t output_idx = out_tensors->size();
         if (output_idx >= dataset()->out_type_.size()) {
           // We can get here if we're selecting all columns, but the number of
           // fields exceeds the number of defaults provided
@@ -397,7 +578,7 @@ class CSVDatasetOp : public DatasetOpKernel {
                   dataset()->record_defaults_[output_idx].flat<float>()(0);
             } else {
               float value;
-              if (!strings::safe_strtof(field.c_str(), &value)) {
+              if (!strings::safe_strtof(field, &value)) {
                 return errors::InvalidArgument(
                     "Field ", output_idx,
                     " in record is not a valid float: ", field);
@@ -412,7 +593,7 @@ class CSVDatasetOp : public DatasetOpKernel {
                   dataset()->record_defaults_[output_idx].flat<double>()(0);
             } else {
               double value;
-              if (!strings::safe_strtod(field.c_str(), &value)) {
+              if (!strings::safe_strtod(field, &value)) {
                 return errors::InvalidArgument(
                     "Field ", output_idx,
                     " in record is not a valid double: ", field);
@@ -426,7 +607,7 @@ class CSVDatasetOp : public DatasetOpKernel {
               component.scalar<string>()() =
                   dataset()->record_defaults_[output_idx].flat<string>()(0);
             } else {
-              component.scalar<string>()() = std::move(field);
+              component.scalar<string>()() = field.ToString();
             }
             break;
           }
@@ -439,6 +620,50 @@ class CSVDatasetOp : public DatasetOpKernel {
         return Status::OK();
       }
 
+      // Records can be delimited by "\r\n" line breaks. When we encounter a
+      // '\r', we have to check the next character to see if it is part of the
+      // linebreak, and ignore it if so.
+      void SkipNewLineIfNecessary() EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+        if (pos_ >= buffer_.size()) {
+          Status s = FillBuffer(&buffer_);
+          pos_ = 0;
+          // If we failed to fill buffer, it doesn't matter because we're done
+          // with the record
+          if (!s.ok()) return;
+        }
+        if (buffer_[pos_] == '\n') {
+          pos_++;
+        }
+      }
+
+      // Given a string field, and its index in the output,
+      // converts it to a Tensor of the right type and adds it to the
+      // out_tensors vector.
+      Status UnquotedFieldToOutput(IteratorContext* ctx, StringPiece field,
+                                   std::vector<Tensor>* out_tensors,
+                                   const std::vector<Piece>& earlier_pieces,
+                                   bool include) EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+        if (!include) return Status::OK();
+
+        if (earlier_pieces.empty()) {
+          return FieldToOutput(ctx, field, out_tensors);
+        }
+
+        size_t str_len = field.size();
+        for (const Piece& p : earlier_pieces) {
+          str_len += p.len;
+        }
+        string field_complete;
+        field_complete.reserve(str_len);
+
+        for (const Piece& p : earlier_pieces) {
+          field_complete.append(p.buffer, p.start, p.len);
+        }
+
+        field_complete.append(field.data(), field.size());
+        return FieldToOutput(ctx, field_complete, out_tensors);
+      }
+
       // Sets up reader streams to read from the file at `current_file_index_`.
       Status SetupStreamsLocked(Env* env) EXCLUSIVE_LOCKS_REQUIRED(mu_) {
         if (current_file_index_ >= dataset()->filenames_.size()) {
@@ -452,16 +677,18 @@ class CSVDatasetOp : public DatasetOpKernel {
             dataset()->filenames_[current_file_index_], &file_));
         input_stream_.reset(
             new io::RandomAccessInputStream(file_.get(), false));
-        // TODO(rachelim): Maintain our own buffer so we don't read every record
-        //   twice
-        buffered_input_stream_.reset(new io::BufferedInputStream(
-            input_stream_.get(), dataset()->buffer_size_, false));
+        buffer_.clear();
+        pos_ = 0;
         if (dataset()->header_) {
-          // Ignore header line
-          string str;
-          Status s = buffered_input_stream_->ReadLine(&str);
-          if (errors::IsOutOfRange(s)) {
-            return errors::InvalidArgument("Can't read header of empty file");
+          // Read one line, but don't include it. Pass nullptrs as dummy
+          // pointers to objects that shouldn't be invoked anyway
+          // We need to process this as a record here instead of just finding
+          // the first newline because it might contain quoted fields with
+          // newlines in the header as well
+          std::vector<int64> empty;
+          Status s = ReadRecord(nullptr, nullptr, false, empty);
+          if (!s.ok()) {
+            return errors::InvalidArgument("Can't read header of file");
           }
         }
         return Status::OK();
@@ -470,15 +697,15 @@ class CSVDatasetOp : public DatasetOpKernel {
       // Resets all reader streams.
       void ResetStreamsLocked() EXCLUSIVE_LOCKS_REQUIRED(mu_) {
         input_stream_.reset();
-        buffered_input_stream_.reset();
         file_.reset();
       }
 
       mutex mu_;
+      string buffer_ GUARDED_BY(mu_);  // Maintain our own buffer
+      size_t pos_ GUARDED_BY(
+          mu_);  // Index into the buffer must be maintained between iters
       std::unique_ptr<io::RandomAccessInputStream> input_stream_
           GUARDED_BY(mu_);
-      std::unique_ptr<io::BufferedInputStream> buffered_input_stream_
-          GUARDED_BY(mu_);
       size_t current_file_index_ GUARDED_BY(mu_) = 0;
       std::unique_ptr<RandomAccessFile> file_
           GUARDED_BY(mu_);  // must outlive input_stream_
@@ -491,7 +718,6 @@ class CSVDatasetOp : public DatasetOpKernel {
     const std::vector<PartialTensorShape> output_shapes_;
     const std::vector<Tensor> record_defaults_;
     const std::vector<int64> select_cols_;
-    const bool select_all_cols_;
     const bool use_quote_delim_;
     const char delim_;
     const string na_value_;
diff --git a/tensorflow/contrib/data/kernels/directed_interleave_dataset_op.cc b/tensorflow/contrib/data/kernels/directed_interleave_dataset_op.cc
index 48d3734162525ffc6ace076e4f0523c1d0cae511..6a12ca06f4d6cc2096aaf8191a01a899881b43db 100644
--- a/tensorflow/contrib/data/kernels/directed_interleave_dataset_op.cc
+++ b/tensorflow/contrib/data/kernels/directed_interleave_dataset_op.cc
@@ -91,7 +91,7 @@ class DirectedInterleaveDatasetOp : public DatasetOpKernel {
       }
     }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(new Iterator(
           {this, strings::StrCat(prefix, "::DirectedInterleave")}));
@@ -105,7 +105,7 @@ class DirectedInterleaveDatasetOp : public DatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override {
+    string DebugString() const override {
       return strings::StrCat("DirectedInterleaveDatasetOp::Dataset");
     }
 
@@ -130,15 +130,21 @@ class DirectedInterleaveDatasetOp : public DatasetOpKernel {
      public:
       explicit Iterator(const Params& params)
           : DatasetIterator<Dataset>(params),
-            selector_input_impl_(params.dataset->selector_input_->MakeIterator(
-                params.prefix + ".selector")),
-            num_active_inputs_(params.dataset->data_inputs_.size()) {
-        data_input_impls_.reserve(params.dataset->data_inputs_.size());
-        for (size_t i = 0; i < params.dataset->data_inputs_.size(); ++i) {
-          const DatasetBase* data_input = params.dataset->data_inputs_[i];
-          data_input_impls_.push_back(data_input->MakeIterator(
-              strings::StrCat(params.prefix, "[", i, "]")));
+            num_active_inputs_(params.dataset->data_inputs_.size()) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        mutex_lock l(mu_);
+        TF_RETURN_IF_ERROR(dataset()->selector_input_->MakeIterator(
+            ctx, strings::StrCat(prefix(), ".selector"),
+            &selector_input_impl_));
+        data_input_impls_.resize(dataset()->data_inputs_.size());
+        for (size_t i = 0; i < data_input_impls_.size(); ++i) {
+          const DatasetBase* data_input = dataset()->data_inputs_[i];
+          TF_RETURN_IF_ERROR(data_input->MakeIterator(
+              ctx, strings::StrCat(prefix(), "[", i, "]"),
+              &data_input_impls_[i]));
         }
+        return Status::OK();
       }
 
       Status GetNextInternal(IteratorContext* ctx,
diff --git a/tensorflow/contrib/data/kernels/ignore_errors_dataset_op.cc b/tensorflow/contrib/data/kernels/ignore_errors_dataset_op.cc
index bb29df60e8f114aaa50f578c43e73874f72ab0a3..bbec50681c6f5decec5a3b5fbf09cc3011a21199 100644
--- a/tensorflow/contrib/data/kernels/ignore_errors_dataset_op.cc
+++ b/tensorflow/contrib/data/kernels/ignore_errors_dataset_op.cc
@@ -44,7 +44,7 @@ class IgnoreErrorsDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::IgnoreErrors")}));
@@ -57,7 +57,9 @@ class IgnoreErrorsDatasetOp : public UnaryDatasetOpKernel {
       return input_->output_shapes();
     }
 
-    string DebugString() override { return "IgnoreErrorsDatasetOp::Dataset"; }
+    string DebugString() const override {
+      return "IgnoreErrorsDatasetOp::Dataset";
+    }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -72,8 +74,11 @@ class IgnoreErrorsDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
diff --git a/tensorflow/contrib/data/kernels/threadpool_dataset_op.cc b/tensorflow/contrib/data/kernels/threadpool_dataset_op.cc
index 63e19ae3f837c9d3cfb1221df64360ee74117f13..3dfc3741c2b040dd5be3223c24f0715ba3be4248 100644
--- a/tensorflow/contrib/data/kernels/threadpool_dataset_op.cc
+++ b/tensorflow/contrib/data/kernels/threadpool_dataset_op.cc
@@ -127,7 +127,7 @@ class ThreadPoolDatasetOp : public UnaryDatasetOpKernel {
       threadpool_->Unref();
     }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::ThreadPool")}));
@@ -140,7 +140,9 @@ class ThreadPoolDatasetOp : public UnaryDatasetOpKernel {
       return input_->output_shapes();
     }
 
-    string DebugString() override { return "ThreadPoolDatasetOp::Dataset"; }
+    string DebugString() const override {
+      return "ThreadPoolDatasetOp::Dataset";
+    }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -154,8 +156,11 @@ class ThreadPoolDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
diff --git a/tensorflow/contrib/data/kernels/unique_dataset_op.cc b/tensorflow/contrib/data/kernels/unique_dataset_op.cc
index 69fbb0fcdcce87951d2c9b84210fda378081b103..67c237799c10a2724f18bb0df99e4bf8f5cd2b8a 100644
--- a/tensorflow/contrib/data/kernels/unique_dataset_op.cc
+++ b/tensorflow/contrib/data/kernels/unique_dataset_op.cc
@@ -56,7 +56,7 @@ class UniqueDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::Unique")}));
@@ -70,7 +70,7 @@ class UniqueDatasetOp : public UnaryDatasetOpKernel {
       return input_->output_shapes();
     }
 
-    string DebugString() override {
+    string DebugString() const override {
       return strings::StrCat("UniqueDatasetOp::Dataset");
     }
 
@@ -87,8 +87,11 @@ class UniqueDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const typename Iterator::Params& params)
-          : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
diff --git a/tensorflow/contrib/data/python/kernel_tests/BUILD b/tensorflow/contrib/data/python/kernel_tests/BUILD
index f5082228e885d065e659abf208ca7b94bb4999a5..0dfd249ec27d96d6f1a4ae65d623df456db9991f 100644
--- a/tensorflow/contrib/data/python/kernel_tests/BUILD
+++ b/tensorflow/contrib/data/python/kernel_tests/BUILD
@@ -54,6 +54,19 @@ py_test(
     ],
 )
 
+py_test(
+    name = "cache_dataset_op_test",
+    size = "small",
+    srcs = ["cache_dataset_op_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":dataset_serialization_test",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:errors",
+        "//tensorflow/python/data/ops:dataset_ops",
+    ],
+)
+
 py_test(
     name = "concatenate_dataset_op_test",
     size = "small",
@@ -128,6 +141,7 @@ py_test(
     tags = ["no_pip"],
     deps = [
         ":dataset_serialization_test",
+        "//tensorflow/contrib/data/python/ops:error_ops",
         "//tensorflow/contrib/data/python/ops:readers",
         "//third_party/py/numpy",
     ],
@@ -208,6 +222,23 @@ py_test(
     ],
 )
 
+py_test(
+    name = "directed_interleave_dataset_test",
+    size = "medium",
+    srcs = ["directed_interleave_dataset_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":dataset_serialization_test",
+        "//tensorflow/contrib/data/python/ops:interleave_ops",
+        "//tensorflow/python:client",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:errors",
+        "//tensorflow/python:training",
+        "//tensorflow/python/data/ops:dataset_ops",
+        "//third_party/py/numpy",
+    ],
+)
+
 tf_py_test(
     name = "get_single_element_test",
     size = "small",
@@ -262,6 +293,19 @@ py_test(
     ],
 )
 
+py_test(
+    name = "optimize_dataset_op_test",
+    size = "small",
+    srcs = ["optimize_dataset_op_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":dataset_serialization_test",
+        "//tensorflow/contrib/data/python/ops:optimization",
+        "//tensorflow/python:platform",
+        "//tensorflow/python/data/ops:dataset_ops",
+    ],
+)
+
 py_test(
     name = "prefetch_dataset_op_test",
     size = "small",
@@ -299,6 +343,26 @@ py_test(
     ],
 )
 
+py_library(
+    name = "reader_dataset_ops_test_base",
+    testonly = 1,
+    srcs = [
+        "reader_dataset_ops_test_base.py",
+    ],
+    srcs_version = "PY2AND3",
+    visibility = ["//visibility:private"],
+    deps = [
+        "//tensorflow/contrib/data/python/ops:readers",
+        "//tensorflow/core:protos_all_py",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:lib",
+        "//tensorflow/python:parsing_ops",
+        "//tensorflow/python:util",
+        "//tensorflow/python/data/ops:readers",
+    ],
+)
+
 py_test(
     name = "reader_dataset_ops_test",
     size = "medium",
@@ -308,8 +372,8 @@ py_test(
     tags = ["no_pip"],
     deps = [
         ":dataset_serialization_test",
+        ":reader_dataset_ops_test_base",
         "//tensorflow/contrib/data/python/ops:readers",
-        "//tensorflow/core:protos_all_py",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:constant_op",
@@ -321,6 +385,7 @@ py_test(
         "//tensorflow/python:string_ops",
         "//tensorflow/python:util",
         "//tensorflow/python/data/ops:iterator_ops",
+        "//tensorflow/python/data/ops:readers",
         "//third_party/py/numpy",
     ],
 )
@@ -410,6 +475,7 @@ py_test(
     tags = ["no_pip"],
     deps = [
         ":dataset_serialization_test",
+        "//tensorflow/contrib/data/python/ops:iterator_ops",
         "//tensorflow/contrib/data/python/ops:shuffle_ops",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
@@ -417,6 +483,7 @@ py_test(
         "//tensorflow/python:dtypes",
         "//tensorflow/python:errors",
         "//tensorflow/python:framework_ops",
+        "//tensorflow/python:training",
         "//tensorflow/python/data/ops:dataset_ops",
         "//tensorflow/python/data/ops:iterator_ops",
         "//third_party/py/numpy",
@@ -447,10 +514,15 @@ py_test(
     tags = ["no_pip"],
     deps = [
         ":dataset_serialization_test",
+        ":reader_dataset_ops_test_base",
         "//tensorflow/contrib/data/python/ops:stats_ops",
+        "//tensorflow/core:protos_all_py",
+        "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:errors",
+        "//tensorflow/python:framework_ops",
         "//tensorflow/python/data/ops:dataset_ops",
+        "//third_party/py/numpy",
     ],
 )
 
diff --git a/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py
index 2568b899d7ea1be685036ad8af93f584f861c951..b5fbc45ad3d8d262c1c79b5723ffeb38ff6a34c2 100644
--- a/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py
@@ -552,6 +552,44 @@ class BatchDatasetTest(test.TestCase):
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(next_element)
 
+  def testMapAndBatchParallelGetNext(self):
+    iterator = (dataset_ops.Dataset.range(50000)
+                .apply(batching.map_and_batch(lambda x: x, batch_size=100))
+                .make_one_shot_iterator())
+    elements = []
+    for _ in range(100):
+      elements.append(iterator.get_next())
+    with self.test_session() as sess:
+      for i in range(5):
+        got = sess.run(elements)
+        got.sort(key=lambda x: x[0])
+        expected = []
+        for j in range(100):
+          expected.append(range(i*10000+j*100, i*10000+(j+1)*100))
+        self.assertAllEqual(got, expected)
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(elements)
+
+  def testMapAndBatchParallelGetNextDropRemainder(self):
+    iterator = (
+        dataset_ops.Dataset.range(49999).apply(
+            batching.map_and_batch(
+                lambda x: x, batch_size=100, drop_remainder=True))
+        .make_one_shot_iterator())
+    elements = []
+    for _ in range(100):
+      elements.append(iterator.get_next())
+    with self.test_session() as sess:
+      for i in range(4):
+        got = sess.run(elements)
+        got.sort(key=lambda x: x[0])
+        expected = []
+        for j in range(100):
+          expected.append(range(i*10000+j*100, i*10000+(j+1)*100))
+        self.assertAllEqual(got, expected)
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(elements)
+
   def testMapAndBatchSparse(self):
 
     def _sparse(i):
diff --git a/tensorflow/contrib/data/python/kernel_tests/cache_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/cache_dataset_op_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..f08216a303e2d7dee155ccadcdb9f42f1b24ea0f
--- /dev/null
+++ b/tensorflow/contrib/data/python/kernel_tests/cache_dataset_op_test.py
@@ -0,0 +1,190 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for the experimental features of CacheDataset."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+from tensorflow.contrib.data.python.kernel_tests import dataset_serialization_test_base
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.framework import errors
+from tensorflow.python.platform import test
+
+
+class CacheToFileDatasetSerializationTest(
+    dataset_serialization_test_base.DatasetSerializationTestBase):
+
+  def setUp(self):
+    self.range_size = 10
+    self.num_repeats = 3
+    self.num_outputs = self.range_size * self.num_repeats
+    self.cache_file_prefix = 'test'
+
+  def ds_fn(self):
+    return dataset_ops.Dataset.range(self.range_size).cache(
+        os.path.join(self.get_temp_dir(),
+                     self.cache_file_prefix)).repeat(self.num_repeats)
+
+  def expected_outputs(self):
+    return list(range(self.range_size)) * self.num_repeats
+
+  def testCheckpointBeforeOneEpoch(self):
+    # Generate 5 entries from iterator and save checkpoint.
+    outputs = self.gen_outputs(self.ds_fn, [], 5, verify_exhausted=False)
+    self.assertSequenceEqual(outputs, range(5))
+
+    # Restore from checkpoint and produce the rest of the elements from the
+    # iterator.
+    outputs.extend(
+        self.gen_outputs(
+            self.ds_fn, [],
+            self.num_outputs - 5,
+            ckpt_saved=True,
+            verify_exhausted=False))
+    self.assertSequenceEqual(outputs, self.expected_outputs())
+
+  def testCheckpointBeforeOneEpochThenRunFewSteps(self):
+    # Generate 8 entries from iterator but save checkpoint after producing
+    # 5.
+    outputs = self.gen_outputs(
+        self.ds_fn, [5],
+        8,
+        verify_exhausted=False,
+        save_checkpoint_at_end=False)
+    self.assertSequenceEqual(outputs, range(8))
+
+    # Restoring from checkpoint and running GetNext should return a
+    # `AlreadExistsError` now because the lockfile already exists.
+    with self.assertRaises(errors.AlreadyExistsError):
+      self.gen_outputs(
+          self.ds_fn, [],
+          self.num_outputs - 5,
+          ckpt_saved=True,
+          verify_exhausted=False)
+
+  def testCheckpointAfterOneEpoch(self):
+    # Generate 15 entries from iterator and save checkpoint.
+    outputs = self.gen_outputs(self.ds_fn, [], 15, verify_exhausted=False)
+    self.assertSequenceEqual(outputs, list(range(10)) + list(range(5)))
+
+    # Restore from checkpoint and produce the rest of the elements from the
+    # iterator.
+    outputs.extend(
+        self.gen_outputs(
+            self.ds_fn, [],
+            self.num_outputs - 15,
+            ckpt_saved=True,
+            verify_exhausted=False))
+    self.assertSequenceEqual(outputs, self.expected_outputs())
+
+  def testCheckpointAfterOneEpochThenRunFewSteps(self):
+    # Generate 18 entries from iterator but save checkpoint after producing
+    # 15.
+    outputs = self.gen_outputs(
+        self.ds_fn, [15],
+        18,
+        verify_exhausted=False,
+        save_checkpoint_at_end=False)
+    self.assertSequenceEqual(outputs, list(range(10)) + list(range(8)))
+
+    outputs = list(range(10)) + list(range(5)) + self.gen_outputs(
+        self.ds_fn, [],
+        self.num_outputs - 15,
+        ckpt_saved=True,
+        verify_exhausted=False)
+    self.assertSequenceEqual(outputs, list(range(10)) * 3)
+
+  def testCheckpointBeforeOneEpochButRunCompleteEpoch(self):
+    # Generate 13 entries from iterator but save checkpoint after producing
+    # 5.
+    outputs = self.gen_outputs(
+        self.ds_fn, [5],
+        13,
+        verify_exhausted=False,
+        save_checkpoint_at_end=False)
+    self.assertSequenceEqual(outputs, list(range(10)) + list(range(3)))
+
+    # Since we ran for more than one epoch, the cache was completely written.
+    # The ckpt was saved when the iterator was in cache-write mode. Test that
+    # the iterator falls back to read mode after restoring if the cache has
+    # been completely written.
+
+    outputs = list(range(5)) + self.gen_outputs(
+        self.ds_fn, [],
+        self.num_outputs - 5,
+        ckpt_saved=True,
+        verify_exhausted=False)
+    self.assertSequenceEqual(outputs, list(range(10)) * 3)
+
+  def testCheckpointUnusedWriterIterator(self):
+    # Checkpoint before get_next is called even once.
+    outputs = self.gen_outputs(self.ds_fn, [], 0, verify_exhausted=False)
+    self.assertSequenceEqual(outputs, [])
+
+    outputs = self.gen_outputs(
+        self.ds_fn, [],
+        self.num_outputs,
+        ckpt_saved=True,
+        verify_exhausted=False)
+    self.assertSequenceEqual(outputs, list(range(10)) * 3)
+
+  def testCheckpointUnusedMidwayWriterIterator(self):
+    # Produce 5 elements and checkpoint.
+    outputs = self.gen_outputs(self.ds_fn, [], 5, verify_exhausted=False)
+    self.assertSequenceEqual(outputs, range(5))
+
+    # Restore from checkpoint, then produce no elements and checkpoint.
+    outputs.extend(
+        self.gen_outputs(
+            self.ds_fn, [], 0, ckpt_saved=True, verify_exhausted=False))
+    self.assertSequenceEqual(outputs, range(5))
+
+    # Restore from checkpoint and produce rest of the elements.
+    outputs.extend(
+        self.gen_outputs(
+            self.ds_fn, [],
+            self.num_outputs - 5,
+            ckpt_saved=True,
+            verify_exhausted=False))
+    self.assertSequenceEqual(outputs, list(range(10)) * 3)
+
+  def testUnusedCheckpointError(self):
+    # Produce 5 elements and save ckpt.
+    outputs = self.gen_outputs(self.ds_fn, [], 5, verify_exhausted=False)
+    self.assertSequenceEqual(outputs, range(5))
+
+    # Since the complete cache has not been written, a new iterator which does
+    # not restore the checkpoint will throw an error since there is a partial
+    # cache shard.
+    with self.assertRaises(errors.AlreadyExistsError):
+      outputs = self.gen_outputs(
+          self.ds_fn, [], self.num_outputs, verify_exhausted=False)
+
+  def testIgnoreCheckpointIfCacheWritten(self):
+    # Produce 15 elements and save ckpt. This will write the complete cache.
+    outputs = self.gen_outputs(self.ds_fn, [], 15, verify_exhausted=False)
+    self.assertSequenceEqual(outputs, list(range(10)) + list(range(5)))
+
+    # Build the iterator again but do not restore from ckpt. Since the cache
+    # has already been written we should be able to use it.
+    outputs = self.gen_outputs(
+        self.ds_fn, [], self.num_outputs, verify_exhausted=False)
+    self.assertSequenceEqual(outputs, list(range(10)) * 3)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/csv_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/csv_dataset_op_test.py
index 641a389c033687ebe081963182390b00230e4cb5..97b5e9416521dcad9ee5047a8275f8fd0142e338 100644
--- a/tensorflow/contrib/data/python/kernel_tests/csv_dataset_op_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/csv_dataset_op_test.py
@@ -19,11 +19,13 @@ from __future__ import division
 from __future__ import print_function
 
 import os
+import string
 import tempfile
 import time
 
 import numpy as np
 
+from tensorflow.contrib.data.python.ops import error_ops
 from tensorflow.contrib.data.python.ops import readers
 from tensorflow.python.client import session
 from tensorflow.python.data.ops import readers as core_readers
@@ -60,12 +62,12 @@ class CsvDatasetOpTest(test.TestCase):
         op2 = sess.run(next2)
         self.assertAllEqual(op1, op2)
 
-  def setup_files(self, inputs):
+  def setup_files(self, inputs, linebreak='\n'):
     filenames = []
     for i, ip in enumerate(inputs):
-      fn = os.path.join(self.get_temp_dir(), 'temp_%d.txt' % i)
-      with open(fn, 'w') as f:
-        f.write('\n'.join(ip))
+      fn = os.path.join(self.get_temp_dir(), 'temp_%d.csv' % i)
+      with open(fn, 'wb') as f:
+        f.write(linebreak.join(ip).encode('utf-8'))
       filenames.append(fn)
     return filenames
 
@@ -85,38 +87,47 @@ class CsvDatasetOpTest(test.TestCase):
           inputs, **kwargs)
       self._assert_datasets_equal(g, dataset_actual, dataset_expected)
 
+  def _verify_output_or_err(self,
+                            sess,
+                            dataset,
+                            expected_output=None,
+                            expected_err_re=None):
+    nxt = dataset.make_one_shot_iterator().get_next()
+    if expected_err_re is None:
+      # Verify that output is expected, without errors
+      expected_output = [[
+          v.encode('utf-8') if isinstance(v, str) else v for v in op
+      ] for op in expected_output]
+      for value in expected_output:
+        op = sess.run(nxt)
+        self.assertAllEqual(op, value)
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(nxt)
+    else:
+      # Verify that OpError is produced as expected
+      with self.assertRaisesOpError(expected_err_re):
+        while True:
+          try:
+            sess.run(nxt)
+          except errors.OutOfRangeError:
+            break
+
   def _test_dataset(self,
                     inputs,
                     expected_output=None,
                     expected_err_re=None,
+                    linebreak='\n',
                     **kwargs):
     """Checks that elements produced by CsvDataset match expected output."""
     # Convert str type because py3 tf strings are bytestrings
-    filenames = self.setup_files(inputs)
+    filenames = self.setup_files(inputs, linebreak)
     with ops.Graph().as_default() as g:
       with self.test_session(graph=g) as sess:
         dataset = readers.CsvDataset(filenames, **kwargs)
-        nxt = dataset.make_one_shot_iterator().get_next()
-        if expected_err_re is None:
-          # Verify that output is expected, without errors
-          expected_output = [[
-              v.encode('utf-8') if isinstance(v, str) else v for v in op
-          ] for op in expected_output]
-          for value in expected_output:
-            op = sess.run(nxt)
-            self.assertAllEqual(op, value)
-          with self.assertRaises(errors.OutOfRangeError):
-            sess.run(nxt)
-        else:
-          # Verify that OpError is produced as expected
-          with self.assertRaisesOpError(expected_err_re):
-            while True:
-              try:
-                sess.run(nxt)
-              except errors.OutOfRangeError:
-                break
-
-  def testCsvDataset_floatRequired(self):
+        self._verify_output_or_err(sess, dataset, expected_output,
+                                   expected_err_re)
+
+  def testCsvDataset_requiredFields(self):
     record_defaults = [[]] * 4
     inputs = [['1,2,3,4']]
     self._test_by_comparison(inputs, record_defaults=record_defaults)
@@ -136,10 +147,55 @@ class CsvDatasetOpTest(test.TestCase):
     inputs = [['1.0,2.1,hello,4.3', '5.4,6.5,goodbye,8.7']]
     self._test_by_comparison(inputs, record_defaults=record_defaults)
 
-  def testCsvDataset_withQuoted(self):
-    record_defaults = [['']] * 4
-    inputs = [['1.0,2.1,"hello, it is me",4.3', '5.4,6.5,goodbye,8.7']]
-    self._test_by_comparison(inputs, record_defaults=record_defaults)
+  def testCsvDataset_withEmptyFields(self):
+    record_defaults = [[0]] * 4
+    inputs = [[',,,', '1,1,1,', ',2,2,2']]
+    self._test_dataset(
+        inputs, [[0, 0, 0, 0], [1, 1, 1, 0], [0, 2, 2, 2]],
+        record_defaults=record_defaults)
+
+  def testCsvDataset_errWithUnquotedQuotes(self):
+    record_defaults = [['']] * 3
+    inputs = [['1,2"3,4']]
+    self._test_dataset(
+        inputs,
+        expected_err_re='Unquoted fields cannot have quotes inside',
+        record_defaults=record_defaults)
+
+  def testCsvDataset_errWithUnescapedQuotes(self):
+    record_defaults = [['']] * 3
+    inputs = [['"a"b","c","d"']]
+    self._test_dataset(
+        inputs,
+        expected_err_re=
+        'Quote inside a string has to be escaped by another quote',
+        record_defaults=record_defaults)
+
+  def testCsvDataset_ignoreErrWithUnescapedQuotes(self):
+    record_defaults = [['']] * 3
+    inputs = [['1,"2"3",4', '1,"2"3",4",5,5', 'a,b,"c"d"', 'e,f,g']]
+    filenames = self.setup_files(inputs)
+    with ops.Graph().as_default() as g:
+      with self.test_session(graph=g) as sess:
+        dataset = readers.CsvDataset(filenames, record_defaults=record_defaults)
+        dataset = dataset.apply(error_ops.ignore_errors())
+        self._verify_output_or_err(sess, dataset, [['e', 'f', 'g']])
+
+  def testCsvDataset_ignoreErrWithUnquotedQuotes(self):
+    record_defaults = [['']] * 3
+    inputs = [['1,2"3,4', 'a,b,c"d', '9,8"7,6,5', 'e,f,g']]
+    filenames = self.setup_files(inputs)
+    with ops.Graph().as_default() as g:
+      with self.test_session(graph=g) as sess:
+        dataset = readers.CsvDataset(filenames, record_defaults=record_defaults)
+        dataset = dataset.apply(error_ops.ignore_errors())
+        self._verify_output_or_err(sess, dataset, [['e', 'f', 'g']])
+
+  def testCsvDataset_withNoQuoteDelimAndUnquotedQuotes(self):
+    record_defaults = [['']] * 3
+    inputs = [['1,2"3,4']]
+    self._test_by_comparison(
+        inputs, record_defaults=record_defaults, use_quote_delim=False)
 
   def testCsvDataset_mixedTypes(self):
     record_defaults = [
@@ -163,11 +219,6 @@ class CsvDatasetOpTest(test.TestCase):
     self._test_by_comparison(
         inputs, record_defaults=record_defaults, field_delim=':')
 
-  def testCsvDataset_withEmptyValues(self):
-    record_defaults = [[0]] * 4
-    inputs = [['1,,3,4', ',6,7,8']]
-    self._test_by_comparison(inputs, record_defaults=record_defaults)
-
   def testCsvDataset_withNaValue(self):
     record_defaults = [[0]] * 4
     inputs = [['1,NA,3,4', 'NA,6,7,8']]
@@ -175,8 +226,8 @@ class CsvDatasetOpTest(test.TestCase):
         inputs, record_defaults=record_defaults, na_value='NA')
 
   def testCsvDataset_withSelectCols(self):
-    record_defaults = [[0]] * 2
-    inputs = [['1,2,3,4', '5,6,7,8']]
+    record_defaults = [['']] * 2
+    inputs = [['1,2,3,4', '"5","6","7","8"']]
     self._test_by_comparison(
         inputs, record_defaults=record_defaults, select_cols=[1, 2])
 
@@ -189,27 +240,17 @@ class CsvDatasetOpTest(test.TestCase):
         record_defaults=record_defaults,
         select_cols=[3, 4])
 
+  def testCsvDataset_withOneCol(self):
+    record_defaults = [['NA']]
+    inputs = [['0', '', '2']]
+    self._test_dataset(
+        inputs, [['0'], ['NA'], ['2']], record_defaults=record_defaults)
+
   def testCsvDataset_withMultipleFiles(self):
     record_defaults = [[0]] * 4
     inputs = [['1,2,3,4', '5,6,7,8'], ['5,6,7,8']]
     self._test_by_comparison(inputs, record_defaults=record_defaults)
 
-  def testCsvDataset_withNewLine(self):
-    # In this case, we expect it to behave differently from
-    # TextLineDataset->map(decode_csv) since that flow has bugs
-    record_defaults = [['']] * 4
-    inputs = [['a,b,"""c""\n0","d\ne"', 'f,g,h,i']]
-    expected = [['a', 'b', '"c"\n0', 'd\ne'], ['f', 'g', 'h', 'i']]
-    self._test_dataset(inputs, expected, record_defaults=record_defaults)
-
-  def testCsvDataset_withMultipleNewLines(self):
-    # In this case, we expect it to behave differently from
-    # TextLineDataset->map(decode_csv) since that flow has bugs
-    record_defaults = [['']] * 4
-    inputs = [['a,"b\n\nx","""c""\n \n0","d\ne"', 'f,g,h,i']]
-    expected = [['a', 'b\n\nx', '"c"\n \n0', 'd\ne'], ['f', 'g', 'h', 'i']]
-    self._test_dataset(inputs, expected, record_defaults=record_defaults)
-
   def testCsvDataset_withLeadingAndTrailingSpaces(self):
     record_defaults = [[0.0]] * 4
     inputs = [['0, 1, 2, 3']]
@@ -265,9 +306,10 @@ class CsvDatasetOpTest(test.TestCase):
   def testCsvDataset_errorWithHeaderEmptyFile(self):
     record_defaults = [[0]] * 2
     inputs = [[]]
+    expected_err_re = "Can't read header of file"
     self._test_dataset(
         inputs,
-        expected_err_re="Can't read header of empty file",
+        expected_err_re=expected_err_re,
         record_defaults=record_defaults,
         header=True,
     )
@@ -283,7 +325,7 @@ class CsvDatasetOpTest(test.TestCase):
     inputs = [['', '1,2']]  # First record is empty
     self._test_dataset(
         inputs,
-        expected_err_re='Expect 2 fields but have 0 in record',
+        expected_err_re='Expect 2 fields but have 1 in record',
         record_defaults=record_defaults)
 
   def testCsvDataset_withChainedOps(self):
@@ -300,7 +342,7 @@ class CsvDatasetOpTest(test.TestCase):
 
   def testCsvDataset_withTypeDefaults(self):
     # Testing using dtypes as record_defaults for required fields
-    record_defaults = [dtypes.float32, dtypes.float32]
+    record_defaults = [dtypes.float32, [0.0]]
     inputs = [['1.0,2.0', '3.0,4.0']]
     self._test_dataset(
         inputs,
@@ -308,71 +350,270 @@ class CsvDatasetOpTest(test.TestCase):
         record_defaults=record_defaults,
     )
 
+  def testMakeCsvDataset_fieldOrder(self):
+    data = [[
+        '1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19',
+        '1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19'
+    ]]
+    file_path = self.setup_files(data)
+
+    with ops.Graph().as_default() as g:
+      ds = readers.make_csv_dataset(
+          file_path, batch_size=1, shuffle=False, num_epochs=1)
+      next_batch = ds.make_one_shot_iterator().get_next()
+
+    with self.test_session(graph=g) as sess:
+      result = list(sess.run(next_batch).values())
+
+    self.assertEqual(result, sorted(result))
+
+## The following tests exercise parsing logic for quoted fields
+
+  def testCsvDataset_withQuoted(self):
+    record_defaults = [['']] * 4
+    inputs = [['"a","b","c :)","d"', '"e","f","g :(","h"']]
+    self._test_by_comparison(inputs, record_defaults=record_defaults)
+
+  def testCsvDataset_withOneColAndQuotes(self):
+    record_defaults = [['']]
+    inputs = [['"0"', '"1"', '"2"']]
+    self._test_dataset(
+        inputs, [['0'], ['1'], ['2']], record_defaults=record_defaults)
+
+  def testCsvDataset_withNewLine(self):
+    # In this case, we expect it to behave differently from
+    # TextLineDataset->map(decode_csv) since that flow has bugs
+    record_defaults = [['']] * 4
+    inputs = [['a,b,"""c""\n0","d\ne"', 'f,g,h,i']]
+    expected = [['a', 'b', '"c"\n0', 'd\ne'], ['f', 'g', 'h', 'i']]
+    self._test_dataset(inputs, expected, record_defaults=record_defaults)
+
+  def testCsvDataset_withNewLineInUnselectedCol(self):
+    record_defaults = [['']]
+    inputs = [['1,"2\n3",4', '5,6,7']]
+    self._test_dataset(
+        inputs,
+        expected_output=[['1'], ['5']],
+        record_defaults=record_defaults,
+        select_cols=[0])
+
+  def testCsvDataset_withMultipleNewLines(self):
+    # In this case, we expect it to behave differently from
+    # TextLineDataset->map(decode_csv) since that flow has bugs
+    record_defaults = [['']] * 4
+    inputs = [['a,"b\n\nx","""c""\n \n0","d\ne"', 'f,g,h,i']]
+    expected = [['a', 'b\n\nx', '"c"\n \n0', 'd\ne'], ['f', 'g', 'h', 'i']]
+    self._test_dataset(inputs, expected, record_defaults=record_defaults)
+
+  def testCsvDataset_errorWithTerminateMidRecord(self):
+    record_defaults = [['']] * 4
+    inputs = [['a,b,c,"a']]
+    self._test_dataset(
+        inputs,
+        expected_err_re=
+        'Reached end of file without closing quoted field in record',
+        record_defaults=record_defaults)
+
+  def testCsvDataset_withEscapedQuotes(self):
+    record_defaults = [['']] * 4
+    inputs = [['1.0,2.1,"she said: ""hello""",4.3', '5.4,6.5,goodbye,8.7']]
+    self._test_by_comparison(inputs, record_defaults=record_defaults)
+
+
+## Testing that parsing works with all buffer sizes, quoted/unquoted fields,
+## and different types of line breaks
+
+  def testCsvDataset_withInvalidBufferSize(self):
+    record_defaults = [['']] * 4
+    inputs = [['a,b,c,d']]
+    self._test_dataset(
+        inputs,
+        expected_err_re='buffer_size should be positive',
+        record_defaults=record_defaults,
+        buffer_size=0)
+
+  def testCsvDataset_withBufferSize(self):
+    record_defaults = [['NA']] * 3
+    inputs = [['abc,def,ghi', '0,1,2', ',,']]
+    expected = [['abc', 'def', 'ghi'], ['0', '1', '2'], ['NA', 'NA', 'NA']]
+    for i in range(20):
+      # Test a range of buffer sizes that should all work
+      self._test_dataset(
+          inputs, expected, record_defaults=record_defaults, buffer_size=i + 1)
+
+  def testCsvDataset_withCR(self):
+    # Test that when the line separator is '\r', parsing works with all buffer
+    # sizes
+    record_defaults = [['NA']] * 3
+    inputs = [['abc,def,ghi', '0,1,2', ',,']]
+    expected = [['abc', 'def', 'ghi'], ['0', '1', '2'], ['NA', 'NA', 'NA']]
+    for i in range(20):
+      # Test a range of buffer sizes that should all work
+      self._test_dataset(
+          inputs,
+          expected,
+          linebreak='\r',
+          record_defaults=record_defaults,
+          buffer_size=i + 1)
+
+  def testCsvDataset_withCRLF(self):
+    # Test that when the line separator is '\r\n', parsing works with all buffer
+    # sizes
+    record_defaults = [['NA']] * 3
+    inputs = [['abc,def,ghi', '0,1,2', ',,']]
+    expected = [['abc', 'def', 'ghi'], ['0', '1', '2'], ['NA', 'NA', 'NA']]
+    for i in range(20):
+      # Test a range of buffer sizes that should all work
+      self._test_dataset(
+          inputs,
+          expected,
+          linebreak='\r\n',
+          record_defaults=record_defaults,
+          buffer_size=i + 1)
+
+  def testCsvDataset_withBufferSizeAndQuoted(self):
+    record_defaults = [['NA']] * 3
+    inputs = [['"\n\n\n","\r\r\r","abc"', '"0","1","2"', '"","",""']]
+    expected = [['\n\n\n', '\r\r\r', 'abc'], ['0', '1', '2'],
+                ['NA', 'NA', 'NA']]
+    for i in range(20):
+      # Test a range of buffer sizes that should all work
+      self._test_dataset(
+          inputs,
+          expected,
+          linebreak='\n',
+          record_defaults=record_defaults,
+          buffer_size=i + 1)
+    self._test_dataset(
+        inputs, expected, linebreak='\n', record_defaults=record_defaults)
+
+  def testCsvDataset_withCRAndQuoted(self):
+    # Test that when the line separator is '\r', parsing works with all buffer
+    # sizes
+    record_defaults = [['NA']] * 3
+    inputs = [['"\n\n\n","\r\r\r","abc"', '"0","1","2"', '"","",""']]
+    expected = [['\n\n\n', '\r\r\r', 'abc'], ['0', '1', '2'],
+                ['NA', 'NA', 'NA']]
+    for i in range(20):
+      # Test a range of buffer sizes that should all work
+      self._test_dataset(
+          inputs,
+          expected,
+          linebreak='\r',
+          record_defaults=record_defaults,
+          buffer_size=i + 1)
+    self._test_dataset(
+        inputs, expected, linebreak='\r', record_defaults=record_defaults)
+
+  def testCsvDataset_withCRLFAndQuoted(self):
+    # Test that when the line separator is '\r\n', parsing works with all buffer
+    # sizes
+    record_defaults = [['NA']] * 3
+    inputs = [['"\n\n\n","\r\r\r","abc"', '"0","1","2"', '"","",""']]
+    expected = [['\n\n\n', '\r\r\r', 'abc'], ['0', '1', '2'],
+                ['NA', 'NA', 'NA']]
+    for i in range(20):
+      # Test a range of buffer sizes that should all work
+      self._test_dataset(
+          inputs,
+          expected,
+          linebreak='\r\n',
+          record_defaults=record_defaults,
+          buffer_size=i + 1)
+    self._test_dataset(
+        inputs, expected, linebreak='\r\n', record_defaults=record_defaults)
+
 
 class CsvDatasetBenchmark(test.Benchmark):
   """Benchmarks for the various ways of creating a dataset from CSV files.
   """
+  FLOAT_VAL = '1.23456E12'
+  STR_VAL = string.ascii_letters * 10
 
-  def _setUp(self):
+  def _setUp(self, str_val):
     # Since this isn't test.TestCase, have to manually create a test dir
     gfile.MakeDirs(googletest.GetTempDir())
     self._temp_dir = tempfile.mkdtemp(dir=googletest.GetTempDir())
 
     self._num_cols = [4, 64, 256]
-    self._batch_size = 500
+    self._num_per_iter = 5000
     self._filenames = []
     for n in self._num_cols:
       fn = os.path.join(self._temp_dir, 'file%d.csv' % n)
-      with open(fn, 'w') as f:
-        # Just write 10 rows and use `repeat`...
-        row = ','.join(['1.23456E12' for _ in range(n)])
-        f.write('\n'.join([row for _ in range(10)]))
+      with open(fn, 'wb') as f:
+        # Just write 100 rows and use `repeat`... Assumes the cost
+        # of creating an iterator is not significant
+        row = ','.join([str_val for _ in range(n)])
+        f.write('\n'.join([row for _ in range(100)]))
       self._filenames.append(fn)
 
   def _tearDown(self):
     gfile.DeleteRecursively(self._temp_dir)
 
   def _runBenchmark(self, dataset, num_cols, prefix):
-    next_element = dataset.make_one_shot_iterator().get_next()
-    with session.Session() as sess:
-      for _ in range(5):
-        sess.run(next_element)
-      deltas = []
-      for _ in range(10):
+    dataset = dataset.skip(self._num_per_iter - 1)
+    deltas = []
+    for _ in range(10):
+      next_element = dataset.make_one_shot_iterator().get_next()
+      with session.Session() as sess:
         start = time.time()
+        # NOTE: This depends on the underlying implementation of skip, to have
+        # the net effect of calling `GetNext` num_per_iter times on the
+        # input dataset. We do it this way (instead of a python for loop, or
+        # batching N inputs in one iter) so that the overhead from session.run
+        # or batch doesn't dominate. If we eventually optimize skip, this has
+        # to change.
         sess.run(next_element)
         end = time.time()
-        deltas.append(end - start)
-    median_wall_time = np.median(deltas) / 100
+      deltas.append(end - start)
+    # Median wall time per CSV record read and decoded
+    median_wall_time = np.median(deltas) / self._num_per_iter
     print('%s num_cols: %d Median wall time: %f' % (prefix, num_cols,
                                                     median_wall_time))
     self.report_benchmark(
-        iters=self._batch_size,
+        iters=self._num_per_iter,
         wall_time=median_wall_time,
         name='%s_with_cols_%d' % (prefix, num_cols))
 
-  def benchmarkBatchThenMap(self):
-    self._setUp()
+  def benchmarkMapWithFloats(self):
+    self._setUp(self.FLOAT_VAL)
     for i in range(len(self._filenames)):
       num_cols = self._num_cols[i]
       kwargs = {'record_defaults': [[0.0]] * num_cols}
       dataset = core_readers.TextLineDataset(self._filenames[i]).repeat()
       dataset = dataset.map(lambda l: gen_parsing_ops.decode_csv(l, **kwargs))  # pylint: disable=cell-var-from-loop
-      dataset = dataset.batch(self._batch_size)
-      self._runBenchmark(dataset, num_cols, 'csv_map_then_batch')
+      self._runBenchmark(dataset, num_cols, 'csv_float_map_decode_csv')
+    self._tearDown()
+
+  def benchmarkMapWithStrings(self):
+    self._setUp(self.STR_VAL)
+    for i in range(len(self._filenames)):
+      num_cols = self._num_cols[i]
+      kwargs = {'record_defaults': [['']] * num_cols}
+      dataset = core_readers.TextLineDataset(self._filenames[i]).repeat()
+      dataset = dataset.map(lambda l: gen_parsing_ops.decode_csv(l, **kwargs))  # pylint: disable=cell-var-from-loop
+      self._runBenchmark(dataset, num_cols, 'csv_strings_map_decode_csv')
     self._tearDown()
 
-  def benchmarkCsvDataset(self):
-    self._setUp()
+  def benchmarkCsvDatasetWithFloats(self):
+    self._setUp(self.FLOAT_VAL)
     for i in range(len(self._filenames)):
       num_cols = self._num_cols[i]
       kwargs = {'record_defaults': [[0.0]] * num_cols}
       dataset = core_readers.TextLineDataset(self._filenames[i]).repeat()
       dataset = readers.CsvDataset(self._filenames[i], **kwargs).repeat()  # pylint: disable=cell-var-from-loop
-      dataset = dataset.batch(self._batch_size)
-      self._runBenchmark(dataset, num_cols, 'csv_fused_dataset')
+      self._runBenchmark(dataset, num_cols, 'csv_float_fused_dataset')
     self._tearDown()
 
+  def benchmarkCsvDatasetWithStrings(self):
+    self._setUp(self.STR_VAL)
+    for i in range(len(self._filenames)):
+      num_cols = self._num_cols[i]
+      kwargs = {'record_defaults': [['']] * num_cols}
+      dataset = core_readers.TextLineDataset(self._filenames[i]).repeat()
+      dataset = readers.CsvDataset(self._filenames[i], **kwargs).repeat()  # pylint: disable=cell-var-from-loop
+      self._runBenchmark(dataset, num_cols, 'csv_strings_fused_dataset')
+    self._tearDown()
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/dataset_serialization_test_base.py b/tensorflow/contrib/data/python/kernel_tests/dataset_serialization_test_base.py
index 78ecce8f7daaf84002ae78d8d77820755b967d89..393f08850b1865180a8b94e9209b2445b54c8b69 100644
--- a/tensorflow/contrib/data/python/kernel_tests/dataset_serialization_test_base.py
+++ b/tensorflow/contrib/data/python/kernel_tests/dataset_serialization_test_base.py
@@ -467,7 +467,8 @@ class DatasetSerializationTestBase(test.TestCase):
                   ckpt_saved=False,
                   init_before_restore=False,
                   sparse_tensors=False,
-                  verify_exhausted=True):
+                  verify_exhausted=True,
+                  save_checkpoint_at_end=True):
     """Generates elements from input dataset while stopping at break points.
 
     Produces `num_outputs` outputs and saves the state of the iterator in the
@@ -490,6 +491,10 @@ class DatasetSerializationTestBase(test.TestCase):
       sparse_tensors:  Whether dataset is built from SparseTensor(s).
       verify_exhausted: Whether to verify that the iterator has been exhausted
         after producing `num_outputs` elements.
+      save_checkpoint_at_end: Whether to save a checkpoint after producing all
+        outputs. If False, checkpoints are saved each break point but not at the
+        end. Note that checkpoints overwrite each other so there is always only
+        a single checkpoint available. Defaults to True.
 
     Returns:
       A list of `num_outputs` items.
@@ -526,8 +531,9 @@ class DatasetSerializationTestBase(test.TestCase):
           if i == len(break_points) and verify_exhausted:
             with self.assertRaises(errors.OutOfRangeError):
               sess.run(get_next_op)
-          self._save(sess, saver)
-          ckpt_saved = True
+          if save_checkpoint_at_end or i < len(break_points):
+            self._save(sess, saver)
+            ckpt_saved = True
 
     return outputs
 
diff --git a/tensorflow/contrib/data/python/kernel_tests/directed_interleave_dataset_test.py b/tensorflow/contrib/data/python/kernel_tests/directed_interleave_dataset_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..34b6a080c0aae7dfc228746139acc52cea4e6f28
--- /dev/null
+++ b/tensorflow/contrib/data/python/kernel_tests/directed_interleave_dataset_test.py
@@ -0,0 +1,167 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for the experimental input pipeline ops."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.data.python.kernel_tests import dataset_serialization_test_base
+from tensorflow.contrib.data.python.ops import interleave_ops
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import random_seed
+from tensorflow.python.platform import test
+
+
+class DirectedInterleaveDatasetTest(test.TestCase):
+
+  def testBasic(self):
+    selector_dataset = dataset_ops.Dataset.range(10).repeat(100)
+    input_datasets = [
+        dataset_ops.Dataset.from_tensors(i).repeat(100) for i in range(10)
+    ]
+    dataset = interleave_ops.DirectedInterleaveDataset(selector_dataset,
+                                                       input_datasets)
+    iterator = dataset.make_initializable_iterator()
+    next_element = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer)
+      for _ in range(100):
+        for i in range(10):
+          self.assertEqual(i, sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def _normalize(self, vec):
+    return vec / vec.sum()
+
+  def _chi2(self, expected, actual):
+    actual = np.asarray(actual)
+    expected = np.asarray(expected)
+    diff = actual - expected
+    chi2 = np.sum(diff * diff / expected, axis=0)
+    return chi2
+
+  def _testSampleFromDatasetsHelper(self, weights, num_datasets, num_samples):
+    # Create a dataset that samples each integer in `[0, num_datasets)`
+    # with probability given by `weights[i]`.
+    dataset = interleave_ops.sample_from_datasets([
+        dataset_ops.Dataset.from_tensors(i).repeat(None)
+        for i in range(num_datasets)
+    ], weights)
+    dataset = dataset.take(num_samples)
+    iterator = dataset.make_one_shot_iterator()
+    next_element = iterator.get_next()
+
+    with self.test_session() as sess:
+      freqs = np.zeros([num_datasets])
+      for _ in range(num_samples):
+        freqs[sess.run(next_element)] += 1
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+    return freqs
+
+  def testSampleFromDatasets(self):
+    random_seed.set_random_seed(1619)
+    num_samples = 5000
+    rand_probs = self._normalize(np.random.random_sample((15,)))
+
+    # Use chi-squared test to assert that the observed distribution matches the
+    # expected distribution. Based on the implementation in
+    # "tensorflow/python/kernel_tests/multinomial_op_test.py".
+    for probs in [[.85, .05, .1], rand_probs]:
+      probs = np.asarray(probs)
+      classes = len(probs)
+      freqs = self._testSampleFromDatasetsHelper(probs, classes, num_samples)
+      self.assertLess(self._chi2(probs, freqs / num_samples), 1e-2)
+
+      # Also check that `weights` as a dataset samples correctly.
+      probs_ds = dataset_ops.Dataset.from_tensors(probs).repeat()
+      freqs = self._testSampleFromDatasetsHelper(probs_ds, classes, num_samples)
+      self.assertLess(self._chi2(probs, freqs / num_samples), 1e-2)
+
+  def testSelectFromDatasets(self):
+    words = [b"foo", b"bar", b"baz"]
+    datasets = [dataset_ops.Dataset.from_tensors(w).repeat() for w in words]
+    choice_array = np.random.randint(3, size=(15,), dtype=np.int64)
+    choice_dataset = dataset_ops.Dataset.from_tensor_slices(choice_array)
+    dataset = interleave_ops.choose_from_datasets(datasets, choice_dataset)
+    iterator = dataset.make_one_shot_iterator()
+    next_element = iterator.get_next()
+
+    with self.test_session() as sess:
+      for i in choice_array:
+        self.assertEqual(words[i], sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testErrors(self):
+    with self.assertRaisesRegexp(ValueError,
+                                 r"vector of length `len\(datasets\)`"):
+      interleave_ops.sample_from_datasets(
+          [dataset_ops.Dataset.range(10),
+           dataset_ops.Dataset.range(20)],
+          weights=[0.25, 0.25, 0.25, 0.25])
+
+    with self.assertRaisesRegexp(TypeError, "`tf.float32` or `tf.float64`"):
+      interleave_ops.sample_from_datasets(
+          [dataset_ops.Dataset.range(10),
+           dataset_ops.Dataset.range(20)],
+          weights=[1, 1])
+
+    with self.assertRaisesRegexp(TypeError, "must have the same type"):
+      interleave_ops.sample_from_datasets([
+          dataset_ops.Dataset.from_tensors(0),
+          dataset_ops.Dataset.from_tensors(0.0)
+      ])
+
+    with self.assertRaisesRegexp(TypeError, "tf.int64"):
+      interleave_ops.choose_from_datasets([
+          dataset_ops.Dataset.from_tensors(0),
+          dataset_ops.Dataset.from_tensors(1)
+      ], choice_dataset=dataset_ops.Dataset.from_tensors(1.0))
+
+    with self.assertRaisesRegexp(TypeError, "scalar"):
+      interleave_ops.choose_from_datasets([
+          dataset_ops.Dataset.from_tensors(0),
+          dataset_ops.Dataset.from_tensors(1)
+      ], choice_dataset=dataset_ops.Dataset.from_tensors([1.0]))
+
+
+class SampleFromDatasetsSerializationTest(
+    dataset_serialization_test_base.DatasetSerializationTestBase):
+
+  def _build_dataset(self, probs, num_samples):
+    dataset = interleave_ops.sample_from_datasets(
+        [
+            dataset_ops.Dataset.from_tensors(i).repeat(None)
+            for i in range(len(probs))
+        ],
+        probs,
+        seed=1813)
+    return dataset.take(num_samples)
+
+  def testSerializationCore(self):
+    self.run_core_tests(
+        lambda: self._build_dataset([0.5, 0.5], 100),
+        lambda: self._build_dataset([0.25, 0.25, 0.25, 0.25], 1000), 100)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/interleave_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/interleave_dataset_op_test.py
index 43aa4b1bd02791ff304a990c0bbe8e45534c0c77..bee561e3e23a2ab6f314894caa21785347e6ca8b 100644
--- a/tensorflow/contrib/data/python/kernel_tests/interleave_dataset_op_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/interleave_dataset_op_test.py
@@ -30,7 +30,6 @@ from tensorflow.contrib.data.python.ops import interleave_ops
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
-from tensorflow.python.framework import random_seed
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
@@ -907,114 +906,5 @@ class ParallelInterleaveDatasetTest(test.TestCase):
         sess.run(self.next_element)
 
 
-class DirectedInterleaveDatasetTest(test.TestCase):
-
-  def testBasic(self):
-    selector_dataset = dataset_ops.Dataset.range(10).repeat(100)
-    input_datasets = [
-        dataset_ops.Dataset.from_tensors(i).repeat(100) for i in range(10)
-    ]
-    dataset = interleave_ops.DirectedInterleaveDataset(selector_dataset,
-                                                       input_datasets)
-    iterator = dataset.make_initializable_iterator()
-    next_element = iterator.get_next()
-
-    with self.test_session() as sess:
-      sess.run(iterator.initializer)
-      for _ in range(100):
-        for i in range(10):
-          self.assertEqual(i, sess.run(next_element))
-      with self.assertRaises(errors.OutOfRangeError):
-        sess.run(next_element)
-
-  def _normalize(self, vec):
-    return vec / vec.sum()
-
-  def _chi2(self, expected, actual):
-    actual = np.asarray(actual)
-    expected = np.asarray(expected)
-    diff = actual - expected
-    chi2 = np.sum(diff * diff / expected, axis=0)
-    return chi2
-
-  def _testSampleFromDatasetsHelper(self, weights, num_datasets, num_samples):
-    # Create a dataset that samples each integer in `[0, num_datasets)`
-    # with probability given by `weights[i]`.
-    dataset = interleave_ops.sample_from_datasets([
-        dataset_ops.Dataset.from_tensors(i).repeat(None)
-        for i in range(num_datasets)
-    ], weights)
-    dataset = dataset.take(num_samples)
-    iterator = dataset.make_one_shot_iterator()
-    next_element = iterator.get_next()
-
-    with self.test_session() as sess:
-      freqs = np.zeros([num_datasets])
-      for _ in range(num_samples):
-        freqs[sess.run(next_element)] += 1
-      with self.assertRaises(errors.OutOfRangeError):
-        sess.run(next_element)
-
-    return freqs
-
-  def testSampleFromDatasets(self):
-    random_seed.set_random_seed(1619)
-    num_samples = 10000
-    rand_probs = self._normalize(np.random.random_sample((15,)))
-
-    # Use chi-squared test to assert that the observed distribution matches the
-    # expected distribution. Based on the implementation in
-    # "tensorflow/python/kernel_tests/multinomial_op_test.py".
-    for probs in [[.85, .05, .1], rand_probs]:
-      probs = np.asarray(probs)
-      classes = len(probs)
-      freqs = self._testSampleFromDatasetsHelper(probs, classes, num_samples)
-      self.assertLess(self._chi2(probs, freqs / num_samples), 1e-3)
-
-      # Also check that `weights` as a dataset samples correctly.
-      probs_ds = dataset_ops.Dataset.from_tensors(probs).repeat()
-      freqs = self._testSampleFromDatasetsHelper(probs_ds, classes, num_samples)
-      self.assertLess(self._chi2(probs, freqs / num_samples), 1e-3)
-
-  def testErrors(self):
-    with self.assertRaisesRegexp(ValueError,
-                                 r"vector of length `len\(datasets\)`"):
-      interleave_ops.sample_from_datasets(
-          [dataset_ops.Dataset.range(10),
-           dataset_ops.Dataset.range(20)],
-          weights=[0.25, 0.25, 0.25, 0.25])
-
-    with self.assertRaisesRegexp(TypeError, "`tf.float32` or `tf.float64`"):
-      interleave_ops.sample_from_datasets(
-          [dataset_ops.Dataset.range(10),
-           dataset_ops.Dataset.range(20)],
-          weights=[1, 1])
-
-    with self.assertRaisesRegexp(TypeError, "must have the same type"):
-      interleave_ops.sample_from_datasets([
-          dataset_ops.Dataset.from_tensors(0),
-          dataset_ops.Dataset.from_tensors(0.0)
-      ])
-
-
-class SampleFromDatasetsSerializationTest(
-    dataset_serialization_test_base.DatasetSerializationTestBase):
-
-  def _build_dataset(self, probs, num_samples):
-    dataset = interleave_ops.sample_from_datasets(
-        [
-            dataset_ops.Dataset.from_tensors(i).repeat(None)
-            for i in range(len(probs))
-        ],
-        probs,
-        seed=1813)
-    return dataset.take(num_samples)
-
-  def testSerializationCore(self):
-    self.run_core_tests(
-        lambda: self._build_dataset([0.5, 0.5], 100),
-        lambda: self._build_dataset([0.25, 0.25, 0.25, 0.25], 1000), 100)
-
-
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/optimize_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/optimize_dataset_op_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..30f1847dcddbfaf379ef2b09185f7a8db4aaeae2
--- /dev/null
+++ b/tensorflow/contrib/data/python/kernel_tests/optimize_dataset_op_test.py
@@ -0,0 +1,89 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for the experimental input pipeline ops."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.data.python.kernel_tests import dataset_serialization_test_base
+from tensorflow.contrib.data.python.ops import optimization
+from tensorflow.core.framework import graph_pb2
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.framework import errors
+from tensorflow.python.platform import test
+
+
+class OptimizeDatasetTest(test.TestCase):
+
+  def testDefaultOptimizations(self):
+    dataset = dataset_ops.Dataset.range(10).map(lambda x: x * x).batch(
+        10).apply(optimization.optimize())
+    iterator = dataset.make_one_shot_iterator()
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      graph = graph_pb2.GraphDef().FromString(
+          sess.run(dataset._as_serialized_graph()))
+      self.assertTrue(
+          all([node.op != "MapAndBatchDatasetV2" for node in graph.node]))
+      self.assertAllEqual([x * x for x in range(10)], sess.run(get_next))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+  def testEmptyOptimizations(self):
+    dataset = dataset_ops.Dataset.range(10).map(lambda x: x * x).batch(
+        10).apply(optimization.optimize([]))
+    iterator = dataset.make_one_shot_iterator()
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      graph = graph_pb2.GraphDef().FromString(
+          sess.run(dataset._as_serialized_graph()))
+      self.assertTrue(
+          all([node.op != "MapAndBatchDatasetV2" for node in graph.node]))
+      self.assertAllEqual([x * x for x in range(10)], sess.run(get_next))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+  def testOptimization(self):
+    dataset = dataset_ops.Dataset.range(10).map(lambda x: x * x).batch(
+        10).apply(optimization.optimize(["map_and_batch_fusion"]))
+    iterator = dataset.make_one_shot_iterator()
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      graph = graph_pb2.GraphDef().FromString(
+          sess.run(dataset._as_serialized_graph()))
+      self.assertTrue(
+          any([node.op == "MapAndBatchDatasetV2" for node in graph.node]))
+      self.assertAllEqual([x * x for x in range(10)], sess.run(get_next))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+
+class OptimizeDatasetSerializationTest(
+    dataset_serialization_test_base.DatasetSerializationTestBase):
+
+  def testCore(self):
+
+    def build_dataset(num_elements, batch_size):
+      return dataset_ops.Dataset.range(num_elements).map(lambda x: x * x).batch(
+          batch_size).apply(optimization.optimize(["map_and_batch_fusion"]))
+
+    self.run_core_tests(lambda: build_dataset(200, 10), None, 20)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test.py b/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test.py
index e0237198b7d47eb98eeffe88d28bf9681b2722c6..3b07ef290bc38daa37472ef8919f3350851fe370 100644
--- a/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test.py
@@ -24,9 +24,8 @@ import zlib
 import numpy as np
 
 from tensorflow.contrib.data.python.kernel_tests import dataset_serialization_test_base
+from tensorflow.contrib.data.python.kernel_tests import reader_dataset_ops_test_base
 from tensorflow.contrib.data.python.ops import readers
-from tensorflow.core.example import example_pb2
-from tensorflow.core.example import feature_pb2
 from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.data.ops import readers as core_readers
 from tensorflow.python.framework import constant_op
@@ -280,163 +279,8 @@ def _interleave(iterators, cycle_length):
           num_open -= 1
 
 
-class ReadBatchFeaturesTest(test.TestCase):
-
-  def setUp(self):
-    super(ReadBatchFeaturesTest, self).setUp()
-    self._num_files = 2
-    self._num_records = 7
-    self.test_filenames = self._createFiles()
-
-  def _read_batch_features(self,
-                           filenames,
-                           num_epochs,
-                           batch_size,
-                           reader_num_threads=1,
-                           parser_num_threads=1,
-                           shuffle=False,
-                           shuffle_seed=None,
-                           drop_final_batch=False):
-    self.filenames = filenames
-    self.num_epochs = num_epochs
-    self.batch_size = batch_size
-
-    return readers.make_batched_features_dataset(
-        file_pattern=self.filenames,
-        batch_size=self.batch_size,
-        features={
-            "file": parsing_ops.FixedLenFeature([], dtypes.int64),
-            "record": parsing_ops.FixedLenFeature([], dtypes.int64),
-            "keywords": parsing_ops.VarLenFeature(dtypes.string)
-        },
-        reader=core_readers.TFRecordDataset,
-        num_epochs=self.num_epochs,
-        shuffle=shuffle,
-        shuffle_seed=shuffle_seed,
-        reader_num_threads=reader_num_threads,
-        parser_num_threads=parser_num_threads,
-        drop_final_batch=drop_final_batch).make_one_shot_iterator(
-        ).get_next()
-
-  def _record(self, f, r):
-    example = example_pb2.Example(
-        features=feature_pb2.Features(
-            feature={
-                "file":
-                    feature_pb2.Feature(
-                        int64_list=feature_pb2.Int64List(value=[f])),
-                "record":
-                    feature_pb2.Feature(
-                        int64_list=feature_pb2.Int64List(value=[r])),
-                "keywords":
-                    feature_pb2.Feature(
-                        bytes_list=feature_pb2.BytesList(
-                            value=self._get_keywords(f, r)))
-            }))
-    return example.SerializeToString()
-
-  def _get_keywords(self, f, r):
-    num_keywords = 1 + (f + r) % 2
-    keywords = []
-    for index in range(num_keywords):
-      keywords.append(compat.as_bytes("keyword%d" % index))
-    return keywords
-
-  def _createFiles(self):
-    filenames = []
-    for i in range(self._num_files):
-      fn = os.path.join(self.get_temp_dir(), "tf_record.%d.txt" % i)
-      filenames.append(fn)
-      writer = python_io.TFRecordWriter(fn)
-      for j in range(self._num_records):
-        writer.write(self._record(i, j))
-      writer.close()
-    return filenames
-
-  def _run_actual_batch(self, outputs, sess):
-    file_op = outputs["file"]
-    keywords_indices_op = outputs["keywords"].indices
-    keywords_values_op = outputs["keywords"].values
-    keywords_dense_shape_op = outputs["keywords"].dense_shape
-    record_op = outputs["record"]
-    return sess.run([
-        file_op, keywords_indices_op, keywords_values_op,
-        keywords_dense_shape_op, record_op
-    ])
-
-  def _next_actual_batch(self, sess):
-    return self._run_actual_batch(self.outputs, sess)
-
-  def _next_expected_batch(self,
-                           file_indices,
-                           batch_size,
-                           num_epochs,
-                           cycle_length=1):
-
-    def _next_record(file_indices):
-      for j in file_indices:
-        for i in range(self._num_records):
-          yield j, i
-
-    def _next_record_interleaved(file_indices, cycle_length):
-      return _interleave([_next_record([i]) for i in file_indices],
-                         cycle_length)
-
-    file_batch = []
-    keywords_batch_indices = []
-    keywords_batch_values = []
-    keywords_batch_max_len = 0
-    record_batch = []
-    batch_index = 0
-    for _ in range(num_epochs):
-      if cycle_length == 1:
-        next_records = _next_record(file_indices)
-      else:
-        next_records = _next_record_interleaved(file_indices, cycle_length)
-      for record in next_records:
-        f = record[0]
-        r = record[1]
-        file_batch.append(f)
-        record_batch.append(r)
-        keywords = self._get_keywords(f, r)
-        keywords_batch_values.extend(keywords)
-        keywords_batch_indices.extend(
-            [[batch_index, i] for i in range(len(keywords))])
-        batch_index += 1
-        keywords_batch_max_len = max(keywords_batch_max_len, len(keywords))
-        if len(file_batch) == batch_size:
-          yield [
-              file_batch, keywords_batch_indices, keywords_batch_values,
-              [batch_size, keywords_batch_max_len], record_batch
-          ]
-          file_batch = []
-          keywords_batch_indices = []
-          keywords_batch_values = []
-          keywords_batch_max_len = 0
-          record_batch = []
-          batch_index = 0
-    if file_batch:
-      yield [
-          file_batch, keywords_batch_indices, keywords_batch_values,
-          [len(file_batch), keywords_batch_max_len], record_batch
-      ]
-
-  def _verify_records(self,
-                      sess,
-                      batch_size,
-                      file_index=None,
-                      num_epochs=1,
-                      interleave_cycle_length=1):
-    if file_index is not None:
-      file_indices = [file_index]
-    else:
-      file_indices = range(self._num_files)
-
-    for expected_batch in self._next_expected_batch(
-        file_indices, batch_size, num_epochs, interleave_cycle_length):
-      actual_batch = self._next_actual_batch(sess)
-      for i in range(len(expected_batch)):
-        self.assertAllEqual(expected_batch[i], actual_batch[i])
+class ReadBatchFeaturesTest(
+    reader_dataset_ops_test_base.ReadBatchFeaturesTestBase):
 
   def testRead(self):
     for batch_size in [1, 2]:
@@ -444,33 +288,33 @@ class ReadBatchFeaturesTest(test.TestCase):
         with ops.Graph().as_default() as g:
           with self.test_session(graph=g) as sess:
             # Basic test: read from file 0.
-            self.outputs = self._read_batch_features(
+            self.outputs = self.make_batch_feature(
                 filenames=self.test_filenames[0],
                 num_epochs=num_epochs,
-                batch_size=batch_size)
-            self._verify_records(sess, batch_size, 0, num_epochs=num_epochs)
+                batch_size=batch_size).make_one_shot_iterator().get_next()
+            self.verify_records(sess, batch_size, 0, num_epochs=num_epochs)
             with self.assertRaises(errors.OutOfRangeError):
               self._next_actual_batch(sess)
 
         with ops.Graph().as_default() as g:
           with self.test_session(graph=g) as sess:
             # Basic test: read from file 1.
-            self.outputs = self._read_batch_features(
+            self.outputs = self.make_batch_feature(
                 filenames=self.test_filenames[1],
                 num_epochs=num_epochs,
-                batch_size=batch_size)
-            self._verify_records(sess, batch_size, 1, num_epochs=num_epochs)
+                batch_size=batch_size).make_one_shot_iterator().get_next()
+            self.verify_records(sess, batch_size, 1, num_epochs=num_epochs)
             with self.assertRaises(errors.OutOfRangeError):
               self._next_actual_batch(sess)
 
         with ops.Graph().as_default() as g:
           with self.test_session(graph=g) as sess:
             # Basic test: read from both files.
-            self.outputs = self._read_batch_features(
+            self.outputs = self.make_batch_feature(
                 filenames=self.test_filenames,
                 num_epochs=num_epochs,
-                batch_size=batch_size)
-            self._verify_records(sess, batch_size, num_epochs=num_epochs)
+                batch_size=batch_size).make_one_shot_iterator().get_next()
+            self.verify_records(sess, batch_size, num_epochs=num_epochs)
             with self.assertRaises(errors.OutOfRangeError):
               self._next_actual_batch(sess)
 
@@ -504,18 +348,18 @@ class ReadBatchFeaturesTest(test.TestCase):
       # Test that shuffling with same seed produces the same result.
       with ops.Graph().as_default() as g:
         with self.test_session(graph=g) as sess:
-          outputs1 = self._read_batch_features(
+          outputs1 = self.make_batch_feature(
               filenames=self.test_filenames[0],
               num_epochs=num_epochs,
               batch_size=batch_size,
               shuffle=True,
-              shuffle_seed=5)
-          outputs2 = self._read_batch_features(
+              shuffle_seed=5).make_one_shot_iterator().get_next()
+          outputs2 = self.make_batch_feature(
               filenames=self.test_filenames[0],
               num_epochs=num_epochs,
               batch_size=batch_size,
               shuffle=True,
-              shuffle_seed=5)
+              shuffle_seed=5).make_one_shot_iterator().get_next()
           for _ in range(total_records // batch_size):
             batch1 = self._run_actual_batch(outputs1, sess)
             batch2 = self._run_actual_batch(outputs2, sess)
@@ -525,18 +369,18 @@ class ReadBatchFeaturesTest(test.TestCase):
       # Test that shuffling with different seeds produces a different order.
       with ops.Graph().as_default() as g:
         with self.test_session(graph=g) as sess:
-          outputs1 = self._read_batch_features(
+          outputs1 = self.make_batch_feature(
               filenames=self.test_filenames[0],
               num_epochs=num_epochs,
               batch_size=batch_size,
               shuffle=True,
-              shuffle_seed=5)
-          outputs2 = self._read_batch_features(
+              shuffle_seed=5).make_one_shot_iterator().get_next()
+          outputs2 = self.make_batch_feature(
               filenames=self.test_filenames[0],
               num_epochs=num_epochs,
               batch_size=batch_size,
               shuffle=True,
-              shuffle_seed=15)
+              shuffle_seed=15).make_one_shot_iterator().get_next()
           all_equal = True
           for _ in range(total_records // batch_size):
             batch1 = self._run_actual_batch(outputs1, sess)
@@ -552,13 +396,14 @@ class ReadBatchFeaturesTest(test.TestCase):
         for parser_num_threads in [2, 4]:
           with ops.Graph().as_default() as g:
             with self.test_session(graph=g) as sess:
-              self.outputs = self._read_batch_features(
+              self.outputs = self.make_batch_feature(
                   filenames=self.test_filenames,
                   num_epochs=num_epochs,
                   batch_size=batch_size,
                   reader_num_threads=reader_num_threads,
-                  parser_num_threads=parser_num_threads)
-              self._verify_records(
+                  parser_num_threads=parser_num_threads).make_one_shot_iterator(
+                  ).get_next()
+              self.verify_records(
                   sess,
                   batch_size,
                   num_epochs=num_epochs,
@@ -571,11 +416,11 @@ class ReadBatchFeaturesTest(test.TestCase):
       for num_epochs in [1, 10]:
         with ops.Graph().as_default():
           # Basic test: read from file 0.
-          self.outputs = self._read_batch_features(
+          self.outputs = self.make_batch_feature(
               filenames=self.test_filenames[0],
               num_epochs=num_epochs,
               batch_size=batch_size,
-              drop_final_batch=True)
+              drop_final_batch=True).make_one_shot_iterator().get_next()
           for _, tensor in self.outputs.items():
             if isinstance(tensor, ops.Tensor):  # Guard against SparseTensor.
               self.assertEqual(tensor.shape[0], batch_size)
diff --git a/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test_base.py b/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..805a7c7b7384d53cc166a48ba243502ef8643280
--- /dev/null
+++ b/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test_base.py
@@ -0,0 +1,218 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for the experimental input pipeline ops."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+from tensorflow.contrib.data.python.ops import readers
+from tensorflow.core.example import example_pb2
+from tensorflow.core.example import feature_pb2
+from tensorflow.python.data.ops import readers as core_readers
+from tensorflow.python.framework import dtypes
+from tensorflow.python.lib.io import python_io
+from tensorflow.python.ops import parsing_ops
+from tensorflow.python.platform import test
+from tensorflow.python.util import compat
+
+
+class ReadBatchFeaturesTestBase(test.TestCase):
+  """Base class for setting up and testing `make_batched_feature_dataset`."""
+
+  def setUp(self):
+    super(ReadBatchFeaturesTestBase, self).setUp()
+    self._num_files = 2
+    self._num_records = 7
+    self.test_filenames = self._createFiles()
+
+  def make_batch_feature(self,
+                         filenames,
+                         num_epochs,
+                         batch_size,
+                         reader_num_threads=1,
+                         parser_num_threads=1,
+                         shuffle=False,
+                         shuffle_seed=None,
+                         drop_final_batch=False):
+    self.filenames = filenames
+    self.num_epochs = num_epochs
+    self.batch_size = batch_size
+
+    return readers.make_batched_features_dataset(
+        file_pattern=self.filenames,
+        batch_size=self.batch_size,
+        features={
+            "file": parsing_ops.FixedLenFeature([], dtypes.int64),
+            "record": parsing_ops.FixedLenFeature([], dtypes.int64),
+            "keywords": parsing_ops.VarLenFeature(dtypes.string)
+        },
+        reader=core_readers.TFRecordDataset,
+        num_epochs=self.num_epochs,
+        shuffle=shuffle,
+        shuffle_seed=shuffle_seed,
+        reader_num_threads=reader_num_threads,
+        parser_num_threads=parser_num_threads,
+        drop_final_batch=drop_final_batch)
+
+  def _record(self, f, r):
+    example = example_pb2.Example(
+        features=feature_pb2.Features(
+            feature={
+                "file":
+                    feature_pb2.Feature(
+                        int64_list=feature_pb2.Int64List(value=[f])),
+                "record":
+                    feature_pb2.Feature(
+                        int64_list=feature_pb2.Int64List(value=[r])),
+                "keywords":
+                    feature_pb2.Feature(
+                        bytes_list=feature_pb2.BytesList(
+                            value=self._get_keywords(f, r)))
+            }))
+    return example.SerializeToString()
+
+  def _get_keywords(self, f, r):
+    num_keywords = 1 + (f + r) % 2
+    keywords = []
+    for index in range(num_keywords):
+      keywords.append(compat.as_bytes("keyword%d" % index))
+    return keywords
+
+  def _sum_keywords(self, num_files):
+    sum_keywords = 0
+    for i in range(num_files):
+      for j in range(self._num_records):
+        sum_keywords += 1 + (i + j) % 2
+    return sum_keywords
+
+  def _createFiles(self):
+    filenames = []
+    for i in range(self._num_files):
+      fn = os.path.join(self.get_temp_dir(), "tf_record.%d.txt" % i)
+      filenames.append(fn)
+      writer = python_io.TFRecordWriter(fn)
+      for j in range(self._num_records):
+        writer.write(self._record(i, j))
+      writer.close()
+    return filenames
+
+  def _run_actual_batch(self, outputs, sess):
+    file_op = outputs["file"]
+    keywords_indices_op = outputs["keywords"].indices
+    keywords_values_op = outputs["keywords"].values
+    keywords_dense_shape_op = outputs["keywords"].dense_shape
+    record_op = outputs["record"]
+    return sess.run([
+        file_op, keywords_indices_op, keywords_values_op,
+        keywords_dense_shape_op, record_op
+    ])
+
+  def _next_actual_batch(self, sess):
+    return self._run_actual_batch(self.outputs, sess)
+
+  def _interleave(self, iterators, cycle_length):
+    pending_iterators = iterators
+    open_iterators = []
+    num_open = 0
+    for i in range(cycle_length):
+      if pending_iterators:
+        open_iterators.append(pending_iterators.pop(0))
+        num_open += 1
+
+    while num_open:
+      for i in range(min(cycle_length, len(open_iterators))):
+        if open_iterators[i] is None:
+          continue
+        try:
+          yield next(open_iterators[i])
+        except StopIteration:
+          if pending_iterators:
+            open_iterators[i] = pending_iterators.pop(0)
+          else:
+            open_iterators[i] = None
+            num_open -= 1
+
+  def _next_expected_batch(self,
+                           file_indices,
+                           batch_size,
+                           num_epochs,
+                           cycle_length=1):
+
+    def _next_record(file_indices):
+      for j in file_indices:
+        for i in range(self._num_records):
+          yield j, i
+
+    def _next_record_interleaved(file_indices, cycle_length):
+      return self._interleave([_next_record([i]) for i in file_indices],
+                              cycle_length)
+
+    file_batch = []
+    keywords_batch_indices = []
+    keywords_batch_values = []
+    keywords_batch_max_len = 0
+    record_batch = []
+    batch_index = 0
+    for _ in range(num_epochs):
+      if cycle_length == 1:
+        next_records = _next_record(file_indices)
+      else:
+        next_records = _next_record_interleaved(file_indices, cycle_length)
+      for record in next_records:
+        f = record[0]
+        r = record[1]
+        file_batch.append(f)
+        record_batch.append(r)
+        keywords = self._get_keywords(f, r)
+        keywords_batch_values.extend(keywords)
+        keywords_batch_indices.extend(
+            [[batch_index, i] for i in range(len(keywords))])
+        batch_index += 1
+        keywords_batch_max_len = max(keywords_batch_max_len, len(keywords))
+        if len(file_batch) == batch_size:
+          yield [
+              file_batch, keywords_batch_indices, keywords_batch_values,
+              [batch_size, keywords_batch_max_len], record_batch
+          ]
+          file_batch = []
+          keywords_batch_indices = []
+          keywords_batch_values = []
+          keywords_batch_max_len = 0
+          record_batch = []
+          batch_index = 0
+    if file_batch:
+      yield [
+          file_batch, keywords_batch_indices, keywords_batch_values,
+          [len(file_batch), keywords_batch_max_len], record_batch
+      ]
+
+  def verify_records(self,
+                     sess,
+                     batch_size,
+                     file_index=None,
+                     num_epochs=1,
+                     interleave_cycle_length=1):
+    if file_index is not None:
+      file_indices = [file_index]
+    else:
+      file_indices = range(self._num_files)
+
+    for expected_batch in self._next_expected_batch(
+        file_indices, batch_size, num_epochs, interleave_cycle_length):
+      actual_batch = self._next_actual_batch(sess)
+      for i in range(len(expected_batch)):
+        self.assertAllEqual(expected_batch[i], actual_batch[i])
diff --git a/tensorflow/contrib/data/python/kernel_tests/shuffle_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/shuffle_dataset_op_test.py
index bcc644c0971854d948025009dc7add2fea214048..1b67a33f04b0c2ac80402e163005123a4b3e4400 100644
--- a/tensorflow/contrib/data/python/kernel_tests/shuffle_dataset_op_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/shuffle_dataset_op_test.py
@@ -20,11 +20,13 @@ from __future__ import print_function
 import numpy as np
 
 from tensorflow.contrib.data.python.kernel_tests import dataset_serialization_test_base
+from tensorflow.contrib.data.python.ops import iterator_ops as contrib_iterator_ops
 from tensorflow.contrib.data.python.ops import shuffle_ops
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.platform import test
+from tensorflow.python.training import saver as saver_lib
 
 
 class ShuffleDatasetSerializationTest(
@@ -50,26 +52,100 @@ class ShuffleDatasetSerializationTest(
     num_repeats = 5
     num_outputs = range_limit * num_repeats
     buffer_sizes = [1, 3, 8, 10, 25, 50]
-    reshuffle_each_iteration = False
     # pylint: disable=cell-var-from-loop
     # pylint: disable=g-long-lambda
-    for buffer_size in buffer_sizes:
-      self.run_core_tests(
-          lambda: self._build_shuffle_dataset(
+    for reshuffle_each_iteration in [True, False]:
+      for buffer_size in buffer_sizes:
+        self.run_core_tests(
+            lambda: self._build_shuffle_dataset(
+                range_limit=range_limit,
+                num_repeats=num_repeats,
+                buffer_size=buffer_size,
+                seed=seed,
+                reshuffle_each_iteration=reshuffle_each_iteration),
+            lambda: self._build_shuffle_dataset(
+                range_limit=range_limit,
+                num_repeats=num_repeats,
+                buffer_size=buffer_size,
+                seed=10,
+                reshuffle_each_iteration=reshuffle_each_iteration),
+            num_outputs)
+    # pylint: enable=cell-var-from-loop
+    # pylint: enable=g-long-lambda
+
+  def testNonDeterministicSeeding(self):
+
+    range_limit = 10
+    num_repeats = 5
+    num_outputs = range_limit * num_repeats
+    buffer_sizes = [1, 3, 8, 10, 25, 50]
+    for reshuffle_each_iteration in [True, False]:
+      for buffer_size in buffer_sizes:
+
+        def ds_fn():
+          # pylint: disable=cell-var-from-loop
+          return self._build_shuffle_dataset(
               range_limit=range_limit,
               num_repeats=num_repeats,
               buffer_size=buffer_size,
-              seed=seed,
-              reshuffle_each_iteration=reshuffle_each_iteration),
-          lambda: self._build_shuffle_dataset(
+              seed=None,  # Iterator seeds are generated non-deterministically.
+              reshuffle_each_iteration=reshuffle_each_iteration)
+          # pylint: enable=cell-var-from-loop
+
+        # We checkpoint the initial state of the Dataset so that we can restore
+        # the seeds in the next run. Since the seeding is non-deterministic
+        # the dataset gets initialized with different seeds each time.
+        expected = self.gen_outputs(
+            ds_fn,
+            break_points=[0],
+            num_outputs=num_outputs,
+            ckpt_saved=False,
+            verify_exhausted=False,
+            save_checkpoint_at_end=False)
+        actual = self.gen_outputs(
+            ds_fn,
+            break_points=self.gen_break_points(num_outputs),
+            num_outputs=num_outputs,
+            ckpt_saved=True,
+            verify_exhausted=False)
+        self.match(expected, actual)
+
+  def testMultipleIterators(self):
+    range_limit = 10
+    num_repeats = 5
+    num_outputs = range_limit * num_repeats
+    buffer_sizes = [1, 3, 8, 10, 25, 50]
+
+    for reshuffle_each_iteration in [True, False]:
+      for buffer_size in buffer_sizes:
+
+        def ds_fn():
+          # pylint: disable=cell-var-from-loop
+          return self._build_shuffle_dataset(
               range_limit=range_limit,
               num_repeats=num_repeats,
               buffer_size=buffer_size,
-              seed=10,
-              reshuffle_each_iteration=reshuffle_each_iteration),
-          num_outputs)
-    # pylint: enable=cell-var-from-loop
-    # pylint: enable=g-long-lambda
+              seed=None,  # Iterator seeds are generated non-deterministically.
+              reshuffle_each_iteration=reshuffle_each_iteration)
+          # pylint: enable=cell-var-from-loop
+
+        with ops.Graph().as_default() as g:
+          ds = ds_fn()
+          iterators = [ds.make_one_shot_iterator(), ds.make_one_shot_iterator()]
+          get_next_ops = [it.get_next() for it in iterators]
+          saveables = [
+              contrib_iterator_ops.make_saveable_from_iterator(it)
+              for it in iterators
+          ]
+          for saveable in saveables:
+            ops.add_to_collection(ops.GraphKeys.SAVEABLE_OBJECTS, saveable)
+          saver = saver_lib.Saver(allow_empty=True)
+          with self.test_session(graph=g) as sess:
+            self._save(sess, saver)
+            expected = [sess.run(get_next_ops) for _ in range(num_outputs)]
+            self._restore(saver, sess)
+            actual = [sess.run(get_next_ops) for _ in range(num_outputs)]
+            self.match(expected, actual)
 
 
 class ShuffleAndRepeatTest(
diff --git a/tensorflow/contrib/data/python/kernel_tests/stats_dataset_ops_test.py b/tensorflow/contrib/data/python/kernel_tests/stats_dataset_ops_test.py
index 5c74ed6ae7210e8e22efb6e8fdb773397459ce1e..17b6644759e53f84b23e070a71267aa15dcffe49 100644
--- a/tensorflow/contrib/data/python/kernel_tests/stats_dataset_ops_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/stats_dataset_ops_test.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 import numpy as np
 
 from tensorflow.contrib.data.python.kernel_tests import dataset_serialization_test_base
+from tensorflow.contrib.data.python.kernel_tests import reader_dataset_ops_test_base
 from tensorflow.contrib.data.python.ops import stats_ops
 from tensorflow.core.framework import summary_pb2
 from tensorflow.python.data.ops import dataset_ops
@@ -29,7 +30,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import test
 
 
-class StatsDatasetTest(test.TestCase):
+class StatsDatasetTestBase(test.TestCase):
 
   def _assertSummaryHasCount(self, summary_str, tag, expected_value):
     summary_proto = summary_pb2.Summary()
@@ -49,6 +50,9 @@ class StatsDatasetTest(test.TestCase):
         return
     self.fail("Expected tag %r not found in summary %r" % (tag, summary_proto))
 
+
+class StatsDatasetTest(StatsDatasetTestBase):
+
   def testBytesProduced(self):
     stats_aggregator = stats_ops.StatsAggregator()
     dataset = dataset_ops.Dataset.range(100).map(
@@ -193,6 +197,45 @@ class StatsDatasetTest(test.TestCase):
       self._assertSummaryHasCount(sess.run(summary_t), "record_latency", 200.0)
 
 
+class FeatureStatsDatasetTest(
+    StatsDatasetTestBase,
+    reader_dataset_ops_test_base.ReadBatchFeaturesTestBase):
+
+  def testFeaturesStats(self):
+    num_epochs = 5
+    total_records = num_epochs * self._num_records
+    batch_size = 2
+    stats_aggregator = stats_ops.StatsAggregator()
+    dataset = self.make_batch_feature(
+        filenames=self.test_filenames[0],
+        num_epochs=num_epochs,
+        batch_size=batch_size,
+        shuffle=True,
+        shuffle_seed=5,
+        drop_final_batch=True).apply(
+            stats_ops.set_stats_aggregator(stats_aggregator))
+    iterator = dataset.make_initializable_iterator()
+    next_element = iterator.get_next()
+    summary_t = stats_aggregator.get_summary()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer)
+      for _ in range(total_records // batch_size):
+        sess.run(next_element)
+
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+      self._assertSummaryHasCount(
+          sess.run(summary_t), "record_stats:features", total_records)
+      self._assertSummaryHasCount(
+          sess.run(summary_t), "record_stats:feature-values", total_records)
+      self._assertSummaryHasSum(
+          sess.run(summary_t), "record_stats:features", total_records * 3)
+      self._assertSummaryHasSum(
+          sess.run(summary_t), "record_stats:feature-values",
+          self._sum_keywords(1) * num_epochs + 2 * total_records)
+
+
 class StatsDatasetSerializationTest(
     dataset_serialization_test_base.DatasetSerializationTestBase):
 
diff --git a/tensorflow/contrib/data/python/ops/BUILD b/tensorflow/contrib/data/python/ops/BUILD
index eceecfd1744d0ae28953a4504450653efa473569..33b7a75046cf2acfa3d787833b907aa2b28dbdca 100644
--- a/tensorflow/contrib/data/python/ops/BUILD
+++ b/tensorflow/contrib/data/python/ops/BUILD
@@ -96,8 +96,10 @@ py_library(
     srcs_version = "PY2AND3",
     deps = [
         ":batching",
+        ":gen_dataset_ops",
         ":interleave_ops",
         ":shuffle_ops",
+        ":stats_ops",
         "//tensorflow/python:constant_op",
         "//tensorflow/python:dataset_ops_gen",
         "//tensorflow/python:dtypes",
@@ -106,12 +108,12 @@ py_library(
         "//tensorflow/python:math_ops",
         "//tensorflow/python:parsing_ops",
         "//tensorflow/python:platform",
-        "//tensorflow/python:sparse_tensor",
         "//tensorflow/python:string_ops",
         "//tensorflow/python:tensor_shape",
         "//tensorflow/python:util",
         "//tensorflow/python/data/ops:dataset_ops",
         "//tensorflow/python/data/ops:readers",
+        "//tensorflow/python/data/util:convert",
         "//tensorflow/python/data/util:nest",
         "//third_party/py/numpy",
     ],
@@ -142,6 +144,7 @@ py_library(
         "//tensorflow/python:tensor_shape",
         "//tensorflow/python:tensor_util",
         "//tensorflow/python/data/ops:dataset_ops",
+        "//tensorflow/python/data/util:convert",
         "//tensorflow/python/data/util:nest",
         "//tensorflow/python/data/util:sparse",
     ],
@@ -208,6 +211,20 @@ py_library(
     ],
 )
 
+py_library(
+    name = "optimization",
+    srcs = ["optimization.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":contrib_op_loader",
+        ":gen_dataset_ops",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python/data/util:nest",
+        "//tensorflow/python/data/util:sparse",
+    ],
+)
+
 py_library(
     name = "resampling",
     srcs = ["resampling.py"],
@@ -368,6 +385,7 @@ py_library(
         ":get_single_element",
         ":grouping",
         ":interleave_ops",
+        ":optimization",
         ":prefetching_ops",
         ":readers",
         ":resampling",
diff --git a/tensorflow/contrib/data/python/ops/batching.py b/tensorflow/contrib/data/python/ops/batching.py
index b9393de4e90ae2597045b29070934b94e18cfcbd..052618e08c8f204613db5a20d42e078f17f12840 100644
--- a/tensorflow/contrib/data/python/ops/batching.py
+++ b/tensorflow/contrib/data/python/ops/batching.py
@@ -19,6 +19,7 @@ from __future__ import print_function
 
 from tensorflow.contrib.framework import with_shape
 from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.data.util import convert
 from tensorflow.python.data.util import nest
 from tensorflow.python.data.util import sparse
 from tensorflow.python.framework import dtypes
@@ -29,6 +30,7 @@ from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gen_dataset_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.util import deprecation
 
 
 def dense_to_sparse_batch(batch_size, row_shape):
@@ -101,10 +103,7 @@ class UnbatchDataset(dataset_ops.Dataset):
   def _as_variant_tensor(self):
     return gen_dataset_ops.unbatch_dataset(
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **dataset_ops.flat_structure(self))
 
   @property
   def output_classes(self):
@@ -218,6 +217,8 @@ def filter_irregular_batches(batch_size):
   return _apply_fn
 
 
+@deprecation.deprecated(
+    None, "Use `tf.data.Dataset.batch(..., drop_remainder=True)`.")
 def batch_and_drop_remainder(batch_size):
   """A batching transformation that omits the final small batch (if present).
 
@@ -250,12 +251,16 @@ def batch_and_drop_remainder(batch_size):
 
   def _apply_fn(dataset):
     """Function from `Dataset` to `Dataset` that applies the transformation."""
+    # TODO(jsimsa): Switch to using `batch(..., drop_remainder=True)` any time
+    # after 6/30/2018.
     batched = dataset.batch(batch_size)
     return filter_irregular_batches(batch_size)(batched)
 
   return _apply_fn
 
 
+@deprecation.deprecated(
+    None, "Use `tf.data.Dataset.padded_batch(..., drop_remainder=True)`.")
 def padded_batch_and_drop_remainder(batch_size,
                                     padded_shapes,
                                     padding_values=None):
@@ -284,6 +289,8 @@ def padded_batch_and_drop_remainder(batch_size,
 
   def _apply_fn(dataset):
     """Function from `Dataset` to `Dataset` that applies the transformation."""
+    # TODO(jsimsa): Switch to using `padded_batch(..., drop_remainder=True)`
+    # any time after 6/30/2018.
     batched = dataset.padded_batch(
         batch_size, padded_shapes=padded_shapes, padding_values=padding_values)
     return filter_irregular_batches(batch_size)(batched)
@@ -309,11 +316,8 @@ class DenseToSparseBatchDataset(dataset_ops.Dataset):
     return gen_dataset_ops.dense_to_sparse_batch_dataset(
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
         self._batch_size,
-        row_shape=dataset_ops._partial_shape_to_tensor(self._row_shape),  # pylint: disable=protected-access
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        row_shape=convert.partial_shape_to_tensor(self._row_shape),
+        **dataset_ops.flat_structure(self))
 
   @property
   def output_classes(self):
@@ -490,10 +494,7 @@ class _MapAndBatchDataset(dataset_ops.MapDataset):
         batch_size=self._batch_size_t,
         num_parallel_calls=self._num_parallel_calls_t,
         drop_remainder=self._drop_remainder_t,
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+        **dataset_ops.flat_structure(self))
     # pylint: enable=protected-access
 
   @property
diff --git a/tensorflow/contrib/data/python/ops/error_ops.py b/tensorflow/contrib/data/python/ops/error_ops.py
index 6c21e489f7c35484ebacd465e3b46d6920df5933..5f5513849cb29a18b86ba8bcee1ab6c9c60674cb 100644
--- a/tensorflow/contrib/data/python/ops/error_ops.py
+++ b/tensorflow/contrib/data/python/ops/error_ops.py
@@ -20,8 +20,6 @@ from __future__ import print_function
 from tensorflow.contrib.data.python.ops import contrib_op_loader  # pylint: disable=unused-import
 from tensorflow.contrib.data.python.ops import gen_dataset_ops
 from tensorflow.python.data.ops import dataset_ops
-from tensorflow.python.data.util import nest
-from tensorflow.python.data.util import sparse
 
 
 def ignore_errors():
@@ -64,10 +62,7 @@ class IgnoreErrorsDataset(dataset_ops.Dataset):
   def _as_variant_tensor(self):
     return gen_dataset_ops.ignore_errors_dataset(
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **dataset_ops.flat_structure(self))
 
   @property
   def output_classes(self):
diff --git a/tensorflow/contrib/data/python/ops/get_single_element.py b/tensorflow/contrib/data/python/ops/get_single_element.py
index 3a07df572748e464284f580d67e3a664e71acdfe..0f4cd8e20c5727a5bcfa1dce4dadbfa8f90bd551 100644
--- a/tensorflow/contrib/data/python/ops/get_single_element.py
+++ b/tensorflow/contrib/data/python/ops/get_single_element.py
@@ -64,10 +64,7 @@ def get_single_element(dataset):
   nested_ret = nest.pack_sequence_as(
       dataset.output_types, gen_dataset_ops.dataset_to_single_element(
           dataset._as_variant_tensor(),  # pylint: disable=protected-access
-          output_types=nest.flatten(sparse.as_dense_types(
-              dataset.output_types, dataset.output_classes)),
-          output_shapes=nest.flatten(sparse.as_dense_shapes(
-              dataset.output_shapes, dataset.output_classes))))
+          **dataset_ops.flat_structure(dataset)))
   return sparse.deserialize_sparse_tensors(
       nested_ret, dataset.output_types, dataset.output_shapes,
       dataset.output_classes)
diff --git a/tensorflow/contrib/data/python/ops/grouping.py b/tensorflow/contrib/data/python/ops/grouping.py
index ea229b5b27b117984e508fa4edc6f1cf713008b4..f9f25e6a0687fe7167525847c64743f52a551fb0 100644
--- a/tensorflow/contrib/data/python/ops/grouping.py
+++ b/tensorflow/contrib/data/python/ops/grouping.py
@@ -300,6 +300,7 @@ class GroupByReducerDataset(dataset_ops.Dataset):
         raise ValueError(
             "`key_func` must return a single tf.int64 tensor. "
             "Got type=%s and shape=%s" % (ret.dtype, ret.get_shape()))
+      dataset_ops._warn_if_collections("tf.contrib.data.group_by_reducer()")  # pylint: disable=protected-access
       return ret
 
     self._key_func = tf_key_func
@@ -327,6 +328,8 @@ class GroupByReducerDataset(dataset_ops.Dataset):
       self._state_types = nest.pack_sequence_as(
           ret, [t.dtype for t in nest.flatten(ret)])
 
+      dataset_ops._warn_if_collections("tf.contrib.data.group_by_reducer()")  # pylint: disable=protected-access
+
       # Serialize any sparse tensors.
       ret = nest.pack_sequence_as(
           ret, [t for t in nest.flatten(sparse.serialize_sparse_tensors(ret))])
@@ -398,6 +401,8 @@ class GroupByReducerDataset(dataset_ops.Dataset):
                  nest.pack_sequence_as(self._state_types,
                                        [t.dtype for t in flat_new_state])))
 
+        dataset_ops._warn_if_collections("tf.contrib.data.group_by_reducer()")  # pylint: disable=protected-access
+
         # Serialize any sparse tensors.
         ret = nest.pack_sequence_as(
             ret,
@@ -464,6 +469,8 @@ class GroupByReducerDataset(dataset_ops.Dataset):
       self._output_types = nest.pack_sequence_as(
           ret, [t.dtype for t in nest.flatten(ret)])
 
+      dataset_ops._warn_if_collections("tf.contrib.data.group_by_reducer()")  # pylint: disable=protected-access
+
       # Serialize any sparse tensors.
       ret = nest.pack_sequence_as(
           ret, [t for t in nest.flatten(sparse.serialize_sparse_tensors(ret))])
@@ -495,10 +502,7 @@ class GroupByReducerDataset(dataset_ops.Dataset):
         init_func=self._init_func,
         reduce_func=self._reduce_func,
         finalize_func=self._finalize_func,
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+        **dataset_ops.flat_structure(self))
 
 
 class GroupByWindowDataset(dataset_ops.Dataset):
@@ -525,6 +529,7 @@ class GroupByWindowDataset(dataset_ops.Dataset):
       if window_size.dtype != dtypes.int64:
         raise ValueError(
             "`window_size_func` must return a single tf.int64 tensor.")
+      dataset_ops._warn_if_collections("tf.contrib.data.group_by_window()")  # pylint: disable=protected-access
       return window_size
 
     self._window_size_func = tf_window_size_func
@@ -557,6 +562,7 @@ class GroupByWindowDataset(dataset_ops.Dataset):
       ret = ops.convert_to_tensor(ret, dtype=dtypes.int64)
       if ret.dtype != dtypes.int64:
         raise ValueError("`key_func` must return a single tf.int64 tensor.")
+      dataset_ops._warn_if_collections("tf.contrib.data.group_by_window()")  # pylint: disable=protected-access
       return ret
 
     self._key_func = tf_key_func
@@ -580,6 +586,7 @@ class GroupByWindowDataset(dataset_ops.Dataset):
       self._output_classes = output_dataset.output_classes
       self._output_types = output_dataset.output_types
       self._output_shapes = output_dataset.output_shapes
+      dataset_ops._warn_if_collections("tf.contrib.data.group_by_window()")  # pylint: disable=protected-access
       return output_dataset._as_variant_tensor()  # pylint: disable=protected-access
 
     self._reduce_func = tf_reduce_func
@@ -606,10 +613,7 @@ class GroupByWindowDataset(dataset_ops.Dataset):
         key_func=self._key_func,
         reduce_func=self._reduce_func,
         window_size_func=self._window_size_func,
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+        **dataset_ops.flat_structure(self))
 
 
 class Reducer(object):
diff --git a/tensorflow/contrib/data/python/ops/interleave_ops.py b/tensorflow/contrib/data/python/ops/interleave_ops.py
index 812a50ecbf105393f7e422edbbdf5c87311d72c1..70153ac575758f16beff373941dfefb32bd342cf 100644
--- a/tensorflow/contrib/data/python/ops/interleave_ops.py
+++ b/tensorflow/contrib/data/python/ops/interleave_ops.py
@@ -24,9 +24,9 @@ from tensorflow.contrib.data.python.ops import random_ops
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.ops import readers
 from tensorflow.python.data.util import nest
-from tensorflow.python.data.util import sparse
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.util import deprecation
@@ -170,10 +170,7 @@ class DirectedInterleaveDataset(dataset_ops.Dataset):
     return gen_dataset_ops.directed_interleave_dataset(
         self._selector_input._as_variant_tensor(),
         [data_input._as_variant_tensor() for data_input in self._data_inputs],
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **dataset_ops.flat_structure(self))
     # pylint: enable=protected-access
 
   @property
@@ -240,3 +237,47 @@ def sample_from_datasets(datasets, weights=None, seed=None):
       (logits_ds, random_ops.RandomDataset(seed).batch(2))).map(select_dataset)
 
   return DirectedInterleaveDataset(selector_input, datasets)
+
+
+def choose_from_datasets(datasets, choice_dataset):
+  """Creates a dataset that deterministically chooses elements from `datasets`.
+
+  For example, given the following datasets:
+
+  ```python
+  datasets = [tf.data.Dataset.from_tensors("foo").repeat(),
+              tf.data.Dataset.from_tensors("bar").repeat(),
+              tf.data.Dataset.from_tensors("baz").repeat()]
+
+  # Define a dataset containing `[0, 1, 2, 0, 1, 2, 0, 1, 2]`.
+  choice_dataset = tf.data.Dataset.range(3).repeat(3)
+
+  result = tf.contrib.data.choose_from_datasets(datasets, choice_dataset)
+  ```
+
+  The elements of `result` will be:
+
+  ```
+  "foo", "bar", "baz", "foo", "bar", "baz", "foo", "bar", "baz"
+  ```
+
+  Args:
+    datasets: A list of @{tf.data.Dataset} objects with compatible structure.
+    choice_dataset: A @{tf.data.Dataset} of scalar `tf.int64` tensors between
+      `0` and `len(datasets) - 1`.
+
+  Returns:
+    A dataset that interleaves elements from `datasets` according to the values
+    of `choice_dataset`.
+
+  Raises:
+    TypeError: If the `datasets` or `choice_dataset` arguments have the wrong
+      type.
+  """
+  if not (choice_dataset.output_types == dtypes.int64
+          and choice_dataset.output_shapes.is_compatible_with(
+              tensor_shape.scalar())
+          and choice_dataset.output_classes == ops.Tensor):
+    raise TypeError("`choice_dataset` must be a dataset of scalar "
+                    "`tf.int64` tensors.")
+  return DirectedInterleaveDataset(choice_dataset, datasets)
diff --git a/tensorflow/contrib/data/python/ops/optimization.py b/tensorflow/contrib/data/python/ops/optimization.py
new file mode 100644
index 0000000000000000000000000000000000000000..9612ac5ae910f8ee08d4b3ed9097a5c80266fcfd
--- /dev/null
+++ b/tensorflow/contrib/data/python/ops/optimization.py
@@ -0,0 +1,74 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Experimental API for optimizing `tf.data` pipelines."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.data.python.ops import contrib_op_loader  # pylint: disable=unused-import
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import gen_dataset_ops
+
+
+def optimize(optimizations=None):
+  """A transformation that applies optimizations.
+
+  Args:
+    optimizations: (Optional.) A `tf.string` vector `tf.Tensor` identifying
+      optimizations to use. If not specified, the default set of optimizations
+      is applied.
+
+  Returns:
+    A `Dataset` transformation function, which can be passed to
+    @{tf.data.Dataset.apply}.
+  """
+
+  def _apply_fn(dataset):
+    """Function from `Dataset` to `Dataset` that applies the transformation."""
+    return OptimizeDataset(dataset, optimizations)
+
+  return _apply_fn
+
+
+class OptimizeDataset(dataset_ops.Dataset):
+  """A `Dataset` that acts as an identity, and applies optimizations."""
+
+  def __init__(self, input_dataset, optimizations):
+    """See `optimize()` for details."""
+    super(OptimizeDataset, self).__init__()
+    self._input_dataset = input_dataset
+    if optimizations is None:
+      optimizations = []
+    self._optimizations = ops.convert_to_tensor(
+        optimizations, dtype=dtypes.string, name="optimizations")
+
+  def _as_variant_tensor(self):
+    return gen_dataset_ops.optimize_dataset(
+        self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
+        self._optimizations,
+        **dataset_ops.flat_structure(self))
+  @property
+  def output_classes(self):
+    return self._input_dataset.output_classes
+
+  @property
+  def output_shapes(self):
+    return self._input_dataset.output_shapes
+
+  @property
+  def output_types(self):
+    return self._input_dataset.output_types
diff --git a/tensorflow/contrib/data/python/ops/random_ops.py b/tensorflow/contrib/data/python/ops/random_ops.py
index 28ef5e50f39dd7d1b6f124e58e068fc968ddd6dc..e670c4c8354f4067eb21c9b1fce708147c162967 100644
--- a/tensorflow/contrib/data/python/ops/random_ops.py
+++ b/tensorflow/contrib/data/python/ops/random_ops.py
@@ -18,9 +18,7 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.data.ops import dataset_ops
-from tensorflow.python.data.util import nest
 from tensorflow.python.data.util import random_seed
-from tensorflow.python.data.util import sparse
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
@@ -39,10 +37,7 @@ class RandomDataset(dataset_ops.Dataset):
     return gen_dataset_ops.random_dataset(
         seed=self._seed,
         seed2=self._seed2,
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **dataset_ops.flat_structure(self))
 
   @property
   def output_classes(self):
diff --git a/tensorflow/contrib/data/python/ops/readers.py b/tensorflow/contrib/data/python/ops/readers.py
index 75c31a944a09462f534f6ae3e3204c812ecf28d9..83095c7ba1c6465d18490e5197f71bf7f1fe2497 100644
--- a/tensorflow/contrib/data/python/ops/readers.py
+++ b/tensorflow/contrib/data/python/ops/readers.py
@@ -17,6 +17,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import collections
 import csv
 
 import numpy as np
@@ -25,6 +26,7 @@ from tensorflow.contrib.data.python.ops import batching
 from tensorflow.contrib.data.python.ops import gen_dataset_ops as contrib_gen_dataset_ops
 from tensorflow.contrib.data.python.ops import interleave_ops
 from tensorflow.contrib.data.python.ops import shuffle_ops
+from tensorflow.contrib.data.python.ops import stats_ops
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.ops import readers as core_readers
 from tensorflow.python.data.util import convert
@@ -467,11 +469,11 @@ def make_csv_dataset(
     Args:
       *columns: list of `Tensor`s corresponding to one csv record.
     Returns:
-      A dictionary of feature names to values for that particular record. If
+      An OrderedDict of feature names to values for that particular record. If
       label_name is provided, extracts the label feature to be returned as the
       second element of the tuple.
     """
-    features = dict(zip(column_names, columns))
+    features = collections.OrderedDict(zip(column_names, columns))
     if label_name is not None:
       label = features.pop(label_name)
       return features, label
@@ -753,6 +755,8 @@ def make_batched_features_dataset(file_pattern,
   dataset = _maybe_shuffle_and_repeat(
       dataset, num_epochs, shuffle, shuffle_buffer_size, shuffle_seed)
 
+  dataset = dataset.apply(stats_ops.feature_stats("record_stats"))
+
   if drop_final_batch:
     dataset = dataset.apply(batching.batch_and_drop_remainder(batch_size))
   else:
diff --git a/tensorflow/contrib/data/python/ops/resampling.py b/tensorflow/contrib/data/python/ops/resampling.py
index bad6edd5147d832228c412919f1e6e782aafc40f..182a5c6ff36fcda8c9e2c522cce07bed0c2daec9 100644
--- a/tensorflow/contrib/data/python/ops/resampling.py
+++ b/tensorflow/contrib/data/python/ops/resampling.py
@@ -291,4 +291,4 @@ def _calculate_acceptance_probs_with_mixing(initial_probs, target_probs):
 
   # TODO(joelshor): Simplify fraction, if possible.
   a_i = (ratio_l - m) / (max_ratio - m)
-  return a_i, m
\ No newline at end of file
+  return a_i, m
diff --git a/tensorflow/contrib/data/python/ops/scan_ops.py b/tensorflow/contrib/data/python/ops/scan_ops.py
index e911ad0fa0541f2d8b991d66182dd002c2ecaab0..67eede981cb8f685ba4840d1f3c12bfea54c7646 100644
--- a/tensorflow/contrib/data/python/ops/scan_ops.py
+++ b/tensorflow/contrib/data/python/ops/scan_ops.py
@@ -148,6 +148,8 @@ class _ScanDataset(dataset_ops.Dataset):
         self._output_types = nest.pack_sequence_as(
             output_value, [t.dtype for t in nest.flatten(output_value)])
 
+        dataset_ops._warn_if_collections("tf.contrib.data.scan()")  # pylint: disable=protected-access
+
         # Serialize any sparse tensors.
         new_state = nest.pack_sequence_as(new_state, [
             t for t in nest.flatten(sparse.serialize_sparse_tensors(new_state))
@@ -193,10 +195,7 @@ class _ScanDataset(dataset_ops.Dataset):
         nest.flatten(sparse.serialize_sparse_tensors(self._initial_state)),
         self._scan_func.captured_inputs,
         f=self._scan_func,
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+        **dataset_ops.flat_structure(self))
 
   @property
   def output_classes(self):
diff --git a/tensorflow/contrib/data/python/ops/shuffle_ops.py b/tensorflow/contrib/data/python/ops/shuffle_ops.py
index f35795abd38000b13cec0f08596e2ff66e86286c..d7f8a73fe3d67bb83e44e962832ce34c116aef66 100644
--- a/tensorflow/contrib/data/python/ops/shuffle_ops.py
+++ b/tensorflow/contrib/data/python/ops/shuffle_ops.py
@@ -18,9 +18,7 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.data.ops import dataset_ops
-from tensorflow.python.data.util import nest
 from tensorflow.python.data.util import random_seed
-from tensorflow.python.data.util import sparse
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
@@ -56,10 +54,7 @@ class _ShuffleAndRepeatDataset(dataset_ops.Dataset):
         count=self._count,
         seed=self._seed,
         seed2=self._seed2,
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+        **dataset_ops.flat_structure(self))
     # pylint: enable=protected-access
 
   @property
diff --git a/tensorflow/contrib/data/python/ops/sliding.py b/tensorflow/contrib/data/python/ops/sliding.py
index 19cc3cb89fc5c494f79ce1d25ed57c92099c8bd2..f935beb1a9e85d4901857e7781a5ed8473838fa5 100644
--- a/tensorflow/contrib/data/python/ops/sliding.py
+++ b/tensorflow/contrib/data/python/ops/sliding.py
@@ -19,7 +19,6 @@ from __future__ import print_function
 
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import nest
-from tensorflow.python.data.util import sparse
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
@@ -43,10 +42,7 @@ class _SlideDataset(dataset_ops.Dataset):
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
         window_size=self._window_size,
         stride=self._stride,
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **dataset_ops.flat_structure(self))
 
   @property
   def output_classes(self):
diff --git a/tensorflow/contrib/data/python/ops/stats_ops.py b/tensorflow/contrib/data/python/ops/stats_ops.py
index 3cbaab5affd7397213b0fbb6b0682db92b99d591..3c82a03df1745d855b2d3f918f7bbde113600556 100644
--- a/tensorflow/contrib/data/python/ops/stats_ops.py
+++ b/tensorflow/contrib/data/python/ops/stats_ops.py
@@ -18,8 +18,6 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.data.ops import dataset_ops
-from tensorflow.python.data.util import nest
-from tensorflow.python.data.util import sparse
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import gen_dataset_ops
@@ -97,10 +95,7 @@ class _SetStatsAggregatorDataset(dataset_ops.Dataset):
     return gen_dataset_ops.set_stats_aggregator_dataset(
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
         self._stats_aggregator._resource,  # pylint: disable=protected-access
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+        **dataset_ops.flat_structure(self))
 
   @property
   def output_shapes(self):
@@ -176,6 +171,27 @@ def latency_stats(tag):
   return _apply_fn
 
 
+def feature_stats(tag):
+  """Records the features stats from `Example` records of the input dataset.
+
+  To consume the statistics, associate a `StatsAggregator` with the output
+  dataset.
+
+  Args:
+    tag: String. All statistics recorded by the returned transformation will be
+      associated with the given `tag`.
+
+  Returns:
+    A `Dataset` transformation function, which can be passed to
+    @{tf.data.Dataset.apply}.
+  """
+
+  def _apply_fn(dataset):
+    return _StatsDataset(dataset, gen_dataset_ops.feature_stats_dataset, tag)
+
+  return _apply_fn
+
+
 class _StatsDataset(dataset_ops.Dataset):
   """A `Dataset` that acts as an identity, and also records statistics."""
 
@@ -189,10 +205,7 @@ class _StatsDataset(dataset_ops.Dataset):
     return self._op_function(
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
         self._tag,
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+        **dataset_ops.flat_structure(self))
 
   @property
   def output_shapes(self):
diff --git a/tensorflow/contrib/data/python/ops/threadpool.py b/tensorflow/contrib/data/python/ops/threadpool.py
index 56f67e1766bbaff680bdff6b939df0c3ba68c679..bb49604d4de90d726418684124608438aa33e6cf 100644
--- a/tensorflow/contrib/data/python/ops/threadpool.py
+++ b/tensorflow/contrib/data/python/ops/threadpool.py
@@ -22,8 +22,6 @@ import threading
 from tensorflow.contrib.data.python.ops import contrib_op_loader  # pylint: disable=unused-import
 from tensorflow.contrib.data.python.ops import gen_dataset_ops
 from tensorflow.python.data.ops import dataset_ops
-from tensorflow.python.data.util import nest
-from tensorflow.python.data.util import sparse
 from tensorflow.python.eager import context
 from tensorflow.python.ops import resource_variable_ops
 
@@ -69,10 +67,7 @@ class _ThreadPoolDataset(dataset_ops.Dataset):
     return gen_dataset_ops.thread_pool_dataset(
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
         self._thread_pool._resource,  # pylint: disable=protected-access
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **dataset_ops.flat_structure(self))
 
   @property
   def output_shapes(self):
diff --git a/tensorflow/contrib/data/python/ops/unique.py b/tensorflow/contrib/data/python/ops/unique.py
index 765ef3f9b6d42c9d7af3ce4916731d37d65c9260..4ce6ddede8350735636fd152fdc9df0319265990 100644
--- a/tensorflow/contrib/data/python/ops/unique.py
+++ b/tensorflow/contrib/data/python/ops/unique.py
@@ -20,8 +20,6 @@ from __future__ import print_function
 from tensorflow.contrib.data.python.ops import contrib_op_loader  # pylint: disable=unused-import
 from tensorflow.contrib.data.python.ops import gen_dataset_ops
 from tensorflow.python.data.ops import dataset_ops
-from tensorflow.python.data.util import nest
-from tensorflow.python.data.util import sparse
 from tensorflow.python.framework import dtypes
 
 
@@ -65,10 +63,7 @@ class UniqueDataset(dataset_ops.Dataset):
   def _as_variant_tensor(self):
     return gen_dataset_ops.unique_dataset(
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **dataset_ops.flat_structure(self))
 
   @property
   def output_classes(self):
diff --git a/tensorflow/contrib/distribute/python/BUILD b/tensorflow/contrib/distribute/python/BUILD
index 64a77bbed1d55c3d95329d9c7783c2b468bde745..9dfb8552f1b0f058b44f8ed09c2ed681367293d5 100644
--- a/tensorflow/contrib/distribute/python/BUILD
+++ b/tensorflow/contrib/distribute/python/BUILD
@@ -77,6 +77,7 @@ py_library(
         "//tensorflow/python:device_util",
         "//tensorflow/python:distribute",
         "//tensorflow/python:framework_ops",
+        "//tensorflow/python:math_ops",
         "//tensorflow/python:pywrap_tensorflow",
         "//tensorflow/python:training",
         "//tensorflow/python:variable_scope",
@@ -148,6 +149,7 @@ py_library(
     ],
     deps = [
         ":mirrored_strategy",
+        ":multi_worker_strategy",
         ":one_device_strategy",
         ":tpu_strategy",
         "//tensorflow/contrib/optimizer_v2:training",
@@ -445,10 +447,31 @@ py_library(
     srcs = ["cross_tower_utils.py"],
     srcs_version = "PY2AND3",
     deps = [
+        ":values",
+        "//tensorflow/contrib/all_reduce:all_reduce_py",
         "//tensorflow/contrib/nccl:nccl_py",
         "//tensorflow/python:array_ops",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:math_ops",
+    ],
+)
+
+cuda_py_test(
+    name = "cross_tower_utils_test",
+    srcs = ["cross_tower_utils_test.py"],
+    additional_deps = [
+        ":combinations",
+        ":cross_tower_utils",
+        "@absl_py//absl/testing:parameterized",
+        "//tensorflow/python:constant_op",
         "//tensorflow/python:framework_ops",
         "//tensorflow/python:math_ops",
+        "//tensorflow/python/eager:context",
+        "//tensorflow/python/eager:test",
+    ],
+    tags = [
+        "no_pip",
     ],
 )
 
@@ -476,6 +499,7 @@ cuda_py_test(
     additional_deps = [
         ":combinations",
         ":cross_tower_ops",
+        ":multi_worker_test_base",
         ":values",
         "@absl_py//absl/testing:parameterized",
         "//tensorflow/python:array_ops",
@@ -485,6 +509,7 @@ cuda_py_test(
         "//tensorflow/python/eager:context",
         "//tensorflow/python/eager:test",
     ],
+    shard_count = 15,
     tags = [
         "multi_and_single_gpu",
         "no_pip",
@@ -547,3 +572,41 @@ cuda_py_test(
         "no_pip",
     ],
 )
+
+cuda_py_test(
+    name = "keras_test",
+    srcs = ["keras_test.py"],
+    additional_deps = [
+        "//third_party/py/numpy",
+        "//tensorflow/contrib/distribute/python:mirrored_strategy",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:training",
+        "//tensorflow/python/estimator:keras",
+        "//tensorflow/python/estimator:run_config",
+        "//tensorflow/python/keras",
+    ],
+    tags = [
+        "multi_and_single_gpu",
+        "noguitar",
+        "notsan",
+    ],
+)
+
+cuda_py_test(
+    name = "metrics_v1_test",
+    srcs = ["metrics_v1_test.py"],
+    additional_deps = [
+        ":combinations",
+        "@absl_py//absl/testing:parameterized",
+        "//tensorflow/contrib/data/python/ops:batching",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:metrics",
+        "//tensorflow/python:variables",
+        "//tensorflow/python/data/ops:dataset_ops",
+        "//tensorflow/python/eager:test",
+    ],
+    tags = [
+        "multi_and_single_gpu",
+        "no_pip",
+    ],
+)
diff --git a/tensorflow/contrib/distribute/python/combinations.py b/tensorflow/contrib/distribute/python/combinations.py
index 15935817b0283ebc04b95304afe41d8690a11442..ba03b14deb9a3897dae29382ce601c0319f84735 100644
--- a/tensorflow/contrib/distribute/python/combinations.py
+++ b/tensorflow/contrib/distribute/python/combinations.py
@@ -41,11 +41,15 @@ from __future__ import print_function
 
 from collections import OrderedDict
 import sys
+import types
+import unittest
 from absl.testing import parameterized
+import six
 
-from tensorflow.contrib.distribute.python import mirrored_strategy
-from tensorflow.contrib.distribute.python import one_device_strategy
-from tensorflow.contrib.distribute.python import tpu_strategy
+from tensorflow.contrib.distribute.python import mirrored_strategy as mirrored_lib
+from tensorflow.contrib.distribute.python import multi_worker_strategy
+from tensorflow.contrib.distribute.python import one_device_strategy as one_device_lib
+from tensorflow.contrib.distribute.python import tpu_strategy as tpu_lib
 from tensorflow.contrib.optimizer_v2 import adam as adam_v2
 from tensorflow.contrib.optimizer_v2 import gradient_descent as gradient_descent_v2
 from tensorflow.python.eager import context
@@ -67,8 +71,8 @@ def generate(combinations):
     combinations: a list of dictionaries created using combine() and times().
 
   Restrictions:
-   -- there should always be a "mode" argument.  Accepted values are "eager"
-      and "graph".
+   -- the "mode" argument can be either "eager" or "graph".  It's "graph" by
+      default.
    -- arguments of the test method must match by name to get the corresponding
       value of the combination.  Tests must accept all arguments except the
       "mode", "required_tpu" and "required_gpus".
@@ -83,14 +87,15 @@ def generate(combinations):
       test will be skipped if the specified number of GPUs aren't available.
 
   Returns:
-    a decorator that will cause the test method to be run under the specified
-    conditions.
+    a decorator that will cause the test method or the test class to be run
+    under the specified conditions.
 
   Raises:
-    ValueError - if "mode" argument wasn't either "eager" or "graph".
+    ValueError - if "mode" argument wasn't either "eager" or "graph" or if other
+      arguments were not accepted by the test method.
   """
 
-  def decorator(test_function):
+  def decorator(test_method_or_class):
     """The decorator to be returned."""
 
     # Generate good test names that can be used with --test_filter.
@@ -110,70 +115,91 @@ def generate(combinations):
               list(combination.items()) + [("testcase_name",
                                             "_test{}".format(name))]))
 
-    @parameterized.named_parameters(*named_combinations)
-    def decorated(self, **kwargs):
-      """A wrapped test method that sets up `test_function`."""
-      assert "mode" in kwargs
-      mode = kwargs["mode"]
-
-      distribution = kwargs.pop("distribution", None)
-      required_tpu = kwargs.pop("required_tpu", False)
-      required_gpus = kwargs.pop("required_gpus", None)
-
-      if distribution:
-        assert required_gpus is None, (
-            "Do not use `required_gpus` and `distribution` together.")
-        assert required_tpu is False, (
-            "Do not use `required_tpu` and `distribution` together.")
-        kwargs["distribution"] = distribution.strategy
-        required_gpus = distribution.required_gpus
-        required_tpu = distribution.required_tpu
-
-      if required_tpu and not TPU_TEST:
-        self.skipTest("Test requires a TPU, but it's not available.")
-      if not required_tpu and TPU_TEST:
-        self.skipTest("Test that doesn't require a TPU.")
-
-      if not required_gpus:
-        if GPU_TEST:
-          self.skipTest("Test that doesn't require GPUs.")
-      elif context.num_gpus() < required_gpus:
-        self.skipTest(
-            "{} GPUs are not available for this test. {} GPUs are available".
-            format(required_gpus, context.num_gpus()))
-
-      # At this point, `kwargs` doesn't have `required_gpus` or `required_tpu`
-      # that the user might have specified.  `kwargs` still has `mode`, which
-      # the test is allowed to accept or ignore.
-      requested_arguments = tf_inspect.getfullargspec(test_function).args
-      missing_arguments = set(list(kwargs.keys()) + ["self"]).difference(
-          set(requested_arguments + ["mode"]))
-      if missing_arguments:
-        raise ValueError("The test is missing arguments {} .".format(
-            missing_arguments))
-
-      kwargs_to_pass = {}
-      for arg in requested_arguments:
-        if arg == "self":
-          kwargs_to_pass[arg] = self
-        else:
-          kwargs_to_pass[arg] = kwargs[arg]
-
-      if mode == "eager":
-        with context.eager_mode(), ops.Graph().as_default():
-          test_function(**kwargs_to_pass)
-      elif mode == "graph":
-        with context.graph_mode(), ops.Graph().as_default():
-          test_function(**kwargs_to_pass)
-      else:
-        raise ValueError(
-            "'mode' has to be either 'eager' or 'graph' and not {}".format(
-                mode))
+    if isinstance(test_method_or_class, type):
+      class_object = test_method_or_class
+      class_object._test_method_ids = test_method_ids = {}
+      for name, test_method in six.iteritems(class_object.__dict__.copy()):
+        if (name.startswith(unittest.TestLoader.testMethodPrefix) and
+            isinstance(test_method, types.FunctionType)):
+          delattr(class_object, name)
+          methods = {}
+          parameterized._update_class_dict_for_param_test_case(
+              class_object.__name__, methods, test_method_ids, name,
+              parameterized._ParameterizedTestIter(
+                  _augment_with_special_arguments(test_method),
+                  named_combinations, parameterized._NAMED, name))
+          for method_name, method in six.iteritems(methods):
+            setattr(class_object, method_name, method)
+
+      return class_object
+    else:
+      test_method = _augment_with_special_arguments(test_method_or_class)
+      return parameterized.named_parameters(*named_combinations)(test_method)
 
-    return decorated
   return decorator
 
 
+def _augment_with_special_arguments(test_method):
+  def decorated(self, **kwargs):
+    """A wrapped test method that treats some arguments in a special way."""
+    mode = kwargs.pop("mode", "graph")
+
+    distribution = kwargs.pop("distribution", None)
+    required_tpu = kwargs.pop("required_tpu", False)
+    required_gpus = kwargs.pop("required_gpus", None)
+
+    if distribution:
+      assert required_gpus is None, (
+          "Do not use `required_gpus` and `distribution` together.")
+      assert required_tpu is False, (
+          "Do not use `required_tpu` and `distribution` together.")
+      kwargs["distribution"] = distribution.strategy
+      required_gpus = distribution.required_gpus
+      required_tpu = distribution.required_tpu
+
+    if required_tpu and not TPU_TEST:
+      self.skipTest("Test requires a TPU, but it's not available.")
+    if not required_tpu and TPU_TEST:
+      self.skipTest("Test that doesn't require a TPU.")
+
+    if not required_gpus:
+      if GPU_TEST:
+        self.skipTest("Test that doesn't require GPUs.")
+    elif context.num_gpus() < required_gpus:
+      self.skipTest(
+          "{} GPUs are not available for this test. {} GPUs are available".
+          format(required_gpus, context.num_gpus()))
+
+    # At this point, `kwargs` doesn't have `required_gpus` or `required_tpu`
+    # that the user might have specified.  `kwargs` still has `mode`, which
+    # the test is allowed to accept or ignore.
+    requested_arguments = tf_inspect.getfullargspec(test_method).args
+    missing_arguments = set(list(kwargs.keys()) + ["self"]).difference(
+        set(requested_arguments + ["mode"]))
+    if missing_arguments:
+      raise ValueError("The test is missing arguments {} .".format(
+          missing_arguments))
+
+    kwargs_to_pass = {}
+    for arg in requested_arguments:
+      if arg == "self":
+        kwargs_to_pass[arg] = self
+      else:
+        kwargs_to_pass[arg] = kwargs[arg]
+
+    if mode == "eager":
+      with ops.Graph().as_default(), context.eager_mode():
+        test_method(**kwargs_to_pass)
+    elif mode == "graph":
+      with ops.Graph().as_default(), context.graph_mode():
+        test_method(**kwargs_to_pass)
+    else:
+      raise ValueError(
+          "'mode' has to be either 'eager' or 'graph' and not {}".format(
+              mode))
+  return decorated
+
+
 def combine(**kwargs):
   """Generate combinations based on its keyword arguments.
 
@@ -264,9 +290,9 @@ class NamedObject(object):
 class NamedDistribution(object):
   """Translates DistributionStrategy and its data into a good name."""
 
-  def __init__(self, name, distribution, required_gpus=None,
+  def __init__(self, name, distribution_fn, required_gpus=None,
                required_tpu=False):
-    self._distribution = distribution
+    self._distribution_fn = distribution_fn
     self._name = name
     self._required_gpus = required_gpus
     self._required_tpu = required_tpu
@@ -276,7 +302,7 @@ class NamedDistribution(object):
 
   @property
   def strategy(self):
-    return self._distribution
+    return self._distribution_fn()
 
   @property
   def required_gpus(self):
@@ -287,32 +313,60 @@ class NamedDistribution(object):
     return self._required_tpu
 
 
+# pylint: disable=g-long-lambda
 default_strategy = NamedDistribution(
     "Default",
-    distribute_lib._default_distribution_strategy,  # pylint: disable=protected-access
+    lambda: distribute_lib._default_distribution_strategy,  # pylint: disable=protected-access
     required_gpus=None)
 one_device_strategy = NamedDistribution(
-    "OneDeviceCPU", one_device_strategy.OneDeviceStrategy("/cpu:0"),
+    "OneDeviceCPU", lambda: one_device_lib.OneDeviceStrategy("/cpu:0"),
     required_gpus=None)
 tpu_strategy_single_iteration = NamedDistribution(
     "TPUSingleIteration",
-    tpu_strategy.TPUStrategy(iterations_per_step=1),
+    lambda: tpu_lib.TPUStrategy(iterations_per_step=1),
     required_tpu=True)
-tpu_strategy = NamedDistribution(
-    "TPU", tpu_strategy.TPUStrategy(), required_tpu=True)
+tpu_strategy = NamedDistribution("TPU", tpu_lib.TPUStrategy, required_tpu=True)
 # Note that we disable prefetching for testing since prefetching makes
 # the input non-deterministic.
 mirrored_strategy_with_gpu_and_cpu = NamedDistribution(
     "MirroredCPUAndGPU",
-    mirrored_strategy.MirroredStrategy(
+    lambda: mirrored_lib.MirroredStrategy(
         ["/gpu:0", "/cpu:0"], prefetch_on_device=False),
     required_gpus=1)
 mirrored_strategy_with_two_gpus = NamedDistribution(
     "Mirrored2GPUs",
-    mirrored_strategy.MirroredStrategy(
+    lambda: mirrored_lib.MirroredStrategy(
         ["/gpu:0", "/gpu:1"], prefetch_on_device=False),
     required_gpus=2)
 
+multi_worker_strategy_with_cpu = NamedDistribution(
+    "MultiWorkerCPU",
+    lambda: multi_worker_strategy.MultiWorkerMirroredStrategy(
+        cluster={
+            "worker": [
+                "/job:worker/replica:0/task:0", "/job:worker/replica:0/task:1"
+            ]
+        },
+        num_gpus_per_worker=0), 0)
+multi_worker_strategy_with_one_gpu = NamedDistribution(
+    "MultiWorker1GPU",
+    lambda: multi_worker_strategy.MultiWorkerMirroredStrategy(
+        cluster={
+            "worker": [
+                "/job:worker/replica:0/task:0", "/job:worker/replica:0/task:1"
+            ]
+        },
+        num_gpus_per_worker=1), 1)
+multi_worker_strategy_with_two_gpus = NamedDistribution(
+    "MultiWorker2GPUs",
+    lambda: multi_worker_strategy.MultiWorkerMirroredStrategy(
+        cluster={
+            "worker": [
+                "/job:worker/replica:0/task:0", "/job:worker/replica:0/task:1"
+            ]
+        },
+        num_gpus_per_worker=2), 2)
+
 adam_optimizer_v1_fn = NamedObject(
     "AdamV1", lambda: adam.AdamOptimizer(0.2, epsilon=1))
 gradient_descent_optimizer_v1_fn = NamedObject(
diff --git a/tensorflow/contrib/distribute/python/combinations_test.py b/tensorflow/contrib/distribute/python/combinations_test.py
index 184bcf27e59d68b82d28e8f01890c04f214c017c..86aa48cea889c6c2ce169b18bcabb6d08890fbed 100644
--- a/tensorflow/contrib/distribute/python/combinations_test.py
+++ b/tensorflow/contrib/distribute/python/combinations_test.py
@@ -19,6 +19,7 @@ from __future__ import division
 from __future__ import print_function
 
 from collections import OrderedDict
+from absl.testing import parameterized
 
 from tensorflow.contrib.distribute.python import combinations
 from tensorflow.python.eager import test
@@ -120,5 +121,28 @@ class TestingCombinationsTest(test.TestCase):
       _ = combinations.times(c1, c2)
 
 
+@combinations.generate(combinations.combine(a=[1, 0], b=[2, 3], c=[1]))
+class CombineTheTestSuite(parameterized.TestCase):
+
+  def test_add_things(self, a, b, c):
+    self.assertLessEqual(3, a + b + c)
+    self.assertLessEqual(a + b + c, 5)
+
+  def test_add_things_one_more(self, a, b, c):
+    self.assertLessEqual(3, a + b + c)
+    self.assertLessEqual(a + b + c, 5)
+
+  def not_a_test(self, a=0, b=0, c=0):
+    del a, b, c
+    self.fail()
+
+  def _test_but_private(self, a=0, b=0, c=0):
+    del a, b, c
+    self.fail()
+
+  # Check that nothing funny happens to a non-callable that starts with "_test".
+  test_member = 0
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/distribute/python/cross_tower_ops.py b/tensorflow/contrib/distribute/python/cross_tower_ops.py
index c6a1bf6a9f65828c45617ae18a1b0989f9d46225..f8ae8b9712c392fa948c8598dd123cdea01d9866 100644
--- a/tensorflow/contrib/distribute/python/cross_tower_ops.py
+++ b/tensorflow/contrib/distribute/python/cross_tower_ops.py
@@ -18,6 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import collections
 import six
 
 from tensorflow.contrib.distribute.python import cross_tower_utils
@@ -77,12 +78,12 @@ def _all_devices_match(value_destination_pairs):
   return True
 
 
-def _simple_broadcast(tensor, destinations):
+def _simple_broadcast(value, destinations):
   index = {}
   devices = _get_devices_from(destinations)
   for d in devices:
-    with ops.device(d):
-      index[d] = array_ops.identity(tensor)
+    index[d] = cross_tower_utils.copy_tensor_or_indexed_slices_to_device(
+        value, d)
   return value_lib.Mirrored(index)
 
 
@@ -98,7 +99,9 @@ def _simple_reduce(per_device_value, reduce_to_device, accumulation_fn,
         continue
       count += len(v_list)
       # Sum within each device before aggregating across devices.
-      v = math_ops.add_n(v_list)
+      # TODO(yuefengz): Check whether it helps to use accumulation_fn here.
+      v = cross_tower_utils.aggregate_tensors_or_indexed_slices(
+          v_list, math_ops.add_n)
     else:
       count += 1
     all_values.append(v)
@@ -107,11 +110,12 @@ def _simple_reduce(per_device_value, reduce_to_device, accumulation_fn,
 
   with ops.device(reduce_to_device):
     with context.context().device_policy(context.DEVICE_PLACEMENT_SILENT):
-      if method_string == "sum":
-        reduced = accumulation_fn(all_values)
-      elif method_string == "mean":
-        reduced = accumulation_fn(all_values) / count
-      else:
+      reduced = cross_tower_utils.aggregate_tensors_or_indexed_slices(
+          all_values, accumulation_fn)
+      if method_string == "mean":
+        reduced = cross_tower_utils.divide_by_n_tensors_or_indexed_slices(
+            reduced, count)
+      elif method_string != "sum":
         raise ValueError("`method_string` must be 'sum' or 'mean'")
   return reduced
 
@@ -231,7 +235,13 @@ class ReductionToOneDeviceCrossTowerOps(CrossTowerOps):
 def _group_value_by_device(per_device_values):
   """Group values into sublists by their devices.
 
-  This grouping is needed to call the all-reduce library.
+  This grouping is needed to call the all-reduce library because it expects a
+  list of the following form:
+    [(grad0_gpu0, v0_gpu0), (grad1_gpu0, v1_gpu0), (grad2_gpu0, v2_gpu0) ...
+     (grad0_gpu1, v0_gpu1), (grad1_gpu1, v1_gpu1), (grad2_gpu1, v2_gpu1) ...
+     (grad0_gpu2, v0_gpu2), (grad1_gpu0, v1_gpu2), (grad2_gpu0, v2_gpu2) ...
+     ...
+    ]
 
   Args:
     per_device_values: a list of PerDevice obejcts.
@@ -319,7 +329,17 @@ class ConcatAndSplitPacker(object):
         # TODO(zhengxq): it is also possible to optimize away all the concat
         # as well.
         num_splits = self.num_packs
-        total_grad_size = array_ops.size(concat_grads)
+
+        # The array_ops.size function will sometimes remove static shapes. So if
+        # all gradient shapes are defined, we use another method to get the
+        # total size.
+        # TODO(yuefengz): move this logic to array_ops.size.
+        if all([g.shape.is_fully_defined() for g, _ in tower_grads_and_vars]):
+          total_grad_size = sum(
+              [g.shape.num_elements() for g, _ in tower_grads_and_vars])
+        else:
+          total_grad_size = array_ops.size(concat_grads)
+
         split_size = total_grad_size // num_splits
         split_size_last = total_grad_size - split_size * (num_splits - 1)
         split_sizes = [split_size] * (num_splits - 1) + [split_size_last]
@@ -409,6 +429,31 @@ class AggregateSmallTensorPacker(object):
                                                   self.packing)
 
 
+def _pack_tensors(device_grads,
+                  num_packs=0,
+                  agg_small_grads_max_bytes=0,
+                  agg_small_grads_max_group=0):
+  """Pack tensors if specified."""
+  if num_packs > 0:
+    tensor_packer = ConcatAndSplitPacker(num_packs)
+    device_grad_packs = tensor_packer.pack(device_grads)
+  elif agg_small_grads_max_bytes > 0 and agg_small_grads_max_group > 0:
+    tensor_packer = AggregateSmallTensorPacker(agg_small_grads_max_bytes,
+                                               agg_small_grads_max_group)
+    device_grad_packs = tensor_packer.pack(device_grads)
+  else:
+    tensor_packer = None
+    device_grad_packs = device_grads
+  return device_grad_packs, tensor_packer
+
+
+def _unpack_tensors(reduced, tensor_packer=None):
+  """Unpack tensors if they are packed before all-reduce."""
+  if tensor_packer:
+    return tensor_packer.unpack(reduced)
+  return reduced
+
+
 class AllReduceCrossTowerOps(CrossTowerOps):
   """Reduction using all reduce."""
 
@@ -437,17 +482,25 @@ class AllReduceCrossTowerOps(CrossTowerOps):
       agg_small_grads_max_group: see above.
         tensors.
     """
-    self.all_reduce_alg = all_reduce_alg
-    self.num_packs = num_packs
-    self.agg_small_grads_max_bytes = agg_small_grads_max_bytes
-    self.agg_small_grads_max_group = agg_small_grads_max_group
+    self._all_reduce_alg = all_reduce_alg
+    self._num_packs = num_packs
+    self._agg_small_grads_max_bytes = agg_small_grads_max_bytes
+    self._agg_small_grads_max_group = agg_small_grads_max_group
     super(AllReduceCrossTowerOps, self).__init__()
 
   def _reduce(self, method_string, per_device_value, destinations):
+    contains_indexed_slices = cross_tower_utils.contains_indexed_slices(
+        per_device_value)
     if ((destinations is None or _devices_match(per_device_value, destinations))
-        and not context.executing_eagerly()):
+        and not context.executing_eagerly()
+        and not contains_indexed_slices):
       return self._batch_all_reduce(method_string, [per_device_value])[0]
     else:
+      if contains_indexed_slices:
+        logging.log_first_n(
+            logging.WARN,
+            "Efficient allreduce is not supported for IndexedSlices.", 10)
+
       devices = _get_devices_from(destinations or per_device_value)
       reduce_to_device = devices[0]
       reduced = _simple_reduce(per_device_value, reduce_to_device,
@@ -455,14 +508,18 @@ class AllReduceCrossTowerOps(CrossTowerOps):
       return self.broadcast(reduced, devices)
 
   def _batch_reduce(self, method_string, value_destination_pairs):
-    if (_all_devices_match(value_destination_pairs) and
-        not context.executing_eagerly()):
+    all_devices_match = _all_devices_match(value_destination_pairs)
+    contains_indexed_slices = cross_tower_utils.contains_indexed_slices(
+        value_destination_pairs)
+    if (all_devices_match and not context.executing_eagerly()
+        and not contains_indexed_slices):
       return self._batch_all_reduce(method_string,
                                     [v[0] for v in value_destination_pairs])
     else:
-      if not context.executing_eagerly():
+      if not all_devices_match:
         logging.warning("Efficient batch_reduce is not supported if "
                         "destinations are different.")
+
       return [
           self._reduce(method_string, t, destinations=v)
           for t, v in value_destination_pairs
@@ -470,37 +527,24 @@ class AllReduceCrossTowerOps(CrossTowerOps):
 
   def _batch_all_reduce(self, method_string, per_device_values):
     """All reduce algorithm in a batch."""
+    logging.info(
+        "batch_all_reduce invoked for batches size = %d with "
+        "algorithm = %s, num_packs = %d, agg_small_grads_max_bytes = %d and "
+        "agg_small_grads_max_group = %d", len(per_device_values),
+        self._all_reduce_alg, self._num_packs, self._agg_small_grads_max_bytes,
+        self._agg_small_grads_max_group)
     destinations = per_device_values[0].devices
     grouped = _group_value_by_device(per_device_values)
-    if self.num_packs > 0:
-      logging.info(
-          "batch_all_reduce invoked for batches size = %d with "
-          "algorithm = %s and num_packs = %d", len(per_device_values),
-          self.all_reduce_alg, self.num_packs)
-      tensor_packer = ConcatAndSplitPacker(self.num_packs)
-      device_grad_packs = tensor_packer.pack(grouped)
-    elif (self.agg_small_grads_max_bytes > 0 and
-          self.agg_small_grads_max_group > 0):
-      logging.info(
-          "batch_all_reduce invoked for batches size = %d with "
-          "algorithm = %s, agg_small_grads_max_bytes = %d and "
-          "agg_small_grads_max_group = %d", len(per_device_values),
-          self.all_reduce_alg, self.agg_small_grads_max_bytes,
-          self.agg_small_grads_max_group)
-      tensor_packer = AggregateSmallTensorPacker(
-          self.agg_small_grads_max_bytes, self.agg_small_grads_max_group)
-      device_grad_packs = tensor_packer.pack(grouped)
-    else:
-      logging.info(
-          "batch_all_reduce invoked for batches size = %d with algorithm = %s",
-          len(per_device_values), self.all_reduce_alg)
-      tensor_packer = None
-      device_grad_packs = grouped
+
+    device_grad_packs, self._tensor_packer = _pack_tensors(
+        grouped, self._num_packs, self._agg_small_grads_max_bytes,
+        self._agg_small_grads_max_group)
 
     # The actual aggregation of the repacked gradients. Note that they are
     # sharded among different aggregation trees. So it is important to strike
     # the balance on num_splits.
-    if self.all_reduce_alg == "nccl":
+    if self._all_reduce_alg == "nccl":
+      # TODO(yuefengz): merge this into the all-reduce library.
       reduced = cross_tower_utils.aggregate_gradients_using_nccl(
           device_grad_packs)
     else:
@@ -510,13 +554,137 @@ class AllReduceCrossTowerOps(CrossTowerOps):
           cross_tower_utils.aggregate_gradients_using_hierarchical_copy(
               destinations, device_grad_packs))
 
-    if tensor_packer:
-      reduced = tensor_packer.unpack(reduced)
-
+    reduced = _unpack_tensors(reduced, self._tensor_packer)
     return _ungroup_and_make_mirrored(reduced, per_device_values[0].devices,
                                       method_string)
 
 
+AllReduceSpecTuple = collections.namedtuple("AllReduceSpecTuple",
+                                            "alg shards limit")
+
+
+class MultiWorkerAllReduce(AllReduceCrossTowerOps):
+  """All-reduce algorithms for distributed TensorFlow."""
+
+  def __init__(self,
+               worker_devices,
+               num_gpus_per_worker,
+               all_reduce_spec=("pscpu/pscpu", 2, -1),
+               num_packs=0,
+               agg_small_grads_max_bytes=0,
+               agg_small_grads_max_group=10):
+    """Initialize the all-reduce algorithm.
+
+    Args:
+      worker_devices: a list of device strings for workers participating in
+        all-reduce.
+      num_gpus_per_worker: number of GPU devices per worker.
+      all_reduce_spec: a tuple or a named tuple or a list of tuples specifying
+        the all-reduce algorithm.
+        1. The first element of a tuple is the name of the all-reduce algorithm.
+        Valid algorithm names are: "nccl", "nccl/xring", "nccl/rechd",
+        "nccl/pscpu", "xring", "pscpu", "psgpu", "pscpu/pscpu". Algorithms with
+        a "/" are hierarchical, so two all-reduces are executed, the first one
+        aggregates tensors within a worker and the second aggregates across
+        workers.
+        2. The second element of a tuple is the number of shards when doing
+        all-reduce. Let's say its values is M, each tensor after packing will be
+        split into M shards and then M parallel all-reduces would be performed
+        before finally they are concatenated backed into a complete tensor.
+        3. The third element is the maximum size of tensors that will be
+        applicable for the algorithm specified by the first element. For
+        example, if all_reduce_spec=[("nccl", 2, 1024), ("pscpu/pscpu", 2, -1)],
+        tensors with size not larger than 1024 bytes will be applied a 2-shard
+        "nccl" all-reduce and other tensors will be applied a 2-shard
+        "pscpu/pscpu" algorithm. The third elements should be in increasing
+        order across tuples and end with -1 which indicates infinity.
+      num_packs: see AllReduceCrossTowerOps.
+      agg_small_grads_max_bytes: see AllReduceCrossTowerOps.
+      agg_small_grads_max_group: see AllReduceCrossTowerOps.
+    """
+    self._worker_devices = worker_devices
+    self._num_gpus_per_worker = num_gpus_per_worker
+    super(MultiWorkerAllReduce, self).__init__(
+        num_packs=num_packs,
+        agg_small_grads_max_bytes=agg_small_grads_max_bytes,
+        agg_small_grads_max_group=agg_small_grads_max_group)
+
+    def validate_and_complete_spec(spec):
+      """Validate and complete the all-reduce spec."""
+      # TODO(yuefengz): support namedtuple.
+      if not isinstance(spec, tuple):
+        raise ValueError(
+            "A tuple is expected for all-reduce spec: %r" % all_reduce_spec)
+      if not spec or len(spec) > 3:
+        raise ValueError(
+            "Too many elements in the all-reduce spec tuple: %r" % spec)
+      if len(spec) == 1:
+        return AllReduceSpecTuple(spec[0], 1, -1)
+      elif len(spec) == 2:
+        return AllReduceSpecTuple(spec[0], spec[1], -1)
+      else:
+        return AllReduceSpecTuple(*spec)
+
+    self._all_reduce_spec = []
+    if isinstance(all_reduce_spec, six.string_types):
+      self._all_reduce_spec.append(AllReduceSpecTuple(all_reduce_spec, 1, -1))
+    elif isinstance(all_reduce_spec, tuple):
+      self._all_reduce_spec.append(validate_and_complete_spec(all_reduce_spec))
+    elif isinstance(all_reduce_spec, list):
+      self._all_reduce_spec = [
+          validate_and_complete_spec(spec) for spec in all_reduce_spec
+      ]
+
+  def _batch_all_reduce(self, method_string, per_device_values):
+    """All reduce algorithm in a batch."""
+    logging.info(
+        "distributed batch_all_reduce invoked for batches size = %d with "
+        "allreduce_spec = %r, num_packs = %d, agg_small_grads_max_bytes = %d "
+        "and agg_small_grads_max_group = %d", len(per_device_values),
+        self._all_reduce_spec, self._num_packs, self._agg_small_grads_max_bytes,
+        self._agg_small_grads_max_group)
+
+    destinations = sorted(per_device_values[0].devices)
+    device_grads = _group_value_by_device(per_device_values)
+
+    # The all reduce library requires fully defined shapes.
+    # TODO(yuefengz): when tensor sharding is not needed, static shapes are not
+    # required as well.
+    for device_grad in device_grads:
+      for grad, _ in device_grad:
+        if not grad.shape.is_fully_defined():
+          raise ValueError("Shape is unknown for node %r" % grad)
+
+    remaining_grads = device_grads
+    aggregated_grads = []
+    for spec_tuple in self._all_reduce_spec:
+      if spec_tuple.limit < 0:
+        this_grads = remaining_grads
+        remaining_grads = []
+      else:
+        (this_grads, remaining_grads) = cross_tower_utils.split_grads_by_size(
+            spec_tuple.limit, remaining_grads)
+      if this_grads:
+        device_grad_packs, self._tensor_packer = _pack_tensors(
+            this_grads, self._num_packs, self._agg_small_grads_max_bytes,
+            self._agg_small_grads_max_group)
+        range_agg_grads = cross_tower_utils.sum_gradients_all_reduce(
+            self._worker_devices, device_grad_packs, len(self._worker_devices),
+            spec_tuple.alg, spec_tuple.shards, range(self._num_gpus_per_worker))
+        range_agg_grads = _unpack_tensors(range_agg_grads, self._tensor_packer)
+
+        if not aggregated_grads:
+          aggregated_grads = range_agg_grads
+        else:
+          assert len(aggregated_grads) == len(range_agg_grads)
+          for i in range(len(aggregated_grads)):
+            aggregated_grads[i] += range_agg_grads[i]
+    assert not remaining_grads
+
+    return _ungroup_and_make_mirrored(aggregated_grads, destinations,
+                                      method_string)
+
+
 _dgx1_links = [[1, 2, 3, 4], [0, 2, 3, 5], [0, 1, 3, 6], [0, 1, 2, 7],
                [0, 5, 6, 7], [1, 4, 6, 7], [2, 4, 5, 7], [3, 4, 5, 6]]
 
diff --git a/tensorflow/contrib/distribute/python/cross_tower_ops_test.py b/tensorflow/contrib/distribute/python/cross_tower_ops_test.py
index 7c7b0870887465ec2fe40007695d099277db38bf..fed5505d92ef2544215069736c166a67d6141708 100644
--- a/tensorflow/contrib/distribute/python/cross_tower_ops_test.py
+++ b/tensorflow/contrib/distribute/python/cross_tower_ops_test.py
@@ -24,6 +24,7 @@ from absl.testing import parameterized
 
 from tensorflow.contrib.distribute.python import combinations
 from tensorflow.contrib.distribute.python import cross_tower_ops as cross_tower_ops_lib
+from tensorflow.contrib.distribute.python import multi_worker_test_base
 from tensorflow.contrib.distribute.python import values as value_lib
 from tensorflow.python.eager import context
 from tensorflow.python.eager import test
@@ -31,6 +32,7 @@ from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.training import device_util
 
 
 def _make_per_device(values, devices):
@@ -56,19 +58,46 @@ def _fake_mirrored(value, devices):
       {d: v for d, v in zip(devices, [value] * len(devices))})
 
 
+def _make_indexed_slices(values, indices, dense_shape, device):
+  with ops.device(device):
+    tensor = ops.IndexedSlices(
+        values=constant_op.constant(values),
+        indices=constant_op.constant(indices),
+        dense_shape=constant_op.constant(dense_shape))
+  return tensor
+
+
+def _make_mirrored_indexed_slices(devices, values, indices, dense_shape):
+  return value_lib.Mirrored({
+      d: _make_indexed_slices(values, indices, dense_shape, d) for d in devices
+  })
+
+
 _cpu_device = "/device:CPU:0"
 
 
-class CrossTowerOpsTest(test.TestCase, parameterized.TestCase):
+class CrossTowerOpsTestBase(test.TestCase, parameterized.TestCase):
 
-  def _assert_value_equal(self, left, right):
+  def _assert_indexed_slices_equal(self, left, right):
+    self.assertIsInstance(left, ops.IndexedSlices)
+    self.assertIsInstance(right, ops.IndexedSlices)
+    self.assertEqual(device_util.resolve(left.device),
+                     device_util.resolve(right.device))
+    self.assertAllEqual(
+        self.evaluate(ops.convert_to_tensor(left)),
+        self.evaluate(ops.convert_to_tensor(right)))
+
+  def _assert_values_equal(self, left, right):
     if isinstance(left, list):
       for l, r in zip(left, right):
-        self._assert_value_equal(l, r)
+        self._assert_values_equal(l, r)
     else:
       self.assertEqual(type(left), type(right))
       self.assertEqual(left.devices, right.devices)
-      if context.executing_eagerly():
+      if isinstance(list(left._index.values())[0], ops.IndexedSlices):
+        for (d, v) in left._index.items():
+          self._assert_indexed_slices_equal(v, right._index[d])
+      elif context.executing_eagerly():
         self.assertEqual([v.numpy() for v in left._index.values()],
                          list(right._index.values()))
       else:
@@ -76,51 +105,7 @@ class CrossTowerOpsTest(test.TestCase, parameterized.TestCase):
           self.assertEqual(
               sess.run(list(left._index.values())), list(right._index.values()))
 
-  # TODO(yuefengz): decouple the num_gpus check from distribution in
-  # combinations module so that we can pass in devices instead of a distribution
-  # strategy.
-  reduction_to_one_combinations = combinations.combine(
-      cross_tower_ops=[
-          combinations.NamedObject(
-              "DefaultReductionToOneDeviceCrossTowerOps",
-              cross_tower_ops_lib.ReductionToOneDeviceCrossTowerOps()),
-          combinations.NamedObject(
-              "ReductionToCPUDeviceCrossTowerOps",
-              cross_tower_ops_lib.ReductionToOneDeviceCrossTowerOps(
-                  reduce_to_device=_cpu_device)),
-          combinations.NamedObject(
-              "AccumulateNCrossTowerOp",
-              cross_tower_ops_lib.ReductionToOneDeviceCrossTowerOps(
-                  accumulation_fn=math_ops.accumulate_n)),
-      ],
-      distribution=[
-          combinations.one_device_strategy,
-          combinations.mirrored_strategy_with_gpu_and_cpu,
-          combinations.mirrored_strategy_with_two_gpus
-      ],
-      mode=["graph", "eager"])
-  allreduce_combinations = combinations.combine(
-      cross_tower_ops=[
-          combinations.NamedObject(
-              "AllReduce",
-              cross_tower_ops_lib.AllReduceCrossTowerOps("nccl", 1, 0, 0)),
-          combinations.NamedObject(
-              "HierarchicalCopy",
-              cross_tower_ops_lib.AllReduceCrossTowerOps(
-                  "hierarchical_copy", 8, 0, 0)),
-          combinations.NamedObject(
-              "AllReduceNoGradientRepacking",
-              cross_tower_ops_lib.AllReduceCrossTowerOps("nccl", 0, 0, 0)),
-          combinations.NamedObject(
-              "HierarchicalCopyAggregateSmallTensors",
-              cross_tower_ops_lib.AllReduceCrossTowerOps(
-                  "hierarchical_copy", 0, 100, 10))
-      ],
-      distribution=[combinations.mirrored_strategy_with_two_gpus],
-      mode=["graph", "eager"])
-
-  @combinations.generate(reduction_to_one_combinations + allreduce_combinations)
-  def testReductionAndBroadcast(self, cross_tower_ops, distribution):
+  def _testReductionAndBroadcast(self, cross_tower_ops, distribution):
     devices = distribution.worker_devices
 
     values = [constant_op.constant(float(d)) for d in range(len(devices))]
@@ -143,29 +128,29 @@ class CrossTowerOpsTest(test.TestCase, parameterized.TestCase):
 
     # test reduce()
     for destinations in all_destinations:
-      self._assert_value_equal(
+      self._assert_values_equal(
           cross_tower_ops.reduce("mean", per_device, destinations=destinations),
           _fake_mirrored(mean, destinations or per_device))
-      self._assert_value_equal(
+      self._assert_values_equal(
           cross_tower_ops.reduce(
               "mean", per_device_2, destinations=destinations),
           _fake_mirrored(mean_2, destinations or per_device))
-      self._assert_value_equal(
+      self._assert_values_equal(
           cross_tower_ops.reduce("sum", per_device, destinations=destinations),
           _fake_mirrored(mean * len(devices), destinations or per_device))
-      self._assert_value_equal(
+      self._assert_values_equal(
           cross_tower_ops.reduce(
               "sum", per_device_2, destinations=destinations),
           _fake_mirrored(mean_2 * len(devices), destinations or per_device))
 
     # test batch_reduce()
     for d1, d2 in itertools.product(all_destinations, all_destinations):
-      self._assert_value_equal(
+      self._assert_values_equal(
           cross_tower_ops.batch_reduce(
               "mean", [(per_device, d1), (per_device_2, d2)]),
           [_fake_mirrored(mean, d1 or per_device),
            _fake_mirrored(mean_2, d2 or per_device_2)])
-      self._assert_value_equal(
+      self._assert_values_equal(
           cross_tower_ops.batch_reduce(
               "sum", [(per_device, d1), (per_device_2, d2)]),
           [_fake_mirrored(mean * len(devices), d1 or per_device),
@@ -176,45 +161,205 @@ class CrossTowerOpsTest(test.TestCase, parameterized.TestCase):
       if destinations is None:
         continue
       else:
-        self._assert_value_equal(
+        self._assert_values_equal(
             cross_tower_ops.broadcast(constant_op.constant(1.), destinations),
             _fake_mirrored(1., destinations))
 
+
+class SingleWorkerCrossTowerOpsTest(CrossTowerOpsTestBase):
+  # TODO(yuefengz): decouple the num_gpus check from distribution in
+  # combinations module so that we can pass in devices instead of a distribution
+  # strategy.
+  reduction_to_one_combinations = combinations.combine(
+      cross_tower_ops=[
+          combinations.NamedObject(
+              "DefaultReductionToOneDeviceCrossTowerOps",
+              cross_tower_ops_lib.ReductionToOneDeviceCrossTowerOps()),
+          combinations.NamedObject(
+              "ReductionToCPUDeviceCrossTowerOps",
+              cross_tower_ops_lib.ReductionToOneDeviceCrossTowerOps(
+                  reduce_to_device=_cpu_device)),
+          combinations.NamedObject(
+              "AccumulateNCrossTowerOp",
+              cross_tower_ops_lib.ReductionToOneDeviceCrossTowerOps(
+                  accumulation_fn=math_ops.accumulate_n)),
+      ],
+      distribution=[
+          combinations.one_device_strategy,
+          combinations.mirrored_strategy_with_gpu_and_cpu,
+          combinations.mirrored_strategy_with_two_gpus
+      ],
+      mode=["graph", "eager"])
+  allreduce_combinations = combinations.combine(
+      cross_tower_ops=[
+          combinations.NamedObject(
+              "AllReduce",
+              cross_tower_ops_lib.AllReduceCrossTowerOps("nccl", 1, 0, 0)),
+          combinations.NamedObject(
+              "HierarchicalCopy",
+              cross_tower_ops_lib.AllReduceCrossTowerOps(
+                  "hierarchical_copy", 8, 0, 0)),
+          combinations.NamedObject(
+              "AllReduceNoGradientRepacking",
+              cross_tower_ops_lib.AllReduceCrossTowerOps("nccl", 0, 0, 0)),
+          combinations.NamedObject(
+              "HierarchicalCopyAggregateSmallTensors",
+              cross_tower_ops_lib.AllReduceCrossTowerOps(
+                  "hierarchical_copy", 0, 100, 10))
+      ],
+      distribution=[combinations.mirrored_strategy_with_two_gpus],
+      mode=["graph", "eager"])
+
+  @combinations.generate(reduction_to_one_combinations + allreduce_combinations)
+  def testReductionAndBroadcast(self, cross_tower_ops, distribution):
+    with distribution.scope():
+      self._testReductionAndBroadcast(cross_tower_ops, distribution)
+
   def testChooseAlgorithm(self):
     device_links = [[1, 2, 3, 4], [0, 2, 3, 5], [0, 1, 3, 6], [0, 1, 2, 7],
                     [0, 5, 6, 7], [1, 4, 6, 7], [2, 4, 5, 7], [3, 4, 5, 6]]
     result = cross_tower_ops_lib._choose_all_reduce_algorithm(device_links)
-    self.assertTrue(
-        isinstance(result, cross_tower_ops_lib.AllReduceCrossTowerOps))
-    self.assertEqual(result.all_reduce_alg, "hierarchical_copy")
-    self.assertEqual(result.num_packs, 8)
+    self.assertIsInstance(result, cross_tower_ops_lib.AllReduceCrossTowerOps)
+    self.assertEqual(result._all_reduce_alg, "hierarchical_copy")
+    self.assertEqual(result._num_packs, 8)
 
     # if there are only 4 devices
     device_links = [[1, 2, 3, 4], [0, 2, 3, 5], [0, 1, 3, 6], [0, 1, 2, 7]]
     result = cross_tower_ops_lib._choose_all_reduce_algorithm(device_links)
-    self.assertTrue(
-        isinstance(result, cross_tower_ops_lib.AllReduceCrossTowerOps))
-    self.assertEqual(result.all_reduce_alg, "nccl")
-    self.assertEqual(result.num_packs, 1)
+    self.assertIsInstance(result, cross_tower_ops_lib.AllReduceCrossTowerOps)
+    self.assertEqual(result._all_reduce_alg, "nccl")
+    self.assertEqual(result._num_packs, 1)
 
     # if devices links contain each device itself
     device_links = [[0, 1, 2, 3, 4], [0, 1, 2, 3, 5], [0, 1, 2, 3, 6],
                     [0, 1, 2, 3, 7], [0, 4, 5, 6, 7], [1, 4, 5, 6, 7],
                     [2, 4, 5, 6, 7], [3, 4, 5, 6, 7]]
     result = cross_tower_ops_lib._choose_all_reduce_algorithm(device_links)
-    self.assertTrue(
-        isinstance(result, cross_tower_ops_lib.AllReduceCrossTowerOps))
-    self.assertEqual(result.all_reduce_alg, "hierarchical_copy")
-    self.assertEqual(result.num_packs, 8)
+    self.assertIsInstance(result, cross_tower_ops_lib.AllReduceCrossTowerOps)
+    self.assertEqual(result._all_reduce_alg, "hierarchical_copy")
+    self.assertEqual(result._num_packs, 8)
 
     # if not dgx1-like links
     device_links = [[0, 2, 3, 5], [0, 1, 3, 6], [0, 1, 2, 7], [0, 5, 6, 7],
                     [1, 4, 6, 7], [2, 4, 5, 7], [3, 4, 5, 6], [1, 2, 3, 4]]
     result = cross_tower_ops_lib._choose_all_reduce_algorithm(device_links)
-    self.assertTrue(
-        isinstance(result, cross_tower_ops_lib.AllReduceCrossTowerOps))
-    self.assertEqual(result.all_reduce_alg, "nccl")
-    self.assertEqual(result.num_packs, 1)
+    self.assertIsInstance(result, cross_tower_ops_lib.AllReduceCrossTowerOps)
+    self.assertEqual(result._all_reduce_alg, "nccl")
+    self.assertEqual(result._num_packs, 1)
+
+  @combinations.generate(combinations.combine(
+      mode=["graph", "eager"],
+      required_gpus=1))
+  def testSimpleReduceWithIndexedSlices(self):
+    devices = ["/cpu:0", "/gpu:0"]
+    t0 = _make_indexed_slices([[1., 2.]], [1], [5, 2], devices[0])
+    t1 = _make_indexed_slices([[3., 4.], [5., 6.]], [1, 3], [5, 2], devices[1])
+    per_device = value_lib.PerDevice({devices[0]: t0, devices[1]: t1})
+    result = cross_tower_ops_lib._simple_reduce(per_device, devices[0],
+                                                math_ops.add_n, "sum")
+
+    # Test that the result is semantically equal to both the concatenated
+    # IndexedSlices with and without duplicate indices.
+    total_with_dups = _make_indexed_slices(
+        [[1., 2.], [3., 4.], [5., 6.]], [1, 1, 3], [5, 2], devices[0])
+    total_without_dups = _make_indexed_slices(
+        [[4., 6.], [5., 6.]], [1, 3], [5, 2], devices[0])
+    self._assert_indexed_slices_equal(total_with_dups, result)
+    self._assert_indexed_slices_equal(total_without_dups, result)
+
+  @combinations.generate(combinations.combine(
+      cross_tower_ops_instance=[
+          combinations.NamedObject(
+              "ReductionToOneDeviceCrossTowerOps",
+              cross_tower_ops_lib.ReductionToOneDeviceCrossTowerOps()),
+          combinations.NamedObject(
+              "AllReduceCrossTowerOps",
+              cross_tower_ops_lib.AllReduceCrossTowerOps())
+      ],
+      method_string=["sum", "mean"],
+      batch_reduce=[True, False],
+      mode=["graph", "eager"],
+      required_gpus=1))
+  def testIndexedSlicesAllReduce(self, cross_tower_ops_instance,
+                                 method_string, batch_reduce):
+    devices = ["/cpu:0", "/gpu:0"]
+    dense_shape = [5, 2]
+    t0 = _make_indexed_slices([[1., 2.]], [1], dense_shape, devices[0])
+    t1 = _make_indexed_slices(
+        [[3., 4.], [5., 6.]], [1, 3], dense_shape, devices[1])
+    per_device = value_lib.PerDevice({devices[0]: t0, devices[1]: t1})
+
+    if batch_reduce:
+      result = cross_tower_ops_instance.batch_reduce(method_string,
+                                                     [(per_device, devices)])
+    else:
+      result = cross_tower_ops_instance.reduce(method_string, per_device,
+                                               devices)
+
+    total_indices_with_dups = [1, 1, 3]
+    total_indices_without_dups = [1, 3]
+
+    if method_string == "sum":
+      total_values_with_dups = [[1., 2.], [3., 4.], [5., 6.]]
+      total_values_without_dups = [[4., 6.], [5., 6.]]
+    else:
+      assert method_string == "mean"
+      total_values_with_dups = [[0.5, 1.], [1.5, 2.], [2.5, 3.]]
+      total_values_without_dups = [[2., 3.], [2.5, 3.]]
+
+    total_mirrored_with_dups = _make_mirrored_indexed_slices(
+        devices, total_values_with_dups, total_indices_with_dups, dense_shape)
+    total_mirrored_without_dups = _make_mirrored_indexed_slices(
+        devices, total_values_without_dups, total_indices_without_dups,
+        dense_shape)
+
+    # Test that the result is semantically equal to both the concatenated
+    # IndexedSlices, as well as when the duplicate indices are summed up.
+    if batch_reduce:
+      total_mirrored_with_dups = [total_mirrored_with_dups]
+      total_mirrored_without_dups = [total_mirrored_without_dups]
+
+    self._assert_values_equal(total_mirrored_with_dups, result)
+    self._assert_values_equal(total_mirrored_without_dups, result)
+
+
+class MultiWorkerCrossTowerOpsTest(multi_worker_test_base.MultiWorkerTestBase,
+                                   CrossTowerOpsTestBase):
+
+  worker_devices = [
+      "/job:worker/replica:0/task:0", "/job:worker/replica:0/task:1"
+  ]
+  multi_worker_allreduce_combinations = combinations.combine(
+      cross_tower_ops=[
+          combinations.NamedObject(
+              "MultiWorkerAllReduce",
+              cross_tower_ops_lib.MultiWorkerAllReduce(
+                  worker_devices, 2, ("pscpu/pscpu", 2, -1), 0, 0, 0)),
+          combinations.NamedObject(
+              "MultiWorkerAllReducePack",
+              cross_tower_ops_lib.MultiWorkerAllReduce(
+                  worker_devices, 2, ("pscpu/pscpu", 2, -1), 1, 0, 0)),
+          combinations.NamedObject(
+              "MultiWorkerAllReduceAggregation",
+              cross_tower_ops_lib.MultiWorkerAllReduce(
+                  worker_devices, 2, ("pscpu/pscpu", 2, -1), 0, 100, 10)),
+          combinations.NamedObject(
+              "MultiWorkerAllReduceMultipleSpecs",
+              cross_tower_ops_lib.MultiWorkerAllReduce(
+                  worker_devices, 2, [("pscpu/pscpu", 2, 100),
+                                      ("xring", 2, -1)], 0, 0, 0)),
+      ],
+      distribution=[
+          combinations.multi_worker_strategy_with_cpu,
+          combinations.multi_worker_strategy_with_one_gpu,
+          combinations.multi_worker_strategy_with_two_gpus
+      ],
+      mode=["graph"])
+
+  @combinations.generate(multi_worker_allreduce_combinations)
+  def testReductionAndBroadcast(self, cross_tower_ops, distribution):
+    with distribution.scope():
+      self._testReductionAndBroadcast(cross_tower_ops, distribution)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/contrib/distribute/python/cross_tower_utils.py b/tensorflow/contrib/distribute/python/cross_tower_utils.py
index fc04e2195f6d305e0f7c642f24c355286f1a8cfa..2bb088e704c584598b863b1b836166af2a5bb12c 100644
--- a/tensorflow/contrib/distribute/python/cross_tower_utils.py
+++ b/tensorflow/contrib/distribute/python/cross_tower_utils.py
@@ -21,9 +21,12 @@ from __future__ import print_function
 import collections as pycoll
 
 from tensorflow.contrib import nccl
+from tensorflow.contrib.all_reduce.python import all_reduce
+from tensorflow.contrib.distribute.python import values as value_lib
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import math_ops
 
 
@@ -156,6 +159,148 @@ def aggregate_single_gradient_using_copy(grad_and_vars, use_mean,
     return (grad, v), None
 
 
+def group_device_names(devices, group_size):
+  """Group device names into groups of group_size.
+
+  Args:
+    devices: a list of canonical device strings.
+    group_size: integer which is equal to or greater than 1.
+
+  Returns:
+    list of lists of devices, where each inner list is group_size long,
+      and each device appears at least once in an inner list.  If
+      len(devices) % group_size == 0 then each device will appear exactly once.
+
+  Raises:
+    ValueError: if group_size > len(devices)
+  """
+  num_devices = len(devices)
+  if group_size > num_devices:
+    raise ValueError(
+        'only %d devices, but group_size=%d' % (num_devices, group_size))
+  num_groups = (
+      num_devices // group_size + (1 if (num_devices % group_size != 0) else 0))
+  groups = [[] for i in range(num_groups)]
+  for i in range(num_groups * group_size):
+    groups[i % num_groups].append(devices[i % num_devices])
+  return groups
+
+
+def split_grads_by_size(threshold_size, device_grads):
+  """Break gradients into two sets according to tensor size.
+
+  Args:
+    threshold_size: int size cutoff for small vs large tensor.
+    device_grads: List of lists of (gradient, variable) tuples.  The outer
+        list is over devices. The inner list is over individual gradients.
+
+  Returns:
+    small_grads: Subset of device_grads where shape is <= threshold_size
+       elements.
+    large_grads: Subset of device_grads where shape is > threshold_size
+       elements.
+  """
+  small_grads = []
+  large_grads = []
+  for dl in device_grads:
+    small_dl = []
+    large_dl = []
+    for (g, v) in dl:
+      tensor_size = g.get_shape().num_elements()
+      if tensor_size <= threshold_size:
+        small_dl.append([g, v])
+      else:
+        large_dl.append([g, v])
+    if small_dl:
+      small_grads.append(small_dl)
+    if large_dl:
+      large_grads.append(large_dl)
+  return small_grads, large_grads
+
+
+def sum_grad_and_var_all_reduce(grad_and_vars,
+                                num_workers,
+                                alg,
+                                gpu_indices,
+                                aux_devices=None,
+                                num_shards=1):
+  """Apply all-reduce algorithm over specified gradient tensors."""
+  with ops.name_scope('allreduce'):
+    # Note that each grad_and_vars looks like the following:
+    #   ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN))
+    scaled_grads = [g for g, _ in grad_and_vars]
+    if alg == 'nccl':
+      summed_grads = nccl.all_sum(scaled_grads)
+    elif alg == 'xring':
+      summed_grads = all_reduce.build_ring_all_reduce(
+          scaled_grads, num_workers, num_shards, gpu_indices, math_ops.add)
+    elif alg == 'nccl/xring':
+      summed_grads = all_reduce.build_nccl_then_ring(scaled_grads, num_shards,
+                                                     math_ops.add)
+    elif alg == 'nccl/rechd':
+      summed_grads = all_reduce.build_nccl_then_recursive_hd(
+          scaled_grads, math_ops.add)
+    elif alg == 'nccl/pscpu':
+      summed_grads = all_reduce.build_nccl_then_shuffle(
+          scaled_grads, aux_devices, math_ops.add, math_ops.add_n)
+    elif alg == 'pscpu/pscpu':
+      second_gather_devices = aux_devices[:num_shards]
+      summed_grads = all_reduce.build_shuffle_then_shuffle(
+          scaled_grads, aux_devices, second_gather_devices, math_ops.add_n)
+    elif alg in ['pscpu', 'psgpu']:
+      summed_grads = all_reduce.build_shuffle_all_reduce(
+          scaled_grads, aux_devices, math_ops.add_n)
+    else:
+      raise ValueError('unsupported all_reduce alg: ', alg)
+
+    result = []
+    for (_, v), g in zip(grad_and_vars, summed_grads):
+      result.append([g, v])
+    return result
+
+
+def sum_gradients_all_reduce(dev_prefixes, tower_grads, num_workers, alg,
+                             num_shards, gpu_indices):
+  """Apply all-reduce algorithm over specified gradient tensors.
+
+  Args:
+    dev_prefixes: list of prefix strings to use to generate PS device names.
+    tower_grads: the gradients to reduce.
+    num_workers: number of worker processes across entire job.
+    alg: the all-reduce algorithm to apply.
+    num_shards: alg-specific sharding factor.
+    gpu_indices: indices of local GPUs in order usable for ring-reduce.
+
+  Returns:
+    list of reduced tensors
+  """
+  alg_contains_shuffle = any([n in alg for n in ['pscpu', 'psgpu']])
+  is_hierarchical = '/' in alg
+  if 'pscpu' in alg:
+    aux_devices = [prefix + '/cpu:0' for prefix in dev_prefixes]
+  elif 'psgpu' in alg:
+    aux_devices = [
+        prefix + '/gpu:%d' % i
+        for i in range(len(gpu_indices))
+        for prefix in dev_prefixes
+    ]
+  else:
+    aux_devices = ['/job:localhost/cpu:0']
+  # Auxiliary devices for hierarchical all-reduces.
+  aux_device_groups = group_device_names(
+      aux_devices, num_shards if alg_contains_shuffle else 1)
+  group_index = 0
+  reduced_gv_list = []
+  for grad_and_vars in zip(*tower_grads):
+    reduced_gv_list.append(
+        sum_grad_and_var_all_reduce(
+            grad_and_vars, num_workers, alg, gpu_indices, aux_devices
+            if is_hierarchical else aux_device_groups[group_index], num_shards))
+    group_index = (group_index + 1) % len(aux_device_groups)
+  new_tower_grads = [list(x) for x in zip(*reduced_gv_list)]
+  return new_tower_grads
+
+
 def extract_ranges(index_list, range_size_limit=32):
   """Extract consecutive ranges and singles from index_list.
 
@@ -328,7 +473,7 @@ def unpack_small_tensors(tower_grads, packing):
   for dev_idx, gv_list in enumerate(tower_grads):
     gv_list = list(gv_list)
     new_gv_list = gv_list[num_packed:]
-    for i in xrange(0, num_packed):
+    for i in range(num_packed):
       k = '%d:%d' % (dev_idx, i)
       gpt = packing[k]
       gv = unpack_grad_tuple(gv_list[i], gpt)
@@ -337,3 +482,46 @@ def unpack_small_tensors(tower_grads, packing):
         new_gv_list.insert(idx, gv[gi])
     new_tower_grads.append(new_gv_list)
   return new_tower_grads
+
+
+def aggregate_tensors_or_indexed_slices(values, accumulation_fn=math_ops.add_n):
+  """Aggregate tensors using `accumulation_fn` and IndexedSlices via concat."""
+  if any(isinstance(v, ops.IndexedSlices) for v in values):
+    return gradients_impl._AggregateIndexedSlicesGradients(values)  # pylint: disable=protected-access
+  else:
+    return accumulation_fn(values)
+
+
+def divide_by_n_tensors_or_indexed_slices(value, n):
+  if isinstance(value, ops.IndexedSlices):
+    value = gradients_impl._HandleNestedIndexedSlices(value)  # pylint: disable=protected-access
+    return ops.IndexedSlices(
+        value.values / n, value.indices, value.dense_shape)
+  else:
+    return value / n
+
+
+def copy_tensor_or_indexed_slices_to_device(value, device):
+  with ops.device(device):
+    if isinstance(value, ops.IndexedSlices):
+      copied_values = array_ops.identity(value.values)
+      copied_indices = array_ops.identity(value.indices)
+      copied_shape = array_ops.identity(value.dense_shape)
+      result = ops.IndexedSlices(copied_values, copied_indices, copied_shape)
+    else:
+      result = array_ops.identity(value)
+  return result
+
+
+def contains_indexed_slices(value):
+  """Check whether the value is `IndexedSlices` or contains `IndexedSlices`."""
+  if isinstance(value, ops.IndexedSlices):
+    return True
+  elif isinstance(value, (list, tuple)) and value:
+    return any(contains_indexed_slices(v) for v in value)
+  elif isinstance(value, value_lib.DistributedValues):
+    return contains_indexed_slices(list(value._index.values()))  # pylint: disable=protected-access
+  elif isinstance(value, value_lib.MapOutput):
+    return contains_indexed_slices(value.get())
+  else:
+    return False
diff --git a/tensorflow/contrib/distribute/python/cross_tower_utils_test.py b/tensorflow/contrib/distribute/python/cross_tower_utils_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ef8db681503dcef8c72f641455dbb999cef05cf
--- /dev/null
+++ b/tensorflow/contrib/distribute/python/cross_tower_utils_test.py
@@ -0,0 +1,152 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for cross_tower_utils."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from absl.testing import parameterized
+
+from tensorflow.contrib.distribute.python import combinations
+from tensorflow.contrib.distribute.python import cross_tower_utils
+from tensorflow.contrib.distribute.python import values as value_lib
+from tensorflow.python.eager import test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import math_ops
+from tensorflow.python.training import device_util
+
+
+class IndexedSlicesUtilsTest(test.TestCase, parameterized.TestCase):
+
+  def _assert_values_equal(self, left, right):
+    self.assertAllEqual(
+        self.evaluate(ops.convert_to_tensor(left)),
+        self.evaluate(ops.convert_to_tensor(right)))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testAggregateTensors(self):
+    t0 = constant_op.constant([[1., 2.], [0, 0], [3., 4.]])
+    t1 = constant_op.constant([[0., 0.], [5, 6], [7., 8.]])
+    total = constant_op.constant([[1., 2.], [5, 6], [10., 12.]])
+    result = cross_tower_utils.aggregate_tensors_or_indexed_slices([t0, t1])
+    self._assert_values_equal(total, result)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testAggregateIndexedSlices(self):
+    t0 = math_ops._as_indexed_slices(
+        constant_op.constant([[1., 2.], [0, 0], [3., 4.]]))
+    t1 = math_ops._as_indexed_slices(
+        constant_op.constant([[0., 0.], [5, 6], [7., 8.]]))
+    total = constant_op.constant([[1., 2.], [5, 6], [10., 12.]])
+    result = cross_tower_utils.aggregate_tensors_or_indexed_slices([t0, t1])
+    self.assertIsInstance(result, ops.IndexedSlices)
+    self._assert_values_equal(total, result)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testDivideTensor(self):
+    t = constant_op.constant([[1., 2.], [0, 0], [3., 4.]])
+    n = 2
+    expected = constant_op.constant([[0.5, 1.], [0, 0], [1.5, 2.]])
+    result = cross_tower_utils.divide_by_n_tensors_or_indexed_slices(t, n)
+    self._assert_values_equal(expected, result)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testDivideIndexedSlices(self):
+    t = math_ops._as_indexed_slices(
+        constant_op.constant([[1., 2.], [0, 0], [3., 4.]]))
+    n = 2
+    expected = constant_op.constant([[0.5, 1.], [0, 0], [1.5, 2.]])
+    result = cross_tower_utils.divide_by_n_tensors_or_indexed_slices(t, n)
+    self.assertIsInstance(result, ops.IndexedSlices)
+    self._assert_values_equal(expected, result)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testIsIndexedSlices(self):
+    t = math_ops._as_indexed_slices(
+        constant_op.constant([[1., 2.], [0, 0], [3., 4.]]))
+    self.assertTrue(cross_tower_utils.contains_indexed_slices(t))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testContainsIndexedSlices_List(self):
+    t0 = math_ops._as_indexed_slices(
+        constant_op.constant([[1., 2.], [0, 0], [3., 4.]]))
+    t1 = math_ops._as_indexed_slices(
+        constant_op.constant([[0., 0.], [5, 6], [7., 8.]]))
+    self.assertTrue(cross_tower_utils.contains_indexed_slices([t0, t1]))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testContainsIndexedSlices_Tuple(self):
+    t0 = math_ops._as_indexed_slices(
+        constant_op.constant([[1., 2.], [0, 0], [3., 4.]]))
+    t1 = math_ops._as_indexed_slices(
+        constant_op.constant([[0., 0.], [5, 6], [7., 8.]]))
+    self.assertTrue(cross_tower_utils.contains_indexed_slices((t0, t1)))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testContainsIndexedSlices_PerDevice(self):
+    t0 = math_ops._as_indexed_slices(
+        constant_op.constant([[1., 2.], [0, 0], [3., 4.]]))
+    t1 = math_ops._as_indexed_slices(
+        constant_op.constant([[0., 0.], [5, 6], [7., 8.]]))
+    per_device = value_lib.PerDevice({"/gpu:0": t0, "/cpu:0": t1})
+    self.assertTrue(cross_tower_utils.contains_indexed_slices(per_device))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testContainsIndexedSlices_PerDeviceMapOutput(self):
+    t0 = math_ops._as_indexed_slices(
+        constant_op.constant([[1., 2.], [0, 0], [3., 4.]]))
+    t1 = math_ops._as_indexed_slices(
+        constant_op.constant([[0., 0.], [5, 6], [7., 8.]]))
+    per_device = value_lib.PerDevice({
+        "/gpu:0": value_lib.MapOutput([t0]),
+        "/cpu:0": value_lib.MapOutput([t1])})
+    self.assertTrue(cross_tower_utils.contains_indexed_slices(per_device))
+
+  @combinations.generate(combinations.combine(
+      mode=["graph", "eager"],
+      required_gpus=1))
+  def testCopyTensor(self):
+    with ops.device("/cpu:0"):
+      t = constant_op.constant([[1., 2.], [0, 0], [3., 4.]])
+    destination = "/gpu:0"
+    result = cross_tower_utils.copy_tensor_or_indexed_slices_to_device(
+        t, destination)
+
+    self._assert_values_equal(t, result)
+    self.assertEqual(device_util.resolve(destination),
+                     device_util.resolve(result.device))
+
+  @combinations.generate(combinations.combine(
+      mode=["graph", "eager"],
+      required_gpus=1))
+  def testCopyIndexedSlices(self):
+    with ops.device("/cpu:0"):
+      t = math_ops._as_indexed_slices(
+          constant_op.constant([[1., 2.], [0, 0], [3., 4.]]))
+    destination = "/gpu:0"
+    result = cross_tower_utils.copy_tensor_or_indexed_slices_to_device(
+        t, destination)
+
+    self.assertIsInstance(result, ops.IndexedSlices)
+    self._assert_values_equal(t, result)
+    self.assertEqual(device_util.resolve(destination),
+                     device_util.resolve(result.device))
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/distribute/python/keras_test.py b/tensorflow/contrib/distribute/python/keras_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..75ecd90dcffa7a786b78238ef453c4c8e4346afa
--- /dev/null
+++ b/tensorflow/contrib/distribute/python/keras_test.py
@@ -0,0 +1,148 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for Keras Sequential and Functional models."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+import numpy as np
+
+from tensorflow.contrib.distribute.python import mirrored_strategy
+from tensorflow.python import keras
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.estimator import keras as keras_lib
+from tensorflow.python.estimator import run_config as run_config_lib
+from tensorflow.python.framework import test_util
+from tensorflow.python.keras import testing_utils
+from tensorflow.python.platform import gfile
+from tensorflow.python.platform import test
+from tensorflow.python.summary.writer import writer_cache
+from tensorflow.python.training import rmsprop
+
+_RANDOM_SEED = 1337
+_TRAIN_SIZE = 200
+_INPUT_SIZE = (10,)
+_NUM_CLASS = 2
+
+
+def simple_sequential_model():
+  model = keras.models.Sequential()
+  model.add(keras.layers.Dense(16, activation='relu', input_shape=_INPUT_SIZE))
+  model.add(keras.layers.Dropout(0.1))
+  model.add(keras.layers.Dense(_NUM_CLASS, activation='softmax'))
+  return model
+
+
+def simple_functional_model():
+  a = keras.layers.Input(shape=_INPUT_SIZE)
+  b = keras.layers.Dense(16, activation='relu')(a)
+  b = keras.layers.Dropout(0.1)(b)
+  b = keras.layers.Dense(_NUM_CLASS, activation='softmax')(b)
+  model = keras.models.Model(inputs=[a], outputs=[b])
+  return model
+
+
+def get_ds_train_input_fn():
+  np.random.seed(_RANDOM_SEED)
+  (x_train, y_train), _ = testing_utils.get_test_data(
+      train_samples=_TRAIN_SIZE,
+      test_samples=50,
+      input_shape=_INPUT_SIZE,
+      num_classes=_NUM_CLASS)
+  y_train = keras.utils.to_categorical(y_train)
+
+  dataset = dataset_ops.Dataset.from_tensor_slices((x_train, y_train))
+  dataset = dataset.batch(32)
+  return dataset
+
+
+def get_ds_test_input_fn():
+  np.random.seed(_RANDOM_SEED)
+  _, (x_test, y_test) = testing_utils.get_test_data(
+      train_samples=_TRAIN_SIZE,
+      test_samples=50,
+      input_shape=_INPUT_SIZE,
+      num_classes=_NUM_CLASS)
+  y_test = keras.utils.to_categorical(y_test)
+
+  dataset = dataset_ops.Dataset.from_tensor_slices((x_test, y_test))
+  dataset = dataset.batch(32)
+  return dataset
+
+
+class TestKerasDistributionStrategy(test_util.TensorFlowTestCase):
+
+  def setUp(self):
+    self._base_dir = os.path.join(self.get_temp_dir(),
+                                  'keras_mirrored_strategy_test')
+    gfile.MakeDirs(self._base_dir)
+    self._config = run_config_lib.RunConfig(
+        tf_random_seed=_RANDOM_SEED, model_dir=self._base_dir)
+
+  def tearDown(self):
+    writer_cache.FileWriterCache.clear()
+    if os.path.isdir(self._base_dir):
+      gfile.DeleteRecursively(self._base_dir)
+
+  def test_train_functional_with_distribution_strategy(self):
+    dist = mirrored_strategy.MirroredStrategy(
+        devices=['/device:GPU:0', '/device:GPU:1'])
+    keras_model = simple_functional_model()
+    keras_model.compile(
+        loss='categorical_crossentropy',
+        optimizer=rmsprop.RMSPropOptimizer(learning_rate=0.01))
+    config = run_config_lib.RunConfig(tf_random_seed=_RANDOM_SEED,
+                                      model_dir=self._base_dir,
+                                      train_distribute=dist)
+    with self.test_session():
+      est_keras = keras_lib.model_to_estimator(
+          keras_model=keras_model, config=config)
+      before_eval_results = est_keras.evaluate(
+          input_fn=get_ds_test_input_fn, steps=1)
+      est_keras.train(input_fn=get_ds_train_input_fn, steps=_TRAIN_SIZE / 16)
+      after_eval_results = est_keras.evaluate(input_fn=get_ds_test_input_fn,
+                                              steps=1)
+      self.assertLess(after_eval_results['loss'], before_eval_results['loss'])
+
+    writer_cache.FileWriterCache.clear()
+    gfile.DeleteRecursively(self._config.model_dir)
+
+  def test_train_sequential_with_distribution_strategy(self):
+    dist = mirrored_strategy.MirroredStrategy(
+        devices=['/device:GPU:0', '/device:GPU:1'])
+    keras_model = simple_sequential_model()
+    keras_model.compile(
+        loss='categorical_crossentropy',
+        optimizer=rmsprop.RMSPropOptimizer(learning_rate=0.01))
+    config = run_config_lib.RunConfig(tf_random_seed=_RANDOM_SEED,
+                                      model_dir=self._base_dir,
+                                      train_distribute=dist)
+    with self.test_session():
+      est_keras = keras_lib.model_to_estimator(
+          keras_model=keras_model, config=config)
+      before_eval_results = est_keras.evaluate(
+          input_fn=get_ds_test_input_fn, steps=1)
+      est_keras.train(input_fn=get_ds_train_input_fn, steps=_TRAIN_SIZE / 16)
+      after_eval_results = est_keras.evaluate(input_fn=get_ds_test_input_fn,
+                                              steps=1)
+      self.assertLess(after_eval_results['loss'], before_eval_results['loss'])
+
+    writer_cache.FileWriterCache.clear()
+    gfile.DeleteRecursively(self._config.model_dir)
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/distribute/python/metrics_v1_test.py b/tensorflow/contrib/distribute/python/metrics_v1_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c6bf143098c1bba64d47efce1bfface7682683d
--- /dev/null
+++ b/tensorflow/contrib/distribute/python/metrics_v1_test.py
@@ -0,0 +1,438 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for V1 metrics."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from absl.testing import parameterized
+
+from tensorflow.contrib.data.python.ops import batching
+from tensorflow.contrib.distribute.python import combinations
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.eager import test
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import metrics
+from tensorflow.python.ops import variables
+
+
+def _labeled_dataset_fn():
+  # First four batches of x: labels, predictions -> (labels == predictions)
+  #  0: 0, 0 -> True;   1: 1, 1 -> True;   2: 2, 2 -> True;   3: 3, 0 -> False
+  #  4: 4, 1 -> False;  5: 0, 2 -> False;  6: 1, 0 -> False;  7: 2, 1 -> False
+  #  8: 3, 2 -> False;  9: 4, 0 -> False; 10: 0, 1 -> False; 11: 1, 2 -> False
+  # 12: 2, 0 -> False; 13: 3, 1 -> False; 14: 4, 2 -> False; 15: 0, 0 -> True
+  return dataset_ops.Dataset.range(1000).map(
+      lambda x: {"labels": x % 5, "predictions": x % 3}).batch(4)
+
+
+def _boolean_dataset_fn():
+  # First four batches of labels, predictions: {TP, FP, TN, FN}
+  # with a threshold of 0.5:
+  #   T, T -> TP;  F, T -> FP;   T, F -> FN
+  #   F, F -> TN;  T, T -> TP;   F, T -> FP
+  #   T, F -> FN;  F, F -> TN;   T, T -> TP
+  #   F, T -> FP;  T, F -> FN;   F, F -> TN
+  return dataset_ops.Dataset.from_tensor_slices({
+      "labels": [True, False, True, False],
+      "predictions": [True, True, False, False]}).repeat().batch(3)
+
+
+def _threshold_dataset_fn():
+  # First four batches of labels, predictions: {TP, FP, TN, FN}
+  # with a threshold of 0.5:
+  #   True, 1.0 -> TP;  False, .75 -> FP;   True, .25 -> FN
+  #  False, 0.0 -> TN;   True, 1.0 -> TP;  False, .75 -> FP
+  #   True, .25 -> FN;  False, 0.0 -> TN;   True, 1.0 -> TP
+  #  False, .75 -> FP;   True, .25 -> FN;  False, 0.0 -> TN
+  return dataset_ops.Dataset.from_tensor_slices({
+      "labels": [True, False, True, False],
+      "predictions": [1.0, 0.75, 0.25, 0.]}).repeat().batch(3)
+
+
+def _regression_dataset_fn():
+  return dataset_ops.Dataset.from_tensor_slices({
+      "labels": [1., .5, 1., 0.],
+      "predictions": [1., .75, .25, 0.]}).repeat()
+
+
+def all_combinations():
+  return combinations.combine(
+      distribution=[combinations.default_strategy,
+                    combinations.one_device_strategy,
+                    combinations.mirrored_strategy_with_gpu_and_cpu,
+                    combinations.mirrored_strategy_with_two_gpus],
+      mode=["graph"])
+
+
+# TODO(josh11b): Test metrics.recall_at_top_k, metrics.average_precision_at_k,
+# metrics.precision_at_k
+class MetricsV1Test(test.TestCase, parameterized.TestCase):
+
+  def _test_metric(self, distribution, dataset_fn, metric_fn, expected_fn):
+    with ops.Graph().as_default(), distribution.scope():
+      iterator = distribution.distribute_dataset(
+          dataset_fn).make_one_shot_iterator()
+      value, update = distribution.call_for_each_tower(
+          metric_fn, iterator.get_next())
+      update = distribution.group(update)
+      self.evaluate(variables.local_variables_initializer())
+      # TODO(josh11b): Once we switch to using a global batch size for input,
+      # replace "distribution.num_towers" with "1".
+      batches_per_update = distribution.num_towers
+
+      # Update variables using the first `num_towers` batches.
+      self.evaluate(update)
+      self.assertAllClose(expected_fn(batches_per_update), self.evaluate(value),
+                          0.001, msg="After first update")
+
+      # Update variables using the second `num_towers` batches.
+      self.evaluate(update)
+      self.assertAllClose(expected_fn(2 * batches_per_update),
+                          self.evaluate(value),
+                          0.001,
+                          msg="After second update")
+
+      if batches_per_update == 1:  # Consume 4 input batches
+        self.evaluate(update)
+        self.assertAllClose(expected_fn(3 * batches_per_update),
+                            self.evaluate(value),
+                            0.001,
+                            msg="After third update")
+        self.evaluate(update)
+        self.assertAllClose(expected_fn(4 * batches_per_update),
+                            self.evaluate(value),
+                            0.001,
+                            msg="After fourth update")
+
+  @combinations.generate(all_combinations())
+  def testMean(self, distribution):
+    def _dataset_fn():
+      return dataset_ops.Dataset.range(1000).map(math_ops.to_float).batch(4)
+
+    def _expected_fn(num_batches):
+      # Mean(0..3) = 1.5, Mean(0..7) = 3.5, Mean(0..11) = 5.5, etc.
+      return num_batches * 2 - 0.5
+
+    self._test_metric(distribution, _dataset_fn, metrics.mean, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testAccuracy(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.accuracy(labels, predictions)
+
+    def _expected_fn(num_batches):
+      return [3./4, 3./8, 3./12, 4./16][num_batches - 1]
+
+    self._test_metric(
+        distribution, _labeled_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testMeanPerClassAccuracy(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.mean_per_class_accuracy(
+          labels, predictions, num_classes=5)
+
+    def _expected_fn(num_batches):
+      mean = lambda x: sum(x) / len(x)
+      return [mean([1., 1., 1., 0., 0.]),
+              mean([0.5, 0.5, 0.5, 0., 0.]),
+              mean([1./3, 1./3, 0.5, 0., 0.]),
+              mean([0.5, 1./3, 1./3, 0., 0.])][num_batches - 1]
+
+    self._test_metric(
+        distribution, _labeled_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testMeanIOU(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.mean_iou(
+          labels, predictions, num_classes=5)
+
+    def _expected_fn(num_batches):
+      mean = lambda x: sum(x) / len(x)
+      return [mean([1./2, 1./1, 1./1, 0.]),  # no class 4 in first batch
+              mean([1./4, 1./4, 1./3, 0., 0.]),
+              mean([1./6, 1./6, 1./5, 0., 0.]),
+              mean([2./8, 1./7, 1./7, 0., 0.])][num_batches - 1]
+
+    self._test_metric(
+        distribution, _labeled_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testMeanTensor(self, distribution):
+    def _dataset_fn():
+      dataset = dataset_ops.Dataset.range(1000).map(math_ops.to_float)
+      # Want to produce a fixed, known shape, so drop remainder when batching.
+      dataset = dataset.apply(batching.batch_and_drop_remainder(4))
+      return dataset
+
+    def _expected_fn(num_batches):
+      # Mean(0, 4, ..., 4 * num_batches - 4) == 2 * num_batches - 2
+      # Mean(1, 5, ..., 4 * num_batches - 3) == 2 * num_batches - 1
+      # Mean(2, 6, ..., 4 * num_batches - 2) == 2 * num_batches
+      # Mean(3, 7, ..., 4 * num_batches - 1) == 2 * num_batches + 1
+      first = 2. * num_batches - 2.
+      return [first, first + 1., first + 2., first + 3.]
+
+    self._test_metric(
+        distribution, _dataset_fn, metrics.mean_tensor, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testAUCROC(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.auc(labels, predictions, num_thresholds=8, curve="ROC",
+                         summation_method="careful_interpolation")
+
+    def _expected_fn(num_batches):
+      return [0.5, 7./9, 0.8, 0.75][num_batches - 1]
+
+    self._test_metric(
+        distribution, _threshold_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testAUCPR(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.auc(labels, predictions, num_thresholds=8, curve="PR",
+                         summation_method="careful_interpolation")
+
+    def _expected_fn(num_batches):
+      return [0.797267, 0.851238, 0.865411, 0.797267][num_batches - 1]
+
+    self._test_metric(
+        distribution, _threshold_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testFalseNegatives(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.false_negatives(labels, predictions)
+
+    def _expected_fn(num_batches):
+      return [1., 1., 2., 3.][num_batches - 1]
+
+    self._test_metric(
+        distribution, _boolean_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testFalseNegativesAtThresholds(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.false_negatives_at_thresholds(labels, predictions, [.5])
+
+    def _expected_fn(num_batches):
+      return [[1.], [1.], [2.], [3.]][num_batches - 1]
+
+    self._test_metric(
+        distribution, _threshold_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testTrueNegatives(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.true_negatives(labels, predictions)
+
+    def _expected_fn(num_batches):
+      return [0., 1., 2., 3.][num_batches - 1]
+
+    self._test_metric(
+        distribution, _boolean_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testTrueNegativesAtThresholds(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.true_negatives_at_thresholds(labels, predictions, [.5])
+
+    def _expected_fn(num_batches):
+      return [[0.], [1.], [2.], [3.]][num_batches - 1]
+
+    self._test_metric(
+        distribution, _threshold_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testFalsePositives(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.false_positives(labels, predictions)
+
+    def _expected_fn(num_batches):
+      return [1., 2., 2., 3.][num_batches - 1]
+
+    self._test_metric(
+        distribution, _boolean_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testFalsePositivesAtThresholds(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.false_positives_at_thresholds(labels, predictions, [.5])
+
+    def _expected_fn(num_batches):
+      return [[1.], [2.], [2.], [3.]][num_batches - 1]
+
+    self._test_metric(
+        distribution, _threshold_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testTruePositives(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.true_positives(labels, predictions)
+
+    def _expected_fn(num_batches):
+      return [1., 2., 3., 3.][num_batches - 1]
+
+    self._test_metric(
+        distribution, _boolean_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testTruePositivesAtThresholds(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.true_positives_at_thresholds(labels, predictions, [.5])
+
+    def _expected_fn(num_batches):
+      return [[1.], [2.], [3.], [3.]][num_batches - 1]
+
+    self._test_metric(
+        distribution, _threshold_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testPrecision(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.precision(labels, predictions)
+
+    def _expected_fn(num_batches):
+      return [0.5, 0.5, 0.6, 0.5][num_batches - 1]
+
+    self._test_metric(
+        distribution, _boolean_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testPrecisionAtThreshold(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.precision_at_thresholds(labels, predictions, [0.5])
+
+    def _expected_fn(num_batches):
+      return [[0.5], [0.5], [0.6], [0.5]][num_batches - 1]
+
+    self._test_metric(
+        distribution, _threshold_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testRecall(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.recall(labels, predictions)
+
+    def _expected_fn(num_batches):
+      return [0.5, 2./3, 0.6, 0.5][num_batches - 1]
+
+    self._test_metric(
+        distribution, _boolean_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testRecallAtThreshold(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.recall_at_thresholds(labels, predictions, [0.5])
+
+    def _expected_fn(num_batches):
+      return [[0.5], [2./3], [0.6], [0.5]][num_batches - 1]
+
+    self._test_metric(
+        distribution, _threshold_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testMeanSquaredError(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.mean_squared_error(labels, predictions)
+
+    def _expected_fn(num_batches):
+      return [0., 1./32, 0.208333, 0.15625][num_batches - 1]
+
+    self._test_metric(
+        distribution, _regression_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testRootMeanSquaredError(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.root_mean_squared_error(labels, predictions)
+
+    def _expected_fn(num_batches):
+      return [0., 0.176777, 0.456435, 0.395285][num_batches - 1]
+
+    self._test_metric(
+        distribution, _regression_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testSensitivityAtSpecificity(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.sensitivity_at_specificity(labels, predictions, 0.8)
+
+    def _expected_fn(num_batches):
+      return [0.5, 2./3, 0.6, 0.5][num_batches - 1]
+
+    self._test_metric(
+        distribution, _threshold_dataset_fn, _metric_fn, _expected_fn)
+
+  @combinations.generate(all_combinations())
+  def testSpecificityAtSensitivity(self, distribution):
+    def _metric_fn(x):
+      labels = x["labels"]
+      predictions = x["predictions"]
+      return metrics.specificity_at_sensitivity(labels, predictions, 0.95)
+
+    def _expected_fn(num_batches):
+      return [0., 1./3, 0.5, 0.5][num_batches - 1]
+
+    self._test_metric(
+        distribution, _threshold_dataset_fn, _metric_fn, _expected_fn)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/distribute/python/mirrored_strategy.py b/tensorflow/contrib/distribute/python/mirrored_strategy.py
index 89f2c431fece63269928fec6aa6d23b5a79ba0b9..900aa10e93e8881aa236bac8a2873d5c5531c6f6 100644
--- a/tensorflow/contrib/distribute/python/mirrored_strategy.py
+++ b/tensorflow/contrib/distribute/python/mirrored_strategy.py
@@ -18,6 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import contextlib
 import threading
 import six
 
@@ -30,6 +31,7 @@ from tensorflow.python.eager import tape
 from tensorflow.python.framework import device as tf_device
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.training import coordinator
 from tensorflow.python.training import device_util
@@ -39,6 +41,16 @@ from tensorflow.python.training import distribute as distribute_lib
 # TODO(josh11b): Replace asserts in this file with if ...: raise ...
 
 
+@contextlib.contextmanager
+def _enter_graph(g):
+  if context.executing_eagerly():
+    with g.as_default(), context.eager_mode():
+      yield
+  else:
+    with g.as_default():
+      yield
+
+
 def _cpu_device(device):
   cpu_device = tf_device.DeviceSpec.from_string(device)
   cpu_device.merge_from(tf_device.DeviceSpec(device_type="CPU", device_index=0))
@@ -73,9 +85,8 @@ class MirroredStrategy(distribute_lib.DistributionStrategy):
     assert len(set(devices)) == len(devices), (
         "No duplicates allowed in `devices` argument.")
     # TODO(josh11b): Require at least 2 devices?
-    self._devices = devices
-    self._canonical_device_set = set(
-        [device_util.canonicalize(d) for d in devices])
+    self._devices = [device_util.resolve(d) for d in devices]
+    self._canonical_device_set = set(self._devices)
     self._device_index = values.PerDevice(
         dict((d, i) for i, d in enumerate(devices)))
     self._cross_tower_ops = cross_tower_ops
@@ -108,7 +119,10 @@ class MirroredStrategy(distribute_lib.DistributionStrategy):
           if i > 0:
             # Give replicas meaningful distinct names:
             var0name = index[devices[0]].name.split(":")[0]
-            kwargs["name"] = "%s/replica_%d" % (var0name, i)
+            # We append a / to variable names created on towers with id > 0 to
+            # ensure that we ignore the name scope and instead use the given
+            # name as the absolute name of the variable.
+            kwargs["name"] = "%s/replica_%d/" % (var0name, i)
             # Initialize replicas with the same value:
             if context.executing_eagerly():
               kwargs["initial_value"] = array_ops.identity(
@@ -248,8 +262,15 @@ class MirroredStrategy(distribute_lib.DistributionStrategy):
                 {t.device: t.merge_args for t in threads})
             merge_kwargs = values.regroup(
                 {t.device: t.merge_kwargs for t in threads})
-            merge_result = threads[0].merge_fn(
-                self, *merge_args, **merge_kwargs)
+            # We capture the name_scope of the MTT when we call merge_fn
+            # to ensure that if we have opened a name scope in the MTT,
+            # it will be respected when executing the merge function. We only
+            # capture the name_scope from the first MTT and assume it is
+            # the same for all other MTTs.
+            mtt_captured_name_scope = threads[0].captured_name_scope
+            with ops.name_scope(mtt_captured_name_scope):
+              merge_result = threads[0].merge_fn(
+                  self, *merge_args, **merge_kwargs)
             for t in threads:
               t.merge_result = values.select_device(t.device, merge_result)
     finally:
@@ -323,6 +344,13 @@ class MirroredStrategy(distribute_lib.DistributionStrategy):
                         **values.select_device_mirrored(d, kwargs))
     return values.regroup(updates, values.Mirrored)
 
+  def read_var(self, tower_local_var):
+    """Read the aggregate value of a tower-local variable."""
+    if isinstance(tower_local_var, values.TowerLocalVariable):
+      return math_ops.add_n(self.unwrap(tower_local_var))
+    assert isinstance(tower_local_var, values.Mirrored)
+    return array_ops.identity(tower_local_var.get())
+
   def _fetch(self, val, destination, fn):
     """Return a copy of `val` or `fn(val)` on `destination`."""
     if isinstance(val, values.TowerLocalVariable):
@@ -389,7 +417,9 @@ class MirroredStrategy(distribute_lib.DistributionStrategy):
       # pylint: disable=protected-access
       return list(colocate_with._index.keys())
     elif isinstance(colocate_with, six.string_types):
-      return [colocate_with]
+      return [device_util.resolve(colocate_with)]
+    elif isinstance(colocate_with, list):
+      return [device_util.resolve(d) for d in colocate_with]
     else:
       return colocate_with
 
@@ -416,6 +446,7 @@ class MirroredStrategy(distribute_lib.DistributionStrategy):
       self.merge_args = None
       self.merge_kwargs = None
       self.merge_result = None
+      self.captured_name_scope = None
       # We use a thread.Event for the main thread to signal when this
       # thread should start running (`should_run`), and another for
       # this thread to transfer control back to the main thread
@@ -439,13 +470,13 @@ class MirroredStrategy(distribute_lib.DistributionStrategy):
       self._variable_creator_stack = self.graph._variable_creator_stack[:]
       self._captured_var_scope = variable_scope.get_variable_scope()
       # Adding a "/" at end lets us re-enter this scope later.
-      self._captured_name_scope = self.graph.get_name_scope()
-      if self._captured_name_scope:
-        self._captured_name_scope += "/"
+      self._name_scope = self.graph.get_name_scope()
+      if self._name_scope:
+        self._name_scope += "/"
       if self.tower_id > 0:
-        if not self._captured_name_scope:
-          self._captured_name_scope = ""
-        self._captured_name_scope += "tower_%d/" % self.tower_id
+        if not self._name_scope:
+          self._name_scope = ""
+        self._name_scope += "tower_%d/" % self.tower_id
 
     def run(self):
       # pylint: disable=protected-access
@@ -458,10 +489,10 @@ class MirroredStrategy(distribute_lib.DistributionStrategy):
         with self.coord.stop_on_exception(), \
             context.context()._mode(self.context_mode), \
             context.context().device_policy(self.context_device_policy), \
-            self.graph.as_default(), \
+            _enter_graph(self.graph), \
             MirroredTowerContext(self.distribution, self.tower_id), \
             ops.device(self.device), \
-            ops.name_scope(self._captured_name_scope), \
+            ops.name_scope(self._name_scope), \
             variable_scope.variable_scope(
                 self._captured_var_scope, reuse=self.tower_id > 0), \
             variable_scope.variable_creator_scope(self.variable_creator_fn):
@@ -487,6 +518,10 @@ class MirroredTowerContext(distribute_lib.TowerContext):
     t.merge_fn = fn
     t.merge_args = args
     t.merge_kwargs = kwargs
+    t.captured_name_scope = t.graph.get_name_scope()
+    # Adding a "/" at end lets us re-enter this scope later.
+    if t.captured_name_scope:
+      t.captured_name_scope += "/"
     t.has_paused.set()
     t.should_run.wait()
     t.should_run.clear()
diff --git a/tensorflow/contrib/distribute/python/mirrored_strategy_multigpu_test.py b/tensorflow/contrib/distribute/python/mirrored_strategy_multigpu_test.py
index 3f9a02b249dde9a66056ed8952b664bbc3f74ead..bccd278847e3c87080af3cb15665e7a0d802d8fb 100644
--- a/tensorflow/contrib/distribute/python/mirrored_strategy_multigpu_test.py
+++ b/tensorflow/contrib/distribute/python/mirrored_strategy_multigpu_test.py
@@ -438,6 +438,74 @@ class MirroredStrategyVariableCreationTest(test.TestCase):
         self.assertEquals("foo/" + name + ":0", v0.name)
         self.assertEquals("tower_1/foo/" + name + ":0", v1.name)
 
+  # variable_scope.variable() respects name scopes when creating
+  # variables. On the other hand variable_scope.get_variable() ignores name
+  # scopes when creating variables. We test both methods of creating variables
+  # to make sure that we have the same variable names in both cases.
+  def testNameScopeWithVariable(self):
+    def in_cross_tower(_):
+      c = variable_scope.variable(1.0, name="c")
+      return c
+
+    def model_fn():
+      b = variable_scope.variable(1.0, name="b")
+      with ops.name_scope("foo"):
+        c = distribute_lib.get_tower_context().merge_call(in_cross_tower)
+      return b, c
+
+    dist = mirrored_strategy.MirroredStrategy(
+        ["/device:GPU:0", "/device:CPU:0"])
+
+    with context.graph_mode(), dist.scope():
+      with ops.name_scope("main"):
+        a = variable_scope.variable(1.0, name="a")
+        result = dist.call_for_each_tower(model_fn, run_concurrently=False)
+      result_b = result[0]
+      result_c = result[1]
+      self.assertIsInstance(result_b, values.DistributedValues)
+      self.assertIsInstance(result_c, values.DistributedValues)
+      a0, a1 = dist.unwrap(a)
+      b0, b1 = dist.unwrap(result_b)
+      c0, c1 = dist.unwrap(result_c)
+      self.assertEquals("main/a:0", a0.name)
+      self.assertEquals("main/a/replica_1:0", a1.name)
+      self.assertEquals("main/b:0", b0.name)
+      self.assertEquals("main/b/replica_1:0", b1.name)
+      self.assertEquals("main/foo/c:0", c0.name)
+      self.assertEquals("main/foo/c/replica_1:0", c1.name)
+
+  def testNameScopeWithGetVariable(self):
+    def in_cross_tower(_):
+      c = variable_scope.get_variable("c", [1])
+      return c
+
+    def model_fn():
+      b = variable_scope.get_variable("b", [1])
+      with ops.name_scope("foo"):
+        c = distribute_lib.get_tower_context().merge_call(in_cross_tower)
+      return b, c
+
+    dist = mirrored_strategy.MirroredStrategy(
+        ["/device:GPU:0", "/device:CPU:0"])
+
+    with context.graph_mode(), dist.scope():
+      with ops.name_scope("main"):
+        a = variable_scope.get_variable("a", [1])
+        result = dist.call_for_each_tower(model_fn, run_concurrently=False)
+      result_b = result[0]
+      result_c = result[1]
+      self.assertIsInstance(result_b, values.DistributedValues)
+      self.assertIsInstance(result_c, values.DistributedValues)
+      a0, a1 = dist.unwrap(a)
+      b0, b1 = dist.unwrap(result_b)
+      c0, c1 = dist.unwrap(result_c)
+      self.assertEquals("a:0", a0.name)
+      self.assertEquals("a/replica_1:0", a1.name)
+      self.assertEquals("b:0", b0.name)
+      self.assertEquals("b/replica_1:0", b1.name)
+      self.assertEquals("c:0", c0.name)
+      self.assertEquals("c/replica_1:0", c1.name)
+
   def testDynamicRnnVariables(self):
     def model_fn():
       inputs = constant_op.constant(2 * [2 * [[0.0, 1.0, 2.0, 3.0, 4.0]]])
diff --git a/tensorflow/contrib/distribute/python/monitor_test.py b/tensorflow/contrib/distribute/python/monitor_test.py
index 8277e1e7919e86ef616b31d0986589dcc9c49bbd..4fdb9bf69b4f6ad76b79fd298f5303f24a1bd455 100644
--- a/tensorflow/contrib/distribute/python/monitor_test.py
+++ b/tensorflow/contrib/distribute/python/monitor_test.py
@@ -25,6 +25,7 @@ from tensorflow.contrib.distribute.python import combinations
 from tensorflow.contrib.distribute.python import monitor as monitor_lib
 from tensorflow.contrib.distribute.python import one_device_strategy
 from tensorflow.contrib.distribute.python.single_loss_example import single_loss_example
+from tensorflow.python.client import session
 from tensorflow.python.eager import context
 from tensorflow.python.eager import test
 from tensorflow.python.framework import ops
@@ -65,7 +66,7 @@ class MonitorTest(test.TestCase, parameterized.TestCase):
     step_function, _ = single_loss_example(
         lambda: gradient_descent.GradientDescentOptimizer(0.2), distribution)
 
-    with self.test_session() as sess:
+    with session.Session() as sess, context.eager_mode():
       with self.assertRaisesRegexp(ValueError, "Should not provide"):
         _ = monitor_lib.Monitor(step_function, sess)
 
diff --git a/tensorflow/contrib/distribute/python/multi_worker_strategy.py b/tensorflow/contrib/distribute/python/multi_worker_strategy.py
index a552b370ebf359464afcaf3211119e73434e0dfb..0f21a427320510635279f80c11711e81715ec37c 100644
--- a/tensorflow/contrib/distribute/python/multi_worker_strategy.py
+++ b/tensorflow/contrib/distribute/python/multi_worker_strategy.py
@@ -121,7 +121,7 @@ class MultiWorkerMirroredStrategy(MirroredStrategy):
           worker: [device_util.canonicalize(worker, '/device:CPU:0')]
           for worker in self._workers
       }
-    self._devices = nest.flatten(self._worker_device_map.values())
+    self._devices = nest.flatten(self._worker_device_map)
 
     super(MultiWorkerMirroredStrategy, self).__init__(
         devices=self._devices, prefetch_on_device=prefetch_on_device)
diff --git a/tensorflow/contrib/distribute/python/one_device_strategy.py b/tensorflow/contrib/distribute/python/one_device_strategy.py
index 09b6d4a515ab46879520f304cd5ef60469512380..7f4bab9d93814eb70a2a1586fc291a16b2766b90 100644
--- a/tensorflow/contrib/distribute/python/one_device_strategy.py
+++ b/tensorflow/contrib/distribute/python/one_device_strategy.py
@@ -102,6 +102,10 @@ class OneDeviceStrategy(distribute_lib.DistributionStrategy):
     with ops.device(self._device), distribute_lib.UpdateContext(self._device):
       return fn(*args, **kwargs)
 
+  def read_var(self, tower_local_var):
+    """Read the aggregate value of a tower-local variable."""
+    return array_ops.identity(tower_local_var)
+
   def _fetch(self, val, destination, fn):
     """Return a copy of `val` or `fn(val)` on `destination`."""
     with ops.device(self._device):
diff --git a/tensorflow/contrib/distribute/python/values.py b/tensorflow/contrib/distribute/python/values.py
index 49b4e24daa4ffe417712bc854aa29995d5afc408..9572ade8e497fa13a7ca0746399d3e0237ee79fd 100644
--- a/tensorflow/contrib/distribute/python/values.py
+++ b/tensorflow/contrib/distribute/python/values.py
@@ -65,9 +65,10 @@ class DistributedValues(object):
     device = device_util.canonicalize(device)
     try:
       return self._index[device]
-    except KeyError:
-      raise ValueError("Device %s not found in %s (current device %s)" %
-                       (device, self._index.keys(), device_util.current()))
+    except KeyError as e:
+      six.raise_from(
+          ValueError("Device %s not found in %s (current device %s)" %
+                     (device, self._index.keys(), device_util.current())), e)
 
   def on_device(self, device):
     device = device_util.canonicalize(device)
diff --git a/tensorflow/contrib/distributions/BUILD b/tensorflow/contrib/distributions/BUILD
index 6192f04c8b695d124b498850ad430823b44fd472..ad00d1734dd14ed846522a33d888a5387cb25cc6 100644
--- a/tensorflow/contrib/distributions/BUILD
+++ b/tensorflow/contrib/distributions/BUILD
@@ -16,6 +16,13 @@ load("//tensorflow:tensorflow.bzl", "cuda_py_test")
 py_library(
     name = "bijectors_py",
     srcs = glob(["python/ops/bijectors/*.py"]),
+    deprecation = ("TensorFlow Distributions has migrated to " +
+                   "TensorFlow Probability " +
+                   "(https://github.com/tensorflow/probability). " +
+                   "Deprecated copies remaining in tf.contrib.distributions " +
+                   "are unmaintained, unsupported, and will be removed by " +
+                   "late 2018. You should update all usage of " +
+                   "`tf.contrib.distributions` to `tfp.distributions`."),
     srcs_version = "PY2AND3",
     deps = [
         "//tensorflow/contrib/linalg:linalg_py",
@@ -42,6 +49,13 @@ py_library(
 py_library(
     name = "distributions_py",
     srcs = ["__init__.py"] + glob(["python/ops/*.py"]),
+    deprecation = ("TensorFlow Distributions has migrated to " +
+                   "TensorFlow Probability " +
+                   "(https://github.com/tensorflow/probability). " +
+                   "Deprecated copies remaining in tf.contrib.distributions " +
+                   "are unmaintained, unsupported, and will be removed by " +
+                   "late 2018. You should update all usage of " +
+                   "`tf.contrib.distributions` to `tfp.distributions`."),
     srcs_version = "PY2AND3",
     deps = [
         ":bijectors_py",
@@ -940,6 +954,25 @@ cuda_py_test(
     ],
 )
 
+cuda_py_test(
+    name = "fill_triangular_test",
+    size = "small",
+    srcs = ["python/kernel_tests/bijectors/fill_triangular_test.py"],
+    additional_deps = [
+        ":bijectors_py",
+        ":distributions_py",
+        "//third_party/py/numpy",
+        "@six_archive//:six",
+        "//tensorflow/contrib/linalg:linalg_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:platform_test",
+    ],
+)
+
 cuda_py_test(
     name = "gumbel_test",
     size = "small",
@@ -1032,6 +1065,25 @@ cuda_py_test(
     ],
 )
 
+cuda_py_test(
+    name = "matrix_inverse_tril_test",
+    size = "medium",
+    srcs = ["python/kernel_tests/bijectors/matrix_inverse_tril_test.py"],
+    additional_deps = [
+        ":bijectors_py",
+        ":distributions_py",
+        "//third_party/py/numpy",
+        "@six_archive//:six",
+        "//tensorflow/contrib/linalg:linalg_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:platform_test",
+    ],
+)
+
 cuda_py_test(
     name = "real_nvp_test",
     size = "small",
@@ -1099,6 +1151,25 @@ cuda_py_test(
     ],
 )
 
+cuda_py_test(
+    name = "scale_tril_test",
+    size = "small",
+    srcs = ["python/kernel_tests/bijectors/scale_tril_test.py"],
+    additional_deps = [
+        ":bijectors_py",
+        ":distributions_py",
+        "//third_party/py/numpy",
+        "@six_archive//:six",
+        "//tensorflow/contrib/linalg:linalg_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:platform_test",
+    ],
+)
+
 cuda_py_test(
     name = "sigmoid_test",
     size = "small",
@@ -1216,6 +1287,25 @@ cuda_py_test(
     ],
 )
 
+cuda_py_test(
+    name = "transform_diagonal_test",
+    size = "small",
+    srcs = ["python/kernel_tests/bijectors/transform_diagonal_test.py"],
+    additional_deps = [
+        ":bijectors_py",
+        ":distributions_py",
+        "//third_party/py/numpy",
+        "@six_archive//:six",
+        "//tensorflow/contrib/linalg:linalg_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:platform_test",
+    ],
+)
+
 cuda_py_test(
     name = "weibull_test",
     size = "small",
diff --git a/tensorflow/contrib/distributions/__init__.py b/tensorflow/contrib/distributions/__init__.py
index ddf59891e626a85e6c917ac74b3cfaabf16eb15d..5cec93c4df2e970f203253be6342bb292f296eb0 100644
--- a/tensorflow/contrib/distributions/__init__.py
+++ b/tensorflow/contrib/distributions/__init__.py
@@ -13,8 +13,6 @@
 # limitations under the License.
 # ==============================================================================
 """Classes representing statistical distributions and ops for working with them.
-
-See the @{$python/contrib.distributions} guide.
 """
 from __future__ import absolute_import
 from __future__ import division
@@ -32,6 +30,7 @@ from tensorflow.contrib.distributions.python.ops.conditional_distribution import
 from tensorflow.contrib.distributions.python.ops.conditional_transformed_distribution import *
 from tensorflow.contrib.distributions.python.ops.deterministic import *
 from tensorflow.contrib.distributions.python.ops.distribution_util import fill_triangular
+from tensorflow.contrib.distributions.python.ops.distribution_util import fill_triangular_inverse
 from tensorflow.contrib.distributions.python.ops.distribution_util import matrix_diag_transform
 from tensorflow.contrib.distributions.python.ops.distribution_util import reduce_weighted_logsumexp
 from tensorflow.contrib.distributions.python.ops.distribution_util import softplus_inverse
@@ -156,6 +155,7 @@ _allowed_symbols = [
     'kl_divergence',
     'RegisterKL',
     'fill_triangular',
+    'fill_triangular_inverse',
     'matrix_diag_transform',
     'reduce_weighted_logsumexp',
     'softplus_inverse',
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/conditional_bijector_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/conditional_bijector_test.py
index 8b279ebcd908b6f375b35594ac5f3db9228a1e31..f8a52615b0f3f5ad0c7e01e0f76c7d7a6b455ef7 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/conditional_bijector_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/conditional_bijector_test.py
@@ -59,7 +59,7 @@ class ConditionalBijectorTest(test.TestCase):
     for name in ["inverse_log_det_jacobian", "forward_log_det_jacobian"]:
       method = getattr(b, name)
       with self.assertRaisesRegexp(ValueError, name + ".*b1.*b2"):
-        method(1., event_ndims=0., arg1="b1", arg2="b2")
+        method(1., event_ndims=0, arg1="b1", arg2="b2")
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/fill_triangular_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/fill_triangular_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..caeaf2a0c6e4fff28c0edd82cb09ca0bcee85fc3
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/fill_triangular_test.py
@@ -0,0 +1,98 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for FillTriangular bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.distributions.python.ops import bijectors
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import array_ops
+from tensorflow.python.platform import test
+
+
+class FillTriangularBijectorTest(test.TestCase):
+  """Tests the correctness of the FillTriangular bijector."""
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testBijector(self):
+    x = np.float32(np.array([1., 2., 3.]))
+    y = np.float32(np.array([[3., 0.],
+                             [2., 1.]]))
+
+    b = bijectors.FillTriangular()
+
+    y_ = self.evaluate(b.forward(x))
+    self.assertAllClose(y, y_)
+
+    x_ = self.evaluate(b.inverse(y))
+    self.assertAllClose(x, x_)
+
+    fldj = self.evaluate(b.forward_log_det_jacobian(x, event_ndims=1))
+    self.assertAllClose(fldj, 0.)
+
+    ildj = self.evaluate(b.inverse_log_det_jacobian(y, event_ndims=2))
+    self.assertAllClose(ildj, 0.)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testShape(self):
+    x_shape = tensor_shape.TensorShape([5, 4, 6])
+    y_shape = tensor_shape.TensorShape([5, 4, 3, 3])
+
+    b = bijectors.FillTriangular(validate_args=True)
+
+    x = array_ops.ones(shape=x_shape, dtype=dtypes.float32)
+    y_ = b.forward(x)
+    self.assertAllEqual(y_.shape.as_list(), y_shape.as_list())
+    x_ = b.inverse(y_)
+    self.assertAllEqual(x_.shape.as_list(), x_shape.as_list())
+
+    y_shape_ = b.forward_event_shape(x_shape)
+    self.assertAllEqual(y_shape_.as_list(), y_shape.as_list())
+    x_shape_ = b.inverse_event_shape(y_shape)
+    self.assertAllEqual(x_shape_.as_list(), x_shape.as_list())
+
+    y_shape_tensor = self.evaluate(
+        b.forward_event_shape_tensor(x_shape.as_list()))
+    self.assertAllEqual(y_shape_tensor, y_shape.as_list())
+    x_shape_tensor = self.evaluate(
+        b.inverse_event_shape_tensor(y_shape.as_list()))
+    self.assertAllEqual(x_shape_tensor, x_shape.as_list())
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testShapeError(self):
+
+    b = bijectors.FillTriangular(validate_args=True)
+
+    x_shape_bad = tensor_shape.TensorShape([5, 4, 7])
+    with self.assertRaisesRegexp(ValueError, "is not a triangular number"):
+      b.forward_event_shape(x_shape_bad)
+    with self.assertRaisesOpError("is not a triangular number"):
+      self.evaluate(b.forward_event_shape_tensor(x_shape_bad.as_list()))
+
+    y_shape_bad = tensor_shape.TensorShape([5, 4, 3, 2])
+    with self.assertRaisesRegexp(ValueError, "Matrix must be square"):
+      b.inverse_event_shape(y_shape_bad)
+    with self.assertRaisesOpError("Matrix must be square"):
+      self.evaluate(b.inverse_event_shape_tensor(y_shape_bad.as_list()))
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/matrix_inverse_tril_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/matrix_inverse_tril_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..18397035571561731698b06d90e20dc74e3cf83c
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/matrix_inverse_tril_test.py
@@ -0,0 +1,190 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for MatrixInverseTriL bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.distributions.python.ops import bijectors
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import test_util
+from tensorflow.python.platform import test
+
+
+class MatrixInverseTriLBijectorTest(test.TestCase):
+  """Tests the correctness of the Y = inv(tril) transformation."""
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testComputesCorrectValues(self):
+    inv = bijectors.MatrixInverseTriL(validate_args=True)
+    self.assertEqual("matrix_inverse_tril", inv.name)
+    x_ = np.array([[0.7, 0., 0.],
+                   [0.1, -1., 0.],
+                   [0.3, 0.25, 0.5]], dtype=np.float32)
+    x_inv_ = np.linalg.inv(x_)
+    expected_fldj_ = -6. * np.sum(np.log(np.abs(np.diag(x_))))
+
+    y = inv.forward(x_)
+    x_back = inv.inverse(x_inv_)
+    fldj = inv.forward_log_det_jacobian(x_, event_ndims=2)
+    ildj = inv.inverse_log_det_jacobian(x_inv_, event_ndims=2)
+
+    y_, x_back_, fldj_, ildj_ = self.evaluate([y, x_back, fldj, ildj])
+
+    self.assertAllClose(x_inv_, y_, atol=0., rtol=1e-5)
+    self.assertAllClose(x_, x_back_, atol=0., rtol=1e-5)
+    self.assertNear(expected_fldj_, fldj_, err=1e-3)
+    self.assertNear(-expected_fldj_, ildj_, err=1e-3)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testOneByOneMatrix(self):
+    inv = bijectors.MatrixInverseTriL(validate_args=True)
+    x_ = np.array([[5.]], dtype=np.float32)
+    x_inv_ = np.array([[0.2]], dtype=np.float32)
+    expected_fldj_ = np.log(0.04)
+
+    y = inv.forward(x_)
+    x_back = inv.inverse(x_inv_)
+    fldj = inv.forward_log_det_jacobian(x_, event_ndims=2)
+    ildj = inv.inverse_log_det_jacobian(x_inv_, event_ndims=2)
+
+    y_, x_back_, fldj_, ildj_ = self.evaluate([y, x_back, fldj, ildj])
+
+    self.assertAllClose(x_inv_, y_, atol=0., rtol=1e-5)
+    self.assertAllClose(x_, x_back_, atol=0., rtol=1e-5)
+    self.assertNear(expected_fldj_, fldj_, err=1e-3)
+    self.assertNear(-expected_fldj_, ildj_, err=1e-3)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testZeroByZeroMatrix(self):
+    inv = bijectors.MatrixInverseTriL(validate_args=True)
+    x_ = np.eye(0, dtype=np.float32)
+    x_inv_ = np.eye(0, dtype=np.float32)
+    expected_fldj_ = 0.
+
+    y = inv.forward(x_)
+    x_back = inv.inverse(x_inv_)
+    fldj = inv.forward_log_det_jacobian(x_, event_ndims=2)
+    ildj = inv.inverse_log_det_jacobian(x_inv_, event_ndims=2)
+
+    y_, x_back_, fldj_, ildj_ = self.evaluate([y, x_back, fldj, ildj])
+
+    self.assertAllClose(x_inv_, y_, atol=0., rtol=1e-5)
+    self.assertAllClose(x_, x_back_, atol=0., rtol=1e-5)
+    self.assertNear(expected_fldj_, fldj_, err=1e-3)
+    self.assertNear(-expected_fldj_, ildj_, err=1e-3)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testBatch(self):
+    # Test batch computation with input shape (2, 1, 2, 2), i.e. batch shape
+    # (2, 1).
+    inv = bijectors.MatrixInverseTriL(validate_args=True)
+    x_ = np.array([[[[1., 0.],
+                     [2., 3.]]],
+                   [[[4., 0.],
+                     [5., -6.]]]], dtype=np.float32)
+    x_inv_ = np.linalg.inv(x_)
+    expected_fldj_ = -4. * np.sum(
+        np.log(np.abs(np.diagonal(x_, axis1=-2, axis2=-1))), axis=-1)
+
+    y = inv.forward(x_)
+    x_back = inv.inverse(x_inv_)
+    fldj = inv.forward_log_det_jacobian(x_, event_ndims=2)
+    ildj = inv.inverse_log_det_jacobian(x_inv_, event_ndims=2)
+
+    y_, x_back_, fldj_, ildj_ = self.evaluate([y, x_back, fldj, ildj])
+
+    self.assertAllClose(x_inv_, y_, atol=0., rtol=1e-5)
+    self.assertAllClose(x_, x_back_, atol=0., rtol=1e-5)
+    self.assertAllClose(expected_fldj_, fldj_, atol=0., rtol=1e-3)
+    self.assertAllClose(-expected_fldj_, ildj_, atol=0., rtol=1e-3)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testErrorOnInputRankTooLow(self):
+    inv = bijectors.MatrixInverseTriL(validate_args=True)
+    x_ = np.array([0.1], dtype=np.float32)
+    rank_error_msg = "must have rank at least 2"
+    with self.test_session():
+      with self.assertRaisesWithPredicateMatch(ValueError, rank_error_msg):
+        inv.forward(x_).eval()
+      with self.assertRaisesWithPredicateMatch(ValueError, rank_error_msg):
+        inv.inverse(x_).eval()
+      with self.assertRaisesWithPredicateMatch(ValueError, rank_error_msg):
+        inv.forward_log_det_jacobian(x_, event_ndims=2).eval()
+      with self.assertRaisesWithPredicateMatch(ValueError, rank_error_msg):
+        inv.inverse_log_det_jacobian(x_, event_ndims=2).eval()
+
+  # TODO(b/80481923): Figure out why these assertions fail, and fix them.
+  ## def testErrorOnInputNonSquare(self):
+  ##   inv = bijectors.MatrixInverseTriL(validate_args=True)
+  ##   x_ = np.array([[1., 2., 3.],
+  ##                  [4., 5., 6.]], dtype=np.float32)
+  ##   square_error_msg = "must be a square matrix"
+  ##   with self.test_session():
+  ##     with self.assertRaisesWithPredicateMatch(errors.InvalidArgumentError,
+  ##                                              square_error_msg):
+  ##       inv.forward(x_).eval()
+  ##     with self.assertRaisesWithPredicateMatch(errors.InvalidArgumentError,
+  ##                                              square_error_msg):
+  ##       inv.inverse(x_).eval()
+  ##     with self.assertRaisesWithPredicateMatch(errors.InvalidArgumentError,
+  ##                                              square_error_msg):
+  ##       inv.forward_log_det_jacobian(x_, event_ndims=2).eval()
+  ##     with self.assertRaisesWithPredicateMatch(errors.InvalidArgumentError,
+  ##                                              square_error_msg):
+  ##       inv.inverse_log_det_jacobian(x_, event_ndims=2).eval()
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testErrorOnInputNotLowerTriangular(self):
+    inv = bijectors.MatrixInverseTriL(validate_args=True)
+    x_ = np.array([[1., 2.],
+                   [3., 4.]], dtype=np.float32)
+    triangular_error_msg = "must be lower triangular"
+    with self.test_session():
+      with self.assertRaisesWithPredicateMatch(errors.InvalidArgumentError,
+                                               triangular_error_msg):
+        inv.forward(x_).eval()
+      with self.assertRaisesWithPredicateMatch(errors.InvalidArgumentError,
+                                               triangular_error_msg):
+        inv.inverse(x_).eval()
+      with self.assertRaisesWithPredicateMatch(errors.InvalidArgumentError,
+                                               triangular_error_msg):
+        inv.forward_log_det_jacobian(x_, event_ndims=2).eval()
+      with self.assertRaisesWithPredicateMatch(errors.InvalidArgumentError,
+                                               triangular_error_msg):
+        inv.inverse_log_det_jacobian(x_, event_ndims=2).eval()
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testErrorOnInputSingular(self):
+    inv = bijectors.MatrixInverseTriL(validate_args=True)
+    x_ = np.array([[1., 0.],
+                   [0., 0.]], dtype=np.float32)
+    nonsingular_error_msg = "must have all diagonal entries nonzero"
+    with self.test_session():
+      with self.assertRaisesOpError(nonsingular_error_msg):
+        inv.forward(x_).eval()
+      with self.assertRaisesOpError(nonsingular_error_msg):
+        inv.inverse(x_).eval()
+      with self.assertRaisesOpError(nonsingular_error_msg):
+        inv.forward_log_det_jacobian(x_, event_ndims=2).eval()
+      with self.assertRaisesOpError(nonsingular_error_msg):
+        inv.inverse_log_det_jacobian(x_, event_ndims=2).eval()
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/scale_tril_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/scale_tril_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..566a7b3dff9b5d97a1cb143e0b32fc15984c3a02
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/scale_tril_test.py
@@ -0,0 +1,69 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for ScaleTriL bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.distributions.python.ops import bijectors
+from tensorflow.python.framework import test_util
+from tensorflow.python.platform import test
+
+
+class ScaleTriLBijectorTest(test.TestCase):
+  """Tests the correctness of the ScaleTriL bijector."""
+
+  def setUp(self):
+    self._rng = np.random.RandomState(42)
+
+  def testComputesCorrectValues(self):
+    shift = 1.61803398875
+    x = np.float32(np.array([-1, .5, 2]))
+    y = np.float32(np.array([[np.exp(2) + shift, 0.],
+                             [.5, np.exp(-1) + shift]]))
+
+    b = bijectors.ScaleTriL(diag_bijector=bijectors.Exp(),
+                            diag_shift=shift)
+
+    y_ = self.evaluate(b.forward(x))
+    self.assertAllClose(y, y_)
+
+    x_ = self.evaluate(b.inverse(y))
+    self.assertAllClose(x, x_)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testInvertible(self):
+
+    # Generate random inputs from an unconstrained space, with
+    # event size 6 to specify 3x3 triangular matrices.
+    batch_shape = [2, 1]
+    x = np.float32(np.random.randn(*(batch_shape + [6])))
+    b = bijectors.ScaleTriL(diag_bijector=bijectors.Softplus(),
+                            diag_shift=3.14159)
+    y = self.evaluate(b.forward(x))
+    self.assertAllEqual(y.shape, batch_shape + [3, 3])
+
+    x_ = self.evaluate(b.inverse(y))
+    self.assertAllClose(x, x_)
+
+    fldj = self.evaluate(b.forward_log_det_jacobian(x, event_ndims=1))
+    ildj = self.evaluate(b.inverse_log_det_jacobian(y, event_ndims=2))
+    self.assertAllClose(fldj, -ildj)
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/sinh_arcsinh_bijector_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/sinh_arcsinh_bijector_test.py
index 45760a29ee42835da69ef63803ccec7ce82a5a8f..795f1993ba5c31bf5a26333f31f1bc73125bff07 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/sinh_arcsinh_bijector_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/sinh_arcsinh_bijector_test.py
@@ -151,16 +151,24 @@ class SinhArcsinhBijectorTest(test.TestCase):
         self.assertAllClose(y, bijector.forward(x).eval(), rtol=1e-4, atol=0.)
         self.assertAllClose(x, bijector.inverse(y).eval(), rtol=1e-4, atol=0.)
 
-        # Do the numpy calculation in float128 to avoid inf/nan.
-        y_float128 = np.float128(y)
-        self.assertAllClose(
-            np.log(np.cosh(
-                np.arcsinh(y_float128) / tailweight - skewness) / np.sqrt(
-                    y_float128**2 + 1)) -
-            np.log(tailweight),
-            bijector.inverse_log_det_jacobian(y, event_ndims=0).eval(),
-            rtol=1e-4,
-            atol=0.)
+        # On IBM PPC systems, longdouble (np.float128) is same as double except that it can have more precision.
+        # Type double being of 8 bytes, can't hold square of max of float64 (which is also 8 bytes) and
+        # below test fails due to overflow error giving inf. So this check avoids that error by skipping square
+        # calculation and corresponding assert.
+
+        if np.amax(y) <= np.sqrt(np.finfo(np.float128).max) and \
+           np.fabs(np.amin(y)) <= np.sqrt(np.fabs(np.finfo(np.float128).min)):
+
+          # Do the numpy calculation in float128 to avoid inf/nan.
+          y_float128 = np.float128(y)
+          self.assertAllClose(
+              np.log(np.cosh(
+                  np.arcsinh(y_float128) / tailweight - skewness) / np.sqrt(
+                      y_float128**2 + 1)) -
+              np.log(tailweight),
+              bijector.inverse_log_det_jacobian(y, event_ndims=0).eval(),
+              rtol=1e-4,
+              atol=0.)
         self.assertAllClose(
             -bijector.inverse_log_det_jacobian(y, event_ndims=0).eval(),
             bijector.forward_log_det_jacobian(x, event_ndims=0).eval(),
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/transform_diagonal_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/transform_diagonal_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..6428a68702274fae384ae3de6d03f7ca126e2346
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/transform_diagonal_test.py
@@ -0,0 +1,66 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for TransformDiagonal bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.distributions.python.ops import bijectors
+from tensorflow.python.framework import test_util
+from tensorflow.python.platform import test
+
+
+class TransformDiagonalBijectorTest(test.TestCase):
+  """Tests correctness of the TransformDiagonal bijector."""
+
+  def setUp(self):
+    self._rng = np.random.RandomState(42)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testBijector(self):
+    x = np.float32(np.random.randn(3, 4, 4))
+
+    y = x.copy()
+    for i in range(x.shape[0]):
+      np.fill_diagonal(y[i, :, :], np.exp(np.diag(x[i, :, :])))
+
+    exp = bijectors.Exp()
+    b = bijectors.TransformDiagonal(diag_bijector=exp)
+
+    y_ = self.evaluate(b.forward(x))
+    self.assertAllClose(y, y_)
+
+    x_ = self.evaluate(b.inverse(y))
+    self.assertAllClose(x, x_)
+
+    fldj = self.evaluate(b.forward_log_det_jacobian(x, event_ndims=2))
+    ildj = self.evaluate(b.inverse_log_det_jacobian(y, event_ndims=2))
+    self.assertAllEqual(
+        fldj,
+        self.evaluate(exp.forward_log_det_jacobian(
+            np.array([np.diag(x_mat) for x_mat in x]),
+            event_ndims=1)))
+    self.assertAllEqual(
+        ildj,
+        self.evaluate(exp.inverse_log_det_jacobian(
+            np.array([np.diag(y_mat) for y_mat in y]),
+            event_ndims=1)))
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/distribution_util_test.py b/tensorflow/contrib/distributions/python/kernel_tests/distribution_util_test.py
index 31d24aa9ea09007b8db40e4869371b1f62639ac7..bbbec2103aefd3f38a9b734bcd3f2e15fc8bb683 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/distribution_util_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/distribution_util_test.py
@@ -29,7 +29,9 @@ from tensorflow.contrib.distributions.python.ops import mvn_diag
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import tensor_shape
+from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import categorical
 from tensorflow.python.ops.distributions import normal
 from tensorflow.python.ops.linalg import linear_operator_diag
@@ -540,5 +542,51 @@ class PadDynamicTest(_PadTest, test.TestCase):
     return False
 
 
+class TestMoveDimension(test.TestCase):
+
+  @test_util.run_in_graph_and_eager_modes()
+  def test_move_dimension_static_shape(self):
+
+    x = random_ops.random_normal(shape=[200, 30, 4, 1, 6])
+
+    x_perm = distribution_util.move_dimension(x, 1, 1)
+    self.assertAllEqual(x_perm.shape.as_list(), [200, 30, 4, 1, 6])
+
+    x_perm = distribution_util.move_dimension(x, 0, 3)
+    self.assertAllEqual(x_perm.shape.as_list(), [30, 4, 1, 200, 6])
+
+    x_perm = distribution_util.move_dimension(x, 0, -2)
+    self.assertAllEqual(x_perm.shape.as_list(), [30, 4, 1, 200, 6])
+
+    x_perm = distribution_util.move_dimension(x, 4, 2)
+    self.assertAllEqual(x_perm.shape.as_list(), [200, 30, 6, 4, 1])
+
+  @test_util.run_in_graph_and_eager_modes()
+  def test_move_dimension_dynamic_shape(self):
+
+    x_ = random_ops.random_normal(shape=[200, 30, 4, 1, 6])
+    x = array_ops.placeholder_with_default(input=x_, shape=None)
+
+    x_perm = distribution_util.move_dimension(x, 1, 1)
+    self.assertAllEqual(self.evaluate(array_ops.shape(x_perm)),
+                        [200, 30, 4, 1, 6])
+
+    x_perm = distribution_util.move_dimension(x, 0, 3)
+    self.assertAllEqual(self.evaluate(array_ops.shape(x_perm)),
+                        [30, 4, 1, 200, 6])
+
+    x_perm = distribution_util.move_dimension(x, 0, -2)
+    self.assertAllEqual(self.evaluate(array_ops.shape(x_perm)),
+                        [30, 4, 1, 200, 6])
+
+    x_perm = distribution_util.move_dimension(x, 4, 2)
+    self.assertAllEqual(self.evaluate(array_ops.shape(x_perm)),
+                        [200, 30, 6, 4, 1])
+
+    x_perm = distribution_util.move_dimension(x, -1, 2)
+    self.assertAllEqual(self.evaluate(array_ops.shape(x_perm)),
+                        [200, 30, 6, 4, 1])
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/statistical_testing_test.py b/tensorflow/contrib/distributions/python/kernel_tests/statistical_testing_test.py
index ce6cf702d522792f1ad26066a3d9be42003a0e3c..9c4dfed83631e9f0815fb674d650cac2e570b923 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/statistical_testing_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/statistical_testing_test.py
@@ -98,23 +98,21 @@ class StatisticalTestingTest(test.TestCase):
     num_samples = 5000
     # 5000 samples is chosen to be enough to find discrepancies of
     # size 0.1 or more with assurance 1e-6, as confirmed here:
-    with self.test_session() as sess:
-      d = st.min_discrepancy_of_true_means_detectable_by_dkwm(
-          num_samples, 0., 1., false_fail_rate=1e-6, false_pass_rate=1e-6)
-      d = sess.run(d)
-      self.assertLess(d, 0.1)
+    d = st.min_discrepancy_of_true_means_detectable_by_dkwm(
+        num_samples, 0., 1., false_fail_rate=1e-6, false_pass_rate=1e-6)
+    d = self.evaluate(d)
+    self.assertLess(d, 0.1)
 
     # Test that the confidence interval computed for the mean includes
     # 0.5 and excludes 0.4 and 0.6.
-    with self.test_session() as sess:
-      samples = rng.uniform(size=num_samples).astype(np.float32)
-      (low, high) = st.true_mean_confidence_interval_by_dkwm(
-          samples, 0., 1., error_rate=1e-6)
-      low, high = sess.run([low, high])
-      self.assertGreater(low, 0.4)
-      self.assertLess(low, 0.5)
-      self.assertGreater(high, 0.5)
-      self.assertLess(high, 0.6)
+    samples = rng.uniform(size=num_samples).astype(np.float32)
+    (low, high) = st.true_mean_confidence_interval_by_dkwm(
+        samples, 0., 1., error_rate=1e-6)
+    low, high = self.evaluate([low, high])
+    self.assertGreater(low, 0.4)
+    self.assertLess(low, 0.5)
+    self.assertGreater(high, 0.5)
+    self.assertLess(high, 0.6)
 
   def test_dkwm_mean_one_sample_assertion(self):
     rng = np.random.RandomState(seed=0)
@@ -123,21 +121,45 @@ class StatisticalTestingTest(test.TestCase):
     # Test that the test assertion agrees that the mean of the standard
     # uniform distribution is 0.5.
     samples = rng.uniform(size=num_samples).astype(np.float32)
-    with self.test_session() as sess:
-      sess.run(st.assert_true_mean_equal_by_dkwm(
-          samples, 0., 1., 0.5, false_fail_rate=1e-6))
-
-      # Test that the test assertion confirms that the mean of the
-      # standard uniform distribution is not 0.4.
-      with self.assertRaisesOpError("Mean confidence interval too high"):
-        sess.run(st.assert_true_mean_equal_by_dkwm(
-            samples, 0., 1., 0.4, false_fail_rate=1e-6))
-
-      # Test that the test assertion confirms that the mean of the
-      # standard uniform distribution is not 0.6.
-      with self.assertRaisesOpError("Mean confidence interval too low"):
-        sess.run(st.assert_true_mean_equal_by_dkwm(
-            samples, 0., 1., 0.6, false_fail_rate=1e-6))
+    self.evaluate(st.assert_true_mean_equal_by_dkwm(
+        samples, 0., 1., 0.5, false_fail_rate=1e-6))
+
+    # Test that the test assertion confirms that the mean of the
+    # standard uniform distribution is not 0.4.
+    with self.assertRaisesOpError("true mean greater than expected"):
+      self.evaluate(st.assert_true_mean_equal_by_dkwm(
+          samples, 0., 1., 0.4, false_fail_rate=1e-6))
+
+    # Test that the test assertion confirms that the mean of the
+    # standard uniform distribution is not 0.6.
+    with self.assertRaisesOpError("true mean smaller than expected"):
+      self.evaluate(st.assert_true_mean_equal_by_dkwm(
+          samples, 0., 1., 0.6, false_fail_rate=1e-6))
+
+  def test_dkwm_mean_in_interval_one_sample_assertion(self):
+    rng = np.random.RandomState(seed=0)
+    num_samples = 5000
+
+    # Test that the test assertion agrees that the mean of the standard
+    # uniform distribution is between 0.4 and 0.6.
+    samples = rng.uniform(size=num_samples).astype(np.float32)
+    self.evaluate(st.assert_true_mean_in_interval_by_dkwm(
+        samples, 0., 1.,
+        expected_low=0.4, expected_high=0.6, false_fail_rate=1e-6))
+
+    # Test that the test assertion confirms that the mean of the
+    # standard uniform distribution is not between 0.2 and 0.4.
+    with self.assertRaisesOpError("true mean greater than expected"):
+      self.evaluate(st.assert_true_mean_in_interval_by_dkwm(
+          samples, 0., 1.,
+          expected_low=0.2, expected_high=0.4, false_fail_rate=1e-6))
+
+    # Test that the test assertion confirms that the mean of the
+    # standard uniform distribution is not between 0.6 and 0.8.
+    with self.assertRaisesOpError("true mean smaller than expected"):
+      self.evaluate(st.assert_true_mean_in_interval_by_dkwm(
+          samples, 0., 1.,
+          expected_low=0.6, expected_high=0.8, false_fail_rate=1e-6))
 
   def test_dkwm_mean_two_sample_assertion(self):
     rng = np.random.RandomState(seed=0)
@@ -145,20 +167,18 @@ class StatisticalTestingTest(test.TestCase):
 
     # 4000 samples is chosen to be enough to find discrepancies of
     # size 0.2 or more with assurance 1e-6, as confirmed here:
-    with self.test_session() as sess:
-      d = st.min_discrepancy_of_true_means_detectable_by_dkwm_two_sample(
-          num_samples, 0., 1., num_samples, 0., 1.,
-          false_fail_rate=1e-6, false_pass_rate=1e-6)
-      d = sess.run(d)
-      self.assertLess(d, 0.2)
+    d = st.min_discrepancy_of_true_means_detectable_by_dkwm_two_sample(
+        num_samples, 0., 1., num_samples, 0., 1.,
+        false_fail_rate=1e-6, false_pass_rate=1e-6)
+    d = self.evaluate(d)
+    self.assertLess(d, 0.2)
 
     # Test that the test assertion agrees that the standard
     # uniform distribution has the same mean as itself.
     samples1 = rng.uniform(size=num_samples).astype(np.float32)
     samples2 = rng.uniform(size=num_samples).astype(np.float32)
-    with self.test_session() as sess:
-      sess.run(st.assert_true_mean_equal_by_dkwm_two_sample(
-          samples1, 0., 1., samples2, 0., 1., false_fail_rate=1e-6))
+    self.evaluate(st.assert_true_mean_equal_by_dkwm_two_sample(
+        samples1, 0., 1., samples2, 0., 1., false_fail_rate=1e-6))
 
   def test_dkwm_mean_two_sample_assertion_beta_2_1_false(self):
     rng = np.random.RandomState(seed=0)
@@ -168,15 +188,14 @@ class StatisticalTestingTest(test.TestCase):
     # As established above, 4000 samples is enough to find discrepancies
     # of size 0.2 or more with assurance 1e-6.
 
-    with self.test_session() as sess:
-      # Test that the test assertion confirms that the mean of the
-      # standard uniform distribution is different from the mean of beta(2, 1).
-      beta_high_samples = rng.beta(2, 1, size=num_samples).astype(np.float32)
-      with self.assertRaisesOpError("samples1 has a smaller mean"):
-        sess.run(st.assert_true_mean_equal_by_dkwm_two_sample(
-            samples1, 0., 1.,
-            beta_high_samples, 0., 1.,
-            false_fail_rate=1e-6))
+    # Test that the test assertion confirms that the mean of the
+    # standard uniform distribution is different from the mean of beta(2, 1).
+    beta_high_samples = rng.beta(2, 1, size=num_samples).astype(np.float32)
+    with self.assertRaisesOpError("true mean smaller than expected"):
+      self.evaluate(st.assert_true_mean_equal_by_dkwm_two_sample(
+          samples1, 0., 1.,
+          beta_high_samples, 0., 1.,
+          false_fail_rate=1e-6))
 
   def test_dkwm_mean_two_sample_assertion_beta_1_2_false(self):
     rng = np.random.RandomState(seed=0)
@@ -186,15 +205,14 @@ class StatisticalTestingTest(test.TestCase):
     # As established above, 4000 samples is enough to find discrepancies
     # of size 0.2 or more with assurance 1e-6.
 
-    with self.test_session() as sess:
-      # Test that the test assertion confirms that the mean of the
-      # standard uniform distribution is different from the mean of beta(1, 2).
-      beta_low_samples = rng.beta(1, 2, size=num_samples).astype(np.float32)
-      with self.assertRaisesOpError("samples2 has a smaller mean"):
-        sess.run(st.assert_true_mean_equal_by_dkwm_two_sample(
-            samples1, 0., 1.,
-            beta_low_samples, 0., 1.,
-            false_fail_rate=1e-6))
+    # Test that the test assertion confirms that the mean of the
+    # standard uniform distribution is different from the mean of beta(1, 2).
+    beta_low_samples = rng.beta(1, 2, size=num_samples).astype(np.float32)
+    with self.assertRaisesOpError("true mean greater than expected"):
+      self.evaluate(st.assert_true_mean_equal_by_dkwm_two_sample(
+          samples1, 0., 1.,
+          beta_low_samples, 0., 1.,
+          false_fail_rate=1e-6))
 
   def test_dkwm_argument_validity_checking(self):
     rng = np.random.RandomState(seed=0)
@@ -203,18 +221,17 @@ class StatisticalTestingTest(test.TestCase):
 
     # Test that the test library complains if the given samples fall
     # outside the purported bounds.
-    with self.test_session() as sess:
-      with self.assertRaisesOpError("maximum value exceeds expectations"):
-        sess.run(st.true_mean_confidence_interval_by_dkwm(
-            samples, [[0., 1.]], [[0.5, 1.5]], error_rate=0.5))
-      with self.assertRaisesOpError("minimum value falls below expectations"):
-        sess.run(st.true_mean_confidence_interval_by_dkwm(
-            samples, [[0.5, 1.5]], [[1., 2.]], error_rate=0.5))
-
-      # But doesn't complain if they don't.
-      op = st.true_mean_confidence_interval_by_dkwm(
-          samples, [[0., 1.]], [[1., 2.]], error_rate=0.5)
-      _ = sess.run(op)
+    with self.assertRaisesOpError("maximum value exceeds expectations"):
+      self.evaluate(st.true_mean_confidence_interval_by_dkwm(
+          samples, [[0., 1.]], [[0.5, 1.5]], error_rate=0.5))
+    with self.assertRaisesOpError("minimum value falls below expectations"):
+      self.evaluate(st.true_mean_confidence_interval_by_dkwm(
+          samples, [[0.5, 1.5]], [[1., 2.]], error_rate=0.5))
+
+    # But doesn't complain if they don't.
+    op = st.true_mean_confidence_interval_by_dkwm(
+        samples, [[0., 1.]], [[1., 2.]], error_rate=0.5)
+    _ = self.evaluate(op)
 
   def test_do_maximum_mean(self):
     n = 117
@@ -223,10 +240,9 @@ class StatisticalTestingTest(test.TestCase):
     samples = rng.uniform(size=n).astype(np.float32)
 
     # Compute the answer in TF using the code under test
-    with self.test_session() as sess:
-      envelope_t = ops.convert_to_tensor(envelope)
-      max_mean = st._do_maximum_mean(samples, envelope_t, 1)
-      max_mean = sess.run(max_mean)
+    envelope_t = ops.convert_to_tensor(envelope)
+    max_mean = st._do_maximum_mean(samples, envelope_t, 1)
+    max_mean = self.evaluate(max_mean)
 
     # Compute the correct answer for this case in numpy.  In this
     # example, `n` and `envelope` are such that `samples[2]` is the
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/util/BUILD b/tensorflow/contrib/distributions/python/kernel_tests/util/BUILD
new file mode 100644
index 0000000000000000000000000000000000000000..03e26b198ea02ad1bef8bcd2f6076078ecd7df0b
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/kernel_tests/util/BUILD
@@ -0,0 +1,48 @@
+# Description:
+#   Internal testing utilities, e.g., computing the correct answer to
+#   put in a unit test.
+
+licenses(["notice"])  # Apache 2.0
+
+py_library(
+    name = "correlation_matrix_volumes_py",
+    srcs = [
+        "correlation_matrix_volumes_lib.py",
+    ],
+    deps = [
+        "//tensorflow/contrib/distributions:distributions_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:errors",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:math_ops",
+        "//third_party/py/numpy",
+    ],
+)
+
+py_binary(
+    name = "correlation_matrix_volumes",
+    srcs = [
+        "correlation_matrix_volumes.py",
+    ],
+    deps = [
+        ":correlation_matrix_volumes_py",
+    ],
+)
+
+py_test(
+    name = "correlation_matrix_volumes_test",
+    size = "medium",
+    srcs = ["correlation_matrix_volumes_test.py"],
+    tags = ["no_pip"],
+    deps = [
+        ":correlation_matrix_volumes_py",
+        # For statistical testing
+        "//tensorflow/contrib/distributions:distributions_py",
+        "//third_party/py/numpy",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:check_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework",
+    ],
+)
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/util/correlation_matrix_volumes.py b/tensorflow/contrib/distributions/python/kernel_tests/util/correlation_matrix_volumes.py
new file mode 100644
index 0000000000000000000000000000000000000000..2eab51cd3053ea55f2e03619fd002fbf48251fb1
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/kernel_tests/util/correlation_matrix_volumes.py
@@ -0,0 +1,98 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Executable to estimate the volume of various sets of correlation matrices.
+
+See correlation_matrix_volumes_lib.py for purpose and methodology.
+
+Invocation example:
+```
+python correlation_matrix_volumes.py --num_samples 1e7
+```
+
+This will compute 10,000,000-sample confidence intervals for the
+volumes of several sets of correlation matrices.  Which sets, and the
+desired statistical significance, are hard-coded in this source file.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import pprint
+
+from absl import app
+from absl import flags
+
+from tensorflow.contrib.distributions.python.kernel_tests.util import correlation_matrix_volumes_lib as corr
+
+FLAGS = flags.FLAGS
+
+# Float to support giving the number of samples in scientific notation.
+# The production run used for the LKJ test used 1e7 samples.
+flags.DEFINE_float('num_samples', 1e4, 'Number of samples to use.')
+
+
+def ctv_debatched(det_bounds, dim, num_samples, error_rate=1e-6, seed=42):
+  # This wrapper undoes the batching in compute_true_volumes, because
+  # apparently several 5x5x9x1e7 Tensors of float32 can strain RAM.
+  bounds = {}
+  for db in det_bounds:
+    bounds[db] = corr.compute_true_volumes(
+        [db], dim, num_samples, error_rate=error_rate, seed=seed)[db]
+  return bounds
+
+
+# The particular bounds in all three of these functions were chosen by
+# a somewhat arbitrary walk through an empirical tradeoff, for the
+# purpose of testing the LKJ distribution.  Setting the determinant
+# bound lower
+# - Covers more of the testee's sample space, and
+# - Increases the probability that the rejection sampler will hit, thus
+# - Decreases the relative error (at a fixed sample count) in the
+#   rejection-based volume estimate;
+# but also
+# - Increases the variance of the estimator used in the LKJ test.
+# This latter variance is also affected by the dimension and the
+# tested concentration parameter, and can be compensated for with more
+# compute (expensive) or a looser discrepancy limit (unsatisfying).
+# The values here are the projection of the points in that test design
+# space that ended up getting chosen.
+def compute_3x3_volumes(num_samples):
+  det_bounds = [0.01, 0.25, 0.3, 0.35, 0.4, 0.45]
+  return ctv_debatched(
+      det_bounds, 3, num_samples, error_rate=5e-7, seed=46)
+
+
+def compute_4x4_volumes(num_samples):
+  det_bounds = [0.01, 0.25, 0.3, 0.35, 0.4, 0.45]
+  return ctv_debatched(
+      det_bounds, 4, num_samples, error_rate=5e-7, seed=47)
+
+
+def compute_5x5_volumes(num_samples):
+  det_bounds = [0.01, 0.2, 0.25, 0.3, 0.35, 0.4]
+  return ctv_debatched(
+      det_bounds, 5, num_samples, error_rate=5e-7, seed=48)
+
+
+def main(_):
+  full_bounds = {}
+  full_bounds[3] = compute_3x3_volumes(int(FLAGS.num_samples))
+  full_bounds[4] = compute_4x4_volumes(int(FLAGS.num_samples))
+  full_bounds[5] = compute_5x5_volumes(int(FLAGS.num_samples))
+  pprint.pprint(full_bounds)
+
+if __name__ == '__main__':
+  app.run(main)
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/util/correlation_matrix_volumes_lib.py b/tensorflow/contrib/distributions/python/kernel_tests/util/correlation_matrix_volumes_lib.py
new file mode 100644
index 0000000000000000000000000000000000000000..455e71f00c96e799c4aaae25050c77a9ae36df06
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/kernel_tests/util/correlation_matrix_volumes_lib.py
@@ -0,0 +1,323 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Estimating the volume of the correlation matrices with bounded determinant.
+
+Why?  Because lkj_test.py tests the sampler for the LKJ distribution
+by estimating the same volume another way.
+
+How?  Rejection sampling.  Or, more precisely, importance sampling,
+proposing from the uniform distribution on symmetric matrices with
+diagonal 1s and entries in [-1, 1].  Such a matrix is a correlation
+matrix if and only if it is also positive semi-definite.
+
+The samples can then be converted into a confidence interval on the
+volume in question by the [Clopper-Pearson
+method](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval),
+also implemented here.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import importlib
+import sys
+
+import numpy as np
+
+from tensorflow.python.client import session
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import linalg_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops.distributions import uniform
+from tensorflow.python.ops.distributions import util
+from tensorflow.python.platform import tf_logging
+
+__all__ = [
+    "correlation_matrix_volume_rejection_samples",
+    "compute_true_volumes",
+]
+
+
+def try_import(name):  # pylint: disable=invalid-name
+  module = None
+  try:
+    module = importlib.import_module(name)
+  except ImportError as e:
+    tf_logging.warning("Could not import %s: %s" % (name, str(e)))
+  return module
+
+optimize = try_import("scipy.optimize")
+stats = try_import("scipy.stats")
+
+
+def _psd_mask(x):
+  """Computes whether each square matrix in the input is positive semi-definite.
+
+  Args:
+    x: A floating-point `Tensor` of shape `[B1, ..., Bn, M, M]`.
+
+  Returns:
+    mask: A floating-point `Tensor` of shape `[B1, ... Bn]`.  Each
+      scalar is 1 if the corresponding matrix was PSD, otherwise 0.
+  """
+  # Allegedly
+  # https://scicomp.stackexchange.com/questions/12979/testing-if-a-matrix-is-positive-semi-definite
+  # it is more efficient to test for positive semi-definiteness by
+  # trying to compute the Cholesky decomposition -- the matrix is PSD
+  # if you succeed and not PSD if you fail.  However, TensorFlow's
+  # Cholesky raises an exception if _any_ of the input matrices are
+  # not PSD, from which I don't know how to extract _which ones_, so I
+  # proceed by explicitly computing all the eigenvalues and checking
+  # whether they are all positive or not.
+  #
+  # Also, as was discussed in the answer, it is somewhat dangerous to
+  # treat SPD-ness as binary in floating-point arithmetic. Cholesky
+  # factorization can complete and 'look' like everything is fine
+  # (e.g., O(1) entries and a diagonal of all ones) but the matrix can
+  # have an exponential condition number.
+  eigenvalues, _ = linalg_ops.self_adjoint_eig(x)
+  return math_ops.cast(
+      math_ops.reduce_min(eigenvalues, axis=-1) >= 0, dtype=x.dtype)
+
+
+def _det_large_enough_mask(x, det_bounds):
+  """Returns whether the input matches the given determinant limit.
+
+  Args:
+    x: A floating-point `Tensor` of shape `[B1, ..., Bn, M, M]`.
+    det_bounds: A floating-point `Tensor` that must broadcast to shape
+      `[B1, ..., Bn]`, giving the desired lower bound on the
+      determinants in `x`.
+
+  Returns:
+    mask: A floating-point `Tensor` of shape [B1, ..., Bn].  Each
+      scalar is 1 if the corresponding matrix had determinant above
+      the corresponding bound, otherwise 0.
+  """
+  # For the curious: I wonder whether it is possible and desirable to
+  # use a Cholesky decomposition-based algorithm for this, since the
+  # only matrices whose determinant this code cares about will be PSD.
+  # Didn't figure out how to code that in TensorFlow.
+  #
+  # Expert opinion is that it would be about twice as fast since
+  # Cholesky is roughly half the cost of Gaussian Elimination with
+  # Partial Pivoting. But this is less of an impact than the switch in
+  # _psd_mask.
+  return math_ops.cast(
+      linalg_ops.matrix_determinant(x) > det_bounds, dtype=x.dtype)
+
+
+def _uniform_correlation_like_matrix(num_rows, batch_shape, dtype, seed):
+  """Returns a uniformly random `Tensor` of "correlation-like" matrices.
+
+  A "correlation-like" matrix is a symmetric square matrix with all entries
+  between -1 and 1 (inclusive) and 1s on the main diagonal.  Of these,
+  the ones that are positive semi-definite are exactly the correlation
+  matrices.
+
+  Args:
+    num_rows: Python `int` dimension of the correlation-like matrices.
+    batch_shape: `Tensor` or Python `tuple` of `int` shape of the
+      batch to return.
+    dtype: `dtype` of the `Tensor` to return.
+    seed: Random seed.
+
+  Returns:
+    matrices: A `Tensor` of shape `batch_shape + [num_rows, num_rows]`
+      and dtype `dtype`.  Each entry is in [-1, 1], and each matrix
+      along the bottom two dimensions is symmetric and has 1s on the
+      main diagonal.
+  """
+  num_entries = num_rows * (num_rows + 1) / 2
+  ones = array_ops.ones(shape=[num_entries], dtype=dtype)
+  # It seems wasteful to generate random values for the diagonal since
+  # I am going to throw them away, but `fill_triangular` fills the
+  # diagonal, so I probably need them.
+  # It's not impossible that it would be more efficient to just fill
+  # the whole matrix with random values instead of messing with
+  # `fill_triangular`.  Then would need to filter almost half out with
+  # `matrix_band_part`.
+  unifs = uniform.Uniform(-ones, ones).sample(batch_shape, seed=seed)
+  tril = util.fill_triangular(unifs)
+  symmetric = tril + array_ops.matrix_transpose(tril)
+  diagonal_ones = array_ops.ones(
+      shape=util.pad(batch_shape, axis=0, back=True, value=num_rows),
+      dtype=dtype)
+  return array_ops.matrix_set_diag(symmetric, diagonal_ones)
+
+
+def correlation_matrix_volume_rejection_samples(
+    det_bounds, dim, sample_shape, dtype, seed):
+  """Returns rejection samples from trying to get good correlation matrices.
+
+  The proposal being rejected from is the uniform distribution on
+  "correlation-like" matrices.  We say a matrix is "correlation-like"
+  if it is a symmetric square matrix with all entries between -1 and 1
+  (inclusive) and 1s on the main diagonal.  Of these, the ones that
+  are positive semi-definite are exactly the correlation matrices.
+
+  The rejection algorithm, then, is to sample a `Tensor` of
+  `sample_shape` correlation-like matrices of dimensions `dim` by
+  `dim`, and check each one for (i) being a correlation matrix (i.e.,
+  PSD), and (ii) having determinant at least the corresponding entry
+  of `det_bounds`.
+
+  Args:
+    det_bounds: A `Tensor` of lower bounds on the determinants of
+      acceptable matrices.  The shape must broadcast with `sample_shape`.
+    dim: A Python `int` dimension of correlation matrices to sample.
+    sample_shape: Python `tuple` of `int` shape of the samples to
+      compute, excluding the two matrix dimensions.
+    dtype: The `dtype` in which to do the computation.
+    seed: Random seed.
+
+  Returns:
+    weights: A `Tensor` of shape `sample_shape`.  Each entry is 0 if the
+      corresponding matrix was not a correlation matrix, or had too
+      small of a determinant.  Otherwise, the entry is the
+      multiplicative inverse of the density of proposing that matrix
+      uniformly, i.e., the volume of the set of `dim` by `dim`
+      correlation-like matrices.
+    volume: The volume of the set of `dim` by `dim` correlation-like
+      matrices.
+  """
+  with ops.name_scope("rejection_sampler"):
+    rej_proposals = _uniform_correlation_like_matrix(
+        dim, sample_shape, dtype, seed=seed)
+    rej_proposal_volume = 2. ** (dim * (dim - 1) / 2.)
+    # The density of proposing any given point is 1 / rej_proposal_volume;
+    # The weight of that point should be scaled by
+    # 1 / density = rej_proposal_volume.
+    rej_weights = rej_proposal_volume * _psd_mask(
+        rej_proposals) * _det_large_enough_mask(rej_proposals, det_bounds)
+    return rej_weights, rej_proposal_volume
+
+
+def _clopper_pearson_confidence_interval(samples, error_rate):
+  """Computes a confidence interval for the mean of the given 1-D distribution.
+
+  Assumes (and checks) that the given distribution is Bernoulli, i.e.,
+  takes only two values.  This licenses using the CDF of the binomial
+  distribution for the confidence, which is tighter (for extreme
+  probabilities) than the DKWM inequality.  The method is known as the
+  [Clopper-Pearson method]
+  (https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval).
+
+  Assumes:
+
+  - The given samples were drawn iid from the distribution of interest.
+
+  - The given distribution is a Bernoulli, i.e., supported only on
+    low and high.
+
+  Guarantees:
+
+  - The probability (over the randomness of drawing the given sample)
+    that the true mean is outside the returned interval is no more
+    than the given error_rate.
+
+  Args:
+    samples: `np.ndarray` of samples drawn iid from the distribution
+      of interest.
+    error_rate: Python `float` admissible rate of mistakes.
+
+  Returns:
+    low: Lower bound of confidence interval.
+    high: Upper bound of confidence interval.
+
+  Raises:
+    ValueError: If `samples` has rank other than 1 (batch semantics
+      are not implemented), or if `samples` contains values other than
+      `low` or `high` (as that makes the distribution not Bernoulli).
+  """
+  # TODO(b/78025336) Migrate this confidence interval function
+  # to statistical_testing.py.  In order to do that
+  # - Get the binomial CDF from the Binomial distribution
+  # - Implement scalar root finding in TF.  Batch bisection search
+  #   shouldn't be too hard, and is definitely good enough for this
+  #   problem.  Batching the Brent algorithm (from scipy) that is used
+  #   here may be more involved, but may also not be necessary---it's
+  #   only used here because scipy made it convenient.  In particular,
+  #   robustness is more important than speed here, which may make
+  #   bisection search actively better.
+  # - The rest is just a matter of rewriting in the appropriate style.
+  if optimize is None or stats is None:
+    raise ValueError(
+        "Scipy is required for computing Clopper-Pearson confidence intervals")
+  if len(samples.shape) != 1:
+    raise ValueError("Batch semantics not implemented")
+  n = len(samples)
+  low = np.amin(samples)
+  high = np.amax(samples)
+  successes = np.count_nonzero(samples - low)
+  failures = np.count_nonzero(samples - high)
+  if successes + failures != n:
+    uniques = np.unique(samples)
+    msg = ("Purportedly Bernoulli distribution had distinct samples"
+           " {}, {}, and {}".format(uniques[0], uniques[1], uniques[2]))
+    raise ValueError(msg)
+  def p_small_enough(p):
+    prob = stats.binom.logcdf(successes, n, p)
+    return prob - np.log(error_rate / 2.)
+  def p_big_enough(p):
+    prob = stats.binom.logsf(successes, n, p)
+    return prob - np.log(error_rate / 2.)
+  high_p = optimize.brentq(
+      p_small_enough, float(successes) / n, 1., rtol=1e-9)
+  low_p = optimize.brentq(
+      p_big_enough, 0., float(successes) / n, rtol=1e-9)
+  low_interval = low + (high - low) * low_p
+  high_interval = low + (high - low) * high_p
+  return (low_interval, high_interval)
+
+
+def compute_true_volumes(
+    det_bounds, dim, num_samples, error_rate=1e-6, seed=42):
+  """Returns confidence intervals for the desired correlation matrix volumes.
+
+  The confidence intervals are computed by the [Clopper-Pearson method]
+  (https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval).
+
+  Args:
+    det_bounds: A rank-1 numpy array of lower bounds on the
+      determinants of acceptable matrices.  Entries must be unique.
+    dim: A Python `int` dimension of correlation matrices to sample.
+    num_samples: The number of samples to draw.
+    error_rate: The statistical significance of the returned
+      confidence intervals.  The significance is broadcast: Each
+      returned interval separately may be incorrect with probability
+      (under the sample of correlation-like matrices drawn internally)
+      at most `error_rate`.
+    seed: Random seed.
+
+  Returns:
+    bounds: A Python `dict` mapping each determinant bound to the low, high
+      tuple giving the confidence interval.
+  """
+  bounds = {}
+  with session.Session() as sess:
+    rej_weights, _ = correlation_matrix_volume_rejection_samples(
+        det_bounds, dim, [num_samples, len(det_bounds)], np.float32, seed=seed)
+    rej_weights = sess.run(rej_weights)
+    for rw, det in zip(np.rollaxis(rej_weights, 1), det_bounds):
+      template = ("Estimating volume of {}x{} correlation "
+                  "matrices with determinant >= {}.")
+      print(template.format(dim, dim, det))
+      sys.stdout.flush()
+      bounds[det] = _clopper_pearson_confidence_interval(
+          rw, error_rate=error_rate)
+    return bounds
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/util/correlation_matrix_volumes_test.py b/tensorflow/contrib/distributions/python/kernel_tests/util/correlation_matrix_volumes_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..8f99300e63871119800a42f122c8321e5986541a
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/kernel_tests/util/correlation_matrix_volumes_test.py
@@ -0,0 +1,150 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for correlation_matrix_volumes_lib.py."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.distributions.python.kernel_tests.util import correlation_matrix_volumes_lib as corr
+from tensorflow.contrib.distributions.python.ops import statistical_testing as st
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
+from tensorflow.python.platform import test
+
+
+# NxN correlation matrices are determined by the N*(N-1)/2
+# lower-triangular entries.  In addition to being between -1 and 1,
+# they must also obey the constraint that the determinant of the
+# resulting symmetric matrix is non-negative.  In 2x2, we can even
+# analytically compute the volume when the determinant is bounded to >
+# epsilon, as that boils down to the one lower-triangular entry being
+# less than 1 - epsilon in absolute value.
+def two_by_two_volume(det_bound):
+  return 2 * np.sqrt(1.0 - det_bound)
+
+
+# The post
+# https://psychometroscar.com/the-volume-of-a-3-x-3-correlation-matrix/
+# derives (with elementary calculus) that the volume (with respect to
+# Lebesgue^3 measure) of the set of 3x3 correlation matrices is
+# pi^2/2.  The same result is also obtained by [1].
+def three_by_three_volume():
+  return np.pi**2 / 2.
+
+
+# The volume of the unconstrained set of correlation matrices is also
+# the normalization constant of the LKJ distribution from [2].  As
+# part of defining the distribution, that reference a derives general
+# formula for this volume for all dimensions.  A TensorFlow
+# computation thereof gave the below result for 4x4:
+def four_by_four_volume():
+  # This constant computed as math_ops.exp(lkj.log_norm_const(4, [1.0]))
+  return 11.6973076
+
+# [1] Rousseeuw, P. J., & Molenberghs, G. (1994). "The shape of
+# correlation matrices." The American Statistician, 48(4), 276-279.
+
+# [2] Daniel Lewandowski, Dorota Kurowicka, and Harry Joe, "Generating
+# random correlation matrices based on vines and extended onion
+# method," Journal of Multivariate Analysis 100 (2009), pp 1989-2001.
+
+
+class CorrelationMatrixVolumesTest(test.TestCase):
+
+  def testRejection2D(self):
+    num_samples = int(1e5)  # Chosen for a small min detectable discrepancy
+    det_bounds = np.array(
+        [0.01, 0.02, 0.03, 0.04, 0.05, 0.3, 0.35, 0.4, 0.5], dtype=np.float32)
+    exact_volumes = two_by_two_volume(det_bounds)
+    (rej_weights,
+     rej_proposal_volume) = corr.correlation_matrix_volume_rejection_samples(
+         det_bounds, 2, [num_samples, 9], dtype=np.float32, seed=43)
+    # shape of rej_weights: [num_samples, 9, 2, 2]
+    chk1 = st.assert_true_mean_equal_by_dkwm(
+        rej_weights, low=0., high=rej_proposal_volume, expected=exact_volumes,
+        false_fail_rate=1e-6)
+    chk2 = check_ops.assert_less(
+        st.min_discrepancy_of_true_means_detectable_by_dkwm(
+            num_samples, low=0., high=rej_proposal_volume,
+            # Correct the false fail rate due to different broadcasting
+            false_fail_rate=1.1e-7, false_pass_rate=1e-6),
+        0.036)
+    with ops.control_dependencies([chk1, chk2]):
+      rej_weights = array_ops.identity(rej_weights)
+    self.evaluate(rej_weights)
+
+  def testRejection3D(self):
+    num_samples = int(1e5)  # Chosen for a small min detectable discrepancy
+    det_bounds = np.array([0.0], dtype=np.float32)
+    exact_volumes = np.array([three_by_three_volume()], dtype=np.float32)
+    (rej_weights,
+     rej_proposal_volume) = corr.correlation_matrix_volume_rejection_samples(
+         det_bounds, 3, [num_samples, 1], dtype=np.float32, seed=44)
+    # shape of rej_weights: [num_samples, 1, 3, 3]
+    chk1 = st.assert_true_mean_equal_by_dkwm(
+        rej_weights, low=0., high=rej_proposal_volume, expected=exact_volumes,
+        false_fail_rate=1e-6)
+    chk2 = check_ops.assert_less(
+        st.min_discrepancy_of_true_means_detectable_by_dkwm(
+            num_samples, low=0., high=rej_proposal_volume,
+            false_fail_rate=1e-6, false_pass_rate=1e-6),
+        # Going for about a 3% relative error
+        0.15)
+    with ops.control_dependencies([chk1, chk2]):
+      rej_weights = array_ops.identity(rej_weights)
+    self.evaluate(rej_weights)
+
+  def testRejection4D(self):
+    num_samples = int(1e5)  # Chosen for a small min detectable discrepancy
+    det_bounds = np.array([0.0], dtype=np.float32)
+    exact_volumes = [four_by_four_volume()]
+    (rej_weights,
+     rej_proposal_volume) = corr.correlation_matrix_volume_rejection_samples(
+         det_bounds, 4, [num_samples, 1], dtype=np.float32, seed=45)
+    # shape of rej_weights: [num_samples, 1, 4, 4]
+    chk1 = st.assert_true_mean_equal_by_dkwm(
+        rej_weights, low=0., high=rej_proposal_volume, expected=exact_volumes,
+        false_fail_rate=1e-6)
+    chk2 = check_ops.assert_less(
+        st.min_discrepancy_of_true_means_detectable_by_dkwm(
+            num_samples, low=0., high=rej_proposal_volume,
+            false_fail_rate=1e-6, false_pass_rate=1e-6),
+        # Going for about a 10% relative error
+        1.1)
+    with ops.control_dependencies([chk1, chk2]):
+      rej_weights = array_ops.identity(rej_weights)
+    self.evaluate(rej_weights)
+
+  def testVolumeEstimation2D(self):
+    # Test that the confidence intervals produced by
+    # corr.compte_true_volumes are sound, in the sense of containing
+    # the exact volume.
+    num_samples = int(1e5)  # Chosen by symmetry with testRejection2D
+    det_bounds = np.array(
+        [0.01, 0.02, 0.03, 0.04, 0.05, 0.3, 0.35, 0.4, 0.5], dtype=np.float32)
+    volume_bounds = corr.compute_true_volumes(
+        det_bounds, 2, num_samples, error_rate=1e-6, seed=47)
+    exact_volumes = two_by_two_volume(det_bounds)
+    for det, volume in zip(det_bounds, exact_volumes):
+      computed_low, computed_high = volume_bounds[det]
+      self.assertLess(computed_low, volume)
+      self.assertGreater(computed_high, volume)
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/distributions/python/ops/autoregressive.py b/tensorflow/contrib/distributions/python/ops/autoregressive.py
index d813831bef803a22c095d9c98e7163aa4861a15d..bb9b8043b2233b2109f51b5dde188d088fdb0d39 100644
--- a/tensorflow/contrib/distributions/python/ops/autoregressive.py
+++ b/tensorflow/contrib/distributions/python/ops/autoregressive.py
@@ -23,6 +23,7 @@ import numpy as np
 from tensorflow.python.framework import ops
 from tensorflow.python.ops.distributions import distribution as distribution_lib
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 class Autoregressive(distribution_lib.Distribution):
@@ -107,6 +108,14 @@ class Autoregressive(distribution_lib.Distribution):
        https://arxiv.org/abs/1606.05328
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                distribution_fn,
                sample0=None,
@@ -144,7 +153,7 @@ class Autoregressive(distribution_lib.Distribution):
         `distribution_fn(sample0).event_shape.num_elements()` are both `None`.
       ValueError: if `num_steps < 1`.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name) as name:
       self._distribution_fn = distribution_fn
       self._sample0 = sample0
diff --git a/tensorflow/contrib/distributions/python/ops/batch_reshape.py b/tensorflow/contrib/distributions/python/ops/batch_reshape.py
index c709318f76552e1188f735f5bafff4be0537baed..519077bc9ab1063a1135486cfae34656f3f68157 100644
--- a/tensorflow/contrib/distributions/python/ops/batch_reshape.py
+++ b/tensorflow/contrib/distributions/python/ops/batch_reshape.py
@@ -28,7 +28,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import distribution as distribution_lib
-from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -72,6 +72,14 @@ class BatchReshape(distribution_lib.Distribution):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                distribution,
                batch_shape,
@@ -103,7 +111,7 @@ class BatchReshape(distribution_lib.Distribution):
       ValueError: if `batch_shape` size is not the same as a
         `distribution.batch_shape` size.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     name = name or "BatchReshape" + distribution.name
     with ops.name_scope(name, values=[batch_shape]) as name:
       # The unexpanded batch shape may contain up to one dimension of -1.
@@ -353,6 +361,14 @@ class BatchReshape(distribution_lib.Distribution):
       return runtime_assertions
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def calculate_reshape(original_shape, new_shape, validate=False, name=None):
   """Calculates the reshaped dimensions (replacing up to one -1 in reshape)."""
   batch_shape_static = tensor_util.constant_value_as_shape(new_shape)
@@ -385,6 +401,14 @@ def calculate_reshape(original_shape, new_shape, validate=False, name=None):
     return expanded_new_shape, batch_shape_static, validations
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def validate_init_args_statically(distribution, batch_shape):
   """Helper to __init__ which makes or raises assertions."""
   if batch_shape.shape.ndims is not None:
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/__init__.py b/tensorflow/contrib/distributions/python/ops/bijectors/__init__.py
index 51478dbeffaabc58ce3662f25f06bc579e8a407e..e141f8b5c6423bd6cce4d09da6f49d55b3e25a24 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/__init__.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/__init__.py
@@ -24,23 +24,27 @@
 @@CholeskyOuterProduct
 @@ConditionalBijector
 @@Exp
+@@FillTriangular
 @@Gumbel
 @@Identity
 @@Inline
 @@Invert
 @@Kumaraswamy
 @@MaskedAutoregressiveFlow
+@@MatrixInverseTriL
 @@Ordered
 @@Permute
 @@PowerTransform
 @@RealNVP
 @@Reshape
+@@ScaleTriL
 @@Sigmoid
 @@SinhArcsinh
 @@SoftmaxCentered
 @@Softplus
 @@Softsign
 @@Square
+@@TransformDiagonal
 @@Weibull
 
 @@masked_autoregressive_default_template
@@ -63,22 +67,26 @@ from tensorflow.contrib.distributions.python.ops.bijectors.chain import *
 from tensorflow.contrib.distributions.python.ops.bijectors.cholesky_outer_product import *
 from tensorflow.contrib.distributions.python.ops.bijectors.conditional_bijector import *
 from tensorflow.contrib.distributions.python.ops.bijectors.exp import *
+from tensorflow.contrib.distributions.python.ops.bijectors.fill_triangular import *
 from tensorflow.contrib.distributions.python.ops.bijectors.gumbel import *
 from tensorflow.contrib.distributions.python.ops.bijectors.inline import *
 from tensorflow.contrib.distributions.python.ops.bijectors.invert import *
 from tensorflow.contrib.distributions.python.ops.bijectors.kumaraswamy import *
 from tensorflow.contrib.distributions.python.ops.bijectors.masked_autoregressive import *
+from tensorflow.contrib.distributions.python.ops.bijectors.matrix_inverse_tril import *
 from tensorflow.contrib.distributions.python.ops.bijectors.ordered import *
 from tensorflow.contrib.distributions.python.ops.bijectors.permute import *
 from tensorflow.contrib.distributions.python.ops.bijectors.power_transform import *
 from tensorflow.contrib.distributions.python.ops.bijectors.real_nvp import *
 from tensorflow.contrib.distributions.python.ops.bijectors.reshape import *
+from tensorflow.contrib.distributions.python.ops.bijectors.scale_tril import *
 from tensorflow.contrib.distributions.python.ops.bijectors.sigmoid import *
 from tensorflow.contrib.distributions.python.ops.bijectors.sinh_arcsinh import *
 from tensorflow.contrib.distributions.python.ops.bijectors.softmax_centered import *
 from tensorflow.contrib.distributions.python.ops.bijectors.softplus import *
 from tensorflow.contrib.distributions.python.ops.bijectors.softsign import *
 from tensorflow.contrib.distributions.python.ops.bijectors.square import *
+from tensorflow.contrib.distributions.python.ops.bijectors.transform_diagonal import *
 from tensorflow.python.ops.distributions.bijector import *
 from tensorflow.python.ops.distributions.identity_bijector import Identity
 
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/absolute_value.py b/tensorflow/contrib/distributions/python/ops/bijectors/absolute_value.py
index c9e31d7712f09f6c4b4cc6ae51a34c42a19c291d..4d6a46e7358933fdf512f49eae2673f35953c90a 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/absolute_value.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/absolute_value.py
@@ -23,6 +23,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 __all__ = [
     "AbsoluteValue",
@@ -70,6 +71,14 @@ class AbsoluteValue(bijector.Bijector):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self, validate_args=False, name="absolute_value"):
     """Instantiates the `AbsoluteValue` bijector.
 
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/affine.py b/tensorflow/contrib/distributions/python/ops/bijectors/affine.py
index b4c2939eb914d50475ba6b1c1e979a804090f641..25f29452c3949600b8a4153a8585dd7269bd3b2b 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/affine.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/affine.py
@@ -29,6 +29,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -36,6 +37,14 @@ __all__ = [
 ]
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _as_tensor(x, name):
   """Convenience to convert to `Tensor` or leave as `None`."""
   return None if x is None else ops.convert_to_tensor(x, name=name)
@@ -97,6 +106,14 @@ class Affine(bijector.Bijector):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                shift=None,
                scale_identity_multiplier=None,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/affine_linear_operator.py b/tensorflow/contrib/distributions/python/ops/bijectors/affine_linear_operator.py
index 59f9742d576a7804f401d3a47ba31ae61d6c6e54..91301f15ad87e133777371b346864ecf7b964f27 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/affine_linear_operator.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/affine_linear_operator.py
@@ -24,6 +24,7 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops.distributions import bijector
 from tensorflow.python.ops.linalg import linear_operator
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -88,6 +89,14 @@ class AffineLinearOperator(bijector.Bijector):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                shift=None,
                scale=None,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/affine_scalar.py b/tensorflow/contrib/distributions/python/ops/bijectors/affine_scalar.py
index cd792e2c8cf48602daf9fb5eb56b8c34bac050c7..460d906231bd30f8cec4fe21d42afe7b2a05805e 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/affine_scalar.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/affine_scalar.py
@@ -25,6 +25,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -52,6 +53,14 @@ class AffineScalar(bijector.Bijector):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                shift=None,
                scale=None,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/batch_normalization.py b/tensorflow/contrib/distributions/python/ops/bijectors/batch_normalization.py
index 224cec8a63dba53a528490117efac890312fe8d5..f19f147dd645b4f805f1905899b44293284d4225 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/batch_normalization.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/batch_normalization.py
@@ -27,6 +27,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -34,6 +35,14 @@ __all__ = [
 ]
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _undo_batch_normalization(x,
                               mean,
                               variance,
@@ -128,6 +137,14 @@ class BatchNormalization(bijector.Bijector):
        Processing Systems_, 2017. https://arxiv.org/abs/1705.07057
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                batchnorm_layer=None,
                training=True,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/chain.py b/tensorflow/contrib/distributions/python/ops/bijectors/chain.py
index b158a51bb022b5e2ea3afda74e97b9dc131665a6..910774ea5bb4106a948567144c46c6db23a2c6e0 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/chain.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/chain.py
@@ -24,6 +24,7 @@ from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -31,10 +32,26 @@ __all__ = [
 ]
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _use_static_shape(input_tensor, ndims):
   return input_tensor.shape.is_fully_defined() and isinstance(ndims, int)
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _compute_min_event_ndims(bijector_list, compute_forward=True):
   """Computes the min_event_ndims associated with the give list of bijectors.
 
@@ -142,6 +159,14 @@ class Chain(bijector.Bijector):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self, bijectors=None, validate_args=False, name=None):
     """Instantiates `Chain` bijector.
 
@@ -234,7 +259,7 @@ class Chain(bijector.Bijector):
     if not self.bijectors:
       return ildj
 
-    event_ndims = self._maybe_get_event_ndims_statically(
+    event_ndims = self._maybe_get_static_event_ndims(
         self.inverse_min_event_ndims)
 
     if _use_static_shape(y, event_ndims):
@@ -248,12 +273,15 @@ class Chain(bijector.Bijector):
 
       if _use_static_shape(y, event_ndims):
         event_shape = b.inverse_event_shape(event_shape)
-        event_ndims = self._maybe_get_event_ndims_statically(
+        event_ndims = self._maybe_get_static_event_ndims(
             event_shape.ndims)
       else:
         event_shape = b.inverse_event_shape_tensor(event_shape)
-        event_ndims = self._maybe_get_event_ndims_statically(
-            array_ops.size(event_shape))
+        event_ndims = array_ops.size(event_shape)
+        event_ndims_ = self._maybe_get_static_event_ndims(event_ndims)
+        if event_ndims_ is not None:
+          event_ndims = event_ndims_
+
       y = b.inverse(y, **kwargs.get(b.name, {}))
     return ildj
 
@@ -270,7 +298,7 @@ class Chain(bijector.Bijector):
     if not self.bijectors:
       return fldj
 
-    event_ndims = self._maybe_get_event_ndims_statically(
+    event_ndims = self._maybe_get_static_event_ndims(
         self.forward_min_event_ndims)
 
     if _use_static_shape(x, event_ndims):
@@ -283,21 +311,14 @@ class Chain(bijector.Bijector):
           x, event_ndims=event_ndims, **kwargs.get(b.name, {}))
       if _use_static_shape(x, event_ndims):
         event_shape = b.forward_event_shape(event_shape)
-        event_ndims = self._maybe_get_event_ndims_statically(event_shape.ndims)
+        event_ndims = self._maybe_get_static_event_ndims(event_shape.ndims)
       else:
         event_shape = b.forward_event_shape_tensor(event_shape)
-        event_ndims = self._maybe_get_event_ndims_statically(
-            array_ops.size(event_shape))
+        event_ndims = array_ops.size(event_shape)
+        event_ndims_ = self._maybe_get_static_event_ndims(event_ndims)
+        if event_ndims_ is not None:
+          event_ndims = event_ndims_
 
       x = b.forward(x, **kwargs.get(b.name, {}))
 
     return fldj
-
-  def _maybe_get_event_ndims_statically(self, event_ndims):
-    event_ndims_ = super(Chain, self)._maybe_get_event_ndims_statically(
-        event_ndims)
-    if event_ndims_ is None:
-      return event_ndims
-    return event_ndims_
-
-
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/cholesky_outer_product.py b/tensorflow/contrib/distributions/python/ops/bijectors/cholesky_outer_product.py
index 268c8d03426d435dc38412ac1bd05c674bd05d2b..8267ee7df89f69f8d610e9507e0cca9f4a5d4323 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/cholesky_outer_product.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/cholesky_outer_product.py
@@ -27,6 +27,7 @@ from tensorflow.python.ops import linalg_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bijector
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -69,6 +70,14 @@ class CholeskyOuterProduct(bijector.Bijector):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self, validate_args=False, name="cholesky_outer_product"):
     """Instantiates the `CholeskyOuterProduct` bijector.
 
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/exp.py b/tensorflow/contrib/distributions/python/ops/bijectors/exp.py
index 9fc1bbf052b419d07a9db149b990c2b80190d72b..07627e1e45eae6b63d830b2adf036bdc3b1d2895 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/exp.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/exp.py
@@ -19,6 +19,7 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.contrib.distributions.python.ops.bijectors import power_transform
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -47,6 +48,14 @@ class Exp(power_transform.PowerTransform):
     over the event space.
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                validate_args=False,
                name="exp"):
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/fill_triangular.py b/tensorflow/contrib/distributions/python/ops/bijectors/fill_triangular.py
new file mode 100644
index 0000000000000000000000000000000000000000..31a9ca27e519bc312813668bf621a875838f12a0
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/fill_triangular.py
@@ -0,0 +1,165 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""FillTriangular bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.ops.distributions import util as dist_util
+from tensorflow.python.util import deprecation
+
+
+__all__ = [
+    "FillTriangular",
+]
+
+
+class FillTriangular(bijector.Bijector):
+  """Transforms vectors to triangular.
+
+  Triangular matrix elements are filled in a clockwise spiral.
+
+  Given input with shape `batch_shape + [d]`, produces output with
+  shape `batch_shape + [n, n]`, where
+   `n = (-1 + sqrt(1 + 8 * d))/2`.
+  This follows by solving the quadratic equation
+   `d = 1 + 2 + ... + n = n * (n + 1)/2`.
+
+  #### Example
+
+  ```python
+  b = tfb.FillTriangular(upper=False)
+  b.forward([1, 2, 3, 4, 5, 6])
+  # ==> [[4, 0, 0],
+  #      [6, 5, 0],
+  #      [3, 2, 1]]
+
+  b = tfb.FillTriangular(upper=True)
+  b.forward([1, 2, 3, 4, 5, 6])
+  # ==> [[1, 2, 3],
+  #      [0, 5, 6],
+  #      [0, 0, 4]]
+
+  ```
+  """
+
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
+  def __init__(self,
+               upper=False,
+               validate_args=False,
+               name="fill_triangular"):
+    """Instantiates the `FillTriangular` bijector.
+
+    Args:
+      upper: Python `bool` representing whether output matrix should be upper
+        triangular (`True`) or lower triangular (`False`, default).
+      validate_args: Python `bool` indicating whether arguments should be
+        checked for correctness.
+      name: Python `str` name given to ops managed by this object.
+    """
+    self._upper = upper
+    super(FillTriangular, self).__init__(
+        forward_min_event_ndims=1,
+        inverse_min_event_ndims=2,
+        validate_args=validate_args,
+        name=name)
+
+  def _forward(self, x):
+    return dist_util.fill_triangular(x, upper=self._upper)
+
+  def _inverse(self, y):
+    return dist_util.fill_triangular_inverse(y, upper=self._upper)
+
+  def _forward_log_det_jacobian(self, x):
+    return array_ops.zeros_like(x[..., 0])
+
+  def _inverse_log_det_jacobian(self, y):
+    return array_ops.zeros_like(y[..., 0, 0])
+
+  def _forward_event_shape(self, input_shape):
+    batch_shape, d = input_shape[:-1], input_shape[-1].value
+    if d is None:
+      n = None
+    else:
+      n = vector_size_to_square_matrix_size(d, self.validate_args)
+    return batch_shape.concatenate([n, n])
+
+  def _inverse_event_shape(self, output_shape):
+    batch_shape, n1, n2 = (output_shape[:-2],
+                           output_shape[-2].value,
+                           output_shape[-1].value)
+    if n1 is None or n2 is None:
+      m = None
+    elif n1 != n2:
+      raise ValueError("Matrix must be square. (saw [{}, {}])".format(n1, n2))
+    else:
+      m = n1 * (n1 + 1) / 2
+    return batch_shape.concatenate([m])
+
+  def _forward_event_shape_tensor(self, input_shape_tensor):
+    batch_shape, d = input_shape_tensor[:-1], input_shape_tensor[-1]
+    n = vector_size_to_square_matrix_size(d, self.validate_args)
+    return array_ops.concat([batch_shape, [n, n]], axis=0)
+
+  def _inverse_event_shape_tensor(self, output_shape_tensor):
+    batch_shape, n = output_shape_tensor[:-2], output_shape_tensor[-1]
+    if self.validate_args:
+      is_square_matrix = check_ops.assert_equal(
+          n, output_shape_tensor[-2], message="Matrix must be square.")
+      with ops.control_dependencies([is_square_matrix]):
+        n = array_ops.identity(n)
+    d = math_ops.cast(n * (n + 1) / 2, output_shape_tensor.dtype)
+    return array_ops.concat([batch_shape, [d]], axis=0)
+
+
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
+def vector_size_to_square_matrix_size(d, validate_args, name=None):
+  """Convert a vector size to a matrix size."""
+  if isinstance(d, (float, int, np.generic, np.ndarray)):
+    n = (-1 + np.sqrt(1 + 8 * d)) / 2.
+    if float(int(n)) != n:
+      raise ValueError("Vector length is not a triangular number.")
+    return int(n)
+  else:
+    with ops.name_scope(name, "vector_size_to_square_matrix_size", [d]) as name:
+      n = (-1. + math_ops.sqrt(1 + 8. * math_ops.to_float(d))) / 2.
+      if validate_args:
+        with ops.control_dependencies([check_ops.assert_equal(
+            math_ops.to_float(math_ops.to_int32(n)), n,
+            message="Vector length is not a triangular number")]):
+          n = array_ops.identity(n)
+      return math_ops.cast(n, d.dtype)
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/gumbel.py b/tensorflow/contrib/distributions/python/ops/bijectors/gumbel.py
index e656a258e56e71898ecb719dd2af876f158cf799..71e562a927a30a17d695b81c566f981db7553ad9 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/gumbel.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/gumbel.py
@@ -24,6 +24,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 __all__ = [
     "Gumbel",
@@ -45,6 +46,14 @@ class Gumbel(bijector.Bijector):
   ```
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc=0.,
                scale=1.,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/inline.py b/tensorflow/contrib/distributions/python/ops/bijectors/inline.py
index 2bde956d1345129285acae4684256c5ac828b9a1..1504bd27204f728c0cb519159230e945128c4740 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/inline.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/inline.py
@@ -19,6 +19,7 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -43,6 +44,14 @@ class Inline(bijector.Bijector):
   The above example is equivalent to the `Bijector` `Exp()`.
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                forward_fn=None,
                inverse_fn=None,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/invert.py b/tensorflow/contrib/distributions/python/ops/bijectors/invert.py
index 84a3289ba2160ed22a2bc7030dd612ba9ca6f6df..a648676d4b1956e5c27f67a71e6bd93d0d7fc97d 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/invert.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/invert.py
@@ -19,6 +19,7 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 __all__ = [
     "Invert",
@@ -40,6 +41,14 @@ class Invert(bijector.Bijector):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self, bijector, validate_args=False, name=None):
     """Creates a `Bijector` which swaps the meaning of `inverse` and `forward`.
 
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/kumaraswamy.py b/tensorflow/contrib/distributions/python/ops/bijectors/kumaraswamy.py
index 97000c17262d3efdef10274711364c2bc2083bd4..33b75a04d34fdd01bc0d854d4e5b9c45a737b122 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/kumaraswamy.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/kumaraswamy.py
@@ -24,6 +24,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 __all__ = [
     "Kumaraswamy",
@@ -44,6 +45,14 @@ class Kumaraswamy(bijector.Bijector):
   ```
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                concentration1=None,
                concentration0=None,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py b/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py
index 83667b0e80cfcc1c4f0617cdc739221f24439665..b8f2a4b2c731bdaee78692c036fb9f2fba4e3760 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py
@@ -33,6 +33,7 @@ from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import template as template_ops
 from tensorflow.python.ops import variable_scope as variable_scope_lib
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -186,6 +187,14 @@ class MaskedAutoregressiveFlow(bijector.Bijector):
        Processing Systems_, 2017. https://arxiv.org/abs/1705.07057
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                shift_and_log_scale_fn,
                is_constant_jacobian=False,
@@ -296,6 +305,14 @@ MASK_INCLUSIVE = "inclusive"
 MASK_EXCLUSIVE = "exclusive"
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _gen_slices(num_blocks, n_in, n_out, mask_type=MASK_EXCLUSIVE):
   """Generate the slices for building an autoregressive mask."""
   # TODO(b/67594795): Better support of dynamic shape.
@@ -313,6 +330,14 @@ def _gen_slices(num_blocks, n_in, n_out, mask_type=MASK_EXCLUSIVE):
   return slices
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _gen_mask(num_blocks,
               n_in,
               n_out,
@@ -327,6 +352,14 @@ def _gen_mask(num_blocks,
   return mask
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def masked_dense(inputs,
                  units,
                  num_blocks=None,
@@ -399,6 +432,14 @@ def masked_dense(inputs,
     return layer.apply(inputs)
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def masked_autoregressive_default_template(
     hidden_layers,
     shift_only=False,
@@ -515,6 +556,14 @@ def masked_autoregressive_default_template(
         "masked_autoregressive_default_template", _fn)
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _clip_by_value_preserve_grad(x, clip_value_min, clip_value_max, name=None):
   """Clips input while leaving gradient unaltered."""
   with ops.name_scope(name, "clip_by_value_preserve_grad",
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/matrix_inverse_tril.py b/tensorflow/contrib/distributions/python/ops/bijectors/matrix_inverse_tril.py
new file mode 100644
index 0000000000000000000000000000000000000000..49e6192f067edec4890dcfa107876a5104c14dd4
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/matrix_inverse_tril.py
@@ -0,0 +1,154 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""MatrixInverseTriL bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
+from tensorflow.python.ops import linalg_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
+
+
+__all__ = [
+    "MatrixInverseTriL",
+]
+
+
+class MatrixInverseTriL(bijector.Bijector):
+  """Computes `g(L) = inv(L)`, where `L` is a lower-triangular matrix.
+
+  `L` must be nonsingular; equivalently, all diagonal entries of `L` must be
+  nonzero.
+
+  The input must have `rank >= 2`.  The input is treated as a batch of matrices
+  with batch shape `input.shape[:-2]`, where each matrix has dimensions
+  `input.shape[-2]` by `input.shape[-1]` (hence `input.shape[-2]` must equal
+  `input.shape[-1]`).
+
+  #### Examples
+
+  ```python
+  tfd.bijectors.MatrixInverseTriL().forward(x=[[1., 0], [2, 1]])
+  # Result: [[1., 0], [-2, 1]], i.e., inv(x)
+
+  tfd.bijectors.MatrixInverseTriL().inverse(y=[[1., 0], [-2, 1]])
+  # Result: [[1., 0], [2, 1]], i.e., inv(y).
+  ```
+
+  """
+
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
+  def __init__(self, validate_args=False, name="matrix_inverse_tril"):
+    """Instantiates the `MatrixInverseTriL` bijector.
+
+    Args:
+      validate_args: Python `bool` indicating whether arguments should be
+        checked for correctness.
+      name: Python `str` name given to ops managed by this object.
+    """
+    self._graph_parents = []
+    self._name = name
+    super(MatrixInverseTriL, self).__init__(
+        forward_min_event_ndims=2,
+        validate_args=validate_args,
+        name=name)
+
+  def _forward(self, x):
+    with ops.control_dependencies(self._assertions(x)):
+      shape = array_ops.shape(x)
+      return linalg_ops.matrix_triangular_solve(
+          x, linalg_ops.eye(shape[-1], batch_shape=shape[:-2]), lower=True)
+
+  def _inverse(self, y):
+    return self._forward(y)
+
+  def _forward_log_det_jacobian(self, x):
+    # Calculation of the Jacobian:
+    #
+    # Let X = (x_{ij}), 0 <= i,j < n, be a matrix of indeterminates.  Let Z =
+    # X^{-1} where Z = (z_{ij}).  Then
+    #
+    #     dZ/dx_{ij} = (d/dt | t=0) Y(t)^{-1},
+    #
+    # where Y(t) = X + t*E_{ij} and E_{ij} is the matrix with a 1 in the (i,j)
+    # entry and zeros elsewhere.  By the product rule,
+    #
+    #     0 = d/dt [Identity matrix]
+    #       = d/dt [Y Y^{-1}]
+    #       = Y d/dt[Y^{-1}] + dY/dt Y^{-1}
+    #
+    # so
+    #
+    #     d/dt[Y^{-1}] = -Y^{-1} dY/dt Y^{-1}
+    #                  = -Y^{-1} E_{ij} Y^{-1}.
+    #
+    # Evaluating at t=0,
+    #
+    #     dZ/dx_{ij} = -Z E_{ij} Z.
+    #
+    # Taking the (r,s) entry of each side,
+    #
+    #     dz_{rs}/dx_{ij} = -z_{ri}z_{sj}.
+    #
+    # Now, let J be the Jacobian dZ/dX, arranged as the n^2-by-n^2 matrix whose
+    # (r*n + s, i*n + j) entry is dz_{rs}/dx_{ij}.  Considering J as an n-by-n
+    # block matrix with n-by-n blocks, the above expression for dz_{rs}/dx_{ij}
+    # shows that the block at position (r,i) is -z_{ri}Z.  Hence
+    #
+    #          J = -KroneckerProduct(Z, Z),
+    #     det(J) = (-1)^(n^2) (det Z)^(2n)
+    #            = (-1)^n (det X)^(-2n).
+    with ops.control_dependencies(self._assertions(x)):
+      return (-2. * math_ops.cast(array_ops.shape(x)[-1], x.dtype.base_dtype) *
+              math_ops.reduce_sum(
+                  math_ops.log(math_ops.abs(array_ops.matrix_diag_part(x))),
+                  axis=-1))
+
+  def _assertions(self, x):
+    if not self.validate_args:
+      return []
+    shape = array_ops.shape(x)
+    is_matrix = check_ops.assert_rank_at_least(
+        x, 2, message="Input must have rank at least 2.")
+    is_square = check_ops.assert_equal(
+        shape[-2], shape[-1], message="Input must be a square matrix.")
+    above_diagonal = array_ops.matrix_band_part(
+        array_ops.matrix_set_diag(
+            x, array_ops.zeros(shape[:-1], dtype=dtypes.float32)),
+        0, -1)
+    is_lower_triangular = check_ops.assert_equal(
+        above_diagonal, array_ops.zeros_like(above_diagonal),
+        message="Input must be lower triangular.")
+    # A lower triangular matrix is nonsingular iff all its diagonal entries are
+    # nonzero.
+    diag_part = array_ops.matrix_diag_part(x)
+    is_nonsingular = check_ops.assert_none_equal(
+        diag_part, array_ops.zeros_like(diag_part),
+        message="Input must have all diagonal entries nonzero.")
+    return [is_matrix, is_square, is_lower_triangular, is_nonsingular]
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/ordered.py b/tensorflow/contrib/distributions/python/ops/bijectors/ordered.py
index 3f03592f314cc13e8a9ea7e2ae18c5bb1f14e74f..fb393218b6b47764f45b5055bbf15cc17aba219e 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/ordered.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/ordered.py
@@ -25,6 +25,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -57,6 +58,14 @@ class Ordered(bijector.Bijector):
   ```
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self, validate_args=False, name="ordered"):
     super(Ordered, self).__init__(
         forward_min_event_ndims=1,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/permute.py b/tensorflow/contrib/distributions/python/ops/bijectors/permute.py
index 12a16a3f2ba3da53077307fd97d3f10d99b2c81f..f182a1adcbb6b11af2376cd271f903d50e50f1a0 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/permute.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/permute.py
@@ -29,6 +29,7 @@ from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -74,6 +75,14 @@ class Permute(bijector.Bijector):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self, permutation, validate_args=False, name=None):
     """Creates the `Permute` bijector.
 
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/power_transform.py b/tensorflow/contrib/distributions/python/ops/bijectors/power_transform.py
index 71f123f2a998458edaa9c8da07ea2932f62625ca..16264fe728a334db347304500767ce5876f9db7e 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/power_transform.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/power_transform.py
@@ -24,6 +24,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -41,6 +42,14 @@ class PowerTransform(bijector.Bijector):
   This bijector is equivalent to the `Exp` bijector when `c=0`.
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                power=0.,
                validate_args=False,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/real_nvp.py b/tensorflow/contrib/distributions/python/ops/bijectors/real_nvp.py
index 66e8a5b9b356867424d1d47efaf848fc6903c371..773ae2446118051a61636bc21de6b81dfacda746 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/real_nvp.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/real_nvp.py
@@ -26,6 +26,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import template as template_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -126,6 +127,14 @@ class RealNVP(bijector.Bijector):
        Processing Systems_, 2017. https://arxiv.org/abs/1705.07057
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                num_masked,
                shift_and_log_scale_fn,
@@ -228,6 +237,14 @@ class RealNVP(bijector.Bijector):
     return math_ops.reduce_sum(log_scale, axis=-1)
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def real_nvp_default_template(
     hidden_layers,
     shift_only=False,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/reshape.py b/tensorflow/contrib/distributions/python/ops/bijectors/reshape.py
index 5497c422e4d51e259435692dac722f801e8844ac..c8282229a30fabff0c4c267d0bdfcdbce4f5f3d9 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/reshape.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/reshape.py
@@ -29,6 +29,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -36,10 +37,26 @@ __all__ = [
 ]
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _static_ndims_from_shape(shape):
   return shape.shape.with_rank_at_least(1)[0].value
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _ndims_from_shape(shape):
   return array_ops.shape(shape)[0]
 
@@ -86,6 +103,14 @@ class Reshape(bijector.Bijector):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self, event_shape_out, event_shape_in=(-1,),
                validate_args=False, name=None):
     """Creates a `Reshape` bijector.
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/scale_tril.py b/tensorflow/contrib/distributions/python/ops/bijectors/scale_tril.py
new file mode 100644
index 0000000000000000000000000000000000000000..6fbe8665781211ca803feb8bf5a8c04fb0b969e8
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/scale_tril.py
@@ -0,0 +1,123 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""ScaleTriL bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.distributions.python.ops.bijectors import affine_scalar
+from tensorflow.contrib.distributions.python.ops.bijectors import chain
+from tensorflow.contrib.distributions.python.ops.bijectors import fill_triangular
+from tensorflow.contrib.distributions.python.ops.bijectors import softplus
+from tensorflow.contrib.distributions.python.ops.bijectors import transform_diagonal
+from tensorflow.python.util import deprecation
+
+__all__ = [
+    "ScaleTriL",
+]
+
+
+class ScaleTriL(chain.Chain):
+  """Transforms unconstrained vectors to TriL matrices with positive diagonal.
+
+  This is implemented as a simple `tfb.Chain` of `tfb.FillTriangular`
+  followed by `tfb.TransformDiagonal`, and provided mostly as a
+  convenience. The default setup is somewhat opinionated, using a
+  Softplus transformation followed by a small shift (`1e-5`) which
+  attempts to avoid numerical issues from zeros on the diagonal.
+
+  #### Examples
+
+  ```python
+  tfb = tf.contrib.distributions.bijectors
+  b = tfb.ScaleTriL(
+       diag_bijector=tfb.Exp(),
+       diag_shift=None)
+  b.forward(x=[0., 0., 0.])
+  # Result: [[1., 0.],
+  #          [0., 1.]]
+  b.inverse(y=[[1., 0],
+               [.5, 2]])
+  # Result: [log(2), .5, log(1)]
+
+  # Define a distribution over PSD matrices of shape `[3, 3]`,
+  # with `1 + 2 + 3 = 6` degrees of freedom.
+  dist = tfd.TransformedDistribution(
+          tfd.Normal(tf.zeros(6), tf.ones(6)),
+          tfb.Chain([tfb.CholeskyOuterProduct(), tfb.ScaleTriL()]))
+
+  # Using an identity transformation, ScaleTriL is equivalent to
+  # tfb.FillTriangular.
+  b = tfb.ScaleTriL(
+       diag_bijector=tfb.Identity(),
+       diag_shift=None)
+
+  # For greater control over initialization, one can manually encode
+  # pre- and post- shifts inside of `diag_bijector`.
+  b = tfb.ScaleTriL(
+       diag_bijector=tfb.Chain([
+         tfb.AffineScalar(shift=1e-3),
+         tfb.Softplus(),
+         tfb.AffineScalar(shift=0.5413)]),  # softplus_inverse(1.)
+                                            #  = log(expm1(1.)) = 0.5413
+       diag_shift=None)
+  ```
+  """
+
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
+  def __init__(self,
+               diag_bijector=None,
+               diag_shift=1e-5,
+               validate_args=False,
+               name="scale_tril"):
+    """Instantiates the `ScaleTriL` bijector.
+
+    Args:
+      diag_bijector: `Bijector` instance, used to transform the output diagonal
+        to be positive.
+        Default value: `None` (i.e., `tfb.Softplus()`).
+      diag_shift: Float value broadcastable and added to all diagonal entries
+        after applying the `diag_bijector`. Setting a positive
+        value forces the output diagonal entries to be positive, but
+        prevents inverting the transformation for matrices with
+        diagonal entries less than this value.
+        Default value: `1e-5` (i.e., no shift is applied).
+      validate_args: Python `bool` indicating whether arguments should be
+        checked for correctness.
+        Default value: `False` (i.e., arguments are not validated).
+      name: Python `str` name given to ops managed by this object.
+        Default value: `scale_tril`.
+    """
+
+    if diag_bijector is None:
+      diag_bijector = softplus.Softplus(validate_args=validate_args)
+
+    if diag_shift is not None:
+      diag_bijector = chain.Chain([affine_scalar.AffineScalar(shift=diag_shift),
+                                   diag_bijector])
+
+    super(ScaleTriL, self).__init__(
+        [transform_diagonal.TransformDiagonal(diag_bijector=diag_bijector),
+         fill_triangular.FillTriangular()],
+        validate_args=validate_args,
+        name=name)
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/sigmoid.py b/tensorflow/contrib/distributions/python/ops/bijectors/sigmoid.py
index 5df8c886315ff75cdc884e3b9b4665fb64bb109d..194b318fce31a13f84e7b664b58cebb24fc9a264 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/sigmoid.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/sigmoid.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -31,6 +32,14 @@ __all__ = [
 class Sigmoid(bijector.Bijector):
   """Bijector which computes `Y = g(X) = 1 / (1 + exp(-X))`."""
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self, validate_args=False, name="sigmoid"):
     super(Sigmoid, self).__init__(
         forward_min_event_ndims=0,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/sinh_arcsinh.py b/tensorflow/contrib/distributions/python/ops/bijectors/sinh_arcsinh.py
index 2a32e8abcde940b0056b0faf2955ec1b3bd71803..241fba2cb7ec33b7b02c1ca79051f1b826d7d2aa 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/sinh_arcsinh.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/sinh_arcsinh.py
@@ -26,12 +26,21 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 __all__ = [
     "SinhArcsinh",
 ]
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _sqrtx2p1(x):
   """Implementation of `sqrt(1 + x**2)` which is stable despite large `x`."""
   return array_ops.where(
@@ -88,6 +97,14 @@ class SinhArcsinh(bijector.Bijector):
   `Y approx 0.5 X**tailweight e**(sign(X) skewness * tailweight)`.
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                skewness=None,
                tailweight=None,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/softmax_centered.py b/tensorflow/contrib/distributions/python/ops/bijectors/softmax_centered.py
index f52b91550edff7390d8094a4508d862674e85d59..20ee0d340833d5c5275e2ab52a89dcdf7198add1 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/softmax_centered.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/softmax_centered.py
@@ -26,6 +26,7 @@ from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -60,6 +61,14 @@ class SoftmaxCentered(bijector.Bijector):
   makes the (forward) image non-open and the theorem does not directly apply.
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                validate_args=False,
                name="softmax_centered"):
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/softplus.py b/tensorflow/contrib/distributions/python/ops/bijectors/softplus.py
index 96a938c803418ff818f9c531754b47ba1eb8667a..3df84ef8b04c2c8f6be91ecd1c972ad1484b4285 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/softplus.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/softplus.py
@@ -25,6 +25,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops.distributions import bijector
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -80,6 +81,14 @@ class Softplus(bijector.Bijector):
           "hinge_softness": (
               "Nonzero floating point `Tensor`.  Controls the softness of what "
               "would otherwise be a kink at the origin.  Default is 1.0")})
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                hinge_softness=None,
                validate_args=False,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/softsign.py b/tensorflow/contrib/distributions/python/ops/bijectors/softsign.py
index b4a658c171b8313358754228aabbfa4bf93fd84d..f96a4bb01de59a21107b9e7c14f929e13e358ac9 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/softsign.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/softsign.py
@@ -22,6 +22,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -51,6 +52,14 @@ class Softsign(bijector.Bijector):
   ```
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self, validate_args=False, name="softsign"):
     super(Softsign, self).__init__(
         forward_min_event_ndims=0,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/square.py b/tensorflow/contrib/distributions/python/ops/bijectors/square.py
index 2ccfdc95970e387e708603e2614ad29fb6a18db3..294460a80f6209797831ea361e64efe677f71e59 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/square.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/square.py
@@ -24,6 +24,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -49,6 +50,14 @@ class Square(bijector.Bijector):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self, validate_args=False, name="square"):
     """Instantiates the `Square` bijector.
 
@@ -81,4 +90,3 @@ class Square(bijector.Bijector):
     is_valid = check_ops.assert_non_negative(
         t, message="All elements must be non-negative.")
     return control_flow_ops.with_dependencies([is_valid], t)
-
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/transform_diagonal.py b/tensorflow/contrib/distributions/python/ops/bijectors/transform_diagonal.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b7a3b026b8dcc31bed49c489d77b9c184f463cb
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/transform_diagonal.py
@@ -0,0 +1,111 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""TransformDiagonal bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
+
+__all__ = [
+    "TransformDiagonal",
+]
+
+
+class TransformDiagonal(bijector.Bijector):
+  """Applies a Bijector to the diagonal of a matrix.
+
+  #### Example
+
+  ```python
+  b = tfb.TransformDiagonal(diag_bijector=tfb.Exp())
+
+  b.forward([[1., 0.],
+             [0., 1.]])
+  # ==> [[2.718, 0.],
+         [0., 2.718]]
+  ```
+
+  """
+
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
+  def __init__(self,
+               diag_bijector,
+               validate_args=False,
+               name="transform_diagonal"):
+    """Instantiates the `TransformDiagonal` bijector.
+
+    Args:
+      diag_bijector: `Bijector` instance used to transform the diagonal.
+      validate_args: Python `bool` indicating whether arguments should be
+        checked for correctness.
+      name: Python `str` name given to ops managed by this object.
+    """
+    self._diag_bijector = diag_bijector
+    super(TransformDiagonal, self).__init__(
+        forward_min_event_ndims=2,
+        inverse_min_event_ndims=2,
+        validate_args=validate_args,
+        name=name)
+
+  def _forward(self, x):
+    diag = self._diag_bijector.forward(array_ops.matrix_diag_part(x))
+    return array_ops.matrix_set_diag(x, diag)
+
+  def _inverse(self, y):
+    diag = self._diag_bijector.inverse(array_ops.matrix_diag_part(y))
+    return array_ops.matrix_set_diag(y, diag)
+
+  def _forward_log_det_jacobian(self, x):
+    # We formulate the Jacobian with respect to the flattened matrices
+    # `vec(x)` and `vec(y)`. Suppose for notational convenience that
+    # the first `n` entries of `vec(x)` are the diagonal of `x`, and
+    # the remaining `n**2-n` entries are the off-diagonals in
+    # arbitrary order. Then the Jacobian is a block-diagonal matrix,
+    # with the Jacobian of the diagonal bijector in the first block,
+    # and the identity Jacobian for the remaining entries (since this
+    # bijector acts as the identity on non-diagonal entries):
+    #
+    # J_vec(x) (vec(y)) =
+    # -------------------------------
+    # | J_diag(x) (diag(y))      0  | n entries
+    # |                             |
+    # | 0                        I  | n**2-n entries
+    # -------------------------------
+    #   n                     n**2-n
+    #
+    # Since the log-det of the second (identity) block is zero, the
+    # overall log-det-jacobian is just the log-det of first block,
+    # from the diagonal bijector.
+    #
+    # Note that for elementwise operations (exp, softplus, etc) the
+    # first block of the Jacobian will itself be a diagonal matrix,
+    # but our implementation does not require this to be true.
+    return self._diag_bijector.forward_log_det_jacobian(
+        array_ops.matrix_diag_part(x), event_ndims=1)
+
+  def _inverse_log_det_jacobian(self, y):
+    return self._diag_bijector.inverse_log_det_jacobian(
+        array_ops.matrix_diag_part(y), event_ndims=1)
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/weibull.py b/tensorflow/contrib/distributions/python/ops/bijectors/weibull.py
index a22560fe80298b762795e7b0e7aea2db55823065..8903a70d98ae144731b12047e5074d0450b59378 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/weibull.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/weibull.py
@@ -24,6 +24,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import bijector
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -47,6 +48,14 @@ class Weibull(bijector.Bijector):
   ```
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                scale=1.,
                concentration=1.,
diff --git a/tensorflow/contrib/distributions/python/ops/binomial.py b/tensorflow/contrib/distributions/python/ops/binomial.py
index 24b26bf124c78c8320b9a6bc3b900e6c7a93f5e4..b349e5966dd750fdf96c0b211dce02658c9400b7 100644
--- a/tensorflow/contrib/distributions/python/ops/binomial.py
+++ b/tensorflow/contrib/distributions/python/ops/binomial.py
@@ -27,6 +27,7 @@ from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import distribution
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 _binomial_sample_note = """
@@ -42,6 +43,14 @@ to integer values.
 """
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _bdtr(k, n, p):
   """The binomial cumulative distribution function.
 
@@ -130,6 +139,14 @@ class Binomial(distribution.Distribution):
   ```
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                total_count,
                logits=None,
@@ -163,7 +180,7 @@ class Binomial(distribution.Distribution):
         more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[total_count, logits, probs]) as name:
       self._total_count = self._maybe_assert_valid_total_count(
           ops.convert_to_tensor(total_count, name="total_count"),
diff --git a/tensorflow/contrib/distributions/python/ops/cauchy.py b/tensorflow/contrib/distributions/python/ops/cauchy.py
index f5ffdd873124d6626dca26f603592bd0b030d7b3..cb5223b0557080e10bf24c3e1cb432f15fd5e7e3 100644
--- a/tensorflow/contrib/distributions/python/ops/cauchy.py
+++ b/tensorflow/contrib/distributions/python/ops/cauchy.py
@@ -29,7 +29,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import distribution
-from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 __all__ = [
     "Cauchy",
@@ -93,6 +93,14 @@ class Cauchy(distribution.Distribution):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc,
                scale,
@@ -121,7 +129,7 @@ class Cauchy(distribution.Distribution):
     Raises:
       TypeError: if `loc` and `scale` have different `dtype`.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[loc, scale]) as name:
       with ops.control_dependencies([check_ops.assert_positive(scale)]
                                     if validate_args else []):
diff --git a/tensorflow/contrib/distributions/python/ops/chi2.py b/tensorflow/contrib/distributions/python/ops/chi2.py
index 08cdc1582892cc7d308bd60f082dde082704f57f..e9a7b39070f3d76693ad54852ed0847a0980d2a6 100644
--- a/tensorflow/contrib/distributions/python/ops/chi2.py
+++ b/tensorflow/contrib/distributions/python/ops/chi2.py
@@ -25,7 +25,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import gamma
-from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -64,6 +64,14 @@ class Chi2(gamma.Gamma):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                df,
                validate_args=False,
@@ -84,7 +92,7 @@ class Chi2(gamma.Gamma):
         more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     # Even though all stats of chi2 are defined for valid parameters, this is
     # not true in the parent class "gamma."  therefore, passing
     # allow_nan_stats=True
@@ -115,12 +123,20 @@ class Chi2(gamma.Gamma):
 class Chi2WithAbsDf(Chi2):
   """Chi2 with parameter transform `df = floor(abs(df))`."""
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                df,
                validate_args=False,
                allow_nan_stats=True,
                name="Chi2WithAbsDf"):
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[df]) as name:
       super(Chi2WithAbsDf, self).__init__(
           df=math_ops.floor(
diff --git a/tensorflow/contrib/distributions/python/ops/conditional_transformed_distribution.py b/tensorflow/contrib/distributions/python/ops/conditional_transformed_distribution.py
index 10b45361358b40a3c8fd725f27ad84ef9b8a37f5..3598c8d23ea9007fb359ae4931738fb61ede4ccc 100644
--- a/tensorflow/contrib/distributions/python/ops/conditional_transformed_distribution.py
+++ b/tensorflow/contrib/distributions/python/ops/conditional_transformed_distribution.py
@@ -20,7 +20,6 @@ from __future__ import print_function
 from tensorflow.contrib.distributions.python.ops import conditional_distribution
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import transformed_distribution
@@ -106,7 +105,7 @@ class ConditionalTransformedDistribution(
     bijector_kwargs = bijector_kwargs or {}
     distribution_kwargs = distribution_kwargs or {}
     x = self.bijector.inverse(y, **bijector_kwargs)
-    event_ndims = self._maybe_get_event_ndims_statically()
+    event_ndims = self._maybe_get_static_event_ndims()
     ildj = self.bijector.inverse_log_det_jacobian(
         y, event_ndims=event_ndims, **bijector_kwargs)
     if self.bijector._is_injective:  # pylint: disable=protected-access
@@ -131,7 +130,7 @@ class ConditionalTransformedDistribution(
     bijector_kwargs = bijector_kwargs or {}
     distribution_kwargs = distribution_kwargs or {}
     x = self.bijector.inverse(y, **bijector_kwargs)
-    event_ndims = self._maybe_get_event_ndims_statically()
+    event_ndims = self._maybe_get_static_event_ndims()
     ildj = self.bijector.inverse_log_det_jacobian(
         y, event_ndims=event_ndims, **bijector_kwargs)
     if self.bijector._is_injective:  # pylint: disable=protected-access
@@ -220,14 +219,14 @@ class ConditionalTransformedDistribution(
     inv_cdf = self.distribution.quantile(value, **distribution_kwargs)
     return self.bijector.forward(inv_cdf, **bijector_kwargs)
 
-  def _maybe_get_event_ndims_statically(self):
+  def _maybe_get_static_event_ndims(self):
     if self.event_shape.ndims is not None:
       return self.event_shape.ndims
 
     event_ndims = array_ops.size(self.event_shape_tensor())
-    static_event_ndims = tensor_util.constant_value(event_ndims)
+    event_ndims_ = distribution_util.maybe_get_static_value(event_ndims)
 
-    if static_event_ndims is not None:
-      return static_event_ndims
+    if event_ndims_ is not None:
+      return event_ndims_
 
     return event_ndims
diff --git a/tensorflow/contrib/distributions/python/ops/deterministic.py b/tensorflow/contrib/distributions/python/ops/deterministic.py
index 6d7d6d307bd0f815344c8a0e347f45ae11ba6462..ad853ee293f86565c1af601214522f53d936b70a 100644
--- a/tensorflow/contrib/distributions/python/ops/deterministic.py
+++ b/tensorflow/contrib/distributions/python/ops/deterministic.py
@@ -32,7 +32,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import distribution
-from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 __all__ = [
     "Deterministic",
@@ -44,6 +44,14 @@ __all__ = [
 class _BaseDeterministic(distribution.Distribution):
   """Base class for Deterministic distributions."""
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc,
                atol=None,
@@ -87,7 +95,7 @@ class _BaseDeterministic(distribution.Distribution):
     Raises:
       ValueError:  If `loc` is a scalar.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[loc, atol, rtol]) as name:
       loc = ops.convert_to_tensor(loc, name="loc")
       if is_vector and validate_args:
@@ -204,6 +212,14 @@ class Deterministic(_BaseDeterministic):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc,
                atol=None,
@@ -309,6 +325,14 @@ class VectorDeterministic(_BaseDeterministic):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc,
                atol=None,
diff --git a/tensorflow/contrib/distributions/python/ops/distribution_util.py b/tensorflow/contrib/distributions/python/ops/distribution_util.py
index 289e1d50e1146a641c0cc433ece3465aed73b1c2..6959b3e8775d2dd488b4ee3252d143ef376d58f9 100644
--- a/tensorflow/contrib/distributions/python/ops/distribution_util.py
+++ b/tensorflow/contrib/distributions/python/ops/distribution_util.py
@@ -21,12 +21,19 @@ from __future__ import print_function
 from tensorflow.contrib import linalg
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import smart_cond
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import distribution as distribution_lib
+
+# The following two lines are redundant, in a sense. The first enables
+# good coding practice  *within* this file (`util.prefer_static_value`
+# rather than  `prefer_static_value`). The  second ensures  that users
+# also get the core utils when they import this file.
+from tensorflow.python.ops.distributions import util
 from tensorflow.python.ops.distributions.util import *  # pylint: disable=wildcard-import
 
 
@@ -484,3 +491,75 @@ def pad_mixture_dimensions(x, mixture_distribution, categorical_distribution,
 def static_value(x):
   """Returns the static value of a `Tensor` or `None`."""
   return tensor_util.constant_value(ops.convert_to_tensor(x))
+
+
+def move_dimension(x, source_idx, dest_idx):
+  """Move a single tensor dimension within its shape.
+
+  This is a special case of `tf.transpose()`, which applies
+  arbitrary permutations to tensor dimensions.
+
+  Args:
+    x: Tensor of rank `ndims`.
+    source_idx: Integer index into `x.shape` (negative indexing is
+      supported).
+    dest_idx: Integer index into `x.shape` (negative indexing is
+      supported).
+
+  Returns:
+    x_perm: Tensor of rank `ndims`, in which the dimension at original
+     index `source_idx` has been moved to new index `dest_idx`, with
+     all other dimensions retained in their original order.
+
+  Example:
+
+  ```python
+  x = tf.placeholder(shape=[200, 30, 4, 1, 6])
+  x_perm = _move_dimension(x, 1, 1) # no-op
+  x_perm = _move_dimension(x, 0, 3) # result shape [30, 4, 1, 200, 6]
+  x_perm = _move_dimension(x, 0, -2) # equivalent to previous
+  x_perm = _move_dimension(x, 4, 2) # result shape [200, 30, 6, 4, 1]
+  ```
+  """
+  ndims = util.prefer_static_rank(x)
+  if isinstance(source_idx, int):
+    dtype = dtypes.int32
+  else:
+    dtype = dtypes.as_dtype(source_idx.dtype)
+
+  # Handle negative indexing. Since ndims might be dynamic, this makes
+  # source_idx and dest_idx also possibly dynamic.
+  if source_idx < 0:
+    source_idx = ndims + source_idx
+  if dest_idx < 0:
+    dest_idx = ndims + dest_idx
+
+  # Construct the appropriate permutation of dimensions, depending
+  # whether the source is before or after the destination.
+  def move_left_permutation():
+    return util.prefer_static_value(
+        array_ops.concat([
+            math_ops.range(0, dest_idx, dtype=dtype),
+            [source_idx],
+            math_ops.range(dest_idx, source_idx, dtype=dtype),
+            math_ops.range(source_idx+1, ndims, dtype=dtype)], axis=0))
+
+  def move_right_permutation():
+    return util.prefer_static_value(
+        array_ops.concat([
+            math_ops.range(0, source_idx, dtype=dtype),
+            math_ops.range(source_idx+1, dest_idx+1, dtype=dtype),
+            [source_idx],
+            math_ops.range(dest_idx+1, ndims, dtype=dtype)], axis=0))
+
+  def x_permuted():
+    return array_ops.transpose(
+        x, perm=smart_cond.smart_cond(source_idx < dest_idx,
+                                      move_right_permutation,
+                                      move_left_permutation))
+
+  # One final conditional to handle the special case where source
+  # and destination indices are equal.
+  return smart_cond.smart_cond(math_ops.equal(source_idx, dest_idx),
+                               lambda: x,
+                               x_permuted)
diff --git a/tensorflow/contrib/distributions/python/ops/estimator.py b/tensorflow/contrib/distributions/python/ops/estimator.py
index 98edd337fe02ffbf53c6ecd9ebda9424231ea2fe..bdec6527d5378d6e86aa8e6279cc6ee672083e56 100644
--- a/tensorflow/contrib/distributions/python/ops/estimator.py
+++ b/tensorflow/contrib/distributions/python/ops/estimator.py
@@ -23,6 +23,7 @@ from tensorflow.contrib.learn.python.learn.estimators.head import _RegressionHea
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -30,6 +31,14 @@ __all__ = [
 ]
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def estimator_head_distribution_regression(make_distribution_fn,
                                            label_dimension=1,
                                            logits_dimension=None,
@@ -77,6 +86,14 @@ def estimator_head_distribution_regression(make_distribution_fn,
 class _DistributionRegressionHead(_RegressionHead):
   """Creates a _RegressionHead instance from an arbitrary `Distribution`."""
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                make_distribution_fn,
                label_dimension,
diff --git a/tensorflow/contrib/distributions/python/ops/geometric.py b/tensorflow/contrib/distributions/python/ops/geometric.py
index 446cff6ec242f25178fed0c6a424791fa9f176ad..d62f024aa2a081f0ec231015af1f26a8851518e9 100644
--- a/tensorflow/contrib/distributions/python/ops/geometric.py
+++ b/tensorflow/contrib/distributions/python/ops/geometric.py
@@ -31,6 +31,7 @@ from tensorflow.python.ops import nn
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import distribution
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 class Geometric(distribution.Distribution):
@@ -55,6 +56,14 @@ class Geometric(distribution.Distribution):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                logits=None,
                probs=None,
@@ -85,7 +94,7 @@ class Geometric(distribution.Distribution):
       name: Python `str` name prefixed to Ops created by this class.
     """
 
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[logits, probs]) as name:
       self._logits, self._probs = distribution_util.get_logits_and_probs(
           logits, probs, validate_args=validate_args, name=name)
diff --git a/tensorflow/contrib/distributions/python/ops/gumbel.py b/tensorflow/contrib/distributions/python/ops/gumbel.py
index ed9ea6f4f3ffe18fb6bf1e0a7d57728d010e0f01..acdea4d61d3ada7e9f4f0aa7bc58c5643db2802b 100644
--- a/tensorflow/contrib/distributions/python/ops/gumbel.py
+++ b/tensorflow/contrib/distributions/python/ops/gumbel.py
@@ -29,7 +29,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import distribution
-from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 class _Gumbel(distribution.Distribution):
@@ -97,6 +97,14 @@ class _Gumbel(distribution.Distribution):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc,
                scale,
@@ -125,7 +133,7 @@ class _Gumbel(distribution.Distribution):
     Raises:
       TypeError: if loc and scale are different dtypes.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[loc, scale]) as name:
       with ops.control_dependencies([check_ops.assert_positive(scale)] if
                                     validate_args else []):
diff --git a/tensorflow/contrib/distributions/python/ops/half_normal.py b/tensorflow/contrib/distributions/python/ops/half_normal.py
index 7e12767f6d8f6c61565ecf266d3b222de68c0e40..b02c4031069191592b8acc1a90313450f98af6d7 100644
--- a/tensorflow/contrib/distributions/python/ops/half_normal.py
+++ b/tensorflow/contrib/distributions/python/ops/half_normal.py
@@ -31,7 +31,7 @@ from tensorflow.python.ops import nn
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import distribution
 from tensorflow.python.ops.distributions import special_math
-from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -86,6 +86,14 @@ class HalfNormal(distribution.Distribution):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                scale,
                validate_args=False,
@@ -106,7 +114,7 @@ class HalfNormal(distribution.Distribution):
         if one or more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[scale]) as name:
       with ops.control_dependencies([check_ops.assert_positive(scale)] if
                                     validate_args else []):
diff --git a/tensorflow/contrib/distributions/python/ops/independent.py b/tensorflow/contrib/distributions/python/ops/independent.py
index fa89fff3b7b2f8266a44c446a0c9807790b3aed8..0672702b96c1eb81c176774554df3f5922a0319e 100644
--- a/tensorflow/contrib/distributions/python/ops/independent.py
+++ b/tensorflow/contrib/distributions/python/ops/independent.py
@@ -29,7 +29,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import distribution as distribution_lib
 from tensorflow.python.ops.distributions import kullback_leibler
-from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 class Independent(distribution_lib.Distribution):
@@ -95,6 +95,14 @@ class Independent(distribution_lib.Distribution):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(
       self, distribution, reinterpreted_batch_ndims=None,
       validate_args=False, name=None):
@@ -117,7 +125,7 @@ class Independent(distribution_lib.Distribution):
       ValueError: if `reinterpreted_batch_ndims` exceeds
         `distribution.batch_ndims`
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     name = name or "Independent" + distribution.name
     self._distribution = distribution
     with ops.name_scope(name) as name:
@@ -259,6 +267,14 @@ class Independent(distribution_lib.Distribution):
 
 
 @kullback_leibler.RegisterKL(Independent, Independent)
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _kl_independent(a, b, name="kl_independent"):
   """Batched KL divergence `KL(a || b)` for Independent distributions.
 
diff --git a/tensorflow/contrib/distributions/python/ops/inverse_gamma.py b/tensorflow/contrib/distributions/python/ops/inverse_gamma.py
index 85e8e10466038e5e55ef4b754f82c0c2c2543b6d..70d050d7a647b38928ddb1c788db0e6957ac0f03 100644
--- a/tensorflow/contrib/distributions/python/ops/inverse_gamma.py
+++ b/tensorflow/contrib/distributions/python/ops/inverse_gamma.py
@@ -32,6 +32,7 @@ from tensorflow.python.ops import nn
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import distribution
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -95,6 +96,14 @@ class InverseGamma(distribution.Distribution):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                concentration,
                rate,
@@ -125,7 +134,7 @@ class InverseGamma(distribution.Distribution):
     Raises:
       TypeError: if `concentration` and `rate` are different dtypes.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[concentration, rate]) as name:
       with ops.control_dependencies([
           check_ops.assert_positive(concentration),
@@ -274,13 +283,21 @@ class InverseGamma(distribution.Distribution):
 class InverseGammaWithSoftplusConcentrationRate(InverseGamma):
   """`InverseGamma` with softplus of `concentration` and `rate`."""
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                concentration,
                rate,
                validate_args=False,
                allow_nan_stats=True,
                name="InverseGammaWithSoftplusConcentrationRate"):
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[concentration, rate]) as name:
       super(InverseGammaWithSoftplusConcentrationRate, self).__init__(
           concentration=nn.softplus(concentration,
diff --git a/tensorflow/contrib/distributions/python/ops/kumaraswamy.py b/tensorflow/contrib/distributions/python/ops/kumaraswamy.py
index 66682b2ff5493f8565410138e770b45ffc6b5d77..e3712dd84e36609d6bba4a5a39866046c0c8d1d8 100644
--- a/tensorflow/contrib/distributions/python/ops/kumaraswamy.py
+++ b/tensorflow/contrib/distributions/python/ops/kumaraswamy.py
@@ -31,7 +31,7 @@ from tensorflow.python.ops import special_math_ops
 from tensorflow.python.ops.distributions import distribution
 from tensorflow.python.ops.distributions import transformed_distribution
 from tensorflow.python.ops.distributions import uniform
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util import deprecation
 
 __all__ = [
     "Kumaraswamy",
@@ -41,6 +41,14 @@ _kumaraswamy_sample_note = """Note: `x` must have dtype `self.dtype` and be in
 `[0, 1].` It must have a shape compatible with `self.batch_shape()`."""
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _harmonic_number(x):
   """Compute the harmonic number from its analytic continuation.
 
@@ -59,7 +67,6 @@ def _harmonic_number(x):
   return math_ops.digamma(x + one) - math_ops.digamma(one)
 
 
-@tf_export("distributions.Kumaraswamy")
 class Kumaraswamy(transformed_distribution.TransformedDistribution):
   """Kumaraswamy distribution.
 
@@ -125,6 +132,14 @@ class Kumaraswamy(transformed_distribution.TransformedDistribution):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                concentration1=None,
                concentration0=None,
diff --git a/tensorflow/contrib/distributions/python/ops/logistic.py b/tensorflow/contrib/distributions/python/ops/logistic.py
index 0103283259b0526b5a108ea1836f95709eedc067..02e3bad51ee48188acf83cb09359861c9e6932c7 100644
--- a/tensorflow/contrib/distributions/python/ops/logistic.py
+++ b/tensorflow/contrib/distributions/python/ops/logistic.py
@@ -31,7 +31,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import distribution
-from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 class Logistic(distribution.Distribution):
@@ -92,6 +92,14 @@ class Logistic(distribution.Distribution):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc,
                scale,
@@ -120,7 +128,7 @@ class Logistic(distribution.Distribution):
     Raises:
       TypeError: if loc and scale are different dtypes.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[loc, scale]) as name:
       with ops.control_dependencies([check_ops.assert_positive(scale)] if
                                     validate_args else []):
diff --git a/tensorflow/contrib/distributions/python/ops/mixture.py b/tensorflow/contrib/distributions/python/ops/mixture.py
index d54f30dc634ab5c8aa82066056266747b63eec21..3b7114ef067c0aaede23fff04c40d1dc6e830f1c 100644
--- a/tensorflow/contrib/distributions/python/ops/mixture.py
+++ b/tensorflow/contrib/distributions/python/ops/mixture.py
@@ -32,6 +32,7 @@ from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops.distributions import categorical
 from tensorflow.python.ops.distributions import distribution
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 class Mixture(distribution.Distribution):
@@ -66,6 +67,14 @@ class Mixture(distribution.Distribution):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                cat,
                components,
@@ -116,7 +125,7 @@ class Mixture(distribution.Distribution):
         matching static batch shapes, or all components do not
         have matching static event shapes.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     if not isinstance(cat, categorical.Categorical):
       raise TypeError("cat must be a Categorical distribution, but saw: %s" %
                       cat)
diff --git a/tensorflow/contrib/distributions/python/ops/mixture_same_family.py b/tensorflow/contrib/distributions/python/ops/mixture_same_family.py
index c7c90cf875484a1753577227bf22de878d00a502..8ffee940d03c9a5204f2ac6f7acd9ea482adae1a 100644
--- a/tensorflow/contrib/distributions/python/ops/mixture_same_family.py
+++ b/tensorflow/contrib/distributions/python/ops/mixture_same_family.py
@@ -28,6 +28,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops.distributions import distribution
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 class MixtureSameFamily(distribution.Distribution):
@@ -95,6 +96,14 @@ class MixtureSameFamily(distribution.Distribution):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                mixture_distribution,
                components_distribution,
@@ -130,7 +139,7 @@ class MixtureSameFamily(distribution.Distribution):
       ValueError: if `mixture_distribution` categories does not equal
         `components_distribution` rightmost batch shape.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name) as name:
       self._mixture_distribution = mixture_distribution
       self._components_distribution = components_distribution
@@ -321,6 +330,14 @@ class MixtureSameFamily(distribution.Distribution):
       return x
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _outer_squared_difference(x, y):
   """Convenience function analogous to tf.squared_difference."""
   z = x - y
diff --git a/tensorflow/contrib/distributions/python/ops/mvn_diag.py b/tensorflow/contrib/distributions/python/ops/mvn_diag.py
index cad398582b9c939e8e96cf498638869ccd3701bd..cd0c282ba6cebf784261a4e821f36ce4eed98fe0 100644
--- a/tensorflow/contrib/distributions/python/ops/mvn_diag.py
+++ b/tensorflow/contrib/distributions/python/ops/mvn_diag.py
@@ -22,6 +22,7 @@ from tensorflow.contrib.distributions.python.ops import distribution_util
 from tensorflow.contrib.distributions.python.ops import mvn_linear_operator as mvn_linop
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import nn
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -134,6 +135,14 @@ class MultivariateNormalDiag(
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc=None,
                scale_diag=None,
@@ -193,7 +202,7 @@ class MultivariateNormalDiag(
     Raises:
       ValueError: if at most `scale_identity_multiplier` is specified.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name) as name:
       with ops.name_scope("init", values=[
           loc, scale_diag, scale_identity_multiplier]):
@@ -218,13 +227,21 @@ class MultivariateNormalDiag(
 class MultivariateNormalDiagWithSoftplusScale(MultivariateNormalDiag):
   """MultivariateNormalDiag with `diag_stddev = softplus(diag_stddev)`."""
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc,
                scale_diag,
                validate_args=False,
                allow_nan_stats=True,
                name="MultivariateNormalDiagWithSoftplusScale"):
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[scale_diag]) as name:
       super(MultivariateNormalDiagWithSoftplusScale, self).__init__(
           loc=loc,
diff --git a/tensorflow/contrib/distributions/python/ops/mvn_diag_plus_low_rank.py b/tensorflow/contrib/distributions/python/ops/mvn_diag_plus_low_rank.py
index 1c11594df3ad2612dd8746bb8785d86390b69937..d8401801f21afbe8fd042053c6a38a31a2539438 100644
--- a/tensorflow/contrib/distributions/python/ops/mvn_diag_plus_low_rank.py
+++ b/tensorflow/contrib/distributions/python/ops/mvn_diag_plus_low_rank.py
@@ -22,6 +22,7 @@ from tensorflow.contrib import linalg
 from tensorflow.contrib.distributions.python.ops import distribution_util
 from tensorflow.contrib.distributions.python.ops import mvn_linear_operator as mvn_linop
 from tensorflow.python.framework import ops
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -141,6 +142,14 @@ class MultivariateNormalDiagPlusLowRank(
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc=None,
                scale_diag=None,
@@ -215,7 +224,7 @@ class MultivariateNormalDiagPlusLowRank(
     Raises:
       ValueError: if at most `scale_identity_multiplier` is specified.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     def _convert_to_tensor(x, name):
       return None if x is None else ops.convert_to_tensor(x, name=name)
     with ops.name_scope(name) as name:
diff --git a/tensorflow/contrib/distributions/python/ops/mvn_full_covariance.py b/tensorflow/contrib/distributions/python/ops/mvn_full_covariance.py
index 47d7d13cf357f1ac657641420602c92eefdad197..dbc4c1b3dc956641f3e38ffafe3a3410bd3e2097 100644
--- a/tensorflow/contrib/distributions/python/ops/mvn_full_covariance.py
+++ b/tensorflow/contrib/distributions/python/ops/mvn_full_covariance.py
@@ -24,7 +24,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import linalg_ops
-from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -113,6 +113,14 @@ class MultivariateNormalFullCovariance(mvn_tril.MultivariateNormalTriL):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc=None,
                covariance_matrix=None,
@@ -156,7 +164,7 @@ class MultivariateNormalFullCovariance(mvn_tril.MultivariateNormalTriL):
     Raises:
       ValueError: if neither `loc` nor `covariance_matrix` are specified.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
 
     # Convert the covariance_matrix up to a scale_tril and call MVNTriL.
     with ops.name_scope(name) as name:
diff --git a/tensorflow/contrib/distributions/python/ops/mvn_linear_operator.py b/tensorflow/contrib/distributions/python/ops/mvn_linear_operator.py
index 79916fef8d7b752649dcc673a84ea45ccf460905..efe5a6d0d99ca8fa9e0274049423bb3c4eef2d6f 100644
--- a/tensorflow/contrib/distributions/python/ops/mvn_linear_operator.py
+++ b/tensorflow/contrib/distributions/python/ops/mvn_linear_operator.py
@@ -27,6 +27,7 @@ from tensorflow.python.ops.distributions import kullback_leibler
 from tensorflow.python.ops.distributions import normal
 from tensorflow.python.ops.distributions import transformed_distribution
 from tensorflow.python.ops.linalg import linalg
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -133,6 +134,14 @@ class MultivariateNormalLinearOperator(
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc=None,
                scale=None,
@@ -170,7 +179,7 @@ class MultivariateNormalLinearOperator(
       ValueError: if `scale` is unspecified.
       TypeError: if not `scale.dtype.is_floating`
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     if scale is None:
       raise ValueError("Missing required `scale` parameter.")
     if not scale.dtype.is_floating:
@@ -266,6 +275,14 @@ class MultivariateNormalLinearOperator(
 
 @kullback_leibler.RegisterKL(MultivariateNormalLinearOperator,
                              MultivariateNormalLinearOperator)
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _kl_brute_force(a, b, name=None):
   """Batched KL divergence `KL(a || b)` for multivariate Normals.
 
diff --git a/tensorflow/contrib/distributions/python/ops/mvn_tril.py b/tensorflow/contrib/distributions/python/ops/mvn_tril.py
index d6b0ed994ec0a62e9b7684e7478130052a1fd300..d9110947ecdbba1a63669573f46db17b02e512ab 100644
--- a/tensorflow/contrib/distributions/python/ops/mvn_tril.py
+++ b/tensorflow/contrib/distributions/python/ops/mvn_tril.py
@@ -22,6 +22,7 @@ from tensorflow.contrib import linalg
 from tensorflow.contrib.distributions.python.ops import mvn_linear_operator as mvn_linop
 from tensorflow.python.framework import ops
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -134,6 +135,14 @@ class MultivariateNormalTriL(
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc=None,
                scale_tril=None,
@@ -179,7 +188,7 @@ class MultivariateNormalTriL(
     Raises:
       ValueError: if neither `loc` nor `scale_tril` are specified.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     def _convert_to_tensor(x, name):
       return None if x is None else ops.convert_to_tensor(x, name=name)
     if loc is None and scale_tril is None:
diff --git a/tensorflow/contrib/distributions/python/ops/negative_binomial.py b/tensorflow/contrib/distributions/python/ops/negative_binomial.py
index 1085c56dc86c8d45bdab2e7cecedf44663e5c408..6acfc5746a0cc20e916de81b71f90e08d8d91ad5 100644
--- a/tensorflow/contrib/distributions/python/ops/negative_binomial.py
+++ b/tensorflow/contrib/distributions/python/ops/negative_binomial.py
@@ -27,6 +27,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import distribution
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 class NegativeBinomial(distribution.Distribution):
@@ -51,6 +52,14 @@ class NegativeBinomial(distribution.Distribution):
   * `n!` is the factorial of `n`.
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                total_count,
                logits=None,
@@ -90,7 +99,7 @@ class NegativeBinomial(distribution.Distribution):
       name: Python `str` name prefixed to Ops created by this class.
     """
 
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[total_count, logits, probs]) as name:
       self._logits, self._probs = distribution_util.get_logits_and_probs(
           logits, probs, validate_args=validate_args, name=name)
diff --git a/tensorflow/contrib/distributions/python/ops/onehot_categorical.py b/tensorflow/contrib/distributions/python/ops/onehot_categorical.py
index a4b9f3b78d4fdcc328bac84623114b921b9ded49..0c762f17c9b770ecada57b6ce60a4825ba374dd9 100644
--- a/tensorflow/contrib/distributions/python/ops/onehot_categorical.py
+++ b/tensorflow/contrib/distributions/python/ops/onehot_categorical.py
@@ -29,6 +29,7 @@ from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import distribution
 from tensorflow.python.ops.distributions import kullback_leibler
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 class OneHotCategorical(distribution.Distribution):
@@ -83,6 +84,14 @@ class OneHotCategorical(distribution.Distribution):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(
       self,
       logits=None,
@@ -115,7 +124,7 @@ class OneHotCategorical(distribution.Distribution):
         more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[logits, probs]) as name:
       self._logits, self._probs = distribution_util.get_logits_and_probs(
           name=name, logits=logits, probs=probs, validate_args=validate_args,
@@ -233,6 +242,14 @@ class OneHotCategorical(distribution.Distribution):
 
 
 @kullback_leibler.RegisterKL(OneHotCategorical, OneHotCategorical)
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _kl_categorical_categorical(a, b, name=None):
   """Calculate the batched KL divergence KL(a || b) with a, b OneHotCategorical.
 
diff --git a/tensorflow/contrib/distributions/python/ops/poisson.py b/tensorflow/contrib/distributions/python/ops/poisson.py
index b34539402102b8f289d4eb289fcb82f4030f4e8c..3d055085cc7386e57a71aa310458b7666bb9a396 100644
--- a/tensorflow/contrib/distributions/python/ops/poisson.py
+++ b/tensorflow/contrib/distributions/python/ops/poisson.py
@@ -28,6 +28,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import distribution
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 __all__ = [
     "Poisson",
@@ -65,6 +66,14 @@ class Poisson(distribution.Distribution):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                rate=None,
                log_rate=None,
@@ -93,7 +102,7 @@ class Poisson(distribution.Distribution):
       TypeError: if `rate` is not a float-type.
       TypeError: if `log_rate` is not a float-type.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[rate]) as name:
       if (rate is None) == (log_rate is None):
         raise ValueError("Must specify exactly one of `rate` and `log_rate`.")
diff --git a/tensorflow/contrib/distributions/python/ops/poisson_lognormal.py b/tensorflow/contrib/distributions/python/ops/poisson_lognormal.py
index fe72091d7d759e54c51eb666f2ceacc8371e55fd..7a7ad1be35b80ff0f000181ea0778ab282a8220f 100644
--- a/tensorflow/contrib/distributions/python/ops/poisson_lognormal.py
+++ b/tensorflow/contrib/distributions/python/ops/poisson_lognormal.py
@@ -33,6 +33,7 @@ from tensorflow.python.ops.distributions import categorical as categorical_lib
 from tensorflow.python.ops.distributions import distribution as distribution_lib
 from tensorflow.python.ops.distributions import normal as normal_lib
 from tensorflow.python.ops.distributions import transformed_distribution as transformed_lib
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -42,6 +43,14 @@ __all__ = [
 ]
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def quadrature_scheme_lognormal_gauss_hermite(
     loc, scale, quadrature_size,
     validate_args=False, name=None):  # pylint: disable=unused-argument
@@ -85,6 +94,14 @@ def quadrature_scheme_lognormal_gauss_hermite(
     return grid, probs
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def quadrature_scheme_lognormal_quantiles(
     loc, scale, quadrature_size,
     validate_args=False, name=None):
@@ -214,6 +231,14 @@ class PoissonLogNormalQuadratureCompound(distribution_lib.Distribution):
       validate_args=True)
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc,
                scale,
@@ -255,7 +280,7 @@ class PoissonLogNormalQuadratureCompound(distribution_lib.Distribution):
       TypeError: if `quadrature_grid` and `quadrature_probs` have different base
         `dtype`.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[loc, scale]) as name:
       if loc is not None:
         loc = ops.convert_to_tensor(loc, name="loc")
@@ -417,6 +442,14 @@ class PoissonLogNormalQuadratureCompound(distribution_lib.Distribution):
         axis=[-2, -1])
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def concat_vectors(*args):
   """Concatenates input vectors, statically if possible."""
   args_ = [distribution_util.static_value(x) for x in args]
diff --git a/tensorflow/contrib/distributions/python/ops/quantized_distribution.py b/tensorflow/contrib/distributions/python/ops/quantized_distribution.py
index 584d2c385fced95ec496bb8dae9556e5c376b66d..ef3bdfa75fcaa8df17db1238ceadadf788601356 100644
--- a/tensorflow/contrib/distributions/python/ops/quantized_distribution.py
+++ b/tensorflow/contrib/distributions/python/ops/quantized_distribution.py
@@ -27,10 +27,19 @@ from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import distribution as distributions
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 __all__ = ["QuantizedDistribution"]
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def _logsum_expbig_minus_expsmall(big, small):
   """Stable evaluation of `Log[exp{big} - exp{small}]`.
 
@@ -228,6 +237,14 @@ class QuantizedDistribution(distributions.Distribution):
        https://arxiv.org/abs/1711.10433
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                distribution,
                low=None,
@@ -263,7 +280,7 @@ class QuantizedDistribution(distributions.Distribution):
           `Distribution` or continuous.
       NotImplementedError:  If the base distribution does not implement `cdf`.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     values = (
         list(distribution.parameters.values()) +
         [low, high])
diff --git a/tensorflow/contrib/distributions/python/ops/relaxed_bernoulli.py b/tensorflow/contrib/distributions/python/ops/relaxed_bernoulli.py
index 0362996e684fb34b15cd98a2fc40df58087fbe95..7e1f64dc425e6a576bfbe1bb456901fddfac26e1 100644
--- a/tensorflow/contrib/distributions/python/ops/relaxed_bernoulli.py
+++ b/tensorflow/contrib/distributions/python/ops/relaxed_bernoulli.py
@@ -19,15 +19,16 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.contrib.distributions.python.ops import logistic
+from tensorflow.contrib.distributions.python.ops.bijectors.sigmoid import Sigmoid
 # Bijectors must be directly imported because `remove_undocumented` prevents
 # individual file imports.
-from tensorflow.contrib.distributions.python.ops.bijectors.sigmoid import Sigmoid
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import check_ops
 from tensorflow.python.ops.distributions import transformed_distribution
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 class RelaxedBernoulli(transformed_distribution.TransformedDistribution):
@@ -131,6 +132,14 @@ class RelaxedBernoulli(transformed_distribution.TransformedDistribution):
   Gumbel-Softmax. 2016.
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                temperature,
                logits=None,
@@ -165,7 +174,7 @@ class RelaxedBernoulli(transformed_distribution.TransformedDistribution):
     Raises:
       ValueError: If both `probs` and `logits` are passed, or if neither.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[logits, probs, temperature]) as name:
       with ops.control_dependencies([check_ops.assert_positive(temperature)]
                                     if validate_args else []):
diff --git a/tensorflow/contrib/distributions/python/ops/relaxed_onehot_categorical.py b/tensorflow/contrib/distributions/python/ops/relaxed_onehot_categorical.py
index 910c430ae7f026a3ac9ce50d1d5936d4454cba41..9b5bd7576f2a3c364e21da76dd3905a8c6e35829 100644
--- a/tensorflow/contrib/distributions/python/ops/relaxed_onehot_categorical.py
+++ b/tensorflow/contrib/distributions/python/ops/relaxed_onehot_categorical.py
@@ -31,6 +31,7 @@ from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import distribution
 from tensorflow.python.ops.distributions import transformed_distribution
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 class ExpRelaxedOneHotCategorical(distribution.Distribution):
@@ -125,6 +126,14 @@ class ExpRelaxedOneHotCategorical(distribution.Distribution):
   A Continuous Relaxation of Discrete Random Variables. 2016.
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(
       self,
       temperature,
@@ -162,7 +171,7 @@ class ExpRelaxedOneHotCategorical(distribution.Distribution):
         more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[logits, probs, temperature]) as name:
 
       self._logits, self._probs = distribution_util.get_logits_and_probs(
@@ -368,6 +377,14 @@ class RelaxedOneHotCategorical(
   A Continuous Relaxation of Discrete Random Variables. 2016.
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(
       self,
       temperature,
diff --git a/tensorflow/contrib/distributions/python/ops/shape.py b/tensorflow/contrib/distributions/python/ops/shape.py
index 6a7f28713acefd2285b07a212e2e47a6db1ae5e1..4f348be2806aa3ade7c1ea2a7bc68ca26db6447f 100644
--- a/tensorflow/contrib/distributions/python/ops/shape.py
+++ b/tensorflow/contrib/distributions/python/ops/shape.py
@@ -27,6 +27,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.util import deprecation
 
 
 class _DistributionShape(object):
@@ -166,6 +167,14 @@ class _DistributionShape(object):
   "free," i.e., during graph construction.
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                batch_ndims=None,
                event_ndims=None,
diff --git a/tensorflow/contrib/distributions/python/ops/sinh_arcsinh.py b/tensorflow/contrib/distributions/python/ops/sinh_arcsinh.py
index f04dc8da39140240edbe4efb75de30e321436d55..a9d0fb4ccfb1803873f7fe17089f3e7c7f10f4b7 100644
--- a/tensorflow/contrib/distributions/python/ops/sinh_arcsinh.py
+++ b/tensorflow/contrib/distributions/python/ops/sinh_arcsinh.py
@@ -25,6 +25,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops.distributions import normal
 from tensorflow.python.ops.distributions import transformed_distribution
+from tensorflow.python.util import deprecation
 
 __all__ = [
     "SinhArcsinh",
@@ -94,6 +95,14 @@ class SinhArcsinh(transformed_distribution.TransformedDistribution):
   ```
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc,
                scale,
@@ -132,7 +141,7 @@ class SinhArcsinh(transformed_distribution.TransformedDistribution):
         if one or more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
 
     with ops.name_scope(name,
                         values=[loc, scale, skewness, tailweight]) as name:
diff --git a/tensorflow/contrib/distributions/python/ops/statistical_testing.py b/tensorflow/contrib/distributions/python/ops/statistical_testing.py
index 9c69435fac109914ff29b307dfad105f62849339..c25e8c51d7705b641699fb05623c7b0fb4950e1b 100644
--- a/tensorflow/contrib/distributions/python/ops/statistical_testing.py
+++ b/tensorflow/contrib/distributions/python/ops/statistical_testing.py
@@ -140,6 +140,7 @@ __all__ = [
     "assert_true_mean_equal_by_dkwm",
     "min_discrepancy_of_true_means_detectable_by_dkwm",
     "min_num_samples_for_dkwm_mean_test",
+    "assert_true_mean_in_interval_by_dkwm",
     "assert_true_mean_equal_by_dkwm_two_sample",
     "min_discrepancy_of_true_means_detectable_by_dkwm_two_sample",
     "min_num_samples_for_dkwm_mean_two_sample_test",
@@ -209,17 +210,17 @@ def _maximum_mean(samples, envelope, high, name=None):
   separately.
 
   Args:
-    samples: Floating-point tensor of samples from the distribution(s)
+    samples: Floating-point `Tensor` of samples from the distribution(s)
       of interest.  Entries are assumed IID across the 0th dimension.
       The other dimensions must broadcast with `envelope` and `high`.
-    envelope: Floating-point tensor of sizes of admissible CDF
+    envelope: Floating-point `Tensor` of sizes of admissible CDF
       envelopes (i.e., the `eps` above).
-    high: Floating-point tensor of upper bounds on the distributions'
-      supports.
+    high: Floating-point `Tensor` of upper bounds on the distributions'
+      supports.  `samples <= high`.
     name: A name for this operation (optional).
 
   Returns:
-    bound: Floating-point tensor of upper bounds on the true means.
+    bound: Floating-point `Tensor` of upper bounds on the true means.
 
   Raises:
     InvalidArgumentError: If some `sample` is found to be larger than
@@ -254,17 +255,17 @@ def _minimum_mean(samples, envelope, low, name=None):
   separately.
 
   Args:
-    samples: Floating-point tensor of samples from the distribution(s)
+    samples: Floating-point `Tensor` of samples from the distribution(s)
       of interest.  Entries are assumed IID across the 0th dimension.
       The other dimensions must broadcast with `envelope` and `low`.
-    envelope: Floating-point tensor of sizes of admissible CDF
+    envelope: Floating-point `Tensor` of sizes of admissible CDF
       envelopes (i.e., the `eps` above).
-    low: Floating-point tensor of lower bounds on the distributions'
-      supports.
+    low: Floating-point `Tensor` of lower bounds on the distributions'
+      supports.  `samples >= low`.
     name: A name for this operation (optional).
 
   Returns:
-    bound: Floating-point tensor of lower bounds on the true means.
+    bound: Floating-point `Tensor` of lower bounds on the true means.
 
   Raises:
     InvalidArgumentError: If some `sample` is found to be smaller than
@@ -300,12 +301,12 @@ def _dkwm_cdf_envelope(n, error_rate, name=None):
   probability above.
 
   Args:
-    n: Tensor of numbers of samples drawn.
-    error_rate: Floating-point tensor of admissible rates of mistakes.
+    n: `Tensor` of numbers of samples drawn.
+    error_rate: Floating-point `Tensor` of admissible rates of mistakes.
     name: A name for this operation (optional).
 
   Returns:
-    eps: Tensor of maximum distances the true CDF can be from the
+    eps: `Tensor` of maximum distances the true CDF can be from the
       empirical CDF.  This scales as `O(sqrt(-log(error_rate)))` and
       as `O(1 / sqrt(n))`.  The shape is the broadcast of `n` and
       `error_rate`.
@@ -324,8 +325,8 @@ def _check_shape_dominates(samples, parameters):
   sample counts end up inflated.
 
   Args:
-    samples: A Tensor whose shape is to be protected against broadcasting.
-    parameters: A list of Tensors who are parameters for the statistical test.
+    samples: A `Tensor` whose shape is to be protected against broadcasting.
+    parameters: A list of `Tensor`s who are parameters for the statistical test.
 
   Returns:
     samples: Return original `samples` with control dependencies attached
@@ -369,19 +370,23 @@ def true_mean_confidence_interval_by_dkwm(
   members.
 
   Args:
-    samples: Floating-point tensor of samples from the distribution(s)
+    samples: Floating-point `Tensor` of samples from the distribution(s)
       of interest.  Entries are assumed IID across the 0th dimension.
       The other dimensions must broadcast with `low` and `high`.
-    low: Floating-point tensor of lower bounds on the distributions'
+      The support is bounded: `low <= samples <= high`.
+    low: Floating-point `Tensor` of lower bounds on the distributions'
       supports.
-    high: Floating-point tensor of upper bounds on the distributions'
+    high: Floating-point `Tensor` of upper bounds on the distributions'
       supports.
-    error_rate: *Scalar* admissible total rate of mistakes.
+    error_rate: *Scalar* floating-point `Tensor` admissible total rate
+      of mistakes.
     name: A name for this operation (optional).
 
   Returns:
-    low: A floating-point tensor of stochastic lower bounds on the true means.
-    high: A floating-point tensor of stochastic upper bounds on the true means.
+    low: A floating-point `Tensor` of stochastic lower bounds on the
+      true means.
+    high: A floating-point `Tensor` of stochastic upper bounds on the
+      true means.
   """
   with ops.name_scope(
       name, "true_mean_confidence_interval_by_dkwm",
@@ -436,15 +441,17 @@ def assert_true_mean_equal_by_dkwm(
   the assertion will insist on stronger evidence to fail any one member.
 
   Args:
-    samples: Floating-point tensor of samples from the distribution(s)
+    samples: Floating-point `Tensor` of samples from the distribution(s)
       of interest.  Entries are assumed IID across the 0th dimension.
       The other dimensions must broadcast with `low` and `high`.
-    low: Floating-point tensor of lower bounds on the distributions'
+      The support is bounded: `low <= samples <= high`.
+    low: Floating-point `Tensor` of lower bounds on the distributions'
       supports.
-    high: Floating-point tensor of upper bounds on the distributions'
+    high: Floating-point `Tensor` of upper bounds on the distributions'
       supports.
-    expected: Floating-point tensor of expected true means.
-    false_fail_rate: *Scalar* admissible total rate of mistakes.
+    expected: Floating-point `Tensor` of expected true means.
+    false_fail_rate: *Scalar* floating-point `Tensor` admissible total
+      rate of mistakes.
     name: A name for this operation (optional).
 
   Returns:
@@ -454,20 +461,8 @@ def assert_true_mean_equal_by_dkwm(
   with ops.name_scope(
       name, "assert_true_mean_equal_by_dkwm",
       [samples, low, high, expected, false_fail_rate]):
-    samples = ops.convert_to_tensor(samples, name="samples")
-    low = ops.convert_to_tensor(low, name="low")
-    high = ops.convert_to_tensor(high, name="high")
-    expected = ops.convert_to_tensor(expected, name="expected")
-    false_fail_rate = ops.convert_to_tensor(
-        false_fail_rate, name="false_fail_rate")
-    samples = _check_shape_dominates(samples, [low, high, expected])
-    min_mean, max_mean = true_mean_confidence_interval_by_dkwm(
-        samples, low, high, error_rate=false_fail_rate)
-    less_op = check_ops.assert_less(
-        min_mean, expected, message="Mean confidence interval too high")
-    with ops.control_dependencies([less_op]):
-      return check_ops.assert_greater(
-          max_mean, expected, message="Mean confidence interval too low")
+    return assert_true_mean_in_interval_by_dkwm(
+        samples, low, high, expected, expected, false_fail_rate)
 
 
 def min_discrepancy_of_true_means_detectable_by_dkwm(
@@ -487,30 +482,35 @@ def min_discrepancy_of_true_means_detectable_by_dkwm(
   with the same `false_pass_rate`.
 
   Args:
-    n: Tensor of numbers of samples to be drawn from the distributions
+    n: `Tensor` of numbers of samples to be drawn from the distributions
       of interest.
-    low: Floating-point tensor of lower bounds on the distributions'
+    low: Floating-point `Tensor` of lower bounds on the distributions'
       supports.
-    high: Floating-point tensor of upper bounds on the distributions'
+    high: Floating-point `Tensor` of upper bounds on the distributions'
       supports.
-    false_fail_rate: *Scalar* admissible total rate of false failures.
-    false_pass_rate: *Scalar* admissible rate of false passes.
+    false_fail_rate: *Scalar* floating-point `Tensor` admissible total
+      rate of false failures.
+    false_pass_rate: *Scalar* floating-point `Tensor` admissible rate
+      of false passes.
     name: A name for this operation (optional).
 
   Returns:
-    discr: Tensor of lower bounds on the distances between true
+    discr: `Tensor` of lower bounds on the distances between true
        means detectable by a DKWM-based test.
 
   For each batch member `i`, of `K` total, drawing `n[i]` samples from
   some scalar distribution supported on `[low[i], high[i]]` is enough
   to detect a difference in means of size `discr[i]` or more.
   Specifically, we guarantee that (a) if the true mean is the expected
-  mean, `assert_true_mean_equal_by_dkwm` will fail with probability at
-  most `false_fail_rate / K` (which amounts to `false_fail_rate` if
-  applied to the whole batch at once), and (b) if the true mean
-  differs from the expected mean by at least `discr[i]`,
-  `assert_true_mean_equal_by_dkwm` will pass with probability at most
-  `false_pass_rate`.
+  mean (resp. in the expected interval), then `assert_true_mean_equal_by_dkwm`
+  (resp. `assert_true_mean_in_interval_by_dkwm`) will fail with
+  probability at most `false_fail_rate / K` (which amounts to
+  `false_fail_rate` if applied to the whole batch at once), and (b) if
+  the true mean differs from the expected mean (resp. falls outside
+  the expected interval) by at least `discr[i]`,
+  `assert_true_mean_equal_by_dkwm`
+  (resp. `assert_true_mean_in_interval_by_dkwm`) will pass with
+  probability at most `false_pass_rate`.
 
   The detectable discrepancy scales as
 
@@ -558,17 +558,19 @@ def min_num_samples_for_dkwm_mean_test(
   on a scalar distribution supported on `[low, high]`.
 
   Args:
-    discrepancy: Floating-point tensor of desired upper limits on mean
+    discrepancy: Floating-point `Tensor` of desired upper limits on mean
       differences that may go undetected with probability higher than
       `1 - false_pass_rate`.
-    low: Tensor of lower bounds on the distributions' support.
-    high: Tensor of upper bounds on the distributions' support.
-    false_fail_rate: *Scalar* admissible total rate of false failures.
-    false_pass_rate: *Scalar* admissible rate of false passes.
+    low: `Tensor` of lower bounds on the distributions' support.
+    high: `Tensor` of upper bounds on the distributions' support.
+    false_fail_rate: *Scalar* floating-point `Tensor` admissible total
+      rate of false failures.
+    false_pass_rate: *Scalar* floating-point `Tensor` admissible rate
+      of false passes.
     name: A name for this operation (optional).
 
   Returns:
-    n: Tensor of numbers of samples to be drawn from the distributions
+    n: `Tensor` of numbers of samples to be drawn from the distributions
       of interest.
 
   The `discrepancy`, `low`, and `high` tensors must have
@@ -578,12 +580,15 @@ def min_num_samples_for_dkwm_mean_test(
   some scalar distribution supported on `[low[i], high[i]]` is enough
   to detect a difference in means of size `discrepancy[i]` or more.
   Specifically, we guarantee that (a) if the true mean is the expected
-  mean, `assert_true_mean_equal_by_dkwm` will fail with probability at
-  most `false_fail_rate / K` (which amounts to `false_fail_rate` if
-  applied to the whole batch at once), and (b) if the true mean
-  differs from the expected mean by at least `discrepancy[i]`,
-  `assert_true_mean_equal_by_dkwm` will pass with probability at most
-  `false_pass_rate`.
+  mean (resp. in the expected interval), then `assert_true_mean_equal_by_dkwm`
+  (resp. `assert_true_mean_in_interval_by_dkwm`) will fail with
+  probability at most `false_fail_rate / K` (which amounts to
+  `false_fail_rate` if applied to the whole batch at once), and (b) if
+  the true mean differs from the expected mean (resp. falls outside
+  the expected interval) by at least `discrepancy[i]`,
+  `assert_true_mean_equal_by_dkwm`
+  (resp. `assert_true_mean_in_interval_by_dkwm`) will pass with
+  probability at most `false_pass_rate`.
 
   The required number of samples scales
   as `O((high[i] - low[i])**2)`, `O(-log(false_fail_rate/K))`,
@@ -610,6 +615,76 @@ def min_num_samples_for_dkwm_mean_test(
     return math_ops.maximum(n1, n2)
 
 
+def assert_true_mean_in_interval_by_dkwm(
+    samples, low, high, expected_low, expected_high,
+    false_fail_rate=1e-6, name=None):
+  """Asserts the mean of the given distribution is in the given interval.
+
+  More precisely, fails if there is enough evidence (using the
+  [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval))
+  that the mean of the distribution from which the given samples are
+  drawn is _outside_ the given interval with statistical significance
+  `false_fail_rate` or stronger, otherwise passes.  If you also want
+  to check that you are gathering enough evidence that a pass is not
+  spurious, see `min_num_samples_for_dkwm_mean_test` and
+  `min_discrepancy_of_true_means_detectable_by_dkwm`.
+
+  Note that `false_fail_rate` is a total false failure rate for all
+  the assertions in the batch.  As such, if the batch is nontrivial,
+  the assertion will insist on stronger evidence to fail any one member.
+
+  Args:
+    samples: Floating-point `Tensor` of samples from the distribution(s)
+      of interest.  Entries are assumed IID across the 0th dimension.
+      The other dimensions must broadcast with `low` and `high`.
+      The support is bounded: `low <= samples <= high`.
+    low: Floating-point `Tensor` of lower bounds on the distributions'
+      supports.
+    high: Floating-point `Tensor` of upper bounds on the distributions'
+      supports.
+    expected_low: Floating-point `Tensor` of lower bounds on the
+      expected true means.
+    expected_high: Floating-point `Tensor` of upper bounds on the
+      expected true means.
+    false_fail_rate: *Scalar* floating-point `Tensor` admissible total
+      rate of mistakes.
+    name: A name for this operation (optional).
+
+  Returns:
+    check: Op that raises `InvalidArgumentError` if any expected mean
+      interval does not overlap with the corresponding confidence
+      interval.
+  """
+  with ops.name_scope(
+      name, "assert_true_mean_in_interval_by_dkwm",
+      [samples, low, high, expected_low, expected_high, false_fail_rate]):
+    samples = ops.convert_to_tensor(samples, name="samples")
+    low = ops.convert_to_tensor(low, name="low")
+    high = ops.convert_to_tensor(high, name="high")
+    expected_low = ops.convert_to_tensor(expected_low, name="expected_low")
+    expected_high = ops.convert_to_tensor(expected_high, name="expected_high")
+    false_fail_rate = ops.convert_to_tensor(
+        false_fail_rate, name="false_fail_rate")
+    samples = _check_shape_dominates(
+        samples, [low, high, expected_low, expected_high])
+    min_mean, max_mean = true_mean_confidence_interval_by_dkwm(
+        samples, low, high, false_fail_rate)
+    # Assert that the interval [min_mean, max_mean] intersects the
+    # interval [expected_low, expected_high].  This is true if
+    #   max_mean >= expected_low and min_mean <= expected_high.
+    # By DeMorgan's law, that's also equivalent to
+    #   not (max_mean < expected_low or min_mean > expected_high),
+    # which is a way of saying the two intervals are not disjoint.
+    check_confidence_interval_can_intersect = check_ops.assert_greater_equal(
+        max_mean, expected_low, message="Confidence interval does not "
+        "intersect: true mean smaller than expected")
+    with ops.control_dependencies([check_confidence_interval_can_intersect]):
+      return check_ops.assert_less_equal(
+          min_mean, expected_high, message="Confidence interval does not "
+          "intersect: true mean greater than expected")
+
+
 def assert_true_mean_equal_by_dkwm_two_sample(
     samples1, low1, high1, samples2, low2, high2,
     false_fail_rate=1e-6, name=None):
@@ -630,23 +705,26 @@ def assert_true_mean_equal_by_dkwm_two_sample(
   the assertion will insist on stronger evidence to fail any one member.
 
   Args:
-    samples1: Floating-point tensor of samples from the
+    samples1: Floating-point `Tensor` of samples from the
       distribution(s) A.  Entries are assumed IID across the 0th
       dimension.  The other dimensions must broadcast with `low1`,
       `high1`, `low2`, and `high2`.
-    low1: Floating-point tensor of lower bounds on the supports of the
+      The support is bounded: `low1 <= samples1 <= high1`.
+    low1: Floating-point `Tensor` of lower bounds on the supports of the
       distributions A.
-    high1: Floating-point tensor of upper bounds on the supports of
+    high1: Floating-point `Tensor` of upper bounds on the supports of
       the distributions A.
-    samples2: Floating-point tensor of samples from the
+    samples2: Floating-point `Tensor` of samples from the
       distribution(s) B.  Entries are assumed IID across the 0th
       dimension.  The other dimensions must broadcast with `low1`,
       `high1`, `low2`, and `high2`.
-    low2: Floating-point tensor of lower bounds on the supports of the
+      The support is bounded: `low2 <= samples2 <= high2`.
+    low2: Floating-point `Tensor` of lower bounds on the supports of the
       distributions B.
-    high2: Floating-point tensor of upper bounds on the supports of
+    high2: Floating-point `Tensor` of upper bounds on the supports of
       the distributions B.
-    false_fail_rate: *Scalar* admissible total rate of mistakes.
+    false_fail_rate: *Scalar* floating-point `Tensor` admissible total
+      rate of mistakes.
     name: A name for this operation (optional).
 
   Returns:
@@ -676,20 +754,10 @@ def assert_true_mean_equal_by_dkwm_two_sample(
       # and sample counts should be valid; however, because the intervals
       # scale as O(-log(false_fail_rate)), there doesn't seem to be much
       # room to win.
-      min_mean_1, max_mean_1 = true_mean_confidence_interval_by_dkwm(
-          samples1, low1, high1, false_fail_rate / 2.)
       min_mean_2, max_mean_2 = true_mean_confidence_interval_by_dkwm(
           samples2, low2, high2, false_fail_rate / 2.)
-      # I want to assert
-      #   not (max_mean_1 < min_mean_2 or min_mean_1 > max_mean_2),
-      # but I think I only have and-combination of asserts, so use DeMorgan.
-      check_confidence_intervals_can_intersect = check_ops.assert_greater_equal(
-          max_mean_1, min_mean_2, message="Confidence intervals do not "
-          "intersect: samples1 has a smaller mean than samples2")
-      with ops.control_dependencies([check_confidence_intervals_can_intersect]):
-        return check_ops.assert_less_equal(
-            min_mean_1, max_mean_2, message="Confidence intervals do not "
-            "intersect: samples2 has a smaller mean than samples1")
+      return assert_true_mean_in_interval_by_dkwm(
+          samples1, low1, high1, min_mean_2, max_mean_2, false_fail_rate / 2.)
 
 
 def min_discrepancy_of_true_means_detectable_by_dkwm_two_sample(
@@ -710,22 +778,24 @@ def min_discrepancy_of_true_means_detectable_by_dkwm_two_sample(
   with the same `false_pass_rate`.
 
   Args:
-    n1: Tensor of numbers of samples to be drawn from the distributions A.
-    low1: Floating-point tensor of lower bounds on the supports of the
+    n1: `Tensor` of numbers of samples to be drawn from the distributions A.
+    low1: Floating-point `Tensor` of lower bounds on the supports of the
       distributions A.
-    high1: Floating-point tensor of upper bounds on the supports of
+    high1: Floating-point `Tensor` of upper bounds on the supports of
       the distributions A.
-    n2: Tensor of numbers of samples to be drawn from the distributions B.
-    low2: Floating-point tensor of lower bounds on the supports of the
+    n2: `Tensor` of numbers of samples to be drawn from the distributions B.
+    low2: Floating-point `Tensor` of lower bounds on the supports of the
       distributions B.
-    high2: Floating-point tensor of upper bounds on the supports of
+    high2: Floating-point `Tensor` of upper bounds on the supports of
       the distributions B.
-    false_fail_rate: *Scalar* admissible total rate of false failures.
-    false_pass_rate: *Scalar* admissible rate of false passes.
+    false_fail_rate: *Scalar* floating-point `Tensor` admissible total
+      rate of false failures.
+    false_pass_rate: *Scalar* floating-point `Tensor` admissible rate
+      of false passes.
     name: A name for this operation (optional).
 
   Returns:
-    discr: Tensor of lower bounds on the distances between true means
+    discr: `Tensor` of lower bounds on the distances between true means
        detectable by a two-sample DKWM-based test.
 
   For each batch member `i`, of `K` total, drawing `n1[i]` samples
@@ -776,24 +846,26 @@ def min_num_samples_for_dkwm_mean_two_sample_test(
   (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval).
 
   Args:
-    discrepancy: Floating-point tensor of desired upper limits on mean
+    discrepancy: Floating-point `Tensor` of desired upper limits on mean
       differences that may go undetected with probability higher than
       `1 - false_pass_rate`.
-    low1: Floating-point tensor of lower bounds on the supports of the
+    low1: Floating-point `Tensor` of lower bounds on the supports of the
       distributions A.
-    high1: Floating-point tensor of upper bounds on the supports of
+    high1: Floating-point `Tensor` of upper bounds on the supports of
       the distributions A.
-    low2: Floating-point tensor of lower bounds on the supports of the
+    low2: Floating-point `Tensor` of lower bounds on the supports of the
       distributions B.
-    high2: Floating-point tensor of upper bounds on the supports of
+    high2: Floating-point `Tensor` of upper bounds on the supports of
       the distributions B.
-    false_fail_rate: *Scalar* admissible total rate of false failures.
-    false_pass_rate: *Scalar* admissible rate of false passes.
+    false_fail_rate: *Scalar* floating-point `Tensor` admissible total
+      rate of false failures.
+    false_pass_rate: *Scalar* floating-point `Tensor` admissible rate
+      of false passes.
     name: A name for this operation (optional).
 
   Returns:
-    n1: Tensor of numbers of samples to be drawn from the distributions A.
-    n2: Tensor of numbers of samples to be drawn from the distributions B.
+    n1: `Tensor` of numbers of samples to be drawn from the distributions A.
+    n2: `Tensor` of numbers of samples to be drawn from the distributions B.
 
   For each batch member `i`, of `K` total, drawing `n1[i]` samples
   from scalar distribution A supported on `[low1[i], high1[i]]` and `n2[i]`
diff --git a/tensorflow/contrib/distributions/python/ops/vector_diffeomixture.py b/tensorflow/contrib/distributions/python/ops/vector_diffeomixture.py
index cd6d7499595d88d18de339371d4a07fe780662d9..ece03fe4aab3cc3046e0958d883ca9388517b94b 100644
--- a/tensorflow/contrib/distributions/python/ops/vector_diffeomixture.py
+++ b/tensorflow/contrib/distributions/python/ops/vector_diffeomixture.py
@@ -40,6 +40,7 @@ from tensorflow.python.ops.linalg import linear_operator_diag as linop_diag_lib
 from tensorflow.python.ops.linalg import linear_operator_full_matrix as linop_full_lib
 from tensorflow.python.ops.linalg import linear_operator_identity as linop_identity_lib
 from tensorflow.python.ops.linalg import linear_operator_lower_triangular as linop_tril_lib
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -49,6 +50,14 @@ __all__ = [
 ]
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def quadrature_scheme_softmaxnormal_gauss_hermite(
     normal_loc, normal_scale, quadrature_size,
     validate_args=False, name=None):
@@ -111,6 +120,14 @@ def quadrature_scheme_softmaxnormal_gauss_hermite(
     return grid, probs
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def quadrature_scheme_softmaxnormal_quantiles(
     normal_loc, normal_scale, quadrature_size,
     validate_args=False, name=None):
@@ -318,6 +335,14 @@ class VectorDiffeomixture(distribution_lib.Distribution):
        https://arxiv.org/abs/1801.03080
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                mix_loc,
                temperature,
@@ -395,7 +420,7 @@ class VectorDiffeomixture(distribution_lib.Distribution):
       ValueError: if `not distribution.is_scalar_batch`.
       ValueError: if `not distribution.is_scalar_event`.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[mix_loc, temperature]) as name:
       if not scale or len(scale) < 2:
         raise ValueError("Must specify list (or list-like object) of scale "
@@ -779,6 +804,14 @@ class VectorDiffeomixture(distribution_lib.Distribution):
     return array_ops.reshape(p, shape=expand_shape)
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def maybe_check_quadrature_param(param, name, validate_args):
   """Helper which checks validity of `loc` and `scale` init args."""
   with ops.name_scope(name="check_" + name, values=[param]):
@@ -812,6 +845,14 @@ def maybe_check_quadrature_param(param, name, validate_args):
     return param
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def determine_batch_event_shapes(grid, endpoint_affine):
   """Helper to infer batch_shape and event_shape."""
   with ops.name_scope(name="determine_batch_event_shapes"):
@@ -850,6 +891,14 @@ def determine_batch_event_shapes(grid, endpoint_affine):
     return batch_shape, batch_shape_tensor, event_shape, event_shape_tensor
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def interpolate_loc(grid, loc):
   """Helper which interpolates between two locs."""
   if len(loc) != 2:
@@ -876,6 +925,14 @@ def interpolate_loc(grid, loc):
     return [x[..., k] for k in range(deg)]             # list(shape:[B, e])
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def interpolate_scale(grid, scale):
   """Helper which interpolates between two scales."""
   if len(scale) != 2:
@@ -892,6 +949,14 @@ def interpolate_scale(grid, scale):
     ])[0] for q in range(deg)]
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def linop_scale(w, op):
   # We assume w > 0. (This assumption only relates to the is_* attributes.)
   with ops.name_scope("linop_scale", values=[w]):
@@ -927,6 +992,14 @@ def linop_scale(w, op):
         "Unsupported Linop type ({})".format(type(op).__name__))
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def concat_vectors(*args):
   """Concatenates input vectors, statically if possible."""
   args_ = [distribution_util.static_value(x) for x in args]
@@ -935,6 +1008,14 @@ def concat_vectors(*args):
   return [val for vec in args_ for val in vec]
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def add(x, y):
   """Adds inputs; interprets `None` as zero."""
   if x is None:
@@ -944,11 +1025,27 @@ def add(x, y):
   return x + y
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def vec_osquare(x):
   """Computes the outer-product of a (batch of) vector, i.e., x.T x."""
   return x[..., :, array_ops.newaxis] * x[..., array_ops.newaxis, :]
 
 
+@deprecation.deprecated(
+    "2018-10-01",
+    "The TensorFlow Distributions library has moved to "
+    "TensorFlow Probability "
+    "(https://github.com/tensorflow/probability). You "
+    "should update all references to use `tfp.distributions` "
+    "instead of `tf.contrib.distributions`.",
+    warn_once=True)
 def softmax(x, axis, name=None):
   """Equivalent to tf.nn.softmax but works around b/70297725."""
   with ops.name_scope(name, "softmax", [x, axis]):
diff --git a/tensorflow/contrib/distributions/python/ops/vector_exponential_diag.py b/tensorflow/contrib/distributions/python/ops/vector_exponential_diag.py
index 3465d66b30501e7aebd9904d2ae2206d628c10b7..73356a3625c9a1aa15af5b6c1cf2ccb0c514b39a 100644
--- a/tensorflow/contrib/distributions/python/ops/vector_exponential_diag.py
+++ b/tensorflow/contrib/distributions/python/ops/vector_exponential_diag.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 from tensorflow.contrib.distributions.python.ops import distribution_util
 from tensorflow.contrib.distributions.python.ops import vector_exponential_linear_operator as vector_exponential_linop
 from tensorflow.python.framework import ops
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -116,6 +117,14 @@ class VectorExponentialDiag(
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc=None,
                scale_diag=None,
@@ -175,7 +184,7 @@ class VectorExponentialDiag(
     Raises:
       ValueError: if at most `scale_identity_multiplier` is specified.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name) as name:
       with ops.name_scope("init", values=[
           loc, scale_diag, scale_identity_multiplier]):
diff --git a/tensorflow/contrib/distributions/python/ops/vector_exponential_linear_operator.py b/tensorflow/contrib/distributions/python/ops/vector_exponential_linear_operator.py
index 2c31b019845d7e4558eb3047af84732a2ae03986..9a47b4855763a25b484ad04a3415d191f19256f7 100644
--- a/tensorflow/contrib/distributions/python/ops/vector_exponential_linear_operator.py
+++ b/tensorflow/contrib/distributions/python/ops/vector_exponential_linear_operator.py
@@ -26,6 +26,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import exponential
 from tensorflow.python.ops.distributions import transformed_distribution
 from tensorflow.python.ops.linalg import linalg
+from tensorflow.python.util import deprecation
 
 __all__ = ["VectorExponentialLinearOperator"]
 
@@ -138,6 +139,14 @@ class VectorExponentialLinearOperator(
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc=None,
                scale=None,
@@ -175,7 +184,7 @@ class VectorExponentialLinearOperator(
       ValueError: if `scale` is unspecified.
       TypeError: if not `scale.dtype.is_floating`
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     if scale is None:
       raise ValueError("Missing required `scale` parameter.")
     if not scale.dtype.is_floating:
diff --git a/tensorflow/contrib/distributions/python/ops/vector_laplace_diag.py b/tensorflow/contrib/distributions/python/ops/vector_laplace_diag.py
index 6a36018d6f1b83955ef9080ec11c74c08a670075..e68ddc569c95ff63760b4b2f6d7a92f17240a558 100644
--- a/tensorflow/contrib/distributions/python/ops/vector_laplace_diag.py
+++ b/tensorflow/contrib/distributions/python/ops/vector_laplace_diag.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 from tensorflow.contrib.distributions.python.ops import distribution_util
 from tensorflow.contrib.distributions.python.ops import vector_laplace_linear_operator as vector_laplace_linop
 from tensorflow.python.framework import ops
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -151,6 +152,14 @@ class VectorLaplaceDiag(
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc=None,
                scale_diag=None,
@@ -210,7 +219,7 @@ class VectorLaplaceDiag(
     Raises:
       ValueError: if at most `scale_identity_multiplier` is specified.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name):
       with ops.name_scope("init", values=[
           loc, scale_diag, scale_identity_multiplier]):
diff --git a/tensorflow/contrib/distributions/python/ops/vector_laplace_linear_operator.py b/tensorflow/contrib/distributions/python/ops/vector_laplace_linear_operator.py
index 97e5c76d800acd800e34a9e66a3c5fdd7ce4f660..3923161a332a77e4eaab8d65d96fd8c278c872ec 100644
--- a/tensorflow/contrib/distributions/python/ops/vector_laplace_linear_operator.py
+++ b/tensorflow/contrib/distributions/python/ops/vector_laplace_linear_operator.py
@@ -28,6 +28,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import laplace
 from tensorflow.python.ops.distributions import transformed_distribution
 from tensorflow.python.ops.linalg import linalg
+from tensorflow.python.util import deprecation
 
 
 __all__ = [
@@ -154,6 +155,14 @@ class VectorLaplaceLinearOperator(
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc=None,
                scale=None,
@@ -191,7 +200,7 @@ class VectorLaplaceLinearOperator(
       ValueError: if `scale` is unspecified.
       TypeError: if not `scale.dtype.is_floating`
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     if scale is None:
       raise ValueError("Missing required `scale` parameter.")
     if not scale.dtype.is_floating:
diff --git a/tensorflow/contrib/distributions/python/ops/vector_sinh_arcsinh_diag.py b/tensorflow/contrib/distributions/python/ops/vector_sinh_arcsinh_diag.py
index ff5ca4525700aedc88d75e391bf0c2415c2afa13..49ffff24caec8d6c525f65f06796d10548d5ec40 100644
--- a/tensorflow/contrib/distributions/python/ops/vector_sinh_arcsinh_diag.py
+++ b/tensorflow/contrib/distributions/python/ops/vector_sinh_arcsinh_diag.py
@@ -25,6 +25,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops.distributions import normal
 from tensorflow.python.ops.distributions import transformed_distribution
+from tensorflow.python.util import deprecation
 
 __all__ = [
     "VectorSinhArcsinhDiag",
@@ -95,6 +96,14 @@ class VectorSinhArcsinhDiag(transformed_distribution.TransformedDistribution):
   ```
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                loc=None,
                scale_diag=None,
@@ -163,7 +172,7 @@ class VectorSinhArcsinhDiag(transformed_distribution.TransformedDistribution):
     Raises:
       ValueError: if at most `scale_identity_multiplier` is specified.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
 
     with ops.name_scope(
         name,
diff --git a/tensorflow/contrib/distributions/python/ops/vector_student_t.py b/tensorflow/contrib/distributions/python/ops/vector_student_t.py
index 4742f7521816d4643354017495f3380c78ac7bc2..f289b39e51aff36780541a0545ed9e6cfe21dd4e 100644
--- a/tensorflow/contrib/distributions/python/ops/vector_student_t.py
+++ b/tensorflow/contrib/distributions/python/ops/vector_student_t.py
@@ -26,6 +26,7 @@ from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops.distributions import student_t
 from tensorflow.python.ops.distributions import transformed_distribution
+from tensorflow.python.util import deprecation
 
 
 class _VectorStudentT(transformed_distribution.TransformedDistribution):
@@ -121,6 +122,14 @@ class _VectorStudentT(transformed_distribution.TransformedDistribution):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                df,
                loc=None,
@@ -175,7 +184,7 @@ class _VectorStudentT(transformed_distribution.TransformedDistribution):
         if one or more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     graph_parents = [df, loc, scale_identity_multiplier, scale_diag,
                      scale_tril, scale_perturb_factor, scale_perturb_diag]
     with ops.name_scope(name) as name:
diff --git a/tensorflow/contrib/distributions/python/ops/wishart.py b/tensorflow/contrib/distributions/python/ops/wishart.py
index f555867e7f3c2a6bc797e9b3d56da2fa434aba6f..f1accaaa4c920344608015c792a2c3606de1337f 100644
--- a/tensorflow/contrib/distributions/python/ops/wishart.py
+++ b/tensorflow/contrib/distributions/python/ops/wishart.py
@@ -36,6 +36,7 @@ from tensorflow.python.ops import linalg_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import distribution
+from tensorflow.python.util import deprecation
 
 __all__ = [
     "WishartCholesky",
@@ -73,6 +74,14 @@ class _WishartLinearOperator(distribution.Distribution):
   this class.
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                df,
                scale_operator,
@@ -107,7 +116,7 @@ class _WishartLinearOperator(distribution.Distribution):
       ValueError: if df < k, where scale operator event shape is
         `(k, k)`
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     self._cholesky_input_output_matrices = cholesky_input_output_matrices
     with ops.name_scope(name) as name:
       with ops.name_scope("init", values=[df, scale_operator]):
@@ -501,6 +510,14 @@ class WishartCholesky(_WishartLinearOperator):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                df,
                scale,
@@ -530,7 +547,7 @@ class WishartCholesky(_WishartLinearOperator):
         more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[scale]) as name:
       with ops.name_scope("init", values=[scale]):
         scale = ops.convert_to_tensor(scale)
@@ -617,6 +634,14 @@ class WishartFull(_WishartLinearOperator):
 
   """
 
+  @deprecation.deprecated(
+      "2018-10-01",
+      "The TensorFlow Distributions library has moved to "
+      "TensorFlow Probability "
+      "(https://github.com/tensorflow/probability). You "
+      "should update all references to use `tfp.distributions` "
+      "instead of `tf.contrib.distributions`.",
+      warn_once=True)
   def __init__(self,
                df,
                scale,
@@ -646,7 +671,7 @@ class WishartFull(_WishartLinearOperator):
         more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name) as name:
       with ops.name_scope("init", values=[scale]):
         scale = ops.convert_to_tensor(scale)
diff --git a/tensorflow/contrib/eager/python/datasets.py b/tensorflow/contrib/eager/python/datasets.py
index d7909dd5a2691a015a6afed2caa475b39ca7ebc3..adf92c27ea0a27c5741bcdd175b277462cb28d02 100644
--- a/tensorflow/contrib/eager/python/datasets.py
+++ b/tensorflow/contrib/eager/python/datasets.py
@@ -106,7 +106,8 @@ class Iterator(iterator_ops.EagerIterator, checkpointable.CheckpointableBase):
             target_device=target,
             buffer_size=10,
             container="",
-            shared_name=_generate_shared_name("function_buffer_resource"))
+            shared_name=_generate_shared_name(
+                "contrib_eager_iterator_function_buffer_resource"))
         self._buffer_resource_deleter = resource_variable_ops.EagerResourceDeleter(  # pylint: disable=line-too-long
             handle=self._buffer_resource_handle,
             handle_device=self._device)
diff --git a/tensorflow/contrib/eager/python/examples/BUILD b/tensorflow/contrib/eager/python/examples/BUILD
index c1fd9e0ed020beeb722204edf1adfe1dfcf8ff03..1d9371c7ac405dbf0ec40210270b90f2cf9b9a25 100644
--- a/tensorflow/contrib/eager/python/examples/BUILD
+++ b/tensorflow/contrib/eager/python/examples/BUILD
@@ -7,6 +7,8 @@ py_library(
     name = "examples_pip",
     deps = [
         "//tensorflow/contrib/eager/python/examples/gan:mnist",
+        "//tensorflow/contrib/eager/python/examples/l2hmc",
+        "//tensorflow/contrib/eager/python/examples/l2hmc:neural_nets",
         "//tensorflow/contrib/eager/python/examples/linear_regression",
         "//tensorflow/contrib/eager/python/examples/resnet50",
         "//tensorflow/contrib/eager/python/examples/rnn_colorbot",
diff --git a/tensorflow/contrib/eager/python/examples/l2hmc/BUILD b/tensorflow/contrib/eager/python/examples/l2hmc/BUILD
new file mode 100644
index 0000000000000000000000000000000000000000..7bdf9053de749af9d09b12ba7b848e21c1fdb8f0
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/l2hmc/BUILD
@@ -0,0 +1,39 @@
+licenses(["notice"])  # Apache 2.0
+
+package(default_visibility = ["//tensorflow:internal"])
+
+load("//tensorflow:tensorflow.bzl", "cuda_py_test")
+
+py_library(
+    name = "neural_nets",
+    srcs = ["neural_nets.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow:tensorflow_py",
+        "//tensorflow/contrib/eager/python:tfe",
+    ],
+)
+
+py_library(
+    name = "l2hmc",
+    srcs = ["l2hmc.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":neural_nets",
+        "//tensorflow:tensorflow_py",
+        "//tensorflow/contrib/eager/python:tfe",
+        "//third_party/py/numpy",
+    ],
+)
+
+cuda_py_test(
+    name = "l2hmc_test",
+    size = "large",
+    srcs = ["l2hmc_test.py"],
+    additional_deps = [
+        ":l2hmc",
+        "//tensorflow:tensorflow_py",
+        "//tensorflow/contrib/eager/python:tfe",
+        "//third_party/py/numpy",
+    ],
+)
diff --git a/tensorflow/contrib/eager/python/examples/l2hmc/l2hmc.py b/tensorflow/contrib/eager/python/examples/l2hmc/l2hmc.py
new file mode 100644
index 0000000000000000000000000000000000000000..729d8525fab31ee214178ca1bcb18dbd069f767a
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/l2hmc/l2hmc.py
@@ -0,0 +1,326 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""L2HMC compatible with TensorFlow's eager execution.
+
+Reference [Generalizing Hamiltonian Monte Carlo with Neural
+Networks](https://arxiv.org/pdf/1711.09268.pdf)
+
+Code adapted from the released TensorFlow graph implementation by original
+authors https://github.com/brain-research/l2hmc.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import numpy.random as npr
+import tensorflow as tf
+import tensorflow.contrib.eager as tfe
+from tensorflow.contrib.eager.python.examples.l2hmc import neural_nets
+
+
+class Dynamics(tf.keras.Model):
+  """Dynamics engine of naive L2HMC sampler.
+
+  Args:
+    x_dim: dimensionality of observed data
+    loglikelihood_fn: log-likelihood function of conditional probability
+    n_steps: number of leapfrog steps within each transition
+    eps: initial value learnable scale of step size
+  """
+
+  def __init__(self, x_dim, loglikelihood_fn, n_steps=25, eps=.1):
+    super(Dynamics, self).__init__()
+
+    self.x_dim = x_dim
+    self.potential = loglikelihood_fn
+    self.n_steps = n_steps
+
+    self._construct_time()
+    self._construct_masks()
+
+    self.position_fn = neural_nets.GenericNet(x_dim, factor=2.)
+    self.momentum_fn = neural_nets.GenericNet(x_dim, factor=1.)
+
+    self.eps = tfe.Variable(
+        initial_value=eps, name="eps", dtype=tf.float32, trainable=True)
+
+  def apply_transition(self, position):
+    """Propose a new state and perform the accept or reject step."""
+
+    # Simulate dynamics both forward and backward;
+    # Use sampled Bernoulli masks to compute the actual solutions
+    position_f, momentum_f, accept_prob_f = self.transition_kernel(
+        position, forward=True)
+    position_b, momentum_b, accept_prob_b = self.transition_kernel(
+        position, forward=False)
+
+    # Decide direction uniformly
+    forward_mask = tf.cast(
+        tf.random_uniform(shape=[tf.shape(position)[0]]) > .5, tf.float32)
+    backward_mask = 1. - forward_mask
+
+    # Obtain proposed states
+    position_post = (
+        forward_mask[:, None] * position_f +
+        backward_mask[:, None] * position_b)
+    momentum_post = (
+        forward_mask[:, None] * momentum_f +
+        backward_mask[:, None] * momentum_b)
+
+    # Probability of accepting the proposed states
+    accept_prob = forward_mask * accept_prob_f + backward_mask * accept_prob_b
+
+    # Accept or reject step
+    accept_mask = tf.cast(
+        accept_prob > tf.random_uniform(tf.shape(accept_prob)), tf.float32)
+    reject_mask = 1. - accept_mask
+
+    # Samples after accept/reject step
+    position_out = (
+        accept_mask[:, None] * position_post + reject_mask[:, None] * position)
+
+    return position_post, momentum_post, accept_prob, position_out
+
+  def transition_kernel(self, position, forward=True):
+    """Transition kernel of augmented leapfrog integrator."""
+
+    lf_fn = self._forward_lf if forward else self._backward_lf
+
+    # Resample momentum
+    momentum = tf.random_normal(tf.shape(position))
+    position_post, momentum_post = position, momentum
+    sumlogdet = 0.
+    # Apply augmented leapfrog steps
+    for i in range(self.n_steps):
+      position_post, momentum_post, logdet = lf_fn(position_post, momentum_post,
+                                                   i)
+      sumlogdet += logdet
+
+    accept_prob = self._compute_accept_prob(position, momentum, position_post,
+                                            momentum_post, sumlogdet)
+
+    return position_post, momentum_post, accept_prob
+
+  def _forward_lf(self, position, momentum, i):
+    """One forward augmented leapfrog step. See eq (5-6) in paper."""
+
+    t = self._get_time(i)
+    mask, mask_inv = self._get_mask(i)
+    sumlogdet = 0.
+
+    momentum, logdet = self._update_momentum_forward(position, momentum, t)
+    sumlogdet += logdet
+
+    position, logdet = self._update_position_forward(position, momentum, t,
+                                                     mask)
+    sumlogdet += logdet
+
+    position, logdet = self._update_position_forward(position, momentum, t,
+                                                     mask_inv)
+    sumlogdet += logdet
+
+    momentum, logdet = self._update_momentum_forward(position, momentum, t)
+    sumlogdet += logdet
+
+    return position, momentum, tf.reduce_sum(sumlogdet, axis=1)
+
+  def _backward_lf(self, position, momentum, i):
+    """One backward augmented leapfrog step. See Appendix A in paper."""
+
+    # Reversed index/sinusoidal time
+    t = self._get_time(self.n_steps - i - 1)
+    mask, mask_inv = self._get_mask(self.n_steps - i - 1)
+    sumlogdet = 0.
+
+    momentum, logdet = self._update_momentum_backward(position, momentum, t)
+    sumlogdet += logdet
+
+    position, logdet = self._update_position_backward(position, momentum, t,
+                                                      mask)
+    sumlogdet += logdet
+
+    position, logdet = self._update_position_backward(position, momentum, t,
+                                                      mask_inv)
+    sumlogdet += logdet
+
+    momentum, logdet = self._update_momentum_backward(position, momentum, t)
+    sumlogdet += logdet
+
+    return position, momentum, tf.reduce_sum(sumlogdet, axis=1)
+
+  def _update_momentum_forward(self, position, momentum, t):
+    """Update v in the forward leapfrog step."""
+
+    grad = self.grad_potential(position)
+    scale, translation, transformed = self.momentum_fn([position, grad, t])
+    scale *= .5 * self.eps
+    transformed *= self.eps
+    momentum = (
+        momentum * tf.exp(scale) -
+        .5 * self.eps * (tf.exp(transformed) * grad - translation))
+
+    return momentum, scale
+
+  def _update_position_forward(self, position, momentum, t, mask):
+    """Update x in the forward leapfrog step."""
+
+    mask_inv = 1. - mask
+    scale, translation, transformed = self.position_fn(
+        [momentum, mask * position, t])
+    scale *= self.eps
+    transformed *= self.eps
+    position = (
+        mask * position +
+        mask_inv * (position * tf.exp(scale) + self.eps *
+                    (tf.exp(transformed) * momentum + translation)))
+
+    return position, mask_inv * scale
+
+  def _update_momentum_backward(self, position, momentum, t):
+    """Update v in the backward leapfrog step. Inverting the forward update."""
+
+    grad = self.grad_potential(position)
+    scale, translation, transformed = self.momentum_fn([position, grad, t])
+    scale *= -.5 * self.eps
+    transformed *= self.eps
+    momentum = (
+        tf.exp(scale) * (momentum + .5 * self.eps *
+                         (tf.exp(transformed) * grad - translation)))
+
+    return momentum, scale
+
+  def _update_position_backward(self, position, momentum, t, mask):
+    """Update x in the backward leapfrog step. Inverting the forward update."""
+
+    mask_inv = 1. - mask
+    scale, translation, transformed = self.position_fn(
+        [momentum, mask_inv * position, t])
+    scale *= -self.eps
+    transformed *= self.eps
+    position = (
+        mask_inv * position + mask * tf.exp(scale) *
+        (position - self.eps * tf.exp(transformed) * momentum + translation))
+
+    return position, mask * scale
+
+  def _compute_accept_prob(self, position, momentum, position_post,
+                           momentum_post, sumlogdet):
+    """Compute the prob of accepting the proposed state given old state."""
+
+    old_hamil = self.hamiltonian(position, momentum)
+    new_hamil = self.hamiltonian(position_post, momentum_post)
+
+    return tf.exp(tf.minimum(old_hamil - new_hamil + sumlogdet, 0.))
+
+  def _construct_time(self):
+    """Convert leapfrog step index into sinusoidal time."""
+
+    self.ts = []
+    for i in range(self.n_steps):
+      t = tf.constant(
+          [
+              np.cos(2 * np.pi * i / self.n_steps),
+              np.sin(2 * np.pi * i / self.n_steps)
+          ],
+          dtype=tf.float32)
+      self.ts.append(t[None, :])
+
+  def _get_time(self, i):
+    """Get sinusoidal time for i-th augmented leapfrog step."""
+
+    return self.ts[i]
+
+  def _construct_masks(self):
+    """Construct different binary masks for different time steps."""
+
+    self.masks = []
+    for _ in range(self.n_steps):
+      idx = npr.permutation(np.arange(self.x_dim))[:self.x_dim // 2]
+      mask = np.zeros((self.x_dim,))
+      mask[idx] = 1.
+      mask = tf.constant(mask, dtype=tf.float32)
+      self.masks.append(mask[None, :])
+
+  def _get_mask(self, i):
+    """Get binary masks for i-th augmented leapfrog step."""
+
+    m = self.masks[i]
+    return m, 1. - m
+
+  def kinetic(self, v):
+    """Compute the kinetic energy."""
+
+    return .5 * tf.reduce_sum(v**2, axis=1)
+
+  def hamiltonian(self, position, momentum):
+    """Compute the overall Hamiltonian."""
+
+    return self.potential(position) + self.kinetic(momentum)
+
+  def grad_potential(self, position, check_numerics=True):
+    """Get gradient of potential function at current location."""
+
+    if not tf.executing_eagerly():
+      # TODO(lxuechen): Change this to tfe.gradients_function when it works
+      grad = tf.gradients(self.potential(position), position)[0]
+    else:
+      grad = tfe.gradients_function(self.potential)(position)[0]
+
+    if check_numerics:
+      return tf.check_numerics(grad, message="gradient of potential")
+
+    return grad
+
+
+# Examples of unnormalized log density/probabilities
+def get_scg_energy_fn():
+  """Get energy function for 2d strongly correlated Gaussian."""
+
+  # Avoid recreating tf constants on each invocation of gradients
+  mu = tf.constant([0., 0.])
+  sigma = tf.constant([[50.05, -49.95], [-49.95, 50.05]])
+  sigma_inv = tf.matrix_inverse(sigma)
+
+  def energy(x):
+    """Unnormalized log density/energy of 2d strongly correlated Gaussian."""
+
+    xmmu = x - mu
+    return .5 * tf.diag_part(
+        tf.matmul(tf.matmul(xmmu, sigma_inv), tf.transpose(xmmu)))
+
+  return energy
+
+
+def get_multivariate_gaussian_energy_fn(x_dim=2):
+  """Get energy function for 2d strongly correlated Gaussian."""
+
+  mu = tf.random_normal(shape=[x_dim])
+  # Lower triangularize and positive diagonal
+  l = tf.sigmoid(
+      tf.matrix_band_part(tf.random_normal(shape=[x_dim, x_dim]), -1, 0))
+  # Exploit Cholesky decomposition
+  sigma = tf.matmul(l, tf.transpose(l))
+  sigma *= 100.  # Small covariance causes extreme numerical instability
+  sigma_inv = tf.matrix_inverse(sigma)
+
+  def energy(x):
+    """Unnormalized log density/energy of 2d strongly correlated Gaussian."""
+
+    xmmu = x - mu
+    return .5 * tf.diag_part(
+        tf.matmul(tf.matmul(xmmu, sigma_inv), tf.transpose(xmmu)))
+
+  return energy
diff --git a/tensorflow/contrib/eager/python/examples/l2hmc/l2hmc_test.py b/tensorflow/contrib/eager/python/examples/l2hmc/l2hmc_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..e33b4cae4c73388dfd78542c9907953f137ad710
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/l2hmc/l2hmc_test.py
@@ -0,0 +1,264 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests l2hmc fit to 2D strongly correlated Gaussian executed eagerly."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import time
+
+import numpy.random as npr
+import tensorflow as tf
+import tensorflow.contrib.eager as tfe
+from tensorflow.contrib.eager.python.examples.l2hmc import l2hmc
+
+
+def get_default_hparams():
+  return tf.contrib.training.HParams(
+      x_dim=2,
+      n_samples=200,
+      n_steps=10,
+      eps=.1,
+      n_iters=10,
+      learning_rate=.0003,
+      n_warmup_iters=3)
+
+
+# Relevant functions for benchmarking
+def compute_loss(dynamics, x, scale=.1, eps=1e-4):
+  """Compute loss defined in equation (8)."""
+
+  z = tf.random_normal(tf.shape(x))
+  x_, _, x_accept_prob, x_out = dynamics.apply_transition(x)
+  z_, _, z_accept_prob, _ = dynamics.apply_transition(z)
+
+  # Add eps for numerical stability; following released impl
+  x_loss = tf.reduce_sum((x - x_)**2, axis=1) * x_accept_prob + eps
+  z_loss = tf.reduce_sum((z - z_)**2, axis=1) * z_accept_prob + eps
+
+  loss = tf.reduce_mean(
+      (1. / x_loss + 1. / z_loss) * scale - (x_loss + z_loss) / scale, axis=0)
+
+  return loss, x_out
+
+
+def loss_and_grads(dynamics, x, loss_fn=compute_loss):
+  """Obtain loss value and gradients."""
+
+  with tf.GradientTape() as tape:
+    loss_val, x_out = loss_fn(dynamics, x)
+  grads = tape.gradient(loss_val, dynamics.variables)
+
+  return loss_val, grads, x_out
+
+
+def warmup(dynamics, optimizer, n_iters=1, n_samples=200, loss_fn=compute_loss):
+  """Warmup optimization to reduce overhead."""
+
+  samples = tf.random_normal(
+      shape=[n_samples, dynamics.x_dim], dtype=tf.float32)
+
+  for _ in range(n_iters):
+    _, grads, samples = loss_and_grads(dynamics, samples, loss_fn=loss_fn)
+    optimizer.apply_gradients(zip(grads, dynamics.variables))
+
+
+def fit(dynamics,
+        samples,
+        optimizer,
+        loss_fn=compute_loss,
+        n_iters=5000,
+        verbose=True,
+        logdir=None,
+        decay_lr=True):
+  """Fit L2HMC sampler with given log-likelihood function."""
+
+  if logdir:
+    summary_writer = tf.contrib.summary.create_file_writer(logdir)
+
+  for i in range(n_iters):
+    loss, grads, samples = loss_and_grads(dynamics, samples, loss_fn=loss_fn)
+    # TODO(lxuechen): Proper learning rate decay
+    if decay_lr:
+      grads = [grad * .96**(i // 1000) for grad in grads]
+    optimizer.apply_gradients(zip(grads, dynamics.variables))
+    if verbose:
+      print("Iteration %d: loss %.4f" % (i, loss))
+
+    if logdir:
+      with summary_writer.as_default():
+        with tf.contrib.summary.always_record_summaries():
+          tf.contrib.summary.scalar("loss", loss)
+
+
+class L2hmcTest(tf.test.TestCase):
+  """Unit tests for l2hmc in both eager and graph mode."""
+
+  def test_apply_transition(self):
+    """Testing function `Dynamics.apply_transition` in graph and eager mode."""
+
+    # Eager mode testing
+    hparams = get_default_hparams()
+    dynamics = l2hmc.Dynamics(
+        x_dim=hparams.x_dim,
+        loglikelihood_fn=l2hmc.get_scg_energy_fn(),
+        n_steps=hparams.n_steps,
+        eps=hparams.eps)
+    samples = tf.random_normal(shape=[hparams.n_samples, hparams.x_dim])
+    x_, v_, x_accept_prob, x_out = dynamics.apply_transition(samples)
+
+    self.assertEqual(x_.shape, v_.shape)
+    self.assertEqual(x_out.shape, samples.shape)
+    self.assertEqual(x_.shape, x_out.shape)
+    self.assertEqual(x_accept_prob.shape, (hparams.n_samples,))
+
+    # Graph mode testing
+    with tf.Graph().as_default():
+      dynamics = l2hmc.Dynamics(
+          x_dim=hparams.x_dim,
+          loglikelihood_fn=l2hmc.get_scg_energy_fn(),
+          n_steps=hparams.n_steps,
+          eps=hparams.eps)
+      x = tf.placeholder(tf.float32, shape=[None, hparams.x_dim])
+      x_, v_, x_accept_prob, x_out = dynamics.apply_transition(x)
+      samples = npr.normal(size=[hparams.n_samples, hparams.x_dim])
+
+      with tf.Session() as sess:
+        sess.run(tf.global_variables_initializer())
+        np_x_, np_v_, np_x_accept_prob, np_x_out = sess.run(
+            [x_, v_, x_accept_prob, x_out], feed_dict={x: samples})
+
+        self.assertEqual(np_x_.shape, np_v_.shape)
+        self.assertEqual(samples.shape, np_x_out.shape)
+        self.assertEqual(np_x_.shape, np_x_out.shape)
+        self.assertEqual(np_x_accept_prob.shape, (hparams.n_samples,))
+
+
+class L2hmcBenchmark(tf.test.Benchmark):
+  """Eager and graph benchmarks for l2hmc."""
+
+  def _get_energy_fn(self):
+    """Get specific energy function according to FLAGS."""
+
+    if FLAGS.energy_fn == "scg":
+      energy_fn = l2hmc.get_scg_energy_fn()
+    elif FLAGS.energy_fn == "multivariate_gaussian":
+      energy_fn = l2hmc.get_multivariate_gaussian_energy_fn(x_dim=FLAGS.x_dim)
+    else:
+      raise ValueError("No such energy function %s" % FLAGS.energy_fn)
+
+    return energy_fn
+
+  def benchmark_graph(self):
+    """Benchmark Graph performance."""
+
+    hparams = get_default_hparams()
+    tf.reset_default_graph()
+    with tf.Graph().as_default():
+      energy_fn = self._get_energy_fn()
+      dynamics = l2hmc.Dynamics(
+          x_dim=hparams.x_dim,
+          loglikelihood_fn=energy_fn,
+          n_steps=hparams.n_steps,
+          eps=hparams.eps)
+      x = tf.placeholder(tf.float32, shape=[None, hparams.x_dim])
+      loss, x_out = compute_loss(dynamics, x)
+
+      global_step = tf.Variable(0., name="global_step", trainable=False)
+      learning_rate = tf.train.exponential_decay(
+          hparams.learning_rate, global_step, 1000, 0.96, staircase=True)
+      optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
+      train_op = optimizer.minimize(loss, global_step=global_step)
+
+      with tf.Session() as sess:
+        sess.run(tf.global_variables_initializer())
+
+        # Warmup to reduce initialization effect when timing
+        samples = npr.normal(size=[hparams.n_samples, hparams.x_dim])
+        for _ in range(hparams.n_warmup_iters):
+          _, _, _, _ = sess.run(
+              [x_out, loss, train_op, learning_rate], feed_dict={x: samples})
+
+        # Training
+        start_time = time.time()
+        for i in range(hparams.n_iters):
+          samples, loss_np, _, _ = sess.run(
+              [x_out, loss, train_op, learning_rate], feed_dict={x: samples})
+          print("Iteration %d: loss %.4f" % (i, loss_np))
+        wall_time = time.time() - start_time
+        examples_per_sec = hparams.n_samples / wall_time
+
+        self.report_benchmark(
+            name="graph_train_%s" % ("gpu"
+                                     if tf.test.is_gpu_available() else "cpu"),
+            iters=hparams.n_iters,
+            extras={"examples_per_sec": examples_per_sec},
+            wall_time=wall_time)
+
+  def benchmark_eager(self):
+    self._benchmark_eager()
+
+  def benchmark_eager_defun(self):
+    self._benchmark_eager(defun=True)
+
+  def _benchmark_eager(self, defun=False):
+    """Benchmark Eager performance."""
+
+    hparams = get_default_hparams()
+    energy_fn = self._get_energy_fn()
+    dynamics = l2hmc.Dynamics(
+        x_dim=hparams.x_dim,
+        loglikelihood_fn=energy_fn,
+        n_steps=hparams.n_steps,
+        eps=hparams.eps)
+    optimizer = tf.train.AdamOptimizer(learning_rate=hparams.learning_rate)
+    loss_fn = tfe.defun(compute_loss) if defun else compute_loss
+
+    # Warmup to reduce initialization effect when timing
+    warmup(dynamics, optimizer, n_iters=hparams.n_warmup_iters, loss_fn=loss_fn)
+
+    # Training
+    samples = tf.random_normal(
+        shape=[hparams.n_samples, hparams.x_dim], dtype=tf.float32)
+    start_time = time.time()
+    fit(dynamics,
+        samples,
+        optimizer,
+        loss_fn=loss_fn,
+        n_iters=hparams.n_iters,
+        decay_lr=True)
+    wall_time = time.time() - start_time
+    examples_per_sec = hparams.n_samples / wall_time
+
+    self.report_benchmark(
+        name="eager_train_%s%s" % ("gpu" if tf.test.is_gpu_available() else
+                                   "cpu", "_defun" if defun else ""),
+        iters=hparams.n_iters,
+        extras={"examples_per_sec": examples_per_sec},
+        wall_time=wall_time)
+
+    del dynamics
+    del loss_fn
+
+
+if __name__ == "__main__":
+  tf.flags.DEFINE_string("energy_fn", "scg",
+                         ("The energy function/unnormalized log-probability. "
+                          "Either be `scg` or `multivariate_gaussian`"))
+  tf.flags.DEFINE_integer("x_dim", 2, "Dimensionality of observation space.")
+  FLAGS = tf.flags.FLAGS
+  tf.enable_eager_execution()
+  tf.test.main()
diff --git a/tensorflow/contrib/eager/python/examples/l2hmc/neural_nets.py b/tensorflow/contrib/eager/python/examples/l2hmc/neural_nets.py
new file mode 100644
index 0000000000000000000000000000000000000000..e230ad5e259df5b450897bd815e901e3934cd293
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/l2hmc/neural_nets.py
@@ -0,0 +1,84 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Neural nets utility for L2HMC compatible with TensorFlow's eager execution.
+
+Reference [Generalizing Hamiltonian Monte Carlo with Neural
+Networks](https://arxiv.org/pdf/1711.09268.pdf)
+
+Code adapted from the released TensorFlow graph implementation by original
+authors https://github.com/brain-research/l2hmc.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+import tensorflow.contrib.eager as tfe
+
+
+class GenericNet(tf.keras.Model):
+  """Generic neural net with different initialization scale based on input.
+
+  Args:
+    x_dim: dimensionality of observed data
+    factor: factor of variance scaling initializer
+    n_hidden: number of hidden units
+  """
+
+  def __init__(self, x_dim, factor, n_hidden=10):
+    super(GenericNet, self).__init__()
+
+    self.v_layer = _custom_dense(n_hidden, 1. / 3.)
+    self.x_layer = _custom_dense(n_hidden, factor / 3.)
+    self.t_layer = _custom_dense(n_hidden, 1. / 3.)
+    self.h_layer = _custom_dense(n_hidden)
+
+    # Scale
+    self.scale_layer = _custom_dense(x_dim, .001)
+    self.coeff_scale = tfe.Variable(
+        initial_value=tf.zeros([1, x_dim]), name='coeff_scale', trainable=True)
+    # Translation
+    self.translation_layer = _custom_dense(x_dim, factor=.001)
+    # Transformation
+    self.transformation_layer = _custom_dense(x_dim, .001)
+    self.coeff_transformation = tfe.Variable(
+        initial_value=tf.zeros([1, x_dim]),
+        name='coeff_transformation',
+        trainable=True)
+
+  def call(self, inputs):
+    v, x, t = inputs
+    h = self.v_layer(v) + self.x_layer(x) + self.t_layer(t)
+    h = tf.nn.relu(h)
+    h = self.h_layer(h)
+    h = tf.nn.relu(h)
+    scale = tf.nn.tanh(self.scale_layer(h)) * tf.exp(self.coeff_scale)
+    translation = self.translation_layer(h)
+    transformation = (
+        tf.nn.tanh(self.transformation_layer(h)) * tf.exp(
+            self.coeff_transformation))
+
+    return scale, translation, transformation
+
+
+def _custom_dense(units, factor=1.):
+  """Custom dense layer with specified weight initialization."""
+
+  return tf.keras.layers.Dense(
+      units=units,
+      use_bias=True,
+      kernel_initializer=tf.contrib.layers.variance_scaling_initializer(
+          factor=factor * 2., mode='FAN_IN', uniform=False),
+      bias_initializer=tf.constant_initializer(0., dtype=tf.float32))
diff --git a/tensorflow/contrib/eager/python/examples/linear_regression/linear_regression.py b/tensorflow/contrib/eager/python/examples/linear_regression/linear_regression.py
index 2259c20741ab689dbe0d08d32ff05fc7f8a2100d..099b712fc06d1d3eb9ab4095f8db7283690bda76 100644
--- a/tensorflow/contrib/eager/python/examples/linear_regression/linear_regression.py
+++ b/tensorflow/contrib/eager/python/examples/linear_regression/linear_regression.py
@@ -75,7 +75,6 @@ def fit(model, dataset, optimizer, verbose=False, logdir=None):
   mse = lambda xs, ys: mean_square_loss(model, xs, ys)
   loss_and_grads = tfe.implicit_value_and_gradients(mse)
 
-  tf.train.get_or_create_global_step()
   if logdir:
     # Support for TensorBoard summaries. Once training has started, use:
     #   tensorboard --logdir=<logdir>
@@ -87,12 +86,13 @@ def fit(model, dataset, optimizer, verbose=False, logdir=None):
     if verbose:
       print("Iteration %d: loss = %s" % (i, loss.numpy()))
 
-    optimizer.apply_gradients(grads, global_step=tf.train.get_global_step())
+    optimizer.apply_gradients(grads)
 
     if logdir:
       with summary_writer.as_default():
         with tf.contrib.summary.always_record_summaries():
-          tf.contrib.summary.scalar("loss", loss)
+          tf.contrib.summary.scalar("loss", loss, step=i)
+          tf.contrib.summary.scalar("step", i, step=i)
 
 
 def synthetic_dataset(w, b, noise_level, batch_size, num_batches):
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/1_basics.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/1_basics.ipynb
index 9fd2d8d1254e32ae75ab5b085986c6e1c05e76f4..51d10a778413cfbb574b4e22e8adcb18bd731dee 100644
--- a/tensorflow/contrib/eager/python/examples/notebooks/1_basics.ipynb
+++ b/tensorflow/contrib/eager/python/examples/notebooks/1_basics.ipynb
@@ -1,495 +1,429 @@
 {
-  "nbformat": 4,
-  "nbformat_minor": 0,
-  "metadata": {
-    "colab": {
-      "name": "Eager Execution Tutorial: Basics",
-      "version": "0.3.2",
-      "views": {},
-      "default_view": {},
-      "provenance": [
-        {
-          "file_id": "0B0kLcpwLFwKEVm9XNkFueGk4bTg",
-          "timestamp": 1504118841551
-        }
-      ]
-    }
-  },
   "cells": [
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "U9i2Dsh-ziXr",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "U9i2Dsh-ziXr"
       },
-      "cell_type": "markdown",
       "source": [
-        "# Eager Execution Tutorial: Basics\n",
+        "# An introduction to TensorFlow\n",
         "\n",
-        "This notebook introduces the basics of using TensorFlow's eager execution capabilities. It covers concepts such as:\n",
+        "This is an introductory tutorial for using TensorFlow. It will cover:\n",
         "\n",
         "* Importing required packages\n",
-        "* Enabling eager execution\n",
-        "* Creating and using TensorFlow Tensors and Variables\n",
-        "* Using TensorFlow interactively\n",
-        "* Using GPUs with eager execution enabled\n",
-        "\n",
-        "This notebook does *not* cover modeling topics, such as gradients."
+        "* Creating and using Tensors\n",
+        "* Using GPU acceleration\n"
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "z1JcS5iBXMRO",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "z1JcS5iBXMRO"
       },
-      "cell_type": "markdown",
       "source": [
-        "# Step 1: Import Eager\n",
+        "## Import TensorFlow\n",
         "\n",
-        "The key imports for eager execution are the following:"
+        "To get started, import the `tensorflow` module and enable eager execution.\n",
+        "Eager execution enables a more interactive frontend to TensorFlow, the details of which we will discuss much later."
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "RlIWhyeLoYnG",
-        "colab_type": "code",
+        "cellView": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
         },
-        "cellView": "code"
+        "colab_type": "code",
+        "id": "RlIWhyeLoYnG"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
-        "# Import TensorFlow.\n",
         "import tensorflow as tf\n",
         "\n",
-        "# Import TensorFlow eager execution support (subject to future changes).\n",
-        "tfe = tf.contrib.eager"
-      ],
-      "execution_count": 0,
-      "outputs": []
+        "tf.enable_eager_execution()"
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "H9UySOPLXdaw",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "H9UySOPLXdaw"
       },
-      "cell_type": "markdown",
       "source": [
-        "# Step 2: Enable eager execution\n",
+        "## Tensors\n",
         "\n",
-        "All future TensorFlow calls will execute the\n",
-        "underlying TensorFlow ops immediately:"
+        "A Tensor is a multi-dimensional array. Similar to NumPy `ndarray` objects, `Tensor` objects have a data type and a shape. Additionally, Tensors can reside in accelerator (like GPU) memory. TensorFlow offers a rich library of operations ([tf.add](https://www.tensorflow.org/api_docs/python/tf/add), [tf.matmul](https://www.tensorflow.org/api_docs/python/tf/matmul), [tf.linalg.inv](https://www.tensorflow.org/api_docs/python/tf/linalg/inv) etc.) that consume and produce Tensors. These operations automatically convert native Python types. For example:\n"
       ]
     },
     {
-      "metadata": {
-        "id": "WPTUfGq6kJ5w",
-        "colab_type": "code",
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "cellView": "code"
-      },
       "cell_type": "code",
-      "source": [
-        "tf.enable_eager_execution()"
-      ],
       "execution_count": 0,
-      "outputs": []
-    },
-    {
-      "metadata": {
-        "id": "twBfWd5xyu_d",
-        "colab_type": "text"
-      },
-      "cell_type": "markdown",
-      "source": [
-        "# Step 3: Interactively Use TensorFlow!\n",
-        "\n",
-        "Now you can call TensorFlow functions and get results, immediately! No more `tf.Sessions`!\n",
-        "\n",
-        "TensorFlow will automatically wrap native Python types for you with operator overloading for TensorFlow Tensors."
-      ]
-    },
-    {
       "metadata": {
-        "id": "ngUe237Wt48W",
-        "colab_type": "code",
+        "cellView": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
-          }
+          },
+          "height": 125
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 320,
+          "status": "ok",
+          "timestamp": 1526420535530,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 420
         },
-        "cellView": "code"
+        "id": "ngUe237Wt48W",
+        "outputId": "b1a1cd60-4eb3-443d-cd6b-68406390784e"
       },
-      "cell_type": "code",
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "tf.Tensor(3, shape=(), dtype=int32)\n",
+            "tf.Tensor([4 6], shape=(2,), dtype=int32)\n",
+            "tf.Tensor(25, shape=(), dtype=int32)\n",
+            "tf.Tensor(6, shape=(), dtype=int32)\n",
+            "tf.Tensor(aGVsbG8gd29ybGQ, shape=(), dtype=string)\n",
+            "tf.Tensor(13, shape=(), dtype=int32)\n"
+          ]
+        }
+      ],
       "source": [
         "print(tf.add(1, 2))\n",
         "print(tf.add([1, 2], [3, 4]))\n",
         "print(tf.square(5))\n",
         "print(tf.reduce_sum([1, 2, 3]))\n",
         "print(tf.encode_base64(\"hello world\"))\n",
-        "print(\"\")\n",
         "\n",
-        "x = tf.constant(2)\n",
-        "y = tf.constant(3)\n",
-        "print(x * y + 1)\n",
-        "\n",
-        "# Most TensorFlow ops are directly usable with eager execution, giving\n",
-        "# results immediately.\n",
-        "print(tf.contrib.signal.hamming_window(x * y + 1))"
-      ],
-      "execution_count": 0,
-      "outputs": []
+        "# Operator overloading is also supported\n",
+        "print(tf.square(2) + tf.square(3))"
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "IDY4WsYRhP81",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "IDY4WsYRhP81"
       },
-      "cell_type": "markdown",
       "source": [
-        "Numpy arrays are supported, too:"
+        "Each Tensor has a shape and a datatype"
       ]
     },
     {
-      "metadata": {
-        "id": "lCUWzso6mbqR",
-        "colab_type": "code",
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        }
-      },
       "cell_type": "code",
-      "source": [
-        "import numpy as np\n",
-        "\n",
-        "ones = np.ones([3, 3])\n",
-        "\n",
-        "print(\"numpy 3x3 matrix of 1s:\")\n",
-        "print(ones)\n",
-        "print(\"\")\n",
-        "\n",
-        "print(\"Multiplied by 42:\")\n",
-        "print(tf.multiply(ones, 42))"
-      ],
       "execution_count": 0,
-      "outputs": []
-    },
-    {
-      "metadata": {
-        "id": "PBNP8yTRfu_X",
-        "colab_type": "text"
-      },
-      "cell_type": "markdown",
-      "source": [
-        "# Step 4: Define and Print TensorFlow Variables\n",
-        "\n",
-        "To define TensorFlow variables, use the `get_variable()` function as follows:"
-      ]
-    },
-    {
       "metadata": {
-        "id": "3Twf_Rw-gQFM",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
-          }
+          },
+          "height": 53
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 215,
+          "status": "ok",
+          "timestamp": 1526420538162,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 420
         },
-        "cellView": "code"
+        "id": "srYWH1MdJNG7",
+        "outputId": "5e4ac41c-5115-4e50-eba0-42e249c16561"
       },
-      "cell_type": "code",
-      "source": [
-        "x = tfe.Variable(0.)"
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "(1, 2)\n",
+            "\u003cdtype: 'int32'\u003e\n"
+          ]
+        }
       ],
-      "execution_count": 0,
-      "outputs": []
-    },
-    {
-      "metadata": {
-        "id": "45G7094TxsMb",
-        "colab_type": "text"
-      },
-      "cell_type": "markdown",
       "source": [
-        "## Printing TensorFlow Variables"
+        "x = tf.matmul([[1]], [[2, 3]])\n",
+        "print(x.shape)\n",
+        "print(x.dtype)"
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "UJBJeZ5XxuwA",
-        "colab_type": "code",
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "cellView": "code"
+        "colab_type": "text",
+        "id": "eBPw8e8vrsom"
       },
-      "cell_type": "code",
       "source": [
-        "# This does NOT print the Variable's actual value:\n",
-        "print(\"Printing a TensorFlow Variable:\")\n",
-        "print(x)\n",
-        "print(\"\")\n",
+        "The most obvious differences between NumPy arrays and TensorFlow Tensors are:\n",
         "\n",
-        "\n",
-        "print(\"Printing a TensorFlow Variable's value as a numpy array:\")\n",
-        "print(x.numpy())"
-      ],
-      "execution_count": 0,
-      "outputs": []
+        "1. Tensors can be backed by accelerator memory (like GPU, TPU).\n",
+        "2. Tensors are immutable."
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "2njjWHcTpBEn",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "Dwi1tdW3JBw6"
       },
-      "cell_type": "markdown",
       "source": [
-        "## Changing a TensorFlow Variable's value\n",
+        "### NumPy Compatibility\n",
         "\n",
-        "To change a TensorFlow Variable's value, use its `.assign()` or `.assign_add()` method:"
+        "Conversion between TensorFlow Tensors and NumPy ndarrays is quite simple as:\n",
+        "* TensorFlow operations automatically convert NumPy ndarrays to Tensors.\n",
+        "* NumPy operations automatically convert Tensors to NumPy ndarrays.\n",
+        "\n",
+        "Tensors can be explicitly converted to NumPy ndarrays by invoking the `.numpy()` method on them.\n",
+        "These conversions are typically cheap as the array and Tensor share the underlying memory representation if possible. However, sharing the underlying representation isn't always possible since the Tensor may be hosted in GPU memory while NumPy arrays are always backed by host memory, and the conversion will thus involve a copy from GPU to host memory."
       ]
     },
     {
-      "metadata": {
-        "id": "v3wr6Erbo_hB",
-        "colab_type": "code",
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        }
-      },
       "cell_type": "code",
-      "source": [
-        "x.assign(42)\n",
-        "print(x)\n",
-        "\n",
-        "x.assign_add(3)\n",
-        "print(x)"
-      ],
       "execution_count": 0,
-      "outputs": []
-    },
-    {
       "metadata": {
-        "id": "uhtynjHVpTB5",
-        "colab_type": "text"
-      },
-      "cell_type": "markdown",
-      "source": [
-        "## Use a Variable just like any other Tensor"
-      ]
-    },
-    {
-      "metadata": {
-        "id": "7PbktdnHoehR",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
-          }
-        }
+          },
+          "height": 251
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 238,
+          "status": "ok",
+          "timestamp": 1526420540562,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 420
+        },
+        "id": "lCUWzso6mbqR",
+        "outputId": "fd0a22bc-8249-49dd-fcbd-63161cc47e46"
       },
-      "cell_type": "code",
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "TensorFlow operations convert numpy arrays to Tensors automatically\n",
+            "tf.Tensor(\n",
+            "[[ 42.  42.  42.]\n",
+            " [ 42.  42.  42.]\n",
+            " [ 42.  42.  42.]], shape=(3, 3), dtype=float64)\n",
+            "And NumPy operations convert Tensors to numpy arrays automatically\n",
+            "[[ 43.  43.  43.]\n",
+            " [ 43.  43.  43.]\n",
+            " [ 43.  43.  43.]]\n",
+            "The .numpy() method explicitly converts a Tensor to a numpy array\n",
+            "[[ 42.  42.  42.]\n",
+            " [ 42.  42.  42.]\n",
+            " [ 42.  42.  42.]]\n"
+          ]
+        }
+      ],
       "source": [
-        "print(x + 3)\n",
+        "import numpy as np\n",
         "\n",
-        "# This code will broadcast the value across the list of numbers:\n",
-        "print(x * [1, 2, 4])"
-      ],
-      "execution_count": 0,
-      "outputs": []
+        "ndarray = np.ones([3, 3])\n",
+        "\n",
+        "print(\"TensorFlow operations convert numpy arrays to Tensors automatically\")\n",
+        "tensor = tf.multiply(ndarray, 42)\n",
+        "print(tensor)\n",
+        "\n",
+        "\n",
+        "print(\"And NumPy operations convert Tensors to numpy arrays automatically\")\n",
+        "print(np.add(tensor, 1))\n",
+        "\n",
+        "print(\"The .numpy() method explicitly converts a Tensor to a numpy array\")\n",
+        "print(tensor.numpy())"
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "GVChqwlwy1SI",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "PBNP8yTRfu_X"
       },
-      "cell_type": "markdown",
       "source": [
-        "# Step 5: Debug Errors with Instant Feedback\n",
+        "## GPU acceleration\n",
         "\n",
-        "TensorFlow's eager execution helps you identify and debug runtime issues through interactive exploration of code snippets.\n",
-        "\n",
-        "Below, we'll define a length-4 vector, and attempt two `tf.slice()` operations,\n",
-        "one being legal and the other being illegal, leading to a runtime error that is\n",
-        "raised immediately."
+        "Many TensorFlow operations can be accelerated by using the GPU for computation. Without any annotations, TensorFlow automatically decides whether to use the GPU or CPU for an operation (and copies the tensor between CPU and GPU memory if necessary). Tensors produced by an operation are typically backed by the memory of the device on which the operation executed. For example:"
       ]
     },
     {
-      "metadata": {
-        "id": "23ap04N0v4k0",
-        "colab_type": "code",
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "cellView": "code"
-      },
       "cell_type": "code",
-      "source": [
-        "vector = tf.constant([10.0, 20.0, 30.0, 40.0])"
-      ],
       "execution_count": 0,
-      "outputs": []
-    },
-    {
       "metadata": {
-        "id": "FCUMsIYxxRRa",
-        "colab_type": "code",
+        "cellView": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
-          }
+          },
+          "height": 53
         },
-        "cellView": "code"
-      },
-      "cell_type": "code",
-      "source": [
-        "# Works, because the values of `begin` and `size` (the 2nd and 3rd input\n",
-        "# arguments) are within the bound of `vector`.\n",
-        "print(tf.slice(vector, [1], [3]))"
-      ],
-      "execution_count": 0,
-      "outputs": []
-    },
-    {
-      "metadata": {
-        "id": "T8me2oCNxpFp",
         "colab_type": "code",
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
+        "executionInfo": {
+          "elapsed": 340,
+          "status": "ok",
+          "timestamp": 1526420543562,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 420
         },
-        "cellView": "code"
+        "id": "3Twf_Rw-gQFM",
+        "outputId": "2239ae2b-adf3-4895-b1f3-464cf5361d1b"
       },
-      "cell_type": "code",
-      "source": [
-        "# The following does NOT work, because the value of `size` (the 3rd\n",
-        "# argument) causes the indices to go out of the bounds of `vector`. The\n",
-        "# error is raised immediately.\n",
-        "try:\n",
-        "  print(tf.slice(vector, [1], [4]))\n",
-        "except tf.OpError as e:\n",
-        "  print(\"Caught error: %s\" % e)"
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Is there a GPU available:  False\n",
+            "Is the Tensor on GPU #0:   False\n"
+          ]
+        }
       ],
-      "execution_count": 0,
-      "outputs": []
+      "source": [
+        "x = tf.random_uniform([3, 3])\n",
+        "\n",
+        "print(\"Is there a GPU available: \"),\n",
+        "print(tf.test.is_gpu_available())\n",
+        "\n",
+        "print(\"Is the Tensor on GPU #0:  \"),\n",
+        "print(x.device.endswith('GPU:0'))"
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "irxJhAgar84v",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "vpgYzgVXW2Ud"
       },
-      "cell_type": "markdown",
       "source": [
-        "# Step 6: Using the GPU\n",
-        "\n",
-        "You can explicitly place Tensors on the GPU by calling a Tensor's `.gpu()` method. The `.device` property tells you whether the Tensor is backed by CPU or GPU memory.\n",
+        "### Device Names\n",
         "\n",
-        "The first operation executing on the GPU may be slow as TensorFlow initializes. Subsequent uses will be much faster."
+        "The `Tensor.device` property provides a fully qualified string name of the device hosting the contents of the Tensor. This name encodes a bunch of details, such as an identifier of the network address of the host on which this program is executing and the device within that host. This is required for distributed execution of TensorFlow programs, but we'll skip that for now. The string will end with `GPU:\u003cN\u003e` if the tensor is placed on the `N`-th tensor on the host."
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "7J4N9baqaKCL",
-        "colab_type": "code",
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        }
+        "colab_type": "text",
+        "id": "ZWZQCimzuqyP"
       },
-      "cell_type": "code",
       "source": [
-        "# Create some Tensors\n",
-        "SIZE = 1000\n",
-        "tensor = tf.random_normal([SIZE, SIZE])\n",
-        "print(tensor.device)\n",
         "\n",
         "\n",
-        "if tf.test.is_gpu_available():\n",
-        "  gpu_tensor = tensor.gpu()\n",
-        "  cpu_tensor = tensor.cpu()\n",
-        "else:\n",
-        "  print(\"GPU not available.\")\n",
-        "  cpu_tensor = tensor"
-      ],
-      "execution_count": 0,
-      "outputs": []
+        "### Explicit Device Placement\n",
+        "\n",
+        "The term \"placement\" in TensorFlow refers to how individual operations are assigned (placed on) a device for execution. As mentioned above, when there is no explicit guidance provided, TensorFlow automatically decides which device to execute an operation, and copies Tensors to that device if needed. However, TensorFlow operations can be explicitly placed on specific devices using the `tf.device` context manager. For example:"
+      ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "4E-2n7VbzY1n",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
-          }
-        }
+          },
+          "height": 53
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 1762,
+          "status": "ok",
+          "timestamp": 1526420547562,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 420
+        },
+        "id": "RjkNZTuauy-Q",
+        "outputId": "2e613293-ccac-4db2-b793-8ceb5b5adcfd"
       },
-      "cell_type": "code",
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "On CPU:\n",
+            "10 loops, best of 3: 35.8 ms per loop\n"
+          ]
+        }
+      ],
       "source": [
-        "# Time a CPU-based matrix multiplication\n",
+        "def time_matmul(x):\n",
+        "  %timeit tf.matmul(x, x)\n",
         "\n",
-        "print(\"Time to conduct matmul on CPU:\")\n",
-        "%time tf.matmul(cpu_tensor, cpu_tensor)"
-      ],
-      "execution_count": 0,
-      "outputs": []
+        "# Force execution on CPU\n",
+        "print(\"On CPU:\")\n",
+        "with tf.device(\"CPU:0\"):\n",
+        "  x = tf.random_uniform([1000, 1000])\n",
+        "  assert x.device.endswith(\"CPU:0\")\n",
+        "  time_matmul(x)\n",
+        "\n",
+        "# Force execution on GPU #0 if available\n",
+        "if tf.test.is_gpu_available():\n",
+        "  with tf.device(\"GPU:0\"): # Or GPU:1 for the 2nd GPU, GPU:2 for the 3rd etc.\n",
+        "    x = tf.random_uniform([1000, 1000])\n",
+        "    assert x.device.endswith(\"GPU:0\")\n",
+        "    time_matmul(x)"
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "vbSFW-T5zhZF",
-        "colab_type": "code",
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        }
+        "colab_type": "text",
+        "id": "YEOJTNiOvnpQ"
       },
-      "cell_type": "code",
       "source": [
-        "# Time GPU-based matrix multiplications.\n",
+        "## Next Steps\n",
         "\n",
-        "if tf.test.is_gpu_available():\n",
-        "  # First use of the GPU will be slow:\n",
-        "  print(\"Time to conduct first matmul on GPU:\")\n",
-        "  %time tf.matmul(gpu_tensor, gpu_tensor)\n",
-        "  print()\n",
-        "\n",
-        "  # Subsequent uses are much faster:\n",
-        "  print(\"Time to conduct second matmul on GPU:\")\n",
-        "  %time tf.matmul(gpu_tensor, gpu_tensor)"
-      ],
-      "execution_count": 0,
-      "outputs": []
+        "In this tutorial we covered the most fundamental concepts in TensorFlow - `Tensor`s, operations, and devices.\n",
+        "In [the next tutorial](https://github.com/tensorflow/models/tree/master/official/contrib/eager/python/examples/notebooks/2_gradients.ipynb) we will cover automatic differentiation - a building block required for training many machine learning models like neural networks."
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "collapsed_sections": [],
+      "default_view": {},
+      "name": "TensorFlow: An introduction",
+      "provenance": [],
+      "version": "0.3.2",
+      "views": {}
     }
-  ]
-}
\ No newline at end of file
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/2_gradients.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/2_gradients.ipynb
index 1e65b27bc8be8b05fefa38dffae7799b1e503bd3..9c1af9c2084bac7ae6369babeaa13720e6199097 100644
--- a/tensorflow/contrib/eager/python/examples/notebooks/2_gradients.ipynb
+++ b/tensorflow/contrib/eager/python/examples/notebooks/2_gradients.ipynb
@@ -7,12 +7,9 @@
         "id": "vDJ4XzMqodTy"
       },
       "source": [
-        "# Eager Execution: Working with Gradients\n",
+        "# Automatic Differentiation\n",
         "\n",
-        "This notebook demonstrates:\n",
-        "\n",
-        "* How to get gradients using TensorFlow's eager execution capabilities\n",
-        "* How to apply the gradients so you can update your variables"
+        "In the previous tutorial we introduced `Tensor`s and operations on them. In this tutorial we will cover [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation), a key technique for optimizing machine learning models."
       ]
     },
     {
@@ -22,7 +19,7 @@
         "id": "GQJysDM__Qb0"
       },
       "source": [
-        "# Setup: Import eager and enable eager execution.\n"
+        "## Setup\n"
       ]
     },
     {
@@ -40,12 +37,10 @@
       },
       "outputs": [],
       "source": [
-        "# Import TensorFlow.\n",
         "import tensorflow as tf\n",
+        "tf.enable_eager_execution()\n",
         "\n",
-        "\n",
-        "# Enable eager execution.\n",
-        "tf.enable_eager_execution()"
+        "tfe = tf.contrib.eager # Shorthand for some symbols"
       ]
     },
     {
@@ -55,28 +50,15 @@
         "id": "1CLWJl0QliB0"
       },
       "source": [
-        "# Fitting a Simple Linear Model"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "-39gouo7mtgu"
-      },
-      "source": [
-        "## Step 1: Synthesize some data\n",
-        "\n",
-        "To demonstrate fitting a model with TensorFlow's eager execution, we'll fit a linear model to some synthesized data (which includes some noise).\n",
+        "## Derivatives of a function\n",
         "\n",
-        "In the code, we  use the variable names `w` and `b` to represent the single weight and bias we'll use to fit our model."
+        "TensorFlow provides APIs for automatic differentiation - computing the derivative of a function. The way that more closely mimics the math is to encapsulate the computation in a Python function, say `f`, and use `tfe.gradients_function` to create a function that computes the derivatives of `f` with respect to its arguments. If you're familiar with [autograd](https://github.com/HIPS/autograd) for differentiating numpy functions, this will be familiar. For example: "
       ]
     },
     {
       "cell_type": "code",
       "execution_count": 0,
       "metadata": {
-        "cellView": "code",
         "colab": {
           "autoexec": {
             "startup": false,
@@ -84,105 +66,53 @@
           }
         },
         "colab_type": "code",
-        "id": "rQsdCg9PfIL-"
+        "id": "9FViq92UX7P8"
       },
       "outputs": [],
       "source": [
-        "# The constants we'll try to fit our variables to:\n",
-        "true_w = 3\n",
-        "true_b = 2\n",
-        "\n",
-        "NUM_EXAMPLES = 1000\n",
+        "from math import pi\n",
         "\n",
-        "# Our inputs:\n",
-        "inputs = tf.random_normal(shape=[NUM_EXAMPLES, 1])\n",
+        "def f(x):\n",
+        "  return tf.square(tf.sin(x))\n",
         "\n",
-        "# Our labels, with noise:\n",
-        "noise = tf.random_normal(shape=[NUM_EXAMPLES, 1])\n",
-        "labels = inputs * true_w + true_b + noise"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "cellView": "code",
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "base_uri": "https://localhost:8080/",
-          "height": 347
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 374,
-          "status": "ok",
-          "timestamp": 1525154227149,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "O4lsC4ckAcar",
-        "outputId": "f8becb3f-498b-4cb7-9ef3-608a68cb65d0"
-      },
-      "outputs": [
-        {
-          "data": {
-            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAecAAAFKCAYAAAAnj5dkAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzs3Xt8VPWdP/7X3M5MkpkkM8mEAAER\nQoICgUBALkUEQ7FucekDEeWL3VZXu121dler39pu1Vbb77b+2m1/3277qNXa2kUptGttt/tDEWqp\nyDWBiC6ES8slXDJJJpfJ3C+/P8JM5nLOmTOTmWQm83r+RebMnJyTAO/z+Xzen/dbFQqFQiAiIqKc\noR7rCyAiIqJYDM5EREQ5hsGZiIgoxzA4ExER5RgGZyIiohzD4ExERJRjtGN9AWE220DWzm02F8Nu\nd2bt/LmukO+/kO8d4P0X8v0X8r0D+XH/VqtJ8lhBjJy1Ws1YX8KYKuT7L+R7B3j/hXz/hXzvQP7f\nf0EEZyIionzC4ExERJRjGJyJiIhyDIMzERFRjmFwJiIiyjEMzkRERDmGwZmIiCjHMDgTERHlGAZn\nIiKiJDy+ADrtTnh8gVH5fjlTvpOIiCjXBIJBbNt9Gq3tNvT0e2Ap1aOxzopNq2uhUWdvfMvgTERE\nJGHb7tPYdfhi5Ovufk/k683NdVn7vpzWJiIiEuHxBdDabhM91treldUpbgZnIiIiEX0OD3r6PaLH\n7ANu9DnEj2UCgzMREZGIMqMellK96DGzyYAyo/ixTGBwJiIiEqHXadBYZxU91lhXCb0ue20pmRBG\nREQkYdPqWgBDa8z2ATfMJgMa6yojr2cLgzMREZEEjVqNzc112LByBvocHpQZ9VkdMYcxOBMRESWh\n12lQZS4ete/HNWciIsqa0a6sNV5w5ExERBk3VpW1xgsGZyIiyrixqqw1XvDxhYiIMmosK2uNFwzO\nRESUUWNZWWu8YHAmIqKMGsvKWuMFgzMREWXUWFbWGi+YEEZERBk3VpW1xgsGZyIiyrixqqw1XjA4\nExFR1ox2Za3xgmvORESUMawIlhmKRs7t7e34x3/8R3zmM5/Bli1bcPnyZTzxxBMIBAKwWq34zne+\nA0EQYj7zzW9+E8eOHYNKpcJTTz2FhoaGrNwAERGNPVYEy6ykPzGn04lvfOMbWLp0aeS1H/zgB9i8\neTO2bt2K6667Djt27Ij5zMGDB3Hu3Dls27YNzz//PJ5//vnMXzkREeWMcEWw7n4PQhiuCLZt9+mx\nvrS8lDQ4C4KAF198EVVVVZHXDhw4gFtvvRUAsGrVKrz//vsxn3n//ffR3NwMAJgxYwb6+vrgcDgy\ned1ERJQjlFQE43R3apJOa2u1Wmi1sW9zuVyRaeyKigrYbLG/lK6uLsyePTvytcVigc1mg9FozMQ1\nExFRCjy+QFYzppNVBHt150mcPG/ndHcKRpytHQqFMvIes7kYWm320uytVlPWzp0PCvn+C/neAd5/\nId+/xVKCl3/3IfYfvwxbrwvW8iIsmTMR962bDY0mc4HRVFYEq7kInXZXwjG9oMW+41ciX4enu4uL\nBDywfm7GrkFMPv/u0wrOxcXFcLvdMBgMuHr1asyUNwBUVVWhq6sr8nVnZyesVvFqMWF2uzOdS1HE\najXBZhvI2vlzXSHffyHfO8D7L+T7t1pN+L+/ao3pDNVpd+HNvWfhdHkz3hmqYUZFzPcKC4WCou9/\n79glfGLxlKztfc6H373cw0Naj07Lli3Dzp07AQBvvfUWVqxYEXN8+fLlkeMffvghqqqqOKVNRDSK\n3F7/qHaG2rS6Fs1NNagoNUCtAipKDVg+pxpur3hwZgMMeUlHzsePH8e//uu/oqOjA1qtFjt37sQL\nL7yA//2//ze2bduGSZMmYf369QCAf/qnf8K3vvUtLFiwALNnz8bdd98NlUqFp59+Ous3QkREw+z9\nyTtDZaI4SPR6dnxFMAA4cd6ObpHrYAMMeUmD85w5c/Dqq68mvP6zn/0s4bXvfe97kT8//vjjI7w0\nIiJKl7l0qDNUssCYLFlM6rjcvubooN9YZxWd7mYDDHks30lENA54fAHYel1AKASruRhWQSsbGLUa\nFbbuapcsGpKsqEh4X3NYONELQMx6tlgDjIYZFqxqnAyPL8AALYHBmYgojwWCQbz+zim898EVuL1D\n68gGQY3mxdfhzlumAxDvDJUsuMod37Byhsx6tg0bVs6IBN3oBhg9/W7sOnIRbae78MfWS9xWJYPB\nmYgoj23bfRrvHOmIec3tDeL3f/4L3G6faGeoZEVD1i2bJnv85nmTJNezu/s9eHXnSXz29lkxAVev\n02BPawf2tHTEvFdstE1sfEFElLfkgiwAtJy0RaaOq8zFkdFssqIhFzsdsscRCsFSKp3Mte/4lYSy\nnUqqiNEwBmciojE0krKWckEWAOwDHtHtSmVGvWRwNZsMqKkySh4XdBpYyorQWCdfuyI+4CZ7IOC2\nqlic1iYiGgPpdHGKz5wOB1mxjGwAMJv0otuV9DqNbLKYqViQPO72BvDG3rPYtLoWLrcf70VV/4oW\nv11L7lq5rSoRgzMR0RhQmu0MyAdyqSAKAAvqrZLZ0GJZ1OFkMQBYv+J6/LntciTJLFprexc2rJyB\nLWvr8T/netAz4E14T3zATfZAwKztWAzORESjLNn6a3S2MyAfyDetrkUoFIrL1tagefFU/O2y6ySv\nITqLWmwfs8Ppg0ckMAOxo+IF9VWKA26yBwIaxuBMRDTKlKy/hqeDlQTy/7WmHnfeUhuzz7lmUrmi\n2tLhZLF4SqehUwm4yR4IaBiDMxHRKEtl/VVpINfrNKixKu9hkKwymNJp6HQCrtQDAQ1jcCYiGmWp\nrL9mOpHK6fFh69uncOJcD+wDXlhK9WiorUTzwhpYSg0x3zuVUTEDbmYxOBMRjQGlgS9TiVThpLL4\nJK/ufg/2tAwVB6mIyxgXGxUDQHefO/JnTk9nB4MzEdEYSGU6eP2K6+F0+3HinB29Do9oIE82TR2f\nVCZGKmNcr9OgoswQyRjv7vfAIKgBqODxBliGMwsYnImIxlB4v7LSzk9LZ1fjnjV1KNZrJd/TWGfF\nw3c1Rs6TrJJYPCUZ49F9mlmGM/MYnImIRlH0CFerUaXc+em941fg8vrxd7fNgqlYkNxmVVwkYP3y\naQCSVxKLl0rGeDSxoE7pYXAmIsqAZNPKYiPcYoMOFzodkfco7fzU0t6F1vY/Y7K1BE63T/Q9+49f\nxicWT1FUSSye2aSH1xeI1OVWGtzjgzqlj8GZiGgElJbhFBvhSgXLZJ2fACAE4KJtUPJ4V68LZzv6\nMH1ymWxSmZhBtw9Pv3woci/rV0xXFNxZhjNzGJyJiEZASRnOVNd8ozs/KR3tJlAB33n9aCQDO7G3\nsx71U83QaVU4ftYO+4Abgk4DtzcQWU+OvhclwZ1lODOHwZmIKE1Ky3CmuuZbbtQDKhUaaitj+h+n\nIngtXyv+YUEsO9zjC8DW68K//eqoZC3tZ+9fhEAgiNZTXeh1eGEQhj7r9QVYhjMLGJyJiNKktHpX\nqmu+To8fT790EGaTgImWYlzucY74WsMPC2L0Og0ErRp2kQYWwNC9bH37FE6et6PP4YXZqMf8ukps\nWDkdDqeP+5yzgMGZiChNSqt3KV3zFbQqeP2hyOh1qNuTF3qdGh5fUPazydgH3Hh150mcPG8XXRuX\nuxdBp8G+qNaQdsdQ4RKNWsWtU1nC3eJERCny+ALotA+NZhvrrKLviV9/3bS6FsvmVMue1+sPib6u\nUqV5oVEEnRr7jl9Bd78HIQxPd7/+zikAww8Q4sSvq7W9Cx6feOcqGhmOnImIFBLLzJ4/sxKrF07G\nsVPdsmU4NWo17l1bj5Pn7Sknebm9QSyfU40j7TbRNWElfH7xkfd7H1zBnbfUQq/TYNPq2si6cp/D\nC0upAbOmluO9qFFzNG6dyh4GZyIihcQys9850oHmpho898BNSctwprqlKUytGirh+T/netIKztWW\nIlzpcYkec3uHksEmVhRj2+7TaDvTjT6HF+VGPRpqK7Bh5QyckHig4Nap7OG0NhGNC+Gp5mxNsybL\nzAYQad0oJRAMIhQKRTKdlQqGgE67SzJhKxmXxy//hlAo8uARnvYOryu/sfes4ql7yhyOnIkor8kV\nAckkpZnZcrbtPo13jqS+Ncpi0qOmypj2vue+QR/0WjU8IlPbBmGogpjcg8ez9y+K/DlZ60jKDAZn\nIsprckVAHr1nYca+z0j7KqdaiCRaSZEOGo0K9VPNMVnTSpUbBcybWYl3Wy8lHFs2txouj1/2wcPh\n9CnuoEWZweBMRHkr2VSz25tkOjcFep0GDTMqsEckwCmZ3k21EEm0C50OPP7DffB4AzAIGoRCoZS2\nVjXOrMTmNXXQadRoOWlDz4AHZSU6LKiz4p5bZ8IfkK5GFr8ljMlfoyPt4Lx9+3a8+eabka+PHz+O\n1tbWyNezZ8/GggULIl+/8sor0Gj4pEVEmZNsqtne78nICCQ8dd52phvAUIJWMDQ03bygXnoKPboZ\nRqqFSOKFE8FSTQibaCnG5jV10KjVQ9nYwRCOtneh1+FB25luaDSnsWl1rWSiGteVx0baf283btyI\njRs3AgAOHjyI//7v/445bjQa8eqrr47s6oiIZCSbajaX6jHQJ56lrEQ4uO48dCGmjGbw2rbfeTMr\nRYtwSK2Dz59Zmdaas5jwA4Icg6DGV/6uKdKAY9vu0zH3Eb0EEH7A4LpybsjItPYPf/hDvPDCC5k4\nFRGRYnJbkxrrKmEQtBgQ+Vwq7R27+z1QSxQBaTvdDc+qQMI5tr7dHjP9HQ6CtyyYhBpriWw3KRWk\nSn7EShaYAeBjDZNQrB/6b37A6cXh/+kUfV+4tCfXlXPHiINzW1sbJk6cCKs1NtXe6/XiscceQ0dH\nB9auXYvPfvazsucxm4uh1WbvL4LVasraufNBId9/Id87MP7v/+G7GlFcJGD/8cuw2V0wl+qxZM5E\nPLh+LoDY+w8Egnj5dx8OvbfXBWt5EZbMmYj71s2GRjO8s/TFNz6ICfhSgdA+4IZG0MFaWRI5/0/e\n+ADvHktclwaAAx9ehcsjPy2tJDADgLXcgAX1VXj70PlIk4toRXotHlg/F3pBi5++eRxvHzwHj1d8\nnTr+PmoUXkOuy+e/+yMOzjt27MCnPvWphNefeOIJ3HHHHVCpVNiyZQuampowd+5cyfPY7SMv7C7F\najXBZhN7fi4MhXz/hXzvQOHc/7qlUzEw6MFRXxfs/R4cOH4ZXq8fD9/ViJ6e4VHq1l3tMUG30+7C\nm3vPwunyxrR3fO+Ysqlns8mAgNcX+RnHnz9essAMDK1jz5tZibbT3TFtHOPNq63E8jnVeOvAedHz\neLx+/OWCHbuOXExa9CT+PsaDfPi7L/fwMOLgfODAAXz1q19NeP2ee+6J/HnJkiVob2+XDc5EROmS\nWkstLhKwfvk0ANlp7xidLDWSrVLRFtRbsbm5Dp5VQ1PvxmIBb+w9G7MWPG9mBUKhEP7tV0clR9qV\n5UUo0mvRclJ8KlvqPig3jKhC2NWrV1FSUgJBEGJeP3v2LB577DGEQiH4/X60tLRg5syZI7pQIiIx\nckFx//HLkYphSoqIAMNJZmLUqqEmFBWlBjQ31WDT6tpIZTKb3Zk0qMtVBrOY9JFzAsPblor1Wmxu\nrsNzD9yEbz64BF/7TBM8ngDeOdJxrWuVuCVzJg7tX05SVWzZnGomfeWgEY2cbTYbLBZL5Ouf/OQn\nWLRoERobG1FdXY0777wTarUaq1evRkNDw4gvlojGv2TJWvHkgq7N7sLZjj5Mn1yWkfaOK+dPwtrF\nU1Fm1EOrUSVkZAs6FTw+8bGsXqfGktlV+GPr5YRjy+dUY8vaetn71WpU2HXkIlpOdsoG3IprmeH3\nrZuNy1f7YTEJku+3lOpx79r6SDY35Y4RBec5c+bgpz/9aeTrBx98MPLnL33pSyM5NREVGLkynHLB\nQy7oqtTAC68fjZxr3sxK7BbZyiTW3hEQ31YUvpb49eVk+5c9viBuXVgDrUYje14p8ZXQxKgAPHpn\nA2qqTNBo1NDrNFhQXyX5uQV1Vk5n5yhWCCOinCBXhlNsL3GY3Eg3nMUcPtetCyejualGci9v9Kg9\nflsRAHT3uSN/Tmd9efeRDty7dlbS7UrxswceX0DR2rGl1ABrXAWvTatrEQyFsO+DK5HEMoOgwfK5\nnM7OZQzORDTmlCZrSYke6fb0u6GSKNBx9FQ3nnvgpoTgGAgGsXVXu+iovaLMkDCir59qTqsUZ9uZ\nHnh8AckymFKzB6saJyddOwbEE7s0ajW2rKnHxltqYbM7AZUK1vIijphzHIMzEY25kXZ80qjVkZHu\n2Y4+vPD6UdH39Qy4I2vQ0eeTG7UHAsGEgiL7jl+BRF0SWcnuRap4idcfkK0IJuhUWNEwSXYkrNdp\nUFOVv/t+Cw2DMxGNuZF2fArT6zSYPrlMeg0awHdePxpJmtq0uhb+QEhy1L637RK8EoU7lBYLiSZ1\nL0Mj91N496h48ZK2092yFcG8vhBUKhUTu8YR/iaJaMyF143FpLoHV+5c4QAXHpFu231adtTu8QbT\nCsJSpO4lvE9bKgD3ObwoNwriB69pbe+KbBsDALfXj067M+Y1yh8cORNRTshk44XwZ9rOdMPW64IK\n4lPCre1dWLds2oi6RSlRbhTQNKtK9F6UFC+xlBrQUFsRU2glXnjKPLxG3namGza7S3HWO+UWBmci\nygnR68YjbbwQPtfnNhTh4LEOfEdiDdo+4IbL45fM9s6EcqOAZ+9bDFOx+MhXSUWy6IeUd1vFR9jh\nKfN0s94pt/AxiohySjiTOZXAHK7SJTaFayrWoUKi4lc4oG1aXYtbF05GNgaWbq8fb+w9i8vdg6LX\nl6wi2c3zqrGqcTL8gRDuWlWLxTdMEH1vY10lAOktXvHT3pTbOHImorwltfXozlumY8cfz0amdvWC\neNSNXgMOBkOi3Z1SMVTeU4VA1NDW7R3K9t7TeikmES08xSy3T3tSZQk+/Isde49dgV7QAAjB7Q3C\nIKgBqOD1BWKm/7v73CPKeqfcweBMRHlLagr35PleXOh0RF53X8u41qiBwLUAbBA0CIVCCASDQxnb\np7pGdC2L6q24Z00dnvv5Ick9yVJTzGLr7cUGbdw9RCd7Dd3EsjnVuDeq7Gemst5p7DE4E1Fekkuk\n6rA5RF8PRI2M3d6h5hHBENBUZ0WvI3mRDznGEgFeXwB2BcVC4gurxK+3F+m1+Porh5Ke5+T53piv\n5Ubh7DyVXxiciSjjUm1ekY6efrdkhrXcnuB477Z2yGZBp3Ieh9MLs0yjiTD7gBu2XhcErTrmZxRe\nb+9U0OEKGPoZxE9VR2eqd/W6RpT1TmOHwZmIMibd5hXp2HVEOrtarppWvFQCebLzHDphg0bBbQo6\nDb63rRV2hw8Wk4AF9VUxPyO56eloekGTMFUdnal+5q/dWX1AouxhtjYRZUx4Dbi734MQYot9ZJLH\nF0Dbaek14urKsUt6CihIKnN7A7A7fACAngEvdh2+iNfeORXJOgcgWUhFKYOgTTnrnXIHR85ElBEj\nbV6RCrkpbQCwlhlwyebMyPfKJLUKQAgQi99/bO3A0XYb7ANeWEr1mDezErcunIyWk12wO8Tv1Xtt\n+YAZ2OMPR85ElBFKmldkytuHz0seU6uAY6d7Mva9MikoEZiBofaWPQPeyIzD7iMdUKlUeOa+RZKl\nO5mBPX4xOBNRRsgV05ALIh5fABc7B3DR5lBUJMPjC2D/h9K9jTO1hpwLWtu7IOg0aJpVJXqcGdjj\nF6e1iQjAyDOsU93GEwgG8do7p7Dvg8uRfbsGQYPlc6tx960zJRPIbL2umD2/41l4xiGTdccpPzA4\nExW4TGZYpxJEtu0+jd1HYrcwhfceq1QqbG6uE39gCOXe0DiV7PBUhGccMll3nPIDgzNRgctkowSl\nQcTjC6DlpPTUdGu7DYFAEG1nuhMeGKzmYhgEdWS0nQuyNZUeP+MQ3gdN4x/XnIkKWLIMa6WNEuIb\nTyRrXtHn8MgW6uju92BP6yXRLVl6nQY33Sje/GG8ELRqNDfVcNq6gHHkTFTAlGRYy43U0p0SL9Jr\nIWhU8AbEh5xS08QtJ20IBEM4fjY3s7GTUauGZuXNJj2cHr/o2rleq8a3/mEpypmFXdAYnIkKWCqN\nEsTWf1OdEo8O5lKBGZCeJu4Z8GSk1OZYWXzDBKxfcT3KjHr8+t0zoslzK+ZPYmAmBmeiQqYkw1pq\ndLx+xfUpFx2JD+ZSykp06Bv0JbyercSr0WAQNNiyth7F+qH/dpmBTXIYnIkKXLIgITU6drr9KU2J\ny61vR6soNWDuDDP+2Ho54Vi+BmYA+FjDxEhgBpQnz1FhYnAmKnByQUIuoJ44Z1c0JR6eDvf6Aoo6\nLTXWVSKYg9ul0lVRKj8iZgY2iWFwJiIA4kFCLmGs1+HB4hsmoPujqwnH6qeWIxAMYeuu9pjpcL3M\nFihTkRY3Xl+B25dch+d/cXjkN5QDyo0CvvaZJpiKxctvEklJOzgfOHAAjz76KGbOnAkAqKurw7/8\ny79Eju/btw/f/e53odFocPPNN+Ohhx4a+dUS0aiSSxgTdBq0X+wFMLwWrNepgJAK7x+/gpb2zphA\nnKz94YDLjwMfXR1KFvPnzh7mkegf9MLl8TM4U8pGNHJevHgxfvCDH4gee+655/DSSy9hwoQJ2LJl\nC9auXYvaWiY6EOWD6MxsqYQxtzcQ2QoUXgv2+EIAQteOiwdYg6BGKAR4fOLHMxGY9Tq15PnV6qEm\nE+kyCBrF5UPZmILSlZVp7QsXLqCsrAwTJ04EAKxcuRLvv/8+gzNRjhPLzJ4/sxKrF07GsVPdsA+4\nUW6U3qOrRCYre5mNAubXWdF2ujsmmW353Go8+zPxqfFQEBC0Knj9qa1rq1XAyvmToFar8M4RZdu5\nGmZYmORFaRlRcD59+jT+4R/+AX19fXj44YexfPlyAIDNZoPFYom8z2Kx4MKFCyO7UiLKOrHM7HeO\ndKC5qQbPPXDTUGKXP4inXzo4hlc5zHXtAeHhDXOgUathLS+CVqPC1rfbJbddhQD4ZfZYSwmGgOam\nKagyF0GlUqG1vQs9A27ZUt/NTVNS/j5EwAiC87Rp0/Dwww/jE5/4BC5cuIBPf/rTeOuttyAI6a2t\nmM3F0Gqz94RptZqydu58UMj3X8j3Dii/f6fLiz+3JW5fAoDWU134zLo5qJlUDrfXD6u5CJ12VyYv\nMy1ubwB7Wjqwp6UDVnMR5s6ohKBTY0/rJdnPpbsl670Pr+LzG+bh0XsWwu3140r3IL7+0/2w9boT\n3qtWD73/wfVzodGMTaVk/t3P3/tPOzhPmDABt99+OwBg6tSpqKysxNWrVzFlyhRUVVWhq6sr8t6r\nV6+iqkq8H2mY3e5M91KSslpNsNkGsnb+XFfI91/I9w4ov/9AMIiv/fSg5FR1d58bD39nN5pmVWHT\n6lo0zKhQVExkNNnsLuw+nN0Zuv0fXMa6pddFpqpLtGrMq60U/VkEg8Af9v0VXq8/5QYimcC/+7l/\n/3IPD2k/zr355pt46aWXAAxNY3d3d2PChKFi9DU1NXA4HLh48SL8fj/27NkTmfImotwQ3axi665T\nuNwj/4Dc6/BGmk9sWl2L5qYaVJQaoFYNJUllmirjZxy5ngEP+hyxWeebVtdiVeMkqCUuOJUGIkRh\naY+cV69ejccffxzvvPMOfD4fnnnmGfz+97+HyWTCmjVr8Mwzz+Cxxx4DANx+++24/vrrM3bRRJS+\n+KQvs0nAoMuv+PPh0pzRhUuMxTq8sfcvOHyiE70O6W5TqcjFMiRq1VDTjmgatRprF0/FHyWm0pU0\nECGKl3ZwNhqN+PGPfyx5fNGiRdi2bVu6pyeiLIlP+pJr3SgmOthEFy7Z3FyHdcum4emXD2YsQOea\nYAii+5ZTaSBCpAT7ORMVEKX1reWIdasKT4+bigUUG8Zv4cGKUr1ooA03EBETbiBClIrx+6+IiBLI\nleNUSq5bVZFei0td2UvuHGuNdVbJQMsuU5RJDM5EBaTMqIfZJIhOZet1ahiLdOgZ8KC8RI95MysQ\nCoVw7HQ3+hxeWEqTd6sCRhb4c9nK+ZNkAy27TFEmMTgTFRC9ToOSIvHgXGUuxlP3LoxJ8Gptt6HP\n4UW5UY+G2gpsWl0LjVqdkenxfPOJm6ZCo06+EsguU5QJDM5EBcTjC8Dp9okec7p9sNmdsJqL8et3\nz8SMiu0OD/a0dECjVmFzc11GpsfzidmoY1IXjSoGZ6ICIhdUu/s9+NrLh2AxCXB6xPfltrZ3Yf2K\n6fjDgXNQqSBbunI8MRbLT1FHNwrhVDZlAoMzUQGR2/ITJre1yj7gxvM/P5y0YMl4M+jyweMLJARe\nsaS4xjprZPqfKF3820NUQOS2/Cih1agKLjADQK8jsTIYMJwU193vQQhDsw/hKmpEI8HgTDTORO87\nFhNdelOVYo1Mf7odI/KcWCERuaQ4luykkeK0NlEOS2UtU2yKdfm8yVi3NDbLOLzlZ/2K6/HLnSdx\n4KNOxaUyg5lrxZxXxAqJyK3fs2QnjRSDM1EOSmctU2zf8Zt7z8Lp8op2RXpj71+w/6POrN1DPlOr\nhmp7W2QKibBkJ2UTgzNRDhILtOGvxQJtsinWDStnxIz8CnGfcipWzp+EtYunxsxYDDi9uNjpQE2V\nEaZiIbJ+L9YukiU7aaQYnIlyTKqBFkg+xWrrdUHQqiPBps/hkc3YLiRTqoxwuv0JJTfDMxRevx/P\n/6IFHTYHgqGhUfVkqxFf+fQCluykrGFwJsox6axlyk2xCjoN/u1XR2Ef8Eamx29fMhVq1VCXpZEQ\ntGp4/bm/EF1jLcGDd8zGntYOtJ3uTgik/kBIcm3/+V+04EKnI/J1MARc6HTg+V+04Nn7FrNkJ2UF\ngzNRjklnLVNuitXtDcDtHcocDk+PO93+EQdmAPjnTQ048FEn3j16KSPny4abGybi3tvqoVGrce/H\n6+FZlZhkp1FDNHlrwOlFh82R8DoAdNgcGHB6I1PcTP6iTOJWKqIck077wUAwiFAoBIMwfMwgqGEQ\nxP+Jnzhnh8UkiB5LxSt/OIlFCLMMAAAgAElEQVSb508as0phgjb5XjCNJvY94UCqZIR7sdMh+dAR\nDA0dJ8oGBmeiHBS9F1mtAipKDWhuqpFcy9y2+zTeOdIRGSEDgNsbhNsrPuXc6/DghussI77OK3YX\n/s8vWyDoUtwwnQHlJTo01FZCp5X/b2xP66W0i4LUVBmhlrg1tWroOFE2cFqbKAeF9yKvWzYtJkM4\nnscXgM3uTDnzWtBpcM+aOpzvdMSsp6bD4xubNefeQR8On1B231KJdMmYigVMthpFf0aTreK/E6JM\nYHAmyiHhoiPRLRvF9jlH74NOJ+s6FArhavcgBl3SdbTHk5EUBfnKpxdIZmsTZQuDM1EOiC86ohc0\nMVPU8fuc4/dBp8rjC+Ibvzgy4uvOFyMpCiJotXj2vsUJ+5yJsolrzkQ5YOvb7TENFKIDc7TW9i4M\nOL0sIJKiTBQFMRULuGGahYGZRgVHzkRjKBAMYuuuU3j36CVF7+/pd+Nip0NyH3Suq7YUodPuyvq2\nK/W1XtNWcxEaZlSwKAjlHQZnojHi8QXwy50n8d7xK4o/oxc0qKkyJu3JnItunj8RaxdNRZFei1f/\nvxM4cd4OlzeYkWIo8UIAHr97PhbPm4yBPldmT040ChiciUZZeH35yImrsDt8KX46hDf2nsWgO9XP\njb332i7jT0cvJ7yejVF0eYke0yeXwSBoMZD50xNlHdeciUZZOJkr9cA8tHd5T+ulhP3LGjWwsnEi\nKkpztxNSIMmOK71O+X9HS26sgtmokzw+n40nKM8xOBMl4fEF0Gl3wuMLiH6d6rmykcwVCAJqlVqy\nslg+ULJf2iBo0NxUg/s/eSMWzpog+p4pVUZsbp6Z6csjGlWc1iaSEL+9yWwSUFIkwOn2Ke6xHC+b\n3aBaT9owt9aSN80o4pmKdBB0atmfT4lBiw0rZ0CjVsd0hOrpd6PMKKBxZiU2r6lT/PsgylUMzkQS\n4vcS9wx40TMwXLQjWY/leIFgEDsPXchKAhQA9A56sfeY8uSyXNNYXwlBq5Hdv20f8ESKiYSrqLEj\nFI1HIwrO3/72t3HkyBH4/X587nOfw8c//vHIsdWrV6O6uhoazdA/lhdeeAETJohPQxHlmlSmn5WW\nhty2+zT2tHSM6LoErQpef462f7rm5saJOHuxHxdtg4o/o9WocO/H6wEAgWAI77Z2iD7AiBUTYUco\nGo/SDs779+/HqVOnsG3bNtjtdnzqU5+KCc4A8OKLL6KkpGTEF0k02uR6KsdTUhoyU2vNC+qtOH6m\nBw63f8TnyhatSoWnP7sIW99uR+upLvQ5vBB0atk15dJiHfyBELQaFTRqFXRa8fdnopgIUT5IOzgv\nWrQIDQ0NAIDS0lK4XC4EAoHISJkon8n1VI6npDSkXLBXqQBTkYB+p3yda4OggVarzunADADvHb+C\njatm4t61s3DX6qFa4V5fAF97+ZDkZ+wOL/ocHuw6clF0WnsoG30yi4lQwUg7OGs0GhQXD40UduzY\ngZtvvjkhMD/99NPo6OjAwoUL8dhjj0Glkm4rZzYXQ6vNXmC3Wk1ZO3c+KOT7T/fel8+bjDf3nlXw\nvkmomVQu+x5TWRGs5qHqWAnXV16EuuvK8WeRPcDRqiuKse+D3F9T9niD8KtUqCwrwmC3EyUmA2pM\nBljLDbD1ukU/Yy0vQs2kcrT96pjo8UAQMOh1qJ5QlvL18O9+4crn+x9xQtiuXbuwY8cOvPzyyzGv\nf+ELX8CKFStQVlaGhx56CDt37sRtt90meR673TnSS5FktZpgsxVuKYJCvv+R3Pu6pVPhdHnR2t4F\n+4Ab5UY9Sop0cLp9sA94YDYZ0FhXiXVLp8JmG4h0lJJKTGqYUSE6Kuwf9CQNzBMtxfjr5fz5Hf7i\nDx/hg9Ndkf3YBkGDynKD5PsbZlTg4qVe0YeXsPfbLmPd0utSmtbm3/3CvHcgP+5f7uFhRMF57969\n+PGPf4yf/vSnMJliv8n69esjf7755pvR3t4uG5yJco1GrcaGlTNwc8NEQKWCtbwIep0mIQgP1cdu\nl2zvGHbnLdNx8nxvpPVgWHxBkXhmowCvP/U91WPp0EedMV+7vQFc7BxEtaUI9gFPZD3ZIGiwfG41\nNq2uhT8QQrlRQK9DfHq/d9CTdttHonyTdnAeGBjAt7/9bbzyyisoLy9POPbFL34RP/rRjyAIAg4d\nOoS1a9eO+GKJRkv8HufogBufHRy/5Sp+i1U4mO88eB4XOh0pX8sN0yx4P4X627nsSo8LZpMejTPL\nsPam61BtKY6MhDVqoHFmJfa0ijcBsYyg7SNRvkk7OP/hD3+A3W7HF7/4xchrN910E+rr67FmzRrc\nfPPN2LRpE/R6PW688UaOmimvSAXcQDAU2fIDAANOLw6f6BQ7BVrbbQgEgmg7042efg9kUi4kLZ9T\njXvWzMTJ8/a8a3QhxT7gwf6POqFRq7FlbX3Msc1r6nC6o1/0IYaZ2lRIVKFQKCc2TWZzbSAf1h6y\nqZDvP5179/gC+OqL+0WDoVoFLLphAjavmYk39v4FR050ot+ZnSYUpmItHr2zAZOtppS7V+ULi0nA\ngvqqmCWAcBvNo+1d6B30wHJtbT+VSmxh/LtfmPcO5Mf9Z23NmSjfhaeci/RauDx+lBn1stuegiHg\nwEdXceCjq0nPrbrWUzhdA04/nvtFCwyCGotvqIJBUCddn843PQPehCprGrUad62qxarGyUAoBKu5\nmCNmKjgMzlSQwmvKLSc70TPgjZTUrCjVY870CpSVCOgdlN93nEym5qTc3iD+dOwKplQZ01qzzgfh\nKmtajUpyrZ/1sqmQMDhTQYpfUw5nT3f3e/DuUfGEpLE2MDg+1pzFhKusxRchSbV+OdF4wUdRKjjZ\natuYbb2D2VnbzgVmkwFFeq3k76W1vSutFp1E+YrBmcalcM9ltzex1GUqdbMzqdwoYGXjJOi16f2z\ny9VlV4Mw8v9GGusq4fL4JX8v4ZE1UaHgtDaNK/H7k63mIjTMqIhZs0ylbnaYkh7J1RVF6OxxiXZT\nUqmAL909H5ayIrSf68XlntQr4uXqwLGi1ICOLvn7MQgaeH0BmE16FBt0GHT50OsYrrIWLkIi9XtR\nUr+caDxhcKZxJX4tudPuSliz1Os0aKyzyvYNjmYx6THnegsOfHQVnmsBWqMeanPo8YWgAhAC4HL7\nJfs0h0LAsz8/BBVUst2Z8pHD5cOqBZPRdrob3f3uoZkBNeDzBSPBd/2K6XA4vZGqamKlTjVqSP5e\nuMeZCg2DM40bcmvJ8T2Xw92NWk7a0DPgiWRriykp0uFPbbG1rwNBYIKlCJe6nAh/rC/JmrDXFwKQ\nE2UFMqp/0Ie1i6bgrlW1kYALICH4FuuH/7uR6sEc/r2E65lHj6yJCgmDM40bcmvJ8T2XNWo1NjfX\nYcPKGZF9zg6XD7sOX0DbmZ5IYGiYYUHbmW7Rc15KMpVbKCylhkgQjg646dTAjv+9SDURIRrvGJxp\n3JBbS5Zas4wOKKZiAfeunRUz5WqzOyVrPcspLdZlrXJYrsnGlLPUyJqoUDBbm8aN8FqymPgAEs7m\njt+eEw7MxmIBv373DL63vS3l67CY9Nh068yUP5ePjEVaTjkTZQFHzjSuxK9ZVpYPZ2sD0t2m7rxl\nOnb88Sxa223o7vcoys6W4nB58dPffZSxe8plTrcfTrcfpmJhrC+FaFxhcKZxJX7Ncsa0Cgz0uSLH\npbpNnTzfG1MaM93APPTZ8Zf0JSUYAi52OnDDNMtYXwrRuMJpbRqXwmuWBmH4+VMum3u81qzONrUK\nqKkyjvVlEI07DM5UMMaqMthYaaqvhEFQlqhVbhSgVg01/qixlkCvG/6vwSBoYCwSn2SbbDVySpso\nCzitTQWjzKiH2SSgZ2Bk3abyhVarwavP3ob/OW2D1+fHd147KloAxSBo8Ox9iyMtM8NFQmx2J6BS\nwVpeBJUqhOd/0YIOmwPB0NCIebLViK98esEY3BnR+MfgTAVDr9Ng1nUW7Dt+ZawvZVS0n+8FANRY\njRhwehGSqrICQNBpYkbAep0GNVWxjeCfvW8xBpxeXOx0oKaKI2aibGJwpoKyec1MtLTb4PbmaKHq\nDOp1eNDV68L2XSfx52OX4Q2IB2ePNxBToEWOqVhg8hfRKOCaM41LUl2p9DoNKssMY3RVo8tsMuB3\ne89i95EO2ezzcIUvIsodHDmTLLEGBbkm+hq1GpVsV6ptu0/jom0w49dQXiKgpEiLrj53zjS2mD3d\njP0fJK9uJlbhKx9+70TjGYMziZIq1hHdenGsiV1jsUEXsy0quivVhpUzJLdSjVTvoBcurz9nArOx\nSIu2093odcgnvy2bUx1T4Ssffu9EhYDBmURJFesAhlsvjjWxa5Tq0dza3oWlN07I6laqXAnMAOBw\n+ZO/CYBOp4r5Oh9+70SFgI/ClCBZ68X4etRjQe4axXT3u/GDX7eNw4aNI/Nu62Vs230aQH783okK\nBYMzJVDSenGspVNQJFm/5ULVctIWWWPO9d87UaFgcKYE4daLYqRaL442uWuk1NgHPJHkr1z/vRMV\nCgZnSpBK68WxIneNwFDVK7UKqCiQbVMjYTbpI1nZuf57JyoUDM4katPqWjQ31aCi1HCt5rIBzU01\nOdW7d/2K6yVrR5cYtHjms4vw/X++BRUcYctaUG+NBN58+L0TFYK0s7W/+c1v4tixY1CpVHjqqafQ\n0NAQObZv3z5897vfhUajwc0334yHHnooIxdLoye+9WIu7nd1OH3wSFT6Cq+dlhn1mDXVjPcKpGQn\nANRPLcPJ831J32cQNFg2N3YrVT783okKQVrB+eDBgzh37hy2bduGM2fO4KmnnsK2bdsix5977jm8\n9NJLmDBhArZs2YK1a9eitpZP3vko3HpxrMgVwwivkYptnwoB+P6ONiw/1Y31N08vmOCsVgGf/cQN\n+MqL+xEQ2dmlF9T40j2NEDRqWM3FkoF3rH/vRIUureD8/vvvo7m5GQAwY8YM9PX1weFwwGg04sKF\nCygrK8PEiRMBACtXrsT777/P4EwpUVIMI7xGGr0vN1p3vwdv7j2LQx8WRmAGhjpFVZmLcUvjZLxz\npCPh+MfmTsT0iWVjcGVElIq0gnNXVxdmz54d+dpiscBms8FoNMJms8FiscQcu3DhQtJzms3F0Gqz\nN31mtZqSv2kcy7f7f/GND0SLYRQXCXhg/dzI6w/f1YjiIgHvf3AJtl636Lku9zizfr1jTa0GplWX\n4juPrIAgaPHIpgUoKdZj//HLsPW6YC0vwpI5E3HfutnQaAor1STf/u5nUiHfO5Df95+RCmGh0MhL\nO9jt2fsP1Go1wWYbyNr5c12u33/81LXT48NbB86Jvve9Y5fwicVTYqZjP7F4CqZXG/Fv29tG65Jz\nyv23z0JDbSVMxQL6+lyR19cvn4Z7b78BZ/7aHfnZ9vRkvq54Lsv1v/vZVMj3DuTH/cs9PKQVnKuq\nqtDV1RX5urOzE1arVfTY1atXUVVVlc63oXFOaura4fZJtnQMF8OoMhcnfL4QVZQa0HTDBMm1Y4Og\n5doxUR5Ka35r+fLl2LlzJwDgww8/RFVVFYxGIwCgpqYGDocDFy9ehN/vx549e7B8+fLMXTGNG+E6\nzt39HoQwPHXderJT8jPlJj28vgA8vkDC5wsR9x8TjU9pjZwXLFiA2bNn4+6774ZKpcLTTz+N3/zm\nNzCZTFizZg2eeeYZPPbYYwCA22+/Hddff31GL5ryn1wdZ49POtQ6nD48/fIhmE0CnJ7Cq/WsUSOS\nhW0Q1AiGQggEg+wYRTTOpL3m/Pjjj8d8PWvWrMifFy1aFLO1igqT3DaodGpjA4DXPxSZegbkWyGO\nFwZBA68vALPJgCK9JqYXtdsbxO4jHVCrVOwYRTTOsGUkZZySbVBye5QNgkZyzXm8U6uG9mhbTAY0\n1lVi/Yrp6HN4sPPQefz52GXRz7S2d2HDyhmc3iYaRxicKeOU9ASW26NcWW5AV687EqD1WjU8/uz3\nSv7nuxrQM+DBG386g95BZf2QM23l/ElYu3hqzGzDG3vP4k9HxQMzEJskR0TjA4MzZZTcWvKREzas\nWzYNpmIBACJlI1vbu2AfcMNsMqDYoMWFTkfsOf1BGAQ13N7EAG0QNCjSa2Af4TS3SgW88t8n0DPg\nRaa3AWvUgFajhscn/4BhEDTYcEstivXD/yyV9K1mxyii8YfBmTJKtieww4OnXz6IpllVkSnu6DrO\nRXotvv7KIYkzq0RftZYXYVJlMQ58JJ3hrUQoNLyOLVb2MhUq1dD5wl2xnti8AIJWjadfPoheh/RD\nhNcXgMPpjQnOStbmmbFNNP4wxZMyKlmf5V6HF7sOX8R/vH0y8lq4jrPL45cMRF5fANWWooTXL3Q6\ncPRUl8gnxk64Jk8wBNh63fiXn+7H7/b9FQvrpVtcAuIjYLmfp1oFrGqcxI5RROMQgzNlVLI+y2F/\nbL2MV986iUBweJhqLNZBL9ECUqdV42qPS/RYsuniseb2BrHr8EWEADQ31Ui2uWyYYUGfwwOPbzgZ\nTu7nubJxMu5dO4vbqIjGIU5rU0Z5fAGsapyMQDCElnYb+mSmcfe0dECjHt4G9Js/nZXM0s71AKzE\nsVPdeO6Bm7B+xfXY+vYpnDhnR6/Dg3KjHiVFOrSd6cYfWy8lZLeLrc031lVyxEw0jjE4U0ZEb5/q\n7vdA0Krg9Sev2xXeBgQA+z6QzkiWky9br6Kzqv/+kzdG9oHvPHQBe1qGO0jFZ7ezxzJR4eF8GGVE\ndClNAIoCMzAcsGx2p2g2thJL50xAjbUkrc+Opvg1Zb1OgzKjHm2nxdfMW9u7Eqa4q2R6MBPR+MHg\nTCOmZLuPlEjAUolnYyvh8wfh8ozNvuRUiGVVy2a3X3twIaLCw+BMafH4Aui0OyNTs+l2hQoHLGt5\nETRq8QCdLGy/13ZFtNKYEjoNUG4U0vqsHL1ODUE7fOUGQYPQtTrY0eSysbl/mahwcc2ZUiJWmrOh\nthJmk6Co3nV8ecropCadVoWAN3E6XC+oMb/Wiv0fXRU950g6UvkCkN17nK4qc3FMMRW3N4B3jnRA\nFVcHW6tRodigE3244P5losLF4EwpESvNuaelA1OqjIqC88caqnHTDdWoqTJGKoUBQ9O7UmvObm8Q\nK+dPwoGPruZ0a0iVauiho6G2AsdOiU/zRyfA9Tk82HnwfEJFNACYUmVkNjZRAWNwJsXk1padbh9W\nNU7C+x9eFc2cVquBSRUl+PAvduw9diVhu5CxWCebdf3j3x5XFJjD1blGm8WkxxfvmoeyEgEXOx0x\n2dfRevrd+OXOkzhx3o6efo/kUrvT7Yc/EMp4KVEiyg8MzqSYfPKSB2sXT8WGW2rx2tvtQ8FnwIOy\nEgGzppZDr9fi3dZLkffHbxd6Y+9fZLdD9Q36FF3jWARmAJhfV4k/HbsUme5Xq4YqhMXT6dR47/iV\nyNdS18tmFkSFjcGZFJNr8xhOXtLrNLj/2h5eW68LCIVQZtRL1sxube/CumXT0HJyZLWxR5v62gjd\nUjq0dh4KhWKm+6WCrldhMRUmgxEVNgZnUkyuzWN08lIgGMSv3z0TGUWWGQXJpCv7gBsXOx2K1qtz\nycrGyVg1fxKgUqGsRJB8+JAaQSfDZDCiwsbgTEmFt0uVGfWKSknGJ43JZUObTQZUmYvSDmIjoVYD\nQZmBrFoFlJYMPViEr89i0mN+XSVUAL6/oy3pw0cIQGmxgH6n/MNH/EicyWBEhY3BmSSJbZsKJ3FJ\nlZJMtSBJY10lXB7/qAdmYCgw11SVoMM2KDoNfcuCydh4S22knaXL40eZUY9fv3tG8cOHqViH/sHk\nswIrGydj7aIpLM1JRAAYnEmG2Lap6CQusWSlPodHtiCI2ahH36AHZpMBc6ab4XT78b3tbZm/eIVc\n7gD+n4eW4Ve7T+PDv3RjwBWA2ShgYVTP6fB9moqFlB8++gd9srMCFpMeC+qHs9aJiAAGZ5IgF4TC\ne3XjR3iBYBA7D12QDEYVpQZ87TNNcLh82HXkIt4/fjntetqZYh9ww+UJwFgsQNBpoXIFoJaoVAbI\nZ6wDQw8f9riSm1KBefmcamxZW8+RMhEl4KM6iVJa8zm6jOe23aexp6VDMhg11lXCVCxgT2sH9rR0\njHlgBoCyEj12HjofadoRwvAMwbbdp2PeGwgGsfPgecm9yRWlBjx17wLJcqBq1VAp0opSA5qbavCZ\n22cxMBORKI6cCUBs0le4W5LctiljsYCtu9qHM7JLBLi80s0nJltLsGl17YiaZGSD3eHBn4+Jt6qM\nnyHYtvs09kTt1Y7XWFeJQDAk2cM6BODxu+dj+uQyBmUiksXgXODkkr6ktk0V6TX4j7dO4v0Ph2td\n9yZJehp0+eDxBbD17VNpN6nIFqmRfnQhELmHCrVqKKFr0+pa+AMhyYcai8nAwExEijA457H40W46\n5JK+Nq2uxcnzvQm1ny/aBnHRNpjS9+lzeLH17VPYF1UdK9eVG/WRQiBy0/yhELB20RRo1Gpo1FC0\nF5yISA6Dcx6SG+2mkvGbLOlr3bJpcLqVlc1Mxlyqx4lzPSM6h0EYCmweXwAqZH9fdEmRLhJM5ab5\nLaWx1byU7AUnIpLD4JyHkm1xUipZ0tfFTkfafZrjOZw+eP3KE8AMggZeXwDma12emhfWwFJqiFz3\nHw6cw5+Oiq8VZ4rTPTQVr9dpFFdHAwCNWo3NzXWSe8GJiJJJKzj7/X585Stfwfnz5xEIBPDEE0+g\nqakp5j2zZ8/GggULIl+/8sor0Gj4H9RIpbPFSUqypK+aKqPkcTla9VBWcnQZaaWBWaUCbmmcjA0r\nZ8Dh9IoGtipzMdYumpr14Gwf8MQ0n0h1RKzXadi4gojSklZw/u1vf4uioiK89tprOHXqFL785S9j\nx44dMe8xGo149dVXM3KRNEzJFielASHZaFDQaVA/1ZzyOnEKA+QEEyxFuPfj9QCAYr30X09LqQEG\nQZ32dixBq4I/EJKdGo9vPpFsRJyJHAAiIiDN4HzHHXfgk5/8JADAYrGgt7c3oxdF0pR0hkqF2Ghw\n/swKBEMhfPXF/ejp90TWeuVaOmaKvd+DAac3UipTPshJFwvR69TwyHSA8vmTL1hLJXDFj4gzlQNA\nRBSWVnDW6XSRP//85z+PBOpoXq8Xjz32GDo6OrB27Vp89rOfTf8qKSKVtU8lxEaDv373DN6JOn84\nKC+bUw29To22Mz2wD7gh6DIftD2+IJ5+6SD6Br2yQa7P4YFH4vuqVEBTfVVM3+R4ZpMeKhVEH3LU\nKmDl/EmKE7gylQNARBSWNDhv374d27dvj3ntkUcewYoVK/Af//Ef+PDDD/HjH/844XNPPPEE7rjj\nDqhUKmzZsgVNTU2YO3eu5Pcxm4uh1WZvKtBqNWXt3KPt4bsaUVwkYP/xy+jqdaGyvAhL5kzEfetm\nQ6MRH6kpuf8aAG6vH21nukWPn+7oww+fWA0AuNLtRCAYwH/vO4cDx6+g15G5vcvhPdPhIFdcJOCB\n9bF/d0xlRbCai9BpdyV83lpehEfubkTFzpN4++A5uDyJQfxj8ycDAN7cezbh2G1Lp+HzG+Ypula5\nn1fbmW58bkMRDMLY5l2Op7/76Sjk+y/kewfy+/6T/q+xceNGbNy4MeH17du3Y/fu3fj3f//3mJF0\n2D333BP585IlS9De3i4bnO12p9JrTpnVaoLNNpC18482jy+AZTdW4dbGSTHTvz094nuPU7n/TrsT\nNpGABwBdvS60n+3CntYOtLbb0i4mkmp7yPeOXcInFk9JmBWYO92Cd450JLx/7nQLnA4P1i+fhs1r\n6/H/vt6KE+ftsA94Iklc65ZOBQA4Xd6EBK9PfWxaxn5eZ/7aPaZJYePt736qCvn+C/negfy4f7mH\nh7Qe6S9cuIDXX38dv/zlL6HXJ65xnj17Fj/84Q/xwgsvIBAIoKWlBbfddls634qiiK1tNsyoQHPT\nFFhKDRlJQkq2pr3r8AXZEpZKTLYaEwqbyOnpF090k4rv/mAQnXYnyox6WIsE3P/JGyWTtUa65SnT\nOQBERECawXn79u3o7e3Fgw8+GHntpZdewiuvvIJFixahsbER1dXVuPPOO6FWq7F69Wo0NDRk7KIL\nldja5p7WS9jTegkVGUpC0us0aKitxJ4WkRHpDEtMyc50LJ9TjXtvq8P2PWfw3gdXIuvVekENny8o\nOqLWC5qEIOfxBXBUYkvZ3qOX8afWy7CU6rF83mSsWzpVdlvTSLY8ZToHgIgIAFShkFib+dGXzemH\nfJjeAOS34nh8AXz1xf1Jp5Kbm2oSkpCU3n94ZN5yshM9A97I9HNFqR6zpprhDwZx4KPOpOdRqYZK\nWsazmPR4/sElkXvz+AKw9boQCASx52iH5L5lg6DB9x75WORzgWAQL/3+I+xXcC2A+M8kk4ZnNBL3\nP491tna+/N3PlkK+/0K+dyA/7j/j09qUWUq24iTrIxyWaiGSaPEj8/Ao1uHy4r3jV2Q2LsWqNhfj\nck9iDsGCemvMdel1GtRYjdi6q122oIj32kNLlbkYgWAQX3/lcErT4iP5mSjBimBElGnchJkDwkFR\nrp9weG0zmehey6mQqzzm8Q1F6WRTLAZBA4OgwZUeZ+TP4f7FqxZMxqrGyfD4YjOnlbSQjF673fp2\ne0qBGUj/Z5Kq8PQ4AzMRjRSD8xhLVo4zHMzCa5vJpJuEpHRkLkavU+OmG6vg9gbg9gYQAiJ/Xjqn\nGg0zLGg73YWvvngAX31xP7buakcgGFT8fcNrtx5fAK2nulK+PiZmEVG+4bT2GEulHGd0Na/ufrfo\nZ9JNQpLLOk5m2Zxqyb2+Le22mCIl8QU65L5vdJ9kYOhn1euQ7xstholZRJRvOHIeY3LT1VK1nZ97\n4CY8/8BNWLVgMipKDVCrhqaOm5tq0m5LqHRkDgwFTVXU92xumiL5gCFVPSw8KyD3fVfOn4R7P14f\nWXcv0mtRbhQUXWNYkXhq+nIAABFXSURBVF6D9Sump/QZIqKxxpHzGEtnK45ep8HEihLc+/F6eFZl\nrtlCfJ1tQacRDa4r50/C2sVTI9/T4wukPOruiZoVSNbtKTphLtWRs8cbgMPplW2iQUSUa/g/Vg5I\ntRVhtEy2JYzOOu7pd+Otw+dx4MOrkc5PBkGDJbOr0Nw0JeFhYNZUs2wt63gqADsPnsfmNXVJs53j\ns8hTUVlexPVmIso7DM45YKRbcTLdqlCv02BPawfebY3d3uT2BrD/w068e63Ax/yZlQgBOHaqC939\nHhgENQAVvL6A5Kg7LBgC9rRegkajjuxBFnvQkEuY0+vUKDFo0evwSn6/JXMmcr2ZiPIOg3MOSXUU\nnK1WhXIBMRwAu/s9CXWtwyPsJbMnoP28XVG3qmR7kOUS5nz+IL5413wIWjWMxQLe2Hs2YfbhvnWz\nJWuOExHlKgbnPJatVoUj2VYFACfP9cKucG04PiM9nnztaj2s5UWRwC42+yDVpYuIKJfxf648pXR/\ntNznO+1O0fcpLXgixZ5CwY9ke5D1Og2KDYldzwCg2KBLGHGzEAgRjQccOecpudFtd78bPf1uTKwo\nSTgWPxVebtRjfl0lNjfPjEyFy2WQK5FKS8hke5A9vgAGXeKj8EGXL7Idi4hoPOHIOU8lG93uOnxB\n9PX4UqF2hwd7Wjrw9VcOR6p2AUMZ5M1NNZF91AZBeQBUEpjVKmDVgslJM9L7HB7YB8SDc6/DMypl\nOYmIRhtHzjkqWQa2XGtHAGg705MwqpSbCr/Q6cDWt9tx79pZAGIzyMOdo/7Udhltp7sjCVfzZ1Zc\ny9Yefq2htgLHTtnQIxFQw0IhYO2iKUkT19gvmYgKEYNzjkklA7t5YY1kcBZLtEqW6PXe8SvYcEtt\npGBHIBjEr989E3MtDTMq0Nw0BZZSQyTwb7wl9kFCo1YlnRK3lIoH1vBDSZFeC5fHjzKjnv2Siajg\nMDjnmFQysC2lBlSkMKosM+pRbtRLJmx5fUG89nY77v/kjZLXEr83GUjcApZODXC5XtLzZ1Zi9cLJ\nMSN0pUVaiIjyEYNzDpGbdj58ohPrlk2DqXi4tnSqpT/1Og3m10lPhQPAifP2SAa3XDa43N7k+Epj\nuw5fQNuZHtnAKtVLOryfurmpBs89cBP7JRNRQWBwziFy0869Di+eefkQFs6KneJOtfTn5uaZOPFX\nOy73OEWP2weGk6yUdsuSEqkBvnaW7Bq6kp7O4QeCTJUqJSLKZQzOOSRZ20a7I3GKO9XSnxq1Gl/5\nu4V47P++B48vmHA8PB0eCIagF9SRql9i70mlbKhc9TMlRU+UPhAQEY0HDM45ROn+4tb2LqxbNi2S\nMKXXaVIq/Vms12HFvEmS0+EAsPXtdtHADADzZlYkJIqNpGyokl7SzMwmokLC4JxjwtPRh090SrZH\n7O534+mXD6LP4U07MIpNh8+bWYFQKISvvrhfMlAaBA2CwRB2tw6vW4+0bKiShxJmZhNRIWFwzjHh\naep1y6bhmZcPSWZWhwN3KoExfho6fjr81++eSTpq93gDOHaqW/RYskQxOeGHhZaTNvQMeGKytcMP\nH0REhYLBOUeZigUsnKW8hGZ8YIwOxIFAEFt3tYtOQ4enw5UkZQFAmVFAr8QDw0jWhePXzqP3OXPE\nTESFhsE5h8VPPZeVSO9RDgfGijJDQhGTMqMeZy/1R94rNtpW2omqcWYl2s50Z61iV/TaefS2MSKi\nQsLgnMPERpNff+WQbGAUKxwitX4cPdpOlpRlMemxoP7a2rbmNCt2ERFlEYNzHogeTcoVHQGkC4eI\niZ6GlkvKWj6nGlvW1kcCb6p7q4mIKDUMznkkEAzCHwhCr1XD4x/a5mQQNFg2txqbVteiu8+taGo6\nLH4aWi7oRmeCp7q3moiIUsPgnMOik7q0GhW+/sphXOh0xLzH7Q3A6fLjctegov3C0eKnoVMNuqns\nrSYiIuXSCs6/+c1v8P3vfx9Tp04FACxbtgyf//znY97z5ptv4uc//znUajXuuusubNy4ceRXWyDE\nOlMZ9Fp02AZF37//o6vY/9FVGAQ1KsuKACQG5ylVRjjdfkXT0Ay6RERjK+2R8+23344nn3xS9JjT\n6cQPf/hD7NixAzqdDnfeeSfWrFmD8vLytC+0kIgldYkF3HhubxAXbYMJgXj5vElYt3Qq/IEQp6GJ\niPJAVqa1jx07hrlz58JkMgEAFixYgJaWFqxevTob3y4vKK1D7fEF0HKyc0Tfa9Dlw9OfXRTZJ1wz\nqRw22wA0anBETESUB9IOzgcPHsT9998Pv9+PJ598EjfeeGPkWFdXFywWS+Rri8UCm00+i9hsLoZW\nm73RnNVqysp53V4/7P0emEv1MAiJP85AIIiXf/ch9h+/DFuvC9byIiyZMxH3rZsNjUad8N4f/Ooo\negbEy3YqZR/woKjEgOnXlURey9b954NCvneA91/I91/I9w7k9/0nDc7bt2/H9u3bY177m7/5Gzzy\nyCO45ZZb0NraiieffBK/+93vJM8RCoWSXojdLt7CMBOsVhNstoGMnlNsXVisxvXWXe0xU9Sddhfe\n3HsWTpc3odzm1l3t2K2wIpgcs0mPgNcXueds3H++KOR7B3j/hXz/hXzvQH7cv9zDQ9LgvHHjRtlk\nrsbGRvT09CAQCECjGRr5VlVVoaurK/Kezs5OzJ8/P5Vrznli68LxVbfkSmKKldtMtkc5vJbc0++G\nTqeGV6TlIwAsqLdyTZmIKI+l3t8PwIsvvojf//73AID29nZYLJZIYAaAefPm4YMPPkB/fz8GBwfR\n0tKCpqamzFxxDkgWdD2+AAD5kpjhAiBhycpnLptTja99pgnPPXATvvW5Jfjuwx/DrQsnwyAM/9wN\nggarF05mMRAiojyX1przunXr8KUvfQmvv/46/H4/nn/+eQDAT37yEyxatAiNjY147LHHcP/990Ol\nUuGhhx6KJIeNB0qCbpW5WHbfcXwBELn3VpTqce/aemjU6pikrv+1ph533lILW68LCIVgvVbpi4iI\n8ltawbm6uhqvvvpqwusPPvhg5M+33XYbbrvttvSvLIcpDbpyJTHjC4DIv1d6mlqv06DGakz3VoiI\nKAexQlgaUgm6qdShZs1qIiICGJzTpjSQplISkzWriYgIYHBOWzbrULN8JhFRYUsrW5uGhQNpvoxw\nPb4AOu3OSEY5ERHlHo6cC4TSoilERDT2GJwLhJKiKURElBvG7ZCJ07fD3F6/oqIpRESUG8bdyFls\n+nb5vMlYt3Rqzk/fKu1clSp7v7KiKURElBvGXXAWm76VajSRK7K9HmwuVV6pjIiIxl5uDyVTpLTm\nda4JP1B093sQwvB68LbdpzNyfoOgRWOdVfRYfNEUIiIae+MqOKfSaCJXjNYDxabVtWhuqkFFqQFq\nFVBRakBzUw2rjxER5aBxNa2dSqOJXKG0icZIsfoYEVH+GFcj53DNazG5On0bfqAQk40HinwrmkJE\nVIjGVXAGxKdv71gxPWenb/PxgYKIiLJrXE1rA+LTtzWTymGzDYz1pUliNyoiIoo27oJzWD41j8jm\nerDHF8DlrkEEfAGOwomI8sS4Dc75KJMPFDF7pwc8sJhYS5uIKF8wOI9TrKVNRJS/OIQah/K1GAsR\nEQ1hcB6H8rEYCxERDWNwHodGe+80ERFlFoPzOMS900RE+Y0JYRmWrbaPqeLeaSKi/MXgnCHZbvuY\nqui90xpBh4DXxxEzEVGe4LR2hmS77WO69DoNJlaWMDATEeURBucM4NYlIiLKJAbnDODWJSIiyqS0\n1px/9KMfYd++fQCAYDCIrq4u7Ny5M3L84sWLWLduHebMmQMAMJvN+MEPfpCBy81N+dhHmoiIclda\nwfnzn/88Pv/5zwMA/vM//xPd3d0J77n++uvx6quvjuzq8kR461J0ucwwbl0iIqJUjShb2+/347XX\nXsMvfvGLTF1P3uLWJSIiypQRBee33noLH/vYx2AwGBKOdXV14Qtf+AI6OzuxefNm3HHHHSP5Vjkv\nm20fiYiosKhCoVBI7g3bt2/H9u3bY1575JFHsGLFCtx///149tlnUVNTE3Pc4XBg586duOOOOzAw\nMICNGzfitddeQ1VVleT38fsD0GoZzIiIiJIGZylOpxMbN27Ef/3XfyV976OPPop77rkHS5YskXyP\nzTaQzmUoYrWasnr+XFfI91/I9w7w/gv5/gv53oH8uH+r1SR5LO2tVCdOnMD06dNFj+3fvx/f+ta3\nAAwF8RMnTuD6669P91sREREVlLSDs81mg8ViiXnt+eefx4ULF9DU1IS+vj5s2rQJn/70p/Hggw9i\nwoQJI75YIiKiQpD2tHamcVo7ewr5/gv53gHefyHffyHfO5Af95+VaW0iIiLKDgZnIiKiHMPgTERE\nlGMYnImIiHJMziSEERER0RCOnImIiHIMgzMREVGOYXAmIiLKMQzOREREOYbBmYiIKMcwOBMREeWY\nggjO3d3d+Pu//3vce++9uPvuu3Hs2LGxvqRR4/f78eSTT+Kee+7BXXfdhcOHD4/1JY26gwcPYunS\npdizZ89YX8qo+uY3v4lNmzbh7rvvRltb21hfzqhrb29Hc3MzfvnLX471pYy6b3/729i0aRM2bNiA\nt956a6wvZ1S5XC48+uij2LJlCzZu3Ji3/+61Y30Bo+HNN9/E3/7t32LdunU4ePAgvv/97+Pll18e\n68saFb/97W9RVFSE1157DadOncKXv/xl7NixY6wva9ScP38eP/vZz7BgwYKxvpRRdfDgQZw7dw7b\ntm3DmTNn8NRTT2Hbtm1jfVmjxul04hvf+AaWLl061pcy6vbv349Tp05h27ZtsNvt+NSnPoWPf/zj\nY31Zo2bPnj3/f3v3D5JaFIAB/BNvRtHfK9ewLVqKIlqaoqJoimgTWguChhqL4g7NRrQooZiDQ2Bo\nBEFDEVE0BOGoREtLiFEXScqSQHhDcHnCe5EP3j3q+X7TuWf6DlzOxz2IB/39/VhYWEA6ncb8/DzG\nx8dFxyqbFOU8NzdnjjOZjFTXV87MzGB6ehoAoKoqXl5eBCeylqZp8Pv90HVddBRLXV9fY3JyEgDQ\n3d2NXC6Ht7c3NDU1CU5mDYfDgVAohFAoJDqK5YaGhjAwMAAAaGlpwcfHB4rFIux2u+Bk1piamjLH\n1bzfS1HOwNf904uLi8jn84hEIqLjWKaurs4cRyIRs6hl0dDQIDqCEIZhoK+vz3xWVRXPz8/SlLOi\nKFAUaba3Ena7HY2NjQCAeDyO0dFRaYr5d7Ozs3h8fEQgEBAd5Z/U3Nsbi8UQi8VK5paXlzEyMoKD\ngwNcXl5ifX29Jo+1v1v73t4eUqlU1b6oP/Hd+mXHf+mVz9nZGeLxeE3udT8RjUZxe3uLlZUVHB0d\nwWaziY5UlporZ4/HA4/HUzJ3c3ODXC6H1tZWjI2NYXV1VVC6/+tPawe+Suv8/Bw7OzslX9K15m/r\nl5HL5YJhGObz09MTNE0TmIisdHV1hUAggN3dXTQ3N4uOY6lkMgmn0wm3243e3l4Ui0Vks1k4nU7R\n0coixa+1T09PcXh4CAC4u7uD2+0WnMg6Dw8PiEaj8Pv9qK+vFx2HLDI8PIyTkxMAQCqVgsvlkuZI\nW3avr6/Y3NxEMBhEW1ub6DiWSyQS5mmBYRh4f39He3u74FTlk+JWqmw2i7W1NeTzeXx+fkLXdQwO\nDoqOZYnt7W0cHx+js7PTnAuHw3A4HAJTWefi4gLhcBj39/dQVRWapklzzLe1tYVEIgGbzYaNjQ30\n9PSIjmSZZDIJr9eLdDoNRVHQ0dEBn88nRVnt7+/D5/Ohq6vLnPN6vSV7QC0rFArQdR2ZTAaFQgFL\nS0uYmJgQHatsUpQzERFRNZHiWJuIiKiasJyJiIgqDMuZiIiowrCciYiIKgzLmYiIqMKwnImIiCoM\ny5mIiKjCsJyJiIgqzC8iivHPF8qqogAAAABJRU5ErkJggg==\n",
-            "text/plain": [
-              "\u003cmatplotlib.figure.Figure at 0x7f7a18dfb8d0\u003e"
-            ]
-          },
-          "metadata": {
-            "tags": []
-          },
-          "output_type": "display_data"
-        }
-      ],
-      "source": [
-        "# Plot the Data (Optional)\n",
+        "assert f(pi/2).numpy() == 1.0\n",
         "\n",
-        "import matplotlib.pyplot as plt\n",
         "\n",
-        "plt.scatter(inputs, labels)\n",
-        "plt.show()"
+        "# grad_f will return a list of derivatives of f\n",
+        "# with respect to its arguments. Since f() has a single argument,\n",
+        "# grad_f will return a list with a single element.\n",
+        "grad_f = tfe.gradients_function(f)\n",
+        "assert tf.abs(grad_f(pi/2)[0]).numpy() \u003c 1e-7"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "colab_type": "text",
-        "id": "JaFHyAG9nDET"
+        "id": "v9fPs8RyopCf"
       },
       "source": [
-        "## Step 2: Define our TensorFlow variables\n",
+        "### Higher-order gradients\n",
         "\n",
-        "We'll use Keras's object-oriented [`Dense`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layer to create our variables. In this case, we'll create a `Dense` layer with a single weight and bias."
+        "The same API can be used to differentiate as many times as you like:\n"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": 0,
       "metadata": {
-        "cellView": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           },
-          "base_uri": "https://localhost:8080/",
-          "height": 34
+          "height": 276
         },
         "colab_type": "code",
         "executionInfo": {
-          "elapsed": 332,
+          "elapsed": 730,
           "status": "ok",
-          "timestamp": 1525154229931,
+          "timestamp": 1527005655565,
           "user": {
             "displayName": "",
             "photoUrl": "",
@@ -190,54 +120,61 @@
           },
           "user_tz": 420
         },
-        "id": "z9r-ZeyrXu3A",
-        "outputId": "e19a698e-5892-4fcd-80d3-1394605ee72c"
+        "id": "3D0ZvnGYo0rW",
+        "outputId": "e23f8cc6-6813-4944-f20f-825b8a03c2ff"
       },
       "outputs": [
         {
           "data": {
+            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXYAAAEDCAYAAAAhsS8XAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzsnXd0HNX5sJ/ZXrTq3ZLV3IvcDdgGGwOm2WCbHhJa6C2B\nUBISQioBfoQPkjhACA4QCIQSDITQbGMbsHHvVbZ6s7q0vc18f4xmJVltJa0q+5zDOXhn9s7dqzvv\nfe/briBJkkSYMGHChBkxqAa7A2HChAkTJrSEBXuYMGHCjDDCgj1MmDBhRhhhwR4mTJgwI4ywYA8T\nJkyYEUZYsIcJEybMCCNkgl0URVasWMHtt98eqibDhAkTJkwvCJlgf+2118jJyQlVc2HChAkTppeE\nRLBXVlayceNGrrjiilA0FyZMmDBh+kBIBPvjjz/OQw89hCAIoWguTJgwYcL0gT4L9g0bNhAfH8/E\niRMJVycIEyZMmMFH6GutmGeeeYYPP/wQtVqN2+3Gbrdz3nnn8dRTT3X6HUmSwtp9CKittvH8UxsQ\nxZY/4aXXTGfa7PRB7NXAU1dj5y9PrIfmYUgeFcnya2aQmBI5uB0bYE5WNPHS/9uE6JcHYukVucw8\nPWOQezXw7NhcyCfvH0Bqfi+uumkO4ycnD3KvBpY+C/bWbNu2jdWrV/PCCy90e291tTVUj+03EhIs\nQ7qfWzfls2tzMTNPH01UrJEv/3eU5LRIVnx/5mB3rUP6azw3fnaMQ7vLOX1RNrVVNvIOVZGeFcPS\nq6YNmT6GmlP7KYoi/3ltF9WVNhacO4btXxfi9fi5+Mpc0jJjhkw/+5t9O0r5Zu1xDEYtpy/KZuOn\nR4mOM3HlTbNRqTo3UAynv3swhOPYhymSJJF3sAqtTs35l05mQm4K6VkxVJY2UVdtH+zuDRgOu4ej\n+yqIjDYwbW4a514yiYTkCMqKGnC7vIPdvQFjz9YSqittjJuSxNTZaVywcgoAX3xwCL9PHOTeDRyH\ndpej0ai47PqZTJyWwoTcFOprHBzdf3KwuzaghFSwz507NyhtPUzfOVnehLXRRdbYeLQ6DQATp6UC\ncGhv+WB2bUA5sLMMv19i2pz0gEaWNS4BUZQoOlE3yL0bGDxuHzu+LsRk1jH/nDEApI6OZtL0VFxO\nLyfLmwa5hwNDU4OT+loHozJiiIw2AjB7QSYajYrtXxXg9foHuYcDR1hjH6bkHawCYOzkxMBnmWPj\nMJq1HDtwEt93YBJ7PT4O7CrDYNQwPrfFhpo1Lh6AgmPVg9W1AaWyrBG/X2JCbjIGozbweXqWbIIp\nLawfrK4NKMX58kI+Oic28FmERc/UOWnYbR7yDn53tPawYB+GiKLI8SNVGEzaNvZTtVrFhKkpuF0+\n8o+OfKGWd7gKt8vHlJmj0GrVgc9j4kxExRopzq/7Tixw5cWNAKSkR7f5PHV0NIIApUXfEcHevEMb\nnR3b5vPxU5IAqChpHPA+DRZhwT4MKS2sx+XwMmZCYjuH0MRpKQAc3lsxGF0bUJQXNWdiYpvPBUEg\ne1w8Pq9IyXdAW60oaUAQ5Gig1uj0GhJTIqkqb8Lj9g1S7wYGn89PWXE90XGmgBlGITrWhN6gobIs\nLNjDDGE6MsMoRMUYSUiOoLKsacQ7zaoqrGh1amLiTO2uZY1LAKDgWM1Ad2tA8Xr9VFVYSUi2oNNr\n2l1Py4xBkqC8pGEQejdwVJQ04vOKZJyirYO80CeNiqSpwYXD5h6E3g08YcE+zJAkiZKCOswWHUmp\nHcdpJyRbEEWJupqRGx3jdvloqHWQmGLpMCciMcWCOUJH0fEaRHHkLnBV5U2IokRKelSH10dlyOaZ\nkW5nD5hhcuI6vJ48Sh6fyrLvhiM5LNiHGQ67B6fDS2JyZKdJXgnJcqxrdeXQj8vtLcpv6ywJSRAE\nMsfF43L6RvTLXF4sa+Kn2tcVkkdFodGoKCvqu8b+zjtv8f3vX8Fvf/ton9sKNUX5tWi0KlLSOl7g\nFDPVSJ4LrWm/dwszpKk5aQMgLimi03u+C4JdCeFLTOk8YSMlLYqDu8qpOWkjtRPBN9wpb/YzpHai\nsas1KlLSoygpqMdhc2OK0Pf6WWvWvMsf//hnkpNTet1Gf9BY76Sxzknm2DjUmo51VXlnBye/I3b2\nsMY+zKitkgV7fGLngj02wYxKLVBdaRuobg04VRWyYO/MHAUQlyCPkTJmIw2/X+RkeRNxCWb0Bm2n\n943KaA577IPW/vTTf6C8vIyHH76ft99+s9ft9AeKUzQto/MMW61OQ1xiBFWV1hHve4Kwxj7sUDT2\n+C40drVaRVyCmdpqG36/iFo9stZvSZKoKrditugwWzrXQKNijahUwojLxH17/XF25VXj8fhx+Hzo\nGh1s/+vmTu8X/SJ2RA59egTDxhMd3jNnQiJXLh7TaRsPPPAztm79lj//+UUiI4dWDZ76Zl9SXBfK\nDshmqZqTNqpPWgM295HKyHrjvwPUVNnQGzRERHa9pU5ItiD6pREn1ADsVjcOu6fbIl9qtYqYeBN1\nNfYRWXlU0Ty7W7hVzddFf181VYlApbUhhDLHYxPMXd6XnCbPl5PfATt7WGMfRng9PhrrnM2JJ11X\nx5Tt7BVUn7QGbO4jhZPliuO0+98VlxBBbZWdpgYnUTHtwyKHI1cuHsNdV83g1b9upuhELdf/cG63\ntvN/v7ydpgYnN99xxoirrFpX48Bk1rXJuu2IlsiYRqYxsiughjX2YURts2bSlRlGocWBOvLsy4p9\nPZiyvLGJshZXWzXydi71tXYMJm1QDtHYeBM+r4itaWTFcXs9PqyNLmLiu1+0IyL1mCN0VJY2jcgd\nXGvCgn0YEbCvd2NLBIiNN6NSCdSMwMiYqoqeaeww8hyoPq9fFmixwe1CouPkBa6+ti8L3NDT9Otq\nHED3ZhiQQ2ATUyNx2D3YbZ7+7tqgEhbsw4hgHKcKao2K2AQztVWyA3WkIIoS1ZVWYuJNHWZankpc\n8wtfO8J8DbLfAKI7yLrtiNhmjba+WRD2hnfe+YDIyKHldAzY1+O7F+xAIEu5sa734zAcCAv2YURt\nlQ2VWgj6ZU5ItuD3S4GogZFAU4MTr8dPQlJwfgNThA6DUTPinMi11fIi31E5hY5Q5kx97cgSaMrc\nDkZjB7luDEBDnbPf+jQUCAv2YYIoitRW24mNNwcdvjgS7eyN9fILGR1r7OZOGUEQiE2IoLFeXhBG\nCjXNpqXoYE0xMSYEoa+mmKGHUjYjJi44wR7VPG/CGnuYIUFDnRO/TwzKDKOg3DuS7MuNzZpWVJAC\nDVrMMSOpdk5AsAepsas1KiJjjNTXOEaU47Cuxo7ZokdvCC7AL6yxhxlS9MRxqqBotU0NI2cSN9bL\nmlZUTHAaO7Qkrijmi5GAYpazRBmC/k5snBm3y4fTMTKODHS7vNitnqDNMAAGoxaDUUNDWGMPMxRQ\nbMTdZde1Rm/QojdoaGxw9Ve3BhzFFNMTwa68+HUjJORRkiRqquxExciZtcESHXCgjoxxCETEBBHq\n2JqoWBNNDc4RFVRwKn0W7B6PhyuuuILly5ezbNky/vKXv4SiX2FOQdG6eyLQlPubGpyI4sjYfjfW\nOzGatEFFxCgoERMjJTLGYfPgcfuCdpwqxI4wB2rAcRpkRIxCdIwRSQJr48hReE6lz4Jdp9Px2muv\nsWbNGtasWcOmTZvYt29fKPoWphVNDU7UGhWmCF2PvhcZbUT0S9itwz8xxe8XsTa6Ag6wYNHq1ETF\nGKkbIaYYRTAHa19XiGkWgL0NeWxdtvebb77ijTdeDfq7lZUVfPHFp0Hd+/jjv2bjxvXd3te6lMCa\nNe/x2Wf/C6r9qICdXR6HTz75L9XVLUdJPvnk7ykqKgyqraFKSEoKGI3yi+bxePD5RvYRXINFU4OL\nyChDj9PBFQ2/sd7ZI3vsUMTa6EKS6FVpgMgYIyX5TjxuX4+0/aGIIpCCTU5SUByHvY2MObVs7/z5\nZ7a7x+/3o1ar231eXl7GF198xnnnXdCrZ3eE4gyPjDawfPllQX9PGQfFEf+//33EzJlTSUrKAODh\nh38esj4OFiGZ4aIosnLlSoqLi7n22mvJzc0NRbNhmnG7vLhdvnZnWgZDZLQszGVTTudlTYcDgYiY\nHpqjACKjlHFw9SiyaCjS0EuNXatTY4nU98oU07ps78UXX4LFYuHIkUPcd99DPP74r7FYIsnLO8r4\n8ROZP/9MnnvuaQRBQKvV8OyzL/Dii6soKirkppuu5YILlnLllde0af+ZZ55k9+6dpKSktonaOXr0\nCH/+8zO4XC6ioqL5+c8fIzY2jnvuuQ3RFUt1XSGxH5Zht9sxmUycccYCfve7x3jpJXk3UVlZwcMP\n38+rr77JK6/8nW+++QqHw4lWSmTS9HvYsGEdR44c5sEHH0Sj0fL886t54IF7ufvu+zh8+ADl5eXc\neee9gKzZHz16hB//+AE+//wT3nnnLfx+H5MmTeEnP/npkKrBExLBrlKpWLNmDTabjTvvvJPjx48z\nZkznJUDD9IymZufnqYf0BoMiBEdCZExDLyJiFJQFztroHPaC/Vv315ROK+RP+ZsRCnomTJxjPfh8\nInnfbGgjiGYkTmXlmKWdfu/Usr2ffPLfNt8vLS3mT396AYCHH76Pn/zkp0yZkktEhIamJg+33343\nb731Ok8++f/atb1x45eUlpbwz3++TU1NDd///hUsXXopPp+PZ599iieeeIaoqGjWrfuCF19cxc9+\n9kskScLpsHPdlT9l6VXTWL36bwBkZGTi9/uoqCgnJSWVdes+55xzzgPgssuu4oYbbsbn9XPj9+9k\n566t3P/Idbz33tv88pe/ICGhbWGwRYvO5fbbbwwI9nXrPuf6639IUVEh69Z9zgsvrEatVvPHPz7J\n559/wvnnX9Sjv0V/EtI9aUREBHPnzuWrr77qVrAnJAyPioNDoZ/VzdUMU9OiO+1PZ58b9HLFO5fD\nNyR+S1/64HHKCUaZ2fE9bidttLxb8fukbr87FMapK9xuH6oIAU0v6uyrNSp8PhEkUKtbBLPJqOv2\nd6tUEBdnJjragsViwNj8HYNBy8KFSwPfP/30uTz//HMsW7aMJUuWkJSURHS0CZ1O0+Ezjh07wIoV\nl5KQYCEhwcK8eWcQGWnEZquhoCCfBx+8F0mSEEWRxMREEhIsCAhkpE4nMTmShAQLZrMes9lAQoKF\npUsvZuvWTdxyyy1s2rSeZ599loQEC7t2bebll1/G6XRSVV9FWXkaCQkWtFo1ktQyL7RaNTExJsaO\nTSczM4OKigJGjx5NeXkpixcv4I033uD48WPccceNSJKE2+0mLS15SM2bPgv2uro6tFotFosFl8vF\nli1buPXWW7v9XnX10C9OlZBgGRL9LC2WDyJWaYQO+9NVPyVJQqNVUV1pHfTf0tfxPFkhn5QjIva4\nHalZhlWUNnb53aHyN+8Mr9dPbN5YZo45gwvPn9Lj7x/aU87GT49x9sUTmDA1uc217n63KErU1trw\netVYrS6cTg/V1VZcLi8+X8vcXLHiGqZNm8uWLV9z5ZVX8swzq2hocODx+Dp8htPpwWZzB6653V6a\nmpzU1dnIysrm+edXt+uny+VFY9Gh0amorrZit7uRJDXV1VZOO+0sHn30p8yaNQ+/X8JojKGsrJZf\n/erXrF79OvHxCTx8/2+wNjgpL6vH6/W3+f1er5/6egfV1VYWLDibd99dQ0ZGJvPnL6S62orV6mTJ\nkou47ba7ejR+oSDYxaPPUTHV1dVcd911XHrppVxxxRUsWLCAhQsX9rXZMK1QzCi9McUIgkBktJHG\nBuewzzhsqHNiMut65fxsbYoZziip8PGJPQvxU1Ac6LZ+DPUrKyslOzuHa6+9nilTplBcXIjJZMZu\n79hpO23aTNau/RxRFKmpqWHXrp0AjB6dSX19AwcO7AfA5/NRUJAPtBwy0tE7MWpUGmq1ilde+TuL\nF8tmGI/HgyBAZGQUDoeD44W7geY5ZTJhs3UcMbVw4WK++mpDG5POrFlz2bBhHfX1ssLV1NREZWVl\nr8aqv+izxj5+/Hjef//9UPQlTCcoSTmW6N5FtcihfnacDi8mc8/CJYcKfr+IrcnV6yPN9AY59r1p\nmMcuK6nwSjninqIIdmtTb8YhOHv+O++8ya5dO1Cr1YwfP47TT58PgFqt4cYbv8eFFy5r4zxduPBs\ndu3azvXXX016egYzZswCQKPR8LvfPcmzz/4fNpsNUfRz5ZXXkJWVjd8vtfk9p7J48RKef/5P3HLL\nnYBsJl62bAXXXXcVKSmpZGeNw14vv1sXXbSMxx57DK1Wx/PPr27jO7BYLGRmZlNcXMiECZMAyMzM\n4pZb7uT+++9CFCW0Wi333/8QycnJHfZlMBCkQVLjhvJ2V2GobMtff/5b/H6R6++e1+H17vq5ef0J\n9m4rYcX3Z5CcNnhlV/synvW1dt56aTsTpiZz9sUTetXGO//YQUOtg5t/cmanEQxD5W/eGbu/Lebb\nDflcddOcwCEiPcHn8/PS018xKiOaS66Z3g89bEt/jeen/zlAwbEarr9nXq+UleL8Wj5+ez9zFmQy\ne0HmkP+7KwyYKSZM/6JoqpG91NahVSz7MI6MCZQS6GFyUmssUQZ8PhGnffgesqBkS0b38pg/jUaN\nyawb9lmX1kYXGo0Ko6nr4/A6I1AMrH5kZOGeSliwD3FsTW4kCSKjei/QomLkRUERjsORvsSwKyj2\n2OFsjrE1m1D6Mg4RUfrmeTV8fS7WRheWXiTsKUREGlCpBJrqh+9c6IqwYB/iBBynoRBoI0Fj78OB\n1C3JWsP3ZbY2udHp1d0e3NwVkVEGRFEatsfDuV0+3C5fnzKpVSoBc4QOm3X4zoWuCAv2IU5LclLv\nJ7GinQxrjT0g2Hs/DgHH4TBd4CRJwtroIiKyb6UhlO/3Z2RMf6LsWvpaIiMi0oDd6hmRVR7Dgn2I\n05dQRwWVSsASbRjW205rkwuDSYtW1/tAroDGPkwFmsftw+vxY4nU96kdRSAO13FQ+t3bKDGFiCh5\nHEdCgbxTCQv2IU6LYO/bJI6KNuJyyjVnhhuSJGFvchNhCZFAG6amGGujLIAi+qipBmLZexXyOPhY\nlV1sCDR2kP1YI42wYB/iNDXI3v++xp8PZzu72+XD5xOJ6KOmqtGoMUcM34gQJfbc0kdTjPL94TYO\nu3fv5KGH7gv0uzNTzD333MbRo0e6bU/Z+diaXPzpT39i587tverX22+/idvdsjg89NCPsdsHt0R0\nWLAPYSRJoqnBiSW6995/BWXbaRuG207lRY6w9L3ssCXagK3JNSztqoqG3dcFztI8F4abYAcQBLoV\n7MGiaOyNDU7uvfdeZs2a06t23nnnTdzulrF86qlnMZsHt9Dc8C5MPcJxu3x43H5S0ntvX1dQzBjD\n0Z6oLEZ9FWggh41WljZht7r75LcYDBRTTF8FmlanwWDU9Eiwu1wufvnLn1JdXYUoilx//c0sXnxu\np2V1y8pK+b//exybrQlJEvjtb58gNXUUq1Y9x9atmxEEFddddxPnnHMeu3fvZPXqvxEVFU1BwQkm\nTJjIo4/+FoBvv93Mn//8DNHRMYwdOx6ApkYnGq0qEBnkdrt5/PFfU1RUSEZGBh5PS7TP9u3f8vLL\nf8Pr9TJqVBqPPPIYBoOBK664hLMXXcAXm7/Er1vGV9veZNas09HrDfzvfx/xm9/8AZB3Cf/+9xs8\n8cQzPP30Exw9egi3282iRedw00238u67b1FTU80999xOdHQ0zz33PFdccQkvv/xP3njjNZKTU1ix\n4nIAVq/+G2azmauuupZ//euffPnlF3i9Ps46axE33dR9fa2eEBbsQxjlxeurLRFaBPtwtCfam0In\n2C2tQh6Hm2BXNHb/hv+yY9WePu065tg8iH6J/IffBsAyew4JV1zd6f1bt24mPj6Bp556FgCHw95l\nWd1f//oXXHfdjaxYsZTy8jpEUWTjxvWcOJHHa6/9m/r6Om6++TpmzJgJQF7eMV5//R3i4uK4444f\nsn//XsaPn8hTT/2eP//5RUaNSuOXv/wZ0D6Gfc2adzEajbzyyr84ceI4N910LQCNjQ28+upqnnvu\nr+j1Bt5441Xeeut1brjhZvk3R5pZMu8uRqfHcrSkVB6XOafx9NN/wO12odcbWLfuCxYvXgLAbbfd\nhcViQRRFfvSjO8jPP87ll1/Nv//9ZqCcsYzcr3PPXcJzz/0xINjXr1/LM8/8me3bv6W0tJiXXnoN\nSZJ4+OH72bt3D9OmhS4TOCzYhzC2EAo087DW2BUTRN8XuMCBG43D7+ARa5MLlUpAq1XTVxe4oBKQ\n/CKSJJs3uiM7ewyrVj3HCy/8hTPOWMC0adPJzz9Bfv4J7rvvruayuhLx8Qk4HA5qaqpZsEAuBqjV\nypr1vn17OPfc8wGIiYllxoxZHD58CJPJxKRJk4mPjwdgzJhxVFRUYDAYSU0dxahRaQAsWXIhH3zw\nH3kXm9ayKO/Zs5srmhelnJwxjBkzDoCDBw9QWJjPHXf8EEmS8Pl8TJkyLfC9JUvO57//ymuj7KjV\nak477Qy+/vorFi1azJYtX3PXXT8CYN26z/jwwzX4/X7q6mopKCggO3sMIDX/pyD//9ix42loaKC2\ntob6+noiIyNJTEzinXfeYvv2bdx007VyXXmni9LS4rBg/66gCGFzH6NBWrcxHCMhrMoCF4JxaHEi\nD79xsDW6MVv0JF55NQl33dKn2ibfrDvOvu2lrLxuJkmp3Z/MlZ4+mpdffp0tW77hxRf/wty5p3PW\nWYvIzs5pV1bX4ei4iuOpma6t/60IfwC1WoXf3/HS5fPKu5RTzVGtfVBKu5IkMWfO6Tz22O86bMto\nNBIRaWj3TixefB7/+c/bREZamDhxMkajkYqKct566w1efvmfmM0RPP74r/F4uleSzj77HL78ci21\ntbWcc86SQL9+8IMbuOSSFd1+v7eEnadDmIBtOQQCTa2WD8Iejs5TW5MbQQCzpe+VKZXdj32YmaT8\nPhGH3ROyc2t7GvJYU1ODXq9nyZILuOaa73Ps2NFOy+qaTGYSE5P46qsNAHi9XtxuF9OmzWTdui8Q\nRZH6+nr27dvDpEmTO31mRkYmlZUVlJeXAbB27Wf4mmuntx6H6dNn8PnnnwCQn3+cEyfyAJg8eSr7\n9++lrEw2s7jdLkpKits8IyJSj8ftlw8faWbGjFkcO3aUDz9cEyjVa7fbMRqNmExm6upq+fbbzYH7\nuypJvHjxeaxb9zkbN67n7LPPAeC0007n448/xOl0No9tdaAEcKgIa+xDmFBq7CAvEDVVNiRJGlLn\nM3aHvcmFKUKHStV3PSSwcxlmC5xijuprcpKCEvIYbJJSfv5xVq16DpVKQKPR8sADP+uyrO4vfvFr\n/u//HueVV15CENT89rdPsHDh2Rw8uI8bbrgGQVBx5533EhMTS2FhQZtnKXNTp9Px4IOP8OCDPyI6\nOobc3OmcrKiT+99KsC9ffjmPP/5rbrjhe4wdO45Jk+QDSKKjo3nkkcf41a8ewePxIggCt9xyB+np\no1Hs4Ip5T1kwQD7qc968BXzyycf84he/BmDMmLGMHTueH/zgKlJTR5Gb22LSueSS5TzwwL3Exyfw\n3HPP07q8cVZWNg6Hg4SEJGJj4wCYM+d0iooKuf32GwEwmUw8+uhviYkJnWkwXLa3Cwa7lOcH/9pD\neXEDtz54FuoujkELtp99LXXaV3oznqIo8dLTm0hIsbDyBzND0o9X/vQNOr2G7912Wkj6OBCUFtbz\n0Vt7mTUvg7lnZfW5nzUnrbzzj51MmZnKmUvGhbCnbQn1eH79RR77d5Zx+Q2zSEju+1F0u7YUsXVj\nAVf/cC4xCb2vQzRQhMv2jgDsVjdGs7ZLod4ThmPIo8PuQRSlkJijFMwWPXbr8KpuGKr6KAqBujnD\nLJY9lKGvcjtKlNTwS9zrirBgH6JIkoTN2vc0+tZERA6/kMdQJeW0JsKix+cTh1V5hUCSVojGQT5R\nSh1wTA8X7FY3KrXQp+qWrVHGczifVdARYcE+RHG7fPh9Ysjs69CqNsYwKlVqDziQQ6OpwvAM/VQW\n41Bp7CDb2a2NrmG1c7Fb3Zgj9CHzEQV8DcO48mlH9FmwV1ZWct1113HRRRexbNkyXnvttVD06zuP\nLYQhfgrDWaCFUmMfjg5UpU5MSOdDpB6vx4/X4+/+5iGAKMqRQaGIjlIwRegQhJGnsfc5KkatVvOz\nn/2MiRMnYrfbWblyJfPnzycnJycU/fvOEuqIGBie2afWfjDFBBY42zAah0YXRpMWjVYdsjbNES0L\nvU4/9APkHHYvktTS71AghwHrh/VZBR3RZ409ISGBiRMnAmA2m8nJyaGqqqrPHfuuE8oYdgVThKzp\nDCfB3qKxh84EEXAiD5NxCPhbQjgGMPwWuP5QdkAOIW1qdCGKw8ck1R0htbGXlpZy5MgRcnNzQ9ls\nv2I/sA9nfv5gd6Md/TGJ1WpV83Fg7V9kT2UFjsOHQvasUKE4y3p7aHFHKFv51uMgSRKi14vo9SL5\nhpZT1enwIvqlkO5aAMzNC73d2lI0S3S7se3Zjd/WtuyszWbj/fffDfxbKaHbEU8++XuKigq7fX5X\nbbRGKcMbeCeC0NhffvnFoMvwRkQakEQJR/MC9/bbb+JyOrHu2IansmJIlOHtKSHbf9ntdu69914e\neeQRzGZzt/cHG4/Z3xT+8xU8dfWkLL2IjB9ci1rfdtIMVj+V1OnRmXHExoduPKNiTVSWNRIfFwGS\nSPlHH1P15QYchUUApF9zFaOvvrL3HQ9RPxUcNg9R0UYSE7tPew+WSItcVsDr8ZOQYEH0ejn8uz/Q\nsGcvxwFUKjK+/z3SLuu/lO+eUOlpBCA+IaLN+PV1bqamRcv/I8lt+Z1ODj3zJE2HDiOo1URNyyV1\n6UXEzJqJ293IRx/9h1tvlZNqoqNN6PWaDvvw9NNPtPm3co8oim2SzLpqozVarZqYGBP2etlhmjIq\nqsvviKLIT3/6QPcD0ExisoXjh6vQqNUkJFh49+03mJF/HOn4CZIvWMI//vFy0G0NFUIi2H0+H/fe\ney+XXnoGQVm+AAAgAElEQVQp5557blDfGSpJIEm33U3l6r9R8dHH1Gzbwagf/wRdQiIwuMkqNVXy\nc90eb7d96Ek/DUYNol+iuKgW19frqHn3bVCrMU+bjqesjJI3/43D4SFu2aV9/g196SfIafQ2q5vU\n9KiQ/x10ejX1tQ6qq61UvfUGDXv2ohuVhikhDmt+AUX/fANfXDLmyVNC+tzeUFoip5urNEJgHEIx\nN33N1SGrKps4WVpD2XPP4Dx2FOOEiYgOBw27dtOwZy8Zj/2Gx//2V4qLi1m27BJmzz6NM86YT0ND\nE7fddme7Urv33HMbd999H+PHT2DJkrO46qpr2bbtW+6++8fY7fY2ZXg9Hl+733FqGV673Ul9vYP6\nCh8V1cf42aOrEVRSuzK8F198Cdu3b2XlyivZunUz8+efGVQZ3sYGG3GWCZQUTeTdF/8fVSdP8ugX\nnxEdHcOq85exaNHZg16GVyHYxTwkgv2RRx5hzJgxXH/99aFobkAxZmeT8cvfUPOfd2hY+wU1775N\n6h13D3a3sFvdGIyhdZZBS9hgQ1k19o8+QB1hIePXv0MTFYW3tpbS/3uC2g/eR9DpiD3/wpA+u6co\ntt9Q25ahJUnJumsnDWu/QJeSyuhHHiUpLZ6SbXspfuL3VP79RTIe+y2a6OiQP78nKONgajZBbF5/\ngsK8GsQ+HhaimJSP7KvEsXsHWXlHiZg9h5RbbkdQq7Ht3kX5qj9R9cY/uf32uykszGf16jcAWUB2\nVGp36tRpbZ7hdDrJyRnDD394Gx6Ph6uvXtGuDO+pdFaGt7qqhgN5a3nxpb+RkBTdrgyvTqdn1aqX\nALnMMARXhvfEkZM89PC9HDt8mNOLi3lPq+WPj/6G1IVnN4dVDn4Z3p7SZxv7zp07+eijj/j2229Z\nvnw5K1asYNOmTaHo24Ch0ulIuOp76DOzsO3cgfuUQkEDTX8kJykoNvvyz9Yjud3EX34lmqgoALRx\ncaQ9+DDqqGhqP3i/nZ11oOmPUEeFCIset8tH+auvIOh0pNx+F6pmM5whK5uEK67Cb7VS8dILSOLg\nnrak2MAVm3ioUELBRb+Ir64Oc+40Um6+DUEtKxMRM2ZinjET57Gj2Hbvavd9pdSuIAiBUrunotFo\nWLhwMQBFRYXtyvB2xJ49uwPXWpfhPX7iCI22kzz007u48cbv8emnH3Py5MnA95SCXa1pXYbX7/ez\nZcvXnHmmXE543brPuOmm7/PL395Do/Ukx3ftQLTZEIwmLDNntYqVb1+G9/jxvEAZ3m3btgbK8N50\n07UUFxdRWjq4MqTPGvusWbM4fPhwKPoyqAiCQPzyFZQ9+wy1H35A6l33DFpfPG4fPm9ok5MUApl2\nReUk5owhct78Nte1cfHELDmfmnf+TeNXm4i98KKQ9yFY+iPrVEFxwDk9kPW9a9GPGtXmevQ55+E4\nchj7nt3Y9+4hYkZo6tT0BsWpp8yHeYtzuPSq6SExT73+/Ld4GxsZW7uDhB8/jqBpKxISr7qGwoMH\nqPv4w3YLXDCldnU6Xa+SiToqw+tyekhLnsA//vFih98xGjs+OKW7MrySX8Odt/4Ya1UtKrMZVSft\nwOCV4e0p4czTVpgmT8WQnYNt905cxUWD1g8lWsPcLwJN1vpcmggSr/0BQgcVE6POPAtBr6dh/dpB\njRCx2xRNNfTjYDLJWqkvbhSR889sd10QBOIvXQlA49eDuwPtL40dwKgRcUtajJNz0aWktruujU8g\n9qKlaB0ObLW1PW6/dVZrR2V4O6KzMrwW4yiq6gq6LMPbEd2V4XW6rZRXH8EraIg9/0LM5oghV4a3\np4QFeysEQSDuUnnVrf1wzaD1w94PMewK2qZqAPyJ6RhGZ3R4j9pkJmr+Anz1dR1uwQeK/opbBlDX\nyMJFGJ/b4eIGoE9PR5+ZhX3/PnwNDSHvQ7DYbW40GlW/JBFprDVIggrDWZ0HPcScfwGRlkhydDqu\nu+5q/vrXP7W7p7WG3dn/63Q6Hnro5zz44I+4665bSOlgIQG5DK/D4eCGG77Hm2++zqRJU/B6fKgF\nI8vOv5lf/eoRrr/+Gm677SaKAwpY57sCpQzv1q1bmDdPXsRbl+F96onfkBSdgU+tJ3rxOYEyvD/6\n0R3t2u6sDO95553P7bffyPXXX82jjz6M0+notD8DQbhs7ylIkkTJE7/HdeI4s/72PFbVwJ+LeWhv\nORs/OcbZF09gwtTkbu/vSYTEybfe5D8FSSTGaLns9vaaqoLnZCWFP/8phpwxjP7ZL4Lue6j6CfDZ\n+wfJP1rNdXefEXKtffsfVrFDmMycuUnMXjyx0z42bFhP1euvEb/ycmIvWhrSPgTLK3/+Bp2ubZnh\nkETFNNTz6ZNvURI1kcuun0liSuchpZWvrKbp602kPfAwpgkTO73vVEIVWVZfY+etv29n4rQUFl04\nvs/ttWl7/Vo+3tSI0xTLzQ8uGtJnFYTL9vYSQRCInLcAgLqt2walD/Z+qBMDIIki9p3b0IsunGLX\n2p8uKRlz7jRcJ47jzD8R0n4Ei8Mun5xkNIXWBOEuL0dVIv8mp6/rqCPL3NMRtFoav/lqUIpl+f0i\nTrs3kDUcShq+XI/eKzvIFbNXZ0SedjoA1m1bQ96PYLDb+m/3Ztu1E53fgU8Uhk3dnO4IC/YOiJg+\nAwSB2i3fDsrzbf1kgnDmHcNXX49Rr8Jh93QrqKIXy9tza6tjwAYSu9WDyaxDpQqtBtX09Sb0Pnvg\nGV2hNpmImDUb78mTOPOOhbQfweC094+fQZIkmrZuwaCSfSjdFYYzjp+AOioK687tg+J3USKkQlkA\nDMBvteI8dpSICNkRPFzKK3RHWLB3gCYqCuOYsTQdPoKvqWnAn99iYw/tJLZukxeqiPhI/H4Jj7vr\nF9Q0YSIqgwH7/n0Drq1KkoTD7gm5pir5fDRt+Qa9SYtaLQRV4TFqwVkANH61MaR9CYYWB3Jox8FT\nUY6vpoao0SmAnOHbFYJKhWX2XES7HfuhgyHtSzD0lyPdtncPiCIxaXJSYncL/XAhLNg7IWLGTJAk\n7Ht2D/izbVY3Or0arS50zjLJ58O6cwfqqCgik2SnT3eTWNBoME2egre6Gm9l+xjl/sTjluvRm0L8\nIjvzjuG3Womae7qcpBSEhmYcPwFNfDz23bsGXFvtLweyfd9eAGInjmnznK6wzJVt/IqCMJD0V0CB\nbfdOABInZAItoaXDnbBg74SIGbOAlj/8QOKweUKumdgPHUS02bDMnoup+eVw2LufxObmTEJbsyAY\nKPorxM9+8IDcbm4uZoseh82Dv5sMTkEQME/JRXS5cBUMbME4RZMO9c7Fvm8vCALxM+SSCcEscIbs\nHLTxCdh270Z0D6wA7I8FTnS5cBw8gG5UGjHpSfJzutm5DBfCgr0TtAkJmLMycRw+hN85cLWa/c1H\ntoX6Rbbtkhcoy9zTWqr6BTGJzVOnyvfu3xfS/nSHsuiEWmN3HDyAoNFgHDs+ICQUO3ZXmCdPBhhw\nM0TAaRjCcfA77DiP52HIysIQG41OrwlqLgiCgGXuaUhu14BXArXb3KjVAnpD6Hax9gP7kXw+ImbM\nDJykFLaxfweIPf00JJ8P+/6B01Zb6oKEVrA7jxxGZTJhyMoOtN2dXRVAExWNPjNLNmEM4ALXHxq7\nr6kJd0kxxrHjUOn1LQePBGGGMI6fCCoVjmaNf6DoD03VcfAgiGJgN2a26II+VcvUXBTNcWSABbvV\ng9kSuiPxoGU3HjFzVuDIwWDeieFAWLB3QdzpcwGwD2CSjqNZezSZQ/cie2uq8dZUYxw3HkGlajk5\nJ0jtxDw1F/x+HIcGTqgFxiGEgt1xWNa2TZNk4WQyB7/AqU0mDNk5uAry8XeSldgf2PvBFKPY1825\nzYI9Qq6b4/N2H+pnyM5B0GpxHDkSsv50h9/ffCReCHctkt+Pfd9eNHFx6NNHN5+jGjbFfCcwZWSg\njo7GcfTIgEWFOPohCkJ5CZXEkp4INICIZgFg3zdw5pieHKoQLIq2bWo2q/Rk5wLIJXwlaUC1VbtN\nPrZOG6Iqn5IoYj+wD3VUNPrmzOOWk5S6HweVVotxzFg8pSX4rAMTMRYI+QzhrsVdXITodGKeMhVB\nEFCpBExmXdh5+l1AEARM48bjb2rC26qKXH/SH84yx1G5SJsi2I1mbY+0E31GJmpLJPb9ewes0mGo\nw/wkScJ+8CBqiwV9Wnpz280CLQgnMoBpkrwgOAbQzu6whfbwZldhAX6rFfPU3IBZo+UkpeDGwdg8\nj5xHj4asX13RktcRwnfimNx347gJgc9MEXrstu7zO4YDYcHeDcaxcvqy89jATGJFyChadV+RJAnn\nkcOoLRZ0qXIFQ5VKhdEUvHYiqFSYp0zF39SEp6wsJP3qDiXr1BCirFNPeRn+xgZMkyYHasP0VGM3\nZGahMhqxHzwwIC+/z+vH7fKFdtfSvCgpTnHo+dmnioLgODIwVV0d/RDDrrzPxrHjAp+ZI3T4fWK3\n+R3DgbBg7wbjOFmwO/IGRrAHJnGItp3eqpNytun4CW2KXZkidEFlnyooL4DzeF5I+tUdoc46DZhh\nJrWciNRTk5SgVmOaOAlfTQ3eATiwvT+Sk5S/n6KwyO03C/Ygk3MMGZkIegPOgRLsIfa3SKKIM+8Y\n2oQEtLGxgc+VMOCRkKQUFuzdoEtJQRURMWAae8AUEyKNXdGqTi3cZI7Q4fOKeNzB1cYwjh0LDIxg\n74+sUyVMUQlbBNDpNWi0qh5FQgSiQgbAkRyIkArRIi+JIq4Tx9EmJaGJbCn4pZg4gtXY5XDRcXgq\nK/A19H952lC/E56yMkSHo83iBq1MUiPAzh4W7N0gqFQYx47DV1uLt7am35/nsHnQaENXotXZiWBX\n4sODSVIC0CY3L3An+l+whzrrVBJFXMfz0CYno4mOaXPNZNYFbWMHMI1vti/n9f84hNqR7ikvQ3Q6\nMeaMbfO5orH3xHFomiDbph1H+z865tSjAfuKsvtWduMKph7kdwx1woI9CEwBO3v/F4Gy290hsyVK\nkoTjyBHU0dFok9qW/+2xGUIQMOaMwVdT0+9aWqhj2D3lZYguF8bsMe2umSL0uBxeRDE4k5Q2KUle\n4PKPh6RvXRHqyKCAGWZMW8FuNMsFsHq0c5kwSf7OAJye5rSHVmMP2NfHnaqx93yBG6qEBXsQKBPA\n2c92dlFsLtEaqi1nRTl+axOm8RPbJXa0bL+Df5kVgdDf5phQZ50qZYcNOe0FuzlChySB09GDBS47\nR17gGvv38I1Ql6pV/m6GUwS77EzXYg8iA1dBP3o0KpMJ59H+F+x2m6f5oJG+h3xKkoTz2FFZ2UlI\naHOtJToorLED8MgjjzBv3jyWLVsWiuaGHPr0dFQGQyBEqr9w2r1A6JxErhOyVqnYx1ujJED1RDsZ\nKMEeao3ddUIW7MacnHbXerpzATlJB8DVz3XqQ+08dR0/jspsRpfc/vAWU4QuqNIKCoJKhTFnDN7q\n6n6vgKr4W0KRdeo9eRJ/UxOmcePbtWfqYeLeUCYkgn3lypW8/PLLoWhqSCKo1RjGjMVbWYmvsbHf\nnhNq779SsEoRRK3paagfgD4zE0GjwXm8f80QIR+H/BOoDIZAuGdrejMOxmbN33mifwW70idjCHZw\nvoYGOfs4Z0yHRwGazDo8bj/eILJPFQxZ2QD9WhhNFCWcdk/ozTCnOE4BjCYtKpUwIsoKhESwz549\nm8jIzo/VGgmYAuaY/rOzh7rgkzM/H0GnQz8qrd21nhQCU1BpdegzMuWsvX6s7hdK27LfbsdTUY4h\nK7tjgdbDJCUAQ1YWCEJgR9RfOOweDEYtanXfX9PO7OsKyjj0RGs3ZCuCvf8WOJfTiySFbpFX3l/j\nuHHtrgmCIIcBjwCNPfSn445QAtvvgnwss+cAYPPa2V65G5WgIjMyndSIFLSq4IZUkiRKqmwcKqzH\n6/Oj16pxVcs1SEKhnYhuN56yUoxjxiKo29smjQETRM8msXHMGFwnjuMqyA9E2tS7GjhQewS7186M\nxFySTAndtNKCy+OjqNJKQYUVm9NLXJSB6pPyGZmhMEEoQqejXUvrZ/RES1MZjOhSR+EqKkTy+RA0\nGlw+F2W2SmpddVg9NqbGTySxB+NQ3eCk+KSVmkYXjXYPybEmbFY3lsj+ta8rKHPObvMQGR3cOb+G\nTEWwFwQ+a/JYOVhzBLVKjU6tY5pxLALB/4ayGjvHShpweXx4vCKG5jyLUNVOchacQGU0ouvkIG1T\nhI6aShuSJA3ps0+7Y9AEe7CHsg42Sj995qmUCgL+smI0ESJrDn/G+vxvcPtbBIJereOHs65mUdYZ\nnbbncvt4d30ea7cXU9voanMtFRiFig0HK7GkRzNtbPCC4dTxbDxYDJJEzKTxnY61KUKH2+Xr0d9C\nNWsa9Z99iqqiGPuUFJ7f/k8K6ksC1z/K/4zxcdlcPuVipiVP6rSftY1O3vz8KGu3FeM/JSJlAgIR\nCHyxr4JLzxpDQkzvDxR3VpYCkDRzKrEd/E7RKz9b8kuBvgUzHo1TJnLys1JM9joKLV6e3foyTW5b\n4PoHJ/7HOTkLuHzyxUQbOt7NSpLEoYI63t9wnG2HKmmdKyYAs1FRWu9k8+EqLpqXiVbTdoHuyd+t\nvCgfQaMhbfZU1Pr2QjIxSW5Lq1YF326ChbKUZNyFBcTGGvmycAtv7H0fu7elCqjmoIZrc5dz4biz\nUQkd7zx8fpEvthbxxbZi8kraOqQjgfGo2Ha8moRJSSyYntprgeuz2zlWWUlU7lQSk6La/5wECzGx\nZqrKrUSY9CEvGT2QDJpgD8XJ5f3NqSes65JTsB4/zs8+e4I6dwMx+miWZi3BrDVT2FTCjpO7+eu2\n1yisKueirPPaTEBJktiTV8O/1h6jtsmN2aDh9MlJ5GbHYTHpcHv9HPy2GFu5lX2FdWx9YTNn5qZw\n9TljMXYT097RSfB1u+WEHCk5vdOxNpq0NDW4evS38CXIduqSndt5Ub0Zl8/FxNhxTImbiElrZGvF\nTo7WHucPm1Zx69TrmBrfItwTEixUVDby4TcFfLatBK9PJCnWxPQxcWSlRBJl1lHX5GbfF3l4PH4+\n2JTPf78uYMVZ2Vxw2mhUvXiha/fLBbs8cakd/k63V3ZY19bYqa62djiWHZI6GoAvv3iP12MLUQkq\nFqbNJ9mUiFqlYm3RRj4/vomvCrfx4xm3k2ZpqyHaXV5Wf3yY3XlybkRWSiRzJiQSH2Ug0qyjsKSB\nE5sK8UgSf//gAO9/mcfV54xj1viEwFgG+3cT3W5s+QUYMjKpa/IA7XcnIvKqUlHeSHxK8AuGdnQW\nrq1b+MM7v2efUIlBrWdZ9gVEaE04vE6+LPuKV/e8y9aivVw/+WoidW3bLqux8/f/HqKo0oogQG5O\nHLPGJWAx6dBpVRzeW0HV4WqqrW6een0H//06hmvPG0dKnDnoPiooNeRVqe3fCWU8NVp58SkuriMu\nIaLHz+hvgl10QybYR0LhnO7QjE7HU1GOVFXDBdPO56LMc1GrZC3qtJRZLEybx1/3ruZ/hWupdzdy\n7YTLEQQBUZR4Y+0xvtxVhlolcPEZGSw9IxO9rq0GdnJfJTbgnqum8eaXJ/hqXwWHCuu5fflkclLb\naxhdoURsKHbQjjBF6KmtsuP1+II+hk9jiUSKjcZZkI97ZgLXTbqKuckzA9fnJs8krz6fv+59mb/v\n/ye35t7A5DjZP9Fk9/D/3t7L4aJ6Yix6li/IYt7UZNStbN+SJLH/02MkJ0bww9mjeG/jCd7dcIJj\nJQ3cvHQSEUZt0GMgiSKu/BNok5JQR3T8khqMvXOYGZtNO1VH9hC5KI2bp/6A7KjMwPXTk2ezsWwz\n7+V9xPP7/sFDs+8hSi9r7gUVTTy/5gA1jS7GpUez8qxsxqZFtVEEIlUCJyhk/oxR5Khh3c4yVr2/\nn0vmZ3LJgqwe9dVdUgx+f9dzQTHN9cDGDqDPysK6dQuewgJy58zmqvHLida3zNWLpy7iua//wcHa\nI/xt32vcN/P2wDuzbmcp/15/HJ9fZP6UZFYuzCHmlNBOZ7mVqsPVfO+C8aw7Ws3+/FoeW72dW5dN\nYvaExB711VUom4wMWZ2PnzIOTrsHgt8wDzlC4jz9yU9+wtVXX01BQQGLFi3ivffeC0WzQwqv38s2\nXSUAC8VMlmYtCUxQhWRzIg/MvovRllFsqdjOloodeLx+Vr2/ny93lZGWEMGvb5rLZQtz2gl1kF8q\ntVpgfGYsj14/m6XzMqizunj6zT0cKqzrUX9dBfmoLZFoYuM6vcds7rkDtdZZzwmLG6Nb5Oa0S9oI\ndYWxMdncnnsjgiDw0v5XKWgspqzGzk+e28jhonpmjI3ndzefxpnTUtsIdWjJOjVb9MyfmsKvbpzL\n5KxY9p2o5TevbKemIfjDPjyVFXKmZQeJSQqCIGDsRbnWfK0Nl05gVK3Iw3N+3EaoA6hVahann8ml\n2RfS4G7kxX2v4vF72Hm0isf/uZPaRheXzM/koWtmMC49up15QVlooqMMXLV4LI/dMJuEaAMfflPI\nX98/gKsHhapcRYUAGDK6EGi98DUA7NLLO46JNjO3TP1BG6EOEG2I5I7cG5mdNJ2CpiI+yP8ESZJ4\nb+MJ3vjiGEa9mrtXTuWHSye1E+rQ4sxNSbLw4ytyuXP5FNRqgefXHGDtjpJ293dFQLBndqXs9G4c\nhhohEex//OMf+frrrzlw4AAbNmzgsssuC0WzQ4qPC77ggFGO1811xXRq54vUWbhl6nUY1HrezfuQ\nJ975ht15NUzMiOGn184kNb7zLaTdJod1CYKARq1i5Vk53L1iKn5R5Nl39rEnL7iSBr6GBnx1dRiy\ns7u0R5osPZvEkiTx5tH3qIyRp022tXPn5vjYMdw85Qd4RR+vHnybJ/+1g8paB8vmZXLXyqmdmpdO\nTaOPNOu478ppLJ2XSU2ji6fe3E1NY3DCPbBr6SB+vTXmCB32HhREs3nsvHbk31TGa7FYvZjdnX/v\nvIxFnJ48myJrCX/Z9i9e+OAgGo2K+66axvIzszstcnZqyOeohAgevX4OE0ZHs+tYNb//xza8vuBC\nE92FhYBcfrkzeqOx76s+yIeu3fhVkNOk79SGLggC14xfSaIpnnXFm1i1di0fbykiMcbIo9fPZua4\nzlXj1rH8giAwe0IiP/3eTCLNOv61No/3NgYfkeMqKGhWdmI7vUcJKuhJstZQJJx5GgQn7VWsL/kK\nX1IcqFS4iwq6vD/WEMOKMctw+92UGzczd1Ii9105DVMX5zVKUnO87ikOmxnjEvjRFdNQqWDV+/vZ\nd6K22/4G4tezOtdMAMzmniVkbKvcxeG6YwGNx11U1OX9U+InMit+FtWuKlxRedxxWS4rzsru0lau\nCJbWsdsqQWDlWdmsODNLFu7/Ck64K9Ea3Y2DyaxD9Eu4Xd1rwZIk8caRd2n0WIkeI0cFKZpgRwiC\nwDUTVhKvTeaE8xCaqHruv3IaU7I630lBx4WvIoxa7r9qOtPHxLMnr5oXPjiIr5uDuEHW2AW9ocPE\nJIWeFkRz+py8ceRdVFodmrQ0vKWliN7Ov2vQGPjh5O+jktQckjaQkqziZ9fOJD6qa8d4R+WbM5It\n/PwHs0iKMfLxliI+3VrcbX99TU346moxZGV1qewoCoUzrLGPbCRJ4p28D/FLflZOvBR9Wjru4mIk\nX+dCQJQkDuw04a9PQB1Vx4QZTWi6iUV2OeV6JR3F607OjOX+K6ejUgk8/8EBik927TQLVrD3ZNvZ\n5LHybt6H6NU6zjvjGvk5XQg0gEa7h6Nbk5G8OvTp+czO7d7x4+iiLsiy+Vksbxbu/+/tvTi6EcTu\n4iJQqzuM429NT8Zhb81B9tUcZGx0NhNzF7Y8pwsKym1U7pPNIElTCsgZ1X3OR2dJWhq1ijuWT2ba\n2Hh259Xwj/8d7nKnIbrdchz/6NEdxvG3xmTWBa2xf160AZvXzgWZ5xA1Zjz4/biLuxawhw77cBWN\nQ9B4mTCnhqggok4cNg9GU/vyzfHRRh64egbRETre/vI4Ww5UdtmOMle72rVA730NQ42wYO+GvTUH\nOVx3jImx45iWMAVDZhaSz4e7vPMDJ9798gTbDlUxyn0GBrWeTwq/aBMW2RHdnZw0Lj2aW5ZOwuPx\n8+w7e6lrcnV4H7QW7F072QICLYhJvOb4/3D4nFyacxEJcaloE5PkOO5OhIrXJ7Lq/f1U1/qZrJ+P\niI+Xd/272+d0V6L1kvlZLJmTTkWtg+c/OIC/kxOdJJ8Pd0kx+lFpCJquHcPBVroUJZGP8j9DQDYt\nKDZrxYbdETUNTv7yn/34bdGMi5hMtfsk31bs6PI50PU4aDVqfn7jaeSkRrLl4En+u7nz57uL5bBX\nfWb3DldThB6n3dNtQbQ6Vz1flnxFtD6KxekLMGQpOR6dL/Q7j1bx7/XHMTtyiNPHsa1qBycd1V0+\np7vyzXFRBu6/ajomvYbV/zvMwS78UO4gHKcARlNYsI94vH4v7+V9hFpQc8XYSxAEAUPzC9LZJN5y\nsJJPtxWTEmfivhWncXb6mdi8djaVbu7yWQFbYhfJSbMnJHLl4jE02Dw89+4+3B2kf0uShKuwAG1S\nMmpT1yFhwdZJqXLUsK1yF6nmZM4cdToAhsxMRLsdX017u78kSbzxxVGOlzYyd2Iid5x1PuNixrC7\n4gDHG7rW8oMpJ3Dl2WOYlhPHwYI63lrbcfanp6ICyefDkJnZ5fOgbXJOV+w4uYdK+0lOS5lFkjkR\nTXQ06sjITk1STreP597bh9Xh5drzxnL9tOXoVFo+PPEpTl/nCzO0ONI7K99s1Gu457Jc4iL1vP9V\nAbuPdSwkXUWKwzCzy+eBPA6SJO8eu+Kj/M/wij4uyb4AnVrXUlqgsOPSAkWVVv720SF0WjX3XT6D\n5WMvDCySXeH1+PF5xS7nQlpCBPdenosgwAtrDlDdiXM9GMcpgFqjQm/QhJ2nI5mvirZT56rnrLQz\nSN51YmQAACAASURBVDLLoVXKit/RJC6qtPLqJ0cw6tXcc1kuEUYti9PPxKgxsLZ4Iy5f5xqhI8ia\n00vmpLNoeiolVTZe+7T9IdvemmpEpxNDN1tOCH7b+VnReiQkLsg8J+AgU7a0rg78Det3lbFpbwUZ\nSRZuvGgiKpWKZdnny20Vru/yWcEcqqBSCdx6yWRGJZhZt6uUTXvL292jaNHKgc1dEczOxS/6+Tj/\nc9SCmosyzwVk+7l+dCa+ulr81rbmMUmS+Pt/D1FWbeecWWmcPTONaH0USzIWY/XaWF+8qcs+OZr9\nLV3ZgyPNOu65LBedVsXf/nuI0mpbu3sCAi2I+dCShdv5PC2xlrG9cjdpEanMSZ4BgDYxEUFv6NAU\nY3V4WPX+frw+kdsumUxGsoUZCVPJsKSzu2ofRU2dR7Z0ZZZrzbj0aK49bxx2l49V/9nfTuGRJAlX\nQQGa2Lg2B4x0hnK62HAmLNg7QZREPjwiv8jnjl4Y+FyXOgpBpwts7RRsTi+r3t+Pxydy89JJJMea\nADBpjUFp7cEWvhIEgWvOHUd28zb8y91tTUKK9qgfPbrb36jRqtHp1V1O4hpnHdsqd5FkSmRGYss5\nmYqgcDVHXCjklTbw5to8Ik1a7rlsKnqtHNaZHZXB5MRxHKo7SnFTaafPC/ZlNuo1/OiyXMwGDa9/\nfoyiyraC1V0s90s/OrPLdiC4sgJbKrZT46pjfuppxBlboioMGfLC4TrFzv7p1uJANNTV57SEWy4e\nfSZmjYmNZZvxdGKeCzjSgygtMTrJwg8vnoTb42fVf/bjPCUM0l1UhMpgQJuY1G1bxiAW+k8L5UV+\n+ZiLAou8oFJhGD0aT0V5mxpCoiTx9Bs7qWkO7Zw+Nl6+XxC4NOdCAD488Wmnz+rJwe4Lp49i4fRU\niqtsvHqKwuOrq8NvberWDKNgMssZ2X7fwBzc3h+EBXsn7Ks5RLn1JHOTZ7aJzRXUavTpo3GXlSF6\n5IknShIvfXQoMIFnnFIKYHH6AowaY7PW3vEWvCfHf2k1Ku5cPoUIo5Y31+ZxpJVtUXHkBaOhKc/r\n6kX+vOhLREnkgszFbcLZFE3Y3cq+3OTw8MIHB5GQuGP5FGIjDW3aWjHxAgA+K/qy0+c57B50ejUa\nbfe1t+Ojjdy8dBI+v8hf1+zH4WoxIbiKikClQp/eteMUujdJ+UU/nxauR6vSckHm4jbXlJ1L63E4\nWlzPuxtPEB2h47ZLJreJ1derdZyZdgZ2r6NTW3tXjvSOmDMhkQtPG83Jeif/+KRFqIkuJ57KCvSj\nM7p1nEL3C1yNs5a91QcYbRnFhJi2NWf0ozNAknCXtSzaH35dwK4jVUzJjm2XVDU+dgzjonM4Up9H\nqbX9jgtaFcULsk7M984dR05qJN8ePMmGVgpPixkmSMHeA9/TUCUs2DtAkiQ+L/oSAaGNtq5gyMgA\nUcRdKk/iT74tYn9+LZOz2k9gAKPGyDnpZ2L3Odhcsb3DZ/a0VG1spIHbL52MKEk8+c8d2Jrtoorm\nqE/vXmMHWai5HF78HYTN1bsa+LZiB4nGeGYlTmtzTW0yoU1KDjhQlcWt3upm5VnZjB8d0669qUkT\nyLCks7f6AJX2kx32x9HDEq3TxsSzdF4G1Q0u/v5fOUJEEkXcJcXoUkeh0nbfVncF0fbWHKTe3cAZ\nKbMD2aMKp2rsjTY3z39wEAGB2y+dQmQHv2Vh2jw0Kg3rSr5ClNqPe7C7ltasaM5e3XGkivW7ypr7\nJDtOgxVoxm58DV+WfI2ExOL0s9qZiJQdorJjPFhQx0ffFJIYY+TWZZM7DHFdPPpMud3Srzt8Xkeh\nr12h1ai4Q1F41uUFdnGKshOMWQ5GRmRMWLB3QF7DCYqaSpgzahrJ5vZpywFttaSIYyUNvL+pgBiL\nnluWTeo0RvvMUWegUWnYVLq5y5fZaAo+ZX5SZizLF2RR0+Dk7/89hF8UcRcVoYmL6zSF/lSUhcTl\naO8w21S2Bb/k57yMRe2ybEHeFYgOB97qaj7eXMjBgjpyc+K48PSOXyBBEDg/82wkJD4v2tDuut8v\n4nL0/ASp5QuymZgRw57jNXy2rQRPZQWSxxP0rkWtVmHo4gShDSXfALAwbX67a5rYOFQREbiLChFF\niRc/PEiT3cPli3IYlx7dYXuROgunJc9s1oAPtrvem8ObNWoVt186BYtJy1vr8sgvbwoqMak1gRju\nDsbB4ZWVkmh9FDMTc9tdNzSbvNwlRdRb3fzto4OoVAIPXzen0zIQk+MmkGiMZ0flbqye9v6B3pz5\nGhtpaN7FSc27OF/LLjZowa4cQhMW7COKdc2OrUsnLunwuiLYrScKeOED+bT62y6ZTKSp8wkYoTMz\nO3E61c5aDte1P4HIYfc0F/rv2Z/k4jMymT4ugX0nalm34SB+a1PQmgl0blf1ij42l2/DrDExO2lG\nh99VIi3yt+9nzdcFxEbquXlp54sbwNT4SSQa49lZtReb197mmtOhnCDVs6p6ijM1yqzjvY0nKN4j\nH9emzwh+HMzmjk8QKrGWcaKxgImx4zpc5AVBwDA6A+//Z++9oyS560PfT3WOk3ty3JyjNiqsJAQS\nCiRjHgbDRRhjHDg8Xb/jc1+wr6/TxX6PCxiuMRgso4vBZIQQKGu1knalzTnvTs6xezqHqvdHdfX0\nzHRPV3XXzG6P+nMO54jpqq7f/vpX39/3942jo/zqlYtc7pli++oaHtzdsuDz3tVyDwICL/W8Ns8B\nnm+jkUq3lc8+thFRlPjGL87jV8JeVUTEwMLRQW8OHCWaiHJv850ZN3lLQ4Ncvri7i2/98gLTwRgf\nuX8VazKc3BQMgoF7W+4iLiV4vf/IvM+12NjT2bKymkf2yae4J399iXBvD6bKKoxudQW0SqaYZch4\naIIL41foKGtldXXmI6y1sQmMRgbOX2HKH+VDB1Zk1c7SOdCyH4BDfW/O+yzfLjEGg8Cffmwn5S4L\nxw+eBtRrJpDdvnxq5Cz+WIC9jXdgMWbWuJQN5PQbZzAIAn/4/k05i3QZBAN3Ne0lLsbn2ZgLaVpc\n7rTw2ffJpqmzb+QxD65kB6HobOfjweRvdW8GbV1BmYdTh85QU27j04/M7zE7lzpnLZtq1tPl66Fr\nTmRIPqYYhY0dVTx2ZzvjvjCjF6/JjlOPumJZNocFQZgv0BJigoN9b2I1WrizcU/GewWTCUtzC6He\nPq71TLBzjYcHdub2b+yp34ndZONQ/xFi4uy5L2QePnB3B2tbKrh0sYfE1JSqYAIFRw7TXDFQEuxz\nODxwFAmJu5Lx2pkQTCZC5R5c02NsX1HFQ3vULZpWdzMdZW1cGL/CaHCmNEAsliAaSeTdJabCbeVz\n79tIXVj+zkRt5iYCmZhJzpn9Mh/qO4KAwN2N2WvLm5plrbQiMMZv37eKlU3qKlDubbgDs8HE6/1v\nzTJL5auhKaxvq+T9d3VQ4RtBQsDctLDWnI5ycvGnNTKejvo5PnyaWnsNG6rnt1JTiNfKpYwbouP8\n4Qc24bSpM6cdaJI3+jcH3p7190Ln4X13drCp2YUjMEmgok6V4xRkJcHumF8Q7ezYRaYiXvY27MJh\nzl4CIFBei0FMsMYa5vGH16mqm24zWdnfuJvpqJ+Tw2dmfabFkT4Xo8HAH7x/Ix2CbGcPVOSOClIo\n2diXGQkxwZuDR7Gb7OyY4yxM5/zNca7HnJilBJ/YVampTviB5v1ISBzqnwl9DGl0EmVibWslO8rk\nF/IHF0JZMzLnkmkR90730+nrZn31GjyO7DVNfnFsiCmTi6b4FA/snN9PNBtOs4OdtdsYC41zZWIm\nwUirsywTj+xppSE2yZiljGeOZY62yIQjJdhnopYODxwlLsY50Hxn1gJXsbjIDy7K9+yqiNHRoL5F\n5NqqVVTbqjgxfJpQfCaxphBNFWQB/cntZRiQuBiyc6l7UvW9mWK4lY3nrizaOsDIZJBDI7IA/vB6\nKw6VmxvIG5yAwBsDb836e9BfWK/TCpeVRzrkMT3fk8AXVCeoS6aYZcaZsQtMR/3srd+Z1fwwPBHk\nn5++wIhdFniG4eylBTKxvXYzZRa3XNI3IduU83GWZaJ8eoSIxcGZ4Rg/Oaiu6l0mU8yhPtneqWiU\nmXjr4hDPvd2D112DNRpE1Nip/u5m+USUblstVKABJMZGMSViTLk8PHO4S3VFzJR9OdlvVZREDg8c\nxWIws6dhfmlihe+/dJXzkxA3WamYHtE0VoNg4M7G3UTFGMeGTqX+rkcTa9PYIAAjtiq59rvKcscO\np4V4TCSajIcfD01weeIaK8rbaHRlLiIWiSX4p5+fp9con9hck5kjnrJRba9iXdVqbnq7GUxGSyUS\nIuFQrOAuRmU+OSP3pljGN35+XlXRNKvNJNfoLwn25cGb/UnNpCmzZhIMx/jqT84SjMTZcfc2gJyF\nj+ZiMpjY23AHoXiI06Pn5O/N01mWTsLvJz4+TvmqFdRVO3n+aC+vn82tsc7VTkLxMMeHT1Ftq8xq\nfugemubffn0Zm8XI6js2AvMTdHLR5m6hxd3E2bGLTIbldmip7NsCBFqkV/491u/ZjNlk4F9+dYHB\n8UCOu2bmwZ8U7NcmbzIWnmB77Rbspszmh4On+3nt9ACtdW6cHe3ERoY1N/ne27ALg2DgjYG3U05U\nPZpYK+ty54Ht+EMxvp4hIzMTc9fD4cFjSEhZbeuiJPHtZy7SM+Jn3a4NcvXTXm3vBMD+xt3y8waO\nAoX5W9KJ9HZjcDhZtbGdK71TfO+FqznLM880tS4J9qJnJDjG5clrrKrooN453x4nihL//MsLDE0E\neXB3C3fcK0eKaBVoAPsa5GbYR5LOQz00VeVlcrS384UPyxmZTz13Jecx3GY3z3KYnRw5Q1SMsa9h\nd0bzw4QvzNd/dpZoXOSzj22kZu2qWc9XiyAI3N20FwmJI8nYfj02OGUc9RtW86n3riMUSfA/fniG\nqRyOsJQpxidfd3hQFjCKwJnL2RtjfO/5q7jsZrm+fFurnKDTp635Q7nVzZaajfT7B1NO1KA/e+Er\ntUR65cqW++/blsrI/PavLuYs8JV+gkuICY4MHMNusmUMcQT46cEbnLg6yrrWCn7noY1Y6hsI9/Qg\nqTQFKmyp2YDL7OTtoRPExLgu74QYDhEbGcHa2spnHt1Ia62LQ2cGePlE9sxnBSVxr1g7w5UEexJF\nuNzVON9pKkoS//aby5y/OcHmFdX89r2rMNrtmGvr5BK+Gn/8WkcNqyo6uDp5nbHQuC6mmHBaEkZ9\nlYM/+ZCc/v8/f3aOgbHsGqviMFM0pCMDxxEQ2Nuwc961/lCM//GjM4z7IvzWgRVsW12TSoTKVbo2\nEztrt2IxmHlr8ASiJBIMROXa2xra380lPUFr38Z6Pnh3B+O+MF/58Zl56fbppNvYg7Egp0fPU+fw\nsHJOZySQW9v90y/OYzQKfOHDW/BU2LG2JHMbNJ7gYMZ2/cbAW8RjCaKReEFrQUomz1kbGzGYzXz8\n3WtY21LBiSujPPX8lQXXa7rGfmH8Mt6oj11127EY54/ntdP9/ObtHuqqHPzRBzdjMhqwtrUhRcLE\nRrSZY0wGE3sadhKIBTk7ekGnTb5PTtBqacVqkes3lTkt/ODlaxy9tPD4lBr9UQ2dqm4nSoId2Z76\n9uAJ7CYbWz2bZn0mSRLfe+Eqb5wbpL3ezR+8b2OqNrS1pQUxGCA+kbv5xVz2N8ia4JHB4/os4jkZ\np2tbK/nUe9cRjMT5hx+con8B4a5oJ0OBYTp93ayrWk2lbXb4Zjga58s/OsPAWID37Grh4WQSkqmq\nCoPTSaRXm6YKcvOFHbVbGQ9PcH3qplx72zm/9rYWIr09mKpmErQe3d/OPVsb6Rn28z9/fo5INLM5\nIt0Uc3T4FHExzr6GXfMiO/pH/Xz1x2eIxUU+976NqUggm5J5mYcZQnaiVnJy5CyTPjlRpxDBHh0a\nQopGU2vBZDTw+d/aQmudrLH+7FDmKozpzw0GoryZNItkMsO8fnaAp567gtNm4n//7S2pMFdbARuc\n8k4cHjiqi8Ye7p1dN6m63MYXPrwFq9nIvzxzMWtFTCj+FnklwQ5cmriKN+pjZ922WU5TUZT4wUvX\nOHiqn5Zal1z7Oa0LUioDNQ9tdXvtZmxGK28NHk/VAS/UFCPHLM/UqblzcwMff/cafIEo//D9k/SN\nzM/uA7C7LMSiCd7slU1DiqlIwReM8qUfnqZz0Medm+r5yP2rUgJPEASsLa3ERoZJhNT3I1XY23AH\nMLPBFTIHce8UCa93VsyyIAh84sE1bFtVw8WuSf6/H55KlV9Ix2I1YTAK+H0RDg8cxSAY2DPn1HKj\n38sX//0kvmCM333PWrantXSzNDSC0ZiXYDcIBvbU7ySaiHKm/zKgjzkqPVHNYTPxnz+yLdV16Iev\nXEPMoLkr8z/pnebixBVa3U00u2eHzx483c+Tv76Mw2bi//joduoqHanPlLkP5zEP9c5aVpZ3cHny\nGmNTXnk8eig7afPQ0VDGEx/Zislo4J9+cT6rcz1XeYXbHV0E+6FDh3jooYd48MEH+da3vqXHVy4p\niq17X1LIAATCMf76X9/mpRN9NNY4+dOPbpuXfKMkwITz0E4sRgs767YxFfEy4Z1esPZ2LhKRCNHB\nQawt87vkvGtnM598cC3TwRh///2TnL0xfyErNeBP9VzAaXKwxbMx9dnAWIC/+e5xbvT72Luxjk89\nvG5eeKcyD1GN9mWAVRUdeOzVnB68KNfeLmhzk58/t06O0WDgjz64ib0b67jR7+Pv//0k497ZxdgE\nQcDhtOD1Buj3D7K5ej1llplMxbM3xvh//+MUoUiC33tkPfdtnx3eKZhMWBubiPT1IiXU9SJNZ09y\n7V3ol7OSC5qHLPWCypwW/vSj22iodvD80V65xO2cE4wiSPvGhxElkb1pm3xCFPnF6zd56rkruB1m\n/uxjO2irn53NaU3mNuSzwQHsa5Sf1z0qO/4Lm4ceBLMZS33DrL+vbq7gCx/egtEg8LWfnuVXh7vm\n+R6cRR7yWLBgF0WRv/7rv+Y73/kOv/rVr3j22We5cUN9g9lbTSAW5NzoBeqddbS55UV5Y8DLX/3b\nMY5fGmZjRxX/5eM7MpYLSBU+ykNjB9ifXMTT06FUE+t8CPb0yl1yWjIn5Ny7vYnfe2Q9kViCr/z4\nLN9/8eqsRsj25CKOhBLsqt+O2WAiIYocPNXP3/6vmbKrv//ohlmVChUUAZKPI1kQBFlrj8jfq4um\nmqEAmslo4DOPbuCBO5rpHwvwF//6NgdP9c/SWh1Oi6yhSTMCJhiO893nLvOVH59FkuBPPrSZOzc3\nzPt+5blSLEZ0WJt9GaDGXsWaipWMTckRQos1DzXldv6vT+xkXWsFp66N8bf/6wRXe6dSnyuCdGzK\ni0kwckfdtuT/D/H3/36KX77ZRXWZjT/72A5aaufXIzK6XJiqqvMW7Ns9m7EYLYwm5yHfkE8pHic6\n0I+lqRnBOD/BaV1bJX/2sR1UuK387NBN/vt3j+JNc7Ar85CpzEQxkJ+KmMbZs2dpa2ujqUnWYB55\n5BFefvllVuboDH+7cGz4FHEpwd76ndwY8PGrw12phtEfeWAN79nRlNXmayqvwFhenpd9GeSQv3pH\nHVLEgLUy/58ikOzmtFBFxzs3N9BS6+Kbv7zASyf6OHVtjPt2NHHXlobUIjbFrGyr3s6JKyM8/UYn\nfaMBrBYjv//oBvZtyt4I2VqAfRnktPJXzsjOaz00VVuWeTAIAr/zrtU0e1z88JXrPPX8FY5cGOL+\nHc1sXVWN3WkGUaDcUEGdqY3nj/bw/NEepvxRmj1OHn94/YIJSNbWVjgsR6RYG9Vn/yrsbbiD586f\nBPKfB0mSiPT2YK7xYHQ4Ml7jtMlNsb//4lUOnh7gi/9+kl3rannPrhbaG9wYTQKJMGz2bGRiUuSn\np65w5PwQkViC3etr+eSDaxdMQLK2thI4fYq4dwo86uqzKNhMVnZ4tjB8TijIkR4dHJA7aC1QVmJF\nYxn/9VO7+Oenz/PW+SFOXh7hwLYmHtzdoqo2/e1MwYJ9eHiYhoYZDaauro5z584V+rVLxuHXbuK2\n1fLTX0SIhk4AsKa5nA/cvYK772hldHThxtHWllaC58+R8PtVV1RUEASBXVU76ZQgYgzm/W8I3OyS\nx5KjNkprnZu/+NQufn7oJgdP9/OTgzf4+aGbNDsEagFLqIIvfvs6kgQCcNeWBj50zwoqciSJWOrl\nAlD5OMwAKm0VtFrlscfN+dfnCPf2YLDbMdXUZL1GEATu2drI5hXVfO+FK5y6Nsa1Pi9mk4GV9ghu\nrIhDTfyXf5ZzGkxGgQ/e3cF797blbEg+43PpgT3ZSzFkY1vtZl6NyzZ2m4Yqn+nEp6ZITE9jX71m\nwetMRgOffGgd+zc38IOXrnHs8gjHLo9gtRhZb4hgilk5fzzB4SHZgVpdZuV337OG/Zvqc54srS2y\nYI/09sIq9WUdFPY27OTZ2GWwJPJ2pCvm0VzlqxXz1MkbE/zwxSu8eLyXF4/3Umkzsgq41jfIPopD\nSU2nYMGeb5ynR+NOvliU9zVisVThrK6mbUMZ79ndxuZVM4Ih1zgDa1cRPH8O2/QYFR2Zj+gLsT+4\nk05OMMl43nMyeLMTwWikactaDJbcmt7nP7qDx9+/mVeO9/DayT68kWvgr0OYqmJ9exWbV9Zw59ZG\nOhrV1X4BGGxvI9DVTXWFDYM5u1DK9m9cX76Wq/gYZgCPJ3Ps+EIkwmGuDg9TtnEDtbW50/o9Hjd/\n9bkauod8vHlmgDfODBCMD+CmkcRoLVtX13Dn1ib2b26gXGX2Y9yxnj5AGh7I+7f0mGqJAn7HOOs8\nC6+nTM+Y6L4KQNW61arG4PG42bOliWMXhzhxeYSzN4eJ+idwBMqxhl3csb6Sh/a2cceGeowqhaxh\n01omngHT+FDWcS5Edc0WXox3EbIFcFeYsZltuW+aw/SY/Oy6LesoU/H8h+vKeffuVl4+1suxi8Pc\nnOwkOi4QIXbbyCotFCzY6+vrGRiYyXAcHh6mtjZ3NblcmvBS4XY5KBMcfOKTM45TZWwejzvnOMUa\n+eUbOXeZWEO75uf7RuQ42QlxnDOd17KmbWdDEkUC3d2Y6xsY90YA9RrvvnW17F3r4YsHj8BwHfva\nV/LgYzPt77T8RoaGJqTrNxg4dzWrlrTQfNojZYCPs5PnGRrOXP99IUI3roMkYahv1DRuh1Hg3Tua\n2L3RzZd+fhTGG/n9d+1gzUY5SS0aijIaUn8cN9d4mL5xk5ERX14+E1vcSVgI8XLnm7Q5s5/Ass3l\n+DlZ449X1WmahxV1Lvl/66Z58RdhhEAl//UTu1MmoYnxzBFVmYiVy9FCE5ev0Yz2dz0WjSMkjMRM\nYV64eDjl79DC1JVrIAiEnFVEVDzf43EzNRlk56pqdq6q5t8vXeTwwDH+eNunbxtZBeo3yYKdp5s3\nb6anp4f+/n6i0SjPPvss73rXuwr92iXD6ZKTc/I9eaQch3nalxUbXswS4a2hzK3SFiI2MoIYDmsq\nS5pOr7+fgbhc7yYRzj/LrpAIIYBIQN7gvMIklyauar9/AYehGo4NnyJqliNlCnGYWVtaSUxPk/BO\n5b44A2JIQLLEOTt2nmBMe/io1m5BczkyeCxlDss3httUXYPBbi/4nYibI6mINS2k/Ax1dRhs2rX9\nSCLKyZEzVNrKWVe1OvcNtyEFC3aj0cif//mf8+lPf5pHH32URx55pGgcpyB73RMFZJjJHdqteduX\nlZfHZIWjQydJiNpC5RSBls1hmIu3Bk8gGuIYjIU5itK7SuVD+sv81tAJzfcXItglSeLI4HEkc2zW\nWPIhFcedx3qQJEmO5XdZiIlxToyc1vwdkd4ejC43psrsDS6yMRme4vLENdxuuTZOvvOQym0YHiYR\nztzjdyGUd6LM7eCGt5ORYPZEokzEx8cQQ6G834nTI+cIJyLsadiZtarn7Y4uo77nnnt4/vnneeGF\nF/jsZz+rx1cuGWo61C+EYDBgbW6RO7THtH+H8vKsbmhjOurn4sQVTfcXItBiYpzjQ6dwW1w4XbaC\nsuysTc0gCPlvcIEoJrOB2rIazo1eIBDT5kyO9PSA0Sg3QdFIl6+HocAwq+rlzamgeSigxEIkHEcU\nJWoqKhAQNGuriWSbQmtLa15moLcGTyAhsaJWbpBR8AYnSQS7ta8H5bntHvm31DoPhZ7elAYwe+vv\nyHHl7Utxbkc6okdYk7W1FUSRaL/6+t8KyrH/jhbZtq2kcatFrfc/E+fGLhKIB9lVvx2nq7CiRwab\nDXNdHZFe7bVzYKb29r6GO4hLCY4Pq9dWpUSCSF8v1sYmBJN2t5FSUXBfm1yeVw+NPZ/QT+W55W4H\nG6rX0u3rZcA/pPr+mYxT7WtBlETeGjyGxWBmbf0KoHCTFID/Zqfme5WNdVVdG3aTnbcHj2s6yabe\niTzMUWOhCa5O3ZAT5xboRXC7844X7Hp0S0lpaXmYIZTnrqxrodXdxIXxy0xFvKrvj/T2YPXUaA61\nhJkyxXc27sbutCBJEM6Qbq8WW2sbYihEbEzb0VkUJULBKA6XlV11OzAIBt5MK2Obi+jQEFIslteL\nHI6HOT5yhipbJRtqV2O1mQpaC6bKKowud14ae3oxOKWsw9z2gQtRiGC/PtWZKlNcWe6aNZ58UHwu\ngc4uzfcq819WZmdX3Ta80WlNJ9lCNPa3FW29QbvD9naiJNiz9PzUQiGOQ7n9lwmTycj+xt2pgmRq\niHu9JLxTODsy92ZdiLHQOJcnr7GyvJ16Z50uRY9mzBDa5iEciiFJ8m9RbnWzuWYD/f5BeqZzl1eV\nn9clP19D82qFkyNniSai7G24A4NgwOW2FiTYBUHA2tpKbHSURDB3Hfh00ovBba5Zj9Ps4O2hE6q1\n1ZlSAtrnQaluuq9hly6nWKV2TiAfjT2tAJgSEXNk4Jjq+yM9PRjLyzGVqw/XBbmD2uHBY1iNw/az\nyAAAIABJREFUFrZ7Nue+4TamJNh1qAlhaWzKu8FAeu3tO+q2YTaYOTxwdFYv0GwoJwRnR7vm5x5O\nvihK5T5dTi55OlDnNthQxvRG/1tZ70lH2VBteQi0wwPHEBBSdYJcZTbCwRgJFZ12sjErUUkDMxq7\nFZPBxO76HfhjAc6MXVB1f7inB8FiwVKvLWQ2FA9xauQcHns1qyo6sCeTowra4JK1c4Ld3Zpr56QL\n9hZXE02uBs6NX8IXzR12mPD7iU+M56WtX5y4wlTEy676HdhMhXVuutWUBLsOGrvBYsFS30Ckt1dT\ng4FU+6/kGOReq1sYC09wbTJ7aVWFcHdSsK9coWm8CTHBW4PHsJvsbE82ULiVGvvcssXrq1ZTZavk\n+PBpQvHcURWRnm4QhKy1crIxmFamuMomR5G43PILHQ7mb5KaqSFU2DwovQFe7zuS9R4FMRYjOjiA\ntblZdfNqhbcGTxATY+xv2I0gCBiNBmx287ym1lqxtrUhRqNEhwY13Rf0y450s8WIIAjsb1B/kk1F\nieVhlns9qUjcnaEnQ7FREuw6VXGztrbKDQZG1duXQ0nh4XDOZGoq2qrSwWchlKO3a4U2wX5+/BLe\n6DS767enyhTrobGbysowVlRoPrnMbTSS3gv0+PCphW6diVmu1R6zrPgY0rskKYK9kHlImeY0nlzm\ntoOrd9aypmIlV6duMBRYuLBYdKAfEgnNZhhJkni9/wgmwTgrEShTU2utKPMQ6dY+D+lF8XbXb8ds\nMPN6/5GcJ9l87eujgXEujl+ho6x1XpniYuQdL9hNJiMWa2EOM8jPgTpjgpg59q0ob6PeUcupkXN4\nIwsfPSM93Rhdbiw12rz3mRooOJNp84U2FrC1thGfnCQ+rb65daZGI/uUXqD9CztR42NjiMFgqtGF\nWsLxMEcGj1NucbOlZkPq704dBLu5tg7BastbY7enbfR3N8s1Z17PYZbK13F6ZfI6w8FRttduxW2Z\nccA7nBaikQRxFX1Ss2FtawcgnPSBqCEVy59WBM1hdrC7fjvj4UkujF9e8P5wlpLFuXj55htyb9em\n4tfWoSTYAXRpXGtTFnFXl+p7UppqmkATBIEDzXeSkBK80Z/9CJ4IBOSY5bY2TTHLI8HRlGbS5Jqp\nRTLjMCvw+J2HGSJTa8ByaxmbazbQ5x9I9QLNhCI0tEbEvDV0gnAizN1N+zAZZkIkXW7brDHlg2Aw\nYG1J5jZE1X9PwB9JOdIVttZspMzi5u2hE0QS2b8rX8fpoeQaO9A8u2iZLj6X5hbZ96RBY1cc6XPL\n9R5ovhOAg71vLnh/pLtbDr1VUdZEISEmePnmYewmOzuz9HYtNkqCHXkRh0M6Ocw0LOJsLfH2NOzE\nbrJzqP8IsURmW2+mLjlqeLVX1kzua7l71t9TDrMCN7h8en9mm4d7mmRh82rv61nvjeQRsyxKIq/1\nvYlJMHLXHA3NVVa4xg7JVnnJ3qNqCQWiqYQ5BaNBjpYKxcOcWCC2P9zTI/sZmptVP28yPMXZ0Qu0\nuBppL5ut4ephojRYrdibGjU1t86k7AA0uRpYVSF3VxoKjGS8VwyHiQ4NYm1t0+RnOD16Dm/Yx976\nnRl7uxYjJcGOPkX1jQ4H5to6wt1dquOvszWxthot3NW4B38skDVRJ9zdBYBNQ4ifPxbgyOBxqmyV\nbJvT29VoNGBzmAno4GsArSappAliTqnatZWraHY1cnLkLGOhiYz3ztRGUX/0vjRxjZHgGDvrts0y\nP0CaYC/UcagxQkh2pMczNpa4q3EPAgIH+97MuLYkUSTS24uloUFVdU+FN/rfQkLinub98059ejWa\ncK1ckWxunVkYz2WhXqeK1n4oy0k20tsjN5xJnp7VIEkSL/a8hoDAPc3aSy3frpQEO/o5UG1tbXJz\n67HMfRTnEsiiqQIcaN6PQTDwat8bGV/mGYHWrnp8b/S/TUyMcV/znRmrJzqdloJfZHONB4PDkYrY\nUUMwEMXuNGOYo2UJgsC7Wu9BQuKVLFp7uKcHU2UVJnfuUr0KB/veAODepKBIJ2WKKXiD09YPN7TA\nWqi0VbCzbiv9/kHOj1+a93lsZBgpEtZkhgnHw7ze/xYOkz3VJSkdvRpNOJOOfbV29oUE+9aajZRb\nynh78HjGaKl8lJ0rk9fpne5nT/N2ah2e3DcUCSXBjj72REhzFiUXWC5CSU3VmaHed6Wtgu2ezfT7\nB7k2Nb/VYKS7G4PdPqt59ULExDiv9b2JzWhjX2PmeucOl+wwixXgMBMEAVtbO7HhIRJBdfVeFmpi\nvbN2K5XWCo4MHMUfm53wIzevntKkrQ/4h7g4foUV5e20ls03WzidFgRBB5NUY5Pc3FqlSWohgQbw\nnrb7AHi+65V5G324S04CsmlIVDvUf4RAPMj9LXdnND/oEQYMssYO6k2UC82D0WDkQPN+wolIRlv7\njGBvVz2+F7pfBeD969+j+p5ioCTY0U+w2zQK9kAggsEgYLVlrm9yX8tdAPxmzssshsNEh4dkW6JK\nx+nx4dP4otNy+QBT5rBAvY7fyganRluNRePEogkcWZpZGA1G7mu5i6gY4/W+2ZEh+djXn+18AYD3\ntN2b8XPBIMz0Pi0AwWTC2tSsurl1LsHe5Gpgc80GOn098zZ6xWFva1Mn2COJKC/3HMJmtKXMG3PR\n6xSrJM+p3uCy2NgVDjTvx2ly8HLvoXlljSPd3QhWG+Y6dQla3b5erkxeZ23lKlZW5Vfm+HalJNjR\nJzkH0h2oXaquV7JOswnnjvI2NlSt5erkdS5PXEv9PdIrN69Wm4QRS8T4deeLmAQj97ZkfpFhZh4K\nFWqK5hjuzJ1OHgwosfzZbcPKZnSw7w3CaUfwlIamUmPv8fVxevQ87WWtbKpen/U6R4EF0RSsrW1y\nc+vB3MXhsvlb0nmw7X4Anu96ddbfw12dsuNU5Ty80f8W/liA+1ruxGG2Z7xGL43d5HTKvqcedb6n\nXPNgM9l4oPUAoXiIV5MmNQAxEiE6OICttVW14/TF7oPAzGloOVES7OinsRudTswejyoHaqZ43Uy8\nb+V7AXj6xq9TyRlhjbVRXus/zER4kgPNd6YyLDOhxNMXHPrZnhTs3SoE+5xyAhm/z2Tj/pa78ccC\nPN89I9RmTBDqErSeufk8AI+teHDBk47DaSURF4lG8jdJyePqmDXOhcgWGZROR3kraytXcXnyGlfH\n5MxkKZEg0tONpbEJgzV3Gnw0EePFnoNYjZZ5kVHpWG0mDEZBl2bO1tY2xECA+MR4zmuV9ZDJiaxw\nT/N+XGYnr/a+ntLatTpOu329nB49T6u7ibWVq1TdU0yUBDv6aewgmyHEQID4+MIO1Eg4jpiQcgr2\nFncjd9Rto9c/wMmRs/K93eodp/5YgOe6XsZhsvNQ+/0LXjtz/C4sIsRUVS1XOFQR069GoAE80HqA\nSmsFr/S+zlhoAkmSCHfexFhRgakid1OJ61OdXJy4wpqKlTm74ug1D6kNrjN3eYhcphiFhzveDcC/\nnvwhoiQSHRpEikZTz8rFq72vMx31c6D5TpxmR9brBEE2Sekh2BVnphqHeiAQxeYwY1ygcbjNZE1q\n7WFe6T2U/O6u5LPacz5DlER+dPVpJCQ+uOqRvGrX3+6UBDtgs5sRhMJty6Dezp7LlpjOYysexCgY\neebm88TFOOGebtXFnp7rfJlQPMx729+FY4EXGfQ7uQiCgLW9ndjYKAn/wr0y1ZggACxGCx9Y9TBx\nMc7Prz9LfHKShNerSlsXJZFfXP81AI+tfDDn9XqZIaxNzQgmkzqTlMr1sKqig931O7g52cOhviMz\np5b29pzPGA6O8uuul3CbXTzQeiDn9Ypg18MkBRBRc3LxR3HmWAsga+1ui4sXe15jKDCcMn+q0djf\nGjxOl6+HnbVbWbMMtXUoCXZgRjsp1LYMaY7DHNrJjKaa+/hcY6/m7qa9jIXGefbys0T7+7C1tee0\nJfb7BznUf4QaWxV3N+/P+Rw9Ty6KoMm5wanUVEGOkFlR3s7p0XN0npdjme0qBPvzXa/S6etmR+0W\nVpS357xeL1+DYDJhbWsn0tebMwM1FIgiCLKSkYsPrXoUp8XBMzefw3dD7g9rzeE4FSWRf7/0E+Ji\nnI+s/cCC2rqCw2VBTEhEwvm1jVRIKTs5NrhYNJF0pOdeC1ajhY+u+SBxMc5TF39EuKsLwWrNqewE\nY0GevvEbLEYLH1z1iOp/Q7FREuxJHAU2tVZIFYDKqbHnti2n89iKB6m113DhzKuy4zRH4a9ALMi3\nzn6XhJTgw2veh9mQu7OQXho7gK09Gb+cQ0tTa4oBeQP+8OrHEBA4d+ol+Tk5BHunt5tfd71IhbWc\nj679kJqhF9wuMR1be4ecgZqjMJrib1FjFnBbXHx8ywcJJyIMXz0jtwTMUdnyzYG3ueHtZKtnk+pa\n44rSESgwWcvocmGuqyfcdXPBDFTF9KVG2QHYVruZXXU76J/sITI4ILcEXEDZkSSJH1/7Jf5YgIfb\nH6DSVqHtH1JEFCTYn3vuOR599FHWr1/PhQvqakbfrjicFuJxkVi0MIeZ0eXCVFOT04G6UHJSJmwm\nG7+36XdpnJDHF2/OrpmIksi/XfgBY+EJHmq7n81pRa4WwmwxYjIb9NXYcwl2laYYhbayFt634iEq\nRmQTj9CcvRJfOB7m3y78AEmS+E8bPqpKS4UZwVKojR3SI4Sy29klSSLojy7oMJzL/Sv2s8rVimPE\nR6imbMGWgDemuvj59Wexm2z8b2s+oNqmrOcGZ1+xEjEUWrCEb0CDeVLhI2veR7vfgiBJRBqqFrz2\nmZvPc3ToJK3uplQo8XKlIMG+Zs0avv71r7NrV3G3kQL9Mu1A1lZFv3/BEr4zyUnqF3Gzu5GdIbmS\n43+E3s7YvT0hJvjZtV9xceIKG6rX8sgK9YkXejrMTBWVGMsrcjpQlSbWFqv6XqUPtNxDw6TERJmR\n73U9k7HD0Hhokq+c+iZj4Qne3XYvaypXqv5+XU8uyRPFQoI9GkkQj4ua1oJBMPCJqvswiXDdHeLZ\nzhczXnd54hpfP/0vxMQ4v7vutym3qs/Q1cskBaROmOGb2edB2UDU2NgVHGYHDxnWAfBi4irnxi5m\nvO5g75s83/0KHns1f7T192YVfluOFCTYV6xYQXt7e8Hmi9sBPe3L9lWyQyZ841rWawIabMsKkiTh\nGJwk6rJxnXH++7Gv8ubA28QSMSRJotPbzd8f/0de7XsDj72axzf8DgZB20+smKREsfDf1NbeTnxy\ngrh3Kus1akI+5xIfGcYUjROsr+T06Dn+7uiXOTt6QY6UiYc5N3aRvz/+VXqn+9nXsItHO7RlFerl\nPAW5hK/B4Vjw5JIyy6k0QSiYBuSNPVBXwW+6XuKpiz+k0ys3E58IT/JKzyG+ceZfEZH47OZPsq1W\nW7s3p1OfujkAthXyxhq+OT+LWkFLQEE65UNyiejBWgvfPPtdXuh+lfHQJJIk0Tc9wJMXvs9Prv0S\nt8XFn2z7zLz6QMuR5b1taSC1iHXQ0uwrZcEeun6dsn2ZE4JSha80CLX4xAQJr5eqHTt5fOPd/MeV\nn/H9yz/l+5d/ikEwpOLc72zczftXPpwzCiYTDqc11dRaq8Cdi629g8CZ04S7unBtnV+PRBQlQoEo\ndU3qtUiYMe9s2v4uhhoDHB44xjfPfReb0Uo4IQsho2Dkd9Z+iDsb92gOZzOaDHJTax0EuyAI2No7\nCF68QMLvz9h0PJDH6Q1InYYevPPjdE78hreHTvD20AncZhfTMdlUZTGY+YMtn8oZ4pkJXcOAm5oR\nzGbCndkFeyCPDU6SJELXr2EsL+f37v5j/vncv/H0jd/w9I3f4DQ5CMTlshaNznr+04aPUmPX1rug\nWMkp2B9//HHGMhS1euKJJ7j//oXjohfC43Hnfe9iUN8oCxdBmj22fMYpVmygz2Ih1n0z6/3RcByH\n00J9vfqGu2NXzwFQvXkDWzfdza6Ojfzowq+YCE4RSUSxGM18eOPDrPdof4kVqmuc3LwyitVsKvg3\nMm3byPjTP8cw1IvnATkZJv07/dMRJAkqq5yanjU9JJfCbb1jO19Ys5rf8j3ED889Q59vkFpnNR5H\nNfd27GNVdXte4/Z43JRV2Jn2hnVZp8GN6whevIB1apjKjoZ5nw/2eAGoayjT9LxY900MFgsb9uzm\nHw17OTt8iYOdR7gwcpXtDRvZ0bCZXU1bqXLk5yS0W+UInXhMLGgelHuHV6/Cd/kKVS4TRvv8jFcx\nLp8SW1orqax2qvruyOgoiakpqvbuYf2qjaxs/L95vfsoNya6uTnZTXtVM4+tfTfbGzbm3OBvN5lU\nCDkF+5NPPrkoDx4dzd2YdimJJ731I8PTqbF5PO68x2ltayd4/RpDPSMZF7HPG8JVZtP0/aOnzgOQ\nqGtO3mfmtzs+OG+chcytYJQXf3/fJEZLYUFTiepGEATGz5zH8eD0vHGODcv/bTQZNI158uIVMBoJ\nuqoJj05jxcUn1/zO7IvE/OZBGaPVZmJ0KMbgwBQm8/xKmFoQa5sAGD59gXjzfFv/0IAs2EVJUj3m\nSrtAsKcX+9p1jE/K2ZdNplY+vroV0vb1RABGA/mtB1GUEASYnAjkvabSf3NjcxtcvETf8XM41s0v\n6TAxLhd5C0diqp83ffQMAIaW9uQ9RvbX7GN/zewSvGNjC+dTFPKuLyVqNx/dwh2L3c7u1Cm0S8G2\nchUksyPnEo8liEYSmk0doZs3wGDQVL1OK3ral40OB9bmFsKdNxFj8xuGaAl1VJDicSK9PVhbWjGY\nc8d858tSOlDzsS37Ll8BScK+Kv/TWS4MSkG06cLnANLs7FnmIdVBSsNGGrpxHZgxf5aQKUiwv/TS\nSxw4cIAzZ87wuc99js985jN6jWvJ0VOgAakXLpxceOnkLdB6urE2NauqCZIvelX1U7CvXo0Ui2Us\njKY11BFk+7oUj2PX2MBbK3rOg6miAlNVNaEb1zPGcSvKRKbyzdnwXZCjP+yr1xQ8voXQqyAazETG\nhLI4UIP++R2kchG6cT2ZCLa8qjMWSkHO0wceeIAHHnhAr7HcUowmAza7WZfQLgDbSlk7CV2fHxmT\nj0CL9PUixWIprWex0H2DW72WqVdeJnTtKuzbMeuzlNPQrX4eglfkZsb2Net0GV829HQcAtjXrmX6\nyGGiA/1yL9A0gn456zS9iXUufJcugyBgX7nI68FlZXTITzQSx2or7IRkqqzCWFFB+OYNJEmaZfNO\nxEUi4Tg1deojVsRIhEhPN7aOFRjMy6OlnV6UMk/TcLosuoR2AZjcZZjr6uRFPEdLyycRQ9FycmWc\nFopTx9hlkDV2QBbsc8hHUw1dvSJ/75q1OowuO8qY9BLsjrXyRhRMjj+dgD+C3WGZ10EqG2Isiv/a\ndaytbRhsmcvu6oWe60EQBOwdK0l4vfMqPeZzig13dYIolswwGSgJ9jQcbqvcQShaWG0MBfuKVXK2\n3eDsbDslo1GTQFM01UW0qQLYHMkOQjpkXYKcqGT2eAhdn2+GCE5re5mleJzQtatYGpswlWkLkdSK\ncnIJ6DQP9qRgV35HBSXrVJNA60yao1Yv7lqAxTjBJTf6K7M3uFSoo1P9O6GYOW2LfGopRkqCPQ29\ntVVbMlEpNCdRSUvhK5CbFQcvX8JUVY25tk6XsWXDYBCwOy26vcgA9lVrEIMBgr19s/4e8EcwGAVV\nha9Arr8jRaPY1y6utg76m2LMNR5MlVWErlyZZa/OJ+s0nDTvLbZ9HcDp1i9JCcCxXi5vEbw0O0M0\nmEcsf8lxmp2SYE9D7+O3suDC1+YIdo2mmEhPD2IggGPDhiWpHe10yZUu9Yp0UgSQ7+Lslzngj+J0\nWVX/mxRtVzFrLCZ6a6qCIGBfu5aEf5rowExHpXyyToNXZbOWfdXiC/bUyUWnebA0NWN0uwlcujBr\nfWk1xUiiSOjGdUzV1arq8b/TKAn2NGZqY+ijnVgam+RFfPH87EWs0XmqaDeKtrPYOF3WlDNLD5Tj\nt+/ijBlCFCWC/ogmDW2pHKdAMuzOoFt0EMxsSKErl1J/05p1Koki4RvXsDU2YCpXn9yWL3qfXASD\nAce69SSmpoilFQTT+k5EeroR/f4leyeKjZJgTyNlitEpblcwGHBs3ETC6yXa15v6e9AvF74yW9TF\n6wYvyZUzHeuWSLAnj9+BaX02OHN9A0aXG9/FGYEWDkaRJPWaqhSPE7p+DUtD46Lb1xWcLqu+Jqk1\n8x2oWjX2aH8fYihE2frsPVv1xKljpUsFx/qNAATSzDFaywkEzstZ2M5N2urfvFMoCfY0UuVaddLY\nYWbhKQsRwO+PqDZBiLGoLNCampdEQwP9fQ2CIGBfs4bo2BjRoaFZ36021DHc3YUUiaSckEuBw2kh\nFNSnIBqAubYWU2XlLDu7Vo1dOb2VbVwawa6EYOql7EBmO7tWG3vg/DkQhNQmUWI2JcGeht4CDcCx\ncRMIQkqwJ+Ii4WAspRXnInzjBlI0uqRHTr01dgDnlq0A+M+ckr9bY6ijEua4FPZ1BYfLgiRBKKjn\nBreOxLQvFSml1d/iP30KBIHKnTtyX6wDBoMBu9Osq0nK7PHIkVKXLyEl5JLLWk6xiUCA8I3r2Fas\nxOhUV1PmnUZJsKeRqsmuo8ZucpdhbWsndP0aYjg0I9BUaqpLbV8H/TrnpOPcvFXe4M6clr97WqOm\nelk24yx2/Ho6etuXIS2e/bL8u2rZ4BJ+P6Hr17CtWImlYum6/zhdVgL+iK5lQxzrNyCGQqkG14FA\nRHUHqeClCyBJODdv0W08y42SYE/DaJS1Ez01dkiaYxIJgpcupR291WmqwUsXwGDAsQQhfgrKpqPn\nPJjKy3GvWU3o+jUSfr8mm2oiECB4+RLW1rYlM0dBWv0gHU8ujqRpzn/yhPzdGrJOA+fPgihmLIG8\nmDhcFuKxwruLzfrOpAkleOkCoigSCsRK9nUdKQn2OSyGdpJuZ1eEhBpTTCIYINzZKadML3KGYTqu\nRTDFAFTt3gWiSODc2Rmbqop58J8+CYkE7juWtlNXyiSl48nFXFWFbeUqQlcuE/f5CGrIOvWflk87\nzq3bdRuPGmYK5OnoSF6XPLlcukgoEEs+J/fpTZIkAufPYXS5sbaW6sNkoyTY5+BcBO3E1rECg8NB\n4MI5/Elh6VIj0E6evCVHTovVhNFk0NUkBVC1+w5AtrPPmCByv8z+48cAcO1cWsGu/EZ+nTc4985d\nIElMnzyhOutUiscJnj+L2ePB0pi9z+ti4FgkE6WtYwWhq1fwDcvlBdTMQ7S/j8TUFI6NmxZsXP1O\npzQzc1gM+7JgNOLYsJH42BjTQxOAOk3Vd+RNAMr27Mtxpb4IgiAnKekYCQFgb2nB7PEQPH+OgC+C\n2WLM2es0EQwQuHgBa0srlrrFzbqdy4wTWd95cN0hb3BTx0+qzjoNXrmMGA7j3Lp9SZLU0tGz92k6\n7n37QRQZPymH86oxTwbOlcwwaigJ9jnoHcuu4Eoen729Q7Oek43Y+BihK5exr1mL2ePRdSxqcLqt\nBANREon5ZWbzRRAEnFu3I4bDBLxBVRqa/9QpSCRwLbEZBtJ8DTpr7OaqamwrVjJ1U85tUGNbDiSj\niVzbltYMA/pnZCuU7doDRiMTV+X67K6yhedBkiR8bx0GoxHHpk26jmW5URLsc9C7NoaCa+cdGBxO\npsenEYTcx07fW0cAKNu7X9dxqEV5mUM6hrmBLJhEDISjkioNzX9CNsMstX0dwGQyYrObdDfFgLwe\nIkbZb5Jrk5dEEf/p0xgcjkUvApeJmdr0+s6D0e3GuXkLfp86v1P4+jWi/X24tu/E5F6aJLVipSTY\n57BYx06DxULZnXcRESzYzCzoLJMkCd+RNxFMpluiqcLiRMaAXJ0yXlkLgMO+cMxyIhggcOE81pYW\nLHX1uo5DLU63VXeNHeSNShHsuTT2wNkzxCfGcW3fiWBa+v7zi3WKBSjbt5+ISW66nsvvNHXwFQAq\n7r1P93EsN0qCfQ56t8hLp/yee4kYnViiC/dfjHR1EhsawrV9B0aHQ/dxqGExQv0ABJMJy54DAJgm\nBhe8dvrYMdkMs8RO03ScbiuxaIJoRJ+6OQrm6hrE2mYArGSfY0mSmPj1rwCofM9Duo5BLQ6XFUEA\n/3RY9+92btlGxCr38XQ4sod8xqd9+E8cx9LQuKTZx8VKSbDPYTGSUhSkihpEgxGzf4LIQH/W6xSn\nqXvfrTHDwOKE+ikIa2THl3TzEmI08zyLkQgTv3oawWymbP9duo9BLYsV+gkgJRtbx46+kfWa0LWr\nhG/ewLltO9amJt3HoAaDQcDhshLw6T8HBrOZmKMKczxE5NrlrNf53ngdKR6n/MB9S+48LkZKgn0O\n9mSjCb1NEEDKlmiNB/AefDXjNdGhQbyvH8JYXoFzw61zEC3m8Tuc/EqzfwLf4cxCbfKlF4hPTlL5\n7gcxV1XpPga1KCeXxbCzx9zVAMRPHSHS25PxmolfPwtA1Xsf0f35WnCVWQn49auboyBJEiEs2OIB\nxn72E6T4/JORJIp4XzuIYLFQtv/WKTvFREGC/R/+4R9473vfy/vf/34+//nP4/cvbGIoBpTO7Ho7\nT2FG+7WbJbxvHCLS2zvrc0kUGXryO0ixGLUf+/gtsacqLKbGrmyaNqJMPv+bVL0QhbjPx+RvnsXo\nclP50MO6P18Li1E3R8E/HUEQwBIPMfqTH837PNLbQ/D8Wexr1t7yZhIutxVRlHR3pkfCcRIJCWeZ\njUh3FxPP/XreNVMvv0hsbBT37r0YHaXaMGooSLDfddddPPvsszz99NO0tbXxzW9+U69x3VIcLquu\njSYUFOHg2b0NKRql/2tfJj41lfp88sXnCd+4jnvXbjmJ5RaSciIvgkBTNovqbZuIjY4y+sMfzGqb\nN/7M04jhMFXve/8t8zEoLKZgD/giuMpsODdsIHjh/KwKoHHvFEPffRK49do6LF6yljJpmH0fAAAa\nGklEQVSvVWtXYKyoYPyZp4mklbj2nznN6I/+A2N5OdXve7+uz17OFCTY9+/fn4ru2LZtG0PJkqzF\njtNlIREXCQVjun6vsohrNq+j5kMfJj4xQf/XvkLg/Dkmnv8N47/4GUa3m9qPfULX5+aDEuq3GCYp\nZR4aH3svloZGpl55iYGvfYXg1Sv0f+0reF99GXNdHRX33Kv7s7WSEmg6z0MiIRLwR3G5rdR8+CMg\nCAz809cY/fF/ELx0kZ6//SsiXZ2U7b8zVV/mVuJMxpj7dbazKxuFu8pJ3Sc/BYkEg//yTbyHXmP6\nxDEGv/UNBLOZpj/5Auaqal2fvZzR7az/k5/8hEceufWahR64ymwA+KZCGC36uSHSC4BVvPcRosPD\n+N58nf6vfEm+QBCo/cSnMLrduj2zEBwuK36f/pEQQX8Um92EraaKlv/z/2Hwm/9E4NxZAufOAnIr\nvdqPfeKWmqIUUmGfOgs0xTnvKrNia22j/vd+n7Gf/pjJ559j8vnnAKj+4G9R9fCjt4WzcLGcyOm1\nk1ybtlF+zwG8h15j+KknU9c0/OGfYOtYoetzlzs535zHH3+csbGxeX9/4oknuP/++wH4xje+gdls\n5rHHHlP9YI/n9hBemahvLOP8yX68UyHWbtQvfjoWkW3JbR3VWG1map74Y3qb5DR5Z1srrlUrsdXn\n97zFmM/KagcTowHKy+w5U//V4vG4CQailFfak2N2U/fXf0H3975P4GYnTR98P+Vbt9xSYZY+l5Ik\nYbYYiYRius5xKOmU9tSV4fG48Tz2IB0P3sfwiy8x+tobNH3wfVTv26t6nItNJCg7NRNxUfNzF7pe\nTMjmzqaWSjweNzX/+fP4H32IYE8Pwd4+XCtX4jlwd/4D12mcxUbOt/XJJ59c8POf//znvPbaazz1\n1FOaHjw6Oq3p+qVEMMpCxTcZ0nWck+MBzBYjvukwJGOCHe95FAAJmAam83iex+NelPlUmh50dY5T\nWV24rdvjcTPQP0UkHMdqM80as/PhD+AEYsDY2K1zwmeaS4fLwtSUvmuht2cSAKNZmPW9pt1307D7\nbkQWfkcW6zfPRizp4B4dntb03FzjHB2SP4snEjPXVTVgqGrAtW2PfM0S/DuXej7zRe3mU5Cd4dCh\nQ3z729/mG9/4BhaL+qbEtzvKsdM7FdL1ewMamzffapyL0CpwOmnaUcxdxYDLbSUcjJGI61c3J6Ch\nyuftgMNpwWAQdDfF+DWUsS6hnoLO13/zN39DLBbj05/+NABbt27lL//yL/UY1y1FKUbkndRPsMfj\nCcKhONW1Lt2+c7FZjIgQxWbvLi8ewZ6ejVxWoU9dfH9qgysOgSYnKVkWJSrGZjdhNqtr7F5CHQUJ\n9hdeeEGvcdxWKCnUemrsWhpL3C4ojkM9X+Zpb1JTLRKBBmkRIdN6CnZlHopng3O5rQwP+BBFCYNB\nHx+IPKfFMwfFQinzNAMGg4DLbcWno8ZejEdOd1Lo6Bnipphi3MUk0Bahbo5/OoLJbMBqu/WRP2px\nlVmRJHRrbB2NxIlFE0VjjiomSoI9C64yG9O+sG71yFM2VZV9HW8HFG1y2qtfyGNRmmIWySTlcltv\ni1BGtSjzoFcIbDEqO8VCSbBnwVWe1E50SkyZidctHuep1WbCYjWltGw9mPZGVNWjv53Q2yQVi8n+\nlmIywwC43PJ49drgis2BXEyUBHsWUtqqTkJN0XqLSVMFKCu3Me0N61Zewe8L43RbMRqLZ+nNJOfo\nu8kXm0Bz6Zx9qrbBRgntFM/btcS4dV7ExSrYXeVW4jGRcKjw8gpiQiQwHSk6TdWuc6hfSqAVkQMZ\n0kwxemvsRTYPxUBJsGfBlXIc6qOx+7xhLFYjVlv2ZgK3I8pGpMcG5/OGkaSZTbNYUJp769Vowl+E\nDmSYEcC6bXAlG/uiURLsWdDz2ClJEtPeMGXl+oTKLSXuVN2cwoWaEj7qKrJTC8gbXGA6qkuS0kyo\nY3EJNLtDPrnodYpN+Z2KKKCgWCgJ9iwojiI9NPZwKEY8JhadGQbSNXYdBHsyfLTYNHYgFb+uh8/F\nX6Q2doNB55PLdASL1ahbHaISM5QEexasNhNWm4lpHbSTYrWvw8yY9Qh59E4GgeJKylFwV+h3cim2\nrNN0nGU2gv4ooljYyUWSJDnkswjXQjFQEuwLUF5h10VTTQn2Isyw01ewh2Z9ZzFRVj5TyrlQ/NMR\nrDYTZkvxaaoutz5hwJFwnGgkUco6XSRKgn0ByirtRCMJIuHCOtQrWl4xCjRZABl1MUEUsynGrZhi\nCtzgZE01UnRmGAW9fE/KWtCrREOJ2ZQE+wKUJxddoTZFRRiUFaFgFwQBV5lVN429WDXVMp1MMak0\n+iLc3CDNmV7gelBOPiWNfXEoCfYFKK9MCvYCtZNitrGDvCEVenKRJAnvVKho58DhtGA0GZj2FmaK\nmYlhL855KEu+E4XWUVI2yJLGvjiUBPsCpDT2As0QPm84lZ5fjLh0iIwJh2JFrakKgoC73Fawxp4K\ndSxSU0x5pbwWCq18OqOxlwT7YlAS7AugaCeFRMYoMezFqqmCPsdvRaAVW1JOOmXlNiLheEEnF8W2\nrJwGiw1XmQ1B0FFjL+L34namJNgXQA+NPZTsvFPMtsRULHsBgl0xRxVzeJvyGxZijil2wW40GnCX\n23TR2F1lVoymkghaDEqzugDu8qR2UsDxu9jt66BPyGOqDnt5cZogANzJzOFC1oMSy1+sgh3ksYcC\nMaKR/E4uibiI3xcpaeuLSEmwL4DRaKCswo53In/tRLElLgvBXsDJxZ/snFTM86BHZIx3MoTdaS5a\nfwvM2MXznQdlHZXs64tHSbDnoKLKTjgUy7u64XLQ2O0OczIiJH9fQzE2sZ7LzMklv40+kRCZ9oYp\nr3ToOawlRzlt5NsTuBTquPgUpDZ89atf5eWXX8ZgMFBdXc0Xv/hFPB6PXmO7LSivcsCNCbyTIWx2\n7ZUZZ2LYi1c7EQQBd4Gx7FMTQSxWE3ZHcVW3TCelqeY5D9PJ6pbFbIaBdI09T8E+mXwninwebmcK\n0tg/85nP8Mtf/pJf/OIX3HvvvXz961/Xa1y3DRVV8uKbGg/mdf+Mxl68tmWQtVUlZFEroijhnQxR\nU+sqqlZwc0nVD8rTBKGY9IpesCshjwVr7MU9D7czBQl2p9OZ+u9QKITBsPwsOxVV8rF5ajI/we7z\nhrHZzUWZbZmOklKfz8s87Q0jJiSqa525L77NcZfbknXltXeUUtaQoiwUKwVr7KnkpJIpZrEoWNp8\n+ctf5umnn8btdvPUU0/pMabbivKkYM/HgSpJEn5vmOpal97DWnIqq+V5mBwPUFOn7d+jnHZqlsE8\nlFXYGBv2EwxENdcR9y2T+ihmsxGny5J3LLtvKoTZYszLtFlCHTkF++OPP87Y2Ni8vz/xxBPcf//9\nPPHEEzzxxBN861vf4nvf+x6f//znVT3Y43FrH+0toL2jGrPFiN8X0TxmnzdEIiFRU+ta9H/vYn9/\n+4oa3uQ60VBC87OuXxgBWJJ50IOFxljXUM7NK2MYMWj+twT9sgN+5eparLbCT3C3ci6ra130dE5Q\nWenAZDIueG36OCVJwucNU1XjpLa2bLGHqYliWJtqybm6nnzySVVf9Oijj/IHf/AHqgX76Oi0qutu\nJR6Pm7ExP+UVdsZH/YyM+DTZiHs7JwCwuyyL+u/1eNyLPp8Gs/zv7uuZ1Pysvp5JAKprF3+chZJr\nLk0W2dzY0zWOzaVN4xwdnsbhtOCbDkGB07AUv/lCOJwWkODm9bHUaS4Tc8cZDESJRRM4Fvmd0Mqt\nnk+1qN18CjKKd3d3p/775ZdfZsWKFYV83W1LeZWdeEzU3OtxYjQAQLWn+G3LTpcFs8XI5HhA871T\n40EEAapqijvMD/KPZU8kRPy+cNE7ThXyLQZWcpwuDQWdB7/0pS/R2dmJwWCgsbGR//bf/pte47qt\nSDlQJ0Ka4rAnxmQhWFlT/IJdEAQqaxyMDfkRRVGTo3xyIoi73JbzyF4MKGtB6wbnm1oeoY4KqVh2\njQ7UkuN0aShIsP/jP/6jXuO4rSlXQh4ngjS3V6q+b2IsgMEgLJuXubLaycjANN7J8ILH73TCoRjh\nYIy6huVhv3SX2zBbjIyPaBPsqVICRR4Ro1Be0thva5ZffOIiUJFHZIwkSUyOBamodmA0Lo9pTkXG\njKkXalMTyRA/lRvB7Y4gCFTXOpmaCBKPq4/pXy4x7AqKxq1VY1fWTrGHfN7uLA+Js8ikkpQ0xLL7\nfRFi0QRVy8AMo1BZo5gh1M+DEuqobI7LgSqPC0mCyTH186AIwOUi2K02M1abSXNew9iwH4vVVNQl\nNoqBkmBXgdVmxuYwa9LYFcfpcnAYKlRWy5uUFvuysgksF40dZpzhym+shuWmsYN8gvNNhojH1J1c\nYtEEUxMhauqKOwO5GCgJdpVUVNnxTYVIJERV1yuO06plEBGj4C63YTQZNGmqisau1iZfDCgJZ+Oj\nftX3eCdDOFyWos9ATsdT70aSYFzlBqfM13JIVLvdKQl2lVRUOpAk9WFuyykiRsFgEKiosjM1HlSd\nUj81EcRqMy2rLEPFvKbWgRoJx5n2qnc4Fws19bJDfHRIXfz32LAs2Ks1Zi6X0E5JsKskPTJGDROj\nAYwmw7Lz/lfWOInHRVWVHhMJEd9UmIpqx7I6elttJtxlVtWmGEXw1TbcXpmWheKplwW0WsE+PlLS\n2JeKkmBXiWJSGVOxiEVRYmo8SGW1A4Nh+Qg0SK8Zk3uD802FEEWJymXkOFWoqnURDEQJBaM5rx0Z\n9AFQu0xCPhUqqx2YTAZNGrvBIKSc8CUWj5JgV0ldo6xtDfX7cl477Q0Rj4vLKiJGIeVAVWFnV65Z\nTo5TBcWBqsYcMzwgrxllDS0XDAYD1XUuJsdyh36Kosj4aICqGueyCf+9nSnNsErsDgsVVXaGB3yI\n4sL25YlRWaAtJ8epwkzIo3qBphzZlxNqHaiSJDEyMI3TbcHpLu6a/Jnw1LkRRSnnBuedCJGIiyX7\n+hJREuwaqG8qJxZN5EzQmXGcLj9NtbzSjsEgpOylCzHQO4XBIFDXWL4EI1taUiGPOQRaYDpCMBBd\ndvZ1BbV29jHFvl4S7EtCSbBroK5ZMcd4F7wuFeq4DE0xRqOB2sYyxob9RMLZ+8DGonHGhvx46t2Y\nLcVfI2Yu5VV2jEYhZ6jfyKDiOF1e9nUFj8rIGCUipuQ4XRpKgl0D9U2y5jnUl93OLkkSw33eZZ1d\n19xWgSTBQM9U1muG+mWTVUPL8tPWQbYvV9Y4mRgLLGiaW672dYXKGtmBOja08AkuFepYEuxLQkmw\na6Cy2oHFalpQY58cDzLti9C6onJZhfiloxRC6+uazHrNYK88R40tFUsypltBtcdJIi4u6G9QNHZF\ns11uGAwGqmtdTIwFsjpQJUlibMSPu9ymS4ORErkpCXYNCIJAfVMZvqkwwUDmMLeeG+MAtKyoXsqh\nLSm1jWWYzAb6urNr7AO98mf1zctTYwdobJM3uO7r4xk/F0WJ0aFpKmtkhWC54ql3IYpS1rj+oD9K\nOBgr2deXkJJg10h9k3ykHs4S9thzU+6a1LqiasnGtNQYjQYaWyuYGg/iz9B8JB5PMDLgo6bOtaw1\ntPZV1QgCdF6d3zoS5HIKsWhi2TpOFXLZ2ZV3Yrmao25HSoJdI3WKnT2DOSYaiTPY68VT75Jbhy1j\nmpPaan8Gc8zIwDSJxPK1ryvY7GYaWysYGZzOuMEp9vXl6jhV8CT/fdlMc9cuDgOwan3tko3pnU5J\nsGukrtGNIGROVOrrmkQUJVqXsRlGoSkp2Pu657/Mg0kzzHK2ryt0rKkBoCuD1t6f7PW63DXVqhon\nVR4nXdfG55kofd4Q/d1T1DeXL9tggtuRkmDXiNliorrWxeigj3BodrhfygyzcvmaYRSqa53YHGb6\nuybnFQQbSDpOl7vGDtCxWhbsN6+Ozvq7fzrCjUujVFTZl71tWRAENmxrQBQlrpwbmvXZhdMDAKze\nUNLWlxJdBPt3vvMd1q1bx9RUdmfacmLNxjoSCYmTR2aaeUuSRM/NcWx207K3qYL8Mje3VRDwR2cV\nRgsGogz1e6msdmB3LG9zFICrzEZtg5uBnqlZG/25432IosTWPS3LNjoqnTUb6zCaDFw6Mzhroz9/\nsh+DQWDlOs8tHN07j4IF+9DQEIcPH6axsVGP8RQFm3Y04S6zcu5Ef6rK4cRogMB0lJaOqmVX+Csb\nTcmwx7PH+1N/e/2Fa8RjIhu3v3PWQ8eaGiRpJjomEo5z8fQADqeFNRvrbvHolgarzcyqdR68k7Lp\nBeTQ38E+Ly0dle+ITf52omDB/nd/93f82Z/9mR5jKRqMJgO77ulATEgcfb0T72SIF56+CEB78mj+\nTmD1+lqqPE4unhrg3Ik+blwe4eaVUeqby9m0s+lWD2/J6Fgja6PnT8kb/cUzA0QjCTbf0YTJtPyy\nbrOxYZu8mV86M5A0ywwCsGrDO2Nzu50oKBbtlVdeoaGhgbVr1+o1nqJhzca6/7+9u4tpMkvjAP6v\ntIDDOKaK06DD6CwOG4gFRhPdgURtbeSjVlFRboymDUZvrCB+hKJGA8aAqJekxAjRZDTK2myI0Wym\nWiEIIsYFN6Q6bHAcjAVRMhSj9OvZC9dO2NJqzOgp5fndnSYn+acfT09P3/c56Or4DY/+PYBfe19g\n7I0H6Uu/mVI/OWXRUuQVKPH3c/fQ+nMvZNFSREmnQZX31ymx/fCOfPYXSPxOjt/6hvGT+Q6ipNMg\ni46aUr9aAEAx7yvI47/Af+zP8fiXFng8Psiio/Dd95F/MUG4eW9h1+v1GBoK/Me/uLgYZrMZZ8+e\n9T/2oafqRAKJRIK/rfwLrl56ALfLixU5yf4Vy1QyY2Yscjcq8Y+f/gXXmAc/qpIi6uDqD5W3KQ29\nPQPobP0Vvw+/RvrSRMTERs6pUR9CIpFg8Y/z0fLPX/DVzFjMmhOHH5Z+G1HHAU4WEvrIavzo0SPo\n9XrExsa+7Y8yMACFQoHLly9j9mz+hmaMMVE+urD/P7VaDYvFgpkzI/8SN8YYC2d/2nXsEolkSm3F\nMMZYuPrTVuyMMcbCA995yhhjEYYLO2OMRRgu7IwxFmGEFXa73Y7CwkLk5+ejoKAADx48EBXlvc6f\nP4+cnBzodDrU1NSIjhNUuPfsqa6uRm5uLtatW4ddu3ZhdPT9B2J/Ts3NzcjJyUF2djbq6upEx5mQ\nw+HA1q1bkZeXB51Oh3PnzomOFJTP58P69euxc+dO0VGCcjqdMBqNyM3NhVarRVdXl+hIE2poaMCa\nNWug0+lQWloKl2vig378SBCDwUAtLS1ERGSz2WjLli2iooTU3t5Oer2e3G43ERG9ePFCcKKJPXv2\njAwGA6lUKhoeHhYdZ0Ktra3k9XqJiOjEiRNUU1MjONEfvF4vaTQa6u/vJ5fLRWvXrqXe3l7RsQIM\nDg5ST08PERGNjo7S6tWrwzInEVF9fT2VlpbSjh07REcJ6sCBA9TY2EhERG63m5xOp+BEgRwOB6nV\nahobGyMiot27d5PFYgk5R9iKXSKRwOl8e+KK0+mEQhGe/SQuXLiA7du3Qyp9e/fcrFnh2ZJ3MvTs\nyczMxLRpb99yGRkZcDgc75nx+XR3d2P+/PmYN28eZDIZtFotrFar6FgB5syZg5SUFABAXFwckpKS\nMDg4KDhVIIfDgVu3bmHTpk2iowQ1OjqKzs5ObNy4EQAglUrx5Zfh2WLZ5/Ph9evX8Hg8ePPmDb7+\nOnQbZGH3+paVlaGoqAhVVVUgIly8eFFUlJAeP36Mzs5OnD59GjExMdi/fz+USqXoWONMxp49jY2N\n0Gq1omP4DQwMICEhwT9WKBRhvT0IAP39/bDb7UhLSxMdJcC7hca7xVs46u/vh1wuR1lZGex2OxYt\nWoTy8nLExobXgSAKhQJ6vR4rV67E9OnTkZWVhczMzJBzPmlhD9ZnpqSkBLdv30Z5eTk0Gg2uX78O\nk8mE+vr6TxknqFD9cLxeL0ZGRnDp0iV0d3ejuLhYyEpusvTsCfWaq9VqAEBtbS1kMhl0Ot3njheU\nyOfsY7x69QpGoxEmkwlxcXGi44xjs9kQHx+PlJQU3LlzR3ScoDweD3p6enD48GEolUocO3YMdXV1\nMBqNoqONMzIyAqvVips3b2LGjBkwGo1oamoK/fn55BtEQSxZsmTcePHixYKShFZUVEQdHR3+sUaj\noZcvXwpMNN7Dhw8pMzOT1Go1qVQqSk1NJZVKRUNDQ6KjTejKlStUWFjo3y8MF/fv3yeDweAfm81m\nMpvNAhMF53a7yWAwUENDg+goEzp58iStWLGC1Go1ZWVlUUZGBu3bt090rADPnz8ntVrtH9+9ezcs\n/w+4du0alZeX+8cWi4WOHj0aco6wPXaFQoGOjg4AQFtbGxYsWCAqSkgajQZtbW0AgL6+Png8Hsjl\ncsGp/pCcnIzW1lZYrVbcuHEDCoUCFoslLBuxNTc348yZM6itrUV0dHgdvKBUKvHkyRM8ffoULpcL\nV69exapVq0THmpDJZMLChQuxbds20VEmtGfPHthsNlitVpw6dQrLli1DdXW16FgB4uPjkZCQgL6+\nPgBAe3s7kpKSBKcKNHfuXHR1dWFsbAxE9EE5he2xV1RUoLKyEj6fDzExMaioqBAVJaQNGzbAZDJB\np9NBJpOhqqpKdKSQwrlnT2VlJdxuNwwGAwAgPT0dR44cERvqf6KionDo0CEYDAYQEQoKCsLyQ37v\n3j00NTUhOTkZ+fn5kEgkKCkpwfLly0VHm5QOHjyIvXv3wuPxIDExEcePHxcdKUBaWhqys7ORn58P\nqVSK1NRUbN68OeQc7hXDGGMRhu88ZYyxCMOFnTHGIgwXdsYYizBc2BljLMJwYWeMsQjDhZ0xxiIM\nF3bGGIswXNgZYyzC/Be68EGj7hfMcwAAAABJRU5ErkJggg==\n",
             "text/plain": [
-              "[]"
+              "\u003cmatplotlib.figure.Figure at 0x7f385e198650\u003e"
             ]
           },
-          "execution_count": 48,
           "metadata": {
             "tags": []
           },
-          "output_type": "execute_result"
+          "output_type": "display_data"
         }
       ],
       "source": [
-        "# Create TensorFlow Variables using Keras's Dense layer.\n",
+        "def f(x):\n",
+        "  return tf.square(tf.sin(x))\n",
         "\n",
-        "wb = tf.keras.layers.Dense(units=1, use_bias=True)\n",
+        "def grad(f):\n",
+        "  return lambda x: tfe.gradients_function(f)(x)[0]\n",
         "\n",
-        "# We can access the underlying TensorFlow variables using wb.variables.\n",
-        "# However, the variables won't exist until the dimensions of the input\n",
-        "# tensors are known. Once the dimensions of the input tensors are known,\n",
-        "# Keras can create and initialize the variables. Until then, Keras will\n",
-        "# report the variables as an empty list: [].\n",
+        "x = tf.lin_space(-2*pi, 2*pi, 100)  # 100 points between -2π and +2π\n",
         "\n",
-        "wb.variables"
+        "import matplotlib.pyplot as plt\n",
+        "\n",
+        "plt.plot(x, f(x), label=\"f\")\n",
+        "plt.plot(x, grad(f)(x), label=\"first derivative\")\n",
+        "plt.plot(x, grad(grad(f))(x), label=\"second derivative\")\n",
+        "plt.plot(x, grad(grad(grad(f)))(x), label=\"third derivative\")\n",
+        "plt.legend()\n",
+        "plt.show()"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "colab_type": "text",
-        "id": "docKLUaonYG_"
+        "id": "-39gouo7mtgu"
       },
       "source": [
-        "## Step 3: *Define the loss function*\n",
+        "## Gradient tapes\n",
         "\n",
-        "Our loss function is the standard L2 loss (where we reduce the loss to its mean across its inputs)."
+        "Every differentiable TensorFlow operation has an associated gradient function. For example, the gradient function of `tf.square(x)` would be a function that returns `2.0 * x`.  To compute the gradient of a user-defined function (like `f(x)` in the example above), TensorFlow first \"records\" all the operations applied to compute the output of the function. We call this record a \"tape\". It then uses that tape and the gradients functions associated with each primitive operation to compute the gradients of the user-defined function using [reverse mode differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation).\n",
+        "\n",
+        "Since operations are recorded as they are executed, Python control flow (using `if`s and `while`s for example) is naturally handled:\n",
+        "\n"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": 0,
       "metadata": {
-        "cellView": "code",
         "colab": {
           "autoexec": {
             "startup": false,
@@ -245,125 +182,42 @@
           }
         },
         "colab_type": "code",
-        "id": "0_w8ZJSCtuY7"
+        "id": "MH0UfjympWf7"
       },
       "outputs": [],
       "source": [
-        "def loss_fn(predictions, labels):\n",
-        "  \"\"\"Calculates the mean L2 loss for our linear model.\"\"\"\n",
-        "  return tf.reduce_mean(tf.square(predictions - labels))"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "cellView": "code",
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "base_uri": "https://localhost:8080/",
-          "height": 34
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 348,
-          "status": "ok",
-          "timestamp": 1525154234538,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "RkNbXoXkpjVH",
-        "outputId": "e4688f3c-e29f-416d-f541-6d81953b5660"
-      },
-      "outputs": [
-        {
-          "data": {
-            "text/plain": [
-              "\u003ctf.Tensor: id=1252, shape=(), dtype=float32, numpy=16.979801\u003e"
-            ]
-          },
-          "execution_count": 50,
-          "metadata": {
-            "tags": []
-          },
-          "output_type": "execute_result"
-        }
-      ],
-      "source": [
-        "# Test loss function (optional).\n",
+        "def f(x, y):\n",
+        "  output = 1\n",
+        "  for i in range(y):\n",
+        "    output = tf.multiply(output, x)\n",
+        "  return output\n",
         "\n",
-        "loss_fn(wb(inputs), labels)"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "base_uri": "https://localhost:8080/",
-          "height": 51
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 418,
-          "status": "ok",
-          "timestamp": 1525154260083,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "K_7beXoHOU7t",
-        "outputId": "8f55c028-fe2b-4edb-ad68-a849afc60623"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "w: -0.311619\n",
-            "b: 0.000000\n"
-          ]
-        }
-      ],
-      "source": [
-        "# At this point, the variables exist, and can now be queried:\n",
+        "def g(x, y):\n",
+        "  # Return the gradient of `f` with respect to it's first parameter\n",
+        "  return tfe.gradients_function(f)(x, y)[0]\n",
         "\n",
-        "w, b = wb.variables\n",
-        "print(\"w: %f\" % w.numpy())\n",
-        "print(\"b: %f\" % b.numpy())"
+        "assert f(3.0, 2).numpy() == 9.0   # f(x, 2) is essentially x * x\n",
+        "assert g(3.0, 2).numpy() == 6.0   # And its gradient will be 2 * x\n",
+        "assert f(4.0, 3).numpy() == 64.0  # f(x, 3) is essentially x * x * x\n",
+        "assert g(4.0, 3).numpy() == 48.0  # And its gradient will be 3 * x * x"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "colab_type": "text",
-        "id": "JVDWpL9VYWdP"
+        "id": "aNmR5-jhpX2t"
       },
       "source": [
-        "## Step 4: Create an optimizer\n",
+        "At times it may be inconvenient to encapsulate computation of interest into a function. For example, if you want the gradient of the output with respect to intermediate values computed in the function. In such cases, the slightly more verbose but explicit [tf.GradientTape](https://www.tensorflow.org/api_docs/python/tf/GradientTape) context is useful. All computation inside the context of a `tf.GradientTape` is \"recorded\".\n",
         "\n",
-        "We'll use a `GradientDescentOptimizer` to fit our model."
+        "For example:"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": 0,
       "metadata": {
-        "cellView": "code",
         "colab": {
           "autoexec": {
             "startup": false,
@@ -371,36 +225,48 @@
           }
         },
         "colab_type": "code",
-        "id": "DudNEebMKDWN"
+        "id": "bAFeIE8EuVIq"
       },
       "outputs": [],
       "source": [
-        "optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)"
+        "x = tf.ones((2, 2))\n",
+        "  \n",
+        "# TODO(b/78880779): Remove the 'persistent=True' argument and use\n",
+        "# a single t.gradient() call when the bug is resolved.\n",
+        "with tf.GradientTape(persistent=True) as t:\n",
+        "  # TODO(ashankar): Explain with \"watch\" argument better?\n",
+        "  t.watch(x)\n",
+        "  y = tf.reduce_sum(x)\n",
+        "  z = tf.multiply(y, y)\n",
+        "\n",
+        "# Use the same tape to compute the derivative of z with respect to the\n",
+        "# intermediate value y.\n",
+        "dz_dy = t.gradient(z, y)\n",
+        "assert dz_dy.numpy() == 8.0\n",
+        "\n",
+        "# Derivative of z with respect to the original input tensor x\n",
+        "dz_dx = t.gradient(z, x)\n",
+        "for i in [0, 1]:\n",
+        "  for j in [0, 1]:\n",
+        "    assert dz_dx[i][j].numpy() == 8.0"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "colab_type": "text",
-        "id": "YBeJYxY8YaiO"
+        "id": "DK05KXrAAld3"
       },
       "source": [
-        "### Step 5: Define a training step\n",
-        "\n",
-        "To fit model variables to the data we'll need to:\n",
+        "### Higher-order gradients\n",
         "\n",
-        "1. Calculate the gradients of the loss with respect to the model variables.\n",
-        "2. Use `optimizer` to compute updates to the variable values based on those gradients.\n",
-        "\n",
-        "To calculate the gradients, we use the [`tf.GradientTape`](https://www.tensorflow.org/api_docs/python/tf/GradientTape) context manager\n",
-        "and its `gradient` function to compute gradients through computation conducted within its context:\n"
+        "Operations inside of the `GradientTape` context manager are recorded for automatic differentiation. If gradients are computed in that context, then the gradient computation is recorded as well. As a result, the exact same API works for higher-order gradients as well. For example:"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": 0,
       "metadata": {
-        "cellView": "code",
         "colab": {
           "autoexec": {
             "startup": false,
@@ -408,163 +274,37 @@
           }
         },
         "colab_type": "code",
-        "id": "diDZfrMJM3OC"
+        "id": "cPQgthZ7ugRJ"
       },
       "outputs": [],
       "source": [
-        "def run_step(inputs, labels):\n",
-        "  with tf.GradientTape() as g:\n",
-        "    loss = loss_fn(wb(inputs), labels)\n",
-        "  # Compute the partial derivatives of loss with respect to the variables\n",
-        "  grads = g.gradient(loss, wb.variables)\n",
-        "  optimizer.apply_gradients(zip(grads, wb.variables))\n",
-        "  return loss"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "1WWepgmJQOzc"
-      },
-      "source": [
-        "Repeatedly running the training step will nudge the variables towards the values that best fit the data (i.e., \"w\" will move closer to 3.0, while \"b\" will tend to 2.0):\n",
-        "\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "base_uri": "https://localhost:8080/",
-          "height": 51
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 380,
-          "status": "ok",
-          "timestamp": 1525154412590,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "ya5Qxz5XQlhU",
-        "outputId": "8dd47155-a6c1-44c5-c279-617c803f1723"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Values of w, b BEFORE applying gradients: 2.725763, 1.894334\n",
-            "Values of w, b AFTER  applying gradients: 2.774932, 1.922555\n"
-          ]
-        }
-      ],
-      "source": [
-        "w, b = wb.variables\n",
-        "print(\"Values of w, b BEFORE applying gradients: %f, %f\" % (w.numpy(), b.numpy()))\n",
-        "run_step(inputs, labels)\n",
-        "print(\"Values of w, b AFTER  applying gradients: %f, %f\" % (w.numpy(), b.numpy()))\n"
+        "# TODO(ashankar): Should we use the persistent tape here instead? Follow up on Tom and Alex's discussion\n",
+        "\n",
+        "x = tf.constant(1.0)  # Convert the Python 1.0 to a Tensor object\n",
+        "\n",
+        "with tf.GradientTape() as t:\n",
+        "  with tf.GradientTape() as t2:\n",
+        "    t2.watch(x)\n",
+        "    y = x * x * x\n",
+        "  # Compute the gradient inside the 't' context manager\n",
+        "  # which means the gradient computation is differentiable as well.\n",
+        "  dy_dx = t2.gradient(y, x)\n",
+        "d2y_dx2 = t.gradient(dy_dx, x)\n",
+        "\n",
+        "assert dy_dx.numpy() == 3.0\n",
+        "assert d2y_dx2.numpy() == 6.0"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "colab_type": "text",
-        "id": "61TgeLVlKEQp"
-      },
-      "source": [
-        "## Step 6: Create a training loop\n",
-        "\n",
-        "Of course, now we can simply turn all of this code into a self-standing training loop. We'll also capture our loss and approximations of `w` and `b` and plot them over time."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "base_uri": "https://localhost:8080/",
-          "height": 364
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 580,
-          "status": "ok",
-          "timestamp": 1525154278709,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "VukGe-huNaJ4",
-        "outputId": "c79c8e63-c781-451e-f74f-20815d8da49f"
+        "id": "4U1KKzUpNl58"
       },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "[0.9409681558609009, 1.3733772039413452, 1.7128530740737915, 1.9793939590454102, 2.188689708709717, 2.3530514240264893, 2.4821391105651855, 2.583533763885498, 2.6631851196289062, 2.7257626056671143]\n"
-          ]
-        },
-        {
-          "data": {
-            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAd8AAAFKCAYAAABcq1WoAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzs3Xd4U2X/BvD7ZLRpumlLS6EDgbKh\niIggU7aAgPhDRKsIUoYgiK++ioAguBARXmZBEARFUBGhiChIEQcqe+/RMlpGd9KRcX5/nDZtaFra\nkuY07f25rlw5zXmSfPMk5OY5Oec8giiKIoiIiMhhFHIXQEREVN0wfImIiByM4UtERORgDF8iIiIH\nY/gSERE5GMOXiIjIwVSOeJJbtzLs/pi+vlqkpOjt/rhkjf3sGOxnx2A/Owb7WRIQ4FnsOqcd+apU\nSrlLqBbYz47BfnYM9rNjsJ/vzWnDl4iIyFkxfImIiByM4UtERORgDF8iIiIHY/gSERE5GMOXiIjI\nwRi+REREDsbwJSIih/vxx61YtGi+3GXIhuFLRETkYA45vSQREZEtGzeux65dPwMAOnbsjOeeG45/\n/tmHFSuWwNVVA1/fGnjnndk4eHB/kdtUKueNMKesPDZ2C7p37wSNxkfuUoiInN6MGVOxdetmuz2e\nQiGgb98BmDFjdontbty4hgMH/sGKFV8AAKKjX0DXrt3x3XcbMH78q2jZshX27PkVaWmpNm/z8/O3\nW82O5nSbnTMy0jFixHN48cUX5S6FiIjuw9mzZ9G0aXOoVCqoVCo0b94S58+fRdeu3fHxxx/giy9W\noUGDhvDz87d5mzNzupGvp6cXOnTohF27duHkyRNo0qSp3CURETm1GTNm33OUWhYBAZ6lms1OEABR\nFC1/GwwGCIICvXv3Rdu27fDbb3H4739fxezZc2zeFhYWbreaHc3pRr4AEB09DgCwYsVSmSshIqLy\niohoiOPHj8FoNMJoNOLkyROIiGiI1as/g1KpwoABT6Jbt564fPmizducmdONfAGgR49eqFevHr79\ndgPefnsG/P2de/MDEVF1FBQUjFatHsKECdEwm0X07z8AQUG1EBgYhEmTxsHT0wuenp4YOvQ56PX6\nIrc5M0EsPOavIKXZ/FBW69d/jokTJ+LNN6di8uQ37P74JCnt5iO6P+xnx2A/Owb7WRIQ4FnsOqfc\n7AwAL774Ijw9vbBq1Qrk5ubKXQ4REVGpOW34enp6YtiwKNy8mYQfftgkdzlERESl5rThCwAvvTQa\nCoUCMTFL4ICt50RERHbh1OEbFhaO3r374ujRw/j7731yl0NERFQqTh2+ADB6tHTY0fLlS2SuhIiI\nqHScPnwfeaQ9mjdviR9/3Ir4+Ctyl0NERHRPTh++giAgOnoszGYzVq5cLnc5REQkk/Pnz1kGYe+8\n8xZycrLL/ViHDx9ESkqyvUorwunDFwAGDhyMgICa+PLLL5CZmSl3OUREJIM9e35FQkI8AGDmzA/g\n6qop92Nt27alQsPXKc9wdTdXV1e8+OJLmDPnfWzY8BVGjoyWuyQiIrqHYcMGY+3ajRBFEX36PIaF\nC5ehUaMmmDx5PN54420EBdWCyWTCnDnv4fr1azAajXjppTFo3boNtm+PxaZNG6FSqVG/fgQGDhyM\nH37YhD17foWvry+mT38LX3yxAZ9+Oge+vr44c+Y0UlNT8OyzL2Dbtq1IS0vFokXLIQjAzJlTkZWV\nhezsbLz66uvQ6TKxd28cLl26iNmz5+DMmZP4+ut1UCpVaNiwMSZMePW+X3uVCF8AeOGFkZg/fy5W\nrFiKF198CQpFlRjUExFVOPcZU+FqxykFoRDg3ncAdPeYrKFhw8a4ePECjEYDGjVqjOPHjyIiohGS\nk5MRFFQLAPDLLz/Bz88fb701HampqZg4cQzWrPkaX3+9DnPmzEdgYBC2bduCOnXqoG3bdujSpRua\nNGlm9TxKpQoLFizFzJlTcezYUSxYsASzZk3DwYP7ER5eF/36DUSnTl1w4MC/+PLLNXjvvY9Rv34E\nJk9+A15eXlizZiWWLfscLi4umDbtTRw9ehgtWkTeVxdVmfANCAjA4MFDsH79Ouza9TN69Ogtd0lE\nRFSCyMgHceLEMeTm5uCpp57Gnj270bLleURENLS0OX78KI4cOYSjRw8DAHJycmAwGNC9ey9MmfI6\nevXqg+7de5W4iblxY2n2Oz8/f8tMSL6+ftDpMlGjhh/WrPkM69evhcFggEZj/TiXLl1EUlIiJk8e\nDwDQ6TKRmJiIFi3u77VXmfAFgFGjxmL9+nWIiVnK8CUiKiXdjNn3HKWWRUCAJ3SlOLdzq1atsW7d\nauTkZKNfvwHYtm0rjh07ggcffMjSRqVS4/nnRxT5To+KehE9evRBXNxOvPLKWCxeXPwOt0ql0uay\nKIrYuPEr+PvXxLRps3D69EksWjTf6r5qtbSped68Rfd8PWVRpbbNNmvWHB06dMJvv+3GqVMn5S6H\niIhKEBoahqSkJGRm6qDVusPPzw9798ZZhW+TJs3w++97AAApKcmIiVkMs9mMmJjF8Pf3x9Chz6FZ\ns+ZITEyEIAgwmUxlqiEtLRW1a9cBAOzZsxtGoxEAoFAoYDKZEBoajsuXL1l2vlq5Mga3bt2879de\nqvA9e/YsunfvjnXr1gEAbty4gaioKAwbNgwTJ06sVBMbcK5fIiLn4evri6CgIABS0N64cQM1awZa\n1j/2WHe4uWkxZswIvPHGq2jRIhIKhQJarTtGj34REyeOhSAIaNAgAi1btsL8+R9j//5/Sv38vXv3\nxYYNX+LVV19G06bNcOfOHWzbtgWRkQ9i6tT/4vr1a5g48TX85z8TMXbsCKSlpcLfP+C+X/c9pxTU\n6/UYPXo0wsPD0bBhQzz33HN466230KlTJ/Tp0wfz5s1DUFAQhg0bVuxjVMTUUsVNWWUymdCu3YO4\nceM6Dh06xbl+7xOnBnMM9rNjsJ8dg/0sua8pBV1cXLBixQrUrFnTctvff/+Nbt26AQC6du2Kv/76\nyw5l2odSqcSoUWOQk5ODtWs/l7scIiKiIu4ZviqVqsjeX1lZWXBxcQEA+Pn54datWxVTXTk988xz\nnOuXiIgqrfve27k0U/n5+mqhUinv2a6sihvSBwR44qWXRuLTTz9FXNxPePbZZ+3+3NVJSZtOyH7Y\nz47BfnYM9nPJyhW+Wq0W2dnZ0Gg0SEpKstokbUtKir5cxZXkXr8pDBv2IhYsWIC5cz9Bjx79IQiC\n3WuoDvjbjWOwnx2D/ewY7GfJff3ma0v79u2xY8cOAMDPP/+Mjh07lq+yChQWFo5evR7H4cOH8M8/\nf8tdDhERkcU9w/f48eOIiorC999/jy+++AJRUVEYP348Nm/ejGHDhiE1NRUDBw50RK1lxrl+iYio\nMrrnoUb24MhDjQoTRRHdunXEyZPH8e+/RxESEmr3Oqo6bj5yDPazY7CfHcPe/RwXtwtdunSz2+M5\nit03OzsLzvVLROTcbty4jp07d8hdht1V6fAFgEGDnoK/fwDWrVvDuX6JiCqRYcMGw2QywWg0okeP\nTjh9Wjot8OTJ45GYeAMAMG/eRzh8+CA+/3wFVq6MwaxZ0zFu3EvYv/8fTJ36huWx+vaVRsaXLl3E\nK6+MwcSJY/HWW68hI6Nybumo8uGbP9dvenoaNmz4Su5yiIgqpRqtm9m8aAptNfQcN8pmG8/o4ZY2\nmrWrgfDwUj1n/pSC586dsUwpaDabraYUfOaZKERGPogXXxwFADAaDViy5LNip42dP/9jvP76FCxY\nsBRt2jyCTZs2lqs/KlqVD19AmutXOlPXUpjNZrnLISIiFEwpeOzYETz11NM4efIELlywnlLwbvnT\nAxbn5MkT+Oij2Rg/Pho7dvxomRChsqlSUwoWp2bNmnjyyf/D119/ybl+iYhsSD5w/J5tMpasuGeb\n7Kjh8Jw8AbDTlIJ3U6vVAFDk3A35sxFpNBosXBhT6c/tUC1GvoA01y8AxMRwtiMiosqgNFMK5k/t\ndzd3d3fcuXMbAHD+/Dno9dLJnOrXb4B9+/4EAOzcuaNMMxw5UrUJ3+bNW+DRRztyrl8iokrkXlMK\nhoXVxZkzp/G//31idb/69SOg0bhhzJgR2LHjRwQFBQMAJk78D9au/Rzjx0fjxx9jS9yELacqfZzv\n3bZv34YXXngGzz33AubNW2j3mqoiHhfpGOxnx2A/Owb7WVJtj/O9W8+evREWFo5vvvkat2/flrsc\nIiKqpqpV+HKuXyIiqgyqVfgC0ly/Hh6enOuXiIhkU+3C19PTC88+G4WkpERs2fK93OUQEVE1VO3C\nFwBGjhwNQRCwfPkSOGB/MyIiIivVMnzDw+uid+++OHz4EP79t3IeA0ZERFVXtQxfgHP9EhHJ6ccf\nt2LRovl2eSydLhP//LMPALB27WocP3603I+VmJiIkyfvfbav+1Vtw7ddu0fRrFkLxMb+gISEeLnL\nISKicjpz5rQlfKOihqNZsxblfqyDB//FqVMn7FVasarFuZ1tyZ/r95VXxmLVqhV4551ZcpdERFSt\n3LhxDf/5zyu4eTMJQ4YMQ79+A6zWf/fdRuzc+RMEQYGOHbvgmWeew9mzp/HJJx9BrVbDxcUFM2d+\ngHnz5kCv1yEkJBTHjx9Fly7dkJaWisOHDyI1NRWXLl1EdPRY7Ny5A5cvX8L06bPRtGkzLFw4DydP\nnkBubi4GDhyMDh06Y9Wq5VCpVAgMDELt2iH49NM5EAQBWq0WU6bMgKdn8SfOKItqG76ANNfvu+9O\nx7p1a/Daa/+Fh4eH3CURETncjBmu2LrVfnGgUAB9+7pixoycEtslJMRj1aovodNlYvjwYejb9wnL\nhAjXr19DXNwuLFmyEgAwduxIdO3aHT/+uBWDBj2F3r374sCBf5GcfAfDhkXh4sULGDDgSatNzgkJ\n8Viy5DNs3boZ69atxqpVX2L79q3YuXMH6tdvgKCgYEyYMBk5OdkYMmQg+vcfiD59+sHHxwcdOnTG\nxIlj8frrUxASEopNm77Bpk0b8cILI+3SR9U6fPPn+v344w+wceN6jBgxSu6SiIiqjRYtIqFSqeDt\n7QN3d3ekpaXBx8cHAHDq1AlcvZqACRNGAwD0eh0SE6+jQ4fOmDv3QyQkxKNbtx4ICwvHiRPHbD5+\no0ZNIAgC/Pz8Ua9eAyiVSvj6+kGnOwJXV1ekp6dhzJgRUKlUSE1NKXL//OkJAcBgMKBx4yZ2e+3V\nOnwBaa7fBQs+wYoVSzF8+MhiJ2gmIqqqZszIuecotSykczuX5vGsp/0rPAugSqVGu3aP4o033i5y\nr88++wJ//rkXs2fPwPjxk4p9dKVSaXNZFEUcOnQABw/ux6JF0mbmHj06Frl/RU5PWO2TJn+u3wsX\nzuPXX3+RuxwiomrjxImjMJlMSElJQVZWFry8vC3rGjZsjIMHDyA7OxuiKGL+/LnIycnGd99tQHp6\nGnr27IOnnx6Gs2dPQxAEm9MOliQtLRU1awZCpVLh99/3wGQyw2AwWE1hWJHTE1b7kS8gzfX79ddf\nIiZmCbp37yV3OURE1UJoaDimTXsT164lIDp6nNUIMygoCEOGPIOXXx4FhUKBTp26wNVVg9q1QzBt\n2pvw8PCAWq3GlCnvIDU1BcuWLURAQM1SP/dDD7XFl1+uwfjx0ejYsTPat++AuXM/QPfuPTF79gz4\n+Phi4sT/YM6c9/Dll2vg4uKKGTNm2+21V6spBUsyaFBf/PHHXvz2299o1Kix3R7X2XFqMMdgPzsG\n+9kx2M8STilYCtHR0kk3VqxYKnMlRERU1TF88xSe6/fOnTtyl0NERFUYwzdP/ly/2dnZnOuXiIgq\nFMO3EM71S0REjsDwLSR/rt/ExBvYunWz3OUQEVEVxfC9S/5cvzExiznXLxERVQiG71041y8RUcUr\nzZSCu3fvdFA1jsfwtYFz/RIRyW/dujVyl1BhGL42cK5fIqKKlz+l4PPPP43Y2B+s1n311Rc4f/4s\npkx5HQcP7scbb0zC+PHROH36FPr27WZpN3XqGzh4cD/0eh2mTn0DEyeOxfjx0Th//pyjX06ZMHxt\nyJ/r12w2Y9WqFXKXQ0RU4Vq3drd5WblSbWkzbpzGZpvoaI2lzdq1aoSHl+45ExLi8eGH87BwYQxW\nroyx2s9m2LDn4eHhgfff/xgAcOHCecybt6jYMxBu3Lgebdu2x4IFS/Haa29i0aJPy94JDsTwLcbA\ngYPh7x+AdevWIDMzU+5yiIiqHFtTChanfv0GcHFxKXb9sWNHsXnzdxg/PhqffPIhdLrK/b3NiRWK\nodFoMHz4SMyd+yHn+iWiKu/AAd092yxZkn3PNlFRBkyerMGtW6V51uKnFLybWq22ebvRaMxbr8Kr\nr76OZs1alOaJZceRbwleeGEkXFxcsGLFUpjNZrnLISKqUkqaUhAAzGbbh3sKgoDs7GxkZ2fj7Nkz\nAIAmTZrht9/iAACXLl3E11+vq9Da7xfDtwSBgYEYNOgpzvVLRFQB8qcUnDRpbJEpBQEgIqIhRo16\nvsj9Bg58CtHRL+D992eiYUPpN+Cnnnoa164lYNy4l/DRR7MRGfmgQ15DeXFKwXs4duwIunXriM6d\nu+Kbb3649x2qGE4N5hjsZ8dgPzsG+1nCKQXvQ/PmLdG+fQfs2bMbp0+fkrscIiKqAhi+pcC5fomI\nyJ4YvqXQq1cfhIZyrl8iIrIPhm8pSHP9jkZ2djbWrVstdzlEROTkGL6lNGxYFDw8PLFy5XIYDAa5\nyyEiIifG8C0lT08vDBv2HOf6JSKi+8bwLQPO9UtERPbA8C2DunUfQK9ej+PQoYPYv59z/RIRUfmU\nK3x1Oh3Gjx+PqKgoDB06FHv37rV3XZVWwVy/POyIiIjKp1zh+/3336Nu3bpYu3YtFixYgPfee8/e\ndVVa7dt3QNOmzREb+wOuXk2QuxwiInJC5QpfX19fpKamAgDS09Ph6+tr16IqM0EQMHr0OJhMJs71\nS0RE5VLuczuPHDkS8fHxSE9PR0xMDCIjI4ttazSaoFIpy11kZZOdnY2wsDDk5ubi6tWrcHd3l7sk\nIiJyIuWaz/eHH35AcHAwVq5cidOnT2PKlCnYtGlTse1TUvTlLrA4cp+4+/nnR2Du3A+xePFyvPji\nS7LVUdHk7ufqgv3sGOxnx2A/S+w+scLBgwfRoUMHAECjRo1w8+ZNmEym8lXnpDjXLxERlVe5wjcs\nLAxHjhwBAFy7dg3u7u5QKqvOZuXSyJ/r9/z5c9i9e6fc5RARkRMpV/g+/fTTuHbtGp577jm89tpr\nmDFjhp3Lcg7R0WMBADExS2SuhIiInEm5fvN1d3fHggUL7F2L08mf6zcu7lecPn0KjRo1lrskIiJy\nAjzD1X0qmOt3mcyVEBGRs2D43qeCuX7XIzmZc/0SEdG9MXzvU+G5fteuXS13OURE5AQYvnbAuX6J\niKgsGL52wLl+iYioLBi+dsK5fomIqLQYvnbCuX6JiKi0GL52xLl+iYioNBi+dsS5fomIqDQYvnbE\nuX6JiKg0GL52NnDgYPj7B2Dt2tXQ6XRyl0NERJUQw9fONBoNhg8fibS0VGzcuF7ucoiIqBJi+FYA\nzvVLREQlYfhWgMDAQAwcOBjnz59DXNwuucshIqJKhuFbQTjXLxERFYfhW0FatIhEu3aPYvfuXThz\n5rTc5RARUSXC8K1AnOuXiIhsYfhWoN69H0doaBjn+iUiIisM3wqkVCrx0kujkZWVhXXr1shdDhER\nVRIM3wo2bFgU3N09ONcvERFZMHwrmJeXN4YNew43blxHbOwPcpdDRESVAMPXAV56aQwEQcAHH8xC\nZmam3OUQEZHMGL4OULfuA3j55Ym4fPkSpk79r9zlEBGRzBi+DvLmm1PRokUkvvpqLbZu3Sx3OURE\nJCOGr4O4uLhg2bKVcHNzw+TJr+Datatyl0RERDJh+DpQ/foNMGvWh0hLS8X48aNhMpnkLomIiGTA\n8HWwqKjh6NOnH/74Yy8WL/6f3OUQEZEMGL4OJggC5s1biMDAIHz44SwcPnxQ7pKIiMjBGL4y8PPz\nw6JFMTAajRgzZiR0Op3cJRERkQMxfGXSuXNXjBv3Ci5evIBp096UuxwiInIghq+M3nprGpo1a4F1\n69Zg61ae/YqIqLpg+MrI1dXVcvjRa69NwPXr1+QuiYiIHIDhK7OIiIaYOfN9pKamYsKEMTCbzXKX\nREREFYzhWwm88MII9O79OPbu3YMlSxbKXQ4REVUwhm8lIB1+tAg1awbigw/exdGjh+UuiYiIKhDD\nt5Lw9/fHwoXLYDAYePgREVEVx/CtRLp27YbRo1/G+fPnMH36FLnLISKiCsLwrWSmTp2Bpk2bY+3a\nz/Hjj7Fyl0NERBWA4VvJ5B9+pNFoMHnyeCQm3pC7JCIisjOGbyXUsGEjzJjxHpKTk/Hyy6N5+BER\nURXD8K2kXnzxJfTs2Rt798Zh2bLFcpdDRER2xPCtpARBwKefLkZAQE28994MHDt2RO6SiIjIThi+\nlVhAQAAWLlxqOfxIr9fLXRIREdkBw7eSe+yxHoiOHotz587inXfelrscIiKyA4avE5g6dSYaN26K\nNWtW4qeffpS7HCIiuk/lDt8tW7bgiSeewJNPPom4uDg7lkR302g0WLZsJVxdXfHqqy8jKSlR7pKI\niOg+lCt8U1JSsHjxYnz11VdYtmwZdu3aZe+66C6NGzfBjBmzcefOHc5+RETk5MoVvn/99RfatWsH\nDw8P1KxZE7NmzbJ3XWTDiBHR6N69J+LifsXy5UvkLoeIiMqpXOF79epVZGdnY8yYMRg2bBj++usv\ne9dFNgiCgAULlsLfPwCzZ8/AsWNH5S6JiIjKQRBFUSzrnZYvX46DBw9i0aJFuH79Op5//nns3r0b\ngiDYbG80mqBSKe+7WJJs374djz/+OBo3boz9+/dDq9XKXRIREZWBqjx38vPzQ6tWraBSqRAaGgp3\nd3ckJyfDz8/PZvuUFPsfnxoQ4IlbtzLs/rjO4KGHOuCll0bjs89iMH78RHz00bwKe67q3M+OxH52\nDPazY7CfJQEBnsWuK9dm5w4dOmDfvn0wm81ISUmBXq+Hr69vuQuksps+fRYaN26Czz//DD//vF3u\ncoiIqAzKFb6BgYHo1asXhgwZglGjRmHq1KlQKHjIsCNpNBosXSodfjRx4jgkJSXJXRIREZVSuX7z\nLauK2PzAzRqSFSuW4u23/4uuXbth/frv7P6fIPazY7CfHYP97BjsZ4ndNztT5fHSS2Pw2GPdsXv3\nLnz22TK5yyEiolJg+Dq5gsOP/PHuu9Nx4sRxuUsiIqJ7YPhWAYGBgZg/fzFyc3MxduxIZGVlyV0S\nERGVgOFbRfTs2QcjRozC6dOn8O670+Quh4iISsDwrULeeWc2GjZshJUrl2Pnzh1yl0NERMVg+FYh\nbm5uWLZsFVxcXPDKK+Nw8+ZNuUsiIiIbGL5VTNOmzTBt2kzcvn0LEyeOhQOOJCMiojJi+FZBo0aN\nRZcuj2HXrl+wcmWM3OUQEdFdGL5VkEKhwMKFy+Dn54eZM6fh1KmTcpdERESFMHyrqMDAIHz66WLk\n5ORgzJgRyM7OlrskIiLKw/Ctwnr3fhzDh4/EqVMnMWvWdLnLISKiPAzfKm7GjPcQEdEQK1Ysw65d\nP8tdDhERgeFb5Wm1WixdutJy+NGtW7fkLomIqNpj+FYDzZu3wNtvz8CtWzcxadI4Hn5ERCQzhm81\nMXr0OHTu3BW//LIDq1atkLscIqJqjeFbTeQfflSjRg3MnDkVp0+fkrskIqJqi+FbjQQF1cKnny5G\ndnY2xowZycOPiIhkwvCtZvr06Yvnnx+BkyeP4733ZspdDhFRtcTwrYZmznwP9es3QEzMYuzevUvu\ncoiIqh2GbzXk7u6OZctWQq1WY8KEMbh9+7bcJRERVSsM32qqRYtIvPXWdNy8mYRXX32Zhx8RETkQ\nw7caGzduAjp27IIdO7Zj9eqVcpdDRFRtMHyrMYVCgUWLlsHX1xfvvDMFZ8+ekbskIqJqgeFbzdWq\nFYx58xYhOzsbo0ePQE5OjtwlERFVeQxfQt++/REVNRwnThzD+++/K3c5RERVHsOXAADvvvsB6tWr\nj6VLFyIu7le5yyEiqtIYvgSg4PAjlUqFCRPG4M6dO3KXRERUZTF8yaJly1Z4881pSEpKxKuvjufh\nR0REFYThS1bGj5+IDh064aeftuGLLz6XuxwioiqJ4UtWpMOPYuDj44Pp09/C6dOn5S6JiKjKYfhS\nEcHBtfHJJwuRlZWF/v374+LFC3KXRERUpTB8yab+/Qdg8uTXcf78eTz+eDf8/fc+uUsiIqoyGL5U\nrDffnIbly5cjLS0Ngwf3w/fffyt3SUREVQLDl0o0atQorF//HVxdNRg9egQ+/fRj7gVNRHSfGL50\nT126PIbY2J9Rp04IPvhgFiZNehm5ublyl0VE5LQYvlQqjRs3wfbtuxAZ2Qrr16/DM88MRlpaqtxl\nERE5JYYvlVpgYBC+//5H9O7dF3v37kHfvj1w5cplucsiInI6DF8qE3d3d3z++TqMGTMeZ8+eQZ8+\nj+HAgX/lLouIyKkwfKnMlEol3n33fXz44SdITk7GoEF9sXXrD3KXRUTkNBi+VG4jRozCunUboFSq\nMHJkFBYtWsA9oYmISoHhS/ele/de2LLlJ9SqFYx3352G//xnEgwGg9xlERFVagxfum/Nm7fATz/9\nimbNWmDt2s/x7LP/h/T0NLnLIiKqtBi+ZBe1agVjy5af0KNHL8TF/Yr+/Xvh6tUEucsiIqqUGL5k\nNx4eHlizZj1GjozGqVMn0bv3Yzhy5JDcZRERVToMX7IrlUqFDz6Yi9mzP8StWzcxYEAfbN++Te6y\niIgqFYYvVYjo6HFYvforAMDw4cMQE7OYe0ITEeW5r/DNzs5G9+7dsWnTJnvVQ1VInz598cMP2xEQ\nUBPTpr2FKVNeh9FolLssIiLZ3Vf4Ll26FN7e3vaqhaqgli1b4aeffkXjxk2xcuVyvPDCM8jMzJS7\nLCIiWZU7fC9cuIDz58+jS5cudiyHqqI6dUIQG7sDXbo8hl9+2YEnnuiNGzeuy10WEZFsBLGcP8RF\nR0dj2rRp2Lx5M2rXro0nn3xblRT0AAAgAElEQVSy2LZGowkqlbLcRVLVYDAYMH78eCxfvhy1a9dG\nbGwsIiMj5S6LiMjhVOW50+bNmxEZGYmQkJBStU9J0ZfnaUoUEOCJW7cy7P64ZM3e/Txr1seoVSsU\nM2dOxaOPdsBnn61G9+697Pb4zoqfZ8dgPzsG+1kSEOBZ7LpyhW9cXBwSEhIQFxeHxMREuLi4ICgo\nCO3bty93kVQ9CIKAl19+BaGhYXj55VF47rmn8f77H2PEiFFyl0ZE5DDlCt/58+dblhcuXIjatWsz\neKlM+vcfgODgYERFDcWbb76GS5cuYsaM2VAq+fMEEVV9PM6XZNO6dRts374LERENEROzGC+++Bx0\nOp3cZRERVbj7Dt8JEyaUuLMVUUnCwsKxbdsv6NixM376aRsGDnwcSUmJcpdFRFShOPIl2Xl7+2D9\n+u/wzDPP4ciRQ+jTpxtOnTopd1lERBWG4UuVgouLC+bPX4wpU6bj6tUE9OvXE7t375K7LCKiCsHw\npUpDEARMmvQfxMSsQm5uDoYNewpr166WuywiIrtj+FKlM2jQU/j2263w9vbGa6+9glmz3oHZbJa7\nLCIiu2H4UqXUtu0j+PHHXahXrz4WLvwUo0YNR1ZWltxlERHZBcOXKq0HHqiHH3/ciXbtHsXWrZvx\n5JN9cevWLbnLIiK6bwxfqtR8fWtg48bNeOqpp3HgwH706dMNZ8+ekbssIqL7wvClSs/V1RWLFy/H\n66+/hfj4y+jbtwd+//03ucsiIio3hi85BUEQ8Prrb2HRohjo9ToMGTIQX3/9pdxlERGVC8OXnMqQ\nIc/gm29+gIeHB155ZSw+/HAWyjkrJhGRbBi+5HTat++AH3/chbCwcMyb9zHGjh2J7OxsucsiIio1\nhi85pfr1G2D79l/Rpk1bbNr0Lf7v/wbgzp07cpdFRFQqDF9yWv7+/vjuu60YOPBJ/P33X3j88W64\nePG83GUREd0Tw5ecmkajwbJlqzBp0n9w6dJF9OnTjXtCE1Glx/Alp6dQKDBlynTMn78YGRkZePLJ\nfnjhhWE4ffqU3KUREdnE8KUqY9iwKGzZ8hPatGmL7dtj0aVLO0yYMAbx8VfkLo2IyArDl6qUhx56\nGLGxP2Pdug1o1KgJNmz4Cu3aPYgpU17HzZs35S6PiAgAw5eqIEEQ0LNnH/z66+9YuvQzBAfXxmef\nxeDhh1vigw/eRVpaqtwlElE1x/ClKkuhUGDw4CH4888DmDPnU3h6euLTT+eiTZsWWLhwPvR6vdwl\nElE1xfClKk+tVmP48JH4++/DmDp1JgBg1qzpaNs2EqtXr4TBYJC5QiKqbgTRAefmu3Urw+6PGdCm\nOUzmoqXrx72C7JHRAADPcaOg/vuvIm0MrR9CxvLVAADN2tXQzp9r8zmS/zoIuLhAee4svIc+abNN\nxryFMHTuCgDw6dUFitu3i7TJHvIM9P99GwDg/s7bcI39oUgbU2gY0r7fBgBw2b4NHlP/a/P5Urfu\ngDm4NoTUFPh262izjW7KdOQMHgIA8Hr2/6CysddvbtfuyJw7HwDgtnA+3FZ/VqSNqNVCdfoUbt3K\ngGr/P/AaPcLm86WvWgtjy1YAAN+2kRCMxiJtsqLHImv0ywAAj0kvw2XvniJtjM1bIn21dL5m16+/\nhPvHH9h8vuQ9+wAPDyguX4LP4P4222TOmYfcbj0BAD79ekJx47plndlsRkZ6OlZl6fG60Yjw8Lr4\nrmEjtDxxHBAEq8cx1wpGauzPAACXXT/D443JNp8v9butMIfXBTIzUaPzIzbb6F5/CzlDnwUAeA1/\nFqpjRyzrlAoBJrOI3I6dkTl/MQDALWYx3JYvLfI4okqFlL8PAwBURw7Ba0SUzedLj1kF40MPAwB8\nOz4MwcZIP2v4S8iaMAkA4PGfSXDZvbNIG2Ojxkj/8hsAgOt3G+H+/rs2ny9l116IPr5QXL8Gn/69\nbLbJnP0Rcvv0BQB4D+oLpY2d4XL6DYBu5nsAAO1H70GzcX2RNmZ/f6TuiAMAqPfshufkCTafL+3r\nTTA1iAByc1Gj3YOWfi5MP+k/yI4aDgDwjB4O9YH9RR7H0LYdMpasAABoVi6Hdsn/bD5f8oHjAADl\nyRPwjnraZpuMRTEwtHsUAODb9VEI6WlF2mQ/+zz0k98AALhPeR2uO7YXaWOqVx9pGzcDAFy2bobH\njKk2ny9l+68Qa9aEcPMmfPs8ZrNN5ozZyO0/EADgPWQglBeKHi+f06sPdO9/DADQzpsDzZdfFGkj\nenkjZfcfCAjwROqWn+A5frTN50tbuwGmJk0BADVaN7PZRs7vcnsJCPAsdp3Krs9E5AQUCgW8fXzw\nwpBncBoivvjic+y4fAmBajW8vX3g5uYmd4lEVMU578g3wLNCHpesVYd+vnLlMj7++AN8883XEEUR\nDz/8CKZOnYFHHmnvsBqqQz9XBuxnx2A/S0oa+fI3X6r2wsLCsWhRDPbs2Yc+ffrhn3/24YknemPo\n0CdxrNCmYSIie2H4EuVp1Kgx1qz5Ctu370KHDp3w66870a1bR0RHD+c5o4nIrhi+RHdp3boNvvtu\nKzZu3IzIyFbYvHkTHn20DV577RVcv35N7vKIqApg+BLZIAgCunR5DDt2xGHlyrV44IF6WLt2Ndq2\njcQ777yN5GROX0hE5cfwJSqBIAjo338A9uzZhwULliAgoCaWLl2Ihx5qgblzP0RmJncqIXJqZjOg\n0wGZmQ59Wu7tTCViP1vLycnBmjUrMX/+XNy+fRv+/v6YOPE1vPDCSGg0mnI/LvvZMdjPjnHf/SyK\ngMEAITsLQlYWoNdDyM6GkKWHkJUFITsL0GdJfxe6HdlZEPRZBW2yCrXRF2pT+PbsbOkpFQqkffUt\nDI91t1MvlLy3M8OXSsR+ti0zMwMxMUuwZMlCZGSko3btOnj99bcwZMgzUKnKfvg8+9kx2M92IoqA\nTgdBp4Ogy4Sg00Ghy4SgywR0OngrTMhISraEoJCVBdwVggXhWNBG0OuB/DA1mexbsloN0U0L0c0N\n0GggarUQNRrLbaKPL3RvvwNznRC7PSfDl8qN/Vyy5OQ7+N//PsWqVcuRnZ2NBg0i8Oab09Cv3xMQ\n7jpbVknYz45RLftZFKWQKxSUQmZmwbLuruXM/GUdhMyMguXC6/Q6CHaIDlEQADc3KfzcCsIQbm4Q\nNW4QtXnrNG557Qq3KRScGqkd7g5UjRugzbsux3+K7xfDl8qN/Vw6169fwyeffISvvloLk8mEyMhW\nmDLlHXTu3LVUIcx+dgyn6Oe8sFSkp0FIS4OQnlZ8GBYXmlZ/Z0Iwm8tfjiBAdPeA6O4uXTw8Cy17\nWK9zl9Z5BvkhzaQoCNG84LQEZn6AuroWOaVrVcLwpXJjP5fNhQvn8NFH72Hz5k0AgA4dOuHtt99B\n69ZtSrwf+9kxHNLPZjOEjPS84Ey3CtGC5XTp70LLVuttnB+9tEStuyUMzR6eQKHl/Nvh7gGzR35o\neuSFqDtErUeRUIVWW+aA5OdZwvClcmM/l8+xY0fw/vvvYteuXwAAvXv3xVtvTUPjxk1stmc/O0ap\n+jk3F0JaGhQZxQVlqnVopqVBUWhZyMwo8yZZUauF2csbopcXRC9vmL3zl32k68JBefeIs/Cy1h1Q\nyH8QCz/PEoYvlRv7+f7s2/cnZs+egX/+2QdBEPDUU0/jjTemICws3Kod+9nOjEYIKSlQJN+BIvkO\nhDvStWeuDvrrN4sP0Yx0aeegMhAFAaKnF0Rvbyk4ve5a9vKG6C2FqNnLO29d/rJ0O9TqCuoIefDz\nLGH4Urmxn++fKIrYuXMH3nvvXZw8eRxqtRpRUcPx6qtvIDAwEAD7uURmM4TUFCiSky0hWjhQFcl3\nICTfgeJO3nVyMhRpqaV+eFGthujtDbMlQPOCMi9ArZel8LSMUr29IXp4VorRZmXCz7OE4Uvlxn62\nH7PZjM2bv8OHH87G5cuXoNVqER09Di+//Arq1w+pHv0sitIo0yo4k6G4Yx2iluWUZAgpKaXaYUhU\nqWCu4QfRz0+6riFdm/1qWJa9wmsjRVRLI868wIWbW5Xe6UcO/N6QMHyp3NjP9mcwGPDVV2sxd+6H\nSEpKhI+PDyZNmoQePfqhQYMIucsrPVGU9qS1MRq1CtSUZGn9HWm5NDsTiQoFxBo1pPDMD1K/QoFa\no4YUsr41LIErenrdM0T5eXYM9rOE4Uvlxn6uOHq9HitXLsfChfOQmiptJm3YsBH69n0C/foNQNOm\nzcp0rLDdmM0Qbt+GMukGFEmJUCQlQZFYaDnphnR96yaE3NzSPaSvr1WIWo9M85Z9a0D0ywtTb58K\n2ZTLz7NjsJ8lDF8qN/ZzxcvISMcff/yKr77agLi4XcjOO91deHhd9Os3AP36PYFWrVrffxAbjVDc\nvpUXoolQJCbeFah5t926WeLZhUQXF5gDg2AOCIDZz79oiNbwsx61+vjIcoIDW/h5dgz2s4ThS+XG\nfnaM/H7OzMzErl0/IzZ2C375ZQf0eh0AIDi4Nvr27Y9+/Qbg4YcfgVKpLLizwQDFrZt5o9OkvBC9\nAcXNJOuQvX2rxN9ORY0G5ppBMAcGwhxUC6bAQClk8y9BtWAODIToW8NpfyPl59kx2M8Shi+VG/vZ\nMWz1c1ZqKvbH/oCD27bg8l9/wFuvRy0AdTUaNK/hh1C1Gp46HRR3bpd4XKmo1cJcMxCmoFp5IRpk\nFbJSuAZKm3qdNFRLi59nx2A/S0oK38qxLYiousnKgjIhHsqEK1DExwOpt+B58Yr1ZuDkZIQCePLu\n+2ZnA9evIQPARYUCBv8AuNVvAP9mzSEE15HC1TJaDZIOhanioUrkbBi+RBUhOxvKawlQXLmSF7Lx\nUMRflpbj46G4dbPIXfInJDR7ecMcGAhj0+Yw1wwsGK3mBaohoCb+jr+CH3b9jG3btuLGjevArZvw\nOHYUPXr0RL/QAXisVWu4u7s79jUTUalxszOViP1cjNxcKK4m5IXpFSjyrqWQvQJlUqLNu4lqNcy1\n68AUGg5TaCjMIaEwhYbBq2kE7rh6wRwYJJ1Lt5TMZjMOHtyP2NgtiI3dgvj4ywAANzc3dO3aHX37\n9kevXn3g5eVtj1ft9Ph5dgz2s6RCfvOdM2cODhw4AKPRiNGjR6Nnz57FtmX4Oq9q288GAxTXrlqP\nWuPzlhPiobhx3ebvrKJSCXPtEJhCpVA1h4TCFBIKU2g4zKGhUrgW3lkqjz36WRRFHD9+DNu2/YDY\n2C04e/YMAECtVqNTpy7o128AevfuCz8/v/t6HmdWbT/PDsZ+ltg9fPft24eVK1dixYoVSElJwaBB\ngxAXF1dse4av86qy/Ww0QnHjuvWoNX85IR6K69ds7hksKhTSyDUktFCwhsEcGibdViu4XIfVVEQ/\nnz17BrGxUhAfP34UAKBUKtG+fQf07fsEHn+8H4KCatn1OSu7Kvt5rmTYzxK7h6/JZEJOTg60Wi1M\nJhPat2+PP//80/rwh0IYvs7LafvZZIIi8YYUpFcuW0asls3E167aPJZVFASYawVbNgebQkKlYM1f\nDq5dISfBr+h+vnz5ErZt24rY2B9w4MC/AABBEPDQQw+jX78B6Nu3P0JDwyrs+SsLp/08Oxn2s6RC\nDzXasGED9u/fj48//rjYNhXxJrRp4wmzjZHJuHG5GDnSkLeswd9/F/0PQevWJixfLp3IYO1aNebP\nd7H5HH/9pYOLC3DunAJDh7rZbDNvXjY6d5a+xHv10uL27aJ7lQ4ZYsB//yudCeidd1wRG1t0ZBQa\nasb330uzqWzfrsLUqa42n2/rVj2Cg0WkpgLdutneoWbKlBwMHiydwu/ZZ91w+nTRMwV17WrE3Lk5\nAICFC12wenXRQNFqRZw+rcStWxnYv1+B0aNt98GqVVlo2VJ6L9q2dYetswdGR+di9GjpfZk0yRV7\n9xbtg+bNTVi9Wnpfvv5ahY8/tt0He/bo4OEBXL4sYPBADWA0QDAYAEPetdGIJRiLx02xAIAO2Iur\nqFPwAAoloFLh/8L3YUbfP2EOCcP033rhu31hgEpptWdwrVpmxMZK78uuXUq88YYGtnz3nR7h4SIy\nM4HOnW2/L6+/noOhQ6XOGT5cg2PHCj6bCoUCZrMZHTsaMX++9L7ExKixfHnRz6ZKBfz9t3T875Ej\nCowYYft9iYnJwkMPSe9Lx45a6PXS6zIaTcjK0iMrS4/c3PkQxTkAAD+/b2A0doebmxvUhf6D0aiR\nGV9+mZX3OlV4/33b78uuXTr4+ADXrwvo39/279azZ+egTx+pDwYNckN8fNHPZr9+RsycKfXBRx+5\nYOPGop9Nf38RO3boAQB79igxebLt9+Xrr7PQoIEZublAu3buln4ubNKkXERFSZ/N6GgNDhwo+p3R\ntq0JS5ZIn82VK9VYssT2d8aBA9L7cvKkAlFRtt+XRYuy0a6d9J3RtasW6elFvzOefdaAyZOl74wp\nU1yxY0fRfy/16pmxcaP0vmzdqsKMGbbfl+3b9ahZU8TNmwL69LH9vsyYkYP+/aX3ZcgQN1y4UPR9\n6dXLiPffl96XefNc8OWXRd8XLy8Ru3frERDgiS1b9Bg/3vb7snZtFpo0kd6H1q1t/3uR87vcXirs\nUKOdO3fi22+/xapVq0ps5+urhUple1R8PxQ2Tj/n6alBQID0hms0ts9Q5+qqQECAOq998WexCwjw\nhIsLcOdO8W18fLQICJCWVSrb7dzdXREQIP3D0Gptt1GrFZY3ytu7+Ofz8/NAQEDxzwUAXl5ulppc\nXGy3c3NzQUCA9EH18LDdJn9DRkCAJ3x9i38+X193y/MplYCt8zh4eNzP+yICJhOQKwVswKyp8Dh/\nGBnH9VCkfFv0gRQKCA3qA62GAuHhwDf1gMy8sywpVZZwVT05CO4fDJJqugkoDhV9qLK+L25uxbfx\n9Cx4X1xdi7ZTKBTQaEr3vuTXVJb3Jb+di4sCLi7e8Pb2xvPPT0OdOvWwadMm/PxzMkQxFWlpqVCr\n1dBqtXB3d4eLi9ryfF5exT+fv7/0OcnJKb6Nt3dBH6jVtttptQV94F7M9LQqVUEf+JRwJsoaNaQ+\nyM0taHP390bh7wxb7wsAaDSl/86Qntd+3xnFfaZcXBSlfF+kz6bZbL/vjNK9L9p7vi9ASf9e5Psu\nd4Ryj3z37t2LBQsW4LPPPoOPj0+JbbnZ2Xk5tJ+NRigvX4Ly7Bkoz52BKu9aee4cFLpMq6aiQgFz\naBiMEQ1himgkXTeIgKlBBEQn3LO3MnyeU1NTsGPHdmzbtgW7d+9CTo40yqlb9wHLaS4jIx+U53zT\ndlIZ+rk6qMh+FkXAaATyNnbBYBBgMEj/wTIagdxcoci6/PWF/zYYhEL3kf7TMXSoAV5e9qvV7pud\nMzIyMGzYMKxevbpUe04yfJ1XhfRzVhaU589Bde4MlGfPQHXurBSyFy8UOVG/6OICU70GBeEa0RDG\nBg1hqldf+u9wFVHZPs+ZmRnYuVM6zeXOnT9bTnNZu3Ydy2ku27RpW+x+HpVVZevnyshkks7jkpsr\nBVlODpCTI11b3ybdLt0GZGdLyzk5AtRqV6Sm5tgIwMLhd3cAFl1XEKrW7SrKypVZls3v9mD38N2w\nYQMWLlyIunXrWm776KOPEBwcbLM9w9d53U8/C2mpBeFaaDSriL9S5DAds4cnTBERMDXIC9eIhtLf\noeGV5qT8Fakyf56zsrIQF/crYmN/wI4d25GengYA8PX1RYsWkWjZshVatoxEixaRCA0Nq9Qj48rc\nz2YzkJUlhVh2trSclSXkBaFgFXZSEBaEoK1l27cVF6QFj2syyfP+qdUi1GppBKpSiXnXgIuLdHvB\npaBd/nL+xcVFzLtP0XVqtfVjFrST1nl5AW3amGwdCVhuPLczlds9+1kUobiZJIXr2TPSaDY/bG2c\naMLsH5A3ipXC1digIUwNG8EcVKtanwLRWT7Pubm5+P33PYiN3YK9e/fgypXLVut9fHzQvHkkWraM\ntARyeHjdShPIZelnUcwf0UmBqNcXBGN2tmAVjkWvC9rq9db3KQhX679zciq+jwRBhKurFDouLiI0\nGmnZ1VXMuw3QaETL+vy2rq62lotfX7OmFnq9/q4ALByu1gGoUlXNf/4MXyo3Sz+bzVDEX8nbVHy2\n0G+yZ6FISy1yP1NIKEwNCsLVmBe2om8NGV5F5eesn+fU1BQcPXoER44cxtGj0uXSpYtWbby9fdCi\nRUs0b97SEsrh4Q/Y3GGyLERRGqnpdAJ0uruvbd8mii5ISTFYQq9wgNoKUbPZ/ong5iaFnkYjXbu5\niXBzk/7Ovy58e36wFQ486/C71/qCx3BUyDnr59neGL5UJkJKMlRHDkN15BA8Lp6F4dgJqM6fhZA3\nz2w+UaWCqe4DeTs8ReSNZhvCWK+BtDsklVpV+jynpaXi2LGjeYF8CEePHsGFCxcAuAPwAOABrTYI\nDzzQEqGhTREcHIGaNetBqw1EVpaiSGjq9cWHqz02kapURcPPVgjmL+cHp5tb6f8uuL1qjvDuVpU+\nz/eDsxpRsYSMdKiOHoHq8CGojhyE+tBBKO/alKjSamGMaGS9w1NEQ5jqPlAhJ5ygysFgANLTBaSn\nAxkZAtLSBMvf6ekCMjKKjjCloNRCpwuGTtfHchtgnTh6PXD8uHQpLa1WhLu7CHd3oEYNs2XZ+rrk\n22rXdkdWViY0GunxNJpqsUsBVUL82FUnej1Ux49BfeQgVIcOQnXkEJTnz1nt/GSuUQO5XbvB0OpB\nGFs+CO9Oj+C2WwkHk1KlZDYDmZn54WkdmmlpUnCmp8OynB+sGRkFt+WflKOslEoRHh5S2Pn6iqhT\n5+5QlJbV6hxkZNzAnTuXkZh4Htevn8aNG+cgiukAMgFkws1NRNOmddGqVSO0bNkSLVu2Qv36Dcq9\nl3VAAHDrVoVv7CO6J4ZvVZWbC9XJ49KI9vBBqA8fgvLMKatTKpo9vWB4tCOMkQ/CENkKxsgHYQ4J\ntd4uFuAJcPORQ4mi9Ptj4dC0DkncFZgC0tJQaFkKUVEsW3hKe3yK8PQEgoLM8PIS8y4otFxwm6en\nCA8PEVqtdai6uJRl02qtvEs7AIBOp8Px48dw9Oghy+/IBw/GYf/+Xy330Gq1aNq0uWWHrpYtW6FB\ngwioOIQlJ8LffKsCoxHKM6ehPnIob0R7EKqTJ6yOmRXd3GBs3jJvRCsFremBevcc0bKf7092NpCc\nLFhd7tyRrlNSCv7OzFQhOdlsGZ0aDGULTkGQQlMKTxHe3sWHZuG/vb0L7uPmVjl/j9Tr9Thx4hiO\nHj2MI0eky9mzp2Eq9B9JNzc3NGnSLG+HrlZo0SISDRs2KhLI/Dw7BvtZwh2uqhKzGcoL56E6fNAy\nolUdPwohK8vSRHRxgbFps7wRrRS2poiGlWa2HWeVk1NykOYvF/67tJtu3dwALy9zMSPNoiHq7Y1C\nISsWeyrKqiorKwsnTx63jI6PHDmMM2dOwVjoxOIajQZNmzbL28taCuSOHR9Gamp2CY9M9sDvDQnD\n11mJIhRXLhca0R6C6shhKDILXreoVMLUqIlls7ExshWMjZtK2/7soKr2c04OLCPPwkFaeDR6d5Dq\ndKULUq1WRI0a0sXXV4SfX/F/+/lJt4WEVM1+dqTs7GycOnXCKpBPnz4Jg8FgaaNUKlG7dh3UqRNi\nuYSEhFqua9euA1dX2xMUUOlV1e+NsuLezs5AFKG4cd0SsurD0rUiJaWgiSDA1CACuS1bFWw+btZC\nGjZVczodkJQkIClJgdu3bQdp4dFpZmbpglSjkQLygQfMRYLz7kt+kPLtkIdGo0GrVq3RqlVry205\nOTk4ffqkZXP1hQtncOnSZfz11x8obtwRGBiUF8YhCAkJsyzXqSOFtIeHh6NeElVhHPnKRLh1C+rD\nB6x2iFLcumnVxhReN29E21oa0bZoCdGj+P9JVQS5+zkzsyBUExMFJCUJSExU5N0mWNZlZNw7TF1d\nC8KzpCAt3EZrewY2u5O7n6uL/H7Ozc3FtWtXcfVqAq5eTUBCQrxlOT4+HtevX7XahF2Yr6/vXaEs\nBbMU1qHw8fGtNGf0kgs/zxKOfOWWlQX1/n+gOrgf6vxDfK5dtWpiql0HOY/3LxjRtoyssmeDEkUp\nVPNDND9Uk5IKQjV/3b029fr7mxESYkZgoIigIBGBgWYEBBQfpNX8O5HyuLi4oG7dB1C37gM215tM\nJiQlJSIhIQFXr8YjISHesnz1agLOnTuDo0cP27yvu7tHoVCWRs/5f4eEhCIgoOZ9n92LnB/DtyIY\njVAdPgiXvXug3rsH6n//hpA3PRsgnd84p0cvy2+0hpYPQqxZU8aC7UMUgfR0WI1MExMVuHlTsBq1\n3rxZ8o5IgiCFZt26+aEqXdesWRCwQUEiAgJEe/20TWRFqVQiOLg2goNro23bR4qsF0URd+7cQULC\nlbyRc0Ewx8dL16dPn7L52K6urggOrm0VyvnBHBISilq1gnnYVDXAd9geRBHKUyfhsjdOCts//7Da\nKcrYtDlyO3aG4eFHYGz1IMzBtZ1qCCaKQGoqrDb9Wo9SC/7Ozi7+dSkUIvz9RdSrZ7aEaGCgaBWw\ngYFSqPLEWVSZCYIAf39/+Pv7W/3GXFh6elpeKCcgIeGKZVkaSSfgt99227yfUqlErVrBVjuF1ahR\nA76+NQpd+6FGjRrw8vLmKNpJ8TffclJcvpQ3so2Dy++/QXH7tmWdse4DMHTsgtxOnWFo3xGiv79s\ndZaG0Qhcvy4gPl6BhAQBV64oLMtJSSrcuCGWOOOKQiGNSvM3/dasab0ZWLqWgpf/obdN7s9zdVGZ\n+lmv1+Patat3/d58xbKcmHgDZrO5xMdQKBTw9fWFr68Uyn5+fpbl/KDOX65RI3+dL1wqeJNRZepn\nOfE3XzsQkpLg8ru0Gdnl99+gjL9iWWcKDEL2U08jt1MXGDp0grlOiIyVFmU2SzstXbkiBWp+sMbH\nSyF77Zpg8wT1CoWIWolNSOEAAAkTSURBVLWAJk3Md41SrUet/v6iXefAJKoOtFotGjSIQIMGETbX\nGwwGXL9+DdevX0NycjJSUpILXd+x+jslJRmXLl20OvFISTw8PG2MpouGduEwd3d3r/Y7ktkTw7cY\nQloq1H/+IY1s9+6B6sxpyzqztw9yHu8vbUru1AWm+g1k3YwsisDt24JVoMbHFyxfvSogN9d2fUFB\nZjz4oBmhoWaEhZkREiIiNFT6OzhYRHCwJ27d0jv4FRGRWq1GWFg4wsLCS9XebDYjPT3NKpALL9+5\nU/T2s2dPI6vQCXpK4uLiYmMUbTu869ULQW6uAHd3d2i17uU+F3dVxvDNl5UF9T/7LJuSVUcOQ8jb\n5CO6uSG3y2PI7dgFhk6dpWNrHfxhSk0FEhIUVqPXwiPY4nZg8vc3o2lTsyVQ88M1LMyM2rWlWV2I\nyPkpFAr4+PjCx8cXQL1S30+v1xcJ6sIj7Ltvv379Ok6dOlmm2jQaDdzd3eHu7gGtVpsXyh5511q4\nuxddzg/u/PvZauvMv3dX3/A1GqE6dMB6j+S8cyGLKhWMDz1sGdkaHnxImp26AmVmSuEaHy9YQjZ/\nOT5egfR02+Hq5SWdACI/WMPCCpZDQszg+QCIqCRarRZarRa1a9cp9X2MRiNSUlKKDe2srAwkJ6dC\np9PlXTKh1+uh0+mQlJQInU6H3ELnnr+/2u8O6sIhbyvIrf8TkL/s6+sLb2+f+66ptKpP+JrN1nsk\n//WnZY9kURBgbNYChg6dYOjUGblt28PeqZWTgyKbhfODNT5ewJ07tv8Hp9VKI9VHHpHCVBrBFmwa\n9va2a5lERPekUqkQEBCAgIAAm+tLs8OVwWCAXq+zBHTBcmbe33rLsvV66zDPb5OamorMzIxS/+59\nN4VCga+++haPPda9XPcvq6obvqJYsEfy73uK7pFcrz5yBg+R9kh+tCPEGn52eVq9Hjh/XoEzZxQ4\nezb/WonLlwWYzUVHry4uIkJCRDRvbiwSrKGh0vGu3MeBiKoatVoNb28fu442RVFEbm6uzXC2DvOi\n6wEgIqKh3Wq5lyoVvoqkRGlUm79HckK8ZZ2pVjCyhzyD3A6dYOjYGeYybGKxJTMTOHs2P2CVlqBN\nSBCKzKNao4YZbdqYUK+eFKjSCFbaRFyzplitZqMhIqoogiDA1dUVrq6uqGGnAVVFcerwFdJSof7j\nd2lT8u+/We+R7OuLnH4DpLDt1AWmevXLtUdyaipw5owS584VjGbPnlXg2rWiiVmzphkdOpgQEWFG\nRIQZDRtK1/7+FX4oNRERORHnC19RhNvC+cCOWPgdOFCwR7JWi9zHukt7JHfsJO2RXIYh5e3bQqHN\nxAWbjG/eLPoYwcFmdOlitISrdDHB19dur5KIiKow5wtfnQ7un3wIGI0wPPwIDB07S5cHH7rnHLai\nCNy8Kdz1e6x0sbXDU2ioGd27G/NGsdKItkEDM7y8KurFERFRdeB84evhgTv7j8M/LBBpetunXhNF\n6XSJ1qNY6XfZtDTrTc+CICI8XESbNgarzcX165vh7u6IF0RERNWN84UvADEgAHB3hzkzAwkJgtVe\nxfnLd09Fp1RKx8N26GC22lxcr56Zk58TEZFDOV34iiIwfbor/v0XOHXKA1lZ1iGrVouoX99cZKen\nBx4wc/o5IiKqFJwufPV6YMMGNbKzYQnZ/IBt2NCE8HDOnENERJWb08WUuztw4kQmAgM9kZzME/4T\nEZHzccrTO6jVDp/XgIiIyG6cMnyJiIicGcOXiIjIwRi+REREDsbwJSIicjCGLxERkYMxfImIiByM\n4UtERORgDF8iIiIHY/gSERE5GMOXiIjIwRi+REREDiaIoijKXQQREVF1wpEvERGRgzF8iYiIHIzh\nS0RE5GAMXyIiIgdj+BIRETkYw5eIiMjBnC5833//fTz99NMYOnQojh49Knc5VdqcOXPw9NNPY/Dg\nwfj555/lLqdKy87ORvfu3bFp0ya5S6mytmzZgieeeAJPPvkk4uLi5C6nStLpdBg/fjyioqIwdOhQ\n7N27V+6SKi2V3AWUxT///IMrV65gw4YNuHDhAqZMmYINGzbIXVaVtG/fPpw7dw4bNmxASkoKBg0a\nhJ49e8pdVpW1dOlSeHv/f3v398r6H8Bx/LkzubBxzDJaIblRSigXWHJBLlz7kRa3cqVc0FKUq7lS\nKAp/gLZwI0pZuZgr5UJRXGExy8evxgU6d6fOt9x8a3vbp9fjbrt61i5ee38+n7bfpjNsy7IslpaW\niEajpNNpFhYW6OjoMJ1lO5ubm1RXVzM+Ps7d3R3Dw8Ps7u6azvqRcmp84/E4nZ2dANTU1PD09MTr\n6ytut9twmf00NzdTX18PQFFREW9vb3x+fuJ0Og2X2c/l5SUXFxcagwyKx+O0tLTgdrtxu93Mzs6a\nTrIlj8fD+fk5AM/Pz3g8HsNFP1dOXXZOpVL/fJglJSXc398bLLIvp9NJQUEBAJFIhPb2dg1vhoTD\nYSYnJ01n2Nr19TXv7++MjIwwODhIPB43nWRLPT09JBIJurq6CAaDTExMmE76sXLq5Ptf+mXMzNvf\n3ycSibC+vm46xZa2trZoaGigoqLCdIrtPT4+sri4SCKRYGhoiIODAxwOh+ksW9ne3sbv97O2tsbZ\n2RmhUEjPMXwjp8bX5/ORSqX+vk4mk5SWlhossrfDw0OWl5dZXV2lsLDQdI4txWIxrq6uiMVi3N7e\nkp+fT3l5Oa2trabTbMXr9dLY2EheXh6VlZW4XC4eHh7wer2m02zl+PiYQCAAQG1tLclkUrervpFT\nl53b2trY29sD4PT0FJ/Pp/u9GfLy8sLc3BwrKysUFxebzrGt+fl5otEoGxsb9Pb2Mjo6quHNgEAg\nwNHREV9fX1iWRTqd1v3IDKiqquLk5ASAm5sbXC6XhvcbOXXybWpqoq6ujoGBARwOB9PT06aTbGtn\nZwfLshgbG/v7Xjgcxu/3G6wS+X/Kysro7u6mr68PgKmpKX79yqmzR07o7+8nFAoRDAb5+PhgZmbG\ndNKPpb8UFBERyTJ99RMREckyja+IiEiWaXxFRESyTOMrIiKSZRpfERGRLNP4ioiIZJnGV0REJMs0\nviIiIln2BzQKNGAGnBgwAAAAAElFTkSuQmCC\n",
-            "text/plain": [
-              "\u003cmatplotlib.figure.Figure at 0x7f7a18df6b50\u003e"
-            ]
-          },
-          "metadata": {
-            "tags": []
-          },
-          "output_type": "display_data"
-        }
-      ],
       "source": [
-        "# Train our variables.\n",
-        "\n",
-        "# numpy is used for its asscalar() function.\n",
-        "import numpy as np\n",
-        "\n",
-        "num_training_steps = 10\n",
-        "\n",
-        "def train_model(inputs, labels, wb, optimizer, num_training_steps):\n",
-        "  loss_at_step = []\n",
-        "  w_at_step = []\n",
-        "  b_at_step = []\n",
-        "  for step_num in range(num_training_steps):\n",
-        "    loss_at_step.append(run_step(inputs, labels))\n",
-        "    w, b = wb.variables\n",
-        "    w_at_step.append(np.asscalar(w.numpy()))\n",
-        "    b_at_step.append(np.asscalar(b.numpy()))\n",
-        "\n",
-        "  print(w_at_step)\n",
-        "  t = range(0, num_training_steps)\n",
-        "  plt.plot(t, loss_at_step, 'k',\n",
-        "           t, w_at_step, 'r',\n",
-        "           t, [true_w] * num_training_steps, 'r--',\n",
-        "           t, b_at_step, 'b',\n",
-        "           t, [true_b] * num_training_steps, 'b--')\n",
-        "  plt.legend(['loss', 'w estimate', 'w true', 'b estimate', 'b true'])\n",
-        "  plt.show()\n",
+        "## Next Steps\n",
         "\n",
-        "train_model(inputs, labels, wb, optimizer, num_training_steps)"
+        "In this tutorial we covered gradient computation in TensorFlow. With that we have enough of the primitives required to build an train neural networks, which we will cover in the [next tutorial](https://github.com/tensorflow/models/tree/master/official/contrib/eager/python/examples/notebooks/3_neural_networks.ipynb)."
       ]
     }
   ],
@@ -572,7 +312,7 @@
     "colab": {
       "collapsed_sections": [],
       "default_view": {},
-      "name": "Eager Execution Tutorial: Working with Gradients",
+      "name": "Automatic Differentiation",
       "provenance": [],
       "version": "0.3.2",
       "views": {}
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/3_training_models.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/3_training_models.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..84f1d031d40604ae029e8a8347474950ee01b38a
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/notebooks/3_training_models.ipynb
@@ -0,0 +1,485 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "k2o3TTG4TFpt"
+      },
+      "source": [
+        "# Training Models\n",
+        "\n",
+        "In the previous tutorial we covered the TensorFlow APIs for automatic differentiation, a basic building block for machine learning.\n",
+        "In this tutorial we will use the TensorFlow primitives introduced in the prior tutorials to do some simple machine learning.\n",
+        "\n",
+        "TensorFlow also includes a higher-level neural networks API (`tf.keras`) which provides useful abstractions to reduce boilerplate. We strongly recommend those higher level APIs for people working with neural networks. However, in this short tutorial we cover neural network training from first principles to establish a strong foundation."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "3LXMVuV0VhDr"
+      },
+      "source": [
+        "## Setup"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "PJ64L90aVir3"
+      },
+      "outputs": [],
+      "source": [
+        "import tensorflow as tf\n",
+        "tf.enable_eager_execution()\n",
+        "tfe = tf.contrib.eager # Shorthand for some symbols"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "eMAWbDJFVmMk"
+      },
+      "source": [
+        "## Variables\n",
+        "\n",
+        "Tensors in TensorFlow are immutable stateless objects. Machine learning models, however, need to have changing state: as your model trains, the same code to compute predictions should behave differently over time (hopefully with a lower loss!). To represent this state which needs to change over the course of your computation, you can choose to rely on the fact that Python is a stateful programming language:\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "VkJwtLS_Jbn8"
+      },
+      "outputs": [],
+      "source": [
+        "# Using python state\n",
+        "x = tf.zeros([10, 10])\n",
+        "x += 2  # This is equivalent to x = x + 2, which does not mutate the original\n",
+        "        # value of x\n",
+        "print(x)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "wfneTXy7JcUz"
+      },
+      "source": [
+        "TensorFlow, however, has stateful operations built in, and these are often more pleasant to use than low-level Python representations of your state. To represent weights in a model, for example, it's often convenient and efficient to use TensorFlow variables.\n",
+        "\n",
+        "A Variable is an object which stores a value and, when used in a TensorFlow computation, will implicitly read from this stored value. There are operations (`tf.assign_sub`, `tf.scatter_update`, etc) which manipulate the value stored in a TensorFlow variable."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "itxmrMil6DQi"
+      },
+      "outputs": [],
+      "source": [
+        "v = tfe.Variable(1.0)\n",
+        "assert v.numpy() == 1.0\n",
+        "\n",
+        "# Re-assign the value\n",
+        "v.assign(3.0)\n",
+        "assert v.numpy() == 3.0\n",
+        "\n",
+        "# Use `v` in a TensorFlow operation like tf.square() and reassign\n",
+        "v.assign(tf.square(v))\n",
+        "assert v.numpy() == 9.0"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "-paSaeq1JzwC"
+      },
+      "source": [
+        "Computations using Variables are automatically traced when computing gradients. For Variables representing embeddings TensorFlow will do sparse updates by default, which are more computation and memory efficient.\n",
+        "\n",
+        "Using Variables is also a way to quickly let a reader of your code know that this piece of state is mutable."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "BMiFcDzE7Qu3"
+      },
+      "source": [
+        "## Example: Fitting a linear model\n",
+        "\n",
+        "Let's now put the few concepts we have so far ---`Tensor`, `GradientTape`, `Variable` --- to build and train a simple model. This typically involves a few steps:\n",
+        "\n",
+        "1. Define the model.\n",
+        "2. Define a loss function.\n",
+        "3. Obtain training data.\n",
+        "4. Run through the training data and use an \"optimizer\" to adjust the variables to fit the data.\n",
+        "\n",
+        "In this tutorial, we'll walk through a trivial example of a simple linear model: `f(x) = x * W + b`, which has two variables - `W` and `b`. Furthermore, we'll synthesize data such that a well trained model would have `W = 3.0` and `b = 2.0`."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "gFzH64Jn9PIm"
+      },
+      "source": [
+        "### Define the model\n",
+        "\n",
+        "Let's define a simple class to encapsulate the variables and the computation."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "_WRu7Pze7wk8"
+      },
+      "outputs": [],
+      "source": [
+        "class Model(object):\n",
+        "  def __init__(self):\n",
+        "    # Initialize variable to (5.0, 0.0)\n",
+        "    # In practice, these should be initialized to random values.\n",
+        "    self.W = tfe.Variable(5.0)\n",
+        "    self.b = tfe.Variable(0.0)\n",
+        "    \n",
+        "  def __call__(self, x):\n",
+        "    return self.W * x + self.b\n",
+        "  \n",
+        "model = Model()\n",
+        "\n",
+        "assert model(3.0).numpy() == 15.0"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "xa6j_yXa-j79"
+      },
+      "source": [
+        "### Define a loss function\n",
+        "\n",
+        "A loss function measures how well the output of a model for a given input matches the desired output. Let's use the standard L2 loss."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "Y0ysUFGY924U"
+      },
+      "outputs": [],
+      "source": [
+        "def loss(predicted_y, desired_y):\n",
+        "  return tf.reduce_mean(tf.square(predicted_y - desired_y))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "qutT_fkl_CBc"
+      },
+      "source": [
+        "### Obtain training data\n",
+        "\n",
+        "Let's synthesize the training data with some noise."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "gxPTb-kt_N5m"
+      },
+      "outputs": [],
+      "source": [
+        "TRUE_W = 3.0\n",
+        "TRUE_b = 2.0\n",
+        "NUM_EXAMPLES = 1000\n",
+        "\n",
+        "inputs  = tf.random_normal(shape=[NUM_EXAMPLES])\n",
+        "noise   = tf.random_normal(shape=[NUM_EXAMPLES])\n",
+        "outputs = inputs * TRUE_W + TRUE_b + noise"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "-50nq-wPBsAW"
+      },
+      "source": [
+        "Before we train the model let's visualize where the model stands right now. We'll plot the model's predictions in red and the training data in blue."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 293
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 1210,
+          "status": "ok",
+          "timestamp": 1527005898290,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 420
+        },
+        "id": "_eb83LtrB4nt",
+        "outputId": "3873f508-72fb-41e7-a7f5-3f513deefe38"
+      },
+      "outputs": [
+        {
+          "data": {
+            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAEDCAYAAAA2k7/eAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJztnXlgU1X2xz/pAhRautCWUsCwWVlcUHHGBUFQcSg7uM8P\nFLUICo4VpygObihI3UdmUHBB0IGZQbEgFNGCqKgMolV2pKylCy1pukDp+n5/3LxmaUsDTUjSns8/\nbZKXd09C+b7zvvfccw2apmkIgiAITR4/TwcgCIIgnB9E8AVBEJoJIviCIAjNBBF8QRCEZoIIviAI\nQjNBBF8QBKGZENDYE+Tk5JCUlER+fj7+/v7cdtttTJgwgcLCQhITEzl27BidOnXijTfeICQkxBUx\nC4IgCOeAobF1+Hl5eeTn59OrVy9OnjzJ2LFj+ec//8mnn35KWFgYCQkJLFy4kKKiIh5//HFXxS0I\ngiCcJY22dKKioujVqxcAbdq0oXv37uTm5pKWlsaYMWMAGDNmDF999VVjhxIEQRAagUs9/MzMTPbs\n2cNll13GiRMniIyMBNRFoaCgwJVDCYIgCGeJywT/5MmTPPLII8ycOZM2bdpgMBhcdWpBEATBBbhE\n8CsrK3nkkUcYNWoUN910EwDt2rUjPz8fUD5/REREg+eRtj6CIAjuo9FVOgAzZ86kR48e3HPPPTXP\nDR48mE8//ZRJkyaxcuVKbrzxxgbPYzAYyMsrdkVIbiUqKkTidCESp2vxhTh9IUbwrTidodGCv23b\nNlavXk1cXByjR4/GYDCQmJhIQkICjz76KJ988gmxsbG8+eabjR1KEARBaASNFvwrr7yS3bt31/na\n4sWLG3t6QRAEwUXISltBEIRmggi+IAhCM0EEXxAEoZkggi8IgtBMEMEXBEFoJojgC4IgNBNE8AVB\nEJoJIviCIAjNBBF8QRCEZoIIviAIQjNBBF8QBKGZIIIvCILQTBDBFwRBaCaI4AuCIDQTRPAFQRCa\nCSL4giAIzQQRfEEQhLOk0GTi84R7+XbIDXyecA+FBSZPh+QULtnTVhAEoTnx7YzHuDflUwyAlv4z\nizEwfNFiT4fVIJLhC4IgnCWhhw9hsPxusDz2BVwi+DNnzuTaa69lxIgRNc/Nnz+fAQMGMGbMGMaM\nGcM333zjiqEEQRA8TqHRiGb5XQMKjV08GI3zuMTSGTt2LOPHjycpKcnu+YkTJzJx4kRXDCEIguA1\nXJ/8OosxEHr4EIXGLlyf/JqnQ3IKlwh+v379OHbsWK3nNU2r42hBEATfJjQ8wic8e0fc6uF//PHH\njBo1iqeeeori4mJ3DiUIgiA0gNsE/+677+arr74iJSWFyMhI5s6d666hBEEQXMLRjAwW9u3FWmN7\nFvbtxeGMDE+H5FLcVpYZERFR8/vtt9/O5MmTnXpfVFSIu0JyKRKna5E4XYsvxOmNMb53xQhmZh1T\n5Zalx5h3ww08cfSop8NyGS4TfEe/Pi8vj6ioKAC+/PJL4uLinDpPXp73Wz9RUSESpwuROF2LL8Tp\nTTEWmkx8O+MxQg8fIjory67cMtZk8po4z4SzF0+XCP706dPZsmULZrOZG264gWnTprFlyxZ2796N\nn58fHTt25Pnnn3fFUIIgCC7FdhHVXFSZpcHyM8vGqWgKuETwX3311VrPjRs3zhWnFgRBcCu2i6ju\nBp4JDKR7QACZ4RH839dfezAy1yMrbQVBaNbYLqK6AOgaP4L4w7lMSt+NsXt3T4bmcqSXjiAIzRpf\nXUR1LojgC4LQrPHVRVTnglg6giA0WXy1jbG7kAxfEIQmi6+2MXYXIviCIDQZbGvqC41Ggg9k+GQb\nY3chgi8IQpPBMaOfE9vRrq7eV9oYuwsRfEEQfB49s/dbn2qX0V8QEcHiq/7odAWOyWRmxoyNHD7c\nFqOxkPffHwX4uzv884YIviAIPk2hycS/B1/HJVnH2In9StnK7heelWc/Y8ZGUlLGAwbS0zWmTFnO\n/PnD3RK3JxDBFwTBJzmakUHquOFE5mTTpbqaAcD1wDygQ1AQ1UOGnnVN/eHDbcHmHuHgwWDXBu1h\nRPAFQfBJUscNt3a2BJYDdwG9gRNDhp5TNY7RWEh6uvUeoWvXEhdG7HlE8AVB8BkKTSY2PDqVgB+/\nI8ZstvPrg1HCvz22I3ec42rZ5OTBwFKLh1/EggUjqapyTezegAi+IAhejz4pq23aQBuzmWHAAuz9\n+t8CA8mPH8Edya8RGn5uXS7tu7w3vS1aRfAFQfBq9ElZR/vmbuBZwGgwkN0hlqEr19C5a7dGjdXU\nJ22ltYIgCF7NtzMe4xKL2IPVvrkAuAgwjBzDpPTdjRZ7aPqTtiL4giB4NaGHD1GC1WDR7ZvktqFs\nbH85r2eMJiHhUwoKzI0ey2gstBtJJm0FQRDciF5u2anARGZ4BC169eIBlI3TBsuk7MbNPJ60Sdkv\nuQa279CApSxaNOasx7NdbNWhw0mGDn2P7OxImbQVBEFwF/rEbNba1cysqKjZSPyF6mo+GzWW0MOH\nOGHsUjMp62i/HD7cttZK2eTkwYSHh51xXEffftSopaxffyMAERHes/euKxDBFwTBK9D74HwO9u0R\nCs3E11FT71gzbzQW1RJvZ7L+ui4cTRWXCP7MmTP5+uuvadeuHatXrwagsLCQxMREjh07RqdOnXjj\njTcICXFuZ3VBEJo+u7Zt48vRQ+ladpqDBgNRrVtjAIqxL7fMrKfE0rFmPjl5EHfcsY2zFe+6LhxN\nFZcI/tixYxk/fjxJSUk1zy1cuJBrrrmGhIQEFi5cyDvvvMPjjz/uiuEEQWgCfDkmntllp5XMahpP\nnzyJBsQDy4BC/MhoFcawxcvqfH94eFit7P1cxLuuC0dTxSWC369fP44dO2b3XFpaGh999BEAY8aM\nYfz48SL4giCwa9s20sbG0+10qZ110x14JSwMf8LZZO7HKt6G0+Hs/8dSFi3q69S5z0W867pwNMS5\nzBV4A27z8E0mE5GRkQBERUVRUFDgrqEEQfAQZyN8+qTs6VUruUjTOIS9dZMNxAwczN8Pjyc9fXTN\n+87GUz8X8T4XzmWuwBvwuknbqCjf8PklTtcicbqW8xXn1Kmf2wlfy5bL+fe/76p1nPnECVbc1J/e\nmZmUAEOBpajOltHAAYOBsBtvZMz7i1g3ZZ2dLRMXV4qfXxUPPZTKwYPBdO1azIIF8UREnJ+Muq7v\nMisrHNu5gqyscJ/423Cb4Ldr1478/HwiIyPJy8sjIsK53ha+UAIVFeUbpVoSp2uROGuzb18QtsK3\nb18QeXnFtTL/+PIVzMjMtGuN0BUYDsxqFcRfjuQCUFEFs2dfT1mZ1ZaZPXsQ99+/qubCsnWrRlnZ\nUubNG+R2W6W+7zI21oTt/UlsbIFH/zacvdi4TPA1+65DDB48mE8//ZRJkyaxcuVKbrzxRlcNJQiC\nl1DfJKlueRgwcUH6FPBbb+fXtwF+A7a0CuLmVevszlmXLVNX6aQnbRVfneh1ieBPnz6dLVu2YDab\nueGGG5g2bRqTJk3iL3/5C5988gmxsbG8+eabrhhKEAQvoi7hKzSZCN34D17hUfIoYS4VLKu29+t3\nderEHWnfOd3Vsq4Liyfr58/XXIGrcYngv/rqq3U+v3jxYlecXhAEL8VW+ApNJj57YAKl337NDSgp\n7mD5GY+yccotO1FNfn8RuXknSUhY6ZQlU9eFJSlpQ7Opn3cVXjdpKwiCb/LtjMeI/fZr7sKayb9k\n+RkG3AkstuxEFRYRwr33fe60JaNfWPS5gTvu2Far742v2CqeRARfEASnqasMs9BUwOJxCVyanU4+\nUIgSeAOqffFLQPuwMAwDB3N98muYTGamTv2c9evB1pLJyPBvMOOvr++NyWQmKcn36uLPNyL4giA4\nja3g/pqeT+DqP3BJ9UHmY83ql6E2J9GAn4Hc9pezLCqRblRzHX4251iGrbNvMh1mx44nOVPGX59v\n76t18ecbEXxBEJxGF1wDJ5jM5fyjOrNWs7NC4G3gd/zZenUS3/74ol0LY6toK2c/KKiCIUPgwIE4\nsrLOPAnrOHl7/PguDhzowaZNucDnqE488U26AVpjEMEXBKEW9a2g7djhGMb0eG5mHa3R6mx2to6r\nWcUPwCrC9uRbXjEDqaxfD+HhO4GBNe8oKytl69Z8evUKtjuTPglr36++nPbtnyY39yrgJFlZUxg7\ndgFm85PY3mMYjZVn/BzNFRF8QRBqoSySEcA60tPD2bp1CY/cW0nv1GfpDeQBuWDX7Kwc2A2s4l+W\nV04C+ZbfU4E7KS01UFqqERT0DKWlLYGZVFcbyMrSqK5+gVGjate2O9o1YWGvACNrYi0o6ITtPUZY\n2GmSk2+u873N3eoRwReEZsDZZrrKElkH3Ik/WxmSNYaCOdX0B0qAB4BVwNNAJ9QFIB9Y1vZeKNoO\n/Aj8iWuuWU6LFktZvx5KS62iXFraDyXS1ucKC411irGjbw/tsL0TCA8/Smmp9fHAgQE1n6059bp3\nBhF8QWgG1JXpnqk1gdFYyK/pp4nnEv7ATiKBUGCA5edyIALoDOwliNfIJCzsM7744g/MmfOz5Zyr\nSU4eTnh4GAkJn5KSYmv8nLT8tBXuzDpjd/Ttr7mmmhYtrHcCM2eOYs6cule9Nqde984ggi8ITRDH\njP7AgTY01JrgxImX+emnYsrKuhKifcxf2EAw0BdqGp6lAnehWiMUA7tpxRtsB8Ixm1vx7LPf0aJF\na8s41nYrtgunjh/fRVbWFEs8y/DzKyYm5gQrV1ptGltqL7q6pdbdyaJFRiff27xr9UXwBaEJ4ijm\nsbFzcJwQdbQ7Nm8uAm0ioxjFzeykEHgCa06+HNCnVf8H/MKlrGUssMvyTDybNy+kqOhBHD1z2xW5\nBQVXMmvWOvbtC8JorCQ5Of6M9lJj2hj4agsEdyGCLwhNEEcxj4jowlVXnbk1gb92kkR6MM/yzCrs\nnfM2wK/Atxh4mXuB14A1qJ6X6hynToXSkGceHh7Gv/99l090Hm1qiOALQhNEedcFqInXlvz++06O\nHGmFn18nOnSoAuztjuh2+7k07Q36Y5XrEuzLLb8DXmYDMAhQ1TLV1aUUFS1BOfoltG5toqiocZ65\nlFK6DxF8QfBRMjIOM27cKgoKOhEefpSVK0fRtavyspOTB7N16wKyslR9elnZGMrKlgHxpKauZfPm\nLwgOziW8bTsuP/4SxvTddMZe5IcCs4COwC78mc8e4D+Wo4rp1CmW7t0rSUmZgC7w/fq9xZ49cy0x\nZTJzZm1fXm+toCyd2oIupZTuQwRfEFzI+cxOx41bVSPopaUaI0c+w9VX9yArK5zYWBMREUa7lasQ\nhFoDO4PiIhN9i/7ENVk/0Qb4G/AKcDvKq28DbAPKgHXczKqaupyLgRGAxv79s3jvvTuxnRQtLw+0\ni2nOnKW1JlQbEnQppXQfIviC4EKczU6dvTDUd5zJZCYnR8O2nUBeXsuasdUuTHOxN2X2AH3wYz/j\nuYgYNHoDpZYjwlBVOCGWM5qBv/Moyqu3LacEMHD6tCrBtP18Q4akoZorpALBbNqUQ0GB2e6z2Qt6\nIZs25TJkSFrN55NSSvchgi8ILsTZ7NTZC4PjcWVl7wGQlpZLdfVMbNsJaFqE3dgFBbG0ajWL06fb\no0Q4GH/SmcooWgNXo8wZ/Qy3AWuBTGB/y1DSus7B//dDVFUtAyqBY8Bky/mV+Dt+PiXWa8HSJNls\nHk5S0lK71saHDlUCHwPDgLWYzY+Tnm79HqSU0n2I4AuCC3E2O63vwmCb0cfE5PH99/arUX/80Q+z\neSLUallWTnCwieJi69iqdcFsYmJe4FTO10xhA3HAPuBFrEK/BJiLMmwOGAL5vPsLxPVpz6fJg3n0\n0dWkpgLkoMR+Hcrw2QU8SEzMJ3YtjWfOvJJNm/6H2Xzmjpb6pC+0q3WslFK6DxF8QXAhzmanHTpk\nk57+L5SBUkSHDrZ7waoeNtAeVd9uFfGiohzL744ty/IpKTlOy5azKC/vhqYFoaZdDbSqzOMeNjCX\nusstw1E9cF7yv4viqo9hv4Hd+zVSU5+ksjIEg8FAy5a5lJXNQ9PiwLLQKiZmPgZDJCkp92N7pzJw\noL/dqlr9oud4kevS5UKMxsI6jxXcgwi+ILgQ57PTQLDbG+o9TCazpc1viuX1AcD1wDwgBmhBdXWU\n5fjrUHl5NKqTzd1o2mbKyu5CtTK7k0BWkMjtdM2HFtRfbvk9MI+HoeoKm6MKKS9vA1wClHD69KXA\nBJt3LeP06WNkZ3fA8U7l3/++krouenXd/Yh9c35xu+APHjyY4OBg/Pz8CAgIYMWKFe4eUhC8nuzs\nSGyFMjs7khkzNmI2P255vgBVUdMH8ENJdjzqYvAekAHMwX4dbIjl8XX48xBTeJvLLM9uwb7c8kng\nAiCdIBaRALyB/YYka1G1O/r5P8T+viAEaFeniNd30bMV97i4UmbPHiT2zXnG7YJvMBhYunQpoaGh\n7h5KEHwG+4VRbTh+fCdVVRdhFdV1wAzL4+G0aJFEefkBVG/KAqAH9gIcDBThzxb+j6uJRTnt+j1E\nf+ApoCfKfd9LIPN4CUjEKub6VuOlqEla2/PnYX9fUMw111Rb2hA7l6HbintUVIistPUAbhd8TdOo\nrq529zCC4HHOpgbfcWFUVtYITKZZwDisjQysgtuyZSTl5UlYBVffHlx/vJcW/EoiH2EE2qLuC/Qz\nhANGlNjPYxgwGnXXsAw4hf1W48ss77I9fy7qziIAP78sbrklnDfeGC4Zuo9xXjL8+++/H4PBwB13\n3MHtt9/u7iEFwSMkJq4hNbUt4E96egDl5Z/z4Yf/V+ex4eFhREf3tlsYdfp0F5RfHw38jvLvwwGN\n4uJg7DPuzsALQCcCWMOdfEJHqJmYreuSsA94jf0og8d2/mA2yh5S7REgwTKOyvZbtjxAWdlUoAug\nMWKErHz1Vdwu+MuXLycqKgqTycTEiRPp1q0b/fr1q/f4qKgQd4fkEiRO1+LNcZ44Yeahh1I5eDCY\nrl2LWbAgnoiIsFrHfPVVNqA6RYLGjz++WvO5Tpww88ADKWzapAF5DBgQRocO2PnfaguRGTaP56E8\n/BIgG3v5Pgp0pxWf8hc+IRi4FPtLQk9Urm4GdhDCAn4BuqPyfNsjL0c1QHseqEB1vDcAd9Kp0zx+\n/fVxpkxJZd++X8jP38vhw0amTl1d5/dwNnjzv7ktvhKnM7hd8KOiogCIiIjg5ptvZvv27WcUfF/w\n9XzFf5Q4XUNCwqqaUsmtW4P57rt/sHHjBMLDw2r62eTktKO6ugu2QlpcHMK+fUctG4Cssus5k5Ky\nBPgNeBWIRAl4H+yFuDd6GwNYgLVBcQkBHGIqM7kQ5ei3pnb1zS6gCEjmJ9Qq226Wcxc5HKkvv7rC\n8rt1nLCwzlRV+TN//nASElaSnj6DzEx9Edi5Z/re/m+u40txOoNbBb+0tJTq6mratGnDqVOn+O67\n75g6dao7hxSEs8IZ3z0jwx94B1XXspOsrF4MGrSEjRsn2PWzUatHrUJaWdmKQYOWEh3d27K61FbM\nNVT3Guvyp8DAX6ioGIO9ZBuAdJSFcyd+7Gc0/ehOUU0bYw1l7tyLtQ/O98AxgviI7aisvrvlqF7A\nXtQU7oVAS9RkrS781cDdNWfu3n1pzfcgPW58H7cKfn5+PlOnTsVgMFBVVcWIESPo37+/O4cUhLPC\nmRYHJtNhVCHjcvQtQbKyNJKSllJQYFuHPgyVscehWo/dR1bWf8nK8kdNehage/JwANs+OFBJUFB7\n2rV7kZycSNS0613AZtQdwAECuZ1EVnARqqmZ7eWjI+oeIBz4BQMvk4Ta+1XP6kNRtf0VwHPAEWAp\nEAXMx8+vkN69+9K5cxHwHtnZkbJdYBPErYLfuXNnUlJS3DmEIDQKZ7LWdu3iLJOr9hOnq1ZVomm7\nsWb1eulxgeXnRqADavJ1OJCEqoTRd4PNRU3QLgBKKSp63tJL/hmU4P8XmI4BE4MZSD920hvV0cYP\ne1PGhDJqnuIOVOb+PKp/zjKgHFWRU2zzGb5HZfnqDDExc9mwwdrKWL/zueOObTV3PrJIyveRlbZC\ns8aZrLVbt5Ns365qz21lVrUviEOJahCqQUEgyjL5K9a+MwuAKajFSvYNz2AkyqefZxnNgLJdDgNt\n8eN1pjKDi6gkFHUPEYqa2l2GtbNlJvAmM1C1OXrzhDCUPbPaMsYCwsJ2YzYPx/Hi1a5dnN1nru/O\nR6pzfBsRfKFZo2etGRmtMZn2kZFhJCHhUzsv33qMH/v3P83p00bUQqRhwHpU24NiVLZ+P8qqWYeq\naTegxHYJyj5xXK2q/x6MkvA2qJLMSIJYwiNs4VpqbyLeFdiBWob1Bd1YxR2oOwioPX2rHgcGmtiy\nZQJJSUvZtCnHIvzqmG7dTtl9L+LXN01E8IVmjb5w6J57PmbHji5kZYWwY0cOP/zwHuXlFwD5XHNN\nMG+8MYJZs75jx47nsbY+eB3ohxL7oSj/Xq+jz0ZZKmGWn0eAVjiuVlVoKKPmYcBAC/L4Cw/QAlUh\nb9s8Qd9E/DCQg4G5bEU1MzuFaocQgrJwnkDdfRxH1c8vY8CAkJrPW1BgJimpfntG/PqmiQi+0OQ4\n212nTCYzX32VhbWG/l8cP/4Mutilpi6jRYuNZGWFY93cIwNV6a7L8XxUhYttHf0ylKWi96XRPfVi\nlOtegrJgwlGZfSEB/I9HeYCXUFOqtvcDbVDSvhmYxzxURq8Bn6CqbabYjP0CMBbdVoqN3cE//zm+\n5jM3tEJW/PqmiQi+0OQ42z1RZ8zYSEVFP6zyGoKj9ZKSchz4BiW5T6KyedvVqi+gJktt31cIvI+q\njLH11FehBD8Y/QLhzxbGE04HrF1yjlG7q2UJLfgHPwArLec5iTJ42tuN3bZtB6677hNLtY2Z5OTx\ntS56Z7owSsuEpokIvtDkcPSfMzL87TbpePLJK5k792ebTUayUTX2+i5MjguTTKiKmj4o774QVSrp\n2Opgj8P7TqKmWG1ragpQ9fVRwDEMHORmRtGXHXRB1ebonW3uRuX/eqOFDYRyqtfrxBauo23bzhQV\n7aBduzgOHy6nqGgnaq5AjT1oUIsGBVs2C29+iOALTQ6r/1wIrGXv3kPs2KGqY9LTNdasmUVl5eya\n15V4ZwMXofZvLUJZNsrDV4+fw75LDdiLeybKdHkS1aYsFHjA8vM9VK+aXqhFVOpcLXmTKXSnDfZe\n/RKsPStPAj8Bb/MhsbGZpG+6tdbn7dv3LYqKpqAvuwoK+onk5IRaxzkiE7PNDz9PByAIrsRkMlNe\nXkFY2AcEBLwCDKWiwr7LTGVlZ8vjVNRkaxFqknMsSozboiY6W6BE+yrss/k+KL9cL4FcjppwDUDZ\nQS1R+XmY5fho1H+1/6F8/+W04s88yqNcBfzB4ewRqPqeA0ApLXibn4AJhIZ2r/Mzq5LKcJTFNJKe\nPS8/45yFjtFYiLrEgEzMNg8kwxe8nrOZhH300S9Yt05tuWetbdFw3A5Q/QxGTWr2xl5y+6Hq4/X3\nV1PbqgkDLkbZKDp6q4IdDsf/BDwGrCKA9fyFJfRE4xDKvsHh6L3Ad0AyM7Dtf1lYmFHnZ7auE1DH\nXXjh6TN9nTXIxGzzQwRf8HrOxmv+8UfbLvB6bcsAVHVMIcq6Kbc8PoaycRzr1k/avL8Y5bnvRmX9\nB1CLqqC21/+75Zi7gVmoydTjwP34cZw7uI8LqKpZLfsAyux5DGsPnP8BvxDLWraj7gqWoyZ9Aykp\naUtBgbnWxc5RuBcsGElVVcPfq0zMNj9E8AWv5+y8Zj1710V4p817v0cJcg9U1h2E6lLZBiW90ZZj\nJlse630oo4CHULZJAaqRWm/LuZdg7SMfClwLfInK9A1AJa14iGmspSWq4YFt82Mj8E/LmX/AwFv8\nRHT0ajiul4BqqN2n/CkqaklS0sZaIu0o3BERvtHhUTj/iOALXo+zi4BMJjMtWxZjbTlcSfv2p+jQ\noYqYmFOsWxeNmjgNQVXbPIG1cuYN1H+HauADVNOx6Vjl+VUgFtXorA/KytmL/cbeT6IuAN2Bv+HP\nVhKYRDBVhKHW49ree8SiMv1iYBVhFF74MqN676CkJJy0tGVAlkMMS2RiVWgUIviC1+Ho2c+ceSWO\nXrPtMR06ZAOB/PCDH2ZzT/SOM4GBc7jiigt4440refTRNahM3LZ2XpffdcCzNs/PcXjdgLqABKP6\n4kRZXg/HWk+Ti6rq6QQYaM10pvI6ccAh1KXAcQeqXagcftfVf+XzVbNqPv+QIWmoLQhXO8QQjtFo\nbvwXLDRbRPAFr8Pesy9g69YFREf3tpuwTUhYaXPMv7AX8uXAXVRUXEpqan9++WW+peVwFUpiQdkx\noGrsq7AX1hhqL3tqgbXR2WzUHMCtKBtHb5u8kAC+ZzjzuAhqvPo4y1nuxtp4YR/wHZEcaD+Jrz+c\nbPf5rXc09s3aYmN3kJw8HkE4V0TwBa/D3rNfR1bWk2RlWSds580bxKZNucBnqLr2AOBDy/GjUR0r\n56JWn75ETs5L2Fe5t8Bq59S1+2suKovXNwjMBfoC/0JZOlGoOv3lqHmAUYCBEN5iCjsJwbbxMDxt\n+WlEratNAl7hCoYOfYCvLRuB22Jt1uaPyTSXdu3i6NbtVJ2rZQXhbBDBF7wG3aZRu0Ppq17bYJt9\n792rcdllb1NW9kfURGknVL2LrdeeidqnNQJrWwMsP09TO6PvgrU12V7L43hUnX4R9nbPMlRWfxKV\nvz+PH/uJJ5o+VDAX1SvT9uzdUReAjqgan9dYQlBQFR9+OK7O70GqZwR3IYIveA22Vg5otG37EuXl\nJzl9Wm8ZUMC+fXuprn4Re4G3ldeLUNOhQ1HevGPVTjFqktb2Ob2RgV4FvxtlEd2Ftbe8fv5y1F3E\nt0B/gunNFPbQBWtdTrHD2fcAJ4C5PI1a2KURGvqi6744QXASEXzBa3AsvywpiaK6uhxYCORjMBRS\nXd0fewFuR+3e7yFY+9G/i/1WIUUoF/0pVO69H2XLrEb5+WGoCdqXUP89CrHtUaNkPRQD2VxPF66h\niDhUBX6X0UVuAAAgAElEQVRLyxHxWHtiHgR+IJhv+NYS0xLgd/r0EWtGOP9IawXB45w4YSYhYSUH\nDuzFdql/dfUhVOVLS+AhNK071kVSYF3s9CyqNn4Z1lYJuhV0G9bSS/189wDXAPcB/ijhH2455/2o\n7cCfQOXl01F2zyrURSKHAJ5kEg9yDUX0Rjn8D6LuJf6Gala8C7WI6oer/8qivbsIC/sSVc4ZCEzn\nxIm62yQIgjtxe4b/zTffMGfOHDRNY9y4cUyaNMndQwpegG3ZZExMHgZDJdnZHepsjfDQQ6kWK8ex\nX/x0rJt+L0fVzt+OypL1TUNOowTeH+XNL0CJ/SFUZh6GyvR160evrNmCEnRQUv0qtdsi2/aoAX/2\ncieP0Qkl246LqK6wRP078D3h7G8/hc/evIvw8DAGDowmJcW6w5T0rRE8gVsFv7q6mtmzZ7N48WKi\no6O59dZbufHGG+neXbKbpo6jH6+EfDTp6Rrl5e/QokXrmjr7I0d0K0fvF78ENXG6DjWRql8AilCZ\nfABqQrYUJdIXUbsscwLwIsryse1cqS+gMqIyeX3B1KWoUk1be2h/zWMD+dxOEp1Qa2mzqb2Iapcl\nor/zHPA05GrMmbOURYuM0rdG8ArcKvi//fYbRqORjh07AjBs2DDS0tJE8JsBGRn+WCtfirGVx82b\n8ygq6g74k54eQIcOv6AmQnWhPWY51rZ0ch5KmF8GrkZZO9NRG4E4ZubBKHHvbvndtsGZXrnTEmuZ\nZXeUXD+OtavNVmASBhYxhme4kZyada/hqBoix0VUR4BlrETdbahY9JWxUnkjeANuFfzc3Fw6dOhQ\n87h9+/Zs377dnUMKHka3cvbu3U99PeSLiqqwzchPnnyRsLBXMJs7oCpkOlN7pWs0ag9Z2wqd5aiL\nQ0vs5fc3lGUzHXVn8aHl+TxUM7OHgR9Qwr4AlZf/EVv7BvJoxTKmMZN5DiPehSoYnYOq9P8dSOav\nQDLWuxn1WcW6EbwJtwq+pmkNH+RAVFSIGyJxPRJn3Uyd+rnFyvkMW8E2GNqiaUtQAh2DdW/YYIqK\nqhk6tA2pqX6orQIN1M6hg1Btix07YZajBFvvn3MUlbFnoCyhLOy3F1mG6pXzrOW5EahGadZiSj/2\n8Sce4BKUfeM4Iqj7h2LgZ+C+las5tKyYgwdX07GjCU2rICtrNV27lrBgwUgiIs7/34ov/H36Qozg\nO3E6g1sFPyYmhqysrJrHubm5REdHn/E9vtDlLyrKN7oReiLOffuCUNKod3pUQqtpLVFTnX1Q2fda\nrFn+cNLSnqBt21YUFenyOgwl4hEosR9qeY/tRWAr1lYJlUCZ5fl01DKnO1HzAbaSHYK6IDjePagW\nyi1ZwZ9ZSRRK7B0bJ+9EXbIOA/P4KzCPqsX1t2uuqjr/f9O+8PfpCzGCb8XpDG4V/EsuuYQjR45w\n7NgxoqKiWLNmDa+99po7hxRsUOWOq5zaOMRVqD4wBajVrh+ibJTTqHLIO1HSeT3wH2xFt7z8QgwG\n6ySpyqFjUcuWdGtoKMoaCkNV1kSiBL81+mbg1sVYRcBbqAzfceGVfZ8cP78f8av+HxN5kWiUQXQZ\nSuyHYu/qlwDbCWAZe1AXDqSDpeAzuFXw/f39mTVrFvfddx+apnHrrbfKhO15xFrueP42qU5OHszW\nrQvIyrLtJvMSyh/XbZxAVAXMIpS9UwSUU1b2V1Stew+sE7ftUNXtF6JEvhR1Z6B78Ccs4ziutu1v\nGddoOacR5d/HoCqBVJ+cmBgThTmnmcrrNVO8rbGK/TqsG5OUAIb7JnHqxLWQ0s0ynvj0gu/g9jr8\nAQMGMGDAAHcPI9TBwYPBOL9xyJlxdpvB8PAwoqN7k5VlK8DtgV9RkqnbOONQojsCdVF4EXVR6Im6\nIPwN6wVjBipTvxh157Ac1YuyBEgE3qb2att1KMG3nW69HVXW+RtgwJ9D9Mx5mb5Qk9lXAL9YzqqL\n/fdArrErr//8ExVVgRQUmJESS8EXkdYKTZiuXYvZurXhjUOcwXGbQb2WPiOjNSbTXiIiutC9eyXJ\nyYOJicnDXoBboeri11HbT9d/jwIWo5oRnLa8pxRVNtkVtSh8JKoLpm255nLUBWW25Ryhlvd84zBW\nBaq0cwYQTkve5EFeJghVgW9bxT8Pa9u0X4G+H3zMjcNGEGbZSUpKLAVfRQS/CbNgQTxlZWfORPXM\nvS7hts3gHfvc/PBDMWbzg+gymZX1Pjt2BLFmzVpURfoLqJbCe1GLnlRFTm0/HcvvIVgbmC1Bib6+\n4YgJlYNrqDsAx7qZwyg//3eU7/8ZyhKy9sDx89tD9+4XcOD3KQzj31yE2tMqHzUlbHvGCNQ9QC6w\ns/f/MX2YbR2/IPguIvhNmIiIhjNRxxWxWVnL2bFjZK1NRxy3GSwpsb0AFKJE/lkqKx27wFdYfgaj\nJmuXo7L3LagFSgtQ3vpfLOcyoKpt9K0D9bJJtayp9sYkO1Fi74eaatVQ62BNGAwLMBgKCAgopLz8\nSTJ/f5XH+DctsC/UdOyGvxdY3fmv9L7iYj4Su0ZoQojgN3McM3clzKvsNh3ZsuUFIiO70bLlLMrK\nugIFVFaWohqSrUMJdBeH8/RC9YzvgzJJ/FENyu5CyeoW1MpWfd1qqOW9GsqnL0RV46iWxMHBwbRu\nncHJkyGcPDkLuBJ1FzAFa0Y/E1sZ17TOaFoorQK2MLG8KxEUcrHlXbaRhqIWUYUDu6KiefS7//FE\neIQLvl1B8C5E8Jsp1s1Gcqg94Wm/yjUnpx05OX7AH1AZ9RSsbvdc6l4odQRrqeQIm2MvRpVq9gJS\nULtP9QdmWc5/EjVluhblxa8FWtO2bQGXXRZJaupk9L48+lixsVmUlMTY1PAb0Dcab8F7TDr1Fn1Q\nS7JKLKPbRhqGarV22YrV3DZgoAu+XUHwTkTwmxG2lTbHj+8kK+shlOwtIyTkFOXlBygr80ctWtJ3\nnApFudlTsIq33mCgN9YLg75QKhrlpV+OfR6t7yk7EiXYek2+vvq1o+U1nWLgH+hZe1aWRnb2U8BS\nlGe/gKCgIMLDs4mIMFJdfYCiIpvaevZzEy25iHIuR80QBAKnLL+/iJrizQUyW7Zk8jdb6Ny1G4LQ\nlBHBb0bY+/WjgPctrxRQXNwG5YPbNv3VO0teQG3bR29yZrtQqhglpzEoJ9w2j24DVGP1823PV4i6\nSNger0u09ThNuxp1UVCWTXi4ucZ6ggJiY+cSHd2bnN8/5s8nV9ADlc3bVuC8iqrSH44ylHq/9TZT\n7ri7MV+rIPgMIvhNGMeVthkZAdgLbQFK0O+zPLbfzi8oKJLQ0AxycsB+Zep2AgK+oby8M0pC26EE\nXpU8qvLKu7GuUf0NdRFohVoEFYSSXF2GQ1H5ti7HJSg7Zz72F4GdqBYIYfj5taekRP8cAOFEtA1j\n4P4JtDhZXNPw7ANq32f8AmwA/mgptxSE5oIIfhPmgQdSSElR1S7p6RrR0S9gK6ABAcFUVtq2Frbv\nHBMenkV09CXk5NyAEu8yoAXV1X+mvPxfqIlafU3qfJTYg8r030bZNDstj5+ynONFVEbvKO6foTJ6\n2wtBEQbDU5bM/iQwGVXeeSctWhykqKhnTbxBPMWf9syhB8qmsZ3yrVXT87fnmPlIoku+Y0HwJUTw\nmzCbNtlPvppMkSi/3AAcoqoqFNiBVWSHoiZXewO7aN26DXv2bANyUPZNN5QL/jFK7HeiRPtt7Gvs\ny7BfHDUHqxVkQElxLPbibsDffzdVVXrXSwNt215ARUUwpaW23n4psbFzadu2M3v2DMOfF3iAp4lE\nTfmWoNbTrkXdY4xCFYgagX1Az7feZqRYOEIzRQTfx6ivxUFdzzvWo1RXF6AmX5cBT6BpytYJCHgG\n6EBl5WGUtbINuICMjN/RtBmo0kt9kdW/UBuRLMde1Gdh3SzcfkMSgyHC0iq72CaeocTEvMjp07EY\nDCauvroNYCQ19YGacw4atJStW49SWmr9DLGxOaSnTyMh4VP279nCgzxNR2qvvT2NMpbyLaMa3nqb\nv4rQC80cEXwfo74WB5s2VWI2twRuID09FFjK1Ve3JDX1JZS1coyIiALy8x0nTcMJDu6C2TwRtcI1\nEHgMNUmql17G2hyvi7njxGtfVG/6LNRCKqtI33ijgV275pKV1cVyvjhiY/ewceM9hIeH1bSgLSgw\n06KF/cpgs7mQMWPmUlDQifDwTFauHMnRjAx6bnyEKyiqKcB0XHt7CDUzkN82lAlfbpIKHEFABN/n\naKjFgV4yefhwW7p00YBpNa/17fsOO3a8YKmpt9opRUU5qOy8LepPwlY+9SZluoAXYW2LYOuOH0Jd\nWKpQ3S5fom3bKAYNakFy8jAAkpI2cvhwT4uYj6/VfK2uHjXh4WGkp0+refzh669wdO7zRKNaIDiu\nHNCAHwEzEP3W20yXrF4QahDB9zEcWxyoCpnaJZNGYxHHjkXYvZafH0OfPhXk5LRCTZqGAK2orn4I\nlQ8/i6qksfXWT6ImVZfTqlUZoaEZ5OYuQS2YeslyjmKUp1+NukNQq2mvu+49OwFvTMOxXdu28Wn8\nYDprGiGWT51nGd22Z/1m4GhQax7/+nvJ6gXBARF8H0H36A8caENs7BxLk7MqysurSE21XgDCwvYw\ncGABTz55Bbfeuhpb8TYai9i06TQwFaugv4+qfAFlyVyCtY/8LtS+sGHAnUREzKWg4EJUnxtFQMBz\nVFY+jaqLWYtqoaA2B8/OjnTJZ1/98VL2JD7MGyhhn24T/VxUfVBHVGfLuLfe5nHJ6gWhTkTwfQTH\nJmeXXvoe0IKjR1sTGzuXdu3i6NbtFMnJdxIeHkZCwkoyMyej574xMb9SXh5JUVE7lH0TjxLyAlQd\n/nKs1TS6NdQH+CcBAZFERBwnK2sq6uKgoQt8dXVny/kqsDY8U6tnjcZKwPle+nXxzovPU/LmK/S0\njOLY2bIT6rK0u20od4lXLwhnRATfwzgrho7e/Y8/+mE2Wy8AV11l3c3KZDKzaVMlqi7+LgCOH99h\n6UNjK+h3Uv8kbAVwjKFDI/jww7sZMiSN48f15z9E7Vg1nerqcMv5PrR7f1jYaZKTbwZqTzQ7s/NW\nWspnbEmYQBuse8sOpfZWJzuB61es5o/SA0cQGkQE38M4K4a1vXt9az8A+92sZszYiNlchbJWQoAi\nqqvD7I4PCqrA3382JSX+qBW2O7H37gNRxY7v2Yy/FvssXu+pY8DP7xjV1db4Bg4MqLlwOV6sGtp5\n6z8LF3D0bzO4Cvu2CMtRl6VZqLqhA35+jFi3kd59Lz/j+QRBUIjgexhnxVDV1VtLFsvL29h590Zj\nUc3dwvr1oNab2u4rOwfb3HjIEPjii3KsneGvB55BbczdApVPG2p8+OTkwWza9CVms2MBJIDGLbdE\n1Cqp1HG8WNW389aWDRv45s7RdEXtcWVfza9G247aB6vlW28zQ7x6QTgr3Cb48+fP5z//+Q/t2rUD\nIDExUfa2rQNnxdCxZLGumvWkJFuf374vjiqvXEZY2GkGDgwgOXkQ69dX2RwTDlyFqrixdrLU4wkP\nD2PgQH9SUmwXQe0gOrqamJh8gHptKceLVV07b+kWzlUood9B7f2xvgcKWrfmwY1SgSMI54JbM/yJ\nEycyceJEdw7h8zgjhnWhXwD0rP6OO7ZZetvrXWRKsG5Q0gb4BX//Mq65xkhy8gjCw8MID8+yW8Wq\nO+V610nHeGrHOr5mgjgl5X7qs6XOtAfsrm3bWDlyCK0qKojCauH0x9pBPwzVmu1SaYsgCI3CrYKv\nVmoKZ+JMYujMhO6jj37BunVKbK37wd4DDMVgeBlNe9Hy2giqql4lNXUKLVooQV65cpRlFWsHNO0A\nXbv2IC5udZ2LovRY580bVBNTUtIGkpMHn7VHD8q++e6uMcRpGq1QbdG+xv5+oyewBwiZ+wp/u39S\ng+cUBOHMuFXwP/74Y1JSUrj44ot54oknCAkJcedwTQZd6Otql+B4cfjxRz9sxTYg4DQXX/yZpea+\nu4PnrpqS6YLctavRbhWrM9Q1yWw0ak7ZUjqrP17KvsSHa/bK0uvpY7G3cHYAwdNncKeIvSC4hEYJ\n/sSJE8nPz6/1fGJiInfffTcPP/wwBoOB119/nblz5zJnzpwGzxkV5RsXBXfFeeKEmZtu+pjMTH17\nQGs1TFZWeK1xDYYT2MpkSEgxv/zyIACjRy+289z1n3FxpWcV/4kTZh56KJWDB4PZv78a2wtMVlY4\n69Zdz5Qpyzl4MJiuXUtYsGAkERG1z3/49995+7rr0PLyuBb7GYYY1KaFy1BtEQ4FBjLhhx+49Mor\nnY7zfNDc/z5diS/ECL4TpzM0SvA/+OADp467/fbbmTx5slPH5uUVNyak84Le7MsdJCSsIjPTdutA\na7uE2NiCWuNefXUbUlP1LpXFXHGFH6NHL7HYQBUMHfoemZlhnDixj4gII927L2X27EHk5RU7vQYg\nIWGVzWSw/d61sbEFVFX5M3/+8Jrjq6pq/ztu2bCBNXeOph2qyfJOVF2QXsV/AHVZy42MYsSaL7nN\nMinrTX8P7vx3dyW+EKcvxAi+FaczuM3SycvLIyoqCoAvv/ySuLg4dw3VpFB2i307ML1dgu0Eqi7W\nmZnRxMbutWm10NbOchk1ailpabcAt9Qay9k1APYe/TDCwl6hS5cLnZpkLjSZ+PD2MZz67Rcisd9A\nUe+8vxXVxrjrW2/zkEzKCoLbcJvgv/zyy+zevRs/Pz86duzI888/766hmhQxMXnArVhbIvzGpk33\n1Mq8HVst6CtthwxJw9kJVGcnW+1LR0MZOLA9ixbd2OBnKTSZePe6fgSeyKcDEIf9fUsUkI5aQnaD\nbDcoCG7HbYKfnJzsrlM3aQyGSlS/GmXRXH55O6daLehi7Wxd/9kce7alo4UmEx/eOoLAHdvpgloC\n1prabYx/B4au38TAmwf4xG2zIPg6stLWy8jO7oCavtQff1bncfWJta04x8WVMnt2/eLsrJCfqXTU\nkS0bNrDqztG0R7VAsF3nexfWNsbfA/1XrJa2CIJwHhHB9zIam3XbinNDE05nI+QNUWgy8cn/3U7Z\nT/8jGrVm19a+CQcWoBZRHbj8Sh5Y/gmh4REuGVsQBOcQwT+POFMV8+STV7J1q76l31FmzhxV57lc\nKdaNpdBk4tUr+hB56iRdgQyUjWNr32QBtGzFdau/kKxeEDyECP55xJmqmLlzfyYr60nAQGmpxpw5\nS1m0yOiJcJ1CX0TVDvsKnGewt286zn1FFlAJgodp8oJfV1ataZzzhhyNwZmqmHNpU+AJ0lI+Iz1h\nAj1QmyN2wN7CuQDV1fJXoK9U4AiCV9DkBb+urBo46w05XIEz/vzZVNl4gkKTifWJD3MkdY1da4SZ\n2Fs4+1FCP12EXhC8hiYv+PVnzOc/i3amKuZcu2eeD45mZLDwun6EVVfVqqmPRO2EG4US+/C/PSdZ\nvSB4GU1e8OvOmM+u2ZercGai1ZsmY3UKTSa+eHgSpWnrCUNtObgT1XxZb41QBJQBmZf2JfG/n0kF\njiB4IU1e8OvPmL0zi/Y2jmZk8J8Bf2RuRTnLgenozZZVa4RoVEYfcNUfeOCj/4jQC4IX0+QFv76M\n2duyaG9k17ZtpA4dRFfq3ua8N/CjwY/79hwQoRcEH6DJC75w9hzNyODzUX/C73guc4FXULZNMbW3\nHBz6xUYRe0HwEUTwBTv0rL43qtfNEdTq2GUooX8JaAvkRbfn9tVfyN6yguBDiOALgEXoRw7hgooK\nLgGGoerrXwKmAGtRk7SFLVpy7efrZbWsIPggIvjNHL0C50Taeru6erXHFrRHZfeHUZ0tbxOhFwSf\nRQTfDTi7k5SnSUv5jG8SJhAFGKlrjy3YB1RFRnHXmi/FvhEEH0cEvwHqEu+GthNzdicpT1JoMvFz\nwgQ6A0+gsnjbCdm9wGZUVi/2jSA0DUTwG6Au8f7sswlnfI8398PRWyPkfLWei4BAVKTxKBvnJJCN\n2nLwZulXLwhNCj9PB+DtnIt4G42FqDwZvKkfztGMDN699CJOpK7huYoKgoBjqEjDgDtRi6hCbhzC\ntL2H+OOAgZ4MVxAEFyMZfgOcSzMzb+yHU2gy8emga7m2vAwT1qx+CfA00BXYHxjI7d9tFa9eEJoo\njRL8devWMX/+fDIyMlixYgV9+vSpee2dd97hk08+wd/fn6eeeor+/fs3OlhPcC7i7S39cMwnTvDf\ne+6hcPO3VBYXM1vTMAAfY83qpwFzAgOpvOkW7ntjviyiEoQmTKMEPy4ujvnz5/P000/bPZ+RkUFq\naipr164lJyeHiRMnsn79egwGQz1n8l68RbzPlqMZGSy89goiNI0eqEVU24FLUTX2r6I6XO5v2Yp7\nf9sjQi8IzYBGCX63burWX9M0u+fT0tKIj48nICCATp06YTQa+e2337jssssaM5zgJLp9E6VpdrtQ\nPY0S/FDADJyIiua2z9eL2AtCM8EtHn5ubi59+/atedy+fXtyc3PdMZTgQKHJxL8HX0e306VUYF9b\n3xVYDByL7ci9GzeL0AtCM6NBwZ84cSL5+fm1nk9MTGTw4MF1vscx4wectnMaqnH3FrwtTvOJE6Q8\n8ACZa9Yws6LCzqvXM/x9QI9Ro3j4/fcJi/Ausfe277M+JE7X4Qsxgu/E6QwNCv4HH3xw1ieNiYkh\nOzu75nFOTg7R0dFOvTcvr/isxzvfREWFeE2cu7Zt48sx8XQ9Xcpx7FfMDgNeADqixF7fW7aiyru+\nZ2/6Ps+ExOk6fCFG8K04ncFldfi2Wf3gwYNZu3Yt5eXlHD16lCNHjnDppZe6aijBhi/HxDP7dCn3\no1bM7sW6AiAU8IvtyIC9h5h+vEi2HBSEZk6jPPyvvvqK2bNnU1BQwOTJk+nZsyfvvvsuPXr0YOjQ\noQwbNoyAgACeeeYZn6zQ8WaOZmSQOm443U6X2vn03VFtEsqBnE6duCPtO/HqBUEAwKDVZbh7EF+5\nffJUnIUmE9/OeIyDa1fzXEUFy1BdLXWffhYQGhZG62uu488fLaGiKtAjcZ4NvnTbLHG6Bl+IEXwr\nTmeQlbY+gi702qYNtDSb6YZ9D5xS4ECrIG5eta6m/01YhG/8sQqCcH4QwfcRvp3xGPemfGqXydv2\nwJkT25G/pO/2ZIiCIHg5IvhejJ7Vhx4+hHbogJ1X3xO1G1V7Pz+yYzowdOUazwUqCIJPIILvxdhm\n9Y419dlhYcQMHMz1ya/JpKwgCE4hgu9l7Nq2jS9G/wljWRm5wALgblRN/SthYXTv0o1CYxfGiNAL\ngnCWiOB7GV+OiefFsrKaTH4ZkIry6SMHDub6RYs9GZ4gCD6MCL6X0a3stJ1XHwKYgoJYPGQo1ye/\n5sHIBEHwdUTwPYztxGyh0cjuwBZo5dYMvxioHjKU4ZLZC4LQSETwPYxduWX6z/x94CCe+vF7jGVl\nHDcYaHX9QMZIZi8IggsQwfcwoYcP2Vk4nQsLuftonidDEgShiSKbmJ9HCk0mPk+4l2+H3MDnCfdQ\nWGCi0Gi02e4cCo1dPBihIAhNGcnwzyOO9s1iDFyf/DqLMVg8/C4yMSsIgtsQwT+PONo3oYcPERoe\nIROygiCcF8TSOY+IfSMIgieRDN8NOJZaXp/8OqHhEWLfCILgUUTw3UBdXv3wRYvFvhEEwaOIpeMG\n6vLqBUEQPI0IvhsQr14QBG9ELB03IF69IAjeSKMEf926dcyfP5+MjAxWrFhBnz59ADh27Bjx8fF0\n69YNgMsuu4xnn3220cH6CuLVC4LgjTRK8OPi4pg/fz5PP/10rdcuuOACVq5c2ZjTC4IgCC6kUYKv\nZ/CapjVwpCAIguBp3DZpm5mZydixYxk/fjw//fSTu4YRBEEQnKTBDH/ixInk5+fXej4xMZHBgwfX\n+Z7o6Gi+/vprQkND2blzJw8//DBr1qyhTZs2DQYUFRXiRNjnD/OJE6Q+9BDBBw9S3LUr8QsWAN4X\nZ31InK5F4nQdvhAj+E6cztCg4H/wwQdnfdLAwEBCQ0MB6NOnD507d+bQoUM1k7pnIi+v+KzHcyef\nJ0yyLqLaupXFZZVM/OwTr4uzLqKiQiROFyJxug5fiBF8K05ncJmlY+vjm0wmqqurATh69ChHjhyh\nc+fOrhrqvCKLqARBaCo0atL2q6++Yvbs2RQUFDB58mR69uzJu+++y08//cTf//53AgIC8PPz4/nn\nn6dt27auivm8Umg0oqX/XLPloCyiEgTBV2mU4N90003cdNNNtZ4fMmQIQ4YMacypvQZZRCUIQlNB\nVto2gCyiEgShqSC9dARBEJoJzVLw69pbVhAEoanTLC2d+vrVC4IgNGWaZYYvpZaCIDRHmqXgS796\nQRCaI03e0qlrf1kptRQEoTnS5AW/Pr9ePHtBEJobTd7SEb9eEARB0eQFX/x6QRAERZO3dMSvFwRB\nUDR5wZfWCIIgCIomb+kIgiAIChF8QRCEZoIIviAIQjNBBF8QBKGZIIIvCILQTBDBFwRBaCY0SvCT\nk5MZOnQoo0aNYtq0aZSUlNS89s477zBkyBCGDh3Kd9991+hABUEQhMbRKMHv378/a9asISUlBaPR\nyDvvvAPA/v37SU1NZe3atSxatIjnnnsOTdMaOJsgCILgThol+Ndeey1+fuoUffv2JScnB4ANGzYQ\nHx9PQEAAnTp1wmg08ttvvzU+WkEQBOGccZmHv2LFCgYOHAhAbm4uHTp0qHmtffv25ObmumooQRAE\n4RxosLXCxIkTyc/Pr/V8YmIigwcPBmDBggUEBgYyfPhwgDrtG4PBUOs5QRAE4fzRoOB/8MEHZ3x9\n5cqVbNq0iSVLltQ8FxMTQ3Z2ds3jnJwcoqOjnQooKirEqeM8jcTpWiRO1+ILcfpCjOA7cTpDoyyd\nb775hnfffZcFCxbQokWLmucHDx7M2rVrKS8v5+jRoxw5coRLL7200cEKgiAI545Ba0T5zJAhQ6io\nqIMzjrUAAATvSURBVCAsLAyAyy67jGeffRZQZZkrVqwgICCAp556iv79+7skYEEQBOHcaJTgC4Ig\nCL6DrLQVBEFoJojgC4IgNBNE8AVBEJoJXiv47733Hj179sRsNns6lDp58803GTlyJKNHj+b+++8n\nLy/P0yHVyZn6HXkT69atY/jw4fTq1YudO3d6Ohw7vvnmG/70pz9xyy23sHDhQk+HUy8zZ87k2muv\nZcSIEZ4OpV5ycnKYMGEC8fHxjBgxwq6c25soLy/ntttuY/To0YwYMYL58+d7OqR6qa6uZsyYMUye\nPLnhgzUvJDs7W7vvvvu0QYMGaQUFBZ4Op05KSkpqfl+yZIn29NNPezCa+tm8ebNWVVWlaZqmvfzy\ny9orr7zi4YjqJiMjQzt48KA2fvx4bceOHZ4Op4aqqirtpptu0jIzM7Xy8nJt5MiR2v79+z0dVp1s\n3bpV27VrlzZ8+HBPh1Ivx48f13bt2qVpmvo/NGTIEK/9Pk+dOqVpmqZVVlZqt912m/brr796OKK6\n+eCDD7Tp06drDz74YIPHemWGP2fOHJKSkjwdxhlp06ZNze+lpaU1PYW8jfr6HXkb3bp1o0uXLl7X\nZO+3337DaDTSsWNHAgMDGTZsGGlpaZ4Oq0769etH27ZtPR3GGYmKiqJXr16A+j/UvXt3jh8/7uGo\n6iYoKAhQ2X5lZaWHo6mbnJwcNm3axG233ebU8Q2utD3fbNiwgQ4dOnDRRRd5OpQGef3110lJSSEk\nJMRrb01tWbFiBcOGDfN0GD5FXX2htm/f7sGImg6ZmZns2bPHaxdlVldXM3bsWI4cOcKf//xnr4xT\nT46Li4udOt4jgl9ff55HH32Ud955h/fff7/mOU9mfA31EUpMTCQxMZGFCxfy0UcfMW3aNA9EeXb9\njjzp7zoTp7fhbXccTYWTJ0/yyCOPMHPmTLu7ZW/Cz8+Pzz77jJKSEh566CH2799Pjx49PB1WDV9/\n/TWRkZH06tWLLVu2OPUejwh+ff159u3bx7Fjxxg1ahSappGbm8u4ceP473//S7t27c5zlA33EdIZ\nPnw4Dz74oMcE/1z6HXkCZ79PbyImJoasrKyax7m5uU73hRLqprKykkceeYRRo0Zx0003eTqcBgkO\nDuYPf/gD3377rVcJ/s8//8yGDRvYtGkTZWVlnDx5kqSkJJKTk+t9j1cZz3FxcWzevJm0tDQ2bNhA\n+/btWblypUfEviEOHz5c83taWhrdunXzYDT1U1+/I2/Gm7LqSy65hCNHjnDs2DHKy8tZs2YNN954\no6fDqhdv+u7qY+bMmfTo0YN77rnH06HUi8lkqrFJTp8+zQ8//OB1/8cfe+wxvv76a9LS0njttdf4\n4x//eEaxBy/08G0xGAxe+wf86quvcvDgQfz8/IiNjeW5557zdEh18sILL1BRUcF9990H2Pc78ia+\n+uorZs+eTUFBAZMnT6Znz568++67ng4Lf39/Zs2axX333Yemadx66610797d02HVyfTp09myZQtm\ns5kbbriBadOmMW7cOE+HZce2bdtYvXo1cXFxjB49GoPBQGJiIgMGDPB0aHbk5eXxxBNPUF1dTXV1\nNfHx8TX7ffgy0ktHEAShmeBVlo4gCILgPkTwBUEQmgki+IIgCM0EEXxBEIRmggi+IAhCM0EEXxAE\noZkggi8IgtBMEMEXBEFoJvw//5K32R/vBHAAAAAASUVORK5CYII=\n",
+            "text/plain": [
+              "\u003cmatplotlib.figure.Figure at 0x7f5be3c99f50\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "output_type": "display_data"
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Current loss:  9.48636\n"
+          ]
+        }
+      ],
+      "source": [
+        "import matplotlib.pyplot as plt\n",
+        "\n",
+        "plt.scatter(inputs, outputs, c='b')\n",
+        "plt.scatter(inputs, model(inputs), c='r')\n",
+        "plt.show()\n",
+        "\n",
+        "print('Current loss: '),\n",
+        "print(loss(model(inputs), outputs).numpy())"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "sSDP-yeq_4jE"
+      },
+      "source": [
+        "### Define a training loop\n",
+        "\n",
+        "We now have our network and our training data. Let's train it, i.e., use the training data to update the model's variables (`W` and `b`) so that the loss goes down using [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent). There are many variants of the gradient descent scheme that are captured in `tf.train.Optimizer` implementations. We'd highly recommend using those implementations, but in the spirit of building from first principles, in this particular example we will implement the basic math ourselves."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "MBIACgdnA55X"
+      },
+      "outputs": [],
+      "source": [
+        "def train(model, inputs, outputs, learning_rate):\n",
+        "  with tf.GradientTape() as t:\n",
+        "    current_loss = loss(model(inputs), outputs)\n",
+        "  dW, db = t.gradient(current_loss, [model.W, model.b])\n",
+        "  model.W.assign_sub(learning_rate * dW)\n",
+        "  model.b.assign_sub(learning_rate * db)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "RwWPaJryD2aN"
+      },
+      "source": [
+        "Finally, let's repeatedly run through the training data and see how `W` and `b` evolve."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 446
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 569,
+          "status": "ok",
+          "timestamp": 1527005915434,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 420
+        },
+        "id": "XdfkR223D9dW",
+        "outputId": "c43591ae-d5ac-4f2b-a8e7-bfce607e0919"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Epoch  0: W=5.00 b=0.00, loss=9.48636\n",
+            "Epoch  1: W=4.58 b=0.42, loss=6.28101\n",
+            "Epoch  2: W=4.24 b=0.76, loss=4.29357\n",
+            "Epoch  3: W=3.98 b=1.02, loss=3.06128\n",
+            "Epoch  4: W=3.78 b=1.23, loss=2.29721\n",
+            "Epoch  5: W=3.61 b=1.39, loss=1.82345\n",
+            "Epoch  6: W=3.49 b=1.52, loss=1.52970\n",
+            "Epoch  7: W=3.38 b=1.62, loss=1.34756\n",
+            "Epoch  8: W=3.30 b=1.70, loss=1.23463\n",
+            "Epoch  9: W=3.24 b=1.76, loss=1.16460\n"
+          ]
+        },
+        {
+          "data": {
+            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW0AAAEDCAYAAAD+/1UIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl4VOXdPvD7zJZ9XwmELQkQIAELsiTsi6xiEBGXAiIW\nbV8WBY2K0tLa4lbsr283qxURtIoioAi8SpFNg6whi0FJKAoJBgLZt5k5c87vj5OZLIRkgEnOGXJ/\nritXJsmZyT0sN1+enPOMIMuyDCIicgs6tQMQEZHzWNpERG6EpU1E5EZY2kREboSlTUTkRljaRERu\nxODMQePGjYOvry90Oh0MBgM2b97c1rmIiKgZTpW2IAjYuHEjAgIC2joPERG1wKnlEVmWIUlSW2ch\nIqJWCM5cETl+/HgEBARAEATMmTMH9957b3tkIyKiJpxaHvnggw8QFhaG4uJiLFiwAD179sTgwYPb\nOhsRETXh1PJIWFgYACA4OBgTJ05EVlZWi8fL3t6AIADdugFvvglYrTeflIiIWl8eqampgSRJ8PHx\nQXV1NR5++GEsXrwYI0aMuPadCgtRvfoFeL2zDkJtLWxdu6NqRSrMs+8DDE4N9y4XFuaHoqIKVb73\ntTCTc7SYCdBmLmZyjlYzOaPVSfvy5ct44IEHkJKSgjlz5mDcuHEtFzYAREai6oWXUHwkA9WPPApd\n4QX4L/sVgpMGwWPTvwFRdCocERE15tQPIm9Ew3/FdBcK4P3ntfB89x0IVivEmFhUP/kMzCmzAL2+\nLb79VbT6LysztU6LmQBt5mIm52g1kzPa5YpIKaozKl9+DcWHT6Jm7gLof/wB/r98BEGjh8Fj28cA\nTyckInJKu17GLnWJRuXaP6P40AnUPDgP+jN58F+0AEFjhsO0fRvLm4ioFarsPSJ1647KP/0VxWnH\nUTvnAehPf4+AhfMQNG4ETDs/A/hiOkREzVJ1wyipR09U/OV1lHx9FLX3zIH+uxwEPPQAAieMgunz\nXSxvIqImNLHLny0mDhV/fxMlB4+g9u57YMjORMDcOQicNAamPV+wvImI6miitO1scb1Q8fo6lOz/\nBrUzZsJ4Mh0B99+DwKnjYdy7h+VNRNftL395DR999IHj4+XLl2DVqlWOj//61/+HDz/8txrRboim\nStvO1iceFf96B8V702CeNgPG48cQOGcmAu+cBOOBfSxvInJa//6JyM7OAKBsfldWVorc3FzH17Oz\nM5GQMECteNdNk6VtZ+vXH+Vvv4uSPQdhnjwVxiPfIPCeGQhImQpj2ldqxyMiN5CQMBBZWZkAgLNn\nz6Bnzxj4+PigsrISVqsVP/74A+Liequc0nnqXFN+ncSEASjf8AEMJ0/A+9UX4bH7c5hSpsIycjSq\nnloJcdhwtSMSkRN8Vj8Pj+3bXPqY5jtTULX699f8emhoKPR6Ay5duoisrEz075+I6uoyZGdnwsfH\nBzExsTCotL3GjdD0pN2UOPBnKH/vI5Ts2gPL2PEwHdyPoBmTEDD7LhiOHlY7HhFpVGJiIrKyMpCd\nrZT2gAEDkJWVgaws91oaAdxk0m5KHHQ7yjZtheHIYfi8sgam/Xth2r8X5vETUZ26EuJtg9SOSETN\nqFr9+xan4rbSr18isrIy8d//KssjHh4y/vnPf8HX1wfTpt3V7nluhltN2k2JQ4aibPMnKP1kFyzJ\nI+GxZzeCJo2F/8/vhSHzpNrxiEgjEhIGIC3tIPz9/SEIAgICAlBZWYHs7Cz075+gdrzr4talbWcd\nnoyyrTtQuuUzWIYlweOL/0PQhFHwn3c/9HU/gCCijismJhbl5WXo3z+x0ef8/Pzg7+9er33bLrv8\ntStZhvHAPvi8/AcYjx0BAJin3wWPXz+Hom69lRdn0Ait7jTGTM7RYi5mco5WMznjlpi0GxEEWEeP\nRemO3Sj9YAusPxsEj88+AYYMQeCEUfBc/xaEinK1UxIR3ZBbr7TtBAHWcRNQuutLlG7aCsycCUNO\nNvxSn0BIQm/4Ll8CQ/pxXqhDRG7l1i1tO0GAdex4YMsWFJ88hapnV0EKCYHXu+8gaNJYBI4fCc+3\n/wWhvEztpERErbr1S7sBKSIS1U88heKjmSj9YAvM02bAcOpb+D29HCGJveH7xGIYThzj9E1EmtWh\nSttBp4N13ASUv/2uMn2v/DWk0DB4vbcBQZPHIWjcCHiue5PTNxFpTscs7QakiEhUP/4kio9koHTT\nVpinzYD++1Pwe2aFMn0//j8wHD/K6ZuINKHDl7aDTgfr2PHK9J2eg8rnfgMpNBxe/96IoCnjETQ2\nmdM3kZsqLPwJ8+bNUTuGS7C0myFFRKJm2QoUHzmpTN/T74L+9HfK9J3QC77LfgXDsSOcvonciKCh\nazRuBku7Jfbpe91GXEk/hcrnV0MKj4DX++8iaOoEZfp+6w0IZaVqJyWiVoiiiD/8YTXmz78fy5Yt\ng9lsVjvSDbn1roi8BpddASVJMB7YB6+N62Ha9RkEUYTs5QXzXXejZu5DEAcPcfqqS61elcVMztFi\nLq1nWr3aA9u3u3afujvvFLF6dcsFXFj4E2bPnoF//GMd+vdPwJ/+9CI6dYrGfff93KVZbkbHvSKy\nrel0sI4Zh/K3NuDKye9Q+fxvIYVHwPOD9xA0bSKCxiTB861/cvom0piIiEjH5lAzZsxAZmaGyolu\njFtuzaoVcng4apY+gZrFy2A8uB+eG9fDY+d2+D37FHx/92uYZ8xEzbwF1zV9E93KVq82tzoVt5Wm\na9ru+leSk7Yr6HSwjh6Lin+9o0zfq34HKSISnpv+XTd9D4fnv16HUFqidlKiDquw8Cd8+202AGDH\njh1ITByocqIbw9J2MTk8HDVLHkfxN+ko3fwpau+6G/q8XPitTEVIYm/4LXkMhiOHeeYJUTvr3r0H\ndu36DPPn34+ysjKkpNyjdqQbwh9EtgOhqAiem/4Nz41vw3D2vwAAsU88DAsfxpVREyH16KlKruZo\n/QdZWqLFXMzkHK1mcgYn7XYgh4WhZvEylBw6gdKPt6M25W7oz+QBTz2FkKEDETR6OLxf/gMMWRmc\nwImoRfxBZHvS6WAdORrWkaNReeUKQr/eA/Omj2A6sA8+a1+Gz9qXYYvuCvOUabBMvRPWIcMAN3qV\naCJqe2wElcghIcDChSifcS+EygoYv/wPPHZ+BtPuz+H9xj/g/cY/IAUHwzxpKixTpsMyeizg5aV2\nbCJSGUtbA2RfP1hmzIRlxkzAYoHx64NKgf/fDni9/y683n8Xsrc3LGMnwDx1OiwTJ0EODFI7NhGp\ngKWtNSYTrGPHKy/c8PJaGE4cg8euHTDt3A6PHZ/CY8enkA0GWJNGKgU+ZRqkTlFqpyaidsLS1jKd\nDuLgIRAHD0HV86uhP/09PHZ9BtPO7TAd2AvTgb3AMytg/dkgmKdMh2XqnbDF9VI7NRG1IZ494i4E\nAbbefVD9+JMo/WI/rqTnoOLFV2EZOQaGjJPw/cNvEZw8GEFJg+Dz+9XKHuCSpHZqItVVVlZi69bN\nbfb406dPQGVlJQDgypXLGDnydmRlZTT4+kSUl7vuxcSdLm1JkjBz5kw89thjLvvmdOOkzl1Qu/BR\nlH38Ka7knEH5X/8J89Q7oS/Ih/f/voagKeMRPDAevqlPwLjvS8BiUTsykSoqKsqxdetHzX5NcsFg\n07dvArKzMwEA2dmZ6NWrD7KylI/PnfsRgYFB8Pf3v+nvY+d0aW/YsAExMTEu+8bkOnJQMMz33o/y\n9e/h8qmzKHvnfdTOeQCCuRZe699C4L0pCOkbA79fPgLT9m1A3VRA1BG8/vpfceFCAR5++EH8/e//\ni/T045g3bx5++9vnMX/+fVe9QML777+Lt99+EwBQUJCPFSuW4pFH5mHx4kU4d+7Hqx4/ISHRUdpZ\nWZmYM+dBfPttfYknJCS69Pk4taZdWFiI/fv347HHHsPbb7/t0gDkYt7esEyZBsuUaYAowvhNGky7\nPoPHzs/g+fGH8Pz4Q8geHrCMGQfLlOkw3zEFcmio2qmpAwke1L/Zzxcfz3bJ8U398pdL8MMP/8W6\nde8BANLTjyMrKwsbNnyIyMhIFBb+dM0XSHjllTVITV2Jzp27ICcnG2vXvoQ///kfjY7p3z8R69e/\nBQA4depbPPLIY/joo38DUEo8IWGAUzmd5VRpr1mzBqmpqaio0NZln9QKgwHWEaNgHTEKVb9/GYas\nDOUslF074PH5Lnh8vgu+Oh2sQ4fDMnU6zFOmA2HN/wUhupUkJiYiMjKyxWNqamqQnZ2BVauehn23\nD1EUrzqub99+yM39HrW1tbDZbPD09ERUVGcUFOQjOzsD99/v2j27Wy3tffv2ITQ0FPHx8Th8+LDT\nD+zsdfTtqcNnGj9SeVv7CpCbC3zyCYStW2E6lAbToa/hu+pZICEBYWPGAGPGAKNGARqZwrX4ewdo\nM5fmMzWzxAAAYde68/Ue34TFUg69XufIEBjoDS8vL8fHklQNQajPaDQCOp0JwcHeCAgIwPbtn7by\nHfzQvXs37N//OQYMSEBYmB+GDBmMrKxjKC8vw6Br/E/hRrVa2idOnMCXX36J/fv3w2w2o6qqCqmp\nqXjllVdavJ8WN2NhpgYCI4H5jwLzH4Vw8SI8Pt8Jj53bYUr7CsjKAv7yFwCAGN8X1uHJsCSPhHVY\nMuQwZ/+quI4Wf+8AbeZipqvV1sqoqKh0ZCgtrQZQ31GSZMLly1dw5kwBPD09sXv3HgwbloSaGhkR\nEZ3w4YdbMXbsBABAXl4uYmPjrvoeffr0w7p1b2PhwkdRVFSBbt164YUXViE+vp/Tz93Zf2xbLe3l\ny5dj+fLlAIAjR45g3bp1rRY2uRc5IgK18xagdt4ChPmbUPrFPhi/Pghj2tcwHjsMw6kceK1TfjAj\n9u4D6/BkWJNHwjJ8BOTwcJXTE7XM3z8ACQkDMH/+fRg6NAnDhyc3+rrBYMCCBY9g0aL5iIrqjG7d\nuju+9utfv4A//vElvPPOOthsIsaPv6PZ0k5IGIDNmzehXz/llXF69+6DoqIizJgx0+XP57q2ZrWX\n9uuvv97qsfzXvnVukcligSH9BEyHvlKK/OhhCNXVji+Lcb1gHT4C1uQRsCaNgBTR8jqhSzJphBZz\nMZNztJrJGdxPW0VumclqheHkCRgPfQ3T1wdhOHIYuqr6UwjFmFhYk0Y43lxxib0Wf50AbeZiJudo\nNZMzeBk7XR+jEeLtQyHePhQ1S5crJZ55UllKSTsI4+Fv4LVxPbw2rgcA2Lr3UNbD65ZUpM5d1M1P\n5OZY2nRzjEaIg26HOOh21Cx5HBBFGLIylBI/9BWMh9Lg9d4GeL23AQBg69odluQR9SUe3VXlJ0Dk\nXlja5FoGA8TbBkG8bRBq/mcpYLPB8G0WjF9/VV/iddvNAoAtuiusSSNgsS+ndO3mvi+TTdQOWNrU\ntvR6iIkDISYORM0vFwM2G/Q538KUdtAxjXtu+jc8NylXkNk6d3Gsh1uSRkDq3kPlJ0CkLSxtal96\nPWwJiahJSETNo/8DSBL0p3Ial/hHH8Dzow8AALZOUcCY0fDqkwAxcQDEhETI/gEqPwki9bC0SV06\nHWz9+qOmX3/U/OKXSol//x2MaV/BlKYsqeD99+GL9x13EXv0VKb3hAF1RT5Aefk2og6ApU3aotPB\nFt8Xtvi+qF24CJBlhJUWonx/GgyZGcpb1kl4frIF+GSL4262LtH1JZ44AGLiwDY5Z5zcT2VlJXbv\n/j/MnHlPm32PNWt+i+TkkRg9elybfQ87ljZpmyAAvXrBHNQJ5pRZyudkGbr8844CN2RmwJhxEh67\nPoPHrs8cd7WFR9SXeMJAiIkDIHWJ5g86Oxj7ftpNS1uSJOh07vc6MCxtcj+CACm6KyzRXWGZdqfj\n07qLhTBknmwwkWfA4z9fwOM/XziOkYKCHAVuf7N17wm44V9edzVokE+znz9+vMolxzfVcD9tvV4P\nLy9vREVF4ttvc/Dqq39Gaurj2LBhEwBlL+3a2hosWPALFBTk47XXXkFZWSk8PT2Rmvocunbtds3v\nc/ToYXz44fsoKSnG4sVPIClphFP5rhdLm24ZUkQkLBMnwzJxsuNzwpUrMGTVl7gh82T962va7+fr\nBzEh0bE+LiYOhC02DjDwr8etoOF+2unpx5Ga+gTWrn0VRqPfTe+l3VBh4U/429/eRH7+eSxd+hg2\nbdoGo9Ho8ufDP5V0S5NDQmAdMw7WMfVrjUJ5GQzZWfVTeVYGjIcPwXTo6/r7eXlB7NvfsT4uJg6A\n2DseMJnUeBq3FGcn5Bs9vjV9+/ZDVFRUi5exO7uXdkPjxk0EAHTpEo2oqM748ccfmt1c6maxtKnD\nkf0DHOeCO1RVwZCT3WAiz4AhIx3G40fr72c0QozvpxR4/0Rg2CAIIZ2VnQ65Tu42PD09Hbf1ej1s\ntvrXibRYzAAAWZbg5+fveLUbZzSd2K81wd8sljYRAPj4OPZUcTCbYfgup9FZK4Zvs2HMPOk4JBSA\n5B8AW2wsbDFxsMX1glj33tajJ+Dh0f7PhRrx9vZGdd3OlE33xwsKCkZpaQnKy8vh6emJtLSvMGxY\nEry9fdCpUxT27v1Pq3tp2+3d+x9MnjwNFy4U4MKFghbXv28GS5voWjw8IA64DeKA2+o/Z7VCn3sa\nhqwM+F/4EeaMbOjP5MKQlQnjieON7i7rdJC6doMYG+codFtsHMTYXsqLSXA6bxcN99M2mTwQHBzs\n+Jor9tK2i47uhsWLF6GkpBhPPbWyTdazAW7Nqipmco4WMwFNcokidOd+hOFMLvS5udCfyVXKPS8X\nustFV93XMZ3H1he5LTbupqdzLf5aMZNzuDUrUXsyGCD1jIGlZwzQ4OwVABBKS6DPy4U+LxeGuvf6\nvNOtT+f2Iq9bcuF0TgBLm6jNyYFBEAcPgTh4CMwNvyCK0J/7oa7E86DPO11X7KeVc8sbnF8O1E3n\nccpSixjXS1lyccF0Ts7bsGEd9u79DwRBgCzLEAQBY8dOwNy5C9otA5dHVMRMztFiJqBtc101neee\nVpZczv4XgtXa6FjHdB7XCx59eqEyJBK26GhIXaJh69IVcmioqhO6Fn//tJrJGZy0iTTIqem8bu3c\nUFfoHrs/B3Z/Dt+mj+XlBVvnLkqJR3eFFN0VtrpCl6KjIUV2AvT6dnx2dDNY2kTuxGCArWcsbD1j\ngTumNPqSUFKM0MorKMv8Dvr8c9Dln4f+/Pm69z/CkJfb7EPKBgOkqM6wdbFP59GOYpeio2HrHM3l\nFw1haRPdIuSgYKBXN1iir3FaWmUl9PnnlUI/fx76/PPQ5Z9zFLvx0NcQrrFaaguPUAo8uiukLg0K\nvW5al32d+6893TyWNlFH4esLW5942PrEN/91sxm6CwV1ZX4e+vPn6m+fOwdDxkkYjx9r9q5SYKBS\n4F2i69bT64sdiX0A2YNLMC7C0iYihYcHpB49IfXo2fzXbTboLhbWTen1yy/224b/5kHIzmz2rqF6\nPaTQMEjhEZAiIhq/D49s9DG8vdvwSbo/ljYROUevhxTVGVJUZ4hDh139dVmGUFzcYPlFKXPv4iKI\n5wuUrXPP5ELIymjx20h+/pDCwyFFRCrvHcVu/1wEpIhIyMHBHXJLXZY2EbmGIEAOCYEYEgI0uPTf\nO8wPpQ1OrxMqK6C7dBG6ixfr3hdCd+lS3fv6z+v/e+aaa+xA3Q9Qw8KbTO0RjlJvWPJosEmUu2Np\nE1G7kn39YPP1U86AaYnVCt2Vy1eVeePCvwjD96cgZKS3+FBSQGD91B4RAXTtAm9PX0jBIZCCgyEH\nBUMKCoYcEgIpKFjTJc/SJiJtMhohRXZSziNviSxDqChvMq03md7r3gy5px13a/71cOoe0ttbKfSg\nukIPaVDswcH1X6u7LQcHQ/bxbZeLmFjaROTeBAGyfwBs/gHKKw61xGKB7nIRQsQqlJ45D11JMYSS\nYuiuXGl0Wygpga6kGIYzeRCqnXsRBtlobDSty0H1hS4FBSsTfXDj4pcDAq97XZ6lTUQdh8kEKaoz\nEOYHa9dezt3HbFYKvbgYuuIrSrHbbxcX15e9/eOfLsBwKseph5Z1OsiBgZCCQ4AG/wtoCUubiKgl\nHh7KEk1kJ9icvY8oQigtVQq9boq/ZvHX3XYWS5uIyNUMBsihobCFhgJOvkxkmJMP3fFOciQicmMs\nbSIiN8LSJiJyIyxtIiI30uoPIi0WCx588EFYrVbYbDZMmjQJixcvbo9sRETURKulbTKZsGHDBnh5\necFms+H+++/HqFGjkJiY2B75iIioAaeWR7y8vAAoU7coim0aiIiIrs2p0pYkCSkpKUhOTkZycjKn\nbCIilTh1cY1Op8O2bdtQWVmJX/3qV8jLy0NsbAs7dHXvjmDp6i0Vi49nN3t48KD+zX7epcfrhKsy\nqZoHuCqT6nmaZNJEngaZNJPH7tyPmsrD42+N41tzXVdE+vr6YsiQITh48GDLpQ1Ar7t6t6trvkR8\nM8e2xfFNM6mdp2kmLeRpmEkreeyZtJSnxfuolMd+/FX3UznPVffVQJ5GH2skj7MEWW5hl3EAxcXF\nMBqN8PPzQ21tLRYuXIhFixZh9OjRLT5wUYNNz7UgLMyPmZzATM7TYi5mco5WMzmj1Um7qKgIzzzz\nDCRJgiRJmDp1aquFTUREbaPV0u7duze2bt3aHlmIiKgVvCKSiMiNsLSJiNwIS5uIyI2wtImI3AhL\nm4jIjbC0iYjcCEubiMiNsLSJiNwIS5uIyI2wtImI3AhLm4jIjbC0iYjcCEubiMiNsLSJiNwIS5uI\nyI2wtImI3AhLm4jIjbC0iYjcCEubiMiNsLSJiNwIS5uIyI2wtImI3AhLm4jIjbC0iYjcCEubiMiN\nsLSJiNwIS5uIyI2wtImI3AhLm4jIjbC0iYjcCEubiMiNsLSJiNwIS5uIyI2wtImI3AhLm4jIjbC0\niYjciKG1AwoLC5GamorLly9Dr9dj9uzZmDdvXntkIyKiJlotbb1ej2effRbx8fGoqqrC3XffjeTk\nZMTExLRHPiIiaqDV5ZGwsDDEx8cDAHx8fBATE4NLly61eTAiIrrada1p5+fn47vvvkNiYmJb5SEi\noha0ujxiV1VVhaVLl2LlypXw8fFp8dju3QFJuvqY48ermj1+0KDmH8+Vx+t0V2dSMw+AqzKpnadp\nJi3kaZhJK3nszp1r9tOq5eHxt8bxrXGqtEVRxNKlS3HXXXdhwoQJTj2wTnf1EB8W5neNY5t/DFcf\n3zST2nmaZtJCnoaZtJLHnklLeVq6j1p57Mc3vZ/aeZre1kKehh9rJY+zBFmW5dYOSk1NRVBQEJ59\n9lmnH7ioqOKGArWVsDA/ZnICMzlPi7mYyTlazeSMVte0jx8/ju3bt+Obb75BSkoKZs6ciQMHDtx0\nQCIiun6tLo8MGjQIp06dao8sRETUCl4RSUTkRljaRERuhKVNRORGWNpERG6EpU1E5EacviKSiIiu\nnyQBZWVASYmA4uL6t5ISwfG5khIBn37q3OOxtImInGSxoFHRNizgpkWsfAyUlgqQJMFlGVjaRNTh\nyDJQWYmrCvdaU7D981VVzpWvXi8jKEhGaKiMuDgJQUEygoOVt6Ag1L2XHe+DgmQAvk49NkubiG4Z\nNTVAUZGAixcFXLqkw6VLyu2iIuVj5fMCLl8GLBbnLhv39lZKtUcPqVHR1pdw4/INCZHh5wcIrhuu\nG2FpE5GmSZIyEV+6JDhK2F7IDd8uXtShvLzlpvTwkBERIWPgQMDfX2yxfO23vbza6Yk6iaVNRKqo\nqUGjwm1cwvVTcVGRAFFsuYxDQiR07izhtttkhIfLiIiQEB5uvy3X3Zbg769MwMqGUTXt9Exdi6VN\nRC4likBhoYD8fB3OnxdQUQGcPevRYEpWStn5qVhCeLjUqIAblnJYmAyjsZ2enAawtInoutTUABcu\nCDh/Xof8fB3y8+23laK+cEGAzda0kE2OW9eaiusnYuVzbbku7M5Y2kTUSFkZGpVw49sCLl9u/po8\nQZARGSnjZz+T0KWL/U1GfLwnPD2rEBGhnE3RkabitsDSJupAZFlZR25Ywsq0XH+7oqL58dZolNG5\ns4z4eBFdusjo0kVCdLTkuB0VJcNkuvp+YWGeKCqS2viZdRwsbaJbiNUKnDvXtJDrlzIKCgSYzc2X\nso+P3KiEu3SxfywhOlpZtmjppdeofbC0idyMLAM//SQgL0+H3FwdzpzRIS9PeV9QAEhS8xdphIZK\niI9X1pPrC7m+mAMDuYbsDljaRBpVXQ2cOaOUccNyzsvTobr66naNiJCQlARERFgbTczR0TI6d5bg\n7a3CkyCXY2kTqcg+Nefm1heyfWrOz796LcLTU0bPnhJiYxu/xcQoZ1so5x/XqvBMqL2wtInagX1q\nbljK9um5uak5MlLCyJEiYmIal3OXLlxX7uhY2kQuIsvK+csNJ2b7W0HBtafmuDjJUc72277O7R1E\nHRBLm+g6WSzA99/rcOkScOKEqdH03NzU3KmTMjU3XMqIi5PQuTOnZrp+LG2iFtTUADk5OmRm6pGV\npbw/dUoHq9Vezh4AAC+va681c2omV2JpE9WprASys/XIzKwv6dOndY0uyfbwkJGQIKF/fxsGDzYh\nIqIasbGcmqn9sLSpQyopAbKylIJW3utx5kzj1vX2ljF4sA2JiRISEpT3cXGS4zLssDATiopsKqSn\njoylTbe8S5cEx9KGvaTPnWtc0AEBMkaOFJGQICEx0YbERBt69uT0TNrD0qZbhv3sjYblnJmpQ2Fh\n4+YNDZUwbpyIxESbo6S7dpV5NSC5BZY2uSVZBn74QXAUs30N+sqVxgUdFSVh8mRrgwlaQmQkC5rc\nF0ubNM9mA06f1jUq56ws/VWb6HfrJiEpyepYg05IkBAWJquUmqhtsLRJc8xmID1dj6+/1iMtTY/j\nx4Hqah/H1wVBeYXrCRPqp+f+/W0IDFQxNFE7YWmT6mprgRMnlIJOS9Pj2DE9amvrp+h+/YCEBKtj\nDbpfPxvPfaYOi6VN7a6mBjh+vL6kjx/XO/Z4FgQZfftKSE62YfhwG4YPF9G7NzdBIrJjaVObq64G\njh2rL+lQSYIFAAANpklEQVQTJ/SwWOpLun9/CUlJNiQl2TBsmIigIJUDE2kYS5tcrqrq6pK2X/at\n09WXdHKyiKFDuRZNdD1Y2nTTKiuBo0ftJW1AeroOolhf0omJ9klaKemAAJUDE7kxljZdt8pK4MgR\n+9kdBmRk1Je0Xi9jwAAJw4crk/SQITb4+6scmOgW0mppr1y5Evv27UNISAi2b9/eHplIYyoqgMOH\n6yfpjIz6TZT0ehkDB0pIShKRnGzDkCE8s4OoLbVa2nfffTfmzp2L1NTU9shDGlBeDnzzjVLQaWnK\nFYeSpJS0wSDjttskJCeLGD6cJU3U3lot7cGDB6OgoKA9spBKZBk4eVKHXbsMOHgQSE/3dZS00ajs\ndGc/u+P2223w8WnlAYmozXBNu4OyWoFDh/TYtcuAXbsMuHBB2bPDaARuv92G5GSlpAcPtvFVvIk0\npM1KOyzMr60e+oZ19EzV1cAXXwBbtwLbtyt7SgNAYCAwdy6QkgJMmgT4+BigtX/Ptfh7B2gzFzM5\nR4uZnNFmfzOLiira6qFvSFiYX4fMVFICfPGFATt3GrBvnwE1NcqyR2SkhAULREydKiIpyebY2N/H\np2P+Ot0ILeZiJudoNZMznCptWeZOae7kwgUBu3YpRZ2Wpnec6REba8PUqUpRDxwocYN/IjfUammv\nWLEChw8fRmlpKcaMGYMlS5Zg1qxZ7ZGNrkNurg47dypFnZ6ud3z+ttuUop4yRUSvXpKKCYnIFVot\n7bVr17ZHDrpOkqSc8WEv6rw8paj1euVls+xFHRXF/yUR3Uq09dMmapHVCqSl1Z/x8dNPyvqGl5eM\nKVOsmDpVxB13cMMlolsZS1vjqquBvXuVaXr3bgNKS5X16cBAGffeqxT1mDEiT8sj6iBY2hpUUgJ8\n/rlS1Pv315/xERUlYdYspaiHDas/44OIOg6WtkYUFAiOZY+GZ3z06lX/g8SBAyW+IC1RB8fSVtGp\nU8C775qwc6cBJ0/Wn/Hxs5/ZT82zIjaWP0gkonos7XZWWgp89JER775rxKlTAOABg0HGqFH1Z3x0\n6sSiJqLmsbTbgSwDR4/qsGGDCZ9+akBtrQCjUcbMmcCECTWYOFHkq7cQkVNY2m2orEyZqjduNOLU\nKWX5o0cPCXPnmjFnjoi+fX1RVCSqnJKI3AlL28VkGTh2TIeNG0345BPlzA+jUcZdd1kxd64VI0bY\nePk4Ed0wlraLlJUBmzcbsWFD/VTdrZuEuXMtuP9+K8LCuE5NRDePpX0TZBk4cUJZq962TZmqDQYZ\nM2YoU/XIkZyqici1WNo3oLy8fqrOyWk8Vd93nxXh4ZyqiahtsLSdJMtAeroOGzYYsW2bEdXVylQ9\nfboV8+ZZMWoUp2oianss7VZUVChT9caNRmRnK1N11671U3VEBKdqImo/LO1m2F/oduNGI7ZsUaZq\nvV7GtGnKVD16NKdqIlIHS7uBysr6qTorq36q/vnPlTNAOFUTkdpY2gAyMpS16o8/rp+qp05Vpuox\nYzhVE5F2dNjSrqwEtmxRzgDJzFSm6i5dJCxdasEDD1gRGcmpmoi0p8OVdmamDu+8o6xVV1UpU/Xk\nyVbMn69M1Xp9649BRKSWDlHalZXAtm3A3//u7dgCtXNnCYsXK1M1d9UjIndxS5d2eTnwxhsmvP66\nCeXlgE6nw+TJylr12LGcqonI/dySpV1ZCbz1lgl/+5sJpaUCQkIkrF4tICWliq9OTkRu7ZYq7epq\nYN06I/72NxOuXNEhMFDGc8+ZsXChBT16+KGoiIVNRO7tlijt2lpgwwYj/vxnE4qKdPD3l5Gaasai\nRRb4+6udjojIddy6tM1m4N13lbIuLNTBx0fG8uVmPPaYha8EQ0S3JLcsbasVeP99I/70JxMKCnTw\n9paxZIkZv/qVFSEhXAIholuXW5W2KAIffWTA2rUeOHdOB09PGY89ZsGSJRa+yAARdQhuUdo2G7Bl\niwF//KMHzp7VwWSS8cgjFixbZuF+IETUoWi6tCUJ+PRTA1591YTcXD2MRhkPPWTB449beOoeEXVI\nmixtSQJ27lTK+tQpPfR6GT//uVLWXbuyrImo49JUacsy8MUXerz8sgeys/XQ6WTMmWPF8uVm9OjB\nsiYi0kRpyzKwd69S1unpegiCjLvvtuLJJ82IjWVZExHZqVrasgwcPKiU9dGjykYgM2ZY8eSTFvTp\nI6kZjYhIk1Qr7UOH9HjpJRMOHVIiTJlixVNPWdC/P8uaiOha2r20jx7V4aWXPHDwoPKtJ04UkZpq\nxoABLGsiotY49UJaBw4cwOTJkzFp0iS88cYbN/SNTpzQ4b77vDBtmg8OHjRgzBgRu3ZV4b33aljY\nREROanXSliQJL7zwAtavX4/w8HDcc889GD9+PGJiYpz6BllZOrzyigc+/1z5ViNGiEhNtWDYMNvN\nJSci6oBaLe3MzEx069YNnTt3BgBMmzYNe/bsabW0c3J0ePVVE3bsMAIAhg4V8fTTFowYwbImIrpR\nrZb2xYsX0alTJ8fHERERyMrKavE+990HfPihN2RZwKBBNjz9tBmjR9sgCDcfmIioI2u1tGX5+s+T\n3rQJGDBAwtNPmzF+PMuaiMhVWi3tyMhIXLhwwfHxxYsXER4e3uJ9lJ7XA/C+yXiuFRbmp3aEqzCT\nc7SYCdBmLmZyjhYzOaPVs0cSEhJw7tw5FBQUwGKxYMeOHRg/fnx7ZCMioiZanbT1ej1WrVqFhx9+\nGLIs45577nH6zBEiInItQb6RRWsiIlKFUxfXEBGRNrC0iYjcCEubiMiNuHTDqAMHDmDNmjWQZRmz\nZs3CokWLXPnwN2TlypXYt28fQkJCsH37drXjAAAKCwuRmpqKy5cvQ6/XY/bs2Zg3b56qmSwWCx58\n8EFYrVbYbDZMmjQJixcvVjWTnSRJmDVrFiIiIvD666+rHQfjxo2Dr68vdDodDAYDNm/erHYkVFRU\n4LnnnkNubi50Oh3WrFmDAQMGqJrp7NmzeOKJJyAIAmRZxvnz57Fs2TLV/6yvX78emzdvhiAI6NWr\nF1588UWYTCZVM73zzjuOP0et9oHsIjabTZ4wYYKcn58vWywWecaMGXJeXp6rHv6GHT16VM7JyZGn\nT5+udhSHS5cuyTk5ObIsy3JlZaV8xx13aOLXqrq6WpZlWRZFUZ49e7ackZGhciLF22+/La9YsUJ+\n9NFH1Y4iy7Isjxs3Ti4tLVU7RiNPP/20vHnzZlmWZdlqtcoVFRUqJ2rMZrPJycnJ8oULF1TNUVhY\nKI8bN042m82yLMvysmXL5K1bt6qa6fTp0/L06dNls9ksi6IoP/TQQ/KPP/54zeNdtjzScI8So9Ho\n2KNEbYMHD4a/v7/aMRoJCwtDfHw8AMDHxwcxMTG4dOmSyqkALy8vAMrULYqiymkUhYWF2L9/P2bP\nnq12FAdZliFJ2tmZsrKyEseOHcOsWbMAAAaDAb6+viqnaiwtLQ1du3ZttCWGWiRJQk1NDURRRG1t\nbasXC7a1M2fOYODAgTCZTNDr9bj99tuxe/fuax7vstJubo8SLRSR1uXn5+O7775DYmKi2lEgSRJS\nUlKQnJyM5ORkTWRas2YNUlNTIWhoLwRBELBw4ULMmjULH374odpxkJ+fj6CgIDz77LOYOXMmVq1a\nhdraWrVjNbJz505MmzZN7RiIiIjAggULMGbMGIwaNQp+fn5ISkpSNVNcXByOHj2KsrIy1NTU4MCB\nA/jpp5+uebzLSlvm6d7XraqqCkuXLsXKlSvh4+OjdhzodDps27YNBw4cQEZGBvLy8lTNs2/fPoSG\nhiI+Pl5Tf74++OADbNmyBW+++Sbee+89HDt2TNU8oigiJycHDzzwALZu3QpPT88b3ve+LVitVnz5\n5ZeYMmWK2lFQXl6OPXv2YO/evTh48CCqq6tV/1lXTEwMfvGLX2DBggVYtGgR+vTpA4Ph2j9udFlp\n38geJR2ZKIpYunQp7rrrLkyYMEHtOI34+vpiyJAhOHjwoKo5Tpw4gS+//BLjx4/HihUrcPjwYaSm\npqqaCVCWtwAgODgYEydObHXXy7YWGRmJyMhIJCQkAAAmTZqEnJwcVTM1dODAAfTr1w/BwcFqR0Fa\nWhqio6MRGBgIvV6PiRMnIj09Xe1YmDVrFrZs2YKNGzciICAA3bp1u+axLittLe9RoqUpzW7lypWI\njY3F/Pnz1Y4CACguLkZFRQUAoLa2FocOHULPnj1VzbR8+XLs27cPe/bswWuvvYahQ4filVdeUTVT\nTU0NqqqqAADV1dX46quvEBcXp2qm0NBQdOrUCWfPngUAfPPNN5raamLHjh2YPn262jEAAFFRUcjI\nyIDZbIYsy5r5tSouLgYAXLhwAbt3727x18tlp/xpdY8S+4RWWlqKMWPGYMmSJY4f2Kjl+PHj2L59\nO3r16oWUlBQIgoAnnngCo0aNUi1TUVERnnnmGUiSBEmSMHXqVIwePVq1PFp1+fJlLF68GIIgwGaz\n4c4778SIESPUjoXnn38eTz75JERRRHR0NF588UW1IwFQBoC0tDT87ne/UzsKACAxMRGTJk1CSkoK\nDAYD+vbti3vvvVftWFiyZAnKyspgMBjwm9/8Bn5+196BkHuPEBG5EV4RSUTkRljaRERuhKVNRORG\nWNpERG6EpU1E5EZY2kREboSlTUTkRljaRERu5P8D+7Wym3BFpegAAAAASUVORK5CYII=\n",
+            "text/plain": [
+              "\u003cmatplotlib.figure.Figure at 0x7f5be4b8ec50\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "output_type": "display_data"
+        }
+      ],
+      "source": [
+        "model = Model()\n",
+        "\n",
+        "# Collect the history of W-values and b-values to plot later\n",
+        "Ws, bs = [], []\n",
+        "epochs = range(10)\n",
+        "for epoch in epochs:\n",
+        "  Ws.append(model.W.numpy())\n",
+        "  bs.append(model.b.numpy())\n",
+        "  current_loss = loss(model(inputs), outputs)\n",
+        "\n",
+        "  train(model, inputs, outputs, learning_rate=0.1)\n",
+        "  print('Epoch %2d: W=%1.2f b=%1.2f, loss=%2.5f' %\n",
+        "        (epoch, Ws[-1], bs[-1], current_loss))\n",
+        "\n",
+        "# Let's plot it all\n",
+        "plt.plot(epochs, Ws, 'r',\n",
+        "         epochs, bs, 'b')\n",
+        "plt.plot([TRUE_W] * len(epochs), 'r--',\n",
+        "         [TRUE_b] * len(epochs), 'b--')\n",
+        "plt.legend(['W', 'b', 'true W', 'true_b'])\n",
+        "plt.show()\n",
+        "  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "vPnIVuaSJwWz"
+      },
+      "source": [
+        "## Next Steps\n",
+        "\n",
+        "In this tutorial we covered `Variable`s and built and trained a simple linear model using the TensorFlow primitives discussed so far.\n",
+        "\n",
+        "In theory, this is pretty much all you need to use TensorFlow for your machine learning research.\n",
+        "In practice, particularly for neural networks, the higher level APIs like `tf.keras` will be much more convenient since it provides higher level building blocks (called \"layers\"), utilities to save and restore state, a suite of loss functions, a suite of optimization strategies etc. \n",
+        "\n",
+        "The [next tutorial](TODO) will cover these higher level APIs."
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "collapsed_sections": [],
+      "default_view": {},
+      "name": "Training Models",
+      "provenance": [],
+      "version": "0.3.2",
+      "views": {}
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/4_high_level.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/4_high_level.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..5749f22ac58e0a012ed7e3fec4dfe2913d3f8273
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/notebooks/4_high_level.ipynb
@@ -0,0 +1,551 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "pwX7Fii1rwsJ"
+      },
+      "outputs": [],
+      "source": [
+        "import tensorflow as tf\n",
+        "tf.enable_eager_execution()\n",
+        "tfe = tf.contrib.eager\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "UEu3q4jmpKVT"
+      },
+      "source": [
+        "# High level API\n",
+        "\n",
+        "We recommend using `tf.keras` as a high-level API for building neural networks. That said, most TensorFlow APIs are usable with eager execution.\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "zSFfVVjkrrsI"
+      },
+      "source": [
+        "## Layers: common sets of useful operations\n",
+        "\n",
+        "Most of the time when writing code for machine learning models you want to operate at a higher level of abstraction than individual operations and manipulation of individual variables.\n",
+        "\n",
+        "Many machine learning models are expressible as the composition and stacking of relatively simple layers, and TensorFlow provides both a set of many common layers as a well as easy ways for you to write your own application-specific layers either from scratch or as the composition of existing layers.\n",
+        "\n",
+        "TensorFlow includes the full [Keras](https://keras.io) API in the tf.keras package, and the Keras layers are very useful when building your own models.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "8PyXlPl-4TzQ"
+      },
+      "outputs": [],
+      "source": [
+        "# In the tf.keras.layers package, layers are objects. To construct a layer,\n",
+        "# simply construct the object. Most layers take as a first argument the number\n",
+        "# of output dimensions / channels.\n",
+        "layer = tf.keras.layers.Dense(100)\n",
+        "# The number of input dimensions is often unnecessary, as it can be inferred\n",
+        "# the first time the layer is used, but it can be provided if you want to \n",
+        "# specify it manually, which is useful in some complex models.\n",
+        "layer = tf.keras.layers.Dense(10, input_shape=(None, 5))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "Fn69xxPO5Psr"
+      },
+      "source": [
+        "The full list of pre-existing layers can be seen in [the documentation](https://www.tensorflow.org/api_docs/python/tf/keras/layers). It includes Dense (a fully-connected layer),\n",
+        "Conv2D, LSTM, BatchNormalization, Dropout, and many others."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 204
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 244,
+          "status": "ok",
+          "timestamp": 1527783641557,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 420
+        },
+        "id": "E3XKNknP5Mhb",
+        "outputId": "c5d52434-d980-4488-efa7-5660819d0207"
+      },
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "\u003ctf.Tensor: id=30, shape=(10, 10), dtype=float32, numpy=\n",
+              "array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
+              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
+              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
+              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
+              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
+              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
+              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
+              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
+              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
+              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]], dtype=float32)\u003e"
+            ]
+          },
+          "execution_count": 3,
+          "metadata": {
+            "tags": []
+          },
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "# To use a layer, simply call it.\n",
+        "layer(tf.zeros([10, 5]))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 221
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 320,
+          "status": "ok",
+          "timestamp": 1527783642457,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 420
+        },
+        "id": "Wt_Nsv-L5t2s",
+        "outputId": "f0d96dce-0128-4080-bfe2-0ee6fbc0ad90"
+      },
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "[\u003ctf.Variable 'dense_1/kernel:0' shape=(5, 10) dtype=float32, numpy=\n",
+              " array([[ 0.43788117, -0.62099844, -0.30525017, -0.59352523,  0.1783089 ,\n",
+              "          0.47078604, -0.23620895, -0.30482283,  0.01366901, -0.1288507 ],\n",
+              "        [ 0.18407935, -0.56550485,  0.54180616, -0.42254075,  0.3702994 ,\n",
+              "          0.36705834, -0.29678228,  0.36660975,  0.36717761,  0.46269661],\n",
+              "        [ 0.1709305 , -0.11529458,  0.32710236,  0.46300393, -0.62802851,\n",
+              "          0.51641601,  0.39624029,  0.26918125, -0.25196898,  0.21353298],\n",
+              "        [ 0.35752094,  0.44161648,  0.61500639, -0.12653333,  0.41629118,\n",
+              "          0.36193585,  0.066082  , -0.59253877,  0.47318751,  0.17115968],\n",
+              "        [-0.22554061, -0.17727301,  0.5525015 ,  0.3678053 , -0.00454676,\n",
+              "          0.24066836, -0.53640735,  0.13792562, -0.10727292,  0.59708995]], dtype=float32)\u003e,\n",
+              " \u003ctf.Variable 'dense_1/bias:0' shape=(10,) dtype=float32, numpy=array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.], dtype=float32)\u003e]"
+            ]
+          },
+          "execution_count": 4,
+          "metadata": {
+            "tags": []
+          },
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "# Layers have many useful methods. For example, you can inspect all variables\n",
+        "# in a layer by calling layer.variables. In this case a fully-connected layer\n",
+        "# will have variables for weights and biases.\n",
+        "layer.variables"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 221
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 226,
+          "status": "ok",
+          "timestamp": 1527783643252,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 420
+        },
+        "id": "6ilvKjz8_4MQ",
+        "outputId": "f647fced-c2d7-41a3-c237-242036784665"
+      },
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "(\u003ctf.Variable 'dense_1/kernel:0' shape=(5, 10) dtype=float32, numpy=\n",
+              " array([[ 0.43788117, -0.62099844, -0.30525017, -0.59352523,  0.1783089 ,\n",
+              "          0.47078604, -0.23620895, -0.30482283,  0.01366901, -0.1288507 ],\n",
+              "        [ 0.18407935, -0.56550485,  0.54180616, -0.42254075,  0.3702994 ,\n",
+              "          0.36705834, -0.29678228,  0.36660975,  0.36717761,  0.46269661],\n",
+              "        [ 0.1709305 , -0.11529458,  0.32710236,  0.46300393, -0.62802851,\n",
+              "          0.51641601,  0.39624029,  0.26918125, -0.25196898,  0.21353298],\n",
+              "        [ 0.35752094,  0.44161648,  0.61500639, -0.12653333,  0.41629118,\n",
+              "          0.36193585,  0.066082  , -0.59253877,  0.47318751,  0.17115968],\n",
+              "        [-0.22554061, -0.17727301,  0.5525015 ,  0.3678053 , -0.00454676,\n",
+              "          0.24066836, -0.53640735,  0.13792562, -0.10727292,  0.59708995]], dtype=float32)\u003e,\n",
+              " \u003ctf.Variable 'dense_1/bias:0' shape=(10,) dtype=float32, numpy=array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.], dtype=float32)\u003e)"
+            ]
+          },
+          "execution_count": 5,
+          "metadata": {
+            "tags": []
+          },
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "# The variables are also accessible through nice accessors\n",
+        "layer.kernel, layer.bias"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "O0kDbE54-5VS"
+      },
+      "source": [
+        "## Implementing custom layers\n",
+        "The best way to implement your own layer is extending the tf.keras.Layer class and implementing:\n",
+        "  *  `__init__` , where you can do all input-independent initialization\n",
+        "  * `build`, where you know the shapes of the input tensors and can do the rest of the initialization\n",
+        "  * `call`, where you do the forward computation\n",
+        "\n",
+        "Note that you don't have to wait until `build` is called to create your variables, you can also create them in `__init__`. However, the advantage of creating them in `build` is that it enables late variable creation based on the shape of the inputs the layer will operate on. On the other hand, creating variables in `__init__` would mean that shapes required to create the variables will need to be explicitly specified."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 7,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 391
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 251,
+          "status": "ok",
+          "timestamp": 1527783661512,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 420
+        },
+        "id": "5Byl3n1k5kIy",
+        "outputId": "6e7f9285-649a-4132-82ce-73ea92f15862"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "tf.Tensor(\n",
+            "[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
+            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
+            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
+            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
+            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
+            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
+            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
+            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
+            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
+            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]], shape=(10, 10), dtype=float32)\n",
+            "[\u003ctf.Variable 'my_dense_layer_1/kernel:0' shape=(5, 10) dtype=float32, numpy=\n",
+            "array([[-0.4011991 ,  0.22458655, -0.33237562, -0.25117266,  0.33528614,\n",
+            "        -0.01392961,  0.58580834, -0.16346583,  0.28465688, -0.47191954],\n",
+            "       [-0.52922136,  0.22416979, -0.58209574, -0.60914612,  0.05226624,\n",
+            "        -0.18325993,  0.5591442 , -0.24718609,  0.37148207,  0.40475875],\n",
+            "       [ 0.16912812, -0.47618777, -0.38989353,  0.30105609, -0.08085585,\n",
+            "         0.44758242,  0.545829  ,  0.51421839,  0.11063248,  0.20159996],\n",
+            "       [ 0.34073615, -0.59835428,  0.06498981, -0.44489855, -0.34302285,\n",
+            "         0.20969599,  0.35527444, -0.03173476, -0.22227573,  0.09303057],\n",
+            "       [ 0.41764337, -0.06435019, -0.52509922, -0.39957345,  0.56811184,\n",
+            "         0.23481232, -0.61666459,  0.31144124, -0.11532354, -0.42421889]], dtype=float32)\u003e]\n"
+          ]
+        }
+      ],
+      "source": [
+        "class MyDenseLayer(tf.keras.layers.Layer):\n",
+        "  def __init__(self, num_outputs):\n",
+        "    super(MyDenseLayer, self).__init__()\n",
+        "    self.num_outputs = num_outputs\n",
+        "    \n",
+        "  def build(self, input_shape):\n",
+        "    self.kernel = self.add_variable(\"kernel\", \n",
+        "                                    shape=[input_shape[-1].value, \n",
+        "                                           self.num_outputs])\n",
+        "    \n",
+        "  def call(self, input):\n",
+        "    return tf.matmul(input, self.kernel)\n",
+        "  \n",
+        "layer = MyDenseLayer(10)\n",
+        "print(layer(tf.zeros([10, 5])))\n",
+        "print(layer.variables)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "tk8E2vY0-z4Z"
+      },
+      "source": [
+        "Note that you don't have to wait until `build` is called to create your variables, you can also create them in `__init__`.\n",
+        "\n",
+        "Overall code is easier to read and maintain if it uses standard layers whenever possible, as other readers will be familiar with the behavior of standard layers. If you want to use a layer which is not present in tf.keras.layers or tf.contrib.layers, consider filing a [github issue](http://github.com/tensorflow/tensorflow/issues/new) or, even better, sending us a pull request!"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "Qhg4KlbKrs3G"
+      },
+      "source": [
+        "## Models: composing layers\n",
+        "\n",
+        "Many interesting layer-like things in machine learning models are implemented by composing existing layers. For example, each residual block in a resnet is a composition of convolutions, batch normalizations, and a shortcut.\n",
+        "\n",
+        "The main class used when creating a layer-like thing which contains other layers is tf.keras.Model. Implementing one is done by inheriting from tf.keras.Model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 9,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 190
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 420,
+          "status": "ok",
+          "timestamp": 1527783698512,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 420
+        },
+        "id": "N30DTXiRASlb",
+        "outputId": "a8b23a8e-5cf9-4bbf-f93b-6c763d74e2b3"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "tf.Tensor(\n",
+            "[[[[ 0.  0.  0.]\n",
+            "   [ 0.  0.  0.]\n",
+            "   [ 0.  0.  0.]]\n",
+            "\n",
+            "  [[ 0.  0.  0.]\n",
+            "   [ 0.  0.  0.]\n",
+            "   [ 0.  0.  0.]]]], shape=(1, 2, 3, 3), dtype=float32)\n",
+            "['resnet_identity_block_1/conv2d_3/kernel:0', 'resnet_identity_block_1/conv2d_3/bias:0', 'resnet_identity_block_1/batch_normalization_3/gamma:0', 'resnet_identity_block_1/batch_normalization_3/beta:0', 'resnet_identity_block_1/conv2d_4/kernel:0', 'resnet_identity_block_1/conv2d_4/bias:0', 'resnet_identity_block_1/batch_normalization_4/gamma:0', 'resnet_identity_block_1/batch_normalization_4/beta:0', 'resnet_identity_block_1/conv2d_5/kernel:0', 'resnet_identity_block_1/conv2d_5/bias:0', 'resnet_identity_block_1/batch_normalization_5/gamma:0', 'resnet_identity_block_1/batch_normalization_5/beta:0', 'resnet_identity_block_1/batch_normalization_3/moving_mean:0', 'resnet_identity_block_1/batch_normalization_3/moving_variance:0', 'resnet_identity_block_1/batch_normalization_4/moving_mean:0', 'resnet_identity_block_1/batch_normalization_4/moving_variance:0', 'resnet_identity_block_1/batch_normalization_5/moving_mean:0', 'resnet_identity_block_1/batch_normalization_5/moving_variance:0']\n"
+          ]
+        }
+      ],
+      "source": [
+        "class ResnetIdentityBlock(tf.keras.Model):\n",
+        "  def __init__(self, kernel_size, filters):\n",
+        "    super(ResnetIdentityBlock, self).__init__(name='')\n",
+        "    filters1, filters2, filters3 = filters\n",
+        "\n",
+        "    self.conv2a = tf.keras.layers.Conv2D(filters1, (1, 1))\n",
+        "    self.bn2a = tf.keras.layers.BatchNormalization()\n",
+        "\n",
+        "    self.conv2b = tf.keras.layers.Conv2D(filters2, kernel_size, padding='same')\n",
+        "    self.bn2b = tf.keras.layers.BatchNormalization()\n",
+        "\n",
+        "    self.conv2c = tf.keras.layers.Conv2D(filters3, (1, 1))\n",
+        "    self.bn2c = tf.keras.layers.BatchNormalization()\n",
+        "\n",
+        "  def call(self, input_tensor, training=False):\n",
+        "    x = self.conv2a(input_tensor)\n",
+        "    x = self.bn2a(x, training=training)\n",
+        "    x = tf.nn.relu(x)\n",
+        "\n",
+        "    x = self.conv2b(x)\n",
+        "    x = self.bn2b(x, training=training)\n",
+        "    x = tf.nn.relu(x)\n",
+        "\n",
+        "    x = self.conv2c(x)\n",
+        "    x = self.bn2c(x, training=training)\n",
+        "\n",
+        "    x += input_tensor\n",
+        "    return tf.nn.relu(x)\n",
+        "\n",
+        "    \n",
+        "block = ResnetIdentityBlock(1, [1, 2, 3])\n",
+        "print(block(tf.zeros([1, 2, 3, 3])))\n",
+        "print([x.name for x in block.variables])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "wYfucVw65PMj"
+      },
+      "source": [
+        "Much of the time, however, models which compose many layers simply call one layer after the other. This can be done in very little code using tf.keras.Sequential"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "base_uri": "https://localhost:8080/",
+          "height": 153
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 361,
+          "status": "ok",
+          "timestamp": 1526674830777,
+          "user": {
+            "displayName": "Alexandre Passos",
+            "photoUrl": "//lh4.googleusercontent.com/-kmTTWXEgAPw/AAAAAAAAAAI/AAAAAAAAAC0/q_DoOzKGwds/s50-c-k-no/photo.jpg",
+            "userId": "108023195365833072773"
+          },
+          "user_tz": 420
+        },
+        "id": "L9frk7Ur4uvJ",
+        "outputId": "882e9076-b6d9-4380-bb1e-7c6b57d54c39"
+      },
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "\u003ctf.Tensor: id=1423, shape=(1, 2, 3, 3), dtype=float32, numpy=\n",
+              "array([[[[0., 0., 0.],\n",
+              "         [0., 0., 0.],\n",
+              "         [0., 0., 0.]],\n",
+              "\n",
+              "        [[0., 0., 0.],\n",
+              "         [0., 0., 0.],\n",
+              "         [0., 0., 0.]]]], dtype=float32)\u003e"
+            ]
+          },
+          "execution_count": 26,
+          "metadata": {
+            "tags": []
+          },
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        " my_seq = tf.keras.Sequential([tf.keras.layers.Conv2D(1, (1, 1)),\n",
+        "                               tf.keras.layers.BatchNormalization(),\n",
+        "                               tf.keras.layers.Conv2D(2, 1, \n",
+        "                                                      padding='same'),\n",
+        "                               tf.keras.layers.BatchNormalization(),\n",
+        "                               tf.keras.layers.Conv2D(3, (1, 1)),\n",
+        "                               tf.keras.layers.BatchNormalization()])\n",
+        "my_seq(tf.zeros([1, 2, 3, 3]))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "c5YwYcnuK-wc"
+      },
+      "source": [
+        "# Next steps\n",
+        "\n",
+        "Now you can go back to the previous notebook and adapt the linear regression example to use layers and models to be better structured."
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "collapsed_sections": [],
+      "default_view": {},
+      "name": "4 - High level API - TensorFlow Eager.ipynb",
+      "provenance": [],
+      "version": "0.3.2",
+      "views": {}
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py b/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py
index 2d51cfdeee3f0b45514af0895366417158b01614..b14ef1df8ff4c660b9b6f2abfd5df6572d10b1e8 100644
--- a/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py
+++ b/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py
@@ -49,15 +49,17 @@ def random_batch(batch_size, data_format):
   return images, one_hot
 
 
-def train_one_step(model, images, labels, optimizer):
-
+def compute_gradients(model, images, labels):
   with tf.GradientTape() as tape:
     logits = model(images, training=True)
     loss = tf.losses.softmax_cross_entropy(
         logits=logits, onehot_labels=labels)
     tf.contrib.summary.scalar(name='loss', tensor=loss)
-  grads = tape.gradient(loss, model.variables)
-  optimizer.apply_gradients(zip(grads, model.variables))
+  return tape.gradient(loss, model.variables)
+
+
+def apply_gradients(model, optimizer, gradients):
+  optimizer.apply_gradients(zip(gradients, model.variables))
 
 
 class ResNet50Test(tf.test.TestCase):
@@ -114,7 +116,8 @@ class ResNet50Test(tf.test.TestCase):
       with tf.device(device), tfe.execution_mode(execution_mode):
         optimizer = tf.train.GradientDescentOptimizer(0.1)
         images, labels = random_batch(2, data_format)
-        train_one_step(model, images, labels, optimizer)
+        apply_gradients(model, optimizer,
+                        compute_gradients(model, images, labels))
         self.assertEqual(320, len(model.variables))
         tfe.async_wait()
     events = summary_test_util.events_from_logdir(logdir)
@@ -138,14 +141,16 @@ class ResNet50Test(tf.test.TestCase):
       # garbage to be collected. The hope is that this is a build-only effect,
       # and a subsequent training loop will create nothing which needs to be
       # collected.
-      train_one_step(model, images, labels, optimizer)
+      apply_gradients(model, optimizer,
+                      compute_gradients(model, images, labels))
       gc.collect()
       previous_gc_debug_flags = gc.get_debug()
       gc.set_debug(gc.DEBUG_SAVEALL)
       for _ in range(2):
         # Run twice to ensure that garbage that is created on the first
         # iteration is no longer accessible.
-        train_one_step(model, images, labels, optimizer)
+        apply_gradients(model, optimizer,
+                        compute_gradients(model, images, labels))
       gc.collect()
       # There should be no garbage requiring collection.
       self.assertEqual(0, len(gc.garbage))
@@ -180,9 +185,7 @@ class ResNet50Benchmarks(tf.test.Benchmark):
           return (16, 32, 64)
 
       if tf.DeviceSpec.from_string(device.name).device_type == 'TPU':
-        # TODO(iga): Training fails with batch size of 16, probably because of
-        # no layout optimizations with op-by-op mode. Investigate more.
-        return (8,)
+        return (32,)
     return (16, 32)
 
   def _report(self, label, start, num_iters, device, batch_size, data_format):
@@ -248,18 +251,21 @@ class ResNet50Benchmarks(tf.test.Benchmark):
       device, data_format = device_and_format
       for batch_size in self._train_batch_sizes():
         (images, labels) = random_batch(batch_size, data_format)
-        num_burn = 3
-        num_iters = 10
         model = resnet50.ResNet50(data_format)
+        optimizer = tf.train.GradientDescentOptimizer(0.1)
+        apply_grads = apply_gradients
         if defun:
           model.call = tfe.defun(model.call, compiled=compiled)
-        optimizer = tf.train.GradientDescentOptimizer(0.1)
+          apply_grads = tfe.defun(apply_gradients, compiled=compiled)
 
+        num_burn = 3
+        num_iters = 10
         with tf.device(device):
           iterator = make_iterator((images, labels))
           for _ in xrange(num_burn):
             (images, labels) = iterator.next()
-            train_one_step(model, images, labels, optimizer)
+            apply_grads(model, optimizer,
+                        compute_gradients(model, images, labels))
           if execution_mode:
             tfe.async_wait()
           self._force_device_sync()
@@ -268,7 +274,8 @@ class ResNet50Benchmarks(tf.test.Benchmark):
           start = time.time()
           for _ in xrange(num_iters):
             (images, labels) = iterator.next()
-            train_one_step(model, images, labels, optimizer)
+            apply_grads(model, optimizer,
+                        compute_gradients(model, images, labels))
           if execution_mode:
             tfe.async_wait()
           self._force_device_sync()
diff --git a/tensorflow/contrib/eager/python/examples/rnn_colorbot/rnn_colorbot.py b/tensorflow/contrib/eager/python/examples/rnn_colorbot/rnn_colorbot.py
index 492adbe1d80941f9df96d6636e4933d11239408e..5ee2176154ec7011dcb3d7b384a86213e778014f 100644
--- a/tensorflow/contrib/eager/python/examples/rnn_colorbot/rnn_colorbot.py
+++ b/tensorflow/contrib/eager/python/examples/rnn_colorbot/rnn_colorbot.py
@@ -152,7 +152,7 @@ class RNNColorbot(tf.keras.Model):
     self.label_dimension = label_dimension
     self.keep_prob = keep_prob
 
-    self.cells = self._add_cells(
+    self.cells = tf.contrib.checkpoint.List(
         [tf.nn.rnn_cell.BasicLSTMCell(size) for size in rnn_cell_sizes])
     self.relu = layers.Dense(
         label_dimension, activation=tf.nn.relu, name="relu")
@@ -204,14 +204,6 @@ class RNNColorbot(tf.keras.Model):
     hidden_states = tf.gather_nd(chars, indices)
     return self.relu(hidden_states)
 
-  def _add_cells(self, cells):
-    # "Magic" required for keras.Model classes to track all the variables in
-    # a list of layers.Layer objects.
-    # TODO(ashankar): Figure out API so user code doesn't have to do this.
-    for i, c in enumerate(cells):
-      setattr(self, "cell-%d" % i, c)
-    return cells
-
 
 def loss(labels, predictions):
   """Computes mean squared loss."""
diff --git a/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py b/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py
index 74701b2f4f7448c5f6c2c1bd7c67a8ee112f4115..c2340a293a80924f2dfa90e2fb23134b0f1feb6b 100644
--- a/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py
+++ b/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py
@@ -50,7 +50,7 @@ class RNN(tf.keras.Model):
   def __init__(self, hidden_dim, num_layers, keep_ratio):
     super(RNN, self).__init__()
     self.keep_ratio = keep_ratio
-    self.cells = self._add_cells([
+    self.cells = tf.contrib.checkpoint.List([
         tf.nn.rnn_cell.BasicLSTMCell(num_units=hidden_dim)
         for _ in range(num_layers)
     ])
@@ -74,14 +74,6 @@ class RNN(tf.keras.Model):
     # tuple (output, output_states).
     return [input_seq]
 
-  def _add_cells(self, cells):
-    # "Magic" required for keras.Model classes to track all the variables in
-    # a list of Layer objects.
-    # TODO(ashankar): Figure out API so user code doesn't have to do this.
-    for i, c in enumerate(cells):
-      setattr(self, "cell-%d" % i, c)
-    return cells
-
 
 class Embedding(layers.Layer):
   """An Embedding layer."""
diff --git a/tensorflow/contrib/eager/python/metrics_impl.py b/tensorflow/contrib/eager/python/metrics_impl.py
index 1ae6415d5ecb03ef97cdf734c808e3f728dafcb0..c947ed9dcc415670a820f8a5cd9eaaf07334cfc3 100644
--- a/tensorflow/contrib/eager/python/metrics_impl.py
+++ b/tensorflow/contrib/eager/python/metrics_impl.py
@@ -25,6 +25,7 @@ from tensorflow.python.eager import function
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import math_ops
@@ -367,6 +368,9 @@ class Accuracy(Mean):
     Returns:
       The arguments, for easy chaining.
     """
+    check_ops.assert_equal(
+        array_ops.shape(labels), array_ops.shape(predictions),
+        message="Shapes of labels and predictions are unequal")
     matches = math_ops.equal(labels, predictions)
     matches = math_ops.cast(matches, dtypes.float64)
     super(Accuracy, self).call(matches, weights=weights)
diff --git a/tensorflow/contrib/eager/python/metrics_test.py b/tensorflow/contrib/eager/python/metrics_test.py
index 98a98a8d358d8a7d7a06505ed1a7d4c0ff1e18f4..02ee05487515b81bfae70d02c1dfdb6d816b77c7 100644
--- a/tensorflow/contrib/eager/python/metrics_test.py
+++ b/tensorflow/contrib/eager/python/metrics_test.py
@@ -26,6 +26,7 @@ from tensorflow.contrib.summary import summary_test_util
 from tensorflow.python.eager import context
 from tensorflow.python.eager import test
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
@@ -117,6 +118,11 @@ class MetricsTest(test.TestCase):
     self.assertEqual(dtypes.float64, m.dtype)
     self.assertEqual(dtypes.float64, m.result().dtype)
 
+  def testAccuracyDifferentShapes(self):
+    m = metrics.Accuracy()
+    with self.assertRaises(errors.InvalidArgumentError):
+      m([[0], [0]], [0, 1])
+
   def testWeightedAccuracy(self):
     m = metrics.Accuracy()
     # 1 correct, total weight of 2
diff --git a/tensorflow/contrib/eager/python/saver_test.py b/tensorflow/contrib/eager/python/saver_test.py
index 4032e755f6e7dea9dcb42587f14e8386e5db2338..90a3711475719a7f991473c6c9067da1e76ab9f2 100644
--- a/tensorflow/contrib/eager/python/saver_test.py
+++ b/tensorflow/contrib/eager/python/saver_test.py
@@ -60,15 +60,9 @@ class SaverTest(test.TestCase):
 
   def testSameNameNoClobbering(self):
     with ops.device(self._dev()):
-      # Note that this test purposefully uses Graphs rather than
-      # IsolateTest. Users are more likely to accidentally create the same
-      # variable name this way.
-      first_graph = ops.Graph()
-      with first_graph.as_default():
-        v1_first_graph = resource_variable_ops.ResourceVariable(1.0, name='v1')
-      with ops.Graph().as_default():
-        v1_second_graph = resource_variable_ops.ResourceVariable(2.0, name='v1')
-        saver = _saver.Saver([v1_first_graph, v1_second_graph])
+      v1 = resource_variable_ops.ResourceVariable(1.0, name='v1')
+      v2 = resource_variable_ops.ResourceVariable(2.0, name='v1')
+      saver = _saver.Saver([v1, v2])
       ckpt_prefix = os.path.join(test.get_temp_dir(), 'ckpt')
       with self.assertRaisesRegexp(ValueError, 'v1'):
         saver.save(ckpt_prefix)
@@ -126,12 +120,11 @@ class SaverTest(test.TestCase):
       saver = _saver.Saver([v1])
       saver.save(ckpt_prefix)
 
-      with ops.Graph().as_default():
-        saver = _saver.Saver([v1])
-        with _saver.restore_variables_on_create(ckpt_prefix):
-          # Value is from checkpoint, but not from argument.
-          ret, _ = model(2.0)
-          self.assertEqual(ret.numpy(), 1.0)
+      saver = _saver.Saver([v1])
+      with _saver.restore_variables_on_create(ckpt_prefix):
+        # Value is from checkpoint, but not from argument.
+        ret, _ = model(2.0)
+        self.assertEqual(ret.numpy(), 1.0)
 
   def testRestoreNotFound(self):
     with ops.device(self._dev()):
@@ -184,17 +177,17 @@ class SaverTest(test.TestCase):
           4, model(array_ops.constant(2, dtype=dtypes.float32)).numpy())
 
       # reset the graph and reload on create, so that 1 + 2 = 3
-      with ops.Graph().as_default():
-        with _saver.restore_variables_on_create(ckpt_prefix):
-          @graph_callable.graph_callable(
-              [graph_callable.ShapeAndDtype(shape=(), dtype=dtypes.float32)])
-          def model2(x):
-            v = variable_scope.get_variable(
-                'v', initializer=init_ops.zeros_initializer(), shape=())
-            return v + x
-
-          self.assertEqual(
-              3, model2(array_ops.constant(2, dtype=dtypes.float32)).numpy())
+      ops.reset_default_graph()
+      with _saver.restore_variables_on_create(ckpt_prefix):
+        @graph_callable.graph_callable(
+            [graph_callable.ShapeAndDtype(shape=(), dtype=dtypes.float32)])
+        def model2(x):
+          v = variable_scope.get_variable(
+              'v', initializer=init_ops.zeros_initializer(), shape=())
+          return v + x
+
+        self.assertEqual(
+            3, model2(array_ops.constant(2, dtype=dtypes.float32)).numpy())
 
 
 class GetOptimizerTests(test.TestCase):
diff --git a/tensorflow/contrib/eager/python/tfe.py b/tensorflow/contrib/eager/python/tfe.py
index 5826700c73e255198e9a6974ca240ba55e438a26..fee9db46fa4f79d7dd613436726e8ddad51faf1c 100644
--- a/tensorflow/contrib/eager/python/tfe.py
+++ b/tensorflow/contrib/eager/python/tfe.py
@@ -115,6 +115,7 @@ from tensorflow.python.eager.execution_callbacks import seterr
 from tensorflow.python.framework.ops import enable_eager_execution
 from tensorflow.python.framework.ops import eager_run as run
 from tensorflow.python.framework.test_util import run_in_graph_and_eager_modes as run_test_in_graph_and_eager_modes
+from tensorflow.python.framework.test_util import run_all_in_graph_and_eager_modes as run_all_tests_in_graph_and_eager_modes
 from tensorflow.python.ops.custom_gradient import custom_gradient
 from tensorflow.python.ops.resource_variable_ops import ResourceVariable as Variable
 from tensorflow.python.ops.variable_scope import EagerVariableStore
diff --git a/tensorflow/contrib/estimator/BUILD b/tensorflow/contrib/estimator/BUILD
index d5d2abf8c4c82374842ed2e10a849765a6dddd3b..1937ffb583bc727df76470d072b35fb3c9acaa88 100644
--- a/tensorflow/contrib/estimator/BUILD
+++ b/tensorflow/contrib/estimator/BUILD
@@ -312,6 +312,7 @@ py_test(
         "//tensorflow/python:sparse_tensor",
         "//tensorflow/python:string_ops",
         "//tensorflow/python:training",
+        "//tensorflow/python:variables",
         "//tensorflow/python/estimator:metric_keys",
         "//tensorflow/python/estimator:model_fn",
         "//tensorflow/python/estimator:prediction_keys",
@@ -340,6 +341,7 @@ py_test(
     size = "medium",
     srcs = ["python/estimator/hooks_test.py"],
     srcs_version = "PY2AND3",
+    tags = ["notsan"],
     deps = [
         ":hooks",
         "//tensorflow/python:client_testlib",
diff --git a/tensorflow/contrib/estimator/python/estimator/head.py b/tensorflow/contrib/estimator/python/estimator/head.py
index 8b97f86db19a1bc2d9f17c9935e6678844daf177..9594e5132fd20dadea118fd1dd6768feb7fd7fff 100644
--- a/tensorflow/contrib/estimator/python/estimator/head.py
+++ b/tensorflow/contrib/estimator/python/estimator/head.py
@@ -529,6 +529,7 @@ def multi_label_head(n_classes,
   applications, the shape is `[batch_size, n_classes]`.
 
   Labels can be:
+
   * A multi-hot tensor of shape `[D0, D1, ... DN, n_classes]`
   * An integer `SparseTensor` of class indices. The `dense_shape` must be
     `[D0, D1, ... DN, ?]` and the values within `[0, n_classes)`.
@@ -845,6 +846,7 @@ class _MultiLabelHead(head_lib._Head):  # pylint:disable=protected-access
         train_op = train_op_fn(regularized_training_loss)
       else:
         raise ValueError('train_op_fn and optimizer cannot both be None.')
+      train_op = head_lib._append_update_ops(train_op)  # pylint:disable=protected-access
       # Only summarize mean_loss for SUM reduction to preserve backwards
       # compatibility. Otherwise skip it to avoid unnecessary computation.
       if self._loss_reduction == losses.Reduction.SUM:
diff --git a/tensorflow/contrib/estimator/python/estimator/head_test.py b/tensorflow/contrib/estimator/python/estimator/head_test.py
index d6c158608b5c564f24bc90583084306aa7084742..b2b57fa06ba818d4455871fe57dde5ce287b39a2 100644
--- a/tensorflow/contrib/estimator/python/estimator/head_test.py
+++ b/tensorflow/contrib/estimator/python/estimator/head_test.py
@@ -36,6 +36,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import string_ops
+from tensorflow.python.ops import variables
 from tensorflow.python.ops.losses import losses
 from tensorflow.python.platform import test
 from tensorflow.python.saved_model import signature_constants
@@ -989,6 +990,34 @@ class MultiLabelHead(test.TestCase):
           six.b('{0:s}{1:.3f}'.format(expected_train_result, expected_loss)),
           train_result)
 
+  def test_train_with_update_ops(self):
+    head = head_lib.multi_label_head(n_classes=2)
+
+    with ops.Graph().as_default():
+      w = variables.Variable(1)
+      update_op = w.assign_add(1)
+      ops.add_to_collection(ops.GraphKeys.UPDATE_OPS, update_op)
+
+      t = variables.Variable('')
+      expected_train_result = b'my_train_op'
+      def _train_op_fn(loss):
+        del loss
+        return t.assign(expected_train_result)
+
+      spec = head.create_estimator_spec(
+          features={'x': np.array(((42,),), dtype=np.int32)},
+          mode=model_fn.ModeKeys.TRAIN,
+          logits=np.array([[-10., 10.], [-15., 10.]], dtype=np.float32),
+          labels=np.array([[1, 0], [1, 1]], dtype=np.int64),
+          train_op_fn=_train_op_fn)
+
+      with self.test_session() as sess:
+        _initialize_variables(self, spec.scaffold)
+        sess.run(spec.train_op)
+        w_value, t_value = sess.run([w, t])
+        self.assertEqual(2, w_value)
+        self.assertEqual(expected_train_result, t_value)
+
   def test_train_with_regularization_losses(self):
     head = head_lib.multi_label_head(
         n_classes=2, loss_reduction=losses.Reduction.SUM_OVER_BATCH_SIZE)
diff --git a/tensorflow/contrib/estimator/python/estimator/hooks.py b/tensorflow/contrib/estimator/python/estimator/hooks.py
index 4808b9ee30e10047aaf3d33f74457b2717c87a13..ddd6aa442f82bad2d4714dbcdc85b20b34773068 100644
--- a/tensorflow/contrib/estimator/python/estimator/hooks.py
+++ b/tensorflow/contrib/estimator/python/estimator/hooks.py
@@ -72,7 +72,7 @@ class InMemoryEvaluatorHook(training.SessionRunHook):
       estimator: A `tf.estimator.Estimator` instance to call evaluate.
       input_fn:  Equivalent to the `input_fn` arg to `estimator.evaluate`. A
         function that constructs the input data for evaluation.
-        See @{$get_started/premade_estimators#create_input_functions} for more
+        See @{$premade_estimators#create_input_functions} for more
         information. The function should construct and return one of
         the following:
 
diff --git a/tensorflow/contrib/factorization/python/ops/factorization_ops.py b/tensorflow/contrib/factorization/python/ops/factorization_ops.py
index 5cef4068ed119d5dbccd585c5b4e5e28840d2cc7..8f73274c2a0ebbdc41ce6a647a8a5650694c9a23 100644
--- a/tensorflow/contrib/factorization/python/ops/factorization_ops.py
+++ b/tensorflow/contrib/factorization/python/ops/factorization_ops.py
@@ -197,7 +197,8 @@ class WALSModel(object):
                row_weights=1,
                col_weights=1,
                use_factors_weights_cache=True,
-               use_gramian_cache=True):
+               use_gramian_cache=True,
+               use_scoped_vars=False):
     """Creates model for WALS matrix factorization.
 
     Args:
@@ -239,6 +240,8 @@ class WALSModel(object):
         weights cache to take effect.
       use_gramian_cache: When True, the Gramians will be cached on the workers
         before the updates start. Defaults to True.
+      use_scoped_vars: When True, the factor and weight vars will also be nested
+        in a tf.name_scope.
     """
     self._input_rows = input_rows
     self._input_cols = input_cols
@@ -251,25 +254,46 @@ class WALSModel(object):
         regularization * linalg_ops.eye(self._n_components)
         if regularization is not None else None)
     assert (row_weights is None) == (col_weights is None)
-    self._row_weights = WALSModel._create_weights(
-        row_weights, self._input_rows, self._num_row_shards, "row_weights")
-    self._col_weights = WALSModel._create_weights(
-        col_weights, self._input_cols, self._num_col_shards, "col_weights")
     self._use_factors_weights_cache = use_factors_weights_cache
     self._use_gramian_cache = use_gramian_cache
-    self._row_factors = self._create_factors(
-        self._input_rows, self._n_components, self._num_row_shards, row_init,
-        "row_factors")
-    self._col_factors = self._create_factors(
-        self._input_cols, self._n_components, self._num_col_shards, col_init,
-        "col_factors")
+
+    if use_scoped_vars:
+      with ops.name_scope("row_weights"):
+        self._row_weights = WALSModel._create_weights(
+            row_weights, self._input_rows, self._num_row_shards, "row_weights")
+      with ops.name_scope("col_weights"):
+        self._col_weights = WALSModel._create_weights(
+            col_weights, self._input_cols, self._num_col_shards, "col_weights")
+      with ops.name_scope("row_factors"):
+        self._row_factors = self._create_factors(
+            self._input_rows, self._n_components, self._num_row_shards,
+            row_init, "row_factors")
+      with ops.name_scope("col_factors"):
+        self._col_factors = self._create_factors(
+            self._input_cols, self._n_components, self._num_col_shards,
+            col_init, "col_factors")
+    else:
+      self._row_weights = WALSModel._create_weights(
+          row_weights, self._input_rows, self._num_row_shards, "row_weights")
+      self._col_weights = WALSModel._create_weights(
+          col_weights, self._input_cols, self._num_col_shards, "col_weights")
+      self._row_factors = self._create_factors(
+          self._input_rows, self._n_components, self._num_row_shards, row_init,
+          "row_factors")
+      self._col_factors = self._create_factors(
+          self._input_cols, self._n_components, self._num_col_shards, col_init,
+          "col_factors")
+
     self._row_gramian = self._create_gramian(self._n_components, "row_gramian")
     self._col_gramian = self._create_gramian(self._n_components, "col_gramian")
-    self._row_update_prep_gramian = self._prepare_gramian(
-        self._col_factors, self._col_gramian)
-    self._col_update_prep_gramian = self._prepare_gramian(
-        self._row_factors, self._row_gramian)
-    self._create_transient_vars()
+    with ops.name_scope("row_prepare_gramian"):
+      self._row_update_prep_gramian = self._prepare_gramian(
+          self._col_factors, self._col_gramian)
+    with ops.name_scope("col_prepare_gramian"):
+      self._col_update_prep_gramian = self._prepare_gramian(
+          self._row_factors, self._row_gramian)
+    with ops.name_scope("transient_vars"):
+      self._create_transient_vars()
 
   @property
   def row_factors(self):
diff --git a/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py b/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py
index 555beddeaab419bcb23d06f960d370b706d744c8..b588f75efe9d0bbf8213a89978a627c0a0ccf554 100644
--- a/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py
+++ b/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py
@@ -346,7 +346,8 @@ def sequence_numeric_column(
     key,
     shape=(1,),
     default_value=0.,
-    dtype=dtypes.float32):
+    dtype=dtypes.float32,
+    normalizer_fn=None):
   """Returns a feature column that represents sequences of numeric data.
 
   Example:
@@ -370,6 +371,12 @@ def sequence_numeric_column(
     default_value: A single value compatible with `dtype` that is used for
       padding the sparse data into a dense `Tensor`.
     dtype: The type of values.
+    normalizer_fn: If not `None`, a function that can be used to normalize the
+      value of the tensor after `default_value` is applied for parsing.
+      Normalizer function takes the input `Tensor` as its argument, and returns
+      the output `Tensor`. (e.g. lambda x: (x - 3.0) / 4.2). Please note that
+      even though the most common use case of this function is normalization, it
+      can be used for any kind of Tensorflow transformations.
 
   Returns:
     A `_SequenceNumericColumn`.
@@ -383,12 +390,16 @@ def sequence_numeric_column(
   if not (dtype.is_integer or dtype.is_floating):
     raise ValueError('dtype must be convertible to float. '
                      'dtype: {}, key: {}'.format(dtype, key))
+  if normalizer_fn is not None and not callable(normalizer_fn):
+    raise TypeError(
+        'normalizer_fn must be a callable. Given: {}'.format(normalizer_fn))
 
   return _SequenceNumericColumn(
       key,
       shape=shape,
       default_value=default_value,
-      dtype=dtype)
+      dtype=dtype,
+      normalizer_fn=normalizer_fn)
 
 
 def _assert_all_equal_and_return(tensors, name=None):
@@ -407,7 +418,7 @@ class _SequenceNumericColumn(
     fc._SequenceDenseColumn,
     collections.namedtuple(
         '_SequenceNumericColumn',
-        ['key', 'shape', 'default_value', 'dtype'])):
+        ['key', 'shape', 'default_value', 'dtype', 'normalizer_fn'])):
   """Represents sequences of numeric data."""
 
   @property
@@ -419,7 +430,10 @@ class _SequenceNumericColumn(
     return {self.key: parsing_ops.VarLenFeature(self.dtype)}
 
   def _transform_feature(self, inputs):
-    return inputs.get(self.key)
+    input_tensor = inputs.get(self.key)
+    if self.normalizer_fn is not None:
+      input_tensor = self.normalizer_fn(input_tensor)
+    return input_tensor
 
   @property
   def _variable_shape(self):
diff --git a/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column_test.py b/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column_test.py
index 88f5d535162939e063eb1e7f43d495137c5adef4..89b5f4c4137f6c42417f539a578fd8b11f8b235d 100644
--- a/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column_test.py
+++ b/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column_test.py
@@ -28,6 +28,7 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.ops import sparse_ops
 from tensorflow.python.platform import test
 from tensorflow.python.training import monitored_session
 
@@ -670,6 +671,7 @@ class SequenceNumericColumnTest(test.TestCase):
     self.assertEqual((1,), a.shape)
     self.assertEqual(0., a.default_value)
     self.assertEqual(dtypes.float32, a.dtype)
+    self.assertIsNone(a.normalizer_fn)
 
   def test_shape_saved_as_tuple(self):
     a = sfc.sequence_numeric_column('aaa', shape=[1, 2])
@@ -688,6 +690,10 @@ class SequenceNumericColumnTest(test.TestCase):
         ValueError, 'dtype must be convertible to float'):
       sfc.sequence_numeric_column('aaa', dtype=dtypes.string)
 
+  def test_normalizer_fn_must_be_callable(self):
+    with self.assertRaisesRegexp(TypeError, 'must be a callable'):
+      sfc.sequence_numeric_column('aaa', normalizer_fn='NotACallable')
+
   def test_get_sequence_dense_tensor(self):
     sparse_input = sparse_tensor.SparseTensorValue(
         # example 0, values [[0.], [1]]
@@ -708,6 +714,41 @@ class SequenceNumericColumnTest(test.TestCase):
       self.assertAllEqual(
           expected_dense_tensor, dense_tensor.eval(session=sess))
 
+  def test_get_sequence_dense_tensor_with_normalizer_fn(self):
+
+    def _increment_two(input_sparse_tensor):
+      return sparse_ops.sparse_add(
+          input_sparse_tensor,
+          sparse_tensor.SparseTensor(((0, 0), (1, 1)), (2.0, 2.0), (2, 2))
+      )
+
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0.], [1]]
+        # example 1, [[10.]]
+        indices=((0, 0), (0, 1), (1, 0)),
+        values=(0., 1., 10.),
+        dense_shape=(2, 2))
+
+    # Before _increment_two:
+    #   [[0.], [1.]],
+    #   [[10.], [0.]],
+    # After _increment_two:
+    #   [[2.], [1.]],
+    #   [[10.], [2.]],
+    expected_dense_tensor = [
+        [[2.], [1.]],
+        [[10.], [2.]],
+    ]
+    numeric_column = sfc.sequence_numeric_column(
+        'aaa', normalizer_fn=_increment_two)
+
+    dense_tensor, _ = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_dense_tensor, dense_tensor.eval(session=sess))
+
   def test_get_sequence_dense_tensor_with_shape(self):
     """Tests get_sequence_dense_tensor with shape !=(1,)."""
     sparse_input = sparse_tensor.SparseTensorValue(
diff --git a/tensorflow/contrib/ffmpeg/__init__.py b/tensorflow/contrib/ffmpeg/__init__.py
index daba965a98893b992abdc598ec713f13020d6e91..484ffee3e7afe55c63cab2a463454353b2663e18 100644
--- a/tensorflow/contrib/ffmpeg/__init__.py
+++ b/tensorflow/contrib/ffmpeg/__init__.py
@@ -28,7 +28,6 @@ from __future__ import print_function
 from tensorflow.contrib.ffmpeg.ffmpeg_ops import decode_audio
 from tensorflow.contrib.ffmpeg.ffmpeg_ops import decode_video
 from tensorflow.contrib.ffmpeg.ffmpeg_ops import encode_audio
-from tensorflow.contrib.ffmpeg.ffmpeg_ops import decode_video
 
 from tensorflow.python.util.all_util import remove_undocumented
 
diff --git a/tensorflow/contrib/ffmpeg/ffmpeg_ops.py b/tensorflow/contrib/ffmpeg/ffmpeg_ops.py
index 020b5c99c61019254bef0b1dff6bc5901c92758a..b1b5126d9e9e5196a1733b80e0778e53cef7f774 100644
--- a/tensorflow/contrib/ffmpeg/ffmpeg_ops.py
+++ b/tensorflow/contrib/ffmpeg/ffmpeg_ops.py
@@ -21,7 +21,6 @@ from __future__ import print_function
 from tensorflow.contrib.ffmpeg.ops import gen_decode_audio_op_py
 from tensorflow.contrib.ffmpeg.ops import gen_decode_video_op_py
 from tensorflow.contrib.ffmpeg.ops import gen_encode_audio_op_py
-from tensorflow.contrib.ffmpeg.ops import gen_decode_video_op_py
 from tensorflow.contrib.util import loader
 from tensorflow.python.framework import ops
 from tensorflow.python.platform import resource_loader
diff --git a/tensorflow/contrib/framework/__init__.py b/tensorflow/contrib/framework/__init__.py
index 10d1ecc738de6777784200ba934a521dff592e28..dc49383c5c300e82839c478e097074b3e8776b3b 100644
--- a/tensorflow/contrib/framework/__init__.py
+++ b/tensorflow/contrib/framework/__init__.py
@@ -119,14 +119,13 @@ from tensorflow.python.framework.smart_cond import smart_cond
 from tensorflow.python.framework.smart_cond import smart_constant_value
 from tensorflow.python.framework.tensor_spec import BoundedTensorSpec
 from tensorflow.python.framework.tensor_spec import TensorSpec
-from tensorflow.python.ops.array_ops import broadcast_to
 from tensorflow.python.ops.init_ops import convolutional_delta_orthogonal
 from tensorflow.python.ops.init_ops import convolutional_orthogonal_1d
 from tensorflow.python.ops.init_ops import convolutional_orthogonal_2d
 from tensorflow.python.ops.init_ops import convolutional_orthogonal_3d
 from tensorflow.python.util.all_util import remove_undocumented
 
-_allowed_symbols = ['nest', 'broadcast_to']
+_allowed_symbols = ['nest']
 _nest_allowed_symbols = [
     'assert_same_structure',
     'is_sequence',
diff --git a/tensorflow/contrib/fused_conv/BUILD b/tensorflow/contrib/fused_conv/BUILD
index 0eb6889db1fae1c74aeb4392441b308392b091a5..0f0813c07f8bd330b089780064e02f8dfe7d49f6 100644
--- a/tensorflow/contrib/fused_conv/BUILD
+++ b/tensorflow/contrib/fused_conv/BUILD
@@ -75,6 +75,7 @@ tf_kernel_library(
         "//tensorflow/core/kernels:gpu_util_hdrs",
         "//tensorflow/core/kernels:ops_util_hdrs",
         "//third_party/eigen3",
+        "@local_config_cuda//cuda:cudnn_header",
     ],
     alwayslink = 1,
 )
@@ -94,6 +95,7 @@ tf_custom_op_library(
         "//tensorflow/core/kernels:conv_ops_gpu_hdrs",
         "//tensorflow/core/kernels:gpu_util_hdrs",
         "//tensorflow/core/kernels:ops_util_hdrs",
+        "@local_config_cuda//cuda:cudnn_header",
     ],
 )
 
diff --git a/tensorflow/contrib/fused_conv/python/ops/fused_conv2d_bias_activation_op_test.py b/tensorflow/contrib/fused_conv/python/ops/fused_conv2d_bias_activation_op_test.py
index 3d0ed899322c26bf4ae428930899d7a5885e9f21..a955e21b72e765f751318c7927f9644481fe7933 100644
--- a/tensorflow/contrib/fused_conv/python/ops/fused_conv2d_bias_activation_op_test.py
+++ b/tensorflow/contrib/fused_conv/python/ops/fused_conv2d_bias_activation_op_test.py
@@ -21,6 +21,8 @@ from __future__ import print_function
 import numpy as np
 
 from tensorflow.contrib.fused_conv.python.ops import fused_conv2d_bias_activation_op
+from tensorflow.core.protobuf import config_pb2
+from tensorflow.core.protobuf import rewriter_config_pb2
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors_impl
@@ -33,6 +35,13 @@ from tensorflow.python.platform import test
 from tensorflow.python.platform import tf_logging
 
 
+def NoMemoryOptimizationConfig():
+  config = config_pb2.ConfigProto()
+  config.graph_options.rewrite_options.memory_optimization = (
+      rewriter_config_pb2.RewriterConfig.OFF)
+  return config
+
+
 def GetShrunkInceptionShapes(shrink=10):
   """Iterator for smaller versions of convolution shapes in 2015 Inception.
 
@@ -193,7 +202,8 @@ class FusedConv2DBiasActivationTest(test.TestCase):
     # This is to guarantee that there is always negative values after
     # bias add so that we can test whether relu works correctly.
     x3 = bias
-    with self.test_session(use_gpu=True):
+    # TODO(b/79323979): re-enable memory optimization after this bug is fixed.
+    with self.test_session(use_gpu=True, config=NoMemoryOptimizationConfig()):
       t1 = constant_op.constant(x1, shape=tensor_in_sizes, dtype=dtype)
       t2 = constant_op.constant(x2, shape=filter_in_sizes, dtype=dtype)
       fused_t2 = t2
@@ -241,7 +251,9 @@ class FusedConv2DBiasActivationTest(test.TestCase):
     x3 = np.random.rand(*[filter_in_sizes[-1]]).astype(np.float32)
 
     def _SetupVal(data_format, use_gpu):
-      with self.test_session(use_gpu=use_gpu):
+      # TODO(b/79323979): re-enable memory optimization after this bug is fixed.
+      with self.test_session(
+          use_gpu=use_gpu, config=NoMemoryOptimizationConfig()):
         t1 = constant_op.constant(x1, shape=tensor_in_sizes)
         t2 = constant_op.constant(x2, shape=filter_in_sizes)
         t3 = constant_op.constant(x3, shape=[filter_in_sizes[-1]])
@@ -289,8 +301,8 @@ class FusedConv2DBiasActivationTest(test.TestCase):
           conv = tensors[i]
           value = values[i]
           ref_value = ref_values[i]
-          print("expected = ", ref_value)
-          print("actual = ", value)
+          tf_logging.info("expected = ", ref_value)
+          tf_logging.info("actual = ", value)
           tol = 1e-5
           if value.dtype == np.float16:
             tol = 1e-3
@@ -831,7 +843,8 @@ class FusedConvInt8Tests(test.TestCase):
                                                 vertical_stride, padding_type)
     output_width = CalculateConvolvedOutputDim(input_width, filter_width,
                                                horizontal_stride, padding_type)
-    print("output_height=", output_height, ", output_width=", output_width)
+    tf_logging.info("output_height=", output_height, ", output_width=", 
+			                 output_width)
 
     side_input, _, _ = gen_array_ops.quantize_v2(
         random_ops.random_uniform(
@@ -864,10 +877,12 @@ class FusedConvInt8Tests(test.TestCase):
         conv_input_scale, conv_input, kernel, padding_type, strides,
         side_input_scale, side_input, biases)
 
-    with self.test_session(use_gpu=True) as sess:
+    # TODO(b/79323979): re-enable memory optimization after this bug is fixed.
+    with self.test_session(
+        use_gpu=True, config=NoMemoryOptimizationConfig()) as sess:
       actual_y, expected_y = sess.run([actual, expected])
-      print("actual_y = ", actual_y)
-      print("expected_y = ", expected_y)
+      tf_logging.info("actual_y = ", actual_y)
+      tf_logging.info("expected_y = ", expected_y)
       self.assertTrue(np.array_equal(actual_y, expected_y))
 
   def testFusedConvInt8(self):
diff --git a/tensorflow/contrib/gdr/gdr_server_lib.cc b/tensorflow/contrib/gdr/gdr_server_lib.cc
index 1f9dd0decb84cf9b7b703f18c061d3c0c7a1cb25..9025c992a4467f521d6d8d514e6a5e92f5492947 100644
--- a/tensorflow/contrib/gdr/gdr_server_lib.cc
+++ b/tensorflow/contrib/gdr/gdr_server_lib.cc
@@ -57,7 +57,7 @@ Status GdrServer::Init() {
         new GdrWorker(env, remote_memory_manager_.get()));
   };
   TF_RETURN_IF_ERROR(
-      GrpcServer::Init(nullptr, rendezvous_mgr_func, worker_func));
+      GrpcServer::Init(nullptr, rendezvous_mgr_func, nullptr, worker_func));
 
   return remote_memory_manager_->Init();
 }
diff --git a/tensorflow/contrib/graph_editor/transform.py b/tensorflow/contrib/graph_editor/transform.py
index 592d37b432ee605d74162e0b8ec6ccdf426c45d1..026a3d1200033400472c4fd763a244c04b284a9b 100644
--- a/tensorflow/contrib/graph_editor/transform.py
+++ b/tensorflow/contrib/graph_editor/transform.py
@@ -189,9 +189,6 @@ def copy_op_handler(info, op, new_inputs, copy_shape=True, nodedef_fn=None):
   if op._original_op:
     op_._original_op = op._original_op
 
-  # Add op to the graph
-  info.graph_._add_op(op_)
-
   return op_, op_.outputs
 
 
@@ -492,7 +489,7 @@ class Transformer(object):
       t_ = info.transformed_ts[t]
       consumer_op_ = info.transformed_ops[consumer_op]
       t_index_ = list(consumer_op_.inputs).index(tmp_t_)
-      consumer_op_._update_input(t_index_, t_, update_dtype=False)  # pylint: disable=protected-access
+      consumer_op_._update_input(t_index_, t_)  # pylint: disable=protected-access
 
   def _connect_control_inputs(self, info):
     """Connect the previously copied ops."""
diff --git a/tensorflow/contrib/integrate/python/ops/odes.py b/tensorflow/contrib/integrate/python/ops/odes.py
index b4a99867ed46897f60be3f230838c3f576d5455e..61f78febfc07bb4e677259366a81c16b2b585244 100644
--- a/tensorflow/contrib/integrate/python/ops/odes.py
+++ b/tensorflow/contrib/integrate/python/ops/odes.py
@@ -28,7 +28,6 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
-from tensorflow.python.ops import functional_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import tensor_array_ops
 
@@ -279,13 +278,27 @@ def _assert_increasing(t):
   return ops.control_dependencies([assert_increasing])
 
 
-def _check_input_types(t, y0):
+def _check_input_types(y0, t, dt=None):
   if not (y0.dtype.is_floating or y0.dtype.is_complex):
     raise TypeError('`y0` must have a floating point or complex floating '
                     'point dtype')
   if not t.dtype.is_floating:
     raise TypeError('`t` must have a floating point dtype')
 
+  if dt is not None and not dt.dtype.is_floating:
+    raise TypeError('`dt` must have a floating point dtype')
+
+
+def _check_input_sizes(t, dt):
+  if len(t.get_shape().as_list()) > 1:
+    raise ValueError('t must be a 1D tensor')
+
+  if len(dt.get_shape().as_list()) > 1:
+    raise ValueError('t must be a 1D tensor')
+
+  if t.get_shape()[0] != dt.get_shape()[0] + 1:
+    raise ValueError('t and dt have incompatible lengths, must be N and N-1')
+
 
 def _dopri5(func,
             y0,
@@ -510,7 +523,7 @@ def odeint(func,
     # avoiding the need to pack/unpack in user functions.
     y0 = ops.convert_to_tensor(y0, name='y0')
     t = ops.convert_to_tensor(t, preferred_dtype=dtypes.float64, name='t')
-    _check_input_types(t, y0)
+    _check_input_types(y0, t)
 
     error_dtype = abs(y0).dtype
     rtol = ops.convert_to_tensor(rtol, dtype=error_dtype, name='rtol')
@@ -530,24 +543,74 @@ def odeint(func,
 class _FixedGridIntegrator(six.with_metaclass(abc.ABCMeta)):
   """Base class for fixed-grid ODE integrators."""
 
-  def integrate(self, evol_func, y0, time_grid):
-    time_delta_grid = time_grid[1:] - time_grid[:-1]
-
-    scan_func = self._make_scan_func(evol_func)
+  def integrate(self, evol_func, y0, time_grid, dt_grid, steps_on_intervals):
+    """Returns integrated values of differential equation on the `time grid`.
+
+    Numerically integrates differential equation defined via time derivative
+    evaluator `evol_func` using fixed time steps specified in dt_grid.
+
+    Args:
+      evol_func: Callable, evaluates time derivative of y at a given time.
+      y0: N-D Tensor holds initial values of the solution.
+      time_grid: 1-D Tensor holding the time points at which the solution
+        will be recorded, must have a floating dtype.
+      dt_grid: 1-D Tensor holds fixed time steps to be used on time_grid
+        intervals. Must be a floating dtype and have one less element than that
+        of the time_grid.
+      steps_on_intervals: 1-D Tensor of integer dtype, must have the same size
+        as dt_grid. Specifies number of steps needed for every interval. Assumes
+        steps_on_intervals * dt_grid == time intervals.
+
+    Returns:
+      (N+1)-D tensor, where the first dimension corresponds to different
+      time points. Contains the solved value of y for each desired time point in
+      `t`, with the initial value `y0` being the first element along the first
+      dimension.
+    """
 
-    y_grid = functional_ops.scan(scan_func, (time_grid[:-1], time_delta_grid),
-                                 y0)
-    return array_ops.concat([[y0], y_grid], axis=0)
+    iteration_func = self._make_iteration_func(evol_func, dt_grid)
+    integrate_interval = self._make_interval_integrator(iteration_func,
+                                                        steps_on_intervals)
 
-  def _make_scan_func(self, evol_func):
+    num_times = array_ops.size(time_grid)
+    current_time = time_grid[0]
+    solution_array = tensor_array_ops.TensorArray(y0.dtype, num_times)
+    solution_array = solution_array.write(0, y0)
 
-    def scan_func(y, t_and_dt):
-      t, dt = t_and_dt
+    solution_array, _, _, _ = control_flow_ops.while_loop(
+        lambda _, __, ___, i: i < num_times,
+        integrate_interval,
+        (solution_array, y0, current_time, 1)
+    )
+    solution_array = solution_array.stack()
+    solution_array.set_shape(time_grid.get_shape().concatenate(y0.get_shape()))
+    return solution_array
+
+  def _make_iteration_func(self, evol_func, dt_grid):
+    """Returns a function that builds operations of a single time step."""
+
+    def iteration_func(y, t, dt_step, interval_step):
+      """Performs a single time step advance."""
+      dt = dt_grid[interval_step - 1]
       dy = self._step_func(evol_func, t, dt, y)
       dy = math_ops.cast(dy, dtype=y.dtype)
-      return y + dy
+      return y + dy, t + dt, dt_step + 1, interval_step
+
+    return iteration_func
+
+  def _make_interval_integrator(self, iteration_func, interval_sizes):
+    """Returns a function that builds operations for interval integration."""
 
-    return scan_func
+    def integrate_interval(solution_array, y, t, interval_num):
+      """Integrates y with fixed time step on interval `interval_num`."""
+      y, t, _, _ = control_flow_ops.while_loop(
+          lambda _, __, j, interval_num: j < interval_sizes[interval_num - 1],
+          iteration_func,
+          (y, t, 0, interval_num)
+      )
+      return solution_array.write(interval_num, y), y, t, interval_num + 1
+
+    return integrate_interval
 
   @abc.abstractmethod
   def _step_func(self, evol_func, t, dt, y):
@@ -555,6 +618,7 @@ class _FixedGridIntegrator(six.with_metaclass(abc.ABCMeta)):
 
 
 class _MidpointFixedGridIntegrator(_FixedGridIntegrator):
+  """Fixed grid integrator implementing midpoint scheme."""
 
   def _step_func(self, evol_func, t, dt, y):
     dt_cast = math_ops.cast(dt, y.dtype)
@@ -563,6 +627,7 @@ class _MidpointFixedGridIntegrator(_FixedGridIntegrator):
 
 
 class _RK4FixedGridIntegrator(_FixedGridIntegrator):
+  """Fixed grid integrator implementing RK4 scheme."""
 
   def _step_func(self, evol_func, t, dt, y):
     k1 = evol_func(y, t)
@@ -575,7 +640,7 @@ class _RK4FixedGridIntegrator(_FixedGridIntegrator):
     return math_ops.add_n([k1, 2 * k2, 2 * k3, k4]) * (dt_cast / 6)
 
 
-def odeint_fixed(func, y0, t, method='rk4', name=None):
+def odeint_fixed(func, y0, t, dt=None, method='rk4', name=None):
   """ODE integration on a fixed grid (with no step size control).
 
   Useful in certain scenarios to avoid the overhead of adaptive step size
@@ -590,6 +655,14 @@ def odeint_fixed(func, y0, t, method='rk4', name=None):
       `y`. The initial time point should be the first element of this sequence,
       and each time must be larger than the previous time. May have any floating
       point dtype.
+    dt: 0-D or 1-D Tensor providing time step suggestion to be used on time
+      integration intervals in `t`. 1-D Tensor should provide values
+      for all intervals, must have 1 less element than that of `t`.
+      If given a 0-D Tensor, the value is interpreted as time step suggestion
+      same for all intervals. If passed None, then time step is set to be the
+      t[1:] - t[:-1]. Defaults to None. The actual step size is obtained by
+      insuring an integer number of steps per interval, potentially reducing the
+      time step.
     method: One of 'midpoint' or 'rk4'.
     name: Optional name for the resulting operation.
 
@@ -602,16 +675,29 @@ def odeint_fixed(func, y0, t, method='rk4', name=None):
   Raises:
     ValueError: Upon caller errors.
   """
-  with ops.name_scope(name, 'odeint_fixed', [y0, t]):
+  with ops.name_scope(name, 'odeint_fixed', [y0, t, dt]):
     t = ops.convert_to_tensor(t, preferred_dtype=dtypes.float64, name='t')
     y0 = ops.convert_to_tensor(y0, name='y0')
-    _check_input_types(t, y0)
+
+    intervals = t[1:] - t[:-1]
+    if dt is None:
+      dt = intervals
+    dt = ops.convert_to_tensor(dt, preferred_dtype=dtypes.float64, name='dt')
+
+    steps_on_intervals = math_ops.ceil(intervals / dt)
+    dt = intervals / steps_on_intervals
+    steps_on_intervals = math_ops.cast(steps_on_intervals, dtype=dtypes.int32)
+
+    _check_input_types(y0, t, dt)
+    _check_input_sizes(t, dt)
 
     with _assert_increasing(t):
       with ops.name_scope(method):
         if method == 'midpoint':
-          return _MidpointFixedGridIntegrator().integrate(func, y0, t)
+          return _MidpointFixedGridIntegrator().integrate(func, y0, t, dt,
+                                                          steps_on_intervals)
         elif method == 'rk4':
-          return _RK4FixedGridIntegrator().integrate(func, y0, t)
+          return _RK4FixedGridIntegrator().integrate(func, y0, t, dt,
+                                                     steps_on_intervals)
         else:
           raise ValueError('method not supported: {!s}'.format(method))
diff --git a/tensorflow/contrib/integrate/python/ops/odes_test.py b/tensorflow/contrib/integrate/python/ops/odes_test.py
index 3ec01212d25ca8dc6e13f340177a5e85138868d5..c7b4e2faa84e1a87cb1904b22eb0008ab1ee4be6 100644
--- a/tensorflow/contrib/integrate/python/ops/odes_test.py
+++ b/tensorflow/contrib/integrate/python/ops/odes_test.py
@@ -242,40 +242,56 @@ class InterpolationTest(test.TestCase):
 
 class OdeIntFixedTest(test.TestCase):
 
-  def _test_integrate_sine(self, method):
+  def _test_integrate_sine(self, method, t, dt=None):
 
     def evol_func(y, t):
       del t
       return array_ops.stack([y[1], -y[0]])
 
     y0 = [0., 1.]
-    time_grid = np.linspace(0., 10., 200)
-    y_grid = odes.odeint_fixed(evol_func, y0, time_grid, method=method)
+    y_grid = odes.odeint_fixed(evol_func, y0, t, dt, method=method)
 
     with self.test_session() as sess:
       y_grid_array = sess.run(y_grid)
 
     np.testing.assert_allclose(
-        y_grid_array[:, 0], np.sin(time_grid), rtol=1e-2, atol=1e-2)
+        y_grid_array[:, 0], np.sin(t), rtol=1e-2, atol=1e-2)
 
-  def _test_integrate_gaussian(self, method):
+  def _test_integrate_gaussian(self, method, t, dt=None):
 
     def evol_func(y, t):
       return -math_ops.cast(t, dtype=y.dtype) * y[0]
 
     y0 = [1.]
-    time_grid = np.linspace(0., 2., 100)
-    y_grid = odes.odeint_fixed(evol_func, y0, time_grid, method=method)
+    y_grid = odes.odeint_fixed(evol_func, y0, t, dt, method=method)
 
     with self.test_session() as sess:
       y_grid_array = sess.run(y_grid)
 
     np.testing.assert_allclose(
-        y_grid_array[:, 0], np.exp(-time_grid**2 / 2), rtol=1e-2, atol=1e-2)
+        y_grid_array[:, 0], np.exp(-t**2 / 2), rtol=1e-2, atol=1e-2)
+
+  def _test_integrate_sine_all(self, method):
+    uniform_time_grid = np.linspace(0., 10., 200)
+    non_uniform_time_grid = np.asarray([0.0, 0.4, 4.7, 5.2, 7.0])
+    uniform_dt = 0.02
+    non_uniform_dt = np.asarray([0.01, 0.001, 0.05, 0.03])
+    self._test_integrate_sine(method, uniform_time_grid)
+    self._test_integrate_sine(method, non_uniform_time_grid, uniform_dt)
+    self._test_integrate_sine(method, non_uniform_time_grid, non_uniform_dt)
+
+  def _test_integrate_gaussian_all(self, method):
+    uniform_time_grid = np.linspace(0., 2., 100)
+    non_uniform_time_grid = np.asarray([0.0, 0.1, 0.7, 1.2, 2.0])
+    uniform_dt = 0.01
+    non_uniform_dt = np.asarray([0.01, 0.001, 0.1, 0.03])
+    self._test_integrate_gaussian(method, uniform_time_grid)
+    self._test_integrate_gaussian(method, non_uniform_time_grid, uniform_dt)
+    self._test_integrate_gaussian(method, non_uniform_time_grid, non_uniform_dt)
 
   def _test_everything(self, method):
-    self._test_integrate_sine(method)
-    self._test_integrate_gaussian(method)
+    self._test_integrate_sine_all(method)
+    self._test_integrate_gaussian_all(method)
 
   def test_midpoint(self):
     self._test_everything('midpoint')
@@ -283,6 +299,21 @@ class OdeIntFixedTest(test.TestCase):
   def test_rk4(self):
     self._test_everything('rk4')
 
+  def test_dt_size_exceptions(self):
+    times = np.linspace(0., 2., 100)
+    dt = np.ones(99) * 0.01
+    dt_wrong_length = np.asarray([0.01, 0.001, 0.1, 0.03])
+    dt_wrong_dim = np.expand_dims(np.linspace(0., 2., 99), axis=0)
+    times_wrong_dim = np.expand_dims(np.linspace(0., 2., 100), axis=0)
+    with self.assertRaises(ValueError):
+      self._test_integrate_gaussian('midpoint', times, dt_wrong_length)
+
+    with self.assertRaises(ValueError):
+      self._test_integrate_gaussian('midpoint', times, dt_wrong_dim)
+
+    with self.assertRaises(ValueError):
+      self._test_integrate_gaussian('midpoint', times_wrong_dim, dt)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/kafka/kernels/kafka_dataset_ops.cc b/tensorflow/contrib/kafka/kernels/kafka_dataset_ops.cc
index a4cd4a2cc4b99b5906185bd2b942ed15c1ddf5e4..2638b25ec424b5b4ef556ff769e94e64da32fec2 100644
--- a/tensorflow/contrib/kafka/kernels/kafka_dataset_ops.cc
+++ b/tensorflow/contrib/kafka/kernels/kafka_dataset_ops.cc
@@ -64,7 +64,7 @@ class KafkaDatasetOp : public DatasetOpKernel {
           eof_(eof),
           timeout_(timeout) {}
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::Kafka")}));
@@ -81,7 +81,7 @@ class KafkaDatasetOp : public DatasetOpKernel {
       return *shapes;
     }
 
-    string DebugString() override { return "KafkaDatasetOp::Dataset"; }
+    string DebugString() const override { return "KafkaDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(DatasetGraphDefBuilder* b,
diff --git a/tensorflow/contrib/kfac/README.md b/tensorflow/contrib/kfac/README.md
index 762a2f0b57e95e2fef3dd177070701afb410e93a..102626925db560e47cdc73eb1e25e08836cb4fba 100644
--- a/tensorflow/contrib/kfac/README.md
+++ b/tensorflow/contrib/kfac/README.md
@@ -1,5 +1,10 @@
 # K-FAC: Kronecker-Factored Approximate Curvature
 
+# <font color="red", size=10><u>WARNING: </u></font>
+# ==third_party/tensorflow/contrib/kfac is deprecated. This will be==
+# ==removed on 15-07-2018. <!-- STY:begin_strip_and_replace -->Please import third_party/tensorflow_kfac.==
+# ==<!-- STY:end_strip_and_replace Please check https://github.com/tensorflow/kfac. -->==
+
 **K-FAC in TensorFlow** is an implementation of [K-FAC][kfac-paper], an
 approximate second-order optimization method, in TensorFlow. When applied to
 feedforward and convolutional neural networks, K-FAC can converge `>3.5x`
diff --git a/tensorflow/contrib/kfac/python/ops/optimizer.py b/tensorflow/contrib/kfac/python/ops/optimizer.py
index b7f63d8d94a7a427eb57afefeda3939f0c530f8e..03b9da793307b966632789fd11162306e6cd19f9 100644
--- a/tensorflow/contrib/kfac/python/ops/optimizer.py
+++ b/tensorflow/contrib/kfac/python/ops/optimizer.py
@@ -18,6 +18,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import warnings
+
 # pylint disable=long-line
 from tensorflow.contrib.kfac.python.ops import curvature_matrix_vector_products as cmvp
 from tensorflow.contrib.kfac.python.ops import estimator as est
@@ -107,6 +109,10 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
       ValueError: If momentum is non-zero and momentum_type is not 'regular'
           or 'adam'.
     """
+    warnings.warn(
+        "third_party.tensorflow.contrib.kfac is deprecated."
+        "This will be removed on 15-07-2018. Check README for further details.",
+        DeprecationWarning)
     # Parameters to be passed to the Fisher estimator:
     self._variables = var_list or tf_variables.trainable_variables
     self._cov_ema_decay = cov_ema_decay
diff --git a/tensorflow/contrib/layers/__init__.py b/tensorflow/contrib/layers/__init__.py
index 00f03a111ae8be7f49761ef5fb5a82810bcca182..bc3359693562deb1229a78a2db5c256c76f7fd8d 100644
--- a/tensorflow/contrib/layers/__init__.py
+++ b/tensorflow/contrib/layers/__init__.py
@@ -19,6 +19,8 @@ See the @{$python/contrib.layers} guide.
 @@avg_pool2d
 @@avg_pool3d
 @@batch_norm
+@@convolution
+@@convolution1d
 @@convolution2d
 @@convolution3d
 @@conv2d_in_plane
diff --git a/tensorflow/contrib/layers/python/layers/layers.py b/tensorflow/contrib/layers/python/layers/layers.py
index 2f3e57653c5d6d949c4dcc91635690322b7f90c4..b6d63c9640611abdda65f1205f544ee505dae1f0 100644
--- a/tensorflow/contrib/layers/python/layers/layers.py
+++ b/tensorflow/contrib/layers/python/layers/layers.py
@@ -57,10 +57,10 @@ from tensorflow.python.training import moving_averages
 __all__ = [
     'avg_pool2d', 'avg_pool3d', 'batch_norm', 'bias_add', 'conv2d', 'conv3d',
     'conv2d_in_plane', 'conv2d_transpose', 'conv3d_transpose', 'convolution',
-    'convolution2d', 'convolution2d_in_plane', 'convolution2d_transpose',
-    'convolution3d', 'convolution3d_transpose', 'dense_to_sparse',
-    'dropout', 'elu', 'flatten', 'fully_connected', 'GDN', 'gdn',
-    'images_to_sequence', 'layer_norm', 'linear', 'pool', 'max_pool2d',
+    'convolution1d', 'convolution2d', 'convolution2d_in_plane',
+    'convolution2d_transpose', 'convolution3d', 'convolution3d_transpose',
+    'dense_to_sparse', 'dropout', 'elu', 'flatten', 'fully_connected', 'GDN',
+    'gdn', 'images_to_sequence', 'layer_norm', 'linear', 'pool', 'max_pool2d',
     'max_pool3d', 'one_hot_encoding', 'relu', 'relu6', 'repeat',
     'scale_gradient', 'separable_conv2d', 'separable_convolution2d',
     'sequence_to_images', 'softmax', 'spatial_softmax', 'stack', 'unit_norm',
@@ -2022,6 +2022,7 @@ class GDN(base.Layer):
 
     def beta_initializer(shape, dtype=None, partition_info=None):
       del partition_info  # unused
+      pedestal = array_ops.constant(self._reparam_offset**2, dtype=self.dtype)
       return math_ops.sqrt(array_ops.ones(shape, dtype=dtype) + pedestal)
 
     def gamma_initializer(shape, dtype=None, partition_info=None):
@@ -2029,6 +2030,7 @@ class GDN(base.Layer):
       assert len(shape) == 2
       assert shape[0] == shape[1]
       eye = linalg_ops.eye(shape[0], dtype=dtype)
+      pedestal = array_ops.constant(self._reparam_offset**2, dtype=self.dtype)
       return math_ops.sqrt(self._gamma_init * eye + pedestal)
 
     beta = self.add_variable(
diff --git a/tensorflow/contrib/layers/python/layers/layers_test.py b/tensorflow/contrib/layers/python/layers/layers_test.py
index 56e9194cebbe46907707f7ac0996f9a56fb53c0f..c5c7269b1f15849956e90654e3bcf8ab0eebc393 100644
--- a/tensorflow/contrib/layers/python/layers/layers_test.py
+++ b/tensorflow/contrib/layers/python/layers/layers_test.py
@@ -1312,6 +1312,29 @@ class ConvolutionInPlaneTest(test.TestCase):
 
       self.assertAllClose(result, expected, rtol=1e-5, atol=1e-5)
 
+  def testConv1dShape(self):
+    width = 7
+    with self.test_session():
+      images = random_ops.random_uniform((5, width, 3), seed=1)
+      output = layers_lib.convolution1d(images, 32, 3)
+      self.assertEqual(output.op.name, 'Conv/Relu')
+      self.assertListEqual(output.get_shape().as_list(), [5, width, 32])
+
+  def testConvInferSpatialDims(self):
+    depth, height, width = 7, 9, 11
+    with self.test_session():
+      images = np.random.uniform(size=(5, width, 4)).astype(np.float32)
+      output = layers_lib.convolution(images, 32, [3])
+      self.assertListEqual(output.get_shape().as_list(), [5, width, 32])
+      images = np.random.uniform(size=(5, height, width, 4)).astype(np.float32)
+      output = layers_lib.convolution(images, 32, [3, 3])
+      self.assertListEqual(output.get_shape().as_list(), [5, height, width, 32])
+      images = np.random.uniform(size=(5, depth, height, width,
+                                       4)).astype(np.float32)
+      output = layers_lib.convolution(images, 32, [3, 3, 3])
+      self.assertListEqual(output.get_shape().as_list(),
+                           [5, depth, height, width, 32])
+
 
 class DenseToSparseTest(test.TestCase):
 
diff --git a/tensorflow/contrib/layers/python/layers/rev_block_lib.py b/tensorflow/contrib/layers/python/layers/rev_block_lib.py
index 8ed9f446bcd5f222f486e43125dafc595852e5ce..0e35b1aa8bf682c1b4f7e8d974d3e8fad69e33cb 100644
--- a/tensorflow/contrib/layers/python/layers/rev_block_lib.py
+++ b/tensorflow/contrib/layers/python/layers/rev_block_lib.py
@@ -46,6 +46,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util import nest
+from tensorflow.python.util import tf_inspect
 
 __all__ = ["rev_block", "RevBlock", "recompute_grad"]
 
@@ -449,6 +450,15 @@ def recompute_grad(fn, use_data_dep=_USE_DEFAULT, tupleize_grads=False):
   `variable_scope(name, use_resource=True), which are the default in Eager mode
   and when running on TPU.
 
+  Warning: Because the function will be called again on the backwards pass, the
+  user should be careful to not use ops in their function that mutate state or
+  have randomness (for example, batch normalization or dropout). If the function
+  does have such operations, it is recommended that the function take the
+  `is_recomputing` keyword argument which will be `False` on the forward pass
+  and `True` on the backwards pass so that it can disable state changes when
+  `is_recomputing=True` (for example, not updating the moving averages in batch
+  normalization).
+
   Args:
     fn: a function that takes Tensors (all as positional arguments) and returns
       a tuple of Tensors.
@@ -482,6 +492,7 @@ def _is_on_tpu():
 
 def _recompute_grad(fn, args, use_data_dep=_USE_DEFAULT, tupleize_grads=False):
   """See recompute_grad."""
+  has_is_recompute_kwarg = "is_recomputing" in tf_inspect.getargspec(fn).args
   for arg in args:
     if not isinstance(arg, framework_ops.Tensor):
       raise ValueError("All inputs to function must be Tensors")
@@ -496,7 +507,10 @@ def _recompute_grad(fn, args, use_data_dep=_USE_DEFAULT, tupleize_grads=False):
     vs = variable_scope.get_variable_scope()
     arg_scope = contrib_framework_ops.current_arg_scope()
     with backprop.GradientTape() as tape:
-      outputs = fn(*args)
+      fn_kwargs = {}
+      if has_is_recompute_kwarg:
+        fn_kwargs["is_recomputing"] = False
+      outputs = fn(*args, **fn_kwargs)
     original_vars = set(tape.watched_variables())
 
     # Backward pass
@@ -516,7 +530,10 @@ def _recompute_grad(fn, args, use_data_dep=_USE_DEFAULT, tupleize_grads=False):
         with contrib_framework_ops.arg_scope(arg_scope):
           with variable_scope.variable_scope(vs, reuse=True):
             with backprop.GradientTape() as tape:
-              outputs = fn(*inputs)
+              fn_kwargs = {}
+              if has_is_recompute_kwarg:
+                fn_kwargs["is_recomputing"] = True
+              outputs = fn(*inputs, **fn_kwargs)
             recompute_vars = set(tape.watched_variables())
             if original_vars != recompute_vars:
               raise ValueError(_WRONG_VARS_ERR)
diff --git a/tensorflow/contrib/layers/python/layers/rev_block_lib_test.py b/tensorflow/contrib/layers/python/layers/rev_block_lib_test.py
index 997f53b9e1bbf9ac151cadd4a9f8e79c2e0ebca2..bc09ba8d439808c1582f207a99504012afcf33a6 100644
--- a/tensorflow/contrib/layers/python/layers/rev_block_lib_test.py
+++ b/tensorflow/contrib/layers/python/layers/rev_block_lib_test.py
@@ -21,9 +21,11 @@ from __future__ import print_function
 from tensorflow.contrib.layers.python.layers import layers
 from tensorflow.contrib.layers.python.layers import rev_block_lib
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
 from tensorflow.python.framework import random_seed
 from tensorflow.python.layers import convolutional
 from tensorflow.python.layers import core as core_layers
+from tensorflow.python.layers import normalization as normalization_layers
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import init_ops
@@ -342,6 +344,34 @@ class RecomputeTest(test.TestCase):
     for grad in grads:
       self.assertTrue(grad is not None)
 
+  def testWithIsRecomputeKwarg(self):
+
+    kwarg_values = []
+
+    @rev_block_lib.recompute_grad
+    def layer_with_recompute(inputs, is_recomputing=False):
+      kwarg_values.append(is_recomputing)
+      out = core_layers.dense(inputs, 2)
+      out = normalization_layers.batch_normalization(out, training=True)
+      if is_recomputing:
+        # Ensure that the updates are not duplicated by popping off the latest
+        # 2 additions.
+        update_ops = ops.get_collection_ref(ops.GraphKeys.UPDATE_OPS)
+        update_ops.pop()
+        update_ops.pop()
+      return out
+
+    x = array_ops.ones((2, 4), dtypes.float32)
+    with variable_scope.variable_scope("layer1", use_resource=True):
+      y = layer_with_recompute(x)
+    loss = math_ops.reduce_sum(y)
+    tvars = variables.trainable_variables()
+    gradients_impl.gradients(loss, [x] + tvars)
+
+    update_ops = ops.get_collection(ops.GraphKeys.UPDATE_OPS)
+    self.assertEqual(2, len(update_ops))
+    self.assertEqual([False, True], kwarg_values)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/learn/BUILD b/tensorflow/contrib/learn/BUILD
index 0fdbe8f6308e30db2043c400f37d7dcb6058d1f2..b56a88659bbd4467600788fc8e3e9dbf38ce8244 100644
--- a/tensorflow/contrib/learn/BUILD
+++ b/tensorflow/contrib/learn/BUILD
@@ -284,6 +284,7 @@ py_test(
     tags = [
         "manual",
         "noasan",  # times out
+        "optonly",  # test is flaky without optimization.
     ],
     deps = [
         ":learn",
diff --git a/tensorflow/contrib/learn/python/learn/estimators/linear.py b/tensorflow/contrib/learn/python/learn/estimators/linear.py
index 70b70af98c51dcb991c19152607272673953ee2a..e100bc7a1e7be4896e9ab1c965775b5185b38897 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/linear.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/linear.py
@@ -31,7 +31,6 @@ import six
 from tensorflow.contrib import layers
 from tensorflow.contrib.framework import deprecated
 from tensorflow.contrib.framework import deprecated_arg_values
-from tensorflow.python.training import training_util
 from tensorflow.contrib.layers.python.layers import feature_column
 from tensorflow.contrib.learn.python.learn.estimators import estimator
 from tensorflow.contrib.learn.python.learn.estimators import head as head_lib
@@ -51,6 +50,7 @@ from tensorflow.python.ops import variable_scope
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import session_run_hook
 from tensorflow.python.training import training as train
+from tensorflow.python.training import training_util
 
 
 # The default learning rate of 0.2 is a historical artifact of the initial
@@ -244,7 +244,9 @@ def sdca_model_fn(features, labels, mode, params):
   parent_scope = "linear"
 
   with variable_scope.variable_scope(
-      values=features.values(), name_or_scope=parent_scope) as scope:
+      values=features.values(),
+      name_or_scope=parent_scope,
+      partitioner=optimizer.partitioner) as scope:
     features = features.copy()
     features.update(layers.transform_features(features, feature_columns))
     logits, columns_to_variables, bias = (
diff --git a/tensorflow/contrib/learn/python/learn/estimators/linear_test.py b/tensorflow/contrib/learn/python/learn/estimators/linear_test.py
index 0a863f0e20c05d3372ffd8f7677cd518390ecc9d..597ca4e86dbf66c86182f14a2a364b662d52fb0a 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/linear_test.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/linear_test.py
@@ -43,6 +43,7 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import partitioned_variables
 from tensorflow.python.platform import test
 from tensorflow.python.training import ftrl
 from tensorflow.python.training import input as input_lib
@@ -966,6 +967,63 @@ class LinearClassifierTest(test.TestCase):
     scores = classifier.evaluate(input_fn=input_fn, steps=1)
     self.assertGreater(scores['accuracy'], 0.9)
 
+  def testSdcaOptimizerPartitionedVariables(self):
+    """Tests LinearClassifier with SDCAOptimizer with partitioned variables."""
+
+    def input_fn():
+      return {
+          'example_id':
+              constant_op.constant(['1', '2', '3']),
+          'price':
+              constant_op.constant([[0.6], [0.8], [0.3]]),
+          'sq_footage':
+              constant_op.constant([[900.0], [700.0], [600.0]]),
+          'country':
+              sparse_tensor.SparseTensor(
+                  values=['IT', 'US', 'GB'],
+                  indices=[[0, 0], [1, 3], [2, 1]],
+                  dense_shape=[3, 5]),
+          'weights':
+              constant_op.constant([[3.0], [1.0], [1.0]])
+      }, constant_op.constant([[1], [0], [1]])
+
+    price = feature_column_lib.real_valued_column('price')
+    sq_footage_bucket = feature_column_lib.bucketized_column(
+        feature_column_lib.real_valued_column('sq_footage'),
+        boundaries=[650.0, 800.0])
+    country = feature_column_lib.sparse_column_with_hash_bucket(
+        'country', hash_bucket_size=5)
+    sq_footage_country = feature_column_lib.crossed_column(
+        [sq_footage_bucket, country], hash_bucket_size=10)
+
+    sdca_optimizer = sdca_optimizer_lib.SDCAOptimizer(
+        example_id_column='example_id',
+        partitioner=partitioned_variables.fixed_size_partitioner(
+            num_shards=2, axis=0))
+
+    tf_config = {
+        'cluster': {
+            run_config.TaskType.PS: ['fake_ps_0', 'fake_ps_1']
+        }
+    }
+    with test.mock.patch.dict('os.environ',
+                              {'TF_CONFIG': json.dumps(tf_config)}):
+      config = run_config.RunConfig()
+      # Because we did not start a distributed cluster, we need to pass an
+      # empty ClusterSpec, otherwise the device_setter will look for
+      # distributed jobs, such as "/job:ps" which are not present.
+      config._cluster_spec = server_lib.ClusterSpec({})
+
+    classifier = linear.LinearClassifier(
+        feature_columns=[price, sq_footage_bucket, country, sq_footage_country],
+        weight_column_name='weights',
+        optimizer=sdca_optimizer,
+        config=config)
+    classifier.fit(input_fn=input_fn, steps=50)
+    scores = classifier.evaluate(input_fn=input_fn, steps=1)
+    print('all scores = {}'.format(scores))
+    self.assertGreater(scores['accuracy'], 0.9)
+
   def testEval(self):
     """Tests that eval produces correct metrics.
     """
@@ -1540,6 +1598,60 @@ class LinearRegressorTest(test.TestCase):
     loss = regressor.evaluate(input_fn=input_fn, steps=1)['loss']
     self.assertLess(loss, 0.05)
 
+  def testSdcaOptimizerPartitionedVariables(self):
+    """Tests LinearRegressor with SDCAOptimizer with partitioned variables."""
+
+    def input_fn():
+      return {
+          'example_id':
+              constant_op.constant(['1', '2', '3']),
+          'price':
+              constant_op.constant([0.6, 0.8, 0.3]),
+          'sq_footage':
+              constant_op.constant([[900.0], [700.0], [600.0]]),
+          'country':
+              sparse_tensor.SparseTensor(
+                  values=['IT', 'US', 'GB'],
+                  indices=[[0, 0], [1, 3], [2, 1]],
+                  dense_shape=[3, 5]),
+          'weights':
+              constant_op.constant([[3.0], [5.0], [7.0]])
+      }, constant_op.constant([[1.55], [-1.25], [-3.0]])
+
+    price = feature_column_lib.real_valued_column('price')
+    sq_footage_bucket = feature_column_lib.bucketized_column(
+        feature_column_lib.real_valued_column('sq_footage'),
+        boundaries=[650.0, 800.0])
+    country = feature_column_lib.sparse_column_with_hash_bucket(
+        'country', hash_bucket_size=5)
+    sq_footage_country = feature_column_lib.crossed_column(
+        [sq_footage_bucket, country], hash_bucket_size=10)
+    sdca_optimizer = sdca_optimizer_lib.SDCAOptimizer(
+        example_id_column='example_id', symmetric_l2_regularization=1.0,
+        partitioner=partitioned_variables.fixed_size_partitioner(
+            num_shards=2, axis=0))
+    tf_config = {
+        'cluster': {
+            run_config.TaskType.PS: ['fake_ps_0', 'fake_ps_1']
+        }
+    }
+    with test.mock.patch.dict('os.environ',
+                              {'TF_CONFIG': json.dumps(tf_config)}):
+      config = run_config.RunConfig()
+      # Because we did not start a distributed cluster, we need to pass an
+      # empty ClusterSpec, otherwise the device_setter will look for
+      # distributed jobs, such as "/job:ps" which are not present.
+      config._cluster_spec = server_lib.ClusterSpec({})
+
+    regressor = linear.LinearRegressor(
+        feature_columns=[price, sq_footage_bucket, country, sq_footage_country],
+        weight_column_name='weights',
+        optimizer=sdca_optimizer,
+        config=config)
+    regressor.fit(input_fn=input_fn, steps=20)
+    loss = regressor.evaluate(input_fn=input_fn, steps=1)['loss']
+    self.assertLess(loss, 0.05)
+
   def testSdcaOptimizerSparseFeaturesWithL1Reg(self):
     """Tests LinearClassifier with SDCAOptimizer and sparse features."""
 
diff --git a/tensorflow/contrib/learn/python/learn/experiment.py b/tensorflow/contrib/learn/python/learn/experiment.py
index 541da9061732ad271f6d5456446a9c30b81e58dd..f8a3709ee57a32734afa7ac8133271c75d152b2c 100644
--- a/tensorflow/contrib/learn/python/learn/experiment.py
+++ b/tensorflow/contrib/learn/python/learn/experiment.py
@@ -505,7 +505,7 @@ class Experiment(object):
     eval_result = None
     last_warning_time = 0
     while (not predicate_fn or predicate_fn(
-        eval_result, checkpoint_path=previous_path if eval_result else None)):
+        eval_result, checkpoint_path=previous_path)):
       # Exit if we have already reached number of steps to train.
       if self._has_training_stopped(eval_result):
         logging.info("Exiting continuous eval, global_step=%s >= "
diff --git a/tensorflow/contrib/learn/python/learn/experiment_test.py b/tensorflow/contrib/learn/python/learn/experiment_test.py
index d10927a0cdd5c67c8d2a8e569153235ee175ec4d..fb16c94c29660e2777942ea9cf30da51dbf90571 100644
--- a/tensorflow/contrib/learn/python/learn/experiment_test.py
+++ b/tensorflow/contrib/learn/python/learn/experiment_test.py
@@ -500,7 +500,7 @@ class ExperimentTest(test.TestCase):
       noop_hook = _NoopHook()
 
       def _predicate_fn(eval_result, checkpoint_path):
-        self.assertEqual(not eval_result,
+        self.assertEqual(eval_result is None,
                          checkpoint_path is None)
         return est.eval_count < 3  # pylint: disable=cell-var-from-loop
 
diff --git a/tensorflow/contrib/linear_optimizer/python/kernel_tests/sdca_ops_test.py b/tensorflow/contrib/linear_optimizer/python/kernel_tests/sdca_ops_test.py
index b5741967ab52568725d7c9f03a0cc0b0f63f7459..ef0e08a777779e04f70d11fe83280ccaf1c178fd 100644
--- a/tensorflow/contrib/linear_optimizer/python/kernel_tests/sdca_ops_test.py
+++ b/tensorflow/contrib/linear_optimizer/python/kernel_tests/sdca_ops_test.py
@@ -35,6 +35,8 @@ from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import gen_sdca_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import parsing_ops
+from tensorflow.python.ops import partitioned_variables
+from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables as variables_lib
 from tensorflow.python.platform import googletest
 
@@ -132,15 +134,22 @@ def make_random_examples_and_variables_dicts(num_examples, dim, num_non_zero):
   return examples_dict, variables_dict
 
 
-def make_variable_dict(max_age, max_gender):
+def make_variable_dict(max_age, max_gender, partitioned=False):
   # TODO(sibyl-toe9oF2e):  Figure out how to derive max_age & max_gender from
   # examples_dict.
-  age_weights = variables_lib.Variable(
-      array_ops.zeros(
-          [max_age + 1], dtype=dtypes.float32))
-  gender_weights = variables_lib.Variable(
-      array_ops.zeros(
-          [max_gender + 1], dtype=dtypes.float32))
+  partitioner = None
+  if partitioned:
+    partitioner = partitioned_variables.fixed_size_partitioner(num_shards=2,
+                                                               axis=0)
+  with variable_scope.variable_scope(
+      name_or_scope='variables',
+      partitioner=partitioner):
+    age_weights = variables_lib.Variable(
+        array_ops.zeros(
+            [max_age + 1], dtype=dtypes.float32))
+    gender_weights = variables_lib.Variable(
+        array_ops.zeros(
+            [max_gender + 1], dtype=dtypes.float32))
   return dict(
       sparse_features_weights=[age_weights, gender_weights],
       dense_features_weights=[])
@@ -265,6 +274,54 @@ class SdcaWithLogisticLossTest(SdcaModelTest):
         self.assertAllClose(
             0.01, lr.approximate_duality_gap().eval(), rtol=1e-2, atol=1e-2)
 
+  def testPartitionedPrimals(self):
+    # Setup test data
+    example_protos = [
+        make_example_proto({
+            'age': [0],
+            'gender': [0]
+        }, 0),
+        make_example_proto({
+            'age': [1],
+            'gender': [1]
+        }, 1),
+    ]
+    example_weights = [1.0, 1.0]
+    for num_shards in _SHARD_NUMBERS:
+      with self._single_threaded_test_session():
+        examples = make_example_dict(example_protos, example_weights)
+        variables = make_variable_dict(1, 1, partitioned=True)
+        options = dict(
+            symmetric_l2_regularization=1,
+            symmetric_l1_regularization=0,
+            num_table_shards=num_shards,
+            loss_type='logistic_loss')
+
+        lr = SdcaModel(examples, variables, options)
+        variables_lib.global_variables_initializer().run()
+        unregularized_loss = lr.unregularized_loss(examples)
+        loss = lr.regularized_loss(examples)
+        predictions = lr.predictions(examples)
+        self.assertAllClose(0.693147, unregularized_loss.eval())
+        self.assertAllClose(0.693147, loss.eval())
+        train_op = lr.minimize()
+        for _ in range(_MAX_ITERATIONS):
+          train_op.run()
+        lr.update_weights(train_op).run()
+        # The high tolerance in unregularized_loss comparisons is due to the
+        # fact that it's possible to trade off unregularized_loss vs.
+        # regularization and still have a sum that is quite close to the
+        # optimal regularized_loss value.  SDCA's duality gap only ensures that
+        # the regularized_loss is within 0.01 of optimal.
+        # 0.525457 is the optimal regularized_loss.
+        # 0.411608 is the unregularized_loss at that optimum.
+        self.assertAllClose(0.411608, unregularized_loss.eval(), atol=0.05)
+        self.assertAllClose(0.525457, loss.eval(), atol=0.01)
+        predicted_labels = get_binary_predictions_for_logistic(predictions)
+        self.assertAllEqual([0, 1], predicted_labels.eval())
+        self.assertAllClose(
+            0.01, lr.approximate_duality_gap().eval(), rtol=1e-2, atol=1e-2)
+
   def testSparseRandom(self):
     dim = 20
     num_examples = 1000
@@ -320,7 +377,10 @@ class SdcaWithLogisticLossTest(SdcaModelTest):
         train_op.run()
 
   def testDistributedSimple(self):
-    # Setup test data
+    # Distributed SDCA may not converge if the workers update concurrently the
+    # same example. In this test the examples are partitioned across workers.
+    # The examples are the same for all workers, just the example_ids are
+    # different.
     example_protos = [
         make_example_proto({
             'age': [0],
@@ -332,13 +392,19 @@ class SdcaWithLogisticLossTest(SdcaModelTest):
         }, 1),
     ]
     example_weights = [1.0, 1.0]
+    examples = make_example_dict(example_protos, example_weights)
+    example_ids = array_ops.placeholder(
+        dtypes.string, shape=(len(example_weights),))
+    examples['example_ids'] = example_ids
+    variables = make_variable_dict(1, 1)
     for num_shards in _SHARD_NUMBERS:
       for num_loss_partitions in _NUM_LOSS_PARTITIONS:
         with self._single_threaded_test_session():
-          examples = make_example_dict(example_protos, example_weights)
-          variables = make_variable_dict(1, 1)
           options = dict(
-              symmetric_l2_regularization=1,
+              # Keep the same solution as for TestSimple: since the number of
+              # examples is multplied by num_loss_partitions, multiply also
+              # L2 by the same value.
+              symmetric_l2_regularization=num_loss_partitions,
               symmetric_l1_regularization=0,
               loss_type='logistic_loss',
               num_table_shards=num_shards,
@@ -354,32 +420,30 @@ class SdcaWithLogisticLossTest(SdcaModelTest):
 
           train_op = lr.minimize()
 
-          def minimize():
+          def minimize(worker_id):
             with self._single_threaded_test_session():
+              feed_dict = {example_ids: [
+                  str(i + worker_id*len(example_weights)) for i in range(
+                      len(example_weights))]}
               for _ in range(_MAX_ITERATIONS):
-                train_op.run()  # pylint: disable=cell-var-from-loop
+                train_op.run(feed_dict=feed_dict)  # pylint: disable=cell-var-from-loop
 
           threads = []
-          for _ in range(num_loss_partitions):
-            threads.append(threading.Thread(target=minimize))
+          for worker_id in range(num_loss_partitions):
+            threads.append(threading.Thread(target=minimize, args=(worker_id,)))
             threads[-1].start()
 
           for t in threads:
             t.join()
-          lr.update_weights(train_op).run()
-
-          # The high tolerance in unregularized_loss comparisons is due to the
-          # fact that it's possible to trade off unregularized_loss vs.
-          # regularization and still have a sum that is quite close to the
-          # optimal regularized_loss value.  SDCA's duality gap only ensures
-          # that the regularized_loss is within 0.01 of optimal.
-          # 0.525457 is the optimal regularized_loss.
-          # 0.411608 is the unregularized_loss at that optimum.
-          self.assertAllClose(0.411608, unregularized_loss.eval(), atol=0.05)
-          self.assertAllClose(0.525457, loss.eval(), atol=0.01)
+          lr.update_weights(train_op).run(feed_dict={
+              example_ids: [str(i) for i in range(len(example_weights))]})
+
+          # Test only the unregularized loss because the optimal value of the
+          # regularized loss depends on num_loss_partitions.
+          self.assertAllClose(0.411608, unregularized_loss.eval(), atol=0.02)
           predicted_labels = get_binary_predictions_for_logistic(predictions)
           self.assertAllEqual([0, 1], predicted_labels.eval())
-          self.assertTrue(lr.approximate_duality_gap().eval() < 0.02)
+          self.assertNear(0.0, lr.approximate_duality_gap().eval(), 0.02)
 
   def testSimpleNoL2(self):
     # Same as test above (so comments from above apply) but without an L2.
diff --git a/tensorflow/contrib/linear_optimizer/python/ops/sdca_ops.py b/tensorflow/contrib/linear_optimizer/python/ops/sdca_ops.py
index f980746a19fb8e0a02b9d023c127da7ab33e457f..0047d5753a773ce814d685f89da9ae6b04d21cb6 100644
--- a/tensorflow/contrib/linear_optimizer/python/ops/sdca_ops.py
+++ b/tensorflow/contrib/linear_optimizer/python/ops/sdca_ops.py
@@ -22,12 +22,14 @@ import collections
 from six.moves import range
 
 from tensorflow.contrib.linear_optimizer.python.ops.sharded_mutable_dense_hashtable import ShardedMutableDenseHashTable
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework.ops import internal_convert_to_tensor
 from tensorflow.python.framework.ops import name_scope
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import data_flow_ops
 from tensorflow.python.ops import gen_sdca_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
@@ -43,9 +45,6 @@ __all__ = ['SdcaModel']
 class SdcaModel(object):
   """Stochastic dual coordinate ascent solver for linear models.
 
-    This class currently only supports a single machine (multi-threaded)
-    implementation. We expect the weights and duals to fit in a single machine.
-
     Loss functions supported:
 
      * Binary logistic loss
@@ -182,18 +181,41 @@ class SdcaModel(object):
 
   # TODO(sibyl-Aix6ihai): Use optimizer interface to make use of slot creation logic.
   def _create_slots(self):
-    # Make internal variables which have the updates before applying L1
-    # regularization.
+    """Make unshrinked internal variables (slots)."""
+    # Unshrinked variables have the updates before applying L1 regularization.
+    # Each unshrinked slot variable is either a `Variable` or list of
+    # `Variable`, depending on the value of its corresponding primary variable.
+    # We avoid using `PartitionedVariable` for the unshrinked slots since we do
+    # not need any of the extra information.
     self._slots = collections.defaultdict(list)
     for name in ['sparse_features_weights', 'dense_features_weights']:
       for var in self._variables[name]:
-        with ops.device(var.device):
-          # TODO(andreasst): remove SDCAOptimizer suffix once bug 30843109 is
-          # fixed
-          self._slots['unshrinked_' + name].append(
-              var_ops.Variable(
-                  array_ops.zeros_like(var.initialized_value(), dtypes.float32),
-                  name=var.op.name + '_unshrinked/SDCAOptimizer'))
+        # Our primary variable may be either a PartitionedVariable, or a list
+        # of Variables (each representing a partition).
+        if (isinstance(var, var_ops.PartitionedVariable) or
+            isinstance(var, list)):
+          var_list = []
+          # pylint: disable=protected-access
+          for v in var:
+            with ops.colocate_with(v):
+              # TODO(andreasst): remove SDCAOptimizer suffix once bug 30843109
+              # is fixed.
+              slot_var = var_ops.Variable(
+                  initial_value=array_ops.zeros_like(v.initialized_value(),
+                                                     dtypes.float32),
+                  name=v.op.name + '_unshrinked/SDCAOptimizer')
+              var_list.append(slot_var)
+          self._slots['unshrinked_' + name].append(var_list)
+          # pylint: enable=protected-access
+        else:
+          with ops.device(var.device):
+            # TODO(andreasst): remove SDCAOptimizer suffix once bug 30843109 is
+            # fixed.
+            self._slots['unshrinked_' + name].append(
+                var_ops.Variable(
+                    array_ops.zeros_like(var.initialized_value(),
+                                         dtypes.float32),
+                    name=var.op.name + '_unshrinked/SDCAOptimizer'))
 
   def _assertSpecified(self, items, check_in):
     for x in items:
@@ -205,16 +227,25 @@ class SdcaModel(object):
       if not isinstance(check_in[x], list):
         raise ValueError(x + ' must be a list.')
 
+  def _var_to_list(self, var):
+    """Wraps var in a list if it is not a list or PartitionedVariable."""
+    if not (isinstance(var, list) or
+            isinstance(var, var_ops.PartitionedVariable)):
+      var = [var]
+    return var
+
   def _l1_loss(self):
     """Computes the (un-normalized) l1 loss of the model."""
     with name_scope('sdca/l1_loss'):
       sums = []
       for name in ['sparse_features_weights', 'dense_features_weights']:
-        for weights in self._convert_n_to_tensor(self._variables[name]):
-          with ops.device(weights.device):
-            sums.append(
-                math_ops.reduce_sum(
-                    math_ops.abs(math_ops.cast(weights, dtypes.float64))))
+        for var in self._variables[name]:
+          for v in self._var_to_list(var):
+            weights = internal_convert_to_tensor(v)
+            with ops.device(weights.device):
+              sums.append(
+                  math_ops.reduce_sum(
+                      math_ops.abs(math_ops.cast(weights, dtypes.float64))))
       # SDCA L1 regularization cost is: l1 * sum(|weights|)
       return self._options['symmetric_l1_regularization'] * math_ops.add_n(sums)
 
@@ -223,17 +254,37 @@ class SdcaModel(object):
     with name_scope('sdca/l2_loss'):
       sums = []
       for name in ['sparse_features_weights', 'dense_features_weights']:
-        for weights in self._convert_n_to_tensor(self._variables[name]):
-          with ops.device(weights.device):
-            sums.append(
-                math_ops.reduce_sum(
-                    math_ops.square(math_ops.cast(weights, dtypes.float64))))
+        for var in self._variables[name]:
+          for v in self._var_to_list(var):
+            weights = internal_convert_to_tensor(v)
+            with ops.device(weights.device):
+              sums.append(math_ops.reduce_sum(math_ops.square(math_ops.cast(
+                  weights, dtypes.float64))))
       # SDCA L2 regularization cost is: l2 * sum(weights^2) / 2
       return l2 * math_ops.add_n(sums) / 2.0
 
   def _convert_n_to_tensor(self, input_list, as_ref=False):
     """Converts input list to a set of tensors."""
-    return [internal_convert_to_tensor(x, as_ref=as_ref) for x in input_list]
+    # input_list can be a list of Variables (that are implicitly partitioned),
+    # in which case the underlying logic in internal_convert_to_tensor will not
+    # concatenate the partitions together.  This method takes care of the
+    # concatenating (we only allow partitioning on the first axis).
+    output_list = []
+    for x in input_list:
+      tensor_to_convert = x
+      if isinstance(x, list) or isinstance(x, var_ops.PartitionedVariable):
+        # We only allow for partitioning on the first axis.
+        tensor_to_convert = array_ops.concat(x, axis=0)
+      output_list.append(internal_convert_to_tensor(
+          tensor_to_convert, as_ref=as_ref))
+    return output_list
+
+  def _get_first_dimension_size_statically(self, w, num_partitions):
+    """Compute the static size of the first dimension for a sharded variable."""
+    dim_0_size = w[0].get_shape()[0]
+    for p in range(1, num_partitions):
+      dim_0_size += w[p].get_shape()[0]
+    return dim_0_size
 
   def _linear_predictions(self, examples):
     """Returns predictions of the form w*x."""
@@ -286,6 +337,28 @@ class SdcaModel(object):
         result = math_ops.sigmoid(result)
     return result
 
+  def _get_partitioned_update_ops(self,
+                                  v_num,
+                                  num_partitions_by_var,
+                                  p_assignments_by_var,
+                                  gather_ids_by_var,
+                                  weights,
+                                  full_update,
+                                  p_assignments,
+                                  num_partitions):
+    """Get updates for partitioned variables."""
+    num_partitions = num_partitions_by_var[v_num]
+    p_assignments = p_assignments_by_var[v_num]
+    gather_ids = gather_ids_by_var[v_num]
+    updates = data_flow_ops.dynamic_partition(
+        full_update, p_assignments, num_partitions)
+    update_ops = []
+    for p in range(num_partitions):
+      with ops.colocate_with(weights[p]):
+        result = state_ops.scatter_add(weights[p], gather_ids[p], updates[p])
+      update_ops.append(result)
+    return update_ops
+
   def minimize(self, global_step=None, name=None):
     """Add operations to train a linear model by minimizing the loss function.
 
@@ -318,18 +391,89 @@ class SdcaModel(object):
       # Solver returns example_state_update, new delta sparse_feature_weights
       # and delta dense_feature_weights.
 
-      weights_tensor = self._convert_n_to_tensor(self._slots[
-          'unshrinked_sparse_features_weights'])
       sparse_weights = []
       sparse_indices = []
-      for w, i in zip(weights_tensor, sparse_feature_indices):
-        # Find the feature ids to lookup in the variables.
-        with ops.device(w.device):
-          sparse_indices.append(
-              math_ops.cast(
-                  array_ops.unique(math_ops.cast(i, dtypes.int32))[0],
-                  dtypes.int64))
-          sparse_weights.append(array_ops.gather(w, sparse_indices[-1]))
+      # If we have partitioned variables, keep a few lists of Tensors around
+      # that we need for the assign_add after the op call to
+      # gen_sdca_ops.sdca_optimizer().
+      num_partitions_by_var = []
+      p_assignments_by_var = []
+      gather_ids_by_var = []
+      for w, i in zip(self._slots['unshrinked_sparse_features_weights'],
+                      sparse_feature_indices):
+        # Append the sparse_indices (in full-variable space).
+        sparse_idx = math_ops.cast(
+            array_ops.unique(math_ops.cast(i, dtypes.int32))[0],
+            dtypes.int64)
+        sparse_indices.append(sparse_idx)
+        if isinstance(w, list) or isinstance(w, var_ops.PartitionedVariable):
+          num_partitions = len(w)
+          flat_ids = array_ops.reshape(sparse_idx, [-1])
+          # We use div partitioning, which is easiest to support downstream.
+          # Compute num_total_ids as the sum of dim-0 of w, then assign
+          # to partitions based on a constant number of ids per partition.
+          # Optimize if we already know the full shape statically.
+          dim_0_size = self._get_first_dimension_size_statically(
+              w, num_partitions)
+
+          if dim_0_size.value:
+            num_total_ids = constant_op.constant(dim_0_size.value,
+                                                 flat_ids.dtype)
+          else:
+            dim_0_sizes = []
+            for p in range(num_partitions):
+              if w[p].get_shape()[0].value is not None:
+                dim_0_sizes.append(w[p].get_shape()[0].value)
+              else:
+                with ops.colocate_with(w[p]):
+                  dim_0_sizes.append(array_ops.shape(w[p])[0])
+            num_total_ids = math_ops.reduce_sum(
+                math_ops.cast(array_ops.stack(dim_0_sizes), flat_ids.dtype))
+          ids_per_partition = num_total_ids // num_partitions
+          extras = num_total_ids % num_partitions
+
+          p_assignments = math_ops.maximum(
+              flat_ids // (ids_per_partition + 1),
+              (flat_ids - extras) // ids_per_partition)
+
+          # Emulate a conditional using a boolean indicator tensor
+          new_ids = array_ops.where(p_assignments < extras,
+                                    flat_ids % (ids_per_partition + 1),
+                                    (flat_ids - extras) % ids_per_partition)
+
+          # Cast partition assignments to int32 for use in dynamic_partition.
+          # There really should not be more than 2^32 partitions.
+          p_assignments = math_ops.cast(p_assignments, dtypes.int32)
+          # Partition list of ids based on assignments into num_partitions
+          # separate lists.
+          gather_ids = data_flow_ops.dynamic_partition(new_ids,
+                                                       p_assignments,
+                                                       num_partitions)
+          # Append these to the lists for use in the later update.
+          num_partitions_by_var.append(num_partitions)
+          p_assignments_by_var.append(p_assignments)
+          gather_ids_by_var.append(gather_ids)
+
+          # Gather the weights from each partition.
+          partition_gathered_weights = []
+          for p in range(num_partitions):
+            with ops.colocate_with(w[p]):
+              partition_gathered_weights.append(
+                  array_ops.gather(w[p], gather_ids[p]))
+
+          # Stitch the weights back together in the same order they were before
+          # we dynamic_partitioned them.
+          condition_indices = data_flow_ops.dynamic_partition(
+              math_ops.range(array_ops.shape(new_ids)[0]),
+              p_assignments, num_partitions)
+          batch_gathered_weights = data_flow_ops.dynamic_stitch(
+              condition_indices, partition_gathered_weights)
+        else:
+          w_as_tensor = internal_convert_to_tensor(w)
+          with ops.device(w_as_tensor.device):
+            batch_gathered_weights = array_ops.gather(
+                w_as_tensor, sparse_idx)
+        sparse_weights.append(batch_gathered_weights)
 
       # pylint: disable=protected-access
       esu, sfw, dfw = gen_sdca_ops.sdca_optimizer(
@@ -355,12 +499,25 @@ class SdcaModel(object):
       with ops.control_dependencies([esu]):
         update_ops = [self._hashtable.insert(example_ids_hashed, esu)]
         # Update the weights before the proximal step.
-        for w, i, u in zip(self._slots['unshrinked_sparse_features_weights'],
-                           sparse_indices, sfw):
-          update_ops.append(state_ops.scatter_add(w, i, u))
+        for v_num, (w, i, u) in enumerate(
+            zip(self._slots['unshrinked_sparse_features_weights'],
+                sparse_indices, sfw)):
+          if (isinstance(w, var_ops.PartitionedVariable) or
+              isinstance(w, list)):
+            update_ops += self._get_partitioned_update_ops(
+                v_num, num_partitions_by_var, p_assignments_by_var,
+                gather_ids_by_var, w, u, p_assignments, num_partitions)
+          else:
+            update_ops.append(state_ops.scatter_add(w, i, u))
         for w, u in zip(self._slots['unshrinked_dense_features_weights'], dfw):
-          update_ops.append(w.assign_add(u))
-
+          if (isinstance(w, var_ops.PartitionedVariable) or
+              isinstance(w, list)):
+            split_updates = array_ops.split(
+                u, num_or_size_splits=[v.shape.as_list()[0] for v in w])
+            for v, split_update in zip(w, split_updates):
+              update_ops.append(state_ops.assign_add(v, split_update))
+          else:
+            update_ops.append(state_ops.assign_add(w, u))
       if not global_step:
         return control_flow_ops.group(*update_ops)
       with ops.control_dependencies(update_ops):
@@ -385,21 +542,22 @@ class SdcaModel(object):
       for name in ['sparse_features_weights', 'dense_features_weights']:
         for var, slot_var in zip(self._variables[name],
                                  self._slots['unshrinked_' + name]):
-          update_ops.append(var.assign(slot_var))
+          for v, sv in zip(self._var_to_list(var), self._var_to_list(slot_var)):
+            update_ops.append(v.assign(sv))
 
     # Apply proximal step.
     with ops.control_dependencies(update_ops):
       update_ops = []
       for name in ['sparse_features_weights', 'dense_features_weights']:
         for var in self._variables[name]:
-          with ops.device(var.device):
-            # pylint: disable=protected-access
-            update_ops.append(
-                gen_sdca_ops.sdca_shrink_l1(
-                    self._convert_n_to_tensor(
-                        [var], as_ref=True),
-                    l1=self._symmetric_l1_regularization(),
-                    l2=self._symmetric_l2_regularization()))
+          for v in self._var_to_list(var):
+            with ops.device(v.device):
+              # pylint: disable=protected-access
+              update_ops.append(
+                  gen_sdca_ops.sdca_shrink_l1(
+                      self._convert_n_to_tensor([v], as_ref=True),
+                      l1=self._symmetric_l1_regularization(),
+                      l2=self._symmetric_l2_regularization()))
       return control_flow_ops.group(*update_ops)
 
   def approximate_duality_gap(self):
diff --git a/tensorflow/contrib/linear_optimizer/python/sdca_estimator.py b/tensorflow/contrib/linear_optimizer/python/sdca_estimator.py
index d4e54c82f988e0adcd16aad29702ee9f8b16aea3..200e7de6b95f17672c6ef51f887b15f9d185f775 100644
--- a/tensorflow/contrib/linear_optimizer/python/sdca_estimator.py
+++ b/tensorflow/contrib/linear_optimizer/python/sdca_estimator.py
@@ -116,6 +116,7 @@ def sdca_model_fn(features, labels, mode, params, config=None):
   num_loss_partitions = params["num_loss_partitions"]
   weight_column_name = params["weight_column_name"]
   update_weights_hook = params.get("update_weights_hook", None)
+  partitioner = params["partitioner"]
 
   loss_type = None
   if isinstance(head, head_lib._BinarySvmHead):  # pylint: disable=protected-access
@@ -136,12 +137,14 @@ def sdca_model_fn(features, labels, mode, params, config=None):
       example_id_column=example_id_column,
       num_loss_partitions=n_loss_partitions,
       symmetric_l1_regularization=l1_regularization,
-      symmetric_l2_regularization=l2_regularization)
+      symmetric_l2_regularization=l2_regularization,
+      partitioner=partitioner)
 
   parent_scope = "linear"
 
   with variable_scope.variable_scope(
-      values=features.values(), name_or_scope=parent_scope) as scope:
+      values=features.values(), name_or_scope=parent_scope,
+      partitioner=partitioner) as scope:
     features = features.copy()
     features.update(layers.transform_features(features, feature_columns))
     logits, columns_to_variables, bias = (
@@ -213,7 +216,8 @@ class _SDCAEstimator(estimator.Estimator):
                l2_regularization=1.0,
                num_loss_partitions=None,
                config=None,
-               feature_engineering_fn=None):
+               feature_engineering_fn=None,
+               partitioner=None):
     """Construct a `_SDCAEstimator` estimator object.
 
     Args:
@@ -241,6 +245,8 @@ class _SDCAEstimator(estimator.Estimator):
       feature_engineering_fn: Feature engineering function. Takes features and
         labels which are the output of `input_fn` and returns features and
         labels which will be fed into the model.
+      partitioner: Variable partitioner for the primal weights (`div`
+        partitioning strategy will be used).
 
     Returns:
       A `_SDCAEstimator` estimator.
@@ -267,6 +273,7 @@ class _SDCAEstimator(estimator.Estimator):
         "l2_regularization": l2_regularization,
         "weight_column_name": weight_column_name,
         "update_weights_hook": _SdcaUpdateWeightsHook(),
+        "partitioner": partitioner,
     }
 
     super(_SDCAEstimator, self).__init__(
@@ -336,7 +343,8 @@ class SDCALogisticClassifier(_SDCAEstimator):
                l2_regularization=1.0,
                num_loss_partitions=None,
                config=None,
-               feature_engineering_fn=None):
+               feature_engineering_fn=None,
+               partitioner=None):
     """Construct a `SDCALogisticClassifier` object.
 
     Args:
@@ -361,6 +369,8 @@ class SDCALogisticClassifier(_SDCAEstimator):
       feature_engineering_fn: Feature engineering function. Takes features and
         labels which are the output of `input_fn` and returns features and
         labels which will be fed into the model.
+      partitioner: Variable partitioner for the primal weights (`div`
+        partitioning strategy will be used).
 
     Returns:
       A `SDCALogisiticClassifier` estimator.
@@ -376,7 +386,8 @@ class SDCALogisticClassifier(_SDCAEstimator):
         l2_regularization=l2_regularization,
         num_loss_partitions=num_loss_partitions,
         config=config,
-        feature_engineering_fn=None)
+        feature_engineering_fn=None,
+        partitioner=partitioner)
 
   def predict_classes(self, input_fn=None):
     """Runs inference to determine the predicted class.
@@ -463,7 +474,8 @@ class SDCALinearRegressor(_SDCAEstimator):
                l2_regularization=1.0,
                num_loss_partitions=None,
                config=None,
-               feature_engineering_fn=None):
+               feature_engineering_fn=None,
+               partitioner=None):
     """Construct a `SDCALinearRegressor` estimator object.
 
 
@@ -489,6 +501,8 @@ class SDCALinearRegressor(_SDCAEstimator):
       feature_engineering_fn: Feature engineering function. Takes features and
         labels which are the output of `input_fn` and returns features and
         labels which will be fed into the model.
+      partitioner: Variable partitioner for the primal weights (`div`
+        partitioning strategy will be used).
 
     Returns:
       A `SDCALinearRegressor` estimator.
@@ -503,7 +517,8 @@ class SDCALinearRegressor(_SDCAEstimator):
         l2_regularization=l2_regularization,
         num_loss_partitions=num_loss_partitions,
         config=config,
-        feature_engineering_fn=None)
+        feature_engineering_fn=None,
+        partitioner=partitioner)
 
   def predict_scores(self, input_fn):
     """Returns predicted scores for given features.
diff --git a/tensorflow/contrib/linear_optimizer/python/sdca_estimator_test.py b/tensorflow/contrib/linear_optimizer/python/sdca_estimator_test.py
index bed3d5139fcbf9d9e8b85605c752736f26af6793..647667188238dc18b137eaad98356a79b3a549b4 100644
--- a/tensorflow/contrib/linear_optimizer/python/sdca_estimator_test.py
+++ b/tensorflow/contrib/linear_optimizer/python/sdca_estimator_test.py
@@ -25,6 +25,7 @@ from tensorflow.contrib.linear_optimizer.python import sdca_estimator
 from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.ops import partitioned_variables
 from tensorflow.python.platform import test
 
 
@@ -273,6 +274,47 @@ class SDCALogisticClassifierTest(test.TestCase):
       metrics = classifier.evaluate(input_fn=input_fn, steps=1)
       self.assertGreater(metrics['accuracy'], 0.9)
 
+  def testPartitionedMixedFeatures(self):
+    """Tests SDCALogisticClassifier with a mix of features (partitioned)."""
+
+    def input_fn():
+      return {
+          'example_id':
+              constant_op.constant(['1', '2', '3']),
+          'price':
+              constant_op.constant([[0.6], [0.8], [0.3]]),
+          'sq_footage':
+              constant_op.constant([900.0, 700.0, 600.0]),
+          'country':
+              sparse_tensor.SparseTensor(
+                  values=['IT', 'US', 'GB'],
+                  indices=[[0, 0], [1, 3], [2, 1]],
+                  dense_shape=[3, 5]),
+          'weights':
+              constant_op.constant([[3.0], [1.0], [1.0]])
+      }, constant_op.constant([[1], [0], [1]])
+
+    with self._single_threaded_test_session():
+      price = feature_column_lib.real_valued_column('price')
+      sq_footage_bucket = feature_column_lib.bucketized_column(
+          feature_column_lib.real_valued_column('sq_footage'),
+          boundaries=[650.0, 800.0])
+      country = feature_column_lib.sparse_column_with_hash_bucket(
+          'country', hash_bucket_size=5)
+      sq_footage_country = feature_column_lib.crossed_column(
+          [sq_footage_bucket, country], hash_bucket_size=10)
+      classifier = sdca_estimator.SDCALogisticClassifier(
+          example_id_column='example_id',
+          feature_columns=[
+              price, sq_footage_bucket, country, sq_footage_country
+          ],
+          weight_column_name='weights',
+          partitioner=partitioned_variables.fixed_size_partitioner(
+              num_shards=2, axis=0))
+      classifier.fit(input_fn=input_fn, steps=50)
+      metrics = classifier.evaluate(input_fn=input_fn, steps=1)
+      self.assertGreater(metrics['accuracy'], 0.9)
+
 
 class SDCALinearRegressorTest(test.TestCase):
 
@@ -350,6 +392,48 @@ class SDCALinearRegressorTest(test.TestCase):
       loss = regressor.evaluate(input_fn=input_fn, steps=1)['loss']
       self.assertLess(loss, 0.05)
 
+  def testMixedFeaturesArbitraryWeightsPartitioned(self):
+    """Tests SDCALinearRegressor works with a mix of features (partitioned)."""
+
+    def input_fn():
+      return {
+          'example_id':
+              constant_op.constant(['1', '2', '3']),
+          'price':
+              constant_op.constant([[0.6], [0.8], [0.3]]),
+          'sq_footage':
+              constant_op.constant([[900.0], [700.0], [600.0]]),
+          'country':
+              sparse_tensor.SparseTensor(
+                  values=['IT', 'US', 'GB'],
+                  indices=[[0, 0], [1, 3], [2, 1]],
+                  dense_shape=[3, 5]),
+          'weights':
+              constant_op.constant([[3.0], [5.0], [7.0]])
+      }, constant_op.constant([[1.55], [-1.25], [-3.0]])
+
+    with self._single_threaded_test_session():
+      price = feature_column_lib.real_valued_column('price')
+      sq_footage_bucket = feature_column_lib.bucketized_column(
+          feature_column_lib.real_valued_column('sq_footage'),
+          boundaries=[650.0, 800.0])
+      country = feature_column_lib.sparse_column_with_hash_bucket(
+          'country', hash_bucket_size=5)
+      sq_footage_country = feature_column_lib.crossed_column(
+          [sq_footage_bucket, country], hash_bucket_size=10)
+      regressor = sdca_estimator.SDCALinearRegressor(
+          example_id_column='example_id',
+          feature_columns=[
+              price, sq_footage_bucket, country, sq_footage_country
+          ],
+          l2_regularization=1.0,
+          weight_column_name='weights',
+          partitioner=partitioned_variables.fixed_size_partitioner(
+              num_shards=2, axis=0))
+      regressor.fit(input_fn=input_fn, steps=20)
+      loss = regressor.evaluate(input_fn=input_fn, steps=1)['loss']
+      self.assertLess(loss, 0.05)
+
   def testSdcaOptimizerSparseFeaturesWithL1Reg(self):
     """SDCALinearRegressor works with sparse features and L1 regularization."""
 
diff --git a/tensorflow/contrib/linear_optimizer/python/sdca_optimizer.py b/tensorflow/contrib/linear_optimizer/python/sdca_optimizer.py
index 12039ecc6f357af07e0c2a08e17d46396f3ad386..9872c6f97c879d8994b6c26e65df33e368a0603e 100644
--- a/tensorflow/contrib/linear_optimizer/python/sdca_optimizer.py
+++ b/tensorflow/contrib/linear_optimizer/python/sdca_optimizer.py
@@ -64,7 +64,8 @@ class SDCAOptimizer(object):
   of workers running the train steps. It defaults to 1 (single machine).
   `num_table_shards` defines the number of shards for the internal state
   table, typically set to match the number of parameter servers for large
-  data sets.
+  data sets. You can also specify a `partitioner` object to partition the primal
+  weights during training (`div` partitioning strategy will be used).
   """
 
   def __init__(self,
@@ -73,13 +74,15 @@ class SDCAOptimizer(object):
                num_table_shards=None,
                symmetric_l1_regularization=0.0,
                symmetric_l2_regularization=1.0,
-               adaptive=True):
+               adaptive=True,
+               partitioner=None):
     self._example_id_column = example_id_column
     self._num_loss_partitions = num_loss_partitions
     self._num_table_shards = num_table_shards
     self._symmetric_l1_regularization = symmetric_l1_regularization
     self._symmetric_l2_regularization = symmetric_l2_regularization
     self._adaptive = adaptive
+    self._partitioner = partitioner
 
   def get_name(self):
     return 'SDCAOptimizer'
@@ -108,6 +111,10 @@ class SDCAOptimizer(object):
   def adaptive(self):
     return self._adaptive
 
+  @property
+  def partitioner(self):
+    return self._partitioner
+
   def get_train_step(self, columns_to_variables, weight_column_name, loss_type,
                      features, targets, global_step):
     """Returns the training operation of an SdcaModel optimizer."""
@@ -175,10 +182,12 @@ class SDCAOptimizer(object):
           sparse_feature_column = _dense_tensor_to_sparse_feature_column(
               dense_bucket_tensor)
           sparse_feature_with_values.append(sparse_feature_column)
-          # For bucketized columns, the variables list contains exactly one
-          # element.
-          sparse_feature_with_values_weights.append(
-              columns_to_variables[column][0])
+          # If a partitioner was used during variable creation, we will have a
+          # list of Variables here larger than 1.
+          vars_to_append = columns_to_variables[column][0]
+          if len(columns_to_variables[column]) > 1:
+            vars_to_append = columns_to_variables[column]
+          sparse_feature_with_values_weights.append(vars_to_append)
         elif isinstance(
             column,
             (
@@ -226,8 +235,12 @@ class SDCAOptimizer(object):
                                             array_ops.shape(ids)[0]), [-1])
           sparse_feature_with_values.append(
               SparseFeatureColumn(example_ids_filtered, reproject_ids, weights))
-          sparse_feature_with_values_weights.append(
-              columns_to_variables[column][0])
+          # If a partitioner was used during variable creation, we will have a
+          # list of Variables here larger than 1.
+          vars_to_append = columns_to_variables[column][0]
+          if len(columns_to_variables[column]) > 1:
+            vars_to_append = columns_to_variables[column]
+          sparse_feature_with_values_weights.append(vars_to_append)
         else:
           raise ValueError('SDCAOptimizer does not support column type %s.' %
                            type(column).__name__)
diff --git a/tensorflow/contrib/lite/Android.bp b/tensorflow/contrib/lite/Android.bp
index bd470696c5879822f9a75b77b3ec132500cd0d34..ff3d18b4b117863ab483b4ff9c5dac71fb5379ee 100644
--- a/tensorflow/contrib/lite/Android.bp
+++ b/tensorflow/contrib/lite/Android.bp
@@ -45,7 +45,7 @@ cc_library_static {
         "graph_info.cc",
         "interpreter.cc",
         "model.cc",
-        "op_resolver.cc",	
+        "op_resolver.cc",
         "nnapi_delegate.cc",
         "optional_debug_tools.cc",
         "simple_memory_arena.cc",
diff --git a/tensorflow/contrib/lite/BUILD b/tensorflow/contrib/lite/BUILD
index 55b984f260ec49ab9b52be6402885a46226cba70..9c804d27854b8004d34c65691b48ca2b0d3bbf7c 100644
--- a/tensorflow/contrib/lite/BUILD
+++ b/tensorflow/contrib/lite/BUILD
@@ -90,6 +90,16 @@ cc_library(
     deps = [":context"],
 )
 
+cc_library(
+    name = "kernel_api",
+    hdrs = [
+        "builtin_op_data.h",
+        "builtin_ops.h",
+        "context.h",
+        "context_util.h",
+    ],
+)
+
 exports_files(["builtin_ops.h"])
 
 cc_library(
diff --git a/tensorflow/contrib/lite/arena_planner_test.cc b/tensorflow/contrib/lite/arena_planner_test.cc
index a8a8755e2c9e81474f2ff9cd2b85c0eb3d5c3441..16171df10a7b18b22919c6e54fe3d1e8e0120f69 100644
--- a/tensorflow/contrib/lite/arena_planner_test.cc
+++ b/tensorflow/contrib/lite/arena_planner_test.cc
@@ -209,11 +209,8 @@ TEST_F(ArenaPlannerTest, ZeroSizedTensors) {
   TestGraph graph({1}, {{{1}, {2}, {}}}, {2});
   (*graph.tensors())[1].bytes = 0;
   SetGraph(&graph);
-  // TODO(ahentz): this is currently broken because the arena finds two
-  // allocations with the same offset and returns an error.
-  ASSERT_FALSE(planner_->ExecuteAllocations(0, 10) == kTfLiteOk);
-  // EXPECT_EQ(GetOffset(1), 0);
-  // EXPECT_EQ(GetOffset(2), GetOffsetAfter(1));
+  ASSERT_EQ(planner_->ExecuteAllocations(0, 10), kTfLiteOk);
+  EXPECT_EQ((*graph_->tensors())[1].data.raw, nullptr);
 }
 
 TEST_F(ArenaPlannerTest, SimpleGraph) {
diff --git a/tensorflow/contrib/lite/build_def.bzl b/tensorflow/contrib/lite/build_def.bzl
index 9bfc0a0fbeff38fb77b6d67c1a2df37a6807528c..974e6c5d98e5691a3733495e915d919c4bf57d3a 100644
--- a/tensorflow/contrib/lite/build_def.bzl
+++ b/tensorflow/contrib/lite/build_def.bzl
@@ -204,7 +204,9 @@ def generated_test_models():
         "conv",
         "depthwiseconv",
         "div",
+        "equal",
         "exp",
+        "expand_dims",
         "floor",
         "fully_connected",
         "fused_batch_norm",
@@ -212,18 +214,22 @@ def generated_test_models():
         "global_batch_norm",
         "greater",
         "greater_equal",
-        "l2_pool",
         "l2norm",
+        "l2_pool",
         "less",
         "less_equal",
         "local_response_norm",
         "log_softmax",
+        "log",
+        # TODO(b/110143200): Enable after resolving issues with LSTM conversion.
+        # "lstm",
         "max_pool",
         "maximum",
         "mean",
         "minimum",
         "mul",
         "neg",
+        "not_equal",
         "pad",
         "padv2",
         # "prelu",
@@ -238,11 +244,13 @@ def generated_test_models():
         "softmax",
         "space_to_batch_nd",
         "space_to_depth",
+        "sparse_to_dense",
         "split",
         "squeeze",
         "strided_slice",
         "strided_slice_1d_exhaustive",
         "sub",
+        "tile",
         "topk",
         "transpose",
         "transpose_conv",
diff --git a/tensorflow/contrib/lite/builtin_op_data.h b/tensorflow/contrib/lite/builtin_op_data.h
index 8660c653ae4c0c69e4f5ad8fae739c8c1db7414c..c1cc4476fbd45fa6b3f5b3a1ed2cba39cc2ad54b 100644
--- a/tensorflow/contrib/lite/builtin_op_data.h
+++ b/tensorflow/contrib/lite/builtin_op_data.h
@@ -148,10 +148,20 @@ typedef struct {
   float beta;
 } TfLiteLocalResponseNormParams;
 
+typedef enum {
+  kTfLiteLSTMFullKernel = 0,
+  kTfLiteLSTMBasicKernel
+} TfLiteLSTMKernelType;
+
 typedef struct {
+  // Parameters for LSTM version 1.
   TfLiteFusedActivation activation;
   float cell_clip;
   float proj_clip;
+
+  // Parameters for LSTM version 2.
+  // kTfLiteLSTMBasicKernel is only supported in version 2 or above.
+  TfLiteLSTMKernelType kernel_type;
 } TfLiteLSTMParams;
 
 typedef struct {
@@ -236,6 +246,10 @@ typedef struct {
   int stride_height;
 } TfLiteTransposeConvParams;
 
+typedef struct {
+  bool validate_indices;
+} TfLiteSparseToDenseParams;
+
 #ifdef __cplusplus
 }  // extern "C"
 #endif  // __cplusplus
diff --git a/tensorflow/contrib/lite/builtin_ops.h b/tensorflow/contrib/lite/builtin_ops.h
index 7e285186f45a61a451fd7328b061e16059049ea5..aef9a92883f18dabfc36058507d739856c3c2af7 100644
--- a/tensorflow/contrib/lite/builtin_ops.h
+++ b/tensorflow/contrib/lite/builtin_ops.h
@@ -17,7 +17,7 @@ limitations under the License.
 #define TENSORFLOW_CONTRIB_LITE_BUILTIN_OPS_H_
 
 // DO NOT EDIT MANUALLY: This file is automatically generated by
-// `schema_builtin_ops_header_generator.py`.
+// `schema/builtin_ops_header/generator.cc`.
 
 #ifdef __cplusplus
 extern "C" {
@@ -93,10 +93,15 @@ typedef enum {
   kTfLiteBuiltinSlice = 65,
   kTfLiteBuiltinSin = 66,
   kTfLiteBuiltinTransposeConv = 67,
+  kTfLiteBuiltinSparseToDense = 68,
+  kTfLiteBuiltinTile = 69,
+  kTfLiteBuiltinExpandDims = 70,
+  kTfLiteBuiltinEqual = 71,
+  kTfLiteBuiltinNotEqual = 72,
+  kTfLiteBuiltinLog = 73,
 } TfLiteBuiltinOperator;
 
 #ifdef __cplusplus
 }  // extern "C"
 #endif  // __cplusplus
 #endif  // TENSORFLOW_CONTRIB_LITE_BUILTIN_OPS_H_
-}
diff --git a/tensorflow/contrib/lite/context_util.h b/tensorflow/contrib/lite/context_util.h
new file mode 100644
index 0000000000000000000000000000000000000000..abe802e34214caf4d5063da827b3aca4a82aa56d
--- /dev/null
+++ b/tensorflow/contrib/lite/context_util.h
@@ -0,0 +1,48 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+// This provides a few C++ helpers that are useful for manipulating C structures
+// in C++.
+#ifndef TENSORFLOW_CONTRIB_LITE_CONTEXT_UTIL_H_
+#define TENSORFLOW_CONTRIB_LITE_CONTEXT_UTIL_H_
+
+#include "tensorflow/contrib/lite/context.h"
+
+namespace tflite {
+
+// Provide a range iterable wrapper for TfLiteIntArray* (C lists that TfLite
+// C api uses. Can't use the google array_view, since we can't depend on even
+// absl for embedded device reasons.
+class TfLiteIntArrayView {
+ public:
+  // Construct a view of a TfLiteIntArray*. Note, `int_array` should be non-null
+  // and this view does not take ownership of it.
+  explicit TfLiteIntArrayView(const TfLiteIntArray* int_array)
+      : int_array_(int_array) {}
+
+  TfLiteIntArrayView(const TfLiteIntArrayView&) = default;
+  TfLiteIntArrayView& operator=(const TfLiteIntArrayView& rhs) = default;
+
+  typedef const int* const_iterator;
+  const_iterator begin() const { return int_array_->data; }
+  const_iterator end() const { return &int_array_->data[int_array_->size]; }
+  size_t size() const { return end() - begin(); }
+
+ private:
+  const TfLiteIntArray* int_array_;
+};
+
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_CONTEXT_UTIL_H_
diff --git a/tensorflow/contrib/lite/delegates/nnapi/BUILD b/tensorflow/contrib/lite/delegates/nnapi/BUILD
new file mode 100644
index 0000000000000000000000000000000000000000..35a8f6ca4166e373ea1a0af5d4a013327b30d2b6
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/nnapi/BUILD
@@ -0,0 +1,31 @@
+package(default_visibility = [
+    "//visibility:public",
+])
+
+load("//tensorflow:tensorflow.bzl", "tf_cc_test")
+
+licenses(["notice"])  # Apache 2.0
+
+cc_library(
+    name = "nnapi_delegate",
+    srcs = ["nnapi_delegate.cc"],
+    hdrs = ["nnapi_delegate.h"],
+    deps = [
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite:kernel_api",
+        "//tensorflow/contrib/lite/kernels:kernel_util",
+        "//tensorflow/contrib/lite/nnapi:nnapi_lib",
+    ],
+)
+
+tf_cc_test(
+    name = "nnapi_delegate_test",
+    size = "small",
+    srcs = ["nnapi_delegate_test.cc"],
+    deps = [
+        ":nnapi_delegate",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+    ],
+)
diff --git a/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate.cc b/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate.cc
new file mode 100644
index 0000000000000000000000000000000000000000..0731d14419d2dec2ea5efa48ef5d4b7728af635f
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate.cc
@@ -0,0 +1,464 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <cstdarg>
+#include <iostream>
+#include <memory>
+#include <vector>
+
+#include "tensorflow/contrib/lite/allocation.h"
+#include "tensorflow/contrib/lite/builtin_op_data.h"
+#include "tensorflow/contrib/lite/builtin_ops.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/context_util.h"
+#include "tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/contrib/lite/nnapi/NeuralNetworksShim.h"
+
+namespace tflite {
+namespace {
+
+// TODO(b/80621585): Consider printing error string, but don't for now to
+// minimize binary size.
+#define CHECK_NN(context, code)                                           \
+  if (code != ANEURALNETWORKS_NO_ERROR) {                                 \
+    context->ReportError(context, "NN API returned error (%d).\n", code); \
+    return kTfLiteError;                                                  \
+  }
+
+// RAII NN API Model Destructor for use with std::unique_ptr
+struct NNFreeModel {
+  void operator()(ANeuralNetworksModel* model) {
+    ANeuralNetworksModel_free(model);
+  }
+};
+// RAII NN API Compilation Destructor for use with std::unique_ptr
+struct NNFreeCompilation {
+  void operator()(ANeuralNetworksCompilation* model) {
+    ANeuralNetworksCompilation_free(model);
+  }
+};
+
+// Track tensor indices to NN API tensor indices mapping.
+class OperandMapping {
+ public:
+  // Given a TFLite index return the ANN index. If it doesn't exist
+  // return -1.
+  int lite_index_to_ann(int index) const {
+    if (index < lite_tensor_to_ann_tensor_.size())
+      return lite_tensor_to_ann_tensor_[index];
+    else
+      return -1;
+  }
+
+  // NN API uses non tensor operands instead of structs. This creates one
+  // and returns the index. It uses a std::vector and resizes it as needed
+  // keeping -1 to unmapped values. Intermediate tensors likely will not
+  // be mapped.
+  int add_new_non_tensor_operand() { return next_ann_tensor_index_++; }
+
+  // Add a new mapping from `tflite_index` and return the NN API tensor index.
+  int add_new_ann_tensor_index(int tflite_index) {
+    if (tflite_index >= lite_tensor_to_ann_tensor_.size()) {
+      lite_tensor_to_ann_tensor_.resize(tflite_index + 1);
+    }
+    int new_tensor_index = next_ann_tensor_index_++;
+    lite_tensor_to_ann_tensor_[tflite_index] = new_tensor_index;
+    return new_tensor_index;
+  }
+
+ private:
+  // Next index of ann tensor
+  int next_ann_tensor_index_ = 0;
+
+  // Mapping from lite index. Use a std::vector for speed and code size
+  // rather than a map.
+  std::vector<int> lite_tensor_to_ann_tensor_;
+};
+
+// Abstract builder for building an op in the NN API graph. This handles
+// the disparity between TFLite and NN API operand types. NN API has singular
+// operands for both tensors and parameters, and TFLite separates the two.
+class NNAPIOpBuilder {
+ public:
+  NNAPIOpBuilder(TfLiteContext* context, OperandMapping* tensor_mapping,
+                 ANeuralNetworksModel* nn_model)
+      : context_(context),
+        operand_mapping_(tensor_mapping),
+        nn_model_(nn_model) {}
+
+  TfLiteStatus AddScalarInt32Operand(int value) {
+    ANeuralNetworksOperandType operand_type{.type = ANEURALNETWORKS_INT32};
+    CHECK_NN(context_,
+             ANeuralNetworksModel_addOperand(nn_model_, &operand_type));
+    int ann_operand = operand_mapping_->add_new_non_tensor_operand();
+    CHECK_NN(context_, ANeuralNetworksModel_setOperandValue(
+                           nn_model_, ann_operand, &value, sizeof(int32_t)));
+    augmented_inputs_.push_back(ann_operand);
+    return kTfLiteOk;
+  }
+
+  TfLiteStatus AddTensorInput(int tensor_index) {
+    int ann_index;
+    TF_LITE_ENSURE_STATUS(AddTensor(tensor_index, &ann_index));
+    augmented_inputs_.push_back(ann_index);
+    return kTfLiteOk;
+  }
+
+  TfLiteStatus AddTensorOutput(int tensor_index) {
+    int ann_index;
+    TF_LITE_ENSURE_STATUS(AddTensor(tensor_index, &ann_index));
+    augmented_outputs_.push_back(ann_index);
+    return kTfLiteOk;
+  }
+
+  // Adds a new NN API tensor that shadows the TF Lite tensor `tensor_index`.
+  // This returns the NN API tensor index corresponding to the created tensor.
+  // If another caller previously created a NN API tensor for `tensor_index`
+  // then the existing one is returned.
+  TfLiteStatus AddTensor(int tensor_index, int* ann_tensor_index_out) {
+    int ann_tensor_index = operand_mapping_->lite_index_to_ann(tensor_index);
+    if (ann_tensor_index != -1) {
+      *ann_tensor_index_out = ann_tensor_index;
+      return kTfLiteOk;
+    }
+    // Allocate a new tensor index
+    ann_tensor_index = operand_mapping_->add_new_ann_tensor_index(tensor_index);
+
+    // Parameters needed for new type.
+    int32_t nn_type = 0;
+    float scale = 0.0f;
+    int32_t zeroPoint = 0;
+    TfLiteTensor* tensor = &context_->tensors[tensor_index];
+    switch (tensor->type) {
+      case kTfLiteNoType:
+        // Tensors added during initialization of Ops don't have a type yet and
+        // should not be registered with the NNAPI.
+        *ann_tensor_index_out = -1;
+        return kTfLiteOk;
+      case kTfLiteFloat32:
+        nn_type = ANEURALNETWORKS_TENSOR_FLOAT32;
+        scale = 0.f;
+        break;
+      case kTfLiteUInt8:
+        nn_type = ANEURALNETWORKS_TENSOR_QUANT8_ASYMM;
+        scale = tensor->params.scale;
+        zeroPoint = tensor->params.zero_point;
+        break;
+      case kTfLiteInt32:
+        nn_type = ANEURALNETWORKS_TENSOR_INT32;
+        scale = 0.f;
+        zeroPoint = 0;
+        break;
+      default:
+        context_->ReportError(context_, "Logic error in NN API Delegate.\n");
+        return kTfLiteError;
+    }
+
+    ANeuralNetworksOperandType operand_type{
+        nn_type, static_cast<uint32_t>(tensor->dims->size),
+        reinterpret_cast<uint32_t*>(tensor->dims->data), scale, zeroPoint};
+    CHECK_NN(context_,
+             ANeuralNetworksModel_addOperand(nn_model_, &operand_type));
+
+    if (tensor->allocation_type == kTfLiteMmapRo) {
+      // TODO(b/80630405): Use NNAPIAllocation.
+      CHECK_NN(context_, ANeuralNetworksModel_setOperandValue(
+                             nn_model_, ann_tensor_index, tensor->data.raw,
+                             tensor->bytes));
+    }
+
+    *ann_tensor_index_out = ann_tensor_index;
+    return kTfLiteOk;
+  }
+
+  // Finish emitting the op (of type `type`) into the NN API.
+  TfLiteStatus FinalizeAddOperation(ANeuralNetworksOperationType type) {
+    // Actually add a NN API operation
+    CHECK_NN(context_, ANeuralNetworksModel_addOperation(
+                           nn_model_, type,
+                           static_cast<uint32_t>(augmented_inputs_.size()),
+                           augmented_inputs_.data(),
+                           static_cast<uint32_t>(augmented_outputs_.size()),
+                           augmented_outputs_.data()));
+    augmented_outputs_.clear();
+    augmented_outputs_.clear();
+    return kTfLiteOk;
+  }
+
+ private:
+  // TfLiteContext for error handling. Must be named context for macros to
+  // work.
+  TfLiteContext* context_;
+
+  // Tracks relationship between indices
+  OperandMapping* operand_mapping_;
+
+  // The model
+  ANeuralNetworksModel* nn_model_;
+
+  // Inputs and outputs for the current op. These are augmented in the sense
+  // that NN API uses operands for all arguments, not just tensors, unlike
+  // TensorFlow lite.
+  std::vector<uint32_t> augmented_inputs_;
+  std::vector<uint32_t> augmented_outputs_;
+};
+
+// The kernel that represents the subgraph of TF Lite being run on NN API.
+class NNAPIDelegateKernel {
+ public:
+  NNAPIDelegateKernel() = default;
+
+  typedef ANeuralNetworksOperationType (*MappingFn)(TfLiteContext*,
+                                                    NNAPIOpBuilder* builder,
+                                                    TfLiteNode* node);
+
+  // Return a function that knows how to translate a node into its operands
+  // when called. You can use this function to see if a node is supported
+  // (i.e. that MappingFn is not nullptr).
+  MappingFn Map(TfLiteContext* context, int builtin_code, TfLiteNode* node) {
+    switch (builtin_code) {
+      case kTfLiteBuiltinAdd:
+        return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                  TfLiteNode* node) -> ANeuralNetworksOperationType {
+          auto builtin = reinterpret_cast<TfLiteAddParams*>(node->builtin_data);
+          builder->AddScalarInt32Operand(builtin->activation);
+          return ANEURALNETWORKS_ADD;
+        };
+        break;
+      case kTfLiteBuiltinAveragePool2d:
+        return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                  TfLiteNode* node) -> ANeuralNetworksOperationType {
+          auto builtin =
+              reinterpret_cast<TfLitePoolParams*>(node->builtin_data);
+          builder->AddScalarInt32Operand(builtin->padding);
+          builder->AddScalarInt32Operand(builtin->stride_width);
+          builder->AddScalarInt32Operand(builtin->stride_height);
+          builder->AddScalarInt32Operand(builtin->filter_width);
+          builder->AddScalarInt32Operand(builtin->filter_height);
+          builder->AddScalarInt32Operand(builtin->activation);
+          return ANEURALNETWORKS_AVERAGE_POOL_2D;
+        };
+        break;
+      default:
+        return nullptr;
+    }
+  }
+
+  // Initialize the kernel (a NN model).
+  TfLiteStatus Init(TfLiteContext* context,
+                    const TfLiteDelegateParams* params) {
+    for (auto node_index : TfLiteIntArrayView(params->nodes_to_replace)) {
+      nodes_.push_back(node_index);
+    }
+
+    if (!nn_model_) {
+      ANeuralNetworksModel* model;
+      CHECK_NN(context, ANeuralNetworksModel_create(&model));
+      nn_model_.reset(model);
+
+      TF_LITE_ENSURE_STATUS(
+          BuildGraph(context, params->input_tensors, params->output_tensors));
+    }
+
+    if (!nn_compilation_) {
+      ANeuralNetworksCompilation* compilation;
+      CHECK_NN(context, ANeuralNetworksCompilation_create(nn_model_.get(),
+                                                          &compilation));
+      CHECK_NN(context, ANeuralNetworksCompilation_finish(compilation));
+      nn_compilation_.reset(compilation);
+    }
+    return kTfLiteOk;
+  }
+
+  TfLiteStatus Invoke(TfLiteContext* context, TfLiteNode* node) {
+    ANeuralNetworksExecution* execution = nullptr;
+    CHECK_NN(context, ANeuralNetworksExecution_create(nn_compilation_.get(),
+                                                      &execution));
+
+    // Set the input tensor buffers. Note: we access tflite tensors using
+    // absolute indices but NN api indices inputs by relative indices.
+    int relative_input_index = 0;
+    for (auto absolute_input_index : TfLiteIntArrayView(node->inputs)) {
+      TfLiteTensor* tensor = &context->tensors[absolute_input_index];
+      CHECK_NN(context, ANeuralNetworksExecution_setInput(
+                            execution, relative_input_index, nullptr,
+                            tensor->data.raw, tensor->bytes));
+      relative_input_index++;
+    }
+
+    // Set the output tensor buffers.
+    int relative_output_index = 0;
+    for (auto output_index : TfLiteIntArrayView(node->outputs)) {
+      TfLiteTensor* tensor = &context->tensors[output_index];
+      CHECK_NN(context, ANeuralNetworksExecution_setOutput(
+                            execution, relative_output_index, nullptr,
+                            tensor->data.raw, tensor->bytes));
+      relative_output_index++;
+    }
+    // Invoke ANN in blocking fashion.
+    ANeuralNetworksEvent* event = nullptr;
+    CHECK_NN(context, ANeuralNetworksExecution_startCompute(execution, &event));
+    CHECK_NN(context, ANeuralNetworksEvent_wait(event));
+    ANeuralNetworksEvent_free(event);
+    ANeuralNetworksExecution_free(execution);
+
+    return kTfLiteOk;
+  }
+
+ private:
+  // ANN API state.
+  std::unique_ptr<ANeuralNetworksModel, NNFreeModel> nn_model_;
+  std::unique_ptr<ANeuralNetworksCompilation, NNFreeCompilation>
+      nn_compilation_;
+  // Node indices that this delegate is responsible for. Indices here
+  // indexes into the nodes array in the TfLiteContext.
+  std::vector<int> nodes_;
+  // Track indices we use
+  OperandMapping operand_mapping_;
+
+  TfLiteStatus AddOpsAndTensors(TfLiteContext* context) {
+    // The operand builder allows creating a single op. We create it at this
+    // reduced power position rather than in the for loop to avoid reallocating
+    // the vectors.
+    NNAPIOpBuilder builder(context, &operand_mapping_, nn_model_.get());
+    // Add Tensors
+    // allocate outside to avoid realloc
+    for (auto node_index : nodes_) {
+      // Obtain the op and registration.
+      TfLiteNode* node;
+      TfLiteRegistration* reg;
+      context->GetNodeAndRegistration(context, node_index, &node, &reg);
+      // Map inputs to NN API tensor indices.
+      for (auto input_index : TfLiteIntArrayView(node->inputs)) {
+        TF_LITE_ENSURE_STATUS(builder.AddTensorInput(input_index));
+      }
+      // Get op type and operands
+      int nn_op_type =
+          Map(context, reg->builtin_code, node)(context, &builder, node);
+      // Map outputs to NN API tensor indices.
+      for (auto output_index : TfLiteIntArrayView(node->outputs)) {
+        TF_LITE_ENSURE_STATUS(builder.AddTensorOutput(output_index));
+      }
+
+      builder.FinalizeAddOperation(nn_op_type);
+    }
+    return kTfLiteOk;
+  }
+
+  TfLiteStatus BuildGraph(TfLiteContext* context,
+                          const TfLiteIntArray* input_tensors,
+                          const TfLiteIntArray* output_tensors) {
+    // Build the ops and tensors.
+    TF_LITE_ENSURE_STATUS(AddOpsAndTensors(context));
+    // Map input and output tensor indices to ANN
+    std::vector<uint32_t> inputs;
+    inputs.reserve(input_tensors->size);
+    std::vector<uint32_t> outputs;
+    outputs.reserve(output_tensors->size);
+    // Make the TensorFlow lite inputs and outputs to ann_indices.
+    for (int i : TfLiteIntArrayView(input_tensors))
+      inputs.push_back(operand_mapping_.lite_index_to_ann(i));
+    for (int i : TfLiteIntArrayView(output_tensors))
+      outputs.push_back(operand_mapping_.lite_index_to_ann(i));
+    // Tell ANN to declare inputs/outputs
+    CHECK_NN(context, ANeuralNetworksModel_identifyInputsAndOutputs(
+                          nn_model_.get(), inputs.size(), inputs.data(),
+                          outputs.size(), outputs.data()));
+    // Finalize the model
+    CHECK_NN(context, ANeuralNetworksModel_finish(nn_model_.get()));
+
+    return kTfLiteOk;
+  }
+};
+
+}  // namespace
+
+// Return a NN API Delegate struct that can check for support of ops.
+TfLiteDelegate* NnApiDelegate() {
+  static TfLiteDelegate delegate = {
+      .data_ = nullptr,
+      .Prepare = [](TfLiteContext* context,
+                    TfLiteDelegate* delegate) -> TfLiteStatus {
+        // Do not check nodes_ if NN API is unavailable.
+        if (!NNAPIExists()) return kTfLiteOk;
+
+        std::vector<int> supported_nodes(1);
+        // We don't care about all nodes_, we only care about ones in the
+        // current plan.
+        TfLiteIntArray* plan;
+        TF_LITE_ENSURE_STATUS(context->GetExecutionPlan(context, &plan));
+        int total_supported_nodes = 0;
+        // Check for every node if it is supported
+        // TODO(b/80625235): Fix this to do more careful checking of versioning.
+        for (int node_index : TfLiteIntArrayView(plan)) {
+          TfLiteNode* node;
+          TfLiteRegistration* registration;
+          TF_LITE_ENSURE_STATUS(context->GetNodeAndRegistration(
+              context, node_index, &node, &registration));
+          NNAPIDelegateKernel dummy_kernel;
+          if (dummy_kernel.Map(context, registration->builtin_code, node)) {
+            supported_nodes.push_back(node_index);
+          }
+          total_supported_nodes += 1;
+        }
+        // Put the size at the beginning of the array.
+        supported_nodes[0] = supported_nodes.size() - 1;
+
+        // NN API Delegate Registration (the pseudo kernel that will invoke NN
+        // API subgraphs)
+        static const TfLiteRegistration nnapi_delegate_kernel = {
+            .init = [](TfLiteContext* context, const char* buffer,
+                       size_t length) -> void* {
+              const TfLiteDelegateParams* params =
+                  reinterpret_cast<const TfLiteDelegateParams*>(buffer);
+              NNAPIDelegateKernel* kernel_state = new NNAPIDelegateKernel;
+              kernel_state->Init(context, params);
+              return kernel_state;
+            },
+
+            .free = [](TfLiteContext* context, void* buffer) -> void {
+              delete reinterpret_cast<NNAPIDelegateKernel*>(buffer);
+            },
+
+            .prepare = [](TfLiteContext* context,
+                          TfLiteNode* node) -> TfLiteStatus {
+              // Since the underlying resize happened ahead of delegation
+              // worked. This does nothing.
+              return kTfLiteOk;
+            },
+
+            .invoke = [](TfLiteContext* context,
+                         TfLiteNode* node) -> TfLiteStatus {
+              NNAPIDelegateKernel* state =
+                  reinterpret_cast<NNAPIDelegateKernel*>(node->user_data);
+              return state->Invoke(context, node);
+            },
+
+            .builtin_code = kTfLiteBuiltinDelegate,
+        };
+
+        // Request TFLite to partition the graph and make kernels
+        // for each independent subgraph a new nnapi_delegate_kernel.
+        context->ReplaceSubgraphsWithDelegateKernels(
+            context, nnapi_delegate_kernel,
+            reinterpret_cast<TfLiteIntArray*>(supported_nodes.data()),
+            delegate);
+        return kTfLiteOk;
+      }};
+
+  return &delegate;
+}
+
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate.h b/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate.h
new file mode 100644
index 0000000000000000000000000000000000000000..44cca2fd285370d700525f98ba33c861fb97be1e
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate.h
@@ -0,0 +1,31 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CONTRIB_LITE_DELEGATES_NNAPI_NNAPI_DELEGATE_H_
+#define TENSORFLOW_CONTRIB_LITE_DELEGATES_NNAPI_NNAPI_DELEGATE_H_
+
+#include "tensorflow/contrib/lite/context.h"
+
+namespace tflite {
+
+// Return a delegate that can be used to use the NN API.
+// e.g.
+//   NnApiDelegate* delegate = NnApiDelegate();
+//   interpreter->ModifyGraphWithDelegate(&delegate);
+// NnApiDelegate() returns a singleton, so you should not free this
+// pointer or worry about its lifetime.
+TfLiteDelegate* NnApiDelegate();
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_DELEGATES_NNAPI_NNAPI_DELEGATE_H_
diff --git a/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate_test.cc b/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..ff2e721423f07889f36746a2889afcc3369f28fc
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate_test.cc
@@ -0,0 +1,82 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate.h"
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace {
+
+using ::testing::ElementsAreArray;
+
+class FloatAddOpModel : public SingleOpModel {
+ public:
+  FloatAddOpModel(const TensorData& input1, const TensorData& input2,
+                  const TensorData& output,
+                  ActivationFunctionType activation_type) {
+    this->SetApplyDelegate([](Interpreter* interpreter) {
+      interpreter->ModifyGraphWithDelegate(NnApiDelegate());
+    });
+    input1_ = AddInput(input1);
+    input2_ = AddInput(input2);
+    output_ = AddOutput(output);
+    SetBuiltinOp(BuiltinOperator_ADD, BuiltinOptions_AddOptions,
+                 CreateAddOptions(builder_, activation_type).Union());
+    BuildInterpreter({GetShape(input1_), GetShape(input2_)});
+  }
+
+  int input1() { return input1_; }
+  int input2() { return input2_; }
+
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+
+ protected:
+  int input1_;
+  int input2_;
+  int output_;
+};
+
+// Do a test with the NN API using no activation.
+TEST(NNAPIDelegate, AddWithNoActivation) {
+  FloatAddOpModel m({TensorType_FLOAT32, {1, 2, 2, 1}},
+                    {TensorType_FLOAT32, {1, 2, 2, 1}},
+                    {TensorType_FLOAT32, {}}, ActivationFunctionType_NONE);
+  m.PopulateTensor<float>(m.input1(), {-2.0, 0.2, 0.7, 0.8});
+  m.PopulateTensor<float>(m.input2(), {0.1, 0.2, 0.3, 0.5});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({-1.9, 0.4, 1.0, 1.3}));
+}
+
+// Do a test with the NN api with relu.
+TEST(NNAPIDelegate, AddWithRelu) {
+  FloatAddOpModel m({TensorType_FLOAT32, {1, 2, 2, 1}},
+                    {TensorType_FLOAT32, {1, 2, 2, 1}},
+                    {TensorType_FLOAT32, {}}, ActivationFunctionType_RELU);
+  m.PopulateTensor<float>(m.input1(), {-2.0, 0.2, 0.7, 0.8});
+  m.PopulateTensor<float>(m.input2(), {0.1, 0.2, 0.3, 0.5});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({0.0, 0.4, 1.0, 1.3}));
+}
+
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/download_dependencies.sh b/tensorflow/contrib/lite/download_dependencies.sh
index 436c3e1d4cad5e6ee355d7e9cf8ee7da1a8385ce..840015a7fad173dbd2ea353786871dd4e89abb98 100755
--- a/tensorflow/contrib/lite/download_dependencies.sh
+++ b/tensorflow/contrib/lite/download_dependencies.sh
@@ -30,9 +30,7 @@ if [ ! -f $BZL_FILE_PATH ]; then
 fi
 
 EIGEN_URL="$(grep -o 'http.*bitbucket.org/eigen/eigen/get/.*tar\.gz' "${BZL_FILE_PATH}" | grep -v mirror.bazel | head -n1)"
-# TODO (yongtang): Replace the following with 'https://mirror.bazel.build/github.com/google/gemmlowp/.*zip' once
-# the archive has been propagated in mirror.bazel.build.
-GEMMLOWP_URL="$(grep -o 'https://github.com/google/gemmlowp/.*zip' "${BZL_FILE_PATH}" | head -n1)"
+GEMMLOWP_URL="$(grep -o 'https://mirror.bazel.build/github.com/google/gemmlowp/.*zip' "${BZL_FILE_PATH}" | head -n1)"
 GOOGLETEST_URL="https://github.com/google/googletest/archive/release-1.8.0.tar.gz"
 ABSL_URL="$(grep -o 'https://github.com/abseil/abseil-cpp/.*tar.gz' "${BZL_FILE_PATH}" | head -n1)"
 NEON_2_SSE_URL="https://github.com/intel/ARM_NEON_2_x86_SSE/archive/master.zip"
diff --git a/tensorflow/contrib/lite/examples/label_image/BUILD b/tensorflow/contrib/lite/examples/label_image/BUILD
index 9322e186a280e932a2441ab16ac8579d9ab67ee2..c61445114ecc6dfbe4f2b6ab666b28a8aa746be3 100644
--- a/tensorflow/contrib/lite/examples/label_image/BUILD
+++ b/tensorflow/contrib/lite/examples/label_image/BUILD
@@ -53,19 +53,18 @@ cc_library(
     ],
 )
 
-# TODO(ahentz): Test disabled as it has a memory leek from read_bmp
-# cc_test(
-#     name = "label_image_test",
-#     srcs = [
-#         "get_top_n.h",
-#         "get_top_n_impl.h",
-#         "label_image_test.cc",
-#     ],
-#     data = [
-#         "testdata/grace_hopper.bmp",
-#     ],
-#     deps = [
-#         ":bitmap_helpers",
-#         "//testing/base/public:gunit",
-#     ],
-# )
+cc_test(
+    name = "label_image_test",
+    srcs = [
+        "get_top_n.h",
+        "get_top_n_impl.h",
+        "label_image_test.cc",
+    ],
+    data = [
+        "testdata/grace_hopper.bmp",
+    ],
+    deps = [
+        ":bitmap_helpers",
+        "@com_google_googletest//:gtest",
+    ],
+)
diff --git a/tensorflow/contrib/lite/examples/label_image/bitmap_helpers.cc b/tensorflow/contrib/lite/examples/label_image/bitmap_helpers.cc
index 0b38cd38c83927c65d251b9356301b6bef7521f2..2735d1f5ea4e2a104f71a3a6f874d9acb2f48142 100644
--- a/tensorflow/contrib/lite/examples/label_image/bitmap_helpers.cc
+++ b/tensorflow/contrib/lite/examples/label_image/bitmap_helpers.cc
@@ -28,8 +28,9 @@ limitations under the License.
 namespace tflite {
 namespace label_image {
 
-uint8_t* decode_bmp(const uint8_t* input, int row_size, uint8_t* const output,
-                    int width, int height, int channels, bool top_down) {
+std::vector<uint8_t> decode_bmp(const uint8_t* input, int row_size, int width,
+                                int height, int channels, bool top_down) {
+  std::vector<uint8_t> output(height * width * channels);
   for (int i = 0; i < height; i++) {
     int src_pos;
     int dst_pos;
@@ -66,12 +67,11 @@ uint8_t* decode_bmp(const uint8_t* input, int row_size, uint8_t* const output,
       }
     }
   }
-
   return output;
 }
 
-uint8_t* read_bmp(const std::string& input_bmp_name, int* width, int* height,
-                  int* channels, Settings* s) {
+std::vector<uint8_t> read_bmp(const std::string& input_bmp_name, int* width,
+                              int* height, int* channels, Settings* s) {
   int begin, end;
 
   std::ifstream file(input_bmp_name, std::ios::in | std::ios::binary);
@@ -87,14 +87,15 @@ uint8_t* read_bmp(const std::string& input_bmp_name, int* width, int* height,
 
   if (s->verbose) LOG(INFO) << "len: " << len << "\n";
 
-  const uint8_t* img_bytes = new uint8_t[len];
+  std::vector<uint8_t> img_bytes(len);
   file.seekg(0, std::ios::beg);
-  file.read((char*)img_bytes, len);
+  file.read(reinterpret_cast<char*>(img_bytes.data()), len);
   const int32_t header_size =
-      *(reinterpret_cast<const int32_t*>(img_bytes + 10));
-  *width = *(reinterpret_cast<const int32_t*>(img_bytes + 18));
-  *height = *(reinterpret_cast<const int32_t*>(img_bytes + 22));
-  const int32_t bpp = *(reinterpret_cast<const int32_t*>(img_bytes + 28));
+      *(reinterpret_cast<const int32_t*>(img_bytes.data() + 10));
+  *width = *(reinterpret_cast<const int32_t*>(img_bytes.data() + 18));
+  *height = *(reinterpret_cast<const int32_t*>(img_bytes.data() + 22));
+  const int32_t bpp =
+      *(reinterpret_cast<const int32_t*>(img_bytes.data() + 28));
   *channels = bpp / 8;
 
   if (s->verbose)
@@ -110,10 +111,9 @@ uint8_t* read_bmp(const std::string& input_bmp_name, int* width, int* height,
   bool top_down = (*height < 0);
 
   // Decode image, allocating tensor once the image size is known
-  uint8_t* output = new uint8_t[abs(*height) * *width * *channels];
   const uint8_t* bmp_pixels = &img_bytes[header_size];
-  return decode_bmp(bmp_pixels, row_size, output, *width, abs(*height),
-                    *channels, top_down);
+  return decode_bmp(bmp_pixels, row_size, *width, abs(*height), *channels,
+                    top_down);
 }
 
 }  // namespace label_image
diff --git a/tensorflow/contrib/lite/examples/label_image/bitmap_helpers.h b/tensorflow/contrib/lite/examples/label_image/bitmap_helpers.h
index 97343dde6b31694e5b2de20b35a7083fb8fe4a0e..5fc75b1f7274c14d49e4a26d6ce4902c037afa6b 100644
--- a/tensorflow/contrib/lite/examples/label_image/bitmap_helpers.h
+++ b/tensorflow/contrib/lite/examples/label_image/bitmap_helpers.h
@@ -22,8 +22,8 @@ limitations under the License.
 namespace tflite {
 namespace label_image {
 
-uint8_t* read_bmp(const std::string& input_bmp_name, int* width, int* height,
-                  int* channels, Settings* s);
+std::vector<uint8_t> read_bmp(const std::string& input_bmp_name, int* width,
+                              int* height, int* channels, Settings* s);
 
 template <class T>
 void resize(T* out, uint8_t* in, int image_height, int image_width,
diff --git a/tensorflow/contrib/lite/examples/label_image/label_image.cc b/tensorflow/contrib/lite/examples/label_image/label_image.cc
index 966fcd2a31fd4d4ff2c3e91633550a8effa81ee8..86d7d1cc4a625243791d5e7d5b746526a58efb6d 100644
--- a/tensorflow/contrib/lite/examples/label_image/label_image.cc
+++ b/tensorflow/contrib/lite/examples/label_image/label_image.cc
@@ -138,8 +138,8 @@ void RunInference(Settings* s) {
   int image_width = 224;
   int image_height = 224;
   int image_channels = 3;
-  uint8_t* in = read_bmp(s->input_bmp_name, &image_width, &image_height,
-                         &image_channels, s);
+  std::vector<uint8_t> in = read_bmp(s->input_bmp_name, &image_width,
+                                     &image_height, &image_channels, s);
 
   int input = interpreter->inputs()[0];
   if (s->verbose) LOG(INFO) << "input: " << input << "\n";
@@ -168,12 +168,12 @@ void RunInference(Settings* s) {
   switch (interpreter->tensor(input)->type) {
     case kTfLiteFloat32:
       s->input_floating = true;
-      resize<float>(interpreter->typed_tensor<float>(input), in, image_height,
-                    image_width, image_channels, wanted_height, wanted_width,
-                    wanted_channels, s);
+      resize<float>(interpreter->typed_tensor<float>(input), in.data(),
+                    image_height, image_width, image_channels, wanted_height,
+                    wanted_width, wanted_channels, s);
       break;
     case kTfLiteUInt8:
-      resize<uint8_t>(interpreter->typed_tensor<uint8_t>(input), in,
+      resize<uint8_t>(interpreter->typed_tensor<uint8_t>(input), in.data(),
                       image_height, image_width, image_channels, wanted_height,
                       wanted_width, wanted_channels, s);
       break;
diff --git a/tensorflow/contrib/lite/examples/label_image/label_image_test.cc b/tensorflow/contrib/lite/examples/label_image/label_image_test.cc
index ce35483f76e8f40ced79e1ee30774c62d0eba94e..de7de21f7741d3d46cb96e793e8bc4bfb21384fe 100644
--- a/tensorflow/contrib/lite/examples/label_image/label_image_test.cc
+++ b/tensorflow/contrib/lite/examples/label_image/label_image_test.cc
@@ -27,20 +27,20 @@ namespace label_image {
 
 TEST(LabelImageTest, GraceHopper) {
   std::string lena_file =
-      "tensorflow/contrib/lite/examples/label_image/testdata/grace_hopper.bmp";
+      "tensorflow/contrib/lite/examples/label_image/testdata/"
+      "grace_hopper.bmp";
   int height, width, channels;
   Settings s;
-  uint8_t *data;
-
-  data = read_bmp(lena_file, &width, &height, &channels, &s);
+  std::vector<uint8_t> input =
+      read_bmp(lena_file, &width, &height, &channels, &s);
   ASSERT_EQ(height, 606);
   ASSERT_EQ(width, 517);
   ASSERT_EQ(channels, 3);
 
-  uint8_t *out = new uint8_t[606 * 517 * 3];
-  downsize<uint8_t>(out, data, 606, 517, 3, 214, 214, 3, &s);
-  ASSERT_EQ(out[0], 0x15);
-  ASSERT_EQ(out[214 * 214 * 3 - 1], 0x12);
+  std::vector<uint8_t> output(606 * 517 * 3);
+  resize<uint8_t>(output.data(), input.data(), 606, 517, 3, 214, 214, 3, &s);
+  ASSERT_EQ(output[0], 0x15);
+  ASSERT_EQ(output[214 * 214 * 3 - 1], 0x11);
 }
 
 TEST(LabelImageTest, GetTopN) {
diff --git a/tensorflow/contrib/lite/examples/minimal/minimal.cc b/tensorflow/contrib/lite/examples/minimal/minimal.cc
index 106e3b027055b67092f653c6bcdc4827b56bdbaa..8b0ace96ccaf06ac1cbdc2ea95ac6e92ef886993 100644
--- a/tensorflow/contrib/lite/examples/minimal/minimal.cc
+++ b/tensorflow/contrib/lite/examples/minimal/minimal.cc
@@ -38,7 +38,7 @@ using namespace tflite;
 
 int main(int argc, char *argv[]) {
   if(argc != 2) {
-    fprintf(stderr, "Usage: %s <model>\n");
+    fprintf(stderr, "minimal <tflite model>\n");
     return 1;
   }
   const char* filename = argv[1];
diff --git a/tensorflow/contrib/lite/g3doc/ops_versioning.md b/tensorflow/contrib/lite/g3doc/ops_versioning.md
new file mode 100644
index 0000000000000000000000000000000000000000..bd2f797e6c5b05f52bec9fc34f1b8011aca70330
--- /dev/null
+++ b/tensorflow/contrib/lite/g3doc/ops_versioning.md
@@ -0,0 +1,206 @@
+# TensorFlow Lite Ops Versioning
+
+This document describes TensorFlow Lite's op versioning schema. Op
+versioning enables developers to add new functionalities and parameters into
+existing ops. In addition, it guarantees the following:
+
+*   Backward compatibility: New TensorFlow Lite implementation should
+    handle an old model file.
+*   Forward compatibility: Old TensorFlow Lite implementation should
+    handle a new model file produced by new version of TOCO, as long as no new
+    features are used.
+*   Forward in-compatibility detection: If an old TensorFlow Lite implementation
+    reads a new model that contains a new version of an op which isn't
+    supported, it should report the error.
+
+## Example: Adding Dilation into Convolution
+
+The remainder of this document explains op versioning in TFLite by showing how
+to add dilation parameters to the convolution operation.
+
+Knowledge of dilation is not required to understand this document. Note that:
+
+*   2 new integer parameters will be added: `dilation_width_factor` and
+    `dilation_height_factor`.
+*   Old convolution kernels that don't support dilation are equivalent to
+    setting the dilation factors to 1.
+
+### Change FlatBuffer Schema
+
+To add new parameters into an op, change the options table in
+`lite/schema/schema.fbs`.
+
+For example, the options table of convolution looks like this:
+
+```
+table Conv2DOptions {
+  padding:Padding;
+  stride_w:int;
+  stride_h:int;
+  fused_activation_function:ActivationFunctionType;
+}
+```
+
+When adding new parameters:
+
+*   Add comments indicating which parameters are supported by which version.
+*   When the new implementation gets the default values for newly added
+    parameters, it should work exactly the same as the old implementation.
+
+The table will be like this after the new parameters are added:
+
+```
+table Conv2DOptions {
+  // Parameters supported by version 1:
+  padding:Padding;
+  stride_w:int;
+  stride_h:int;
+  fused_activation_function:ActivationFunctionType;
+
+  // Parameters supported by version 2:
+  dilation_width_factor:int = 1;
+  dilation_height_factor:int = 1;
+}
+```
+
+### Change C Structures and Kernel Implementation
+
+In TensorFlow Lite, the kernel implementation is decoupled from
+FlatBuffer definition. The kernels read the parameter from C structures defined
+in `lite/builtin_op_data.h`.
+
+The original convolution parameter is as follows:
+
+```
+typedef struct {
+  TfLitePadding padding;
+  int stride_width;
+  int stride_height;
+  TfLiteFusedActivation activation;
+} TfLiteConvParams;
+```
+
+As with the FlatBuffer schema, add comments indicating which parameters are
+supported starting from which version. The result is seen below:
+
+```
+typedef struct {
+  // Parameters supported by version 1: TfLitePadding padding; int
+  stride_width;
+  int stride_height;
+  TfLiteFusedActivation activation;
+
+  // Parameters supported by version 2:
+  int dilation_width_factor;
+  int dilation_height_factor;
+} TfLiteConvParams;
+```
+
+Please also change the kernel implementation to read the newly added parameters
+from the C structures. The details are omitted here.
+
+### Change the FlatBuffer Reading Code
+
+The logic to read FlatBuffer and produce C structure is in `lite/model.cc`.
+
+Update the file to handle the new parameters, as shown below:
+
+```
+case BuiltinOperator_CONV_2D: {
+  TfLiteConvParams* params = MallocPOD<TfLiteConvParams>();
+  if (auto* conv_params = op->builtin_options_as_Conv2DOptions()) {
+    params->padding = parse_padding(conv_params->padding());
+    params->stride_width = conv_params->stride_w();
+    params->stride_height = conv_params->stride_h();
+    params->activation =
+        parse_activation(conv_params->fused_activation_function());
+    params->dilation_width_factor = conv_params->dilation_width_factor();
+    params->dilation_height_factor = conv_params->dilation_height_factor();
+  }
+  *builtin_data = reinterpret_cast<void*>(params);
+  break;
+}
+```
+
+It's not required to check the op version here. When the new implementation
+reads an old model file where dilation factors are missing, it will use 1 as
+the default value, and the new kernel will work consistently with the old
+kernel.
+
+### Change Kernel Registration
+
+The MutableOpResolver (defined in `lite/op_resolver.h`) provides a few functions
+to register op kernels. The minimum and maximum version are 1 by default:
+
+```
+void AddBuiltin(tflite::BuiltinOperator op, TfLiteRegistration* registration,
+                int min_version = 1, int max_version = 1);
+void AddCustom(const char* name, TfLiteRegistration* registration,
+               int min_version = 1, int max_version = 1);
+```
+
+The built-in ops are registered in `lite/kernels/register.cc`. In this example,
+we implemented a new op kernel which can handle `Conv2D` version 1 and 2, so we
+need to change this line:
+
+```
+AddBuiltin(BuiltinOperator_CONV_2D, Register_CONV_2D());
+```
+
+to:
+
+```
+AddBuiltin(BuiltinOperator_CONV_2D, Register_CONV_2D(), 1, 2);
+```
+
+### Change TOCO TFLite exporter
+
+The last step is to make TOCO populate the minimum version that's required to
+execute the op. In this example, it means:
+
+*   Populate version=1 when dilation factors are all 1.
+*   Populate version=2 otherwise.
+
+To do this, you need to override `GetVersion` function for the operator class in
+`lite/toco/tflite/operator.cc`.
+
+For ops with only one version, the `GetVersion` function is defined as:
+
+```
+int GetVersion(const Operator& op) const override { return 1; }
+```
+
+When supporting multiple versions, check the parameters and determine the
+version for the op, as shown in the following example:
+
+```
+int GetVersion(const Operator& op) const override {
+  const auto& conv_op = static_cast<const ConvOperator&>(op);
+  if (conv_op.dilation_width_factor != 1 ||
+      conv_op.dilation_height_factor != 1) {
+    return 2;
+  }
+  return 1;
+}
+```
+
+### Delegation Implementation
+
+TensorFlow Lite provides a delegation API which enables delegating ops to
+hardware backends. In Delegate's `Prepare` function, check if the version
+is supported for every node in Delegation code.
+
+```
+const int kMinVersion = 1;
+TfLiteNode* node;
+TfLiteRegistration;
+context->GetNodeAndRegistration(context, node_index, &node, &registration);
+
+if (registration->version > kMinVersion) {
+  // Reject the node if the version isn't supported.
+}
+```
+
+This is required even if the delegation only supports version 1 ops, so the
+delegation can detect incompatibility when getting a higher version op.
+
diff --git a/tensorflow/contrib/lite/g3doc/tf_ops_compatibility.md b/tensorflow/contrib/lite/g3doc/tf_ops_compatibility.md
index d8c46e633151cba94ff3d2a3c8b0ab5c230f245e..965273f0f04d33b52903c0551fff3533c31d3bd8 100644
--- a/tensorflow/contrib/lite/g3doc/tf_ops_compatibility.md
+++ b/tensorflow/contrib/lite/g3doc/tf_ops_compatibility.md
@@ -95,11 +95,7 @@ Here is a list of TensorFlow operations that are usually removed from the graph:
 *   [tf.divide](https://www.tensorflow.org/api_docs/python/tf/divide)
 *   [tf.fake_quant_with_min_max_args](https://www.tensorflow.org/api_docs/python/tf/fake_quant_with_min_max_args)
 *   [tf.fake_quant_with_min_max_vars](https://www.tensorflow.org/api_docs/python/tf/fake_quant_with_min_max_vars)
-*   [tf.greater](https://www.tensorflow.org/api_docs/python/tf/greater)
-*   [tf.greater_equal](https://www.tensorflow.org/api_docs/python/tf/greater_equal)
 *   [tf.identity](https://www.tensorflow.org/api_docs/python/tf/identity)
-*   [tf.less](https://www.tensorflow.org/api_docs/python/tf/less)
-*   [tf.less_equal](https://www.tensorflow.org/api_docs/python/tf/less_equal)
 *   [tf.maximum](https://www.tensorflow.org/api_docs/python/tf/maximum)
 *   [tf.minimum](https://www.tensorflow.org/api_docs/python/tf/minimum)
 *   [tf.multiply](https://www.tensorflow.org/api_docs/python/tf/multiply)
@@ -257,6 +253,19 @@ Options {
 }
 ```
 
+**EQUAL**
+
+```
+Inputs {
+  0: a tensor
+  1: a tensor
+}
+Outputs {
+  0: a tensor of type bool, true whenever an element of the first tensor is
+  equal to the corresponding element of the second tensor.
+}
+```
+
 **EXP**
 
 ```
@@ -420,6 +429,17 @@ Outputs {
 }
 ```
 
+**LOG**
+
+```
+Inputs {
+  0: a tensor
+}
+Outputs {
+  0: a tensor equivalent to log(input)
+}
+```
+
 **LOG_SOFTMAX**
 
 ```
@@ -503,6 +523,19 @@ Options {
 }
 ```
 
+**NOT_EQUAL**
+
+```
+Inputs {
+  0: a tensor
+  1: a tensor
+}
+Outputs {
+  0: a tensor of type bool, true whenever an element of the first tensor is not
+  equal to the corresponding element of the second tensor.
+}
+```
+
 **RELU**
 
 ```
@@ -607,6 +640,21 @@ Outputs {
 }
 ```
 
+**SPARSE_TO_DENSE**
+
+```
+Inputs {
+  0: 0D or 1D or 2D tensor
+  1: 1D tensor
+  2: 0D or 1D tensor
+  3: 0D tensor
+  4: a boolean value
+}
+Outputs {
+  0: Dense Tensor of shape output_shape. Has the same type as sparse_values.
+}
+```
+
 **SPLIT**
 
 ```
diff --git a/tensorflow/contrib/lite/interpreter_test.cc b/tensorflow/contrib/lite/interpreter_test.cc
index 453c1ada1cf6263be14a3b170f209e3a30580cc3..4c78466480bdcfb42b1f582ecc7c185201a81b05 100644
--- a/tensorflow/contrib/lite/interpreter_test.cc
+++ b/tensorflow/contrib/lite/interpreter_test.cc
@@ -211,7 +211,7 @@ TEST(BasicInterpreter, CheckArenaAllocation) {
   TfLiteRegistration reg = {nullptr, nullptr, nullptr, nullptr};
 
   std::vector<int> sizes{2048, 4096, 1023, 2047, 1021,
-                         2047, 1023, 2046, 1021, 2048};
+                         2047, 1023, 2046, 0,    2048};
   for (int i = 0; i < sizes.size(); ++i) {
     interpreter.SetTensorParametersReadWrite(i, kTfLiteUInt8, "", {sizes[i]},
                                              quant);
@@ -228,6 +228,7 @@ TEST(BasicInterpreter, CheckArenaAllocation) {
 
   ASSERT_EQ(interpreter.tensor(0)->data.raw, interpreter.tensor(4)->data.raw);
   ASSERT_EQ(interpreter.tensor(1)->data.raw, interpreter.tensor(7)->data.raw);
+  ASSERT_EQ(interpreter.tensor(8)->data.raw, nullptr);
 
   ASSERT_LT(interpreter.tensor(4)->data.raw, interpreter.tensor(1)->data.raw);
   ASSERT_LT(interpreter.tensor(6)->data.raw, interpreter.tensor(1)->data.raw);
@@ -314,6 +315,18 @@ TEST(BasicInterpreter, ResizingTensors) {
   EXPECT_EQ(tensor->bytes, 8 * sizeof(float));
   ASSERT_EQ(interpreter.AllocateTensors(), kTfLiteOk);
 
+  ASSERT_EQ(interpreter.ResizeInputTensor(t, {}), kTfLiteOk);
+  EXPECT_EQ(tensor->bytes, 1 * sizeof(float));
+  ASSERT_EQ(interpreter.AllocateTensors(), kTfLiteOk);
+
+  ASSERT_EQ(interpreter.ResizeInputTensor(t, {0}), kTfLiteOk);
+  EXPECT_EQ(tensor->bytes, 0);
+  ASSERT_EQ(interpreter.AllocateTensors(), kTfLiteOk);
+
+  ASSERT_EQ(interpreter.ResizeInputTensor(t, {1, 2, 0}), kTfLiteOk);
+  EXPECT_EQ(tensor->bytes, 0);
+  ASSERT_EQ(interpreter.AllocateTensors(), kTfLiteOk);
+
   // TODO(ahentz): We shouldn't have to force reallocation, but
   // ResizeInputTensor doesn't realloc dynamic tensors. Also note that
   // TfLiteTensorRealloc(tensor->bytes, tensor) is a no-op.
diff --git a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java
index 644ce4cb3e0beaed2b9ae542cdacbb912ab0f010..fd1f0ffa68eeca7b5866b146ecaa1f9216ef377d 100644
--- a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java
+++ b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java
@@ -17,6 +17,7 @@ package org.tensorflow.lite;
 
 import java.io.File;
 import java.nio.ByteBuffer;
+import java.nio.MappedByteBuffer;
 import java.util.HashMap;
 import java.util.Map;
 import org.checkerframework.checker.nullness.qual.NonNull;
@@ -103,6 +104,27 @@ public final class Interpreter implements AutoCloseable {
     wrapper = new NativeInterpreterWrapper(byteBuffer, numThreads);
   }
 
+  /**
+   * Initializes a {@code Interpreter} with a {@code MappedByteBuffer} to the model file.
+   *
+   * <p>The {@code MappedByteBuffer} should remain unchanged after the construction of a {@code
+   * Interpreter}.
+   */
+  public Interpreter(@NonNull MappedByteBuffer mappedByteBuffer) {
+    wrapper = new NativeInterpreterWrapper(mappedByteBuffer);
+  }
+
+  /**
+   * Initializes a {@code Interpreter} with a {@code MappedByteBuffer} to the model file and
+   * specifies the number of threads used for inference.
+   *
+   * <p>The {@code MappedByteBuffer} should remain unchanged after the construction of a {@code
+   * Interpreter}.
+   */
+  public Interpreter(@NonNull MappedByteBuffer mappedByteBuffer, int numThreads) {
+    wrapper = new NativeInterpreterWrapper(mappedByteBuffer, numThreads);
+  }
+
   /**
    * Runs model inference if the model takes only one input, and provides only one output.
    *
@@ -231,5 +253,14 @@ public final class Interpreter implements AutoCloseable {
     wrapper = null;
   }
 
+  @Override
+  protected void finalize() throws Throwable {
+    try {
+      close();
+    } finally {
+      super.finalize();
+    }
+  }
+
   NativeInterpreterWrapper wrapper;
 }
diff --git a/tensorflow/contrib/lite/java/src/main/native/tensor_jni.cc b/tensorflow/contrib/lite/java/src/main/native/tensor_jni.cc
index 005dca0253d2c30d56a15adf6e2b371d43f50945..9e9387da86ebde7d711a7ce967461e370c95bc3e 100644
--- a/tensorflow/contrib/lite/java/src/main/native/tensor_jni.cc
+++ b/tensorflow/contrib/lite/java/src/main/native/tensor_jni.cc
@@ -43,31 +43,27 @@ size_t writeOneDimensionalArray(JNIEnv* env, jobject object, TfLiteType type,
   }
   switch (type) {
     case kTfLiteFloat32: {
-      jfloatArray a = static_cast<jfloatArray>(array);
-      jfloat* values = env->GetFloatArrayElements(a, nullptr);
-      memcpy(dst, values, to_copy);
-      env->ReleaseFloatArrayElements(a, values, JNI_ABORT);
+      jfloatArray float_array = static_cast<jfloatArray>(array);
+      jfloat* float_dst = static_cast<jfloat*>(dst);
+      env->GetFloatArrayRegion(float_array, 0, num_elements, float_dst);
       return to_copy;
     }
     case kTfLiteInt32: {
-      jintArray a = static_cast<jintArray>(array);
-      jint* values = env->GetIntArrayElements(a, nullptr);
-      memcpy(dst, values, to_copy);
-      env->ReleaseIntArrayElements(a, values, JNI_ABORT);
+      jintArray int_array = static_cast<jintArray>(array);
+      jint* int_dst = static_cast<jint*>(dst);
+      env->GetIntArrayRegion(int_array, 0, num_elements, int_dst);
       return to_copy;
     }
     case kTfLiteInt64: {
-      jlongArray a = static_cast<jlongArray>(array);
-      jlong* values = env->GetLongArrayElements(a, nullptr);
-      memcpy(dst, values, to_copy);
-      env->ReleaseLongArrayElements(a, values, JNI_ABORT);
+      jlongArray long_array = static_cast<jlongArray>(array);
+      jlong* long_dst = static_cast<jlong*>(dst);
+      env->GetLongArrayRegion(long_array, 0, num_elements, long_dst);
       return to_copy;
     }
     case kTfLiteUInt8: {
-      jbyteArray a = static_cast<jbyteArray>(array);
-      jbyte* values = env->GetByteArrayElements(a, nullptr);
-      memcpy(dst, values, to_copy);
-      env->ReleaseByteArrayElements(a, values, JNI_ABORT);
+      jbyteArray byte_array = static_cast<jbyteArray>(array);
+      jbyte* byte_dst = static_cast<jbyte*>(dst);
+      env->GetByteArrayRegion(byte_array, 0, num_elements, byte_dst);
       return to_copy;
     }
     default: {
diff --git a/tensorflow/contrib/lite/kernels/Android.bp b/tensorflow/contrib/lite/kernels/Android.bp
index 59262d398ef173008d18ed912fc64d7ea5cb97d9..0e89cc33f6e7438f38d121cac2639965b35f9e51 100644
--- a/tensorflow/contrib/lite/kernels/Android.bp
+++ b/tensorflow/contrib/lite/kernels/Android.bp
@@ -49,10 +49,11 @@ cc_library_static {
         "depthwise_conv.cc",
         "dequantize.cc",
         "div.cc",
-        "elementwise.cc", 	
+        "elementwise.cc",
         "embedding_lookup.cc",
         "embedding_lookup_sparse.cc",
         "exp.cc",
+        "expand_dims.cc",
         "floor.cc",
         "fully_connected.cc",
         "gather.cc",
@@ -76,11 +77,13 @@ cc_library_static {
         "slice.cc",
         "space_to_batch_nd.cc",
         "space_to_depth.cc",
+        "sparse_to_dense.cc",
         "split.cc",
         "squeeze.cc",
         "strided_slice.cc",
         "sub.cc",
         "svdf.cc",
+        "tile.cc",
         "topk_v2.cc",
         "transpose.cc",
         "transpose_conv.cc",
diff --git a/tensorflow/contrib/lite/kernels/BUILD b/tensorflow/contrib/lite/kernels/BUILD
index b7291dd379a6c09a70a78de7bc6c2f217b293b26..cf5d0b4ce9cb3c516c185f31fea12db70a2c3bdb 100644
--- a/tensorflow/contrib/lite/kernels/BUILD
+++ b/tensorflow/contrib/lite/kernels/BUILD
@@ -147,6 +147,7 @@ cc_library(
         "embedding_lookup.cc",
         "embedding_lookup_sparse.cc",
         "exp.cc",
+        "expand_dims.cc",
         "floor.cc",
         "fully_connected.cc",
         "gather.cc",
@@ -170,11 +171,13 @@ cc_library(
         "slice.cc",
         "space_to_batch_nd.cc",
         "space_to_depth.cc",
+        "sparse_to_dense.cc",
         "split.cc",
         "squeeze.cc",
         "strided_slice.cc",
         "sub.cc",
         "svdf.cc",
+        "tile.cc",
         "topk_v2.cc",
         "transpose.cc",
         "transpose_conv.cc",
@@ -857,6 +860,20 @@ tf_cc_test(
     ],
 )
 
+tf_cc_test(
+    name = "tile_test",
+    size = "small",
+    srcs = ["tile_test.cc"],
+    tags = ["tflite_not_portable_ios"],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:builtin_op_data",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
 tf_cc_test(
     name = "comparisons_test",
     size = "small",
@@ -934,6 +951,34 @@ tf_cc_test(
     ],
 )
 
+tf_cc_test(
+    name = "expand_dims_test",
+    size = "small",
+    srcs = ["expand_dims_test.cc"],
+    tags = ["tflite_not_portable_ios"],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:builtin_op_data",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
+tf_cc_test(
+    name = "sparse_to_dense_test",
+    size = "small",
+    srcs = ["sparse_to_dense_test.cc"],
+    tags = ["tflite_not_portable_ios"],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:builtin_op_data",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
 filegroup(
     name = "all_files",
     srcs = glob(
diff --git a/tensorflow/contrib/lite/kernels/add.cc b/tensorflow/contrib/lite/kernels/add.cc
index 7ca1e35489cba3b5d2567bc04e532fedf8a527a7..443ce8924a43669fb264e19561c733d7e3436cb0 100644
--- a/tensorflow/contrib/lite/kernels/add.cc
+++ b/tensorflow/contrib/lite/kernels/add.cc
@@ -126,16 +126,19 @@ void EvalAddQuantized(TfLiteContext* context, TfLiteNode* node,
 
   int32 input1_multiplier;
   int input1_shift;
-  QuantizeMultiplierSmallerThanOne(real_input1_multiplier, &input1_multiplier,
-                                   &input1_shift);
+  QuantizeMultiplierSmallerThanOneExp(real_input1_multiplier,
+                                      &input1_multiplier, &input1_shift);
+  input1_shift *= -1;
   int32 input2_multiplier;
   int input2_shift;
-  QuantizeMultiplierSmallerThanOne(real_input2_multiplier, &input2_multiplier,
-                                   &input2_shift);
+  QuantizeMultiplierSmallerThanOneExp(real_input2_multiplier,
+                                      &input2_multiplier, &input2_shift);
+  input2_shift *= -1;
   int32 output_multiplier;
   int output_shift;
-  QuantizeMultiplierSmallerThanOne(real_output_multiplier, &output_multiplier,
-                                   &output_shift);
+  QuantizeMultiplierSmallerThanOneExp(real_output_multiplier,
+                                      &output_multiplier, &output_shift);
+  output_shift *= -1;
 
   int32 output_activation_min, output_activation_max;
   CalculateActivationRangeUint8(params->activation, output,
diff --git a/tensorflow/contrib/lite/kernels/basic_rnn.cc b/tensorflow/contrib/lite/kernels/basic_rnn.cc
index 7dc0c5656dca02a86339c558f4fe2babb4961695..c09b15b3d263d6cd639234590c99a50a9a48f4a7 100644
--- a/tensorflow/contrib/lite/kernels/basic_rnn.cc
+++ b/tensorflow/contrib/lite/kernels/basic_rnn.cc
@@ -36,7 +36,7 @@ constexpr int kOutputTensor = 1;
 
 void* Init(TfLiteContext* context, const char* buffer, size_t length) {
   auto* scratch_tensor_index = new int;
-  context->AddTensors(context, /*tensors_to_add=*/2, scratch_tensor_index);
+  context->AddTensors(context, /*tensors_to_add=*/3, scratch_tensor_index);
   return scratch_tensor_index;
 }
 
@@ -91,7 +91,7 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   if (input->type == kTfLiteFloat32 && input_weights->type == kTfLiteUInt8) {
     int* scratch_tensor_index = reinterpret_cast<int*>(node->user_data);
     TfLiteIntArrayFree(node->temporaries);
-    node->temporaries = TfLiteIntArrayCreate(2);
+    node->temporaries = TfLiteIntArrayCreate(3);
     node->temporaries->data[0] = *scratch_tensor_index;
     TfLiteTensor* input_quantized = GetTemporary(context, node, /*index=*/0);
     input_quantized->type = kTfLiteUInt8;
@@ -114,6 +114,16 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
                         context->ResizeTensor(context, hidden_state_quantized,
                                               hidden_state_quantized_size));
     }
+    node->temporaries->data[2] = *scratch_tensor_index + 2;
+    TfLiteTensor* scaling_factors = GetTemporary(context, node, /*index=*/2);
+    scaling_factors->type = kTfLiteFloat32;
+    scaling_factors->allocation_type = kTfLiteArenaRw;
+    TfLiteIntArray* scaling_factors_size = TfLiteIntArrayCreate(1);
+    scaling_factors_size->data[0] = batch_size;
+    if (!TfLiteIntArrayEqual(scaling_factors->dims, scaling_factors_size)) {
+      TF_LITE_ENSURE_OK(context, context->ResizeTensor(context, scaling_factors,
+                                                       scaling_factors_size));
+    }
   }
 
   return kTfLiteOk;
@@ -145,14 +155,14 @@ TfLiteStatus EvalFloat(const TfLiteTensor* input,
   return kTfLiteOk;
 }
 
-TfLiteStatus EvalQuantized(const TfLiteTensor* input,
-                           const TfLiteTensor* input_weights,
-                           const TfLiteTensor* recurrent_weights,
-                           const TfLiteTensor* bias,
-                           const TfLiteRNNParams* params,
-                           TfLiteTensor* input_scratch,
-                           TfLiteTensor* hidden_state_scratch,
-                           TfLiteTensor* hidden_state, TfLiteTensor* output) {
+TfLiteStatus EvalHybrid(const TfLiteTensor* input,
+                        const TfLiteTensor* input_weights,
+                        const TfLiteTensor* recurrent_weights,
+                        const TfLiteTensor* bias, const TfLiteRNNParams* params,
+                        TfLiteTensor* input_scratch,
+                        TfLiteTensor* hidden_state_scratch,
+                        TfLiteTensor* scaling_factors,
+                        TfLiteTensor* hidden_state, TfLiteTensor* output) {
   const int batch_size = input->dims->data[0];
   const int num_units = input_weights->dims->data[0];
   const int input_size = input->dims->data[1];
@@ -176,12 +186,14 @@ TfLiteStatus EvalQuantized(const TfLiteTensor* input,
       reinterpret_cast<int8_t*>(input_scratch->data.uint8);
   int8_t* quantized_hidden_state_ptr =
       reinterpret_cast<int8_t*>(hidden_state_scratch->data.uint8);
+  float* scaling_factors_ptr = scaling_factors->data.f;
 
   kernel_utils::RnnBatchStep(
       input_ptr_batch, input_weights_ptr, input_weights_scale,
       recurrent_weights_ptr, recurrent_weights_scale, bias_ptr, input_size,
       num_units, batch_size, params->activation, quantized_input_ptr,
-      quantized_hidden_state_ptr, hidden_state_ptr_batch, output_ptr_batch);
+      quantized_hidden_state_ptr, scaling_factors_ptr, hidden_state_ptr_batch,
+      output_ptr_batch);
   return kTfLiteOk;
 }
 
@@ -205,9 +217,10 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
       // TODO(mirkov): implement eval with quantized inputs as well.
       TfLiteTensor* input_quantized = GetTemporary(context, node, 0);
       TfLiteTensor* hidden_state_quantized = GetTemporary(context, node, 1);
-      return EvalQuantized(input, input_weights, recurrent_weights, bias,
-                           params, input_quantized, hidden_state_quantized,
-                           hidden_state, output);
+      TfLiteTensor* scaling_factors = GetTemporary(context, node, 2);
+      return EvalHybrid(input, input_weights, recurrent_weights, bias, params,
+                        input_quantized, hidden_state_quantized,
+                        scaling_factors, hidden_state, output);
     }
     default:
       context->ReportError(context, "Type %d not currently supported.",
diff --git a/tensorflow/contrib/lite/kernels/comparisons.cc b/tensorflow/contrib/lite/kernels/comparisons.cc
index 3b81062cd42f04582b33ea919ef2742d3d869c22..f678f48fa5bbbcece6c5b87030d951783378d78f 100644
--- a/tensorflow/contrib/lite/kernels/comparisons.cc
+++ b/tensorflow/contrib/lite/kernels/comparisons.cc
@@ -23,6 +23,7 @@ namespace tflite {
 namespace ops {
 namespace builtin {
 namespace comparisons {
+namespace {
 
 constexpr int kInputTensor1 = 0;
 constexpr int kInputTensor2 = 1;
@@ -67,6 +68,57 @@ TfLiteStatus ComparisonPrepare(TfLiteContext* context, TfLiteNode* node) {
             GetTensorData<type>(input2), GetTensorDims(input2), \
             GetTensorData<bool>(output), GetTensorDims(output));
 
+TfLiteStatus EqualEval(TfLiteContext* context, TfLiteNode* node) {
+  const TfLiteTensor* input1 = GetInput(context, node, kInputTensor1);
+  const TfLiteTensor* input2 = GetInput(context, node, kInputTensor2);
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+  bool requires_broadcast = !HaveSameShapes(input1, input2);
+  // TODO(renjieliu): Support quantized data.
+  switch (input1->type) {
+    case kTfLiteFloat32:
+      TF_LITE_COMPARISON(float, Equal, requires_broadcast);
+      break;
+    case kTfLiteInt32:
+      TF_LITE_COMPARISON(int32_t, Equal, requires_broadcast);
+      break;
+    case kTfLiteInt64:
+      TF_LITE_COMPARISON(int64_t, Equal, requires_broadcast);
+      break;
+    default:
+      context->ReportError(context,
+                           "Does not support type %d, requires float|int",
+                           input1->type);
+      return kTfLiteError;
+  }
+  return kTfLiteOk;
+}
+
+// TODO(renjieliu): Refactor the logic to avoid duplications.
+TfLiteStatus NotEqualEval(TfLiteContext* context, TfLiteNode* node) {
+  const TfLiteTensor* input1 = GetInput(context, node, kInputTensor1);
+  const TfLiteTensor* input2 = GetInput(context, node, kInputTensor2);
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+  bool requires_broadcast = !HaveSameShapes(input1, input2);
+  // TODO(renjieliu): Support quantized data.
+  switch (input1->type) {
+    case kTfLiteFloat32:
+      TF_LITE_COMPARISON(float, NotEqual, requires_broadcast);
+      break;
+    case kTfLiteInt32:
+      TF_LITE_COMPARISON(int32_t, NotEqual, requires_broadcast);
+      break;
+    case kTfLiteInt64:
+      TF_LITE_COMPARISON(int64_t, NotEqual, requires_broadcast);
+      break;
+    default:
+      context->ReportError(context,
+                           "Does not support type %d, requires float|int",
+                           input1->type);
+      return kTfLiteError;
+  }
+  return kTfLiteOk;
+}
+
 TfLiteStatus GreaterEval(TfLiteContext* context, TfLiteNode* node) {
   const TfLiteTensor* input1 = GetInput(context, node, kInputTensor1);
   const TfLiteTensor* input2 = GetInput(context, node, kInputTensor2);
@@ -167,8 +219,22 @@ TfLiteStatus LessEqualEval(TfLiteContext* context, TfLiteNode* node) {
   return kTfLiteOk;
 }
 
+}  // namespace
 }  // namespace comparisons
 
+TfLiteRegistration* Register_EQUAL() {
+  static TfLiteRegistration r = {
+      nullptr, nullptr, comparisons::ComparisonPrepare, comparisons::EqualEval};
+  return &r;
+}
+
+TfLiteRegistration* Register_NOT_EQUAL() {
+  static TfLiteRegistration r = {nullptr, nullptr,
+                                 comparisons::ComparisonPrepare,
+                                 comparisons::NotEqualEval};
+  return &r;
+}
+
 TfLiteRegistration* Register_GREATER() {
   static TfLiteRegistration r = {nullptr, nullptr,
                                  comparisons::ComparisonPrepare,
diff --git a/tensorflow/contrib/lite/kernels/comparisons_test.cc b/tensorflow/contrib/lite/kernels/comparisons_test.cc
index 835d238d36d1757a27119ae24b3c07232e9d3dc0..bb02e1c812fdc40bf515f1f978e9e39b5a16a4ea 100644
--- a/tensorflow/contrib/lite/kernels/comparisons_test.cc
+++ b/tensorflow/contrib/lite/kernels/comparisons_test.cc
@@ -21,18 +21,17 @@ limitations under the License.
 namespace tflite {
 namespace {
 
-using ::testing::ElementsAreArray;
+using ::testing::ElementsAre;
 
-class GreaterOpModel : public SingleOpModel {
+class ComparisonOpModel : public SingleOpModel {
  public:
-  GreaterOpModel(std::initializer_list<int> input1_shape,
-                 std::initializer_list<int> input2_shape,
-                 TensorType input_type) {
+  ComparisonOpModel(std::initializer_list<int> input1_shape,
+                    std::initializer_list<int> input2_shape,
+                    TensorType input_type, BuiltinOperator op) {
     input1_ = AddInput(input_type);
     input2_ = AddInput(input_type);
     output_ = AddOutput(TensorType_BOOL);
-    SetBuiltinOp(BuiltinOperator_GREATER, BuiltinOptions_GreaterOptions,
-                 CreateGreaterOptions(builder_).Union());
+    ConfigureBuiltinOp(op);
     BuildInterpreter({input1_shape, input2_shape});
   }
 
@@ -46,245 +45,313 @@ class GreaterOpModel : public SingleOpModel {
   int input1_;
   int input2_;
   int output_;
+
+  void ConfigureBuiltinOp(BuiltinOperator op) {
+    switch (op) {
+      case BuiltinOperator_EQUAL: {
+        SetBuiltinOp(op, BuiltinOptions_EqualOptions,
+                     CreateEqualOptions(builder_).Union());
+        break;
+      }
+      case BuiltinOperator_NOT_EQUAL: {
+        SetBuiltinOp(op, BuiltinOptions_NotEqualOptions,
+                     CreateNotEqualOptions(builder_).Union());
+        break;
+      }
+      case BuiltinOperator_GREATER: {
+        SetBuiltinOp(op, BuiltinOptions_GreaterOptions,
+                     CreateGreaterOptions(builder_).Union());
+        break;
+      }
+      case BuiltinOperator_GREATER_EQUAL: {
+        SetBuiltinOp(op, BuiltinOptions_GreaterEqualOptions,
+                     CreateGreaterEqualOptions(builder_).Union());
+        break;
+      }
+      case BuiltinOperator_LESS: {
+        SetBuiltinOp(op, BuiltinOptions_LessOptions,
+                     CreateLessOptions(builder_).Union());
+        break;
+      }
+      case BuiltinOperator_LESS_EQUAL: {
+        SetBuiltinOp(op, BuiltinOptions_LessEqualOptions,
+                     CreateLessEqualOptions(builder_).Union());
+        break;
+      }
+      default: { FAIL() << "We shouldn't get here."; }
+    }
+  }
 };
 
-TEST(ComparisonsTest, GreaterFloat) {
-  GreaterOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_FLOAT32);
+TEST(ComparisonsTest, EqualFloat) {
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_FLOAT32,
+                          BuiltinOperator_EQUAL);
   model.PopulateTensor<float>(model.input1(), {0.1, 0.9, 0.7, 0.3});
   model.PopulateTensor<float>(model.input2(), {0.1, 0.2, 0.6, 0.5});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({false, true, true, false}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 4}));
+  EXPECT_THAT(model.GetOutput(), ElementsAre(true, false, false, false));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
 }
 
-TEST(ComparisonsTest, GreaterInt) {
-  GreaterOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_INT32);
+TEST(ComparisonsTest, EqualInt) {
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_INT32,
+                          BuiltinOperator_EQUAL);
   model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3});
   model.PopulateTensor<int>(model.input2(), {1, 2, 7, 5});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({false, true, false, false}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 4}));
+  EXPECT_THAT(model.GetOutput(), ElementsAre(false, false, true, false));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
 }
 
-TEST(ComparisonsTest, GreaterBroadcast) {
-  GreaterOpModel model({1, 1, 1, 4}, {1, 1, 1, 1}, TensorType_INT32);
+TEST(ComparisonsTest, EqualBroadcast) {
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 1}, TensorType_INT32,
+                          BuiltinOperator_EQUAL);
   model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3});
   model.PopulateTensor<int>(model.input2(), {7});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({false, true, false, false}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 4}));
+  EXPECT_THAT(model.GetOutput(), ElementsAre(false, false, true, false));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
 }
 
-TEST(ComparisonsTest, GreaterBroadcastTwoD) {
-  GreaterOpModel model({1, 1, 2, 4}, {1, 1, 1, 4}, TensorType_INT32);
+TEST(ComparisonsTest, EqualBroadcastTwoD) {
+  ComparisonOpModel model({1, 1, 2, 4}, {1, 1, 1, 4}, TensorType_INT32,
+                          BuiltinOperator_EQUAL);
   model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3, 2, 4, 2, 8});
   model.PopulateTensor<int>(model.input2(), {7, 1, 2, 4});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({false, true, true, false,
-                                                   false, true, false, true}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 2, 4}));
+  EXPECT_THAT(model.GetOutput(), ElementsAre(false, false, false, false, false,
+                                             false, true, false));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 2, 4));
 }
 
-class GreaterEqualOpModel : public SingleOpModel {
- public:
-  GreaterEqualOpModel(std::initializer_list<int> input1_shape,
-                      std::initializer_list<int> input2_shape,
-                      TensorType input_type) {
-    input1_ = AddInput(input_type);
-    input2_ = AddInput(input_type);
-    output_ = AddOutput(TensorType_BOOL);
-    SetBuiltinOp(BuiltinOperator_GREATER_EQUAL,
-                 BuiltinOptions_GreaterEqualOptions,
-                 CreateGreaterEqualOptions(builder_).Union());
-    BuildInterpreter({input1_shape, input2_shape});
-  }
+TEST(ComparisonsTest, NotEqualFloat) {
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_FLOAT32,
+                          BuiltinOperator_NOT_EQUAL);
+  model.PopulateTensor<float>(model.input1(), {0.1, 0.9, 0.7, 0.3});
+  model.PopulateTensor<float>(model.input2(), {0.1, 0.2, 0.6, 0.5});
+  model.Invoke();
 
-  int input1() { return input1_; }
-  int input2() { return input2_; }
+  EXPECT_THAT(model.GetOutput(), ElementsAre(false, true, true, true));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
+}
 
-  std::vector<bool> GetOutput() { return ExtractVector<bool>(output_); }
-  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+TEST(ComparisonsTest, NotEqualInt) {
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_INT32,
+                          BuiltinOperator_NOT_EQUAL);
+  model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3});
+  model.PopulateTensor<int>(model.input2(), {1, 2, 7, 5});
+  model.Invoke();
 
- private:
-  int input1_;
-  int input2_;
-  int output_;
-};
+  EXPECT_THAT(model.GetOutput(), ElementsAre(true, true, false, true));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
+}
+
+TEST(ComparisonsTest, NotEqualBroadcast) {
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 1}, TensorType_INT32,
+                          BuiltinOperator_NOT_EQUAL);
+  model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3});
+  model.PopulateTensor<int>(model.input2(), {7});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutput(), ElementsAre(true, true, false, true));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
+}
+
+TEST(ComparisonsTest, NotEqualBroadcastTwoD) {
+  ComparisonOpModel model({1, 1, 2, 4}, {1, 1, 1, 4}, TensorType_INT32,
+                          BuiltinOperator_NOT_EQUAL);
+  model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3, 2, 4, 2, 8});
+  model.PopulateTensor<int>(model.input2(), {7, 1, 2, 4});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutput(),
+              ElementsAre(true, true, true, true, true, true, false, true));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 2, 4));
+}
+
+TEST(ComparisonsTest, GreaterFloat) {
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_FLOAT32,
+                          BuiltinOperator_GREATER);
+  model.PopulateTensor<float>(model.input1(), {0.1, 0.9, 0.7, 0.3});
+  model.PopulateTensor<float>(model.input2(), {0.1, 0.2, 0.6, 0.5});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutput(), ElementsAre(false, true, true, false));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
+}
+
+TEST(ComparisonsTest, GreaterInt) {
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_INT32,
+                          BuiltinOperator_GREATER);
+  model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3});
+  model.PopulateTensor<int>(model.input2(), {1, 2, 7, 5});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutput(), ElementsAre(false, true, false, false));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
+}
+
+TEST(ComparisonsTest, GreaterBroadcast) {
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 1}, TensorType_INT32,
+                          BuiltinOperator_GREATER);
+  model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3});
+  model.PopulateTensor<int>(model.input2(), {7});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutput(), ElementsAre(false, true, false, false));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
+}
+
+TEST(ComparisonsTest, GreaterBroadcastTwoD) {
+  ComparisonOpModel model({1, 1, 2, 4}, {1, 1, 1, 4}, TensorType_INT32,
+                          BuiltinOperator_GREATER);
+  model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3, 2, 4, 2, 8});
+  model.PopulateTensor<int>(model.input2(), {7, 1, 2, 4});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutput(),
+              ElementsAre(false, true, true, false, false, true, false, true));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 2, 4));
+}
 
 TEST(ComparisonsTest, GreaterEqualFloat) {
-  GreaterEqualOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_FLOAT32);
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_FLOAT32,
+                          BuiltinOperator_GREATER_EQUAL);
   model.PopulateTensor<float>(model.input1(), {0.1, 0.9, 0.7, 0.3});
   model.PopulateTensor<float>(model.input2(), {0.1, 0.2, 0.6, 0.5});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({true, true, true, false}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 4}));
+  EXPECT_THAT(model.GetOutput(), ElementsAre(true, true, true, false));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
 }
 
 TEST(ComparisonsTest, GreaterEqualInt) {
-  GreaterEqualOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_INT32);
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_INT32,
+                          BuiltinOperator_GREATER_EQUAL);
   model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3});
   model.PopulateTensor<int>(model.input2(), {1, 2, 7, 5});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({false, true, true, false}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 4}));
+  EXPECT_THAT(model.GetOutput(), ElementsAre(false, true, true, false));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
 }
 
 TEST(ComparisonsTest, GreaterEqualBroadcast) {
-  GreaterEqualOpModel model({1, 1, 1, 4}, {1, 1, 1, 1}, TensorType_INT32);
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 1}, TensorType_INT32,
+                          BuiltinOperator_GREATER_EQUAL);
   model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3});
   model.PopulateTensor<int>(model.input2(), {7});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({false, true, true, false}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 4}));
+  EXPECT_THAT(model.GetOutput(), ElementsAre(false, true, true, false));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
 }
 
 TEST(ComparisonsTest, GreaterEqualBroadcastTwoD) {
-  GreaterEqualOpModel model({1, 1, 2, 4}, {1, 1, 1, 4}, TensorType_INT32);
+  ComparisonOpModel model({1, 1, 2, 4}, {1, 1, 1, 4}, TensorType_INT32,
+                          BuiltinOperator_GREATER_EQUAL);
   model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3, 2, 4, 2, 8});
   model.PopulateTensor<int>(model.input2(), {7, 1, 2, 4});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({false, true, true, false,
-                                                   false, true, true, true}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 2, 4}));
+  EXPECT_THAT(model.GetOutput(),
+              ElementsAre(false, true, true, false, false, true, true, true));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 2, 4));
 }
 
-class LessOpModel : public SingleOpModel {
- public:
-  LessOpModel(std::initializer_list<int> input1_shape,
-              std::initializer_list<int> input2_shape, TensorType input_type) {
-    input1_ = AddInput(input_type);
-    input2_ = AddInput(input_type);
-    output_ = AddOutput(TensorType_BOOL);
-    SetBuiltinOp(BuiltinOperator_LESS, BuiltinOptions_LessOptions,
-                 CreateLessOptions(builder_).Union());
-    BuildInterpreter({input1_shape, input2_shape});
-  }
-
-  int input1() { return input1_; }
-  int input2() { return input2_; }
-
-  std::vector<bool> GetOutput() { return ExtractVector<bool>(output_); }
-  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
-
- private:
-  int input1_;
-  int input2_;
-  int output_;
-};
 
 TEST(ComparisonsTest, LessFloat) {
-  LessOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_FLOAT32);
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_FLOAT32,
+                          BuiltinOperator_LESS);
   model.PopulateTensor<float>(model.input1(), {0.1, 0.9, 0.7, 0.3});
   model.PopulateTensor<float>(model.input2(), {0.1, 0.2, 0.6, 0.5});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({false, false, false, true}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 4}));
+  EXPECT_THAT(model.GetOutput(), ElementsAre(false, false, false, true));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
 }
 
 TEST(ComparisonsTest, LessInt) {
-  LessOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_INT32);
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_INT32,
+                          BuiltinOperator_LESS);
   model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3});
   model.PopulateTensor<int>(model.input2(), {1, 2, 6, 5});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({true, false, false, true}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 4}));
+  EXPECT_THAT(model.GetOutput(), ElementsAre(true, false, false, true));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
 }
 
 TEST(ComparisonsTest, LessBroadcast) {
-  LessOpModel model({1, 1, 1, 4}, {1, 1, 1, 1}, TensorType_INT32);
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 1}, TensorType_INT32,
+                          BuiltinOperator_LESS);
   model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3});
   model.PopulateTensor<int>(model.input2(), {7});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({true, false, false, true}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 4}));
+  EXPECT_THAT(model.GetOutput(), ElementsAre(true, false, false, true));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
 }
 
 TEST(ComparisonsTest, LessBroadcastTwoD) {
-  LessOpModel model({1, 1, 2, 4}, {1, 1, 1, 4}, TensorType_INT32);
+  ComparisonOpModel model({1, 1, 2, 4}, {1, 1, 1, 4}, TensorType_INT32,
+                          BuiltinOperator_LESS);
   model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3, 2, 4, 6, 8});
   model.PopulateTensor<int>(model.input2(), {7, 1, 2, 4});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({true, false, false, true,
-                                                   true, false, false, false}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 2, 4}));
+  EXPECT_THAT(model.GetOutput(),
+              ElementsAre(true, false, false, true, true, false, false, false));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 2, 4));
 }
 
-class LessEqualOpModel : public SingleOpModel {
- public:
-  LessEqualOpModel(std::initializer_list<int> input1_shape,
-                   std::initializer_list<int> input2_shape,
-                   TensorType input_type) {
-    input1_ = AddInput(input_type);
-    input2_ = AddInput(input_type);
-    output_ = AddOutput(TensorType_BOOL);
-    SetBuiltinOp(BuiltinOperator_LESS_EQUAL, BuiltinOptions_LessEqualOptions,
-                 CreateLessEqualOptions(builder_).Union());
-    BuildInterpreter({input1_shape, input2_shape});
-  }
-
-  int input1() { return input1_; }
-  int input2() { return input2_; }
-
-  std::vector<bool> GetOutput() { return ExtractVector<bool>(output_); }
-  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
-
- private:
-  int input1_;
-  int input2_;
-  int output_;
-};
-
 TEST(ComparisonsTest, LessEqualFloat) {
-  LessEqualOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_FLOAT32);
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_FLOAT32,
+                          BuiltinOperator_LESS_EQUAL);
   model.PopulateTensor<float>(model.input1(), {0.1, 0.9, 0.7, 0.3});
   model.PopulateTensor<float>(model.input2(), {0.1, 0.2, 0.6, 0.5});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({true, false, false, true}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 4}));
+  EXPECT_THAT(model.GetOutput(), ElementsAre(true, false, false, true));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
 }
 
 TEST(ComparisonsTest, LessEqualInt) {
-  LessEqualOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_INT32);
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, TensorType_INT32,
+                          BuiltinOperator_LESS_EQUAL);
   model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3});
   model.PopulateTensor<int>(model.input2(), {1, 2, 7, 5});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({true, false, true, true}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 4}));
+  EXPECT_THAT(model.GetOutput(), ElementsAre(true, false, true, true));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
 }
 
 TEST(ComparisonsTest, LessEqualBroadcast) {
-  LessEqualOpModel model({1, 1, 1, 4}, {1, 1, 1, 1}, TensorType_INT32);
+  ComparisonOpModel model({1, 1, 1, 4}, {1, 1, 1, 1}, TensorType_INT32,
+                          BuiltinOperator_LESS_EQUAL);
   model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3});
   model.PopulateTensor<int>(model.input2(), {7});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({true, false, true, true}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 4}));
+  EXPECT_THAT(model.GetOutput(), ElementsAre(true, false, true, true));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
 }
 
 TEST(ComparisonsTest, LessEqualBroadcastTwoD) {
-  LessEqualOpModel model({1, 1, 2, 4}, {1, 1, 1, 4}, TensorType_INT32);
+  ComparisonOpModel model({1, 1, 2, 4}, {1, 1, 1, 4}, TensorType_INT32,
+                          BuiltinOperator_LESS_EQUAL);
   model.PopulateTensor<int>(model.input1(), {-1, 9, 7, 3, 2, 4, 2, 8});
   model.PopulateTensor<int>(model.input2(), {7, 1, 2, 4});
   model.Invoke();
 
-  EXPECT_THAT(model.GetOutput(), ElementsAreArray({true, false, false, true,
-                                                   true, false, true, false}));
-  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 2, 4}));
+  EXPECT_THAT(model.GetOutput(),
+              ElementsAre(true, false, false, true, true, false, true, false));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 2, 4));
 }
 
 }  // namespace
diff --git a/tensorflow/contrib/lite/kernels/conv.cc b/tensorflow/contrib/lite/kernels/conv.cc
index 0b35a220e783572985206e918f3fcd8361f16790..14b399ef96eab1d5066a22a7eb95ab061e8ba2bc 100644
--- a/tensorflow/contrib/lite/kernels/conv.cc
+++ b/tensorflow/contrib/lite/kernels/conv.cc
@@ -134,7 +134,9 @@ static TfLiteStatus AllocateTemporaryTensorsIfRequired(TfLiteContext* context,
   // optimized_ops.h, in order to avoid a DCHECK(!im2col_data).
   data->need_im2col =
       (params->stride_width != 1 || params->stride_height != 1 ||
-       filter_width != 1 || filter_height != 1);
+       params->dilation_width_factor != 1 ||
+       params->dilation_height_factor != 1 || filter_width != 1 ||
+       filter_height != 1);
   // If we're using the optimized multithreaded EigenTensor implementation of
   // convolution, it expects the filter weights to be transposed compared to
   // the normal TF Lite buffer format. Typical TF Lite weights are
@@ -254,8 +256,10 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
     double real_multiplier = 0.0;
     TF_LITE_ENSURE_STATUS(GetQuantizedConvolutionMultipler(
         context, input, filter, bias, output, &real_multiplier));
-    QuantizeMultiplierSmallerThanOne(real_multiplier, &data->output_multiplier,
-                                     &data->output_shift);
+    TF_LITE_ENSURE(context, real_multiplier < 1.0);
+    QuantizeMultiplierSmallerThanOneExp(
+        real_multiplier, &data->output_multiplier, &data->output_shift);
+    data->output_shift *= -1;
     CalculateActivationRangeUint8(params->activation, output,
                                   &data->output_activation_min,
                                   &data->output_activation_max);
diff --git a/tensorflow/contrib/lite/kernels/depthwise_conv.cc b/tensorflow/contrib/lite/kernels/depthwise_conv.cc
index abb2549f85164e40875456fb732716e53b263127..a308de055f49eddba99d02e264fad11409a799f4 100644
--- a/tensorflow/contrib/lite/kernels/depthwise_conv.cc
+++ b/tensorflow/contrib/lite/kernels/depthwise_conv.cc
@@ -151,8 +151,9 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
     double real_multiplier = 0.0;
     TF_LITE_ENSURE_STATUS(GetQuantizedConvolutionMultipler(
         context, input, filter, bias, output, &real_multiplier));
-    QuantizeMultiplierSmallerThanOne(real_multiplier, &data->output_multiplier,
-                                     &data->output_shift);
+    int exponent;
+    QuantizeMultiplier(real_multiplier, &data->output_multiplier, &exponent);
+    data->output_shift = -exponent;
     CalculateActivationRangeUint8(params->activation, output,
                                   &data->output_activation_min,
                                   &data->output_activation_max);
diff --git a/tensorflow/contrib/lite/kernels/depthwise_conv_test.cc b/tensorflow/contrib/lite/kernels/depthwise_conv_test.cc
index 1439c8bce14ad127ed68dc54991aed8b8bb39383..c00cafb9fbfaf53d4dc301ccd3f21a6c6fd892e6 100644
--- a/tensorflow/contrib/lite/kernels/depthwise_conv_test.cc
+++ b/tensorflow/contrib/lite/kernels/depthwise_conv_test.cc
@@ -47,12 +47,6 @@ class BaseDepthwiseConvolutionOpModel : public SingleOpModel {
     }
 
     output_ = AddOutput(output);
-    if (input.type != TensorType_FLOAT32) {
-      // The following is required by quantized inference. It is the unittest's
-      // responsibility to make sure the output scale falls into the correct
-      // range.
-      CHECK_LT(GetScale(input_) * GetScale(filter_), GetScale(output_));
-    }
 
     int input_depth = GetShape(input_)[3];
     int output_depth = GetShape(filter_)[3];
@@ -176,6 +170,43 @@ TEST(QuantizedDepthwiseConvolutionOpTest, SimpleTestQuantized) {
                              }));
 }
 
+TEST(QuantizedDepthwiseConvolutionOpTest,
+     SimpleTestQuantizedFilterMultiplierGreaterThan1) {
+  QuantizedDepthwiseConvolutionOpModel quant_op(
+      {TensorType_UINT8, {1, 3, 2, 2}, -63.5, 64},
+      {TensorType_UINT8, {1, 2, 2, 4}, -128.5, 128},
+      {TensorType_UINT8, {}, -127, 128});
+  DepthwiseConvolutionOpModel float_op({TensorType_FLOAT32, {1, 3, 2, 2}},
+                                       {TensorType_FLOAT32, {1, 2, 2, 4}},
+                                       {TensorType_FLOAT32, {}});
+
+  std::initializer_list<float> input = {
+      1, 2, 7,  8,   // column 1
+      3, 4, 9,  10,  // column 2
+      5, 6, 11, 12,  // column 3
+  };
+  std::initializer_list<float> filter = {
+      1,  2,   3,   4,    //
+      -9, 10,  -11, 12,   //
+      5,  6,   7,   8,    //
+      13, -14, 15,  -16,  //
+  };
+  std::initializer_list<float> bias = {1, 2, 3, 4};
+
+  quant_op.SetInput(input);
+  quant_op.SetFilter(filter);
+  quant_op.SetBias(bias);
+  quant_op.Invoke();
+
+  float_op.SetInput(input);
+  float_op.SetFilter(filter);
+  float_op.SetBias(bias);
+  float_op.Invoke();
+
+  EXPECT_THAT(quant_op.GetDequantizedOutput(),
+              ElementsAreArray(ArrayFloatNear(float_op.GetOutput(), 1)));
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/elementwise.cc b/tensorflow/contrib/lite/kernels/elementwise.cc
index 0bd504695074011efd946f4c4d1f8d4854e82730..98c21ce9d390aaa1f3cb5fdb8f31cbffb1b81d6a 100644
--- a/tensorflow/contrib/lite/kernels/elementwise.cc
+++ b/tensorflow/contrib/lite/kernels/elementwise.cc
@@ -23,7 +23,7 @@ namespace ops {
 namespace builtin {
 namespace elementwise {
 
-TfLiteStatus SinPrepare(TfLiteContext* context, TfLiteNode* node) {
+TfLiteStatus GenericPrepare(TfLiteContext* context, TfLiteNode* node) {
   TF_LITE_ENSURE_EQ(context, NumInputs(node), 1);
   TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
   const TfLiteTensor* input = GetInput(context, node, 0);
@@ -35,7 +35,8 @@ TfLiteStatus SinPrepare(TfLiteContext* context, TfLiteNode* node) {
                                TfLiteIntArrayCopy(input->dims));
 }
 
-TfLiteStatus SinEval(TfLiteContext* context, TfLiteNode* node) {
+inline TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node,
+                         float float_func(float)) {
   const TfLiteTensor* input = GetInput(context, node, 0);
   TfLiteTensor* output = GetOutput(context, node, 0);
   switch (input->type) {
@@ -44,7 +45,7 @@ TfLiteStatus SinEval(TfLiteContext* context, TfLiteNode* node) {
       const float* in = GetTensorData<float>(input);
       const float* in_end = in + elements;
       float* out = output->data.f;
-      for (; in < in_end; in++, out++) *out = std::sin(*in);
+      for (; in < in_end; in++, out++) *out = float_func(*in);
       return kTfLiteOk;
     }
     default: {
@@ -55,14 +56,28 @@ TfLiteStatus SinEval(TfLiteContext* context, TfLiteNode* node) {
   }
 }
 
+TfLiteStatus SinEval(TfLiteContext* context, TfLiteNode* node) {
+  return Eval(context, node, std::sin);
+}
+
+TfLiteStatus LogEval(TfLiteContext* context, TfLiteNode* node) {
+  return Eval(context, node, std::log);
+}
+
 }  // namespace elementwise
 
 TfLiteRegistration* Register_SIN() {
-  static TfLiteRegistration r = {nullptr, nullptr, elementwise::SinPrepare,
+  static TfLiteRegistration r = {nullptr, nullptr, elementwise::GenericPrepare,
                                  elementwise::SinEval};
   return &r;
 }
 
+TfLiteRegistration* Register_LOG() {
+  static TfLiteRegistration r = {nullptr, nullptr, elementwise::GenericPrepare,
+                                 elementwise::LogEval};
+  return &r;
+}
+
 }  // namespace builtin
 }  // namespace ops
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/elementwise_test.cc b/tensorflow/contrib/lite/kernels/elementwise_test.cc
index 412ffb04b90fbc24d232d25d2a86ce639752c3e8..10e88d5a31868eeb5f65c7ade1f1c73827dea24a 100644
--- a/tensorflow/contrib/lite/kernels/elementwise_test.cc
+++ b/tensorflow/contrib/lite/kernels/elementwise_test.cc
@@ -24,12 +24,13 @@ namespace {
 
 using ::testing::ElementsAreArray;
 
-class SinOpModel : public SingleOpModel {
+class ElementWiseOpModel : public SingleOpModel {
  public:
-  SinOpModel(std::initializer_list<int> input_shape) {
+  ElementWiseOpModel(BuiltinOperator op,
+                     std::initializer_list<int> input_shape) {
     input_ = AddInput(TensorType_FLOAT32);
     output_ = AddOutput(TensorType_FLOAT32);
-    SetBuiltinOp(BuiltinOperator_SIN, BuiltinOptions_NONE, 0);
+    SetBuiltinOp(op, BuiltinOptions_NONE, 0);
     BuildInterpreter({input_shape});
   }
 
@@ -42,7 +43,7 @@ class SinOpModel : public SingleOpModel {
 };
 
 TEST(ElementWise, Sin) {
-  SinOpModel m({1, 1, 4, 1});
+  ElementWiseOpModel m(BuiltinOperator_SIN, {1, 1, 4, 1});
   m.PopulateTensor<float>(m.input(), {0, 3.1415926, -3.1415926, 1});
   m.Invoke();
   EXPECT_THAT(m.ExtractVector<float>(m.output()),
@@ -50,6 +51,15 @@ TEST(ElementWise, Sin) {
   EXPECT_THAT(m.GetTensorShape(m.output()), ElementsAreArray({1, 1, 4, 1}));
 }
 
+TEST(ElementWise, Log) {
+  ElementWiseOpModel m(BuiltinOperator_LOG, {1, 1, 4, 1});
+  m.PopulateTensor<float>(m.input(), {1, 3.1415926, 1, 1});
+  m.Invoke();
+  EXPECT_THAT(m.ExtractVector<float>(m.output()),
+              ElementsAreArray(ArrayFloatNear({0, 1.14473, 0, 0})));
+  EXPECT_THAT(m.GetTensorShape(m.output()), ElementsAreArray({1, 1, 4, 1}));
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/embedding_lookup.cc b/tensorflow/contrib/lite/kernels/embedding_lookup.cc
index 7539c0b30ded921df957217bebdc7b20ea4b40b4..9410bead5e7a68363d034c22fb2c0eff9f060ef1 100644
--- a/tensorflow/contrib/lite/kernels/embedding_lookup.cc
+++ b/tensorflow/contrib/lite/kernels/embedding_lookup.cc
@@ -24,7 +24,8 @@ limitations under the License.
 // Output:
 //   Output.dim[0] == Tensor[0].dim[0], num of lookups
 //   Output.dim[1] == Tensor[1].dim[1],  num of items per row
-//   Each item in output is a raw bytes copy of corresponding item in input.
+//   Each item in output is a raw bytes copy of the corresponding item in input,
+//   or a dequantized value in the case of a uint8 input.
 //   When indices are out of bound, the ops will not succeed.
 //
 
@@ -69,11 +70,9 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   return context->ResizeTensor(context, output, outputSize);
 }
 
-TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
-  TfLiteTensor* output = GetOutput(context, node, 0);
-  const TfLiteTensor* lookup = GetInput(context, node, 0);
-  const TfLiteTensor* value = GetInput(context, node, 1);
-
+TfLiteStatus EvalFloat(TfLiteContext* context, TfLiteNode* node,
+                       const TfLiteTensor* lookup, const TfLiteTensor* value,
+                       TfLiteTensor* output) {
   const int row_size = SizeOfDimension(value, 0);
   const int row_bytes = value->bytes / row_size;
 
@@ -91,6 +90,52 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   return kTfLiteOk;
 }
 
+TfLiteStatus EvalHybrid(TfLiteContext* context, TfLiteNode* node,
+                        const TfLiteTensor* lookup, const TfLiteTensor* value,
+                        TfLiteTensor* output) {
+  const int row_size = SizeOfDimension(value, 0);
+  const double scaling_factor = 1.0 / value->params.scale;
+
+  // col_size after we flatten tensor into 2D.
+  int col_size = 1;
+  for (int i = 1; i < NumDimensions(value); i++) {
+    col_size *= SizeOfDimension(value, i);
+  }
+
+  for (int i = 0; i < SizeOfDimension(lookup, 0); i++) {
+    int idx = lookup->data.i32[i];
+    if (idx >= row_size || idx < 0) {
+      context->ReportError(context, "Embedding Lookup: index out of bounds.");
+      return kTfLiteError;
+    } else {
+      // Dequantize embedding values.
+      // TODO(alanchiao): refactor scalar multiply into separate function
+      // for ease of adding a neon equivalent if ever necessary.
+      for (int j = 0; j < col_size; j++) {
+        output->data.f[j + i * col_size] =
+            value->data.uint8[j + idx * col_size] * scaling_factor;
+      }
+    }
+  }
+
+  return kTfLiteOk;
+}
+
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  const TfLiteTensor* lookup = GetInput(context, node, 0);
+  const TfLiteTensor* value = GetInput(context, node, 1);
+  TfLiteTensor* output = GetOutput(context, node, 0);
+  switch (value->type) {
+    case kTfLiteFloat32:
+      return EvalFloat(context, node, lookup, value, output);
+    case kTfLiteUInt8:
+      return EvalHybrid(context, node, lookup, value, output);
+    default:
+      context->ReportError(context, "Type not currently supported.");
+      return kTfLiteError;
+  }
+}
+
 }  // namespace embedding_lookup
 
 TfLiteRegistration* Register_EMBEDDING_LOOKUP() {
diff --git a/tensorflow/contrib/lite/kernels/embedding_lookup_test.cc b/tensorflow/contrib/lite/kernels/embedding_lookup_test.cc
index 9b501878f196216a61568bfa36e6615f4dd07478..04657fd86323ef1c58d069c06097c7665f55cc87 100644
--- a/tensorflow/contrib/lite/kernels/embedding_lookup_test.cc
+++ b/tensorflow/contrib/lite/kernels/embedding_lookup_test.cc
@@ -7,13 +7,14 @@ You may obtain a copy of the License at
     http://www.apache.org/licenses/LICENSE-2.0
 
 Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
+distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License
+for the specific language governing permissions and limitations under the
+License.
 ==============================================================================*/
 // Unit test for TFLite Lookup op.
 
+#include <initializer_list>
 #include <iomanip>
 #include <vector>
 
@@ -29,12 +30,13 @@ namespace {
 
 using ::testing::ElementsAreArray;
 
-class EmbeddingLookupOpModel : public SingleOpModel {
+class BaseEmbeddingLookupOpModel : public SingleOpModel {
  public:
-  EmbeddingLookupOpModel(std::initializer_list<int> index_shape,
-                         std::initializer_list<int> weight_shape) {
+  BaseEmbeddingLookupOpModel(std::initializer_list<int> index_shape,
+                             std::initializer_list<int> weight_shape,
+                             TensorType weight_type = TensorType_FLOAT32) {
     input_ = AddInput(TensorType_INT32);
-    weight_ = AddInput(TensorType_FLOAT32);
+    weight_ = AddInput(weight_type);
     output_ = AddOutput(TensorType_FLOAT32);
     SetBuiltinOp(BuiltinOperator_EMBEDDING_LOOKUP, BuiltinOptions_NONE, 0);
     BuildInterpreter({index_shape, weight_shape});
@@ -44,6 +46,18 @@ class EmbeddingLookupOpModel : public SingleOpModel {
     PopulateTensor(input_, data);
   }
 
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+
+ protected:
+  int input_;
+  int weight_;
+  int output_;
+};
+
+class EmbeddingLookupOpModel : public BaseEmbeddingLookupOpModel {
+ public:
+  using BaseEmbeddingLookupOpModel::BaseEmbeddingLookupOpModel;
+
   void Set3DWeightMatrix(const std::function<float(int, int, int)>& function) {
     TfLiteTensor* tensor = interpreter_->tensor(weight_);
     int rows = tensor->dims->data[0];
@@ -57,20 +71,25 @@ class EmbeddingLookupOpModel : public SingleOpModel {
       }
     }
   }
+};
 
-  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+class HybridEmbeddingLookupOpModel : public BaseEmbeddingLookupOpModel {
+ public:
+  HybridEmbeddingLookupOpModel(std::initializer_list<int> index_shape,
+                               std::initializer_list<int> weight_shape)
+      : BaseEmbeddingLookupOpModel(index_shape, weight_shape,
+                                   TensorType_UINT8) {}
 
- private:
-  int input_;
-  int weight_;
-  int output_;
+  void SetWeight(std::initializer_list<float> data) {
+    SymmetricQuantizeAndPopulate(weight_, data);
+  }
 };
 
 // TODO(ahentz): write more tests that exercise the details of the op, such as
 // lookup errors and variable input shapes.
 TEST(EmbeddingLookupOpTest, SimpleTest) {
   EmbeddingLookupOpModel m({3}, {3, 2, 4});
-  m.PopulateTensor<int>(0, {1, 0, 2});
+  m.SetInput({1, 0, 2});
   m.Set3DWeightMatrix(
       [](int i, int j, int k) { return i + j / 10.0f + k / 100.0f; });
 
@@ -84,6 +103,69 @@ TEST(EmbeddingLookupOpTest, SimpleTest) {
               })));
 }
 
+TEST(HybridEmbeddingLookupHybridOpTest, Simple2DTest) {
+  HybridEmbeddingLookupOpModel m({3}, {3, 8});
+  m.SetInput({1, 0, 2});
+  m.SetWeight({
+      0.00, 0.01, 0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
+      1.00, 1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
+      2.00, 2.01, 2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
+  });
+
+  m.Invoke();
+
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray(ArrayFloatNear(
+                  {
+                      1.00, 1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
+                      0.00, 0.01, 0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
+                      2.00, 2.01, 2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
+                  },
+                  7.41e-03)));
+}
+
+TEST(HybridEmbeddingLookupHybridOpTest, Simple3DTest) {
+  HybridEmbeddingLookupOpModel m({3}, {3, 2, 4});
+  m.SetInput({1, 0, 2});
+  m.SetWeight({
+      0.00, 0.01, 0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
+      1.00, 1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
+      2.00, 2.01, 2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
+  });
+
+  m.Invoke();
+
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray(ArrayFloatNear(
+                  {
+                      1.00, 1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
+                      0.00, 0.01, 0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
+                      2.00, 2.01, 2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
+                  },
+                  7.41e-03)));
+}
+
+TEST(HybridEmbeddingLookupHybridOpTest, Simple4DTest) {
+  HybridEmbeddingLookupOpModel m({3}, {3, 2, 2, 2});
+  m.SetInput({1, 0, 2});
+  m.SetWeight({
+      0.00, 0.01, 0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
+      1.00, 1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
+      2.00, 2.01, 2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
+  });
+
+  m.Invoke();
+
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray(ArrayFloatNear(
+                  {
+                      1.00, 1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
+                      0.00, 0.01, 0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
+                      2.00, 2.01, 2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
+                  },
+                  7.41e-03)));
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/expand_dims.cc b/tensorflow/contrib/lite/kernels/expand_dims.cc
new file mode 100644
index 0000000000000000000000000000000000000000..ed33012864354cd93eac2344f75d7eca302c8952
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/expand_dims.cc
@@ -0,0 +1,113 @@
+
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <string.h>
+#include <vector>
+#include "tensorflow/contrib/lite/builtin_op_data.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/tensor.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/contrib/lite/kernels/op_macros.h"
+namespace tflite {
+namespace ops {
+namespace builtin {
+namespace expand_dims {
+constexpr int kInput = 0;
+constexpr int kAxis = 1;
+constexpr int kOutput = 0;
+
+namespace {
+TfLiteStatus ExpandTensorDim(TfLiteContext* context, const TfLiteTensor& input,
+                             int axis, TfLiteTensor* output) {
+  const TfLiteIntArray& input_dims = *input.dims;
+  if (axis < 0) {
+    axis = input_dims.size + 1 + axis;
+  }
+  TF_LITE_ENSURE(context, axis <= input_dims.size);
+
+  TfLiteIntArray* output_dims = TfLiteIntArrayCreate(input_dims.size + 1);
+  for (int i = 0; i < output_dims->size; ++i) {
+    if (i < axis) {
+      output_dims->data[i] = input_dims.data[i];
+    } else if (i == axis) {
+      output_dims->data[i] = 1;
+    } else {
+      output_dims->data[i] = input_dims.data[i - 1];
+    }
+  }
+
+  return context->ResizeTensor(context, output, output_dims);
+}
+
+TfLiteStatus GetAxisValueFromTensor(TfLiteContext* context,
+                                    const TfLiteTensor& axis, int* axis_value) {
+  TF_LITE_ENSURE_EQ(context, NumElements(&axis), 1);
+  switch (axis.type) {
+    case kTfLiteInt32:
+      *axis_value = *GetTensorData<int32_t>(&axis);
+      return kTfLiteOk;
+    case kTfLiteInt64:
+      *axis_value = *GetTensorData<int64_t>(&axis);
+      return kTfLiteOk;
+    default:
+      return kTfLiteError;
+  }
+}
+
+}  // namespace
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), 2);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+  const TfLiteTensor* input = GetInput(context, node, kInput);
+  const TfLiteTensor* axis = GetInput(context, node, kAxis);
+  TfLiteTensor* output = GetOutput(context, node, 0);
+  output->type = input->type;
+  if (IsConstantTensor(axis)) {
+    int axis_value;
+    TF_LITE_ENSURE_OK(context,
+                      GetAxisValueFromTensor(context, *axis, &axis_value));
+    return ExpandTensorDim(context, *input, axis_value, output);
+  }
+  SetTensorToDynamic(output);
+  return kTfLiteOk;
+}
+
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  // Just copy input to output.
+  const TfLiteTensor* input = GetInput(context, node, kInput);
+  TfLiteTensor* output = GetOutput(context, node, 0);
+  const TfLiteTensor* axis = GetInput(context, node, kAxis);
+  if (IsDynamicTensor(output)) {
+    int axis_value;
+    TF_LITE_ENSURE_OK(context,
+                      GetAxisValueFromTensor(context, *axis, &axis_value));
+    TF_LITE_ENSURE_OK(context,
+                      ExpandTensorDim(context, *input, axis_value, output));
+  }
+  memcpy(output->data.raw, input->data.raw, input->bytes);
+  return kTfLiteOk;
+}
+
+}  // namespace expand_dims
+TfLiteRegistration* Register_EXPAND_DIMS() {
+  static TfLiteRegistration r = {nullptr, nullptr, expand_dims::Prepare,
+                                 expand_dims::Eval};
+  return &r;
+}
+}  // namespace builtin
+}  // namespace ops
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/expand_dims_test.cc b/tensorflow/contrib/lite/kernels/expand_dims_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..b755e8ce293442813b26ec3177162a3c95af2f89
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/expand_dims_test.cc
@@ -0,0 +1,83 @@
+
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/builtin_op_data.h"
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace {
+
+using ::testing::ElementsAreArray;
+
+class ExpandDimsOpModel : public SingleOpModel {
+ public:
+  ExpandDimsOpModel(std::initializer_list<int> input_shape,
+                    TensorType input_type) {
+    input_ = AddInput(input_type);
+    axis_ = AddInput(TensorType_INT32);
+    output_ = AddOutput(input_type);
+    SetBuiltinOp(BuiltinOperator_EXPAND_DIMS, BuiltinOptions_ExpandDimsOptions,
+                 0);
+    BuildInterpreter({input_shape, {1}});
+  }
+  void SetInputFloat(std::initializer_list<float> data) {
+    PopulateTensor<float>(input_, data);
+  }
+  void SetAxis(int axis) { PopulateTensor<int32>(axis_, {axis}); }
+  std::vector<float> GetValuesFloat() { return ExtractVector<float>(output_); }
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+ protected:
+  int input_;
+  int axis_;
+  int output_;
+};
+
+TEST(ExpandDimsOpTest, DifferentAxis) {
+  ExpandDimsOpModel m({2, 2}, TensorType_FLOAT32);
+  const auto values = {-1.f, 1.f, -2.f, 2.f};
+  m.SetInputFloat(values);
+  m.SetAxis(0);
+  m.Invoke();
+  EXPECT_THAT(m.GetValuesFloat(), ElementsAreArray(values));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 2, 2}));
+
+  m.SetAxis(1);
+  m.Invoke();
+  EXPECT_THAT(m.GetValuesFloat(), ElementsAreArray(values));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2, 1, 2}));
+
+  m.SetAxis(2);
+  m.Invoke();
+  EXPECT_THAT(m.GetValuesFloat(), ElementsAreArray(values));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2, 2, 1}));
+
+  m.SetAxis(-1);
+  m.Invoke();
+  EXPECT_THAT(m.GetValuesFloat(), ElementsAreArray(values));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2, 2, 1}));
+}
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/fully_connected.cc b/tensorflow/contrib/lite/kernels/fully_connected.cc
index 3374923e6e0f2f4520d2d67e698ae5b4fd8e5443..f6fc0f5b6ad12d58c541efc6eae566ab4b8327f4 100644
--- a/tensorflow/contrib/lite/kernels/fully_connected.cc
+++ b/tensorflow/contrib/lite/kernels/fully_connected.cc
@@ -101,16 +101,15 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
     input_size *= input->dims->data[i];
   }
 
+  TF_LITE_ENSURE_EQ(context, NumDimensions(filter), 2);
   const int batch_size = input_size / filter->dims->data[1];
   const int num_units = filter->dims->data[0];
 
-  TF_LITE_ASSERT_EQ(input_size, batch_size * filter->dims->data[1]);
+  TF_LITE_ENSURE_EQ(context, input_size, batch_size * filter->dims->data[1]);
   if (bias) {
     TF_LITE_ENSURE_EQ(context, NumElements(bias), SizeOfDimension(filter, 0));
   }
 
-  TF_LITE_ENSURE_EQ(context, NumDimensions(filter), 2);
-
   // Note that quantized inference requires that all tensors have their
   // parameters set. This is usually done during quantized training.
   TfLiteType data_type = input->type;
@@ -118,8 +117,10 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
     double real_multiplier = 0.0;
     TF_LITE_ENSURE_STATUS(GetQuantizedConvolutionMultipler(
         context, input, filter, bias, output, &real_multiplier));
-    QuantizeMultiplierSmallerThanOne(real_multiplier, &data->output_multiplier,
-                                     &data->output_shift);
+    TF_LITE_ENSURE(context, real_multiplier < 1.0);
+    QuantizeMultiplierSmallerThanOneExp(
+        real_multiplier, &data->output_multiplier, &data->output_shift);
+    data->output_shift *= -1;
     CalculateActivationRangeUint8(params->activation, output,
                                   &data->output_activation_min,
                                   &data->output_activation_max);
@@ -218,11 +219,8 @@ TfLiteStatus EvalPieQuantized(TfLiteContext* context, TfLiteNode* node,
     tensor_utils::ZeroVector(output->data.f, batch_size * num_units);
   }
 
-  // TODO(mirkov): change std::minmax_element with a vectorized call.
-  auto minmax_element =
-      std::minmax_element(input->data.f, input->data.f + total_input_size);
   // Save matrix multiplication computation for all zero input.
-  if (*minmax_element.first == 0.0 && *minmax_element.second == 0.0) {
+  if (tensor_utils::IsZeroVector(input->data.f, total_input_size)) {
     tensor_utils::ApplyActivationToVector(output->data.f,
                                           batch_size * num_units,
                                           params->activation, output->data.f);
diff --git a/tensorflow/contrib/lite/kernels/internal/BUILD b/tensorflow/contrib/lite/kernels/internal/BUILD
index aabbb0685c5f9160cd863952cb47526215dc31a6..75298b995d6184985efc76c60c2f5541e9cbea40 100644
--- a/tensorflow/contrib/lite/kernels/internal/BUILD
+++ b/tensorflow/contrib/lite/kernels/internal/BUILD
@@ -420,6 +420,15 @@ cc_library(
     }),
 )
 
+cc_library(
+    name = "test_util",
+    srcs = ["test_util.cc"],
+    hdrs = ["test_util.h"],
+    deps = [
+        ":types",
+    ],
+)
+
 cc_test(
     name = "tensor_utils_test",
     srcs = ["tensor_utils_test.cc"],
@@ -440,6 +449,84 @@ cc_test(
     ],
 )
 
+cc_test(
+    name = "depthwiseconv_float_test",
+    srcs = ["depthwiseconv_float_test.cc"],
+    deps = [
+        ":optimized_base",
+        ":reference_base",
+        ":test_util",
+        ":types",
+        "@com_google_googletest//:gtest_main",
+    ],
+)
+
+cc_test(
+    name = "depthwiseconv_quantized_test",
+    srcs = ["depthwiseconv_quantized_test.cc"],
+    deps = [
+        ":optimized_base",
+        ":reference_base",
+        ":test_util",
+        ":types",
+        "@com_google_googletest//:gtest_main",
+    ],
+)
+
+cc_test(
+    name = "resize_bilinear_test",
+    srcs = ["resize_bilinear_test.cc"],
+    tags = ["tflite_not_portable"],
+    deps = [
+        ":optimized_base",
+        ":reference_base",
+        ":test_util",
+        ":types",
+        "@com_google_googletest//:gtest_main",
+    ],
+)
+
+cc_test(
+    name = "softmax_quantized_test",
+    timeout = "long",
+    srcs = [
+        "softmax_quantized_test.cc",
+    ],
+    deps = [
+        ":optimized_base",
+        ":quantization_util",
+        ":reference_base",
+        ":test_util",
+        "@com_google_googletest//:gtest_main",
+    ],
+)
+
+cc_test(
+    name = "logsoftmax_quantized_test",
+    timeout = "long",
+    srcs = [
+        "logsoftmax_quantized_test.cc",
+    ],
+    tags = ["tflite_not_portable"],
+    deps = [
+        ":optimized_base",
+        ":quantization_util",
+        ":reference_base",
+        ":test_util",
+        "@com_google_googletest//:gtest_main",
+    ],
+)
+
+cc_test(
+    name = "log_quantized_test",
+    srcs = ["log_quantized_test.cc"],
+    deps = [
+        ":optimized_base",
+        ":reference_base",
+        "@com_google_googletest//:gtest_main",
+    ],
+)
+
 cc_library(
     name = "cpu_check",
     hdrs = [
diff --git a/tensorflow/contrib/lite/kernels/internal/common.h b/tensorflow/contrib/lite/kernels/internal/common.h
index 4d3803a07fbdaf9d0d27664037c219715884ee76..3543b6f79e1633b4ffa0872b2999161a519c8cad 100644
--- a/tensorflow/contrib/lite/kernels/internal/common.h
+++ b/tensorflow/contrib/lite/kernels/internal/common.h
@@ -89,12 +89,12 @@ float ActivationFunction(float x) {
                                       output_activation_max);
 }
 
-inline int32 MultiplyByQuantizedMultiplierSmallerThanOne(
-    int32 x, int32 quantized_multiplier, int right_shift) {
+inline int32 MultiplyByQuantizedMultiplierSmallerThanOneExp(
+    int32 x, int32 quantized_multiplier, int left_shift) {
   using gemmlowp::RoundingDivideByPOT;
   using gemmlowp::SaturatingRoundingDoublingHighMul;
   return RoundingDivideByPOT(
-      SaturatingRoundingDoublingHighMul(x, quantized_multiplier), right_shift);
+      SaturatingRoundingDoublingHighMul(x, quantized_multiplier), -left_shift);
 }
 
 inline int32 MultiplyByQuantizedMultiplierGreaterThanOne(
diff --git a/tensorflow/contrib/lite/kernels/internal/depthwiseconv_float_test.cc b/tensorflow/contrib/lite/kernels/internal/depthwiseconv_float_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..844ee6a53dd65b81f21ae1ef5b6d04192744a304
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/depthwiseconv_float_test.cc
@@ -0,0 +1,162 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <algorithm>
+#include <cmath>
+#include <vector>
+
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/kernels/internal/test_util.h"
+#include "tensorflow/contrib/lite/kernels/internal/types.h"
+
+#define ALLOW_SLOW_GENERIC_DEPTHWISECONV_FALLBACK
+#include "tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_float.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/depthwiseconv_float.h"
+
+namespace tflite {
+namespace {
+
+// Runs the DepthwiseConv and compares against the reference implementation.
+template <FusedActivationFunctionType Ac>
+void TestOneDepthwiseConv(const float* input_data, const Dims<4>& input_dims,
+                          const float* filter_data, const Dims<4>& filter_dims,
+                          const float* bias_data, const Dims<4>& bias_dims,
+                          int stride, int pad_width, int pad_height,
+                          int depth_multiplier, const Dims<4>& output_dims) {
+  const int output_buffer_size = RequiredBufferSizeForDims(output_dims);
+  std::vector<float> output_data(output_buffer_size);
+  std::vector<float> reference_output_data(output_buffer_size);
+  reference_ops::DepthwiseConv<Ac>(input_data, input_dims, filter_data,
+                                   filter_dims, bias_data, bias_dims, stride,
+                                   pad_width, pad_height, depth_multiplier,
+                                   reference_output_data.data(), output_dims);
+  optimized_ops::DepthwiseConv<Ac>(input_data, input_dims, filter_data,
+                                   filter_dims, bias_data, bias_dims, stride,
+                                   pad_width, pad_height, depth_multiplier,
+                                   output_data.data(), output_dims);
+  double sum_abs_diff = 0;
+  float max_abs_val = 0;
+  for (int i = 0; i < output_buffer_size; i++) {
+    sum_abs_diff += std::abs(output_data[i] - reference_output_data[i]);
+    max_abs_val = std::max(max_abs_val, std::abs(reference_output_data[i]));
+  }
+  if (sum_abs_diff != 0.f) {
+    const float mean_diff =
+        static_cast<float>(sum_abs_diff / output_buffer_size);
+    const float relative_error = std::abs(mean_diff) / max_abs_val;
+    ASSERT_LT(relative_error, 1e-5f);
+  }
+}
+
+void TestOneDepthwiseConv(FusedActivationFunctionType Ac,
+                          const float* input_data, const Dims<4>& input_dims,
+                          const float* filter_data, const Dims<4>& filter_dims,
+                          const float* bias_data, const Dims<4>& bias_dims,
+                          int stride, int pad_width, int pad_height,
+                          int depth_multiplier, const Dims<4>& output_dims) {
+#define TOCO_HANDLE_CASE(AC_TYPE)                                            \
+  if (AC_TYPE == Ac) {                                                       \
+    TestOneDepthwiseConv<AC_TYPE>(input_data, input_dims, filter_data,       \
+                                  filter_dims, bias_data, bias_dims, stride, \
+                                  pad_width, pad_height, depth_multiplier,   \
+                                  output_dims);                              \
+    return;                                                                  \
+  }
+  TOCO_HANDLE_CASE(FusedActivationFunctionType::kNone)
+  TOCO_HANDLE_CASE(FusedActivationFunctionType::kRelu)
+  TOCO_HANDLE_CASE(FusedActivationFunctionType::kRelu1)
+  TOCO_HANDLE_CASE(FusedActivationFunctionType::kRelu6)
+#undef TOCO_HANDLE_CASE
+}
+
+// This function picks some random DepthwiseConv params, which may or may not
+// be legal. If they're not legal, it returns false. If they're legal,
+// it runs the DepthwiseConv test and returns true. This allows the caller
+// to loop until a test has been run.
+bool TryTestOneDepthwiseConv() {
+  // We have to pick a lot of positive values, where we are particularly
+  // interested in small values because they are most likely to be special
+  // cases in optimized implementations, and secondarily because they allow
+  // tests to run fast, which means we can run more tests and get more
+  // coverage.
+  const int batch = ExponentialRandomPositiveInt(0.9f, 3, 20);
+  const int input_depth = ExponentialRandomPositiveInt(0.9f, 6, 50);
+  const int input_width = ExponentialRandomPositiveInt(0.9f, 20, 200);
+  const int input_height = ExponentialRandomPositiveInt(0.9f, 20, 200);
+  const int filter_width = ExponentialRandomPositiveInt(0.9f, 4, 10);
+  const int filter_height = ExponentialRandomPositiveInt(0.9f, 4, 10);
+  const int depth_multiplier = ExponentialRandomPositiveInt(0.8f, 6, 50);
+  const int stride = ExponentialRandomPositiveInt(0.9f, 3, 8);
+  const int output_depth = input_depth * depth_multiplier;
+  // The optimized DepthwiseConv implementation currently uses a fixed-size
+  // accumulator buffer on the stack, with that size. This currently means
+  // that it does not support larger output depths. It CHECK's for it,
+  // so it's safe in the sense that if a larger output depth was encountered,
+  // it would explicitly fail. We just need to adjust our testing to that
+  // constraint.
+  const int kMaxSupportedOutputDepth = 1024;
+  if (output_depth > kMaxSupportedOutputDepth) {
+    return false;
+  }
+  const auto ac = RandomElement(std::vector<FusedActivationFunctionType>(
+      {FusedActivationFunctionType::kNone, FusedActivationFunctionType::kRelu,
+       FusedActivationFunctionType::kRelu6,
+       FusedActivationFunctionType::kRelu1}));
+  Dims<4> input_dims_inference =
+      MakeDimsForInference(input_depth, input_width, input_height, batch);
+  Dims<4> output_dims_inference;
+  int pad_width, pad_height;
+  const auto padding_type =
+      UniformRandomInt(0, 1) ? PaddingType::kSame : PaddingType::kValid;
+  if (!ComputeConvSizes(input_dims_inference, output_depth, filter_width,
+                        filter_height, stride, padding_type,
+                        &output_dims_inference, &pad_width, &pad_height)) {
+    return false;
+  }
+  Dims<4> filter_dims_inference =
+      MakeDimsForInference(output_depth, filter_width, filter_height, 1);
+  Dims<4> bias_dims_inference = MakeDimsForInference(output_depth, 1, 1, 1);
+  const int input_buffer_size = RequiredBufferSizeForDims(input_dims_inference);
+  const int filter_buffer_size =
+      RequiredBufferSizeForDims(filter_dims_inference);
+  std::vector<float> input_data(input_buffer_size);
+  std::vector<float> filter_data(filter_buffer_size);
+  std::vector<float> bias_data(output_depth);
+  const float input_amplitude = 1.f;
+  const float filter_amplitude = 1.f;
+  const float bias_amplitude =
+      filter_width * filter_height * input_amplitude * filter_amplitude;
+  FillRandom(&input_data, -input_amplitude, input_amplitude);
+  FillRandom(&filter_data, -filter_amplitude, filter_amplitude);
+  FillRandom(&bias_data, -bias_amplitude, bias_amplitude);
+  TestOneDepthwiseConv(ac, input_data.data(), input_dims_inference,
+                       filter_data.data(), filter_dims_inference,
+                       bias_data.data(), bias_dims_inference, stride, pad_width,
+                       pad_height, depth_multiplier, output_dims_inference);
+  return true;
+}
+
+void TestOneDepthwiseConv() {
+  while (!TryTestOneDepthwiseConv()) {
+  }
+}
+
+TEST(TestDepthwiseConv, TestDepthwiseConv) {
+  const int kTestsToRun = 100 * 1000;
+  for (int i = 0; i < kTestsToRun; i++) {
+    TestOneDepthwiseConv();
+  }
+}
+}  // namespace
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/internal/depthwiseconv_quantized_test.cc b/tensorflow/contrib/lite/kernels/internal/depthwiseconv_quantized_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..2c0fc8433e18fb7f7f89c17380210d94b39ffc94
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/depthwiseconv_quantized_test.cc
@@ -0,0 +1,330 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <sys/types.h>
+#include <algorithm>
+#include <cmath>
+#include <cstdint>
+#include <cstdlib>
+#include <iterator>
+#include <limits>
+#include <vector>
+
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/kernels/internal/test_util.h"
+#include "tensorflow/contrib/lite/kernels/internal/types.h"
+
+#define ALLOW_SLOW_GENERIC_DEPTHWISECONV_FALLBACK
+#include "tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/depthwiseconv_uint8.h"
+
+namespace tflite {
+namespace {
+
+// Runs the DepthwiseConv and compares against the reference implementation.
+template <FusedActivationFunctionType Ac>
+int TestOneDepthwiseConvWithGivenOutputShift(
+    const std::uint8_t* input_data, const Dims<4>& input_dims,
+    std::int32_t input_offset, const std::uint8_t* filter_data,
+    const Dims<4>& filter_dims, std::int32_t filter_offset,
+    const std::int32_t* bias_data, const Dims<4>& bias_dims, int stride,
+    int pad_width, int pad_height, int depth_multiplier,
+    std::int32_t output_offset, std::int32_t output_multiplier,
+    int output_shift, std::int32_t output_activation_min,
+    std::int32_t output_activation_max, const Dims<4>& output_dims) {
+  const int output_buffer_size = RequiredBufferSizeForDims(output_dims);
+  std::vector<std::uint8_t> output_data(output_buffer_size);
+  std::vector<std::uint8_t> reference_output_data(output_buffer_size);
+  reference_ops::DepthwiseConv<Ac>(
+      input_data, input_dims, input_offset, filter_data, filter_dims,
+      filter_offset, bias_data, bias_dims, stride, pad_width, pad_height,
+      depth_multiplier, output_offset, output_multiplier, output_shift,
+      output_activation_min, output_activation_max,
+      reference_output_data.data(), output_dims);
+  optimized_ops::DepthwiseConv<Ac>(
+      input_data, input_dims, input_offset, filter_data, filter_dims,
+      filter_offset, bias_data, bias_dims, stride, pad_width, pad_height,
+      depth_multiplier, output_offset, output_multiplier, output_shift,
+      output_activation_min, output_activation_max, output_data.data(),
+      output_dims);
+  int saturated_min = 0;
+  int saturated_max = 0;
+  std::vector<int> diff(output_buffer_size);
+  std::int64_t sum_diff = 0;
+  std::int64_t sum_abs_diff = 0;
+  for (int i = 0; i < output_buffer_size; i++) {
+    diff[i] = static_cast<int>(output_data[i]) -
+              static_cast<int>(reference_output_data[i]);
+    sum_diff += diff[i];
+    sum_abs_diff += std::abs(diff[i]);
+    saturated_min += output_data[i] == output_activation_min;
+    saturated_max += output_data[i] == output_activation_max;
+  }
+  // These stats help understand test failures.
+  std::sort(std::begin(diff), std::end(diff));
+  const int min_diff = diff.front();
+  const int max_diff = diff.back();
+  const int median_diff = diff[diff.size() / 2];
+  const float mean_diff = static_cast<float>(sum_diff) / output_buffer_size;
+  const float mean_abs_diff =
+      static_cast<float>(sum_abs_diff) / output_buffer_size;
+  // Normally we should require bit-for-bit exact results. Unfortunately a bug
+  // in the Intel arm_neon_sse.h translation header that we use for x86 tests
+  // causes 1-bit inaccuracy in
+  // the vqrdmulh_n_s32 intrinsic, which causes off-by-1 errors in quantized
+  // DepthwiseConv ops. So we have to live with a few off-by-one errors for now,
+  // yet still ensure that no more than a small minority of values are wrong.
+  EXPECT_TRUE(std::abs(mean_diff) < 1e-5f && mean_abs_diff < 1e-5f &&
+              std::abs(median_diff) == 0 && std::abs(min_diff) <= 1 &&
+              std::abs(max_diff) <= 1);
+  if (saturated_min > 2 * saturated_max) {
+    return -1;
+  }
+  if (saturated_max > 2 * saturated_min) {
+    return 1;
+  }
+  return 0;
+}
+
+// The point of this function is that we can't practically know which
+// output_shift value to pass to test DepthwiseConv. It's not easy to guess (we
+// could do some
+// statistics for large size, but they would be fragile at smaller sizes), and
+// guessing wrong would mean that all the values get saturated so the test
+// becomes
+// vacuous. So we just bisect our way to reasonable output_shift values.
+template <FusedActivationFunctionType Ac>
+void TestOneDepthwiseConvBisectOutputShift(
+    const std::uint8_t* input_data, const Dims<4>& input_dims,
+    std::int32_t input_offset, const std::uint8_t* filter_data,
+    const Dims<4>& filter_dims, std::int32_t filter_offset,
+    const std::int32_t* bias_data, const Dims<4>& bias_dims, int stride,
+    int pad_width, int pad_height, int depth_multiplier,
+    std::int32_t output_offset, std::int32_t output_multiplier,
+    int output_activation_bisect_start, int output_activation_bisect_end,
+    std::int32_t output_activation_min, std::int32_t output_activation_max,
+    const Dims<4>& output_dims) {
+  ASSERT_LT(output_activation_bisect_start, output_activation_bisect_end)
+      << "Bisection failed ?!?!";
+  int output_shift_bisect_midpoint =
+      (output_activation_bisect_start + output_activation_bisect_end) / 2;
+  int bisect_result = TestOneDepthwiseConvWithGivenOutputShift<Ac>(
+      input_data, input_dims, input_offset, filter_data, filter_dims,
+      filter_offset, bias_data, bias_dims, stride, pad_width, pad_height,
+      depth_multiplier, output_offset, output_multiplier,
+      output_shift_bisect_midpoint, output_activation_min,
+      output_activation_max, output_dims);
+  // At this point we know that the test succeeded (otherwise it would have
+  // aborted).
+  if (bisect_result == 0) {
+    // The result isn't particularly saturated on one or the other side.
+    // All good, we're done.
+    return;
+  }
+  if (output_activation_bisect_start == output_activation_bisect_end - 1) {
+    // There is still some saturation on one side, but the bisection is
+    // finished anyways. We're done; nothing more we can do about it. This
+    // happens
+    // in particular when using an activation with a narrow range.
+    return;
+  }
+  // Continue the bisection based on the present result.
+  int new_output_activation_bisect_start = bisect_result == 1
+                                               ? output_shift_bisect_midpoint
+                                               : output_activation_bisect_start;
+  int new_output_activation_bisect_end = bisect_result == 1
+                                             ? output_activation_bisect_end
+                                             : output_shift_bisect_midpoint;
+  TestOneDepthwiseConvBisectOutputShift<Ac>(
+      input_data, input_dims, input_offset, filter_data, filter_dims,
+      filter_offset, bias_data, bias_dims, stride, pad_width, pad_height,
+      depth_multiplier, output_offset, output_multiplier,
+      new_output_activation_bisect_start, new_output_activation_bisect_end,
+      output_activation_min, output_activation_max, output_dims);
+}
+
+template <FusedActivationFunctionType Ac>
+void TestOneDepthwiseConv(
+    const std::uint8_t* input_data, const Dims<4>& input_dims,
+    std::int32_t input_offset, const std::uint8_t* filter_data,
+    const Dims<4>& filter_dims, std::int32_t filter_offset,
+    const std::int32_t* bias_data, const Dims<4>& bias_dims, int stride,
+    int pad_width, int pad_height, int depth_multiplier,
+    std::int32_t output_offset, std::int32_t output_multiplier,
+    std::int32_t output_activation_min, std::int32_t output_activation_max,
+    const Dims<4>& output_dims) {
+  TestOneDepthwiseConvBisectOutputShift<Ac>(
+      input_data, input_dims, input_offset, filter_data, filter_dims,
+      filter_offset, bias_data, bias_dims, stride, pad_width, pad_height,
+      depth_multiplier, output_offset, output_multiplier, 0, 32,
+      output_activation_min, output_activation_max, output_dims);
+}
+
+void TestOneDepthwiseConv(
+    FusedActivationFunctionType Ac, const std::uint8_t* input_data,
+    const Dims<4>& input_dims, std::int32_t input_offset,
+    const std::uint8_t* filter_data, const Dims<4>& filter_dims,
+    std::int32_t filter_offset, const std::int32_t* bias_data,
+    const Dims<4>& bias_dims, int stride, int pad_width, int pad_height,
+    int depth_multiplier, std::int32_t output_offset,
+    std::int32_t output_multiplier, std::int32_t output_activation_min,
+    std::int32_t output_activation_max, const Dims<4>& output_dims) {
+#define TOCO_HANDLE_CASE(AC_TYPE)                                           \
+  if (AC_TYPE == Ac) {                                                      \
+    TestOneDepthwiseConv<AC_TYPE>(                                          \
+        input_data, input_dims, input_offset, filter_data, filter_dims,     \
+        filter_offset, bias_data, bias_dims, stride, pad_width, pad_height, \
+        depth_multiplier, output_offset, output_multiplier,                 \
+        output_activation_min, output_activation_max, output_dims);         \
+    return;                                                                 \
+  }
+  TOCO_HANDLE_CASE(FusedActivationFunctionType::kNone)
+  TOCO_HANDLE_CASE(FusedActivationFunctionType::kRelu)
+  TOCO_HANDLE_CASE(FusedActivationFunctionType::kRelu1)
+  TOCO_HANDLE_CASE(FusedActivationFunctionType::kRelu6)
+#undef TOCO_HANDLE_CASE
+}
+
+bool TryTestDepthwiseConv(int batch, int input_depth, int input_width,
+                          int input_height, int filter_width, int filter_height,
+                          int depth_multiplier, int stride,
+                          PaddingType padding_type) {
+  const int output_depth = input_depth * depth_multiplier;
+  // The optimized DepthwiseConv implementation currently uses a fixed-size
+  // accumulator buffer on the stack, with that size. This currently means
+  // that it does not support larger output depths. It CHECK's for it,
+  // so it's safe in the sense that if a larger output depth was encountered,
+  // it would explicitly fail. We just need to adjust our testing to that
+  // constraint.
+  const int kMaxSupportedOutputDepth = 1024;
+  if (output_depth > kMaxSupportedOutputDepth) {
+    return false;
+  }
+  const auto ac = RandomElement(std::vector<FusedActivationFunctionType>(
+      {FusedActivationFunctionType::kNone, FusedActivationFunctionType::kRelu,
+       FusedActivationFunctionType::kRelu6,
+       FusedActivationFunctionType::kRelu1}));
+  int output_activation_min = 0;
+  int output_activation_max = 255;
+  if (ac != FusedActivationFunctionType::kNone && UniformRandomInt(0, 1)) {
+    output_activation_min = UniformRandomInt(0, 50);
+    output_activation_max = UniformRandomInt(200, 255);
+  }
+  const std::int32_t output_multiplier =
+      UniformRandomInt(1 << 29, std::numeric_limits<std::int32_t>::max());
+  const std::int32_t input_offset = UniformRandomInt(-256, 0);
+  const std::int32_t filter_offset = UniformRandomInt(-256, 0);
+  const std::int32_t output_offset = UniformRandomInt(-256, 0);
+  Dims<4> input_dims_inference =
+      MakeDimsForInference(input_depth, input_width, input_height, batch);
+  Dims<4> output_dims_inference;
+  int pad_width, pad_height;
+  if (!ComputeConvSizes(input_dims_inference, output_depth, filter_width,
+                        filter_height, stride, padding_type,
+                        &output_dims_inference, &pad_width, &pad_height)) {
+    return false;
+  }
+  Dims<4> filter_dims_inference =
+      MakeDimsForInference(output_depth, filter_width, filter_height, 1);
+  Dims<4> bias_dims_inference = MakeDimsForInference(output_depth, 1, 1, 1);
+  const int input_buffer_size = RequiredBufferSizeForDims(input_dims_inference);
+  const int filter_buffer_size =
+      RequiredBufferSizeForDims(filter_dims_inference);
+  std::vector<std::uint8_t> input_data(input_buffer_size);
+  std::vector<std::uint8_t> filter_data(filter_buffer_size);
+  std::vector<std::int32_t> bias_data(output_depth);
+  FillRandom(&input_data);
+  FillRandom(&filter_data);
+  FillRandom(&bias_data, -10000, 10000);
+  TestOneDepthwiseConv(ac, input_data.data(), input_dims_inference,
+                       input_offset, filter_data.data(), filter_dims_inference,
+                       filter_offset, bias_data.data(), bias_dims_inference,
+                       stride, pad_width, pad_height, depth_multiplier,
+                       output_offset, output_multiplier, output_activation_min,
+                       output_activation_max, output_dims_inference);
+  return true;
+}
+
+// This function picks some random DepthwiseConv params, which may or may not
+// be legal. If they're not legal, it returns false. If they're legal,
+// it runs the DepthwiseConv test and returns true. This allows the caller
+// to loop until a test has been run.
+bool TryTestOneDepthwiseConv() {
+  // We have to pick a lot of positive values, where we are particularly
+  // interested in small values because they are most likely to be special
+  // cases in optimized implementations, and secondarily because they allow
+  // tests to run fast, which means we can run more tests and get more
+  // coverage.
+  const int batch = ExponentialRandomPositiveInt(0.9f, 3, 20);
+  const int input_depth = ExponentialRandomPositiveInt(0.9f, 6, 50);
+  const int input_width = ExponentialRandomPositiveInt(0.9f, 20, 200);
+  const int input_height = ExponentialRandomPositiveInt(0.9f, 20, 200);
+  const int filter_width = ExponentialRandomPositiveInt(0.9f, 4, 10);
+  const int filter_height = ExponentialRandomPositiveInt(0.9f, 4, 10);
+  const int depth_multiplier = ExponentialRandomPositiveInt(0.8f, 6, 50);
+  const int stride = ExponentialRandomPositiveInt(0.9f, 3, 8);
+  const auto padding_type =
+      UniformRandomInt(0, 1) ? PaddingType::kSame : PaddingType::kValid;
+
+  return TryTestDepthwiseConv(batch, input_depth, input_width, input_height,
+                              filter_width, filter_height, depth_multiplier,
+                              stride, padding_type);
+}
+
+// Tests parameters for the 3x3 filter kernel.
+bool TryTestOneDepthwiseConv3x3Filter() {
+  const int batch = ExponentialRandomPositiveInt(0.9f, 3, 20);
+  const int input_depth = 8 * ExponentialRandomPositiveInt(0.9f, 10, 50);
+  const int input_width = ExponentialRandomPositiveInt(0.9f, 20, 200);
+  const int input_height = ExponentialRandomPositiveInt(0.9f, 20, 200);
+  const int filter_width = 3;
+  const int filter_height = 3;
+  const int depth_multiplier = 1;
+  const int stride = UniformRandomInt(1, 2);
+  // Although the kernel supports only kValid padding, we test that kSame
+  // is using the correct code path.
+  const auto padding_type =
+      UniformRandomInt(0, 1) ? PaddingType::kSame : PaddingType::kValid;
+
+  return TryTestDepthwiseConv(batch, input_depth, input_width, input_height,
+                              filter_width, filter_height, depth_multiplier,
+                              stride, padding_type);
+}
+
+void TestOneDepthwiseConv() {
+  while (!TryTestOneDepthwiseConv()) {
+  }
+}
+
+void TestOneDepthwiseConv3x3Filter() {
+  while (!TryTestOneDepthwiseConv3x3Filter()) {
+  }
+}
+
+TEST(TestDepthwiseConv, TestDepthwiseConv) {
+  const int kTestsToRun = 10 * 1000;
+  for (int i = 0; i < kTestsToRun; i++) {
+    TestOneDepthwiseConv();
+  }
+}
+
+TEST(TestDepthwiseConv3x3Filter, TestDepthwiseConv) {
+  const int kTestsToRun = 3 * 1000;
+  for (int i = 0; i < kTestsToRun; i++) {
+    TestOneDepthwiseConv3x3Filter();
+  }
+}
+
+}  // namespace
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc b/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc
index 5f9cfc450db1c25ff604f99f93481e5ca590a5a2..36c25388e8bde721d7644dc83d5b7c490d37b4d3 100644
--- a/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc
+++ b/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc
@@ -52,21 +52,19 @@ void RnnBatchStep(const float* input_ptr_batch, const int8_t* input_weights_ptr,
                   TfLiteFusedActivation activation,
                   int8_t* quantized_input_ptr_batch,
                   int8_t* quantized_hidden_state_ptr_batch,
-                  float* hidden_state_ptr_batch, float* output_ptr_batch) {
+                  float* scaling_factors, float* hidden_state_ptr_batch,
+                  float* output_ptr_batch) {
   // Output = bias
   tensor_utils::VectorBatchVectorAssign(bias_ptr, num_units, batch_size,
                                         output_ptr_batch);
 
-  // TODO(mirkov): change std::minmax_element with a vectorized call.
-  auto minmax_element = std::minmax_element(
-      input_ptr_batch, input_ptr_batch + batch_size * input_size);
-
   // Save quantization and matmul computation for all zero input.
-  if (!(*minmax_element.first == 0.0 && *minmax_element.second == 0.0)) {
+  if (!tensor_utils::IsZeroVector(input_ptr_batch, batch_size * input_size)) {
     // Quantize input from float to uint8 + quantization params (scaling
     // factor).
     float unused_min, unused_max;
-    float* scaling_factors = new float[batch_size];
+    // TODO(mirkov,raziel): replace this for-loop with a MACRO (or function)
+    // whichever is faster.
     for (int b = 0; b < batch_size; ++b) {
       const int offset = b * input_size;
       tensor_utils::SymmetricQuantizeFloats(
@@ -80,16 +78,13 @@ void RnnBatchStep(const float* input_ptr_batch, const int8_t* input_weights_ptr,
     tensor_utils::MatrixBatchVectorMultiplyAccumulate(
         input_weights_ptr, num_units, input_size, quantized_input_ptr_batch,
         scaling_factors, batch_size, output_ptr_batch, /*result_stride=*/1);
-    delete[] scaling_factors;
   }
 
-  minmax_element = std::minmax_element(
-      hidden_state_ptr_batch, hidden_state_ptr_batch + batch_size * num_units);
   // Save quantization and matmul computation for all zero input.
-  if (!(*minmax_element.first == 0.0 && *minmax_element.second == 0.0)) {
+  if (!tensor_utils::IsZeroVector(hidden_state_ptr_batch,
+                                  batch_size * num_units)) {
     // Quantize hidden_state
     float unused_min, unused_max;
-    float* scaling_factors = new float[batch_size];
     for (int b = 0; b < batch_size; ++b) {
       const int offset = b * num_units;
       tensor_utils::SymmetricQuantizeFloats(
@@ -104,7 +99,6 @@ void RnnBatchStep(const float* input_ptr_batch, const int8_t* input_weights_ptr,
         recurrent_weights_ptr, num_units, num_units,
         quantized_hidden_state_ptr_batch, scaling_factors, batch_size,
         output_ptr_batch, /*result_stride=*/1);
-    delete[] scaling_factors;
   }
 
   // Output = activation(Output) and update hidden_state
@@ -155,6 +149,7 @@ void LstmStep(
         input_to_input_weights_ptr, n_cell, n_input, input_ptr_batch, n_batch,
         input_gate_scratch, /*result_stride=*/1);
   }
+
   tensor_utils::MatrixBatchVectorMultiplyAccumulate(
       input_to_forget_weights_ptr, n_cell, n_input, input_ptr_batch, n_batch,
       forget_gate_scratch, /*result_stride=*/1);
@@ -169,8 +164,7 @@ void LstmStep(
   if (!use_cifg) {
     tensor_utils::MatrixBatchVectorMultiplyAccumulate(
         recurrent_to_input_weights_ptr, n_cell, n_output, output_state_ptr,
-        n_batch, input_gate_scratch,
-        /*result_stride=*/1);
+        n_batch, input_gate_scratch, /*result_stride=*/1);
   }
   tensor_utils::MatrixBatchVectorMultiplyAccumulate(
       recurrent_to_forget_weights_ptr, n_cell, n_output, output_state_ptr,
@@ -261,5 +255,263 @@ void LstmStep(
                            output_state_ptr);
 }
 
+// TODO(alanchiao): move this to tensor_utils.
+void VectorMultiply(const int8_t* vector, const int v_size, const float scale,
+                    float* result) {
+  for (int i = 0; i < v_size; ++i) {
+    *result++ = scale * *vector++;
+  }
+}
+
+void LstmStep(
+    const float* input_ptr_batch, const int8_t* input_to_input_weights_ptr,
+    float input_to_input_weights_scale,
+    const int8_t* input_to_forget_weights_ptr,
+    float input_to_forget_weights_scale,
+    const int8_t* input_to_cell_weights_ptr, float input_to_cell_weights_scale,
+    const int8_t* input_to_output_weights_ptr,
+    float input_to_output_weights_scale,
+    const int8_t* recurrent_to_input_weights_ptr,
+    float recurrent_to_input_weights_scale,
+    const int8_t* recurrent_to_forget_weights_ptr,
+    float recurrent_to_forget_weights_scale,
+    const int8_t* recurrent_to_cell_weights_ptr,
+    float recurrent_to_cell_weights_scale,
+    const int8_t* recurrent_to_output_weights_ptr,
+    float recurrent_to_output_weights_scale,
+    const int8_t* cell_to_input_weights_ptr, float cell_to_input_weights_scale,
+    const int8_t* cell_to_forget_weights_ptr,
+    float cell_to_forget_weights_scale,
+    const int8_t* cell_to_output_weights_ptr,
+    float cell_to_output_weights_scale, const float* input_gate_bias_ptr,
+    const float* forget_gate_bias_ptr, const float* cell_bias_ptr,
+    const float* output_gate_bias_ptr, const int8_t* projection_weights_ptr,
+    float projection_weights_scale, const float* projection_bias_ptr,
+    const TfLiteLSTMParams* params, int n_batch, int n_cell, int n_input,
+    int n_output, float* input_gate_scratch, float* forget_gate_scratch,
+    float* cell_scratch, float* output_gate_scratch, float* scaling_factors,
+    float* product_scaling_factors, float* recovered_cell_weights,
+    int8_t* quantized_input_ptr_batch, int8_t* quantized_output_state_ptr,
+    int8_t* quantized_cell_state_ptr, float* output_state_ptr,
+    float* cell_state_ptr, float* output_ptr_batch) {
+  // Since we have already checked that weights are all there or none, we can
+  // check the existense of only one to the get the condition.
+  const bool use_cifg = (input_to_input_weights_ptr == nullptr);
+  const bool use_peephole = (cell_to_output_weights_ptr != nullptr);
+  // Initialize scratch buffers with bias.
+  if (!use_cifg) {
+    tensor_utils::VectorBatchVectorAssign(input_gate_bias_ptr, n_cell, n_batch,
+                                          input_gate_scratch);
+  }
+  tensor_utils::VectorBatchVectorAssign(forget_gate_bias_ptr, n_cell, n_batch,
+                                        forget_gate_scratch);
+  tensor_utils::VectorBatchVectorAssign(cell_bias_ptr, n_cell, n_batch,
+                                        cell_scratch);
+  tensor_utils::VectorBatchVectorAssign(output_gate_bias_ptr, n_cell, n_batch,
+                                        output_gate_scratch);
+
+  if (!tensor_utils::IsZeroVector(input_ptr_batch, n_batch * n_input)) {
+    // Save quantization and matmul computation for all zero input.
+    float unused_min, unused_max;
+    for (int b = 0; b < n_batch; ++b) {
+      const int offset = b * n_input;
+      tensor_utils::SymmetricQuantizeFloats(
+          input_ptr_batch + offset, n_input, quantized_input_ptr_batch + offset,
+          &unused_min, &unused_max, &scaling_factors[b]);
+    }
+    // For each batch and cell: compute input_weight * input.
+    if (!use_cifg) {
+      for (int b = 0; b < n_batch; ++b) {
+        product_scaling_factors[b] =
+            scaling_factors[b] * input_to_input_weights_scale;
+      }
+      tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+          input_to_input_weights_ptr, n_cell, n_input,
+          quantized_input_ptr_batch, product_scaling_factors, n_batch,
+          input_gate_scratch, /*result_stride=*/1);
+    }
+
+    for (int b = 0; b < n_batch; ++b) {
+      product_scaling_factors[b] =
+          scaling_factors[b] * input_to_forget_weights_scale;
+    }
+    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+        input_to_forget_weights_ptr, n_cell, n_input, quantized_input_ptr_batch,
+        product_scaling_factors, n_batch, forget_gate_scratch,
+        /*result_stride=*/1);
+
+    for (int b = 0; b < n_batch; ++b) {
+      product_scaling_factors[b] =
+          scaling_factors[b] * input_to_cell_weights_scale;
+    }
+    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+        input_to_cell_weights_ptr, n_cell, n_input, quantized_input_ptr_batch,
+        product_scaling_factors, n_batch, cell_scratch, /*result_stride=*/1);
+
+    for (int b = 0; b < n_batch; ++b) {
+      product_scaling_factors[b] =
+          scaling_factors[b] * input_to_output_weights_scale;
+    }
+    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+        input_to_output_weights_ptr, n_cell, n_input, quantized_input_ptr_batch,
+        product_scaling_factors, n_batch, output_gate_scratch,
+        /*result_stride=*/1);
+  }
+
+  if (!tensor_utils::IsZeroVector(output_state_ptr, n_batch * n_output)) {
+    // Save quantization and matmul computation for all zero input.
+    float unused_min, unused_max;
+    for (int b = 0; b < n_batch; ++b) {
+      const int offset = b * n_output;
+      tensor_utils::SymmetricQuantizeFloats(output_state_ptr + offset, n_output,
+                                            quantized_output_state_ptr + offset,
+                                            &unused_min, &unused_max,
+                                            &scaling_factors[b]);
+    }
+    // For each batch and cell: compute recurrent_weight * output_state.
+    if (!use_cifg) {
+      for (int b = 0; b < n_batch; ++b) {
+        product_scaling_factors[b] =
+            scaling_factors[b] * recurrent_to_input_weights_scale;
+      }
+      tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+          recurrent_to_input_weights_ptr, n_cell, n_output,
+          quantized_output_state_ptr, product_scaling_factors, n_batch,
+          input_gate_scratch, /*result_stride=*/1);
+    }
+
+    for (int b = 0; b < n_batch; ++b) {
+      product_scaling_factors[b] =
+          scaling_factors[b] * recurrent_to_forget_weights_scale;
+    }
+    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+        recurrent_to_forget_weights_ptr, n_cell, n_output,
+        quantized_output_state_ptr, product_scaling_factors, n_batch,
+        forget_gate_scratch, /*result_stride=*/1);
+
+    for (int b = 0; b < n_batch; ++b) {
+      product_scaling_factors[b] =
+          scaling_factors[b] * recurrent_to_cell_weights_scale;
+    }
+    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+        recurrent_to_cell_weights_ptr, n_cell, n_output,
+        quantized_output_state_ptr, product_scaling_factors, n_batch,
+        cell_scratch, /*result_stride=*/1);
+
+    for (int b = 0; b < n_batch; ++b) {
+      product_scaling_factors[b] =
+          scaling_factors[b] * recurrent_to_output_weights_scale;
+    }
+    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+        recurrent_to_output_weights_ptr, n_cell, n_output,
+        quantized_output_state_ptr, product_scaling_factors, n_batch,
+        output_gate_scratch, /*result_stride=*/1);
+  }
+
+  // Save quantization and matmul computation for all zero input.
+  bool is_cell_state_all_zeros =
+      tensor_utils::IsZeroVector(cell_state_ptr, n_batch * n_cell);
+
+  // For each batch and cell: update input gate.
+  if (!use_cifg) {
+    if (use_peephole && !is_cell_state_all_zeros) {
+      VectorMultiply(cell_to_input_weights_ptr, n_cell,
+                     1. / cell_to_input_weights_scale, recovered_cell_weights);
+      tensor_utils::VectorBatchVectorCwiseProductAccumulate(
+          recovered_cell_weights, n_cell, cell_state_ptr, n_batch,
+          input_gate_scratch);
+    }
+    tensor_utils::ApplySigmoidToVector(input_gate_scratch, n_cell * n_batch,
+                                       input_gate_scratch);
+  }
+
+  // For each batch and cell: update forget gate.
+  if (use_peephole && !is_cell_state_all_zeros) {
+    VectorMultiply(cell_to_forget_weights_ptr, n_cell,
+                   1. / cell_to_forget_weights_scale, recovered_cell_weights);
+    tensor_utils::VectorBatchVectorCwiseProductAccumulate(
+        recovered_cell_weights, n_cell, cell_state_ptr, n_batch,
+        forget_gate_scratch);
+  }
+  tensor_utils::ApplySigmoidToVector(forget_gate_scratch, n_cell * n_batch,
+                                     forget_gate_scratch);
+
+  // For each batch and cell: update the cell.
+  tensor_utils::VectorVectorCwiseProduct(forget_gate_scratch, cell_state_ptr,
+                                         n_batch * n_cell, cell_state_ptr);
+  tensor_utils::ApplyActivationToVector(cell_scratch, n_batch * n_cell,
+                                        params->activation, cell_scratch);
+  if (use_cifg) {
+    tensor_utils::Sub1Vector(forget_gate_scratch, n_batch * n_cell,
+                             forget_gate_scratch);
+    tensor_utils::VectorVectorCwiseProductAccumulate(
+        cell_scratch, forget_gate_scratch, n_batch * n_cell, cell_state_ptr);
+  } else {
+    tensor_utils::VectorVectorCwiseProductAccumulate(
+        cell_scratch, input_gate_scratch, n_batch * n_cell, cell_state_ptr);
+  }
+  if (params->cell_clip > 0.0) {
+    tensor_utils::ClipVector(cell_state_ptr, n_batch * n_cell,
+                             params->cell_clip, cell_state_ptr);
+  }
+
+  is_cell_state_all_zeros =
+      tensor_utils::IsZeroVector(cell_state_ptr, n_batch * n_cell);
+  // For each batch and cell: update the output gate.
+  if (use_peephole && !is_cell_state_all_zeros) {
+    VectorMultiply(cell_to_output_weights_ptr, n_cell,
+                   1. / cell_to_output_weights_scale, recovered_cell_weights);
+    tensor_utils::VectorBatchVectorCwiseProductAccumulate(
+        recovered_cell_weights, n_cell, cell_state_ptr, n_batch,
+        output_gate_scratch);
+  }
+  tensor_utils::ApplySigmoidToVector(output_gate_scratch, n_batch * n_cell,
+                                     output_gate_scratch);
+  tensor_utils::ApplyActivationToVector(cell_state_ptr, n_batch * n_cell,
+                                        params->activation, cell_scratch);
+  tensor_utils::VectorVectorCwiseProduct(output_gate_scratch, cell_scratch,
+                                         n_batch * n_cell, output_gate_scratch);
+
+  // For each batch: update the projection and output_state.
+  const bool use_projection_weight = (projection_weights_ptr != nullptr);
+  const bool use_projection_bias = (projection_bias_ptr != nullptr);
+  if (use_projection_weight) {
+    if (use_projection_bias) {
+      tensor_utils::VectorBatchVectorAssign(projection_bias_ptr, n_output,
+                                            n_batch, output_ptr_batch);
+    } else {
+      tensor_utils::ZeroVector(output_ptr_batch, n_batch * n_output);
+    }
+    if (!tensor_utils::IsZeroVector(output_gate_scratch, n_batch * n_cell)) {
+      // Save quantization and matmul computation for all zero input.
+      float unused_min, unused_max;
+      for (int b = 0; b < n_batch; ++b) {
+        const int offset = b * n_cell;
+        tensor_utils::SymmetricQuantizeFloats(
+            output_gate_scratch + offset, n_cell,
+            quantized_cell_state_ptr + offset, &unused_min, &unused_max,
+            &scaling_factors[b]);
+      }
+      for (int b = 0; b < n_batch; ++b) {
+        product_scaling_factors[b] =
+            scaling_factors[b] * projection_weights_scale;
+      }
+      tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+          projection_weights_ptr, n_output, n_cell, quantized_cell_state_ptr,
+          product_scaling_factors, n_batch, output_ptr_batch,
+          /*result_stride=*/1);
+    }
+    if (params->proj_clip > 0.0) {
+      tensor_utils::ClipVector(output_ptr_batch, n_batch * n_output,
+                               params->proj_clip, output_ptr_batch);
+    }
+  } else {
+    tensor_utils::CopyVector(output_gate_scratch, n_batch * n_output,
+                             output_ptr_batch);
+  }
+  tensor_utils::CopyVector(output_ptr_batch, n_batch * n_output,
+                           output_state_ptr);
+}
+
 }  // namespace kernel_utils
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/internal/kernel_utils.h b/tensorflow/contrib/lite/kernels/internal/kernel_utils.h
index cbfbcbeefcd34fa732799d89f52791b18855857d..2a11b37a6069367e8232350c2fc68d4c385e14ba 100644
--- a/tensorflow/contrib/lite/kernels/internal/kernel_utils.h
+++ b/tensorflow/contrib/lite/kernels/internal/kernel_utils.h
@@ -41,6 +41,9 @@ void RnnBatchStep(const float* input_ptr_batch, const float* input_weights_ptr,
 // values of hidden_state_ptr_batch and input_ptr_batch, respectively.
 // These temporary storages are expected to be preallocated to the same size as
 // the respective pointers.
+// An additional preallocated temporary storage 'scaling_factors' (of size
+// batch_size) is used to store the scaling factors of the quantization (used
+// for recovery).
 // {input,recurrent}_weights_scale params are used for dequantization/recovery.
 void RnnBatchStep(const float* input_ptr_batch, const int8_t* input_weights_ptr,
                   float input_weights_scale,
@@ -50,7 +53,8 @@ void RnnBatchStep(const float* input_ptr_batch, const int8_t* input_weights_ptr,
                   TfLiteFusedActivation activation,
                   int8_t* quantized_input_ptr_batch,
                   int8_t* quantized_hidden_state_ptr_batch,
-                  float* hidden_state_ptr_batch, float* output_ptr_batch);
+                  float* scaling_factors, float* hidden_state_ptr_batch,
+                  float* output_ptr_batch);
 
 // Performs an LSTM batch inference step for input specified by input_ptr_batch.
 // The LSTM cell is specified by the pointers to its weights (*_weights_ptr) and
@@ -88,6 +92,89 @@ void LstmStep(
     float* forget_gate_scratch, float* cell_scratch, float* output_gate_scratch,
     float* output_ptr_batch);
 
+// Same as above but with quantized weight matrices. In detail:
+// Input of size 'n_batch * n_input':
+//   input_ptr_batch
+//
+// LSTM weights:
+// Quantized input weights of size 'n_cell * n_input':
+//   input_to_input_weights            - optional (can be nullptr)
+//   input_to_forget_weights
+//   input_to_cell_weights
+//   input_to_input_weights
+// Quantized recurrent weights of size 'n_cell * n_output':
+//   recurrent_to_input_weights        - optional
+//   recurrent_to_forget_weights
+//   recurrent_to_cell_weights
+//   recurrent_to_input_weights
+// Quantized peephole weights of size 'n_cell', representing diagonal matrices.
+//   cell_to_input_weights             - optional
+//   cell_to_cell_weights              - optional
+//   cell_to_output_weights            - optional
+// Quantized projection weights of size 'n_output * n_cell'
+//   projection_weights_ptr            - optional
+// Weight scales (scalars) for each of the weights above.
+//   input_to_input_weights_scale      - optional
+//   input_to_forget_weights_scale
+//   input_to_cell_weights_scale
+//   input_to_output_weights_scale
+//   recurrent_to_input_weights_scale  - optional
+//   recurrent_to_forget_weights_scale
+//   recurrent_to_cell_weights_scale
+//   recurrent_to_output_weights_scale
+//   cell_to_input_weights_scale,
+//   cell_to_forget_weights_scale,
+//   cell_to_output_weights_scale,
+//   projection_weights_scale          - optional
+// Gate biases of size 'n_cell':
+//   input_gate_bias_ptr               - optional
+//   forget_gate_bias_ptr
+//   cell_gate_bias_ptr
+//   output_gate_bias_ptr
+//
+// Temporary pre-allocated storage for quantized values:
+//   quantized_input_ptr_batch (same size as input_ptr_batch)
+//   quantized_output_state_ptr (same size as output_state_ptr)
+//   quantized_cell_state_ptr (same size as cell_state_ptr)
+// Temporary pre-allocated storage for recovered values:
+//   recovered_cell_weights (same size as cell_to_*_weights)
+//
+// Outputs:
+//   output_state_ptr - size 'n_batch * n_output'
+//   cell_state_ptr   - size 'n_batch * n_cell'
+//   output_ptr_batch - size 'n_batch * n_output'
+void LstmStep(
+    const float* input_ptr_batch, const int8_t* input_to_input_weights_ptr,
+    float input_to_input_weights_scale,
+    const int8_t* input_to_forget_weights_ptr,
+    float input_to_forget_weights_scale,
+    const int8_t* input_to_cell_weights_ptr, float input_to_cell_weights_scale,
+    const int8_t* input_to_output_weights_ptr,
+    float input_to_output_weights_scale,
+    const int8_t* recurrent_to_input_weights_ptr,
+    float recurrent_to_input_weights_scale,
+    const int8_t* recurrent_to_forget_weights_ptr,
+    float recurrent_to_forget_weights_scale,
+    const int8_t* recurrent_to_cell_weights_ptr,
+    float recurrent_to_cell_weights_scale,
+    const int8_t* recurrent_to_output_weights_ptr,
+    float recurrent_to_output_weights_scale,
+    const int8_t* cell_to_input_weights_ptr, float cell_to_input_weights_scale,
+    const int8_t* cell_to_forget_weights_ptr,
+    float cell_to_forget_weights_scale,
+    const int8_t* cell_to_output_weights_ptr,
+    float cell_to_output_weights_scale, const float* input_gate_bias_ptr,
+    const float* forget_gate_bias_ptr, const float* cell_bias_ptr,
+    const float* output_gate_bias_ptr, const int8_t* projection_weights_ptr,
+    float projection_weights_scale, const float* projection_bias_ptr,
+    const TfLiteLSTMParams* params, int n_batch, int n_cell, int n_input,
+    int n_output, float* input_gate_scratch, float* forget_gate_scratch,
+    float* cell_scratch, float* output_gate_scratch, float* scaling_factors,
+    float* product_scaling_factors, float* recovered_cell_weights,
+    int8_t* quantized_input_ptr_batch, int8_t* quantized_output_state_ptr,
+    int8_t* quantized_cell_state_ptr, float* output_state_ptr,
+    float* cell_state_ptr, float* output_ptr_batch);
+
 }  // namespace kernel_utils
 }  // namespace tflite
 #endif  // TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_KERNEL_UTILS_H_
diff --git a/tensorflow/contrib/lite/kernels/internal/log_quantized_test.cc b/tensorflow/contrib/lite/kernels/internal/log_quantized_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..7e9ff5242a43a8b54e0e6ae167cdcf7a341c918e
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/log_quantized_test.cc
@@ -0,0 +1,333 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <algorithm>
+#include <cmath>
+#include <cstdlib>
+#include <functional>
+#include <iterator>
+#include <limits>
+#include <random>
+#include <sstream>
+#include <string>
+#include <vector>
+
+#define GEMMLOWP_ENABLE_FIXEDPOINT_CONSTANTS_CHECKS
+
+#include <gmock/gmock.h>
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h"
+
+namespace {
+
+class NumberGenerator {
+ public:
+  std::vector<int> RandomIntVector(int n, int min_val, int max_val) {
+    std::vector<int> vec(n);
+    double scale = static_cast<double>(max_val + 1 - min_val) / engine_.max();
+    for (auto& it : vec) {
+      it = min_val + std::floor(engine_() * scale);
+    }
+    return vec;
+  }
+
+  std::mt19937 engine_;
+};
+
+class LogQuantizedTest : public ::testing::Test {
+ public:
+  NumberGenerator generator_;
+};
+
+// input_integer_bits <= 30.  output_integer_bits > 0.
+inline int32 LogPositiveValuesViaFloat(int32 input_val, int input_integer_bits,
+                                       int output_integer_bits) {
+  const double float_log_sum_of_exps = std::log(
+      static_cast<double>(input_val) * 0.5 / (1 << (30 - input_integer_bits)));
+  static constexpr double min_int =
+      static_cast<double>(std::numeric_limits<int32>::min());
+  static constexpr double max_int =
+      static_cast<double>(std::numeric_limits<int32>::max());
+  double double_result = tflite::TfLiteRound(float_log_sum_of_exps *
+                                             (1 << (31 - output_integer_bits)));
+  return static_cast<std::int32_t>(
+      std::min(max_int, std::max(min_int, double_result)));
+}
+
+void CheckOutputData(const std::vector<int32>& test_output,
+                     const std::vector<int32>& reference_output,
+                     const std::vector<int32>& test_input,
+                     const string& check_label, int input_integer_bits,
+                     int output_integer_bits, int tolerance) {
+  // In the special case of small input, specifically raw value of 5, a rounding
+  // up leads to difference in the output.  We do not aim to be accurate for
+  // very small input values, and there should be sufficient input fractional
+  // bits that this is a small input.
+  static constexpr double error_from_rounding_up = 0.0224585;
+  const int n = test_output.size();
+  ASSERT_EQ(n, reference_output.size());
+  for (int i = 0; i < n; ++i) {
+    // Adjust tolerance when input <= 5*2^-(31-input_integer_bits).
+    const int adjusted_tolerance =
+        test_input[i] > 5
+            ? tolerance
+            : std::max(tolerance, static_cast<int>(std::ceil(
+                                      error_from_rounding_up *
+                                      (1 << (31 - output_integer_bits)))));
+    ASSERT_LE(std::abs(test_output[i] - reference_output[i]),
+              adjusted_tolerance)
+        << "Failure in \"" << check_label << "\" at i=" << i
+        << ", test_input[i]=" << test_input[i] << "="
+        << static_cast<double>(test_input[i]) / (1 << (31 - input_integer_bits))
+        << ", test_output[i]=" << test_output[i] << "="
+        << static_cast<double>(test_output[i]) /
+               (1 << (31 - output_integer_bits))
+        << ", reference_output[i]=" << reference_output[i] << "="
+        << static_cast<double>(reference_output[i]) /
+               (1 << (31 - output_integer_bits))
+        << ", difference[i]=" << std::abs(reference_output[i] - test_output[i])
+        << "="
+        << static_cast<double>(std::abs(reference_output[i] - test_output[i])) /
+               (1 << (31 - output_integer_bits))
+        << "; tolerance=" << tolerance
+        << ", adj tolerance=" << adjusted_tolerance;
+  }
+}
+
+void RightShiftVector(const std::vector<int32>& shifts,
+                      std::vector<int32>* vec) {
+  const int n = vec->size();
+  ASSERT_EQ(n, shifts.size());
+  for (int i = 0; i < n; ++i) {
+    vec->at(i) = std::max(1, vec->at(i) >> shifts[i]);
+  }
+}
+
+template <int OutputIntegerBits, int InputIntegerBits>
+void RunSingleTest(const std::vector<int32>& test_input,
+                   const string& check_label, int tolerance) {
+  const int n = test_input.size();
+  std::vector<int32> float_gen_output(n, 0);
+  std::vector<int32> reference_output(n, 0);
+  std::vector<int32> optimized_output(n, 0);
+
+  // Workaround the stupid things that intelligent humans do.
+  // Consequence of __builtin_clz(0u) may equal 31 instead of 32.
+  std::vector<int32> fudged_input(n, 0);
+  for (int i = 0; i < n; ++i) {
+    fudged_input[i] = std::max(test_input[i], 2);
+  }
+
+  for (int i = 0; i < n; ++i) {
+    reference_output[i] =
+        tflite::reference_ops::log_x_for_x_greater_than_or_equal_to_1_impl<
+            OutputIntegerBits, InputIntegerBits>(
+            gemmlowp::FixedPoint<int32, InputIntegerBits>::FromRaw(
+                fudged_input[i]))
+            .raw();
+    optimized_output[i] =
+        tflite::optimized_ops::log_x_for_x_greater_than_or_equal_to_1_impl<
+            OutputIntegerBits, InputIntegerBits>(
+            gemmlowp::FixedPoint<int32, InputIntegerBits>::FromRaw(
+                fudged_input[i]))
+            .raw();
+    float_gen_output[i] = LogPositiveValuesViaFloat(
+        fudged_input[i], InputIntegerBits, OutputIntegerBits);
+  }
+  // Note that first check is intolerant.
+  {
+    std::ostringstream label;
+    label << check_label << " / optimized vs reference / InputIntegerBits="
+          << InputIntegerBits << ", OutputIntegerBits=" << OutputIntegerBits;
+    CheckOutputData(
+        optimized_output, reference_output, test_input, label.str(),
+        InputIntegerBits, OutputIntegerBits, 0);
+  }
+  {
+    std::ostringstream label;
+    label << check_label << " / reference vs float-gen / InputIntegerBits="
+          << InputIntegerBits << ", OutputIntegerBits=" << OutputIntegerBits;
+    CheckOutputData(
+        reference_output, float_gen_output, test_input, label.str(),
+        InputIntegerBits, OutputIntegerBits, tolerance);
+  }
+  {
+    std::ostringstream label;
+    label << check_label << " optimized vs float-gen / InputIntegerBits="
+          << InputIntegerBits << ", OutputIntegerBits=" << OutputIntegerBits;
+    CheckOutputData(
+        optimized_output, float_gen_output, test_input, label.str(),
+        InputIntegerBits, OutputIntegerBits, tolerance);
+  }
+}
+
+template <int OutputIntegerBits>
+void RunSingleTest(const std::vector<int32>& test_input, int input_integer_bits,
+                   const string& check_label, int tolerance) {
+#define INPUT_CASE(K)                                                   \
+  case K:                                                               \
+    return RunSingleTest<OutputIntegerBits, K>(test_input, check_label, \
+                                               tolerance)
+  switch (input_integer_bits) {
+    INPUT_CASE(0);
+    INPUT_CASE(1);
+    INPUT_CASE(2);
+    INPUT_CASE(3);
+    INPUT_CASE(4);
+    INPUT_CASE(5);
+    INPUT_CASE(6);
+    INPUT_CASE(7);
+    INPUT_CASE(8);
+    INPUT_CASE(9);
+    INPUT_CASE(10);
+    INPUT_CASE(11);
+    INPUT_CASE(12);
+    INPUT_CASE(13);
+    INPUT_CASE(14);
+    INPUT_CASE(15);
+    INPUT_CASE(16);
+    INPUT_CASE(17);
+    INPUT_CASE(18);
+    INPUT_CASE(19);
+    INPUT_CASE(20);
+    INPUT_CASE(21);
+    INPUT_CASE(22);
+    INPUT_CASE(23);
+    INPUT_CASE(24);
+    INPUT_CASE(25);
+    INPUT_CASE(26);
+    INPUT_CASE(27);
+    INPUT_CASE(28);
+    INPUT_CASE(29);
+    default:
+      ASSERT_LE(input_integer_bits, 30)
+                << "Input integer bits not handled: " << input_integer_bits;
+  }
+#undef INPUT_CASE
+}
+
+void RunSingleTest(const std::vector<int32>& test_input, int input_integer_bits,
+                   int output_integer_bits, const string& check_label,
+                   int tolerance) {
+#define OUTPUT_CASE(K)                                                   \
+  case K:                                                                \
+    return RunSingleTest<K>(test_input, input_integer_bits, check_label, \
+                            tolerance)
+  switch (output_integer_bits) {
+    OUTPUT_CASE(0);
+    OUTPUT_CASE(1);
+    OUTPUT_CASE(2);
+    OUTPUT_CASE(3);
+    OUTPUT_CASE(4);
+    OUTPUT_CASE(5);
+    OUTPUT_CASE(6);
+    OUTPUT_CASE(7);
+    OUTPUT_CASE(8);
+    OUTPUT_CASE(9);
+    OUTPUT_CASE(10);
+    OUTPUT_CASE(11);
+    OUTPUT_CASE(12);
+    OUTPUT_CASE(13);
+    OUTPUT_CASE(14);
+    OUTPUT_CASE(15);
+    OUTPUT_CASE(16);
+    OUTPUT_CASE(17);
+    OUTPUT_CASE(18);
+    OUTPUT_CASE(19);
+    OUTPUT_CASE(20);
+    OUTPUT_CASE(21);
+    OUTPUT_CASE(22);
+    OUTPUT_CASE(23);
+    OUTPUT_CASE(24);
+    OUTPUT_CASE(25);
+    OUTPUT_CASE(26);
+    OUTPUT_CASE(27);
+    OUTPUT_CASE(28);
+    OUTPUT_CASE(29);
+    default:
+      ASSERT_LE(input_integer_bits, 30)
+                << "Input integer bits not handled: " << input_integer_bits;
+  }
+#undef OUTPUT_CASE
+}
+
+void RunUniformTest(int test_size, int input_integer_bits,
+                    int output_integer_bits, const string& check_label,
+                    int tolerance, NumberGenerator* generator) {
+  std::vector<int> test_data = generator->RandomIntVector(
+      test_size, 2, std::numeric_limits<int>::max() - 1);
+  test_data[0] = 2;
+  test_data[1] = 3;
+  test_data[2] = 4;
+  test_data[3] = std::numeric_limits<int32>::max() - 2;
+  test_data[4] = std::numeric_limits<int32>::max() - 1;
+  test_data[5] = std::numeric_limits<int32>::max();
+
+  RunSingleTest(test_data, input_integer_bits, output_integer_bits,
+                check_label + " / uniform test", tolerance);
+}
+
+void RunUniformShiftUniformTest(int test_size, int input_integer_bits,
+                                int output_integer_bits,
+                                const string& check_label, int tolerance,
+                                NumberGenerator* generator) {
+  std::vector<int> test_data = generator->RandomIntVector(
+      test_size, 2, std::numeric_limits<int>::max() - 1);
+  std::vector<int> shifts = generator->RandomIntVector(test_size, 0, 29);
+  RightShiftVector(shifts, &test_data);
+
+  RunSingleTest(test_data, input_integer_bits, output_integer_bits,
+                check_label + " / shifted test", tolerance);
+}
+
+TEST_F(LogQuantizedTest, VariedIntegerBits) {
+  static constexpr int kVariations = 250;
+  static constexpr int kRunSize = 250;
+  static constexpr int kIntegerTolerance = 8;
+  static constexpr double kOutputFloatTolerance = 7.0e-7;
+
+  std::vector<int> input_integer_bits =
+      generator_.RandomIntVector(kVariations, 0, 24);
+  std::vector<int> output_integer_bits =
+      generator_.RandomIntVector(kVariations, 1, 10);
+
+  for (int i = 0; i < kVariations; ++i) {
+    int var_output_integer_bits = output_integer_bits[i];
+    int tolerance =
+        std::max(1.0 * kIntegerTolerance,
+                 (1 << (31 - var_output_integer_bits)) * kOutputFloatTolerance);
+
+    RunUniformTest(kRunSize, input_integer_bits[i], var_output_integer_bits,
+                   "VariedIntegerBits", tolerance, &generator_);
+    RunUniformShiftUniformTest(kRunSize, input_integer_bits[i],
+                               var_output_integer_bits, "VariedIntegerBits",
+                               tolerance, &generator_);
+  }
+}
+
+TEST_F(LogQuantizedTest, SelectedIntegerBits) {
+  static constexpr int kInputBits = 12;
+  static constexpr int kOutputBits = 5;
+  static constexpr int kRunSize = 100000;
+  static constexpr int kIntegerTolerance = 4;
+
+  RunUniformTest(kRunSize, kInputBits, kOutputBits, "SelectedIntegerBits",
+                 kIntegerTolerance, &generator_);
+  RunUniformShiftUniformTest(kRunSize, kInputBits, kOutputBits,
+                             "SelectedIntegerBits", kIntegerTolerance,
+                             &generator_);
+}
+
+}  // namespace
diff --git a/tensorflow/contrib/lite/kernels/internal/logsoftmax_quantized_test.cc b/tensorflow/contrib/lite/kernels/internal/logsoftmax_quantized_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..e786f785abe3aa66a9fb243dd4f332ca91676863
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/logsoftmax_quantized_test.cc
@@ -0,0 +1,242 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <algorithm>
+#include <cmath>
+#include <cstdlib>
+#include <functional>
+#include <iterator>
+#include <limits>
+#include <random>
+#include <string>
+#include <vector>
+
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/quantization_util.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/test_util.h"
+
+namespace tflite {
+namespace {
+
+void RunLogSoftmaxFloatReference(const uint8* input_data,
+                                 const Dims<4>& dims_common, int32 input_offset,
+                                 const double input_scale, int stride,
+                                 float beta, uint8* reference_output_data) {
+  const int ref_buffer_size = RequiredBufferSizeForDims(dims_common);
+  std::vector<float> reference_dequant_data(ref_buffer_size);
+  std::vector<float> reference_output_float_data(ref_buffer_size);
+
+  // Reference data generated via Dequant of input into float, and then applying
+  // float LogSoftmax.
+  reference_ops::Dequantize(input_data, dims_common, input_offset, input_scale,
+                            reference_dequant_data.data(), dims_common);
+  optimized_ops::LogSoftmax(reference_dequant_data.data(), dims_common,
+                            reference_output_float_data.data(), dims_common);
+  // Work with quantized scaling for LogSoftmax, under which 255 represents 0,
+  // and -16 gets nudged up to 0.
+  for (int i = 0; i < ref_buffer_size; i++) {
+    reference_output_data[i] = std::max(
+        0, static_cast<int>(
+               255 + std::round(16.0f * reference_output_float_data[i])));
+  }
+}
+
+void CheckOutputData(const uint8* test_output, const uint8* reference_output,
+                     const Dims<4>& dims_common, const string& check_label,
+                     bool be_exacting) {
+  const int buffer_size = RequiredBufferSizeForDims(dims_common);
+  // While calculating some metrics in floating point, we work with quantized
+  // scaling.
+  std::vector<int> diff(buffer_size);
+  int64_t sum_diff = 0;
+  int64_t sum_abs_diff = 0;
+  for (int i = 0; i < buffer_size; i++) {
+    diff[i] = static_cast<int>(test_output[i]) - reference_output[i];
+    sum_diff += diff[i];
+    sum_abs_diff += std::abs(diff[i]);
+  }
+  // These stats help understand test failures.
+  std::sort(std::begin(diff), std::end(diff));
+  const int min_diff = diff.front();
+  const int max_diff = diff.back();
+  const int median_diff = diff[diff.size() / 2];
+  const float mean_diff = static_cast<float>(sum_diff) / buffer_size;
+  const float mean_abs_diff = static_cast<float>(sum_abs_diff) / buffer_size;
+  // We either check for bit exactness (against the reference quantized version)
+  // or for general accuracy, allowing off-by-one (against the float reference).
+  if (be_exacting) {
+    ASSERT_TRUE(std::abs(min_diff) == 0 && std::abs(max_diff) == 0)
+        << check_label << ": "
+        << "std::abs(min_diff)=" << std::abs(min_diff)
+        << ", std::abs(max_diff)=" << std::abs(max_diff);
+  } else {
+    // For small numbers of samples, the estimates of the means vary more.
+    // Rather than widen the tolerances, we skip the smaller tests.
+    ASSERT_TRUE(((std::abs(mean_diff) < 2e-2f && mean_abs_diff < 3e-2f) ||
+                 buffer_size < 10000) &&
+                std::abs(median_diff) == 0 && std::abs(min_diff) <= 1 &&
+                std::abs(max_diff) <= 1)
+        << check_label << ": "
+        << "buffer_size=" << buffer_size << ", mean_diff=" << mean_diff
+        << ", mean_abs_diff=" << mean_abs_diff
+        << ", median_diff=" << median_diff << ", min_diff=" << min_diff
+        << ", max_diff=" << max_diff;
+  }
+}
+
+// Runs the LogSoftmax and compares against the float reference implementation
+// and the quantized reference implementation.
+void RunOneLogSoftmaxTest(const uint8* input_data, const Dims<4>& dims_common,
+                          int32 input_offset, const double input_scale,
+                          int stride, float beta) {
+  const int buffer_size = RequiredBufferSizeForDims(dims_common);
+  std::vector<uint8> optimized_logsoftmax_output(buffer_size);
+  std::vector<uint8> reference_float_logsoftmax_output(buffer_size);
+  std::vector<uint8> reference_quant_logsoftmax_output(buffer_size);
+
+  RunLogSoftmaxFloatReference(input_data, dims_common, input_offset,
+                              input_scale, stride, beta,
+                              reference_float_logsoftmax_output.data());
+
+  int32 input_beta_multiplier;
+  int input_beta_left_shift;
+  int32 reverse_scaling_divisor;
+  int reverse_scaling_right_shift;
+  static const int kScaledDiffIntegerBits = 5;
+  tflite::PreprocessLogSoftmaxScalingExp(
+      beta, input_scale, kScaledDiffIntegerBits, &input_beta_multiplier,
+      &input_beta_left_shift, &reverse_scaling_divisor,
+      &reverse_scaling_right_shift);
+  reverse_scaling_right_shift *= -1;
+  // diff_min has a negative value, and is used to limit the maximum magnitude
+  // of the diffs, which are <= 0.
+  const int diff_min = -tflite::CalculateInputRadius(kScaledDiffIntegerBits,
+                                                     input_beta_left_shift);
+
+  optimized_ops::LogSoftmax(input_data, dims_common, input_beta_multiplier,
+                            input_beta_left_shift, reverse_scaling_divisor,
+                            reverse_scaling_right_shift, diff_min,
+                            optimized_logsoftmax_output.data(), dims_common);
+  reference_ops::LogSoftmax(
+      input_data, dims_common, input_beta_multiplier, input_beta_left_shift,
+      reverse_scaling_divisor, reverse_scaling_right_shift, diff_min,
+      reference_quant_logsoftmax_output.data(), dims_common);
+
+  CheckOutputData(optimized_logsoftmax_output.data(),
+                  reference_float_logsoftmax_output.data(), dims_common,
+                  "Optimized vs float reference", false);
+  CheckOutputData(optimized_logsoftmax_output.data(),
+                  reference_quant_logsoftmax_output.data(), dims_common,
+                  "Optimized vs quant reference", true);
+  CheckOutputData(reference_quant_logsoftmax_output.data(),
+                  reference_float_logsoftmax_output.data(), dims_common,
+                  "Quant reference vs float reference", false);
+}
+
+// This function picks some random LogSoftmax params, which are checked for
+// desirability.  If not acceptable, it returns false. If they're OK,
+// it runs the LogSoftmax test and returns true. This allows the caller
+// to loop until a test has been run.
+//
+// Currently we do not reject for any reason.
+bool TryOneUniformLogSoftmax() {
+  // We pick mostly positive values, on the whole emphasizing smaller values and
+  // therefore faster tests.  We test a wider range of depths.  In the case of
+  // LogSoftmax, the width and height really just create test repetitions.
+  const int batch = ExponentialRandomPositiveInt(0.9f, 3, 20);
+  const int input_depth = ExponentialRandomPositiveInt(0.75f, 175, 500);
+  const int input_width = ExponentialRandomPositiveInt(0.8f, 20, 200);
+  const int input_height = ExponentialRandomPositiveInt(0.8f, 20, 200);
+  const int stride = ExponentialRandomPositiveInt(0.9f, 3, 8);
+  const double input_scale = std::pow(10.0, UniformRandomFloat(-2.0, 1.0));
+  const int32 input_offset = UniformRandomInt(-256, 0);
+  static constexpr float beta = 1.0f;
+
+  Dims<4> dims_common =
+      MakeDimsForInference(input_depth, input_width, input_height, batch);
+  const int buffer_size = RequiredBufferSizeForDims(dims_common);
+
+  std::vector<uint8> input_data(buffer_size);
+  FillRandom(&input_data);
+  RunOneLogSoftmaxTest(input_data.data(), dims_common, input_offset,
+                       input_scale, stride, beta);
+  return true;
+}
+
+// See TryOneUniformLogSoftmax() for a general description.
+//
+// Tests with "skyscraper" input patterns are included for two reasons. (a)
+// Bimodal distributions are potentially challenging and perhaps more
+// realistic than simple uniform random inputs.  (b) Some implementations of
+// LogSoftmax may adapt as they traverse the depth, and so we test handling of
+// cases where relatively small values are encountered at the beginning and end.
+bool TryOneSkyscraperLogSoftmax(bool small_depth) {
+  // We pick mostly positive values, on the whole emphasizing smaller values and
+  // therefore faster tests.  We test a wider range of depths.  In the case of
+  // LogSoftmax, the width and height really just create test repetitions.
+  const int batch = ExponentialRandomPositiveInt(0.9f, 3, 20);
+  const int input_depth = small_depth
+                              ? ExponentialRandomPositiveInt(0.75f, 40, 500)
+                              : ExponentialRandomPositiveInt(0.75f, 175, 500);
+  const int input_width = ExponentialRandomPositiveInt(0.7f, 20, 200);
+  const int input_height = ExponentialRandomPositiveInt(0.7f, 20, 200);
+  const int stride = ExponentialRandomPositiveInt(0.9f, 3, 8);
+  const double input_scale = std::pow(10.0, UniformRandomFloat(-2.0, 1.0));
+  const int32 input_offset = UniformRandomInt(-256, 0);
+  static constexpr float beta = 1.0f;
+  // Extra parameters for skyscraper input patterns.
+  const double middle_proportion =
+      ExponentialRandomPositiveFloat(0.65f, 0.1, 1.0);
+  const int middle_min = UniformRandomInt(0, 255);
+  const int sides_max = UniformRandomInt(0, middle_min);
+
+  Dims<4> dims_common =
+      MakeDimsForInference(input_depth, input_width, input_height, batch);
+  const int buffer_size = RequiredBufferSizeForDims(dims_common);
+
+  std::vector<uint8> input_data(buffer_size);
+  FillRandomSkyscraper(&input_data, input_depth, middle_proportion, middle_min,
+                       sides_max);
+  RunOneLogSoftmaxTest(input_data.data(), dims_common, input_offset,
+                       input_scale, stride, beta);
+  return true;
+}
+
+TEST(TestQuantizedLogSoftmax, UniformLogSoftmaxTests) {
+  const int kTestsToRun = 1000;
+  for (int i = 0; i < kTestsToRun; i++) {
+    while (!TryOneUniformLogSoftmax()) {
+    }
+  }
+}
+
+TEST(TestQuantizedLogSoftmax, SkyscraperLogSoftmaxTests) {
+  const int kTestsToRun = 1000;
+  for (int i = 0; i < kTestsToRun; i++) {
+    while (!TryOneSkyscraperLogSoftmax(false)) {
+    }
+  }
+}
+
+TEST(TestQuantizedLogSoftmax, SmallSkyscraperLogSoftmaxTests) {
+  const int kTestsToRun = 1000;
+  for (int i = 0; i < kTestsToRun; i++) {
+    while (!TryOneSkyscraperLogSoftmax(true)) {
+    }
+  }
+}
+}  // namespace
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8.h b/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8.h
index dd6932ffe7b7a6f1101f146ce6472b0df4cbda3b..3fd00c89308d3b163111fda004287c715259352f 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8.h
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8.h
@@ -1691,14 +1691,20 @@ inline void DepthwiseConv(const uint8* input_data, const Dims<4>& input_dims,
   const int filter_width = ArraySize(filter_dims, 1);
   const int output_height = ArraySize(output_dims, 2);
   const int output_width = ArraySize(output_dims, 1);
+#ifdef USE_NEON
+  const bool shift_left = (output_shift <= 0);
+  const int32 multiplier_power_of_two = shift_left ? (1 << -output_shift) : 1;
+#endif
   TFLITE_DCHECK(output_depth == input_depth * depth_multiplier);
 
-#ifdef __aarch64__
+// Enable for arm64 except for the Nvidia Linux 4 Tegra (L4T) running on
+// Jetson TX-2. This compiler does not support the offsetof() macro.
+#if defined(__aarch64__) && !defined(GOOGLE_L4T)
   // Call kernel optimized for depthwise convolutions using 3x3 filters if
   // parameters are supported.
-  if (Fast3x3FilterKernelSupported(input_dims, filter_dims, stride_width,
-                                   stride_height, pad_width, pad_height,
-                                   depth_multiplier, output_dims)) {
+  if (Fast3x3FilterKernelSupported(
+          input_dims, filter_dims, stride_width, stride_height, pad_width,
+          pad_height, depth_multiplier, output_dims, output_shift)) {
     DepthwiseConv3x3Filter(input_data, input_dims, input_offset, filter_data,
                            filter_dims, filter_offset, bias_data, bias_dims,
                            stride_width, stride_height, pad_width, pad_height,
@@ -1833,12 +1839,20 @@ inline void DepthwiseConv(const uint8* input_data, const Dims<4>& input_dims,
             acc[j] = vld1q_s32(acc_buffer + i + 4 * j);
           }
 
-          // Fixed-point multiplication.
-          for (int j = 0; j < 4; j++) {
-            acc[j] = vqrdmulhq_n_s32(acc[j], output_multiplier);
-          }
-          for (int j = 0; j < 4; j++) {
-            acc[j] = RoundingDivideByPOT(acc[j], output_shift);
+          if (!shift_left) {
+            // Fixed-point multiplication.
+            for (int j = 0; j < 4; j++) {
+              acc[j] = vqrdmulhq_n_s32(acc[j], output_multiplier);
+            }
+            for (int j = 0; j < 4; j++) {
+              acc[j] = RoundingDivideByPOT(acc[j], output_shift);
+            }
+          } else {
+            // Fixed-point multiplication.
+            for (int j = 0; j < 4; j++) {
+              acc[j] = vmulq_n_s32(acc[j], multiplier_power_of_two);
+              acc[j] = vqrdmulhq_n_s32(acc[j], output_multiplier);
+            }
           }
           // Add the output offset.
           for (int j = 0; j < 4; j++) {
@@ -1870,12 +1884,21 @@ inline void DepthwiseConv(const uint8* input_data, const Dims<4>& input_dims,
         for (; i <= num_output_values - 8; i += 8) {
           int32x4_t acc0 = vld1q_s32(acc_buffer + i);
           int32x4_t acc1 = vld1q_s32(acc_buffer + i + 4);
-          // Fixed-point multiplication.
-          acc0 = vqrdmulhq_n_s32(acc0, output_multiplier);
-          acc1 = vqrdmulhq_n_s32(acc1, output_multiplier);
-          // Rounding right shift.
-          acc0 = RoundingDivideByPOT(acc0, output_shift);
-          acc1 = RoundingDivideByPOT(acc1, output_shift);
+          if (!shift_left) {
+            // Fixed-point multiplication.
+            acc0 = vqrdmulhq_n_s32(acc0, output_multiplier);
+            acc1 = vqrdmulhq_n_s32(acc1, output_multiplier);
+            // Rounding right shift.
+            acc0 = RoundingDivideByPOT(acc0, output_shift);
+            acc1 = RoundingDivideByPOT(acc1, output_shift);
+          } else {
+            // Fixed-point multiplication.
+            acc0 = vmulq_n_s32(acc0, multiplier_power_of_two);
+            acc0 = vqrdmulhq_n_s32(acc0, output_multiplier);
+
+            acc1 = vmulq_n_s32(acc1, multiplier_power_of_two);
+            acc1 = vqrdmulhq_n_s32(acc1, output_multiplier);
+          }
           // Add the output offset.
           acc0 = vaddq_s32(acc0, output_offset_vec);
           acc1 = vaddq_s32(acc1, output_offset_vec);
@@ -1899,10 +1922,16 @@ inline void DepthwiseConv(const uint8* input_data, const Dims<4>& input_dims,
         // that will have to go through the very slow scalar code.
         for (; i <= num_output_values - 4; i += 4) {
           int32x4_t acc = vld1q_s32(acc_buffer + i);
-          // Fixed-point multiplication.
-          acc = vqrdmulhq_n_s32(acc, output_multiplier);
-          // Rounding right shift.
-          acc = RoundingDivideByPOT(acc, output_shift);
+          if (!shift_left) {
+            // Fixed-point multiplication.
+            acc = vqrdmulhq_n_s32(acc, output_multiplier);
+            // Rounding right shift.
+            acc = RoundingDivideByPOT(acc, output_shift);
+          } else {
+            // Fixed-point multiplication.
+            acc = vmulq_n_s32(acc, multiplier_power_of_two);
+            acc = vqrdmulhq_n_s32(acc, output_multiplier);
+          }
           // Add the output offset.
           acc = vaddq_s32(acc, output_offset_vec);
           // Apply the activation function.
@@ -1923,8 +1952,8 @@ inline void DepthwiseConv(const uint8* input_data, const Dims<4>& input_dims,
         // Handle leftover values, one by one. This is very slow.
         for (; i < num_output_values; i++) {
           int32 acc = acc_buffer[i];
-          acc = MultiplyByQuantizedMultiplierSmallerThanOne(
-              acc, output_multiplier, output_shift);
+          acc = MultiplyByQuantizedMultiplier(acc, output_multiplier,
+                                              -output_shift);
           acc += output_offset;
           acc = std::max(acc, output_activation_min);
           acc = std::min(acc, output_activation_max);
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8_3x3_filter.h b/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8_3x3_filter.h
index 55e0d5c3aa9ebb8b46403550e190b00a54cb53e5..a7b0d805a3acd35b592a35ba4266dfff4eb992cd 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8_3x3_filter.h
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/depthwiseconv_uint8_3x3_filter.h
@@ -23,3848 +23,2912 @@ limitations under the License.
 namespace tflite {
 namespace optimized_ops {
 
-#ifdef __aarch64__
-
-inline void preload_l1_keep(const uint8* ptr) {
-#ifdef GEMMLOWP_ARM_64
-  asm volatile("prfm pldl1keep, [%[ptr]]\n" ::[ptr] "r"(ptr) :);
-#else
-  gemmlowp::Prefetch(ptr);
-#endif
-}
-
-// Implementation of quantized DepthwiseConv for 3x3 filters.
-
-// Below are helper structs to remove the use of arrays.
-// There is an llvm bug that causes significant slowdown when using arrays for
-// NEON intrinsics vector data types.
-// See: https://bugs.llvm.org/show_bug.cgi?id=34945
-
-struct Int32x8 {
-  int32x4_t low, high;
-};
-
-struct Filter3x3x8 {
-  int16x8_t f0, f1, f2, f3, f4, f5, f6, f7, f8;
-};
-
-// Loads 3x3 filter of depth 8 and adds filter offsets.
-inline Filter3x3x8 Load3x3Filter(const uint8* filter_ptr, int32 filter_offset,
-                                 int output_depth) {
-  Filter3x3x8 filter;
-
-  uint8x8_t temp_u8_0, temp_u8_1, temp_u8_2, temp_u8_3, temp_u8_4, temp_u8_5,
-      temp_u8_6, temp_u8_7, temp_u8_8;
-  int16x8_t filter_offset_vec = vdupq_n_s16(filter_offset);
-
-  temp_u8_0 = vld1_u8(filter_ptr + 0 * output_depth);
-  temp_u8_1 = vld1_u8(filter_ptr + 1 * output_depth);
-  temp_u8_2 = vld1_u8(filter_ptr + 2 * output_depth);
-  temp_u8_3 = vld1_u8(filter_ptr + 3 * output_depth);
-  temp_u8_4 = vld1_u8(filter_ptr + 4 * output_depth);
-  temp_u8_5 = vld1_u8(filter_ptr + 5 * output_depth);
-  temp_u8_6 = vld1_u8(filter_ptr + 6 * output_depth);
-  temp_u8_7 = vld1_u8(filter_ptr + 7 * output_depth);
-  temp_u8_8 = vld1_u8(filter_ptr + 8 * output_depth);
-
-  filter.f0 = vreinterpretq_s16_u16(vmovl_u8(temp_u8_0));
-  filter.f1 = vreinterpretq_s16_u16(vmovl_u8(temp_u8_1));
-  filter.f2 = vreinterpretq_s16_u16(vmovl_u8(temp_u8_2));
-  filter.f3 = vreinterpretq_s16_u16(vmovl_u8(temp_u8_3));
-  filter.f4 = vreinterpretq_s16_u16(vmovl_u8(temp_u8_4));
-  filter.f5 = vreinterpretq_s16_u16(vmovl_u8(temp_u8_5));
-  filter.f6 = vreinterpretq_s16_u16(vmovl_u8(temp_u8_6));
-  filter.f7 = vreinterpretq_s16_u16(vmovl_u8(temp_u8_7));
-  filter.f8 = vreinterpretq_s16_u16(vmovl_u8(temp_u8_8));
-
-  filter.f0 = vaddq_s16(filter.f0, filter_offset_vec);
-  filter.f1 = vaddq_s16(filter.f1, filter_offset_vec);
-  filter.f2 = vaddq_s16(filter.f2, filter_offset_vec);
-  filter.f3 = vaddq_s16(filter.f3, filter_offset_vec);
-  filter.f4 = vaddq_s16(filter.f4, filter_offset_vec);
-  filter.f5 = vaddq_s16(filter.f5, filter_offset_vec);
-  filter.f6 = vaddq_s16(filter.f6, filter_offset_vec);
-  filter.f7 = vaddq_s16(filter.f7, filter_offset_vec);
-  filter.f8 = vaddq_s16(filter.f8, filter_offset_vec);
-
-  return filter;
-}
-
-// Applies activation, offset and downquantize on a set of accumulator
-// registers that correspond to a 2x2 output of depth 8.
-// Stores results to output.
-inline void DownquantizeAndStore2x2Output(
-    Int32x8 acc_0, Int32x8 acc_1, Int32x8 acc_2, Int32x8 acc_3,
-    int32 output_offset, int32 output_multiplier, int output_shift,
-    int32 output_activation_min, int32 output_activation_max, uint8* output_ptr,
-    int output_depth, int output_width) {
-  using gemmlowp::RoundingDivideByPOT;
-  const int32x4_t output_offset_vec = vdupq_n_s32(output_offset);
-  const int32x4_t output_activation_min_vec =
-      vdupq_n_s32(output_activation_min);
-  const int32x4_t output_activation_max_vec =
-      vdupq_n_s32(output_activation_max);
-
-  // Fixed-point multiplication.
-  acc_0.low = vqrdmulhq_n_s32(acc_0.low, output_multiplier);
-  acc_0.high = vqrdmulhq_n_s32(acc_0.high, output_multiplier);
-  acc_1.low = vqrdmulhq_n_s32(acc_1.low, output_multiplier);
-  acc_1.high = vqrdmulhq_n_s32(acc_1.high, output_multiplier);
-  acc_2.low = vqrdmulhq_n_s32(acc_2.low, output_multiplier);
-  acc_2.high = vqrdmulhq_n_s32(acc_2.high, output_multiplier);
-  acc_3.low = vqrdmulhq_n_s32(acc_3.low, output_multiplier);
-  acc_3.high = vqrdmulhq_n_s32(acc_3.high, output_multiplier);
-
-  acc_0.low = RoundingDivideByPOT(acc_0.low, output_shift);
-  acc_0.high = RoundingDivideByPOT(acc_0.high, output_shift);
-  acc_1.low = RoundingDivideByPOT(acc_1.low, output_shift);
-  acc_1.high = RoundingDivideByPOT(acc_1.high, output_shift);
-  acc_2.low = RoundingDivideByPOT(acc_2.low, output_shift);
-  acc_2.high = RoundingDivideByPOT(acc_2.high, output_shift);
-  acc_3.low = RoundingDivideByPOT(acc_3.low, output_shift);
-  acc_3.high = RoundingDivideByPOT(acc_3.high, output_shift);
-
-  // Add the output offset.
-  acc_0.low = vaddq_s32(acc_0.low, output_offset_vec);
-  acc_0.high = vaddq_s32(acc_0.high, output_offset_vec);
-  acc_1.low = vaddq_s32(acc_1.low, output_offset_vec);
-  acc_1.high = vaddq_s32(acc_1.high, output_offset_vec);
-  acc_2.low = vaddq_s32(acc_2.low, output_offset_vec);
-  acc_2.high = vaddq_s32(acc_2.high, output_offset_vec);
-  acc_3.low = vaddq_s32(acc_3.low, output_offset_vec);
-  acc_3.high = vaddq_s32(acc_3.high, output_offset_vec);
-
-  // Apply the activation function.
-  acc_0.low = vmaxq_s32(acc_0.low, output_activation_min_vec);
-  acc_0.high = vmaxq_s32(acc_0.high, output_activation_min_vec);
-  acc_1.low = vmaxq_s32(acc_1.low, output_activation_min_vec);
-  acc_1.high = vmaxq_s32(acc_1.high, output_activation_min_vec);
-  acc_2.low = vmaxq_s32(acc_2.low, output_activation_min_vec);
-  acc_2.high = vmaxq_s32(acc_2.high, output_activation_min_vec);
-  acc_3.low = vmaxq_s32(acc_3.low, output_activation_min_vec);
-  acc_3.high = vmaxq_s32(acc_3.high, output_activation_min_vec);
-
-  acc_0.low = vminq_s32(acc_0.low, output_activation_max_vec);
-  acc_0.high = vminq_s32(acc_0.high, output_activation_max_vec);
-  acc_1.low = vminq_s32(acc_1.low, output_activation_max_vec);
-  acc_1.high = vminq_s32(acc_1.high, output_activation_max_vec);
-  acc_2.low = vminq_s32(acc_2.low, output_activation_max_vec);
-  acc_2.high = vminq_s32(acc_2.high, output_activation_max_vec);
-  acc_3.low = vminq_s32(acc_3.low, output_activation_max_vec);
-  acc_3.high = vminq_s32(acc_3.high, output_activation_max_vec);
-
-  // Saturating cast to uint8 and store to destination.
-  int16x4_t acc_0_low_s16 = vqmovn_s32(acc_0.low);
-  int16x4_t acc_0_high_s16 = vqmovn_s32(acc_0.high);
-  int16x4_t acc_1_low_s16 = vqmovn_s32(acc_1.low);
-  int16x4_t acc_1_high_s16 = vqmovn_s32(acc_1.high);
-  int16x4_t acc_2_low_s16 = vqmovn_s32(acc_2.low);
-  int16x4_t acc_2_high_s16 = vqmovn_s32(acc_2.high);
-  int16x4_t acc_3_low_s16 = vqmovn_s32(acc_3.low);
-  int16x4_t acc_3_high_s16 = vqmovn_s32(acc_3.high);
-
-  int16x8_t res_0_s16 = vcombine_s16(acc_0_low_s16, acc_0_high_s16);
-  int16x8_t res_1_s16 = vcombine_s16(acc_1_low_s16, acc_1_high_s16);
-  int16x8_t res_2_s16 = vcombine_s16(acc_2_low_s16, acc_2_high_s16);
-  int16x8_t res_3_s16 = vcombine_s16(acc_3_low_s16, acc_3_high_s16);
-
-  uint8x8_t res_0_u8 = vqmovun_s16(res_0_s16);
-  uint8x8_t res_1_u8 = vqmovun_s16(res_1_s16);
-  uint8x8_t res_2_u8 = vqmovun_s16(res_2_s16);
-  uint8x8_t res_3_u8 = vqmovun_s16(res_3_s16);
-
-  vst1_u8(output_ptr, res_0_u8);
-  vst1_u8(output_ptr + output_depth, res_1_u8);
-  vst1_u8(output_ptr + output_depth * output_width, res_2_u8);
-  vst1_u8(output_ptr + output_depth * output_width + output_depth, res_3_u8);
-}
-
-inline void DownquantizeAndStore(Int32x8 acc, int32 output_offset,
-                                 int32 output_multiplier, int output_shift,
-                                 int32 output_activation_min,
-                                 int32 output_activation_max,
-                                 uint8* output_ptr) {
-  using gemmlowp::RoundingDivideByPOT;
-  const int32x4_t output_offset_vec = vdupq_n_s32(output_offset);
-  const int32x4_t output_activation_min_vec =
-      vdupq_n_s32(output_activation_min);
-  const int32x4_t output_activation_max_vec =
-      vdupq_n_s32(output_activation_max);
-
-  acc.low = vqrdmulhq_n_s32(acc.low, output_multiplier);
-  acc.high = vqrdmulhq_n_s32(acc.high, output_multiplier);
-
-  acc.low = RoundingDivideByPOT(acc.low, output_shift);
-  acc.high = RoundingDivideByPOT(acc.high, output_shift);
-
-  acc.low = vaddq_s32(acc.low, output_offset_vec);
-  acc.high = vaddq_s32(acc.high, output_offset_vec);
-
-  acc.low = vmaxq_s32(acc.low, output_activation_min_vec);
-  acc.high = vmaxq_s32(acc.high, output_activation_min_vec);
-
-  acc.low = vminq_s32(acc.low, output_activation_max_vec);
-  acc.high = vminq_s32(acc.high, output_activation_max_vec);
-
-  int16x4_t acc_low_s16 = vqmovn_s32(acc.low);
-  int16x4_t acc_high_s16 = vqmovn_s32(acc.high);
-
-  int16x8_t res_s16 = vcombine_s16(acc_low_s16, acc_high_s16);
-  uint8x8_t res_u8 = vqmovun_s16(res_s16);
-  vst1_u8(output_ptr, res_u8);
-}
-
-inline void DownquantizeAndStore2Output(
-    Int32x8 acc_0, Int32x8 acc_1, int32 output_offset, int32 output_multiplier,
-    int output_shift, int32 output_activation_min, int32 output_activation_max,
-    uint8* output_ptr, int output_ptr_offset) {
-  {
-    using gemmlowp::RoundingDivideByPOT;
-    const int32x4_t output_offset_vec = vdupq_n_s32(output_offset);
-    const int32x4_t output_activation_min_vec =
-        vdupq_n_s32(output_activation_min);
-    const int32x4_t output_activation_max_vec =
-        vdupq_n_s32(output_activation_max);
-
-    // Fixed-point multiplication.
-    acc_0.low = vqrdmulhq_n_s32(acc_0.low, output_multiplier);
-    acc_0.high = vqrdmulhq_n_s32(acc_0.high, output_multiplier);
-    acc_1.low = vqrdmulhq_n_s32(acc_1.low, output_multiplier);
-    acc_1.high = vqrdmulhq_n_s32(acc_1.high, output_multiplier);
-
-    acc_0.low = RoundingDivideByPOT(acc_0.low, output_shift);
-    acc_0.high = RoundingDivideByPOT(acc_0.high, output_shift);
-    acc_1.low = RoundingDivideByPOT(acc_1.low, output_shift);
-    acc_1.high = RoundingDivideByPOT(acc_1.high, output_shift);
-
-    // Add the output offset.
-    acc_0.low = vaddq_s32(acc_0.low, output_offset_vec);
-    acc_0.high = vaddq_s32(acc_0.high, output_offset_vec);
-    acc_1.low = vaddq_s32(acc_1.low, output_offset_vec);
-    acc_1.high = vaddq_s32(acc_1.high, output_offset_vec);
-
-    // Apply the activation function.
-    acc_0.low = vmaxq_s32(acc_0.low, output_activation_min_vec);
-    acc_0.high = vmaxq_s32(acc_0.high, output_activation_min_vec);
-    acc_1.low = vmaxq_s32(acc_1.low, output_activation_min_vec);
-    acc_1.high = vmaxq_s32(acc_1.high, output_activation_min_vec);
-
-    acc_0.low = vminq_s32(acc_0.low, output_activation_max_vec);
-    acc_0.high = vminq_s32(acc_0.high, output_activation_max_vec);
-    acc_1.low = vminq_s32(acc_1.low, output_activation_max_vec);
-    acc_1.high = vminq_s32(acc_1.high, output_activation_max_vec);
-  }
-
-  // Saturating cast to uint8 and store to destination.
-  int16x8_t res_0_s16;
-  {
-    int16x4_t acc_0_low_s16 = vqmovn_s32(acc_0.low);
-    int16x4_t acc_0_high_s16 = vqmovn_s32(acc_0.high);
-    res_0_s16 = vcombine_s16(acc_0_low_s16, acc_0_high_s16);
-  }
-
-  int16x8_t res_1_s16;
-  {
-    int16x4_t acc_1_low_s16 = vqmovn_s32(acc_1.low);
-    int16x4_t acc_1_high_s16 = vqmovn_s32(acc_1.high);
-    res_1_s16 = vcombine_s16(acc_1_low_s16, acc_1_high_s16);
-  }
-
-  uint8x8_t res_0_u8 = vqmovun_s16(res_0_s16);
-  uint8x8_t res_1_u8 = vqmovun_s16(res_1_s16);
-  vst1_u8(output_ptr, res_0_u8);
-  vst1_u8(output_ptr + output_ptr_offset, res_1_u8);
-}
-
-// Performs multiply accumulate on 3 inputs of depth 8.
-inline Int32x8 MultiplyAccumulateRow(Int32x8 accum, int16x8_t f0, int16x8_t f1,
-                                     int16x8_t f2, int16x8_t i0, int16x8_t i1,
-                                     int16x8_t i2) {
-  accum.low = vmlal_s16(accum.low, vget_low_s16(f0), vget_low_s16(i0));
-  accum.high = vmlal_s16(accum.high, vget_high_s16(f0), vget_high_s16(i0));
-  accum.low = vmlal_s16(accum.low, vget_low_s16(f1), vget_low_s16(i1));
-  accum.high = vmlal_s16(accum.high, vget_high_s16(f1), vget_high_s16(i1));
-  accum.low = vmlal_s16(accum.low, vget_low_s16(f2), vget_low_s16(i2));
-  accum.high = vmlal_s16(accum.high, vget_high_s16(f2), vget_high_s16(i2));
-  return accum;
-}
-
-// Performs multiply accumulate on 3 inputs of depth 8.
-inline Int32x8 MultiplyAccumulate3x3Filter(const Filter3x3x8& f, int16x8_t i0,
-                                           int16x8_t i1, int16x8_t i2,
-                                           int16x8_t i3, int16x8_t i4,
-                                           int16x8_t i5, int16x8_t i6,
-                                           int16x8_t i7, int16x8_t i8,
-                                           Int32x8 accum) {
-  accum.low = vmlal_s16(accum.low, vget_low_s16(f.f0), vget_low_s16(i0));
-  accum.high = vmlal_s16(accum.high, vget_high_s16(f.f0), vget_high_s16(i0));
-  accum.low = vmlal_s16(accum.low, vget_low_s16(f.f1), vget_low_s16(i1));
-  accum.high = vmlal_s16(accum.high, vget_high_s16(f.f1), vget_high_s16(i1));
-  accum.low = vmlal_s16(accum.low, vget_low_s16(f.f2), vget_low_s16(i2));
-  accum.high = vmlal_s16(accum.high, vget_high_s16(f.f2), vget_high_s16(i2));
-  accum.low = vmlal_s16(accum.low, vget_low_s16(f.f3), vget_low_s16(i3));
-  accum.high = vmlal_s16(accum.high, vget_high_s16(f.f3), vget_high_s16(i3));
-  accum.low = vmlal_s16(accum.low, vget_low_s16(f.f4), vget_low_s16(i4));
-  accum.high = vmlal_s16(accum.high, vget_high_s16(f.f4), vget_high_s16(i4));
-  accum.low = vmlal_s16(accum.low, vget_low_s16(f.f5), vget_low_s16(i5));
-  accum.high = vmlal_s16(accum.high, vget_high_s16(f.f5), vget_high_s16(i5));
-  accum.low = vmlal_s16(accum.low, vget_low_s16(f.f6), vget_low_s16(i6));
-  accum.high = vmlal_s16(accum.high, vget_high_s16(f.f6), vget_high_s16(i6));
-  accum.low = vmlal_s16(accum.low, vget_low_s16(f.f7), vget_low_s16(i7));
-  accum.high = vmlal_s16(accum.high, vget_high_s16(f.f7), vget_high_s16(i7));
-  accum.low = vmlal_s16(accum.low, vget_low_s16(f.f8), vget_low_s16(i8));
-  accum.high = vmlal_s16(accum.high, vget_high_s16(f.f8), vget_high_s16(i8));
-  return accum;
-}
-
-inline void DotProductAndStore(const Filter3x3x8& filter, int16x8_t i0,
-                               int16x8_t i1, int16x8_t i2, int16x8_t i3,
-                               int16x8_t i4, int16x8_t i5, int16x8_t i6,
-                               int16x8_t i7, int16x8_t i8,
-                               const int32* bias_ptr, int32 output_offset,
-                               int32 output_multiplier, int output_shift,
-                               int32 output_activation_min,
-                               int32 output_activation_max, uint8* output_ptr) {
-  Int32x8 acc;
-  acc.low = vld1q_s32(bias_ptr);
-  acc.high = vld1q_s32(bias_ptr + 4);
-
-  acc = MultiplyAccumulate3x3Filter(filter, i0, i1, i2, i3, i4, i5, i6, i7, i8,
-                                    acc);
-
-  DownquantizeAndStore(acc, output_offset, output_multiplier, output_shift,
-                       output_activation_min, output_activation_max,
-                       output_ptr);
-}
-
-// Performs multiply-accumulate on a 3x4 input for 2 horizontal outputs.
-inline void DotProductAndStore2xStride1(
-    const Filter3x3x8& filter, int16x8_t i0, int16x8_t i1, int16x8_t i2,
-    int16x8_t i3, int16x8_t i4, int16x8_t i5, int16x8_t i6, int16x8_t i7,
-    int16x8_t i8, int16x8_t i9, int16x8_t i10, int16x8_t i11,
-    const int32* bias_ptr, int32 output_offset, int32 output_multiplier,
-    int output_shift, int32 output_activation_min, int32 output_activation_max,
-    uint8* output_ptr, int output_ptr_offset) {
-  Int32x8 acc_0, acc_1;
-  acc_0.low = vld1q_s32(bias_ptr);
-  acc_1.low = vld1q_s32(bias_ptr);
-  acc_0.high = vld1q_s32(bias_ptr + 4);
-  acc_1.high = vld1q_s32(bias_ptr + 4);
-
-  acc_0 = MultiplyAccumulate3x3Filter(filter, i0, i1, i2, i4, i5, i6, i8, i9,
-                                      i10, acc_0);
-  acc_1 = MultiplyAccumulate3x3Filter(filter, i1, i2, i3, i5, i6, i7, i9, i10,
-                                      i11, acc_1);
-  DownquantizeAndStore2Output(acc_0, acc_1, output_offset, output_multiplier,
-                              output_shift, output_activation_min,
-                              output_activation_max, output_ptr,
-                              output_ptr_offset);
-}
-
-// Performs multiply-accumulate on a 4x3 input for 2 vertical outputs.
-inline void DotProductAndStore2yStride1(
-    const Filter3x3x8& filter, int16x8_t i0, int16x8_t i1, int16x8_t i2,
-    int16x8_t i3, int16x8_t i4, int16x8_t i5, int16x8_t i6, int16x8_t i7,
-    int16x8_t i8, int16x8_t i9, int16x8_t i10, int16x8_t i11,
-    const int32* bias_ptr, int32 output_offset, int32 output_multiplier,
-    int output_shift, int32 output_activation_min, int32 output_activation_max,
-    uint8* output_ptr, int output_ptr_offset) {
-  Int32x8 acc_0, acc_1;
-  acc_0.low = vld1q_s32(bias_ptr);
-  acc_1.low = vld1q_s32(bias_ptr);
-  acc_0.high = vld1q_s32(bias_ptr + 4);
-  acc_1.high = vld1q_s32(bias_ptr + 4);
-
-  acc_0 = MultiplyAccumulate3x3Filter(filter, i0, i1, i2, i3, i4, i5, i6, i7,
-                                      i8, acc_0);
-  acc_1 = MultiplyAccumulate3x3Filter(filter, i3, i4, i5, i6, i7, i8, i9, i10,
-                                      i11, acc_1);
-  DownquantizeAndStore2Output(acc_0, acc_1, output_offset, output_multiplier,
-                              output_shift, output_activation_min,
-                              output_activation_max, output_ptr,
-                              output_ptr_offset);
-}
-
-// A kernel that is optimized on the number of output cells in the x and y
-// direction, and the stride. Assumes 3x3 filters of 8 depth.
-template <int kFixedOutputY, int kFixedOutputX, int kFixedStrideWidth,
-          int kFixedStrideHeight>
-struct ConvKernel3x3FilterDepth8 {};
-
-template <>
-struct ConvKernel3x3FilterDepth8<8, 8, 1, 1> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-    const int output_row_size = output_depth * output_width;
-
-    // To process 8x8 outputs using a 3x3 filter, we require 10x10 inputs.
-    // Load inputs for the first 2 filters on the top left, then slide to
-    // the right, down, left, down, right, etc. in a snake-like path. This
-    // minimizes the total number of loads.
-    //
-    //        INPUT                          OUTPUT
-    //   |\----------------\               |\------------\
-    //   | \                \              | \            \
-    //   |  \----------------\             |  \------------\
-    //   |  | 0    ...     9 |             |  | 0  ...   7 |
-    //   |  | 10   ...    19 |     --->    |  | 8  ...  15 |
-    //   |  | 20   ...    29 |              \ | .. ...  .. |
-    //    \ | ..   ...    .. |               \| 56 ...  63 |
-    //     \| 90   ...   109 |                |------------|
-    //      |----------------|
-    //
-    // The first set of loads corresponds to:
-    //
-    //        INPUT                          OUTPUT
-    //   |\-----------------                |\-----------
-    //   | \                                | \
-    //   |  \-----------------              |  \----------
-    //   |  | 0  1   2  3 ...               |  | 0  1 ...
-    //   |  | 10 11 12 13 ...     --->      |  | ..   ...
-    //   |  | 20 21 22 23 ...                  | ..   ...
-    //   |  | ..   ...    ...
-    //
-    // The next set of loads correspond to a sliding window to the right.
-    // It loads inputs 4, 5, 14, 15, 23, 24 and keeps 2, 3, 12, 13, and 22:
-    //
-    //        INPUT                          OUTPUT
-    //   |\-------------------                |\-------------
-    //   | \                                  | \
-    //   |  \-------------------              |  \------------
-    //   |  | .. 2  3   4  5 ...              |  | .. 2  3 ...
-    //   |  | .. 12 13 14 15 ...     --->     |  | ..      ...
-    //   |  | .. 21 22 23 24 ...                 | ..      ...
-    //   |  | ..    ...      ...
-    //
-    // And so on...
-
-    int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11;
-
-    // Load inputs for 1x2 outputs starting from the top left. Referring to the
-    // indexes in the diagram above, this corresponds to outputs (0) and (1).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-
-      ptr += input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-
-      ptr += input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr, output_depth);
-
-    // Slide to the right for outputs x = [2, 3], y = 0. Referring to the
-    // indexes in the diagram above, this corresponds to outputs (2) and (3).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 4 * input_depth;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_2, input_3, input_0, input_1, input_6, input_7, input_4,
-        input_5, input_10, input_11, input_8, input_9, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 2 * output_depth, output_depth);
-
-    // Slide to the right again for outputs x = [4, 5], y = 0. Referring to the
-    // indexes in the diagram above, this corresponds to outputs (4) and (5).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 6 * input_depth;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 4 * output_depth, output_depth);
-
-    // Slide to the right one last time for outputs x = [6, 7], y = 0.
-    // Referring to the indexes in the diagram above, this corresponds to
-    // outputs (6) and (7).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 8 * input_depth;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_2, input_3, input_0, input_1, input_6, input_7, input_4,
-        input_5, input_10, input_11, input_8, input_9, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 6 * output_depth, output_depth);
-
-    // Slide to down for outputs x = [6, 7], y = 1. Referring to the indexes in
-    // the diagram above, this corresponds to outputs (14) and (15).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr + 6 * input_depth + 3 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_6, input_7, input_4, input_5, input_10, input_11, input_8,
-        input_9, input_2, input_3, input_0, input_1, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 6 * output_depth + output_row_size,
-        output_depth);
-
-    // Slide left for outputs x = [4, 5], y = 1. Referring to the indexes in
-    // the diagram above, this corresponds to outputs (12) and (13).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 4 * input_depth + input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_4, input_5, input_6, input_7, input_8, input_9, input_10,
-        input_11, input_0, input_1, input_2, input_3, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 4 * output_depth + output_row_size,
-        output_depth);
-
-    // Slide left again for outputs x = [2, 3], y = 1. Referring to the indexes
-    // in the diagram above, this corresponds to outputs (10) and (11).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 2 * input_depth + input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_6, input_7, input_4, input_5, input_10, input_11, input_8,
-        input_9, input_2, input_3, input_0, input_1, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 2 * output_depth + output_row_size,
-        output_depth);
-
-    // Slide left one more time for outputs x = [0, 1], y = 1. Referring to the
-    // indexes in the diagram above, this corresponds to outputs (8) and (9).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_4, input_5, input_6, input_7, input_8, input_9, input_10,
-        input_11, input_0, input_1, input_2, input_3, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + output_row_size, output_depth);
-
-    // Slide down for outputs x = [0, 1], y = 2. Referring to the
-    // indexes in the diagram above, this corresponds to outputs (16) and (17).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr + 4 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_8, input_9, input_10, input_11, input_0, input_1, input_2,
-        input_3, input_4, input_5, input_6, input_7, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 2 * output_row_size, output_depth);
-
-    // Slide right for outputs x = [2, 3], y = 2. Referring to the
-    // indexes in the diagram above, this corresponds to outputs (18) and (19).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 4 * input_depth + 2 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_10, input_11, input_8, input_9, input_2, input_3, input_0,
-        input_1, input_6, input_7, input_4, input_5, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 2 * output_depth + 2 * output_row_size, output_depth);
-
-    // Slide right for outputs x = [4, 5], y = 2. Referring to the
-    // indexes in the diagram above, this corresponds to outputs (20) and (21).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 6 * input_depth + 2 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_8, input_9, input_10, input_11, input_0, input_1, input_2,
-        input_3, input_4, input_5, input_6, input_7, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 4 * output_depth + 2 * output_row_size, output_depth);
-
-    // Slide right one more time for outputs x = [6, 7], y = 2. Referring to the
-    // indexes in the diagram above, this corresponds to outputs (22) and (23).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 8 * input_depth + 2 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_10, input_11, input_8, input_9, input_2, input_3, input_0,
-        input_1, input_6, input_7, input_4, input_5, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 6 * output_depth + 2 * output_row_size, output_depth);
-
-    // Slide down for outputs x = [6, 7], y = 3. Referring to the indexes in
-    // the diagram above, this corresponds to outputs (30) and (31).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr + 6 * input_depth + 5 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_2, input_3, input_0, input_1, input_6, input_7, input_4,
-        input_5, input_10, input_11, input_8, input_9, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 6 * output_depth + 3 * output_row_size, output_depth);
-
-    // Slide left for outputs x = [4, 5], y = 3. Referring to the indexes in
-    // the diagram above, this corresponds to outputs (28) and (29).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 4 * input_depth + 3 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 4 * output_depth + 3 * output_row_size, output_depth);
-
-    // Slide left for outputs x = [2, 3], y = 3. Referring to the indexes in
-    // the diagram above, this corresponds to outputs (26) and (27).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 2 * input_depth + 3 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_2, input_3, input_0, input_1, input_6, input_7, input_4,
-        input_5, input_10, input_11, input_8, input_9, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 2 * output_depth + 3 * output_row_size, output_depth);
-
-    // Slide left one more time for outputs x = [0, 1], y = 3. Referring to the
-    // indexes in the diagram above, this corresponds to outputs (24) and (25).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 3 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 3 * output_row_size, output_depth);
-
-    // Slide down for outputs x = [0, 1], y = 4. Referring to the indexes in
-    // the diagram above, this corresponds to outputs (32) and (33).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr + 6 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_4, input_5, input_6, input_7, input_8, input_9, input_10,
-        input_11, input_0, input_1, input_2, input_3, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 4 * output_row_size, output_depth);
-
-    // Slide right for outputs x = [2, 3], y = 4. Referring to the indexes in
-    // the diagram above, this corresponds to outputs (34) and (35).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 4 * input_depth + 4 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_6, input_7, input_4, input_5, input_10, input_11, input_8,
-        input_9, input_2, input_3, input_0, input_1, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 2 * output_depth + 4 * output_row_size, output_depth);
-
-    // Slide right for outputs x = [4, 5], y = 4. Referring to the indexes in
-    // the diagram above, this corresponds to outputs (36) and (37).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 6 * input_depth + 4 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_4, input_5, input_6, input_7, input_8, input_9, input_10,
-        input_11, input_0, input_1, input_2, input_3, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 4 * output_depth + 4 * output_row_size, output_depth);
-
-    // Slide right one more time for outputs x = [6, 7], y = 4. Referring to the
-    // indexes in the diagram above, this corresponds to outputs (38) and (39).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 8 * input_depth + 4 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_6, input_7, input_4, input_5, input_10, input_11, input_8,
-        input_9, input_2, input_3, input_0, input_1, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 6 * output_depth + 4 * output_row_size, output_depth);
-
-    // Slide down for outputs x = [6, 7], y = 5. Referring to the  indexes in
-    // the diagram above, this corresponds to outputs (46) and (47).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr + 6 * input_depth + 7 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_10, input_11, input_8, input_9, input_2, input_3, input_0,
-        input_1, input_6, input_7, input_4, input_5, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 6 * output_depth + 5 * output_row_size, output_depth);
-
-    // Slide left for outputs x = [4, 5], y = 5. Referring to the  indexes in
-    // the diagram above, this corresponds to outputs (44) and (45).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 4 * input_depth + 5 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_8, input_9, input_10, input_11, input_0, input_1, input_2,
-        input_3, input_4, input_5, input_6, input_7, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 4 * output_depth + 5 * output_row_size, output_depth);
-
-    // Slide left for outputs x = [2, 3], y = 5. Referring to the  indexes in
-    // the diagram above, this corresponds to outputs (42) and (43).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 2 * input_depth + 5 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_10, input_11, input_8, input_9, input_2, input_3, input_0,
-        input_1, input_6, input_7, input_4, input_5, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 2 * output_depth + 5 * output_row_size, output_depth);
-
-    // Slide left one more time for outputs x = [0, 1], y = 5. Referring to the
-    // indexes in the diagram above, this corresponds to outputs (40) and (41).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 5 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_8, input_9, input_10, input_11, input_0, input_1, input_2,
-        input_3, input_4, input_5, input_6, input_7, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 5 * output_row_size, output_depth);
-
-    // Slide down for outputs x = [0, 1], y = 6. Referring to the  indexes in
-    // the diagram above, this corresponds to outputs (48) and (49).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr + 8 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 6 * output_row_size, output_depth);
-
-    // Slide right for outputs x = [2, 3], y = 6. Referring to the  indexes in
-    // the diagram above, this corresponds to outputs (50) and (51).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 4 * input_depth + 6 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_2, input_3, input_0, input_1, input_6, input_7, input_4,
-        input_5, input_10, input_11, input_8, input_9, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 2 * output_depth + 6 * output_row_size, output_depth);
-
-    // Slide right for outputs x = [4, 5], y = 6. Referring to the  indexes in
-    // the diagram above, this corresponds to outputs (52) and (53).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 6 * input_depth + 6 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 4 * output_depth + 6 * output_row_size, output_depth);
-
-    // Slide right one more time for outputs x = [6, 7], y = 6. Referring to the
-    // indexes in the diagram above, this corresponds to outputs (54) and (55).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 8 * input_depth + 6 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_2, input_3, input_0, input_1, input_6, input_7, input_4,
-        input_5, input_10, input_11, input_8, input_9, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 6 * output_depth + 6 * output_row_size, output_depth);
-
-    // Slide down for outputs x = [6, 7], y = 7. Referring to the indexes in the
-    // diagram above, this corresponds to outputs (62) and (63).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr + 6 * input_depth + 9 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_6, input_7, input_4, input_5, input_10, input_11, input_8,
-        input_9, input_2, input_3, input_0, input_1, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 6 * output_depth + 7 * output_row_size, output_depth);
-
-    // Slide left for outputs x = [4, 5], y = 7. Referring to the indexes in the
-    // diagram above, this corresponds to outputs (60) and (61).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 4 * input_depth + 7 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_4, input_5, input_6, input_7, input_8, input_9, input_10,
-        input_11, input_0, input_1, input_2, input_3, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 4 * output_depth + 7 * output_row_size, output_depth);
-
-    // Slide left for outputs x = [2, 3], y = 7. Referring to the indexes in the
-    // diagram above, this corresponds to outputs (58) and (59).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 2 * input_depth + 7 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_6, input_7, input_4, input_5, input_10, input_11, input_8,
-        input_9, input_2, input_3, input_0, input_1, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 2 * output_depth + 7 * output_row_size, output_depth);
-
-    // Slide left one more time for outputs x = [0, 1], y = 7. Referring to the
-    // indexes in the diagram above, this corresponds to outputs (56) and (57).
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 7 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_4, input_5, input_6, input_7, input_8, input_9, input_10,
-        input_11, input_0, input_1, input_2, input_3, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 7 * output_row_size, output_depth);
-  }
-};
-
-template <>
-struct ConvKernel3x3FilterDepth8<4, 4, 1, 1> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-    const int output_row_size = output_depth * output_width;
-
-    // To process 4x4 outputs using a 3x3 filter, we require 6x6 inputs.
-    // Load inputs for the first 2 filters on the top left, then slide to
-    // the right, down, left, down, right, etc. in a snake-like path. This
-    // minimizes the total number of loads.
-    int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11;
-
-    // Load inputs for 1x2 outputs starting from the top left.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-
-      ptr += input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-
-      ptr += input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr, output_depth);
-
-    // Now load 1x2 inputs on the top right.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 4 * input_depth;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_2, input_3, input_0, input_1, input_6, input_7, input_4,
-        input_5, input_10, input_11, input_8, input_9, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 2 * output_depth, output_depth);
-
-    // Now load next inputs when sliding window down.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr + 2 * input_depth + 3 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_6, input_7, input_4, input_5, input_10, input_11, input_8,
-        input_9, input_2, input_3, input_0, input_1, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 2 * output_depth + output_row_size,
-        output_depth);
-
-    // Now load next inputs when sliding window left.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_4, input_5, input_6, input_7, input_8, input_9, input_10,
-        input_11, input_0, input_1, input_2, input_3, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + output_row_size, output_depth);
-
-    // Now load next inputs when sliding window down.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr + 4 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_8, input_9, input_10, input_11, input_0, input_1, input_2,
-        input_3, input_4, input_5, input_6, input_7, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 2 * output_row_size, output_depth);
-
-    // Now load next inputs when sliding window right.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 4 * input_depth + 2 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_10, input_11, input_8, input_9, input_2, input_3, input_0,
-        input_1, input_6, input_7, input_4, input_5, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 2 * output_depth + 2 * output_row_size, output_depth);
-
-    // Now load next inputs when sliding window down.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr + 2 * input_depth + 5 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_2, input_3, input_0, input_1, input_6, input_7, input_4,
-        input_5, input_10, input_11, input_8, input_9, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max,
-        output_ptr + 2 * output_depth + 3 * output_row_size, output_depth);
-
-    // Now load next inputs when sliding window left.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 3 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 3 * output_row_size, output_depth);
-  }
-};
-
-template <>
-struct ConvKernel3x3FilterDepth8<4, 2, 1, 1> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-    const int output_row_size = output_depth * output_width;
-
-    int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11;
-
-    // Load inputs for 1x2 outputs starting from the top.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-
-      ptr += input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-
-      ptr += input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr, output_depth);
-
-    output_ptr += output_row_size;
-
-    // Now load next inputs one row down.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr + 3 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_4, input_5, input_6, input_7, input_8, input_9, input_10,
-        input_11, input_0, input_1, input_2, input_3, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr, output_depth);
-
-    output_ptr += output_row_size;
-
-    // Now load next row.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr + 4 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_8, input_9, input_10, input_11, input_0, input_1, input_2,
-        input_3, input_4, input_5, input_6, input_7, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr, output_depth);
-
-    output_ptr += output_row_size;
-
-    // Now load last row.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr + 5 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr, output_depth);
-  }
-};
-
-template <>
-struct ConvKernel3x3FilterDepth8<4, 1, 1, 1> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-    const int output_row_size = output_depth * output_width;
-
-    int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11;
-
-    // Load inputs for 2x1 outputs starting from the top.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      ptr += input_row_size;
-      temp_3 = vld1_u8(ptr);
-      temp_4 = vld1_u8(ptr + input_depth);
-      temp_5 = vld1_u8(ptr + 2 * input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-
-      ptr += input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      ptr += input_row_size;
-      temp_3 = vld1_u8(ptr);
-      temp_4 = vld1_u8(ptr + input_depth);
-      temp_5 = vld1_u8(ptr + 2 * input_depth);
-
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-    }
-
-    DotProductAndStore2yStride1(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr, output_row_size);
-
-    // Load inputs for bottom 2 rows.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 4 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      ptr += input_row_size;
-      temp_3 = vld1_u8(ptr);
-      temp_4 = vld1_u8(ptr + input_depth);
-      temp_5 = vld1_u8(ptr + 2 * input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-    }
-
-    DotProductAndStore2yStride1(
-        filter, input_6, input_7, input_8, input_9, input_10, input_11, input_0,
-        input_1, input_2, input_3, input_4, input_5, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 2 * output_row_size,
-        output_row_size);
-  }
-};
-
-template <>
-struct ConvKernel3x3FilterDepth8<2, 2, 1, 1> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    Int32x8 acc_0, acc_1, acc_2, acc_3;
-
-    acc_0.low = vld1q_s32(bias_ptr);
-    acc_1.low = vld1q_s32(bias_ptr);
-    acc_2.low = vld1q_s32(bias_ptr);
-    acc_3.low = vld1q_s32(bias_ptr);
-
-    bias_ptr += 4;
-    acc_0.high = vld1q_s32(bias_ptr);
-    acc_1.high = vld1q_s32(bias_ptr);
-    acc_2.high = vld1q_s32(bias_ptr);
-    acc_3.high = vld1q_s32(bias_ptr);
-
-    const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-
-    // Add scope for input registers to help the compiler know that it is
-    // not needed.
-    {
-      // To process 2x2 outputs using a 3x3 filter, we require 4x4 inputs.
-      // Load inputs for the top two filters first.
-      int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-          input_7, input_8, input_9, input_10, input_11;
-
-      const uint8* ptr = input_ptr;
-
-      // Load top 3 rows.
-      {
-        uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-        temp_0 = vld1_u8(ptr);
-        temp_1 = vld1_u8(ptr + input_depth);
-        temp_2 = vld1_u8(ptr + 2 * input_depth);
-        temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-        input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-        input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-        input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-        input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-        input_0 = vaddq_s16(input_0, input_offset_vec);
-        input_1 = vaddq_s16(input_1, input_offset_vec);
-        input_2 = vaddq_s16(input_2, input_offset_vec);
-        input_3 = vaddq_s16(input_3, input_offset_vec);
-
-        ptr += input_row_size;
-        temp_0 = vld1_u8(ptr);
-        temp_1 = vld1_u8(ptr + input_depth);
-        temp_2 = vld1_u8(ptr + 2 * input_depth);
-        temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-        input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-        input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-        input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-        input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-        input_4 = vaddq_s16(input_4, input_offset_vec);
-        input_5 = vaddq_s16(input_5, input_offset_vec);
-        input_6 = vaddq_s16(input_6, input_offset_vec);
-        input_7 = vaddq_s16(input_7, input_offset_vec);
-
-        ptr += input_row_size;
-        temp_0 = vld1_u8(ptr);
-        temp_1 = vld1_u8(ptr + input_depth);
-        temp_2 = vld1_u8(ptr + 2 * input_depth);
-        temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-        input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-        input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-        input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-        input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-        input_8 = vaddq_s16(input_8, input_offset_vec);
-        input_9 = vaddq_s16(input_9, input_offset_vec);
-        input_10 = vaddq_s16(input_10, input_offset_vec);
-        input_11 = vaddq_s16(input_11, input_offset_vec);
-      }
+// Enable for arm64 except for the Nvidia Linux 4 Tegra (L4T) running on
+// Jetson TX-2. This compiler does not support the offsetof() macro.
+#if defined(__aarch64__) && !defined(GOOGLE_L4T)
 
-      // Multiply-accum for top-left output.
-      acc_0 = MultiplyAccumulate3x3Filter(filter, input_0, input_1, input_2,
-                                          input_4, input_5, input_6, input_8,
-                                          input_9, input_10, acc_0);
-
-      // Multiply-accum for top-right output.
-      acc_1 = MultiplyAccumulate3x3Filter(filter, input_1, input_2, input_3,
-                                          input_5, input_6, input_7, input_9,
-                                          input_10, input_11, acc_1);
-
-      // Now load the bottom row.
-      {
-        uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-        ptr += input_row_size;
-        temp_0 = vld1_u8(ptr);
-        temp_1 = vld1_u8(ptr + input_depth);
-        temp_2 = vld1_u8(ptr + 2 * input_depth);
-        temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-        input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-        input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-        input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-        input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-        input_0 = vaddq_s16(input_0, input_offset_vec);
-        input_1 = vaddq_s16(input_1, input_offset_vec);
-        input_2 = vaddq_s16(input_2, input_offset_vec);
-        input_3 = vaddq_s16(input_3, input_offset_vec);
-      }
-
-      // Multiply-accum for bottom-left output.
-      acc_2 = MultiplyAccumulate3x3Filter(filter, input_4, input_5, input_6,
-                                          input_8, input_9, input_10, input_0,
-                                          input_1, input_2, acc_2);
-
-      // Multiply-accum for bottom-right output.
-      acc_3 = MultiplyAccumulate3x3Filter(filter, input_5, input_6, input_7,
-                                          input_9, input_10, input_11, input_1,
-                                          input_2, input_3, acc_3);
-    }
-
-    DownquantizeAndStore2x2Output(acc_0, acc_1, acc_2, acc_3, output_offset,
-                                  output_multiplier, output_shift,
-                                  output_activation_min, output_activation_max,
-                                  output_ptr, output_depth, output_width);
-  }
-};
-
-template <>
-struct ConvKernel3x3FilterDepth8<2, 4, 1, 1> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-    const int output_row_size = output_depth * output_width;
-
-    int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11;
-
-    // Load inputs for 1x2 outputs starting from the top left.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-
-      ptr += input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-
-      ptr += input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-    }
+// clang-format gets confused with this file and ends up formatting lines to
+// be larger than 80 characters. Turn off here and back on at the end of the
+// file.
 
-    DotProductAndStore2xStride1(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr, output_depth);
-
-    // Now load 1x2 inputs on the top right.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + 4 * input_depth;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_2, input_3, input_0, input_1, input_6, input_7, input_4,
-        input_5, input_10, input_11, input_8, input_9, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 2 * output_depth, output_depth);
-
-    // Now load next inputs when sliding window down.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr + 2 * input_depth + 3 * input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_6, input_7, input_4, input_5, input_10, input_11, input_8,
-        input_9, input_2, input_3, input_0, input_1, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 2 * output_depth + output_row_size,
-        output_depth);
-
-    // Now load next inputs when sliding window left.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-    }
+// clang-format off
 
-    DotProductAndStore2xStride1(
-        filter, input_4, input_5, input_6, input_7, input_8, input_9, input_10,
-        input_11, input_0, input_1, input_2, input_3, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + output_row_size, output_depth);
-  }
-};
-
-template <>
-struct ConvKernel3x3FilterDepth8<1, 4, 1, 1> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-
-    int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11;
-
-    // Load inputs for 1x2 outputs starting from the left.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3;
-
-      const uint8* ptr = input_ptr;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-
-      ptr += input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-
-      ptr += input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      temp_3 = vld1_u8(ptr + 3 * input_depth);
-
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-    }
-
-    DotProductAndStore2xStride1(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr, output_depth);
-
-    // Now load 1x2 inputs on the right.
-    {
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr + input_depth * 4;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_2 = vld1_u8(ptr);
-      temp_3 = vld1_u8(ptr + input_depth);
-
-      ptr += input_row_size;
-      temp_4 = vld1_u8(ptr);
-      temp_5 = vld1_u8(ptr + input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-    }
+#define DEPTHWISECONV_SHUFFLE_WORKSPACE_SIZE 10 * 10 * 64
 
-    DotProductAndStore2xStride1(
-        filter, input_2, input_3, input_0, input_1, input_6, input_7, input_4,
-        input_5, input_10, input_11, input_8, input_9, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr + 2 * output_depth, output_depth);
-  }
+// Encapsulates constant parameters used in DepthwiseConv.
+// 64-bit is used for types that will be added to 64-bit addresses in asm.
+struct DepthwiseConvParams {
+  int64_t input_depth;
+  int64_t input_row_size;
+  int64_t output_depth;
+  int64_t output_row_size;
+  int64_t filter_row_size;
+  int32 input_offset;
+  int32 output_offset;
+  int32 filter_offset;
+  int32 output_multiplier;
+  int32 output_activation_min;
+  int32 output_activation_max;
+  int32 output_shift;
+  int32 input_width;
+  int32 input_height;
+  int32 stride_width;
+  int32 stride_height;
+  int32 output_width;
+  int32 output_height;
 };
 
-template <>
-struct ConvKernel3x3FilterDepth8<2, 1, 1, 1> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    // To process 2x1 outputs using a 3x3 filter, we require 4x3 inputs.
-    // Load all inputs at the beginning.
-    int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11;
-
-    // Load inputs for 1x2 outputs starting from the top left.
-    {
-      const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-      uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5;
-
-      const uint8* ptr = input_ptr;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      ptr += input_row_size;
-      temp_3 = vld1_u8(ptr);
-      temp_4 = vld1_u8(ptr + input_depth);
-      temp_5 = vld1_u8(ptr + 2 * input_depth);
-
-      input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_0 = vaddq_s16(input_0, input_offset_vec);
-      input_1 = vaddq_s16(input_1, input_offset_vec);
-      input_2 = vaddq_s16(input_2, input_offset_vec);
-      input_3 = vaddq_s16(input_3, input_offset_vec);
-      input_4 = vaddq_s16(input_4, input_offset_vec);
-      input_5 = vaddq_s16(input_5, input_offset_vec);
-
-      ptr += input_row_size;
-      temp_0 = vld1_u8(ptr);
-      temp_1 = vld1_u8(ptr + input_depth);
-      temp_2 = vld1_u8(ptr + 2 * input_depth);
-      ptr += input_row_size;
-      temp_3 = vld1_u8(ptr);
-      temp_4 = vld1_u8(ptr + input_depth);
-      temp_5 = vld1_u8(ptr + 2 * input_depth);
-
-      input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-      input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-      input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-      input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-      input_10 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-      input_11 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-      input_6 = vaddq_s16(input_6, input_offset_vec);
-      input_7 = vaddq_s16(input_7, input_offset_vec);
-      input_8 = vaddq_s16(input_8, input_offset_vec);
-      input_9 = vaddq_s16(input_9, input_offset_vec);
-      input_10 = vaddq_s16(input_10, input_offset_vec);
-      input_11 = vaddq_s16(input_11, input_offset_vec);
-    }
-
-    DotProductAndStore2yStride1(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9, input_10, input_11, bias_ptr, output_offset,
-        output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_ptr, output_depth * output_width);
-  }
-};
+#define STR(s) STR_UNEXPANDED(s)
+#define STR_UNEXPANDED(s) #s
+
+// Represents the number of bytes offset from the start of the
+// DepthwiseConvParams struct. This is used in the asm to load parameters.
+// Keep these values in sync with the static_asserts below.
+#define OFFSET_INPUT_DEPTH 0
+#define OFFSET_INPUT_ROW_SIZE 8
+#define OFFSET_OUTPUT_DEPTH 16
+#define OFFSET_OUTPUT_ROW_SIZE 24
+#define OFFSET_FILTER_ROW_SIZE 32
+#define OFFSET_INPUT_OFFSET 40
+#define OFFSET_OUTPUT_OFFSET 44
+#define OFFSET_FILTER_OFFSET 48
+#define OFFSET_OUTPUT_MULTIPLIER 52
+#define OFFSET_OUTPUT_ACTIVATION_MIN 56
+#define OFFSET_OUTPUT_ACTIVATION_MAX 60
+#define OFFSET_OUTPUT_SHIFT 64
+#define OFFSET_INPUT_WIDTH 68
+#define OFFSET_INPUT_HEIGHT 72
+#define OFFSET_STRIDE_WIDTH 76
+#define OFFSET_STRIDE_HEIGHT 80
+#define OFFSET_OUTPUT_WIDTH 84
+#define OFFSET_OUTPUT_HEIGHT 88
+
+static_assert(offsetof(DepthwiseConvParams, input_depth) ==
+                  OFFSET_INPUT_DEPTH, "");
+static_assert(offsetof(DepthwiseConvParams, input_row_size) ==
+                  OFFSET_INPUT_ROW_SIZE, "");
+static_assert(offsetof(DepthwiseConvParams, output_depth) ==
+                  OFFSET_OUTPUT_DEPTH, "");
+static_assert(offsetof(DepthwiseConvParams, output_row_size) ==
+                  OFFSET_OUTPUT_ROW_SIZE, "");
+static_assert(offsetof(DepthwiseConvParams, filter_row_size) ==
+                  OFFSET_FILTER_ROW_SIZE, "");
+static_assert(offsetof(DepthwiseConvParams, input_offset) ==
+                  OFFSET_INPUT_OFFSET, "");
+static_assert(offsetof(DepthwiseConvParams, output_offset) ==
+                  OFFSET_OUTPUT_OFFSET, "");
+static_assert(offsetof(DepthwiseConvParams, filter_offset) ==
+                  OFFSET_FILTER_OFFSET, "");
+static_assert(offsetof(DepthwiseConvParams, output_multiplier) ==
+                  OFFSET_OUTPUT_MULTIPLIER, "");
+static_assert(offsetof(DepthwiseConvParams, output_activation_min) ==
+                  OFFSET_OUTPUT_ACTIVATION_MIN, "");
+static_assert(offsetof(DepthwiseConvParams, output_activation_max) ==
+                  OFFSET_OUTPUT_ACTIVATION_MAX, "");
+static_assert(offsetof(DepthwiseConvParams, output_shift) ==
+                  OFFSET_OUTPUT_SHIFT, "");
+static_assert(offsetof(DepthwiseConvParams, input_width) ==
+                  OFFSET_INPUT_WIDTH, "");
+static_assert(offsetof(DepthwiseConvParams, input_height) ==
+                  OFFSET_INPUT_HEIGHT, "");
+static_assert(offsetof(DepthwiseConvParams, stride_width) ==
+                  OFFSET_STRIDE_WIDTH, "");
+static_assert(offsetof(DepthwiseConvParams, stride_height) ==
+                  OFFSET_STRIDE_HEIGHT, "");
+static_assert(offsetof(DepthwiseConvParams, output_width) ==
+                  OFFSET_OUTPUT_WIDTH, "");
+static_assert(offsetof(DepthwiseConvParams, output_height) ==
+                  OFFSET_OUTPUT_HEIGHT, "");
+
+template <int32 kDepth, int32 kStrideWidth, int32 kStrideHeight>
+struct DepthwiseConvWindow {};
 
 template <>
-struct ConvKernel3x3FilterDepth8<4, 2, 2, 2> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    const int output_row_size = output_depth * output_width;
-
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    Int32x8 acc_0, acc_1;
-    acc_0.low = vld1q_s32(bias_ptr);
-    acc_1.low = vld1q_s32(bias_ptr);
-    acc_0.high = vld1q_s32(bias_ptr + 4);
-    acc_1.high = vld1q_s32(bias_ptr + 4);
-
-    const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-
-    int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, input_9;
-
-    const uint8* ptr = input_ptr;
-    uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4;
-
-    // Load first 2 rows.
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    temp_3 = vld1_u8(ptr + 3 * input_depth);
-    temp_4 = vld1_u8(ptr + 4 * input_depth);
-
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-
-    ptr += input_row_size;
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    temp_3 = vld1_u8(ptr + 3 * input_depth);
-    temp_4 = vld1_u8(ptr + 4 * input_depth);
-
-    input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-
-    input_5 = vaddq_s16(input_5, input_offset_vec);
-    input_6 = vaddq_s16(input_6, input_offset_vec);
-    input_7 = vaddq_s16(input_7, input_offset_vec);
-    input_8 = vaddq_s16(input_8, input_offset_vec);
-    input_9 = vaddq_s16(input_9, input_offset_vec);
-
-    acc_0 = MultiplyAccumulateRow(acc_0, filter.f0, filter.f1, filter.f2,
-                                  input_0, input_1, input_2);
-
-    acc_1 = MultiplyAccumulateRow(acc_1, filter.f0, filter.f1, filter.f2,
-                                  input_2, input_3, input_4);
-
-    acc_0 = MultiplyAccumulateRow(acc_0, filter.f3, filter.f4, filter.f5,
-                                  input_5, input_6, input_7);
-
-    acc_1 = MultiplyAccumulateRow(acc_1, filter.f3, filter.f4, filter.f5,
-                                  input_7, input_8, input_9);
-
-    // Load next 2 rows.
-    ptr += input_row_size;
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    temp_3 = vld1_u8(ptr + 3 * input_depth);
-    temp_4 = vld1_u8(ptr + 4 * input_depth);
-
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-
-    ptr += input_row_size;
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    temp_3 = vld1_u8(ptr + 3 * input_depth);
-    temp_4 = vld1_u8(ptr + 4 * input_depth);
-
-    input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-
-    input_5 = vaddq_s16(input_5, input_offset_vec);
-    input_6 = vaddq_s16(input_6, input_offset_vec);
-    input_7 = vaddq_s16(input_7, input_offset_vec);
-    input_8 = vaddq_s16(input_8, input_offset_vec);
-    input_9 = vaddq_s16(input_9, input_offset_vec);
-
-    acc_0 = MultiplyAccumulateRow(acc_0, filter.f6, filter.f7, filter.f8,
-                                  input_0, input_1, input_2);
-
-    acc_1 = MultiplyAccumulateRow(acc_1, filter.f6, filter.f7, filter.f8,
-                                  input_2, input_3, input_4);
-
-    DownquantizeAndStore2Output(
-        acc_0, acc_1, output_offset, output_multiplier, output_shift,
-        output_activation_min, output_activation_max, output_ptr, output_depth);
-
-    output_ptr += output_row_size;
-
-    // Moving onto the next row of outputs.
-    acc_0.low = vld1q_s32(bias_ptr);
-    acc_1.low = vld1q_s32(bias_ptr);
-    acc_0.high = vld1q_s32(bias_ptr + 4);
-    acc_1.high = vld1q_s32(bias_ptr + 4);
-
-    acc_0 = MultiplyAccumulateRow(acc_0, filter.f0, filter.f1, filter.f2,
-                                  input_0, input_1, input_2);
-
-    acc_1 = MultiplyAccumulateRow(acc_1, filter.f0, filter.f1, filter.f2,
-                                  input_2, input_3, input_4);
-
-    acc_0 = MultiplyAccumulateRow(acc_0, filter.f3, filter.f4, filter.f5,
-                                  input_5, input_6, input_7);
-
-    acc_1 = MultiplyAccumulateRow(acc_1, filter.f3, filter.f4, filter.f5,
-                                  input_7, input_8, input_9);
-
-    // Load next 2 rows.
-    ptr += input_row_size;
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    temp_3 = vld1_u8(ptr + 3 * input_depth);
-    temp_4 = vld1_u8(ptr + 4 * input_depth);
-
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-
-    ptr += input_row_size;
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    temp_3 = vld1_u8(ptr + 3 * input_depth);
-    temp_4 = vld1_u8(ptr + 4 * input_depth);
-
-    input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-
-    input_5 = vaddq_s16(input_5, input_offset_vec);
-    input_6 = vaddq_s16(input_6, input_offset_vec);
-    input_7 = vaddq_s16(input_7, input_offset_vec);
-    input_8 = vaddq_s16(input_8, input_offset_vec);
-    input_9 = vaddq_s16(input_9, input_offset_vec);
-
-    acc_0 = MultiplyAccumulateRow(acc_0, filter.f6, filter.f7, filter.f8,
-                                  input_0, input_1, input_2);
-
-    acc_1 = MultiplyAccumulateRow(acc_1, filter.f6, filter.f7, filter.f8,
-                                  input_2, input_3, input_4);
-
-    DownquantizeAndStore2Output(
-        acc_0, acc_1, output_offset, output_multiplier, output_shift,
-        output_activation_min, output_activation_max, output_ptr, output_depth);
-
-    output_ptr += output_row_size;
-
-    // Moving onto the next row of outputs.
-    acc_0.low = vld1q_s32(bias_ptr);
-    acc_1.low = vld1q_s32(bias_ptr);
-    acc_0.high = vld1q_s32(bias_ptr + 4);
-    acc_1.high = vld1q_s32(bias_ptr + 4);
-
-    acc_0 = MultiplyAccumulateRow(acc_0, filter.f0, filter.f1, filter.f2,
-                                  input_0, input_1, input_2);
-
-    acc_1 = MultiplyAccumulateRow(acc_1, filter.f0, filter.f1, filter.f2,
-                                  input_2, input_3, input_4);
-
-    acc_0 = MultiplyAccumulateRow(acc_0, filter.f3, filter.f4, filter.f5,
-                                  input_5, input_6, input_7);
-
-    acc_1 = MultiplyAccumulateRow(acc_1, filter.f3, filter.f4, filter.f5,
-                                  input_7, input_8, input_9);
-
-    // Load next 2 rows.
-    ptr += input_row_size;
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    temp_3 = vld1_u8(ptr + 3 * input_depth);
-    temp_4 = vld1_u8(ptr + 4 * input_depth);
-
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-
-    ptr += input_row_size;
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    temp_3 = vld1_u8(ptr + 3 * input_depth);
-    temp_4 = vld1_u8(ptr + 4 * input_depth);
-
-    input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-
-    input_5 = vaddq_s16(input_5, input_offset_vec);
-    input_6 = vaddq_s16(input_6, input_offset_vec);
-    input_7 = vaddq_s16(input_7, input_offset_vec);
-    input_8 = vaddq_s16(input_8, input_offset_vec);
-    input_9 = vaddq_s16(input_9, input_offset_vec);
-
-    acc_0 = MultiplyAccumulateRow(acc_0, filter.f6, filter.f7, filter.f8,
-                                  input_0, input_1, input_2);
-
-    acc_1 = MultiplyAccumulateRow(acc_1, filter.f6, filter.f7, filter.f8,
-                                  input_2, input_3, input_4);
-
-    DownquantizeAndStore2Output(
-        acc_0, acc_1, output_offset, output_multiplier, output_shift,
-        output_activation_min, output_activation_max, output_ptr, output_depth);
-
-    output_ptr += output_row_size;
-
-    // Moving onto the next row of outputs.
-    acc_0.low = vld1q_s32(bias_ptr);
-    acc_1.low = vld1q_s32(bias_ptr);
-    acc_0.high = vld1q_s32(bias_ptr + 4);
-    acc_1.high = vld1q_s32(bias_ptr + 4);
-
-    acc_0 = MultiplyAccumulateRow(acc_0, filter.f0, filter.f1, filter.f2,
-                                  input_0, input_1, input_2);
-
-    acc_1 = MultiplyAccumulateRow(acc_1, filter.f0, filter.f1, filter.f2,
-                                  input_2, input_3, input_4);
-
-    acc_0 = MultiplyAccumulateRow(acc_0, filter.f3, filter.f4, filter.f5,
-                                  input_5, input_6, input_7);
-
-    acc_1 = MultiplyAccumulateRow(acc_1, filter.f3, filter.f4, filter.f5,
-                                  input_7, input_8, input_9);
-
-    // Load last row.
-    ptr += input_row_size;
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    temp_3 = vld1_u8(ptr + 3 * input_depth);
-    temp_4 = vld1_u8(ptr + 4 * input_depth);
-
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-
-    acc_0 = MultiplyAccumulateRow(acc_0, filter.f6, filter.f7, filter.f8,
-                                  input_0, input_1, input_2);
-
-    acc_1 = MultiplyAccumulateRow(acc_1, filter.f6, filter.f7, filter.f8,
-                                  input_2, input_3, input_4);
-
-    DownquantizeAndStore2Output(
-        acc_0, acc_1, output_offset, output_multiplier, output_shift,
-        output_activation_min, output_activation_max, output_ptr, output_depth);
+struct DepthwiseConvWindow<8, 1, 1> {
+ public:
+  static inline void Run(const uint8* input_ptr, const uint8* filter_ptr,
+                  const int32* bias_ptr, uint8* output_ptr, int64_t input_depth,
+                  int64_t input_row_size, int32 output_window_height,
+                  int32 output_window_width,
+                  const DepthwiseConvParams* params_ptr) {
+    const int64_t input_width_increment = 2 * input_depth;
+    const int64_t input_height_increment = 2 * input_row_size;
+    const int64_t output_height_increment = 2 * params_ptr->output_row_size;
+
+#define DEPTHWISECONV_LABEL_HEIGHT_2_LOOP "1"
+#define DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LOOP "2"
+#define DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_1_LEFTOVER "3"
+#define DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LEFTOVER "4"
+#define DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_AFTER_LOOP "5"
+#define DEPTHWISECONV_LABEL_HEIGHT_2_AFTER_LOOP "6"
+#define DEPTHWISECONV_LABEL_HEIGHT_1 "7"
+#define DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LOOP "8"
+#define DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_1_LEFTOVER "9"
+#define DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LEFTOVER "10"
+#define DEPTHWISECONV_LABEL_HEIGHT_1_END "11"
+
+    asm volatile(
+        // Performs depthwise convolutions for a window specified by
+        // |output_window_height| and |output_window_width|. The inner-most loop
+        // processes 2x2 outputs, and any leftovers at the end.
+        //
+        // Algorithm works as follows:
+        //
+        //   1. Load filters of 8 depth (8x3x3). Registers v0--v8 hold filter
+        //      values.
+        //   2. For 2 output heights at a time:
+        //        i.  For 2 output widths at a time, load inputs for a 2x1 (2
+        //            height, 1 width) output window (4x3 input window).
+        //            Registers v9--v20 hold input values. Mul-add with
+        //            accumulators v21--v24. Then run activation, downquantize
+        //            and store. Repeat for the next 2x1 output window,
+        //            leveraging overlapping inputs.
+        //        ii. Handle single leftover width if exists.
+        //   3. Handle single leftover height if exists.
+        //        i.  For 2 output widths at a time, load inputs for a 1x2 (1
+        //            height, 2 width) output window (3x4 input window).
+        //            Registers v9--v20 hold input values. Mul-add with
+        //            accumulators v21--v24. Then run activation, downquantize
+        //            and store. Repeat for the next 1x2 output window,
+        //            leveraging overlapping inputs.
+        //        ii. Handle single leftover width if exists.
+        //
+        // Loads are placed as soon as the register is no longer needed and
+        // interleaved with arithmetic operations to take advantage of
+        // dual-issue pipelines. We also add input offsets as far from the loads
+        // as possible to give loads enough cycles to fetch data from memory.
+
+        // Set "constant" registers. These registers may be replaced with temp
+        // values from time to time when there are not enough NEON registers.
+        // We use x9--x15 general purpose registers as they are caller-saved
+        // temporary registers (see http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf).  // NOLINT
+        "ldr w9, [%[params_ptr], #" STR(OFFSET_INPUT_OFFSET) "]\n"
+        "ldr x3, [%[params_ptr], #" STR(OFFSET_OUTPUT_DEPTH) "]\n"
+        "cmp %w[output_window_height], #2\n"
+        "dup v26.8h, w9\n"
+        "ldr w9, [%[params_ptr], #" STR(OFFSET_OUTPUT_MULTIPLIER) "]\n"
+        "ldr w2, [%[params_ptr], #" STR(OFFSET_OUTPUT_OFFSET) "]\n"
+        "dup v27.4s, w9\n"
+        "ldr w9, [%[params_ptr], #" STR(OFFSET_OUTPUT_SHIFT) "]\n"
+        "dup v29.4s, w2\n"
+        "ldr w4, [%[params_ptr], #" STR(OFFSET_OUTPUT_ACTIVATION_MIN) "]\n"
+        "dup v30.4s, w4\n"
+        "ldr w0, [%[params_ptr], #" STR(OFFSET_OUTPUT_ACTIVATION_MAX) "]\n"
+        "dup v31.4s, w0\n"
+        "neg w9, w9\n"
+        "dup v28.4s, w9\n"
+        "ldr w9, [%[params_ptr], #" STR(OFFSET_FILTER_OFFSET) "]\n"
+        "add x10, %[bias_ptr], #16\n"
+        "ldr x1, [%[params_ptr], #" STR(OFFSET_OUTPUT_ROW_SIZE) "]\n"
+        "dup v9.8h, w9\n"
+
+        // Load filters and add offsets.
+        "ld1 {v0.8b}, [%[filter_ptr]], x3\n"
+        "ld1 {v1.8b}, [%[filter_ptr]], x3\n"
+        "uaddw v0.8h, v9.8h, v0.8b\n"
+        "ld1 {v2.8b}, [%[filter_ptr]], x3\n"
+        "uaddw v1.8h, v9.8h, v1.8b\n"
+        "ld1 {v3.8b}, [%[filter_ptr]], x3\n"
+        "uaddw v2.8h, v9.8h, v2.8b\n"
+        "ld1 {v4.8b}, [%[filter_ptr]], x3\n"
+        "uaddw v3.8h, v9.8h, v3.8b\n"
+        "ld1 {v5.8b}, [%[filter_ptr]], x3\n"
+        "uaddw v4.8h, v9.8h, v4.8b\n"
+        "ld1 {v6.8b}, [%[filter_ptr]], x3\n"
+        "uaddw v5.8h, v9.8h, v5.8b\n"
+        "ld1 {v7.8b}, [%[filter_ptr]], x3\n"
+        "uaddw v6.8h, v9.8h, v6.8b\n"
+        "ld1 {v8.8b}, [%[filter_ptr]], x3\n"
+        "uaddw v7.8h, v9.8h, v7.8b\n"
+        "uaddw v8.8h, v9.8h, v8.8b\n"
+
+        "blt " DEPTHWISECONV_LABEL_HEIGHT_2_AFTER_LOOP "f\n"
+
+        //"loop_%=:\n"
+        DEPTHWISECONV_LABEL_HEIGHT_2_LOOP ":\n"
+          // This loop processes 2x2 outputs. To avoid register exhaustion,
+          // inputs for the left 2 outputs are loaded first, then the right
+          // two outputs.
+          "mov x11, %[input_ptr]\n"
+          "mov x12, x11\n"
+          "ld1 {v9.8b}, [x12], %[input_depth]\n"
+          "add x13, x11, %[input_row_size]\n"
+          "ld1 {v10.8b}, [x12], %[input_depth]\n"
+          "add x14, x13, %[input_row_size]\n"
+          "ld1 {v11.8b}, [x12], %[input_depth]\n"
+          "add x15, x14, %[input_row_size]\n"
+          "ld1 {v12.8b}, [x13], %[input_depth]\n"
+          "mov w5, %w[output_window_width]\n"
+          "ld1 {v13.8b}, [x13], %[input_depth]\n"
+          "mov x6, %[output_ptr]\n"
+          "ld1 {v14.8b}, [x13], %[input_depth]\n"
+          "add x7, %[output_ptr], x1\n"
+          "ld1 {v15.8b}, [x14], %[input_depth]\n"
+          // The height 2 / width 2 loop loads an extra 2x1 outputs (2 height,
+          // 1 width) in anticipation for the next iteration. Make sure
+          // |output_window_width| is large enough to handle the additional
+          // loads, otherwise jump to specific the appropriate label to handle
+          // smaller widths.
+          "cmp w5, #2\n"
+          "uaddw v9.8h, v26.8h, v9.8b\n"
+          "ld1 {v16.8b}, [x14], %[input_depth]\n"
+          "uaddw v10.8h, v26.8h, v10.8b\n"
+          "ld1 {v17.8b}, [x14], %[input_depth]\n"
+          "uaddw v11.8h, v26.8h, v11.8b\n"
+          "ld1 {v18.8b}, [x15], %[input_depth]\n"
+          "uaddw v12.8h, v26.8h, v12.8b\n"
+          "ld1 {v19.8b}, [x15], %[input_depth]\n"
+          "uaddw v13.8h, v26.8h, v13.8b\n"
+          "ld1 {v20.8b}, [x15], %[input_depth]\n"
+          "uaddw v14.8h, v26.8h, v14.8b\n"
+          "ld1 {v21.4s}, [%[bias_ptr]]\n"
+          "uaddw v15.8h, v26.8h, v15.8b\n"
+          "ld1 {v22.4s}, [x10]\n"
+          "uaddw v16.8h, v26.8h, v16.8b\n"
+          "ld1 {v23.4s}, [%[bias_ptr]]\n"
+          "uaddw v17.8h, v26.8h, v17.8b\n"
+          "ld1 {v24.4s}, [x10]\n"
+          "uaddw v18.8h, v26.8h, v18.8b\n"
+          "uaddw v19.8h, v26.8h, v19.8b\n"
+          "uaddw v20.8h, v26.8h, v20.8b\n"
+
+          "beq " DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LEFTOVER "f\n"
+          "cmp w5, #1\n"
+          "beq " DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_1_LEFTOVER "f\n"
+
+          //"loop_%=:\n"
+          DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LOOP ":\n"
+            // Mul-add left outputs.
+            "smlal v21.4s, v0.4h, v9.4h\n"
+            "subs w5, w5, #2\n"
+            "smlal2 v22.4s, v0.8h, v9.8h\n"
+            "cmp w5, #3\n"
+            "smlal v23.4s, v0.4h, v12.4h\n"
+            "ld1 {v9.8b}, [x12]\n"
+            "smlal2 v24.4s, v0.8h, v12.8h\n"
+            "smlal v21.4s, v1.4h, v10.4h\n"
+            "smlal2 v22.4s, v1.8h, v10.8h\n"
+            "smlal v23.4s, v1.4h, v13.4h\n"
+            "smlal2 v24.4s, v1.8h, v13.8h\n"
+            "smlal v21.4s, v2.4h, v11.4h\n"
+            "smlal2 v22.4s, v2.8h, v11.8h\n"
+            "smlal v23.4s, v2.4h, v14.4h\n"
+            "smlal2 v24.4s, v2.8h, v14.8h\n"
+            "smlal v21.4s, v3.4h, v12.4h\n"
+            "smlal2 v22.4s, v3.8h, v12.8h\n"
+            "ld1 {v12.8b}, [x13]\n"
+            "smlal v23.4s, v3.4h, v15.4h\n"
+            "smlal2 v24.4s, v3.8h, v15.8h\n"
+            "smlal v21.4s, v4.4h, v13.4h\n"
+            "smlal2 v22.4s, v4.8h, v13.8h\n"
+            "smlal v23.4s, v4.4h, v16.4h\n"
+            "smlal2 v24.4s, v4.8h, v16.8h\n"
+            "smlal v21.4s, v5.4h, v14.4h\n"
+            "smlal2 v22.4s, v5.8h, v14.8h\n"
+            "smlal v23.4s, v5.4h, v17.4h\n"
+            "smlal2 v24.4s, v5.8h, v17.8h\n"
+            "smlal v21.4s, v6.4h, v15.4h\n"
+            "smlal2 v22.4s, v6.8h, v15.8h\n"
+            "ld1 {v15.8b}, [x14]\n"
+            "smlal v23.4s, v6.4h, v18.4h\n"
+            "smlal2 v24.4s, v6.8h, v18.8h\n"
+            "ld1 {v18.8b}, [x15]\n"
+            "smlal v21.4s, v7.4h, v16.4h\n"
+            "smlal2 v22.4s, v7.8h, v16.8h\n"
+            "smlal v23.4s, v7.4h, v19.4h\n"
+            "smlal2 v24.4s, v7.8h, v19.8h\n"
+            "smlal v21.4s, v8.4h, v17.4h\n"
+            "smlal2 v22.4s, v8.8h, v17.8h\n"
+            "smlal v23.4s, v8.4h, v20.4h\n"
+            "smlal2 v24.4s, v8.8h, v20.8h\n"
+
+            "sqrdmulh v21.4s, v21.4s, v27.4s\n"
+            "sqrdmulh v22.4s, v22.4s, v27.4s\n"
+            "sqrdmulh v23.4s, v23.4s, v27.4s\n"
+            "sqrdmulh v24.4s, v24.4s, v27.4s\n"
+            "and v25.16b, v21.16b, v28.16b\n"
+            "and v29.16b, v22.16b, v28.16b\n"
+            "and v30.16b, v23.16b, v28.16b\n"
+            "and v31.16b, v24.16b, v28.16b\n"
+            "sshr v25.4s, v25.4s, #31\n"
+            "sshr v29.4s, v29.4s, #31\n"
+            "sshr v30.4s, v30.4s, #31\n"
+            "sshr v31.4s, v31.4s, #31\n"
+            "sqadd v21.4s, v21.4s, v25.4s\n"
+            "sqadd v22.4s, v22.4s, v29.4s\n"
+            "dup v29.4s, w2\n"
+            "sqadd v23.4s, v23.4s, v30.4s\n"
+            "dup v30.4s, w4\n"
+            "sqadd v24.4s, v24.4s, v31.4s\n"
+            "dup v31.4s, w0\n"
+            "srshl v21.4s, v21.4s, v28.4s\n"
+            "srshl v22.4s, v22.4s, v28.4s\n"
+            "srshl v23.4s, v23.4s, v28.4s\n"
+            "srshl v24.4s, v24.4s, v28.4s\n"
+            "add v21.4s, v21.4s, v29.4s\n"
+            "add v22.4s, v22.4s, v29.4s\n"
+            "add v23.4s, v23.4s, v29.4s\n"
+            "add v24.4s, v24.4s, v29.4s\n"
+            "smax v21.4s, v21.4s, v30.4s\n"
+            "smax v22.4s, v22.4s, v30.4s\n"
+            "smax v23.4s, v23.4s, v30.4s\n"
+            "smax v24.4s, v24.4s, v30.4s\n"
+            "smin v21.4s, v21.4s, v31.4s\n"
+            "smin v22.4s, v22.4s, v31.4s\n"
+            "smin v23.4s, v23.4s, v31.4s\n"
+            "smin v24.4s, v24.4s, v31.4s\n"
+            "sqxtn v21.4h, v21.4s\n"
+            "sqxtn v23.4h, v23.4s\n"
+            "sqxtn2 v21.8h, v22.4s\n"
+            "ld1 {v22.4s}, [x10]\n"
+            "sqxtn2 v23.8h, v24.4s\n"
+            "ld1 {v24.4s}, [x10]\n"
+            "sqxtun v21.8b, v21.8h\n"
+            "sqxtun v23.8b, v23.8h\n"
+            "uaddw v9.8h, v26.8h, v9.8b\n"
+            "st1 {v21.8b}, [x6], x3\n"
+            "uaddw v12.8h, v26.8h, v12.8b\n"
+            "st1 {v23.8b}, [x7], x3\n"
+            "uaddw v15.8h, v26.8h, v15.8b\n"
+            "ld1 {v21.4s}, [%[bias_ptr]]\n"
+            "uaddw v18.8h, v26.8h, v18.8b\n"
+            "ld1 {v23.4s}, [%[bias_ptr]]\n"
+
+            // Mul-add right outputs.
+            "smlal v21.4s, v0.4h, v10.4h\n"
+            "add x11, x11, %[input_width_increment]\n"
+            "smlal2 v22.4s, v0.8h, v10.8h\n"
+            "mov x12, x11\n"
+            "smlal v23.4s, v0.4h, v13.4h\n"
+            "add x13, x11, %[input_row_size]\n"
+            "smlal2 v24.4s, v0.8h, v13.8h\n"
+            "add x14, x13, %[input_row_size]\n"
+            "smlal v21.4s, v1.4h, v11.4h\n"
+            "add x15, x14, %[input_row_size]\n"
+            "smlal2 v22.4s, v1.8h, v11.8h\n"
+            "smlal v23.4s, v1.4h, v14.4h\n"
+            "smlal2 v24.4s, v1.8h, v14.8h\n"
+            "smlal v21.4s, v2.4h, v9.4h\n"
+            "smlal2 v22.4s, v2.8h, v9.8h\n"
+            "ld1 {v9.8b}, [x12], %[input_depth]\n"
+            "smlal v23.4s, v2.4h, v12.4h\n"
+            "ld1 {v10.8b}, [x12], %[input_depth]\n"
+            "smlal2 v24.4s, v2.8h, v12.8h\n"
+            "ld1 {v11.8b}, [x12], %[input_depth]\n"
+            "smlal v21.4s, v3.4h, v13.4h\n"
+            "smlal2 v22.4s, v3.8h, v13.8h\n"
+            "smlal v23.4s, v3.4h, v16.4h\n"
+            "smlal2 v24.4s, v3.8h, v16.8h\n"
+            "smlal v21.4s, v4.4h, v14.4h\n"
+            "smlal2 v22.4s, v4.8h, v14.8h\n"
+            "smlal v23.4s, v4.4h, v17.4h\n"
+            "smlal2 v24.4s, v4.8h, v17.8h\n"
+            "smlal v21.4s, v5.4h, v12.4h\n"
+            "smlal2 v22.4s, v5.8h, v12.8h\n"
+            "ld1 {v12.8b}, [x13], %[input_depth]\n"
+            "smlal v23.4s, v5.4h, v15.4h\n"
+            "ld1 {v13.8b}, [x13], %[input_depth]\n"
+            "smlal2 v24.4s, v5.8h, v15.8h\n"
+            "ld1 {v14.8b}, [x13], %[input_depth]\n"
+            "smlal v21.4s, v6.4h, v16.4h\n"
+            "smlal2 v22.4s, v6.8h, v16.8h\n"
+            "smlal v23.4s, v6.4h, v19.4h\n"
+            "smlal2 v24.4s, v6.8h, v19.8h\n"
+            "smlal v21.4s, v7.4h, v17.4h\n"
+            "smlal2 v22.4s, v7.8h, v17.8h\n"
+            "smlal v23.4s, v7.4h, v20.4h\n"
+            "smlal2 v24.4s, v7.8h, v20.8h\n"
+            "smlal v21.4s, v8.4h, v15.4h\n"
+            "smlal2 v22.4s, v8.8h, v15.8h\n"
+            "ld1 {v15.8b}, [x14], %[input_depth]\n"
+            "smlal v23.4s, v8.4h, v18.4h\n"
+            "ld1 {v16.8b}, [x14], %[input_depth]\n"
+            "smlal2 v24.4s, v8.8h, v18.8h\n"
+            "ld1 {v17.8b}, [x14], %[input_depth]\n"
+
+            "sqrdmulh v21.4s, v21.4s, v27.4s\n"
+            "ld1 {v18.8b}, [x15], %[input_depth]\n"
+            "sqrdmulh v22.4s, v22.4s, v27.4s\n"
+            "ld1 {v19.8b}, [x15], %[input_depth]\n"
+            "sqrdmulh v23.4s, v23.4s, v27.4s\n"
+            "ld1 {v20.8b}, [x15], %[input_depth]\n"
+            "sqrdmulh v24.4s, v24.4s, v27.4s\n"
+            "and v25.16b, v21.16b, v28.16b\n"
+            "and v29.16b, v22.16b, v28.16b\n"
+            "and v30.16b, v23.16b, v28.16b\n"
+            "and v31.16b, v24.16b, v28.16b\n"
+            "sshr v25.4s, v25.4s, #31\n"
+            "sshr v29.4s, v29.4s, #31\n"
+            "sshr v30.4s, v30.4s, #31\n"
+            "sshr v31.4s, v31.4s, #31\n"
+            "sqadd v21.4s, v21.4s, v25.4s\n"
+            "sqadd v22.4s, v22.4s, v29.4s\n"
+            "dup v29.4s, w2\n"
+            "sqadd v23.4s, v23.4s, v30.4s\n"
+            "dup v30.4s, w4\n"
+            "sqadd v24.4s, v24.4s, v31.4s\n"
+            "dup v31.4s, w0\n"
+            "srshl v21.4s, v21.4s, v28.4s\n"
+            "srshl v22.4s, v22.4s, v28.4s\n"
+            "srshl v23.4s, v23.4s, v28.4s\n"
+            "srshl v24.4s, v24.4s, v28.4s\n"
+            "add v21.4s, v21.4s, v29.4s\n"
+            "add v22.4s, v22.4s, v29.4s\n"
+            "add v23.4s, v23.4s, v29.4s\n"
+            "add v24.4s, v24.4s, v29.4s\n"
+            "smax v21.4s, v21.4s, v30.4s\n"
+            "smax v22.4s, v22.4s, v30.4s\n"
+            "smax v23.4s, v23.4s, v30.4s\n"
+            "smax v24.4s, v24.4s, v30.4s\n"
+            "smin v21.4s, v21.4s, v31.4s\n"
+            "smin v22.4s, v22.4s, v31.4s\n"
+            "smin v23.4s, v23.4s, v31.4s\n"
+            "smin v24.4s, v24.4s, v31.4s\n"
+            "sqxtn v21.4h, v21.4s\n"
+            "sqxtn v23.4h, v23.4s\n"
+            "sqxtn2 v21.8h, v22.4s\n"
+            "ld1 {v22.4s}, [x10]\n"
+            "sqxtn2 v23.8h, v24.4s\n"
+            "ld1 {v24.4s}, [x10]\n"
+            "sqxtun v21.8b, v21.8h\n"
+            "sqxtun v23.8b, v23.8h\n"
+            "uaddw v9.8h, v26.8h, v9.8b\n"
+            "st1 {v21.8b}, [x6], x3\n"
+            "uaddw v10.8h, v26.8h, v10.8b\n"
+            "st1 {v23.8b}, [x7], x3\n"
+            "uaddw v11.8h, v26.8h, v11.8b\n"
+            "uaddw v12.8h, v26.8h, v12.8b\n"
+            "uaddw v13.8h, v26.8h, v13.8b\n"
+            "uaddw v14.8h, v26.8h, v14.8b\n"
+            "uaddw v15.8h, v26.8h, v15.8b\n"
+            "ld1 {v21.4s}, [%[bias_ptr]]\n"
+            "uaddw v16.8h, v26.8h, v16.8b\n"
+            "ld1 {v23.4s}, [%[bias_ptr]]\n"
+            "uaddw v17.8h, v26.8h, v17.8b\n"
+            "uaddw v18.8h, v26.8h, v18.8b\n"
+            "uaddw v19.8h, v26.8h, v19.8b\n"
+            "uaddw v20.8h, v26.8h, v20.8b\n"
+
+            "bge " DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LOOP "b\n"
+
+          // At this point, there will be one of 2 width or 1 width leftover,
+          // not both.
+          "cmp w5, #2\n"
+          "blt " DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_1_LEFTOVER "f\n"
+
+          // Handle last 2 columns if exists.
+          DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LEFTOVER ":\n"
+          // Mul-add left outputs.
+          "smlal v21.4s, v0.4h, v9.4h\n"
+          "smlal2 v22.4s, v0.8h, v9.8h\n"
+          "smlal v23.4s, v0.4h, v12.4h\n"
+          "ld1 {v9.8b}, [x12]\n"
+          "smlal2 v24.4s, v0.8h, v12.8h\n"
+          "smlal v21.4s, v1.4h, v10.4h\n"
+          "smlal2 v22.4s, v1.8h, v10.8h\n"
+          "smlal v23.4s, v1.4h, v13.4h\n"
+          "smlal2 v24.4s, v1.8h, v13.8h\n"
+          "smlal v21.4s, v2.4h, v11.4h\n"
+          "smlal2 v22.4s, v2.8h, v11.8h\n"
+          "smlal v23.4s, v2.4h, v14.4h\n"
+          "smlal2 v24.4s, v2.8h, v14.8h\n"
+          "smlal v21.4s, v3.4h, v12.4h\n"
+          "smlal2 v22.4s, v3.8h, v12.8h\n"
+          "ld1 {v12.8b}, [x13]\n"
+          "smlal v23.4s, v3.4h, v15.4h\n"
+          "smlal2 v24.4s, v3.8h, v15.8h\n"
+          "smlal v21.4s, v4.4h, v13.4h\n"
+          "smlal2 v22.4s, v4.8h, v13.8h\n"
+          "smlal v23.4s, v4.4h, v16.4h\n"
+          "smlal2 v24.4s, v4.8h, v16.8h\n"
+          "smlal v21.4s, v5.4h, v14.4h\n"
+          "smlal2 v22.4s, v5.8h, v14.8h\n"
+          "smlal v23.4s, v5.4h, v17.4h\n"
+          "smlal2 v24.4s, v5.8h, v17.8h\n"
+          "smlal v21.4s, v6.4h, v15.4h\n"
+          "smlal2 v22.4s, v6.8h, v15.8h\n"
+          "ld1 {v15.8b}, [x14]\n"
+          "smlal v23.4s, v6.4h, v18.4h\n"
+          "smlal2 v24.4s, v6.8h, v18.8h\n"
+          "ld1 {v18.8b}, [x15]\n"
+          "smlal v21.4s, v7.4h, v16.4h\n"
+          "smlal2 v22.4s, v7.8h, v16.8h\n"
+          "smlal v23.4s, v7.4h, v19.4h\n"
+          "smlal2 v24.4s, v7.8h, v19.8h\n"
+          "smlal v21.4s, v8.4h, v17.4h\n"
+          "smlal2 v22.4s, v8.8h, v17.8h\n"
+          "smlal v23.4s, v8.4h, v20.4h\n"
+          "smlal2 v24.4s, v8.8h, v20.8h\n"
+
+          "sqrdmulh v21.4s, v21.4s, v27.4s\n"
+          "sqrdmulh v22.4s, v22.4s, v27.4s\n"
+          "sqrdmulh v23.4s, v23.4s, v27.4s\n"
+          "sqrdmulh v24.4s, v24.4s, v27.4s\n"
+          "and v25.16b, v21.16b, v28.16b\n"
+          "and v29.16b, v22.16b, v28.16b\n"
+          "and v30.16b, v23.16b, v28.16b\n"
+          "and v31.16b, v24.16b, v28.16b\n"
+          "sshr v25.4s, v25.4s, #31\n"
+          "sshr v29.4s, v29.4s, #31\n"
+          "sshr v30.4s, v30.4s, #31\n"
+          "sshr v31.4s, v31.4s, #31\n"
+          "sqadd v21.4s, v21.4s, v25.4s\n"
+          "sqadd v22.4s, v22.4s, v29.4s\n"
+          "dup v29.4s, w2\n"
+          "sqadd v23.4s, v23.4s, v30.4s\n"
+          "dup v30.4s, w4\n"
+          "sqadd v24.4s, v24.4s, v31.4s\n"
+          "dup v31.4s, w0\n"
+          "srshl v21.4s, v21.4s, v28.4s\n"
+          "srshl v22.4s, v22.4s, v28.4s\n"
+          "srshl v23.4s, v23.4s, v28.4s\n"
+          "srshl v24.4s, v24.4s, v28.4s\n"
+          "add v21.4s, v21.4s, v29.4s\n"
+          "add v22.4s, v22.4s, v29.4s\n"
+          "add v23.4s, v23.4s, v29.4s\n"
+          "add v24.4s, v24.4s, v29.4s\n"
+          "smax v21.4s, v21.4s, v30.4s\n"
+          "smax v22.4s, v22.4s, v30.4s\n"
+          "smax v23.4s, v23.4s, v30.4s\n"
+          "smax v24.4s, v24.4s, v30.4s\n"
+          "smin v21.4s, v21.4s, v31.4s\n"
+          "smin v22.4s, v22.4s, v31.4s\n"
+          "smin v23.4s, v23.4s, v31.4s\n"
+          "smin v24.4s, v24.4s, v31.4s\n"
+          "sqxtn v21.4h, v21.4s\n"
+          "sqxtn v23.4h, v23.4s\n"
+          "sqxtn2 v21.8h, v22.4s\n"
+          "ld1 {v22.4s}, [x10]\n"
+          "sqxtn2 v23.8h, v24.4s\n"
+          "ld1 {v24.4s}, [x10]\n"
+          "sqxtun v21.8b, v21.8h\n"
+          "sqxtun v23.8b, v23.8h\n"
+          "uaddw v9.8h, v26.8h, v9.8b\n"
+          "st1 {v21.8b}, [x6], x3\n"
+          "uaddw v12.8h, v26.8h, v12.8b\n"
+          "st1 {v23.8b}, [x7], x3\n"
+          "uaddw v15.8h, v26.8h, v15.8b\n"
+          "ld1 {v21.4s}, [%[bias_ptr]]\n"
+          "uaddw v18.8h, v26.8h, v18.8b\n"
+          "ld1 {v23.4s}, [%[bias_ptr]]\n"
+
+          // Mul-add right outputs.
+          "smlal v21.4s, v0.4h, v10.4h\n"
+          "smlal2 v22.4s, v0.8h, v10.8h\n"
+          "smlal v23.4s, v0.4h, v13.4h\n"
+          "smlal2 v24.4s, v0.8h, v13.8h\n"
+          "smlal v21.4s, v1.4h, v11.4h\n"
+          "smlal2 v22.4s, v1.8h, v11.8h\n"
+          "smlal v23.4s, v1.4h, v14.4h\n"
+          "smlal2 v24.4s, v1.8h, v14.8h\n"
+          "smlal v21.4s, v2.4h, v9.4h\n"
+          "smlal2 v22.4s, v2.8h, v9.8h\n"
+          "smlal v23.4s, v2.4h, v12.4h\n"
+          "smlal2 v24.4s, v2.8h, v12.8h\n"
+          "smlal v21.4s, v3.4h, v13.4h\n"
+          "smlal2 v22.4s, v3.8h, v13.8h\n"
+          "smlal v23.4s, v3.4h, v16.4h\n"
+          "smlal2 v24.4s, v3.8h, v16.8h\n"
+          "smlal v21.4s, v4.4h, v14.4h\n"
+          "smlal2 v22.4s, v4.8h, v14.8h\n"
+          "smlal v23.4s, v4.4h, v17.4h\n"
+          "smlal2 v24.4s, v4.8h, v17.8h\n"
+          "smlal v21.4s, v5.4h, v12.4h\n"
+          "smlal2 v22.4s, v5.8h, v12.8h\n"
+          "smlal v23.4s, v5.4h, v15.4h\n"
+          "smlal2 v24.4s, v5.8h, v15.8h\n"
+          "smlal v21.4s, v6.4h, v16.4h\n"
+          "smlal2 v22.4s, v6.8h, v16.8h\n"
+          "smlal v23.4s, v6.4h, v19.4h\n"
+          "smlal2 v24.4s, v6.8h, v19.8h\n"
+          "smlal v21.4s, v7.4h, v17.4h\n"
+          "smlal2 v22.4s, v7.8h, v17.8h\n"
+          "smlal v23.4s, v7.4h, v20.4h\n"
+          "smlal2 v24.4s, v7.8h, v20.8h\n"
+          "smlal v21.4s, v8.4h, v15.4h\n"
+          "smlal2 v22.4s, v8.8h, v15.8h\n"
+          "smlal v23.4s, v8.4h, v18.4h\n"
+          "smlal2 v24.4s, v8.8h, v18.8h\n"
+
+          "sqrdmulh v21.4s, v21.4s, v27.4s\n"
+          "sqrdmulh v22.4s, v22.4s, v27.4s\n"
+          "sqrdmulh v23.4s, v23.4s, v27.4s\n"
+          "sqrdmulh v24.4s, v24.4s, v27.4s\n"
+          "and v25.16b, v21.16b, v28.16b\n"
+          "and v29.16b, v22.16b, v28.16b\n"
+          "and v30.16b, v23.16b, v28.16b\n"
+          "and v31.16b, v24.16b, v28.16b\n"
+          "sshr v25.4s, v25.4s, #31\n"
+          "sshr v29.4s, v29.4s, #31\n"
+          "sshr v30.4s, v30.4s, #31\n"
+          "sshr v31.4s, v31.4s, #31\n"
+          "sqadd v21.4s, v21.4s, v25.4s\n"
+          "sqadd v22.4s, v22.4s, v29.4s\n"
+          "dup v29.4s, w2\n"
+          "sqadd v23.4s, v23.4s, v30.4s\n"
+          "dup v30.4s, w4\n"
+          "sqadd v24.4s, v24.4s, v31.4s\n"
+          "dup v31.4s, w0\n"
+          "srshl v21.4s, v21.4s, v28.4s\n"
+          "srshl v22.4s, v22.4s, v28.4s\n"
+          "srshl v23.4s, v23.4s, v28.4s\n"
+          "srshl v24.4s, v24.4s, v28.4s\n"
+          "add v21.4s, v21.4s, v29.4s\n"
+          "add v22.4s, v22.4s, v29.4s\n"
+          "add v23.4s, v23.4s, v29.4s\n"
+          "add v24.4s, v24.4s, v29.4s\n"
+          "smax v21.4s, v21.4s, v30.4s\n"
+          "smax v22.4s, v22.4s, v30.4s\n"
+          "smax v23.4s, v23.4s, v30.4s\n"
+          "smax v24.4s, v24.4s, v30.4s\n"
+          "smin v21.4s, v21.4s, v31.4s\n"
+          "smin v22.4s, v22.4s, v31.4s\n"
+          "smin v23.4s, v23.4s, v31.4s\n"
+          "smin v24.4s, v24.4s, v31.4s\n"
+          "sqxtn v21.4h, v21.4s\n"
+          "sqxtn v23.4h, v23.4s\n"
+          "sqxtn2 v21.8h, v22.4s\n"
+          "sqxtn2 v23.8h, v24.4s\n"
+          "sqxtun v21.8b, v21.8h\n"
+          "sqxtun v23.8b, v23.8h\n"
+          "st1 {v21.8b}, [x6], x3\n"
+          "st1 {v23.8b}, [x7], x3\n"
+          "b " DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_AFTER_LOOP "f\n"
+
+          DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_1_LEFTOVER ":\n"
+          "smlal v21.4s, v0.4h, v9.4h\n"
+          "smlal2 v22.4s, v0.8h, v9.8h\n"
+          "smlal v23.4s, v0.4h, v12.4h\n"
+          "smlal2 v24.4s, v0.8h, v12.8h\n"
+          "smlal v21.4s, v1.4h, v10.4h\n"
+          "smlal2 v22.4s, v1.8h, v10.8h\n"
+          "smlal v23.4s, v1.4h, v13.4h\n"
+          "smlal2 v24.4s, v1.8h, v13.8h\n"
+          "smlal v21.4s, v2.4h, v11.4h\n"
+          "smlal2 v22.4s, v2.8h, v11.8h\n"
+          "smlal v23.4s, v2.4h, v14.4h\n"
+          "smlal2 v24.4s, v2.8h, v14.8h\n"
+          "smlal v21.4s, v3.4h, v12.4h\n"
+          "smlal2 v22.4s, v3.8h, v12.8h\n"
+          "smlal v23.4s, v3.4h, v15.4h\n"
+          "smlal2 v24.4s, v3.8h, v15.8h\n"
+          "smlal v21.4s, v4.4h, v13.4h\n"
+          "smlal2 v22.4s, v4.8h, v13.8h\n"
+          "smlal v23.4s, v4.4h, v16.4h\n"
+          "smlal2 v24.4s, v4.8h, v16.8h\n"
+          "smlal v21.4s, v5.4h, v14.4h\n"
+          "smlal2 v22.4s, v5.8h, v14.8h\n"
+          "smlal v23.4s, v5.4h, v17.4h\n"
+          "smlal2 v24.4s, v5.8h, v17.8h\n"
+          "smlal v21.4s, v6.4h, v15.4h\n"
+          "smlal2 v22.4s, v6.8h, v15.8h\n"
+          "smlal v23.4s, v6.4h, v18.4h\n"
+          "smlal2 v24.4s, v6.8h, v18.8h\n"
+          "smlal v21.4s, v7.4h, v16.4h\n"
+          "smlal2 v22.4s, v7.8h, v16.8h\n"
+          "smlal v23.4s, v7.4h, v19.4h\n"
+          "smlal2 v24.4s, v7.8h, v19.8h\n"
+          "smlal v21.4s, v8.4h, v17.4h\n"
+          "smlal2 v22.4s, v8.8h, v17.8h\n"
+          "smlal v23.4s, v8.4h, v20.4h\n"
+          "smlal2 v24.4s, v8.8h, v20.8h\n"
+
+          "sqrdmulh v21.4s, v21.4s, v27.4s\n"
+          "sqrdmulh v22.4s, v22.4s, v27.4s\n"
+          "sqrdmulh v23.4s, v23.4s, v27.4s\n"
+          "sqrdmulh v24.4s, v24.4s, v27.4s\n"
+          "and v9.16b, v21.16b, v28.16b\n"
+          "and v12.16b, v22.16b, v28.16b\n"
+          "and v15.16b, v23.16b, v28.16b\n"
+          "and v18.16b, v24.16b, v28.16b\n"
+          "sshr v9.4s, v9.4s, #31\n"
+          "sshr v12.4s, v12.4s, #31\n"
+          "sshr v15.4s, v15.4s, #31\n"
+          "sshr v18.4s, v18.4s, #31\n"
+          "sqadd v21.4s, v21.4s, v9.4s\n"
+          "sqadd v22.4s, v22.4s, v12.4s\n"
+          "sqadd v23.4s, v23.4s, v15.4s\n"
+          "sqadd v24.4s, v24.4s, v18.4s\n"
+          "srshl v21.4s, v21.4s, v28.4s\n"
+          "srshl v22.4s, v22.4s, v28.4s\n"
+          "srshl v23.4s, v23.4s, v28.4s\n"
+          "srshl v24.4s, v24.4s, v28.4s\n"
+          "add v21.4s, v21.4s, v29.4s\n"
+          "add v22.4s, v22.4s, v29.4s\n"
+          "add v23.4s, v23.4s, v29.4s\n"
+          "add v24.4s, v24.4s, v29.4s\n"
+          "smax v21.4s, v21.4s, v30.4s\n"
+          "smax v22.4s, v22.4s, v30.4s\n"
+          "smax v23.4s, v23.4s, v30.4s\n"
+          "smax v24.4s, v24.4s, v30.4s\n"
+          "smin v21.4s, v21.4s, v31.4s\n"
+          "smin v22.4s, v22.4s, v31.4s\n"
+          "smin v23.4s, v23.4s, v31.4s\n"
+          "smin v24.4s, v24.4s, v31.4s\n"
+          "sqxtn v21.4h, v21.4s\n"
+          "sqxtn v23.4h, v23.4s\n"
+          "sqxtn2 v21.8h, v22.4s\n"
+          "sqxtn2 v23.8h, v24.4s\n"
+          "sqxtun v21.8b, v21.8h\n"
+          "sqxtun v23.8b, v23.8h\n"
+          "st1 {v21.8b}, [x6], x3\n"
+          "st1 {v23.8b}, [x7], x3\n"
+
+          DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_AFTER_LOOP ":\n"
+          "subs %w[output_window_height], %w[output_window_height], #2\n"
+          "add %[input_ptr], %[input_ptr], %[input_height_increment]\n"
+          "cmp %w[output_window_height], #2\n"
+          "add %[output_ptr], %[output_ptr], %[output_height_increment]\n"
+          "bge " DEPTHWISECONV_LABEL_HEIGHT_2_LOOP "b\n"
+
+        DEPTHWISECONV_LABEL_HEIGHT_2_AFTER_LOOP ":\n"
+        "cmp %w[output_window_height], #1\n"
+        "blt " DEPTHWISECONV_LABEL_HEIGHT_1_END "f\n"
+
+        DEPTHWISECONV_LABEL_HEIGHT_1 ":\n"
+        "mov x12, %[input_ptr]\n"
+        "ld1 {v9.8b}, [x12], %[input_depth]\n"
+        "add x13, %[input_ptr], %[input_row_size]\n"
+        "ld1 {v10.8b}, [x12], %[input_depth]\n"
+        "add x14, x13, %[input_row_size]\n"
+        "ld1 {v11.8b}, [x12], %[input_depth]\n"
+        "add x15, x14, %[input_row_size]\n"
+        "mov w5, %w[output_window_width]\n"
+        "ld1 {v13.8b}, [x13], %[input_depth]\n"
+        "mov x6, %[output_ptr]\n"
+        "ld1 {v14.8b}, [x13], %[input_depth]\n"
+        "add x7, %[output_ptr], x1\n"
+        "ld1 {v15.8b}, [x13], %[input_depth]\n"
+        // The height 1 / width 2 loop loads an extra 1x1 output in anticipation
+        // for the next iteration. Make sure |output_window_width| is large
+        // enough to handle the additional load, otherwise jump to the
+        // appropriate label to handle smaller widths.
+        "cmp w5, #2\n"
+        "ld1 {v17.8b}, [x14], %[input_depth]\n"
+        "ld1 {v18.8b}, [x14], %[input_depth]\n"
+        "ld1 {v19.8b}, [x14], %[input_depth]\n"
+        "ld1 {v21.4s}, [%[bias_ptr]]\n"
+        "ld1 {v22.4s}, [x10]\n"
+        "ld1 {v23.4s}, [%[bias_ptr]]\n"
+        "ld1 {v24.4s}, [x10]\n"
+
+        "uaddw v9.8h, v26.8h, v9.8b\n"
+        "uaddw v10.8h, v26.8h, v10.8b\n"
+        "uaddw v11.8h, v26.8h, v11.8b\n"
+        "uaddw v13.8h, v26.8h, v13.8b\n"
+        "uaddw v14.8h, v26.8h, v14.8b\n"
+        "uaddw v15.8h, v26.8h, v15.8b\n"
+        "uaddw v17.8h, v26.8h, v17.8b\n"
+        "uaddw v18.8h, v26.8h, v18.8b\n"
+        "uaddw v19.8h, v26.8h, v19.8b\n"
+
+        "beq " DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LEFTOVER "f\n"
+        "cmp w5, #1\n"
+        "beq " DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_1_LEFTOVER "f\n"
+
+        //"loop_%=:\n"
+        DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LOOP ":\n"
+          // Load inputs for 3x4 input window which corresponds to a 1x2 output
+          // window.
+          "smlal v21.4s, v0.4h, v9.4h\n"
+          "ld1 {v12.8b}, [x12]\n"
+          "smlal2 v22.4s, v0.8h, v9.8h\n"
+          "ld1 {v16.8b}, [x13]\n"
+          "smlal v23.4s, v0.4h, v10.4h\n"
+          "ld1 {v20.8b}, [x14]\n"
+          "smlal2 v24.4s, v0.8h, v10.8h\n"
+          "subs w5, w5, #2\n"
+          "smlal v21.4s, v1.4h, v10.4h\n"
+          "cmp w5, #3\n"
+          "smlal2 v22.4s, v1.8h, v10.8h\n"
+          "add %[input_ptr], %[input_ptr], %[input_width_increment]\n"
+          "smlal v23.4s, v1.4h, v11.4h\n"
+          "mov x12, %[input_ptr]\n"
+          "smlal2 v24.4s, v1.8h, v11.8h\n"
+          "ld1 {v9.8b}, [x12], %[input_depth]\n"
+          "smlal v21.4s, v2.4h, v11.4h\n"
+          "ld1 {v10.8b}, [x12], %[input_depth]\n"
+          "uaddw v12.8h, v26.8h, v12.8b\n"
+          "smlal2 v22.4s, v2.8h, v11.8h\n"
+          "ld1 {v11.8b}, [x12], %[input_depth]\n"
+          "add x13, %[input_ptr], %[input_row_size]\n"
+          "smlal v23.4s, v2.4h, v12.4h\n"
+          "add x14, x13, %[input_row_size]\n"
+          "smlal2 v24.4s, v2.8h, v12.8h\n"
+          "smlal v21.4s, v3.4h, v13.4h\n"
+          "add x15, x14, %[input_row_size]\n"
+          "smlal2 v22.4s, v3.8h, v13.8h\n"
+          "ld1 {v13.8b}, [x13], %[input_depth]\n"
+          "smlal v23.4s, v3.4h, v14.4h\n"
+          "smlal2 v24.4s, v3.8h, v14.8h\n"
+          "smlal v21.4s, v4.4h, v14.4h\n"
+          "smlal2 v22.4s, v4.8h, v14.8h\n"
+          "ld1 {v14.8b}, [x13], %[input_depth]\n"
+          "smlal v23.4s, v4.4h, v15.4h\n"
+          "smlal2 v24.4s, v4.8h, v15.8h\n"
+          "smlal v21.4s, v5.4h, v15.4h\n"
+          "uaddw v16.8h, v26.8h, v16.8b\n"
+          "smlal2 v22.4s, v5.8h, v15.8h\n"
+          "ld1 {v15.8b}, [x13], %[input_depth]\n"
+          "smlal v23.4s, v5.4h, v16.4h\n"
+          "smlal2 v24.4s, v5.8h, v16.8h\n"
+          "smlal v21.4s, v6.4h, v17.4h\n"
+          "smlal2 v22.4s, v6.8h, v17.8h\n"
+          "ld1 {v17.8b}, [x14], %[input_depth]\n"
+          "smlal v23.4s, v6.4h, v18.4h\n"
+          "smlal2 v24.4s, v6.8h, v18.8h\n"
+          "smlal v21.4s, v7.4h, v18.4h\n"
+          "smlal2 v22.4s, v7.8h, v18.8h\n"
+          "ld1 {v18.8b}, [x14], %[input_depth]\n"
+          "smlal v23.4s, v7.4h, v19.4h\n"
+          "smlal2 v24.4s, v7.8h, v19.8h\n"
+          "smlal v21.4s, v8.4h, v19.4h\n"
+          "uaddw v20.8h, v26.8h, v20.8b\n"
+          "smlal2 v22.4s, v8.8h, v19.8h\n"
+          "ld1 {v19.8b}, [x14], %[input_depth]\n"
+          "smlal v23.4s, v8.4h, v20.4h\n"
+          "smlal2 v24.4s, v8.8h, v20.8h\n"
+
+          "sqrdmulh v21.4s, v21.4s, v27.4s\n"
+          "sqrdmulh v22.4s, v22.4s, v27.4s\n"
+          "sqrdmulh v23.4s, v23.4s, v27.4s\n"
+          "sqrdmulh v24.4s, v24.4s, v27.4s\n"
+          "and v25.16b, v21.16b, v28.16b\n"
+          "and v29.16b, v22.16b, v28.16b\n"
+          "and v30.16b, v23.16b, v28.16b\n"
+          "and v31.16b, v24.16b, v28.16b\n"
+          "sshr v25.4s, v25.4s, #31\n"
+          "sshr v29.4s, v29.4s, #31\n"
+          "sshr v30.4s, v30.4s, #31\n"
+          "sshr v31.4s, v31.4s, #31\n"
+          "sqadd v21.4s, v21.4s, v25.4s\n"
+          "sqadd v22.4s, v22.4s, v29.4s\n"
+          "dup v29.4s, w2\n"
+          "sqadd v23.4s, v23.4s, v30.4s\n"
+          "dup v30.4s, w4\n"
+          "sqadd v24.4s, v24.4s, v31.4s\n"
+          "dup v31.4s, w0\n"
+          "srshl v21.4s, v21.4s, v28.4s\n"
+          "srshl v22.4s, v22.4s, v28.4s\n"
+          "srshl v23.4s, v23.4s, v28.4s\n"
+          "srshl v24.4s, v24.4s, v28.4s\n"
+          "add v21.4s, v21.4s, v29.4s\n"
+          "add v22.4s, v22.4s, v29.4s\n"
+          "add v23.4s, v23.4s, v29.4s\n"
+          "add v24.4s, v24.4s, v29.4s\n"
+          "smax v21.4s, v21.4s, v30.4s\n"
+          "smax v22.4s, v22.4s, v30.4s\n"
+          "smax v23.4s, v23.4s, v30.4s\n"
+          "smax v24.4s, v24.4s, v30.4s\n"
+          "smin v21.4s, v21.4s, v31.4s\n"
+          "smin v22.4s, v22.4s, v31.4s\n"
+          "smin v23.4s, v23.4s, v31.4s\n"
+          "smin v24.4s, v24.4s, v31.4s\n"
+          "sqxtn v21.4h, v21.4s\n"
+          "sqxtn v23.4h, v23.4s\n"
+          "sqxtn2 v21.8h, v22.4s\n"
+          "ld1 {v22.4s}, [x10]\n"
+          "sqxtn2 v23.8h, v24.4s\n"
+          "ld1 {v24.4s}, [x10]\n"
+          "sqxtun v21.8b, v21.8h\n"
+          "sqxtun v23.8b, v23.8h\n"
+          "uaddw v9.8h, v26.8h, v9.8b\n"
+          "st1 {v21.8b}, [%[output_ptr]], x3\n"
+          "uaddw v10.8h, v26.8h, v10.8b\n"
+          "st1 {v23.8b}, [%[output_ptr]], x3\n"
+          "uaddw v11.8h, v26.8h, v11.8b\n"
+          "uaddw v12.8h, v26.8h, v12.8b\n"
+          "uaddw v13.8h, v26.8h, v13.8b\n"
+          "uaddw v14.8h, v26.8h, v14.8b\n"
+          "uaddw v15.8h, v26.8h, v15.8b\n"
+          "ld1 {v21.4s}, [%[bias_ptr]]\n"
+          "uaddw v16.8h, v26.8h, v16.8b\n"
+          "ld1 {v23.4s}, [%[bias_ptr]]\n"
+          "uaddw v17.8h, v26.8h, v17.8b\n"
+          "uaddw v18.8h, v26.8h, v18.8b\n"
+          "uaddw v19.8h, v26.8h, v19.8b\n"
+          "uaddw v20.8h, v26.8h, v20.8b\n"
+
+          "bge " DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LOOP "b\n"
+
+        // At this point, there will be one of 2 width or 1 width leftover,
+        // not both.
+        "cmp w5, #2\n"
+        "blt " DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_1_LEFTOVER "f\n"
+
+        // Handle last two horizontal outputs if exists.
+        DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LEFTOVER ":\n"
+        "smlal v21.4s, v0.4h, v9.4h\n"
+        "ld1 {v12.8b}, [x12], %[input_depth]\n"
+        "smlal2 v22.4s, v0.8h, v9.8h\n"
+        "ld1 {v16.8b}, [x13], %[input_depth]\n"
+        "smlal v23.4s, v0.4h, v10.4h\n"
+        "ld1 {v20.8b}, [x14], %[input_depth]\n"
+        "smlal2 v24.4s, v0.8h, v10.8h\n"
+        "smlal v21.4s, v1.4h, v10.4h\n"
+        "smlal2 v22.4s, v1.8h, v10.8h\n"
+        "smlal v23.4s, v1.4h, v11.4h\n"
+        "smlal2 v24.4s, v1.8h, v11.8h\n"
+        "smlal v21.4s, v2.4h, v11.4h\n"
+        "uaddw v12.8h, v26.8h, v12.8b\n"
+        "smlal2 v22.4s, v2.8h, v11.8h\n"
+        "smlal v23.4s, v2.4h, v12.4h\n"
+        "smlal2 v24.4s, v2.8h, v12.8h\n"
+        "smlal v21.4s, v3.4h, v13.4h\n"
+        "smlal2 v22.4s, v3.8h, v13.8h\n"
+        "smlal v23.4s, v3.4h, v14.4h\n"
+        "smlal2 v24.4s, v3.8h, v14.8h\n"
+        "smlal v21.4s, v4.4h, v14.4h\n"
+        "smlal2 v22.4s, v4.8h, v14.8h\n"
+        "smlal v23.4s, v4.4h, v15.4h\n"
+        "smlal2 v24.4s, v4.8h, v15.8h\n"
+        "smlal v21.4s, v5.4h, v15.4h\n"
+        "uaddw v16.8h, v26.8h, v16.8b\n"
+        "smlal2 v22.4s, v5.8h, v15.8h\n"
+        "smlal v23.4s, v5.4h, v16.4h\n"
+        "smlal2 v24.4s, v5.8h, v16.8h\n"
+        "smlal v21.4s, v6.4h, v17.4h\n"
+        "smlal2 v22.4s, v6.8h, v17.8h\n"
+        "smlal v23.4s, v6.4h, v18.4h\n"
+        "smlal2 v24.4s, v6.8h, v18.8h\n"
+        "smlal v21.4s, v7.4h, v18.4h\n"
+        "smlal2 v22.4s, v7.8h, v18.8h\n"
+        "smlal v23.4s, v7.4h, v19.4h\n"
+        "smlal2 v24.4s, v7.8h, v19.8h\n"
+        "smlal v21.4s, v8.4h, v19.4h\n"
+        "uaddw v20.8h, v26.8h, v20.8b\n"
+        "smlal2 v22.4s, v8.8h, v19.8h\n"
+        "smlal v23.4s, v8.4h, v20.4h\n"
+        "smlal2 v24.4s, v8.8h, v20.8h\n"
+
+        "sqrdmulh v21.4s, v21.4s, v27.4s\n"
+        "sqrdmulh v22.4s, v22.4s, v27.4s\n"
+        "sqrdmulh v23.4s, v23.4s, v27.4s\n"
+        "sqrdmulh v24.4s, v24.4s, v27.4s\n"
+        "and v25.16b, v21.16b, v28.16b\n"
+        "and v29.16b, v22.16b, v28.16b\n"
+        "and v30.16b, v23.16b, v28.16b\n"
+        "and v31.16b, v24.16b, v28.16b\n"
+        "sshr v25.4s, v25.4s, #31\n"
+        "sshr v29.4s, v29.4s, #31\n"
+        "sshr v30.4s, v30.4s, #31\n"
+        "sshr v31.4s, v31.4s, #31\n"
+        "sqadd v21.4s, v21.4s, v25.4s\n"
+        "sqadd v22.4s, v22.4s, v29.4s\n"
+        "dup v29.4s, w2\n"
+        "sqadd v23.4s, v23.4s, v30.4s\n"
+        "dup v30.4s, w4\n"
+        "sqadd v24.4s, v24.4s, v31.4s\n"
+        "dup v31.4s, w0\n"
+        "srshl v21.4s, v21.4s, v28.4s\n"
+        "srshl v22.4s, v22.4s, v28.4s\n"
+        "srshl v23.4s, v23.4s, v28.4s\n"
+        "srshl v24.4s, v24.4s, v28.4s\n"
+        "add v21.4s, v21.4s, v29.4s\n"
+        "add v22.4s, v22.4s, v29.4s\n"
+        "add v23.4s, v23.4s, v29.4s\n"
+        "add v24.4s, v24.4s, v29.4s\n"
+        "smax v21.4s, v21.4s, v30.4s\n"
+        "smax v22.4s, v22.4s, v30.4s\n"
+        "smax v23.4s, v23.4s, v30.4s\n"
+        "smax v24.4s, v24.4s, v30.4s\n"
+        "smin v21.4s, v21.4s, v31.4s\n"
+        "smin v22.4s, v22.4s, v31.4s\n"
+        "smin v23.4s, v23.4s, v31.4s\n"
+        "smin v24.4s, v24.4s, v31.4s\n"
+        "sqxtn v21.4h, v21.4s\n"
+        "sqxtn v23.4h, v23.4s\n"
+        "sqxtn2 v21.8h, v22.4s\n"
+        "sqxtn2 v23.8h, v24.4s\n"
+        "sqxtun v21.8b, v21.8h\n"
+        "sqxtun v23.8b, v23.8h\n"
+        "st1 {v21.8b}, [%[output_ptr]], x3\n"
+        "st1 {v23.8b}, [%[output_ptr]], x3\n"
+        "b " DEPTHWISECONV_LABEL_HEIGHT_1_END "f\n"
+
+        // Handle bottom right output if exists.
+        DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_1_LEFTOVER ":\n"
+        "smlal v21.4s, v0.4h, v9.4h\n"
+        "smlal2 v22.4s, v0.8h, v9.8h\n"
+        "smlal v21.4s, v1.4h, v10.4h\n"
+        "smlal2 v22.4s, v1.8h, v10.8h\n"
+        "smlal v21.4s, v2.4h, v11.4h\n"
+        "smlal2 v22.4s, v2.8h, v11.8h\n"
+        "smlal v21.4s, v3.4h, v13.4h\n"
+        "smlal2 v22.4s, v3.8h, v13.8h\n"
+        "smlal v21.4s, v4.4h, v14.4h\n"
+        "smlal2 v22.4s, v4.8h, v14.8h\n"
+        "smlal v21.4s, v5.4h, v15.4h\n"
+        "smlal2 v22.4s, v5.8h, v15.8h\n"
+        "smlal v21.4s, v6.4h, v17.4h\n"
+        "smlal2 v22.4s, v6.8h, v17.8h\n"
+        "smlal v21.4s, v7.4h, v18.4h\n"
+        "smlal2 v22.4s, v7.8h, v18.8h\n"
+        "smlal v21.4s, v8.4h, v19.4h\n"
+        "smlal2 v22.4s, v8.8h, v19.8h\n"
+
+        "sqrdmulh v21.4s, v21.4s, v27.4s\n"
+        "sqrdmulh v22.4s, v22.4s, v27.4s\n"
+        "and v9.16b, v21.16b, v28.16b\n"
+        "and v12.16b, v22.16b, v28.16b\n"
+        "sshr v9.4s, v9.4s, #31\n"
+        "sshr v12.4s, v12.4s, #31\n"
+        "sqadd v21.4s, v21.4s, v9.4s\n"
+        "sqadd v22.4s, v22.4s, v12.4s\n"
+        "srshl v21.4s, v21.4s, v28.4s\n"
+        "srshl v22.4s, v22.4s, v28.4s\n"
+        "add v21.4s, v21.4s, v29.4s\n"
+        "add v22.4s, v22.4s, v29.4s\n"
+        "smax v21.4s, v21.4s, v30.4s\n"
+        "smax v22.4s, v22.4s, v30.4s\n"
+        "smin v21.4s, v21.4s, v31.4s\n"
+        "smin v22.4s, v22.4s, v31.4s\n"
+        "sqxtn v21.4h, v21.4s\n"
+        "sqxtn2 v21.8h, v22.4s\n"
+        "sqxtun v21.8b, v21.8h\n"
+        "st1 {v21.8b}, [%[output_ptr]]\n"
+        DEPTHWISECONV_LABEL_HEIGHT_1_END ":\n"
+    :
+    // Outputs.
+    [filter_ptr] "+r"(filter_ptr), [input_ptr] "+r"(input_ptr),
+    [output_ptr] "+r"(output_ptr),
+    [output_window_height] "+r"(output_window_height)
+    :
+    // Inputs.
+    [bias_ptr] "r"(bias_ptr), [input_row_size] "r"(input_row_size),
+    [input_depth] "r"(input_depth),
+    [output_window_width] "r"(output_window_width),
+    [input_width_increment] "r"(input_width_increment),
+    [input_height_increment] "r"(input_height_increment),
+    [output_height_increment] "r"(output_height_increment),
+    [params_ptr] "r"(params_ptr)
+    :
+    // Clobbers.
+    "cc", "memory",
+    // We use these NEON registers.
+    "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+    "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "v19",
+    "v20", "v21", "v22", "v23", "v24", "v25", "v26", "v27", "v28", "v29",
+    "v30", "v31",
+    // We use these general-purpose registers.
+    "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7",
+    "x9", "x10", "x11", "x12", "x13", "x14", "x15");
+#undef DEPTHWISECONV_LABEL_HEIGHT_2_LOOP
+#undef DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LOOP
+#undef DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_1_LEFTOVER
+#undef DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LEFTOVER
+#undef DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_AFTER_LOOP
+#undef DEPTHWISECONV_LABEL_HEIGHT_2_AFTER_LOOP
+#undef DEPTHWISECONV_LABEL_HEIGHT_1
+#undef DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LOOP
+#undef DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_1_LEFTOVER
+#undef DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LEFTOVER
+#undef DEPTHWISECONV_LABEL_HEIGHT_1_END
   }
 };
 
 template <>
-struct ConvKernel3x3FilterDepth8<4, 4, 2, 2> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    // Reuse 4x2 kernel twice.
-    ConvKernel3x3FilterDepth8<4, 2, 2, 2>::Run(
-        input_ptr, input_depth, input_offset, input_row_size, filter_ptr,
-        filter_offset, bias_ptr, output_offset, output_multiplier, output_shift,
-        output_activation_min, output_activation_max, output_ptr, output_depth,
-        output_width);
-
-    ConvKernel3x3FilterDepth8<4, 2, 2, 2>::Run(
-        input_ptr + 4 * input_depth, input_depth, input_offset, input_row_size,
-        filter_ptr, filter_offset, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max,
-        output_ptr + 2 * output_depth, output_depth, output_width);
+struct DepthwiseConvWindow<8, 2, 2> {
+  static inline void Run(const uint8* input_ptr, const uint8* filter_ptr,
+                  const int32* bias_ptr, uint8* output_ptr, int64_t input_depth,
+                  int64_t input_row_size, int32 output_window_height,
+                  int32 output_window_width,
+                  const DepthwiseConvParams* params_ptr) {
+    const int64_t input_width_increment = 4 * input_depth;
+    const int64_t input_height_increment = 4 * input_row_size;
+    const int64_t output_height_increment = 2 * params_ptr->output_row_size;
+
+#define DEPTHWISECONV_LABEL_HEIGHT_2_LOOP "1"
+#define DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LOOP "2"
+#define DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_1_LEFTOVER "3"
+#define DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LEFTOVER "4"
+#define DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_AFTER_LOOP "5"
+#define DEPTHWISECONV_LABEL_HEIGHT_2_AFTER_LOOP "6"
+#define DEPTHWISECONV_LABEL_HEIGHT_1 "7"
+#define DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LOOP "8"
+#define DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_1_LEFTOVER "9"
+#define DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LEFTOVER "10"
+#define DEPTHWISECONV_LABEL_HEIGHT_1_END "11"
+
+    asm volatile(
+        // Performs depthwise convolutions for a window specified by
+        // |output_window_height| and |output_window_width|. The inner-most loop
+        // processes 2x2 outputs, and any leftovers at the end.
+        //
+        // Algorithm works as follows:
+        //
+        //   1. Load filters of 8 depth (8x3x3). Registers v0--v8 hold filter
+        //      values.
+        //   2. For 2 output heights at a time:
+        //        i.  For 2 output widths at a time at stride 2, a 5x5 input
+        //            window is required. To avoid register exhaustion, we load
+        //            the first 2 rows of the 5x5 input window into registers
+        //            v9--v18, and use the same registers to load the next 2
+        //            rows, and finally v9--v13 to load the last row.
+        //            Accumulators for all 2x2 outputs are reserved by registers
+        //            v21-v22 (top left output), v23-v24 (top right output),
+        //            v19-v20 (bottom left output), v25-v26 (bottom right
+        //            output).
+        //        ii. Handle single leftover width if exists.
+        //   3. Handle single leftover height if exists.
+        //        i.  For 2 output widths at a time at stride 2, load inputs for
+        //            a 1x2 (1 height, 2 width) output window (3x5 input
+        //            window). Registers v9--v24 hold input values. Mul-add with
+        //            accumulators v24--v27.
+        //        ii. Handle single leftover width if exists.
+        //
+        // Loads are placed as soon as the register is no longer needed and
+        // interleaved with arithmetic operations to take advantage of
+        // dual-issue pipelines. We also add input offsets as far from the loads
+        // as possible to give loads enough cycles to fetch data from memory.
+
+        // Set "constant" registers. These registers may be replaced with temp
+        // values from time to time when there are not enough NEON registers.
+        // We use x9--x15 general purpose registers as they are caller-saved
+        // temporary registers (see http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf).  // NOLINT
+        "ldr w9, [%[params_ptr], #" STR(OFFSET_OUTPUT_SHIFT) "]\n"
+        "ldr w0, [%[params_ptr], #" STR(OFFSET_INPUT_OFFSET) "]\n"
+        "cmp %w[output_window_height], #2\n"
+        "dup v28.8h, w0\n"
+        "neg w9, w9\n"
+        "ldr w1, [%[params_ptr], #" STR(OFFSET_OUTPUT_MULTIPLIER) "]\n"
+        "dup v26.4s, w9\n"
+        "ldr w2, [%[params_ptr], #" STR(OFFSET_OUTPUT_OFFSET) "]\n"
+        "dup v27.4s, w1\n"
+        "ldr w3, [%[params_ptr], #" STR(OFFSET_OUTPUT_ACTIVATION_MIN) "]\n"
+        "dup v29.4s, w2\n"
+        "ldr w4, [%[params_ptr], #" STR(OFFSET_OUTPUT_ACTIVATION_MAX) "]\n"
+        "dup v30.4s, w3\n"
+        "ldr x5, [%[params_ptr], #" STR(OFFSET_OUTPUT_DEPTH) "]\n"
+        "dup v31.4s, w4\n"
+        "ldr x19, [%[params_ptr], #" STR(OFFSET_OUTPUT_ROW_SIZE) "]\n"
+        "ldr w20, [%[params_ptr], #" STR(OFFSET_FILTER_OFFSET) "]\n"
+
+        // Load filters and add offsets.
+        "add x10, %[bias_ptr], #16\n"
+        "ld1 {v0.8b}, [%[filter_ptr]], x5\n"
+        "dup v9.8h, w20\n"
+        "ld1 {v1.8b}, [%[filter_ptr]], x5\n"
+        "uaddw v0.8h, v9.8h, v0.8b\n"
+        "ld1 {v2.8b}, [%[filter_ptr]], x5\n"
+        "uaddw v1.8h, v9.8h, v1.8b\n"
+        "ld1 {v3.8b}, [%[filter_ptr]], x5\n"
+        "uaddw v2.8h, v9.8h, v2.8b\n"
+        "ld1 {v4.8b}, [%[filter_ptr]], x5\n"
+        "uaddw v3.8h, v9.8h, v3.8b\n"
+        "ld1 {v5.8b}, [%[filter_ptr]], x5\n"
+        "uaddw v4.8h, v9.8h, v4.8b\n"
+        "ld1 {v6.8b}, [%[filter_ptr]], x5\n"
+        "uaddw v5.8h, v9.8h, v5.8b\n"
+        "ld1 {v7.8b}, [%[filter_ptr]], x5\n"
+        "uaddw v6.8h, v9.8h, v6.8b\n"
+        "ld1 {v8.8b}, [%[filter_ptr]]\n"
+        "uaddw v7.8h, v9.8h, v7.8b\n"
+        "uaddw v8.8h, v9.8h, v8.8b\n"
+
+        "blt " DEPTHWISECONV_LABEL_HEIGHT_2_AFTER_LOOP "f\n"
+
+        //"loop_%=:\n"
+        DEPTHWISECONV_LABEL_HEIGHT_2_LOOP ":\n"
+          // Load the first two rows of the 5x5 input window, then reuse the
+          // same registers to load subsequent rows as they become available.
+          "mov x11, %[input_ptr]\n"
+          "mov x12, x11\n"
+          "add x13, x12, %[input_row_size]\n"
+          "ld1 {v9.8b}, [x12], %[input_depth]\n"
+          "mov w14, %w[output_window_width]\n"
+          "ld1 {v10.8b}, [x12], %[input_depth]\n"
+          // The height 2 / width 2 loop loads an extra 1 output horizontally in
+          // anticipation for the next iteration. Make sure
+          // |output_window_width| is large enough to handle the additional
+          // load, otherwise jump to the appropriate label to handle smaller
+          // widths.
+          "cmp w14, #2\n"
+          "ld1 {v11.8b}, [x12], %[input_depth]\n"
+          "add x15, x13, %[input_row_size]\n"
+          "ld1 {v14.8b}, [x13], %[input_depth]\n"
+          "mov x6, %[output_ptr]\n"
+          "ld1 {v15.8b}, [x13], %[input_depth]\n"
+          "add x7, %[output_ptr], x19\n"
+          "ld1 {v16.8b}, [x13], %[input_depth]\n"
+          "ld1 {v21.4s}, [%[bias_ptr]]\n"
+          "ld1 {v22.4s}, [x10]\n"
+          "ld1 {v23.4s}, [%[bias_ptr]]\n"
+          "uaddw v9.8h, v28.8h, v9.8b\n"
+          "ld1 {v24.4s}, [x10]\n"
+          "uaddw v10.8h, v28.8h, v10.8b\n"
+          "ld1 {v19.4s}, [%[bias_ptr]]\n"
+          "uaddw v11.8h, v28.8h, v11.8b\n"
+          "ld1 {v20.4s}, [x10]\n"
+          "uaddw v14.8h, v28.8h, v14.8b\n"
+          "ld1 {v25.4s}, [%[bias_ptr]]\n"
+          "uaddw v15.8h, v28.8h, v15.8b\n"
+          "ld1 {v26.4s}, [x10]\n"
+          "uaddw v16.8h, v28.8h, v16.8b\n"
+
+          "beq " DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LEFTOVER "f\n"
+          "cmp w14, #1\n"
+          "beq " DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_1_LEFTOVER "f\n"
+
+          //"loop_%=:\n"
+          DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LOOP ":\n"
+            "smlal v21.4s, v0.4h, v9.4h\n"
+            "ld1 {v12.8b}, [x12], %[input_depth]\n"
+            "smlal2 v22.4s, v0.8h, v9.8h\n"
+            "ld1 {v13.8b}, [x12]\n"
+            "add x12, x15, %[input_row_size]\n"
+            "smlal v23.4s, v0.4h, v11.4h\n"
+            "ld1 {v17.8b}, [x13], %[input_depth]\n"
+            "smlal2 v24.4s, v0.8h, v11.8h\n"
+            "ld1 {v18.8b}, [x13]\n"
+            "add x13, x12, %[input_row_size]\n"
+            "smlal v21.4s, v1.4h, v10.4h\n"
+            "ld1 {v9.8b}, [x15], %[input_depth]\n"
+            "smlal2 v22.4s, v1.8h, v10.8h\n"
+            "ld1 {v10.8b}, [x15], %[input_depth]\n"
+            "smlal v21.4s, v2.4h, v11.4h\n"
+            "smlal2 v22.4s, v2.8h, v11.8h\n"
+            "ld1 {v11.8b}, [x15], %[input_depth]\n"
+            "smlal v21.4s, v3.4h, v14.4h\n"
+            "smlal2 v22.4s, v3.8h, v14.8h\n"
+            "ld1 {v14.8b}, [x12], %[input_depth]\n"
+            "smlal v23.4s, v3.4h, v16.4h\n"
+            "subs w14, w14, #2\n"
+            "smlal2 v24.4s, v3.8h, v16.8h\n"
+            "cmp w14, #3\n"
+            "smlal v21.4s, v4.4h, v15.4h\n"
+            "uaddw v12.8h, v28.8h, v12.8b\n"
+            "smlal2 v22.4s, v4.8h, v15.8h\n"
+            "ld1 {v15.8b}, [x12], %[input_depth]\n"
+            "smlal v21.4s, v5.4h, v16.4h\n"
+            "uaddw v13.8h, v28.8h, v13.8b\n"
+            "smlal2 v22.4s, v5.8h, v16.8h\n"
+            "ld1 {v16.8b}, [x12], %[input_depth]\n"
+            "smlal v23.4s, v1.4h, v12.4h\n"
+            "uaddw v17.8h, v28.8h, v17.8b\n"
+            "smlal2 v24.4s, v1.8h, v12.8h\n"
+            "ld1 {v12.8b}, [x15], %[input_depth]\n"
+            "smlal v23.4s, v2.4h, v13.4h\n"
+            "uaddw v18.8h, v28.8h, v18.8b\n"
+            "smlal2 v24.4s, v2.8h, v13.8h\n"
+            "ld1 {v13.8b}, [x15]\n"
+            "smlal v23.4s, v4.4h, v17.4h\n"
+            "uaddw v9.8h, v28.8h, v9.8b\n"
+            "smlal2 v24.4s, v4.8h, v17.8h\n"
+            "ld1 {v17.8b}, [x12], %[input_depth]\n"
+            "smlal v23.4s, v5.4h, v18.4h\n"
+            "uaddw v10.8h, v28.8h, v10.8b\n"
+            "smlal2 v24.4s, v5.8h, v18.8h\n"
+            "ld1 {v18.8b}, [x12]\n"
+
+            "smlal v21.4s, v6.4h, v9.4h\n"
+            "smlal2 v22.4s, v6.8h, v9.8h\n"
+            "smlal v19.4s, v0.4h, v9.4h\n"
+            "uaddw v11.8h, v28.8h, v11.8b\n"
+            "smlal2 v20.4s, v0.8h, v9.8h\n"
+            "ld1 {v9.8b}, [x13], %[input_depth]\n"
+            "smlal v23.4s, v6.4h, v11.4h\n"
+            "smlal2 v24.4s, v6.8h, v11.8h\n"
+            "smlal v21.4s, v7.4h, v10.4h\n"
+            "smlal2 v22.4s, v7.8h, v10.8h\n"
+            "uaddw v12.8h, v28.8h, v12.8b\n"
+            "smlal v19.4s, v1.4h, v10.4h\n"
+            "smlal2 v20.4s, v1.8h, v10.8h\n"
+            "ld1 {v10.8b}, [x13], %[input_depth]\n"
+            "smlal v23.4s, v7.4h, v12.4h\n"
+            "smlal2 v24.4s, v7.8h, v12.8h\n"
+            "smlal v25.4s, v1.4h, v12.4h\n"
+            "smlal2 v26.4s, v1.8h, v12.8h\n"
+            "smlal v21.4s, v8.4h, v11.4h\n"
+            "smlal2 v22.4s, v8.8h, v11.8h\n"
+            "add x11, x11, %[input_width_increment]\n"
+            "smlal v19.4s, v2.4h, v11.4h\n"
+            "mov x12, x11\n"
+            "smlal2 v20.4s, v2.8h, v11.8h\n"
+            "uaddw v13.8h, v28.8h, v13.8b\n"
+            "smlal v25.4s, v0.4h, v11.4h\n"
+            "smlal2 v26.4s, v0.8h, v11.8h\n"
+            "ld1 {v11.8b}, [x13], %[input_depth]\n"
+            "smlal v23.4s, v8.4h, v13.4h\n"
+            "ld1 {v12.8b}, [x13], %[input_depth]\n"
+            "smlal2 v24.4s, v8.8h, v13.8h\n"
+            "smlal v25.4s, v2.4h, v13.4h\n"
+            "smlal2 v26.4s, v2.8h, v13.8h\n"
+            "ld1 {v13.8b}, [x13]\n"
+            "add x13, x12, %[input_row_size]\n"
+            "add x15, x13, %[input_row_size]\n"
+
+            "dup v28.4s, w9\n"
+            "sqrdmulh v21.4s, v21.4s, v27.4s\n"
+            "sqrdmulh v22.4s, v22.4s, v27.4s\n"
+            "sqrdmulh v23.4s, v23.4s, v27.4s\n"
+            "sqrdmulh v24.4s, v24.4s, v27.4s\n"
+            "and v27.16b, v21.16b, v28.16b\n"
+            "and v29.16b, v22.16b, v28.16b\n"
+            "and v30.16b, v23.16b, v28.16b\n"
+            "and v31.16b, v24.16b, v28.16b\n"
+            "sshr v27.4s, v27.4s, #31\n"
+            "sshr v29.4s, v29.4s, #31\n"
+            "sshr v30.4s, v30.4s, #31\n"
+            "sshr v31.4s, v31.4s, #31\n"
+            "sqadd v21.4s, v21.4s, v27.4s\n"
+            "dup v27.4s, w1\n"
+            "sqadd v22.4s, v22.4s, v29.4s\n"
+            "dup v29.4s, w2\n"
+            "sqadd v23.4s, v23.4s, v30.4s\n"
+            "dup v30.4s, w3\n"
+            "sqadd v24.4s, v24.4s, v31.4s\n"
+            "dup v31.4s, w4\n"
+            "srshl v21.4s, v21.4s, v28.4s\n"
+            "srshl v22.4s, v22.4s, v28.4s\n"
+            "srshl v23.4s, v23.4s, v28.4s\n"
+            "srshl v24.4s, v24.4s, v28.4s\n"
+            "dup v28.8h, w0\n"
+            "add v21.4s, v21.4s, v29.4s\n"
+            "add v22.4s, v22.4s, v29.4s\n"
+            "add v23.4s, v23.4s, v29.4s\n"
+            "add v24.4s, v24.4s, v29.4s\n"
+            "smax v21.4s, v21.4s, v30.4s\n"
+            "smax v22.4s, v22.4s, v30.4s\n"
+            "smax v23.4s, v23.4s, v30.4s\n"
+            "smax v24.4s, v24.4s, v30.4s\n"
+            "smin v21.4s, v21.4s, v31.4s\n"
+            "smin v22.4s, v22.4s, v31.4s\n"
+            "smin v23.4s, v23.4s, v31.4s\n"
+            "smin v24.4s, v24.4s, v31.4s\n"
+            "sqxtn v21.4h, v21.4s\n"
+            "sqxtn v23.4h, v23.4s\n"
+            "sqxtn2 v21.8h, v22.4s\n"
+            "ld1 {v22.4s}, [x10]\n"
+            "sqxtn2 v23.8h, v24.4s\n"
+            "ld1 {v24.4s}, [x10]\n"
+            "sqxtun v21.8b, v21.8h\n"
+            "sqxtun v23.8b, v23.8h\n"
+            "uaddw v9.8h, v28.8h, v9.8b\n"
+            "st1 {v21.8b}, [x6], x5\n"
+            "uaddw v10.8h, v28.8h, v10.8b\n"
+            "st1 {v23.8b}, [x6], x5\n"
+            "uaddw v11.8h, v28.8h, v11.8b\n"
+
+            "smlal v19.4s, v6.4h, v9.4h\n"
+            "smlal2 v20.4s, v6.8h, v9.8h\n"
+            "ld1 {v9.8b}, [x12], %[input_depth]\n"
+            "smlal v25.4s, v6.4h, v11.4h\n"
+            "smlal2 v26.4s, v6.8h, v11.8h\n"
+            "smlal v19.4s, v7.4h, v10.4h\n"
+            "uaddw v12.8h, v28.8h, v12.8b\n"
+            "smlal2 v20.4s, v7.8h, v10.8h\n"
+            "ld1 {v10.8b}, [x12], %[input_depth]\n"
+            "smlal v25.4s, v7.4h, v12.4h\n"
+            "smlal2 v26.4s, v7.8h, v12.8h\n"
+            "smlal v19.4s, v8.4h, v11.4h\n"
+            "uaddw v13.8h, v28.8h, v13.8b\n"
+            "smlal2 v20.4s, v8.8h, v11.8h\n"
+            "ld1 {v11.8b}, [x12], %[input_depth]\n"
+            "smlal v25.4s, v8.4h, v13.4h\n"
+            "uaddw v14.8h, v28.8h, v14.8b\n"
+            "smlal2 v26.4s, v8.8h, v13.8h\n"
+            "uaddw v16.8h, v28.8h, v16.8b\n"
+            "smlal v19.4s, v3.4h, v14.4h\n"
+            "uaddw v15.8h, v28.8h, v15.8b\n"
+            "smlal2 v20.4s, v3.8h, v14.8h\n"
+            "ld1 {v14.8b}, [x13], %[input_depth]\n"
+            "smlal v25.4s, v3.4h, v16.4h\n"
+            "ld1 {v21.4s}, [%[bias_ptr]]\n"
+            "smlal2 v26.4s, v3.8h, v16.8h\n"
+            "ld1 {v23.4s}, [%[bias_ptr]]\n"
+            "smlal v19.4s, v4.4h, v15.4h\n"
+            "uaddw v17.8h, v28.8h, v17.8b\n"
+            "smlal2 v20.4s, v4.8h, v15.8h\n"
+            "ld1 {v15.8b}, [x13], %[input_depth]\n"
+            "smlal v25.4s, v4.4h, v17.4h\n"
+            "smlal2 v26.4s, v4.8h, v17.8h\n"
+            "smlal v19.4s, v5.4h, v16.4h\n"
+            "uaddw v18.8h, v28.8h, v18.8b\n"
+            "smlal2 v20.4s, v5.8h, v16.8h\n"
+            "ld1 {v16.8b}, [x13], %[input_depth]\n"
+            "smlal v25.4s, v5.4h, v18.4h\n"
+            "smlal2 v26.4s, v5.8h, v18.8h\n"
+
+            "dup v28.4s, w9\n"
+            "sqrdmulh v19.4s, v19.4s, v27.4s\n"
+            "sqrdmulh v20.4s, v20.4s, v27.4s\n"
+            "sqrdmulh v25.4s, v25.4s, v27.4s\n"
+            "sqrdmulh v26.4s, v26.4s, v27.4s\n"
+            "and v27.16b, v19.16b, v28.16b\n"
+            "and v29.16b, v20.16b, v28.16b\n"
+            "and v30.16b, v25.16b, v28.16b\n"
+            "and v31.16b, v26.16b, v28.16b\n"
+            "sshr v27.4s, v27.4s, #31\n"
+            "sshr v29.4s, v29.4s, #31\n"
+            "sshr v30.4s, v30.4s, #31\n"
+            "sshr v31.4s, v31.4s, #31\n"
+            "sqadd v19.4s, v19.4s, v27.4s\n"
+            "dup v27.4s, w1\n"
+            "sqadd v20.4s, v20.4s, v29.4s\n"
+            "dup v29.4s, w2\n"
+            "sqadd v25.4s, v25.4s, v30.4s\n"
+            "dup v30.4s, w3\n"
+            "sqadd v26.4s, v26.4s, v31.4s\n"
+            "dup v31.4s, w4\n"
+            "srshl v19.4s, v19.4s, v28.4s\n"
+            "srshl v20.4s, v20.4s, v28.4s\n"
+            "srshl v25.4s, v25.4s, v28.4s\n"
+            "srshl v26.4s, v26.4s, v28.4s\n"
+            "dup v28.8h, w0\n"
+            "add v19.4s, v19.4s, v29.4s\n"
+            "add v20.4s, v20.4s, v29.4s\n"
+            "add v25.4s, v25.4s, v29.4s\n"
+            "add v26.4s, v26.4s, v29.4s\n"
+            "smax v19.4s, v19.4s, v30.4s\n"
+            "smax v20.4s, v20.4s, v30.4s\n"
+            "smax v25.4s, v25.4s, v30.4s\n"
+            "smax v26.4s, v26.4s, v30.4s\n"
+            "smin v19.4s, v19.4s, v31.4s\n"
+            "smin v20.4s, v20.4s, v31.4s\n"
+            "smin v25.4s, v25.4s, v31.4s\n"
+            "smin v26.4s, v26.4s, v31.4s\n"
+            "sqxtn v19.4h, v19.4s\n"
+            "sqxtn v25.4h, v25.4s\n"
+            "sqxtn2 v19.8h, v20.4s\n"
+            "ld1 {v20.4s}, [x10]\n"
+            "sqxtn2 v25.8h, v26.4s\n"
+            "ld1 {v26.4s}, [x10]\n"
+            "sqxtun v19.8b, v19.8h\n"
+            "sqxtun v25.8b, v25.8h\n"
+            "uaddw v9.8h, v28.8h, v9.8b\n"
+            "st1 {v19.8b}, [x7], x5\n"
+            "uaddw v10.8h, v28.8h, v10.8b\n"
+            "st1 {v25.8b}, [x7], x5\n"
+            "uaddw v11.8h, v28.8h, v11.8b\n"
+            "ld1 {v19.4s}, [%[bias_ptr]]\n"
+            "uaddw v14.8h, v28.8h, v14.8b\n"
+            "ld1 {v25.4s}, [%[bias_ptr]]\n"
+            "uaddw v15.8h, v28.8h, v15.8b\n"
+            "uaddw v16.8h, v28.8h, v16.8b\n"
+
+            "bge " DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LOOP "b\n"
+
+          // At this point, there will be one of 2 width or 1 width leftover,
+          // not both.
+          "cmp w14, #2\n"
+          "blt " DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_1_LEFTOVER "f\n"
+
+          // Handle last 2 columns if exists.
+          DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LEFTOVER ":\n"
+          "smlal v21.4s, v0.4h, v9.4h\n"
+          "ld1 {v12.8b}, [x12], %[input_depth]\n"
+          "smlal2 v22.4s, v0.8h, v9.8h\n"
+          "ld1 {v13.8b}, [x12]\n"
+          "add x12, x15, %[input_row_size]\n"
+          "smlal v23.4s, v0.4h, v11.4h\n"
+          "ld1 {v17.8b}, [x13], %[input_depth]\n"
+          "smlal2 v24.4s, v0.8h, v11.8h\n"
+          "ld1 {v18.8b}, [x13]\n"
+          "add x13, x12, %[input_row_size]\n"
+          "smlal v21.4s, v1.4h, v10.4h\n"
+          "ld1 {v9.8b}, [x15], %[input_depth]\n"
+          "smlal2 v22.4s, v1.8h, v10.8h\n"
+          "ld1 {v10.8b}, [x15], %[input_depth]\n"
+          "smlal v21.4s, v2.4h, v11.4h\n"
+          "smlal2 v22.4s, v2.8h, v11.8h\n"
+          "ld1 {v11.8b}, [x15], %[input_depth]\n"
+          "smlal v21.4s, v3.4h, v14.4h\n"
+          "smlal2 v22.4s, v3.8h, v14.8h\n"
+          "ld1 {v14.8b}, [x12], %[input_depth]\n"
+          "smlal v23.4s, v3.4h, v16.4h\n"
+          "smlal2 v24.4s, v3.8h, v16.8h\n"
+          "smlal v21.4s, v4.4h, v15.4h\n"
+          "uaddw v12.8h, v28.8h, v12.8b\n"
+          "smlal2 v22.4s, v4.8h, v15.8h\n"
+          "ld1 {v15.8b}, [x12], %[input_depth]\n"
+          "smlal v21.4s, v5.4h, v16.4h\n"
+          "uaddw v13.8h, v28.8h, v13.8b\n"
+          "smlal2 v22.4s, v5.8h, v16.8h\n"
+          "ld1 {v16.8b}, [x12], %[input_depth]\n"
+          "smlal v23.4s, v1.4h, v12.4h\n"
+          "uaddw v17.8h, v28.8h, v17.8b\n"
+          "smlal2 v24.4s, v1.8h, v12.8h\n"
+          "ld1 {v12.8b}, [x15], %[input_depth]\n"
+          "smlal v23.4s, v2.4h, v13.4h\n"
+          "uaddw v18.8h, v28.8h, v18.8b\n"
+          "smlal2 v24.4s, v2.8h, v13.8h\n"
+          "ld1 {v13.8b}, [x15]\n"
+          "smlal v23.4s, v4.4h, v17.4h\n"
+          "uaddw v9.8h, v28.8h, v9.8b\n"
+          "smlal2 v24.4s, v4.8h, v17.8h\n"
+          "ld1 {v17.8b}, [x12], %[input_depth]\n"
+          "smlal v23.4s, v5.4h, v18.4h\n"
+          "uaddw v10.8h, v28.8h, v10.8b\n"
+          "smlal2 v24.4s, v5.8h, v18.8h\n"
+          "ld1 {v18.8b}, [x12]\n"
+
+          "smlal v21.4s, v6.4h, v9.4h\n"
+          "smlal2 v22.4s, v6.8h, v9.8h\n"
+          "smlal v19.4s, v0.4h, v9.4h\n"
+          "uaddw v11.8h, v28.8h, v11.8b\n"
+          "smlal2 v20.4s, v0.8h, v9.8h\n"
+          "ld1 {v9.8b}, [x13], %[input_depth]\n"
+          "smlal v23.4s, v6.4h, v11.4h\n"
+          "smlal2 v24.4s, v6.8h, v11.8h\n"
+          "smlal v21.4s, v7.4h, v10.4h\n"
+          "smlal2 v22.4s, v7.8h, v10.8h\n"
+          "uaddw v12.8h, v28.8h, v12.8b\n"
+          "smlal v19.4s, v1.4h, v10.4h\n"
+          "smlal2 v20.4s, v1.8h, v10.8h\n"
+          "ld1 {v10.8b}, [x13], %[input_depth]\n"
+          "smlal v23.4s, v7.4h, v12.4h\n"
+          "smlal2 v24.4s, v7.8h, v12.8h\n"
+          "smlal v25.4s, v1.4h, v12.4h\n"
+          "smlal2 v26.4s, v1.8h, v12.8h\n"
+          "smlal v21.4s, v8.4h, v11.4h\n"
+          "smlal2 v22.4s, v8.8h, v11.8h\n"
+          "smlal v19.4s, v2.4h, v11.4h\n"
+          "smlal2 v20.4s, v2.8h, v11.8h\n"
+          "uaddw v13.8h, v28.8h, v13.8b\n"
+          "smlal v25.4s, v0.4h, v11.4h\n"
+          "smlal2 v26.4s, v0.8h, v11.8h\n"
+          "ld1 {v11.8b}, [x13], %[input_depth]\n"
+          "smlal v23.4s, v8.4h, v13.4h\n"
+          "ld1 {v12.8b}, [x13], %[input_depth]\n"
+          "smlal2 v24.4s, v8.8h, v13.8h\n"
+          "smlal v25.4s, v2.4h, v13.4h\n"
+          "smlal2 v26.4s, v2.8h, v13.8h\n"
+          "ld1 {v13.8b}, [x13]\n"
+
+          "dup v28.4s, w9\n"
+          "sqrdmulh v21.4s, v21.4s, v27.4s\n"
+          "sqrdmulh v22.4s, v22.4s, v27.4s\n"
+          "sqrdmulh v23.4s, v23.4s, v27.4s\n"
+          "sqrdmulh v24.4s, v24.4s, v27.4s\n"
+          "and v27.16b, v21.16b, v28.16b\n"
+          "and v29.16b, v22.16b, v28.16b\n"
+          "and v30.16b, v23.16b, v28.16b\n"
+          "and v31.16b, v24.16b, v28.16b\n"
+          "sshr v27.4s, v27.4s, #31\n"
+          "sshr v29.4s, v29.4s, #31\n"
+          "sshr v30.4s, v30.4s, #31\n"
+          "sshr v31.4s, v31.4s, #31\n"
+          "sqadd v21.4s, v21.4s, v27.4s\n"
+          "dup v27.4s, w1\n"
+          "sqadd v22.4s, v22.4s, v29.4s\n"
+          "dup v29.4s, w2\n"
+          "sqadd v23.4s, v23.4s, v30.4s\n"
+          "dup v30.4s, w3\n"
+          "sqadd v24.4s, v24.4s, v31.4s\n"
+          "dup v31.4s, w4\n"
+          "srshl v21.4s, v21.4s, v28.4s\n"
+          "srshl v22.4s, v22.4s, v28.4s\n"
+          "srshl v23.4s, v23.4s, v28.4s\n"
+          "srshl v24.4s, v24.4s, v28.4s\n"
+          "dup v28.8h, w0\n"
+          "add v21.4s, v21.4s, v29.4s\n"
+          "add v22.4s, v22.4s, v29.4s\n"
+          "add v23.4s, v23.4s, v29.4s\n"
+          "add v24.4s, v24.4s, v29.4s\n"
+          "smax v21.4s, v21.4s, v30.4s\n"
+          "smax v22.4s, v22.4s, v30.4s\n"
+          "smax v23.4s, v23.4s, v30.4s\n"
+          "smax v24.4s, v24.4s, v30.4s\n"
+          "smin v21.4s, v21.4s, v31.4s\n"
+          "smin v22.4s, v22.4s, v31.4s\n"
+          "smin v23.4s, v23.4s, v31.4s\n"
+          "smin v24.4s, v24.4s, v31.4s\n"
+          "sqxtn v21.4h, v21.4s\n"
+          "sqxtn v23.4h, v23.4s\n"
+          "sqxtn2 v21.8h, v22.4s\n"
+          "ld1 {v22.4s}, [x10]\n"
+          "sqxtn2 v23.8h, v24.4s\n"
+          "ld1 {v24.4s}, [x10]\n"
+          "sqxtun v21.8b, v21.8h\n"
+          "sqxtun v23.8b, v23.8h\n"
+          "uaddw v9.8h, v28.8h, v9.8b\n"
+          "st1 {v21.8b}, [x6], x5\n"
+          "uaddw v10.8h, v28.8h, v10.8b\n"
+          "st1 {v23.8b}, [x6]\n"
+          "uaddw v11.8h, v28.8h, v11.8b\n"
+
+          "smlal v19.4s, v6.4h, v9.4h\n"
+          "smlal2 v20.4s, v6.8h, v9.8h\n"
+          "smlal v25.4s, v6.4h, v11.4h\n"
+          "smlal2 v26.4s, v6.8h, v11.8h\n"
+          "smlal v19.4s, v7.4h, v10.4h\n"
+          "uaddw v12.8h, v28.8h, v12.8b\n"
+          "smlal2 v20.4s, v7.8h, v10.8h\n"
+          "smlal v25.4s, v7.4h, v12.4h\n"
+          "smlal2 v26.4s, v7.8h, v12.8h\n"
+          "smlal v19.4s, v8.4h, v11.4h\n"
+          "uaddw v13.8h, v28.8h, v13.8b\n"
+          "smlal2 v20.4s, v8.8h, v11.8h\n"
+          "smlal v25.4s, v8.4h, v13.4h\n"
+          "uaddw v14.8h, v28.8h, v14.8b\n"
+          "smlal2 v26.4s, v8.8h, v13.8h\n"
+          "uaddw v16.8h, v28.8h, v16.8b\n"
+          "smlal v19.4s, v3.4h, v14.4h\n"
+          "uaddw v15.8h, v28.8h, v15.8b\n"
+          "smlal2 v20.4s, v3.8h, v14.8h\n"
+          "smlal v25.4s, v3.4h, v16.4h\n"
+          "smlal2 v26.4s, v3.8h, v16.8h\n"
+          "smlal v19.4s, v4.4h, v15.4h\n"
+          "uaddw v17.8h, v28.8h, v17.8b\n"
+          "smlal2 v20.4s, v4.8h, v15.8h\n"
+          "smlal v25.4s, v4.4h, v17.4h\n"
+          "smlal2 v26.4s, v4.8h, v17.8h\n"
+          "smlal v19.4s, v5.4h, v16.4h\n"
+          "uaddw v18.8h, v28.8h, v18.8b\n"
+          "smlal2 v20.4s, v5.8h, v16.8h\n"
+          "smlal v25.4s, v5.4h, v18.4h\n"
+          "smlal2 v26.4s, v5.8h, v18.8h\n"
+
+          "dup v28.4s, w9\n"
+          "sqrdmulh v19.4s, v19.4s, v27.4s\n"
+          "sqrdmulh v20.4s, v20.4s, v27.4s\n"
+          "sqrdmulh v25.4s, v25.4s, v27.4s\n"
+          "sqrdmulh v26.4s, v26.4s, v27.4s\n"
+          "and v27.16b, v19.16b, v28.16b\n"
+          "and v29.16b, v20.16b, v28.16b\n"
+          "and v30.16b, v25.16b, v28.16b\n"
+          "and v31.16b, v26.16b, v28.16b\n"
+          "sshr v27.4s, v27.4s, #31\n"
+          "sshr v29.4s, v29.4s, #31\n"
+          "sshr v30.4s, v30.4s, #31\n"
+          "sshr v31.4s, v31.4s, #31\n"
+          "sqadd v19.4s, v19.4s, v27.4s\n"
+          "dup v27.4s, w1\n"
+          "sqadd v20.4s, v20.4s, v29.4s\n"
+          "dup v29.4s, w2\n"
+          "sqadd v25.4s, v25.4s, v30.4s\n"
+          "dup v30.4s, w3\n"
+          "sqadd v26.4s, v26.4s, v31.4s\n"
+          "dup v31.4s, w4\n"
+          "srshl v19.4s, v19.4s, v28.4s\n"
+          "srshl v20.4s, v20.4s, v28.4s\n"
+          "srshl v25.4s, v25.4s, v28.4s\n"
+          "srshl v26.4s, v26.4s, v28.4s\n"
+          "dup v28.8h, w0\n"
+          "add v19.4s, v19.4s, v29.4s\n"
+          "add v20.4s, v20.4s, v29.4s\n"
+          "add v25.4s, v25.4s, v29.4s\n"
+          "add v26.4s, v26.4s, v29.4s\n"
+          "smax v19.4s, v19.4s, v30.4s\n"
+          "smax v20.4s, v20.4s, v30.4s\n"
+          "smax v25.4s, v25.4s, v30.4s\n"
+          "smax v26.4s, v26.4s, v30.4s\n"
+          "smin v19.4s, v19.4s, v31.4s\n"
+          "smin v20.4s, v20.4s, v31.4s\n"
+          "smin v25.4s, v25.4s, v31.4s\n"
+          "smin v26.4s, v26.4s, v31.4s\n"
+          "sqxtn v19.4h, v19.4s\n"
+          "sqxtn v25.4h, v25.4s\n"
+          "sqxtn2 v19.8h, v20.4s\n"
+          "sqxtn2 v25.8h, v26.4s\n"
+          "sqxtun v19.8b, v19.8h\n"
+          "sqxtun v25.8b, v25.8h\n"
+          "st1 {v19.8b}, [x7], x5\n"
+          "st1 {v25.8b}, [x7]\n"
+          "b " DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_AFTER_LOOP "f\n"
+
+          // Handle last column if exists.
+          DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_1_LEFTOVER ":\n"
+          // Registers v9, v10, v11, v14, v15, and v16 have already been loaded
+          // with the correct values at this point. This corresponds to the
+          // first two input rows of the top left output. Now load the last
+          // input row for this output. Once these inputs are no longer needed,
+          // load the input rows for the bottom left output.
+          "add x12, x15, %[input_row_size]\n"
+          "add x13, x12, %[input_row_size]\n"
+
+          "ld1 {v12.8b}, [x15], %[input_depth]\n"
+          "smlal v21.4s, v0.4h, v9.4h\n"
+          "ld1 {v13.8b}, [x15], %[input_depth]\n"
+          "smlal2 v22.4s, v0.8h, v9.8h\n"
+          "ld1 {v17.8b}, [x15]\n"
+          "smlal v21.4s, v1.4h, v10.4h\n"
+          "ld1 {v9.8b}, [x12], %[input_depth]\n"
+          "smlal2 v22.4s, v1.8h, v10.8h\n"
+          "ld1 {v10.8b}, [x12], %[input_depth]\n"
+          "smlal v21.4s, v2.4h, v11.4h\n"
+          "smlal2 v22.4s, v2.8h, v11.8h\n"
+          "ld1 {v11.8b}, [x12]\n"
+          "smlal v21.4s, v3.4h, v14.4h\n"
+          "smlal2 v22.4s, v3.8h, v14.8h\n"
+          "ld1 {v14.8b}, [x13], %[input_depth]\n"
+          "smlal v21.4s, v4.4h, v15.4h\n"
+          "smlal2 v22.4s, v4.8h, v15.8h\n"
+          "ld1 {v15.8b}, [x13], %[input_depth]\n"
+          "smlal v21.4s, v5.4h, v16.4h\n"
+          "uaddw v12.8h, v28.8h, v12.8b\n"
+          "smlal2 v22.4s, v5.8h, v16.8h\n"
+          "uaddw v13.8h, v28.8h, v13.8b\n"
+          "ld1 {v16.8b}, [x13]\n"
+
+          "smlal v21.4s, v6.4h, v12.4h\n"
+          "smlal2 v22.4s, v6.8h, v12.8h\n"
+          "smlal v23.4s, v0.4h, v12.4h\n"
+          "uaddw v17.8h, v28.8h, v17.8b\n"
+          "smlal2 v24.4s, v0.8h, v12.8h\n"
+          "smlal v21.4s, v7.4h, v13.4h\n"
+          "smlal2 v22.4s, v7.8h, v13.8h\n"
+          "smlal v23.4s, v1.4h, v13.4h\n"
+          "smlal2 v24.4s, v1.8h, v13.8h\n"
+          "smlal v21.4s, v8.4h, v17.4h\n"
+          "smlal2 v22.4s, v8.8h, v17.8h\n"
+          "smlal v23.4s, v2.4h, v17.4h\n"
+          "smlal2 v24.4s, v2.8h, v17.8h\n"
+
+          "dup v26.4s, w9\n"
+          "sqrdmulh v21.4s, v21.4s, v27.4s\n"
+          "sqrdmulh v22.4s, v22.4s, v27.4s\n"
+          "and v18.16b, v21.16b, v26.16b\n"
+          "and v19.16b, v22.16b, v26.16b\n"
+          "sshr v18.4s, v18.4s, #31\n"
+          "sshr v19.4s, v19.4s, #31\n"
+          "sqadd v21.4s, v21.4s, v18.4s\n"
+          "sqadd v22.4s, v22.4s, v19.4s\n"
+          "srshl v21.4s, v21.4s, v26.4s\n"
+          "srshl v22.4s, v22.4s, v26.4s\n"
+          "add v21.4s, v21.4s, v29.4s\n"
+          "add v22.4s, v22.4s, v29.4s\n"
+          "smax v21.4s, v21.4s, v30.4s\n"
+          "smax v22.4s, v22.4s, v30.4s\n"
+          "smin v21.4s, v21.4s, v31.4s\n"
+          "smin v22.4s, v22.4s, v31.4s\n"
+          "sqxtn v21.4h, v21.4s\n"
+          "sqxtn2 v21.8h, v22.4s\n"
+          "sqxtun v21.8b, v21.8h\n"
+          "uaddw v9.8h, v28.8h, v9.8b\n"
+          "st1 {v21.8b}, [x6]\n"
+          "uaddw v10.8h, v28.8h, v10.8b\n"
+
+          "smlal v23.4s, v3.4h, v9.4h\n"
+          "uaddw v11.8h, v28.8h, v11.8b\n"
+          "smlal2 v24.4s, v3.8h, v9.8h\n"
+          "uaddw v14.8h, v28.8h, v14.8b\n"
+          "smlal v23.4s, v4.4h, v10.4h\n"
+          "uaddw v15.8h, v28.8h, v15.8b\n"
+          "smlal2 v24.4s, v4.8h, v10.8h\n"
+          "uaddw v16.8h, v28.8h, v16.8b\n"
+          "smlal v23.4s, v5.4h, v11.4h\n"
+          "smlal2 v24.4s, v5.8h, v11.8h\n"
+
+          "smlal v23.4s, v6.4h, v14.4h\n"
+          "smlal2 v24.4s, v6.8h, v14.8h\n"
+          "smlal v23.4s, v7.4h, v15.4h\n"
+          "smlal2 v24.4s, v7.8h, v15.8h\n"
+          "smlal v23.4s, v8.4h, v16.4h\n"
+          "smlal2 v24.4s, v8.8h, v16.8h\n"
+
+          "sqrdmulh v23.4s, v23.4s, v27.4s\n"
+          "sqrdmulh v24.4s, v24.4s, v27.4s\n"
+          "and v18.16b, v23.16b, v26.16b\n"
+          "and v19.16b, v24.16b, v26.16b\n"
+          "sshr v18.4s, v18.4s, #31\n"
+          "sshr v19.4s, v19.4s, #31\n"
+          "sqadd v23.4s, v23.4s, v18.4s\n"
+          "sqadd v24.4s, v24.4s, v19.4s\n"
+          "srshl v23.4s, v23.4s, v26.4s\n"
+          "srshl v24.4s, v24.4s, v26.4s\n"
+          "add v23.4s, v23.4s, v29.4s\n"
+          "add v24.4s, v24.4s, v29.4s\n"
+          "smax v23.4s, v23.4s, v30.4s\n"
+          "smax v24.4s, v24.4s, v30.4s\n"
+          "smin v23.4s, v23.4s, v31.4s\n"
+          "smin v24.4s, v24.4s, v31.4s\n"
+          "sqxtn v23.4h, v23.4s\n"
+          "sqxtn2 v23.8h, v24.4s\n"
+          "sqxtun v23.8b, v23.8h\n"
+          "st1 {v23.8b}, [x7]\n"
+
+          DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_AFTER_LOOP ":\n"
+          "subs %w[output_window_height], %w[output_window_height], #2\n"
+          "add %[input_ptr], %[input_ptr], %[input_height_increment]\n"
+          "cmp %w[output_window_height], #2\n"
+          "add %[output_ptr], %[output_ptr], %[output_height_increment]\n"
+          "bge " DEPTHWISECONV_LABEL_HEIGHT_2_LOOP "b\n"
+
+        DEPTHWISECONV_LABEL_HEIGHT_2_AFTER_LOOP ":\n"
+        "cmp %w[output_window_height], #1\n"
+        "blt " DEPTHWISECONV_LABEL_HEIGHT_1_END "f\n"
+
+        DEPTHWISECONV_LABEL_HEIGHT_1 ":\n"
+        "mov x11, %[input_ptr]\n"
+        "mov x12, x11\n"
+        "add x13, x12, %[input_row_size]\n"
+        "ld1 {v9.8b}, [x12], %[input_depth]\n"
+        "add x15, x13, %[input_row_size]\n"
+        "ld1 {v10.8b}, [x12], %[input_depth]\n"
+        "mov x6, %[output_ptr]\n"
+        "ld1 {v11.8b}, [x12], %[input_depth]\n"
+        "mov w14, %w[output_window_width]\n"
+        // The height 1 / width 2 loop loads an extra 1x1 output in anticipation
+        // for the next iteration. Make sure |output_window_width| is large
+        // enough to handle the additional load, otherwise jump to the
+        // appropriate label to handle smaller widths.
+        "cmp w14, #2\n"
+        "ld1 {v12.8b}, [x13], %[input_depth]\n"
+        "ld1 {v13.8b}, [x13], %[input_depth]\n"
+        "ld1 {v14.8b}, [x13], %[input_depth]\n"
+        "ld1 {v15.8b}, [x15], %[input_depth]\n"
+        "ld1 {v16.8b}, [x15], %[input_depth]\n"
+        "ld1 {v17.8b}, [x15], %[input_depth]\n"
+
+        "uaddw v9.8h, v28.8h, v9.8b\n"
+        "ld1 {v24.4s}, [%[bias_ptr]]\n"
+        "uaddw v10.8h, v28.8h, v10.8b\n"
+        "ld1 {v25.4s}, [x10]\n"
+        "uaddw v11.8h, v28.8h, v11.8b\n"
+        "ld1 {v26.4s}, [%[bias_ptr]]\n"
+        "ld1 {v27.4s}, [x10]\n"
+        "uaddw v12.8h, v28.8h, v12.8b\n"
+        "uaddw v13.8h, v28.8h, v13.8b\n"
+        "uaddw v14.8h, v28.8h, v14.8b\n"
+        "uaddw v15.8h, v28.8h, v15.8b\n"
+        "uaddw v16.8h, v28.8h, v16.8b\n"
+        "uaddw v17.8h, v28.8h, v17.8b\n"
+
+        "beq " DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LEFTOVER "f\n"
+        "cmp w14, #1\n"
+        "beq " DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_1_LEFTOVER "f\n"
+
+        //"loop_%=:\n"
+        DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LOOP ":\n"
+          "smlal v24.4s, v0.4h, v9.4h\n"
+          "ld1 {v18.8b}, [x12], %[input_depth]\n"
+          "smlal2 v25.4s, v0.8h, v9.8h\n"
+          "ld1 {v19.8b}, [x12]\n"
+          "smlal v26.4s, v0.4h, v11.4h\n"
+          "ld1 {v20.8b}, [x13], %[input_depth]\n"
+          "smlal2 v27.4s, v0.8h, v11.8h\n"
+          "ld1 {v21.8b}, [x13]\n"
+          "smlal v24.4s, v1.4h, v10.4h\n"
+          "ld1 {v22.8b}, [x15], %[input_depth]\n"
+          "smlal2 v25.4s, v1.8h, v10.8h\n"
+          "ld1 {v23.8b}, [x15]\n"
+          "smlal v24.4s, v2.4h, v11.4h\n"
+          "subs w14, w14, #2\n"
+          "smlal2 v25.4s, v2.8h, v11.8h\n"
+          "cmp w14, #3\n"
+          "smlal v24.4s, v3.4h, v12.4h\n"
+          "add x11, x11, %[input_width_increment]\n"
+          "smlal2 v25.4s, v3.8h, v12.8h\n"
+          "mov x12, x11\n"
+          "smlal v26.4s, v3.4h, v14.4h\n"
+          "add x13, x12, %[input_row_size]\n"
+          "smlal2 v27.4s, v3.8h, v14.8h\n"
+          "add x15, x13, %[input_row_size]\n"
+          "smlal v24.4s, v4.4h, v13.4h\n"
+          "ld1 {v9.8b}, [x12], %[input_depth]\n"
+          "smlal2 v25.4s, v4.8h, v13.8h\n"
+          "ld1 {v10.8b}, [x12], %[input_depth]\n"
+          "smlal v24.4s, v5.4h, v14.4h\n"
+          "ld1 {v11.8b}, [x12], %[input_depth]\n"
+          "smlal2 v25.4s, v5.8h, v14.8h\n"
+          "ld1 {v12.8b}, [x13], %[input_depth]\n"
+          "smlal v24.4s, v6.4h, v15.4h\n"
+          "ld1 {v13.8b}, [x13], %[input_depth]\n"
+          "smlal2 v25.4s, v6.8h, v15.8h\n"
+          "ld1 {v14.8b}, [x13], %[input_depth]\n"
+          "smlal v26.4s, v6.4h, v17.4h\n"
+          "ld1 {v15.8b}, [x15], %[input_depth]\n"
+          "smlal2 v27.4s, v6.8h, v17.8h\n"
+          "smlal v24.4s, v7.4h, v16.4h\n"
+          "smlal2 v25.4s, v7.8h, v16.8h\n"
+          "ld1 {v16.8b}, [x15], %[input_depth]\n"
+          "smlal v24.4s, v8.4h, v17.4h\n"
+          "uaddw v18.8h, v28.8h, v18.8b\n"
+          "smlal2 v25.4s, v8.8h, v17.8h\n"
+          "ld1 {v17.8b}, [x15], %[input_depth]\n"
+          "uaddw v19.8h, v28.8h, v19.8b\n"
+
+          "smlal v26.4s, v1.4h, v18.4h\n"
+          "uaddw v20.8h, v28.8h, v20.8b\n"
+          "smlal2 v27.4s, v1.8h, v18.8h\n"
+          "smlal v26.4s, v2.4h, v19.4h\n"
+          "uaddw v21.8h, v28.8h, v21.8b\n"
+          "smlal2 v27.4s, v2.8h, v19.8h\n"
+          "smlal v26.4s, v4.4h, v20.4h\n"
+          "smlal v26.4s, v5.4h, v21.4h\n"
+          "smlal2 v27.4s, v4.8h, v20.8h\n"
+          "uaddw v22.8h, v28.8h, v22.8b\n"
+          "smlal2 v27.4s, v5.8h, v21.8h\n"
+          "uaddw v23.8h, v28.8h, v23.8b\n"
+          "smlal v26.4s, v7.4h, v22.4h\n"
+          "smlal2 v27.4s, v7.8h, v22.8h\n"
+          "smlal v26.4s, v8.4h, v23.4h\n"
+          "smlal2 v27.4s, v8.8h, v23.8h\n"
+
+          "dup v28.4s, w1\n"
+          "dup v29.4s, w9\n"
+          "sqrdmulh v24.4s, v24.4s, v28.4s\n"
+          "sqrdmulh v25.4s, v25.4s, v28.4s\n"
+          "sqrdmulh v26.4s, v26.4s, v28.4s\n"
+          "sqrdmulh v27.4s, v27.4s, v28.4s\n"
+          "dup v28.4s, w2\n"
+          "and v30.16b, v24.16b, v29.16b\n"
+          "and v31.16b, v25.16b, v29.16b\n"
+          "sshr v30.4s, v30.4s, #31\n"
+          "sshr v31.4s, v31.4s, #31\n"
+          "sqadd v24.4s, v24.4s, v30.4s\n"
+          "sqadd v25.4s, v25.4s, v31.4s\n"
+          "and v30.16b, v26.16b, v29.16b\n"
+          "and v31.16b, v27.16b, v29.16b\n"
+          "sshr v30.4s, v30.4s, #31\n"
+          "sshr v31.4s, v31.4s, #31\n"
+          "sqadd v26.4s, v26.4s, v30.4s\n"
+          "dup v30.4s, w3\n"
+          "sqadd v27.4s, v27.4s, v31.4s\n"
+          "dup v31.4s, w4\n"
+          "srshl v24.4s, v24.4s, v29.4s\n"
+          "srshl v25.4s, v25.4s, v29.4s\n"
+          "srshl v26.4s, v26.4s, v29.4s\n"
+          "srshl v27.4s, v27.4s, v29.4s\n"
+          "add v24.4s, v24.4s, v28.4s\n"
+          "add v25.4s, v25.4s, v28.4s\n"
+          "add v26.4s, v26.4s, v28.4s\n"
+          "add v27.4s, v27.4s, v28.4s\n"
+          "dup v28.8h, w0\n"
+          "smax v24.4s, v24.4s, v30.4s\n"
+          "smax v25.4s, v25.4s, v30.4s\n"
+          "smax v26.4s, v26.4s, v30.4s\n"
+          "smax v27.4s, v27.4s, v30.4s\n"
+          "smin v24.4s, v24.4s, v31.4s\n"
+          "smin v25.4s, v25.4s, v31.4s\n"
+          "smin v26.4s, v26.4s, v31.4s\n"
+          "smin v27.4s, v27.4s, v31.4s\n"
+          "sqxtn v24.4h, v24.4s\n"
+          "sqxtn v26.4h, v26.4s\n"
+          "sqxtn2 v24.8h, v25.4s\n"
+          "ld1 {v25.4s}, [x10]\n"
+          "sqxtn2 v26.8h, v27.4s\n"
+          "ld1 {v27.4s}, [x10]\n"
+          "sqxtun v24.8b, v24.8h\n"
+          "sqxtun v26.8b, v26.8h\n"
+          "uaddw v9.8h, v28.8h, v9.8b\n"
+          "st1 {v24.8b}, [x6], x5\n"
+          "uaddw v10.8h, v28.8h, v10.8b\n"
+          "st1 {v26.8b}, [x6], x5\n"
+          "uaddw v11.8h, v28.8h, v11.8b\n"
+          "uaddw v12.8h, v28.8h, v12.8b\n"
+          "uaddw v13.8h, v28.8h, v13.8b\n"
+          "uaddw v14.8h, v28.8h, v14.8b\n"
+          "ld1 {v24.4s}, [%[bias_ptr]]\n"
+          "uaddw v15.8h, v28.8h, v15.8b\n"
+          "ld1 {v26.4s}, [%[bias_ptr]]\n"
+          "uaddw v16.8h, v28.8h, v16.8b\n"
+          "uaddw v17.8h, v28.8h, v17.8b\n"
+
+          "bge " DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LOOP "b\n"
+
+        // At this point, there will be one of 2 width or 1 width leftover,
+        // not both.
+        "cmp w14, #2\n"
+        "blt " DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_1_LEFTOVER "f\n"
+
+        // Handle last two horizontal outputs if exists.
+        DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LEFTOVER ":\n"
+        "smlal v24.4s, v0.4h, v9.4h\n"
+        "ld1 {v18.8b}, [x12], %[input_depth]\n"
+        "smlal2 v25.4s, v0.8h, v9.8h\n"
+        "ld1 {v19.8b}, [x12]\n"
+        "smlal v26.4s, v0.4h, v11.4h\n"
+        "ld1 {v20.8b}, [x13], %[input_depth]\n"
+        "smlal2 v27.4s, v0.8h, v11.8h\n"
+        "ld1 {v21.8b}, [x13]\n"
+        "smlal v24.4s, v1.4h, v10.4h\n"
+        "ld1 {v22.8b}, [x15], %[input_depth]\n"
+        "smlal2 v25.4s, v1.8h, v10.8h\n"
+        "ld1 {v23.8b}, [x15]\n"
+        "smlal v24.4s, v2.4h, v11.4h\n"
+        "smlal2 v25.4s, v2.8h, v11.8h\n"
+        "smlal v24.4s, v3.4h, v12.4h\n"
+        "smlal2 v25.4s, v3.8h, v12.8h\n"
+        "smlal v26.4s, v3.4h, v14.4h\n"
+        "smlal2 v27.4s, v3.8h, v14.8h\n"
+        "smlal v24.4s, v4.4h, v13.4h\n"
+        "smlal2 v25.4s, v4.8h, v13.8h\n"
+        "smlal v24.4s, v5.4h, v14.4h\n"
+        "smlal2 v25.4s, v5.8h, v14.8h\n"
+        "smlal v24.4s, v6.4h, v15.4h\n"
+        "smlal2 v25.4s, v6.8h, v15.8h\n"
+        "smlal v26.4s, v6.4h, v17.4h\n"
+        "smlal2 v27.4s, v6.8h, v17.8h\n"
+        "smlal v24.4s, v7.4h, v16.4h\n"
+        "smlal2 v25.4s, v7.8h, v16.8h\n"
+        "smlal v24.4s, v8.4h, v17.4h\n"
+        "uaddw v18.8h, v28.8h, v18.8b\n"
+        "smlal2 v25.4s, v8.8h, v17.8h\n"
+        "uaddw v19.8h, v28.8h, v19.8b\n"
+
+        "smlal v26.4s, v1.4h, v18.4h\n"
+        "uaddw v20.8h, v28.8h, v20.8b\n"
+        "smlal2 v27.4s, v1.8h, v18.8h\n"
+        "smlal v26.4s, v2.4h, v19.4h\n"
+        "uaddw v21.8h, v28.8h, v21.8b\n"
+        "smlal2 v27.4s, v2.8h, v19.8h\n"
+        "smlal v26.4s, v4.4h, v20.4h\n"
+        "smlal v26.4s, v5.4h, v21.4h\n"
+        "smlal2 v27.4s, v4.8h, v20.8h\n"
+        "uaddw v22.8h, v28.8h, v22.8b\n"
+        "smlal2 v27.4s, v5.8h, v21.8h\n"
+        "uaddw v23.8h, v28.8h, v23.8b\n"
+        "smlal v26.4s, v7.4h, v22.4h\n"
+        "smlal2 v27.4s, v7.8h, v22.8h\n"
+        "smlal v26.4s, v8.4h, v23.4h\n"
+        "smlal2 v27.4s, v8.8h, v23.8h\n"
+
+        "dup v28.4s, w1\n"
+        "dup v29.4s, w9\n"
+        "sqrdmulh v24.4s, v24.4s, v28.4s\n"
+        "sqrdmulh v25.4s, v25.4s, v28.4s\n"
+        "sqrdmulh v26.4s, v26.4s, v28.4s\n"
+        "sqrdmulh v27.4s, v27.4s, v28.4s\n"
+        "dup v28.4s, w2\n"
+        "and v30.16b, v24.16b, v29.16b\n"
+        "and v31.16b, v25.16b, v29.16b\n"
+        "sshr v30.4s, v30.4s, #31\n"
+        "sshr v31.4s, v31.4s, #31\n"
+        "sqadd v24.4s, v24.4s, v30.4s\n"
+        "sqadd v25.4s, v25.4s, v31.4s\n"
+        "and v30.16b, v26.16b, v29.16b\n"
+        "and v31.16b, v27.16b, v29.16b\n"
+        "sshr v30.4s, v30.4s, #31\n"
+        "sshr v31.4s, v31.4s, #31\n"
+        "sqadd v26.4s, v26.4s, v30.4s\n"
+        "dup v30.4s, w3\n"
+        "sqadd v27.4s, v27.4s, v31.4s\n"
+        "dup v31.4s, w4\n"
+        "srshl v24.4s, v24.4s, v29.4s\n"
+        "srshl v25.4s, v25.4s, v29.4s\n"
+        "srshl v26.4s, v26.4s, v29.4s\n"
+        "srshl v27.4s, v27.4s, v29.4s\n"
+        "add v24.4s, v24.4s, v28.4s\n"
+        "add v25.4s, v25.4s, v28.4s\n"
+        "add v26.4s, v26.4s, v28.4s\n"
+        "add v27.4s, v27.4s, v28.4s\n"
+        "dup v28.8h, w0\n"
+        "smax v24.4s, v24.4s, v30.4s\n"
+        "smax v25.4s, v25.4s, v30.4s\n"
+        "smax v26.4s, v26.4s, v30.4s\n"
+        "smax v27.4s, v27.4s, v30.4s\n"
+        "smin v24.4s, v24.4s, v31.4s\n"
+        "smin v25.4s, v25.4s, v31.4s\n"
+        "smin v26.4s, v26.4s, v31.4s\n"
+        "smin v27.4s, v27.4s, v31.4s\n"
+        "sqxtn v24.4h, v24.4s\n"
+        "sqxtn v26.4h, v26.4s\n"
+        "sqxtn2 v24.8h, v25.4s\n"
+        "sqxtn2 v26.8h, v27.4s\n"
+        "sqxtun v24.8b, v24.8h\n"
+        "sqxtun v26.8b, v26.8h\n"
+        "st1 {v24.8b}, [x6], x5\n"
+        "st1 {v26.8b}, [x6]\n"
+        "b " DEPTHWISECONV_LABEL_HEIGHT_1_END "f\n"
+
+        // Handle bottom right output if exists.
+        DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_1_LEFTOVER ":\n"
+        "dup v26.4s, w9\n"
+        "dup v27.4s, w1\n"
+        "dup v29.4s, w2\n"
+
+        "smlal v24.4s, v0.4h, v9.4h\n"
+        "smlal2 v25.4s, v0.8h, v9.8h\n"
+        "smlal v24.4s, v1.4h, v10.4h\n"
+        "smlal2 v25.4s, v1.8h, v10.8h\n"
+        "smlal v24.4s, v2.4h, v11.4h\n"
+        "smlal2 v25.4s, v2.8h, v11.8h\n"
+        "smlal v24.4s, v3.4h, v12.4h\n"
+        "smlal2 v25.4s, v3.8h, v12.8h\n"
+        "smlal v24.4s, v4.4h, v13.4h\n"
+        "smlal2 v25.4s, v4.8h, v13.8h\n"
+        "smlal v24.4s, v5.4h, v14.4h\n"
+        "smlal2 v25.4s, v5.8h, v14.8h\n"
+        "smlal v24.4s, v6.4h, v15.4h\n"
+        "smlal2 v25.4s, v6.8h, v15.8h\n"
+        "smlal v24.4s, v7.4h, v16.4h\n"
+        "smlal2 v25.4s, v7.8h, v16.8h\n"
+        "smlal v24.4s, v8.4h, v17.4h\n"
+        "smlal2 v25.4s, v8.8h, v17.8h\n"
+
+        "sqrdmulh v24.4s, v24.4s, v27.4s\n"
+        "sqrdmulh v25.4s, v25.4s, v27.4s\n"
+        "and v18.16b, v24.16b, v26.16b\n"
+        "and v19.16b, v25.16b, v26.16b\n"
+        "sshr v18.4s, v18.4s, #31\n"
+        "sshr v19.4s, v19.4s, #31\n"
+        "sqadd v24.4s, v24.4s, v18.4s\n"
+        "sqadd v25.4s, v25.4s, v19.4s\n"
+        "srshl v24.4s, v24.4s, v26.4s\n"
+        "srshl v25.4s, v25.4s, v26.4s\n"
+        "add v24.4s, v24.4s, v29.4s\n"
+        "add v25.4s, v25.4s, v29.4s\n"
+        "smax v24.4s, v24.4s, v30.4s\n"
+        "smax v25.4s, v25.4s, v30.4s\n"
+        "smin v24.4s, v24.4s, v31.4s\n"
+        "smin v25.4s, v25.4s, v31.4s\n"
+        "sqxtn v24.4h, v24.4s\n"
+        "sqxtn2 v24.8h, v25.4s\n"
+        "sqxtun v24.8b, v24.8h\n"
+        "st1 {v24.8b}, [x6]\n"
+
+        DEPTHWISECONV_LABEL_HEIGHT_1_END ":\n"
+    :
+    // Outputs.
+    [filter_ptr] "+r"(filter_ptr), [input_ptr] "+r"(input_ptr),
+    [output_ptr] "+r"(output_ptr),
+    [output_window_height] "+r"(output_window_height)
+    :
+    // Inputs.
+    [bias_ptr] "r"(bias_ptr), [input_row_size] "r"(input_row_size),
+    [input_depth] "r"(input_depth),
+    [output_window_width] "r"(output_window_width),
+    [input_width_increment] "r"(input_width_increment),
+    [input_height_increment] "r"(input_height_increment),
+    [output_height_increment] "r"(output_height_increment),
+    [params_ptr] "r"(params_ptr)
+    :
+    // Clobbers.
+    "cc", "memory",
+    // We use these NEON registers.
+    "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+    "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "v19",
+    "v20", "v21", "v22", "v23", "v24", "v25", "v26", "v27", "v28", "v29",
+    "v30", "v31",
+    // We use these general-purpose registers.
+    "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7",
+    "x9", "x10", "x11", "x12", "x13", "x14", "x15",
+    "x19", "x20");
+#undef DEPTHWISECONV_LABEL_HEIGHT_2_LOOP
+#undef DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LOOP
+#undef DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_1_LEFTOVER
+#undef DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_LEFTOVER
+#undef DEPTHWISECONV_LABEL_HEIGHT_2_WIDTH_2_AFTER_LOOP
+#undef DEPTHWISECONV_LABEL_HEIGHT_2_AFTER_LOOP
+#undef DEPTHWISECONV_LABEL_HEIGHT_1
+#undef DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LOOP
+#undef DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_1_LEFTOVER
+#undef DEPTHWISECONV_LABEL_HEIGHT_1_WIDTH_2_LEFTOVER
+#undef DEPTHWISECONV_LABEL_HEIGHT_1_END
   }
 };
 
-template <>
-struct ConvKernel3x3FilterDepth8<4, 1, 2, 2> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    const int output_row_size = output_depth * output_width;
-
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-    int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8;
-    uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5, temp_6, temp_7,
-        temp_8;
-
-    const uint8* ptr = input_ptr;
-
-    // Load all inputs for top output.
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    ptr += input_row_size;
-    temp_3 = vld1_u8(ptr);
-    temp_4 = vld1_u8(ptr + input_depth);
-    temp_5 = vld1_u8(ptr + 2 * input_depth);
-    ptr += input_row_size;
-    temp_6 = vld1_u8(ptr);
-    temp_7 = vld1_u8(ptr + input_depth);
-    temp_8 = vld1_u8(ptr + 2 * input_depth);
-
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-    input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-    input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_6));
-    input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_7));
-    input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_8));
-
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-    input_5 = vaddq_s16(input_5, input_offset_vec);
-    input_6 = vaddq_s16(input_6, input_offset_vec);
-    input_7 = vaddq_s16(input_7, input_offset_vec);
-    input_8 = vaddq_s16(input_8, input_offset_vec);
-
-    DotProductAndStore(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max, output_ptr);
-
-    // Second output.
-    output_ptr += output_row_size;
-
-    ptr += input_row_size;
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    ptr += input_row_size;
-    temp_3 = vld1_u8(ptr);
-    temp_4 = vld1_u8(ptr + input_depth);
-    temp_5 = vld1_u8(ptr + 2 * input_depth);
-
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-    input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-    input_5 = vaddq_s16(input_5, input_offset_vec);
-
-    DotProductAndStore(
-        filter, input_6, input_7, input_8, input_0, input_1, input_2, input_3,
-        input_4, input_5, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max, output_ptr);
-
-    // Third output.
-    output_ptr += output_row_size;
-
-    ptr += input_row_size;
-    temp_6 = vld1_u8(ptr);
-    temp_7 = vld1_u8(ptr + input_depth);
-    temp_8 = vld1_u8(ptr + 2 * input_depth);
-    ptr += input_row_size;
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-
-    input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_6));
-    input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_7));
-    input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_8));
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-
-    input_6 = vaddq_s16(input_6, input_offset_vec);
-    input_7 = vaddq_s16(input_7, input_offset_vec);
-    input_8 = vaddq_s16(input_8, input_offset_vec);
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-
-    DotProductAndStore(
-        filter, input_3, input_4, input_5, input_6, input_7, input_8, input_0,
-        input_1, input_2, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max, output_ptr);
-
-    // Fourth output.
-    output_ptr += output_row_size;
-
-    ptr += input_row_size;
-    temp_3 = vld1_u8(ptr);
-    temp_4 = vld1_u8(ptr + input_depth);
-    temp_5 = vld1_u8(ptr + 2 * input_depth);
-    ptr += input_row_size;
-    temp_6 = vld1_u8(ptr);
-    temp_7 = vld1_u8(ptr + input_depth);
-    temp_8 = vld1_u8(ptr + 2 * input_depth);
-
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-    input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-    input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_6));
-    input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_7));
-    input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_8));
-
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-    input_5 = vaddq_s16(input_5, input_offset_vec);
-    input_6 = vaddq_s16(input_6, input_offset_vec);
-    input_7 = vaddq_s16(input_7, input_offset_vec);
-    input_8 = vaddq_s16(input_8, input_offset_vec);
-
-    DotProductAndStore(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max, output_ptr);
-  }
-};
-
-template <>
-struct ConvKernel3x3FilterDepth8<2, 2, 2, 2> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    Int32x8 acc_0, acc_1, acc_2, acc_3;
-    acc_0.low = vld1q_s32(bias_ptr);
-    acc_1.low = vld1q_s32(bias_ptr);
-    acc_2.low = vld1q_s32(bias_ptr);
-    acc_3.low = vld1q_s32(bias_ptr);
-
-    bias_ptr += 4;
-    acc_0.high = vld1q_s32(bias_ptr);
-    acc_1.high = vld1q_s32(bias_ptr);
-    acc_2.high = vld1q_s32(bias_ptr);
-    acc_3.high = vld1q_s32(bias_ptr);
-
-    const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-
-    // Add scope for input registers to help the compiler know that it is
-    // not needed.
-    {
-      // To process 2x2 outputs using a 3x3 filter at stride 2, we require
-      // 5x5 inputs. We load the first 5x2 inputs at a time.
-      int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-          input_7, input_8, input_9;
-
-      const uint8* ptr = input_ptr;
-
-      // Load inputs.
-      {
-        uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4;
-
-        temp_0 = vld1_u8(ptr);
-        temp_1 = vld1_u8(ptr + input_depth);
-        temp_2 = vld1_u8(ptr + 2 * input_depth);
-        temp_3 = vld1_u8(ptr + 3 * input_depth);
-        temp_4 = vld1_u8(ptr + 4 * input_depth);
-
-        input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-        input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-        input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-        input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-        input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-
-        input_0 = vaddq_s16(input_0, input_offset_vec);
-        input_1 = vaddq_s16(input_1, input_offset_vec);
-        input_2 = vaddq_s16(input_2, input_offset_vec);
-        input_3 = vaddq_s16(input_3, input_offset_vec);
-        input_4 = vaddq_s16(input_4, input_offset_vec);
-
-        ptr += input_row_size;
-        temp_0 = vld1_u8(ptr);
-        temp_1 = vld1_u8(ptr + input_depth);
-        temp_2 = vld1_u8(ptr + 2 * input_depth);
-        temp_3 = vld1_u8(ptr + 3 * input_depth);
-        temp_4 = vld1_u8(ptr + 4 * input_depth);
-
-        input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-        input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-        input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-        input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-        input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-
-        input_5 = vaddq_s16(input_5, input_offset_vec);
-        input_6 = vaddq_s16(input_6, input_offset_vec);
-        input_7 = vaddq_s16(input_7, input_offset_vec);
-        input_8 = vaddq_s16(input_8, input_offset_vec);
-        input_9 = vaddq_s16(input_9, input_offset_vec);
-      }
-
-      acc_0 = MultiplyAccumulateRow(acc_0, filter.f0, filter.f1, filter.f2,
-                                    input_0, input_1, input_2);
-
-      acc_1 = MultiplyAccumulateRow(acc_1, filter.f0, filter.f1, filter.f2,
-                                    input_2, input_3, input_4);
-
-      acc_0 = MultiplyAccumulateRow(acc_0, filter.f3, filter.f4, filter.f5,
-                                    input_5, input_6, input_7);
-
-      acc_1 = MultiplyAccumulateRow(acc_1, filter.f3, filter.f4, filter.f5,
-                                    input_7, input_8, input_9);
-
-      // Load next inputs.
-      {
-        uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4;
-
-        ptr += input_row_size;
-        temp_0 = vld1_u8(ptr);
-        temp_1 = vld1_u8(ptr + input_depth);
-        temp_2 = vld1_u8(ptr + 2 * input_depth);
-        temp_3 = vld1_u8(ptr + 3 * input_depth);
-        temp_4 = vld1_u8(ptr + 4 * input_depth);
-
-        input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-        input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-        input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-        input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-        input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-
-        input_0 = vaddq_s16(input_0, input_offset_vec);
-        input_1 = vaddq_s16(input_1, input_offset_vec);
-        input_2 = vaddq_s16(input_2, input_offset_vec);
-        input_3 = vaddq_s16(input_3, input_offset_vec);
-        input_4 = vaddq_s16(input_4, input_offset_vec);
-
-        ptr += input_row_size;
-        temp_0 = vld1_u8(ptr);
-        temp_1 = vld1_u8(ptr + input_depth);
-        temp_2 = vld1_u8(ptr + 2 * input_depth);
-        temp_3 = vld1_u8(ptr + 3 * input_depth);
-        temp_4 = vld1_u8(ptr + 4 * input_depth);
-
-        input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-        input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-        input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-        input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-        input_9 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-
-        input_5 = vaddq_s16(input_5, input_offset_vec);
-        input_6 = vaddq_s16(input_6, input_offset_vec);
-        input_7 = vaddq_s16(input_7, input_offset_vec);
-        input_8 = vaddq_s16(input_8, input_offset_vec);
-        input_9 = vaddq_s16(input_9, input_offset_vec);
-      }
-
-      acc_0 = MultiplyAccumulateRow(acc_0, filter.f6, filter.f7, filter.f8,
-                                    input_0, input_1, input_2);
-
-      acc_1 = MultiplyAccumulateRow(acc_1, filter.f6, filter.f7, filter.f8,
-                                    input_2, input_3, input_4);
-
-      // Moving onto the two bottom outputs.
-      acc_2 = MultiplyAccumulateRow(acc_2, filter.f0, filter.f1, filter.f2,
-                                    input_0, input_1, input_2);
-
-      acc_3 = MultiplyAccumulateRow(acc_3, filter.f0, filter.f1, filter.f2,
-                                    input_2, input_3, input_4);
-
-      acc_2 = MultiplyAccumulateRow(acc_2, filter.f3, filter.f4, filter.f5,
-                                    input_5, input_6, input_7);
-
-      acc_3 = MultiplyAccumulateRow(acc_3, filter.f3, filter.f4, filter.f5,
-                                    input_7, input_8, input_9);
-
-      // Load last input row.
-      {
-        uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4;
+enum class EdgeType { kCorner, kHorizontal, kVertical, kCenter };
 
-        ptr += input_row_size;
-        temp_0 = vld1_u8(ptr);
-        temp_1 = vld1_u8(ptr + input_depth);
-        temp_2 = vld1_u8(ptr + 2 * input_depth);
-        temp_3 = vld1_u8(ptr + 3 * input_depth);
-        temp_4 = vld1_u8(ptr + 4 * input_depth);
-
-        input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-        input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-        input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-        input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-        input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-
-        input_0 = vaddq_s16(input_0, input_offset_vec);
-        input_1 = vaddq_s16(input_1, input_offset_vec);
-        input_2 = vaddq_s16(input_2, input_offset_vec);
-        input_3 = vaddq_s16(input_3, input_offset_vec);
-        input_4 = vaddq_s16(input_4, input_offset_vec);
-      }
-
-      acc_2 = MultiplyAccumulateRow(acc_2, filter.f6, filter.f7, filter.f8,
-                                    input_0, input_1, input_2);
-
-      acc_3 = MultiplyAccumulateRow(acc_3, filter.f6, filter.f7, filter.f8,
-                                    input_2, input_3, input_4);
-    }
-
-    DownquantizeAndStore2x2Output(acc_0, acc_1, acc_2, acc_3, output_offset,
-                                  output_multiplier, output_shift,
-                                  output_activation_min, output_activation_max,
-                                  output_ptr, output_depth, output_width);
-  }
-};
+template <EdgeType kEdgeType, int kPadWidth, int kPadHeight>
+struct DepthwiseConvPartial {};
 
 template <>
-struct ConvKernel3x3FilterDepth8<2, 4, 2, 2> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    // Reuse 2x2 kernel twice.
-    ConvKernel3x3FilterDepth8<2, 2, 2, 2>::Run(
-        input_ptr, input_depth, input_offset, input_row_size, filter_ptr,
-        filter_offset, bias_ptr, output_offset, output_multiplier, output_shift,
-        output_activation_min, output_activation_max, output_ptr, output_depth,
-        output_width);
-
-    ConvKernel3x3FilterDepth8<2, 2, 2, 2>::Run(
-        input_ptr + 4 * input_depth, input_depth, input_offset, input_row_size,
-        filter_ptr, filter_offset, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max,
-        output_ptr + 2 * output_depth, output_depth, output_width);
+struct DepthwiseConvPartial<EdgeType::kCenter, 1, 1> {
+  static inline void Run(const uint8* input_ptr, const uint8* filter_ptr,
+                         const int32* bias_ptr, uint8* output_ptr,
+                         const DepthwiseConvParams* params_ptr) {
+#define DEPTHWISECONV_LABEL_DEPTH_8_LOOP "1"
+#define DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP "2"
+    asm volatile(
+        // Performs depthwise convolutions for an input window of size 1x1 and
+        // padding of 1 across the full depth. Expects |input_ptr| and
+        // |filter_ptr| to be pointing to the 1x1 input and filter values.
+        "ld1 {v8.8b}, [%[input_ptr]], #8\n"
+        "ldr w9, [%[params_ptr], #" STR(OFFSET_INPUT_OFFSET) "]\n"
+        "ldr x11, [%[params_ptr], #" STR(OFFSET_OUTPUT_DEPTH) "]\n"
+        "ldr w10, [%[params_ptr], #" STR(OFFSET_OUTPUT_MULTIPLIER) "]\n"
+        "dup v26.8h, w9\n"
+        "ldr w9, [%[params_ptr], #" STR(OFFSET_OUTPUT_OFFSET) "]\n"
+        "dup v27.4s, w10\n"
+        "ld1 {v0.8b}, [%[filter_ptr]], #8\n"
+        "cmp x11, #16\n"
+        "ldr w10, [%[params_ptr], #" STR(OFFSET_OUTPUT_SHIFT) "]\n"
+        "dup v28.4s, w9\n"
+        "ldr w9, [%[params_ptr], #" STR(OFFSET_OUTPUT_ACTIVATION_MIN) "]\n"
+        "neg w10, w10\n"
+        "dup v29.4s, w10\n"
+        "ldr w10, [%[params_ptr], #" STR(OFFSET_OUTPUT_ACTIVATION_MAX) "]\n"
+        "dup v30.4s, w9\n"
+        "ldr w9, [%[params_ptr], #" STR(OFFSET_FILTER_OFFSET) "]\n"
+        "dup v31.4s, w10\n"
+        "dup v25.8h, w9\n"
+
+        "ld1 {v16.4s}, [%[bias_ptr]], #16\n"
+        "uaddw v8.8h, v26.8h, v8.8b\n"
+        "ld1 {v17.4s}, [%[bias_ptr]], #16\n"
+        "uaddw v0.8h, v25.8h, v0.8b\n"
+
+        "blt " DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP "f\n"
+
+        //"loop_%=:\n"
+        DEPTHWISECONV_LABEL_DEPTH_8_LOOP ":\n"
+          "smlal v16.4s, v0.4h, v8.4h\n"
+          "subs x11, x11, #8\n"
+          "smlal2 v17.4s, v0.8h, v8.8h\n"
+          "ld1 {v8.8b}, [%[input_ptr]], #8\n"
+          "cmp x11, #16\n"
+          "ld1 {v0.8b}, [%[filter_ptr]], #8\n"
+
+          "sqrdmulh v16.4s, v16.4s, v27.4s\n"
+          "sqrdmulh v17.4s, v17.4s, v27.4s\n"
+          "and v18.16b, v16.16b, v29.16b\n"
+          "and v19.16b, v17.16b, v29.16b\n"
+          "sshr v18.4s, v18.4s, #31\n"
+          "sshr v19.4s, v19.4s, #31\n"
+          "sqadd v16.4s, v16.4s, v18.4s\n"
+          "sqadd v17.4s, v17.4s, v19.4s\n"
+          "srshl v16.4s, v16.4s, v29.4s\n"
+          "srshl v17.4s, v17.4s, v29.4s\n"
+          "add v16.4s, v16.4s, v28.4s\n"
+          "add v17.4s, v17.4s, v28.4s\n"
+          "smax v16.4s, v16.4s, v30.4s\n"
+          "smax v17.4s, v17.4s, v30.4s\n"
+          "smin v16.4s, v16.4s, v31.4s\n"
+          "smin v17.4s, v17.4s, v31.4s\n"
+          "sqxtn v16.4h, v16.4s\n"
+          "sqxtn2 v16.8h, v17.4s\n"
+          "sqxtun v16.8b, v16.8h\n"
+          "st1 {v16.8b}, [%[output_ptr]], #8\n"
+          "uaddw v8.8h, v26.8h, v8.8b\n"
+          "ld1 {v16.4s}, [%[bias_ptr]], #16\n"
+          "uaddw v0.8h, v25.8h, v0.8b\n"
+          "ld1 {v17.4s}, [%[bias_ptr]], #16\n"
+
+          "bge " DEPTHWISECONV_LABEL_DEPTH_8_LOOP "b\n"
+
+        DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP ":\n"
+        "smlal v16.4s, v0.4h, v8.4h\n"
+        "smlal2 v17.4s, v0.8h, v8.8h\n"
+
+        "sqrdmulh v16.4s, v16.4s, v27.4s\n"
+        "sqrdmulh v17.4s, v17.4s, v27.4s\n"
+        "and v18.16b, v16.16b, v29.16b\n"
+        "and v19.16b, v17.16b, v29.16b\n"
+        "sshr v18.4s, v18.4s, #31\n"
+        "sshr v19.4s, v19.4s, #31\n"
+        "sqadd v16.4s, v16.4s, v18.4s\n"
+        "sqadd v17.4s, v17.4s, v19.4s\n"
+        "srshl v16.4s, v16.4s, v29.4s\n"
+        "srshl v17.4s, v17.4s, v29.4s\n"
+
+        "add v16.4s, v16.4s, v28.4s\n"
+        "add v17.4s, v17.4s, v28.4s\n"
+        "smax v16.4s, v16.4s, v30.4s\n"
+        "smax v17.4s, v17.4s, v30.4s\n"
+        "smin v16.4s, v16.4s, v31.4s\n"
+        "smin v17.4s, v17.4s, v31.4s\n"
+        "sqxtn v16.4h, v16.4s\n"
+        "sqxtn2 v16.8h, v17.4s\n"
+        "sqxtun v16.8b, v16.8h\n"
+        "st1 {v16.8b}, [%[output_ptr]]\n"
+        :
+        // Outputs.
+        [filter_ptr] "+r"(filter_ptr), [input_ptr] "+r"(input_ptr),
+        [output_ptr] "+r"(output_ptr), [bias_ptr] "+r"(bias_ptr)
+        :
+        // Inputs.
+        [params_ptr] "r"(params_ptr)
+        :
+        // Clobbers.
+        "cc", "memory",
+        // We use these NEON registers.
+        "v0", "v8", "v16", "v17", "v18", "v19", "v25", "v26", "v27", "v28",
+        "v29", "v30", "v31",
+        // We use these general-purpose registers.
+        "x9", "x10", "x11");
+#undef DEPTHWISECONV_LABEL_DEPTH_8_LOOP
+#undef DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP
   }
 };
 
 template <>
-struct ConvKernel3x3FilterDepth8<2, 1, 2, 2> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    const int output_row_size = output_depth * output_width;
-
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-    int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8;
-    uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5, temp_6, temp_7,
-        temp_8;
-
-    const uint8* ptr = input_ptr;
-
-    // Load all inputs for top output.
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    ptr += input_row_size;
-    temp_3 = vld1_u8(ptr);
-    temp_4 = vld1_u8(ptr + input_depth);
-    temp_5 = vld1_u8(ptr + 2 * input_depth);
-    ptr += input_row_size;
-    temp_6 = vld1_u8(ptr);
-    temp_7 = vld1_u8(ptr + input_depth);
-    temp_8 = vld1_u8(ptr + 2 * input_depth);
-
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-    input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-    input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_6));
-    input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_7));
-    input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_8));
-
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-    input_5 = vaddq_s16(input_5, input_offset_vec);
-    input_6 = vaddq_s16(input_6, input_offset_vec);
-    input_7 = vaddq_s16(input_7, input_offset_vec);
-    input_8 = vaddq_s16(input_8, input_offset_vec);
-
-    DotProductAndStore(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max, output_ptr);
-
-    // Second output.
-    output_ptr += output_row_size;
-
-    ptr += input_row_size;
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    ptr += input_row_size;
-    temp_3 = vld1_u8(ptr);
-    temp_4 = vld1_u8(ptr + input_depth);
-    temp_5 = vld1_u8(ptr + 2 * input_depth);
-
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-    input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-    input_5 = vaddq_s16(input_5, input_offset_vec);
-
-    DotProductAndStore(
-        filter, input_6, input_7, input_8, input_0, input_1, input_2, input_3,
-        input_4, input_5, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max, output_ptr);
+struct DepthwiseConvPartial<EdgeType::kCorner, 1, 1> {
+  static inline void Run(const uint8* input_ptr, const uint8* filter_ptr,
+                         const int32* bias_ptr, uint8* output_ptr,
+                         const DepthwiseConvParams* params_ptr) {
+#define DEPTHWISECONV_LABEL_DEPTH_8_LOOP "1"
+#define DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP "2"
+    asm volatile(
+        // Performs depthwise convolutions for an input window of size 2x2 and
+        // padding of 1 across the full depth. Expects |input_ptr| and
+        // |filter_ptr| to be pointing to the beginning of the 2x2 input and
+        // filter values.
+
+        // Load input and filter values.
+        "ldr x15, [%[params_ptr], #" STR(OFFSET_OUTPUT_DEPTH) "]\n"
+        "ldr x9, [%[params_ptr], #" STR(OFFSET_INPUT_ROW_SIZE) "]\n"
+        "cmp x15, #16\n"
+        "add x12, %[input_ptr], x15\n"
+        "add x13, %[input_ptr], x9\n"
+        "ld1 {v8.8b}, [%[input_ptr]], #8\n"
+        "add x14, x13, x15\n"
+        "ld1 {v9.8b}, [x12], #8\n"
+        "ldr x6, [%[params_ptr], #" STR(OFFSET_FILTER_ROW_SIZE) "]\n"
+
+        "add x9, %[filter_ptr], x15\n"
+        "ld1 {v10.8b}, [x13], #8\n"
+        "add x10, %[filter_ptr], x6\n"
+        "ld1 {v11.8b}, [x14], #8\n"
+        "ld1 {v0.8b}, [%[filter_ptr]], #8\n"
+        "add x11, x10, x15\n"
+        "ld1 {v1.8b}, [x9], #8\n"
+        "ld1 {v2.8b}, [x10], #8\n"
+        "ld1 {v3.8b}, [x11], #8\n"
+
+        // Load constants.
+        "ldr w6, [%[params_ptr], #" STR(OFFSET_INPUT_OFFSET) "]\n"
+        "ldr w7, [%[params_ptr], #" STR(OFFSET_OUTPUT_MULTIPLIER) "]\n"
+        "dup v26.8h, w6\n"
+        "ldr w6, [%[params_ptr], #" STR(OFFSET_OUTPUT_OFFSET) "]\n"
+        "dup v27.4s, w7\n"
+        "ldr w7, [%[params_ptr], #" STR(OFFSET_OUTPUT_SHIFT) "]\n"
+        "dup v28.4s, w6\n"
+        "ldr w6, [%[params_ptr], #" STR(OFFSET_OUTPUT_ACTIVATION_MIN) "]\n"
+        "neg w7, w7\n"
+        "dup v29.4s, w7\n"
+        "ldr w7, [%[params_ptr], #" STR(OFFSET_OUTPUT_ACTIVATION_MAX) "]\n"
+        "dup v30.4s, w6\n"
+        "ldr w6, [%[params_ptr], #" STR(OFFSET_FILTER_OFFSET) "]\n"
+        "dup v31.4s, w7\n"
+        "dup v25.8h, w6\n"
+
+        // Add input and filter offsets.
+        "uaddw v8.8h, v26.8h, v8.8b\n"
+        "ld1 {v16.4s}, [%[bias_ptr]], #16\n"
+        "uaddw v9.8h, v26.8h, v9.8b\n"
+        "ld1 {v17.4s}, [%[bias_ptr]], #16\n"
+        "uaddw v10.8h, v26.8h, v10.8b\n"
+        "uaddw v11.8h, v26.8h, v11.8b\n"
+
+        "uaddw v0.8h, v25.8h, v0.8b\n"
+        "uaddw v1.8h, v25.8h, v1.8b\n"
+        "uaddw v2.8h, v25.8h, v2.8b\n"
+        "uaddw v3.8h, v25.8h, v3.8b\n"
+
+        "blt " DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP "f\n"
+
+        //"loop_%=:\n"
+        DEPTHWISECONV_LABEL_DEPTH_8_LOOP ":\n"
+          "smlal v16.4s, v0.4h, v8.4h\n"
+          "subs x15, x15, #8\n"
+          "smlal2 v17.4s, v0.8h, v8.8h\n"
+          "ld1 {v8.8b}, [%[input_ptr]], #8\n"
+          "cmp x15, #16\n"
+          "ld1 {v0.8b}, [%[filter_ptr]], #8\n"
+          "smlal v16.4s, v1.4h, v9.4h\n"
+          "smlal2 v17.4s, v1.8h, v9.8h\n"
+          "ld1 {v9.8b}, [x12], #8\n"
+          "smlal v16.4s, v2.4h, v10.4h\n"
+          "ld1 {v1.8b}, [x9], #8\n"
+          "smlal2 v17.4s, v2.8h, v10.8h\n"
+          "ld1 {v10.8b}, [x13], #8\n"
+          "smlal v16.4s, v3.4h, v11.4h\n"
+          "ld1 {v2.8b}, [x10], #8\n"
+          "smlal2 v17.4s, v3.8h, v11.8h\n"
+          "ld1 {v11.8b}, [x14], #8\n"
+          "ld1 {v3.8b}, [x11], #8\n"
+
+          "sqrdmulh v16.4s, v16.4s, v27.4s\n"
+          "sqrdmulh v17.4s, v17.4s, v27.4s\n"
+          "and v18.16b, v16.16b, v29.16b\n"
+          "and v19.16b, v17.16b, v29.16b\n"
+          "sshr v18.4s, v18.4s, #31\n"
+          "sshr v19.4s, v19.4s, #31\n"
+          "sqadd v16.4s, v16.4s, v18.4s\n"
+          "sqadd v17.4s, v17.4s, v19.4s\n"
+          "srshl v16.4s, v16.4s, v29.4s\n"
+          "srshl v17.4s, v17.4s, v29.4s\n"
+          "add v16.4s, v16.4s, v28.4s\n"
+          "add v17.4s, v17.4s, v28.4s\n"
+          "smax v16.4s, v16.4s, v30.4s\n"
+          "smax v17.4s, v17.4s, v30.4s\n"
+          "smin v16.4s, v16.4s, v31.4s\n"
+          "smin v17.4s, v17.4s, v31.4s\n"
+          "sqxtn v16.4h, v16.4s\n"
+          "sqxtn2 v16.8h, v17.4s\n"
+          "sqxtun v16.8b, v16.8h\n"
+          "st1 {v16.8b}, [%[output_ptr]], #8\n"
+          "uaddw v8.8h, v26.8h, v8.8b\n"
+          "ld1 {v16.4s}, [%[bias_ptr]], #16\n"
+          "uaddw v9.8h, v26.8h, v9.8b\n"
+          "ld1 {v17.4s}, [%[bias_ptr]], #16\n"
+          "uaddw v10.8h, v26.8h, v10.8b\n"
+          "uaddw v11.8h, v26.8h, v11.8b\n"
+          "uaddw v0.8h, v25.8h, v0.8b\n"
+          "uaddw v1.8h, v25.8h, v1.8b\n"
+          "uaddw v2.8h, v25.8h, v2.8b\n"
+          "uaddw v3.8h, v25.8h, v3.8b\n"
+
+          "bge " DEPTHWISECONV_LABEL_DEPTH_8_LOOP "b\n"
+
+        DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP ":\n"
+        "smlal v16.4s, v0.4h, v8.4h\n"
+        "smlal2 v17.4s, v0.8h, v8.8h\n"
+        "smlal v16.4s, v1.4h, v9.4h\n"
+        "smlal2 v17.4s, v1.8h, v9.8h\n"
+        "smlal v16.4s, v2.4h, v10.4h\n"
+        "smlal2 v17.4s, v2.8h, v10.8h\n"
+        "smlal v16.4s, v3.4h, v11.4h\n"
+        "smlal2 v17.4s, v3.8h, v11.8h\n"
+
+        "sqrdmulh v16.4s, v16.4s, v27.4s\n"
+        "sqrdmulh v17.4s, v17.4s, v27.4s\n"
+        "and v18.16b, v16.16b, v29.16b\n"
+        "and v19.16b, v17.16b, v29.16b\n"
+        "sshr v18.4s, v18.4s, #31\n"
+        "sshr v19.4s, v19.4s, #31\n"
+        "sqadd v16.4s, v16.4s, v18.4s\n"
+        "sqadd v17.4s, v17.4s, v19.4s\n"
+        "srshl v16.4s, v16.4s, v29.4s\n"
+        "srshl v17.4s, v17.4s, v29.4s\n"
+
+        "add v16.4s, v16.4s, v28.4s\n"
+        "add v17.4s, v17.4s, v28.4s\n"
+        "smax v16.4s, v16.4s, v30.4s\n"
+        "smax v17.4s, v17.4s, v30.4s\n"
+        "smin v16.4s, v16.4s, v31.4s\n"
+        "smin v17.4s, v17.4s, v31.4s\n"
+        "sqxtn v16.4h, v16.4s\n"
+        "sqxtn2 v16.8h, v17.4s\n"
+        "sqxtun v16.8b, v16.8h\n"
+        "st1 {v16.8b}, [%[output_ptr]]\n"
+        :
+        // Outputs.
+        [filter_ptr] "+r"(filter_ptr), [input_ptr] "+r"(input_ptr),
+        [output_ptr] "+r"(output_ptr), [bias_ptr] "+r"(bias_ptr)
+        :
+        // Inputs.
+        [params_ptr] "r"(params_ptr)
+        :
+        // Clobbers.
+        "cc", "memory",
+        // We use these NEON registers.
+        "v0", "v1", "v2", "v3", "v8", "v9", "v10", "v11", "v16", "v17", "v18",
+        "v19", "v25", "v26", "v27", "v28", "v29", "v30", "v31",
+        // We use these general-purpose registers.
+        "x6", "x7", "x9", "x10", "x11", "x12", "x13", "x14", "x15");
+#undef DEPTHWISECONV_LABEL_DEPTH_8_LOOP
+#undef DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP
   }
 };
 
 template <>
-struct ConvKernel3x3FilterDepth8<1, 2, 2, 2> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-    int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8;
-    uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5, temp_6, temp_7,
-        temp_8;
-
-    const uint8* ptr = input_ptr;
-
-    // Load all inputs for top output.
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    ptr += input_row_size;
-    temp_3 = vld1_u8(ptr);
-    temp_4 = vld1_u8(ptr + input_depth);
-    temp_5 = vld1_u8(ptr + 2 * input_depth);
-    ptr += input_row_size;
-    temp_6 = vld1_u8(ptr);
-    temp_7 = vld1_u8(ptr + input_depth);
-    temp_8 = vld1_u8(ptr + 2 * input_depth);
-
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-    input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-    input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_6));
-    input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_7));
-    input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_8));
-
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-    input_5 = vaddq_s16(input_5, input_offset_vec);
-    input_6 = vaddq_s16(input_6, input_offset_vec);
-    input_7 = vaddq_s16(input_7, input_offset_vec);
-    input_8 = vaddq_s16(input_8, input_offset_vec);
-
-    DotProductAndStore(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max, output_ptr);
-
-    // Second output.
-    output_ptr += output_depth;
-
-    ptr = input_ptr + 3 * input_depth;
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    ptr += input_row_size;
-    temp_3 = vld1_u8(ptr);
-    temp_4 = vld1_u8(ptr + input_depth);
-    ptr += input_row_size;
-    temp_6 = vld1_u8(ptr);
-    temp_7 = vld1_u8(ptr + input_depth);
-
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-    input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_6));
-    input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_7));
-
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-    input_6 = vaddq_s16(input_6, input_offset_vec);
-    input_7 = vaddq_s16(input_7, input_offset_vec);
-
-    DotProductAndStore(
-        filter, input_2, input_0, input_1, input_5, input_3, input_4, input_8,
-        input_6, input_7, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max, output_ptr);
+struct DepthwiseConvPartial<EdgeType::kHorizontal, 1, 1> {
+  static inline void Run(const uint8* input_ptr, const uint8* filter_ptr,
+                         const int32* bias_ptr, uint8* output_ptr,
+                         const DepthwiseConvParams* params_ptr) {
+#define DEPTHWISECONV_LABEL_DEPTH_8_LOOP "1"
+#define DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP "2"
+    asm volatile(
+        // Performs depthwise convolutions for an input window of size 2x3 and
+        // padding of 1 across the full depth. Expects |input_ptr| and
+        // |filter_ptr| to be pointing to the beginning of the 2x3 input and
+        // filter values.
+
+        // Load input and filter values.
+        "ldr x7, [%[params_ptr], #" STR(OFFSET_INPUT_DEPTH) "]\n"
+        "mov x12, %[input_ptr]\n"
+        "ldr x11, [%[params_ptr], #" STR(OFFSET_INPUT_ROW_SIZE) "]\n"
+        "mov x9, %[filter_ptr]\n"
+        "ldr x14, [%[params_ptr], #" STR(OFFSET_FILTER_ROW_SIZE) "]\n"
+        "add x13, x12, x11\n"
+        "ldr x15, [%[params_ptr], #" STR(OFFSET_OUTPUT_DEPTH) "]\n"
+
+        "ld1 {v8.8b}, [x12], x7\n"
+        "add x10, x9, x14\n"
+        "ld1 {v9.8b}, [x12], x7\n"
+        "cmp x15, #16\n"
+        "ld1 {v10.8b}, [x12]\n"
+        "add %[input_ptr], %[input_ptr], #8\n"
+        "ld1 {v11.8b}, [x13], x7\n"
+        "add %[filter_ptr], %[filter_ptr], #8\n"
+        "ld1 {v12.8b}, [x13], x7\n"
+        "ld1 {v13.8b}, [x13]\n"
+
+        "ld1 {v0.8b}, [x9], x7\n"
+        "ld1 {v1.8b}, [x9], x7\n"
+        "ld1 {v2.8b}, [x9]\n"
+        "ld1 {v3.8b}, [x10], x7\n"
+        "ld1 {v4.8b}, [x10], x7\n"
+        "ld1 {v5.8b}, [x10]\n"
+
+        // Load constants.
+        "ldr w12, [%[params_ptr], #" STR(OFFSET_INPUT_OFFSET) "]\n"
+        "ldr w13, [%[params_ptr], #" STR(OFFSET_OUTPUT_MULTIPLIER) "]\n"
+        "dup v26.8h, w12\n"
+        "ldr w12, [%[params_ptr], #" STR(OFFSET_OUTPUT_OFFSET) "]\n"
+        "dup v27.4s, w13\n"
+        "ldr w13, [%[params_ptr], #" STR(OFFSET_OUTPUT_SHIFT) "]\n"
+        "dup v28.4s, w12\n"
+        "ldr w12, [%[params_ptr], #" STR(OFFSET_OUTPUT_ACTIVATION_MIN) "]\n"
+        "neg w13, w13\n"
+        "dup v29.4s, w13\n"
+        "ldr w13, [%[params_ptr], #" STR(OFFSET_OUTPUT_ACTIVATION_MAX) "]\n"
+        "dup v30.4s, w12\n"
+        "ldr w12, [%[params_ptr], #" STR(OFFSET_FILTER_OFFSET) "]\n"
+        "dup v31.4s, w13\n"
+        "dup v25.8h, w12\n"
+
+        // Add input and filter offsets.
+        "uaddw v8.8h, v26.8h, v8.8b\n"
+        "ld1 {v16.4s}, [%[bias_ptr]], #16\n"
+        "uaddw v9.8h, v26.8h, v9.8b\n"
+        "ld1 {v17.4s}, [%[bias_ptr]], #16\n"
+        "uaddw v10.8h, v26.8h, v10.8b\n"
+        "uaddw v11.8h, v26.8h, v11.8b\n"
+        "uaddw v12.8h, v26.8h, v12.8b\n"
+        "uaddw v13.8h, v26.8h, v13.8b\n"
+
+        "uaddw v0.8h, v25.8h, v0.8b\n"
+        "uaddw v1.8h, v25.8h, v1.8b\n"
+        "uaddw v2.8h, v25.8h, v2.8b\n"
+        "uaddw v3.8h, v25.8h, v3.8b\n"
+        "uaddw v4.8h, v25.8h, v4.8b\n"
+        "uaddw v5.8h, v25.8h, v5.8b\n"
+
+        "blt " DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP "f\n"
+
+        //"loop_%=:\n"
+        DEPTHWISECONV_LABEL_DEPTH_8_LOOP ":\n"
+          "mov x12, %[input_ptr]\n"
+          "subs x15, x15, #8\n"
+          "add x13, x12, x11\n"
+          "cmp x15, #16\n"
+          "add %[input_ptr], %[input_ptr], #8\n"
+
+          "smlal v16.4s, v0.4h, v8.4h\n"
+          "mov x9, %[filter_ptr]\n"
+          "smlal2 v17.4s, v0.8h, v8.8h\n"
+          "ld1 {v8.8b}, [x12], x7\n"
+          "smlal v16.4s, v1.4h, v9.4h\n"
+          "add x10, x9, x14\n"
+          "smlal2 v17.4s, v1.8h, v9.8h\n"
+          "ld1 {v9.8b}, [x12], x7\n"
+          "smlal v16.4s, v2.4h, v10.4h\n"
+          "add %[filter_ptr], %[filter_ptr], #8\n"
+          "smlal2 v17.4s, v2.8h, v10.8h\n"
+          "ld1 {v10.8b}, [x12]\n"
+          "smlal v16.4s, v3.4h, v11.4h\n"
+          "ld1 {v0.8b}, [x9], x7\n"
+          "smlal2 v17.4s, v3.8h, v11.8h\n"
+          "ld1 {v11.8b}, [x13], x7\n"
+          "smlal v16.4s, v4.4h, v12.4h\n"
+          "ld1 {v1.8b}, [x9], x7\n"
+          "smlal2 v17.4s, v4.8h, v12.8h\n"
+          "ld1 {v12.8b}, [x13], x7\n"
+          "smlal v16.4s, v5.4h, v13.4h\n"
+          "ld1 {v2.8b}, [x9]\n"
+          "smlal2 v17.4s, v5.8h, v13.8h\n"
+          "ld1 {v13.8b}, [x13]\n"
+
+          "sqrdmulh v16.4s, v16.4s, v27.4s\n"
+          "ld1 {v3.8b}, [x10], x7\n"
+          "sqrdmulh v17.4s, v17.4s, v27.4s\n"
+          "ld1 {v4.8b}, [x10], x7\n"
+          "and v18.16b, v16.16b, v29.16b\n"
+          "ld1 {v5.8b}, [x10]\n"
+          "and v19.16b, v17.16b, v29.16b\n"
+          "sshr v18.4s, v18.4s, #31\n"
+          "sshr v19.4s, v19.4s, #31\n"
+          "sqadd v16.4s, v16.4s, v18.4s\n"
+          "sqadd v17.4s, v17.4s, v19.4s\n"
+          "srshl v16.4s, v16.4s, v29.4s\n"
+          "srshl v17.4s, v17.4s, v29.4s\n"
+          "add v16.4s, v16.4s, v28.4s\n"
+          "add v17.4s, v17.4s, v28.4s\n"
+          "smax v16.4s, v16.4s, v30.4s\n"
+          "smax v17.4s, v17.4s, v30.4s\n"
+          "smin v16.4s, v16.4s, v31.4s\n"
+          "smin v17.4s, v17.4s, v31.4s\n"
+          "sqxtn v16.4h, v16.4s\n"
+          "sqxtn2 v16.8h, v17.4s\n"
+          "sqxtun v16.8b, v16.8h\n"
+          "uaddw v8.8h, v26.8h, v8.8b\n"
+          "st1 {v16.8b}, [%[output_ptr]], #8\n"
+          "uaddw v9.8h, v26.8h, v9.8b\n"
+          "uaddw v10.8h, v26.8h, v10.8b\n"
+          "uaddw v11.8h, v26.8h, v11.8b\n"
+          "uaddw v12.8h, v26.8h, v12.8b\n"
+          "uaddw v13.8h, v26.8h, v13.8b\n"
+
+          "uaddw v0.8h, v25.8h, v0.8b\n"
+          "uaddw v1.8h, v25.8h, v1.8b\n"
+          "uaddw v2.8h, v25.8h, v2.8b\n"
+          "ld1 {v16.4s}, [%[bias_ptr]], #16\n"
+          "uaddw v3.8h, v25.8h, v3.8b\n"
+          "ld1 {v17.4s}, [%[bias_ptr]], #16\n"
+          "uaddw v4.8h, v25.8h, v4.8b\n"
+          "uaddw v5.8h, v25.8h, v5.8b\n"
+
+          "bge " DEPTHWISECONV_LABEL_DEPTH_8_LOOP "b\n"
+
+        DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP ":\n"
+        "smlal v16.4s, v0.4h, v8.4h\n"
+        "smlal2 v17.4s, v0.8h, v8.8h\n"
+        "smlal v16.4s, v1.4h, v9.4h\n"
+        "smlal2 v17.4s, v1.8h, v9.8h\n"
+        "smlal v16.4s, v2.4h, v10.4h\n"
+        "smlal2 v17.4s, v2.8h, v10.8h\n"
+        "smlal v16.4s, v3.4h, v11.4h\n"
+        "smlal2 v17.4s, v3.8h, v11.8h\n"
+        "smlal v16.4s, v4.4h, v12.4h\n"
+        "smlal2 v17.4s, v4.8h, v12.8h\n"
+        "smlal v16.4s, v5.4h, v13.4h\n"
+        "smlal2 v17.4s, v5.8h, v13.8h\n"
+
+        "sqrdmulh v16.4s, v16.4s, v27.4s\n"
+        "sqrdmulh v17.4s, v17.4s, v27.4s\n"
+        "and v18.16b, v16.16b, v29.16b\n"
+        "and v19.16b, v17.16b, v29.16b\n"
+        "sshr v18.4s, v18.4s, #31\n"
+        "sshr v19.4s, v19.4s, #31\n"
+        "sqadd v16.4s, v16.4s, v18.4s\n"
+        "sqadd v17.4s, v17.4s, v19.4s\n"
+        "srshl v16.4s, v16.4s, v29.4s\n"
+        "srshl v17.4s, v17.4s, v29.4s\n"
+        "add v16.4s, v16.4s, v28.4s\n"
+        "add v17.4s, v17.4s, v28.4s\n"
+        "smax v16.4s, v16.4s, v30.4s\n"
+        "smax v17.4s, v17.4s, v30.4s\n"
+        "smin v16.4s, v16.4s, v31.4s\n"
+        "smin v17.4s, v17.4s, v31.4s\n"
+        "sqxtn v16.4h, v16.4s\n"
+        "sqxtn2 v16.8h, v17.4s\n"
+        "sqxtun v16.8b, v16.8h\n"
+        "st1 {v16.8b}, [%[output_ptr]]\n"
+        :
+        // Outputs.
+        [filter_ptr] "+r"(filter_ptr), [input_ptr] "+r"(input_ptr),
+        [output_ptr] "+r"(output_ptr), [bias_ptr] "+r"(bias_ptr)
+        :
+        // Inputs.
+        [params_ptr] "r"(params_ptr)
+        :
+        // Clobbers.
+        "cc", "memory",
+        // We use these NEON registers.
+        "v0", "v1", "v2", "v3", "v4", "v5", "v8", "v9", "v10", "v11", "v12",
+        "v13", "v16", "v17", "v18", "v19", "v25", "v26", "v27", "v28", "v29",
+        "v30", "v31",
+        // We use these general-purpose registers.
+        "x7", "x9", "x10", "x11", "x12", "x13", "x14", "x15");
+#undef DEPTHWISECONV_LABEL_DEPTH_8_LOOP
+#undef DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP
   }
 };
 
 template <>
-struct ConvKernel3x3FilterDepth8<1, 4, 2, 2> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-    int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8;
-    uint8x8_t temp_0, temp_1, temp_2, temp_3, temp_4, temp_5, temp_6, temp_7,
-        temp_8;
-
-    const uint8* ptr = input_ptr;
-
-    // Load all inputs for top output.
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    temp_2 = vld1_u8(ptr + 2 * input_depth);
-    ptr += input_row_size;
-    temp_3 = vld1_u8(ptr);
-    temp_4 = vld1_u8(ptr + input_depth);
-    temp_5 = vld1_u8(ptr + 2 * input_depth);
-    ptr += input_row_size;
-    temp_6 = vld1_u8(ptr);
-    temp_7 = vld1_u8(ptr + input_depth);
-    temp_8 = vld1_u8(ptr + 2 * input_depth);
-
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-    input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-    input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_6));
-    input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_7));
-    input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_8));
-
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-    input_5 = vaddq_s16(input_5, input_offset_vec);
-    input_6 = vaddq_s16(input_6, input_offset_vec);
-    input_7 = vaddq_s16(input_7, input_offset_vec);
-    input_8 = vaddq_s16(input_8, input_offset_vec);
-
-    DotProductAndStore(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max, output_ptr);
-
-    // Second output.
-    output_ptr += output_depth;
-
-    ptr = input_ptr + 3 * input_depth;
-    temp_0 = vld1_u8(ptr);
-    temp_1 = vld1_u8(ptr + input_depth);
-    ptr += input_row_size;
-    temp_3 = vld1_u8(ptr);
-    temp_4 = vld1_u8(ptr + input_depth);
-    ptr += input_row_size;
-    temp_6 = vld1_u8(ptr);
-    temp_7 = vld1_u8(ptr + input_depth);
-
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-    input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_6));
-    input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_7));
-
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-    input_6 = vaddq_s16(input_6, input_offset_vec);
-    input_7 = vaddq_s16(input_7, input_offset_vec);
-
-    DotProductAndStore(
-        filter, input_2, input_0, input_1, input_5, input_3, input_4, input_8,
-        input_6, input_7, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max, output_ptr);
-
-    // Third output.
-    output_ptr += output_depth;
-
-    ptr = input_ptr + 5 * input_depth;
-    temp_2 = vld1_u8(ptr);
-    temp_0 = vld1_u8(ptr + input_depth);
-    ptr += input_row_size;
-    temp_5 = vld1_u8(ptr);
-    temp_3 = vld1_u8(ptr + input_depth);
-    ptr += input_row_size;
-    temp_8 = vld1_u8(ptr);
-    temp_6 = vld1_u8(ptr + input_depth);
-
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_8));
-    input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_6));
-
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_5 = vaddq_s16(input_5, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_8 = vaddq_s16(input_8, input_offset_vec);
-    input_6 = vaddq_s16(input_6, input_offset_vec);
-
-    DotProductAndStore(
-        filter, input_1, input_2, input_0, input_4, input_5, input_3, input_7,
-        input_8, input_6, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max, output_ptr);
-
-    // Fourth output.
-    output_ptr += output_depth;
-
-    ptr = input_ptr + 7 * input_depth;
-    temp_1 = vld1_u8(ptr);
-    temp_2 = vld1_u8(ptr + input_depth);
-    ptr += input_row_size;
-    temp_4 = vld1_u8(ptr);
-    temp_5 = vld1_u8(ptr + input_depth);
-    ptr += input_row_size;
-    temp_7 = vld1_u8(ptr);
-    temp_8 = vld1_u8(ptr + input_depth);
-
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-    input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-    input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_7));
-    input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_8));
-
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-    input_5 = vaddq_s16(input_5, input_offset_vec);
-    input_7 = vaddq_s16(input_7, input_offset_vec);
-    input_8 = vaddq_s16(input_8, input_offset_vec);
-
-    DotProductAndStore(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max, output_ptr);
+struct DepthwiseConvPartial<EdgeType::kVertical, 1, 1> {
+  static inline void Run(const uint8* input_ptr, const uint8* filter_ptr,
+                         const int32* bias_ptr, uint8* output_ptr,
+                         const DepthwiseConvParams* params_ptr) {
+#define DEPTHWISECONV_LABEL_DEPTH_8_LOOP "1"
+#define DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP "2"
+    asm volatile(
+        // Performs depthwise convolutions for an input window of size 3x2 and
+        // padding of 1 across the full depth. Expects |input_ptr| and
+        // |filter_ptr| to be pointing to the beginning of the 3x2 input and
+        // filter values.
+
+        // Load input and filter values.
+        "ldr x6, [%[params_ptr], #" STR(OFFSET_INPUT_DEPTH) "]\n"
+        "mov x12, %[input_ptr]\n"
+        "ldr x11, [%[params_ptr], #" STR(OFFSET_INPUT_ROW_SIZE) "]\n"
+        "mov x7, %[filter_ptr]\n"
+        "ldr x5, [%[params_ptr], #" STR(OFFSET_FILTER_ROW_SIZE) "]\n"
+        "add x13, x12, x11\n"
+        "ldr x15, [%[params_ptr], #" STR(OFFSET_OUTPUT_DEPTH) "]\n"
+        "add x14, x13, x11\n"
+
+        "ld1 {v8.8b}, [x12], x6\n"
+        "add x9, x7, x5\n"
+        "ld1 {v9.8b}, [x12]\n"
+        "cmp x15, #16\n"
+        "add x10, x9, x5\n"
+        "ld1 {v10.8b}, [x13], x6\n"
+        "add %[input_ptr], %[input_ptr], #8\n"
+        "ld1 {v11.8b}, [x13]\n"
+        "add %[filter_ptr], %[filter_ptr], #8\n"
+        "ld1 {v12.8b}, [x14], x6\n"
+        "ld1 {v13.8b}, [x14]\n"
+
+        "ld1 {v0.8b}, [x7], x6\n"
+        "ld1 {v1.8b}, [x7]\n"
+        "ld1 {v2.8b}, [x9], x6\n"
+        "ld1 {v3.8b}, [x9]\n"
+        "ld1 {v4.8b}, [x10], x6\n"
+        "ld1 {v5.8b}, [x10]\n"
+
+        // Load constants.
+        "ldr w12, [%[params_ptr], #" STR(OFFSET_INPUT_OFFSET) "]\n"
+        "ldr w13, [%[params_ptr], #" STR(OFFSET_OUTPUT_MULTIPLIER) "]\n"
+        "dup v26.8h, w12\n"
+        "ldr w12, [%[params_ptr], #" STR(OFFSET_OUTPUT_OFFSET) "]\n"
+        "dup v27.4s, w13\n"
+        "ldr w13, [%[params_ptr], #" STR(OFFSET_OUTPUT_SHIFT) "]\n"
+        "dup v28.4s, w12\n"
+        "ldr w12, [%[params_ptr], #" STR(OFFSET_OUTPUT_ACTIVATION_MIN) "]\n"
+        "neg w13, w13\n"
+        "dup v29.4s, w13\n"
+        "ldr w13, [%[params_ptr], #" STR(OFFSET_OUTPUT_ACTIVATION_MAX) "]\n"
+        "dup v30.4s, w12\n"
+        "ldr w12, [%[params_ptr], #" STR(OFFSET_FILTER_OFFSET) "]\n"
+        "dup v31.4s, w13\n"
+        "dup v25.8h, w12\n"
+
+        // Add input and filter offsets.
+        "uaddw v8.8h, v26.8h, v8.8b\n"
+        "ld1 {v16.4s}, [%[bias_ptr]], #16\n"
+        "uaddw v9.8h, v26.8h, v9.8b\n"
+        "ld1 {v17.4s}, [%[bias_ptr]], #16\n"
+        "uaddw v10.8h, v26.8h, v10.8b\n"
+        "uaddw v11.8h, v26.8h, v11.8b\n"
+        "uaddw v12.8h, v26.8h, v12.8b\n"
+        "uaddw v13.8h, v26.8h, v13.8b\n"
+
+        "uaddw v0.8h, v25.8h, v0.8b\n"
+        "uaddw v1.8h, v25.8h, v1.8b\n"
+        "uaddw v2.8h, v25.8h, v2.8b\n"
+        "uaddw v3.8h, v25.8h, v3.8b\n"
+        "uaddw v4.8h, v25.8h, v4.8b\n"
+        "uaddw v5.8h, v25.8h, v5.8b\n"
+
+        "blt " DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP "f\n"
+
+        //"loop_%=:\n"
+        DEPTHWISECONV_LABEL_DEPTH_8_LOOP ":\n"
+          "mov x12, %[input_ptr]\n"
+          "subs x15, x15, #8\n"
+          "add x13, x12, x11\n"
+          "cmp x15, #16\n"
+          "add x14, x13, x11\n"
+          "add %[input_ptr], %[input_ptr], #8\n"
+
+          "smlal v16.4s, v0.4h, v8.4h\n"
+          "mov x7, %[filter_ptr]\n"
+          "smlal2 v17.4s, v0.8h, v8.8h\n"
+          "ld1 {v8.8b}, [x12], x6\n"
+          "smlal v16.4s, v1.4h, v9.4h\n"
+          "add x9, x7, x5\n"
+          "smlal2 v17.4s, v1.8h, v9.8h\n"
+          "add x10, x9, x5\n"
+          "ld1 {v9.8b}, [x12]\n"
+          "smlal v16.4s, v2.4h, v10.4h\n"
+          "add %[filter_ptr], %[filter_ptr], #8\n"
+          "smlal2 v17.4s, v2.8h, v10.8h\n"
+          "ld1 {v10.8b}, [x13], x6\n"
+          "smlal v16.4s, v3.4h, v11.4h\n"
+          "ld1 {v0.8b}, [x7], x6\n"
+          "smlal2 v17.4s, v3.8h, v11.8h\n"
+          "ld1 {v11.8b}, [x13]\n"
+          "smlal v16.4s, v4.4h, v12.4h\n"
+          "ld1 {v1.8b}, [x7]\n"
+          "smlal2 v17.4s, v4.8h, v12.8h\n"
+          "ld1 {v12.8b}, [x14], x6\n"
+          "smlal v16.4s, v5.4h, v13.4h\n"
+          "ld1 {v2.8b}, [x9], x6\n"
+          "smlal2 v17.4s, v5.8h, v13.8h\n"
+          "ld1 {v13.8b}, [x14]\n"
+
+          "sqrdmulh v16.4s, v16.4s, v27.4s\n"
+          "ld1 {v3.8b}, [x9]\n"
+          "sqrdmulh v17.4s, v17.4s, v27.4s\n"
+          "ld1 {v4.8b}, [x10], x6\n"
+          "and v18.16b, v16.16b, v29.16b\n"
+          "ld1 {v5.8b}, [x10]\n"
+          "and v19.16b, v17.16b, v29.16b\n"
+          "sshr v18.4s, v18.4s, #31\n"
+          "sshr v19.4s, v19.4s, #31\n"
+          "sqadd v16.4s, v16.4s, v18.4s\n"
+          "sqadd v17.4s, v17.4s, v19.4s\n"
+          "srshl v16.4s, v16.4s, v29.4s\n"
+          "srshl v17.4s, v17.4s, v29.4s\n"
+          "add v16.4s, v16.4s, v28.4s\n"
+          "add v17.4s, v17.4s, v28.4s\n"
+          "smax v16.4s, v16.4s, v30.4s\n"
+          "smax v17.4s, v17.4s, v30.4s\n"
+          "smin v16.4s, v16.4s, v31.4s\n"
+          "smin v17.4s, v17.4s, v31.4s\n"
+          "sqxtn v16.4h, v16.4s\n"
+          "sqxtn2 v16.8h, v17.4s\n"
+          "sqxtun v16.8b, v16.8h\n"
+          "uaddw v8.8h, v26.8h, v8.8b\n"
+          "st1 {v16.8b}, [%[output_ptr]], #8\n"
+          "uaddw v9.8h, v26.8h, v9.8b\n"
+          "uaddw v10.8h, v26.8h, v10.8b\n"
+          "uaddw v11.8h, v26.8h, v11.8b\n"
+          "uaddw v12.8h, v26.8h, v12.8b\n"
+          "uaddw v13.8h, v26.8h, v13.8b\n"
+
+          "uaddw v0.8h, v25.8h, v0.8b\n"
+          "uaddw v1.8h, v25.8h, v1.8b\n"
+          "uaddw v2.8h, v25.8h, v2.8b\n"
+          "ld1 {v16.4s}, [%[bias_ptr]], #16\n"
+          "uaddw v3.8h, v25.8h, v3.8b\n"
+          "ld1 {v17.4s}, [%[bias_ptr]], #16\n"
+          "uaddw v4.8h, v25.8h, v4.8b\n"
+          "uaddw v5.8h, v25.8h, v5.8b\n"
+
+          "bge " DEPTHWISECONV_LABEL_DEPTH_8_LOOP "b\n"
+
+        DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP ":\n"
+        "smlal v16.4s, v0.4h, v8.4h\n"
+        "smlal2 v17.4s, v0.8h, v8.8h\n"
+        "smlal v16.4s, v1.4h, v9.4h\n"
+        "smlal2 v17.4s, v1.8h, v9.8h\n"
+        "smlal v16.4s, v2.4h, v10.4h\n"
+        "smlal2 v17.4s, v2.8h, v10.8h\n"
+        "smlal v16.4s, v3.4h, v11.4h\n"
+        "smlal2 v17.4s, v3.8h, v11.8h\n"
+        "smlal v16.4s, v4.4h, v12.4h\n"
+        "smlal2 v17.4s, v4.8h, v12.8h\n"
+        "smlal v16.4s, v5.4h, v13.4h\n"
+        "smlal2 v17.4s, v5.8h, v13.8h\n"
+
+        "sqrdmulh v16.4s, v16.4s, v27.4s\n"
+        "sqrdmulh v17.4s, v17.4s, v27.4s\n"
+        "and v18.16b, v16.16b, v29.16b\n"
+        "and v19.16b, v17.16b, v29.16b\n"
+        "sshr v18.4s, v18.4s, #31\n"
+        "sshr v19.4s, v19.4s, #31\n"
+        "sqadd v16.4s, v16.4s, v18.4s\n"
+        "sqadd v17.4s, v17.4s, v19.4s\n"
+        "srshl v16.4s, v16.4s, v29.4s\n"
+        "srshl v17.4s, v17.4s, v29.4s\n"
+        "add v16.4s, v16.4s, v28.4s\n"
+        "add v17.4s, v17.4s, v28.4s\n"
+        "smax v16.4s, v16.4s, v30.4s\n"
+        "smax v17.4s, v17.4s, v30.4s\n"
+        "smin v16.4s, v16.4s, v31.4s\n"
+        "smin v17.4s, v17.4s, v31.4s\n"
+        "sqxtn v16.4h, v16.4s\n"
+        "sqxtn2 v16.8h, v17.4s\n"
+        "sqxtun v16.8b, v16.8h\n"
+        "st1 {v16.8b}, [%[output_ptr]]\n"
+        :
+        // Outputs.
+        [filter_ptr] "+r"(filter_ptr), [input_ptr] "+r"(input_ptr),
+        [output_ptr] "+r"(output_ptr), [bias_ptr] "+r"(bias_ptr)
+        :
+        // Inputs.
+        [params_ptr] "r"(params_ptr)
+        :
+        // Clobbers.
+        "cc", "memory",
+        // We use these NEON registers.
+        "v0", "v1", "v2", "v3", "v4", "v5", "v8", "v9", "v10", "v11", "v12",
+        "v13", "v16", "v17", "v18", "v19", "v25", "v26", "v27", "v28", "v29",
+        "v30", "v31",
+        // We use these general-purpose registers.
+        "x5", "x6", "x7", "x9", "x10", "x11", "x12", "x13", "x14", "x15");
+#undef DEPTHWISECONV_LABEL_DEPTH_8_LOOP
+#undef DEPTHWISECONV_LABEL_DEPTH_8_AFTER_LOOP
   }
 };
 
-template <int kFixedStrideWidth, int kFixedStrideHeight>
-struct ConvKernel3x3FilterDepth8<1, 1, kFixedStrideWidth, kFixedStrideHeight> {
-  static inline void Run(const uint8* input_ptr, int input_depth,
-                         int32 input_offset, int input_row_size,
-                         const uint8* filter_ptr, int32 filter_offset,
-                         const int32* bias_ptr, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_ptr,
-                         int output_depth, int output_width) {
-    Filter3x3x8 filter = Load3x3Filter(filter_ptr, filter_offset, output_depth);
-
-    int16x8_t input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8;
-
-    uint8x8_t temp_0 = vld1_u8(input_ptr);
-    uint8x8_t temp_1 = vld1_u8(input_ptr + input_depth);
-    uint8x8_t temp_2 = vld1_u8(input_ptr + 2 * input_depth);
-
-    input_ptr += input_row_size;
-    uint8x8_t temp_3 = vld1_u8(input_ptr);
-    uint8x8_t temp_4 = vld1_u8(input_ptr + input_depth);
-    uint8x8_t temp_5 = vld1_u8(input_ptr + 2 * input_depth);
-
-    input_ptr += input_row_size;
-    uint8x8_t temp_6 = vld1_u8(input_ptr);
-    uint8x8_t temp_7 = vld1_u8(input_ptr + input_depth);
-    uint8x8_t temp_8 = vld1_u8(input_ptr + 2 * input_depth);
-
-    input_0 = vreinterpretq_s16_u16(vmovl_u8(temp_0));
-    input_1 = vreinterpretq_s16_u16(vmovl_u8(temp_1));
-    input_2 = vreinterpretq_s16_u16(vmovl_u8(temp_2));
-    input_3 = vreinterpretq_s16_u16(vmovl_u8(temp_3));
-    input_4 = vreinterpretq_s16_u16(vmovl_u8(temp_4));
-    input_5 = vreinterpretq_s16_u16(vmovl_u8(temp_5));
-    input_6 = vreinterpretq_s16_u16(vmovl_u8(temp_6));
-    input_7 = vreinterpretq_s16_u16(vmovl_u8(temp_7));
-    input_8 = vreinterpretq_s16_u16(vmovl_u8(temp_8));
-
-    const int16x8_t input_offset_vec = vdupq_n_s16(input_offset);
-    input_0 = vaddq_s16(input_0, input_offset_vec);
-    input_1 = vaddq_s16(input_1, input_offset_vec);
-    input_2 = vaddq_s16(input_2, input_offset_vec);
-    input_3 = vaddq_s16(input_3, input_offset_vec);
-    input_4 = vaddq_s16(input_4, input_offset_vec);
-    input_5 = vaddq_s16(input_5, input_offset_vec);
-    input_6 = vaddq_s16(input_6, input_offset_vec);
-    input_7 = vaddq_s16(input_7, input_offset_vec);
-    input_8 = vaddq_s16(input_8, input_offset_vec);
-
-    DotProductAndStore(
-        filter, input_0, input_1, input_2, input_3, input_4, input_5, input_6,
-        input_7, input_8, bias_ptr, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max, output_ptr);
-  }
-};
-
-inline void ShuffleInput(const uint8* input_ptr, int input_depth,
-                         int input_width, int input_height, int output_depth,
-                         int output_width, int output_height,
-                         uint8* output_ptr) {
-  const int input_row_size = input_depth * input_width;
-
-  for (int y = 0; y < output_height; y++) {
+#undef OFFSET_INPUT_DEPTH
+#undef OFFSET_INPUT_ROW_SIZE
+#undef OFFSET_OUTPUT_DEPTH
+#undef OFFSET_OUTPUT_ROW_SIZE
+#undef OFFSET_INPUT_OFFSET
+#undef OFFSET_OUTPUT_OFFSET
+#undef OFFSET_FILTER_OFFSET
+#undef OFFSET_OUTPUT_MULTIPLIER
+#undef OFFSET_OUTPUT_ACTIVATION_MIN
+#undef OFFSET_OUTPUT_ACTIVATION_MAX
+#undef OFFSET_OUTPUT_SHIFT
+#undef OFFSET_INPUT_WIDTH
+#undef OFFSET_INPUT_HEIGHT
+#undef OFFSET_OUTPUT_WIDTH
+#undef OFFSET_OUTPUT_HEIGHT
+#undef STR
+#undef STR_UNEXPANDED
+
+// Copies a subset of the input designated by |input_ptr| into |output_ptr|
+// with the specified output dimensions. Supports output depths of 64 only as
+// this is the cache line size.
+inline void ShuffleInput(const uint8* input_ptr, int64_t input_depth,
+                         int32 input_width, int32 input_height,
+                         int64_t output_depth, int32 output_width,
+                         int32 output_height, uint8* output_ptr) {
+  const int64_t input_row_size = input_depth * input_width;
+  for (int32 y = 0; y < output_height; y++) {
     const uint8* ptr = input_ptr;
-    for (int x = 0; x < output_width; x++) {
+    for (int32 x = 0; x < output_width; x++) {
       memcpy(output_ptr, ptr, output_depth);
       output_ptr += output_depth;
       ptr += input_depth;
@@ -3873,561 +2937,262 @@ inline void ShuffleInput(const uint8* input_ptr, int input_depth,
   }
 }
 
-template <int kFixedHeight, int kFixedStrideWidth, int kFixedStrideHeight>
-struct ConvRow3x3FilterDepth8 {};
-
-template <int kFixedStrideWidth, int kFixedStrideHeight>
-struct ConvRow3x3FilterDepth8<1, kFixedStrideWidth, kFixedStrideHeight> {
-  static inline void Run(const uint8* input_data, int start_x, int start_y,
-                         int input_depth, int input_width, int input_height,
-                         int input_row_size, int32 input_offset,
-                         const uint8* filter_data, int32 filter_offset,
-                         const int32* bias_data, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_data,
-                         int output_depth, int output_width,
-                         uint8* shuffle_workspace) {
-    int out_x = start_x;
-
-    // 1x4 at a time.
-    for (; out_x <= output_width - 4; out_x += 4) {
-      const int32* bias_ptr = bias_data;
-      const uint8* filter_ptr = filter_data;
-
-      const uint8* input_ptr = input_data;
-      uint8* output_ptr = output_data;
-
-      for (int depth = 0; depth <= output_depth - 8; depth += 8) {
-        ConvKernel3x3FilterDepth8<1, 4, kFixedStrideWidth, kFixedStrideHeight>::
-            Run(input_ptr, input_depth, input_offset, input_row_size,
-                filter_ptr, filter_offset, bias_ptr, output_offset,
-                output_multiplier, output_shift, output_activation_min,
-                output_activation_max, output_ptr, output_depth, output_width);
-
-        input_ptr += 8;
-        output_ptr += 8;
-        filter_ptr += 8;
-        bias_ptr += 8;
-      }
-
-      input_data += 4 * kFixedStrideWidth * input_depth;
-      output_data += 4 * output_depth;
-    }
-
-    // 1x1 at a time.
-    for (; out_x < output_width; out_x++) {
-      const int32* bias_ptr = bias_data;
-      const uint8* filter_ptr = filter_data;
-
-      const uint8* input_ptr = input_data;
-      uint8* output_ptr = output_data;
-
-      for (int depth = 0; depth <= output_depth - 8; depth += 8) {
-        ConvKernel3x3FilterDepth8<1, 1, kFixedStrideWidth, kFixedStrideHeight>::
-            Run(input_ptr, input_depth, input_offset, input_row_size,
-                filter_ptr, filter_offset, bias_ptr, output_offset,
-                output_multiplier, output_shift, output_activation_min,
-                output_activation_max, output_ptr, output_depth, output_width);
-
-        input_ptr += 8;
-        output_ptr += 8;
-        filter_ptr += 8;
-        bias_ptr += 8;
-      }
+// Calculates the input size depending on stride and output.
+inline int32 get_shuffle_input_size(int32 stride, int32 output) {
+  return stride * (output - 1) + 3;
+}
 
-      input_data += kFixedStrideWidth * input_depth;
-      output_data += output_depth;
-    }
+// Indicates the input and output dimensions used when shuffling input
+// activations.
+struct ShuffleParams {
+  int32 output_width;
+  int32 output_height;
+  int32 input_width;
+  int32 input_height;
+
+  ShuffleParams() = default;
+  ShuffleParams(int32 output_width, int32 output_height, int32 stride_width,
+                int32 stride_height)
+  : output_width(output_width)
+  , output_height(output_height)
+  , input_width(get_shuffle_input_size(stride_width, output_width))
+  , input_height(get_shuffle_input_size(stride_height, output_height)) {
   }
 };
 
-template <int kFixedStrideWidth, int kFixedStrideHeight>
-struct ConvRow3x3FilterDepth8<2, kFixedStrideWidth, kFixedStrideHeight> {
-  static inline void Run(const uint8* input_data, int start_x, int start_y,
-                         int input_depth, int input_width, int input_height,
-                         int input_row_size, int32 input_offset,
-                         const uint8* filter_data, int32 filter_offset,
-                         const int32* bias_data, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_data,
-                         int output_depth, int output_width,
-                         uint8* shuffle_workspace) {
-    int out_x = start_x;
-
-    // 2x4 at a time.
-    for (; out_x <= output_width - 4; out_x += 4) {
-      const int32* bias_ptr = bias_data;
-      const uint8* filter_ptr = filter_data;
-
-      const uint8* input_ptr = input_data;
-      uint8* output_ptr = output_data;
-
-      for (int depth = 0; depth <= output_depth - 8; depth += 8) {
-        ConvKernel3x3FilterDepth8<2, 4, kFixedStrideWidth, kFixedStrideHeight>::
-            Run(input_ptr, input_depth, input_offset, input_row_size,
-                filter_ptr, filter_offset, bias_ptr, output_offset,
-                output_multiplier, output_shift, output_activation_min,
-                output_activation_max, output_ptr, output_depth, output_width);
-
-        input_ptr += 8;
-        output_ptr += 8;
-        filter_ptr += 8;
-        bias_ptr += 8;
-      }
-
-      input_data += 4 * kFixedStrideWidth * input_depth;
-      output_data += 4 * output_depth;
-    }
-
-    // 2x2 at a time.
-    for (; out_x <= output_width - 2; out_x += 2) {
-      const int32* bias_ptr = bias_data;
-      const uint8* filter_ptr = filter_data;
-
-      const uint8* input_ptr = input_data;
-      uint8* output_ptr = output_data;
-
-      for (int depth = 0; depth <= output_depth - 8; depth += 8) {
-        ConvKernel3x3FilterDepth8<2, 2, kFixedStrideWidth, kFixedStrideHeight>::
-            Run(input_ptr, input_depth, input_offset, input_row_size,
-                filter_ptr, filter_offset, bias_ptr, output_offset,
-                output_multiplier, output_shift, output_activation_min,
-                output_activation_max, output_ptr, output_depth, output_width);
-
-        input_ptr += 8;
-        output_ptr += 8;
-        filter_ptr += 8;
-        bias_ptr += 8;
-      }
-
-      input_data += 2 * kFixedStrideWidth * input_depth;
-      output_data += 2 * output_depth;
-    }
-
-    // 2x1 at a time.
-    for (; out_x < output_width; out_x++) {
-      const int32* bias_ptr = bias_data;
-      const uint8* filter_ptr = filter_data;
-
-      const uint8* input_ptr = input_data;
-      uint8* output_ptr = output_data;
-
-      for (int depth = 0; depth <= output_depth - 8; depth += 8) {
-        ConvKernel3x3FilterDepth8<2, 1, kFixedStrideWidth, kFixedStrideHeight>::
-            Run(input_ptr, input_depth, input_offset, input_row_size,
-                filter_ptr, filter_offset, bias_ptr, output_offset,
-                output_multiplier, output_shift, output_activation_min,
-                output_activation_max, output_ptr, output_depth, output_width);
-
-        input_ptr += 8;
-        output_ptr += 8;
-        filter_ptr += 8;
-        bias_ptr += 8;
-      }
-
-      input_data += kFixedStrideWidth * input_depth;
-      output_data += output_depth;
+template <int32 kStrideWidth, int32 kStrideHeight>
+struct DepthwiseConvThroughDepth {
+  // Runs the DepthwiseConvWindow kernels through the depth dimension from
+  // |start_depth| to |end_depth|. Keep this not inlined to maintain a small
+  // binary size. We use a DepthwiseConvParams struct for read only params
+  // to minimize call overhead.
+  static __attribute__((noinline)) void Run(const uint8* input_ptr,
+      const uint8* filter_ptr, const int32* bias_ptr, uint8* output_ptr,
+      int64_t start_depth, int64_t end_depth, int64_t input_depth,
+      int64_t input_row_size, int32 output_window_height,
+      int32 output_window_width, const DepthwiseConvParams& params) {
+    for (; start_depth <= end_depth - 8; start_depth += 8) {
+      DepthwiseConvWindow<8, kStrideWidth, kStrideHeight>::Run(
+          input_ptr, filter_ptr, bias_ptr, output_ptr, input_depth,
+          input_row_size, output_window_height, output_window_width, &params);
+      input_ptr += 8;
+      output_ptr += 8;
+      filter_ptr += 8;
+      bias_ptr += 8;
     }
   }
 };
 
-template <>
-struct ConvRow3x3FilterDepth8<4, 1, 1> {
-  static inline void Run(const uint8* input_data, int start_x, int start_y,
-                         int input_depth, int input_width, int input_height,
-                         int input_row_size, int32 input_offset,
-                         const uint8* filter_data, int32 filter_offset,
-                         const int32* bias_data, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_data,
-                         int output_depth, int output_width,
-                         uint8* shuffle_workspace) {
-    int out_x = start_x;
-
-    // 4x4 at a time.
-    for (; out_x <= output_width - 4; out_x += 4) {
-      const int32* bias_ptr = bias_data;
-      const uint8* filter_ptr = filter_data;
-
-      const uint8* input_ptr = input_data;
-      uint8* output_ptr = output_data;
-
-      for (int depth = 0; depth <= output_depth - 8; depth += 8) {
-        ConvKernel3x3FilterDepth8<4, 4, 1, 1>::Run(
-            input_ptr, input_depth, input_offset, input_row_size, filter_ptr,
-            filter_offset, bias_ptr, output_offset, output_multiplier,
-            output_shift, output_activation_min, output_activation_max,
-            output_ptr, output_depth, output_width);
-
-        input_ptr += 8;
-        output_ptr += 8;
-        filter_ptr += 8;
-        bias_ptr += 8;
-      }
+template <int32 kStrideWidth, int32 kStrideHeight>
+struct DepthwiseConvMultiRow {
+  using ConvKernel = DepthwiseConvThroughDepth<kStrideWidth, kStrideHeight>;
 
-      input_data += 4 * input_depth;
-      output_data += 4 * output_depth;
-    }
-
-    // Handle the rest of the right side.
-    // 4x2 at a time.
-    for (; out_x <= output_width - 2; out_x += 2) {
-      const int32* bias_ptr = bias_data;
-      const uint8* filter_ptr = filter_data;
-
-      const uint8* input_ptr = input_data;
-      uint8* output_ptr = output_data;
-
-      for (int depth = 0; depth <= output_depth - 8; depth += 8) {
-        ConvKernel3x3FilterDepth8<4, 2, 1, 1>::Run(
-            input_ptr, input_depth, input_offset, input_row_size, filter_ptr,
-            filter_offset, bias_ptr, output_offset, output_multiplier,
-            output_shift, output_activation_min, output_activation_max,
-            output_ptr, output_depth, output_width);
-
-        input_ptr += 8;
-        output_ptr += 8;
-        filter_ptr += 8;
-        bias_ptr += 8;
-      }
-
-      input_data += 2 * input_depth;
-      output_data += 2 * output_depth;
-    }
-
-    // 4x1 at a time.
-    for (; out_x < output_width; out_x++) {
-      const int32* bias_ptr = bias_data;
-      const uint8* filter_ptr = filter_data;
-
-      const uint8* input_ptr = input_data;
-      uint8* output_ptr = output_data;
-
-      for (int depth = 0; depth <= output_depth - 8; depth += 8) {
-        ConvKernel3x3FilterDepth8<4, 1, 1, 1>::Run(
-            input_ptr, input_depth, input_offset, input_row_size, filter_ptr,
-            filter_offset, bias_ptr, output_offset, output_multiplier,
-            output_shift, output_activation_min, output_activation_max,
-            output_ptr, output_depth, output_width);
-
-        input_ptr += 8;
-        output_ptr += 8;
-        filter_ptr += 8;
-        bias_ptr += 8;
-      }
-
-      input_data += input_depth;
-      output_data += output_depth;
-    }
-  }
-};
-
-template <>
-struct ConvRow3x3FilterDepth8<4, 2, 2> {
-  // The buffer size of the shuffled input.
-  static inline constexpr int ShuffleWorkspaceSize() { return 64 * 9 * 9; }
-
-  static inline void Run(const uint8* input_data, int start_x, int start_y,
-                         int input_depth, int input_width, int input_height,
-                         int input_row_size, int32 input_offset,
-                         const uint8* filter_data, int32 filter_offset,
-                         const int32* bias_data, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_data,
-                         int output_depth, int output_width,
+  static inline void Run(const uint8* input_data, int32 start_x, int32 end_x,
+                         const uint8* filter_data, const int32* bias_data,
+                         uint8* output_data, const DepthwiseConvParams& params,
+                         const ShuffleParams& shuffle_params,
                          uint8* shuffle_workspace) {
-    // Branch and cache misses increase substantially with stride 2 kernels.
-    // Adding prefetching reduces latency by as much as 2x.
-    const int i0 = 0;
-    const int i1 = input_depth;
-    const int i2 = 2 * input_depth;
-    const int i3 = 3 * input_depth;
-    const int i4 = 4 * input_depth;
-    const int i5 = 5 * input_depth;
-    const int i6 = 6 * input_depth;
-    const int i7 = 7 * input_depth;
-    const int i8 = 8 * input_depth;
-
-#define DEPTHWISECONV_PRELOAD_ROW(input_ptr, i)         \
-  preload_l1_keep(input_ptr + i * input_row_size + i0); \
-  preload_l1_keep(input_ptr + i * input_row_size + i1); \
-  preload_l1_keep(input_ptr + i * input_row_size + i2); \
-  preload_l1_keep(input_ptr + i * input_row_size + i3); \
-  preload_l1_keep(input_ptr + i * input_row_size + i4); \
-  preload_l1_keep(input_ptr + i * input_row_size + i5); \
-  preload_l1_keep(input_ptr + i * input_row_size + i6); \
-  preload_l1_keep(input_ptr + i * input_row_size + i7); \
-  preload_l1_keep(input_ptr + i * input_row_size + i8);
-
-    int out_x = start_x;
-    // 4x4 at a time.
-    for (; out_x <= output_width - 4; out_x += 4) {
-      const int32* bias_ptr = bias_data;
-      const uint8* filter_ptr = filter_data;
-
-      const uint8* input_ptr = input_data;
-      uint8* output_ptr = output_data;
-
-      int depth = 0;
-      for (; depth <= output_depth - 64; depth += 64) {
-        // Preload 9x9 input.
-        DEPTHWISECONV_PRELOAD_ROW(input_ptr, 0);
-        DEPTHWISECONV_PRELOAD_ROW(input_ptr, 1);
-        DEPTHWISECONV_PRELOAD_ROW(input_ptr, 2);
-        DEPTHWISECONV_PRELOAD_ROW(input_ptr, 3);
-        DEPTHWISECONV_PRELOAD_ROW(input_ptr, 4);
-        DEPTHWISECONV_PRELOAD_ROW(input_ptr, 5);
-        DEPTHWISECONV_PRELOAD_ROW(input_ptr, 6);
-        DEPTHWISECONV_PRELOAD_ROW(input_ptr, 7);
-        DEPTHWISECONV_PRELOAD_ROW(input_ptr, 8);
-
-        // For a large input window (64x9x9) that is small enough to fit in L1
-        // cache, copy the input into a separate buffer and run the kernel on
-        // this new buffer. This reduces the likelihood of cache misses when
-        // the kernel is loading input data. If this size is ever changed,
-        // update the ShuffleWorkspaceSize() function to return the new size.
-        ShuffleInput(input_ptr, input_depth, input_width, input_height, 64, 9,
-                     9, shuffle_workspace);
-        const uint8* shuffled_ptr = &shuffle_workspace[0];
-
-        for (int micro_depth = 0; micro_depth <= 64 - 8; micro_depth += 8) {
-          ConvKernel3x3FilterDepth8<4, 4, 2, 2>::Run(
-              shuffled_ptr, 64, input_offset, 64 * 9, filter_ptr, filter_offset,
-              bias_ptr, output_offset, output_multiplier, output_shift,
-              output_activation_min, output_activation_max, output_ptr,
-              output_depth, output_width);
-
-          shuffled_ptr += 8;
-          output_ptr += 8;
-          filter_ptr += 8;
-          bias_ptr += 8;
+    TFLITE_DCHECK(shuffle_params.input_height ==
+        get_shuffle_input_size(kStrideHeight, shuffle_params.output_height));
+    TFLITE_DCHECK(shuffle_params.input_width ==
+        get_shuffle_input_size(kStrideWidth, shuffle_params.output_width));
+    TFLITE_DCHECK(64 * shuffle_params.input_width * shuffle_params.input_height
+                  <= DEPTHWISECONV_SHUFFLE_WORKSPACE_SIZE);
+
+    int32 out_x = start_x;
+
+    // Run shuffling on inputs with sufficiently large depth and width. When
+    // these parameters are large enough, more time is taken to load inputs
+    // from memory. At this point, it becomes useful to prefetch and
+    // preshuffle the input data to maximize locality.
+    if (params.output_depth > 64 ||
+        (params.output_depth <= 64 && params.input_width > 150)) {
+      for (; out_x <= (end_x - shuffle_params.output_width);
+             out_x += shuffle_params.output_width) {
+        const uint8* input_ptr = input_data;
+        const int32* bias_ptr = bias_data;
+        const uint8* filter_ptr = filter_data;
+        uint8* output_ptr = output_data;
+        int64_t depth = 0;
+        const int64_t shuffle_row_size = 64 * shuffle_params.input_width;
+
+        for (; depth <= params.output_depth - 64; depth += 64) {
+          // Preload.
+          const uint8* h_ptr = input_ptr;
+          for (int32 i = 0; i < shuffle_params.input_height; i++) {
+            const uint8* ptr = h_ptr;
+            for (int32 j = 0; j < shuffle_params.input_width; j++) {
+              asm volatile("prfm pldl1keep, [%[ptr]]\n" ::[ptr] "r"(ptr) :);
+              ptr += params.input_depth;
+            }
+            h_ptr += params.input_row_size;
+          }
+
+          // For a large enough input, shuffle into buckets.
+          ShuffleInput(input_ptr, params.input_depth, params.input_width,
+                       params.input_height, 64, shuffle_params.input_width,
+                       shuffle_params.input_height, shuffle_workspace);
+          ConvKernel::Run(shuffle_workspace, filter_ptr, bias_ptr, output_ptr,
+                          0, 64, 64, shuffle_row_size,
+                          shuffle_params.output_height,
+                          shuffle_params.output_width, params);
+          input_ptr += 64;
+          output_ptr += 64;
+          filter_ptr += 64;
+          bias_ptr += 64;
         }
-        input_ptr += 64;
-      }
-
-      // Preload 9x9 input one more time for the rest of the depth.
-      DEPTHWISECONV_PRELOAD_ROW(input_ptr, 0);
-      DEPTHWISECONV_PRELOAD_ROW(input_ptr, 1);
-      DEPTHWISECONV_PRELOAD_ROW(input_ptr, 2);
-      DEPTHWISECONV_PRELOAD_ROW(input_ptr, 3);
-      DEPTHWISECONV_PRELOAD_ROW(input_ptr, 4);
-      DEPTHWISECONV_PRELOAD_ROW(input_ptr, 5);
-      DEPTHWISECONV_PRELOAD_ROW(input_ptr, 6);
-      DEPTHWISECONV_PRELOAD_ROW(input_ptr, 7);
-      DEPTHWISECONV_PRELOAD_ROW(input_ptr, 8);
-
-      for (; depth <= output_depth - 8; depth += 8) {
-        ConvKernel3x3FilterDepth8<4, 4, 2, 2>::Run(
-            input_ptr, input_depth, input_offset, input_row_size, filter_ptr,
-            filter_offset, bias_ptr, output_offset, output_multiplier,
-            output_shift, output_activation_min, output_activation_max,
-            output_ptr, output_depth, output_width);
-
-        input_ptr += 8;
-        output_ptr += 8;
-        filter_ptr += 8;
-        bias_ptr += 8;
-      }
-
-      input_data += 4 * 2 * input_depth;
-      output_data += 4 * output_depth;
-    }
-
-#undef DEPTHWISECONV_PRELOAD_ROW
-
-    // Handle the rest of the right side.
-    // 4x2 at a time.
-    for (; out_x <= output_width - 2; out_x += 2) {
-      const int32* bias_ptr = bias_data;
-      const uint8* filter_ptr = filter_data;
 
-      const uint8* input_ptr = input_data;
-      uint8* output_ptr = output_data;
+        // Preload.
+        const uint8* h_ptr = input_ptr;
+        for (int32 i = 0; i < shuffle_params.input_height; i++) {
+          const uint8* ptr = h_ptr;
+          for (int32 j = 0; j < shuffle_params.input_width; j++) {
+            asm volatile("prfm pldl1keep, [%[ptr]]\n" ::[ptr] "r"(ptr) :);
+            ptr += params.input_depth;
+          }
+          h_ptr += params.input_row_size;
+        }
 
-      for (int depth = 0; depth <= output_depth - 8; depth += 8) {
-        ConvKernel3x3FilterDepth8<4, 2, 2, 2>::Run(
-            input_ptr, input_depth, input_offset, input_row_size, filter_ptr,
-            filter_offset, bias_ptr, output_offset, output_multiplier,
-            output_shift, output_activation_min, output_activation_max,
-            output_ptr, output_depth, output_width);
+        // Handle leftover depth.
+        ConvKernel::Run(input_ptr, filter_ptr, bias_ptr, output_ptr,
+                        depth, params.output_depth, params.input_depth,
+                        params.input_row_size, shuffle_params.output_height,
+                        shuffle_params.output_width, params);
 
-        input_ptr += 8;
-        output_ptr += 8;
-        filter_ptr += 8;
-        bias_ptr += 8;
+        input_data +=
+            shuffle_params.output_width * kStrideWidth * params.input_depth;
+        output_data += shuffle_params.output_width * params.output_depth;
       }
-
-      input_data += 2 * 2 * input_depth;
-      output_data += 2 * output_depth;
     }
 
-    // 4x1 at a time.
-    for (; out_x < output_width; out_x++) {
-      const int32* bias_ptr = bias_data;
-      const uint8* filter_ptr = filter_data;
-
-      const uint8* input_ptr = input_data;
-      uint8* output_ptr = output_data;
-
-      for (int depth = 0; depth <= output_depth - 8; depth += 8) {
-        ConvKernel3x3FilterDepth8<4, 1, 2, 2>::Run(
-            input_ptr, input_depth, input_offset, input_row_size, filter_ptr,
-            filter_offset, bias_ptr, output_offset, output_multiplier,
-            output_shift, output_activation_min, output_activation_max,
-            output_ptr, output_depth, output_width);
-
-        input_ptr += 8;
-        output_ptr += 8;
-        filter_ptr += 8;
-        bias_ptr += 8;
-      }
-
-      input_data += 2 * input_depth;
-      output_data += output_depth;
+    const int32 output_leftover_width = end_x - out_x;
+    if (output_leftover_width > 0) {
+      ConvKernel::Run(input_data, filter_data, bias_data, output_data, 0,
+                      params.output_depth, params.input_depth,
+                      params.input_row_size, shuffle_params.output_height,
+                      output_leftover_width, params);
     }
   }
 };
 
-template <>
-struct ConvRow3x3FilterDepth8<8, 2, 2> {
-  static inline void Run(const uint8* input_data, int start_x, int start_y,
-                         int input_depth, int input_width, int input_height,
-                         int input_row_size, int32 input_offset,
-                         const uint8* filter_data, int32 filter_offset,
-                         const int32* bias_data, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_data,
-                         int output_depth, int output_width,
-                         uint8* shuffle_workspace) {
-    // Reuse 4 row kernels twice.
-    ConvRow3x3FilterDepth8<4, 2, 2>::Run(
-        input_data, start_x, start_y, input_depth, input_width, input_height,
-        input_row_size, input_offset, filter_data, filter_offset, bias_data,
-        output_offset, output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_data, output_depth, output_width,
-        shuffle_workspace);
-
-    ConvRow3x3FilterDepth8<4, 2, 2>::Run(
-        input_data + 2 * 4 * input_row_size, start_x, start_y + 4, input_depth,
-        input_width, input_height, input_row_size, input_offset, filter_data,
-        filter_offset, bias_data, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max,
-        output_data + 4 * output_depth * output_width, output_depth,
-        output_width, shuffle_workspace);
+// Processes the borders of the input for pad_width and pad_height = 1.
+// Calls 4 asm kernels:
+//   * 1x1 input shape.
+//   * Corner edges.
+//   * Horizontal edges.
+//   * Vertical edges.
+inline void DepthwiseConvHandlePadding(const uint8* input_data,
+    const uint8* filter_data, const int32* bias_data, uint8* output_data,
+    const DepthwiseConvParams& params) {
+  if (params.input_width == 1 && params.input_height == 1) {
+    const uint8* filter_ptr = filter_data + params.filter_row_size
+        + params.output_depth;
+    DepthwiseConvPartial<EdgeType::kCenter, 1, 1>::Run(input_data, filter_ptr,
+        bias_data, output_data, &params);
+    return;
   }
-};
 
-template <>
-struct ConvRow3x3FilterDepth8<8, 1, 1> {
-  // The buffer size of the shuffled input.
-  static inline constexpr int ShuffleWorkspaceSize() { return 64 * 10 * 10; }
-
-  static inline void Run(const uint8* input_data, int start_x, int start_y,
-                         int input_depth, int input_width, int input_height,
-                         int input_row_size, int32 input_offset,
-                         const uint8* filter_data, int32 filter_offset,
-                         const int32* bias_data, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_data,
-                         int output_depth, int output_width,
-                         uint8* shuffle_workspace) {
-    int out_x = start_x;
-    // 8x8 at a time.
-    for (; out_x <= output_width - 8; out_x += 8) {
-      const int32* bias_ptr = bias_data;
-      const uint8* filter_ptr = filter_data;
-
-      const uint8* input_ptr = input_data;
-      uint8* output_ptr = output_data;
-
-      int depth = 0;
-      for (; depth <= output_depth - 64; depth += 64) {
-        // For a large input window (64x10x10) that is small enough to fit in L1
-        // cache, copy the input into a separate buffer and run the kernel on
-        // this new buffer. This reduces the likelihood of cache misses when
-        // the kernel is loading input data. If the size of the input window
-        // changes, update the function ShuffleWorkspaceSize() with the new
-        // size.
-        ShuffleInput(input_ptr, input_depth, input_width, input_height, 64, 10,
-                     10, shuffle_workspace);
-        const uint8* shuffled_ptr = shuffle_workspace;
-
-        for (int micro_depth = 0; micro_depth <= 64 - 8; micro_depth += 8) {
-          ConvKernel3x3FilterDepth8<8, 8, 1, 1>::Run(
-              shuffled_ptr, 64, input_offset, 64 * 10, filter_ptr,
-              filter_offset, bias_ptr, output_offset, output_multiplier,
-              output_shift, output_activation_min, output_activation_max,
-              output_ptr, output_depth, output_width);
-
-          shuffled_ptr += 8;
-          output_ptr += 8;
-          filter_ptr += 8;
-          bias_ptr += 8;
-        }
-        input_ptr += 64;
-      }
+  const int32 out_x_start_corner = 0;
+  const int32 out_x_end_corner = params.output_width - 1;
+  const int32 out_y_start_corner = 0;
+  const int32 out_y_end_corner = params.output_height - 1;
+
+  // Handle top row.
+  const uint8* input_ptr = input_data;
+  const uint8* filter_ptr = filter_data + params.filter_row_size
+      + params.output_depth;
+  uint8* output_ptr = output_data;
+
+  DepthwiseConvPartial<EdgeType::kCorner, 1, 1>::Run(input_ptr, filter_ptr,
+      bias_data, output_ptr, &params);
+
+  input_ptr += (params.stride_width - 1) * params.input_depth;
+  filter_ptr = filter_data + params.filter_row_size;
+  output_ptr += params.output_depth;
+
+  for (int32 out_x = out_x_start_corner + 1; out_x < out_x_end_corner;
+           out_x++) {
+    DepthwiseConvPartial<EdgeType::kHorizontal, 1, 1>::Run(
+        input_ptr, filter_ptr, bias_data, output_ptr, &params);
+    input_ptr += params.stride_width * params.input_depth;
+    output_ptr += params.output_depth;
+  }
 
-      for (; depth <= output_depth - 8; depth += 8) {
-        ConvKernel3x3FilterDepth8<8, 8, 1, 1>::Run(
-            input_ptr, input_depth, input_offset, input_row_size, filter_ptr,
-            filter_offset, bias_ptr, output_offset, output_multiplier,
-            output_shift, output_activation_min, output_activation_max,
-            output_ptr, output_depth, output_width);
-
-        input_ptr += 8;
-        output_ptr += 8;
-        filter_ptr += 8;
-        bias_ptr += 8;
-      }
+  DepthwiseConvPartial<EdgeType::kCorner, 1, 1>::Run(input_ptr, filter_ptr,
+      bias_data, output_ptr, &params);
 
-      input_data += 8 * input_depth;
-      output_data += 8 * output_depth;
-    }
+  // Handle left side.
+  input_ptr = input_data + (params.stride_width - 1) * params.input_row_size;
+  filter_ptr = filter_data + params.input_depth;
+  output_ptr = output_data + params.output_row_size;
 
-    // Handle the rest of the right side by re-using 4 row kernels twice.
-    ConvRow3x3FilterDepth8<4, 1, 1>::Run(
-        input_data, out_x, start_y, input_depth, input_width, input_height,
-        input_row_size, input_offset, filter_data, filter_offset, bias_data,
-        output_offset, output_multiplier, output_shift, output_activation_min,
-        output_activation_max, output_data, output_depth, output_width,
-        shuffle_workspace);
-
-    ConvRow3x3FilterDepth8<4, 1, 1>::Run(
-        input_data + 4 * input_row_size, out_x, start_y + 4, input_depth,
-        input_width, input_height, input_row_size, input_offset, filter_data,
-        filter_offset, bias_data, output_offset, output_multiplier,
-        output_shift, output_activation_min, output_activation_max,
-        output_data + 4 * output_depth * output_width, output_depth,
-        output_width, shuffle_workspace);
+  for (int32 out_y = out_y_start_corner + 1; out_y < out_y_end_corner;
+           out_y++) {
+    DepthwiseConvPartial<EdgeType::kVertical, 1, 1>::Run(
+        input_ptr, filter_ptr, bias_data, output_ptr, &params);
+    input_ptr += params.stride_width * params.input_row_size;
+    output_ptr += params.output_row_size;
   }
-};
 
-inline bool Fast3x3FilterKernelSupported(const Dims<4>& input_dims,
-                                         const Dims<4>& filter_dims,
-                                         int stride_width, int stride_height,
-                                         int pad_width, int pad_height,
-                                         int depth_multiplier,
-                                         const Dims<4>& output_dims) {
-  const int input_height = ArraySize(input_dims, 2);
-  const int input_width = ArraySize(input_dims, 1);
-  const int input_depth = ArraySize(input_dims, 0);
-  const int filter_height = ArraySize(filter_dims, 2);
-  const int filter_width = ArraySize(filter_dims, 1);
-  const int output_height = ArraySize(output_dims, 2);
-  const int output_width = ArraySize(output_dims, 1);
-
-  bool supported = filter_width == 3 && filter_height == 3 &&
-                   depth_multiplier == 1 &&
-                   (stride_width == 1 || stride_width == 2) &&
-                   (stride_height == 1 || stride_height == 2) &&
-                   (stride_width == stride_height) && pad_width == 0 &&
-                   pad_height == 0 && (input_depth % 8) == 0;
+  // Handle right side.
+  input_ptr = input_data + (params.input_width - 2) * params.input_depth
+      + (params.stride_width - 1) * params.input_row_size;
+  filter_ptr = filter_data;
+  output_ptr = output_data + params.output_row_size +
+      (params.output_width - 1) * params.output_depth;
+
+  for (int32 out_y = out_y_start_corner + 1; out_y < out_y_end_corner;
+         out_y++) {
+    DepthwiseConvPartial<EdgeType::kVertical, 1, 1>::Run(
+        input_ptr, filter_ptr, bias_data, output_ptr, &params);
+    input_ptr += params.stride_width * params.input_row_size;
+    output_ptr += params.output_row_size;
+  }
+
+  // Handle bottom row.
+  input_ptr = input_data + (params.input_height - 2) * params.input_row_size;
+  filter_ptr = filter_data + params.output_depth;
+  output_ptr = output_data +
+      (params.output_height - 1) * params.output_row_size;
+
+  DepthwiseConvPartial<EdgeType::kCorner, 1, 1>::Run(input_ptr, filter_ptr,
+      bias_data, output_ptr, &params);
+
+  input_ptr += (params.stride_width == 1) ? 0 : params.input_depth;
+  filter_ptr = filter_data;
+  output_ptr += params.output_depth;
+
+  for (int32 out_x = out_x_start_corner + 1; out_x < out_x_end_corner;
+           out_x++) {
+    DepthwiseConvPartial<EdgeType::kHorizontal, 1, 1>::Run(
+        input_ptr, filter_ptr, bias_data, output_ptr, &params);
+    input_ptr += params.stride_width * params.input_depth;
+    output_ptr += params.output_depth;
+  }
+
+  DepthwiseConvPartial<EdgeType::kCorner, 1, 1>::Run(input_ptr, filter_ptr,
+      bias_data, output_ptr, &params);
+}
+
+inline bool Fast3x3FilterKernelSupported(
+    const Dims<4>& input_dims, const Dims<4>& filter_dims, int32 stride_width,
+    int32 stride_height, int32 pad_width, int32 pad_height,
+    int32 depth_multiplier, const Dims<4>& output_dims, int32 output_shift) {
+  const int32 input_height = ArraySize(input_dims, 2);
+  const int32 input_width = ArraySize(input_dims, 1);
+  const int32 input_depth = ArraySize(input_dims, 0);
+  const int32 filter_height = ArraySize(filter_dims, 2);
+  const int32 filter_width = ArraySize(filter_dims, 1);
+  const int32 output_height = ArraySize(output_dims, 2);
+  const int32 output_width = ArraySize(output_dims, 1);
+
+  bool supported =
+      filter_width == 3 && filter_height == 3 && depth_multiplier == 1 &&
+      (stride_width == 1 || stride_width == 2) &&
+      (stride_height == 1 || stride_height == 2) &&
+      (stride_width == stride_height) && (pad_width == 0 || pad_width == 1) &&
+      (pad_height == 0 || pad_height == 1) && (pad_width == pad_height) &&
+      (input_depth % 8) == 0 && (output_shift > 0);
 
   if (!supported) {
     return false;
@@ -4436,145 +3201,193 @@ inline bool Fast3x3FilterKernelSupported(const Dims<4>& input_dims,
   // Handle case where padding is zero but padding type is not kValid.
   // This would require special boundary case handling that is not supported.
 
-  const int out_x = output_width - 1;
-  const int out_y = output_height - 1;
+  const int32 out_x = output_width - 1;
+  const int32 out_y = output_height - 1;
 
-  const int in_x_origin = (out_x * stride_width) - pad_width;
-  const int in_y_origin = (out_y * stride_height) - pad_height;
+  const int32 in_x_origin = (out_x * stride_width) - pad_width;
+  const int32 in_y_origin = (out_y * stride_height) - pad_height;
 
-  const int in_x_end = in_x_origin + filter_width;
-  const int in_y_end = in_y_origin + filter_height;
+  const int32 in_x_end = in_x_origin + filter_width;
+  const int32 in_y_end = in_y_origin + filter_height;
 
   // Supported only if filter on the right and bottom boundary lies completely
-  // within the input.
-  return in_x_end <= input_width && in_y_end <= input_height;
+  // within the input if padding is zero.
+  if (pad_width == 0 && pad_height == 0) {
+    return in_x_end <= input_width && in_y_end <= input_height;
+  }
+
+  // Else if padding is 1, supported if bottom right filter lies +1 past input
+  // width and height.
+  supported = in_x_end <= (input_width + 1) && in_y_end <= (input_height + 1);
+
+  if (!supported) {
+    return false;
+  }
+
+  // Shapes with width 1 and height > 1, and vice versa are not supported yet.
+  if (input_width == 1) {
+    supported = (input_width == input_height);
+  } else if (input_height == 1) {
+    supported = (input_width == input_height);
+  }
+  return supported;
 }
 
 inline void DepthwiseConv3x3Filter(
     const uint8* input_data, const Dims<4>& input_dims, int32 input_offset,
     const uint8* filter_data, const Dims<4>& filter_dims, int32 filter_offset,
-    const int32* bias_data, const Dims<4>& bias_dims, int stride_width,
-    int stride_height, int pad_width, int pad_height, int depth_multiplier,
-    int32 output_offset, int32 output_multiplier, int output_shift,
-    int32 output_activation_min, int32 output_activation_max,
-    uint8* output_data, const Dims<4>& output_dims) {
-  const int batches = MatchingArraySize(input_dims, 3, output_dims, 3);
-  const int output_depth = MatchingArraySize(filter_dims, 0, output_dims, 0);
-  const int input_height = ArraySize(input_dims, 2);
-  const int input_width = ArraySize(input_dims, 1);
-  const int input_depth = ArraySize(input_dims, 0);
-  const int filter_height = ArraySize(filter_dims, 2);
-  const int filter_width = ArraySize(filter_dims, 1);
-  const int output_height = ArraySize(output_dims, 2);
-  const int output_width = ArraySize(output_dims, 1);
-
-  // Algorithm assumes below constraints. It is optimized for depth multiplier
-  // of 1, 3x3 filter, no padding and strides 1 and 2.
-  TFLITE_DCHECK(output_depth == input_depth * depth_multiplier);
+    const int32* bias_data, const Dims<4>& bias_dims, int32 stride_width,
+    int32 stride_height, int32 pad_width, int32 pad_height,
+    int32 depth_multiplier, int32 output_offset, int32 output_multiplier,
+    int32 output_shift, int32 output_activation_min,
+    int32 output_activation_max, uint8* output_data,
+    const Dims<4>& output_dims) {
+  DepthwiseConvParams params;
+  params.input_depth = ArraySize(input_dims, 0);
+  params.input_width = ArraySize(input_dims, 1);
+  params.input_height = ArraySize(input_dims, 2);
+  params.input_row_size = params.input_depth * params.input_width;
+  params.input_offset = input_offset;
+  params.stride_width = stride_width;
+  params.stride_height = stride_height;
+  params.output_depth = MatchingArraySize(filter_dims, 0, output_dims, 0);
+  params.output_width = ArraySize(output_dims, 1);
+  params.output_height = ArraySize(output_dims, 2);
+  params.output_row_size = params.output_depth * params.output_width;
+  params.output_offset = output_offset;
+  params.filter_offset = filter_offset;
+  params.output_multiplier = output_multiplier;
+  params.output_shift = output_shift;
+  params.output_activation_min = output_activation_min;
+  params.output_activation_max = output_activation_max;
+
+  const int32 filter_height = ArraySize(filter_dims, 2);
+  const int32 filter_width = ArraySize(filter_dims, 1);
+  params.filter_row_size = params.output_depth * filter_width;
+
+  // Algorithm assumes below constraints. It is optimized for depth
+  // multiplier of 1, 3x3 filter, no padding and strides 1 and 2.
+  TFLITE_DCHECK(params.output_depth == params.input_depth * depth_multiplier);
   TFLITE_DCHECK(depth_multiplier == 1);
   TFLITE_DCHECK(filter_height == 3);
   TFLITE_DCHECK(filter_width == 3);
-  TFLITE_DCHECK(pad_height == 0);
-  TFLITE_DCHECK(pad_width == 0);
   TFLITE_DCHECK(stride_height == 1 || stride_height == 2);
   TFLITE_DCHECK(stride_width == 1 || stride_width == 2);
   TFLITE_DCHECK(stride_width == stride_height);
+  TFLITE_DCHECK(pad_height == 0 || pad_height == 1);
+  TFLITE_DCHECK(pad_width == 0 || pad_width == 1);
+  TFLITE_DCHECK(pad_width == pad_height);
+
+  const int32 batches = MatchingArraySize(input_dims, 3, output_dims, 3);
+  const int64_t input_batch_size = params.input_row_size * params.input_height;
+  const int64_t output_batch_size =
+      params.output_row_size * params.output_height;
+
+  ShuffleParams one_row_shuffle_params, two_row_shuffle_params,
+      four_row_shuffle_params, eight_row_shuffle_params;
+  if (stride_width == 1) {
+    one_row_shuffle_params = ShuffleParams(30, 1, 1, 1);
+    two_row_shuffle_params = ShuffleParams(22, 2, 1, 1);
+    four_row_shuffle_params = ShuffleParams(14, 4, 1, 1);
+    eight_row_shuffle_params = ShuffleParams(8, 8, 1, 1);
+  } else {
+    one_row_shuffle_params = ShuffleParams(14, 1, 2, 2);
+    two_row_shuffle_params = ShuffleParams(8, 2, 2, 2);
+    four_row_shuffle_params = ShuffleParams(4, 4, 2, 2);
+    eight_row_shuffle_params = ShuffleParams(2, 8, 2, 2);
+  }
 
-  const int input_row_size = input_depth * (input_width + 2 * pad_width);
-  const int output_row_size = output_depth * output_width;
-  const int input_batch_size = input_row_size * (input_height + 2 * pad_height);
-  const int output_batch_size = output_depth * output_width * output_height;
-
-  using conv_row_func_t = decltype(&ConvRow3x3FilterDepth8<1, 1, 1>::Run);
-  conv_row_func_t conv_1_output_row = ConvRow3x3FilterDepth8<1, 1, 1>::Run;
-  conv_row_func_t conv_2_output_rows = ConvRow3x3FilterDepth8<2, 1, 1>::Run;
-  conv_row_func_t conv_4_output_rows = ConvRow3x3FilterDepth8<4, 1, 1>::Run;
-  conv_row_func_t conv_8_output_rows = ConvRow3x3FilterDepth8<8, 1, 1>::Run;
-
+  using conv_multirow_func_t = decltype(&DepthwiseConvMultiRow<1, 1>::Run);
+  conv_multirow_func_t conv_multirow_func = DepthwiseConvMultiRow<1, 1>::Run;
   if (stride_width == 2) {
-    conv_1_output_row = ConvRow3x3FilterDepth8<1, 2, 2>::Run;
-    conv_2_output_rows = ConvRow3x3FilterDepth8<2, 2, 2>::Run;
-    conv_4_output_rows = ConvRow3x3FilterDepth8<4, 2, 2>::Run;
-    conv_8_output_rows = ConvRow3x3FilterDepth8<8, 2, 2>::Run;
+    conv_multirow_func = DepthwiseConvMultiRow<2, 2>::Run;
   }
 
   // Allocate maximum memory needed for shuffled input.
   // TODO(mariewhite): The size of this workspace is small enough to be
   // allocated on the stack. Eventually we will want to move it to the heap
-  // and have it allocated outside of this function, like the im2col_array used
-  // in gemmlowp.
-#define DEPTHWISECONV_SHUFFLE_WORKSPACE_SIZE 10 * 10 * 64
+  // and have it allocated outside of this function, like the im2col_array
+  // used in gemmlowp.
   uint8 shuffle_workspace[DEPTHWISECONV_SHUFFLE_WORKSPACE_SIZE];
 
-  // Make sure the kernels using this buffer will not run out of bounds.
-  static_assert(ConvRow3x3FilterDepth8<8, 1, 1>::ShuffleWorkspaceSize() <=
-                    DEPTHWISECONV_SHUFFLE_WORKSPACE_SIZE,
-                "Shuffle workspace size is too small.");
-  static_assert(ConvRow3x3FilterDepth8<4, 2, 2>::ShuffleWorkspaceSize() <=
-                    DEPTHWISECONV_SHUFFLE_WORKSPACE_SIZE,
-                "Shuffle workspace size is too small.");
-
-#undef DEPTHWISECONV_SHUFFLE_WORKSPACE_SIZE
-
-  for (int b = 0; b < batches; ++b) {
+  for (int32 b = 0; b < batches; ++b) {
     const uint8* input_ptr = input_data + b * input_batch_size;
     uint8* output_ptr = output_data + b * output_batch_size;
 
-    int out_y = 0;
+    int32 out_x = 0;
+    int32 out_y = 0;
+    int32 end_x = params.output_width;
+    int32 end_y = params.output_height;
+
+    if (pad_width == 1 && pad_height == 1) {
+      DepthwiseConvHandlePadding(input_ptr, filter_data, bias_data, output_ptr,
+                                 params);
+
+      // Update extents now that the edges have been handled.
+      out_x = 1;
+      end_x = params.output_width - 1;
+      out_y = 1;
+      end_y = params.output_height - 1;
+      const int in_x = (out_x * stride_width) - pad_width;
+      const int in_y = (out_y * stride_height) - pad_height;
+      input_ptr += in_y * params.input_row_size + in_x * params.input_depth;
+      output_ptr += out_y * params.output_row_size
+          + out_x * params.output_depth;
+    }
+
+    // Shuffling shapes that maximize width over the shuffle workspace size
+    // perform better since the inputs are closer together, minimizing
+    // shuffling time.
+    //
+    // If the input shape has width large enough for the 2 row kernels,
+    // we prefer to use this. The innermost loop of the kernels handle
+    // 2 height x 2 width so this is the fastest path.
+    //
+    // If the input shape has smaller width but larger height, shuffling is
+    // still useful and can benefit from kernels 4 row and 8 row kernels.
 
     // Handle 8 rows at a time.
-    for (; out_y <= output_height - 8; out_y += 8) {
-      conv_8_output_rows(input_ptr, 0, out_y, input_depth, input_width,
-                         input_height, input_row_size, input_offset,
-                         filter_data, filter_offset, bias_data, output_offset,
-                         output_multiplier, output_shift, output_activation_min,
-                         output_activation_max, output_ptr, output_depth,
-                         output_width, shuffle_workspace);
-
-      input_ptr += 8 * stride_height * input_row_size;
-      output_ptr += 8 * output_row_size;
+    if (params.input_width < four_row_shuffle_params.input_width) {
+      for (; out_y <= end_y - 8; out_y += 8) {
+        conv_multirow_func(input_ptr, out_x, end_x, filter_data, bias_data,
+                           output_ptr, params, eight_row_shuffle_params,
+                           shuffle_workspace);
+        input_ptr += 8 * stride_height * params.input_row_size;
+        output_ptr += 8 * params.output_row_size;
+      }
     }
 
     // Handle 4 rows at a time.
-    for (; out_y <= output_height - 4; out_y += 4) {
-      conv_4_output_rows(input_ptr, 0, out_y, input_depth, input_width,
-                         input_height, input_row_size, input_offset,
-                         filter_data, filter_offset, bias_data, output_offset,
-                         output_multiplier, output_shift, output_activation_min,
-                         output_activation_max, output_ptr, output_depth,
-                         output_width, shuffle_workspace);
-
-      input_ptr += 4 * stride_height * input_row_size;
-      output_ptr += 4 * output_row_size;
+    if (params.input_width < two_row_shuffle_params.input_width) {
+      for (; out_y <= end_y - 4; out_y += 4) {
+        conv_multirow_func(input_ptr, out_x, end_x, filter_data, bias_data,
+                           output_ptr, params, four_row_shuffle_params,
+                           shuffle_workspace);
+        input_ptr += 4 * stride_height * params.input_row_size;
+        output_ptr += 4 * params.output_row_size;
+      }
     }
 
     // Handle 2 rows at a time.
-    for (; out_y <= output_height - 2; out_y += 2) {
-      conv_2_output_rows(input_ptr, 0, out_y, input_depth, input_width,
-                         input_height, input_row_size, input_offset,
-                         filter_data, filter_offset, bias_data, output_offset,
-                         output_multiplier, output_shift, output_activation_min,
-                         output_activation_max, output_ptr, output_depth,
-                         output_width, shuffle_workspace);
-
-      input_ptr += 2 * stride_height * input_row_size;
-      output_ptr += 2 * output_row_size;
+    for (; out_y <= end_y - 2; out_y += 2) {
+      conv_multirow_func(input_ptr, out_x, end_x, filter_data, bias_data,
+                         output_ptr, params, two_row_shuffle_params,
+                         shuffle_workspace);
+      input_ptr += 2 * stride_height * params.input_row_size;
+      output_ptr += 2 * params.output_row_size;
     }
 
     // Handle one row at a time.
-    for (; out_y < output_height; out_y++) {
-      conv_1_output_row(input_ptr, 0, out_y, input_depth, input_width,
-                        input_height, input_row_size, input_offset, filter_data,
-                        filter_offset, bias_data, output_offset,
-                        output_multiplier, output_shift, output_activation_min,
-                        output_activation_max, output_ptr, output_depth,
-                        output_width, shuffle_workspace);
-
-      input_ptr += stride_height * input_row_size;
-      output_ptr += output_row_size;
+    for (; out_y < end_y; out_y++) {
+      conv_multirow_func(input_ptr, out_x, end_x, filter_data, bias_data,
+                         output_ptr, params, one_row_shuffle_params,
+                         shuffle_workspace);
+      input_ptr += stride_height * params.input_row_size;
+      output_ptr += params.output_row_size;
     }
   }
 }
+// clang-format on
 
 #endif  // __aarch64__
 
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.cc b/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.cc
index 08f7cfa5a5f9453cd187164078898e754126da52..38ad32c734a2286c7d23162810625169a4d8df43 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.cc
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.cc
@@ -352,6 +352,30 @@ void NeonSub1Vector(const float* vector, int v_size, float* result) {
   }
 }
 
+bool NeonIsZeroVector(const float* vector, int v_size) {
+  // If v_size is not divisible by kFloatWeightsPerNeonLane, we cannot
+  // use the main vectorized loop, and we need to process sequentially.
+  // postamble_start shows the start index where this should happen.
+  const int postamble_start =
+      v_size - (v_size & (kFloatWeightsPerNeonLane - 1));
+
+  const float32x4_t zero_x4_float = vmovq_n_f32(0.0f);
+  for (int v = 0; v < postamble_start; v += kFloatWeightsPerNeonLane) {
+    const float32x4_t i_x4_float = vld1q_f32(vector + v);
+    uint32x4_t cmp_result = vceqq_f32(i_x4_float, zero_x4_float);
+    if (vgetq_lane_u32(cmp_result, 0) == 0) return false;
+    if (vgetq_lane_u32(cmp_result, 1) == 0) return false;
+    if (vgetq_lane_u32(cmp_result, 2) == 0) return false;
+    if (vgetq_lane_u32(cmp_result, 3) == 0) return false;
+  }
+
+  // Postamble loop
+  for (int v = postamble_start; v < v_size; ++v) {
+    if (vector[v] != 0.0) return false;
+  }
+  return true;
+}
+
 void NeonClipVector(const float* vector, int v_size, float abs_limit,
                     float* result) {
   // If v_size is not divisible by kWeightsPerNeonLane, we cannot use the main
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.h b/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.h
index 9e60d0657b49ed5ee1f2e999acb86ebee0eab972..7a5a8fc54123946229963abd1720030d0bb358bf 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.h
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.h
@@ -100,6 +100,11 @@ void ZeroVector(float* vector, int v_size) {
 
 float Clip(float f, float abs_limit) { return PortableClip(f, abs_limit); }
 
+// Check if all entries of a vector are zero.
+bool IsZeroVector(const float* vector, int v_size) {
+  return NEON_OR_PORTABLE(IsZeroVector, vector, v_size);
+}
+
 void ClipVector(const float* vector, int v_size, float abs_limit,
                 float* result) {
   NEON_OR_PORTABLE(ClipVector, vector, v_size, abs_limit, result);
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h b/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h
index b33e4b5017d6ace16799fdea9b4d08dc6c110660..be4825f2507ccb583298af6cd824dd6da1fc47c7 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h
@@ -51,6 +51,13 @@ using reference_ops::LessEqual;
 using reference_ops::RankOneSelect;
 using reference_ops::Select;
 
+// TODO(b/80247582) Remove this constant.
+// This will be phased out as the shifts are revised with more thought. Use of a
+// constant enables us to track progress on this work.
+//
+// Used mainly to convert from old-style shifts (right) to new-style (left).
+static constexpr int kReverseShift = -1;
+
 // Make a local VectorMap typedef allowing to map a float array
 // as a Eigen vector expression. The std::conditional here is to
 // construct the suitable Eigen type for the constness of the
@@ -140,6 +147,45 @@ MatrixMap<Scalar> MapAsMatrixWithGivenNumberOfRows(Scalar* data,
   return MatrixMap<Scalar>(data, rows, cols);
 }
 
+// This is like the template-parameter version, except that the power-of-two is
+// passed as a function parameter. The template version is to be preferred,
+// since some target hardware optimizations depend on the range of the exponent.
+template <typename IntegerType>
+IntegerType SaturatingRoundingMultiplyByPOTParam(IntegerType x, int exponent) {
+  if (exponent == 0) {
+    return x;
+  }
+  using ScalarIntegerType =
+      typename gemmlowp::FixedPointRawTypeTraits<IntegerType>::ScalarRawType;
+  const IntegerType min =
+      gemmlowp::Dup<IntegerType>(std::numeric_limits<ScalarIntegerType>::min());
+  const IntegerType max =
+      gemmlowp::Dup<IntegerType>(std::numeric_limits<ScalarIntegerType>::max());
+  const int ScalarIntegerTypeBits = 8 * sizeof(ScalarIntegerType);
+
+  const std::int32_t threshold =
+      ((1 << (ScalarIntegerTypeBits - 1 - exponent)) - 1);
+  const IntegerType positive_mask =
+      gemmlowp::MaskIfGreaterThan(x, gemmlowp::Dup<IntegerType>(threshold));
+  const IntegerType negative_mask =
+      gemmlowp::MaskIfLessThan(x, gemmlowp::Dup<IntegerType>(-threshold));
+
+  IntegerType result = gemmlowp::ShiftLeft(x, exponent);
+  result = gemmlowp::SelectUsingMask(positive_mask, max, result);
+  result = gemmlowp::SelectUsingMask(negative_mask, min, result);
+  return result;
+}
+
+// This is like the template-parameter version, except that the power-of-two is
+// passed as a function parameter. See raw-integer version for further comments.
+template <typename tRawType, int tIntegerBits>
+gemmlowp::FixedPoint<tRawType, tIntegerBits>
+SaturatingRoundingMultiplyByPOTParam(
+    gemmlowp::FixedPoint<tRawType, tIntegerBits> a, int exponent) {
+  return gemmlowp::FixedPoint<tRawType, tIntegerBits>::FromRaw(
+      SaturatingRoundingMultiplyByPOTParam(a.raw(), exponent));
+}
+
 // DO NOT USE THIS STRUCT FOR NEW FUNCTIONALITY BEYOND IMPLEMENTING ELEMENT-WISE
 // BROADCASTING.
 //
@@ -1036,10 +1082,10 @@ struct GemmlowpOutputPipeline {
       gemmlowp::OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint,
       gemmlowp::OutputStageClamp, gemmlowp::OutputStageSaturatingCastToUint8>
       Pipeline;
-  static Pipeline Make(const int32* bias_data, int output_rows,
-                       int32 output_offset, int32 output_multiplier,
-                       int output_shift, int32 output_activation_min,
-                       int32 output_activation_max) {
+  static Pipeline MakeExp(const int32* bias_data, int output_rows,
+                          int32 output_offset, int32 output_multiplier,
+                          int output_left_shift, int32 output_activation_min,
+                          int32 output_activation_max) {
     ColVectorMap bias_vector(bias_data, output_rows);
     gemmlowp::OutputStageBiasAddition<ColVectorMap> bias_addition_stage;
     bias_addition_stage.bias_vector = bias_vector;
@@ -1047,7 +1093,7 @@ struct GemmlowpOutputPipeline {
         quantize_down_stage;
     quantize_down_stage.result_offset_after_shift = output_offset;
     quantize_down_stage.result_fixedpoint_multiplier = output_multiplier;
-    quantize_down_stage.result_shift = output_shift;
+    quantize_down_stage.result_shift = -output_left_shift;
     gemmlowp::OutputStageClamp clamp_stage;
     clamp_stage.min = output_activation_min;
     clamp_stage.max = output_activation_max;
@@ -1100,8 +1146,8 @@ inline void FullyConnected(const uint8* input_data, const Dims<4>& input_dims,
       input_data, filter_cols, batches, filter_cols);
   gemmlowp::MatrixMap<uint8, gemmlowp::MapOrder::ColMajor> output_matrix(
       output_data, output_rows, batches, output_rows);
-  const auto& output_pipeline = GemmlowpOutputPipeline::Make(
-      bias_data, output_rows, output_offset, output_multiplier, output_shift,
+  const auto& output_pipeline = GemmlowpOutputPipeline::MakeExp(
+      bias_data, output_rows, output_offset, output_multiplier, -output_shift,
       output_activation_min, output_activation_max);
   gemmlowp::GemmWithOutputPipeline<uint8, uint8,
                                    gemmlowp::L8R8WithLhsNonzeroBitDepthParams>(
@@ -1730,6 +1776,100 @@ inline void ExtractPatchIntoBufferColumn(
   }
 }
 
+template <typename T>
+void DilatedIm2col(const T* input_data, const Dims<4>& input_dims,
+                   const Dims<4>& filter_dims, int stride_width,
+                   int stride_height, int dilation_width_factor,
+                   int dilation_height_factor, int pad_width, int pad_height,
+                   const Dims<4>& output_dims, uint8 byte_zero,
+                   T* im2col_data) {
+  // For dilated convolution, the input pixels are not contiguous therefore we
+  // can't use the same opitimizations as Im2Col(). Though note this code would
+  // work fine for the non-dilated case too (though likely a bit slower).
+  gemmlowp::ScopedProfilingLabel label("DilatedIm2col");
+  TFLITE_DCHECK(dilation_width_factor != 1 || dilation_height_factor != 1);
+  TFLITE_DCHECK(IsPackedWithoutStrides(input_dims));
+  TFLITE_DCHECK(IsPackedWithoutStrides(filter_dims));
+  TFLITE_DCHECK(IsPackedWithoutStrides(output_dims));
+  TFLITE_DCHECK(im2col_data);
+  const int batches = MatchingArraySize(input_dims, 3, output_dims, 3);
+  const int input_height = ArraySize(input_dims, 2);
+  const int input_width = ArraySize(input_dims, 1);
+  const int input_depth = MatchingArraySize(input_dims, 0, filter_dims, 0);
+  const int filter_height = ArraySize(filter_dims, 2);
+  const int filter_width = ArraySize(filter_dims, 1);
+  const int output_height = ArraySize(output_dims, 2);
+  const int output_width = ArraySize(output_dims, 1);
+  MatchingArraySize(output_dims, 0, filter_dims, 3);
+
+  // Construct the MxN sized im2col matrix.
+  // The rows M, are sub-ordered B x H x W
+  Dims<4> row_dims;
+  row_dims.sizes[0] = output_width;
+  row_dims.sizes[1] = output_height;
+  row_dims.sizes[2] = batches;
+  row_dims.sizes[3] = 1;
+  ComputeStrides(&row_dims);
+
+  // The columns, N, are sub-ordered Kh x Kw x Din
+  Dims<4> col_dims;
+  col_dims.sizes[0] = input_depth;
+  col_dims.sizes[1] = filter_width;
+  col_dims.sizes[2] = filter_height;
+  col_dims.sizes[3] = 1;
+  ComputeStrides(&col_dims);
+
+  // Use dimensions M and N to construct dims for indexing directly into im2col
+  Dims<4> im2col_dims;
+  im2col_dims.sizes[0] = col_dims.strides[3];
+  im2col_dims.sizes[1] = row_dims.strides[3];
+  im2col_dims.sizes[2] = 1;
+  im2col_dims.sizes[3] = 1;
+  ComputeStrides(&im2col_dims);
+
+  // Loop through the output rows (B x H x W)
+  for (int batch = 0; batch < batches; ++batch) {
+    for (int out_y = 0; out_y < output_height; ++out_y) {
+      for (int out_x = 0; out_x < output_width; ++out_x) {
+        // Each row is an output pixel. Arrange the input data into this row in
+        // an order we can conveniently multiply with the filter data.
+        int row_offset = Offset(row_dims, out_x, out_y, batch, 0);
+        const int in_x_origin = (out_x * stride_width) - pad_width;
+        const int in_y_origin = (out_y * stride_height) - pad_height;
+        // Loop through all the pixels of the filter (Kh x Kw)
+        for (int filter_y = 0; filter_y < filter_height; ++filter_y) {
+          const int in_y = in_y_origin + dilation_height_factor * filter_y;
+          if ((in_y >= 0) && (in_y < input_height)) {
+            // Filter row is within the input data.
+            // Loop through all the filter pixels in this row.
+            for (int filter_x = 0; filter_x < filter_width; ++filter_x) {
+              const int in_x = in_x_origin + dilation_width_factor * filter_x;
+              int col_offset = Offset(col_dims, 0, filter_x, filter_y, 0);
+              T* dst = im2col_data +
+                       Offset(im2col_dims, col_offset, row_offset, 0, 0);
+              if ((in_x >= 0) && (in_x < input_width)) {
+                // Filter pixel is within the input, copy the data.
+                T const* src =
+                    input_data + Offset(input_dims, 0, in_x, in_y, batch);
+                memcpy(dst, src, input_depth * sizeof(T));
+              } else {
+                // Filter pixel is outside the input, zero it out.
+                memset(dst, byte_zero, input_depth * sizeof(T));
+              }
+            }
+          } else {
+            // Filter row is outside the input, zero out the entire im2col row.
+            int col_offset = Offset(col_dims, 0, 0, filter_y, 0);
+            T* dst =
+                im2col_data + Offset(im2col_dims, col_offset, row_offset, 0, 0);
+            memset(dst, byte_zero, filter_width * input_depth * sizeof(T));
+          }
+        }
+      }
+    }
+  }
+}
+
 template <typename T>
 void Im2col(const T* input_data, const Dims<4>& input_dims, int stride_width,
             int stride_height, int pad_width, int pad_height, int kheight,
@@ -1770,74 +1910,6 @@ void Im2col(const T* input_data, const Dims<4>& input_dims, int stride,
          kwidth, byte_zero, output_data, output_dims);
 }
 
-inline void DilatedConv(const float* input_data, const Dims<4>& input_dims,
-                        const float* filter_data, const Dims<4>& filter_dims,
-                        const float* bias_data, const Dims<4>& bias_dims,
-                        int stride_width, int stride_height,
-                        int dilation_width_factor, int dilation_height_factor,
-                        int pad_width, int pad_height,
-                        float output_activation_min,
-                        float output_activation_max, float* output_data,
-                        const Dims<4>& output_dims, float* im2col_data,
-                        const Dims<4>& im2col_dims) {
-  gemmlowp::ScopedProfilingLabel label("DilatedConv");
-  // This is a copy of the reference Conv implementation. We do not currently
-  // have an optimized path for dilation.
-  (void)im2col_data;  // only used in optimized code.
-  (void)im2col_dims;  // only used in optimized code.
-  const int batches = MatchingArraySize(input_dims, 3, output_dims, 3);
-  const int input_depth = MatchingArraySize(input_dims, 0, filter_dims, 0);
-  const int output_depth = MatchingArraySize(filter_dims, 3, output_dims, 0);
-  if (bias_data) {
-    TFLITE_DCHECK_EQ(ArraySize(filter_dims, 3), ArraySize(bias_dims, 0));
-  }
-  const int input_height = ArraySize(input_dims, 2);
-  const int input_width = ArraySize(input_dims, 1);
-  const int filter_height = ArraySize(filter_dims, 2);
-  const int filter_width = ArraySize(filter_dims, 1);
-  const int output_height = ArraySize(output_dims, 2);
-  const int output_width = ArraySize(output_dims, 1);
-  for (int batch = 0; batch < batches; ++batch) {
-    for (int out_y = 0; out_y < output_height; ++out_y) {
-      for (int out_x = 0; out_x < output_width; ++out_x) {
-        for (int out_channel = 0; out_channel < output_depth; ++out_channel) {
-          const int in_x_origin = (out_x * stride_width) - pad_width;
-          const int in_y_origin = (out_y * stride_height) - pad_height;
-          float total = 0.f;
-          for (int filter_y = 0; filter_y < filter_height; ++filter_y) {
-            for (int filter_x = 0; filter_x < filter_width; ++filter_x) {
-              for (int in_channel = 0; in_channel < input_depth; ++in_channel) {
-                const int in_x = in_x_origin + dilation_width_factor * filter_x;
-                const int in_y =
-                    in_y_origin + dilation_height_factor * filter_y;
-                // If the location is outside the bounds of the input image,
-                // use zero as a default value.
-                if ((in_x >= 0) && (in_x < input_width) && (in_y >= 0) &&
-                    (in_y < input_height)) {
-                  float input_value = input_data[Offset(input_dims, in_channel,
-                                                        in_x, in_y, batch)];
-                  float filter_value =
-                      filter_data[Offset(filter_dims, in_channel, filter_x,
-                                         filter_y, out_channel)];
-                  total += (input_value * filter_value);
-                }
-              }
-            }
-          }
-          float bias_value = 0.0f;
-          if (bias_data) {
-            bias_value = bias_data[Offset(bias_dims, out_channel, 0, 0, 0)];
-          }
-          output_data[Offset(output_dims, out_channel, out_x, out_y, batch)] =
-              ActivationFunctionWithMinMax(total + bias_value,
-                                           output_activation_min,
-                                           output_activation_max);
-        }
-      }
-    }
-  }
-}
-
 inline void Conv(const float* input_data, const Dims<4>& input_dims,
                  const float* filter_data, const Dims<4>& filter_dims,
                  const float* bias_data, const Dims<4>& bias_dims,
@@ -1846,29 +1918,32 @@ inline void Conv(const float* input_data, const Dims<4>& input_dims,
                  float output_activation_min, float output_activation_max,
                  float* output_data, const Dims<4>& output_dims,
                  float* im2col_data, const Dims<4>& im2col_dims) {
-  if ((dilation_width_factor != 1) || (dilation_height_factor != 1)) {
-    return DilatedConv(input_data, input_dims, filter_data, filter_dims,
-                       bias_data, bias_dims, stride_width, stride_height,
-                       dilation_width_factor, dilation_height_factor, pad_width,
-                       pad_height, output_activation_min, output_activation_max,
-                       output_data, output_dims, im2col_data, im2col_dims);
-  }
-
   (void)im2col_data;
   (void)im2col_dims;
   gemmlowp::ScopedProfilingLabel label("Conv");
 
+  // A float set to 0x00000000h == 0.0f
+  const uint8 float_zero_byte = 0x00;
   const float* gemm_input_data = nullptr;
   const Dims<4>* gemm_input_dims = nullptr;
   const int filter_width = ArraySize(filter_dims, 1);
   const int filter_height = ArraySize(filter_dims, 2);
+  const bool need_dilated_im2col =
+      dilation_width_factor != 1 || dilation_height_factor != 1;
   const bool need_im2col = stride_width != 1 || stride_height != 1 ||
                            filter_width != 1 || filter_height != 1;
-  if (need_im2col) {
+  if (need_dilated_im2col) {
+    DilatedIm2col(input_data, input_dims, filter_dims, stride_width,
+                  stride_height, dilation_width_factor, dilation_height_factor,
+                  pad_width, pad_height, output_dims, float_zero_byte,
+                  im2col_data);
+    gemm_input_data = im2col_data;
+    gemm_input_dims = &im2col_dims;
+  } else if (need_im2col) {
     TFLITE_DCHECK(im2col_data);
     Im2col(input_data, input_dims, stride_width, stride_height, pad_width,
-           pad_height, filter_height, filter_width, 0, im2col_data,
-           im2col_dims);
+           pad_height, filter_height, filter_width, float_zero_byte,
+           im2col_data, im2col_dims);
     gemm_input_data = im2col_data;
     gemm_input_dims = &im2col_dims;
   } else {
@@ -1979,11 +2054,23 @@ inline void Conv(const uint8* input_data, const Dims<4>& input_dims,
   }
 
   const int gemm_input_rows = gemm_input_dims->sizes[0];
-  const int gemm_input_cols = FlatSizeSkipDim(*gemm_input_dims, 0);
+  // Using FlatSizeSkipDim causes segfault in some contexts (see b/79927784).
+  // The root cause has not yet been identified though. Same applies below for
+  // the other calls commented out. This is a partial rollback of cl/196819423.
+  // const int gemm_input_cols = FlatSizeSkipDim(*gemm_input_dims, 0);
+  const int gemm_input_cols = gemm_input_dims->sizes[1] *
+                              gemm_input_dims->sizes[2] *
+                              gemm_input_dims->sizes[3];
   const int filter_rows = filter_dims.sizes[3];
-  const int filter_cols = FlatSizeSkipDim(filter_dims, 3);
+  // See b/79927784.
+  // const int filter_cols = FlatSizeSkipDim(filter_dims, 3);
+  const int filter_cols =
+      filter_dims.sizes[0] * filter_dims.sizes[1] * filter_dims.sizes[2];
   const int output_rows = output_dims.sizes[0];
-  const int output_cols = FlatSizeSkipDim(output_dims, 0);
+  // See b/79927784.
+  // const int output_cols = FlatSizeSkipDim(output_dims, 0);
+  const int output_cols =
+      output_dims.sizes[1] * output_dims.sizes[2] * output_dims.sizes[3];
   TFLITE_DCHECK_EQ(output_rows, filter_rows);
   TFLITE_DCHECK_EQ(output_cols, gemm_input_cols);
   TFLITE_DCHECK_EQ(filter_cols, gemm_input_rows);
@@ -1997,8 +2084,8 @@ inline void Conv(const uint8* input_data, const Dims<4>& input_dims,
       gemm_input_data, gemm_input_rows, gemm_input_cols);
   gemmlowp::MatrixMap<uint8, gemmlowp::MapOrder::ColMajor> output_matrix(
       output_data, output_rows, output_cols);
-  const auto& output_pipeline = GemmlowpOutputPipeline::Make(
-      bias_data, output_rows, output_offset, output_multiplier, output_shift,
+  const auto& output_pipeline = GemmlowpOutputPipeline::MakeExp(
+      bias_data, output_rows, output_offset, output_multiplier, -output_shift,
       output_activation_min, output_activation_max);
   gemmlowp::GemmWithOutputPipeline<uint8, uint8,
                                    gemmlowp::L8R8WithLhsNonzeroBitDepthParams>(
@@ -2155,8 +2242,8 @@ void ConvAsGemm(const uint8* input_data, const Dims<4>& input_dims,
       input_data, filter_cols, output_cols, filter_cols);
   gemmlowp::MatrixMap<uint8, gemmlowp::MapOrder::ColMajor> output_matrix(
       output_data, output_rows, output_cols, output_rows);
-  const auto& output_pipeline = GemmlowpOutputPipeline::Make(
-      bias_data, output_rows, output_offset, output_multiplier, output_shift,
+  const auto& output_pipeline = GemmlowpOutputPipeline::MakeExp(
+      bias_data, output_rows, output_offset, output_multiplier, -output_shift,
       output_activation_min, output_activation_max);
   gemmlowp::GemmWithOutputPipeline<uint8, uint8,
                                    gemmlowp::L8R8WithLhsNonzeroBitDepthParams>(
@@ -2300,8 +2387,9 @@ void L2Normalization(const float* input_data, const Dims<4>& input_dims,
   }
 }
 
-inline void GetInvSqrtQuantizedMultiplier(int32 input, int32* output_inv_sqrt,
-                                          int* output_shift) {
+inline void GetInvSqrtQuantizedMultiplierExp(int32 input,
+                                             int32* output_inv_sqrt,
+                                             int* output_shift) {
   *output_shift = 11;
   while (input >= (1 << 29)) {
     input /= 4;
@@ -2343,6 +2431,7 @@ inline void GetInvSqrtQuantizedMultiplier(int32 input, int32* output_inv_sqrt,
     *output_inv_sqrt <<= -*output_shift;
     *output_shift = 0;
   }
+  *output_shift *= kReverseShift;
 }
 
 inline void L2Normalization(const uint8* input_data, const Dims<4>& input_dims,
@@ -2353,24 +2442,27 @@ inline void L2Normalization(const uint8* input_data, const Dims<4>& input_dims,
   TFLITE_DCHECK(IsPackedWithoutStrides(output_dims));
   const int depth = MatchingArraySize(input_dims, 0, output_dims, 0);
   const int outer_size = MatchingFlatSizeSkipDim(input_dims, 0, output_dims);
-  TFLITE_DCHECK_EQ(outer_size, 1);
-  int32 square_l2_norm = 0;
-  for (int i = 0; i < depth; i++) {
-    int32 diff = input_data[i] - input_zero_point;
-    square_l2_norm += diff * diff;
-  }
-  int32 inv_l2norm_multiplier;
-  int inv_l2norm_shift;
-  GetInvSqrtQuantizedMultiplier(square_l2_norm, &inv_l2norm_multiplier,
-                                &inv_l2norm_shift);
-
-  for (int i = 0; i < depth; i++) {
-    int32 diff = input_data[i] - input_zero_point;
-    int32 rescaled_diff = MultiplyByQuantizedMultiplierSmallerThanOne(
-        128 * diff, inv_l2norm_multiplier, inv_l2norm_shift);
-    int32 unclamped_output_val = 128 + rescaled_diff;
-    int32 output_val = std::min(255, std::max(0, unclamped_output_val));
-    output_data[i] = static_cast<uint8>(output_val);
+  for (int i = 0; i < outer_size; ++i) {
+    int32 square_l2_norm = 0;
+    for (int c = 0; c < depth; c++) {
+      int32 diff = input_data[c] - input_zero_point;
+      square_l2_norm += diff * diff;
+    }
+    int32 inv_l2norm_multiplier;
+    int inv_l2norm_shift;
+    GetInvSqrtQuantizedMultiplierExp(square_l2_norm, &inv_l2norm_multiplier,
+                                     &inv_l2norm_shift);
+
+    for (int c = 0; c < depth; c++) {
+      int32 diff = *input_data - input_zero_point;
+      int32 rescaled_diff = MultiplyByQuantizedMultiplierSmallerThanOneExp(
+          128 * diff, inv_l2norm_multiplier, inv_l2norm_shift);
+      int32 unclamped_output_val = 128 + rescaled_diff;
+      int32 output_val = std::min(255, std::max(0, unclamped_output_val));
+      *output_data = static_cast<uint8>(output_val);
+      ++input_data;
+      ++output_data;
+    }
   }
 }
 
@@ -2506,14 +2598,19 @@ inline void AddElementwise(int size, int left_shift, const uint8* input1_data,
     const int32 input2_val = input2_offset + input2_data[i];
     const int32 shifted_input1_val = input1_val * (1 << left_shift);
     const int32 shifted_input2_val = input2_val * (1 << left_shift);
-    const int32 scaled_input1_val = MultiplyByQuantizedMultiplierSmallerThanOne(
-        shifted_input1_val, input1_multiplier, input1_shift);
-    const int32 scaled_input2_val = MultiplyByQuantizedMultiplierSmallerThanOne(
-        shifted_input2_val, input2_multiplier, input2_shift);
+    const int32 scaled_input1_val =
+        MultiplyByQuantizedMultiplierSmallerThanOneExp(
+            shifted_input1_val, input1_multiplier,
+            kReverseShift * input1_shift);
+    const int32 scaled_input2_val =
+        MultiplyByQuantizedMultiplierSmallerThanOneExp(
+            shifted_input2_val, input2_multiplier,
+            kReverseShift * input2_shift);
     const int32 raw_sum = scaled_input1_val + scaled_input2_val;
-    const int32 raw_output = MultiplyByQuantizedMultiplierSmallerThanOne(
-                                 raw_sum, output_multiplier, output_shift) +
-                             output_offset;
+    const int32 raw_output =
+        MultiplyByQuantizedMultiplierSmallerThanOneExp(
+            raw_sum, output_multiplier, kReverseShift * output_shift) +
+        output_offset;
     const int32 clamped_output = std::min(
         output_activation_max, std::max(output_activation_min, raw_output));
     output_data[i] = static_cast<uint8>(clamped_output);
@@ -2732,15 +2829,17 @@ inline void BroadcastAdd(int left_shift, const uint8* input1_data,
           const int32 shifted_input1_val = input1_val * (1 << left_shift);
           const int32 shifted_input2_val = input2_val * (1 << left_shift);
           const int32 scaled_input1_val =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  shifted_input1_val, input1_multiplier, input1_shift);
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  shifted_input1_val, input1_multiplier,
+                  kReverseShift * input1_shift);
           const int32 scaled_input2_val =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  shifted_input2_val, input2_multiplier, input2_shift);
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  shifted_input2_val, input2_multiplier,
+                  kReverseShift * input2_shift);
           const int32 raw_sum = scaled_input1_val + scaled_input2_val;
           const int32 raw_output =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  raw_sum, output_multiplier, output_shift) +
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  raw_sum, output_multiplier, kReverseShift * output_shift) +
               output_offset;
           const int32 clamped_output =
               std::min(output_activation_max,
@@ -3081,9 +3180,9 @@ inline void BroadcastMul(const uint8* input1_data, const Dims<4>& input1_dims,
           const int32 input2_val =
               input2_offset + input2_data[SubscriptToIndex(desc2, c, x, y, b)];
           const int32 unclamped_result =
-              output_offset +
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  input1_val * input2_val, output_multiplier, output_shift);
+              output_offset + MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                                  input1_val * input2_val, output_multiplier,
+                                  kReverseShift * output_shift);
           const int32 clamped_output =
               std::min(output_activation_max,
                        std::max(output_activation_min, unclamped_result));
@@ -3265,15 +3364,17 @@ inline void BroadcastSub(int left_shift, const uint8* input1_data,
           const int32 shifted_input1_val = input1_val * (1 << left_shift);
           const int32 shifted_input2_val = input2_val * (1 << left_shift);
           const int32 scaled_input1_val =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  shifted_input1_val, input1_multiplier, input1_shift);
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  shifted_input1_val, input1_multiplier,
+                  kReverseShift * input1_shift);
           const int32 scaled_input2_val =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  shifted_input2_val, input2_multiplier, input2_shift);
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  shifted_input2_val, input2_multiplier,
+                  kReverseShift * input2_shift);
           const int32 raw_sub = scaled_input1_val - scaled_input2_val;
           const int32 raw_output =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  raw_sub, output_multiplier, output_shift) +
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  raw_sub, output_multiplier, kReverseShift * output_shift) +
               output_offset;
           const int32 clamped_output =
               std::min(output_activation_max,
@@ -4556,6 +4657,119 @@ inline void LogSoftmax(const float* input_data, const Dims<4>& input_dims,
   }
 }
 
+template <int OutputIntegerBits, int InputIntegerBits>
+inline gemmlowp::FixedPoint<int32, OutputIntegerBits>
+log_x_for_x_greater_than_or_equal_to_1_impl(
+    gemmlowp::FixedPoint<int32, InputIntegerBits> input_val) {
+  // assert(__builtin_clz(0u) >= std::numeric_limits<uint32>::digits - 1);
+  // assert(__builtin_clz(0u) <= std::numeric_limits<uint32>::digits);
+  using FixedPoint0 = gemmlowp::FixedPoint<int32, 0>;
+  // The reason for accumulating the result with an extra bit of headroom is
+  // that z_pow_2_adj * log_2 might be saturated, and adding num_scaled *
+  // recip_denom will otherwise introduce an error.
+  static constexpr int kAccumIntegerBits = OutputIntegerBits + 1;
+  using FixedPointAccum = gemmlowp::FixedPoint<int32, kAccumIntegerBits>;
+
+  const FixedPoint0 log_2 = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(
+      FixedPoint0, 1488522236, std::log(2.0));
+  const FixedPoint0 sqrt_sqrt_half = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(
+      FixedPoint0, 1805811301, std::sqrt(std::sqrt(0.5)));
+  const FixedPoint0 sqrt_half = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(
+      FixedPoint0, 1518500250, std::sqrt(0.5));
+  const FixedPoint0 one_quarter =
+      GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(FixedPoint0, 536870912, 1.0 / 4.0);
+
+  const FixedPoint0 alpha_n = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(
+      FixedPoint0, 117049297, 11.0 / 240.0 * std::sqrt(std::sqrt(2.0)));
+  const FixedPoint0 alpha_d = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(
+      FixedPoint0, 127690142, 1.0 / 20.0 * std::sqrt(std::sqrt(2.0)));
+  const FixedPoint0 alpha_i = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(
+      FixedPoint0, 1057819769,
+      2.0 / std::sqrt(std::sqrt(2.0)) - std::sqrt(std::sqrt(2.0)));
+  const FixedPoint0 alpha_f = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(
+      FixedPoint0, 638450708, 1.0 / 4.0 * std::sqrt(std::sqrt(2.0)));
+
+  const FixedPointAccum shifted_quarter =
+      gemmlowp::Rescale<kAccumIntegerBits>(one_quarter);
+
+  // Reinterpret the input value as Q0.31, because we will figure out the
+  // required shift "ourselves" instead of using, say, Rescale.
+  FixedPoint0 z_a = FixedPoint0::FromRaw(input_val.raw());
+  // z_a_pow_2 = input_integer_bits - z_a_headroom;
+  int z_a_headroom_plus_1 = __builtin_clz(static_cast<uint32>(z_a.raw()));
+  FixedPoint0 r_a_tmp =
+      SaturatingRoundingMultiplyByPOTParam(z_a, (z_a_headroom_plus_1 - 1));
+  const int32 r_a_raw =
+      SaturatingRoundingMultiplyByPOTParam((r_a_tmp * sqrt_half).raw(), 1);
+  // z_pow_2_adj = max(z_pow_2_a - 0.75, z_pow_2_b - 0.25);
+  // z_pow_2_adj = max(InputIntegerBits - z_a_headroom_plus_1 + 0.25,
+  //                   InputIntegerBits - z_b_headroom - 0.25);
+  const FixedPointAccum z_a_pow_2_adj = SaturatingAddNonGemmlowp(
+      FixedPointAccum::FromRaw(SaturatingRoundingMultiplyByPOTParam(
+          InputIntegerBits - z_a_headroom_plus_1, 31 - kAccumIntegerBits)),
+      shifted_quarter);
+
+  // z_b is treated like z_a, but premultiplying by sqrt(0.5).
+  FixedPoint0 z_b = z_a * sqrt_half;
+  int z_b_headroom = __builtin_clz(static_cast<uint32>(z_b.raw())) - 1;
+  const int32 r_b_raw =
+      SaturatingRoundingMultiplyByPOTParam(z_a.raw(), z_b_headroom);
+  const FixedPointAccum z_b_pow_2_adj = SaturatingSub(
+      FixedPointAccum::FromRaw(SaturatingRoundingMultiplyByPOTParam(
+          InputIntegerBits - z_b_headroom, 31 - kAccumIntegerBits)),
+      shifted_quarter);
+
+  const FixedPoint0 r = FixedPoint0::FromRaw(std::min(r_a_raw, r_b_raw));
+  const FixedPointAccum z_pow_2_adj = FixedPointAccum::FromRaw(
+      std::max(z_a_pow_2_adj.raw(), z_b_pow_2_adj.raw()));
+
+  const FixedPoint0 p = gemmlowp::RoundingHalfSum(r, sqrt_sqrt_half);
+  FixedPoint0 q = r - sqrt_sqrt_half;
+  q = q + q;
+
+  const FixedPoint0 common_sq = q * q;
+  const FixedPoint0 num = q * r + q * common_sq * alpha_n;
+  const FixedPoint0 denom_minus_one_0 =
+      p * (alpha_i + q + alpha_d * common_sq) + alpha_f * q;
+  const FixedPoint0 recip_denom =
+      one_over_one_plus_x_for_x_in_0_1(denom_minus_one_0);
+
+  const FixedPointAccum num_scaled = gemmlowp::Rescale<kAccumIntegerBits>(num);
+  return gemmlowp::Rescale<OutputIntegerBits>(z_pow_2_adj * log_2 +
+                                              num_scaled * recip_denom);
+}
+
+// Minimum output bits to accommodate log of maximum input range.  It actually
+// does not matter if one considers, say, [-64,64] or [-64,64).
+//
+// For example, run this through Octave:
+// [0:127; ...
+//  ceil(log(abs( log(2.^(0:127))+1 ))/log(2)); ...
+//  ceil(log(abs( log(2.^(0:127))+1 ))/log(2))]
+constexpr int min_log_x_output_bits(int input_bits) {
+  return input_bits > 90
+             ? 7
+             : input_bits > 44
+                   ? 6
+                   : input_bits > 21
+                         ? 5
+                         : input_bits > 10
+                               ? 4
+                               : input_bits > 4 ? 3 : input_bits > 1 ? 2 : 1;
+}
+
+template <int OutputIntegerBits, int InputIntegerBits>
+inline gemmlowp::FixedPoint<int32, OutputIntegerBits>
+log_x_for_x_greater_than_or_equal_to_1(
+    gemmlowp::FixedPoint<int32, InputIntegerBits> input_val) {
+  static_assert(
+      OutputIntegerBits >= min_log_x_output_bits(InputIntegerBits),
+      "Output integer bits must be sufficent to accommodate logs of inputs.");
+  return log_x_for_x_greater_than_or_equal_to_1_impl<OutputIntegerBits,
+                                                     InputIntegerBits>(
+      input_val);
+}
+
 // Currently just a copy of the reference code.
 inline void LogSoftmax(const uint8* input_data, const Dims<4>& input_dims,
                        int32 input_multiplier, int32 input_left_shift,
@@ -4601,13 +4815,10 @@ inline void LogSoftmax(const uint8* input_data, const Dims<4>& input_dims,
       }
     }
 
-    // TODO(b/77858996): Implement fixed-point log().
-    // Not a fully-quantized implementation: floating-point log().
-    const float float_log_sum_of_exps =
-        std::log(static_cast<float>(sum_of_exps.raw()) /
-                 (1 << (31 - kAccumulationIntegerBits)));
-    const int32 fixed_log_sum_of_exps = static_cast<int32>(TfLiteRound(
-        float_log_sum_of_exps * (1 << (31 - kScaledDiffIntegerBits))));
+    const int32 fixed_log_sum_of_exps =
+        log_x_for_x_greater_than_or_equal_to_1<kScaledDiffIntegerBits>(
+            sum_of_exps)
+            .raw();
 
     // rescaled_diff_min is smallest representable in
     // Q(kScaledDiffIntegerBits).(31-kScaledDiffIntegerBits) plus the
@@ -4618,9 +4829,9 @@ inline void LogSoftmax(const uint8* input_data, const Dims<4>& input_dims,
         fixed_log_sum_of_exps + std::numeric_limits<int32>::lowest();
     const int adjusted_diff_min =
         std::max(diff_min - 1,  // Note use of > below instead of >= above.
-                 MultiplyByQuantizedMultiplierSmallerThanOne(
+                 MultiplyByQuantizedMultiplierSmallerThanOneExp(
                      rescaled_diff_min, reverse_scaling_divisor,
-                     reverse_scaling_right_shift));
+                     kReverseShift * reverse_scaling_right_shift));
 
     for (int c = 0; c < depth; ++c) {
       int32 input_diff = static_cast<int32>(block_input_data[c]) - max_in_row;
@@ -5513,6 +5724,46 @@ inline void ResizeBilinearGeneric(const float* input_data,
   }
 }
 
+template <typename T>
+inline void ResizeBilinearGenericSmallChannel(
+    const T* input_data, const Dims<4>& input_dims, T* output_data,
+    const Dims<4>& output_dims, int32 batches, int32 input_height,
+    int32 input_width, int32 depth, int32 output_height, int32 output_width,
+    float height_scale, float width_scale) {
+  memset(output_data, 0,
+         batches * output_height * output_width * depth * sizeof(T));
+
+  T* output_ptr = &output_data[0];
+  for (int b = 0; b < batches; ++b) {
+    for (int y = 0; y < output_height; ++y) {
+      float input_y = y * height_scale;
+      int32 y0 = static_cast<int32>(std::floor(input_y));
+      int32 y1 = std::min(y0 + 1, input_height - 1);
+      for (int x = 0; x < output_width; ++x) {
+        float input_x = x * width_scale;
+        int32 x0 = static_cast<int32>(input_x);
+        int32 x1 = std::min(x0 + 1, input_width - 1);
+
+        int32 input_offset[4] = {
+            Offset(input_dims, 0, x0, y0, b), Offset(input_dims, 0, x1, y0, b),
+            Offset(input_dims, 0, x0, y1, b), Offset(input_dims, 0, x1, y1, b)};
+        float scale[4] = {(1 - (input_y - y0)) * (1 - (input_x - x0)),
+                          (1 - (input_y - y0)) * (input_x - x0),
+                          (input_y - y0) * (1 - (input_x - x0)),
+                          (input_y - y0) * (input_x - x0)};
+
+        for (int d = 0; d < depth; d++) {
+          const T* input_ptr = &input_data[d];
+          *output_ptr++ = static_cast<T>(input_ptr[input_offset[0]] * scale[0] +
+                                         input_ptr[input_offset[1]] * scale[1] +
+                                         input_ptr[input_offset[2]] * scale[2] +
+                                         input_ptr[input_offset[3]] * scale[3]);
+        }
+      }
+    }
+  }
+}
+
 inline void ResizeBilinear(const float* input_data, const Dims<4>& input_dims,
                            const int32* output_size_data,
                            const Dims<4>& output_size_dims, float* output_data,
@@ -5553,6 +5804,41 @@ inline void ResizeBilinear(const float* input_data, const Dims<4>& input_dims,
   }
 }
 
+// TODO(prabhumk): This is not a real quantized bilinear. It does not use int8
+// or int16 arithmetic.
+inline void ResizeBilinear(const uint8* input_data, const Dims<4>& input_dims,
+                           const int32* output_size_data,
+                           const Dims<4>& output_size_dims, uint8* output_data,
+                           const Dims<4>& output_dims, bool align_corners) {
+  gemmlowp::ScopedProfilingLabel label("ResizeBilinear");
+  int32 batches = MatchingArraySize(input_dims, 3, output_dims, 3);
+  int32 input_height = ArraySize(input_dims, 2);
+  int32 input_width = ArraySize(input_dims, 1);
+  int32 depth = MatchingArraySize(input_dims, 0, output_dims, 0);
+
+  TFLITE_DCHECK_EQ(ArraySize(output_size_dims, 3), 1);
+  TFLITE_DCHECK_EQ(ArraySize(output_size_dims, 2), 1);
+  TFLITE_DCHECK_EQ(ArraySize(output_size_dims, 1), 1);
+  TFLITE_DCHECK_EQ(ArraySize(output_size_dims, 0), 2);
+  int32 output_height = output_size_data[Offset(output_size_dims, 0, 0, 0, 0)];
+  int32 output_width = output_size_data[Offset(output_size_dims, 1, 0, 0, 0)];
+
+  float height_scale =
+      (align_corners && output_height > 1)
+          ? (static_cast<float>(input_height - 1) / (output_height - 1))
+          : (static_cast<float>(input_height) / output_height);
+
+  float width_scale =
+      (align_corners && output_width > 1)
+          ? (static_cast<float>(input_width - 1) / (output_width - 1))
+          : (static_cast<float>(input_width) / output_width);
+
+  ResizeBilinearGenericSmallChannel<uint8>(
+      input_data, input_dims, output_data, output_dims, batches, input_height,
+      input_width, depth, output_height, output_width, height_scale,
+      width_scale);
+}
+
 // legacy, for compatibility with old checked-in code
 inline void ResizeBilinear(const float* input_data, const Dims<4>& input_dims,
                            const int32* output_size_data,
@@ -5562,6 +5848,15 @@ inline void ResizeBilinear(const float* input_data, const Dims<4>& input_dims,
                  output_data, output_dims, /*align_corners=*/false);
 }
 
+// legacy, for compatibility with old checked-in code
+inline void ResizeBilinear(const uint8* input_data, const Dims<4>& input_dims,
+                           const int32* output_size_data,
+                           const Dims<4>& output_size_dims, uint8* output_data,
+                           const Dims<4>& output_dims) {
+  ResizeBilinear(input_data, input_dims, output_size_data, output_size_dims,
+                 output_data, output_dims, /*align_corners=*/false);
+}
+
 template <typename T>
 inline void SpaceToBatchND(const T* input_data, const Dims<4>& input_dims,
                            const int32* block_shape_data,
@@ -6080,8 +6375,8 @@ inline void TransposeConv(const float* input_data, const Dims<4>& input_dims,
   // To optimize, start by using the conv code with transposed weights for the
   // case of stride_height = stride_width = 1.
   const int batches = MatchingArraySize(input_dims, 3, output_dims, 3);
-  const int input_depth = MatchingArraySize(input_dims, 0, filter_dims, 3);
-  const int output_depth = MatchingArraySize(filter_dims, 0, output_dims, 0);
+  const int input_depth = MatchingArraySize(input_dims, 0, filter_dims, 0);
+  const int output_depth = MatchingArraySize(filter_dims, 3, output_dims, 0);
   const int input_height = ArraySize(input_dims, 2);
   const int input_width = ArraySize(input_dims, 1);
   const int filter_height = ArraySize(filter_dims, 2);
@@ -6128,8 +6423,8 @@ inline void TransposeConv(const float* input_data, const Dims<4>& input_dims,
                   float input_value = input_data[Offset(input_dims, in_channel,
                                                         in_x, in_y, batch)];
                   float filter_value =
-                      filter_data[Offset(filter_dims, out_channel, filter_x,
-                                         filter_y, in_channel)];
+                      filter_data[Offset(filter_dims, in_channel, filter_x,
+                                         filter_y, out_channel)];
                   output_data[Offset(output_dims, out_channel, out_x, out_y,
                                      batch)] += input_value * filter_value;
                 }
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/tensor_utils_impl.h b/tensorflow/contrib/lite/kernels/internal/optimized/tensor_utils_impl.h
index d570dadd86b4dc7c3abe341a4955320367330b9c..f14667090f5c3867c7992211272063239f3b92aa 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/tensor_utils_impl.h
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/tensor_utils_impl.h
@@ -127,6 +127,10 @@ void PortableZeroVector(float* vector, int v_size);
 // Limit a float input f between +abs_limit and -abs_limit.
 float PortableClip(float f, float abs_limit);
 
+// Check if all entries of a vector are zero.
+bool PortableIsZeroVector(const float* vector, int v_size);
+bool NeonIsZeroVector(const float* vector, int v_size);
+
 // Symmetric quantizer.
 void PortableSymmetricQuantizeFloats(const float* values, const int size,
                                      int8_t* quantized_values, float* min,
diff --git a/tensorflow/contrib/lite/kernels/internal/quantization_util.cc b/tensorflow/contrib/lite/kernels/internal/quantization_util.cc
index b0951aac8cbb98a181d9dcaef88770fadfc74f62..57ee859115cddbcbccae24ff639e848340d8e2ee 100644
--- a/tensorflow/contrib/lite/kernels/internal/quantization_util.cc
+++ b/tensorflow/contrib/lite/kernels/internal/quantization_util.cc
@@ -48,15 +48,15 @@ void QuantizeMultiplierGreaterThanOne(double double_multiplier,
   TFLITE_CHECK_GE(*left_shift, 0);
 }
 
-void QuantizeMultiplierSmallerThanOne(double double_multiplier,
-                                      int32_t* quantized_multiplier,
-                                      int* right_shift) {
+void QuantizeMultiplierSmallerThanOneExp(double double_multiplier,
+                                         int32_t* quantized_multiplier,
+                                         int* left_shift) {
   TFLITE_CHECK_LT(double_multiplier, 1.);
   TFLITE_CHECK_GT(double_multiplier, 0.);
   int shift;
   QuantizeMultiplier(double_multiplier, quantized_multiplier, &shift);
   TFLITE_CHECK_LE(shift, 0);
-  *right_shift = -shift;
+  *left_shift = shift;
 }
 
 void PreprocessSoftmaxScaling(double beta, double input_scale,
@@ -78,20 +78,21 @@ void PreprocessSoftmaxScaling(double beta, double input_scale,
                                    quantized_multiplier, left_shift);
 }
 
-void PreprocessLogSoftmaxScaling(double beta, double input_scale,
-                                 int input_integer_bits,
-                                 int32_t* quantized_multiplier, int* left_shift,
-                                 int32_t* reverse_scaling_divisor,
-                                 int* reverse_scaling_right_shift) {
+void PreprocessLogSoftmaxScalingExp(double beta, double input_scale,
+                                    int input_integer_bits,
+                                    int32_t* quantized_multiplier,
+                                    int* left_shift,
+                                    int32_t* reverse_scaling_divisor,
+                                    int* reverse_scaling_left_shift) {
   PreprocessSoftmaxScaling(beta, input_scale, input_integer_bits,
                            quantized_multiplier, left_shift);
 
   // Also calculate what amounts to the inverse scaling factor for the input.
   const double real_reverse_scaling_divisor =
       (1 << (31 - *left_shift)) / static_cast<double>(*quantized_multiplier);
-  tflite::QuantizeMultiplierSmallerThanOne(real_reverse_scaling_divisor,
-                                           reverse_scaling_divisor,
-                                           reverse_scaling_right_shift);
+  tflite::QuantizeMultiplierSmallerThanOneExp(real_reverse_scaling_divisor,
+                                              reverse_scaling_divisor,
+                                              reverse_scaling_left_shift);
 }
 
 int CalculateInputRadius(int input_integer_bits, int input_left_shift) {
diff --git a/tensorflow/contrib/lite/kernels/internal/quantization_util.h b/tensorflow/contrib/lite/kernels/internal/quantization_util.h
index 4a217515f142b2451ebd61e423871b95cdc09748..182ee782c76fcccedc99327d47805b49bfb8580d 100644
--- a/tensorflow/contrib/lite/kernels/internal/quantization_util.h
+++ b/tensorflow/contrib/lite/kernels/internal/quantization_util.h
@@ -167,9 +167,9 @@ IntOut SafeCast(FloatIn x) {
 // this is intended as a RIGHT-shift.
 //
 // Restricted to the case where the multiplier < 1 (and non-negative).
-void QuantizeMultiplierSmallerThanOne(double double_multiplier,
-                                      int32_t* quantized_multiplier,
-                                      int* right_shift);
+void QuantizeMultiplierSmallerThanOneExp(double double_multiplier,
+                                         int32_t* quantized_multiplier,
+                                         int* left_shift);
 
 // Decompose a double multiplier into a Q0.31 int32 representation of its
 // significand, and shift representation of its exponent.
@@ -197,11 +197,12 @@ void PreprocessSoftmaxScaling(double beta, double input_scale,
                               int input_integer_bits,
                               int32_t* quantized_multiplier, int* left_shift);
 // Like PreprocessSoftmaxScaling, but inverse scaling factors also calculated.
-void PreprocessLogSoftmaxScaling(double beta, double input_scale,
-                                 int input_integer_bits,
-                                 int32_t* quantized_multiplier, int* left_shift,
-                                 int32_t* reverse_scaling_divisor,
-                                 int* reverse_scaling_right_shift);
+void PreprocessLogSoftmaxScalingExp(double beta, double input_scale,
+                                    int input_integer_bits,
+                                    int32_t* quantized_multiplier,
+                                    int* left_shift,
+                                    int32_t* reverse_scaling_divisor,
+                                    int* reverse_scaling_left_shift);
 // Calculate the largest input that will result in a within-bounds intermediate
 // result within MultiplyByQuantizedMultiplierGreaterThanOne.  In other words,
 // it must not overflow before we reduce the value by multiplication by the
diff --git a/tensorflow/contrib/lite/kernels/internal/quantization_util_test.cc b/tensorflow/contrib/lite/kernels/internal/quantization_util_test.cc
index 2d74b3d3849812a2dc95fabcd680aa280c99ca55..94773b47d3817d7ed7240f74545ad04e7fa4bd52 100644
--- a/tensorflow/contrib/lite/kernels/internal/quantization_util_test.cc
+++ b/tensorflow/contrib/lite/kernels/internal/quantization_util_test.cc
@@ -196,21 +196,21 @@ TEST(QuantizationUtilTest, ChooseQuantizationParamsInvalidRange) {
   EXPECT_DEATH(ChooseQuantizationParams<uint8>(10.0, -30.0), "");
 }
 
-TEST(QuantizationUtilTest, QuantizeMultiplierSmallerThanOne) {
+TEST(QuantizationUtilTest, QuantizeMultiplierSmallerThanOneExp) {
   auto quantize = [](double d) {
     int32_t q;
     int s;
-    QuantizeMultiplierSmallerThanOne(d, &q, &s);
+    QuantizeMultiplierSmallerThanOneExp(d, &q, &s);
     return std::pair<int32_t, int>{q, s};
   };
 
   EXPECT_DEATH(quantize(-0.1), "");
   EXPECT_DEATH(quantize(0.0), "");
-  EXPECT_THAT(quantize(0.25), Pair(1073741824, 1));
+  EXPECT_THAT(quantize(0.25), Pair(1073741824, -1));
 
   // Around 0.5 we can see the change in exponent and how we try hard to
   // void hitting max int32.
-  EXPECT_THAT(quantize(0.50 - 5e-9), Pair(2147483627, 1));
+  EXPECT_THAT(quantize(0.50 - 5e-9), Pair(2147483627, -1));
   EXPECT_THAT(quantize(0.50 - 1e-10), Pair(1073741824, 0));
   EXPECT_THAT(quantize(0.50), Pair(1073741824, 0));
 
diff --git a/tensorflow/contrib/lite/kernels/internal/reference/depthwiseconv_uint8.h b/tensorflow/contrib/lite/kernels/internal/reference/depthwiseconv_uint8.h
index e9b6baeaee87d22aef238410bc9f447509a81c47..d57739279f44ad2a9fff2bd4dca21047b8147f2a 100644
--- a/tensorflow/contrib/lite/kernels/internal/reference/depthwiseconv_uint8.h
+++ b/tensorflow/contrib/lite/kernels/internal/reference/depthwiseconv_uint8.h
@@ -76,8 +76,8 @@ inline void DepthwiseConv(const uint8* input_data, const Dims<4>& input_dims,
             if (bias_data) {
               acc += bias_data[Offset(bias_dims, oc, 0, 0, 0)];
             }
-            acc = MultiplyByQuantizedMultiplierSmallerThanOne(
-                acc, output_multiplier, output_shift);
+            acc = MultiplyByQuantizedMultiplier(acc, output_multiplier,
+                                                -output_shift);
             acc += output_offset;
             acc = std::max(acc, output_activation_min);
             acc = std::min(acc, output_activation_max);
diff --git a/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.cc b/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.cc
index 2607adc0c18aeaa8dc2061e0e95a307205700a08..f8c6f341f7e61529bbbac592f9caf115f6121e0c 100644
--- a/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.cc
+++ b/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.cc
@@ -29,9 +29,18 @@ float PortableClip(float f, float abs_limit) {
   return result;
 }
 
+bool PortableIsZeroVector(const float* vector, int v_size) {
+  for (int i = 0; i < v_size; ++i) {
+    if (*vector++ != 0.0f) return false;
+  }
+  return true;
+}
+
 void PortableSymmetricQuantizeFloats(const float* values, const int size,
-                                     int8_t* quantized_values, float* min,
-                                     float* max, float* scaling_factor) {
+                                     int8_t* quantized_values,
+                                     float* __restrict__ min,
+                                     float* __restrict__ max,
+                                     float* __restrict__ scaling_factor) {
   auto minmax = std::minmax_element(values, values + size);
   *min = *minmax.first;
   *max = *minmax.second;
@@ -71,13 +80,14 @@ void PortableMatrixBatchVectorMultiplyAccumulate(const float* matrix,
 
 void PortableMatrixBatchVectorMultiplyAccumulate(
     const int8_t* __restrict__ matrix, const int m_rows, const int m_cols,
-    const int8_t* __restrict__ vectors, const float* scaling_factors,
-    int n_batch, float* __restrict__ result, int result_stride) {
+    const int8_t* __restrict__ vectors,
+    const float* __restrict__ scaling_factors, int n_batch,
+    float* __restrict__ result, int result_stride) {
   int batch, row, col;
   for (batch = 0; batch < n_batch; ++batch, vectors += m_cols) {
     const float batch_scaling_factor_inv = 1.0 / scaling_factors[batch];
     // Get the address of the first row.
-    int8_t* row_ptr = (int8_t*)matrix;  // NOLINT
+    const int8_t* row_ptr = matrix;
     for (row = 0; row < m_rows; ++row, result += result_stride) {
       // Initialize the dot product sum for the row to 0.
       int32_t dotprod = 0;
diff --git a/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.h b/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.h
index 1757a9f5e5299401fb9fef1d38870bd4e63ba3c1..d2e1fecd25cf3d11d3daffcc566dc1d5df97128c 100644
--- a/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.h
+++ b/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.h
@@ -25,6 +25,8 @@ namespace tensor_utils {
 // Limit a float input f between +abs_limit and -abs_limit.
 float PortableClip(float f, float abs_limit);
 
+bool PortableIsZeroVector(const float* vector, int v_size);
+
 void PortableSymmetricQuantizeFloats(const float* values, const int size,
                                      int8_t* quantized_values, float* min,
                                      float* max, float* scaling_factor);
@@ -112,6 +114,10 @@ void PortableReductionSumVector(const float* input_vector, float* output_vector,
 
 float Clip(float f, float abs_limit) { return PortableClip(f, abs_limit); }
 
+bool IsZeroVector(const float* vector, int v_size) {
+  return PortableIsZeroVector(vector, v_size);
+}
+
 void SymmetricQuantizeFloats(const float* values, const int size,
                              int8_t* quantized_values, float* min, float* max,
                              float* scaling_factor) {
diff --git a/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h b/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h
index 15fac237edae79ea8d778b6e28694acb8de48668..07c215ebcb6f65b33df7086a2ffc1c3d1f04c605 100644
--- a/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h
+++ b/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h
@@ -33,8 +33,131 @@ limitations under the License.
 #include "tensorflow/contrib/lite/kernels/internal/types.h"
 
 namespace tflite {
+
+// TODO(b/77858996): Add these to gemmlowp.
+template <typename IntegerType>
+IntegerType SaturatingAddNonGemmlowp(IntegerType a, IntegerType b) {
+  static_assert(std::is_same<IntegerType, void>::value, "unimplemented");
+  return a;
+}
+
+template <>
+inline std::int32_t SaturatingAddNonGemmlowp(std::int32_t a, std::int32_t b) {
+  std::int64_t a64 = a;
+  std::int64_t b64 = b;
+  std::int64_t sum = a64 + b64;
+  return static_cast<std::int32_t>(std::min(
+      static_cast<std::int64_t>(std::numeric_limits<std::int32_t>::max()),
+      std::max(
+          static_cast<std::int64_t>(std::numeric_limits<std::int32_t>::min()),
+          sum)));
+}
+
+template <typename tRawType, int tIntegerBits>
+gemmlowp::FixedPoint<tRawType, tIntegerBits> SaturatingAddNonGemmlowp(
+    gemmlowp::FixedPoint<tRawType, tIntegerBits> a,
+    gemmlowp::FixedPoint<tRawType, tIntegerBits> b) {
+  return gemmlowp::FixedPoint<tRawType, tIntegerBits>::FromRaw(
+      SaturatingAddNonGemmlowp(a.raw(), b.raw()));
+}
+
+template <typename IntegerType>
+IntegerType SaturatingSub(IntegerType a, IntegerType b) {
+  static_assert(std::is_same<IntegerType, void>::value, "unimplemented");
+  return a;
+}
+
+template <>
+inline std::int16_t SaturatingSub(std::int16_t a, std::int16_t b) {
+  std::int32_t a32 = a;
+  std::int32_t b32 = b;
+  std::int32_t diff = a32 - b32;
+  return static_cast<std::int16_t>(std::min(32767, std::max(-32768, diff)));
+}
+
+template <>
+inline std::int32_t SaturatingSub(std::int32_t a, std::int32_t b) {
+  std::int64_t a64 = a;
+  std::int64_t b64 = b;
+  std::int64_t diff = a64 - b64;
+  return static_cast<std::int32_t>(std::min(
+      static_cast<std::int64_t>(std::numeric_limits<std::int32_t>::max()),
+      std::max(
+          static_cast<std::int64_t>(std::numeric_limits<std::int32_t>::min()),
+          diff)));
+}
+
+template <typename tRawType, int tIntegerBits>
+gemmlowp::FixedPoint<tRawType, tIntegerBits> SaturatingSub(
+    gemmlowp::FixedPoint<tRawType, tIntegerBits> a,
+    gemmlowp::FixedPoint<tRawType, tIntegerBits> b) {
+  return gemmlowp::FixedPoint<tRawType, tIntegerBits>::FromRaw(
+      SaturatingSub(a.raw(), b.raw()));
+}
+// End section to be moved to gemmlowp.
+
 namespace reference_ops {
 
+// TODO(b/80247582) Remove this constant.
+// This will be phased out as the shifts are revised with more thought. Use of a
+// constant enables us to track progress on this work.
+//
+// Used mainly to convert from old-style shifts (right) to new-style (left).
+static constexpr int kReverseShift = -1;
+
+template <typename T>
+int CountLeadingZeros(T integer_input) {
+  static_assert(std::is_unsigned<T>::value,
+                "Only unsigned integer types handled.");
+  if (integer_input == 0) {
+    return std::numeric_limits<T>::digits;
+  }
+  const T one_in_leading_positive = static_cast<T>(1)
+                                    << (std::numeric_limits<T>::digits - 1);
+  int leading_zeros = 0;
+  while (integer_input < one_in_leading_positive) {
+    integer_input <<= 1;
+    ++leading_zeros;
+  }
+  return leading_zeros;
+}
+
+template <typename IntegerType>
+IntegerType SaturatingRoundingMultiplyByPOTParam(IntegerType x, int exponent) {
+  if (exponent == 0) {
+    return x;
+  }
+  using ScalarIntegerType =
+      typename gemmlowp::FixedPointRawTypeTraits<IntegerType>::ScalarRawType;
+  const IntegerType min =
+      gemmlowp::Dup<IntegerType>(std::numeric_limits<ScalarIntegerType>::min());
+  const IntegerType max =
+      gemmlowp::Dup<IntegerType>(std::numeric_limits<ScalarIntegerType>::max());
+  const int ScalarIntegerTypeBits = 8 * sizeof(ScalarIntegerType);
+
+  const std::int32_t threshold =
+      ((1 << (ScalarIntegerTypeBits - 1 - exponent)) - 1);
+  const IntegerType positive_mask =
+      gemmlowp::MaskIfGreaterThan(x, gemmlowp::Dup<IntegerType>(threshold));
+  const IntegerType negative_mask =
+      gemmlowp::MaskIfLessThan(x, gemmlowp::Dup<IntegerType>(-threshold));
+
+  IntegerType result = gemmlowp::ShiftLeft(x, exponent);
+  result = gemmlowp::SelectUsingMask(positive_mask, max, result);
+  result = gemmlowp::SelectUsingMask(negative_mask, min, result);
+  return result;
+}
+
+// If we want to leave IntegerBits fixed, then multiplication
+// by a power of two has to be saturating/rounding, not exact anymore.
+template <typename tRawType, int tIntegerBits>
+gemmlowp::FixedPoint<tRawType, tIntegerBits>
+SaturatingRoundingMultiplyByPOTParam(
+    gemmlowp::FixedPoint<tRawType, tIntegerBits> a, int exponent) {
+  return gemmlowp::FixedPoint<tRawType, tIntegerBits>::FromRaw(
+      SaturatingRoundingMultiplyByPOTParam(a.raw(), exponent));
+}
+
 // DO NOT USE THIS STRUCT FOR NEW FUNCTIONALITY BEYOND IMPLEMENTING ELEMENT-WISE
 // BROADCASTING.
 //
@@ -291,8 +414,8 @@ inline void Conv(const uint8* input_data, const Dims<4>& input_dims,
           if (bias_data) {
             acc += bias_data[Offset(bias_dims, out_channel, 0, 0, 0)];
           }
-          acc = MultiplyByQuantizedMultiplierSmallerThanOne(
-              acc, output_multiplier, output_shift);
+          acc = MultiplyByQuantizedMultiplierSmallerThanOneExp(
+              acc, output_multiplier, kReverseShift * output_shift);
           acc += output_offset;
           acc = std::max(acc, output_activation_min);
           acc = std::min(acc, output_activation_max);
@@ -515,8 +638,8 @@ inline void FullyConnected(const uint8* input_data, const Dims<4>& input_dims,
       if (bias_data) {
         acc += bias_data[Offset(bias_dims, out_c, 0, 0, 0)];
       }
-      acc = MultiplyByQuantizedMultiplierSmallerThanOne(acc, output_multiplier,
-                                                        output_shift);
+      acc = MultiplyByQuantizedMultiplierSmallerThanOneExp(
+          acc, output_multiplier, kReverseShift * output_shift);
       acc += output_offset;
       acc = std::max(acc, output_activation_min);
       acc = std::min(acc, output_activation_max);
@@ -845,8 +968,9 @@ void L2Normalization(const float* input_data, const Dims<4>& input_dims,
   }
 }
 
-inline void GetInvSqrtQuantizedMultiplier(int32 input, int32* output_inv_sqrt,
-                                          int* output_shift) {
+inline void GetInvSqrtQuantizedMultiplierExp(int32 input,
+                                             int32* output_inv_sqrt,
+                                             int* output_shift) {
   *output_shift = 11;
   while (input >= (1 << 29)) {
     input /= 4;
@@ -888,6 +1012,7 @@ inline void GetInvSqrtQuantizedMultiplier(int32 input, int32* output_inv_sqrt,
     *output_inv_sqrt <<= -*output_shift;
     *output_shift = 0;
   }
+  *output_shift *= kReverseShift;
 }
 
 inline void L2Normalization(const uint8* input_data, const Dims<4>& input_dims,
@@ -895,25 +1020,28 @@ inline void L2Normalization(const uint8* input_data, const Dims<4>& input_dims,
                             const Dims<4>& output_dims) {
   const int depth = MatchingArraySize(input_dims, 0, output_dims, 0);
   const int outer_size = MatchingFlatSizeSkipDim(input_dims, 0, output_dims);
-  TFLITE_DCHECK_EQ(outer_size, 1);
-  int32 square_l2_norm = 0;
-  for (int i = 0; i < depth; i++) {
-    int32 diff = input_data[Offset(input_dims, i, 0, 0, 0)] - input_zero_point;
-    square_l2_norm += diff * diff;
-  }
-  int32 inv_l2norm_multiplier;
-  int inv_l2norm_shift;
-  GetInvSqrtQuantizedMultiplier(square_l2_norm, &inv_l2norm_multiplier,
-                                &inv_l2norm_shift);
-
-  for (int i = 0; i < depth; i++) {
-    int32 diff = input_data[Offset(input_dims, i, 0, 0, 0)] - input_zero_point;
-    int32 rescaled_diff = MultiplyByQuantizedMultiplierSmallerThanOne(
-        128 * diff, inv_l2norm_multiplier, inv_l2norm_shift);
-    int32 unclamped_output_val = 128 + rescaled_diff;
-    int32 output_val = std::min(255, std::max(0, unclamped_output_val));
-    output_data[Offset(output_dims, i, 0, 0, 0)] =
-        static_cast<uint8>(output_val);
+  for (int i = 0; i < outer_size; ++i) {
+    int32 square_l2_norm = 0;
+    for (int c = 0; c < depth; c++) {
+      int32 diff =
+          input_data[Offset(input_dims, c, i, 0, 0)] - input_zero_point;
+      square_l2_norm += diff * diff;
+    }
+    int32 inv_l2norm_multiplier;
+    int inv_l2norm_shift;
+    GetInvSqrtQuantizedMultiplierExp(square_l2_norm, &inv_l2norm_multiplier,
+                                     &inv_l2norm_shift);
+
+    for (int c = 0; c < depth; c++) {
+      int32 diff =
+          input_data[Offset(input_dims, c, i, 0, 0)] - input_zero_point;
+      int32 rescaled_diff = MultiplyByQuantizedMultiplierSmallerThanOneExp(
+          128 * diff, inv_l2norm_multiplier, inv_l2norm_shift);
+      int32 unclamped_output_val = 128 + rescaled_diff;
+      int32 output_val = std::min(255, std::max(0, unclamped_output_val));
+      output_data[Offset(output_dims, c, i, 0, 0)] =
+          static_cast<uint8>(output_val);
+    }
   }
 }
 
@@ -979,15 +1107,17 @@ inline void Add(int left_shift, const uint8* input1_data,
           const int32 shifted_input1_val = input1_val * (1 << left_shift);
           const int32 shifted_input2_val = input2_val * (1 << left_shift);
           const int32 scaled_input1_val =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  shifted_input1_val, input1_multiplier, input1_shift);
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  shifted_input1_val, input1_multiplier,
+                  kReverseShift * input1_shift);
           const int32 scaled_input2_val =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  shifted_input2_val, input2_multiplier, input2_shift);
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  shifted_input2_val, input2_multiplier,
+                  kReverseShift * input2_shift);
           const int32 raw_sum = scaled_input1_val + scaled_input2_val;
           const int32 raw_output =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  raw_sum, output_multiplier, output_shift) +
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  raw_sum, output_multiplier, kReverseShift * output_shift) +
               output_offset;
           const int32 clamped_output =
               std::min(output_activation_max,
@@ -1133,15 +1263,17 @@ inline void BroadcastAdd(int left_shift, const uint8* input1_data,
           const int32 shifted_input1_val = input1_val * (1 << left_shift);
           const int32 shifted_input2_val = input2_val * (1 << left_shift);
           const int32 scaled_input1_val =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  shifted_input1_val, input1_multiplier, input1_shift);
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  shifted_input1_val, input1_multiplier,
+                  kReverseShift * input1_shift);
           const int32 scaled_input2_val =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  shifted_input2_val, input2_multiplier, input2_shift);
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  shifted_input2_val, input2_multiplier,
+                  kReverseShift * input2_shift);
           const int32 raw_sum = scaled_input1_val + scaled_input2_val;
           const int32 raw_output =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  raw_sum, output_multiplier, output_shift) +
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  raw_sum, output_multiplier, kReverseShift * output_shift) +
               output_offset;
           const int32 clamped_output =
               std::min(output_activation_max,
@@ -1186,15 +1318,17 @@ inline void BroadcastAddFivefold(
             const int32 shifted_input1_val = input1_val * (1 << left_shift);
             const int32 shifted_input2_val = input2_val * (1 << left_shift);
             const int32 scaled_input1_val =
-                MultiplyByQuantizedMultiplierSmallerThanOne(
-                    shifted_input1_val, input1_multiplier, input1_shift);
+                MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                    shifted_input1_val, input1_multiplier,
+                    kReverseShift * input1_shift);
             const int32 scaled_input2_val =
-                MultiplyByQuantizedMultiplierSmallerThanOne(
-                    shifted_input2_val, input2_multiplier, input2_shift);
+                MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                    shifted_input2_val, input2_multiplier,
+                    kReverseShift * input2_shift);
             const int32 raw_sum = scaled_input1_val + scaled_input2_val;
             const int32 raw_output =
-                MultiplyByQuantizedMultiplierSmallerThanOne(
-                    raw_sum, output_multiplier, output_shift) +
+                MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                    raw_sum, output_multiplier, kReverseShift * output_shift) +
                 output_offset;
             const int32 clamped_output =
                 std::min(output_activation_max,
@@ -1374,9 +1508,9 @@ inline void BroadcastMul(const uint8* input1_data, const Dims<4>& input1_dims,
           const int32 input2_val =
               input2_offset + input2_data[SubscriptToIndex(desc2, c, x, y, b)];
           const int32 unclamped_result =
-              output_offset +
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  input1_val * input2_val, output_multiplier, output_shift);
+              output_offset + MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                                  input1_val * input2_val, output_multiplier,
+                                  kReverseShift * output_shift);
           const int32 clamped_output =
               std::min(output_activation_max,
                        std::max(output_activation_min, unclamped_result));
@@ -1590,15 +1724,17 @@ inline void BroadcastSub(int left_shift, const uint8* input1_data,
           const int32 shifted_input1_val = input1_val * (1 << left_shift);
           const int32 shifted_input2_val = input2_val * (1 << left_shift);
           const int32 scaled_input1_val =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  shifted_input1_val, input1_multiplier, input1_shift);
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  shifted_input1_val, input1_multiplier,
+                  kReverseShift * input1_shift);
           const int32 scaled_input2_val =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  shifted_input2_val, input2_multiplier, input2_shift);
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  shifted_input2_val, input2_multiplier,
+                  kReverseShift * input2_shift);
           const int32 raw_sub = scaled_input1_val - scaled_input2_val;
           const int32 raw_output =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  raw_sub, output_multiplier, output_shift) +
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  raw_sub, output_multiplier, kReverseShift * output_shift) +
               output_offset;
           const int32 clamped_output =
               std::min(output_activation_max,
@@ -2639,6 +2775,121 @@ inline void LogSoftmax(const float* input_data, const Dims<4>& input_dims,
   }
 }
 
+// Although currently the name of this function says that it cannot handle
+// values less than 1, in practice it can handle as low as 1/x_max, where
+// x_max is the largest representable input.  In other words, the output range
+// is symmetric.
+template <int OutputIntegerBits, int InputIntegerBits>
+inline gemmlowp::FixedPoint<int32, OutputIntegerBits>
+log_x_for_x_greater_than_or_equal_to_1_impl(
+    gemmlowp::FixedPoint<int32, InputIntegerBits> input_val) {
+  using FixedPoint0 = gemmlowp::FixedPoint<int32, 0>;
+  // The reason for accumulating the result with an extra bit of headroom is
+  // that z_pow_2_adj * log_2 might be saturated, and adding num_scaled *
+  // recip_denom will otherwise introduce an error.
+  static constexpr int kAccumIntegerBits = OutputIntegerBits + 1;
+  using FixedPointAccum = gemmlowp::FixedPoint<int32, kAccumIntegerBits>;
+
+  const FixedPoint0 log_2 = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(
+      FixedPoint0, 1488522236, std::log(2.0));
+  const FixedPoint0 sqrt_sqrt_half = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(
+      FixedPoint0, 1805811301, std::sqrt(std::sqrt(0.5)));
+  const FixedPoint0 sqrt_half = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(
+      FixedPoint0, 1518500250, std::sqrt(0.5));
+  const FixedPoint0 one_quarter =
+      GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(FixedPoint0, 536870912, 1.0 / 4.0);
+
+  const FixedPoint0 alpha_n = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(
+      FixedPoint0, 117049297, 11.0 / 240.0 * std::sqrt(std::sqrt(2.0)));
+  const FixedPoint0 alpha_d = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(
+      FixedPoint0, 127690142, 1.0 / 20.0 * std::sqrt(std::sqrt(2.0)));
+  const FixedPoint0 alpha_i = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(
+      FixedPoint0, 1057819769,
+      2.0 / std::sqrt(std::sqrt(2.0)) - std::sqrt(std::sqrt(2.0)));
+  const FixedPoint0 alpha_f = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(
+      FixedPoint0, 638450708, 1.0 / 4.0 * std::sqrt(std::sqrt(2.0)));
+
+  const FixedPointAccum shifted_quarter =
+      gemmlowp::Rescale<kAccumIntegerBits>(one_quarter);
+
+  // Reinterpret the input value as Q0.31, because we will figure out the
+  // required shift "ourselves" instead of using, say, Rescale.
+  FixedPoint0 z_a = FixedPoint0::FromRaw(input_val.raw());
+  // z_a_pow_2 = input_integer_bits - z_a_headroom;
+  int z_a_headroom_plus_1 = CountLeadingZeros(static_cast<uint32>(z_a.raw()));
+  FixedPoint0 r_a_tmp =
+      SaturatingRoundingMultiplyByPOTParam(z_a, (z_a_headroom_plus_1 - 1));
+  const int32 r_a_raw =
+      SaturatingRoundingMultiplyByPOTParam((r_a_tmp * sqrt_half).raw(), 1);
+  // z_pow_2_adj = max(z_pow_2_a - 0.75, z_pow_2_b - 0.25);
+  // z_pow_2_adj = max(InputIntegerBits - z_a_headroom_plus_1 + 0.25,
+  //                   InputIntegerBits - z_b_headroom - 0.25);
+  const FixedPointAccum z_a_pow_2_adj = SaturatingAddNonGemmlowp(
+      FixedPointAccum::FromRaw(SaturatingRoundingMultiplyByPOTParam(
+          InputIntegerBits - z_a_headroom_plus_1, 31 - kAccumIntegerBits)),
+      shifted_quarter);
+
+  // z_b is treated like z_a, but premultiplying by sqrt(0.5).
+  FixedPoint0 z_b = z_a * sqrt_half;
+  int z_b_headroom = CountLeadingZeros(static_cast<uint32>(z_b.raw())) - 1;
+  const int32 r_b_raw =
+      SaturatingRoundingMultiplyByPOTParam(z_a.raw(), z_b_headroom);
+  const FixedPointAccum z_b_pow_2_adj = SaturatingSub(
+      FixedPointAccum::FromRaw(SaturatingRoundingMultiplyByPOTParam(
+          InputIntegerBits - z_b_headroom, 31 - kAccumIntegerBits)),
+      shifted_quarter);
+
+  const FixedPoint0 r = FixedPoint0::FromRaw(std::min(r_a_raw, r_b_raw));
+  const FixedPointAccum z_pow_2_adj = FixedPointAccum::FromRaw(
+      std::max(z_a_pow_2_adj.raw(), z_b_pow_2_adj.raw()));
+
+  const FixedPoint0 p = gemmlowp::RoundingHalfSum(r, sqrt_sqrt_half);
+  FixedPoint0 q = r - sqrt_sqrt_half;
+  q = q + q;
+
+  const FixedPoint0 common_sq = q * q;
+  const FixedPoint0 num = q * r + q * common_sq * alpha_n;
+  const FixedPoint0 denom_minus_one_0 =
+      p * (alpha_i + q + alpha_d * common_sq) + alpha_f * q;
+  const FixedPoint0 recip_denom =
+      one_over_one_plus_x_for_x_in_0_1(denom_minus_one_0);
+
+  const FixedPointAccum num_scaled = gemmlowp::Rescale<kAccumIntegerBits>(num);
+  return gemmlowp::Rescale<OutputIntegerBits>(z_pow_2_adj * log_2 +
+                                              num_scaled * recip_denom);
+}
+
+// Minimum output bits to accommodate log of maximum input range.  It actually
+// does not matter if one considers, say, [-64,64] or [-64,64).
+//
+// For example, run this through Octave:
+// [0:127; ...
+//  ceil(log(abs( log(2.^(0:127))+1 ))/log(2)); ...
+//  ceil(log(abs( log(2.^(0:127))+1 ))/log(2))]
+constexpr int min_log_x_output_bits(int input_bits) {
+  return input_bits > 90
+             ? 7
+             : input_bits > 44
+                   ? 6
+                   : input_bits > 21
+                         ? 5
+                         : input_bits > 10
+                               ? 4
+                               : input_bits > 4 ? 3 : input_bits > 1 ? 2 : 1;
+}
+
+template <int OutputIntegerBits, int InputIntegerBits>
+inline gemmlowp::FixedPoint<int32, OutputIntegerBits>
+log_x_for_x_greater_than_or_equal_to_1(
+    gemmlowp::FixedPoint<int32, InputIntegerBits> input_val) {
+  static_assert(
+      OutputIntegerBits >= min_log_x_output_bits(InputIntegerBits),
+      "Output integer bits must be sufficent to accommodate logs of inputs.");
+  return log_x_for_x_greater_than_or_equal_to_1_impl<OutputIntegerBits,
+                                                     InputIntegerBits>(
+      input_val);
+}
+
 inline void LogSoftmax(const uint8* input_data, const Dims<4>& input_dims,
                        int32 input_multiplier, int32 input_left_shift,
                        int32 reverse_scaling_divisor,
@@ -2681,13 +2932,10 @@ inline void LogSoftmax(const uint8* input_data, const Dims<4>& input_dims,
       }
     }
 
-    // TODO(b/77858996): Implement fixed-point log().
-    // Not a fully-quantized implementation: floating-point log().
-    const float float_log_sum_of_exps =
-        std::log(static_cast<float>(sum_of_exps.raw()) /
-                 (1 << (31 - kAccumulationIntegerBits)));
-    const int32 fixed_log_sum_of_exps = static_cast<int32>(TfLiteRound(
-        float_log_sum_of_exps * (1 << (31 - kScaledDiffIntegerBits))));
+    const int32 fixed_log_sum_of_exps =
+        log_x_for_x_greater_than_or_equal_to_1<kScaledDiffIntegerBits>(
+            sum_of_exps)
+            .raw();
 
     // rescaled_diff_min is smallest representable in
     // Q(kScaledDiffIntegerBits).(31-kScaledDiffIntegerBits) plus the
@@ -2698,9 +2946,9 @@ inline void LogSoftmax(const uint8* input_data, const Dims<4>& input_dims,
         fixed_log_sum_of_exps + std::numeric_limits<int32>::lowest();
     const int adjusted_diff_min =
         std::max(diff_min - 1,  // Note use of > below instead of >= above.
-                 MultiplyByQuantizedMultiplierSmallerThanOne(
+                 MultiplyByQuantizedMultiplierSmallerThanOneExp(
                      rescaled_diff_min, reverse_scaling_divisor,
-                     reverse_scaling_right_shift));
+                     kReverseShift * reverse_scaling_right_shift));
 
     for (int c = 0; c < depth; ++c) {
       int32 input_diff =
@@ -2956,9 +3204,10 @@ inline void Gather(const T* input_data, const Dims<4>& input_dims,
   }
 }
 
-inline void ResizeBilinear(const float* input_data, const Dims<4>& input_dims,
+template <typename T>
+inline void ResizeBilinear(const T* input_data, const Dims<4>& input_dims,
                            const int32* output_size_data,
-                           const Dims<4>& output_size_dims, float* output_data,
+                           const Dims<4>& output_size_dims, T* output_data,
                            const Dims<4>& output_dims, bool align_corners) {
   int32 batches = MatchingArraySize(input_dims, 3, output_dims, 3);
   int32 input_height = ArraySize(input_dims, 2);
@@ -2990,15 +3239,15 @@ inline void ResizeBilinear(const float* input_data, const Dims<4>& input_dims,
         int32 x0 = static_cast<int32>(std::floor(input_x));
         int32 x1 = std::min(x0 + 1, input_width - 1);
         for (int c = 0; c < depth; ++c) {
-          float interpolation = input_data[Offset(input_dims, c, x0, y0, b)] *
-                                    (1 - (input_y - y0)) *
-                                    (1 - (input_x - x0)) +
-                                input_data[Offset(input_dims, c, x0, y1, b)] *
-                                    (input_y - y0) * (1 - (input_x - x0)) +
-                                input_data[Offset(input_dims, c, x1, y0, b)] *
-                                    (1 - (input_y - y0)) * (input_x - x0) +
-                                input_data[Offset(input_dims, c, x1, y1, b)] *
-                                    (input_y - y0) * (input_x - x0);
+          T interpolation =
+              static_cast<T>(input_data[Offset(input_dims, c, x0, y0, b)] *
+                                 (1 - (input_y - y0)) * (1 - (input_x - x0)) +
+                             input_data[Offset(input_dims, c, x0, y1, b)] *
+                                 (input_y - y0) * (1 - (input_x - x0)) +
+                             input_data[Offset(input_dims, c, x1, y0, b)] *
+                                 (1 - (input_y - y0)) * (input_x - x0) +
+                             input_data[Offset(input_dims, c, x1, y1, b)] *
+                                 (input_y - y0) * (input_x - x0));
           output_data[Offset(output_dims, c, x, y, b)] = interpolation;
         }
       }
@@ -3011,8 +3260,18 @@ inline void ResizeBilinear(const float* input_data, const Dims<4>& input_dims,
                            const int32* output_size_data,
                            const Dims<4>& output_size_dims, float* output_data,
                            const Dims<4>& output_dims) {
-  ResizeBilinear(input_data, input_dims, output_size_data, output_size_dims,
-                 output_data, output_dims, /*align_corners=*/false);
+  ResizeBilinear<float>(input_data, input_dims, output_size_data,
+                        output_size_dims, output_data, output_dims,
+                        /*align_corners=*/false);
+}
+
+inline void ResizeBilinear(const uint8* input_data, const Dims<4>& input_dims,
+                           const int32* output_size_data,
+                           const Dims<4>& output_size_dims, uint8* output_data,
+                           const Dims<4>& output_dims) {
+  ResizeBilinear<uint8>(input_data, input_dims, output_size_data,
+                        output_size_dims, output_data, output_dims,
+                        /*align_corners=*/false);
 }
 
 template <typename T>
@@ -3259,63 +3518,124 @@ inline void Exp(const T* input_data, const size_t num_elements,
   }
 }
 
+// A generic reduce method that can be used for reduce_sum, reduce_mean, etc.
+// It takes a reducer function as input and returns false when numeric overflow
+// is detected.
+// This method iterates through input data and reduce elements along the
+// dimensions given in axis.
+template <typename In, typename Out>
+inline bool Reduce(const In* input_data, const int* input_dims,
+                   const int* output_dims, const int input_num_dims,
+                   const int output_num_dims, const int* axis,
+                   const int num_axis, int* input_iter,
+                   Out reducer(Out current, const In in, bool* overflow),
+                   Out* output_data) {
+  // Reset input iterator.
+  TFLITE_DCHECK(input_num_dims > 0);
+  for (int idx = 0; idx < input_num_dims; ++idx) {
+    input_iter[idx] = 0;
+  }
+  // Iterate through input_data.
+  do {
+    size_t input_offset =
+        ReducedOutputOffset(input_num_dims, input_dims, input_iter, 0, nullptr);
+    size_t output_offset = ReducedOutputOffset(input_num_dims, input_dims,
+                                               input_iter, num_axis, axis);
+    bool overflow = false;
+    output_data[output_offset] = reducer(output_data[output_offset],
+                                         input_data[input_offset], &overflow);
+    if (overflow) return false;
+  } while (NextIndex(input_num_dims, input_dims, input_iter));
+  return true;
+}
+
+inline bool ResolveAxis(const int num_dims, const int* axis, const int num_axis,
+                        int* out_axis, int* out_num_axis) {
+  *out_num_axis = 0;  // Just in case.
+  // o(n^2) is fine since out_num_axis should be really small, mostly <= 4
+  for (int idx = 0; idx < num_axis; ++idx) {
+    // Handle negative index.
+    int current = axis[idx] < 0 ? (axis[idx] + num_dims) : axis[idx];
+    TFLITE_DCHECK(current >= 0 && current < num_dims);
+    bool is_dup = false;
+    for (int j = 0; j < *out_num_axis; ++j) {
+      if (out_axis[j] == current) {
+        is_dup = true;
+        break;
+      }
+    }
+    if (!is_dup) {
+      out_axis[*out_num_axis] = current;
+      *out_num_axis += 1;
+    }
+  }
+  return true;
+}
+
+// This method expects that output_data has been initialized.
+template <typename In, typename Out>
+inline bool ReduceSumImpl(const In* input_data, const int* input_dims,
+                          const int* output_dims, const int input_num_dims,
+                          const int output_num_dims, const int* axis,
+                          const int num_axis, int* input_iter,
+                          Out* output_data) {
+  auto reducer = [](Out current, const In in, bool* overflow) -> Out {
+    const Out actual_in = static_cast<Out>(in);
+    return current + actual_in;
+  };
+  return Reduce<In, Out>(input_data, input_dims, output_dims, input_num_dims,
+                         output_num_dims, axis, num_axis, input_iter, reducer,
+                         output_data);
+}
+
+// Computes the mean of elements across dimensions given in axis.
+// It does so in two stages, first calculates the sum of elements along the axis
+// then divides it by the number of element in axis.
 template <typename T, typename U>
 inline bool Mean(const T* input_data, const int* input_dims,
                  const int input_num_dims, T* output_data,
                  const int* output_dims, const int output_num_dims,
                  const int* axis, const int num_axis_dimensions, bool keep_dims,
                  int* temp_index, int* resolved_axis, U* temp_sum) {
-  // resets output data.
+  // Reset output data.
   size_t num_outputs = 1;
   for (int idx = 0; idx < output_num_dims; ++idx) {
-    num_outputs *= static_cast<size_t>(output_dims[idx]);
+    size_t current = static_cast<size_t>(output_dims[idx]);
+    // Overflow prevention.
+    if (num_outputs > std::numeric_limits<size_t>::max() / current) {
+      return false;
+    }
+    num_outputs *= current;
   }
   for (size_t idx = 0; idx < num_outputs; ++idx) {
     output_data[idx] = T();
     temp_sum[idx] = U();
   }
-  // resets temp index.
-  for (int idx = 0; idx < input_num_dims; ++idx) {
-    temp_index[idx] = 0;
-  }
-  // resolves axis.
+
+  // Resolve axis.
   int num_resolved_axis = 0;
-  for (int idx = 0; idx < num_axis_dimensions; ++idx) {
-    int current = axis[idx];
-    TFLITE_DCHECK(current < input_num_dims && current + input_num_dims >= 0);
-    if (current < 0) {
-      current += input_num_dims;
-    }
-    bool is_dup = false;
-    for (int j = 0; j < num_resolved_axis; ++j) {
-      if (resolved_axis[j] == current) {
-        is_dup = true;
-        break;
-      }
-    }
-    if (!is_dup) {
-      resolved_axis[num_resolved_axis++] = current;
-    }
+  if (!ResolveAxis(input_num_dims, axis, num_axis_dimensions, resolved_axis,
+                   &num_resolved_axis)) {
+    return false;
   }
-  // iterates through input_data.
-  for (bool has_next = true; has_next;
-       has_next = NextIndex(input_num_dims, input_dims, temp_index)) {
-    size_t input_offset =
-        ReducedOutputOffset(input_num_dims, input_dims, temp_index, 0, nullptr);
-    size_t output_offset =
-        ReducedOutputOffset(input_num_dims, input_dims, temp_index,
-                            num_resolved_axis, resolved_axis);
-    temp_sum[output_offset] += static_cast<U>(input_data[input_offset]);
-  }
-  // takes average by num of elements added to get mean.
-  size_t num_elements_in_axis = 1;
+
+  if (!ReduceSumImpl<T, U>(input_data, input_dims, output_dims, input_num_dims,
+                           output_num_dims, resolved_axis, num_resolved_axis,
+                           temp_index, temp_sum)) {
+    return false;
+  }
+
+  // Calculate mean by dividing output_data by num of aggregated element.
+  U num_elements_in_axis = 1;
   for (int idx = 0; idx < num_resolved_axis; ++idx) {
     size_t current = static_cast<size_t>(input_dims[resolved_axis[idx]]);
+    // Overflow prevention.
     if (current > (std::numeric_limits<U>::max() / num_elements_in_axis)) {
       return false;
     }
     num_elements_in_axis *= current;
   }
+
   if (num_elements_in_axis > 0) {
     for (size_t idx = 0; idx < num_outputs; ++idx) {
       output_data[idx] =
@@ -3503,8 +3823,8 @@ inline void TransposeConv(const float* input_data, const Dims<4>& input_dims,
                           int pad_height, float* output_data,
                           const Dims<4>& output_dims) {
   const int batches = MatchingArraySize(input_dims, 3, output_dims, 3);
-  const int input_depth = MatchingArraySize(input_dims, 0, filter_dims, 3);
-  const int output_depth = MatchingArraySize(filter_dims, 0, output_dims, 0);
+  const int input_depth = MatchingArraySize(input_dims, 0, filter_dims, 0);
+  const int output_depth = MatchingArraySize(filter_dims, 3, output_dims, 0);
   const int input_height = ArraySize(input_dims, 2);
   const int input_width = ArraySize(input_dims, 1);
   const int filter_height = ArraySize(filter_dims, 2);
@@ -3544,8 +3864,8 @@ inline void TransposeConv(const float* input_data, const Dims<4>& input_dims,
                   float input_value = input_data[Offset(input_dims, in_channel,
                                                         in_x, in_y, batch)];
                   float filter_value =
-                      filter_data[Offset(filter_dims, out_channel, filter_x,
-                                         filter_y, in_channel)];
+                      filter_data[Offset(filter_dims, in_channel, filter_x,
+                                         filter_y, out_channel)];
                   output_data[Offset(output_dims, out_channel, out_x, out_y,
                                      batch)] += input_value * filter_value;
                 }
@@ -3558,6 +3878,16 @@ inline void TransposeConv(const float* input_data, const Dims<4>& input_dims,
   }
 }
 
+template <typename T>
+inline bool EqualFn(T lhs, T rhs) {
+  return lhs == rhs;
+}
+
+template <typename T>
+inline bool NotEqualFn(T lhs, T rhs) {
+  return lhs != rhs;
+}
+
 template <typename T>
 inline bool GreaterFn(T lhs, T rhs) {
   return lhs > rhs;
@@ -3604,10 +3934,14 @@ inline void Comparison(int left_shift, const T* input1_data,
     const int32 input2_val = input2_offset + input2_data[i];
     const int32 shifted_input1_val = input1_val * (1 << left_shift);
     const int32 shifted_input2_val = input2_val * (1 << left_shift);
-    const int32 scaled_input1_val = MultiplyByQuantizedMultiplierSmallerThanOne(
-        shifted_input1_val, input1_multiplier, input1_shift);
-    const int32 scaled_input2_val = MultiplyByQuantizedMultiplierSmallerThanOne(
-        shifted_input2_val, input2_multiplier, input2_shift);
+    const int32 scaled_input1_val =
+        MultiplyByQuantizedMultiplierSmallerThanOneExp(
+            shifted_input1_val, input1_multiplier,
+            kReverseShift * input1_shift);
+    const int32 scaled_input2_val =
+        MultiplyByQuantizedMultiplierSmallerThanOneExp(
+            shifted_input2_val, input2_multiplier,
+            kReverseShift * input2_shift);
     output_data[i] = F(scaled_input1_val, scaled_input2_val);
   }
 }
@@ -3656,11 +3990,13 @@ inline void BroadcastComparison(int left_shift, const T* input1_data,
           const int32 shifted_input1_val = input1_val * (1 << left_shift);
           const int32 shifted_input2_val = input2_val * (1 << left_shift);
           const int32 scaled_input1_val =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  shifted_input1_val, input1_multiplier, input1_shift);
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  shifted_input1_val, input1_multiplier,
+                  kReverseShift * input1_shift);
           const int32 scaled_input2_val =
-              MultiplyByQuantizedMultiplierSmallerThanOne(
-                  shifted_input2_val, input2_multiplier, input2_shift);
+              MultiplyByQuantizedMultiplierSmallerThanOneExp(
+                  shifted_input2_val, input2_multiplier,
+                  kReverseShift * input2_shift);
           output_data[Offset(output_dims, c, x, y, b)] =
               F(scaled_input1_val, scaled_input2_val);
         }
@@ -3715,6 +4051,8 @@ inline void BroadcastComparison(int left_shift, const T* input1_data,
                                      input2_offset, input2_multiplier,        \
                                      input2_shift, output_data, output_dims); \
   }
+TFLITE_COMPARISON_OP(Equal);
+TFLITE_COMPARISON_OP(NotEqual);
 TFLITE_COMPARISON_OP(Greater);
 TFLITE_COMPARISON_OP(GreaterEqual);
 TFLITE_COMPARISON_OP(Less);
@@ -3754,6 +4092,42 @@ inline void RankOneSelect(const D* input_condition_data,
   }
 }
 
+// For easy implementation, the indices is always a vector of size-4 vectors.
+template <typename T, typename I>
+inline void SparseToDense(const std::vector<std::vector<I>>& indices,
+                          const T* values, T default_value, T* output_data,
+                          const Dims<4>& output_dims, bool value_is_scalar) {
+  const int value_count = indices.size();
+
+  // First fill the output_data with default value.
+  const int num_elements = FlatSize(output_dims);
+  for (int i = 0; i < num_elements; ++i) {
+    output_data[i] = default_value;
+  }
+
+  // Special handle for value is scalar case to avoid checking the boolean
+  // condition within the loop every time.
+  if (value_is_scalar) {
+    for (int i = 0; i < value_count; ++i) {
+      const std::vector<I>& index = indices[i];
+      TFLITE_DCHECK_EQ(index.size(), 4);
+      const T value = *values;  // just use the first value.
+      output_data[Offset(output_dims, index[3], index[2], index[1], index[0])] =
+          value;
+    }
+    return;
+  }
+
+  // Go through the values and indices to fill the sparse values.
+  for (int i = 0; i < value_count; ++i) {
+    const std::vector<I>& index = indices[i];
+    TFLITE_DCHECK_EQ(index.size(), 4);
+    const T value = values[i];
+    output_data[Offset(output_dims, index[3], index[2], index[1], index[0])] =
+        value;
+  }
+}
+
 }  // namespace reference_ops
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/internal/resize_bilinear_test.cc b/tensorflow/contrib/lite/kernels/internal/resize_bilinear_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..3d8765f11b2941ef5871c7db8e3582e506713aa6
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/resize_bilinear_test.cc
@@ -0,0 +1,136 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <algorithm>
+#include <cmath>
+#include <vector>
+
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/test_util.h"
+#include "tensorflow/contrib/lite/kernels/internal/types.h"
+
+namespace tflite {
+namespace {
+template <typename T>
+void TestOneResizeBilinear(int batch, int depth, int input_width,
+                           int input_height, int output_width,
+                           int output_height, float error_threshold) {
+  Dims<4> input_dims_inference =
+      MakeDimsForInference(depth, input_width, input_height, batch);
+  Dims<4> output_dims_inference =
+      MakeDimsForInference(depth, output_width, output_height, batch);
+
+  const int input_buffer_size = RequiredBufferSizeForDims(input_dims_inference);
+  const int output_buffer_size =
+      RequiredBufferSizeForDims(output_dims_inference);
+
+  std::vector<T> input_data(input_buffer_size, 0);
+  std::vector<T> reference_output_data(output_buffer_size, 0);
+  // Initialize the output data with something other than zero, so we can catch
+  // issue with kernels failing to initialize the output.
+  std::vector<T> output_data(output_buffer_size, 3);
+
+  const T min_amplitude = static_cast<T>(0);
+  const T max_amplitude = static_cast<T>(255);
+  FillRandom(&input_data, min_amplitude, max_amplitude);
+
+  Dims<4> output_size_dims = MakeDimsForInference(2, 1, 1, 1);
+  std::vector<int32> output_size_data = {output_height, output_width};
+
+  reference_ops::ResizeBilinear(
+      input_data.data(), input_dims_inference, output_size_data.data(),
+      output_size_dims, reference_output_data.data(), output_dims_inference);
+  optimized_ops::ResizeBilinear(input_data.data(), input_dims_inference,
+                                output_size_data.data(), output_size_dims,
+                                output_data.data(), output_dims_inference);
+
+  double sum_diff = 0;
+  float max_abs_val = 0;
+  for (int i = 0; i < output_buffer_size; i++) {
+    sum_diff += std::abs(static_cast<float>(output_data[i]) -
+                         static_cast<float>(reference_output_data[i]));
+    max_abs_val = std::max(
+        max_abs_val, std::abs(static_cast<float>(reference_output_data[i])));
+  }
+
+  if (sum_diff != 0.f) {
+    const float mean_diff = static_cast<float>(sum_diff / output_buffer_size);
+    const float relative_error = std::abs(mean_diff) / max_abs_val;
+    ASSERT_LT(relative_error, error_threshold);
+  }
+}
+
+TEST(ResizeBilinear, TestResizeBilinear8Bit) {
+  const int kTestsToRun = 100 * 1000;
+  for (int i = 0; i < kTestsToRun; i++) {
+    const int batch = ExponentialRandomPositiveInt(0.9f, 3, 20);
+    const int depth = ExponentialRandomPositiveInt(0.9f, 6, 50);
+    const int input_width = ExponentialRandomPositiveInt(0.9f, 20, 200);
+    const int input_height = ExponentialRandomPositiveInt(0.9f, 20, 200);
+    const int output_width = ExponentialRandomPositiveInt(0.9f, 20, 200);
+    const int output_height = ExponentialRandomPositiveInt(0.9f, 20, 200);
+
+    TestOneResizeBilinear<uint8>(batch, depth, input_width, input_height,
+                                 output_width, output_height, 0.025);
+  }
+}
+
+TEST(ResizeBilinear2x2, TestResizeBilinear8Bit) {
+  const int kTestsToRun = 100 * 1000;
+  for (int i = 0; i < kTestsToRun; i++) {
+    const int batch = ExponentialRandomPositiveInt(0.9f, 3, 20);
+    const int depth = ExponentialRandomPositiveInt(0.9f, 6, 50);
+    const int input_width = ExponentialRandomPositiveInt(0.9f, 20, 200);
+    const int input_height = ExponentialRandomPositiveInt(0.9f, 20, 200);
+    const int output_width = input_width * 2;
+    const int output_height = input_height * 2;
+
+    TestOneResizeBilinear<uint8>(batch, depth, input_width, input_height,
+                                 output_width, output_height, 1e-5);
+  }
+}
+
+TEST(ResizeBilinear, TestResizeBilinear) {
+  const int kTestsToRun = 100 * 1000;
+  for (int i = 0; i < kTestsToRun; i++) {
+    const int batch = ExponentialRandomPositiveInt(0.9f, 3, 20);
+    const int depth = ExponentialRandomPositiveInt(0.9f, 6, 50);
+    const int input_width = ExponentialRandomPositiveInt(0.9f, 20, 200);
+    const int input_height = ExponentialRandomPositiveInt(0.9f, 20, 200);
+    const int output_width = ExponentialRandomPositiveInt(0.9f, 20, 200);
+    const int output_height = ExponentialRandomPositiveInt(0.9f, 20, 200);
+
+    TestOneResizeBilinear<float>(batch, depth, input_width, input_height,
+                                 output_width, output_height, 1e-5);
+  }
+}
+
+TEST(ResizeBilinear2x2, TestResizeBilinear) {
+  const int kTestsToRun = 100 * 1000;
+  for (int i = 0; i < kTestsToRun; i++) {
+    const int batch = ExponentialRandomPositiveInt(0.9f, 3, 20);
+    const int depth = ExponentialRandomPositiveInt(0.9f, 6, 50);
+    const int input_width = ExponentialRandomPositiveInt(0.9f, 20, 200);
+    const int input_height = ExponentialRandomPositiveInt(0.9f, 20, 200);
+    const int output_width = input_width * 2;
+    const int output_height = input_height * 2;
+
+    TestOneResizeBilinear<float>(batch, depth, input_width, input_height,
+                                 output_width, output_height, 1e-5);
+  }
+}
+}  // namespace
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/internal/softmax_quantized_test.cc b/tensorflow/contrib/lite/kernels/internal/softmax_quantized_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..d781a7b642036f3c5ddaa366f257fe26511c83c3
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/softmax_quantized_test.cc
@@ -0,0 +1,227 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <algorithm>
+#include <cmath>
+#include <cstdlib>
+#include <functional>
+#include <iterator>
+#include <limits>
+#include <random>
+#include <string>
+#include <vector>
+
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/quantization_util.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/test_util.h"
+
+namespace tflite {
+namespace {
+
+void RunSoftmaxFloatReference(const uint8* input_data,
+                              const Dims<4>& dims_common, int32 input_offset,
+                              const double input_scale, int stride, float beta,
+                              uint8* reference_output_data) {
+  const int ref_buffer_size = RequiredBufferSizeForDims(dims_common);
+  std::vector<float> reference_dequant_data(ref_buffer_size);
+  std::vector<float> reference_output_float_data(ref_buffer_size);
+
+  // Reference data generated via Dequant of input into float, and then applying
+  // float Softmax.
+  reference_ops::Dequantize(input_data, dims_common, input_offset, input_scale,
+                            reference_dequant_data.data(), dims_common);
+  optimized_ops::Softmax(reference_dequant_data.data(), dims_common, beta,
+                         reference_output_float_data.data(), dims_common);
+  // Work with quantized scaling for Softmax, under which 256 represents 1, but
+  // we limit this to 255.
+  for (int i = 0; i < ref_buffer_size; i++) {
+    reference_output_data[i] = std::min(
+        255,
+        static_cast<int>(std::round(256.0f * reference_output_float_data[i])));
+  }
+}
+
+void CheckOutputData(const uint8* test_output, const uint8* reference_output,
+                     const Dims<4>& dims_common, const string& check_label,
+                     bool be_exacting) {
+  const int buffer_size = RequiredBufferSizeForDims(dims_common);
+  // While calculating some metrics in floating point, we work with quantized
+  // scaling.
+  std::vector<int> diff(buffer_size);
+  int64_t sum_diff = 0;
+  int64_t sum_abs_diff = 0;
+  for (int i = 0; i < buffer_size; i++) {
+    diff[i] = static_cast<int>(test_output[i]) - reference_output[i];
+    sum_diff += diff[i];
+    sum_abs_diff += std::abs(diff[i]);
+  }
+  // These stats help understand test failures.
+  std::sort(std::begin(diff), std::end(diff));
+  const int min_diff = diff.front();
+  const int max_diff = diff.back();
+  const int median_diff = diff[diff.size() / 2];
+  const float mean_diff = static_cast<float>(sum_diff) / buffer_size;
+  const float mean_abs_diff = static_cast<float>(sum_abs_diff) / buffer_size;
+  // We either check for bit exactness (against the reference quantized version)
+  // or for general accuracy, allowing off-by-one (against the float reference).
+  if (be_exacting) {
+    ASSERT_TRUE(std::abs(min_diff) == 0 && std::abs(max_diff) == 0);
+  } else {
+    // For small numbers of samples, the estimates of the means vary more.
+    // Rather than widen the tolerances, we skip the smaller tests.
+    ASSERT_TRUE(((std::abs(mean_diff) < 2e-2f && mean_abs_diff < 3e-2f) ||
+                 buffer_size < 10000) &&
+                std::abs(median_diff) == 0 && std::abs(min_diff) <= 1 &&
+                std::abs(max_diff) <= 1);
+  }
+}
+
+// Runs the Softmax and compares against the float reference implementation and
+// the quantized reference implementation.
+void RunOneSoftmaxTest(const uint8* input_data, const Dims<4>& dims_common,
+                       int32 input_offset, const double input_scale, int stride,
+                       float beta) {
+  const int buffer_size = RequiredBufferSizeForDims(dims_common);
+  std::vector<uint8> optimized_softmax_output(buffer_size);
+  std::vector<uint8> reference_float_softmax_output(buffer_size);
+  std::vector<uint8> reference_quant_softmax_output(buffer_size);
+
+  RunSoftmaxFloatReference(input_data, dims_common, input_offset, input_scale,
+                           stride, beta, reference_float_softmax_output.data());
+
+  int32 input_beta_multiplier;
+  int input_beta_left_shift;
+  static const int kScaledDiffIntegerBits = 5;
+  tflite::PreprocessSoftmaxScaling(beta, input_scale, kScaledDiffIntegerBits,
+                                   &input_beta_multiplier,
+                                   &input_beta_left_shift);
+  // diff_min has a negative value, and is used to limit the maximum magnitude
+  // of the diffs, which are <= 0.
+  const int diff_min = -tflite::CalculateInputRadius(kScaledDiffIntegerBits,
+                                                     input_beta_left_shift);
+
+  optimized_ops::Softmax(input_data, dims_common, input_beta_multiplier,
+                         input_beta_left_shift, diff_min,
+                         optimized_softmax_output.data(), dims_common);
+  reference_ops::Softmax(input_data, dims_common, input_beta_multiplier,
+                         input_beta_left_shift, diff_min,
+                         reference_quant_softmax_output.data(), dims_common);
+
+  CheckOutputData(optimized_softmax_output.data(),
+                  reference_float_softmax_output.data(), dims_common,
+                  "Optimized vs float reference", false);
+  CheckOutputData(optimized_softmax_output.data(),
+                  reference_quant_softmax_output.data(), dims_common,
+                  "Optimized vs quant reference", true);
+  CheckOutputData(reference_quant_softmax_output.data(),
+                  reference_float_softmax_output.data(), dims_common,
+                  "Quant reference vs float reference", false);
+}
+
+// This function picks some random Softmax params, which are checked for
+// desirability.  If not acceptable, it returns false. If they're OK,
+// it runs the Softmax test and returns true. This allows the caller
+// to loop until a test has been run.
+//
+// Currently we do not reject for any reason.
+bool TryOneUniformSoftmax() {
+  // We pick mostly positive values, on the whole emphasizing smaller values and
+  // therefore faster tests.  We test a wider range of depths.  In the case of
+  // Softmax, the width and height really just create test repetitions.
+  const int batch = ExponentialRandomPositiveInt(0.9f, 3, 20);
+  const int input_depth = ExponentialRandomPositiveInt(0.75f, 175, 500);
+  const int input_width = ExponentialRandomPositiveInt(0.8f, 20, 200);
+  const int input_height = ExponentialRandomPositiveInt(0.8f, 20, 200);
+  const int stride = ExponentialRandomPositiveInt(0.9f, 3, 8);
+  const double input_scale = std::pow(10.0, UniformRandomFloat(-2.0, 1.0));
+  const int32 input_offset = UniformRandomInt(-256, 0);
+  const float beta = 1.0f + ExponentialRandomPositiveFloat(0.9f, 2, 10);
+
+  Dims<4> dims_common =
+      MakeDimsForInference(input_depth, input_width, input_height, batch);
+  const int buffer_size = RequiredBufferSizeForDims(dims_common);
+
+  std::vector<uint8> input_data(buffer_size);
+  FillRandom(&input_data);
+  RunOneSoftmaxTest(input_data.data(), dims_common, input_offset, input_scale,
+                    stride, beta);
+  return true;
+}
+
+// See TryOneUniformSoftmax() for a general description.
+//
+// Tests with "skyscraper" input patterns are included for two reasons. (a)
+// Bimodal distributions are potentially challenging and perhaps more
+// realistic than simple uniform random inputs.  (b) Some implementations of
+// Softmax may adapt as they traverse the depth, and so we test handling of
+// cases where relatively small values are encountered at the beginning and end.
+bool TryOneSkyscraperSoftmax(bool small_depth) {
+  // We pick mostly positive values, on the whole emphasizing smaller values and
+  // therefore faster tests.  We test a wider range of depths.  In the case of
+  // Softmax, the width and height really just create test repetitions.
+  const int batch = ExponentialRandomPositiveInt(0.9f, 3, 20);
+  const int input_depth = small_depth
+                              ? ExponentialRandomPositiveInt(0.75f, 40, 500)
+                              : ExponentialRandomPositiveInt(0.75f, 175, 500);
+  const int input_width = ExponentialRandomPositiveInt(0.7f, 20, 200);
+  const int input_height = ExponentialRandomPositiveInt(0.7f, 20, 200);
+  const int stride = ExponentialRandomPositiveInt(0.9f, 3, 8);
+  const double input_scale = std::pow(10.0, UniformRandomFloat(-2.0, 1.0));
+  const int32 input_offset = UniformRandomInt(-256, 0);
+  const float beta = 1.0f + ExponentialRandomPositiveFloat(0.9f, 2, 10);
+  // Extra parameters for skyscraper input patterns.
+  const double middle_proportion =
+      ExponentialRandomPositiveFloat(0.65f, 0.1, 1.0);
+  const int middle_min = UniformRandomInt(0, 255);
+  const int sides_max = UniformRandomInt(0, middle_min);
+
+  Dims<4> dims_common =
+      MakeDimsForInference(input_depth, input_width, input_height, batch);
+  const int buffer_size = RequiredBufferSizeForDims(dims_common);
+
+  std::vector<uint8> input_data(buffer_size);
+  FillRandomSkyscraper(&input_data, input_depth, middle_proportion, middle_min,
+                       sides_max);
+  RunOneSoftmaxTest(input_data.data(), dims_common, input_offset, input_scale,
+                    stride, beta);
+  return true;
+}
+
+TEST(TestQuantizedSoftmax, UniformSoftmaxTests) {
+  const int kTestsToRun = 1000;
+  for (int i = 0; i < kTestsToRun; i++) {
+    while (!TryOneUniformSoftmax()) {
+    }
+  }
+}
+
+TEST(TestQuantizedSoftmax, SkyscraperSoftmaxTests) {
+  const int kTestsToRun = 1000;
+  for (int i = 0; i < kTestsToRun; i++) {
+    while (!TryOneSkyscraperSoftmax(false)) {
+    }
+  }
+}
+
+TEST(TestQuantizedSoftmax, SmallSkyscraperSoftmaxTests) {
+  const int kTestsToRun = 1000;
+  for (int i = 0; i < kTestsToRun; i++) {
+    while (!TryOneSkyscraperSoftmax(true)) {
+    }
+  }
+}
+}  // namespace
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/internal/tensor_utils.h b/tensorflow/contrib/lite/kernels/internal/tensor_utils.h
index e1c9ccd84b09fdca0241192ef4dff62ded096433..5160e22307ae0894fabd0e9c4f7b9cd38b00840e 100644
--- a/tensorflow/contrib/lite/kernels/internal/tensor_utils.h
+++ b/tensorflow/contrib/lite/kernels/internal/tensor_utils.h
@@ -23,6 +23,9 @@ namespace tensor_utils {
 // Limit a float input f between +abs_limit and -abs_limit.
 float Clip(float f, float abs_limit);
 
+// Checks if all entries of vector are zero.
+bool IsZeroVector(const float* vector, int v_size);
+
 // Quantizes a buffer of floating point values using a symmetric quantization
 // (i.e. linear quantization without an offset) to 8-bit signed integers.
 // It also outputs the range (min, max) of the floating point buffer, and the
diff --git a/tensorflow/contrib/lite/kernels/internal/tensor_utils_test.cc b/tensorflow/contrib/lite/kernels/internal/tensor_utils_test.cc
index 3d8a2eada0c30132b5646327da5cd9fd80ccd39f..14ee528394b6872d9e79969db0e431658277f56b 100644
--- a/tensorflow/contrib/lite/kernels/internal/tensor_utils_test.cc
+++ b/tensorflow/contrib/lite/kernels/internal/tensor_utils_test.cc
@@ -32,6 +32,25 @@ TEST(uKernels, ClipTest) {
                   {0.0, -0.5, 1.0, -1.5, 2.0, -2.0, 2.0, -2.0, 2.0, -2.0})));
 }
 
+TEST(uKernels, IsZeroTest) {
+  constexpr int kVectorSize = 21;
+  static float zeros[kVectorSize] = {0.0};
+  EXPECT_TRUE(IsZeroVector(zeros, kVectorSize));
+
+  static float nonzeros[kVectorSize] = {
+      1e-6,  1e-7,  1e-8,  1e-9,  1e-10, 1e-11, 1e-12,
+      1e-13, 1e-14, 1e-15, 1e-16, 1e-17, 1e-18, 1e-19,
+      1e-20, 1e-21, 1e-22, 1e-23, 1e-24, 1e-25, 1e-26};
+  EXPECT_FALSE(IsZeroVector(nonzeros, kVectorSize));
+}
+
+TEST(uKernels, GeneratedIsZeroTest) {
+  constexpr int kVectorSize = 39;
+  std::vector<float> input(kVectorSize);
+  ZeroVector(input.data(), kVectorSize);
+  EXPECT_TRUE(IsZeroVector(input.data(), kVectorSize));
+}
+
 TEST(uKernels, SymmetricQuantizeFloatsTest) {
   constexpr int kVectorSize = 9;
   static float input[kVectorSize] = {-640, -635.0, -630, 10.0,  2.0,
diff --git a/tensorflow/contrib/lite/kernels/internal/test_util.cc b/tensorflow/contrib/lite/kernels/internal/test_util.cc
new file mode 100644
index 0000000000000000000000000000000000000000..9b1fd9b344d99103ee2a1b5b95fd697ccf4a64d0
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/test_util.cc
@@ -0,0 +1,121 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/kernels/internal/test_util.h"
+
+#include <cmath>
+#include <iterator>
+
+namespace tflite {
+
+Dims<4> MakeDimsForInference(int depth, int width, int height, int batch) {
+  Dims<4> result;
+  int cum_prod = 1;
+
+  result.sizes[0] = depth;
+  result.strides[0] = cum_prod;
+  cum_prod *= result.sizes[0];
+
+  result.sizes[1] = width;
+  result.strides[1] = cum_prod;
+  cum_prod *= result.sizes[1];
+
+  result.sizes[2] = height;
+  result.strides[2] = cum_prod;
+  cum_prod *= result.sizes[2];
+
+  result.sizes[3] = batch;
+  result.strides[3] = cum_prod;
+
+  return result;
+}
+
+// this is a copied from an internal function in propagate_fixed_sizes.cc
+bool ComputeConvSizes(Dims<4> input_dims, int output_depth, int filter_width,
+                      int filter_height, int stride, PaddingType padding_type,
+                      Dims<4>* output_dims, int* pad_width, int* pad_height) {
+  const int input_width = ArraySize(input_dims, 1);
+  const int input_height = ArraySize(input_dims, 2);
+  const int batch = ArraySize(input_dims, 3);
+
+  int output_height = 0;
+  int output_width = 0;
+  if (padding_type == PaddingType::kValid) {
+    output_height = (input_height + stride - filter_height) / stride;
+    output_width = (input_width + stride - filter_width) / stride;
+  } else if (padding_type == PaddingType::kSame) {
+    output_height = (input_height + stride - 1) / stride;
+    output_width = (input_width + stride - 1) / stride;
+  } else {
+    return false;
+  }
+
+  if (output_width <= 0 || output_height <= 0) {
+    return false;
+  }
+
+  *pad_height =
+      ((output_height - 1) * stride + filter_height - input_height) / 2;
+  *pad_width = ((output_width - 1) * stride + filter_width - input_width) / 2;
+  *output_dims =
+      MakeDimsForInference(output_depth, output_width, output_height, batch);
+  return true;
+}
+
+std::mt19937& RandomEngine() {
+  static std::mt19937 engine;
+  return engine;
+}
+
+int UniformRandomInt(int min, int max) {
+  std::uniform_int_distribution<int> dist(min, max);
+  return dist(RandomEngine());
+}
+
+float UniformRandomFloat(float min, float max) {
+  std::uniform_real_distribution<float> dist(min, max);
+  return dist(RandomEngine());
+}
+
+int ExponentialRandomPositiveInt(float percentile, int percentile_val,
+                                 int max_val) {
+  const float lambda =
+      -std::log(1.f - percentile) / static_cast<float>(percentile_val);
+  std::exponential_distribution<float> dist(lambda);
+  float val;
+  do {
+    val = dist(RandomEngine());
+  } while (!val || !std::isfinite(val) || val > max_val);
+  return static_cast<int>(std::ceil(val));
+}
+
+float ExponentialRandomPositiveFloat(float percentile, float percentile_val,
+                                     float max_val) {
+  const float lambda =
+      -std::log(1.f - percentile) / static_cast<float>(percentile_val);
+  std::exponential_distribution<float> dist(lambda);
+  float val;
+  do {
+    val = dist(RandomEngine());
+  } while (!std::isfinite(val) || val > max_val);
+  return val;
+}
+
+void FillRandom(std::vector<float>* vec, float min, float max) {
+  std::uniform_real_distribution<float> dist(min, max);
+  auto gen = std::bind(dist, RandomEngine());
+  std::generate(std::begin(*vec), std::end(*vec), gen);
+}
+
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/internal/test_util.h b/tensorflow/contrib/lite/kernels/internal/test_util.h
new file mode 100644
index 0000000000000000000000000000000000000000..26078cef49a7868c64ff0095898eebe5e4de8751
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/internal/test_util.h
@@ -0,0 +1,104 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_TEST_UTIL_H_
+#define TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_TEST_UTIL_H_
+
+#include <algorithm>
+#include <functional>
+#include <iterator>
+#include <limits>
+#include <random>
+#include <vector>
+
+#include "tensorflow/contrib/lite/kernels/internal/types.h"
+
+namespace tflite {
+
+// Creates a Dims struct from a set of dimensions.
+Dims<4> MakeDimsForInference(int depth, int width, int height, int batch);
+
+// Computes output and padding dimensions.
+bool ComputeConvSizes(Dims<4> input_dims, int output_depth, int filter_width,
+                      int filter_height, int stride, PaddingType padding_type,
+                      Dims<4>* output_dims, int* pad_width, int* pad_height);
+
+// Returns a mt19937 random engine.
+std::mt19937& RandomEngine();
+
+// Returns a random integer uniformly distributed between |min| and |max|.
+int UniformRandomInt(int min, int max);
+
+// Returns a random float uniformly distributed between |min| and |max|.
+float UniformRandomFloat(float min, float max);
+
+// Returns a random element in |v|.
+template <typename T>
+const T& RandomElement(const std::vector<T>& v) {
+  return v[UniformRandomInt(0, v.size() - 1)];
+}
+
+// Returns a random exponentially distributed integer.
+int ExponentialRandomPositiveInt(float percentile, int percentile_val,
+                                 int max_val);
+
+// Returns a random exponentially distributed float.
+float ExponentialRandomPositiveFloat(float percentile, float percentile_val,
+                                     float max_val);
+
+// Fills a vector with random floats between |min| and |max|.
+void FillRandom(std::vector<float>* vec, float min, float max);
+
+// Fills a vector with random numbers between |min| and |max|.
+template <typename T>
+void FillRandom(std::vector<T>* vec, T min, T max) {
+  std::uniform_int_distribution<T> dist(min, max);
+  auto gen = std::bind(dist, RandomEngine());
+  std::generate(std::begin(*vec), std::end(*vec), gen);
+}
+
+// Fills a vector with random numbers.
+template <typename T>
+void FillRandom(std::vector<T>* vec) {
+  FillRandom(vec, std::numeric_limits<T>::min(), std::numeric_limits<T>::max());
+}
+
+template <typename T>
+void FillRandom(typename std::vector<T>::iterator begin_it,
+                typename std::vector<T>::iterator end_it, T min, T max) {
+  std::uniform_int_distribution<T> dist(min, max);
+  auto gen = std::bind(dist, RandomEngine());
+  std::generate(begin_it, end_it, gen);
+}
+
+// Fill with a "skyscraper" pattern, in which there is a central section (across
+// the depth) with higher values than the surround.
+template <typename T>
+void FillRandomSkyscraper(std::vector<T>* vec, int depth,
+                          double middle_proportion, uint8 middle_min,
+                          uint8 sides_max) {
+  for (auto base_it = std::begin(*vec); base_it != std::end(*vec);
+       base_it += depth) {
+    auto left_it = base_it + std::ceil(0.5 * depth * (1.0 - middle_proportion));
+    auto right_it =
+        base_it + std::ceil(0.5 * depth * (1.0 + middle_proportion));
+    FillRandom(base_it, left_it, std::numeric_limits<T>::min(), sides_max);
+    FillRandom(left_it, right_it, middle_min, std::numeric_limits<T>::max());
+    FillRandom(right_it, base_it + depth, std::numeric_limits<T>::min(),
+               sides_max);
+  }
+}
+
+}  // namespace tflite
+#endif  // TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_TEST_UTIL_H_
diff --git a/tensorflow/contrib/lite/kernels/internal/types.h b/tensorflow/contrib/lite/kernels/internal/types.h
index 43c68832785ac87f51c298370b50dc722167dc7f..3ecef15271c6eb9dd9d6dd370377fddda2723fcf 100644
--- a/tensorflow/contrib/lite/kernels/internal/types.h
+++ b/tensorflow/contrib/lite/kernels/internal/types.h
@@ -1,4 +1,4 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
@@ -15,11 +15,15 @@ limitations under the License.
 #ifndef TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_TYPES_H_
 #define TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_TYPES_H_
 
+#include <cstring>
+#include <iterator>
+
 #include "tensorflow/contrib/lite/kernels/internal/compatibility.h"
 
 namespace tflite {
 
 enum class FusedActivationFunctionType : uint8 { kNone, kRelu6, kRelu1, kRelu };
+enum class PaddingType { kNone, kSame, kValid };
 
 // Quantization parameters, determining the mapping of quantized values
 // to real values (i.e. determining how quantized values are mathematically
@@ -43,6 +47,121 @@ struct Dims {
   int strides[N];
 };
 
+class RuntimeShape {
+ public:
+  // Shapes with dimensions up to 4 are stored directly in the structure, while
+  // larger shapes are separately allocated.
+  static constexpr int kMaxSmallSize = 4;
+
+  RuntimeShape() : size_(0) {}
+
+  explicit RuntimeShape(int dimensions_count) : size_(dimensions_count) {
+    if (dimensions_count > kMaxSmallSize) {
+      dims_pointer_ = new int32[dimensions_count];
+    }
+  }
+
+  RuntimeShape(int dimensions_count, const int32* dims_data) : size_(0) {
+    ReplaceWith(dimensions_count, dims_data);
+  }
+
+  ~RuntimeShape() {
+    if (size_ > kMaxSmallSize) {
+      delete[] dims_pointer_;
+    }
+  }
+
+  inline int32 DimensionsCount() const { return size_; }
+  inline int32 Dims(int i) const {
+    TFLITE_DCHECK_GE(i, 0);
+    TFLITE_DCHECK_LT(i, size_);
+    return size_ > kMaxSmallSize ? dims_pointer_[i] : dims_[i];
+  }
+  inline void SetDim(int i, int32 val) {
+    TFLITE_DCHECK_GE(i, 0);
+    TFLITE_DCHECK_LT(i, size_);
+    if (size_ > kMaxSmallSize) {
+      dims_pointer_[i] = val;
+    } else {
+      dims_[i] = val;
+    }
+  }
+  inline int32* DimsData() {
+    return size_ > kMaxSmallSize ? dims_pointer_ : dims_;
+  }
+  inline const int32* DimsData() const {
+    return size_ > kMaxSmallSize ? dims_pointer_ : dims_;
+  }
+
+  inline void Resize(int dimensions_count) {
+    if (size_ > kMaxSmallSize) {
+      delete[] dims_pointer_;
+    }
+    size_ = dimensions_count;
+    if (dimensions_count > kMaxSmallSize) {
+      dims_pointer_ = new int32[dimensions_count];
+    }
+  }
+
+  inline void ReplaceWith(int dimensions_count, const int32* dims_data) {
+    Resize(dimensions_count);
+    int32* dst_dims = DimsData();
+    std::memcpy(dst_dims, dims_data, dimensions_count * sizeof(int32));
+  }
+
+  template <typename T>
+  inline void BuildFrom(const T& src_iterable) {
+    const int dimensions_count =
+        std::distance(src_iterable.begin(), src_iterable.end());
+    Resize(dimensions_count);
+    int32* data = DimsData();
+    for (auto it : src_iterable) {
+      *data = it;
+      ++data;
+    }
+  }
+
+  inline void BuildFrom(const std::initializer_list<int> init_list) {
+    BuildFrom<const std::initializer_list<int>>(init_list);
+  }
+
+  // Returns the total count of elements, that is the size when flattened into a
+  // vector.
+  inline int FlatSize() const {
+    int buffer_size = 1;
+    const int* dims_data = DimsData();
+    for (int i = 0; i < size_; i++) {
+      const int dim = dims_data[i];
+      TFLITE_DCHECK_GE(dim, 1);
+      buffer_size *= dim;
+    }
+    return buffer_size;
+  }
+
+ private:
+  int32 size_;
+  union {
+    int32 dims_[kMaxSmallSize];
+    int32* dims_pointer_;
+  };
+};
+
+// Converts inference-style shape to legacy tflite::Dims<4>.
+inline tflite::Dims<4> ToRuntimeDims(const tflite::RuntimeShape& array_shape) {
+  tflite::Dims<4> result;
+  const int dimensions_count = array_shape.DimensionsCount();
+  TFLITE_CHECK_LE(dimensions_count, 4);
+  int cum_prod = 1;
+  for (int i = 0; i < 4; i++) {
+    const int new_dim =
+        (i < dimensions_count) ? array_shape.Dims(dimensions_count - 1 - i) : 1;
+    result.sizes[i] = new_dim;
+    result.strides[i] = cum_prod;
+    cum_prod *= new_dim;
+  }
+  return result;
+}
+
 // Gets next index to iterate through a multidimensional array.
 inline bool NextIndex(const int num_dims, const int* dims, int* current) {
   TFLITE_DCHECK_GT(num_dims, 0);
@@ -259,6 +378,14 @@ bool IsPackedWithoutStrides(const Dims<N>& dims) {
   return true;
 }
 
+template <int N>
+void ComputeStrides(Dims<N>* dims) {
+  dims->strides[0] = 1;
+  for (int d = 1; d < N; d++) {
+    dims->strides[d] = dims->strides[d - 1] * dims->sizes[d - 1];
+  }
+}
+
 }  // namespace tflite
 
 #endif  // TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_TYPES_H_
diff --git a/tensorflow/contrib/lite/kernels/kernel_util.cc b/tensorflow/contrib/lite/kernels/kernel_util.cc
index 239b533a17efaa25d632c140c17e9f35d01a80ef..184028427fb193aa99cf155961c16eda1298e326 100644
--- a/tensorflow/contrib/lite/kernels/kernel_util.cc
+++ b/tensorflow/contrib/lite/kernels/kernel_util.cc
@@ -37,7 +37,6 @@ TfLiteStatus GetQuantizedConvolutionMultipler(TfLiteContext* context,
   TF_LITE_ENSURE(context, std::abs(input_product_scale - bias_scale) <=
                               1e-6 * std::min(input_product_scale, bias_scale));
   TF_LITE_ENSURE(context, input_product_scale >= 0);
-  TF_LITE_ENSURE(context, input_product_scale < output_scale);
 
   *multiplier = input_product_scale / output_scale;
 
diff --git a/tensorflow/contrib/lite/kernels/l2norm_test.cc b/tensorflow/contrib/lite/kernels/l2norm_test.cc
index 11cc666bad61b5753d933b98510ee8d4a093644e..070ed60040997f18f7e8053acc9532adc2377400 100644
--- a/tensorflow/contrib/lite/kernels/l2norm_test.cc
+++ b/tensorflow/contrib/lite/kernels/l2norm_test.cc
@@ -67,7 +67,7 @@ class L2NormOpModel : public SingleOpModel {
   int output_;
 };
 
-TEST(L2NormOpTest, SimpleTest) {
+TEST(L2NormOpTest, SimpleFloatTest) {
   L2NormOpModel m({1, 1, 1, 6}, TensorType_FLOAT32,
                   ActivationFunctionType_NONE);
   m.SetInput({-1.1, 0.6, 0.7, 1.2, -0.7, 0.1});
@@ -76,7 +76,7 @@ TEST(L2NormOpTest, SimpleTest) {
               ElementsAreArray({-0.55, 0.3, 0.35, 0.6, -0.35, 0.05}));
 }
 
-TEST(L2NormOpTest, MultipleBatchesTest) {
+TEST(L2NormOpTest, MultipleBatchFloatTest) {
   L2NormOpModel m({3, 1, 1, 6}, TensorType_FLOAT32,
                   ActivationFunctionType_NONE);
   m.SetInput({
@@ -105,6 +105,32 @@ TEST(L2NormOpTest, SimpleUint8Test) {
                   ArrayFloatNear({-0.55, 0.3, 0.35, 0.6, -0.35, 0.05}, 0.1)));
 }
 
+TEST(L2NormOpTest, MultipleBatchUint8Test) {
+  L2NormOpModel m({3, 1, 1, 6}, TensorType_UINT8, ActivationFunctionType_NONE);
+
+  m.QuantizeAndPopulate<uint8_t>(m.input(),
+                                 {
+                                     -1.1, 0.6, 0.7, 1.2, -0.7, 0.1,  // batch 1
+                                     -1.1, 0.6, 0.7, 1.2, -0.7, 0.1,  // batch 2
+                                     -1.1, 0.6, 0.7, 1.2, -0.7, 0.1,  // batch 3
+                                 });
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput<uint8_t>(),
+              ElementsAreArray({
+                  58, 166, 173, 205, 83, 134,  // batch 1
+                  58, 166, 173, 205, 83, 134,  // batch 2
+                  58, 166, 173, 205, 83, 134,  // batch 3
+              }));
+  EXPECT_THAT(m.GetDequantizedOutput(),
+              ElementsAreArray(ArrayFloatNear(
+                  {
+                      -0.55, 0.3, 0.35, 0.6, -0.35, 0.05,  // batch 1
+                      -0.55, 0.3, 0.35, 0.6, -0.35, 0.05,  // batch 2
+                      -0.55, 0.3, 0.35, 0.6, -0.35, 0.05,  // batch 3
+                  },
+                  0.1)));
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/lstm.cc b/tensorflow/contrib/lite/kernels/lstm.cc
index 990b3da0554ebcb13f995fa281ed04f8c7c6d7ea..eb26a02455ce2afccaa081a72d93a9ceeca746cc 100644
--- a/tensorflow/contrib/lite/kernels/lstm.cc
+++ b/tensorflow/contrib/lite/kernels/lstm.cc
@@ -25,6 +25,8 @@ limitations under the License.
 #include "tensorflow/contrib/lite/context.h"
 #include "tensorflow/contrib/lite/kernels/activation_functor.h"
 #include "tensorflow/contrib/lite/kernels/internal/kernel_utils.h"
+#include "tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/tensor.h"
 #include "tensorflow/contrib/lite/kernels/internal/tensor_utils.h"
 #include "tensorflow/contrib/lite/kernels/kernel_util.h"
 #include "tensorflow/contrib/lite/kernels/op_macros.h"
@@ -34,6 +36,17 @@ namespace ops {
 namespace builtin {
 namespace lstm {
 
+struct OpData {
+  // Which kernel type to use. Full kernel (18-inputs) or basic kernel
+  // (5-inputs).
+  TfLiteLSTMKernelType kernel_type;
+  // Only used by full kernel.
+  int scratch_tensor_index;
+};
+
+// For full inputs kernel (18-inputs).
+namespace full {
+
 // Input Tensors of size {n_batch, n_input}
 constexpr int kInputTensor = 0;
 
@@ -71,20 +84,18 @@ constexpr int kCellStateTensor = 1;
 constexpr int kOutputTensor = 2;
 
 void* Init(TfLiteContext* context, const char* buffer, size_t length) {
-  auto* scratch_tensor_index = new int;
-  context->AddTensors(context, 1, scratch_tensor_index);
-  return scratch_tensor_index;
-}
-
-void Free(TfLiteContext* context, void* buffer) {
-  delete reinterpret_cast<int*>(buffer);
+  auto* op_data = new OpData;
+  op_data->kernel_type = kTfLiteLSTMFullKernel;
+  context->AddTensors(context, /*tensors_to_add=*/7,
+                      &op_data->scratch_tensor_index);
+  return op_data;
 }
 
 // Check that input tensor dimensions matches with each other.
 TfLiteStatus CheckInputTensorDimensions(TfLiteContext* context,
                                         TfLiteNode* node, int n_input,
                                         int n_output, int n_cell) {
-  auto* params = reinterpret_cast<TfLiteLSTMParams*>(node->builtin_data);
+  const auto* params = reinterpret_cast<TfLiteLSTMParams*>(node->builtin_data);
 
   // Making sure clipping parameters have valid values.
   // == 0 means no clipping
@@ -94,7 +105,7 @@ TfLiteStatus CheckInputTensorDimensions(TfLiteContext* context,
 
   const TfLiteTensor* input_to_input_weights =
       GetOptionalInputTensor(context, node, kInputToInputWeightsTensor);
-  if (input_to_input_weights) {
+  if (input_to_input_weights != nullptr) {
     TF_LITE_ENSURE_EQ(context, input_to_input_weights->dims->size, 2);
     TF_LITE_ENSURE_EQ(context, input_to_input_weights->dims->data[0], n_cell);
     TF_LITE_ENSURE_EQ(context, input_to_input_weights->dims->data[1], n_input);
@@ -114,7 +125,7 @@ TfLiteStatus CheckInputTensorDimensions(TfLiteContext* context,
 
   const TfLiteTensor* recurrent_to_input_weights =
       GetOptionalInputTensor(context, node, kRecurrentToInputWeightsTensor);
-  if (recurrent_to_input_weights) {
+  if (recurrent_to_input_weights != nullptr) {
     TF_LITE_ENSURE_EQ(context, recurrent_to_input_weights->dims->size, 2);
     TF_LITE_ENSURE_EQ(context, recurrent_to_input_weights->dims->data[0],
                       n_cell);
@@ -204,7 +215,7 @@ TfLiteStatus CheckInputTensorDimensions(TfLiteContext* context,
 
   const TfLiteTensor* projection_weights =
       GetOptionalInputTensor(context, node, kProjectionWeightsTensor);
-  if (projection_weights) {
+  if (projection_weights != nullptr) {
     TF_LITE_ENSURE_EQ(context, projection_weights->dims->size, 2);
     TF_LITE_ENSURE_EQ(context, projection_weights->dims->data[0], n_output);
     TF_LITE_ENSURE_EQ(context, projection_weights->dims->data[1], n_cell);
@@ -212,7 +223,7 @@ TfLiteStatus CheckInputTensorDimensions(TfLiteContext* context,
 
   const TfLiteTensor* projection_bias =
       GetOptionalInputTensor(context, node, kProjectionBiasTensor);
-  if (projection_bias) {
+  if (projection_bias != nullptr) {
     TF_LITE_ENSURE_EQ(context, projection_bias->dims->size, 1);
     TF_LITE_ENSURE_EQ(context, projection_bias->dims->data[0], n_output);
   }
@@ -233,7 +244,7 @@ TfLiteStatus CheckInputTensorDimensions(TfLiteContext* context,
 // Allocate a temporary scratch tensor. Also check that the sizes of the input
 // tensors match each other.
 TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
-  int* scratch_tensor_index = reinterpret_cast<int*>(node->user_data);
+  OpData* op_data = reinterpret_cast<OpData*>(node->user_data);
 
   // Check we have all the inputs and outputs we need.
   TF_LITE_ENSURE_EQ(context, node->inputs->size, 18);
@@ -242,6 +253,7 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   // Inferring batch size, number of outputs and number of cells from the
   // input tensors.
   const TfLiteTensor* input = GetInput(context, node, kInputTensor);
+  TF_LITE_ENSURE_EQ(context, input->type, kTfLiteFloat32);
   TF_LITE_ENSURE(context, input->dims->size > 1);
   const int n_batch = input->dims->data[0];
   const int n_input = input->dims->data[1];
@@ -286,86 +298,148 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   TF_LITE_ENSURE_OK(context,
                     context->ResizeTensor(context, cell_state, cell_size));
 
-  // Create a scratch buffer tensor.
+  // Mark state tensors as persistent tensors.
+  output_state->allocation_type = kTfLiteArenaRwPersistent;
+  cell_state->allocation_type = kTfLiteArenaRwPersistent;
+
+  // The weights are of consistent type, so it suffices to check one.
+  // TODO(mirkov): create a utility/macro for this check, so all Ops can use it.
+  const bool is_hybrid_op = (input_to_output_weights->type == kTfLiteUInt8 &&
+                             input->type == kTfLiteFloat32);
+
   TfLiteIntArrayFree(node->temporaries);
-  node->temporaries = TfLiteIntArrayCreate(1);
-  node->temporaries->data[0] = *scratch_tensor_index;
+  if (is_hybrid_op) {
+    node->temporaries = TfLiteIntArrayCreate(7);
+  } else {
+    node->temporaries = TfLiteIntArrayCreate(1);
+  }
+  node->temporaries->data[0] = op_data->scratch_tensor_index;
+
+  // Create a scratch buffer tensor.
   TfLiteTensor* scratch_buffer = GetTemporary(context, node, /*index=*/0);
   scratch_buffer->type = input->type;
   scratch_buffer->allocation_type = kTfLiteArenaRw;
 
-  // Mark state tensors as persistent tensors.
-  output_state->allocation_type = kTfLiteArenaRwPersistent;
-  cell_state->allocation_type = kTfLiteArenaRwPersistent;
-
   const TfLiteTensor* input_to_input_weights =
       GetOptionalInputTensor(context, node, kInputToInputWeightsTensor);
   const bool use_cifg = (input_to_input_weights == nullptr);
+  TfLiteIntArray* scratch_buffer_size = TfLiteIntArrayCreate(2);
+  scratch_buffer_size->data[0] = n_batch;
   if (use_cifg) {
-    TfLiteIntArray* scratch_buffer_size = TfLiteIntArrayCreate(2);
-    scratch_buffer_size->data[0] = n_batch;
     // Reserving space for Cell, Forget, Output gates
     scratch_buffer_size->data[1] = n_cell * 3;
-    TF_LITE_ENSURE_OK(context, context->ResizeTensor(context, scratch_buffer,
-                                                     scratch_buffer_size));
   } else {
-    TfLiteIntArray* scratch_buffer_size = TfLiteIntArrayCreate(2);
-    scratch_buffer_size->data[0] = n_batch;
     // Reserving space for Input, Cell, Forget, Output gates
     scratch_buffer_size->data[1] = n_cell * 4;
-    TF_LITE_ENSURE_OK(context, context->ResizeTensor(context, scratch_buffer,
-                                                     scratch_buffer_size));
+  }
+  TF_LITE_ENSURE_OK(context, context->ResizeTensor(context, scratch_buffer,
+                                                   scratch_buffer_size));
+
+  if (is_hybrid_op) {
+    // Allocate temporary tensors to store quantized values of input,
+    // output_state and cell_state tensors.
+    node->temporaries->data[1] = op_data->scratch_tensor_index + 1;
+    TfLiteTensor* input_quantized = GetTemporary(context, node, /*index=*/1);
+    input_quantized->type = kTfLiteUInt8;
+    input_quantized->allocation_type = kTfLiteArenaRw;
+    if (!TfLiteIntArrayEqual(input_quantized->dims, input->dims)) {
+      TfLiteIntArray* input_quantized_size = TfLiteIntArrayCopy(input->dims);
+      TF_LITE_ENSURE_OK(context, context->ResizeTensor(context, input_quantized,
+                                                       input_quantized_size));
+    }
+    node->temporaries->data[2] = op_data->scratch_tensor_index + 2;
+    TfLiteTensor* output_state_quantized =
+        GetTemporary(context, node, /*index=*/2);
+    output_state_quantized->type = kTfLiteUInt8;
+    output_state_quantized->allocation_type = kTfLiteArenaRw;
+    if (!TfLiteIntArrayEqual(output_state_quantized->dims,
+                             output_state->dims)) {
+      TfLiteIntArray* output_state_quantized_size =
+          TfLiteIntArrayCopy(output_state->dims);
+      TF_LITE_ENSURE_OK(context,
+                        context->ResizeTensor(context, output_state_quantized,
+                                              output_state_quantized_size));
+    }
+    node->temporaries->data[3] = op_data->scratch_tensor_index + 3;
+    TfLiteTensor* cell_state_quantized =
+        GetTemporary(context, node, /*index=*/3);
+    cell_state_quantized->type = kTfLiteUInt8;
+    cell_state_quantized->allocation_type = kTfLiteArenaRw;
+    if (!TfLiteIntArrayEqual(cell_state_quantized->dims, cell_state->dims)) {
+      TfLiteIntArray* cell_state_quantized_size =
+          TfLiteIntArrayCopy(cell_state->dims);
+      TF_LITE_ENSURE_OK(context,
+                        context->ResizeTensor(context, cell_state_quantized,
+                                              cell_state_quantized_size));
+    }
+
+    // Allocate temporary tensors to store scaling factors and product scaling
+    // factors. The latter is a convenience storage which allows to quantize
+    // a vector once (which produces the scaling factors) and multiply it with
+    // different matrices (which requires multiplying the scaling factors with
+    // the scaling factor of the matrix).
+    node->temporaries->data[4] = op_data->scratch_tensor_index + 4;
+    TfLiteTensor* scaling_factors = GetTemporary(context, node, /*index=*/4);
+    scaling_factors->type = kTfLiteFloat32;
+    scaling_factors->allocation_type = kTfLiteArenaRw;
+    TfLiteIntArray* scaling_factors_size = TfLiteIntArrayCreate(1);
+    scaling_factors_size->data[0] = n_batch;
+    if (!TfLiteIntArrayEqual(scaling_factors->dims, scaling_factors_size)) {
+      TF_LITE_ENSURE_OK(context, context->ResizeTensor(context, scaling_factors,
+                                                       scaling_factors_size));
+    }
+    node->temporaries->data[5] = op_data->scratch_tensor_index + 5;
+    TfLiteTensor* prod_scaling_factors =
+        GetTemporary(context, node, /*index=*/5);
+    prod_scaling_factors->type = kTfLiteFloat32;
+    prod_scaling_factors->allocation_type = kTfLiteArenaRw;
+    TfLiteIntArray* prod_scaling_factors_size = TfLiteIntArrayCreate(1);
+    prod_scaling_factors_size->data[0] = n_batch;
+    if (!TfLiteIntArrayEqual(prod_scaling_factors->dims,
+                             prod_scaling_factors_size)) {
+      TF_LITE_ENSURE_OK(context,
+                        context->ResizeTensor(context, prod_scaling_factors,
+                                              prod_scaling_factors_size));
+    }
+
+    // Allocate a temporary tensor to store the recovered cell weights. Since
+    // this is used for diagonal matrices, only need to store n_cell values.
+    node->temporaries->data[6] = op_data->scratch_tensor_index + 6;
+    TfLiteTensor* recovered_cell_weights =
+        GetTemporary(context, node, /*index=*/6);
+    recovered_cell_weights->type = kTfLiteFloat32;
+    recovered_cell_weights->allocation_type = kTfLiteArenaRw;
+    TfLiteIntArray* recovered_cell_weights_size = TfLiteIntArrayCreate(1);
+    recovered_cell_weights_size->data[0] = n_cell;
+    if (!TfLiteIntArrayEqual(recovered_cell_weights->dims,
+                             recovered_cell_weights_size)) {
+      TF_LITE_ENSURE_OK(context,
+                        context->ResizeTensor(context, recovered_cell_weights,
+                                              recovered_cell_weights_size));
+    }
   }
   return kTfLiteOk;
 }
 
 // The LSTM Op engine.
-TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
-  auto* params = reinterpret_cast<TfLiteLSTMParams*>(node->builtin_data);
-  const TfLiteTensor* input = GetInput(context, node, kInputTensor);
-
-  const TfLiteTensor* input_to_input_weights =
-      GetOptionalInputTensor(context, node, kInputToInputWeightsTensor);
-  const TfLiteTensor* input_to_forget_weights =
-      GetInput(context, node, kInputToForgetWeightsTensor);
-  const TfLiteTensor* input_to_cell_weights =
-      GetInput(context, node, kInputToCellWeightsTensor);
-  const TfLiteTensor* input_to_output_weights =
-      GetInput(context, node, kInputToOutputWeightsTensor);
-
-  const TfLiteTensor* recurrent_to_input_weights =
-      GetOptionalInputTensor(context, node, kRecurrentToInputWeightsTensor);
-  const TfLiteTensor* recurrent_to_forget_weights =
-      GetInput(context, node, kRecurrentToForgetWeightsTensor);
-  const TfLiteTensor* recurrent_to_cell_weights =
-      GetInput(context, node, kRecurrentToCellWeightsTensor);
-  const TfLiteTensor* recurrent_to_output_weights =
-      GetInput(context, node, kRecurrentToOutputWeightsTensor);
-
-  const TfLiteTensor* cell_to_input_weights =
-      GetOptionalInputTensor(context, node, kCellToInputWeightsTensor);
-  const TfLiteTensor* cell_to_forget_weights =
-      GetOptionalInputTensor(context, node, kCellToForgetWeightsTensor);
-  const TfLiteTensor* cell_to_output_weights =
-      GetOptionalInputTensor(context, node, kCellToOutputWeightsTensor);
-
-  const TfLiteTensor* input_gate_bias =
-      GetOptionalInputTensor(context, node, kInputGateBiasTensor);
-  const TfLiteTensor* forget_gate_bias =
-      GetInput(context, node, kForgetGateBiasTensor);
-  const TfLiteTensor* cell_bias = GetInput(context, node, kCellGateBiasTensor);
-  const TfLiteTensor* output_gate_bias =
-      GetInput(context, node, kOutputGateBiasTensor);
-
-  const TfLiteTensor* projection_weights =
-      GetOptionalInputTensor(context, node, kProjectionWeightsTensor);
-  const TfLiteTensor* projection_bias =
-      GetOptionalInputTensor(context, node, kProjectionBiasTensor);
-
-  TfLiteTensor* output_state = GetOutput(context, node, kOutputStateTensor);
-  TfLiteTensor* cell_state = GetOutput(context, node, kCellStateTensor);
-  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
-
+TfLiteStatus EvalFloat(
+    const TfLiteTensor* input, const TfLiteTensor* input_to_input_weights,
+    const TfLiteTensor* input_to_forget_weights,
+    const TfLiteTensor* input_to_cell_weights,
+    const TfLiteTensor* input_to_output_weights,
+    const TfLiteTensor* recurrent_to_input_weights,
+    const TfLiteTensor* recurrent_to_forget_weights,
+    const TfLiteTensor* recurrent_to_cell_weights,
+    const TfLiteTensor* recurrent_to_output_weights,
+    const TfLiteTensor* cell_to_input_weights,
+    const TfLiteTensor* cell_to_forget_weights,
+    const TfLiteTensor* cell_to_output_weights,
+    const TfLiteTensor* input_gate_bias, const TfLiteTensor* forget_gate_bias,
+    const TfLiteTensor* cell_bias, const TfLiteTensor* output_gate_bias,
+    const TfLiteTensor* projection_weights, const TfLiteTensor* projection_bias,
+    const TfLiteLSTMParams* params, TfLiteTensor* scratch_buffer,
+    TfLiteTensor* output_state, TfLiteTensor* cell_state,
+    TfLiteTensor* output) {
   const int n_batch = input->dims->data[0];
   const int n_input = input->dims->data[1];
   // n_cell and n_output will be the same size when there is no projection.
@@ -377,9 +451,6 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   const bool use_cifg = (input_to_input_weights == nullptr);
   const bool use_peephole = (cell_to_output_weights != nullptr);
 
-  // Index the scratch buffers pointers to the global scratch buffer.
-  TfLiteTensor* scratch_buffer = GetTemporary(context, node, /*index=*/0);
-
   float* input_gate_scratch = nullptr;
   float* cell_scratch = nullptr;
   float* forget_gate_scratch = nullptr;
@@ -447,6 +518,421 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   return kTfLiteOk;
 }
 
+TfLiteStatus EvalHybrid(
+    const TfLiteTensor* input, const TfLiteTensor* input_to_input_weights,
+    const TfLiteTensor* input_to_forget_weights,
+    const TfLiteTensor* input_to_cell_weights,
+    const TfLiteTensor* input_to_output_weights,
+    const TfLiteTensor* recurrent_to_input_weights,
+    const TfLiteTensor* recurrent_to_forget_weights,
+    const TfLiteTensor* recurrent_to_cell_weights,
+    const TfLiteTensor* recurrent_to_output_weights,
+    const TfLiteTensor* cell_to_input_weights,
+    const TfLiteTensor* cell_to_forget_weights,
+    const TfLiteTensor* cell_to_output_weights,
+    const TfLiteTensor* input_gate_bias, const TfLiteTensor* forget_gate_bias,
+    const TfLiteTensor* cell_bias, const TfLiteTensor* output_gate_bias,
+    const TfLiteTensor* projection_weights, const TfLiteTensor* projection_bias,
+    const TfLiteLSTMParams* params, TfLiteTensor* scratch_buffer,
+    TfLiteTensor* scaling_factors, TfLiteTensor* prod_scaling_factors,
+    TfLiteTensor* recovered_cell_weights, TfLiteTensor* input_quantized,
+    TfLiteTensor* output_state_quantized, TfLiteTensor* cell_state_quantized,
+    TfLiteTensor* output_state, TfLiteTensor* cell_state,
+    TfLiteTensor* output) {
+  const int n_batch = input->dims->data[0];
+  const int n_input = input->dims->data[1];
+  // n_cell and n_output will be the same size when there is no projection.
+  const int n_cell = input_to_output_weights->dims->data[0];
+  const int n_output = recurrent_to_output_weights->dims->data[1];
+
+  // Since we have already checked that weights are all there or none, we can
+  // check the existence of only one to get the condition.
+  const bool use_cifg = (input_to_input_weights == nullptr);
+  const bool use_peephole = (cell_to_output_weights != nullptr);
+
+  float* input_gate_scratch = nullptr;
+  float* cell_scratch = nullptr;
+  float* forget_gate_scratch = nullptr;
+  float* output_gate_scratch = nullptr;
+  if (use_cifg) {
+    cell_scratch = scratch_buffer->data.f;
+    forget_gate_scratch = scratch_buffer->data.f + n_cell * n_batch;
+    output_gate_scratch = scratch_buffer->data.f + 2 * n_cell * n_batch;
+  } else {
+    input_gate_scratch = scratch_buffer->data.f;
+    cell_scratch = scratch_buffer->data.f + n_cell * n_batch;
+    forget_gate_scratch = scratch_buffer->data.f + 2 * n_cell * n_batch;
+    output_gate_scratch = scratch_buffer->data.f + 3 * n_cell * n_batch;
+  }
+
+  // Check optional tensors, the respective pointers can be null.
+  int8_t* input_to_input_weights_ptr = nullptr;
+  float input_to_input_weights_scale = 1.0f;
+  int8_t* recurrent_to_input_weights_ptr = nullptr;
+  float recurrent_to_input_weights_scale = 1.0f;
+  float* input_gate_bias_ptr = nullptr;
+  if (!use_cifg) {
+    input_to_input_weights_ptr =
+        reinterpret_cast<int8_t*>(input_to_input_weights->data.uint8);
+    recurrent_to_input_weights_ptr =
+        reinterpret_cast<int8_t*>(recurrent_to_input_weights->data.uint8);
+    input_gate_bias_ptr = input_gate_bias->data.f;
+    input_to_input_weights_scale = input_to_input_weights->params.scale;
+    recurrent_to_input_weights_scale = recurrent_to_input_weights->params.scale;
+  }
+
+  int8_t* cell_to_input_weights_ptr = nullptr;
+  int8_t* cell_to_forget_weights_ptr = nullptr;
+  int8_t* cell_to_output_weights_ptr = nullptr;
+  float cell_to_input_weights_scale = 1.0f;
+  float cell_to_forget_weights_scale = 1.0f;
+  float cell_to_output_weights_scale = 1.0f;
+  if (use_peephole) {
+    if (!use_cifg) {
+      cell_to_input_weights_ptr =
+          reinterpret_cast<int8_t*>(cell_to_input_weights->data.uint8);
+      cell_to_input_weights_scale = cell_to_input_weights->params.scale;
+    }
+    cell_to_forget_weights_ptr =
+        reinterpret_cast<int8_t*>(cell_to_forget_weights->data.uint8);
+    cell_to_output_weights_ptr =
+        reinterpret_cast<int8_t*>(cell_to_output_weights->data.uint8);
+    cell_to_forget_weights_scale = cell_to_forget_weights->params.scale;
+    cell_to_output_weights_scale = cell_to_output_weights->params.scale;
+  }
+
+  const int8_t* projection_weights_ptr =
+      (projection_weights == nullptr)
+          ? nullptr
+          : reinterpret_cast<int8_t*>(projection_weights->data.uint8);
+  const float projection_weights_scale =
+      (projection_weights == nullptr) ? 1.0f : projection_weights->params.scale;
+  const float* projection_bias_ptr =
+      (projection_bias == nullptr) ? nullptr : projection_bias->data.f;
+
+  // Required tensors, pointers are non-null.
+  const float* input_ptr_batch = input->data.f;
+  const int8_t* input_to_forget_weights_ptr =
+      reinterpret_cast<int8_t*>(input_to_forget_weights->data.uint8);
+  const float input_to_forget_weights_scale =
+      input_to_forget_weights->params.scale;
+  const int8_t* input_to_cell_weights_ptr =
+      reinterpret_cast<int8_t*>(input_to_cell_weights->data.uint8);
+  const float input_to_cell_weights_scale = input_to_cell_weights->params.scale;
+  const int8_t* input_to_output_weights_ptr =
+      reinterpret_cast<int8_t*>(input_to_output_weights->data.uint8);
+  const float input_to_output_weights_scale =
+      input_to_output_weights->params.scale;
+  const int8_t* recurrent_to_forget_weights_ptr =
+      reinterpret_cast<int8_t*>(recurrent_to_forget_weights->data.uint8);
+  const float recurrent_to_forget_weights_scale =
+      recurrent_to_forget_weights->params.scale;
+  const int8_t* recurrent_to_cell_weights_ptr =
+      reinterpret_cast<int8_t*>(recurrent_to_cell_weights->data.uint8);
+  const float recurrent_to_cell_weights_scale =
+      recurrent_to_cell_weights->params.scale;
+  const int8_t* recurrent_to_output_weights_ptr =
+      reinterpret_cast<int8_t*>(recurrent_to_output_weights->data.uint8);
+  const float recurrent_to_output_weights_scale =
+      recurrent_to_output_weights->params.scale;
+  const float* forget_gate_bias_ptr = forget_gate_bias->data.f;
+  const float* cell_bias_ptr = cell_bias->data.f;
+  const float* output_gate_bias_ptr = output_gate_bias->data.f;
+
+  float* output_state_ptr = output_state->data.f;
+  float* cell_state_ptr = cell_state->data.f;
+  float* output_ptr_batch = output->data.f;
+
+  // Temporary storage for quantized values and scaling factors.
+  int8_t* quantized_input_ptr =
+      reinterpret_cast<int8_t*>(input_quantized->data.uint8);
+  int8_t* quantized_output_state_ptr =
+      reinterpret_cast<int8_t*>(output_state_quantized->data.uint8);
+  int8_t* quantized_cell_state_ptr =
+      reinterpret_cast<int8_t*>(cell_state_quantized->data.uint8);
+  float* scaling_factors_ptr = scaling_factors->data.f;
+  float* prod_scaling_factors_ptr = prod_scaling_factors->data.f;
+  float* recovered_cell_weights_ptr = recovered_cell_weights->data.f;
+
+  kernel_utils::LstmStep(
+      input_ptr_batch, input_to_input_weights_ptr, input_to_input_weights_scale,
+      input_to_forget_weights_ptr, input_to_forget_weights_scale,
+      input_to_cell_weights_ptr, input_to_cell_weights_scale,
+      input_to_output_weights_ptr, input_to_output_weights_scale,
+      recurrent_to_input_weights_ptr, recurrent_to_input_weights_scale,
+      recurrent_to_forget_weights_ptr, recurrent_to_forget_weights_scale,
+      recurrent_to_cell_weights_ptr, recurrent_to_cell_weights_scale,
+      recurrent_to_output_weights_ptr, recurrent_to_output_weights_scale,
+      cell_to_input_weights_ptr, cell_to_input_weights_scale,
+      cell_to_forget_weights_ptr, cell_to_forget_weights_scale,
+      cell_to_output_weights_ptr, cell_to_output_weights_scale,
+      input_gate_bias_ptr, forget_gate_bias_ptr, cell_bias_ptr,
+      output_gate_bias_ptr, projection_weights_ptr, projection_weights_scale,
+      projection_bias_ptr, params, n_batch, n_cell, n_input, n_output,
+      input_gate_scratch, forget_gate_scratch, cell_scratch,
+      output_gate_scratch, scaling_factors_ptr, prod_scaling_factors_ptr,
+      recovered_cell_weights_ptr, quantized_input_ptr,
+      quantized_output_state_ptr, quantized_cell_state_ptr, output_state_ptr,
+      cell_state_ptr, output_ptr_batch);
+
+  return kTfLiteOk;
+}
+
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  const auto* params = reinterpret_cast<TfLiteLSTMParams*>(node->builtin_data);
+  const TfLiteTensor* input = GetInput(context, node, kInputTensor);
+
+  const TfLiteTensor* input_to_input_weights =
+      GetOptionalInputTensor(context, node, kInputToInputWeightsTensor);
+  const TfLiteTensor* input_to_forget_weights =
+      GetInput(context, node, kInputToForgetWeightsTensor);
+  const TfLiteTensor* input_to_cell_weights =
+      GetInput(context, node, kInputToCellWeightsTensor);
+  const TfLiteTensor* input_to_output_weights =
+      GetInput(context, node, kInputToOutputWeightsTensor);
+
+  const TfLiteTensor* recurrent_to_input_weights =
+      GetOptionalInputTensor(context, node, kRecurrentToInputWeightsTensor);
+  const TfLiteTensor* recurrent_to_forget_weights =
+      GetInput(context, node, kRecurrentToForgetWeightsTensor);
+  const TfLiteTensor* recurrent_to_cell_weights =
+      GetInput(context, node, kRecurrentToCellWeightsTensor);
+  const TfLiteTensor* recurrent_to_output_weights =
+      GetInput(context, node, kRecurrentToOutputWeightsTensor);
+
+  const TfLiteTensor* cell_to_input_weights =
+      GetOptionalInputTensor(context, node, kCellToInputWeightsTensor);
+  const TfLiteTensor* cell_to_forget_weights =
+      GetOptionalInputTensor(context, node, kCellToForgetWeightsTensor);
+  const TfLiteTensor* cell_to_output_weights =
+      GetOptionalInputTensor(context, node, kCellToOutputWeightsTensor);
+
+  const TfLiteTensor* input_gate_bias =
+      GetOptionalInputTensor(context, node, kInputGateBiasTensor);
+  const TfLiteTensor* forget_gate_bias =
+      GetInput(context, node, kForgetGateBiasTensor);
+  const TfLiteTensor* cell_bias = GetInput(context, node, kCellGateBiasTensor);
+  const TfLiteTensor* output_gate_bias =
+      GetInput(context, node, kOutputGateBiasTensor);
+
+  const TfLiteTensor* projection_weights =
+      GetOptionalInputTensor(context, node, kProjectionWeightsTensor);
+  const TfLiteTensor* projection_bias =
+      GetOptionalInputTensor(context, node, kProjectionBiasTensor);
+
+  // Index the scratch buffers pointers to the global scratch buffer.
+  TfLiteTensor* scratch_buffer = GetTemporary(context, node, /*index=*/0);
+
+  TfLiteTensor* output_state = GetOutput(context, node, kOutputStateTensor);
+  TfLiteTensor* cell_state = GetOutput(context, node, kCellStateTensor);
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+
+  // TODO(mirkov): add a check that weights are all uint8s or all floats.
+  switch (input_to_output_weights->type) {
+    case kTfLiteFloat32: {
+      return EvalFloat(input, input_to_input_weights, input_to_forget_weights,
+                       input_to_cell_weights, input_to_output_weights,
+                       recurrent_to_input_weights, recurrent_to_forget_weights,
+                       recurrent_to_cell_weights, recurrent_to_output_weights,
+                       cell_to_input_weights, cell_to_forget_weights,
+                       cell_to_output_weights, input_gate_bias,
+                       forget_gate_bias, cell_bias, output_gate_bias,
+                       projection_weights, projection_bias, params,
+                       scratch_buffer, output_state, cell_state, output);
+    }
+    case kTfLiteUInt8: {
+      TfLiteTensor* input_quantized = GetTemporary(context, node, /*index=*/1);
+      TfLiteTensor* output_state_quantized =
+          GetTemporary(context, node, /*index=*/2);
+      TfLiteTensor* cell_state_quantized =
+          GetTemporary(context, node, /*index=*/3);
+      TfLiteTensor* scaling_factors = GetTemporary(context, node, /*index=*/4);
+      TfLiteTensor* prod_scaling_factors =
+          GetTemporary(context, node, /*index=*/5);
+      TfLiteTensor* recovered_cell_weights =
+          GetTemporary(context, node, /*index=*/6);
+      return EvalHybrid(
+          input, input_to_input_weights, input_to_forget_weights,
+          input_to_cell_weights, input_to_output_weights,
+          recurrent_to_input_weights, recurrent_to_forget_weights,
+          recurrent_to_cell_weights, recurrent_to_output_weights,
+          cell_to_input_weights, cell_to_forget_weights, cell_to_output_weights,
+          input_gate_bias, forget_gate_bias, cell_bias, output_gate_bias,
+          projection_weights, projection_bias, params, scratch_buffer,
+          scaling_factors, prod_scaling_factors, recovered_cell_weights,
+          input_quantized, output_state_quantized, cell_state_quantized,
+          output_state, cell_state, output);
+    }
+    default:
+      context->ReportError(context, "Type %d is not currently supported.",
+                           input_to_output_weights->type);
+      return kTfLiteError;
+  }
+  return kTfLiteOk;
+}
+
+}  // namespace full
+
+// For basic kernel (5-inputs).
+namespace basic {
+
+enum InputTensor {
+  kInputData = 0,
+  kInputPrevActivation = 1,
+  kInputWeights = 2,
+  kInputBiases = 3,
+  kInputPrevState = 4,
+  kInputNum = 5,
+};
+
+enum OutputTensor {
+  kOutputActivation = 0,
+  kOutputState = 1,
+  kOutputConcatTemp = 2,
+  kOutputActivationTemp = 3,
+  kOutputNum = 4,
+};
+
+void* Init(TfLiteContext* context, const char* buffer, size_t length) {
+  auto* op_data = new OpData;
+  op_data->kernel_type = kTfLiteLSTMBasicKernel;
+  // `scratch_tensor_index` is unused in this kernel.
+  op_data->scratch_tensor_index = -1;
+  return op_data;
+}
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  TF_LITE_ENSURE(context, node->inputs->size == kInputNum);
+  TF_LITE_ENSURE(context, node->outputs->size == kOutputNum);
+
+  // Only Float32 is supported currently.
+  // TODO(ycling): Implement quantize uint8 support.
+  for (int index = 0; index < node->inputs->size; ++index) {
+    TfLiteTensor* tensor = &context->tensors[node->inputs->data[index]];
+    TF_LITE_ENSURE_EQ(context, tensor->type, kTfLiteFloat32);
+  }
+
+  const TfLiteTensor* input = GetInput(context, node, kInputData);
+  const TfLiteTensor* prev_activation =
+      GetInput(context, node, kInputPrevActivation);
+  const TfLiteTensor* weights = GetInput(context, node, kInputWeights);
+  const TfLiteTensor* bias = GetInput(context, node, kInputBiases);
+  const TfLiteTensor* prev_state = GetInput(context, node, kInputPrevState);
+
+  TF_LITE_ENSURE_EQ(context, input->dims->size, 2);
+  const int num_batches = input->dims->data[0];
+
+  TF_LITE_ENSURE_EQ(context, prev_activation->dims->size, 2);
+  TF_LITE_ENSURE_EQ(context, prev_activation->dims->data[0], num_batches);
+
+  TF_LITE_ENSURE_EQ(context, weights->dims->size, 2);
+  TF_LITE_ENSURE_EQ(context, bias->dims->size, 1);
+
+  TF_LITE_ENSURE_EQ(context, prev_state->dims->size, 2);
+  TF_LITE_ENSURE_EQ(context, prev_state->dims->data[0], num_batches);
+
+  TfLiteTensor* activation_out = GetOutput(context, node, kOutputActivation);
+  TfLiteTensor* state_out = GetOutput(context, node, kOutputState);
+  TfLiteTensor* concat_temp = GetOutput(context, node, kOutputConcatTemp);
+  TfLiteTensor* activation_temp =
+      GetOutput(context, node, kOutputActivationTemp);
+
+  TF_LITE_ENSURE_OK(context, context->ResizeTensor(
+                                 context, activation_out,
+                                 TfLiteIntArrayCopy(prev_activation->dims)));
+  TF_LITE_ENSURE_OK(
+      context, context->ResizeTensor(context, state_out,
+                                     TfLiteIntArrayCopy(prev_state->dims)));
+  TfLiteIntArray* concat_temp_size = TfLiteIntArrayCreate(2);
+  concat_temp_size->data[0] = num_batches;
+  concat_temp_size->data[1] = weights->dims->data[1];
+  TF_LITE_ENSURE_OK(
+      context, context->ResizeTensor(context, concat_temp, concat_temp_size));
+  TfLiteIntArray* activation_temp_size = TfLiteIntArrayCreate(2);
+  activation_temp_size->data[0] = num_batches;
+  activation_temp_size->data[1] = weights->dims->data[0];
+  TF_LITE_ENSURE_OK(context, context->ResizeTensor(context, activation_temp,
+                                                   activation_temp_size));
+
+  // Set the state tensors as persistent.
+  for (auto index : {kInputPrevActivation, kInputPrevState}) {
+    TfLiteTensor* tensor = &context->tensors[node->inputs->data[index]];
+    tensor->allocation_type = kTfLiteArenaRwPersistent;
+  }
+  return kTfLiteOk;
+}
+
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  const TfLiteTensor* input = GetInput(context, node, kInputData);
+  const TfLiteTensor* prev_activation =
+      GetInput(context, node, kInputPrevActivation);
+  const TfLiteTensor* weights = GetInput(context, node, kInputWeights);
+  const TfLiteTensor* bias = GetInput(context, node, kInputBiases);
+  const TfLiteTensor* prev_state = GetInput(context, node, kInputPrevState);
+
+  TfLiteTensor* activation_out = GetOutput(context, node, kOutputActivation);
+  TfLiteTensor* state_out = GetOutput(context, node, kOutputState);
+  TfLiteTensor* concat_temp = GetOutput(context, node, kOutputConcatTemp);
+  TfLiteTensor* activation_temp =
+      GetOutput(context, node, kOutputActivationTemp);
+
+  optimized_ops::LstmCell(
+      // Inputs.
+      GetTensorData<float>(input), GetTensorDims(input),
+      GetTensorData<float>(prev_activation), GetTensorDims(prev_activation),
+      GetTensorData<float>(weights), GetTensorDims(weights),
+      GetTensorData<float>(bias), GetTensorDims(bias),
+      GetTensorData<float>(prev_state), GetTensorDims(prev_state),
+      // Outputs.
+      GetTensorData<float>(state_out), GetTensorDims(state_out),
+      GetTensorData<float>(activation_out), GetTensorDims(activation_out),
+      GetTensorData<float>(concat_temp), GetTensorDims(concat_temp),
+      GetTensorData<float>(activation_temp), GetTensorDims(activation_temp));
+
+  // TODO(ycling): Investigate if this copy can be avoided with the 5-inputs
+  // LSTM kernel.
+  memcpy(prev_activation->data.raw, activation_out->data.raw,
+         activation_out->bytes);
+  memcpy(prev_state->data.raw, state_out->data.raw, state_out->bytes);
+
+  return kTfLiteOk;
+}
+
+}  // namespace basic
+
+void* Init(TfLiteContext* context, const char* buffer, size_t length) {
+  const auto* params = reinterpret_cast<const TfLiteLSTMParams*>(buffer);
+  switch (params->kernel_type) {
+    case kTfLiteLSTMFullKernel:
+      return full::Init(context, buffer, length);
+    case kTfLiteLSTMBasicKernel:
+      return basic::Init(context, buffer, length);
+  }
+}
+void Free(TfLiteContext* context, void* buffer) {
+  delete reinterpret_cast<OpData*>(buffer);
+}
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  const auto* op_data = reinterpret_cast<const OpData*>(node->user_data);
+  switch (op_data->kernel_type) {
+    case kTfLiteLSTMFullKernel:
+      return full::Prepare(context, node);
+    case kTfLiteLSTMBasicKernel:
+      return basic::Prepare(context, node);
+  }
+}
+
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  const auto* op_data = reinterpret_cast<const OpData*>(node->user_data);
+  switch (op_data->kernel_type) {
+    case kTfLiteLSTMFullKernel:
+      return full::Eval(context, node);
+    case kTfLiteLSTMBasicKernel:
+      return basic::Eval(context, node);
+  }
+}
+
 }  // namespace lstm
 
 TfLiteRegistration* Register_LSTM() {
diff --git a/tensorflow/contrib/lite/kernels/lstm_test.cc b/tensorflow/contrib/lite/kernels/lstm_test.cc
index d81220d8d30793616444c03e8647b0877a39a4d9..6da29a4a923f16f7b5ad382f51cfd820783504cd 100644
--- a/tensorflow/contrib/lite/kernels/lstm_test.cc
+++ b/tensorflow/contrib/lite/kernels/lstm_test.cc
@@ -14,7 +14,6 @@ limitations under the License.
 ==============================================================================*/
 // Unit test for TFLite LSTM op.
 
-#include <iomanip>
 #include <memory>
 #include <vector>
 
@@ -35,7 +34,8 @@ class LSTMOpModel : public SingleOpModel {
   LSTMOpModel(int n_batch, int n_input, int n_cell, int n_output, bool use_cifg,
               bool use_peephole, bool use_projection_weights,
               bool use_projection_bias, float cell_clip, float proj_clip,
-              const std::vector<std::vector<int>>& input_shapes)
+              const std::vector<std::vector<int>>& input_shapes,
+              const TensorType& weight_type = TensorType_FLOAT32)
       : n_batch_(n_batch),
         n_input_(n_input),
         n_cell_(n_cell),
@@ -45,31 +45,31 @@ class LSTMOpModel : public SingleOpModel {
     if (use_cifg) {
       input_to_input_weights_ = AddNullInput();
     } else {
-      input_to_input_weights_ = AddInput(TensorType_FLOAT32);
+      input_to_input_weights_ = AddInput(weight_type);
     }
 
-    input_to_forget_weights_ = AddInput(TensorType_FLOAT32);
-    input_to_cell_weights_ = AddInput(TensorType_FLOAT32);
-    input_to_output_weights_ = AddInput(TensorType_FLOAT32);
+    input_to_forget_weights_ = AddInput(weight_type);
+    input_to_cell_weights_ = AddInput(weight_type);
+    input_to_output_weights_ = AddInput(weight_type);
 
     if (use_cifg) {
       recurrent_to_input_weights_ = AddNullInput();
     } else {
-      recurrent_to_input_weights_ = AddInput(TensorType_FLOAT32);
+      recurrent_to_input_weights_ = AddInput(weight_type);
     }
 
-    recurrent_to_forget_weights_ = AddInput(TensorType_FLOAT32);
-    recurrent_to_cell_weights_ = AddInput(TensorType_FLOAT32);
-    recurrent_to_output_weights_ = AddInput(TensorType_FLOAT32);
+    recurrent_to_forget_weights_ = AddInput(weight_type);
+    recurrent_to_cell_weights_ = AddInput(weight_type);
+    recurrent_to_output_weights_ = AddInput(weight_type);
 
     if (use_peephole) {
       if (use_cifg) {
         cell_to_input_weights_ = AddNullInput();
       } else {
-        cell_to_input_weights_ = AddInput(TensorType_FLOAT32);
+        cell_to_input_weights_ = AddInput(weight_type);
       }
-      cell_to_forget_weights_ = AddInput(TensorType_FLOAT32);
-      cell_to_output_weights_ = AddInput(TensorType_FLOAT32);
+      cell_to_forget_weights_ = AddInput(weight_type);
+      cell_to_output_weights_ = AddInput(weight_type);
     } else {
       cell_to_input_weights_ = AddNullInput();
       cell_to_forget_weights_ = AddNullInput();
@@ -86,7 +86,7 @@ class LSTMOpModel : public SingleOpModel {
     output_gate_bias_ = AddInput(TensorType_FLOAT32);
 
     if (use_projection_weights) {
-      projection_weights_ = AddInput(TensorType_FLOAT32);
+      projection_weights_ = AddInput(weight_type);
       if (use_projection_bias) {
         projection_bias_ = AddInput(TensorType_FLOAT32);
       } else {
@@ -192,8 +192,9 @@ class LSTMOpModel : public SingleOpModel {
                    zero_buffer.get() + zero_buffer_size);
   }
 
-  void SetInput(int offset, float* begin, float* end) {
-    PopulateTensor(input_, offset, begin, end);
+  void SetInput(int offset, const float* begin, const float* end) {
+    PopulateTensor(input_, offset, const_cast<float*>(begin),
+                   const_cast<float*>(end));
   }
 
   std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
@@ -203,7 +204,7 @@ class LSTMOpModel : public SingleOpModel {
   int num_cells() { return n_cell_; }
   int num_batches() { return n_batch_; }
 
- private:
+ protected:
   int input_;
   int input_to_input_weights_;
   int input_to_forget_weights_;
@@ -237,7 +238,182 @@ class LSTMOpModel : public SingleOpModel {
   int n_output_;
 };
 
-TEST(LSTMOpTest, BlackBoxTestNoCifgNoPeepholeNoProjectionNoClipping) {
+class HybridLSTMOpModel : public LSTMOpModel {
+ public:
+  HybridLSTMOpModel(int n_batch, int n_input, int n_cell, int n_output,
+                    bool use_cifg, bool use_peephole,
+                    bool use_projection_weights, bool use_projection_bias,
+                    float cell_clip, float proj_clip,
+                    const std::vector<std::vector<int>>& input_shapes)
+      : LSTMOpModel(n_batch, n_input, n_cell, n_output, use_cifg, use_peephole,
+                    use_projection_weights, use_projection_bias, cell_clip,
+                    proj_clip, input_shapes, TensorType_UINT8) {}
+
+  void SetInputToInputWeights(std::initializer_list<float> f) {
+    SymmetricQuantizeAndPopulate(input_to_input_weights_, f);
+  }
+
+  void SetInputToForgetWeights(std::initializer_list<float> f) {
+    SymmetricQuantizeAndPopulate(input_to_forget_weights_, f);
+  }
+
+  void SetInputToCellWeights(std::initializer_list<float> f) {
+    SymmetricQuantizeAndPopulate(input_to_cell_weights_, f);
+  }
+
+  void SetInputToOutputWeights(std::initializer_list<float> f) {
+    SymmetricQuantizeAndPopulate(input_to_output_weights_, f);
+  }
+
+  void SetRecurrentToInputWeights(std::initializer_list<float> f) {
+    SymmetricQuantizeAndPopulate(recurrent_to_input_weights_, f);
+  }
+
+  void SetRecurrentToForgetWeights(std::initializer_list<float> f) {
+    SymmetricQuantizeAndPopulate(recurrent_to_forget_weights_, f);
+  }
+
+  void SetRecurrentToCellWeights(std::initializer_list<float> f) {
+    SymmetricQuantizeAndPopulate(recurrent_to_cell_weights_, f);
+  }
+
+  void SetRecurrentToOutputWeights(std::initializer_list<float> f) {
+    SymmetricQuantizeAndPopulate(recurrent_to_output_weights_, f);
+  }
+
+  void SetCellToInputWeights(std::initializer_list<float> f) {
+    SymmetricQuantizeAndPopulate(cell_to_input_weights_, f);
+  }
+
+  void SetCellToForgetWeights(std::initializer_list<float> f) {
+    SymmetricQuantizeAndPopulate(cell_to_forget_weights_, f);
+  }
+
+  void SetCellToOutputWeights(std::initializer_list<float> f) {
+    SymmetricQuantizeAndPopulate(cell_to_output_weights_, f);
+  }
+
+  void SetProjectionWeights(std::initializer_list<float> f) {
+    SymmetricQuantizeAndPopulate(projection_weights_, f);
+  }
+};
+
+class BaseLstmTest : public ::testing::Test {
+ protected:
+  // Weights of the LSTM model. Some are optional.
+  std::initializer_list<float> input_to_input_weights_;
+  std::initializer_list<float> input_to_cell_weights_;
+  std::initializer_list<float> input_to_forget_weights_;
+  std::initializer_list<float> input_to_output_weights_;
+  std::initializer_list<float> input_gate_bias_;
+  std::initializer_list<float> cell_gate_bias_;
+  std::initializer_list<float> forget_gate_bias_;
+  std::initializer_list<float> output_gate_bias_;
+  std::initializer_list<float> recurrent_to_input_weights_;
+  std::initializer_list<float> recurrent_to_cell_weights_;
+  std::initializer_list<float> recurrent_to_forget_weights_;
+  std::initializer_list<float> recurrent_to_output_weights_;
+  std::initializer_list<float> cell_to_input_weights_;
+  std::initializer_list<float> cell_to_forget_weights_;
+  std::initializer_list<float> cell_to_output_weights_;
+  std::initializer_list<float> projection_weights_;
+
+  // LSTM input is stored as num_batch x num_inputs vector.
+  std::vector<std::vector<float>> lstm_input_;
+  // LSTM output is stored as num_batch x num_outputs vector.
+  std::vector<std::vector<float>> lstm_golden_output_;
+
+  // Compares output up to tolerance to the result of the lstm given the input.
+  void VerifyGoldens(const std::vector<std::vector<float>>& input,
+                     const std::vector<std::vector<float>>& output,
+                     LSTMOpModel* lstm, float tolerance = 1e-5) {
+    const int num_batches = input.size();
+    EXPECT_GT(num_batches, 0);
+    const int num_inputs = lstm->num_inputs();
+    EXPECT_GT(num_inputs, 0);
+    const int input_sequence_size = input[0].size() / num_inputs;
+    EXPECT_GT(input_sequence_size, 0);
+    for (int i = 0; i < input_sequence_size; ++i) {
+      for (int b = 0; b < num_batches; ++b) {
+        const float* batch_start = input[b].data() + i * num_inputs;
+        const float* batch_end = batch_start + num_inputs;
+
+        lstm->SetInput(b * lstm->num_inputs(), batch_start, batch_end);
+      }
+
+      lstm->Invoke();
+
+      const int num_outputs = lstm->num_outputs();
+      std::vector<float> expected;
+      for (int b = 0; b < num_batches; ++b) {
+        const float* golden_start_batch = output[b].data() + i * num_outputs;
+        const float* golden_end_batch = golden_start_batch + num_outputs;
+        expected.insert(expected.end(), golden_start_batch, golden_end_batch);
+      }
+      EXPECT_THAT(lstm->GetOutput(),
+                  ElementsAreArray(ArrayFloatNear(expected, tolerance)));
+      for (int i = 0; i < num_outputs; ++i) {
+        std::cout << lstm->GetOutput()[i] << ", ";
+      }
+      std::cout << std::endl;
+      for (int i = 0; i < num_outputs; ++i) {
+        std::cout << expected[i] << ", ";
+      }
+      std::cout << std::endl;
+    }
+  }
+};
+
+class NoCifgNoPeepholeNoProjectionNoClippingLstmTest : public BaseLstmTest {
+  void SetUp() override {
+    input_to_input_weights_ = {-0.45018822, -0.02338299, -0.0870589,
+                               -0.34550029, 0.04266912,  -0.15680569,
+                               -0.34856534, 0.43890524};
+    input_to_cell_weights_ = {-0.50013041, 0.1370284,  0.11810488, 0.2013163,
+                              -0.20583314, 0.44344562, 0.22077113, -0.29909778};
+    input_to_forget_weights_ = {0.09701663,  0.20334584,  -0.50592935,
+                                -0.31343272, -0.40032279, 0.44781327,
+                                0.01387155,  -0.35593212};
+    input_to_output_weights_ = {-0.25065863, -0.28290087, 0.04613829,
+                                0.40525138,  0.44272184,  0.03897077,
+                                -0.1556896,  0.19487578};
+    input_gate_bias_ = {0., 0., 0., 0.};
+    cell_gate_bias_ = {0., 0., 0., 0.};
+    forget_gate_bias_ = {1., 1., 1., 1.};
+    output_gate_bias_ = {0., 0., 0., 0.};
+
+    recurrent_to_input_weights_ = {
+        -0.0063535,  -0.2042388,  0.31454784,  -0.35746509,
+        0.28902304,  0.08183324,  -0.16555229, 0.02286911,
+        -0.13566875, 0.03034258,  0.48091322,  -0.12528998,
+        0.24077177,  -0.51332325, -0.33502164, 0.10629296};
+
+    recurrent_to_cell_weights_ = {
+        -0.3407414,  0.24443203,  -0.2078532,  0.26320225,
+        0.05695659,  -0.00123841, -0.4744786,  -0.35869038,
+        -0.06418842, -0.13502428, -0.501764,   0.22830659,
+        -0.46367589, 0.26016325,  -0.03894562, -0.16368064};
+
+    recurrent_to_forget_weights_ = {
+        -0.48684245, -0.06655136, 0.42224967,  0.2112639,
+        0.27654213,  0.20864892,  -0.07646349, 0.45877004,
+        0.00141793,  -0.14609534, 0.36447752,  0.09196436,
+        0.28053468,  0.01560611,  -0.20127171, -0.01140004};
+
+    recurrent_to_output_weights_ = {
+        0.43385774,  -0.17194885, 0.2718237,  0.09215671,
+        0.24107647,  -0.39835793, 0.18212086, 0.01301402,
+        0.48572797,  -0.50656658, 0.20047462, -0.20607421,
+        -0.51818722, -0.15390486, 0.0468148,  0.39922136};
+
+    lstm_input_ = {{2., 3., 3., 4., 1., 1.}};
+    lstm_golden_output_ = {{-0.02973187, 0.1229473, 0.20885126, -0.15358765,
+                            -0.03716109, 0.12507336, 0.41193449, -0.20860538,
+                            -0.15053082, 0.09120187, 0.24278517, -0.12222792}};
+  }
+};
+
+TEST_F(NoCifgNoPeepholeNoProjectionNoClippingLstmTest, LstmBlackBoxTest) {
   const int n_batch = 1;
   const int n_input = 2;
   // n_cell and n_output have the same size when there is no projection.
@@ -257,10 +433,10 @@ TEST(LSTMOpTest, BlackBoxTestNoCifgNoPeepholeNoProjectionNoClipping) {
                        {n_cell, n_input},  // input_to_cell_weight tensor
                        {n_cell, n_input},  // input_to_output_weight tensor
 
-                       {n_cell, n_output},  // recurrent_to_input_weight tensor
-                       {n_cell, n_output},  // recurrent_to_forget_weight tensor
-                       {n_cell, n_output},  // recurrent_to_cell_weight tensor
-                       {n_cell, n_output},  // recurrent_to_output_weight tensor
+                       {n_cell, n_output},  // recurrent_to_input_weight_tensor
+                       {n_cell, n_output},  // recurrent_to_forget_weight_tensor
+                       {n_cell, n_output},  // recurrent_to_cell_weight_tensor
+                       {n_cell, n_output},  // recurrent_to_output_weight_tensor
 
                        {0},  // cell_to_input_weight tensor
                        {0},  // cell_to_forget_weight tensor
@@ -275,79 +451,137 @@ TEST(LSTMOpTest, BlackBoxTestNoCifgNoPeepholeNoProjectionNoClipping) {
                        {0},     // projection_bias tensor
                    });
 
-  lstm.SetInputToInputWeights({-0.45018822, -0.02338299, -0.0870589,
-                               -0.34550029, 0.04266912, -0.15680569,
-                               -0.34856534, 0.43890524});
-
-  lstm.SetInputToCellWeights({-0.50013041, 0.1370284, 0.11810488, 0.2013163,
-                              -0.20583314, 0.44344562, 0.22077113,
-                              -0.29909778});
-
-  lstm.SetInputToForgetWeights({0.09701663, 0.20334584, -0.50592935,
-                                -0.31343272, -0.40032279, 0.44781327,
-                                0.01387155, -0.35593212});
-
-  lstm.SetInputToOutputWeights({-0.25065863, -0.28290087, 0.04613829,
-                                0.40525138, 0.44272184, 0.03897077, -0.1556896,
-                                0.19487578});
+  lstm.SetInputToInputWeights(input_to_input_weights_);
+  lstm.SetInputToCellWeights(input_to_cell_weights_);
+  lstm.SetInputToForgetWeights(input_to_forget_weights_);
+  lstm.SetInputToOutputWeights(input_to_output_weights_);
 
-  lstm.SetInputGateBias({0., 0., 0., 0.});
+  lstm.SetInputGateBias(input_gate_bias_);
+  lstm.SetCellBias(cell_gate_bias_);
+  lstm.SetForgetGateBias(forget_gate_bias_);
+  lstm.SetOutputGateBias(output_gate_bias_);
 
-  lstm.SetCellBias({0., 0., 0., 0.});
+  lstm.SetRecurrentToInputWeights(recurrent_to_input_weights_);
+  lstm.SetRecurrentToCellWeights(recurrent_to_cell_weights_);
+  lstm.SetRecurrentToForgetWeights(recurrent_to_forget_weights_);
+  lstm.SetRecurrentToOutputWeights(recurrent_to_output_weights_);
 
-  lstm.SetForgetGateBias({1., 1., 1., 1.});
-
-  lstm.SetOutputGateBias({0., 0., 0., 0.});
-
-  lstm.SetRecurrentToInputWeights(
-      {-0.0063535, -0.2042388, 0.31454784, -0.35746509, 0.28902304, 0.08183324,
-       -0.16555229, 0.02286911, -0.13566875, 0.03034258, 0.48091322,
-       -0.12528998, 0.24077177, -0.51332325, -0.33502164, 0.10629296});
-
-  lstm.SetRecurrentToCellWeights(
-      {-0.3407414, 0.24443203, -0.2078532, 0.26320225, 0.05695659, -0.00123841,
-       -0.4744786, -0.35869038, -0.06418842, -0.13502428, -0.501764, 0.22830659,
-       -0.46367589, 0.26016325, -0.03894562, -0.16368064});
+  // Resetting cell_state and output_state
+  lstm.ResetCellState();
+  lstm.ResetOutputState();
 
-  lstm.SetRecurrentToForgetWeights(
-      {-0.48684245, -0.06655136, 0.42224967, 0.2112639, 0.27654213, 0.20864892,
-       -0.07646349, 0.45877004, 0.00141793, -0.14609534, 0.36447752, 0.09196436,
-       0.28053468, 0.01560611, -0.20127171, -0.01140004});
+  VerifyGoldens(lstm_input_, lstm_golden_output_, &lstm);
+}
 
-  lstm.SetRecurrentToOutputWeights(
-      {0.43385774, -0.17194885, 0.2718237, 0.09215671, 0.24107647, -0.39835793,
-       0.18212086, 0.01301402, 0.48572797, -0.50656658, 0.20047462, -0.20607421,
-       -0.51818722, -0.15390486, 0.0468148, 0.39922136});
+TEST_F(NoCifgNoPeepholeNoProjectionNoClippingLstmTest, HybridLstmBlackBoxTest) {
+  const int n_batch = 1;
+  const int n_input = 2;
+  // n_cell and n_output have the same size when there is no projection.
+  const int n_cell = 4;
+  const int n_output = 4;
 
-  static float lstm_input[] = {2., 3., 3., 4., 1., 1.};
-  static float lstm_golden_output[] = {-0.02973187, 0.1229473,   0.20885126,
-                                       -0.15358765, -0.03716109, 0.12507336,
-                                       0.41193449,  -0.20860538, -0.15053082,
-                                       0.09120187,  0.24278517,  -0.12222792};
+  HybridLSTMOpModel lstm(
+      n_batch, n_input, n_cell, n_output,
+      /*use_cifg=*/false, /*use_peephole=*/false,
+      /*use_projection_weights=*/false,
+      /*use_projection_bias=*/false, /*cell_clip=*/0.0, /*proj_clip=*/0.0,
+      {
+          {n_batch, n_input},  // input tensor
+
+          {n_cell, n_input},  // input_to_input_weight tensor
+          {n_cell, n_input},  // input_to_forget_weight tensor
+          {n_cell, n_input},  // input_to_cell_weight tensor
+          {n_cell, n_input},  // input_to_output_weight tensor
+
+          {n_cell, n_output},  // recurrent_to_input_weight tensor
+          {n_cell, n_output},  // recurrent_to_forget_weight tensor
+          {n_cell, n_output},  // recurrent_to_cell_weight tensor
+          {n_cell, n_output},  // recurrent_to_output_weight tensor
+
+          {0},  // cell_to_input_weight tensor
+          {0},  // cell_to_forget_weight tensor
+          {0},  // cell_to_output_weight tensor
+
+          {n_cell},  // input_gate_bias tensor
+          {n_cell},  // forget_gate_bias tensor
+          {n_cell},  // cell_bias tensor
+          {n_cell},  // output_gate_bias tensor
+
+          {0, 0},  // projection_weight tensor
+          {0},     // projection_bias tensor
+      });
+
+  lstm.SetInputToInputWeights(input_to_input_weights_);
+  lstm.SetInputToCellWeights(input_to_cell_weights_);
+  lstm.SetInputToForgetWeights(input_to_forget_weights_);
+  lstm.SetInputToOutputWeights(input_to_output_weights_);
+
+  lstm.SetInputGateBias(input_gate_bias_);
+  lstm.SetCellBias(cell_gate_bias_);
+  lstm.SetForgetGateBias(forget_gate_bias_);
+  lstm.SetOutputGateBias(output_gate_bias_);
+
+  lstm.SetRecurrentToInputWeights(recurrent_to_input_weights_);
+  lstm.SetRecurrentToCellWeights(recurrent_to_cell_weights_);
+  lstm.SetRecurrentToForgetWeights(recurrent_to_forget_weights_);
+  lstm.SetRecurrentToOutputWeights(recurrent_to_output_weights_);
 
   // Resetting cell_state and output_state
   lstm.ResetCellState();
   lstm.ResetOutputState();
 
-  const int input_sequence_size =
-      sizeof(lstm_input) / sizeof(float) / (lstm.num_inputs());
-  for (int i = 0; i < input_sequence_size; i++) {
-    float* batch0_start = lstm_input + i * lstm.num_inputs();
-    float* batch0_end = batch0_start + lstm.num_inputs();
+  VerifyGoldens(lstm_input_, lstm_golden_output_, &lstm,
+                /*tolerance=*/0.0157651);
+}
 
-    lstm.SetInput(0, batch0_start, batch0_end);
+class CifgNoPeepholeNoProjectionNoClippingLstmTest : public BaseLstmTest {
+  void SetUp() override {
+    input_to_cell_weights_ = {-0.49770179, -0.27711356, -0.09624726,
+                              0.05100781,  0.04717243,  0.48944736,
+                              -0.38535351, -0.17212132};
 
-    lstm.Invoke();
+    input_to_forget_weights_ = {-0.55291498, -0.42866567, 0.13056988,
+                                -0.3633365,  -0.22755712, 0.28253698,
+                                0.24407166,  0.33826375};
 
-    float* golden_start = lstm_golden_output + i * lstm.num_outputs();
-    float* golden_end = golden_start + lstm.num_outputs();
-    std::vector<float> expected;
-    expected.insert(expected.end(), golden_start, golden_end);
-    EXPECT_THAT(lstm.GetOutput(), ElementsAreArray(ArrayFloatNear(expected)));
+    input_to_output_weights_ = {0.10725588,  -0.02335852, -0.55932593,
+                                -0.09426838, -0.44257352, 0.54939759,
+                                0.01533556,  0.42751634};
+    cell_gate_bias_ = {0., 0., 0., 0.};
+    forget_gate_bias_ = {1., 1., 1., 1.};
+    output_gate_bias_ = {0., 0., 0., 0.};
+
+    recurrent_to_cell_weights_ = {
+        0.54066205,  -0.32668582, -0.43562764, -0.56094903,
+        0.42957711,  0.01841056,  -0.32764608, -0.33027974,
+        -0.10826075, 0.20675004,  0.19069612,  -0.03026325,
+        -0.54532051, 0.33003211,  0.44901288,  0.21193194};
+
+    recurrent_to_forget_weights_ = {
+        -0.13832897, -0.0515101,  -0.2359007, -0.16661474,
+        -0.14340827, 0.36986142,  0.23414481, 0.55899,
+        0.10798943,  -0.41174671, 0.17751795, -0.34484994,
+        -0.35874045, -0.11352962, 0.27268326, 0.54058349};
+
+    recurrent_to_output_weights_ = {
+        0.41613156, 0.42610586,  -0.16495961, -0.5663873,
+        0.30579174, -0.05115908, -0.33941799, 0.23364776,
+        0.11178309, 0.09481031,  -0.26424935, 0.46261835,
+        0.50248802, 0.26114327,  -0.43736315, 0.33149987};
+
+    cell_to_forget_weights_ = {0.47485286, -0.51955009, -0.24458408,
+                               0.31544167};
+    cell_to_output_weights_ = {-0.17135078, 0.82760304, 0.85573703,
+                               -0.77109635};
+
+    lstm_input_ = {{2., 3., 3., 4., 1., 1.}};
+    lstm_golden_output_ = {{-0.36444446, -0.00352185, 0.12886585, -0.05163646,
+                            -0.42312205, -0.01218222, 0.24201041, -0.08124574,
+                            -0.358325, -0.04621704, 0.21641694, -0.06471302}};
   }
-}
+};
 
-TEST(LSTMOpTest, BlackBoxTestWithCifgWithPeepholeNoProjectionNoClipping) {
+TEST_F(CifgNoPeepholeNoProjectionNoClippingLstmTest, LstmBlackBoxTest) {
   const int n_batch = 1;
   const int n_input = 2;
   // n_cell and n_output have the same size when there is no projection.
@@ -385,74 +619,689 @@ TEST(LSTMOpTest, BlackBoxTestWithCifgWithPeepholeNoProjectionNoClipping) {
                        {0},     // projection_bias tensor
                    });
 
-  lstm.SetInputToCellWeights({-0.49770179, -0.27711356, -0.09624726, 0.05100781,
-                              0.04717243, 0.48944736, -0.38535351,
-                              -0.17212132});
-
-  lstm.SetInputToForgetWeights({-0.55291498, -0.42866567, 0.13056988,
-                                -0.3633365, -0.22755712, 0.28253698, 0.24407166,
-                                0.33826375});
-
-  lstm.SetInputToOutputWeights({0.10725588, -0.02335852, -0.55932593,
-                                -0.09426838, -0.44257352, 0.54939759,
-                                0.01533556, 0.42751634});
-
-  lstm.SetCellBias({0., 0., 0., 0.});
+  lstm.SetInputToCellWeights(input_to_cell_weights_);
+  lstm.SetInputToForgetWeights(input_to_forget_weights_);
+  lstm.SetInputToOutputWeights(input_to_output_weights_);
 
-  lstm.SetForgetGateBias({1., 1., 1., 1.});
+  lstm.SetCellBias(cell_gate_bias_);
+  lstm.SetForgetGateBias(forget_gate_bias_);
+  lstm.SetOutputGateBias(output_gate_bias_);
 
-  lstm.SetOutputGateBias({0., 0., 0., 0.});
+  lstm.SetRecurrentToCellWeights(recurrent_to_cell_weights_);
+  lstm.SetRecurrentToForgetWeights(recurrent_to_forget_weights_);
+  lstm.SetRecurrentToOutputWeights(recurrent_to_output_weights_);
 
-  lstm.SetRecurrentToCellWeights(
-      {0.54066205, -0.32668582, -0.43562764, -0.56094903, 0.42957711,
-       0.01841056, -0.32764608, -0.33027974, -0.10826075, 0.20675004,
-       0.19069612, -0.03026325, -0.54532051, 0.33003211, 0.44901288,
-       0.21193194});
+  lstm.SetCellToForgetWeights(cell_to_forget_weights_);
+  lstm.SetCellToOutputWeights(cell_to_output_weights_);
 
-  lstm.SetRecurrentToForgetWeights(
-      {-0.13832897, -0.0515101, -0.2359007, -0.16661474, -0.14340827,
-       0.36986142, 0.23414481, 0.55899, 0.10798943, -0.41174671, 0.17751795,
-       -0.34484994, -0.35874045, -0.11352962, 0.27268326, 0.54058349});
+  // Resetting cell_state and output_state
+  lstm.ResetCellState();
+  lstm.ResetOutputState();
 
-  lstm.SetRecurrentToOutputWeights(
-      {0.41613156, 0.42610586, -0.16495961, -0.5663873, 0.30579174, -0.05115908,
-       -0.33941799, 0.23364776, 0.11178309, 0.09481031, -0.26424935, 0.46261835,
-       0.50248802, 0.26114327, -0.43736315, 0.33149987});
+  VerifyGoldens(lstm_input_, lstm_golden_output_, &lstm);
+}
 
-  lstm.SetCellToForgetWeights(
-      {0.47485286, -0.51955009, -0.24458408, 0.31544167});
-  lstm.SetCellToOutputWeights(
-      {-0.17135078, 0.82760304, 0.85573703, -0.77109635});
+TEST_F(CifgNoPeepholeNoProjectionNoClippingLstmTest, HybridLstmBlackBoxTest) {
+  const int n_batch = 1;
+  const int n_input = 2;
+  // n_cell and n_output have the same size when there is no projection.
+  const int n_cell = 4;
+  const int n_output = 4;
 
-  static float lstm_input[] = {2., 3., 3., 4., 1., 1.};
-  static float lstm_golden_output[] = {-0.36444446, -0.00352185, 0.12886585,
-                                       -0.05163646, -0.42312205, -0.01218222,
-                                       0.24201041,  -0.08124574, -0.358325,
-                                       -0.04621704, 0.21641694,  -0.06471302};
+  HybridLSTMOpModel lstm(
+      n_batch, n_input, n_cell, n_output,
+      /*use_cifg=*/true, /*use_peephole=*/true,
+      /*use_projection_weights=*/false,
+      /*use_projection_bias=*/false,
+      /*cell_clip=*/0.0, /*proj_clip=*/0.0,
+      {
+          {n_batch, n_input},  // input tensor
+
+          {0, 0},             // input_to_input_weight tensor
+          {n_cell, n_input},  // input_to_forget_weight tensor
+          {n_cell, n_input},  // input_to_cell_weight tensor
+          {n_cell, n_input},  // input_to_output_weight tensor
+
+          {0, 0},              // recurrent_to_input_weight tensor
+          {n_cell, n_output},  // recurrent_to_forget_weight tensor
+          {n_cell, n_output},  // recurrent_to_cell_weight tensor
+          {n_cell, n_output},  // recurrent_to_output_weight tensor
+
+          {0},       // cell_to_input_weight tensor
+          {n_cell},  // cell_to_forget_weight tensor
+          {n_cell},  // cell_to_output_weight tensor
+
+          {0},       // input_gate_bias tensor
+          {n_cell},  // forget_gate_bias tensor
+          {n_cell},  // cell_bias tensor
+          {n_cell},  // output_gate_bias tensor
+
+          {0, 0},  // projection_weight tensor
+          {0},     // projection_bias tensor
+      });
+
+  lstm.SetInputToCellWeights(input_to_cell_weights_);
+  lstm.SetInputToForgetWeights(input_to_forget_weights_);
+  lstm.SetInputToOutputWeights(input_to_output_weights_);
+
+  lstm.SetCellBias(cell_gate_bias_);
+  lstm.SetForgetGateBias(forget_gate_bias_);
+  lstm.SetOutputGateBias(output_gate_bias_);
+
+  lstm.SetRecurrentToCellWeights(recurrent_to_cell_weights_);
+  lstm.SetRecurrentToForgetWeights(recurrent_to_forget_weights_);
+  lstm.SetRecurrentToOutputWeights(recurrent_to_output_weights_);
+
+  lstm.SetCellToForgetWeights(cell_to_forget_weights_);
+  lstm.SetCellToOutputWeights(cell_to_output_weights_);
 
   // Resetting cell_state and output_state
   lstm.ResetCellState();
   lstm.ResetOutputState();
 
-  const int input_sequence_size =
-      sizeof(lstm_input) / sizeof(float) / (lstm.num_inputs());
-  for (int i = 0; i < input_sequence_size; i++) {
-    float* batch0_start = lstm_input + i * lstm.num_inputs();
-    float* batch0_end = batch0_start + lstm.num_inputs();
-
-    lstm.SetInput(0, batch0_start, batch0_end);
-
-    lstm.Invoke();
+  VerifyGoldens(lstm_input_, lstm_golden_output_, &lstm, /*tolerance=*/0.03573);
+}
 
-    float* golden_start = lstm_golden_output + i * lstm.num_outputs();
-    float* golden_end = golden_start + lstm.num_outputs();
-    std::vector<float> expected;
-    expected.insert(expected.end(), golden_start, golden_end);
-    EXPECT_THAT(lstm.GetOutput(), ElementsAreArray(ArrayFloatNear(expected)));
+class NoCifgPeepholeProjectionClippingLstmTest : public BaseLstmTest {
+  void SetUp() override {
+    input_to_input_weights_ = {
+        0.021393683,  0.06124551,    0.046905167,  -0.014657677,  -0.03149463,
+        0.09171803,   0.14647801,    0.10797193,   -0.0057968358, 0.0019193048,
+        -0.2726754,   0.10154029,    -0.018539885, 0.080349885,   -0.10262385,
+        -0.022599787, -0.09121155,   -0.008675967, -0.045206103,  -0.0821282,
+        -0.008045952, 0.015478081,   0.055217247,  0.038719587,   0.044153627,
+        -0.06453243,  0.05031825,    -0.046935108, -0.008164439,  0.014574226,
+        -0.1671009,   -0.15519552,   -0.16819797,  -0.13971269,   -0.11953059,
+        0.25005487,   -0.22790983,   0.009855087,  -0.028140958,  -0.11200698,
+        0.11295408,   -0.0035217577, 0.054485075,  0.05184695,    0.064711206,
+        0.10989193,   0.11674786,    0.03490607,   0.07727357,    0.11390585,
+        -0.1863375,   -0.1034451,    -0.13945189,  -0.049401227,  -0.18767063,
+        0.042483903,  0.14233552,    0.13832581,   0.18350165,    0.14545603,
+        -0.028545704, 0.024939531,   0.050929718,  0.0076203286,  -0.0029723682,
+        -0.042484224, -0.11827596,   -0.09171104,  -0.10808628,   -0.16327988,
+        -0.2273378,   -0.0993647,    -0.017155107, 0.0023917493,  0.049272764,
+        0.0038534778, 0.054764505,   0.089753784,  0.06947234,    0.08014476,
+        -0.04544234,  -0.0497073,    -0.07135631,  -0.048929106,  -0.004042012,
+        -0.009284026, 0.018042054,   0.0036860977, -0.07427302,   -0.11434604,
+        -0.018995456, 0.031487543,   0.012834908,  0.019977754,   0.044256654,
+        -0.39292613,  -0.18519334,   -0.11651281,  -0.06809892,   0.011373677};
+
+    input_to_forget_weights_ = {
+        -0.0018401089, -0.004852237, 0.03698424,    0.014181704,
+        0.028273236,   -0.016726194, -0.05249759,   -0.10204261,
+        0.00861066,    -0.040979505, -0.009899187,  0.01923892,
+        -0.028177269,  -0.08535103,  -0.14585495,   0.10662567,
+        -0.01909731,   -0.017883534, -0.0047269356, -0.045103323,
+        0.0030784295,  0.076784775,  0.07463696,    0.094531395,
+        0.0814421,     -0.12257899,  -0.033945758,  -0.031303465,
+        0.045630626,   0.06843887,   -0.13492945,   -0.012480007,
+        -0.0811829,    -0.07224499,  -0.09628791,   0.045100946,
+        0.0012300825,  0.013964662,  0.099372394,   0.02543059,
+        0.06958324,    0.034257296,  0.0482646,     0.06267997,
+        0.052625068,   0.12784666,   0.07077897,    0.025725935,
+        0.04165009,    0.07241905,   0.018668644,   -0.037377294,
+        -0.06277783,   -0.08833636,  -0.040120605,  -0.011405586,
+        -0.007808335,  -0.010301386, -0.005102167,  0.027717464,
+        0.05483423,    0.11449111,   0.11289652,    0.10939839,
+        0.13396506,    -0.08402166,  -0.01901462,   -0.044678304,
+        -0.07720565,   0.014350063,  -0.11757958,   -0.0652038,
+        -0.08185733,   -0.076754324, -0.092614375,  0.10405491,
+        0.052960336,   0.035755895,  0.035839386,   -0.012540553,
+        0.036881298,   0.02913376,   0.03420159,    0.05448447,
+        -0.054523353,  0.02582715,   0.02327355,    -0.011857179,
+        -0.0011980024, -0.034641717, -0.026125094,  -0.17582615,
+        -0.15923657,   -0.27486774,  -0.0006143371, 0.0001771948,
+        -8.470171e-05, 0.02651807,   0.045790765,   0.06956496};
+
+    input_to_cell_weights_ = {
+        -0.04580283,   -0.09549462,   -0.032418985,  -0.06454633,
+        -0.043528453,  0.043018587,   -0.049152344,  -0.12418144,
+        -0.078985475,  -0.07596889,   0.019484362,   -0.11434962,
+        -0.0074034138, -0.06314844,   -0.092981495,  0.0062155537,
+        -0.025034338,  -0.0028890965, 0.048929527,   0.06235075,
+        0.10665918,    -0.032036792,  -0.08505916,   -0.10843358,
+        -0.13002433,   -0.036816437,  -0.02130134,   -0.016518239,
+        0.0047691227,  -0.0025825808, 0.066017866,   0.029991534,
+        -0.10652836,   -0.1037554,    -0.13056071,   -0.03266643,
+        -0.033702414,  -0.006473424,  -0.04611692,   0.014419339,
+        -0.025174323,  0.0396852,     0.081777506,   0.06157468,
+        0.10210095,    -0.009658194,  0.046511717,   0.03603906,
+        0.0069369148,  0.015960095,   -0.06507666,   0.09551598,
+        0.053568836,   0.06408714,    0.12835667,    -0.008714329,
+        -0.20211966,   -0.12093674,   0.029450472,   0.2849013,
+        -0.029227901,  0.1164364,     -0.08560263,   0.09941786,
+        -0.036999565,  -0.028842626,  -0.0033637602, -0.017012902,
+        -0.09720865,   -0.11193351,   -0.029155117,  -0.017936034,
+        -0.009768936,  -0.04223324,   -0.036159635,  0.06505112,
+        -0.021742892,  -0.023377212,  -0.07221364,   -0.06430552,
+        0.05453865,    0.091149814,   0.06387331,    0.007518393,
+        0.055960953,   0.069779344,   0.046411168,   0.10509911,
+        0.07463894,    0.0075130584,  0.012850982,   0.04555431,
+        0.056955688,   0.06555285,    0.050801456,   -0.009862683,
+        0.00826772,    -0.026555609,  -0.0073611983, -0.0014897042};
+
+    input_to_output_weights_ = {
+        -0.0998932,   -0.07201956,  -0.052803773,  -0.15629593,  -0.15001918,
+        -0.07650751,  0.02359855,   -0.075155355,  -0.08037709,  -0.15093534,
+        0.029517552,  -0.04751393,  0.010350531,   -0.02664851,  -0.016839722,
+        -0.023121163, 0.0077019283, 0.012851257,   -0.05040649,  -0.0129761,
+        -0.021737747, -0.038305793, -0.06870586,   -0.01481247,  -0.001285394,
+        0.10124236,   0.083122835,  0.053313006,   -0.062235646, -0.075637154,
+        -0.027833903, 0.029774971,  0.1130802,     0.09218906,   0.09506135,
+        -0.086665764, -0.037162706, -0.038880914,  -0.035832845, -0.014481564,
+        -0.09825003,  -0.12048569,  -0.097665586,  -0.05287633,  -0.0964047,
+        -0.11366429,  0.035777505,  0.13568819,    0.052451383,  0.050649304,
+        0.05798951,   -0.021852335, -0.099848844,  0.014740475,  -0.078897946,
+        0.04974699,   0.014160473,  0.06973932,    0.04964942,   0.033364646,
+        0.08190124,   0.025535367,  0.050893165,   0.048514254,  0.06945813,
+        -0.078907564, -0.06707616,  -0.11844508,   -0.09986688,  -0.07509403,
+        0.06263226,   0.14925587,   0.20188436,    0.12098451,   0.14639415,
+        0.0015017595, -0.014267382, -0.03417257,   0.012711468,  0.0028300495,
+        -0.024758482, -0.05098548,  -0.0821182,    0.014225672,  0.021544158,
+        0.08949725,   0.07505268,   -0.0020780868, 0.04908258,   0.06476295,
+        -0.022907063, 0.027562456,  0.040185735,   0.019567577,  -0.015598739,
+        -0.049097303, -0.017121866, -0.083368234,  -0.02332002,  -0.0840956};
+
+    input_gate_bias_ = {0.02234832,   0.14757581,  0.18176508,  0.10380666,
+                        0.053110216,  -0.06928846, -0.13942584, -0.11816189,
+                        0.19483899,   0.03652339,  -0.10250295, 0.036714908,
+                        -0.18426876,  0.036065217, 0.21810818,  0.02383196,
+                        -0.043370757, 0.08690144,  -0.04444982, 0.00030581196};
+
+    forget_gate_bias_ = {0.035185695, -0.042891346, -0.03032477, 0.23027696,
+                         0.11098921,  0.15378423,   0.09263801,  0.09790885,
+                         0.09508917,  0.061199076,  0.07665568,  -0.015443159,
+                         -0.03499149, 0.046190713,  0.08895977,  0.10899629,
+                         0.40694186,  0.06030037,   0.012413437, -0.06108739};
+
+    cell_gate_bias_ = {-0.024379363, 0.0055531194, 0.23377132,   0.033463873,
+                       -0.1483596,   -0.10639995,  -0.091433935, 0.058573797,
+                       -0.06809782,  -0.07889636,  -0.043246906, -0.09829136,
+                       -0.4279842,   0.034901652,  0.18797937,   0.0075234566,
+                       0.016178843,  0.1749513,    0.13975595,   0.92058027};
+
+    output_gate_bias_ = {0.046159424, -0.0012809046, 0.03563469,   0.12648113,
+                         0.027195795, 0.35373217,    -0.018957434, 0.008907322,
+                         -0.0762701,  0.12018895,    0.04216877,   0.0022856654,
+                         0.040952638, 0.3147856,     0.08225149,   -0.057416286,
+                         -0.14995944, -0.008040261,  0.13208859,   0.029760877};
+
+    recurrent_to_input_weights_ = {
+        -0.001374326,   -0.078856036,   0.10672688,    0.029162422,
+        -0.11585556,    0.02557986,     -0.13446963,   -0.035785314,
+        -0.01244275,    0.025961924,    -0.02337298,   -0.044228926,
+        -0.055839065,   -0.046598054,   -0.010546039,  -0.06900766,
+        0.027239809,    0.022582639,    -0.013296484,  -0.05459212,
+        0.08981,        -0.045407712,   0.08682226,    -0.06867011,
+        -0.14390695,    -0.02916037,    0.000996957,   0.091420636,
+        0.14283475,     -0.07390571,    -0.06402044,   0.062524505,
+        -0.093129106,   0.04860203,     -0.08364217,   -0.08119002,
+        0.009352075,    0.22920375,     0.0016303885,  0.11583097,
+        -0.13732095,    0.012405723,    -0.07551853,   0.06343048,
+        0.12162708,     -0.031923793,   -0.014335606,  0.01790974,
+        -0.10650317,    -0.0724401,     0.08554849,    -0.05727212,
+        0.06556731,     -0.042729504,   -0.043227166,  0.011683251,
+        -0.013082158,   -0.029302018,   -0.010899579,  -0.062036745,
+        -0.022509435,   -0.00964907,    -0.01567329,   0.04260106,
+        -0.07787477,    -0.11576462,    0.017356863,   0.048673786,
+        -0.017577527,   -0.05527947,    -0.082487635,  -0.040137455,
+        -0.10820036,    -0.04666372,    0.022746278,   -0.07851417,
+        0.01068115,     0.032956902,    0.022433773,   0.0026891115,
+        0.08944216,     -0.0685835,     0.010513544,   0.07228705,
+        0.02032331,     -0.059686817,   -0.0005566496, -0.086984694,
+        0.040414046,    -0.1380399,     0.094208956,   -0.05722982,
+        0.012092817,    -0.04989123,    -0.086576,     -0.003399834,
+        -0.04696032,    -0.045747425,   0.10091314,    0.048676282,
+        -0.029037097,   0.031399418,    -0.0040285117, 0.047237843,
+        0.09504992,     0.041799378,    -0.049185462,  -0.031518843,
+        -0.10516937,    0.026374253,    0.10058866,    -0.0033195973,
+        -0.041975245,   0.0073591834,   0.0033782164,  -0.004325073,
+        -0.10167381,    0.042500053,    -0.01447153,   0.06464186,
+        -0.017142897,   0.03312627,     0.009205989,   0.024138335,
+        -0.011337001,   0.035530265,    -0.010912711,  0.0706555,
+        -0.005894094,   0.051841937,    -0.1401738,    -0.02351249,
+        0.0365468,      0.07590991,     0.08838724,    0.021681072,
+        -0.10086113,    0.019608743,    -0.06195883,   0.077335775,
+        0.023646897,    -0.095322326,   0.02233014,    0.09756986,
+        -0.048691444,   -0.009579111,   0.07595467,    0.11480546,
+        -0.09801813,    0.019894179,    0.08502348,    0.004032281,
+        0.037211012,    0.068537936,    -0.048005626,  -0.091520436,
+        -0.028379958,   -0.01556313,    0.06554592,    -0.045599163,
+        -0.01672207,    -0.020169014,   -0.011877351,  -0.20212261,
+        0.010889619,    0.0047078193,   0.038385306,   0.08540671,
+        -0.017140968,   -0.0035865551,  0.016678626,   0.005633034,
+        0.015963363,    0.00871737,     0.060130805,   0.028611384,
+        0.10109069,     -0.015060172,   -0.07894427,   0.06401885,
+        0.011584063,    -0.024466386,   0.0047652307,  -0.09041358,
+        0.030737216,    -0.0046374933,  0.14215417,    -0.11823516,
+        0.019899689,    0.006106124,    -0.027092824,  0.0786356,
+        0.05052217,     -0.058925,      -0.011402121,  -0.024987547,
+        -0.0013661642,  -0.06832946,    -0.015667673,  -0.1083353,
+        -0.00096863037, -0.06988685,    -0.053350925,  -0.027275559,
+        -0.033664223,   -0.07978348,    -0.025200296,  -0.017207067,
+        -0.058403496,   -0.055697463,   0.005798788,   0.12965427,
+        -0.062582195,   0.0013350133,   -0.10482091,   0.0379771,
+        0.072521195,    -0.0029455067,  -0.13797039,   -0.03628521,
+        0.013806405,    -0.017858358,   -0.01008298,   -0.07700066,
+        -0.017081132,   0.019358726,    0.0027079724,  0.004635139,
+        0.062634714,    -0.02338735,    -0.039547626,  -0.02050681,
+        0.03385117,     -0.083611414,   0.002862572,   -0.09421313,
+        0.058618143,    -0.08598433,    0.00972939,    0.023867095,
+        -0.053934585,   -0.023203006,   0.07452513,    -0.048767887,
+        -0.07314807,    -0.056307215,   -0.10433547,   -0.06440842,
+        0.04328182,     0.04389765,     -0.020006588,  -0.09076438,
+        -0.11652589,    -0.021705797,   0.03345259,    -0.010329105,
+        -0.025767034,   0.013057034,    -0.07316461,   -0.10145612,
+        0.06358255,     0.18531723,     0.07759293,    0.12006465,
+        0.1305557,      0.058638252,    -0.03393652,   0.09622831,
+        -0.16253184,    -2.4580743e-06, 0.079869635,   -0.070196845,
+        -0.005644518,   0.06857898,     -0.12598175,   -0.035084512,
+        0.03156317,     -0.12794146,    -0.031963028,  0.04692781,
+        0.030070418,    0.0071660685,   -0.095516115,  -0.004643372,
+        0.040170413,    -0.062104587,   -0.0037324072, 0.0554317,
+        0.08184801,     -0.019164372,   0.06791302,    0.034257166,
+        -0.10307039,    0.021943003,    0.046745934,   0.0790918,
+        -0.0265588,     -0.007824208,   0.042546265,   -0.00977924,
+        -0.0002440307,  -0.017384544,   -0.017990116,  0.12252321,
+        -0.014512694,   -0.08251313,    0.08861942,    0.13589665,
+        0.026351685,    0.012641483,    0.07466548,    0.044301085,
+        -0.045414884,   -0.051112458,   0.03444247,    -0.08502782,
+        -0.04106223,    -0.028126027,   0.028473156,   0.10467447};
+
+    recurrent_to_cell_weights_ = {
+        -0.037322544,   0.018592842,   0.0056175636,  -0.06253426,
+        0.055647098,    -0.05713207,   -0.05626563,   0.005559383,
+        0.03375411,     -0.025757805,  -0.088049285,  0.06017052,
+        -0.06570978,    0.007384076,   0.035123326,   -0.07920549,
+        0.053676967,    0.044480428,   -0.07663568,   0.0071805613,
+        0.08089997,     0.05143358,    0.038261272,   0.03339287,
+        -0.027673481,   0.044746667,   0.028349208,   0.020090483,
+        -0.019443132,   -0.030755889,  -0.0040000007, 0.04465846,
+        -0.021585021,   0.0031670958,  0.0053199246,  -0.056117613,
+        -0.10893326,    0.076739706,   -0.08509834,   -0.027997585,
+        0.037871376,    0.01449768,    -0.09002357,   -0.06111149,
+        -0.046195522,   0.0422062,     -0.005683705,  -0.1253618,
+        -0.012925729,   -0.04890792,   0.06985068,    0.037654128,
+        0.03398274,     -0.004781977,  0.007032333,   -0.031787455,
+        0.010868644,    -0.031489216,  0.09525667,    0.013939797,
+        0.0058680447,   0.0167067,     0.02668468,    -0.04797466,
+        -0.048885044,   -0.12722108,   0.035304096,   0.06554885,
+        0.00972396,     -0.039238118,  -0.05159735,   -0.11329045,
+        0.1613692,      -0.03750952,   0.06529313,    -0.071974665,
+        -0.11769596,    0.015524369,   -0.0013754242, -0.12446318,
+        0.02786344,     -0.014179351,  0.005264273,   0.14376344,
+        0.015983658,    0.03406988,    -0.06939408,   0.040699873,
+        0.02111075,     0.09669095,    0.041345075,   -0.08316494,
+        -0.07684199,    -0.045768797,  0.032298047,   -0.041805092,
+        0.0119405,      0.0061010392,  0.12652606,    0.0064572375,
+        -0.024950314,   0.11574242,    0.04508852,    -0.04335324,
+        0.06760663,     -0.027437469,  0.07216407,    0.06977076,
+        -0.05438599,    0.034033038,   -0.028602652,  0.05346137,
+        0.043184172,    -0.037189785,  0.10420091,    0.00882477,
+        -0.054019816,   -0.074273005,  -0.030617684,  -0.0028467078,
+        0.024302477,    -0.0038869337, 0.005332455,   0.0013399826,
+        0.04361412,     -0.007001822,  0.09631092,    -0.06702025,
+        -0.042049985,   -0.035070654,  -0.04103342,   -0.10273396,
+        0.0544271,      0.037184782,   -0.13150354,   -0.0058036847,
+        -0.008264958,   0.042035464,   0.05891794,    0.029673764,
+        0.0063542654,   0.044788733,   0.054816857,   0.062257513,
+        -0.00093483756, 0.048938446,   -0.004952862,  -0.007730018,
+        -0.04043371,    -0.017094059,  0.07229206,    -0.023670016,
+        -0.052195564,   -0.025616996,  -0.01520939,   0.045104615,
+        -0.007376126,   0.003533447,   0.006570588,   0.056037236,
+        0.12436656,     0.051817212,   0.028532185,   -0.08686856,
+        0.11868599,     0.07663395,    -0.07323171,   0.03463402,
+        -0.050708205,   -0.04458982,   -0.11590894,   0.021273347,
+        0.1251325,      -0.15313013,   -0.12224372,   0.17228661,
+        0.023029093,    0.086124025,   0.006445803,   -0.03496501,
+        0.028332196,    0.04449512,    -0.042436164,  -0.026587414,
+        -0.006041347,   -0.09292539,   -0.05678812,   0.03897832,
+        0.09465633,     0.008115513,   -0.02171956,   0.08304309,
+        0.071401566,    0.019622514,   0.032163795,   -0.004167056,
+        0.02295182,     0.030739572,   0.056506045,   0.004612461,
+        0.06524936,     0.059999723,   0.046395954,   -0.0045512207,
+        -0.1335546,     -0.030136576,  0.11584653,    -0.014678886,
+        0.0020118146,   -0.09688814,   -0.0790206,    0.039770417,
+        -0.0329582,     0.07922767,    0.029322514,   0.026405897,
+        0.04207835,     -0.07073373,   0.063781224,   0.0859677,
+        -0.10925287,    -0.07011058,   0.048005477,   0.03438226,
+        -0.09606514,    -0.006669445,  -0.043381985,  0.04240257,
+        -0.06955775,    -0.06769346,   0.043903265,   -0.026784198,
+        -0.017840602,   0.024307009,   -0.040079936,  -0.019946516,
+        0.045318738,    -0.12233574,   0.026170589,   0.0074471775,
+        0.15978073,     0.10185836,    0.10298046,    -0.015476589,
+        -0.039390966,   -0.072174534,  0.0739445,     -0.1211869,
+        -0.0347889,     -0.07943156,   0.014809798,   -0.12412325,
+        -0.0030663363,  0.039695457,   0.0647603,     -0.08291318,
+        -0.018529687,   -0.004423833,  0.0037507233,  0.084633216,
+        -0.01514876,    -0.056505352,  -0.012800942,  -0.06994386,
+        0.012962922,    -0.031234352,  0.07029052,    0.016418684,
+        0.03618972,     0.055686004,   -0.08663945,   -0.017404709,
+        -0.054761406,   0.029065743,   0.052404847,   0.020238016,
+        0.0048197987,   -0.0214882,    0.07078733,    0.013016777,
+        0.06262858,     0.009184685,   0.020785125,   -0.043904778,
+        -0.0270329,     -0.03299152,   -0.060088247,  -0.015162964,
+        -0.001828936,   0.12642565,    -0.056757294,  0.013586685,
+        0.09232601,     -0.035886683,  0.06000002,    0.05229691,
+        -0.052580316,   -0.082029596,  -0.010794592,  0.012947712,
+        -0.036429964,   -0.085508935,  -0.13127148,   -0.017744139,
+        0.031502828,    0.036232427,   -0.031581745,  0.023051167,
+        -0.05325106,    -0.03421577,   0.028793324,   -0.034633752,
+        -0.009881397,   -0.043551125,  -0.018609839,  0.0019097115,
+        -0.008799762,   0.056595087,   0.0022273948,  0.055752404};
+
+    recurrent_to_forget_weights_ = {
+        -0.057784554,  -0.026057621,  -0.068447545,   -0.022581743,
+        0.14811787,    0.10826372,    0.09471067,     0.03987225,
+        -0.0039523416, 0.00030638507, 0.053185795,    0.10572994,
+        0.08414449,    -0.022036452,  -0.00066928595, -0.09203576,
+        0.032950465,   -0.10985798,   -0.023809856,   0.0021431844,
+        -0.02196096,   -0.00326074,   0.00058621005,  -0.074678116,
+        -0.06193199,   0.055729095,   0.03736828,     0.020123724,
+        0.061878487,   -0.04729229,   0.034919553,    -0.07585433,
+        -0.04421272,   -0.044019096,  0.085488975,    0.04058006,
+        -0.06890133,   -0.030951202,  -0.024628663,   -0.07672815,
+        0.034293607,   0.08556707,    -0.05293577,    -0.033561368,
+        -0.04899627,   0.0241671,     0.015736353,    -0.095442444,
+        -0.029564252,  0.016493602,   -0.035026584,   0.022337519,
+        -0.026871363,  0.004780428,   0.0077918363,   -0.03601621,
+        0.016435321,   -0.03263031,   -0.09543275,    -0.047392778,
+        0.013454138,   0.028934088,   0.01685226,     -0.086110644,
+        -0.046250615,  -0.01847454,   0.047608484,    0.07339695,
+        0.034546845,   -0.04881143,   0.009128804,    -0.08802852,
+        0.03761666,    0.008096139,   -0.014454086,   0.014361001,
+        -0.023502491,  -0.0011840804, -0.07607001,    0.001856849,
+        -0.06509276,   -0.006021153,  -0.08570962,    -0.1451793,
+        0.060212336,   0.055259194,   0.06974018,     0.049454916,
+        -0.027794661,  -0.08077226,   -0.016179763,   0.1169753,
+        0.17213494,    -0.0056326236, -0.053934924,   -0.0124349,
+        -0.11520337,   0.05409887,    0.088759385,    0.0019655675,
+        0.0042065294,  0.03881498,    0.019844765,    0.041858196,
+        -0.05695512,   0.047233116,   0.038937137,    -0.06542224,
+        0.014429736,   -0.09719407,   0.13908425,     -0.05379757,
+        0.012321099,   0.082840554,   -0.029899208,   0.044217527,
+        0.059855383,   0.07711018,    -0.045319796,   0.0948846,
+        -0.011724666,  -0.0033288454, -0.033542685,   -0.04764985,
+        -0.13873616,   0.040668588,   0.034832682,    -0.015319203,
+        -0.018715994,  0.046002675,   0.0599172,      -0.043107376,
+        0.0294216,     -0.002314414,  -0.022424703,   0.0030315618,
+        0.0014641669,  0.0029166266,  -0.11878115,    0.013738511,
+        0.12375372,    -0.0006038222, 0.029104086,    0.087442465,
+        0.052958444,   0.07558703,    0.04817258,     0.044462286,
+        -0.015213451,  -0.08783778,   -0.0561384,     -0.003008196,
+        0.047060397,   -0.002058388,  0.03429439,     -0.018839769,
+        0.024734668,   0.024614193,   -0.042046934,   0.09597743,
+        -0.0043254104, 0.04320769,    0.0064070094,   -0.0019131786,
+        -0.02558259,   -0.022822596,  -0.023273505,   -0.02464396,
+        -0.10991725,   -0.006240552,  0.0074488563,   0.024044557,
+        0.04383914,    -0.046476185,  0.028658995,    0.060410924,
+        0.050786525,   0.009452605,   -0.0073054377,  -0.024810238,
+        0.0052906186,  0.0066939713,  -0.0020913032,  0.014515517,
+        0.015898481,   0.021362653,   -0.030262267,   0.016587038,
+        -0.011442813,  0.041154444,   -0.007631438,   -0.03423484,
+        -0.010977775,  0.036152758,   0.0066366293,   0.11915515,
+        0.02318443,    -0.041350313,  0.021485701,    -0.10906167,
+        -0.028218046,  -0.00954771,   0.020531068,    -0.11995105,
+        -0.03672871,   0.024019798,   0.014255957,    -0.05221243,
+        -0.00661567,   -0.04630967,   0.033188973,    0.10107534,
+        -0.014027541,  0.030796422,   -0.10270911,    -0.035999842,
+        0.15443139,    0.07684145,    0.036571592,    -0.035900835,
+        -0.0034699554, 0.06209149,    0.015920248,    -0.031122351,
+        -0.03858649,   0.01849943,    0.13872518,     0.01503974,
+        0.069941424,   -0.06948533,   -0.0088794185,  0.061282158,
+        -0.047401894,  0.03100163,    -0.041533746,   -0.10430945,
+        0.044574402,   -0.01425562,   -0.024290353,   0.034563623,
+        0.05866852,    0.023947537,   -0.09445152,    0.035450947,
+        0.02247216,    -0.0042998926, 0.061146557,    -0.10250651,
+        0.020881841,   -0.06747029,   0.10062043,     -0.0023941975,
+        0.03532124,    -0.016341697,  0.09685456,     -0.016764693,
+        0.051808182,   0.05875331,    -0.04536488,    0.001626336,
+        -0.028892258,  -0.01048663,   -0.009793449,   -0.017093895,
+        0.010987891,   0.02357273,    -0.00010856845, 0.0099760275,
+        -0.001845119,  -0.03551521,   0.0018358806,   0.05763657,
+        -0.01769146,   0.040995963,   0.02235177,     -0.060430344,
+        0.11475477,    -0.023854522,  0.10071741,     0.0686208,
+        -0.014250481,  0.034261297,   0.047418304,    0.08562733,
+        -0.030519066,  0.0060542435,  0.014653856,    -0.038836084,
+        0.04096551,    0.032249358,   -0.08355519,    -0.026823482,
+        0.056386515,   -0.010401743,  -0.028396193,   0.08507674,
+        0.014410365,   0.020995233,   0.17040324,     0.11511526,
+        0.02459721,    0.0066619175,  0.025853224,    -0.023133837,
+        -0.081302024,  0.017264642,   -0.009585969,   0.09491168,
+        -0.051313367,  0.054532815,   -0.014298593,   0.10657464,
+        0.007076659,   0.10964551,    0.0409152,      0.008275321,
+        -0.07283536,   0.07937492,    0.04192024,     -0.1075027};
+
+    recurrent_to_output_weights_ = {
+        0.025825322,   -0.05813119,   0.09495884,     -0.045984812,
+        -0.01255415,   -0.0026479573, -0.08196161,    -0.054914974,
+        -0.0046604523, -0.029587349,  -0.044576716,   -0.07480124,
+        -0.082868785,  0.023254942,   0.027502948,    -0.0039728214,
+        -0.08683098,   -0.08116779,   -0.014675607,   -0.037924774,
+        -0.023314456,  -0.007401714,  -0.09255757,    0.029460307,
+        -0.08829125,   -0.005139627,  -0.08989442,    -0.0555066,
+        0.13596267,    -0.025062224,  -0.048351806,   -0.03850004,
+        0.07266485,    -0.022414139,  0.05940088,     0.075114764,
+        0.09597592,    -0.010211725,  -0.0049794707,  -0.011523867,
+        -0.025980417,  0.072999895,   0.11091378,     -0.081685916,
+        0.014416728,   0.043229222,   0.034178585,    -0.07530371,
+        0.035837382,   -0.085607,     -0.007721233,   -0.03287832,
+        -0.043848954,  -0.06404588,   -0.06632928,    -0.073643476,
+        0.008214239,   -0.045984086,  0.039764922,    0.03474462,
+        0.060612556,   -0.080590084,  0.049127717,    0.04151091,
+        -0.030063879,  0.008801774,   -0.023021035,   -0.019558564,
+        0.05158114,    -0.010947698,  -0.011825728,   0.0075720972,
+        0.0699727,     -0.0039981045, 0.069350146,    0.08799282,
+        0.016156472,   0.035502106,   0.11695009,     0.006217345,
+        0.13392477,    -0.037875112,  0.025745004,    0.08940699,
+        -0.00924166,   0.0046702605,  -0.036598757,   -0.08811812,
+        0.10522024,    -0.032441203,  0.008176899,    -0.04454919,
+        0.07058152,    0.0067963637,  0.039206743,    0.03259838,
+        0.03725492,    -0.09515802,   0.013326398,    -0.052055415,
+        -0.025676316,  0.03198509,    -0.015951829,   -0.058556724,
+        0.036879618,   0.043357447,   0.028362012,    -0.05908629,
+        0.0059240665,  -0.04995891,   -0.019187413,   0.0276265,
+        -0.01628143,   0.0025863599,  0.08800015,     0.035250366,
+        -0.022165963,  -0.07328642,   -0.009415526,   -0.07455109,
+        0.11690406,    0.0363299,     0.07411125,     0.042103454,
+        -0.009660886,  0.019076364,   0.018299393,    -0.046004917,
+        0.08891175,    0.0431396,     -0.026327137,   -0.051502608,
+        0.08979574,    -0.051670972,  0.04940282,     -0.07491107,
+        -0.021240504,  0.022596184,   -0.034280192,   0.060163025,
+        -0.058211457,  -0.051837247,  -0.01349775,    -0.04639988,
+        -0.035936575,  -0.011681591,  0.064818054,    0.0073146066,
+        -0.021745546,  -0.043124277,  -0.06471268,    -0.07053354,
+        -0.029321948,  -0.05330136,   0.016933719,    -0.053782392,
+        0.13747959,    -0.1361751,    -0.11569455,    0.0033329215,
+        0.05693899,    -0.053219706,  0.063698,       0.07977434,
+        -0.07924483,   0.06936997,    0.0034815092,   -0.007305279,
+        -0.037325785,  -0.07251102,   -0.033633437,   -0.08677009,
+        0.091591336,   -0.14165086,   0.021752775,    0.019683983,
+        0.0011612234,  -0.058154266,  0.049996935,    0.0288841,
+        -0.0024567875, -0.14345716,   0.010955264,    -0.10234828,
+        0.1183656,     -0.0010731248, -0.023590032,   -0.072285876,
+        -0.0724771,    -0.026382286,  -0.0014920527,  0.042667855,
+        0.0018776858,  0.02986552,    0.009814309,    0.0733756,
+        0.12289186,    0.018043943,   -0.0458958,     0.049412545,
+        0.033632483,   0.05495232,    0.036686596,    -0.013781798,
+        -0.010036754,  0.02576849,    -0.08307328,    0.010112348,
+        0.042521734,   -0.05869831,   -0.071689695,   0.03876447,
+        -0.13275425,   -0.0352966,    -0.023077697,   0.10285965,
+        0.084736146,   0.15568255,    -0.00040734606, 0.027835453,
+        -0.10292561,   -0.032401145,  0.10053256,     -0.026142767,
+        -0.08271222,   -0.0030240538, -0.016368777,   0.1070414,
+        0.042672627,   0.013456989,   -0.0437609,     -0.022309763,
+        0.11576483,    0.04108048,    0.061026827,    -0.0190714,
+        -0.0869359,    0.037901703,   0.0610107,      0.07202949,
+        0.01675338,    0.086139716,   -0.08795751,    -0.014898893,
+        -0.023771819,  -0.01965048,   0.007955471,    -0.043740474,
+        0.03346837,    -0.10549954,   0.090567775,    0.042013682,
+        -0.03176985,   0.12569028,    -0.02421228,    -0.029526481,
+        0.023851605,   0.031539805,   0.05292009,     -0.02344001,
+        -0.07811758,   -0.08834428,   0.10094801,     0.16594367,
+        -0.06861939,   -0.021256343,  -0.041093912,   -0.06669611,
+        0.035498552,   0.021757556,   -0.09302526,    -0.015403468,
+        -0.06614931,   -0.051798206,  -0.013874718,   0.03630673,
+        0.010412845,   -0.08077351,   0.046185967,    0.0035662893,
+        0.03541868,    -0.094149634,  -0.034814864,   0.003128424,
+        -0.020674974,  -0.03944324,   -0.008110165,   -0.11113267,
+        0.08484226,    0.043586485,   0.040582247,    0.0968012,
+        -0.065249965,  -0.028036479,  0.0050708856,   0.0017462453,
+        0.0326779,     0.041296225,   0.09164146,     -0.047743853,
+        -0.015952192,  -0.034451712,  0.084197424,    -0.05347844,
+        -0.11768019,   0.085926116,   -0.08251791,    -0.045081906,
+        0.0948852,     0.068401024,   0.024856757,    0.06978981,
+        -0.057309967,  -0.012775832,  -0.0032452994,  0.01977615,
+        -0.041040014,  -0.024264973,  0.063464895,    0.05431621,
+    };
+
+    cell_to_input_weights_ = {
+        0.040369894, 0.030746894,  0.24704495,  0.018586371,  -0.037586458,
+        -0.15312155, -0.11812848,  -0.11465643, 0.20259799,   0.11418174,
+        -0.10116027, -0.011334949, 0.12411352,  -0.076769054, -0.052169047,
+        0.21198851,  -0.38871562,  -0.09061183, -0.09683246,  -0.21929175};
+
+    cell_to_forget_weights_ = {
+        -0.01998659,  -0.15568835,  -0.24248174,   -0.012770197, 0.041331276,
+        -0.072311886, -0.052123554, -0.0066330447, -0.043891653, 0.036225766,
+        -0.047248036, 0.021479502,  0.033189066,   0.11952997,   -0.020432774,
+        0.64658105,   -0.06650122,  -0.03467612,   0.095340036,  0.23647355};
+
+    cell_to_output_weights_ = {
+        0.08286371,  -0.08261836, -0.51210177, 0.002913762, 0.17764764,
+        -0.5495371,  -0.08460716, -0.24552552, 0.030037103, 0.04123544,
+        -0.11940523, 0.007358328, 0.1890978,   0.4833202,   -0.34441817,
+        0.36312827,  -0.26375428, 0.1457655,   -0.19724406, 0.15548733};
+
+    projection_weights_ = {
+        -0.009802181, 0.09401916,   0.0717386,     -0.13895074,
+        0.09641832,   0.060420845,  0.08539281,    0.054285463,
+        0.061395317,  0.034448683,  -0.042991187,  0.019801661,
+        -0.16840284,  -0.015726732, -0.23041931,   -0.024478018,
+        -0.10959692,  -0.013875541, 0.18600968,    -0.061274476,
+        0.0138165,    -0.08160894,  -0.07661644,   0.032372914,
+        0.16169067,   0.22465782,   -0.03993472,   -0.004017731,
+        0.08633481,   -0.28869787,  0.08682067,    0.17240396,
+        0.014975425,  0.056431185,  0.031037588,   0.16702051,
+        0.0077946745, 0.15140012,   0.29405436,    0.120285,
+        -0.188994,    -0.027265169, 0.043389652,   -0.022061434,
+        0.014777949,  -0.20203483,  0.094781205,   0.19100232,
+        0.13987629,   -0.036132768, -0.06426278,   -0.05108664,
+        0.13221376,   0.009441198,  -0.16715929,   0.15859416,
+        -0.040437475, 0.050779544,  -0.022187516,  0.012166504,
+        0.027685808,  -0.07675938,  -0.0055694645, -0.09444123,
+        0.0046453946, 0.050794356,  0.10770313,    -0.20790008,
+        -0.07149004,  -0.11425117,  0.008225835,   -0.035802525,
+        0.14374903,   0.15262283,   0.048710253,   0.1847461,
+        -0.007487823, 0.11000021,   -0.09542012,   0.22619456,
+        -0.029149994, 0.08527916,   0.009043713,   0.0042746216,
+        0.016261552,  0.022461696,  0.12689082,    -0.043589946,
+        -0.12035478,  -0.08361797,  -0.050666027,  -0.1248618,
+        -0.1275799,   -0.071875185, 0.07377272,    0.09944291,
+        -0.18897448,  -0.1593054,   -0.06526116,   -0.040107165,
+        -0.004618631, -0.067624845, -0.007576253,  0.10727444,
+        0.041546922,  -0.20424393,  0.06907816,    0.050412357,
+        0.00724631,   0.039827548,  0.12449835,    0.10747581,
+        0.13708383,   0.09134148,   -0.12617786,   -0.06428341,
+        0.09956831,   0.1208086,    -0.14676677,   -0.0727722,
+        0.1126304,    0.010139365,  0.015571211,   -0.038128063,
+        0.022913318,  -0.042050496, 0.16842307,    -0.060597885,
+        0.10531834,   -0.06411776,  -0.07451711,   -0.03410368,
+        -0.13393489,  0.06534304,   0.003620307,   0.04490757,
+        0.05970546,   0.05197996,   0.02839995,    0.10434969,
+        -0.013699693, -0.028353551, -0.07260381,   0.047201227,
+        -0.024575593, -0.036445823, 0.07155557,    0.009672501,
+        -0.02328883,  0.009533515,  -0.03606021,   -0.07421458,
+        -0.028082801, -0.2678904,   -0.13221288,   0.18419984,
+        -0.13012612,  -0.014588381, -0.035059117,  -0.04824723,
+        0.07830115,   -0.056184657, 0.03277091,    0.025466874,
+        0.14494097,   -0.12522776,  -0.098633975,  -0.10766018,
+        -0.08317623,  0.08594209,   0.07749552,    0.039474737,
+        0.1776665,    -0.07409566,  -0.0477268,    0.29323658,
+        0.10801441,   0.1154011,    0.013952499,   0.10739139,
+        0.10708251,   -0.051456142, 0.0074137426,  -0.10430189,
+        0.10034707,   0.045594677,  0.0635285,     -0.0715442,
+        -0.089667566, -0.10811871,  0.00026344223, 0.08298446,
+        -0.009525053, 0.006585689,  -0.24567553,   -0.09450807,
+        0.09648481,   0.026996298,  -0.06419476,   -0.04752702,
+        -0.11063944,  -0.23441927,  -0.17608605,   -0.052156363,
+        0.067035615,  0.19271925,   -0.0032889997, -0.043264326,
+        0.09663576,   -0.057112187, -0.10100678,   0.0628376,
+        0.04447668,   0.017961001,  -0.10094388,   -0.10190601,
+        0.18335468,   0.10494553,   -0.052095775,  -0.0026118709,
+        0.10539724,   -0.04383912,  -0.042349473,  0.08438151,
+        -0.1947263,   0.02251204,   0.11216432,    -0.10307853,
+        0.17351969,   -0.039091777, 0.08066188,    -0.00561982,
+        0.12633002,   0.11335965,   -0.0088127935, -0.019777594,
+        0.06864014,   -0.059751723, 0.016233567,   -0.06894641,
+        -0.28651384,  -0.004228674, 0.019708522,   -0.16305895,
+        -0.07468996,  -0.0855457,   0.099339016,   -0.07580735,
+        -0.13775392,  0.08434318,   0.08330512,    -0.12131499,
+        0.031935584,  0.09180414,   -0.08876437,   -0.08049874,
+        0.008753825,  0.03498998,   0.030215185,   0.03907079,
+        0.089751154,  0.029194152,  -0.03337423,   -0.019092513,
+        0.04331237,   0.04299654,   -0.036394123,  -0.12915532,
+        0.09793732,   0.07512415,   -0.11319543,   -0.032502122,
+        0.15661901,   0.07671967,   -0.005491124,  -0.19379048,
+        -0.218606,    0.21448623,   0.017840758,   0.1416943,
+        -0.07051762,  0.19488361,   0.02664691,    -0.18104725,
+        -0.09334311,  0.15026465,   -0.15493552,   -0.057762887,
+        -0.11604192,  -0.262013,    -0.01391798,   0.012185008,
+        0.11156489,   -0.07483202,  0.06693364,    -0.26151478,
+        0.046425626,  0.036540434,  -0.16435726,   0.17338543,
+        -0.21401681,  -0.11385144,  -0.08283257,   -0.069031075,
+        0.030635102,  0.010969227,  0.11109743,    0.010919218,
+        0.027526086,  0.13519906,   0.01891392,    -0.046839405,
+        -0.040167913, 0.017953383,  -0.09700955,   0.0061885654,
+        -0.07000971,  0.026893595,  -0.038844477,  0.14543656};
+
+    lstm_input_ = {
+        {// Batch0: 4 (input_sequence_size) * 5 (n_input)
+         0.787926, 0.151646, 0.071352, 0.118426, 0.458058,   // step 0
+         0.596268, 0.998386, 0.568695, 0.864524, 0.571277,   // step 1
+         0.073204, 0.296072, 0.743333, 0.069199, 0.045348,   // step 2
+         0.867394, 0.291279, 0.013714, 0.482521, 0.626339},  // step 3
+
+        {// Batch1: 4 (input_sequence_size) * 5 (n_input)
+         0.295743, 0.544053, 0.690064, 0.858138, 0.497181,  // step 0
+         0.642421, 0.524260, 0.134799, 0.003639, 0.162482,  // step 1
+         0.640394, 0.930399, 0.050782, 0.432485, 0.988078,  // step 2
+         0.082922, 0.563329, 0.865614, 0.333232, 0.259916}  // step 3
+    };
+
+    lstm_golden_output_ = {
+        {// Batch0: 4 (input_sequence_size) * 16 (n_output)
+         -0.00396806, 0.029352,     -0.00279226, 0.0159977,   -0.00835576,
+         -0.0211779,  0.0283512,    -0.0114597,  0.00907307,  -0.0244004,
+         -0.0152191,  -0.0259063,   0.00914318,  0.00415118,  0.017147,
+         0.0134203,   -0.0166936,   0.0381209,   0.000889694, 0.0143363,
+         -0.0328911,  -0.0234288,   0.0333051,   -0.012229,   0.0110322,
+         -0.0457725,  -0.000832209, -0.0202817,  0.0327257,   0.0121308,
+         0.0155969,   0.0312091,    -0.0213783,  0.0350169,   0.000324794,
+         0.0276012,   -0.0263374,   -0.0371449,  0.0446149,   -0.0205474,
+         0.0103729,   -0.0576349,   -0.0150052,  -0.0292043,  0.0376827,
+         0.0136115,   0.0243435,    0.0354492,   -0.0189322,  0.0464512,
+         -0.00251373, 0.0225745,    -0.0308346,  -0.0317124,  0.0460407,
+         -0.0189395,  0.0149363,    -0.0530162,  -0.0150767,  -0.0340193,
+         0.0286833,   0.00824207,   0.0264887,   0.0305169},
+        {// Batch1: 4 (input_sequence_size) * 16 (n_output)
+         -0.013869,    0.0287268,   -0.00334693, 0.00733398,  -0.0287926,
+         -0.0186926,   0.0193662,   -0.0115437,  0.00422612,  -0.0345232,
+         0.00223253,   -0.00957321, 0.0210624,   0.013331,    0.0150954,
+         0.02168,      -0.0141913,  0.0322082,   0.00227024,  0.0260507,
+         -0.0188721,   -0.0296489,  0.0399134,   -0.0160509,  0.0116039,
+         -0.0447318,   -0.0150515,  -0.0277406,  0.0316596,   0.0118233,
+         0.0214762,    0.0293641,   -0.0204549,  0.0450315,   -0.00117378,
+         0.0167673,    -0.0375007,  -0.0238314,  0.038784,    -0.0174034,
+         0.0131743,    -0.0506589,  -0.0048447,  -0.0240239,  0.0325789,
+         0.00790065,   0.0220157,   0.0333314,   -0.0264787,  0.0387855,
+         -0.000764675, 0.0217599,   -0.037537,   -0.0335206,  0.0431679,
+         -0.0211424,   0.010203,    -0.062785,   -0.00832363, -0.025181,
+         0.0412031,    0.0118723,   0.0239643,   0.0394009}};
   }
-}
+};
 
-TEST(LSTMOpTest, BlackBoxTestWithPeepholeWithProjectionNoClipping) {
+TEST_F(NoCifgPeepholeProjectionClippingLstmTest, LstmBlackBoxTest) {
   const int n_batch = 2;
   const int n_input = 5;
   const int n_cell = 20;
@@ -489,588 +1338,98 @@ TEST(LSTMOpTest, BlackBoxTestWithPeepholeWithProjectionNoClipping) {
                        {0},                 // projection_bias tensor
                    });
 
-  lstm.SetInputToInputWeights(
-      {0.021393683,  0.06124551,    0.046905167,  -0.014657677,  -0.03149463,
-       0.09171803,   0.14647801,    0.10797193,   -0.0057968358, 0.0019193048,
-       -0.2726754,   0.10154029,    -0.018539885, 0.080349885,   -0.10262385,
-       -0.022599787, -0.09121155,   -0.008675967, -0.045206103,  -0.0821282,
-       -0.008045952, 0.015478081,   0.055217247,  0.038719587,   0.044153627,
-       -0.06453243,  0.05031825,    -0.046935108, -0.008164439,  0.014574226,
-       -0.1671009,   -0.15519552,   -0.16819797,  -0.13971269,   -0.11953059,
-       0.25005487,   -0.22790983,   0.009855087,  -0.028140958,  -0.11200698,
-       0.11295408,   -0.0035217577, 0.054485075,  0.05184695,    0.064711206,
-       0.10989193,   0.11674786,    0.03490607,   0.07727357,    0.11390585,
-       -0.1863375,   -0.1034451,    -0.13945189,  -0.049401227,  -0.18767063,
-       0.042483903,  0.14233552,    0.13832581,   0.18350165,    0.14545603,
-       -0.028545704, 0.024939531,   0.050929718,  0.0076203286,  -0.0029723682,
-       -0.042484224, -0.11827596,   -0.09171104,  -0.10808628,   -0.16327988,
-       -0.2273378,   -0.0993647,    -0.017155107, 0.0023917493,  0.049272764,
-       0.0038534778, 0.054764505,   0.089753784,  0.06947234,    0.08014476,
-       -0.04544234,  -0.0497073,    -0.07135631,  -0.048929106,  -0.004042012,
-       -0.009284026, 0.018042054,   0.0036860977, -0.07427302,   -0.11434604,
-       -0.018995456, 0.031487543,   0.012834908,  0.019977754,   0.044256654,
-       -0.39292613,  -0.18519334,   -0.11651281,  -0.06809892,   0.011373677});
-
-  lstm.SetInputToForgetWeights(
-      {-0.0018401089, -0.004852237,  0.03698424,   0.014181704,   0.028273236,
-       -0.016726194,  -0.05249759,   -0.10204261,  0.00861066,    -0.040979505,
-       -0.009899187,  0.01923892,    -0.028177269, -0.08535103,   -0.14585495,
-       0.10662567,    -0.01909731,   -0.017883534, -0.0047269356, -0.045103323,
-       0.0030784295,  0.076784775,   0.07463696,   0.094531395,   0.0814421,
-       -0.12257899,   -0.033945758,  -0.031303465, 0.045630626,   0.06843887,
-       -0.13492945,   -0.012480007,  -0.0811829,   -0.07224499,   -0.09628791,
-       0.045100946,   0.0012300825,  0.013964662,  0.099372394,   0.02543059,
-       0.06958324,    0.034257296,   0.0482646,    0.06267997,    0.052625068,
-       0.12784666,    0.07077897,    0.025725935,  0.04165009,    0.07241905,
-       0.018668644,   -0.037377294,  -0.06277783,  -0.08833636,   -0.040120605,
-       -0.011405586,  -0.007808335,  -0.010301386, -0.005102167,  0.027717464,
-       0.05483423,    0.11449111,    0.11289652,   0.10939839,    0.13396506,
-       -0.08402166,   -0.01901462,   -0.044678304, -0.07720565,   0.014350063,
-       -0.11757958,   -0.0652038,    -0.08185733,  -0.076754324,  -0.092614375,
-       0.10405491,    0.052960336,   0.035755895,  0.035839386,   -0.012540553,
-       0.036881298,   0.02913376,    0.03420159,   0.05448447,    -0.054523353,
-       0.02582715,    0.02327355,    -0.011857179, -0.0011980024, -0.034641717,
-       -0.026125094,  -0.17582615,   -0.15923657,  -0.27486774,   -0.0006143371,
-       0.0001771948,  -8.470171e-05, 0.02651807,   0.045790765,   0.06956496});
-
-  lstm.SetInputToCellWeights(
-      {-0.04580283,   -0.09549462,   -0.032418985,  -0.06454633,
-       -0.043528453,  0.043018587,   -0.049152344,  -0.12418144,
-       -0.078985475,  -0.07596889,   0.019484362,   -0.11434962,
-       -0.0074034138, -0.06314844,   -0.092981495,  0.0062155537,
-       -0.025034338,  -0.0028890965, 0.048929527,   0.06235075,
-       0.10665918,    -0.032036792,  -0.08505916,   -0.10843358,
-       -0.13002433,   -0.036816437,  -0.02130134,   -0.016518239,
-       0.0047691227,  -0.0025825808, 0.066017866,   0.029991534,
-       -0.10652836,   -0.1037554,    -0.13056071,   -0.03266643,
-       -0.033702414,  -0.006473424,  -0.04611692,   0.014419339,
-       -0.025174323,  0.0396852,     0.081777506,   0.06157468,
-       0.10210095,    -0.009658194,  0.046511717,   0.03603906,
-       0.0069369148,  0.015960095,   -0.06507666,   0.09551598,
-       0.053568836,   0.06408714,    0.12835667,    -0.008714329,
-       -0.20211966,   -0.12093674,   0.029450472,   0.2849013,
-       -0.029227901,  0.1164364,     -0.08560263,   0.09941786,
-       -0.036999565,  -0.028842626,  -0.0033637602, -0.017012902,
-       -0.09720865,   -0.11193351,   -0.029155117,  -0.017936034,
-       -0.009768936,  -0.04223324,   -0.036159635,  0.06505112,
-       -0.021742892,  -0.023377212,  -0.07221364,   -0.06430552,
-       0.05453865,    0.091149814,   0.06387331,    0.007518393,
-       0.055960953,   0.069779344,   0.046411168,   0.10509911,
-       0.07463894,    0.0075130584,  0.012850982,   0.04555431,
-       0.056955688,   0.06555285,    0.050801456,   -0.009862683,
-       0.00826772,    -0.026555609,  -0.0073611983, -0.0014897042});
-
-  lstm.SetInputToOutputWeights(
-      {-0.0998932,   -0.07201956,  -0.052803773,  -0.15629593,  -0.15001918,
-       -0.07650751,  0.02359855,   -0.075155355,  -0.08037709,  -0.15093534,
-       0.029517552,  -0.04751393,  0.010350531,   -0.02664851,  -0.016839722,
-       -0.023121163, 0.0077019283, 0.012851257,   -0.05040649,  -0.0129761,
-       -0.021737747, -0.038305793, -0.06870586,   -0.01481247,  -0.001285394,
-       0.10124236,   0.083122835,  0.053313006,   -0.062235646, -0.075637154,
-       -0.027833903, 0.029774971,  0.1130802,     0.09218906,   0.09506135,
-       -0.086665764, -0.037162706, -0.038880914,  -0.035832845, -0.014481564,
-       -0.09825003,  -0.12048569,  -0.097665586,  -0.05287633,  -0.0964047,
-       -0.11366429,  0.035777505,  0.13568819,    0.052451383,  0.050649304,
-       0.05798951,   -0.021852335, -0.099848844,  0.014740475,  -0.078897946,
-       0.04974699,   0.014160473,  0.06973932,    0.04964942,   0.033364646,
-       0.08190124,   0.025535367,  0.050893165,   0.048514254,  0.06945813,
-       -0.078907564, -0.06707616,  -0.11844508,   -0.09986688,  -0.07509403,
-       0.06263226,   0.14925587,   0.20188436,    0.12098451,   0.14639415,
-       0.0015017595, -0.014267382, -0.03417257,   0.012711468,  0.0028300495,
-       -0.024758482, -0.05098548,  -0.0821182,    0.014225672,  0.021544158,
-       0.08949725,   0.07505268,   -0.0020780868, 0.04908258,   0.06476295,
-       -0.022907063, 0.027562456,  0.040185735,   0.019567577,  -0.015598739,
-       -0.049097303, -0.017121866, -0.083368234,  -0.02332002,  -0.0840956});
-
-  lstm.SetInputGateBias(
-      {0.02234832,  0.14757581,   0.18176508,  0.10380666,  0.053110216,
-       -0.06928846, -0.13942584,  -0.11816189, 0.19483899,  0.03652339,
-       -0.10250295, 0.036714908,  -0.18426876, 0.036065217, 0.21810818,
-       0.02383196,  -0.043370757, 0.08690144,  -0.04444982, 0.00030581196});
-
-  lstm.SetForgetGateBias({0.035185695, -0.042891346, -0.03032477, 0.23027696,
-                          0.11098921,  0.15378423,   0.09263801,  0.09790885,
-                          0.09508917,  0.061199076,  0.07665568,  -0.015443159,
-                          -0.03499149, 0.046190713,  0.08895977,  0.10899629,
-                          0.40694186,  0.06030037,   0.012413437, -0.06108739});
-
-  lstm.SetCellBias({-0.024379363, 0.0055531194, 0.23377132,   0.033463873,
-                    -0.1483596,   -0.10639995,  -0.091433935, 0.058573797,
-                    -0.06809782,  -0.07889636,  -0.043246906, -0.09829136,
-                    -0.4279842,   0.034901652,  0.18797937,   0.0075234566,
-                    0.016178843,  0.1749513,    0.13975595,   0.92058027});
-
-  lstm.SetOutputGateBias(
-      {0.046159424,  -0.0012809046, 0.03563469,   0.12648113, 0.027195795,
-       0.35373217,   -0.018957434,  0.008907322,  -0.0762701, 0.12018895,
-       0.04216877,   0.0022856654,  0.040952638,  0.3147856,  0.08225149,
-       -0.057416286, -0.14995944,   -0.008040261, 0.13208859, 0.029760877});
-
-  lstm.SetRecurrentToInputWeights(
-      {-0.001374326,   -0.078856036,   0.10672688,    0.029162422,
-       -0.11585556,    0.02557986,     -0.13446963,   -0.035785314,
-       -0.01244275,    0.025961924,    -0.02337298,   -0.044228926,
-       -0.055839065,   -0.046598054,   -0.010546039,  -0.06900766,
-       0.027239809,    0.022582639,    -0.013296484,  -0.05459212,
-       0.08981,        -0.045407712,   0.08682226,    -0.06867011,
-       -0.14390695,    -0.02916037,    0.000996957,   0.091420636,
-       0.14283475,     -0.07390571,    -0.06402044,   0.062524505,
-       -0.093129106,   0.04860203,     -0.08364217,   -0.08119002,
-       0.009352075,    0.22920375,     0.0016303885,  0.11583097,
-       -0.13732095,    0.012405723,    -0.07551853,   0.06343048,
-       0.12162708,     -0.031923793,   -0.014335606,  0.01790974,
-       -0.10650317,    -0.0724401,     0.08554849,    -0.05727212,
-       0.06556731,     -0.042729504,   -0.043227166,  0.011683251,
-       -0.013082158,   -0.029302018,   -0.010899579,  -0.062036745,
-       -0.022509435,   -0.00964907,    -0.01567329,   0.04260106,
-       -0.07787477,    -0.11576462,    0.017356863,   0.048673786,
-       -0.017577527,   -0.05527947,    -0.082487635,  -0.040137455,
-       -0.10820036,    -0.04666372,    0.022746278,   -0.07851417,
-       0.01068115,     0.032956902,    0.022433773,   0.0026891115,
-       0.08944216,     -0.0685835,     0.010513544,   0.07228705,
-       0.02032331,     -0.059686817,   -0.0005566496, -0.086984694,
-       0.040414046,    -0.1380399,     0.094208956,   -0.05722982,
-       0.012092817,    -0.04989123,    -0.086576,     -0.003399834,
-       -0.04696032,    -0.045747425,   0.10091314,    0.048676282,
-       -0.029037097,   0.031399418,    -0.0040285117, 0.047237843,
-       0.09504992,     0.041799378,    -0.049185462,  -0.031518843,
-       -0.10516937,    0.026374253,    0.10058866,    -0.0033195973,
-       -0.041975245,   0.0073591834,   0.0033782164,  -0.004325073,
-       -0.10167381,    0.042500053,    -0.01447153,   0.06464186,
-       -0.017142897,   0.03312627,     0.009205989,   0.024138335,
-       -0.011337001,   0.035530265,    -0.010912711,  0.0706555,
-       -0.005894094,   0.051841937,    -0.1401738,    -0.02351249,
-       0.0365468,      0.07590991,     0.08838724,    0.021681072,
-       -0.10086113,    0.019608743,    -0.06195883,   0.077335775,
-       0.023646897,    -0.095322326,   0.02233014,    0.09756986,
-       -0.048691444,   -0.009579111,   0.07595467,    0.11480546,
-       -0.09801813,    0.019894179,    0.08502348,    0.004032281,
-       0.037211012,    0.068537936,    -0.048005626,  -0.091520436,
-       -0.028379958,   -0.01556313,    0.06554592,    -0.045599163,
-       -0.01672207,    -0.020169014,   -0.011877351,  -0.20212261,
-       0.010889619,    0.0047078193,   0.038385306,   0.08540671,
-       -0.017140968,   -0.0035865551,  0.016678626,   0.005633034,
-       0.015963363,    0.00871737,     0.060130805,   0.028611384,
-       0.10109069,     -0.015060172,   -0.07894427,   0.06401885,
-       0.011584063,    -0.024466386,   0.0047652307,  -0.09041358,
-       0.030737216,    -0.0046374933,  0.14215417,    -0.11823516,
-       0.019899689,    0.006106124,    -0.027092824,  0.0786356,
-       0.05052217,     -0.058925,      -0.011402121,  -0.024987547,
-       -0.0013661642,  -0.06832946,    -0.015667673,  -0.1083353,
-       -0.00096863037, -0.06988685,    -0.053350925,  -0.027275559,
-       -0.033664223,   -0.07978348,    -0.025200296,  -0.017207067,
-       -0.058403496,   -0.055697463,   0.005798788,   0.12965427,
-       -0.062582195,   0.0013350133,   -0.10482091,   0.0379771,
-       0.072521195,    -0.0029455067,  -0.13797039,   -0.03628521,
-       0.013806405,    -0.017858358,   -0.01008298,   -0.07700066,
-       -0.017081132,   0.019358726,    0.0027079724,  0.004635139,
-       0.062634714,    -0.02338735,    -0.039547626,  -0.02050681,
-       0.03385117,     -0.083611414,   0.002862572,   -0.09421313,
-       0.058618143,    -0.08598433,    0.00972939,    0.023867095,
-       -0.053934585,   -0.023203006,   0.07452513,    -0.048767887,
-       -0.07314807,    -0.056307215,   -0.10433547,   -0.06440842,
-       0.04328182,     0.04389765,     -0.020006588,  -0.09076438,
-       -0.11652589,    -0.021705797,   0.03345259,    -0.010329105,
-       -0.025767034,   0.013057034,    -0.07316461,   -0.10145612,
-       0.06358255,     0.18531723,     0.07759293,    0.12006465,
-       0.1305557,      0.058638252,    -0.03393652,   0.09622831,
-       -0.16253184,    -2.4580743e-06, 0.079869635,   -0.070196845,
-       -0.005644518,   0.06857898,     -0.12598175,   -0.035084512,
-       0.03156317,     -0.12794146,    -0.031963028,  0.04692781,
-       0.030070418,    0.0071660685,   -0.095516115,  -0.004643372,
-       0.040170413,    -0.062104587,   -0.0037324072, 0.0554317,
-       0.08184801,     -0.019164372,   0.06791302,    0.034257166,
-       -0.10307039,    0.021943003,    0.046745934,   0.0790918,
-       -0.0265588,     -0.007824208,   0.042546265,   -0.00977924,
-       -0.0002440307,  -0.017384544,   -0.017990116,  0.12252321,
-       -0.014512694,   -0.08251313,    0.08861942,    0.13589665,
-       0.026351685,    0.012641483,    0.07466548,    0.044301085,
-       -0.045414884,   -0.051112458,   0.03444247,    -0.08502782,
-       -0.04106223,    -0.028126027,   0.028473156,   0.10467447});
-
-  lstm.SetRecurrentToForgetWeights(
-      {-0.057784554,  -0.026057621,  -0.068447545,   -0.022581743,
-       0.14811787,    0.10826372,    0.09471067,     0.03987225,
-       -0.0039523416, 0.00030638507, 0.053185795,    0.10572994,
-       0.08414449,    -0.022036452,  -0.00066928595, -0.09203576,
-       0.032950465,   -0.10985798,   -0.023809856,   0.0021431844,
-       -0.02196096,   -0.00326074,   0.00058621005,  -0.074678116,
-       -0.06193199,   0.055729095,   0.03736828,     0.020123724,
-       0.061878487,   -0.04729229,   0.034919553,    -0.07585433,
-       -0.04421272,   -0.044019096,  0.085488975,    0.04058006,
-       -0.06890133,   -0.030951202,  -0.024628663,   -0.07672815,
-       0.034293607,   0.08556707,    -0.05293577,    -0.033561368,
-       -0.04899627,   0.0241671,     0.015736353,    -0.095442444,
-       -0.029564252,  0.016493602,   -0.035026584,   0.022337519,
-       -0.026871363,  0.004780428,   0.0077918363,   -0.03601621,
-       0.016435321,   -0.03263031,   -0.09543275,    -0.047392778,
-       0.013454138,   0.028934088,   0.01685226,     -0.086110644,
-       -0.046250615,  -0.01847454,   0.047608484,    0.07339695,
-       0.034546845,   -0.04881143,   0.009128804,    -0.08802852,
-       0.03761666,    0.008096139,   -0.014454086,   0.014361001,
-       -0.023502491,  -0.0011840804, -0.07607001,    0.001856849,
-       -0.06509276,   -0.006021153,  -0.08570962,    -0.1451793,
-       0.060212336,   0.055259194,   0.06974018,     0.049454916,
-       -0.027794661,  -0.08077226,   -0.016179763,   0.1169753,
-       0.17213494,    -0.0056326236, -0.053934924,   -0.0124349,
-       -0.11520337,   0.05409887,    0.088759385,    0.0019655675,
-       0.0042065294,  0.03881498,    0.019844765,    0.041858196,
-       -0.05695512,   0.047233116,   0.038937137,    -0.06542224,
-       0.014429736,   -0.09719407,   0.13908425,     -0.05379757,
-       0.012321099,   0.082840554,   -0.029899208,   0.044217527,
-       0.059855383,   0.07711018,    -0.045319796,   0.0948846,
-       -0.011724666,  -0.0033288454, -0.033542685,   -0.04764985,
-       -0.13873616,   0.040668588,   0.034832682,    -0.015319203,
-       -0.018715994,  0.046002675,   0.0599172,      -0.043107376,
-       0.0294216,     -0.002314414,  -0.022424703,   0.0030315618,
-       0.0014641669,  0.0029166266,  -0.11878115,    0.013738511,
-       0.12375372,    -0.0006038222, 0.029104086,    0.087442465,
-       0.052958444,   0.07558703,    0.04817258,     0.044462286,
-       -0.015213451,  -0.08783778,   -0.0561384,     -0.003008196,
-       0.047060397,   -0.002058388,  0.03429439,     -0.018839769,
-       0.024734668,   0.024614193,   -0.042046934,   0.09597743,
-       -0.0043254104, 0.04320769,    0.0064070094,   -0.0019131786,
-       -0.02558259,   -0.022822596,  -0.023273505,   -0.02464396,
-       -0.10991725,   -0.006240552,  0.0074488563,   0.024044557,
-       0.04383914,    -0.046476185,  0.028658995,    0.060410924,
-       0.050786525,   0.009452605,   -0.0073054377,  -0.024810238,
-       0.0052906186,  0.0066939713,  -0.0020913032,  0.014515517,
-       0.015898481,   0.021362653,   -0.030262267,   0.016587038,
-       -0.011442813,  0.041154444,   -0.007631438,   -0.03423484,
-       -0.010977775,  0.036152758,   0.0066366293,   0.11915515,
-       0.02318443,    -0.041350313,  0.021485701,    -0.10906167,
-       -0.028218046,  -0.00954771,   0.020531068,    -0.11995105,
-       -0.03672871,   0.024019798,   0.014255957,    -0.05221243,
-       -0.00661567,   -0.04630967,   0.033188973,    0.10107534,
-       -0.014027541,  0.030796422,   -0.10270911,    -0.035999842,
-       0.15443139,    0.07684145,    0.036571592,    -0.035900835,
-       -0.0034699554, 0.06209149,    0.015920248,    -0.031122351,
-       -0.03858649,   0.01849943,    0.13872518,     0.01503974,
-       0.069941424,   -0.06948533,   -0.0088794185,  0.061282158,
-       -0.047401894,  0.03100163,    -0.041533746,   -0.10430945,
-       0.044574402,   -0.01425562,   -0.024290353,   0.034563623,
-       0.05866852,    0.023947537,   -0.09445152,    0.035450947,
-       0.02247216,    -0.0042998926, 0.061146557,    -0.10250651,
-       0.020881841,   -0.06747029,   0.10062043,     -0.0023941975,
-       0.03532124,    -0.016341697,  0.09685456,     -0.016764693,
-       0.051808182,   0.05875331,    -0.04536488,    0.001626336,
-       -0.028892258,  -0.01048663,   -0.009793449,   -0.017093895,
-       0.010987891,   0.02357273,    -0.00010856845, 0.0099760275,
-       -0.001845119,  -0.03551521,   0.0018358806,   0.05763657,
-       -0.01769146,   0.040995963,   0.02235177,     -0.060430344,
-       0.11475477,    -0.023854522,  0.10071741,     0.0686208,
-       -0.014250481,  0.034261297,   0.047418304,    0.08562733,
-       -0.030519066,  0.0060542435,  0.014653856,    -0.038836084,
-       0.04096551,    0.032249358,   -0.08355519,    -0.026823482,
-       0.056386515,   -0.010401743,  -0.028396193,   0.08507674,
-       0.014410365,   0.020995233,   0.17040324,     0.11511526,
-       0.02459721,    0.0066619175,  0.025853224,    -0.023133837,
-       -0.081302024,  0.017264642,   -0.009585969,   0.09491168,
-       -0.051313367,  0.054532815,   -0.014298593,   0.10657464,
-       0.007076659,   0.10964551,    0.0409152,      0.008275321,
-       -0.07283536,   0.07937492,    0.04192024,     -0.1075027});
-
-  lstm.SetRecurrentToCellWeights(
-      {-0.037322544,   0.018592842,   0.0056175636,  -0.06253426,
-       0.055647098,    -0.05713207,   -0.05626563,   0.005559383,
-       0.03375411,     -0.025757805,  -0.088049285,  0.06017052,
-       -0.06570978,    0.007384076,   0.035123326,   -0.07920549,
-       0.053676967,    0.044480428,   -0.07663568,   0.0071805613,
-       0.08089997,     0.05143358,    0.038261272,   0.03339287,
-       -0.027673481,   0.044746667,   0.028349208,   0.020090483,
-       -0.019443132,   -0.030755889,  -0.0040000007, 0.04465846,
-       -0.021585021,   0.0031670958,  0.0053199246,  -0.056117613,
-       -0.10893326,    0.076739706,   -0.08509834,   -0.027997585,
-       0.037871376,    0.01449768,    -0.09002357,   -0.06111149,
-       -0.046195522,   0.0422062,     -0.005683705,  -0.1253618,
-       -0.012925729,   -0.04890792,   0.06985068,    0.037654128,
-       0.03398274,     -0.004781977,  0.007032333,   -0.031787455,
-       0.010868644,    -0.031489216,  0.09525667,    0.013939797,
-       0.0058680447,   0.0167067,     0.02668468,    -0.04797466,
-       -0.048885044,   -0.12722108,   0.035304096,   0.06554885,
-       0.00972396,     -0.039238118,  -0.05159735,   -0.11329045,
-       0.1613692,      -0.03750952,   0.06529313,    -0.071974665,
-       -0.11769596,    0.015524369,   -0.0013754242, -0.12446318,
-       0.02786344,     -0.014179351,  0.005264273,   0.14376344,
-       0.015983658,    0.03406988,    -0.06939408,   0.040699873,
-       0.02111075,     0.09669095,    0.041345075,   -0.08316494,
-       -0.07684199,    -0.045768797,  0.032298047,   -0.041805092,
-       0.0119405,      0.0061010392,  0.12652606,    0.0064572375,
-       -0.024950314,   0.11574242,    0.04508852,    -0.04335324,
-       0.06760663,     -0.027437469,  0.07216407,    0.06977076,
-       -0.05438599,    0.034033038,   -0.028602652,  0.05346137,
-       0.043184172,    -0.037189785,  0.10420091,    0.00882477,
-       -0.054019816,   -0.074273005,  -0.030617684,  -0.0028467078,
-       0.024302477,    -0.0038869337, 0.005332455,   0.0013399826,
-       0.04361412,     -0.007001822,  0.09631092,    -0.06702025,
-       -0.042049985,   -0.035070654,  -0.04103342,   -0.10273396,
-       0.0544271,      0.037184782,   -0.13150354,   -0.0058036847,
-       -0.008264958,   0.042035464,   0.05891794,    0.029673764,
-       0.0063542654,   0.044788733,   0.054816857,   0.062257513,
-       -0.00093483756, 0.048938446,   -0.004952862,  -0.007730018,
-       -0.04043371,    -0.017094059,  0.07229206,    -0.023670016,
-       -0.052195564,   -0.025616996,  -0.01520939,   0.045104615,
-       -0.007376126,   0.003533447,   0.006570588,   0.056037236,
-       0.12436656,     0.051817212,   0.028532185,   -0.08686856,
-       0.11868599,     0.07663395,    -0.07323171,   0.03463402,
-       -0.050708205,   -0.04458982,   -0.11590894,   0.021273347,
-       0.1251325,      -0.15313013,   -0.12224372,   0.17228661,
-       0.023029093,    0.086124025,   0.006445803,   -0.03496501,
-       0.028332196,    0.04449512,    -0.042436164,  -0.026587414,
-       -0.006041347,   -0.09292539,   -0.05678812,   0.03897832,
-       0.09465633,     0.008115513,   -0.02171956,   0.08304309,
-       0.071401566,    0.019622514,   0.032163795,   -0.004167056,
-       0.02295182,     0.030739572,   0.056506045,   0.004612461,
-       0.06524936,     0.059999723,   0.046395954,   -0.0045512207,
-       -0.1335546,     -0.030136576,  0.11584653,    -0.014678886,
-       0.0020118146,   -0.09688814,   -0.0790206,    0.039770417,
-       -0.0329582,     0.07922767,    0.029322514,   0.026405897,
-       0.04207835,     -0.07073373,   0.063781224,   0.0859677,
-       -0.10925287,    -0.07011058,   0.048005477,   0.03438226,
-       -0.09606514,    -0.006669445,  -0.043381985,  0.04240257,
-       -0.06955775,    -0.06769346,   0.043903265,   -0.026784198,
-       -0.017840602,   0.024307009,   -0.040079936,  -0.019946516,
-       0.045318738,    -0.12233574,   0.026170589,   0.0074471775,
-       0.15978073,     0.10185836,    0.10298046,    -0.015476589,
-       -0.039390966,   -0.072174534,  0.0739445,     -0.1211869,
-       -0.0347889,     -0.07943156,   0.014809798,   -0.12412325,
-       -0.0030663363,  0.039695457,   0.0647603,     -0.08291318,
-       -0.018529687,   -0.004423833,  0.0037507233,  0.084633216,
-       -0.01514876,    -0.056505352,  -0.012800942,  -0.06994386,
-       0.012962922,    -0.031234352,  0.07029052,    0.016418684,
-       0.03618972,     0.055686004,   -0.08663945,   -0.017404709,
-       -0.054761406,   0.029065743,   0.052404847,   0.020238016,
-       0.0048197987,   -0.0214882,    0.07078733,    0.013016777,
-       0.06262858,     0.009184685,   0.020785125,   -0.043904778,
-       -0.0270329,     -0.03299152,   -0.060088247,  -0.015162964,
-       -0.001828936,   0.12642565,    -0.056757294,  0.013586685,
-       0.09232601,     -0.035886683,  0.06000002,    0.05229691,
-       -0.052580316,   -0.082029596,  -0.010794592,  0.012947712,
-       -0.036429964,   -0.085508935,  -0.13127148,   -0.017744139,
-       0.031502828,    0.036232427,   -0.031581745,  0.023051167,
-       -0.05325106,    -0.03421577,   0.028793324,   -0.034633752,
-       -0.009881397,   -0.043551125,  -0.018609839,  0.0019097115,
-       -0.008799762,   0.056595087,   0.0022273948,  0.055752404});
-
-  lstm.SetRecurrentToOutputWeights({
-      0.025825322,   -0.05813119,  0.09495884,   -0.045984812,   -0.01255415,
-      -0.0026479573, -0.08196161,  -0.054914974, -0.0046604523,  -0.029587349,
-      -0.044576716,  -0.07480124,  -0.082868785, 0.023254942,    0.027502948,
-      -0.0039728214, -0.08683098,  -0.08116779,  -0.014675607,   -0.037924774,
-      -0.023314456,  -0.007401714, -0.09255757,  0.029460307,    -0.08829125,
-      -0.005139627,  -0.08989442,  -0.0555066,   0.13596267,     -0.025062224,
-      -0.048351806,  -0.03850004,  0.07266485,   -0.022414139,   0.05940088,
-      0.075114764,   0.09597592,   -0.010211725, -0.0049794707,  -0.011523867,
-      -0.025980417,  0.072999895,  0.11091378,   -0.081685916,   0.014416728,
-      0.043229222,   0.034178585,  -0.07530371,  0.035837382,    -0.085607,
-      -0.007721233,  -0.03287832,  -0.043848954, -0.06404588,    -0.06632928,
-      -0.073643476,  0.008214239,  -0.045984086, 0.039764922,    0.03474462,
-      0.060612556,   -0.080590084, 0.049127717,  0.04151091,     -0.030063879,
-      0.008801774,   -0.023021035, -0.019558564, 0.05158114,     -0.010947698,
-      -0.011825728,  0.0075720972, 0.0699727,    -0.0039981045,  0.069350146,
-      0.08799282,    0.016156472,  0.035502106,  0.11695009,     0.006217345,
-      0.13392477,    -0.037875112, 0.025745004,  0.08940699,     -0.00924166,
-      0.0046702605,  -0.036598757, -0.08811812,  0.10522024,     -0.032441203,
-      0.008176899,   -0.04454919,  0.07058152,   0.0067963637,   0.039206743,
-      0.03259838,    0.03725492,   -0.09515802,  0.013326398,    -0.052055415,
-      -0.025676316,  0.03198509,   -0.015951829, -0.058556724,   0.036879618,
-      0.043357447,   0.028362012,  -0.05908629,  0.0059240665,   -0.04995891,
-      -0.019187413,  0.0276265,    -0.01628143,  0.0025863599,   0.08800015,
-      0.035250366,   -0.022165963, -0.07328642,  -0.009415526,   -0.07455109,
-      0.11690406,    0.0363299,    0.07411125,   0.042103454,    -0.009660886,
-      0.019076364,   0.018299393,  -0.046004917, 0.08891175,     0.0431396,
-      -0.026327137,  -0.051502608, 0.08979574,   -0.051670972,   0.04940282,
-      -0.07491107,   -0.021240504, 0.022596184,  -0.034280192,   0.060163025,
-      -0.058211457,  -0.051837247, -0.01349775,  -0.04639988,    -0.035936575,
-      -0.011681591,  0.064818054,  0.0073146066, -0.021745546,   -0.043124277,
-      -0.06471268,   -0.07053354,  -0.029321948, -0.05330136,    0.016933719,
-      -0.053782392,  0.13747959,   -0.1361751,   -0.11569455,    0.0033329215,
-      0.05693899,    -0.053219706, 0.063698,     0.07977434,     -0.07924483,
-      0.06936997,    0.0034815092, -0.007305279, -0.037325785,   -0.07251102,
-      -0.033633437,  -0.08677009,  0.091591336,  -0.14165086,    0.021752775,
-      0.019683983,   0.0011612234, -0.058154266, 0.049996935,    0.0288841,
-      -0.0024567875, -0.14345716,  0.010955264,  -0.10234828,    0.1183656,
-      -0.0010731248, -0.023590032, -0.072285876, -0.0724771,     -0.026382286,
-      -0.0014920527, 0.042667855,  0.0018776858, 0.02986552,     0.009814309,
-      0.0733756,     0.12289186,   0.018043943,  -0.0458958,     0.049412545,
-      0.033632483,   0.05495232,   0.036686596,  -0.013781798,   -0.010036754,
-      0.02576849,    -0.08307328,  0.010112348,  0.042521734,    -0.05869831,
-      -0.071689695,  0.03876447,   -0.13275425,  -0.0352966,     -0.023077697,
-      0.10285965,    0.084736146,  0.15568255,   -0.00040734606, 0.027835453,
-      -0.10292561,   -0.032401145, 0.10053256,   -0.026142767,   -0.08271222,
-      -0.0030240538, -0.016368777, 0.1070414,    0.042672627,    0.013456989,
-      -0.0437609,    -0.022309763, 0.11576483,   0.04108048,     0.061026827,
-      -0.0190714,    -0.0869359,   0.037901703,  0.0610107,      0.07202949,
-      0.01675338,    0.086139716,  -0.08795751,  -0.014898893,   -0.023771819,
-      -0.01965048,   0.007955471,  -0.043740474, 0.03346837,     -0.10549954,
-      0.090567775,   0.042013682,  -0.03176985,  0.12569028,     -0.02421228,
-      -0.029526481,  0.023851605,  0.031539805,  0.05292009,     -0.02344001,
-      -0.07811758,   -0.08834428,  0.10094801,   0.16594367,     -0.06861939,
-      -0.021256343,  -0.041093912, -0.06669611,  0.035498552,    0.021757556,
-      -0.09302526,   -0.015403468, -0.06614931,  -0.051798206,   -0.013874718,
-      0.03630673,    0.010412845,  -0.08077351,  0.046185967,    0.0035662893,
-      0.03541868,    -0.094149634, -0.034814864, 0.003128424,    -0.020674974,
-      -0.03944324,   -0.008110165, -0.11113267,  0.08484226,     0.043586485,
-      0.040582247,   0.0968012,    -0.065249965, -0.028036479,   0.0050708856,
-      0.0017462453,  0.0326779,    0.041296225,  0.09164146,     -0.047743853,
-      -0.015952192,  -0.034451712, 0.084197424,  -0.05347844,    -0.11768019,
-      0.085926116,   -0.08251791,  -0.045081906, 0.0948852,      0.068401024,
-      0.024856757,   0.06978981,   -0.057309967, -0.012775832,   -0.0032452994,
-      0.01977615,    -0.041040014, -0.024264973, 0.063464895,    0.05431621,
-  });
-
-  lstm.SetCellToInputWeights(
-      {0.040369894, 0.030746894,  0.24704495,  0.018586371,  -0.037586458,
-       -0.15312155, -0.11812848,  -0.11465643, 0.20259799,   0.11418174,
-       -0.10116027, -0.011334949, 0.12411352,  -0.076769054, -0.052169047,
-       0.21198851,  -0.38871562,  -0.09061183, -0.09683246,  -0.21929175});
-
-  lstm.SetCellToForgetWeights(
-      {-0.01998659,  -0.15568835,  -0.24248174,   -0.012770197, 0.041331276,
-       -0.072311886, -0.052123554, -0.0066330447, -0.043891653, 0.036225766,
-       -0.047248036, 0.021479502,  0.033189066,   0.11952997,   -0.020432774,
-       0.64658105,   -0.06650122,  -0.03467612,   0.095340036,  0.23647355});
-
-  lstm.SetCellToOutputWeights(
-      {0.08286371,  -0.08261836, -0.51210177, 0.002913762, 0.17764764,
-       -0.5495371,  -0.08460716, -0.24552552, 0.030037103, 0.04123544,
-       -0.11940523, 0.007358328, 0.1890978,   0.4833202,   -0.34441817,
-       0.36312827,  -0.26375428, 0.1457655,   -0.19724406, 0.15548733});
-
-  lstm.SetProjectionWeights(
-      {-0.009802181,  0.09401916,    0.0717386,     -0.13895074,  0.09641832,
-       0.060420845,   0.08539281,    0.054285463,   0.061395317,  0.034448683,
-       -0.042991187,  0.019801661,   -0.16840284,   -0.015726732, -0.23041931,
-       -0.024478018,  -0.10959692,   -0.013875541,  0.18600968,   -0.061274476,
-       0.0138165,     -0.08160894,   -0.07661644,   0.032372914,  0.16169067,
-       0.22465782,    -0.03993472,   -0.004017731,  0.08633481,   -0.28869787,
-       0.08682067,    0.17240396,    0.014975425,   0.056431185,  0.031037588,
-       0.16702051,    0.0077946745,  0.15140012,    0.29405436,   0.120285,
-       -0.188994,     -0.027265169,  0.043389652,   -0.022061434, 0.014777949,
-       -0.20203483,   0.094781205,   0.19100232,    0.13987629,   -0.036132768,
-       -0.06426278,   -0.05108664,   0.13221376,    0.009441198,  -0.16715929,
-       0.15859416,    -0.040437475,  0.050779544,   -0.022187516, 0.012166504,
-       0.027685808,   -0.07675938,   -0.0055694645, -0.09444123,  0.0046453946,
-       0.050794356,   0.10770313,    -0.20790008,   -0.07149004,  -0.11425117,
-       0.008225835,   -0.035802525,  0.14374903,    0.15262283,   0.048710253,
-       0.1847461,     -0.007487823,  0.11000021,    -0.09542012,  0.22619456,
-       -0.029149994,  0.08527916,    0.009043713,   0.0042746216, 0.016261552,
-       0.022461696,   0.12689082,    -0.043589946,  -0.12035478,  -0.08361797,
-       -0.050666027,  -0.1248618,    -0.1275799,    -0.071875185, 0.07377272,
-       0.09944291,    -0.18897448,   -0.1593054,    -0.06526116,  -0.040107165,
-       -0.004618631,  -0.067624845,  -0.007576253,  0.10727444,   0.041546922,
-       -0.20424393,   0.06907816,    0.050412357,   0.00724631,   0.039827548,
-       0.12449835,    0.10747581,    0.13708383,    0.09134148,   -0.12617786,
-       -0.06428341,   0.09956831,    0.1208086,     -0.14676677,  -0.0727722,
-       0.1126304,     0.010139365,   0.015571211,   -0.038128063, 0.022913318,
-       -0.042050496,  0.16842307,    -0.060597885,  0.10531834,   -0.06411776,
-       -0.07451711,   -0.03410368,   -0.13393489,   0.06534304,   0.003620307,
-       0.04490757,    0.05970546,    0.05197996,    0.02839995,   0.10434969,
-       -0.013699693,  -0.028353551,  -0.07260381,   0.047201227,  -0.024575593,
-       -0.036445823,  0.07155557,    0.009672501,   -0.02328883,  0.009533515,
-       -0.03606021,   -0.07421458,   -0.028082801,  -0.2678904,   -0.13221288,
-       0.18419984,    -0.13012612,   -0.014588381,  -0.035059117, -0.04824723,
-       0.07830115,    -0.056184657,  0.03277091,    0.025466874,  0.14494097,
-       -0.12522776,   -0.098633975,  -0.10766018,   -0.08317623,  0.08594209,
-       0.07749552,    0.039474737,   0.1776665,     -0.07409566,  -0.0477268,
-       0.29323658,    0.10801441,    0.1154011,     0.013952499,  0.10739139,
-       0.10708251,    -0.051456142,  0.0074137426,  -0.10430189,  0.10034707,
-       0.045594677,   0.0635285,     -0.0715442,    -0.089667566, -0.10811871,
-       0.00026344223, 0.08298446,    -0.009525053,  0.006585689,  -0.24567553,
-       -0.09450807,   0.09648481,    0.026996298,   -0.06419476,  -0.04752702,
-       -0.11063944,   -0.23441927,   -0.17608605,   -0.052156363, 0.067035615,
-       0.19271925,    -0.0032889997, -0.043264326,  0.09663576,   -0.057112187,
-       -0.10100678,   0.0628376,     0.04447668,    0.017961001,  -0.10094388,
-       -0.10190601,   0.18335468,    0.10494553,    -0.052095775, -0.0026118709,
-       0.10539724,    -0.04383912,   -0.042349473,  0.08438151,   -0.1947263,
-       0.02251204,    0.11216432,    -0.10307853,   0.17351969,   -0.039091777,
-       0.08066188,    -0.00561982,   0.12633002,    0.11335965,   -0.0088127935,
-       -0.019777594,  0.06864014,    -0.059751723,  0.016233567,  -0.06894641,
-       -0.28651384,   -0.004228674,  0.019708522,   -0.16305895,  -0.07468996,
-       -0.0855457,    0.099339016,   -0.07580735,   -0.13775392,  0.08434318,
-       0.08330512,    -0.12131499,   0.031935584,   0.09180414,   -0.08876437,
-       -0.08049874,   0.008753825,   0.03498998,    0.030215185,  0.03907079,
-       0.089751154,   0.029194152,   -0.03337423,   -0.019092513, 0.04331237,
-       0.04299654,    -0.036394123,  -0.12915532,   0.09793732,   0.07512415,
-       -0.11319543,   -0.032502122,  0.15661901,    0.07671967,   -0.005491124,
-       -0.19379048,   -0.218606,     0.21448623,    0.017840758,  0.1416943,
-       -0.07051762,   0.19488361,    0.02664691,    -0.18104725,  -0.09334311,
-       0.15026465,    -0.15493552,   -0.057762887,  -0.11604192,  -0.262013,
-       -0.01391798,   0.012185008,   0.11156489,    -0.07483202,  0.06693364,
-       -0.26151478,   0.046425626,   0.036540434,   -0.16435726,  0.17338543,
-       -0.21401681,   -0.11385144,   -0.08283257,   -0.069031075, 0.030635102,
-       0.010969227,   0.11109743,    0.010919218,   0.027526086,  0.13519906,
-       0.01891392,    -0.046839405,  -0.040167913,  0.017953383,  -0.09700955,
-       0.0061885654,  -0.07000971,   0.026893595,   -0.038844477, 0.14543656});
-
-  static float lstm_input[][20] = {
-      {// Batch0: 4 (input_sequence_size) * 5 (n_input)
-       0.787926, 0.151646, 0.071352, 0.118426, 0.458058, 0.596268, 0.998386,
-       0.568695, 0.864524, 0.571277, 0.073204, 0.296072, 0.743333, 0.069199,
-       0.045348, 0.867394, 0.291279, 0.013714, 0.482521, 0.626339},
-
-      {// Batch1: 4 (input_sequence_size) * 5 (n_input)
-       0.295743, 0.544053, 0.690064, 0.858138, 0.497181, 0.642421, 0.524260,
-       0.134799, 0.003639, 0.162482, 0.640394, 0.930399, 0.050782, 0.432485,
-       0.988078, 0.082922, 0.563329, 0.865614, 0.333232, 0.259916}};
-
-  static float lstm_golden_output[][64] = {
-      {// Batch0: 4 (input_sequence_size) * 16 (n_output)
-       -0.00396806, 0.029352,     -0.00279226, 0.0159977,   -0.00835576,
-       -0.0211779,  0.0283512,    -0.0114597,  0.00907307,  -0.0244004,
-       -0.0152191,  -0.0259063,   0.00914318,  0.00415118,  0.017147,
-       0.0134203,   -0.0166936,   0.0381209,   0.000889694, 0.0143363,
-       -0.0328911,  -0.0234288,   0.0333051,   -0.012229,   0.0110322,
-       -0.0457725,  -0.000832209, -0.0202817,  0.0327257,   0.0121308,
-       0.0155969,   0.0312091,    -0.0213783,  0.0350169,   0.000324794,
-       0.0276012,   -0.0263374,   -0.0371449,  0.0446149,   -0.0205474,
-       0.0103729,   -0.0576349,   -0.0150052,  -0.0292043,  0.0376827,
-       0.0136115,   0.0243435,    0.0354492,   -0.0189322,  0.0464512,
-       -0.00251373, 0.0225745,    -0.0308346,  -0.0317124,  0.0460407,
-       -0.0189395,  0.0149363,    -0.0530162,  -0.0150767,  -0.0340193,
-       0.0286833,   0.00824207,   0.0264887,   0.0305169},
-      {// Batch1: 4 (input_sequence_size) * 16 (n_output)
-       -0.013869,    0.0287268,   -0.00334693, 0.00733398,  -0.0287926,
-       -0.0186926,   0.0193662,   -0.0115437,  0.00422612,  -0.0345232,
-       0.00223253,   -0.00957321, 0.0210624,   0.013331,    0.0150954,
-       0.02168,      -0.0141913,  0.0322082,   0.00227024,  0.0260507,
-       -0.0188721,   -0.0296489,  0.0399134,   -0.0160509,  0.0116039,
-       -0.0447318,   -0.0150515,  -0.0277406,  0.0316596,   0.0118233,
-       0.0214762,    0.0293641,   -0.0204549,  0.0450315,   -0.00117378,
-       0.0167673,    -0.0375007,  -0.0238314,  0.038784,    -0.0174034,
-       0.0131743,    -0.0506589,  -0.0048447,  -0.0240239,  0.0325789,
-       0.00790065,   0.0220157,   0.0333314,   -0.0264787,  0.0387855,
-       -0.000764675, 0.0217599,   -0.037537,   -0.0335206,  0.0431679,
-       -0.0211424,   0.010203,    -0.062785,   -0.00832363, -0.025181,
-       0.0412031,    0.0118723,   0.0239643,   0.0394009}};
+  lstm.SetInputToInputWeights(input_to_input_weights_);
+  lstm.SetInputToCellWeights(input_to_cell_weights_);
+  lstm.SetInputToForgetWeights(input_to_forget_weights_);
+  lstm.SetInputToOutputWeights(input_to_output_weights_);
+
+  lstm.SetInputGateBias(input_gate_bias_);
+  lstm.SetCellBias(cell_gate_bias_);
+  lstm.SetForgetGateBias(forget_gate_bias_);
+  lstm.SetOutputGateBias(output_gate_bias_);
+
+  lstm.SetRecurrentToInputWeights(recurrent_to_input_weights_);
+  lstm.SetRecurrentToCellWeights(recurrent_to_cell_weights_);
+  lstm.SetRecurrentToForgetWeights(recurrent_to_forget_weights_);
+  lstm.SetRecurrentToOutputWeights(recurrent_to_output_weights_);
+
+  lstm.SetCellToInputWeights(cell_to_input_weights_);
+  lstm.SetCellToForgetWeights(cell_to_forget_weights_);
+  lstm.SetCellToOutputWeights(cell_to_output_weights_);
+
+  lstm.SetProjectionWeights(projection_weights_);
 
   // Resetting cell_state and output_state
   lstm.ResetCellState();
   lstm.ResetOutputState();
 
-  const int input_sequence_size =
-      sizeof(lstm_input[0]) / sizeof(float) / (lstm.num_inputs());
-  for (int i = 0; i < input_sequence_size; i++) {
-    float* batch0_start = lstm_input[0] + i * lstm.num_inputs();
-    float* batch0_end = batch0_start + lstm.num_inputs();
+  VerifyGoldens(lstm_input_, lstm_golden_output_, &lstm);
+}
 
-    lstm.SetInput(0, batch0_start, batch0_end);
+TEST_F(NoCifgPeepholeProjectionClippingLstmTest, HybridLstmBlackBoxTest) {
+  const int n_batch = 2;
+  const int n_input = 5;
+  const int n_cell = 20;
+  const int n_output = 16;
 
-    float* batch1_start = lstm_input[1] + i * lstm.num_inputs();
-    float* batch1_end = batch1_start + lstm.num_inputs();
-    lstm.SetInput(lstm.num_inputs(), batch1_start, batch1_end);
+  HybridLSTMOpModel lstm(
+      n_batch, n_input, n_cell, n_output,
+      /*use_cifg=*/false, /*use_peephole=*/true,
+      /*use_projection_weights=*/true,
+      /*use_projection_bias=*/false,
+      /*cell_clip=*/0.0, /*proj_clip=*/0.0,
+      {
+          {n_batch, n_input},  // input tensor
+
+          {n_cell, n_input},  // input_to_input_weight tensor
+          {n_cell, n_input},  // input_to_forget_weight tensor
+          {n_cell, n_input},  // input_to_cell_weight tensor
+          {n_cell, n_input},  // input_to_output_weight tensor
+
+          {n_cell, n_output},  // recurrent_to_input_weight tensor
+          {n_cell, n_output},  // recurrent_to_forget_weight tensor
+          {n_cell, n_output},  // recurrent_to_cell_weight tensor
+          {n_cell, n_output},  // recurrent_to_output_weight tensor
+
+          {n_cell},  // cell_to_input_weight tensor
+          {n_cell},  // cell_to_forget_weight tensor
+          {n_cell},  // cell_to_output_weight tensor
+
+          {n_cell},  // input_gate_bias tensor
+          {n_cell},  // forget_gate_bias tensor
+          {n_cell},  // cell_bias tensor
+          {n_cell},  // output_gate_bias tensor
+
+          {n_output, n_cell},  // projection_weight tensor
+          {0},                 // projection_bias tensor
+      });
+
+  lstm.SetInputToInputWeights(input_to_input_weights_);
+  lstm.SetInputToCellWeights(input_to_cell_weights_);
+  lstm.SetInputToForgetWeights(input_to_forget_weights_);
+  lstm.SetInputToOutputWeights(input_to_output_weights_);
+
+  lstm.SetInputGateBias(input_gate_bias_);
+  lstm.SetCellBias(cell_gate_bias_);
+  lstm.SetForgetGateBias(forget_gate_bias_);
+  lstm.SetOutputGateBias(output_gate_bias_);
+
+  lstm.SetRecurrentToInputWeights(recurrent_to_input_weights_);
+  lstm.SetRecurrentToCellWeights(recurrent_to_cell_weights_);
+  lstm.SetRecurrentToForgetWeights(recurrent_to_forget_weights_);
+  lstm.SetRecurrentToOutputWeights(recurrent_to_output_weights_);
+
+  lstm.SetCellToInputWeights(cell_to_input_weights_);
+  lstm.SetCellToForgetWeights(cell_to_forget_weights_);
+  lstm.SetCellToOutputWeights(cell_to_output_weights_);
+
+  lstm.SetProjectionWeights(projection_weights_);
 
-    lstm.Invoke();
+  // Resetting cell_state and output_state
+  lstm.ResetCellState();
+  lstm.ResetOutputState();
 
-    float* golden_start_batch0 = lstm_golden_output[0] + i * lstm.num_outputs();
-    float* golden_end_batch0 = golden_start_batch0 + lstm.num_outputs();
-    float* golden_start_batch1 = lstm_golden_output[1] + i * lstm.num_outputs();
-    float* golden_end_batch1 = golden_start_batch1 + lstm.num_outputs();
-    std::vector<float> expected;
-    expected.insert(expected.end(), golden_start_batch0, golden_end_batch0);
-    expected.insert(expected.end(), golden_start_batch1, golden_end_batch1);
-    EXPECT_THAT(lstm.GetOutput(), ElementsAreArray(ArrayFloatNear(expected)));
-  }
+  VerifyGoldens(lstm_input_, lstm_golden_output_, &lstm, /*tolerance=*/0.00467);
 }
 
 }  // namespace
diff --git a/tensorflow/contrib/lite/kernels/mul.cc b/tensorflow/contrib/lite/kernels/mul.cc
index 62f4e94a386fbbc6987e8a6dc1a9a47ce3349cbb..b69a221447db963bcd3a7e6a69f132fe3767bfd1 100644
--- a/tensorflow/contrib/lite/kernels/mul.cc
+++ b/tensorflow/contrib/lite/kernels/mul.cc
@@ -120,8 +120,9 @@ void EvalQuantized(TfLiteContext* context, TfLiteNode* node,
 
   double real_multiplier =
       input1->params.scale * input2->params.scale / output->params.scale;
-  QuantizeMultiplierSmallerThanOne(real_multiplier, &output_multiplier,
-                                   &output_shift);
+  QuantizeMultiplierSmallerThanOneExp(real_multiplier, &output_multiplier,
+                                      &output_shift);
+  output_shift *= -1;
 
   int32 output_activation_min, output_activation_max;
   CalculateActivationRangeUint8(params->activation, output,
diff --git a/tensorflow/contrib/lite/kernels/register.cc b/tensorflow/contrib/lite/kernels/register.cc
index 66069996049b1a415a6397f7321970eb90d0c050..fca040649d8ff76b42db1962135cd4315e961ec0 100644
--- a/tensorflow/contrib/lite/kernels/register.cc
+++ b/tensorflow/contrib/lite/kernels/register.cc
@@ -73,6 +73,7 @@ TfLiteRegistration* Register_SQUEEZE();
 TfLiteRegistration* Register_STRIDED_SLICE();
 TfLiteRegistration* Register_EXP();
 TfLiteRegistration* Register_TOPK_V2();
+TfLiteRegistration* Register_LOG();
 TfLiteRegistration* Register_LOG_SOFTMAX();
 TfLiteRegistration* Register_CAST();
 TfLiteRegistration* Register_DEQUANTIZE();
@@ -85,11 +86,16 @@ TfLiteRegistration* Register_GREATER_EQUAL();
 TfLiteRegistration* Register_LESS();
 TfLiteRegistration* Register_LESS_EQUAL();
 TfLiteRegistration* Register_FLOOR();
+TfLiteRegistration* Register_TILE();
 TfLiteRegistration* Register_NEG();
 TfLiteRegistration* Register_SELECT();
 TfLiteRegistration* Register_SLICE();
 TfLiteRegistration* Register_SIN();
 TfLiteRegistration* Register_TRANSPOSE_CONV();
+TfLiteRegistration* Register_EXPAND_DIMS();
+TfLiteRegistration* Register_SPARSE_TO_DENSE();
+TfLiteRegistration* Register_EQUAL();
+TfLiteRegistration* Register_NOT_EQUAL();
 
 BuiltinOpResolver::BuiltinOpResolver() {
   AddBuiltin(BuiltinOperator_RELU, Register_RELU());
@@ -123,7 +129,8 @@ BuiltinOpResolver::BuiltinOpResolver() {
   AddBuiltin(BuiltinOperator_L2_NORMALIZATION, Register_L2_NORMALIZATION());
   AddBuiltin(BuiltinOperator_LOCAL_RESPONSE_NORMALIZATION,
              Register_LOCAL_RESPONSE_NORMALIZATION());
-  AddBuiltin(BuiltinOperator_LSTM, Register_LSTM());
+  AddBuiltin(BuiltinOperator_LSTM, Register_LSTM(), /* min_version */ 1,
+             /* max_version */ 2);
   AddBuiltin(BuiltinOperator_BIDIRECTIONAL_SEQUENCE_LSTM,
              Register_BIDIRECTIONAL_SEQUENCE_LSTM());
   AddBuiltin(BuiltinOperator_UNIDIRECTIONAL_SEQUENCE_LSTM,
@@ -144,6 +151,7 @@ BuiltinOpResolver::BuiltinOpResolver() {
   AddBuiltin(BuiltinOperator_STRIDED_SLICE, Register_STRIDED_SLICE());
   AddBuiltin(BuiltinOperator_EXP, Register_EXP());
   AddBuiltin(BuiltinOperator_TOPK_V2, Register_TOPK_V2());
+  AddBuiltin(BuiltinOperator_LOG, Register_LOG());
   AddBuiltin(BuiltinOperator_LOG_SOFTMAX, Register_LOG_SOFTMAX());
   AddBuiltin(BuiltinOperator_CAST, Register_CAST());
   AddBuiltin(BuiltinOperator_DEQUANTIZE, Register_DEQUANTIZE());
@@ -161,6 +169,11 @@ BuiltinOpResolver::BuiltinOpResolver() {
   AddBuiltin(BuiltinOperator_SLICE, Register_SLICE());
   AddBuiltin(BuiltinOperator_SIN, Register_SIN());
   AddBuiltin(BuiltinOperator_TRANSPOSE_CONV, Register_TRANSPOSE_CONV());
+  AddBuiltin(BuiltinOperator_TILE, Register_TILE());
+  AddBuiltin(BuiltinOperator_EXPAND_DIMS, Register_EXPAND_DIMS());
+  AddBuiltin(BuiltinOperator_SPARSE_TO_DENSE, Register_SPARSE_TO_DENSE());
+  AddBuiltin(BuiltinOperator_EQUAL, Register_EQUAL());
+  AddBuiltin(BuiltinOperator_NOT_EQUAL, Register_NOT_EQUAL());
 
 #if 0
   // TODO(andrewharp, ahentz): Move these somewhere more appropriate so that
diff --git a/tensorflow/contrib/lite/kernels/resize_bilinear.cc b/tensorflow/contrib/lite/kernels/resize_bilinear.cc
index f2092eaa36db32ebbc959ac23365bb13dd034e68..86c4cd3ee88013ca4174f444d0388bc036d9cde6 100644
--- a/tensorflow/contrib/lite/kernels/resize_bilinear.cc
+++ b/tensorflow/contrib/lite/kernels/resize_bilinear.cc
@@ -61,12 +61,10 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   TF_LITE_ENSURE_EQ(context, NumDimensions(input), 4);
   TF_LITE_ENSURE_EQ(context, NumDimensions(size), 1);
 
-  // TODO(ahentz): Our current implementations only support float32.
-  TF_LITE_ENSURE_EQ(context, input->type, kTfLiteFloat32);
   TF_LITE_ENSURE_EQ(context, size->type, kTfLiteInt32);
   // ResizeBilinear creates a float tensor even when the input is made of
   // integers.
-  output->type = kTfLiteFloat32;
+  output->type = input->type;
 
   if (!IsConstantTensor(size)) {
     SetTensorToDynamic(output);
@@ -90,17 +88,24 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   }
 
   if (output->type == kTfLiteFloat32) {
-#define TF_LITE_RESIZE_BILINEAR(type)                                       \
-  type::ResizeBilinear(GetTensorData<float>(input), GetTensorDims(input),   \
-                       GetTensorData<int32>(size), GetTensorDims(size),     \
-                       GetTensorData<float>(output), GetTensorDims(output), \
+#define TF_LITE_RESIZE_BILINEAR(type, datatype)                                \
+  type::ResizeBilinear(GetTensorData<datatype>(input), GetTensorDims(input),   \
+                       GetTensorData<int32>(size), GetTensorDims(size),        \
+                       GetTensorData<datatype>(output), GetTensorDims(output), \
                        params->align_corners)
 
     if (kernel_type == kReference) {
-      TF_LITE_RESIZE_BILINEAR(reference_ops);
+      TF_LITE_RESIZE_BILINEAR(reference_ops, float);
     }
     if (kernel_type == kGenericOptimized || kernel_type == kNeonOptimized) {
-      TF_LITE_RESIZE_BILINEAR(optimized_ops);
+      TF_LITE_RESIZE_BILINEAR(optimized_ops, float);
+    }
+  } else if (output->type == kTfLiteUInt8) {
+    if (kernel_type == kReference) {
+      TF_LITE_RESIZE_BILINEAR(reference_ops, uint8_t);
+    }
+    if (kernel_type == kGenericOptimized || kernel_type == kNeonOptimized) {
+      TF_LITE_RESIZE_BILINEAR(optimized_ops, uint8_t);
     }
 #undef TF_LITE_RESIZE_BILINEAR
   } else {
diff --git a/tensorflow/contrib/lite/kernels/resize_bilinear_test.cc b/tensorflow/contrib/lite/kernels/resize_bilinear_test.cc
index 4e03f3820a5c14ee1692c553db61e385716b1723..10caffea03ebcec7862df1627541ac3d076b04e4 100644
--- a/tensorflow/contrib/lite/kernels/resize_bilinear_test.cc
+++ b/tensorflow/contrib/lite/kernels/resize_bilinear_test.cc
@@ -22,6 +22,7 @@ namespace tflite {
 namespace {
 
 using ::testing::ElementsAreArray;
+using uint8 = std::uint8_t;
 
 class ResizeBilinearOpModel : public SingleOpModel {
  public:
@@ -34,7 +35,7 @@ class ResizeBilinearOpModel : public SingleOpModel {
     } else {
       size_ = AddInput({TensorType_INT32, {2}});
     }
-    output_ = AddOutput(TensorType_FLOAT32);  // Always float.
+    output_ = AddOutput(input.type);
     SetBuiltinOp(BuiltinOperator_RESIZE_BILINEAR,
                  BuiltinOptions_ResizeBilinearOptions,
                  CreateResizeBilinearOptions(builder_).Union());
@@ -45,12 +46,16 @@ class ResizeBilinearOpModel : public SingleOpModel {
     }
   }
 
-  void SetInput(std::initializer_list<float> data) {
+  template <typename T>
+  void SetInput(std::initializer_list<T> data) {
     PopulateTensor(input_, data);
   }
   void SetSize(std::initializer_list<int> data) { PopulateTensor(size_, data); }
 
-  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+  template <typename T>
+  std::vector<T> GetOutput() {
+    return ExtractVector<T>(output_);
+  }
 
  private:
   int input_;
@@ -60,60 +65,121 @@ class ResizeBilinearOpModel : public SingleOpModel {
 
 TEST(ResizeBilinearOpTest, HorizontalResize) {
   ResizeBilinearOpModel m({TensorType_FLOAT32, {1, 1, 2, 1}});
-  m.SetInput({3, 6});
+  m.SetInput<float>({3, 6});
   m.SetSize({1, 3});
   m.Invoke();
-  EXPECT_THAT(m.GetOutput(), ElementsAreArray(ArrayFloatNear({3, 5, 6})));
+  EXPECT_THAT(m.GetOutput<float>(),
+              ElementsAreArray(ArrayFloatNear({3, 5, 6})));
 
   ResizeBilinearOpModel const_m({TensorType_FLOAT32, {1, 1, 2, 1}}, {1, 3});
-  const_m.SetInput({3, 6});
+  const_m.SetInput<float>({3, 6});
+  const_m.Invoke();
+  EXPECT_THAT(const_m.GetOutput<float>(),
+              ElementsAreArray(ArrayFloatNear({3, 5, 6})));
+}
+
+TEST(ResizeBilinearOpTest, HorizontalResize8Bit) {
+  ResizeBilinearOpModel m({TensorType_UINT8, {1, 1, 2, 1}});
+  m.SetInput<uint8>({3, 6});
+  m.SetSize({1, 3});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput<uint8>(),
+              ElementsAreArray(ArrayFloatNear({3, 5, 6})));
+
+  ResizeBilinearOpModel const_m({TensorType_UINT8, {1, 1, 2, 1}}, {1, 3});
+  const_m.SetInput<uint8>({3, 6});
   const_m.Invoke();
-  EXPECT_THAT(const_m.GetOutput(), ElementsAreArray(ArrayFloatNear({3, 5, 6})));
+  EXPECT_THAT(const_m.GetOutput<uint8>(),
+              ElementsAreArray(ArrayFloatNear({3, 5, 6})));
 }
 
 TEST(ResizeBilinearOpTest, VerticalResize) {
   ResizeBilinearOpModel m({TensorType_FLOAT32, {1, 2, 1, 1}});
-  m.SetInput({3, 9});
+  m.SetInput<float>({3, 9});
   m.SetSize({3, 1});
   m.Invoke();
-  EXPECT_THAT(m.GetOutput(), ElementsAreArray(ArrayFloatNear({3, 7, 9})));
+  EXPECT_THAT(m.GetOutput<float>(),
+              ElementsAreArray(ArrayFloatNear({3, 7, 9})));
 
   ResizeBilinearOpModel const_m({TensorType_FLOAT32, {1, 2, 1, 1}}, {3, 1});
-  const_m.SetInput({3, 9});
+  const_m.SetInput<float>({3, 9});
+  const_m.Invoke();
+  EXPECT_THAT(const_m.GetOutput<float>(),
+              ElementsAreArray(ArrayFloatNear({3, 7, 9})));
+}
+
+TEST(ResizeBilinearOpTest, VerticalResize8Bit) {
+  ResizeBilinearOpModel m({TensorType_UINT8, {1, 2, 1, 1}});
+  m.SetInput<uint8>({3, 9});
+  m.SetSize({3, 1});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput<uint8>(),
+              ElementsAreArray(ArrayFloatNear({3, 7, 9})));
+
+  ResizeBilinearOpModel const_m({TensorType_UINT8, {1, 2, 1, 1}}, {3, 1});
+  const_m.SetInput<uint8>({3, 9});
   const_m.Invoke();
-  EXPECT_THAT(const_m.GetOutput(), ElementsAreArray(ArrayFloatNear({3, 7, 9})));
+  EXPECT_THAT(const_m.GetOutput<uint8>(),
+              ElementsAreArray(ArrayFloatNear({3, 7, 9})));
 }
 
 TEST(ResizeBilinearOpTest, TwoDimensionalResize) {
   ResizeBilinearOpModel m({TensorType_FLOAT32, {1, 2, 2, 1}});
-  m.SetInput({
+  m.SetInput<float>({
       3, 6,  //
       9, 12  //
   });
   m.SetSize({3, 3});
   m.Invoke();
-  EXPECT_THAT(m.GetOutput(), ElementsAreArray(ArrayFloatNear({
-                                 3, 5, 6,    //
-                                 7, 9, 10,   //
-                                 9, 11, 12,  //
-                             })));
+  EXPECT_THAT(m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({
+                                        3, 5, 6,    //
+                                        7, 9, 10,   //
+                                        9, 11, 12,  //
+                                    })));
 
   ResizeBilinearOpModel const_m({TensorType_FLOAT32, {1, 2, 2, 1}}, {3, 3});
-  const_m.SetInput({
+  const_m.SetInput<float>({
       3, 6,  //
       9, 12  //
   });
   const_m.Invoke();
-  EXPECT_THAT(const_m.GetOutput(), ElementsAreArray(ArrayFloatNear({
-                                       3, 5, 6,    //
-                                       7, 9, 10,   //
-                                       9, 11, 12,  //
-                                   })));
+  EXPECT_THAT(const_m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({
+                                              3, 5, 6,    //
+                                              7, 9, 10,   //
+                                              9, 11, 12,  //
+                                          })));
+}
+
+TEST(ResizeBilinearOpTest, TwoDimensionalResize8Bit) {
+  ResizeBilinearOpModel m({TensorType_UINT8, {1, 2, 2, 1}});
+  m.SetInput<uint8>({
+      3, 6,  //
+      9, 12  //
+  });
+  m.SetSize({3, 3});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput<uint8>(), ElementsAreArray(ArrayFloatNear({
+                                        3, 5, 6,    //
+                                        7, 9, 10,   //
+                                        9, 11, 12,  //
+                                    })));
+
+  ResizeBilinearOpModel const_m({TensorType_UINT8, {1, 2, 2, 1}}, {3, 3});
+  const_m.SetInput<uint8>({
+      3, 6,  //
+      9, 12  //
+  });
+  const_m.Invoke();
+  EXPECT_THAT(const_m.GetOutput<uint8>(), ElementsAreArray(ArrayFloatNear({
+                                              3, 5, 6,    //
+                                              7, 9, 10,   //
+                                              9, 11, 12,  //
+                                          })));
 }
 
 TEST(ResizeBilinearOpTest, TwoDimensionalResizeWithTwoBatches) {
   ResizeBilinearOpModel m({TensorType_FLOAT32, {2, 2, 2, 1}});
-  m.SetInput({
+  m.SetInput<float>({
       3, 6,   //
       9, 12,  //
       4, 10,  //
@@ -121,60 +187,123 @@ TEST(ResizeBilinearOpTest, TwoDimensionalResizeWithTwoBatches) {
   });
   m.SetSize({3, 3});
   m.Invoke();
-  EXPECT_THAT(m.GetOutput(), ElementsAreArray(ArrayFloatNear({
-                                 3, 5, 6,     //
-                                 7, 9, 10,    //
-                                 9, 11, 12,   //
-                                 4, 8, 10,    //
-                                 8, 12, 14,   //
-                                 10, 14, 16,  //
-                             })));
+  EXPECT_THAT(m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({
+                                        3, 5, 6,     //
+                                        7, 9, 10,    //
+                                        9, 11, 12,   //
+                                        4, 8, 10,    //
+                                        8, 12, 14,   //
+                                        10, 14, 16,  //
+                                    })));
 
   ResizeBilinearOpModel const_m({TensorType_FLOAT32, {2, 2, 2, 1}}, {3, 3});
-  const_m.SetInput({
+  const_m.SetInput<float>({
       3, 6,   //
       9, 12,  //
       4, 10,  //
       10, 16  //
   });
   const_m.Invoke();
-  EXPECT_THAT(const_m.GetOutput(), ElementsAreArray(ArrayFloatNear({
-                                       3, 5, 6,     //
-                                       7, 9, 10,    //
-                                       9, 11, 12,   //
-                                       4, 8, 10,    //
-                                       8, 12, 14,   //
-                                       10, 14, 16,  //
-                                   })));
+  EXPECT_THAT(const_m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({
+                                              3, 5, 6,     //
+                                              7, 9, 10,    //
+                                              9, 11, 12,   //
+                                              4, 8, 10,    //
+                                              8, 12, 14,   //
+                                              10, 14, 16,  //
+                                          })));
 }
 
 TEST(ResizeBilinearOpTest, ThreeDimensionalResize) {
   ResizeBilinearOpModel m({TensorType_FLOAT32, {1, 2, 2, 2}});
-  m.SetInput({
+  m.SetInput<float>({
       3, 4, 6, 10,    //
       9, 10, 12, 16,  //
   });
   m.SetSize({3, 3});
   m.Invoke();
-  EXPECT_THAT(m.GetOutput(), ElementsAreArray(ArrayFloatNear({
-                                 3, 4, 5, 8, 6, 10,      //
-                                 7, 8, 9, 12, 10, 14,    //
-                                 9, 10, 11, 14, 12, 16,  //
-                             })));
+  EXPECT_THAT(m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({
+                                        3, 4, 5, 8, 6, 10,      //
+                                        7, 8, 9, 12, 10, 14,    //
+                                        9, 10, 11, 14, 12, 16,  //
+                                    })));
 
   ResizeBilinearOpModel const_m({TensorType_FLOAT32, {1, 2, 2, 2}}, {3, 3});
-  const_m.SetInput({
+  const_m.SetInput<float>({
       3, 4, 6, 10,    //
       9, 10, 12, 16,  //
   });
   const_m.Invoke();
-  EXPECT_THAT(const_m.GetOutput(), ElementsAreArray(ArrayFloatNear({
-                                       3, 4, 5, 8, 6, 10,      //
-                                       7, 8, 9, 12, 10, 14,    //
-                                       9, 10, 11, 14, 12, 16,  //
-                                   })));
+  EXPECT_THAT(const_m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({
+                                              3, 4, 5, 8, 6, 10,      //
+                                              7, 8, 9, 12, 10, 14,    //
+                                              9, 10, 11, 14, 12, 16,  //
+                                          })));
+}
+
+TEST(ResizeBilinearOpTest, TwoDimensionalResizeWithTwoBatches8Bit) {
+  ResizeBilinearOpModel m({TensorType_UINT8, {2, 2, 2, 1}});
+  m.SetInput<uint8>({
+      3, 6,   //
+      9, 12,  //
+      4, 10,  //
+      10, 16  //
+  });
+  m.SetSize({3, 3});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput<uint8>(), ElementsAreArray(ArrayFloatNear({
+                                        3, 5, 6,     //
+                                        7, 9, 10,    //
+                                        9, 11, 12,   //
+                                        4, 8, 10,    //
+                                        8, 12, 14,   //
+                                        10, 13, 16,  //
+                                    })));
+
+  ResizeBilinearOpModel const_m({TensorType_UINT8, {2, 2, 2, 1}}, {3, 3});
+  const_m.SetInput<uint8>({
+      3, 6,   //
+      9, 12,  //
+      4, 10,  //
+      10, 16  //
+  });
+  const_m.Invoke();
+  EXPECT_THAT(const_m.GetOutput<uint8>(), ElementsAreArray(ArrayFloatNear({
+                                              3, 5, 6,     //
+                                              7, 9, 10,    //
+                                              9, 11, 12,   //
+                                              4, 8, 10,    //
+                                              8, 12, 14,   //
+                                              10, 13, 16,  //
+                                          })));
 }
 
+TEST(ResizeBilinearOpTest, ThreeDimensionalResize8Bit) {
+  ResizeBilinearOpModel m({TensorType_UINT8, {1, 2, 2, 2}});
+  m.SetInput<uint8>({
+      3, 4, 6, 10,    //
+      9, 10, 12, 16,  //
+  });
+  m.SetSize({3, 3});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput<uint8>(), ElementsAreArray(ArrayFloatNear({
+                                        3, 4, 5, 8, 6, 10,      //
+                                        7, 8, 9, 12, 10, 14,    //
+                                        9, 10, 11, 13, 12, 16,  //
+                                    })));
+
+  ResizeBilinearOpModel const_m({TensorType_UINT8, {1, 2, 2, 2}}, {3, 3});
+  const_m.SetInput<uint8>({
+      3, 4, 6, 10,    //
+      9, 10, 12, 16,  //
+  });
+  const_m.Invoke();
+  EXPECT_THAT(const_m.GetOutput<uint8>(), ElementsAreArray(ArrayFloatNear({
+                                              3, 4, 5, 8, 6, 10,      //
+                                              7, 8, 9, 12, 10, 14,    //
+                                              9, 10, 11, 13, 12, 16,  //
+                                          })));
+}
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/sparse_to_dense.cc b/tensorflow/contrib/lite/kernels/sparse_to_dense.cc
new file mode 100644
index 0000000000000000000000000000000000000000..404c32ad9ca8b9f1e467b747708ccb451f2a5118
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/sparse_to_dense.cc
@@ -0,0 +1,275 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <unistd.h>
+#include <cassert>
+#include <cmath>
+#include <cstdio>
+#include <cstdlib>
+#include <iostream>
+#include <limits>
+
+#include "tensorflow/contrib/lite/builtin_op_data.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/tensor.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/contrib/lite/kernels/op_macros.h"
+#include "tensorflow/contrib/lite/kernels/padding.h"
+
+namespace tflite {
+namespace ops {
+namespace builtin {
+namespace sparse_to_dense {
+
+constexpr int kIndicesTensor = 0;
+constexpr int kOutputShapeTensor = 1;
+constexpr int kValueInputTensor = 2;
+constexpr int kDefaultValueTensor = 3;
+constexpr int kOutputTensor = 0;
+
+constexpr int kMaxDimensions = 4;
+
+template <typename T>
+TfLiteStatus Resize(TfLiteContext* context, const TfLiteTensor* output_shape,
+                    TfLiteTensor* output) {
+  const int output_dimensions = NumElements(output_shape);
+  TfLiteIntArray* output_shape_array = TfLiteIntArrayCreate(output_dimensions);
+  for (int i = 0; i < output_dimensions; ++i) {
+    output_shape_array->data[i] = GetTensorData<T>(output_shape)[i];
+  }
+
+  return context->ResizeTensor(context, output, output_shape_array);
+}
+
+TfLiteStatus CheckDimensionsMatch(TfLiteContext* context,
+                                  const TfLiteTensor* indices,
+                                  const TfLiteTensor* output_shape,
+                                  const TfLiteTensor* values) {
+  switch (NumDimensions(indices)) {
+    case 0:
+    case 1: {
+      if (NumDimensions(values) == 0) {
+        TF_LITE_ENSURE_EQ(context, NumElements(indices), NumElements(values));
+      }
+      TF_LITE_ENSURE_EQ(context, NumElements(output_shape), 1);
+      break;
+    }
+    case 2: {
+      TF_LITE_ENSURE_EQ(context, SizeOfDimension(indices, 1),
+                        NumElements(output_shape));
+      if (NumDimensions(values) == 0)
+        TF_LITE_ENSURE_EQ(context, SizeOfDimension(indices, 0),
+                          NumElements(values));
+      break;
+    }
+    default:
+      context->ReportError(
+          context, "Wrong indices dimensions %d, should be less than 3.",
+          NumDimensions(indices));
+      return kTfLiteError;
+  }
+  return kTfLiteOk;
+}
+
+// Convert indices into a vector of 4-d vectors.
+// TODO(renjieliu): Revisit here to improve the performance, since multiple
+// allocations of std::vectors will be quite slow on phones.
+template <typename T>
+TfLiteStatus GetIndicesVector(TfLiteContext* context,
+                              const TfLiteTensor* indices,
+                              const int num_indices,
+                              std::vector<std::vector<T>>* indices_vector) {
+  // Note because TfLite will reverse the dimensions, so pad zeros upfront.
+  switch (NumDimensions(indices)) {
+    case 0:
+    case 1: {
+      const auto indices_data = GetTensorData<T>(indices);
+      for (int i = 0; i < num_indices; ++i) {
+        std::vector<T> index({0, 0, 0, indices_data[i]});
+        indices_vector->push_back(index);
+      }
+      break;
+    }
+    case 2: {
+      const int true_dimensions = SizeOfDimension(indices, 1);
+      TF_LITE_ENSURE(context, true_dimensions <= kMaxDimensions);
+      for (int i = 0; i < num_indices; ++i) {
+        std::vector<T> index;
+        index.reserve(kMaxDimensions);
+        // Fill the index with 1 up to kMaxDimensions - true_dimensions to
+        // satisfy the needs for 4-dimension index.
+        for (int j = 0; j < kMaxDimensions - true_dimensions; ++j) {
+          index.push_back(0);
+        }
+        for (int j = 0; j < true_dimensions; ++j) {
+          index.push_back(GetTensorData<T>(indices)[i * true_dimensions + j]);
+        }
+
+        indices_vector->push_back(index);
+      }
+      break;
+    }
+    default:
+      context->ReportError(context,
+                           "Indices dimensions problem, got %d dimensions",
+                           NumDimensions(indices));
+      return kTfLiteError;
+  }
+  return kTfLiteOk;
+}
+
+TfLiteStatus ResizeOutputShape(TfLiteContext* context,
+                               const TfLiteTensor* output_shape,
+                               TfLiteTensor* output) {
+  if (output_shape->type == kTfLiteInt32) {
+    return Resize<int32_t>(context, output_shape, output);
+  } else if (output_shape->type == kTfLiteInt64) {
+    return Resize<int64_t>(context, output_shape, output);
+  } else {
+    context->ReportError(context, "Dense shape type %d not supported.",
+                         output_shape->type);
+    return kTfLiteError;
+  }
+}
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), 4);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+
+  const TfLiteTensor* indices = GetInput(context, node, kIndicesTensor);
+  const TfLiteTensor* output_shape =
+      GetInput(context, node, kOutputShapeTensor);
+  const TfLiteTensor* values = GetInput(context, node, kValueInputTensor);
+  const TfLiteTensor* default_value =
+      GetInput(context, node, kDefaultValueTensor);
+
+  // TODO(renjieliu): Handle validate_indices.
+
+  // Indices can be 0-D, 1-D or 2-D.
+  TF_LITE_ASSERT(NumDimensions(indices) >= 0);
+  TF_LITE_ENSURE(context, NumDimensions(indices) < 3);
+  TF_LITE_ASSERT(NumDimensions(output_shape) >= 0);
+  TF_LITE_ENSURE_EQ(context, NumDimensions(output_shape), 1);
+  // Values can be 0-D or 1-D.
+  TF_LITE_ASSERT(NumDimensions(values) >= 0);
+  TF_LITE_ENSURE(context, NumDimensions(values) < 2);
+
+  TF_LITE_ENSURE_EQ(context, NumElements(default_value), 1);
+
+  TF_LITE_ENSURE(
+      context, indices->type == kTfLiteInt32 || indices->type == kTfLiteInt64);
+  TF_LITE_ENSURE(context, output_shape->type == kTfLiteInt32 ||
+                              output_shape->type == kTfLiteInt64);
+  TF_LITE_ENSURE_EQ(context, values->type, default_value->type);
+
+  // Ensure dimensions match.
+  TF_LITE_ENSURE_OK(
+      context, CheckDimensionsMatch(context, indices, output_shape, values));
+
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+  TF_LITE_ENSURE_EQ(context, NumDimensions(output_shape), 1);
+
+  if (!IsConstantTensor(output_shape)) {
+    SetTensorToDynamic(output);
+    return kTfLiteOk;
+  }
+  return ResizeOutputShape(context, output_shape, output);
+}
+
+template <typename T, typename I>
+TfLiteStatus SparseToDenseImpl(TfLiteContext* context, TfLiteNode* node) {
+  const TfLiteTensor* indices = GetInput(context, node, kIndicesTensor);
+  const TfLiteTensor* output_shape =
+      GetInput(context, node, kOutputShapeTensor);
+  const TfLiteTensor* values = GetInput(context, node, kValueInputTensor);
+  const TfLiteTensor* default_value =
+      GetInput(context, node, kDefaultValueTensor);
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+
+  if (IsDynamicTensor(output)) {
+    TF_LITE_ENSURE_OK(context,
+                      ResizeOutputShape(context, output_shape, output));
+  }
+
+  const int num_indices = SizeOfDimension(indices, 0);
+  const bool value_is_scalar = NumDimensions(values) == 0;
+  std::vector<std::vector<I>> indices_vector;
+  indices_vector.reserve(num_indices);
+  TF_LITE_ENSURE_OK(context, GetIndicesVector<I>(context, indices, num_indices,
+                                                 &indices_vector));
+  reference_ops::SparseToDense(indices_vector, GetTensorData<T>(values),
+                               *GetTensorData<T>(default_value),
+                               GetTensorData<T>(output), GetTensorDims(output),
+                               value_is_scalar);
+  return kTfLiteOk;
+}
+
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  const TfLiteTensor* indices = GetInput(context, node, kIndicesTensor);
+  const TfLiteTensor* values = GetInput(context, node, kValueInputTensor);
+
+  // Currently only supports float32 and int32.
+  switch (values->type) {
+    case kTfLiteFloat32: {
+      switch (indices->type) {
+        case kTfLiteInt32: {
+          return SparseToDenseImpl<float, int32_t>(context, node);
+        }
+        case kTfLiteInt64: {
+          return SparseToDenseImpl<float, int64_t>(context, node);
+        }
+        default:
+          context->ReportError(
+              context, "Type %d is currently not supported by sparse to dense.",
+              indices->type);
+          return kTfLiteError;
+      }
+      break;
+    }
+    case kTfLiteInt32: {
+      switch (indices->type) {
+        case kTfLiteInt32: {
+          return SparseToDenseImpl<int32_t, int32_t>(context, node);
+        }
+        case kTfLiteInt64: {
+          return SparseToDenseImpl<int32_t, int64_t>(context, node);
+        }
+        default:
+          context->ReportError(
+              context, "Type %d is currently not supported by sparse to dense.",
+              indices->type);
+          return kTfLiteError;
+      }
+      break;
+    }
+    default:
+      context->ReportError(
+          context, "Type %d is currently not supported by sparse to dense.",
+          values->type);
+      return kTfLiteError;
+  }
+}
+
+}  // namespace sparse_to_dense
+
+TfLiteRegistration* Register_SPARSE_TO_DENSE() {
+  static TfLiteRegistration r = {nullptr, nullptr, sparse_to_dense::Prepare,
+                                 sparse_to_dense::Eval};
+  return &r;
+}
+
+}  // namespace builtin
+}  // namespace ops
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/sparse_to_dense_test.cc b/tensorflow/contrib/lite/kernels/sparse_to_dense_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..a51ec17afcefd791680d7aa42cef467f481f6dbc
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/sparse_to_dense_test.cc
@@ -0,0 +1,155 @@
+
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <cstdarg>
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace {
+
+using ::testing::ElementsAreArray;
+
+template <typename T>
+class SparseToDenseOpModel : public SingleOpModel {
+ public:
+  SparseToDenseOpModel(std::initializer_list<int> indices_shape,
+                       std::initializer_list<int> output_shape_shape,
+                       std::initializer_list<int> values_shape, T default_value,
+                       TensorType tensor_index_type,
+                       TensorType tensor_input_type) {
+    indices_ = AddInput(tensor_index_type);
+    output_shape_ = AddInput(TensorType_INT32);
+    values_ = AddInput(tensor_input_type);
+    default_value_ = AddInput(tensor_input_type);
+    output_ = AddOutput(tensor_input_type);
+
+    SetBuiltinOp(BuiltinOperator_SPARSE_TO_DENSE,
+                 BuiltinOptions_SparseToDenseOptions,
+                 CreateSparseToDenseOptions(builder_, false).Union());
+    BuildInterpreter({indices_shape, output_shape_shape, values_shape, {1}});
+
+    PopulateTensor<T>(default_value_, {default_value});
+  }
+
+  int indices() { return indices_; }
+  int output_shape() { return output_shape_; }
+  int values() { return values_; }
+
+  std::vector<T> GetOutput() { return ExtractVector<T>(output_); }
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+ private:
+  int indices_;
+  int output_shape_;
+  int values_;
+  int default_value_;
+  int output_;
+};
+
+TEST(SparseToDenseOpModelTest, ZeroDimensionTest) {
+  SparseToDenseOpModel<float> m({1}, {1}, {1}, 0, TensorType_INT32,
+                                TensorType_FLOAT32);
+  m.PopulateTensor<int32_t>(m.indices(), {3});
+  m.PopulateTensor<int32_t>(m.output_shape(), {5});
+  m.PopulateTensor<float>(m.values(), {7});
+  m.Invoke();
+
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({0, 0, 0, 7, 0}));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({5}));
+}
+
+TEST(SparseToDenseOpModelTest, OneDimensionTest) {
+  SparseToDenseOpModel<float> m({3}, {1}, {3}, 0, TensorType_INT32,
+                                TensorType_FLOAT32);
+  m.PopulateTensor<int32_t>(m.indices(), {1, 3, 5});
+  m.PopulateTensor<int32_t>(m.output_shape(), {7});
+  m.PopulateTensor<float>(m.values(), {2, 4, 6});
+  m.Invoke();
+
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({0, 2, 0, 4, 0, 6, 0}));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({7}));
+}
+
+TEST(SparseToDenseOpModelTest, TwoDimensionsTest) {
+  SparseToDenseOpModel<float> m({3, 3}, {3}, {3}, 0, TensorType_INT32,
+                                TensorType_FLOAT32);
+  m.PopulateTensor<int32_t>(m.indices(), {0, 0, 0, 1, 2, 1, 2, 0, 1});
+  m.PopulateTensor<int32_t>(m.output_shape(), {3, 3, 3});
+  m.PopulateTensor<float>(m.values(), {2, 4, 6});
+  m.Invoke();
+
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray({2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                                0, 0, 4, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0}));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({3, 3, 3}));
+}
+
+TEST(SparseToDenseOpModelTest, DefaultValueTest) {
+  SparseToDenseOpModel<float> m({3, 3}, {3}, {3}, -1, TensorType_INT32,
+                                TensorType_FLOAT32);
+  m.PopulateTensor<int32_t>(m.indices(), {0, 0, 0, 1, 2, 1, 2, 0, 1});
+  m.PopulateTensor<int32_t>(m.output_shape(), {3, 3, 3});
+  m.PopulateTensor<float>(m.values(), {2, 4, 6});
+  m.Invoke();
+
+  EXPECT_THAT(
+      m.GetOutput(),
+      ElementsAreArray({2,  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+                        -1, -1, 4,  -1, -1, 6,  -1, -1, -1, -1, -1, -1, -1}));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({3, 3, 3}));
+}
+
+TEST(SparseToDenseOpModelTest, IntegerValueTest) {
+  SparseToDenseOpModel<int32_t> m({3, 3}, {3}, {3}, -1, TensorType_INT32,
+                                  TensorType_INT32);
+  m.PopulateTensor<int32_t>(m.indices(), {0, 0, 0, 1, 2, 1, 2, 0, 1});
+  m.PopulateTensor<int32_t>(m.output_shape(), {3, 3, 3});
+  m.PopulateTensor<int32_t>(m.values(), {2, 4, 6});
+  m.Invoke();
+
+  EXPECT_THAT(
+      m.GetOutput(),
+      ElementsAreArray({2,  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+                        -1, -1, 4,  -1, -1, 6,  -1, -1, -1, -1, -1, -1, -1}));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({3, 3, 3}));
+}
+
+TEST(SparseToDenseOpModelTest, Int64IndexTest) {
+  SparseToDenseOpModel<float> m({3, 3}, {3}, {3}, -1, TensorType_INT64,
+                                TensorType_FLOAT32);
+  m.PopulateTensor<int64_t>(m.indices(), {0, 0, 0, 1, 2, 1, 2, 0, 1});
+  m.PopulateTensor<int32_t>(m.output_shape(), {3, 3, 3});
+  m.PopulateTensor<float>(m.values(), {2, 4, 6});
+  m.Invoke();
+
+  EXPECT_THAT(
+      m.GetOutput(),
+      ElementsAreArray({2,  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+                        -1, -1, 4,  -1, -1, 6,  -1, -1, -1, -1, -1, -1, -1}));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({3, 3, 3}));
+}
+
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/sub.cc b/tensorflow/contrib/lite/kernels/sub.cc
index d788159a8d80e6479024b7b75624839387a461c7..a8b803589962032db3ed579d31e8b736c3afada0 100644
--- a/tensorflow/contrib/lite/kernels/sub.cc
+++ b/tensorflow/contrib/lite/kernels/sub.cc
@@ -126,16 +126,19 @@ void EvalQuantized(TfLiteContext* context, TfLiteNode* node,
 
   int32 input1_multiplier;
   int input1_shift;
-  QuantizeMultiplierSmallerThanOne(real_input1_multiplier, &input1_multiplier,
-                                   &input1_shift);
+  QuantizeMultiplierSmallerThanOneExp(real_input1_multiplier,
+                                      &input1_multiplier, &input1_shift);
+  input1_shift *= -1;
   int32 input2_multiplier;
   int input2_shift;
-  QuantizeMultiplierSmallerThanOne(real_input2_multiplier, &input2_multiplier,
-                                   &input2_shift);
+  QuantizeMultiplierSmallerThanOneExp(real_input2_multiplier,
+                                      &input2_multiplier, &input2_shift);
+  input2_shift *= -1;
   int32 output_multiplier;
   int output_shift;
-  QuantizeMultiplierSmallerThanOne(real_output_multiplier, &output_multiplier,
-                                   &output_shift);
+  QuantizeMultiplierSmallerThanOneExp(real_output_multiplier,
+                                      &output_multiplier, &output_shift);
+  output_shift *= -1;
 
   int32 output_activation_min, output_activation_max;
   CalculateActivationRangeUint8(params->activation, output,
@@ -175,7 +178,7 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
                                output);
   } else {
     context->ReportError(
-        context, "output type %d is not support, requires float|uint8 types.",
+        context, "output type %d is not supported, requires float|uint8 types.",
         output->type);
     return kTfLiteError;
   }
diff --git a/tensorflow/contrib/lite/kernels/test_util.cc b/tensorflow/contrib/lite/kernels/test_util.cc
index 1a01ee093626c08badd65858fc16ad44e69e4912..d23ec201b41887b0682242687fc938d76d058c44 100644
--- a/tensorflow/contrib/lite/kernels/test_util.cc
+++ b/tensorflow/contrib/lite/kernels/test_util.cc
@@ -112,6 +112,12 @@ void SingleOpModel::BuildInterpreter(
     if (shape.empty()) continue;
     CHECK(interpreter_->ResizeInputTensor(input_idx, shape) == kTfLiteOk);
   }
+
+  // Modify delegate with function.
+  if (apply_delegate_fn_) {
+    apply_delegate_fn_(interpreter_.get());
+  }
+
   CHECK(interpreter_->AllocateTensors() == kTfLiteOk)
       << "Cannot allocate tensors";
 }
diff --git a/tensorflow/contrib/lite/kernels/test_util.h b/tensorflow/contrib/lite/kernels/test_util.h
index 55edc97d19fa75bedb6c0928fcf9c7be5f434522..db80c0082c394a2cb2f9388d3db5bd1a7cbe6266 100644
--- a/tensorflow/contrib/lite/kernels/test_util.h
+++ b/tensorflow/contrib/lite/kernels/test_util.h
@@ -114,6 +114,13 @@ class SingleOpModel {
   SingleOpModel() {}
   ~SingleOpModel() {}
 
+  // Set a function callback that is run right after graph is prepared
+  // that allows applying external delegates. This is useful for testing
+  // other runtimes like NN API or GPU.
+  void SetApplyDelegate(std::function<void(Interpreter*)> apply_delegate_fn) {
+    apply_delegate_fn_ = apply_delegate_fn;
+  }
+
   // Copying or assignment is disallowed to simplify ownership semantics.
   SingleOpModel(const SingleOpModel&) = delete;
   SingleOpModel& operator=(const SingleOpModel&) = delete;
@@ -317,6 +324,9 @@ class SingleOpModel {
   std::vector<flatbuffers::Offset<Operator>> operators_;
   std::vector<flatbuffers::Offset<Buffer>> buffers_;
   std::map<string, std::function<TfLiteRegistration*()>> custom_registrations_;
+  // A function pointer that gets called after the interpreter is created but
+  // before evaluation happens. This is useful for applying a delegate.
+  std::function<void(Interpreter*)> apply_delegate_fn_;
 };
 
 // Base class for single op unit tests.
diff --git a/tensorflow/contrib/lite/kernels/tile.cc b/tensorflow/contrib/lite/kernels/tile.cc
new file mode 100644
index 0000000000000000000000000000000000000000..af77f074742eb3fef10a74616ff679255911fbb2
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/tile.cc
@@ -0,0 +1,194 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <string.h>
+#include <vector>
+#include "tensorflow/contrib/lite/builtin_op_data.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/tensor.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/contrib/lite/kernels/op_macros.h"
+namespace tflite {
+namespace ops {
+namespace builtin {
+namespace tile {
+
+constexpr int kInputTensor = 0;
+constexpr int kInputMultipliers = 1;
+constexpr int kOutputTensor = 0;
+
+namespace {
+template <typename T>
+TfLiteIntArray* MultiplyShapeDims(const TfLiteIntArray& shape,
+                                  const TfLiteTensor* multipliers,
+                                  int num_dimensions) {
+  const T* multipliers_v = GetTensorData<T>(multipliers);
+
+  TfLiteIntArray* output_shape = TfLiteIntArrayCreate(num_dimensions);
+  for (int i = 0; i < num_dimensions; ++i) {
+    output_shape->data[i] = shape.data[i] * multipliers_v[i];
+  }
+  return output_shape;
+}
+
+TfLiteStatus ResizeOutput(TfLiteContext* context, TfLiteNode* node) {
+  const TfLiteTensor* input = GetInput(context, node, kInputTensor);
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+  const TfLiteTensor* multipliers = GetInput(context, node, kInputMultipliers);
+
+  const int num_dimensions = NumDimensions(input);
+  const int num_multipliers = NumElements(multipliers);
+  TF_LITE_ENSURE_EQ(context, num_dimensions, num_multipliers);
+  switch (multipliers->type) {
+    case kTfLiteInt32:
+      return context->ResizeTensor(
+          context, output,
+          MultiplyShapeDims<int32_t>(*input->dims, multipliers,
+                                     num_dimensions));
+    case kTfLiteInt64:
+      return context->ResizeTensor(
+          context, output,
+          MultiplyShapeDims<int64_t>(*input->dims, multipliers,
+                                     num_dimensions));
+    default:
+      context->ReportError(context, "Tile not supported multiply tensor type.");
+      return kTfLiteError;
+  }
+}
+
+template <typename T>
+void CopyMultipleTimes(const T* in_data, int32_t in_size, int32_t multiplier,
+                       T* out_data) {
+  for (int i = 0; i < multiplier; ++i) {
+    const T* in_end = in_data + in_size;
+    T* new_out_data = std::copy(in_data, in_end, out_data);
+    in_data = out_data;
+    out_data = new_out_data;
+  }
+}
+
+template <typename T, typename M>
+std::pair<int, int> TileOneDimension(const TfLiteIntArray& in_dimensions,
+                                     const T* in_data, const M* multipliers,
+                                     T* out_data, int dimension) {
+  const int dimension_size = in_dimensions.data[dimension];
+  if (dimension == in_dimensions.size - 1) {
+    CopyMultipleTimes(in_data, dimension_size, multipliers[dimension],
+                      out_data);
+    return std::make_pair(dimension_size,
+                          dimension_size * multipliers[dimension]);
+  }
+  int total_stride_size = 0, total_tiled_stride_size = 0;
+  const T* copy_from_data = in_data;
+  T* copy_to_data = out_data;
+  for (int i = 0; i < dimension_size; ++i) {
+    int stride_size = 0, tiled_stride_size = 0;
+    std::tie(stride_size, tiled_stride_size) =
+        TileOneDimension(in_dimensions, copy_from_data, multipliers,
+                         copy_to_data, dimension + 1);
+    copy_from_data += stride_size;
+    copy_to_data += tiled_stride_size;
+    total_stride_size += stride_size;
+    total_tiled_stride_size += tiled_stride_size;
+  }
+  CopyMultipleTimes(out_data, total_tiled_stride_size,
+                    multipliers[dimension] - 1,
+                    out_data + total_tiled_stride_size);
+  return std::make_pair(total_stride_size,
+                        total_tiled_stride_size * multipliers[dimension]);
+}
+
+template <typename T>
+void Tile(const TfLiteIntArray& in_dimensions, const TfLiteTensor* in_data,
+          const TfLiteTensor* multipliers, TfLiteTensor* out_data) {
+  // Doing recursively tiling from top to down dimension.
+  switch (multipliers->type) {
+    case kTfLiteInt32:
+      TileOneDimension(in_dimensions, GetTensorData<T>(in_data),
+                       GetTensorData<int32_t>(multipliers),
+                       GetTensorData<T>(out_data), 0);
+      break;
+    case kTfLiteInt64:
+      TileOneDimension(in_dimensions, GetTensorData<T>(in_data),
+                       GetTensorData<int64_t>(multipliers),
+                       GetTensorData<T>(out_data), 0);
+      break;
+    default:
+      break;
+  }
+}
+}  // namespace
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), 2);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+
+  const TfLiteTensor* input = GetInput(context, node, kInputTensor);
+
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+  TF_LITE_ENSURE_EQ(context, input->type, output->type);
+
+  const TfLiteTensor* multipliers = GetInput(context, node, kInputMultipliers);
+  // Only int32 and int64 multipliers type is supported.
+  TF_LITE_ENSURE_MSG(context,
+                     (multipliers->type == kTfLiteInt32) ||
+                         (multipliers->type == kTfLiteInt64),
+                     "Tile only supports int32 and int64 mutlipliers.");
+
+  if (IsConstantTensor(multipliers)) {
+    TF_LITE_ENSURE_OK(context, ResizeOutput(context, node));
+  } else {
+    SetTensorToDynamic(output);
+  }
+  return kTfLiteOk;
+}
+
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  const TfLiteTensor* input = GetInput(context, node, kInputTensor);
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+  const TfLiteTensor* multipliers = GetInput(context, node, kInputMultipliers);
+
+  if (IsDynamicTensor(output)) {
+    TF_LITE_ENSURE_OK(context, ResizeOutput(context, node));
+  }
+
+  switch (output->type) {
+    case kTfLiteFloat32:
+      Tile<float>(*(input->dims), input, multipliers, output);
+      break;
+    case kTfLiteUInt8:
+      Tile<uint8_t>(*(input->dims), input, multipliers, output);
+      break;
+    case kTfLiteInt32:
+      Tile<int32_t>(*(input->dims), input, multipliers, output);
+      break;
+    case kTfLiteInt64:
+      Tile<int64_t>(*(input->dims), input, multipliers, output);
+      break;
+    default:
+      context->ReportError(context, "Type is currently not supported by Tile.");
+      return kTfLiteError;
+  }
+  return kTfLiteOk;
+}
+
+}  // namespace tile
+TfLiteRegistration* Register_TILE() {
+  static TfLiteRegistration r = {nullptr, nullptr, tile::Prepare, tile::Eval};
+  return &r;
+}
+}  // namespace builtin
+}  // namespace ops
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/tile_test.cc b/tensorflow/contrib/lite/kernels/tile_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..a134a75d56ae03a5d03a3cdf632146474b863971
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/tile_test.cc
@@ -0,0 +1,256 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/builtin_op_data.h"
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace {
+
+using ::testing::ElementsAreArray;
+class TileOpModel : public SingleOpModel {
+ public:
+  TileOpModel(std::initializer_list<int> input_shape, TensorType input_type,
+              TensorType multiply_type) {
+    input_ = AddInput(input_type);
+    multipliers_ = AddInput(TensorType_INT32);
+    output_ = AddOutput(input_type);
+    SetBuiltinOp(BuiltinOperator_TILE, BuiltinOptions_TileOptions, 0);
+    BuildInterpreter({input_shape, {static_cast<int>(input_shape.size())}});
+  }
+
+  void SetInputFloat(std::initializer_list<float> data) {
+    PopulateTensor<float>(input_, data);
+  }
+
+  void SetInputUInt8(std::initializer_list<uint8> data) {
+    PopulateTensor<uint8>(input_, data);
+  }
+
+  void SetInputInt32(std::initializer_list<int32> data) {
+    PopulateTensor<int32>(input_, data);
+  }
+
+  void SetInputInt64(std::initializer_list<int64_t> data) {
+    PopulateTensor<int64_t>(input_, data);
+  }
+
+  void SetMultipliers(std::initializer_list<int32> data) {
+    PopulateTensor<int32>(multipliers_, data);
+  }
+
+  std::vector<float> GetOutputFloat() { return ExtractVector<float>(output_); }
+
+  std::vector<uint8> GetOutputUInt8() { return ExtractVector<uint8>(output_); }
+
+  std::vector<int32> GetOutputInt32() { return ExtractVector<int32>(output_); }
+
+  std::vector<int64_t> GetOutputInt64() {
+    return ExtractVector<int64_t>(output_);
+  }
+
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+ protected:
+  int input_;
+  int multipliers_;
+  int output_;
+};
+
+TEST(TileTest, Float32Vector) {
+  TileOpModel m({3}, TensorType_FLOAT32, TensorType_INT32);
+  m.SetInputFloat({1.f, 2.f, 3.f});
+  m.SetMultipliers({2});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputFloat(),
+              ElementsAreArray({1.f, 2.f, 3.f, 1.f, 2.f, 3.f}));
+}
+
+TEST(TileTest, Float32Matrix) {
+  TileOpModel m({2, 3}, TensorType_FLOAT32, TensorType_INT32);
+  m.SetInputFloat({
+      11.f,
+      12.f,
+      13.f,
+      21.f,
+      22.f,
+      23.f,
+  });
+  m.SetMultipliers({2, 1});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputFloat(), ElementsAreArray({
+                                      11.f,
+                                      12.f,
+                                      13.f,
+                                      21.f,
+                                      22.f,
+                                      23.f,
+                                      11.f,
+                                      12.f,
+                                      13.f,
+                                      21.f,
+                                      22.f,
+                                      23.f,
+                                  }));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({4, 3}));
+}
+
+TEST(TileTest, Float32HighDimension) {
+  TileOpModel m({1, 2, 3}, TensorType_FLOAT32, TensorType_INT32);
+  m.SetInputFloat({
+      11.f,
+      12.f,
+      13.f,
+      21.f,
+      22.f,
+      23.f,
+  });
+  m.SetMultipliers({2, 3, 1});
+  m.Invoke();
+  EXPECT_THAT(
+      m.GetOutputFloat(),
+      ElementsAreArray({11.f, 12.f, 13.f, 21.f, 22.f, 23.f, 11.f, 12.f, 13.f,
+                        21.f, 22.f, 23.f, 11.f, 12.f, 13.f, 21.f, 22.f, 23.f,
+                        11.f, 12.f, 13.f, 21.f, 22.f, 23.f, 11.f, 12.f, 13.f,
+                        21.f, 22.f, 23.f, 11.f, 12.f, 13.f, 21.f, 22.f, 23.f}));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2, 6, 3}));
+}
+
+TEST(TileTest, Uint8Matrix) {
+  TileOpModel m({2, 3}, TensorType_UINT8, TensorType_INT32);
+  m.SetInputUInt8({
+      11,
+      12,
+      13,
+      21,
+      22,
+      23,
+  });
+  m.SetMultipliers({2, 1});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputUInt8(), ElementsAreArray({
+                                      11,
+                                      12,
+                                      13,
+                                      21,
+                                      22,
+                                      23,
+                                      11,
+                                      12,
+                                      13,
+                                      21,
+                                      22,
+                                      23,
+                                  }));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({4, 3}));
+}
+
+TEST(TileTest, Int32Matrix) {
+  TileOpModel m({2, 3}, TensorType_INT32, TensorType_INT32);
+  m.SetInputInt32({
+      11,
+      12,
+      13,
+      21,
+      22,
+      23,
+  });
+  m.SetMultipliers({2, 1});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputInt32(), ElementsAreArray({
+                                      11,
+                                      12,
+                                      13,
+                                      21,
+                                      22,
+                                      23,
+                                      11,
+                                      12,
+                                      13,
+                                      21,
+                                      22,
+                                      23,
+                                  }));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({4, 3}));
+}
+
+TEST(TileTest, Int64Matrix) {
+  TileOpModel m({2, 3}, TensorType_INT64, TensorType_INT32);
+  m.SetInputInt64({
+      11,
+      12,
+      13,
+      21,
+      22,
+      23,
+  });
+  m.SetMultipliers({2, 1});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputInt64(), ElementsAreArray({
+                                      11,
+                                      12,
+                                      13,
+                                      21,
+                                      22,
+                                      23,
+                                      11,
+                                      12,
+                                      13,
+                                      21,
+                                      22,
+                                      23,
+                                  }));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({4, 3}));
+}
+
+TEST(TileTest, Int64Matrix64Multipliers) {
+  TileOpModel m({2, 3}, TensorType_INT64, TensorType_INT64);
+  m.SetInputInt64({
+      11,
+      12,
+      13,
+      21,
+      22,
+      23,
+  });
+  m.SetMultipliers({2, 1});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputInt64(), ElementsAreArray({
+                                      11,
+                                      12,
+                                      13,
+                                      21,
+                                      22,
+                                      23,
+                                      11,
+                                      12,
+                                      13,
+                                      21,
+                                      22,
+                                      23,
+                                  }));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({4, 3}));
+}
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/transpose_conv.cc b/tensorflow/contrib/lite/kernels/transpose_conv.cc
index 3c99661029ed1ac881536f83519dcec355c60d50..e83b1ec9879d3c360203a52835d8486d0a9b81bb 100644
--- a/tensorflow/contrib/lite/kernels/transpose_conv.cc
+++ b/tensorflow/contrib/lite/kernels/transpose_conv.cc
@@ -79,7 +79,7 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   // Ensure that weights and inputs have the same channel dimension.
   // Note: TOCO will reorder weights in the following format: OHWI.
   TF_LITE_ENSURE_EQ(context, SizeOfDimension(input, 3),
-                    SizeOfDimension(weights, 0));
+                    SizeOfDimension(weights, 3));
 
   if (!IsConstantTensor(output_shape)) {
     SetTensorToDynamic(output);
diff --git a/tensorflow/contrib/lite/kernels/transpose_conv_test.cc b/tensorflow/contrib/lite/kernels/transpose_conv_test.cc
index 52be08934997f484337e4a3592bc7af832601695..55df8971806ed0baae9f5bcaebd24fb8065ec300 100644
--- a/tensorflow/contrib/lite/kernels/transpose_conv_test.cc
+++ b/tensorflow/contrib/lite/kernels/transpose_conv_test.cc
@@ -88,10 +88,10 @@ TEST(TransposeConvOpModelTest, SimpleTest) {
 // And filter value is derived by:
 // filter = tf.reshape(tf.transpose(filter, perm=[3, 0, 1, 2]), shape=[18, 1])
 TEST(TransposeConvOpModelTest, TwoFiltersTest) {
-  TransposeConvOpModel m({1, 4, 4, 2}, {2, 3, 3, 1}, Padding_SAME, 1, 1);
+  TransposeConvOpModel m({1, 4, 4, 2}, {1, 3, 3, 2}, Padding_SAME, 1, 1);
   m.PopulateTensor<int>(m.output_shape(), {1, 4, 4, 1});
-  m.PopulateTensor<float>(m.filter(), {1, 3, 5, 7, 9, 11, 13, 15, 17, 2, 4, 6,
-                                       8, 10, 12, 14, 16, 18});
+  m.PopulateTensor<float>(m.filter(), {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
+                                       13, 14, 15, 16, 17, 18});
   m.PopulateTensor<float>(
       m.input(),
       {1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15, 16,
@@ -117,10 +117,10 @@ TEST(TransposeConvOpModelTest, TwoFiltersTest) {
 // And filter value is derived by:
 // filter = tf.reshape(tf.transpose(filter, perm=[3, 0, 1, 2]), shape=[1, 18])
 TEST(TransposeConvOpModelTest, PaddingValidTest) {
-  TransposeConvOpModel m({1, 4, 4, 2}, {2, 3, 3, 1}, Padding_VALID, 1, 1);
+  TransposeConvOpModel m({1, 4, 4, 2}, {1, 3, 3, 2}, Padding_VALID, 1, 1);
   m.PopulateTensor<int>(m.output_shape(), {1, 6, 6, 1});
-  m.PopulateTensor<float>(m.filter(), {1, 3, 5, 7, 9, 11, 13, 15, 17, 2, 4, 6,
-                                       8, 10, 12, 14, 16, 18});
+  m.PopulateTensor<float>(m.filter(), {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
+                                       13, 14, 15, 16, 17, 18});
   m.PopulateTensor<float>(
       m.input(),
       {1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11, 12, 13, 14, 15, 16,
@@ -171,10 +171,10 @@ TEST(TransposeConvOpModelTest, StrideValidTest) {
 //     [1, 2, 2, 1 ],
 //     "VALID")
 TEST(TransposeConvOpModelTest, MultiChannelTest) {
-  TransposeConvOpModel m({1, 2, 2, 1}, {1, 3, 3, 2}, Padding_VALID, 2, 2);
+  TransposeConvOpModel m({1, 2, 2, 1}, {2, 3, 3, 1}, Padding_VALID, 2, 2);
   m.PopulateTensor<int>(m.output_shape(), {1, 5, 5, 2});
-  m.PopulateTensor<float>(m.filter(), {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
-                                       13, 14, 15, 16, 17, 18});
+  m.PopulateTensor<float>(m.filter(), {1, 3, 5, 7, 9, 11, 13, 15, 17, 2, 4, 6,
+                                       8, 10, 12, 14, 16, 18});
   m.PopulateTensor<float>(m.input(), {1, 2, 3, 4});
   m.Invoke();
 
diff --git a/tensorflow/contrib/lite/kernels/unidirectional_sequence_rnn.cc b/tensorflow/contrib/lite/kernels/unidirectional_sequence_rnn.cc
index 8429dba54bd1806125aadc2119ca59c1bd42ce89..164a0cbd08d6ce82a413f12ba6b1703087a30aba 100644
--- a/tensorflow/contrib/lite/kernels/unidirectional_sequence_rnn.cc
+++ b/tensorflow/contrib/lite/kernels/unidirectional_sequence_rnn.cc
@@ -41,7 +41,7 @@ constexpr int kOutputTensor = 1;
 
 void* Init(TfLiteContext* context, const char* buffer, size_t length) {
   auto* scratch_tensor_index = new int;
-  context->AddTensors(context, /*tensors_to_add=*/2, scratch_tensor_index);
+  context->AddTensors(context, /*tensors_to_add=*/3, scratch_tensor_index);
   return scratch_tensor_index;
 }
 
@@ -102,7 +102,7 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   if (input->type == kTfLiteFloat32 && input_weights->type == kTfLiteUInt8) {
     int* scratch_tensor_index = reinterpret_cast<int*>(node->user_data);
     TfLiteIntArrayFree(node->temporaries);
-    node->temporaries = TfLiteIntArrayCreate(2);
+    node->temporaries = TfLiteIntArrayCreate(3);
     node->temporaries->data[0] = *scratch_tensor_index;
     TfLiteTensor* input_quantized = GetTemporary(context, node, /*index=*/0);
     input_quantized->type = kTfLiteUInt8;
@@ -125,6 +125,16 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
                         context->ResizeTensor(context, hidden_state_quantized,
                                               hidden_state_quantized_size));
     }
+    node->temporaries->data[2] = *scratch_tensor_index + 2;
+    TfLiteTensor* scaling_factors = GetTemporary(context, node, /*index=*/2);
+    scaling_factors->type = kTfLiteFloat32;
+    scaling_factors->allocation_type = kTfLiteArenaRw;
+    TfLiteIntArray* scaling_factors_size = TfLiteIntArrayCreate(1);
+    scaling_factors_size->data[0] = batch_size;
+    if (!TfLiteIntArrayEqual(scaling_factors->dims, scaling_factors_size)) {
+      TF_LITE_ENSURE_OK(context, context->ResizeTensor(context, scaling_factors,
+                                                       scaling_factors_size));
+    }
   }
   return kTfLiteOk;
 }
@@ -187,14 +197,12 @@ TfLiteStatus EvalFloat(const TfLiteTensor* input,
   return kTfLiteOk;
 }
 
-TfLiteStatus EvalQuantized(const TfLiteTensor* input,
-                           const TfLiteTensor* input_weights,
-                           const TfLiteTensor* recurrent_weights,
-                           const TfLiteTensor* bias,
-                           const TfLiteSequenceRNNParams* params,
-                           TfLiteTensor* input_scratch,
-                           TfLiteTensor* hidden_state_scratch,
-                           TfLiteTensor* hidden_state, TfLiteTensor* output) {
+TfLiteStatus EvalHybrid(
+    const TfLiteTensor* input, const TfLiteTensor* input_weights,
+    const TfLiteTensor* recurrent_weights, const TfLiteTensor* bias,
+    const TfLiteSequenceRNNParams* params, TfLiteTensor* input_scratch,
+    TfLiteTensor* hidden_state_scratch, TfLiteTensor* scaling_factors,
+    TfLiteTensor* hidden_state, TfLiteTensor* output) {
   const bool time_major = params->time_major;
   const int batch_size =
       (time_major) ? input->dims->data[1] : input->dims->data[0];
@@ -218,6 +226,7 @@ TfLiteStatus EvalQuantized(const TfLiteTensor* input,
       reinterpret_cast<int8_t*>(input_scratch->data.uint8);
   int8_t* quantized_hidden_state_ptr =
       reinterpret_cast<int8_t*>(hidden_state_scratch->data.uint8);
+  float* scaling_factors_ptr = scaling_factors->data.f;
 
   if (time_major) {
     // Initialize the pointer to hidden state.
@@ -233,7 +242,8 @@ TfLiteStatus EvalQuantized(const TfLiteTensor* input,
           input_ptr_batch, input_weights_ptr, input_weights_scale,
           recurrent_weights_ptr, recurrent_weights_scale, bias_ptr, input_size,
           num_units, batch_size, params->activation, quantized_input_ptr,
-          quantized_hidden_state_ptr, hidden_state_ptr_batch, output_ptr_batch);
+          quantized_hidden_state_ptr, scaling_factors_ptr,
+          hidden_state_ptr_batch, output_ptr_batch);
     }
   } else {
     // For each batch
@@ -252,7 +262,7 @@ TfLiteStatus EvalQuantized(const TfLiteTensor* input,
             recurrent_weights_ptr, recurrent_weights_scale, bias_ptr,
             input_size, num_units, /*batch_size=*/1, params->activation,
             quantized_input_ptr, quantized_hidden_state_ptr,
-            hidden_state_ptr_batch, output_ptr_batch);
+            scaling_factors_ptr, hidden_state_ptr_batch, output_ptr_batch);
       }
     }
   }
@@ -278,9 +288,10 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
       // TODO(mirkov): implement eval with quantized inputs as well.
       TfLiteTensor* input_quantized = GetTemporary(context, node, 0);
       TfLiteTensor* hidden_state_quantized = GetTemporary(context, node, 1);
-      return EvalQuantized(input, input_weights, recurrent_weights, bias,
-                           params, input_quantized, hidden_state_quantized,
-                           hidden_state, output);
+      TfLiteTensor* scaling_factors = GetTemporary(context, node, 2);
+      return EvalHybrid(input, input_weights, recurrent_weights, bias, params,
+                        input_quantized, hidden_state_quantized,
+                        scaling_factors, hidden_state, output);
     }
     default:
       context->ReportError(context, "Type %d not currently supported.",
diff --git a/tensorflow/contrib/lite/model.cc b/tensorflow/contrib/lite/model.cc
index 80fcb28bc7f6c09c7b979fcefcbc25deef583a00..039f32b38eb29068b223dd63355c66295301beba 100644
--- a/tensorflow/contrib/lite/model.cc
+++ b/tensorflow/contrib/lite/model.cc
@@ -322,12 +322,6 @@ TfLiteStatus ParseOpData(const Operator* op, BuiltinOperator op_type,
 
   *builtin_data = nullptr;
   switch (op_type) {
-    case BuiltinOperator_CALL:
-      // TODO(aselle): Implement call in BuiltinOptions, but nullptrs are
-      // ok for now, since there is no call implementation either.
-      break;
-    case BuiltinOperator_CUSTOM:
-      break;
     case BuiltinOperator_CONV_2D: {
       TfLiteConvParams* params = MallocPOD<TfLiteConvParams>();
       if (auto* conv_params = op->builtin_options_as_Conv2DOptions()) {
@@ -343,21 +337,6 @@ TfLiteStatus ParseOpData(const Operator* op, BuiltinOperator op_type,
       *builtin_data = reinterpret_cast<void*>(params);
       break;
     }
-    case BuiltinOperator_TANH:
-    case BuiltinOperator_LOGISTIC:
-    case BuiltinOperator_RELU:
-    case BuiltinOperator_RELU_N1_TO_1:
-    case BuiltinOperator_RELU6:
-    case BuiltinOperator_CONCAT_EMBEDDINGS:
-    case BuiltinOperator_EXP:
-    case BuiltinOperator_TOPK_V2:
-    case BuiltinOperator_LOG_SOFTMAX:
-    case BuiltinOperator_DEQUANTIZE:
-    case BuiltinOperator_PRELU:
-    case BuiltinOperator_FLOOR:
-    case BuiltinOperator_NEG:
-    case BuiltinOperator_SIN:
-      break;
     case BuiltinOperator_CAST: {
       TfLiteCastParams* params = MallocPOD<TfLiteCastParams>();
       if (auto* schema_params = op->builtin_options_as_CastOptions()) {
@@ -445,9 +424,6 @@ TfLiteStatus ParseOpData(const Operator* op, BuiltinOperator op_type,
       *builtin_data = reinterpret_cast<void*>(params);
       break;
     }
-    case BuiltinOperator_EMBEDDING_LOOKUP:
-      // no-op.
-      break;
     case BuiltinOperator_EMBEDDING_LOOKUP_SPARSE: {
       TfLiteEmbeddingLookupSparseParams* params =
           MallocPOD<TfLiteEmbeddingLookupSparseParams>();
@@ -558,6 +534,14 @@ TfLiteStatus ParseOpData(const Operator* op, BuiltinOperator op_type,
             parse_activation(lstm_params->fused_activation_function());
         params->cell_clip = lstm_params->cell_clip();
         params->proj_clip = lstm_params->proj_clip();
+        switch (lstm_params->kernel_type()) {
+          case LSTMKernelType_FULL:
+            params->kernel_type = kTfLiteLSTMFullKernel;
+            break;
+          case LSTMKernelType_BASIC:
+            params->kernel_type = kTfLiteLSTMBasicKernel;
+            break;
+        }
       }
       *builtin_data = reinterpret_cast<void*>(params);
       break;
@@ -571,12 +555,6 @@ TfLiteStatus ParseOpData(const Operator* op, BuiltinOperator op_type,
       *builtin_data = reinterpret_cast<void*>(params);
       break;
     }
-    case BuiltinOperator_PAD: {
-      break;
-    }
-    case BuiltinOperator_PADV2: {
-      break;
-    }
     case BuiltinOperator_RESHAPE: {
       auto* params = MallocPOD<TfLiteReshapeParams>();
       if (auto* schema_params = op->builtin_options_as_ReshapeOptions()) {
@@ -616,15 +594,6 @@ TfLiteStatus ParseOpData(const Operator* op, BuiltinOperator op_type,
       *builtin_data = reinterpret_cast<void*>(params);
       break;
     }
-    case BuiltinOperator_SPACE_TO_BATCH_ND: {
-      break;
-    }
-    case BuiltinOperator_BATCH_TO_SPACE_ND: {
-      break;
-    }
-    case BuiltinOperator_TRANSPOSE: {
-      break;
-    }
     case BuiltinOperator_MEAN: {
       auto* params = MallocPOD<TfLiteMeanParams>();
       if (auto* schema_params = op->builtin_options_as_MeanOptions()) {
@@ -664,10 +633,6 @@ TfLiteStatus ParseOpData(const Operator* op, BuiltinOperator op_type,
       *builtin_data = reinterpret_cast<void*>(params);
       break;
     }
-    case BuiltinOperator_MAXIMUM:
-    case BuiltinOperator_MINIMUM: {
-      break;
-    }
     case BuiltinOperator_ARG_MAX: {
       auto* params = MallocPOD<TfLiteArgMaxParams>();
       if (auto* schema_params = op->builtin_options_as_ArgMaxOptions()) {
@@ -677,16 +642,6 @@ TfLiteStatus ParseOpData(const Operator* op, BuiltinOperator op_type,
       *builtin_data = reinterpret_cast<void*>(params);
       break;
     }
-    case BuiltinOperator_GREATER:
-    case BuiltinOperator_GREATER_EQUAL:
-    case BuiltinOperator_LESS:
-    case BuiltinOperator_LESS_EQUAL:
-    case BuiltinOperator_SELECT: {
-      break;
-    }
-    case BuiltinOperator_SLICE: {
-      break;
-    }
     case BuiltinOperator_TRANSPOSE_CONV: {
       TfLiteTransposeConvParams* params =
           MallocPOD<TfLiteTransposeConvParams>();
@@ -699,11 +654,61 @@ TfLiteStatus ParseOpData(const Operator* op, BuiltinOperator op_type,
       *builtin_data = reinterpret_cast<void*>(params);
       break;
     }
+    case BuiltinOperator_SPARSE_TO_DENSE: {
+      TfLiteSparseToDenseParams* params =
+          MallocPOD<TfLiteSparseToDenseParams>();
+      if (auto* sparse_to_dense_params =
+              op->builtin_options_as_SparseToDenseOptions()) {
+        params->validate_indices = sparse_to_dense_params->validate_indices();
+      }
+      *builtin_data = reinterpret_cast<void*>(params);
+      break;
+    }
     case BuiltinOperator_DELEGATE: {
       // TODO(ycling): Revisit when supporting saving delegated models.
       error_reporter->Report("DELEGATE op shouldn't exist in model.");
       return kTfLiteError;
     }
+
+    // Below are the ops with no builtin_data strcture.
+    case BuiltinOperator_BATCH_TO_SPACE_ND:
+    // TODO(aselle): Implement call in BuiltinOptions, but nullptrs are
+    // ok for now, since there is no call implementation either.
+    case BuiltinOperator_CALL:
+    case BuiltinOperator_CONCAT_EMBEDDINGS:
+    case BuiltinOperator_CUSTOM:
+    case BuiltinOperator_DEQUANTIZE:
+    case BuiltinOperator_EMBEDDING_LOOKUP:
+    case BuiltinOperator_EQUAL:
+    case BuiltinOperator_EXP:
+    case BuiltinOperator_EXPAND_DIMS:
+    case BuiltinOperator_FLOOR:
+    case BuiltinOperator_GREATER:
+    case BuiltinOperator_GREATER_EQUAL:
+    case BuiltinOperator_LESS:
+    case BuiltinOperator_LESS_EQUAL:
+    case BuiltinOperator_LOG:
+    case BuiltinOperator_LOGISTIC:
+    case BuiltinOperator_LOG_SOFTMAX:
+    case BuiltinOperator_MAXIMUM:
+    case BuiltinOperator_MINIMUM:
+    case BuiltinOperator_NEG:
+    case BuiltinOperator_NOT_EQUAL:
+    case BuiltinOperator_PAD:
+    case BuiltinOperator_PADV2:
+    case BuiltinOperator_PRELU:
+    case BuiltinOperator_RELU:
+    case BuiltinOperator_RELU6:
+    case BuiltinOperator_RELU_N1_TO_1:
+    case BuiltinOperator_SELECT:
+    case BuiltinOperator_SIN:
+    case BuiltinOperator_SLICE:
+    case BuiltinOperator_SPACE_TO_BATCH_ND:
+    case BuiltinOperator_TANH:
+    case BuiltinOperator_TILE:
+    case BuiltinOperator_TOPK_V2:
+    case BuiltinOperator_TRANSPOSE:
+      break;
   }
   return kTfLiteOk;
 }
diff --git a/tensorflow/contrib/lite/models/smartreply/predictor_test.cc b/tensorflow/contrib/lite/models/smartreply/predictor_test.cc
index e6c8d966f1aff5a867f9469f8fcdec526df84763..c7e08814fdf502f1ecfea60af3385fc7aa6055fa 100644
--- a/tensorflow/contrib/lite/models/smartreply/predictor_test.cc
+++ b/tensorflow/contrib/lite/models/smartreply/predictor_test.cc
@@ -35,8 +35,8 @@ const char kModelName[] = "smartreply_ondevice_model.bin";
 const char kSamples[] = "smartreply_samples.tsv";
 
 string TestDataPath() {
-  return string(StrCat(tensorflow::testing::TensorFlowSrcRoot(), "/",
-                       "contrib/lite/models/testdata/"));
+  return string(absl::StrCat(tensorflow::testing::TensorFlowSrcRoot(), "/",
+                             "contrib/lite/models/testdata/"));
 }
 
 MATCHER_P(IncludeAnyResponesIn, expected_response, "contains the response") {
@@ -55,7 +55,7 @@ class PredictorTest : public ::testing::Test {
  protected:
   PredictorTest() {
     model_ = tflite::FlatBufferModel::BuildFromFile(
-        StrCat(TestDataPath(), "/", kModelName).c_str());
+        absl::StrCat(TestDataPath(), "/", kModelName).c_str());
     CHECK(model_);
   }
   ~PredictorTest() override {}
@@ -121,7 +121,7 @@ TEST_F(PredictorTest, BatchTest) {
   int total_triggers = 0;
 
   string line;
-  std::ifstream fin(StrCat(TestDataPath(), "/", kSamples));
+  std::ifstream fin(absl::StrCat(TestDataPath(), "/", kSamples));
   while (std::getline(fin, line)) {
     const std::vector<string> fields = absl::StrSplit(line, '\t');
     if (fields.empty()) {
diff --git a/tensorflow/contrib/lite/nnapi_delegate.cc b/tensorflow/contrib/lite/nnapi_delegate.cc
index 21f1df0be320ba433d623641dc6abdf7d1f3ab69..c71ad1d37937228662798a7370d66147191032ec 100644
--- a/tensorflow/contrib/lite/nnapi_delegate.cc
+++ b/tensorflow/contrib/lite/nnapi_delegate.cc
@@ -159,7 +159,6 @@ uint32_t addTensorOperands(tflite::Interpreter* interpreter,
         nn_type, static_cast<uint32_t>(tensor->dims->size),
         reinterpret_cast<uint32_t*>(tensor->dims->data), scale, zeroPoint};
     CHECK_NN(ANeuralNetworksModel_addOperand(nn_model, &operand_type));
-
     // TODO(aselle): Based on Michael's suggestion, limiting this to read
     // only memory
     if (tensor->allocation_type == kTfLiteMmapRo) {
@@ -172,7 +171,12 @@ uint32_t addTensorOperands(tflite::Interpreter* interpreter,
         CHECK_NN(ANeuralNetworksModel_setOperandValue(
             nn_model, next_id, tensor->data.raw, tensor->bytes));
       }
+    } else if (tensor->bytes == 0) {
+      // These size 0 tensors are optional tensors reserved.
+      CHECK_NN(
+          ANeuralNetworksModel_setOperandValue(nn_model, next_id, nullptr, 0));
     }
+
     ++next_id;
   }
   return next_id;
@@ -181,7 +185,9 @@ uint32_t addTensorOperands(tflite::Interpreter* interpreter,
 // Adds the operations and their parameters to the NN API model.
 // 'next-id' is the operand ID of the next operand of the model.
 void AddOpsAndParams(tflite::Interpreter* interpreter,
-                     ANeuralNetworksModel* nn_model, uint32_t next_id) {
+                     ANeuralNetworksModel* nn_model, uint32_t next_id,
+                     std::vector<int>* model_state_inputs,
+                     std::vector<int>* model_state_outputs) {
   for (size_t i = 0; i < interpreter->nodes_size(); i++) {
     const auto* node_and_registration = interpreter->node_and_registration(i);
     const TfLiteNode& node = node_and_registration->first;
@@ -192,6 +198,8 @@ void AddOpsAndParams(tflite::Interpreter* interpreter,
     // Add the parameters.
     std::vector<uint32_t> augmented_inputs(
         node.inputs->data, node.inputs->data + node.inputs->size);
+    std::vector<uint32_t> augmented_outputs(
+        node.outputs->data, node.outputs->data + node.outputs->size);
 
     auto add_scalar_int32 = [&nn_model, &augmented_inputs,
                              &next_id](int value) {
@@ -211,15 +219,29 @@ void AddOpsAndParams(tflite::Interpreter* interpreter,
       augmented_inputs.push_back(next_id++);
     };
 
+    // Handle state tensors of RNN, LSTM, SVDF.
+    // For each state_out tensor, a corresponding state_in operand needs to be
+    // created for NNAPI.
     auto duplicate_state_tensor_float32 =
-        [interpreter, &nn_model, &augmented_inputs](int tensor_id) {
+        [interpreter, &nn_model, &next_id, &augmented_inputs,
+         &model_state_inputs, &model_state_outputs](int tensor_id) {
           const TfLiteTensor* tensor = interpreter->tensor(tensor_id);
-          CHECK_NN(ANeuralNetworksModel_setOperandValue(
-              nn_model, tensor_id, tensor->data.raw, tensor->bytes));
-          augmented_inputs.push_back(tensor_id);
+          ANeuralNetworksOperandType operand_type{
+              ANEURALNETWORKS_TENSOR_FLOAT32,
+              static_cast<uint32_t>(tensor->dims->size),
+              reinterpret_cast<uint32_t*>(tensor->dims->data),
+              tensor->params.scale, tensor->params.zero_point};
+          CHECK_NN(ANeuralNetworksModel_addOperand(nn_model, &operand_type));
+          augmented_inputs.push_back(next_id);
+          model_state_inputs->push_back(next_id);
+          model_state_outputs->push_back(tensor_id);
+          next_id++;
         };
 
-    auto add_add_params = [&add_scalar_int32]() { add_scalar_int32(0); };
+    auto add_add_params = [&add_scalar_int32](void* data) {
+      auto* builtin = reinterpret_cast<TfLiteAddParams*>(data);
+      add_scalar_int32(builtin->activation);
+    };
 
     auto add_pooling_params = [&add_scalar_int32](void* data) {
       auto builtin = reinterpret_cast<TfLitePoolParams*>(data);
@@ -279,39 +301,62 @@ void AddOpsAndParams(tflite::Interpreter* interpreter,
       add_scalar_float32(builtin->proj_clip);
     };
 
+    // LSTM in NNAPI requires scratch tensor as an output operand.
+    auto add_lstm_scratch_tensor_float32 = [interpreter, &node, &nn_model,
+                                            &next_id, &augmented_outputs]() {
+      int scratch_buffer_index = node.temporaries->data[0];
+      const TfLiteTensor* tensor = interpreter->tensor(scratch_buffer_index);
+      ANeuralNetworksOperandType operand_type{
+          ANEURALNETWORKS_TENSOR_FLOAT32,
+          static_cast<uint32_t>(tensor->dims->size),
+          reinterpret_cast<uint32_t*>(tensor->dims->data), tensor->params.scale,
+          tensor->params.zero_point};
+      CHECK_NN(ANeuralNetworksModel_addOperand(nn_model, &operand_type));
+      augmented_outputs.insert(augmented_outputs.begin(), next_id++);
+    };
+
     auto add_mean_params = [&add_scalar_int32](void* data) {
       auto builtin = reinterpret_cast<TfLiteMeanParams*>(data);
       add_scalar_int32(builtin->keep_dims);
     };
 
-#if 0
-    auto add_reshape_params = [&](void* data) {
-      auto builtin = reinterpret_cast<TfLiteReshapeParams*>(data);
-      uint32_t tensor_size_shape = builtin->num_dimensions;
-      ANeuralNetworksOperandType operand_type{
-          ANEURALNETWORKS_TENSOR_INT32,
-          {static_cast<uint32_t>(1),
-           reinterpret_cast<uint32_t*>(&tensor_size_shape)},
-          0,
-          0};
-      CHECK_NN(ANeuralNetworksModel_addOperand(nn_model, &operand_type))
-      CHECK_NN(ANeuralNetworksModel_setOperandValue(
-          nn_model, next_id, builtin->shape,
-          sizeof(int) * builtin->num_dimensions));
-      augmented_inputs.push_back(next_id++);
+    auto add_svdf_params = [&add_scalar_int32](void* data) {
+      auto builtin = reinterpret_cast<TfLiteSVDFParams*>(data);
+      add_scalar_int32(builtin->rank);
+      add_scalar_int32(builtin->activation);
     };
-#endif
+
+    auto add_rnn_params = [&add_scalar_int32](void* data) {
+      auto builtin = reinterpret_cast<TfLiteRNNParams*>(data);
+      add_scalar_int32(builtin->activation);
+    };
+
+    // Handle optional input tensors.
+    auto add_optional_tensors = [&nn_model, &augmented_inputs,
+                                 &next_id](int nn_type) {
+      for (size_t idx = 0; idx < augmented_inputs.size(); idx++) {
+        if (augmented_inputs[idx] == kOptionalTensor) {
+          const std::vector<uint32_t> dim = {0, 0};
+          ANeuralNetworksOperandType operand_type{nn_type, 2, dim.data(), 0, 0};
+          CHECK_NN(ANeuralNetworksModel_addOperand(nn_model, &operand_type))
+          CHECK_NN(ANeuralNetworksModel_setOperandValue(nn_model, next_id,
+                                                        nullptr, 0))
+          augmented_inputs[idx] = next_id++;
+        }
+      }
+    };
+
     int nnapi_version = 10;
     ANeuralNetworksOperationType nn_op_type;
 
     switch (builtin) {
       case tflite::BuiltinOperator_ADD:
         nn_op_type = ANEURALNETWORKS_ADD;
-        add_add_params();
+        add_add_params(node.builtin_data);
         break;
       case tflite::BuiltinOperator_MUL:
         nn_op_type = ANEURALNETWORKS_MUL;
-        add_add_params();
+        add_add_params(node.builtin_data);
         break;
       case tflite::BuiltinOperator_AVERAGE_POOL_2D:
         add_pooling_params(node.builtin_data);
@@ -370,13 +415,31 @@ void AddOpsAndParams(tflite::Interpreter* interpreter,
         break;
       case tflite::BuiltinOperator_LSTM: {
         duplicate_state_tensor_float32(
-            node.outputs->data[/*kOutputStateTensor*/ 1]);
+            node.outputs->data[/*kOutputStateTensor*/ 0]);
         duplicate_state_tensor_float32(
-            node.outputs->data[/*kCellStateTensor*/ 2]);
+            node.outputs->data[/*kCellStateTensor*/ 1]);
         add_lstm_params(node.builtin_data);
+        add_lstm_scratch_tensor_float32();
+        add_optional_tensors(ANEURALNETWORKS_TENSOR_FLOAT32);
         nn_op_type = ANEURALNETWORKS_LSTM;
         break;
       }
+      case tflite::BuiltinOperator_SVDF: {
+        duplicate_state_tensor_float32(node.outputs->data[/*kStateTensor*/ 0]);
+        add_svdf_params(node.builtin_data);
+        nn_op_type = ANEURALNETWORKS_SVDF;
+        break;
+      }
+      case tflite::BuiltinOperator_RNN: {
+        duplicate_state_tensor_float32(
+            node.outputs->data[/*kHiddenStateTensor*/ 0]);
+        add_rnn_params(node.builtin_data);
+        nn_op_type = ANEURALNETWORKS_RNN;
+        break;
+      }
+      case tflite::BuiltinOperator_EMBEDDING_LOOKUP:
+        nn_op_type = ANEURALNETWORKS_EMBEDDING_LOOKUP;
+        break;
       case tflite::BuiltinOperator_PAD:
         nnapi_version = 11;  // require NNAPI 1.1
         nn_op_type = ANEURALNETWORKS_PAD;
@@ -396,12 +459,9 @@ void AddOpsAndParams(tflite::Interpreter* interpreter,
         break;
       case tflite::BuiltinOperator_CONCAT_EMBEDDINGS:
       case tflite::BuiltinOperator_LSH_PROJECTION:
-      case tflite::BuiltinOperator_SVDF:
       case tflite::BuiltinOperator_HASHTABLE_LOOKUP:
-      case tflite::BuiltinOperator_RNN:
       case tflite::BuiltinOperator_BIDIRECTIONAL_SEQUENCE_RNN:
       case tflite::BuiltinOperator_UNIDIRECTIONAL_SEQUENCE_RNN:
-      case tflite::BuiltinOperator_EMBEDDING_LOOKUP:
       case tflite::BuiltinOperator_EMBEDDING_LOOKUP_SPARSE:
       case tflite::BuiltinOperator_BIDIRECTIONAL_SEQUENCE_LSTM:
       case tflite::BuiltinOperator_UNIDIRECTIONAL_SEQUENCE_LSTM:
@@ -437,7 +497,13 @@ void AddOpsAndParams(tflite::Interpreter* interpreter,
       case tflite::BuiltinOperator_SELECT:
       case tflite::BuiltinOperator_SLICE:
       case tflite::BuiltinOperator_SIN:
+      case tflite::BuiltinOperator_LOG:
       case tflite::BuiltinOperator_TRANSPOSE_CONV:
+      case tflite::BuiltinOperator_TILE:
+      case tflite::BuiltinOperator_EXPAND_DIMS:
+      case tflite::BuiltinOperator_SPARSE_TO_DENSE:
+      case tflite::BuiltinOperator_EQUAL:
+      case tflite::BuiltinOperator_NOT_EQUAL:
         FATAL("Op code %d is currently not delegated to NNAPI", builtin);
         nn_op_type = -1;  // set to invalid
         break;
@@ -454,8 +520,9 @@ void AddOpsAndParams(tflite::Interpreter* interpreter,
     // Add the operation.
     CHECK_NN(ANeuralNetworksModel_addOperation(
         nn_model, nn_op_type, static_cast<uint32_t>(augmented_inputs.size()),
-        augmented_inputs.data(), static_cast<uint32_t>(node.outputs->size),
-        reinterpret_cast<uint32_t*>(node.outputs->data)));
+        augmented_inputs.data(),
+        static_cast<uint32_t>(augmented_outputs.size()),
+        reinterpret_cast<uint32_t*>(augmented_outputs.data())));
   }
 }
 
@@ -479,12 +546,25 @@ TfLiteStatus NNAPIDelegate::BuildGraph(Interpreter* interpreter) {
     }
 
     uint32_t next_id = addTensorOperands(interpreter, nn_model_, skip_list);
-    AddOpsAndParams(interpreter, nn_model_, next_id);
+    AddOpsAndParams(interpreter, nn_model_, next_id, &model_states_inputs_,
+                    &model_states_outputs_);
+
+    std::vector<int> augmented_inputs = interpreter->inputs();
+    std::vector<int> augmented_outputs = interpreter->outputs();
+
+    // All state tensors input/output need to be treated as model input/output.
+    augmented_inputs.insert(augmented_inputs.end(),
+                            model_states_inputs_.begin(),
+                            model_states_inputs_.end());
+    augmented_outputs.insert(augmented_outputs.end(),
+                             model_states_outputs_.begin(),
+                             model_states_outputs_.end());
+
     CHECK_NN(ANeuralNetworksModel_identifyInputsAndOutputs(
-        nn_model_, static_cast<uint32_t>(interpreter->inputs().size()),
-        reinterpret_cast<const uint32_t*>(interpreter->inputs().data()),
-        static_cast<uint32_t>(interpreter->outputs().size()),
-        reinterpret_cast<const uint32_t*>(interpreter->outputs().data())));
+        nn_model_, static_cast<uint32_t>(augmented_inputs.size()),
+        reinterpret_cast<const uint32_t*>(augmented_inputs.data()),
+        static_cast<uint32_t>(augmented_outputs.size()),
+        reinterpret_cast<const uint32_t*>(augmented_outputs.data())));
     CHECK_NN(ANeuralNetworksModel_finish(nn_model_));
   }
   if (!nn_compiled_model_) {
@@ -511,6 +591,7 @@ TfLiteStatus NNAPIDelegate::Invoke(Interpreter* interpreter) {
     CHECK_NN(ANeuralNetworksExecution_setInput(
         execution, i, nullptr, tensor->data.raw, tensor->bytes));
   }
+
   // Tell nn api where to place final data.
   for (size_t i = 0; i < interpreter->outputs().size(); i++) {
     int output = interpreter->outputs()[i];
@@ -518,6 +599,24 @@ TfLiteStatus NNAPIDelegate::Invoke(Interpreter* interpreter) {
     CHECK_NN(ANeuralNetworksExecution_setOutput(
         execution, i, nullptr, tensor->data.raw, tensor->bytes));
   }
+
+  // The state_out of previous invocation need to be mapped to state_in of
+  // current invocation.
+  for (size_t i = 0; i < model_states_outputs_.size(); i++) {
+    int state_tensor_idx = model_states_outputs_[i];
+    TfLiteTensor* tensor = interpreter->tensor(state_tensor_idx);
+    // Here we are using a deep copy for state_in tensors so that we are not
+    // reading and writing into the same buffer during a invocation.
+    // TODO(miaowang): using double shared buffer to minimize the copies.
+    CHECK_NN(ANeuralNetworksExecution_setInput(
+        execution, i + interpreter->inputs().size(), nullptr, tensor->data.raw,
+        tensor->bytes));
+    // Tell NNAPI where to output the state_out.
+    CHECK_NN(ANeuralNetworksExecution_setOutput(
+        execution, i + interpreter->outputs().size(), nullptr, tensor->data.raw,
+        tensor->bytes));
+  }
+
   // Currently use blocking compute.
   ANeuralNetworksEvent* event = nullptr;
   CHECK_NN(ANeuralNetworksExecution_startCompute(execution, &event));
diff --git a/tensorflow/contrib/lite/nnapi_delegate.h b/tensorflow/contrib/lite/nnapi_delegate.h
index e98000929a1168c786f6c18f498f9d1d72311ada..94dea4f9b23f208fddbacd3c77d889ea753a8a1d 100644
--- a/tensorflow/contrib/lite/nnapi_delegate.h
+++ b/tensorflow/contrib/lite/nnapi_delegate.h
@@ -59,6 +59,14 @@ class NNAPIDelegate {
   ANeuralNetworksModel* nn_model_ = nullptr;
   // The NN API compilation handle
   ANeuralNetworksCompilation* nn_compiled_model_ = nullptr;
+
+  // List of state tensors for LSTM, RNN, SVDF.
+  // NN API does not allow ops to maintain states across multiple
+  // invocations. We need to manually create state input tensors from
+  // corresponding state output tensors of TFLite operations, and map them
+  // correctly.
+  std::vector<int> model_states_inputs_;
+  std::vector<int> model_states_outputs_;
 };
 
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/op_resolver.h b/tensorflow/contrib/lite/op_resolver.h
index 38a27069421586f28a5fbe4c7880a28f80548b98..9d7e3f20854a3596181ffa885cc17cfdbd16356e 100644
--- a/tensorflow/contrib/lite/op_resolver.h
+++ b/tensorflow/contrib/lite/op_resolver.h
@@ -18,6 +18,7 @@ limitations under the License.
 #include <unordered_map>
 #include "tensorflow/contrib/lite/context.h"
 #include "tensorflow/contrib/lite/schema/schema_generated.h"
+#include "tensorflow/contrib/lite/util.h"
 
 namespace tflite {
 
@@ -55,8 +56,7 @@ struct OperatorKeyHasher {
   size_t operator()(const T& x) const {
     size_t a = ValueHasher<typename T::first_type>()(x.first);
     size_t b = ValueHasher<typename T::second_type>()(x.second);
-    // Hash combinator used by TensorFlow core.
-    return a ^ (b + 0x9e3779b97f4a7800ULL + (a << 10) + (a >> 4));
+    return CombineHashes({a, b});
   }
 };
 }  // namespace op_resolver_hasher
diff --git a/tensorflow/contrib/lite/profiling/BUILD b/tensorflow/contrib/lite/profiling/BUILD
index 15999e5d4188db1e191936ae6d84faf8cce5ca6e..a162b87b8f98576ec7c3b2623d1d34f2baef6cce 100644
--- a/tensorflow/contrib/lite/profiling/BUILD
+++ b/tensorflow/contrib/lite/profiling/BUILD
@@ -2,9 +2,11 @@ package(default_visibility = ["//visibility:public"])
 
 licenses(["notice"])  # Apache 2.0
 
+load("//tensorflow/contrib/lite:build_def.bzl", "tflite_copts")
+
 common_copts = [
     "-Wall",
-]
+] + tflite_copts()
 
 cc_library(
     name = "profiler",
@@ -29,6 +31,43 @@ cc_library(
     name = "profile_buffer",
     hdrs = ["profile_buffer.h"],
     copts = common_copts,
+    deps = [":time"],
+)
+
+cc_library(
+    name = "time",
+    srcs = ["time.cc"],
+    hdrs = ["time.h"],
+    copts = common_copts,
+)
+
+cc_library(
+    name = "profile_summarizer",
+    srcs = ["profile_summarizer.cc"],
+    hdrs = ["profile_summarizer.h"],
+    copts = common_copts,
+    deps = [
+        ":profiler",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/schema:schema_fbs",
+        "//tensorflow/core:stats_calculator_portable",
+    ],
+)
+
+cc_test(
+    name = "profile_summarizer_test",
+    srcs = ["profile_summarizer_test.cc"],
+    copts = common_copts,
+    deps = [
+        ":profile_summarizer",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite:schema_fbs_version",
+        "//tensorflow/contrib/lite/kernels:builtin_ops",
+        "//tensorflow/contrib/lite/kernels:kernel_util",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "//tensorflow/contrib/lite/testing:util",
+        "@com_google_googletest//:gtest",
+    ],
 )
 
 cc_test(
diff --git a/tensorflow/contrib/lite/profiling/profile_buffer.h b/tensorflow/contrib/lite/profiling/profile_buffer.h
index 299b2a9cad161ce05ba68f39cf612f9866a0b656..65d86dce47f397c7dad6cc2beb8ffa1f95b29d45 100644
--- a/tensorflow/contrib/lite/profiling/profile_buffer.h
+++ b/tensorflow/contrib/lite/profiling/profile_buffer.h
@@ -18,6 +18,8 @@ limitations under the License.
 #include <cstddef>
 #include <cstdint>
 
+#include "tensorflow/contrib/lite/profiling/time.h"
+
 namespace tflite {
 namespace profiling {
 
@@ -74,7 +76,7 @@ class ProfileBuffer {
     if (!enabled_) {
       return kInvalidEventHandle;
     }
-    uint64_t timestamp = NowMicros();
+    uint64_t timestamp = time::NowMicros();
     int index = current_index_ % event_buffer_.size();
     event_buffer_[index].tag = tag;
     event_buffer_[index].event_type = event_type;
@@ -103,7 +105,7 @@ class ProfileBuffer {
     }
 
     int event_index = event_handle % max_size;
-    event_buffer_[event_index].end_timestamp_us = NowMicros();
+    event_buffer_[event_index].end_timestamp_us = time::NowMicros();
   }
 
   // Returns the size of the buffer.
@@ -134,12 +136,6 @@ class ProfileBuffer {
   }
 
  private:
-  static uint64_t NowMicros() {
-    // TODO(shashishekhar): Refactor this to a separate file.
-    struct timeval tv;
-    gettimeofday(&tv, nullptr);
-    return static_cast<uint64_t>(tv.tv_sec) * 1000000 + tv.tv_usec;
-  }
   bool enabled_;
   uint32_t current_index_;
   std::vector<ProfileEvent> event_buffer_;
diff --git a/tensorflow/contrib/lite/profiling/profile_summarizer.cc b/tensorflow/contrib/lite/profiling/profile_summarizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..45388b500c7897c8b33b49eb6ab4e9f8c4fdb37c
--- /dev/null
+++ b/tensorflow/contrib/lite/profiling/profile_summarizer.cc
@@ -0,0 +1,148 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/lite/profiling/profile_summarizer.h"
+
+#include <sstream>
+
+#include "tensorflow/contrib/lite/schema/schema_generated.h"
+
+namespace tflite {
+namespace profiling {
+namespace {
+
+using Detail = tensorflow::StatsCalculator::Detail;
+
+struct OperatorDetails {
+  std::string name;
+  std::vector<std::string> inputs;
+  std::vector<std::string> outputs;
+};
+
+std::string GetTensorName(const tflite::Interpreter& interpreter,
+                          int tensor_index) {
+  const auto tensor = interpreter.tensor(tensor_index);
+  if (tensor == nullptr || tensor->name == nullptr) {
+    return "Unknown";
+  }
+  return tensor->name;
+}
+std::vector<std::string> GetTensorNames(const tflite::Interpreter& interpreter,
+                                        const TfLiteIntArray* tensor_indices) {
+  std::vector<std::string> tensors;
+  tensors.reserve(tensor_indices->size);
+  for (int i = 0; i < tensor_indices->size; i++) {
+    tensors.push_back(GetTensorName(interpreter, tensor_indices->data[i]));
+  }
+  return tensors;
+}
+
+std::string ToString(const std::vector<std::string>& str_vector) {
+  std::stringstream stream;
+  stream << "[";
+  bool first = true;
+  for (const auto& s : str_vector) {
+    if (!first) {
+      stream << ", ";
+    } else {
+      first = false;
+    }
+    stream << s;
+  }
+  stream << "]";
+  return stream.str();
+}
+
+OperatorDetails GetOperatorDetails(const tflite::Interpreter& interpreter,
+                                   int node_index) {
+  auto node_reg = interpreter.node_and_registration(node_index);
+  auto inputs = node_reg->first.inputs;
+  auto outputs = node_reg->first.outputs;
+  int code = node_reg->second.builtin_code;
+  const char* op_name = nullptr;
+  if (code == tflite::BuiltinOperator_CUSTOM) {
+    const char* custom_name = node_reg->second.custom_name;
+    op_name = custom_name ? custom_name : "UnknownCustomOp";
+  } else {
+    op_name = tflite::EnumNamesBuiltinOperator()[code];
+  }
+  OperatorDetails details;
+  details.name = op_name;
+  details.inputs = GetTensorNames(interpreter, inputs);
+  details.outputs = GetTensorNames(interpreter, outputs);
+  return details;
+}
+
+tensorflow::StatSummarizerOptions GetProfileSummarizerOptions() {
+  auto options = tensorflow::StatSummarizerOptions();
+  options.show_summary = true;
+  options.show_memory = false;
+  return options;
+}
+
+}  // namespace
+
+ProfileSummarizer::ProfileSummarizer()
+    : stats_calculator_(
+          new ::tensorflow::StatsCalculator(GetProfileSummarizerOptions())) {}
+
+void ProfileSummarizer::ProcessProfiles(
+    const std::vector<const ProfileEvent*>& profile_stats,
+    const tflite::Interpreter& interpreter) {
+  std::vector<const ProfileEvent*> events;
+  std::copy_if(profile_stats.begin(), profile_stats.end(),
+               std::back_inserter(events), [](const ProfileEvent* e) {
+                 return e->event_type ==
+                            ProfileEvent::EventType::OPERATOR_INVOKE_EVENT &&
+                        e->end_timestamp_us >= e->begin_timestamp_us;
+               });
+  // Sort with begin_time.
+  std::sort(events.begin(), events.end(),
+            [](const ProfileEvent* const& a, const ProfileEvent* const& b) {
+              return a->begin_timestamp_us < b->begin_timestamp_us;
+            });
+  if (events.empty()) {
+    return;
+  }
+
+  int64_t base_start_us = events[0]->begin_timestamp_us;
+  int node_num = 0;
+  int64_t curr_total_us = 0;
+  std::map<std::string, Detail> details;
+  for (auto event : events) {
+    auto op_details = GetOperatorDetails(interpreter, event->event_metadata);
+    auto node_name = ToString(op_details.outputs);
+    auto result = details.emplace(node_name, Detail());
+    Detail* detail = &(result.first->second);
+    detail->start_us.UpdateStat(event->begin_timestamp_us - base_start_us);
+    int64_t node_exec_time =
+        event->end_timestamp_us - event->begin_timestamp_us;
+    detail->rel_end_us.UpdateStat(node_exec_time);
+    curr_total_us += node_exec_time;
+    ++node_num;
+
+    if (result.second) {
+      detail->name = node_name;
+      detail->type = op_details.name;
+      detail->run_order = node_num;
+      detail->times_called = 0;
+    }
+    ++detail->times_called;
+  }
+  stats_calculator_->UpdateDetails(details);
+  stats_calculator_->UpdateRunTotalUs(curr_total_us);
+}
+}  // namespace profiling
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/profiling/profile_summarizer.h b/tensorflow/contrib/lite/profiling/profile_summarizer.h
new file mode 100644
index 0000000000000000000000000000000000000000..a529ff87428d70d002241311d7f70f185521020f
--- /dev/null
+++ b/tensorflow/contrib/lite/profiling/profile_summarizer.h
@@ -0,0 +1,55 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_LITE_PROFILING_PROFILE_SUMMARIZER_H_
+#define TENSORFLOW_CONTRIB_LITE_PROFILING_PROFILE_SUMMARIZER_H_
+
+#include <vector>
+
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/profiling/profiler.h"
+#include "tensorflow/core/util/stats_calculator.h"
+
+namespace tflite {
+namespace profiling {
+
+// Creates a summary of operator invocations in the interpreter.
+class ProfileSummarizer {
+ public:
+  ProfileSummarizer();
+  virtual ~ProfileSummarizer() {}
+
+  // Process profile events to update statistics for operator invocations.
+  void ProcessProfiles(const std::vector<const ProfileEvent*>& profile_stats,
+                       const tflite::Interpreter& interpreter);
+
+  // Returns a string detailing the accumulated runtime stats in a tab-separated
+  // format which can be pasted into a spreadsheet for further analysis.
+  std::string GetOutputString() const {
+    return stats_calculator_->GetOutputString();
+  }
+
+  std::string GetShortSummary() const {
+    return stats_calculator_->GetShortSummary();
+  }
+
+ private:
+  std::unique_ptr<tensorflow::StatsCalculator> stats_calculator_;
+};
+
+}  // namespace profiling
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_PROFILING_PROFILE_SUMMARIZER_H_
diff --git a/tensorflow/contrib/lite/profiling/profile_summarizer_test.cc b/tensorflow/contrib/lite/profiling/profile_summarizer_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..35cf780713b93db559f86dcaf62e1ac004b5049a
--- /dev/null
+++ b/tensorflow/contrib/lite/profiling/profile_summarizer_test.cc
@@ -0,0 +1,116 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <string>
+#include <vector>
+
+#include <gmock/gmock.h>
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+#include "tensorflow/contrib/lite/profiling/profile_summarizer.h"
+#include "tensorflow/contrib/lite/testing/util.h"
+#include "tensorflow/contrib/lite/version.h"
+
+namespace tflite {
+namespace profiling {
+
+namespace {
+
+TfLiteStatus SimpleOpEval(TfLiteContext* context, TfLiteNode* node) {
+  const TfLiteTensor* input1 = tflite::GetInput(context, node, /*index=*/0);
+  const TfLiteTensor* input2 = tflite::GetInput(context, node, /*index=*/1);
+
+  TfLiteTensor* output = GetOutput(context, node, /*index=*/0);
+
+  int32_t* output_data = output->data.i32;
+  *output_data = *(input1->data.i32) + *(input2->data.i32);
+  return kTfLiteOk;
+}
+
+TfLiteRegistration* RegisterSimpleOp() {
+  static TfLiteRegistration registration = {nullptr,
+                                            nullptr,
+                                            nullptr,
+                                            SimpleOpEval,
+                                            tflite::BuiltinOperator_CUSTOM,
+                                            "SimpleOpEval",
+                                            1};
+  return &registration;
+}
+
+class SimpleOpModel : public SingleOpModel {
+ public:
+  void Init();
+  tflite::Interpreter* GetInterpreter() { return interpreter_.get(); }
+  void SetInputs(int32_t x, int32_t y) {
+    PopulateTensor(inputs_[0], {x});
+    PopulateTensor(inputs_[1], {y});
+  }
+  int32_t GetOutput() { return ExtractVector<int32_t>(output_)[0]; }
+
+ private:
+  int inputs_[2];
+  int output_;
+};
+
+void SimpleOpModel::Init() {
+  inputs_[0] = AddInput({TensorType_INT32, {1}});
+  inputs_[1] = AddInput({TensorType_INT32, {1}});
+  output_ = AddOutput({TensorType_INT32, {}});
+  SetCustomOp("SimpleAdd", {}, RegisterSimpleOp);
+  BuildInterpreter({GetShape(inputs_[0]), GetShape(inputs_[1])});
+}
+
+TEST(ProfileSummarizerTest, Empty) {
+  ProfileSummarizer summarizer;
+  std::string output = summarizer.GetOutputString();
+  EXPECT_GT(output.size(), 0);
+}
+
+#ifdef TFLITE_PROFILING_ENABLED
+TEST(ProfileSummarizerTest, Interpreter) {
+  Profiler profiler;
+  SimpleOpModel m;
+  m.Init();
+  auto interpreter = m.GetInterpreter();
+  interpreter->SetProfiler(&profiler);
+  profiler.StartProfiling();
+  m.SetInputs(1, 2);
+  m.Invoke();
+  // 3 = 1 + 2
+  EXPECT_EQ(m.GetOutput(), 3);
+  profiler.StopProfiling();
+  ProfileSummarizer summarizer;
+  auto events = profiler.GetProfileEvents();
+  EXPECT_EQ(1, events.size());
+  summarizer.ProcessProfiles(profiler.GetProfileEvents(), *interpreter);
+  auto output = summarizer.GetOutputString();
+  // TODO(shashishekhar): Add a better test here.
+  ASSERT_TRUE(output.find("SimpleOp") != std::string::npos) << output;
+}
+#endif
+
+}  // namespace
+}  // namespace profiling
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/profiling/time.cc b/tensorflow/contrib/lite/profiling/time.cc
new file mode 100644
index 0000000000000000000000000000000000000000..446660bb747cd6e3b694669b64ac1d95cf415fbe
--- /dev/null
+++ b/tensorflow/contrib/lite/profiling/time.cc
@@ -0,0 +1,29 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/profiling/time.h"
+
+#include <sys/time.h>
+
+namespace tflite {
+namespace profiling {
+namespace time {
+uint64_t NowMicros() {
+  struct timeval tv;
+  gettimeofday(&tv, nullptr);
+  return static_cast<uint64_t>(tv.tv_sec) * 1000000 + tv.tv_usec;
+}
+}  // namespace time
+}  // namespace profiling
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/profiling/time.h b/tensorflow/contrib/lite/profiling/time.h
new file mode 100644
index 0000000000000000000000000000000000000000..cc2ec319b8a95b3efa0aab0ac9f97a88bf7b5536
--- /dev/null
+++ b/tensorflow/contrib/lite/profiling/time.h
@@ -0,0 +1,27 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CONTRIB_LITE_PROFILING_TIME_H_
+#define TENSORFLOW_CONTRIB_LITE_PROFILING_TIME_H_
+
+#include <cstdint>
+
+namespace tflite {
+namespace profiling {
+namespace time {
+uint64_t NowMicros();
+}  // namespace time
+}  // namespace profiling
+}  // namespace tflite
+#endif  // TENSORFLOW_CONTRIB_LITE_PROFILING_TIME_H_
diff --git a/tensorflow/contrib/lite/python/BUILD b/tensorflow/contrib/lite/python/BUILD
index 4920e83970d1cb7f60a38f95ea05986d52b0bbe7..27909a9458f6b09f96cb556a5254f01e54f46e05 100644
--- a/tensorflow/contrib/lite/python/BUILD
+++ b/tensorflow/contrib/lite/python/BUILD
@@ -36,6 +36,16 @@ py_test(
     ],
 )
 
+py_binary(
+    name = "tflite_convert",
+    srcs = ["tflite_convert.py"],
+    srcs_version = "PY2AND3",
+    visibility = ["//visibility:public"],
+    deps = [
+        ":lite",
+    ],
+)
+
 py_library(
     name = "lite",
     srcs = ["lite.py"],
@@ -45,7 +55,23 @@ py_library(
         ":convert",
         ":convert_saved_model",
         ":interpreter",
+        ":lite_constants",
         ":op_hint",
+        "//tensorflow/python:graph_util",
+        "//tensorflow/python/saved_model:constants",
+        "//tensorflow/python/saved_model:loader",
+        "//tensorflow/python/tools:freeze_graph_lib",
+    ],
+)
+
+py_test(
+    name = "lite_test",
+    srcs = ["lite_test.py"],
+    data = [":interpreter_test_data"],
+    srcs_version = "PY2AND3",
+    tags = ["no_windows"],
+    deps = [
+        ":lite",
     ],
 )
 
@@ -111,9 +137,9 @@ py_library(
     visibility = ["//visibility:public"],
     deps = [
         ":convert",
-        ":lite_constants",
         "//tensorflow/contrib/saved_model:saved_model_py",
         "//tensorflow/python:graph_util",
+        "//tensorflow/python:platform",
         "//tensorflow/python/tools:freeze_graph_lib",
     ],
 )
@@ -150,20 +176,3 @@ py_test(
         "//tensorflow/python/saved_model",
     ],
 )
-
-py_binary(
-    name = "convert_saved_model_to_frozen_graph",
-    srcs = ["convert_saved_model_to_frozen_graph.py"],
-    srcs_version = "PY2AND3",
-    deps = [
-        ":convert_saved_model",
-    ],
-)
-
-# Transitive dependencies of this target will be included in the pip package.
-py_library(
-    name = "tf_lite_py_pip",
-    deps = [
-        ":convert_saved_model",
-    ],
-)
diff --git a/tensorflow/contrib/lite/python/convert.py b/tensorflow/contrib/lite/python/convert.py
index c0926d2f33c0bbc5111e6df90dbd759172021f95..c038c88945b71f30bf091a1098dcf853f5415b1b 100644
--- a/tensorflow/contrib/lite/python/convert.py
+++ b/tensorflow/contrib/lite/python/convert.py
@@ -111,38 +111,75 @@ def tensor_name(x):
   return x.name.split(":")[0]
 
 
-def toco_convert(input_data,
-                 input_tensors,
-                 output_tensors,
-                 inference_type=lite_constants.FLOAT,
-                 input_format=lite_constants.TENSORFLOW_GRAPHDEF,
-                 output_format=lite_constants.TFLITE,
-                 quantized_input_stats=None,
-                 drop_control_dependency=True,
-                 allow_custom_ops=False):
-  """Convert a model using TOCO from `input_format` to `output_format`.
+def build_toco_convert_protos(input_tensors,
+                              output_tensors,
+                              inference_type=lite_constants.FLOAT,
+                              inference_input_type=None,
+                              input_format=lite_constants.TENSORFLOW_GRAPHDEF,
+                              output_format=lite_constants.TFLITE,
+                              quantized_input_stats=None,
+                              default_ranges_stats=None,
+                              drop_control_dependency=True,
+                              reorder_across_fake_quant=False,
+                              allow_custom_ops=False,
+                              change_concat_input_ranges=False,
+                              quantize_weights=False,
+                              dump_graphviz_dir=None,
+                              dump_graphviz_video=False):
+  """Builds protocol buffers describing a conversion of a model using TOCO.
 
   Typically this is to convert from TensorFlow GraphDef to TFLite, in which
   case the default `input_format` and `output_format` are sufficient.
 
   Args:
-    input_data: Input data (i.e. often `sess.graph_def`).
     input_tensors: List of input tensors. Type and shape are computed using
       `foo.get_shape()` and `foo.dtype`.
     output_tensors: List of output tensors (only .name is used from this).
-    inference_type: Currently must be `{FLOAT, QUANTIZED_UINT8}`.
-    input_format: Type of data to read (currently must be TENSORFLOW_GRAPHDEF).
-    output_format: Type of data to write (currently must be TFLITE or
-      GRAPHVIZ_DOT)
-    quantized_input_stats: For each member of input_tensors the mean and
-      std deviation of training data. Only needed if `inference_type` is
-      `QUANTIZED_UINT8`.
-    drop_control_dependency: Drops control dependencies silently. This is due
-      to tf lite not supporting control dependencies.
+    inference_type: Target data type of arrays in the output file. Currently
+      must be `{FLOAT, QUANTIZED_UINT8}`.  (default FLOAT)
+    inference_input_type: Target data type of input arrays. Allows for a
+      different type for input arrays in the case of quantization. Currently
+      must be `{FLOAT, QUANTIZED_UINT8}`. (default `inference_type`)
+    input_format: Type of data to read Currently must be
+      `{TENSORFLOW_GRAPHDEF}`. (default TENSORFLOW_GRAPHDEF)
+    output_format: Output file format. Currently must be `{TFLITE,
+      GRAPHVIZ_DOT}`. (default TFLITE)
+    quantized_input_stats: List of tuples of integers representing the mean and
+      standard deviation. Each tuple maps to the corresponding input tensor.
+      Only need if `inference_type` is `QUANTIZED_UINT8`. (default None)
+    default_ranges_stats: Tuple of integers representing (min, max) range values
+      for all arrays without a specified range. Intended for experimenting with
+      quantization via "dummy quantization". (default None)
+    drop_control_dependency: Boolean indicating whether to drop control
+      dependencies silently. This is due to TFLite not supporting control
+      dependencies. (default True)
+    reorder_across_fake_quant: Boolean indicating whether to reorder FakeQuant
+      nodes in unexpected locations. Used when the location of the FakeQuant
+      nodes is preventing graph transformations necessary to convert the graph.
+      Results in a graph that differs from the quantized training graph,
+      potentially causing differing arithmetic behavior. (default False)
+    allow_custom_ops: Boolean indicating whether to allow custom operations.
+      When false any unknown operation is an error. When true, custom ops are
+      created for any op that is unknown. The developer will need to provide
+      these to the TensorFlow Lite runtime with a custom resolver.
+      (default False)
+    change_concat_input_ranges: Boolean to change behavior of min/max ranges for
+      inputs and outputs of the concat operator for quantized models. Changes
+      the ranges of concat operator overlap when true. (default False)
+    quantize_weights: Boolean indicating whether to store weights as quantized
+      weights followed by dequantize operations. Computation is still done in
+      float, but reduces model size (at the cost of accuracy and latency).
+      (default False)
+    dump_graphviz_dir: Full filepath of folder to dump the graphs at various
+      stages of processing GraphViz .dot files. Preferred over
+      --output_format=GRAPHVIZ_DOT in order to keep the requirements of the
+      output file. (default None)
+    dump_graphviz_video: Boolean indicating whether to dump the graph after
+      every graph transformation. (default False)
 
   Returns:
-    The converted data. For example if tflite was the destination, then
-    this will be a tflite flatbuffer in a bytes array.
+    model_flags, toco_flags: two protocol buffers describing the conversion
+    process.
 
   Raises:
     ValueError: If the input tensor type is unknown
@@ -152,10 +189,21 @@ def toco_convert(input_data,
   toco = _toco_flags_pb2.TocoFlags()
   toco.input_format = input_format
   toco.output_format = output_format
-  toco.drop_control_dependency = drop_control_dependency
-  model = _model_flags_pb2.ModelFlags()
   toco.inference_type = inference_type
+  if inference_input_type:
+    toco.inference_input_type = inference_input_type
+  toco.drop_control_dependency = drop_control_dependency
+  toco.reorder_across_fake_quant = reorder_across_fake_quant
   toco.allow_custom_ops = allow_custom_ops
+  toco.quantize_weights = quantize_weights
+  if default_ranges_stats:
+    toco.default_ranges_min = default_ranges_stats[0]
+    toco.default_ranges_max = default_ranges_stats[1]
+  if dump_graphviz_dir:
+    toco.dump_graphviz_dir = dump_graphviz_dir
+  toco.dump_graphviz_include_video = dump_graphviz_video
+  model = _model_flags_pb2.ModelFlags()
+  model.change_concat_input_ranges = change_concat_input_ranges
   for idx, input_tensor in enumerate(input_tensors):
     if input_tensor.dtype == _dtypes.float32:
       tflite_input_type = lite_constants.FLOAT
@@ -163,6 +211,8 @@ def toco_convert(input_data,
       tflite_input_type = lite_constants.INT32
     elif input_tensor.dtype == _dtypes.int64:
       tflite_input_type = lite_constants.INT64
+    elif input_tensor.dtype == _dtypes.uint8:
+      tflite_input_type = lite_constants.QUANTIZED_UINT8
     # TODO(aselle): Insert strings when they are available
     else:
       raise ValueError("Tensors %s not known type %r" % (input_tensor.name,
@@ -180,10 +230,35 @@ def toco_convert(input_data,
 
   for output_tensor in output_tensors:
     model.output_arrays.append(tensor_name(output_tensor))
+  return model, toco
+
+
+def toco_convert(input_data, input_tensors, output_tensors, *args, **kwargs):
+  """"Convert a model using TOCO.
+
+  Typically this function is used to convert from TensorFlow GraphDef to TFLite.
+  Conversion can be customized by providing arguments that are forwarded to
+  `build_toco_convert_protos` (see documentation for details).
+
+  Args:
+    input_data: Input data (i.e. often `sess.graph_def`),
+    input_tensors: List of input tensors. Type and shape are computed using
+      `foo.get_shape()` and `foo.dtype`.
+    output_tensors: List of output tensors (only .name is used from this).
+    *args: See `build_toco_convert_protos`,
+    **kwargs: See `build_toco_convert_protos`.
 
-  # TODO(aselle): Consider handling the case of allowing quantized
-  # inputs to be converted to float (via the toco.inference_input_type field).
-  data = toco_convert_protos(model.SerializeToString(),
-                             toco.SerializeToString(),
+  Returns:
+    The converted data. For example if TFLite was the destination, then
+    this will be a tflite flatbuffer in a bytes array.
+
+  Raises:
+    Defined in `build_toco_convert_protos`.
+  """
+  model_flags, toco_flags = build_toco_convert_protos(input_tensors,
+                                                      output_tensors,
+                                                      *args, **kwargs)
+  data = toco_convert_protos(model_flags.SerializeToString(),
+                             toco_flags.SerializeToString(),
                              input_data.SerializeToString())
   return data
diff --git a/tensorflow/contrib/lite/python/convert_saved_model.py b/tensorflow/contrib/lite/python/convert_saved_model.py
index a7eddf3408f54dff5fa49ff6fa7b61cd0b8a22e4..1553464b9fe30f596c151bcc67efe891bb913ba3 100644
--- a/tensorflow/contrib/lite/python/convert_saved_model.py
+++ b/tensorflow/contrib/lite/python/convert_saved_model.py
@@ -18,34 +18,14 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.contrib.lite.python import convert
-from tensorflow.contrib.lite.python import lite_constants
-from tensorflow.contrib.lite.toco import model_flags_pb2
-from tensorflow.contrib.saved_model.python.saved_model import reader
-from tensorflow.contrib.saved_model.python.saved_model import signature_def_utils
+from tensorflow.contrib.lite.python.convert import tensor_name
 from tensorflow.core.framework import types_pb2
 from tensorflow.python.client import session
 from tensorflow.python.framework import graph_util as tf_graph_util
 from tensorflow.python.framework import ops
-from tensorflow.python.platform import gfile
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.saved_model import constants
 from tensorflow.python.saved_model import loader
-from tensorflow.python.saved_model import signature_constants
-from tensorflow.python.saved_model import tag_constants
-
-
-def _write_and_flush_file(file_path, data_str):
-  """Writes data to file path.
-
-  Args:
-    file_path: Full path of the file to store data in.
-    data_str: Data represented as a string.
-
-  Returns: None.
-  """
-  with gfile.Open(file_path, "wb") as data_file:
-    data_file.write(data_str)
-    data_file.flush()
 
 
 def _log_tensor_details(tensor_info):
@@ -77,21 +57,8 @@ def _get_meta_graph_def(saved_model_dir, tag_set):
   Raises:
     ValueError: No valid MetaGraphDef for given tag_set.
   """
-  saved_model = reader.read_saved_model(saved_model_dir)
-  tag_sets = []
-  result_meta_graph_def = None
-  for meta_graph_def in saved_model.meta_graphs:
-    meta_graph_tag_set = set(meta_graph_def.meta_info_def.tags)
-    tag_sets.append(meta_graph_tag_set)
-    if meta_graph_tag_set == tag_set:
-      result_meta_graph_def = meta_graph_def
-  logging.info("The given saved_model contains the following tags: %s",
-               tag_sets)
-  if result_meta_graph_def is not None:
-    return result_meta_graph_def
-  else:
-    raise ValueError("No valid MetaGraphDef for this tag_set '{}'. Possible "
-                     "values are '{}'. ".format(tag_set, tag_sets))
+  with session.Session(graph=ops.Graph()) as sess:
+    return loader.load(sess, tag_set, saved_model_dir)
 
 
 def _get_signature_def(meta_graph, signature_key):
@@ -110,15 +77,13 @@ def _get_signature_def(meta_graph, signature_key):
   signature_def_map = meta_graph.signature_def
   signature_def_keys = set(signature_def_map.keys())
   logging.info(
-      "The given saved_model MetaGraphDef contains SignatureDefs with the "
+      "The given SavedModel MetaGraphDef contains SignatureDefs with the "
       "following keys: %s", signature_def_keys)
   if signature_key not in signature_def_keys:
-    raise ValueError("No '{}' in the saved_model\'s SignatureDefs. Possible "
-                     "values are '{}'. ".format(signature_key,
-                                                signature_def_keys))
-  signature_def = signature_def_utils.get_signature_def_by_key(
-      meta_graph, signature_key)
-  return signature_def
+    raise ValueError("No '{}' in the SavedModel\'s SignatureDefs. Possible "
+                     "values are '{}'.".format(signature_key,
+                                               ",".join(signature_def_keys)))
+  return signature_def_map[signature_key]
 
 
 def _get_inputs_outputs(signature_def):
@@ -170,29 +135,10 @@ def _get_tensors(graph, signature_def_tensor_names=None,
   """
   tensors = []
   if user_tensor_names:
-    # Get the list of all of the tensors with and without the tensor index.
-    all_tensor_names = [
-        tensor.name for op in graph.get_operations() for tensor in op.outputs
-    ]
-    all_tensor_names_only = [name.split(":")[0] for name in all_tensor_names]
-
     # Sort the tensor names.
     user_tensor_names = sorted(user_tensor_names)
 
-    # Get the tensors associated with the tensor names.
-    tensors = []
-    invalid_tensors = []
-    for name in user_tensor_names:
-      if name not in all_tensor_names_only:
-        invalid_tensors.append(name)
-      else:
-        idx = all_tensor_names_only.index(name)
-        tensors.append(graph.get_tensor_by_name(all_tensor_names[idx]))
-
-    # Throw ValueError if any user input names are not valid tensors.
-    if invalid_tensors:
-      raise ValueError("Invalid tensors '{}' were found.".format(
-          ",".join(invalid_tensors)))
+    tensors = get_tensors_from_tensor_names(graph, user_tensor_names)
   elif signature_def_tensor_names:
     tensors = [
         graph.get_tensor_by_name(name)
@@ -207,25 +153,74 @@ def _get_tensors(graph, signature_def_tensor_names=None,
   return tensors
 
 
-def _freeze_saved_model(saved_model_dir, input_arrays, input_shapes,
-                        output_arrays, tag_set, signature_key, batch_size):
+def get_tensors_from_tensor_names(graph, tensor_names):
+  """Gets the Tensors associated with the `tensor_names` in the provided graph.
+
+  Args:
+    graph: TensorFlow Graph.
+    tensor_names: List of strings that represent names of tensors in the graph.
+
+  Returns:
+    A list of Tensor objects in the same order the names are provided.
+
+  Raises:
+    ValueError:
+      tensor_names contains an invalid tensor name.
+  """
+  # Get the list of all of the tensors.
+  tensor_name_to_tensor = {
+      tensor_name(tensor): tensor for op in graph.get_operations()
+      for tensor in op.values()
+  }
+
+  # Get the tensors associated with tensor_names.
+  tensors = []
+  invalid_tensors = []
+  for name in tensor_names:
+    tensor = tensor_name_to_tensor.get(name)
+    if tensor is None:
+      invalid_tensors.append(name)
+    else:
+      tensors.append(tensor)
+
+  # Throw ValueError if any user input names are not valid tensors.
+  if invalid_tensors:
+    raise ValueError("Invalid tensors '{}' were found.".format(
+        ",".join(invalid_tensors)))
+  return tensors
+
+
+def set_tensor_shapes(tensors, shapes):
+  """Sets Tensor shape for each tensor if the shape is defined.
+
+  Args:
+    tensors: TensorFlow ops.Tensor.
+    shapes: Dict of strings representing input tensor names to list of
+      integers representing input shapes (e.g., {"foo": : [1, 16, 16, 3]}).
+  """
+  if shapes:
+    for tensor in tensors:
+      shape = shapes.get(tensor_name(tensor))
+      if shape is not None:
+        tensor.set_shape(shape)
+
+
+def freeze_saved_model(saved_model_dir, input_arrays, input_shapes,
+                       output_arrays, tag_set, signature_key):
   """Converts a SavedModel to a frozen graph.
 
   Args:
     saved_model_dir: SavedModel directory to convert.
     input_arrays: List of input tensors to freeze graph with. Uses input arrays
-      from SignatureDef when none are provided. (default None)
-    input_shapes: Map of strings representing input tensor names to list of
+      from SignatureDef when none are provided.
+    input_shapes: Dict of strings representing input tensor names to list of
       integers representing input shapes (e.g., {"foo": : [1, 16, 16, 3]}).
       Automatically determined when input shapes is None (e.g., {"foo" : None}).
-      (default None)
     output_arrays: List of output tensors to freeze graph with. Uses output
-      arrays from SignatureDef when none are provided. (default None)
+      arrays from SignatureDef when none are provided.
     tag_set: Set of tags identifying the MetaGraphDef within the SavedModel to
-      analyze. All tags in the tag set must be present. (default "serve")
+      analyze. All tags in the tag set must be present.
     signature_key: Key identifying SignatureDef containing inputs and outputs.
-    batch_size: Batch size for the model. Replaces the first dimension of an
-      input size array if undefined. (default 1)
 
   Returns:
     frozen_graph_def: Frozen GraphDef.
@@ -236,186 +231,32 @@ def _freeze_saved_model(saved_model_dir, input_arrays, input_shapes,
     ValueError:
       SavedModel doesn't contain a MetaGraphDef identified by tag_set.
       signature_key is not in the MetaGraphDef.
+      assets/ directory is in the MetaGraphDef.
       input_shapes does not match the length of input_arrays.
-      input_shapes has a None value after the 1st dimension.
       input_arrays or output_arrays are not valid.
-      Unable to load Session.
   """
-  # Set default values for inputs if they are set to None.
-  if signature_key is None:
-    signature_key = signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
-  if tag_set is None:
-    tag_set = set([tag_constants.SERVING])
-  if batch_size is None:
-    batch_size = 1
-
   # Read SignatureDef.
   meta_graph = _get_meta_graph_def(saved_model_dir, tag_set)
   signature_def = _get_signature_def(meta_graph, signature_key)
   inputs, outputs = _get_inputs_outputs(signature_def)
 
+  # Check SavedModel for assets directory.
+  collection_def = meta_graph.collection_def
+  if constants.ASSETS_KEY in collection_def:
+    raise ValueError("SavedModels with assets/ directory are not supported.")
+
   graph = ops.Graph()
   with session.Session(graph=graph) as sess:
-    # TODO(nupurgarg): Throw ValueError if SavedModel has assets/ directory.
     loader.load(sess, meta_graph.meta_info_def.tags, saved_model_dir)
 
     # Gets input and output tensors.
     # TODO(zhixianyan): Use TFLite supported Op list to filter outputs.
     in_tensors = _get_tensors(graph, inputs, input_arrays)
     out_tensors = _get_tensors(graph, outputs, output_arrays)
-
-    # Gets fully defined tensor shape. An input tensor with None in the first
-    # dimension, e.g. (None, 224, 224, 3), is replaced with the batch_size.
-    # Shapes with None after the first dimension result in a ValueError.
-    # TODO(zhixianyan): Add supports for input tensor with more None in shape.
-    for tensor in in_tensors:
-      if (input_shapes and tensor.name in input_shapes and
-          input_shapes[tensor.name] is not None):
-        shape = input_shapes[tensor.name]
-      else:
-        shape = tensor.get_shape().as_list()
-
-      if None in shape[1:]:
-        raise ValueError(
-            "None is only supported in the 1st dimension. Tensor '{0}' has "
-            "invalid shape '{1}'.".format(tensor.name, shape))
-      elif shape[0] is None:
-        shape[0] = batch_size
-      tensor.set_shape(shape)
+    set_tensor_shapes(in_tensors, input_shapes)
 
     output_names = [node.split(":")[0] for node in outputs]
     frozen_graph_def = tf_graph_util.convert_variables_to_constants(
         sess, graph.as_graph_def(), output_names)
 
     return frozen_graph_def, in_tensors, out_tensors
-  raise ValueError("Unable to load Session.")
-
-
-def saved_model_to_frozen_graphdef(
-    saved_model_dir,
-    output_file_model,
-    output_file_flags,
-    input_arrays=None,
-    input_shapes=None,
-    output_arrays=None,
-    tag_set=None,
-    signature_key=signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY,
-    batch_size=1):
-  """Converts a SavedModel to a frozen graph. Writes graph to tmp directory.
-
-  Stores frozen graph and command line flags in the tmp directory.
-
-  Args:
-    saved_model_dir: SavedModel directory to convert.
-    output_file_model: Full file path to save frozen graph.
-    output_file_flags: Full file path to save ModelFlags.
-    input_arrays: List of input tensors to freeze graph with. Uses input arrays
-      from SignatureDef when none are provided. (default None)
-    input_shapes: Map of strings representing input tensor names to list of
-      integers representing input shapes (e.g., {"foo": : [1, 16, 16, 3]}).
-      Automatically determined when input shapes is None (e.g., {"foo" : None}).
-      (default None)
-    output_arrays: List of output tensors to freeze graph with. Uses output
-      arrays from SignatureDef when none are provided. (default None)
-    tag_set: Set of tags identifying the MetaGraphDef within the SavedModel to
-      analyze. All tags in the tag set must be present. (default "serve")
-    signature_key: Key identifying SignatureDef containing inputs and outputs.
-    batch_size: Batch size for the model. Replaces the first dimension of an
-      input size array if undefined. (default 1)
-
-  Returns: None.
-
-  Raises:
-    ValueError: Unable to convert to frozen graph.
-  """
-  frozen_graph_def, in_tensors, out_tensors = _freeze_saved_model(
-      saved_model_dir, input_arrays, input_shapes, output_arrays, tag_set,
-      signature_key, batch_size)
-
-  # Initialize model flags.
-  model = model_flags_pb2.ModelFlags()
-
-  for input_tensor in in_tensors:
-    input_array = model.input_arrays.add()
-    input_array.name = convert.tensor_name(input_tensor)
-    input_array.shape.dims.extend(map(int, input_tensor.get_shape()))
-
-  for output_tensor in out_tensors:
-    model.output_arrays.append(convert.tensor_name(output_tensor))
-
-  # Write model and ModelFlags to file. ModelFlags contain input array and
-  # output array information that is parsed from the SignatureDef and used for
-  # analysis by TOCO.
-  _write_and_flush_file(output_file_model, frozen_graph_def.SerializeToString())
-  _write_and_flush_file(output_file_flags, model.SerializeToString())
-
-
-def tflite_from_saved_model(
-    saved_model_dir,
-    output_file=None,
-    input_arrays=None,
-    input_shapes=None,
-    output_arrays=None,
-    tag_set=None,
-    signature_key=signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY,
-    batch_size=1,
-    inference_type=lite_constants.FLOAT,
-    input_format=lite_constants.TENSORFLOW_GRAPHDEF,
-    output_format=lite_constants.TFLITE,
-    quantized_input_stats=None,
-    drop_control_dependency=True):
-  """Converts a SavedModel to TFLite FlatBuffer.
-
-  Args:
-    saved_model_dir: SavedModel directory to convert.
-    output_file: File path to write result TFLite FlatBuffer.
-    input_arrays: List of input tensors to freeze graph with. Uses input arrays
-      from SignatureDef when none are provided. (default None)
-    input_shapes: Map of strings representing input tensor names to list of
-      integers representing input shapes (e.g., {"foo": : [1, 16, 16, 3]}).
-      Automatically determined when input shapes is None (e.g., {"foo" : None}).
-      (default None)
-    output_arrays: List of output tensors to freeze graph with. Uses output
-      arrays from SignatureDef when none are provided. (default None)
-    tag_set: Set of tags identifying the MetaGraphDef within the SavedModel to
-      analyze. All tags in the tag set must be present. (default "serve")
-    signature_key: Key identifying SignatureDef containing inputs and outputs.
-    batch_size: Batch size for the model. Replaces the first dimension of an
-      input size array if undefined. (default 1)
-    inference_type: Currently must be `{FLOAT, QUANTIZED_UINT8}`.
-    input_format: Type of data to read (currently must be TENSORFLOW_GRAPHDEF).
-    output_format: Type of data to write (currently must be TFLITE or
-      GRAPHVIZ_DOT)
-    quantized_input_stats: For each member of input_tensors the mean and
-      std deviation of training data. Only needed if `inference_type` is
-      `QUANTIZED_UINT8`.
-    drop_control_dependency: Drops control dependencies silently. This is due
-      to tf lite not supporting control dependencies.
-
-  Returns:
-    The converted data. For example if tflite was the destination, then
-    this will be a tflite flatbuffer in a bytes array.
-
-  Raises:
-    ValueError: Unable to convert to frozen graph.
-  """
-  frozen_graph_def, in_tensors, out_tensors = _freeze_saved_model(
-      saved_model_dir, input_arrays, input_shapes, output_arrays, tag_set,
-      signature_key, batch_size)
-
-  result = convert.toco_convert(
-      input_data=frozen_graph_def,
-      input_tensors=in_tensors,
-      output_tensors=out_tensors,
-      inference_type=inference_type,
-      input_format=input_format,
-      output_format=output_format,
-      quantized_input_stats=quantized_input_stats,
-      drop_control_dependency=drop_control_dependency)
-
-  if output_file is not None:
-    with gfile.Open(output_file, "wb") as f:
-      f.write(result)
-    logging.info("Successfully converted to: %s", output_file)
-
-  return result
diff --git a/tensorflow/contrib/lite/python/convert_saved_model_test.py b/tensorflow/contrib/lite/python/convert_saved_model_test.py
index db95fc8ad7f94b52d33c72f6ec5819bdfe8cf05f..92c4ebb2465c2abaa1cefd020e69b2f7ad6a54a5 100644
--- a/tensorflow/contrib/lite/python/convert_saved_model_test.py
+++ b/tensorflow/contrib/lite/python/convert_saved_model_test.py
@@ -25,12 +25,12 @@ from __future__ import print_function
 
 import os
 from tensorflow.contrib.lite.python import convert_saved_model
-from tensorflow.contrib.lite.toco import model_flags_pb2 as _model_flags_pb2
 from tensorflow.python import keras
 from tensorflow.python.client import session
 from tensorflow.python.estimator import estimator_lib as estimator
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import test_util
 from tensorflow.python.layers import layers
 from tensorflow.python.ops import array_ops
@@ -38,13 +38,68 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.losses import losses
-from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
 from tensorflow.python.saved_model import saved_model
+from tensorflow.python.saved_model import signature_constants
+from tensorflow.python.saved_model import tag_constants
 from tensorflow.python.training import training as train
 
 
-class ConvertSavedModelTestBasicGraph(test_util.TensorFlowTestCase):
+class TensorFunctionsTest(test_util.TensorFlowTestCase):
+
+  def testGetTensorsValid(self):
+    in_tensor = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32)
+    _ = in_tensor + in_tensor
+    sess = session.Session()
+
+    tensors = convert_saved_model.get_tensors_from_tensor_names(
+        sess.graph, ["Placeholder"])
+    self.assertEqual("Placeholder:0", tensors[0].name)
+
+  def testGetTensorsInvalid(self):
+    in_tensor = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32)
+    _ = in_tensor + in_tensor
+    sess = session.Session()
+
+    with self.assertRaises(ValueError) as error:
+      convert_saved_model.get_tensors_from_tensor_names(sess.graph,
+                                                        ["invalid-input"])
+    self.assertEqual("Invalid tensors 'invalid-input' were found.",
+                     str(error.exception))
+
+  def testSetTensorShapeValid(self):
+    tensor = array_ops.placeholder(shape=[None, 3, 5], dtype=dtypes.float32)
+    self.assertEqual([None, 3, 5], tensor.shape.as_list())
+
+    convert_saved_model.set_tensor_shapes([tensor], {"Placeholder": [5, 3, 5]})
+    self.assertEqual([5, 3, 5], tensor.shape.as_list())
+
+  def testSetTensorShapeNoneValid(self):
+    tensor = array_ops.placeholder(dtype=dtypes.float32)
+    self.assertEqual(None, tensor.shape)
+
+    convert_saved_model.set_tensor_shapes([tensor], {"Placeholder": [1, 3, 5]})
+    self.assertEqual([1, 3, 5], tensor.shape.as_list())
+
+  def testSetTensorShapeInvalid(self):
+    tensor = array_ops.placeholder(shape=[None, 3, 5], dtype=dtypes.float32)
+    self.assertEqual([None, 3, 5], tensor.shape.as_list())
+
+    convert_saved_model.set_tensor_shapes([tensor],
+                                          {"invalid-input": [5, 3, 5]})
+    self.assertEqual([None, 3, 5], tensor.shape.as_list())
+
+  def testSetTensorShapeEmpty(self):
+    tensor = array_ops.placeholder(shape=[None, 3, 5], dtype=dtypes.float32)
+    self.assertEqual([None, 3, 5], tensor.shape.as_list())
+
+    convert_saved_model.set_tensor_shapes([tensor], {})
+    self.assertEqual([None, 3, 5], tensor.shape.as_list())
+
+
+class FreezeSavedModelTest(test_util.TensorFlowTestCase):
 
   def _createSimpleSavedModel(self, shape):
     """Create a simple SavedModel on the fly."""
@@ -57,82 +112,167 @@ class ConvertSavedModelTestBasicGraph(test_util.TensorFlowTestCase):
       saved_model.simple_save(sess, saved_model_dir, inputs, outputs)
     return saved_model_dir
 
+  def _createSavedModelTwoInputArrays(self, shape):
+    """Create a simple SavedModel."""
+    saved_model_dir = os.path.join(self.get_temp_dir(), "simple_savedmodel")
+    with session.Session() as sess:
+      in_tensor_1 = array_ops.placeholder(
+          shape=shape, dtype=dtypes.float32, name="inputB")
+      in_tensor_2 = array_ops.placeholder(
+          shape=shape, dtype=dtypes.float32, name="inputA")
+      out_tensor = in_tensor_1 + in_tensor_2
+      inputs = {"x": in_tensor_1, "y": in_tensor_2}
+      outputs = {"z": out_tensor}
+      saved_model.simple_save(sess, saved_model_dir, inputs, outputs)
+    return saved_model_dir
+
+  def _getArrayNames(self, tensors):
+    return [tensor.name for tensor in tensors]
+
+  def _getArrayShapes(self, tensors):
+    dims = []
+    for tensor in tensors:
+      dim_tensor = []
+      for dim in tensor.shape:
+        if isinstance(dim, tensor_shape.Dimension):
+          dim_tensor.append(dim.value)
+        else:
+          dim_tensor.append(dim)
+      dims.append(dim_tensor)
+    return dims
+
+  def _convertSavedModel(self,
+                         saved_model_dir,
+                         input_arrays=None,
+                         input_shapes=None,
+                         output_arrays=None,
+                         tag_set=None,
+                         signature_key=None):
+    if tag_set is None:
+      tag_set = set([tag_constants.SERVING])
+    if signature_key is None:
+      signature_key = signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
+    graph_def, in_tensors, out_tensors = convert_saved_model.freeze_saved_model(
+        saved_model_dir=saved_model_dir,
+        input_arrays=input_arrays,
+        input_shapes=input_shapes,
+        output_arrays=output_arrays,
+        tag_set=tag_set,
+        signature_key=signature_key)
+    return graph_def, in_tensors, out_tensors
+
   def testSimpleSavedModel(self):
-    """Test a simple SavedModel created on the fly."""
-    # Create a simple SavedModel
+    """Test a SavedModel."""
     saved_model_dir = self._createSimpleSavedModel(shape=[1, 16, 16, 3])
-    # Convert to tflite
-    result = convert_saved_model.tflite_from_saved_model(
-        saved_model_dir=saved_model_dir)
-    self.assertTrue(result)
+    _, in_tensors, out_tensors = self._convertSavedModel(saved_model_dir)
+
+    self.assertEqual(self._getArrayNames(out_tensors), ["add:0"])
+    self.assertEqual(self._getArrayNames(in_tensors), ["Placeholder:0"])
+    self.assertEqual(self._getArrayShapes(in_tensors), [[1, 16, 16, 3]])
 
   def testSimpleSavedModelWithNoneBatchSizeInShape(self):
-    """Test a simple SavedModel, with None in input tensor's shape."""
+    """Test a SavedModel with None in input tensor's shape."""
     saved_model_dir = self._createSimpleSavedModel(shape=[None, 16, 16, 3])
-    result = convert_saved_model.tflite_from_saved_model(
-        saved_model_dir=saved_model_dir)
-    self.assertTrue(result)
+    _, in_tensors, out_tensors = self._convertSavedModel(saved_model_dir)
 
-  def testSimpleSavedModelWithMoreNoneInShape(self):
-    """Test a simple SavedModel, fail as more None in input shape."""
-    saved_model_dir = self._createSimpleSavedModel(shape=[None, 16, None, 3])
-    # Convert to tflite: this should raise ValueError, as 3rd dim is None.
-    with self.assertRaises(ValueError):
-      convert_saved_model.tflite_from_saved_model(
-          saved_model_dir=saved_model_dir)
+    self.assertEqual(self._getArrayNames(out_tensors), ["add:0"])
+    self.assertEqual(self._getArrayNames(in_tensors), ["Placeholder:0"])
+    self.assertEqual(self._getArrayShapes(in_tensors), [[None, 16, 16, 3]])
 
-  def testSimpleSavedModelWithWrongSignatureKey(self):
-    """Test a simple SavedModel, fail as given signature is invalid."""
+  def testSimpleSavedModelWithInvalidSignatureKey(self):
+    """Test a SavedModel that fails due to an invalid signature_key."""
     saved_model_dir = self._createSimpleSavedModel(shape=[1, 16, 16, 3])
-    # Convert to tflite: this should raise ValueError, as
-    # signature_key does not exit in the saved_model.
-    with self.assertRaises(ValueError):
-      convert_saved_model.tflite_from_saved_model(
-          saved_model_dir=saved_model_dir, signature_key="wrong-key")
-
-  def testSimpleSavedModelWithWrongOutputArray(self):
-    """Test a simple SavedModel, fail as given output_arrays is invalid."""
-    # Create a simple SavedModel
+    with self.assertRaises(ValueError) as error:
+      self._convertSavedModel(saved_model_dir, signature_key="invalid-key")
+    self.assertEqual(
+        "No 'invalid-key' in the SavedModel's SignatureDefs. "
+        "Possible values are 'serving_default'.", str(error.exception))
+
+  def testSimpleSavedModelWithInvalidOutputArray(self):
+    """Test a SavedModel that fails due to invalid output arrays."""
     saved_model_dir = self._createSimpleSavedModel(shape=[1, 16, 16, 3])
-    # Convert to tflite: this should raise ValueError, as
-    # output_arrays is not valid for the saved_model.
-    with self.assertRaises(ValueError):
-      convert_saved_model.tflite_from_saved_model(
-          saved_model_dir=saved_model_dir, output_arrays=["wrong-output"])
+    with self.assertRaises(ValueError) as error:
+      self._convertSavedModel(saved_model_dir, output_arrays=["invalid-output"])
+    self.assertEqual("Invalid tensors 'invalid-output' were found.",
+                     str(error.exception))
 
   def testSimpleSavedModelWithWrongInputArrays(self):
-    """Test a simple SavedModel, fail as given input_arrays is invalid."""
+    """Test a SavedModel that fails due to invalid input arrays."""
     saved_model_dir = self._createSimpleSavedModel(shape=[1, 16, 16, 3])
-    # Checks invalid input_arrays.
-    with self.assertRaises(ValueError):
-      convert_saved_model.tflite_from_saved_model(
-          saved_model_dir=saved_model_dir, input_arrays=["wrong-input"])
-    # Checks valid and invalid input_arrays.
-    with self.assertRaises(ValueError):
-      convert_saved_model.tflite_from_saved_model(
-          saved_model_dir=saved_model_dir,
-          input_arrays=["Placeholder", "wrong-input"])
+
+    # Check invalid input_arrays.
+    with self.assertRaises(ValueError) as error:
+      self._convertSavedModel(saved_model_dir, input_arrays=["invalid-input"])
+    self.assertEqual("Invalid tensors 'invalid-input' were found.",
+                     str(error.exception))
+
+    # Check valid and invalid input_arrays.
+    with self.assertRaises(ValueError) as error:
+      self._convertSavedModel(
+          saved_model_dir, input_arrays=["Placeholder", "invalid-input"])
+    self.assertEqual("Invalid tensors 'invalid-input' were found.",
+                     str(error.exception))
 
   def testSimpleSavedModelWithCorrectArrays(self):
-    """Test a simple SavedModel, with correct input_arrays and output_arrays."""
+    """Test a SavedModel with correct input_arrays and output_arrays."""
     saved_model_dir = self._createSimpleSavedModel(shape=[None, 16, 16, 3])
-    result = convert_saved_model.tflite_from_saved_model(
+    _, in_tensors, out_tensors = self._convertSavedModel(
         saved_model_dir=saved_model_dir,
         input_arrays=["Placeholder"],
         output_arrays=["add"])
-    self.assertTrue(result)
+
+    self.assertEqual(self._getArrayNames(out_tensors), ["add:0"])
+    self.assertEqual(self._getArrayNames(in_tensors), ["Placeholder:0"])
+    self.assertEqual(self._getArrayShapes(in_tensors), [[None, 16, 16, 3]])
 
   def testSimpleSavedModelWithCorrectInputArrays(self):
-    """Test a simple SavedModel, with correct input_arrays and input_shapes."""
+    """Test a SavedModel with correct input_arrays and input_shapes."""
     saved_model_dir = self._createSimpleSavedModel(shape=[1, 16, 16, 3])
-    result = convert_saved_model.tflite_from_saved_model(
+    _, in_tensors, out_tensors = self._convertSavedModel(
         saved_model_dir=saved_model_dir,
         input_arrays=["Placeholder"],
         input_shapes={"Placeholder": [1, 16, 16, 3]})
-    self.assertTrue(result)
+
+    self.assertEqual(self._getArrayNames(out_tensors), ["add:0"])
+    self.assertEqual(self._getArrayNames(in_tensors), ["Placeholder:0"])
+    self.assertEqual(self._getArrayShapes(in_tensors), [[1, 16, 16, 3]])
+
+  def testTwoInputArrays(self):
+    """Test a simple SavedModel."""
+    saved_model_dir = self._createSavedModelTwoInputArrays(shape=[1, 16, 16, 3])
+
+    _, in_tensors, out_tensors = self._convertSavedModel(
+        saved_model_dir=saved_model_dir, input_arrays=["inputB", "inputA"])
+
+    self.assertEqual(self._getArrayNames(out_tensors), ["add:0"])
+    self.assertEqual(self._getArrayNames(in_tensors), ["inputA:0", "inputB:0"])
+    self.assertEqual(
+        self._getArrayShapes(in_tensors), [[1, 16, 16, 3], [1, 16, 16, 3]])
+
+  def testSubsetInputArrays(self):
+    """Test a SavedModel with a subset of the input array names of the model."""
+    saved_model_dir = self._createSavedModelTwoInputArrays(shape=[1, 16, 16, 3])
+
+    # Check case where input shape is given.
+    _, in_tensors, out_tensors = self._convertSavedModel(
+        saved_model_dir=saved_model_dir,
+        input_arrays=["inputA"],
+        input_shapes={"inputA": [1, 16, 16, 3]})
+
+    self.assertEqual(self._getArrayNames(out_tensors), ["add:0"])
+    self.assertEqual(self._getArrayNames(in_tensors), ["inputA:0"])
+    self.assertEqual(self._getArrayShapes(in_tensors), [[1, 16, 16, 3]])
+
+    # Check case where input shape is None.
+    _, in_tensors, out_tensors = self._convertSavedModel(
+        saved_model_dir=saved_model_dir, input_arrays=["inputA"])
+
+    self.assertEqual(self._getArrayNames(out_tensors), ["add:0"])
+    self.assertEqual(self._getArrayNames(in_tensors), ["inputA:0"])
+    self.assertEqual(self._getArrayShapes(in_tensors), [[1, 16, 16, 3]])
 
   def testMultipleMetaGraphDef(self):
-    """Test saved model with multiple MetaGraphDef."""
+    """Test saved model with multiple MetaGraphDefs."""
     saved_model_dir = os.path.join(self.get_temp_dir(), "savedmodel_two_mgd")
     builder = saved_model.builder.SavedModelBuilder(saved_model_dir)
     with session.Session(graph=ops.Graph()) as sess:
@@ -161,91 +301,13 @@ class ConvertSavedModelTestBasicGraph(test_util.TensorFlowTestCase):
       builder.save(True)
 
     # Convert to tflite
-    convert_saved_model.tflite_from_saved_model(
+    _, in_tensors, out_tensors = self._convertSavedModel(
         saved_model_dir=saved_model_dir,
         tag_set=set([saved_model.tag_constants.SERVING, "additional_test_tag"]))
 
-
-class ConvertSavedModelTestBasicGraphToText(test_util.TensorFlowTestCase):
-
-  def _createSimpleSavedModel(self, shape):
-    """Create a simple SavedModel."""
-    saved_model_dir = os.path.join(self.get_temp_dir(), "simple_savedmodel")
-    with session.Session() as sess:
-      in_tensor_1 = array_ops.placeholder(
-          shape=shape, dtype=dtypes.float32, name="inputB")
-      in_tensor_2 = array_ops.placeholder(
-          shape=shape, dtype=dtypes.float32, name="inputA")
-      out_tensor = in_tensor_1 + in_tensor_2
-      inputs = {"x": in_tensor_1, "y": in_tensor_2}
-      outputs = {"z": out_tensor}
-      saved_model.simple_save(sess, saved_model_dir, inputs, outputs)
-    return saved_model_dir
-
-  def _getInputArrayNames(self, model_proto):
-    return [data.name for data in model_proto.input_arrays]
-
-  def _getInputArrayShapes(self, model_proto):
-    return [
-        [dim for dim in data.shape.dims] for data in model_proto.input_arrays
-    ]
-
-  def _get_model_flags_proto_from_file(self, filename):
-    proto = _model_flags_pb2.ModelFlags()
-    with gfile.Open(filename, "rb") as output_file:
-      proto.ParseFromString(output_file.read())
-      output_file.close()
-    return proto
-
-  def testSimpleSavedModel(self):
-    """Test a simple SavedModel."""
-    saved_model_dir = self._createSimpleSavedModel(shape=[1, 16, 16, 3])
-    output_file_model = os.path.join(self.get_temp_dir(), "model.pb")
-    output_file_flags = os.path.join(self.get_temp_dir(), "model.pbtxt")
-
-    convert_saved_model.saved_model_to_frozen_graphdef(
-        saved_model_dir=saved_model_dir,
-        output_file_model=output_file_model,
-        output_file_flags=output_file_flags,
-        input_arrays=["inputB", "inputA"])
-
-    proto = self._get_model_flags_proto_from_file(output_file_flags)
-    self.assertEqual(proto.output_arrays, ["add"])
-    self.assertEqual(self._getInputArrayNames(proto), ["inputA", "inputB"])
-    self.assertEqual(
-        self._getInputArrayShapes(proto), [[1, 16, 16, 3], [1, 16, 16, 3]])
-
-  def testSimpleSavedModelWithDifferentInputNames(self):
-    """Test a simple SavedModel."""
-    saved_model_dir = self._createSimpleSavedModel(shape=[1, 16, 16, 3])
-    output_file_model = os.path.join(self.get_temp_dir(), "model.pb")
-    output_file_flags = os.path.join(self.get_temp_dir(), "model.pbtxt")
-
-    # Check case where input shape is given.
-    convert_saved_model.saved_model_to_frozen_graphdef(
-        saved_model_dir=saved_model_dir,
-        output_file_model=output_file_model,
-        output_file_flags=output_file_flags,
-        input_arrays=["inputA"],
-        input_shapes={"inputA": [1, 16, 16, 3]})
-
-    proto = self._get_model_flags_proto_from_file(output_file_flags)
-    self.assertEqual(proto.output_arrays, ["add"])
-    self.assertEqual(self._getInputArrayNames(proto), ["inputA"])
-    self.assertEqual(self._getInputArrayShapes(proto), [[1, 16, 16, 3]])
-
-    # Check case where input shape is None.
-    convert_saved_model.saved_model_to_frozen_graphdef(
-        saved_model_dir=saved_model_dir,
-        output_file_model=output_file_model,
-        output_file_flags=output_file_flags,
-        input_arrays=["inputA"],
-        input_shapes={"inputA": None})
-
-    proto = self._get_model_flags_proto_from_file(output_file_flags)
-    self.assertEqual(proto.output_arrays, ["add"])
-    self.assertEqual(self._getInputArrayNames(proto), ["inputA"])
-    self.assertEqual(self._getInputArrayShapes(proto), [[1, 16, 16, 3]])
+    self.assertEqual(self._getArrayNames(out_tensors), ["add:0"])
+    self.assertEqual(self._getArrayNames(in_tensors), ["Placeholder:0"])
+    self.assertEqual(self._getArrayShapes(in_tensors), [[1, 28, 28]])
 
 
 class Model(keras.Model):
@@ -354,7 +416,7 @@ def dummy_input_fn():
   return image, labels
 
 
-class ConvertSavedModelTestTrainGraph(test_util.TensorFlowTestCase):
+class FreezeSavedModelTestTrainGraph(test_util.TensorFlowTestCase):
 
   def testTrainedMnistSavedModel(self):
     """Test mnist SavedModel, trained with dummy data and small steps."""
@@ -379,13 +441,16 @@ class ConvertSavedModelTestTrainGraph(test_util.TensorFlowTestCase):
     # Convert to tflite and test output
     saved_model_name = os.listdir(saved_model_dir)[0]
     saved_model_final_dir = os.path.join(saved_model_dir, saved_model_name)
-    output_file = os.path.join(saved_model_dir, saved_model_final_dir + ".lite")
+
     # TODO(zhixianyan): no need to limit output_arrays to `Softmax'
     # once b/74205001 fixed and argmax implemented in tflite.
-    result = convert_saved_model.tflite_from_saved_model(
+    result = convert_saved_model.freeze_saved_model(
         saved_model_dir=saved_model_final_dir,
+        input_arrays=None,
+        input_shapes=None,
         output_arrays=["Softmax"],
-        output_file=output_file)
+        tag_set=set([tag_constants.SERVING]),
+        signature_key=signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY)
 
     self.assertTrue(result)
 
diff --git a/tensorflow/contrib/lite/python/convert_saved_model_to_frozen_graph.py b/tensorflow/contrib/lite/python/convert_saved_model_to_frozen_graph.py
deleted file mode 100644
index 4d9782f4a6a9e853c3afdbd97d4264a818937e63..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/lite/python/convert_saved_model_to_frozen_graph.py
+++ /dev/null
@@ -1,106 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Python console command for generating frozen models from SavedModels.
-
-This exists to add SavedModel compatibility to TOCO.
-"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import argparse
-import sys
-from tensorflow.contrib.lite.python.convert_saved_model import saved_model_to_frozen_graphdef
-from tensorflow.python.platform import app
-
-FLAGS = None
-
-
-def execute(unused_args):
-  """Calls function to convert the SavedModel to a frozen graph."""
-  # Error handling.
-  if FLAGS.input_shapes and not FLAGS.input_arrays:
-    raise ValueError("Input shapes requires input arrays to be specified.")
-
-  # Calls saved_model_to_frozen_graphdef function to generate frozen graph.
-  input_arrays = (FLAGS.input_arrays.split(",") if FLAGS.input_arrays else None)
-  input_shapes = None
-  if FLAGS.input_shapes:
-    input_shapes = {
-        input_arrays[idx]: shape.split(",")
-        for idx, shape in enumerate(FLAGS.input_shapes.split(":"))
-    }
-  output_arrays = (
-      FLAGS.output_arrays.split(",") if FLAGS.output_arrays else None)
-  tag_set = set(FLAGS.tag_set.split(",")) if FLAGS.tag_set else None
-
-  saved_model_to_frozen_graphdef(
-      saved_model_dir=FLAGS.saved_model_directory,
-      output_file_model=FLAGS.output_file_model,
-      output_file_flags=FLAGS.output_file_flags,
-      input_arrays=input_arrays,
-      input_shapes=input_shapes,
-      output_arrays=output_arrays,
-      tag_set=tag_set,
-      signature_key=FLAGS.signature_key,
-      batch_size=FLAGS.batch_size)
-
-
-def main():
-  global FLAGS
-  # Parses flags.
-  parser = argparse.ArgumentParser(
-      description="Invoke SavedModel to frozen model converter.")
-  parser.add_argument(
-      "saved_model_directory",
-      type=str,
-      help="Full path to directory containing the SavedModel.")
-  parser.add_argument(
-      "output_file_model",
-      type=str,
-      help="Full file path to save frozen graph.")
-  parser.add_argument(
-      "output_file_flags", type=str, help="Full file path to save ModelFlags.")
-  parser.add_argument(
-      "--input_arrays",
-      type=str,
-      help="Name of the input arrays, comma-separated.")
-  parser.add_argument(
-      "--input_shapes",
-      type=str,
-      help="Shapes corresponding to --input_arrays, colon-separated.")
-  parser.add_argument(
-      "--output_arrays",
-      type=str,
-      help="Name of the output arrays, comma-separated.")
-  parser.add_argument(
-      "--tag_set", type=str, help="Name of output arrays, comma-separated.")
-  parser.add_argument(
-      "--signature_key",
-      type=str,
-      help="Key identifying SignatureDef containing inputs and outputs.")
-  parser.add_argument(
-      "--batch_size",
-      type=int,
-      help="Batch size for the model. Replaces the first dimension of an "
-      "input size array if undefined.")
-
-  FLAGS, unparsed = parser.parse_known_args()
-
-  app.run(main=execute, argv=[sys.argv[0]] + unparsed)
-
-
-if __name__ == "__main__":
-  main()
diff --git a/tensorflow/contrib/lite/python/interpreter.py b/tensorflow/contrib/lite/python/interpreter.py
index 5fbc55145217dd8a4e3eec4108e18d1c2be5c883..779bda4c9d05fd056d6a262412fdcf0d47e7c57c 100644
--- a/tensorflow/contrib/lite/python/interpreter.py
+++ b/tensorflow/contrib/lite/python/interpreter.py
@@ -54,7 +54,7 @@ class Interpreter(object):
     elif model_content and not model_path:
       self._interpreter = (
           _interpreter_wrapper.InterpreterWrapper_CreateWrapperCPPFromBuffer(
-              model_content, len(model_content)))
+              model_content))
       if not self._interpreter:
         raise ValueError(
             'Failed to create model from {} bytes'.format(len(model_content)))
diff --git a/tensorflow/contrib/lite/python/interpreter_wrapper/BUILD b/tensorflow/contrib/lite/python/interpreter_wrapper/BUILD
index 453eda6e7345762666917fd501b69c7181c349e8..12ab38847dc0f838ae2c6bf80ed80805285e4b8b 100644
--- a/tensorflow/contrib/lite/python/interpreter_wrapper/BUILD
+++ b/tensorflow/contrib/lite/python/interpreter_wrapper/BUILD
@@ -15,7 +15,7 @@ cc_library(
         "//tensorflow/contrib/lite/kernels:builtin_ops",
         "//tensorflow/core:lib",
         "//tensorflow/python:numpy_lib",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
         "@com_google_absl//absl/memory",
     ],
 )
@@ -27,6 +27,6 @@ tf_py_wrap_cc(
     ],
     deps = [
         ":interpreter_wrapper_lib",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
     ],
 )
diff --git a/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.cc b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.cc
index 16f4f30b94313453c3a4f0496a6ac5649847f076..5f304ad45d400b13e20bda8184b5b40cfe13f6c2 100644
--- a/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.cc
+++ b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.cc
@@ -42,6 +42,8 @@ std::unique_ptr<tflite::Interpreter> CreateInterpreter(
     return nullptr;
   }
 
+  tensorflow::ImportNumpy();
+
   std::unique_ptr<tflite::Interpreter> interpreter;
   tflite::InterpreterBuilder(*model, resolver)(&interpreter);
   if (interpreter) {
@@ -331,9 +333,14 @@ InterpreterWrapper* InterpreterWrapper::CreateWrapperCPPFromFile(
 }
 
 InterpreterWrapper* InterpreterWrapper::CreateWrapperCPPFromBuffer(
-    const char* data, size_t len) {
+    PyObject* data) {
+  char * buf = nullptr;
+  Py_ssize_t length;
+  if (PY_TO_CPPSTRING(data, &buf, &length) == -1) {
+    return nullptr;
+  }
   std::unique_ptr<tflite::FlatBufferModel> model =
-      tflite::FlatBufferModel::BuildFromBuffer(data, len);
+      tflite::FlatBufferModel::BuildFromBuffer(buf, length);
   return model ? new InterpreterWrapper(std::move(model)) : nullptr;
 }
 
diff --git a/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h
index 0972c572595f5044a305a81afaccbea5f131247c..c02aa3804367f787016ef78fc8557005507f051b 100644
--- a/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h
+++ b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h
@@ -19,6 +19,8 @@ limitations under the License.
 #include <string>
 #include <vector>
 
+// Place `<locale>` before <Python.h> to avoid build failures in macOS.
+#include <locale>
 #include <Python.h>
 
 // We forward declare TFLite classes here to avoid exposing them to SWIG.
@@ -40,8 +42,7 @@ class InterpreterWrapper {
   static InterpreterWrapper* CreateWrapperCPPFromFile(const char* model_path);
 
   // SWIG caller takes ownership of pointer.
-  static InterpreterWrapper* CreateWrapperCPPFromBuffer(const char* data,
-                                                        size_t len);
+  static InterpreterWrapper* CreateWrapperCPPFromBuffer(PyObject* data);
 
   ~InterpreterWrapper();
   bool AllocateTensors();
diff --git a/tensorflow/contrib/lite/python/lite.py b/tensorflow/contrib/lite/python/lite.py
index 86b25e68acaf5d74e3dd11784446e7bda3d329ee..876ffbbffa5a47b91d1318baf431050fe364aac0 100644
--- a/tensorflow/contrib/lite/python/lite.py
+++ b/tensorflow/contrib/lite/python/lite.py
@@ -16,23 +16,376 @@
 
 EXPERIMENTAL: APIs here are unstable and likely to change without notice.
 
+@@TocoConverter
 @@toco_convert
 @@toco_convert_protos
-@@tflite_from_saved_model
 @@Interpreter
 @@OpHint
 @@convert_op_hints_to_stubs
 
+@@FLOAT
+@@QUANTIZED_UINT8
+@@TFLITE
+@@GRAPHVIZ_DOT
+
 """
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-# pylint: disable=unused-import
+from six import PY3
+
+from google.protobuf import text_format as _text_format
+from google.protobuf.message import DecodeError
+from tensorflow.contrib.lite.python import lite_constants as constants
+from tensorflow.contrib.lite.python.convert import build_toco_convert_protos  # pylint: disable=unused-import
+from tensorflow.contrib.lite.python.convert import tensor_name
 from tensorflow.contrib.lite.python.convert import toco_convert
-from tensorflow.contrib.lite.python.convert import toco_convert_protos
-from tensorflow.contrib.lite.python.convert_saved_model import tflite_from_saved_model
-from tensorflow.contrib.lite.python.interpreter import Interpreter
-from tensorflow.contrib.lite.python.op_hint import convert_op_hints_to_stubs
-from tensorflow.contrib.lite.python.op_hint import OpHint
-# pylint: enable=unused-import
+from tensorflow.contrib.lite.python.convert import toco_convert_protos  # pylint: disable=unused-import
+from tensorflow.contrib.lite.python.convert_saved_model import freeze_saved_model
+from tensorflow.contrib.lite.python.convert_saved_model import get_tensors_from_tensor_names
+from tensorflow.contrib.lite.python.convert_saved_model import set_tensor_shapes
+from tensorflow.contrib.lite.python.interpreter import Interpreter  # pylint: disable=unused-import
+from tensorflow.contrib.lite.python.op_hint import convert_op_hints_to_stubs  # pylint: disable=unused-import
+from tensorflow.contrib.lite.python.op_hint import OpHint  # pylint: disable=unused-import
+from tensorflow.core.framework import graph_pb2 as _graph_pb2
+from tensorflow.python.client import session as _session
+from tensorflow.python.framework import graph_util as tf_graph_util
+from tensorflow.python.framework.importer import import_graph_def
+from tensorflow.python.ops.variables import global_variables_initializer
+from tensorflow.python.saved_model import signature_constants
+from tensorflow.python.saved_model import tag_constants
+
+
+class TocoConverter(object):
+  """Convert a TensorFlow model into `output_format` using TOCO.
+
+  This is used to convert from a TensorFlow GraphDef or SavedModel into either a
+  TFLite FlatBuffer or graph visualization.
+
+  Attributes:
+
+    inference_type: Target data type of arrays in the output file. Currently
+      must be `{FLOAT, QUANTIZED_UINT8}`.  (default FLOAT)
+    inference_input_type: Target data type of input arrays. Allows for a
+      different type for input arrays in the case of quantization. Currently
+      must be `{FLOAT, QUANTIZED_UINT8}`. (default `inference_type`)
+    output_format: Output file format. Currently must be `{TFLITE,
+      GRAPHVIZ_DOT}`. (default TFLITE)
+    quantized_input_stats: Dict of strings representing input tensor names
+      mapped to tuple of integers representing the mean and standard deviation
+      of the training data (e.g., {"foo" : (0., 1.)}). Only need if
+      `inference_type` is `QUANTIZED_UINT8`. (default {})
+    default_ranges_stats: Tuple of integers representing (min, max) range values
+      for all arrays without a specified range. Intended for experimenting with
+      quantization via "dummy quantization". (default None)
+    drop_control_dependency: Boolean indicating whether to drop control
+      dependencies silently. This is due to TFLite not supporting control
+      dependencies. (default True)
+    reorder_across_fake_quant: Boolean indicating whether to reorder FakeQuant
+      nodes in unexpected locations. Used when the location of the FakeQuant
+      nodes is preventing graph transformations necessary to convert the graph.
+      Results in a graph that differs from the quantized training graph,
+      potentially causing differing arithmetic behavior. (default False)
+    change_concat_input_ranges: Boolean to change behavior of min/max ranges for
+      inputs and outputs of the concat operator for quantized models. Changes
+      the ranges of concat operator overlap when true. (default False)
+    allow_custom_ops: Boolean indicating whether to allow custom operations.
+      When false any unknown operation is an error. When true, custom ops are
+      created for any op that is unknown. The developer will need to provide
+      these to the TensorFlow Lite runtime with a custom resolver.
+      (default False)
+    quantize_weights: Boolean indicating whether to store weights as quantized
+      weights followed by dequantize operations. Computation is still done in
+      float, but reduces model size (at the cost of accuracy and latency).
+      (default False)
+    dump_graphviz_dir: Full filepath of folder to dump the graphs at various
+      stages of processing GraphViz .dot files. Preferred over
+      --output_format=GRAPHVIZ_DOT in order to keep the requirements of the
+      output file. (default None)
+    dump_graphviz_video: Boolean indicating whether to dump the graph after
+      every graph transformation. (default False)
+
+  Example usage:
+
+    # Converting a GraphDef from session.
+    converter = lite.TocoConverter.from_session(sess, in_tensors, out_tensors)
+    tflite_model = converter.convert()
+    open("converted_model.tflite", "wb").write(tflite_model)
+
+    # Converting a GraphDef from file.
+    converter = lite.TocoConverter.from_frozen_graph(
+      graph_def_file, input_arrays, output_arrays)
+    tflite_model = converter.convert()
+    open("converted_model.tflite", "wb").write(tflite_model)
+
+    # Converting a SavedModel.
+    converter = lite.TocoConverter.from_saved_model(saved_model_dir)
+    tflite_model = converter.convert()
+  """
+
+  def __init__(self, graph_def, input_tensors, output_tensors):
+    """Constructor for TocoConverter.
+
+    Args:
+
+      graph_def: TensorFlow GraphDef.
+      input_tensors: List of input tensors. Type and shape are computed using
+        `foo.get_shape()` and `foo.dtype`.
+      output_tensors: List of output tensors (only .name is used from this).
+    """
+    self._graph_def = graph_def
+    self._input_tensors = input_tensors
+    self._output_tensors = output_tensors
+    self.inference_type = constants.FLOAT
+    self.inference_input_type = None
+    self.output_format = constants.TFLITE
+    self.quantized_input_stats = {}
+    self.default_ranges_stats = None
+    self.drop_control_dependency = True
+    self.reorder_across_fake_quant = False
+    self.change_concat_input_ranges = False
+    self.allow_custom_ops = False
+    self.quantize_weights = False
+    self.dump_graphviz_dir = None
+    self.dump_graphviz_video = False
+
+  @classmethod
+  def from_session(cls, sess, input_tensors, output_tensors):
+    """Creates a TocoConverter class from a TensorFlow Session.
+
+    Args:
+      sess: TensorFlow Session.
+      input_tensors: List of input tensors. Type and shape are computed using
+        `foo.get_shape()` and `foo.dtype`.
+      output_tensors: List of output tensors (only .name is used from this).
+
+    Returns:
+      TocoConverter class.
+    """
+    graph_def = _freeze_graph(sess, output_tensors)
+    return cls(graph_def, input_tensors, output_tensors)
+
+  @classmethod
+  def from_frozen_graph(cls,
+                        graph_def_file,
+                        input_arrays,
+                        output_arrays,
+                        input_shapes=None):
+    """Creates a TocoConverter class from a file containing a frozen GraphDef.
+
+    Args:
+      graph_def_file: Full filepath of file containing TensorFlow GraphDef.
+      input_arrays: List of input tensors to freeze graph with.
+      output_arrays: List of output tensors to freeze graph with.
+      input_shapes: Dict of strings representing input tensor names to list of
+        integers representing input shapes (e.g., {"foo" : [1, 16, 16, 3]}).
+        Automatically determined when input shapes is None (e.g., {"foo" :
+        None}). (default None)
+
+    Returns:
+      TocoConverter class.
+
+    Raises:
+      ValueError:
+        Unable to parse input file.
+        The graph is not frozen.
+        input_arrays or output_arrays contains an invalid tensor name.
+    """
+    with _session.Session() as sess:
+      sess.run(global_variables_initializer())
+
+      # Read GraphDef from file.
+      graph_def = _graph_pb2.GraphDef()
+      with open(graph_def_file, "rb") as f:
+        file_content = f.read()
+      try:
+        graph_def.ParseFromString(file_content)
+      except (_text_format.ParseError, DecodeError):
+        try:
+          print("Ignore 'tcmalloc: large alloc' warnings.")
+
+          if not isinstance(file_content, str):
+            if PY3:
+              file_content = file_content.decode('utf-8')
+            else:
+              file_content = file_content.encode('utf-8')
+          _text_format.Merge(file_content, graph_def)
+        except (_text_format.ParseError, DecodeError):
+          raise ValueError(
+              "Unable to parse input file '{}'.".format(graph_def_file))
+      sess.graph.as_default()
+      import_graph_def(graph_def, name="")
+
+      # Get input and output tensors.
+      input_tensors = get_tensors_from_tensor_names(sess.graph, input_arrays)
+      output_tensors = get_tensors_from_tensor_names(sess.graph, output_arrays)
+      set_tensor_shapes(input_tensors, input_shapes)
+
+      # Check if graph is frozen.
+      if not _is_frozen_graph(sess):
+        raise ValueError("Please freeze the graph using freeze_graph.py.")
+
+      # Create TocoConverter class.
+      return cls(sess.graph_def, input_tensors, output_tensors)
+
+  @classmethod
+  def from_saved_model(cls,
+                       saved_model_dir,
+                       input_arrays=None,
+                       input_shapes=None,
+                       output_arrays=None,
+                       tag_set=None,
+                       signature_key=None):
+    """Creates a TocoConverter class from a SavedModel.
+
+    Args:
+      saved_model_dir: SavedModel directory to convert.
+      input_arrays: List of input tensors to freeze graph with. Uses input
+        arrays from SignatureDef when none are provided. (default None)
+      input_shapes: Dict of strings representing input tensor names to list of
+        integers representing input shapes (e.g., {"foo" : [1, 16, 16, 3]}).
+        Automatically determined when input shapes is None (e.g., {"foo" :
+        None}). (default None)
+      output_arrays: List of output tensors to freeze graph with. Uses output
+        arrays from SignatureDef when none are provided. (default None)
+      tag_set: Set of tags identifying the MetaGraphDef within the SavedModel to
+        analyze. All tags in the tag set must be present. (default set("serve"))
+      signature_key: Key identifying SignatureDef containing inputs and outputs.
+        (default DEFAULT_SERVING_SIGNATURE_DEF_KEY)
+
+    Returns:
+      TocoConverter class.
+    """
+    if tag_set is None:
+      tag_set = set([tag_constants.SERVING])
+    if signature_key is None:
+      signature_key = signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
+
+    result = freeze_saved_model(saved_model_dir, input_arrays, input_shapes,
+                                output_arrays, tag_set, signature_key)
+    return cls(
+        graph_def=result[0], input_tensors=result[1], output_tensors=result[2])
+
+  def convert(self):
+    """Converts a TensorFlow GraphDef based on instance variables.
+
+    Returns:
+      The converted data in serialized format. Either a TFLite Flatbuffer or a
+      Graphviz graph depending on value in `output_format`.
+
+    Raises:
+      ValueError:
+        Input shape is not specified.
+        None value for dimension in input_tensor.
+    """
+    # Checks dimensions in input tensor.
+    for tensor in self._input_tensors:
+      if not tensor.get_shape():
+        raise ValueError("Provide an input shape for input array '{0}'.".format(
+            tensor_name(tensor)))
+      shape = tensor.get_shape().as_list()
+      if None in shape[1:]:
+        raise ValueError(
+            "None is only supported in the 1st dimension. Tensor '{0}' has "
+            "invalid shape '{1}'.".format(tensor_name(tensor), shape))
+      elif shape[0] is None:
+        self._set_batch_size(batch_size=1)
+
+    # Get quantization stats. Ensures there is one stat per name if the stats
+    # are specified.
+    if self.quantized_input_stats:
+      quantized_stats = []
+      invalid_stats = []
+      for tensor in self._input_tensors:
+        name = tensor_name(tensor)
+        if name in self.quantized_input_stats:
+          quantized_stats.append(self.quantized_input_stats[name])
+        else:
+          invalid_stats.append(name)
+
+      if invalid_stats:
+        raise ValueError("Quantization input stats are not available for input "
+                         "tensors '{0}'.".format(",".join(invalid_stats)))
+    else:
+      quantized_stats = None
+
+    # Converts model.
+    result = toco_convert(
+        input_data=self._graph_def,
+        input_tensors=self._input_tensors,
+        output_tensors=self._output_tensors,
+        inference_type=self.inference_type,
+        inference_input_type=self.inference_input_type,
+        input_format=constants.TENSORFLOW_GRAPHDEF,
+        output_format=self.output_format,
+        quantized_input_stats=quantized_stats,
+        default_ranges_stats=self.default_ranges_stats,
+        drop_control_dependency=self.drop_control_dependency,
+        reorder_across_fake_quant=self.reorder_across_fake_quant,
+        change_concat_input_ranges=self.change_concat_input_ranges,
+        allow_custom_ops=self.allow_custom_ops,
+        quantize_weights=self.quantize_weights,
+        dump_graphviz_dir=self.dump_graphviz_dir,
+        dump_graphviz_video=self.dump_graphviz_video)
+    return result
+
+  def get_input_arrays(self):
+    """Returns a list of the names of the input tensors.
+
+    Returns:
+      List of strings.
+    """
+    return [tensor_name(tensor) for tensor in self._input_tensors]
+
+  def _set_batch_size(self, batch_size):
+    """Sets the first dimension of the input tensor to `batch_size`.
+
+    Args:
+      batch_size: Batch size for the model. Replaces the first dimension of an
+        input size array if undefined. (default 1)
+    """
+    for tensor in self._input_tensors:
+      shape = tensor.get_shape().as_list()
+      shape[0] = batch_size
+      tensor.set_shape(shape)
+
+
+def _is_frozen_graph(sess):
+  """Determines if the graph is frozen.
+
+  Determines if a graph has previously been frozen by checking for any
+  operations of type Variable*. If variables are found, the graph is not frozen.
+
+  Args:
+    sess: TensorFlow Session.
+
+  Returns:
+    Bool.
+  """
+  for op in sess.graph.get_operations():
+    if op.type.startswith("Variable"):
+      return False
+  return True
+
+
+def _freeze_graph(sess, output_tensors):
+  """Returns a frozen GraphDef.
+
+  Freezes a graph with Variables in it. Otherwise the existing GraphDef is
+  returned.
+
+  Args:
+    sess: TensorFlow Session.
+    output_tensors: List of output tensors (only .name is used from this).
+
+  Returns:
+    Frozen GraphDef.
+  """
+  if not _is_frozen_graph(sess):
+    sess.run(global_variables_initializer())
+    output_arrays = [tensor_name(tensor) for tensor in output_tensors]
+    return tf_graph_util.convert_variables_to_constants(sess, sess.graph_def,
+                                                        output_arrays)
+  else:
+    return sess.graph_def
diff --git a/tensorflow/contrib/lite/python/lite_test.py b/tensorflow/contrib/lite/python/lite_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c9d2c1651dd2d0b3cd27cf638c04429e3131efb
--- /dev/null
+++ b/tensorflow/contrib/lite/python/lite_test.py
@@ -0,0 +1,622 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for lite.py."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import numpy as np
+
+from tensorflow.contrib.lite.python import lite
+from tensorflow.contrib.lite.python import lite_constants
+from tensorflow.contrib.lite.python.interpreter import Interpreter
+from tensorflow.python.client import session
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import variable_scope
+from tensorflow.python.platform import gfile
+from tensorflow.python.platform import test
+from tensorflow.python.saved_model import saved_model
+from tensorflow.python.training.training_util import write_graph
+
+
+class FromSessionTest(test_util.TensorFlowTestCase):
+
+  def testFloat(self):
+    in_tensor = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32)
+    out_tensor = in_tensor + in_tensor
+    sess = session.Session()
+
+    # Convert model and ensure model is not None.
+    converter = lite.TocoConverter.from_session(sess, [in_tensor], [out_tensor])
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    # Check values from converted model.
+    interpreter = Interpreter(model_content=tflite_model)
+    interpreter.allocate_tensors()
+
+    input_details = interpreter.get_input_details()
+    self.assertEqual(1, len(input_details))
+    self.assertEqual('Placeholder', input_details[0]['name'])
+    self.assertEqual(np.float32, input_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[0]['shape']).all())
+    self.assertEqual((0., 0.), input_details[0]['quantization'])
+
+    output_details = interpreter.get_output_details()
+    self.assertEqual(1, len(output_details))
+    self.assertEqual('add', output_details[0]['name'])
+    self.assertEqual(np.float32, output_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == output_details[0]['shape']).all())
+    self.assertEqual((0., 0.), output_details[0]['quantization'])
+
+  def testQuantization(self):
+    in_tensor_1 = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32, name='inputA')
+    in_tensor_2 = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32, name='inputB')
+    out_tensor = array_ops.fake_quant_with_min_max_args(
+        in_tensor_1 + in_tensor_2, min=0., max=1., name='output')
+    sess = session.Session()
+
+    # Convert model and ensure model is not None.
+    converter = lite.TocoConverter.from_session(
+        sess, [in_tensor_1, in_tensor_2], [out_tensor])
+    converter.inference_type = lite_constants.QUANTIZED_UINT8
+    converter.quantized_input_stats = {
+        'inputA': (0., 1.),
+        'inputB': (0., 1.)
+    }  # mean, std_dev
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    # Check values from converted model.
+    interpreter = Interpreter(model_content=tflite_model)
+    interpreter.allocate_tensors()
+
+    input_details = interpreter.get_input_details()
+    self.assertEqual(2, len(input_details))
+    self.assertEqual('inputA', input_details[0]['name'])
+    self.assertEqual(np.uint8, input_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[0]['shape']).all())
+    self.assertEqual((1., 0.),
+                     input_details[0]['quantization'])  # scale, zero_point
+
+    self.assertEqual('inputB', input_details[1]['name'])
+    self.assertEqual(np.uint8, input_details[1]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[1]['shape']).all())
+    self.assertEqual((1., 0.),
+                     input_details[1]['quantization'])  # scale, zero_point
+
+    output_details = interpreter.get_output_details()
+    self.assertEqual(1, len(output_details))
+    self.assertEqual('output', output_details[0]['name'])
+    self.assertEqual(np.uint8, output_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == output_details[0]['shape']).all())
+    self.assertTrue(output_details[0]['quantization'][0] > 0)  # scale
+
+  def testQuantizationInvalid(self):
+    in_tensor_1 = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32, name='inputA')
+    in_tensor_2 = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32, name='inputB')
+    out_tensor = array_ops.fake_quant_with_min_max_args(
+        in_tensor_1 + in_tensor_2, min=0., max=1., name='output')
+    sess = session.Session()
+
+    # Convert model and ensure model is not None.
+    converter = lite.TocoConverter.from_session(
+        sess, [in_tensor_1, in_tensor_2], [out_tensor])
+    converter.inference_type = lite_constants.QUANTIZED_UINT8
+    converter.quantized_input_stats = {'inputA': (0., 1.)}  # mean, std_dev
+    with self.assertRaises(ValueError) as error:
+      converter.convert()
+    self.assertEqual(
+        'Quantization input stats are not available for input tensors '
+        '\'inputB\'.', str(error.exception))
+
+  def testSizeNoneInvalid(self):
+    in_tensor = array_ops.placeholder(dtype=dtypes.float32)
+    out_tensor = in_tensor + in_tensor
+    sess = session.Session()
+
+    # Test invalid shape. None after 1st dimension.
+    converter = lite.TocoConverter.from_session(sess, [in_tensor], [out_tensor])
+    with self.assertRaises(ValueError) as error:
+      converter.convert()
+    self.assertEqual('Provide an input shape for input array \'Placeholder\'.',
+                     str(error.exception))
+
+  def testBatchSizeInvalid(self):
+    in_tensor = array_ops.placeholder(
+        shape=[1, None, 16, 3], dtype=dtypes.float32)
+    out_tensor = in_tensor + in_tensor
+    sess = session.Session()
+
+    # Test invalid shape. None after 1st dimension.
+    converter = lite.TocoConverter.from_session(sess, [in_tensor], [out_tensor])
+    with self.assertRaises(ValueError) as error:
+      converter.convert()
+    self.assertEqual(
+        'None is only supported in the 1st dimension. Tensor '
+        '\'Placeholder\' has invalid shape \'[1, None, 16, 3]\'.',
+        str(error.exception))
+
+  def testBatchSizeValid(self):
+    in_tensor = array_ops.placeholder(
+        shape=[None, 16, 16, 3], dtype=dtypes.float32)
+    out_tensor = in_tensor + in_tensor
+    sess = session.Session()
+
+    # Convert model and ensure model is not None.
+    converter = lite.TocoConverter.from_session(sess, [in_tensor], [out_tensor])
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    # Check values from converted model.
+    interpreter = Interpreter(model_content=tflite_model)
+    interpreter.allocate_tensors()
+
+    input_details = interpreter.get_input_details()
+    self.assertEqual(1, len(input_details))
+    self.assertEqual('Placeholder', input_details[0]['name'])
+    self.assertEqual(np.float32, input_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[0]['shape']).all())
+    self.assertEqual((0., 0.), input_details[0]['quantization'])
+
+    output_details = interpreter.get_output_details()
+    self.assertEqual(1, len(output_details))
+    self.assertEqual('add', output_details[0]['name'])
+    self.assertEqual(np.float32, output_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == output_details[0]['shape']).all())
+    self.assertEqual((0., 0.), output_details[0]['quantization'])
+
+  def testFreezeGraph(self):
+    in_tensor = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32)
+    var = variable_scope.get_variable(
+        'weights', shape=[1, 16, 16, 3], dtype=dtypes.float32)
+    out_tensor = in_tensor + var
+    sess = session.Session()
+
+    # Convert model and ensure model is not None.
+    converter = lite.TocoConverter.from_session(sess, [in_tensor], [out_tensor])
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    # Check values from converted model.
+    interpreter = Interpreter(model_content=tflite_model)
+    interpreter.allocate_tensors()
+
+    input_details = interpreter.get_input_details()
+    self.assertEqual(1, len(input_details))
+    self.assertEqual('Placeholder', input_details[0]['name'])
+    self.assertEqual(np.float32, input_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[0]['shape']).all())
+    self.assertEqual((0., 0.), input_details[0]['quantization'])
+
+    output_details = interpreter.get_output_details()
+    self.assertEqual(1, len(output_details))
+    self.assertEqual('add', output_details[0]['name'])
+    self.assertEqual(np.float32, output_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == output_details[0]['shape']).all())
+    self.assertEqual((0., 0.), output_details[0]['quantization'])
+
+  # TODO(nupurgarg): Verify value of contents in GraphViz.
+  def testGraphviz(self):
+    in_tensor = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32)
+    out_tensor = in_tensor + in_tensor
+    sess = session.Session()
+
+    # Convert model and ensure model is not None.
+    converter = lite.TocoConverter.from_session(sess, [in_tensor], [out_tensor])
+    converter.output_format = lite_constants.GRAPHVIZ_DOT
+    graphviz_output = converter.convert()
+    self.assertTrue(graphviz_output)
+
+  # TODO(nupurgarg): Verify value of contents in GraphViz.
+  def testDumpGraphviz(self):
+    in_tensor = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32)
+    out_tensor = in_tensor + in_tensor
+    sess = session.Session()
+
+    # Convert model and ensure model is not None.
+    converter = lite.TocoConverter.from_session(sess, [in_tensor], [out_tensor])
+    graphviz_dir = self.get_temp_dir()
+    converter.dump_graphviz_dir = graphviz_dir
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    # Ensure interpreter is able to allocate and check graphviz data.
+    interpreter = Interpreter(model_content=tflite_model)
+    interpreter.allocate_tensors()
+
+    num_items_graphviz = len(os.listdir(graphviz_dir))
+    self.assertTrue(num_items_graphviz)
+
+    # Convert model and ensure model is not None.
+    converter = lite.TocoConverter.from_session(sess, [in_tensor], [out_tensor])
+    graphviz_dir = self.get_temp_dir()
+    converter.dump_graphviz_dir = graphviz_dir
+    converter.dump_graphviz_video = True
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    # Ensure graphviz folder has more data after using video flag.
+    num_items_graphviz_video = len(os.listdir(graphviz_dir))
+    self.assertTrue(num_items_graphviz_video > num_items_graphviz)
+
+  def testInferenceInputType(self):
+    in_tensor = array_ops.placeholder(shape=[1, 16, 16, 3], dtype=dtypes.uint8)
+    out_tensor = in_tensor + in_tensor
+    sess = session.Session()
+
+    # Convert model and ensure model is not None.
+    converter = lite.TocoConverter.from_session(sess, [in_tensor], [out_tensor])
+    converter.inference_input_type = lite_constants.QUANTIZED_UINT8
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    # Check values from converted model.
+    interpreter = Interpreter(model_content=tflite_model)
+    interpreter.allocate_tensors()
+
+    input_details = interpreter.get_input_details()
+    self.assertEqual(1, len(input_details))
+    self.assertEqual('Placeholder', input_details[0]['name'])
+    self.assertEqual(np.uint8, input_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[0]['shape']).all())
+    self.assertEqual((0., 0.), input_details[0]['quantization'])
+
+    output_details = interpreter.get_output_details()
+    self.assertEqual(1, len(output_details))
+    self.assertEqual('add', output_details[0]['name'])
+    self.assertEqual(np.uint8, output_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == output_details[0]['shape']).all())
+    self.assertEqual((0., 0.), input_details[0]['quantization'])
+
+  def testDefaultRangesStats(self):
+    in_tensor = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32)
+    out_tensor = in_tensor + in_tensor
+    sess = session.Session()
+
+    # Convert model and ensure model is not None.
+    converter = lite.TocoConverter.from_session(sess, [in_tensor], [out_tensor])
+    converter.inference_type = lite_constants.QUANTIZED_UINT8
+    converter.quantized_input_stats = {'Placeholder': (0., 1.)}  # mean, std_dev
+    converter.default_ranges_stats = (0, 6)  # min, max
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    # Check values from converted model.
+    interpreter = Interpreter(model_content=tflite_model)
+    interpreter.allocate_tensors()
+
+    input_details = interpreter.get_input_details()
+    self.assertEqual(1, len(input_details))
+    self.assertEqual('Placeholder', input_details[0]['name'])
+    self.assertEqual(np.uint8, input_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[0]['shape']).all())
+    self.assertEqual((1., 0.), input_details[0]['quantization'])
+
+    output_details = interpreter.get_output_details()
+    self.assertEqual(1, len(output_details))
+    self.assertEqual('add', output_details[0]['name'])
+    self.assertEqual(np.uint8, output_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == output_details[0]['shape']).all())
+    self.assertTrue(output_details[0]['quantization'][0] > 0)  # scale
+
+  def testQuantizeWeights(self):
+    np.random.seed(0)
+    # We need the tensor to have more than 1024 elements for quantize_weights
+    # to kick in. Thus, the [33, 33] shape.
+    in_tensor_1 = array_ops.placeholder(
+        shape=[33, 33], dtype=dtypes.float32, name='inputA')
+    in_tensor_2 = constant_op.constant(
+        np.random.uniform(low=-10., high=10., size=(33, 33)),
+        shape=[33, 33],
+        dtype=dtypes.float32,
+        name='inputB')
+    out_tensor = math_ops.matmul(in_tensor_1, in_tensor_2, name='output')
+    sess = session.Session()
+
+    # Convert float model.
+    float_converter = lite.TocoConverter.from_session(sess, [in_tensor_1],
+                                                      [out_tensor])
+    float_tflite = float_converter.convert()
+    self.assertTrue(float_tflite)
+
+    # Convert quantized weights model.
+    quantized_weights_converter = lite.TocoConverter.from_session(
+        sess, [in_tensor_1], [out_tensor])
+    quantized_weights_converter.quantize_weights = True
+    quantized_weights_tflite = quantized_weights_converter.convert()
+    self.assertTrue(quantized_weights_tflite)
+
+    # Ensure that the quantized weights tflite model is smaller.
+    self.assertTrue(len(quantized_weights_tflite) < len(float_tflite))
+
+
+class FromFrozenGraphFile(test_util.TensorFlowTestCase):
+
+  def testFloat(self):
+    in_tensor = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32)
+    _ = in_tensor + in_tensor
+    sess = session.Session()
+
+    # Write graph to file.
+    graph_def_file = os.path.join(self.get_temp_dir(), 'model.pb')
+    write_graph(sess.graph_def, '', graph_def_file, False)
+
+    # Convert model and ensure model is not None.
+    converter = lite.TocoConverter.from_frozen_graph(graph_def_file,
+                                                     ['Placeholder'], ['add'])
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    # Check values from converted model.
+    interpreter = Interpreter(model_content=tflite_model)
+    interpreter.allocate_tensors()
+
+    input_details = interpreter.get_input_details()
+    self.assertEqual(1, len(input_details))
+    self.assertEqual('Placeholder', input_details[0]['name'])
+    self.assertEqual(np.float32, input_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[0]['shape']).all())
+    self.assertEqual((0., 0.), input_details[0]['quantization'])
+
+    output_details = interpreter.get_output_details()
+    self.assertEqual(1, len(output_details))
+    self.assertEqual('add', output_details[0]['name'])
+    self.assertEqual(np.float32, output_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == output_details[0]['shape']).all())
+    self.assertEqual((0., 0.), output_details[0]['quantization'])
+
+  def testFloatWithShapesArray(self):
+    in_tensor = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32)
+    _ = in_tensor + in_tensor
+    sess = session.Session()
+
+    # Write graph to file.
+    graph_def_file = os.path.join(self.get_temp_dir(), 'model.pb')
+    write_graph(sess.graph_def, '', graph_def_file, False)
+
+    # Convert model and ensure model is not None.
+    converter = lite.TocoConverter.from_frozen_graph(
+        graph_def_file, ['Placeholder'], ['add'],
+        input_shapes={'Placeholder': [1, 16, 16, 3]})
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    # Check values from converted model.
+    interpreter = Interpreter(model_content=tflite_model)
+    interpreter.allocate_tensors()
+
+    input_details = interpreter.get_input_details()
+    self.assertEqual(1, len(input_details))
+    self.assertTrue(([1, 16, 16, 3] == input_details[0]['shape']).all())
+
+  def testFreezeGraph(self):
+    in_tensor = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32)
+    var = variable_scope.get_variable(
+        'weights', shape=[1, 16, 16, 3], dtype=dtypes.float32)
+    _ = in_tensor + var
+    sess = session.Session()
+
+    # Write graph to file.
+    graph_def_file = os.path.join(self.get_temp_dir(), 'model.pb')
+    write_graph(sess.graph_def, '', graph_def_file, False)
+
+    # Ensure the graph with variables cannot be converted.
+    with self.assertRaises(ValueError) as error:
+      lite.TocoConverter.from_frozen_graph(graph_def_file, ['Placeholder'],
+                                           ['add'])
+    self.assertEqual('Please freeze the graph using freeze_graph.py.',
+                     str(error.exception))
+
+  def testPbtxt(self):
+    in_tensor = array_ops.placeholder(
+        shape=[1, 16, 16, 3], dtype=dtypes.float32)
+    _ = in_tensor + in_tensor
+    sess = session.Session()
+
+    # Write graph to file.
+    graph_def_file = os.path.join(self.get_temp_dir(), 'model.pbtxt')
+    write_graph(sess.graph_def, '', graph_def_file, True)
+
+    # Convert model and ensure model is not None.
+    converter = lite.TocoConverter.from_frozen_graph(graph_def_file,
+                                                     ['Placeholder'], ['add'])
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    # Check values from converted model.
+    interpreter = Interpreter(model_content=tflite_model)
+    interpreter.allocate_tensors()
+
+    input_details = interpreter.get_input_details()
+    self.assertEqual(1, len(input_details))
+    self.assertEqual('Placeholder', input_details[0]['name'])
+    self.assertEqual(np.float32, input_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[0]['shape']).all())
+    self.assertEqual((0., 0.), input_details[0]['quantization'])
+
+    output_details = interpreter.get_output_details()
+    self.assertEqual(1, len(output_details))
+    self.assertEqual('add', output_details[0]['name'])
+    self.assertEqual(np.float32, output_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == output_details[0]['shape']).all())
+    self.assertEqual((0., 0.), output_details[0]['quantization'])
+
+  def testInvalidFile(self):
+    graph_def_file = os.path.join(self.get_temp_dir(), 'invalid_file')
+    with gfile.Open(graph_def_file, 'wb') as temp_file:
+      temp_file.write('bad data')
+      temp_file.flush()
+
+    # Attempts to convert the invalid model.
+    with self.assertRaises(ValueError) as error:
+      lite.TocoConverter.from_frozen_graph(graph_def_file, ['Placeholder'],
+                                           ['add'])
+    self.assertEqual(
+        'Unable to parse input file \'{}\'.'.format(graph_def_file),
+        str(error.exception))
+
+
+class FromSavedModelTest(test_util.TensorFlowTestCase):
+
+  def _createSavedModel(self, shape):
+    """Create a simple SavedModel."""
+    saved_model_dir = os.path.join(self.get_temp_dir(), 'simple_savedmodel')
+    with session.Session() as sess:
+      in_tensor_1 = array_ops.placeholder(
+          shape=shape, dtype=dtypes.float32, name='inputB')
+      in_tensor_2 = array_ops.placeholder(
+          shape=shape, dtype=dtypes.float32, name='inputA')
+      out_tensor = in_tensor_1 + in_tensor_2
+      inputs = {'x': in_tensor_1, 'y': in_tensor_2}
+      outputs = {'z': out_tensor}
+      saved_model.simple_save(sess, saved_model_dir, inputs, outputs)
+    return saved_model_dir
+
+  def testSimpleModel(self):
+    """Test a SavedModel."""
+    saved_model_dir = self._createSavedModel(shape=[1, 16, 16, 3])
+
+    # Convert model and ensure model is not None.
+    converter = lite.TocoConverter.from_saved_model(saved_model_dir)
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    interpreter = Interpreter(model_content=tflite_model)
+    interpreter.allocate_tensors()
+
+    input_details = interpreter.get_input_details()
+    self.assertEqual(2, len(input_details))
+    self.assertEqual('inputA', input_details[0]['name'])
+    self.assertEqual(np.float32, input_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[0]['shape']).all())
+    self.assertEqual((0., 0.), input_details[0]['quantization'])
+
+    self.assertEqual('inputB', input_details[1]['name'])
+    self.assertEqual(np.float32, input_details[1]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[1]['shape']).all())
+    self.assertEqual((0., 0.), input_details[1]['quantization'])
+
+    output_details = interpreter.get_output_details()
+    self.assertEqual(1, len(output_details))
+    self.assertEqual('add', output_details[0]['name'])
+    self.assertEqual(np.float32, output_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == output_details[0]['shape']).all())
+    self.assertEqual((0., 0.), output_details[0]['quantization'])
+
+  def testNoneBatchSize(self):
+    """Test a SavedModel, with None in input tensor's shape."""
+    saved_model_dir = self._createSavedModel(shape=[None, 16, 16, 3])
+
+    converter = lite.TocoConverter.from_saved_model(saved_model_dir)
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    # Check values from converted model.
+    interpreter = Interpreter(model_content=tflite_model)
+    interpreter.allocate_tensors()
+
+    input_details = interpreter.get_input_details()
+    self.assertEqual(2, len(input_details))
+    self.assertEqual('inputA', input_details[0]['name'])
+    self.assertEqual(np.float32, input_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[0]['shape']).all())
+    self.assertEqual((0., 0.), input_details[0]['quantization'])
+
+    self.assertEqual('inputB', input_details[1]['name'])
+    self.assertEqual(np.float32, input_details[1]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[1]['shape']).all())
+    self.assertEqual((0., 0.), input_details[1]['quantization'])
+
+    output_details = interpreter.get_output_details()
+    self.assertEqual(1, len(output_details))
+    self.assertEqual('add', output_details[0]['name'])
+    self.assertEqual(np.float32, output_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == output_details[0]['shape']).all())
+    self.assertEqual((0., 0.), output_details[0]['quantization'])
+
+  def testOrderInputArrays(self):
+    """Test a SavedModel ordering of input arrays."""
+    saved_model_dir = self._createSavedModel(shape=[1, 16, 16, 3])
+
+    converter = lite.TocoConverter.from_saved_model(
+        saved_model_dir, input_arrays=['inputB', 'inputA'])
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    # Check values from converted model.
+    interpreter = Interpreter(model_content=tflite_model)
+    interpreter.allocate_tensors()
+
+    input_details = interpreter.get_input_details()
+    self.assertEqual(2, len(input_details))
+    self.assertEqual('inputA', input_details[0]['name'])
+    self.assertEqual(np.float32, input_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[0]['shape']).all())
+    self.assertEqual((0., 0.), input_details[0]['quantization'])
+
+    self.assertEqual('inputB', input_details[1]['name'])
+    self.assertEqual(np.float32, input_details[1]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == input_details[1]['shape']).all())
+    self.assertEqual((0., 0.), input_details[1]['quantization'])
+
+    output_details = interpreter.get_output_details()
+    self.assertEqual(1, len(output_details))
+    self.assertEqual('add', output_details[0]['name'])
+    self.assertEqual(np.float32, output_details[0]['dtype'])
+    self.assertTrue(([1, 16, 16, 3] == output_details[0]['shape']).all())
+    self.assertEqual((0., 0.), output_details[0]['quantization'])
+
+  def testSubsetInputArrays(self):
+    """Test a SavedModel with a subset of the input array names of the model."""
+    saved_model_dir = self._createSavedModel(shape=[1, 16, 16, 3])
+
+    # Check case where input shape is given.
+    converter = lite.TocoConverter.from_saved_model(
+        saved_model_dir,
+        input_arrays=['inputA'],
+        input_shapes={'inputA': [1, 16, 16, 3]})
+
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+    # Check case where input shape is None.
+    converter = lite.TocoConverter.from_saved_model(
+        saved_model_dir, input_arrays=['inputA'], input_shapes={'inputA': None})
+
+    tflite_model = converter.convert()
+    self.assertTrue(tflite_model)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/lite/python/tflite_convert.py b/tensorflow/contrib/lite/python/tflite_convert.py
new file mode 100644
index 0000000000000000000000000000000000000000..f497533bed054d260aefc7b3fe67ae655c7cbcda
--- /dev/null
+++ b/tensorflow/contrib/lite/python/tflite_convert.py
@@ -0,0 +1,364 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Python command line interface for running TOCO."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import os
+import sys
+
+from tensorflow.contrib.lite.python import lite
+from tensorflow.contrib.lite.toco import toco_flags_pb2 as _toco_flags_pb2
+from tensorflow.contrib.lite.toco import types_pb2 as _types_pb2
+from tensorflow.python.platform import app
+
+
+def _parse_array(values):
+  if values:
+    return values.split(",")
+
+
+def _parse_int_array(values):
+  if values:
+    return [int(val) for val in values.split(",")]
+
+
+def _parse_set(values):
+  if values:
+    return set(values.split(","))
+
+
+def _get_toco_converter(flags):
+  """Makes a TocoConverter object based on the flags provided.
+
+  Args:
+    flags: argparse.Namespace object containing TFLite flags.
+
+  Returns:
+    TocoConverter object.
+  """
+  # Parse input and output arrays.
+  input_arrays = _parse_array(flags.input_arrays)
+  input_shapes = None
+  if flags.input_shapes:
+    input_shapes_list = [
+        _parse_int_array(shape) for shape in flags.input_shapes.split(":")
+    ]
+    input_shapes = dict(zip(input_arrays, input_shapes_list))
+  output_arrays = _parse_array(flags.output_arrays)
+
+  converter_kwargs = {
+      "input_arrays": input_arrays,
+      "input_shapes": input_shapes,
+      "output_arrays": output_arrays
+  }
+
+  # Create TocoConverter.
+  if flags.graph_def_file:
+    converter_fn = lite.TocoConverter.from_frozen_graph
+    converter_kwargs["graph_def_file"] = flags.graph_def_file
+  elif flags.saved_model_dir:
+    converter_fn = lite.TocoConverter.from_saved_model
+    converter_kwargs["saved_model_dir"] = flags.saved_model_dir
+    converter_kwargs["tag_set"] = _parse_set(flags.saved_model_tag_set)
+    converter_kwargs["signature_key"] = flags.saved_model_signature_key
+
+  return converter_fn(**converter_kwargs)
+
+
+def _convert_model(flags):
+  """Calls function to convert the TensorFlow model into a TFLite model.
+
+  Args:
+    flags: argparse.Namespace object.
+
+  Raises:
+    ValueError: Invalid flags.
+  """
+  # Create converter.
+  converter = _get_toco_converter(flags)
+  if flags.inference_type:
+    converter.inference_type = _types_pb2.IODataType.Value(flags.inference_type)
+  if flags.inference_input_type:
+    converter.inference_input_type = _types_pb2.IODataType.Value(
+        flags.inference_input_type)
+  if flags.output_format:
+    converter.output_format = _toco_flags_pb2.FileFormat.Value(
+        flags.output_format)
+
+  if flags.mean_values and flags.std_dev_values:
+    input_arrays = converter.get_input_arrays()
+    std_dev_values = _parse_int_array(flags.std_dev_values)
+    mean_values = _parse_int_array(flags.mean_values)
+    quant_stats = zip(mean_values, std_dev_values)
+    if ((not flags.input_arrays and len(input_arrays) > 1) or
+        (len(input_arrays) != len(quant_stats))):
+      raise ValueError("Mismatching --input_arrays, --std_dev_values, and "
+                       "--mean_values. The flags must have the same number of "
+                       "items. The current input arrays are '{0}'. "
+                       "--input_arrays must be present when specifying "
+                       "--std_dev_values and --mean_values with multiple input "
+                       "tensors in order to map between names and "
+                       "values.".format(",".join(input_arrays)))
+    converter.quantized_input_stats = dict(zip(input_arrays, quant_stats))
+  if (flags.default_ranges_min is not None) and (flags.default_ranges_max is
+                                                 not None):
+    converter.default_ranges_stats = (flags.default_ranges_min,
+                                      flags.default_ranges_max)
+
+  if flags.drop_control_dependency:
+    converter.drop_control_dependency = flags.drop_control_dependency
+  if flags.reorder_across_fake_quant:
+    converter.reorder_across_fake_quant = flags.reorder_across_fake_quant
+  if flags.change_concat_input_ranges:
+    converter.change_concat_input_ranges = flags.change_concat_input_ranges
+  if flags.allow_custom_ops:
+    converter.allow_custom_ops = flags.allow_custom_ops
+  if flags.quantize_weights:
+    converter.quantize_weights = flags.quantize_weights
+  if flags.dump_graphviz_dir:
+    converter.dump_graphviz_dir = flags.dump_graphviz_dir
+  if flags.dump_graphviz_video:
+    converter.dump_graphviz_vode = flags.dump_graphviz_video
+
+  # Convert model.
+  output_data = converter.convert()
+  with open(flags.output_file, "wb") as f:
+    f.write(output_data)
+
+
+def _check_flags(flags, unparsed):
+  """Checks the parsed and unparsed flags to ensure they are valid.
+
+  Raises an error if previously support unparsed flags are found. Raises an
+  error for parsed flags that don't meet the required conditions.
+
+  Args:
+    flags: argparse.Namespace object containing TFLite flags.
+    unparsed: List of unparsed flags.
+
+  Raises:
+    ValueError: Invalid flags.
+  """
+
+  # Check unparsed flags for common mistakes based on previous TOCO.
+  def _get_message_unparsed(flag, orig_flag, new_flag):
+    if flag.startswith(orig_flag):
+      return "\n  Use {0} instead of {1}".format(new_flag, orig_flag)
+    return ""
+
+  if unparsed:
+    output = ""
+    for flag in unparsed:
+      output += _get_message_unparsed(flag, "--input_file", "--graph_def_file")
+      output += _get_message_unparsed(flag, "--savedmodel_directory",
+                                      "--saved_model_dir")
+      output += _get_message_unparsed(flag, "--std_value", "--std_dev_values")
+      output += _get_message_unparsed(flag, "--batch_size", "--input_shapes")
+      output += _get_message_unparsed(flag, "--dump_graphviz",
+                                      "--dump_graphviz_dir")
+    if output:
+      raise ValueError(output)
+
+  # Check that flags are valid.
+  if flags.graph_def_file and (not flags.input_arrays or
+                               not flags.output_arrays):
+    raise ValueError("--input_arrays and --output_arrays are required with "
+                     "--graph_def_file")
+
+  if flags.input_shapes:
+    if not flags.input_arrays:
+      raise ValueError("--input_shapes must be used with --input_arrays")
+    if flags.input_shapes.count(":") != flags.input_arrays.count(","):
+      raise ValueError("--input_shapes and --input_arrays must have the same "
+                       "number of items")
+
+  if flags.std_dev_values or flags.mean_values:
+    if bool(flags.std_dev_values) != bool(flags.mean_values):
+      raise ValueError("--std_dev_values and --mean_values must be used "
+                       "together")
+    if flags.std_dev_values.count(",") != flags.mean_values.count(","):
+      raise ValueError("--std_dev_values, --mean_values must have the same "
+                       "number of items")
+
+  if (flags.default_ranges_min is None) != (flags.default_ranges_max is None):
+    raise ValueError("--default_ranges_min and --default_ranges_max must be "
+                     "used together")
+
+
+def run_main(_):
+  """Main in toco_convert.py."""
+  parser = argparse.ArgumentParser(
+      description=("Command line tool to run TensorFlow Lite Optimizing "
+                   "Converter (TOCO)."))
+
+  # Output file flag.
+  parser.add_argument(
+      "--output_file",
+      type=str,
+      help="Full filepath of the output file.",
+      required=True)
+
+  # Input file flags.
+  input_file_group = parser.add_mutually_exclusive_group(required=True)
+  input_file_group.add_argument(
+      "--graph_def_file",
+      type=str,
+      help="Full filepath of file containing TensorFlow GraphDef.")
+  input_file_group.add_argument(
+      "--saved_model_dir",
+      type=str,
+      help="Full filepath of directory containing the SavedModel.")
+
+  # Model format flags.
+  parser.add_argument(
+      "--output_format",
+      type=str.upper,
+      choices=["TFLITE", "GRAPHVIZ_DOT"],
+      help="Output file format.")
+  parser.add_argument(
+      "--inference_type",
+      type=str.upper,
+      choices=["FLOAT", "QUANTIZED_UINT8"],
+      help="Target data type of arrays in the output file.")
+  parser.add_argument(
+      "--inference_input_type",
+      type=str.upper,
+      choices=["FLOAT", "QUANTIZED_UINT8"],
+      help=("Target data type of input arrays. Allows for a different type for "
+            "input arrays in the case of quantization."))
+
+  # Input and output arrays flags.
+  parser.add_argument(
+      "--input_arrays",
+      type=str,
+      help="Names of the output arrays, comma-separated.")
+  parser.add_argument(
+      "--input_shapes",
+      type=str,
+      help="Shapes corresponding to --input_arrays, colon-separated.")
+  parser.add_argument(
+      "--output_arrays",
+      type=str,
+      help="Names of the output arrays, comma-separated.")
+
+  # SavedModel related flags.
+  parser.add_argument(
+      "--saved_model_tag_set",
+      type=str,
+      help=("Comma-separated set of tags identifying the MetaGraphDef within "
+            "the SavedModel to analyze. All tags must be present. "
+            "(default \"serve\")"))
+  parser.add_argument(
+      "--saved_model_signature_key",
+      type=str,
+      help=("Key identifying the SignatureDef containing inputs and outputs. "
+            "(default DEFAULT_SERVING_SIGNATURE_DEF_KEY)"))
+
+  # Quantization flags.
+  parser.add_argument(
+      "--std_dev_values",
+      type=str,
+      help=("Standard deviation of training data for each input tensor, "
+            "comma-separated. Used for quantization. (default None)"))
+  parser.add_argument(
+      "--mean_values",
+      type=str,
+      help=("Mean of training data for each input tensor, comma-separated. "
+            "Used for quantization. (default None)"))
+  parser.add_argument(
+      "--default_ranges_min",
+      type=int,
+      help=("Default value for min bound of min/max range values used for all "
+            "arrays without a specified range, Intended for experimenting with "
+            "quantization via \"dummy quantization\". (default None)"))
+  parser.add_argument(
+      "--default_ranges_max",
+      type=int,
+      help=("Default value for max bound of min/max range values used for all "
+            "arrays without a specified range, Intended for experimenting with "
+            "quantization via \"dummy quantization\". (default None)"))
+  parser.add_argument(
+      "--quantize_weights",
+      type=bool,
+      help=("Store float weights as quantized weights followed by dequantize "
+            "operations. Inference is still done in FLOAT, but reduces model "
+            "size (at the cost of accuracy and latency)."))
+
+  # Graph manipulation flags.
+  parser.add_argument(
+      "--drop_control_dependency",
+      action="store_true",
+      help=("Boolean indicating whether to drop control dependencies silently. "
+            "This is due to TensorFlow not supporting control dependencies. "
+            "(default True)"))
+  parser.add_argument(
+      "--reorder_across_fake_quant",
+      action="store_true",
+      help=("Boolean indicating whether to reorder FakeQuant nodes in "
+            "unexpected locations. Used when the location of the FakeQuant "
+            "nodes is preventing graph transformations necessary to convert "
+            "the graph. Results in a graph that differs from the quantized "
+            "training graph, potentially causing differing arithmetic "
+            "behavior. (default False)"))
+  parser.add_argument(
+      "--change_concat_input_ranges",
+      action="store_true",
+      help=("Boolean to change behavior of min/max ranges for inputs and "
+            "outputs of the concat operator for quantized models. Changes the "
+            "ranges of concat operator overlap when true. (default False)"))
+  parser.add_argument(
+      "--allow_custom_ops",
+      action="store_true",
+      help=("Boolean indicating whether to allow custom operations. When false "
+            "any unknown operation is an error. When true, custom ops are "
+            "created for any op that is unknown. The developer will need to "
+            "provide these to the TensorFlow Lite runtime with a custom "
+            "resolver. (default False)"))
+
+  # Logging flags.
+  parser.add_argument(
+      "--dump_graphviz_dir",
+      type=str,
+      help=("Full filepath of folder to dump the graphs at various stages of "
+            "processing GraphViz .dot files. Preferred over --output_format="
+            "GRAPHVIZ_DOT in order to keep the requirements of the output "
+            "file."))
+  parser.add_argument(
+      "--dump_graphviz_video",
+      action="store_true",
+      help=("Boolean indicating whether to dump the graph after every graph "
+            "transformation"))
+
+  tflite_flags, unparsed = parser.parse_known_args(args=sys.argv[1:])
+  try:
+    _check_flags(tflite_flags, unparsed)
+  except ValueError as e:
+    parser.print_usage()
+    file_name = os.path.basename(sys.argv[0])
+    sys.stderr.write("{0}: error: {1}\n".format(file_name, str(e)))
+    sys.exit(1)
+  _convert_model(tflite_flags)
+
+
+def main():
+  app.run(main=run_main, argv=sys.argv[:1])
+
+
+if __name__ == "__main__":
+  main()
diff --git a/tensorflow/contrib/lite/schema/builtin_ops_header/generator.cc b/tensorflow/contrib/lite/schema/builtin_ops_header/generator.cc
index ac408d2f94b98d505afe4c951d7cc2ff960606fb..9dc8daa227dd68ccde2efa4013ac4465a72e6bb0 100644
--- a/tensorflow/contrib/lite/schema/builtin_ops_header/generator.cc
+++ b/tensorflow/contrib/lite/schema/builtin_ops_header/generator.cc
@@ -39,7 +39,7 @@ limitations under the License.
 #define TENSORFLOW_CONTRIB_LITE_BUILTIN_OPS_H_
 
 // DO NOT EDIT MANUALLY: This file is automatically generated by
-// `schema_builtin_ops_header_generator.py`.
+// `schema/builtin_ops_header/generator.cc`.
 
 #ifdef __cplusplus
 extern "C" {
@@ -57,7 +57,6 @@ const char* kFileFooter =
 }  // extern "C"
 #endif  // __cplusplus
 #endif  // TENSORFLOW_CONTRIB_LITE_BUILTIN_OPS_H_
-}
 )";
 }  // anonymous namespace
 
diff --git a/tensorflow/contrib/lite/schema/schema.fbs b/tensorflow/contrib/lite/schema/schema.fbs
index 8bdeb035f5a778fa3b0d85da36d6b8d6721445ea..ee5208df1456d01f1a99ecc69722f5fb4ab0763a 100644
--- a/tensorflow/contrib/lite/schema/schema.fbs
+++ b/tensorflow/contrib/lite/schema/schema.fbs
@@ -145,6 +145,12 @@ enum BuiltinOperator : byte {
   SLICE = 65,
   SIN = 66,
   TRANSPOSE_CONV = 67,
+  SPARSE_TO_DENSE = 68,
+  TILE = 69,
+  EXPAND_DIMS = 70,
+  EQUAL = 71,
+  NOT_EQUAL = 72,
+  LOG = 73,
 }
 
 // Options for the builtin operators.
@@ -198,6 +204,11 @@ union BuiltinOptions {
   SelectOptions,
   SliceOptions,
   TransposeConvOptions,
+  SparseToDenseOptions,
+  TileOptions,
+  ExpandDimsOptions,
+  EqualOptions,
+  NotEqualOptions,
 }
 
 enum Padding : byte { SAME, VALID }
@@ -309,11 +320,23 @@ table LocalResponseNormalizationOptions {
   beta:float;
 }
 
+enum LSTMKernelType : byte {
+  // Full LSTM kernel which supports peephole and projection.
+  FULL = 0,
+  // Basic LSTM kernels. Equivalent to TensorFlow BasicLSTMCell.
+  BASIC = 1,
+}
+
 // An implementation of TensorFlow LSTMCell and CoupledInputForgetGateLSTMCell
 table LSTMOptions {
+  // Parameters for LSTM version 1 or above.
   fused_activation_function:ActivationFunctionType;
   cell_clip: float; // Optional, 0.0 means no clipping
   proj_clip: float; // Optional, 0.0 means no clipping
+
+  // Parameters for LSTM version 2 or above.
+  // Basic kernel is only supported in version 2 or above.
+  kernel_type: LSTMKernelType = FULL;
 }
 
 table ResizeBilinearOptions {
@@ -419,6 +442,9 @@ table DequantizeOptions {
 table MaximumMinimumOptions {
 }
 
+table TileOptions {
+}
+
 table ArgMaxOptions {
   output_type : TensorType;
 }
@@ -450,6 +476,19 @@ table TransposeConvOptions {
   stride_h:int;
 }
 
+table ExpandDimsOptions {
+}
+
+table SparseToDenseOptions {
+  validate_indices:bool;
+}
+
+table EqualOptions {
+}
+
+table NotEqualOptions {
+}
+
 // An OperatorCode can be an enum value (BuiltinOperator) if the operator is a
 // builtin, or a string if the operator is custom.
 table OperatorCode {
diff --git a/tensorflow/contrib/lite/schema/schema_generated.h b/tensorflow/contrib/lite/schema/schema_generated.h
index 35c34f53a6bf9716941f623b43f238c681252747..887e47ed1ea309d025d4be8745ffb8da06e8ee6b 100755
--- a/tensorflow/contrib/lite/schema/schema_generated.h
+++ b/tensorflow/contrib/lite/schema/schema_generated.h
@@ -151,6 +151,9 @@ struct DequantizeOptionsT;
 struct MaximumMinimumOptions;
 struct MaximumMinimumOptionsT;
 
+struct TileOptions;
+struct TileOptionsT;
+
 struct ArgMaxOptions;
 struct ArgMaxOptionsT;
 
@@ -178,6 +181,18 @@ struct SliceOptionsT;
 struct TransposeConvOptions;
 struct TransposeConvOptionsT;
 
+struct ExpandDimsOptions;
+struct ExpandDimsOptionsT;
+
+struct SparseToDenseOptions;
+struct SparseToDenseOptionsT;
+
+struct EqualOptions;
+struct EqualOptionsT;
+
+struct NotEqualOptions;
+struct NotEqualOptionsT;
+
 struct OperatorCode;
 struct OperatorCodeT;
 
@@ -305,11 +320,17 @@ enum BuiltinOperator {
   BuiltinOperator_SLICE = 65,
   BuiltinOperator_SIN = 66,
   BuiltinOperator_TRANSPOSE_CONV = 67,
+  BuiltinOperator_SPARSE_TO_DENSE = 68,
+  BuiltinOperator_TILE = 69,
+  BuiltinOperator_EXPAND_DIMS = 70,
+  BuiltinOperator_EQUAL = 71,
+  BuiltinOperator_NOT_EQUAL = 72,
+  BuiltinOperator_LOG = 73,
   BuiltinOperator_MIN = BuiltinOperator_ADD,
-  BuiltinOperator_MAX = BuiltinOperator_TRANSPOSE_CONV
+  BuiltinOperator_MAX = BuiltinOperator_LOG
 };
 
-inline BuiltinOperator (&EnumValuesBuiltinOperator())[67] {
+inline BuiltinOperator (&EnumValuesBuiltinOperator())[73] {
   static BuiltinOperator values[] = {
     BuiltinOperator_ADD,
     BuiltinOperator_AVERAGE_POOL_2D,
@@ -377,7 +398,13 @@ inline BuiltinOperator (&EnumValuesBuiltinOperator())[67] {
     BuiltinOperator_SELECT,
     BuiltinOperator_SLICE,
     BuiltinOperator_SIN,
-    BuiltinOperator_TRANSPOSE_CONV
+    BuiltinOperator_TRANSPOSE_CONV,
+    BuiltinOperator_SPARSE_TO_DENSE,
+    BuiltinOperator_TILE,
+    BuiltinOperator_EXPAND_DIMS,
+    BuiltinOperator_EQUAL,
+    BuiltinOperator_NOT_EQUAL,
+    BuiltinOperator_LOG
   };
   return values;
 }
@@ -452,6 +479,12 @@ inline const char **EnumNamesBuiltinOperator() {
     "SLICE",
     "SIN",
     "TRANSPOSE_CONV",
+    "SPARSE_TO_DENSE",
+    "TILE",
+    "EXPAND_DIMS",
+    "EQUAL",
+    "NOT_EQUAL",
+    "LOG",
     nullptr
   };
   return names;
@@ -513,11 +546,16 @@ enum BuiltinOptions {
   BuiltinOptions_SelectOptions = 47,
   BuiltinOptions_SliceOptions = 48,
   BuiltinOptions_TransposeConvOptions = 49,
+  BuiltinOptions_SparseToDenseOptions = 50,
+  BuiltinOptions_TileOptions = 51,
+  BuiltinOptions_ExpandDimsOptions = 52,
+  BuiltinOptions_EqualOptions = 53,
+  BuiltinOptions_NotEqualOptions = 54,
   BuiltinOptions_MIN = BuiltinOptions_NONE,
-  BuiltinOptions_MAX = BuiltinOptions_TransposeConvOptions
+  BuiltinOptions_MAX = BuiltinOptions_NotEqualOptions
 };
 
-inline BuiltinOptions (&EnumValuesBuiltinOptions())[50] {
+inline BuiltinOptions (&EnumValuesBuiltinOptions())[55] {
   static BuiltinOptions values[] = {
     BuiltinOptions_NONE,
     BuiltinOptions_Conv2DOptions,
@@ -568,7 +606,12 @@ inline BuiltinOptions (&EnumValuesBuiltinOptions())[50] {
     BuiltinOptions_LessEqualOptions,
     BuiltinOptions_SelectOptions,
     BuiltinOptions_SliceOptions,
-    BuiltinOptions_TransposeConvOptions
+    BuiltinOptions_TransposeConvOptions,
+    BuiltinOptions_SparseToDenseOptions,
+    BuiltinOptions_TileOptions,
+    BuiltinOptions_ExpandDimsOptions,
+    BuiltinOptions_EqualOptions,
+    BuiltinOptions_NotEqualOptions
   };
   return values;
 }
@@ -625,6 +668,11 @@ inline const char **EnumNamesBuiltinOptions() {
     "SelectOptions",
     "SliceOptions",
     "TransposeConvOptions",
+    "SparseToDenseOptions",
+    "TileOptions",
+    "ExpandDimsOptions",
+    "EqualOptions",
+    "NotEqualOptions",
     nullptr
   };
   return names;
@@ -835,6 +883,26 @@ template<> struct BuiltinOptionsTraits<TransposeConvOptions> {
   static const BuiltinOptions enum_value = BuiltinOptions_TransposeConvOptions;
 };
 
+template<> struct BuiltinOptionsTraits<SparseToDenseOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_SparseToDenseOptions;
+};
+
+template<> struct BuiltinOptionsTraits<TileOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_TileOptions;
+};
+
+template<> struct BuiltinOptionsTraits<ExpandDimsOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_ExpandDimsOptions;
+};
+
+template<> struct BuiltinOptionsTraits<EqualOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_EqualOptions;
+};
+
+template<> struct BuiltinOptionsTraits<NotEqualOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_NotEqualOptions;
+};
+
 struct BuiltinOptionsUnion {
   BuiltinOptions type;
   void *value;
@@ -1258,6 +1326,46 @@ struct BuiltinOptionsUnion {
     return type == BuiltinOptions_TransposeConvOptions ?
       reinterpret_cast<const TransposeConvOptionsT *>(value) : nullptr;
   }
+  SparseToDenseOptionsT *AsSparseToDenseOptions() {
+    return type == BuiltinOptions_SparseToDenseOptions ?
+      reinterpret_cast<SparseToDenseOptionsT *>(value) : nullptr;
+  }
+  const SparseToDenseOptionsT *AsSparseToDenseOptions() const {
+    return type == BuiltinOptions_SparseToDenseOptions ?
+      reinterpret_cast<const SparseToDenseOptionsT *>(value) : nullptr;
+  }
+  TileOptionsT *AsTileOptions() {
+    return type == BuiltinOptions_TileOptions ?
+      reinterpret_cast<TileOptionsT *>(value) : nullptr;
+  }
+  const TileOptionsT *AsTileOptions() const {
+    return type == BuiltinOptions_TileOptions ?
+      reinterpret_cast<const TileOptionsT *>(value) : nullptr;
+  }
+  ExpandDimsOptionsT *AsExpandDimsOptions() {
+    return type == BuiltinOptions_ExpandDimsOptions ?
+      reinterpret_cast<ExpandDimsOptionsT *>(value) : nullptr;
+  }
+  const ExpandDimsOptionsT *AsExpandDimsOptions() const {
+    return type == BuiltinOptions_ExpandDimsOptions ?
+      reinterpret_cast<const ExpandDimsOptionsT *>(value) : nullptr;
+  }
+  EqualOptionsT *AsEqualOptions() {
+    return type == BuiltinOptions_EqualOptions ?
+      reinterpret_cast<EqualOptionsT *>(value) : nullptr;
+  }
+  const EqualOptionsT *AsEqualOptions() const {
+    return type == BuiltinOptions_EqualOptions ?
+      reinterpret_cast<const EqualOptionsT *>(value) : nullptr;
+  }
+  NotEqualOptionsT *AsNotEqualOptions() {
+    return type == BuiltinOptions_NotEqualOptions ?
+      reinterpret_cast<NotEqualOptionsT *>(value) : nullptr;
+  }
+  const NotEqualOptionsT *AsNotEqualOptions() const {
+    return type == BuiltinOptions_NotEqualOptions ?
+      reinterpret_cast<const NotEqualOptionsT *>(value) : nullptr;
+  }
 };
 
 bool VerifyBuiltinOptions(flatbuffers::Verifier &verifier, const void *obj, BuiltinOptions type);
@@ -1365,6 +1473,35 @@ inline const char *EnumNameLSHProjectionType(LSHProjectionType e) {
   return EnumNamesLSHProjectionType()[index];
 }
 
+enum LSTMKernelType {
+  LSTMKernelType_FULL = 0,
+  LSTMKernelType_BASIC = 1,
+  LSTMKernelType_MIN = LSTMKernelType_FULL,
+  LSTMKernelType_MAX = LSTMKernelType_BASIC
+};
+
+inline LSTMKernelType (&EnumValuesLSTMKernelType())[2] {
+  static LSTMKernelType values[] = {
+    LSTMKernelType_FULL,
+    LSTMKernelType_BASIC
+  };
+  return values;
+}
+
+inline const char **EnumNamesLSTMKernelType() {
+  static const char *names[] = {
+    "FULL",
+    "BASIC",
+    nullptr
+  };
+  return names;
+}
+
+inline const char *EnumNameLSTMKernelType(LSTMKernelType e) {
+  const size_t index = static_cast<int>(e);
+  return EnumNamesLSTMKernelType()[index];
+}
+
 enum CombinerType {
   CombinerType_SUM = 0,
   CombinerType_MEAN = 1,
@@ -2802,10 +2939,12 @@ struct LSTMOptionsT : public flatbuffers::NativeTable {
   ActivationFunctionType fused_activation_function;
   float cell_clip;
   float proj_clip;
+  LSTMKernelType kernel_type;
   LSTMOptionsT()
       : fused_activation_function(ActivationFunctionType_NONE),
         cell_clip(0.0f),
-        proj_clip(0.0f) {
+        proj_clip(0.0f),
+        kernel_type(LSTMKernelType_FULL) {
   }
 };
 
@@ -2814,7 +2953,8 @@ struct LSTMOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   enum {
     VT_FUSED_ACTIVATION_FUNCTION = 4,
     VT_CELL_CLIP = 6,
-    VT_PROJ_CLIP = 8
+    VT_PROJ_CLIP = 8,
+    VT_KERNEL_TYPE = 10
   };
   ActivationFunctionType fused_activation_function() const {
     return static_cast<ActivationFunctionType>(GetField<int8_t>(VT_FUSED_ACTIVATION_FUNCTION, 0));
@@ -2825,11 +2965,15 @@ struct LSTMOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   float proj_clip() const {
     return GetField<float>(VT_PROJ_CLIP, 0.0f);
   }
+  LSTMKernelType kernel_type() const {
+    return static_cast<LSTMKernelType>(GetField<int8_t>(VT_KERNEL_TYPE, 0));
+  }
   bool Verify(flatbuffers::Verifier &verifier) const {
     return VerifyTableStart(verifier) &&
            VerifyField<int8_t>(verifier, VT_FUSED_ACTIVATION_FUNCTION) &&
            VerifyField<float>(verifier, VT_CELL_CLIP) &&
            VerifyField<float>(verifier, VT_PROJ_CLIP) &&
+           VerifyField<int8_t>(verifier, VT_KERNEL_TYPE) &&
            verifier.EndTable();
   }
   LSTMOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
@@ -2849,6 +2993,9 @@ struct LSTMOptionsBuilder {
   void add_proj_clip(float proj_clip) {
     fbb_.AddElement<float>(LSTMOptions::VT_PROJ_CLIP, proj_clip, 0.0f);
   }
+  void add_kernel_type(LSTMKernelType kernel_type) {
+    fbb_.AddElement<int8_t>(LSTMOptions::VT_KERNEL_TYPE, static_cast<int8_t>(kernel_type), 0);
+  }
   explicit LSTMOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
         : fbb_(_fbb) {
     start_ = fbb_.StartTable();
@@ -2865,10 +3012,12 @@ inline flatbuffers::Offset<LSTMOptions> CreateLSTMOptions(
     flatbuffers::FlatBufferBuilder &_fbb,
     ActivationFunctionType fused_activation_function = ActivationFunctionType_NONE,
     float cell_clip = 0.0f,
-    float proj_clip = 0.0f) {
+    float proj_clip = 0.0f,
+    LSTMKernelType kernel_type = LSTMKernelType_FULL) {
   LSTMOptionsBuilder builder_(_fbb);
   builder_.add_proj_clip(proj_clip);
   builder_.add_cell_clip(cell_clip);
+  builder_.add_kernel_type(kernel_type);
   builder_.add_fused_activation_function(fused_activation_function);
   return builder_.Finish();
 }
@@ -4131,6 +4280,46 @@ inline flatbuffers::Offset<MaximumMinimumOptions> CreateMaximumMinimumOptions(
 
 flatbuffers::Offset<MaximumMinimumOptions> CreateMaximumMinimumOptions(flatbuffers::FlatBufferBuilder &_fbb, const MaximumMinimumOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
 
+struct TileOptionsT : public flatbuffers::NativeTable {
+  typedef TileOptions TableType;
+  TileOptionsT() {
+  }
+};
+
+struct TileOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef TileOptionsT NativeTableType;
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           verifier.EndTable();
+  }
+  TileOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(TileOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<TileOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const TileOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct TileOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  explicit TileOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  TileOptionsBuilder &operator=(const TileOptionsBuilder &);
+  flatbuffers::Offset<TileOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<TileOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<TileOptions> CreateTileOptions(
+    flatbuffers::FlatBufferBuilder &_fbb) {
+  TileOptionsBuilder builder_(_fbb);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<TileOptions> CreateTileOptions(flatbuffers::FlatBufferBuilder &_fbb, const TileOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
 struct ArgMaxOptionsT : public flatbuffers::NativeTable {
   typedef ArgMaxOptions TableType;
   TensorType output_type;
@@ -4543,6 +4732,180 @@ inline flatbuffers::Offset<TransposeConvOptions> CreateTransposeConvOptions(
 
 flatbuffers::Offset<TransposeConvOptions> CreateTransposeConvOptions(flatbuffers::FlatBufferBuilder &_fbb, const TransposeConvOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
 
+struct ExpandDimsOptionsT : public flatbuffers::NativeTable {
+  typedef ExpandDimsOptions TableType;
+  ExpandDimsOptionsT() {
+  }
+};
+
+struct ExpandDimsOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef ExpandDimsOptionsT NativeTableType;
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           verifier.EndTable();
+  }
+  ExpandDimsOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(ExpandDimsOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<ExpandDimsOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const ExpandDimsOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct ExpandDimsOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  explicit ExpandDimsOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  ExpandDimsOptionsBuilder &operator=(const ExpandDimsOptionsBuilder &);
+  flatbuffers::Offset<ExpandDimsOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<ExpandDimsOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<ExpandDimsOptions> CreateExpandDimsOptions(
+    flatbuffers::FlatBufferBuilder &_fbb) {
+  ExpandDimsOptionsBuilder builder_(_fbb);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<ExpandDimsOptions> CreateExpandDimsOptions(flatbuffers::FlatBufferBuilder &_fbb, const ExpandDimsOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
+struct SparseToDenseOptionsT : public flatbuffers::NativeTable {
+  typedef SparseToDenseOptions TableType;
+  bool validate_indices;
+  SparseToDenseOptionsT()
+      : validate_indices(false) {
+  }
+};
+
+struct SparseToDenseOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef SparseToDenseOptionsT NativeTableType;
+  enum {
+    VT_VALIDATE_INDICES = 4
+  };
+  bool validate_indices() const {
+    return GetField<uint8_t>(VT_VALIDATE_INDICES, 0) != 0;
+  }
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           VerifyField<uint8_t>(verifier, VT_VALIDATE_INDICES) &&
+           verifier.EndTable();
+  }
+  SparseToDenseOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(SparseToDenseOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<SparseToDenseOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const SparseToDenseOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct SparseToDenseOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  void add_validate_indices(bool validate_indices) {
+    fbb_.AddElement<uint8_t>(SparseToDenseOptions::VT_VALIDATE_INDICES, static_cast<uint8_t>(validate_indices), 0);
+  }
+  explicit SparseToDenseOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  SparseToDenseOptionsBuilder &operator=(const SparseToDenseOptionsBuilder &);
+  flatbuffers::Offset<SparseToDenseOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<SparseToDenseOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<SparseToDenseOptions> CreateSparseToDenseOptions(
+    flatbuffers::FlatBufferBuilder &_fbb,
+    bool validate_indices = false) {
+  SparseToDenseOptionsBuilder builder_(_fbb);
+  builder_.add_validate_indices(validate_indices);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<SparseToDenseOptions> CreateSparseToDenseOptions(flatbuffers::FlatBufferBuilder &_fbb, const SparseToDenseOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
+struct EqualOptionsT : public flatbuffers::NativeTable {
+  typedef EqualOptions TableType;
+  EqualOptionsT() {
+  }
+};
+
+struct EqualOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef EqualOptionsT NativeTableType;
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           verifier.EndTable();
+  }
+  EqualOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(EqualOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<EqualOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const EqualOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct EqualOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  explicit EqualOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  EqualOptionsBuilder &operator=(const EqualOptionsBuilder &);
+  flatbuffers::Offset<EqualOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<EqualOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<EqualOptions> CreateEqualOptions(
+    flatbuffers::FlatBufferBuilder &_fbb) {
+  EqualOptionsBuilder builder_(_fbb);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<EqualOptions> CreateEqualOptions(flatbuffers::FlatBufferBuilder &_fbb, const EqualOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
+struct NotEqualOptionsT : public flatbuffers::NativeTable {
+  typedef NotEqualOptions TableType;
+  NotEqualOptionsT() {
+  }
+};
+
+struct NotEqualOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef NotEqualOptionsT NativeTableType;
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           verifier.EndTable();
+  }
+  NotEqualOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(NotEqualOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<NotEqualOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const NotEqualOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct NotEqualOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  explicit NotEqualOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  NotEqualOptionsBuilder &operator=(const NotEqualOptionsBuilder &);
+  flatbuffers::Offset<NotEqualOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<NotEqualOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<NotEqualOptions> CreateNotEqualOptions(
+    flatbuffers::FlatBufferBuilder &_fbb) {
+  NotEqualOptionsBuilder builder_(_fbb);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<NotEqualOptions> CreateNotEqualOptions(flatbuffers::FlatBufferBuilder &_fbb, const NotEqualOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
 struct OperatorCodeT : public flatbuffers::NativeTable {
   typedef OperatorCode TableType;
   BuiltinOperator builtin_code;
@@ -4821,6 +5184,21 @@ struct Operator FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   const TransposeConvOptions *builtin_options_as_TransposeConvOptions() const {
     return builtin_options_type() == BuiltinOptions_TransposeConvOptions ? static_cast<const TransposeConvOptions *>(builtin_options()) : nullptr;
   }
+  const SparseToDenseOptions *builtin_options_as_SparseToDenseOptions() const {
+    return builtin_options_type() == BuiltinOptions_SparseToDenseOptions ? static_cast<const SparseToDenseOptions *>(builtin_options()) : nullptr;
+  }
+  const TileOptions *builtin_options_as_TileOptions() const {
+    return builtin_options_type() == BuiltinOptions_TileOptions ? static_cast<const TileOptions *>(builtin_options()) : nullptr;
+  }
+  const ExpandDimsOptions *builtin_options_as_ExpandDimsOptions() const {
+    return builtin_options_type() == BuiltinOptions_ExpandDimsOptions ? static_cast<const ExpandDimsOptions *>(builtin_options()) : nullptr;
+  }
+  const EqualOptions *builtin_options_as_EqualOptions() const {
+    return builtin_options_type() == BuiltinOptions_EqualOptions ? static_cast<const EqualOptions *>(builtin_options()) : nullptr;
+  }
+  const NotEqualOptions *builtin_options_as_NotEqualOptions() const {
+    return builtin_options_type() == BuiltinOptions_NotEqualOptions ? static_cast<const NotEqualOptions *>(builtin_options()) : nullptr;
+  }
   const flatbuffers::Vector<uint8_t> *custom_options() const {
     return GetPointer<const flatbuffers::Vector<uint8_t> *>(VT_CUSTOM_OPTIONS);
   }
@@ -5043,6 +5421,26 @@ template<> inline const TransposeConvOptions *Operator::builtin_options_as<Trans
   return builtin_options_as_TransposeConvOptions();
 }
 
+template<> inline const SparseToDenseOptions *Operator::builtin_options_as<SparseToDenseOptions>() const {
+  return builtin_options_as_SparseToDenseOptions();
+}
+
+template<> inline const TileOptions *Operator::builtin_options_as<TileOptions>() const {
+  return builtin_options_as_TileOptions();
+}
+
+template<> inline const ExpandDimsOptions *Operator::builtin_options_as<ExpandDimsOptions>() const {
+  return builtin_options_as_ExpandDimsOptions();
+}
+
+template<> inline const EqualOptions *Operator::builtin_options_as<EqualOptions>() const {
+  return builtin_options_as_EqualOptions();
+}
+
+template<> inline const NotEqualOptions *Operator::builtin_options_as<NotEqualOptions>() const {
+  return builtin_options_as_NotEqualOptions();
+}
+
 struct OperatorBuilder {
   flatbuffers::FlatBufferBuilder &fbb_;
   flatbuffers::uoffset_t start_;
@@ -6008,6 +6406,7 @@ inline void LSTMOptions::UnPackTo(LSTMOptionsT *_o, const flatbuffers::resolver_
   { auto _e = fused_activation_function(); _o->fused_activation_function = _e; };
   { auto _e = cell_clip(); _o->cell_clip = _e; };
   { auto _e = proj_clip(); _o->proj_clip = _e; };
+  { auto _e = kernel_type(); _o->kernel_type = _e; };
 }
 
 inline flatbuffers::Offset<LSTMOptions> LSTMOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const LSTMOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
@@ -6021,11 +6420,13 @@ inline flatbuffers::Offset<LSTMOptions> CreateLSTMOptions(flatbuffers::FlatBuffe
   auto _fused_activation_function = _o->fused_activation_function;
   auto _cell_clip = _o->cell_clip;
   auto _proj_clip = _o->proj_clip;
+  auto _kernel_type = _o->kernel_type;
   return tflite::CreateLSTMOptions(
       _fbb,
       _fused_activation_function,
       _cell_clip,
-      _proj_clip);
+      _proj_clip,
+      _kernel_type);
 }
 
 inline ResizeBilinearOptionsT *ResizeBilinearOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
@@ -6643,6 +7044,29 @@ inline flatbuffers::Offset<MaximumMinimumOptions> CreateMaximumMinimumOptions(fl
       _fbb);
 }
 
+inline TileOptionsT *TileOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new TileOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void TileOptions::UnPackTo(TileOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+}
+
+inline flatbuffers::Offset<TileOptions> TileOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const TileOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreateTileOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<TileOptions> CreateTileOptions(flatbuffers::FlatBufferBuilder &_fbb, const TileOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const TileOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  return tflite::CreateTileOptions(
+      _fbb);
+}
+
 inline ArgMaxOptionsT *ArgMaxOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
   auto _o = new ArgMaxOptionsT();
   UnPackTo(_o, _resolver);
@@ -6862,6 +7286,101 @@ inline flatbuffers::Offset<TransposeConvOptions> CreateTransposeConvOptions(flat
       _stride_h);
 }
 
+inline ExpandDimsOptionsT *ExpandDimsOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new ExpandDimsOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void ExpandDimsOptions::UnPackTo(ExpandDimsOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+}
+
+inline flatbuffers::Offset<ExpandDimsOptions> ExpandDimsOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const ExpandDimsOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreateExpandDimsOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<ExpandDimsOptions> CreateExpandDimsOptions(flatbuffers::FlatBufferBuilder &_fbb, const ExpandDimsOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const ExpandDimsOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  return tflite::CreateExpandDimsOptions(
+      _fbb);
+}
+
+inline SparseToDenseOptionsT *SparseToDenseOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new SparseToDenseOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void SparseToDenseOptions::UnPackTo(SparseToDenseOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+  { auto _e = validate_indices(); _o->validate_indices = _e; };
+}
+
+inline flatbuffers::Offset<SparseToDenseOptions> SparseToDenseOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const SparseToDenseOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreateSparseToDenseOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<SparseToDenseOptions> CreateSparseToDenseOptions(flatbuffers::FlatBufferBuilder &_fbb, const SparseToDenseOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const SparseToDenseOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  auto _validate_indices = _o->validate_indices;
+  return tflite::CreateSparseToDenseOptions(
+      _fbb,
+      _validate_indices);
+}
+
+inline EqualOptionsT *EqualOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new EqualOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void EqualOptions::UnPackTo(EqualOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+}
+
+inline flatbuffers::Offset<EqualOptions> EqualOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const EqualOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreateEqualOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<EqualOptions> CreateEqualOptions(flatbuffers::FlatBufferBuilder &_fbb, const EqualOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const EqualOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  return tflite::CreateEqualOptions(
+      _fbb);
+}
+
+inline NotEqualOptionsT *NotEqualOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new NotEqualOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void NotEqualOptions::UnPackTo(NotEqualOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+}
+
+inline flatbuffers::Offset<NotEqualOptions> NotEqualOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const NotEqualOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreateNotEqualOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<NotEqualOptions> CreateNotEqualOptions(flatbuffers::FlatBufferBuilder &_fbb, const NotEqualOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const NotEqualOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  return tflite::CreateNotEqualOptions(
+      _fbb);
+}
+
 inline OperatorCodeT *OperatorCode::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
   auto _o = new OperatorCodeT();
   UnPackTo(_o, _resolver);
@@ -7244,6 +7763,26 @@ inline bool VerifyBuiltinOptions(flatbuffers::Verifier &verifier, const void *ob
       auto ptr = reinterpret_cast<const TransposeConvOptions *>(obj);
       return verifier.VerifyTable(ptr);
     }
+    case BuiltinOptions_SparseToDenseOptions: {
+      auto ptr = reinterpret_cast<const SparseToDenseOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
+    case BuiltinOptions_TileOptions: {
+      auto ptr = reinterpret_cast<const TileOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
+    case BuiltinOptions_ExpandDimsOptions: {
+      auto ptr = reinterpret_cast<const ExpandDimsOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
+    case BuiltinOptions_EqualOptions: {
+      auto ptr = reinterpret_cast<const EqualOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
+    case BuiltinOptions_NotEqualOptions: {
+      auto ptr = reinterpret_cast<const NotEqualOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
     default: return false;
   }
 }
@@ -7458,6 +7997,26 @@ inline void *BuiltinOptionsUnion::UnPack(const void *obj, BuiltinOptions type, c
       auto ptr = reinterpret_cast<const TransposeConvOptions *>(obj);
       return ptr->UnPack(resolver);
     }
+    case BuiltinOptions_SparseToDenseOptions: {
+      auto ptr = reinterpret_cast<const SparseToDenseOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
+    case BuiltinOptions_TileOptions: {
+      auto ptr = reinterpret_cast<const TileOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
+    case BuiltinOptions_ExpandDimsOptions: {
+      auto ptr = reinterpret_cast<const ExpandDimsOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
+    case BuiltinOptions_EqualOptions: {
+      auto ptr = reinterpret_cast<const EqualOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
+    case BuiltinOptions_NotEqualOptions: {
+      auto ptr = reinterpret_cast<const NotEqualOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
     default: return nullptr;
   }
 }
@@ -7660,6 +8219,26 @@ inline flatbuffers::Offset<void> BuiltinOptionsUnion::Pack(flatbuffers::FlatBuff
       auto ptr = reinterpret_cast<const TransposeConvOptionsT *>(value);
       return CreateTransposeConvOptions(_fbb, ptr, _rehasher).Union();
     }
+    case BuiltinOptions_SparseToDenseOptions: {
+      auto ptr = reinterpret_cast<const SparseToDenseOptionsT *>(value);
+      return CreateSparseToDenseOptions(_fbb, ptr, _rehasher).Union();
+    }
+    case BuiltinOptions_TileOptions: {
+      auto ptr = reinterpret_cast<const TileOptionsT *>(value);
+      return CreateTileOptions(_fbb, ptr, _rehasher).Union();
+    }
+    case BuiltinOptions_ExpandDimsOptions: {
+      auto ptr = reinterpret_cast<const ExpandDimsOptionsT *>(value);
+      return CreateExpandDimsOptions(_fbb, ptr, _rehasher).Union();
+    }
+    case BuiltinOptions_EqualOptions: {
+      auto ptr = reinterpret_cast<const EqualOptionsT *>(value);
+      return CreateEqualOptions(_fbb, ptr, _rehasher).Union();
+    }
+    case BuiltinOptions_NotEqualOptions: {
+      auto ptr = reinterpret_cast<const NotEqualOptionsT *>(value);
+      return CreateNotEqualOptions(_fbb, ptr, _rehasher).Union();
+    }
     default: return 0;
   }
 }
@@ -7862,6 +8441,26 @@ inline BuiltinOptionsUnion::BuiltinOptionsUnion(const BuiltinOptionsUnion &u) FL
       value = new TransposeConvOptionsT(*reinterpret_cast<TransposeConvOptionsT *>(u.value));
       break;
     }
+    case BuiltinOptions_SparseToDenseOptions: {
+      value = new SparseToDenseOptionsT(*reinterpret_cast<SparseToDenseOptionsT *>(u.value));
+      break;
+    }
+    case BuiltinOptions_TileOptions: {
+      value = new TileOptionsT(*reinterpret_cast<TileOptionsT *>(u.value));
+      break;
+    }
+    case BuiltinOptions_ExpandDimsOptions: {
+      value = new ExpandDimsOptionsT(*reinterpret_cast<ExpandDimsOptionsT *>(u.value));
+      break;
+    }
+    case BuiltinOptions_EqualOptions: {
+      value = new EqualOptionsT(*reinterpret_cast<EqualOptionsT *>(u.value));
+      break;
+    }
+    case BuiltinOptions_NotEqualOptions: {
+      value = new NotEqualOptionsT(*reinterpret_cast<NotEqualOptionsT *>(u.value));
+      break;
+    }
     default:
       break;
   }
@@ -8114,6 +8713,31 @@ inline void BuiltinOptionsUnion::Reset() {
       delete ptr;
       break;
     }
+    case BuiltinOptions_SparseToDenseOptions: {
+      auto ptr = reinterpret_cast<SparseToDenseOptionsT *>(value);
+      delete ptr;
+      break;
+    }
+    case BuiltinOptions_TileOptions: {
+      auto ptr = reinterpret_cast<TileOptionsT *>(value);
+      delete ptr;
+      break;
+    }
+    case BuiltinOptions_ExpandDimsOptions: {
+      auto ptr = reinterpret_cast<ExpandDimsOptionsT *>(value);
+      delete ptr;
+      break;
+    }
+    case BuiltinOptions_EqualOptions: {
+      auto ptr = reinterpret_cast<EqualOptionsT *>(value);
+      delete ptr;
+      break;
+    }
+    case BuiltinOptions_NotEqualOptions: {
+      auto ptr = reinterpret_cast<NotEqualOptionsT *>(value);
+      delete ptr;
+      break;
+    }
     default: break;
   }
   value = nullptr;
diff --git a/tensorflow/contrib/lite/simple_memory_arena.cc b/tensorflow/contrib/lite/simple_memory_arena.cc
index 2f2004f56bcad5b56f9dd6d4bc824ec14d79e795..4eaf6f1bfe76efc1e6737d03d58be9bc87bb849d 100644
--- a/tensorflow/contrib/lite/simple_memory_arena.cc
+++ b/tensorflow/contrib/lite/simple_memory_arena.cc
@@ -36,6 +36,12 @@ TfLiteStatus SimpleMemoryArena::Allocate(TfLiteContext* context,
                                          ArenaAlloc* new_alloc) {
   TF_LITE_ENSURE(context, alignment < arena_alignment_);
 
+  if (size == 0) {
+    new_alloc->offset = 0;
+    new_alloc->size = 0;
+    return kTfLiteOk;
+  }
+
   size_t current_top = 0;
 
   if (!allocs_.empty()) {
@@ -75,6 +81,10 @@ TfLiteStatus SimpleMemoryArena::Allocate(TfLiteContext* context,
 
 TfLiteStatus SimpleMemoryArena::Deallocate(TfLiteContext* context,
                                            const ArenaAlloc& alloc) {
+  if (alloc.size == 0) {
+    return kTfLiteOk;
+  }
+
   int erased_allocs_count = 0;
   auto it = allocs_.begin();
   while (it != allocs_.end()) {
@@ -122,7 +132,11 @@ TfLiteStatus SimpleMemoryArena::ResolveAlloc(TfLiteContext* context,
                                              char** output_ptr) {
   TF_LITE_ENSURE(context, committed_);
   TF_LITE_ENSURE(context, output_ptr != nullptr);
-  *output_ptr = underlying_buffer_aligned_ptr_ + alloc.offset;
+  if (alloc.size == 0) {
+    *output_ptr = nullptr;
+  } else {
+    *output_ptr = underlying_buffer_aligned_ptr_ + alloc.offset;
+  }
   return kTfLiteOk;
 }
 
diff --git a/tensorflow/contrib/lite/simple_memory_arena.h b/tensorflow/contrib/lite/simple_memory_arena.h
index 5faf78b59e3755d22e4e866d433e622baa6c66c1..f738315cf2f91403f9dcb6fa9e66b40bd70495aa 100644
--- a/tensorflow/contrib/lite/simple_memory_arena.h
+++ b/tensorflow/contrib/lite/simple_memory_arena.h
@@ -39,7 +39,8 @@ struct ArenaAlloc {
 // This small class is responsible for allocating, deallocating and reusing
 // dynamic memory from a common underlying buffer. The arena can be used in
 // scenarios when the pattern of memory allocations and deallocations is
-// repetitive, e.g. running NN inference in multiple iterations.
+// repetitive, e.g. running NN inference in multiple iterations. Note that
+// zero-sized allocations are explicitly allowed, and will resolve to null.
 class SimpleMemoryArena {
  public:
   explicit SimpleMemoryArena(size_t arena_alignment)
diff --git a/tensorflow/contrib/lite/simple_memory_arena_test.cc b/tensorflow/contrib/lite/simple_memory_arena_test.cc
index 4444f642eb75c563c57762d095e454ac63d836c6..60d4d5e768aeda958574422e1c36a7cc2f6a1429 100644
--- a/tensorflow/contrib/lite/simple_memory_arena_test.cc
+++ b/tensorflow/contrib/lite/simple_memory_arena_test.cc
@@ -43,6 +43,47 @@ TEST(SimpleMemoryArenaTest, BasicArenaOperations) {
   EXPECT_EQ(allocs[5].offset, 1024);
 }
 
+TEST(SimpleMemoryArenaTest, BasicZeroAlloc) {
+  TfLiteContext context;
+  SimpleMemoryArena arena(64);
+  ArenaAlloc alloc;
+
+  // Zero-sized allocs should have a 0 offset and size.
+  ASSERT_EQ(arena.Allocate(&context, 32, 0, &alloc), kTfLiteOk);
+  EXPECT_EQ(alloc.offset, 0);
+  EXPECT_EQ(alloc.size, 0);
+
+  // Deallocation of zero-sized allocs should always succeed (even redundantly).
+  ASSERT_EQ(arena.Deallocate(&context, alloc), kTfLiteOk);
+  ASSERT_EQ(arena.Deallocate(&context, alloc), kTfLiteOk);
+
+  // The zero-sized alloc should resolve to null.
+  char* resolved_ptr = nullptr;
+  ASSERT_EQ(arena.Commit(&context), kTfLiteOk);
+  ASSERT_EQ(arena.ResolveAlloc(&context, alloc, &resolved_ptr), kTfLiteOk);
+  EXPECT_EQ(resolved_ptr, nullptr);
+}
+
+TEST(SimpleMemoryArenaTest, InterleavedZeroAlloc) {
+  TfLiteContext context;
+  SimpleMemoryArena arena(64);
+  ArenaAlloc allocs[4];
+
+  // Interleave some zero and non-zero-sized allocations and deallocations.
+  ASSERT_EQ(arena.Allocate(&context, 32, 2047, &allocs[0]), kTfLiteOk);
+  ASSERT_EQ(arena.Allocate(&context, 32, 0, &allocs[1]), kTfLiteOk);
+  ASSERT_EQ(arena.Allocate(&context, 32, 1023, &allocs[2]), kTfLiteOk);
+  ASSERT_EQ(arena.Deallocate(&context, allocs[1]), kTfLiteOk);
+  ASSERT_EQ(arena.Deallocate(&context, allocs[2]), kTfLiteOk);
+  ASSERT_EQ(arena.Allocate(&context, 32, 2047, &allocs[3]), kTfLiteOk);
+
+  // Deallocation of a zero-sized alloc should not impact the allocator offsets.
+  EXPECT_EQ(allocs[0].offset, 0);
+  EXPECT_EQ(allocs[1].offset, 0);
+  EXPECT_EQ(allocs[2].offset, 2048);
+  EXPECT_EQ(allocs[3].offset, 2048);
+}
+
 TEST(SimpleMemoryArenaTest, TestAfterClear) {
   TfLiteContext context;
   SimpleMemoryArena arena(64);
diff --git a/tensorflow/contrib/lite/testing/BUILD b/tensorflow/contrib/lite/testing/BUILD
index 74fc32a12b12ec3bca81590a74b81bc3caff0d96..b823c97f38e7660652aa0ce3538b11de59dc9aea 100644
--- a/tensorflow/contrib/lite/testing/BUILD
+++ b/tensorflow/contrib/lite/testing/BUILD
@@ -20,11 +20,15 @@ load(
     size = "large",
     srcs = ["generated_examples_zip_test.cc"],
     args = [
-        "--zip_file_path=$(location :zip_%s)" % test_name,
-        # TODO(angerson) We may be able to add an external unzip binary instead
-        # of relying on an existing one for OSS builds.
-        "--unzip_binary_path=/usr/bin/unzip",
-    ],
+    ] + select({
+        "//tensorflow:android": [],
+        "//conditions:default": [
+            "--zip_file_path=$(location :zip_%s)" % test_name,
+            # TODO(angerson) We may be able to add an external unzip binary instead
+            # of relying on an existing one for OSS builds.
+            "--unzip_binary_path=/usr/bin/unzip",
+        ],
+    }),
     data = [
         ":zip_%s" % test_name,
     ],
@@ -155,6 +159,7 @@ cc_library(
     deps = [
         ":split",
         ":test_runner",
+        "//tensorflow/contrib/lite:builtin_op_data",
         "//tensorflow/contrib/lite:framework",
         "//tensorflow/contrib/lite/kernels:builtin_ops",
     ],
diff --git a/tensorflow/contrib/lite/testing/generate_examples.py b/tensorflow/contrib/lite/testing/generate_examples.py
index 0e036bda92e4c42056f3f5df9fd8bcddcd932c13..f5e25784fa17209af7cfb06d32aeea2b9b947196 100644
--- a/tensorflow/contrib/lite/testing/generate_examples.py
+++ b/tensorflow/contrib/lite/testing/generate_examples.py
@@ -58,10 +58,11 @@ from tensorflow.python.ops import rnn
 parser = argparse.ArgumentParser(description="Script to generate TFLite tests.")
 parser.add_argument("output_path",
                     help="Directory where the outputs will be go.")
-parser.add_argument("--zip_to_output",
-                    type=str,
-                    help="Particular zip to output.",
-                    required=False)
+parser.add_argument(
+    "--zip_to_output",
+    type=str,
+    help="Particular zip to output.",
+    required=True)
 parser.add_argument("--toco",
                     type=str,
                     help="Path to toco tool.",
@@ -97,8 +98,6 @@ KNOWN_BUGS = {
     r"fully_connected.*transpose_.=True": "67586970",
     # Softmax graphs are too complex.
     r"softmax.*dim=0": "67749831",
-    # SpaceToDepth only supports float32.
-    r"space_to_depth.*(float16|int32|uint8|int64)": "68018134",
     # BatchToSpaceND only supports 4D tensors.
     r"batch_to_space_nd.*input_shape=\[8,2,2,2,1,1\]": "70594733",
     # Div will use floordiv.
@@ -118,6 +117,8 @@ class ExtraTocoOptions(object):
     self.allow_custom_ops = False
     # Rnn states that are used to support rnn / lstm cells.
     self.rnn_states = None
+    # Split the LSTM inputs from 5 inoputs to 18 inputs for TFLite.
+    self.split_tflite_lstm_inputs = None
 
 
 def toco_options(data_types,
@@ -146,14 +147,20 @@ def toco_options(data_types,
        " --inference_type=%s" % inference_type +
        " --input_format=TENSORFLOW_GRAPHDEF" + " --output_format=TFLITE" +
        " --input_arrays=%s" % ",".join(input_arrays) +
-       " --input_shapes=%s" % shape_str +
        " --output_arrays=%s" % ",".join(output_arrays))
+  if shape_str:
+    s += (" --input_shapes=%s" % shape_str)
   if extra_toco_options.drop_control_dependency:
     s += " --drop_control_dependency"
   if extra_toco_options.allow_custom_ops:
     s += " --allow_custom_ops"
   if extra_toco_options.rnn_states:
     s += (" --rnn_states='" + extra_toco_options.rnn_states + "'")
+  if extra_toco_options.split_tflite_lstm_inputs is not None:
+    if extra_toco_options.split_tflite_lstm_inputs:
+      s += " --split_tflite_lstm_inputs=true"
+    else:
+      s += " --split_tflite_lstm_inputs=false"
   return s
 
 
@@ -238,6 +245,19 @@ def create_tensor_data(dtype, shape, min_value=-100, max_value=100):
   return value.astype(dtype)
 
 
+def create_scalar_data(dtype, min_value=-100, max_value=100):
+  """Build scalar tensor data range from min_value to max_value exclusively."""
+
+  if dtype in _TF_TYPE_INFO:
+    dtype = _TF_TYPE_INFO[dtype][0]
+
+  if dtype in (tf.float32, tf.float16):
+    value = (max_value - min_value) * np.random.random() + min_value
+  elif dtype in (tf.int32, tf.uint8, tf.int64):
+    value = np.random.randint(min_value, max_value + 1)
+  return np.array(value, dtype=dtype)
+
+
 def freeze_graph(session, outputs):
   """Freeze the current graph.
 
@@ -385,7 +405,7 @@ def make_zip_of_tests(zip_path,
   for parameters in test_parameters:
     keys = parameters.keys()
     for curr in itertools.product(*parameters.values()):
-      label = zip_path.replace(".zip", "") + (",".join(
+      label = zip_path.replace(".zip", "_") + (",".join(
           "%s=%r" % z for z in sorted(zip(keys, curr))).replace(" ", ""))
       if label[0] == "/":
         label = label[1:]
@@ -447,6 +467,11 @@ def make_zip_of_tests(zip_path,
             sess,
             tf.global_variables() + inputs +
             outputs) if use_frozen_graph else sess.graph_def
+
+        if "split_tflite_lstm_inputs" in param_dict_real:
+          extra_toco_options.split_tflite_lstm_inputs = param_dict_real[
+              "split_tflite_lstm_inputs"]
+
         tflite_model_binary, toco_log = toco_convert(
             graph_def.SerializeToString(), input_tensors, output_tensors,
             extra_toco_options)
@@ -730,65 +755,83 @@ def make_binary_op_tests(zip_path, binary_operator):
   make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
 
 
-def make_mean_tests(zip_path):
-  """Make a set of tests to do mean."""
+def make_reduce_tests(reduce_op):
+  """Make a set of tests to do reduce operation.
 
-  test_parameters = [{
-      "input_dtype": [tf.float32, tf.int32, tf.int64],
-      "input_shape": [[3, 2, 4]],
-      "axis": [
-          None, 0, 1, 2, [0, 1], [0, 2], [1, 2], [0, 1, 2], [1, 0], [2, 0],
-          [2, 1], [2, 1, 0], [2, 0, 1], -1, -2, -3, [1, -1], [0, -1], [-1, 0],
-          [-1, -2, -3], [0, 0, 0], [2, 2, 0], [1, 0, -3, -3]
-      ],
-      "const_axis": [True, False],
-      "keepdims": [True, False],
-  }, {
-      "input_dtype": [tf.float32],
-      "input_shape": [[1, 8, 8, 3]],
-      "axis": [
-          None, 0, 1, 2, 3, [1, 2], [0, 3], [1, 2, 3], [0, 1, 2, 3],
-          [3, 2, 1, 0], [3, 1, 0, 2], [2, 0], [3, 0], [3, 1], [1, 0], -1, -2,
-          -3, -4, [0, -2], [2, 3, -1, 0], [3, 1, 2, -3], [3, -4], [2, 2, 2],
-          [2, 2, 3], [-3, -3, -4], [-3, 2, 1]
-      ],
-      "const_axis": [True, False],
-      "keepdims": [True, False],
-  }]
+  Args:
+    reduce_op: TensorFlow reduce operation to test, i.e. `tf.reduce_mean`.
 
-  def build_graph(parameters):
-    """Build the mean op testing graph."""
-    input_tensor = tf.placeholder(
-        dtype=parameters["input_dtype"],
-        name="input",
-        shape=parameters["input_shape"])
+  Returns:
+    a function representing the true generator with `reduce_op_in` curried.
+  """
 
-    # Get axis as either a placeholder or constants.
-    if parameters["const_axis"]:
-      axis = parameters["axis"]
-      input_tensors = [input_tensor]
-    else:
-      if isinstance(parameters["axis"], list):
-        shape = [len(parameters["axis"])]
+  def f(zip_path):
+    """Actual function that generates examples."""
+
+    test_parameters = [{
+        "input_dtype": [tf.float32, tf.int32, tf.int64],
+        "input_shape": [[3, 2, 4]],
+        "axis": [
+            None, 0, 1, 2, [0, 1], [0, 2], [1, 2], [0, 1, 2], [1, 0], [2, 0],
+            [2, 1], [2, 1, 0], [2, 0, 1], -1, -2, -3, [1, -1], [0, -1], [-1, 0],
+            [-1, -2, -3], [0, 0, 0], [2, 2, 0], [1, 0, -3, -3]
+        ],
+        "const_axis": [True, False],
+        "keepdims": [True, False],
+    }, {
+        "input_dtype": [tf.float32],
+        "input_shape": [[1, 8, 8, 3]],
+        "axis": [
+            None, 0, 1, 2, 3, [1, 2], [0, 3], [1, 2, 3], [0, 1, 2, 3],
+            [3, 2, 1, 0], [3, 1, 0, 2], [2, 0], [3, 0], [3, 1], [1, 0], -1, -2,
+            -3, -4, [0, -2], [2, 3, -1, 0], [3, 1, 2, -3], [3, -4], [2, 2, 2],
+            [2, 2, 3], [-3, -3, -4], [-3, 2, 1]
+        ],
+        "const_axis": [True, False],
+        "keepdims": [True, False],
+    }]
+
+    def build_graph(parameters):
+      """Build the mean op testing graph."""
+      input_tensor = tf.placeholder(
+          dtype=parameters["input_dtype"],
+          name="input",
+          shape=parameters["input_shape"])
+
+      # Get axis as either a placeholder or constants.
+      if parameters["const_axis"]:
+        axis = parameters["axis"]
+        input_tensors = [input_tensor]
       else:
-        shape = [0]  # shape for None or integers.
-      axis = tf.placeholder(dtype=tf.int32, name="axis", shape=shape)
-      input_tensors = [input_tensor, axis]
+        if isinstance(parameters["axis"], list):
+          shape = [len(parameters["axis"])]
+        else:
+          shape = [0]  # shape for None or integers.
+        axis = tf.placeholder(dtype=tf.int32, name="axis", shape=shape)
+        input_tensors = [input_tensor, axis]
 
-    out = tf.reduce_mean(
-        input_tensor, axis=axis, keepdims=parameters["keepdims"])
-    return input_tensors, [out]
+      out = reduce_op(
+          input_tensor, axis=axis, keepdims=parameters["keepdims"])
+      return input_tensors, [out]
 
-  def build_inputs(parameters, sess, inputs, outputs):
-    values = [
-        create_tensor_data(parameters["input_dtype"], parameters["input_shape"])
-    ]
-    if not parameters["const_axis"]:
-      if parameters["axis"]:
-        values.append(np.array(parameters["axis"]))
-    return values, sess.run(outputs, feed_dict=dict(zip(inputs, values)))
+    def build_inputs(parameters, sess, inputs, outputs):
+      values = [
+          create_tensor_data(parameters["input_dtype"],
+                             parameters["input_shape"])]
+      if not parameters["const_axis"]:
+        if parameters["axis"]:
+          values.append(np.array(parameters["axis"]))
+      return values, sess.run(outputs, feed_dict=dict(zip(inputs, values)))
 
-  make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
+    make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
+
+  return f
+
+
+def make_mean_tests(zip_path):
+  """Make a set of tests to do mean."""
+
+  return make_reduce_tests(tf.reduce_mean)(zip_path)
 
 
 def make_exp_tests(zip_path):
@@ -1577,7 +1620,7 @@ def make_space_to_depth_tests(zip_path):
   """Make a set of tests to do space_to_depth."""
 
   test_parameters = [{
-      "dtype": [tf.float32, tf.float16, tf.int32, tf.uint8, tf.int64],
+      "dtype": [tf.float32, tf.int32, tf.uint8, tf.int64],
       "input_shape": [[2, 12, 24, 1]],
       "block_size": [2, 3, 4],
   }]
@@ -1987,6 +2030,7 @@ def make_lstm_tests(zip_path):
           "time_step_size": [1],
           "input_vec_size": [3],
           "num_cells": [4],
+          "split_tflite_lstm_inputs": [True, False],
       },
   ]
 
@@ -2121,6 +2165,74 @@ def make_arg_max_tests(zip_path):
   make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
 
 
+def make_equal_tests(zip_path):
+  """Make a set of tests to do equal."""
+
+  test_parameters = [{
+      "input_dtype": [tf.float32, tf.int32, tf.int64],
+      "input_shape_pair": [([1, 1, 1, 3], [1, 1, 1, 3]),
+                           ([2, 3, 4, 5], [2, 3, 4, 5]), ([2, 3, 3], [2, 3]),
+                           ([5, 5], [1]), ([10], [2, 4, 10])],
+  }]
+
+  def build_graph(parameters):
+    """Build the equal op testing graph."""
+    input_value1 = tf.placeholder(
+        dtype=parameters["input_dtype"],
+        name="input1",
+        shape=parameters["input_shape_pair"][0])
+    input_value2 = tf.placeholder(
+        dtype=parameters["input_dtype"],
+        name="input2",
+        shape=parameters["input_shape_pair"][1])
+    out = tf.equal(input_value1, input_value2)
+    return [input_value1, input_value2], [out]
+
+  def build_inputs(parameters, sess, inputs, outputs):
+    input_value1 = create_tensor_data(parameters["input_dtype"],
+                                      parameters["input_shape_pair"][0])
+    input_value2 = create_tensor_data(parameters["input_dtype"],
+                                      parameters["input_shape_pair"][1])
+    return [input_value1, input_value2], sess.run(
+        outputs, feed_dict=dict(zip(inputs, [input_value1, input_value2])))
+
+  make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
+
+
+def make_not_equal_tests(zip_path):
+  """Make a set of tests to do not equal."""
+
+  test_parameters = [{
+      "input_dtype": [tf.float32, tf.int32, tf.int64],
+      "input_shape_pair": [([1, 1, 1, 3], [1, 1, 1, 3]),
+                           ([2, 3, 4, 5], [2, 3, 4, 5]), ([2, 3, 3], [2, 3]),
+                           ([5, 5], [1]), ([10], [2, 4, 10])],
+  }]
+
+  def build_graph(parameters):
+    """Build the not euqal op testing graph."""
+    input_value1 = tf.placeholder(
+        dtype=parameters["input_dtype"],
+        name="input1",
+        shape=parameters["input_shape_pair"][0])
+    input_value2 = tf.placeholder(
+        dtype=parameters["input_dtype"],
+        name="input2",
+        shape=parameters["input_shape_pair"][1])
+    out = tf.not_equal(input_value1, input_value2)
+    return [input_value1, input_value2], [out]
+
+  def build_inputs(parameters, sess, inputs, outputs):
+    input_value1 = create_tensor_data(parameters["input_dtype"],
+                                      parameters["input_shape_pair"][0])
+    input_value2 = create_tensor_data(parameters["input_dtype"],
+                                      parameters["input_shape_pair"][1])
+    return [input_value1, input_value2], sess.run(
+        outputs, feed_dict=dict(zip(inputs, [input_value1, input_value2])))
+
+  make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
+
+
 def make_greater_tests(zip_path):
   """Make a set of tests to do greater."""
 
@@ -2308,30 +2420,44 @@ def make_neg_tests(zip_path):
   make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
 
 
-def make_sin_tests(zip_path):
-  """Make a set of tests to do sin."""
+def _make_elementwise_tests(op):
+  """Make a set of tests to do element-wise operations."""
 
-  test_parameters = [{
-      "input_dtype": [tf.float32],
-      "input_shape": [[1], [1, 2], [5, 6, 7, 8], [3, 4, 5, 6]],
-  }]
+  def f(zip_path):
+    """Actual function that generates examples."""
+    test_parameters = [{
+        "input_dtype": [tf.float32],
+        "input_shape": [[1], [1, 2], [5, 6, 7, 8], [3, 4, 5, 6]],
+    }]
 
-  def build_graph(parameters):
-    """Build the sin op testing graph."""
-    input_value = tf.placeholder(
-        dtype=parameters["input_dtype"],
-        name="input1",
-        shape=parameters["input_shape"])
-    out = tf.sin(input_value)
-    return [input_value], [out]
+    def build_graph(parameters):
+      """Build the sin op testing graph."""
+      input_value = tf.placeholder(
+          dtype=parameters["input_dtype"],
+          name="input1",
+          shape=parameters["input_shape"])
+      out = op(input_value)
+      return [input_value], [out]
 
-  def build_inputs(parameters, sess, inputs, outputs):
-    input_value = create_tensor_data(parameters["input_dtype"],
-                                     parameters["input_shape"])
-    return [input_value], sess.run(
-        outputs, feed_dict={inputs[0]: input_value})
+    def build_inputs(parameters, sess, inputs, outputs):
+      input_value = create_tensor_data(parameters["input_dtype"],
+                                       parameters["input_shape"])
+      return [input_value], sess.run(
+          outputs, feed_dict={inputs[0]: input_value})
 
-  make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
+    make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
+
+  return f
+
+
+def make_sin_tests(zip_path):
+  """Make a set of tests to do sin."""
+  return _make_elementwise_tests(tf.sin)(zip_path)
+
+
+def make_log_tests(zip_path):
+  """Make a set of tests to do log."""
+  return _make_elementwise_tests(tf.log)(zip_path)
 
 
 def make_where_tests(zip_path):
@@ -2485,6 +2611,134 @@ def make_transpose_conv_tests(zip_path):
   make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
 
 
+def make_tile_tests(zip_path):
+  """Make a set of tests to do tile."""
+  test_parameters = [{
+      "input_dtype": [tf.float32, tf.int32],
+      "input_shape": [[3, 2, 1], [2, 2, 2]],
+      "multiplier_dtype": [tf.int32, tf.int64],
+      "multiplier_shape": [[3]]
+  }]
+
+  def build_graph(parameters):
+    """Build the tile op testing graph."""
+    input_value = tf.placeholder(
+        dtype=parameters["input_dtype"],
+        shape=parameters["input_shape"],
+        name="input")
+    multiplier_value = tf.placeholder(
+        dtype=parameters["multiplier_dtype"],
+        shape=parameters["multiplier_shape"],
+        name="multiplier")
+    out = tf.tile(input_value, multiplier_value)
+    return [input_value, multiplier_value], [out]
+
+  def build_inputs(parameters, sess, inputs, outputs):
+    input_value = create_tensor_data(parameters["input_dtype"],
+                                     parameters["input_shape"])
+    multipliers_value = create_tensor_data(parameters["multiplier_dtype"],
+                                           parameters["multiplier_shape"])
+    return [input_value, multipliers_value], sess.run(
+        outputs,
+        feed_dict={
+            inputs[0]: input_value,
+            inputs[1]: multipliers_value
+        })
+
+  make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
+
+
+def make_expand_dims_tests(zip_path):
+  """Make a set of tests to do expand_dims."""
+
+  test_parameters = [{
+      "input_type": [tf.float32, tf.int32],
+      "input_shape": [[3, 4], [10, 10, 3]],
+      "axis_value": [0, 1, 2, -1, -2],
+  }]
+
+  def build_graph(parameters):
+    """Build the where op testing graph."""
+    input_value = tf.placeholder(
+        dtype=parameters["input_type"],
+        name="input",
+        shape=parameters["input_shape"])
+    axis_value = tf.placeholder(dtype=tf.int32, name="axis", shape=[1])
+    out = tf.expand_dims(input_value, axis=axis_value)
+    return [input_value, axis_value], [out]
+
+  def build_inputs(parameters, sess, inputs, outputs):
+    input_value = create_tensor_data(parameters["input_type"],
+                                     parameters["input_shape"])
+    axis_value = np.array([parameters["axis_value"]], dtype=np.int32)
+    return [input_value, axis_value], sess.run(
+        outputs, feed_dict=dict(zip(inputs, [input_value, axis_value])))
+
+  make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
+
+
+def make_sparse_to_dense_tests(zip_path):
+  """Make a set of tests to do sparse to dense."""
+
+  test_parameters = [{
+      "value_dtype": [tf.float32, tf.int32],
+      "index_dtype": [tf.int32, tf.int64],
+      "value_count": [1, 3, 6, 8],
+      "dense_shape": [[15], [3, 10], [4, 4, 4, 4], [7, 10, 9]],
+      "default_value": [0, -1],
+      "value_is_scalar": [True, False],
+  }]
+
+  # Return a single value for 1-D dense shape, but a tuple for other shapes.
+  def generate_index(dense_shape):
+    if len(dense_shape) == 1:
+      return np.random.randint(dense_shape[0])
+    else:
+      index = []
+      for shape in dense_shape:
+        index.append(np.random.randint(shape))
+      return tuple(index)
+
+  def build_graph(parameters):
+    """Build the sparse_to_dense op testing graph."""
+    dense_shape = parameters["dense_shape"]
+
+    # Special handle for value_is_scalar case.
+    # value_count must be 1.
+    if parameters["value_is_scalar"] and parameters["value_count"] == 1:
+      value = tf.placeholder(
+          name="value", dtype=parameters["value_dtype"], shape=())
+    else:
+      value = tf.placeholder(
+          name="value",
+          dtype=parameters["value_dtype"],
+          shape=[parameters["value_count"]])
+    indices = set()
+    while len(indices) < parameters["value_count"]:
+      indices.add(generate_index(dense_shape))
+    indices = tf.constant(tuple(indices), dtype=parameters["index_dtype"])
+    # TODO(renjieliu): Add test for validate_indices case.
+    out = tf.sparse_to_dense(
+        indices,
+        dense_shape,
+        value,
+        parameters["default_value"],
+        validate_indices=False)
+
+    return [value], [out]
+
+  def build_inputs(parameters, sess, inputs, outputs):
+    if parameters["value_is_scalar"] and parameters["value_count"] == 1:
+      input_value = create_scalar_data(parameters["value_dtype"])
+    else:
+      input_value = create_tensor_data(parameters["value_dtype"],
+                                       [parameters["value_count"]])
+    return [input_value], sess.run(
+        outputs, feed_dict=dict(zip(inputs, [input_value])))
+
+  make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
+
+
 # Toco binary path provided by the generate rule.
 bin_path = None
 
diff --git a/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc b/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc
index 2f069ff8e79b4a08824121c49e9327619cfeb858..8a59d756f8dbbcefc930b5285c1ced8ce6b08845 100644
--- a/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc
+++ b/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc
@@ -36,7 +36,12 @@ bool FLAGS_ignore_known_bugs = true;
 // TODO(b/71769302) zip_files_dir should have a more accurate default, if
 // possible
 string* FLAGS_zip_file_path = new string("./");
+#ifndef __ANDROID__
 string* FLAGS_unzip_binary_path = new string("/usr/bin/unzip");
+#else
+string* FLAGS_unzip_binary_path = new string("/system/bin/unzip");
+#endif
+bool FLAGS_use_nnapi = false;
 }  // namespace
 
 // TensorFlow system environment for file system called.
@@ -48,7 +53,7 @@ tensorflow::Env* env = tensorflow::Env::Default();
 // TODO(ahentz): make sure we clean this list up frequently.
 std::map<string, string> kBrokenTests = {
     // Add only supports float32. (and "constant" tests use Add)
-    {R"(^\/adda.*int32)", "68808744"},
+    {R"(^\/add_a.*int32)", "68808744"},
     {R"(^\/constant.*int32)", "68808744"},
     {R"(^\/mul.*int32)", "68808744"},
     {R"(^\/div.*int32)", "68808744"},
@@ -61,25 +66,25 @@ std::map<string, string> kBrokenTests = {
      "70527055"},
 
     // L2Norm only supports tensors with 4D or fewer.
-    {R"(^\/l2normdim=.*,epsilon=.*,input_shape=\[.,.,.,.,.*\])", "67963684"},
+    {R"(^\/l2norm_dim=.*,epsilon=.*,input_shape=\[.,.,.,.,.*\])", "67963684"},
 
     // SpaceToBatchND only supports 4D tensors.
     {R"(^\/space_to_batch_nd.*input_shape=\[1,4,4,4,1,1\])", "70848787"},
 
     // L2Norm only works for dim=-1.
-    {R"(^\/l2normdim=-2,epsilon=.*,input_shape=\[.,.\])", "67963812"},
-    {R"(^\/l2normdim=0,epsilon=.*,input_shape=\[.,.\])", "67963812"},
-    {R"(^\/l2normdim=-2,epsilon=.*,input_shape=\[3,15,14,3\])", "67963812"},
-    {R"(^\/l2normdim=-2,epsilon=.*,input_shape=\[1,3,4,3\])", "67963812"},
-    {R"(^\/l2normdim=2,epsilon=.*,input_shape=\[3,15,14,3\])", "67963812"},
-    {R"(^\/l2normdim=2,epsilon=.*,input_shape=\[1,3,4,3\])", "67963812"},
-    {R"(^\/l2normdim=0,epsilon=.*,input_shape=\[3,15,14,3\])", "67963812"},
-    {R"(^\/l2normdim=0,epsilon=.*,input_shape=\[1,3,4,3\])", "67963812"},
-    {R"(^\/l2normdim=1,epsilon=.*,input_shape=\[3,15,14,3\])", "67963812"},
-    {R"(^\/l2normdim=1,epsilon=.*,input_shape=\[1,3,4,3\])", "67963812"},
-    {R"(^\/l2normdim=\[2,3\],epsilon=.*,input_shape=\[3,15,14,3\])",
+    {R"(^\/l2norm_dim=-2,epsilon=.*,input_shape=\[.,.\])", "67963812"},
+    {R"(^\/l2norm_dim=0,epsilon=.*,input_shape=\[.,.\])", "67963812"},
+    {R"(^\/l2norm_dim=-2,epsilon=.*,input_shape=\[3,15,14,3\])", "67963812"},
+    {R"(^\/l2norm_dim=-2,epsilon=.*,input_shape=\[1,3,4,3\])", "67963812"},
+    {R"(^\/l2norm_dim=2,epsilon=.*,input_shape=\[3,15,14,3\])", "67963812"},
+    {R"(^\/l2norm_dim=2,epsilon=.*,input_shape=\[1,3,4,3\])", "67963812"},
+    {R"(^\/l2norm_dim=0,epsilon=.*,input_shape=\[3,15,14,3\])", "67963812"},
+    {R"(^\/l2norm_dim=0,epsilon=.*,input_shape=\[1,3,4,3\])", "67963812"},
+    {R"(^\/l2norm_dim=1,epsilon=.*,input_shape=\[3,15,14,3\])", "67963812"},
+    {R"(^\/l2norm_dim=1,epsilon=.*,input_shape=\[1,3,4,3\])", "67963812"},
+    {R"(^\/l2norm_dim=\[2,3\],epsilon=.*,input_shape=\[3,15,14,3\])",
      "67963812"},
-    {R"(^\/l2normdim=\[2,3\],epsilon=.*,input_shape=\[1,3,4,3\])", "67963812"},
+    {R"(^\/l2norm_dim=\[2,3\],epsilon=.*,input_shape=\[1,3,4,3\])", "67963812"},
 
     // ResizeBilinear looks completely incompatible with Tensorflow
     {R"(^\/resize_bilinear.*dtype=tf.int32)", "72401107"},
@@ -212,7 +217,7 @@ TEST_P(OpsTest, RunZipTests) {
 
   std::ifstream tflite_stream(tflite_test_case);
   ASSERT_TRUE(tflite_stream.is_open()) << tflite_test_case;
-  tflite::testing::TfLiteDriver test_driver(/*use_nnapi=*/true);
+  tflite::testing::TfLiteDriver test_driver(FLAGS_use_nnapi);
   test_driver.SetModelBaseDir(tflite_dir);
 
   string bug_number;
@@ -273,7 +278,10 @@ int main(int argc, char** argv) {
                        "Required: Location of the test zip file."),
       tensorflow::Flag("unzip_binary_path",
                        tflite::testing::FLAGS_unzip_binary_path,
-                       "Required: Location of a suitable unzip binary.")};
+                       "Required: Location of a suitable unzip binary."),
+      tensorflow::Flag("use_nnapi", &tflite::testing::FLAGS_use_nnapi,
+                       "Whether to enable the NNAPI delegate")};
+
   bool success = tensorflow::Flags::Parse(&argc, argv, flags);
   if (!success || (argc == 2 && !strcmp(argv[1], "--helpfull"))) {
     fprintf(stderr, "%s", tensorflow::Flags::Usage(argv[0], flags).c_str());
@@ -281,6 +289,8 @@ int main(int argc, char** argv) {
   }
 
   ::tflite::LogToStderr();
+  // TODO(mikie): googletest arguments do not work - maybe the tensorflow flags
+  // parser removes them?
   ::testing::InitGoogleTest(&argc, argv);
   return RUN_ALL_TESTS();
 }
diff --git a/tensorflow/contrib/lite/testing/tflite_driver.cc b/tensorflow/contrib/lite/testing/tflite_driver.cc
index 75ac24719aa8fad960ae06d006eda386d44d721a..f518bf864c6a71679400f0013bbcd40142bb8ca1 100644
--- a/tensorflow/contrib/lite/testing/tflite_driver.cc
+++ b/tensorflow/contrib/lite/testing/tflite_driver.cc
@@ -16,6 +16,7 @@ limitations under the License.
 
 #include <iostream>
 
+#include "tensorflow/contrib/lite/builtin_op_data.h"
 #include "tensorflow/contrib/lite/testing/split.h"
 
 namespace tflite {
@@ -143,6 +144,7 @@ void TfLiteDriver::AllocateTensors() {
       Invalidate("Failed to allocate tensors");
       return;
     }
+    ResetLSTMStateTensors();
     must_allocate_tensors_ = false;
   }
 }
@@ -161,6 +163,7 @@ void TfLiteDriver::LoadModel(const string& bin_file_path) {
     Invalidate("Failed build interpreter");
     return;
   }
+  interpreter_->UseNNAPI(use_nnapi_);
 
   must_allocate_tensors_ = true;
 }
@@ -281,5 +284,36 @@ bool TfLiteDriver::CheckResults() {
   return success;
 }
 
+void TfLiteDriver::ResetLSTMStateTensors() {
+  // This is a workaround for initializing state tensors for LSTM.
+  // TODO(ycling): Refactoring and find a better way to initialize state
+  // tensors. Maybe write the reset instructions into the test data.
+  for (auto node_index : interpreter_->execution_plan()) {
+    const auto& node_and_reg = interpreter_->node_and_registration(node_index);
+    const auto& node = node_and_reg->first;
+    const auto& registration = node_and_reg->second;
+
+    if (registration.builtin_code == tflite::BuiltinOperator_LSTM) {
+      const auto* params =
+          reinterpret_cast<const TfLiteLSTMParams*>(node.builtin_data);
+      if (params->kernel_type == kTfLiteLSTMFullKernel &&
+          node.outputs->size >= 2) {
+        // The first 2 outputs of LSTM are state tensors.
+        for (int i = 0; i < 2; ++i) {
+          int node_index = node.outputs->data[i];
+          ResetTensor(node_index);
+        }
+      } else if (params->kernel_type == kTfLiteLSTMBasicKernel &&
+                 node.inputs->size == 5) {
+        // The 2th and 5th inputs are state tensors.
+        for (int i : {1, 4}) {
+          int node_index = node.inputs->data[i];
+          ResetTensor(node_index);
+        }
+      }
+    }
+  }
+}
+
 }  // namespace testing
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/testing/tflite_driver.h b/tensorflow/contrib/lite/testing/tflite_driver.h
index 02b7de1534e648734d7bc53154afa42f2ef256b4..5493ba3631b0423942cc9c4f98fbd6393a404060 100644
--- a/tensorflow/contrib/lite/testing/tflite_driver.h
+++ b/tensorflow/contrib/lite/testing/tflite_driver.h
@@ -48,6 +48,8 @@ class TfLiteDriver : public TestRunner {
   string ReadOutput(int id) override { return "no-op"; }
 
  private:
+  void ResetLSTMStateTensors();
+
   class Expectation;
 
   bool use_nnapi_ = false;
diff --git a/tensorflow/contrib/lite/tflite_static.bp b/tensorflow/contrib/lite/tflite_static.bp
index d873ee234e802b62bed08a3bfa8dbca4069d9269..5a78ace359bb99b25f374262b6f689a4ad88cbd1 100644
--- a/tensorflow/contrib/lite/tflite_static.bp
+++ b/tensorflow/contrib/lite/tflite_static.bp
@@ -24,10 +24,10 @@ cc_library_static {
         "error_reporter.cc",
 	"graph_info.cc",
         "interpreter.cc",
-        "model.cc",	
+        "model.cc",
         "nnapi_delegate.cc",
         "optional_debug_tools.cc",
-        "op_resolver.cc",		
+        "op_resolver.cc",
         "simple_memory_arena.cc",
         "string_util.cc",
         "util.cc",
@@ -52,6 +52,7 @@ cc_library_static {
         "kernels/embedding_lookup.cc",
         "kernels/embedding_lookup_sparse.cc",
         "kernels/exp.cc",
+        "kernels/expand_dims.cc",
         "kernels/floor.cc",
         "kernels/fully_connected.cc",
         "kernels/gather.cc",
@@ -71,15 +72,17 @@ cc_library_static {
         "kernels/register.cc",
         "kernels/reshape.cc",
         "kernels/resize_bilinear.cc",
-        "kernels/select.cc",	
+        "kernels/select.cc",
         "kernels/skip_gram.cc",
         "kernels/slice.cc",
         "kernels/space_to_batch_nd.cc",
         "kernels/space_to_depth.cc",
+        "kernels/sparse_to_dense.cc",
         "kernels/squeeze.cc",
         "kernels/strided_slice.cc",
         "kernels/sub.cc",
         "kernels/svdf.cc",
+        "kernels/tile.cc",
         "kernels/transpose.cc",
         "kernels/transpose_conv.cc",
         "kernels/unidirectional_sequence_lstm.cc",
diff --git a/tensorflow/contrib/lite/toco/BUILD b/tensorflow/contrib/lite/toco/BUILD
index b8acc9a8e0361a4c38fcbe2f16be172e637b95c6..0789dc99286361183aea4c95db98c11ff700ea79 100644
--- a/tensorflow/contrib/lite/toco/BUILD
+++ b/tensorflow/contrib/lite/toco/BUILD
@@ -245,6 +245,7 @@ cc_library(
         "graph_transformations/quantization_util.cc",
         "graph_transformations/quantization_util.h",
         "graph_transformations/quantize.cc",
+        "graph_transformations/quantize_weights.cc",
         "graph_transformations/read_fake_quant_min_max.cc",
         "graph_transformations/remove_final_dequantize_op.cc",
         "graph_transformations/remove_tensorflow_assert.cc",
@@ -373,6 +374,7 @@ tf_cc_test(
         ":toco_tooling",
         "//tensorflow/core:framework",
         "//tensorflow/core:graph",
+        "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
         "@com_google_googletest//:gtest_main",
     ],
@@ -410,6 +412,7 @@ tf_cc_test(
     deps = [
         ":model",
         ":tooling_util",
+        "//tensorflow/core:lib",
         "@com_google_googletest//:gtest_main",
     ],
 )
diff --git a/tensorflow/contrib/lite/toco/args.h b/tensorflow/contrib/lite/toco/args.h
index 6c0311af0a926711955caaa1c7507d7c52c77069..9f5ca66d050f0ead9b8856c77dba8d9bbd182d10 100644
--- a/tensorflow/contrib/lite/toco/args.h
+++ b/tensorflow/contrib/lite/toco/args.h
@@ -234,6 +234,7 @@ struct ParsedTocoFlags {
   Arg<bool> drop_fake_quant = Arg<bool>(false);
   Arg<bool> reorder_across_fake_quant = Arg<bool>(false);
   Arg<bool> allow_custom_ops = Arg<bool>(false);
+  Arg<bool> quantize_weights = Arg<bool>(false);
   // Deprecated flags
   Arg<string> input_type;
   Arg<string> input_types;
@@ -242,6 +243,7 @@ struct ParsedTocoFlags {
   Arg<bool> propagate_fake_quant_num_bits = Arg<bool>(false);
   Arg<bool> allow_nudging_weights_to_use_fast_gemm_kernel = Arg<bool>(false);
   Arg<int64> dedupe_array_min_size_bytes = Arg<int64>(64);
+  Arg<bool> split_tflite_lstm_inputs = Arg<bool>(true);
 };
 
 }  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/dump_graphviz.cc b/tensorflow/contrib/lite/toco/dump_graphviz.cc
index 6e5927295fe2a9d237683155c9bca90048d478c4..878bda36ef3900d6d8c509aca40cee834cefe514 100644
--- a/tensorflow/contrib/lite/toco/dump_graphviz.cc
+++ b/tensorflow/contrib/lite/toco/dump_graphviz.cc
@@ -16,8 +16,6 @@ limitations under the License.
 
 #include <cmath>
 #include <memory>
-#include <set>
-#include <unordered_set>
 #include <vector>
 
 #include "absl/strings/str_replace.h"
@@ -134,6 +132,12 @@ void AppendArrayVal(string* string, Array const& array, int index) {
       return;
     }
     AppendF(string, "%d", data[index]);
+  } else if (array.buffer->type == ArrayDataType::kBool) {
+    const auto& data = array.GetBuffer<ArrayDataType::kBool>().data;
+    if (index >= data.size()) {
+      return;
+    }
+    AppendF(string, "%d", data[index]);
   }
 }
 
@@ -142,6 +146,7 @@ NodeProperties GetPropertiesForArray(const Model& model,
   NodeProperties node_properties;
   node_properties.color = GetColorForArray(model, array_name);
   node_properties.label = absl::StrReplaceAll(array_name, {{"/", "/\\n"}});
+  node_properties.log2_buffer_size = 0.0f;
 
   // Append array shape to the label.
   auto& array = model.GetArray(array_name);
@@ -161,9 +166,12 @@ NodeProperties GetPropertiesForArray(const Model& model,
     }
     node_properties.label += "]";
 
-    int buffer_size = RequiredBufferSizeForShape(array.shape());
-    node_properties.log2_buffer_size =
-        std::log2(static_cast<float>(buffer_size));
+    int buffer_size = 0;
+    if (IsValid(array.shape())) {
+      buffer_size = RequiredBufferSizeForShape(array.shape());
+      node_properties.log2_buffer_size =
+          std::log2(static_cast<float>(buffer_size));
+    }
 
     if (array.buffer) {
       const auto& array = model.GetArray(array_name);
@@ -196,8 +204,6 @@ NodeProperties GetPropertiesForArray(const Model& model,
         AppendF(&node_properties.label, "}");
       }
     }
-  } else {
-    node_properties.log2_buffer_size = 0.0f;
   }
 
   if (array.minmax) {
@@ -304,7 +310,15 @@ void DumpGraphviz(const Model& model, string* output_file_contents) {
   constexpr char kRNNBackEdgeFormat[] =
       "\t \"%s\" -> \"%s\" [color=\"#0F9D58\"];\n";
 
-  std::set<string> already_added_arrays;
+  for (const auto& array_kv : model.GetArrayMap()) {
+    // Add node for array.
+    const string& array_name = array_kv.first;
+    const auto& array_properties = GetPropertiesForArray(model, array_name);
+    AppendF(output_file_contents, kNodeFormat, array_name,
+            array_properties.label, "octagon",
+            array_properties.color.FillColorString().c_str(),
+            array_properties.color.TextColorString().c_str());
+  }
   for (int op_index = 0; op_index < model.operators.size(); op_index++) {
     const Operator& op = *model.operators[op_index];
     // Add node for operator.
@@ -313,20 +327,13 @@ void DumpGraphviz(const Model& model, string* output_file_contents) {
     AppendF(output_file_contents, kNodeFormat, operator_id, op_properties.label,
             "box", op_properties.color.FillColorString().c_str(),
             op_properties.color.TextColorString().c_str());
-    // Add nodes and edges for all inputs of the operator.
+    // Add edges for all inputs of the operator.
     for (const auto& input : op.inputs) {
       if (!model.HasArray(input)) {
         // Arrays should _always_ exist. Except, perhaps, during development.
         continue;
       }
       auto array_properties = GetPropertiesForArray(model, input);
-      if (!already_added_arrays.count(input)) {
-        AppendF(output_file_contents, kNodeFormat, input,
-                array_properties.label, "octagon",
-                array_properties.color.FillColorString().c_str(),
-                array_properties.color.TextColorString().c_str());
-      }
-
       // Draw lines that transport more data thicker (Otherwise, where would the
       // data fit? right?).
       float line_width =
@@ -342,22 +349,14 @@ void DumpGraphviz(const Model& model, string* output_file_contents) {
       }
       AppendF(output_file_contents, kEdgeFormat, input, operator_id, line_width,
               weight);
-      already_added_arrays.insert(input);
     }
-    // Add nodes and edges for all outputs of the operator.
+    // Add edges for all outputs of the operator.
     for (const auto& output : op.outputs) {
       if (!model.HasArray(output)) {
         // Arrays should _always_ exist. Except, perhaps, during development.
         continue;
       }
       auto array_properties = GetPropertiesForArray(model, output);
-      if (!already_added_arrays.count(output)) {
-        AppendF(output_file_contents, kNodeFormat, output,
-                array_properties.label, "octagon",
-                array_properties.color.FillColorString().c_str(),
-                array_properties.color.TextColorString().c_str());
-      }
-
       // See comments above regarding weight and line_width calculations.
       float line_width =
           std::max(0.5f, array_properties.log2_buffer_size / 3.0f);
@@ -367,7 +366,6 @@ void DumpGraphviz(const Model& model, string* output_file_contents) {
       }
       AppendF(output_file_contents, kEdgeFormat, operator_id, output,
               line_width, weight);
-      already_added_arrays.insert(output);
     }
   }
 
diff --git a/tensorflow/contrib/lite/toco/export_tensorflow.cc b/tensorflow/contrib/lite/toco/export_tensorflow.cc
index f5157149afca17383a8625c489f15a23ce6dd224..c7c80ab21cc88ecb94e5234750431130dbd8400c 100644
--- a/tensorflow/contrib/lite/toco/export_tensorflow.cc
+++ b/tensorflow/contrib/lite/toco/export_tensorflow.cc
@@ -494,7 +494,7 @@ void ConvertTransposeConvOperator(const Model& model,
   const auto& weights_array = model.GetArray(weights_array_name);
   CHECK(weights_array.buffer->type == ArrayDataType::kFloat);
   ConvertFloatTensorConst(model, weights_array_name, AxesOrder::kOHWI,
-                          AxesOrder::kHWIO, tensorflow_graph);
+                          AxesOrder::kHWOI, tensorflow_graph);
   auto& strides = (*conv2d_op->mutable_attr())["strides"];
   strides.mutable_list()->add_i(1);
   strides.mutable_list()->add_i(src_op.stride_height);
@@ -1728,6 +1728,25 @@ void ConvertComparisonOperator(const Model& model, const Operator& src_op,
   (*comparison_op->mutable_attr())["T"].set_type(data_type);
 }
 
+void ConvertSparseToDenseOperator(const Model& model,
+                                  const SparseToDenseOperator& src_op,
+                                  const char* op_name,
+                                  GraphDef* tensorflow_graph) {
+  auto* sparse_to_dense_op = tensorflow_graph->add_node();
+  sparse_to_dense_op->set_op(op_name);
+  sparse_to_dense_op->set_name(src_op.outputs[0]);
+  CHECK_EQ(src_op.inputs.size(), 4);
+  for (int i = 0; i < 4; ++i) {
+    *sparse_to_dense_op->add_input() = src_op.inputs[i];
+  }
+  const auto data_type = GetTensorFlowDataType(model, src_op.inputs[3]);
+  (*sparse_to_dense_op->mutable_attr())["T"].set_type(data_type);
+  const auto index_type = GetTensorFlowDataType(model, src_op.inputs[0]);
+  (*sparse_to_dense_op->mutable_attr())["Tindices"].set_type(index_type);
+  (*sparse_to_dense_op->mutable_attr())["Tindices"].set_b(
+      src_op.validate_indices);
+}
+
 void ConvertOperator(const Model& model, const Operator& src_op,
                      GraphDef* tensorflow_graph) {
   if (src_op.fused_activation_function != FusedActivationFunctionType::kNone) {
@@ -1919,6 +1938,10 @@ void ConvertOperator(const Model& model, const Operator& src_op,
     ConvertRandomUniformOperator(
         model, static_cast<const RandomUniformOperator&>(src_op),
         tensorflow_graph);
+  } else if (src_op.type == OperatorType::kTensorFlowEqual) {
+    ConvertComparisonOperator(model, src_op, "Equal", tensorflow_graph);
+  } else if (src_op.type == OperatorType::kTensorFlowNotEqual) {
+    ConvertComparisonOperator(model, src_op, "NotEqual", tensorflow_graph);
   } else if (src_op.type == OperatorType::kTensorFlowGreater) {
     ConvertComparisonOperator(model, src_op, "Greater", tensorflow_graph);
   } else if (src_op.type == OperatorType::kTensorFlowGreaterEqual) {
diff --git a/tensorflow/contrib/lite/toco/g3doc/cmdline_examples.md b/tensorflow/contrib/lite/toco/g3doc/cmdline_examples.md
index 7680cdd344814bf6cbc7bbe11c915f220642d55d..8e93f02ef109f7bccd07ce54baff3d0bb4ae50c7 100644
--- a/tensorflow/contrib/lite/toco/g3doc/cmdline_examples.md
+++ b/tensorflow/contrib/lite/toco/g3doc/cmdline_examples.md
@@ -26,8 +26,6 @@ Table of contents:
     *   [Convert a TensorFlow Lite FlatBuffer back into TensorFlow GraphDef
         format](#to-graphdef)
 *   [Logging](#logging)
-    *   [Standard logging](#standard-logging)
-    *   [Verbose logging](#verbose-logging)
     *   [Graph "video" logging](#graph-video-logging)
 *   [Graph visualizations](#graph-visualizations)
     *   [Using --output_format=GRAPHVIZ_DOT](#using-output-formatgraphviz-dot)
@@ -277,49 +275,6 @@ bazel run --config=opt \
 
 ## Logging
 
-### Standard logging
-
-The converter generates some informative log messages during processing. The
-easiest way to view them is to add `--logtostderr` to command lines as seen in
-the following example.
-
-```
-curl https://storage.googleapis.com/download.tensorflow.org/models/mobilenet_v1_0.50_128_frozen.tgz \
-  | tar xzv -C /tmp
-bazel run --config=opt \
-  //tensorflow/contrib/lite/toco:toco -- \
-  --input_file=/tmp/mobilenet_v1_0.50_128/frozen_graph.pb \
-  --output_file=/tmp/foo.tflite \
-  --input_format=TENSORFLOW_GRAPHDEF \
-  --output_format=TFLITE \
-  --inference_type=FLOAT \
-  --input_shape=1,128,128,3 \
-  --input_array=input \
-  --output_array=MobilenetV1/Predictions/Reshape_1 \
-  --logtostderr
-```
-
-After some initialization messages, we get the following informative messages:
-
-```
-I1101 21:51:33.297475    5339 graph_transformations.cc:39] Before general graph transformations: 416 operators, 583 arrays (0 quantized)
-I1101 21:51:33.308972    5339 graph_transformations.cc:39] After general graph transformations pass 1: 31 operators, 89 arrays (0 quantized)
-I1101 21:51:33.309204    5339 graph_transformations.cc:39] Before dequantization graph transformations: 31 operators, 89 arrays (0 quantized)
-I1101 21:51:33.309368    5339 allocate_transient_arrays.cc:312] Total transient array allocated size: 1048576 bytes, theoretical optimal value: 786432 bytes.
-I1101 21:51:33.309484    5339 toco_tooling.cc:249] Estimated count of arithmetic ops: 0.099218 billion (note that a multiply-add is counted as 2 ops).
-```
-
-### Verbose logging
-
-For debugging purposes, the converter supports two levels of verbose logging,
-which can be set by passing a `--v=` flag:
-
-*   For `--v=1`, the converter generates text dumps of the graph at various
-    points during processing as well as log messages about every graph
-    transformation that took place.
-*   For `--v=2`, the converter additionally generates log messages about graph
-    transformations that were considered but not performed.
-
 ### Graph "video" logging
 
 When `--dump_graphviz=` is used (see the section on [graph
diff --git a/tensorflow/contrib/lite/toco/g3doc/cmdline_reference.md b/tensorflow/contrib/lite/toco/g3doc/cmdline_reference.md
index 9e99287f828c22aa81eb216c087f3261e378fc14..8085ae07489816c38677ff792e7ac71f1a75fa71 100644
--- a/tensorflow/contrib/lite/toco/g3doc/cmdline_reference.md
+++ b/tensorflow/contrib/lite/toco/g3doc/cmdline_reference.md
@@ -203,17 +203,11 @@ have.
     graph transformations on them, at the cost of no longer faithfully matching
     inference and training arithmetic.
 
-## Logging flags
-
-The following are standard Google logging flags:
+*   `--quantize_weights`. Type: boolean. Default: false. Store weights as
+    quantized weights followed by dequantize operations. Computation is still
+    done in float, but reduces model size (at the cost of accuracy and latency).
 
-*   `--logtostderr` redirects Google logging to standard error, typically making
-    it visible in a terminal.
-*   `--v` sets verbose logging levels (for debugging purposes). Defined levels:
-    *   `--v=1`: log all graph transformations that did make a change on the
-        graph.
-    *   `--v=2`: log all graph transformations that did *not* make a change on
-        the graph.
+## Logging flags
 
 The following flags allow to generate graph visualizations of the actual graph
 at various points during transformations:
diff --git a/tensorflow/contrib/lite/toco/g3doc/python_api.md b/tensorflow/contrib/lite/toco/g3doc/python_api.md
index f0fd638a618c75c75d336a746f9b1d8dccaea470..a7841a685528fb18bb08f1943278339a2daec16a 100644
--- a/tensorflow/contrib/lite/toco/g3doc/python_api.md
+++ b/tensorflow/contrib/lite/toco/g3doc/python_api.md
@@ -1,69 +1,202 @@
-# TensorFlow Lite Optimizing Converter (TOCO) Python API reference
+# TensorFlow Lite Optimizing Converter & Interpreter Python API reference
 
-This page provides examples on how to use TOCO via the Python API. It is
-complemented by the following documents:
+This page provides examples on how to use TOCO and the TensorFlow Lite
+interpreter via the Python API. It is complemented by the following documents:
 
 *   [README](../README.md)
 *   [Command-line examples](cmdline_examples.md)
 *   [Command-line glossary](cmdline_reference.md)
 
+Table of contents:
+
+*   [High-level overview](#high-level-overview)
+*   [API](#api)
+*   [Basic examples](#basic)
+    *   [Exporting a GraphDef from tf.Session](#basic-graphdef-sess)
+    *   [Exporting a GraphDef from file](#basic-graphdef-file)
+    *   [Exporting a SavedModel](#basic-savedmodel)
+*   [Complex examples](#complex)
+    *   [Exporting a quantized GraphDef](#complex-quant)
+*   [TensorFlow Lite Python interpreter](#interpreter)
+    *   [Using the interpreter from a model file](#interpreter-file)
+    *   [Using the interpreter from model data](#interpreter-data)
+
 ## High-level overview
 
 While the TensorFlow Lite Optimizing Converter can be used from the command
-line, it is often convenient to use it as part of Python model build and
+line, it is often convenient to use it as part of a Python model build and
 training script. This is so that conversion can be part of your model
 development pipeline. This allows you to know early and often that you are
 designing a model that can be targeted to devices with mobile.
 
 ## API
 
-In Python you can run `help(tf.contrib.lite)` to get documentation on functions.
-In particular, `tf.contrib.lite.toco_convert` presents a simple API and
-`tf.contrib.lite.toco_from_protos` allows more detailed control of TOCO using
-the protobuf interface to TOCO.
+The API for converting TensorFlow models to TensorFlow Lite is
+`tf.contrib.lite.TocoConverter`. The API for calling the Python intepreter is
+`tf.contrib.lite.Interpreter`.
+
+`TocoConverter` provides class methods based on the original format of the
+model. `TocoConverter.from_session()` is available for GraphDefs.
+`TocoConverter.from_saved_model()` is available for SavedModels. Example usages
+for simple float-point models are shown in [Basic Examples](#basic). Examples
+usages for more complex models is shown in [Complex Examples](#complex).
+
+**NOTE**: Currently, `TocoConverter` will cause a fatal error to the Python
+interpreter when the conversion fails. This will be remedied as soon as
+possible.
+
+## Basic examples <a name="basic"></a>
+
+The following section shows examples of how to convert a basic float-point model
+from each of the supported data formats into a TensorFlow Lite FlatBuffers.
 
-## Example
+### Exporting a GraphDef from tf.Session <a name="basic-graphdef-sess"></a>
 
-In particular, here we show creating a simple model and converting it to a
-TensorFlow Lite Model.
+The following example shows how to convert a TensorFlow GraphDef into a
+TensorFlow Lite FlatBuffer from a `tf.Session` object.
 
 ```python
 import tensorflow as tf
 
 img = tf.placeholder(name="img", dtype=tf.float32, shape=(1, 64, 64, 3))
-val = img + tf.constant([1., 2., 3.]) + tf.constant([1., 4., 4.])
+var = tf.get_variable("weights", dtype=tf.float32, shape=(1, 64, 64, 3))
+val = img + var
 out = tf.identity(val, name="out")
+
 with tf.Session() as sess:
-  tflite_model = tf.contrib.lite.toco_convert(sess.graph_def, [img], [out])
-  open("test.tflite", "wb").write(tflite_model)
+  converter = tf.contrib.lite.TocoConverter.from_session(sess, [img], [out])
+  tflite_model = converter.convert()
+  open("converted_model.tflite", "wb").write(tflite_model)
 ```
 
-**NOTE** Currently, the TOCO command will cause a fatal error to the Python
-interpreter when TOCO conversion fails. This will be remedied as soon as
-possible.
+### Exporting a GraphDef from file <a name="basic-graphdef-file"></a>
+
+The following example shows how to convert a TensorFlow GraphDef into a
+TensorFlow Lite FlatBuffer when the GraphDef is stored in a file. Both `.pb` and
+`.pbtxt` files are accepted.
+
+The example uses
+[Mobilenet_1.0_224](https://storage.googleapis.com/download.tensorflow.org/models/mobilenet_v1_1.0_224_frozen.tgz).
+The function only supports GraphDefs frozen via
+[freeze_graph.py](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/freeze_graph.py).
+
+```python
+import tensorflow as tf
+
+graph_def_file = "/path/to/Downloads/mobilenet_v1_1.0_224/frozen_graph.pb"
+input_arrays = ["input"]
+output_arrays = ["MobilenetV1/Predictions/Softmax"]
+
+converter = tf.contrib.lite.TocoConverter.from_frozen_graph(
+  graph_def_file, input_arrays, output_arrays)
+tflite_model = converter.convert()
+open("converted_model.tflite", "wb").write(tflite_model)
+```
+
+### Exporting a SavedModel <a name="basic-savedmodel"></a>
+
+The following example shows how to convert a SavedModel into a TensorFlow Lite
+FlatBuffer.
+
+```python
+import tensorflow as tf
+
+converter = tf.contrib.lite.TocoConverter.from_saved_model(saved_model_dir)
+tflite_model = converter.convert()
+open("converted_model.tflite", "wb").write(tflite_model)
+```
+
+For more complex SavedModels, the optional parameters that can be passed into
+`TocoConverter.from_saved_model()` are `input_arrays`, `input_shapes`,
+`output_arrays`, `tag_set` and `signature_key`. Details of each parameter are
+available by running `help(tf.contrib.lite.TocoConverter)`.
 
-## Example 2: Export with variables
+## Complex examples <a name="complex"></a>
 
-If a model has variables, they need to be turned into constants. This process is
-known as freezing, and it can actually be accomplished with
+For models where the default value of the attributes is not sufficient, the
+attribute's values should be set before calling `convert()`. In order to call
+any constants use `tf.contrib.lite.constants.<CONSTANT_NAME>` as seen below with
+`QUANTIZED_UINT8`. Run `help(tf.contrib.lite.TocoConverter)` in the Python
+terminal for detailed documentation on the attributes.
+
+Although the examples are demonstrated on GraphDefs containing only constants.
+The same logic can be applied irrespective of the input data format.
+
+### Exporting a quantized GraphDef <a name="complex-quant"></a>
+
+The following example shows how to convert a quantized model into a TensorFlow
+Lite FlatBuffer.
 
 ```python
 import tensorflow as tf
 
 img = tf.placeholder(name="img", dtype=tf.float32, shape=(1, 64, 64, 3))
-var = tf.get_variable("weights", dtype=tf.float32, shape=(1,64,64,3))
-val = img + var
+const = tf.constant([1., 2., 3.]) + tf.constant([1., 4., 4.])
+val = img + const
+out = tf.fake_quant_with_min_max_args(val, min=0., max=1., name="output")
+
+with tf.Session() as sess:
+  converter = tf.contrib.lite.TocoConverter.from_session(sess, [img], [out])
+  converter.inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8
+  input_arrays = converter.get_input_arrays()
+  converter.quantized_input_stats = {input_arrays[0] : (0., 1.)}  # mean, std_dev
+  tflite_model = converter.convert()
+  open("converted_model.tflite", "wb").write(tflite_model)
+```
+
+## TensorFlow Lite Python interpreter <a name="interpreter"></a>
+
+### Using the interpreter from a model file <a name="interpreter-file"></a>
+
+The following example shows how to use the TensorFlow Lite Python interpreter
+when provided a TensorFlow Lite FlatBuffer file. The example also demonstrates
+how to run inference on random input data. Run
+`help(tf.contrib.lite.Interpreter)` in the Python terminal to get detailed
+documentation on the interpreter.
+
+```python
+import numpy as np
+import tensorflow as tf
 
-def canonical_name(x):
-  return x.name.split(":")[0]
+# Load TFLite model and allocate tensors.
+interpreter = tf.contrib.lite.Interpreter(model_path="converted_model.tflite")
+interpreter.allocate_tensors()
 
+# Get input and output tensors.
+input_details = interpreter.get_input_details()
+output_details = interpreter.get_output_details()
+
+# Test model on random input data.
+input_shape = input_details[0]['shape']
+input_data = np.array(np.random.random_sample(input_shape), dtype=np.float32)
+interpreter.set_tensor(input_details[0]['index'], input_data)
+
+interpreter.invoke()
+output_data = interpreter.get_tensor(output_details[0]['index'])
+print(output_data)
+```
+
+### Using the interpreter from model data <a name="interpreter-data"></a>
+
+The following example shows how to use the TensorFlow Lite Python interpreter
+when starting with the TensorFlow Lite Flatbuffer model previously loaded. This
+example shows an end-to-end use case, starting from building the TensorFlow
+model.
+
+```python
+import numpy as np
+import tensorflow as tf
+
+img = tf.placeholder(name="img", dtype=tf.float32, shape=(1, 64, 64, 3))
+const = tf.constant([1., 2., 3.]) + tf.constant([1., 4., 4.])
+val = img + const
 out = tf.identity(val, name="out")
+
 with tf.Session() as sess:
-  sess.run(tf.global_variables_initializer())
-  out_tensors = [out]
-  frozen_graphdef = tf.graph_util.convert_variables_to_constants(
-      sess, sess.graph_def, map(canonical_name, out_tensors))
-  tflite_model = tf.contrib.lite.toco_convert(
-      frozen_graphdef, [img], out_tensors)
-  open("converted_model.tflite", "wb").write(tflite_model)
+  converter = tf.contrib.lite.TocoConverter.from_session(sess, [img], [out])
+  tflite_model = converter.convert()
+
+# Load TFLite model and allocate tensors.
+interpreter = tf.contrib.lite.Interpreter(model_content=tflite_model)
+interpreter.allocate_tensors()
 ```
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/create_im2col_arrays.cc b/tensorflow/contrib/lite/toco/graph_transformations/create_im2col_arrays.cc
index 076415ece8c1039caa32e947fe54ab3e101bec9e..8ca2cd66ac6377a70a4c504ac006dc0388b88bf7 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/create_im2col_arrays.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/create_im2col_arrays.cc
@@ -46,8 +46,9 @@ bool CreateIm2colArrays::Run(Model* model, std::size_t op_index) {
   const int kheight = weights_shape.dims(1);
   const int kwidth = weights_shape.dims(2);
   if (kwidth == 1 && kheight == 1 && conv_op->stride_width == 1 &&
-      conv_op->stride_height == 1) {
-    // 1x1 unstrided conv does not need an im2col array.
+      conv_op->stride_height == 1 && conv_op->dilation_width_factor == 1 &&
+      conv_op->dilation_height_factor == 1) {
+    // 1x1 unstrided undilated conv does not need an im2col array.
     return false;
   }
 
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h b/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h
index 8da242aa9c2ca4917a681c95c3eded894664c046..1bc7557d46cfa5e1b27468d2da271e75fd491d36 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h
+++ b/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h
@@ -139,6 +139,7 @@ DECLARE_GRAPH_TRANSFORMATION(PropagateFakeQuantNumBits);
 DECLARE_GRAPH_TRANSFORMATION(PropagateFixedSizes)
 DECLARE_GRAPH_TRANSFORMATION(HardcodeMinMax)
 DECLARE_GRAPH_TRANSFORMATION(Quantize)
+DECLARE_GRAPH_TRANSFORMATION(QuantizeWeights)
 DECLARE_GRAPH_TRANSFORMATION(RemoveFinalDequantizeOp)
 DECLARE_GRAPH_TRANSFORMATION(RemoveTensorFlowAssert)
 DECLARE_GRAPH_TRANSFORMATION(RemoveTensorFlowIdentity)
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc b/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc
index d63ee7c9519d169a2f44ec1afe81125217db8976..bda6dce22be0f0ca83eb8339ad17573b0267c18c 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc
@@ -362,6 +362,8 @@ bool HardcodeMinMax::Run(Model* model, std::size_t op_index) {
       changed = HardcodeMinMaxForAverageOrMaxPool(model, op);
       break;
 
+    case OperatorType::kResizeBilinear:
+    case OperatorType::kSlice:
     case OperatorType::kStridedSlice:
     case OperatorType::kSqueeze:
     case OperatorType::kTensorFlowReshape:
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/identify_dilated_conv.cc b/tensorflow/contrib/lite/toco/graph_transformations/identify_dilated_conv.cc
index ae3301f467de5714230e731b4bab87ddc1637201..d49857cfc22ecaf5feb06b39a42187f8adb61d50 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/identify_dilated_conv.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/identify_dilated_conv.cc
@@ -90,12 +90,13 @@ bool IdentifyDilatedConv::Run(Model* model, std::size_t op_index) {
   }
 
   // Conv Op
-  ConvOperator* conv_op = dynamic_cast<ConvOperator*>(
-      has_expand_op ? GetOpWithInput(*model, post_stb_op->outputs[0])
-                    : GetOpWithInput(*model, stb_op->outputs[0]));
-  if (!conv_op || conv_op->type != OperatorType::kConv) {
+  const string& input_of_conv_op =
+      has_expand_op ? post_stb_op->outputs[0] : stb_op->outputs[0];
+  auto* conv_base_op = GetOpWithInput(*model, input_of_conv_op);
+  if (conv_base_op->type != OperatorType::kConv) {
     return false;
   }
+  auto* conv_op = static_cast<ConvOperator*>(conv_base_op);
   if (conv_op->inputs.size() != 2) {
     // The conv op must only have weights, no bias.
     return false;
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/identify_lstm_merge_inputs.cc b/tensorflow/contrib/lite/toco/graph_transformations/identify_lstm_merge_inputs.cc
index 3f768bfee12ebe31ebeb72855eb67ec03d5bcf8c..5b6a984ee143a6007471b165510030cd3ad3f73c 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/identify_lstm_merge_inputs.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/identify_lstm_merge_inputs.cc
@@ -33,9 +33,10 @@ bool MergeLstmCellInputs::Run(Model* model, std::size_t op_index) {
     return false;
   }
 
-  // Already a compact LstmCell with LstmCellOperator::NUM_INPUTS of inputs,
-  // do not need to merge cell inputs.
-  if (src_op->inputs.size() == LstmCellOperator::NUM_INPUTS) {
+  // Already a compact LstmCell. Do not need to merge cell inputs.
+  const auto* src_lstm_op = static_cast<LstmCellOperator*>(src_op);
+  if (src_lstm_op->kernel_type != LstmCellOperator::KERNEL_FULL ||
+      src_lstm_op->inputs.size() != kExtendedLstmInputCount) {
     return false;
   }
 
@@ -136,6 +137,7 @@ bool MergeLstmCellInputs::Run(Model* model, std::size_t op_index) {
 
   // Emplace a new LSTM cell operator (use basic 5 inputs kernel).
   auto lstm_cell_op = absl::make_unique<LstmCellOperator>();
+  lstm_cell_op->kernel_type = LstmCellOperator::KERNEL_BASIC;
 
   // Compact LstmCell's 5 inputs.
   lstm_cell_op->inputs.resize(LstmCellOperator::NUM_INPUTS);
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/identify_lstm_split_inputs.cc b/tensorflow/contrib/lite/toco/graph_transformations/identify_lstm_split_inputs.cc
index 8e66323bd769ca166d6b521c5b7b2f1cb944b0a2..e6e3dfa1de9c9fdd5e759fd547d11a7b8c95d837 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/identify_lstm_split_inputs.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/identify_lstm_split_inputs.cc
@@ -33,9 +33,10 @@ bool SplitLstmCellInputs::Run(Model* model, std::size_t op_index) {
     return false;
   }
 
-  // Already an extended LstmCell with kExtendedLstmInputCount of inputs,
-  // do not need to split cell inputs.
-  if (curr_op->inputs.size() == kExtendedLstmInputCount) {
+  const auto* curr_lstm_op = static_cast<LstmCellOperator*>(curr_op);
+  // Already an extended LstmCell. Do not need to split cell inputs.
+  if (curr_lstm_op->kernel_type != LstmCellOperator::KERNEL_BASIC ||
+      curr_lstm_op->inputs.size() != LstmCellOperator::NUM_INPUTS) {
     return false;
   }
 
@@ -56,6 +57,7 @@ bool SplitLstmCellInputs::Run(Model* model, std::size_t op_index) {
 
   // Emplace a new LstmCell operator with extended inputs (kernel/lstm.cc).
   auto lstm_cell_op = absl::make_unique<LstmCellOperator>();
+  lstm_cell_op->kernel_type = LstmCellOperator::KERNEL_FULL;
   lstm_cell_op->inputs.resize(kExtendedLstmInputCount);
   int num_input = model->GetArray(curr_op->inputs[LstmCellOperator::DATA_INPUT])
                       .shape()
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc b/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc
index 6342cf3e8af4d85ad869a5d60a63d62ca2b00588..92d283ca2cc7069f4b80c95ffdadcad81a884cbf 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc
@@ -60,6 +60,8 @@ bool PropagateArrayDataTypes::Run(Model* model, std::size_t op_index) {
     case OperatorType::kTensorFlowLessEqual:
     case OperatorType::kTensorFlowGreater:
     case OperatorType::kTensorFlowGreaterEqual:
+    case OperatorType::kTensorFlowEqual:
+    case OperatorType::kTensorFlowNotEqual:
       // These operators unconditionally produce bool outputs
       SetDataTypeForAllOutputs(model, op, ArrayDataType::kBool);
       break;
@@ -163,6 +165,16 @@ bool PropagateArrayDataTypes::Run(Model* model, std::size_t op_index) {
       SetDataTypeForAllOutputs(model, op, data_type_x);
       break;
     }
+    case OperatorType::kSparseToDense: {
+      // Select produces outputs with the same type as their 3rd input
+      CHECK_EQ(op->inputs.size(), 4);
+      const ArrayDataType data_type = model->GetArray(op->inputs[2]).data_type;
+      const ArrayDataType data_type_default =
+          model->GetArray(op->inputs[3]).data_type;
+      CHECK(data_type == data_type_default);
+      SetDataTypeForAllOutputs(model, op, data_type);
+      break;
+    }
     default: {
       // These operators produce outputs with the same type as their 1st input
       CHECK_GT(op->inputs.size(), 0);
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc b/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc
index 9d1d27f3ef01a572c2ae232b1f172a8e05374381..170a499d4eeea6f38704dadd5274e52da7ae2817 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc
@@ -278,7 +278,7 @@ void ProcessTransposeConvOperator(Model* model, TransposeConvOperator* op) {
       << "TransposeConv input shape must have 4 dimensions. Input \""
       << op->inputs[TransposeConvOperator::WEIGHTS] << "\" had shape "
       << toco::ShapeToString(weights_shape) << ".";
-  CHECK_EQ(input_shape.dims(3), weights_shape.dims(0))
+  CHECK_EQ(input_shape.dims(3), weights_shape.dims(3))
       << "Input shape depth and weight depth do not agree";
 
   // Set the output shape according to the specified output shape.
@@ -1477,6 +1477,34 @@ void ProcessArgMaxOperator(Model* model, ArgMaxOperator* op) {
   *output_array.mutable_shape()->mutable_dims() = output_dims;
 }
 
+void ProcessSparseToDenseOperator(Model* model, SparseToDenseOperator* op) {
+  CHECK_EQ(op->inputs.size(), 4);
+
+  const Array& output_shape_array = model->GetArray(op->inputs[1]);
+  if (!output_shape_array.has_shape()) return;
+  CHECK_EQ(output_shape_array.shape().dimensions_count(), 1);
+
+  // Output should not go over four dimensions.
+  CHECK_LE(output_shape_array.shape().dims(0), 4);
+
+  const string& output_name = op->outputs[0];
+  Array& output_array = model->GetArray(output_name);
+  if (output_array.has_shape()) return;
+
+  CHECK(output_shape_array.data_type == ArrayDataType::kInt32 ||
+        output_shape_array.data_type == ArrayDataType::kInt64);
+  if (output_shape_array.data_type == ArrayDataType::kInt32) {
+    *output_array.mutable_shape()->mutable_dims() =
+        output_shape_array.GetBuffer<ArrayDataType::kInt32>().data;
+  } else {
+    const std::vector<int64>& output_shape_data =
+        output_shape_array.GetBuffer<ArrayDataType::kInt64>().data;
+    std::copy(
+        output_shape_data.begin(), output_shape_data.end(),
+        std::back_inserter(*output_array.mutable_shape()->mutable_dims()));
+  }
+}
+
 }  // namespace
 
 bool PropagateFixedSizes::Run(Model* model, std::size_t op_index) {
@@ -1535,6 +1563,8 @@ bool PropagateFixedSizes::Run(Model* model, std::size_t op_index) {
     case OperatorType::kTensorFlowMaximum:
     case OperatorType::kTensorFlowMinimum:
     case OperatorType::kTensorFlowGreaterEqual:
+    case OperatorType::kTensorFlowEqual:
+    case OperatorType::kTensorFlowNotEqual:
       ProcessSimpleBinaryOperator(model, op);
       break;
     case OperatorType::kAddN:
@@ -1700,6 +1730,10 @@ bool PropagateFixedSizes::Run(Model* model, std::size_t op_index) {
       CHECK_EQ(op->inputs.size(), 1);
       ProcessOpWithShapeInput(model, op);
       break;
+    case OperatorType::kSparseToDense:
+      ProcessSparseToDenseOperator(model,
+                                   static_cast<SparseToDenseOperator*>(op));
+      break;
     default:
       // Unimplemented, another graph transformation should drop it.
       LOG(FATAL) << "Unhandled operator type " << OperatorTypeName(op->type);
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc b/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc
index 142841fcc460e8a5e9e4f2333496f4ece2557275..eca2c701f8bbf889088794c939af7082db0734dd 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc
@@ -45,12 +45,14 @@ bool SupportsQuantization(const Operator& op) {
          type == OperatorType::kTensorFlowMinimum ||
          type == OperatorType::kTensorFlowMaximum ||
          type == OperatorType::kLogistic || type == OperatorType::kSoftmax ||
-         type == OperatorType::kLogSoftmax ||
+         type == OperatorType::kLogSoftmax || type == OperatorType::kSlice ||
+         type == OperatorType::kResizeBilinear ||
          type == OperatorType::kTensorFlowSplit || type == OperatorType::kSub ||
          type == OperatorType::kSqueeze || type == OperatorType::kPad ||
          type == OperatorType::kPadV2 ||
          type == OperatorType::kTensorFlowReshape ||
          type == OperatorType::kTanh || type == OperatorType::kMul ||
+         type == OperatorType::kSpaceToBatchND ||
          type == OperatorType::kSpaceToDepth ||
          type == OperatorType::kStridedSlice ||
          type == OperatorType::kDepthToSpace ||
@@ -60,7 +62,7 @@ bool SupportsQuantization(const Operator& op) {
          type == OperatorType::kTensorFlowGreaterEqual ||
          type == OperatorType::kTensorFlowLess ||
          type == OperatorType::kTensorFlowLessEqual ||
-         type == OperatorType::kSelect;
+         type == OperatorType::kSelect || type == OperatorType::kArgMax;
 }
 
 const MinMax& GetOrComputeMinMax(Model* model, const string& array_name) {
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/quantize_weights.cc b/tensorflow/contrib/lite/toco/graph_transformations/quantize_weights.cc
new file mode 100644
index 0000000000000000000000000000000000000000..88ea0945e7dd15ba325d34ea3fdbf34ff7d91381
--- /dev/null
+++ b/tensorflow/contrib/lite/toco/graph_transformations/quantize_weights.cc
@@ -0,0 +1,108 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <iterator>
+#include <string>
+#include <vector>
+
+#include "tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h"
+#include "tensorflow/contrib/lite/toco/graph_transformations/quantization_util.h"
+#include "tensorflow/contrib/lite/toco/model.h"
+#include "tensorflow/contrib/lite/toco/tooling_util.h"
+
+namespace toco {
+
+namespace {
+
+// The minimum number of elements a weights array must have to be quantized
+// by this transformation.
+// TODO(suharshs): Make this minimum size configurable.
+const int kWeightsMinSize = 1024;
+
+// Gets the quantization params from the float array.
+void GetQuantizationParamsFromArray(const Array& array,
+                                    QuantizationParams* params) {
+  const std::vector<float>& float_vals =
+      array.GetBuffer<ArrayDataType::kFloat>().data;
+  auto minmax = std::minmax_element(float_vals.begin(), float_vals.end());
+  MinMax toco_minmax;
+  toco_minmax.min = *minmax.first;
+  toco_minmax.max = *minmax.second;
+  GetQuantizationParams(ArrayDataType::kUint8, toco_minmax, params);
+}
+
+}  // namespace
+
+bool QuantizeWeights::Run(Model* model, std::size_t op_index) {
+  const auto op_it = model->operators.begin() + op_index;
+  Operator* op = op_it->get();
+
+  // Get the weights tensor, if the current operator has one.
+  int weights_index;
+  if (op->type == OperatorType::kConv ||
+      op->type == OperatorType::kDepthwiseConv ||
+      op->type == OperatorType::kFullyConnected) {
+    weights_index = 1;
+  } else if (op->type == OperatorType::kLstmCell) {
+    weights_index = LstmCellOperator::WEIGHTS_INPUT;
+  } else {
+    return false;
+  }
+
+  // Return early if the array isn't a constant param, this can happen in early
+  // transformation passes until transpose operations following the weight array
+  // are resolved.
+  const string weights = op->inputs[weights_index];
+  if (!IsConstantParameterArray(*model, weights)) {
+    return false;
+  }
+
+  // Return early if the weight tensor is not type float.
+  Array& weights_array = model->GetArray(weights);
+  if (weights_array.data_type != ArrayDataType::kFloat) {
+    return false;
+  }
+
+  // Return early if the tensor is too small. Small tensors don't take up too
+  // much space and can result in bad quantization results.
+  if (weights_array.GetBuffer<ArrayDataType::kFloat>().data.size() <
+      kWeightsMinSize) {
+    return false;
+  }
+
+  // Quantize the weight tensor to type kUint8.
+  QuantizationParams params;
+  GetQuantizationParamsFromArray(weights_array, &params);
+  QuantizeArray(this, model, weights, ArrayDataType::kUint8, params);
+
+  // Insert a Dequantize operation after the quantized weights tensor.
+  auto* dequantize_op = new DequantizeOperator;
+  model->operators.emplace(op_it, dequantize_op);
+
+  // Create a new intermediate tensor to connect the Dequantize op to the
+  // original op.
+  const string dequantized_output =
+      AvailableArrayName(*model, weights + "_dequantized");
+  Array& dequantized_output_array = model->GetOrCreateArray(dequantized_output);
+  dequantized_output_array.data_type = ArrayDataType::kFloat;
+
+  // Connect up the new Dequantize op with the weights and original op.
+  op->inputs[weights_index] = dequantized_output;
+  dequantize_op->inputs = {weights};
+  dequantize_op->outputs = {dequantized_output};
+
+  return true;
+}
+
+}  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/tests/BUILD b/tensorflow/contrib/lite/toco/graph_transformations/tests/BUILD
index 8dcd4adc90b188c745cadb9815c3c46383705833..95e8433be2a332cfce5175f4f65ea0b83d5638c5 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/tests/BUILD
+++ b/tensorflow/contrib/lite/toco/graph_transformations/tests/BUILD
@@ -8,8 +8,8 @@ load(
 )
 
 tf_cc_test(
-    name = "resolve_constant_concatenation_test",
-    srcs = ["resolve_constant_concatenation_test.cc"],
+    name = "lstm_utils_test",
+    srcs = ["lstm_utils_test.cc"],
     deps = [
         "//tensorflow/contrib/lite/toco:graph_transformations",
         "//tensorflow/contrib/lite/toco:model",
@@ -19,8 +19,20 @@ tf_cc_test(
 )
 
 tf_cc_test(
-    name = "lstm_utils_test",
-    srcs = ["lstm_utils_test.cc"],
+    name = "quantize_weights_test",
+    srcs = ["quantize_weights_test.cc"],
+    deps = [
+        "//tensorflow/contrib/lite/toco:graph_transformations",
+        "//tensorflow/contrib/lite/toco:model",
+        "//tensorflow/contrib/lite/toco:tooling_util",
+        "@com_google_absl//absl/memory",
+        "@com_google_googletest//:gtest_main",
+    ],
+)
+
+tf_cc_test(
+    name = "resolve_constant_concatenation_test",
+    srcs = ["resolve_constant_concatenation_test.cc"],
     deps = [
         "//tensorflow/contrib/lite/toco:graph_transformations",
         "//tensorflow/contrib/lite/toco:model",
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/tests/quantize_weights_test.cc b/tensorflow/contrib/lite/toco/graph_transformations/tests/quantize_weights_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..c05eb0929fd775d315fa735b4c9842a7fc024fa8
--- /dev/null
+++ b/tensorflow/contrib/lite/toco/graph_transformations/tests/quantize_weights_test.cc
@@ -0,0 +1,167 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <math.h>
+#include <string>
+#include <vector>
+
+#include <gmock/gmock.h>
+#include <gtest/gtest.h>
+#include "absl/memory/memory.h"
+#include "tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h"
+#include "tensorflow/contrib/lite/toco/model.h"
+#include "tensorflow/contrib/lite/toco/tooling_util.h"
+
+namespace toco {
+
+class QuantizeWeightsTest : public ::testing::Test {
+ protected:
+  QuantizeWeightsTest() {}
+
+  // The name of the weights input array.
+  const string kWeightsName = "weights";
+  // The zero_point of the values in the input array.
+  const int kZeroPoint = 128;
+
+  // Prepare a hypothetical TOCO model of a quantizable fully connected float
+  // layer.
+  void PrepareModel(Model* model, int elements_per_dim) {
+    std::vector<string> fc_input_names = {"inputs", kWeightsName};
+
+    const int kDim = 4;
+    const int buf_size = std::pow(elements_per_dim, static_cast<double>(kDim));
+    auto in_buf = absl::make_unique<float[]>(buf_size);
+    // Initialize the array with values from -128.0 to 127.0, since these values
+    // should be exactly representable by quantization.
+    for (int i = 0; i < buf_size; i++) {
+      in_buf[i] = static_cast<float>(i % 256 - kZeroPoint);
+    }
+
+    for (const string& fc_input_name : fc_input_names) {
+      Array& in_array = model->GetOrCreateArray(fc_input_name);
+      in_array.data_type = ArrayDataType::kFloat;
+
+      // Initialize shape for the input array.
+      Shape* in_array_shape = in_array.mutable_shape();
+      std::vector<int>* in_array_shape_dim = in_array_shape->mutable_dims();
+      in_array_shape_dim->resize(kDim, elements_per_dim);
+      auto& in_array_buffer =
+          in_array.GetMutableBuffer<ArrayDataType::kFloat>();
+      in_array_buffer.data.resize(buf_size);
+      float* buf_ptr =
+          in_array.GetMutableBuffer<ArrayDataType::kFloat>().data.data();
+      std::copy(in_buf.get(), in_buf.get() + buf_size, buf_ptr);
+    }
+
+    auto* fc_op = new FullyConnectedOperator;
+    fc_op->inputs = fc_input_names;
+    fc_op->outputs = {"fc_op_outputs"};
+    Array& out_array = model->GetOrCreateArray(fc_op->outputs[0]);
+    out_array.data_type = ArrayDataType::kFloat;
+    Shape* out_array_shape = out_array.mutable_shape();
+    std::vector<int>* out_array_shape_dim = out_array_shape->mutable_dims();
+    out_array_shape_dim->resize(kDim, elements_per_dim);
+    model->operators.push_back(std::unique_ptr<Operator>(fc_op));
+  }
+};
+
+TEST_F(QuantizeWeightsTest, QuantizedFullyConnected) {
+  // Test that weight arrays that are large enough are quantized.
+  Model model;
+  // 6 elements per dim gives us 1296 elements, which is sufficient to be
+  // quantized.
+  PrepareModel(&model, 6);
+
+  // Check the state of the graph before the transformation.
+  const auto& float_array_map = model.GetArrayMap();
+  EXPECT_EQ(float_array_map.size(), 3);
+  // Before the transformation, all arrays should be type float.
+  for (const auto& element : float_array_map) {
+    EXPECT_EQ(element.second->data_type, ArrayDataType::kFloat);
+  }
+  const std::vector<float> float_weight_vals =
+      model.GetArray(kWeightsName).GetBuffer<ArrayDataType::kFloat>().data;
+
+  // Invoke the transformation.
+  GraphTransformationsSet graph_transformation_set;
+  graph_transformation_set.Add(new toco::QuantizeWeights);
+  (*graph_transformation_set.begin())->Run(&model, /*op_index=*/0);
+
+  // Check the state of the graph after the transformation.
+  const auto& quantized_array_map = model.GetArrayMap();
+  EXPECT_EQ(quantized_array_map.size(), 4);
+  // After the transformation, three arrays should be type float and one array
+  // should be uint8.
+  int num_float = 0;
+  int num_uint8 = 0;
+  for (const auto& element : quantized_array_map) {
+    if (element.second->data_type == ArrayDataType::kFloat) {
+      num_float++;
+    } else if (element.second->data_type == ArrayDataType::kUint8) {
+      num_uint8++;
+    } else {
+      FAIL() << "Unexpected array type.";
+    }
+  }
+  EXPECT_EQ(num_float, 3);
+  EXPECT_EQ(num_uint8, 1);
+  // Ensure that the values were quantized correctly.
+  const std::vector<uint8>& quantized_weight_vals =
+      model.GetArray(kWeightsName).GetBuffer<ArrayDataType::kUint8>().data;
+  for (int i = 0; i < quantized_weight_vals.size(); i++) {
+    EXPECT_EQ(quantized_weight_vals[i], float_weight_vals[i] + kZeroPoint);
+  }
+
+  // Ensure that a Dequantize operator has been inserted before the
+  // FullyConnectedLayer.
+  EXPECT_EQ(model.operators[0]->type, OperatorType::kDequantize);
+}
+
+TEST_F(QuantizeWeightsTest, NotQuantizedFullyConnected) {
+  // Test that weight arrays that are too small are left untouched.
+  Model model;
+  // 5 elements per dim gives us 625 elements, which is NOT sufficient to be
+  // quantized.
+  PrepareModel(&model, 5);
+
+  // Check the state of the graph before the transformation.
+  const auto& float_array_map = model.GetArrayMap();
+  EXPECT_EQ(float_array_map.size(), 3);
+  // Before the transformation, all arrays should be type float.
+  for (auto it = float_array_map.begin(); it != float_array_map.end(); it++) {
+    EXPECT_EQ(it->second->data_type, ArrayDataType::kFloat);
+  }
+  std::vector<float> float_weight_vals =
+      model.GetArray(kWeightsName).GetBuffer<ArrayDataType::kFloat>().data;
+
+  // Invoke the transformation.
+  GraphTransformationsSet graph_transformation_set;
+  graph_transformation_set.Add(new toco::QuantizeWeights);
+  (*graph_transformation_set.begin())->Run(&model, /*op_index=*/0);
+
+  // Check the state of the graph after the transformation.
+  const auto& post_array_map = model.GetArrayMap();
+  EXPECT_EQ(post_array_map.size(), 3);
+  for (auto it = post_array_map.begin(); it != post_array_map.end(); it++) {
+    EXPECT_EQ(it->second->data_type, ArrayDataType::kFloat);
+  }
+  // Ensure that the values remain unchanged.
+  std::vector<float> const& quantized_weight_vals =
+      model.GetArray(kWeightsName).GetBuffer<ArrayDataType::kFloat>().data;
+  for (int i = 0; i < quantized_weight_vals.size(); i++) {
+    EXPECT_EQ(quantized_weight_vals[i], float_weight_vals[i]);
+  }
+}
+
+}  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/tests/resolve_constant_concatenation_test.cc b/tensorflow/contrib/lite/toco/graph_transformations/tests/resolve_constant_concatenation_test.cc
index 3a1d175b9823f085c9b8730caba8bedd7eb87d52..66cfed4ac26969729d1881f11ba6ae74d9817fb5 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/tests/resolve_constant_concatenation_test.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/tests/resolve_constant_concatenation_test.cc
@@ -12,9 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
-#include <memory>
 #include <string>
-#include <unordered_map>
 #include <vector>
 
 #include <gmock/gmock.h>
@@ -126,7 +124,7 @@ class ResolveConstantConcatenationTest : public ::testing::Test {
       Array& in_array = model->GetOrCreateArray(concat_input_name);
       in_array.data_type = ArrayDataType::kFloat;
 
-      // Initialize shape for the input  array.
+      // Initialize shape for the input array.
       Shape* in_array_shape = in_array.mutable_shape();
       std::vector<int>* in_array_shape_dim = in_array_shape->mutable_dims();
       for (int i = 0; i < kDim; i++) {
diff --git a/tensorflow/contrib/lite/toco/import_tensorflow.cc b/tensorflow/contrib/lite/toco/import_tensorflow.cc
index ea051bb84ac1b70612397b7a929cf9c5d82c59de..cd4f034dfea57b6d379b67a90ba4fa3fe3d615d5 100644
--- a/tensorflow/contrib/lite/toco/import_tensorflow.cc
+++ b/tensorflow/contrib/lite/toco/import_tensorflow.cc
@@ -31,7 +31,6 @@ limitations under the License.
 #include "tensorflow/contrib/lite/toco/model_flags.pb.h"
 #include "tensorflow/contrib/lite/toco/tensorflow_graph_matching/resolve_cluster.h"
 #include "tensorflow/contrib/lite/toco/tensorflow_util.h"
-#include "tensorflow/contrib/lite/toco/toco_port.h"
 #include "tensorflow/contrib/lite/toco/tooling_util.h"
 #include "tensorflow/core/common_runtime/device_factory.h"
 #include "tensorflow/core/common_runtime/function.h"
@@ -44,6 +43,7 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor_shape.pb.h"
 #include "tensorflow/core/framework/types.pb.h"
 #include "tensorflow/core/graph/graph_constructor.h"
+#include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/public/session_options.h"
 #include "tensorflow/core/public/version.h"
@@ -63,8 +63,6 @@ using tensorflow::TensorShapeProto;
 
 namespace toco {
 
-using port::Status;
-
 namespace {
 bool HasAttr(const NodeDef& node, const string& attr_name) {
   return node.attr().count(attr_name) > 0;
@@ -130,6 +128,42 @@ const AttrValue::ListValue& GetListAttr(const NodeDef& node,
   return attr.list();
 }
 
+tensorflow::Status CheckOptionalAttr(const NodeDef& node,
+                                     const string& attr_name,
+                                     const string& expected_value) {
+  if (HasAttr(node, attr_name)) {
+    const string& value = GetStringAttr(node, attr_name);
+    if (value != expected_value) {
+      return tensorflow::errors::InvalidArgument(
+          "Unexpected value for attribute '" + attr_name + "'. Expected '" +
+          expected_value + "'");
+    }
+  }
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status CheckOptionalAttr(
+    const NodeDef& node, const string& attr_name,
+    const tensorflow::DataType& expected_value) {
+  if (HasAttr(node, attr_name)) {
+    const tensorflow::DataType& value = GetDataTypeAttr(node, attr_name);
+    if (value != expected_value) {
+      return tensorflow::errors::InvalidArgument(
+          "Unexpected value for attribute '" + attr_name + "'. Expected '" +
+          tensorflow::DataType_Name(expected_value) + "'");
+    }
+  }
+  return tensorflow::Status::OK();
+}
+
+template <typename T1, typename T2>
+tensorflow::Status ExpectValue(const T1& v1, const T2& v2,
+                               const string& description) {
+  if (v1 == v2) return tensorflow::Status::OK();
+  return tensorflow::errors::InvalidArgument(absl::StrCat(
+      "Unexpected ", description, ": got ", v1, ", expected ", v2));
+}
+
 ArrayDataType ConvertDataType(tensorflow::DataType dtype) {
   if (dtype == DT_UINT8)
     return ArrayDataType::kUint8;
@@ -148,9 +182,10 @@ ArrayDataType ConvertDataType(tensorflow::DataType dtype) {
   return ArrayDataType::kNone;
 }
 
-Status ImportShape(const TFLITE_PROTO_NS::RepeatedPtrField<
-                       tensorflow::TensorShapeProto_Dim>& input_dims,
-                   int* input_flat_size, Shape* shape) {
+tensorflow::Status ImportShape(
+    const TFLITE_PROTO_NS::RepeatedPtrField<tensorflow::TensorShapeProto_Dim>&
+        input_dims,
+    int* input_flat_size, Shape* shape) {
   std::vector<int> input_dims_only_sizes;
   for (auto& d : input_dims) {
     if (d.size() == 0) {
@@ -160,23 +195,24 @@ Status ImportShape(const TFLITE_PROTO_NS::RepeatedPtrField<
       // For now, tweaking this to record a 0-D shape instead.
       shape->mutable_dims()->clear();
       if (input_flat_size != nullptr) *input_flat_size = 0;
-      return Status::OK();
+      return tensorflow::Status::OK();
     }
     // TensorFlow's shapes use int64s, while TOCO uses ints.
     if (d.size() > std::numeric_limits<int>::max()) {
-      return Status(false, "Shape element overflows");
+      return tensorflow::errors::InvalidArgument("Shape element overflows");
     }
 
     input_dims_only_sizes.push_back(d.size());
   }
   *shape->mutable_dims() = input_dims_only_sizes;
 
-  if (input_flat_size == nullptr) return Status::OK();
+  if (input_flat_size == nullptr) return tensorflow::Status::OK();
 
   return NumElements(input_dims_only_sizes, input_flat_size);
 }
 
-Status ImportFloatArray(const TensorProto& input_tensor, Array* output_array) {
+tensorflow::Status ImportFloatArray(const TensorProto& input_tensor,
+                                    Array* output_array) {
   CHECK_EQ(input_tensor.dtype(), DT_FLOAT);
   const auto& input_shape = input_tensor.tensor_shape();
   CHECK_LE(input_shape.dim_size(), 4);
@@ -203,18 +239,18 @@ Status ImportFloatArray(const TensorProto& input_tensor, Array* output_array) {
     toco::port::CopyToBuffer(input_tensor.tensor_content(),
                              reinterpret_cast<char*>(output_float_data.data()));
   } else {
-    return Status(
-        false,
+    return tensorflow::errors::InvalidArgument(
         absl::StrCat("Neither input_content (",
                      input_tensor.tensor_content().size() / sizeof(float),
                      ") nor float_val (", input_tensor.float_val_size(),
                      ") have the right dimensions (", input_flat_size,
                      ") for this float tensor"));
   }
-  return Status::OK();
+  return tensorflow::Status::OK();
 }
 
-Status ImportQuint8Array(const TensorProto& input_tensor, Array* output_array) {
+tensorflow::Status ImportQuint8Array(const TensorProto& input_tensor,
+                                     Array* output_array) {
   CHECK_EQ(input_tensor.dtype(), DT_QUINT8);
   const auto& input_shape = input_tensor.tensor_shape();
   CHECK_LE(input_shape.dim_size(), 4);
@@ -236,18 +272,18 @@ Status ImportQuint8Array(const TensorProto& input_tensor, Array* output_array) {
     toco::port::CopyToBuffer(input_tensor.tensor_content(),
                              reinterpret_cast<char*>(output_int_data.data()));
   } else {
-    return Status(
-        false,
+    return tensorflow::errors::InvalidArgument(
         absl::StrCat("Neither input_content (",
                      input_tensor.tensor_content().size() / sizeof(uint8_t),
                      ") nor int_val (", input_tensor.int_val_size(),
                      ") have the right dimensions (", input_flat_size,
                      ") for this uint8 tensor"));
   }
-  return Status::OK();
+  return tensorflow::Status::OK();
 }
 
-Status ImportInt32Array(const TensorProto& input_tensor, Array* output_array) {
+tensorflow::Status ImportInt32Array(const TensorProto& input_tensor,
+                                    Array* output_array) {
   CHECK_EQ(input_tensor.dtype(), DT_INT32);
   const auto& input_shape = input_tensor.tensor_shape();
   CHECK_LE(input_shape.dim_size(), 4);
@@ -269,18 +305,17 @@ Status ImportInt32Array(const TensorProto& input_tensor, Array* output_array) {
     toco::port::CopyToBuffer(input_tensor.tensor_content(),
                              reinterpret_cast<char*>(output_int_data.data()));
   } else {
-    return Status(
-        false,
-        absl::StrCat("Neither input_content (",
-                     input_tensor.tensor_content().size() / sizeof(int32),
-                     ") nor int_val (", input_tensor.int_val_size(),
-                     ") have the right dimensions (", input_flat_size,
-                     ") for this int32 tensor"));
+    return tensorflow::errors::InvalidArgument(absl::StrCat(
+        "Neither input_content (",
+        input_tensor.tensor_content().size() / sizeof(int32), ") nor int_val (",
+        input_tensor.int_val_size(), ") have the right dimensions (",
+        input_flat_size, ") for this int32 tensor"));
   }
-  return Status::OK();
+  return tensorflow::Status::OK();
 }
 
-Status ImportInt64Array(const TensorProto& input_tensor, Array* output_array) {
+tensorflow::Status ImportInt64Array(const TensorProto& input_tensor,
+                                    Array* output_array) {
   CHECK_EQ(input_tensor.dtype(), DT_INT64);
   const auto& input_shape = input_tensor.tensor_shape();
   CHECK_LE(input_shape.dim_size(), 4);
@@ -302,18 +337,18 @@ Status ImportInt64Array(const TensorProto& input_tensor, Array* output_array) {
     toco::port::CopyToBuffer(input_tensor.tensor_content(),
                              reinterpret_cast<char*>(output_int_data.data()));
   } else {
-    return Status(
-        false,
+    return tensorflow::errors::InvalidArgument(
         absl::StrCat("Neither input_content (",
                      input_tensor.tensor_content().size() / sizeof(int64),
                      ") nor int64_val (", input_tensor.int64_val_size(),
                      ") have the right dimensions (", input_flat_size,
                      ") for this int64 tensor"));
   }
-  return Status::OK();
+  return tensorflow::Status::OK();
 }
 
-Status ImportBoolArray(const TensorProto& input_tensor, Array* output_array) {
+tensorflow::Status ImportBoolArray(const TensorProto& input_tensor,
+                                   Array* output_array) {
   CHECK_EQ(input_tensor.dtype(), DT_BOOL);
   const auto& input_shape = input_tensor.tensor_shape();
   CHECK_LE(input_shape.dim_size(), 4);
@@ -343,19 +378,19 @@ Status ImportBoolArray(const TensorProto& input_tensor, Array* output_array) {
     // So far only encountered that in an array with 1 entry, let's
     // require that until we encounter a graph where that's not the case.
     if (output_bool_data.size() != 1) {
-      return Status(
-          false, absl::StrCat("Neither input_content (",
-                              input_tensor.tensor_content().size(),
-                              ") nor bool_val (", input_tensor.bool_val_size(),
-                              ") have the right dimensions (", input_flat_size,
-                              ") for this bool tensor"));
+      return tensorflow::errors::InvalidArgument(absl::StrCat(
+          "Neither input_content (", input_tensor.tensor_content().size(),
+          ") nor bool_val (", input_tensor.bool_val_size(),
+          ") have the right dimensions (", input_flat_size,
+          ") for this bool tensor"));
     }
     output_bool_data[0] = false;
   }
-  return Status::OK();
+  return tensorflow::Status::OK();
 }
 
-Status ImportStringArray(const TensorProto& input_tensor, Array* output_array) {
+tensorflow::Status ImportStringArray(const TensorProto& input_tensor,
+                                     Array* output_array) {
   CHECK_EQ(input_tensor.dtype(), DT_STRING);
   const auto& input_shape = input_tensor.tensor_shape();
   CHECK_LE(input_shape.dim_size(), 4);
@@ -365,9 +400,9 @@ Status ImportStringArray(const TensorProto& input_tensor, Array* output_array) {
   if (!status.ok()) return status;
 
   if (input_flat_size != input_tensor.string_val_size()) {
-    return Status(false,
-                  "Input_content string_val doesn't have the right dimensions "
-                  "for this string tensor");
+    return tensorflow::errors::InvalidArgument(
+        "Input_content string_val doesn't have the right dimensions "
+        "for this string tensor");
   }
 
   auto& output_string_data =
@@ -377,7 +412,7 @@ Status ImportStringArray(const TensorProto& input_tensor, Array* output_array) {
   for (int i = 0; i < input_flat_size; ++i) {
     output_string_data[i] = input_tensor.string_val(i);
   }
-  return Status::OK();
+  return tensorflow::Status::OK();
 }
 
 // Count the number of inputs of a given node. If
@@ -417,14 +452,14 @@ string CreateConstArray(Model* model, string const& name,
   return array_name;
 }
 
-Status ConvertConstOperator(const NodeDef& node,
-                            const TensorFlowImportFlags& tf_import_flags,
-                            Model* model) {
+tensorflow::Status ConvertConstOperator(
+    const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
+    Model* model) {
   CHECK_EQ(node.op(), "Const");
   const auto& tensor = GetTensorAttr(node, "value");
   const auto dtype = GetDataTypeAttr(node, "dtype");
 
-  Status status = Status::OK();
+  tensorflow::Status status = tensorflow::Status::OK();
 
   auto& array = model->GetOrCreateArray(node.name());
   switch (dtype) {
@@ -460,24 +495,21 @@ Status ConvertConstOperator(const NodeDef& node,
       array.GetMutableBuffer<ArrayDataType::kNone>();
       break;
   }
-  if (!status.ok()) {
-    status.AppendMessage(" (while processing node '" + node.name() + "')");
-  }
-  return status;
+  TF_RETURN_WITH_CONTEXT_IF_ERROR(
+      status, " (while processing node '" + node.name() + "')");
+  return tensorflow::Status::OK();
 }
 
-void ConvertConvOperator(const NodeDef& node,
-                         const TensorFlowImportFlags& tf_import_flags,
-                         Model* model) {
+tensorflow::Status ConvertConvOperator(
+    const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
+    Model* model) {
   CHECK_EQ(node.op(), "Conv2D");
   CheckInputsCount(node, tf_import_flags, 2);
 
   // We only support NHWC, which is the default data_format.
   // So if data_format is not defined, we're all good.
-  if (HasAttr(node, "data_format")) {
-    CHECK_EQ(GetStringAttr(node, "data_format"), "NHWC");
-  }
-  CHECK_EQ(GetDataTypeAttr(node, "T"), DT_FLOAT);
+  TF_RETURN_IF_ERROR(CheckOptionalAttr(node, "data_format", "NHWC"));
+  TF_RETURN_IF_ERROR(CheckOptionalAttr(node, "T", DT_FLOAT));
 
   const auto& input_name = node.input(0);
   const auto& weights_name = node.input(1);
@@ -502,27 +534,26 @@ void ConvertConvOperator(const NodeDef& node,
   auto* conv = new ConvOperator;
   conv->inputs = {input_name, reordered_weights_name};
   conv->outputs = {node.name()};
+  if (!HasAttr(node, "strides")) {
+    return tensorflow::errors::InvalidArgument("Missing attribute 'strides'");
+  }
   const auto& strides = GetListAttr(node, "strides");
-  CHECK_EQ(strides.i_size(), 4);
-  CHECK_EQ(strides.i(0), 1);
-  CHECK_EQ(strides.i(3), 1);
+  TF_RETURN_IF_ERROR(ExpectValue(strides.i_size(), 4, "number of strides"));
+  TF_RETURN_IF_ERROR(ExpectValue(strides.i(0), 1, "strides(0)"));
+  TF_RETURN_IF_ERROR(ExpectValue(strides.i(3), 1, "strides(3)"));
   conv->stride_height = strides.i(1);
   conv->stride_width = strides.i(2);
   if (HasAttr(node, "dilations")) {
     const auto& dilations = GetListAttr(node, "dilations");
-    CHECK_EQ(dilations.i_size(), 4);
-    CHECK_EQ(dilations.i(0), 1)
-        << "Can only import Conv ops with dilation along the height (1st) or "
-           "width (2nd) axis. TensorFlow op \""
-        << node.name() << "\" had dilations:[ " << dilations.i(0) << ", "
-        << dilations.i(1) << ", " << dilations.i(2) << ", " << dilations.i(3)
-        << "].";
-    CHECK_EQ(dilations.i(3), 1)
-        << "Can only import Conv ops with dilation along the height (1st) or "
-           "width (2nd) axis. TensorFlow op \""
-        << node.name() << "\" had dilations:[ " << dilations.i(0) << ", "
-        << dilations.i(1) << ", " << dilations.i(2) << ", " << dilations.i(3)
-        << "].";
+    TF_RETURN_IF_ERROR(
+        ExpectValue(dilations.i_size(), 4, "number of dilations"));
+    if (dilations.i(0) != 1 || dilations.i(3) != 1) {
+      return tensorflow::errors::InvalidArgument(absl::StrCat(
+          "Can only import Conv ops with dilation along the height "
+          "(1st) or width (2nd) axis. TensorFlow op \"",
+          node.name(), "\" had dilations:[ ", dilations.i(0), ", ",
+          dilations.i(1), ", ", dilations.i(2), ", ", dilations.i(3), "]."));
+    }
     conv->dilation_height_factor = dilations.i(1);
     conv->dilation_width_factor = dilations.i(2);
   } else {
@@ -535,9 +566,12 @@ void ConvertConvOperator(const NodeDef& node,
   } else if (padding == "VALID") {
     conv->padding.type = PaddingType::kValid;
   } else {
-    LOG(FATAL) << "Bad padding (only SAME and VALID are supported)";
+    return tensorflow::errors::InvalidArgument(
+        "Bad padding (only SAME and VALID are supported)");
   }
   model->operators.emplace_back(conv);
+
+  return tensorflow::Status::OK();
 }
 
 void ConvertDepthwiseConvOperator(const NodeDef& node,
@@ -614,7 +648,14 @@ void ConvertSpaceToDepthOperator(const NodeDef& node,
   CHECK_EQ(node.op(), "SpaceToDepth");
   CheckInputsCount(node, tf_import_flags, 1);
 
-  CHECK_EQ(GetDataTypeAttr(node, "T"), DT_FLOAT);
+  tensorflow::DataType dtype = GetDataTypeAttr(node, "T");
+  if (dtype != DT_FLOAT && dtype != DT_UINT8 && dtype != DT_INT32 &&
+      dtype != DT_INT64) {
+    const auto* enum_descriptor = tensorflow::DataType_descriptor();
+    LOG(FATAL) << "TFLite does not support SpaceToDepth with type T:"
+               << enum_descriptor->FindValueByNumber(dtype)->name() << ". "
+               << "T must be one of {DT_FLOAT, DT_INT8, DT_INT32, DT_INT64}.";
+  }
   auto* op = new SpaceToDepthOperator;
   op->inputs.push_back(node.input(0));
   op->outputs.push_back(node.name());
@@ -656,81 +697,6 @@ void ConvertRandomUniform(const NodeDef& node,
   model->operators.emplace_back(std::move(op));
 }
 
-void ConvertReluOperator(const NodeDef& node,
-                         const TensorFlowImportFlags& tf_import_flags,
-                         Model* model) {
-  CHECK_EQ(node.op(), "Relu");
-  CheckInputsCount(node, tf_import_flags, 1);
-  const auto& input_name = node.input(0);
-  auto* relu = new ReluOperator;
-  relu->inputs.push_back(input_name);
-  relu->outputs.push_back(node.name());
-  model->operators.emplace_back(relu);
-}
-
-void ConvertRelu6Operator(const NodeDef& node,
-                          const TensorFlowImportFlags& tf_import_flags,
-                          Model* model) {
-  CHECK_EQ(node.op(), "Relu6");
-  CheckInputsCount(node, tf_import_flags, 1);
-
-  const auto& input_name = node.input(0);
-  auto* op = new Relu6Operator;
-  op->inputs.push_back(input_name);
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertLogOperator(const NodeDef& node,
-                        const TensorFlowImportFlags& tf_import_flags,
-                        Model* model) {
-  CHECK_EQ(node.op(), "Log");
-  CheckInputsCount(node, tf_import_flags, 1);
-
-  auto op = absl::make_unique<LogOperator>();
-  op->inputs.push_back(node.input(0));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(std::move(op));
-}
-
-void ConvertLogisticOperator(const NodeDef& node,
-                             const TensorFlowImportFlags& tf_import_flags,
-                             Model* model) {
-  CHECK_EQ(node.op(), "Sigmoid");
-  CheckInputsCount(node, tf_import_flags, 1);
-
-  const auto& input_name = node.input(0);
-  auto* op = new LogisticOperator;
-  op->inputs.push_back(input_name);
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertTanhOperator(const NodeDef& node,
-                         const TensorFlowImportFlags& tf_import_flags,
-                         Model* model) {
-  CHECK_EQ(node.op(), "Tanh");
-  CheckInputsCount(node, tf_import_flags, 1);
-
-  const auto& input_name = node.input(0);
-  auto* op = new TanhOperator;
-  op->inputs.push_back(input_name);
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertDivOperator(const NodeDef& node,
-                        const TensorFlowImportFlags& tf_import_flags,
-                        Model* model) {
-  CHECK(node.op() == "Div" || node.op() == "RealDiv");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new DivOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
 void ConvertIdentityOperator(const NodeDef& node,
                              const TensorFlowImportFlags& tf_import_flags,
                              Model* model) {
@@ -787,38 +753,6 @@ void ConvertFakeQuantWithMinMaxVars(
   model->operators.emplace_back(op);
 }
 
-void ConvertNegOperator(const NodeDef& node,
-                        const TensorFlowImportFlags& tf_import_flags,
-                        Model* model) {
-  CHECK_EQ(node.op(), "Neg");
-  CheckInputsCount(node, tf_import_flags, 1);
-  auto* op = new NegOperator;
-  op->inputs.push_back(node.input(0));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertRsqrtOperator(const NodeDef& node,
-                          const TensorFlowImportFlags& tf_import_flags,
-                          Model* model) {
-  CHECK_EQ(node.op(), "Rsqrt");
-  CheckInputsCount(node, tf_import_flags, 1);
-  auto* op = new TensorFlowRsqrtOperator;
-  op->inputs.push_back(node.input(0));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertSqrtOperator(const NodeDef& node,
-                         const TensorFlowImportFlags& tf_import_flags,
-                         Model* model) {
-  CHECK_EQ(node.op(), "Sqrt");
-  CheckInputsCount(node, tf_import_flags, 1);
-  auto* op = new TensorFlowSqrtOperator;
-  op->inputs.push_back(node.input(0));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
 
 void ConvertSqueezeOperator(const NodeDef& node,
                             const TensorFlowImportFlags& tf_import_flags,
@@ -840,66 +774,6 @@ void ConvertSqueezeOperator(const NodeDef& node,
   model->operators.emplace_back(op);
 }
 
-void ConvertSquareOperator(const NodeDef& node,
-                           const TensorFlowImportFlags& tf_import_flags,
-                           Model* model) {
-  CHECK_EQ(node.op(), "Square");
-  CheckInputsCount(node, tf_import_flags, 1);
-  auto* op = new TensorFlowSquareOperator;
-  op->inputs.push_back(node.input(0));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertAddOperator(const NodeDef& node,
-                        const TensorFlowImportFlags& tf_import_flags,
-                        Model* model) {
-  CHECK_EQ(node.op(), "Add");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new AddOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertAddNOperator(const NodeDef& node,
-                         const TensorFlowImportFlags& tf_import_flags,
-                         Model* model) {
-  CHECK_EQ(node.op(), "AddN");
-  const int num_inputs = GetInputsCount(node, tf_import_flags);
-  auto* op = new AddNOperator;
-  for (int i = 0; i < num_inputs; ++i) {
-    op->inputs.push_back(node.input(i));
-  }
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertMulOperator(const NodeDef& node,
-                        const TensorFlowImportFlags& tf_import_flags,
-                        Model* model) {
-  CHECK_EQ(node.op(), "Mul");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new MulOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertSubOperator(const NodeDef& node,
-                        const TensorFlowImportFlags& tf_import_flags,
-                        Model* model) {
-  CHECK_EQ(node.op(), "Sub");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new SubOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
 void ConvertSumOperator(const NodeDef& node,
                         const TensorFlowImportFlags& tf_import_flags,
                         Model* model) {
@@ -915,67 +789,6 @@ void ConvertSumOperator(const NodeDef& node,
   }
 }
 
-void ConvertTileOperator(const NodeDef& node,
-                         const TensorFlowImportFlags& tf_import_flags,
-                         Model* model) {
-  CHECK_EQ(node.op(), "Tile");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new TensorFlowTileOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertSliceOperator(const NodeDef& node,
-                          const TensorFlowImportFlags& tf_import_flags,
-                          Model* model) {
-  CHECK_EQ(node.op(), "Slice");
-  CheckInputsCount(node, tf_import_flags, 3);
-  auto* op = new SliceOperator;
-  for (int i = 0; i < 3; ++i) {
-    op->inputs.push_back(node.input(i));
-  }
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertPadOperator(const NodeDef& node,
-                        const TensorFlowImportFlags& tf_import_flags,
-                        Model* model) {
-  CHECK_EQ(node.op(), "Pad");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new PadOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertPadV2Operator(const NodeDef& node,
-                          const TensorFlowImportFlags& tf_import_flags,
-                          Model* model) {
-  CHECK_EQ(node.op(), "PadV2");
-  CheckInputsCount(node, tf_import_flags, 3);
-  auto* op = new PadV2Operator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->inputs.push_back(node.input(2));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertShapeOperator(const NodeDef& node,
-                          const TensorFlowImportFlags& tf_import_flags,
-                          Model* model) {
-  CHECK_EQ(node.op(), "Shape");
-  CheckInputsCount(node, tf_import_flags, 1);
-  auto* op = new TensorFlowShapeOperator;
-  op->inputs.push_back(node.input(0));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
 void ConvertSplitOperator(const NodeDef& node,
                           const TensorFlowImportFlags& tf_import_flags,
                           Model* model) {
@@ -993,18 +806,6 @@ void ConvertSplitOperator(const NodeDef& node,
   model->operators.emplace_back(op);
 }
 
-void ConvertMergeOperator(const NodeDef& node,
-                          const TensorFlowImportFlags& tf_import_flags,
-                          Model* model) {
-  CHECK_EQ(node.op(), "Merge");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new TensorFlowMergeOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
 void ConvertSwitchOperator(const NodeDef& node,
                            const TensorFlowImportFlags& tf_import_flags,
                            Model* model) {
@@ -1034,18 +835,6 @@ void ConvertSoftmaxOperator(const NodeDef& node,
   model->operators.emplace_back(softmax);
 }
 
-void ConvertLogSoftmaxOperator(const NodeDef& node,
-                               const TensorFlowImportFlags& tf_import_flags,
-                               Model* model) {
-  CHECK_EQ(node.op(), "LogSoftmax");
-  CheckInputsCount(node, tf_import_flags, 1);
-  const auto& input_name = node.input(0);
-  auto* log_softmax = new LogSoftmaxOperator;
-  log_softmax->inputs.push_back(input_name);
-  log_softmax->outputs.push_back(node.name());
-  model->operators.emplace_back(log_softmax);
-}
-
 void ConvertLRNOperator(const NodeDef& node,
                         const TensorFlowImportFlags& tf_import_flags,
                         Model* model) {
@@ -1142,17 +931,6 @@ void ConvertAvgPoolOperator(const NodeDef& node,
   model->operators.emplace_back(avgpool);
 }
 
-void ConvertReshapeOperator(const NodeDef& node,
-                            const TensorFlowImportFlags& tf_import_flags,
-                            Model* model) {
-  CHECK_EQ(node.op(), "Reshape");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new TensorFlowReshapeOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
 
 void ConvertBatchMatMulOperator(const NodeDef& node,
                                 const TensorFlowImportFlags& tf_import_flags,
@@ -1215,24 +993,12 @@ void ConvertConcatOperator(const NodeDef& node,
   model->operators.emplace_back(op);
 }
 
-void ConvertAllOperator(const NodeDef& node,
-                        const TensorFlowImportFlags& tf_import_flags,
-                        Model* model) {
-  CHECK_EQ(node.op(), "All");
-  auto* op = new TensorFlowAllOperator;
-  const int num_inputs = GetInputsCount(node, tf_import_flags);
-  for (int i = 0; i < num_inputs; ++i) {
-    op->inputs.push_back(node.input(i));
-  }
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertAssertOperator(const NodeDef& node,
+// This method supports simple operators without additional attributes.
+template <typename Op>
+void ConvertSimpleOperator(const NodeDef& node,
                            const TensorFlowImportFlags& tf_import_flags,
                            Model* model) {
-  CHECK_EQ(node.op(), "Assert");
-  auto* op = new TensorFlowAssertOperator;
+  auto* op = new Op;
   const int num_inputs = GetInputsCount(node, tf_import_flags);
   for (int i = 0; i < num_inputs; ++i) {
     op->inputs.push_back(node.input(i));
@@ -1241,69 +1007,13 @@ void ConvertAssertOperator(const NodeDef& node,
   model->operators.emplace_back(op);
 }
 
-void ConvertLessOperator(const NodeDef& node,
-                         const TensorFlowImportFlags& tf_import_flags,
-                         Model* model) {
-  CHECK_EQ(node.op(), "Less");
-  auto* op = new TensorFlowLessOperator;
-  const int num_inputs = GetInputsCount(node, tf_import_flags);
-  for (int i = 0; i < num_inputs; ++i) {
-    op->inputs.push_back(node.input(i));
-  }
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertLessEqualOperator(const NodeDef& node,
-                              const TensorFlowImportFlags& tf_import_flags,
-                              Model* model) {
-  CHECK_EQ(node.op(), "LessEqual");
-  auto* op = new TensorFlowLessEqualOperator;
-  const int num_inputs = GetInputsCount(node, tf_import_flags);
-  for (int i = 0; i < num_inputs; ++i) {
-    op->inputs.push_back(node.input(i));
-  }
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertSinOperator(const NodeDef& node,
-                        const TensorFlowImportFlags& tf_import_flags,
-                        Model* model) {
-  CHECK_EQ(node.op(), "Sin");
-  auto* op = new SinOperator;
-  const int num_inputs = GetInputsCount(node, tf_import_flags);
-  for (int i = 0; i < num_inputs; ++i) {
-    op->inputs.push_back(node.input(i));
-  }
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertGreaterOperator(const NodeDef& node,
-                            const TensorFlowImportFlags& tf_import_flags,
-                            Model* model) {
-  CHECK_EQ(node.op(), "Greater");
-  auto* op = new TensorFlowGreaterOperator;
-  const int num_inputs = GetInputsCount(node, tf_import_flags);
-  for (int i = 0; i < num_inputs; ++i) {
-    op->inputs.push_back(node.input(i));
-  }
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertGreaterEqualOperator(const NodeDef& node,
-                                 const TensorFlowImportFlags& tf_import_flags,
-                                 Model* model) {
-  CHECK_EQ(node.op(), "GreaterEqual");
-  auto* op = new TensorFlowGreaterEqualOperator;
-  const int num_inputs = GetInputsCount(node, tf_import_flags);
-  for (int i = 0; i < num_inputs; ++i) {
-    op->inputs.push_back(node.input(i));
-  }
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
+// This method supports simple operators without additional attributes.
+template <typename Op, unsigned int NumInputs>
+void ConvertSimpleOperator(const NodeDef& node,
+                           const TensorFlowImportFlags& tf_import_flags,
+                           Model* model) {
+  CheckInputsCount(node, tf_import_flags, NumInputs);
+  ConvertSimpleOperator<Op>(node, tf_import_flags, model);
 }
 
 void ConvertMaxOperator(const NodeDef& node,
@@ -1336,29 +1046,6 @@ void ConvertMinOperator(const NodeDef& node,
   }
 }
 
-void ConvertMaximumOperator(const NodeDef& node,
-                            const TensorFlowImportFlags& tf_import_flags,
-                            Model* model) {
-  CHECK_EQ(node.op(), "Maximum");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new TensorFlowMaximumOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertMinimumOperator(const NodeDef& node,
-                            const TensorFlowImportFlags& tf_import_flags,
-                            Model* model) {
-  CHECK_EQ(node.op(), "Minimum");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new TensorFlowMinimumOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
 
 void ConvertUnsupportedOperator(const NodeDef& node,
                                 const TensorFlowImportFlags& tf_import_flags,
@@ -1387,19 +1074,6 @@ void ConvertUnsupportedOperator(const NodeDef& node,
   }
 }
 
-void ConvertSelectOperator(const NodeDef& node,
-                           const TensorFlowImportFlags& tf_import_flags,
-                           Model* model) {
-  CheckInputsCount(node, tf_import_flags, 3);
-
-  auto* op = new SelectOperator;
-  for (const auto& input : node.input()) {
-    op->inputs.push_back(input);
-  }
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
 void ConvertStridedSliceOperator(const NodeDef& node,
                                  const TensorFlowImportFlags& tf_import_flags,
                                  Model* model) {
@@ -1678,17 +1352,6 @@ void ConvertBatchToSpaceNDOperator(const NodeDef& node,
   model->operators.emplace_back(op);
 }
 
-void ConvertExpOperator(const NodeDef& node,
-                        const TensorFlowImportFlags& tf_import_flags,
-                        Model* model) {
-  CHECK_EQ(node.op(), "Exp");
-  CheckInputsCount(node, tf_import_flags, 1);
-  auto* op = new ExpOperator;
-  op->inputs.push_back(node.input(0));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
 void ConvertMeanOperator(const NodeDef& node,
                          const TensorFlowImportFlags& tf_import_flags,
                          Model* model) {
@@ -1779,11 +1442,13 @@ void ConvertTransposeConvOperator(const NodeDef& node,
   if (existing_transpose) {
     CHECK(existing_transpose->type == OperatorType::kTranspose);
   } else {
-    // Transpose weights from HWIO order to OHWI order, which is more efficient
-    // for computation
+    // Transpose weights from HWOI order to OHWI order, which is more efficient
+    // for computation. (Note that TensorFlow considers the order as HWIO
+    // because they consider this a backward conv, inverting the sense of
+    // input/output.)
     TransposeOperator* transpose = new TransposeOperator;
     string perm_array = CreateConstArray<ArrayDataType::kInt32>(
-        model, node.name() + "_transpose_perm", {3, 0, 1, 2});
+        model, node.name() + "_transpose_perm", {2, 0, 1, 3});
     transpose->inputs = {weights_name, perm_array};
     transpose->outputs = {transposed_weights_name};
     model->operators.emplace_back(transpose);
@@ -1802,53 +1467,6 @@ void ConvertTransposeConvOperator(const NodeDef& node,
   model->operators.emplace_back(op);
 }
 
-void ConvertExpandDimsOperator(const NodeDef& node,
-                               const TensorFlowImportFlags& tf_import_flags,
-                               Model* model) {
-  CHECK_EQ(node.op(), "ExpandDims");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new ExpandDimsOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertFillOperator(const NodeDef& node,
-                         const TensorFlowImportFlags& tf_import_flags,
-                         Model* model) {
-  CHECK_EQ(node.op(), "Fill");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new FillOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertFloorDivOperator(const NodeDef& node,
-                             const TensorFlowImportFlags& tf_import_flags,
-                             Model* model) {
-  CHECK_EQ(node.op(), "FloorDiv");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new FloorDivOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
-void ConvertFloorModOperator(const NodeDef& node,
-                             const TensorFlowImportFlags& tf_import_flags,
-                             Model* model) {
-  CHECK_EQ(node.op(), "FloorMod");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new FloorModOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
 
 void ConvertRangeOperator(const NodeDef& node,
                           const TensorFlowImportFlags& tf_import_flags,
@@ -1869,17 +1487,6 @@ void ConvertRangeOperator(const NodeDef& node,
   model->operators.emplace_back(op);
 }
 
-void ConvertRankOperator(const NodeDef& node,
-                         const TensorFlowImportFlags& tf_import_flags,
-                         Model* model) {
-  CHECK_EQ(node.op(), "Rank");
-  CheckInputsCount(node, tf_import_flags, 1);
-  auto* op = new RankOperator;
-  op->inputs.push_back(node.input(0));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
-
 void ConvertStackOperator(const NodeDef& node,
                           const TensorFlowImportFlags& tf_import_flags,
                           Model* model) {
@@ -1900,17 +1507,6 @@ void ConvertStackOperator(const NodeDef& node,
   model->operators.emplace_back(op);
 }
 
-void ConvertTransposeOperator(const NodeDef& node,
-                              const TensorFlowImportFlags& tf_import_flags,
-                              Model* model) {
-  CHECK_EQ(node.op(), "Transpose");
-  CheckInputsCount(node, tf_import_flags, 2);
-  auto* op = new TransposeOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-}
 
 // Some TensorFlow ops only occur in graph cycles, representing
 // control flow. We do not currently support control flow, so we wouldn't
@@ -2133,18 +1729,36 @@ void ConvertDynamicStitchOperator(const NodeDef& node,
   model->operators.emplace_back(op.release());
 }
 
+void ConvertSparseToDenseOperator(const NodeDef& node,
+                                  const TensorFlowImportFlags& tf_import_flags,
+                                  Model* model) {
+  CHECK_EQ(node.op(), "SparseToDense");
+  CheckInputsCount(node, tf_import_flags, 4);
+
+  auto* op = new SparseToDenseOperator;
+  for (const string& input : node.input()) {
+    op->inputs.push_back(input);
+  }
+  op->outputs.push_back(node.name());
+
+  op->validate_indices = HasAttr(node, "validate_indices")
+                             ? GetBoolAttr(node, "validate_indices")
+                             : true;
+  model->operators.emplace_back(op);
+}
+
 }  // namespace
 
 namespace internal {
-Status ImportTensorFlowNode(const tensorflow::NodeDef& node,
-                            const TensorFlowImportFlags& tf_import_flags,
-                            Model* model) {
+tensorflow::Status ImportTensorFlowNode(
+    const tensorflow::NodeDef& node,
+    const TensorFlowImportFlags& tf_import_flags, Model* model) {
   // TODO(ahentz): Historically these functions all CHECK-fail on error. We've
   // been slowly converting them to return Status.
   if (node.op() == "Const") {
     return ConvertConstOperator(node, tf_import_flags, model);
   } else if (node.op() == "Conv2D") {
-    ConvertConvOperator(node, tf_import_flags, model);
+    return ConvertConvOperator(node, tf_import_flags, model);
   } else if (node.op() == "Conv2DBackpropInput") {
     ConvertTransposeConvOperator(node, tf_import_flags, model);
   } else if (node.op() == "DepthwiseConv2dNative") {
@@ -2156,25 +1770,26 @@ Status ImportTensorFlowNode(const tensorflow::NodeDef& node,
   } else if (node.op() == "BiasAdd") {
     ConvertBiasAddOperator(node, tf_import_flags, model);
   } else if (node.op() == "Relu") {
-    ConvertReluOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<ReluOperator, 1>(node, tf_import_flags, model);
   } else if (node.op() == "Relu6") {
-    ConvertRelu6Operator(node, tf_import_flags, model);
+    ConvertSimpleOperator<Relu6Operator, 1>(node, tf_import_flags, model);
   } else if (node.op() == "Sigmoid") {
-    ConvertLogisticOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<LogisticOperator, 1>(node, tf_import_flags, model);
   } else if (node.op() == "Tanh") {
-    ConvertTanhOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TanhOperator, 1>(node, tf_import_flags, model);
   } else if (node.op() == "MaxPool") {
     ConvertMaxPoolOperator(node, tf_import_flags, model);
   } else if (node.op() == "AvgPool") {
     ConvertAvgPoolOperator(node, tf_import_flags, model);
   } else if (node.op() == "Reshape") {
-    ConvertReshapeOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowReshapeOperator, 2>(node, tf_import_flags,
+                                                        model);
   } else if (node.op() == "BatchMatMul") {
     ConvertBatchMatMulOperator(node, tf_import_flags, model);
   } else if (node.op() == "MatMul") {
     ConvertMatMulOperator(node, tf_import_flags, model);
   } else if (node.op() == "Div" || node.op() == "RealDiv") {
-    ConvertDivOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<DivOperator, 2>(node, tf_import_flags, model);
   } else if (node.op() == "Identity" || node.op() == "CheckNumerics" ||
              node.op() == "StopGradient") {
     ConvertIdentityOperator(node, tf_import_flags, model);
@@ -2183,27 +1798,31 @@ Status ImportTensorFlowNode(const tensorflow::NodeDef& node,
   } else if (node.op() == "FakeQuantWithMinMaxArgs") {
     ConvertFakeQuantWithMinMaxArgs(node, tf_import_flags, model);
   } else if (node.op() == "Neg") {
-    ConvertNegOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<NegOperator, 1>(node, tf_import_flags, model);
   } else if (node.op() == "Rsqrt") {
-    ConvertRsqrtOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowRsqrtOperator, 1>(node, tf_import_flags,
+                                                      model);
   } else if (node.op() == "Squeeze") {
     ConvertSqueezeOperator(node, tf_import_flags, model);
   } else if (node.op() == "Sqrt") {
-    ConvertSqrtOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowSqrtOperator, 1>(node, tf_import_flags,
+                                                     model);
   } else if (node.op() == "Square") {
-    ConvertSquareOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowSquareOperator, 1>(node, tf_import_flags,
+                                                       model);
   } else if (node.op() == "Add") {
-    ConvertAddOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<AddOperator, 2>(node, tf_import_flags, model);
   } else if (node.op() == "AddN") {
-    ConvertAddNOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<AddNOperator>(node, tf_import_flags, model);
   } else if (node.op() == "Mul") {
-    ConvertMulOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<MulOperator, 2>(node, tf_import_flags, model);
   } else if (node.op() == "Sub") {
-    ConvertSubOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<SubOperator, 2>(node, tf_import_flags, model);
   } else if (node.op() == "Sum") {
     ConvertSumOperator(node, tf_import_flags, model);
   } else if (node.op() == "Tile") {
-    ConvertTileOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowTileOperator, 2>(node, tf_import_flags,
+                                                     model);
   } else if (node.op() == "Concat" || node.op() == "ConcatV2") {
     ConvertConcatOperator(node, tf_import_flags, model);
   } else if (node.op() == "LRN") {
@@ -2211,41 +1830,50 @@ Status ImportTensorFlowNode(const tensorflow::NodeDef& node,
   } else if (node.op() == "Softmax") {
     ConvertSoftmaxOperator(node, tf_import_flags, model);
   } else if (node.op() == "Log") {
-    ConvertLogOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<LogOperator, 1>(node, tf_import_flags, model);
   } else if (node.op() == "LogSoftmax") {
-    ConvertLogSoftmaxOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<LogSoftmaxOperator, 1>(node, tf_import_flags, model);
   } else if (node.op() == "All") {
-    ConvertAllOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowAllOperator>(node, tf_import_flags, model);
   } else if (node.op() == "Assert") {
-    ConvertAssertOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowAssertOperator>(node, tf_import_flags,
+                                                    model);
   } else if (node.op() == "Less") {
-    ConvertLessOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowLessOperator, 2>(node, tf_import_flags,
+                                                     model);
   } else if (node.op() == "LessEqual") {
-    ConvertLessEqualOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowLessEqualOperator, 2>(node, tf_import_flags,
+                                                          model);
   } else if (node.op() == "Greater") {
-    ConvertGreaterOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowGreaterOperator, 2>(node, tf_import_flags,
+                                                        model);
   } else if (node.op() == "GreaterEqual") {
-    ConvertGreaterEqualOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowGreaterEqualOperator, 2>(
+        node, tf_import_flags, model);
   } else if (node.op() == "Max") {
     ConvertMaxOperator(node, tf_import_flags, model);
   } else if (node.op() == "Min") {
     ConvertMinOperator(node, tf_import_flags, model);
   } else if (node.op() == "Maximum") {
-    ConvertMaximumOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowMaximumOperator, 2>(node, tf_import_flags,
+                                                        model);
   } else if (node.op() == "Minimum") {
-    ConvertMinimumOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowMinimumOperator, 2>(node, tf_import_flags,
+                                                        model);
   } else if (node.op() == "Merge") {
-    ConvertMergeOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowMergeOperator, 2>(node, tf_import_flags,
+                                                      model);
   } else if (node.op() == "Pad") {
-    ConvertPadOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<PadOperator, 2>(node, tf_import_flags, model);
   } else if (node.op() == "PadV2") {
-    ConvertPadV2Operator(node, tf_import_flags, model);
+    ConvertSimpleOperator<PadV2Operator, 3>(node, tf_import_flags, model);
   } else if (node.op() == "StridedSlice") {
     ConvertStridedSliceOperator(node, tf_import_flags, model);
   } else if (node.op() == "Shape") {
-    ConvertShapeOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TensorFlowShapeOperator, 1>(node, tf_import_flags,
+                                                      model);
   } else if (node.op() == "Slice") {
-    ConvertSliceOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<SliceOperator, 3>(node, tf_import_flags, model);
   } else if (node.op() == "Split") {
     ConvertSplitOperator(node, tf_import_flags, model);
   } else if (node.op() == "Switch") {
@@ -2282,25 +1910,25 @@ Status ImportTensorFlowNode(const tensorflow::NodeDef& node,
   } else if (node.op() == "NextIteration") {
     ConvertOperatorSpecialCasedAsRNNBackEdge(node, tf_import_flags, model);
   } else if (node.op() == "ExpandDims") {
-    ConvertExpandDimsOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<ExpandDimsOperator, 2>(node, tf_import_flags, model);
   } else if (node.op() == "Fill") {
-    ConvertFillOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<FillOperator, 2>(node, tf_import_flags, model);
   } else if (node.op() == "FloorDiv") {
-    ConvertFloorDivOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<FloorDivOperator, 2>(node, tf_import_flags, model);
   } else if (node.op() == "FloorMod") {
-    ConvertFloorModOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<FloorModOperator, 2>(node, tf_import_flags, model);
   } else if (node.op() == "Range") {
     ConvertRangeOperator(node, tf_import_flags, model);
   } else if (node.op() == "Rank") {
-    ConvertRankOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<RankOperator, 1>(node, tf_import_flags, model);
   } else if (node.op() == "Stack" || node.op() == "Pack") {
     ConvertStackOperator(node, tf_import_flags, model);
   } else if (node.op() == "Transpose") {
-    ConvertTransposeOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<TransposeOperator, 2>(node, tf_import_flags, model);
   } else if (node.op() == "ArgMax") {
     ConvertArgMaxOperator(node, tf_import_flags, model);
   } else if (node.op() == "Exp") {
-    ConvertExpOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<ExpOperator, 1>(node, tf_import_flags, model);
   } else if (node.op() == "TopK" || node.op() == "TopKV2") {
     ConvertTopKV2Operator(node, tf_import_flags, model);
   } else if (node.op() == "DynamicPartition") {
@@ -2311,13 +1939,23 @@ Status ImportTensorFlowNode(const tensorflow::NodeDef& node,
   } else if (node.op() == "RandomUniform") {
     ConvertRandomUniform(node, tf_import_flags, model);
   } else if (node.op() == "Sin") {
-    ConvertSinOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<SinOperator, 1>(node, tf_import_flags, model);
+  } else if (node.op() == "Log") {
+    ConvertSimpleOperator<LogOperator, 1>(node, tf_import_flags, model);
   } else if (node.op() == "Select") {
-    ConvertSelectOperator(node, tf_import_flags, model);
+    ConvertSimpleOperator<SelectOperator, 3>(node, tf_import_flags, model);
+  } else if (node.op() == "SparseToDense") {
+    ConvertSparseToDenseOperator(node, tf_import_flags, model);
+  } else if (node.op() == "Equal") {
+    ConvertSimpleOperator<TensorFlowEqualOperator, 2>(node, tf_import_flags,
+                                                      model);
+  } else if (node.op() == "NotEqual") {
+    ConvertSimpleOperator<TensorFlowNotEqualOperator, 2>(node, tf_import_flags,
+                                                         model);
   } else {
     ConvertUnsupportedOperator(node, tf_import_flags, model);
   }
-  return Status::OK();
+  return tensorflow::Status::OK();
 }
 }  // namespace internal
 
diff --git a/tensorflow/contrib/lite/toco/import_tensorflow_test.cc b/tensorflow/contrib/lite/toco/import_tensorflow_test.cc
index 835676662b9cb7ed20e578e2a35747a64ba443dc..d18c329a43411236f8fd5446998c168803b9373a 100644
--- a/tensorflow/contrib/lite/toco/import_tensorflow_test.cc
+++ b/tensorflow/contrib/lite/toco/import_tensorflow_test.cc
@@ -21,10 +21,10 @@ limitations under the License.
 #include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/framework/node_def_builder.h"
 #include "tensorflow/core/framework/tensor_shape.pb.h"
+#include "tensorflow/core/lib/core/status.h"
 
 namespace toco {
 
-using port::Status;
 using tensorflow::AttrValue;
 using tensorflow::DT_BOOL;
 using tensorflow::DT_FLOAT;
@@ -33,6 +33,7 @@ using tensorflow::DT_INT64;
 using tensorflow::DT_QUINT8;
 using tensorflow::DT_STRING;
 using tensorflow::NodeDef;
+using tensorflow::Status;
 
 namespace internal {
 Status ImportTensorFlowNode(const NodeDef&, const TensorFlowImportFlags&,
@@ -117,9 +118,10 @@ TEST_P(ShapeImportTest, ShapeElementIsNegative) {
   NodeDef node;
   BuildConstNode({1, -2, 10}, GetParam(), 0, &node);
   auto status = ImportNode(node);
-  EXPECT_EQ(status.error_message(),
-            "Tensor shape should not include negative values (while processing "
-            "node 'Node1')");
+  EXPECT_EQ(
+      status.error_message(),
+      "Tensor shape should not include negative values\n\t (while processing "
+      "node 'Node1')");
 }
 INSTANTIATE_TEST_CASE_P(ShapeElementIsNegative, ShapeImportTest,
                         ::testing::ValuesIn(TestTypes()));
@@ -129,7 +131,7 @@ TEST_P(ShapeImportTest, ShapeElementTooLarge) {
   BuildConstNode({3000000000}, GetParam(), 0, &node);
   auto status = ImportNode(node);
   EXPECT_EQ(status.error_message(),
-            "Shape element overflows (while processing node 'Node1')");
+            "Shape element overflows\n\t (while processing node 'Node1')");
 }
 INSTANTIATE_TEST_CASE_P(ShapeElementTooLarge, ShapeImportTest,
                         ::testing::ValuesIn(TestTypes()));
@@ -139,7 +141,7 @@ TEST_P(ShapeImportTest, ShapeTooLarge) {
   BuildConstNode({1000000, 2000000, 2000000, 2000000}, GetParam(), 0, &node);
   auto status = ImportNode(node);
   EXPECT_EQ(status.error_message(),
-            "Tensor shape is too large (while processing node 'Node1')");
+            "Tensor shape is too large\n\t (while processing node 'Node1')");
 }
 INSTANTIATE_TEST_CASE_P(ShapeTooLarge, ShapeImportTest,
                         ::testing::ValuesIn(TestTypes()));
@@ -148,11 +150,11 @@ TEST_P(ShapeImportTest, ValidShapeButZeroElements) {
   NodeDef node;
   BuildConstNode({1, 2, 2, 2}, GetParam(), 0, &node);
   auto status = ImportNode(node);
-  EXPECT_THAT(
-      status.error_message(),
-      ::testing::MatchesRegex(
-          "Neither input_content .0. nor .*_val .0. have the right "
-          "dimensions .8. for this .* tensor .while processing node 'Node1'."));
+  EXPECT_THAT(status.error_message(),
+              ::testing::MatchesRegex(
+                  "Neither input_content .0. nor .*_val .0. have the right "
+                  "dimensions .8. for this .* tensor\n\t .while processing "
+                  "node 'Node1'."));
 }
 INSTANTIATE_TEST_CASE_P(ValidShapeButZeroElements, ShapeImportTest,
                         ::testing::ValuesIn(TestTypes()));
diff --git a/tensorflow/contrib/lite/toco/model.h b/tensorflow/contrib/lite/toco/model.h
index d878ac54e4d819efc1b0951acbbab23b3387eac5..2f43adb07b1c9dc9645942ce6ec868595704baa5 100644
--- a/tensorflow/contrib/lite/toco/model.h
+++ b/tensorflow/contrib/lite/toco/model.h
@@ -135,6 +135,9 @@ enum class OperatorType {
   // special nodes in the graph to shuffle axes.
   kReorderAxes,
   kSelect,
+  kSparseToDense,
+  kTensorFlowEqual,
+  kTensorFlowNotEqual,
 };
 
 // Helper to deal with TensorFlow arrays using a different ordering of
@@ -152,6 +155,7 @@ enum class AxesOrder {
   k1HWO,     // Our standard for DepthwiseConv weights
   kHWIM,     // TensorFlow DepthwiseConv weights
   kNHWC,     // TensorFlow activations
+  kHWOI,     // TensorFlow back-prop conv weights
 };
 
 // The type of the scalars in an array.
@@ -526,7 +530,15 @@ struct LstmCellOperator : Operator {
     ACTIV_TEMP = 3,
     NUM_OUTPUTS = 4
   };
-  LstmCellOperator() : Operator(OperatorType::kLstmCell) {}
+  enum KernelType {
+    KERNEL_BASIC = 0,
+    KERNEL_FULL = 1,
+  };
+
+  LstmCellOperator()
+      : Operator(OperatorType::kLstmCell), kernel_type(KERNEL_BASIC) {}
+
+  KernelType kernel_type;
 };
 
 // Element-wise multiplication operator.
@@ -1349,6 +1361,22 @@ struct TensorFlowGreaterEqualOperator : Operator {
       : Operator(OperatorType::kTensorFlowGreaterEqual) {}
 };
 
+// TensorFlow Equal equivalent. Refer to TensorFlow documentation for
+// details.
+// Not fully supported, just a placeholder to handle TensorFlow graphs and
+// support graph transformations to other operator types by matching sub-graphs.
+// Typically, this is only used as an input to an Assert node, so can be
+// removed as an unused node as we drop Assert nodes.
+struct TensorFlowEqualOperator : Operator {
+  TensorFlowEqualOperator() : Operator(OperatorType::kTensorFlowEqual) {}
+};
+
+// TensorFlow Not Equal equivalent. Refer to TensorFlow documentation for
+// details.
+struct TensorFlowNotEqualOperator : Operator {
+  TensorFlowNotEqualOperator() : Operator(OperatorType::kTensorFlowNotEqual) {}
+};
+
 // Global max reduction: computes the max of all of entries in the input array.
 // Thus the output is "0-dimensional": it consists of a single scalar value.
 //
@@ -1598,13 +1626,26 @@ struct DynamicStitchOperator : Operator {
   int num_partitions;
 };
 
+// SparseToDense operator:
+//
+// Inputs:
+// Inputs[0]: required: sparse_indices.
+// Inputs[1]: required: output_shape.
+// Inputs[2]: required: sparse_values.
+//
+// TensorFlow equivalent: SparseToDense.
+struct SparseToDenseOperator : Operator {
+  SparseToDenseOperator() : Operator(OperatorType::kSparseToDense) {}
+  bool validate_indices;
+};
+
 // Alloc's are used for transient arrays only. An Alloc specifies which interval
 // of the "transient_data" workspace buffer passed to inference functions, is to
 // be used for the transient array at hand. The 'start' and 'end' values are
 // offsets from the start of the workspace buffer, expressed in bytes.
 struct Alloc {
-  int start = 0;
-  int end = 0;
+  int64 start = 0;
+  int64 end = 0;
 };
 
 inline bool operator<(const Alloc& a, const Alloc& b) {
diff --git a/tensorflow/contrib/lite/toco/model_cmdline_flags.cc b/tensorflow/contrib/lite/toco/model_cmdline_flags.cc
index f875c85d1a7447432ea0f3a7d68b028e52cb78d4..4c9f1aa4b0274b5123bb3baa9b9fca1463bda4c3 100644
--- a/tensorflow/contrib/lite/toco/model_cmdline_flags.cc
+++ b/tensorflow/contrib/lite/toco/model_cmdline_flags.cc
@@ -48,7 +48,7 @@ bool ParseModelFlagsFromCommandLineFlags(
            "that information from the input file."),
       Flag("input_arrays", parsed_flags.input_arrays.bind(),
            parsed_flags.input_arrays.default_value(),
-           "Names of the output arrays, comma-separated. If not specified, "
+           "Names of the input arrays, comma-separated. If not specified, "
            "will try to read that information from the input file."),
       Flag("output_array", parsed_flags.output_array.bind(),
            parsed_flags.output_array.default_value(),
@@ -83,7 +83,7 @@ bool ParseModelFlagsFromCommandLineFlags(
            "Deprecated: use --input_data_types instead. Input array type, if "
            "not already provided in the graph. "
            "Typically needs to be specified when passing arbitrary arrays "
-           "to --input_array."),
+           "to --input_arrays."),
       Flag("input_data_types", parsed_flags.input_data_types.bind(),
            parsed_flags.input_data_types.default_value(),
            "Input arrays types, comma-separated, if not already provided in "
diff --git a/tensorflow/contrib/lite/toco/python/BUILD b/tensorflow/contrib/lite/toco/python/BUILD
index 6c4f8e12cdd5b3222997c4a2b0ac243cc74324e0..93fe756a55d378fa205ff88be5e18aff586e5dca 100644
--- a/tensorflow/contrib/lite/toco/python/BUILD
+++ b/tensorflow/contrib/lite/toco/python/BUILD
@@ -12,10 +12,11 @@ cc_library(
     deps = [
         "//tensorflow/contrib/lite/toco:model_flags_proto_cc",
         "//tensorflow/contrib/lite/toco:toco_flags_proto_cc",
+        "//tensorflow/contrib/lite/toco:toco_graphviz_dump_options",
         "//tensorflow/contrib/lite/toco:toco_port",
         "//tensorflow/contrib/lite/toco:toco_tooling",
         "//tensorflow/core:lib",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
     ],
 )
 
@@ -26,7 +27,7 @@ tf_py_wrap_cc(
         ":toco_python_api",
         "//tensorflow/contrib/lite/toco:model_flags_proto_cc",
         "//tensorflow/contrib/lite/toco:toco_flags_proto_cc",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
         "@com_google_absl//absl/strings",
     ],
 )
@@ -41,12 +42,6 @@ py_binary(
     ],
 )
 
-py_binary(
-    name = "toco_wrapper",
-    srcs = ["toco_wrapper.py"],
-    srcs_version = "PY2AND3",
-)
-
 tf_py_test(
     name = "toco_from_protos_test",
     srcs = ["toco_from_protos_test.py"],
diff --git a/tensorflow/contrib/lite/toco/python/toco_python_api.cc b/tensorflow/contrib/lite/toco/python/toco_python_api.cc
index 5b1db852b4f8e89c1a591cfe18a0ab0aa2db04c9..d93e104038741e6e59608f04115854d611f1f9ae 100644
--- a/tensorflow/contrib/lite/toco/python/toco_python_api.cc
+++ b/tensorflow/contrib/lite/toco/python/toco_python_api.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include "tensorflow/contrib/lite/toco/model_flags.pb.h"
 #include "tensorflow/contrib/lite/toco/python/toco_python_api.h"
 #include "tensorflow/contrib/lite/toco/toco_flags.pb.h"
+#include "tensorflow/contrib/lite/toco/toco_graphviz_dump_options.h"
 #include "tensorflow/contrib/lite/toco/toco_port.h"
 #include "tensorflow/contrib/lite/toco/toco_tooling.h"
 #include "tensorflow/contrib/lite/toco/toco_types.h"
@@ -62,7 +63,7 @@ PyObject* TocoConvert(PyObject* model_flags_proto_txt_raw,
   std::string input_contents_txt = ConvertArg(input_contents_txt_raw, &error);
   if (error) return nullptr;
 
-  // Use toco to produce new outputs
+  // Use TOCO to produce new outputs.
   toco::ModelFlags model_flags;
   if (!model_flags.ParseFromString(model_flags_proto_txt)) {
     LOG(FATAL) << "Model proto failed to parse." << std::endl;
@@ -71,6 +72,16 @@ PyObject* TocoConvert(PyObject* model_flags_proto_txt_raw,
   if (!toco_flags.ParseFromString(toco_flags_proto_txt)) {
     LOG(FATAL) << "Toco proto failed to parse." << std::endl;
   }
+
+  auto& dump_options = *GraphVizDumpOptions::singleton();
+  if (toco_flags.has_dump_graphviz_dir()) {
+    dump_options.dump_graphviz = toco_flags.dump_graphviz_dir();
+  }
+  if (toco_flags.has_dump_graphviz_include_video()) {
+    dump_options.dump_graphviz_video = toco_flags.dump_graphviz_include_video();
+  }
+
+  // Convert model.
   std::unique_ptr<toco::Model> model =
       toco::Import(toco_flags, model_flags, input_contents_txt);
   toco::Transform(toco_flags, model.get());
diff --git a/tensorflow/contrib/lite/toco/python/toco_python_api.h b/tensorflow/contrib/lite/toco/python/toco_python_api.h
index 9af38e937c29804f950ea53ee86b70e3ccb02360..7e8ad9c1dafa68dd91e4a0eb3bfb742207878c59 100644
--- a/tensorflow/contrib/lite/toco/python/toco_python_api.h
+++ b/tensorflow/contrib/lite/toco/python/toco_python_api.h
@@ -15,8 +15,8 @@ limitations under the License.
 #ifndef _THIRD_PARTY_TENSORFLOW_CONTRIB_LITE_TOCO_PYTHON_TOCO_PYTHON_API_H_
 #define _THIRD_PARTY_TENSORFLOW_CONTRIB_LITE_TOCO_PYTHON_TOCO_PYTHON_API_H_
 
-#include <string>
 #include <Python.h>
+#include <string>
 
 namespace toco {
 
diff --git a/tensorflow/contrib/lite/toco/python/toco_wrapper.py b/tensorflow/contrib/lite/toco/python/toco_wrapper.py
deleted file mode 100644
index 6d6b500d7eccd353f566a4bad76df35e0e849d95..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/lite/toco/python/toco_wrapper.py
+++ /dev/null
@@ -1,40 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Wrapper for runninmg toco binary embedded in pip site-package.
-
-NOTE: this mainly exists since PIP setup.py cannot install binaries to bin/.
-It can only install Python "console-scripts." This will work as a console
-script. See tools/pip_package/setup.py (search for CONSOLE_SCRIPTS).
-"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import sys
-
-
-def main():
-  # Pip installs the binary in aux-bin off of main site-package install.
-  # Just find it and exec, passing all arguments in the process.
-  # TODO(aselle): it is unfortunate to use all of tensorflow to lookup binary.
-  print("""TOCO from pip install is currently not working on command line.
-Please use the python TOCO API or use
-bazel run tensorflow/contrib/lite:toco -- <args> from a TensorFlow source dir.
-""")
-  sys.exit(1)
-  # TODO(aselle): Replace this when we find a way to run toco without
-  # blowing up executable size.
-  # binary = os.path.join(tf.__path__[0], 'aux-bin/toco')
-  # os.execvp(binary, sys.argv)
diff --git a/tensorflow/contrib/lite/toco/tflite/export.cc b/tensorflow/contrib/lite/toco/tflite/export.cc
index 5daa703c80b3b5d9152c5d21976260f21679a3f2..a2d753657b0bf6c88f5c94a20a1240fb7c13a37c 100644
--- a/tensorflow/contrib/lite/toco/tflite/export.cc
+++ b/tensorflow/contrib/lite/toco/tflite/export.cc
@@ -316,6 +316,7 @@ void Export(
   auto op_codes = ExportOperatorCodes(model, ops_by_type, operators_map,
                                       &builder, &error_summary);
   const string fake_quant_operation_name = "FAKE_QUANT";
+
   if (error_summary.count(fake_quant_operation_name) != 0) {
     LOG(ERROR)
         << fake_quant_operation_name
@@ -327,6 +328,21 @@ void Export(
     error_summary.erase(fake_quant_operation_name);
   }
   if (!allow_custom_ops && !error_summary.empty()) {
+    // Remove ExpandDims and ReorderAxes from unimplemented list unless they
+    // compose the list. Both ops are removed during graph transformations.
+    // However, if an op is unimplemented earlier in the model, the graph
+    // transformation is unable to run because the output shape is not defined.
+    // This causes unnecessary confusion during model conversion time.
+    std::set<string> error_summary_final;
+    for (const auto& op_type : error_summary) {
+      if (op_type != "ReorderAxes" && op_type != "ExpandDims") {
+        error_summary_final.insert(op_type);
+      }
+    }
+    if (error_summary_final.empty()) {
+      error_summary_final = error_summary;
+    }
+
     LOG(QFATAL)
         << "Some of the operators in the model are not supported by "
            "the standard TensorFlow Lite runtime. If you have a custom "
@@ -334,7 +350,7 @@ void Export(
            "--allow_custom_ops, or by setting allow_custom_ops=True "
            "when calling tf.contrib.lite.toco_convert(). Here is a list "
            "of operators for which  you will need custom implementations: "
-        << absl::StrJoin(error_summary, ", ") << ".";
+        << absl::StrJoin(error_summary_final, ", ") << ".";
   }
 
   auto ops =
diff --git a/tensorflow/contrib/lite/toco/tflite/export.h b/tensorflow/contrib/lite/toco/tflite/export.h
index 90abfb94d8d091525cc6ce7b12e2e29c7e648160..098d2163e6c2fe26f3cb9cdf9959df62a1a4baf0 100644
--- a/tensorflow/contrib/lite/toco/tflite/export.h
+++ b/tensorflow/contrib/lite/toco/tflite/export.h
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include "tensorflow/contrib/lite/toco/model.h"
 #include "tensorflow/contrib/lite/toco/tflite/operator.h"
+#include "tensorflow/contrib/lite/util.h"
 
 namespace toco {
 
@@ -72,22 +73,10 @@ struct OperatorKey {
 
   struct Hash {
     size_t operator()(const OperatorKey& key) const {
-      return CombineHashes({std::hash<size_t>()(static_cast<size_t>(key.type)),
-                            std::hash<std::string>()(key.custom_code),
-                            std::hash<int>()(key.version)});
-    }
-
-   private:
-    // TODO(ycling): Refactoring and extract this function into a common
-    // utility module.
-    static size_t CombineHashes(std::initializer_list<size_t> hashes) {
-      size_t result = 0;
-      // Hash combiner used by TensorFlow core.
-      for (size_t hash : hashes) {
-        result = result ^ (hash + 0x9e3779b97f4a7800ULL + (result << 10) +
-                           (result >> 4));
-      }
-      return result;
+      return ::tflite::CombineHashes(
+          {std::hash<size_t>()(static_cast<size_t>(key.type)),
+           std::hash<std::string>()(key.custom_code),
+           std::hash<int>()(key.version)});
     }
   };
 };
diff --git a/tensorflow/contrib/lite/toco/tflite/import.cc b/tensorflow/contrib/lite/toco/tflite/import.cc
index c0e7ab2ef57ed8edf1b7cda08c64f6ae66172af3..cb44a5e6d7356a1cf5597bbe48565c5b1e1949a6 100644
--- a/tensorflow/contrib/lite/toco/tflite/import.cc
+++ b/tensorflow/contrib/lite/toco/tflite/import.cc
@@ -113,15 +113,35 @@ void ImportOperators(
                  << operators_table.size();
     }
     string opname = operators_table.at(index);
+
+    // Find and use the appropriate operator deserialization factory.
+    std::unique_ptr<Operator> new_op = nullptr;
     if (ops_by_name.count(opname) == 0) {
-      LOG(FATAL) << "Op '" << opname << "' not supported";
+      string effective_opname = "TENSORFLOW_UNSUPPORTED";
+      if (ops_by_name.count(effective_opname) == 0) {
+        LOG(FATAL) << "Internal logic error: TENSORFLOW_UNSUPPORTED not found.";
+      }
+      new_op = ops_by_name.at(effective_opname)
+                   ->Deserialize(input_op->builtin_options(),
+                                 input_op->custom_options());
+      if (new_op->type == OperatorType::kTensorFlowUnsupported) {
+        auto* unsupported_op =
+            static_cast<TensorFlowUnsupportedOperator*>(new_op.get());
+        unsupported_op->tensorflow_op = opname;
+        // TODO(b/109932940): Remove this when quantized is removed.
+        // For now, we assume all ops are quantized.
+        unsupported_op->quantized = true;
+      } else {
+        LOG(FATAL) << "Expected a TensorFlowUnsupportedOperator";
+      }
+    } else {
+      new_op = ops_by_name.at(opname)->Deserialize(input_op->builtin_options(),
+                                                   input_op->custom_options());
     }
-
-    auto new_op = ops_by_name.at(opname)->Deserialize(
-        input_op->builtin_options(), input_op->custom_options());
     model->operators.emplace_back(new_op.release());
     auto* op = model->operators.back().get();
 
+    // Make sure all the inputs and outputs are hooked up.
     auto inputs = input_op->inputs();
     for (int i = 0; i < inputs->Length(); i++) {
       auto input_index = inputs->Get(i);
diff --git a/tensorflow/contrib/lite/toco/tflite/operator.cc b/tensorflow/contrib/lite/toco/tflite/operator.cc
index 6922e5055a602b8d2eb43f88cde15b0d505eac40..7490ab960b9b0c62bef4c343927664ac6ae4eb9d 100644
--- a/tensorflow/contrib/lite/toco/tflite/operator.cc
+++ b/tensorflow/contrib/lite/toco/tflite/operator.cc
@@ -507,6 +507,22 @@ class Pad : public BuiltinOperator<PadOperator, ::tflite::PadOptions,
   int GetVersion(const Operator& op) const override { return 1; }
 };
 
+class Tile
+    : public BuiltinOperator<TensorFlowTileOperator, ::tflite::TileOptions,
+                             ::tflite::BuiltinOptions_TileOptions> {
+  using BuiltinOperator::BuiltinOperator;
+
+  flatbuffers::Offset<TfLiteOptions> WriteOptions(
+      const TocoOperator& op,
+      flatbuffers::FlatBufferBuilder* builder) const override {
+    return ::tflite::CreateTileOptions(*builder);
+  }
+
+  void ReadOptions(const TfLiteOptions& options,
+                   TocoOperator* op) const override {}
+  int GetVersion(const Operator& op) const override { return 1; }
+};
+
 class PadV2 : public BuiltinOperator<PadV2Operator, ::tflite::PadV2Options,
                                      ::tflite::BuiltinOptions_PadV2Options> {
  public:
@@ -610,11 +626,21 @@ class Lstm : public BuiltinOperator<LstmCellOperator, ::tflite::LSTMOptions,
   flatbuffers::Offset<TfLiteOptions> WriteOptions(
       const TocoOperator& op,
       flatbuffers::FlatBufferBuilder* builder) const override {
+    ::tflite::LSTMKernelType kernel_type;
+    switch (op.kernel_type) {
+      case LstmCellOperator::KERNEL_BASIC:
+        kernel_type = ::tflite::LSTMKernelType_BASIC;
+        break;
+      case LstmCellOperator::KERNEL_FULL:
+        kernel_type = ::tflite::LSTMKernelType_FULL;
+        break;
+    }
+
     // Current toco converter only supports tanh, no clip.
     return ::tflite::CreateLSTMOptions(*builder, /*fused_activation_function=*/
                                        ::tflite::ActivationFunctionType_TANH,
                                        /*cell_clip=*/0.0,
-                                       /*proj_clip=*/0.0);
+                                       /*proj_clip=*/0.0, kernel_type);
   }
 
   void ReadOptions(const TfLiteOptions& options,
@@ -622,9 +648,26 @@ class Lstm : public BuiltinOperator<LstmCellOperator, ::tflite::LSTMOptions,
     // Only support tanh activation, so check that tflite type is tanh.
     CHECK(options.fused_activation_function() ==
           ::tflite::ActivationFunctionType_TANH);
+
+    switch (options.kernel_type()) {
+      case ::tflite::LSTMKernelType_BASIC:
+        op->kernel_type = LstmCellOperator::KERNEL_BASIC;
+        break;
+      case ::tflite::LSTMKernelType_FULL:
+        op->kernel_type = LstmCellOperator::KERNEL_FULL;
+        break;
+    }
   }
 
-  int GetVersion(const Operator& op) const override { return 1; }
+  int GetVersion(const Operator& op) const override {
+    const auto& lstm_op = static_cast<const LstmCellOperator&>(op);
+    switch (lstm_op.kernel_type) {
+      case LstmCellOperator::KERNEL_FULL:
+        return 1;
+      case LstmCellOperator::KERNEL_BASIC:
+        return 2;
+    }
+  }
 };
 
 class Mean : public BuiltinOperator<MeanOperator, ::tflite::MeanOptions,
@@ -794,6 +837,45 @@ class TransposeConv
   int GetVersion(const Operator& op) const override { return 1; }
 };
 
+class SparseToDense
+    : public BuiltinOperator<SparseToDenseOperator,
+                             ::tflite::SparseToDenseOptions,
+                             ::tflite::BuiltinOptions_SparseToDenseOptions> {
+ public:
+  using BuiltinOperator::BuiltinOperator;
+
+  flatbuffers::Offset<TfLiteOptions> WriteOptions(
+      const TocoOperator& op,
+      flatbuffers::FlatBufferBuilder* builder) const override {
+    return ::tflite::CreateSparseToDenseOptions(*builder, op.validate_indices);
+  }
+
+  void ReadOptions(const TfLiteOptions& options,
+                   TocoOperator* op) const override {
+    op->validate_indices = options.validate_indices();
+  }
+
+  int GetVersion(const Operator& op) const override { return 1; }
+};
+
+class ExpandDims
+    : public BuiltinOperator<ExpandDimsOperator, ::tflite::ExpandDimsOptions,
+                             ::tflite::BuiltinOptions_ExpandDimsOptions> {
+ public:
+  using BuiltinOperator::BuiltinOperator;
+
+  flatbuffers::Offset<TfLiteOptions> WriteOptions(
+      const TocoOperator& op,
+      flatbuffers::FlatBufferBuilder* builder) const override {
+    return ::tflite::CreateExpandDimsOptions(*builder);
+  }
+
+  void ReadOptions(const TfLiteOptions& options,
+                   TocoOperator* op) const override {}
+
+  int GetVersion(const Operator& op) const override { return 1; }
+};
+
 class TensorFlowUnsupported : public BaseOperator {
  public:
   using BaseOperator::BaseOperator;
@@ -976,8 +1058,14 @@ std::vector<std::unique_ptr<BaseOperator>> BuildOperatorList() {
       new Cast(::tflite::BuiltinOperator_CAST, OperatorType::kCast));
   ops.emplace_back(
       new ArgMax(::tflite::BuiltinOperator_ARG_MAX, OperatorType::kArgMax));
+  ops.emplace_back(
+      new Tile(::tflite::BuiltinOperator_TILE, OperatorType::kTensorFlowTile));
+  ops.emplace_back(new ExpandDims(::tflite::BuiltinOperator_EXPAND_DIMS,
+                                  OperatorType::kExpandDims));
   ops.emplace_back(new TransposeConv(::tflite::BuiltinOperator_TRANSPOSE_CONV,
                                      OperatorType::kTransposeConv));
+  ops.emplace_back(new SparseToDense(::tflite::BuiltinOperator_SPARSE_TO_DENSE,
+                                     OperatorType::kSparseToDense));
 
   // Custom Operators.
   ops.emplace_back(
@@ -1024,12 +1112,18 @@ std::vector<std::unique_ptr<BaseOperator>> BuildOperatorList() {
       "LESS", OperatorType::kTensorFlowLess));
   ops.emplace_back(new SimpleOperator<TensorFlowLessEqualOperator>(
       "LESS_EQUAL", OperatorType::kTensorFlowLessEqual));
+  ops.emplace_back(new SimpleOperator<TensorFlowEqualOperator>(
+      "EQUAL", OperatorType::kTensorFlowEqual));
+  ops.emplace_back(new SimpleOperator<TensorFlowNotEqualOperator>(
+      "NOT_EQUAL", OperatorType::kTensorFlowNotEqual));
   ops.emplace_back(new SimpleOperator<NegOperator>("NEG", OperatorType::kNeg));
   ops.emplace_back(
       new SimpleOperator<SelectOperator>("SELECT", OperatorType::kSelect));
   ops.emplace_back(
       new SimpleOperator<SliceOperator>("SLICE", OperatorType::kSlice));
+  // Element-wise operator
   ops.emplace_back(new SimpleOperator<SinOperator>("SIN", OperatorType::kSin));
+  ops.emplace_back(new SimpleOperator<LogOperator>("LOG", OperatorType::kLog));
 
   return ops;
 }
diff --git a/tensorflow/contrib/lite/toco/tflite/operator_test.cc b/tensorflow/contrib/lite/toco/tflite/operator_test.cc
index fe594c6da9826ab904d162c9e28e1455b1bf69f6..03bb20b3208196e964d950c0f0954d1fc0ba9e86 100644
--- a/tensorflow/contrib/lite/toco/tflite/operator_test.cc
+++ b/tensorflow/contrib/lite/toco/tflite/operator_test.cc
@@ -74,8 +74,10 @@ class OperatorTest : public ::testing::Test {
     auto new_toco_op = op.Deserialize(output_options->builtin_options(),
                                       output_options->custom_options());
 
-    CHECK(dynamic_cast<T*>(new_toco_op.get()))
-        << "Cannot cast " << HelpfulOperatorTypeName(*new_toco_op) << " to "
+    CHECK(new_toco_op->type == toco_op.type)
+        << "The type of the serialized and deserialized"
+        << HelpfulOperatorTypeName(*new_toco_op)
+        << " does not match the type of the original "
         << HelpfulOperatorTypeName(toco_op);
 
     return std::unique_ptr<T>(dynamic_cast<T*>(new_toco_op.release()));
@@ -119,6 +121,11 @@ TEST_F(OperatorTest, SimpleOperators) {
   CheckSimpleOperator<SelectOperator>("SELECT", OperatorType::kSelect);
   CheckSimpleOperator<SliceOperator>("SLICE", OperatorType::kSlice);
   CheckSimpleOperator<SinOperator>("SIN", OperatorType::kSin);
+  CheckSimpleOperator<TensorFlowEqualOperator>("EQUAL",
+                                               OperatorType::kTensorFlowEqual);
+  CheckSimpleOperator<TensorFlowNotEqualOperator>(
+      "NOT_EQUAL", OperatorType::kTensorFlowNotEqual);
+  CheckSimpleOperator<LogOperator>("LOG", OperatorType::kLog);
 }
 
 TEST_F(OperatorTest, BuiltinAdd) {
@@ -420,6 +427,15 @@ TEST_F(OperatorTest, BuiltinTransposeConv) {
   EXPECT_EQ(op.padding.type, output_toco_op->padding.type);
 }
 
+TEST_F(OperatorTest, BuiltinSparseToDense) {
+  SparseToDenseOperator op;
+  op.validate_indices = false;
+  std::unique_ptr<toco::SparseToDenseOperator> output_toco_op =
+      SerializeAndDeserialize(
+          GetOperator("SPARSE_TO_DENSE", OperatorType::kSparseToDense), op);
+  EXPECT_EQ(op.validate_indices, output_toco_op->validate_indices);
+}
+
 TEST_F(OperatorTest, TensorFlowUnsupported) {
   TensorFlowUnsupportedOperator op;
   op.tensorflow_op = "MyCustomUnsupportedOp";
diff --git a/tensorflow/contrib/lite/toco/toco_cmdline_flags.cc b/tensorflow/contrib/lite/toco/toco_cmdline_flags.cc
index 7786a4ada335abc9a01a0a6e423125f2d67957c2..87a1e429b928bf59cb14597980602953732a7659 100644
--- a/tensorflow/contrib/lite/toco/toco_cmdline_flags.cc
+++ b/tensorflow/contrib/lite/toco/toco_cmdline_flags.cc
@@ -153,6 +153,16 @@ bool ParseTocoFlagsFromCommandLineFlags(
            parsed_flags.dedupe_array_min_size_bytes.default_value(),
            "Minimum size of constant arrays to deduplicate; arrays smaller "
            "will not be deduplicated."),
+      Flag("split_tflite_lstm_inputs",
+           parsed_flags.split_tflite_lstm_inputs.bind(),
+           parsed_flags.split_tflite_lstm_inputs.default_value(),
+           "Split the LSTM inputs from 5 tensors to 18 tensors for TFLite. "
+           "Ignored if the output format is not TFLite."),
+      Flag("quantize_weights", parsed_flags.quantize_weights.bind(),
+           parsed_flags.quantize_weights.default_value(),
+           "Store weights as quantized weights followed by dequantize "
+           "operations. Computation is still done in float, but reduces model "
+           "size (at the cost of accuracy and latency)."),
   };
   bool asked_for_help =
       *argc == 2 && (!strcmp(argv[1], "--help") || !strcmp(argv[1], "-help"));
@@ -245,6 +255,8 @@ void ReadTocoFlagsFromCommandLineFlags(const ParsedTocoFlags& parsed_toco_flags,
   READ_TOCO_FLAG(allow_nudging_weights_to_use_fast_gemm_kernel,
                  FlagRequirement::kNone);
   READ_TOCO_FLAG(dedupe_array_min_size_bytes, FlagRequirement::kNone);
+  READ_TOCO_FLAG(split_tflite_lstm_inputs, FlagRequirement::kNone);
+  READ_TOCO_FLAG(quantize_weights, FlagRequirement::kNone);
 
   // Deprecated flag handling.
   if (parsed_toco_flags.input_type.specified()) {
@@ -278,6 +290,11 @@ void ReadTocoFlagsFromCommandLineFlags(const ParsedTocoFlags& parsed_toco_flags,
     QCHECK(toco::IODataType_Parse(input_types[0], &input_type));
     toco_flags->set_inference_input_type(input_type);
   }
+  if (parsed_toco_flags.quantize_weights.value()) {
+    QCHECK_NE(toco_flags->inference_type(), IODataType::QUANTIZED_UINT8)
+        << "quantize_weights is not supported with inference_type "
+           "QUANTIZED_UINT8.";
+  }
 
 #undef READ_TOCO_FLAG
 #undef PARSE_TOCO_FLAG
diff --git a/tensorflow/contrib/lite/toco/toco_flags.proto b/tensorflow/contrib/lite/toco/toco_flags.proto
index 8589ca361dae2561207f9fa0c57b3240240c08d6..ad4e94ded9f9730842a257e065d9aec2b1cbfac8 100644
--- a/tensorflow/contrib/lite/toco/toco_flags.proto
+++ b/tensorflow/contrib/lite/toco/toco_flags.proto
@@ -37,7 +37,7 @@ enum FileFormat {
 // of as properties of models, instead describing how models are to be
 // processed in the context of the present tooling job.
 //
-// Next ID to use: 19.
+// Next ID to use: 21.
 message TocoFlags {
   // Input file format
   optional FileFormat input_format = 1;
@@ -165,4 +165,22 @@ message TocoFlags {
   // Minimum size of constant arrays to deduplicate; arrays smaller will not be
   // deduplicated.
   optional int64 dedupe_array_min_size_bytes = 18 [default = 64];
+
+  // Split the LSTM inputs from 5 tensors to 18 tensors for TFLite.
+  // Ignored if the output format is not TFLite.
+  optional bool split_tflite_lstm_inputs = 19 [default = true];
+
+  // Store weights as quantized weights followed by dequantize operations.
+  // Computation is still done in float, but reduces model size (at the cost of
+  // accuracy and latency).
+  optional bool quantize_weights = 20 [default = false];
+
+  // Full filepath of folder to dump the graphs at various stages of processing
+  // GraphViz .dot files. Preferred over --output_format=GRAPHVIZ_DOT in order
+  // to keep the requirements of the output file.
+  optional string dump_graphviz_dir = 24;
+
+  // Boolean indicating whether to dump the graph after every graph
+  // transformation.
+  optional bool dump_graphviz_include_video = 25;
 }
diff --git a/tensorflow/contrib/lite/toco/toco_port.cc b/tensorflow/contrib/lite/toco/toco_port.cc
index a1c8696cd06a30bfe8661bb70aa4f2d6d175aac3..de76fd4032d24eff8a6c2fd0c16a911b9c00186b 100644
--- a/tensorflow/contrib/lite/toco/toco_port.cc
+++ b/tensorflow/contrib/lite/toco/toco_port.cc
@@ -16,8 +16,16 @@ limitations under the License.
 
 #include "tensorflow/contrib/lite/toco/toco_port.h"
 #include "tensorflow/contrib/lite/toco/toco_types.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/logging.h"
 
+#if defined(__ANDROID__) && defined(__ARM_ARCH_7A__)
+namespace std {
+double round(double x) { return ::round(x); }
+}  // namespace std
+#endif
+
 namespace toco {
 namespace port {
 void CopyToBuffer(const string& src, char* dest) {
@@ -55,8 +63,12 @@ void CheckInitGoogleIsDone(const char* message) {
 namespace file {
 
 // Conversion to our wrapper Status.
-Status ToStatus(const ::util::Status& uts) {
-  return Status(uts.ok(), uts.error_message());
+tensorflow::Status ToStatus(const ::util::Status& uts) {
+  if (!uts.ok()) {
+    return tensorflow::Status(tensorflow::errors::Code(uts.error_code()),
+                              uts.error_message());
+  }
+  return tensorflow::Status::OK();
 }
 
 // Conversion to our wrapper Options.
@@ -65,7 +77,7 @@ toco::port::file::Options ToOptions(const ::file::Options& options) {
   return Options();
 }
 
-Status Writable(const string& filename) {
+tensorflow::Status Writable(const string& filename) {
   File* f = nullptr;
   const auto status = ::file::Open(filename, "w", &f, ::file::Defaults());
   if (f) {
@@ -74,22 +86,24 @@ Status Writable(const string& filename) {
   return ToStatus(status);
 }
 
-Status Readable(const string& filename, const file::Options& options) {
+tensorflow::Status Readable(const string& filename,
+                            const file::Options& options) {
   return ToStatus(::file::Readable(filename, ::file::Defaults()));
 }
 
-Status Exists(const string& filename, const file::Options& options) {
+tensorflow::Status Exists(const string& filename,
+                          const file::Options& options) {
   auto status = ::file::Exists(filename, ::file::Defaults());
   return ToStatus(status);
 }
 
-Status GetContents(const string& filename, string* contents,
-                   const file::Options& options) {
+tensorflow::Status GetContents(const string& filename, string* contents,
+                               const file::Options& options) {
   return ToStatus(::file::GetContents(filename, contents, ::file::Defaults()));
 }
 
-Status SetContents(const string& filename, const string& contents,
-                   const file::Options& options) {
+tensorflow::Status SetContents(const string& filename, const string& contents,
+                               const file::Options& options) {
   return ToStatus(::file::SetContents(filename, contents, ::file::Defaults()));
 }
 
@@ -133,37 +147,42 @@ void CheckInitGoogleIsDone(const char* message) {
 
 namespace file {
 
-Status Writable(const string& filename) {
+tensorflow::Status Writable(const string& filename) {
   FILE* f = fopen(filename.c_str(), "w");
   if (f) {
     fclose(f);
-    return Status(true, "");
+    return tensorflow::Status::OK();
   }
-  return Status(false, "not writable");
+  return tensorflow::errors::NotFound("not writable");
 }
 
-Status Readable(const string& filename, const file::Options& options) {
+tensorflow::Status Readable(const string& filename,
+                            const file::Options& options) {
   FILE* f = fopen(filename.c_str(), "r");
   if (f) {
     fclose(f);
-    return Status(true, "");
+    return tensorflow::Status::OK();
   }
-  return Status(false, "not readable");
+  return tensorflow::errors::NotFound("not readable");
 }
 
-Status Exists(const string& filename, const file::Options& options) {
+tensorflow::Status Exists(const string& filename,
+                          const file::Options& options) {
   struct stat statbuf;
   int ret = stat(filename.c_str(), &statbuf);
-  return Status(ret != -1, "");
+  if (ret == -1) {
+    return tensorflow::errors::NotFound("file doesn't exist");
+  }
+  return tensorflow::Status::OK();
 }
 
-Status GetContents(const string& path, string* output,
-                   const file::Options& options) {
+tensorflow::Status GetContents(const string& path, string* output,
+                               const file::Options& options) {
   output->clear();
 
   int fd = open(path.c_str(), O_RDONLY);
   if (fd == -1) {
-    return Status(false, "can't open() for read");
+    return tensorflow::errors::NotFound("can't open() for read");
   }
 
   // Direct read, for speed.
@@ -174,25 +193,25 @@ Status GetContents(const string& path, string* output,
     if (size == 0) {
       // Done.
       close(fd);
-      return Status(true, "");
+      return tensorflow::Status::OK();
     } else if (size == -1) {
       // Error.
       close(fd);
-      return Status(false, "error during read()");
+      return tensorflow::errors::Internal("error during read()");
     } else {
       output->append(buffer, size);
     }
   }
 
   CHECK(0);
-  return Status(false, "internal error");
+  return tensorflow::errors::Internal("internal error");
 }
 
-Status SetContents(const string& filename, const string& contents,
-                   const file::Options& options) {
+tensorflow::Status SetContents(const string& filename, const string& contents,
+                               const file::Options& options) {
   int fd = open(filename.c_str(), O_WRONLY | O_CREAT, 0664);
   if (fd == -1) {
-    return Status(false, "can't open() for write");
+    return tensorflow::errors::Internal("can't open() for write");
   }
 
   size_t i = 0;
@@ -201,13 +220,13 @@ Status SetContents(const string& filename, const string& contents,
     ssize_t written = write(fd, &contents[i], to_write);
     if (written == -1) {
       close(fd);
-      return Status(false, "write() error");
+      return tensorflow::errors::Internal("write() error");
     }
     i += written;
   }
   close(fd);
 
-  return Status(true, "");
+  return tensorflow::Status::OK();
 }
 
 string JoinPath(const string& base, const string& filename) {
diff --git a/tensorflow/contrib/lite/toco/toco_port.h b/tensorflow/contrib/lite/toco/toco_port.h
index 906792ef569e5b8dd2a40f6cf683fa8a35946012..17f82b9dd7dcc633aa204038b6d965f4eb6967bb 100644
--- a/tensorflow/contrib/lite/toco/toco_port.h
+++ b/tensorflow/contrib/lite/toco/toco_port.h
@@ -21,6 +21,7 @@ limitations under the License.
 #include <string>
 #include "google/protobuf/text_format.h"
 #include "tensorflow/contrib/lite/toco/format_port.h"
+#include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/platform.h"
 #if defined(PLATFORM_GOOGLE)
@@ -33,28 +34,26 @@ limitations under the License.
 #define TFLITE_PROTO_NS google::protobuf
 #endif
 
-namespace toco {
-namespace port {
-
-class Status {
- public:
-  static Status OK() { return Status(true, ""); }
-
-  // Create a failed status with no message.
-  Status() {}
-
-  Status(bool ok, const string& message) : ok_(ok), message_(message) {}
-
-  void AppendMessage(const string& message) { message_ += message; }
+#ifdef __ANDROID__
+#include <sstream>
+namespace std {
 
-  bool ok() const { return ok_; }
+template <typename T>
+std::string to_string(T value)
+{
+    std::ostringstream os ;
+    os << value ;
+    return os.str() ;
+}
 
-  const string error_message() const { return message_; }
+#ifdef __ARM_ARCH_7A__
+double round(double x);
+#endif
+}
+#endif
 
- private:
-  bool ok_ = false;
-  string message_;
-};
+namespace toco {
+namespace port {
 
 void InitGoogle(const char* usage, int* argc, char*** argv, bool remove_flags);
 void CheckInitGoogleIsDone(const char* message);
@@ -65,14 +64,14 @@ inline Options Defaults() {
   Options o;
   return o;
 }
-Status GetContents(const string& filename, string* contents,
-                   const Options& options);
-Status SetContents(const string& filename, const string& contents,
-                   const Options& options);
+tensorflow::Status GetContents(const string& filename, string* contents,
+                               const Options& options);
+tensorflow::Status SetContents(const string& filename, const string& contents,
+                               const Options& options);
 string JoinPath(const string& base, const string& filename);
-Status Writable(const string& filename);
-Status Readable(const string& filename, const Options& options);
-Status Exists(const string& filename, const Options& options);
+tensorflow::Status Writable(const string& filename);
+tensorflow::Status Readable(const string& filename, const Options& options);
+tensorflow::Status Exists(const string& filename, const Options& options);
 }  // namespace file
 
 // Copy `src` string to `dest`. User must ensure `dest` has enough space.
diff --git a/tensorflow/contrib/lite/toco/toco_tooling.cc b/tensorflow/contrib/lite/toco/toco_tooling.cc
index b5531ca2f4785e0c95703f95977be93a0ba2a8e2..1fe76f8163cdf23b27f8baaf2d9c6d99b1aa3747 100644
--- a/tensorflow/contrib/lite/toco/toco_tooling.cc
+++ b/tensorflow/contrib/lite/toco/toco_tooling.cc
@@ -263,12 +263,15 @@ void Transform(const TocoFlags& toco_flags, Model* model) {
     if (!toco_flags.debug_disable_recurrent_cell_fusion()) {
       transformations.Add(new IdentifyLstmCell);
     }
-    if (output_format == TFLITE) {
+    if (output_format == TFLITE && toco_flags.split_tflite_lstm_inputs()) {
       transformations.Add(new toco::SplitLstmCellInputs);
     } else {
       transformations.Add(new toco::MergeLstmCellInputs);
     }
   }
+  if (toco_flags.quantize_weights()) {
+    transformations.Add(new QuantizeWeights);
+  }
   transformations.Add(new ResolveConstantConcatenation);
   RunGraphTransformations(model, "general graph transformations",
                           transformations);
diff --git a/tensorflow/contrib/lite/toco/tooling_util.cc b/tensorflow/contrib/lite/toco/tooling_util.cc
index 1e6314f2dc78297c8bdacb19cf89292603695e3f..92bab5246cb85052b5e0216f1cb8a04736ae7a79 100644
--- a/tensorflow/contrib/lite/toco/tooling_util.cc
+++ b/tensorflow/contrib/lite/toco/tooling_util.cc
@@ -30,7 +30,7 @@ limitations under the License.
 #include "tensorflow/contrib/lite/toco/dump_graphviz.h"
 #include "tensorflow/contrib/lite/toco/model_flags.pb.h"
 #include "tensorflow/contrib/lite/toco/toco_graphviz_dump_options.h"
-#include "tensorflow/contrib/lite/toco/toco_port.h"
+#include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/logging.h"
 
 namespace toco {
@@ -393,6 +393,9 @@ const char* OperatorTypeName(OperatorType type) {
     HANDLE_OPERATORTYPENAME_CASE(DynamicPartition)
     HANDLE_OPERATORTYPENAME_CASE(DynamicStitch)
     HANDLE_OPERATORTYPENAME_CASE(Select)
+    HANDLE_OPERATORTYPENAME_CASE(SparseToDense)
+    HANDLE_OPERATORTYPENAME_CASE(TensorFlowEqual)
+    HANDLE_OPERATORTYPENAME_CASE(TensorFlowNotEqual)
     default:
       LOG(FATAL) << "Unhandled op type";
 #undef HANDLE_OPERATORTYPENAME_CASE
@@ -582,6 +585,13 @@ void UnextendShape(Shape* shape, int new_shape_size) {
   shape_dims.erase(shape_dims.begin(), shape_dims.begin() + size_reduction);
 }
 
+bool IsValid(const Shape& shape) {
+  for (int i = 0; i < shape.dimensions_count(); ++i) {
+    if (shape.dims(i) < 1) return false;
+  }
+  return true;
+}
+
 void CheckShapeDimensions(const Shape& shape) {
   for (int i = 0; i < shape.dimensions_count(); ++i) {
     CHECK_GE(shape.dims()[i], 1) << "shape has dimension 0 at index << " << i
@@ -1862,18 +1872,15 @@ void GetShuffleShape(AxesOrder input_axes_order, AxesOrder output_axes_order,
              output_axes_order == AxesOrder::kHWIO) {
     // 3210 <- 3210
     // HWIO <- OHWI
-    (*shuffle)[0] = 1;
-    (*shuffle)[1] = 2;
-    (*shuffle)[2] = 3;
-    (*shuffle)[3] = 0;
+    *shuffle = {1, 2, 3, 0};
   } else if (input_axes_order == AxesOrder::kHWIO &&
              output_axes_order == AxesOrder::kOHWI) {
     // 3210 <- 3210
     // OHWI <- HWIO
-    (*shuffle)[0] = 3;
-    (*shuffle)[1] = 0;
-    (*shuffle)[2] = 1;
-    (*shuffle)[3] = 2;
+    *shuffle = {3, 0, 1, 2};
+  } else if (input_axes_order == AxesOrder::kOHWI &&
+             output_axes_order == AxesOrder::kHWOI) {
+    *shuffle = {1, 2, 0, 3};
   } else {
     LOG(FATAL) << "Bad shuffle";
   }
@@ -2019,6 +2026,8 @@ int AxesCount(AxesOrder axes_order) {
       return 4;
     case AxesOrder::kNHWC:
       return 4;
+    case AxesOrder::kHWOI:
+      return 4;
     default:
       LOG(FATAL) << "Bad AxesOrder";
       return 0;
diff --git a/tensorflow/contrib/lite/toco/tooling_util.h b/tensorflow/contrib/lite/toco/tooling_util.h
index 1f596ca8e5a28f17e816c33eea03725d16f7ce12..7681ce9d39ec56f9447896682b52bd4efb1d0e54 100644
--- a/tensorflow/contrib/lite/toco/tooling_util.h
+++ b/tensorflow/contrib/lite/toco/tooling_util.h
@@ -26,14 +26,15 @@ limitations under the License.
 #include "absl/strings/string_view.h"
 #include "tensorflow/core/platform/logging.h"
 #if TOCO_SUPPORT_PORTABLE_PROTOS
-#include "third_party/protobuf/src/google/protobuf/text_format.h"
+#include "third_party/protobuf/include/google/protobuf/text_format.h"
 #endif  // TOCO_SUPPORT_PORTABLE_PROTOS
 #include "tensorflow/contrib/lite/toco/model.h"
 #include "tensorflow/contrib/lite/toco/model_flags.pb.h"
 #include "tensorflow/contrib/lite/toco/runtime/types.h"
 #include "tensorflow/contrib/lite/toco/toco_flags.pb.h"
-#include "tensorflow/contrib/lite/toco/toco_port.h"
 #include "tensorflow/contrib/lite/toco/types.pb.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/status.h"
 
 // TODO(aselle): Replace with using a container specific hash override instead.
 namespace std {
@@ -112,7 +113,9 @@ void ExtendShape(Shape* shape, int new_shape_size);
 // TODO(b/36075966): Clean up when dims superseded by array shape.
 void UnextendShape(Shape* shape, int new_shape_size);
 
-// Checks (using CHECK) that all dimensions of 'shape' are at least 1.
+// Checks that all dimensions of 'shape' are at least 1.
+bool IsValid(const Shape& shape);
+// Same as above, but reports error using CHECK.
 void CheckShapeDimensions(const Shape& shape);
 
 // Given two shapes with potentially different dimensionality and dimension
@@ -315,7 +318,7 @@ void UseArraysExtraInfo(Model* model, bool quantize_output);
 // doesn't have enough range to represent the sum of elements, an error is
 // returned.
 template <typename T, typename U>
-port::Status NumElements(const std::vector<T>& shape, U* num_elements) {
+tensorflow::Status NumElements(const std::vector<T>& shape, U* num_elements) {
   static_assert(
       std::numeric_limits<T>::max() <= std::numeric_limits<uint64_t>::max(),
       "vector type exceed capabilities of NumElements");
@@ -326,17 +329,17 @@ port::Status NumElements(const std::vector<T>& shape, U* num_elements) {
       // TensorFlow's shapes sometimes include -1 to represent an "unknown"
       // size but TOCO isn't able to create arrays of unknown sizes and will
       // crash in RequiredBufferSizeForShape().
-      return port::Status(false,
-                          "Tensor shape should not include negative values");
+      return tensorflow::errors::InvalidArgument(
+          "Tensor shape should not include negative values");
     }
     if (static_cast<uint64_t>(dim) >
         std::numeric_limits<U>::max() / *num_elements) {
       *num_elements = 0;
-      return port::Status(false, "Tensor shape is too large");
+      return tensorflow::errors::InvalidArgument("Tensor shape is too large");
     }
     *num_elements *= dim;
   }
-  return port::Status::OK();
+  return tensorflow::Status::OK();
 }
 
 }  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/tooling_util_test.cc b/tensorflow/contrib/lite/toco/tooling_util_test.cc
index 87fd30db2cf54824a3c34ed875291d898f1a9e38..a683867374c8b8dcb274478adf6b5fa0691d1c5a 100644
--- a/tensorflow/contrib/lite/toco/tooling_util_test.cc
+++ b/tensorflow/contrib/lite/toco/tooling_util_test.cc
@@ -18,6 +18,7 @@ limitations under the License.
 #include <gtest/gtest.h>
 #include "tensorflow/contrib/lite/toco/model.h"
 #include "tensorflow/contrib/lite/toco/tooling_util.h"
+#include "tensorflow/core/lib/core/status.h"
 
 namespace toco {
 
@@ -99,7 +100,7 @@ static const char kLargeTensorMessage[] = "Tensor shape is too large";
 
 TEST(NumElementsTest, Int) {
   int count;
-  port::Status status = port::Status::OK();
+  tensorflow::Status status = tensorflow::Status::OK();
 
   status = NumElements(std::vector<int>{1024, 1024, 2047}, &count);
   EXPECT_TRUE(status.ok());
@@ -114,7 +115,7 @@ TEST(NumElementsTest, Int) {
 
 TEST(NumElementsTest, Int32) {
   int32_t count;
-  port::Status status = port::Status::OK();
+  tensorflow::Status status = tensorflow::Status::OK();
 
   status = NumElements(std::vector<int32_t>{1024, 1024, 2047}, &count);
   EXPECT_TRUE(status.ok());
@@ -129,7 +130,7 @@ TEST(NumElementsTest, Int32) {
 
 TEST(NumElementsTest, Int64) {
   int64_t count;
-  port::Status status = port::Status::OK();
+  tensorflow::Status status = tensorflow::Status::OK();
 
   status = NumElements(std::vector<int64_t>{16777216, 16777216, 32767}, &count);
   EXPECT_TRUE(status.ok());
@@ -144,7 +145,7 @@ TEST(NumElementsTest, Int64) {
 
 TEST(NumElementsTest, UnsignedInt32) {
   uint32_t count;
-  port::Status status = port::Status::OK();
+  tensorflow::Status status = tensorflow::Status::OK();
 
   status = NumElements(std::vector<uint32_t>{1024, 2048, 2047}, &count);
   EXPECT_TRUE(status.ok());
@@ -159,7 +160,7 @@ TEST(NumElementsTest, UnsignedInt32) {
 
 TEST(NumElementsTest, UnsignedInt64) {
   uint64_t count;
-  port::Status status = port::Status::OK();
+  tensorflow::Status status = tensorflow::Status::OK();
 
   status =
       NumElements(std::vector<uint64_t>{16777216, 16777216, 65535}, &count);
diff --git a/tensorflow/contrib/lite/tools/BUILD b/tensorflow/contrib/lite/tools/BUILD
index 824a164651073bac846a514505726a8ee85cc41d..5913847329eeae7373d0d21834dd37327e4068c4 100644
--- a/tensorflow/contrib/lite/tools/BUILD
+++ b/tensorflow/contrib/lite/tools/BUILD
@@ -7,6 +7,8 @@ licenses(["notice"])  # Apache 2.0
 load("//tensorflow/contrib/lite:special_rules.bzl", "tflite_portable_test_suite")
 load("//tensorflow:tensorflow.bzl", "tf_cc_binary")
 
+common_copts = ["-Wall"]
+
 py_binary(
     name = "visualize",
     srcs = ["visualize.py"],
@@ -28,34 +30,6 @@ tf_cc_binary(
     ],
 )
 
-tf_cc_binary(
-    name = "benchmark_model",
-    srcs = ["benchmark_model.cc"],
-    linkopts = select({
-        "//tensorflow:android": [
-            "-pie",
-            "-landroid",
-            "-lm",
-            "-z defs",
-            "-Wl,--exclude-libs,ALL",  # Exclude syms in all libs from auto export
-        ],
-        "//conditions:default": [],
-    }),
-    deps = [
-        "//tensorflow/contrib/lite:framework",
-        "//tensorflow/contrib/lite:string_util",
-        "//tensorflow/contrib/lite/kernels:builtin_ops",
-    ] + select({
-        "//tensorflow:android": [
-            "//tensorflow/core:android_tensorflow_lib",
-        ],
-        "//conditions:default": [
-            "//tensorflow/core:framework_internal",
-            "//tensorflow/core:lib",
-        ],
-    }),
-)
-
 cc_library(
     name = "gen_op_registration",
     srcs = ["gen_op_registration.cc"],
diff --git a/tensorflow/contrib/lite/tools/benchmark/BUILD b/tensorflow/contrib/lite/tools/benchmark/BUILD
new file mode 100644
index 0000000000000000000000000000000000000000..8857062c000201e1077469fc36e3bf2760924a30
--- /dev/null
+++ b/tensorflow/contrib/lite/tools/benchmark/BUILD
@@ -0,0 +1,89 @@
+package(default_visibility = [
+    "//visibility:public",
+])
+
+licenses(["notice"])  # Apache 2.0
+
+load("//tensorflow/contrib/lite:special_rules.bzl", "tflite_portable_test_suite")
+load("//tensorflow/contrib/lite:build_def.bzl", "tflite_linkopts")
+load("//tensorflow/contrib/lite:build_def.bzl", "tflite_copts")
+
+common_copts = ["-Wall"] + tflite_copts()
+
+cc_binary(
+    name = "benchmark_model",
+    srcs = [
+        "benchmark_main.cc",
+        "logging.h",
+    ],
+    copts = common_copts,
+    linkopts = tflite_linkopts() + select({
+        "//tensorflow:android": [
+            "-pie",  # Android 5.0 and later supports only PIE
+            "-lm",  # some builtin ops, e.g., tanh, need -lm
+        ],
+        "//conditions:default": [],
+    }),
+    deps = [
+        ":benchmark_tflite_model_lib",
+    ],
+)
+
+cc_library(
+    name = "command_line_flags",
+    srcs = ["command_line_flags.cc"],
+    hdrs = ["command_line_flags.h"],
+    copts = common_copts,
+)
+
+cc_test(
+    name = "command_line_flags_test",
+    srcs = ["command_line_flags_test.cc"],
+    copts = common_copts,
+    visibility = ["//visibility:private"],
+    deps = [
+        ":command_line_flags",
+        "//tensorflow/contrib/lite/testing:util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
+cc_library(
+    name = "benchmark_tflite_model_lib",
+    srcs = [
+        "benchmark_tflite_model.cc",
+        "logging.h",
+    ],
+    hdrs = ["benchmark_tflite_model.h"],
+    copts = common_copts,
+    deps = [
+        ":benchmark_model_lib",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite:string_util",
+        "//tensorflow/contrib/lite/kernels:builtin_ops",
+        "//tensorflow/contrib/lite/profiling:profile_summarizer",
+        "//tensorflow/contrib/lite/profiling:profiler",
+    ],
+)
+
+cc_library(
+    name = "benchmark_model_lib",
+    srcs = [
+        "benchmark_model.cc",
+        "logging.h",
+    ],
+    hdrs = ["benchmark_model.h"],
+    copts = common_copts,
+    deps = [
+        ":command_line_flags",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite:string_util",
+        "//tensorflow/contrib/lite/kernels:builtin_ops",
+        "//tensorflow/contrib/lite/profiling:profile_summarizer",
+        "//tensorflow/contrib/lite/profiling:profiler",
+        "//tensorflow/contrib/lite/profiling:time",
+        "//tensorflow/core:stats_calculator_portable",
+    ],
+)
+
+tflite_portable_test_suite()
diff --git a/tensorflow/contrib/lite/tools/benchmark/README.md b/tensorflow/contrib/lite/tools/benchmark/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c10826afff6d5569545d4b7df73c88d24d9dcd1a
--- /dev/null
+++ b/tensorflow/contrib/lite/tools/benchmark/README.md
@@ -0,0 +1,154 @@
+# TFLite Model Benchmark Tool
+
+## Description
+
+A simple C++ binary to benchmark a TFLite model and its individual operators,
+both on desktop machines and on Android.
+
+## To build/install/run
+
+### On Android:
+
+(0) Refer to https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android to edit the `WORKSPACE` to configure the android NDK/SDK.
+
+(1) Build for your specific platform, e.g.:
+
+```
+bazel build -c opt \
+  --config=android_arm \
+  --cxxopt='--std=c++11' \
+  tensorflow/contrib/lite/tools/benchmark:benchmark_model
+```
+
+(2) Connect your phone. Push the binary to your phone with adb push
+     (make the directory if required):
+
+```
+adb push bazel-bin/tensorflow/contrib/lite/tools/benchmark/benchmark_model /data/local/tmp
+```
+
+(3) Make the binary executable.
+
+```
+adb shell chmod +x /data/local/tmp/benchmark_model
+```
+
+(4) Push the compute graph that you need to test. For example:
+
+```
+adb push mobilenet_quant_v1_224.tflite /data/local/tmp
+```
+
+(5) Run the benchmark. For example:
+
+```
+adb shell /data/local/tmp/benchmark_model \
+  --graph=/data/local/tmp/mobilenet_quant_v1_224.tflite \
+  --input_layer="Placeholder" \
+  --input_layer_shape="1,224,224,3" \
+  --num_threads=4
+```
+
+### On desktop:
+(1) build the binary
+
+```
+bazel build -c opt tensorflow/contrib/lite/tools/benchmark:benchmark_model
+```
+
+(2) Run on your compute graph, similar to the Android case but without the need of adb shell.
+For example:
+
+```
+bazel-bin/tensorflow/contrib/lite/tools/benchmark/benchmark_model \
+  --graph=mobilenet_quant_v1_224.tflite \
+  --input_layer="Placeholder" \
+  --input_layer_shape="1,224,224,3" \
+  --num_threads=4
+```
+
+The MobileNet graph used as an example here may be downloaded from
+https://storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_v1_224_android_quant_2017_11_08.zip
+
+## Profiling model operators
+The benchmark model binary also allows you to profile operators and give execution times of each operator. To do this,
+compile the binary with a compiler flag that enables profiling to be compiled in. Pass **--copt=-DTFLITE_PROFILING_ENABLED**
+to compile benchmark with profiling support.
+For example, to compile with profiling support on Android, add this flag to the previous command:
+
+```
+bazel build -c opt \
+  --config=android_arm \
+  --cxxopt='--std=c++11' \
+  --copt=-DTFLITE_PROFILING_ENABLED \
+  tensorflow/contrib/lite/tools/benchmark:benchmark_model
+```
+This compiles TFLite with profiling enabled, now you can run the benchmark binary like before. The binary will produce detailed statistics for each operation similar to those shown below:
+
+```
+
+============================== Run Order ==============================
+	             [node type]	  [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
+	                 CONV_2D	    0.000	    4.269	    4.269	  0.107%	  0.107%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_0/Relu6]
+	       DEPTHWISE_CONV_2D	    4.270	    2.150	    2.150	  0.054%	  0.161%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_1_depthwise/Relu6]
+	                 CONV_2D	    6.421	    6.107	    6.107	  0.153%	  0.314%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6]
+	       DEPTHWISE_CONV_2D	   12.528	    1.366	    1.366	  0.034%	  0.348%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_2_depthwise/Relu6]
+	                 CONV_2D	   13.895	    4.195	    4.195	  0.105%	  0.454%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Relu6]
+	       DEPTHWISE_CONV_2D	   18.091	    1.260	    1.260	  0.032%	  0.485%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6]
+	                 CONV_2D	   19.352	    6.652	    6.652	  0.167%	  0.652%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6]
+	       DEPTHWISE_CONV_2D	   26.005	    0.698	    0.698	  0.018%	  0.670%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_4_depthwise/Relu6]
+	                 CONV_2D	   26.703	    3.344	    3.344	  0.084%	  0.754%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_4_pointwise/Relu6]
+	       DEPTHWISE_CONV_2D	   30.047	    0.646	    0.646	  0.016%	  0.770%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_5_depthwise/Relu6]
+	                 CONV_2D	   30.694	    5.800	    5.800	  0.145%	  0.915%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6]
+	       DEPTHWISE_CONV_2D	   36.495	    0.331	    0.331	  0.008%	  0.924%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_6_depthwise/Relu6]
+	                 CONV_2D	   36.826	    2.838	    2.838	  0.071%	  0.995%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_6_pointwise/Relu6]
+	       DEPTHWISE_CONV_2D	   39.665	    0.439	    0.439	  0.011%	  1.006%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_7_depthwise/Relu6]
+	                 CONV_2D	   40.105	    5.293	    5.293	  0.133%	  1.139%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6]
+	       DEPTHWISE_CONV_2D	   45.399	    0.352	    0.352	  0.009%	  1.147%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_8_depthwise/Relu6]
+	                 CONV_2D	   45.752	    5.322	    5.322	  0.133%	  1.281%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6]
+	       DEPTHWISE_CONV_2D	   51.075	    0.357	    0.357	  0.009%	  1.290%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_9_depthwise/Relu6]
+	                 CONV_2D	   51.432	    5.693	    5.693	  0.143%	  1.433%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6]
+	       DEPTHWISE_CONV_2D	   57.126	    0.366	    0.366	  0.009%	  1.442%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_10_depthwise/Relu6]
+	                 CONV_2D	   57.493	    5.472	    5.472	  0.137%	  1.579%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6]
+	       DEPTHWISE_CONV_2D	   62.966	    0.364	    0.364	  0.009%	  1.588%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_11_depthwise/Relu6]
+	                 CONV_2D	   63.330	    5.404	    5.404	  0.136%	  1.724%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6]
+	       DEPTHWISE_CONV_2D	   68.735	    0.155	    0.155	  0.004%	  1.728%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_12_depthwise/Relu6]
+	                 CONV_2D	   68.891	    2.970	    2.970	  0.074%	  1.802%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_12_pointwise/Relu6]
+	       DEPTHWISE_CONV_2D	   71.862	    0.206	    0.206	  0.005%	  1.807%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_13_depthwise/Relu6]
+	                 CONV_2D	   72.069	    5.888	    5.888	  0.148%	  1.955%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]
+	         AVERAGE_POOL_2D	   77.958	    0.036	    0.036	  0.001%	  1.956%	     0.000	        0	[MobilenetV1/Logits/AvgPool_1a/AvgPool]
+	                 CONV_2D	   77.994	    1.445	    1.445	  0.036%	  1.992%	     0.000	        0	[MobilenetV1/Logits/Conv2d_1c_1x1/BiasAdd]
+	                 RESHAPE	   79.440	    0.002	    0.002	  0.000%	  1.992%	     0.000	        0	[MobilenetV1/Predictions/Reshape]
+	                 SOFTMAX	   79.443	    0.029	    0.029	  0.001%	  1.993%	     0.000	        0	[MobilenetV1/Predictions/Softmax]
+
+============================== Top by Computation Time ==============================
+	             [node type]	  [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
+	                 CONV_2D	   19.352	    6.652	    6.652	  0.167%	  0.167%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6]
+	                 CONV_2D	    6.421	    6.107	    6.107	  0.153%	  0.320%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6]
+	                 CONV_2D	   72.069	    5.888	    5.888	  0.148%	  0.468%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]
+	                 CONV_2D	   30.694	    5.800	    5.800	  0.145%	  0.613%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6]
+	                 CONV_2D	   51.432	    5.693	    5.693	  0.143%	  0.756%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6]
+	                 CONV_2D	   57.493	    5.472	    5.472	  0.137%	  0.893%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6]
+	                 CONV_2D	   63.330	    5.404	    5.404	  0.136%	  1.029%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6]
+	                 CONV_2D	   45.752	    5.322	    5.322	  0.133%	  1.162%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6]
+	                 CONV_2D	   40.105	    5.293	    5.293	  0.133%	  1.295%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6]
+	                 CONV_2D	    0.000	    4.269	    4.269	  0.107%	  1.402%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_0/Relu6]
+
+Number of nodes executed: 31
+============================== Summary by node type ==============================
+	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
+	                 CONV_2D	       15	     1.406	    89.270%	    89.270%	     0.000	        0
+	       DEPTHWISE_CONV_2D	       13	     0.169	    10.730%	   100.000%	     0.000	        0
+	                 SOFTMAX	        1	     0.000	     0.000%	   100.000%	     0.000	        0
+	                 RESHAPE	        1	     0.000	     0.000%	   100.000%	     0.000	        0
+	         AVERAGE_POOL_2D	        1	     0.000	     0.000%	   100.000%	     0.000	        0
+
+Timings (microseconds): count=50 first=79449 curr=81350 min=77385 max=88213 avg=79732 std=1929
+Memory (bytes): count=0
+31 nodes observed
+
+
+Average inference timings in us: Warmup: 83235, Init: 38467, no stats: 79760.9
+```
+
+
diff --git a/tensorflow/contrib/lite/tools/benchmark/benchmark_main.cc b/tensorflow/contrib/lite/tools/benchmark/benchmark_main.cc
new file mode 100644
index 0000000000000000000000000000000000000000..372d31e838e5666df492ee3156022249a2d97691
--- /dev/null
+++ b/tensorflow/contrib/lite/tools/benchmark/benchmark_main.cc
@@ -0,0 +1,37 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.h"
+#include "tensorflow/contrib/lite/tools/benchmark/logging.h"
+
+namespace tflite {
+namespace benchmark {
+
+int Main(int argc, char** argv) {
+#ifdef TFLITE_CUSTOM_OPS_HEADER
+  TFLITE_LOG(INFO) << "STARTING with custom ops!";
+#else
+  TFLITE_LOG(INFO) << "STARTING!";
+#endif
+  BenchmarkTfLiteModel benchmark;
+  BenchmarkLoggingListener listener;
+  benchmark.AddListener(&listener);
+  benchmark.Run(argc, argv);
+  return 0;
+}
+}  // namespace benchmark
+}  // namespace tflite
+
+int main(int argc, char** argv) { return tflite::benchmark::Main(argc, argv); }
diff --git a/tensorflow/contrib/lite/tools/benchmark/benchmark_model.cc b/tensorflow/contrib/lite/tools/benchmark/benchmark_model.cc
new file mode 100644
index 0000000000000000000000000000000000000000..a8a9a6112c1ec050be8d0bcfe9dc5f00df40d3ff
--- /dev/null
+++ b/tensorflow/contrib/lite/tools/benchmark/benchmark_model.cc
@@ -0,0 +1,139 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/lite/tools/benchmark/benchmark_model.h"
+
+#include <time.h>
+
+#include <iostream>
+#include <sstream>
+
+#include "tensorflow/contrib/lite/profiling/time.h"
+#include "tensorflow/contrib/lite/tools/benchmark/logging.h"
+
+namespace {
+void SleepForSeconds(double sleep_seconds) {
+  if (sleep_seconds <= 0.0) {
+    return;
+  }
+  // Convert the run_delay string into a timespec.
+  timespec req;
+  req.tv_sec = static_cast<time_t>(sleep_seconds);
+  req.tv_nsec = (sleep_seconds - req.tv_sec) * 1000000000;
+  // If requested, sleep between runs for an arbitrary amount of time.
+  // This can be helpful to determine the effect of mobile processor
+  // scaling and thermal throttling.
+#ifdef PLATFORM_WINDOWS
+  Sleep(sleep_seconds * 1000);
+#else
+  nanosleep(&req, nullptr);
+#endif
+}
+
+}  // namespace
+
+namespace tflite {
+namespace benchmark {
+using tensorflow::Stat;
+
+void BenchmarkLoggingListener::OnBenchmarkEnd(const BenchmarkResults &results) {
+  auto inference_us = results.inference_time_us();
+  auto init_us = results.startup_latency_us();
+  auto warmup_us = results.warmup_time_us();
+  TFLITE_LOG(INFO) << "Average inference timings in us: "
+                   << "Warmup: " << warmup_us.avg() << ", "
+                   << "Init: " << init_us << ", "
+                   << "no stats: " << inference_us.avg();
+}
+
+std::vector<Flag> BenchmarkModel::GetFlags() {
+  return {
+      Flag("num_runs", &params_.num_runs, "number of runs"),
+      Flag("run_delay", &params_.run_delay, "delay between runs in seconds"),
+      Flag("num_threads", &params_.num_threads, "number of threads"),
+      Flag("benchmark_name", &params_.benchmark_name, "benchmark name"),
+      Flag("output_prefix", &params_.output_prefix, "benchmark output prefix"),
+      Flag("warmup_runs", &params_.warmup_runs,
+           "how many runs to initialize model"),
+  };
+}
+
+void BenchmarkModel::LogFlags() {
+  TFLITE_LOG(INFO) << "Num runs: [" << params_.num_runs << "]";
+  TFLITE_LOG(INFO) << "Inter-run delay (seconds): [" << params_.run_delay
+                   << "]";
+  TFLITE_LOG(INFO) << "Num threads: [" << params_.num_threads << "]";
+  TFLITE_LOG(INFO) << "Benchmark name: [" << params_.benchmark_name << "]";
+  TFLITE_LOG(INFO) << "Output prefix: [" << params_.output_prefix << "]";
+  TFLITE_LOG(INFO) << "Warmup runs: [" << params_.warmup_runs << "]";
+}
+
+Stat<int64_t> BenchmarkModel::Run(int num_times, RunType run_type) {
+  Stat<int64_t> run_stats;
+  TFLITE_LOG(INFO) << "Running benchmark for " << num_times << " iterations ";
+  for (int run = 0; run < num_times; run++) {
+    listeners_.OnSingleRunStart(run_type);
+    int64_t start_us = profiling::time::NowMicros();
+    RunImpl();
+    int64_t end_us = profiling::time::NowMicros();
+    listeners_.OnSingleRunEnd();
+
+    run_stats.UpdateStat(end_us - start_us);
+    SleepForSeconds(params_.run_delay);
+  }
+
+  std::stringstream stream;
+  run_stats.OutputToStream(&stream);
+  TFLITE_LOG(INFO) << stream.str() << std::endl;
+
+  return run_stats;
+}
+
+void BenchmarkModel::Run(int argc, char **argv) {
+  if (!ParseFlags(argc, argv)) {
+    return;
+  }
+
+  LogFlags();
+
+  listeners_.OnBenchmarkStart(params_);
+  int64_t initialization_start_us = profiling::time::NowMicros();
+  Init();
+  int64_t initialization_end_us = profiling::time::NowMicros();
+  int64_t startup_latency_us = initialization_end_us - initialization_start_us;
+  TFLITE_LOG(INFO) << "Initialized session in " << startup_latency_us / 1e3
+                   << "ms";
+
+  uint64_t input_bytes = ComputeInputBytes();
+  Stat<int64_t> warmup_time_us = Run(params_.warmup_runs, WARMUP);
+  Stat<int64_t> inference_time_us = Run(params_.num_runs, REGULAR);
+  listeners_.OnBenchmarkEnd(
+      {startup_latency_us, input_bytes, warmup_time_us, inference_time_us});
+}
+
+bool BenchmarkModel::ParseFlags(int argc, char **argv) {
+  auto flag_list = GetFlags();
+  const bool parse_result =
+      Flags::Parse(&argc, const_cast<const char **>(argv), flag_list);
+  if (!parse_result) {
+    std::string usage = Flags::Usage(argv[0], flag_list);
+    TFLITE_LOG(ERROR) << usage;
+    return false;
+  }
+  return ValidateFlags();
+}
+
+}  // namespace benchmark
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/tools/benchmark/benchmark_model.h b/tensorflow/contrib/lite/tools/benchmark/benchmark_model.h
new file mode 100644
index 0000000000000000000000000000000000000000..d48f693693c2cee0cd2e2a6f2b4c590998feffb3
--- /dev/null
+++ b/tensorflow/contrib/lite/tools/benchmark/benchmark_model.h
@@ -0,0 +1,161 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_LITE_TOOLS_BENCHMARK_MODEL_H_
+#define TENSORFLOW_CONTRIB_LITE_TOOLS_BENCHMARK_MODEL_H_
+
+#include <cmath>
+#include <limits>
+#include <ostream>
+#include <string>
+#include <unordered_set>
+#include <vector>
+
+#include "tensorflow/contrib/lite/tools/benchmark/command_line_flags.h"
+#include "tensorflow/core/util/stats_calculator.h"
+
+namespace tflite {
+namespace benchmark {
+
+enum RunType {
+  WARMUP,
+  REGULAR,
+};
+
+class BenchmarkResults {
+ public:
+  BenchmarkResults(int64_t startup_latency_us, uint64_t input_bytes,
+                   tensorflow::Stat<int64_t> warmup_time_us,
+                   tensorflow::Stat<int64_t> inference_time_us)
+      : startup_latency_us_(startup_latency_us),
+        input_bytes_(input_bytes),
+        warmup_time_us_(warmup_time_us),
+        inference_time_us_(inference_time_us) {}
+
+  tensorflow::Stat<int64_t> inference_time_us() const {
+    return inference_time_us_;
+  }
+  tensorflow::Stat<int64_t> warmup_time_us() const { return warmup_time_us_; }
+  int64_t startup_latency_us() const { return startup_latency_us_; }
+  uint64_t input_bytes() const { return input_bytes_; }
+  double throughput_MB_per_second() const {
+    double bytes_per_sec = (input_bytes_ * inference_time_us_.count() * 1e6) /
+                           inference_time_us_.sum();
+    return bytes_per_sec / (1024.0 * 1024.0);
+  }
+
+ private:
+  int64_t startup_latency_us_;
+  uint64_t input_bytes_;
+  tensorflow::Stat<int64_t> warmup_time_us_;
+  tensorflow::Stat<int64_t> inference_time_us_;
+};
+
+struct BenchmarkParams {
+  BenchmarkParams()
+      : num_runs(50), warmup_runs(1), run_delay(-1.0), num_threads(1) {}
+  int num_runs;
+  int warmup_runs;
+  float run_delay;
+  int num_threads;
+  std::string benchmark_name;
+  std::string output_prefix;
+};
+
+class BenchmarkListener {
+ public:
+  virtual void OnBenchmarkStart(const BenchmarkParams& params) {}
+  virtual void OnSingleRunStart(RunType runType) {}
+  virtual void OnSingleRunEnd() {}
+  virtual void OnBenchmarkEnd(const BenchmarkResults& results) {}
+  virtual ~BenchmarkListener() {}
+};
+
+// A listener that forwards its method calls to a collection of listeners.
+class BenchmarkListeners : public BenchmarkListener {
+ public:
+  // Added a listener to the listener collection.
+  // |listener| is not owned by the instance of |BenchmarkListeners|.
+  // |listener| should not be null and should outlast the instance of
+  // |BenchmarkListeners|.
+  void AddListener(BenchmarkListener* listener) {
+    listeners_.push_back(listener);
+  }
+
+  void OnBenchmarkStart(const BenchmarkParams& params) override {
+    for (auto listener : listeners_) {
+      listener->OnBenchmarkStart(params);
+    }
+  }
+
+  void OnSingleRunStart(RunType runType) override {
+    for (auto listener : listeners_) {
+      listener->OnSingleRunStart(runType);
+    }
+  }
+
+  void OnSingleRunEnd() override {
+    for (auto listener : listeners_) {
+      listener->OnSingleRunEnd();
+    }
+  }
+
+  void OnBenchmarkEnd(const BenchmarkResults& results) override {
+    for (auto listener : listeners_) {
+      listener->OnBenchmarkEnd(results);
+    }
+  }
+
+  ~BenchmarkListeners() {}
+
+ private:
+  // Use vector so listeners are invoked in the order they are added.
+  std::vector<BenchmarkListener*> listeners_;
+};
+
+// Benchmark listener that just logs the results of benchmark run.
+class BenchmarkLoggingListener : public BenchmarkListener {
+  void OnBenchmarkEnd(const BenchmarkResults& results) override;
+};
+
+// Benchmarks a model.
+//
+// Subclasses need to implement initialization and running of the model.
+// The results can be collected by adding BenchmarkListener(s).
+class BenchmarkModel {
+ public:
+  virtual ~BenchmarkModel() {}
+  bool ParseFlags(int argc, char** argv);
+  virtual void Init() = 0;
+  void Run(int argc, char** argv);
+  void AddListener(BenchmarkListener* listener) {
+    listeners_.AddListener(listener);
+  }
+
+ protected:
+  virtual void LogFlags();
+  virtual bool ValidateFlags() { return true; }
+  virtual std::vector<Flag> GetFlags();
+  virtual uint64_t ComputeInputBytes() = 0;
+  virtual tensorflow::Stat<int64_t> Run(int num_times, RunType run_type);
+  virtual void RunImpl() = 0;
+  BenchmarkParams params_;
+  BenchmarkListeners listeners_;
+};
+
+}  // namespace benchmark
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_TOOLS_BENCHMARK_BENCHMARK_MODEL_H_
diff --git a/tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.cc b/tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.cc
new file mode 100644
index 0000000000000000000000000000000000000000..5f803cec197858953180d379c763ed7ebd34ee1d
--- /dev/null
+++ b/tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.cc
@@ -0,0 +1,304 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.h"
+
+#include <cstdarg>
+#include <cstdlib>
+#include <iostream>
+#include <memory>
+#include <string>
+#include <unordered_set>
+#include <vector>
+
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/model.h"
+#include "tensorflow/contrib/lite/op_resolver.h"
+#include "tensorflow/contrib/lite/string_util.h"
+#include "tensorflow/contrib/lite/tools/benchmark/logging.h"
+
+#ifdef TFLITE_CUSTOM_OPS_HEADER
+void RegisterSelectedOps(::tflite::MutableOpResolver* resolver);
+#endif
+
+namespace tflite {
+namespace benchmark {
+
+void ProfilingListener::SetInterpreter(tflite::Interpreter* interpreter) {
+  TFLITE_BENCHMARK_CHECK(interpreter);
+  interpreter_ = interpreter;
+  interpreter_->SetProfiler(&profiler_);
+}
+
+void ProfilingListener::OnSingleRunStart(RunType run_type) {
+  if (run_type == REGULAR) {
+    profiler_.Reset();
+    profiler_.StartProfiling();
+  }
+}
+
+void ProfilingListener::OnBenchmarkEnd(const BenchmarkResults& results) {
+  if (has_profiles_) {
+    TFLITE_LOG(INFO) << summarizer_.GetOutputString();
+  }
+}
+
+void ProfilingListener::OnSingleRunEnd() {
+  profiler_.StopProfiling();
+  auto profile_events = profiler_.GetProfileEvents();
+  has_profiles_ = !profile_events.empty();
+  summarizer_.ProcessProfiles(profile_events, *interpreter_);
+}
+
+namespace {
+
+std::vector<std::string> Split(const std::string& str, const char delim) {
+  std::istringstream input(str);
+  std::vector<std::string> results;
+  std::string item;
+  while (std::getline(input, item, delim)) {
+    results.push_back(item);
+  }
+  return results;
+}
+
+template <typename T>
+bool SplitAndParse(const std::string& str, char delim, std::vector<T>* values) {
+  std::istringstream input(str);
+  bool first = true;
+  while (!input.eof()) {
+    if (!first) {
+      char c;
+      input >> c;
+      if (c != delim) {
+        return false;
+      }
+    } else {
+      first = false;
+    }
+    T val;
+    input >> val;
+    if (!input.eof() && !input.good()) {
+      return false;
+    }
+    values->push_back(val);
+  }
+  return true;
+}
+
+template <typename T>
+void FillRandomValue(T* ptr, const std::vector<int>& sizes,
+                     const std::function<T()>& random_func) {
+  int num_elements = 1;
+  for (int dim : sizes) {
+    num_elements *= dim;
+  }
+  for (int i = 0; i < num_elements; ++i) {
+    *ptr++ = random_func();
+  }
+}
+
+void FillRandomString(tflite::DynamicBuffer* buffer,
+                      const std::vector<int>& sizes,
+                      const std::function<string()>& random_func) {
+  int num_elements = 1;
+  for (int dim : sizes) {
+    num_elements *= dim;
+  }
+  for (int i = 0; i < num_elements; ++i) {
+    auto str = random_func();
+    buffer->AddString(str.data(), str.length());
+  }
+}
+
+bool PopulateInputLayerInfo(
+    const string& names_string, const string& shapes_string,
+    std::vector<BenchmarkTfLiteModel::InputLayerInfo>* info) {
+  std::vector<std::string> names = Split(names_string, ',');
+  std::vector<std::string> shapes = Split(shapes_string, ':');
+
+  if (names.size() != shapes.size()) {
+    TFLITE_LOG(ERROR) << "The number of items in"
+                      << " --input_layer_shape (" << shapes_string << ", with "
+                      << shapes.size() << " items)"
+                      << " must match the number of items in"
+                      << " --input_layer (" << names_string << ", with "
+                      << names.size() << " items)."
+                      << " For example --input_layer=input1,input2"
+                      << " --input_layer_shape=1,224,224,4:1,20";
+    return false;
+  }
+
+  for (int i = 0; i < names.size(); ++i) {
+    info->push_back(BenchmarkTfLiteModel::InputLayerInfo());
+    BenchmarkTfLiteModel::InputLayerInfo& input = info->back();
+
+    input.name = names[i];
+
+    TFLITE_BENCHMARK_CHECK(SplitAndParse(shapes[i], ',', &input.shape))
+        << "Incorrect size string specified: " << shapes[i];
+    for (int dim : input.shape) {
+      if (dim == -1) {
+        TFLITE_LOG(ERROR)
+            << "Any unknown sizes in the shapes (-1's) must be replaced"
+            << " with the size you want to benchmark with.";
+        return false;
+      }
+    }
+  }
+
+  return true;
+}
+
+}  // namespace
+
+std::vector<Flag> BenchmarkTfLiteModel::GetFlags() {
+  std::vector<Flag> flags = BenchmarkTfLiteModel::BenchmarkModel::GetFlags();
+  std::vector<Flag> specific_flags = {
+      Flag("graph", &graph, "graph file name"),
+      Flag("input_layer", &input_layer_string, "input layer names"),
+      Flag("input_layer_shape", &input_layer_shape_string, "input layer shape"),
+      Flag("use_nnapi", &use_nnapi, "use nnapi api")};
+
+  flags.insert(flags.end(), specific_flags.begin(), specific_flags.end());
+  return flags;
+}
+
+void BenchmarkTfLiteModel::LogFlags() {
+  BenchmarkModel::LogFlags();
+  TFLITE_LOG(INFO) << "Graph: [" << graph << "]";
+  TFLITE_LOG(INFO) << "Input layers: [" << input_layer_string << "]";
+  TFLITE_LOG(INFO) << "Input shapes: [" << input_layer_shape_string << "]";
+  TFLITE_LOG(INFO) << "Use nnapi : [" << use_nnapi << "]";
+}
+
+bool BenchmarkTfLiteModel::ValidateFlags() {
+  if (graph.empty()) {
+    TFLITE_LOG(ERROR)
+        << "Please specify the name of your TF Lite input file with --graph";
+    return false;
+  }
+  return PopulateInputLayerInfo(input_layer_string, input_layer_shape_string,
+                                &inputs);
+}
+
+uint64_t BenchmarkTfLiteModel::ComputeInputBytes() {
+  TFLITE_BENCHMARK_CHECK(interpreter);
+  uint64_t total_input_bytes = 0;
+  for (int input : interpreter->inputs()) {
+    auto* t = interpreter->tensor(input);
+    total_input_bytes += t->bytes;
+  }
+  return total_input_bytes;
+}
+
+void BenchmarkTfLiteModel::Init() {
+  model = tflite::FlatBufferModel::BuildFromFile(graph.c_str());
+  if (!model) {
+    TFLITE_LOG(FATAL) << "Failed to mmap model " << graph;
+  }
+  TFLITE_LOG(INFO) << "Loaded model " << graph;
+  model->error_reporter();
+  TFLITE_LOG(INFO) << "resolved reporter";
+
+#ifdef TFLITE_CUSTOM_OPS_HEADER
+  tflite::MutableOpResolver resolver;
+  RegisterSelectedOps(&resolver);
+#else
+  tflite::ops::builtin::BuiltinOpResolver resolver;
+#endif
+
+  tflite::InterpreterBuilder(*model, resolver)(&interpreter);
+  if (!interpreter) {
+    TFLITE_LOG(FATAL) << "Failed to construct interpreter";
+  }
+  profiling_listener_.SetInterpreter(interpreter.get());
+
+  if (params_.num_threads != -1) {
+    interpreter->SetNumThreads(params_.num_threads);
+  }
+
+  interpreter->UseNNAPI(use_nnapi);
+  auto interpreter_inputs = interpreter->inputs();
+
+  if (!inputs.empty()) {
+    TFLITE_BENCHMARK_CHECK_EQ(inputs.size(), interpreter_inputs.size())
+        << "Inputs mismatch: Model inputs #:" << interpreter_inputs.size()
+        << " expected: " << inputs.size();
+  }
+
+  // TFLITE_BENCHMARK_CHECK that all names and types match
+  for (int j = 0; j < inputs.size(); ++j) {
+    const InputLayerInfo& input = inputs[j];
+    int i = interpreter_inputs[j];
+    TfLiteTensor* t = interpreter->tensor(i);
+    TFLITE_BENCHMARK_CHECK_EQ(t->name, input.name)
+        << "Tensor # " << i << " is named " << t->name << " but flags call it "
+        << input.name;
+  }
+
+  // Resize all non-string tensors.
+  for (int j = 0; j < inputs.size(); ++j) {
+    const InputLayerInfo& input = inputs[j];
+    int i = interpreter_inputs[j];
+    TfLiteTensor* t = interpreter->tensor(i);
+    if (t->type != kTfLiteString) {
+      interpreter->ResizeInputTensor(i, input.shape);
+    }
+  }
+
+  if (interpreter->AllocateTensors() != kTfLiteOk) {
+    TFLITE_LOG(FATAL) << "Failed to allocate tensors!";
+  }
+
+  // Set the values of the input tensors.
+  for (int j = 0; j < inputs.size(); ++j) {
+    const InputLayerInfo& input = inputs[j];
+    int i = interpreter_inputs[j];
+    TfLiteTensor* t = interpreter->tensor(i);
+    std::vector<int> sizes = input.shape;
+
+    // TODO(ahentz): below we ignore the O-th dimension (number of batches).
+    if (t->type == kTfLiteFloat32) {
+      FillRandomValue<float>(
+          interpreter->typed_tensor<float>(i),
+          std::vector<int>(sizes.begin() + 1, sizes.end()),
+          []() { return static_cast<float>(rand()) / RAND_MAX - 0.5f; });
+    } else if (t->type == kTfLiteUInt8) {
+      FillRandomValue<uint8_t>(
+          interpreter->typed_tensor<uint8_t>(i),
+          std::vector<int>(sizes.begin() + 1, sizes.end()),
+          []() { return static_cast<uint8_t>(rand()) % 255; });
+    } else if (t->type == kTfLiteString) {
+      tflite::DynamicBuffer buffer;
+      FillRandomString(&buffer, sizes, []() {
+        return "we're have some friends over saturday to hang out in the yard";
+      });
+      buffer.WriteToTensor(interpreter->tensor(i));
+    } else {
+      TFLITE_LOG(FATAL) << "Don't know how to populate tensor " << t->name
+                        << " of type " << t->type;
+    }
+  }
+}
+
+void BenchmarkTfLiteModel::RunImpl() {
+  if (interpreter->Invoke() != kTfLiteOk) {
+    TFLITE_LOG(FATAL) << "Failed to invoke!";
+  }
+}
+
+}  // namespace benchmark
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.h b/tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.h
new file mode 100644
index 0000000000000000000000000000000000000000..ffb93da964b2da0328616e749abd9c5a84189468
--- /dev/null
+++ b/tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.h
@@ -0,0 +1,86 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_LITE_TOOLS_BENCHMARK_TFLITE_MODEL_H_
+#define TENSORFLOW_CONTRIB_LITE_TOOLS_BENCHMARK_TFLITE_MODEL_H_
+
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "tensorflow/contrib/lite/model.h"
+#include "tensorflow/contrib/lite/profiling/profile_summarizer.h"
+#include "tensorflow/contrib/lite/tools/benchmark/benchmark_model.h"
+
+namespace tflite {
+namespace benchmark {
+
+// Dumps profiling events if profiling is enabled
+class ProfilingListener : public BenchmarkListener {
+ public:
+  explicit ProfilingListener() : interpreter_(nullptr), has_profiles_(false) {}
+
+  void SetInterpreter(Interpreter* interpreter);
+
+  void OnSingleRunStart(RunType run_type) override;
+
+  void OnSingleRunEnd() override;
+
+  void OnBenchmarkEnd(const BenchmarkResults& results) override;
+
+ private:
+  Interpreter* interpreter_;
+  profiling::Profiler profiler_;
+  profiling::ProfileSummarizer summarizer_;
+  bool has_profiles_;
+};
+
+// Benchmarks a TFLite model by running tflite interpreter.
+class BenchmarkTfLiteModel : public BenchmarkModel {
+ public:
+  BenchmarkTfLiteModel() : use_nnapi(false) {
+    AddListener(&profiling_listener_);
+  }
+
+  std::vector<Flag> GetFlags() override;
+  void LogFlags() override;
+  bool ValidateFlags() override;
+  uint64_t ComputeInputBytes() override;
+  void Init() override;
+  void RunImpl() override;
+  virtual ~BenchmarkTfLiteModel() {}
+
+  struct InputLayerInfo {
+    std::string name;
+    std::vector<int> shape;
+  };
+
+ private:
+  std::unique_ptr<tflite::FlatBufferModel> model;
+  std::unique_ptr<tflite::Interpreter> interpreter;
+  std::string graph;
+  std::string input_layer_string;
+  std::string input_layer_type_string;
+  std::string input_layer_shape_string;
+  std::string input_layer_values_string;
+  std::vector<InputLayerInfo> inputs;
+  bool use_nnapi;
+  ProfilingListener profiling_listener_;
+};
+
+}  // namespace benchmark
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_TOOLS_BENCHMARK_BENCHMARK_TFLITE_MODEL_H_
diff --git a/tensorflow/contrib/lite/tools/benchmark/command_line_flags.cc b/tensorflow/contrib/lite/tools/benchmark/command_line_flags.cc
new file mode 100644
index 0000000000000000000000000000000000000000..8195fc44beb288eec3c020791b47eefa01536fb7
--- /dev/null
+++ b/tensorflow/contrib/lite/tools/benchmark/command_line_flags.cc
@@ -0,0 +1,194 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/lite/tools/benchmark/command_line_flags.h"
+
+#include <cstring>
+#include <sstream>
+#include <string>
+#include <vector>
+
+namespace tflite {
+namespace {
+
+template <typename T>
+std::string ToString(T val) {
+  std::ostringstream stream;
+  stream << val;
+  return stream.str();
+}
+
+bool ParseFlag(const std::string& arg, const std::string& flag,
+               const std::function<bool(const std::string&)>& parse_func,
+               bool* value_parsing_ok) {
+  *value_parsing_ok = true;
+  std::string flag_prefix = "--" + flag + "=";
+  if (arg.find(flag_prefix) != 0) {
+    return false;
+  }
+  bool has_value = arg.size() >= flag_prefix.size();
+  *value_parsing_ok = has_value;
+  if (has_value) {
+    *value_parsing_ok = parse_func(arg.substr(flag_prefix.size()));
+  }
+  return true;
+}
+
+template <typename T>
+bool ParseFlag(const std::string& flag_value, T* value) {
+  std::istringstream stream(flag_value);
+  T read_value;
+  stream >> read_value;
+  if (!stream.eof() && !stream.good()) {
+    return false;
+  }
+  *value = read_value;
+  return true;
+}
+
+bool ParseBoolFlag(const std::string& flag_value, bool* value) {
+  if (flag_value != "true" && flag_value != "false") {
+    return false;
+  }
+
+  *value = (flag_value == "true");
+  return true;
+}
+
+bool ParseStringFlag(const std::string& flag_value, std::string* value) {
+  *value = flag_value;
+  return true;
+}
+
+}  // namespace
+
+Flag::Flag(const char* name, int32_t* dst, const std::string& usage_text)
+    : name_(name),
+      type_(TYPE_INT32),
+      value_hook_([dst](const std::string& flag_value) {
+        return ParseFlag<int32_t>(flag_value, dst);
+      }),
+      default_for_display_(ToString(*dst)),
+      usage_text_(usage_text) {}
+
+Flag::Flag(const char* name, int64_t* dst, const std::string& usage_text)
+    : name_(name),
+      type_(TYPE_INT64),
+      value_hook_([dst](const std::string& flag_value) {
+        return ParseFlag<int64_t>(flag_value, dst);
+      }),
+      default_for_display_(ToString(*dst)),
+      usage_text_(usage_text) {}
+
+Flag::Flag(const char* name, float* dst, const std::string& usage_text)
+    : name_(name),
+      type_(TYPE_FLOAT),
+      value_hook_([dst](const std::string& flag_value) {
+        return ParseFlag<float>(flag_value, dst);
+      }),
+      default_for_display_(ToString(*dst)),
+      usage_text_(usage_text) {}
+
+Flag::Flag(const char* name, bool* dst, const std::string& usage_text)
+    : name_(name),
+      type_(TYPE_BOOL),
+      value_hook_([dst](const std::string& flag_value) {
+        return ParseBoolFlag(flag_value, dst);
+      }),
+      default_for_display_((*dst) ? "true" : "false"),
+      usage_text_(usage_text) {}
+
+Flag::Flag(const char* name, std::string* dst, const std::string& usage_text)
+    : name_(name),
+      type_(TYPE_STRING),
+      value_hook_([dst](const std::string& flag_value) {
+        return ParseStringFlag(flag_value, dst);
+      }),
+      default_for_display_(*dst),
+      usage_text_(usage_text) {}
+
+bool Flag::Parse(const std::string& arg, bool* value_parsing_ok) const {
+  return ParseFlag(arg, name_, value_hook_, value_parsing_ok);
+}
+
+std::string Flag::GetTypeName() const {
+  switch (type_) {
+    case TYPE_INT32:
+      return "int32";
+    case TYPE_INT64:
+      return "int64";
+    case TYPE_FLOAT:
+      return "float";
+    case TYPE_BOOL:
+      return "bool";
+    case TYPE_STRING:
+      return "string";
+  }
+
+  return "unknown";
+}
+
+/*static*/ bool Flags::Parse(int* argc, const char** argv,
+                             const std::vector<Flag>& flag_list) {
+  bool result = true;
+  std::vector<const char*> unknown_flags;
+  for (int i = 1; i < *argc; ++i) {
+    if (std::string(argv[i]) == "--") {
+      while (i < *argc) {
+        unknown_flags.push_back(argv[i]);
+        ++i;
+      }
+      break;
+    }
+
+    bool was_found = false;
+    for (const Flag& flag : flag_list) {
+      bool value_parsing_ok;
+      was_found = flag.Parse(argv[i], &value_parsing_ok);
+      if (!value_parsing_ok) {
+        result = false;
+      }
+      if (was_found) {
+        break;
+      }
+    }
+    if (!was_found) {
+      unknown_flags.push_back(argv[i]);
+    }
+  }
+  int dst = 1;  // Skip argv[0]
+  for (auto f : unknown_flags) {
+    argv[dst++] = f;
+  }
+  argv[dst++] = nullptr;
+  *argc = unknown_flags.size() + 1;
+  return result && (*argc < 2 || std::strcmp(argv[1], "--help") != 0);
+}
+
+/*static*/ std::string Flags::Usage(const std::string& cmdline,
+                                    const std::vector<Flag>& flag_list) {
+  std::ostringstream usage_text;
+  usage_text << "usage: " << cmdline << "\n";
+  if (!flag_list.empty()) {
+    usage_text << "Flags:\n";
+  }
+
+  for (const Flag& flag : flag_list) {
+    auto type_name = flag.GetTypeName();
+    usage_text << "\t";
+    usage_text << "--" << flag.name_ << "=" << flag.default_for_display_;
+    usage_text << "\t" << type_name << "\t" << flag.usage_text_ << "\n";
+  }
+  return usage_text.str();
+}
+
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/tools/benchmark/command_line_flags.h b/tensorflow/contrib/lite/tools/benchmark/command_line_flags.h
new file mode 100644
index 0000000000000000000000000000000000000000..36f9e64767315a317338bc4d2db2ec2d43bee875
--- /dev/null
+++ b/tensorflow/contrib/lite/tools/benchmark/command_line_flags.h
@@ -0,0 +1,112 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_LITE_TOOLS_COMMAND_LINE_FLAGS_H_
+#define TENSORFLOW_CONTRIB_LITE_TOOLS_COMMAND_LINE_FLAGS_H_
+
+#include <functional>
+#include <string>
+#include <vector>
+
+namespace tflite {
+// A simple command-line argument parsing module.
+// Dependency free simplified port of core/util/command_line_flags.
+// This class is written for benchmarks and uses inefficient string
+// concatenation. This was written to avoid dependency on tensorflow/core/util
+// which transitively brings in a lot of other dependencies that are not
+// necessary for tflite benchmarking code.
+// The recommended way of using it is with local variables and an initializer
+// list of Flag objects, for example:
+//
+// int some_int = 10;
+// bool some_switch = false;
+// std::string some_name = "something";
+// std::vector<tensorFlow::Flag> flag_list = {
+//   Flag("some_int", &some_int, "an integer that affects X"),
+//   Flag("some_switch", &some_switch, "a bool that affects Y"),
+//   Flag("some_name", &some_name, "a std::string that affects Z")
+// };
+// // Get usage message before ParseFlags() to capture default values.
+// std::string usage = Flag::Usage(argv[0], flag_list);
+// bool parsed_values_ok = Flags::Parse(&argc, argv, flag_list);
+//
+// tensorflow::port::InitMain(usage.c_str(), &argc, &argv);
+// if (argc != 1 || !parsed_values_ok) {
+//    ...output usage and error message...
+// }
+//
+// The argc and argv values are adjusted by the Parse function so all that
+// remains is the program name (at argv[0]) and any unknown arguments fill the
+// rest of the array. This means you can check for flags that weren't understood
+// by seeing if argv is greater than 1.
+// The result indicates if there were any errors parsing the values that were
+// passed to the command-line switches. For example, --some_int=foo would return
+// false because the argument is expected to be an integer.
+//
+// NOTE: Unlike gflags-style libraries, this library is intended to be
+// used in the `main()` function of your binary. It does not handle
+// flag definitions that are scattered around the source code.
+
+// A description of a single command line flag, holding its name, type, usage
+// text, and a pointer to the corresponding variable.
+class Flag {
+ public:
+  Flag(const char* name, int32_t* dst, const std::string& usage_text);
+  Flag(const char* name, int64_t* dst, const std::string& usage_text);
+  Flag(const char* name, bool* dst, const std::string& usage_text);
+  Flag(const char* name, std::string* dst, const std::string& usage_text);
+  Flag(const char* name, float* dst, const std::string& usage_text);
+
+ private:
+  friend class Flags;
+
+  bool Parse(const std::string& arg, bool* value_parsing_ok) const;
+
+  std::string name_;
+  enum {
+    TYPE_INT32,
+    TYPE_INT64,
+    TYPE_BOOL,
+    TYPE_STRING,
+    TYPE_FLOAT,
+  } type_;
+
+  std::string GetTypeName() const;
+
+  std::function<bool(const std::string&)> value_hook_;
+  std::string default_for_display_;
+
+  std::string usage_text_;
+};
+
+class Flags {
+ public:
+  // Parse the command line represented by argv[0, ..., (*argc)-1] to find flag
+  // instances matching flags in flaglist[].  Update the variables associated
+  // with matching flags, and remove the matching arguments from (*argc, argv).
+  // Return true iff all recognized flag values were parsed correctly, and the
+  // first remaining argument is not "--help".
+  static bool Parse(int* argc, const char** argv,
+                    const std::vector<Flag>& flag_list);
+
+  // Return a usage message with command line cmdline, and the
+  // usage_text strings in flag_list[].
+  static std::string Usage(const std::string& cmdline,
+                           const std::vector<Flag>& flag_list);
+};
+
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_TOOLS_BENCHMARK_COMMAND_LINE_FLAGS_H_
diff --git a/tensorflow/contrib/lite/tools/benchmark/command_line_flags_test.cc b/tensorflow/contrib/lite/tools/benchmark/command_line_flags_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..620d61b027d30044ba9d449a8e308375f72ad76f
--- /dev/null
+++ b/tensorflow/contrib/lite/tools/benchmark/command_line_flags_test.cc
@@ -0,0 +1,166 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/lite/tools/benchmark/command_line_flags.h"
+#include <gmock/gmock.h>
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/testing/util.h"
+
+namespace tflite {
+namespace {
+
+TEST(CommandLineFlagsTest, BasicUsage) {
+  int some_int32 = 10;
+  int64_t some_int64 = 21474836470;  // max int32 is 2147483647
+  bool some_switch = false;
+  std::string some_name = "something_a";
+  float some_float = -23.23f;
+  const char* argv_strings[] = {"program_name",
+                                "--some_int32=20",
+                                "--some_int64=214748364700",
+                                "--some_switch=true",
+                                "--some_name=somethingelse",
+                                "--some_float=42.0"};
+  int argc = 6;
+  bool parsed_ok =
+      Flags::Parse(&argc, reinterpret_cast<const char**>(argv_strings),
+                   {
+                       Flag("some_int32", &some_int32, "some int32"),
+                       Flag("some_int64", &some_int64, "some int64"),
+                       Flag("some_switch", &some_switch, "some switch"),
+                       Flag("some_name", &some_name, "some name"),
+                       Flag("some_float", &some_float, "some float"),
+                   });
+
+  EXPECT_EQ(true, parsed_ok);
+  EXPECT_EQ(20, some_int32);
+  EXPECT_EQ(214748364700, some_int64);
+  EXPECT_EQ(true, some_switch);
+  EXPECT_EQ("somethingelse", some_name);
+  EXPECT_NEAR(42.0f, some_float, 1e-5f);
+  EXPECT_EQ(argc, 1);
+}
+
+TEST(CommandLineFlagsTest, EmptyStringFlag) {
+  int argc = 2;
+  std::string some_string = "invalid";
+  const char* argv_strings[] = {"program_name", "--some_string="};
+  bool parsed_ok =
+      Flags::Parse(&argc, reinterpret_cast<const char**>(argv_strings),
+                   {Flag("some_string", &some_string, "some string")});
+
+  EXPECT_EQ(true, parsed_ok);
+  EXPECT_EQ(some_string, "");
+  EXPECT_EQ(argc, 1);
+}
+
+TEST(CommandLineFlagsTest, BadIntValue) {
+  int some_int = 10;
+  int argc = 2;
+  const char* argv_strings[] = {"program_name", "--some_int=notanumber"};
+  bool parsed_ok =
+      Flags::Parse(&argc, reinterpret_cast<const char**>(argv_strings),
+                   {Flag("some_int", &some_int, "some int")});
+
+  EXPECT_EQ(false, parsed_ok);
+  EXPECT_EQ(10, some_int);
+  EXPECT_EQ(argc, 1);
+}
+
+TEST(CommandLineFlagsTest, BadBoolValue) {
+  bool some_switch = false;
+  int argc = 2;
+  const char* argv_strings[] = {"program_name", "--some_switch=notabool"};
+  bool parsed_ok =
+      Flags::Parse(&argc, reinterpret_cast<const char**>(argv_strings),
+                   {Flag("some_switch", &some_switch, "some switch")});
+
+  EXPECT_EQ(false, parsed_ok);
+  EXPECT_EQ(false, some_switch);
+  EXPECT_EQ(argc, 1);
+}
+
+TEST(CommandLineFlagsTest, BadFloatValue) {
+  float some_float = -23.23f;
+  int argc = 2;
+  const char* argv_strings[] = {"program_name", "--some_float=notanumber"};
+  bool parsed_ok =
+      Flags::Parse(&argc, reinterpret_cast<const char**>(argv_strings),
+                   {Flag("some_float", &some_float, "some float")});
+
+  EXPECT_EQ(false, parsed_ok);
+  EXPECT_NEAR(-23.23f, some_float, 1e-5f);
+  EXPECT_EQ(argc, 1);
+}
+
+// Return whether str==pat, but allowing any whitespace in pat
+// to match zero or more whitespace characters in str.
+static bool MatchWithAnyWhitespace(const std::string& str,
+                                   const std::string& pat) {
+  bool matching = true;
+  int pat_i = 0;
+  for (int str_i = 0; str_i != str.size() && matching; str_i++) {
+    if (isspace(str[str_i])) {
+      matching = (pat_i != pat.size() && isspace(pat[pat_i]));
+    } else {
+      while (pat_i != pat.size() && isspace(pat[pat_i])) {
+        pat_i++;
+      }
+      matching = (pat_i != pat.size() && str[str_i] == pat[pat_i++]);
+    }
+  }
+  while (pat_i != pat.size() && isspace(pat[pat_i])) {
+    pat_i++;
+  }
+  return (matching && pat_i == pat.size());
+}
+
+TEST(CommandLineFlagsTest, UsageString) {
+  int some_int = 10;
+  int64_t some_int64 = 21474836470;  // max int32 is 2147483647
+  bool some_switch = false;
+  std::string some_name = "something";
+  // Don't test float in this case, because precision is hard to predict and
+  // match against, and we don't want a flakey test.
+  const std::string tool_name = "some_tool_name";
+  std::string usage = Flags::Usage(
+      tool_name + " <flags>", {Flag("some_int", &some_int, "some int"),
+                               Flag("some_int64", &some_int64, "some int64"),
+                               Flag("some_switch", &some_switch, "some switch"),
+                               Flag("some_name", &some_name, "some name")});
+  // Match the usage message, being sloppy about whitespace.
+  const char* expected_usage =
+      " usage: some_tool_name <flags>\n"
+      "Flags:\n"
+      "--some_int=10\tint32\tsome int\n"
+      "--some_int64=21474836470\tint64\tsome int64\n"
+      "--some_switch=false\tbool\tsome switch\n"
+      "--some_name=something\tstring\tsome name\n";
+  ASSERT_EQ(MatchWithAnyWhitespace(usage, expected_usage), true) << usage;
+
+  // Again but with no flags.
+  usage = Flags::Usage(tool_name, {});
+  ASSERT_EQ(MatchWithAnyWhitespace(usage, " usage: some_tool_name\n"), true)
+      << usage;
+}
+
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/tools/benchmark/logging.h b/tensorflow/contrib/lite/tools/benchmark/logging.h
new file mode 100644
index 0000000000000000000000000000000000000000..9e9292e2feacf0eff0751534f02cdacd21c9b0dd
--- /dev/null
+++ b/tensorflow/contrib/lite/tools/benchmark/logging.h
@@ -0,0 +1,76 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_LITE_TOOLS_LOGGING_H_
+#define TENSORFLOW_CONTRIB_LITE_TOOLS_LOGGING_H_
+
+// LOG and CHECK macros for benchmarks.
+
+#include <cstdlib>
+#include <iostream>
+#include <sstream>
+
+namespace tflite {
+namespace logging {
+// A wrapper that logs to stderr.
+//
+// Used for TFLITE_LOG and TFLITE_BENCHMARK_CHECK macros.
+class LoggingWrapper {
+ public:
+  enum class LogSeverity : int {
+    INFO = 0,
+    WARN = 1,
+    ERROR = 2,
+    FATAL = 3,
+  };
+  LoggingWrapper(LogSeverity severity)
+      : severity_(severity), should_log_(true) {}
+  LoggingWrapper(LogSeverity severity, bool log)
+      : severity_(severity), should_log_(log) {}
+  std::stringstream& Stream() { return stream_; }
+  ~LoggingWrapper() {
+    if (should_log_) {
+      std::cerr << stream_.str() << std::endl;
+      if (severity_ == LogSeverity::FATAL) {
+        std::flush(std::cerr);
+        std::abort();
+      }
+    }
+  }
+
+ private:
+  std::stringstream stream_;
+  LogSeverity severity_;
+  bool should_log_;
+};
+
+}  // namespace logging
+
+}  // namespace tflite
+
+#define TFLITE_LOG(severity)                                  \
+  tflite::logging::LoggingWrapper(                            \
+      tflite::logging::LoggingWrapper::LogSeverity::severity) \
+      .Stream()
+
+#define TFLITE_BENCHMARK_CHECK(condition)                  \
+  tflite::logging::LoggingWrapper(                         \
+      tflite::logging::LoggingWrapper::LogSeverity::FATAL, \
+      (condition) ? false : true)                          \
+      .Stream()
+
+#define TFLITE_BENCHMARK_CHECK_EQ(a, b) TFLITE_BENCHMARK_CHECK(a == b)
+
+#endif  // TENSORFLOW_CONTRIB_LITE_TOOLS_BENCHMARK_LOGGING_H_
diff --git a/tensorflow/contrib/lite/tools/benchmark_model.cc b/tensorflow/contrib/lite/tools/benchmark_model.cc
deleted file mode 100644
index 869c531b3e3db37f634761e7b25d4ffa1e8304a7..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/lite/tools/benchmark_model.cc
+++ /dev/null
@@ -1,475 +0,0 @@
-/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include <cstdarg>
-#include <cstdlib>
-#include <iostream>
-#include <memory>
-#include <string>
-#include <unordered_set>
-#include <vector>
-
-#include "tensorflow/contrib/lite/kernels/register.h"
-#include "tensorflow/contrib/lite/model.h"
-#include "tensorflow/contrib/lite/op_resolver.h"
-#include "tensorflow/contrib/lite/string_util.h"
-#include "tensorflow/core/lib/strings/str_util.h"
-#include "tensorflow/core/platform/env.h"
-#include "tensorflow/core/platform/init_main.h"
-#include "tensorflow/core/platform/logging.h"
-#include "tensorflow/core/util/command_line_flags.h"
-
-#ifdef TFLITE_CUSTOM_OPS_HEADER
-void RegisterSelectedOps(::tflite::MutableOpResolver* resolver);
-#endif
-
-namespace tflite {
-
-using ::tensorflow::Env;
-using ::tensorflow::str_util::Split;
-using ::tensorflow::str_util::SplitAndParseAsFloats;
-using ::tensorflow::str_util::SplitAndParseAsInts;
-
-struct InputLayerInfo {
-  string name;
-  TfLiteType data_type;
-  std::vector<int> shape;
-  // Note that initialization_values is currently unused.
-  std::vector<float> initialization_values;
-};
-
-template <typename T>
-void FillRandomValue(T* ptr, const std::vector<int>& sizes,
-                     const std::function<T()>& random_func) {
-  int num_elements = 1;
-  for (int dim : sizes) {
-    num_elements *= dim;
-  }
-  for (int i = 0; i < num_elements; ++i) {
-    *ptr++ = random_func();
-  }
-}
-
-void FillRandomString(tflite::DynamicBuffer* buffer,
-                      const std::vector<int>& sizes,
-                      const std::function<string()>& random_func) {
-  int num_elements = 1;
-  for (int dim : sizes) {
-    num_elements *= dim;
-  }
-  for (int i = 0; i < num_elements; ++i) {
-    auto str = random_func();
-    buffer->AddString(str.data(), str.length());
-  }
-}
-
-TfLiteType TfLiteTypeFromString(const string& input_layer_type) {
-  if (input_layer_type == "string")
-    return kTfLiteString;
-  else if (input_layer_type == "float")
-    return kTfLiteFloat32;
-  else if (input_layer_type == "uint8")
-    return kTfLiteUInt8;
-  else if (input_layer_type == "int32")
-    return kTfLiteInt32;
-  else if (input_layer_type == "int64")
-    return kTfLiteInt64;
-  else
-    return kTfLiteNoType;
-}
-
-std::vector<int> ShapeFromTfLiteTensor(TfLiteTensor* t) {
-  std::vector<int> result;
-  result.reserve(t->dims->size);
-  for (int i = 0; i < t->dims->size; ++i) {
-    result.push_back(t->dims->data[i]);
-  }
-  CHECK(!result.empty()) << "Found no shapes in model";
-  return result;
-}
-
-bool CreateInterpreter(const string& graph,
-                       std::unique_ptr<FlatBufferModel>* model,
-                       std::unique_ptr<Interpreter>* interpreter) {
-  *model = tflite::FlatBufferModel::BuildFromFile(graph.c_str());
-  if (!model) {
-    std::cerr << "Failed to load model " << graph << std::endl;
-    return false;
-  }
-
-#ifdef TFLITE_CUSTOM_OPS_HEADER
-  tflite::MutableOpResolver resolver;
-  RegisterSelectedOps(&resolver);
-#else
-  tflite::ops::builtin::BuiltinOpResolver resolver;
-#endif
-
-  tflite::InterpreterBuilder(*(model->get()), resolver)(interpreter);
-  if (!(*interpreter)) {
-    std::cerr << "Failed to construct interpreter" << std::endl;
-    return false;
-  }
-
-  return true;
-}
-
-bool PrepareInterpreter(const std::vector<InputLayerInfo> inputs,
-                        int num_threads, bool use_nnapi,
-                        Interpreter* interpreter) {
-  if (num_threads != -1) {
-    interpreter->SetNumThreads(num_threads);
-  }
-
-  interpreter->UseNNAPI(use_nnapi);
-
-  // Check that all names and types match
-  for (const InputLayerInfo& input : inputs) {
-    for (int i : interpreter->inputs()) {
-      TfLiteTensor* t = interpreter->tensor(i);
-      CHECK_EQ(t->name, input.name)
-          << "Tensor # " << i << " is named " << t->name
-          << " but flags call it " << input.name;
-      CHECK_EQ(t->type, input.data_type)
-          << "Could not match the type of input tensor " << t->name;
-    }
-  }
-
-  // Resize all non-string tensors.
-  for (const InputLayerInfo& input : inputs) {
-    for (int i : interpreter->inputs()) {
-      TfLiteTensor* t = interpreter->tensor(i);
-      if (t->type != kTfLiteString) {
-        interpreter->ResizeInputTensor(i, input.shape);
-      }
-    }
-  }
-
-  if (interpreter->AllocateTensors() != kTfLiteOk) {
-    std::cerr << "Failed to allocate tensors!" << std::endl;
-    return false;
-  }
-
-  // Set the values of the input tensors.
-  for (int i : interpreter->inputs()) {
-    TfLiteTensor* t = interpreter->tensor(i);
-    std::vector<int> sizes = ShapeFromTfLiteTensor(t);
-
-    // TODO(ahentz): below we ignore the O-th dimension (number of batches).
-    if (t->type == kTfLiteFloat32) {
-      FillRandomValue<float>(
-          interpreter->typed_tensor<float>(i),
-          std::vector<int>(sizes.begin() + 1, sizes.end()),
-          []() { return static_cast<float>(rand()) / RAND_MAX - 0.5f; });
-    } else if (t->type == kTfLiteUInt8) {
-      FillRandomValue<uint8_t>(
-          interpreter->typed_tensor<uint8_t>(i),
-          std::vector<int>(sizes.begin() + 1, sizes.end()),
-          []() { return static_cast<uint8_t>(rand()) % 255; });
-    } else if (t->type == kTfLiteString) {
-      tflite::DynamicBuffer buffer;
-      FillRandomString(&buffer, sizes, []() {
-        return "we're have some friends over saturday to hang out in the yard";
-      });
-      buffer.WriteToTensor(interpreter->tensor(i));
-    } else {
-      std::cerr << "Don't know how to populate tensor " << t->name
-                << " of type " << t->type << std::endl;
-      return false;
-    }
-  }
-  return true;
-}
-
-bool PopulateInputLayerInfo(const string& names_string,
-                            const string& shapes_string,
-                            const string& types_string,
-                            const string& values_string,
-                            std::vector<InputLayerInfo>* info) {
-  std::vector<string> names = Split(names_string, ',');
-  std::vector<string> shapes = Split(shapes_string, ':');
-  std::vector<string> types = Split(types_string, ',');
-  std::vector<string> values = Split(values_string, ':');
-
-  if (names.size() != shapes.size()) {
-    LOG(ERROR) << "The number of items in"
-               << " --input_layer_shape (" << shapes_string << ", with "
-               << shapes.size() << " items)"
-               << " must match the number of items in"
-               << " --input_layer (" << names_string << ", with "
-               << names.size() << " items)."
-               << " For example --input_layer=input1,input2"
-               << " --input_layer_shape=1,224,224,4:1,20";
-    return false;
-  }
-  if (names.size() != types.size()) {
-    LOG(ERROR) << "The number of items in"
-               << " --input_layer_type (" << types_string << ", with "
-               << types.size() << " items)"
-               << " must match the number of items in"
-               << " --input_layer (" << names_string << ", with "
-               << names.size() << " items)."
-               << " For example --input_layer=input1,input2"
-               << " --input_layer_type=float,int";
-    return false;
-  }
-
-  for (int i = 0; i < names.size(); ++i) {
-    info->push_back(InputLayerInfo());
-    InputLayerInfo& input = info->back();
-
-    input.name = names[i];
-
-    input.data_type = TfLiteTypeFromString(types[i]);
-    CHECK(input.data_type != kTfLiteNoType)
-        << types[i] << " was an invalid type";
-
-    CHECK(SplitAndParseAsInts(shapes[i], ',', &input.shape))
-        << "Incorrect size string specified: " << shapes[i];
-    for (int dim : input.shape) {
-      if (dim == -1) {
-        LOG(ERROR) << "Any unknown sizes in the shapes (-1's) must be replaced"
-                   << " with the size you want to benchmark with.";
-        return false;
-      }
-    }
-
-    if (i < values.size()) {
-      CHECK(SplitAndParseAsFloats(values[i], ',', &input.initialization_values))
-          << "Incorrect initialization values string specified: " << values[i];
-    }
-  }
-
-  return true;
-}
-
-bool RunBenchmark(Interpreter* interpreter, int64_t* inference_time_us) {
-  const int64_t start_time = Env::Default()->NowMicros();
-
-  if (interpreter->Invoke() != kTfLiteOk) {
-    std::cerr << "Failed to invoke!";
-    return false;
-  }
-
-  const int64_t end_time = Env::Default()->NowMicros();
-  *inference_time_us = end_time - start_time;
-  return true;
-}
-
-class Latencies {
- public:
-  void AddMeasurement(int64_t time_us) {
-    max_ = std::max(time_us, max_);
-    min_ = std::min(time_us, min_);
-    ++count_;
-    sum_ += time_us;
-    squared_sum_ += static_cast<double>(time_us) * time_us;
-  }
-
-  double avg() const {
-    if (count_ == 0) return std::numeric_limits<int64_t>::quiet_NaN();
-    return static_cast<double>(sum_) / count_;
-  }
-
-  int64_t std_deviation() const {
-    if (count_ == 0 || min_ == max_) return 0;
-    return sqrt(squared_sum_ / count_ - avg() * avg());
-  }
-
-  void OutputToStream(std::ostream* stream) const {
-    *stream << "count=" << count_;
-    if (count_ == 0) return;
-    *stream << " min=" << min_ << " max=" << max_;
-    *stream << " avg=" << avg() << " std=" << std_deviation();
-  }
-
- private:
-  int64_t count_ = 0;
-  int64_t min_ = std::numeric_limits<int64_t>::max();
-  int64_t max_ = std::numeric_limits<int64_t>::min();
-  int64_t sum_ = 0;
-  double squared_sum_ = 0;
-};
-
-bool TimeMultipleRuns(Interpreter* interpreter, double sleep_seconds,
-                      int num_runs, int64* total_time_us) {
-  // Convert the run_delay string into a timespec.
-  timespec req;
-  req.tv_sec = static_cast<time_t>(sleep_seconds);
-  req.tv_nsec = (sleep_seconds - req.tv_sec) * 1000000000;
-
-  *total_time_us = 0;
-
-  std::cout << "Running benchmark for " << num_runs
-            << " iterations: " << std::endl;
-
-  Latencies latencies;
-  for (int i = 0; i < num_runs; ++i) {
-    int64_t time_us;
-    bool run_status = RunBenchmark(interpreter, &time_us);
-    latencies.AddMeasurement(time_us);
-    *total_time_us += time_us;
-    if (!run_status) {
-      std::cout << "Failed on run " << i << std::endl;
-      return false;
-    }
-
-    // If requested, sleep between runs for an arbitrary amount of time.
-    // This can be helpful to determine the effect of mobile processor
-    // scaling and thermal throttling.
-    if (sleep_seconds > 0.0) {
-#ifdef PLATFORM_WINDOWS
-      Sleep(sleep_seconds * 1000);
-#else
-      nanosleep(&req, nullptr);
-#endif
-    }
-  }
-  latencies.OutputToStream(&std::cout);
-  std::cout << std::endl;
-
-  return true;
-}
-
-int Main(int argc, char** argv) {
-  using tensorflow::Flag;
-  using tensorflow::Flags;
-
-  string graph;               // e.g.: /data/local/tmp/tfl_inception-v1_model.fb
-  string input_layer_string;  // e.g.: input
-  string input_layer_shape_string;  // e.g.: 1,224,224,3
-  string input_layer_type_string;   // e.g.: float
-  string input_layer_values_string;
-  string output_layer_string;  // e.g.: output
-  int num_runs = 50;
-  string run_delay = "-1.0";
-  int num_threads = 1;
-  string benchmark_name = "";
-  string output_prefix = "";
-  int warmup_runs = 1;
-  bool use_nnapi = false;
-
-  std::vector<Flag> flag_list = {
-      Flag("graph", &graph, "graph file name"),
-      // All the following flags are optional, but can be used in order
-      // to benchmark different input shapes.
-      Flag("input_layer", &input_layer_string, "input layer names"),
-      Flag("input_layer_shape", &input_layer_shape_string, "input layer shape"),
-      Flag("input_layer_type", &input_layer_type_string, "input layer type"),
-      Flag("input_layer_values", &input_layer_values_string,
-           "values to initialize the inputs with"),
-      Flag("output_layer", &output_layer_string, "output layer name"),
-      Flag("num_runs", &num_runs, "number of runs"),
-      Flag("run_delay", &run_delay, "delay between runs in seconds"),
-      Flag("num_threads", &num_threads, "number of threads"),
-      Flag("benchmark_name", &benchmark_name, "benchmark name"),
-      Flag("output_prefix", &output_prefix, "benchmark output prefix"),
-      Flag("warmup_runs", &warmup_runs, "how many runs to initialize model"),
-      Flag("use_nnapi", &use_nnapi, "use nnapi api"),
-  };
-  string usage = Flags::Usage(argv[0], flag_list);
-  const bool parse_result = Flags::Parse(&argc, argv, flag_list);
-  tensorflow::port::InitMain(argv[0], &argc, &argv);
-
-  if (!parse_result) {
-    std::cerr << usage << std::endl;
-    return -1;
-  }
-
-  std::cout << "Graph: [" << graph << "]" << std::endl;
-  if (!input_layer_string.empty()) {
-    std::cout << "Input layers: [" << input_layer_string << "]" << std::endl;
-    std::cout << "Input shapes: [" << input_layer_shape_string << "]"
-              << std::endl;
-    std::cout << "Input types: [" << input_layer_type_string << "]"
-              << std::endl;
-  }
-  if (!output_layer_string.empty()) {
-    std::cout << "Output layers: [" << output_layer_string << "]" << std::endl;
-  }
-  std::cout << "Num runs: [" << num_runs << "]" << std::endl;
-  std::cout << "Inter-run delay (seconds): [" << run_delay << "]" << std::endl;
-  std::cout << "Num threads: [" << num_threads << "]" << std::endl;
-  if (!benchmark_name.empty()) {
-    std::cout << "Benchmark name: [" << benchmark_name << "]" << std::endl;
-    std::cout << "Output prefix: [" << output_prefix << "]" << std::endl;
-  }
-  std::cout << "Warmup runs: [" << warmup_runs << "]" << std::endl;
-  std::cout << "Use nnapi : [" << use_nnapi << "]" << std::endl;
-
-  if (graph.empty()) {
-    std::cout
-        << "Please specify the name of your TF Lite input file with --graph"
-        << std::endl;
-    return -1;
-  }
-
-  std::vector<InputLayerInfo> inputs;
-  if (!PopulateInputLayerInfo(input_layer_string, input_layer_shape_string,
-                              input_layer_type_string,
-                              input_layer_values_string, &inputs)) {
-    return -1;
-  }
-
-  int64 initialization_start_us = Env::Default()->NowMicros();
-
-  std::unique_ptr<tflite::FlatBufferModel> model;
-  std::unique_ptr<tflite::Interpreter> interpreter;
-  if (!CreateInterpreter(graph, &model, &interpreter)) {
-    return -1;
-  }
-  if (!PrepareInterpreter(inputs, num_threads, use_nnapi, interpreter.get())) {
-    return -1;
-  }
-
-  int64 initialization_end_us = Env::Default()->NowMicros();
-
-  const double initialization_time_s =
-      (initialization_end_us - initialization_start_us) / 1000000.0f;
-  std::cout << "Initialized session in " << initialization_time_s << "s"
-            << std::endl;
-
-  const double sleep_seconds = std::strtod(run_delay.c_str(), nullptr);
-
-  // If requested, run through the graph first to preinitialize everything
-  // before the benchmarking runs.
-  int64 warmup_time_us = 0;
-  if (warmup_runs > 0) {
-    if (!TimeMultipleRuns(interpreter.get(), sleep_seconds, warmup_runs,
-                          &warmup_time_us)) {
-      std::cerr << "Warmup failed" << std::endl;
-      return -1;
-    }
-  }
-
-  // Capture overall inference time without stat logging overhead. This is the
-  // timing data that can be compared to other libaries.
-  int64 no_stat_time_us = 0;
-  if (!TimeMultipleRuns(interpreter.get(), sleep_seconds, num_runs,
-                        &no_stat_time_us)) {
-    std::cerr << "Timing failed." << std::endl;
-    return -1;
-  }
-
-  std::cout << "Average inference timings in us: " << no_stat_time_us / num_runs
-            << " , Warmup: "
-            << (warmup_runs > 0 ? warmup_time_us / warmup_runs : 0) << ", "
-            << std::endl;
-
-  return 0;
-}
-
-}  // namespace tflite
-
-int main(int argc, char** argv) { return ::tflite::Main(argc, argv); }
diff --git a/tensorflow/contrib/lite/tools/verifier_test.cc b/tensorflow/contrib/lite/tools/verifier_test.cc
index ce8a7857d2dd66b12e9ea970911ef1dd01e4550e..ad7d59ecb41a0c81a6a4d8edae5fa6b4b5a7bede 100644
--- a/tensorflow/contrib/lite/tools/verifier_test.cc
+++ b/tensorflow/contrib/lite/tools/verifier_test.cc
@@ -41,7 +41,7 @@ class TfLiteFlatbufferModelBuilder {
   }
 
   TfLiteFlatbufferModelBuilder(const std::vector<BuiltinOperator>& builtin_ops,
-                               const std::vector<string>& custom_ops) {
+                               const std::vector<std::string>& custom_ops) {
     buffers_.push_back(
         CreateBuffer(builder_, builder_.CreateVector(std::vector<uint8_t>{})));
 
@@ -194,8 +194,8 @@ TEST(VerifyModel, TensorBufferIsNotValid) {
                       /*operators=*/0, builder.CreateString("Main"))});
 
   auto buffers = builder.CreateVector(std::vector<Offset<Buffer>>{
-      CreateBuffer(builder,
-                   builder.CreateVector(std::vector<uint8>{1, 2, 3, 4, 5, 6})),
+      CreateBuffer(builder, builder.CreateVector(
+                                std::vector<uint8_t>{1, 2, 3, 4, 5, 6})),
   });
 
   auto model = CreateModel(builder, TFLITE_SCHEMA_VERSION, /*operator_codes=*/0,
diff --git a/tensorflow/contrib/lite/util.cc b/tensorflow/contrib/lite/util.cc
index fb4af07d060cac3a6a4e01c7d625b6db5241f10d..8ccb65c24fd64f05d7e2c888f7932e586c1e11ec 100644
--- a/tensorflow/contrib/lite/util.cc
+++ b/tensorflow/contrib/lite/util.cc
@@ -38,4 +38,14 @@ bool EqualArrayAndTfLiteIntArray(const TfLiteIntArray* a, const int b_size,
   return true;
 }
 
+size_t CombineHashes(std::initializer_list<size_t> hashes) {
+  size_t result = 0;
+  // Hash combiner used by TensorFlow core.
+  for (size_t hash : hashes) {
+    result = result ^
+             (hash + 0x9e3779b97f4a7800ULL + (result << 10) + (result >> 4));
+  }
+  return result;
+}
+
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/util.h b/tensorflow/contrib/lite/util.h
index a34db35823104414cce028b9119397da085d05b1..89d9b4f5cffa99e708f391fd8fe19208009b5e79 100644
--- a/tensorflow/contrib/lite/util.h
+++ b/tensorflow/contrib/lite/util.h
@@ -35,6 +35,8 @@ TfLiteIntArray* ConvertArrayToTfLiteIntArray(const int rank, const int* dims);
 bool EqualArrayAndTfLiteIntArray(const TfLiteIntArray* a, const int b_size,
                                  const int* b);
 
+size_t CombineHashes(std::initializer_list<size_t> hashes);
+
 }  // namespace tflite
 
 #endif  // TENSORFLOW_CONTRIB_LITE_UTIL_H_
diff --git a/tensorflow/contrib/lookup/lookup_ops_test.py b/tensorflow/contrib/lookup/lookup_ops_test.py
index 5d4682ec9f4b8c5864383bd1d2f4c0b41a11baad..5a080cceabb55c307dcd1a457a9e30d24e0bd172 100644
--- a/tensorflow/contrib/lookup/lookup_ops_test.py
+++ b/tensorflow/contrib/lookup/lookup_ops_test.py
@@ -24,6 +24,7 @@ import six
 
 from tensorflow.contrib import lookup
 from tensorflow.python.client import session
+from tensorflow.python.eager import context
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors_impl
@@ -1396,15 +1397,22 @@ class KeyValueTensorInitializerTest(test.TestCase):
 
 class IndexTableFromTensor(test.TestCase):
 
+  @test_util.run_in_graph_and_eager_modes()
   def test_index_table_from_tensor_with_tensor_init(self):
-    with self.test_session():
+    table = lookup.index_table_from_tensor(
+        mapping=("brain", "salad", "surgery"), num_oov_buckets=1)
+
+    if not context.executing_eagerly():
+      with self.assertRaises(errors_impl.OpError):
+        self.evaluate(table.lookup(
+            constant_op.constant(("salad", "surgery", "tarkus"))))
+    else:
+      # Reinitializing a table in eager should work.
       table = lookup.index_table_from_tensor(
           mapping=("brain", "salad", "surgery"), num_oov_buckets=1)
-      ids = table.lookup(constant_op.constant(("salad", "surgery", "tarkus")))
-
-      self.assertRaises(errors_impl.OpError, ids.eval)
-      lookup_ops.tables_initializer().run()
-      self.assertAllEqual((1, 2, 3), ids.eval())
+    self.evaluate(lookup_ops.tables_initializer())
+    ids = table.lookup(constant_op.constant(("salad", "surgery", "tarkus")))
+    self.assertAllEqual((1, 2, 3), self.evaluate(ids))
 
   def test_int32_index_table_from_tensor_with_tensor_init(self):
     with self.test_session():
diff --git a/tensorflow/contrib/makefile/download_dependencies.sh b/tensorflow/contrib/makefile/download_dependencies.sh
index eff9081e35c285027c764c5bdbaf14f78bc5f512..48953e2e3843ff92744514d28bd725cc0d72f3a8 100755
--- a/tensorflow/contrib/makefile/download_dependencies.sh
+++ b/tensorflow/contrib/makefile/download_dependencies.sh
@@ -27,9 +27,7 @@ if [ ! -f $BZL_FILE_PATH ]; then
 fi
 
 EIGEN_URL="$(grep -o 'http.*bitbucket.org/eigen/eigen/get/.*tar\.gz' "${BZL_FILE_PATH}" | grep -v mirror.bazel | head -n1)"
-# TODO (yongtang): Replace the following with 'https://mirror.bazel.build/github.com/google/gemmlowp/.*zip' once
-# the archive has been propagated in mirror.bazel.build.
-GEMMLOWP_URL="$(grep -o 'https://github.com/google/gemmlowp/.*zip' "${BZL_FILE_PATH}" | head -n1)"
+GEMMLOWP_URL="$(grep -o 'https://mirror.bazel.build/github.com/google/gemmlowp/.*zip' "${BZL_FILE_PATH}" | head -n1)"
 GOOGLETEST_URL="https://github.com/google/googletest/archive/release-1.8.0.tar.gz"
 NSYNC_URL="$(grep -o 'https://mirror.bazel.build/github.com/google/nsync/.*tar\.gz' "${BZL_FILE_PATH}" | head -n1)"
 PROTOBUF_URL="$(grep -o 'https://mirror.bazel.build/github.com/google/protobuf/.*tar\.gz' "${BZL_FILE_PATH}" | head -n1)"
diff --git a/tensorflow/contrib/metrics/python/ops/metric_ops.py b/tensorflow/contrib/metrics/python/ops/metric_ops.py
index 00a933e5e0c537033573b225d43581f74557b240..a6be2084aae6bb05f958929b45977ed21b570603 100644
--- a/tensorflow/contrib/metrics/python/ops/metric_ops.py
+++ b/tensorflow/contrib/metrics/python/ops/metric_ops.py
@@ -1544,7 +1544,7 @@ def precision_recall_at_equal_thresholds(labels,
     result: A named tuple (See PrecisionRecallData within the implementation of
       this function) with properties that are variables of shape
       `[num_thresholds]`. The names of the properties are tp, fp, tn, fn,
-      precision, recall, thresholds.
+      precision, recall, thresholds. Types are same as that of predictions.
     update_op: An op that accumulates values.
 
   Raises:
@@ -1570,7 +1570,6 @@ def precision_recall_at_equal_thresholds(labels,
 
   check_ops.assert_type(labels, dtypes.bool)
 
-  dtype = predictions.dtype
   with variable_scope.variable_scope(name,
                                      'precision_recall_at_equal_thresholds',
                                      (labels, predictions, weights)):
@@ -1592,11 +1591,16 @@ def precision_recall_at_equal_thresholds(labels,
 
     predictions.get_shape().assert_is_compatible_with(labels.get_shape())
 
-    # We cast to float to ensure we have 0.0 or 1.0.
-    f_labels = math_ops.cast(labels, dtype)
+    # It's important we aggregate using float64 since we're accumulating a lot
+    # of 1.0's for the true/false labels, and accumulating to float32 will
+    # be quite inaccurate even with just a modest amount of values (~20M).
+    # We use float64 instead of integer primarily since GPU scatter kernel
+    # only support floats.
+    agg_dtype = dtypes.float64
 
-    # Get weighted true/false labels.
-    true_labels = f_labels * weights
+    f_labels = math_ops.cast(labels, agg_dtype)
+    weights = math_ops.cast(weights, agg_dtype)
+    true_labels = f_labels  * weights
     false_labels = (1.0 - f_labels) * weights
 
     # Flatten predictions and labels.
@@ -1638,9 +1642,9 @@ def precision_recall_at_equal_thresholds(labels,
 
     with ops.name_scope('variables'):
       tp_buckets_v = metrics_impl.metric_variable(
-          [num_thresholds], dtype, name='tp_buckets')
+          [num_thresholds], agg_dtype, name='tp_buckets')
       fp_buckets_v = metrics_impl.metric_variable(
-          [num_thresholds], dtype, name='fp_buckets')
+          [num_thresholds], agg_dtype, name='fp_buckets')
 
     with ops.name_scope('update_op'):
       update_tp = state_ops.scatter_add(
@@ -1660,18 +1664,21 @@ def precision_recall_at_equal_thresholds(labels,
     fn = tp[0] - tp
 
     # We use a minimum to prevent division by 0.
-    epsilon = 1e-7
+    epsilon = ops.convert_to_tensor(1e-7, dtype=agg_dtype)
     precision = tp / math_ops.maximum(epsilon, tp + fp)
     recall = tp / math_ops.maximum(epsilon, tp + fn)
 
+    # Convert all tensors back to predictions' dtype (as per function contract).
+    out_dtype = predictions.dtype
+    _convert = lambda tensor: math_ops.cast(tensor, out_dtype)
     result = PrecisionRecallData(
-        tp=tp,
-        fp=fp,
-        tn=tn,
-        fn=fn,
-        precision=precision,
-        recall=recall,
-        thresholds=math_ops.lin_space(0.0, 1.0, num_thresholds))
+        tp=_convert(tp),
+        fp=_convert(fp),
+        tn=_convert(tn),
+        fn=_convert(fn),
+        precision=_convert(precision),
+        recall=_convert(recall),
+        thresholds=_convert(math_ops.lin_space(0.0, 1.0, num_thresholds)))
     update_op = control_flow_ops.group(update_tp, update_fp)
     return result, update_op
 
@@ -2496,7 +2503,7 @@ def _compute_recall_at_precision(tp, fp, fn, precision, name):
     name: An optional variable_scope name.
 
   Returns:
-    The recall at a the given `precision`.
+    The recall at a given `precision`.
   """
   precisions = math_ops.div(tp, tp + fp + _EPSILON)
   tf_index = math_ops.argmin(
diff --git a/tensorflow/contrib/metrics/python/ops/metric_ops_test.py b/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
index 76420db8bda39435bcc2be2fd3d8c3467d6753e2..b13f08a37d9e856d56903324fc6e7cf1457bb191 100644
--- a/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
+++ b/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
@@ -2333,47 +2333,24 @@ class StreamingPrecisionRecallAtEqualThresholdsTest(test.TestCase):
     np.random.seed(1)
     ops.reset_default_graph()
 
-  def _testResultsEqual(self, expected_dict, gotten_result):
+  def _testResultsEqual(self, expected_dict, gotten_result, eps=None):
     """Tests that 2 results (dicts) represent the same data.
 
     Args:
       expected_dict: A dictionary with keys that are the names of properties
         of PrecisionRecallData and whose values are lists of floats.
       gotten_result: A PrecisionRecallData object.
+      eps: Epsilon value to use for testing output values. If unspecified, use
+        default from assertAllClose.
     """
     gotten_dict = {k: t.eval() for k, t in gotten_result._asdict().items()}
     self.assertItemsEqual(list(expected_dict.keys()), list(gotten_dict.keys()))
 
     for key, expected_values in expected_dict.items():
-      self.assertAllClose(expected_values, gotten_dict[key])
-
-  def _testCase(self, predictions, labels, expected_result, weights=None):
-    """Performs a test given a certain scenario of labels, predictions, weights.
-
-    Args:
-      predictions: The predictions tensor. Of type float32.
-      labels: The labels tensor. Of type bool.
-      expected_result: The expected result (dict) that maps to tensors.
-      weights: Optional weights tensor.
-    """
-    with self.test_session() as sess:
-      predictions_tensor = constant_op.constant(
-          predictions, dtype=dtypes_lib.float32)
-      labels_tensor = constant_op.constant(labels, dtype=dtypes_lib.bool)
-      weights_tensor = None
-      if weights:
-        weights_tensor = constant_op.constant(weights, dtype=dtypes_lib.float32)
-      gotten_result, update_op = (
-          metric_ops.precision_recall_at_equal_thresholds(
-              labels=labels_tensor,
-              predictions=predictions_tensor,
-              weights=weights_tensor,
-              num_thresholds=3))
-
-      sess.run(variables.local_variables_initializer())
-      sess.run(update_op)
-
-      self._testResultsEqual(expected_result, gotten_result)
+      if eps is not None:
+        self.assertAllClose(expected_values, gotten_dict[key], atol=eps)
+      else:
+        self.assertAllClose(expected_values, gotten_dict[key])
 
   def testVars(self):
     metric_ops.precision_recall_at_equal_thresholds(
@@ -2414,6 +2391,78 @@ class StreamingPrecisionRecallAtEqualThresholdsTest(test.TestCase):
       for _ in range(3):
         self._testResultsEqual(initial_result, result)
 
+  def testLargeCase(self):
+    self.skipTest("Test consistently timing out")
+    shape = [32, 512, 256, 1]
+    predictions = random_ops.random_uniform(
+        shape, 0.0, 1.0, dtype=dtypes_lib.float32)
+    labels = math_ops.greater(random_ops.random_uniform(shape, 0.0, 1.0), 0.5)
+
+    result, update_op = metric_ops.precision_recall_at_equal_thresholds(
+        labels=labels, predictions=predictions, num_thresholds=201)
+    # Run many updates, enough to cause highly inaccurate values if the
+    # code used float32 for accumulation.
+    num_updates = 71
+
+    with self.test_session() as sess:
+      sess.run(variables.local_variables_initializer())
+      for _ in xrange(num_updates):
+        sess.run(update_op)
+
+      prdata = sess.run(result)
+
+      # Since we use random values, we won't know the tp/fp/tn/fn values, but
+      # tp and fp at threshold 0 should be the total number of positive and
+      # negative labels, hence their sum should be total number of pixels.
+      expected_value = 1.0 * np.product(shape) * num_updates
+      got_value = prdata.tp[0] + prdata.fp[0]
+      # They should be at least within 1.
+      self.assertNear(got_value, expected_value, 1.0)
+
+  def _testCase(self,
+                predictions,
+                labels,
+                expected_result,
+                dtype=dtypes_lib.float32,
+                eps=None,
+                weights=None):
+    """Performs a test given a certain scenario of labels, predictions, weights.
+
+    Args:
+      predictions: The predictions tensor. Of type dtype.
+      labels: The labels tensor. Of type bool.
+      expected_result: The expected result (dict) that maps to tensors.
+      dtype: Data type to use for predictions and weights tensor. Default
+        is float32.
+      eps: Epsilon value to use for testing output values. If unspecified, use
+        default from assertAllClose.
+      weights: Optional weights tensor.
+    """
+    with self.test_session() as sess:
+      predictions_tensor = constant_op.constant(predictions, dtype=dtype)
+      labels_tensor = constant_op.constant(labels, dtype=dtypes_lib.bool)
+      weights_tensor = None
+      if weights:
+        weights_tensor = constant_op.constant(weights, dtype=dtype)
+      gotten_result, update_op = (
+          metric_ops.precision_recall_at_equal_thresholds(
+              labels=labels_tensor,
+              predictions=predictions_tensor,
+              weights=weights_tensor,
+              num_thresholds=3))
+      self.assertEqual(gotten_result.tp.dtype, dtype)
+      self.assertEqual(gotten_result.fp.dtype, dtype)
+      self.assertEqual(gotten_result.tn.dtype, dtype)
+      self.assertEqual(gotten_result.fn.dtype, dtype)
+      self.assertEqual(gotten_result.precision.dtype, dtype)
+      self.assertEqual(gotten_result.recall.dtype, dtype)
+      self.assertEqual(gotten_result.thresholds.dtype, dtype)
+
+      sess.run(variables.local_variables_initializer())
+      sess.run(update_op)
+
+      self._testResultsEqual(expected_result, gotten_result, eps=eps)
+
   def testAllTruePositives(self):
     self._testCase(
         [[1]], [[True]], {
@@ -2489,6 +2538,35 @@ class StreamingPrecisionRecallAtEqualThresholdsTest(test.TestCase):
         },
         weights=[[0.0, 0.5, 2.0, 0.0, 0.5, 1.0]])
 
+  def testFloat64(self):
+    self._testCase(
+        [[0.2, 0.3, 0.4, 0.6, 0.7, 0.8]],
+        [[True, False, False, True, True, True]], {
+            'tp': [4, 3, 0],
+            'fp': [2, 0, 0],
+            'tn': [0, 2, 2],
+            'fn': [0, 1, 4],
+            'precision': [2.0 / 3.0, 1.0, 0.0],
+            'recall': [1.0, 0.75, 0.0],
+            'thresholds': [0.0, 0.5, 1.0],
+        },
+        dtype=dtypes_lib.float64)
+
+  def testFloat16(self):
+    self._testCase(
+        [[0.2, 0.3, 0.4, 0.6, 0.7, 0.8]],
+        [[True, False, False, True, True, True]], {
+            'tp': [4, 3, 0],
+            'fp': [2, 0, 0],
+            'tn': [0, 2, 2],
+            'fn': [0, 1, 4],
+            'precision': [2.0 / 3.0, 1.0, 0.0],
+            'recall': [1.0, 0.75, 0.0],
+            'thresholds': [0.0, 0.5, 1.0],
+        },
+        dtype=dtypes_lib.float16,
+        eps=1e-3)
+
 
 class StreamingSpecificityAtSensitivityTest(test.TestCase):
 
@@ -7101,6 +7179,14 @@ class CohenKappaTest(test.TestCase):
     with self.assertRaises(ValueError):
       metrics.cohen_kappa(labels, invalid_predictions, 3)
 
+  def testConditionalPackingOptimization(self):
+    placeholder = array_ops.placeholder(dtypes_lib.float32, [None])
+    values, update_op = metric_ops.streaming_concat(placeholder)
+    with self.test_session() as sess:
+      sess.run(variables.local_variables_initializer())
+      for feed in range(10):
+        sess.run(update_op, feed_dict={placeholder: [feed]})
+        print(sess.run(values))
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py b/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py
index e4e5ccc33472ad5a12bd8111fb1ff6ebbd6f45f9..ef34f7bf7bf3eba047b50ce8abf883b0ed741a63 100644
--- a/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py
+++ b/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py
@@ -26,26 +26,32 @@ from tensorflow.python.training import optimizer
 
 
 class LossScaleOptimizer(optimizer.Optimizer):
+  # TODO(jamesqin): move mixed precision training explanation to __init__
+  # docstring.
   """An optimizer that applies loss scaling in backprop.
 
-  This class is useful for mixed precision training on GPUs (or other potential
-  accelerators), which is an approach to improve compute throughput without loss
-  of model quality.
-
-  The commmon configuration of mixed precision models is the following:
-  * variables are kept in high precision (e.g. float32).
-  * computations are done in lower precision (e.g. float16). variables are
-    casted to lower precision before they're used.
-  * (in training), final gradients are casted back to variable precision and get
-    applied.
-
-  Because computations happen in lower precision, gradients in the backprop pass
-  might underflow in the smaller dynamic range, causing a model to converge at a
-  suboptimal level. This optimizer multiplies the loss by a factor before
-  backprop starts to prevent underflow. Before gradients are applied, they are
-  casted to higher precision and down-scaled by the same factor, so
-  mathematically the variable updates are no different from regular
-  same-precision training.
+  This class is useful for "mixed precision training" on GPUs (or other
+  potential accelerators), an approach to improve compute throughput without
+  compromising model quality.
+
+  The canonical way to perform mixed precision training is the following:
+  * Model variables are kept in high precision (e.g. float32).
+  * Computations are done in lower precision (e.g. float16), which enjoys
+    performance speedup by virtue of hardware support. Variables are casted to
+    lower precision before they're used.
+  * Final gradients are casted back to high precision dtype, then used to update
+    variables.
+
+  The side-effect of performing computation in lower precision, is that it comes
+  with smaller numerical range. During backproping, small gradients might
+  underflow in the reduced numerical range, causing a model to converge at
+  suboptimal level.
+
+  To prevent underflow, this optimizer multiplies the loss by a factor before
+  backprop starts. Consequently, the gradients are linearly scaled up by the
+  same factor, thus not falling into the underflow zone. After that, to perserve
+  the correctness of backprop, the gradients are down-scaled by the same factor,
+  casted to the (higher) variable precision, then applied on the variables.
 
   See [Nvidia's manual on mixed precision training](
   https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html)
diff --git a/tensorflow/contrib/nccl/kernels/nccl_manager_test.cc b/tensorflow/contrib/nccl/kernels/nccl_manager_test.cc
index 4d8d922cb42d2974dab32cf4562bee3993bef098..5144f7c38c8650ebfced1dfcc9378263ebaad8c0 100644
--- a/tensorflow/contrib/nccl/kernels/nccl_manager_test.cc
+++ b/tensorflow/contrib/nccl/kernels/nccl_manager_test.cc
@@ -171,8 +171,7 @@ class NcclManagerTest : public ::testing::Test {
 
  private:
   static Allocator* GpuAllocator(BaseGPUDevice* device) {
-    return device->GetStepAllocator(AllocatorAttributes(),
-                                    nullptr /* step_resource_manager */);
+    return device->GetAllocator(AllocatorAttributes());
   }
 
   static se::DeviceMemory<Scalar> AsDeviceMemory(const Scalar* cuda_memory) {
diff --git a/tensorflow/contrib/opt/BUILD b/tensorflow/contrib/opt/BUILD
index 13aa1d7e7a11877373a848c1ba865aa418790cd0..114b344d38413208755a47f36f45badc1a5ecaa9 100644
--- a/tensorflow/contrib/opt/BUILD
+++ b/tensorflow/contrib/opt/BUILD
@@ -28,6 +28,7 @@ py_library(
         "python/training/reg_adagrad_optimizer.py",
         "python/training/sign_decay.py",
         "python/training/variable_clipping_optimizer.py",
+        "python/training/weight_decay_optimizers.py",
     ],
     srcs_version = "PY2AND3",
     deps = [
@@ -194,6 +195,25 @@ py_test(
     ],
 )
 
+py_test(
+    name = "weight_decay_optimizers_test",
+    srcs = ["python/training/weight_decay_optimizers_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":opt_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:constant_op",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:resource_variable_ops",
+        "//tensorflow/python:session",
+        "//tensorflow/python:variables",
+        "//third_party/py/numpy",
+    ],
+)
+
 tf_py_test(
     name = "drop_stale_gradient_optimizer_test",
     srcs = ["python/training/drop_stale_gradient_optimizer_test.py"],
diff --git a/tensorflow/contrib/opt/__init__.py b/tensorflow/contrib/opt/__init__.py
index 4c13c8e247185213b798eb733ddcf65a07a8f64d..5df5d35f8e4f8fcc2c5aa09bd8f3254e16e3a74f 100644
--- a/tensorflow/contrib/opt/__init__.py
+++ b/tensorflow/contrib/opt/__init__.py
@@ -27,6 +27,7 @@ from tensorflow.contrib.opt.python.training.lazy_adam_optimizer import *
 from tensorflow.contrib.opt.python.training.moving_average_optimizer import *
 from tensorflow.contrib.opt.python.training.multitask_optimizer_wrapper import *
 from tensorflow.contrib.opt.python.training.nadam_optimizer import *
+from tensorflow.contrib.opt.python.training.weight_decay_optimizers import *
 from tensorflow.contrib.opt.python.training.powersign import *
 from tensorflow.contrib.opt.python.training.variable_clipping_optimizer import *
 from tensorflow.contrib.opt.python.training.elastic_average_optimizer import *
@@ -46,6 +47,10 @@ _allowed_symbols = [
     'LazyAdamOptimizer',
     'NadamOptimizer',
     'MovingAverageOptimizer',
+    'MomentumWOptimizer',
+    'AdamWOptimizer',
+    'DecoupledWeightDecayExtension',
+    'extend_with_decoupled_weight_decay',
     'ScipyOptimizerInterface',
     'VariableClippingOptimizer',
     'MultitaskOptimizerWrapper',
diff --git a/tensorflow/contrib/opt/python/training/adamax_test.py b/tensorflow/contrib/opt/python/training/adamax_test.py
index bc92a7006f1a0a56adafc486a75afa94e965cb2c..915e6504e1e59ff247a2715820bc31a4d4cc1944 100644
--- a/tensorflow/contrib/opt/python/training/adamax_test.py
+++ b/tensorflow/contrib/opt/python/training/adamax_test.py
@@ -198,11 +198,11 @@ class AdaMaxOptimizerTest(test.TestCase):
         self.assertTrue(beta1_power is not None)
         self.assertIn(beta1_power, opt_variables)
 
-        with ops.Graph().as_default():
-          # Shouldn't return non-slot variables from other graphs.
-          self.assertEqual(0, len(opt.variables()))
-
         if not context.executing_eagerly():
+          with ops.Graph().as_default():
+            # Shouldn't return non-slot variables from other graphs.
+            self.assertEqual(0, len(opt.variables()))
+
           self.evaluate(variables.global_variables_initializer())
           # Fetch params to validate initial values
           self.assertAllClose([1.0, 2.0], self.evaluate(var0))
@@ -224,8 +224,10 @@ class AdaMaxOptimizerTest(test.TestCase):
           var1_np, m1, v1 = adamax_update_numpy(var1_np, grads1_np, t, m1, v1)
 
           # Validate updated params
-          self.assertAllCloseAccordingToType(var0_np, self.evaluate(var0))
-          self.assertAllCloseAccordingToType(var1_np, self.evaluate(var1))
+          self.assertAllCloseAccordingToType(var0_np, self.evaluate(var0),
+                                             rtol=1e-2)
+          self.assertAllCloseAccordingToType(var1_np, self.evaluate(var1),
+                                             rtol=1e-2)
           if use_resource:
             self.assertEqual("var0_%d/AdaMax:0" % (i,),
                              opt.get_slot(var=var0, name="m").name)
diff --git a/tensorflow/contrib/opt/python/training/weight_decay_optimizers.py b/tensorflow/contrib/opt/python/training/weight_decay_optimizers.py
new file mode 100644
index 0000000000000000000000000000000000000000..8aa40aeb45d4ec15140bdfc5ebd824e8aa08d8d9
--- /dev/null
+++ b/tensorflow/contrib/opt/python/training/weight_decay_optimizers.py
@@ -0,0 +1,326 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Base class to make optimizers weight decay ready."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import ops
+from tensorflow.python.training import optimizer
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.training import adam
+from tensorflow.python.training import momentum as momentum_opt
+from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.ops import state_ops
+from tensorflow.python.ops import resource_variable_ops
+
+
+class DecoupledWeightDecayExtension(object):
+  """This class allows to extend optimizers with decoupled weight decay.
+
+  It implements the decoupled weight decay described by Loshchilov & Hutter
+  (https://arxiv.org/pdf/1711.05101.pdf), in which the weight decay is
+  decoupled from the optimization steps w.r.t. to the loss function.
+  For SGD variants, this simplifies hyperparameter search since it decouples
+  the settings of weight decay and learning rate.
+  For adaptive gradient algorithms, it regularizes variables with large
+  gradients more than L2 regularization would, which was shown to yield better
+  training loss and generalization error in the paper above.
+
+  This class alone is not an optimizer but rather extends existing
+  optimizers with decoupled weight decay. We explicitly define the two examples
+  used in the above paper (SGDW and AdamW), but in general this can extend
+  any OptimizerX by using
+  `extend_with_weight_decay(OptimizerX, weight_decay=weight_decay)`.
+  In order for it to work, it must be the first class the Optimizer with
+  weight decay inherits from, e.g.
+
+  ```python
+  class AdamWOptimizer(DecoupledWeightDecayExtension, adam.AdamOptimizer):
+    def __init__(self, weight_decay, *args, **kwargs):
+      super(AdamWOptimizer, self).__init__(weight_decay, *args, **kwargs).
+  ```
+
+  Note that this extension decays weights BEFORE applying the update based
+  on the gradient, i.e. this extension only has the desired behaviour for
+  optimizers which do not depend on the value of'var' in the update step!
+  """
+
+  def __init__(self, weight_decay, **kwargs):
+    """Construct the extension class that adds weight decay to an optimizer.
+
+    Args:
+      weight_decay: A `Tensor` or a floating point value, the factor by which
+        a variable is decayed in the update step.
+      decay_var_list: Optional list or tuple or set of `Variable` objects to
+        decay.
+    """
+    self._decay_var_list = None  # is set in minimize or apply_gradients
+    self._weight_decay = weight_decay
+    # The tensors are initialized in call to _prepare
+    self._weight_decay_tensor = None
+    super(DecoupledWeightDecayExtension, self).__init__(**kwargs)
+
+  def minimize(self, loss, global_step=None, var_list=None,
+               gate_gradients=optimizer.Optimizer.GATE_OP,
+               aggregation_method=None, colocate_gradients_with_ops=False,
+               name=None, grad_loss=None, decay_var_list=None):
+    """Add operations to minimize `loss` by updating `var_list` with decay.
+
+    This function is the same as Optimizer.minimize except that it allows to
+    specify the variables that should be decayed using decay_var_list.
+    If decay_var_list is None, all variables in var_list are decayed.
+
+    For more information see the documentation of Optimizer.minimize.
+    """
+    self._decay_var_list = set(decay_var_list) if decay_var_list else False
+    return super(DecoupledWeightDecayExtension, self).minimize(
+        loss, global_step=global_step, var_list=var_list,
+        gate_gradients=gate_gradients, aggregation_method=aggregation_method,
+        colocate_gradients_with_ops=colocate_gradients_with_ops, name=name,
+        grad_loss=grad_loss)
+
+  def apply_gradients(self, grads_and_vars, global_step=None, name=None,
+                      decay_var_list=None):
+    """Apply gradients to variables and decay the variables.
+
+    This function is the same as Optimizer.apply_gradients except that it
+    allows to specify the variables that should be decayed using
+    decay_var_list. If decay_var_list is None, all variables in var_list
+    are decayed.
+
+    For more information see the documentation of Optimizer.apply_gradients.
+    """
+    self._decay_var_list = set(decay_var_list) if decay_var_list else False
+    return super(DecoupledWeightDecayExtension, self).apply_gradients(
+        grads_and_vars, global_step=global_step, name=name)
+
+  def _prepare(self):
+    weight_decay = self._weight_decay
+    if callable(weight_decay):
+      weight_decay = weight_decay()
+    self._weight_decay_tensor = ops.convert_to_tensor(
+        weight_decay, name="weight_decay")
+    # Call the optimizers _prepare function.
+    super(DecoupledWeightDecayExtension, self)._prepare()
+
+  def _decay_weights_op(self, var):
+    if not self._decay_var_list or var in self._decay_var_list:
+      return var.assign_sub(self._weight_decay * var, self._use_locking)
+    return control_flow_ops.no_op()
+
+  def _decay_weights_sparse_op(self, var, indices, scatter_add):
+    if not self._decay_var_list or var in self._decay_var_list:
+      return scatter_add(var, indices, -self._weight_decay * var,
+                         self._use_locking)
+    return control_flow_ops.no_op()
+
+  # Here, we overwrite the apply functions that the base optimizer calls.
+  # super().apply_x resolves to the apply_x function of the BaseOptimizer.
+  def _apply_dense(self, grad, var):
+    with ops.control_dependencies([self._decay_weights_op(var)]):
+      return super(DecoupledWeightDecayExtension, self)._apply_dense(grad, var)
+
+  def _resource_apply_dense(self, grad, var):
+    with ops.control_dependencies([self._decay_weights_op(var)]):
+      return super(DecoupledWeightDecayExtension, self)._resource_apply_dense(
+          grad, var)
+
+  def _apply_sparse(self, grad, var):
+    scatter_add = state_ops.scatter_add
+    decay_op = self._decay_weights_sparse_op(var, grad.indices, scatter_add)
+    with ops.control_dependencies([decay_op]):
+      return super(DecoupledWeightDecayExtension, self)._apply_sparse(
+          grad, var)
+
+  def _resource_scatter_add(self, x, i, v, _=None):
+    # last argument allows for one overflow argument, to have the same function
+    # signature as state_ops.scatter_add
+    with ops.control_dependencies(
+        [resource_variable_ops.resource_scatter_add(x.handle, i, v)]):
+      return x.value()
+
+  def _resource_apply_sparse(self, grad, var, indices):
+    scatter_add = self._resource_scatter_add
+    decay_op = self._decay_weights_sparse_op(var, indices, scatter_add)
+    with ops.control_dependencies([decay_op]):
+      return super(DecoupledWeightDecayExtension, self)._resource_apply_sparse(
+          grad, var, indices)
+
+
+def extend_with_decoupled_weight_decay(base_optimizer):
+  """Factory function returning an optimizer class with decoupled weight decay.
+
+  Returns an optimizer class. An instance of the returned class computes the
+  update step of `base_optimizer` and additionally decays the weights.
+  E.g., the class returned by
+  `extend_with_decoupled_weight_decay(tf.train.AdamOptimizer)` is equivalent to
+  `tf.contrib.opt.AdamWOptimizer`.
+
+  The API of the new optimizer class slightly differs from the API of the
+  base optimizer:
+  - The first argument to the constructor is the weight decay rate.
+  - `minimize` and `apply_gradients` accept the optional keyword argument
+    `decay_var_list`, which specifies the variables that should be decayed.
+    If `None`, all variables that are optimized are decayed.
+
+  Usage example:
+  ```python
+  # MyAdamW is a new class
+  MyAdamW = extend_with_decoupled_weight_decay(tf.train.AdamOptimizer)
+  # Create a MyAdamW object
+  optimizer = MyAdamW(weight_decay=0.001, learning_rate=0.001)
+  sess.run(optimizer.minimize(loss, decay_variables=[var1, var2]))
+
+  Note that this extension decays weights BEFORE applying the update based
+  on the gradient, i.e. this extension only has the desired behaviour for
+  optimizers which do not depend on the value of'var' in the update step!
+  ```
+
+  Args:
+    base_optimizer: An optimizer class that inherits from tf.train.Optimizer.
+
+  Returns:
+    A new optimizer class that inherits from DecoupledWeightDecayExtension
+    and base_optimizer.
+  """
+  class OptimizerWithDecoupledWeightDecay(DecoupledWeightDecayExtension,
+                                          base_optimizer):
+    """Base_optimizer with decoupled weight decay.
+
+    This class computes the update step of `base_optimizer` and
+    additionally decays the variable with the weight decay being decoupled from
+    the optimization steps w.r.t. to the loss function, as described by
+    Loshchilov & Hutter (https://arxiv.org/pdf/1711.05101.pdf).
+    For SGD variants, this simplifies hyperparameter search since
+    it decouples the settings of weight decay and learning rate.
+    For adaptive gradient algorithms, it regularizes variables with large
+    gradients more than L2 regularization would, which was shown to yield
+    better training loss and generalization error in the paper above.
+    """
+
+    def __init__(self, weight_decay, *args, **kwargs):
+      # super delegation is necessary here
+      # pylint: disable=useless-super-delegation
+      super(OptimizerWithDecoupledWeightDecay, self).__init__(
+          weight_decay, *args, **kwargs)
+      # pylint: enable=useless-super-delegation
+
+  return OptimizerWithDecoupledWeightDecay
+
+
+@tf_export("contrib.opt.MomentumWOptimizer")
+class MomentumWOptimizer(DecoupledWeightDecayExtension,
+                         momentum_opt.MomentumOptimizer):
+  """Optimizer that implements the Momentum algorithm with weight_decay.
+
+  This is an implementation of the SGDW optimizer described in "Fixing
+  Weight Decay Regularization in Adam" by Loshchilov & Hutter
+  (https://arxiv.org/abs/1711.05101)
+  ([pdf])(https://arxiv.org/pdf/1711.05101.pdf).
+  It computes the update step of `train.MomentumOptimizer` and additionally
+  decays the variable. Note that this is different from adding
+  L2 regularization on the variables to the loss. Decoupling the weight decay
+  from other hyperparameters (in particular the learning rate) simplifies
+  hyperparameter search.
+
+  For further information see the documentation of the Momentum Optimizer.
+
+  Note that this optimizer can also be instantiated as
+  ```python
+  extend_with_weight_decay(tf.train.MomentumOptimizer,
+                           weight_decay=weight_decay)
+  ```
+  """
+
+  def __init__(self, weight_decay, learning_rate, momentum,
+               use_locking=False, name="MomentumW", use_nesterov=False):
+    """Construct a new MomentumW optimizer.
+
+    For further information see the documentation of the Momentum Optimizer.
+
+    Args:
+      weight_decay:  A `Tensor` or a floating point value.  The weight decay.
+      learning_rate: A `Tensor` or a floating point value.  The learning rate.
+      momentum: A `Tensor` or a floating point value.  The momentum.
+      use_locking: If `True` use locks for update operations.
+      name: Optional name prefix for the operations created when applying
+        gradients.  Defaults to "Momentum".
+      use_nesterov: If `True` use Nesterov Momentum.
+        See [Sutskever et al., 2013](
+        http://jmlr.org/proceedings/papers/v28/sutskever13.pdf).
+        This implementation always computes gradients at the value of the
+        variable(s) passed to the optimizer. Using Nesterov Momentum makes the
+        variable(s) track the values called `theta_t + mu*v_t` in the paper.
+
+    @compatibility(eager)
+    When eager execution is enabled, learning_rate, weight_decay and momentum
+    can each be a callable that takes no arguments and returns the actual value
+    to use. This can be useful for changing these values across different
+    invocations of optimizer functions.
+    @end_compatibility
+    """
+    super(MomentumWOptimizer, self).__init__(
+        weight_decay, learning_rate=learning_rate, momentum=momentum,
+        use_locking=use_locking, name=name, use_nesterov=use_nesterov)
+
+
+@tf_export("contrib.opt.AdamWOptimizer")
+class AdamWOptimizer(DecoupledWeightDecayExtension, adam.AdamOptimizer):
+  """Optimizer that implements the Adam algorithm with weight decay.
+
+  This is an implementation of the AdamW optimizer described in "Fixing
+  Weight Decay Regularization in Adam" by Loshchilov & Hutter
+  (https://arxiv.org/abs/1711.05101)
+  ([pdf])(https://arxiv.org/pdf/1711.05101.pdf).
+
+  It computes the update step of `train.AdamOptimizer` and additionally decays
+  the variable. Note that this is different from adding L2 regularization on
+  the variables to the loss: it regularizes variables with large
+  gradients more than L2 regularization would, which was shown to yield better
+  training loss and generalization error in the paper above.
+
+  For further information see the documentation of the Adam Optimizer.
+
+  Note that this optimizer can also be instantiated as
+  ```python
+  extend_with_weight_decay(tf.train.AdamOptimizer, weight_decay=weight_decay)
+  ```
+  """
+
+  def __init__(self, weight_decay, learning_rate=0.001, beta1=0.9, beta2=0.999,
+               epsilon=1e-8, use_locking=False, name="AdamW"):
+    """Construct a new AdamW optimizer.
+
+    For further information see the documentation of the Adam Optimizer.
+
+    Args:
+      weight_decay:  A `Tensor` or a floating point value.  The weight decay.
+      learning_rate: A Tensor or a floating point value.  The learning rate.
+      beta1: A float value or a constant float tensor.
+        The exponential decay rate for the 1st moment estimates.
+      beta2: A float value or a constant float tensor.
+        The exponential decay rate for the 2nd moment estimates.
+      epsilon: A small constant for numerical stability. This epsilon is
+        "epsilon hat" in the Kingma and Ba paper (in the formula just before
+        Section 2.1), not the epsilon in Algorithm 1 of the paper.
+      use_locking: If True use locks for update operations.
+      name: Optional name for the operations created when applying gradients.
+        Defaults to "Adam".
+    """
+    super(AdamWOptimizer, self).__init__(
+        weight_decay, learning_rate=learning_rate, beta1=beta1, beta2=beta2,
+        epsilon=epsilon, use_locking=use_locking, name=name)
diff --git a/tensorflow/contrib/opt/python/training/weight_decay_optimizers_test.py b/tensorflow/contrib/opt/python/training/weight_decay_optimizers_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..74d1cdbbdac8724518937d141a976abf9fec6ce3
--- /dev/null
+++ b/tensorflow/contrib/opt/python/training/weight_decay_optimizers_test.py
@@ -0,0 +1,190 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for optimizers with weight decay."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.eager import context
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import test
+from tensorflow.python.training import adam
+from tensorflow.contrib.opt.python.training import weight_decay_optimizers
+
+WEIGHT_DECAY = 0.01
+
+
+def adamw_update_numpy(param, g_t, t, m, v, lr=0.001, beta1=0.9,
+                       beta2=0.999, epsilon=1e-8):
+  lr_t = lr * np.sqrt(1 - beta2**t) / (1 - beta1**t)
+
+  m_t = beta1 * m + (1 - beta1) * g_t
+  v_t = beta2 * v + (1 - beta2) * g_t * g_t
+
+  param_t = (param - lr_t * m_t / (np.sqrt(v_t) + epsilon) -
+             (param * WEIGHT_DECAY))
+  return param_t, m_t, v_t
+
+
+def momentumw_update_numpy(param, g_t, m, lr=0.001, momentum=0.9, **_):
+  # v, t are not needed for momentum optimizer
+  m = momentum * m + g_t
+  param_t = param - lr * m - param * WEIGHT_DECAY
+  return param_t, m, None
+
+
+class WeightDecayOptimizerTest(test.TestCase):
+
+  def doTest(self, optimizer, update_fn, optimizer_name, slot_name,
+             use_resource=False, do_sparse=False):
+    for i, dtype in enumerate([dtypes.half, dtypes.float32, dtypes.float64]):
+      with self.test_session(graph=ops.Graph()):
+        # Initialize variables for numpy implementation.
+        m0, v0, m1, v1 = 0.0, 0.0, 0.0, 0.0
+        var0_np = np.array([1.0, 2.0], dtype=dtype.as_numpy_dtype)
+        grads0_np = np.array([0.1, 0.1], dtype=dtype.as_numpy_dtype)
+        var1_np = np.array([3.0, 4.0], dtype=dtype.as_numpy_dtype)
+        grads1_np = np.array([0.01, 0.01], dtype=dtype.as_numpy_dtype)
+
+        if use_resource:
+          var0 = resource_variable_ops.ResourceVariable(
+              var0_np, name="var0_%d" % i)
+          var1 = resource_variable_ops.ResourceVariable(
+              var1_np, name="var1_%d" % i)
+        else:
+          var0 = variables.Variable(var0_np)
+          var1 = variables.Variable(var1_np)
+
+        if do_sparse:
+          grads0_np_indices = np.array([0, 1], dtype=np.int32)
+          grads0 = ops.IndexedSlices(constant_op.constant(grads0_np),
+                                     constant_op.constant(grads0_np_indices),
+                                     constant_op.constant([2]))
+          grads1_np_indices = np.array([0, 1], dtype=np.int32)
+          grads1 = ops.IndexedSlices(constant_op.constant(grads1_np),
+                                     constant_op.constant(grads1_np_indices),
+                                     constant_op.constant([2]))
+        else:
+          grads0 = constant_op.constant(grads0_np)
+          grads1 = constant_op.constant(grads1_np)
+
+        opt = optimizer()
+        update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+
+
+        if not context.executing_eagerly():
+          with ops.Graph().as_default():
+            # Shouldn't return non-slot variables from other graphs.
+            self.assertEqual(0, len(opt.variables()))
+          self.evaluate(variables.global_variables_initializer())
+          # Fetch params to validate initial values
+          self.assertAllClose([1.0, 2.0], self.evaluate(var0))
+          self.assertAllClose([3.0, 4.0], self.evaluate(var1))
+
+        # Run 3 steps of the optimizer
+        for t in range(1, 4):
+          if not context.executing_eagerly():
+            self.evaluate(update)
+          elif t > 1:
+            opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+
+          var0_np, m0, v0 = update_fn(var0_np, grads0_np, t=t, m=m0, v=v0)
+          var1_np, m1, v1 = update_fn(var1_np, grads1_np, t=t, m=m1, v=v1)
+
+          # Validate updated params
+          self.assertAllCloseAccordingToType(var0_np, self.evaluate(var0))
+          self.assertAllCloseAccordingToType(var1_np, self.evaluate(var1))
+          if use_resource:
+            self.assertEqual("var0_%d/%s:0" % (i, optimizer_name),
+                             opt.get_slot(var=var0, name=slot_name).name)
+
+
+class AdamWOptimizerTest(WeightDecayOptimizerTest):
+
+  @staticmethod
+  def get_optimizer():
+    return weight_decay_optimizers.AdamWOptimizer(WEIGHT_DECAY)
+
+  def testSparse(self):
+    self.doTest(self.get_optimizer, adamw_update_numpy, "AdamW", "m",
+                use_resource=False, do_sparse=True)
+
+  def testResourceSparse(self):
+    self.doTest(self.get_optimizer, adamw_update_numpy, "AdamW", "m",
+                use_resource=True, do_sparse=True)
+
+  def testBasic(self):
+    self.doTest(self.get_optimizer, adamw_update_numpy, "AdamW", "m",
+                use_resource=False)
+
+  @test_util.run_in_graph_and_eager_modes(reset_test=True)
+  def testResourceBasic(self):
+    self.doTest(self.get_optimizer, adamw_update_numpy, "AdamW", "m",
+                use_resource=True)
+
+
+class MomentumWOptimizerTest(WeightDecayOptimizerTest):
+
+  @staticmethod
+  def get_optimizer():
+    return weight_decay_optimizers.MomentumWOptimizer(WEIGHT_DECAY, 0.001, 0.9)
+
+  def testSparse(self):
+    self.doTest(self.get_optimizer, momentumw_update_numpy, "MomentumW",
+                "momentum", use_resource=False, do_sparse=True)
+
+  def testResourceSparse(self):
+    self.doTest(self.get_optimizer, momentumw_update_numpy, "MomentumW",
+                "momentum", use_resource=True, do_sparse=True)
+
+  def testBasic(self):
+    self.doTest(self.get_optimizer, momentumw_update_numpy, "MomentumW",
+                "momentum", use_resource=False)
+
+  @test_util.run_in_graph_and_eager_modes(reset_test=True)
+  def testResourceBasic(self):
+    self.doTest(self.get_optimizer, momentumw_update_numpy, "MomentumW",
+                "momentum", use_resource=True)
+
+
+class ExtendWithWeightDecayTest(WeightDecayOptimizerTest):
+
+  @staticmethod
+  def get_optimizer():
+    AdamW = weight_decay_optimizers.extend_with_decoupled_weight_decay(
+        adam.AdamOptimizer)
+    return AdamW(WEIGHT_DECAY)
+
+  def testBasic(self):
+    self.doTest(self.get_optimizer, adamw_update_numpy, "Adam", "m",
+                use_resource=False)
+
+  @test_util.run_in_graph_and_eager_modes(reset_test=True)
+  def testResourceBasic(self):
+    self.doTest(self.get_optimizer, adamw_update_numpy, "Adam", "m",
+                use_resource=True)
+
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/optimizer_v2/momentum_test.py b/tensorflow/contrib/optimizer_v2/momentum_test.py
index 26724f66c2a1db1d01577b31b739af18f51d3976..24cdab462665adc6297b0e0821455a545c3880af 100644
--- a/tensorflow/contrib/optimizer_v2/momentum_test.py
+++ b/tensorflow/contrib/optimizer_v2/momentum_test.py
@@ -134,7 +134,6 @@ class MomentumOptimizerTest(test.TestCase):
     with context.eager_mode():
       self.doTestBasic(use_resource=True, use_callable_params=True)
 
-  @test_util.run_in_graph_and_eager_modes(reset_test=True)
   def testVariablesAcrossGraphs(self):
     optimizer = momentum_lib.MomentumOptimizer(0.01, 0.5)
     with ops.Graph().as_default():
@@ -142,10 +141,7 @@ class MomentumOptimizerTest(test.TestCase):
           [1.0, 2.0], dtype=dtypes.float32, name="var0")
       var1 = resource_variable_ops.ResourceVariable(
           [3.0, 4.0], dtype=dtypes.float32, name="var1")
-      if context.executing_eagerly():
-        loss = lambda: math_ops.reduce_sum(var0 + var1)
-      else:
-        loss = math_ops.reduce_sum(var0 + var1)
+      loss = math_ops.reduce_sum(var0 + var1)
       optimizer.minimize(loss)
       optimizer_variables = optimizer.variables()
       self.assertStartsWith(optimizer_variables[0].name, "var0")
@@ -157,10 +153,7 @@ class MomentumOptimizerTest(test.TestCase):
           [1.0, 2.0], dtype=dtypes.float32, name="var2")
       var3 = resource_variable_ops.ResourceVariable(
           [3.0, 4.0], dtype=dtypes.float32, name="var3")
-      if context.executing_eagerly():
-        loss = lambda: math_ops.reduce_sum(var2 + var3)
-      else:
-        loss = math_ops.reduce_sum(var2 + var3)
+      loss = math_ops.reduce_sum(var2 + var3)
       optimizer.minimize(loss)
       optimizer_variables = optimizer.variables()
       self.assertStartsWith(optimizer_variables[0].name, "var2")
diff --git a/tensorflow/contrib/periodic_resample/BUILD b/tensorflow/contrib/periodic_resample/BUILD
index 6ca7fe8b6e59b0dc24be76262d4f54f387e53e48..976b312e8345a801ad07f622b6117b88af2cf603 100644
--- a/tensorflow/contrib/periodic_resample/BUILD
+++ b/tensorflow/contrib/periodic_resample/BUILD
@@ -6,12 +6,13 @@ exports_files(["LICENSE"])
 
 load(
     "//tensorflow:tensorflow.bzl",
-    "py_test",
+    "tf_cc_test",
     "tf_gen_op_libs",
     "tf_custom_op_library",
     "tf_custom_op_py_library",
     "tf_gen_op_wrapper_py",
 )
+load("//tensorflow:tensorflow.bzl", "py_test")
 
 cc_library(
     name = "all_ops",
@@ -84,6 +85,20 @@ py_test(
         ":init_py",
         "//tensorflow/contrib/util:util_py",
         "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:gradient_checker",
+    ],
+)
+
+tf_cc_test(
+    name = "periodic_resample_op_cc_test",
+    size = "small",
+    srcs = [
+        "ops/array_ops_test.cc",
+    ],
+    deps = [
+        ":all_ops",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
     ],
 )
 
diff --git a/tensorflow/contrib/periodic_resample/kernels/periodic_resample_op.cc b/tensorflow/contrib/periodic_resample/kernels/periodic_resample_op.cc
index e18923c8aae74c66ce78f98eb5e615e99463af74..514689cf4543cd08632bd0321a78fa933c456467 100644
--- a/tensorflow/contrib/periodic_resample/kernels/periodic_resample_op.cc
+++ b/tensorflow/contrib/periodic_resample/kernels/periodic_resample_op.cc
@@ -22,4 +22,9 @@ namespace tensorflow {
 REGISTER_KERNEL_BUILDER(Name("PeriodicResample").Device(DEVICE_CPU),
                         PeriodicResampleOp);
 
+
+REGISTER_KERNEL_BUILDER(Name("PeriodicResampleOpGrad")
+                            .Device(DEVICE_CPU),
+                        PeriodicResampleOpGrad);
+
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/periodic_resample/kernels/periodic_resample_op.h b/tensorflow/contrib/periodic_resample/kernels/periodic_resample_op.h
index 3ab588c45881c8f93b4c1bcdf7ccde39086a1ed7..42fba81a5cb9490c093062048f269704a110756a 100644
--- a/tensorflow/contrib/periodic_resample/kernels/periodic_resample_op.h
+++ b/tensorflow/contrib/periodic_resample/kernels/periodic_resample_op.h
@@ -25,92 +25,202 @@
 #include "tensorflow/core/framework/shape_inference.h"
 #include "tensorflow/core/framework/tensor_shape.h"
 #include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/util/work_sharder.h"
 
 namespace {
 
-template <class IndexVecT, class IndexT>
-IndexT compute_input_index(
-    IndexVecT* target_dimensions, const IndexT& output_index,
-    const IndexVecT& original_dimensions, const int& adjustable_dimension,
-    const std::vector<tensorflow::int64>& dimension_ceiling,
-    const std::vector<tensorflow::int64>& cumulative_dimensions, IndexT* result,
-    std::vector<IndexT>* output_indices, const int& rank) {
-  *result = 0;
-  output_indices->clear();
+// Computes input tensor index for given output index during forward
+// propagation through periodic_resample operation.
+class InputIndexer {
+ public:
+  InputIndexer(const std::vector<tensorflow::int64>& output_dimensions,
+               const tensorflow::TensorShape& input_shape,
+               int adjustable_dimension)
+      : output_dimensions_(output_dimensions),
+        adjustable_dimension_(adjustable_dimension),
+        rank_(input_shape.dims()),
+        linear_output_index_(0),
+        linear_input_index_(0),
+        adjustable_dimension_carriage_sum_(0) {
+    auto input_dimensions = TensorShapeToVector(input_shape);
+    // factors by which input_dimensions increases/decreases w.r.t.
+    // output_dimensions
+    dimension_ceiling_ =
+        ComputeDimensionCeiling(output_dimensions, input_dimensions);
+    cumulative_dimensions_ = ComputeCumulativeDimensions();
+
+    output_indices_.resize(output_dimensions_.size());
+    input_indices_.resize(output_dimensions_.size());
+
+    // Compute index_factors
+    index_factors_.resize(rank_);
+    tensorflow::int64 last_index_factor = 1;
+    for (auto r = rank_ - 1; r >= 0; --r) {
+      index_factors_[r] = last_index_factor;
+      last_index_factor *= input_dimensions[r];
+    }
+  }
+
+  tensorflow::int64 linear_input_index() const { return linear_input_index_; }
+
+  void MoveToOutputIndex(tensorflow::int64 output_index);
+  void IncrementOutputIndex();
+
+ private:
+  void RecomputeInputAdjustableDimensionIndex() {
+    tensorflow::int64 index = adjustable_dimension_carriage_sum_;
+    index *= output_dimensions_[adjustable_dimension_];
+    index += output_indices_[adjustable_dimension_];
+    input_indices_[adjustable_dimension_] = index;
+  }
+
+  std::vector<tensorflow::int64> TensorShapeToVector(
+      const tensorflow::TensorShape& tensor_shape);
+
+  std::vector<tensorflow::int64> ComputeDimensionCeiling(
+      const std::vector<tensorflow::int64>& output_dimensions,
+      const std::vector<tensorflow::int64>& input_dimensions);
+
+  std::vector<tensorflow::int64> ComputeCumulativeDimensions();
+
+  const std::vector<tensorflow::int64> output_dimensions_;
+  std::vector<tensorflow::int64> dimension_ceiling_;
+  std::vector<tensorflow::int64> index_factors_;
+  std::vector<tensorflow::int64> cumulative_dimensions_;
+  std::vector<tensorflow::int64> output_indices_;
+  std::vector<tensorflow::int64> input_indices_;
+
+  const int adjustable_dimension_;
+  const int rank_;
+  tensorflow::int64 linear_output_index_;
+  tensorflow::int64 linear_input_index_;
+  tensorflow::int64 adjustable_dimension_carriage_sum_;
+};
+
+void InputIndexer::MoveToOutputIndex(tensorflow::int64 output_index) {
+  linear_output_index_ = output_index;
+  linear_input_index_ = 0;
 
   // un-rasterize the output index
   auto last_reduced_i = output_index;
-  for (auto r = rank - 1; r >= 0; --r) {
-    (*output_indices)[r] = last_reduced_i % (*target_dimensions)[r];
+  for (auto r = rank_ - 1; r >= 0; --r) {
+    output_indices_[r] = last_reduced_i % output_dimensions_[r];
     last_reduced_i =
-        (last_reduced_i - (*output_indices)[r]) / (*target_dimensions)[r];
+        (last_reduced_i - output_indices_[r]) / output_dimensions_[r];
   }
 
+  tensorflow::int64 carriage_sum = 0;
+  for (int qi = 0; qi < rank_; ++qi) {
+    if (qi == adjustable_dimension_) continue;
+    carriage_sum += cumulative_dimensions_[qi] *
+                    (output_indices_[qi] % dimension_ceiling_[qi]);
+  }
+  adjustable_dimension_carriage_sum_ = carriage_sum;
+
   // rasterize the input index
-  IndexT last_index_factor = 1;
-  for (auto r = rank - 1; r >= 0; --r) {
-    IndexT index = 0;
-    if (r != adjustable_dimension)
-      index = (*output_indices)[r] / dimension_ceiling[r];
-    else {
-      for (int qi = 0; qi < rank; ++qi) {
-        if (qi == adjustable_dimension) continue;
-        index += cumulative_dimensions[qi] *
-                 ((*output_indices)[qi] % dimension_ceiling[qi]);
-      }
-      index *= (*target_dimensions)[adjustable_dimension];
-      index += (*output_indices)[r];
+  for (auto r = rank_ - 1; r >= 0; --r) {
+    if (r != adjustable_dimension_) {
+      input_indices_[r] = output_indices_[r] / dimension_ceiling_[r];
+    } else {
+      RecomputeInputAdjustableDimensionIndex();
     }
-    *result += last_index_factor * index;
-    last_index_factor *= original_dimensions[r];
   }
+  for (auto r = rank_ - 1; r >= 0; --r) {
+    linear_input_index_ += index_factors_[r] * input_indices_[r];
+  }
+}
+
+void InputIndexer::IncrementOutputIndex() {
+  linear_output_index_++;
+  for (auto r = rank_ - 1; r >= 0; --r) {
+    auto old_carriage_sum_increment =
+        cumulative_dimensions_[r] *
+        (output_indices_[r] % dimension_ceiling_[r]);
+    output_indices_[r] = (output_indices_[r] + 1) % output_dimensions_[r];
+    if (r != adjustable_dimension_) {
+      auto new_input_index = output_indices_[r] / dimension_ceiling_[r];
+      linear_input_index_ +=
+          (new_input_index - input_indices_[r]) * index_factors_[r];
+
+      input_indices_[r] = new_input_index;
+
+      auto new_carriage_sum_increment =
+          cumulative_dimensions_[r] *
+          (output_indices_[r] % dimension_ceiling_[r]);
 
-  return *result;
+      adjustable_dimension_carriage_sum_ = adjustable_dimension_carriage_sum_ -
+                                           old_carriage_sum_increment +
+                                           new_carriage_sum_increment;
+    }
+
+    if (output_indices_[r] != 0) {
+      // No more carries to higher indices.
+      break;
+    }
+  }
+  auto old_adjustable_dimension_input_index =
+      input_indices_[adjustable_dimension_];
+  RecomputeInputAdjustableDimensionIndex();
+  linear_input_index_ += (input_indices_[adjustable_dimension_] -
+                           old_adjustable_dimension_input_index) *
+                          index_factors_[adjustable_dimension_];
 }
 
-template <class InputDataT,
-          class IndexVecT>  // both types are needed here b/c IndexVecT and
-                            // InputDataT are not related
-                            void
-                            fill_periodic_tensor(
-                                tensorflow::OpKernelContext* context,
-                                const IndexVecT& desired_shape,
-                                const tensorflow::Tensor& input_tensor) {
-  // input is a strided array (last index is fastest, C-ordered)
-  auto input = input_tensor.flat<InputDataT>();
-  const int rank = input_tensor.dims();
-  // original and target dimensions
-  std::vector<tensorflow::int64> original_dimensions(rank),
-      target_dimensions(rank);
-  tensorflow::int64 total_size(input_tensor.NumElements()), new_sliced_size(1);
-  // factors by which original_dimensions increases/decreases w.r.t.
-  // target_dimensions
-  std::vector<tensorflow::int64> dimension_ceiling(rank),
-      cumulative_dimensions(rank);
-  // index of adjustable dimension
-  int adjustable_dimension;
-  tensorflow::TensorShape output_shape;
+std::vector<tensorflow::int64> InputIndexer::TensorShapeToVector(
+    const tensorflow::TensorShape& tensor_shape) {
+  std::vector<tensorflow::int64> result(tensor_shape.dims());
+  int count = 0;
+  for (const auto dim_info : tensor_shape) {
+    result[count] = dim_info.size;
+    ++count;
+  }
+  return result;
+}
 
-  // requires that the rank of the input tensor and length of the desired shape
-  // are equal
-  OP_REQUIRES(context, rank == desired_shape.size(),
-              tensorflow::errors::InvalidArgument(
-                  "periodic_resample expects the rank of the input tensor, ",
-                  rank, ", to be the same as the length of the desired shape, ",
-                  desired_shape.size(), "."));
+std::vector<tensorflow::int64> InputIndexer::ComputeDimensionCeiling(
+    const std::vector<tensorflow::int64>& output_dimensions,
+    const std::vector<tensorflow::int64>& input_dimensions) {
+  std::vector<tensorflow::int64> dimension_ceiling(input_dimensions.size());
+  for (size_t i = 0; i < input_dimensions.size(); ++i) {
+    dimension_ceiling[i] = (output_dimensions[i] + input_dimensions[i] - 1) /
+        input_dimensions[i];
+  }
+  return dimension_ceiling;
+}
 
-  bool found = false;
-  const auto& input_tensor_shape = input_tensor.shape();
+std::vector<tensorflow::int64> InputIndexer::ComputeCumulativeDimensions() {
+  std::vector<tensorflow::int64> cumulative_dimensions(rank_);
+  int count = 0;
+  for (int i = 0; i < rank_; ++i) {
+    if (count == 0) {
+      cumulative_dimensions[count] = 1;
+    } else {
+      cumulative_dimensions[count] =
+          cumulative_dimensions[count - 1] * dimension_ceiling_[count - 1];
+    }
+    ++count;
+  }
+  return cumulative_dimensions;
+}
 
+template <typename IndexVecT>
+void process_desired_shape(tensorflow::OpKernelContext* context,
+                           const tensorflow::TensorShape& input_tensor_shape,
+                           const IndexVecT& desired_shape,
+                           int* adjustable_dimension,
+                           std::vector<tensorflow::int64>* target_dimensions,
+                           tensorflow::int64* output_size) {
+  tensorflow::int64 new_sliced_size = 1;
+  bool found = false;
+  const int rank = input_tensor_shape.dims();
   for (int i = 0; i < rank; ++i) {
-    // if (desired_shape(i) < 1) {
     if (desired_shape[i] < 1) {
       // only one index can be adjustable
       OP_REQUIRES(context, !found,
                   tensorflow::errors::InvalidArgument(
                       "periodic_resample expects only "
                       "one index to be marked as adjustable."));
-      adjustable_dimension = i;
+      *adjustable_dimension = i;
       found = true;
     } else {
       OP_REQUIRES(
@@ -122,9 +232,8 @@ template <class InputDataT,
               i, " input tensor has size ", input_tensor_shape.dim_size(i),
               ", desired shape has size ", desired_shape[i], "."));
 
-      // target_dimensions[i] = desired_shape(i);
-      target_dimensions[i] = desired_shape[i];
-      new_sliced_size *= target_dimensions[i];
+      (*target_dimensions)[i] = desired_shape[i];
+      new_sliced_size *= (*target_dimensions)[i];
     }
   }
   // at least one index needs to be adjustable
@@ -132,26 +241,50 @@ template <class InputDataT,
               tensorflow::errors::InvalidArgument(
                   "periodic_resample expects at least "
                   "one index to be marked as adjustable."));
+  (*target_dimensions)[*adjustable_dimension] =
+      input_tensor_shape.num_elements() / new_sliced_size;
 
-  int count = 0;
-  for (const auto dim_info : input_tensor.shape()) {
-    original_dimensions[count] = dim_info.size;
-    ++count;
-  }
+  *output_size = new_sliced_size * (*target_dimensions)[*adjustable_dimension];
+}
 
-  target_dimensions[adjustable_dimension] = total_size / new_sliced_size;
+// Heuristic number based on measurements on
+// Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
+const tensorflow::int64 costPerFillIndex = 35;
 
-  count = 0;
-  for (int i = 0; i < input_tensor.shape().dims(); ++i) {
-    dimension_ceiling[count] = tensorflow::int64(std::ceil(
-        float(target_dimensions[count]) / float(original_dimensions[count])));
-    if (count == 0)
-      cumulative_dimensions[count] = 1;
-    else
-      cumulative_dimensions[count] =
-          cumulative_dimensions[count - 1] * dimension_ceiling[count - 1];
-    ++count;
-  }
+enum class Mode {
+  kForward,
+  kGradient
+};
+
+// Computes either periodic_resample operation output or gradients for it,
+// depending on |mode|.
+// |original_shape| is always shape of input to periodic_resample operation.
+// |source_tensor| is either source for periodic_resample (for forward mode)
+//     or gradients tensor.
+// |desired_shape| is always shape, provided by user, to which forward
+//     propagation attempts resample input tensor.
+template <class InputDataT, Mode mode>
+void
+do_periodic_resample_op(tensorflow::OpKernelContext* context,
+                        const tensorflow::TensorShape& original_shape,
+                        const tensorflow::PartialTensorShape& desired_shape,
+                        const tensorflow::Tensor& source_tensor) {
+  const int rank = source_tensor.dims();
+
+  // requires that the rank of the input tensor and length of the desired shape
+  // are equal
+  OP_REQUIRES(context, rank == desired_shape.dims(),
+              tensorflow::errors::InvalidArgument(
+                  "periodic_resample expects the rank of the input tensor, ",
+                  rank, ", to be the same as the length of the desired shape, ",
+                  desired_shape.dims(), "."));
+
+  std::vector<tensorflow::int64> target_dimensions(rank);
+  tensorflow::int64 new_size = 0;
+  // index of adjustable dimension
+  int adjustable_dimension = 0;
+  process_desired_shape(context, original_shape, desired_shape.dim_sizes(),
+                        &adjustable_dimension, &target_dimensions, &new_size);
 
   // ensure that the new dimension is greater than zero
   OP_REQUIRES(context, target_dimensions[adjustable_dimension] > 0,
@@ -160,11 +293,14 @@ template <class InputDataT,
                   "adjustable dimension, ",
                   adjustable_dimension, ", isn't greater than zero, ",
                   target_dimensions[adjustable_dimension], "."));
-  for (int i = 0; i < rank; ++i) {
-    output_shape.AddDim(target_dimensions[i]);
+  tensorflow::TensorShape output_shape;
+  if (mode == Mode::kForward) {
+    for (int i = 0; i < rank; ++i) {
+      output_shape.AddDim(target_dimensions[i]);
+    }
+  } else {
+    output_shape = original_shape;
   }
-  const auto new_size =
-      new_sliced_size * target_dimensions[adjustable_dimension];
 
   // Create an output tensor and attach it to the current context
   tensorflow::Tensor* output_tensor = nullptr;
@@ -172,47 +308,73 @@ template <class InputDataT,
                  context->allocate_output(0, output_shape, &output_tensor));
   auto output = output_tensor->flat<InputDataT>();
 
-  // memory is allocated for these variables outside the inner loop for
-  // efficiency (although, I could create a separate class scope for
-  // this purpose instead)
-  tensorflow::int64 result = 0;
-  std::vector<tensorflow::int64> output_indices(target_dimensions.size());
+  // input is a strided array (last index is fastest, C-ordered)
+  auto input = source_tensor.flat<InputDataT>();
 
   // Fill output tensor with periodically resampled input tensor values
-  for (tensorflow::int64 output_index = 0; output_index < new_size;
-       ++output_index) {
-    output(output_index) = input(compute_input_index(
-        &target_dimensions, output_index, original_dimensions,
-        adjustable_dimension, dimension_ceiling, cumulative_dimensions, &result,
-        &output_indices, rank));
-  }
+  InputIndexer input_indexer(target_dimensions, original_shape,
+                             adjustable_dimension);
+
+  auto worker_threads = *(context->device()->tensorflow_cpu_worker_threads());
+  auto fill_output_tensor = [&input_indexer, &output, &input](
+      tensorflow::int64 start, tensorflow::int64 limit) {
+    InputIndexer local_indexer(input_indexer);
+    local_indexer.MoveToOutputIndex(start);
+    for (tensorflow::int64 output_index = start; output_index < limit;
+         ++output_index) {
+      if (mode == Mode::kForward) {
+        output(output_index) = input(local_indexer.linear_input_index());
+      } else {
+        output(local_indexer.linear_input_index()) = input(output_index);
+      }
+      local_indexer.IncrementOutputIndex();
+    }
+  };
+  ::tensorflow::Shard(worker_threads.num_threads, worker_threads.workers,
+                      new_size, costPerFillIndex, fill_output_tensor);
 }
 
+#define DATA_TYPE_SWITCH(data_type, context, CASE)                            \
+  switch (data_type) {                                                        \
+    CASE(float)                                                               \
+    CASE(double)                                                              \
+    CASE(tensorflow::int32)                                                   \
+    CASE(tensorflow::int64)                                                   \
+    default:                                                                  \
+      context->CtxFailure(__FILE__, __LINE__,                                 \
+          tensorflow::errors::InvalidArgument(                                \
+              "Unsuppored tensor elements type"));                            \
+      break;                                                                  \
+  }
+
 void create_output_tensor(
     tensorflow::OpKernelContext* context,
     const tensorflow::Tensor& input_tensor,
     const tensorflow::DataType& input_tensor_type,
-    const tensorflow::PartialTensorShape& desired_shape_tensor) {
-  auto desired_shape = desired_shape_tensor.dim_sizes();
-
-  // obligatory type switch
-  switch (input_tensor_type) {
-    case tensorflow::DataTypeToEnum<float>::value:
-      fill_periodic_tensor<float>(context, desired_shape, input_tensor);
+    const tensorflow::PartialTensorShape& desired_shape) {
+#define CASE(type)                                                            \
+    case tensorflow::DataTypeToEnum<type>::value:                             \
+      do_periodic_resample_op<type, Mode::kForward>(                          \
+          context, input_tensor.shape(), desired_shape, input_tensor);        \
       break;
-    case tensorflow::DataTypeToEnum<double>::value:
-      fill_periodic_tensor<double>(context, desired_shape, input_tensor);
-      break;
-    case tensorflow::DataTypeToEnum<tensorflow::int32>::value:
-      fill_periodic_tensor<tensorflow::int32>(context, desired_shape,
-                                              input_tensor);
-      break;
-    case tensorflow::DataTypeToEnum<tensorflow::int64>::value:
-      fill_periodic_tensor<tensorflow::int64>(context, desired_shape,
-                                              input_tensor);
+
+  DATA_TYPE_SWITCH(input_tensor_type, context, CASE);
+#undef CASE
+}
+
+void create_grad_tensor(tensorflow::OpKernelContext* context,
+                        const tensorflow::Tensor& grad_tensor,
+                        const tensorflow::DataType& grad_tensor_type,
+                        const tensorflow::TensorShape& original_shape,
+                        const tensorflow::PartialTensorShape& desired_shape) {
+#define CASE(type)                                                            \
+    case tensorflow::DataTypeToEnum<type>::value:                             \
+      do_periodic_resample_op<type, Mode::kGradient>(                         \
+          context, original_shape, desired_shape, grad_tensor);               \
       break;
-    default:;
-  }
+
+  DATA_TYPE_SWITCH(grad_tensor_type, context, CASE);
+#undef CASE
 }
 
 }  // namespace
@@ -238,4 +400,25 @@ class PeriodicResampleOp : public tensorflow::OpKernel {
   tensorflow::PartialTensorShape desired_shape;
 };
 
+class PeriodicResampleOpGrad : public tensorflow::OpKernel {
+ public:
+  explicit PeriodicResampleOpGrad(tensorflow::OpKernelConstruction* context)
+      : tensorflow::OpKernel(context) {
+    OP_REQUIRES_OK(context,
+                   context->GetAttr("original_shape", &original_shape));
+    OP_REQUIRES_OK(context, context->GetAttr("desired_shape", &desired_shape));
+  }
+
+  void Compute(tensorflow::OpKernelContext* context) override {
+    const tensorflow::Tensor& grad_tensor = context->input(0);
+    const tensorflow::DataType grad_tensor_type = context->input_dtype(0);
+    create_grad_tensor(context, grad_tensor, grad_tensor_type, original_shape,
+                       desired_shape);
+  }
+
+ private:
+  tensorflow::TensorShape original_shape;
+  tensorflow::PartialTensorShape desired_shape;
+};
+
 #endif  // TENSORFLOW_KERNELS_PERIODICRESAMPLE_OP_H_
diff --git a/tensorflow/contrib/periodic_resample/ops/array_ops.cc b/tensorflow/contrib/periodic_resample/ops/array_ops.cc
index 82bd79695646e3673c2c78ad99dd2bd200fc2fbf..fd38cd09b4d0939d7955f7839763a8e955b71fa5 100644
--- a/tensorflow/contrib/periodic_resample/ops/array_ops.cc
+++ b/tensorflow/contrib/periodic_resample/ops/array_ops.cc
@@ -26,7 +26,42 @@ REGISTER_OP("PeriodicResample")
     .Input("values: T")
     .Attr("shape: shape")
     .Output("output: T")
-    .SetShapeFn(shape_inference::ExplicitShape)
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      tensorflow::PartialTensorShape desired_shape;
+      TF_RETURN_IF_ERROR(c->GetAttr("shape", &desired_shape));
+      shape_inference::ShapeHandle input_tensor_shape = c->input(0);
+      shape_inference::DimensionHandle num_input_elements =
+          c->NumElements(input_tensor_shape);
+      shape_inference::ShapeHandle result_shape_handle;
+      if (!shape_inference::InferenceContext::ValueKnown(num_input_elements)) {
+        TF_RETURN_IF_ERROR(c->MakeShapeFromPartialTensorShape(
+            desired_shape, &result_shape_handle));
+      } else {
+        const int rank = c->Rank(input_tensor_shape);
+        std::vector<tensorflow::int64> target_dimensions(rank);
+        tensorflow::int64 new_sliced_size = 1;
+        int adjustable_dimension = 0;
+        for (int i = 0; i < rank; ++i) {
+          if (desired_shape.dim_size(i) < 1) {
+            adjustable_dimension = i;
+          } else {
+            target_dimensions[i] = desired_shape.dim_size(i);
+            new_sliced_size *= target_dimensions[i];
+          }
+        }
+        target_dimensions[adjustable_dimension] =
+            shape_inference::InferenceContext::Value(
+                num_input_elements) / new_sliced_size;
+        tensorflow::TensorShape result_shape;
+        for (int i = 0; i < rank; ++i) {
+          result_shape.AddDim(target_dimensions[i]);
+        }
+        TF_RETURN_IF_ERROR(c->MakeShapeFromTensorShape(
+            result_shape, &result_shape_handle));
+      }
+      c->set_output(0, result_shape_handle);
+      return Status::OK();
+    })
     .Doc(R"doc(
 Periodically resample elements of a tensor to conform to `shape`.
 
@@ -101,4 +136,20 @@ output: Periodically resampled tensor that has dimensions specified as in
 
 )doc");
 
+
+REGISTER_OP("PeriodicResampleOpGrad")
+    .Attr("T: numbertype")
+    .Input("grad: T")
+    .Attr("original_shape: shape")
+    .Attr("desired_shape: shape")
+    .Output("grad_values: T")
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      tensorflow::TensorShape original_shape;
+      TF_RETURN_IF_ERROR(c->GetAttr("original_shape", &original_shape));
+      shape_inference::ShapeHandle s;
+      TF_RETURN_IF_ERROR(c->MakeShapeFromTensorShape(original_shape, &s));
+      c->set_output(0, s);
+      return Status::OK();
+});
+
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/periodic_resample/ops/array_ops_test.cc b/tensorflow/contrib/periodic_resample/ops/array_ops_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..55edf76fcd3eed461e1465b569e1c2e9e2facbc0
--- /dev/null
+++ b/tensorflow/contrib/periodic_resample/ops/array_ops_test.cc
@@ -0,0 +1,40 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/node_def_builder.h"
+#include "tensorflow/core/framework/shape_inference_testutil.h"
+#include "tensorflow/core/framework/tensor_testutil.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+
+TEST(ArrayOpsTest, PeriodicResample_ShapeFn) {
+  ShapeInferenceTestOp op("PeriodicResample");
+  // Case 1: output shape can be fully inferreed.
+  PartialTensorShape shape({4, 4, -1});
+  TensorShapeProto shape_proto;
+  shape.AsProto(&shape_proto);
+
+  TF_ASSERT_OK(NodeDefBuilder("test", "PeriodicResample")
+                   .Input({"values", 0, DT_INT32})
+                   .Attr("shape", shape_proto)
+                   .Finalize(&op.node_def));
+  INFER_OK(op, "[2,2,4]", "[4,4,1]");
+  // Case 2: output shape can not be inferred - report desired shape.
+  INFER_OK(op, "[2,2,?]", "[4,4,?]");
+}
+
+}  // end namespace tensorflow
diff --git a/tensorflow/contrib/periodic_resample/python/kernel_tests/periodic_resample_op_test.py b/tensorflow/contrib/periodic_resample/python/kernel_tests/periodic_resample_op_test.py
index a25de55e18b223db2b724aafb54b18d8f48a5baa..31a6fe1d94b8a972087e00cf7c676105b0f1129b 100644
--- a/tensorflow/contrib/periodic_resample/python/kernel_tests/periodic_resample_op_test.py
+++ b/tensorflow/contrib/periodic_resample/python/kernel_tests/periodic_resample_op_test.py
@@ -21,8 +21,11 @@ from __future__ import print_function
 import numpy
 
 from tensorflow.contrib.periodic_resample import periodic_resample
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors_impl
 from tensorflow.python.framework import test_util
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gradient_checker
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import googletest
 
@@ -93,7 +96,6 @@ class PeriodicResampleTest(test_util.TensorFlowTestCase):
   def testPeriodicResampleErrors(self):
     input_tensor = numpy.zeros(shape=[1, 2, 2, 4])
     with self.test_session():
-      variables.global_variables_initializer().run()
       with self.assertRaisesWithPredicateMatch(
           errors_impl.InvalidArgumentError,
           'Dimension 3 input tensor has size 4, desired shape has size 1'):
@@ -103,6 +105,29 @@ class PeriodicResampleTest(test_util.TensorFlowTestCase):
           '4, to be the same as the length of the desired shape, 3'):
         periodic_resample(input_tensor, [None, 4, 4]).eval()
 
+  def testPeriodicResampleGradient(self):
+    desired_shape = numpy.array([4, 4, None])
+    result_shape = (4, 4, 1)
+    input_shape = (2, 2, 4)
+    with self.test_session() as sess:
+      x = array_ops.placeholder(dtypes.float32, shape=input_shape)
+      output = periodic_resample(x, desired_shape)
+      error = gradient_checker.compute_gradient_error(
+          x, input_shape, output, result_shape)
+      self.assertLess(error, 1e-4)
+
+  def testPeriodicResampleShapeInference(self):
+    with self.test_session() as sess:
+      # Case 1: output shape can be fully inferreed.
+      x = array_ops.placeholder(dtypes.float32, shape=(2, 2, 4))
+      output = periodic_resample(x, [4, 4, None])
+      self.assertEqual(output.shape, [4, 4, 1])
+      # Case 2: output shape can not be inferred - report desired shape.
+      x = array_ops.placeholder(dtypes.float32, shape=(2, 2, None))
+      output = periodic_resample(x, [4, 4, None])
+      self.assertTrue(output.shape.is_compatible_with([4, 4, None]))
+      self.assertEqual(output.shape[2].value, None)
+
 
 if __name__ == '__main__':
   googletest.main()
diff --git a/tensorflow/contrib/periodic_resample/python/ops/periodic_resample_op.py b/tensorflow/contrib/periodic_resample/python/ops/periodic_resample_op.py
index 348623d8f8d0c2ed60f559eca281343722038100..470e300ccbe7108fd49718341f4a522683366fe3 100644
--- a/tensorflow/contrib/periodic_resample/python/ops/periodic_resample_op.py
+++ b/tensorflow/contrib/periodic_resample/python/ops/periodic_resample_op.py
@@ -21,11 +21,17 @@ from __future__ import print_function
 # pylint: disable=unused-import
 from tensorflow.contrib.periodic_resample.python.ops import gen_periodic_resample_op
 
-from tensorflow.contrib.periodic_resample.python.ops.gen_periodic_resample_op import periodic_resample
+from tensorflow.contrib.periodic_resample.python.ops.gen_periodic_resample_op import periodic_resample, periodic_resample_op_grad
 
 from tensorflow.contrib.util import loader
+from tensorflow.python.framework import ops
 from tensorflow.python.platform import resource_loader
 # pylint: enable=unused-import
 
 _periodic_resample_op = loader.load_op_library(
     resource_loader.get_path_to_datafile('_periodic_resample_op.so'))
+
+@ops.RegisterGradient("PeriodicResample")
+def _periodic_resample_grad_cc(op, grad):
+  return periodic_resample_op_grad(
+      grad, op.inputs[0].shape, op.get_attr('shape'))
diff --git a/tensorflow/contrib/predictor/contrib_estimator_predictor.py b/tensorflow/contrib/predictor/contrib_estimator_predictor.py
index b7a98c68e2343e9c8bb4b41556dc96bfe4ef444c..af3b2ad1b531b835f484a155efcc57bbe634f2df 100644
--- a/tensorflow/contrib/predictor/contrib_estimator_predictor.py
+++ b/tensorflow/contrib/predictor/contrib_estimator_predictor.py
@@ -34,7 +34,8 @@ class ContribEstimatorPredictor(predictor.Predictor):
                prediction_input_fn,
                input_alternative_key=None,
                output_alternative_key=None,
-               graph=None):
+               graph=None,
+               config=None):
     """Initialize a `ContribEstimatorPredictor`.
 
     Args:
@@ -48,6 +49,7 @@ class ContribEstimatorPredictor(predictor.Predictor):
         multi-headed models.
       graph: Optional. The Tensorflow `graph` in which prediction should be
         done.
+      config: `ConfigProto` proto used to configure the session.
     """
     self._graph = graph or ops.Graph()
     with self._graph.as_default():
@@ -58,6 +60,7 @@ class ContribEstimatorPredictor(predictor.Predictor):
       checkpoint_path = saver.latest_checkpoint(estimator.model_dir)
       self._session = monitored_session.MonitoredSession(
           session_creator=monitored_session.ChiefSessionCreator(
+              config=config,
               checkpoint_filename_with_path=checkpoint_path))
 
     input_alternative_key = (
diff --git a/tensorflow/contrib/predictor/core_estimator_predictor.py b/tensorflow/contrib/predictor/core_estimator_predictor.py
index d78d94c2699b14c80e7decee2181d190a6d91f99..a725072e72df2db64cde5ea31ab16e7c2dc5d2ce 100644
--- a/tensorflow/contrib/predictor/core_estimator_predictor.py
+++ b/tensorflow/contrib/predictor/core_estimator_predictor.py
@@ -51,7 +51,8 @@ class CoreEstimatorPredictor(predictor.Predictor):
                estimator,
                serving_input_receiver_fn,
                output_key=None,
-               graph=None):
+               graph=None,
+               config=None):
     """Initialize a `CoreEstimatorPredictor`.
 
     Args:
@@ -62,6 +63,7 @@ class CoreEstimatorPredictor(predictor.Predictor):
         `None`, then `DEFAULT_SERVING_SIGNATURE_DEF_KEY` is used.
       graph: Optional. The Tensorflow `graph` in which prediction should be
         done.
+      config: `ConfigProto` proto used to configure the session.
     """
     self._graph = graph or ops.Graph()
     with self._graph.as_default():
@@ -71,6 +73,7 @@ class CoreEstimatorPredictor(predictor.Predictor):
       checkpoint_dir = estimator.model_dir
       self._session = monitored_session.MonitoredSession(
           session_creator=monitored_session.ChiefSessionCreator(
+              config=config,
               checkpoint_dir=checkpoint_dir))
 
     feed_tensor_info = signature_def.inputs
diff --git a/tensorflow/contrib/predictor/predictor_factories.py b/tensorflow/contrib/predictor/predictor_factories.py
index 6e77e934fe19851eea9ed0b74eb7aecc76f6237a..f275bc15adfa0a51a48964dff8edddbd45500e45 100644
--- a/tensorflow/contrib/predictor/predictor_factories.py
+++ b/tensorflow/contrib/predictor/predictor_factories.py
@@ -30,7 +30,8 @@ def from_contrib_estimator(estimator,
                            prediction_input_fn,
                            input_alternative_key=None,
                            output_alternative_key=None,
-                           graph=None):
+                           graph=None,
+                           config=None):
   """Constructs a `Predictor` from a `tf.contrib.learn.Estimator`.
 
   Args:
@@ -44,6 +45,7 @@ def from_contrib_estimator(estimator,
       multi-headed models.
     graph: Optional. The Tensorflow `graph` in which prediction should be
       done.
+    config: `ConfigProto` proto used to configure the session.
 
   Returns:
     An initialized `Predictor`.
@@ -62,13 +64,15 @@ def from_contrib_estimator(estimator,
       prediction_input_fn,
       input_alternative_key=input_alternative_key,
       output_alternative_key=output_alternative_key,
-      graph=graph)
+      graph=graph,
+      config=config)
 
 
 def from_estimator(estimator,
                    serving_input_receiver_fn,
                    output_key=None,
-                   graph=None):
+                   graph=None,
+                   config=None):
   """Constructs a `Predictor` from a `tf.python.estimator.Estimator`.
 
   Args:
@@ -79,6 +83,7 @@ def from_estimator(estimator,
       `None`, then `DEFAULT_SERVING_SIGNATURE_DEF_KEY` is used.
     graph: Optional. The Tensorflow `graph` in which prediction should be
       done.
+    config: `ConfigProto` proto used to configure the session.
 
   Returns:
     An initialized `Predictor`.
@@ -93,14 +98,19 @@ def from_estimator(estimator,
                     'tf.contrib.learn.Estimator. You likely want to call '
                     'from_contrib_estimator.')
   return core_estimator_predictor.CoreEstimatorPredictor(
-      estimator, serving_input_receiver_fn, output_key=output_key, graph=graph)
+      estimator,
+      serving_input_receiver_fn,
+      output_key=output_key,
+      graph=graph,
+      config=config)
 
 
 def from_saved_model(export_dir,
                      signature_def_key=None,
                      signature_def=None,
                      tags=None,
-                     graph=None):
+                     graph=None,
+                     config=None):
   """Constructs a `Predictor` from a `SavedModel` on disk.
 
   Args:
@@ -115,6 +125,7 @@ def from_saved_model(export_dir,
       `SignatureDef`. Defaults to `DEFAULT_TAGS`.
     graph: Optional. The Tensorflow `graph` in which prediction should be
       done.
+    config: `ConfigProto` proto used to configure the session.
 
   Returns:
     An initialized `Predictor`.
@@ -128,4 +139,5 @@ def from_saved_model(export_dir,
       signature_def_key=signature_def_key,
       signature_def=signature_def,
       tags=tags,
-      graph=graph)
+      graph=graph,
+      config=config)
diff --git a/tensorflow/contrib/predictor/predictor_factories_test.py b/tensorflow/contrib/predictor/predictor_factories_test.py
index 578d9424b25dd38f1d77a267d1fdf1ff9ff2da88..a2ef1dc3af0986afacf646f0dc04b7ef857a7f93 100644
--- a/tensorflow/contrib/predictor/predictor_factories_test.py
+++ b/tensorflow/contrib/predictor/predictor_factories_test.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 
 from tensorflow.contrib.predictor import predictor_factories
 from tensorflow.contrib.predictor import testing_common
+from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.platform import test
 
 MODEL_DIR_NAME = 'contrib/predictor/test_export_dir'
@@ -41,6 +42,11 @@ class PredictorFactoriesTest(test.TestCase):
     """Test loading from_saved_model with tags."""
     predictor_factories.from_saved_model(self._export_dir, tags='serve')
 
+  def testFromSavedModelWithSessionConfig(self):
+    """Test loading from_saved_model with session config."""
+    predictor_factories.from_saved_model(
+        self._export_dir, config=config_pb2.ConfigProto())
+
   def testFromSavedModelWithBadTags(self):
     """Test that loading fails for bad tags."""
     bad_tags_regex = ('.*? could not be found in SavedModel')
@@ -53,6 +59,13 @@ class PredictorFactoriesTest(test.TestCase):
     predictor_factories.from_contrib_estimator(
         estimator, input_fn, output_alternative_key='sum')
 
+  def testFromContribEstimatorWithSessionConfig(self):
+    estimator = testing_common.get_arithmetic_estimator(core=False)
+    input_fn = testing_common.get_arithmetic_input_fn(core=False)
+    predictor_factories.from_contrib_estimator(
+        estimator, input_fn, output_alternative_key='sum',
+        config=config_pb2.ConfigProto())
+
   def testFromContribEstimatorWithCoreEstimatorRaises(self):
     estimator = testing_common.get_arithmetic_estimator(core=True)
     input_fn = testing_common.get_arithmetic_input_fn(core=True)
@@ -64,6 +77,12 @@ class PredictorFactoriesTest(test.TestCase):
     input_fn = testing_common.get_arithmetic_input_fn(core=True)
     predictor_factories.from_estimator(estimator, input_fn)
 
+  def testFromCoreEstimatorWithSessionConfig(self):
+    estimator = testing_common.get_arithmetic_estimator(core=True)
+    input_fn = testing_common.get_arithmetic_input_fn(core=True)
+    predictor_factories.from_estimator(
+        estimator, input_fn, config=config_pb2.ConfigProto())
+
   def testFromCoreEstimatorWithContribEstimatorRaises(self):
     estimator = testing_common.get_arithmetic_estimator(core=False)
     input_fn = testing_common.get_arithmetic_input_fn(core=False)
diff --git a/tensorflow/contrib/predictor/saved_model_predictor.py b/tensorflow/contrib/predictor/saved_model_predictor.py
index 0dbca0f8136e4e618234101ee41c80bc085511c0..95da6d04edc5214d1b5c1851c4ab05c6d7080b9b 100644
--- a/tensorflow/contrib/predictor/saved_model_predictor.py
+++ b/tensorflow/contrib/predictor/saved_model_predictor.py
@@ -121,7 +121,8 @@ class SavedModelPredictor(predictor.Predictor):
                input_names=None,
                output_names=None,
                tags=None,
-               graph=None):
+               graph=None,
+               config=None):
     """Initialize a `CoreEstimatorPredictor`.
 
     Args:
@@ -142,6 +143,7 @@ class SavedModelPredictor(predictor.Predictor):
         the correct `SignatureDef`. Defaults to `DEFAULT_TAGS`.
       graph: Optional. The Tensorflow `graph` in which prediction should be
         done.
+      config: `ConfigProto` proto used to configure the session.
     Raises:
       ValueError: If more than one of signature_def_key OR signature_def OR
         (input_names AND output_names) is specified.
@@ -152,7 +154,7 @@ class SavedModelPredictor(predictor.Predictor):
     self._graph = graph or ops.Graph()
 
     with self._graph.as_default():
-      self._session = session.Session()
+      self._session = session.Session(config=config)
       loader.load(self._session, tags.split(','), export_dir)
 
     if input_names is None:
diff --git a/tensorflow/contrib/quantize/README.md b/tensorflow/contrib/quantize/README.md
index c83623ec947c1550991352a9dd9a5c6ee9282290..27a933c0f945e53a1838aefd30aed82fadbbc146 100644
--- a/tensorflow/contrib/quantize/README.md
+++ b/tensorflow/contrib/quantize/README.md
@@ -6,7 +6,7 @@ inference. The details of the transformation implemented in this package is
 described here [1].
 
 This is done using the
-[fake quantization op](https://www.tensorflow.org/versions/r0.12/api_docs/python/array_ops/fake_quantization).
+[fake quantization op](https://www.tensorflow.org/api_guides/python/array_ops#Fake_quantization).
 
 Literature has shown that fixed point networks provide comparable performance to
 floating point networks [2]. This is achieved by modeling the quantization
diff --git a/tensorflow/contrib/receptive_field/README.md b/tensorflow/contrib/receptive_field/README.md
index 3ff85faf611afad71b6e6203453bbe97c56f9242..79b015a9163f5727caa40b54579c71e57621c92f 100644
--- a/tensorflow/contrib/receptive_field/README.md
+++ b/tensorflow/contrib/receptive_field/README.md
@@ -6,6 +6,32 @@ region your output features depend on. Better yet, using the parameters computed
 by the library, you can easily find the exact image region which is used to
 compute each convnet feature.
 
+This library can be used to compute receptive field parameters of popular
+convnets:
+
+<center>
+
+convnet model       | receptive field | effective stride | effective padding
+:-----------------: | :-------------: | :--------------: | :---------------:
+alexnet_v2          | 195             | 32               | 64
+vgg_16              | 212             | 32               | 90
+inception_v2        | 699             | 32               | 318
+inception_v3        | 1311            | 32               | 618
+inception_v4        | 2071            | 32               | 998
+inception_resnet_v2 | 3039            | 32               | 1482
+mobilenet_v1        | 315             | 32               | 126
+mobilenet_v1_075    | 315             | 32               | 126
+resnet_v1_50        | 483             | 32               | 241
+resnet_v1_101       | 1027            | 32               | 513
+resnet_v1_152       | 1507            | 32               | 753
+resnet_v1_200       | 1763            | 32               | 881
+
+</center>
+
+A comprehensive table with pre-computed receptive field parameters for different
+end-points, input resolutions, and other variants of these networks can be found
+[here](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/receptive_field/RECEPTIVE_FIELD_TABLE.md).
+
 ## Basic usage
 
 The main function to be called is `compute_receptive_field_from_graph_def`,
@@ -96,9 +122,9 @@ The script will write to stdout the receptive field parameters for many variants
 of several popular convnets: AlexNet, VGG, ResNet, Inception, Mobilenet. They
 are also written to the file `/tmp/rf_benchmark_results.csv`.
 
-TODO: include here a plot for receptive field sizes of different convnets.
-
-TODO: include table/link to pre-computed RF parameters.
+A comprehensive table with pre-computed receptive field parameters for different
+networks can be found
+[here](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/receptive_field/RECEPTIVE_FIELD_TABLE.md).
 
 ## Compute RF parameters from a graph pbtxt
 
diff --git a/tensorflow/contrib/receptive_field/RECEPTIVE_FIELD_TABLE.md b/tensorflow/contrib/receptive_field/RECEPTIVE_FIELD_TABLE.md
new file mode 100644
index 0000000000000000000000000000000000000000..736fbef6e7c66176e74144115f0b1acd6bf6cd2f
--- /dev/null
+++ b/tensorflow/contrib/receptive_field/RECEPTIVE_FIELD_TABLE.md
@@ -0,0 +1,629 @@
+# Pre-computed receptive field parameters
+
+## Table with results
+
+The table below presents the receptive field parameters for several popular
+convolutional neural networks. These are computed using the models from the
+[TF-Slim
+repository](https://github.com/tensorflow/models/tree/master/research/slim),
+by using the [rf_benchmark
+script](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/receptive_field/python/util/examples/rf_benchmark.py).
+
+Questions? See the [FAQ](#faq).
+
+CNN                            | resolution | end-point            | RF   | effective stride | effective padding
+:----------------------------: | :--------: | :------------------: | :--: | :--------------: | :---------------:
+alexnet_v2                     | None       | alexnet_v2/conv1     | 11   | 4                | 0
+alexnet_v2                     | None       | alexnet_v2/pool1     | 19   | 8                | 0
+alexnet_v2                     | None       | alexnet_v2/conv2     | 51   | 8                | 16
+alexnet_v2                     | None       | alexnet_v2/conv3     | 99   | 16               | 32
+alexnet_v2                     | None       | alexnet_v2/conv4     | 131  | 16               | 48
+alexnet_v2                     | None       | alexnet_v2/conv5     | 163  | 16               | 64
+alexnet_v2                     | None       | alexnet_v2/pool5     | 195  | 32               | 64
+alexnet_v2                     | 224        | alexnet_v2/conv1     | 11   | 4                | 0
+alexnet_v2                     | 224        | alexnet_v2/pool1     | 19   | 8                | 0
+alexnet_v2                     | 224        | alexnet_v2/conv2     | 51   | 8                | 16
+alexnet_v2                     | 224        | alexnet_v2/conv3     | 99   | 16               | 32
+alexnet_v2                     | 224        | alexnet_v2/conv4     | 131  | 16               | 48
+alexnet_v2                     | 224        | alexnet_v2/conv5     | 163  | 16               | 64
+alexnet_v2                     | 224        | alexnet_v2/pool5     | 195  | 32               | 64
+alexnet_v2                     | 321        | alexnet_v2/conv1     | 11   | 4                | 0
+alexnet_v2                     | 321        | alexnet_v2/pool1     | 19   | 8                | 0
+alexnet_v2                     | 321        | alexnet_v2/conv2     | 51   | 8                | 16
+alexnet_v2                     | 321        | alexnet_v2/conv3     | 99   | 16               | 32
+alexnet_v2                     | 321        | alexnet_v2/conv4     | 131  | 16               | 48
+alexnet_v2                     | 321        | alexnet_v2/conv5     | 163  | 16               | 64
+alexnet_v2                     | 321        | alexnet_v2/pool5     | 195  | 32               | 64
+vgg_a                          | None       | vgg_a/conv1/conv1_1  | 3    | 1                | 1
+vgg_a                          | None       | vgg_a/pool1          | 4    | 2                | 1
+vgg_a                          | None       | vgg_a/conv2/conv2_1  | 8    | 2                | 3
+vgg_a                          | None       | vgg_a/pool2          | 10   | 4                | 3
+vgg_a                          | None       | vgg_a/conv3/conv3_1  | 18   | 4                | 7
+vgg_a                          | None       | vgg_a/conv3/conv3_2  | 26   | 4                | 11
+vgg_a                          | None       | vgg_a/pool3          | 30   | 8                | 11
+vgg_a                          | None       | vgg_a/conv4/conv4_1  | 46   | 8                | 19
+vgg_a                          | None       | vgg_a/conv4/conv4_2  | 62   | 8                | 27
+vgg_a                          | None       | vgg_a/pool4          | 70   | 16               | 27
+vgg_a                          | None       | vgg_a/conv5/conv5_1  | 102  | 16               | 43
+vgg_a                          | None       | vgg_a/conv5/conv5_2  | 134  | 16               | 59
+vgg_a                          | None       | vgg_a/pool5          | 150  | 32               | 59
+vgg_a                          | 224        | vgg_a/conv1/conv1_1  | 3    | 1                | 1
+vgg_a                          | 224        | vgg_a/pool1          | 4    | 2                | 1
+vgg_a                          | 224        | vgg_a/conv2/conv2_1  | 8    | 2                | 3
+vgg_a                          | 224        | vgg_a/pool2          | 10   | 4                | 3
+vgg_a                          | 224        | vgg_a/conv3/conv3_1  | 18   | 4                | 7
+vgg_a                          | 224        | vgg_a/conv3/conv3_2  | 26   | 4                | 11
+vgg_a                          | 224        | vgg_a/pool3          | 30   | 8                | 11
+vgg_a                          | 224        | vgg_a/conv4/conv4_1  | 46   | 8                | 19
+vgg_a                          | 224        | vgg_a/conv4/conv4_2  | 62   | 8                | 27
+vgg_a                          | 224        | vgg_a/pool4          | 70   | 16               | 27
+vgg_a                          | 224        | vgg_a/conv5/conv5_1  | 102  | 16               | 43
+vgg_a                          | 224        | vgg_a/conv5/conv5_2  | 134  | 16               | 59
+vgg_a                          | 224        | vgg_a/pool5          | 150  | 32               | 59
+vgg_a                          | 321        | vgg_a/conv1/conv1_1  | 3    | 1                | 1
+vgg_a                          | 321        | vgg_a/pool1          | 4    | 2                | 1
+vgg_a                          | 321        | vgg_a/conv2/conv2_1  | 8    | 2                | 3
+vgg_a                          | 321        | vgg_a/pool2          | 10   | 4                | 3
+vgg_a                          | 321        | vgg_a/conv3/conv3_1  | 18   | 4                | 7
+vgg_a                          | 321        | vgg_a/conv3/conv3_2  | 26   | 4                | 11
+vgg_a                          | 321        | vgg_a/pool3          | 30   | 8                | 11
+vgg_a                          | 321        | vgg_a/conv4/conv4_1  | 46   | 8                | 19
+vgg_a                          | 321        | vgg_a/conv4/conv4_2  | 62   | 8                | 27
+vgg_a                          | 321        | vgg_a/pool4          | 70   | 16               | 27
+vgg_a                          | 321        | vgg_a/conv5/conv5_1  | 102  | 16               | 43
+vgg_a                          | 321        | vgg_a/conv5/conv5_2  | 134  | 16               | 59
+vgg_a                          | 321        | vgg_a/pool5          | 150  | 32               | 59
+vgg_16                         | None       | vgg_16/conv1/conv1_1 | 3    | 1                | 1
+vgg_16                         | None       | vgg_16/pool1         | 6    | 2                | 2
+vgg_16                         | None       | vgg_16/conv2/conv2_1 | 10   | 2                | 4
+vgg_16                         | None       | vgg_16/pool2         | 16   | 4                | 6
+vgg_16                         | None       | vgg_16/conv3/conv3_1 | 24   | 4                | 10
+vgg_16                         | None       | vgg_16/conv3/conv3_2 | 32   | 4                | 14
+vgg_16                         | None       | vgg_16/pool3         | 44   | 8                | 18
+vgg_16                         | None       | vgg_16/conv4/conv4_1 | 60   | 8                | 26
+vgg_16                         | None       | vgg_16/conv4/conv4_2 | 76   | 8                | 34
+vgg_16                         | None       | vgg_16/pool4         | 100  | 16               | 42
+vgg_16                         | None       | vgg_16/conv5/conv5_1 | 132  | 16               | 58
+vgg_16                         | None       | vgg_16/conv5/conv5_2 | 164  | 16               | 74
+vgg_16                         | None       | vgg_16/pool5         | 212  | 32               | 90
+vgg_16                         | 224        | vgg_16/conv1/conv1_1 | 3    | 1                | 1
+vgg_16                         | 224        | vgg_16/pool1         | 6    | 2                | 2
+vgg_16                         | 224        | vgg_16/conv2/conv2_1 | 10   | 2                | 4
+vgg_16                         | 224        | vgg_16/pool2         | 16   | 4                | 6
+vgg_16                         | 224        | vgg_16/conv3/conv3_1 | 24   | 4                | 10
+vgg_16                         | 224        | vgg_16/conv3/conv3_2 | 32   | 4                | 14
+vgg_16                         | 224        | vgg_16/pool3         | 44   | 8                | 18
+vgg_16                         | 224        | vgg_16/conv4/conv4_1 | 60   | 8                | 26
+vgg_16                         | 224        | vgg_16/conv4/conv4_2 | 76   | 8                | 34
+vgg_16                         | 224        | vgg_16/pool4         | 100  | 16               | 42
+vgg_16                         | 224        | vgg_16/conv5/conv5_1 | 132  | 16               | 58
+vgg_16                         | 224        | vgg_16/conv5/conv5_2 | 164  | 16               | 74
+vgg_16                         | 224        | vgg_16/pool5         | 212  | 32               | 90
+vgg_16                         | 321        | vgg_16/conv1/conv1_1 | 3    | 1                | 1
+vgg_16                         | 321        | vgg_16/pool1         | 6    | 2                | 2
+vgg_16                         | 321        | vgg_16/conv2/conv2_1 | 10   | 2                | 4
+vgg_16                         | 321        | vgg_16/pool2         | 16   | 4                | 6
+vgg_16                         | 321        | vgg_16/conv3/conv3_1 | 24   | 4                | 10
+vgg_16                         | 321        | vgg_16/conv3/conv3_2 | 32   | 4                | 14
+vgg_16                         | 321        | vgg_16/pool3         | 44   | 8                | 18
+vgg_16                         | 321        | vgg_16/conv4/conv4_1 | 60   | 8                | 26
+vgg_16                         | 321        | vgg_16/conv4/conv4_2 | 76   | 8                | 34
+vgg_16                         | 321        | vgg_16/pool4         | 100  | 16               | 42
+vgg_16                         | 321        | vgg_16/conv5/conv5_1 | 132  | 16               | 58
+vgg_16                         | 321        | vgg_16/conv5/conv5_2 | 164  | 16               | 74
+vgg_16                         | 321        | vgg_16/pool5         | 212  | 32               | 90
+inception_v2                   | None       | Conv2d_1a_7x7        | 7    | 2                | None
+inception_v2                   | None       | MaxPool_2a_3x3       | 11   | 4                | None
+inception_v2                   | None       | Conv2d_2b_1x1        | 11   | 4                | None
+inception_v2                   | None       | Conv2d_2c_3x3        | 19   | 4                | None
+inception_v2                   | None       | MaxPool_3a_3x3       | 27   | 8                | None
+inception_v2                   | None       | Mixed_3b             | 59   | 8                | None
+inception_v2                   | None       | Mixed_3c             | 91   | 8                | None
+inception_v2                   | None       | Mixed_4a             | 123  | 16               | None
+inception_v2                   | None       | Mixed_4b             | 187  | 16               | None
+inception_v2                   | None       | Mixed_4c             | 251  | 16               | None
+inception_v2                   | None       | Mixed_4d             | 315  | 16               | None
+inception_v2                   | None       | Mixed_4e             | 379  | 16               | None
+inception_v2                   | None       | Mixed_5a             | 443  | 32               | None
+inception_v2                   | None       | Mixed_5b             | 571  | 32               | None
+inception_v2                   | None       | Mixed_5c             | 699  | 32               | None
+inception_v2                   | 224        | Conv2d_1a_7x7        | 7    | 2                | 2
+inception_v2                   | 224        | MaxPool_2a_3x3       | 11   | 4                | 2
+inception_v2                   | 224        | Conv2d_2b_1x1        | 11   | 4                | 2
+inception_v2                   | 224        | Conv2d_2c_3x3        | 19   | 4                | 6
+inception_v2                   | 224        | MaxPool_3a_3x3       | 27   | 8                | 6
+inception_v2                   | 224        | Mixed_3b             | 59   | 8                | 22
+inception_v2                   | 224        | Mixed_3c             | 91   | 8                | 38
+inception_v2                   | 224        | Mixed_4a             | 123  | 16               | 46
+inception_v2                   | 224        | Mixed_4b             | 187  | 16               | 78
+inception_v2                   | 224        | Mixed_4c             | 251  | 16               | 110
+inception_v2                   | 224        | Mixed_4d             | 315  | 16               | 142
+inception_v2                   | 224        | Mixed_4e             | 379  | 16               | 174
+inception_v2                   | 224        | Mixed_5a             | 443  | 32               | 190
+inception_v2                   | 224        | Mixed_5b             | 571  | 32               | 254
+inception_v2                   | 224        | Mixed_5c             | 699  | 32               | 318
+inception_v2                   | 321        | Conv2d_1a_7x7        | 7    | 2                | 3
+inception_v2                   | 321        | MaxPool_2a_3x3       | 11   | 4                | 5
+inception_v2                   | 321        | Conv2d_2b_1x1        | 11   | 4                | 5
+inception_v2                   | 321        | Conv2d_2c_3x3        | 19   | 4                | 9
+inception_v2                   | 321        | MaxPool_3a_3x3       | 27   | 8                | 13
+inception_v2                   | 321        | Mixed_3b             | 59   | 8                | 29
+inception_v2                   | 321        | Mixed_3c             | 91   | 8                | 45
+inception_v2                   | 321        | Mixed_4a             | 123  | 16               | 61
+inception_v2                   | 321        | Mixed_4b             | 187  | 16               | 93
+inception_v2                   | 321        | Mixed_4c             | 251  | 16               | 125
+inception_v2                   | 321        | Mixed_4d             | 315  | 16               | 157
+inception_v2                   | 321        | Mixed_4e             | 379  | 16               | 189
+inception_v2                   | 321        | Mixed_5a             | 443  | 32               | 221
+inception_v2                   | 321        | Mixed_5b             | 571  | 32               | 285
+inception_v2                   | 321        | Mixed_5c             | 699  | 32               | 349
+inception_v2-no-separable-conv | None       | Conv2d_1a_7x7        | 7    | 2                | None
+inception_v2-no-separable-conv | None       | MaxPool_2a_3x3       | 11   | 4                | None
+inception_v2-no-separable-conv | None       | Conv2d_2b_1x1        | 11   | 4                | None
+inception_v2-no-separable-conv | None       | Conv2d_2c_3x3        | 19   | 4                | None
+inception_v2-no-separable-conv | None       | MaxPool_3a_3x3       | 27   | 8                | None
+inception_v2-no-separable-conv | None       | Mixed_3b             | 59   | 8                | None
+inception_v2-no-separable-conv | None       | Mixed_3c             | 91   | 8                | None
+inception_v2-no-separable-conv | None       | Mixed_4a             | 123  | 16               | None
+inception_v2-no-separable-conv | None       | Mixed_4b             | 187  | 16               | None
+inception_v2-no-separable-conv | None       | Mixed_4c             | 251  | 16               | None
+inception_v2-no-separable-conv | None       | Mixed_4d             | 315  | 16               | None
+inception_v2-no-separable-conv | None       | Mixed_4e             | 379  | 16               | None
+inception_v2-no-separable-conv | None       | Mixed_5a             | 443  | 32               | None
+inception_v2-no-separable-conv | None       | Mixed_5b             | 571  | 32               | None
+inception_v2-no-separable-conv | None       | Mixed_5c             | 699  | 32               | None
+inception_v2-no-separable-conv | 224        | Conv2d_1a_7x7        | 7    | 2                | 2
+inception_v2-no-separable-conv | 224        | MaxPool_2a_3x3       | 11   | 4                | 2
+inception_v2-no-separable-conv | 224        | Conv2d_2b_1x1        | 11   | 4                | 2
+inception_v2-no-separable-conv | 224        | Conv2d_2c_3x3        | 19   | 4                | 6
+inception_v2-no-separable-conv | 224        | MaxPool_3a_3x3       | 27   | 8                | 6
+inception_v2-no-separable-conv | 224        | Mixed_3b             | 59   | 8                | 22
+inception_v2-no-separable-conv | 224        | Mixed_3c             | 91   | 8                | 38
+inception_v2-no-separable-conv | 224        | Mixed_4a             | 123  | 16               | 46
+inception_v2-no-separable-conv | 224        | Mixed_4b             | 187  | 16               | 78
+inception_v2-no-separable-conv | 224        | Mixed_4c             | 251  | 16               | 110
+inception_v2-no-separable-conv | 224        | Mixed_4d             | 315  | 16               | 142
+inception_v2-no-separable-conv | 224        | Mixed_4e             | 379  | 16               | 174
+inception_v2-no-separable-conv | 224        | Mixed_5a             | 443  | 32               | 190
+inception_v2-no-separable-conv | 224        | Mixed_5b             | 571  | 32               | 254
+inception_v2-no-separable-conv | 224        | Mixed_5c             | 699  | 32               | 318
+inception_v2-no-separable-conv | 321        | Conv2d_1a_7x7        | 7    | 2                | 3
+inception_v2-no-separable-conv | 321        | MaxPool_2a_3x3       | 11   | 4                | 5
+inception_v2-no-separable-conv | 321        | Conv2d_2b_1x1        | 11   | 4                | 5
+inception_v2-no-separable-conv | 321        | Conv2d_2c_3x3        | 19   | 4                | 9
+inception_v2-no-separable-conv | 321        | MaxPool_3a_3x3       | 27   | 8                | 13
+inception_v2-no-separable-conv | 321        | Mixed_3b             | 59   | 8                | 29
+inception_v2-no-separable-conv | 321        | Mixed_3c             | 91   | 8                | 45
+inception_v2-no-separable-conv | 321        | Mixed_4a             | 123  | 16               | 61
+inception_v2-no-separable-conv | 321        | Mixed_4b             | 187  | 16               | 93
+inception_v2-no-separable-conv | 321        | Mixed_4c             | 251  | 16               | 125
+inception_v2-no-separable-conv | 321        | Mixed_4d             | 315  | 16               | 157
+inception_v2-no-separable-conv | 321        | Mixed_4e             | 379  | 16               | 189
+inception_v2-no-separable-conv | 321        | Mixed_5a             | 443  | 32               | 221
+inception_v2-no-separable-conv | 321        | Mixed_5b             | 571  | 32               | 285
+inception_v2-no-separable-conv | 321        | Mixed_5c             | 699  | 32               | 349
+inception_v3                   | None       | Conv2d_1a_3x3        | 3    | 2                | 0
+inception_v3                   | None       | Conv2d_2a_3x3        | 7    | 2                | 0
+inception_v3                   | None       | Conv2d_2b_3x3        | 11   | 2                | 2
+inception_v3                   | None       | MaxPool_3a_3x3       | 15   | 4                | 2
+inception_v3                   | None       | Conv2d_3b_1x1        | 15   | 4                | 2
+inception_v3                   | None       | Conv2d_4a_3x3        | 23   | 4                | 2
+inception_v3                   | None       | MaxPool_5a_3x3       | 31   | 8                | 2
+inception_v3                   | None       | Mixed_5b             | 63   | 8                | 18
+inception_v3                   | None       | Mixed_5c             | 95   | 8                | 34
+inception_v3                   | None       | Mixed_5d             | 127  | 8                | 50
+inception_v3                   | None       | Mixed_6a             | 159  | 16               | 58
+inception_v3                   | None       | Mixed_6b             | 351  | 16               | 154
+inception_v3                   | None       | Mixed_6c             | 543  | 16               | 250
+inception_v3                   | None       | Mixed_6d             | 735  | 16               | 346
+inception_v3                   | None       | Mixed_6e             | 927  | 16               | 442
+inception_v3                   | None       | Mixed_7a             | 1055 | 32               | 490
+inception_v3                   | None       | Mixed_7b             | 1183 | 32               | 554
+inception_v3                   | None       | Mixed_7c             | 1311 | 32               | 618
+inception_v3                   | 224        | Conv2d_1a_3x3        | 3    | 2                | 0
+inception_v3                   | 224        | Conv2d_2a_3x3        | 7    | 2                | 0
+inception_v3                   | 224        | Conv2d_2b_3x3        | 11   | 2                | 2
+inception_v3                   | 224        | MaxPool_3a_3x3       | 15   | 4                | 2
+inception_v3                   | 224        | Conv2d_3b_1x1        | 15   | 4                | 2
+inception_v3                   | 224        | Conv2d_4a_3x3        | 23   | 4                | 2
+inception_v3                   | 224        | MaxPool_5a_3x3       | 31   | 8                | 2
+inception_v3                   | 224        | Mixed_5b             | 63   | 8                | 18
+inception_v3                   | 224        | Mixed_5c             | 95   | 8                | 34
+inception_v3                   | 224        | Mixed_5d             | 127  | 8                | 50
+inception_v3                   | 224        | Mixed_6a             | 159  | 16               | 58
+inception_v3                   | 224        | Mixed_6b             | 351  | 16               | 154
+inception_v3                   | 224        | Mixed_6c             | 543  | 16               | 250
+inception_v3                   | 224        | Mixed_6d             | 735  | 16               | 346
+inception_v3                   | 224        | Mixed_6e             | 927  | 16               | 442
+inception_v3                   | 224        | Mixed_7a             | 1055 | 32               | 490
+inception_v3                   | 224        | Mixed_7b             | 1183 | 32               | 554
+inception_v3                   | 224        | Mixed_7c             | 1311 | 32               | 618
+inception_v3                   | 321        | Conv2d_1a_3x3        | 3    | 2                | 0
+inception_v3                   | 321        | Conv2d_2a_3x3        | 7    | 2                | 0
+inception_v3                   | 321        | Conv2d_2b_3x3        | 11   | 2                | 2
+inception_v3                   | 321        | MaxPool_3a_3x3       | 15   | 4                | 2
+inception_v3                   | 321        | Conv2d_3b_1x1        | 15   | 4                | 2
+inception_v3                   | 321        | Conv2d_4a_3x3        | 23   | 4                | 2
+inception_v3                   | 321        | MaxPool_5a_3x3       | 31   | 8                | 2
+inception_v3                   | 321        | Mixed_5b             | 63   | 8                | 18
+inception_v3                   | 321        | Mixed_5c             | 95   | 8                | 34
+inception_v3                   | 321        | Mixed_5d             | 127  | 8                | 50
+inception_v3                   | 321        | Mixed_6a             | 159  | 16               | 58
+inception_v3                   | 321        | Mixed_6b             | 351  | 16               | 154
+inception_v3                   | 321        | Mixed_6c             | 543  | 16               | 250
+inception_v3                   | 321        | Mixed_6d             | 735  | 16               | 346
+inception_v3                   | 321        | Mixed_6e             | 927  | 16               | 442
+inception_v3                   | 321        | Mixed_7a             | 1055 | 32               | 490
+inception_v3                   | 321        | Mixed_7b             | 1183 | 32               | 554
+inception_v3                   | 321        | Mixed_7c             | 1311 | 32               | 618
+inception_v4                   | None       | Conv2d_1a_3x3        | 3    | 2                | 0
+inception_v4                   | None       | Conv2d_2a_3x3        | 7    | 2                | 0
+inception_v4                   | None       | Conv2d_2b_3x3        | 11   | 2                | 2
+inception_v4                   | None       | Mixed_3a             | 15   | 4                | 2
+inception_v4                   | None       | Mixed_4a             | 47   | 4                | 14
+inception_v4                   | None       | Mixed_5a             | 55   | 8                | 14
+inception_v4                   | None       | Mixed_5b             | 87   | 8                | 30
+inception_v4                   | None       | Mixed_5c             | 119  | 8                | 46
+inception_v4                   | None       | Mixed_5d             | 151  | 8                | 62
+inception_v4                   | None       | Mixed_5e             | 183  | 8                | 78
+inception_v4                   | None       | Mixed_6a             | 215  | 16               | 86
+inception_v4                   | None       | Mixed_6b             | 407  | 16               | 182
+inception_v4                   | None       | Mixed_6c             | 599  | 16               | 278
+inception_v4                   | None       | Mixed_6d             | 791  | 16               | 374
+inception_v4                   | None       | Mixed_6e             | 983  | 16               | 470
+inception_v4                   | None       | Mixed_6f             | 1175 | 16               | 566
+inception_v4                   | None       | Mixed_6g             | 1367 | 16               | 662
+inception_v4                   | None       | Mixed_6h             | 1559 | 16               | 758
+inception_v4                   | None       | Mixed_7a             | 1687 | 32               | 806
+inception_v4                   | None       | Mixed_7b             | 1815 | 32               | 870
+inception_v4                   | None       | Mixed_7c             | 1943 | 32               | 934
+inception_v4                   | None       | Mixed_7d             | 2071 | 32               | 998
+inception_v4                   | 224        | Conv2d_1a_3x3        | 3    | 2                | 0
+inception_v4                   | 224        | Conv2d_2a_3x3        | 7    | 2                | 0
+inception_v4                   | 224        | Conv2d_2b_3x3        | 11   | 2                | 2
+inception_v4                   | 224        | Mixed_3a             | 15   | 4                | 2
+inception_v4                   | 224        | Mixed_4a             | 47   | 4                | 14
+inception_v4                   | 224        | Mixed_5a             | 55   | 8                | 14
+inception_v4                   | 224        | Mixed_5b             | 87   | 8                | 30
+inception_v4                   | 224        | Mixed_5c             | 119  | 8                | 46
+inception_v4                   | 224        | Mixed_5d             | 151  | 8                | 62
+inception_v4                   | 224        | Mixed_5e             | 183  | 8                | 78
+inception_v4                   | 224        | Mixed_6a             | 215  | 16               | 86
+inception_v4                   | 224        | Mixed_6b             | 407  | 16               | 182
+inception_v4                   | 224        | Mixed_6c             | 599  | 16               | 278
+inception_v4                   | 224        | Mixed_6d             | 791  | 16               | 374
+inception_v4                   | 224        | Mixed_6e             | 983  | 16               | 470
+inception_v4                   | 224        | Mixed_6f             | 1175 | 16               | 566
+inception_v4                   | 224        | Mixed_6g             | 1367 | 16               | 662
+inception_v4                   | 224        | Mixed_6h             | 1559 | 16               | 758
+inception_v4                   | 224        | Mixed_7a             | 1687 | 32               | 806
+inception_v4                   | 224        | Mixed_7b             | 1815 | 32               | 870
+inception_v4                   | 224        | Mixed_7c             | 1943 | 32               | 934
+inception_v4                   | 224        | Mixed_7d             | 2071 | 32               | 998
+inception_v4                   | 321        | Conv2d_1a_3x3        | 3    | 2                | 0
+inception_v4                   | 321        | Conv2d_2a_3x3        | 7    | 2                | 0
+inception_v4                   | 321        | Conv2d_2b_3x3        | 11   | 2                | 2
+inception_v4                   | 321        | Mixed_3a             | 15   | 4                | 2
+inception_v4                   | 321        | Mixed_4a             | 47   | 4                | 14
+inception_v4                   | 321        | Mixed_5a             | 55   | 8                | 14
+inception_v4                   | 321        | Mixed_5b             | 87   | 8                | 30
+inception_v4                   | 321        | Mixed_5c             | 119  | 8                | 46
+inception_v4                   | 321        | Mixed_5d             | 151  | 8                | 62
+inception_v4                   | 321        | Mixed_5e             | 183  | 8                | 78
+inception_v4                   | 321        | Mixed_6a             | 215  | 16               | 86
+inception_v4                   | 321        | Mixed_6b             | 407  | 16               | 182
+inception_v4                   | 321        | Mixed_6c             | 599  | 16               | 278
+inception_v4                   | 321        | Mixed_6d             | 791  | 16               | 374
+inception_v4                   | 321        | Mixed_6e             | 983  | 16               | 470
+inception_v4                   | 321        | Mixed_6f             | 1175 | 16               | 566
+inception_v4                   | 321        | Mixed_6g             | 1367 | 16               | 662
+inception_v4                   | 321        | Mixed_6h             | 1559 | 16               | 758
+inception_v4                   | 321        | Mixed_7a             | 1687 | 32               | 806
+inception_v4                   | 321        | Mixed_7b             | 1815 | 32               | 870
+inception_v4                   | 321        | Mixed_7c             | 1943 | 32               | 934
+inception_v4                   | 321        | Mixed_7d             | 2071 | 32               | 998
+inception_resnet_v2            | None       | Conv2d_1a_3x3        | 3    | 2                | 0
+inception_resnet_v2            | None       | Conv2d_2a_3x3        | 7    | 2                | 0
+inception_resnet_v2            | None       | Conv2d_2b_3x3        | 11   | 2                | 2
+inception_resnet_v2            | None       | MaxPool_3a_3x3       | 15   | 4                | 2
+inception_resnet_v2            | None       | Conv2d_3b_1x1        | 15   | 4                | 2
+inception_resnet_v2            | None       | Conv2d_4a_3x3        | 23   | 4                | 2
+inception_resnet_v2            | None       | MaxPool_5a_3x3       | 31   | 8                | 2
+inception_resnet_v2            | None       | Mixed_5b             | 63   | 8                | 18
+inception_resnet_v2            | None       | Mixed_6a             | 415  | 16               | 186
+inception_resnet_v2            | None       | PreAuxLogits         | 2335 | 16               | 1146
+inception_resnet_v2            | None       | Mixed_7a             | 2399 | 32               | 1162
+inception_resnet_v2            | None       | Conv2d_7b_1x1        | 3039 | 32               | 1482
+inception_resnet_v2            | 224        | Conv2d_1a_3x3        | 3    | 2                | 0
+inception_resnet_v2            | 224        | Conv2d_2a_3x3        | 7    | 2                | 0
+inception_resnet_v2            | 224        | Conv2d_2b_3x3        | 11   | 2                | 2
+inception_resnet_v2            | 224        | MaxPool_3a_3x3       | 15   | 4                | 2
+inception_resnet_v2            | 224        | Conv2d_3b_1x1        | 15   | 4                | 2
+inception_resnet_v2            | 224        | Conv2d_4a_3x3        | 23   | 4                | 2
+inception_resnet_v2            | 224        | MaxPool_5a_3x3       | 31   | 8                | 2
+inception_resnet_v2            | 224        | Mixed_5b             | 63   | 8                | 18
+inception_resnet_v2            | 224        | Mixed_6a             | 415  | 16               | 186
+inception_resnet_v2            | 224        | PreAuxLogits         | 2335 | 16               | 1146
+inception_resnet_v2            | 224        | Mixed_7a             | 2399 | 32               | 1162
+inception_resnet_v2            | 224        | Conv2d_7b_1x1        | 3039 | 32               | 1482
+inception_resnet_v2            | 321        | Conv2d_1a_3x3        | 3    | 2                | 0
+inception_resnet_v2            | 321        | Conv2d_2a_3x3        | 7    | 2                | 0
+inception_resnet_v2            | 321        | Conv2d_2b_3x3        | 11   | 2                | 2
+inception_resnet_v2            | 321        | MaxPool_3a_3x3       | 15   | 4                | 2
+inception_resnet_v2            | 321        | Conv2d_3b_1x1        | 15   | 4                | 2
+inception_resnet_v2            | 321        | Conv2d_4a_3x3        | 23   | 4                | 2
+inception_resnet_v2            | 321        | MaxPool_5a_3x3       | 31   | 8                | 2
+inception_resnet_v2            | 321        | Mixed_5b             | 63   | 8                | 18
+inception_resnet_v2            | 321        | Mixed_6a             | 415  | 16               | 186
+inception_resnet_v2            | 321        | PreAuxLogits         | 2335 | 16               | 1146
+inception_resnet_v2            | 321        | Mixed_7a             | 2399 | 32               | 1162
+inception_resnet_v2            | 321        | Conv2d_7b_1x1        | 3039 | 32               | 1482
+inception_resnet_v2-same       | None       | Conv2d_1a_3x3        | 3    | 2                | None
+inception_resnet_v2-same       | None       | Conv2d_2a_3x3        | 7    | 2                | None
+inception_resnet_v2-same       | None       | Conv2d_2b_3x3        | 11   | 2                | None
+inception_resnet_v2-same       | None       | MaxPool_3a_3x3       | 15   | 4                | None
+inception_resnet_v2-same       | None       | Conv2d_3b_1x1        | 15   | 4                | None
+inception_resnet_v2-same       | None       | Conv2d_4a_3x3        | 23   | 4                | None
+inception_resnet_v2-same       | None       | MaxPool_5a_3x3       | 31   | 8                | None
+inception_resnet_v2-same       | None       | Mixed_5b             | 63   | 8                | None
+inception_resnet_v2-same       | None       | Mixed_6a             | 415  | 16               | None
+inception_resnet_v2-same       | None       | PreAuxLogits         | 2335 | 16               | None
+inception_resnet_v2-same       | None       | Mixed_7a             | 2399 | 32               | None
+inception_resnet_v2-same       | None       | Conv2d_7b_1x1        | 3039 | 32               | None
+inception_resnet_v2-same       | 224        | Conv2d_1a_3x3        | 3    | 2                | 0
+inception_resnet_v2-same       | 224        | Conv2d_2a_3x3        | 7    | 2                | 2
+inception_resnet_v2-same       | 224        | Conv2d_2b_3x3        | 11   | 2                | 4
+inception_resnet_v2-same       | 224        | MaxPool_3a_3x3       | 15   | 4                | 4
+inception_resnet_v2-same       | 224        | Conv2d_3b_1x1        | 15   | 4                | 4
+inception_resnet_v2-same       | 224        | Conv2d_4a_3x3        | 23   | 4                | 8
+inception_resnet_v2-same       | 224        | MaxPool_5a_3x3       | 31   | 8                | 8
+inception_resnet_v2-same       | 224        | Mixed_5b             | 63   | 8                | 24
+inception_resnet_v2-same       | 224        | Mixed_6a             | 415  | 16               | 192
+inception_resnet_v2-same       | 224        | PreAuxLogits         | 2335 | 16               | 1152
+inception_resnet_v2-same       | 224        | Mixed_7a             | 2399 | 32               | 1168
+inception_resnet_v2-same       | 224        | Conv2d_7b_1x1        | 3039 | 32               | 1488
+inception_resnet_v2-same       | 321        | Conv2d_1a_3x3        | 3    | 2                | 1
+inception_resnet_v2-same       | 321        | Conv2d_2a_3x3        | 7    | 2                | 3
+inception_resnet_v2-same       | 321        | Conv2d_2b_3x3        | 11   | 2                | 5
+inception_resnet_v2-same       | 321        | MaxPool_3a_3x3       | 15   | 4                | 7
+inception_resnet_v2-same       | 321        | Conv2d_3b_1x1        | 15   | 4                | 7
+inception_resnet_v2-same       | 321        | Conv2d_4a_3x3        | 23   | 4                | 11
+inception_resnet_v2-same       | 321        | MaxPool_5a_3x3       | 31   | 8                | 15
+inception_resnet_v2-same       | 321        | Mixed_5b             | 63   | 8                | 31
+inception_resnet_v2-same       | 321        | Mixed_6a             | 415  | 16               | 207
+inception_resnet_v2-same       | 321        | PreAuxLogits         | 2335 | 16               | 1167
+inception_resnet_v2-same       | 321        | Mixed_7a             | 2399 | 32               | 1199
+inception_resnet_v2-same       | 321        | Conv2d_7b_1x1        | 3039 | 32               | 1519
+mobilenet_v1                   | None       | Conv2d_0             | 3    | 2                | None
+mobilenet_v1                   | None       | Conv2d_1_pointwise   | 7    | 2                | None
+mobilenet_v1                   | None       | Conv2d_2_pointwise   | 11   | 4                | None
+mobilenet_v1                   | None       | Conv2d_3_pointwise   | 19   | 4                | None
+mobilenet_v1                   | None       | Conv2d_4_pointwise   | 27   | 8                | None
+mobilenet_v1                   | None       | Conv2d_5_pointwise   | 43   | 8                | None
+mobilenet_v1                   | None       | Conv2d_6_pointwise   | 59   | 16               | None
+mobilenet_v1                   | None       | Conv2d_7_pointwise   | 91   | 16               | None
+mobilenet_v1                   | None       | Conv2d_8_pointwise   | 123  | 16               | None
+mobilenet_v1                   | None       | Conv2d_9_pointwise   | 155  | 16               | None
+mobilenet_v1                   | None       | Conv2d_10_pointwise  | 187  | 16               | None
+mobilenet_v1                   | None       | Conv2d_11_pointwise  | 219  | 16               | None
+mobilenet_v1                   | None       | Conv2d_12_pointwise  | 251  | 32               | None
+mobilenet_v1                   | None       | Conv2d_13_pointwise  | 315  | 32               | None
+mobilenet_v1                   | 224        | Conv2d_0             | 3    | 2                | 0
+mobilenet_v1                   | 224        | Conv2d_1_pointwise   | 7    | 2                | 2
+mobilenet_v1                   | 224        | Conv2d_2_pointwise   | 11   | 4                | 2
+mobilenet_v1                   | 224        | Conv2d_3_pointwise   | 19   | 4                | 6
+mobilenet_v1                   | 224        | Conv2d_4_pointwise   | 27   | 8                | 6
+mobilenet_v1                   | 224        | Conv2d_5_pointwise   | 43   | 8                | 14
+mobilenet_v1                   | 224        | Conv2d_6_pointwise   | 59   | 16               | 14
+mobilenet_v1                   | 224        | Conv2d_7_pointwise   | 91   | 16               | 30
+mobilenet_v1                   | 224        | Conv2d_8_pointwise   | 123  | 16               | 46
+mobilenet_v1                   | 224        | Conv2d_9_pointwise   | 155  | 16               | 62
+mobilenet_v1                   | 224        | Conv2d_10_pointwise  | 187  | 16               | 78
+mobilenet_v1                   | 224        | Conv2d_11_pointwise  | 219  | 16               | 94
+mobilenet_v1                   | 224        | Conv2d_12_pointwise  | 251  | 32               | 94
+mobilenet_v1                   | 224        | Conv2d_13_pointwise  | 315  | 32               | 126
+mobilenet_v1                   | 321        | Conv2d_0             | 3    | 2                | 1
+mobilenet_v1                   | 321        | Conv2d_1_pointwise   | 7    | 2                | 3
+mobilenet_v1                   | 321        | Conv2d_2_pointwise   | 11   | 4                | 5
+mobilenet_v1                   | 321        | Conv2d_3_pointwise   | 19   | 4                | 9
+mobilenet_v1                   | 321        | Conv2d_4_pointwise   | 27   | 8                | 13
+mobilenet_v1                   | 321        | Conv2d_5_pointwise   | 43   | 8                | 21
+mobilenet_v1                   | 321        | Conv2d_6_pointwise   | 59   | 16               | 29
+mobilenet_v1                   | 321        | Conv2d_7_pointwise   | 91   | 16               | 45
+mobilenet_v1                   | 321        | Conv2d_8_pointwise   | 123  | 16               | 61
+mobilenet_v1                   | 321        | Conv2d_9_pointwise   | 155  | 16               | 77
+mobilenet_v1                   | 321        | Conv2d_10_pointwise  | 187  | 16               | 93
+mobilenet_v1                   | 321        | Conv2d_11_pointwise  | 219  | 16               | 109
+mobilenet_v1                   | 321        | Conv2d_12_pointwise  | 251  | 32               | 125
+mobilenet_v1                   | 321        | Conv2d_13_pointwise  | 315  | 32               | 157
+mobilenet_v1_075               | None       | Conv2d_0             | 3    | 2                | None
+mobilenet_v1_075               | None       | Conv2d_1_pointwise   | 7    | 2                | None
+mobilenet_v1_075               | None       | Conv2d_2_pointwise   | 11   | 4                | None
+mobilenet_v1_075               | None       | Conv2d_3_pointwise   | 19   | 4                | None
+mobilenet_v1_075               | None       | Conv2d_4_pointwise   | 27   | 8                | None
+mobilenet_v1_075               | None       | Conv2d_5_pointwise   | 43   | 8                | None
+mobilenet_v1_075               | None       | Conv2d_6_pointwise   | 59   | 16               | None
+mobilenet_v1_075               | None       | Conv2d_7_pointwise   | 91   | 16               | None
+mobilenet_v1_075               | None       | Conv2d_8_pointwise   | 123  | 16               | None
+mobilenet_v1_075               | None       | Conv2d_9_pointwise   | 155  | 16               | None
+mobilenet_v1_075               | None       | Conv2d_10_pointwise  | 187  | 16               | None
+mobilenet_v1_075               | None       | Conv2d_11_pointwise  | 219  | 16               | None
+mobilenet_v1_075               | None       | Conv2d_12_pointwise  | 251  | 32               | None
+mobilenet_v1_075               | None       | Conv2d_13_pointwise  | 315  | 32               | None
+mobilenet_v1_075               | 224        | Conv2d_0             | 3    | 2                | 0
+mobilenet_v1_075               | 224        | Conv2d_1_pointwise   | 7    | 2                | 2
+mobilenet_v1_075               | 224        | Conv2d_2_pointwise   | 11   | 4                | 2
+mobilenet_v1_075               | 224        | Conv2d_3_pointwise   | 19   | 4                | 6
+mobilenet_v1_075               | 224        | Conv2d_4_pointwise   | 27   | 8                | 6
+mobilenet_v1_075               | 224        | Conv2d_5_pointwise   | 43   | 8                | 14
+mobilenet_v1_075               | 224        | Conv2d_6_pointwise   | 59   | 16               | 14
+mobilenet_v1_075               | 224        | Conv2d_7_pointwise   | 91   | 16               | 30
+mobilenet_v1_075               | 224        | Conv2d_8_pointwise   | 123  | 16               | 46
+mobilenet_v1_075               | 224        | Conv2d_9_pointwise   | 155  | 16               | 62
+mobilenet_v1_075               | 224        | Conv2d_10_pointwise  | 187  | 16               | 78
+mobilenet_v1_075               | 224        | Conv2d_11_pointwise  | 219  | 16               | 94
+mobilenet_v1_075               | 224        | Conv2d_12_pointwise  | 251  | 32               | 94
+mobilenet_v1_075               | 224        | Conv2d_13_pointwise  | 315  | 32               | 126
+mobilenet_v1_075               | 321        | Conv2d_0             | 3    | 2                | 1
+mobilenet_v1_075               | 321        | Conv2d_1_pointwise   | 7    | 2                | 3
+mobilenet_v1_075               | 321        | Conv2d_2_pointwise   | 11   | 4                | 5
+mobilenet_v1_075               | 321        | Conv2d_3_pointwise   | 19   | 4                | 9
+mobilenet_v1_075               | 321        | Conv2d_4_pointwise   | 27   | 8                | 13
+mobilenet_v1_075               | 321        | Conv2d_5_pointwise   | 43   | 8                | 21
+mobilenet_v1_075               | 321        | Conv2d_6_pointwise   | 59   | 16               | 29
+mobilenet_v1_075               | 321        | Conv2d_7_pointwise   | 91   | 16               | 45
+mobilenet_v1_075               | 321        | Conv2d_8_pointwise   | 123  | 16               | 61
+mobilenet_v1_075               | 321        | Conv2d_9_pointwise   | 155  | 16               | 77
+mobilenet_v1_075               | 321        | Conv2d_10_pointwise  | 187  | 16               | 93
+mobilenet_v1_075               | 321        | Conv2d_11_pointwise  | 219  | 16               | 109
+mobilenet_v1_075               | 321        | Conv2d_12_pointwise  | 251  | 32               | 125
+mobilenet_v1_075               | 321        | Conv2d_13_pointwise  | 315  | 32               | 157
+resnet_v1_50                   | None       | resnet_v1_50/block1  | 35   | 8                | None
+resnet_v1_50                   | None       | resnet_v1_50/block2  | 99   | 16               | None
+resnet_v1_50                   | None       | resnet_v1_50/block3  | 291  | 32               | None
+resnet_v1_50                   | None       | resnet_v1_50/block4  | 483  | 32               | None
+resnet_v1_50                   | 224        | resnet_v1_50/block1  | 35   | 8                | 15
+resnet_v1_50                   | 224        | resnet_v1_50/block2  | 99   | 16               | 47
+resnet_v1_50                   | 224        | resnet_v1_50/block3  | 291  | 32               | 143
+resnet_v1_50                   | 224        | resnet_v1_50/block4  | 483  | 32               | 239
+resnet_v1_50                   | 321        | resnet_v1_50/block1  | 35   | 8                | 17
+resnet_v1_50                   | 321        | resnet_v1_50/block2  | 99   | 16               | 49
+resnet_v1_50                   | 321        | resnet_v1_50/block3  | 291  | 32               | 145
+resnet_v1_50                   | 321        | resnet_v1_50/block4  | 483  | 32               | 241
+resnet_v1_101                  | None       | resnet_v1_101/block1 | 35   | 8                | None
+resnet_v1_101                  | None       | resnet_v1_101/block2 | 99   | 16               | None
+resnet_v1_101                  | None       | resnet_v1_101/block3 | 835  | 32               | None
+resnet_v1_101                  | None       | resnet_v1_101/block4 | 1027 | 32               | None
+resnet_v1_101                  | 224        | resnet_v1_101/block1 | 35   | 8                | 15
+resnet_v1_101                  | 224        | resnet_v1_101/block2 | 99   | 16               | 47
+resnet_v1_101                  | 224        | resnet_v1_101/block3 | 835  | 32               | 415
+resnet_v1_101                  | 224        | resnet_v1_101/block4 | 1027 | 32               | 511
+resnet_v1_101                  | 321        | resnet_v1_101/block1 | 35   | 8                | 17
+resnet_v1_101                  | 321        | resnet_v1_101/block2 | 99   | 16               | 49
+resnet_v1_101                  | 321        | resnet_v1_101/block3 | 835  | 32               | 417
+resnet_v1_101                  | 321        | resnet_v1_101/block4 | 1027 | 32               | 513
+resnet_v1_152                  | None       | resnet_v1_152/block1 | 35   | 8                | None
+resnet_v1_152                  | None       | resnet_v1_152/block2 | 163  | 16               | None
+resnet_v1_152                  | None       | resnet_v1_152/block3 | 1315 | 32               | None
+resnet_v1_152                  | None       | resnet_v1_152/block4 | 1507 | 32               | None
+resnet_v1_152                  | 224        | resnet_v1_152/block1 | 35   | 8                | 15
+resnet_v1_152                  | 224        | resnet_v1_152/block2 | 163  | 16               | 79
+resnet_v1_152                  | 224        | resnet_v1_152/block3 | 1315 | 32               | 655
+resnet_v1_152                  | 224        | resnet_v1_152/block4 | 1507 | 32               | 751
+resnet_v1_152                  | 321        | resnet_v1_152/block1 | 35   | 8                | 17
+resnet_v1_152                  | 321        | resnet_v1_152/block2 | 163  | 16               | 81
+resnet_v1_152                  | 321        | resnet_v1_152/block3 | 1315 | 32               | 657
+resnet_v1_152                  | 321        | resnet_v1_152/block4 | 1507 | 32               | 753
+resnet_v1_200                  | None       | resnet_v1_200/block1 | 35   | 8                | None
+resnet_v1_200                  | None       | resnet_v1_200/block2 | 419  | 16               | None
+resnet_v1_200                  | None       | resnet_v1_200/block3 | 1571 | 32               | None
+resnet_v1_200                  | None       | resnet_v1_200/block4 | 1763 | 32               | None
+resnet_v1_200                  | 224        | resnet_v1_200/block1 | 35   | 8                | 15
+resnet_v1_200                  | 224        | resnet_v1_200/block2 | 419  | 16               | 207
+resnet_v1_200                  | 224        | resnet_v1_200/block3 | 1571 | 32               | 783
+resnet_v1_200                  | 224        | resnet_v1_200/block4 | 1763 | 32               | 879
+resnet_v1_200                  | 321        | resnet_v1_200/block1 | 35   | 8                | 17
+resnet_v1_200                  | 321        | resnet_v1_200/block2 | 419  | 16               | 209
+resnet_v1_200                  | 321        | resnet_v1_200/block3 | 1571 | 32               | 785
+resnet_v1_200                  | 321        | resnet_v1_200/block4 | 1763 | 32               | 881
+resnet_v2_50                   | None       | resnet_v2_50/block1  | 35   | 8                | None
+resnet_v2_50                   | None       | resnet_v2_50/block2  | 99   | 16               | None
+resnet_v2_50                   | None       | resnet_v2_50/block3  | 291  | 32               | None
+resnet_v2_50                   | None       | resnet_v2_50/block4  | 483  | 32               | None
+resnet_v2_50                   | 224        | resnet_v2_50/block1  | 35   | 8                | 15
+resnet_v2_50                   | 224        | resnet_v2_50/block2  | 99   | 16               | 47
+resnet_v2_50                   | 224        | resnet_v2_50/block3  | 291  | 32               | 143
+resnet_v2_50                   | 224        | resnet_v2_50/block4  | 483  | 32               | 239
+resnet_v2_50                   | 321        | resnet_v2_50/block1  | 35   | 8                | 17
+resnet_v2_50                   | 321        | resnet_v2_50/block2  | 99   | 16               | 49
+resnet_v2_50                   | 321        | resnet_v2_50/block3  | 291  | 32               | 145
+resnet_v2_50                   | 321        | resnet_v2_50/block4  | 483  | 32               | 241
+resnet_v2_101                  | None       | resnet_v2_101/block1 | 35   | 8                | None
+resnet_v2_101                  | None       | resnet_v2_101/block2 | 99   | 16               | None
+resnet_v2_101                  | None       | resnet_v2_101/block3 | 835  | 32               | None
+resnet_v2_101                  | None       | resnet_v2_101/block4 | 1027 | 32               | None
+resnet_v2_101                  | 224        | resnet_v2_101/block1 | 35   | 8                | 15
+resnet_v2_101                  | 224        | resnet_v2_101/block2 | 99   | 16               | 47
+resnet_v2_101                  | 224        | resnet_v2_101/block3 | 835  | 32               | 415
+resnet_v2_101                  | 224        | resnet_v2_101/block4 | 1027 | 32               | 511
+resnet_v2_101                  | 321        | resnet_v2_101/block1 | 35   | 8                | 17
+resnet_v2_101                  | 321        | resnet_v2_101/block2 | 99   | 16               | 49
+resnet_v2_101                  | 321        | resnet_v2_101/block3 | 835  | 32               | 417
+resnet_v2_101                  | 321        | resnet_v2_101/block4 | 1027 | 32               | 513
+resnet_v2_152                  | None       | resnet_v2_152/block1 | 35   | 8                | None
+resnet_v2_152                  | None       | resnet_v2_152/block2 | 163  | 16               | None
+resnet_v2_152                  | None       | resnet_v2_152/block3 | 1315 | 32               | None
+resnet_v2_152                  | None       | resnet_v2_152/block4 | 1507 | 32               | None
+resnet_v2_152                  | 224        | resnet_v2_152/block1 | 35   | 8                | 15
+resnet_v2_152                  | 224        | resnet_v2_152/block2 | 163  | 16               | 79
+resnet_v2_152                  | 224        | resnet_v2_152/block3 | 1315 | 32               | 655
+resnet_v2_152                  | 224        | resnet_v2_152/block4 | 1507 | 32               | 751
+resnet_v2_152                  | 321        | resnet_v2_152/block1 | 35   | 8                | 17
+resnet_v2_152                  | 321        | resnet_v2_152/block2 | 163  | 16               | 81
+resnet_v2_152                  | 321        | resnet_v2_152/block3 | 1315 | 32               | 657
+resnet_v2_152                  | 321        | resnet_v2_152/block4 | 1507 | 32               | 753
+resnet_v2_200                  | None       | resnet_v2_200/block1 | 35   | 8                | None
+resnet_v2_200                  | None       | resnet_v2_200/block2 | 419  | 16               | None
+resnet_v2_200                  | None       | resnet_v2_200/block3 | 1571 | 32               | None
+resnet_v2_200                  | None       | resnet_v2_200/block4 | 1763 | 32               | None
+resnet_v2_200                  | 224        | resnet_v2_200/block1 | 35   | 8                | 15
+resnet_v2_200                  | 224        | resnet_v2_200/block2 | 419  | 16               | 207
+resnet_v2_200                  | 224        | resnet_v2_200/block3 | 1571 | 32               | 783
+resnet_v2_200                  | 224        | resnet_v2_200/block4 | 1763 | 32               | 879
+resnet_v2_200                  | 321        | resnet_v2_200/block1 | 35   | 8                | 17
+resnet_v2_200                  | 321        | resnet_v2_200/block2 | 419  | 16               | 209
+resnet_v2_200                  | 321        | resnet_v2_200/block3 | 1571 | 32               | 785
+resnet_v2_200                  | 321        | resnet_v2_200/block4 | 1763 | 32               | 881
+
+## FAQ
+
+### What does a resolution of 'None' mean?
+
+In this case, the input resolution is undefined. For most models, the receptive
+field parameters can be computed even without knowing the input resolution.
+
+### For some networks, effective_padding shows as 'None' (eg, for Inception_v2 or Mobilenet_v1 when input size is not specified). Why is that?
+
+This means that the padding for these networks depends on the input size. So,
+unless we know exactly the input image dimensionality to be used, it is not
+possible to determine the padding applied at the different layers. Look at the
+other entries where the input size is fixed; for those cases, effective_padding
+is not None.
+
+This happens due to Tensorflow's implementation of the 'SAME' padding mode,
+which may depend on the input feature map size to a given layer. For background
+on this, see [these notes from the TF
+documentation](https://www.tensorflow.org/versions/master/api_guides/python/nn#Notes_on_SAME_Convolution_Padding).
+
+Also, note that in this case the program is not able to check if the network is
+aligned (ie, it could be that the different paths from input to output have
+receptive fields which are not consistently centered at the same position in the
+input image).
+
+So you should be aware that such networks might not be aligned -- the program
+has no way of checking it when the padding cannot be determined.
+
+### The receptive field parameters for network X seem different from what I expected... maybe your calculation is incorrect?
+
+First, note that the results presented here are based on the tensorflow
+implementations from the [TF-Slim model
+library](https://github.com/tensorflow/models/tree/master/research/slim).
+
+So, it is possible that due to some implementation details the RF parameters are
+different.
+
+One common case of confusion is the TF-Slim Resnet implementation, which applies
+stride in the last residual unit of each block, instead of at the input
+activations in the first residual unit of each block (which is what is described
+in the Resnet paper) -- see [this
+comment](https://github.com/tensorflow/models/blob/master/research/slim/nets/resnet_utils.py#L30).
+This makes the stride with respect to each convolution block potentially
+different. In this case, though, note that a
+[flag](https://github.com/tensorflow/models/blob/master/research/slim/nets/resnet_v1.py#L150)
+may be used to recover the original striding convention.
+
+Second, it could be that we have a bug somewhere. While we include [many
+tests](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/receptive_field/python/util/receptive_field_test.py)
+in our library, it is always possible that we missed something. If you suspect
+this is happening, please file a GitHub issue
+[here](https://github.com/tensorflow/tensorflow/issues).
diff --git a/tensorflow/contrib/receptive_field/python/util/examples/csv_to_markdown_table.py b/tensorflow/contrib/receptive_field/python/util/examples/csv_to_markdown_table.py
new file mode 100644
index 0000000000000000000000000000000000000000..4495d74bbf66fa461a05f38b430dd404d7da4b08
--- /dev/null
+++ b/tensorflow/contrib/receptive_field/python/util/examples/csv_to_markdown_table.py
@@ -0,0 +1,82 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Simple script to convert CSV output from rf_benchmark to Markdown format.
+
+The input CSV should have the following fields:
+- CNN
+- input resolution
+- end_point
+- RF size hor
+- RF size ver
+- effective stride hor
+- effective stride ver
+- effective padding hor
+- effective padding ver
+
+Since usually in all cases the parameters in the horizontal and vertical
+directions are the same, this is assumed by this script, which only prints one
+of them to the Markdown file.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import csv
+import sys
+
+from tensorflow.python.platform import app
+
+cmd_args = None
+
+
+def main(unused_argv):
+  with open(cmd_args.markdown_path, 'w') as f:
+    # Write table header and field size.
+    f.write('CNN | resolution | end-point | RF | effective stride | '
+            'effective padding|\n')
+    f.write(
+        ':--------------------: | :----------: | :---------------: | :-----: |'
+        ' :----: | :----:|\n')
+    with open(cmd_args.csv_path) as csvfile:
+      reader = csv.DictReader(csvfile)
+      for row in reader:
+        # Make sure horizontal and parameters are the same.
+        assert row['RF size hor'] == row['RF size ver']
+        assert row['effective stride hor'] == row['effective stride ver']
+        assert row['effective padding hor'] == row['effective padding ver']
+
+        f.write('%s|%s|%s|%s|%s|%s\n' %
+                (row['CNN'], row['input resolution'], row['end_point'],
+                 row['RF size hor'], row['effective stride hor'],
+                 row['effective padding hor']))
+
+
+if __name__ == '__main__':
+  parser = argparse.ArgumentParser()
+  parser.register('type', 'bool', lambda v: v.lower() == 'true')
+  parser.add_argument(
+      '--csv_path',
+      type=str,
+      default='/tmp/rf.csv',
+      help='Path where CSV output of rf_benchmark was saved.')
+  parser.add_argument(
+      '--markdown_path',
+      type=str,
+      default='/tmp/rf.md',
+      help='Path where Markdown output will be saved.')
+  cmd_args, unparsed = parser.parse_known_args()
+  app.run(main=main, argv=[sys.argv[0]] + unparsed)
diff --git a/tensorflow/contrib/receptive_field/python/util/parse_layer_parameters.py b/tensorflow/contrib/receptive_field/python/util/parse_layer_parameters.py
index bc383a803496380aaba4d0248d2b7f93253b2b50..0e3c46f17d2e2a277418d39e31927db73a509670 100644
--- a/tensorflow/contrib/receptive_field/python/util/parse_layer_parameters.py
+++ b/tensorflow/contrib/receptive_field/python/util/parse_layer_parameters.py
@@ -27,7 +27,7 @@ from tensorflow.python.platform import tf_logging as logging
 _UNCHANGED_RF_LAYER_OPS = [
     "Add", "BiasAdd", "Cast", "Ceil", "ConcatV2", "Const", "Floor",
     "FusedBatchNorm", "Identity", "Log", "Mul", "Pow", "RealDiv", "Relu",
-    "Relu6", "Round", "Rsqrt", "Softplus", "Sub", "VariableV2"
+    "Relu6", "Round", "Rsqrt", "Softplus", "Sub", "VariableV2", "LRN"
 ]
 
 # Different ways in which padding modes may be spelled.
diff --git a/tensorflow/contrib/rnn/BUILD b/tensorflow/contrib/rnn/BUILD
index 43c0f7595590802aa80e1012967d377a6ab83d29..4eb5c920b3517a8968ff730003e786ae2a9c9e26 100644
--- a/tensorflow/contrib/rnn/BUILD
+++ b/tensorflow/contrib/rnn/BUILD
@@ -193,6 +193,10 @@ tf_py_test(
         "//tensorflow/python:variable_scope",
         "//tensorflow/python:variables",
     ],
+    tags = [
+        "manual",
+        "notap",
+    ],
 )
 
 cuda_py_tests(
diff --git a/tensorflow/contrib/signal/BUILD b/tensorflow/contrib/signal/BUILD
index fdecceff526a860a274354e53e824b98d11418a6..6bd58c4d322c04d4d14d04678e24a05c0f876208 100644
--- a/tensorflow/contrib/signal/BUILD
+++ b/tensorflow/contrib/signal/BUILD
@@ -1,4 +1,4 @@
-package(default_visibility = ["//tensorflow:__subpackages__"])
+package(default_visibility = ["//tensorflow:internal"])
 
 licenses(["notice"])  # Apache 2.0
 
diff --git a/tensorflow/contrib/slim/python/slim/evaluation_test.py b/tensorflow/contrib/slim/python/slim/evaluation_test.py
index 94fc12ca814721acf62f16b72ffa50473043cc8b..3d0308aaf3da3b5b16fd22a2905db36917e8c97b 100644
--- a/tensorflow/contrib/slim/python/slim/evaluation_test.py
+++ b/tensorflow/contrib/slim/python/slim/evaluation_test.py
@@ -26,7 +26,6 @@ import time
 import numpy as np
 
 from tensorflow.contrib.framework.python.ops import variables as variables_lib
-from tensorflow.contrib.metrics.python.ops import metric_ops
 from tensorflow.contrib.slim.python.slim import evaluation
 from tensorflow.contrib.training.python.training import evaluation as evaluation_lib
 from tensorflow.core.protobuf import saver_pb2
@@ -37,6 +36,7 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import metrics
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import flags
 from tensorflow.python.platform import gfile
@@ -89,8 +89,8 @@ class EvaluationTest(test.TestCase):
     self._predictions, self._scale = TestModel(self._inputs)
 
   def testFinalOpsOnEvaluationLoop(self):
-    value_op, update_op = metric_ops.streaming_accuracy(self._predictions,
-                                                        self._labels)
+    value_op, update_op = metrics.accuracy(
+        labels=self._labels, predictions=self._predictions)
     init_op = control_flow_ops.group(variables.global_variables_initializer(),
                                      variables.local_variables_initializer())
     # Create checkpoint and log directories:
@@ -136,9 +136,10 @@ class EvaluationTest(test.TestCase):
     self.assertTrue(obj.hook_was_run)
 
   def _create_names_to_metrics(self, predictions, labels):
-    accuracy0, update_op0 = metric_ops.streaming_accuracy(predictions, labels)
-    accuracy1, update_op1 = metric_ops.streaming_accuracy(predictions + 1,
-                                                          labels)
+    accuracy0, update_op0 = metrics.accuracy(
+        labels=labels, predictions=predictions)
+    accuracy1, update_op1 = metrics.accuracy(
+        labels=labels, predictions=predictions + 1)
 
     names_to_values = {'Accuracy': accuracy0, 'Another_accuracy': accuracy1}
     names_to_updates = {'Accuracy': update_op0, 'Another_accuracy': update_op1}
@@ -198,8 +199,8 @@ class EvaluationTest(test.TestCase):
     predictions_limited = input.limit_epochs(self._predictions, num_epochs=1)
     labels_limited = input.limit_epochs(self._labels, num_epochs=1)
 
-    value_op, update_op = metric_ops.streaming_accuracy(
-        predictions_limited, labels_limited)
+    value_op, update_op = metrics.accuracy(
+        labels=labels_limited, predictions=predictions_limited)
 
     init_op = control_flow_ops.group(variables.global_variables_initializer(),
                                      variables.local_variables_initializer())
@@ -260,8 +261,8 @@ class SingleEvaluationTest(test.TestCase):
     self._prepareCheckpoint(checkpoint_path)
 
     # Next, determine the metric to evaluate:
-    value_op, update_op = metric_ops.streaming_accuracy(self._predictions,
-                                                        self._labels)
+    value_op, update_op = metrics.accuracy(
+        labels=self._labels, predictions=self._predictions)
 
     # Run the evaluation and verify the results:
     accuracy_value = evaluation.evaluate_once(
@@ -276,8 +277,8 @@ class SingleEvaluationTest(test.TestCase):
     self._prepareCheckpoint(checkpoint_path)
 
     # Next, determine the metric to evaluate:
-    value_op, update_op = metric_ops.streaming_accuracy(self._predictions,
-                                                        self._labels)
+    value_op, update_op = metrics.accuracy(
+        labels=self._labels, predictions=self._predictions)
 
     dumping_root = os.path.join(self.get_temp_dir(), 'tfdbg_dump_dir')
     dumping_hook = hooks.DumpingDebugHook(dumping_root, log_usage=False)
diff --git a/tensorflow/contrib/solvers/python/ops/linear_equations.py b/tensorflow/contrib/solvers/python/ops/linear_equations.py
index 9305c6a11c4ec898c82553773e8e7277a54ab82e..85918bf8506623cf5e0c9106ae9ed80e233f5a7d 100644
--- a/tensorflow/contrib/solvers/python/ops/linear_equations.py
+++ b/tensorflow/contrib/solvers/python/ops/linear_equations.py
@@ -28,7 +28,6 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import linalg_ops
 from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import linalg_ops
 
 
 def conjugate_gradient(operator,
diff --git a/tensorflow/contrib/stat_summarizer/BUILD b/tensorflow/contrib/stat_summarizer/BUILD
index 30be14c10cd8576ded75b8489cc89d439a9cc282..0b8fc0cdc66ae41807cce92776ada263675b1f94 100644
--- a/tensorflow/contrib/stat_summarizer/BUILD
+++ b/tensorflow/contrib/stat_summarizer/BUILD
@@ -31,5 +31,8 @@ tf_py_test(
         "//tensorflow/python:math_ops",
         "//tensorflow/python:variables",
     ],
-    tags = ["no_windows"],
+    tags = [
+        "no_windows",
+        "notap",  # TODO(b/80546574): test is flaky
+    ],
 )
diff --git a/tensorflow/contrib/tensor_forest/client/eval_metrics.py b/tensorflow/contrib/tensor_forest/client/eval_metrics.py
index e893e1d1c836cc7feef15757dde79d0db362cbaf..d8236a0a6fa6d0d0e383e454eb0146bb10b6f49d 100644
--- a/tensorflow/contrib/tensor_forest/client/eval_metrics.py
+++ b/tensorflow/contrib/tensor_forest/client/eval_metrics.py
@@ -21,10 +21,10 @@ import numpy as np
 
 from tensorflow.contrib import losses
 from tensorflow.contrib.learn.python.learn.estimators import prediction_key
-from tensorflow.contrib.metrics.python.ops import metric_ops
 
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import metrics
 from tensorflow.python.ops import nn
 
 INFERENCE_PROB_NAME = prediction_key.PredictionKey.PROBABILITIES
@@ -38,12 +38,13 @@ def _top_k_generator(k):
     targets = math_ops.to_int32(targets)
     if targets.get_shape().ndims > 1:
       targets = array_ops.squeeze(targets, axis=[1])
-    return metric_ops.streaming_mean(nn.in_top_k(probabilities, targets, k))
+    return metrics.mean(nn.in_top_k(probabilities, targets, k))
   return _top_k
 
 
 def _accuracy(predictions, targets, weights=None):
-  return metric_ops.streaming_accuracy(predictions, targets, weights=weights)
+  return metrics.accuracy(
+      labels=targets, predictions=predictions, weights=weights)
 
 
 def _r2(probabilities, targets, weights=None):
@@ -53,7 +54,7 @@ def _r2(probabilities, targets, weights=None):
   squares_residuals = math_ops.reduce_sum(
       math_ops.square(targets - probabilities), 0)
   score = 1 - math_ops.reduce_sum(squares_residuals / squares_total)
-  return metric_ops.streaming_mean(score, weights=weights)
+  return metrics.mean(score, weights=weights)
 
 
 def _squeeze_and_onehot(targets, depth):
@@ -62,7 +63,7 @@ def _squeeze_and_onehot(targets, depth):
 
 
 def _sigmoid_entropy(probabilities, targets, weights=None):
-  return metric_ops.streaming_mean(
+  return metrics.mean(
       losses.sigmoid_cross_entropy(probabilities,
                                    _squeeze_and_onehot(
                                        targets,
@@ -71,7 +72,7 @@ def _sigmoid_entropy(probabilities, targets, weights=None):
 
 
 def _softmax_entropy(probabilities, targets, weights=None):
-  return metric_ops.streaming_mean(
+  return metrics.mean(
       losses.sparse_softmax_cross_entropy(probabilities,
                                           math_ops.to_int32(targets)),
       weights=weights)
@@ -82,7 +83,7 @@ def _predictions(predictions, unused_targets, **unused_kwargs):
 
 
 def _class_log_loss(probabilities, targets, weights=None):
-  return metric_ops.streaming_mean(
+  return metrics.mean(
       losses.log_loss(probabilities,
                       _squeeze_and_onehot(targets,
                                           array_ops.shape(probabilities)[1])),
@@ -90,34 +91,36 @@ def _class_log_loss(probabilities, targets, weights=None):
 
 
 def _precision(predictions, targets, weights=None):
-  return metric_ops.streaming_precision(predictions, targets, weights=weights)
+  return metrics.precision(
+      labels=targets, predictions=predictions, weights=weights)
 
 
 def _precision_at_thresholds(predictions, targets, weights=None):
-  return metric_ops.streaming_precision_at_thresholds(
-      array_ops.slice(predictions, [0, 1], [-1, 1]),
-      targets,
-      np.arange(
-          0, 1, 0.01, dtype=np.float32),
+  return metrics.precision_at_thresholds(
+      labels=targets,
+      predictions=array_ops.slice(predictions, [0, 1], [-1, 1]),
+      thresholds=np.arange(0, 1, 0.01, dtype=np.float32),
       weights=weights)
 
 
 def _recall(predictions, targets, weights=None):
-  return metric_ops.streaming_recall(predictions, targets, weights=weights)
+  return metrics.recall(
+      labels=targets, predictions=predictions, weights=weights)
 
 
 def _recall_at_thresholds(predictions, targets, weights=None):
-  return metric_ops.streaming_recall_at_thresholds(
-      array_ops.slice(predictions, [0, 1], [-1, 1]),
-      targets,
-      np.arange(
-          0, 1, 0.01, dtype=np.float32),
+  return metrics.recall_at_thresholds(
+      labels=targets,
+      predictions=array_ops.slice(predictions, [0, 1], [-1, 1]),
+      thresholds=np.arange(0, 1, 0.01, dtype=np.float32),
       weights=weights)
 
 
 def _auc(probs, targets, weights=None):
-  return metric_ops.streaming_auc(array_ops.slice(probs, [0, 1], [-1, 1]),
-                                  targets, weights=weights)
+  return metrics.auc(
+      labels=targets,
+      predictions=array_ops.slice(probs, [0, 1], [-1, 1]),
+      weights=weights)
 
 
 _EVAL_METRICS = {
diff --git a/tensorflow/contrib/tensorboard/db/summary_db_writer.cc b/tensorflow/contrib/tensorboard/db/summary_db_writer.cc
index 630c0607ae21d0276a9dd0507346d5dc4ed9f4a9..cfdc884277a025aa11995d329389f3748b17490c 100644
--- a/tensorflow/contrib/tensorboard/db/summary_db_writer.cc
+++ b/tensorflow/contrib/tensorboard/db/summary_db_writer.cc
@@ -14,6 +14,8 @@ limitations under the License.
 ==============================================================================*/
 #include "tensorflow/contrib/tensorboard/db/summary_db_writer.h"
 
+#include <deque>
+
 #include "tensorflow/contrib/tensorboard/db/summary_converter.h"
 #include "tensorflow/core/framework/graph.pb.h"
 #include "tensorflow/core/framework/node_def.pb.h"
@@ -66,14 +68,9 @@ const char* kImagePluginName = "images";
 const char* kAudioPluginName = "audio";
 const char* kHistogramPluginName = "histograms";
 
-const int kScalarSlots = 10000;
-const int kImageSlots = 10;
-const int kAudioSlots = 10;
-const int kHistogramSlots = 1;
-const int kTensorSlots = 10;
-
 const int64 kReserveMinBytes = 32;
 const double kReserveMultiplier = 1.5;
+const int64 kPreallocateRows = 1000;
 
 // Flush is a misnomer because what we're actually doing is having lots
 // of commits inside any SqliteTransaction that writes potentially
@@ -139,22 +136,6 @@ void PatchPluginName(SummaryMetadata* metadata, const char* name) {
   }
 }
 
-int GetSlots(const Tensor& t, const SummaryMetadata& metadata) {
-  if (metadata.plugin_data().plugin_name() == kScalarPluginName) {
-    return kScalarSlots;
-  } else if (metadata.plugin_data().plugin_name() == kImagePluginName) {
-    return kImageSlots;
-  } else if (metadata.plugin_data().plugin_name() == kAudioPluginName) {
-    return kAudioSlots;
-  } else if (metadata.plugin_data().plugin_name() == kHistogramPluginName) {
-    return kHistogramSlots;
-  } else if (t.dims() == 0 && t.dtype() != DT_STRING) {
-    return kScalarSlots;
-  } else {
-    return kTensorSlots;
-  }
-}
-
 Status SetDescription(Sqlite* db, int64 id, const StringPiece& markdown) {
   const char* sql = R"sql(
     INSERT OR REPLACE INTO Descriptions (id, description) VALUES (?, ?)
@@ -481,24 +462,6 @@ class RunMetadata {
     return insert.StepAndReset();
   }
 
-  Status GetIsWatching(Sqlite* db, bool* is_watching)
-      SQLITE_TRANSACTIONS_EXCLUDED(*db) LOCKS_EXCLUDED(mu_) {
-    mutex_lock lock(mu_);
-    if (experiment_id_ == kAbsent) {
-      *is_watching = true;
-      return Status::OK();
-    }
-    const char* sql = R"sql(
-      SELECT is_watching FROM Experiments WHERE experiment_id = ?
-    )sql";
-    SqliteStatement stmt;
-    TF_RETURN_IF_ERROR(db->Prepare(sql, &stmt));
-    stmt.BindInt(1, experiment_id_);
-    TF_RETURN_IF_ERROR(stmt.StepOnce());
-    *is_watching = stmt.ColumnInt(0) != 0;
-    return Status::OK();
-  }
-
  private:
   Status InitializeUser(Sqlite* db, uint64 now) EXCLUSIVE_LOCKS_REQUIRED(mu_) {
     if (user_id_ != kAbsent || user_name_.empty()) return Status::OK();
@@ -659,43 +622,15 @@ class RunMetadata {
 
 /// \brief Tensor writer for a single series, e.g. Tag.
 ///
-/// This class can be used to write an infinite stream of Tensors to the
-/// database in a fixed block of contiguous disk space. This is
-/// accomplished using Algorithm R reservoir sampling.
-///
-/// The reservoir consists of a fixed number of rows, which are inserted
-/// using ZEROBLOB upon receiving the first sample, which is used to
-/// predict how big the other ones are likely to be. This is done
-/// transactionally in a way that tries to be mindful of other processes
-/// that might be trying to access the same DB.
-///
-/// Once the reservoir fills up, rows are replaced at random, and writes
-/// gradually become no-ops. This allows long training to go fast
-/// without configuration. The exception is when someone is actually
-/// looking at TensorBoard. When that happens, the "keep last" behavior
-/// is turned on and Append() will always result in a write.
-///
-/// If no one is watching training, this class still holds on to the
-/// most recent "dangling" Tensor, so if Finish() is called, the most
-/// recent training state can be written to disk.
-///
-/// The randomly selected sampling points should be consistent across
-/// multiple instances.
-///
 /// This class is thread safe.
 class SeriesWriter {
  public:
-  SeriesWriter(int64 series, int slots, RunMetadata* meta)
-      : series_{series},
-        slots_{slots},
-        meta_{meta},
-        rng_{std::mt19937_64::default_seed} {
+  SeriesWriter(int64 series, RunMetadata* meta) : series_{series}, meta_{meta} {
     DCHECK(series_ > 0);
-    DCHECK(slots_ > 0);
   }
 
   Status Append(Sqlite* db, int64 step, uint64 now, double computed_time,
-                Tensor t) SQLITE_TRANSACTIONS_EXCLUDED(*db)
+                const Tensor& t) SQLITE_TRANSACTIONS_EXCLUDED(*db)
       LOCKS_EXCLUDED(mu_) {
     mutex_lock lock(mu_);
     if (rowids_.empty()) {
@@ -705,41 +640,20 @@ class SeriesWriter {
         return s;
       }
     }
-    DCHECK(rowids_.size() == slots_);
-    int64 rowid;
-    size_t i = count_;
-    if (i < slots_) {
-      rowid = last_rowid_ = rowids_[i];
-    } else {
-      i = rng_() % (i + 1);
-      if (i < slots_) {
-        rowid = last_rowid_ = rowids_[i];
-      } else {
-        bool keep_last;
-        TF_RETURN_IF_ERROR(meta_->GetIsWatching(db, &keep_last));
-        if (!keep_last) {
-          ++count_;
-          dangling_tensor_.reset(new Tensor(std::move(t)));
-          dangling_step_ = step;
-          dangling_computed_time_ = computed_time;
-          return Status::OK();
-        }
-        rowid = last_rowid_;
-      }
-    }
+    int64 rowid = rowids_.front();
     Status s = Write(db, rowid, step, computed_time, t);
     if (s.ok()) {
       ++count_;
-      dangling_tensor_.reset();
     }
+    rowids_.pop_front();
     return s;
   }
 
   Status Finish(Sqlite* db) SQLITE_TRANSACTIONS_EXCLUDED(*db)
       LOCKS_EXCLUDED(mu_) {
     mutex_lock lock(mu_);
-    // Short runs: Delete unused pre-allocated Tensors.
-    if (count_ < rowids_.size()) {
+    // Delete unused pre-allocated Tensors.
+    if (!rowids_.empty()) {
       SqliteTransaction txn(*db);
       const char* sql = R"sql(
         DELETE FROM Tensors WHERE rowid = ?
@@ -747,19 +661,13 @@ class SeriesWriter {
       SqliteStatement deleter;
       TF_RETURN_IF_ERROR(db->Prepare(sql, &deleter));
       for (size_t i = count_; i < rowids_.size(); ++i) {
-        deleter.BindInt(1, rowids_[i]);
+        deleter.BindInt(1, rowids_.front());
         TF_RETURN_IF_ERROR(deleter.StepAndReset());
+        rowids_.pop_front();
       }
       TF_RETURN_IF_ERROR(txn.Commit());
       rowids_.clear();
     }
-    // Long runs: Make last sample be the very most recent one.
-    if (dangling_tensor_) {
-      DCHECK(last_rowid_ != kAbsent);
-      TF_RETURN_IF_ERROR(Write(db, last_rowid_, dangling_step_,
-                               dangling_computed_time_, *dangling_tensor_));
-      dangling_tensor_.reset();
-    }
     return Status::OK();
   }
 
@@ -783,7 +691,6 @@ class SeriesWriter {
 
   Status Update(Sqlite* db, int64 step, double computed_time, const Tensor& t,
                 const StringPiece& data, int64 rowid) {
-    // TODO(jart): How can we ensure reservoir fills on replace?
     const char* sql = R"sql(
       UPDATE OR REPLACE
         Tensors
@@ -878,7 +785,7 @@ class SeriesWriter {
     // TODO(jart): Maybe preallocate index pages by setting step. This
     //             is tricky because UPDATE OR REPLACE can have a side
     //             effect of deleting preallocated rows.
-    for (int64 i = 0; i < slots_; ++i) {
+    for (int64 i = 0; i < kPreallocateRows; ++i) {
       insert.BindInt(1, series_);
       insert.BindInt(2, reserved_bytes);
       TF_RETURN_WITH_CONTEXT_IF_ERROR(insert.StepAndReset(), "i=", i);
@@ -902,16 +809,10 @@ class SeriesWriter {
 
   mutex mu_;
   const int64 series_;
-  const int slots_;
   RunMetadata* const meta_;
-  std::mt19937_64 rng_ GUARDED_BY(mu_);
   uint64 count_ GUARDED_BY(mu_) = 0;
-  int64 last_rowid_ GUARDED_BY(mu_) = kAbsent;
-  std::vector<int64> rowids_ GUARDED_BY(mu_);
+  std::deque<int64> rowids_ GUARDED_BY(mu_);
   uint64 unflushed_bytes_ GUARDED_BY(mu_) = 0;
-  std::unique_ptr<Tensor> dangling_tensor_ GUARDED_BY(mu_);
-  int64 dangling_step_ GUARDED_BY(mu_) = 0;
-  double dangling_computed_time_ GUARDED_BY(mu_) = 0.0;
 
   TF_DISALLOW_COPY_AND_ASSIGN(SeriesWriter);
 };
@@ -928,10 +829,10 @@ class RunWriter {
   explicit RunWriter(RunMetadata* meta) : meta_{meta} {}
 
   Status Append(Sqlite* db, int64 tag_id, int64 step, uint64 now,
-                double computed_time, Tensor t, int slots)
+                double computed_time, const Tensor& t)
       SQLITE_TRANSACTIONS_EXCLUDED(*db) LOCKS_EXCLUDED(mu_) {
-    SeriesWriter* writer = GetSeriesWriter(tag_id, slots);
-    return writer->Append(db, step, now, computed_time, std::move(t));
+    SeriesWriter* writer = GetSeriesWriter(tag_id);
+    return writer->Append(db, step, now, computed_time, t);
   }
 
   Status Finish(Sqlite* db) SQLITE_TRANSACTIONS_EXCLUDED(*db)
@@ -948,11 +849,11 @@ class RunWriter {
   }
 
  private:
-  SeriesWriter* GetSeriesWriter(int64 tag_id, int slots) LOCKS_EXCLUDED(mu_) {
+  SeriesWriter* GetSeriesWriter(int64 tag_id) LOCKS_EXCLUDED(mu_) {
     mutex_lock sl(mu_);
     auto spot = series_writers_.find(tag_id);
     if (spot == series_writers_.end()) {
-      SeriesWriter* writer = new SeriesWriter(tag_id, slots, meta_);
+      SeriesWriter* writer = new SeriesWriter(tag_id, meta_);
       series_writers_[tag_id].reset(writer);
       return writer;
     } else {
@@ -1082,8 +983,7 @@ class SummaryDbWriter : public SummaryWriterInterface {
     TF_RETURN_IF_ERROR(
         meta_.GetTagId(db_, now, computed_time, tag, &tag_id, metadata));
     TF_RETURN_WITH_CONTEXT_IF_ERROR(
-        run_.Append(db_, tag_id, step, now, computed_time, t,
-                    GetSlots(t, metadata)),
+        run_.Append(db_, tag_id, step, now, computed_time, t),
         meta_.user_name(), "/", meta_.experiment_name(), "/", meta_.run_name(),
         "/", tag, "@", step);
     return Status::OK();
@@ -1155,8 +1055,7 @@ class SummaryDbWriter : public SummaryWriterInterface {
     int64 tag_id;
     TF_RETURN_IF_ERROR(meta_.GetTagId(db_, now, e->wall_time(), s->tag(),
                                       &tag_id, s->metadata()));
-    return run_.Append(db_, tag_id, e->step(), now, e->wall_time(), t,
-                       GetSlots(t, s->metadata()));
+    return run_.Append(db_, tag_id, e->step(), now, e->wall_time(), t);
   }
 
   // TODO(jart): Refactor Summary -> Tensor logic into separate file.
@@ -1169,8 +1068,7 @@ class SummaryDbWriter : public SummaryWriterInterface {
     PatchPluginName(s->mutable_metadata(), kScalarPluginName);
     TF_RETURN_IF_ERROR(meta_.GetTagId(db_, now, e->wall_time(), s->tag(),
                                       &tag_id, s->metadata()));
-    return run_.Append(db_, tag_id, e->step(), now, e->wall_time(),
-                       std::move(t), kScalarSlots);
+    return run_.Append(db_, tag_id, e->step(), now, e->wall_time(), t);
   }
 
   Status MigrateHistogram(const Event* e, Summary::Value* s, uint64 now) {
@@ -1201,8 +1099,7 @@ class SummaryDbWriter : public SummaryWriterInterface {
     PatchPluginName(s->mutable_metadata(), kHistogramPluginName);
     TF_RETURN_IF_ERROR(meta_.GetTagId(db_, now, e->wall_time(), s->tag(),
                                       &tag_id, s->metadata()));
-    return run_.Append(db_, tag_id, e->step(), now, e->wall_time(),
-                       std::move(t), kHistogramSlots);
+    return run_.Append(db_, tag_id, e->step(), now, e->wall_time(), t);
   }
 
   Status MigrateImage(const Event* e, Summary::Value* s, uint64 now) {
@@ -1216,8 +1113,7 @@ class SummaryDbWriter : public SummaryWriterInterface {
     PatchPluginName(s->mutable_metadata(), kImagePluginName);
     TF_RETURN_IF_ERROR(meta_.GetTagId(db_, now, e->wall_time(), s->tag(),
                                       &tag_id, s->metadata()));
-    return run_.Append(db_, tag_id, e->step(), now, e->wall_time(),
-                       std::move(t), kImageSlots);
+    return run_.Append(db_, tag_id, e->step(), now, e->wall_time(), t);
   }
 
   Status MigrateAudio(const Event* e, Summary::Value* s, uint64 now) {
@@ -1230,8 +1126,7 @@ class SummaryDbWriter : public SummaryWriterInterface {
     PatchPluginName(s->mutable_metadata(), kAudioPluginName);
     TF_RETURN_IF_ERROR(meta_.GetTagId(db_, now, e->wall_time(), s->tag(),
                                       &tag_id, s->metadata()));
-    return run_.Append(db_, tag_id, e->step(), now, e->wall_time(),
-                       std::move(t), kAudioSlots);
+    return run_.Append(db_, tag_id, e->step(), now, e->wall_time(), t);
   }
 
   Env* const env_;
diff --git a/tensorflow/contrib/tensorboard/db/summary_db_writer_test.cc b/tensorflow/contrib/tensorboard/db/summary_db_writer_test.cc
index 2044692b6e746bc317843d715fa17ab5ec0bf99d..2e8d4109dd624ab66d774668ad04def9a7d3cdf2 100644
--- a/tensorflow/contrib/tensorboard/db/summary_db_writer_test.cc
+++ b/tensorflow/contrib/tensorboard/db/summary_db_writer_test.cc
@@ -189,7 +189,7 @@ TEST_F(SummaryDbWriterTest, TensorsWritten_RowsGetInitialized) {
   ASSERT_EQ(1LL, QueryInt("SELECT COUNT(*) FROM Experiments"));
   ASSERT_EQ(1LL, QueryInt("SELECT COUNT(*) FROM Runs"));
   ASSERT_EQ(1LL, QueryInt("SELECT COUNT(*) FROM Tags"));
-  ASSERT_EQ(10000LL, QueryInt("SELECT COUNT(*) FROM Tensors"));
+  ASSERT_EQ(1000LL, QueryInt("SELECT COUNT(*) FROM Tensors"));
 
   int64 user_id = QueryInt("SELECT user_id FROM Users");
   int64 experiment_id = QueryInt("SELECT experiment_id FROM Experiments");
@@ -238,7 +238,7 @@ TEST_F(SummaryDbWriterTest, EmptyParentNames_NoParentsCreated) {
   ASSERT_EQ(0LL, QueryInt("SELECT COUNT(*) FROM Experiments"));
   ASSERT_EQ(0LL, QueryInt("SELECT COUNT(*) FROM Runs"));
   ASSERT_EQ(1LL, QueryInt("SELECT COUNT(*) FROM Tags"));
-  ASSERT_EQ(10000LL, QueryInt("SELECT COUNT(*) FROM Tensors"));
+  ASSERT_EQ(1000LL, QueryInt("SELECT COUNT(*) FROM Tensors"));
 }
 
 TEST_F(SummaryDbWriterTest, WriteEvent_Scalar) {
@@ -255,7 +255,7 @@ TEST_F(SummaryDbWriterTest, WriteEvent_Scalar) {
   TF_ASSERT_OK(writer_->WriteEvent(std::move(e)));
   TF_ASSERT_OK(writer_->Flush());
   ASSERT_EQ(2LL, QueryInt("SELECT COUNT(*) FROM Tags"));
-  ASSERT_EQ(20000LL, QueryInt("SELECT COUNT(*) FROM Tensors"));
+  ASSERT_EQ(2000LL, QueryInt("SELECT COUNT(*) FROM Tensors"));
   int64 tag1_id = QueryInt("SELECT tag_id FROM Tags WHERE tag_name = 'π'");
   int64 tag2_id = QueryInt("SELECT tag_id FROM Tags WHERE tag_name = 'φ'");
   EXPECT_GT(tag1_id, 0LL);
diff --git a/tensorflow/contrib/tensorrt/BUILD b/tensorflow/contrib/tensorrt/BUILD
index 7a8a71ac7f491ec48a47ae1ea1aff750a587beaa..a5d8b061b6b26f9d05be40a1162481ae219b0e9c 100644
--- a/tensorflow/contrib/tensorrt/BUILD
+++ b/tensorflow/contrib/tensorrt/BUILD
@@ -192,7 +192,7 @@ tf_py_wrap_cc(
         ":trt_conversion",
         ":trt_engine_op_kernel",
         "//tensorflow/core:framework_lite",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
     ],
 )
 
@@ -303,7 +303,7 @@ tf_cuda_library(
     ],
     deps = [
         "//tensorflow/core:framework_lite",
-        "//tensorflow/core:platform_base",
+        "//tensorflow/core:lib_proto_parsing",
     ] + if_tensorrt([
         "@local_config_tensorrt//:nv_infer",
     ]),
diff --git a/tensorflow/contrib/tensorrt/convert/convert_graph.cc b/tensorflow/contrib/tensorrt/convert/convert_graph.cc
index b7b26cfb1c05ae74e932c8b9cb2479cfca308514..da4dd5a14cd74591fc9df63cd5868044e4e369ec 100644
--- a/tensorflow/contrib/tensorrt/convert/convert_graph.cc
+++ b/tensorflow/contrib/tensorrt/convert/convert_graph.cc
@@ -91,8 +91,11 @@ void GetSubGraphIncomingEdges(const tensorflow::Graph& graph,
       if (!subgraph_node_ids.count(edge->src()->id()) &&
           !edge->src()->IsSource() && !edge->IsControlEdge()) {
         incoming_edges->insert(edge);
+        VLOG(2) << "INCOMING " << edge->src()->name() << " -> " << node->name()
+                << " Y, ";
       } else {
-        VLOG(2) << node->name() << " -> " << edge->src()->name() << " N, ";
+        VLOG(2) << "INCOMING " << edge->src()->name() << " -> " << node->name()
+                << " N, ";
       }
     }
   }
@@ -106,10 +109,12 @@ void GetSubGraphOutgoingEdges(const tensorflow::Graph& graph,
     for (const tensorflow::Edge* edge : node->out_edges()) {
       if (!subgraph_node_ids.count(edge->dst()->id()) &&
           !edge->dst()->IsSink() && !edge->IsControlEdge()) {
-        VLOG(2) << node->name() << " -> " << edge->dst()->name() << " Y, ";
+        VLOG(2) << "OUTGOING " << node->name() << " -> " << edge->dst()->name()
+                << " Y, ";
         outgoing_edges->insert(edge);
       } else {
-        VLOG(2) << node->name() << " -> " << edge->dst()->name() << " N, ";
+        VLOG(2) << "OUTGOING " << node->name() << " -> " << edge->dst()->name()
+                << " N, ";
       }
     }
   }
@@ -181,29 +186,27 @@ struct ConvertGraphParams {
 static tensorflow::Status FillSubGraphEdgeSets(ConvertGraphParams* p) {
   GetSubGraphIncomingEdges(p->graph, p->subgraph_node_ids,
                            &p->subgraph_incoming_edges);
+
+  std::set<std::pair<int, int>> unique_tensors;
+  // Add only unique input source nodes. If output of an outside node is shared
+  // between multiple nodes inside the engine, only one edge should be created
   for (const tensorflow::Edge* edge : p->subgraph_incoming_edges) {
-    p->subgraph_inputs.push_back({edge->src()->id(), edge->src_output()});
-  }
-  auto output_name_to_index_map = BuildTensorNameMap(p->output_names);
-  std::set<std::pair<int, int>> subgraph_outputs_set;
-  // Collect outputs referenced from output_names
-  for (int node_id : p->subgraph_node_ids) {
-    tensorflow::Node* node = p->graph.FindNodeId(node_id);
-    if (output_name_to_index_map.count(node->name())) {
-      for (int index : output_name_to_index_map.at(node->name())) {
-        subgraph_outputs_set.insert({node_id, index});
-      }
-    }
+    unique_tensors.insert({edge->src()->id(), edge->src_output()});
   }
+  p->subgraph_inputs.insert(p->subgraph_inputs.begin(), unique_tensors.begin(),
+                            unique_tensors.end());
   GetSubGraphOutgoingEdges(p->graph, p->subgraph_node_ids,
                            &p->subgraph_outgoing_edges);
+  unique_tensors.clear();
+  // Similar to above, if multiple ouside nodes are sharing the output of an
+  // internal node only one output port should be created and shared between
+  // outputs
   for (const tensorflow::Edge* edge : p->subgraph_outgoing_edges) {
-    subgraph_outputs_set.insert({edge->src()->id(), edge->src_output()});
+    unique_tensors.insert({edge->src()->id(), edge->src_output()});
   }
-  p->subgraph_outputs.reserve(subgraph_outputs_set.size());
+  p->subgraph_outputs.reserve(unique_tensors.size());
   p->subgraph_outputs.insert(p->subgraph_outputs.begin(),
-                             subgraph_outputs_set.begin(),
-                             subgraph_outputs_set.end());
+                             unique_tensors.begin(), unique_tensors.end());
   return tensorflow::Status::OK();
 }
 
@@ -225,7 +228,6 @@ tensorflow::Status GetCalibNode(ConvertGraphParams* params) {
   for (auto in_edge :
        params->subgraph_incoming_edges) {  // loop over incoming edges and
                                            // attach them to calib node
-    // tensorflow::Node* src_node = in_edge->src();
     auto src_output = in_edge->src_output();
     auto dst_node = in_edge->dst();
     auto dst_input = in_edge->dst_input();
@@ -257,19 +259,24 @@ tensorflow::Status ConvertSubGraphToTensorRT(ConvertGraphParams* params) {
   for (size_t i = 0; i < params->subgraph_inputs.size(); ++i) {
     subgraph_edge_to_input_map.insert({params->subgraph_inputs.at(i), i});
   }
+  std::set<std::pair<int, int>> unique_tensors;
   for (const tensorflow::Edge* edge : params->subgraph_incoming_edges) {
     std::pair<int, int> old_src = {edge->src()->id(), edge->src_output()};
+    if (unique_tensors.count(old_src)) continue;
+    unique_tensors.insert(old_src);
     int new_src_output = subgraph_edge_to_input_map.at(old_src);
     params->graph.AddEdge(edge->src(), edge->src_output(), trt_node,
                           new_src_output);
+    VLOG(1) << "Wire " << edge->src()->name() << ":" << edge->src_output()
+            << " -> " << trt_node->name() << ":" << new_src_output;
     params->graph.RemoveEdge(edge);
   }
-
-  VLOG(2) << "new wiring edges: " << trt_node->in_edges().size();
-  for (const tensorflow::Edge* edge : trt_node->in_edges()) {
-    VLOG(2) << edge->src()->name() << " port: " << edge->src_output();
+  if (VLOG_IS_ON(2)) {
+    VLOG(2) << "new edge count: " << trt_node->in_edges().size();
+    for (const tensorflow::Edge* edge : trt_node->in_edges()) {
+      VLOG(2) << edge->src()->name() << " port: " << edge->src_output();
+    }
   }
-
   TF_RETURN_IF_ERROR(status);
 
   // Re-map outgoing edges to use the new TRT node instead of the orig subgraph
@@ -283,6 +290,8 @@ tensorflow::Status ConvertSubGraphToTensorRT(ConvertGraphParams* params) {
     int new_src_output = subgraph_edge_to_output_map.at(old_src);
     TF_RETURN_IF_ERROR(params->graph.UpdateEdge(
         trt_node, new_src_output, edge->dst(), edge->dst_input()));
+    VLOG(1) << "Wire " << trt_node->name() << ":" << new_src_output << " -> "
+            << edge->dst()->name() << ":" << edge->dst_input();
   }
   // Remove the original subgraph
   for (int node_id : params->subgraph_node_ids) {
@@ -317,9 +326,12 @@ tensorflow::Status ConvertCalibGraphToInferGraph(
       tensorflow::GraphConstructorOptions(), graph_def, &graph));
   //  get calib nodes
   std::vector<tensorflow::Node*> calib_nodes;
-  for (auto node : graph.op_nodes()) {
+  std::vector<tensorflow::Node*> topo_order;
+  tensorflow::GetPostOrder(graph, &topo_order);
+  for (auto rit = topo_order.rbegin(); rit != topo_order.rend(); ++rit) {
+    auto node = *rit;
     if (node->type_string() == "TRTCalibOp") {
-      VLOG(1) << "Found Calib Node";
+      VLOG(1) << "Found Calib Node " << node->name();
       calib_nodes.push_back(node);
     }
   }
diff --git a/tensorflow/contrib/tensorrt/convert/convert_nodes.cc b/tensorflow/contrib/tensorrt/convert/convert_nodes.cc
index 32b211dcd1e282d334327b83a27f9401de7f310a..4e4d295538edadd26a347a38ec141737f097f26f 100644
--- a/tensorflow/contrib/tensorrt/convert/convert_nodes.cc
+++ b/tensorflow/contrib/tensorrt/convert/convert_nodes.cc
@@ -362,10 +362,11 @@ void ReorderCKtoKC(const TRT_ShapedWeights& iweights,
       break;
     }
     case tensorflow::DataType::DT_HALF: {
-      Reorder2({k, c}, static_cast<Eigen::half const*>(iweights.GetValues()),
-               istrides, static_cast<Eigen::half*>(
-                             const_cast<void*>(oweights->GetValues())),
-               ostrides);
+      Reorder2(
+          {k, c}, static_cast<Eigen::half const*>(iweights.GetValues()),
+          istrides,
+          static_cast<Eigen::half*>(const_cast<void*>(oweights->GetValues())),
+          ostrides);
       break;
     }
     default:
@@ -1179,9 +1180,9 @@ tensorflow::Status BinaryTensorOpTensor(
   CHECK_EQ_TYPE(tensor_r->getType(), dtype);
   auto op_pair = ops.find(node_def.op());
   if (op_pair == ops.end())
-    return tensorflow::errors::Unimplemented("binary op: " + node_def.op() +
-                                             " not supported at: " +
-                                             node_def.name());
+    return tensorflow::errors::Unimplemented(
+        "binary op: " + node_def.op() +
+        " not supported at: " + node_def.name());
 
   nvinfer1::IElementWiseLayer* layer = ctx.network()->addElementWise(
       *const_cast<nvinfer1::ITensor*>(tensor_l),
@@ -2138,9 +2139,7 @@ void Converter::register_op_converters() {
 }
 
 }  // namespace
-tensorflow::Status GetTensorRTGraph(tensorrt::convert::SubGraphParams& s) {
-  return tensorflow::errors::Unimplemented("Not implemented yet");
-}
+
 tensorflow::Status ConvertCalibrationNodeToEngineNode(
     tensorflow::Graph& graph, tensorflow::Node* c_node) {
   const auto ndef = c_node->def();
@@ -2164,9 +2163,23 @@ tensorflow::Status ConvertCalibrationNodeToEngineNode(
   for (auto n : graph.op_nodes()) {
     node_maps.insert({n->name(), n});
   }
+  std::set<int> subgraph_ids;
+  for (const auto internal_node : segment_nodes) {
+    subgraph_ids.insert(node_maps.at(internal_node)->id());
+  }
+  if (VLOG_IS_ON(2)) {
+    string node_names = StrCat(c_node->name(), " segment nodes= ");
+
+    for (const auto& node_name : segment_nodes) {
+      StrAppend(&node_names, node_name, ", ");
+    }
+    VLOG(2) << node_names;
+  }
+
   VLOG(1) << "Output Nodes:";
   std::vector<tensorflow::DataType> out_types;
   std::vector<const tensorflow::Edge*> out_edges;
+
   for (auto& i : output_nodes) {
     auto node_port = tensorflow::str_util::Split(i, ":");
     VLOG(1) << " " << i << " in graph " << node_maps.count(i);
@@ -2186,18 +2199,24 @@ tensorflow::Status ConvertCalibrationNodeToEngineNode(
         out_types.push_back(out_node->output_type(0));
       }
       for (auto out_edge : out_node->out_edges()) {
+        if (subgraph_ids.count(out_edge->dst()->id()))
+          continue;  // skip internal edges;
         if (out_edge->src_output() == port) {
           out_edges.push_back(out_edge);
-          break;
+          VLOG(1) << "OUTPUT EDGE " << out_edge->src()->name() << ":"
+                  << out_edge->src_output() << " -> " << out_edge->dst()->name()
+                  << ":" << out_edge->dst_input();
         }
       }
     } else {
       LOG(WARNING) << " couldn't find output node " << out_node_name;
     }
   }
-  VLOG(1) << "Input Nodes:";
-  for (auto& i : input_names) {
-    VLOG(1) << " " << i << " in graph " << node_maps.count(i);
+  if (VLOG_IS_ON(1)) {
+    VLOG(1) << c_node->name() << " Input Nodes:";
+    for (auto& i : input_names) {
+      VLOG(1) << " Input " << i << " in graph " << node_maps.count(i);
+    }
   }
   auto trt_rm = tensorflow::tensorrt::TRTResourceManager::instance();
   auto resmgr = trt_rm->getManager("TRTCalibOps");
@@ -2231,14 +2250,24 @@ tensorflow::Status ConvertCalibrationNodeToEngineNode(
   calib_res->builder_ = nullptr;
   tensorflow::NodeDefBuilder op_builder(engine_name, "TRTEngineOp");
   std::vector<tensorflow::NodeDefBuilder::NodeOut> income_edges;
+  income_edges.resize(c_node->num_inputs());
   for (const auto in_edge : c_node->in_edges()) {
     auto src = in_edge->src();
     int dest_port = in_edge->dst_input();
-    income_edges.emplace_back(src->name(), in_edge->src_output(),
-                              c_node->input_type(dest_port));
+    VLOG(1) << "Incoming connection " << src->name() << ":"
+            << in_edge->src_output() << " -> " << c_node->name() << ":"
+            << dest_port;
+    income_edges.at(dest_port) = {src->name(), in_edge->src_output(),
+                                  c_node->input_type(dest_port)};
   }
   tensorflow::gtl::ArraySlice<tensorflow::NodeDefBuilder::NodeOut> input_list(
       income_edges);
+  if (VLOG_IS_ON(2)) {
+    for (const auto& inp : input_list) {
+      VLOG(2) << " Input from inputlist " << inp.node << ":" << inp.index << " "
+              << tensorflow::DataTypeString(inp.data_type);
+    }
+  }
   op_builder.Input(input_list);
   tensorflow::NodeDef engine_node;
   const char* engine_plan_data = static_cast<const char*>(engine_plan->data());
@@ -2255,13 +2284,26 @@ tensorflow::Status ConvertCalibrationNodeToEngineNode(
   }
   auto trt_engine_node = graph.AddNode(engine_node, &status);
   TF_RETURN_IF_ERROR(status);
-  for (size_t i = 0; i < out_edges.size(); i++) {
-    VLOG(1) << "Connecting trt_engine_node output " << i << " with "
-            << out_edges.at(i)->dst()->name() << " port "
-            << out_edges.at(i)->dst_input();
-    TF_RETURN_IF_ERROR(graph.UpdateEdge(trt_engine_node, i,
-                                        out_edges.at(i)->dst(),
-                                        out_edges.at(i)->dst_input()));
+  std::map<string, int> port_map;
+  for (size_t t = 0; t < output_nodes.size(); t++) {
+    port_map.insert({output_nodes.at(t), t});
+  }
+  for (auto& i : out_edges) {
+    string s(i->src()->name());
+    if (i->src_output()) StrAppend(&s, ":", i->src_output());
+    int out_port = port_map.at(s);
+    VLOG(1) << "Connecting " << trt_engine_node->name() << ":" << out_port
+            << " -> " << i->dst()->name() << ":" << i->dst_input();
+    TF_RETURN_IF_ERROR(
+        graph.UpdateEdge(trt_engine_node, out_port, i->dst(), i->dst_input()));
+  }
+  for (const auto ed : trt_engine_node->in_edges()) {
+    VLOG(1) << "In Edge  " << ed->src()->name() << ":" << ed->src_output()
+            << " -> " << ed->dst()->name() << ":" << ed->dst_input();
+  }
+  for (const auto ed : trt_engine_node->out_edges()) {
+    VLOG(1) << "Out Edge " << ed->src()->name() << ":" << ed->src_output()
+            << " -> " << ed->dst()->name() << ":" << ed->dst_input();
   }
   VLOG(1) << "Segment nodes:";
   for (auto& i : segment_nodes) {
@@ -2332,6 +2374,7 @@ tensorflow::Status ConvertSubgraph(
     std::vector<string>* output_names,
     std::vector<tensorflow::DataType>* output_dtypes,
     const string& engine_name) {
+  std::set<string> added_tensors;
   for (const std::pair<int, int>& input : s.input_inds) {
     VLOG(2) << "parsing input. Node id= " << input.first;
     int node_id = input.first;
@@ -2374,7 +2417,6 @@ tensorflow::Status ConvertSubgraph(
 
     auto op_info = op_info_vec.at(shape_inference_output_idx);
     tensorflow::DataType tf_dtype = op_info.dtype();
-    input_dtypes->push_back(tf_dtype);
 
     nvinfer1::DataType dtype(nvinfer1::DataType::kFLOAT);
     auto type_status = ConvertDType(tf_dtype, &dtype);
@@ -2410,8 +2452,10 @@ tensorflow::Status ConvertSubgraph(
     if (output_idx != 0) {
       input_tensor_name = StrCat(node_name, ":", output_idx);
     }
-
+    if (added_tensors.count(input_tensor_name)) continue;
+    added_tensors.insert(input_tensor_name);
     input_names->push_back(input_tensor_name);
+    input_dtypes->push_back(tf_dtype);
     nvinfer1::ITensor* input_tensor = converter.network()->addInput(
         input_tensor_name.c_str(), dtype, input_dim_pseudo_chw);
 
@@ -2435,6 +2479,7 @@ tensorflow::Status ConvertSubgraph(
 
   // Gather output metadata
   int trt_engine_op_output_idx = 0;
+  added_tensors.clear();
   for (const std::pair<int, int>& output : s.output_inds) {
     int node_id = output.first;
     int output_idx = output.second;
@@ -2451,6 +2496,8 @@ tensorflow::Status ConvertSubgraph(
     if (output_idx != 0)
       tensorflow::strings::StrAppend(&tensor_name, ":", output_idx);
     VLOG(2) << "Output tensor name: " << tensor_name;
+    if (added_tensors.count(tensor_name)) continue;
+    added_tensors.insert(tensor_name);
     output_names->push_back(tensor_name);
     auto tensor_or_weights = converter.get_tensor(tensor_name);
     if (!tensor_or_weights.is_tensor()) {
@@ -2534,7 +2581,7 @@ tensorflow::Status InjectCalibrationNode(tensorrt::convert::SubGraphParams& s) {
   // Build the TRT op
   // TODO(sami,ben,jie): proper naming!
   tensorflow::NodeDefBuilder op_builder(calib_op_name, "TRTCalibOp");
-  SetInputList(s, &op_builder, &input_names, &input_dtypes);
+  TF_RETURN_IF_ERROR(SetInputList(s, &op_builder, &input_names, &input_dtypes));
 
   std::vector<string> segment_names;
   segment_names.reserve(s.subgraph_node_ids.size());
@@ -2632,7 +2679,7 @@ tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
 
   // Build the TRT op
   tensorflow::NodeDefBuilder op_builder(engine_name, "TRTEngineOp");
-  SetInputList(s, &op_builder, &input_names, &input_dtypes);
+  TF_RETURN_IF_ERROR(SetInputList(s, &op_builder, &input_names, &input_dtypes));
 
   VLOG(0) << "Finished op preparation";
 
diff --git a/tensorflow/contrib/tensorrt/segment/segment_test.cc b/tensorflow/contrib/tensorrt/segment/segment_test.cc
index 2de3923b06a8ddf89c7e6f922138a85f55a618d6..f5b2d258d70d5577a9d68f2d9f6d6e678ede97ce 100644
--- a/tensorflow/contrib/tensorrt/segment/segment_test.cc
+++ b/tensorflow/contrib/tensorrt/segment/segment_test.cc
@@ -275,13 +275,13 @@ TEST_F(SegmentTest, Multiple) {
   // Expect two subgraphs
   EXPECT_EQ(segments.size(), 2);
 
-  std::vector<string> expected0{"add0", "add1", "add2", "add3"};
+  std::vector<string> expected0{"add6", "add8"};
   for (const auto& ex : expected0) {
     EXPECT_TRUE(segments[0].first.find(ex) != segments[0].first.end())
         << "Missing expected node " << ex;
   }
 
-  std::vector<string> expected1{"add6", "add8"};
+  std::vector<string> expected1{"add0", "add1", "add2", "add3"};
   for (const auto& ex : expected1) {
     EXPECT_TRUE(segments[1].first.find(ex) != segments[1].first.end())
         << "Missing expected node " << ex;
diff --git a/tensorflow/contrib/tpu/profiler/BUILD b/tensorflow/contrib/tpu/profiler/BUILD
index dbf1ab6bbf0ddc7429d8e19279451eb862981e0c..3b2d7adfff6b8de3145a73756a8b5306445034c5 100644
--- a/tensorflow/contrib/tpu/profiler/BUILD
+++ b/tensorflow/contrib/tpu/profiler/BUILD
@@ -53,7 +53,7 @@ tf_cc_binary(
         "//tensorflow/core:lib",
         "//tensorflow/core/distributed_runtime/rpc:grpc_util",
         "//tensorflow/core/platform/cloud:gcs_file_system",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
diff --git a/tensorflow/contrib/tpu/profiler/capture_tpu_profile.cc b/tensorflow/contrib/tpu/profiler/capture_tpu_profile.cc
index 816897499b7a49365060c026d60b977990f3ecdc..f80f5652af79d410946971573ae160fdd0b85f6d 100644
--- a/tensorflow/contrib/tpu/profiler/capture_tpu_profile.cc
+++ b/tensorflow/contrib/tpu/profiler/capture_tpu_profile.cc
@@ -18,7 +18,7 @@ limitations under the License.
 // Initiates a TPU profiling on the TPUProfiler service at service_addr,
 // receives and dumps the profile data to a tensorboard log directory.
 
-#include "grpc++/grpc++.h"
+#include "grpcpp/grpcpp.h"
 
 #include <cstdio>
 #include <ctime>
@@ -79,7 +79,9 @@ ProfileRequest PopulateProfileRequest(int duration_ms,
     request.set_repository_root(repository_root);
     request.set_session_id(session_id);
   }
+  request.add_tools("op_profile");
   request.add_tools("input_pipeline");
+  request.add_tools("memory_viewer");
   request.add_tools("overview_page");
   *request.mutable_opts() = opts;
   std::cout << "Limiting the number of trace events to " << kMaxEvents
diff --git a/tensorflow/contrib/tpu/profiler/dump_tpu_profile.cc b/tensorflow/contrib/tpu/profiler/dump_tpu_profile.cc
index 73d941e5e99801d239441e438e9d640478c4e6b6..98cc31f18d2d34765f2c123c3d34207802541036 100644
--- a/tensorflow/contrib/tpu/profiler/dump_tpu_profile.cc
+++ b/tensorflow/contrib/tpu/profiler/dump_tpu_profile.cc
@@ -38,6 +38,7 @@ namespace {
 using ::tensorflow::io::JoinPath;
 using ::tensorflow::protobuf::util::JsonOptions;
 using ::tensorflow::protobuf::util::MessageToJsonString;
+using ::tensorflow::str_util::EndsWith;
 using ::tensorflow::strings::StrCat;
 
 constexpr char kGraphRunPrefix[] = "tpu_profiler.hlo_graph.";
@@ -46,6 +47,9 @@ constexpr char kJsonTraceFileName[] = "trace.json.gz";
 constexpr char kProfilePluginDirectory[] = "plugins/profile/";
 constexpr char kProtoTraceFileName[] = "trace";
 
+constexpr char kFlatProfilerFileName[] = "flat_profiler.pb";
+constexpr char kTfStatsHelperSuffix[] = "tf_stats_helper_result";
+
 Status WriteGzippedDataToFile(const string& filename, const string& data) {
   std::unique_ptr<WritableFile> file;
   TF_RETURN_IF_ERROR(Env::Default()->NewWritableFile(filename, &file));
@@ -107,6 +111,10 @@ Status DumpToolDataToLogDirectory(StringPiece run_dir,
                                   const string& host_prefix,
                                   const tensorflow::ProfileToolData& tool,
                                   std::ostream* os) {
+  // Don't save the intermediate results for combining the per host tool data.
+  if (EndsWith(tool.name(), kFlatProfilerFileName) ||
+      EndsWith(tool.name(), kTfStatsHelperSuffix))
+    return Status::OK();
   string path = JoinPath(run_dir, StrCat(host_prefix, tool.name()));
   TF_RETURN_IF_ERROR(WriteStringToFile(Env::Default(), path, tool.data()));
   if (os) {
diff --git a/tensorflow/contrib/tpu/profiler/tf_op_stats.proto b/tensorflow/contrib/tpu/profiler/tf_op_stats.proto
index b9ac1a550c87e055fd5d555c346d5ec545bbe634..2b13343efa4e82386cb9259432b854be3ec821f7 100644
--- a/tensorflow/contrib/tpu/profiler/tf_op_stats.proto
+++ b/tensorflow/contrib/tpu/profiler/tf_op_stats.proto
@@ -87,6 +87,8 @@ message StepInfoResult {
   optional uint64 wait_duration_ps = 5;
   // The time spent on cross-replica-sum in picoseconds.
   optional uint64 crs_duration_ps = 6;
+  // Percentage of unit b time spent on infeed.
+  optional double unit_b_infeed_percent = 7;
 }
 
 // Result proto for a sequence of steps.
diff --git a/tensorflow/contrib/tpu/profiler/tpu_profiler.proto b/tensorflow/contrib/tpu/profiler/tpu_profiler.proto
index 7be694e866729c58efae4ccf7932dd929c03ed91..f0fca63db0bca80cdaa27e491b2a03ae2246c007 100644
--- a/tensorflow/contrib/tpu/profiler/tpu_profiler.proto
+++ b/tensorflow/contrib/tpu/profiler/tpu_profiler.proto
@@ -68,7 +68,8 @@ message ProfileRequest {
 }
 
 message ProfileToolData {
-  // The tool's name which this data is associated. (e.g. "input_pipeline".)
+  // The file name which this data is associated (e.g. "input_pipeline.json",
+  // "cluster_xxx.memory_viewer.json").
   string name = 1;
 
   // The data payload (likely json) for the specific tool.
diff --git a/tensorflow/contrib/tpu/profiler/tpu_profiler_analysis.proto b/tensorflow/contrib/tpu/profiler/tpu_profiler_analysis.proto
index 8b0bbde98e6a1dee8ade789328f3ba0624049562..d3c34bfd490080b86cf3d8b893c550f3a87bbbed 100644
--- a/tensorflow/contrib/tpu/profiler/tpu_profiler_analysis.proto
+++ b/tensorflow/contrib/tpu/profiler/tpu_profiler_analysis.proto
@@ -38,6 +38,9 @@ message EnumProfileSessionsAndToolsResponse {
 message ProfileSessionDataRequest {
   string repository_root = 1;
   string session_id = 2;
+  // Which host the data is associated. if empty, data from all hosts are
+  // aggregated.
+  string host_name = 5;
   // Which tool
   string tool_name = 3;
   // Tool's specific parameters. e.g. TraceViewer's viewport etc
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu.py b/tensorflow/contrib/tpu/python/tpu/tpu.py
index e2f57ce9c503f454d2bf1eae895922a4a4d26ced..cd0fd6ae8a2b35efa85bb4583ed3846a1a33395f 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 
 from six.moves import xrange  # pylint: disable=redefined-builtin
 
+from tensorflow.contrib.framework.python.framework import experimental
 from tensorflow.contrib.tpu.python.ops import tpu_ops
 from tensorflow.contrib.tpu.python.tpu import tpu_function
 
@@ -125,7 +126,19 @@ class TPUReplicateContext(control_flow_ops.XLAControlFlowContext):
   outside the replicated computation.
   """
 
-  def __init__(self, name, num_replicas):
+  def __init__(self, name, num_replicas, pivot):
+    """Builds a new TPUReplicateContext.
+
+    Args:
+      name: a unique name for the context, used to populate the `_tpu_replicate`
+        attribute.
+      num_replicas: an integer that gives the number of replicas for the
+        computation.
+      pivot: a pivot node. Nodes in the TPUReplicateContext that do not have any
+        inputs will have a control dependency on the pivot node. This ensures
+        that nodes are correctly included in any enclosing control flow
+        contexts.
+    """
     super(TPUReplicateContext, self).__init__()
     self._num_replicas = num_replicas
     self._outer_device_function_stack = None
@@ -137,6 +150,7 @@ class TPUReplicateContext(control_flow_ops.XLAControlFlowContext):
     self._host_compute_core = []
     self._name = name
     self._unsupported_ops = []
+    self._pivot = pivot
 
   def report_unsupported_operations(self):
     if self._unsupported_ops:
@@ -261,9 +275,6 @@ class TPUReplicateContext(control_flow_ops.XLAControlFlowContext):
       self._outer_device_function_stack = list(graph._device_function_stack)  # pylint: disable=protected-access
     super(TPUReplicateContext, self).Enter()
 
-  def Exit(self):
-    super(TPUReplicateContext, self).Exit()
-
   def HostComputeCore(self):
     return self._host_compute_core
 
@@ -299,10 +310,64 @@ class TPUReplicateContext(control_flow_ops.XLAControlFlowContext):
       op.graph.prevent_feeding(op)
       op.graph.prevent_fetching(op)
 
+    # Remove any control edges from outer control flow contexts. These may cause
+    # mismatched frame errors.
+    control_inputs, external_inputs = self._RemoveExternalControlEdges(op)
+
+    if not op.inputs:
+      # Add a control edge from the control pivot to this op.
+      if not control_inputs:
+        # pylint: disable=protected-access
+        op._add_control_input(self.GetControlPivot())
+        # pylint: enable=protected-access
+    else:
+      for index in xrange(len(op.inputs)):
+        x = op.inputs[index]
+        real_x = self.AddValue(x)
+        if real_x != x:
+          op._update_input(index, real_x)  # pylint: disable=protected-access
+
+    if external_inputs:
+      # Use an identity to pull control inputs as data inputs. Note that we
+      # ignore ops which don't have outputs. TODO(phawkins): fix that.
+      with ops.control_dependencies(None):
+        self.Enter()
+        external_inputs = [
+            array_ops.identity(x.outputs[0]).op
+            for x in external_inputs
+            if x.outputs
+        ]
+        self.Exit()
+      # pylint: disable=protected-access
+      op._add_control_inputs(external_inputs)
+      # pylint: enable=protected-access
+
+    # Mark op's outputs as seen by this context and any outer contexts.
+    output_names = [x.name for x in op.outputs]
+    context = self
+    while context is not None:
+      # pylint: disable=protected-access
+      context._values.update(output_names)
+      context = context._outer_context
+      # pylint: enable=protected-access
+
+    if self._outer_context:
+      self._outer_context.AddInnerOp(op)
+
   def AddValue(self, val):
+    if val.name in self._values:
+      # Use the real value if it comes from outer context.
+      result = self._external_values.get(val.name)
+      return val if result is None else result
+
     result = val
+    self._values.add(val.name)
     if self._outer_context:
       result = self._outer_context.AddValue(val)
+      self._values.add(result.name)
+
+    self._external_values[val.name] = result
+
     return result
 
   def AddInnerOp(self, op):
@@ -318,17 +383,30 @@ class TPUReplicateContext(control_flow_ops.XLAControlFlowContext):
     # grad_state should be as if this is the top-level gradient state.
     return None
 
+  @property
+  def back_prop(self):
+    """Forwards to the enclosing while context, if any."""
+    if self.GetWhileContext():
+      return self.GetWhileContext().back_prop
+    return False
 
-def outside_compilation(computation, args=None):
+  def GetControlPivot(self):
+    return self._pivot
+
+
+def outside_compilation(computation, *args, **kwargs):
   """Builds part of a computation outside any current TPU replicate scope.
 
   Args:
     computation: A Python function that builds the computation to
       place on the host.
-    args: Inputs to pass to computation.
+    *args: the positional arguments for the computation.
+    **kwargs: the keyword arguments for the computation.
+
   Returns:
     The Tensors returned by computation.
   """
+  args = [] if args is None else args
   graph = ops.get_default_graph()
 
   # If we are in a TPUReplicateContext, signal that we are now
@@ -340,7 +418,7 @@ def outside_compilation(computation, args=None):
       context._EnterOutsideCompilationScope()  # pylint: disable=protected-access
     context = context.outer_context
 
-  retval = computation(*args)
+  retval = computation(*args, **kwargs)
 
   # If we are in a TPUReplicateContext, signal that we are no longer
   # outside_compilation
@@ -501,7 +579,9 @@ def split_compile_and_replicate(computation,
         tpu_ops.tpu_replicated_input(replicas, name="input{}".format(i)))
 
   cluster_name = graph.unique_name("cluster")
-  context = TPUReplicateContext(name=cluster_name, num_replicas=num_replicas)
+  pivot = control_flow_ops.no_op(name=cluster_name + "/pivot")
+  context = TPUReplicateContext(
+      name=cluster_name, num_replicas=num_replicas, pivot=pivot)
   try:
     context.Enter()
 
@@ -543,10 +623,16 @@ def split_compile_and_replicate(computation,
 
       vscope.set_use_resource(saved_use_resource)
 
+    # If the computation returns `None`, make it an empty tuple.
+    if outputs is None:
+      outputs = tuple()
     # If the computation only returned one value, makes it a tuple.
     if not isinstance(outputs, (list, tuple)):
       outputs = (outputs,)
 
+    # Append `no_op` here so that fetching any return value of this function
+    # will trigger TPUExecute node.
+    outputs += (control_flow_ops.no_op(),)
     try:
       with ops.device(core(0)):
         outputs = [
@@ -578,6 +664,7 @@ def split_compile_and_replicate(computation,
       with ops.device(t.device if t.device else core(0)):
         new_output_tensors.append(array_ops.identity(t))
     output_tensors = new_output_tensors
+    context.ExitResult(output_tensors)
   finally:
     context.report_unsupported_operations()
     context.Exit()
@@ -867,3 +954,152 @@ def rewrite(computation,
       device_assignment=device_assignment,
       name=name)[0]
   # pylint: enable=indexing-exception
+
+  # Operations that indicate some error in the user's inference graph.
+_BLACKLISTED_INFERENCE_OPS = set([
+    "ReadVariableOp",
+    "AssignVariableOp",
+    "AssignAddVariableOp",
+    "AssignSubVariableOp",
+    "VarHandleOp",
+    "Variable",
+    "VariableV2",
+])
+
+
+class _TPUInferenceContext(control_flow_ops.XLAControlFlowContext):
+  """A `ControlFlowContext` for nodes inside a TPU inference computation.
+
+  The primary role of `TPUReplicateContext` is to sanity check operators inside
+  a tpu.rewrite_for_inference() computation.
+  """
+
+  def __init__(self, name):
+    super(_TPUInferenceContext, self).__init__()
+    self._name = name
+
+  def AddOp(self, op):
+    self._AddOpInternal(op)
+
+  def _AddOpInternal(self, op):
+    # pylint: disable=protected-access
+    if op.type in _BLACKLISTED_INFERENCE_OPS:
+      raise NotImplementedError(
+          "Operation of type %s (%s) is not supported on the TPU for inference."
+          " Execution will fail if this op is used in the graph. Make sure your"
+          " variables are using variable_scope." % (op.type, op.name))
+    if self._outer_context:
+      self._outer_context.AddInnerOp(op)
+
+  def AddValue(self, val):
+    result = val
+    if self._outer_context:
+      result = self._outer_context.AddValue(val)
+    return result
+
+  def AddInnerOp(self, op):
+    self._AddOpInternal(op)
+
+  @property
+  def grad_state(self):
+    return None
+
+
+@experimental
+def validate_inference_rewrite_for_variables(graph):
+  """Validates whether rewrite_for_inference() 'worked' for variables.
+
+     The rewrite_for_inference() method is supposed to append
+     GuaranteeConstOps after ReadVariableOps, but this mechanism works only
+     if you are using tf.get_variable() to create and access variables in your
+     tpu computation. This validation method can be called immediately after
+     calling tpu.rewrite_for_inference() to check whether GuaranteeConstOps
+     where added to the graph.
+
+     Typical usages:
+       tpu.validate_inference_rewrite_for_variables(tf.get_default_graph())
+
+       tpu.validate_inference_rewrite_for_variables(sess.graph)
+
+  Args:
+    graph: The graph which needs to be validated.
+  Raises:
+    RuntimeError: if validation failed.
+  """
+  if not any([x.type == "GuaranteeConst" for x in graph.get_operations()]):
+    raise RuntimeError(
+        "No GuaranteeConst ops found in the graph after "
+        "running tpu.rewrite_for_inference(...). Please "
+        "check that you are using tf.get_variable() to "
+        "create and access variables in your tpu "
+        "computation.")
+
+
+@experimental
+def rewrite_for_inference(computation,
+                          inputs=None,
+                          infeed_queue=None,
+                          device_assignment=None,
+                          name=None):
+  """Rewrites `computation` for inference on a TPU system.
+
+     Other than 'rewriting' the computation to run on a TPU, if using variables
+     in your computation, it moves the ReadVariableOps outside the TPU
+     computation, and adds GuaranteeConst ops just after the ReadVariableOps.
+     This mechanism works only if you are using tf.get_variable() to create and
+     access variables in your tpu computation. You can validate whether
+     this worked, by calling validate_inference_rewrite_for_variables() method
+     immediately after this method to check whether GuaranteeConstOps where
+     added to the graph.
+
+  Args:
+    computation: A Python function that builds a computation to apply
+      to the input. If the function takes n inputs, 'inputs' should be
+      a list of n tensors. If the function returns m outputs, rewrite
+      will return a list of m tensors.
+    inputs: A list of input tensors or `None` (equivalent to an empty list).
+    infeed_queue: If not `None`, the `InfeedQueue` from which to append a tuple
+      of arguments as inputs to `computation`.
+    device_assignment: if not `None`, a `DeviceAssignment` describing the
+      mapping between logical cores in the computation with physical cores in
+      the TPU topology. May be omitted for a single-core computation, in which
+      case the core attached to task 0, TPU device 0 is used.
+    name: The name of the operator.
+  Returns:
+    A list of output tensors.
+  """
+
+  def guarantee_const_getter(getter, name, *args, **kwargs):
+    with ops.control_dependencies(None):
+      return array_ops.guarantee_const(
+          getter(name, *args, **kwargs), name=name + "/GuaranteeConst")
+
+  def wrapped_computation(*args, **kwargs):
+    """Execute computation under `_TPUInferenceContext`."""
+    context = _TPUInferenceContext(
+        name=ops.get_default_graph().unique_name("rewrite_for_inference"))
+    try:
+      context.Enter()
+
+      vscope = variable_scope.get_variable_scope()
+      prev_custom_getter = vscope.custom_getter
+      prev_caching_device = vscope.caching_device
+      vscope.set_custom_getter(guarantee_const_getter)
+      vscope.set_caching_device(lambda op: op.device)
+
+      result = computation(*args, **kwargs)
+
+      vscope.set_custom_getter(prev_custom_getter)
+      vscope.set_caching_device(prev_caching_device)
+    finally:
+      context.Exit()
+    return result
+
+  # pylint: disable=undefined-variable
+  return rewrite(
+      wrapped_computation,
+      inputs=inputs,
+      infeed_queue=infeed_queue,
+      device_assignment=device_assignment,
+      name=name)
+  # pylint: enable=undefined-variable
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py b/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
index 77d117ba78276aa35e55eb3a524575acc1cf607b..e94bd78833f6cbe9adb1b6ca3f29a88bd8a53f64 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
@@ -46,6 +46,8 @@ from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.estimator import estimator as estimator_lib
 from tensorflow.python.estimator import model_fn as model_fn_lib
+from tensorflow.python.estimator import util as estimator_util
+from tensorflow.python.estimator.export import export_output as export_output_lib
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
@@ -61,6 +63,7 @@ from tensorflow.python.ops import summary_ops_v2 as contrib_summary
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.saved_model import tag_constants
 from tensorflow.python.summary import summary
 from tensorflow.python.training import basic_session_run_hooks
 from tensorflow.python.training import evaluation
@@ -71,6 +74,7 @@ from tensorflow.python.util import function_utils
 from tensorflow.python.util import nest
 from tensorflow.python.util import tf_inspect
 
+
 _INITIAL_LOSS = 1e7
 _ZERO_LOSS = 0.
 _TPU_ESTIMATOR = 'tpu_estimator'
@@ -81,6 +85,7 @@ _CROSS_REPLICA_SUM_OP = 'CrossReplicaSum'
 _ONE_GIGABYTE = 1024 * 1024 * 1024
 _TPU_ENQUEUE_OPS = '_tpu_enqueue_ops'
 _TPU_TRAIN_OP = '_tpu_train_op'
+_REWRITE_FOR_INFERENCE_MODE = '_rewrite_for_inference'
 
 _RESERVED_PARAMS_KEYS = [_BATCH_SIZE_KEY, _CTX_KEY]
 
@@ -117,6 +122,33 @@ def _create_global_step(graph):
 
 
 def _create_or_get_iterations_per_loop():
+  """Creates or gets the iterations_per_loop variable.
+
+  In TPUEstimator, the user provided computation, the model_fn, is wrapped
+  inside a tf.while_loop for peak performance. The iterations of the loop are
+  specified by this variable, which adjusts its value on the CPU after each TPU
+  program execution and before the next TPU execution.
+
+  The purpose of using a variable, rather then a constant, is to allow
+  TPUEstimator adapt the TPU training iterations according to the final steps
+  specified by users. For example, if the user sets the iterations_per_loop as 4
+  in TPUConfig and steps as 10 in TPUEstimator.train(), the iterations_per_loop
+  variable will have the following value before each TPU training.
+
+      - 1-th TPU execution: iterations_per_loop = 4
+      - 2-th TPU execution: iterations_per_loop = 4
+      - 3-th TPU execution: iterations_per_loop = 2
+
+  As model_fn increases the global step once per train_op invocation, the global
+  step is 10 after all TPU executions, matching the steps=10 inputs passed in by
+  users.
+
+  Returns:
+    A TF non-trainable resource variable.
+
+  Raises:
+    RuntimeError: If multi iterations_per_loop variables were found.
+  """
   graph = ops.get_default_graph()
   collection_name = '{}_{}'.format(_TPU_ESTIMATOR, _ITERATIONS_PER_LOOP_VAR)
   iter_vars = graph.get_collection(collection_name)
@@ -383,20 +415,21 @@ class TPUInfeedOutfeedSessionHook(session_run_hook.SessionRunHook):
       return
 
     def _cancel_session():
-      # Close the session to avoid the main thread from hanging. If input
-      # pipeline triggers any error, the infeed thread dies but the main thread
-      # for TPU computation waits for the infeed enqueue forever. Close the
-      # Session to cancel the main thread Session.run execution.
-      #
-      # We sleep for a few seconds before closing to give some time
-      # for the TPU compilation error, if any, propagating, from TPU to CPU
-      # host. Compilation errors should be reported by the main thread so that
-      # the program can be interrupted and users can take action.  Due to a race
-      # condition, the infeed thread might see an error first.  Closing the
-      # session here immediately would result in a session cancellation
-      # exception in the main thread, instead of the expected compile error.
-      # User code that depends on having the proper exception type will
-      # therefore be confused.
+      """Close the session to avoid the main thread from hanging.
+
+      If input pipeline triggers any error, the infeed thread dies but the main
+      thread for TPU computation waits for the infeed enqueue forever. Close the
+      Session to cancel the main thread Session.run execution.
+
+      We sleep for a few seconds before closing to give some time for the TPU
+      compilation error, if any, propagating, from TPU to CPU host. Compilation
+      errors should be reported by the main thread so that the program can be
+      interrupted and users can take action.  Due to a race condition, the
+      infeed thread might see an error first.  Closing the session here
+      immediately would result in a session cancellation exception in the main
+      thread, instead of the expected compile error.  User code that depends on
+      having the proper exception type will therefore be confused.
+      """
       time.sleep(5)
 
       # If the main session is still running, the infeed/outfeed errors are
@@ -716,6 +749,15 @@ def generate_per_host_enqueue_ops_fn_for_host(
     tpu_ordinal_function = None
 
   def enqueue_ops_fn():
+    """A Fn returning the TPU infeed enqueue ops.
+
+    By providing as a Fn, it can be invoked inside the tf.while_loop such that
+    the input pipeline for multiple iterations can be executed by one
+    Session.run call.
+
+    Returns:
+      list of dict of ops.
+    """
     with ops.device(device):
       num_of_replicas_per_host = ctx.num_of_replicas_per_host
       # Convert user input to features and labels.  If the user returns a
@@ -1090,10 +1132,16 @@ class _InputPipeline(object):
     return enqueue_ops, all_hooks, run_infeed_loop_on_coordinator
 
   def _validate_input_pipeline(self):
-    # Perform some sanity checks to log user friendly information. We should
-    # error out to give users better error message. But, if
-    # _WRAP_INPUT_FN_INTO_WHILE_LOOP is False (legacy behavior), we cannot break
-    # user code, so, log a warning.
+    """Validates the input pipeline.
+
+    Perform some sanity checks to log user friendly information. We should
+    error out to give users better error message. But, if
+    _WRAP_INPUT_FN_INTO_WHILE_LOOP is False (legacy behavior), we cannot break
+    user code, so, log a warning.
+
+    Raises:
+      RuntimeError: If the validation failed.
+    """
     if ops.get_default_graph().get_collection(ops.GraphKeys.QUEUE_RUNNERS):
       err_msg = ('Input pipeline contains one or more QueueRunners. '
                  'It could be slow and not scalable. Please consider '
@@ -1264,13 +1312,11 @@ class _ModelFnWrapper(object):
             'estimator_spec used by TPU prediction must have type'
             '`TPUEstimatorSpec`. Got {}'.format(type(tpu_estimator_spec)))
 
+      self._verify_tpu_spec_predictions(tpu_estimator_spec.predictions)
+
       captured_scaffold_fn.capture(tpu_estimator_spec.scaffold_fn)
       to_record = {}
       identity_fn = lambda **kwargs: kwargs
-      # TODO(xiejw): Adds validation for prediction dictionrary.
-      # TODO(xiejw): Adds support for single tensor as predictions.
-      if not isinstance(tpu_estimator_spec.predictions, dict):
-        raise TypeError('TPUEstimatorSpec.predictions must be dict of Tensors.')
       to_record['predictions'] = [identity_fn, tpu_estimator_spec.predictions]
       to_record['signals'] = [identity_fn, stopping_signals]
       if tpu_estimator_spec.host_call is not None:
@@ -1282,8 +1328,70 @@ class _ModelFnWrapper(object):
 
     return predict_step, host_calls, captured_scaffold_fn
 
+  def _verify_tpu_spec_predictions(self, predictions):
+    """Validates TPUEstimatorSpec.predictions dict."""
+    # TODO(xiejw): Adds validation for prediction dictionrary.
+    # TODO(xiejw): Adds support for single tensor as predictions.
+    if not isinstance(predictions, dict):
+      raise TypeError('TPUEstimatorSpec.predictions must be dict of Tensors.')
+
+    for (key, tensor) in predictions.items():
+      if tensor.shape[0].value is None:
+        raise ValueError(
+            'The tensor with key ({}) in TPUEstimatorSpec.predictions has '
+            'dynamic shape (should be static). Tensor: {}'.format(
+                key, tensor))
+    return predictions
+
+  def _validate_model_features_and_labels(self,
+                                          features,
+                                          labels,
+                                          is_export_mode):
+    """Validates that the features and labels for the model function are valid.
+
+    A valid features/labels object is the one with:
+    - Type: Tensor or a dictionary of Tensors
+    - Static shape if is_export_mode is False.
+
+    Args:
+      features: the features that would be input to the model function.
+      labels: the labels that would be input to the model function.
+      is_export_mode: boolean value specifying if in export mode.
+
+    Raises:
+      TypeError: If features/labels are not of the correct type.
+      ValueError: If features/labels have dynamic shape.
+    """
+
+    def validate(obj, obj_name):
+      """Helper validate function."""
+      if not isinstance(obj, ops.Tensor) and not isinstance(obj, dict):
+        raise TypeError(
+            'The {} to the model returned by input_fn must be either a Tensor '
+            'or a dictionary of Tensors. {}: {}'.format(obj_name, obj_name,
+                                                        obj))
+      if is_export_mode or self._ctx.is_running_on_cpu(is_export_mode):
+        return
+      if isinstance(obj, ops.Tensor):
+        if not obj.get_shape().is_fully_defined():
+          raise ValueError(
+              'The {} to the model returned by input_fn must have static shape.'
+              ' Tensor: {}'.format(obj_name, obj))
+      else:
+        for (key, tensor) in obj.items():
+          if not tensor.get_shape().is_fully_defined():
+            raise ValueError(
+                'The {} to the model returned by input_fn must have static '
+                'shape. Key: \'{}\', Tensor: {}'.format(
+                    obj_name, key, tensor))
+
+    validate(features, 'features')
+    if labels is not None:
+      validate(labels, 'labels')
+
   def _call_model_fn(self, features, labels, is_export_mode=False):
     """Calls the model_fn with required parameters."""
+    self._validate_model_features_and_labels(features, labels, is_export_mode)
     model_fn_args = function_utils.fn_args(self._model_fn)
     kwargs = {}
 
@@ -1760,8 +1868,40 @@ class TPUEstimator(estimator_lib.Estimator):
   Exporting
   =========
 
-  Exporting `SavedModel` support on TPU is not yet implemented. So,
-  `export_savedmodel` is executed on CPU, even if `use_tpu` is true.
+  `export_savedmodel` exports 2 metagraphs, one with `tag_constants.SERVING`,
+  and another with `tag_constants.SERVING` and `tag_constants.TPU`.
+  At serving time, these tags are used to select metagraph to load.
+
+  Before running the graph on TPU, TPU system needs to be initialized. If
+  TensorFlow Serving model-server is used, this is done automatically. If
+  not, please call `session.run(tpu.initialize_system())`.
+
+  `tpu.outside_compilation` can be used to wrap TPU incompatible ops in
+  `model_fn`.
+
+  Example:
+  ----------------
+
+  ```
+  def model_fn(features, labels, mode, config, params):
+    ...
+    logits = ...
+    export_outputs = {
+      'logits': export_output_lib.PredictOutput(
+        {'logits': logits})
+    }
+
+    def host_call(logits):
+      class_ids = math_ops.argmax(logits)
+      classes = string_ops.as_string(class_ids)
+      export_outputs['classes'] =
+        export_output_lib.ClassificationOutput(classes=classes)
+
+    tpu.outside_compilation(host_call, logits)
+
+    ...
+  ```
+
   """
 
   def __init__(self,
@@ -1775,13 +1915,15 @@ class TPUEstimator(estimator_lib.Estimator):
                predict_batch_size=None,
                batch_axis=None,
                eval_on_tpu=True,
+               export_to_tpu=True,
                warm_start_from=None):
     """Constructs an `TPUEstimator` instance.
 
     Args:
       model_fn: Model function as required by `Estimator`. For training, the
         returned `EstimatorSpec` cannot have hooks as it is not supported in
-        `TPUEstimator`.
+        `TPUEstimator`. Instead, the user can pass the training hooks as
+        an argument to `TPUEstimator.train()`.
       model_dir: Directory to save model parameters, graph and etc. This can
         also be used to load checkpoints from the directory into a estimator to
         continue training a previously saved model. If `None`, the model_dir in
@@ -1817,6 +1959,8 @@ class TPUEstimator(estimator_lib.Estimator):
         False or `PER_HOST_V2`, batch_axis is ignored.
       eval_on_tpu: If False, evaluation runs on CPU or GPU. In this case, the
         model_fn must return `EstimatorSpec` when called with `mode` as `EVAL`.
+      export_to_tpu: If True, `export_savedmodel()` exports a metagraph for
+        serving on TPU besides the one on CPU.
       warm_start_from: Optional string filepath to a checkpoint or SavedModel to
                        warm-start from, or a `tf.estimator.WarmStartSettings`
                        object to fully configure warm-starting.  If the string
@@ -1888,8 +2032,120 @@ class TPUEstimator(estimator_lib.Estimator):
         use_tpu,
         eval_on_tpu)
 
+    self._export_to_tpu = export_to_tpu
+
     self._is_input_fn_invoked = None
 
+  def _add_meta_graph_for_mode(self,
+                               builder,
+                               input_receiver_fn_map,
+                               checkpoint_path,
+                               strip_default_attrs,
+                               save_variables=True,
+                               mode=model_fn_lib.ModeKeys.PREDICT,
+                               export_tags=None):
+    if mode != model_fn_lib.ModeKeys.PREDICT:
+      raise NotImplementedError(
+          'TPUEstimator only handles mode PREDICT for export_savedmodel(); '
+          'got {}.'.format(mode))
+
+    super(TPUEstimator, self)._add_meta_graph_for_mode(builder,
+                                                       input_receiver_fn_map,
+                                                       checkpoint_path,
+                                                       strip_default_attrs,
+                                                       save_variables,
+                                                       mode=mode)
+
+    if self._export_to_tpu:
+      input_receiver_fn_map = {_REWRITE_FOR_INFERENCE_MODE:
+                               input_receiver_fn_map[mode]}
+      export_tags = [tag_constants.SERVING, tag_constants.TPU]
+      mode = _REWRITE_FOR_INFERENCE_MODE
+      (super(TPUEstimator, self).
+       _add_meta_graph_for_mode(builder,
+                                input_receiver_fn_map,
+                                checkpoint_path,
+                                strip_default_attrs,
+                                save_variables=False,
+                                mode=mode,
+                                export_tags=export_tags))
+
+  def _call_model_fn(self, features, labels, mode, config):
+    if mode == _REWRITE_FOR_INFERENCE_MODE:
+      return self._call_model_fn_for_inference(features, labels, mode, config)
+    else:
+      return super(TPUEstimator, self)._call_model_fn(
+          features, labels, mode, config)
+
+  def _call_model_fn_for_inference(self, features, labels, mode, config):
+    """Wraps `_call_model_fn` for `export_savedmodel`."""
+    if mode != _REWRITE_FOR_INFERENCE_MODE:
+      raise ValueError('mode must be {}; '
+                       'got {}.'.format(_REWRITE_FOR_INFERENCE_MODE, mode))
+
+    capture = _CapturedObject()
+
+    def computation():
+      """Compute tpu tensors used in export_outputs.
+
+      Passed to rewrite_for_inference so that model_fn will be called under
+      the rewriting contexts. Only tpu tensors are returned, but export_outputs
+      and scaffold are captured.
+
+      Returns:
+         A list of Tensors used in export_outputs and not marked for
+         outside_compilation.
+      """
+      # We should only call model fn once and it should be inside `computation`
+      # so that building the graph will happen under `rewrite_for_inference`.
+      mode = model_fn_lib.ModeKeys.PREDICT
+      estimator_spec = self._call_model_fn(features, labels, mode, config)
+
+      # We pick the TPU tensors out from `export_output` and later return them
+      # from `computation` for rewriting.
+      tensors_dict = collections.OrderedDict(
+          (k, _export_output_to_tensors(v))
+          for k, v in six.iteritems(estimator_spec.export_outputs)
+      )
+      tensors = nest.flatten(tensors_dict)
+      tpu_tensors = [t for t in tensors if _is_tpu_tensor(t)]
+
+      # We cannot return anything other than `tpu_tensors` here so we capture
+      # the rest for later use.
+      capture.capture((estimator_spec, tensors_dict, tensors))
+      return tpu_tensors
+
+    tpu_tensors_on_cpu = tpu.rewrite_for_inference(computation)
+    estimator_spec, tensors_dict, tensors = capture.get()
+
+    # Reconstruct `tensors`, but with `tpu_tensors` replaced with
+    # `tpu_tensors_on_cpu`.
+    new_tensors = []
+    for t in tensors:
+      if _is_tpu_tensor(t):
+        new_tensors.append(tpu_tensors_on_cpu.pop(0))
+      elif t is None:
+        new_tensors.append(None)
+      else:
+        # Only fetching `tpu_tensors_on_cpu` does not trigger
+        # TPU computation and blocks, so we add the control dependency here.
+        control_inputs = (tpu_tensors_on_cpu
+                          if isinstance(tpu_tensors_on_cpu, (list, tuple))
+                          else (tpu_tensors_on_cpu,))
+        with ops.control_dependencies(control_inputs):
+          new_tensors.append(array_ops.identity(t))
+
+    # Reconstruct `tensors_dict`.
+    new_tensors_dict = nest.pack_sequence_as(tensors_dict, new_tensors)
+    # Reconstruct `export_outputs`.
+    export_outputs = estimator_spec.export_outputs
+    new_export_outputs = collections.OrderedDict(
+        (k, _clone_export_output_with_tensors(export_outputs[k], v))
+        for k, v in six.iteritems(new_tensors_dict)
+    )
+
+    return estimator_spec._replace(export_outputs=new_export_outputs)
+
   def _create_global_step(self, graph):
     """Creates a global step suitable for TPUs.
 
@@ -2072,11 +2328,11 @@ class TPUEstimator(estimator_lib.Estimator):
           if shutdown_mode:
             if shutdown_mode == 'shutdown_worker':
               finalizer_hooks = [
-                  session_support.ShutdownLameWorkers(timeout_ms=1000),
+                  session_support.ShutdownLameWorkers(timeout_ms=60*1000),
               ]
             elif shutdown_mode == 'shutdown_computation':
               finalizer_hooks = [
-                  session_support.RestartComputation(timeout_ms=1000),
+                  session_support.RestartComputation(timeout_ms=60*1000),
               ]
             else:
               raise ValueError('Unknown TF_TPU_GRACEFUL_SHUTDOWN_MODE "%s"' %
@@ -2265,6 +2521,76 @@ class TPUEstimator(estimator_lib.Estimator):
     return _model_fn
 
 
+def _is_tpu_tensor(tensor):
+  if not isinstance(tensor, ops.Tensor):
+    return False
+  try:
+    tensor.op.get_attr(tpu._OUTSIDE_COMPILATION_ATTR)  # pylint: disable=protected-access
+  except ValueError:
+    return True
+  else:
+    return False
+
+
+def _export_output_to_tensors(export_output):
+  """Get a list of `Tensors` used in `export_output`.
+
+  Args:
+    export_output: an `ExportOutput` object such as `ClassificationOutput`,
+            `RegressionOutput`, or `PredictOutput`.
+  Returns:
+    a list of tensors used in export_output.
+
+  Raises:
+    ValueError: if `export_output` is not one of `ClassificationOutput`,
+        `RegressionOutput`, or `PredictOutput`.
+  """
+  if isinstance(export_output, export_output_lib.ClassificationOutput):
+    return [export_output.scores, export_output.classes]
+  elif isinstance(export_output, export_output_lib.RegressionOutput):
+    return [export_output.value]
+  elif isinstance(export_output, export_output_lib.PredictOutput):
+    return export_output.outputs.values()
+  else:
+    raise ValueError(
+        '`export_output` must be have type `ClassificationOutput`, '
+        '`RegressionOutput`, or `PredictOutput`; got {}.'.format(export_output))
+
+
+def _clone_export_output_with_tensors(export_output, tensors):
+  """Clones `export_output` but with new `tensors`.
+
+  Args:
+    export_output: an `ExportOutput` object such as `ClassificationOutput`,
+            `RegressionOutput`, or `PredictOutput`.
+    tensors: a list of `Tensors` used to construct a new `export_output`.
+
+  Returns:
+    A dict similar to `export_output` but with `tensors`.
+
+  Raises:
+    ValueError: if `export_output` is not one of `ClassificationOutput`,
+        `RegressionOutput`, or `PredictOutput`.
+  """
+  if isinstance(export_output, export_output_lib.ClassificationOutput):
+    if len(tensors) != 2:
+      raise ValueError('tensors must be of length 2; '
+                       'got {}.'.format(len(tensors)))
+    return export_output_lib.ClassificationOutput(*tensors)
+  elif isinstance(export_output, export_output_lib.RegressionOutput):
+    if len(tensors) != 1:
+      raise ValueError('tensors must be of length 1; '
+                       'got {}'.format(len(tensors)))
+    return export_output_lib.RegressionOutput(*tensors)
+  elif isinstance(export_output, export_output_lib.PredictOutput):
+    return export_output_lib.PredictOutput(
+        dict(zip(export_output.outputs.keys(), tensors)))
+  else:
+    raise ValueError(
+        '`export_output` must be have type `ClassificationOutput`, '
+        '`RegressionOutput`, or `PredictOutput`; got {}.'.format(export_output))
+
+
 def _eval_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn):
   """Executes `model_fn_wrapper` multiple times on all TPU shards."""
   iterations_per_loop_var = _create_or_get_iterations_per_loop()
@@ -2412,7 +2738,7 @@ class _CapturedObject(object):
   def capture(self, o):
     if self._captured:
       raise RuntimeError(
-          'InternalError: Object can be captured only. Please file bug .')
+          'InternalError: Object can capture only once. Please file bug.')
 
     self._captured = True
     self._object = o
@@ -2421,7 +2747,7 @@ class _CapturedObject(object):
     if not self._captured:
       raise RuntimeError(
           'InternalError: Object is not captured properly before `get`. '
-          'Please file bug .')
+          'Please file bug.')
     return self._object
 
 
@@ -2522,7 +2848,8 @@ class _Inputs(object):
     """
     iterator = self._dataset.make_initializable_iterator()
     # pylint: disable=protected-access
-    hook = estimator_lib._DatasetInitializerHook(iterator)
+    hook = estimator_util._DatasetInitializerHook(iterator)
+    # pylint: enable=protected-access
     self._iterator = iterator
     return hook
 
@@ -2668,6 +2995,7 @@ class _StopSignals(object):
 
   @staticmethod
   def should_stop(scalar_stopping_signal):
+    """Detects whether scalar_stopping_signal indicates stopping."""
     if isinstance(scalar_stopping_signal, ops.Tensor):
       # STOPPING_SIGNAL is a constant True. Here, the logical_and is just the TF
       # way to express the bool check whether scalar_stopping_signal is True.
@@ -2831,4 +3159,3 @@ def _add_item_to_params(params, key, value):
   else:
     # Now params is Python dict.
     params[key] = value
-
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_test.py b/tensorflow/contrib/tpu/python/tpu/tpu_test.py
index c3882b8a27bc835f906c47dc5219f280c53800b8..6bdaa528f9f946ae4b9813d554409da2406b1f8d 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_test.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_test.py
@@ -26,6 +26,7 @@ from tensorflow.contrib.tpu.python.tpu import training_loop
 from tensorflow.python.framework import dtypes
 from tensorflow.python.layers import convolutional
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import control_flow_util
 from tensorflow.python.ops import math_ops
 
@@ -37,7 +38,8 @@ class TPUContextTest(test.TestCase):
   def testIsInContext(self):
     """Test that control_flow_util can check that we're in a TPU context."""
     z1 = array_ops.identity(1)
-    context = tpu.TPUReplicateContext(b"context", 1)
+    pivot = control_flow_ops.no_op()
+    context = tpu.TPUReplicateContext(b"context", 1, pivot=pivot)
     context.Enter()
     z2 = array_ops.identity(1)
     context.Exit()
diff --git a/tensorflow/contrib/training/python/training/tensor_queue_dataset.py b/tensorflow/contrib/training/python/training/tensor_queue_dataset.py
index 409aba817c1ec37003eb98f000f6cf8918234c5d..a2444934bc21d58ed57d15494b3548a31ce3a2df 100644
--- a/tensorflow/contrib/training/python/training/tensor_queue_dataset.py
+++ b/tensorflow/contrib/training/python/training/tensor_queue_dataset.py
@@ -18,6 +18,7 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.data.util import convert
 from tensorflow.python.data.util import nest
 from tensorflow.python.data.util import sparse
 from tensorflow.python.framework import dtypes
@@ -45,14 +46,14 @@ class _PrependFromQueueAndPaddedBatchDataset(dataset_ops.Dataset):
     self._input_dataset = input_dataset
     self._batch_size = ops.convert_to_tensor(
         batch_size, dtype=dtypes.int64, name="batch_size")
-    # pylint: disable=protected-access
     if padded_shapes is None:
       self._padded_shapes = nest.map_structure(
-          dataset_ops._partial_shape_to_tensor, input_dataset.output_shapes)
+          convert.partial_shape_to_tensor, input_dataset.output_shapes)
     else:
       self._padded_shapes = nest.map_structure_up_to(
-          input_dataset.output_shapes, dataset_ops._partial_shape_to_tensor,
+          input_dataset.output_shapes, convert.partial_shape_to_tensor,
           padded_shapes)
+    # pylint: disable=protected-access
     padding_values = (
         padding_values if padding_values is not None else
         dataset_ops._default_padding(input_dataset))
diff --git a/tensorflow/contrib/verbs/BUILD b/tensorflow/contrib/verbs/BUILD
index 9720fd6e8657de18cf8d7565f834568ae52fdbda..1b45584dcb84fe62de6cc14017e7cae575f99b2f 100644
--- a/tensorflow/contrib/verbs/BUILD
+++ b/tensorflow/contrib/verbs/BUILD
@@ -58,7 +58,7 @@ cc_library(
         "//tensorflow/core/distributed_runtime/rpc:async_service_interface",
         "//tensorflow/core/distributed_runtime/rpc:grpc_call",
         "//tensorflow/core/distributed_runtime/rpc:grpc_util",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
     alwayslink = 1,
 )
@@ -69,7 +69,7 @@ cc_library(
     hdrs = ["grpc_verbs_service_impl.h"],
     deps = [
         ":verbs_service_proto_cc",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
diff --git a/tensorflow/contrib/verbs/grpc_verbs_service.cc b/tensorflow/contrib/verbs/grpc_verbs_service.cc
index 742f946c9536973eb8a6a11afda1b32ae4a7726b..af29abd91feda22824e57c19c13a3f48fb1d61b7 100644
--- a/tensorflow/contrib/verbs/grpc_verbs_service.cc
+++ b/tensorflow/contrib/verbs/grpc_verbs_service.cc
@@ -15,9 +15,9 @@ limitations under the License.
 
 #ifdef TENSORFLOW_USE_VERBS
 
-#include "grpc++/alarm.h"
-#include "grpc++/grpc++.h"
-#include "grpc++/server_builder.h"
+#include "grpcpp/alarm.h"
+#include "grpcpp/grpcpp.h"
+#include "grpcpp/server_builder.h"
 
 #include "tensorflow/contrib/verbs/grpc_verbs_service.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_util.h"
diff --git a/tensorflow/contrib/verbs/grpc_verbs_service_impl.cc b/tensorflow/contrib/verbs/grpc_verbs_service_impl.cc
index 991f9a9d8bdf883b1b68bfa1fb6af7bf51b7e66a..4da7b59c69c88a4d04be37543aae7f03decd2c52 100644
--- a/tensorflow/contrib/verbs/grpc_verbs_service_impl.cc
+++ b/tensorflow/contrib/verbs/grpc_verbs_service_impl.cc
@@ -15,14 +15,14 @@ limitations under the License.
 
 #include "tensorflow/contrib/verbs/grpc_verbs_service_impl.h"
 
-#include "grpc++/impl/codegen/async_stream.h"
-#include "grpc++/impl/codegen/async_unary_call.h"
-#include "grpc++/impl/codegen/channel_interface.h"
-#include "grpc++/impl/codegen/client_unary_call.h"
-#include "grpc++/impl/codegen/method_handler_impl.h"
-#include "grpc++/impl/codegen/rpc_service_method.h"
-#include "grpc++/impl/codegen/service_type.h"
-#include "grpc++/impl/codegen/sync_stream.h"
+#include "grpcpp/impl/codegen/async_stream.h"
+#include "grpcpp/impl/codegen/async_unary_call.h"
+#include "grpcpp/impl/codegen/channel_interface.h"
+#include "grpcpp/impl/codegen/client_unary_call.h"
+#include "grpcpp/impl/codegen/method_handler_impl.h"
+#include "grpcpp/impl/codegen/rpc_service_method.h"
+#include "grpcpp/impl/codegen/service_type.h"
+#include "grpcpp/impl/codegen/sync_stream.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/contrib/verbs/grpc_verbs_service_impl.h b/tensorflow/contrib/verbs/grpc_verbs_service_impl.h
index 1f0f10517e98a32ae882c027330091928f1a6ee2..abe5e08b07cd71b7ca28321e6eb2cf0eec5d1b0f 100644
--- a/tensorflow/contrib/verbs/grpc_verbs_service_impl.h
+++ b/tensorflow/contrib/verbs/grpc_verbs_service_impl.h
@@ -16,14 +16,14 @@ limitations under the License.
 #ifndef TENSORFLOW_CONTRIB_GRPC_VERBS_SERVICE_IMPL_H_
 #define TENSORFLOW_CONTRIB_GRPC_VERBS_SERVICE_IMPL_H_
 
-#include "grpc++/impl/codegen/async_stream.h"
-#include "grpc++/impl/codegen/async_unary_call.h"
-#include "grpc++/impl/codegen/proto_utils.h"
-#include "grpc++/impl/codegen/rpc_method.h"
-#include "grpc++/impl/codegen/service_type.h"
-#include "grpc++/impl/codegen/status.h"
-#include "grpc++/impl/codegen/stub_options.h"
-#include "grpc++/impl/codegen/sync_stream.h"
+#include "grpcpp/impl/codegen/async_stream.h"
+#include "grpcpp/impl/codegen/async_unary_call.h"
+#include "grpcpp/impl/codegen/proto_utils.h"
+#include "grpcpp/impl/codegen/rpc_method.h"
+#include "grpcpp/impl/codegen/service_type.h"
+#include "grpcpp/impl/codegen/status.h"
+#include "grpcpp/impl/codegen/stub_options.h"
+#include "grpcpp/impl/codegen/sync_stream.h"
 
 #include "tensorflow/contrib/verbs/verbs_service.pb.h"
 
diff --git a/tensorflow/core/BUILD b/tensorflow/core/BUILD
index a576e360973381d6df3ed399b207a987006d58c7..b6b48a077cdafe12aeb1e4e0988493692c82eace 100644
--- a/tensorflow/core/BUILD
+++ b/tensorflow/core/BUILD
@@ -72,24 +72,23 @@ licenses(["notice"])  # Apache 2.0
 
 load(
     "//tensorflow:tensorflow.bzl",
+    "cc_header_only_library",
     "full_path",
     "if_android",
-    "if_not_android_mips_and_mips64",
     "if_ios",
     "if_linux_x86_64",
     "if_mobile",
     "if_not_mobile",
-    "if_windows",
     "if_not_windows",
-    "tf_copts",
+    "if_windows",
     "tf_cc_test",
     "tf_cc_tests",
+    "tf_copts",
     "tf_cuda_library",
     "tf_gen_op_libs",
     "tf_generate_proto_text_sources",
     "tf_genrule_cmd_append_to_srcs",
     "tf_opts_nortti_if_android",
-    "cc_header_only_library",
 )
 load("//tensorflow:tensorflow.bzl", "tf_cc_test_mkl")
 load("//tensorflow:tensorflow.bzl", "tf_cc_test_gpu")
@@ -101,49 +100,48 @@ load("//tensorflow:tensorflow.bzl", "tf_cuda_only_cc_test")
 # For platform specific build config
 load(
     "//tensorflow/core:platform/default/build_config.bzl",
-    "tf_platform_hdrs",
-    "tf_platform_srcs",
-    "tf_proto_library",
-    "tf_proto_library_cc",
     "tf_additional_all_protos",
+    "tf_additional_cloud_kernel_deps",
+    "tf_additional_cloud_op_deps",
     "tf_additional_core_deps",
+    "tf_additional_cupti_wrapper_deps",
+    "tf_additional_device_tracer_cuda_deps",
+    "tf_additional_device_tracer_deps",
+    "tf_additional_device_tracer_srcs",
+    "tf_additional_gdr_lib_defines",
+    "tf_additional_human_readable_json_deps",
     "tf_additional_lib_defines",
     "tf_additional_lib_deps",
     "tf_additional_lib_hdrs",
     "tf_additional_lib_srcs",
-    "tf_additional_framework_hdrs",
-    "tf_additional_framework_srcs",
-    "tf_additional_minimal_lib_srcs",
-    "tf_additional_proto_hdrs",
-    "tf_additional_proto_srcs",
-    "tf_additional_cupti_wrapper_deps",
     "tf_additional_libdevice_data",
     "tf_additional_libdevice_deps",
     "tf_additional_libdevice_srcs",
+    "tf_additional_minimal_lib_srcs",
+    "tf_additional_mpi_lib_defines",
+    "tf_additional_proto_hdrs",
+    "tf_additional_proto_srcs",
     "tf_additional_test_deps",
     "tf_additional_test_srcs",
-    "tf_kernel_tests_linkstatic",
-    "tf_additional_cloud_op_deps",
-    "tf_additional_cloud_kernel_deps",
-    "tf_lib_proto_parsing_deps",
     "tf_additional_verbs_lib_defines",
-    "tf_additional_mpi_lib_defines",
-    "tf_additional_gdr_lib_defines",
-    "tf_additional_device_tracer_srcs",
-    "tf_additional_device_tracer_deps",
-    "tf_additional_device_tracer_cuda_deps",
-    "tf_pyclif_proto_library",
     "tf_jspb_proto_library",
+    "tf_kernel_tests_linkstatic",
+    "tf_lib_proto_parsing_deps",
     "tf_nano_proto_library",
+    "tf_platform_hdrs",
+    "tf_platform_srcs",
+    "tf_proto_library",
+    "tf_proto_library_cc",
     "tf_protos_all",
     "tf_protos_all_impl",
     "tf_protos_grappler",
     "tf_protos_grappler_impl",
+    "tf_pyclif_proto_library",
 )
 load(
     "//tensorflow/core:platform/default/build_config_root.bzl",
-    "tf_cuda_tests_tags",
     "if_static",
+    "tf_cuda_tests_tags",
 )
 load("@local_config_cuda//cuda:build_defs.bzl", "if_cuda")
 load("@io_bazel_rules_closure//closure:defs.bzl", "closure_proto_library")
@@ -295,43 +293,18 @@ cc_library(
     ],
 )
 
-PLATFORM_BASE_HDRS = [
-    "platform/env_time.h",
-    "platform/logging.h",
-    "platform/macros.h",
-    "platform/types.h",
-    "platform/byte_order.h",
-]
-
-PLATFORM_OTHER_HDRS = [
-    "platform/abi.h",
-    "platform/stacktrace.h",
-    "platform/stacktrace_handler.h",
-    "platform/context.h",
-    "platform/cpu_info.h",
-    "platform/cpu_feature_guard.h",
-    "platform/dynamic_annotations.h",
-    "platform/error.h",
-    "platform/env.h",
-    "platform/file_system.h",
-    "platform/file_system_helper.h",
-    "platform/fingerprint.h",
-    "platform/init_main.h",
-    "platform/mem.h",
-    "platform/mutex.h",
-    "platform/net.h",
-    "platform/notification.h",
-    "platform/null_file_system.h",
-    "platform/prefetch.h",
-    "platform/profile_utils/clock_cycle_profiler.h",
-    "platform/profile_utils/cpu_utils.h",
-    "platform/protobuf.h",
-    "platform/strong_hash.h",
-    "platform/subprocess.h",
-    "platform/thread_annotations.h",
-]
+filegroup(
+    name = "platform_base_hdrs",
+    srcs = [
+        "platform/byte_order.h",
+        "platform/env_time.h",
+        "platform/logging.h",
+        "platform/macros.h",
+        "platform/types.h",
+    ],
+    visibility = ["//visibility:private"],
+)
 
-# Smaller platform libraries that don't depend on "lib" or "lib_internal".
 cc_library(
     name = "platform_base",
     srcs = tf_platform_hdrs([
@@ -343,16 +316,275 @@ cc_library(
     ]) + [
         "platform/env_time.cc",
     ],
-    hdrs = PLATFORM_BASE_HDRS,
+    hdrs = [":platform_base_hdrs"],
     copts = tf_copts(),
-    # TODO(ahentz): remove use of this library so we can move it into 'platform'
     tags = ["avoid_dep"],
+    visibility = ["//tensorflow/core:__subpackages__"],
     deps = [
         ":lib_platform",
         "//tensorflow/core/platform/default/build_config:base",
     ],
 )
 
+filegroup(
+    name = "platform_port_hdrs",
+    srcs = [
+        "platform/cpu_info.h",
+        "platform/dynamic_annotations.h",
+        "platform/init_main.h",
+        "platform/mem.h",
+        "platform/mutex.h",
+        "platform/thread_annotations.h",
+    ],
+    visibility = ["//visibility:private"],
+)
+
+# Headers that are not exported as part of ":lib".
+filegroup(
+    name = "platform_port_internal_hdrs",
+    srcs = [
+        "platform/demangle.h",
+        "platform/host_info.h",
+        "platform/snappy.h",
+    ],
+    visibility = ["//visibility:private"],
+)
+
+cc_library(
+    name = "platform_port",
+    srcs = tf_platform_hdrs([
+        "cpu_info.h",
+        "dynamic_annotations.h",
+        "thread_annotations.h",
+        "mutex.h",
+    ]) + tf_platform_srcs([
+        "port.cc",
+    ]) + [
+        "platform/cpu_info.cc",
+    ],
+    hdrs = [
+        ":platform_port_hdrs",
+        ":platform_port_internal_hdrs",
+    ],
+    copts = tf_copts(),
+    visibility = ["//tensorflow/core:__subpackages__"],
+    deps = [
+        ":lib_platform",
+        ":platform_base",
+        "//tensorflow/core/platform/default/build_config:port",
+        "@snappy",
+    ],
+)
+
+filegroup(
+    name = "platform_protobuf_hdrs",
+    srcs = [
+        "platform/protobuf.h",
+    ],
+    visibility = ["//visibility:private"],
+)
+
+# Headers that are not exported as part of ":lib".
+filegroup(
+    name = "platform_protobuf_internal_hdrs",
+    srcs = [
+        "platform/protobuf_internal.h",
+    ],
+    visibility = ["//visibility:private"],
+)
+
+cc_library(
+    name = "platform_protobuf",
+    srcs = tf_platform_hdrs([
+        "protobuf.h",
+    ]) + tf_platform_srcs([
+        "protobuf.cc",
+    ]) + [
+        "platform/protobuf_util.cc",
+        "lib/core/status.h",
+    ],
+    hdrs = [
+        ":platform_protobuf_hdrs",
+        ":platform_protobuf_internal_hdrs",
+    ],
+    copts = tf_copts(),
+    visibility = ["//tensorflow/core:__subpackages__"],
+    deps = [
+        ":lib_platform",
+        ":platform_base",
+        ":platform_port",
+        "//tensorflow/core/platform/default/build_config:protobuf",
+        "@protobuf_archive//:protobuf",
+    ],
+)
+
+cc_library(
+    name = "human_readable_json",
+    srcs = tf_platform_srcs(["human_readable_json.cc"]),
+    hdrs = ["platform/human_readable_json.h"],
+    copts = tf_copts(),
+    visibility = ["//visibility:public"],
+    deps = [
+        ":lib",
+        ":lib_internal",
+    ] + tf_additional_human_readable_json_deps(),
+)
+
+filegroup(
+    name = "platform_env_hdrs",
+    srcs = [
+        "platform/env.h",
+        "platform/file_statistics.h",
+        "platform/file_system.h",
+    ],
+    visibility = ["//visibility:private"],
+)
+
+# Headers that are not exported as part of ":lib".
+filegroup(
+    name = "platform_env_internal_hdrs",
+    srcs = [
+        "platform/load_library.h",
+    ],
+    visibility = ["//visibility:private"],
+)
+
+cc_library(
+    name = "platform_env",
+    srcs = tf_platform_srcs([
+        "env.cc",
+        "load_library.cc",
+    ]) + tf_platform_hdrs([
+        "wide_char.h",
+    ]) + [
+        "platform/env.cc",
+        "platform/file_system.cc",
+    ],
+    hdrs = [
+        ":platform_env_hdrs",
+        ":platform_env_internal_hdrs",
+    ],
+    copts = tf_copts(),
+    visibility = ["//tensorflow/core:__subpackages__"],
+    deps = [
+        ":error_codes_proto_cc",
+        ":lib",
+        ":lib_internal",
+        ":lib_platform",
+        ":platform_base",
+        ":platform_port",
+        ":platform_protobuf",
+        "//tensorflow/core/platform/default/build_config:env",
+    ],
+)
+
+filegroup(
+    name = "platform_file_system_hdrs",
+    srcs = [
+        "platform/file_system_helper.h",
+        "platform/null_file_system.h",
+    ],
+    visibility = ["//visibility:private"],
+)
+
+cc_library(
+    name = "platform_file_system",
+    srcs = tf_platform_srcs([
+    ]) + tf_platform_hdrs([
+        "windows_file_system.h",
+    ]) + [
+        "platform/file_system_helper.cc",
+    ],
+    hdrs = [
+        ":platform_file_system_hdrs",
+    ],
+    copts = tf_copts(),
+    visibility = ["//tensorflow/core:__subpackages__"],
+    deps = [
+        ":lib",
+        ":lib_platform",
+        ":platform_env",
+    ],
+)
+
+filegroup(
+    name = "platform_other_hdrs",
+    srcs = [
+        "platform/abi.h",
+        "platform/context.h",
+        "platform/cpu_feature_guard.h",
+        "platform/error.h",
+        "platform/fingerprint.h",
+        "platform/net.h",
+        "platform/notification.h",
+        "platform/prefetch.h",
+        "platform/profile_utils/android_armv7a_cpu_utils_helper.h",
+        "platform/profile_utils/clock_cycle_profiler.h",
+        "platform/profile_utils/cpu_utils.h",
+        "platform/profile_utils/i_cpu_utils_helper.h",
+        "platform/stacktrace.h",
+        "platform/stacktrace_handler.h",
+        "platform/strong_hash.h",
+        "platform/subprocess.h",
+    ],
+    visibility = ["//visibility:private"],
+)
+
+# Headers that are not exported as part of ":lib".
+filegroup(
+    name = "platform_other_internal_hdrs",
+    srcs = [
+        "platform/denormal.h",
+        "platform/setround.h",
+        "platform/tracing.h",
+    ],
+    visibility = ["//visibility:private"],
+)
+
+cc_library(
+    name = "platform_other",
+    srcs = tf_platform_srcs([
+        "subprocess.cc",
+        "net.cc",
+        "tracing.cc",
+    ]) + tf_platform_hdrs([
+        "tracing.h",
+        "error.h",
+        "context.h",
+        "fingerprint.h",
+        "notification.h",
+        "stacktrace.h",
+        "strong_hash.h",
+        "subprocess.h",
+        "tracing_impl.h",
+    ]) + [
+        "platform/cpu_feature_guard.cc",
+        "platform/setround.cc",
+        "platform/tracing.cc",
+        "platform/denormal.cc",
+        "platform/profile_utils/android_armv7a_cpu_utils_helper.cc",
+        "platform/profile_utils/clock_cycle_profiler.cc",
+        "platform/profile_utils/cpu_utils.cc",
+    ],
+    hdrs = [
+        ":platform_other_hdrs",
+        ":platform_other_internal_hdrs",
+    ],
+    copts = tf_copts(),
+    visibility = ["//tensorflow/core:__subpackages__"],
+    deps = [
+        ":lib",
+        ":lib_platform",
+        ":platform_base",
+        ":platform_env",
+        ":platform_port",
+        ":platform_protobuf",
+        "//tensorflow/core/platform/default/build_config:other",
+        "//tensorflow/core/platform/default/build_config:platformlib",
+        "//tensorflow/core/platform/default/build_config:port",
+    ],
+)
+
 # Minimal lib so that tools used for mobile compilation
 # don't have to depend on lib/platformlib.
 cc_library(
@@ -386,8 +618,7 @@ cc_library(
 # tf_cc_test and tf_cc_binary will include the necessary symbols.
 cc_library(
     name = "lib",
-    hdrs = PLATFORM_BASE_HDRS +
-           PLATFORM_OTHER_HDRS + [
+    hdrs = [
         "lib/bfloat16/bfloat16.h",
         "lib/core/arena.h",
         "lib/core/bitmap.h",
@@ -434,6 +665,12 @@ cc_library(
         "lib/strings/str_util.h",
         "lib/strings/strcat.h",
         "lib/strings/stringprintf.h",
+        ":platform_base_hdrs",
+        ":platform_env_hdrs",
+        ":platform_file_system_hdrs",
+        ":platform_other_hdrs",
+        ":platform_port_hdrs",
+        ":platform_protobuf_hdrs",
     ],
     visibility = ["//visibility:public"],
     deps = [
@@ -607,6 +844,7 @@ tf_cuda_library(
         "util/sparse/group_iterator.h",
         "util/sparse/sparse_tensor.h",
         "util/stat_summarizer.h",
+        "util/stat_summarizer_options.h",
         "util/stream_executor_util.h",
         "util/strided_slice_op.h",
         "util/tensor_format.h",
@@ -631,6 +869,18 @@ tf_cuda_library(
     deps = [":framework_internal"],
 )
 
+cc_library(
+    name = "stats_calculator_portable",
+    srcs = [
+        "util/stat_summarizer_options.h",
+        "util/stats_calculator.cc",
+    ],
+    hdrs = [
+        "util/stats_calculator.h",
+    ],
+    copts = tf_copts(),
+)
+
 cc_library(
     name = "overflow",
     hdrs = ["util/overflow.h"],
@@ -640,6 +890,12 @@ cc_library(
     ],
 )
 
+cc_library(
+    name = "exec_on_stall",
+    hdrs = ["util/exec_on_stall.h"],
+    deps = [":framework_lite"],
+)
+
 cc_library(
     name = "ptr_util",
     hdrs = ["util/ptr_util.h"],
@@ -1111,6 +1367,7 @@ cc_library(
         ":shape_inference_testutil",
         ":tensor_testutil",
         ":test",
+        ":testlib_ops",
         "//tensorflow/cc:scope",
         "//tensorflow/core/kernels:constant_op",
         "//tensorflow/core/kernels:ops_testutil",
@@ -1118,6 +1375,18 @@ cc_library(
     ],
 )
 
+cc_library(
+    name = "testlib_ops",
+    testonly = 1,
+    srcs = ["common_runtime/testlib_ops.cc"],
+    linkstatic = 1,  # Seems to be needed since alwayslink is broken in bazel
+    deps = [
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+    ],
+    alwayslink = 1,
+)
+
 # This is a link-only library to provide a DirectSession
 # implementation of the Session interface.
 tf_cuda_library(
@@ -1179,6 +1448,7 @@ filegroup(
             "lib/png/**/*",
             "lib/gif/**/*",
             "util/events_writer.*",
+            "util/stats_calculator.*",
             "util/reporter.*",
             "platform/**/cuda_libdevice_path.*",
             "platform/default/test_benchmark.*",
@@ -1262,6 +1532,7 @@ cc_library(
     visibility = ["//visibility:public"],
     deps = [
         ":protos_all_cc_impl",
+        ":stats_calculator_portable",
         "//third_party/eigen3",
         "@double_conversion//:double-conversion",
         "@nsync//:nsync_cpp",
@@ -1302,6 +1573,7 @@ cc_library(
     visibility = ["//visibility:public"],
     deps = [
         ":protos_all_cc_impl",
+        ":stats_calculator_portable",
         "//third_party/eigen3",
         "@double_conversion//:double-conversion",
         "@nsync//:nsync_cpp",
@@ -1767,9 +2039,8 @@ cc_library(
             "platform/**/cuda_libdevice_path.cc",
             "platform/**/device_tracer.cc",
             "platform/**/logging.cc",
+            "platform/**/human_readable_json.cc",
             "platform/abi.cc",
-            "platform/variant_coding.cc",
-            "platform/**/variant_cord_coding.cc",
         ],
     ) + tf_additional_lib_srcs(
         exclude = [
@@ -1781,9 +2052,8 @@ cc_library(
             "platform/**/env_time.cc",
             "platform/**/device_tracer.cc",
             "platform/**/logging.cc",
+            "platform/**/human_readable_json.cc",
             "platform/abi.cc",
-            "platform/variant_coding.cc",
-            "platform/**/variant_cord_coding.cc",
         ] +
         # Protobuf deps already included through the ":lib_proto_parsing"
         # dependency.
@@ -2033,7 +2303,6 @@ cc_library(
 )
 
 FRAMEWORK_INTERNAL_PRIVATE_HEADERS = [
-    "platform/variant_coding.h",
     "graph/edgeset.h",
     "graph/graph.h",
     "graph/graph_def_builder.h",
@@ -2074,14 +2343,13 @@ FRAMEWORK_INTERNAL_PUBLIC_HEADERS = [
     "framework/tracking_allocator.h",  # only needed for tests
     "framework/unique_tensor_references.h",
     "framework/variant.h",
-    "platform/variant_coding.h",
     "util/command_line_flags.h",
     "util/env_var.h",
     "util/equal_graph_def.h",
     "util/presized_cuckoo_map.h",
     "util/tensor_slice_set.h",
     "util/tensor_slice_util.h",
-] + tf_additional_framework_hdrs()
+]
 
 tf_cuda_library(
     name = "framework_internal",
@@ -2123,9 +2391,7 @@ cc_header_only_library(
 
 tf_cuda_library(
     name = "framework_internal_impl",
-    srcs = FRAMEWORK_INTERNAL_PRIVATE_HEADERS + [
-        "platform/variant_coding.cc",
-    ] + glob(
+    srcs = FRAMEWORK_INTERNAL_PRIVATE_HEADERS + glob(
         [
             "example/**/*.cc",
             "framework/**/*.cc",
@@ -2159,7 +2425,7 @@ tf_cuda_library(
             "util/memmapped_file_system.cc",
             "util/memmapped_file_system_writer.cc",
         ],
-    }) + tf_additional_framework_srcs(),
+    }),
     hdrs = FRAMEWORK_INTERNAL_PUBLIC_HEADERS,
     copts = tf_copts(),
     linkopts = select({
@@ -2368,6 +2634,7 @@ CORE_CPU_LIB_HEADERS = CORE_CPU_BASE_HDRS + [
     "common_runtime/dma_helper.h",
     "common_runtime/eigen_thread_pool.h",
     "common_runtime/executor.h",
+    "common_runtime/executor_factory.h",
     "common_runtime/graph_optimizer.h",
     "common_runtime/local_device.h",
     "common_runtime/lower_if_op.h",
@@ -2417,6 +2684,7 @@ tf_cuda_library(
         "common_runtime/device_resolver_local.cc",
         "common_runtime/device_set.cc",
         "common_runtime/executor.cc",
+        "common_runtime/executor_factory.cc",
         "common_runtime/function.cc",
         "common_runtime/graph_optimizer.cc",
         "common_runtime/graph_runner.cc",
@@ -2524,6 +2792,7 @@ cc_library(
     ],
     visibility = [
         "//tensorflow/compiler:__subpackages__",
+        "//tensorflow/core/kernels:__subpackages__",
         "//tensorflow/core/profiler:__subpackages__",
     ],
     deps = [":lib_internal"],
@@ -2997,6 +3266,18 @@ tf_cc_test(
     ],
 )
 
+tf_cc_test(
+    name = "exec_on_stall_test",
+    size = "small",
+    srcs = ["util/exec_on_stall_test.cc"],
+    deps = [
+        ":exec_on_stall",
+        ":framework_lite",
+        ":test",
+        ":test_main",
+    ],
+)
+
 tf_cc_test(
     name = "lib_jpeg_jpeg_mem_unittest",
     srcs = ["lib/jpeg/jpeg_mem_unittest.cc"],
@@ -3760,6 +4041,31 @@ tf_cc_test(
     ],
 )
 
+tf_cc_test(
+    name = "common_runtime_executor_test",
+    size = "small",
+    srcs = ["common_runtime/executor_test.cc"],
+    linkstatic = tf_kernel_tests_linkstatic(),
+    deps = [
+        ":core",
+        ":core_cpu",
+        ":core_cpu_internal",
+        ":framework",
+        ":framework_internal",
+        ":lib",
+        ":lib_internal",
+        ":protos_all_cc",
+        ":test",
+        ":test_main",
+        ":testlib",
+        "//tensorflow/core/kernels:array",
+        "//tensorflow/core/kernels:control_flow_ops",
+        "//tensorflow/core/kernels:math",
+        "//tensorflow/core/kernels:random_ops",
+        "//tensorflow/core/kernels:state",
+    ],
+)
+
 tf_cc_test(
     name = "common_runtime_function_test",
     size = "small",
diff --git a/tensorflow/core/api_def/base_api/api_def_AnonymousIterator.pbtxt b/tensorflow/core/api_def/base_api/api_def_AnonymousIterator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d8c2ed40a324d4854d83c471e8eef50e50277b93
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_AnonymousIterator.pbtxt
@@ -0,0 +1,13 @@
+op {
+  graph_op_name: "AnonymousIterator"
+  out_arg {
+    name: "handle"
+    description: <<END
+A handle to the iterator that can be passed to a "MakeIterator" or
+"IteratorGetNext" op. In contrast to Iterator, AnonymousIterator prevents
+resource sharing by name, and does not keep a reference to the resource
+container.
+END
+  }
+  summary: "A container for an iterator resource."
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_BatchDatasetV2.pbtxt b/tensorflow/core/api_def/base_api/api_def_BatchDatasetV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0c5b1eb45af6812bdd35e2fef43ac8c02a5b9388
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_BatchDatasetV2.pbtxt
@@ -0,0 +1,18 @@
+op {
+  graph_op_name: "BatchDatasetV2"
+  visibility: HIDDEN
+  in_arg {
+    name: "batch_size"
+    description: <<END
+A scalar representing the number of elements to accumulate in a batch.
+END
+  }
+  in_arg {
+    name: "drop_remainder"
+    description: <<END
+A scalar representing whether the last batch should be dropped in case its size
+is smaller than desired.
+END
+  }
+  summary: "Creates a dataset that batches `batch_size` elements from `input_dataset`."
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_BatchFunction.pbtxt b/tensorflow/core/api_def/base_api/api_def_BatchFunction.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..09eff6177b1bcb544803b0806070d7b04c5e93c6
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_BatchFunction.pbtxt
@@ -0,0 +1,128 @@
+op {
+  graph_op_name: "BatchFunction"
+  in_arg {
+    name: "in_tensors"
+    description: <<END
+The tensors to be batched.
+END
+  }
+  in_arg {
+    name: "captured_tensors"
+    description: <<END
+The tensors which are captured in the function, and don't need
+to be batched.
+END
+  }
+  out_arg {
+    name: "out_tensors"
+    description: <<END
+The output tensors.
+END
+  }
+  attr {
+    name: "num_batch_threads"
+    description: <<END
+Number of scheduling threads for processing batches of work.
+Determines the number of batches processed in parallel.
+END
+  }
+  attr {
+    name: "max_batch_size"
+    description: <<END
+Batch sizes will never be bigger than this.
+END
+  }
+  attr {
+    name: "batch_timeout_micros"
+    description: <<END
+Maximum number of microseconds to wait before outputting
+an incomplete batch.
+END
+  }
+  attr {
+    name: "max_enqueued_batches"
+    description: <<END
+Maximum number of batches enqueued. Default: 10.
+END
+  }
+  attr {
+    name: "allowed_batch_sizes"
+    description: <<END
+Optional list of allowed batch sizes. If left empty, does
+nothing. Otherwise, supplies a list of batch sizes, causing the op to pad
+batches up to one of those sizes. The entries must increase monotonically, and
+the final entry must equal max_batch_size.
+END
+  }
+  attr {
+    name: "container"
+    description: <<END
+Controls the scope of sharing of this batch.
+END
+  }
+  attr {
+    name: "shared_name"
+    description: <<END
+Concurrently running instances of batch in the same device with the
+same container and shared_name will batch their elements together. If left
+empty, the op name will be used as the shared name.
+END
+  }
+  attr {
+    name: "Tin"
+    description: <<END
+the types of tensors to be batched.
+END
+  }
+  attr {
+    name: "Tcaptured"
+    description: <<END
+the types of the captured tensors.
+END
+  }
+  attr {
+    name: "Tout"
+    description: <<END
+the types of the output tensors.
+END
+  }
+  summary: "Batches all the inputs tensors to the computation done by the function."
+  description: <<END
+So, for example, in the following code
+
+  ```python
+
+  # This input will be captured.
+  y = tf.placeholder_with_default(1.0, shape=[])
+
+  @tf.Defun(tf.float32)
+  def computation(a):
+    return tf.matmul(a, a) + y
+
+  b = gen_batch_ops.batch_function(
+          f=computation
+          in_tensors=[a],
+          captured_tensors=computation.captured_inputs,
+          Tout=[o.type for o in computation.definition.signature.output_arg],
+          num_batch_threads=1,
+          max_batch_size=10,
+          batch_timeout_micros=100000,  # 100ms
+          allowed_batch_sizes=[3, 10],
+          batching_queue="")
+
+If more than one session.run call is simultaneously trying to compute `b`
+the values of `a` will be gathered, non-deterministically concatenated
+along the first axis, and only one thread will run the computation.
+
+Assumes that all arguments of the function are Tensors which will be batched
+along their first dimension.
+
+Arguments that are captured, are not batched. The session.run call which does
+the concatenation, will use the values of the captured tensors available to it.
+Therefore, typical uses of captured tensors should involve values which remain
+unchanged across session.run calls. Inference is a good example of this.
+
+SparseTensor is not supported. The return value of the decorated function
+must be a Tensor or a list/tuple of Tensors.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_BesselI0e.pbtxt b/tensorflow/core/api_def/base_api/api_def_BesselI0e.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..08313cebb9937b94ba203cd9492e68cab2ee8e48
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_BesselI0e.pbtxt
@@ -0,0 +1,10 @@
+op {
+  graph_op_name: "BesselI0e"
+  summary: "Computes the Bessel i0e function of `x` element-wise."
+  description: <<END
+Exponentially scaled modified Bessel function of order 0 defined as
+`bessel_i0e(x) = exp(-abs(x)) bessel_i0(x)`.
+
+This function is faster and numerically stabler than `bessel_i0(x)`.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_BesselI1e.pbtxt b/tensorflow/core/api_def/base_api/api_def_BesselI1e.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3e46a9506f55b67140e7062915243519d039ace0
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_BesselI1e.pbtxt
@@ -0,0 +1,10 @@
+op {
+  graph_op_name: "BesselI1e"
+  summary: "Computes the Bessel i1e function of `x` element-wise."
+  description: <<END
+Exponentially scaled modified Bessel function of order 0 defined as
+`bessel_i1e(x) = exp(-abs(x)) bessel_i1(x)`.
+
+This function is faster and numerically stabler than `bessel_i1(x)`.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_CollectiveBcastRecv.pbtxt b/tensorflow/core/api_def/base_api/api_def_CollectiveBcastRecv.pbtxt
index 88049bca365f4a738cac9975d0e14340e1ae401d..988bf0a0f82796021fe68f1e96d4f72e18edffa4 100644
--- a/tensorflow/core/api_def/base_api/api_def_CollectiveBcastRecv.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_CollectiveBcastRecv.pbtxt
@@ -1,5 +1,5 @@
 op {
   graph_op_name: "CollectiveBcastRecv"
-  visibility: SKIP
   summary: "Receives a tensor value broadcast from another device."
+  visibility: HIDDEN
 }
diff --git a/tensorflow/core/api_def/base_api/api_def_CollectiveBcastSend.pbtxt b/tensorflow/core/api_def/base_api/api_def_CollectiveBcastSend.pbtxt
index 7ff70f5b178af117e694cf7e998423b5ea58ac5a..d212f6dce723e13c26de776633a5998584f6421e 100644
--- a/tensorflow/core/api_def/base_api/api_def_CollectiveBcastSend.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_CollectiveBcastSend.pbtxt
@@ -1,5 +1,5 @@
 op {
   graph_op_name: "CollectiveBcastSend"
-  visibility: SKIP
   summary: "Broadcasts a tensor value to one or more other devices."
+  visibility: HIDDEN
 }
diff --git a/tensorflow/core/api_def/base_api/api_def_CollectiveReduce.pbtxt b/tensorflow/core/api_def/base_api/api_def_CollectiveReduce.pbtxt
index 10d9771d46d408d9c0414dab4ae5954a75dfc47e..fdd9443ba5a4d72f4fdce2b673ef936c8149a2cf 100644
--- a/tensorflow/core/api_def/base_api/api_def_CollectiveReduce.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_CollectiveReduce.pbtxt
@@ -1,5 +1,5 @@
 op {
   graph_op_name: "CollectiveReduce"
-  visibility: SKIP
   summary: "Mutually reduces multiple tensors of identical type and shape."
+  visibility: HIDDEN
 }
diff --git a/tensorflow/core/api_def/base_api/api_def_DatasetToGraph.pbtxt b/tensorflow/core/api_def/base_api/api_def_DatasetToGraph.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..55dd6179dd60f6811557eaa15d55b47f322b90c1
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_DatasetToGraph.pbtxt
@@ -0,0 +1,20 @@
+op {
+  graph_op_name: "DatasetToGraph"
+  visibility: HIDDEN
+  in_arg {
+    name: "input_dataset"
+    description: <<END
+A variant tensor representing the dataset to return the graph representation for.
+END
+  }
+  out_arg {
+    name: "graph"
+    description: <<END
+The graph representation of the dataset (as serialized GraphDef).
+END
+  }
+  summary: "Returns a serialized GraphDef representing `input_dataset`."
+  description: <<END
+Returns a graph representation for `input_dataset`.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_FakeParam.pbtxt b/tensorflow/core/api_def/base_api/api_def_FakeParam.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d110aba42b27a012a0f2a2d24fef89258e350a56
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_FakeParam.pbtxt
@@ -0,0 +1,24 @@
+op {
+  graph_op_name: "FakeParam"
+  visibility: SKIP
+  out_arg {
+    name: "output"
+    description: <<END
+    \"Fake\" output value. This should not be consumed by another op.
+END
+  }
+  attr { name: "dtype"  description: "The type of the output." }
+  attr {
+    name: "shape"
+    description: <<END
+    The purported shape of the output. This is only used for shape inference;
+    the output will not necessarily have this shape. Can be a partial shape.
+END
+  }
+  summary: <<END
+  This op is used as a placeholder in If branch functions. It doesn't provide a
+  valid output when run, so must either be removed (e.g. replaced with a
+  function input) or guaranteed not to be used (e.g. if mirroring an
+  intermediate output needed for the gradient computation of the other branch).
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_FeatureStatsDataset.pbtxt b/tensorflow/core/api_def/base_api/api_def_FeatureStatsDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ffd01ba5cc36bf7f45ea9928df88fb31fc4f1daa
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_FeatureStatsDataset.pbtxt
@@ -0,0 +1,3 @@
+op {
+  graph_op_name: "FeatureStatsDataset"
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_IdentityDataset.pbtxt b/tensorflow/core/api_def/base_api/api_def_IdentityDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ff2854fd2c1ddfa54f8c6370e73ef827ac15266a
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_IdentityDataset.pbtxt
@@ -0,0 +1,14 @@
+op {
+  graph_op_name: "IdentityDataset"
+  visibility: HIDDEN
+  in_arg {
+    name: "input_dataset"
+    description: <<END
+A variant tensor representing the input dataset.
+END
+  }
+  summary: "A placeholder for input pipeline graph optimizations."
+  description: <<END
+A placeholder for input pipeline graph optimizations.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_MatrixTriangularSolve.pbtxt b/tensorflow/core/api_def/base_api/api_def_MatrixTriangularSolve.pbtxt
index a2bfcdc66e48183edf054a3acbcaa462c80901fe..e90de741093e86a8ae8f623eb4978cbb0568e1b2 100644
--- a/tensorflow/core/api_def/base_api/api_def_MatrixTriangularSolve.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_MatrixTriangularSolve.pbtxt
@@ -32,7 +32,7 @@ Boolean indicating whether to solve with `matrix` or its (block-wise)
          adjoint.
 
 @compatibility(numpy)
-Equivalent to np.linalg.triangular_solve
+Equivalent to scipy.linalg.solve_triangular
 @end_compatibility
 END
   }
diff --git a/tensorflow/core/api_def/base_api/api_def_OptimizeDataset.pbtxt b/tensorflow/core/api_def/base_api/api_def_OptimizeDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f26eb6e3c344877d22608907933f8f3fefac75b7
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_OptimizeDataset.pbtxt
@@ -0,0 +1,20 @@
+op {
+  graph_op_name: "OptimizeDataset"
+  visibility: HIDDEN
+  in_arg {
+    name: "input_dataset"
+    description: <<END
+A variant tensor representing the input dataset.
+END
+  }
+  in_arg {
+    name: "optimizations"
+    description: <<END
+A `tf.string` vector `tf.Tensor` identifying optimizations to use.
+END
+  }
+  summary: "Creates a dataset by applying optimizations to `input_dataset`."
+  description: <<END
+Creates a dataset by applying optimizations to `input_dataset`.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_PaddedBatchDatasetV2.pbtxt b/tensorflow/core/api_def/base_api/api_def_PaddedBatchDatasetV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9fefc0c41863ed5c4f3df54d927287ecf82e27b8
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_PaddedBatchDatasetV2.pbtxt
@@ -0,0 +1,35 @@
+op {
+  graph_op_name: "PaddedBatchDatasetV2"
+  visibility: HIDDEN
+  in_arg {
+    name: "batch_size"
+    description: <<END
+A scalar representing the number of elements to accumulate in a
+batch.
+END
+  }
+  in_arg {
+    name: "drop_remainder"
+    description: <<END
+A scalar representing whether the last batch should be dropped in case its size
+is smaller than desired.
+END
+  }
+  in_arg {
+    name: "padded_shapes"
+    description: <<END
+A list of int64 tensors representing the desired padded shapes
+of the corresponding output components. These shapes may be partially
+specified, using `-1` to indicate that a particular dimension should be
+padded to the maximum size of all batch elements.
+END
+  }
+  in_arg {
+    name: "padding_values"
+    description: <<END
+A list of scalars containing the padding value to use for
+each of the outputs.
+END
+  }
+  summary: "Creates a dataset that batches and pads `batch_size` elements from the input."
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_QuantizeAndDequantizeV2.pbtxt b/tensorflow/core/api_def/base_api/api_def_QuantizeAndDequantizeV2.pbtxt
index 1fc9c9034a1f56ea319f4d3e6df0ded47fda4200..9b500d0b58d2dad182a7069824a55ee953fbda05 100644
--- a/tensorflow/core/api_def/base_api/api_def_QuantizeAndDequantizeV2.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_QuantizeAndDequantizeV2.pbtxt
@@ -9,21 +9,24 @@ END
   in_arg {
     name: "input_min"
     description: <<END
-If range_given, this is the min of the range, otherwise this input
-will be ignored.
+If `range_given == True`, this specifies the minimum input value that needs to
+be represented, otherwise it is determined from the min value of the `input`
+tensor.
 END
   }
   in_arg {
     name: "input_max"
     description: <<END
-If range_given, this is the max of the range, otherwise this input
-will be ignored.
+If `range_given == True`, this specifies the maximum input value that needs to
+be represented, otherwise it is determined from the max value of the `input`
+tensor.
 END
   }
   attr {
     name: "signed_input"
     description: <<END
-If the quantization is signed or unsigned.
+Whether the quantization is signed or unsigned. (actually this parameter should
+have been called <b>`signed_output`</b>)
 END
   }
   attr {
@@ -35,59 +38,60 @@ END
   attr {
     name: "range_given"
     description: <<END
-If the range is given or should be computed from the tensor.
+Whether the range is given or should be determined from the `input` tensor.
 END
   }
   summary: "Quantizes then dequantizes a tensor."
   description: <<END
 This op simulates the precision loss from the quantized forward pass by:
+
 1. Quantizing the tensor to fixed point numbers, which should match the target
    quantization method when it is used in inference.
 2. Dequantizing it back to floating point numbers for the following ops, most
    likely matmul.
 
-There are different ways to quantize. This version does not use the full range
-of the output type, choosing to elide the lowest possible value for symmetry
-(e.g., output range is -127 to 127, not -128 to 127 for signed 8 bit
-quantization), so that 0.0 maps to 0.
-
-To perform this op, we first find the range of values in our tensor. The range
-we use is always centered on 0, so we find m such that
-
-1. m = max(abs(input_min), abs(input_max)) if range_given is true,
-2. m = max(abs(min_elem(input)), abs(max_elem(input))) otherwise.
-
-Our input tensor range is then [-m, m].
+There are different ways to quantize. This version uses only scaling, so 0.0
+maps to 0.
 
-Next, we choose our fixed-point quantization buckets, [min_fixed, max_fixed].
-If signed_input is true, this is
+From the specified 'num_bits' in the quantized output type, it determines
+minimum and maximum representable quantized values.
 
-  [min_fixed, max_fixed ] =
-      [-(1 << (num_bits - 1) - 1), (1 << (num_bits - 1)) - 1].
+e.g.
 
-Otherwise, if signed_input is false, the fixed-point range is
+*   [-128, 127] for signed, num_bits = 8, or
+*   [0, 255] for unsigned, num_bits = 8.
 
-  [min_fixed, max_fixed] = [0, (1 << num_bits) - 1].
+If range_given == False, the initial input_min, input_max will be determined
+automatically as the minimum and maximum values in the input tensor, otherwise
+the specified values of input_min, input_max are used.
 
-From this we compute our scaling factor, s:
+Note: If the input_min, input_max are specified, they do not need to equal the
+actual minimum and maximum values in the tensor. e.g. in some cases it may be
+beneficial to specify these values such that the low probability extremes of the
+input distribution are clipped.
 
-  s = (max_fixed - min_fixed) / (2 * m).
+This op determines the maximum scale_factor that would map the initial
+[input_min, input_max] range to a range that lies within the representable
+quantized range.
 
-Now we can quantize and dequantize the elements of our tensor.  An element e
-is transformed into e':
+It determines the scale from one of input_min and input_max, then updates the
+other one to maximize the respresentable range.
 
-  e' = (e * s).round_to_nearest() / s.
+e.g.
 
-Note that we have a different number of buckets in the signed vs. unsigned
-cases.  For example, if num_bits == 8, we get 254 buckets in the signed case
-vs. 255 in the unsigned case.
+*   if the output is signed, num_bits = 8, [input_min, input_max] = [-10.0,
+    5.0]: it would use a scale_factor of -128 / -10.0 = 12.8 In this case, it
+    would update input_max to be 127 / 12.8 = 9.921875
+*   if the output is signed, num_bits = 8, [input_min, input_max] = [-10.0,
+    10.0]: it would use a scale_factor of 127 / 10.0 = 12.7 In this case, it
+    would update input_min to be 128.0 / 12.7 = -10.07874
+*   if the output is unsigned, input_min is forced to be 0, and only the
+    specified input_max is used.
 
-For example, suppose num_bits = 8 and m = 1.  Then
+After determining the scale_factor and updating the input range, it applies the
+following to each value in the 'input' tensor.
 
-  [min_fixed, max_fixed] = [-127, 127], and
-  s = (127 + 127) / 2 = 127.
+output = round(clamp(value, input_min, input_max) * scale_factor) / scale_factor.
 
-Given the vector {-1, -0.5, 0, 0.3}, this is quantized to
-{-127, -63, 0, 38}, and dequantized to {-1, -63.0/127, 0, 38.0/127}.
 END
 }
diff --git a/tensorflow/core/api_def/base_api/api_def_ReduceJoin.pbtxt b/tensorflow/core/api_def/base_api/api_def_ReduceJoin.pbtxt
index ca7e0d3beefa219a468135ec53b2564f0d0287c7..d13866ddaa1308c21f398bdf8b584f567d4d7567 100644
--- a/tensorflow/core/api_def/base_api/api_def_ReduceJoin.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_ReduceJoin.pbtxt
@@ -38,7 +38,9 @@ END
 Computes the string join across dimensions in the given string Tensor of shape
 `[d_0, d_1, ..., d_n-1]`.  Returns a new Tensor created by joining the input
 strings with the given separator (default: empty string).  Negative indices are
-counted backwards from the end, with `-1` being equivalent to `n - 1`.
+counted backwards from the end, with `-1` being equivalent to `n - 1`.  If
+indices are not specified, joins across all dimensions beginning from `n - 1`
+through `0`.
 
 For example:
 
@@ -51,9 +53,10 @@ tf.reduce_join(a, -1) = tf.reduce_join(a, 1) ==> ["ab", "cd"]
 tf.reduce_join(a, 0, keep_dims=True) ==> [["ac", "bd"]]
 tf.reduce_join(a, 1, keep_dims=True) ==> [["ab"], ["cd"]]
 tf.reduce_join(a, 0, separator=".") ==> ["a.c", "b.d"]
-tf.reduce_join(a, [0, 1]) ==> ["acbd"]
-tf.reduce_join(a, [1, 0]) ==> ["abcd"]
-tf.reduce_join(a, []) ==> ["abcd"]
+tf.reduce_join(a, [0, 1]) ==> "acbd"
+tf.reduce_join(a, [1, 0]) ==> "abcd"
+tf.reduce_join(a, []) ==> [["a", "b"], ["c", "d"]]
+tf.reduce_join(a) = tf.reduce_join(a, [1, 0]) ==> "abcd"
 ```
 END
 }
diff --git a/tensorflow/core/api_def/base_api/api_def_ResourceScatterNdAdd.pbtxt b/tensorflow/core/api_def/base_api/api_def_ResourceScatterNdAdd.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3b3a274df5d51706a07b9165df0bb5bfda8de776
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_ResourceScatterNdAdd.pbtxt
@@ -0,0 +1,69 @@
+op {
+  graph_op_name: "ResourceScatterNdAdd"
+  in_arg {
+    name: "ref"
+    description: <<END
+A resource handle. Must be from a VarHandleOp.
+END
+  }
+  in_arg {
+    name: "indices"
+    description: <<END
+A Tensor. Must be one of the following types: int32, int64.
+A tensor of indices into ref.
+END
+  }
+  in_arg {
+    name: "updates"
+    description: <<END
+A Tensor. Must have the same type as ref. A tensor of
+values to add to ref.
+END
+  }
+  attr {
+    name: "use_locking"
+    description: <<END
+An optional bool. Defaults to True. If True, the assignment will
+be protected by a lock; otherwise the behavior is undefined,
+but may exhibit less contention.
+END
+  }
+  summary: "Adds sparse `updates` to individual values or slices within a given"
+  description: <<END
+variable according to `indices`.
+
+`ref` is a `Tensor` with rank `P` and `indices` is a `Tensor` of rank `Q`.
+
+`indices` must be integer tensor, containing indices into `ref`.
+It must be shape `[d_0, ..., d_{Q-2}, K]` where `0 < K <= P`.
+
+The innermost dimension of `indices` (with length `K`) corresponds to
+indices into elements (if `K = P`) or slices (if `K < P`) along the `K`th
+dimension of `ref`.
+
+`updates` is `Tensor` of rank `Q-1+P-K` with shape:
+
+```
+[d_0, ..., d_{Q-2}, ref.shape[K], ..., ref.shape[P-1]].
+```
+
+For example, say we want to update 4 scattered elements to a rank-1 tensor to
+8 elements. In Python, that update would look like this:
+
+```python
+    ref = tfe.Variable([1, 2, 3, 4, 5, 6, 7, 8])
+    indices = tf.constant([[4], [3], [1] ,[7]])
+    updates = tf.constant([9, 10, 11, 12])
+    update = tf.scatter_nd_add(ref, indices, updates)
+    with tf.Session() as sess:
+      print sess.run(update)
+```
+
+The resulting update to ref would look like this:
+
+    [1, 12, 3, 14, 14, 6, 7, 20]
+
+See @{tf.scatter_nd} for more details about how to make updates to
+slices.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_Selu.pbtxt b/tensorflow/core/api_def/base_api/api_def_Selu.pbtxt
index cbe76de415125663ff47d3f0fac99f27ad029086..985f09312f515ccaa9e1075cfe483a7a5b0a5049 100644
--- a/tensorflow/core/api_def/base_api/api_def_Selu.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_Selu.pbtxt
@@ -4,6 +4,10 @@ op {
   description: <<END
 if < 0, `scale * features` otherwise.
 
+To be used together with
+`initializer = tf.variance_scaling_initializer(factor=1.0, mode='FAN_IN')`.
+For correct dropout, use `tf.contrib.nn.alpha_dropout`.
+
 See [Self-Normalizing Neural Networks](https://arxiv.org/abs/1706.02515)
 END
 }
diff --git a/tensorflow/core/api_def/base_api/api_def_SparseMatMul.pbtxt b/tensorflow/core/api_def/base_api/api_def_SparseMatMul.pbtxt
index 58f2ede62984073d5944226f4b58bc95818b3f32..fe568df388073c715a5003a1c1a97eaf41450417 100644
--- a/tensorflow/core/api_def/base_api/api_def_SparseMatMul.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_SparseMatMul.pbtxt
@@ -3,9 +3,11 @@ op {
   summary: "Multiply matrix \"a\" by matrix \"b\"."
   description: <<END
 The inputs must be two-dimensional matrices and the inner dimension of "a" must
-match the outer dimension of "b". This op is optimized for the case where at
-least one of "a" or "b" is sparse. The breakeven for using this versus a dense
-matrix multiply on one platform was 30% zero values in the sparse matrix.
+match the outer dimension of "b". Both "a" and "b" must be `Tensor`s not
+`SparseTensor`s.  This op is optimized for the case where at least one of "a" or
+"b" is sparse, in the sense that they have a large proportion of zero values.
+The breakeven for using this versus a dense matrix multiply on one platform was
+30% zero values in the sparse matrix.
 
 The gradient computation of this operation will only take advantage of sparsity
 in the input gradient when that gradient comes from a Relu.
diff --git a/tensorflow/core/api_def/base_api/api_def_StringSplitV2.pbtxt b/tensorflow/core/api_def/base_api/api_def_StringSplitV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6e13d0d0497a6154a0ab9d9b0730ef8b95b5c492
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_StringSplitV2.pbtxt
@@ -0,0 +1,48 @@
+op {
+  graph_op_name: "StringSplitV2"
+  in_arg {
+    name: "input"
+    description: <<END
+`1-D` string `Tensor`, the strings to split.
+END
+  }
+  in_arg {
+    name: "sep"
+    description: <<END
+`0-D` string `Tensor`, the delimiter character.
+END
+  }
+  attr {
+    name: "maxsplit"
+    description: <<END
+An `int`. If `maxsplit > 0`, limit of the split of the result.
+END
+  }
+  summary: "Split elements of `source` based on `sep` into a `SparseTensor`."
+  description: <<END
+Let N be the size of source (typically N will be the batch size). Split each
+element of `source` based on `sep` and return a `SparseTensor`
+containing the split tokens. Empty tokens are ignored.
+
+For example, N = 2, source[0] is 'hello world' and source[1] is 'a b c',
+then the output will be
+```
+st.indices = [0, 0;
+              0, 1;
+              1, 0;
+              1, 1;
+              1, 2]
+st.shape = [2, 3]
+st.values = ['hello', 'world', 'a', 'b', 'c']
+```
+
+If `sep` is given, consecutive delimiters are not grouped together and are
+deemed to delimit empty strings. For example, source of `"1<>2<><>3"` and
+sep of `"<>"` returns `["1", "2", "", "3"]`. If `sep` is None or an empty
+string, consecutive whitespace are regarded as a single separator, and the
+result will contain no empty strings at the startor end if the string has
+leading or trailing whitespace.
+
+Note that the above mentioned behavior matches python's str.split.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_TensorArrayGradWithShape.pbtxt b/tensorflow/core/api_def/base_api/api_def_TensorArrayGradWithShape.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..dd37b94ffa6bc86f043cdca6255f8d3031a6c742
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_TensorArrayGradWithShape.pbtxt
@@ -0,0 +1,40 @@
+op {
+  graph_op_name: "TensorArrayGradWithShape"
+  endpoint {
+    name: "TensorArrayGradWithShape"
+  }
+  in_arg {
+    name: "handle"
+    description: <<END
+The handle to the forward TensorArray.
+END
+  }
+  in_arg {
+    name: "flow_in"
+    description: <<END
+A float scalar that enforces proper chaining of operations.
+END
+  }
+  in_arg {
+    name: "shape_to_prepend"
+    description: <<END
+An int32 vector representing a shape. Elements in the gradient accumulator will
+have shape which is this shape_to_prepend value concatenated with shape of the
+elements in the TensorArray corresponding to the input handle.
+END
+  }
+  attr {
+    name: "source"
+    description: <<END
+The gradient source string, used to decide which gradient TensorArray
+to return.
+END
+  }
+  summary: "Creates a TensorArray for storing multiple gradients of values in the given handle."
+  description: <<END
+Similar to TensorArrayGradV3. However it creates an accumulator with an
+expanded shape compared to the input TensorArray whose gradient is being
+computed. This enables multiple gradients for the same TensorArray to be
+calculated using the same accumulator.
+END
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AnonymousIterator.pbtxt b/tensorflow/core/api_def/python_api/api_def_AnonymousIterator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..98b7def4d67a74d3be13544014d7bbfe044ddd8c
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AnonymousIterator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AnonymousIterator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BesselI0e.pbtxt b/tensorflow/core/api_def/python_api/api_def_BesselI0e.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7965af4916e7b8f590bd22452459410075c37cf8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BesselI0e.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BesselI0e"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BesselI1e.pbtxt b/tensorflow/core/api_def/python_api/api_def_BesselI1e.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..dffd296f6d8288356add56f8fbff01bfc4c9213a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BesselI1e.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BesselI1e"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BroadcastTo.pbtxt b/tensorflow/core/api_def/python_api/api_def_BroadcastTo.pbtxt
deleted file mode 100644
index 083eeced81dfc04d01c8721e3fb65727ef13176a..0000000000000000000000000000000000000000
--- a/tensorflow/core/api_def/python_api/api_def_BroadcastTo.pbtxt
+++ /dev/null
@@ -1,4 +0,0 @@
-op {
-  graph_op_name: "BroadcastTo"
-  visibility: HIDDEN
-}
diff --git a/tensorflow/core/api_def/python_api/api_def_CollectiveBcastRecv.pbtxt b/tensorflow/core/api_def/python_api/api_def_CollectiveBcastRecv.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..78034ccffdfad011394dcc7c97f81b3eb1c65039
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CollectiveBcastRecv.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "CollectiveBcastRecv"
+  endpoint {
+    name: "collective.broadcast_recv"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CollectiveBcastSend.pbtxt b/tensorflow/core/api_def/python_api/api_def_CollectiveBcastSend.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9d6b2f83fe012f1a33c22b1442270351bbdaee97
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CollectiveBcastSend.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "CollectiveBcastSend"
+  endpoint {
+    name: "collective.broadcast_send"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CollectiveReduce.pbtxt b/tensorflow/core/api_def/python_api/api_def_CollectiveReduce.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..27ae8a833a26f22cbb653bc99345de4a514803dc
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CollectiveReduce.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "CollectiveReduce"
+  endpoint {
+    name: "collective.all_reduce"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FakeParam.pbtxt b/tensorflow/core/api_def/python_api/api_def_FakeParam.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..57fa8ff5b982a4c32968b01e3555ed8e821fa60a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FakeParam.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FakeParam"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FeatureStatsDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_FeatureStatsDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7f721f4fb7da0a8344e87dd6f96cfcf7d4d0449e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FeatureStatsDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FeatureStatsDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceScatterNdAdd.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceScatterNdAdd.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ffef3ab52266865a2f7fb86b647379e72b586f1e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceScatterNdAdd.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceScatterNdAdd"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ScatterNdAdd.pbtxt b/tensorflow/core/api_def/python_api/api_def_ScatterNdAdd.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f6c8af5c33ebcef2dc81d53ee2e9bd9267ff3706
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ScatterNdAdd.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ScatterNdAdd"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StringSplitV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_StringSplitV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0e8576fb01a347e0efdd88f9eac47d66cc6e36e8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StringSplitV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StringSplitV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_TensorArrayGradWithShape.pbtxt b/tensorflow/core/api_def/python_api/api_def_TensorArrayGradWithShape.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5d76c112a00e736d2f7da55cf2c0f6f644d911a8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_TensorArrayGradWithShape.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "TensorArrayGradWithShape"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/common_runtime/build_graph_options.cc b/tensorflow/core/common_runtime/build_graph_options.cc
index a9dc6ca6cda9443ae9737267aca5f361e492d22d..00f7a8e6452f9cc234c5868437d40ccc99dbaf87 100644
--- a/tensorflow/core/common_runtime/build_graph_options.cc
+++ b/tensorflow/core/common_runtime/build_graph_options.cc
@@ -32,6 +32,9 @@ string BuildGraphOptions::DebugString() const {
   for (auto& s : callable_options.target()) {
     strings::StrAppend(&rv, s, ", ");
   }
+  if (collective_graph_key != kNoCollectiveGraphKey) {
+    strings::StrAppend(&rv, "\ncollective_graph_key: ", collective_graph_key);
+  }
   return rv;
 }
 
diff --git a/tensorflow/core/common_runtime/build_graph_options.h b/tensorflow/core/common_runtime/build_graph_options.h
index 5ca170e922ce348fb3f76b1129b22ca01804054c..3d0f242ea5177fd5a99a925f998ec5252a313327 100644
--- a/tensorflow/core/common_runtime/build_graph_options.h
+++ b/tensorflow/core/common_runtime/build_graph_options.h
@@ -31,6 +31,9 @@ struct BuildGraphOptions {
   // TODO(mrry): Remove this when the distributed runtime supports Arg/Retval.
   bool use_function_convention = false;
 
+  static const int64 kNoCollectiveGraphKey = 0;
+  int64 collective_graph_key = kNoCollectiveGraphKey;
+
   string DebugString() const;
 };
 
diff --git a/tensorflow/core/common_runtime/collective_executor_mgr.cc b/tensorflow/core/common_runtime/collective_executor_mgr.cc
index e07829b286741e18db21e3c491973ec8f4b973dc..4f03a5e13ad59b59c7675ac344b2894b19c3be22 100644
--- a/tensorflow/core/common_runtime/collective_executor_mgr.cc
+++ b/tensorflow/core/common_runtime/collective_executor_mgr.cc
@@ -25,11 +25,11 @@ namespace tensorflow {
 
 CollectiveExecutorMgr::CollectiveExecutorMgr(
     const ConfigProto& config, const DeviceMgr* dev_mgr,
-    DeviceResolverInterface* dev_resolver,
-    ParamResolverInterface* param_resolver)
+    std::unique_ptr<DeviceResolverInterface> dev_resolver,
+    std::unique_ptr<ParamResolverInterface> param_resolver)
     : dev_mgr_(dev_mgr),
-      dev_resolver_(dev_resolver),
-      param_resolver_(param_resolver) {}
+      dev_resolver_(std::move(dev_resolver)),
+      param_resolver_(std::move(param_resolver)) {}
 
 CollectiveExecutorMgr::~CollectiveExecutorMgr() {
   for (auto iter : executor_table_) {
@@ -45,9 +45,7 @@ CollectiveExecutor* CollectiveExecutorMgr::FindOrCreate(int64 step_id) {
     if (it != executor_table_.end()) {
       ce = it->second;
     } else {
-      CollectiveRemoteAccessLocal* rma = new CollectiveRemoteAccessLocal(
-          dev_mgr_, dev_resolver_.get(), step_id);
-      ce = new BaseCollectiveExecutor(this, rma, step_id, dev_mgr_);
+      ce = Create(step_id);
       executor_table_[step_id] = ce;
     }
     ce->Ref();
@@ -55,6 +53,12 @@ CollectiveExecutor* CollectiveExecutorMgr::FindOrCreate(int64 step_id) {
   return ce;
 }
 
+CollectiveExecutor* CollectiveExecutorMgr::Create(int64 step_id) {
+  CollectiveRemoteAccessLocal* rma =
+      new CollectiveRemoteAccessLocal(dev_mgr_, dev_resolver_.get(), step_id);
+  return new BaseCollectiveExecutor(this, rma, step_id, dev_mgr_);
+}
+
 void CollectiveExecutorMgr::Cleanup(int64 step_id) {
   CollectiveExecutor* ce = nullptr;
   {
diff --git a/tensorflow/core/common_runtime/collective_executor_mgr.h b/tensorflow/core/common_runtime/collective_executor_mgr.h
index 4b42e2b4d16c5804e0660079c7a149442b47edb0..9de6ab8968325d5414b714b3f6eb5d34abf16f4a 100644
--- a/tensorflow/core/common_runtime/collective_executor_mgr.h
+++ b/tensorflow/core/common_runtime/collective_executor_mgr.h
@@ -25,8 +25,8 @@ class DeviceMgr;
 class CollectiveExecutorMgr : public CollectiveExecutorMgrInterface {
  public:
   CollectiveExecutorMgr(const ConfigProto& config, const DeviceMgr* dev_mgr,
-                        DeviceResolverInterface* dev_resolver,
-                        ParamResolverInterface* param_resolver);
+                        std::unique_ptr<DeviceResolverInterface> dev_resolver,
+                        std::unique_ptr<ParamResolverInterface> param_resolver);
 
   virtual ~CollectiveExecutorMgr();
 
@@ -56,11 +56,16 @@ class CollectiveExecutorMgr : public CollectiveExecutorMgrInterface {
   void RetireStepId(int64 graph_key, int64 step_id) override {}
 
  protected:
+  // Called by FindOrCreate when table entry does not yet exist.
+  virtual CollectiveExecutor* Create(int64 step_id);
+
   const DeviceMgr* dev_mgr_;
   std::unique_ptr<DeviceResolverInterface> dev_resolver_;
   std::unique_ptr<ParamResolverInterface> param_resolver_;
   CollectiveRemoteAccess* remote_access_;
   string task_name_;
+
+ private:
   mutex exec_mu_;
   // Map from step_id to CollectiveExecutor
   gtl::FlatMap<int64, CollectiveExecutor*> executor_table_ GUARDED_BY(exec_mu_);
diff --git a/tensorflow/core/common_runtime/collective_executor_mgr_test.cc b/tensorflow/core/common_runtime/collective_executor_mgr_test.cc
index 34c9163d6a40ba47323afc306cc2803b643e1d8b..91994c57311f95a669949a38c161f7d3acf5f54d 100644
--- a/tensorflow/core/common_runtime/collective_executor_mgr_test.cc
+++ b/tensorflow/core/common_runtime/collective_executor_mgr_test.cc
@@ -40,10 +40,13 @@ class CollectiveExecutorMgrTest : public ::testing::Test {
     device_count->insert({"CPU", NUM_DEVS});
     TF_CHECK_OK(DeviceFactory::AddDevices(options, task_name, &devices_));
     device_mgr_.reset(new DeviceMgr(devices_));
-    DeviceResolverLocal* drl = new DeviceResolverLocal(device_mgr_.get());
-    cme_.reset(new CollectiveExecutorMgr(
-        cp, device_mgr_.get(), drl,
-        new CollectiveParamResolverLocal(device_mgr_.get(), drl, task_name)));
+    std::unique_ptr<DeviceResolverInterface> drl(
+        new DeviceResolverLocal(device_mgr_.get()));
+    std::unique_ptr<ParamResolverInterface> prl(
+        new CollectiveParamResolverLocal(device_mgr_.get(), drl.get(),
+                                         task_name));
+    cme_.reset(new CollectiveExecutorMgr(cp, device_mgr_.get(), std::move(drl),
+                                         std::move(prl)));
   }
 
   std::unique_ptr<CollectiveExecutorMgr> cme_;
diff --git a/tensorflow/core/common_runtime/collective_param_resolver_local.h b/tensorflow/core/common_runtime/collective_param_resolver_local.h
index 3a871f962dfbefd0e06d0e594be6dca9170a2089..43c404f2ec000e5628f3ead1d2748715d527b5ca 100644
--- a/tensorflow/core/common_runtime/collective_param_resolver_local.h
+++ b/tensorflow/core/common_runtime/collective_param_resolver_local.h
@@ -201,7 +201,7 @@ class CollectiveParamResolverLocal : public ParamResolverInterface {
       LOCKS_EXCLUDED(irec->out_mu);
 
   const DeviceMgr* dev_mgr_;
-  DeviceResolverInterface* dev_resolver_;
+  DeviceResolverInterface* dev_resolver_;  // Not owned.
   string task_name_;
   mutex group_mu_;
   gtl::FlatMap<int32, std::unique_ptr<GroupRec>> group_table_
diff --git a/tensorflow/core/common_runtime/direct_session.cc b/tensorflow/core/common_runtime/direct_session.cc
index 0afbd02e866c279134d5c3576127f6d174068f60..5cef93c605bebc0a74f5053d8aaec8deb1ecec6c 100644
--- a/tensorflow/core/common_runtime/direct_session.cc
+++ b/tensorflow/core/common_runtime/direct_session.cc
@@ -19,15 +19,19 @@ limitations under the License.
 #include <string>
 #include <vector>
 
+#include "tensorflow/core/common_runtime/collective_executor_mgr.h"
+#include "tensorflow/core/common_runtime/collective_param_resolver_local.h"
 #include "tensorflow/core/common_runtime/constant_folding.h"
 #include "tensorflow/core/common_runtime/debugger_state_interface.h"
 #include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/common_runtime/device_resolver_local.h"
 #include "tensorflow/core/common_runtime/executor.h"
 #include "tensorflow/core/common_runtime/function.h"
 #include "tensorflow/core/common_runtime/graph_optimizer.h"
 #include "tensorflow/core/common_runtime/memory_types.h"
 #include "tensorflow/core/common_runtime/optimization_registry.h"
 #include "tensorflow/core/common_runtime/process_util.h"
+#include "tensorflow/core/common_runtime/scoped_allocator_mgr.h"
 #include "tensorflow/core/common_runtime/step_stats_collector.h"
 #include "tensorflow/core/framework/function.h"
 #include "tensorflow/core/framework/graph.pb_text.h"
@@ -443,6 +447,20 @@ Status DirectSession::RunInternal(int64 step_id, const RunOptions& run_options,
   // Create a run state and start execution.
   RunState run_state(step_id, &devices_);
   run_state.rendez = new IntraProcessRendezvous(device_mgr_.get());
+  // Set up for collectives if the RunOption declares a key.
+  if (run_options.experimental().collective_graph_key() > 0) {
+    if (!collective_executor_mgr_) {
+      std::unique_ptr<DeviceResolverInterface> drl(
+          new DeviceResolverLocal(device_mgr_.get()));
+      std::unique_ptr<ParamResolverInterface> cprl(
+          new CollectiveParamResolverLocal(device_mgr_.get(), drl.get(),
+                                           "/job:localhost/replica:0/task:0"));
+      collective_executor_mgr_.reset(new CollectiveExecutorMgr(
+          options_.config, device_mgr_.get(), std::move(drl), std::move(cprl)));
+    }
+    run_state.collective_executor.reset(new CollectiveExecutor::Handle(
+        collective_executor_mgr_->FindOrCreate(step_id), true /*inherit_ref*/));
+  }
 
   // Start parallel Executors.
   const size_t num_executors = executors_and_keys->items.size();
@@ -459,6 +477,9 @@ Status DirectSession::RunInternal(int64 step_id, const RunOptions& run_options,
   args.step_id = step_id;
   args.call_frame = call_frame;
   args.rendezvous = run_state.rendez;
+  args.collective_executor =
+      (run_state.collective_executor ? run_state.collective_executor->get()
+                                     : nullptr);
   CancellationManager step_cancellation_manager;
   args.cancellation_manager = &step_cancellation_manager;
   args.session_state = &session_state_;
@@ -768,6 +789,10 @@ Status DirectSession::PRunSetup(const std::vector<string>& input_names,
 
   args.rendezvous = run_state->rendez;
   args.cancellation_manager = cancellation_manager_;
+  // Note that Collectives are not supported in partial runs
+  // because RunOptions is not passed in so we can't know whether
+  // their use is intended.
+  args.collective_executor = nullptr;
   args.runner = [this, pool](Executor::Args::Closure c) {
     SchedClosure(pool, std::move(c));
   };
@@ -1518,11 +1543,13 @@ DirectSession::RunState::RunState(
     const std::vector<string>& pending_input_names,
     const std::vector<string>& pending_output_names, int64 step_id,
     const std::vector<Device*>* devices)
-    : step_container(step_id, [devices](const string& name) {
+    : step_container(step_id, [devices, step_id](const string& name) {
         for (auto d : *devices) {
           if (!d->resource_manager()->Cleanup(name).ok()) {
             // Do nothing...
           }
+          ScopedAllocatorMgr* sam = d->GetScopedAllocatorMgr();
+          if (sam) sam->Cleanup(step_id);
         }
       }) {
   // Initially all the feeds and fetches are pending.
diff --git a/tensorflow/core/common_runtime/direct_session.h b/tensorflow/core/common_runtime/direct_session.h
index 6f9c1b980b30213373d82ded639fc2e614b3752d..72a2be48162dec295d0c8e02630116ced95182ad 100644
--- a/tensorflow/core/common_runtime/direct_session.h
+++ b/tensorflow/core/common_runtime/direct_session.h
@@ -33,6 +33,7 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/rendezvous_mgr.h"
 #include "tensorflow/core/common_runtime/session_factory.h"
 #include "tensorflow/core/framework/cancellation.h"
+#include "tensorflow/core/framework/collective.h"
 #include "tensorflow/core/framework/graph.pb.h"
 #include "tensorflow/core/framework/session_state.h"
 #include "tensorflow/core/framework/tensor.h"
@@ -175,6 +176,7 @@ class DirectSession : public Session {
     mutex mu_;
     Status status GUARDED_BY(mu_);
     IntraProcessRendezvous* rendez = nullptr;
+    std::unique_ptr<CollectiveExecutor::Handle> collective_executor;
     std::unique_ptr<StepStatsCollector> collector;
     Notification executors_done;
     std::unordered_map<string, bool> pending_inputs;   // true if fed
@@ -352,6 +354,7 @@ class DirectSession : public Session {
 
   DirectSessionFactory* const factory_;  // not owned
   CancellationManager* cancellation_manager_;
+  std::unique_ptr<CollectiveExecutorMgrInterface> collective_executor_mgr_;
 
   // Map of placed stateful nodes, i.e. nodes for which is_stateful()
   // is true, such as "params" and "queue" nodes.  Once placed these
diff --git a/tensorflow/core/common_runtime/eager/execute.cc b/tensorflow/core/common_runtime/eager/execute.cc
index ce989f4b4eb9b42ae26f5b9fb2e0ac0a32d37c00..c619857b7803379df2deedda67aad4e75d3c2101 100644
--- a/tensorflow/core/common_runtime/eager/execute.cc
+++ b/tensorflow/core/common_runtime/eager/execute.cc
@@ -66,6 +66,88 @@ int StepStatsDeviceIndex(StepStats* step_stats, EagerContext* ctx,
   return 0;
 }
 
+// This function expects *handle to point to an existing tensor handle. The
+// function will (maybe) update the *handle to be pointed to the newly copied
+// tensor handle.
+//
+// The passed in *handle will be Unreffed if it is replaced.
+Status MaybeCopyInputToExpectedDevice(EagerOperation* op, int i,
+                                      const Device* expected_device,
+                                      RunMetadata* run_metadata,
+                                      TensorHandle** handle) {
+  EagerContext* ctx = op->EagerContext();
+  Device* handle_device = nullptr;
+  TF_RETURN_IF_ERROR((*handle)->Device(&handle_device));
+  const Device* actual_device =
+      handle_device == nullptr ? ctx->HostCPU() : handle_device;
+
+  if (expected_device != actual_device) {
+    switch (ctx->GetDevicePlacementPolicy()) {
+      case DEVICE_PLACEMENT_SILENT_FOR_INT32:
+        // TODO(xpan): See if we could bubble python related error up
+        // to python level.
+        if ((*handle)->dtype == DT_INT32) {
+          // Note: enabling silent copies of int32 tensors to match behavior
+          // of graph mode.
+          break;
+        }
+        TF_FALLTHROUGH_INTENDED;
+      case DEVICE_PLACEMENT_EXPLICIT:
+        return errors::InvalidArgument(
+            "Tensors on conflicting devices:"
+            " cannot compute ",
+            op->Name(), " as input #", i, " was expected to be on ",
+            expected_device->name(), " but is actually on ",
+            actual_device->name(), " (operation running on ",
+            op->Device()->name(), ")",
+            " Tensors can be copied explicitly using .gpu() or .cpu() "
+            "methods,"
+            " or transparently copied by using tf.enable_eager_execution("
+            "device_policy=tfe.DEVICE_PLACEMENT_SILENT). Copying tensors "
+            "between devices"
+            " may slow down your model");
+      case DEVICE_PLACEMENT_WARN:
+        LOG(WARNING) << "before computing " << op->Name() << " input #" << i
+                     << " was expected to be on " << expected_device->name()
+                     << " but is actually on " << actual_device->name()
+                     << " (operation running on " << op->Device()->name()
+                     << "). This triggers a copy which can be a performance "
+                        "bottleneck.";
+        break;
+      case DEVICE_PLACEMENT_SILENT:  // Do nothing.
+        break;
+    }
+    // We are only here if the policy is warn or silent copies, so we should
+    // trigger a copy.
+    auto pre_time = Env::Default()->NowMicros();
+    TensorHandle* result_handle;
+    Status status = EagerCopyToDevice(
+        *handle, ctx, expected_device->name().c_str(), &result_handle);
+    if (run_metadata != nullptr) {
+      auto* step_stats = run_metadata->mutable_step_stats();
+      MaybeInitializeStepStats(step_stats, ctx);
+      // Record the sending on the source device for now.
+      int device_idx = StepStatsDeviceIndex(step_stats, ctx, handle_device);
+      auto* dev_stats = step_stats->mutable_dev_stats(device_idx);
+      auto* node_stats = dev_stats->add_node_stats();
+      node_stats->set_node_name("_Send");
+      node_stats->set_all_start_micros(pre_time);
+      node_stats->set_op_end_rel_micros(Env::Default()->NowMicros() - pre_time);
+    }
+    if (!status.ok()) {
+      if (result_handle != nullptr) result_handle->Unref();
+      return errors::Internal("Failed copying input tensor from ",
+                              actual_device->name(), " to ",
+                              expected_device->name(), " in order to run ",
+                              op->Name(), ": ", status.error_message());
+    }
+
+    (*handle)->Unref();
+    *handle = result_handle;
+  }
+  return Status::OK();
+}
+
 Status ValidateInputTypeAndPlacement(EagerContext* ctx, Device* op_device,
                                      EagerOperation* op, const OpKernel* kernel,
                                      RunMetadata* run_metadata) {
@@ -78,76 +160,9 @@ Status ValidateInputTypeAndPlacement(EagerContext* ctx, Device* op_device,
   for (int i = 0; i < op->Inputs().size(); ++i) {
     const Device* expected_device =
         memtypes[i] == HOST_MEMORY ? host_device : op_device;
-    TensorHandle* handle = op->Inputs()[i];
-    Device* handle_device = nullptr;
-    TF_RETURN_IF_ERROR(handle->Device(&handle_device));
-    const Device* actual_device =
-        handle_device == nullptr ? host_device : handle_device;
-    if (expected_device != actual_device) {
-      switch (ctx->GetDevicePlacementPolicy()) {
-        case DEVICE_PLACEMENT_SILENT_FOR_INT32:
-          // TODO(xpan): See if we could bubble python related error up
-          // to python level.
-          if (handle->dtype == DT_INT32) {
-            // Note: enabling silent copies of int32 tensors to match behavior
-            // of graph mode.
-            break;
-          }
-          TF_FALLTHROUGH_INTENDED;
-        case DEVICE_PLACEMENT_EXPLICIT:
-          return errors::InvalidArgument(
-              "Tensors on conflicting devices:"
-              " cannot compute ",
-              op->Name(), " as input #", i, " was expected to be on ",
-              expected_device->name(), " but is actually on ",
-              actual_device->name(), " (operation running on ",
-              op_device->name(), ")",
-              " Tensors can be copied explicitly using .gpu() or .cpu() "
-              "methods,"
-              " or transparently copied by using tf.enable_eager_execution("
-              "device_policy=tfe.DEVICE_PLACEMENT_SILENT). Copying tensors "
-              "between devices"
-              " may slow down your model");
-        case DEVICE_PLACEMENT_WARN:
-          LOG(WARNING) << "before computing " << op->Name() << " input #" << i
-                       << " was expected to be on " << expected_device->name()
-                       << " but is actually on " << actual_device->name()
-                       << " (operation running on " << op_device->name()
-                       << "). This triggers a copy which can be a performance "
-                          "bottleneck.";
-          break;
-        case DEVICE_PLACEMENT_SILENT:  // Do nothing.
-          break;
-      }
-      // We are only here if the policy is warn or silent copies, so we should
-      // trigger a copy.
-      auto pre_time = Env::Default()->NowMicros();
-      TensorHandle* copied_tensor = nullptr;
-      Status status = EagerCopyToDevice(
-          handle, ctx, expected_device->name().c_str(), &copied_tensor);
-      if (run_metadata != nullptr) {
-        auto* step_stats = run_metadata->mutable_step_stats();
-        MaybeInitializeStepStats(step_stats, ctx);
-        // Record the sending on the source device for now.
-        int device_idx = StepStatsDeviceIndex(step_stats, ctx, handle_device);
-        auto* dev_stats = step_stats->mutable_dev_stats(device_idx);
-        auto* node_stats = dev_stats->add_node_stats();
-        node_stats->set_node_name("_Send");
-        node_stats->set_all_start_micros(pre_time);
-        node_stats->set_op_end_rel_micros(Env::Default()->NowMicros() -
-                                          pre_time);
-      }
-      if (!status.ok()) {
-        if (copied_tensor != nullptr) copied_tensor->Unref();
-        return errors::Internal("Failed copying input tensor from ",
-                                actual_device->name(), " to ",
-                                expected_device->name(), " in order to run ",
-                                op->Name(), ": ", status.error_message());
-      }
-      handle->Unref();
-      handle = copied_tensor;
-      (*op->MutableInputs())[i] = copied_tensor;
-    }
+    TF_RETURN_IF_ERROR(MaybeCopyInputToExpectedDevice(
+        op, i, expected_device, run_metadata, &((*op->MutableInputs())[i])));
+    tensorflow::TensorHandle* handle = op->Inputs()[i];
     if (handle->dtype != kernel->input_type(i)) {
       return errors::InvalidArgument(
           "cannot compute ", op->Name(), " as input #", i,
@@ -192,8 +207,8 @@ Status SelectDevice(const NodeDef& ndef, EagerContext* ctx, Device** device) {
 // Resource4> as the input params to the synthesized function.
 //
 // It populates `const_input_types`, `arg_input_types` and
-// `op_input_to_func_input` based on the reordering results, that the caller can
-// use them to build an XlaLaunch. On error, it returns NULL, and sets
+// `op_input_to_func_input` based on the reordering results, that the caller
+// can use them to build an XlaLaunch. On error, it returns NULL, and sets
 // `status` accordingly.
 const FunctionDef* OpToFunction(TFE_Op* op,
                                 std::vector<TF_DataType>* const_input_types,
@@ -221,8 +236,8 @@ const FunctionDef* OpToFunction(TFE_Op* op,
   const std::unordered_set<string> const_inputs(
       *XlaOpRegistry::CompileTimeConstantInputs(op->operation.Name()));
 
-  // First add place holders for the input args, so that we can refer to them by
-  // position in the next loop. Also tally up the resource inputs.
+  // First add place holders for the input args, so that we can refer to them
+  // by position in the next loop. Also tally up the resource inputs.
   int num_resource_inputs = 0;
   for (int i = 0; i < op_def.input_arg_size(); ++i) {
     if (op_def.input_arg(i).type() == DT_RESOURCE) {
@@ -336,8 +351,9 @@ std::unique_ptr<TFE_Op> BuildXlaLaunch(TFE_Op* op, TF_Status* status) {
                         &op_input_to_func_input, status);
     if (!status.ok()) return nullptr;
   } else {
-    // TODO(hongm): XlaOpRegistry::CompileTimeConstantInputs() does not work for
-    // functions, so we need to find another way to handle constant inputs.
+    // TODO(hongm): XlaOpRegistry::CompileTimeConstantInputs() does not work
+    // for functions, so we need to find another way to handle constant
+    // inputs.
     for (int i = const_input_types.size();
          i < fdef->signature().input_arg_size(); ++i) {
       VLOG(1) << "Adding Targs from input arg " << i;
@@ -348,8 +364,9 @@ std::unique_ptr<TFE_Op> BuildXlaLaunch(TFE_Op* op, TF_Status* status) {
   DCHECK(fdef != nullptr);
 
   // Copy inputs and their devices.
-  // Since input param reordering may have occurred between `op` and `launch_op`
-  // via `op_input_to_func_input`, adjust the actual inputs accordingly.
+  // Since input param reordering may have occurred between `op` and
+  // `launch_op` via `op_input_to_func_input`, adjust the actual inputs
+  // accordingly.
   *launch_op->operation.MutableInputs() = op->operation.Inputs();
   for (TensorHandle* h : launch_op->operation.Inputs()) {
     h->Ref();
@@ -545,24 +562,24 @@ Status EagerLocalExecute(EagerOperation* op,
 Status EagerRemoteExecute(EagerOperation* op, eager::EagerClient* eager_client,
                           uint64 context_id, TensorHandle** retvals,
                           int* num_retvals) {
-  // All tensors must be on the same device.
-  // TODO(nareshmodi): handle silent copies
   eager::EnqueueRequest request;
   eager::EnqueueResponse response;
 
   auto* remote_op = request.add_queue()->mutable_operation();
 
-  for (auto* input : op->Inputs()) {
+  for (int i = 0; i < op->Inputs().size(); i++) {
     tensorflow::Device* input_device;
-    TF_RETURN_IF_ERROR(input->Device(&input_device));
+    TF_RETURN_IF_ERROR(op->Inputs()[i]->Device(&input_device));
     if (op->Device() != input_device) {
-      return tensorflow::errors::InvalidArgument(
-          "Ops and inputs are not on the same device. Use "
-          "TFE_TensorHandleCopyToDevice to get ops on the same "
-          "device. Expected device: ",
-          op->Device()->name(), ", Actual device: ", input_device->name());
+      // TODO(b/110044833): It's possible the same tensor gets copied to the
+      // remote device repeatedly.
+      TF_RETURN_IF_ERROR(MaybeCopyInputToExpectedDevice(
+          op, i, op->Device(), /* run_metadata= */ nullptr,
+          &(*op->MutableInputs())[i]));
     }
 
+    tensorflow::TensorHandle* input = op->Inputs()[i];
+
     tensorflow::uint64 op_id;
     int32 output_num;
     TF_RETURN_IF_ERROR(input->RemoteAddress(&op_id, &output_num));
diff --git a/tensorflow/core/common_runtime/eager/kernel_and_device.cc b/tensorflow/core/common_runtime/eager/kernel_and_device.cc
index a63b2b97112318b5639ea927e24336a2aa12bd63..b410ea175b331b522e8a66a6c9d2d8dd1e798bdc 100644
--- a/tensorflow/core/common_runtime/eager/kernel_and_device.cc
+++ b/tensorflow/core/common_runtime/eager/kernel_and_device.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/rendezvous_mgr.h"
 #include "tensorflow/core/framework/allocator.h"
 #include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/resource_mgr.h"
 #include "tensorflow/core/framework/step_stats.pb.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/gtl/map_util.h"
@@ -78,6 +79,7 @@ Status KernelAndDevice::Run(std::vector<Tensor>* input_tensors,
   params.function_library = flib_;
   params.slice_reader_cache = &slice_reader_cache_;
   params.rendezvous = rendez_;
+  params.cancellation_manager = &cm_;
   if (stats != nullptr) {
     params.track_allocations = true;
   }
@@ -86,6 +88,11 @@ Status KernelAndDevice::Run(std::vector<Tensor>* input_tensors,
       [](std::function<void()> f) { f(); };
   params.runner = &runner;
 
+  ScopedStepContainer step_container(0, [this](const string& name) {
+    device_->resource_manager()->Cleanup(name).IgnoreError();
+  });
+  params.step_container = &step_container;
+
   OpKernelContext context(&params);
 
   if (kernel_->def().op() == "_Recv") {
diff --git a/tensorflow/core/common_runtime/eager/kernel_and_device.h b/tensorflow/core/common_runtime/eager/kernel_and_device.h
index f78d197fd55551f53de5ba98c74a13454ece38b1..c41a0972b1cfd4a5c6f505d9dc2724d3313de9b7 100644
--- a/tensorflow/core/common_runtime/eager/kernel_and_device.h
+++ b/tensorflow/core/common_runtime/eager/kernel_and_device.h
@@ -22,6 +22,7 @@ limitations under the License.
 #include <unordered_map>
 
 #include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/framework/cancellation.h"
 #include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/types.h"
@@ -76,6 +77,11 @@ class KernelAndDevice {
   const DataTypeVector& output_dtypes() { return output_dtypes_; }
 
  private:
+  // TODO(apassos) Consider a shared cancellation manager. Note that this
+  // cancellation manager is not useful to actually cancel anything, and is
+  // provided here only for the few kernels which can't handle one being
+  // missing.
+  CancellationManager cm_;
   std::unique_ptr<OpKernel> kernel_;
   Device* device_;
   FunctionLibraryRuntime* flib_;
diff --git a/tensorflow/core/common_runtime/executor.cc b/tensorflow/core/common_runtime/executor.cc
index 802bfee890556435cd03b16295ab16a24bf33392..f7f2cdc14fde6666791e3ff91dcdcd66622fe3f1 100644
--- a/tensorflow/core/common_runtime/executor.cc
+++ b/tensorflow/core/common_runtime/executor.cc
@@ -23,11 +23,13 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/core/common_runtime/costmodel_manager.h"
+#include "tensorflow/core/common_runtime/executor_factory.h"
 #include "tensorflow/core/common_runtime/pending_counts.h"
 #include "tensorflow/core/common_runtime/step_stats_collector.h"
 #include "tensorflow/core/framework/allocation_description.pb.h"
 #include "tensorflow/core/framework/allocator.h"
 #include "tensorflow/core/framework/cancellation.h"
+#include "tensorflow/core/framework/collective.h"
 #include "tensorflow/core/framework/control_flow.h"
 #include "tensorflow/core/framework/device_attributes.pb.h"
 #include "tensorflow/core/framework/graph.pb.h"
@@ -592,7 +594,8 @@ char* GraphView::InitializeNode(char* ptr, const Node* n) {
           }
         }
       }
-      if (fwd_status.ok() && forward_from[i] == -1) {
+      if (fwd_status.ok() &&
+          forward_from[i] == OpKernelContext::Params::kNoReservation) {
         DCHECK_EQ(forward_input.size() % 2, 0);
         for (int j = 0; j < forward_input.size(); j += 2) {
           if (forward_input[j + 1] == i) {
@@ -770,7 +773,8 @@ void GraphView::SetScopedAllocatorAttrs(
                 << use_node->name();
         continue;
       }
-      // There should be exactly one output using ScopedAllocation.
+      // There can be more than one output using ScopedAllocation, but this
+      // analysis assumes they use the same ScopedAllocator.
       for (const auto& e : use_node->out_edges()) {
         if (!e->IsControlEdge()) {
           AllocatorAttributes attr;
@@ -887,6 +891,11 @@ Status InferAllocAttr(const Node* n, const Node* dst,
               << " remote type " << parsed_dst_name.type;
     }
   }
+  if (n->IsCollective()) {
+    // We'll make the sweeping assumption that any collective op is going
+    // to be involved in network i/o.
+    attr->set_nic_compatible(true);
+  }
   return s;
 }
 
@@ -1289,6 +1298,7 @@ class ExecutorState {
   int64 step_id_;
   // Not owned.
   Rendezvous* rendezvous_;
+  CollectiveExecutor* collective_executor_ = nullptr;
   SessionState* session_state_;
   TensorStore* tensor_store_;
   // Step-local container.
@@ -1411,6 +1421,7 @@ ExecutorState::ExecutorState(const Executor::Args& args, ExecutorImpl* impl)
       log_memory_(LogMemory::IsEnabled()),
       step_id_(args.step_id),
       rendezvous_(args.rendezvous),
+      collective_executor_(args.collective_executor),
       session_state_(args.session_state),
       tensor_store_(args.tensor_store),
       step_container_(args.step_container),
@@ -1621,6 +1632,7 @@ void ExecutorState::Process(TaggedNode tagged_node, int64 scheduled_usec) {
   params.log_memory = log_memory_;
   params.record_tensor_accesses = impl_->device_record_tensor_accesses_;
   params.rendezvous = rendezvous_;
+  params.collective_executor = collective_executor_;
   params.session_state = session_state_;
   params.tensor_store = tensor_store_;
   params.cancellation_manager = cancellation_manager_;
@@ -2180,6 +2192,9 @@ bool ExecutorState::NodeDone(const Status& s, const Node* node,
     if (rendezvous_) {
       rendezvous_->StartAbort(s);
     }
+    if (collective_executor_) {
+      collective_executor_->StartAbort(s);
+    }
     if (cancellation_manager_) {
       cancellation_manager_->StartCancel();
     }
@@ -2750,4 +2765,30 @@ Status CreateNonCachedKernel(Device* device, FunctionLibraryRuntime* flib,
 
 void DeleteNonCachedKernel(OpKernel* kernel) { delete kernel; }
 
+namespace {
+
+class DefaultExecutorRegistrar {
+ public:
+  DefaultExecutorRegistrar() {
+    Factory* factory = new Factory;
+    ExecutorFactory::Register("", factory);
+    ExecutorFactory::Register("DEFAULT", factory);
+  }
+
+ private:
+  class Factory : public ExecutorFactory {
+    Status NewExecutor(const LocalExecutorParams& params,
+                       std::unique_ptr<const Graph> graph,
+                       std::unique_ptr<Executor>* out_executor) override {
+      Executor* ret = nullptr;
+      TF_RETURN_IF_ERROR(NewLocalExecutor(params, std::move(graph), &ret));
+      out_executor->reset(ret);
+      return Status::OK();
+    }
+  };
+};
+static DefaultExecutorRegistrar registrar;
+
+}  // namespace
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/executor.h b/tensorflow/core/common_runtime/executor.h
index adf80a2417e2a86e874dd1d1068a1bbb611ff882..e5d7b7c53c759ee225c54048b586cab12e9d78d6 100644
--- a/tensorflow/core/common_runtime/executor.h
+++ b/tensorflow/core/common_runtime/executor.h
@@ -89,6 +89,7 @@ class Executor {
     SessionState* session_state = nullptr;
     TensorStore* tensor_store = nullptr;
     ScopedStepContainer* step_container = nullptr;
+    CollectiveExecutor* collective_executor = nullptr;
 
     // If true, calls Sync() on the device.
     bool sync_on_finish = false;
diff --git a/tensorflow/core/common_runtime/executor_factory.cc b/tensorflow/core/common_runtime/executor_factory.cc
new file mode 100644
index 0000000000000000000000000000000000000000..ee7c7c3a7380c526f3e7e5f533b4856fe3165c65
--- /dev/null
+++ b/tensorflow/core/common_runtime/executor_factory.cc
@@ -0,0 +1,85 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/common_runtime/executor_factory.h"
+
+#include <unordered_map>
+
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+namespace {
+
+static mutex executor_factory_lock(LINKER_INITIALIZED);
+
+typedef std::unordered_map<string, ExecutorFactory*> ExecutorFactories;
+ExecutorFactories* executor_factories() {
+  static ExecutorFactories* factories = new ExecutorFactories;
+  return factories;
+}
+
+}  // namespace
+
+void ExecutorFactory::Register(const string& executor_type,
+                               ExecutorFactory* factory) {
+  mutex_lock l(executor_factory_lock);
+  if (!executor_factories()->insert({executor_type, factory}).second) {
+    LOG(FATAL) << "Two executor factories are being registered "
+               << "under" << executor_type;
+  }
+}
+
+namespace {
+const string RegisteredFactoriesErrorMessageLocked()
+    SHARED_LOCKS_REQUIRED(executor_factory_lock) {
+  std::vector<string> factory_types;
+  for (const auto& executor_factory : *executor_factories()) {
+    factory_types.push_back(executor_factory.first);
+  }
+  return strings::StrCat("Registered factories are {",
+                         str_util::Join(factory_types, ", "), "}.");
+}
+}  // namespace
+
+Status ExecutorFactory::GetFactory(const string& executor_type,
+                                   ExecutorFactory** out_factory) {
+  tf_shared_lock l(executor_factory_lock);
+
+  auto iter = executor_factories()->find(executor_type);
+  if (iter == executor_factories()->end()) {
+    return errors::NotFound(
+        "No executor factory registered for the given executor type: ",
+        executor_type, " ", RegisteredFactoriesErrorMessageLocked());
+  }
+
+  *out_factory = iter->second;
+  return Status::OK();
+}
+
+Status NewExecutor(const string& executor_type,
+                   const LocalExecutorParams& params,
+                   std::unique_ptr<const Graph> graph,
+                   std::unique_ptr<Executor>* out_executor) {
+  ExecutorFactory* factory = nullptr;
+  TF_RETURN_IF_ERROR(ExecutorFactory::GetFactory(executor_type, &factory));
+  return factory->NewExecutor(params, std::move(graph), out_executor);
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/executor_factory.h b/tensorflow/core/common_runtime/executor_factory.h
new file mode 100644
index 0000000000000000000000000000000000000000..f81bb080eb3aad5fde94c0c5953a67202cb8fa29
--- /dev/null
+++ b/tensorflow/core/common_runtime/executor_factory.h
@@ -0,0 +1,51 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_COMMON_RUNTIME_EXECUTOR_FACTORY_H_
+#define TENSORFLOW_CORE_COMMON_RUNTIME_EXECUTOR_FACTORY_H_
+
+#include <string>
+
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+
+class Executor;
+class Graph;
+struct LocalExecutorParams;
+
+class ExecutorFactory {
+ public:
+  virtual Status NewExecutor(const LocalExecutorParams& params,
+                             std::unique_ptr<const Graph> graph,
+                             std::unique_ptr<Executor>* out_executor) = 0;
+  virtual ~ExecutorFactory() {}
+
+  static void Register(const string& executor_type, ExecutorFactory* factory);
+  static Status GetFactory(const string& executor_type,
+                           ExecutorFactory** out_factory);
+};
+
+Status NewExecutor(const string& executor_type,
+                   const LocalExecutorParams& params,
+                   std::unique_ptr<const Graph> graph,
+                   std::unique_ptr<Executor>* out_executor);
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_COMMON_RUNTIME_EXECUTOR_FACTORY_H_
diff --git a/tensorflow/core/distributed_runtime/executor_test.cc b/tensorflow/core/common_runtime/executor_test.cc
similarity index 85%
rename from tensorflow/core/distributed_runtime/executor_test.cc
rename to tensorflow/core/common_runtime/executor_test.cc
index e34224205bac48a2dba1bf8cb07f9c623cd38281..7697103faf9bfa7a3fdbdbc0c3286d07d257d817 100644
--- a/tensorflow/core/distributed_runtime/executor_test.cc
+++ b/tensorflow/core/common_runtime/executor_test.cc
@@ -410,4 +410,73 @@ TEST_F(ExecutorTest, RecvInvalidRefDtype) {
   rendez->Unref();
 }
 
+// Create a graph that is 'depth' deep. At each level, fan-in and fan-out a
+// maximum of 'width' nodes. All nodes are no-ops and all dependencies are
+// control dependencies.
+static void BM_executor(int iters, int width, int depth) {
+#ifdef PLATFORM_GOOGLE
+  BenchmarkUseRealTime();
+#endif  // PLATFORM_GOOGLE
+  Graph* g = new Graph(OpRegistry::Global());
+  random::PhiloxRandom philox(1729, 17);
+  random::SimplePhilox rand(&philox);
+  uint64 cur = 0;
+  uint32 r = 1 + rand.Rand32() % width;
+  std::vector<Node*> ready_nodes;
+  for (int i = 0; i < r; ++i) {
+    ready_nodes.push_back(test::graph::NoOp(g, {}));
+    ++cur;
+  }
+  for (int i = 0; i < depth; ++i) {
+    std::random_shuffle(ready_nodes.begin(), ready_nodes.end());
+    r = 1 + rand.Rand32() % (ready_nodes.size());
+    std::vector<Node*> control_inputs;
+    for (int j = 0; j < r; ++j) {
+      control_inputs.push_back(ready_nodes.back());
+      ready_nodes.pop_back();
+    }
+    Node* n = test::graph::NoOp(g, control_inputs);
+    ++cur;
+    r = 1 + rand.Rand32() % width;
+    for (int j = 0; j < r; ++j) {
+      ready_nodes.push_back(test::graph::NoOp(g, {n}));
+      ++cur;
+    }
+  }
+#ifdef PLATFORM_GOOGLE
+  SetBenchmarkLabel(strings::StrCat("Nodes = ", cur));
+  SetBenchmarkItemsProcessed(cur * static_cast<int64>(iters));
+#endif  // PLATFORM_GOOGLE
+  test::Benchmark("cpu", g).Run(iters);
+}
+
+// Tall skinny graphs
+BENCHMARK(BM_executor)->ArgPair(16, 1024);
+BENCHMARK(BM_executor)->ArgPair(32, 8192);
+
+// Short fat graphs
+BENCHMARK(BM_executor)->ArgPair(1024, 16);
+BENCHMARK(BM_executor)->ArgPair(8192, 32);
+
+// Tall fat graph
+BENCHMARK(BM_executor)->ArgPair(1024, 1024);
+
+static void BM_FeedInputFetchOutput(int iters) {
+  Graph* g = new Graph(OpRegistry::Global());
+  // z = x + y: x and y are provided as benchmark inputs.  z is the
+  // output of the benchmark.  Conceptually, the caller is ALICE, the
+  // benchmark is BOB.
+  Node* x = test::graph::Recv(g, "x", "float", ALICE, 1, BOB);
+  Node* y = test::graph::Recv(g, "y", "float", ALICE, 1, BOB);
+  Node* sum = test::graph::Add(g, x, y);
+  Node* z = test::graph::Send(g, sum, "z", BOB, 1, ALICE);
+  Tensor val(DT_FLOAT, TensorShape({}));
+  val.scalar<float>()() = 3.14;
+#ifdef PLATFORM_GOOGLE
+  SetBenchmarkItemsProcessed(static_cast<int64>(iters));
+#endif  // PLATFORM_GOOGLE
+  test::Benchmark("cpu", g).RunWithArgs({{x, val}, {y, val}}, {z}, iters);
+}
+BENCHMARK(BM_FeedInputFetchOutput);
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/function.cc b/tensorflow/core/common_runtime/function.cc
index d05564e9c496095b4ff8882309068e0c87f2f482..68d37ddbcd8cc7d4dc055d323734721f3d8dd840 100644
--- a/tensorflow/core/common_runtime/function.cc
+++ b/tensorflow/core/common_runtime/function.cc
@@ -20,9 +20,11 @@ limitations under the License.
 
 #include "tensorflow/core/common_runtime/device.h"
 #include "tensorflow/core/common_runtime/executor.h"
+#include "tensorflow/core/common_runtime/executor_factory.h"
 #include "tensorflow/core/common_runtime/graph_optimizer.h"
 #include "tensorflow/core/common_runtime/memory_types.h"
 #include "tensorflow/core/common_runtime/rendezvous_mgr.h"
+#include "tensorflow/core/framework/collective.h"
 #include "tensorflow/core/framework/function.h"
 #include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/framework/node_def_util.h"
@@ -214,6 +216,7 @@ class FunctionLibraryRuntimeImpl : public FunctionLibraryRuntime {
     const FunctionLibraryDefinition* overlay_lib = nullptr;  // Not owned.
     FunctionBody* func_graph = nullptr;
     Executor* exec = nullptr;
+    string executor_type;
 
     ~Item() {
       delete this->func_graph;
@@ -548,6 +551,7 @@ Status FunctionLibraryRuntimeImpl::Instantiate(
       item->func_graph = fbody;
       item->overlay_lib = options.overlay_lib;
       item->instantiation_counter = 1;
+      item->executor_type = options.executor_type;
       items_.emplace(next_handle_, std::unique_ptr<Item>(item));
       next_handle_++;
     }
@@ -622,10 +626,12 @@ void PruneFunctionBody(Graph* g) {
 Status FunctionLibraryRuntimeImpl::CreateItem(Handle handle, Item** item) {
   const FunctionBody* fbody;
   const FunctionLibraryDefinition* lib_def;
+  string executor_type;
   {
     mutex_lock l(mu_);
     fbody = (*item)->func_graph;
     lib_def = (*item)->overlay_lib;
+    executor_type = (*item)->executor_type;
   }
   if (!lib_def) {
     lib_def = base_lib_def_;
@@ -655,17 +661,14 @@ Status FunctionLibraryRuntimeImpl::CreateItem(Handle handle, Item** item) {
     DeleteNonCachedKernel(kernel);
   };
   Graph* graph = g.get();
-  Executor* exec;
-  TF_RETURN_IF_ERROR(NewLocalExecutor(params, std::move(g), &exec));
-
+  std::unique_ptr<Executor> exec;
+  TF_RETURN_IF_ERROR(NewExecutor(executor_type, params, std::move(g), &exec));
   {
     // Guard item since it is already inserted in items_.
     mutex_lock l(mu_);
-    if ((*item)->exec) {
-      delete exec;
-    } else {
+    if ((*item)->exec == nullptr) {
       (*item)->graph = graph;
-      (*item)->exec = exec;
+      (*item)->exec = exec.release();
     }
   }
   return Status::OK();
@@ -809,6 +812,7 @@ void FunctionLibraryRuntimeImpl::Run(const Options& opts, Handle handle,
   exec_args->cancellation_manager = run_opts.cancellation_manager;
   exec_args->step_container = run_opts.step_container;
   exec_args->runner = *run_opts.runner;
+  exec_args->collective_executor = run_opts.collective_executor;
 
   Item* item = nullptr;
   Status s = GetOrCreateItem(handle, &item);
@@ -896,6 +900,7 @@ void FunctionLibraryRuntimeImpl::Run(const Options& opts, Handle handle,
   exec_args->rendezvous = run_opts.rendezvous;
   exec_args->stats_collector = run_opts.stats_collector;
   exec_args->cancellation_manager = run_opts.cancellation_manager;
+  exec_args->collective_executor = run_opts.collective_executor;
   exec_args->step_container = run_opts.step_container;
   exec_args->runner = *run_opts.runner;
   exec_args->call_frame = frame;
diff --git a/tensorflow/core/common_runtime/function_test.cc b/tensorflow/core/common_runtime/function_test.cc
index 61b2f0e60f7ea6ca7f7b36f21845766399489795..1e837e9a7e7f7bf5c394a2725b47f4053ed911a5 100644
--- a/tensorflow/core/common_runtime/function_test.cc
+++ b/tensorflow/core/common_runtime/function_test.cc
@@ -25,6 +25,7 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/device.h"
 #include "tensorflow/core/common_runtime/device_factory.h"
 #include "tensorflow/core/common_runtime/executor.h"
+#include "tensorflow/core/common_runtime/executor_factory.h"
 #include "tensorflow/core/common_runtime/function_testlib.h"
 #include "tensorflow/core/common_runtime/rendezvous_mgr.h"
 #include "tensorflow/core/common_runtime/step_stats_collector.h"
@@ -531,6 +532,69 @@ TEST_F(FunctionLibraryRuntimeTest, StateHandle) {
   }
 }
 
+namespace {
+class DummyExecutorRegistrar {
+ public:
+  DummyExecutorRegistrar() {
+    ExecutorFactory::Register("DUMMY", new Factory());
+  }
+
+ private:
+  class Factory : public ExecutorFactory {
+    Status NewExecutor(const LocalExecutorParams& params,
+                       std::unique_ptr<const Graph> graph,
+                       std::unique_ptr<Executor>* out_executor) override {
+      return errors::Internal("This is a dummy.");
+    }
+  };
+};
+static DummyExecutorRegistrar registrar;
+}  // namespace
+
+TEST_F(FunctionLibraryRuntimeTest, ExecutorFactory) {
+  Init({test::function::XTimesTwo()});
+
+  auto x = test::AsTensor<float>({1, 2, 3, 4});
+  Tensor y;
+
+  // Test that the default executor works.
+  {
+    FunctionLibraryRuntime::InstantiateOptions options;
+    options.executor_type = "";
+    TF_CHECK_OK(InstantiateAndRun(flr0_, "XTimesTwo", {{"T", DT_FLOAT}},
+                                  options, {x}, {&y}));
+    test::ExpectTensorEqual<float>(y, test::AsTensor<float>({2, 4, 6, 8}));
+  }
+
+  // Test the explicit registration for the default executor.
+  {
+    FunctionLibraryRuntime::InstantiateOptions options;
+    options.executor_type = "DEFAULT";
+    TF_CHECK_OK(InstantiateAndRun(flr0_, "XTimesTwo", {{"T", DT_FLOAT}},
+                                  options, {x}, {&y}));
+    test::ExpectTensorEqual<float>(y, test::AsTensor<float>({2, 4, 6, 8}));
+  }
+
+  // Test that a non-default executor factory can be invoked.
+  {
+    FunctionLibraryRuntime::InstantiateOptions options;
+    options.executor_type = "DUMMY";
+    HasError(InstantiateAndRun(flr0_, "XTimesTwo", {{"T", DT_FLOAT}}, options,
+                               {x}, {&y}),
+             "Internal: This is a dummy.");
+  }
+
+  // Test that non-existent exector types trigger an error.
+  {
+    FunctionLibraryRuntime::InstantiateOptions options;
+    options.executor_type = "UNKNOWN_EXECUTOR";
+    HasError(InstantiateAndRun(flr0_, "XTimesTwo", {{"T", DT_FLOAT}}, options,
+                               {x}, {&y}),
+             "Not found: No executor factory registered for the given executor "
+             "type: UNKNOWN_EXECUTOR");
+  }
+}
+
 TEST_F(FunctionLibraryRuntimeTest, ExpandInlineFunctions) {
   Init({test::function::XTimesTwo(), test::function::XTimesFour(),
         test::function::XTimes16()});
@@ -803,7 +867,7 @@ TEST_F(FunctionLibraryRuntimeTest, OptimizeGraph) {
     Scope s = Scope::NewRootScope();
     auto x = ops::_Arg(s.WithOpName("x"), DT_FLOAT, 0);
     auto x4_x2_scale = ops::Const<float>(
-        s.WithOpName("x4/x2/scale/_12__cf__6")
+        s.WithOpName("x4/x2/scale/_12__cf__10")
             .WithDevice("/job:localhost/replica:0/task:0/device:CPU:0"),
         2.0f);
     auto x4_x2_y = ops::Mul(s.WithOpName("x4/x2/y"), x, x4_x2_scale);
@@ -845,7 +909,7 @@ TEST_F(FunctionLibraryRuntimeTest, ManySwapsNodeDef) {
   ASSERT_TRUE(g != nullptr);
   OptimizeGraph(flr0_, &g);
   const char* e0 = R"P(
-(n3:float, n2:float) -> (n3:float) {
+(n2:float, n3:float) -> (n2:float) {
 }
 )P";
   EXPECT_EQ(e0, DebugString(g.get()));
@@ -913,7 +977,7 @@ TEST_F(FunctionLibraryRuntimeTest, Error_NotFound) {
            "Not found: Function Foo is not defined.");
 }
 
-TEST_F(FunctionLibraryRuntimeTest, Error_InstantiaionError) {
+TEST_F(FunctionLibraryRuntimeTest, Error_InstantiationError) {
   auto bad_x_times_two = FDH::Define(
       // Name
       "XTimesTwo",
@@ -1009,13 +1073,13 @@ TEST_F(FunctionLibraryRuntimeTest, Gradient_XTimesTwo) {
     auto x = ops::_Arg(s.WithOpName("x"), DT_FLOAT, 0);
     auto func0 = ops::_Arg(s.WithOpName("Func/_0"), DT_FLOAT, 1);
     auto scale = ops::Const(
-        s.WithOpName("scale/_6__cf__11")
+        s.WithOpName("scale/_6__cf__15")
             .WithDevice("/job:localhost/replica:0/task:0/device:CPU:0"),
         2.0f);
     auto func1_gx = ops::Mul(s.WithOpName("Func/_1/gx"), func0, scale);
     auto func1_sx = ops::Shape(s.WithOpName("Func/_1/sx"), x);
     auto const0 = ops::Const(
-        s.WithOpName("Func/_1/sy/_5__cf__10")
+        s.WithOpName("Func/_1/sy/_5__cf__14")
             .WithDevice("/job:localhost/replica:0/task:0/device:CPU:0"),
         0, {0});
     auto func1_rx = ops::internal::BroadcastGradientArgs(
diff --git a/tensorflow/core/common_runtime/gpu/gpu_device.cc b/tensorflow/core/common_runtime/gpu/gpu_device.cc
index cf5d11ec8b44ad3eef1a93d79757e78e74085565..bee562763654f3c2b7f5d6e95a969e48e6a0f3a3 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_device.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_device.cc
@@ -770,7 +770,7 @@ int64 MinSystemMemory(int64 available_memory) {
   } else {
     // max(300 MiB, 0.05 * available_memory)
     min_system_memory =
-        std::max(314572800LL, static_cast<int64>(available_memory * 0.05));
+        std::max(int64{314572800}, static_cast<int64>(available_memory * 0.05));
   }
 #if defined(__GNUC__) && defined(__OPTIMIZE__)
 // Do nothing
diff --git a/tensorflow/core/common_runtime/graph_runner.cc b/tensorflow/core/common_runtime/graph_runner.cc
index adf2ef6f4468ddb82283c85bf0d4854a4c488d2d..0a1797fa19bb7e04d2cd7a6492d436d514b95c8f 100644
--- a/tensorflow/core/common_runtime/graph_runner.cc
+++ b/tensorflow/core/common_runtime/graph_runner.cc
@@ -176,6 +176,9 @@ Status GraphRunner::Run(Graph* graph, FunctionLibraryRuntime* function_library,
   args.step_id = LogMemory::CONSTANT_FOLDING_STEP_ID;
   args.runner = runner;
   args.rendezvous = rendez;
+  // NOTE: Use of graph runner is limited to single-device executions
+  // so a CollectiveExecutor should never be required.
+  args.collective_executor = nullptr;
 
   // Run the graph.
   TF_RETURN_IF_ERROR(executor->Run(args));
diff --git a/tensorflow/core/common_runtime/kernel_benchmark_testlib.cc b/tensorflow/core/common_runtime/kernel_benchmark_testlib.cc
index 7de1b80e2d6b3f2e36b3fdd56fc30e962e38bada..1f585a8c24801e9139cab5cc650fce19dd97e05e 100644
--- a/tensorflow/core/common_runtime/kernel_benchmark_testlib.cc
+++ b/tensorflow/core/common_runtime/kernel_benchmark_testlib.cc
@@ -18,6 +18,7 @@ limitations under the License.
 #include <vector>
 #include "tensorflow/core/common_runtime/device.h"
 #include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/common_runtime/executor_factory.h"
 #include "tensorflow/core/common_runtime/local_device.h"
 #include "tensorflow/core/framework/op.h"
 #include "tensorflow/core/framework/op_kernel.h"
@@ -43,7 +44,7 @@ namespace test {
 // TODO(hongm): Convert `g` and `init` to using std::unique_ptr.
 Benchmark::Benchmark(const string& device, Graph* g,
                      const SessionOptions* options, Graph* init,
-                     Rendezvous* rendez) {
+                     Rendezvous* rendez, const char* executor_type) {
   SessionOptions default_options;
   if (!options) {
     options = &default_options;
@@ -86,23 +87,26 @@ Benchmark::Benchmark(const string& device, Graph* g,
   };
 
   if (init) {
-    Executor* init_exec;
-    TF_CHECK_OK(
-        NewLocalExecutor(params, std::unique_ptr<Graph>(init), &init_exec));
+    std::unique_ptr<Executor> init_exec;
+    TF_CHECK_OK(NewExecutor(executor_type, params, std::unique_ptr<Graph>(init),
+                            &init_exec));
     Executor::Args args;
     args.rendezvous = rendez_;
     args.runner = runner;
     TF_CHECK_OK(init_exec->Run(args));
-    delete init_exec;
   }
 
-  TF_CHECK_OK(NewLocalExecutor(params, std::unique_ptr<Graph>(g), &exec_));
+  TF_CHECK_OK(
+      NewExecutor(executor_type, params, std::unique_ptr<Graph>(g), &exec_));
 }
 
 Benchmark::~Benchmark() {
   if (device_) {
     rendez_->Unref();
-    delete exec_;
+    // We delete `exec_` before `device_` because the `exec_` destructor may
+    // run kernel destructors that may attempt to access state borrowed from
+    // `device_`, such as the resource manager.
+    exec_.reset();
     delete device_;
     delete pool_;
   }
diff --git a/tensorflow/core/common_runtime/kernel_benchmark_testlib.h b/tensorflow/core/common_runtime/kernel_benchmark_testlib.h
index 3a7b3a5acecee161f28a1625f782534a50089abc..995a15a299d74002e953116622f9729252ab21cc 100644
--- a/tensorflow/core/common_runtime/kernel_benchmark_testlib.h
+++ b/tensorflow/core/common_runtime/kernel_benchmark_testlib.h
@@ -39,7 +39,7 @@ class Benchmark {
   // "init", and one reference on "rendez" (if not null).
   Benchmark(const string& device, Graph* g,
             const SessionOptions* options = nullptr, Graph* init = nullptr,
-            Rendezvous* rendez = nullptr);
+            Rendezvous* rendez = nullptr, const char* executor_type = "");
   ~Benchmark();
 
   // Executes the graph for "iters" times.
@@ -57,7 +57,7 @@ class Benchmark {
   thread::ThreadPool* pool_ = nullptr;
   Device* device_ = nullptr;
   Rendezvous* rendez_ = nullptr;
-  Executor* exec_ = nullptr;
+  std::unique_ptr<Executor> exec_;
 
   TF_DISALLOW_COPY_AND_ASSIGN(Benchmark);
 };
diff --git a/tensorflow/core/common_runtime/mkl_cpu_allocator.h b/tensorflow/core/common_runtime/mkl_cpu_allocator.h
index 245320c8964e1767af3da5b6a4e11fdbb67164f6..29f702699f4370d2a36bc807410078b2e09abd5a 100644
--- a/tensorflow/core/common_runtime/mkl_cpu_allocator.h
+++ b/tensorflow/core/common_runtime/mkl_cpu_allocator.h
@@ -29,7 +29,9 @@ limitations under the License.
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/platform/mem.h"
 
+#ifndef DO_NOT_USE_ML
 #include "i_malloc.h"
+#endif
 
 #ifdef _WIN32
 typedef unsigned int uint;
@@ -97,14 +99,14 @@ class MklCPUAllocator : public VisitableAllocator {
     VLOG(1) << "MklCPUAllocator: Setting max_mem_bytes: " << max_mem_bytes;
     allocator_ = new BFCAllocator(new MklSubAllocator, max_mem_bytes,
                                   kAllowGrowth, kName);
-
+#ifndef DO_NOT_USE_ML
     // For redirecting all allocations from MKL to this allocator
     // From: http://software.intel.com/en-us/node/528565
     i_malloc = MallocHook;
     i_calloc = CallocHook;
     i_realloc = ReallocHook;
     i_free = FreeHook;
-
+#endif
     return Status::OK();
   }
 
diff --git a/tensorflow/core/common_runtime/renamed_device.h b/tensorflow/core/common_runtime/renamed_device.h
index fe4df1c106c5a86d4a9cdb73bafed7f4431e76a0..103eee03b36415b49c8b03fc267f948af2719225 100644
--- a/tensorflow/core/common_runtime/renamed_device.h
+++ b/tensorflow/core/common_runtime/renamed_device.h
@@ -58,11 +58,6 @@ class RenamedDevice : public Device {
     return underlying_->GetAllocator(attr);
   }
 
-  Allocator* GetStepAllocator(AllocatorAttributes attr,
-                              ResourceMgr* step_resource_manager) override {
-    return underlying_->GetStepAllocator(attr, step_resource_manager);
-  }
-
   const Eigen::ThreadPoolDevice* eigen_cpu_device() override {
     return underlying_->eigen_cpu_device();
   }
diff --git a/tensorflow/core/common_runtime/scoped_allocator_mgr.cc b/tensorflow/core/common_runtime/scoped_allocator_mgr.cc
index c045596a69b60d244d2667e36e45c75f63d56a39..8ac6adc2e48a072ce1ae68e5141b54332fc54814 100644
--- a/tensorflow/core/common_runtime/scoped_allocator_mgr.cc
+++ b/tensorflow/core/common_runtime/scoped_allocator_mgr.cc
@@ -160,13 +160,18 @@ Status ScopedAllocatorMgr::AddScopedAllocator(
                                  expected_call_count);
 }
 
-void ScopedAllocatorMgr::PopulateFields(
+/*static*/
+size_t ScopedAllocatorMgr::PopulateFields(
     int32 scope_id, const gtl::ArraySlice<TensorShape>& shapes,
     const DataType dtype, std::vector<ScopedAllocator::Field>* fields) {
   const int32 num_fields = static_cast<int32>(shapes.size());
   fields->resize(num_fields);
   size_t offset = 0;
   for (int32 i = 0; i < num_fields; ++i) {
+    size_t overshoot = offset % Allocator::kAllocatorAlignment;
+    if (overshoot > 0) {
+      offset += (Allocator::kAllocatorAlignment - overshoot);
+    }
     size_t bytes = shapes[i].num_elements() * DataTypeSize(dtype);
     (*fields)[i].scope_id = scope_id + 1 + i;
     (*fields)[i].bytes = bytes;
@@ -175,11 +180,8 @@ void ScopedAllocatorMgr::PopulateFields(
             << " bytes=" << (*fields)[i].bytes
             << " offset=" << (*fields)[i].offset;
     offset += bytes;
-    size_t overshoot = offset % Allocator::kAllocatorAlignment;
-    if (overshoot > 0) {
-      offset += (Allocator::kAllocatorAlignment - overshoot);
-    }
   }
+  return offset;
 }
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/scoped_allocator_mgr.h b/tensorflow/core/common_runtime/scoped_allocator_mgr.h
index effc5f2d775336621a783d83b7dd5eece6d42292..8c5e853472d1317dcc5e13583d4c4ea1f4a91956 100644
--- a/tensorflow/core/common_runtime/scoped_allocator_mgr.h
+++ b/tensorflow/core/common_runtime/scoped_allocator_mgr.h
@@ -89,10 +89,13 @@ class ScopedAllocatorMgr {
 
   // Populate the bytes and offset members of Field.  Instance allocaters get
   // consecutive scope_id values following that of the base ScopedAllocator.
-  static void PopulateFields(int32 scope_id,
-                             const gtl::ArraySlice<TensorShape>& shapes,
-                             const DataType dtype,
-                             std::vector<ScopedAllocator::Field>* fields);
+  // Returns the total number of bytes required to be allocated in the
+  // backing tensor, for convenience.  (The same value can be obtained
+  // by summing offset and bytes in the last field.)
+  static size_t PopulateFields(int32 scope_id,
+                               const gtl::ArraySlice<TensorShape>& shapes,
+                               const DataType dtype,
+                               std::vector<ScopedAllocator::Field>* fields);
 
   const string& device_name() const { return device_name_; }
 
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_testlib_ops.cc b/tensorflow/core/common_runtime/testlib_ops.cc
similarity index 84%
rename from tensorflow/core/distributed_runtime/rpc/grpc_testlib_ops.cc
rename to tensorflow/core/common_runtime/testlib_ops.cc
index 5597ee7a76a55f125dd0db82eceb58f5e922ab13..a0139c3ee5d33d55223398ec0c144da3d8446fa4 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_testlib_ops.cc
+++ b/tensorflow/core/common_runtime/testlib_ops.cc
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#include "tensorflow/core/framework/common_shape_fns.h"
 #include "tensorflow/core/framework/op.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/platform/macros.h"
@@ -21,8 +22,12 @@ namespace tensorflow {
 namespace test {
 
 // ErrorOp::Compute returns an error.
-REGISTER_OP("Error").Input("in: T").Output("out: T").Attr("T: type").Attr(
-    "message: string");
+REGISTER_OP("Error")
+    .Input("in: T")
+    .Output("out: T")
+    .Attr("T: type")
+    .Attr("message: string")
+    .SetShapeFn(shape_inference::UnknownShape);
 class ErrorOp : public OpKernel {
  public:
   explicit ErrorOp(OpKernelConstruction* ctx) : OpKernel(ctx) {
@@ -41,7 +46,8 @@ REGISTER_KERNEL_BUILDER(Name("Error").Device(DEVICE_CPU), ErrorOp);
 REGISTER_OP("InvalidRefType")
     .Output("out: Ref(TIn)")
     .Attr("TIn: type")
-    .Attr("TOut: type");
+    .Attr("TOut: type")
+    .SetShapeFn(shape_inference::UnknownShape);
 class InvalidRefType : public OpKernel {
  public:
   explicit InvalidRefType(OpKernelConstruction* ctx) : OpKernel(ctx) {
@@ -63,8 +69,12 @@ REGISTER_KERNEL_BUILDER(Name("InvalidRefType").Device(DEVICE_CPU),
 
 // DelayOp::AsyncCompute sleeps for "micros"-econd and then returns
 // its input.
-REGISTER_OP("Delay").Input("in: T").Output("out: T").Attr("T: type").Attr(
-    "micros: int");
+REGISTER_OP("Delay")
+    .Input("in: T")
+    .Output("out: T")
+    .Attr("T: type")
+    .Attr("micros: int")
+    .SetShapeFn(shape_inference::UnchangedShape);
 class DelayOp : public AsyncOpKernel {
  public:
   explicit DelayOp(OpKernelConstruction* ctx) : AsyncOpKernel(ctx) {
diff --git a/tensorflow/core/debug/BUILD b/tensorflow/core/debug/BUILD
index 1528c7f130657cd14c43578a548106dc855ab3fd..50f8a307d8626e4fa1a00adffd0cfa7db43b2a81 100644
--- a/tensorflow/core/debug/BUILD
+++ b/tensorflow/core/debug/BUILD
@@ -42,7 +42,7 @@ load(
 # Check that tensorflow/core:tensorflow does not depend on grpc.
 check_deps(
     name = "core_tensorflow_check_deps",
-    disallowed_deps = ["@grpc//:grpc++_unsecure"],
+    disallowed_deps = ["@grpc//:grpc++"],
     deps = ["//tensorflow/core:tensorflow"],
 )
 
@@ -150,7 +150,7 @@ tf_cuda_library(
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:proto_text",
         "//tensorflow/core:protos_all_cc",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
     alwayslink = 1,
 )
@@ -170,7 +170,7 @@ tf_cuda_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:protos_all_cc",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
     alwayslink = 1,
 )
diff --git a/tensorflow/core/debug/debug_grpc_testlib.h b/tensorflow/core/debug/debug_grpc_testlib.h
index 58361bf78f43f3715f7489eb32c342680b58b45a..8d3c9ff57577239ef6ef6f996e530cf5cdde4747 100644
--- a/tensorflow/core/debug/debug_grpc_testlib.h
+++ b/tensorflow/core/debug/debug_grpc_testlib.h
@@ -19,7 +19,7 @@ limitations under the License.
 #include <atomic>
 #include <unordered_set>
 
-#include "grpc++/grpc++.h"
+#include "grpcpp/grpcpp.h"
 #include "tensorflow/core/debug/debug_io_utils.h"
 #include "tensorflow/core/debug/debug_service.grpc.pb.h"
 #include "tensorflow/core/framework/tensor.h"
diff --git a/tensorflow/core/debug/debug_io_utils.cc b/tensorflow/core/debug/debug_io_utils.cc
index 03a011f79e1871073b15bece1a0b468bb269899f..9e8002d490cf01d63c883a8ccc3823c009bd254c 100644
--- a/tensorflow/core/debug/debug_io_utils.cc
+++ b/tensorflow/core/debug/debug_io_utils.cc
@@ -23,7 +23,7 @@ limitations under the License.
 #include <vector>
 
 #ifndef PLATFORM_WINDOWS
-#include "grpc++/create_channel.h"
+#include "grpcpp/create_channel.h"
 #else
 // winsock2.h is used in grpc, so Ws2_32.lib is needed
 #pragma comment(lib, "Ws2_32.lib")
diff --git a/tensorflow/core/distributed_runtime/BUILD b/tensorflow/core/distributed_runtime/BUILD
index 18b7069dbe5e86660f6e709042d5fa4db656c19c..c6db2aec066e5494f245ba26c8a1838049575a10 100644
--- a/tensorflow/core/distributed_runtime/BUILD
+++ b/tensorflow/core/distributed_runtime/BUILD
@@ -145,9 +145,11 @@ tf_cc_test(
     deps = [
         ":session_mgr",
         ":worker_env",
+        "//tensorflow/core:framework",
         "//tensorflow/core:protos_all_cc",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
         "//tensorflow/core/distributed_runtime/rpc:rpc_rendezvous_mgr",
     ],
 )
@@ -226,6 +228,17 @@ tf_cc_test(
     ],
 )
 
+cc_library(
+    name = "cancellable_call",
+    hdrs = ["cancellable_call.h"],
+    deps = [
+        ":call_options",
+        ":worker_cache",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+    ],
+)
+
 tf_cc_test(
     name = "tensor_coding_test",
     size = "small",
@@ -392,6 +405,7 @@ cc_library(
     hdrs = ["master_env.h"],
     deps = [
         ":worker_cache",
+        "//tensorflow/core:framework",
         "//tensorflow/core:protos_all_cc",
         "//tensorflow/core:session_options",
     ],
@@ -452,11 +466,46 @@ cc_library(
     ],
 )
 
+cc_library(
+    name = "rpc_collective_executor_mgr",
+    srcs = ["rpc_collective_executor_mgr.cc"],
+    hdrs = ["rpc_collective_executor_mgr.h"],
+    deps = [
+        ":base_rendezvous_mgr",
+        ":collective_param_resolver_distributed",
+        ":collective_rma_distributed",
+        ":device_resolver_distributed",
+        ":worker_cache",
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:worker_proto_cc",
+    ],
+)
+
+tf_cc_test(
+    name = "rpc_collective_executor_mgr_test",
+    srcs = ["rpc_collective_executor_mgr_test.cc"],
+    deps = [
+        ":collective_param_resolver_distributed",
+        ":device_resolver_distributed",
+        ":rpc_collective_executor_mgr",
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:session_options",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
 cc_library(
     name = "collective_rma_distributed",
     srcs = ["collective_rma_distributed.cc"],
     hdrs = ["collective_rma_distributed.h"],
     deps = [
+        ":cancellable_call",
         ":worker_cache",
         "//tensorflow/core:core_cpu_internal",
         "//tensorflow/core:framework",
@@ -492,6 +541,7 @@ cc_library(
     hdrs = ["collective_param_resolver_distributed.h"],
     deps = [
         ":call_options",
+        ":cancellable_call",
         ":device_resolver_distributed",
         ":worker_cache",
         "//tensorflow/core:core_cpu_internal",
@@ -561,17 +611,19 @@ tf_cc_test(
     ],
 )
 
-# TODO(mrry): Move executor_test.cc to ../common_runtime when once it no longer depends
-# on grpc_testlib.
-tf_cuda_cc_tests(
-    name = "executor_tests",
+tf_cuda_cc_test(
+    name = "master_test",
     size = "medium",
     srcs = [
-        "executor_test.cc",
-        #"master_test.cc",  # TODO(b/27683709): Re-enable when not flaky.
+        "master_test.cc",
     ],
     linkstatic = tf_kernel_tests_linkstatic(),
-    tags = tf_cuda_tests_tags(),
+    tags = tf_cuda_tests_tags() + [
+        "manual",  # TODO(b/27683709): Re-enable when not flaky.
+        "notap",  # TODO(b/27683709): Re-enable when not flaky.
+        "noguitar",  # TODO(b/27683709): Re-enable when not flaky.
+        "nooss",  # TODO(b/27683709): Re-enable when not flaky.
+    ],
     deps = [
         ":master",
         ":remote_device",
@@ -588,6 +640,7 @@ tf_cuda_cc_tests(
         "//tensorflow/core:test_main",
         "//tensorflow/core:testlib",
         "//tensorflow/core/distributed_runtime/rpc:grpc_channel",
+        "//tensorflow/core/distributed_runtime/rpc:grpc_master_service_impl",
         "//tensorflow/core/distributed_runtime/rpc:grpc_testlib",
         "//tensorflow/core/distributed_runtime/rpc:grpc_util",
         "//tensorflow/core/distributed_runtime/rpc:grpc_worker_cache",
@@ -596,7 +649,7 @@ tf_cuda_cc_tests(
         "//tensorflow/core/kernels:dense_update_ops",
         "//tensorflow/core/kernels:identity_op",
         "//tensorflow/core/kernels:variable_ops",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -629,7 +682,7 @@ tf_cuda_cc_test(
         "//tensorflow/core/distributed_runtime/rpc:grpc_testlib",
         "//tensorflow/core/distributed_runtime/rpc:grpc_util",
         "//tensorflow/core/distributed_runtime/rpc:grpc_worker_cache",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -648,10 +701,10 @@ tf_cuda_cc_test(
         "//tensorflow/core:tensorflow",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
         "//tensorflow/core/distributed_runtime:server_lib",
         "//tensorflow/core/distributed_runtime/rpc:grpc_server_lib",
         "//tensorflow/core/distributed_runtime/rpc:grpc_session",
-        "//tensorflow/core/distributed_runtime/rpc:grpc_testlib_ops",
         "//tensorflow/core/kernels:aggregate_ops",
         "//tensorflow/core/kernels:array",
     ],
diff --git a/tensorflow/core/distributed_runtime/cancellable_call.h b/tensorflow/core/distributed_runtime/cancellable_call.h
new file mode 100644
index 0000000000000000000000000000000000000000..05089c7d153db94d898b06686add90df20155e0c
--- /dev/null
+++ b/tensorflow/core/distributed_runtime/cancellable_call.h
@@ -0,0 +1,65 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_CANCELLABLE_CALL_H_
+#define TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_CANCELLABLE_CALL_H_
+
+#include <string>
+#include "tensorflow/core/distributed_runtime/call_options.h"
+#include "tensorflow/core/distributed_runtime/worker_cache.h"
+#include "tensorflow/core/framework/cancellation.h"
+#include "tensorflow/core/platform/mutex.h"
+
+namespace tensorflow {
+
+// Supports client side cancellation of WorkerInterface calls via
+// registration with a CancellationManager.
+class CancellableCall {
+ public:
+  CancellableCall(CancellationManager* cancel_mgr, const string& remote_worker,
+                  WorkerCacheInterface* wc)
+      : cancel_mgr_(cancel_mgr),
+        remote_worker_(remote_worker),
+        wc_(wc),
+        wi_(wc_->CreateWorker(remote_worker_)) {}
+
+  virtual ~CancellableCall() { wc_->ReleaseWorker(remote_worker_, wi_); }
+
+  virtual void IssueCall(const StatusCallback& done) = 0;
+
+  void Start(const StatusCallback& done) {
+    CancellationToken token = cancel_mgr_->get_cancellation_token();
+    const bool not_yet_cancelled = cancel_mgr_->RegisterCallback(
+        token, [this, token]() { opts_.StartCancel(); });
+    if (not_yet_cancelled) {
+      IssueCall([this, token, done](const Status& s) {
+        cancel_mgr_->DeregisterCallback(token);
+        done(s);
+      });
+    } else {
+      done(errors::Cancelled("RPC Request was cancelled"));
+    }
+  }
+
+ protected:
+  mutable mutex mu_;
+  CancellationManager* const cancel_mgr_;  // Not owned
+  const string remote_worker_;
+  WorkerCacheInterface* const wc_;  // Not owned
+  WorkerInterface* const wi_;       // Owned by wc_, must be released.
+  CallOptions opts_;
+};
+
+}  // namespace tensorflow
+#endif  // TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_CANCELLABLE_CALL_H_
diff --git a/tensorflow/core/distributed_runtime/collective_param_resolver_distributed.cc b/tensorflow/core/distributed_runtime/collective_param_resolver_distributed.cc
index 7a93b54eae386fa8c8e253b1649636acbcf6f067..612ac14e22b1e8cfde82d20f0aecd8bba9df7b26 100644
--- a/tensorflow/core/distributed_runtime/collective_param_resolver_distributed.cc
+++ b/tensorflow/core/distributed_runtime/collective_param_resolver_distributed.cc
@@ -14,55 +14,13 @@ limitations under the License.
 ==============================================================================*/
 #include "tensorflow/core/distributed_runtime/collective_param_resolver_distributed.h"
 
-#include "tensorflow/core/distributed_runtime/call_options.h"
+#include "tensorflow/core/distributed_runtime/cancellable_call.h"
 #include "tensorflow/core/distributed_runtime/device_resolver_distributed.h"
 #include "tensorflow/core/distributed_runtime/worker_cache.h"
 #include "tensorflow/core/protobuf/config.pb.h"
 
-// TODO(tucker): When we're ready to enable collectives this const will
-// transition to a settable config member.
-static const char FLAGS_collective_group_leader[] =
-    "/job:worker/replica:0/task:0";
-
 namespace tensorflow {
 namespace {
-// Supports client side cancellation of WorkerInterface calls via
-// registration with a CancellationManager.  Note that ParamResolverInterface
-// calls are done on behalf of an Op execution which needs to abort if the
-// step in which it executes is cancelled.
-class CancellableCall {
- public:
-  CancellableCall(CancellationManager* cancel_mgr, const string& remote_worker,
-                  WorkerCacheInterface* wc)
-      : cancel_mgr_(cancel_mgr), remote_worker_(remote_worker), wc_(wc) {
-    wi_ = wc_->CreateWorker(remote_worker_);
-  }
-  virtual ~CancellableCall() { wc_->ReleaseWorker(remote_worker_, wi_); }
-
-  virtual void IssueCall(const StatusCallback& done) = 0;
-
-  void Start(const StatusCallback& done) {
-    CancellationToken token = cancel_mgr_->get_cancellation_token();
-    const bool not_yet_cancelled = cancel_mgr_->RegisterCallback(
-        token, [this, token]() { opts_.StartCancel(); });
-    if (not_yet_cancelled) {
-      IssueCall([this, token, done](const Status& s) {
-        cancel_mgr_->DeregisterCallback(token);
-        done(s);
-      });
-    } else {
-      done(errors::Cancelled("RPC Request was cancelled"));
-    }
-  }
-
- protected:
-  mutable mutex mu_;
-  CancellationManager* cancel_mgr_;  // Not owned
-  const string remote_worker_;
-  WorkerCacheInterface* wc_;  // Not owned
-  WorkerInterface* wi_;       // Owned by wc_, must be released.
-  CallOptions opts_;
-};
 
 class CompleteGroupCall : public CancellableCall {
  public:
@@ -126,9 +84,9 @@ CollectiveParamResolverDistributed::CollectiveParamResolverDistributed(
     const string& task_name)
     : CollectiveParamResolverLocal(dev_mgr, dev_resolver, task_name),
       worker_cache_(worker_cache),
-      group_leader_(task_name == FLAGS_collective_group_leader
+      group_leader_(task_name == config.experimental().collective_group_leader()
                         ? ""
-                        : FLAGS_collective_group_leader) {}
+                        : config.experimental().collective_group_leader()) {}
 
 void CollectiveParamResolverDistributed::CompleteParamsAsync(
     const string& device, CollectiveParams* cp, CancellationManager* cancel_mgr,
diff --git a/tensorflow/core/distributed_runtime/collective_param_resolver_distributed_test.cc b/tensorflow/core/distributed_runtime/collective_param_resolver_distributed_test.cc
index 95a010286d6ce27c74a2e0eefa23fd4bf2bd3313..4eed856759ae7ea2a982e1604ecbc0237e304731 100644
--- a/tensorflow/core/distributed_runtime/collective_param_resolver_distributed_test.cc
+++ b/tensorflow/core/distributed_runtime/collective_param_resolver_distributed_test.cc
@@ -147,10 +147,9 @@ class DeviceResDistTest : public ::testing::Test {
     ConfigProto config;
     for (int w = 0; w < num_workers; ++w) {
       string name = strings::StrCat("/job:worker/replica:0/task:", w);
-      // TODO(tucker): When config option becomes available, set here.
-      // if (w == 0) {
-      //   config.set_collective_group_leader(name);
-      // }
+      if (w == 0) {
+        config.mutable_experimental()->set_collective_group_leader(name);
+      }
       DefineWorker(config, name, device_type, num_devices);
     }
   }
diff --git a/tensorflow/core/distributed_runtime/collective_rma_distributed.cc b/tensorflow/core/distributed_runtime/collective_rma_distributed.cc
index c15878bfd3a2ba579e18969953c4f08ad350045e..d4c47cab49d2eaf70d3a79ed853dc443a0a8b261 100644
--- a/tensorflow/core/distributed_runtime/collective_rma_distributed.cc
+++ b/tensorflow/core/distributed_runtime/collective_rma_distributed.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/device_mgr.h"
 #include "tensorflow/core/common_runtime/dma_helper.h"
 #include "tensorflow/core/common_runtime/process_util.h"
+#include "tensorflow/core/distributed_runtime/cancellable_call.h"
 #include "tensorflow/core/distributed_runtime/worker_cache.h"
 #include "tensorflow/core/platform/protobuf_internal.h"
 #include "tensorflow/core/protobuf/transport_options.pb.h"
@@ -28,45 +29,6 @@ namespace tensorflow {
 
 namespace {
 
-// Supports client side cancellation of WorkerInterface calls via
-// registration with a CancellationManager.
-//
-// TODO(tucker): Maybe unify this with CancellableCall in
-// collective_param_resolver_distributed.cc.
-class CancellableCall {
- public:
-  CancellableCall(CancellationManager* cancel_mgr, const string& remote_worker,
-                  WorkerCacheInterface* wc)
-      : cancel_mgr_(cancel_mgr), remote_worker_(remote_worker), wc_(wc) {
-    wi_ = wc_->CreateWorker(remote_worker_);
-  }
-  virtual ~CancellableCall() { wc_->ReleaseWorker(remote_worker_, wi_); }
-
-  virtual void IssueCall(const StatusCallback& done) = 0;
-
-  void Start(const StatusCallback& done) {
-    CancellationToken token = cancel_mgr_->get_cancellation_token();
-    const bool not_yet_cancelled = cancel_mgr_->RegisterCallback(
-        token, [this, token]() { opts_.StartCancel(); });
-    if (not_yet_cancelled) {
-      IssueCall([this, token, done](const Status& s) {
-        cancel_mgr_->DeregisterCallback(token);
-        done(s);
-      });
-    } else {
-      done(errors::Cancelled("RPC Request was cancelled"));
-    }
-  }
-
- protected:
-  mutable mutex mu_;
-  CancellationManager* cancel_mgr_;  // Not owned
-  const string remote_worker_;
-  WorkerCacheInterface* wc_;  // Not owned
-  WorkerInterface* wi_;       // Owned by wc_, must be released.
-  CallOptions opts_;
-};
-
 class RecvBufCall : public CancellableCall {
  public:
   RecvBufCall(int64 step_id, const string& peer_device, const string& peer_task,
@@ -119,7 +81,7 @@ void CollectiveRemoteAccessDistributed::RecvFromPeer(
   };
   State* state = new State;
 
-  // Logic to be executed on the RecvBufferAsync callback.
+  // Logic to be executed on the RecvBufAsync callback.
   auto recv_buf_callback = [this, state, peer_task, to_device, to_alloc_attr,
                             to_device_ctx, to_tensor, done](const Status& s) {
     if (s.ok()) {
diff --git a/tensorflow/core/distributed_runtime/eager/BUILD b/tensorflow/core/distributed_runtime/eager/BUILD
index f3922dde74ae271465d800262ee20defa508151d..dc02d1b9bf82161632edec296e5c8405f4297ede 100644
--- a/tensorflow/core/distributed_runtime/eager/BUILD
+++ b/tensorflow/core/distributed_runtime/eager/BUILD
@@ -65,8 +65,8 @@ cc_library(
         "//tensorflow/core/distributed_runtime:worker_env",
         "//tensorflow/core/distributed_runtime/eager:remote_tensor_handle",
         "//tensorflow/core/distributed_runtime/rpc:rpc_rendezvous_mgr",
-        "@grpc//:grpc++_unsecure",
-        "@grpc//:grpc_unsecure",
+        "@grpc",
+        "@grpc//:grpc++",
     ],
 )
 
diff --git a/tensorflow/core/distributed_runtime/graph_mgr.cc b/tensorflow/core/distributed_runtime/graph_mgr.cc
index 8447c55bf43d8b2f7b40685932474b075d6cf696..e2f13df19f0da2f78b79447eb513b11ef5e01afa 100644
--- a/tensorflow/core/distributed_runtime/graph_mgr.cc
+++ b/tensorflow/core/distributed_runtime/graph_mgr.cc
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include <vector>
 
+#include "tensorflow/core/common_runtime/build_graph_options.h"
 #include "tensorflow/core/common_runtime/constant_folding.h"
 #include "tensorflow/core/common_runtime/debugger_state_interface.h"
 #include "tensorflow/core/common_runtime/device.h"
@@ -30,6 +31,7 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/step_stats_collector.h"
 #include "tensorflow/core/distributed_runtime/rendezvous_mgr_interface.h"
 #include "tensorflow/core/framework/cancellation.h"
+#include "tensorflow/core/framework/collective.h"
 #include "tensorflow/core/framework/log_memory.h"
 #include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/framework/node_def_util.h"
@@ -118,9 +120,11 @@ Status GraphMgr::DecorateAndPublishGraphForDebug(
 Status GraphMgr::InitItem(const string& session, const GraphDef& gdef,
                           const GraphOptions& graph_options,
                           const DebugOptions& debug_options,
+                          int64 collective_graph_key,
                           DistributedFunctionLibraryRuntime* cluster_flr,
                           Item* item) {
   item->session = session;
+  item->collective_graph_key = collective_graph_key;
   item->lib_def.reset(
       new FunctionLibraryDefinition(OpRegistry::Global(), gdef.library()));
 
@@ -280,11 +284,12 @@ Status GraphMgr::InitItem(const string& session, const GraphDef& gdef,
 Status GraphMgr::Register(const string& session, const GraphDef& gdef,
                           const GraphOptions& graph_options,
                           const DebugOptions& debug_options,
+                          int64 collective_graph_key,
                           DistributedFunctionLibraryRuntime* cluster_flr,
                           string* handle) {
   Item* item = new Item;
-  Status s =
-      InitItem(session, gdef, graph_options, debug_options, cluster_flr, item);
+  Status s = InitItem(session, gdef, graph_options, debug_options,
+                      collective_graph_key, cluster_flr, item);
   if (!s.ok()) {
     item->Unref();
     return s;
@@ -415,7 +420,12 @@ void GraphMgr::ExecuteAsync(const string& handle, const int64 step_id,
 
   RemoteRendezvous* rendezvous = worker_env_->rendezvous_mgr->Find(step_id);
   Status s = rendezvous->Initialize(session);
-
+  CollectiveExecutor::Handle* ce_handle =
+      item->collective_graph_key != BuildGraphOptions::kNoCollectiveGraphKey
+          ? new CollectiveExecutor::Handle(
+                worker_env_->collective_executor_mgr->FindOrCreate(step_id),
+                true)
+          : nullptr;
   // Sends values specified by the caller.
   if (s.ok()) {
     std::vector<string> keys;
@@ -431,22 +441,25 @@ void GraphMgr::ExecuteAsync(const string& handle, const int64 step_id,
 
   if (!s.ok()) {
     done(s);
+    delete ce_handle;
     item->Unref();
     rendezvous->Unref();
     return;
   }
 
-  StartParallelExecutors(handle, step_id, item, rendezvous, collector,
-                         cost_graph, cancellation_manager,
-                         [item, rendezvous, done](const Status& s) {
+  StartParallelExecutors(handle, step_id, item, rendezvous, ce_handle,
+                         collector, cost_graph, cancellation_manager,
+                         [item, rendezvous, ce_handle, done](const Status& s) {
                            done(s);
                            rendezvous->Unref();
                            item->Unref();
+                           delete ce_handle;
                          });
 }
 
 void GraphMgr::StartParallelExecutors(const string& handle, int64 step_id,
                                       Item* item, Rendezvous* rendezvous,
+                                      CollectiveExecutor::Handle* ce_handle,
                                       StepStatsCollector* collector,
                                       CostGraphDef* cost_graph,
                                       CancellationManager* cancellation_manager,
@@ -471,6 +484,7 @@ void GraphMgr::StartParallelExecutors(const string& handle, int64 step_id,
     args.step_id = ++next_id_;
   }
   args.rendezvous = rendezvous;
+  args.collective_executor = ce_handle ? ce_handle->get() : nullptr;
   args.cancellation_manager = cancellation_manager;
   args.stats_collector = collector;
   args.step_container = step_container;
diff --git a/tensorflow/core/distributed_runtime/graph_mgr.h b/tensorflow/core/distributed_runtime/graph_mgr.h
index cc35264b8fe0b6decc325dab793c6a5fe6ad097f..5196046c1969207825ff8d05f7a25d72a2f75980 100644
--- a/tensorflow/core/distributed_runtime/graph_mgr.h
+++ b/tensorflow/core/distributed_runtime/graph_mgr.h
@@ -25,6 +25,7 @@ limitations under the License.
 #include "tensorflow/core/distributed_runtime/message_wrappers.h"
 #include "tensorflow/core/distributed_runtime/worker_env.h"
 #include "tensorflow/core/framework/cancellation.h"
+#include "tensorflow/core/framework/collective.h"
 #include "tensorflow/core/framework/cost_graph.pb.h"
 #include "tensorflow/core/framework/function.h"
 #include "tensorflow/core/lib/core/refcount.h"
@@ -75,7 +76,7 @@ class GraphMgr {
   // reference to cluster_flr to do cross process function calls.
   Status Register(const string& session, const GraphDef& gdef,
                   const GraphOptions& graph_options,
-                  const DebugOptions& debug_options,
+                  const DebugOptions& debug_options, int64 collective_graph_key,
                   DistributedFunctionLibraryRuntime* cluster_flr,
                   string* handle);
 
@@ -138,6 +139,8 @@ class GraphMgr {
     // Used to deregister a cost model when cost model is required in graph
     // manager.
     GraphMgr* graph_mgr;
+
+    int64 collective_graph_key;
   };
 
   const WorkerEnv* worker_env_;  // Not owned.
@@ -161,6 +164,7 @@ class GraphMgr {
 
   void StartParallelExecutors(const string& handle, int64 step_id, Item* item,
                               Rendezvous* rendezvous,
+                              CollectiveExecutor::Handle* ce_handle,
                               StepStatsCollector* collector,
                               CostGraphDef* cost_graph,
                               CancellationManager* cancellation_manager,
@@ -175,7 +179,7 @@ class GraphMgr {
 
   Status InitItem(const string& session, const GraphDef& gdef,
                   const GraphOptions& graph_options,
-                  const DebugOptions& debug_options,
+                  const DebugOptions& debug_options, int64 collective_graph_key,
                   DistributedFunctionLibraryRuntime* cluster_flr, Item* item);
 
   Status DecorateAndPublishGraphForDebug(const DebugOptions& debug_options,
diff --git a/tensorflow/core/distributed_runtime/local_master.h b/tensorflow/core/distributed_runtime/local_master.h
index cad6babad82b9b2ac2953f5497e46bb471699b10..b9c76d0f1d834a6c8e1d659ef280537052701b3e 100644
--- a/tensorflow/core/distributed_runtime/local_master.h
+++ b/tensorflow/core/distributed_runtime/local_master.h
@@ -79,7 +79,7 @@ class LocalMaster : public MasterInterface {
                      RunCallableResponse* response) override;
   Status ReleaseCallable(CallOptions* call_options,
                          const ReleaseCallableRequest* request,
-                         ReleaseCallableResponse* response);
+                         ReleaseCallableResponse* response) override;
 
   // Registers the mapping from the given `target` to the given `master`.
   //
diff --git a/tensorflow/core/distributed_runtime/master.cc b/tensorflow/core/distributed_runtime/master.cc
index 4f9d84d158f81d179f29dec2784409cec0f55beb..a48f734d3e286587e437f899360ac3b1d22dc24c 100644
--- a/tensorflow/core/distributed_runtime/master.cc
+++ b/tensorflow/core/distributed_runtime/master.cc
@@ -473,7 +473,7 @@ void Master::PartialRunSetup(const PartialRunSetupRequest* req,
     return;
   }
 
-  SchedClosure([this, session, req, resp, done]() {
+  SchedClosure([session, req, resp, done]() {
     Status s = session->PartialRunSetup(req, resp);
     session->Unref();
     done(s);
@@ -628,7 +628,7 @@ void Master::MakeCallable(const MakeCallableRequest* req,
   }
 
   SchedClosure(std::bind(
-      [this, session, req, resp](MyClosure done) {
+      [session, req, resp](MyClosure done) {
         Status s = session->MakeCallable(*req, resp);
         session->Unref();
         done(s);
@@ -645,7 +645,7 @@ void Master::RunCallable(CallOptions* opts, const RunCallableRequest* req,
   }
 
   SchedClosure(std::bind(
-      [this, session, opts, req, resp](MyClosure done) {
+      [session, opts, req, resp](MyClosure done) {
         Status s = session->RunCallable(opts, *req, resp);
         session->Unref();
         done(s);
@@ -662,7 +662,7 @@ void Master::ReleaseCallable(const ReleaseCallableRequest* req,
   }
 
   SchedClosure(std::bind(
-      [this, session, req, resp](MyClosure done) {
+      [session, req, resp](MyClosure done) {
         Status s = session->ReleaseCallable(*req, resp);
         session->Unref();
         done(s);
diff --git a/tensorflow/core/distributed_runtime/master_env.h b/tensorflow/core/distributed_runtime/master_env.h
index 16f4d93c8b4a753f42c74fc51c6196b2d229ee1d..da26c42aca20c0ad9c874069f5415a42304cab24 100644
--- a/tensorflow/core/distributed_runtime/master_env.h
+++ b/tensorflow/core/distributed_runtime/master_env.h
@@ -26,6 +26,7 @@ limitations under the License.
 
 namespace tensorflow {
 
+class CollectiveExecutorMgrInterface;
 class Device;
 class DeviceSet;
 class Env;
@@ -90,6 +91,10 @@ struct MasterEnv {
   std::function<Status(const WorkerCacheFactoryOptions&,
                        WorkerCacheInterface**)>
       worker_cache_factory;
+
+  // Generates per-step CollectiveExecutors and has access to utilities
+  // supporting collective operations.
+  CollectiveExecutorMgrInterface* collective_executor_mgr = nullptr;
 };
 
 }  // end namespace tensorflow
diff --git a/tensorflow/core/distributed_runtime/master_session.cc b/tensorflow/core/distributed_runtime/master_session.cc
index bd70eca3f6528e24e25988a7799ad2962073c8ba..d34ca53f73f7eb9a6407729ad55545ab3526c462 100644
--- a/tensorflow/core/distributed_runtime/master_session.cc
+++ b/tensorflow/core/distributed_runtime/master_session.cc
@@ -27,6 +27,7 @@ limitations under the License.
 #include "tensorflow/core/distributed_runtime/worker_cache.h"
 #include "tensorflow/core/distributed_runtime/worker_interface.h"
 #include "tensorflow/core/framework/allocation_description.pb.h"
+#include "tensorflow/core/framework/collective.h"
 #include "tensorflow/core/framework/cost_graph.pb.h"
 #include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/framework/node_def_util.h"
@@ -69,6 +70,7 @@ class MasterSession::ReffedClientGraph : public core::RefCounted {
                     bool is_partial, WorkerCacheInterface* worker_cache,
                     bool should_deregister)
       : session_handle_(handle),
+        bg_opts_(bopts),
         client_graph_(std::move(cg)),
         session_opts_(session_opts),
         is_partial_(is_partial),
@@ -100,6 +102,8 @@ class MasterSession::ReffedClientGraph : public core::RefCounted {
 
   const CallableOptions& callable_options() { return callable_opts_; }
 
+  const BuildGraphOptions& build_graph_options() { return bg_opts_; }
+
   std::unique_ptr<ProfileHandler> GetProfileHandler(uint64 step,
                                                     int64 execution_count,
                                                     const RunOptions& ropts) {
@@ -156,8 +160,7 @@ class MasterSession::ReffedClientGraph : public core::RefCounted {
         LoggingResponse* resp = new LoggingResponse;
         p.worker->LoggingAsync(
             &req, resp,
-            [step_id, ss, resp, &scoped_mu, &waiting_for,
-             &all_done](const Status& s) {
+            [step_id, ss, resp, &scoped_mu, &all_done](const Status& s) {
               {
                 mutex_lock l(scoped_mu);
                 if (s.ok()) {
@@ -226,6 +229,7 @@ class MasterSession::ReffedClientGraph : public core::RefCounted {
 
  private:
   const string session_handle_;
+  const BuildGraphOptions bg_opts_;
   const std::unique_ptr<ClientGraph> client_graph_;
   const SessionOptions session_opts_;
   const bool is_partial_;
@@ -445,6 +449,7 @@ Status MasterSession::ReffedClientGraph::DoRegisterPartitions(
     *c->req.mutable_graph_options() = session_opts_.config.graph_options();
     *c->req.mutable_debug_options() =
         callable_opts_.run_options().debug_options();
+    c->req.set_collective_graph_key(bg_opts_.collective_graph_key);
     VLOG(2) << "Register " << c->req.graph_def().DebugString();
     auto cb = [c, &done](const Status& s) {
       c->status = s;
@@ -1066,6 +1071,9 @@ void BuildBuildGraphOptions(const RunStepRequestWrapper& req,
     *callable_opts->mutable_run_options()->mutable_debug_options() =
         req.options().debug_options();
   }
+
+  opts->collective_graph_key =
+      req.options().experimental().collective_graph_key();
 }
 
 void BuildBuildGraphOptions(const PartialRunSetupRequest& req,
@@ -1103,6 +1111,10 @@ uint64 HashBuildGraphOptions(const BuildGraphOptions& opts) {
     h = Hash64(watch_summary.c_str(), watch_summary.size(), h);
   }
 
+  if (opts.collective_graph_key != BuildGraphOptions::kNoCollectiveGraphKey) {
+    h = Hash64Combine(opts.collective_graph_key, h);
+  }
+
   return h;
 }
 
@@ -1119,6 +1131,9 @@ string BuildGraphOptionsString(const BuildGraphOptions& opts) {
   for (const string& name : opts.callable_options.fetch()) {
     strings::StrAppend(&buf, " FeE: ", name);
   }
+  if (opts.collective_graph_key != BuildGraphOptions::kNoCollectiveGraphKey) {
+    strings::StrAppend(&buf, "\nGK: ", opts.collective_graph_key);
+  }
   strings::StrAppend(&buf, "\n");
   return buf;
 }
@@ -1207,7 +1222,7 @@ Status MasterSession::CreateWorkerSessions(
   std::vector<WorkerGroup> workers(worker_names.size());
 
   // Release the workers.
-  auto cleanup = gtl::MakeCleanup([this, &workers, worker_cache] {
+  auto cleanup = gtl::MakeCleanup([&workers, worker_cache] {
     for (auto&& worker_group : workers) {
       if (worker_group.worker != nullptr) {
         worker_cache->ReleaseWorker(*worker_group.name, worker_group.worker);
@@ -1289,7 +1304,7 @@ Status MasterSession::DeleteWorkerSessions() {
   std::vector<WorkerGroup> workers(worker_names.size());
 
   // Release the workers.
-  auto cleanup = gtl::MakeCleanup([this, &workers, worker_cache] {
+  auto cleanup = gtl::MakeCleanup([&workers, worker_cache] {
     for (auto&& worker_group : workers) {
       if (worker_group.worker != nullptr) {
         worker_cache->ReleaseWorker(*worker_group.name, worker_group.worker);
@@ -1431,11 +1446,35 @@ void MasterSession::ClearRunsTable(std::vector<ReffedClientGraph*>* to_unref,
   rcg_map->clear();
 }
 
-namespace {
-uint64 MakeStepId() {
-  return (random::New64() & ((1uLL << 56) - 1)) | (1uLL << 56);
+uint64 MasterSession::NewStepId(int64 graph_key) {
+  if (graph_key == BuildGraphOptions::kNoCollectiveGraphKey) {
+    // StepId must leave the most-significant 7 bits empty for future use.
+    return random::New64() & (((1uLL << 56) - 1) | (1uLL << 56));
+  } else {
+    uint64 step_id = env_->collective_executor_mgr->NextStepId(graph_key);
+    int32 retry_count = 0;
+    while (step_id == CollectiveExecutor::kInvalidId) {
+      Notification note;
+      Status status;
+      env_->collective_executor_mgr->RefreshStepIdSequenceAsync(
+          graph_key, [&status, &note](const Status& s) {
+            status = s;
+            note.Notify();
+          });
+      note.WaitForNotification();
+      if (!status.ok()) {
+        LOG(ERROR) << "Bad status from "
+                      "collective_executor_mgr->RefreshStepIdSequence: "
+                   << status << ".  Retrying.";
+        int64 delay_micros = std::min(60000000LL, 1000000LL * ++retry_count);
+        Env::Default()->SleepForMicroseconds(delay_micros);
+      } else {
+        step_id = env_->collective_executor_mgr->NextStepId(graph_key);
+      }
+    }
+    return step_id;
+  }
 }
-}  // namespace
 
 Status MasterSession::PartialRunSetup(const PartialRunSetupRequest* req,
                                       PartialRunSetupResponse* resp) {
@@ -1457,15 +1496,13 @@ Status MasterSession::PartialRunSetup(const PartialRunSetupRequest* req,
   // Prepare.
   BuildGraphOptions opts;
   BuildBuildGraphOptions(*req, &opts);
-  int64 count;
+  int64 count = 0;
   TF_RETURN_IF_ERROR(StartStep(opts, true, &rcg, &count));
-  // Keeps the highest 8 bits 0x01: we reserve some bits of the
-  // step_id for future use.
-  const uint64 step_id = MakeStepId();
-  TRACEPRINTF("stepid %llu", step_id);
 
   rcg->Ref();
-  RunState* run_state = new RunState(inputs, outputs, rcg, step_id, count);
+  RunState* run_state =
+      new RunState(inputs, outputs, rcg,
+                   NewStepId(BuildGraphOptions::kNoCollectiveGraphKey), count);
   {
     mutex_lock l(mu_);
     partial_runs_.emplace(
@@ -1567,6 +1604,13 @@ Status MasterSession::DoPartialRun(CallOptions* opts,
     }
     run_state = it->second.get();
   }
+  // CollectiveOps are not supported in partial runs.
+  if (req.options().experimental().collective_graph_key() !=
+      BuildGraphOptions::kNoCollectiveGraphKey) {
+    return errors::InvalidArgument(
+        "PartialRun does not support Collective ops.  collective_graph_key "
+        "must be kNoCollectiveGraphKey.");
+  }
 
   // If this is the first partial run, initialize the PerStepState.
   if (!run_state->step_started) {
@@ -1744,7 +1788,11 @@ Status MasterSession::PostRunCleanup(MasterSession::ReffedClientGraph* rcg,
   Status s = run_status;
   if (s.ok()) {
     pss->end_micros = Env::Default()->NowMicros();
-
+    if (rcg->build_graph_options().collective_graph_key !=
+        BuildGraphOptions::kNoCollectiveGraphKey) {
+      env_->collective_executor_mgr->RetireStepId(
+          rcg->build_graph_options().collective_graph_key, step_id);
+    }
     // Schedule post-processing and cleanup to be done asynchronously.
     rcg->ProcessStats(step_id, pss, ph.get(), run_options, out_run_metadata);
   } else if (errors::IsCancelled(s)) {
@@ -1802,7 +1850,7 @@ Status MasterSession::DoRunWithLocalExecution(
 
   // Keeps the highest 8 bits 0x01: we reserve some bits of the
   // step_id for future use.
-  const uint64 step_id = MakeStepId();
+  uint64 step_id = NewStepId(bgopts.collective_graph_key);
   TRACEPRINTF("stepid %llu", step_id);
 
   std::unique_ptr<ProfileHandler> ph;
@@ -1866,9 +1914,8 @@ Status MasterSession::DoRunCallable(CallOptions* opts, ReffedClientGraph* rcg,
   // Prepare.
   int64 count = rcg->get_and_increment_execution_count();
 
-  // Keeps the highest 8 bits 0x01: we reserve some bits of the
-  // step_id for future use.
-  const uint64 step_id = MakeStepId();
+  const uint64 step_id =
+      NewStepId(rcg->build_graph_options().collective_graph_key);
   TRACEPRINTF("stepid %llu", step_id);
 
   const RunOptions& run_options = rcg->callable_options().run_options();
diff --git a/tensorflow/core/distributed_runtime/master_session.h b/tensorflow/core/distributed_runtime/master_session.h
index ec34e20b79afe9e3b0eef3aa70639ef411977e6f..449a6d3e3c01afab10efa5c1d72a4d0dbb75989c 100644
--- a/tensorflow/core/distributed_runtime/master_session.h
+++ b/tensorflow/core/distributed_runtime/master_session.h
@@ -141,6 +141,8 @@ class MasterSession : public core::RefCounted {
 
   std::atomic<int64> partial_run_handle_counter_ = {0};
 
+  uint64 NewStepId(int64 graph_key);
+
   mutex mu_;
   std::unique_ptr<GraphExecutionState> execution_state_ GUARDED_BY(mu_);
   int64 graph_version_;
@@ -175,6 +177,7 @@ class MasterSession : public core::RefCounted {
     std::unordered_map<string, bool> pending_outputs;  // true if fetched
     ReffedClientGraph* rcg = nullptr;
     uint64 step_id;
+    int64 collective_graph_key;
     int64 count = 0;
     PerStepState pss;
     std::unique_ptr<ProfileHandler> ph;
diff --git a/tensorflow/core/distributed_runtime/master_test.cc b/tensorflow/core/distributed_runtime/master_test.cc
index f2c1f3489c388d6a5fff729b1c8f98136532105c..62b18a45b1e16f6c20bb468868c5b5690f86d90f 100644
--- a/tensorflow/core/distributed_runtime/master_test.cc
+++ b/tensorflow/core/distributed_runtime/master_test.cc
@@ -18,9 +18,10 @@ limitations under the License.
 #include <map>
 #include <memory>
 
-#include "grpc++/grpc++.h"
+#include "grpcpp/grpcpp.h"
 
 #include "tensorflow/core/distributed_runtime/rpc/grpc_channel.h"
+#include "tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_testlib.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_util.h"
 #include "tensorflow/core/framework/allocator.h"
@@ -37,7 +38,6 @@ limitations under the License.
 #include "tensorflow/core/platform/test.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/protobuf/master.pb.h"
-#include "tensorflow/core/protobuf/master_service.grpc.pb.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/distributed_runtime/remote_device_test.cc b/tensorflow/core/distributed_runtime/remote_device_test.cc
index 778060daafa5718f386b61411471930b1590d08b..a04e79328b0b53445d935f1b2b51219685f5873e 100644
--- a/tensorflow/core/distributed_runtime/remote_device_test.cc
+++ b/tensorflow/core/distributed_runtime/remote_device_test.cc
@@ -49,8 +49,9 @@ class RemoteDeviceTest : public ::testing::Test {
     TF_CHECK_OK(spec.AddHostPortsJob("localhost", {hostport}));
     ChannelCreationFunction channel_func =
         ConvertToChannelCreationFunction(NewHostPortGrpcChannel);
-    worker_cache_.reset(
-        NewGrpcWorkerCache(NewGrpcChannelCache(spec, channel_func)));
+    std::shared_ptr<GrpcChannelCache> channel_cache(
+        NewGrpcChannelCache(spec, channel_func));
+    worker_cache_.reset(NewGrpcWorkerCache(channel_cache));
     remote_name_ = "/job:localhost/replica:0/task:0";
     wi_ = worker_cache_->CreateWorker(remote_name_);
   }
diff --git a/tensorflow/core/distributed_runtime/rpc/BUILD b/tensorflow/core/distributed_runtime/rpc/BUILD
index 40028ee241b6e78634c63de04508ba4dca947b7a..882271e3f577298e91bac56dc40c71e07edf43cb 100644
--- a/tensorflow/core/distributed_runtime/rpc/BUILD
+++ b/tensorflow/core/distributed_runtime/rpc/BUILD
@@ -41,8 +41,8 @@ cc_library(
     srcs = ["grpc_util.cc"],
     hdrs = ["grpc_util.h"],
     deps = [
-        "@grpc//:grpc_unsecure",
-        "@grpc//:grpc++_unsecure",
+        "@grpc",
+        "@grpc//:grpc++",
         "//tensorflow/core:lib",
         # Required to be able to overload TensorResponse parsing.
         "//tensorflow/core/distributed_runtime:tensor_coding",
@@ -56,7 +56,7 @@ cc_library(
     deps = [
         ":grpc_util",
         "//tensorflow/core:lib",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -70,7 +70,7 @@ cc_library(
         "//tensorflow/core:lib",
         "//tensorflow/core/distributed_runtime:call_options",
         "//tensorflow/core/distributed_runtime:tensor_coding",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -90,7 +90,7 @@ cc_library(
         "//tensorflow/core/distributed_runtime:tensor_coding",
         "//tensorflow/core/distributed_runtime:worker_cache_logger",
         "//tensorflow/core/distributed_runtime:worker_interface",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -103,7 +103,7 @@ cc_library(
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -118,7 +118,7 @@ cc_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
         "//tensorflow/core:worker_proto_cc",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -129,7 +129,7 @@ cc_library(
     deps = [
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -180,7 +180,7 @@ tf_cuda_library(
         "//tensorflow/core/distributed_runtime:worker_cache",
         "//tensorflow/core/distributed_runtime:worker_env",
         "//tensorflow/core/distributed_runtime:worker_session",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -192,7 +192,7 @@ cc_library(
         ":grpc_util",
         "//tensorflow/core:worker_proto_cc",
         "//tensorflow/core/distributed_runtime:tensor_coding",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -225,7 +225,7 @@ cc_library(
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:master_proto_cc",
         "//tensorflow/core/distributed_runtime:master",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
     alwayslink = 1,
 )
@@ -236,7 +236,7 @@ cc_library(
     hdrs = ["grpc_master_service_impl.h"],
     deps = [
         "//tensorflow/core:master_proto_cc",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -274,16 +274,19 @@ cc_library(
         "//tensorflow/core:framework",
         "//tensorflow/core:framework_internal",
         "//tensorflow/core:lib",
+        "//tensorflow/core/distributed_runtime:collective_param_resolver_distributed",
+        "//tensorflow/core/distributed_runtime:device_resolver_distributed",
         "//tensorflow/core/distributed_runtime:graph_mgr",
         "//tensorflow/core/distributed_runtime:local_master",
         "//tensorflow/core/distributed_runtime:master",
         "//tensorflow/core/distributed_runtime:master_env",
         "//tensorflow/core/distributed_runtime:master_session",
+        "//tensorflow/core/distributed_runtime:rpc_collective_executor_mgr",
         "//tensorflow/core/distributed_runtime:server_lib",
         "//tensorflow/core/distributed_runtime:session_mgr",
         "//tensorflow/core/distributed_runtime:worker_env",
-        "@grpc//:grpc++_unsecure",
-        "@grpc//:grpc_unsecure",
+        "@grpc",
+        "@grpc//:grpc++",
     ],
     alwayslink = 1,
 )
@@ -310,22 +313,10 @@ tf_cc_binary(
         "//tensorflow/core:protos_all_cc",
         "//tensorflow/core/distributed_runtime:server_lib",
         "//tensorflow/core/kernels:data_flow",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
-tf_cuda_library(
-    name = "grpc_testlib_ops",
-    testonly = 1,
-    srcs = ["grpc_testlib_ops.cc"],
-    linkstatic = 1,  # Seems to be needed since alwayslink is broken in bazel
-    deps = [
-        "//tensorflow/core:framework",
-        "//tensorflow/core:lib",
-    ],
-    alwayslink = 1,
-)
-
 tf_cc_binary(
     name = "grpc_testlib_server",
     testonly = 1,
@@ -334,11 +325,11 @@ tf_cc_binary(
     ],
     deps = [
         ":grpc_server_lib",
-        ":grpc_testlib_ops",
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:framework_internal",
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:testlib",
         "//tensorflow/core/distributed_runtime:server_lib",
         "//tensorflow/core/kernels:constant_op",
         "//tensorflow/core/kernels:cwise_op",
@@ -347,7 +338,7 @@ tf_cc_binary(
         "//tensorflow/core/kernels:matmul_op",
         "//tensorflow/core/kernels:reduction_ops",
         "//tensorflow/core/kernels:variable_ops",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -362,12 +353,12 @@ tf_cuda_library(
     visibility = ["//tensorflow:__subpackages__"],
     deps = [
         ":grpc_session",
-        ":grpc_testlib_ops",
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
         "//tensorflow/core:test",
+        "//tensorflow/core:testlib",
     ],
     alwayslink = 1,
 )
@@ -441,7 +432,7 @@ tf_cc_test(
         "//tensorflow/core:test_main",
         "//tensorflow/core:testlib",
         "//tensorflow/core:worker_proto_cc",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -454,8 +445,8 @@ tf_cc_test(
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
         "//tensorflow/core:worker_proto_cc",
-        "@grpc//:grpc++_unsecure",
-        "@grpc//:grpc_unsecure",
+        "@grpc",
+        "@grpc//:grpc++",
     ],
 )
 
diff --git a/tensorflow/core/distributed_runtime/rpc/eager/BUILD b/tensorflow/core/distributed_runtime/rpc/eager/BUILD
index 1a3bd9d6bf0e4240b0a9326e27b45c37ac25c723..a5472159ccb7fdafa4da7eed1601f7e3c96f8654 100644
--- a/tensorflow/core/distributed_runtime/rpc/eager/BUILD
+++ b/tensorflow/core/distributed_runtime/rpc/eager/BUILD
@@ -12,7 +12,7 @@ cc_library(
     hdrs = ["grpc_eager_service.h"],
     deps = [
         "//tensorflow/core:eager_service_proto_cc",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -29,7 +29,7 @@ cc_library(
         "//tensorflow/core/distributed_runtime/rpc:grpc_state",
         "//tensorflow/core/distributed_runtime/rpc:grpc_util",
         "//tensorflow/core/distributed_runtime/rpc/eager:grpc_eager_service",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
@@ -48,7 +48,7 @@ cc_library(
         "//tensorflow/core/distributed_runtime/rpc:grpc_util",
         "//tensorflow/core/distributed_runtime/rpc:grpc_worker_cache",
         "//tensorflow/core/distributed_runtime/rpc:grpc_worker_service",
-        "@grpc//:grpc++_unsecure",
+        "@grpc//:grpc++",
     ],
 )
 
diff --git a/tensorflow/core/distributed_runtime/rpc/eager/eager_grpc_server_lib.h b/tensorflow/core/distributed_runtime/rpc/eager/eager_grpc_server_lib.h
index f5dc4c831d04a00115eb011b6b0eb458ff16416c..9b863ccee506d0c5dad32d2eef7d28a865e4a2b0 100644
--- a/tensorflow/core/distributed_runtime/rpc/eager/eager_grpc_server_lib.h
+++ b/tensorflow/core/distributed_runtime/rpc/eager/eager_grpc_server_lib.h
@@ -74,7 +74,7 @@ class EagerGrpcServer : public GrpcServer {
           this->eager_service_.reset(
               new eager::GrpcEagerServiceImpl(worker_env, server_builder));
         },
-        nullptr));
+        nullptr, nullptr));
 
     worker_session_ = WorkerSession::CreateWithBorrowedDeviceMgr(
         "", worker_name_,
diff --git a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_client.cc b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_client.cc
index 4786c43ee2c5118858bbedbb65d0f0a7e9b2bdc9..b23466037f46200109a0c42be66ed0a8e32ef1e7 100644
--- a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_client.cc
+++ b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_client.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_client.h"
 
-#include "grpc++/generic/generic_stub.h"
+#include "grpcpp/generic/generic_stub.h"
 #include "tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_client_cq_tag.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_state.h"
diff --git a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.cc b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.cc
index 3fd7deaa868a9d0541bef31bd64eee4dc34acc96..39ab6856c53e86dbe4263e0054111fe3a60ebfaf 100644
--- a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.cc
+++ b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.cc
@@ -15,14 +15,14 @@ limitations under the License.
 
 #include "tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.h"
 
-#include "grpc++/impl/codegen/async_stream.h"
-#include "grpc++/impl/codegen/async_unary_call.h"
-#include "grpc++/impl/codegen/channel_interface.h"
-#include "grpc++/impl/codegen/client_unary_call.h"
-#include "grpc++/impl/codegen/method_handler_impl.h"
-#include "grpc++/impl/codegen/rpc_service_method.h"
-#include "grpc++/impl/codegen/service_type.h"
-#include "grpc++/impl/codegen/sync_stream.h"
+#include "grpcpp/impl/codegen/async_stream.h"
+#include "grpcpp/impl/codegen/async_unary_call.h"
+#include "grpcpp/impl/codegen/channel_interface.h"
+#include "grpcpp/impl/codegen/client_unary_call.h"
+#include "grpcpp/impl/codegen/method_handler_impl.h"
+#include "grpcpp/impl/codegen/rpc_service_method.h"
+#include "grpcpp/impl/codegen/service_type.h"
+#include "grpcpp/impl/codegen/sync_stream.h"
 
 namespace tensorflow {
 namespace eager {
diff --git a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.h b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.h
index d7b192ac857a423af81c544bb57c672ae82129d9..66458186ada4bba1433b1f95b0a3ac7ea699117a 100644
--- a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.h
+++ b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.h
@@ -16,14 +16,14 @@ limitations under the License.
 #ifndef TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_EAGER_GRPC_EAGER_SERVICE_H_
 #define TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_EAGER_GRPC_EAGER_SERVICE_H_
 
-#include "grpc++/impl/codegen/async_stream.h"
-#include "grpc++/impl/codegen/async_unary_call.h"
-#include "grpc++/impl/codegen/proto_utils.h"
-#include "grpc++/impl/codegen/rpc_method.h"
-#include "grpc++/impl/codegen/service_type.h"
-#include "grpc++/impl/codegen/status.h"
-#include "grpc++/impl/codegen/stub_options.h"
-#include "grpc++/impl/codegen/sync_stream.h"
+#include "grpcpp/impl/codegen/async_stream.h"
+#include "grpcpp/impl/codegen/async_unary_call.h"
+#include "grpcpp/impl/codegen/proto_utils.h"
+#include "grpcpp/impl/codegen/rpc_method.h"
+#include "grpcpp/impl/codegen/service_type.h"
+#include "grpcpp/impl/codegen/status.h"
+#include "grpcpp/impl/codegen/stub_options.h"
+#include "grpcpp/impl/codegen/sync_stream.h"
 
 #include "tensorflow/core/protobuf/eager_service.pb.h"
 
diff --git a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service_impl.h b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service_impl.h
index 65550caf646283b2a116373f7c54c28efdea6150..e94aedf535f6c2502721064aa94465a078463b31 100644
--- a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service_impl.h
+++ b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service_impl.h
@@ -16,9 +16,9 @@ limitations under the License.
 #ifndef TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_EAGER_GRPC_EAGER_SERVICE_IMPL_H_
 #define TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_EAGER_GRPC_EAGER_SERVICE_IMPL_H_
 
-#include "grpc++/alarm.h"
-#include "grpc++/completion_queue.h"
-#include "grpc++/server_builder.h"
+#include "grpcpp/alarm.h"
+#include "grpcpp/completion_queue.h"
+#include "grpcpp/server_builder.h"
 #include "tensorflow/core/distributed_runtime/eager/eager_service_impl.h"
 #include "tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_call.h"
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_call.h b/tensorflow/core/distributed_runtime/rpc/grpc_call.h
index ecad1274cc14c7f03eddf6fbb806e886b0c7d0b2..90666def6027f1444f78ae4fa95e478d269aaf77 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_call.h
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_call.h
@@ -20,9 +20,9 @@ limitations under the License.
 #include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/platform/mutex.h"
 
-#include "grpc++/grpc++.h"
-#include "grpc++/impl/codegen/service_type.h"
-#include "grpc++/server_builder.h"
+#include "grpcpp/grpcpp.h"
+#include "grpcpp/impl/codegen/service_type.h"
+#include "grpcpp/server_builder.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_channel.cc b/tensorflow/core/distributed_runtime/rpc/grpc_channel.cc
index 613188244fcb196a2bca7307d536a652a0f7f551..0ebc084cb67f9d272845f71f7b2db73e0d6aa176 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_channel.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_channel.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include <map>
 #include <unordered_map>
 
-#include "grpc++/create_channel.h"
+#include "grpcpp/create_channel.h"
 
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/status.h"
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_channel.h b/tensorflow/core/distributed_runtime/rpc/grpc_channel.h
index 48b9d958aa921b0e758fc17a0f4da7c3a13e6c16..4861cdb691cefefb375144ed7bb64b54c1c7c0e1 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_channel.h
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_channel.h
@@ -22,7 +22,7 @@ limitations under the License.
 #include <string>
 #include <vector>
 
-#include "grpc++/grpc++.h"
+#include "grpcpp/grpcpp.h"
 
 #include "tensorflow/core/distributed_runtime/rpc/grpc_util.h"
 
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_client_cq_tag.h b/tensorflow/core/distributed_runtime/rpc/grpc_client_cq_tag.h
index d367b83ee7fac5001bd83737531689b64a7e3774..6e7f5dbd1367d2c4d62aaf5ea2f1828ddada0932 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_client_cq_tag.h
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_client_cq_tag.h
@@ -16,7 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_GRPC_CLIENT_CQ_TAG_H_
 #define TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_GRPC_CLIENT_CQ_TAG_H_
 
-#include "grpc++/grpc++.h"
+#include "grpcpp/grpcpp.h"
 
 #include "tensorflow/core/distributed_runtime/rpc/grpc_util.h"
 #include "tensorflow/core/lib/core/status.h"
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_master_service.cc b/tensorflow/core/distributed_runtime/rpc/grpc_master_service.cc
index e025e555dd065674edff3492df1cc4bd040fc741..127dea2882ad2c71b68ec4fd5cc6b59285768a39 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_master_service.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_master_service.cc
@@ -30,8 +30,8 @@ limitations under the License.
 // RunGraph on workers.
 #include "tensorflow/core/distributed_runtime/rpc/grpc_master_service.h"
 
-#include "grpc++/alarm.h"
-#include "grpc++/server_builder.h"
+#include "grpcpp/alarm.h"
+#include "grpcpp/server_builder.h"
 
 #include "tensorflow/core/distributed_runtime/master.h"
 #include "tensorflow/core/distributed_runtime/rpc/async_service_interface.h"
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.cc b/tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.cc
index 85adfd2c762db0ba451d0dd0e7160d16e80054fd..770a0fcf14fd9907457639cea03a8fbe07312341 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.cc
@@ -15,14 +15,14 @@ limitations under the License.
 
 #include "tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.h"
 
-#include "grpc++/impl/codegen/async_stream.h"
-#include "grpc++/impl/codegen/async_unary_call.h"
-#include "grpc++/impl/codegen/channel_interface.h"
-#include "grpc++/impl/codegen/client_unary_call.h"
-#include "grpc++/impl/codegen/method_handler_impl.h"
-#include "grpc++/impl/codegen/rpc_service_method.h"
-#include "grpc++/impl/codegen/service_type.h"
-#include "grpc++/impl/codegen/sync_stream.h"
+#include "grpcpp/impl/codegen/async_stream.h"
+#include "grpcpp/impl/codegen/async_unary_call.h"
+#include "grpcpp/impl/codegen/channel_interface.h"
+#include "grpcpp/impl/codegen/client_unary_call.h"
+#include "grpcpp/impl/codegen/method_handler_impl.h"
+#include "grpcpp/impl/codegen/rpc_service_method.h"
+#include "grpcpp/impl/codegen/service_type.h"
+#include "grpcpp/impl/codegen/sync_stream.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.h b/tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.h
index 8f1b589698276d5df7aa0245d57bc5bdb4a9f0db..751f2633e752c26be716f9f7337ec46a17a6e265 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.h
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.h
@@ -16,14 +16,14 @@ limitations under the License.
 #ifndef TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_GRPC_MASTER_SERVICE_IMPL_H_
 #define TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_GRPC_MASTER_SERVICE_IMPL_H_
 
-#include "grpc++/impl/codegen/async_stream.h"
-#include "grpc++/impl/codegen/async_unary_call.h"
-#include "grpc++/impl/codegen/proto_utils.h"
-#include "grpc++/impl/codegen/rpc_method.h"
-#include "grpc++/impl/codegen/service_type.h"
-#include "grpc++/impl/codegen/status.h"
-#include "grpc++/impl/codegen/stub_options.h"
-#include "grpc++/impl/codegen/sync_stream.h"
+#include "grpcpp/impl/codegen/async_stream.h"
+#include "grpcpp/impl/codegen/async_unary_call.h"
+#include "grpcpp/impl/codegen/proto_utils.h"
+#include "grpcpp/impl/codegen/rpc_method.h"
+#include "grpcpp/impl/codegen/service_type.h"
+#include "grpcpp/impl/codegen/status.h"
+#include "grpcpp/impl/codegen/stub_options.h"
+#include "grpcpp/impl/codegen/sync_stream.h"
 
 #include "tensorflow/core/protobuf/master.pb.h"
 
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_remote_worker.cc b/tensorflow/core/distributed_runtime/rpc/grpc_remote_worker.cc
index 1acf1fb4fc1ea9b4f516f5bba0e8a2b1d4f72e25..6008462d0448130ed05393dd438d01002d243167 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_remote_worker.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_remote_worker.cc
@@ -17,8 +17,8 @@ limitations under the License.
 
 #include <utility>
 
-#include "grpc++/generic/generic_stub.h"
-#include "grpc++/grpc++.h"
+#include "grpcpp/generic/generic_stub.h"
+#include "grpcpp/grpcpp.h"
 
 #include "tensorflow/core/common_runtime/process_util.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_client_cq_tag.h"
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc b/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc
index e5ffb4ed2fdfdd84a2566643983b2da783df385b..e7914740ae0ae3468657f2fa91b8a2f492adf508 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc
@@ -19,14 +19,16 @@ limitations under the License.
 #include <limits>
 #include <memory>
 
-#include "grpc++/grpc++.h"
-#include "grpc++/security/credentials.h"
-#include "grpc++/server_builder.h"
 #include "grpc/support/alloc.h"
+#include "grpcpp/grpcpp.h"
+#include "grpcpp/security/credentials.h"
+#include "grpcpp/server_builder.h"
 
 #include "tensorflow/core/common_runtime/device_factory.h"
 #include "tensorflow/core/common_runtime/device_mgr.h"
 #include "tensorflow/core/common_runtime/process_util.h"
+#include "tensorflow/core/distributed_runtime/collective_param_resolver_distributed.h"
+#include "tensorflow/core/distributed_runtime/device_resolver_distributed.h"
 #include "tensorflow/core/distributed_runtime/graph_mgr.h"
 #include "tensorflow/core/distributed_runtime/local_master.h"
 #include "tensorflow/core/distributed_runtime/master.h"
@@ -38,6 +40,7 @@ limitations under the License.
 #include "tensorflow/core/distributed_runtime/rpc/grpc_worker_cache.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_worker_service.h"
 #include "tensorflow/core/distributed_runtime/rpc/rpc_rendezvous_mgr.h"
+#include "tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.h"
 #include "tensorflow/core/distributed_runtime/server_lib.h"
 #include "tensorflow/core/distributed_runtime/worker_env.h"
 #include "tensorflow/core/framework/op.h"
@@ -106,6 +109,7 @@ GrpcServer::~GrpcServer() {
 Status GrpcServer::Init(
     ServiceInitFunction service_func,
     const RendezvousMgrCreationFunction& rendezvous_mgr_func,
+    const CollectiveMgrCreationFunction& collective_mgr_func,
     const WorkerCreationFunction& worker_func,
     const StatsPublisherFactory& stats_factory) {
   mutex_lock l(mu_);
@@ -204,6 +208,26 @@ Status GrpcServer::Init(
       WorkerCacheFactory(worker_cache_factory_options, &worker_cache));
   CHECK_NE(nullptr, worker_cache);
 
+  if (collective_mgr_func) {
+    worker_env_.collective_executor_mgr =
+        collective_mgr_func(config, &worker_env_, worker_cache);
+    if (!worker_env_.collective_executor_mgr) {
+      return errors::Internal(
+          "collective_mgr_func did not return CollectiveExecutorMgr");
+    }
+  } else {
+    std::unique_ptr<DeviceResolverDistributed> dev_resolver(
+        new DeviceResolverDistributed(worker_env_.device_mgr, worker_cache,
+                                      default_worker_name));
+    std::unique_ptr<CollectiveParamResolverDistributed> param_resolver(
+        new CollectiveParamResolverDistributed(config, worker_env_.device_mgr,
+                                               dev_resolver.get(), worker_cache,
+                                               default_worker_name));
+    worker_env_.collective_executor_mgr = new RpcCollectiveExecutorMgr(
+        config, worker_env_.device_mgr, std::move(dev_resolver),
+        std::move(param_resolver), worker_cache, default_worker_name);
+  }
+
   // Set up worker environment.
   worker_env_.session_mgr = new SessionMgr(
       &worker_env_, SessionMgr::WorkerNameFromServerDef(server_def_),
@@ -246,18 +270,29 @@ Status GrpcServer::Init(
 Status GrpcServer::Init(
     ServiceInitFunction service_func,
     const RendezvousMgrCreationFunction& rendezvous_mgr_func,
+    const CollectiveMgrCreationFunction& collective_mgr_func,
     const WorkerCreationFunction& worker_func) {
-  return Init(std::move(service_func), rendezvous_mgr_func, worker_func,
-              CreateNoOpStatsPublisher);
+  return Init(std::move(service_func), rendezvous_mgr_func, collective_mgr_func,
+              worker_func, CreateNoOpStatsPublisher);
+}
+
+Status GrpcServer::Init(
+    ServiceInitFunction service_func,
+    const RendezvousMgrCreationFunction& rendezvous_mgr_func,
+    const CollectiveMgrCreationFunction& collective_mgr_func) {
+  return Init(std::move(service_func), rendezvous_mgr_func, collective_mgr_func,
+              nullptr);
 }
 
+
 Status GrpcServer::Init(
     ServiceInitFunction service_func,
     const RendezvousMgrCreationFunction& rendezvous_mgr_func) {
-  return Init(service_func, rendezvous_mgr_func, nullptr);
+  return Init(std::move(service_func), rendezvous_mgr_func, nullptr,
+              nullptr);
 }
 
-Status GrpcServer::Init() { return Init(nullptr, nullptr, nullptr); }
+Status GrpcServer::Init() { return Init(nullptr, nullptr, nullptr, nullptr); }
 
 Status GrpcServer::ParseChannelSpec(const WorkerCacheFactoryOptions& options,
                                     GrpcChannelSpec* channel_spec) {
@@ -403,7 +438,7 @@ Status GrpcServer::Create(const ServerDef& server_def, Env* env,
   std::unique_ptr<GrpcServer> ret(
       new GrpcServer(server_def, env == nullptr ? Env::Default() : env));
   ServiceInitFunction service_func = nullptr;
-  TF_RETURN_IF_ERROR(ret->Init(service_func, NewRpcRendezvousMgr));
+  TF_RETURN_IF_ERROR(ret->Init(service_func, NewRpcRendezvousMgr, nullptr));
   *out_server = std::move(ret);
   return Status::OK();
 }
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.h b/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.h
index 0122df178ad65023b9a0b2e2c4c0fd17f9922e1e..9e53330f85b09fa9c7fde5814a2e814233f8f0c3 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.h
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.h
@@ -18,8 +18,8 @@ limitations under the License.
 
 #include <memory>
 
-#include "grpc++/grpc++.h"
-#include "grpc++/security/credentials.h"
+#include "grpcpp/grpcpp.h"
+#include "grpcpp/security/credentials.h"
 
 #include "tensorflow/core/common_runtime/process_util.h"
 #include "tensorflow/core/common_runtime/stats_publisher_interface.h"
@@ -29,6 +29,7 @@ limitations under the License.
 #include "tensorflow/core/distributed_runtime/server_lib.h"
 #include "tensorflow/core/distributed_runtime/session_mgr.h"
 #include "tensorflow/core/distributed_runtime/worker_env.h"
+#include "tensorflow/core/framework/collective.h"
 #include "tensorflow/core/framework/op.h"
 #include "tensorflow/core/platform/env.h"
 
@@ -41,6 +42,11 @@ class Master;
 typedef std::function<RendezvousMgrInterface*(const WorkerEnv*)>
     RendezvousMgrCreationFunction;
 
+// function that creates a CollectiveExecutorMgr.
+typedef std::function<CollectiveExecutorMgrInterface*(
+    const ConfigProto&, const WorkerEnv*, WorkerCacheInterface*)>
+    CollectiveMgrCreationFunction;
+
 // function that registers a service to the server. The service needs to
 // be registered before builder.BuildAndStart().
 typedef std::function<void(const WorkerEnv*, ::grpc::ServerBuilder*)>
@@ -71,13 +77,19 @@ class GrpcServer : public ServerInterface {
  protected:
   Status Init(ServiceInitFunction service_func,
               const RendezvousMgrCreationFunction& rendezvous_mgr_func,
+              const CollectiveMgrCreationFunction& collective_mgr_func,
               const WorkerCreationFunction& worker_func,
               const StatsPublisherFactory& stats_factory);
 
   Status Init(ServiceInitFunction service_func,
               const RendezvousMgrCreationFunction& rendezvous_mgr_func,
+              const CollectiveMgrCreationFunction& collective_mgr_func,
               const WorkerCreationFunction& worker_func);
 
+  Status Init(ServiceInitFunction service_func,
+              const RendezvousMgrCreationFunction& rendezvous_mgr_func,
+              const CollectiveMgrCreationFunction& collective_mgr_func);
+    
   Status Init(ServiceInitFunction service_func,
               const RendezvousMgrCreationFunction& rendezvous_mgr_func);
 
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_state.h b/tensorflow/core/distributed_runtime/rpc/grpc_state.h
index 59dbb7ae04f63989204ba24755700fb40a22b943..61c5bc285f2f2e38a39737408a446a84b8442690 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_state.h
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_state.h
@@ -18,8 +18,8 @@ limitations under the License.
 
 #include <utility>
 
-#include "grpc++/generic/generic_stub.h"
-#include "grpc++/grpc++.h"
+#include "grpcpp/generic/generic_stub.h"
+#include "grpcpp/grpcpp.h"
 
 #include "tensorflow/core/distributed_runtime/call_options.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_client_cq_tag.h"
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_tensor_coding.cc b/tensorflow/core/distributed_runtime/rpc/grpc_tensor_coding.cc
index e51894b4c756b6f4cfc09fe0adf57e06cb22ee0f..d0684f18335a0f16dff92167a3b3498f68063f08 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_tensor_coding.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_tensor_coding.cc
@@ -14,8 +14,8 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/core/distributed_runtime/rpc/grpc_tensor_coding.h"
-#include "grpc++/support/byte_buffer.h"
-#include "grpc++/support/slice.h"
+#include "grpcpp/support/byte_buffer.h"
+#include "grpcpp/support/slice.h"
 #include "tensorflow/core/common_runtime/dma_helper.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor.pb.h"
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_tensor_coding_test.cc b/tensorflow/core/distributed_runtime/rpc/grpc_tensor_coding_test.cc
index 71f69e90244a6465c1d1759334c6ab8924cdcf96..7cace573e87637f2766a61a970660299f863cecc 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_tensor_coding_test.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_tensor_coding_test.cc
@@ -15,8 +15,8 @@ limitations under the License.
 
 #include "tensorflow/core/distributed_runtime/rpc/grpc_tensor_coding.h"
 
-#include "grpc++/support/byte_buffer.h"
-#include "grpc++/support/slice.h"
+#include "grpcpp/support/byte_buffer.h"
+#include "grpcpp/support/slice.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_testutil.h"
 #include "tensorflow/core/lib/gtl/inlined_vector.h"
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server.cc b/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server.cc
index f247322bc49fa30260dcb53e649f3ddf81dc6ce9..e52b2574117d87d8a79b7d2d612ed7722b203871 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server.cc
@@ -16,9 +16,9 @@ limitations under the License.
 #include <iostream>
 #include <vector>
 
-#include "grpc++/grpc++.h"
-#include "grpc++/security/credentials.h"
-#include "grpc++/server_builder.h"
+#include "grpcpp/grpcpp.h"
+#include "grpcpp/security/credentials.h"
+#include "grpcpp/server_builder.h"
 
 #include "tensorflow/core/distributed_runtime/server_lib.h"
 
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_testlib.cc b/tensorflow/core/distributed_runtime/rpc/grpc_testlib.cc
index 89f83f9f24d570d96704ea0b2d09da13147b1d6c..a8508d2d4f377eb2c9a9a8b2341be9031565c7cf 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_testlib.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_testlib.cc
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include "tensorflow/core/distributed_runtime/rpc/grpc_session.h"
 #include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/platform/env.h"
 #include "tensorflow/core/util/device_name_utils.h"
 
 namespace tensorflow {
@@ -50,9 +51,14 @@ Status TestCluster::MakeTestCluster(const SessionOptions& options, int n,
   }
 
   for (int i = 0; i < n; ++i) {
+    string server_file =
+        strings::StrCat(testing::TensorFlowSrcRoot(),
+                        "/core/distributed_runtime/rpc/grpc_testlib_server");
+    if (!options.env->FileExists(server_file).ok()) {
+      return errors::Internal("Could not find grpc_testlib_server");
+    }
     const std::vector<string> argv(
-        {strings::StrCat(testing::TensorFlowSrcRoot(),
-                         "/core/distributed_runtime/rpc/grpc_testlib_server"),
+        {server_file,
          /* see grpc_testlib_server.cc for flags */
          tf_jobs, "--tf_job=localhost", strings::StrCat("--tf_task=", i),
          strings::StrCat("--num_cpus=", num_cpus),
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_testlib_server.cc b/tensorflow/core/distributed_runtime/rpc/grpc_testlib_server.cc
index e718db251c3e788ad6480a2e8487b86bb6c48793..33cbadda0a18edf02eca6de8180c7cae2c835b2f 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_testlib_server.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_testlib_server.cc
@@ -15,9 +15,9 @@ limitations under the License.
 
 #include <vector>
 
-#include "grpc++/grpc++.h"
-#include "grpc++/security/credentials.h"
-#include "grpc++/server_builder.h"
+#include "grpcpp/grpcpp.h"
+#include "grpcpp/security/credentials.h"
+#include "grpcpp/server_builder.h"
 
 #include "tensorflow/core/distributed_runtime/server_lib.h"
 
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_util.h b/tensorflow/core/distributed_runtime/rpc/grpc_util.h
index 4b58781b543981c0176cdec1296f40b770829e44..45259aa2ece9698d7ffb5a850b716de442f7497f 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_util.h
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_util.h
@@ -18,9 +18,9 @@ limitations under the License.
 
 #include <memory>
 
-#include "grpc++/grpc++.h"
-#include "grpc++/impl/codegen/proto_utils.h"
-#include "grpc++/support/byte_buffer.h"
+#include "grpcpp/grpcpp.h"
+#include "grpcpp/impl/codegen/proto_utils.h"
+#include "grpcpp/support/byte_buffer.h"
 #include "tensorflow/core/distributed_runtime/tensor_coding.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/strings/stringprintf.h"
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc b/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc
index 2e7b11196383e80005a774320a90cb6ae05bab51..61f536961783dbcf95ccc61a8d52765158f006d5 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc
@@ -17,8 +17,8 @@ limitations under the License.
 
 #include <deque>
 
-#include "grpc++/alarm.h"
-#include "grpc++/server_builder.h"
+#include "grpcpp/alarm.h"
+#include "grpcpp/server_builder.h"
 
 #include "tensorflow/core/common_runtime/buf_rendezvous.h"
 #include "tensorflow/core/common_runtime/device.h"
@@ -513,8 +513,8 @@ void GrpcWorker::RecvBufAsync(CallOptions* opts, const RecvBufRequest* request,
   CollectiveRemoteAccess* rma = ce_handle.get()->remote_access();
   rma->buf_rendezvous()->ConsumeBuf(
       request->buf_rendezvous_key(),
-      [this, opts, request, response, done](const Status& status,
-                                            BufRendezvous::Hook* hook) {
+      [this, request, response, done](const Status& status,
+                                      BufRendezvous::Hook* hook) {
         Status s = status;
         if (s.ok()) {
           if (!DMAHelper::CanUseDMA(hook->prod_value)) {
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_worker_service_impl.cc b/tensorflow/core/distributed_runtime/rpc/grpc_worker_service_impl.cc
index 38cc2b81d30e5dd70b16692995eb35c045c8229e..72b5e77f1c12c3794332a13fccca6098f0419c58 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_worker_service_impl.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_worker_service_impl.cc
@@ -15,14 +15,14 @@ limitations under the License.
 
 #include "tensorflow/core/distributed_runtime/rpc/grpc_worker_service_impl.h"
 
-#include "grpc++/impl/codegen/async_stream.h"
-#include "grpc++/impl/codegen/async_unary_call.h"
-#include "grpc++/impl/codegen/channel_interface.h"
-#include "grpc++/impl/codegen/client_unary_call.h"
-#include "grpc++/impl/codegen/method_handler_impl.h"
-#include "grpc++/impl/codegen/rpc_service_method.h"
-#include "grpc++/impl/codegen/service_type.h"
-#include "grpc++/impl/codegen/sync_stream.h"
+#include "grpcpp/impl/codegen/async_stream.h"
+#include "grpcpp/impl/codegen/async_unary_call.h"
+#include "grpcpp/impl/codegen/channel_interface.h"
+#include "grpcpp/impl/codegen/client_unary_call.h"
+#include "grpcpp/impl/codegen/method_handler_impl.h"
+#include "grpcpp/impl/codegen/rpc_service_method.h"
+#include "grpcpp/impl/codegen/service_type.h"
+#include "grpcpp/impl/codegen/sync_stream.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_worker_service_impl.h b/tensorflow/core/distributed_runtime/rpc/grpc_worker_service_impl.h
index da270835bd1ab82fd79787378168bc36d4dc9da2..7915c3aafd8a97de2830962d2851b247e7d4db4a 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_worker_service_impl.h
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_worker_service_impl.h
@@ -16,15 +16,15 @@ limitations under the License.
 #ifndef TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_GRPC_WORKER_SERVICE_IMPL_H_
 #define TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_GRPC_WORKER_SERVICE_IMPL_H_
 
-#include "grpc++/impl/codegen/async_stream.h"
-#include "grpc++/impl/codegen/async_unary_call.h"
-#include "grpc++/impl/codegen/proto_utils.h"
-#include "grpc++/impl/codegen/rpc_method.h"
-#include "grpc++/impl/codegen/service_type.h"
-#include "grpc++/impl/codegen/status.h"
-#include "grpc++/impl/codegen/stub_options.h"
-#include "grpc++/impl/codegen/sync_stream.h"
-#include "grpc++/support/byte_buffer.h"
+#include "grpcpp/impl/codegen/async_stream.h"
+#include "grpcpp/impl/codegen/async_unary_call.h"
+#include "grpcpp/impl/codegen/proto_utils.h"
+#include "grpcpp/impl/codegen/rpc_method.h"
+#include "grpcpp/impl/codegen/service_type.h"
+#include "grpcpp/impl/codegen/status.h"
+#include "grpcpp/impl/codegen/stub_options.h"
+#include "grpcpp/impl/codegen/sync_stream.h"
+#include "grpcpp/support/byte_buffer.h"
 
 #include "tensorflow/core/distributed_runtime/rpc/grpc_util.h"
 #include "tensorflow/core/distributed_runtime/tensor_coding.h"
diff --git a/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.cc b/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.cc
new file mode 100644
index 0000000000000000000000000000000000000000..5eeed6e382a9e90a96d0f96aee1703f873fb513d
--- /dev/null
+++ b/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.cc
@@ -0,0 +1,142 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.h"
+
+#include "tensorflow/core/common_runtime/base_collective_executor.h"
+#include "tensorflow/core/common_runtime/collective_executor_mgr.h"
+#include "tensorflow/core/common_runtime/collective_rma_local.h"
+#include "tensorflow/core/distributed_runtime/collective_param_resolver_distributed.h"
+#include "tensorflow/core/distributed_runtime/collective_rma_distributed.h"
+#include "tensorflow/core/distributed_runtime/device_resolver_distributed.h"
+#include "tensorflow/core/distributed_runtime/worker_cache.h"
+#include "tensorflow/core/lib/random/random.h"
+
+namespace tensorflow {
+
+RpcCollectiveExecutorMgr::RpcCollectiveExecutorMgr(
+    const ConfigProto& config, const DeviceMgr* dev_mgr,
+    std::unique_ptr<DeviceResolverDistributed> dev_resolver,
+    std::unique_ptr<CollectiveParamResolverDistributed> param_resolver,
+    WorkerCacheInterface* worker_cache, const string& task_name)
+    : CollectiveExecutorMgr(config, dev_mgr, std::move(dev_resolver),
+                            std::move(param_resolver)),
+      worker_cache_(worker_cache),
+      task_name_(task_name) {
+  group_leader_ = (task_name == config.experimental().collective_group_leader())
+                      ? ""
+                      : config.experimental().collective_group_leader();
+}
+
+RpcCollectiveExecutorMgr::~RpcCollectiveExecutorMgr() {
+  for (auto it : sequence_table_) {
+    delete it.second;
+  }
+}
+
+CollectiveExecutor* RpcCollectiveExecutorMgr::Create(int64 step_id) {
+  CollectiveRemoteAccessDistributed* rma =
+      new CollectiveRemoteAccessDistributed(dev_mgr_, dev_resolver_.get(),
+                                            worker_cache_, step_id);
+  return new BaseCollectiveExecutor(this, rma, step_id, dev_mgr_);
+}
+
+namespace {
+// StepId must leave the most-significant 7 bits empty for future use.
+static const int64 kStepIdMask = (((1uLL << 56) - 1) | (1uLL << 56));
+
+int64 NewRandomStepId() {
+  int64 step_id = random::New64();
+  // Leave MS 8 bits clear for future use.
+  step_id &= kStepIdMask;
+  return step_id;
+}
+}  // namespace
+
+void RpcCollectiveExecutorMgr::RefreshStepIdSequenceAsync(
+    int64 graph_key, const StatusCallback& done) {
+  if (group_leader_.empty()) {
+    mutex_lock l(sequence_mu_);
+    GraphKeySequence* gks = nullptr;
+    auto it = sequence_table_.find(graph_key);
+    if (it == sequence_table_.end()) {
+      gks = new GraphKeySequence(graph_key);
+      sequence_table_[graph_key] = gks;
+    } else {
+      gks = it->second;
+    }
+    gks->next_step_id_ = NewRandomStepId();
+    done(Status::OK());
+  } else {
+    WorkerInterface* wi = worker_cache_->CreateWorker(group_leader_);
+    GetStepSequenceRequest* req = new GetStepSequenceRequest;
+    GetStepSequenceResponse* resp = new GetStepSequenceResponse;
+    req->add_graph_key(graph_key);
+    wi->GetStepSequenceAsync(
+        req, resp, [this, req, resp, done](const Status& s) {
+          if (!s.ok()) {
+            LOG(ERROR) << "Bad response [" << s
+                       << "] from GetStepSequenceAsync call to "
+                       << group_leader_;
+            done(s);
+          } else {
+            done(UpdateStepSequences(*resp));
+          }
+          delete req;
+          delete resp;
+        });
+  }
+}
+
+Status RpcCollectiveExecutorMgr::UpdateStepSequences(
+    const GetStepSequenceResponse& resp) {
+  mutex_lock l(sequence_mu_);
+  for (const StepSequence& ss : resp.step_sequence()) {
+    GraphKeySequence* gks = nullptr;
+    auto it = sequence_table_.find(ss.graph_key());
+    if (it == sequence_table_.end()) {
+      gks = new GraphKeySequence(ss.graph_key());
+      sequence_table_[ss.graph_key()] = gks;
+    } else {
+      gks = it->second;
+    }
+    gks->next_step_id_ = ss.next_step_id();
+  }
+  return Status::OK();
+}
+
+int64 RpcCollectiveExecutorMgr::NextStepId(int64 graph_key) {
+  mutex_lock l(sequence_mu_);
+  auto it = sequence_table_.find(graph_key);
+  if (it != sequence_table_.end()) {
+    return it->second->next_step_id_;
+  }
+  return CollectiveExecutor::kInvalidId;
+}
+
+void RpcCollectiveExecutorMgr::RetireStepId(int64 graph_key, int64 step_id) {
+  mutex_lock l(sequence_mu_);
+  auto it = sequence_table_.find(graph_key);
+  if (it != sequence_table_.end()) {
+    if (step_id == it->second->next_step_id_) {
+      it->second->next_step_id_ = (it->second->next_step_id_ + 1) & kStepIdMask;
+    } else {
+      it->second->next_step_id_ = CollectiveExecutor::kInvalidId;
+    }
+  } else {
+    LOG(ERROR) << "Failed to find graph_key " << graph_key << " to retire.";
+  }
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.h b/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.h
new file mode 100644
index 0000000000000000000000000000000000000000..e9f3f0ebe894d7d166736174a504eca47f0057b0
--- /dev/null
+++ b/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.h
@@ -0,0 +1,79 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_COLLECTIVE_EXECUTOR_MGR_H_
+#define TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_COLLECTIVE_EXECUTOR_MGR_H_
+
+#include "tensorflow/core/common_runtime/collective_executor_mgr.h"
+#include "tensorflow/core/framework/collective.h"
+
+namespace tensorflow {
+class CollectiveParamResolverDistributed;
+class ConfigProto;
+class DeviceMgr;
+class DeviceResolverDistributed;
+class WorkerCacheInterface;
+class StepSequenceRequest;
+class StepSequenceResponse;
+
+// An implementation of CollectiveExecutorMgr for a distributed environment
+// that uses WorkerInterface::RecvBufAsync to route data transfers over RPCs.
+//
+// In some execution environments it may be possible to implement a
+// higher-performance solution and use it in place of this class.
+class RpcCollectiveExecutorMgr : public CollectiveExecutorMgr {
+ public:
+  RpcCollectiveExecutorMgr(
+      const ConfigProto& config, const DeviceMgr* dev_mgr,
+      std::unique_ptr<DeviceResolverDistributed> dev_resolver,
+      std::unique_ptr<CollectiveParamResolverDistributed> param_resolver,
+      WorkerCacheInterface* worker_cache, const string& task_name);
+
+  virtual ~RpcCollectiveExecutorMgr();
+
+  void RefreshStepIdSequenceAsync(int64 graph_key,
+                                  const StatusCallback& done) override;
+
+  int64 NextStepId(int64 graph_key) override;
+
+  void RetireStepId(int64 graph_key, int64 step_id) override;
+
+ protected:
+  CollectiveExecutor* Create(int64 step_id) override;
+
+  WorkerCacheInterface* const worker_cache_;  // Not owned.
+  const string task_name_;
+  string group_leader_;
+  friend class RpcCollectiveExecutorMgrTest;
+
+ private:
+  Status UpdateStepSequences(const GetStepSequenceResponse& resp);
+
+  // This class maintains the step_id sequencing for a single
+  // collective_graph_key.
+  struct GraphKeySequence {
+    explicit GraphKeySequence(int64 k)
+        : graph_key_(k), next_step_id_(CollectiveExecutor::kInvalidId) {}
+
+    const int64 graph_key_;
+    int64 next_step_id_;
+  };
+
+  mutex sequence_mu_;
+  gtl::FlatMap<int64, GraphKeySequence*> sequence_table_
+      GUARDED_BY(sequence_mu_);
+};
+
+}  // namespace tensorflow
+#endif  // TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_COLLECTIVE_EXECUTOR_MGR_H_
diff --git a/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr_test.cc b/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..37b83d82bee8ab51244d6e8072f9a5e8cf9acbee
--- /dev/null
+++ b/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr_test.cc
@@ -0,0 +1,124 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <stdlib.h>
+#include <string>
+#include <vector>
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/common_runtime/device_mgr.h"
+#include "tensorflow/core/distributed_runtime/collective_param_resolver_distributed.h"
+#include "tensorflow/core/distributed_runtime/device_resolver_distributed.h"
+#include "tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.h"
+#include "tensorflow/core/framework/collective.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/public/session_options.h"
+
+namespace tensorflow {
+#define NUM_DEVS 3
+
+class RpcCollectiveExecutorMgrTest : public ::testing::Test {
+ protected:
+  RpcCollectiveExecutorMgrTest() {
+    string task_name = "/job:localhost/replica:0/task:0";
+    SessionOptions options;
+    options.config.mutable_experimental()->set_collective_group_leader(
+        task_name);
+    WorkerCacheInterface* worker_cache = nullptr;
+    auto* device_count = options.config.mutable_device_count();
+    device_count->insert({"CPU", NUM_DEVS});
+    TF_CHECK_OK(DeviceFactory::AddDevices(options, task_name, &devices_));
+    device_mgr_.reset(new DeviceMgr(devices_));
+    std::unique_ptr<DeviceResolverDistributed> dr(new DeviceResolverDistributed(
+        device_mgr_.get(), worker_cache, task_name));
+    std::unique_ptr<CollectiveParamResolverDistributed> cpr(
+        new CollectiveParamResolverDistributed(options.config,
+                                               device_mgr_.get(), dr.get(),
+                                               worker_cache, task_name));
+    // This CME is the group leader.
+    cme_.reset(new RpcCollectiveExecutorMgr(options.config, device_mgr_.get(),
+                                            std::move(dr), std::move(cpr),
+                                            worker_cache, task_name));
+  }
+
+  std::unique_ptr<RpcCollectiveExecutorMgr> cme_;
+  std::vector<Device*> devices_;
+  std::unique_ptr<DeviceMgr> device_mgr_;
+};
+
+TEST_F(RpcCollectiveExecutorMgrTest, FindOrCreate) {
+  CollectiveExecutor::Handle* h =
+      new CollectiveExecutor::Handle(cme_->FindOrCreate(1), true);
+  EXPECT_TRUE(h->get());
+  CollectiveExecutor::Handle* h2 =
+      new CollectiveExecutor::Handle(cme_->FindOrCreate(1), true);
+  EXPECT_EQ(h->get(), h2->get());
+  CollectiveExecutor* ce = h->get();
+  delete h;
+  delete h2;
+  CollectiveExecutor* ce2 = cme_->FindOrCreate(1);
+  EXPECT_EQ(ce, ce2);
+  ce2->Unref();
+  cme_->Cleanup(1);
+}
+
+TEST_F(RpcCollectiveExecutorMgrTest, NextStepId) {
+  int64 x = cme_->NextStepId(7);
+  EXPECT_EQ(x, CollectiveExecutor::kInvalidId);
+  // Calling Refresh should generate a valid id.
+  {
+    Notification note;
+    Status status;
+    cme_->RefreshStepIdSequenceAsync(7,
+                                     [this, &status, &note](const Status& s) {
+                                       status = s;
+                                       note.Notify();
+                                     });
+    EXPECT_TRUE(status.ok());
+  }
+  x = cme_->NextStepId(7);
+  EXPECT_NE(x, CollectiveExecutor::kInvalidId);
+  // Should keep returning same number.
+  EXPECT_EQ(x, cme_->NextStepId(7));
+  EXPECT_EQ(x, cme_->NextStepId(7));
+  // Retire on a different graph_key should have no effect.
+  cme_->RetireStepId(6, x);
+  EXPECT_EQ(x, cme_->NextStepId(7));
+  // Retire on same graph_key should advance.
+  cme_->RetireStepId(7, x);
+  int64 y = cme_->NextStepId(7);
+  EXPECT_EQ((x + 1) & (((1uLL << 56) - 1) | (1uLL << 56)), y);
+  // Calling refresh should jump to a different point in the random space.
+  {
+    Notification note;
+    Status status;
+    cme_->RefreshStepIdSequenceAsync(7,
+                                     [this, &status, &note](const Status& s) {
+                                       status = s;
+                                       note.Notify();
+                                     });
+
+    note.WaitForNotification();
+    EXPECT_TRUE(status.ok());
+  }
+  int64 z = cme_->NextStepId(7);
+  // z should not be equal to or a successor of y.
+  EXPECT_NE(y, z);
+  EXPECT_GT(llabs(y - z), 3);
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/distributed_runtime/worker.cc b/tensorflow/core/distributed_runtime/worker.cc
index 4e6500fbc6baff76228b5b2e8f4445ad970acf54..1ea19c48f09170e6044eb9c72b5090dfc2feb703 100644
--- a/tensorflow/core/distributed_runtime/worker.cc
+++ b/tensorflow/core/distributed_runtime/worker.cc
@@ -18,6 +18,7 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/collective_executor_mgr.h"
 #include "tensorflow/core/common_runtime/device_mgr.h"
 #include "tensorflow/core/common_runtime/process_util.h"
+#include "tensorflow/core/common_runtime/scoped_allocator_mgr.h"
 #include "tensorflow/core/common_runtime/step_stats_collector.h"
 #include "tensorflow/core/distributed_runtime/rendezvous_mgr_interface.h"
 #include "tensorflow/core/distributed_runtime/tensor_coding.h"
@@ -72,7 +73,8 @@ void Worker::RegisterGraphAsync(const RegisterGraphRequest* request,
     s = session->graph_mgr->Register(
         request->session_handle(), request->graph_def(),
         request->graph_options(), request->debug_options(),
-        session->cluster_flr.get(), response->mutable_graph_handle());
+        request->collective_graph_key(), session->cluster_flr.get(),
+        response->mutable_graph_handle());
   }
   done(s);
 }
@@ -315,6 +317,12 @@ void Worker::CleanupGraphAsync(const CleanupGraphRequest* request,
   if (env_->collective_executor_mgr) {
     env_->collective_executor_mgr->Cleanup(step_id);
   }
+  for (Device* d : env_->local_devices) {
+    ScopedAllocatorMgr* sam = d->GetScopedAllocatorMgr();
+    if (sam) {
+      sam->Cleanup(step_id);
+    }
+  }
   done(Status::OK());
 }
 
diff --git a/tensorflow/core/framework/common_shape_fns.cc b/tensorflow/core/framework/common_shape_fns.cc
index 71a31b0e75471c0965c46ac68f1e0728bfc35261..6da0da14f047e1f2bd5ec13f9967b22be2d8f7d6 100644
--- a/tensorflow/core/framework/common_shape_fns.cc
+++ b/tensorflow/core/framework/common_shape_fns.cc
@@ -40,8 +40,8 @@ Status GetWindowedOutputSizeVerboseV2(int64 input_size, int64 filter_size,
     case Padding::SAME:
       *output_size = (input_size + stride - 1) / stride;
       const int64 padding_needed =
-          std::max(0LL, (*output_size - 1) * stride + effective_filter_size -
-                            input_size);
+          std::max(int64{0}, (*output_size - 1) * stride +
+                                 effective_filter_size - input_size);
       // For odd values of total padding, add more padding at the 'right'
       // side of the given dimension.
       *padding_before = padding_needed / 2;
@@ -303,6 +303,9 @@ Status MakeShapeFromFormat(TensorFormat format, DimensionOrConstant N,
   if (format == FORMAT_NCHW_VECT_C) {
     dims_actual[GetTensorInnerFeatureDimIndex(num_dims, format)] =
         context->MakeDim(4);
+  } else if (format == FORMAT_NHWC_VECT_W) {
+    dims_actual[GetTensorInnerWidthDimIndex(num_dims, format)] =
+        context->MakeDim(4);
   }
   for (int spatial_dim = 0; spatial_dim < spatial.size(); spatial_dim++) {
     dims_actual[GetTensorSpatialDimIndex(num_dims, format, spatial_dim)] =
diff --git a/tensorflow/core/framework/cost_graph.proto b/tensorflow/core/framework/cost_graph.proto
index 19d765cd32e05a356e68dfc65c0765f4f846dbf3..cc6bc84d6964613a238fb398c944f218b1c0f50e 100644
--- a/tensorflow/core/framework/cost_graph.proto
+++ b/tensorflow/core/framework/cost_graph.proto
@@ -69,6 +69,9 @@ message CostGraphDef {
 
     // Ids of the control inputs for this node.
     repeated int32 control_input = 8;
+
+    // Are the costs inaccurate?
+    bool inaccurate = 17;
   }
   repeated Node node = 1;
 }
diff --git a/tensorflow/core/framework/dataset.h b/tensorflow/core/framework/dataset.h
index 8624af9bf56e7ef4c4ff1223e7d70394bc1b180d..d8618f391e3fc271e278f40d5274040e0d791c41 100644
--- a/tensorflow/core/framework/dataset.h
+++ b/tensorflow/core/framework/dataset.h
@@ -351,6 +351,10 @@ class IteratorBase {
   // in the outputs of this iterator.
   virtual const std::vector<PartialTensorShape>& output_shapes() const = 0;
 
+  // Performs initialization that needs to happen outside of a constructor to
+  // properly propagate errors.
+  virtual Status Initialize(IteratorContext* ctx) { return Status::OK(); }
+
   // Saves the state of this iterator.
   virtual Status Save(OpKernelContext* ctx, IteratorStateWriter* writer) {
     return SaveInternal(writer);
@@ -402,12 +406,13 @@ class DatasetBase : public core::RefCounted {
   // iterator will traverse all elements in this dataset from the
   // start.
   //
-  // Ownership of the created iterator will be transferred to the caller.
-  //
   // The prefix identifies the sequence of iterators leading up to the newly
   // created iterator.
-  virtual std::unique_ptr<IteratorBase> MakeIterator(
-      const string& prefix) const = 0;
+  Status MakeIterator(IteratorContext* ctx, const string& prefix,
+                      std::unique_ptr<IteratorBase>* iterator) const {
+    *iterator = MakeIteratorInternal(prefix);
+    return (*iterator)->Initialize(ctx);
+  }
 
   // Returns a vector of DataType values, representing the respective
   // element types of each tuple component in the outputs of this
@@ -420,7 +425,7 @@ class DatasetBase : public core::RefCounted {
   virtual const std::vector<PartialTensorShape>& output_shapes() const = 0;
 
   // A human-readable debug string for this dataset.
-  virtual string DebugString() = 0;
+  virtual string DebugString() const = 0;
 
   // Serializes the dataset and writes it to the `writer`.
   virtual Status Save(OpKernelContext* ctx, IteratorStateWriter* writer) const {
@@ -451,6 +456,11 @@ class DatasetBase : public core::RefCounted {
                                     Node** node) const {
     return errors::Unimplemented("AsGraphDefInternal");
   }
+
+  virtual std::unique_ptr<IteratorBase> MakeIteratorInternal(
+      const string& prefix) const = 0;
+
+  friend class DatasetToGraphOp;  // For access to graph related members.
 };
 
 // Base-class for datasets that are built by ops.
@@ -576,6 +586,23 @@ class DatasetOpKernel : public OpKernel {
     *output = argument_t->scalar<T>()();
     return Status::OK();
   }
+
+  template <typename T>
+  Status ParseVectorArgument(OpKernelContext* ctx,
+                             const StringPiece& argument_name,
+                             std::vector<T>* output) {
+    const Tensor* argument_t;
+    TF_RETURN_IF_ERROR(ctx->input(argument_name, &argument_t));
+    if (!TensorShapeUtils::IsVector(argument_t->shape())) {
+      return errors::InvalidArgument(argument_name, " must be a vector");
+    }
+    int size = argument_t->vec<T>().size();
+    output->reserve(size);
+    for (int i = 0; i < size; ++i) {
+      output->push_back(argument_t->vec<T>()(i));
+    }
+    return Status::OK();
+  }
 };
 
 // Encapsulates the work required to plug unary Datasets into the core
diff --git a/tensorflow/core/framework/device_base.h b/tensorflow/core/framework/device_base.h
index 223b74857d04593b27f7b66efb0cadbad7e3cf4c..b59ced869d5b72dbf5e8b322a249c939188bbae9 100644
--- a/tensorflow/core/framework/device_base.h
+++ b/tensorflow/core/framework/device_base.h
@@ -169,13 +169,10 @@ class DeviceBase {
     return nullptr;
   }
 
-  // Return the Allocator implementation to use based on the allocator
-  // attributes requested and the supplied resource manager. By
-  // default this ignores the resource manager and calls the base
-  // implementation but devices can override if they want to consult
-  // the resource manager when choosing the allocator.
-  virtual Allocator* GetStepAllocator(AllocatorAttributes attr,
-                                      ResourceMgr* /*step_resource_manager*/) {
+  // DEPRECATED: Use `this->GetAllocator()` or `this->GetScopedAllocator()`.
+  // This method is provided for backwards compatibility, and will be removed
+  // in a future release.
+  Allocator* GetStepAllocator(AllocatorAttributes attr, ResourceMgr*) {
     return GetAllocator(attr);
   }
 
@@ -189,6 +186,10 @@ class DeviceBase {
 
   virtual ScopedAllocatorMgr* GetScopedAllocatorMgr() const { return nullptr; }
 
+  const bool has_eigen_cpu_device() const {
+    return (eigen_cpu_device_ != nullptr);
+  }
+
   virtual const Eigen::ThreadPoolDevice* eigen_cpu_device() {
     CHECK(eigen_cpu_device_ != nullptr);
     return eigen_cpu_device_;
diff --git a/tensorflow/core/framework/function.cc b/tensorflow/core/framework/function.cc
index 647c66099cfa99e0d7f83782f9ba77a8fb021b1c..88d9d65f5ad22332c8876a5566884f7831cf3997 100644
--- a/tensorflow/core/framework/function.cc
+++ b/tensorflow/core/framework/function.cc
@@ -815,6 +815,10 @@ string Canonicalize(const string& funcname, AttrSlice attrs,
     entries.push_back(
         strings::StrCat("_state_handle", "=", options.state_handle));
   }
+  if (!options.executor_type.empty()) {
+    entries.push_back(
+        strings::StrCat("_executor_type", "=", options.executor_type));
+  }
   std::sort(entries.begin(), entries.end());
   return strings::StrCat(funcname, "[", str_util::Join(entries, ","), "]");
 }
diff --git a/tensorflow/core/framework/function.h b/tensorflow/core/framework/function.h
index e00399f97de42ca6c683202fdec9142310fa6e2d..8e607b927c6dc6818fbc6cb772413692b7c4c450 100644
--- a/tensorflow/core/framework/function.h
+++ b/tensorflow/core/framework/function.h
@@ -33,6 +33,7 @@ limitations under the License.
 namespace tensorflow {
 
 class CancellationManager;
+class CollectiveExecutor;
 class GraphDef;
 class OpKernel;
 class ProcessFunctionLibraryRuntime;
@@ -449,6 +450,12 @@ class FunctionLibraryRuntime {
     // state (in stateful kernels); and two functions with different
     // values for `state_handle` will have independent state.
     string state_handle;
+
+    // This interface is EXPERIMENTAL and subject to change.
+    //
+    // Instatiates the function using an executor of the given type. If empty,
+    // the default TensorFlow executor will be used.
+    string executor_type;
   };
   typedef uint64 Handle;
   virtual Status Instantiate(const string& function_name, AttrSlice attrs,
@@ -484,6 +491,7 @@ class FunctionLibraryRuntime {
     int64 step_id = 0;
     Rendezvous* rendezvous = nullptr;
     CancellationManager* cancellation_manager = nullptr;
+    CollectiveExecutor* collective_executor = nullptr;
     ScopedStepContainer* step_container = nullptr;
     StepStatsCollector* stats_collector = nullptr;
 
diff --git a/tensorflow/core/framework/op.cc b/tensorflow/core/framework/op.cc
index 0873d4e47bd406e3234d8f18724b45a4e3aee4b1..b8309eafb05251235bfaaa7b5489cac06f0024dc 100644
--- a/tensorflow/core/framework/op.cc
+++ b/tensorflow/core/framework/op.cc
@@ -97,7 +97,7 @@ Status OpRegistry::LookUp(const string& op_type_name,
         "Make sure the Op and Kernel are registered in the "
         "binary running in this process. Note that if you "
         "are loading a saved graph which used ops from "
-        "tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done"
+        "tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done "
         "before importing the graph, as contrib ops are lazily registered "
         "when the module is first accessed.");
     VLOG(1) << status.ToString();
@@ -256,7 +256,7 @@ Status OpListOpRegistry::LookUp(const string& op_type_name,
         "Make sure the Op and Kernel are registered in the "
         "binary running in this process. Note that if you "
         "are loading a saved graph which used ops from "
-        "tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done"
+        "tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done "
         "before importing the graph, as contrib ops are lazily registered "
         "when the module is first accessed.");
   }
diff --git a/tensorflow/core/framework/op_kernel.cc b/tensorflow/core/framework/op_kernel.cc
index c71bcb26abc025e1ab5b990403dde5396f1ef508..a0f449d64f3d2d1bb462d8c41af699829a882558 100644
--- a/tensorflow/core/framework/op_kernel.cc
+++ b/tensorflow/core/framework/op_kernel.cc
@@ -13,12 +13,14 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#define EIGEN_USE_THREADS
 #include "tensorflow/core/framework/op_kernel.h"
 
 #include <unordered_map>
 #include <utility>
 #include <vector>
 
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/framework/attr_value_util.h"
 #include "tensorflow/core/framework/device_attributes.pb.h"
 #include "tensorflow/core/framework/graph.pb_text.h"
@@ -40,6 +42,7 @@ limitations under the License.
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/mutex.h"
 #include "tensorflow/core/platform/types.h"
+#include "tensorflow/core/util/ptr_util.h"
 
 namespace tensorflow {
 
@@ -270,6 +273,19 @@ OpKernelContext::OpKernelContext(Params* params, int num_outputs)
   if (params_->record_tensor_accesses) {
     referenced_tensors_.Init();
   }
+  if (params->device->has_eigen_cpu_device()) {
+    int64 block_size = -1, output_size = -1, num_threads = 1;
+    const Eigen::ThreadPoolDevice* thread_pool =
+        params_->device->eigen_cpu_device();
+    AttrSlice attributes(op_kernel().def());
+    if (GetNodeAttr(attributes, "_block_size", &block_size) == Status::OK() &&
+        GetNodeAttr(attributes, "_output_size", &output_size) == Status::OK()) {
+      num_threads = std::min(Eigen::divup(output_size, block_size),
+                             static_cast<int64>(thread_pool->numThreads()));
+      eigen_cpu_device_ = MakeUnique<Eigen::ThreadPoolDevice>(
+          thread_pool->getPool(), num_threads);
+    }
+  }
 }
 
 OpKernelContext::~OpKernelContext() {
@@ -283,13 +299,13 @@ OpKernelContext::~OpKernelContext() {
 
 Allocator* OpKernelContext::get_allocator(AllocatorAttributes attr) {
   Allocator* allocator = nullptr;
-  if (attr.scope_id > 0) {
+  if (TF_PREDICT_FALSE(attr.scope_id > 0)) {
     allocator = params_->device->GetScopedAllocator(attr, step_id());
     CHECK(allocator);
   } else {
-    allocator = params_->device->GetStepAllocator(attr, resource_manager());
+    allocator = params_->device->GetAllocator(attr);
   }
-  if (track_allocations()) {
+  if (TF_PREDICT_FALSE(track_allocations())) {
     mutex_lock lock(mu_);
     for (const auto& wrapped : wrapped_allocators_) {
       if (wrapped.first == allocator) {
@@ -1120,6 +1136,16 @@ void LogAllRegisteredKernels() {
   }
 }
 
+std::vector<KernelDef> GetAllRegisteredKernels() {
+  const KernelRegistry* const typed_registry = GlobalKernelRegistryTyped();
+  std::vector<KernelDef> kernels;
+  kernels.reserve(typed_registry->size());
+  for (const auto& p : *typed_registry) {
+    kernels.emplace_back(p.second.def);
+  }
+  return kernels;
+}
+
 string KernelsRegisteredForOp(StringPiece op_name) {
   string ret;
   for (const auto& key_registration : *GlobalKernelRegistryTyped()) {
@@ -1273,59 +1299,51 @@ const Eigen::SyclDevice& OpKernelContext::eigen_device() const {
 }
 #endif
 
-namespace {
-template <class OpKernelT>
-void CtxFailureInternal(OpKernelT* op_kernel, const char* file, int line,
-                        const Status& s) {
-  const string logging_prefix =
-      file == nullptr ? "CtxFailure: "
-                      : strings::StrCat("CtxFailure at ", io::Basename(file),
-                                        ":", line, ": ");
-
-  if (errors::IsOutOfRange(s)) {
-    // VLOG OutOfRange errors. Dataset ops create OutOfRange errors when they
-    // reach end-of-sequence.
-    VLOG(1) << logging_prefix << s;
-  } else {
-    LOG(WARNING) << logging_prefix << s;
-  }
-  op_kernel->SetStatus(s);
-}
-}  // anonymous namespace
-
 void OpKernelConstruction::CtxFailure(const Status& s) {
-  CtxFailureInternal(this, nullptr, 0, s);
+  VLOG(1) << s;
+  SetStatus(s);
 }
 
 void OpKernelConstruction::CtxFailureWithWarning(const Status& s) {
-  CtxFailureInternal(this, nullptr, 0, s);
+  LOG(WARNING) << s;
+  SetStatus(s);
 }
 
 void OpKernelConstruction::CtxFailure(const char* file, int line,
                                       const Status& s) {
-  CtxFailureInternal(this, file, line, s);
+  VLOG(1) << "OP_REQUIRES failed at " << io::Basename(file) << ":" << line
+          << " : " << s;
+  SetStatus(s);
 }
 
 void OpKernelConstruction::CtxFailureWithWarning(const char* file, int line,
                                                  const Status& s) {
-  CtxFailureInternal(this, file, line, s);
+  LOG(WARNING) << "OP_REQUIRES failed at " << io::Basename(file) << ":" << line
+               << " : " << s;
+  SetStatus(s);
 }
 
 void OpKernelContext::CtxFailure(const Status& s) {
-  CtxFailureInternal(this, nullptr, 0, s);
+  VLOG(1) << s;
+  SetStatus(s);
 }
 
 void OpKernelContext::CtxFailureWithWarning(const Status& s) {
-  CtxFailureInternal(this, nullptr, 0, s);
+  LOG(WARNING) << s;
+  SetStatus(s);
 }
 
 void OpKernelContext::CtxFailure(const char* file, int line, const Status& s) {
-  CtxFailureInternal(this, file, line, s);
+  VLOG(1) << "OP_REQUIRES failed at " << io::Basename(file) << ":" << line
+          << " : " << s;
+  SetStatus(s);
 }
 
 void OpKernelContext::CtxFailureWithWarning(const char* file, int line,
                                             const Status& s) {
-  CtxFailureInternal(this, file, line, s);
+  LOG(WARNING) << "OP_REQUIRES failed at " << io::Basename(file) << ":" << line
+               << " : " << s;
+  SetStatus(s);
 }
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/framework/op_kernel.h b/tensorflow/core/framework/op_kernel.h
index f577664709c064762d478b17cd09552418479d3f..d307078e63638a9ba77ee4f1b2c94e6256209329 100644
--- a/tensorflow/core/framework/op_kernel.h
+++ b/tensorflow/core/framework/op_kernel.h
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_FRAMEWORK_OP_KERNEL_H_
-#define TENSORFLOW_FRAMEWORK_OP_KERNEL_H_
+#ifndef TENSORFLOW_CORE_FRAMEWORK_OP_KERNEL_H_
+#define TENSORFLOW_CORE_FRAMEWORK_OP_KERNEL_H_
 
 #include <functional>
 
@@ -24,6 +24,7 @@ limitations under the License.
 #include "tensorflow/core/framework/cancellation.h"
 #include "tensorflow/core/framework/control_flow.h"
 #include "tensorflow/core/framework/device_base.h"
+#include "tensorflow/core/framework/kernel_def.pb.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/node_def_util.h"
 #include "tensorflow/core/framework/op.h"  // TODO(b/62899350): Remove
@@ -1003,6 +1004,7 @@ class OpKernelContext {
   // OpKernels can use these eigen devices to carry out their
   // numerical computation.
   const Eigen::ThreadPoolDevice& eigen_cpu_device() const {
+    if (eigen_cpu_device_ != nullptr) return *eigen_cpu_device_;
     return *device()->eigen_cpu_device();
   }
   const Eigen::GpuDevice& eigen_gpu_device() const {
@@ -1138,6 +1140,7 @@ class OpKernelContext {
   mutable mutex mu_;  // mutable so const accessors can acquire the lock
   gtl::InlinedVector<WrappedAllocator, 4> wrapped_allocators_ GUARDED_BY(mu_);
   gtl::InlinedVector<TensorValue, 4> outputs_;
+  std::unique_ptr<Eigen::ThreadPoolDevice> eigen_cpu_device_;
 
   // Constructed only if <params->record_tensor_accesses>.
   ManualConstructor<UniqueTensorReferences> referenced_tensors_ GUARDED_BY(mu_);
@@ -1303,6 +1306,9 @@ Status FindKernelDef(const DeviceType& device_type, const NodeDef& node_def,
 // missing kernel errors.
 void LogAllRegisteredKernels();
 
+// Gets a vector of all registered kernels.
+std::vector<KernelDef> GetAllRegisteredKernels();
+
 namespace kernel_factory {
 
 class OpKernelRegistrar {
@@ -1572,4 +1578,4 @@ inline void OpOutputList::set_ref(int i, mutex* mu, Tensor* tensor_for_ref) {
 
 }  // namespace tensorflow
 
-#endif  // TENSORFLOW_FRAMEWORK_OP_KERNEL_H_
+#endif  // TENSORFLOW_CORE_FRAMEWORK_OP_KERNEL_H_
diff --git a/tensorflow/core/framework/op_kernel_test.cc b/tensorflow/core/framework/op_kernel_test.cc
index bcd409e5c54b7d63137dd9d236d21bb3ec7b4f56..50319ca5764f846cf915e21caecb1a68f9febefb 100644
--- a/tensorflow/core/framework/op_kernel_test.cc
+++ b/tensorflow/core/framework/op_kernel_test.cc
@@ -964,5 +964,27 @@ void BM_SelectInputRange(int iters) {
 BENCHMARK(BM_ConcatInputRange);
 BENCHMARK(BM_SelectInputRange);
 
+TEST(RegisteredKernels, CanCallGetAllRegisteredKernels) {
+  auto all_registered_kernels = GetAllRegisteredKernels();
+  auto has_name_test1 = [](const KernelDef& k) { return k.op() == "Test1"; };
+
+  // Verify we can find the "Test1" op registered above
+  auto test1_it = std::find_if(all_registered_kernels.begin(),
+                               all_registered_kernels.end(), has_name_test1);
+  ASSERT_NE(test1_it, all_registered_kernels.end());
+  EXPECT_EQ(test1_it->device_type(), "CPU");
+
+  // Verify there was just one kernel
+  ++test1_it;
+  EXPECT_EQ(
+      std::find_if(test1_it, all_registered_kernels.end(), has_name_test1),
+      all_registered_kernels.end());
+}
+
+// Simple test just to check we can call LogAllRegisteredKernels
+TEST(RegisteredKernels, CanLogAllRegisteredKernels) {
+  tensorflow::LogAllRegisteredKernels();
+}
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/framework/resource_handle.cc b/tensorflow/core/framework/resource_handle.cc
index 39ef82765f5deda70d68c4ccab77ac7258a3a4e6..fc3a329b3b6852274b2359058f0a38d17f0be21c 100644
--- a/tensorflow/core/framework/resource_handle.cc
+++ b/tensorflow/core/framework/resource_handle.cc
@@ -66,4 +66,29 @@ string ProtoDebugString(const ResourceHandle& handle) {
   return handle.DebugString();
 }
 
+void EncodeResourceHandleList(const ResourceHandle* p, int64 n,
+                              std::unique_ptr<port::StringListEncoder> e) {
+  ResourceHandleProto proto;
+  for (int i = 0; i < n; ++i) {
+    p[i].AsProto(&proto);
+    e->Append(proto);
+  }
+  e->Finalize();
+}
+
+bool DecodeResourceHandleList(std::unique_ptr<port::StringListDecoder> d,
+                              ResourceHandle* ps, int64 n) {
+  std::vector<uint32> sizes(n);
+  if (!d->ReadSizes(&sizes)) return false;
+
+  ResourceHandleProto proto;
+  for (int i = 0; i < n; ++i) {
+    if (!proto.ParseFromArray(d->Data(sizes[i]), sizes[i])) {
+      return false;
+    }
+    ps[i].FromProto(proto);
+  }
+  return true;
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/framework/resource_handle.h b/tensorflow/core/framework/resource_handle.h
index 06df1b9046da2c99ef30b1a69763f4e18e404f94..db213669a3f30b3b5587a4d587e2bfb039dacdda 100644
--- a/tensorflow/core/framework/resource_handle.h
+++ b/tensorflow/core/framework/resource_handle.h
@@ -16,6 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_FRAMEWORK_RESOURCE_HANDLE_H_
 #define TENSORFLOW_FRAMEWORK_RESOURCE_HANDLE_H_
 
+#include "tensorflow/core/platform/tensor_coding.h"
 #include "tensorflow/core/platform/types.h"
 
 namespace tensorflow {
@@ -77,6 +78,14 @@ class ResourceHandle {
 // For backwards compatibility for when this was a proto
 string ProtoDebugString(const ResourceHandle& handle);
 
+// Encodes a list of ResourceHandle protos in the given StringListEncoder.
+void EncodeResourceHandleList(const ResourceHandle* p, int64 n,
+                              std::unique_ptr<port::StringListEncoder> e);
+
+// Decodes a list of ResourceHandle protos from the given StringListDecoder.
+bool DecodeResourceHandleList(std::unique_ptr<port::StringListDecoder> d,
+                              ResourceHandle* ps, int64 n);
+
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_FRAMEWORK_RESOURCE_HANDLE_H_
diff --git a/tensorflow/core/framework/resource_mgr.cc b/tensorflow/core/framework/resource_mgr.cc
index 78574bc0b133fd63944eac17707fbf937b8af59a..21fc6c1bd5877333252e671e908b86aa8bfd6ae7 100644
--- a/tensorflow/core/framework/resource_mgr.cc
+++ b/tensorflow/core/framework/resource_mgr.cc
@@ -138,16 +138,13 @@ string ResourceMgr::DebugString() const {
 
 Status ResourceMgr::DoCreate(const string& container, TypeIndex type,
                              const string& name, ResourceBase* resource) {
-  {
-    mutex_lock l(mu_);
-    Container** b = &containers_[container];
-    if (*b == nullptr) {
-      *b = new Container;
-    }
-    if ((*b)->insert({{type.hash_code(), name}, resource}).second) {
-      TF_RETURN_IF_ERROR(InsertDebugTypeName(type.hash_code(), type.name()));
-      return Status::OK();
-    }
+  Container** b = &containers_[container];
+  if (*b == nullptr) {
+    *b = new Container;
+  }
+  if ((*b)->insert({{type.hash_code(), name}, resource}).second) {
+    TF_RETURN_IF_ERROR(InsertDebugTypeName(type.hash_code(), type.name()));
+    return Status::OK();
   }
   resource->Unref();
   return errors::AlreadyExists("Resource ", container, "/", name, "/",
@@ -157,7 +154,6 @@ Status ResourceMgr::DoCreate(const string& container, TypeIndex type,
 Status ResourceMgr::DoLookup(const string& container, TypeIndex type,
                              const string& name,
                              ResourceBase** resource) const {
-  tf_shared_lock l(mu_);
   const Container* b = gtl::FindPtrOrNull(containers_, container);
   if (b == nullptr) {
     return errors::NotFound("Container ", container,
diff --git a/tensorflow/core/framework/resource_mgr.h b/tensorflow/core/framework/resource_mgr.h
index 621da5b83861956b50e863e2faab5b68a18995bb..33d4cb77ff8a958f06f7b9d9e657f879c5221a60 100644
--- a/tensorflow/core/framework/resource_mgr.h
+++ b/tensorflow/core/framework/resource_mgr.h
@@ -131,6 +131,10 @@ class ResourceMgr {
   // "*resource". Otherwise, invokes creator() to create the resource.
   // The caller takes the ownership of one ref on "*resource".
   //
+  // WARNING: creator() must not call any methods on ResourceMgr during its
+  // execution, because a non-reentrant lock is held during the creator() call
+  // in order to guarantee atomicity of LookupOrCreate().
+  //
   // REQUIRES: std::is_base_of<ResourceBase, T>
   // REQUIRES: resource != nullptr
   template <typename T>
@@ -174,10 +178,19 @@ class ResourceMgr {
   mutable mutex mu_;
   std::unordered_map<string, Container*> containers_ GUARDED_BY(mu_);
 
+  template <typename T>
+  Status LookupInternal(const string& container, const string& name,
+                        T** resource) const
+      SHARED_LOCKS_REQUIRED(mu_) TF_MUST_USE_RESULT;
+
   Status DoCreate(const string& container, TypeIndex type, const string& name,
-                  ResourceBase* resource) TF_MUST_USE_RESULT;
+                  ResourceBase* resource)
+      EXCLUSIVE_LOCKS_REQUIRED(mu_) TF_MUST_USE_RESULT;
+
   Status DoLookup(const string& container, TypeIndex type, const string& name,
-                  ResourceBase** resource) const TF_MUST_USE_RESULT;
+                  ResourceBase** resource) const
+      SHARED_LOCKS_REQUIRED(mu_) TF_MUST_USE_RESULT;
+
   Status DoDelete(const string& container, uint64 type_hash_code,
                   const string& resource_name,
                   const string& type_name) TF_MUST_USE_RESULT;
@@ -362,6 +375,7 @@ Status ResourceMgr::Create(const string& container, const string& name,
                            T* resource) {
   CheckDeriveFromResourceBase<T>();
   CHECK(resource != nullptr);
+  mutex_lock l(mu_);
   return DoCreate(container, MakeTypeIndex<T>(), name, resource);
 }
 
@@ -369,6 +383,13 @@ template <typename T>
 Status ResourceMgr::Lookup(const string& container, const string& name,
                            T** resource) const {
   CheckDeriveFromResourceBase<T>();
+  tf_shared_lock l(mu_);
+  return LookupInternal(container, name, resource);
+}
+
+template <typename T>
+Status ResourceMgr::LookupInternal(const string& container, const string& name,
+                                   T** resource) const {
   ResourceBase* found = nullptr;
   Status s = DoLookup(container, MakeTypeIndex<T>(), name, &found);
   if (s.ok()) {
@@ -383,21 +404,23 @@ template <typename T>
 Status ResourceMgr::LookupOrCreate(const string& container, const string& name,
                                    T** resource,
                                    std::function<Status(T**)> creator) {
-  Status s;
+  CheckDeriveFromResourceBase<T>();
   *resource = nullptr;
-  while (*resource == nullptr) {
-    s = Lookup(container, name, resource);
-    if (s.ok()) break;
-    s = creator(resource);
-    if (!s.ok()) break;
-    s = Create(container, name, *resource);
-    if (s.ok()) {
-      (*resource)->Ref();
-      break;
-    }
-    // Rare event. Concurrent racy creation. Redo the lookup.
-    *resource = nullptr;
+  Status s;
+  {
+    tf_shared_lock l(mu_);
+    s = LookupInternal(container, name, resource);
+    if (s.ok()) return s;
+  }
+  mutex_lock l(mu_);
+  s = LookupInternal(container, name, resource);
+  if (s.ok()) return s;
+  TF_RETURN_IF_ERROR(creator(resource));
+  s = DoCreate(container, MakeTypeIndex<T>(), name, *resource);
+  if (!s.ok()) {
+    return errors::Internal("LookupOrCreate failed unexpectedly");
   }
+  (*resource)->Ref();
   return s;
 }
 
diff --git a/tensorflow/core/framework/resource_mgr_test.cc b/tensorflow/core/framework/resource_mgr_test.cc
index 798220d4c35c502df61c93b78ccd100d7c4b5ad5..7c7f0af0ce46abbde5b66facf4d33db47f9773b8 100644
--- a/tensorflow/core/framework/resource_mgr_test.cc
+++ b/tensorflow/core/framework/resource_mgr_test.cc
@@ -20,6 +20,7 @@ limitations under the License.
 #include "tensorflow/core/framework/node_def_util.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/lib/core/threadpool.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/test.h"
@@ -124,7 +125,7 @@ TEST(ResourceMgrTest, Basic) {
   TF_CHECK_OK(rm.Cleanup("bar"));
 }
 
-TEST(ResourceMgr, CreateOrLookup) {
+TEST(ResourceMgrTest, CreateOrLookup) {
   ResourceMgr rm;
   EXPECT_EQ("R/cat", LookupOrCreate<Resource>(&rm, "foo", "bar", "cat"));
   EXPECT_EQ("R/cat", LookupOrCreate<Resource>(&rm, "foo", "bar", "dog"));
@@ -136,6 +137,30 @@ TEST(ResourceMgr, CreateOrLookup) {
   HasError(FindErr<Other>(rm, "foo", "bar"), "Not found: Resource foo/bar");
 }
 
+TEST(ResourceMgrTest, CreateOrLookupRaceCondition) {
+  ResourceMgr rm;
+  std::atomic<int> atomic_int(0);
+  {
+    thread::ThreadPool threads(Env::Default(), "racing_creates", 2);
+    for (int i = 0; i < 2; i++) {
+      threads.Schedule([&rm, &atomic_int] {
+        Resource* r;
+        TF_CHECK_OK(rm.LookupOrCreate<Resource>(
+            "container", "resource-name", &r, [&atomic_int](Resource** ret) {
+              // Maximize chance of encountering race condition if one exists.
+              Env::Default()->SleepForMicroseconds(1 * 1000 * 1000);
+              atomic_int += 1;
+              *ret = new Resource("label");
+              return Status::OK();
+            }));
+        r->Unref();
+      });
+    }
+  }
+  // Resource creator function should always run exactly once.
+  EXPECT_EQ(1, atomic_int);
+}
+
 Status ComputePolicy(const string& attr_container,
                      const string& attr_shared_name,
                      bool use_node_name_as_default, string* result) {
diff --git a/tensorflow/core/framework/tensor.cc b/tensorflow/core/framework/tensor.cc
index 167e0eaa6e187c9a340ad18bf7a2e02da646512a..384a42fc112007964f1798669361bdfdcb3e919d 100644
--- a/tensorflow/core/framework/tensor.cc
+++ b/tensorflow/core/framework/tensor.cc
@@ -51,7 +51,6 @@ limitations under the License.
 #include "tensorflow/core/platform/protobuf.h"
 #include "tensorflow/core/platform/tensor_coding.h"
 #include "tensorflow/core/platform/types.h"
-#include "tensorflow/core/platform/variant_coding.h"
 
 namespace tensorflow {
 
@@ -207,7 +206,8 @@ struct Helper<ResourceHandle> {
   // "out", which is usually the TensorProto::tensor_content.
   template <typename Destination>
   static void Encode(TensorBuffer* in, int64 n, Destination* out) {
-    port::EncodeResourceHandleList(in->base<const ResourceHandle>(), n, out);
+    EncodeResourceHandleList(in->base<const ResourceHandle>(), n,
+                             port::NewStringListEncoder(out));
   }
 
   // Decodes "n" elements of type string from "in" and constructs a
@@ -217,7 +217,8 @@ struct Helper<ResourceHandle> {
   static TensorBuffer* Decode(Allocator* a, const Source& in, int64 n) {
     auto* buf = new Buffer<ResourceHandle>(a, n);
     ResourceHandle* ps = buf->template base<ResourceHandle>();
-    if (ps == nullptr || !port::DecodeResourceHandleList(in, ps, n)) {
+    if (ps == nullptr ||
+        !DecodeResourceHandleList(port::NewStringListDecoder(in), ps, n)) {
       buf->Unref();
       return nullptr;
     }
@@ -237,7 +238,8 @@ struct Helper<Variant> {
   // "out", which is usually the TensorProto::tensor_content.
   template <typename Destination>
   static void Encode(TensorBuffer* in, int64 n, Destination* out) {
-    port::EncodeVariantList(in->base<const Variant>(), n, out);
+    EncodeVariantList(in->base<const Variant>(), n,
+                      port::NewStringListEncoder(out));
   }
 
   // Decodes "n" elements of type Variant from "in" and constructs a
@@ -247,7 +249,8 @@ struct Helper<Variant> {
   static TensorBuffer* Decode(Allocator* a, const Source& in, int64 n) {
     auto* buf = new Buffer<Variant>(a, n);
     Variant* ps = buf->template base<Variant>();
-    if (ps == nullptr || !port::DecodeVariantList(in, ps, n)) {
+    if (ps == nullptr ||
+        !DecodeVariantList(port::NewStringListDecoder(in), ps, n)) {
       buf->Unref();
       return nullptr;
     }
diff --git a/tensorflow/core/framework/tensor.h b/tensorflow/core/framework/tensor.h
index 58fbced606c9888241a038bba9613d9dc51f6021..d2f2609d3b80a70d8df7b080941f46781ed1f6b2 100644
--- a/tensorflow/core/framework/tensor.h
+++ b/tensorflow/core/framework/tensor.h
@@ -484,6 +484,7 @@ class Tensor {
   friend class TensorTestHelper;      // For access to set_shape
   friend class OpKernelContext;       // For access to RefCountIsOne().
   friend class ScopedAllocator;       // For access to buf_.
+  friend class XlaTensor;             // For access to RefCountIsOne().
   friend class XlaTensorBuffer;  // For access to the private constructor taking
                                  // the buffer
   template <typename Device, typename T>
diff --git a/tensorflow/core/framework/tensor_util.cc b/tensorflow/core/framework/tensor_util.cc
index 8e3ac25512edd283152b72cb4d56a7fed8191428..65f6dc1c00b5123287212eae39dc607ad8f68e29 100644
--- a/tensorflow/core/framework/tensor_util.cc
+++ b/tensorflow/core/framework/tensor_util.cc
@@ -168,5 +168,14 @@ Status Split(const Tensor& tensor, const gtl::ArraySlice<int64>& sizes,
   return Status::OK();
 }
 
+namespace internal {
+void SetTensorProtoShape(std::vector<size_t> shape,
+                         TensorShapeProto* shape_proto) {
+  for (auto dim : shape) {
+    shape_proto->mutable_dim()->Add()->set_size(dim);
+  }
+}
+}  // namespace internal
+
 }  // namespace tensor
 }  // namespace tensorflow
diff --git a/tensorflow/core/framework/tensor_util.h b/tensorflow/core/framework/tensor_util.h
index 6c218b69e07a0ea9b19792e2da9f2032a272e293..43d2d95311225e72e7ca5229ec275a3840e89b0d 100644
--- a/tensorflow/core/framework/tensor_util.h
+++ b/tensorflow/core/framework/tensor_util.h
@@ -17,6 +17,7 @@ limitations under the License.
 #define TENSORFLOW_FRAMEWORK_TENSOR_UTIL_H_
 
 #include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/tensor_shape.pb.h"
 
 #include <vector>
 namespace tensorflow {
@@ -54,6 +55,108 @@ Status Concat(const gtl::ArraySlice<Tensor>& tensors,
 Status Split(const Tensor& tensor, const gtl::ArraySlice<int64>& sizes,
              std::vector<Tensor>* result) TF_MUST_USE_RESULT;
 
+namespace internal {
+void SetTensorProtoShape(std::vector<size_t> shape,
+                         TensorShapeProto* shape_proto);
+
+// Defines value type dependent methods to manipulate `TensorProto`.
+// Class specializations has to define following methods:
+//   static DataType GetDataType()
+//   static void AddValue(Type value, TensorProto* proto)
+template <typename Type>
+class TensorProtoHelper : public std::false_type {};
+
+template <>
+class TensorProtoHelper<string> : public std::true_type {
+ public:
+  static DataType GetDataType() { return DataType::DT_STRING; }
+  static void AddValue(const string& value, TensorProto* proto) {
+    *proto->mutable_string_val()->Add() = value;
+  }
+};
+
+template <>
+class TensorProtoHelper<int32> : public std::true_type {
+ public:
+  static DataType GetDataType() { return DataType::DT_INT32; }
+  static void AddValue(int32 value, TensorProto* proto) {
+    proto->mutable_int_val()->Add(value);
+  }
+};
+
+template <>
+class TensorProtoHelper<int64> : public std::true_type {
+ public:
+  static DataType GetDataType() { return DataType::DT_INT64; }
+  static void AddValue(int64 value, TensorProto* proto) {
+    proto->mutable_int64_val()->Add(value);
+  }
+};
+
+template <>
+class TensorProtoHelper<uint32> : public std::true_type {
+ public:
+  static DataType GetDataType() { return DataType::DT_UINT32; }
+  static void AddValue(uint32 value, TensorProto* proto) {
+    proto->mutable_uint32_val()->Add(value);
+  }
+};
+
+template <>
+class TensorProtoHelper<uint64> : public std::true_type {
+ public:
+  static DataType GetDataType() { return DataType::DT_UINT64; }
+  static void AddValue(uint64 value, TensorProto* proto) {
+    proto->mutable_uint64_val()->Add(value);
+  }
+};
+
+template <>
+class TensorProtoHelper<float> : public std::true_type {
+ public:
+  static DataType GetDataType() { return DataType::DT_FLOAT; }
+  static void AddValue(float value, TensorProto* proto) {
+    proto->mutable_float_val()->Add(value);
+  }
+};
+
+template <>
+class TensorProtoHelper<double> : public std::true_type {
+ public:
+  static DataType GetDataType() { return DataType::DT_DOUBLE; }
+  static void AddValue(double value, TensorProto* proto) {
+    proto->mutable_double_val()->Add(value);
+  }
+};
+
+template <>
+class TensorProtoHelper<bool> : public std::true_type {
+ public:
+  static DataType GetDataType() { return DataType::DT_BOOL; }
+  static void AddValue(bool value, TensorProto* proto) {
+    proto->mutable_bool_val()->Add(value);
+  }
+};
+}  // namespace internal
+
+// Creates a 'TensorProto' with specified shape and values.
+// The dtype and a field to represent data values of the returned 'TensorProto'
+// are determined based on type of the 'values' parameter.
+template <typename Type>
+typename std::enable_if<internal::TensorProtoHelper<Type>::value,
+                        TensorProto>::type
+CreateTensorProto(const std::vector<Type>& values,
+                  const std::vector<size_t>& shape) {
+  TensorProto tensor;
+  using TypeHelper = internal::TensorProtoHelper<Type>;
+  tensor.set_dtype(TypeHelper::GetDataType());
+  internal::SetTensorProtoShape(shape, tensor.mutable_tensor_shape());
+  for (const auto& value : values) {
+    TypeHelper::AddValue(value, &tensor);
+  }
+  return tensor;
+}
+
 }  // namespace tensor
 }  // namespace tensorflow
 
diff --git a/tensorflow/core/framework/tensor_util_test.cc b/tensorflow/core/framework/tensor_util_test.cc
index 69eb8363b2ceee89a58e7bc8844684dd200fbb27..2b4e1cad2fa24c00f1efc703cd040a105fa68bfe 100644
--- a/tensorflow/core/framework/tensor_util_test.cc
+++ b/tensorflow/core/framework/tensor_util_test.cc
@@ -226,5 +226,145 @@ TEST(TensorUtil, ConcatSplitStrings) {
   }
 }
 
+TEST(TensorProtoUtil, CreatesStringTensorProto) {
+  std::vector<string> values{"a", "b", "c"};
+  std::vector<size_t> shape{1, 3};
+
+  auto proto = tensor::CreateTensorProto(values, shape);
+
+  EXPECT_EQ(proto.DebugString(),
+            "dtype: DT_STRING\n"
+            "tensor_shape {\n"
+            "  dim {\n"
+            "    size: 1\n"
+            "  }\n"
+            "  dim {\n"
+            "    size: 3\n"
+            "  }\n"
+            "}\n"
+            "string_val: \"a\"\n"
+            "string_val: \"b\"\n"
+            "string_val: \"c\"\n");
+}
+
+TEST(TensorProtoUtil, CreatesInt32TensorProto) {
+  std::vector<int32> values{1, 2};
+  std::vector<size_t> shape{2};
+
+  auto proto = tensor::CreateTensorProto(values, shape);
+
+  EXPECT_EQ(proto.DebugString(),
+            "dtype: DT_INT32\n"
+            "tensor_shape {\n"
+            "  dim {\n"
+            "    size: 2\n"
+            "  }\n"
+            "}\n"
+            "int_val: 1\n"
+            "int_val: 2\n");
+}
+
+TEST(TensorProtoUtil, CreatesInt64TensorProto) {
+  std::vector<int64> values{1, 2};
+  std::vector<size_t> shape{2};
+
+  auto proto = tensor::CreateTensorProto(values, shape);
+
+  EXPECT_EQ(proto.DebugString(),
+            "dtype: DT_INT64\n"
+            "tensor_shape {\n"
+            "  dim {\n"
+            "    size: 2\n"
+            "  }\n"
+            "}\n"
+            "int64_val: 1\n"
+            "int64_val: 2\n");
+}
+
+TEST(TensorProtoUtil, CreatesUInt32TensorProto) {
+  std::vector<uint32> values{1, 2};
+  std::vector<size_t> shape{2};
+
+  auto proto = tensor::CreateTensorProto(values, shape);
+
+  EXPECT_EQ(proto.DebugString(),
+            "dtype: DT_UINT32\n"
+            "tensor_shape {\n"
+            "  dim {\n"
+            "    size: 2\n"
+            "  }\n"
+            "}\n"
+            "uint32_val: 1\n"
+            "uint32_val: 2\n");
+}
+
+TEST(TensorProtoUtil, CreatesUInt64TensorProto) {
+  std::vector<uint64> values{1, 2};
+  std::vector<size_t> shape{2};
+
+  auto proto = tensor::CreateTensorProto(values, shape);
+
+  EXPECT_EQ(proto.DebugString(),
+            "dtype: DT_UINT64\n"
+            "tensor_shape {\n"
+            "  dim {\n"
+            "    size: 2\n"
+            "  }\n"
+            "}\n"
+            "uint64_val: 1\n"
+            "uint64_val: 2\n");
+}
+
+TEST(TensorProtoUtil, CreatesFloatTensorProto) {
+  std::vector<float> values{1.1, 2.2};
+  std::vector<size_t> shape{2};
+
+  auto proto = tensor::CreateTensorProto(values, shape);
+
+  EXPECT_EQ(proto.DebugString(),
+            "dtype: DT_FLOAT\n"
+            "tensor_shape {\n"
+            "  dim {\n"
+            "    size: 2\n"
+            "  }\n"
+            "}\n"
+            "float_val: 1.1\n"
+            "float_val: 2.2\n");
+}
+
+TEST(TensorProtoUtil, CreatesDoubleTensorProto) {
+  std::vector<double> values{1.1, 2.2};
+  std::vector<size_t> shape{2};
+
+  auto proto = tensor::CreateTensorProto(values, shape);
+
+  EXPECT_EQ(proto.DebugString(),
+            "dtype: DT_DOUBLE\n"
+            "tensor_shape {\n"
+            "  dim {\n"
+            "    size: 2\n"
+            "  }\n"
+            "}\n"
+            "double_val: 1.1\n"
+            "double_val: 2.2\n");
+}
+
+TEST(TensorProtoUtil, CreatesBoolTensorProto) {
+  std::vector<bool> values{true, false};
+  std::vector<size_t> shape{2};
+
+  auto proto = tensor::CreateTensorProto(values, shape);
+
+  EXPECT_EQ(proto.DebugString(),
+            "dtype: DT_BOOL\n"
+            "tensor_shape {\n"
+            "  dim {\n"
+            "    size: 2\n"
+            "  }\n"
+            "}\n"
+            "bool_val: true\n"
+            "bool_val: false\n");
+}
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/framework/variable.proto b/tensorflow/core/framework/variable.proto
index 93ae423babb93704fca6eb52de0f96ba56204884..66ba4cba7d83d9de88625af91a1b2f657a1bfb41 100644
--- a/tensorflow/core/framework/variable.proto
+++ b/tensorflow/core/framework/variable.proto
@@ -26,6 +26,9 @@ message VariableDef {
 
   // Whether to represent this as a ResourceVariable.
   bool is_resource = 5;
+
+  // Whether this variable should be trained.
+  bool trainable = 7;
 }
 
 message SaveSliceInfoDef {
diff --git a/tensorflow/core/framework/variant.cc b/tensorflow/core/framework/variant.cc
index 6ad2fafee778de05da68799344618d94e5780176..5a507804b0b65cb5e1fb0c667ee0796aa131dd75 100644
--- a/tensorflow/core/framework/variant.cc
+++ b/tensorflow/core/framework/variant.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/core/framework/variant.h"
 #include "tensorflow/core/framework/tensor.pb.h"
 #include "tensorflow/core/framework/variant_encode_decode.h"
+#include "tensorflow/core/framework/variant_op_registry.h"
 #include "tensorflow/core/framework/variant_tensor_data.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/gtl/map_util.h"
@@ -73,4 +74,36 @@ bool DecodeVariant(const string& buf, VariantTensorDataProto* value) {
   return value->ParseFromString(buf);
 }
 
+void EncodeVariantList(const Variant* variant_array, int64 n,
+                       std::unique_ptr<port::StringListEncoder> e) {
+  for (int i = 0; i < n; ++i) {
+    string s;
+    variant_array[i].Encode(&s);
+    e->Append(s);
+  }
+  e->Finalize();
+}
+
+bool DecodeVariantList(std::unique_ptr<port::StringListDecoder> d,
+                       Variant* variant_array, int64 n) {
+  std::vector<uint32> sizes(n);
+  if (!d->ReadSizes(&sizes)) return false;
+
+  for (int i = 0; i < n; ++i) {
+    if (variant_array[i].is_empty()) {
+      variant_array[i] = VariantTensorDataProto();
+    }
+    string str(d->Data(sizes[i]), sizes[i]);
+    if (!variant_array[i].Decode(str)) return false;
+    if (!DecodeUnaryVariant(&variant_array[i])) {
+      LOG(ERROR) << "Could not decode variant with type_name: \""
+                 << variant_array[i].TypeName()
+                 << "\".  Perhaps you forgot to register a "
+                    "decoder via REGISTER_UNARY_VARIANT_DECODE_FUNCTION?";
+      return false;
+    }
+  }
+  return true;
+}
+
 }  // end namespace tensorflow
diff --git a/tensorflow/core/framework/variant_encode_decode.h b/tensorflow/core/framework/variant_encode_decode.h
index 5a84f9d94385a7048a0f4adfe78e1805b367f02d..ded04b2a30f571747ff62a126e47ceac94b6b693 100644
--- a/tensorflow/core/framework/variant_encode_decode.h
+++ b/tensorflow/core/framework/variant_encode_decode.h
@@ -259,6 +259,16 @@ void EncodeVariant(const VariantTensorDataProto& value, string* buf);
 template <>
 bool DecodeVariant(const string& buf, VariantTensorDataProto* value);
 
+// Encodes an array of Variant objects in to the given StringListEncoder.
+// `variant_array` is assumed to point to an array of `n` Variant objects.
+void EncodeVariantList(const Variant* variant_array, int64 n,
+                       std::unique_ptr<port::StringListEncoder> e);
+
+// Decodes an array of Variant objects from the given StringListDecoder.
+// `variant_array` is assumed to point to an array of `n` Variant objects.
+bool DecodeVariantList(std::unique_ptr<port::StringListDecoder> d,
+                       Variant* variant_array, int64 n);
+
 }  // end namespace tensorflow
 
 #endif  // TENSORFLOW_FRAMEWORK_VARIANT_ENCODE_DECODE_H_
diff --git a/tensorflow/core/graph/algorithm_test.cc b/tensorflow/core/graph/algorithm_test.cc
index 99ced0c0f5daa7c722aa4060e9a954855411010b..f67d5a2fd2bd9d4928cd9d3190e307931a02c851 100644
--- a/tensorflow/core/graph/algorithm_test.cc
+++ b/tensorflow/core/graph/algorithm_test.cc
@@ -144,8 +144,8 @@ TEST(AlgorithmTest, ReversePostOrderStable) {
     std::vector<Node*> order;
 
     // Test reverse post order generates expected ordering.
-    GetReversePostOrder(g, &order, /*stable_comparator=*/NodeComparatorID());
-    EXPECT_TRUE(ExpectBefore({{"t3", "t2"}}, order, &error));
+    GetReversePostOrder(g, &order, /*stable_comparator=*/NodeComparatorName());
+    EXPECT_TRUE(ExpectBefore({{"t2", "t3"}}, order, &error));
   }
 }
 }  // namespace
diff --git a/tensorflow/core/graph/control_flow.cc b/tensorflow/core/graph/control_flow.cc
index 30ff19cd7eae794e0e9875ca0825b647b44b02af..fea25560d8da36ce55f2cdaa2459fdf513723e08 100644
--- a/tensorflow/core/graph/control_flow.cc
+++ b/tensorflow/core/graph/control_flow.cc
@@ -24,8 +24,8 @@ limitations under the License.
 
 namespace tensorflow {
 
-Status BuildControlFlowInfo(const Graph* g,
-                            std::vector<ControlFlowInfo>* info) {
+Status BuildControlFlowInfo(const Graph* g, std::vector<ControlFlowInfo>* info,
+                            std::vector<string>* unreachable_nodes) {
   info->clear();
   info->resize(g->num_node_ids());
 
@@ -114,6 +114,13 @@ Status BuildControlFlowInfo(const Graph* g,
       }
     }
   }
+  if (unreachable_nodes) {
+    for (const Node* node : g->op_nodes()) {
+      if (!parent_nodes[node->id()]) {
+        unreachable_nodes->push_back(node->name());
+      }
+    }
+  }
   return Status::OK();
 }
 
diff --git a/tensorflow/core/graph/control_flow.h b/tensorflow/core/graph/control_flow.h
index 79e2be0d4b9db6dd70d339ee07faf25c85376386..8605d57c1413f04020e3f381815e4eb38ccc336b 100644
--- a/tensorflow/core/graph/control_flow.h
+++ b/tensorflow/core/graph/control_flow.h
@@ -33,11 +33,15 @@ struct ControlFlowInfo {
 // Clear and populate `info` with each node's frame and the level it belongs to.
 // We check the well-formedness of the graph: All inputs to a node must come
 // from the same frame and have the same "static" iteration level.
+// If `unreachable_nodes` is set, return names of nodes unreachable from the
+// source node. We cannot build ControlFlowInfo for such nodes. They might be
+// pruned later.
 //
 // NOTE(yuanbyu): For now, we require all sends/recvs have iteration level 0.
 // This essentially means there can't be multiple serial Nexts in an iteration,
 // which all sane front-ends should satisfy.
-Status BuildControlFlowInfo(const Graph* g, std::vector<ControlFlowInfo>* info);
+Status BuildControlFlowInfo(const Graph* g, std::vector<ControlFlowInfo>* info,
+                            std::vector<string>* unreachable_nodes = nullptr);
 
 }  // namespace tensorflow
 
diff --git a/tensorflow/core/graph/graph.cc b/tensorflow/core/graph/graph.cc
index 71d0637dc238428a5cf5a3cdb2b2634f5684349a..0f748515efc22e68a3059327fcde0d2c3b4badc2 100644
--- a/tensorflow/core/graph/graph.cc
+++ b/tensorflow/core/graph/graph.cc
@@ -80,6 +80,9 @@ const std::unordered_map<string, Node::NodeClass>& Node::kNodeClassTable =
         {"Shape", NC_METADATA},
         {"Rank", NC_METADATA},
         {"_ScopedAllocator", NC_SCOPED_ALLOCATOR},
+        {"CollectiveReduce", NC_COLLECTIVE},
+        {"CollectiveBcastSend", NC_COLLECTIVE},
+        {"CollectiveBcastRecv", NC_COLLECTIVE},
     });
 
 #undef REF_CLASS
diff --git a/tensorflow/core/graph/graph.h b/tensorflow/core/graph/graph.h
index 83a69e6b2d83314467ab74924488adcae2195717..33fb7cb57a2cd75de2358f7d92b6463c489dbaca 100644
--- a/tensorflow/core/graph/graph.h
+++ b/tensorflow/core/graph/graph.h
@@ -163,6 +163,7 @@ class Node {
   bool IsHostSend() const { return class_ == NC_HOST_SEND; }
   bool IsHostRecv() const { return class_ == NC_HOST_RECV; }
   bool IsScopedAllocator() const { return class_ == NC_SCOPED_ALLOCATOR; }
+  bool IsCollective() const { return class_ == NC_COLLECTIVE; }
 
   bool IsMetadata() const { return class_ == NC_METADATA; }
 
@@ -235,6 +236,7 @@ class Node {
     NC_DELETE_SESSION_TENSOR,
     NC_METADATA,
     NC_SCOPED_ALLOCATOR,
+    NC_COLLECTIVE,
     NC_OTHER  // Not a special kind of node
   };
 
diff --git a/tensorflow/core/graph/graph_constructor.cc b/tensorflow/core/graph/graph_constructor.cc
index 2fd32c0bd4319ffd345769c0ce6504c20679f0f6..0967492d92bfe0a6e703fd3fac920744e143a6e9 100644
--- a/tensorflow/core/graph/graph_constructor.cc
+++ b/tensorflow/core/graph/graph_constructor.cc
@@ -278,8 +278,9 @@ class GraphConstructor {
   // name, the value is the new unique name.
   std::unordered_map<string, string> uniquified_names_;
 
-  // Index of NodeDefs in node_defs_ with all inputs already converted.
-  std::vector<int> ready_;
+  // Index of NodeDefs in node_defs_ with all inputs already converted. We use a
+  // (sorted) set so nodes are created in the order defined in the GraphDef.
+  std::set<int> ready_;
 
   // Mapping between index within node_defs_ and the number of inputs that
   // still need to be converted.
@@ -520,7 +521,7 @@ Status GraphConstructor::InitFromEdges() {
       }
     }
     if (pending_count == 0) {
-      ready_.push_back(n);
+      ready_.insert(n);
     }
     pending_count_.push_back(pending_count);
   }
@@ -884,12 +885,12 @@ namespace {
 
 void UpdatePendingCountAndReady(
     const std::vector<gtl::InlinedVector<int, 4>>& outputs, int o,
-    std::vector<int>* pending_count, std::vector<int>* ready) {
+    std::vector<int>* pending_count, std::set<int>* ready) {
   for (size_t i = 0; i < outputs[o].size(); ++i) {
     const int output = outputs[o][i];
     (*pending_count)[output]--;
     if ((*pending_count)[output] == 0) {
-      ready->push_back(output);
+      ready->insert(output);
     }
   }
 }
@@ -913,8 +914,8 @@ Status GraphConstructor::Convert() {
   // inputs, pending_counts_ with the number of inputs for each node and
   // outputs_ with the outputs of each node).
   while (!ready_.empty()) {
-    int o = ready_.back();
-    ready_.pop_back();
+    int o = *ready_.begin();
+    ready_.erase(ready_.begin());
     ++processed;
     inputs.clear();
     bool has_data_back_edge = false;
diff --git a/tensorflow/core/graph/graph_constructor_test.cc b/tensorflow/core/graph/graph_constructor_test.cc
index c54b4fa269eaf240fa47d80b5e46991bcbcb8e6f..630987019078474e909ec46dd40ca731b822c8e2 100644
--- a/tensorflow/core/graph/graph_constructor_test.cc
+++ b/tensorflow/core/graph/graph_constructor_test.cc
@@ -3170,7 +3170,7 @@ TEST_F(GraphConstructorTest, ImportGraphDef_UnknownOps) {
       {"Make sure the Op and Kernel are registered in the "
        "binary running in this process. Note that if you "
        "are loading a saved graph which used ops from "
-       "tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done"
+       "tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done "
        "before importing the graph, as contrib ops are lazily registered "
        "when the module is first accessed."});
 }
diff --git a/tensorflow/core/graph/graph_partition_test.cc b/tensorflow/core/graph/graph_partition_test.cc
index 83b24cafe2cb364b2afd5dcb6533bf662dc40a1b..f44ed47a6e94acdce66c36902cbcf2fdfb041447 100644
--- a/tensorflow/core/graph/graph_partition_test.cc
+++ b/tensorflow/core/graph/graph_partition_test.cc
@@ -329,11 +329,11 @@ TEST_F(GraphPartitionTest, CrossDeviceControl_MultiUse) {
   string b = "/job:a/replica:0/task:0/cpu:1";
   a1 = FloatInput(scope_a_.WithOpName("A1"));
   auto c = Const(scope_a_.WithOpName("A1/_0").WithControlDependencies(a1), {});
-  _Send(scope_a_.WithOpName("A1/_1"), c, "edge_1_A1", a, 82, b);
+  _Send(scope_a_.WithOpName("A1/_1"), c, "edge_3_A1", a, 82, b);
   ExpectMatchA();
 
   auto recv =
-      _Recv(scope_b_.WithOpName("A1/_2"), DT_FLOAT, "edge_1_A1", a, 82, b);
+      _Recv(scope_b_.WithOpName("A1/_2"), DT_FLOAT, "edge_3_A1", a, 82, b);
   auto id = Identity(scope_b_.WithOpName("A1/_3"), recv);
   b1 = FloatInput(scope_b_.WithOpName("B1"));
   Combine(scope_b_.WithOpName("B2").WithControlDependencies(id), b1, b1);
@@ -353,18 +353,18 @@ TEST_F(GraphPartitionTest, CrossDevice_DataControl) {
   string a = "/job:a/replica:0/task:0/cpu:0";
   string b = "/job:a/replica:0/task:0/cpu:1";
   a1 = FloatInput(scope_a_.WithOpName("A1"));
-  auto c = Const(scope_a_.WithOpName("A1/_0").WithControlDependencies(a1), {});
+  _Send(scope_a_.WithOpName("A1/_0"), a1, "edge_1_A1", a, 82, b);
+  auto c = Const(scope_a_.WithOpName("A1/_2").WithControlDependencies(a1), {});
   // NOTE: Send 0 A1/_1 -> A1/_2 is not necessarily needed. We could
   // use A1/_0 -> A1/_4 as the control as a minor optimization.
-  _Send(scope_a_.WithOpName("A1/_1"), c, "edge_1_A1", a, 82, b);
-  _Send(scope_a_.WithOpName("A1/_4"), a1, "edge_2_A1", a, 82, b);
+  _Send(scope_a_.WithOpName("A1/_3"), c, "edge_3_A1", a, 82, b);
   ExpectMatchA();
 
   auto recv1 =
-      _Recv(scope_b_.WithOpName("A1/_2"), DT_FLOAT, "edge_1_A1", a, 82, b);
-  auto id1 = Identity(scope_b_.WithOpName("A1/_3"), recv1);
+      _Recv(scope_b_.WithOpName("A1/_4"), DT_FLOAT, "edge_3_A1", a, 82, b);
+  auto id1 = Identity(scope_b_.WithOpName("A1/_5"), recv1);
   auto recv2 =
-      _Recv(scope_b_.WithOpName("A1/_5"), DT_FLOAT, "edge_2_A1", a, 82, b);
+      _Recv(scope_b_.WithOpName("A1/_1"), DT_FLOAT, "edge_1_A1", a, 82, b);
   b1 = FloatInput(scope_b_.WithOpName("B1"));
   Combine(scope_b_.WithOpName("B2"), recv2, b1);
   FloatInput(scope_b_.WithOpName("B3").WithControlDependencies(id1));
diff --git a/tensorflow/core/graph/optimizer_cse_test.cc b/tensorflow/core/graph/optimizer_cse_test.cc
index 21a63662cf22c465968b1bf1d347454f2af20745..c1f93ce05ae99fef05d6a16815c3886643d17e26 100644
--- a/tensorflow/core/graph/optimizer_cse_test.cc
+++ b/tensorflow/core/graph/optimizer_cse_test.cc
@@ -115,8 +115,8 @@ TEST_F(OptimizerCSETest, Simple) {
       "node { name: 'D' op: 'Mul' attr { key: 'T' value { type: DT_FLOAT } }"
       " input: ['A', 'B'] }");
   EXPECT_EQ(DoCSE(),
-            "A(Input);B(Input);D(Mul)|"
-            "A->D;B->D:1");
+            "A(Input);B(Input);C(Mul)|"
+            "A->C;B->C:1");
 }
 
 TEST_F(OptimizerCSETest, Simple_ThreeEquivalent) {
@@ -130,8 +130,8 @@ TEST_F(OptimizerCSETest, Simple_ThreeEquivalent) {
       "node { name: 'E' op: 'Mul' attr { key: 'T' value { type: DT_FLOAT } }"
       " input: ['A', 'B'] }");
   EXPECT_EQ(DoCSE(),
-            "A(Input);B(Input);E(Mul)|"
-            "A->E;B->E:1");
+            "A(Input);B(Input);C(Mul)|"
+            "A->C;B->C:1");
 }
 
 TEST_F(OptimizerCSETest, Simple_WithFixups) {
@@ -145,8 +145,8 @@ TEST_F(OptimizerCSETest, Simple_WithFixups) {
       "node { name: 'E' op: 'Mul' attr { key: 'T' value { type: DT_FLOAT } }"
       " input: ['C', 'D'] }");
   EXPECT_EQ(DoCSE(),
-            "A(Input);B(Input);D(Mul);E(Mul)|"
-            "A->D;B->D:1;D->E;D->E:1");
+            "A(Input);B(Input);C(Mul);E(Mul)|"
+            "A->C;B->C:1;C->E;C->E:1");
 }
 
 TEST_F(OptimizerCSETest, Simple_Commutative) {
@@ -158,8 +158,8 @@ TEST_F(OptimizerCSETest, Simple_Commutative) {
       "node { name: 'D' op: 'Mul' attr { key: 'T' value { type: DT_FLOAT } }"
       " input: ['B', 'A'] }");
   EXPECT_EQ(DoCSE(),
-            "A(Input);B(Input);D(Mul)|"
-            "A->D:1;B->D");
+            "A(Input);B(Input);C(Mul)|"
+            "A->C;B->C:1");
 }
 
 static bool IsNotMultiply(const Node* n) { return n->type_string() != "Mul"; }
@@ -210,8 +210,8 @@ TEST_F(OptimizerCSETest, Simple_SameOps_SameAttrs1) {
       " input: ['A', 'B'] attr { key: 'shape'"
       "    value { shape: { dim: { size: 37 name: 'SAME_NAME' } } } } }");
   EXPECT_EQ(DoCSE(),
-            "A(Input);B(Input);D(Mul)|"
-            "A->D;B->D:1");
+            "A(Input);B(Input);C(Mul)|"
+            "A->C;B->C:1");
 }
 
 TEST_F(OptimizerCSETest, Simple_SameOps_SameAttrs2) {
@@ -229,8 +229,8 @@ TEST_F(OptimizerCSETest, Simple_SameOps_SameAttrs2) {
       "    attr { key: 't' value { type: DT_INT32 } }"
       "    attr { key: 'a' value { i: 3 } } }");
   EXPECT_EQ(DoCSE(),
-            "A(Input);B(Input);D(Mul)|"
-            "A->D;B->D:1");
+            "A(Input);B(Input);C(Mul)|"
+            "A->C;B->C:1");
 }
 
 TEST_F(OptimizerCSETest, SameConstants) {
@@ -249,8 +249,8 @@ TEST_F(OptimizerCSETest, SameConstants) {
       "node { name: 'D' op: 'Mul' attr { key: 'T' value { type: DT_INT32 } }"
       " input: ['A', 'B'] }");
   EXPECT_EQ(DoCSE(),
-            "B(Const);D(Mul)|"
-            "B->D;B->D:1");
+            "A(Const);D(Mul)|"
+            "A->D;A->D:1");
 }
 
 TEST_F(OptimizerCSETest, DifferentConstants) {
@@ -338,8 +338,8 @@ TEST_F(OptimizerCSETest, Constant_Dedup) {
             "n/_0(Const);n/_1(Const);n/_2(Const);n/_3(Const);"
             "n/_4(Const);n/_5(Const);n/_6(Const);n/_7(Const)|");
   // In theory, there are 2^4 possible correct output of CSE.  In this
-  // test, it happens to eliminate the first 4 nodes.
-  EXPECT_EQ(DoCSE(), "n/_4(Const);n/_5(Const);n/_6(Const);n/_7(Const)|");
+  // test, it happens to eliminate the last 4 nodes.
+  EXPECT_EQ(DoCSE(), "n/_0(Const);n/_1(Const);n/_2(Const);n/_3(Const)|");
 }
 
 static void BM_CSE(int iters, int op_nodes) {
diff --git a/tensorflow/core/grappler/clusters/BUILD b/tensorflow/core/grappler/clusters/BUILD
index 30c6126fbb58c16813f923cec2f9551482ade6df..d0b2cf01be221df46837cad9fccce84e386a6901 100644
--- a/tensorflow/core/grappler/clusters/BUILD
+++ b/tensorflow/core/grappler/clusters/BUILD
@@ -20,6 +20,9 @@ tf_cuda_library(
     name = "utils",
     srcs = ["utils.cc"],
     hdrs = ["utils.h"],
+    cuda_deps = [
+        "@local_config_cuda//cuda:cudnn_header",
+    ],
     visibility = ["//visibility:public"],
     deps = [
         "//third_party/eigen3",
diff --git a/tensorflow/core/grappler/costs/BUILD b/tensorflow/core/grappler/costs/BUILD
index 35f11eac2955e504c4edcecbb2cde78727343528..b05406829915ae64476f7263fb573111050d097a 100644
--- a/tensorflow/core/grappler/costs/BUILD
+++ b/tensorflow/core/grappler/costs/BUILD
@@ -129,6 +129,9 @@ tf_cuda_library(
     name = "utils",
     srcs = ["utils.cc"],
     hdrs = ["utils.h"],
+    cuda_deps = [
+        "@local_config_cuda//cuda:cudnn_header",
+    ],
     visibility = ["//visibility:public"],
     deps = [
         "//third_party/eigen3",
diff --git a/tensorflow/core/grappler/costs/analytical_cost_estimator.cc b/tensorflow/core/grappler/costs/analytical_cost_estimator.cc
index c8ba4dfbdadf50eab22ee2f4af898fe949572c66..a60e3c7a9fd33d121714f1c43f6b5d083f6193f1 100644
--- a/tensorflow/core/grappler/costs/analytical_cost_estimator.cc
+++ b/tensorflow/core/grappler/costs/analytical_cost_estimator.cc
@@ -98,6 +98,7 @@ Status AnalyticalCostEstimator::PredictCosts(const GraphDef& optimized_graph,
           node_costs.compute_time.asMicroSeconds().count());
       cost_node->set_memory_time(
           node_costs.memory_time.asMicroSeconds().count());
+      cost_node->set_inaccurate(node_costs.inaccurate);
       for (const auto& output : op_context.op_info.outputs()) {
         auto output_info = cost_node->add_output_info();
         output_info->set_dtype(output.dtype());
diff --git a/tensorflow/core/grappler/costs/graph_properties.cc b/tensorflow/core/grappler/costs/graph_properties.cc
index 69b759473588e7c39bf74a151db3500032bdf71a..d9a08d42dbccddf1ac5fb166427d94d466963dab 100644
--- a/tensorflow/core/grappler/costs/graph_properties.cc
+++ b/tensorflow/core/grappler/costs/graph_properties.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include <unordered_map>
 #include <unordered_set>
 #include "tensorflow/core/framework/common_shape_fns.h"
+#include "tensorflow/core/framework/function.pb.h"
 #include "tensorflow/core/framework/node_def_util.h"
 #include "tensorflow/core/framework/tensor_shape.pb.h"
 #include "tensorflow/core/framework/versions.pb.h"
@@ -425,6 +426,13 @@ class SymbolicShapeRefiner {
     return it->second.inference_context.get();
   }
 
+  // Forward the shapes from the function's fanin to the function body,
+  // then call PropagateShapes.
+  // Returns an error if 'node' is not a function node.
+  Status UpdateFunction(const NodeDef* node, bool* refined) {
+    return UpdateNode(node, refined);
+  }
+
   Status UpdateNode(const NodeDef* node, bool* refined) {
     NodeContext* node_context = GetNodeContext(node);
     if (node_context == nullptr) {
@@ -677,10 +685,16 @@ class SymbolicShapeRefiner {
     return true;
   }
 
+  Status AddFunction(const NodeDef* node) { return Status::OK(); }
+
   Status AddNode(const NodeDef* node) {
     NodeContext& node_ctx = node_to_context_[node];
     TF_RETURN_IF_ERROR(function_library_.LookUp(node->op(), &node_ctx.op_data));
 
+    if (node_ctx.op_data->is_function_op) {
+      TF_RETURN_IF_ERROR(AddFunction(node));
+    }
+
     TF_RETURN_IF_ERROR(InOutTypesForNode(*node, node_ctx.op_data->op_def,
                                          &node_ctx.input_types,
                                          &node_ctx.output_types));
@@ -1069,8 +1083,13 @@ Status GraphProperties::UpdateShapes(
     TF_RETURN_IF_ERROR(
         UpdateEnqueue(n, resource_handles, shape_refiner, new_shapes));
   } else {
-    // Rely on regular TF shape refinement for all the other nodes.
-    TF_RETURN_IF_ERROR(shape_refiner->UpdateNode(n, new_shapes));
+    auto c = shape_refiner->GetNodeContext(n);
+    if (c && c->op_data && c->op_data->is_function_op) {
+      TF_RETURN_IF_ERROR(shape_refiner->UpdateFunction(n, new_shapes));
+    } else {
+      // Rely on regular TF shape refinement for all the other nodes.
+      TF_RETURN_IF_ERROR(shape_refiner->UpdateNode(n, new_shapes));
+    }
   }
   return Status::OK();
 }
diff --git a/tensorflow/core/grappler/costs/op_level_cost_estimator.cc b/tensorflow/core/grappler/costs/op_level_cost_estimator.cc
index b8e337582c93a8c9f178f1bd441d408c4250dc4d..b994d263973c0fbf97cb42f3d3bdd800d114187b 100644
--- a/tensorflow/core/grappler/costs/op_level_cost_estimator.cc
+++ b/tensorflow/core/grappler/costs/op_level_cost_estimator.cc
@@ -45,6 +45,7 @@ constexpr char kIdentityN[] = "IdentityN";
 constexpr char kRefIdentity[] = "RefIdentity";
 constexpr char kNoOp[] = "NoOp";
 constexpr char kReshape[] = "Reshape";
+constexpr char kSqueeze[] = "Squeeze";
 constexpr char kRecv[] = "_Recv";
 constexpr char kSend[] = "_Send";
 constexpr char kBatchMatMul[] = "BatchMatMul";
@@ -232,6 +233,7 @@ OpLevelCostEstimator::OpLevelCostEstimator() {
       {kStopGradient, wrap(&OpLevelCostEstimator::PredictIdentity)},
       {kPreventGradient, wrap(&OpLevelCostEstimator::PredictIdentity)},
       {kReshape, wrap(&OpLevelCostEstimator::PredictIdentity)},
+      {kSqueeze, wrap(&OpLevelCostEstimator::PredictIdentity)},
       {kRecv, wrap(&OpLevelCostEstimator::PredictIdentity)},
       {kSend, wrap(&OpLevelCostEstimator::PredictIdentity)},
 
diff --git a/tensorflow/core/grappler/op_types.cc b/tensorflow/core/grappler/op_types.cc
index 92581942cbdbb0b76679297c48bc081c73fda833..2a47a4c4958c7ea1b4ec91bd75d71b088519c45d 100644
--- a/tensorflow/core/grappler/op_types.cc
+++ b/tensorflow/core/grappler/op_types.cc
@@ -78,6 +78,12 @@ bool IsCheckNumerics(const NodeDef& node) {
   return node.op() == "CheckNumerics";
 }
 
+bool IsCollective(const NodeDef& node) {
+  return node.op() == "CollectiveReduce" ||
+         node.op() == "CollectiveBcastSend" ||
+         node.op() == "CollectiveBcastRecv";
+}
+
 bool IsComplex(const NodeDef& node) { return node.op() == "Complex"; }
 
 bool IsComplexAbs(const NodeDef& node) { return node.op() == "ComplexAbs"; }
@@ -203,6 +209,8 @@ bool IsMax(const NodeDef& node) { return node.op() == "Max"; }
 
 bool IsMaximum(const NodeDef& node) { return node.op() == "Maximum"; }
 
+bool IsMaxPoolGrad(const NodeDef& node) { return node.op() == "MaxPoolGrad"; }
+
 bool IsMean(const NodeDef& node) { return node.op() == "Mean"; }
 
 bool IsMerge(const NodeDef& node) {
@@ -445,6 +453,10 @@ bool IsFreeOfSideEffect(const NodeDef& node) {
       return false;
     }
   }
+  // Queue ops modify the queue which is a side effect.
+  if (node.op().find("Queue") != std::string::npos) {
+    return false;
+  }
   return !ModifiesInputsInPlace(node);
 }
 
diff --git a/tensorflow/core/grappler/op_types.h b/tensorflow/core/grappler/op_types.h
index 9d91ba1ba53acb5af6ee4e78ade1fb37cb66283e..e7f39981c00dc29e330876338b82f908c3a35e07 100644
--- a/tensorflow/core/grappler/op_types.h
+++ b/tensorflow/core/grappler/op_types.h
@@ -38,6 +38,7 @@ bool IsBiasAddGrad(const NodeDef& node);
 bool IsBitcast(const NodeDef& node);
 bool IsCast(const NodeDef& node);
 bool IsCheckNumerics(const NodeDef& node);
+bool IsCollective(const NodeDef& node);
 bool IsComplex(const NodeDef& node);
 bool IsComplexAbs(const NodeDef& node);
 bool IsConj(const NodeDef& node);
@@ -78,6 +79,7 @@ bool IsLogicalNot(const NodeDef& node);
 bool IsLogicalOr(const NodeDef& node);
 bool IsMax(const NodeDef& node);
 bool IsMaximum(const NodeDef& node);
+bool IsMaxPoolGrad(const NodeDef& node);
 bool IsMean(const NodeDef& node);
 bool IsMerge(const NodeDef& node);
 bool IsMin(const NodeDef& node);
diff --git a/tensorflow/core/grappler/optimizers/BUILD b/tensorflow/core/grappler/optimizers/BUILD
index 104a0428ce83705abf47f2124e1618ddbb9d08fe..2073c2968bb13275712acc4c6904bbc2af6753fe 100644
--- a/tensorflow/core/grappler/optimizers/BUILD
+++ b/tensorflow/core/grappler/optimizers/BUILD
@@ -328,11 +328,13 @@ tf_cuda_cc_test(
         ":model_pruner",
         "//tensorflow/cc:cc_ops",
         "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:tensor_testutil",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
         "//tensorflow/core/grappler:grappler_item",
         "//tensorflow/core/grappler:utils",
         "//tensorflow/core/grappler/inputs:trivial_test_graph_input_yielder",
+        "//tensorflow/core/grappler/utils:grappler_test",
         "//tensorflow/core/grappler/utils:topological_sort",
     ],
 )
@@ -517,6 +519,7 @@ cc_library(
         ":memory_optimizer",
         ":model_pruner",
         ":remapper",
+        ":scoped_allocator_optimizer",
         ":shape_optimizer",
         "//tensorflow/core:core_cpu_base",
         "//tensorflow/core:framework",
@@ -695,6 +698,7 @@ tf_cuda_cc_test(
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
         "//tensorflow/core:testlib",
+        "//tensorflow/core/grappler:devices",
         "//tensorflow/core/grappler:grappler_item",
         "//tensorflow/core/grappler:utils",
         "//tensorflow/core/grappler/inputs:trivial_test_graph_input_yielder",
@@ -761,3 +765,49 @@ tf_cuda_cc_test(
         "//tensorflow/core/grappler/utils:grappler_test",
     ],
 )
+
+cc_library(
+    name = "scoped_allocator_optimizer",
+    srcs = ["scoped_allocator_optimizer.cc"],
+    hdrs = [
+        "scoped_allocator_optimizer.h",
+    ],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":graph_optimizer",
+        "//tensorflow/core:core_cpu_lib",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:op_types",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/costs:graph_properties",
+        "//tensorflow/core/grappler/utils:frame",
+    ],
+)
+
+tf_cc_test(
+    name = "scoped_allocator_optimizer_test",
+    size = "small",
+    srcs = ["scoped_allocator_optimizer_test.cc"],
+    tags = [
+        "nomsan",
+    ],
+    deps = [
+        ":scoped_allocator_optimizer",
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/core:all_kernels",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:direct_session",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/inputs:trivial_test_graph_input_yielder",
+    ],
+)
diff --git a/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc b/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc
index 060e4200af230267e6c3d18d2fd1495c785bb673..51110b4bda7b7cf5858d066df77dd2c64737f4eb 100644
--- a/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc
@@ -101,38 +101,6 @@ bool ValuesFromConstNode(const NodeDef& node, std::vector<T>* values) {
   return false;
 }
 
-template <typename T>
-bool IsInnerMatrixTranspose(const std::vector<T>& perm) {
-  const T n = perm.size();
-  if (n < 2) {
-    return false;
-  }
-  for (T i = 0; i < n - 2; ++i) {
-    if (perm[i] != i) {
-      return false;
-    }
-  }
-  return perm[n - 1] == n - 2 && perm[n - 2] == n - 1;
-}
-
-bool IsInnerMatrixTransposeNode(const NodeDef& transpose_node,
-                                const NodeMap* node_map) {
-  if (transpose_node.op() != "Transpose" &&
-      transpose_node.op() != "ConjugateTranspose") {
-    return false;
-  }
-  const NodeDef* perm_node = node_map->GetNode(transpose_node.input(1));
-  std::vector<int> perm32;
-  if (ValuesFromConstNode(*perm_node, &perm32)) {
-    return IsInnerMatrixTranspose(perm32);
-  }
-  std::vector<int64> perm64;
-  if (ValuesFromConstNode(*perm_node, &perm64)) {
-    return IsInnerMatrixTranspose(perm64);
-  }
-  return false;
-}
-
 bool MaybeAddControlInput(const string& new_input, NodeDef* node,
                           GraphDef* graph, NodeMap* node_map) {
   bool already_exists = false;
@@ -155,12 +123,6 @@ void SetDataTypeToAttr(DataType dtype, const string& attr_name, NodeDef* node) {
   (*node->mutable_attr())[attr_name].set_type(dtype);
 }
 
-void FlipBooleanAttr(const string& attr_name, NodeDef* node) {
-  const bool old_value =
-      !node->attr().count(attr_name) ? false : node->attr().at(attr_name).b();
-  (*node->mutable_attr())[attr_name].set_b(!old_value);
-}
-
 string SourceDataTypeAttrName(const NodeDef& node) {
   if (node.op() == "Bitcast") {
     return "T";
@@ -194,57 +156,6 @@ void SetSourceDataType(DataType dtype, NodeDef* node) {
   SetDataTypeToAttr(dtype, SourceDataTypeAttrName(*node), node);
 }
 
-bool IsNumberType(DataType dtype) { return kNumberTypes.Contains(dtype); }
-
-// Returns whether `reshape` is an identity op. The tensor that `reshape`
-// reshapes is the `output_pos`-th output of node `input`.
-bool ReshapeIsIdentity(const NodeDef& reshape, const NodeDef& input,
-                       const int output_pos,
-                       const GraphProperties& graph_properties) {
-  const std::vector<OpInfo::TensorProperties>& reshape_props =
-      graph_properties.GetOutputProperties(reshape.name());
-  const std::vector<OpInfo::TensorProperties>& input_props =
-      graph_properties.GetOutputProperties(input.name());
-  if (reshape_props.empty() || input_props.size() <= output_pos) {
-    return false;
-  }
-
-  const PartialTensorShape& src_shape = input_props[output_pos].shape();
-  const PartialTensorShape& dst_shape = reshape_props[0].shape();
-
-  if (src_shape.unknown_rank() || dst_shape.unknown_rank()) {
-    return false;
-  }
-
-  if (!dst_shape.IsCompatibleWith(src_shape)) {
-    return false;
-  }
-
-  // Returns false when src_shape or dst_shape has >=2 dimensions with unknown
-  // sizes.
-  auto num_unknown_dim_sizes = [](const PartialTensorShape& partial_shape) {
-    auto dim_sizes = partial_shape.dim_sizes();
-    return std::count_if(dim_sizes.begin(), dim_sizes.end(),
-                         [](int dim) { return dim < 0; });
-  };
-  int src_num_unknown_dim_sizes = num_unknown_dim_sizes(src_shape);
-  int dst_num_unknown_dim_sizes = num_unknown_dim_sizes(dst_shape);
-  if (src_num_unknown_dim_sizes > 1 || dst_num_unknown_dim_sizes > 1) {
-    return false;
-  }
-
-  // If dst_num_unknown_dim_sizes != src_num_unknown_dim_sizes we would weaken
-  // shape inference in subsequent passes if we removed this reshape.
-  if (src_num_unknown_dim_sizes != dst_num_unknown_dim_sizes) {
-    return false;
-  }
-
-  // Remove the reshape if both are fully defined or partially defined and the
-  // unknown or symbolic shape appears on the same dimension, i.e., if
-  // IsIdenticalTo returns true.
-  return dst_shape.IsIdenticalTo(src_shape);
-}
-
 NodeDef* GetTailOfValuePreservingChain(
     const NodeDef& node, const NodeMap& node_map,
     const std::unordered_set<string>& nodes_to_preserve) {
@@ -1162,10 +1073,8 @@ class MinimizeBroadcasts : public ArithmeticNodesGroupOptimizerStage {
 class RemoveIdentityTranspose : public ArithmeticOptimizerStage {
  public:
   explicit RemoveIdentityTranspose(const GraphOptimizerContext& ctx,
-                                   const ArithmeticOptimizerContext& ctx_ext,
-                                   RewriterConfig::Toggle opt_level)
-      : ArithmeticOptimizerStage("RemoveIdentityTranspose", ctx, ctx_ext),
-        opt_level_(opt_level) {}
+                                   const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("RemoveIdentityTranspose", ctx, ctx_ext) {}
   ~RemoveIdentityTranspose() override = default;
 
   bool IsSupported(const NodeDef* node) const override {
@@ -1260,8 +1169,47 @@ class RemoveIdentityTranspose : public ArithmeticOptimizerStage {
     }
     return true;
   }
+};
+
+// An involution is an element-wise function f(x) that is its own inverse,
+// i.e. f(f(x)) = x. If we can find a chain of ops
+//   f->op1->op2->...opn->f
+// where op1 through opn preserve the values of their inputs, we can remove
+// the two instances of the involution from the graph, since they cancel
+// each other.
+class RemoveInvolution : public ArithmeticOptimizerStage {
+ public:
+  explicit RemoveInvolution(const GraphOptimizerContext& ctx,
+                            const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("RemoveInvolution", ctx, ctx_ext) {}
+  ~RemoveInvolution() override = default;
+
+  bool IsSupported(const NodeDef* node) const override {
+    return IsInvolution(*node);
+  }
 
-  RewriterConfig::Toggle opt_level_;
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    NodeDef* tail = GetTailOfValuePreservingChain(*node, *ctx().node_map,
+                                                  *ctx().nodes_to_preserve);
+
+    NodeDef* involution;
+    TF_RETURN_IF_ERROR(GetInputNode(tail->input(0), &involution));
+
+    if (involution->op() == node->op()) {
+      // Skip both *node and *involution since they cancel each other.
+      if (tail == node) {
+        // The two nodes to eliminate are adjacent.
+        *simplified_node_name = involution->input(0);
+      } else {
+        tail->set_input(0, involution->input(0));
+        ctx().node_map->UpdateInput(tail->name(), involution->name(),
+                                    involution->input(0));
+        *simplified_node_name = node->input(0);
+      }
+    }
+
+    return Status::OK();
+  }
 };
 
 // Remove redundant Bitcasts.
@@ -1819,6 +1767,723 @@ class SqrtDivToRsqrtMulStage : public ArithmeticOptimizerStage {
   }
 };
 
+// Bypass redundant reshape nodes:
+//
+//   Reshape                    Reshape  <-+
+//      ^                                  |
+//      |                                  |
+//   Reshape       becomes      Reshape    |
+//      ^                                  |
+//      |                                  |
+//    input                      input  ---+
+class RemoveRedundantReshape : public ArithmeticOptimizerStage {
+ public:
+  explicit RemoveRedundantReshape(const GraphOptimizerContext& ctx,
+                                  const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("RemoveRedundantReshape", ctx, ctx_ext) {}
+  ~RemoveRedundantReshape() override = default;
+
+  bool IsSupported(const NodeDef* node) const override {
+    return IsReshape(*node);
+  }
+
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    NodeDef* input;
+    TF_RETURN_IF_ERROR(GetInputNode(node->input(0), &input));
+
+    // 1. Bypass reshape followed by reshape.
+    if (IsReshape(*input) && !HasControlInputs(*input)) {
+      node->set_input(0, input->input(0));
+      ctx().node_map->UpdateInput(node->name(), input->name(), input->input(0));
+      *simplified_node_name = node->name();
+      AddToOptimizationQueue(node);
+      return Status::OK();
+    }
+
+    // 2. If the reshape is a no-op, forward its input to its consumers, unless
+    // it anchors a control dependency since we want to make sure that control
+    // dependency is triggered.
+    if (ReshapeIsIdentity(*node) && !HasControlInputs(*node)) {
+      *simplified_node_name = node->input(0);
+      return Status::OK();
+    }
+
+    return Status::OK();
+  }
+
+ private:
+  // Returns whether `reshape` is an identity op.
+  bool ReshapeIsIdentity(const NodeDef& reshape) {
+    OpInfo::TensorProperties reshape_props;
+    OpInfo::TensorProperties input_props;
+
+    if (!GetTensorProperties(reshape.name(), &reshape_props).ok() ||
+        !GetTensorProperties(reshape.input(0), &input_props).ok()) {
+      return false;
+    }
+
+    return ShapesSymbolicallyEqual(input_props.shape(), reshape_props.shape());
+  }
+};
+
+// Reorder Cast and Transpose if beneficial.
+//
+// A common pattern after the layout optimizer is casting an uint8 NHWC
+// image to float before transposing it to NCHW. It is beneficial to reorder
+// the cast and the transpose to make the transpose process smaller amount
+// of data. This optimization converts
+//   Transpose(Cast(image, dst_type), perm)
+// to
+//   Cast(Transpose(image, perm), dst_type)
+// when sizeof(image.type) < sizeof(dst_type).
+//
+// TODO(jingyue): This optimization can be generalized to a cast followed by
+// a chain of ops that merely reorder elements (e.g. Reshape and
+// DepthToSpace).
+class ReorderCastAndTranspose : public ArithmeticOptimizerStage {
+ public:
+  explicit ReorderCastAndTranspose(const GraphOptimizerContext& ctx,
+                                   const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("ReorderCastAndTranspose", ctx, ctx_ext) {}
+  ~ReorderCastAndTranspose() override = default;
+
+  bool IsSupported(const NodeDef* node) const override {
+    return IsTranspose(*node) && NodeIsOnCpuOrGpu(node);
+  }
+
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    const NodeDef* transpose = node;
+
+    // Verify that input to Transpose is the Cast op.
+    NodeDef* cast;
+    TF_RETURN_IF_ERROR(GetInputNode(transpose->input(0), &cast));
+    if (!IsCast(*cast)) return Status::OK();
+
+    // Input to the Cast-Transpose chain.
+    NodeDef* input;
+    TF_RETURN_IF_ERROR(GetInputNode(cast->input(0), &input));
+
+    const DataType src_type = GetSourceDataType(*cast);
+    const DataType dst_type = GetDestinationDataType(*cast);
+
+    const string src_type_name = DataTypeString(src_type);
+    const string dst_type_name = DataTypeString(dst_type);
+
+    // Check if nodes were not already optimized.
+    const string optimized_cast_name =
+        OptimizedNodeName(ParseNodeScopeAndName(cast->name()), dst_type_name);
+    const string optimized_transpose_name = OptimizedNodeName(
+        ParseNodeScopeAndName(transpose->name()), src_type_name);
+
+    bool is_already_optimized =
+        ctx().node_map->NodeExists(optimized_transpose_name) ||
+        ctx().node_map->NodeExists(optimized_cast_name);
+
+    if (IsNumberType(src_type) && IsNumberType(dst_type) &&
+        DataTypeSize(src_type) < DataTypeSize(dst_type) &&
+        !is_already_optimized) {
+      NodeDef* new_transpose = AddCopyNode(optimized_transpose_name, transpose);
+      (*new_transpose->mutable_attr())["T"].set_type(src_type);
+      new_transpose->set_input(0, cast->input(0));
+
+      ctx().node_map->AddOutput(input->name(), new_transpose->name());
+      ctx().node_map->AddOutput(NodeName(new_transpose->input(1)),
+                                new_transpose->name());
+
+      NodeDef* new_cast = AddCopyNode(optimized_cast_name, cast);
+      new_cast->set_input(0, new_transpose->name());
+      ctx().node_map->AddOutput(new_transpose->name(), new_cast->name());
+
+      AddToOptimizationQueue(new_transpose);
+      ForwardControlDependencies(new_transpose, {cast, node});
+
+      *simplified_node_name = new_cast->name();
+    }
+
+    return Status::OK();
+  }
+
+ private:
+  // This optimization can be dangerous on devices other than CPU and
+  // GPU. The transpose might not be implemented for image.type, or
+  // might be slower with image.type than with dst_type.
+  bool NodeIsOnCpuOrGpu(const NodeDef* node) const {
+    using str_util::StrContains;
+
+    string task;
+    string device;
+
+    return DeviceNameUtils::SplitDeviceName(node->device(), &task, &device) &&
+           (StrContains(device, DEVICE_CPU) || StrContains(device, DEVICE_GPU));
+  }
+
+  bool IsNumberType(DataType dtype) { return kNumberTypes.Contains(dtype); }
+};
+
+// Fold a multiply of a scalar into the following convolution. This folding
+// can jump across nodes that merely reorders data (such as reshape and
+// transpose). For example, we can optimize
+//
+//
+//         Conv2D                             Conv2D
+//        /      \                           /      \
+//    Transpose  weights*       ->     Transpose    Mul
+//       |                                |        /   \
+//      Mul                               |    weights  scale
+//     /   \                              |
+//   input  scale**                     input
+//
+//  *) weights must be a const
+// **) scale must be a const scalar
+//
+// When `weights` and `scale` are constant, `Mul` in the optimized graph can be
+// constant-folded, also weights tend to be smaller than the activations.
+//
+// TODO(jingyue): Fold scalar multiplies to Conv?DBackpropFilter and
+// Conv?DBackpropInput.
+class FoldMultiplyIntoConv : public ArithmeticOptimizerStage {
+ public:
+  explicit FoldMultiplyIntoConv(const GraphOptimizerContext& ctx,
+                                const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("FoldMultiplyIntoConv", ctx, ctx_ext) {}
+  ~FoldMultiplyIntoConv() override = default;
+
+  bool IsSupported(const NodeDef* node) const override {
+    return IsConv2D(*node) || IsConv3D(*node);
+  }
+
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+#define TF_RETURN_IF_TRUE(...) \
+  if ((__VA_ARGS__)) return Status::OK()
+
+    NodeDef* conv = node;
+
+    NodeDef* weights;
+    TF_RETURN_IF_ERROR(GetInputNode(conv->input(1), &weights));
+
+    // Fold the multiply to conv only when the weights are constant, so the
+    // multiply can be constant-folded.
+    //
+    // TODO(jingyue): When the weights aren't constant, this should also help
+    // performance a bit and memory usage a lot, since the weights tend to be
+    // smaller than the activations.
+    TF_RETURN_IF_TRUE(!IsConstant(*weights));
+
+    // Verify that this node was not already optimized.
+    const string scaled_weights_node_name =
+        OptimizedNodeName(ParseNodeScopeAndName(weights->name()),
+                          strings::StrCat("scaled", "_", conv->name()));
+
+    TF_RETURN_IF_TRUE(ctx().node_map->NodeExists(scaled_weights_node_name));
+
+    // Find the tail of value preserving chain entering the Conv node.
+    NodeDef* tail = GetTailOfValuePreservingChain(*conv, *ctx().node_map,
+                                                  *ctx().nodes_to_preserve);
+
+    NodeDef* source;
+    TF_RETURN_IF_ERROR(GetInputNode(tail->input(0), &source));
+
+    // Check that value preserving chain is the only consumer of the Mul output.
+    TF_RETURN_IF_TRUE(!IsMul(*source));
+    TF_RETURN_IF_TRUE(NumNonControlOutputs(*source, *ctx().node_map) != 1);
+
+    const NodeDef* mul = source;
+
+    // TODO(jingyue): handle the case where `scale` is 0-th operand.
+    NodeDef* scale;  // scalar multiplier fot the input tensor
+    NodeDef* input;
+    TF_RETURN_IF_ERROR(GetInputNode(mul->input(1), &scale));
+    TF_RETURN_IF_ERROR(GetInputNode(mul->input(0), &input));
+
+    // Check that 'scale * weight' can be const folded.
+    TF_RETURN_IF_TRUE(!IsConstant(*scale));
+    TF_RETURN_IF_TRUE(scale->attr().at("dtype").type() !=
+                      weights->attr().at("dtype").type());
+
+    // Check that `scale` is a scalar.
+    const TensorProto& scale_tensor = scale->attr().at("value").tensor();
+    bool scale_is_a_scalar = scale_tensor.has_tensor_shape() &&
+                             scale_tensor.tensor_shape().dim_size() == 0;
+    TF_RETURN_IF_TRUE(!scale_is_a_scalar);
+
+    // At this point all preconditions are met, and we safely do the rewrite.
+    VLOG(3) << "Fold multiply into conv: conv=" << conv->name()
+            << " mul=" << mul->name() << " weights=" << weights->name();
+
+    // Create new node `scaled_weights`.
+    NodeDef* scaled_weights = AddEmptyNode(scaled_weights_node_name);
+    scaled_weights->set_op("Mul");
+    scaled_weights->set_device(weights->device());
+    (*scaled_weights->mutable_attr())["T"] = weights->attr().at("dtype");
+    AddToOptimizationQueue(scaled_weights);
+
+    // Link in its inputs.
+    scaled_weights->add_input(conv->input(1));
+    ctx().node_map->AddOutput(weights->name(), scaled_weights->name());
+    scaled_weights->add_input(mul->input(1));
+    ctx().node_map->AddOutput(scale->name(), scaled_weights->name());
+    ForwardControlDependencies(scaled_weights, {source});
+
+    // Update `conv`'s weights to `scaled_weights`.
+    conv->set_input(1, scaled_weights->name());
+    ctx().node_map->UpdateInput(conv->name(), weights->name(),
+                                scaled_weights->name());
+    AddToOptimizationQueue(conv);
+
+    // Update `tail` node to bypass `mul` because it's folded to the weights.
+    tail->set_input(0, mul->input(0));
+    ctx().node_map->UpdateInput(tail->name(), mul->name(), input->name());
+    AddToOptimizationQueue(tail);
+    *simplified_node_name = conv->name();
+
+    return Status::OK();
+#undef TF_RETURN_IF_TRUE
+  }
+};
+
+// Fold Transpose into matrix multiplication.
+class FoldTransposeIntoMatMul : public ArithmeticOptimizerStage {
+ public:
+  explicit FoldTransposeIntoMatMul(const GraphOptimizerContext& ctx,
+                                   const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("FoldTransposeIntoMatMul", ctx, ctx_ext) {}
+  ~FoldTransposeIntoMatMul() override = default;
+
+  bool IsSupported(const NodeDef* node) const override {
+    return IsMatMul(*node);
+  }
+
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    const NodeScopeAndName matmul = ParseNodeScopeAndName(node->name());
+    const string optimized_node_name = OptimizedNodeName(matmul);
+    if (ctx().node_map->NodeExists(optimized_node_name)) return Status::OK();
+
+    NodeDef* a;
+    NodeDef* b;
+    TF_RETURN_IF_ERROR(GetInputNode(node->input(0), &a));
+    TF_RETURN_IF_ERROR(GetInputNode(node->input(1), &b));
+
+    bool is_complex = false;
+    if (node->op() != "SparseMatMul") {
+      const DataType type = GetDataTypeFromAttr(*node, "T");
+      is_complex = (type == DT_COMPLEX64) || (type == DT_COMPLEX128);
+    }
+
+    const std::set<string> foldable_transpose_ops =
+        !is_complex ? std::set<string>{"ConjugateTranspose", "Transpose"}
+                    : (node->op() == "BatchMatMul"
+                           ? std::set<string>{"ConjugateTranspose"}
+                           : std::set<string>{"Transpose"});
+
+    const bool a_is_foldable = foldable_transpose_ops.count(a->op()) > 0 &&
+                               IsInnerMatrixTransposeNode(*a, ctx().node_map);
+    const bool b_is_foldable = foldable_transpose_ops.count(b->op()) > 0 &&
+                               IsInnerMatrixTransposeNode(*b, ctx().node_map);
+    if (!a_is_foldable && !b_is_foldable) return Status::OK();
+
+    NodeDef* new_op = AddCopyNode(optimized_node_name, node);
+
+    if (a_is_foldable) {
+      const string attr_a =
+          node->op() == "BatchMatMul" ? "adj_x" : "transpose_a";
+      FlipBooleanAttr(attr_a, new_op);
+      new_op->set_input(0, a->input(0));
+      ctx().node_map->UpdateInput(new_op->name(), a->name(), a->input(0));
+    }
+
+    if (b_is_foldable) {
+      const string attr_b =
+          node->op() == "BatchMatMul" ? "adj_y" : "transpose_b";
+      FlipBooleanAttr(attr_b, new_op);
+      new_op->set_input(1, b->input(0));
+      ctx().node_map->UpdateInput(new_op->name(), b->name(), b->input(0));
+    }
+
+    std::vector<const NodeDef*> deps_to_forward = {node};
+    if (a_is_foldable) deps_to_forward.push_back(a);
+    if (b_is_foldable) deps_to_forward.push_back(b);
+    ForwardControlDependencies(new_op, deps_to_forward);
+
+    return Status::OK();
+  }
+
+ private:
+  void FlipBooleanAttr(const string& attr_name, NodeDef* node) {
+    const bool old_value =
+        !node->attr().count(attr_name) ? false : node->attr().at(attr_name).b();
+    (*node->mutable_attr())[attr_name].set_b(!old_value);
+  }
+
+  template <typename T>
+  bool IsInnerMatrixTranspose(const std::vector<T>& perm) {
+    const T n = perm.size();
+    if (n < 2) {
+      return false;
+    }
+    for (T i = 0; i < n - 2; ++i) {
+      if (perm[i] != i) {
+        return false;
+      }
+    }
+    return perm[n - 1] == n - 2 && perm[n - 2] == n - 1;
+  }
+
+  bool IsInnerMatrixTransposeNode(const NodeDef& transpose_node,
+                                  const NodeMap* node_map) {
+    if (transpose_node.op() != "Transpose" &&
+        transpose_node.op() != "ConjugateTranspose") {
+      return false;
+    }
+    const NodeDef* perm_node = node_map->GetNode(transpose_node.input(1));
+    std::vector<int> perm32;
+    if (ValuesFromConstNode(*perm_node, &perm32)) {
+      return IsInnerMatrixTranspose(perm32);
+    }
+    std::vector<int64> perm64;
+    if (ValuesFromConstNode(*perm_node, &perm64)) {
+      return IsInnerMatrixTranspose(perm64);
+    }
+    return false;
+  }
+};
+
+// Fold Transpose into matrix multiplication.
+class FoldConjugateIntoTranspose : public ArithmeticOptimizerStage {
+ public:
+  explicit FoldConjugateIntoTranspose(const GraphOptimizerContext& ctx,
+                                      const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("FoldConjugateIntoTranspose", ctx, ctx_ext) {}
+  ~FoldConjugateIntoTranspose() override = default;
+
+  bool IsSupported(const NodeDef* node) const override {
+    return IsConj(*node) || IsTranspose(*node) || IsConjugateTranspose(*node);
+  }
+
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    const NodeScopeAndName matmul = ParseNodeScopeAndName(node->name());
+    const string optimized_node_name = OptimizedNodeName(matmul);
+    if (ctx().node_map->NodeExists(optimized_node_name)) return Status::OK();
+
+    NodeDef* input;
+    TF_RETURN_IF_ERROR(GetInputNode(node->input(0), &input));
+
+    const NodeDef* transpose_op = node->op() == "Conj" ? input : node;
+    const NodeDef* conj_op = node->op() == "Conj" ? node : input;
+
+    if ((IsTranspose(*transpose_op) || IsConjugateTranspose(*transpose_op)) &&
+        IsConj(*conj_op)) {
+      NodeDef* new_op = AddCopyNode(optimized_node_name, transpose_op);
+
+      // Flip the type of transpose op to absorb the conjugation.
+      new_op->set_op(transpose_op->op() == "Transpose" ? "ConjugateTranspose"
+                                                       : "Transpose");
+      new_op->set_input(0, input->input(0));
+      ctx().node_map->UpdateInput(new_op->name(), node->name(),
+                                  input->input(0));
+      ForwardControlDependencies(new_op, {node, input});
+      *simplified_node_name = new_op->name();
+    }
+
+    return Status::OK();
+  }
+};
+
+// Replace Mul node with identical inputs with a Square.
+class ReplaceMulWithSquare : public ArithmeticOptimizerStage {
+ public:
+  explicit ReplaceMulWithSquare(const GraphOptimizerContext& ctx,
+                                const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("ReplaceMulWithSquare", ctx, ctx_ext) {}
+  ~ReplaceMulWithSquare() override = default;
+
+  bool IsSupported(const NodeDef* node) const override {
+    return IsMul(*node) && node->input(0) == node->input(1);
+  }
+
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    const NodeScopeAndName mul = ParseNodeScopeAndName(node->name());
+    const string optimized_node_name = OptimizedNodeName(mul);
+    if (ctx().node_map->NodeExists(optimized_node_name)) return Status::OK();
+
+    const DataType type = GetDataTypeFromAttr(*node, "T");
+    bool is_complex = (type == DT_COMPLEX64) || (type == DT_COMPLEX128);
+
+    string task;
+    string device;
+    bool is_on_cpu =
+        DeviceNameUtils::SplitDeviceName(node->device(), &task, &device) &&
+        str_util::StrContains(device, DEVICE_CPU);
+
+    if (!is_complex || is_on_cpu) {
+      NodeDef* new_square_node = AddCopyNode(optimized_node_name, node);
+      new_square_node->set_op("Square");
+      for (int i = 1; i < new_square_node->input_size(); ++i) {
+        new_square_node->set_input(i - 1, new_square_node->input(i));
+      }
+      new_square_node->mutable_input()->RemoveLast();
+      for (const string& input : new_square_node->input()) {
+        ctx().node_map->AddOutput(NodeName(input), new_square_node->name());
+      }
+      *simplified_node_name = new_square_node->name();
+    }
+
+    return Status::OK();
+  }
+};
+
+// Simplify aggregation (e.g. AddN) nodes:
+//
+// 1. Discard aggregate nodes with a single input and no control dependencies.
+//
+// 2. Try to rewrite aggregations of N >= 2 identical terms (possibly due to
+//    deduping or other rewrites) so we can get rid of the sum entirely.
+//
+//    The expression (using AddN as an example of an aggregate op):
+//      AddN(x, x, x, ... ,x)
+//           <-- N terms -->
+//    can be rewritten to:
+//      Mul(Const(N), x))
+//
+class SimplifyAggregation : public ArithmeticOptimizerStage {
+ public:
+  explicit SimplifyAggregation(const GraphOptimizerContext& ctx,
+                               const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("SimplifyAggregation", ctx, ctx_ext) {}
+  ~SimplifyAggregation() override = default;
+
+  bool IsSupported(const NodeDef* node) const override {
+    return IsAggregate(*node) && NumNonControlInputs(*node) > 0;
+  }
+
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    // 1. Discard aggregate nodes with a single input and no control deps.
+    if (node->input_size() == 1) {
+      *simplified_node_name = node->input(0);
+      return Status::OK();
+    }
+
+    // 2. Rewrite aggregations of N >= 2 identical terms.
+
+    // All non-control inputs must be identical.
+    bool all_equal = true;
+    int num_inputs = 1;
+    for (int i = 1; i < node->input_size(); ++i) {
+      if (IsControlInput(node->input(i))) break;
+      ++num_inputs;
+      if (node->input(i) != node->input(0)) {
+        all_equal = false;
+        break;
+      }
+    }
+    if (!all_equal) return Status::OK();
+
+    // And node should not be optimized earlier.
+    const NodeScopeAndName node_scope_and_name =
+        ParseNodeScopeAndName(node->name());
+    const string optimized_const_name =
+        OptimizedNodeName(node_scope_and_name, "Const");
+    const string optimized_mul_name =
+        OptimizedNodeName(node_scope_and_name, "Mul");
+
+    bool is_already_optimized =
+        ctx().node_map->NodeExists(optimized_const_name) ||
+        ctx().node_map->NodeExists(optimized_mul_name);
+
+    if (is_already_optimized) return Status::OK();
+
+    // At this point all preconditions are met, and we safely do the rewrite.
+    VLOG(3) << "Simplify aggregation with identical inputs: node="
+            << node->name() << " num_inputs=" << num_inputs;
+
+    // 1. Create constant node with value N.
+    const auto type = GetDataTypeFromAttr(*node, "T");
+    Tensor t(type, TensorShape({}));
+    Status status = SetTensorValue(type, num_inputs, &t);
+    if (!status.ok()) {
+      return errors::Internal("Failed to create const node: ",
+                              status.error_message());
+    }
+
+    TensorValue value(&t);
+    NodeDef* new_const_node = AddEmptyNode(optimized_const_name);
+    status = ConstantFolding::CreateNodeDef(new_const_node->name(), value,
+                                            new_const_node);
+    if (!status.ok()) {
+      return errors::Internal("Failed to create const node: ",
+                              status.error_message());
+    }
+    new_const_node->set_device(node->device());
+    MaybeAddControlInput(NodeName(node->input(0)), new_const_node,
+                         ctx().optimized_graph, ctx().node_map);
+    AddToOptimizationQueue(new_const_node);
+
+    // 2. Replace the aggregate node with Mul(Const(N), x).
+    NodeDef* new_mul_node = AddEmptyNode(optimized_mul_name);
+    new_mul_node->set_op("Mul");
+    new_mul_node->set_device(node->device());
+    SetDataTypeToAttr(type, "T", new_mul_node);
+    new_mul_node->add_input(new_const_node->name());
+    ctx().node_map->AddOutput(new_const_node->name(), new_mul_node->name());
+    new_mul_node->add_input(node->input(0));
+    ctx().node_map->AddOutput(node->input(0), new_mul_node->name());
+
+    ForwardControlDependencies(new_mul_node, {node});
+    *simplified_node_name = new_mul_node->name();
+
+    return Status::OK();
+  }
+};
+
+class ConvertPowStage : public ArithmeticOptimizerStage {
+ public:
+  explicit ConvertPowStage(const GraphOptimizerContext& ctx,
+                           const ArithmeticOptimizerContext& ctx_ext)
+      : ArithmeticOptimizerStage("ConvertPow", ctx, ctx_ext) {}
+
+  bool IsSupported(const NodeDef* node) const override {
+    return IsPow(*node) &&
+           ctx().graph_properties->GetInputProperties(node->name()).size() == 2;
+  }
+
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    const auto& p = ctx().graph_properties->GetInputProperties(node->name())[1];
+    for (int i = 0; i < p.shape().dim_size(); ++i) {
+      if (p.shape().dim(i).size() < 0) {
+        // skip if p is is not fully defined.
+        return Status::OK();
+      }
+    }
+    if (TensorShape::IsValid(p.shape()) && p.has_value()) {
+      Tensor pow(p.dtype(), p.shape());
+      if (!pow.FromProto(p.value())) {
+        return errors::InvalidArgument("Cannot parse tensor from proto: ",
+                                       p.value().DebugString());
+      }
+
+      complex128 prev, curr;
+      for (int i = 0; i < pow.NumElements(); ++i) {
+        TF_RETURN_IF_ERROR(GetElement(pow, i, &curr));
+        if (i != 0 && curr != prev) {
+          // pow has different values on different elements. Skip.
+          return Status::OK();
+        }
+        prev = curr;
+      }
+      NodeDef *x, *y;
+      TF_RETURN_IF_ERROR(GetInputNode(node->input(0), &x));
+      TF_RETURN_IF_ERROR(GetInputNode(node->input(1), &y));
+      if (curr == complex128(2, 0)) {
+        node->set_op("Square");
+        node->set_input(1, AsControlDependency(y->name()));
+        AddToOptimizationQueue(node);
+        AddToOptimizationQueue(y);
+      } else if (curr == complex128(1, 0)) {
+        node->set_op("Identity");
+        node->set_input(1, AsControlDependency(y->name()));
+        AddToOptimizationQueue(node);
+        AddToOptimizationQueue(y);
+      } else if (curr == complex128(0.5, 0)) {
+        node->set_op("Sqrt");
+        node->set_input(1, AsControlDependency(y->name()));
+        AddToOptimizationQueue(node);
+        AddToOptimizationQueue(y);
+      } else if (curr == complex128(0, 0)) {
+        const auto& b =
+            ctx().graph_properties->GetInputProperties(node->name())[0];
+        for (int i = 0; i < b.shape().dim_size(); ++i) {
+          if (b.shape().dim(i).size() < 0) {
+            // skip if b is is not fully defined.
+            return Status::OK();
+          }
+        }
+        if (TensorShape::IsValid(b.shape()) && b.has_value()) {
+          Tensor base(b.dtype(), b.shape());
+          if (!base.FromProto(b.value())) {
+            return errors::InvalidArgument("Cannot parse tensor from proto: ",
+                                           b.value().DebugString());
+          }
+          node->set_op("Const");
+          Tensor c(base.dtype(), base.shape());
+          for (int i = 0; i < c.NumElements(); ++i) {
+            TF_RETURN_IF_ERROR(SetElementToOne(i, &c));
+          }
+          (*node->mutable_attr())["dtype"].set_type(base.dtype());
+          c.AsProtoTensorContent(
+              (*node->mutable_attr())["value"].mutable_tensor());
+          node->mutable_attr()->erase("T");
+          node->set_input(0, AsControlDependency(x->name()));
+          node->set_input(1, AsControlDependency(y->name()));
+          AddToOptimizationQueue(node);
+          AddToOptimizationQueue(x);
+          AddToOptimizationQueue(y);
+        }
+      } else if (curr == complex128(-0.5, 0)) {
+        node->set_op("Rsqrt");
+        node->set_input(1, AsControlDependency(y->name()));
+        AddToOptimizationQueue(node);
+        AddToOptimizationQueue(y);
+      } else if (curr == complex128(-1, 0)) {
+        node->set_op("Reciprocal");
+        node->set_input(1, AsControlDependency(y->name()));
+        AddToOptimizationQueue(node);
+        AddToOptimizationQueue(y);
+      }
+    }
+    return Status::OK();
+  }
+
+ private:
+  Status GetElement(const Tensor& t, int i, complex128* element) {
+    switch (t.dtype()) {
+      case DT_INT32:
+        *element = complex128(t.flat<int32>()(i));
+        return Status::OK();
+      case DT_INT64:
+        *element = complex128(t.flat<int64>()(i));
+        return Status::OK();
+      case DT_FLOAT:
+        *element = complex128(t.flat<float>()(i));
+        return Status::OK();
+      case DT_DOUBLE:
+        *element = complex128(t.flat<double>()(i));
+        return Status::OK();
+      case DT_COMPLEX64:
+        *element = complex128(t.flat<complex64>()(i));
+        return Status::OK();
+      case DT_COMPLEX128:
+        *element = t.flat<complex128>()(i);
+        return Status::OK();
+      default:
+        return errors::InvalidArgument("Invalid data type: ", t.dtype());
+    }
+  }
+
+  Status SetElementToOne(int i, Tensor* t) {
+    switch (t->dtype()) {
+      case DT_INT32:
+        t->flat<int32>()(i) = 1;
+        return Status::OK();
+      case DT_INT64:
+        t->flat<int64>()(i) = 1L;
+        return Status::OK();
+      case DT_FLOAT:
+        t->flat<float>()(i) = 1.0f;
+        return Status::OK();
+      case DT_DOUBLE:
+        t->flat<double>()(i) = 1.0;
+        return Status::OK();
+      case DT_COMPLEX64:
+        t->flat<complex64>()(i) = complex64(1);
+        return Status::OK();
+      case DT_COMPLEX128:
+        t->flat<complex128>()(i) = complex128(1);
+        return Status::OK();
+      default:
+        return errors::InvalidArgument("Invalid data type: ", t->dtype());
+    }
+  }
+};
+
 }  // namespace
 
 class UniqueNodes {
@@ -1917,33 +2582,6 @@ bool UniqueNodes::SameNode(const NodeDef& node1, const NodeDef& node2) const {
   return true;
 }
 
-NodeDef* ArithmeticOptimizer::AddNode(const NodeDef& node, StringPiece suffix,
-                                      bool copy_node) {
-  return AddNode(OptimizedNodeName(node, suffix), copy_node ? &node : nullptr);
-}
-
-NodeDef* ArithmeticOptimizer::AddNode(const string& name,
-                                      const NodeDef* node_to_copy) {
-  NodeDef* new_node = optimized_graph_->add_node();
-  node_map_->AddNode(NodeName(name), new_node);
-  if (node_to_copy != nullptr) {
-    *new_node = *node_to_copy;
-  }
-  new_node->set_name(name);
-  return new_node;
-}
-
-string ArithmeticOptimizer::OptimizedNodeName(const NodeDef& node,
-                                              StringPiece suffix) const {
-  return AddPrefixToNodeName(strings::StrCat(node.name(), "_", suffix),
-                             kArithmeticOptimizer);
-}
-
-bool ArithmeticOptimizer::OptimizedNodeExists(const NodeDef& node,
-                                              StringPiece suffix) const {
-  return node_map_->NodeExists(OptimizedNodeName(node, suffix));
-}
-
 namespace {
 
 bool FeedsInPlaceOp(const SimpleGraphView& graph_view, const NodeDef& node) {
@@ -2067,380 +2705,6 @@ void ArithmeticOptimizer::ForwardControlDependencies(
   DedupControlInputs(target_node);
 }
 
-// TODO(ezhulenev): extract each individual simplify rewrite into separate
-// ArithmeticOptimizerStage
-string ArithmeticOptimizer::TrySimplifyAndReplaceUses(
-    const NodeDef* node, SetVector<NodeDef*>* nodes_to_simplify) {
-  // Remove involutions applied twice.
-  if (IsInvolution(*node)) {
-    // An involution is an element-wise function f(x) that is its own inverse,
-    // i.e. f(f(x)) = x. If we can find a chain of ops
-    //   f->op1->op2->...opn->f
-    // where op1 through opn preserve the values of their inputs, we can remove
-    // the two instances of the involution from the graph, since they cancel
-    // each other.
-    NodeDef* tail =
-        GetTailOfValuePreservingChain(*node, *node_map_, nodes_to_preserve_);
-    NodeDef* involution = node_map_->GetNode(tail->input(0));
-    if (involution->op() == node->op()) {
-      // Skip both *node and *involution since they cancel each other.
-      if (tail == node) {
-        // The two nodes to eliminate are adjacent.
-        return involution->input(0);
-      } else {
-        tail->set_input(0, involution->input(0));
-        node_map_->UpdateInput(tail->name(), involution->name(),
-                               involution->input(0));
-        return node->input(0);
-      }
-    }
-  }
-
-  if (node->op() == "Reshape") {
-    //   Reshape
-    //      ^
-    //      |
-    //   Reshape
-    //      ^
-    //      |
-    //    input
-    //
-    // becomes
-    //
-    //   Reshape <-+
-    //             |
-    //   Reshape   |
-    //      ^      |
-    //      |      |
-    //    input ---+
-    NodeDef* reshape = const_cast<NodeDef*>(node);
-    int output_pos = 0;
-    string input_node_name = ParseNodeName(reshape->input(0), &output_pos);
-    const NodeDef* input = node_map_->GetNode(input_node_name);
-    if (input->op() == "Reshape" && !HasControlInputs(*input)) {
-      reshape->set_input(0, input->input(0));
-      node_map_->UpdateInput(reshape->name(), input->name(), input->input(0));
-      nodes_to_simplify->PushBack(reshape);
-      return reshape->name();
-    }
-
-    // If the reshape is a no-op, forward its input to its consumers, unless it
-    // anchors a control dependency since we want to make sure that control
-    // dependency is triggered.
-    if (ReshapeIsIdentity(*reshape, *input, output_pos, *graph_properties_) &&
-        !HasControlInputs(*reshape)) {
-      return reshape->input(0);
-    }
-  }
-
-  if (node->op() == "Transpose") {
-    // Reorder Cast and Transpose if beneficial.
-    //
-    // A common pattern after the layout optimizer is casting an uint8 NHWC
-    // image to float before transposing it to NCHW. It is beneficial to reorder
-    // the cast and the transpose to make the transpose process smaller amount
-    // of data. This optimization converts
-    //   Transpose(Cast(image, dst_type), perm)
-    // to
-    //   Cast(Transpose(image, perm), dst_type)
-    // when sizeof(image.type) < sizeof(dst_type).
-    //
-    // TODO(jingyue): This optimization can be generalized to a cast followed by
-    // a chain of ops that merely reorder elements (e.g. Reshape and
-    // DepthToSpace).
-    const NodeDef* transpose = node;
-    string dontcare;
-    string device;
-    // This optimization can be dangerous on devices other than CPU and GPU. The
-    // transpose might not be implemented for image.type, or might be slower
-    // with image.type than with dst_type.
-    if (DeviceNameUtils::SplitDeviceName(transpose->device(), &dontcare,
-                                         &device) &&
-        (str_util::StrContains(device, DEVICE_CPU) ||
-         str_util::StrContains(device, DEVICE_GPU))) {
-      const NodeDef* cast = node_map_->GetNode(transpose->input(0));
-      if (cast->op() == "Cast") {
-        const NodeDef* input = node_map_->GetNode(cast->input(0));
-        const DataType src_type = GetSourceDataType(*cast);
-        const DataType dst_type = GetDestinationDataType(*cast);
-        if (IsNumberType(src_type) && IsNumberType(dst_type) &&
-            DataTypeSize(src_type) < DataTypeSize(dst_type) &&
-            !OptimizedNodeExists(*cast, DataTypeString(dst_type)) &&
-            !OptimizedNodeExists(*transpose, DataTypeString(src_type))) {
-          NodeDef* new_transpose = AddNode(*transpose, DataTypeString(src_type),
-                                           /*copy_node=*/true);
-          (*new_transpose->mutable_attr())["T"].set_type(src_type);
-          new_transpose->set_input(0, cast->input(0));
-          node_map_->AddOutput(input->name(), new_transpose->name());
-          node_map_->AddOutput(NodeName(new_transpose->input(1)),
-                               new_transpose->name());
-
-          NodeDef* new_cast =
-              AddNode(*cast, DataTypeString(dst_type), /*copy_node=*/true);
-          new_cast->set_input(0, new_transpose->name());
-          node_map_->AddOutput(new_transpose->name(), new_cast->name());
-
-          nodes_to_simplify->PushBack(new_transpose);
-          ForwardControlDependencies(new_transpose, {cast, node});
-          return new_cast->name();
-        }
-      }
-    }
-  }
-
-  // Fold a multiply of a scalar into the following convolution. This folding
-  // can jump across nodes that merely reorders data (such as reshape and
-  // transpose). For example, we can optimize
-  //
-  //
-  //         Conv2D
-  //        /      \
-  //    Transpose  weights
-  //       |
-  //      Mul
-  //     /   \
-  //   inputs 255.0
-  //
-  // to
-  //
-  //         Conv2D
-  //        /      \
-  //    Transpose   Mul
-  //       |       /   \
-  //       |   weights  255.0
-  //       |
-  //     inputs
-  //
-  // when `weights` are constant. `Mul` in the optimized graph can be
-  // constant-folded.
-  //
-  // TODO(jingyue): Fold scalar multiplies to Conv?DBackpropFilter and
-  // Conv?DBackpropInput.
-  if (node->op() == "Conv2D" || node->op() == "Conv3D") {
-    NodeDef* conv = const_cast<NodeDef*>(node);
-    const NodeDef* weights = node_map_->GetNode(NodeName(conv->input(1)));
-    // Fold the multiply to conv only when the weights are constant, so the
-    // multiply can be constant-folded. TODO(jingyue): When the weights aren't
-    // constant, this should also help performance a bit and memory usage a lot,
-    // since the weights tend to be smaller than the activations.
-    if (weights->op() == "Const" &&
-        !OptimizedNodeExists(*weights, StrCat("scaled_", conv->name()))) {
-      const NodeDef* source = node_map_->GetNode(
-          GetTailOfValuePreservingChain(*node, *node_map_, nodes_to_preserve_)
-              ->input(0));
-      if (source->op() == "Mul" &&
-          node_map_->GetOutputs(source->name()).size() == 1) {
-        const NodeDef* mul = source;
-        // `scale` is the scalar multiplier, and `other` is the other operand.
-        // TODO(jingyue): handle the case where `scale` is 0-th operand.
-        const NodeDef* scale = node_map_->GetNode(mul->input(1));
-        const NodeDef* other = node_map_->GetNode(mul->input(0));
-        if (scale->op() == "Const" && scale->attr().at("dtype").type() ==
-                                          weights->attr().at("dtype").type()) {
-          const TensorProto& scale_tensor = scale->attr().at("value").tensor();
-          // Test whether `scale` is a scalar.
-          if (scale_tensor.has_tensor_shape() &&
-              scale_tensor.tensor_shape().dim_size() == 0) {
-            // Create new node `scaled_weights`.
-            NodeDef* scaled_weights = AddNode(
-                *weights, StrCat("scaled_", conv->name()), /*copy_node=*/false);
-            scaled_weights->set_op("Mul");
-            scaled_weights->set_device(weights->device());
-            (*scaled_weights->mutable_attr())["T"] =
-                weights->attr().at("dtype");
-            nodes_to_simplify->PushBack(scaled_weights);
-
-            // Link in its inputs.
-            scaled_weights->add_input(conv->input(1));
-            node_map_->AddOutput(weights->name(), scaled_weights->name());
-            scaled_weights->add_input(mul->input(1));
-            node_map_->AddOutput(scale->name(), scaled_weights->name());
-            ForwardControlDependencies(scaled_weights, {source});
-
-            // Update `conv`'s weights to `scaled_weights`.
-            conv->set_input(1, scaled_weights->name());
-            node_map_->UpdateInput(conv->name(), weights->name(),
-                                   scaled_weights->name());
-            nodes_to_simplify->PushBack(conv);
-
-            // Update `mul`'s consumer to bypass `mul` because it's folded to
-            // the weights.
-            CHECK_EQ(node_map_->GetOutputs(mul->name()).size(), 1);
-            NodeDef* consumer_of_mul =
-                *node_map_->GetOutputs(mul->name()).begin();
-            consumer_of_mul->set_input(0, mul->input(0));
-            node_map_->UpdateInput(consumer_of_mul->name(), mul->name(),
-                                   other->name());
-            nodes_to_simplify->PushBack(consumer_of_mul);
-            return conv->name();
-          }
-        }
-      }
-    }
-  }
-
-  if (node->op() == "Mul" && node->input(0) == node->input(1) &&
-      !OptimizedNodeExists(*node, "square")) {
-    const DataType type = GetDataTypeFromAttr(*node, "T");
-    bool is_complex = (type == DT_COMPLEX64) || (type == DT_COMPLEX128);
-    string dontcare;
-    string device;
-    bool is_on_cpu =
-        DeviceNameUtils::SplitDeviceName(node->device(), &dontcare, &device) &&
-        str_util::StrContains(device, DEVICE_CPU);
-    if (!is_complex || is_on_cpu) {
-      NodeDef* new_square_node = AddNode(*node, "square", /*copy_node=*/true);
-      new_square_node->set_op("Square");
-      for (int i = 1; i < new_square_node->input_size(); ++i) {
-        new_square_node->set_input(i - 1, new_square_node->input(i));
-      }
-      new_square_node->mutable_input()->RemoveLast();
-      for (const string& input : new_square_node->input()) {
-        node_map_->AddOutput(NodeName(input), new_square_node->name());
-      }
-      return new_square_node->name();
-    }
-  }
-
-  if (IsAggregate(*node) && NumNonControlInputs(*node) > 0) {
-    // Discard aggregate nodes with a single input and no control dependencies.
-    if (node->input_size() == 1) {
-      return node->input(0);
-    }
-
-    // Try to rewrite aggregations of N >= 2 identical terms (possibly due
-    // to deduping or other rewrites) so we can get rid of the sum entirely.
-    // The expression (using AddN as an example of an aggregate op):
-    //   AddN(x, x, x, ... ,x)
-    //        <-- N terms -->
-    // can be rewritten to
-    //   Mul(Const(N), x))
-    //
-    bool all_equal = true;
-    int num_inputs = 1;
-    for (int i = 1; i < node->input_size(); ++i) {
-      if (IsControlInput(node->input(i))) {
-        break;
-      }
-      ++num_inputs;
-      if (node->input(i) != node->input(0)) {
-        all_equal = false;
-        break;
-      }
-    }
-    if (all_equal && !OptimizedNodeExists(*node, "const") &&
-        !OptimizedNodeExists(*node, "mul")) {
-      // 1. Create constant node with value N.
-      const auto type = GetDataTypeFromAttr(*node, "T");
-      Tensor t(type, TensorShape({}));
-      Status status = SetTensorValue(type, num_inputs, &t);
-      if (!status.ok()) {
-        LOG(WARNING) << "Failed to create const node: "
-                     << status.error_message();
-        return "";
-      }
-      TensorValue value(&t);
-      NodeDef* new_const_node = AddNode(*node, "const", /*copy_node=*/false);
-      status = ConstantFolding::CreateNodeDef(new_const_node->name(), value,
-                                              new_const_node);
-      if (!status.ok()) {
-        LOG(WARNING) << "Failed to create const node: "
-                     << status.error_message();
-        return "";
-      }
-      new_const_node->set_device(node->device());
-      MaybeAddControlInput(NodeName(node->input(0)), new_const_node,
-                           optimized_graph_, node_map_.get());
-      nodes_to_simplify->PushBack(new_const_node);
-
-      // 2. Replace the aggregate node with Mul(Const(N), x).
-      NodeDef* new_mul_node = AddNode(*node, "mul", /*copy_node=*/false);
-      new_mul_node->set_op("Mul");
-      new_mul_node->set_device(node->device());
-      SetDataTypeToAttr(type, "T", new_mul_node);
-      new_mul_node->add_input(new_const_node->name());
-      node_map_->AddOutput(new_const_node->name(), new_mul_node->name());
-      new_mul_node->add_input(node->input(0));
-      node_map_->AddOutput(node->input(0), new_mul_node->name());
-
-      ForwardControlDependencies(new_mul_node, {node});
-      return new_mul_node->name();
-    }
-  }
-
-  // Fold Transpose into matrix multiplication.
-  if ((node->op() == "MatMul" || node->op() == "SparseMatMul" ||
-       node->op() == "BatchMatMul") &&
-      !OptimizedNodeExists(*node, "fused")) {
-    const NodeDef* a = node_map_->GetNode(node->input(0));
-    const NodeDef* b = node_map_->GetNode(node->input(1));
-    bool is_complex = false;
-    if (node->op() != "SparseMatMul") {
-      const DataType type = GetDataTypeFromAttr(*node, "T");
-      is_complex = (type == DT_COMPLEX64) || (type == DT_COMPLEX128);
-    }
-    const std::set<string> foldable_transpose_ops =
-        !is_complex ? std::set<string>{"ConjugateTranspose", "Transpose"}
-                    : (node->op() == "BatchMatMul"
-                           ? std::set<string>{"ConjugateTranspose"}
-                           : std::set<string>{"Transpose"});
-    const bool a_is_foldable = foldable_transpose_ops.count(a->op()) > 0 &&
-                               IsInnerMatrixTransposeNode(*a, node_map_.get());
-    const bool b_is_foldable = foldable_transpose_ops.count(b->op()) > 0 &&
-                               IsInnerMatrixTransposeNode(*b, node_map_.get());
-    if (a_is_foldable || b_is_foldable) {
-      NodeDef* new_op = AddNode(*node, "fused", /*copy_node=*/true);
-      if (a_is_foldable) {
-        const string attr_a =
-            node->op() == "BatchMatMul" ? "adj_x" : "transpose_a";
-        FlipBooleanAttr(attr_a, new_op);
-        new_op->set_input(0, a->input(0));
-        node_map_->UpdateInput(new_op->name(), a->name(), a->input(0));
-      }
-      if (b_is_foldable) {
-        const string attr_b =
-            node->op() == "BatchMatMul" ? "adj_y" : "transpose_b";
-        FlipBooleanAttr(attr_b, new_op);
-        new_op->set_input(1, b->input(0));
-        node_map_->UpdateInput(new_op->name(), b->name(), b->input(0));
-      }
-      std::vector<const NodeDef*> deps_to_forward({node});
-      if (a_is_foldable) {
-        deps_to_forward.push_back(a);
-      }
-      if (b_is_foldable) {
-        deps_to_forward.push_back(b);
-      }
-      ForwardControlDependencies(new_op, deps_to_forward);
-    }
-  }
-
-  // Fold Conj into Transpose or ConjugateTranspose.
-  if ((node->op() == "Conj" || node->op() == "Transpose" ||
-       node->op() == "ConjugateTranspose") &&
-      !OptimizedNodeExists(*node, "fused")) {
-    const NodeDef* input = node_map_->GetNode(node->input(0));
-    const NodeDef* transpose_op = node->op() == "Conj" ? input : node;
-    const NodeDef* conj_op = node->op() == "Conj" ? node : input;
-
-    if ((transpose_op->op() == "Transpose" ||
-         transpose_op->op() == "ConjugateTranspose") &&
-        conj_op->op() == "Conj") {
-      NodeDef* new_op =
-          AddNode(OptimizedNodeName(*node, "fused"), transpose_op);
-      // Flip the type of transpose op to absorb the conjugation.
-      new_op->set_op(transpose_op->op() == "Transpose" ? "ConjugateTranspose"
-                                                       : "Transpose");
-      new_op->set_input(0, input->input(0));
-      node_map_->UpdateInput(new_op->name(), node->name(), input->input(0));
-      ForwardControlDependencies(new_op, {node, input});
-      return new_op->name();
-    }
-  }
-
-  return "";
-}
-
 Status ArithmeticOptimizer::SimplifyArithmeticOps(bool can_use_shapes) {
   SetVector<NodeDef*> nodes_to_simplify;
   nodes_to_simplify.Reserve(optimized_graph_->node_size());
@@ -2458,26 +2722,43 @@ Status ArithmeticOptimizer::SimplifyArithmeticOps(bool can_use_shapes) {
 
   if (options_.combine_add_to_addn && can_use_shapes)
     pipeline.AddStage<AddOpsRewriteStage>(ctx, ctx_ext);
+  if (options_.fold_conjugate_into_transpose)
+    pipeline.AddStage<FoldConjugateIntoTranspose>(ctx, ctx_ext);
+  if (options_.fold_multiply_into_conv)
+    pipeline.AddStage<FoldMultiplyIntoConv>(ctx, ctx_ext);
+  if (options_.fold_transpose_into_matmul)
+    pipeline.AddStage<FoldTransposeIntoMatMul>(ctx, ctx_ext);
   if (options_.hoist_common_factor_out_of_aggregation && can_use_shapes)
     pipeline.AddStage<HoistCommonFactorOutOfAggregation>(ctx, ctx_ext);
   if (options_.minimize_broadcasts && can_use_shapes)
     pipeline.AddStage<MinimizeBroadcasts>(ctx, ctx_ext);
   if (options_.remove_identity_transpose && can_use_shapes)
-    pipeline.AddStage<RemoveIdentityTranspose>(ctx, ctx_ext, opt_level_);
+    pipeline.AddStage<RemoveIdentityTranspose>(ctx, ctx_ext);
+  if (options_.remove_involution)
+    pipeline.AddStage<RemoveInvolution>(ctx, ctx_ext);
   if (options_.remove_redundant_bitcast)
     pipeline.AddStage<RemoveRedundantBitcastStage>(ctx, ctx_ext);
   if (options_.remove_redundant_cast)
     pipeline.AddStage<RemoveRedundantCastStage>(ctx, ctx_ext);
+  if (options_.remove_redundant_reshape)
+    pipeline.AddStage<RemoveRedundantReshape>(ctx, ctx_ext);
   if (options_.remove_negation)
     pipeline.AddStage<RemoveNegationStage>(ctx, ctx_ext);
+  if (options_.replace_mul_with_square)
+    pipeline.AddStage<ReplaceMulWithSquare>(ctx, ctx_ext);
   if (options_.remove_logical_not)
     pipeline.AddStage<RemoveLogicalNotStage>(ctx, ctx_ext);
+  if (options_.reorder_cast_and_transpose)
+    pipeline.AddStage<ReorderCastAndTranspose>(ctx, ctx_ext);
+  if (options_.simplify_aggregation)
+    pipeline.AddStage<SimplifyAggregation>(ctx, ctx_ext);
   if (options_.hoist_cwise_unary_chains)
     pipeline.AddStage<HoistCWiseUnaryChainsStage>(ctx, ctx_ext);
   if (options_.convert_sqrt_div_to_rsqrt_mul)
     pipeline.AddStage<SqrtDivToRsqrtMulStage>(ctx, ctx_ext);
   if (options_.remove_idempotent)
     pipeline.AddStage<RemoveIdempotentStage>(ctx, ctx_ext);
+  if (options_.convert_pow) pipeline.AddStage<ConvertPowStage>(ctx, ctx_ext);
 
   VLOG(1) << "Run " << pipeline.NumStages() << " arithmetic optimizer stages: "
           << str_util::Join(pipeline.StageNames(), ", ");
@@ -2485,19 +2766,11 @@ Status ArithmeticOptimizer::SimplifyArithmeticOps(bool can_use_shapes) {
   while (!nodes_to_simplify.Empty()) {
     NodeDef* node = nodes_to_simplify.PopBack();
 
-    // TODO(ezhulenev): move all rewrites into separate stages
     string simplified_tensor = "";
-    if (options_.enable_try_simplify_and_replace) {
-      simplified_tensor = TrySimplifyAndReplaceUses(node, &nodes_to_simplify);
-    }
+    bool optimized = pipeline.PassThroughAllStages(node, &simplified_tensor);
 
-    // if it was not simplified try to run it through all configured stages
-    if (!stop(simplified_tensor)) {
-      bool optimized = pipeline.PassThroughAllStages(node, &simplified_tensor);
-      if (!optimized) {
-        continue;
-      }
-    }
+    // If the node was not optimized by any of the stages, go to the next one.
+    if (!optimized) continue;
 
     // re-wire consumers of an old node to the new one
     if (NodeName(simplified_tensor) != node->name()) {
diff --git a/tensorflow/core/grappler/optimizers/arithmetic_optimizer.h b/tensorflow/core/grappler/optimizers/arithmetic_optimizer.h
index 8e1b3eda3b8f18b9c4403ac857c637fccaf220e5..40c5e9fc56a50ee3a47791fd3d2b61315e432c19 100644
--- a/tensorflow/core/grappler/optimizers/arithmetic_optimizer.h
+++ b/tensorflow/core/grappler/optimizers/arithmetic_optimizer.h
@@ -54,21 +54,27 @@ class ArithmeticOptimizer : public GraphOptimizer {
 
   // Granular control for arithmetic optimizer stages
   struct ArithmeticOptimizerOptions {
-    // TODO(ezhulenev): flag do disable TrySimplifyAndReplaceUses in tests.
-    // Remove when all optimizers will be migrated to separate stages.
-    bool dedup_computations = true;
-    bool enable_try_simplify_and_replace = true;
     bool combine_add_to_addn = true;
+    bool convert_sqrt_div_to_rsqrt_mul = true;
+    bool dedup_computations = true;
+    bool fold_conjugate_into_transpose = true;
+    bool fold_multiply_into_conv = true;
+    bool fold_transpose_into_matmul = true;
     bool hoist_common_factor_out_of_aggregation = true;
+    bool hoist_cwise_unary_chains = false;
     bool minimize_broadcasts = true;
+    bool remove_idempotent = true;
     bool remove_identity_transpose = true;
+    bool remove_involution = true;
+    bool remove_logical_not = true;
+    bool remove_negation = true;
     bool remove_redundant_bitcast = true;
     bool remove_redundant_cast = true;
-    bool remove_negation = true;
-    bool hoist_cwise_unary_chains = false;
-    bool convert_sqrt_div_to_rsqrt_mul = false;
-    bool remove_idempotent = true;
-    bool remove_logical_not = true;
+    bool remove_redundant_reshape = true;
+    bool reorder_cast_and_transpose = true;
+    bool replace_mul_with_square = true;
+    bool simplify_aggregation = true;
+    bool convert_pow = true;
 
     // Choose which arithmetic optimizer stages will be enabled for a given
     // optimization level by default.
@@ -79,21 +85,6 @@ class ArithmeticOptimizer : public GraphOptimizer {
     }
   };
 
-  // Returns true is a node with given name and the optimizer prefix already
-  // exists.
-  string OptimizedNodeName(const NodeDef& node, StringPiece suffix) const;
-  bool OptimizedNodeExists(const NodeDef& node, StringPiece suffix) const;
-
-  // Creates a new node in the graph, with name equal to that of node, prefixed
-  // with "ArithmeticOptimizer/" and the given suffix. Also updates node_map_,
-  // and optionally copies node into the new node if copy_node is true.
-  NodeDef* AddNode(const NodeDef& node, StringPiece suffix, bool copy_node);
-
-  // Creates a new node in the graph, prefixed with "ArithmeticOptimizer/",
-  // updates node_map_, and optionally copies *node_to_copy into the new
-  // node, if node_to_copy is not nullptr.
-  NodeDef* AddNode(const string& name, const NodeDef* node_to_copy);
-
   // Returns true if it is safe to dedup node from the graph.
   bool CanDedup(const NodeDef& node) const;
 
diff --git a/tensorflow/core/grappler/optimizers/arithmetic_optimizer_test.cc b/tensorflow/core/grappler/optimizers/arithmetic_optimizer_test.cc
index 64fdc8a83b1ad104d14d2998dfb3c23df05b0ece..ff96cb64801ee47ec02c6a1c551ab7e94599cf81 100644
--- a/tensorflow/core/grappler/optimizers/arithmetic_optimizer_test.cc
+++ b/tensorflow/core/grappler/optimizers/arithmetic_optimizer_test.cc
@@ -40,21 +40,37 @@ constexpr char kHoistFactorOptimizerMul[] =
 constexpr char kHoistFactorOptimizerAdd[] =
     "ArithmeticOptimizer/HoistCommonFactor_Add_";
 
-// Optimized name of outer Mul node by HoistCommonFactorOutOfAggregation
+constexpr char kSimplifyAggregationConst[] =
+    "ArithmeticOptimizer/SimplifyAggregation_Const_";
+
+constexpr char kSimplifyAggregationMul[] =
+    "ArithmeticOptimizer/SimplifyAggregation_Mul_";
+
+// Optimized name of outer Mul node by HoistCommonFactorOutOfAggregation.
 string HoistMulName(const string& name) {
   return AddPrefixToNodeName(name, kHoistFactorOptimizerMul, "");
 }
 
-// Optimized name of outer Div node by HoistCommonFactorOutOfAggregation
+// Optimized name of outer Div node by HoistCommonFactorOutOfAggregation.
 string HoistDivName(const string& name) {
   return AddPrefixToNodeName(name, kHoistFactorOptimizerDiv, "");
 }
 
-// Optimized name of inner Add node by HoistCommonFactorOutOfAggregation
+// Optimized name of inner Add node by HoistCommonFactorOutOfAggregation.
 string HoistAddName(const string& name) {
   return AddPrefixToNodeName(name, kHoistFactorOptimizerAdd, "");
 }
 
+// Optimized name of Const node by SimplifyAggregation.
+string AggregationConstName(const string& name) {
+  return AddPrefixToNodeName(name, kSimplifyAggregationConst, "");
+}
+
+// Optimized name of Mul node by SimplifyAggregation.
+string AggregationMulName(const string& name) {
+  return AddPrefixToNodeName(name, kSimplifyAggregationMul, "");
+}
+
 string OptimizedName(const string& name) {
   return AddPrefixToNodeName(name, kArithmeticOptimizer);
 }
@@ -97,12 +113,22 @@ class ArithmeticOptimizerTest : public GrapplerTest {
   }
 
   // Run ArithmeticOptimizer twice to make sure the rewrite is idempotent.
+  // Optionally run a constant folding pass before pruning.
   void OptimizeTwiceAndPrune(ArithmeticOptimizer* optimizer, GrapplerItem* item,
-                             GraphDef* output) {
+                             GraphDef* output, bool const_folding = false) {
     TF_EXPECT_OK(optimizer->Optimize(nullptr, *item, output));
+
     item->graph.Swap(output);
     output->Clear();
     TF_EXPECT_OK(optimizer->Optimize(nullptr, *item, output));
+
+    if (const_folding) {
+      item->graph.Swap(output);
+      output->Clear();
+      TF_EXPECT_OK(ConstantFolding(/*cpu_device=*/nullptr)
+                       .Optimize(nullptr, *item, output));
+    }
+
     item->graph.Swap(output);
     output->Clear();
     TF_EXPECT_OK(ModelPruner().Optimize(nullptr, *item, output));
@@ -113,14 +139,25 @@ class ArithmeticOptimizerTest : public GrapplerTest {
   void DisableAllStages(ArithmeticOptimizer* optimizer) {
     ArithmeticOptimizer::ArithmeticOptimizerOptions options;
     options.dedup_computations = false;
-    options.enable_try_simplify_and_replace = false;
     options.combine_add_to_addn = false;
+    options.convert_sqrt_div_to_rsqrt_mul = false;
+    options.fold_conjugate_into_transpose = false;
+    options.fold_multiply_into_conv = false;
+    options.fold_transpose_into_matmul = false;
     options.hoist_common_factor_out_of_aggregation = false;
+    options.hoist_cwise_unary_chains = false;
     options.minimize_broadcasts = false;
     options.remove_identity_transpose = false;
+    options.remove_involution = false;
+    options.remove_idempotent = false;
     options.remove_redundant_bitcast = false;
     options.remove_redundant_cast = false;
+    options.remove_redundant_reshape = false;
     options.remove_negation = false;
+    options.remove_logical_not = false;
+    options.reorder_cast_and_transpose = false;
+    options.replace_mul_with_square = false;
+    options.simplify_aggregation = false;
     optimizer->options_ = options;
   }
 
@@ -133,6 +170,21 @@ class ArithmeticOptimizerTest : public GrapplerTest {
     optimizer->options_.combine_add_to_addn = true;
   }
 
+  void EnableOnlyFoldConjugateIntoTranspose(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.fold_conjugate_into_transpose = true;
+  }
+
+  void EnableOnlyFoldMultipleIntoConv(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.fold_multiply_into_conv = true;
+  }
+
+  void EnableOnlyFoldTransposeIntoMatMul(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.fold_transpose_into_matmul = true;
+  }
+
   void EnableOnlyHoistCommonFactor(ArithmeticOptimizer* optimizer) {
     DisableAllStages(optimizer);
     optimizer->options_.hoist_common_factor_out_of_aggregation = true;
@@ -148,6 +200,11 @@ class ArithmeticOptimizerTest : public GrapplerTest {
     optimizer->options_.remove_identity_transpose = true;
   }
 
+  void EnableOnlyRemoveInvolution(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.remove_involution = true;
+  }
+
   void EnableOnlyRemoveRedundantBitcast(ArithmeticOptimizer* optimizer) {
     DisableAllStages(optimizer);
     optimizer->options_.remove_redundant_bitcast = true;
@@ -158,11 +215,26 @@ class ArithmeticOptimizerTest : public GrapplerTest {
     optimizer->options_.remove_redundant_cast = true;
   }
 
+  void EnableOnlyRemoveRedundantReshape(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.remove_redundant_reshape = true;
+  }
+
   void EnableOnlyRemoveNegation(ArithmeticOptimizer* optimizer) {
     DisableAllStages(optimizer);
     optimizer->options_.remove_negation = true;
   }
 
+  void EnableOnlyReorderCastAndTranspose(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.reorder_cast_and_transpose = true;
+  }
+
+  void EnableOnlyReplaceMulWithSquare(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.replace_mul_with_square = true;
+  }
+
   void EnableOnlyHoistCWiseUnaryChains(ArithmeticOptimizer* optimizer) {
     DisableAllStages(optimizer);
     optimizer->options_.hoist_cwise_unary_chains = true;
@@ -173,6 +245,11 @@ class ArithmeticOptimizerTest : public GrapplerTest {
     optimizer->options_.convert_sqrt_div_to_rsqrt_mul = true;
   }
 
+  void EnableOnlyConvertPow(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.convert_pow = true;
+  }
+
   void EnableOnlyRemoveIdempotent(ArithmeticOptimizer* optimizer) {
     DisableAllStages(optimizer);
     optimizer->options_.remove_idempotent = true;
@@ -182,6 +259,11 @@ class ArithmeticOptimizerTest : public GrapplerTest {
     DisableAllStages(optimizer);
     optimizer->options_.remove_logical_not = true;
   }
+
+  void EnableOnlySimplifyAggregation(ArithmeticOptimizer* optimizer) {
+    DisableAllStages(optimizer);
+    optimizer->options_.simplify_aggregation = true;
+  }
 };
 
 TEST_F(ArithmeticOptimizerTest, NoOp) {
@@ -307,131 +389,142 @@ TEST_F(ArithmeticOptimizerTest, OpDedupCommutative) {
   test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
 }
 
-TEST_F(ArithmeticOptimizerTest, MulToSquare) {
+TEST_F(ArithmeticOptimizerTest, ReplaceMulWithSquare) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
   Output c = ops::Const(s.WithOpName("c"), {1.0f, 2.0f}, {1, 2});
   Output d = ops::Const(s.WithOpName("d"), {3.0f, 4.0f}, {1, 2});
   Output mul = ops::Mul(s.WithControlDependencies(d).WithOpName("mul"), c, c);
   Output id = ops::Identity(s.WithOpName("id"), mul);
+
   GrapplerItem item;
+  item.fetch = {"id"};
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
-  std::vector<string> fetch = {"id"};
-  auto tensors_expected = EvaluateNodes(item.graph, fetch);
+  auto tensors_expected = EvaluateNodes(item.graph, item.fetch);
   EXPECT_EQ(1, tensors_expected.size());
 
-  ArithmeticOptimizer optimizer;
   GraphDef output;
-  Status status = optimizer.Optimize(nullptr, item, &output);
-  TF_EXPECT_OK(status);
+  ArithmeticOptimizer optimizer;
+  EnableOnlyReplaceMulWithSquare(&optimizer);
+  OptimizeAndPrune(&optimizer, &item, &output);
 
-  EXPECT_EQ(5, output.node_size());
-  EXPECT_EQ("id", output.node(3).name());
-  EXPECT_EQ(OptimizedName("mul_square"), output.node(3).input(0));
-  EXPECT_EQ("Square", output.node(4).op());
-  EXPECT_EQ(OptimizedName("mul_square"), output.node(4).name());
-  EXPECT_EQ(2, output.node(4).input_size());
-  EXPECT_EQ("c", output.node(4).input(0));
-  EXPECT_EQ("^d", output.node(4).input(1));
+  EXPECT_EQ(4, output.node_size());
 
-  auto tensors = EvaluateNodes(output, fetch);
+  NodeMap node_map(&output);
+  const string p = "ArithmeticOptimizer/ReplaceMulWithSquare";
+  const NodeDef* square_node = node_map.GetNode(strings::StrCat(p, "_", "mul"));
+
+  ASSERT_NE(square_node, nullptr);
+  EXPECT_EQ("Square", square_node->op());
+  EXPECT_EQ("c", square_node->input(0));
+  EXPECT_EQ("^d", square_node->input(1));
+
+  auto tensors = EvaluateNodes(output, item.fetch);
   EXPECT_EQ(1, tensors.size());
   test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
 }
 
-TEST_F(ArithmeticOptimizerTest, SimplifyInvolutionsReal) {
+TEST_F(ArithmeticOptimizerTest, RemoveInvolution_AdjacentNodes) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
-  Output c = ops::Const(s.WithOpName("c"), {1.0f, 2.0f}, {1, 2});
-  Output neg1 = ops::Neg(s.WithOpName("neg1"), c);
-  Output neg2 = ops::Neg(s.WithOpName("neg2"), neg1);
-  Output recip1 = ops::Reciprocal(s.WithOpName("recip1"), neg2);
-  Output recip2 = ops::Reciprocal(s.WithOpName("recip2"), recip1);
-  Output id = ops::Identity(s.WithOpName("id"), recip2);
+
+  auto c = ops::Const(s.WithOpName("c"), {1.0f, 2.0f}, {1, 2});
+  auto neg1 = ops::Neg(s.WithOpName("neg1"), c);
+  auto neg2 = ops::Neg(s.WithOpName("neg2"), neg1);
+  auto recip1 = ops::Reciprocal(s.WithOpName("recip1"), neg2);
+  auto recip2 = ops::Reciprocal(s.WithOpName("recip2"), recip1);
+  auto id = ops::Identity(s.WithOpName("id"), recip2);
+
   GrapplerItem item;
+  item.fetch = {"id"};
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
-  std::vector<string> fetch = {"id"};
-  auto tensors_expected = EvaluateNodes(item.graph, fetch);
+  auto tensors_expected = EvaluateNodes(item.graph, item.fetch);
   EXPECT_EQ(1, tensors_expected.size());
 
-  ArithmeticOptimizer optimizer;
   GraphDef output;
-  Status status = optimizer.Optimize(nullptr, item, &output);
-  TF_EXPECT_OK(status);
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveInvolution(&optimizer);
+  OptimizeAndPrune(&optimizer, &item, &output);
 
-  EXPECT_EQ(6, output.node_size());
+  // Negation and Reciprocal nodes cancelled each other.
+  EXPECT_EQ(2, output.node_size());
+  EXPECT_EQ("id", output.node(1).name());
   EXPECT_EQ("c", output.node(1).input(0));
-  EXPECT_EQ("c", output.node(3).input(0));
-  EXPECT_EQ("c", output.node(5).input(0));
 
-  auto tensors = EvaluateNodes(output, fetch);
+  auto tensors = EvaluateNodes(output, item.fetch);
   EXPECT_EQ(1, tensors.size());
   test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
 }
 
-TEST_F(ArithmeticOptimizerTest, SimplifyInvolutionsWithChain) {
+TEST_F(ArithmeticOptimizerTest, RemoveInvolution_AroundValuePreservingChain) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
-  Output c = ops::Const(s.WithOpName("c"), {1.0f, 2.0f}, {1, 2});
-  Output recip1 = ops::Reciprocal(s.WithOpName("recip1"), c);
-  Output id1 = ops::Identity(s.WithOpName("id1"), recip1);
-  Output squeeze = ops::Squeeze(s.WithOpName("squeeze"), id1);
-  Output recip2 = ops::Reciprocal(s.WithOpName("recip2"), squeeze);
-  Output id2 = ops::Identity(s.WithOpName("id2"), recip2);
+
+  auto c = ops::Const(s.WithOpName("c"), {1.0f, 2.0f}, {1, 2});
+  auto recip1 = ops::Reciprocal(s.WithOpName("recip1"), c);
+  auto id1 = ops::Identity(s.WithOpName("id1"), recip1);
+  auto squeeze = ops::Squeeze(s.WithOpName("squeeze"), id1);
+  auto recip2 = ops::Reciprocal(s.WithOpName("recip2"), squeeze);
+  auto id2 = ops::Identity(s.WithOpName("id2"), recip2);
+
+  std::vector<string> fetch = {"id2"};
+
   GrapplerItem item;
+  item.fetch = fetch;
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
-  std::vector<string> fetch = {"id2"};
   auto tensors_expected = EvaluateNodes(item.graph, fetch);
   EXPECT_EQ(1, tensors_expected.size());
 
-  ArithmeticOptimizer optimizer;
   GraphDef output;
-  Status status = optimizer.Optimize(nullptr, item, &output);
-  TF_EXPECT_OK(status);
-  // Run the optimizer twice to make sure the rewrite is idempotent.
-  item.graph.Swap(&output);
-  status = optimizer.Optimize(nullptr, item, &output);
-  TF_EXPECT_OK(status);
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveInvolution(&optimizer);
+  OptimizeTwiceAndPrune(&optimizer, &item, &output);
 
-  EXPECT_EQ(6, output.node_size());
-  EXPECT_EQ("squeeze", output.node(5).input(0));
-  EXPECT_EQ("c", output.node(2).input(0));
+  // Check that Reciprocal nodes were removed from the graph.
+  EXPECT_EQ(3, output.node_size());
+
+  // And const directly flows into squeeze.
+  int found = 0;
+  for (const NodeDef& node : output.node()) {
+    if (node.name() == "squeeze") {
+      EXPECT_EQ("c", node.input(0));
+      found++;
+    } else if (node.name() == "id2") {
+      EXPECT_EQ("squeeze", node.input(0));
+      found++;
+    }
+  }
+  EXPECT_EQ(2, found);
 
   auto tensors = EvaluateNodes(output, fetch);
   EXPECT_EQ(1, tensors.size());
   test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
 }
 
-TEST_F(ArithmeticOptimizerTest, SimplifyInvolutionsWithControlChain) {
+TEST_F(ArithmeticOptimizerTest, RemoveInvolution_SkipControlDependencies) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
-  Output c = ops::Const(s.WithOpName("c"), {1.0f, 2.0f}, {1, 2});
-  Output recip1 = ops::Reciprocal(s.WithOpName("recip1"), c);
-  Output id1 = ops::Identity(s.WithOpName("id1"), recip1);
-  Output squeeze = ops::Squeeze(s.WithOpName("squeeze"), id1);
-  Output recip2 = ops::Reciprocal(
+
+  auto c = ops::Const(s.WithOpName("c"), {1.0f, 2.0f}, {1, 2});
+  auto recip1 = ops::Reciprocal(s.WithOpName("recip1"), c);
+  auto id1 = ops::Identity(s.WithOpName("id1"), recip1);
+  auto squeeze = ops::Squeeze(s.WithOpName("squeeze"), id1);
+  auto recip2 = ops::Reciprocal(
       s.WithOpName("recip2").WithControlDependencies(squeeze), c);
-  Output id2 = ops::Identity(s.WithOpName("id2"), recip2);
+  auto id2 = ops::Identity(s.WithOpName("id2"), recip2);
+
+  std::vector<string> fetch = {"id2"};
+
   GrapplerItem item;
+  item.fetch = fetch;
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
-  std::vector<string> fetch = {"id2"};
   auto tensors_expected = EvaluateNodes(item.graph, fetch);
   EXPECT_EQ(1, tensors_expected.size());
 
-  ArithmeticOptimizer optimizer;
   GraphDef output;
-  Status status = optimizer.Optimize(nullptr, item, &output);
-  TF_EXPECT_OK(status);
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveInvolution(&optimizer);
+  OptimizeTwice(&optimizer, &item, &output);  // do not prune in this test
 
   // The optimizer should be a noop.
-  EXPECT_EQ(item.graph.node_size(), output.node_size());
-  for (int i = 0; i < item.graph.node_size(); ++i) {
-    const NodeDef& original = item.graph.node(i);
-    const NodeDef& optimized = output.node(i);
-    EXPECT_EQ(original.name(), optimized.name());
-    EXPECT_EQ(original.op(), optimized.op());
-    EXPECT_EQ(original.input_size(), optimized.input_size());
-    for (int j = 0; j < original.input_size(); ++j) {
-      EXPECT_EQ(original.input(j), optimized.input(j));
-    }
-  }
+  VerifyGraphsMatch(item.graph, output, __LINE__);
 
   auto tensors = EvaluateNodes(output, fetch);
   EXPECT_EQ(1, tensors.size());
@@ -445,10 +538,10 @@ TEST_F(ArithmeticOptimizerTest, TrivialSumsSimple) {
   Output id = ops::Identity(s.WithOpName("id"), add);
 
   GrapplerItem item;
+  item.fetch = {"id"};
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
-  std::vector<string> fetch = {"id"};
-  auto tensors_expected = EvaluateNodes(item.graph, fetch);
+  auto tensors_expected = EvaluateNodes(item.graph, item.fetch);
   EXPECT_EQ(1, tensors_expected.size());
 
   ArithmeticOptimizer optimizer;
@@ -458,22 +551,25 @@ TEST_F(ArithmeticOptimizerTest, TrivialSumsSimple) {
 
   EXPECT_EQ(5, output.node_size());
 
-  const NodeDef* new_const = node_map.GetNode(OptimizedName("add_const"));
+  const string optimized_const_name = AggregationConstName("add");
+  const string optimized_mul_name = AggregationMulName("add");
+
+  const NodeDef* new_const = node_map.GetNode(optimized_const_name);
   ASSERT_NE(new_const, nullptr);
   EXPECT_EQ("^x", new_const->input(0));
   EXPECT_EQ(std::string("\0\0\0@", 4),
             new_const->attr().at("value").tensor().tensor_content());
 
-  const NodeDef* new_mul = node_map.GetNode(OptimizedName("add_mul"));
+  const NodeDef* new_mul = node_map.GetNode(optimized_mul_name);
   ASSERT_NE(new_mul, nullptr);
-  EXPECT_EQ(OptimizedName("add_const"), new_mul->input(0));
+  EXPECT_EQ(optimized_const_name, new_mul->input(0));
   EXPECT_EQ("x", new_mul->input(1));
 
   const NodeDef* new_id = node_map.GetNode("id");
   ASSERT_NE(new_id, nullptr);
-  EXPECT_EQ(OptimizedName("add_mul"), new_id->input(0));
+  EXPECT_EQ(optimized_mul_name, new_id->input(0));
 
-  auto tensors = EvaluateNodes(output, fetch);
+  auto tensors = EvaluateNodes(output, item.fetch);
   EXPECT_EQ(1, tensors.size());
   test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
 }
@@ -499,21 +595,24 @@ TEST_F(ArithmeticOptimizerTest, TrivialSumsSimpleWithControlDep) {
 
   EXPECT_EQ(6, output.node_size());
 
-  const NodeDef* new_const = node_map.GetNode(OptimizedName("add_const"));
+  const string optimized_const_name = AggregationConstName("add");
+  const string optimized_mul_name = AggregationMulName("add");
+
+  const NodeDef* new_const = node_map.GetNode(optimized_const_name);
   ASSERT_NE(new_const, nullptr);
   EXPECT_EQ("^x", new_const->input(0));
   EXPECT_EQ(std::string("\0\0\0@", 4),
             new_const->attr().at("value").tensor().tensor_content());
 
-  const NodeDef* new_mul = node_map.GetNode(OptimizedName("add_mul"));
+  const NodeDef* new_mul = node_map.GetNode(optimized_mul_name);
   ASSERT_NE(new_mul, nullptr);
-  EXPECT_EQ(OptimizedName("add_const"), new_mul->input(0));
+  EXPECT_EQ(optimized_const_name, new_mul->input(0));
   EXPECT_EQ("x", new_mul->input(1));
   EXPECT_EQ("^y", new_mul->input(2));
 
   const NodeDef* new_id = node_map.GetNode("id");
   ASSERT_NE(new_id, nullptr);
-  EXPECT_EQ(OptimizedName("add_mul"), new_id->input(0));
+  EXPECT_EQ(optimized_mul_name, new_id->input(0));
 
   auto tensors = EvaluateNodes(output, fetch);
   EXPECT_EQ(1, tensors.size());
@@ -578,24 +677,24 @@ TEST_F(ArithmeticOptimizerTest, TrivialSumsRepeatedAdd) {
   ASSERT_NE(add_4_node, nullptr);
   EXPECT_EQ("Add", add_4_node->op());
   EXPECT_EQ(2, add_4_node->input_size());
-  EXPECT_EQ(OptimizedName("Add_const"), add_4_node->input(0));
-  EXPECT_EQ(OptimizedName("Add_1_const"), add_4_node->input(1));
+  EXPECT_EQ(AggregationConstName("Add"), add_4_node->input(0));
+  EXPECT_EQ(AggregationConstName("Add_1"), add_4_node->input(1));
 
   const NodeDef* add_5_node = node_map.GetNode(HoistAddName("Add_5"));
   ASSERT_NE(add_5_node, nullptr);
   EXPECT_EQ("Add", add_5_node->op());
   EXPECT_EQ(2, add_5_node->input_size());
-  EXPECT_EQ(OptimizedName("Add_const"), add_5_node->input(0));
-  EXPECT_EQ(OptimizedName("Add_1_const"), add_5_node->input(1));
+  EXPECT_EQ(AggregationConstName("Add"), add_5_node->input(0));
+  EXPECT_EQ(AggregationConstName("Add_1"), add_5_node->input(1));
 
-  const NodeDef* add_const_node = node_map.GetNode(OptimizedName("Add_const"));
+  const NodeDef* add_const_node = node_map.GetNode(AggregationConstName("Add"));
   ASSERT_NE(add_const_node, nullptr);
   EXPECT_EQ("Const", add_const_node->op());
   EXPECT_EQ(1, add_const_node->input_size());
   EXPECT_EQ("^Placeholder", add_const_node->input(0));
 
   const NodeDef* add_1_const_node =
-      node_map.GetNode(OptimizedName("Add_1_const"));
+      node_map.GetNode(AggregationConstName("Add_1"));
   ASSERT_NE(add_1_const_node, nullptr);
   EXPECT_EQ("Const", add_1_const_node->op());
   EXPECT_EQ(1, add_1_const_node->input_size());
@@ -762,11 +861,14 @@ TEST_F(ArithmeticOptimizerTest, FuseConjAndTranspose) {
   Output perm = ops::Const(s.WithOpName("perm"), {1, 0}, {2});
   Output conj = ops::Conj(s.WithOpName("conj"), z);
   Output transp = ops::Transpose(s.WithOpName("trans"), conj, perm);
+
   GrapplerItem item;
+  item.fetch = {"trans"};
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
-  std::vector<string> fetch = {"trans"};
-  auto tensors_expected = EvaluateNodes(item.graph, fetch);
+
+  auto tensors_expected = EvaluateNodes(item.graph, item.fetch);
   EXPECT_EQ(1, tensors_expected.size());
+
   ArithmeticOptimizer optimizer;
   GraphDef output;
   OptimizeTwice(&optimizer, &item, &output);
@@ -774,20 +876,23 @@ TEST_F(ArithmeticOptimizerTest, FuseConjAndTranspose) {
 
   EXPECT_EQ(7, output.node_size());
 
-  const NodeDef* trans_fused_node =
-      node_map.GetNode(OptimizedName("trans_fused"));
+  const string p = "ArithmeticOptimizer/FoldConjugateIntoTranspose";
+  const string optimized_name = strings::StrCat(p, "_", "trans");
+
+  const NodeDef* trans_fused_node = node_map.GetNode(optimized_name);
   ASSERT_NE(trans_fused_node, nullptr);
   EXPECT_EQ("ConjugateTranspose", trans_fused_node->op());
   EXPECT_EQ("z", trans_fused_node->input(0));
   EXPECT_EQ("perm", trans_fused_node->input(1));
 
-  auto tensors = EvaluateNodes(output, fetch);
+  auto tensors = EvaluateNodes(output, item.fetch);
   EXPECT_EQ(1, tensors.size());
   test::ExpectTensorEqual<complex64>(tensors_expected[0], tensors[0]);
 }
 
 TEST_F(ArithmeticOptimizerTest, FuseConjAndConjugateTranspose) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+
   Output re = ops::Const(s.WithOpName("re"), {1.0f, 2.0f, 3.0f, 4.0f}, {2, 2});
   Output im = ops::Const(s.WithOpName("im"), {5.0f, 6.0f, 7.0f, 8.0f}, {2, 2});
   Output z = ops::Complex(s.WithOpName("z"), re, im);
@@ -795,10 +900,12 @@ TEST_F(ArithmeticOptimizerTest, FuseConjAndConjugateTranspose) {
   Output conj = ops::Conj(s.WithOpName("conj"), z);
   Output transp =
       ops::ConjugateTranspose(s.WithOpName("conjugate_trans"), conj, perm);
+
   GrapplerItem item;
+  item.fetch = {"conjugate_trans"};
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
-  std::vector<string> fetch = {"conjugate_trans"};
-  auto tensors_expected = EvaluateNodes(item.graph, fetch);
+
+  auto tensors_expected = EvaluateNodes(item.graph, item.fetch);
   EXPECT_EQ(1, tensors_expected.size());
 
   ArithmeticOptimizer optimizer;
@@ -808,12 +915,16 @@ TEST_F(ArithmeticOptimizerTest, FuseConjAndConjugateTranspose) {
 
   EXPECT_EQ(7, output.node_size());
 
-  const NodeDef* conjugate_trans_fused_node =
-      node_map.GetNode(OptimizedName("conjugate_trans_fused"));
+  const string p = "ArithmeticOptimizer/FoldConjugateIntoTranspose";
+  const string optimized_name = strings::StrCat(p, "_", "conjugate_trans");
+
+  const NodeDef* conjugate_trans_fused_node = node_map.GetNode(optimized_name);
+  ASSERT_NE(conjugate_trans_fused_node, nullptr);
   EXPECT_EQ("Transpose", conjugate_trans_fused_node->op());
   EXPECT_EQ("z", conjugate_trans_fused_node->input(0));
   EXPECT_EQ("perm", conjugate_trans_fused_node->input(1));
-  auto tensors = EvaluateNodes(output, fetch);
+
+  auto tensors = EvaluateNodes(output, item.fetch);
   EXPECT_EQ(1, tensors.size());
   test::ExpectTensorEqual<complex64>(tensors_expected[0], tensors[0]);
 }
@@ -826,10 +937,12 @@ TEST_F(ArithmeticOptimizerTest, FuseTransposeAndConj) {
   Output perm = ops::Const(s.WithOpName("perm"), {1, 0}, {2});
   Output trans = ops::Transpose(s.WithOpName("trans"), z, perm);
   Output conj = ops::Conj(s.WithOpName("conj"), trans);
+
   GrapplerItem item;
+  item.fetch = {"conj"};
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
-  std::vector<string> fetch = {"conj"};
-  auto tensors_expected = EvaluateNodes(item.graph, fetch);
+
+  auto tensors_expected = EvaluateNodes(item.graph, item.fetch);
   EXPECT_EQ(1, tensors_expected.size());
 
   ArithmeticOptimizer optimizer;
@@ -839,12 +952,16 @@ TEST_F(ArithmeticOptimizerTest, FuseTransposeAndConj) {
 
   EXPECT_EQ(7, output.node_size());
 
-  const NodeDef* conj_fused_node =
-      node_map.GetNode(OptimizedName("conj_fused"));
+  const string p = "ArithmeticOptimizer/FoldConjugateIntoTranspose";
+  const string optimized_name = strings::StrCat(p, "_", "conj");
+
+  const NodeDef* conj_fused_node = node_map.GetNode(optimized_name);
+  ASSERT_NE(conj_fused_node, nullptr);
   EXPECT_EQ("ConjugateTranspose", conj_fused_node->op());
   EXPECT_EQ("z", conj_fused_node->input(0));
   EXPECT_EQ("perm", conj_fused_node->input(1));
-  auto tensors = EvaluateNodes(output, fetch);
+
+  auto tensors = EvaluateNodes(output, item.fetch);
   EXPECT_EQ(1, tensors.size());
   test::ExpectTensorEqual<complex64>(tensors_expected[0], tensors[0]);
 }
@@ -852,38 +969,45 @@ TEST_F(ArithmeticOptimizerTest, FuseTransposeAndConj) {
 TEST_F(ArithmeticOptimizerTest, FoldTransposeIntoMatMul) {
   for (const string matmul_type : {"MatMul", "SparseMatMul", "BatchMatMul"}) {
     tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+
     Output a = ops::Const(s.WithOpName("a"), {1.0f, 2.0f, 3.0f, 4.0f}, {2, 2});
     Output b = ops::Const(s.WithOpName("b"), {5.0f, 6.0f, 7.0f, 8.0f}, {2, 2});
     Output perm = ops::Const(s.WithOpName("perm"), {1, 0}, {2});
     Output trans_a = ops::Transpose(s.WithOpName("trans_a"), a, perm);
     Output trans_b = ops::Transpose(s.WithOpName("trans_b"), b, perm);
+
+    auto matmul_op = s.WithOpName("matmul");
     if (matmul_type == "MatMul") {
-      Output matmul = ops::MatMul(s.WithOpName("matmul"), trans_a, trans_b);
+      Output matmul = ops::MatMul(matmul_op, trans_a, trans_b);
     } else if (matmul_type == "SparseMatMul") {
-      Output matmul =
-          ops::SparseMatMul(s.WithOpName("matmul"), trans_a, trans_b);
+      Output matmul = ops::SparseMatMul(matmul_op, trans_a, trans_b);
     } else if (matmul_type == "BatchMatMul") {
-      Output matmul =
-          ops::BatchMatMul(s.WithOpName("matmul"), trans_a, trans_b);
+      Output matmul = ops::BatchMatMul(matmul_op, trans_a, trans_b);
     }
+
     GrapplerItem item;
+    item.fetch = {"matmul"};
     TF_CHECK_OK(s.ToGraphDef(&item.graph));
-    std::vector<string> fetch = {"matmul"};
-    auto tensors_expected = EvaluateNodes(item.graph, fetch);
+
+    auto tensors_expected = EvaluateNodes(item.graph, item.fetch);
     EXPECT_EQ(1, tensors_expected.size());
 
     ArithmeticOptimizer optimizer;
+    EnableOnlyFoldTransposeIntoMatMul(&optimizer);
     GraphDef output;
     OptimizeTwice(&optimizer, &item, &output);
     NodeMap node_map(&output);
 
     EXPECT_EQ(7, output.node_size());
 
-    const NodeDef* matmul_fused_node =
-        node_map.GetNode(OptimizedName("matmul_fused"));
+    const string p = "ArithmeticOptimizer/FoldTransposeIntoMatMul";
+    const string optimized_name = strings::StrCat(p, "_", "matmul");
+
+    const NodeDef* matmul_fused_node = node_map.GetNode(optimized_name);
     ASSERT_NE(matmul_fused_node, nullptr);
     EXPECT_EQ("a", matmul_fused_node->input(0));
     EXPECT_EQ("b", matmul_fused_node->input(1));
+
     if (matmul_type == "BatchMatMul") {
       EXPECT_TRUE(matmul_fused_node->attr().at("adj_x").b());
       EXPECT_TRUE(matmul_fused_node->attr().at("adj_y").b());
@@ -891,7 +1015,8 @@ TEST_F(ArithmeticOptimizerTest, FoldTransposeIntoMatMul) {
       EXPECT_TRUE(matmul_fused_node->attr().at("transpose_a").b());
       EXPECT_TRUE(matmul_fused_node->attr().at("transpose_b").b());
     }
-    auto tensors = EvaluateNodes(output, fetch);
+
+    auto tensors = EvaluateNodes(output, item.fetch);
     EXPECT_EQ(1, tensors.size());
     test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
   }
@@ -899,6 +1024,7 @@ TEST_F(ArithmeticOptimizerTest, FoldTransposeIntoMatMul) {
 
 TEST_F(ArithmeticOptimizerTest, FoldConjugateTransposeIntoBatchMatMul) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+
   Output re_a =
       ops::Const(s.WithOpName("re_a"), {1.0f, 2.0f, 3.0f, 4.0f}, {2, 2});
   Output im_a =
@@ -913,29 +1039,37 @@ TEST_F(ArithmeticOptimizerTest, FoldConjugateTransposeIntoBatchMatMul) {
   Output trans_a = ops::ConjugateTranspose(s.WithOpName("trans_a"), a, perm);
   Output trans_b = ops::ConjugateTranspose(s.WithOpName("trans_b"), b, perm);
   Output matmul = ops::BatchMatMul(s.WithOpName("matmul"), trans_a, trans_b);
+
   GrapplerItem item;
+  item.fetch = {"matmul"};
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
-  std::vector<string> fetch = {"matmul"};
-  auto tensors_expected = EvaluateNodes(item.graph, fetch);
+
+  auto tensors_expected = EvaluateNodes(item.graph, item.fetch);
   EXPECT_EQ(1, tensors_expected.size());
 
   ArithmeticOptimizer optimizer;
   GraphDef output;
-  Status status = optimizer.Optimize(nullptr, item, &output);
-  TF_EXPECT_OK(status);
+  OptimizeTwice(&optimizer, &item, &output);
 
-  EXPECT_EQ(11, output.node_size());
-  EXPECT_EQ(OptimizedName("matmul_fused"), output.node(10).name());
-  EXPECT_EQ("a", output.node(10).input(0));
-  EXPECT_EQ("b", output.node(10).input(1));
-  EXPECT_TRUE(output.node(10).attr().at("adj_x").b());
-  EXPECT_TRUE(output.node(10).attr().at("adj_y").b());
-  auto tensors = EvaluateNodes(output, fetch);
+  NodeMap node_map(&output);
+  ASSERT_EQ(11, output.node_size());
+
+  const string p = "ArithmeticOptimizer/FoldTransposeIntoMatMul";
+  const string optimized_name = strings::StrCat(p, "_", "matmul");
+
+  const NodeDef* optimized_matmul = node_map.GetNode(optimized_name);
+  ASSERT_NE(optimized_matmul, nullptr);
+  EXPECT_EQ("a", optimized_matmul->input(0));
+  EXPECT_EQ("b", optimized_matmul->input(1));
+  EXPECT_TRUE(optimized_matmul->attr().at("adj_x").b());
+  EXPECT_TRUE(optimized_matmul->attr().at("adj_y").b());
+
+  auto tensors = EvaluateNodes(output, item.fetch);
   EXPECT_EQ(1, tensors.size());
   test::ExpectTensorNear<complex64>(tensors_expected[0], tensors[0], 1e-6);
 }
 
-TEST_F(ArithmeticOptimizerTest, IdentityReshape) {
+TEST_F(ArithmeticOptimizerTest, RemoveRedundantReshape_IdentityReshape) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
   Output inputs =
       ops::Placeholder(s, DT_FLOAT, ops::Placeholder::Shape({-1, 3, 28, 28}));
@@ -957,11 +1091,11 @@ TEST_F(ArithmeticOptimizerTest, IdentityReshape) {
   auto tensors_expected =
       EvaluateNodes(item.graph, item.fetch, {{"Placeholder", x_t}});
   EXPECT_EQ(1, tensors_expected.size());
-  GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
 
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  GraphDef output;
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveRedundantReshape(&optimizer);
+  OptimizeTwiceAndPrune(&optimizer, &item, &output);
 
   EXPECT_EQ(0, CountOpNodes(output, "Reshape"));
   auto tensors = EvaluateNodes(output, item.fetch, {{"Placeholder", x_t}});
@@ -969,7 +1103,49 @@ TEST_F(ArithmeticOptimizerTest, IdentityReshape) {
   test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
 }
 
-TEST_F(ArithmeticOptimizerTest, NotAssumeValidFeeds) {
+TEST_F(ArithmeticOptimizerTest,
+       RemoveRedundantReshape_IdentityReshapeBetweenSymbolicShapes) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+  Output inputs =
+      ops::Placeholder(s, DT_FLOAT, ops::Placeholder::Shape({-1, 3, -1, -1}));
+  Output inputs_shape = ops::Shape(s, inputs);
+  // The target shape of the reshape is the concatenation of `batch_size`, 3,
+  // `height, and `width`.
+  Output batch_size = ops::Slice(s, inputs_shape, ops::Const(s, {0}, {1}),
+                                 ops::Const(s, {1}, {1}));
+  Output height = ops::Slice(s, inputs_shape, ops::Const(s, {2}, {1}),
+                             ops::Const(s, {1}, {1}));
+  Output width = ops::Slice(s, inputs_shape, ops::Const(s, {3}, {1}),
+                            ops::Const(s, {1}, {1}));
+  Output target_shape =
+      ops::Concat(s.WithOpName("target_shape"),
+                  {batch_size, ops::Const(s, {3}, {1}), height, width},
+                  ops::Const(s, {0}, {}));
+  Output reshape = ops::Reshape(s, inputs, target_shape);
+  Output outputs = ops::Identity(s.WithOpName("outputs"), reshape);
+
+  auto x_t = GenerateRandomTensor<DT_FLOAT>(TensorShape({3, 3, 28, 28}));
+  GrapplerItem item;
+  item.fetch = {"outputs"};
+  item.feed = {{"Placeholder", x_t}};
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+
+  auto tensors_expected = EvaluateNodes(item.graph, item.fetch, item.feed);
+  EXPECT_EQ(1, tensors_expected.size());
+
+  GraphDef output;
+  // Assume valid feed shape in aggressive mode.
+  ArithmeticOptimizer optimizer(RewriterConfig::AGGRESSIVE);
+  EnableOnlyRemoveRedundantReshape(&optimizer);
+  OptimizeTwiceAndPrune(&optimizer, &item, &output);
+
+  EXPECT_EQ(0, CountOpNodes(output, "Reshape"));
+  auto tensors = EvaluateNodes(output, item.fetch, item.feed);
+  EXPECT_EQ(1, tensors.size());
+  test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
+}
+
+TEST_F(ArithmeticOptimizerTest, RemoveRedundantReshape_NotAssumeValidFeeds) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
   Output inputs =
       ops::Placeholder(s, DT_FLOAT, ops::Placeholder::Shape({4, 3, 28, 28}));
@@ -987,10 +1163,9 @@ TEST_F(ArithmeticOptimizerTest, NotAssumeValidFeeds) {
   EXPECT_EQ(1, tensors_expected.size());
 
   GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
-
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveRedundantReshape(&optimizer);
+  OptimizeTwiceAndPrune(&optimizer, &item, &output);
 
   // The reshape is preserved because the shape of the placeholder can be
   // different from the shape of the actual feed.
@@ -1001,7 +1176,8 @@ TEST_F(ArithmeticOptimizerTest, NotAssumeValidFeeds) {
   test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
 }
 
-TEST_F(ArithmeticOptimizerTest, AssumeValidFeedsInAggressiveMode) {
+TEST_F(ArithmeticOptimizerTest,
+       RemoveRedundantReshape_AssumeValidFeedsInAggressiveMode) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
   Output inputs =
       ops::Placeholder(s, DT_FLOAT, ops::Placeholder::Shape({4, 3, 28, 28}));
@@ -1017,12 +1193,11 @@ TEST_F(ArithmeticOptimizerTest, AssumeValidFeedsInAggressiveMode) {
 
   auto tensors_expected = EvaluateNodes(item.graph, item.fetch, item.feed);
   EXPECT_EQ(1, tensors_expected.size());
-  GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer(RewriterConfig::AGGRESSIVE)
-                   .Optimize(nullptr, item, &output));
 
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  GraphDef output;
+  ArithmeticOptimizer optimizer(RewriterConfig::AGGRESSIVE);
+  EnableOnlyRemoveRedundantReshape(&optimizer);
+  OptimizeTwiceAndPrune(&optimizer, &item, &output);
 
   EXPECT_EQ(0, CountOpNodes(output, "Reshape"));
   auto tensors = EvaluateNodes(output, item.fetch, item.feed);
@@ -1030,7 +1205,7 @@ TEST_F(ArithmeticOptimizerTest, AssumeValidFeedsInAggressiveMode) {
   test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
 }
 
-TEST_F(ArithmeticOptimizerTest, NotIdentityReshape) {
+TEST_F(ArithmeticOptimizerTest, RemoveRedundantReshape_NotIdentityReshape) {
   // Reshape from [-1,3,28,28] to [8,-1,28,28] is not identity, because it can
   // be from [4,3,28,28] to [8,6,28,28].
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
@@ -1046,11 +1221,11 @@ TEST_F(ArithmeticOptimizerTest, NotIdentityReshape) {
   item.feed = {{"Placeholder", x_t}};
   auto tensors_expected = EvaluateNodes(item.graph, item.fetch, item.feed);
   EXPECT_EQ(1, tensors_expected.size());
-  GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
 
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  GraphDef output;
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveRedundantReshape(&optimizer);
+  OptimizeTwiceAndPrune(&optimizer, &item, &output);
 
   EXPECT_EQ(1, CountOpNodes(output, "Reshape"));
   auto tensors = EvaluateNodes(output, item.fetch, item.feed);
@@ -1058,7 +1233,8 @@ TEST_F(ArithmeticOptimizerTest, NotIdentityReshape) {
   test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
 }
 
-TEST_F(ArithmeticOptimizerTest, NotIdentityReshapeTooManyUnknownDimSizes) {
+TEST_F(ArithmeticOptimizerTest,
+       RemoveRedundantReshape_NotIdentityReshapeTooManyUnknownDimSizes) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
   Output inputs =
       ops::Placeholder(s, DT_FLOAT, ops::Placeholder::Shape({4, 3}));
@@ -1068,16 +1244,16 @@ TEST_F(ArithmeticOptimizerTest, NotIdentityReshapeTooManyUnknownDimSizes) {
   GrapplerItem item;
   item.fetch = {"outputs"};
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
-  GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
 
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  GraphDef output;
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveRedundantReshape(&optimizer);
+  OptimizeTwiceAndPrune(&optimizer, &item, &output);
 
   EXPECT_EQ(1, CountOpNodes(output, "Reshape"));
 }
 
-TEST_F(ArithmeticOptimizerTest, CombineReshapes) {
+TEST_F(ArithmeticOptimizerTest, RemoveRedundantReshape_CombineReshapes) {
   // Converts an NCHW_VECT_C tensor to NHWC and then flattens it to 2D. The two
   // reshapes should be combined.
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
@@ -1102,11 +1278,11 @@ TEST_F(ArithmeticOptimizerTest, CombineReshapes) {
   item.feed = {{"nchw_vect_c", x_t}};
   auto tensors_expected = EvaluateNodes(item.graph, item.fetch, item.feed);
   EXPECT_EQ(1, tensors_expected.size());
-  GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
 
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  GraphDef output;
+  ArithmeticOptimizer optimizer;
+  EnableOnlyRemoveRedundantReshape(&optimizer);
+  OptimizeTwiceAndPrune(&optimizer, &item, &output);
 
   EXPECT_EQ(1, CountOpNodes(output, "Reshape"));
   auto tensors = EvaluateNodes(output, item.fetch, item.feed);
@@ -1378,18 +1554,24 @@ TEST_F(ArithmeticOptimizerTest, FoldMulToTransposeConv) {
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
   GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
-
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  ArithmeticOptimizer optimizer;
+  EnableOnlyFoldMultipleIntoConv(&optimizer);
+  OptimizeTwiceAndPrune(&optimizer, &item, &output);
 
   NodeMap node_map(&output);
+
   // `conv` is now a folded convolution with scaled weights.
   const NodeDef* folded_conv = node_map.GetNode(conv.node()->name());
-  CHECK_EQ(node_map.GetNode(NodeName(folded_conv->input(1)))->op(), "Mul");
+  ASSERT_NE(folded_conv, nullptr);
+
+  const NodeDef* folded_conv_weights = node_map.GetNode(folded_conv->input(1));
+  ASSERT_NE(folded_conv_weights, nullptr);
+  EXPECT_EQ("Mul", folded_conv_weights->op());
+
   // Its input should be a transpose of `inputs`.
   const NodeDef* transpose = node_map.GetNode(NodeName(folded_conv->input(0)));
-  CHECK_EQ(NodeName(transpose->input(0)), inputs.node()->name());
+  ASSERT_NE(transpose, nullptr);
+  EXPECT_EQ("inputs", transpose->input(0));
 }
 
 TEST_F(ArithmeticOptimizerTest, NotFoldMulAcrossPreservedTranspose) {
@@ -1472,6 +1654,7 @@ TEST_F(ArithmeticOptimizerTest, OptimizeCastMulTransposeConv) {
   //     =>
   //   Conv2D(Cast(Transpose(I)), W*S)
   tensorflow::Scope s = tensorflow::Scope::NewRootScope().WithDevice("/gpu:0");
+
   Output inputs =
       ops::Placeholder(s, DT_UINT8, ops::Placeholder::Shape({8, 28, 28, 3}));
   Output cast = ops::Cast(s, inputs, DT_FLOAT);
@@ -1489,28 +1672,32 @@ TEST_F(ArithmeticOptimizerTest, OptimizeCastMulTransposeConv) {
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
   GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
+  ArithmeticOptimizer optimizer;  // all optimization stages are on
+  OptimizeTwiceAndPrune(&optimizer, &item, &output, /*const_folding=*/true);
 
-  // Run the optimizer twice to make sure the rewrite is idempotent.
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
+  NodeMap node_map(&output);
 
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(
-      ConstantFolding(/*cpu_device=*/nullptr).Optimize(nullptr, item, &output));
+  // Expected names for reordered cast and transpose.
+  const string p = "ArithmeticOptimizer/ReorderCastAndTranspose_";
+  const string optimized_cast_name = strings::StrCat(p, "float_Cast");
+  const string optimized_transpose_name = strings::StrCat(p, "uint8_Transpose");
 
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  // Expected names for folded multiply and conv.
+  const string optimized_weights =
+      "ArithmeticOptimizer/FoldMultiplyIntoConv_scaled_Conv2D_weights";
 
-  NodeMap node_map(&output);
-  const NodeDef* inputs_node = CHECK_NOTNULL(node_map.GetNode("Placeholder"));
-  const NodeDef* transpose_node =
-      CHECK_NOTNULL(node_map.GetNode(OptimizedName("Transpose_uint8")));
-  const NodeDef* cast_node =
-      CHECK_NOTNULL(node_map.GetNode(OptimizedName("Cast_float")));
-  const NodeDef* weights_node =
-      CHECK_NOTNULL(node_map.GetNode(OptimizedName("weights_scaled_Conv2D")));
-  const NodeDef* conv_node = CHECK_NOTNULL(node_map.GetNode("Conv2D"));
+  const NodeDef* inputs_node = node_map.GetNode("Placeholder");
+  const NodeDef* transpose_node = node_map.GetNode(optimized_transpose_name);
+  const NodeDef* cast_node = node_map.GetNode(optimized_cast_name);
+
+  const NodeDef* weights_node = node_map.GetNode(optimized_weights);
+  const NodeDef* conv_node = node_map.GetNode("Conv2D");
+
+  ASSERT_NE(inputs_node, nullptr);
+  ASSERT_NE(transpose_node, nullptr);
+  ASSERT_NE(cast_node, nullptr);
+  ASSERT_NE(weights_node, nullptr);
+  ASSERT_NE(conv_node, nullptr);
 
   EXPECT_EQ(output.node_size(), 7);
   EXPECT_EQ(transpose_node->input(0), inputs_node->name());
@@ -1542,23 +1729,27 @@ TEST_F(ArithmeticOptimizerTest, OptimizeMultipleMulTransposeConv) {
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
   GraphDef output;
-  TF_EXPECT_OK(ArithmeticOptimizer().Optimize(nullptr, item, &output));
+  ArithmeticOptimizer optimizer;
+  EnableOnlyFoldMultipleIntoConv(&optimizer);
+  OptimizeTwiceAndPrune(&optimizer, &item, &output, /*const_folding=*/true);
 
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(
-      ConstantFolding(/*cpu_device=*/nullptr).Optimize(nullptr, item, &output));
+  NodeMap node_map(&output);
 
-  item.graph.Swap(&output);
-  TF_EXPECT_OK(ModelPruner().Optimize(nullptr, item, &output));
+  using strings::StrCat;
+  const string p = "ArithmeticOptimizer/FoldMultiplyIntoConv_";
+  const string optimized_weights = StrCat(p, "scaled_Conv2D_weights");
+  const string optimized_weights_1 = StrCat(p, "scaled_Conv2D_1_weights_1");
 
-  NodeMap node_map(&output);
-  const NodeDef* weights_node =
-      CHECK_NOTNULL(node_map.GetNode(OptimizedName("weights_scaled_Conv2D")));
-  const NodeDef* conv_node = CHECK_NOTNULL(node_map.GetNode("Conv2D"));
+  const NodeDef* weights_node = node_map.GetNode(optimized_weights);
+  const NodeDef* weights_node_1 = node_map.GetNode(optimized_weights_1);
+  const NodeDef* conv_node = node_map.GetNode("Conv2D");
+  const NodeDef* conv_node_1 = node_map.GetNode("Conv2D_1");
+
+  ASSERT_NE(weights_node, nullptr);
+  ASSERT_NE(weights_node_1, nullptr);
+  ASSERT_NE(conv_node, nullptr);
+  ASSERT_NE(conv_node_1, nullptr);
 
-  const NodeDef* weights_node_1 =
-      CHECK_NOTNULL(node_map.GetNode(OptimizedName("weights_scaled_Conv2D_1")));
-  const NodeDef* conv_node_1 = CHECK_NOTNULL(node_map.GetNode("Conv2D_1"));
   EXPECT_EQ(conv_node->input(1), weights_node->name());
   EXPECT_EQ(conv_node_1->input(1), weights_node_1->name());
 }
@@ -2243,6 +2434,58 @@ TEST_F(ArithmeticOptimizerTest, ConvertSqrtDivToRsqrtMul) {
   }
 }
 
+TEST_F(ArithmeticOptimizerTest, ConvertPow) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+  auto x = ops::Const(s.WithOpName("x"), {1.0f, 2.0f}, {1, 2});
+  auto y2 = ops::Const(s.WithOpName("y2"), {2.0f, 2.0f}, {1, 2});
+  auto y1 = ops::Const(s.WithOpName("y1"), {1.0f, 1.0f}, {1, 2});
+  auto yPoint5 = ops::Const(s.WithOpName("y.5"), {0.5f, 0.5f}, {1, 2});
+  auto y0 = ops::Const(s.WithOpName("y0"), {0.0f, 0.0f}, {1, 2});
+  auto y_Point5 = ops::Const(s.WithOpName("y_.5"), {-0.5f, -0.5f}, {1, 2});
+  auto y_1 = ops::Const(s.WithOpName("y_1"), {-1.0f, -1.0f}, {1, 2});
+  auto y = ops::Const(s.WithOpName("y"), {3.0f, 4.0f}, {1, 2});
+  Output out2 = ops::Pow(s.WithOpName("out2"), x, y2);
+  Output out1 = ops::Pow(s.WithOpName("out1"), x, y1);
+  Output outPoint5 = ops::Pow(s.WithOpName("out.5"), x, yPoint5);
+  Output out0 = ops::Pow(s.WithOpName("out0"), x, y0);
+  Output out_Point5 = ops::Pow(s.WithOpName("out_.5"), x, y_Point5);
+  Output out_1 = ops::Pow(s.WithOpName("out_1"), x, y_1);
+  Output out = ops::Pow(s.WithOpName("out"), x, y);
+
+  GrapplerItem item;
+  item.fetch = {"out2", "out1", "out.5", "out0", "out_.5", "out_1", "out"};
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+  auto tensors_expected = EvaluateNodes(item.graph, item.fetch);
+  EXPECT_EQ(7, tensors_expected.size());
+
+  GraphDef got;
+  ArithmeticOptimizer optimizer;
+  EnableOnlyConvertPow(&optimizer);
+  OptimizeAndPrune(&optimizer, &item, &got);
+  auto tensors = EvaluateNodes(got, item.fetch);
+  EXPECT_EQ(7, tensors.size());
+
+  GraphDef want;
+  AddNode("x", "Const", {}, {}, &want);
+  AddNode("y2", "Const", {}, {}, &want);
+  AddNode("y1", "Const", {}, {}, &want);
+  AddNode("y.5", "Const", {}, {}, &want);
+  AddNode("y0", "Const", {}, {}, &want);
+  AddNode("y_.5", "Const", {}, {}, &want);
+  AddNode("y_1", "Const", {}, {}, &want);
+  AddNode("y", "Const", {}, {}, &want);
+  AddNode("out2", "Square", {"x", AsControlDependency("y2")}, {}, &want);
+  AddNode("out1", "Identity", {"x", AsControlDependency("y1")}, {}, &want);
+  AddNode("out.5", "Sqrt", {"x", AsControlDependency("y.5")}, {}, &want);
+  AddNode("out0", "Const",
+          {AsControlDependency("x"), AsControlDependency("y0")}, {}, &want);
+  AddNode("out_.5", "Rsqrt", {"x", AsControlDependency("y_.5")}, {}, &want);
+  AddNode("out_1", "Reciprocal", {"x", AsControlDependency("y_1")}, {}, &want);
+  AddNode("out", "Pow", {"x", "y"}, {}, &want);
+
+  CompareGraphs(want, got);
+}
+
 TEST_F(ArithmeticOptimizerTest, MinimizeBroadcasts_SimpleSwap) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
 
@@ -2777,7 +3020,7 @@ TEST_F(ArithmeticOptimizerTest, RemoveLogicalNot) {
   ArithmeticOptimizer optimizer;
   EnableOnlyRemoveLogicalNot(&optimizer);
   OptimizeTwice(&optimizer, &item, &output);
-  LOG(INFO) << output.DebugString();
+
   int found = 0;
   for (const NodeDef& node : output.node()) {
     if (node.name() == "id_not_eq") {
diff --git a/tensorflow/core/grappler/optimizers/constant_folding.cc b/tensorflow/core/grappler/optimizers/constant_folding.cc
index 8bdb164b0371c0326b268d5649565d3b51d25645..f4b384ec1e9d054a9cd9a736dab401aaacc8854a 100644
--- a/tensorflow/core/grappler/optimizers/constant_folding.cc
+++ b/tensorflow/core/grappler/optimizers/constant_folding.cc
@@ -1631,39 +1631,187 @@ Status ConstantFolding::ReplaceOperationWithConstant(
   return Status::OK();
 }
 
-Status ConstantFolding::SimplifyGraph(GraphDef* optimized_graph,
-                                      GraphProperties* properties,
-                                      bool use_shape_info) {
+Status ConstantFolding::SimplifyGraph(bool use_shape_info,
+                                      GraphDef* optimized_graph,
+                                      GraphProperties* properties) {
   for (int i = 0; i < optimized_graph->node_size(); ++i) {
-    TF_RETURN_IF_ERROR(SimplifyNode(optimized_graph->mutable_node(i),
-                                    optimized_graph, properties,
-                                    use_shape_info));
+    TF_RETURN_IF_ERROR(SimplifyNode(use_shape_info,
+                                    optimized_graph->mutable_node(i),
+                                    optimized_graph, properties));
   }
   return Status::OK();
 }
 
-Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
-                                     GraphProperties* properties,
-                                     bool use_shape_info) {
-  if (IsSplit(*node) && node->attr().at("num_split").i() == 1) {
-    ReplaceOperationWithIdentity(1, *properties, node, optimized_graph);
+Status ConstantFolding::SimplifyNode(bool use_shape_info, NodeDef* node,
+                                     GraphDef* optimized_graph,
+                                     GraphProperties* properties) {
+  if (RemoveSplitOrSplitV(*properties, optimized_graph, node)) {
     return Status::OK();
   }
 
-  if (IsSplitV(*node) && node->attr().at("num_split").i() == 1) {
-    ReplaceOperationWithIdentity(0, *properties, node, optimized_graph);
+  bool remove_shuffle_transpose_successful = false;
+  Status remove_shuffle_transpose_status =
+      RemoveShuffleOrTranspose(*properties, use_shape_info, optimized_graph,
+                               node, &remove_shuffle_transpose_successful);
+  if (!remove_shuffle_transpose_status.ok()) {
+    return remove_shuffle_transpose_status;
+  } else if (remove_shuffle_transpose_successful) {
+    return Status::OK();
+  }
+
+  if (RemoveRandomShuffle(*properties, use_shape_info, optimized_graph, node)) {
+    return Status::OK();
+  }
+
+  bool remove_reverse_successful = false;
+  Status remove_reverse_status =
+      RemoveReverse(*properties, use_shape_info, optimized_graph, node,
+                    &remove_reverse_successful);
+  if (!remove_reverse_status.ok()) {
+    return remove_reverse_status;
+  } else if (remove_reverse_successful) {
+    return Status::OK();
+  }
+
+  bool simplify_slice_successful = false;
+  Status simplify_slice_status =
+      SimplifySlice(*properties, use_shape_info, optimized_graph, node,
+                    &simplify_slice_successful);
+  if (!simplify_slice_status.ok()) {
+    return simplify_slice_status;
+  } else if (simplify_slice_successful) {
+    return Status::OK();
+  }
+
+  bool simplify_strided_slice_successful = false;
+  Status simplify_strided_slice_status =
+      SimplifyStridedSlice(*properties, use_shape_info, optimized_graph, node,
+                           &simplify_strided_slice_successful);
+  if (!simplify_strided_slice_status.ok()) {
+    return simplify_strided_slice_status;
+  } else if (simplify_strided_slice_successful) {
+    return Status::OK();
+  }
+
+  bool simplify_tile_successful = false;
+  Status simplify_tile_status =
+      SimplifyTile(*properties, use_shape_info, optimized_graph, node,
+                   &simplify_tile_successful);
+  if (!simplify_tile_status.ok()) {
+    return simplify_tile_status;
+  } else if (simplify_tile_successful) {
+    return Status::OK();
+  }
+
+  bool simplify_pad_successful = false;
+  Status simplify_pad_status =
+      SimplifyPad(*properties, use_shape_info, optimized_graph, node,
+                  &simplify_pad_successful);
+  if (!simplify_pad_status.ok()) {
+    return simplify_pad_status;
+  } else if (simplify_pad_successful) {
+    return Status::OK();
+  }
+
+  if (SimplifySqueeze(*properties, use_shape_info, optimized_graph, node)) {
+    return Status::OK();
+  }
+
+  if (SimplifyPack(optimized_graph, node)) {
+    graph_modified_ = true;
+    return Status::OK();
+  }
+
+  if (MoveConstantsPastEnter(optimized_graph, node)) {
+    graph_modified_ = true;
+    return Status::OK();
+  }
+
+  if (SimplifySwitch(optimized_graph, node)) {
+    graph_modified_ = true;
+    return Status::OK();
+  }
+
+  if (SimplifyReduction(*properties, node)) {
+    graph_modified_ = true;
+    return Status::OK();
+  }
+
+  if (SimplifyReshape(*properties, use_shape_info, node)) {
+    graph_modified_ = true;
+    return Status::OK();
+  }
+
+  bool arithmetic_simplification_succeed = false;
+  Status simplify_arithmetic_status =
+      SimplifyArithmeticOperations(*properties, use_shape_info, optimized_graph,
+                                   node, &arithmetic_simplification_succeed);
+  if (!simplify_arithmetic_status.ok()) {
+    return simplify_arithmetic_status;
+  } else if (arithmetic_simplification_succeed) {
+    graph_modified_ = true;
+    return Status::OK();
+  }
+
+  if (ReduceDivToReciprocalMul(optimized_graph, node)) {
+    graph_modified_ = true;
+    return Status::OK();
+  }
+
+  if (ConstantPushDown(node)) {
+    graph_modified_ = true;
+    return Status::OK();
+  }
+
+  if (MulConvPushDown(node, *properties)) {
+    graph_modified_ = true;
+    return Status::OK();
+  }
+
+  if (PartialConstPropThroughIdentityN(node)) {
+    graph_modified_ = true;
+    return Status::OK();
+  }
+
+  if (PartialAssocOpConstFolding(optimized_graph, properties, node)) {
+    graph_modified_ = true;
+    return Status::OK();
+  }
+
+  if (PartialConcatConstFolding(optimized_graph, properties, node)) {
+    graph_modified_ = true;
     return Status::OK();
   }
 
-  // Remove Shuffle or Transpose op over dimensions of size 1.
+  return Status::OK();
+}
+
+bool ConstantFolding::RemoveSplitOrSplitV(const GraphProperties& properties,
+                                          GraphDef* optimized_graph,
+                                          NodeDef* node) {
+  if (IsSplit(*node) && node->attr().at("num_split").i() == 1) {
+    ReplaceOperationWithIdentity(1, properties, node, optimized_graph);
+    return true;
+  }
+
+  if (IsSplitV(*node) && node->attr().at("num_split").i() == 1) {
+    ReplaceOperationWithIdentity(0, properties, node, optimized_graph);
+    return true;
+  }
+  return false;
+}
+
+Status ConstantFolding::RemoveShuffleOrTranspose(
+    const GraphProperties& properties, bool use_shape_info,
+    GraphDef* optimized_graph, NodeDef* node, bool* success) {
   if (use_shape_info && (IsShuffle(*node) || IsTranspose(*node)) &&
-      properties->GetInputProperties(node->name()).size() >= 2) {
-    const auto& shape = properties->GetInputProperties(node->name())[0].shape();
+      properties.GetInputProperties(node->name()).size() >= 2) {
+    const auto& shape = properties.GetInputProperties(node->name())[0].shape();
     if (shape.unknown_rank()) {
       // Not optimizable.
       return Status::OK();
     }
-    const auto& p = properties->GetInputProperties(node->name())[1];
+    const auto& p = properties.GetInputProperties(node->name())[1];
     if (TensorShape::IsValid(p.shape()) && p.has_value()) {
       Tensor perm(p.dtype(), p.shape());
       if (!perm.FromProto(p.value())) {
@@ -1690,34 +1838,45 @@ Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
         replaceable &= shape.dim(j).size() == 1 || j == permutation[j];
       }
       if (replaceable) {
-        ReplaceOperationWithIdentity(0, *properties, node, optimized_graph);
+        ReplaceOperationWithIdentity(0, properties, node, optimized_graph);
+        *success = true;
         return Status::OK();
       }
     }
   }
-
-  // Remove RandomShuffle op if it is scalar or first dimension is of size 1.
+  *success = false;
+  return Status::OK();
+}
+bool ConstantFolding::RemoveRandomShuffle(const GraphProperties& properties,
+                                          bool use_shape_info,
+                                          GraphDef* optimized_graph,
+                                          NodeDef* node) {
   if (use_shape_info && IsRandomShuffle(*node) &&
-      !properties->GetInputProperties(node->name()).empty()) {
-    const auto& shape = properties->GetInputProperties(node->name())[0].shape();
+      !properties.GetInputProperties(node->name()).empty()) {
+    const auto& shape = properties.GetInputProperties(node->name())[0].shape();
     // The node is replaceable iff
     // unknown_rank == false && (dim_size == 0 || first dim is of size 1)
     if (!shape.unknown_rank() &&
         (shape.dim_size() == 0 || shape.dim(0).size() == 1)) {
-      ReplaceOperationWithIdentity(0, *properties, node, optimized_graph);
-      return Status::OK();
+      ReplaceOperationWithIdentity(0, properties, node, optimized_graph);
+      return true;
     }
   }
+  return false;
+}
 
-  // Remove Reverse op over dimensions with size 1.
+Status ConstantFolding::RemoveReverse(const GraphProperties& properties,
+                                      bool use_shape_info,
+                                      GraphDef* optimized_graph, NodeDef* node,
+                                      bool* success) {
   if (use_shape_info && node->op() == "ReverseV2" &&
-      properties->GetInputProperties(node->name()).size() >= 2) {
-    const auto& shape = properties->GetInputProperties(node->name())[0].shape();
+      properties.GetInputProperties(node->name()).size() >= 2) {
+    const auto& shape = properties.GetInputProperties(node->name())[0].shape();
     if (shape.unknown_rank()) {
       // Not optimizable.
       return Status::OK();
     }
-    const auto& a = properties->GetInputProperties(node->name())[1];
+    const auto& a = properties.GetInputProperties(node->name())[1];
     if (TensorShape::IsValid(a.shape()) && a.has_value()) {
       Tensor axis(a.dtype(), a.shape());
       if (!axis.FromProto(a.value())) {
@@ -1746,17 +1905,25 @@ Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
                        target_axes.find(j) == target_axes.end();
       }
       if (replaceable) {
-        ReplaceOperationWithIdentity(0, *properties, node, optimized_graph);
+        ReplaceOperationWithIdentity(0, properties, node, optimized_graph);
+        *success = true;
         return Status::OK();
       }
     }
   }
+  *success = false;
+  return Status::OK();
+}
 
+Status ConstantFolding::SimplifySlice(const GraphProperties& properties,
+                                      bool use_shape_info,
+                                      GraphDef* optimized_graph, NodeDef* node,
+                                      bool* success) {
   if (use_shape_info && IsSlice(*node) &&
-      properties->GetInputProperties(node->name()).size() == 3) {
-    const auto& input = properties->GetInputProperties(node->name())[0];
-    const auto& b = properties->GetInputProperties(node->name())[1];
-    const auto& s = properties->GetInputProperties(node->name())[2];
+      properties.GetInputProperties(node->name()).size() == 3) {
+    const auto& input = properties.GetInputProperties(node->name())[0];
+    const auto& b = properties.GetInputProperties(node->name())[1];
+    const auto& s = properties.GetInputProperties(node->name())[2];
     if (TensorShape::IsValid(b.shape()) && b.has_value() &&
         TensorShape::IsValid(s.shape()) && s.has_value()) {
       Tensor begin(b.dtype(), b.shape());
@@ -1787,30 +1954,38 @@ Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
         }
       }
       if (replaceable) {
-        ReplaceOperationWithIdentity(0, *properties, node, optimized_graph);
+        ReplaceOperationWithIdentity(0, properties, node, optimized_graph);
+        *success = true;
         return Status::OK();
       }
     }
   }
+  *success = false;
+  return Status::OK();
+}
 
+Status ConstantFolding::SimplifyStridedSlice(const GraphProperties& properties,
+                                             bool use_shape_info,
+                                             GraphDef* optimized_graph,
+                                             NodeDef* node, bool* success) {
   if (use_shape_info && IsStridedSlice(*node) &&
-      properties->GetInputProperties(node->name()).size() == 4) {
+      properties.GetInputProperties(node->name()).size() == 4) {
     if (node->attr().at("new_axis_mask").i() != 0 ||
         node->attr().at("shrink_axis_mask").i() != 0) {
       // Skip nodes with new/shrink axis mask, since they involve dimension
       // changes.
       return Status::OK();
     }
-    const auto& input = properties->GetInputProperties(node->name())[0];
+    const auto& input = properties.GetInputProperties(node->name())[0];
     for (int j = 0; j < input.shape().dim_size(); ++j) {
       // Skip if input shape is not fully determined.
       if (input.shape().dim(j).size() < 0) {
         return Status::OK();
       }
     }
-    const auto& b = properties->GetInputProperties(node->name())[1];
-    const auto& e = properties->GetInputProperties(node->name())[2];
-    const auto& s = properties->GetInputProperties(node->name())[3];
+    const auto& b = properties.GetInputProperties(node->name())[1];
+    const auto& e = properties.GetInputProperties(node->name())[2];
+    const auto& s = properties.GetInputProperties(node->name())[3];
     if (TensorShape::IsValid(b.shape()) && b.has_value() &&
         TensorShape::IsValid(e.shape()) && e.has_value() &&
         TensorShape::IsValid(s.shape()) && s.has_value()) {
@@ -1879,15 +2054,23 @@ Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
             (end_mask & 1 << i || e == input.shape().dim(j).size()) && s == 1;
       }
       if (replaceable) {
-        ReplaceOperationWithIdentity(0, *properties, node, optimized_graph);
+        ReplaceOperationWithIdentity(0, properties, node, optimized_graph);
+        *success = true;
         return Status::OK();
       }
     }
   }
+  *success = false;
+  return Status::OK();
+}
 
+Status ConstantFolding::SimplifyTile(const GraphProperties& properties,
+                                     bool use_shape_info,
+                                     GraphDef* optimized_graph, NodeDef* node,
+                                     bool* success) {
   if (use_shape_info && IsTile(*node) &&
-      properties->GetInputProperties(node->name()).size() == 2) {
-    const auto& m = properties->GetInputProperties(node->name())[1];
+      properties.GetInputProperties(node->name()).size() == 2) {
+    const auto& m = properties.GetInputProperties(node->name())[1];
     if (TensorShape::IsValid(m.shape()) && m.has_value()) {
       Tensor multiplies(m.dtype(), m.shape());
       if (!multiplies.FromProto(m.value())) {
@@ -1907,15 +2090,23 @@ Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
         }
       }
       if (replaceable) {
-        ReplaceOperationWithIdentity(0, *properties, node, optimized_graph);
+        ReplaceOperationWithIdentity(0, properties, node, optimized_graph);
+        *success = true;
         return Status::OK();
       }
     }
   }
+  *success = false;
+  return Status::OK();
+}
 
+Status ConstantFolding::SimplifyPad(const GraphProperties& properties,
+                                    bool use_shape_info,
+                                    GraphDef* optimized_graph, NodeDef* node,
+                                    bool* success) {
   if (use_shape_info && IsPad(*node) &&
-      properties->GetInputProperties(node->name()).size() >= 2) {
-    const auto& p = properties->GetInputProperties(node->name())[1];
+      properties.GetInputProperties(node->name()).size() >= 2) {
+    const auto& p = properties.GetInputProperties(node->name())[1];
     if (TensorShape::IsValid(p.shape()) && p.has_value()) {
       Tensor paddings(p.dtype(), p.shape());
       if (!paddings.FromProto(p.value())) {
@@ -1931,18 +2122,26 @@ Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
         replaceable &= flatten(j) == 0;
       }
       if (replaceable) {
-        ReplaceOperationWithIdentity(0, *properties, node, optimized_graph);
+        ReplaceOperationWithIdentity(0, properties, node, optimized_graph);
+        *success = true;
         return Status::OK();
       }
     }
   }
+  *success = false;
+  return Status::OK();
+}
 
+bool ConstantFolding::SimplifySqueeze(const GraphProperties& properties,
+                                      bool use_shape_info,
+                                      GraphDef* optimized_graph,
+                                      NodeDef* node) {
   if (use_shape_info && IsSqueeze(*node) &&
-      !properties->GetInputProperties(node->name()).empty()) {
+      !properties.GetInputProperties(node->name()).empty()) {
     // https://www.tensorflow.org/api_docs/python/tf/squeeze mentions it's
     // error to squeeze a dimension that is not 1, so we only need to check
     // whether the input has > 1 size for each dimension.
-    const auto& shape = properties->GetInputProperties(node->name())[0].shape();
+    const auto& shape = properties.GetInputProperties(node->name())[0].shape();
     // The node is replaceable iff
     // unknown_rank == false && (dim_size == 0 || all dims have size > 1)
     bool replaceable = !shape.unknown_rank();
@@ -1950,11 +2149,14 @@ Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
       replaceable &= shape.dim(j).size() > 1;
     }
     if (replaceable) {
-      ReplaceOperationWithIdentity(0, *properties, node, optimized_graph);
-      return Status::OK();
+      ReplaceOperationWithIdentity(0, properties, node, optimized_graph);
+      return true;
     }
   }
+  return false;
+}
 
+bool ConstantFolding::SimplifyPack(GraphDef* optimized_graph, NodeDef* node) {
   if (IsPack(*node) && NumNonControlInputs(*node) == 1 &&
       !OptimizedNodeExists(*node, "_const_axis")) {
     // Create constant axis node.
@@ -1965,11 +2167,11 @@ Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
     if (!SetTensorValue(DT_INT32, axis, &axis_t).ok() ||
         !CreateNodeDef(axis_node->name(), TensorValue(&axis_t), axis_node)
              .ok()) {
-      return Status::OK();
+      return false;
     }
     // Add a control dependency to make sure axis_node is in the right frame.
     const string ctrl_dep = ConstantFolding::AddControlDependency(
-        node->input(0), graph_, node_map_.get());
+        node->input(0), optimized_graph, node_map_.get());
     axis_node->add_input(ctrl_dep);
     axis_node->set_device(node->device());
     node->set_op("ExpandDims");
@@ -1984,15 +2186,17 @@ Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
     if (node->input_size() > 2) {
       node->mutable_input()->SwapElements(1, node->input_size() - 1);
     }
-    graph_modified_ = true;
-    return Status::OK();
+    return true;
   }
+  return false;
+}
 
-  // Move constants past Enter.
+bool ConstantFolding::MoveConstantsPastEnter(GraphDef* optimized_graph,
+                                             NodeDef* node) {
   if (IsEnter(*node) && node->input_size() > 0) {
     if (node->attr().count("is_constant") == 0 ||
         !node->attr().at("is_constant").b()) {
-      return Status::OK();
+      return false;
     }
     const string& node_name = node->name();
     const NodeDef* input = node_map_->GetNode(node->input(0));
@@ -2029,28 +2233,14 @@ Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
             }
           }
         }
-        graph_modified_ = true;
-        return Status::OK();
+        return true;
       }
     }
   }
+  return false;
+}
 
-  // Switch(x, x) will always feed false to its false branch and true to
-  // its true branch. By rewriting the graph a bit, we can propagate these
-  // constants down the two output branches, and just use control dependencies
-  // to trigger the selected one at runtime. For example,
-  //
-  //     +------+
-  // x-->|Switch|-->a  (in practice there may be multiple consumers of each
-  // x-->|      |-->b   output branch.)
-  //     +------+
-  //
-  // Is rewritten as
-  //
-  //     +------+
-  // x-->|Switch|-->Identity--^>Const(false)-->a
-  // x-->|      |-->Identity--^>Const(true)-->b
-  //     +------+
+bool ConstantFolding::SimplifySwitch(GraphDef* optimized_graph, NodeDef* node) {
   if (node->op() == "Switch" && node->input(0) == node->input(1) &&
       !OptimizedNodeExists(*node, "_const_false") &&
       !OptimizedNodeExists(*node, "_const_true")) {
@@ -2087,7 +2277,7 @@ Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
       false_node->set_name(OptimizedNodeName(*node, "_const_false"));
       if (!CreateNodeDef(false_node->name(), TensorValue(&false_t), false_node)
                .ok()) {
-        return Status::OK();
+        return false;
       }
       false_node->set_device(node->device());
 
@@ -2095,7 +2285,7 @@ Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
       true_node->set_name(OptimizedNodeName(*node, "_const_true"));
       if (!CreateNodeDef(true_node->name(), TensorValue(&true_t), true_node)
                .ok()) {
-        return Status::OK();
+        return false;
       }
       true_node->set_device(node->device());
 
@@ -2129,11 +2319,15 @@ Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
           }
         }
       }
-      graph_modified_ = true;
-      return Status::OK();
+      return true;
     }
   }
-  if (IsSimplifiableReduction(*node, *properties)) {
+  return false;
+}
+
+bool ConstantFolding::SimplifyReduction(const GraphProperties& properties,
+                                        NodeDef* node) {
+  if (IsSimplifiableReduction(*node, properties)) {
     // Replace the reduction node with an identity node, that can be further
     // optimized by the model pruner.
     DataType output_type;
@@ -2147,66 +2341,26 @@ Status ConstantFolding::SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
     node->clear_attr();
     (*node->mutable_attr())["T"].set_type(output_type);
     *node->mutable_input(1) = AsControlDependency(node->input(1));
-    graph_modified_ = true;
-    return Status::OK();
-  }
-  if (use_shape_info && IsSimplifiableReshape(*node, *properties)) {
-    DataType output_type = node->attr().at("T").type();
-    node->set_op("Identity");
-    node->clear_attr();
-    (*node->mutable_attr())["T"].set_type(output_type);
-    *node->mutable_input(1) = AsControlDependency(node->input(1));
-    graph_modified_ = true;
-    return Status::OK();
-  }
-
-  bool arithmetic_simplification_succeed = false;
-  Status simplify_arithmetic_status = SimplifyArithmeticOperations(
-      optimized_graph, properties, node, use_shape_info,
-      &arithmetic_simplification_succeed);
-  if (!simplify_arithmetic_status.ok()) {
-    return simplify_arithmetic_status;
-  } else if (arithmetic_simplification_succeed) {
-    graph_modified_ = true;
-    return Status::OK();
-  }
-
-  if (ReduceDivToReciprocalMul(optimized_graph, node)) {
-    graph_modified_ = true;
-    return Status::OK();
-  }
-
-  if (ConstantPushDown(node)) {
-    graph_modified_ = true;
-    return Status::OK();
-  }
-
-  if (MulConvPushDown(node, *properties)) {
-    graph_modified_ = true;
-    return Status::OK();
-  }
-
-  if (PartialConstPropThroughIdentityN(node)) {
-    graph_modified_ = true;
-    return Status::OK();
-  }
-
-  if (PartialAssocOpConstFolding(optimized_graph, properties, node)) {
-    graph_modified_ = true;
-    return Status::OK();
-  }
-
-  if (PartialConcatConstFolding(optimized_graph, properties, node)) {
-    graph_modified_ = true;
-    return Status::OK();
+    return true;
   }
+  return false;
+}
 
-  return Status::OK();
+bool ConstantFolding::SimplifyReshape(const GraphProperties& properties,
+                                      bool use_shape_info, NodeDef* node) {
+  if (!use_shape_info) return false;
+  if (!IsSimplifiableReshape(*node, properties)) return false;
+  DataType output_type = node->attr().at("T").type();
+  node->set_op("Identity");
+  node->clear_attr();
+  (*node->mutable_attr())["T"].set_type(output_type);
+  *node->mutable_input(1) = AsControlDependency(node->input(1));
+  return true;
 }
 
 Status ConstantFolding::SimplifyArithmeticOperations(
-    GraphDef* optimized_graph, GraphProperties* properties, NodeDef* node,
-    bool use_shape_info, bool* success) {
+    const GraphProperties& properties, bool use_shape_info,
+    GraphDef* optimized_graph, NodeDef* node, bool* success) {
   const bool is_mul = IsMul(*node) || IsLogicalAnd(*node);
   const bool is_matmul = IsMatMul(*node);
   const bool is_add = IsAdd(*node) || IsBiasAdd(*node) || IsLogicalOr(*node);
@@ -2215,8 +2369,8 @@ Status ConstantFolding::SimplifyArithmeticOperations(
   // Simplify arithmetic operations with ones or zeros.
   if (use_shape_info &&
       (is_mul || is_matmul || is_add || is_sub || is_any_div) &&
-      properties->HasInputProperties(node->name()) &&
-      properties->HasOutputProperties(node->name())) {
+      properties.HasInputProperties(node->name()) &&
+      properties.HasOutputProperties(node->name())) {
     const NodeDef* x = node_map_->GetNode(node->input(0));
     const NodeDef* y = node_map_->GetNode(node->input(1));
     if (x == nullptr || y == nullptr) {
@@ -2224,19 +2378,19 @@ Status ConstantFolding::SimplifyArithmeticOperations(
                                      node->DebugString());
     }
     const TensorShapeProto& output_shape =
-        properties->GetOutputProperties(node->name())[0].shape();
+        properties.GetOutputProperties(node->name())[0].shape();
 
     // Simplify element-wise multiplication by ones or addition/subtraction
     // of zeros.
     const TensorShapeProto& y_shape =
-        properties->GetInputProperties(node->name())[1].shape();
+        properties.GetInputProperties(node->name())[1].shape();
     const bool x_is_zero = IsZeros(*x);
     const bool x_is_one = x_is_zero ? false : IsOnes(*x);
     const bool y_matches_output_shape = ShapesEqual(output_shape, y_shape);
     if (y_matches_output_shape &&
         ((is_mul && x_is_one) || (is_add && x_is_zero))) {
       // 1 * y = y or 0 + y = y.
-      ReplaceOperationWithSnapshot(1, *properties, node, optimized_graph);
+      ReplaceOperationWithSnapshot(1, properties, node, optimized_graph);
       *success = true;
       return Status::OK();
     }
@@ -2259,14 +2413,14 @@ Status ConstantFolding::SimplifyArithmeticOperations(
     }
 
     const TensorShapeProto& x_shape =
-        properties->GetInputProperties(node->name())[0].shape();
+        properties.GetInputProperties(node->name())[0].shape();
     const bool y_is_zero = IsZeros(*y);
     const bool y_is_one = y_is_zero ? false : IsOnes(*y);
     const bool x_matches_output_shape = ShapesEqual(output_shape, x_shape);
     if (x_matches_output_shape && (((is_mul || is_any_div) && y_is_one) ||
                                    ((is_add || is_sub) && y_is_zero))) {
       // x * 1 = x or x / 1 = x or x +/- 0 = x
-      ReplaceOperationWithSnapshot(0, *properties, node, optimized_graph);
+      ReplaceOperationWithSnapshot(0, properties, node, optimized_graph);
       *success = true;
       return Status::OK();
     }
@@ -2276,9 +2430,8 @@ Status ConstantFolding::SimplifyArithmeticOperations(
     const PartialTensorShape shp(output_shape);
     if (shp.IsFullyDefined() && IsLogicalOr(*node) && (y_is_one || x_is_one)) {
       bool replace_succeed = false;
-      Status replace_op_status =
-          ReplaceOperationWithConstant(1, *properties, output_shape, node,
-                                       optimized_graph, &replace_succeed);
+      Status replace_op_status = ReplaceOperationWithConstant(
+          1, properties, output_shape, node, optimized_graph, &replace_succeed);
       if (!replace_op_status.ok()) {
         return replace_op_status;
       } else if (replace_succeed) {
@@ -2296,7 +2449,7 @@ Status ConstantFolding::SimplifyArithmeticOperations(
       if (shp.IsFullyDefined()) {
         bool replace_succeed = false;
         Status replace_op_status =
-            ReplaceOperationWithConstant(0, *properties, output_shape, node,
+            ReplaceOperationWithConstant(0, properties, output_shape, node,
                                          optimized_graph, &replace_succeed);
         if (!replace_op_status.ok()) {
           return replace_op_status;
@@ -2309,11 +2462,11 @@ Status ConstantFolding::SimplifyArithmeticOperations(
       // matches the output shape and thus forward the corresponding zero
       // input.
       if ((is_mul || is_any_div) && x_is_zero && x_matches_output_shape) {
-        ReplaceOperationWithIdentity(0, *properties, node, optimized_graph);
+        ReplaceOperationWithIdentity(0, properties, node, optimized_graph);
         *success = true;
         return Status::OK();
       } else if (is_mul && y_is_zero && y_matches_output_shape) {
-        ReplaceOperationWithIdentity(1, *properties, node, optimized_graph);
+        ReplaceOperationWithIdentity(1, properties, node, optimized_graph);
         *success = true;
         return Status::OK();
       }
@@ -2855,7 +3008,7 @@ Status ConstantFolding::RunOptimizationPass(Cluster* cluster,
   TF_RETURN_IF_ERROR(FoldGraph(optimized_graph));
   node_map_.reset(new NodeMap(optimized_graph));
   TF_RETURN_IF_ERROR(
-      SimplifyGraph(optimized_graph, &properties, can_use_shape_info));
+      SimplifyGraph(can_use_shape_info, optimized_graph, &properties));
 
   return Status::OK();
 }
diff --git a/tensorflow/core/grappler/optimizers/constant_folding.h b/tensorflow/core/grappler/optimizers/constant_folding.h
index e477934f308b243ac6273fdd79d334445d4254c8..b42d5f201eabb7f1697473997ffec2509e1e1118 100644
--- a/tensorflow/core/grappler/optimizers/constant_folding.h
+++ b/tensorflow/core/grappler/optimizers/constant_folding.h
@@ -97,10 +97,10 @@ class ConstantFolding : public GraphOptimizer {
                                const GraphProperties& properties) const;
   bool IsSimplifiableReshape(const NodeDef& node,
                              const GraphProperties& properties) const;
-  Status SimplifyGraph(GraphDef* output, GraphProperties* properties,
-                       bool use_shape_info);
-  Status SimplifyNode(NodeDef* node, GraphDef* optimized_graph,
-                      GraphProperties* properties, bool use_shape_info);
+  Status SimplifyGraph(bool use_shape_info, GraphDef* optimized_graph,
+                       GraphProperties* properties);
+  Status SimplifyNode(bool use_shape_info, NodeDef* node,
+                      GraphDef* optimized_graph, GraphProperties* properties);
 
   Status RunOptimizationPass(Cluster* cluster, const GrapplerItem& item,
                              GraphDef* output);
@@ -134,11 +134,81 @@ class ConstantFolding : public GraphOptimizer {
   // Simplifies arithmetic operations with ones or zeros. Returns the status,
   // and updates the success input argument that denotes if any simplification
   // was applied.
-  Status SimplifyArithmeticOperations(GraphDef* optimized_graph,
-                                      GraphProperties* properties,
-                                      NodeDef* node, bool use_shape_info,
+  Status SimplifyArithmeticOperations(const GraphProperties& properties,
+                                      bool use_shape_info,
+                                      GraphDef* optimized_graph, NodeDef* node,
                                       bool* success);
 
+  // Simplifies a Reshape operation to an Identity operation if applicable.
+  bool SimplifyReshape(const GraphProperties& properties, bool use_shape_info,
+                       NodeDef* node);
+
+  // Simplifies a Reduction operation to an Identity operation if applicable.
+  bool SimplifyReduction(const GraphProperties& properties, NodeDef* node);
+
+  // Switch(x, x) will always feed false to its false branch and true to
+  // its true branch. By rewriting the graph a bit, we can propagate these
+  // constants down the two output branches, and just use control dependencies
+  // to trigger the selected one at runtime. For example,
+  //
+  //     +------+
+  // x-->|Switch|-->a  (in practice there may be multiple consumers of each
+  // x-->|      |-->b   output branch.)
+  //     +------+
+  //
+  // Is rewritten as
+  //
+  //     +------+
+  // x-->|Switch|-->Identity--^>Const(false)-->a
+  // x-->|      |-->Identity--^>Const(true)-->b
+  //     +------+
+  bool SimplifySwitch(GraphDef* optimized_graph, NodeDef* node);
+
+  // Moves constants past Enter node if applicable.
+  bool MoveConstantsPastEnter(GraphDef* optimized_graph, NodeDef* node);
+
+  // Simplifies Pack operation if applicable.
+  bool SimplifyPack(GraphDef* optimized_graph, NodeDef* node);
+
+  // Simplifies a Squeeze operation to an Identity operation if applicable.
+  bool SimplifySqueeze(const GraphProperties& properties, bool use_shape_info,
+                       GraphDef* optimized_graph, NodeDef* node);
+
+  // Simplifies a Pad operation to an Identity operation if applicable.
+  Status SimplifyPad(const GraphProperties& properties, bool use_shape_info,
+                     GraphDef* optimized_graph, NodeDef* node, bool* success);
+
+  // Simplifies a Tile operation to an Identity operation if applicable.
+  Status SimplifyTile(const GraphProperties& properties, bool use_shape_info,
+                      GraphDef* optimized_graph, NodeDef* node, bool* success);
+
+  // Simplifies a StridedSlice operation to an Identity operation if applicable.
+  Status SimplifyStridedSlice(const GraphProperties& properties,
+                              bool use_shape_info, GraphDef* optimized_graph,
+                              NodeDef* node, bool* success);
+
+  // Simplifies a Slice operation to an Identity operation if applicable.
+  Status SimplifySlice(const GraphProperties& properties, bool use_shape_info,
+                       GraphDef* optimized_graph, NodeDef* node, bool* success);
+
+  // Removes Reverse op over dimensions with size 1.
+  Status RemoveReverse(const GraphProperties& properties, bool use_shape_info,
+                       GraphDef* optimized_graph, NodeDef* node, bool* success);
+
+  // Removes RandomShuffle op if it is scalar or first dimension is of size 1.
+  bool RemoveRandomShuffle(const GraphProperties& properties,
+                           bool use_shape_info, GraphDef* optimized_graph,
+                           NodeDef* node);
+
+  // Removes Shuffle or Transpose op over dimensions of size 1.
+  Status RemoveShuffleOrTranspose(const GraphProperties& properties,
+                                  bool use_shape_info,
+                                  GraphDef* optimized_graph, NodeDef* node,
+                                  bool* success);
+
+  // Removes Split or SplitV node if possible.
+  bool RemoveSplitOrSplitV(const GraphProperties& properties,
+                           GraphDef* optimized_graph, NodeDef* node);
   // Points to an externally provided device or to owned_device_;
   RewriterConfig::Toggle opt_level_;
   DeviceBase* cpu_device_;
diff --git a/tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h b/tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h
index 3148a5f809f0dffe333058ea08bc1a6118e31306..0b8e0b692aeeb5ab71b55c72f5320579ff9ac80e 100644
--- a/tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h
+++ b/tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h
@@ -50,7 +50,7 @@ class CustomGraphOptimizerRegistrar {
 
 #define REGISTER_GRAPH_OPTIMIZER_AS(MyCustomGraphOptimizerClass, name) \
   namespace {                                                          \
-  static CustomGraphOptimizerRegistrar                                 \
+  static ::tensorflow::grappler::CustomGraphOptimizerRegistrar         \
       MyCustomGraphOptimizerClass##_registrar(                         \
           []() { return new MyCustomGraphOptimizerClass; }, (name));   \
   }  // namespace
diff --git a/tensorflow/core/grappler/optimizers/data/BUILD b/tensorflow/core/grappler/optimizers/data/BUILD
new file mode 100644
index 0000000000000000000000000000000000000000..08fc9d84da232577734d28a93b925e34f7b14726
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/BUILD
@@ -0,0 +1,112 @@
+licenses(["notice"])  # Apache 2.0
+
+load("//tensorflow:tensorflow.bzl", "tf_cc_test")
+load("//tensorflow/core:platform/default/build_config.bzl", "tf_protos_all")
+
+cc_library(
+    name = "graph_utils",
+    srcs = ["graph_utils.cc"],
+    hdrs = [
+        "graph_utils.h",
+    ],
+    visibility = ["//visibility:public"],
+    deps = [
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core/grappler:graph_view",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:grappler_item_builder",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/clusters:virtual_cluster",
+        "//tensorflow/core/grappler/optimizers:meta_optimizer",
+    ] + tf_protos_all(),
+)
+
+tf_cc_test(
+    name = "graph_utils_test",
+    srcs = ["graph_utils_test.cc"],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":graph_utils",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
+cc_library(
+    name = "map_and_batch_fusion",
+    srcs = ["map_and_batch_fusion.cc"],
+    hdrs = [
+        "map_and_batch_fusion.h",
+    ],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":graph_utils",
+        "//tensorflow/core:lib",
+        "//tensorflow/core/grappler:graph_view",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:op_types",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/clusters:cluster",
+        "//tensorflow/core/grappler/optimizers:custom_graph_optimizer",
+        "//tensorflow/core/grappler/optimizers:custom_graph_optimizer_registry",
+    ] + tf_protos_all(),
+)
+
+tf_cc_test(
+    name = "map_and_batch_fusion_test",
+    srcs = ["map_and_batch_fusion_test.cc"],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":graph_utils",
+        ":map_and_batch_fusion",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core/grappler:grappler_item",
+    ],
+)
+
+cc_library(
+    name = "shuffle_and_repeat_fusion",
+    srcs = ["shuffle_and_repeat_fusion.cc"],
+    hdrs = [
+        "shuffle_and_repeat_fusion.h",
+    ],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":graph_utils",
+        "//tensorflow/core:lib",
+        "//tensorflow/core/grappler:graph_view",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:op_types",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/clusters:cluster",
+        "//tensorflow/core/grappler/optimizers:custom_graph_optimizer",
+        "//tensorflow/core/grappler/optimizers:custom_graph_optimizer_registry",
+    ] + tf_protos_all(),
+)
+
+tf_cc_test(
+    name = "shuffle_and_repeat_fusion_test",
+    srcs = ["shuffle_and_repeat_fusion_test.cc"],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":graph_utils",
+        ":shuffle_and_repeat_fusion",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core/grappler:grappler_item",
+    ],
+)
+
+cc_library(
+    name = "data",
+    visibility = ["//visibility:public"],
+    deps = [
+        ":map_and_batch_fusion",
+        ":shuffle_and_repeat_fusion",
+    ],
+    alwayslink = 1,
+)
diff --git a/tensorflow/core/grappler/optimizers/data/graph_utils.cc b/tensorflow/core/grappler/optimizers/data/graph_utils.cc
new file mode 100644
index 0000000000000000000000000000000000000000..aece142f7ad67305b46fca421ddae8e30746c895
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/graph_utils.cc
@@ -0,0 +1,226 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
+
+#include "tensorflow/core/framework/device_base.h"
+#include "tensorflow/core/grappler/clusters/virtual_cluster.h"
+#include "tensorflow/core/grappler/graph_view.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/grappler_item_builder.h"
+#include "tensorflow/core/grappler/optimizers/meta_optimizer.h"
+#include "tensorflow/core/util/ptr_util.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace graph_utils {
+namespace {
+
+constexpr char kConstOpName[] = "Const";
+
+int FindNodeWithPredicate(const std::function<bool(const NodeDef&)>& predicate,
+                          const GraphDef& graph) {
+  for (int i = 0; i < graph.node_size(); ++i) {
+    if (predicate(graph.node(i))) {
+      return i;
+    }
+  }
+  return -1;
+}
+
+std::vector<int> CreateNameIndex(const GraphDef& graph) {
+  std::map<string, int> names;
+  for (int i = 0; i < graph.node_size(); ++i) {
+    names[graph.node(i).name()] = i;
+  }
+  std::vector<int> index(graph.node_size());
+  int i = 0;
+  for (const auto& pair : names) {
+    index[i++] = pair.second;
+  }
+  return index;
+}
+
+std::vector<int> CreateInputIndex(const NodeDef& node) {
+  std::map<string, int> inputs;
+  for (int i = 0; i < node.input_size(); ++i) {
+    inputs[node.input(i)] = i;
+  }
+  std::vector<int> index(node.input_size());
+  int i = 0;
+  for (const auto& pair : inputs) {
+    index[i++] = pair.second;
+  }
+  return index;
+}
+
+Status AddScalarConstNodeHelper(
+    DataType dtype, const std::function<void(TensorProto*)>& add_value,
+    GraphDef* graph, NodeDef** result) {
+  NodeDef* node = graph->add_node();
+  node->set_op(kConstOpName);
+  SetUniqueName(kConstOpName, graph, node);
+  (*node->mutable_attr())["dtype"].set_type(dtype);
+  std::unique_ptr<tensorflow::TensorProto> tensor =
+      tensorflow::MakeUnique<tensorflow::TensorProto>();
+  std::unique_ptr<tensorflow::TensorShapeProto> tensor_shape =
+      tensorflow::MakeUnique<tensorflow::TensorShapeProto>();
+  tensor->set_allocated_tensor_shape(tensor_shape.release());
+  tensor->set_dtype(dtype);
+  add_value(tensor.get());
+  (*node->mutable_attr())["value"].set_allocated_tensor(tensor.release());
+  *result = node;
+  return Status::OK();
+}
+
+}  // namespace
+
+Status AddNode(const string& name, const string& op,
+               const std::vector<string>& inputs,
+               const std::vector<std::pair<string, AttrValue>>& attributes,
+               GraphDef* graph, NodeDef** result) {
+  NodeDef* node = graph->add_node();
+  if (!name.empty()) {
+    node->set_name(name);
+  } else {
+    SetUniqueName(op, graph, node);
+  }
+  node->set_op(op);
+  for (const string& input : inputs) {
+    node->add_input(input);
+  }
+  for (auto attr : attributes) {
+    (*node->mutable_attr())[attr.first] = attr.second;
+  }
+  *result = node;
+  return Status::OK();
+}
+
+template <>
+Status AddScalarConstNode(bool v, GraphDef* graph, NodeDef** result) {
+  return AddScalarConstNodeHelper(
+      DT_BOOL, [v](TensorProto* proto) { proto->add_bool_val(v); }, graph,
+      result);
+}
+
+template <>
+Status AddScalarConstNode(double v, GraphDef* graph, NodeDef** result) {
+  return AddScalarConstNodeHelper(
+      DT_DOUBLE, [v](TensorProto* proto) { proto->add_double_val(v); }, graph,
+      result);
+}
+
+template <>
+Status AddScalarConstNode(float v, GraphDef* graph, NodeDef** result) {
+  return AddScalarConstNodeHelper(
+      DT_FLOAT, [v](TensorProto* proto) { proto->add_float_val(v); }, graph,
+      result);
+}
+
+template <>
+Status AddScalarConstNode(int v, GraphDef* graph, NodeDef** result) {
+  return AddScalarConstNodeHelper(
+      DT_INT32, [v](TensorProto* proto) { proto->add_int_val(v); }, graph,
+      result);
+}
+
+template <>
+Status AddScalarConstNode(int64 v, GraphDef* graph, NodeDef** result) {
+  return AddScalarConstNodeHelper(
+      DT_INT64, [v](TensorProto* proto) { proto->add_int64_val(v); }, graph,
+      result);
+}
+
+template <>
+Status AddScalarConstNode(StringPiece v, GraphDef* graph, NodeDef** result) {
+  return AddScalarConstNodeHelper(
+      DT_STRING,
+      [v](TensorProto* proto) { proto->add_string_val(v.data(), v.size()); },
+      graph, result);
+}
+
+bool Compare(const GraphDef& g1, const GraphDef& g2) {
+  if (g1.node_size() != g2.node_size()) {
+    return false;
+  }
+  std::vector<int> name_index1 = CreateNameIndex(g1);
+  std::vector<int> name_index2 = CreateNameIndex(g2);
+  for (int i = 0; i < g1.node_size(); ++i) {
+    int idx1 = name_index1[i];
+    int idx2 = name_index2[i];
+    if (g1.node(idx1).op() != g2.node(idx2).op()) {
+      return false;
+    }
+    if (g1.node(idx1).name() != g2.node(idx2).name()) {
+      return false;
+    }
+    if (g1.node(idx1).input_size() != g2.node(idx2).input_size()) {
+      return false;
+    }
+    std::vector<int> input_index1 = CreateInputIndex(g1.node(idx1));
+    std::vector<int> input_index2 = CreateInputIndex(g2.node(idx2));
+    for (int j = 0; j < g1.node(idx1).input_size(); ++j) {
+      if (!IsSameInput(g1.node(idx1).input(input_index1[j]),
+                       g2.node(idx2).input(input_index2[j]))) {
+        return false;
+      }
+    }
+  }
+  return true;
+}
+
+bool ContainsNodeWithName(const string& name, const GraphDef& graph) {
+  return FindNodeWithName(name, graph) != -1;
+}
+
+bool ContainsNodeWithOp(const string& op, const GraphDef& graph) {
+  return FindNodeWithOp(op, graph) != -1;
+}
+
+Status DeleteNodes(const std::set<string>& nodes_to_delete, GraphDef* graph) {
+  int last = graph->node_size() - 1;
+  for (int i = graph->node_size() - 1; i >= 0; --i) {
+    const NodeDef& node = graph->node(i);
+    if (nodes_to_delete.find(node.name()) != nodes_to_delete.end()) {
+      graph->mutable_node()->SwapElements(i, last);
+      last--;
+    }
+  }
+  graph->mutable_node()->DeleteSubrange(last + 1,
+                                        graph->node_size() - last - 1);
+  return Status::OK();
+}
+
+int FindNodeWithName(const string& name, const GraphDef& graph) {
+  return FindNodeWithPredicate(
+      [name](const NodeDef& node) { return node.name() == name; }, graph);
+}
+
+int FindNodeWithOp(const string& op, const GraphDef& graph) {
+  return FindNodeWithPredicate(
+      [op](const NodeDef& node) { return node.op() == op; }, graph);
+}
+
+void SetUniqueName(const string& op, GraphDef* graph, NodeDef* node) {
+  int id = graph->node_size();
+  while (ContainsNodeWithName(strings::StrCat(op, "/_", id), *graph)) {
+    ++id;
+  }
+  node->set_name(strings::StrCat(op, "/_", id));
+}
+
+}  // end namespace graph_utils
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/data/graph_utils.h b/tensorflow/core/grappler/optimizers/data/graph_utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..3d2467031f53e6dde2f80d2ea81c92cd57e8eddc
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/graph_utils.h
@@ -0,0 +1,85 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_GRAPH_UTILS_H_
+#define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_GRAPH_UTILS_H_
+
+#include "tensorflow/core/framework/attr_value.pb.h"
+#include "tensorflow/core/framework/graph.pb.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/tensor.pb.h"
+#include "tensorflow/core/framework/tensor_shape.pb.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/lib/core/errors.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace graph_utils {
+
+// Adds a node to the graph.
+Status AddNode(const string& name, const string& op,
+               const std::vector<string>& inputs,
+               const std::vector<std::pair<string, AttrValue>>& attributes,
+               GraphDef* graph, NodeDef** result);
+
+// Adds a Const node with the given value to the graph.
+template <typename T>
+Status AddScalarConstNode(T v, GraphDef* graph, NodeDef** result) {
+  return errors::Unimplemented("Type %s is not supported.",
+                               DataTypeToEnum<T>::value);
+}
+template <>
+Status AddScalarConstNode(bool v, GraphDef* graph, NodeDef** result);
+template <>
+Status AddScalarConstNode(double v, GraphDef* graph, NodeDef** result);
+template <>
+Status AddScalarConstNode(float v, GraphDef* graph, NodeDef** result);
+template <>
+Status AddScalarConstNode(int v, GraphDef* graph, NodeDef** result);
+template <>
+Status AddScalarConstNode(int64 v, GraphDef* graph, NodeDef** result);
+template <>
+Status AddScalarConstNode(StringPiece v, GraphDef* graph, NodeDef** result);
+
+// Checks whether the two graphs are the same.
+bool Compare(const GraphDef& g1, const GraphDef& g2);
+
+// Checks whether the graph contains a node with the given name.
+bool ContainsNodeWithName(const string& name, const GraphDef& graph);
+
+// Checks whether the graph contains a node with the given op.
+bool ContainsNodeWithOp(const string& op, const GraphDef& graph);
+
+// Deletes nodes from the graph.
+Status DeleteNodes(const std::set<string>& nodes_to_delete, GraphDef* graph);
+
+// Returns the index of the node with the given name or -1 if the node does
+// not exist.
+int FindNodeWithName(const string& name, const GraphDef& graph);
+
+// Returns the index of a node with the given op or -1 if no such  node
+// exists.
+int FindNodeWithOp(const string& op, const GraphDef& graph);
+
+// Sets the node name using the op name as a prefix while guaranteeing the name
+// is unique across the graph.
+void SetUniqueName(const string& op, GraphDef* graph, NodeDef* node);
+
+}  // end namespace graph_utils
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_GRAPH_UTILS_H_
diff --git a/tensorflow/core/grappler/optimizers/data/graph_utils_test.cc b/tensorflow/core/grappler/optimizers/data/graph_utils_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..00f66c9bc176c7e2dd0bb9439b61e826b16b9c87
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/graph_utils_test.cc
@@ -0,0 +1,157 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
+
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace graph_utils {
+namespace {
+
+class GraphUtilsTest : public ::testing::Test {};
+
+TEST_F(GraphUtilsTest, AddScalarConstNodeBool) {
+  GraphDef graph;
+  NodeDef* bool_node;
+  TF_EXPECT_OK(AddScalarConstNode<bool>(true, &graph, &bool_node));
+  EXPECT_TRUE(ContainsNodeWithName(bool_node->name(), graph));
+  EXPECT_EQ(bool_node->attr().at("value").tensor().bool_val(0), true);
+}
+
+TEST_F(GraphUtilsTest, AddScalarConstNodeDouble) {
+  GraphDef graph;
+  NodeDef* double_node;
+  TF_EXPECT_OK(AddScalarConstNode<double>(3.14, &graph, &double_node));
+  EXPECT_TRUE(ContainsNodeWithName(double_node->name(), graph));
+  EXPECT_FLOAT_EQ(double_node->attr().at("value").tensor().double_val(0), 3.14);
+}
+
+TEST_F(GraphUtilsTest, AddScalarConstNodeFloat) {
+  GraphDef graph;
+  NodeDef* float_node;
+  TF_EXPECT_OK(AddScalarConstNode<float>(3.14, &graph, &float_node));
+  EXPECT_TRUE(ContainsNodeWithName(float_node->name(), graph));
+  EXPECT_FLOAT_EQ(float_node->attr().at("value").tensor().float_val(0), 3.14);
+}
+
+TEST_F(GraphUtilsTest, AddScalarConstNodeInt) {
+  GraphDef graph;
+  NodeDef* int_node;
+  TF_EXPECT_OK(AddScalarConstNode<int>(42, &graph, &int_node));
+  EXPECT_TRUE(ContainsNodeWithName(int_node->name(), graph));
+  EXPECT_EQ(int_node->attr().at("value").tensor().int_val(0), 42);
+}
+
+TEST_F(GraphUtilsTest, AddScalarConstNodeInt64) {
+  GraphDef graph;
+  NodeDef* int64_node;
+  TF_EXPECT_OK(AddScalarConstNode<int64>(42, &graph, &int64_node));
+  EXPECT_TRUE(ContainsNodeWithName(int64_node->name(), graph));
+  EXPECT_EQ(int64_node->attr().at("value").tensor().int64_val(0), 42);
+}
+
+TEST_F(GraphUtilsTest, AddScalarConstNodeString) {
+  GraphDef graph;
+  NodeDef* string_node;
+  TF_EXPECT_OK(AddScalarConstNode<StringPiece>("hello", &graph, &string_node));
+  EXPECT_TRUE(ContainsNodeWithName(string_node->name(), graph));
+  EXPECT_EQ(string_node->attr().at("value").tensor().string_val(0), "hello");
+}
+
+TEST_F(GraphUtilsTest, Compare) {
+  GraphDef graphA;
+  GraphDef graphB;
+  EXPECT_TRUE(Compare(graphA, graphB));
+
+  NodeDef* nodeA;
+  TF_EXPECT_OK(AddNode("A", "OpA", {}, {}, &graphA, &nodeA));
+  NodeDef* nodeB;
+  TF_EXPECT_OK(AddNode("B", "OpB", {"A"}, {}, &graphA, &nodeB));
+  EXPECT_FALSE(Compare(graphA, graphB));
+
+  graphB.mutable_node()->CopyFrom(graphA.node());
+  EXPECT_TRUE(Compare(graphA, graphB));
+}
+
+TEST_F(GraphUtilsTest, ContainsNodeWithName) {
+  GraphDef graph;
+  EXPECT_TRUE(!ContainsNodeWithName("A", graph));
+
+  NodeDef* node;
+  TF_EXPECT_OK(AddNode("A", "OpA", {}, {}, &graph, &node));
+  EXPECT_TRUE(ContainsNodeWithName("A", graph));
+
+  TF_EXPECT_OK(DeleteNodes({"A"}, &graph));
+  EXPECT_TRUE(!ContainsNodeWithName("A", graph));
+}
+
+TEST_F(GraphUtilsTest, ContainsNodeWithOp) {
+  GraphDef graph;
+  EXPECT_TRUE(!ContainsNodeWithOp("OpA", graph));
+
+  NodeDef* node;
+  TF_EXPECT_OK(AddNode("A", "OpA", {}, {}, &graph, &node));
+  EXPECT_TRUE(ContainsNodeWithOp("OpA", graph));
+
+  TF_EXPECT_OK(DeleteNodes({"A"}, &graph));
+  EXPECT_TRUE(!ContainsNodeWithOp("OpA", graph));
+}
+
+TEST_F(GraphUtilsTest, FindNodeWithName) {
+  GraphDef graph;
+  EXPECT_EQ(FindNodeWithName("A", graph), -1);
+
+  NodeDef* node;
+  TF_EXPECT_OK(AddNode("A", "OpA", {}, {}, &graph, &node));
+  EXPECT_NE(FindNodeWithName("A", graph), -1);
+
+  TF_EXPECT_OK(DeleteNodes({"A"}, &graph));
+  EXPECT_EQ(FindNodeWithName("A", graph), -1);
+}
+
+TEST_F(GraphUtilsTest, FindNodeWithOp) {
+  GraphDef graph;
+  EXPECT_EQ(FindNodeWithOp("OpA", graph), -1);
+
+  NodeDef* node;
+  TF_EXPECT_OK(AddNode("A", "OpA", {}, {}, &graph, &node));
+  EXPECT_NE(FindNodeWithOp("OpA", graph), -1);
+
+  TF_EXPECT_OK(DeleteNodes({"A"}, &graph));
+  EXPECT_EQ(FindNodeWithOp("OpA", graph), -1);
+}
+
+TEST_F(GraphUtilsTest, SetUniqueName) {
+  GraphDef graph;
+
+  NodeDef* node1;
+  TF_EXPECT_OK(AddNode("", "A", {}, {}, &graph, &node1));
+  NodeDef* node2;
+  TF_EXPECT_OK(AddNode("", "A", {}, {}, &graph, &node2));
+  EXPECT_NE(node1->name(), node2->name());
+
+  TF_EXPECT_OK(DeleteNodes({node1->name()}, &graph));
+  NodeDef* node3;
+  TF_EXPECT_OK(AddNode("", "A", {}, {}, &graph, &node3));
+  EXPECT_NE(node2->name(), node3->name());
+}
+
+}  // namespace
+}  // namespace graph_utils
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion.cc b/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion.cc
new file mode 100644
index 0000000000000000000000000000000000000000..1e8cbb9784e6d4c28a536bddb6dfd8d080562aee
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion.cc
@@ -0,0 +1,141 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/data/map_and_batch_fusion.h"
+
+#include "tensorflow/core/framework/attr_value.pb.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/grappler/clusters/cluster.h"
+#include "tensorflow/core/grappler/graph_view.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/op_types.h"
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
+#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/platform/protobuf.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+constexpr char kFusedOpName[] = "MapAndBatchDatasetV2";
+
+}  // namespace
+
+Status MapAndBatchFusion::Optimize(Cluster* cluster, const GrapplerItem& item,
+                                   GraphDef* output) {
+  *output = item.graph;
+  GraphView graph(output);
+  std::set<string> nodes_to_delete;
+  for (const NodeDef& node : item.graph.node()) {
+    if (node.op() != "BatchDataset" && node.op() != "BatchDatasetV2") {
+      continue;
+    }
+
+    // Use a more descriptive variable name now that we know the node type.
+    const NodeDef batch_node(node);
+    GraphView::InputPort input_port = graph.GetInputPort(batch_node.name(), 0);
+    NodeDef* node2 = graph.GetRegularFanin(input_port).node;
+    if (node2->op() != "MapDataset" && node2->op() != "ParallelMapDataset") {
+      continue;
+    }
+
+    NodeDef* new_node = output->add_node();
+    new_node->set_op(kFusedOpName);
+    graph_utils::SetUniqueName(kFusedOpName, output, new_node);
+
+    // Use a more descriptive variable name now that we know the node type.
+    NodeDef* map_node = node2;
+    // Set the `input` input argument.
+    new_node->add_input(map_node->input(0));
+
+    // Set the `other_arguments` input arguments.
+    int num_other_args;
+    if (map_node->op() == "ParallelMapDataset") {
+      num_other_args = map_node->input_size() - 2;
+    } else {
+      num_other_args = map_node->input_size() - 1;
+    }
+    for (int i = 0; i < num_other_args; i++) {
+      new_node->add_input(map_node->input(i + 1));
+    }
+
+    // Set the `batch_size` input argument.
+    new_node->add_input(batch_node.input(1));
+
+    // Set the `num_parallel_calls` input argument.
+    if (map_node->op() == "ParallelMapDataset") {
+      // The type of the `num_parallel_calls` argument in ParallelMapDataset
+      // and MapAndBatchDataset is different (int32 and int64 respectively)
+      // so we cannot reuse the same Const node and thus create a new one.
+      NodeDef* v = graph.GetNode(map_node->input(map_node->input_size() - 1));
+      NodeDef* tmp;
+      TF_RETURN_IF_ERROR(graph_utils::AddScalarConstNode<int64>(
+          v->attr().at("value").tensor().int_val(0), output, &tmp));
+      new_node->add_input(tmp->name());
+    } else {
+      NodeDef* tmp;
+      TF_RETURN_IF_ERROR(
+          graph_utils::AddScalarConstNode<int64>(1, output, &tmp));
+      new_node->add_input(tmp->name());
+    }
+
+    // Set the `drop_remainder` input argument.
+    if (batch_node.op() == "BatchDatasetV2") {
+      new_node->add_input(batch_node.input(2));
+    } else {
+      NodeDef* tmp;
+      TF_RETURN_IF_ERROR(
+          graph_utils::AddScalarConstNode<bool>(false, output, &tmp));
+      new_node->add_input(tmp->name());
+    }
+
+    // Set `f` and `Targuments` attributes.
+    for (auto key : {"f", "Targuments"}) {
+      (*new_node->mutable_attr())[key] = map_node->attr().at(key);
+    }
+    // Set `output_types` and `output_shapes` attributes.
+    for (auto key : {"output_shapes", "output_types"}) {
+      (*new_node->mutable_attr())[key] = batch_node.attr().at(key);
+    }
+
+    // Mark the `Map` and `Batch` nodes for removal.
+    nodes_to_delete.insert(map_node->name());
+    nodes_to_delete.insert(batch_node.name());
+
+    // Update the input of the outputs of the `Batch` node to use
+    // `MapAndBatch`.
+    GraphView::OutputPort output_port =
+        graph.GetOutputPort(batch_node.name(), 0);
+    auto fanout = graph.GetFanout(output_port);
+    for (auto it = fanout.begin(); it != fanout.end(); ++it) {
+      NodeDef* node = it->node;
+      node->set_input(0, new_node->name());
+    }
+  }
+  TF_RETURN_IF_ERROR(graph_utils::DeleteNodes(nodes_to_delete, output));
+  return Status::OK();
+}
+
+void MapAndBatchFusion::Feedback(Cluster* cluster, const GrapplerItem& item,
+                                 const GraphDef& optimize_output,
+                                 double result) {
+  // no-op
+}
+
+REGISTER_GRAPH_OPTIMIZER_AS(MapAndBatchFusion, "map_and_batch_fusion");
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion.h b/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion.h
new file mode 100644
index 0000000000000000000000000000000000000000..2c64831105295391f77e7e8be554b25fa85a5779
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion.h
@@ -0,0 +1,46 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_MAP_AND_BATCH_FUSION_H_
+#define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_MAP_AND_BATCH_FUSION_H_
+
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer.h"
+
+namespace tensorflow {
+namespace grappler {
+
+class MapAndBatchFusion : public CustomGraphOptimizer {
+ public:
+  MapAndBatchFusion() = default;
+  ~MapAndBatchFusion() override = default;
+
+  string name() const override { return "map_and_batch_fusion"; };
+
+  Status Init(
+      const tensorflow::RewriterConfig_CustomGraphOptimizer* config) override {
+    return Status::OK();
+  }
+
+  Status Optimize(Cluster* cluster, const GrapplerItem& item,
+                  GraphDef* output) override;
+
+  void Feedback(Cluster* cluster, const GrapplerItem& item,
+                const GraphDef& optimize_output, double result) override;
+};
+
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_MAP_AND_BATCH_FUSION_H_
diff --git a/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion_test.cc b/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..3c1d8d5359ae8dd229e4e68b8b06667507d07a2f
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion_test.cc
@@ -0,0 +1,341 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/data/map_and_batch_fusion.h"
+
+#include "tensorflow/core/framework/attr_value_util.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+TEST(MapAndBatchFusionTest, FuseMapAndBatchNodesIntoOne) {
+  GrapplerItem item;
+  GraphDef *graph = &item.graph;
+  NodeDef *start_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(0, graph, &start_node));
+  NodeDef *stop_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(10, graph, &stop_node));
+  NodeDef *step_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(1, graph, &step_node));
+
+  std::vector<string> range_inputs(3);
+  range_inputs[0] = start_node->name();
+  range_inputs[1] = stop_node->name();
+  range_inputs[2] = step_node->name();
+  std::vector<std::pair<string, AttrValue>> range_attrs;
+  NodeDef *range_node;
+  TF_ASSERT_OK(graph_utils::AddNode("", "RangeDataset", range_inputs,
+                                    range_attrs, graph, &range_node));
+  NodeDef *captured_input_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<StringPiece>(
+      "hello", graph, &captured_input_node));
+
+  NodeDef *map_node;
+  {
+    std::vector<string> map_inputs(2);
+    map_inputs[0] = range_node->name();
+    map_inputs[1] = captured_input_node->name();
+    std::vector<std::pair<string, AttrValue>> map_attrs(2);
+    AttrValue f_attr;
+    SetAttrValue("f", &f_attr);
+    map_attrs[0] = std::make_pair("f", f_attr);
+    AttrValue args_attr;
+    SetAttrValue("Targuments", &args_attr);
+    map_attrs[1] = std::make_pair("Targuments", args_attr);
+    TF_ASSERT_OK(graph_utils::AddNode("", "MapDataset", map_inputs, map_attrs,
+                                      graph, &map_node));
+  }
+
+  NodeDef *batch_size_node;
+  TF_ASSERT_OK(
+      graph_utils::AddScalarConstNode<int64>(5, graph, &batch_size_node));
+  NodeDef *batch_node;
+  {
+    std::vector<string> batch_inputs(2);
+    batch_inputs[0] = map_node->name();
+    batch_inputs[1] = batch_size_node->name();
+    std::vector<std::pair<string, AttrValue>> batch_attrs(2);
+    AttrValue shapes_attr;
+    SetAttrValue("output_shapes", &shapes_attr);
+    batch_attrs[0] = std::make_pair("output_shapes", shapes_attr);
+    AttrValue types_attr;
+    SetAttrValue("output_types", &types_attr);
+    batch_attrs[1] = std::make_pair("output_types", types_attr);
+    TF_ASSERT_OK(graph_utils::AddNode("", "BatchDataset", batch_inputs,
+                                      batch_attrs, graph, &batch_node));
+  }
+
+  MapAndBatchFusion optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+
+  EXPECT_FALSE(graph_utils::ContainsNodeWithName(map_node->name(), output));
+  EXPECT_FALSE(graph_utils::ContainsNodeWithName(batch_node->name(), output));
+  EXPECT_TRUE(graph_utils::ContainsNodeWithOp("MapAndBatchDatasetV2", output));
+  NodeDef map_and_batch_node =
+      output.node(graph_utils::FindNodeWithOp("MapAndBatchDatasetV2", output));
+  EXPECT_EQ(map_and_batch_node.input_size(), 5);
+  EXPECT_EQ(map_and_batch_node.input(0), map_node->input(0));
+  EXPECT_EQ(map_and_batch_node.input(1), map_node->input(1));
+  EXPECT_EQ(map_and_batch_node.input(2), batch_node->input(1));
+  NodeDef num_parallel_calls_node = output.node(
+      graph_utils::FindNodeWithName(map_and_batch_node.input(3), output));
+  EXPECT_EQ(num_parallel_calls_node.attr().at("value").tensor().int64_val(0),
+            1);
+  NodeDef drop_remainder_node = output.node(
+      graph_utils::FindNodeWithName(map_and_batch_node.input(4), output));
+  EXPECT_EQ(drop_remainder_node.attr().at("value").tensor().bool_val(0), false);
+  EXPECT_TRUE(AreAttrValuesEqual(map_and_batch_node.attr().at("f"),
+                                 map_node->attr().at("f")));
+  EXPECT_TRUE(AreAttrValuesEqual(map_and_batch_node.attr().at("Targuments"),
+                                 map_node->attr().at("Targuments")));
+  EXPECT_TRUE(AreAttrValuesEqual(map_and_batch_node.attr().at("output_shapes"),
+                                 batch_node->attr().at("output_shapes")));
+  EXPECT_TRUE(AreAttrValuesEqual(map_and_batch_node.attr().at("output_types"),
+                                 batch_node->attr().at("output_types")));
+}
+
+TEST(MapAndBatchFusionTest, FuseMapAndBatchV2NodesIntoOne) {
+  GrapplerItem item;
+  GraphDef *graph = &item.graph;
+  NodeDef *start_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(0, graph, &start_node));
+  NodeDef *stop_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(10, graph, &stop_node));
+  NodeDef *step_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(1, graph, &step_node));
+
+  std::vector<string> range_inputs(3);
+  range_inputs[0] = start_node->name();
+  range_inputs[1] = stop_node->name();
+  range_inputs[2] = step_node->name();
+  std::vector<std::pair<string, AttrValue>> range_attrs;
+  NodeDef *range_node;
+  TF_ASSERT_OK(graph_utils::AddNode("", "RangeDataset", range_inputs,
+                                    range_attrs, graph, &range_node));
+  NodeDef *captured_input_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<StringPiece>(
+      "hello", graph, &captured_input_node));
+
+  NodeDef *map_node;
+  {
+    std::vector<string> map_inputs(2);
+    map_inputs[0] = range_node->name();
+    map_inputs[1] = captured_input_node->name();
+    std::vector<std::pair<string, AttrValue>> map_attrs(2);
+    AttrValue f_attr;
+    SetAttrValue("f", &f_attr);
+    map_attrs[0] = std::make_pair("f", f_attr);
+    AttrValue args_attr;
+    SetAttrValue("Targuments", &args_attr);
+    map_attrs[1] = std::make_pair("Targuments", args_attr);
+    TF_ASSERT_OK(graph_utils::AddNode("", "MapDataset", map_inputs, map_attrs,
+                                      graph, &map_node));
+  }
+
+  NodeDef *batch_size_node;
+  TF_ASSERT_OK(
+      graph_utils::AddScalarConstNode<int64>(5, graph, &batch_size_node));
+  NodeDef *drop_remainder_node;
+  TF_ASSERT_OK(
+      graph_utils::AddScalarConstNode<bool>(true, graph, &drop_remainder_node));
+  NodeDef *batch_node;
+  {
+    std::vector<string> batch_inputs(3);
+    batch_inputs[0] = map_node->name();
+    batch_inputs[1] = batch_size_node->name();
+    batch_inputs[2] = drop_remainder_node->name();
+    std::vector<std::pair<string, AttrValue>> batch_attrs(2);
+    AttrValue shapes_attr;
+    SetAttrValue("output_shapes", &shapes_attr);
+    batch_attrs[0] = std::make_pair("output_shapes", shapes_attr);
+    AttrValue types_attr;
+    SetAttrValue("output_types", &types_attr);
+    batch_attrs[1] = std::make_pair("output_types", types_attr);
+    TF_ASSERT_OK(graph_utils::AddNode("", "BatchDatasetV2", batch_inputs,
+                                      batch_attrs, graph, &batch_node));
+  }
+
+  MapAndBatchFusion optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+
+  EXPECT_FALSE(graph_utils::ContainsNodeWithName(map_node->name(), output));
+  EXPECT_FALSE(graph_utils::ContainsNodeWithName(batch_node->name(), output));
+  EXPECT_TRUE(graph_utils::ContainsNodeWithOp("MapAndBatchDatasetV2", output));
+  NodeDef map_and_batch_node =
+      output.node(graph_utils::FindNodeWithOp("MapAndBatchDatasetV2", output));
+  EXPECT_EQ(map_and_batch_node.input_size(), 5);
+  EXPECT_EQ(map_and_batch_node.input(0), map_node->input(0));
+  EXPECT_EQ(map_and_batch_node.input(1), map_node->input(1));
+  EXPECT_EQ(map_and_batch_node.input(2), batch_node->input(1));
+  NodeDef num_parallel_calls_node = output.node(
+      graph_utils::FindNodeWithName(map_and_batch_node.input(3), output));
+  EXPECT_EQ(num_parallel_calls_node.attr().at("value").tensor().int64_val(0),
+            1);
+  EXPECT_EQ(map_and_batch_node.input(4), batch_node->input(2));
+  EXPECT_TRUE(AreAttrValuesEqual(map_and_batch_node.attr().at("f"),
+                                 map_node->attr().at("f")));
+  EXPECT_TRUE(AreAttrValuesEqual(map_and_batch_node.attr().at("Targuments"),
+                                 map_node->attr().at("Targuments")));
+  EXPECT_TRUE(AreAttrValuesEqual(map_and_batch_node.attr().at("output_shapes"),
+                                 batch_node->attr().at("output_shapes")));
+  EXPECT_TRUE(AreAttrValuesEqual(map_and_batch_node.attr().at("output_types"),
+                                 batch_node->attr().at("output_types")));
+}
+
+TEST(MapAndBatchFusionTest, FuseParallelMapAndBatchNodesIntoOne) {
+  GrapplerItem item;
+  GraphDef *graph = &item.graph;
+  NodeDef *start_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(0, graph, &start_node));
+  NodeDef *stop_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(10, graph, &stop_node));
+  NodeDef *step_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(1, graph, &step_node));
+
+  std::vector<string> range_inputs(3);
+  range_inputs[0] = start_node->name();
+  range_inputs[1] = stop_node->name();
+  range_inputs[2] = step_node->name();
+  std::vector<std::pair<string, AttrValue>> range_attrs;
+  NodeDef *range_node;
+  TF_ASSERT_OK(graph_utils::AddNode("", "RangeDataset", range_inputs,
+                                    range_attrs, graph, &range_node));
+  NodeDef *captured_input_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<StringPiece>(
+      "hello", graph, &captured_input_node));
+  NodeDef *num_parallel_calls_node;
+  TF_ASSERT_OK(
+      graph_utils::AddScalarConstNode<int>(2, graph, &num_parallel_calls_node));
+
+  NodeDef *map_node;
+  {
+    std::vector<string> map_inputs(3);
+    map_inputs[0] = range_node->name();
+    map_inputs[1] = captured_input_node->name();
+    map_inputs[2] = num_parallel_calls_node->name();
+    std::vector<std::pair<string, AttrValue>> map_attrs(2);
+    AttrValue f_attr;
+    SetAttrValue("f", &f_attr);
+    map_attrs[0] = std::make_pair("f", f_attr);
+    AttrValue args_attr;
+    SetAttrValue("Targuments", &args_attr);
+    map_attrs[1] = std::make_pair("Targuments", args_attr);
+    TF_ASSERT_OK(graph_utils::AddNode("", "ParallelMapDataset", map_inputs,
+                                      map_attrs, graph, &map_node));
+  }
+
+  NodeDef *batch_size_node;
+  TF_ASSERT_OK(
+      graph_utils::AddScalarConstNode<int64>(5, graph, &batch_size_node));
+  NodeDef *batch_node;
+  {
+    std::vector<string> batch_inputs(2);
+    batch_inputs[0] = map_node->name();
+    batch_inputs[1] = batch_size_node->name();
+    std::vector<std::pair<string, AttrValue>> batch_attrs(2);
+    AttrValue shapes_attr;
+    SetAttrValue("output_shapes", &shapes_attr);
+    batch_attrs[0] = std::make_pair("output_shapes", shapes_attr);
+    AttrValue types_attr;
+    SetAttrValue("output_types", &types_attr);
+    batch_attrs[1] = std::make_pair("output_types", types_attr);
+    TF_ASSERT_OK(graph_utils::AddNode("", "BatchDataset", batch_inputs,
+                                      batch_attrs, graph, &batch_node));
+  }
+
+  MapAndBatchFusion optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+
+  EXPECT_FALSE(graph_utils::ContainsNodeWithName(map_node->name(), output));
+  EXPECT_FALSE(graph_utils::ContainsNodeWithName(batch_node->name(), output));
+  EXPECT_TRUE(graph_utils::ContainsNodeWithOp("MapAndBatchDatasetV2", output));
+  NodeDef map_and_batch_node =
+      output.node(graph_utils::FindNodeWithOp("MapAndBatchDatasetV2", output));
+  EXPECT_EQ(map_and_batch_node.input_size(), 5);
+  EXPECT_EQ(map_and_batch_node.input(0), map_node->input(0));
+  EXPECT_EQ(map_and_batch_node.input(1), map_node->input(1));
+  EXPECT_EQ(map_and_batch_node.input(2), batch_node->input(1));
+  NodeDef num_parallel_calls_node2 = output.node(
+      graph_utils::FindNodeWithName(map_and_batch_node.input(3), output));
+  EXPECT_EQ(num_parallel_calls_node2.attr().at("value").tensor().int64_val(0),
+            2);
+  NodeDef drop_remainder_node = output.node(
+      graph_utils::FindNodeWithName(map_and_batch_node.input(4), output));
+  EXPECT_EQ(drop_remainder_node.attr().at("value").tensor().bool_val(0), false);
+  EXPECT_TRUE(AreAttrValuesEqual(map_and_batch_node.attr().at("f"),
+                                 map_node->attr().at("f")));
+  EXPECT_TRUE(AreAttrValuesEqual(map_and_batch_node.attr().at("Targuments"),
+                                 map_node->attr().at("Targuments")));
+  EXPECT_TRUE(AreAttrValuesEqual(map_and_batch_node.attr().at("output_shapes"),
+                                 batch_node->attr().at("output_shapes")));
+  EXPECT_TRUE(AreAttrValuesEqual(map_and_batch_node.attr().at("output_types"),
+                                 batch_node->attr().at("output_types")));
+}
+
+TEST(MapAndBatchFusionTest, NoChange) {
+  GrapplerItem item;
+  GraphDef *graph = &item.graph;
+
+  NodeDef *start_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(0, graph, &start_node));
+  NodeDef *stop_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(10, graph, &stop_node));
+  NodeDef *step_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(1, graph, &step_node));
+
+  std::vector<string> range_inputs(3);
+  range_inputs[0] = start_node->name();
+  range_inputs[1] = stop_node->name();
+  range_inputs[2] = step_node->name();
+  std::vector<std::pair<string, AttrValue>> range_attrs;
+  NodeDef *range_node;
+  TF_ASSERT_OK(graph_utils::AddNode("", "RangeDataset", range_inputs,
+                                    range_attrs, graph, &range_node));
+
+  NodeDef *batch_size_node;
+  TF_ASSERT_OK(
+      graph_utils::AddScalarConstNode<int64>(5, graph, &batch_size_node));
+  std::vector<string> batch_inputs(2);
+  batch_inputs[0] = range_node->name();
+  batch_inputs[1] = batch_size_node->name();
+  std::vector<std::pair<string, AttrValue>> batch_attrs(2);
+  AttrValue shapes_attr;
+  SetAttrValue("output_shapes", &shapes_attr);
+  batch_attrs[0] = std::make_pair("output_shapes", shapes_attr);
+  AttrValue types_attr;
+  SetAttrValue("output_types", &types_attr);
+  batch_attrs[1] = std::make_pair("output_types", types_attr);
+  NodeDef *batch_node;
+  TF_ASSERT_OK(graph_utils::AddNode("", "BatchDataset", batch_inputs,
+                                    batch_attrs, graph, &batch_node));
+
+  MapAndBatchFusion optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+
+  EXPECT_TRUE(graph_utils::Compare(*graph, output));
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion.cc b/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion.cc
new file mode 100644
index 0000000000000000000000000000000000000000..0df73b33ed7ede35526bd3f6f679287b65072c3b
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion.cc
@@ -0,0 +1,112 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion.h"
+
+#include "tensorflow/core/framework/attr_value.pb.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/grappler/clusters/cluster.h"
+#include "tensorflow/core/grappler/graph_view.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/op_types.h"
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
+#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/platform/protobuf.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+constexpr char kFusedOpName[] = "ShuffleAndRepeatDataset";
+
+}  // namespace
+
+Status ShuffleAndRepeatFusion::Optimize(Cluster* cluster,
+                                        const GrapplerItem& item,
+                                        GraphDef* output) {
+  *output = item.graph;
+  GraphView graph(output);
+  std::set<string> nodes_to_delete;
+  for (const NodeDef& node : item.graph.node()) {
+    if (node.op() != "RepeatDataset") {
+      continue;
+    }
+
+    // Use a more descriptive variable name now that we know the node type.
+    const NodeDef repeat_node(node);
+    GraphView::InputPort input_port = graph.GetInputPort(repeat_node.name(), 0);
+    NodeDef* node2 = graph.GetRegularFanin(input_port).node;
+    if (node2->op() != "ShuffleDataset") {
+      continue;
+    }
+
+    NodeDef* new_node = output->add_node();
+    new_node->set_op(kFusedOpName);
+    graph_utils::SetUniqueName(kFusedOpName, output, new_node);
+
+    // Use a more descriptive variable name now that we know the node type.
+    NodeDef* shuffle_node = node2;
+
+    // Set the `input` input argument.
+    new_node->add_input(shuffle_node->input(0));
+
+    // Set the `buffer_size` input argument.
+    new_node->add_input(shuffle_node->input(1));
+
+    // Set the `seed` input argument.
+    new_node->add_input(shuffle_node->input(2));
+
+    // Set the `seed2` input argument.
+    new_node->add_input(shuffle_node->input(3));
+
+    // Set the `count` input argument.
+    new_node->add_input(repeat_node.input(1));
+
+    // Set `output_types` and `output_shapes` attributes.
+    for (auto key : {"output_shapes", "output_types"}) {
+      (*new_node->mutable_attr())[key] = repeat_node.attr().at(key);
+    }
+
+    // Mark the `Shuffle` and `Repeat` nodes for removal.
+    nodes_to_delete.insert(shuffle_node->name());
+    nodes_to_delete.insert(repeat_node.name());
+
+    // Update the input of the outputs of the `Repeat` node to use
+    // `ShuffleAndRepeat`.
+    GraphView::OutputPort output_port =
+        graph.GetOutputPort(repeat_node.name(), 0);
+    auto fanout = graph.GetFanout(output_port);
+    for (auto it = fanout.begin(); it != fanout.end(); ++it) {
+      NodeDef* node = it->node;
+      node->set_input(0, new_node->name());
+    }
+  }
+  TF_RETURN_IF_ERROR(graph_utils::DeleteNodes(nodes_to_delete, output));
+  return Status::OK();
+}
+
+void ShuffleAndRepeatFusion::Feedback(Cluster* cluster,
+                                      const GrapplerItem& item,
+                                      const GraphDef& optimize_output,
+                                      double result) {
+  // no-op
+}
+
+REGISTER_GRAPH_OPTIMIZER_AS(ShuffleAndRepeatFusion,
+                            "shuffle_and_repeat_fusion");
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion.h b/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion.h
new file mode 100644
index 0000000000000000000000000000000000000000..c8fa53edce38531671aa481c1dffbc5b8a28046b
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion.h
@@ -0,0 +1,46 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_SHUFFLE_AND_REPEAT_FUSION_H_
+#define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_SHUFFLE_AND_REPEAT_FUSION_H_
+
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer.h"
+
+namespace tensorflow {
+namespace grappler {
+
+class ShuffleAndRepeatFusion : public CustomGraphOptimizer {
+ public:
+  ShuffleAndRepeatFusion() = default;
+  ~ShuffleAndRepeatFusion() override = default;
+
+  string name() const override { return "shuffle_and_repeat_fusion"; };
+
+  Status Init(
+      const tensorflow::RewriterConfig_CustomGraphOptimizer* config) override {
+    return Status::OK();
+  }
+
+  Status Optimize(Cluster* cluster, const GrapplerItem& item,
+                  GraphDef* output) override;
+
+  void Feedback(Cluster* cluster, const GrapplerItem& item,
+                const GraphDef& optimize_output, double result) override;
+};
+
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_SHUFFLE_AND_REPEAT_FUSION_H_
diff --git a/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion_test.cc b/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..e89675efb772e970ffc8d3489e5ab1fc25722782
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion_test.cc
@@ -0,0 +1,149 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion.h"
+
+#include "tensorflow/core/framework/attr_value_util.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+TEST(ShuffleAndRepeatFusionTest, FuseShuffleAndRepeatNodesIntoOne) {
+  GrapplerItem item;
+  GraphDef *graph = &item.graph;
+
+  std::vector<std::pair<string, AttrValue>> common_attrs(2);
+  AttrValue shapes_attr;
+  SetAttrValue("output_shapes", &shapes_attr);
+  common_attrs[0] = std::make_pair("output_shapes", shapes_attr);
+  AttrValue types_attr;
+  SetAttrValue("output_types", &types_attr);
+  common_attrs[1] = std::make_pair("output_types", types_attr);
+
+  NodeDef *start_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(0, graph, &start_node));
+  NodeDef *stop_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(10, graph, &stop_node));
+  NodeDef *step_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(1, graph, &step_node));
+
+  std::vector<string> range_inputs(3);
+  range_inputs[0] = start_node->name();
+  range_inputs[1] = stop_node->name();
+  range_inputs[2] = step_node->name();
+  NodeDef *range_node;
+  TF_ASSERT_OK(graph_utils::AddNode("", "RangeDataset", range_inputs,
+                                    common_attrs, graph, &range_node));
+
+  NodeDef *buffer_size_node;
+  TF_ASSERT_OK(
+      graph_utils::AddScalarConstNode<int64>(128, graph, &buffer_size_node));
+  NodeDef *seed_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(-1, graph, &seed_node));
+  NodeDef *seed2_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(-1, graph, &seed2_node));
+  std::vector<string> shuffle_inputs(4);
+  shuffle_inputs[0] = range_node->name();
+  shuffle_inputs[1] = buffer_size_node->name();
+  shuffle_inputs[2] = seed_node->name();
+  shuffle_inputs[3] = seed2_node->name();
+  NodeDef *shuffle_node;
+  TF_ASSERT_OK(graph_utils::AddNode("", "ShuffleDataset", shuffle_inputs,
+                                    common_attrs, graph, &shuffle_node));
+
+  NodeDef *count_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(-1, graph, &count_node));
+  std::vector<string> repeat_inputs(2);
+  repeat_inputs[0] = shuffle_node->name();
+  repeat_inputs[1] = count_node->name();
+  NodeDef *repeat_node;
+  TF_ASSERT_OK(graph_utils::AddNode("", "RepeatDataset", repeat_inputs,
+                                    common_attrs, graph, &repeat_node));
+
+  ShuffleAndRepeatFusion optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+
+  EXPECT_FALSE(graph_utils::ContainsNodeWithName(shuffle_node->name(), output));
+  EXPECT_FALSE(graph_utils::ContainsNodeWithName(repeat_node->name(), output));
+  EXPECT_TRUE(
+      graph_utils::ContainsNodeWithOp("ShuffleAndRepeatDataset", output));
+  NodeDef shuffle_and_repeat_node = output.node(
+      graph_utils::FindNodeWithOp("ShuffleAndRepeatDataset", output));
+  EXPECT_EQ(shuffle_and_repeat_node.input_size(), 5);
+  EXPECT_EQ(shuffle_and_repeat_node.input(0), shuffle_node->input(0));
+  EXPECT_EQ(shuffle_and_repeat_node.input(1), shuffle_node->input(1));
+  EXPECT_EQ(shuffle_and_repeat_node.input(2), shuffle_node->input(2));
+  EXPECT_EQ(shuffle_and_repeat_node.input(3), shuffle_node->input(3));
+  EXPECT_EQ(shuffle_and_repeat_node.input(4), repeat_node->input(1));
+  EXPECT_TRUE(
+      AreAttrValuesEqual(shuffle_and_repeat_node.attr().at("output_shapes"),
+                         repeat_node->attr().at("output_shapes")));
+  EXPECT_TRUE(
+      AreAttrValuesEqual(shuffle_and_repeat_node.attr().at("output_types"),
+                         repeat_node->attr().at("output_types")));
+}
+
+TEST(ShuffleAndRepeatFusionTest, NoChange) {
+  GrapplerItem item;
+  GraphDef *graph = &item.graph;
+
+  std::vector<std::pair<string, AttrValue>> common_attrs(2);
+  AttrValue shapes_attr;
+  SetAttrValue("output_shapes", &shapes_attr);
+  common_attrs[0] = std::make_pair("output_shapes", shapes_attr);
+  AttrValue types_attr;
+  SetAttrValue("output_types", &types_attr);
+  common_attrs[1] = std::make_pair("output_types", types_attr);
+
+  NodeDef *start_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(0, graph, &start_node));
+  NodeDef *stop_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(10, graph, &stop_node));
+  NodeDef *step_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(1, graph, &step_node));
+
+  std::vector<string> range_inputs(3);
+  range_inputs[0] = start_node->name();
+  range_inputs[1] = stop_node->name();
+  range_inputs[2] = step_node->name();
+  NodeDef *range_node;
+  TF_ASSERT_OK(graph_utils::AddNode("", "RangeDataset", range_inputs,
+                                    common_attrs, graph, &range_node));
+
+  NodeDef *count_node;
+  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(-1, graph, &count_node));
+  std::vector<string> repeat_inputs(2);
+  repeat_inputs[0] = range_node->name();
+  repeat_inputs[1] = count_node->name();
+  NodeDef *repeat_node;
+  TF_ASSERT_OK(graph_utils::AddNode("", "RepeatDataset", repeat_inputs,
+                                    common_attrs, graph, &repeat_node));
+
+  ShuffleAndRepeatFusion optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+
+  EXPECT_TRUE(graph_utils::Compare(*graph, output));
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/dependency_optimizer.cc b/tensorflow/core/grappler/optimizers/dependency_optimizer.cc
index 200454b5222c9520faabf83378c7a9d23b665436..78a6d0d8359fd142afd9e2639088b3e8d9f4418b 100644
--- a/tensorflow/core/grappler/optimizers/dependency_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/dependency_optimizer.cc
@@ -557,6 +557,92 @@ void DependencyOptimizer::BuildNodeToIdx() {
   }
 }
 
+// Suppose there are cross-device control inputs to node C from multiple nodes
+// that are located on another device, e.g., we have control edges:
+// A->C, B->C
+// where A and B are on device X and C is on device Y.
+// We can reduce cross-device communication by introducing an intermediate
+// NoOp node C' on device X and rewriting the control edges to:
+// A->C', B->C', C' -> C
+void DependencyOptimizer::GroupCrossDeviceControlEdges() {
+  const int num_nodes = optimized_graph_->node_size();
+  for (int i = 0; i < num_nodes; ++i) {
+    NodeDef* node = optimized_graph_->mutable_node(i);
+    if (node->device().empty()) continue;
+
+    // Creates new noop nodes for devices on which multiple control inputs are
+    // located.
+
+    // Map keyed by device name to the newly introduced Noop node for that
+    // device. A nullptr value means that we have only seen a single node on
+    // that device.
+    std::map<string, NodeDef*> noops;
+    int num_noops = 0;
+    for (int j = 0; j < node->input_size(); ++j) {
+      if (IsControlInput(node->input(j))) {
+        const NodeDef* input = node_map_->GetNode(node->input(j));
+        if (input != nullptr && !input->device().empty() &&
+            input->device() != node->device()) {
+          auto emplace_result = noops.emplace(input->device(), nullptr);
+          if (!emplace_result.second &&
+              emplace_result.first->second == nullptr) {
+            // This is the second cross-device control input from the same
+            // device. Creates an intermediate noop node on that device.
+            string group_name;
+            NodeDef* noop;
+            // Creates a fresh node name; there may be conflicting names from
+            // a previous iteration of the optimizer.
+            do {
+              group_name = AddPrefixToNodeName(
+                  node->name(),
+                  strings::StrCat("GroupCrossDeviceControlEdges_", num_noops));
+              noop = node_map_->GetNode(group_name);
+              ++num_noops;
+            } while (noop != nullptr);
+            noop = optimized_graph_->add_node();
+            noop->set_name(group_name);
+            noop->set_device(input->device());
+            noop->set_op("NoOp");
+            node_map_->AddNode(noop->name(), noop);
+            emplace_result.first->second = noop;
+          }
+        }
+      }
+    }
+
+    // Reroute existing control edges to go via the newly introduced NoOp nodes.
+    int pos = 0;
+    while (pos < node->input_size()) {
+      const string& input_name = node->input(pos);
+      if (IsControlInput(input_name)) {
+        NodeDef* input = node_map_->GetNode(input_name);
+        if (input == nullptr) {
+          ++pos;
+        } else {
+          auto it = noops.find(input->device());
+          if (it == noops.end() || it->second == nullptr) {
+            ++pos;
+          } else {
+            node->mutable_input()->SwapElements(pos, node->input_size() - 1);
+            node->mutable_input()->RemoveLast();
+            it->second->add_input(AsControlDependency(*input));
+            node_map_->UpdateOutput(input_name, node->name(),
+                                    it->second->name());
+          }
+        }
+      } else {
+        ++pos;
+      }
+    }
+    for (const auto& entry : noops) {
+      if (entry.second) {
+        node->add_input(AsControlDependency(*entry.second));
+        node_map_->AddOutput(entry.second->name(), node->name());
+      }
+    }
+  }
+}
+
 Status DependencyOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
                                      GraphDef* optimized_graph) {
   optimized_graph_ = optimized_graph;
@@ -588,6 +674,8 @@ Status DependencyOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
 
     // Dedup control inputs.
     CleanControlInputs();
+
+    GroupCrossDeviceControlEdges();
   }
 
   return Status::OK();
diff --git a/tensorflow/core/grappler/optimizers/dependency_optimizer.h b/tensorflow/core/grappler/optimizers/dependency_optimizer.h
index b4db98125aa740b5d261e8f9ad0ea5bfd8102877..c97ff23e88867d01cd0f9442c3f5ad2414976e91 100644
--- a/tensorflow/core/grappler/optimizers/dependency_optimizer.h
+++ b/tensorflow/core/grappler/optimizers/dependency_optimizer.h
@@ -30,7 +30,8 @@ namespace grappler {
 class DependencyOptimizer : public GraphOptimizer {
  public:
   DependencyOptimizer() {}
-  explicit DependencyOptimizer(RewriterConfig::Toggle opt_level) {}
+  explicit DependencyOptimizer(RewriterConfig::Toggle opt_level)
+      : opt_level_(opt_level) {}
   ~DependencyOptimizer() override {}
 
   string name() const override { return "dependency_optimizer"; };
@@ -61,7 +62,11 @@ class DependencyOptimizer : public GraphOptimizer {
   Status TransitiveReduction();
   // Main driver of dependency optimizations.
   Status OptimizeDependencies();
+  // Replaces multiple cross-device control edges from the same device with a
+  // single control edge.
+  void GroupCrossDeviceControlEdges();
 
+  RewriterConfig::Toggle opt_level_;
   bool fetch_nodes_known_;
   std::unordered_set<string> nodes_to_preserve_;
   std::unique_ptr<NodeMap> node_map_;
diff --git a/tensorflow/core/grappler/optimizers/dependency_optimizer_test.cc b/tensorflow/core/grappler/optimizers/dependency_optimizer_test.cc
index 6a297da52d075ea9bdae4584b7646ee44b950012..0ae3b4ec3450ad3c9e97c622acb5e45747e06bf6 100644
--- a/tensorflow/core/grappler/optimizers/dependency_optimizer_test.cc
+++ b/tensorflow/core/grappler/optimizers/dependency_optimizer_test.cc
@@ -16,11 +16,13 @@ limitations under the License.
 #include "tensorflow/core/grappler/optimizers/dependency_optimizer.h"
 #include "tensorflow/cc/ops/standard_ops.h"
 #include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/tensor_testutil.h"
 #include "tensorflow/core/grappler/grappler_item.h"
 #include "tensorflow/core/grappler/inputs/trivial_test_graph_input_yielder.h"
 #include "tensorflow/core/grappler/optimizers/constant_folding.h"
 #include "tensorflow/core/grappler/optimizers/model_pruner.h"
 #include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/grappler/utils/grappler_test.h"
 #include "tensorflow/core/grappler/utils/topological_sort.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/platform/test.h"
@@ -29,7 +31,7 @@ namespace tensorflow {
 namespace grappler {
 namespace {
 
-class DependencyOptimizerTest : public ::testing::Test {};
+class DependencyOptimizerTest : public GrapplerTest {};
 
 void VerifyGraphsEqual(const GraphDef& original_graph,
                        const GraphDef& optimized_graph, const string& func) {
@@ -722,6 +724,68 @@ TEST_F(DependencyOptimizerTest, RemoveGreaterEqualWithNoOp) {
   EXPECT_EQ(3, count);
 }
 
+TEST_F(DependencyOptimizerTest, GroupCrossDeviceControlDeps) {
+  GrapplerItem item;
+  {
+    tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+    Output a = ops::RandomUniform(s.WithOpName("a").WithDevice("/CPU:1"),
+                                  {1, 2}, DT_FLOAT);
+    Output b = ops::RandomUniform(s.WithOpName("b").WithDevice("/CPU:2"),
+                                  {1, 2}, DT_FLOAT);
+    Output c = ops::RandomUniform(s.WithOpName("c").WithDevice("/CPU:1"),
+                                  {1, 2}, DT_FLOAT);
+    Output d = ops::RandomUniform(s.WithOpName("d").WithDevice("/CPU:3"),
+                                  {1, 2}, DT_FLOAT);
+    Output e = ops::RandomUniform(s.WithOpName("e").WithDevice("/CPU:0"),
+                                  {1, 2}, DT_FLOAT);
+    // Node with cross-device dependencies.
+    auto fetch = ops::Identity(
+        s.WithOpName("f")
+            .WithControlDependencies({a.op(), b.op(), c.op(), d.op()})
+            .WithDevice("/GPU:0"),
+        {e});
+
+    TF_CHECK_OK(s.ToGraphDef(&item.graph));
+    item.fetch.push_back("f");
+  }
+
+  GraphDef expected;
+  {
+    tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+    Output a = ops::RandomUniform(s.WithOpName("a").WithDevice("/CPU:1"),
+                                  {1, 2}, DT_FLOAT);
+    Output b = ops::RandomUniform(s.WithOpName("b").WithDevice("/CPU:2"),
+                                  {1, 2}, DT_FLOAT);
+    Output c = ops::RandomUniform(s.WithOpName("c").WithDevice("/CPU:1"),
+                                  {1, 2}, DT_FLOAT);
+    Output d = ops::RandomUniform(s.WithOpName("d").WithDevice("/CPU:3"),
+                                  {1, 2}, DT_FLOAT);
+    Output e = ops::RandomUniform(s.WithOpName("e").WithDevice("/CPU:0"),
+                                  {1, 2}, DT_FLOAT);
+    auto noop = ops::NoOp(s.WithOpName("GroupCrossDeviceControlEdges_0/f")
+                              .WithDevice("/CPU:1")
+                              .WithControlDependencies({a.op(), c.op()}));
+    auto fetch =
+        ops::Identity(s.WithOpName("f")
+                          .WithControlDependencies({b.op(), d.op(), noop})
+                          .WithDevice("/GPU:0"),
+                      {e});
+
+    TF_CHECK_OK(s.ToGraphDef(&expected));
+  }
+
+  DependencyOptimizer optimizer;
+  GraphDef output;
+  TF_EXPECT_OK(optimizer.Optimize(nullptr, item, &output));
+  CompareGraphs(expected, output);
+
+  // Run the optimizer again to verify idempotence.
+  item.graph.Swap(&output);
+  output.Clear();
+  TF_EXPECT_OK(optimizer.Optimize(nullptr, item, &output));
+  CompareGraphs(expected, output);
+}
+
 }  // namespace
 }  // namespace grappler
 }  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/function_optimizer.cc b/tensorflow/core/grappler/optimizers/function_optimizer.cc
index 611d871eeabb87746abc1e83ea1c195a634217bd..b0d689c2dde4bcf570ef64f7b75d26b68825124d 100644
--- a/tensorflow/core/grappler/optimizers/function_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/function_optimizer.cc
@@ -610,6 +610,9 @@ Status InlineFunction(const NodeDef& func_node, const FunctionDef& func,
       // Turn input placeholders into identity nodes.
       CHECK_EQ(0, func_body_node.input_size());
       func_body_node.set_op("Identity");
+      (*func_body_node.mutable_attr())["T"] = func_body_node.attr().at("dtype");
+      func_body_node.mutable_attr()->erase("dtype");
+      func_body_node.mutable_attr()->erase("shape");
       int input_idx = input_placeholders_idx[func_body_node.name()];
       func_body_node.add_input(
           strings::StrCat(func_inputs->name(), ":", input_idx));
@@ -659,7 +662,7 @@ Status InlineFunction(const NodeDef& func_node, const FunctionDef& func,
 
 Status InlineSymbolicGradient(const NodeDef& node,
                               FunctionOptimizerContext* ctx,
-                              GraphDef* inlined_graph) {
+                              GraphDef* optimized_graph) {
   VLOG(2) << "Inline symbolic gradient: " << SummarizeNodeDef(node);
 
   GraphDef graph_def;
@@ -747,7 +750,7 @@ Status InlineSymbolicGradient(const NodeDef& node,
       }
     }
     inlined_node.set_device(node.device());
-    inlined_graph->add_node()->Swap(&inlined_node);
+    optimized_graph->add_node()->Swap(&inlined_node);
   }
 
   return Status::OK();
@@ -775,32 +778,62 @@ Status FunctionOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
   for (const NodeDef& node : item.graph.node()) {
     const string func_name = node.op();
 
+    // Each node optimization can modify optimized graph only by adding new
+    // nodes, we can check node size to make sure that graph was not modified.
+    const int num_nodes_before = optimized_graph->node_size();
+    const auto is_graph_modified = [&]() {
+      int num_nodes = optimized_graph->node_size();
+      CHECK_GE(num_nodes, num_nodes_before) << "Nodes should not be removed";
+      return num_nodes > num_nodes_before;
+    };
+
+    // Add a copy of an input graph node to the optimized graph.
+    const auto add_node_copy = [&]() { *optimized_graph->add_node() = node; };
+
+// Skip errors if optimized graph was not modified before error happened.
+#define TF_SKIP_ERROR_IF_GRAPH_UNMODIFIED(...)                     \
+  do {                                                             \
+    const Status _status = (__VA_ARGS__);                          \
+    if (TF_PREDICT_FALSE(!_status.ok() && is_graph_modified()))    \
+      return _status;                                              \
+    if (TF_PREDICT_FALSE(!_status.ok() && !is_graph_modified())) { \
+      VLOG(3) << "Skip error: " << _status.error_message();        \
+      add_node_copy();                                             \
+    }                                                              \
+  } while (0)
+
+    // 1. Inline symbolic gradients into the optimized graph.
     if (func_name == "SymbolicGradient" && inline_gradients) {
       // Inline symbolic gradients only if the corresponding function is inlined
       const auto* f_attr = gtl::FindOrNull(node.attr(), "f");
       string f_name = f_attr != nullptr ? f_attr->func().name() : "";
       if (ctx.IsInlinedFunction(f_name)) {
-        TF_RETURN_IF_ERROR(InlineSymbolicGradient(node, &ctx, optimized_graph));
+        TF_SKIP_ERROR_IF_GRAPH_UNMODIFIED(
+            InlineSymbolicGradient(node, &ctx, optimized_graph));
         continue;
       }
     }
 
+    // 2. Check if a node op is a function call.
     const FunctionDef* func = ctx.function_library().Find(func_name);
     if (func != nullptr) {
+      // 2a. Inline it if it's allowed to do so.
       if (inline_func && ctx.IsInlinedFunction(func_name)) {
         // Inline function body into the optimized graph}
-        TF_RETURN_IF_ERROR(InlineFunction(node, *func, ctx, optimized_graph));
+        TF_SKIP_ERROR_IF_GRAPH_UNMODIFIED(
+            InlineFunction(node, *func, ctx, optimized_graph));
         continue;
       }
 
       // Do not specialize if function has custom gradient.
       const string grad_func = ctx.function_library().FindGradient(func_name);
 
+      // 2b. Specialize it to it's instantiation context if can't be inlined.
       if (specialize_func && grad_func.empty() &&
           (IsParametrized(*func) || HasTrulyConstInputs(node, ctx))) {
         // TODO(ezhulenev): Specialize function call if input has a known shape.
         // Specialize function body for its instantiation attributes and inputs.
-        TF_RETURN_IF_ERROR(
+        TF_SKIP_ERROR_IF_GRAPH_UNMODIFIED(
             SpecializeFunction(node, *func, &ctx, optimized_graph));
         continue;
       }
@@ -808,7 +841,9 @@ Status FunctionOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
 
     // If we reached this point, node was not handled by any of the stages
     // (inline, specialize), simply add a copy to the graph.
-    *optimized_graph->add_node() = node;
+    add_node_copy();
+
+#undef TF_SKIP_ERROR_IF_GRAPH_UNMODIFIED
   }
 
   *optimized_graph->mutable_versions() = item.graph.versions();
diff --git a/tensorflow/core/grappler/optimizers/function_optimizer_test.cc b/tensorflow/core/grappler/optimizers/function_optimizer_test.cc
index 0aaf57e947f2c27fbfbedf55866c887e5ca1377b..d043f6129d483b316b14fe3f23a82abc8bcda5a6 100644
--- a/tensorflow/core/grappler/optimizers/function_optimizer_test.cc
+++ b/tensorflow/core/grappler/optimizers/function_optimizer_test.cc
@@ -111,6 +111,82 @@ TEST_F(FunctionOptimizerTest, InlineFunction_SimpleFunction) {
   test::ExpectTensorEqual<float>(tensors_expected[0], tensors[0]);
 }
 
+TEST_F(FunctionOptimizerTest, InlineFunction_SkipErrorsIfGraphNotModified) {
+  using test::function::NDef;
+
+  FunctionOptimizer optimizer(RewriterConfig::DEFAULT);
+
+  // Standard XTimesTwo() function.
+  FunctionDef x_times_two = test::function::XTimesTwo();
+
+  // Function with sequence of tensors as an input (currently not supported).
+  FunctionDef my_identity_n = FunctionDefHelper::Create(
+      // Name
+      "MyIdentityN",
+      // Args
+      {"x: N*T"},
+      // Return values
+      {"out: N*T"},
+      // Attrs
+      {"N:int", "T:{float, double, int32, int64}"},
+      // Nodes (just forward inputs through IdentityN)
+      {
+          {{"Id"}, "IdentityN", {"x"}, {{"T", "$T"}, {"N", "$N"}}},
+      },
+      // Output mapping
+      {{"out", "Id:output:0"}});
+
+  GrapplerItem item;
+  item.graph = test::function::GDef(
+      {NDef("x", "Placeholder", {}, {{"dtype", DT_FLOAT}}, kDevice),
+       NDef("y1", "XTimesTwo", {"x"}, {{"T", DT_FLOAT}}, kDevice),
+       NDef("y2", "MyIdentityN", {"x"}, {{"T", DT_FLOAT}, {"N", 1}}, kDevice),
+       NDef("z1", "Identity", {"y1:0"}, {{"T", DT_FLOAT}}, kDevice),
+       NDef("z2", "Identity", {"y2:0"}, {{"T", DT_FLOAT}}, kDevice)},
+      // FunctionLib
+      {x_times_two, my_identity_n});
+
+  GraphDef output;
+  TF_EXPECT_OK(optimizer.Optimize(nullptr, item, &output));
+
+  // Verify that only MyIdentityN is in the function library after optimization.
+  ASSERT_EQ(1, output.library().function().size());
+  EXPECT_EQ("MyIdentityN", output.library().function(0).signature().name());
+
+  // And that XTimesTwo was successfully inlined.
+  int found = 0;
+  for (const NodeDef& node : output.node()) {
+    if (node.name() == "y1/inlined_inputs") {
+      found++;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ(kDevice, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+    } else if (node.name() == "y1") {
+      found++;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ(kDevice, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y1/y", node.input(0));
+    } else if (node.name() == "y2") {
+      found++;
+      EXPECT_EQ("MyIdentityN", node.op());
+      EXPECT_EQ(kDevice, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+    }
+  }
+  EXPECT_EQ(3, found);
+
+  Tensor pi = test::AsScalar<float>(3.14f);
+  item.fetch = {"z1"};
+  item.feed.emplace_back("x", pi);
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  test::ExpectTensorEqual<float>(tensors_expected[0], tensors[0]);
+}
+
 TEST_F(FunctionOptimizerTest, InlineFunction_FixedTypeFunction) {
   using test::function::NDef;
 
diff --git a/tensorflow/core/grappler/optimizers/graph_optimizer_stage.h b/tensorflow/core/grappler/optimizers/graph_optimizer_stage.h
index b0ec967473bbec782fdcfe7dfed9848a791d26fc..2fbdd76a775b59d6f45d6c9edbd1c1f8e66f58ee 100644
--- a/tensorflow/core/grappler/optimizers/graph_optimizer_stage.h
+++ b/tensorflow/core/grappler/optimizers/graph_optimizer_stage.h
@@ -240,6 +240,25 @@ class GraphOptimizerStagePipeline {
     return false;
   }
 
+  // Pass a node through all registered optimizer stages, until break predicate
+  // is true or a stage fails.
+  //
+  // Returns any stage failure status, or else Status::OK().
+  Status PassThroughAllStagesWithStatus(NodeDef* node, Result* result) {
+    for (auto& stage : stages_) {
+      if (!stage->IsSupported(node)) {
+        continue;
+      }
+      const Status stage_status = stage->TrySimplify(node, result);
+      if (!stage_status.ok()) {
+        return stage_status;
+      } else if (break_predicate_(*result)) {
+        break;
+      }
+    }
+    return Status::OK();
+  }
+
   std::size_t NumStages() { return stages_.size(); }
 
   std::vector<string> StageNames() {
diff --git a/tensorflow/core/grappler/optimizers/layout_optimizer.cc b/tensorflow/core/grappler/optimizers/layout_optimizer.cc
index e08ab1eb673e12180147a11a596c97f682a12f18..3251e7cb1027a184917218f2a5a4560fa0dee43c 100644
--- a/tensorflow/core/grappler/optimizers/layout_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/layout_optimizer.cc
@@ -499,6 +499,7 @@ class NodeProcessor : public GraphProcessor {
       UpdateAttrDataFormat();
       UpdateAttrKSize();
       UpdateAttrStrides();
+      UpdateAttrDilations();
       UpdateAttrShape();
       TF_RETURN_IF_ERROR(AddLayoutTransposeToInputs());
       TF_RETURN_IF_ERROR(AddLayoutTransposeToOutputs());
@@ -742,6 +743,13 @@ class NodeProcessor : public GraphProcessor {
     }
   }
 
+  void UpdateAttrDilations() {
+    if (node_->attr().find("dilations") != node_->attr().end()) {
+      auto list = node_->mutable_attr()->at("dilations").mutable_list();
+      UpdateTuple(list);
+    }
+  }
+
   void UpdateAttrDataFormat() {
     if (node_->attr().find("data_format") != node_->attr().end()) {
       if (node_->attr().at("data_format").s().compare("NHWC") == 0) {
diff --git a/tensorflow/core/grappler/optimizers/layout_optimizer_test.cc b/tensorflow/core/grappler/optimizers/layout_optimizer_test.cc
index dad49cd74f8d26fde58c8d11c8707a7c62e94dab..20e47c1b26b173c18eefd01ba7bdb87781a4c59b 100644
--- a/tensorflow/core/grappler/optimizers/layout_optimizer_test.cc
+++ b/tensorflow/core/grappler/optimizers/layout_optimizer_test.cc
@@ -87,12 +87,13 @@ class LayoutOptimizerTest : public GrapplerTest {
 
   Output SimpleConv2DBackpropInput(tensorflow::Scope* s, int input_size,
                                    int filter_size, const string& padding) {
-    return SimpleConv2DBackpropInput(s, input_size, filter_size, padding, true);
+    return SimpleConv2DBackpropInput(s, input_size, filter_size, padding, true,
+                                     true);
   }
 
   Output SimpleConv2DBackpropInput(tensorflow::Scope* s, int input_size,
                                    int filter_size, const string& padding,
-                                   bool const_input_size) {
+                                   bool const_input_size, bool dilated) {
     int batch_size = 128;
     int input_height = input_size;
     int input_width = input_size;
@@ -123,14 +124,18 @@ class LayoutOptimizerTest : public GrapplerTest {
     Output conv_backprop_input;
     Output input_sizes_i =
         ops::Identity(s->WithOpName("InputSizesIdentity"), input_sizes);
+    ops::Conv2DBackpropInput::Attrs attrs;
+    if (dilated) {
+      attrs = attrs.Dilations({1, 2, 2, 1});
+    }
     if (const_input_size) {
       conv_backprop_input = ops::Conv2DBackpropInput(
           s->WithOpName("Conv2DBackpropInput"), input_sizes, filter, output,
-          {1, stride, stride, 1}, padding);
+          {1, stride, stride, 1}, padding, attrs);
     } else {
       conv_backprop_input = ops::Conv2DBackpropInput(
           s->WithOpName("Conv2DBackpropInput"), input_sizes_i, filter, output,
-          {1, stride, stride, 1}, padding);
+          {1, stride, stride, 1}, padding, attrs);
     }
     return conv_backprop_input;
   }
@@ -216,7 +221,7 @@ TEST_F(LayoutOptimizerTest, Conv2DBackpropInput) {
 
 TEST_F(LayoutOptimizerTest, Conv2DBackpropInputNonConstInputSizes) {
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
-  auto conv = SimpleConv2DBackpropInput(&s, 7, 2, "SAME", false);
+  auto conv = SimpleConv2DBackpropInput(&s, 7, 2, "SAME", false, false);
   Output fetch = ops::Identity(s.WithOpName("Fetch"), {conv});
   GrapplerItem item;
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
diff --git a/tensorflow/core/grappler/optimizers/meta_optimizer.cc b/tensorflow/core/grappler/optimizers/meta_optimizer.cc
index a92727535d7f20de9dd3adbefac6c047b10b6359..143d9dc1c636940d5658c1d2838e717c569d0126 100644
--- a/tensorflow/core/grappler/optimizers/meta_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/meta_optimizer.cc
@@ -29,6 +29,7 @@ limitations under the License.
 #include "tensorflow/core/grappler/optimizers/memory_optimizer.h"
 #include "tensorflow/core/grappler/optimizers/model_pruner.h"
 #include "tensorflow/core/grappler/optimizers/remapper.h"
+#include "tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.h"
 #include "tensorflow/core/grappler/optimizers/shape_optimizer.h"
 #include "tensorflow/core/grappler/utils/colocation.h"
 #include "tensorflow/core/grappler/utils/functions.h"
@@ -88,6 +89,8 @@ std::unique_ptr<GraphOptimizer> MetaOptimizer::MakeNewOptimizer(
   MK_OPT("loop", new LoopOptimizer(cfg_.loop_optimization()));
   MK_OPT("dependency", new DependencyOptimizer(cfg_.dependency_optimization()));
   MK_OPT("debug_stripper", new DebugStripper());
+  MK_OPT("scoped_allocator",
+         new ScopedAllocatorOptimizer(cfg_.scoped_allocator_opts()));
 
   return std::unique_ptr<GraphOptimizer>();
 }
@@ -145,6 +148,10 @@ Status MetaOptimizer::InitializeOptimizers(
     optimizers->emplace_back(
         new AutoParallel(cfg_.auto_parallel().num_replicas()));
   }
+  if (cfg_.scoped_allocator_optimization()) {
+    optimizers->emplace_back(
+        new ScopedAllocatorOptimizer(cfg_.scoped_allocator_opts()));
+  }
   return Status::OK();
 }
 
@@ -210,6 +217,8 @@ Status MetaOptimizer::OptimizeGraph(Cluster* cluster, const GrapplerItem& item,
 
   bool is_optimized = false;
   GraphOptimizationResult optimization_result(item.id);
+  GraphOptimizer* fusion_optimizer = nullptr;
+  GraphOptimizer* sa_optimizer = nullptr;
 
   for (int iteration = 0; iteration < NumIterations(cfg_); ++iteration) {
     VLOG(4) << "Starting optimization iteration " << iteration + 1;
@@ -217,34 +226,41 @@ Status MetaOptimizer::OptimizeGraph(Cluster* cluster, const GrapplerItem& item,
     for (const auto& optimizer : optimizers) {
       // Some optimizers can run only once.
       if (iteration > 0 && IsRunOnceOptimizer(optimizer->name())) continue;
-
-      uint64 start_us = Env::Default()->NowMicros();
-      // This swaps the current optimized_graph into optimized item and
-      // resets optimized_graph to an empty graph.
-      optimized_graph->Swap(&optimized_item.graph);
-      *optimized_graph = GraphDef();
-      Status status =
-          optimizer->Optimize(cluster, optimized_item, optimized_graph);
-      uint64 end_us = Env::Default()->NowMicros();
-
-      string result;
-      if (!status.ok()) {
-        optimized_graph->Swap(&optimized_item.graph);
-        result = status.ToString();
-      } else {
-        is_optimized = true;
-        float duration_ms = (end_us - start_us) / 1000.0f;
-        result = strings::StrCat(
-            PrintSizesBeforeAfter(optimized_item.graph, *optimized_graph),
-            ", time = ", duration_ms, "ms.");
+      // Some must run only on the last iteration.
+      if (optimizer->name() == "scoped_allocator_optimizer") {
+        if (sa_optimizer == nullptr) sa_optimizer = optimizer.get();
+        continue;
+      }
+      if (optimizer->name() == "xla-fusion") {
+        if (fusion_optimizer == nullptr) fusion_optimizer = optimizer.get();
+        continue;
       }
-      VLOG(4) << optimizer->name() << ": " << result;
 
-      OptimizerResult optimizer_result{optimizer->name(), result};
-      optimization_result.results.push_back(optimizer_result);
+      Status status = RunOptimizer(optimizer.get(), cluster, &optimized_item,
+                                   optimized_graph, &optimization_result);
+      if (status.ok()) is_optimized = true;
     }
   }
 
+  // Run fusion optimizer if requested after all other optimizers since: 1) it
+  // doesn't need to be called more than once. 2) we don't want subsequent
+  // optimization passes to break the fusion clusters. We could potentially
+  // encapsulate the fusion clusters right away, but that will prevent a lot of
+  // optimizations from taking place since we don't have shape inference for
+  // functions, and we can't optimize across function boundaries.
+  if (fusion_optimizer != nullptr) {
+    Status status = RunOptimizer(fusion_optimizer, cluster, &optimized_item,
+                                 optimized_graph, &optimization_result);
+    if (status.ok()) is_optimized = true;
+  }
+
+  // ScopedAllocatorOptimizer must run last.
+  if (sa_optimizer != nullptr) {
+    Status status = RunOptimizer(sa_optimizer, cluster, &optimized_item,
+                                 optimized_graph, &optimization_result);
+    if (status.ok()) is_optimized = true;
+  }
+
   // Record graph optimization result.
   optimization_results_.push_back(optimization_result);
 
@@ -259,6 +275,35 @@ Status MetaOptimizer::OptimizeGraph(Cluster* cluster, const GrapplerItem& item,
   return Status::OK();
 }
 
+Status MetaOptimizer::RunOptimizer(
+    GraphOptimizer* optimizer, Cluster* cluster, GrapplerItem* optimized_item,
+    GraphDef* optimized_graph, GraphOptimizationResult* optimization_result) {
+  uint64 start_us = Env::Default()->NowMicros();
+  // This swaps the current optimized_graph into optimized item and
+  // resets optimized_graph to an empty graph.
+  optimized_graph->Swap(&optimized_item->graph);
+  *optimized_graph = GraphDef();
+  Status status =
+      optimizer->Optimize(cluster, *optimized_item, optimized_graph);
+  uint64 end_us = Env::Default()->NowMicros();
+
+  string result;
+  if (!status.ok()) {
+    optimized_graph->Swap(&optimized_item->graph);
+    result = status.ToString();
+  } else {
+    float duration_ms = (end_us - start_us) / 1000.0f;
+    result = strings::StrCat(
+        PrintSizesBeforeAfter(optimized_item->graph, *optimized_graph),
+        ", time = ", duration_ms, "ms.");
+  }
+  VLOG(4) << optimizer->name() << ": " << result;
+
+  OptimizerResult optimizer_result{optimizer->name(), result};
+  optimization_result->results.push_back(optimizer_result);
+  return status;
+}
+
 Status MetaOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
                                GraphDef* optimized_graph) {
   optimization_results_.clear();
@@ -361,6 +406,7 @@ bool MetaOptimizerEnabled(const RewriterConfig& cfg) {
          cfg.auto_parallel().enable() ||
          cfg.memory_optimization() != RewriterConfig::NO_MEM_OPT ||
          cfg.debug_stripper() == RewriterConfig::ON ||
+         cfg.scoped_allocator_optimization() == RewriterConfig::ON ||
          !cfg.optimizers().empty() || !cfg.custom_optimizers().empty();
 }
 
diff --git a/tensorflow/core/grappler/optimizers/meta_optimizer.h b/tensorflow/core/grappler/optimizers/meta_optimizer.h
index e736dd174ed96cd615bf7eb1e477a461c2568de3..151a54cbdfd32b399403487a5144095289b656a7 100644
--- a/tensorflow/core/grappler/optimizers/meta_optimizer.h
+++ b/tensorflow/core/grappler/optimizers/meta_optimizer.h
@@ -72,6 +72,10 @@ class MetaOptimizer : public GraphOptimizer {
     std::vector<OptimizerResult> results;
   };
 
+  Status RunOptimizer(GraphOptimizer* optimizer, Cluster* cluster,
+                      GrapplerItem* optimized_item, GraphDef* optimized_graph,
+                      GraphOptimizationResult* optimization_result);
+
   std::vector<GraphOptimizationResult> optimization_results_;
 };
 
diff --git a/tensorflow/core/grappler/optimizers/remapper.cc b/tensorflow/core/grappler/optimizers/remapper.cc
index 2a628712930aad52bab821f26f227287591c9c36..03e36a7b9cd9332ae2ed00da9ab7dca56eb73bcc 100644
--- a/tensorflow/core/grappler/optimizers/remapper.cc
+++ b/tensorflow/core/grappler/optimizers/remapper.cc
@@ -22,16 +22,78 @@ limitations under the License.
 #include "tensorflow/core/grappler/op_types.h"
 #include "tensorflow/core/grappler/optimizers/constant_folding.h"
 #include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/platform/logging.h"
 
 namespace tensorflow {
 namespace grappler {
 
 void AddBatchNormNodes(GraphDef* optimized_graph, const NodeDef& fused_node) {
   const string& x = fused_node.input(0);
-  const string& scale = fused_node.input(1);
-  const string& offset = fused_node.input(2);
-  const string& mean = fused_node.input(3);
-  const string& variance = fused_node.input(4);
+  string scale = fused_node.input(1);
+  string offset = fused_node.input(2);
+  string mean = fused_node.input(3);
+  string variance = fused_node.input(4);
+
+  if (fused_node.attr().at("data_format").s() == "NCHW") {
+    // Need to reshape the last 4 inputs
+    NodeDef* new_shape = optimized_graph->add_node();
+    new_shape->set_name(AddPrefixToNodeName("NCHWShape", fused_node.name()));
+    new_shape->set_op("Const");
+    new_shape->set_device(fused_node.device());
+    *new_shape->add_input() = AsControlDependency(scale);
+    (*new_shape->mutable_attr())["dtype"].set_type(DT_INT32);
+    Tensor t(DT_INT32, {4});
+    t.flat<int32>()(0) = 1;
+    t.flat<int32>()(1) = -1;
+    t.flat<int32>()(2) = 1;
+    t.flat<int32>()(3) = 1;
+    t.AsProtoTensorContent(
+        (*new_shape->mutable_attr())["value"].mutable_tensor());
+
+    NodeDef* reshaped_scale = optimized_graph->add_node();
+    reshaped_scale->set_name(
+        AddPrefixToNodeName("NCHWShapedScale", fused_node.name()));
+    reshaped_scale->set_op("Reshape");
+    reshaped_scale->set_device(fused_node.device());
+    *reshaped_scale->add_input() = scale;
+    *reshaped_scale->add_input() = new_shape->name();
+    (*reshaped_scale->mutable_attr())["T"] = fused_node.attr().at("T");
+    (*reshaped_scale->mutable_attr())["Tshape"].set_type(DT_INT32);
+    scale = reshaped_scale->name();
+
+    NodeDef* reshaped_offset = optimized_graph->add_node();
+    reshaped_offset->set_name(
+        AddPrefixToNodeName("NCHWShapedOffset", fused_node.name()));
+    reshaped_offset->set_op("Reshape");
+    reshaped_offset->set_device(fused_node.device());
+    *reshaped_offset->add_input() = offset;
+    *reshaped_offset->add_input() = new_shape->name();
+    (*reshaped_offset->mutable_attr())["T"] = fused_node.attr().at("T");
+    (*reshaped_offset->mutable_attr())["Tshape"].set_type(DT_INT32);
+    offset = reshaped_offset->name();
+
+    NodeDef* reshaped_mean = optimized_graph->add_node();
+    reshaped_mean->set_name(
+        AddPrefixToNodeName("NCHWShapedMean", fused_node.name()));
+    reshaped_mean->set_op("Reshape");
+    reshaped_mean->set_device(fused_node.device());
+    *reshaped_mean->add_input() = mean;
+    *reshaped_mean->add_input() = new_shape->name();
+    (*reshaped_mean->mutable_attr())["T"] = fused_node.attr().at("T");
+    (*reshaped_mean->mutable_attr())["Tshape"].set_type(DT_INT32);
+    mean = reshaped_mean->name();
+
+    NodeDef* reshaped_variance = optimized_graph->add_node();
+    reshaped_variance->set_name(
+        AddPrefixToNodeName("NCHWShapedVariance", fused_node.name()));
+    reshaped_variance->set_op("Reshape");
+    reshaped_variance->set_device(fused_node.device());
+    *reshaped_variance->add_input() = variance;
+    *reshaped_variance->add_input() = new_shape->name();
+    (*reshaped_variance->mutable_attr())["T"] = fused_node.attr().at("T");
+    (*reshaped_variance->mutable_attr())["Tshape"].set_type(DT_INT32);
+    variance = reshaped_variance->name();
+  }
 
   float epsilon = 0.0f;
   if (fused_node.attr().count("epsilon")) {
@@ -118,20 +180,16 @@ Status Remapper::Optimize(Cluster* /*cluster*/, const GrapplerItem& item,
       optimizable &= (node.attr().count("is_training") == 0 ||
                       !node.attr().at("is_training").b());
       if (optimizable) {
-        std::unordered_set<int> const_inputs;
-        for (const string& input : node.input()) {
-          int pos;
-          const string input_node = ParseNodeName(input, &pos);
-          if (properties.HasInputProperties(input_node)) {
-            const auto& props = properties.GetInputProperties(input_node);
-            if (props.size() > pos && props[pos].has_value()) {
-              const_inputs.insert(pos);
-            }
+        int const_inputs = 0;
+        const auto& props = properties.GetInputProperties(node.name());
+        for (const auto& prop : props) {
+          if (prop.has_value()) {
+            const_inputs += 1;
           }
         }
         // TODO(bsteiner): use the cost model to compare the cost of fused batch
         // norm against that of the optimized form.
-        optimizable = (const_inputs.size() >= 4);
+        optimizable = (const_inputs >= 4);
       }
       if (optimizable) {
         for (GraphView::Edge edge : graph.GetFanoutEdges(node, false)) {
@@ -143,6 +201,7 @@ Status Remapper::Optimize(Cluster* /*cluster*/, const GrapplerItem& item,
         }
       }
       if (optimizable) {
+        VLOG(1) << "Optimizing fused batch norm node " << node.DebugString();
         AddBatchNormNodes(optimized_graph, node);
         continue;
       }
diff --git a/tensorflow/core/grappler/optimizers/remapper_test.cc b/tensorflow/core/grappler/optimizers/remapper_test.cc
index 291585c5382173f706a421183b18aea70a30c771..4cbf0d8d6f11ea77cfd3973e3ff0c109d48c0273 100644
--- a/tensorflow/core/grappler/optimizers/remapper_test.cc
+++ b/tensorflow/core/grappler/optimizers/remapper_test.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/core/grappler/optimizers/remapper.h"
 #include "tensorflow/cc/ops/standard_ops.h"
 #include "tensorflow/core/framework/tensor_testutil.h"
+#include "tensorflow/core/grappler/devices.h"
 #include "tensorflow/core/grappler/grappler_item.h"
 #include "tensorflow/core/grappler/utils/grappler_test.h"
 #include "tensorflow/core/platform/test.h"
@@ -54,5 +55,41 @@ TEST_F(RemapperTest, FusedBatchNorm) {
   test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
 }
 
+TEST_F(RemapperTest, FusedBatchNormNCHW) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+  Output dflt =
+      ops::Const(s.WithOpName("dflt"), {3.14f, 2.7f, 1.0f, 2.0f, 3.0f, 100.0f},
+                 {1, 3, 1, 2});
+  Output x = ops::PlaceholderWithDefault(s.WithOpName("x"), dflt, {1, 3, 1, 2});
+  Output scale = ops::Const(s.WithOpName("scale"), {0.3f, 7.0f, 123.0f}, {3});
+  Output offset =
+      ops::Const(s.WithOpName("offset"), {0.123f, 2.1f, 0.55f}, {3});
+  Output mean = ops::Const(s.WithOpName("mean"), {7.3f, 8.3f, 3.1f}, {3});
+  Output variance =
+      ops::Const(s.WithOpName("variance"), {0.57f, 1.0f, 2.0f}, {3});
+  ops::FusedBatchNorm::Attrs attr;
+  attr = attr.IsTraining(false);
+  attr = attr.DataFormat("NCHW");
+  ops::FusedBatchNorm bn(s.WithOpName("batch_norm").WithDevice("/device:GPU:0"),
+                         x, scale, offset, mean, variance, attr);
+
+  GrapplerItem item;
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+  item.fetch = {"batch_norm"};
+
+  Remapper optimizer(RewriterConfig::ON);
+  GraphDef output;
+  TF_CHECK_OK(optimizer.Optimize(nullptr, item, &output));
+
+  if (GetNumAvailableGPUs() > 0) {
+    // NCHW batch norm is only supported on GPU.
+    auto tensors_expected = EvaluateNodes(item.graph, item.fetch);
+    EXPECT_EQ(1, tensors_expected.size());
+    auto tensors = EvaluateNodes(output, item.fetch);
+    EXPECT_EQ(1, tensors.size());
+    test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
+  }
+}
+
 }  // namespace grappler
 }  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc b/tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..cceef4098df67e69e01e5cf7ab4ad3cefc146bbd
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc
@@ -0,0 +1,929 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.h"
+
+#include "tensorflow/core/common_runtime/scoped_allocator.h"
+#include "tensorflow/core/common_runtime/scoped_allocator_mgr.h"
+#include "tensorflow/core/framework/graph.pb.h"
+#include "tensorflow/core/framework/node_def_builder.h"
+#include "tensorflow/core/framework/node_def_util.h"
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/grappler/costs/graph_properties.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/op_types.h"
+#include "tensorflow/core/grappler/utils/frame.h"
+#include "tensorflow/core/lib/gtl/inlined_vector.h"
+
+// Like TF_RETURN_IF_ERROR, but also logs a WARNING.
+#define LOG_WARNING_AND_RETURN_IF_ERROR(...)            \
+  do {                                                  \
+    const ::tensorflow::Status _status = (__VA_ARGS__); \
+    if (TF_PREDICT_FALSE(!_status.ok())) {              \
+      LOG(WARNING) << "error: " << _status;             \
+      return _status;                                   \
+    }                                                   \
+  } while (0)
+
+namespace tensorflow {
+namespace grappler {
+
+namespace {
+// Node names often have some kind of name_scope prefix, with slashes,
+// and a _nn numeric suffix.  Returns true if the main part of the node_name
+// matches op_name, i.e. it looks from the name like this node is
+// of that op type.
+bool HasOpName(const string& node_name, const string& op_name) {
+  size_t begin = node_name.rfind("/");
+  if (begin == string::npos) {
+    begin = 0;
+  } else {
+    ++begin;
+  }
+  size_t end = node_name.rfind("_");
+  if (end != string::npos) {
+    size_t p = end + 1;
+    while (p < node_name.size()) {
+      if (!isdigit(node_name[p])) {
+        end = node_name.size();
+        break;
+      }
+      ++p;
+    }
+  } else {
+    end = node_name.size();
+  }
+  return node_name.substr(begin, end - begin) == op_name;
+}
+
+// After shape inference has been done each op should be annotated
+// with its output shape(s).  This function iterates over a collection
+// of ops that are a potential application of a ScopedAllocator.  It
+// verifies whether they all have the same output type and if so
+// gathers a vector of their output shapes.  It returns an error if
+// any of the ops doesn't have type or shape data, or if it has more
+// than one output, of if the output type of all ops is not the same.
+// If it returns OK then *type and *shapes should be correctly populated.
+Status CheckTypesAndGetShapes(const GraphProperties& graph_properties,
+                              const std::vector<NodeDef*>& ops, DataType* type,
+                              std::vector<TensorShape>* shapes) {
+  VLOG(1) << "CheckTypesAndGetShapes";
+  *type = DT_INVALID;
+  for (NodeDef* n : ops) {
+    AttrSlice n_attrs = AttrSlice(*n);
+    DataType dtype;
+    LOG_WARNING_AND_RETURN_IF_ERROR(GetNodeAttr(n_attrs, "T", &dtype));
+    VLOG(2) << "op " << n->name() << " has type " << dtype << " shapes.size() "
+            << shapes->size();
+    if (!graph_properties.HasOutputProperties(n->name())) {
+      LOG(ERROR) << "Node " << n->DebugString() << " lacks output shape.";
+      return errors::Internal("Node ", n->name(), " lacks output shape.");
+    }
+    const std::vector<OpInfo::TensorProperties>& prop_list =
+        graph_properties.GetOutputProperties(n->name());
+    if (prop_list.size() != 1) {
+      return errors::Internal("Node ", n->name(),
+                              " does not have exactly one output as expected "
+                              "by ScopedAllocatorOptimizer");
+    }
+    const OpInfo::TensorProperties& props = prop_list[0];
+    if (shapes->empty()) {
+      *type = props.dtype();
+    } else if (*type != props.dtype()) {
+      return errors::Internal("Group ops don't all have same type");
+    } else if (!TensorShape::IsValid(props.shape())) {
+      return errors::Internal("Complete shape not known for ", n->name());
+    }
+    VLOG(2) << "Adding shape " << props.shape().DebugString();
+    shapes->push_back(TensorShape(props.shape()));
+  }
+  return Status::OK();
+}
+
+// Describes an existing input edge in the graph.
+struct InputDesc {
+  NodeDef* from_node_def;
+  int output_slot;
+  NodeDef* to_node_def;
+  InputDesc(NodeDef* f, int os, NodeDef* t)
+      : from_node_def(f), output_slot(os), to_node_def(t) {}
+};
+
+// Populates *inputs with all of the non-control inputs of ops.
+// Returns error if it fails to find exactly one input for each op,
+// or if some input is not of type dtype.
+Status GetInputs(NodeMap* node_map, const std::vector<NodeDef*>& ops,
+                 DataType dtype, std::vector<InputDesc>* inputs) {
+  VLOG(1) << "Getinputs";
+  for (NodeDef* n : ops) {
+    NodeDef* inode = nullptr;
+    int position = 0;
+    VLOG(2) << "for node " << n->name();
+    for (const auto& input_name : n->input()) {
+      if (!IsControlInput(input_name)) {
+        if (inode) {
+          return errors::Internal("Found more than one input for node ",
+                                  n->name());
+        }
+        ParseNodeName(input_name, &position);
+        inode = node_map->GetNode(input_name);
+        CHECK(inode) << input_name;
+        VLOG(2) << "inode " << inode->DebugString();
+      }
+    }
+    AttrSlice inode_attrs = AttrSlice(*inode);
+    DataType inode_dtype;
+    LOG_WARNING_AND_RETURN_IF_ERROR(
+        GetNodeAttr(inode_attrs, "T", &inode_dtype));
+    if (inode_dtype != dtype) {
+      return errors::Internal("ScopedAllocatorOptimizer expected input type ",
+                              dtype, " but found ", inode_dtype);
+    }
+    // inputs->push_back(InputDesc(inode, position, n));
+    inputs->emplace_back(inode, position, n);
+  }
+  return Status::OK();
+}
+
+// Remove the NodeDef nd from node_map and graph.  It must be the case
+// that nd no longer has any input or output edges, though that is not
+// checked.
+void RemoveNode(NodeDef* nd, GraphDef* graph, NodeMap* node_map) {
+  node_map->RemoveNode(nd->name());
+  // TODO(tucker): The efficiency of this routine is poor.
+  // Change to accumulate and do a bulk removal, maybe refactoring
+  // some code from dependency_optimizer.
+  protobuf::RepeatedPtrField<NodeDef>* nodes = graph->mutable_node();
+  for (int i = 0; i < nodes->size(); ++i) {
+    if (nd->name() == (*nodes)[i].name()) {
+      nodes->SwapElements(i, nodes->size() - 1);
+      nodes->RemoveLast();
+      return;
+    }
+  }
+  LOG(FATAL) << "Failed to find node " << nd->name() << " in graph";
+}
+
+// Removes a named edge from between two nodes.
+Status RemoveEdge(const string& input_edge_name, const string& from_node_name,
+                  NodeDef* to_node, NodeMap* node_map) {
+  if (node_map) {
+    node_map->RemoveOutput(from_node_name, to_node->name());
+  }
+  protobuf::RepeatedPtrField<string>* inputs = to_node->mutable_input();
+  int edge_index = -1;
+  for (edge_index = 0; edge_index < inputs->size(); ++edge_index) {
+    VLOG(2) << " consider edge " << (*inputs)[edge_index];
+    if ((*inputs)[edge_index] == input_edge_name) {
+      break;
+    }
+  }
+  if (edge_index >= inputs->size()) {
+    return errors::Internal("Could not find input name ", input_edge_name,
+                            " at node ", to_node->name());
+  }
+  inputs->DeleteSubrange(edge_index, 1);
+  return Status::OK();
+}
+}  // namespace
+
+void ScopedAllocatorOptimizer::ExtendNodeAttr(StringPiece name,
+                                              const std::vector<int32>& values,
+                                              NodeDef* node_def) {
+  if (HasNodeAttr(*node_def, name)) {
+    VLOG(2) << "extending";
+    AttrValue* existing = &(*node_def->mutable_attr())[name.ToString()];
+    for (int32 i : values) {
+      existing->mutable_list()->add_i(i);
+    }
+  } else {
+    VLOG(2) << "setting new attr value";
+    AddNodeAttr(name, values, node_def);
+  }
+}
+
+class UnaryElementwiseRewriter : public ScopedAllocatorOptimizer::Rewriter {
+ public:
+  ~UnaryElementwiseRewriter() override {}
+
+  // Return non-OK if any input is already committed to a ScopedAllocator.
+  Status CheckExistingScopedAllocator(const std::vector<InputDesc>& inputs) {
+    for (const InputDesc& nd : inputs) {
+      VLOG(2) << "get attrs for " << nd.from_node_def->name();
+      AttrSlice n_attrs = AttrSlice(*nd.from_node_def);
+      int sa_id;
+      Status ss = GetNodeAttr(n_attrs, "sa_id", &sa_id);
+      if (ss.ok()) {
+        LOG(INFO) << "Abandoning PARewriter because input "
+                  << nd.from_node_def->name() << " is already assigned "
+                  << "to ScopedAllocator " << sa_id;
+        return errors::Internal(
+            "Abandoning PARewriter because input ", nd.from_node_def->name(),
+            " is already assigned to ScopedAllocator ", sa_id);
+      }
+    }
+    return Status::OK();
+  }
+
+  // Return non-OK if any input is a member of op_set.
+  Status CheckInternalDataDependency(const std::set<string>& op_set,
+                                     const std::vector<InputDesc>& inputs) {
+    for (const InputDesc& nd : inputs) {
+      if (op_set.find(nd.from_node_def->name()) != op_set.end()) {
+        if (nd.output_slot != tensorflow::Graph::kControlSlot) {
+          return errors::Internal("Data edge exists bewtween ",
+                                  nd.from_node_def->name(),
+                                  " and another "
+                                  "node in the set");
+        }
+      }
+    }
+    return Status::OK();
+  }
+
+  // Remove all control edges between members of ops.
+  void ClearInternalControlInputs(const std::set<string>& op_set,
+                                  const std::vector<NodeDef*>& ops,
+                                  NodeMap* node_map) {
+    for (NodeDef* n : ops) {
+      for (const auto& input_name : n->input()) {
+        if (IsControlInput(input_name)) {
+          int position = 0;
+          string input_node_name = ParseNodeName(input_name, &position);
+          CHECK_EQ(position, -1);
+          if (op_set.find(input_node_name) != op_set.end()) {
+            // This is an internal control edge.  Remove it.
+            VLOG(1) << "Remove control output from " << input_node_name
+                    << " via edge " << input_name << " to " << n->name();
+            TF_CHECK_OK(RemoveEdge(input_name, input_node_name, n, node_map));
+          }
+        }
+      }
+    }
+  }
+
+  // Examine the input set of an op set, gathering their shapes and types
+  // and checking whether there are any considerations that prevent use
+  // of a single ScopedAllocator for all of those inputs.
+  Status AnalyzeInputs(ScopedAllocatorOptimizer* sa_opti, NodeMap* node_map,
+                       const std::vector<NodeDef*>& ops,
+                       const std::set<string>& op_instance_names,
+                       string* device_name, DataType* dtype,
+                       std::vector<TensorShape>* input_shapes,
+                       std::vector<InputDesc>* inputs, TensorShape* sa_shape) {
+    CHECK(graph_properties_);
+    LOG_WARNING_AND_RETURN_IF_ERROR(
+        CheckTypesAndGetShapes(*graph_properties_, ops, dtype, input_shapes));
+    LOG_WARNING_AND_RETURN_IF_ERROR(
+        GetInputs(sa_opti->node_map(), ops, *dtype, inputs));
+    LOG_WARNING_AND_RETURN_IF_ERROR(CheckExistingScopedAllocator(*inputs));
+    LOG_WARNING_AND_RETURN_IF_ERROR(
+        CheckInternalDataDependency(op_instance_names, *inputs));
+    ClearInternalControlInputs(op_instance_names, ops, node_map);
+    *device_name = ops[0]->device();
+    CHECK(!device_name->empty());
+    CHECK(!input_shapes->empty());
+    CHECK_EQ(0, Allocator::kAllocatorAlignment % DataTypeSize(*dtype))
+        << "ScopedAllocatorOptimizer only applies to types that evenly "
+        << "divide kAllocatorAlignment";
+    std::vector<ScopedAllocator::Field> sa_fields;
+    // Calculate the field embedding boundaries and thereby the
+    // required size of the backing tensor.
+    int64 num_bytes = ScopedAllocatorMgr::PopulateFields(
+        0 /*scope_id*/, *input_shapes, *dtype, &sa_fields);
+    int64 num_elts = num_bytes / DataTypeSize(*dtype);
+    VLOG(2) << "num_bytes " << num_bytes << " num_elts=" << num_elts;
+    *sa_shape = TensorShape({num_elts});
+    return Status::OK();
+  }
+
+  // Build the ScopedAllocator node that will be assigned to allocate
+  // the output tensors of the input node set.
+  Status ConstructScopedAllocatorNode(
+      ScopedAllocatorOptimizer* sa_opti, GraphDef* graph, NodeMap* node_map,
+      const std::vector<NodeDef*>& ops, const string& device_name,
+      DataType dtype, int sa_id, const string& sa_name,
+      const std::vector<TensorShape>& input_shapes,
+      const std::vector<InputDesc>& inputs, const TensorShape& sa_shape) {
+    VLOG(2) << "ConstructScopedAllocatorNode " << sa_name;
+    NodeDefBuilder sa_builder(sa_name, "_ScopedAllocator");
+    sa_builder.Device(device_name);
+    sa_builder.Attr("sa_name", sa_name);
+    sa_builder.Attr("T", dtype);
+    sa_builder.Attr("id", sa_id);
+    sa_builder.Attr("shapes", input_shapes);
+    sa_builder.Attr("shape", sa_shape);
+    sa_builder.Attr("expected_call_count", static_cast<int64>(ops.size()));
+    NodeDef* sa_node = graph->add_node();
+    LOG_WARNING_AND_RETURN_IF_ERROR(sa_builder.Finalize(sa_node));
+    node_map->AddNode(sa_name, sa_node);
+
+    // Add control edges from the ScopedAllocatorOp to all of the
+    // input nodes and mark them for allocation from backing tensor.
+    for (int i = 0; i < inputs.size(); ++i) {
+      auto& nd = inputs[i];
+      VLOG(2) << "To input " << i << ": " << nd.from_node_def->name()
+              << " add control input "
+              << "^" << sa_name;
+      nd.from_node_def->add_input(strings::StrCat("^", sa_name));
+      // This attribute says: allocate output_slot from
+      // ScopedAllocator instance sa_id + 1 + i.
+      ScopedAllocatorOptimizer::ExtendNodeAttr("_scoped_allocator",
+                                               {nd.output_slot, sa_id + 1 + i},
+                                               nd.from_node_def);
+      node_map->AddOutput(sa_name, nd.from_node_def->name());
+    }
+    return Status::OK();
+  }
+
+  Status BuildSAConcatNode(GraphDef* graph, NodeMap* node_map,
+                           const std::vector<NodeDef*>& ops,
+                           const std::set<string>& op_instance_names,
+                           const string& device_name, DataType dtype, int sa_id,
+                           const string& sa_name, const string& sac_name,
+                           const TensorShape& sa_shape,
+                           std::vector<NodeDefBuilder::NodeOut>* sac_inputs) {
+    VLOG(2) << "BuildSAConcatNode " << sac_name;
+    std::set<string> sac_ctl_inputs;
+    for (int i = 0; i < ops.size(); ++i) {
+      NodeDef* old_op = ops[i];
+      for (const string& old_op_input : old_op->input()) {
+        int position = 0;
+        string input_name = ParseNodeName(old_op_input, &position);
+        if (position == -1) {
+          // A control input: drop if from another member of the op set.
+          if (op_instance_names.find(old_op_input) == op_instance_names.end()) {
+            sac_ctl_inputs.insert(old_op_input);
+          }
+        } else {
+          // TODO(tucker): remove redundant check.
+          // A data input: illegal if from another member of the op set.
+          if (op_instance_names.find(old_op_input) != op_instance_names.end()) {
+            LOG(ERROR) << "Data edge between " << old_op_input << " and "
+                       << old_op->name() << " cannot build ScopedAllocator.";
+            return errors::Internal("Data edge between ", old_op_input, " and ",
+                                    old_op->name(),
+                                    " cannot build ScopedAllocator.");
+          }
+          sac_inputs->push_back(
+              NodeDefBuilder::NodeOut(old_op_input, 0, dtype));
+        }
+        VLOG(3) << "from op " << i << ": " << old_op->name()
+                << " sac_inputs append " << old_op_input;
+      }
+    }
+    NodeDefBuilder sac_builder(sac_name, "_ScopedAllocatorConcat");
+    VLOG(2) << "New sac_name " << sac_name << " shape "
+            << sa_shape.DebugString();
+    sac_builder.Device(device_name);
+    sac_builder.Attr("sa_name", sa_name);
+    sac_builder.Attr("id", sa_id);
+    sac_builder.Attr("T", dtype);
+    sac_builder.Attr("shape", sa_shape);
+    sac_builder.Attr("N", static_cast<int>(sac_inputs->size()));
+    sac_builder.Input(NodeDefBuilder::NodeOut(sa_name, 0, dtype));
+    sac_builder.Input(*sac_inputs);
+    NodeDef* sac_node = graph->add_node();
+    LOG_WARNING_AND_RETURN_IF_ERROR(sac_builder.Finalize(sac_node));
+    node_map->AddNode(sac_name, sac_node);
+    node_map->AddOutput(sa_name, sac_name);
+
+    // Attach the old control inputs to the new sac node.
+    for (const string& ctl_input : sac_ctl_inputs) {
+      sac_node->add_input(ctl_input);
+    }
+    return Status::OK();
+  }
+
+  Status BuildReplacementOp(GraphDef* graph, NodeMap* node_map,
+                            const std::vector<NodeDef*>& ops,
+                            const string& device_name, DataType dtype,
+                            const string& op_name, const string& sac_name,
+                            const string& sa_op_name) {
+    VLOG(2) << "BuildReplacementOp " << sa_op_name;
+    NodeDefBuilder op_builder(sa_op_name, op_name);
+    op_builder.Device(device_name);
+
+    // Transfer the Node Attr from the first replaced Node to the new
+    // Node.  TODO(tucker): In principle we should verify that
+    // the Attr are consistent and compatible across all op instances.
+    // Unfortunately that will probably require op-specific tests, so
+    // punt on that for the time being.
+    AttrSlice first_slice(*ops[0]);
+    for (auto& it : first_slice) {
+      op_builder.Attr(it.first, it.second);
+    }
+    op_builder.Attr("_forward_input", {0, 0});
+    op_builder.Input(sac_name, 0, dtype);
+    NodeDef* sa_op_node = graph->add_node();
+    LOG_WARNING_AND_RETURN_IF_ERROR(op_builder.Finalize(sa_op_node));
+    node_map->AddNode(sa_op_name, sa_op_node);
+    node_map->AddOutput(sac_name, sa_op_name);
+    return Status::OK();
+  }
+
+  Status BuildSplitNode(GraphDef* graph, NodeMap* node_map,
+                        const std::vector<NodeDef*>& ops,
+                        const std::vector<TensorShape>& input_shapes,
+                        const std::vector<NodeDefBuilder::NodeOut>& sac_inputs,
+                        const string& device_name, DataType dtype,
+                        const string& op_name, int sa_id,
+                        const string& sas_name, const string& sa_name,
+                        const string& sa_op_name) {
+    VLOG(2) << "new ScopedAllocatorSplit " << sas_name;
+    NodeDefBuilder sas_builder(sas_name, "_ScopedAllocatorSplit");
+    sas_builder.Device(device_name);
+    sas_builder.Attr("sa_name", sa_name);
+    sas_builder.Attr("id", sa_id);
+    sas_builder.Attr("T", dtype);
+    sas_builder.Attr("shapes", input_shapes);
+    std::vector<NodeDefBuilder::NodeOut> sas_inputs = sac_inputs;
+    sas_builder.Attr("N", static_cast<int>(sas_inputs.size()));
+    sas_builder.Input(NodeDefBuilder::NodeOut(sa_op_name, 0, dtype));
+    sas_builder.Input(sas_inputs);
+    NodeDef* sas_node = graph->add_node();
+    LOG_WARNING_AND_RETURN_IF_ERROR(sas_builder.Finalize(sas_node));
+    node_map->AddNode(sas_name, sas_node);
+    node_map->AddOutput(sa_op_name, sas_name);
+    return Status::OK();
+  }
+
+  // After the new ScopedAllocator and its corresponding Concat and
+  // Split nodes have been built, and a new single Op instance
+  // constructed, rewire the graph: Remove input edges to the old Op
+  // nodes and replace the old Op node outputs with the corresponding
+  // ScopedAllocatorSplit node outputs.  After this the old Op nodes
+  // should no longer have any input or output edges and they can be
+  // removed from the graph.
+  Status RewireSubgraph(GraphDef* graph, NodeMap* node_map,
+                        const std::vector<NodeDef*>& ops,
+                        const std::set<string>& op_instance_names,
+                        const string& op_name, const string& sas_name) {
+    VLOG(2) << "RewireSubgraph";
+    for (int op_idx = 0; op_idx < ops.size(); ++op_idx) {
+      NodeDef* old_op = ops[op_idx];
+      // Copy the output node set since we'll be modifying the version
+      // maintained by NodeMap in the loop.
+      std::set<NodeDef*> output_nodes = node_map->GetOutputs(old_op->name());
+      VLOG(3) << "old_op " << old_op->name() << " had " << output_nodes.size()
+              << " outputs.  Moving them to the PASplit node.";
+      if (VLOG_IS_ON(2)) {
+        for (NodeDef* n : output_nodes) {
+          VLOG(3) << "    output: " << n->name();
+        }
+      }
+      for (NodeDef* n : output_nodes) {
+        VLOG(3) << "really checking old output " << n->name()
+                << " for corresponding input.";
+        if (op_instance_names.find(n->name()) != op_instance_names.end()) {
+          // If this output node is a member of the ops set, it must have
+          // been an internal control edge so drop it.
+          VLOG(3) << "Dropping control output from " << old_op->name() << " to "
+                  << n->name();
+          // However, we may already have dropped it at the clear() below,
+          // so if we fail to find it, that's okay.
+          Status ignore = RemoveEdge(strings::StrCat("^", old_op->name()),
+                                     old_op->name(), n, node_map);
+          continue;
+        }
+        bool found = false;
+        VLOG(3) << "about to iterate over " << n->input_size() << " inputs";
+        for (int i = 0; i < n->input_size(); ++i) {
+          VLOG(3) << "input " << n->input(i);
+          int position = 0;
+          string input_node = ParseNodeName(n->input(i), &position);
+          if (input_node == old_op->name()) {
+            found = true;
+            VLOG(3) << "match pos=" << position;
+            if (position == -1) {
+              // It was a control edge
+              *n->mutable_input(i) = strings::StrCat("^", sas_name);
+            } else {
+              CHECK_EQ(0, position)
+                  << "name " << n->input(i) << " pos " << position;
+              *n->mutable_input(i) = strings::StrCat(sas_name, ":", op_idx);
+            }
+            node_map->RemoveOutput(old_op->name(), n->name());
+            node_map->AddOutput(sas_name, n->name());
+            VLOG(3) << "breaking on success";
+            break;
+          } else {
+            VLOG(3) << "other input " << n->input(i);
+          }
+        }
+        // In general it's required that we found the output node's old
+        // input and replaced it, but one exception is if the output node
+        // is of the same type being coalesced and the edge is a control
+        // input.  In that case it probably got eliminated in an earlier
+        // pass.
+        VLOG(3) << "before HasOp";
+        if (!HasOpName(n->name(), op_name)) {
+          CHECK(found) << "old_op " << old_op->name() << " node "
+                       << " could not find input edge on " << n->DebugString()
+                       << " to replace."
+                       << " " << op_name << " not in " << n->name();
+        }
+        VLOG(3) << "bottom of for output_nodes";
+      }
+      VLOG(3) << "Clearing all inputs of " << old_op->name();
+      node_map->RemoveInputs(old_op->name());
+      old_op->clear_input();
+      node_map->RemoveOutputs(old_op->name());
+      VLOG(3) << "after clear: " << old_op->DebugString();
+      // old_op should be dead, with no further inputs or outputs.
+      // It needs to be removed altogether before the graph is generated,
+      // but we need to leave it around until this Optimizer is done,
+      // because there may be some
+      // Remove.
+      RemoveNode(old_op, graph, node_map);
+    }
+    return Status::OK();
+  }
+
+  // Given a collection of instances of op_name, presumed to be
+  // logically parallel and operating on tensors of the same type,
+  // replace them by a single instance.  First find the upstream Ops
+  // generating their inputs. Create a new ScopedAllocatorOp that
+  // outputs a single backing_tensor pre-arranged for sub-allocation
+  // of all of those input tensors.  Then insert a new
+  // ScopedAllocatorConcatOp below the upstream Ops to make explicit
+  // the materialization of a concatenation of their outputs.  Put the
+  // new op_name instance below the new concat op and follow with a
+  // ScopedAllocatorSplitOp that restores the correct shape outputs
+  // for the consumers of the old op_name instances.
+  //
+  // There must be no non-control edges between Nodes in 'ops'.
+  // Control edges among these nodes will be dropped.
+  Status Rewrite(ScopedAllocatorOptimizer* sa_opti, GraphDef* graph,
+                 const string& op_name, const std::vector<NodeDef*>& ops,
+                 bool* applied) override {
+    if (VLOG_IS_ON(1)) {
+      VLOG(1) << "Rewrite";
+      string op_names;
+      for (auto& nd : ops) {
+        strings::StrAppend(&op_names, nd->name(), ", ");
+      }
+      VLOG(1) << "UnaryElementwiseRewriter::Rewrite " << op_name
+              << " to: " << op_names;
+    }
+    NodeMap* node_map = sa_opti->node_map();
+
+    // Make a set of the node names for faster membership testing.
+    std::set<string> op_instance_names;
+    for (auto& nd : ops) {
+      op_instance_names.insert(nd->name());
+      VLOG(2) << "op_instance_name " << nd->name();
+    }
+    DataType dtype;
+    std::vector<TensorShape> input_shapes;
+    std::vector<InputDesc> inputs;
+    TensorShape sa_shape;
+    string device_name;
+
+    TF_RETURN_IF_ERROR(AnalyzeInputs(sa_opti, node_map, ops, op_instance_names,
+                                     &device_name, &dtype, &input_shapes,
+                                     &inputs, &sa_shape));
+
+    int sa_id = sa_opti->NewScopedAllocatorId(input_shapes.size());
+    string sa_name = strings::StrCat("scoped_allocator_", sa_id);
+    TF_RETURN_IF_ERROR(ConstructScopedAllocatorNode(
+        sa_opti, graph, node_map, ops, device_name, dtype, sa_id, sa_name,
+        input_shapes, inputs, sa_shape));
+
+    // TODO(tucker): Maybe add control edges to delay execution of the
+    // ScopedAllocatorOp until just before first use in order to
+    // conserve memory.  What would be correct?  Let I0...In be the
+    // input nodes that are all going to alloc from SA.  If we make
+    // SA wait until all of these are ready, that might be too slow.
+    // It should probably wait until at least one is ready, but which
+    // one?  Maybe just pick the first.
+    // {
+    //   auto& nd = inputs[0];
+    //   std::vector<InputDesc> inputs_to_first;
+    //   LOG_WARNING_AND_RETURN_IF_ERROR(GetInputs(sa_opti->node_map(),
+    //   {nd.from_node_def},
+    //                                dtype, &inputs_to_first));
+    //   for (int i = 0; i < inputs_to_first.size(); ++i) {
+    //     sa_node->add_input(
+    //         strings::StrCat("^", inputs_to_first[i].from_node_def->name()));
+    //   }
+    // }
+
+    // Build a ScopedAllocatorConcat below all of the input nodes.
+    std::vector<NodeDefBuilder::NodeOut> sac_inputs;
+    string sac_name = strings::StrCat("scoped_allocator_concat_", sa_id);
+    TF_RETURN_IF_ERROR(BuildSAConcatNode(
+        graph, node_map, ops, op_instance_names, device_name, dtype, sa_id,
+        sa_name, sac_name, sa_shape, &sac_inputs));
+
+    // Construct a new instance of the parallel op and insert it
+    // immediately below the new ScopedAllocatorConcat.
+    string sa_op_name = strings::StrCat(sa_name, "_", op_name);
+    TF_RETURN_IF_ERROR(BuildReplacementOp(graph, node_map, ops, device_name,
+                                          dtype, op_name, sac_name,
+                                          sa_op_name));
+
+    // Build a ScopedAllocatorSplit split below the new Op.
+    string sas_name = strings::StrCat("scoped_allocator_split_", sa_id);
+    TF_RETURN_IF_ERROR(BuildSplitNode(graph, node_map, ops, input_shapes,
+                                      sac_inputs, device_name, dtype, op_name,
+                                      sa_id, sas_name, sa_name, sa_op_name));
+
+    // Rewire the graph.
+    TF_RETURN_IF_ERROR(RewireSubgraph(graph, node_map, ops, op_instance_names,
+                                      op_name, sas_name));
+
+    *applied = true;
+    return Status::OK();
+  }
+};
+
+ScopedAllocatorOptimizer::ScopedAllocatorOptimizer(
+    const ScopedAllocatorOptions& opts) {
+  VLOG(1) << "ScopedAllocatorOptimizer::ScopedAllocatorOptimizer";
+  Rewriter* r = new UnaryElementwiseRewriter();
+  to_delete_.push_back(r);
+  if (opts.enable_op_size() == 0) {
+    // Opts handled by default:
+    for (const auto& op_name : {"CollectiveReduce"}) {
+      op_name_set_.insert(op_name);
+      rewriters_[op_name] = r;
+    }
+  } else {
+    for (const auto& op_name : opts.enable_op()) {
+      op_name_set_.insert(op_name);
+      rewriters_[op_name] = r;
+    }
+  }
+}
+
+Status ScopedAllocatorOptimizer::Optimize(Cluster* /*cluster*/,
+                                          const GrapplerItem& item,
+                                          GraphDef* optimized_graph) {
+  *optimized_graph = item.graph;
+  // Nodes that cannot be removed from the graph without damaging correctness,
+  // typically fetch nodes.
+  nodes_to_preserve_ = item.NodesToPreserve();
+
+  GraphProperties graph_properties(item);
+  const bool assume_valid_feeds = opt_level_ == RewriterConfig::AGGRESSIVE;
+  LOG_WARNING_AND_RETURN_IF_ERROR(
+      graph_properties.InferStatically(assume_valid_feeds));
+  node_map_.reset(new NodeMap(optimized_graph));
+
+  LOG_WARNING_AND_RETURN_IF_ERROR(ScopedAllocatorOptimizer::ProcessGraphDef(
+      optimized_graph, graph_properties));
+
+  VLOG(1) << "ScopedAllocatorOptimizer::Optimize() done";
+  return Status::OK();
+}
+
+ScopedAllocatorOptimizer::Rewriter* ScopedAllocatorOptimizer::GetRewriter(
+    const string& op_name) {
+  auto it = rewriters_.find(op_name);
+  if (it != rewriters_.end()) {
+    return it->second;
+  }
+  return nullptr;
+}
+
+int ScopedAllocatorOptimizer::NewScopedAllocatorId(int num_fields) {
+  CHECK_GT(num_fields, 0);
+  int id = next_sa_id_;
+  next_sa_id_ += (num_fields + 1);
+  CHECK_GT(next_sa_id_, 0);
+  return id;
+}
+
+ScopedAllocatorOptimizer::~ScopedAllocatorOptimizer() {
+  for (auto ptr : to_delete_) {
+    delete ptr;
+  }
+}
+
+void ScopedAllocatorOptimizer::FindOpOccurrences(GraphDef* graph,
+                                                 const OpNameSet& op_names,
+                                                 GraphOpOccurrences* occs) {
+  VLOG(1) << "FindOpOccurrences ";
+  for (const auto& it : op_names) {
+    VLOG(1) << "search target " << it;
+  }
+  for (int ni = 0; ni < graph->node_size(); ++ni) {
+    NodeDef* node = graph->mutable_node(ni);
+    const string& op_name = node->op();
+    if (op_names.find(op_name) != op_names.end()) {
+      VLOG(1) << "found " << op_name << " on dev " << node->device();
+      (*occs)[node->device()][op_name].push_back(node);
+    }
+  }
+}
+
+namespace {
+struct OpNameOrder {
+  bool operator()(const NodeDef* a, const NodeDef* b) {
+    return a->name() <= b->name();
+  }
+};
+
+class Tree {
+ public:
+  Tree(const string& edge, int depth) : edge_(edge), depth_(depth) {}
+  ~Tree() {
+    for (auto it : subtrees_) delete it.second;
+  }
+
+  Tree* GetSubTree(const string& edge) {
+    auto it = subtrees_.find(edge);
+    if (it != subtrees_.end()) {
+      return it->second;
+    }
+    Tree* t = new Tree(edge, depth_ + 1);
+    subtrees_[edge] = t;
+    return t;
+  }
+
+  void InsertNode(NodeDef* n) { nodes_.push_back(n); }
+
+  string edge_;
+  int depth_;
+  std::vector<NodeDef*> nodes_;
+  std::unordered_map<string, Tree*> subtrees_;
+};
+
+// Applies a function to every Tree in DFS order.  Terminates early
+// on any non-OK Status.
+Status ApplyToAll(Tree* tree, const std::function<Status(Tree*)>& func) {
+  Status s;
+  for (auto it : tree->subtrees_) {
+    s = ApplyToAll(it.second, func);
+    if (!s.ok()) return s;
+  }
+  s = func(tree);
+  return s;
+}
+
+Tree* ComputeScopeTree(const string& op_name,
+                       const std::vector<NodeDef*>& node_vec) {
+  Tree* root = new Tree("", 0);
+  for (NodeDef* n : node_vec) {
+    std::vector<string> pieces = str_util::Split(n->name(), "/");
+    // last piece is node name proper.
+    int depth = pieces.size() - 1;
+    Tree* subtree = root;
+    for (int i = 0; i < depth; ++i) {
+      subtree = subtree->GetSubTree(pieces[i]);
+    }
+    subtree->InsertNode(n);
+  }
+  return root;
+}
+
+void PartitionByLoopStructure(const FrameMap& frame_map,
+                              std::vector<NodeDef*> nodes,
+                              std::vector<std::vector<NodeDef*>>* loop_groups) {
+  // It is assumed that two nodes with identical loop containment have
+  // identical integer vectors.  Represent those by 64 bit hashes.
+  std::unordered_map<uint64, std::vector<NodeDef*>> loop_sets;
+  for (NodeDef* nd : nodes) {
+    uint64 hash = 0;
+    const auto& it = frame_map.find(nd);
+    if (it != frame_map.end()) {
+      const std::vector<int>& loop_ids = it->second;
+      for (int id : loop_ids) {
+        hash = Hash64Combine(hash, static_cast<uint64>(id));
+      }
+    }
+    loop_sets[hash].push_back(nd);
+  }
+  for (auto it : loop_sets) {
+    loop_groups->push_back(std::move(it.second));
+  }
+}
+
+}  // namespace
+
+Status ScopedAllocatorOptimizer::ProcessGraphDef(
+    GraphDef* graph, const GraphProperties& graph_properties) {
+  VLOG(1) << "ProcessGraphDef";
+  Status status;
+  GraphOpOccurrences occ;
+  FindOpOccurrences(graph, op_name_set_, &occ);
+  if (!occ.empty()) {
+    FrameMap frame_map;
+    int num_frames;
+    LOG_WARNING_AND_RETURN_IF_ERROR(
+        IdentifyFramesWithNodeMap(*graph, *node_map_, &frame_map, &num_frames));
+    for (auto& dt : occ) {
+      VLOG(2) << "Processing device " << dt.first;
+      const DevOpOccurrences& dev_occ = dt.second;
+      for (auto& it : dev_occ) {
+        string op_name = it.first;
+        VLOG(1) << "Processing " << op_name << " set size " << it.second.size();
+        Rewriter* rewriter = GetRewriter(op_name);
+        if (!rewriter) {
+          LOG(ERROR) << "Failed to find PARewriter for op_name " << op_name;
+          continue;
+        }
+        rewriter->SetGraphProperties(graph_properties);
+        std::unique_ptr<Tree> root(ComputeScopeTree(it.first, it.second));
+        // Nodes with a common depth and root path are now grouped
+        // in the same Tree struct.  Split those groups into subgroups that
+        // share identical loop nesting.
+        status = ApplyToAll(
+            root.get(), [this, rewriter, graph, &frame_map, &op_name](Tree* t) {
+              VLOG(2) << "applied to tree node " << t->edge_ << " at depth "
+                      << t->depth_ << " of size " << t->nodes_.size();
+              if (t->nodes_.size() > 1) {
+                std::vector<std::vector<NodeDef*>> loop_groups;
+                PartitionByLoopStructure(frame_map, t->nodes_, &loop_groups);
+                for (auto& lg : loop_groups) {
+                  if (lg.size() > 1) {
+                    bool applied = false;
+                    Status s = OrderNodeSet(&lg);
+                    TF_RETURN_IF_ERROR(s);
+                    VLOG(1) << "Applying Rewriter for " << op_name;
+                    s = rewriter->Rewrite(this, graph, op_name, lg, &applied);
+                    LOG_WARNING_AND_RETURN_IF_ERROR(s);
+                  }
+                }
+              }
+              return Status::OK();
+            });
+        if (!status.ok()) {
+          break;
+        }
+      }
+      if (!status.ok()) {
+        break;
+      }
+    }
+  }
+  VLOG(1) << "ScopedAllocatorOptimizer returning " << status;
+  if (!status.ok()) {
+    LOG(ERROR) << "ScopedAllocatorOptimizer: " << status;
+  }
+  return status;
+}
+
+namespace {
+struct InstanceKeyLess {
+  bool operator()(const NodeDef* a, const NodeDef* b) const {
+    AttrSlice a_attrs = AttrSlice(*a);
+    AttrSlice b_attrs = AttrSlice(*b);
+    int32 a_key = -1;
+    int32 b_key = -1;
+    Status s = GetNodeAttr(a_attrs, "instance_key", &a_key);
+    CHECK(s.ok());
+    s = GetNodeAttr(b_attrs, "instance_key", &b_key);
+    CHECK(s.ok());
+    return a_key < b_key;
+  }
+};
+
+struct NameLess {
+  bool operator()(const NodeDef* a, const NodeDef* b) const {
+    return a->name() < b->name();
+  }
+};
+
+bool IsCollectiveNode(const NodeDef& n) {
+  AttrSlice attrs = AttrSlice(n);
+  int key = -1;
+  if (!IsCollective(n)) return false;
+  Status s = GetNodeAttr(attrs, "instance_key", &key);
+  if (s.ok() && key >= 0) {
+    return true;
+  }
+  return false;
+}
+}  // namespace
+
+Status ScopedAllocatorOptimizer::OrderNodeSet(
+    std::vector<NodeDef*>* nodes) const {
+  // Nodes should be identical type.  Default order is by name but for
+  // collectives we order by increasing instance_key so each group gets
+  // the same instance_key.
+  if (nodes->size() <= 1) return Status::OK();
+  if (IsCollectiveNode(*nodes->at(0))) {
+    sort(nodes->begin(), nodes->end(), InstanceKeyLess());
+  } else {
+    sort(nodes->begin(), nodes->end(), NameLess());
+  }
+  return Status::OK();
+}
+
+}  // namespace grappler
+}  // namespace tensorflow
+
+#undef LOG_WARNING_AND_RETURN_IF_ERROR
diff --git a/tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.h b/tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.h
new file mode 100644
index 0000000000000000000000000000000000000000..ab4d444595f8b484d8c61f8ab0f1b9976ebfc0dc
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.h
@@ -0,0 +1,107 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_SCOPED_ALLOCATOR_OPTIMIZER_H_
+#define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_SCOPED_ALLOCATOR_OPTIMIZER_H_
+
+#include <unordered_map>
+#include <unordered_set>
+#include <vector>
+#include "tensorflow/core/grappler/optimizers/graph_optimizer.h"
+#include "tensorflow/core/protobuf/rewriter_config.pb.h"
+
+namespace tensorflow {
+namespace grappler {
+class Graph;
+class GraphProperties;
+class NodeMap;
+class ScopedAllocatorOptimizer;
+
+// An Optimizer that introduces ScopedAllocators in order to reduce data
+// movement and consolidate some kinds of Ops.
+class ScopedAllocatorOptimizer : public GraphOptimizer {
+ public:
+  explicit ScopedAllocatorOptimizer(const ScopedAllocatorOptions& opts);
+  ~ScopedAllocatorOptimizer() override;
+
+  string name() const override { return "scoped_allocator_optimizer"; }
+
+  Status Optimize(Cluster* cluster, const GrapplerItem& item,
+                  GraphDef* optimized_graph) override;
+
+  void Feedback(Cluster* cluster, const GrapplerItem& item,
+                const GraphDef& optimized_graph, double result) override {}
+
+  // Map from an Op name to a vector of Nodes with that Op.
+  typedef std::unordered_map<string, std::vector<NodeDef*>> DevOpOccurrences;
+  // Map from a device name to a DevOpOccurrences map.
+  typedef std::unordered_map<string, DevOpOccurrences> GraphOpOccurrences;
+  typedef std::unordered_set<string> OpNameSet;
+
+  Status ProcessGraphDef(GraphDef* graph,
+                         const GraphProperties& graph_properties);
+
+  // Populates *occs by grouping Nodes with common Ops, according to
+  // their assigned devices.
+  void FindOpOccurrences(GraphDef* graph, const OpNameSet& op_names,
+                         GraphOpOccurrences* occs);
+
+  // Returns a new, unused scope_id to be assigned to a ScopedAllocator that
+  // will allocate num_fields (> 0) separate tensors.
+  int NewScopedAllocatorId(int num_fields);
+
+  NodeMap* node_map() { return node_map_.get(); }
+
+  // Appends values to the attr value under name in node_def, if present.
+  // If not present does an assignment.
+  static void ExtendNodeAttr(StringPiece name, const std::vector<int32>& values,
+                             NodeDef* node_def);
+
+  // Class that knows how to do graph rewriting for a particular kind of Op in
+  // order to take advantage of a ScopedAllocator.
+  class Rewriter {
+   public:
+    virtual ~Rewriter() {}
+
+    virtual Status Rewrite(ScopedAllocatorOptimizer* paopti, GraphDef* graph,
+                           const string& op_name,
+                           const std::vector<NodeDef*>& nodes,
+                           bool* applied) = 0;
+
+    void SetGraphProperties(const GraphProperties& graph_properties) {
+      graph_properties_ = &graph_properties;
+      CHECK(graph_properties_);
+    }
+
+   protected:
+    const GraphProperties* graph_properties_;
+  };
+
+ private:
+  Rewriter* GetRewriter(const string& op_name);
+
+  Status OrderNodeSet(std::vector<NodeDef*>* nodes) const;
+
+  RewriterConfig::Toggle opt_level_;
+  std::unordered_set<string> nodes_to_preserve_;
+  OpNameSet op_name_set_;
+  std::unordered_map<string, Rewriter*> rewriters_;
+  std::vector<Rewriter*> to_delete_;
+  int next_sa_id_ = 1;
+  std::unique_ptr<NodeMap> node_map_;
+};
+
+}  // namespace grappler
+}  // namespace tensorflow
+#endif  // TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_SCOPED_ALLOCATOR_OPTIMIZER_H_
diff --git a/tensorflow/core/grappler/optimizers/scoped_allocator_optimizer_test.cc b/tensorflow/core/grappler/optimizers/scoped_allocator_optimizer_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..3a2859dc5f0f57fc14de85152ca8ced5a6865758
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/scoped_allocator_optimizer_test.cc
@@ -0,0 +1,243 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.h"
+
+#include <unordered_set>
+
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/tensor_shape.pb.h"
+#include "tensorflow/core/framework/tensor_testutil.h"
+#include "tensorflow/core/graph/testlib.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/protobuf/config.pb.h"
+#include "tensorflow/core/public/session.h"
+#include "tensorflow/core/public/session_options.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+class ScopedAllocatorOptimizerTest : public ::testing::Test {
+ public:
+  std::unique_ptr<Session> CreateSession(const GraphDef& graph,
+                                         const ConfigProto& config) {
+    SessionOptions options;
+    options.config = config;
+    (*options.config.mutable_device_count())["CPU"] = 2;
+    Session* session = NewSession(options);
+    TF_CHECK_OK(session->Create(graph));
+    return std::unique_ptr<Session>(session);
+  }
+
+  std::vector<Tensor> EvaluateNodes(const GraphDef& graph,
+                                    const std::vector<string>& fetch) {
+    SessionOptions options;
+    std::unique_ptr<tensorflow::Session> session(NewSession(options));
+    TF_CHECK_OK(session->Create(graph));
+    RunOptions run_options;
+    std::vector<Tensor> output_tensors;
+    TF_CHECK_OK(
+        session->Run(run_options, {}, fetch, fetch, &output_tensors, nullptr));
+    TF_CHECK_OK(session->Close());
+    return output_tensors;
+  }
+
+  // Constructs the following graph.
+  // (Flow is top to bottom, like nature intends.)
+  //
+  // The intended optimization is to have s1 and s2 allocate from
+  // an new ScopedAllocator, then replace a1 and a2 with a3 that
+  // reads from the backing buffer.
+  /*
+        a    b    c
+         \  / \  /
+          s1   s2
+          |    |
+          a1   a2
+          |    |
+          r1   r2
+  */
+  void BuildAbsGraph(GraphDef* graph_def) {
+    tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+    s = s.WithDevice("/job:localhost/replica:0/task:0/device:CPU:0");
+
+    Output a =
+        ops::Const<float>(s.WithOpName("a"), {1.0, 0.0, 0.0, -1.0}, {2, 2});
+    Output b =
+        ops::Const<float>(s.WithOpName("b"), {1.0, -2.0, 3.0, 4.0}, {2, 2});
+    Output c =
+        ops::Const<float>(s.WithOpName("c"), {-5.0, -2.0, 0.0, -2.0}, {2, 2});
+    Output s1 = ops::Add(s.WithOpName("s1"), a, b);
+    Output s2 = ops::Add(s.WithOpName("s2"), b, c);
+    Output a1 = ops::Abs(s.WithOpName("a1"), s1);
+    Output a2 = ops::Abs(s.WithOpName("a2"), s2);
+    Output r1 = ops::Reshape(s.WithOpName("r1"), a1, {1, 4});
+    Output r2 = ops::Reshape(s.WithOpName("r2"), a2, {4, 1});
+    TF_CHECK_OK(s.ToGraphDef(graph_def));
+  }
+
+  void SetShapes(GraphDef* graph_def) {
+    TensorShapeProto shape_proto;
+    shape_proto.add_dim()->set_size(2);
+    shape_proto.add_dim()->set_size(2);
+
+    for (NodeDef& n : *graph_def->mutable_node()) {
+      if (n.op() == "Add" || n.op() == "Abs") {
+        AddNodeAttr("_output_shapes", {shape_proto}, &n);
+      }
+    }
+  }
+};
+
+TEST_F(ScopedAllocatorOptimizerTest, UnaryRewriteOnly) {
+  // Tests that Rewrite of program with parallel unary Ops is done as
+  // anticipated.
+  GrapplerItem item;
+  BuildAbsGraph(&item.graph);
+  SetShapes(&item.graph);
+
+  ScopedAllocatorOptions opts;
+  opts.add_enable_op("Abs");
+  ScopedAllocatorOptimizer sao(opts);
+  ScopedAllocatorOptimizer::OpNameSet ons;
+  ons.insert("Abs");
+
+  GraphDef optimized_graph;
+  TF_ASSERT_OK(sao.Optimize(nullptr /*cluster*/, item, &optimized_graph));
+
+  // Examine the resulting graph def.
+  NodeMap node_map(&optimized_graph);
+  NodeDef* nd = node_map.GetNode("scoped_allocator_1");
+  ASSERT_TRUE(nd);
+  {
+    auto& nd_set = node_map.GetOutputs(nd->name());
+    ASSERT_EQ(3, nd_set.size());
+    std::unordered_set<string> expected = {"scoped_allocator_concat_1", "s1",
+                                           "s2"};
+    for (auto it : nd_set) {
+      ASSERT_NE(expected.find(it->name()), expected.end())
+          << "Failed to find " << it->name();
+    }
+  }
+  {
+    auto& nd_set = node_map.GetOutputs("scoped_allocator_concat_1");
+    ASSERT_EQ(1, nd_set.size());
+    for (auto it : nd_set) {
+      ASSERT_EQ("scoped_allocator_1_Abs", it->name());
+    }
+  }
+  {
+    auto& nd_set = node_map.GetOutputs("scoped_allocator_1_Abs");
+    ASSERT_EQ(1, nd_set.size());
+    for (auto it : nd_set) {
+      ASSERT_EQ("scoped_allocator_split_1", it->name());
+    }
+  }
+  {
+    auto& nd_set = node_map.GetOutputs("scoped_allocator_split_1");
+    ASSERT_EQ(2, nd_set.size());
+    std::unordered_set<string> name_set;
+    for (auto it : nd_set) {
+      name_set.insert(it->name());
+    }
+    ASSERT_TRUE(name_set.find("r1") != name_set.end());
+    ASSERT_TRUE(name_set.find("r2") != name_set.end());
+  }
+}
+
+TEST_F(ScopedAllocatorOptimizerTest, UnaryExecute) {
+  // Constructs the same graph as UnaryRewriteOnly, but actually executes it.
+  GrapplerItem item;
+  BuildAbsGraph(&item.graph);
+
+  // Turn off all optimization except the ScopedAllocatorOptimizer
+  // to avoid anything that would alter the expected graph input/output,
+  // e.g. by constant folding away all calculations.
+  ConfigProto config;
+  GraphOptions* gopt = config.mutable_graph_options();
+  OptimizerOptions* opts = gopt->mutable_optimizer_options();
+  opts->set_do_common_subexpression_elimination(false);
+  opts->set_do_constant_folding(false);
+  opts->set_do_function_inlining(false);
+  opts->set_opt_level(OptimizerOptions::L0);
+  RewriterConfig* rwcfg = gopt->mutable_rewrite_options();
+  rwcfg->clear_optimizers();
+  (*rwcfg->add_optimizers()) = "scoped_allocator";
+  rwcfg->mutable_scoped_allocator_opts()->add_enable_op("Abs");
+  std::unique_ptr<Session> session(CreateSession(item.graph, config));
+
+  std::vector<std::pair<string, Tensor>> inputs;
+
+  // Request two targets: one fetch output and one non-fetched output.
+  std::vector<string> output_names = {"r1:0", "r2:0",
+                                      "scoped_allocator_1_Abs:0"};
+  std::vector<string> target_nodes = {};
+  std::vector<Tensor> outputs;
+  Status s = session->Run(inputs, output_names, target_nodes, &outputs);
+  TF_ASSERT_OK(s);
+  ASSERT_EQ(outputs.size(), 3);
+  std::vector<float> expected_r1({2, 2, 3, 3});
+  std::vector<float> expected_r2({4, 4, 3, 2});
+  // a + b == 2, -2, 3, 3
+  // b + c == -4, -4, 3, 2
+  for (int oi = 0; oi < outputs.size(); ++oi) {
+    for (int i = 0; i < outputs[oi].NumElements(); ++i) {
+      VLOG(0) << "output vec " << oi << " index " << i << " = "
+              << outputs[oi].flat<float>()(i);
+    }
+    if (oi == 0) {
+      ASSERT_EQ(expected_r1.size(), outputs[oi].NumElements());
+      for (int i = 0; i < expected_r1.size(); ++i) {
+        EXPECT_EQ(expected_r1[i], outputs[oi].flat<float>()(i));
+      }
+    } else if (oi == 1) {
+      ASSERT_EQ(expected_r2.size(), outputs[oi].NumElements());
+      for (int i = 0; i < expected_r2.size(); ++i) {
+        EXPECT_EQ(expected_r2[i], outputs[oi].flat<float>()(i));
+      }
+    }
+  }
+}
+
+// Tests static ScopedAllocatorOptimizer::ExtendNodeAttr.
+// Maybe this should be moved elsewhere?
+TEST_F(ScopedAllocatorOptimizerTest, Extend) {
+  NodeDef nd;
+  ScopedAllocatorOptimizer::ExtendNodeAttr("_scoped_allocator", {0, 2}, &nd);
+  ScopedAllocatorOptimizer::ExtendNodeAttr("_scoped_allocator", {6, 7}, &nd);
+  ScopedAllocatorOptimizer::ExtendNodeAttr("_scoped_allocator", {2, 3}, &nd);
+  VLOG(0) << "nd: " << nd.DebugString();
+  std::vector<int> scoped_allocator_attrs;
+  AttrSlice slice(nd);
+  Status sa_status =
+      GetNodeAttr(slice, "_scoped_allocator", &scoped_allocator_attrs);
+  for (int i : scoped_allocator_attrs) {
+    VLOG(0) << "extracted: " << i;
+  }
+  NodeDef nd2;
+  AddNodeAttr("_scoped_allocator", {0, 2}, &nd2);
+  AddNodeAttr("_scoped_allocator", {6, 7}, &nd2);
+  AddNodeAttr("_scoped_allocator", {2, 3}, &nd2);
+  VLOG(0) << "nd2: " << nd2.DebugString();
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/utils/functions.cc b/tensorflow/core/grappler/utils/functions.cc
index 5a5dc47fa06626a5363cf9106f41728f98389a54..d64cb497154d354ddb8479cbc1e8764238917d98 100644
--- a/tensorflow/core/grappler/utils/functions.cc
+++ b/tensorflow/core/grappler/utils/functions.cc
@@ -22,6 +22,7 @@ limitations under the License.
 #include "tensorflow/core/framework/graph_def_util.h"
 #include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/tensor_shape.pb.h"
 #include "tensorflow/core/framework/types.pb.h"
 #include "tensorflow/core/grappler/op_types.h"
 #include "tensorflow/core/grappler/utils.h"
@@ -525,7 +526,9 @@ Status MakeGrapplerFunctionItem(const FunctionDef& func,
     NodeDef* placeholder = function_body.add_node();
     placeholder->set_name(input.name());
     placeholder->set_op("Placeholder");
-    (*placeholder->mutable_attr())["T"].set_type(input_data_type);
+    (*placeholder->mutable_attr())["dtype"].set_type(input_data_type);
+    (*placeholder->mutable_attr())["shape"].mutable_shape()->set_unknown_rank(
+        true);
 
     InputArgExpansion input_expansion{/*input_name=*/input.name(),
                                       /*data_type=*/input_data_type,
diff --git a/tensorflow/core/grappler/utils/functions_test.cc b/tensorflow/core/grappler/utils/functions_test.cc
index 302f02dd392b704ea676f86fafa8b65c1f211c69..8c3cc70351ad5c3bc53f85be91806a9a55e7872d 100644
--- a/tensorflow/core/grappler/utils/functions_test.cc
+++ b/tensorflow/core/grappler/utils/functions_test.cc
@@ -256,7 +256,7 @@ TEST_F(FunctionsTest, FromSimpleFunctionDef) {
   for (const NodeDef &node : item.function_body().node()) {
     if (node.name() == "x" && count++) {
       EXPECT_EQ("Placeholder", node.op());
-      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("dtype").type());
       EXPECT_EQ(0, node.input_size());
     } else if (node.name() == "two" && count++) {
       EXPECT_EQ("Const", node.op());
@@ -333,7 +333,7 @@ TEST_F(FunctionsTest, FromFunctionDefWithMultiOutputNodes) {
     if (node.name() == "x" || node.name() == "y" || node.name() == "dz") {
       count++;
       EXPECT_EQ("Placeholder", node.op());
-      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("dtype").type());
       EXPECT_EQ(0, node.input_size());
     } else if (node.name() == "rx" && count++) {
       EXPECT_EQ("BroadcastGradientArgs", node.op());
@@ -402,7 +402,7 @@ TEST_F(FunctionsTest, FromFunctionDefWithNestedFuncs) {
     if (node.name() == "x" || node.name() == "y") {
       count++;
       EXPECT_EQ("Placeholder", node.op());
-      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("dtype").type());
       EXPECT_EQ(0, node.input_size());
     } else if (node.name() == "a0" && count++) {
       EXPECT_EQ("Swap", node.op());
@@ -465,7 +465,7 @@ TEST_F(FunctionsTest, FromFunctionDefWithOutputMappings) {
   for (const NodeDef &node : item.function_body().node()) {
     if (node.name() == "in" && count++) {
       EXPECT_EQ("Placeholder", node.op());
-      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("dtype").type());
       EXPECT_EQ(0, node.input_size());
     } else if (node.name() == "Linear_func" && count++) {
       EXPECT_EQ("Identity", node.op());
@@ -517,9 +517,9 @@ TEST_F(FunctionsTest, FromFunctionDefWithInputForwarding) {
     count++;
     EXPECT_EQ("Placeholder", node.op());
     if (node.name() == "arg3") {
-      EXPECT_EQ(DT_INT32, node.attr().at("T").type());
+      EXPECT_EQ(DT_INT32, node.attr().at("dtype").type());
     } else {
-      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("dtype").type());
     }
   }
   EXPECT_EQ(5, count);
diff --git a/tensorflow/core/kernels/BUILD b/tensorflow/core/kernels/BUILD
index 1bf6eafb583df2d0e5655c754282f709fa213822..6487cd39712fa16fab6d7305343cc66b89276435 100644
--- a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@@ -1968,6 +1968,7 @@ tf_kernel_library(
 tf_kernel_library(
     name = "resource_variable_ops",
     srcs = ["resource_variable_ops.cc"],
+    hdrs = ["resource_variable_ops.h"],
     deps = [
         ":bounds_check",
         ":dense_update_functor",
@@ -2811,6 +2812,9 @@ tf_kernel_library(
     srcs = [] + if_mkl([
         "mkl_batch_matmul_op.cc",
     ]),
+    # Override EIGEN_STRONG_INLINE to inline when --define=override_eigen_strong_inline=true,
+    # to avoid long compiling time. See https://github.com/tensorflow/tensorflow/issues/10521
+    copts = if_override_eigen_strong_inline(["/DEIGEN_STRONG_INLINE=inline"]),
     prefix = "batch_matmul_op",
     deps = MATH_DEPS + if_mkl([
         "//third_party/mkl:intel_binary_blob",
@@ -2878,6 +2882,9 @@ tf_kernel_library(
         "mkl_matmul_op.cc",
     ]),
     hdrs = ["matmul_op.h"],
+    # Override EIGEN_STRONG_INLINE to inline when --define=override_eigen_strong_inline=true,
+    # to avoid long compiling time. See https://github.com/tensorflow/tensorflow/issues/10521
+    copts = if_override_eigen_strong_inline(["/DEIGEN_STRONG_INLINE=inline"]),
     defines = select({
         ":xsmm": [
             "TENSORFLOW_USE_LIBXSMM",
@@ -3247,8 +3254,7 @@ tf_kernel_library(
         "//conditions:default": [],
     }),
     # Override EIGEN_STRONG_INLINE to inline when --define=override_eigen_strong_inline=true,
-    # So that it doesn't take 20 minutes to compile conv_grad_ops_3d.cc and conv_ops_3d.cc
-    # on Windows. See https://github.com/tensorflow/tensorflow/issues/10521
+    # to avoid long compiling time. See https://github.com/tensorflow/tensorflow/issues/10521
     copts = if_override_eigen_strong_inline(["/DEIGEN_STRONG_INLINE=inline"]),
     defines = select({
         ":xsmm_convolutions": [
@@ -3299,7 +3305,7 @@ tf_kernel_library(
         "//tensorflow/core:nn_ops_op_lib",
     ] + if_cuda([
         "@cub_archive//:cub",
-        "@local_config_cuda//cuda:cudnn",
+        "@local_config_cuda//cuda:cudnn_header",
     ]),
 )
 
@@ -3318,7 +3324,7 @@ tf_kernel_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:nn_ops_op_lib",
     ] + if_cuda([
-        "@local_config_cuda//cuda:cudnn",
+        "@local_config_cuda//cuda:cudnn_header",
     ]),
 )
 
@@ -3375,7 +3381,10 @@ tf_kernel_library(
 tf_kernel_library(
     name = "bias_op",
     prefix = "bias_op",
-    deps = NN_DEPS,
+    deps = NN_DEPS + if_cuda([
+        ":reduction_ops",
+        "@cub_archive//:cub",
+    ]),
 )
 
 tf_kernel_library(
@@ -3394,6 +3403,9 @@ tf_kernel_library(
 
 tf_kernel_library(
     name = "lrn_op",
+    # Override EIGEN_STRONG_INLINE to inline when --define=override_eigen_strong_inline=true,
+    # to avoid long compiling time. See https://github.com/tensorflow/tensorflow/issues/10521
+    copts = if_override_eigen_strong_inline(["/DEIGEN_STRONG_INLINE=inline"]),
     prefix = "lrn_op",
     deps = NN_DEPS,
 )
@@ -5178,9 +5190,9 @@ filegroup(
             "partitioned_function_ops.cc",
             # Excluded due to experimental status:
             "debug_ops.*",
-            "scatter_nd_op*",
             "mutex_ops.*",
             "batch_kernels.*",
+            "regex_full_match_op.cc",
             "regex_replace_op.cc",
         ],
     ),
@@ -6169,7 +6181,7 @@ cc_library(
 tf_kernel_library(
     name = "dataset_ops",
     deps = [
-        "//tensorflow/core/kernels/data:dataset_ops",
+        "//tensorflow/core/kernels/data",
     ],
 )
 
diff --git a/tensorflow/core/kernels/as_string_op.cc b/tensorflow/core/kernels/as_string_op.cc
index 66c4aff3e3376955d8e71c2dae66d0d2d91778c4..a7757d13617c42e9a422569a921a7d3c147e41b9 100644
--- a/tensorflow/core/kernels/as_string_op.cc
+++ b/tensorflow/core/kernels/as_string_op.cc
@@ -73,6 +73,7 @@ class AsStringOp : public OpKernel {
     }
     switch (dtype) {
       case DT_INT8:
+      case DT_INT16:
       case DT_INT32:
         strings::Appendf(&format_, "d");
         break;
@@ -129,6 +130,7 @@ class AsStringOp : public OpKernel {
       ENCODE_TYPE(DT_FLOAT, float, format_);
       ENCODE_TYPE(DT_DOUBLE, double, format_);
       ENCODE_TYPE(DT_INT8, int8, format_);
+      ENCODE_TYPE(DT_INT16, int16, format_);
       case (DT_BOOL): {
         const auto& input_flat = input_tensor->flat<bool>();
         for (int i = 0; i < input_flat.size(); ++i) {
diff --git a/tensorflow/core/kernels/batch_kernels.cc b/tensorflow/core/kernels/batch_kernels.cc
index 8c99ded0a89e8065f4a7112db3b14eb2b27010c1..35ddda0ec04da6f3b6f11606ecb019e38698c6d7 100644
--- a/tensorflow/core/kernels/batch_kernels.cc
+++ b/tensorflow/core/kernels/batch_kernels.cc
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#include "tensorflow/core/framework/function.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/resource_mgr.h"
@@ -24,6 +25,7 @@ limitations under the License.
 #include "tensorflow/core/kernels/concat_lib.h"
 #include "tensorflow/core/kernels/ops_util.h"
 #include "tensorflow/core/kernels/split_lib.h"
+#include "tensorflow/core/lib/gtl/cleanup.h"
 #include "tensorflow/core/lib/random/random.h"
 #include "tensorflow/core/platform/macros.h"
 
@@ -41,7 +43,7 @@ typedef Eigen::SyclDevice SYCLDevice;
 // ensure proper device placement.
 template <typename T>
 Status Concat(OpKernelContext* context, const gtl::ArraySlice<Tensor>& inputs,
-              int output_index) {
+              Tensor* output) {
   const int input_dims = inputs[0].dims();
   const TensorShape& input_shape = inputs[0].shape();
 
@@ -76,9 +78,8 @@ Status Concat(OpKernelContext* context, const gtl::ArraySlice<Tensor>& inputs,
 
   TensorShape output_shape(input_shape);
   output_shape.set_dim(0, output_dim0);
-  Tensor* output = nullptr;
   TF_RETURN_IF_ERROR(
-      context->allocate_output(output_index, output_shape, &output));
+      context->allocate_temp(DataTypeToEnum<T>::value, output_shape, output));
   if (output->NumElements() > 0) {
     auto output_flat = output->shaped<T, 2>({1, output->NumElements()});
 #if GOOGLE_CUDA
@@ -209,6 +210,7 @@ class BatchResource : public ResourceBase {
   static Status Create(int32 num_batch_threads, int32 max_batch_size,
                        int32 batch_timeout_micros, int32 max_enqueued_batches,
                        const std::vector<int32>& allowed_batch_sizes,
+                       FunctionLibraryRuntime::Handle fhandle,
                        std::unique_ptr<BatchResource>* resource) {
     std::unique_ptr<BatchResource> new_resource(new BatchResource);
 
@@ -225,6 +227,8 @@ class BatchResource : public ResourceBase {
 
     new_resource->allowed_batch_sizes_ = allowed_batch_sizes;
 
+    new_resource->fhandle_ = fhandle;
+
     *resource = std::move(new_resource);
     return Status::OK();
   }
@@ -254,6 +258,14 @@ class BatchResource : public ResourceBase {
       }
       batch_components->inputs.push_back(tensor);
     }
+    OpInputList captured_tensors;
+    const auto captured_status =
+        context->input_list("captured_tensors", &captured_tensors);
+    if (captured_status.ok()) {
+      for (const Tensor& captured_tensor : captured_tensors) {
+        batch_components->captured_inputs.push_back(captured_tensor);
+      }
+    }
     batch_components->context = context;
     batch_components->done_callback = std::move(done_callback);
 
@@ -272,6 +284,7 @@ class BatchResource : public ResourceBase {
     int64 guid;
 
     std::vector<Tensor> inputs;
+    std::vector<Tensor> captured_inputs;
     OpKernelContext* context;
     AsyncOpKernel::DoneCallback done_callback;
 
@@ -314,50 +327,32 @@ class BatchResource : public ResourceBase {
     return batch_size;
   }
 
-  // Processes a batch of one or more BatchTask entries.
-  void ProcessBatch(std::unique_ptr<Batch> batch) const {
-    if (batch->empty()) {
-      return;
+  Status ConcatInputTensors(const Batch& batch, OpKernelContext* context,
+                            std::vector<Tensor>* concatenated_tensors) const {
+    if (batch.num_tasks() == 0) {
+      return errors::InvalidArgument("Empty batch.");
     }
-    const int padded_batch_size = RoundToLowestAllowedBatchSize(batch->size());
-    const int padding_amount = padded_batch_size - batch->size();
 
-    OpKernelContext* last_task_context =
-        batch->task(batch->num_tasks() - 1).context;
-    AsyncOpKernel::DoneCallback last_task_callback =
-        batch->task(batch->num_tasks() - 1).done_callback;
-
-    OP_REQUIRES_OK_ASYNC(last_task_context, ValidateBatch(*batch),
-                         last_task_callback);
+    const int padded_batch_size = RoundToLowestAllowedBatchSize(batch.size());
+    const int padding_amount = padded_batch_size - batch.size();
 
     // All tasks should have the same number of input edges.
-    const int num_input_edges = batch->task(0).inputs.size();
-
-    // Process each input edge one at a time (the typical case has just one).
-    for (int i = 0; i < num_input_edges; ++i) {
-      // Emit batch->num_tasks() - 1 empty output tensors.
-      for (int task_idx = 0; task_idx < batch->num_tasks() - 1; ++task_idx) {
-        const BatchTask& task = batch->task(task_idx);
-        TensorShape output_shape(task.inputs.at(i).shape());
-        output_shape.set_dim(0, 0);
-        Tensor* output = nullptr;
-        OP_REQUIRES_OK_ASYNC(
-            task.context,
-            task.context->allocate_output(i, output_shape, &output),
-            task.done_callback);
-      }
+    const int num_inputs = batch.task(0).inputs.size();
+    concatenated_tensors->reserve(num_inputs);
 
+    // Process each input one at a time (the typical case has just one).
+    for (int i = 0; i < num_inputs; ++i) {
       // Concatenate the tasks ith input tensors into a big output tensor.
       std::vector<Tensor> to_concatenate;
-      to_concatenate.reserve(batch->num_tasks());
-      for (int task_idx = 0; task_idx < batch->num_tasks(); ++task_idx) {
-        to_concatenate.push_back(batch->task(task_idx).inputs.at(i));
+      to_concatenate.reserve(batch.num_tasks());
+      for (int task_idx = 0; task_idx < batch.num_tasks(); ++task_idx) {
+        to_concatenate.push_back(batch.task(task_idx).inputs.at(i));
       }
 
       // Add padding as needed. Use the first row of the first task's tensor as
       // the data for padding.
       if (padding_amount > 0) {
-        const Tensor& padding_source = batch->task(0).inputs.at(i);
+        const Tensor& padding_source = batch.task(0).inputs.at(i);
         Tensor padding;
         if (padding_source.shape().dim_size(0) == 1) {
           padding = padding_source;
@@ -367,10 +362,10 @@ class BatchResource : public ResourceBase {
           Status slice_status;
           std::vector<Tensor> slices;
           switch (type) {
-#define CASE(type)                                                   \
-  case DataTypeToEnum<type>::value:                                  \
-    slice_status = SplitCPU<type>(last_task_context, padding_source, \
-                                  slice_sizes, &slices);             \
+#define CASE(type)                                                     \
+  case DataTypeToEnum<type>::value:                                    \
+    slice_status =                                                     \
+        SplitCPU<type>(context, padding_source, slice_sizes, &slices); \
     break;
             TF_CALL_ALL_TYPES(CASE);
 #undef CASE
@@ -379,8 +374,7 @@ class BatchResource : public ResourceBase {
                   errors::InvalidArgument("Unsupported data type: ", type);
               break;
           }
-          OP_REQUIRES_OK_ASYNC(last_task_context, slice_status,
-                               last_task_callback);
+          TF_RETURN_IF_ERROR(slice_status);
           padding = slices.at(0);
         }
         for (int i = 0; i < padding_amount; ++i) {
@@ -390,10 +384,12 @@ class BatchResource : public ResourceBase {
 
       const DataType type = to_concatenate[0].dtype();
       Status concat_status;
+      Tensor concatenated_tensor;
       switch (type) {
-#define CASE(type)                                                      \
-  case DataTypeToEnum<type>::value:                                     \
-    concat_status = Concat<type>(last_task_context, to_concatenate, i); \
+#define CASE(type)                                                   \
+  case DataTypeToEnum<type>::value:                                  \
+    concat_status =                                                  \
+        Concat<type>(context, to_concatenate, &concatenated_tensor); \
     break;
         TF_CALL_ALL_TYPES(CASE);
 #undef CASE
@@ -402,10 +398,197 @@ class BatchResource : public ResourceBase {
               errors::InvalidArgument("Unsupported data type: ", type);
           break;
       }
-      OP_REQUIRES_OK_ASYNC(last_task_context, concat_status,
-                           last_task_callback);
+      TF_RETURN_IF_ERROR(concat_status);
+      concatenated_tensors->push_back(concatenated_tensor);
+    }
+    return Status::OK();
+  }
+
+  Status SplitOutputTensors(const std::vector<Tensor>& combined_outputs,
+                            Batch* batch) const {
+    DCHECK_GE(batch->num_tasks(), 1);
+    if (batch->num_tasks() < 1) {
+      return errors::Internal("Batch size expected to be positive; was ",
+                              batch->num_tasks());
+    }
+
+    std::vector<int64> task_sizes_plus_optional_padding;
+    task_sizes_plus_optional_padding.reserve(batch->num_tasks());
+    for (int i = 0; i < batch->num_tasks(); ++i) {
+      task_sizes_plus_optional_padding.push_back(batch->task(i).size());
+    }
+    const int padding_size =
+        RoundToLowestAllowedBatchSize(batch->size()) - batch->size();
+    if (padding_size > 0) {
+      task_sizes_plus_optional_padding.push_back(padding_size);
+    }
+
+    // For each output tensor name, a divided-up tensor with one entry per task.
+    std::map<string, std::vector<Tensor>> split_tensors;
+
+    DCHECK_EQ(batch->task(0).context->num_outputs(), combined_outputs.size());
+    if (combined_outputs.size() != batch->task(0).context->num_outputs()) {
+      return errors::Internal("Wrong number of batched output tensors");
+    }
+
+    // Generate 'split_tensors' and populate the context outputs.
+    for (int i = 0; i < combined_outputs.size(); ++i) {
+      const Tensor& output_tensor = combined_outputs[i];
+      if (output_tensor.shape().dims() == 0) {
+        return errors::FailedPrecondition(
+            "Batched output tensor has 0 dimensions");
+      }
+      if (output_tensor.shape().dim_size(0) != batch->size() + padding_size) {
+        return errors::FailedPrecondition(
+            "Batched output tensor's 0th dimension does not equal the sum of "
+            "the 0th dimension sizes of the input tensors");
+      }
+
+      std::vector<Tensor> split_tensor;
+      const Status split_status = tensor::Split(
+          output_tensor, task_sizes_plus_optional_padding, &split_tensor);
+      DCHECK(split_status.ok()) << split_status.ToString();
+      if (!split_status.ok()) {
+        return errors::Internal("Tensor split operation failed: ",
+                                split_status.ToString());
+      }
+      DCHECK_EQ(split_tensor.size(), task_sizes_plus_optional_padding.size());
+      if (split_tensor.size() != task_sizes_plus_optional_padding.size()) {
+        return errors::Internal(
+            "Tensor split operation did not work as expected; got ",
+            split_tensor.size(), " splits; expected ",
+            task_sizes_plus_optional_padding.size());
+      }
+
+      for (int j = 0; j < batch->num_tasks(); ++j) {
+        BatchTask& task = *(batch->mutable_task(j));
+        task.context->set_output(i, split_tensor.at(j));
+      }  // (Ignore a possible final split_tensors entry containing the
+         // padding.)
+    }
+
+    return Status::OK();
+  }
+
+  void ProcessFuncBatch(std::unique_ptr<Batch> batch) const {
+    if (batch->empty()) {
+      return;
+    }
+
+    OpKernelContext* last_task_context =
+        batch->task(batch->num_tasks() - 1).context;
+
+    // Regardless of the outcome, we need to propagate the status to the
+    // individual tasks and signal that they are done. We use MakeCleanup() to
+    // ensure that this happens no matter how we exit the method below.
+    Status status;
+    bool cleanup_done = false;
+    auto cleanup_fn = [&cleanup_done, &batch](const Status& status) {
+      if (cleanup_done) {
+        return;
+      }
+      for (int i = 0; i < batch->num_tasks(); ++i) {
+        batch->mutable_task(i)->context->SetStatus(status);
+        batch->mutable_task(i)->done_callback();
+      }
+      cleanup_done = true;
+    };
+    auto finally =
+        gtl::MakeCleanup([&cleanup_fn, &status] { cleanup_fn(status); });
+
+    status = ValidateBatch(*batch);
+    if (!status.ok()) {
+      return;
     }
 
+    std::vector<Tensor> concatenated_tensors;
+    status =
+        ConcatInputTensors(*batch, last_task_context, &concatenated_tensors);
+    if (!status.ok()) {
+      return;
+    }
+    FunctionLibraryRuntime::Options opts;
+    opts.step_id = last_task_context->step_id();
+    opts.step_container = last_task_context->step_container();
+    opts.cancellation_manager = last_task_context->cancellation_manager();
+    opts.stats_collector = last_task_context->stats_collector();
+    opts.rendezvous = last_task_context->rendezvous();
+    opts.runner = last_task_context->runner();
+
+    auto* flib = last_task_context->function_library();
+    std::vector<Tensor> combined_outputs;
+    Notification done;
+    std::vector<Tensor> args(concatenated_tensors.begin(),
+                             concatenated_tensors.end());
+    const auto& captured_inputs =
+        batch->task(batch->num_tasks() - 1).captured_inputs;
+    args.insert(args.end(), captured_inputs.begin(), captured_inputs.end());
+
+    // Releases the cleanup method here, because the callback of the function
+    // library runtime will handle it now.
+    finally.release();
+    flib->Run(
+        opts, fhandle_, args, &combined_outputs, [&](const Status& run_status) {
+          Status final_status;
+          auto run_finally = gtl::MakeCleanup([&]() {
+            // We do the cleanup here as an optimization, so that it runs in
+            // the underlying TF inter-op threadpool. Running it in the
+            // threadpool, let's the ensuing ops be scheduled faster,
+            // because the executor will add them to the front of the
+            // threadpool's task queue rather than the end.
+            cleanup_fn(final_status);
+            done.Notify();
+          });
+          final_status = run_status;
+          if (!final_status.ok()) {
+            return;
+          }
+          final_status = SplitOutputTensors(combined_outputs, batch.get());
+        });
+    // By waiting for the notification we are ensuring that this thread isn't
+    // used for processing other batches, which gives the batches time to
+    // coalesce upstream. So overall the number of batches going through the
+    // devices goes down, improving latency and throughput in most cases.
+    done.WaitForNotification();
+  }
+
+  // Processes a batch of one or more BatchTask entries.
+  void ProcessBatch(std::unique_ptr<Batch> batch) const {
+    if (batch->empty()) {
+      return;
+    }
+
+    OpKernelContext* last_task_context =
+        batch->task(batch->num_tasks() - 1).context;
+    AsyncOpKernel::DoneCallback last_task_callback =
+        batch->task(batch->num_tasks() - 1).done_callback;
+
+    OP_REQUIRES_OK_ASYNC(last_task_context, ValidateBatch(*batch),
+                         last_task_callback);
+
+    // All tasks should have the same number of input edges.
+    const int num_input_edges = batch->task(0).inputs.size();
+    std::vector<Tensor> concatenated_tensors;
+    const Status concat_status =
+        ConcatInputTensors(*batch, last_task_context, &concatenated_tensors);
+    OP_REQUIRES_OK_ASYNC(last_task_context, concat_status, last_task_callback);
+
+    // Process each input edge one at a time (the typical case has just one).
+    for (int i = 0; i < num_input_edges; ++i) {
+      last_task_context->set_output(i, concatenated_tensors.at(i));
+
+      // Emit batch->num_tasks() - 1 empty output tensors.
+      for (int task_idx = 0; task_idx < batch->num_tasks() - 1; ++task_idx) {
+        const BatchTask& task = batch->task(task_idx);
+        TensorShape output_shape(task.inputs.at(i).shape());
+        output_shape.set_dim(0, 0);
+        Tensor* output = nullptr;
+        OP_REQUIRES_OK_ASYNC(
+            task.context,
+            task.context->allocate_output(i, output_shape, &output),
+            task.done_callback);
+      }
+    }
     // Emit batch->num_tasks() - 1 empty index tensors.
     for (int task_idx = 0; task_idx < batch->num_tasks() - 1; ++task_idx) {
       const BatchTask& task = batch->task(task_idx);
@@ -463,7 +646,7 @@ class BatchResource : public ResourceBase {
     return Status::OK();
   }
 
-  // Looks up the batcher queue for 'queue_name'. If it didn't previously exist,
+  // Looks up the batcher queue for 'queue_name'. If it did't previously exist,
   // creates it.
   Status LookupOrCreateBatcherQueue(const string& queue_name,
                                     BatcherQueue** queue) {
@@ -477,7 +660,11 @@ class BatchResource : public ResourceBase {
 
     std::unique_ptr<BatcherQueue> new_queue;
     auto process_batch_callback = [this](std::unique_ptr<Batch> batch) {
-      ProcessBatch(std::move(batch));
+      if (fhandle_ == kInvalidHandle) {
+        ProcessBatch(std::move(batch));
+      } else {
+        ProcessFuncBatch(std::move(batch));
+      }
     };
     TF_RETURN_IF_ERROR(batcher_->AddQueue(batcher_queue_options_,
                                           process_batch_callback, &new_queue));
@@ -498,8 +685,99 @@ class BatchResource : public ResourceBase {
       GUARDED_BY(batcher_queues_mu_);
 
   std::vector<int32> allowed_batch_sizes_;
+  FunctionLibraryRuntime::Handle fhandle_;
 };
 
+class BatchFunctionKernel : public AsyncOpKernel {
+ public:
+  explicit BatchFunctionKernel(OpKernelConstruction* c) : AsyncOpKernel(c) {
+    OP_REQUIRES_OK(c, c->GetAttr("container", &container_));
+    OP_REQUIRES_OK(c, c->GetAttr("shared_name", &shared_name_));
+    // If shared_name is not supplied, use name instead (prevent collisions by
+    // default).
+    if (shared_name_.empty()) {
+      shared_name_ = name();
+    }
+    OP_REQUIRES_OK(c, c->GetAttr("batching_queue", &batcher_queue_));
+    OP_REQUIRES_OK(c, c->GetAttr("num_batch_threads", &num_batch_threads_));
+    OP_REQUIRES_OK(c, c->GetAttr("max_batch_size", &max_batch_size_));
+    OP_REQUIRES_OK(c,
+                   c->GetAttr("batch_timeout_micros", &batch_timeout_micros_));
+    OP_REQUIRES_OK(c,
+                   c->GetAttr("max_enqueued_batches", &max_enqueued_batches_));
+    OP_REQUIRES_OK(c, c->GetAttr("allowed_batch_sizes", &allowed_batch_sizes_));
+    OP_REQUIRES_OK(c, ValidateAllowedBatchSizes());
+
+    auto lib = c->function_library();
+    OP_REQUIRES(c, lib != nullptr, errors::Internal("No function library"));
+    NameAttrList func;
+    OP_REQUIRES_OK(c, c->GetAttr("f", &func));
+    OP_REQUIRES_OK(
+        c, lib->Instantiate(func.name(), AttrSlice(&func.attr()), &fhandle_));
+  }
+
+  bool IsExpensive() override { return false; }
+
+  void ComputeAsync(OpKernelContext* c, DoneCallback done) final {
+    BatchResource* br;
+    std::function<Status(BatchResource * *r)> creator = [this,
+                                                         c](BatchResource** r) {
+      std::unique_ptr<BatchResource> new_resource;
+      TF_RETURN_IF_ERROR(
+          BatchResource::Create(num_batch_threads_, max_batch_size_,
+                                batch_timeout_micros_, max_enqueued_batches_,
+                                allowed_batch_sizes_, fhandle_, &new_resource));
+      *r = new_resource.release();
+      return Status::OK();
+    };
+    OP_REQUIRES_OK_ASYNC(c,
+                         c->resource_manager()->LookupOrCreate(
+                             container_, shared_name_, &br, creator),
+                         done);
+    const Status status =
+        br->RegisterInput(random::New64(), c, batcher_queue_, done);
+    br->Unref();
+    OP_REQUIRES_OK_ASYNC(c, status, done);
+    // Assume br calls done, so nothing to do here.
+  }
+
+  // Validates 'allowed_batch_sizes_'. The entries must increase monotonically,
+  // and the last one must equal 'max_batch_size_'.
+  Status ValidateAllowedBatchSizes() const {
+    if (allowed_batch_sizes_.empty()) {
+      return Status::OK();
+    }
+    int32 last_size = 0;
+    for (size_t i = 0; i < allowed_batch_sizes_.size(); ++i) {
+      const int32 size = allowed_batch_sizes_.at(i);
+      if (i > 0 && size <= last_size) {
+        return errors::InvalidArgument(
+            "allowed_batch_sizes entries must be monotonically increasing");
+      }
+      if (i == allowed_batch_sizes_.size() - 1 && size != max_batch_size_) {
+        return errors::InvalidArgument(
+            "final entry in allowed_batch_sizes must equal max_batch_size");
+      }
+      last_size = size;
+    }
+    return Status::OK();
+  }
+
+ private:
+  string container_;
+  string shared_name_;
+  string batcher_queue_;
+  int32 num_batch_threads_;
+  int32 max_batch_size_;
+  int32 batch_timeout_micros_;
+  int32 max_enqueued_batches_;
+  std::vector<int32> allowed_batch_sizes_;
+  FunctionLibraryRuntime::Handle fhandle_;
+};
+
+REGISTER_KERNEL_BUILDER(Name("BatchFunction").Device(DEVICE_CPU),
+                        BatchFunctionKernel);
+
 class BatchKernel : public AsyncOpKernel {
  public:
   explicit BatchKernel(OpKernelConstruction* c) : AsyncOpKernel(c) {
@@ -528,7 +806,8 @@ class BatchKernel : public AsyncOpKernel {
           std::unique_ptr<BatchResource> new_resource;
           TF_RETURN_IF_ERROR(BatchResource::Create(
               num_batch_threads_, max_batch_size_, batch_timeout_micros_,
-              max_enqueued_batches_, allowed_batch_sizes_, &new_resource));
+              max_enqueued_batches_, allowed_batch_sizes_, kInvalidHandle,
+              &new_resource));
           *r = new_resource.release();
           return Status::OK();
         };
@@ -539,9 +818,7 @@ class BatchKernel : public AsyncOpKernel {
     const Status status =
         br->RegisterInput(random::New64(), c, batcher_queue_, done);
     br->Unref();
-    if (!status.ok()) {
-      OP_REQUIRES_OK_ASYNC(c, status, done);
-    }
+    OP_REQUIRES_OK_ASYNC(c, status, done);
     // Assume br calls done, so nothing to do here.
   }
 
@@ -800,9 +1077,7 @@ class UnbatchKernel : public AsyncOpKernel {
                          done);
     auto status = ubr->Compute(c, done);
     ubr->Unref();
-    if (!status.ok()) {
-      OP_REQUIRES_OK_ASYNC(c, status, done);
-    }
+    OP_REQUIRES_OK_ASYNC(c, status, done);
     // Assume ubr calls done, so nothing to do here.
   }
 
@@ -840,10 +1115,12 @@ class UnbatchGradResource : public ResourceBase {
     }
 
     const DataType type = tensors[0].dtype();
+    Tensor concatenated_tensor;
     switch (type) {
-#define CASE(type)                                         \
-  case DataTypeToEnum<type>::value:                        \
-    TF_RETURN_IF_ERROR(Concat<type>(context, tensors, 0)); \
+#define CASE(type)                                                            \
+  case DataTypeToEnum<type>::value:                                           \
+    TF_RETURN_IF_ERROR(Concat<type>(context, tensors, &concatenated_tensor)); \
+    context->set_output(0, concatenated_tensor);                              \
     break;
       TF_CALL_ALL_TYPES(CASE);
 #undef CASE
@@ -986,9 +1263,7 @@ class UnbatchGradKernel : public AsyncOpKernel {
                          done);
     Status status = ubr->Compute(c, done);
     ubr->Unref();
-    if (!status.ok()) {
-      OP_REQUIRES_OK_ASYNC(c, status, done);
-    }
+    OP_REQUIRES_OK_ASYNC(c, status, done);
     // Assume ubr calls done, so nothing to do here.
   }
 
diff --git a/tensorflow/core/kernels/batch_matmul_op_complex.cc b/tensorflow/core/kernels/batch_matmul_op_complex.cc
index 96216764fd46971db47b6a11be622cef63e5d103..b77c80c01fa09c51c4a054bde83b0c2263cd0b68 100644
--- a/tensorflow/core/kernels/batch_matmul_op_complex.cc
+++ b/tensorflow/core/kernels/batch_matmul_op_complex.cc
@@ -17,7 +17,7 @@ limitations under the License.
 
 namespace tensorflow {
 
-#if !defined(INTEL_MKL)
+#if !defined(INTEL_MKL) || defined(DO_NOT_USE_ML)
 TF_CALL_complex64(REGISTER_BATCH_MATMUL_CPU);
 TF_CALL_complex128(REGISTER_BATCH_MATMUL_CPU);
 #endif
diff --git a/tensorflow/core/kernels/batch_matmul_op_real.cc b/tensorflow/core/kernels/batch_matmul_op_real.cc
index 87a0795f2fd6b401f1d0151ab05c551b96f3e509..fe259c1634294c62c37ba4a51b0a207d975f8a4f 100644
--- a/tensorflow/core/kernels/batch_matmul_op_real.cc
+++ b/tensorflow/core/kernels/batch_matmul_op_real.cc
@@ -21,7 +21,7 @@ limitations under the License.
 
 namespace tensorflow {
 
-#if !defined(INTEL_MKL)
+#if !defined(INTEL_MKL) || defined(DO_NOT_USE_ML)
 TF_CALL_float(REGISTER_BATCH_MATMUL_CPU);
 TF_CALL_double(REGISTER_BATCH_MATMUL_CPU);
 #endif
diff --git a/tensorflow/core/kernels/batching_util/BUILD b/tensorflow/core/kernels/batching_util/BUILD
index de05c647d6bfc80a0368ee3edba8f31bccff33f9..e292ff200a2f2f7a251ca082d7b253714de9151b 100644
--- a/tensorflow/core/kernels/batching_util/BUILD
+++ b/tensorflow/core/kernels/batching_util/BUILD
@@ -126,6 +126,27 @@ tf_cc_test(
     ],
 )
 
+cc_library(
+    name = "serial_device_batch_scheduler",
+    hdrs = ["serial_device_batch_scheduler.h"],
+    deps = [
+        ":batch_scheduler",
+        "//tensorflow/core:lib",
+    ],
+)
+
+tf_cc_test(
+    name = "serial_device_batch_scheduler_test",
+    srcs = ["serial_device_batch_scheduler_test.cc"],
+    deps = [
+        ":fake_clock_env",
+        ":serial_device_batch_scheduler",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
 cc_library(
     name = "basic_batch_scheduler",
     hdrs = ["basic_batch_scheduler.h"],
diff --git a/tensorflow/core/kernels/batching_util/serial_device_batch_scheduler.h b/tensorflow/core/kernels/batching_util/serial_device_batch_scheduler.h
new file mode 100644
index 0000000000000000000000000000000000000000..518f2ff8a939ae47dc5bb70f7dc59348a638869b
--- /dev/null
+++ b/tensorflow/core/kernels/batching_util/serial_device_batch_scheduler.h
@@ -0,0 +1,548 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_KERNELS_BATCHING_UTIL_SERIAL_DEVICE_BATCH_SCHEDULER_H_
+#define TENSORFLOW_CORE_KERNELS_BATCHING_UTIL_SERIAL_DEVICE_BATCH_SCHEDULER_H_
+
+#include <algorithm>
+#include <functional>
+#include <memory>
+#include <random>
+#include <unordered_map>
+#include <vector>
+
+#include "tensorflow/core/kernels/batching_util/batch_scheduler.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/threadpool.h"
+#include "tensorflow/core/platform/cpu_info.h"
+#include "tensorflow/core/platform/env.h"
+#include "tensorflow/core/platform/thread_annotations.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+namespace serving {
+namespace internal {
+template <typename TaskType>
+class SDBSBatch;
+
+template <typename TaskType>
+class SDBSQueue;
+}  // namespace internal
+
+// EXPERIMENTAL: API MAY BE SUBJECTED TO SUDDEN CHANGES.
+//
+// Shared batch scheduler designed for batches which are processed by a serial
+// device (e.g. GPU, TPU). When batch processing involves a mix of
+// parallelizable cpu work and non-parallelizable on-device work, overall
+// latency can be minimized by producing batches at a (load dependent) rate
+// which keeps the serial device uniformly busy.
+//
+// SerialDeviceBatchScheduler (SDBS) controls the batching rate by limiting the
+// allowed number of concurrently processed batches. Too large a limit causes
+// batches to pile up behind the serial device, adding to the overall batch
+// latency. Too small a limit underutilizes the serial device and harms latency
+// by forcing batches to wait longer to be processed. Feedback from the device
+// (i.e. avg number of batches directly pending on the device) is used to set
+// the correct limit.
+//
+// SDBS groups requests into per model batches which are processed when a batch
+// processing thread becomes available. SDBS prioritizes batches primarily by
+// age (i.e. the batch's oldest request) along with a configurable preference
+// for scheduling larger batches first.
+
+
+template <typename TaskType>
+class SerialDeviceBatchScheduler : public std::enable_shared_from_this<
+                                       SerialDeviceBatchScheduler<TaskType>> {
+ public:
+  ~SerialDeviceBatchScheduler();
+
+  struct Options {
+    // The name to use for the pool of batch threads.
+    string thread_pool_name = {"batch_threads"};
+    // Maximum number of batch processing threads.
+    int64 num_batch_threads = port::NumSchedulableCPUs();
+    // Although batch selection is primarily based on age, this parameter
+    // specifies a preference for larger batches.  A full batch will be
+    // scheduled before an older, nearly empty batch as long as the age gap is
+    // less than full_batch_scheduling_boost_micros.  The optimal value for this
+    // parameter should be of order the batch processing latency, but must be
+    // chosen carefully, as too large a value will harm tail latency.
+    int64 full_batch_scheduling_boost_micros = 0;
+    // The environment to use (typically only overridden by test code).
+    Env* env = Env::Default();
+    // Initial limit for number of batches being concurrently processed.
+    int64 initial_in_flight_batches_limit = 3;
+    // Returns the current number of batches directly waiting to be processed
+    // by the serial device (i.e. GPU, TPU).
+    std::function<int64()> get_pending_on_serial_device;
+    // Desired average number of batches directly waiting to be processed by the
+    // serial device. Small numbers of O(1) should deliver the best latency.
+    double target_pending = 2;
+    // Number of batches between potential adjustments of
+    // in_flight_batches_limit.  Larger numbers will reduce noise, but will be
+    // less responsive to sudden changes in workload.
+    int64 batches_to_average_over = 1000;
+  };
+
+  // Ownership is shared between the caller of Create() and any queues created
+  // via AddQueue().
+  static Status Create(
+      const Options& options,
+      std::shared_ptr<SerialDeviceBatchScheduler<TaskType>>* scheduler);
+
+  struct QueueOptions {
+    // Maximum size of each batch.
+    int max_batch_size = 1000;
+    // Maximum number of enqueued (i.e. non-scheduled) batches.
+    int max_enqueued_batches = 10;
+  };
+
+  using BatchProcessor = std::function<void(std::unique_ptr<Batch<TaskType>>)>;
+
+  // Adds queue (and its callback) to be managed by this scheduler.
+  Status AddQueue(const QueueOptions& options,
+                  BatchProcessor process_batch_callback,
+                  std::unique_ptr<BatchScheduler<TaskType>>* queue);
+
+  double in_flight_batches_limit() {
+    mutex_lock l(mu_);
+    return in_flight_batches_limit_;
+  }
+
+  double recent_low_traffic_ratio() {
+    mutex_lock l(mu_);
+    return recent_low_traffic_ratio_;
+  }
+
+ private:
+  // access to AddBatch(), RemoveQueue(), env().
+  friend class internal::SDBSQueue<TaskType>;
+
+  explicit SerialDeviceBatchScheduler(const Options& options);
+
+  // Continuously retrieves and processes batches.
+  void ProcessBatches();
+
+  // Notifies scheduler of non-empty batch which is eligible for processing.
+  void AddBatch(const internal::SDBSBatch<TaskType>* batch);
+
+  // Removes queue from scheduler.
+  void RemoveQueue(const internal::SDBSQueue<TaskType>* queue);
+
+  Env* env() const { return options_.env; }
+
+  const Options options_;
+
+  // Collection of batches added by AddBatch. Owned by scheduler until they are
+  // released for processing.
+  std::vector<const internal::SDBSBatch<TaskType>*> batches_ GUARDED_BY(mu_);
+
+  // Unowned queues and callbacks added by AddQueue.
+  std::unordered_map<const internal::SDBSQueue<TaskType>*, BatchProcessor>
+      queues_and_callbacks_ GUARDED_BY(mu_);
+
+  // Responsible for running the batch processing callbacks.
+  std::unique_ptr<thread::ThreadPool> batch_thread_pool_;
+
+  // Limit on number of batches which can be concurrently processed.
+  int64 in_flight_batches_limit_ GUARDED_BY(mu_);
+
+  // Number of batch processing threads.
+  int64 processing_threads_ GUARDED_BY(mu_) = 0;
+
+  // Number of batches processed since the last in_flight_batches_limit_
+  // adjustment.
+  int64 batch_count_ GUARDED_BY(mu_) = 0;
+
+  // Number of times since the last in_flight_batches_limit_ adjustment when a
+  // processing thread was available but there were no batches to process.
+  int64 no_batch_count_ GUARDED_BY(mu_) = 0;
+
+  // Sum of batches pending on the serial device since the last
+  // in_flight_batches_limit_ adjustment.
+  int64 pending_sum_ = 0;
+
+  // Sum of batch latencies since the last in_flight_batches_limit_ adjustment.
+  int64 batch_latency_sum_ = 0;
+
+  // Average period between which two consecutive batches begin processing.
+  int64 batch_period_micros_ = 0;
+
+  // Moving average tracking the fraction of recent in_flight_batches_limit_
+  // adjustments where the external traffic was not high enough to provide
+  // useful feedback for an adjustment.
+  double recent_low_traffic_ratio_ = 0;
+
+  mutex mu_;
+
+  TF_DISALLOW_COPY_AND_ASSIGN(SerialDeviceBatchScheduler);
+};
+
+//////////////////////////////////////////////////////////
+// Implementation details follow. API users need not read.
+
+namespace internal {
+// Consolidates tasks into batches, passing them off to the
+// SerialDeviceBatchScheduler for processing.
+template <typename TaskType>
+class SDBSQueue : public BatchScheduler<TaskType> {
+ public:
+  using QueueOptions =
+      typename SerialDeviceBatchScheduler<TaskType>::QueueOptions;
+
+  SDBSQueue(std::shared_ptr<SerialDeviceBatchScheduler<TaskType>> scheduler,
+            const QueueOptions& options);
+
+  ~SDBSQueue() override;
+
+  // Adds task to current batch. Fails if the task size is larger than the batch
+  // size or if the current batch is full and this queue's number of outstanding
+  // batches is at its maximum.
+  Status Schedule(std::unique_ptr<TaskType>* task) override;
+
+  // Number of tasks waiting to be scheduled.
+  size_t NumEnqueuedTasks() const override;
+
+  // Number of size 1 tasks which could currently be scheduled without failing.
+  size_t SchedulingCapacity() const override;
+
+  // Notifies queue that a batch is about to be scheduled; the queue should not
+  // place any more tasks in this batch.
+  void ReleaseBatch(const SDBSBatch<TaskType>* batch);
+
+  size_t max_task_size() const override { return options_.max_batch_size; }
+
+ private:
+  std::shared_ptr<SerialDeviceBatchScheduler<TaskType>> scheduler_;
+  const QueueOptions options_;
+  // Owned by scheduler_.
+  SDBSBatch<TaskType>* current_batch_ GUARDED_BY(mu_) = nullptr;
+  int64 num_enqueued_batches_ GUARDED_BY(mu_) = 0;
+  int64 num_enqueued_tasks_ GUARDED_BY(mu_) = 0;
+  mutable mutex mu_;
+  TF_DISALLOW_COPY_AND_ASSIGN(SDBSQueue);
+};
+
+// Batch which remembers when and by whom it was created.
+template <typename TaskType>
+class SDBSBatch : public Batch<TaskType> {
+ public:
+  SDBSBatch(SDBSQueue<TaskType>* queue, int64 creation_time_micros)
+      : queue_(queue), creation_time_micros_(creation_time_micros) {}
+
+  ~SDBSBatch() override {}
+
+  SDBSQueue<TaskType>* queue() const { return queue_; }
+
+  int64 creation_time_micros() const { return creation_time_micros_; }
+
+ private:
+  SDBSQueue<TaskType>* queue_;
+  const int64 creation_time_micros_;
+  TF_DISALLOW_COPY_AND_ASSIGN(SDBSBatch);
+};
+}  // namespace internal
+
+// ---------------- SerialDeviceBatchScheduler ----------------
+
+template <typename TaskType>
+Status SerialDeviceBatchScheduler<TaskType>::Create(
+    const Options& options,
+    std::shared_ptr<SerialDeviceBatchScheduler<TaskType>>* scheduler) {
+  if (options.num_batch_threads < 1) {
+    return errors::InvalidArgument("num_batch_threads must be positive; was ",
+                                   options.num_batch_threads);
+  }
+  if (options.initial_in_flight_batches_limit < 1) {
+    return errors::InvalidArgument(
+        "initial_in_flight_batches_limit must be positive; was ",
+        options.initial_in_flight_batches_limit);
+  }
+  if (options.initial_in_flight_batches_limit > options.num_batch_threads) {
+    return errors::InvalidArgument(
+        "initial_in_flight_batches_limit (",
+        options.initial_in_flight_batches_limit,
+        ") should not be larger than num_batch_threads (",
+        options.num_batch_threads, ")");
+  }
+  if (options.full_batch_scheduling_boost_micros < 0) {
+    return errors::InvalidArgument(
+        "full_batch_scheduling_boost_micros can't be negative; was ",
+        options.full_batch_scheduling_boost_micros);
+  }
+  if (options.batches_to_average_over < 1) {
+    return errors::InvalidArgument(
+        "batches_to_average_over should be "
+        "greater than or equal to 1; was ",
+        options.batches_to_average_over);
+  }
+  if (options.target_pending <= 0) {
+    return errors::InvalidArgument(
+        "target_pending should be larger than zero; was ",
+        options.target_pending);
+  }
+  if (!options.get_pending_on_serial_device) {
+    return errors::InvalidArgument(
+        "get_pending_on_serial_device must be "
+        "specified");
+  }
+  scheduler->reset(new SerialDeviceBatchScheduler<TaskType>(options));
+  return Status::OK();
+}
+
+template <typename TaskType>
+SerialDeviceBatchScheduler<TaskType>::SerialDeviceBatchScheduler(
+    const Options& options)
+    : options_(options),
+      in_flight_batches_limit_(options.initial_in_flight_batches_limit),
+      processing_threads_(options.initial_in_flight_batches_limit) {
+  batch_thread_pool_.reset(new thread::ThreadPool(
+      env(), options.thread_pool_name, options.num_batch_threads));
+  for (int i = 0; i < processing_threads_; i++) {
+    batch_thread_pool_->Schedule(
+        std::bind(&SerialDeviceBatchScheduler<TaskType>::ProcessBatches, this));
+  }
+}
+
+template <typename TaskType>
+SerialDeviceBatchScheduler<TaskType>::~SerialDeviceBatchScheduler() {
+  // Signal processing threads to exit.
+  {
+    mutex_lock l(mu_);
+    processing_threads_ = 0;
+  }
+  // Hangs until all threads finish.
+  batch_thread_pool_.reset();
+}
+
+template <typename TaskType>
+Status SerialDeviceBatchScheduler<TaskType>::AddQueue(
+    const QueueOptions& options, BatchProcessor process_batch_callback,
+    std::unique_ptr<BatchScheduler<TaskType>>* queue) {
+  if (options.max_batch_size <= 0) {
+    return errors::InvalidArgument("max_batch_size must be positive; was ",
+                                   options.max_batch_size);
+  }
+  if (options.max_enqueued_batches <= 0) {
+    return errors::InvalidArgument(
+        "max_enqueued_batches must be positive; was ",
+        options.max_enqueued_batches);
+  }
+  internal::SDBSQueue<TaskType>* SDBS_queue_raw;
+  queue->reset(SDBS_queue_raw = new internal::SDBSQueue<TaskType>(
+                   this->shared_from_this(), options));
+  mutex_lock l(mu_);
+  queues_and_callbacks_[SDBS_queue_raw] = process_batch_callback;
+  return Status::OK();
+}
+
+template <typename TaskType>
+void SerialDeviceBatchScheduler<TaskType>::AddBatch(
+    const internal::SDBSBatch<TaskType>* batch) {
+  mutex_lock l(mu_);
+  batches_.push_back(batch);
+}
+
+template <typename TaskType>
+void SerialDeviceBatchScheduler<TaskType>::RemoveQueue(
+    const internal::SDBSQueue<TaskType>* queue) {
+  mutex_lock l(mu_);
+  queues_and_callbacks_.erase(queue);
+}
+
+template <typename TaskType>
+void SerialDeviceBatchScheduler<TaskType>::ProcessBatches() {
+  const int64 kIdleThreadSleepTimeMicros = 1000;
+  const double kMaxNoBatchRatio = .1;
+  const double kLowTrafficMovingAverageFactor = .1;
+  for (;;) {
+    mu_.lock();
+    if (processing_threads_ < 1 ||
+        processing_threads_ > in_flight_batches_limit_) {
+      processing_threads_--;
+      mu_.unlock();
+      break;
+    }
+    if (batches_.empty()) {
+      no_batch_count_++;
+      int64 sleep_time = batch_period_micros_ ? batch_period_micros_
+                                              : kIdleThreadSleepTimeMicros;
+      mu_.unlock();
+      env()->SleepForMicroseconds(sleep_time);
+      continue;
+    }
+    auto best_it = batches_.begin();
+    double best_score =
+        (*best_it)->creation_time_micros() -
+        options_.full_batch_scheduling_boost_micros * (*best_it)->size() /
+            static_cast<double>((*best_it)->queue()->max_task_size());
+    for (auto it = batches_.begin() + 1; it != batches_.end(); it++) {
+      const double score =
+          (*it)->creation_time_micros() -
+          options_.full_batch_scheduling_boost_micros * (*it)->size() /
+              static_cast<double>((*it)->queue()->max_task_size());
+      if (score < best_score) {
+        best_score = score;
+        best_it = it;
+      }
+    }
+    const internal::SDBSBatch<TaskType>* batch = *best_it;
+    batches_.erase(best_it);
+    // Queue may destroy itself after ReleaseBatch is called.
+    batch->queue()->ReleaseBatch(batch);
+    auto callback = queues_and_callbacks_[batch->queue()];
+    mu_.unlock();
+    int64 start_time = env()->NowMicros();
+    callback(std::unique_ptr<Batch<TaskType>>(
+        const_cast<internal::SDBSBatch<TaskType>*>(batch)));
+    int64 end_time = env()->NowMicros();
+    mu_.lock();
+    batch_count_++;
+    batch_latency_sum_ += end_time - start_time;
+    pending_sum_ += options_.get_pending_on_serial_device();
+    if (batch_count_ == options_.batches_to_average_over) {
+      recent_low_traffic_ratio_ *= (1 - kLowTrafficMovingAverageFactor);
+      // Only adjust in_flight_batches_limit_ if external load is large enough
+      // to consistently provide batches. Otherwise we would (mistakenly) assume
+      // that the device is underutilized because in_flight_batches_limit_ is
+      // too small.
+      if (no_batch_count_ < kMaxNoBatchRatio * batch_count_) {
+        double avg_pending = pending_sum_ / static_cast<double>(batch_count_);
+        // Avg processing time / # of concurrent batches gives the avg period
+        // between which two consecutive batches begin processing. Used to set a
+        // reasonable sleep time for idle batch processing threads.
+        batch_period_micros_ =
+            batch_latency_sum_ / batch_count_ / in_flight_batches_limit_;
+        // When the processing pipeline is consistently busy, the average number
+        // of pending batches differs from in_flight_batches_limit_ by a
+        // load-dependent offset. Adjust in_flight_batches_limit_to maintain
+        // the desired target pending.
+        in_flight_batches_limit_ +=
+            std::round(options_.target_pending - avg_pending);
+        in_flight_batches_limit_ = std::max(in_flight_batches_limit_, 1LL);
+        in_flight_batches_limit_ =
+            std::min(in_flight_batches_limit_, options_.num_batch_threads);
+        // Add extra processing threads if necessary.
+        if (processing_threads_ > 0 &&
+            processing_threads_ < in_flight_batches_limit_) {
+          int extra_threads = in_flight_batches_limit_ - processing_threads_;
+          for (int i = 0; i < extra_threads; i++) {
+            batch_thread_pool_->Schedule(std::bind(
+                &SerialDeviceBatchScheduler<TaskType>::ProcessBatches, this));
+          }
+          processing_threads_ = in_flight_batches_limit_;
+        }
+      } else {
+        recent_low_traffic_ratio_ += kLowTrafficMovingAverageFactor;
+      }
+      batch_count_ = 0;
+      no_batch_count_ = 0;
+      pending_sum_ = 0;
+      batch_latency_sum_ = 0;
+    }
+    mu_.unlock();
+  }
+}
+
+// ---------------- SDBSQueue ----------------
+
+namespace internal {
+template <typename TaskType>
+SDBSQueue<TaskType>::SDBSQueue(
+    std::shared_ptr<SerialDeviceBatchScheduler<TaskType>> scheduler,
+    const QueueOptions& options)
+    : scheduler_(scheduler), options_(options) {}
+
+template <typename TaskType>
+SDBSQueue<TaskType>::~SDBSQueue() {
+  // Wait until last batch has been scheduled.
+  const int kSleepMicros = 1000;
+  for (;;) {
+    {
+      mutex_lock l(mu_);
+      if (num_enqueued_batches_ == 0) {
+        break;
+      }
+    }
+    scheduler_->env()->SleepForMicroseconds(kSleepMicros);
+  }
+  scheduler_->RemoveQueue(this);
+}
+
+template <typename TaskType>
+Status SDBSQueue<TaskType>::Schedule(std::unique_ptr<TaskType>* task) {
+  SDBSBatch<TaskType>* new_batch = nullptr;
+  size_t size = (*task)->size();
+  if (size > options_.max_batch_size) {
+    return errors::InvalidArgument("Task size ", size,
+                                   " is larger than maximum batch size ",
+                                   options_.max_batch_size);
+  }
+  {
+    mutex_lock l(mu_);
+    // Current batch is full, create another if allowed.
+    if (current_batch_ &&
+        current_batch_->size() + size > options_.max_batch_size) {
+      if (num_enqueued_batches_ >= options_.max_enqueued_batches) {
+        return errors::Unavailable("The batch scheduling queue is full");
+      }
+      current_batch_->Close();
+      current_batch_ = nullptr;
+    }
+    if (!current_batch_) {
+      num_enqueued_batches_++;
+      current_batch_ = new_batch =
+          new SDBSBatch<TaskType>(this, scheduler_->env()->NowMicros());
+    }
+    current_batch_->AddTask(std::move(*task));
+    num_enqueued_tasks_++;
+  }
+  // AddBatch must be called outside of lock, since it may call ReleaseBatch.
+  if (new_batch != nullptr) scheduler_->AddBatch(new_batch);
+  return Status::OK();
+}
+
+template <typename TaskType>
+void SDBSQueue<TaskType>::ReleaseBatch(const SDBSBatch<TaskType>* batch) {
+  mutex_lock l(mu_);
+  num_enqueued_batches_--;
+  num_enqueued_tasks_ -= batch->num_tasks();
+  if (batch == current_batch_) {
+    current_batch_->Close();
+    current_batch_ = nullptr;
+  }
+}
+
+template <typename TaskType>
+size_t SDBSQueue<TaskType>::NumEnqueuedTasks() const {
+  mutex_lock l(mu_);
+  return num_enqueued_tasks_;
+}
+
+template <typename TaskType>
+size_t SDBSQueue<TaskType>::SchedulingCapacity() const {
+  mutex_lock l(mu_);
+  const int current_batch_capacity =
+      current_batch_ ? options_.max_batch_size - current_batch_->size() : 0;
+  const int spare_batches =
+      options_.max_enqueued_batches - num_enqueued_batches_;
+  return spare_batches * options_.max_batch_size + current_batch_capacity;
+}
+}  // namespace internal
+}  // namespace serving
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_KERNELS_BATCHING_UTIL_SERIAL_DEVICE_BATCH_SCHEDULER_H_
diff --git a/tensorflow/core/kernels/batching_util/serial_device_batch_scheduler_test.cc b/tensorflow/core/kernels/batching_util/serial_device_batch_scheduler_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..a2f8f9a03ee3a0336a88ab588610c8fe78a296bb
--- /dev/null
+++ b/tensorflow/core/kernels/batching_util/serial_device_batch_scheduler_test.cc
@@ -0,0 +1,394 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/kernels/batching_util/serial_device_batch_scheduler.h"
+
+#include "tensorflow/core/kernels/batching_util/fake_clock_env.h"
+#include "tensorflow/core/lib/core/notification.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/macros.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace serving {
+namespace anonymous {
+
+class FakeTask : public BatchTask {
+ public:
+  explicit FakeTask(size_t size) : size_(size) {}
+
+  ~FakeTask() override = default;
+
+  size_t size() const override { return size_; }
+
+ private:
+  const size_t size_;
+
+  TF_DISALLOW_COPY_AND_ASSIGN(FakeTask);
+};
+
+// Creates a FakeTask of size 'task_size', and calls 'scheduler->Schedule()' on
+// that task. Returns the resulting status.
+Status ScheduleTask(size_t task_size, BatchScheduler<FakeTask>* scheduler) {
+  std::unique_ptr<FakeTask> task(new FakeTask(task_size));
+  Status status = scheduler->Schedule(&task);
+  // Schedule() should have consumed 'task' iff it returned Status::OK.
+  CHECK_EQ(status.ok(), task == nullptr);
+  return status;
+}
+
+// Creates a thread that waits on 'start' and then advances the fake clock in
+// 'env' in a loop until 'stop' is notified. Useful for allowing objects that
+// use the clock to be destroyed.
+std::unique_ptr<Thread> CreateFakeClockAdvancerThread(
+    test_util::FakeClockEnv* env, Notification* start, Notification* stop) {
+  return std::unique_ptr<Thread>(Env::Default()->StartThread(
+      {}, "FakeClockAdvancerThread", [env, start, stop] {
+        start->WaitForNotification();
+        while (!stop->HasBeenNotified()) {
+          env->AdvanceByMicroseconds(10);
+          Env::Default()->SleepForMicroseconds(10);
+        }
+      }));
+}
+
+TEST(SerialDeviceBatchSchedulerTest, BadOptions) {
+  using Scheduler = SerialDeviceBatchScheduler<FakeTask>;
+  std::shared_ptr<Scheduler> scheduler;
+  Scheduler::Options default_options;
+  default_options.get_pending_on_serial_device = []() { return 0; };
+  Scheduler::Options options = default_options;
+  options.num_batch_threads = 0;
+  EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
+  options = default_options;
+  options.initial_in_flight_batches_limit = 0;
+  EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
+  options = default_options;
+  options.num_batch_threads = 5;
+  options.initial_in_flight_batches_limit = 8;
+  EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
+  options = default_options;
+  options.batches_to_average_over = -5;
+  EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
+  options = default_options;
+  options.target_pending = 0;
+  EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
+  options = Scheduler::Options();
+  EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
+}
+
+TEST(SerialDeviceBatchSchedulerTest, InFlightBatchesLimit) {
+  SerialDeviceBatchScheduler<FakeTask>::Options options;
+  options.num_batch_threads = 3;
+  options.initial_in_flight_batches_limit = 2;
+  options.batches_to_average_over = 1000;
+  options.get_pending_on_serial_device = []() { return 0; };
+  mutex mu;
+  int processed_batches = 0;
+  Notification finish_processing;
+  auto queue_callback = [&mu, &processed_batches, &finish_processing](
+                            std::unique_ptr<Batch<FakeTask>> batch) {
+    ASSERT_TRUE(batch->IsClosed());
+    EXPECT_GT(batch->num_tasks(), 0);
+    mu.lock();
+    int batch_num = ++processed_batches;
+    mu.unlock();
+    if (batch_num == 2) {
+      // Give third batch a chance to process if it's going to.
+      Env::Default()->SleepForMicroseconds(1000);
+      finish_processing.Notify();
+    }
+    if (batch_num == 3) {
+      ASSERT_TRUE(finish_processing.HasBeenNotified());
+    }
+    finish_processing.WaitForNotification();
+  };
+  std::shared_ptr<SerialDeviceBatchScheduler<FakeTask>> scheduler;
+  TF_ASSERT_OK(
+      SerialDeviceBatchScheduler<FakeTask>::Create(options, &scheduler));
+  std::unique_ptr<BatchScheduler<FakeTask>> queue1;
+  std::unique_ptr<BatchScheduler<FakeTask>> queue2;
+  std::unique_ptr<BatchScheduler<FakeTask>> queue3;
+  TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue1));
+  TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue2));
+  TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue3));
+  // Create 3 batches, only 2 should be processed concurrently.
+  TF_ASSERT_OK(ScheduleTask(100, queue1.get()));
+  TF_ASSERT_OK(ScheduleTask(100, queue2.get()));
+  TF_ASSERT_OK(ScheduleTask(100, queue3.get()));
+}
+
+TEST(SerialDeviceBatchSchedulerTest, PendingOnSerialDevice) {
+  mutex mu;
+  int pending;
+  SerialDeviceBatchScheduler<FakeTask>::Options options;
+  options.num_batch_threads = 3;
+  options.initial_in_flight_batches_limit = 1;
+  options.batches_to_average_over = 1;
+  options.target_pending = 3;
+  options.get_pending_on_serial_device = [&mu, &pending]() {
+    mutex_lock l(mu);
+    return pending;
+  };
+  std::shared_ptr<SerialDeviceBatchScheduler<FakeTask>> scheduler;
+  TF_ASSERT_OK(
+      SerialDeviceBatchScheduler<FakeTask>::Create(options, &scheduler));
+  // Make sure batch processing thread has gone to sleep.
+  Env::Default()->SleepForMicroseconds(1000);
+  int processed_batches = 0;
+  Notification start_processing;
+  auto queue_callback = [&mu, &processed_batches, &start_processing, &pending,
+                         &scheduler](std::unique_ptr<Batch<FakeTask>> batch) {
+    // Be careful with mutex mu to avoid potential deadlock with mutex mu_
+    // held in ProcessBatch() and in_flight_batches_limit().
+    int batch_num;
+    {
+      mutex_lock l(mu);
+      batch_num = ++processed_batches;
+    }
+    switch (batch_num) {
+      case 1:
+        start_processing.WaitForNotification();
+        {
+          mutex_lock l(mu);
+          pending = 2;
+        }
+        break;
+      case 2:
+        // No batches initially --> low traffic --> no adjustment.
+        CHECK_EQ(scheduler->in_flight_batches_limit(), 1);
+        {
+          mutex_lock l(mu);
+          pending = 3;
+        }
+        break;
+      case 3:
+        // Pending at target --> no adjustment.
+        CHECK_EQ(scheduler->in_flight_batches_limit(), 1);
+        {
+          mutex_lock l(mu);
+          pending = 1;
+        }
+        break;
+      case 4:
+        // Small pending --> 2 additional threads added.
+        CHECK_EQ(scheduler->in_flight_batches_limit(), 3);
+        {
+          mutex_lock l(mu);
+          pending = 3;
+        }
+        break;
+      default:
+        break;
+    }
+  };
+  std::unique_ptr<BatchScheduler<FakeTask>> queue;
+  TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
+  // Create 4 batches.
+  for (int i = 0; i < 4; i++) {
+    TF_ASSERT_OK(ScheduleTask(800, queue.get()));
+  }
+  start_processing.Notify();
+}
+
+TEST(SerialDeviceBatchSchedulerTest, FullBatchSchedulingBoostMicros) {
+  test_util::FakeClockEnv env(Env::Default());
+  Notification start_teardown, stop_teardown;
+  std::unique_ptr<Thread> teardown_thread =
+      CreateFakeClockAdvancerThread(&env, &start_teardown, &stop_teardown);
+  {
+    SerialDeviceBatchScheduler<FakeTask>::Options options;
+    options.env = &env;
+    options.initial_in_flight_batches_limit = 1;
+    options.batches_to_average_over = 1000;
+    options.full_batch_scheduling_boost_micros = 10;
+    options.get_pending_on_serial_device = []() { return 0; };
+    mutex mu;
+    int processed_batches = 0;
+    auto queue_callback =
+        [&mu, &processed_batches](std::unique_ptr<Batch<FakeTask>> batch) {
+          ASSERT_TRUE(batch->IsClosed());
+          mutex_lock l(mu);
+          processed_batches++;
+          switch (processed_batches) {
+            case 1:
+              EXPECT_EQ(1000, batch->size());
+              break;
+            case 2:
+              EXPECT_EQ(100, batch->size());
+              break;
+            case 3:
+              EXPECT_EQ(80, batch->size());
+              break;
+            default:
+              EXPECT_TRUE(false) << "Should only have 3 batches";
+          }
+        };
+    std::shared_ptr<SerialDeviceBatchScheduler<FakeTask>> scheduler;
+    TF_ASSERT_OK(
+        SerialDeviceBatchScheduler<FakeTask>::Create(options, &scheduler));
+    // Make sure batch processing thread has gone to sleep.
+    Env::Default()->SleepForMicroseconds(1000);
+    SerialDeviceBatchScheduler<FakeTask>::QueueOptions queue_options;
+    std::unique_ptr<BatchScheduler<FakeTask>> queue1;
+    std::unique_ptr<BatchScheduler<FakeTask>> queue2;
+    std::unique_ptr<BatchScheduler<FakeTask>> queue3;
+    queue_options.max_batch_size = 1000;
+    TF_ASSERT_OK(scheduler->AddQueue(queue_options, queue_callback, &queue1));
+    queue_options.max_batch_size = 1000;
+    TF_ASSERT_OK(scheduler->AddQueue(queue_options, queue_callback, &queue2));
+    queue_options.max_batch_size = 100;
+    TF_ASSERT_OK(scheduler->AddQueue(queue_options, queue_callback, &queue3));
+
+    TF_ASSERT_OK(ScheduleTask(100, queue1.get()));
+    // First batch - creation time: 0, fullness: 0.1, sched score: -1
+    env.AdvanceByMicroseconds(3);
+    TF_ASSERT_OK(ScheduleTask(1000, queue2.get()));
+    // Second batch - creation time: 3, fullness: 1, sched score: -7
+    env.AdvanceByMicroseconds(5);
+    TF_ASSERT_OK(ScheduleTask(80, queue3.get()));
+    // Third batch - creation time: 8, fullness: .8, sched score: 0
+    // Release the batch processing thread.
+    env.AdvanceByMicroseconds(1000);
+    start_teardown.Notify();
+  }
+  stop_teardown.Notify();
+}
+
+TEST(SerialDeviceBatchSchedulerTest, DeleteQueue) {
+  SerialDeviceBatchScheduler<FakeTask>::Options options;
+  options.initial_in_flight_batches_limit = 1;
+  options.batches_to_average_over = 1000;
+  options.get_pending_on_serial_device = []() { return 0; };
+  mutex mu;
+  int processed_batches = 0;
+  Notification finish_processing;
+  auto queue_callback = [&mu, &processed_batches, &finish_processing](
+                            std::unique_ptr<Batch<FakeTask>> batch) {
+    ASSERT_TRUE(batch->IsClosed());
+    EXPECT_GT(batch->num_tasks(), 0);
+    finish_processing.WaitForNotification();
+    mu.lock();
+    processed_batches++;
+    mu.unlock();
+  };
+  std::shared_ptr<SerialDeviceBatchScheduler<FakeTask>> scheduler;
+  TF_ASSERT_OK(
+      SerialDeviceBatchScheduler<FakeTask>::Create(options, &scheduler));
+  std::unique_ptr<BatchScheduler<FakeTask>> queue;
+  TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
+
+  // Enqueue 2 tasks, should result in 2 batches.
+  for (int i = 0; i < 2; i++) {
+    TF_ASSERT_OK(ScheduleTask(800, queue.get()));
+  }
+  std::unique_ptr<Thread> queue_deleter(Env::Default()->StartThread(
+      {}, "QueueDeleterThread", [&queue, &mu, &processed_batches] {
+        // Delete queue, should be kept alive until empty.
+        queue.reset();
+        mutex_lock l(mu);
+        EXPECT_EQ(processed_batches, 2);
+      }));
+  // Give queue_deleter thread time to delete queue.
+  Env::Default()->SleepForMicroseconds(1000);
+  finish_processing.Notify();
+}
+
+TEST(SerialDeviceBatchSchedulerTest, DeleteScheduler) {
+  SerialDeviceBatchScheduler<FakeTask>::Options options;
+  options.initial_in_flight_batches_limit = 1;
+  options.batches_to_average_over = 1000;
+  options.get_pending_on_serial_device = []() { return 0; };
+  mutex mu;
+  int processed_batches = 0;
+  Notification start_processing;
+  Notification finish_processing;
+  auto queue_callback =
+      [&mu, &processed_batches, &start_processing,
+       &finish_processing](std::unique_ptr<Batch<FakeTask>> batch) {
+        ASSERT_TRUE(batch->IsClosed());
+        EXPECT_GT(batch->num_tasks(), 0);
+        start_processing.WaitForNotification();
+        mutex_lock l(mu);
+        processed_batches++;
+        if (processed_batches == 2) {
+          finish_processing.Notify();
+        }
+      };
+
+  std::shared_ptr<SerialDeviceBatchScheduler<FakeTask>> scheduler;
+  TF_ASSERT_OK(
+      SerialDeviceBatchScheduler<FakeTask>::Create(options, &scheduler));
+  std::unique_ptr<BatchScheduler<FakeTask>> queue;
+  TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
+
+  // Enqueue 2 tasks, should result in 2 batches.
+  for (int i = 0; i < 2; i++) {
+    TF_ASSERT_OK(ScheduleTask(800, queue.get()));
+  }
+  // Delete scheduler, should be kept alive until queues are empty.
+  scheduler.reset();
+  start_processing.Notify();
+  finish_processing.WaitForNotification();
+}
+
+TEST(SerialDeviceBatchSchedulerTest, QueueCapacityInfo) {
+  SerialDeviceBatchScheduler<FakeTask>::Options options;
+  options.initial_in_flight_batches_limit = 1;
+  options.batches_to_average_over = 1000;
+  options.full_batch_scheduling_boost_micros = 1000;
+  options.get_pending_on_serial_device = []() { return 0; };
+  mutex mu;
+  int processed_batches = 0;
+  Notification finish_processing;
+  auto queue_callback = [&mu, &processed_batches, &finish_processing](
+                            std::unique_ptr<Batch<FakeTask>> batch) {
+    ASSERT_TRUE(batch->IsClosed());
+    EXPECT_GT(batch->num_tasks(), 0);
+    mu.lock();
+    int batch_num = ++processed_batches;
+    mu.unlock();
+    if (batch_num == 1) {
+      finish_processing.WaitForNotification();
+    }
+  };
+  std::shared_ptr<SerialDeviceBatchScheduler<FakeTask>> scheduler;
+  TF_ASSERT_OK(
+      SerialDeviceBatchScheduler<FakeTask>::Create(options, &scheduler));
+  std::unique_ptr<BatchScheduler<FakeTask>> queue1;
+  std::unique_ptr<BatchScheduler<FakeTask>> queue2;
+  TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue1));
+  TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue2));
+
+  // Blocker task, should schedule first.
+  TF_ASSERT_OK(ScheduleTask(800, queue1.get()));
+  TF_ASSERT_OK(ScheduleTask(100, queue2.get()));
+
+  EXPECT_EQ(queue2->NumEnqueuedTasks(), 1);
+  EXPECT_EQ(queue2->SchedulingCapacity(), 9 * 1000 + 900);
+  // Enqueue 2 more tasks, should fall in same batch.
+  TF_ASSERT_OK(ScheduleTask(100, queue2.get()));
+  TF_ASSERT_OK(ScheduleTask(200, queue2.get()));
+  EXPECT_EQ(queue2->NumEnqueuedTasks(), 3);
+  EXPECT_EQ(queue2->SchedulingCapacity(), 9 * 1000 + 600);
+  // Enqueue 1 more task, should create new batch.
+  TF_ASSERT_OK(ScheduleTask(700, queue2.get()));
+  EXPECT_EQ(queue2->NumEnqueuedTasks(), 4);
+  EXPECT_EQ(queue2->SchedulingCapacity(), 8 * 1000 + 300);
+  finish_processing.Notify();
+}
+}  // namespace anonymous
+}  // namespace serving
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/bias_op.cc b/tensorflow/core/kernels/bias_op.cc
index 9fda7169a8bf3b6543c0a539a19e01899fea9571..127c3a1ca16c225abfb60fcfff6ae09a87b2d41b 100644
--- a/tensorflow/core/kernels/bias_op.cc
+++ b/tensorflow/core/kernels/bias_op.cc
@@ -29,6 +29,7 @@ limitations under the License.
 #if GOOGLE_CUDA
 #include "tensorflow/core/kernels/bias_op_gpu.h"
 #include "tensorflow/core/platform/stream_executor.h"
+#include "tensorflow/stream_executor/cuda/cuda_stream.h"
 #endif  // GOOGLE_CUDA
 
 namespace tensorflow {
@@ -363,6 +364,40 @@ class BiasOp<GPUDevice, T> : public BinaryOp<T> {
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNEL);
 #undef REGISTER_GPU_KERNEL
 
+struct BiasGradAutotuneGroup {
+  static string name() { return "BiasGrad"; }
+};
+
+class BiasAddGradGPUConfig {
+ public:
+  BiasAddGradGPUConfig() : mode_(BiasAddGradGPUMode::kReduction) {}
+  string ToString() const {
+    if (mode_ == BiasAddGradGPUMode::kNative) {
+      return "native CUDA kernel.";
+    }
+    if (mode_ == BiasAddGradGPUMode::kReduction) {
+      return "cub reduction kernel.";
+    }
+    return "unknown kernel.";
+  }
+  BiasAddGradGPUMode get_mode() const { return mode_; }
+  void set_mode(BiasAddGradGPUMode val) { mode_ = val; }
+
+  bool operator==(const BiasAddGradGPUConfig& other) const {
+    return this->mode_ == other.get_mode();
+  }
+
+  bool operator!=(const BiasAddGradGPUConfig& other) const {
+    return !(*this == other);
+  }
+
+ private:
+  BiasAddGradGPUMode mode_;
+};
+typedef AutoTuneSingleton<BiasGradAutotuneGroup, BiasAddParams,
+                          BiasAddGradGPUConfig>
+    AutotuneBiasGrad;
+
 template <typename T>
 class BiasGradOp<GPUDevice, T> : public OpKernel {
  public:
@@ -377,6 +412,49 @@ class BiasGradOp<GPUDevice, T> : public OpKernel {
     }
   }
 
+  void ComputeWithCustomKernel(OpKernelContext* context,
+                               const Tensor& output_backprop, int32 batch,
+                               int32 width, int32 height, int32 channel,
+                               Tensor* output) {
+    BiasGradGPU<T>::compute(context->template eigen_device<Device>(),
+                            output_backprop.template flat<T>().data(),
+                            output->flat<T>().data(), batch, width, height,
+                            channel, data_format_);
+  }
+
+  void ComputeWithReduceSum(OpKernelContext* context,
+                            const Tensor& output_backprop, int32 batch,
+                            int32 width, int32 height, int32 channel,
+                            Tensor* output) {
+    if (data_format_ == FORMAT_NCHW) {
+      int32 row_count = batch * channel;
+      int32 col_count = height * width;
+      Tensor temp_grad_outputs;
+      // For 'NCHW' format, we perform reduction twice: first HW, then N.
+      TensorShape temp_grad_output_shape{row_count, col_count};
+      OP_REQUIRES_OK(context, context->allocate_temp(DataTypeToEnum<T>::value,
+                                                     temp_grad_output_shape,
+                                                     &temp_grad_outputs));
+      BiasGradGPU<T>::DoRowReduction(
+          context, temp_grad_outputs.flat<T>().data(),
+          output_backprop.template flat<T>().data(), row_count, col_count);
+
+      row_count = batch;
+      col_count = channel;
+      BiasGradGPU<T>::DoColReduction(context, output->flat<T>().data(),
+                                     temp_grad_outputs.flat<T>().data(),
+                                     row_count, col_count);
+    } else {
+      // For 'NHWC', we simply apply reduction once on NHW.
+      int32 row_count = batch * height * width;
+      int32 col_count = channel;
+      BiasGradGPU<T>::DoColReduction(
+          context, const_cast<T*>(output->flat<T>().data()),
+          reinterpret_cast<const T*>(output_backprop.template flat<T>().data()),
+          row_count, col_count);
+    }
+  }
+
   void Compute(OpKernelContext* context) override {
     const Tensor& output_backprop = context->input(0);
 
@@ -396,11 +474,65 @@ class BiasGradOp<GPUDevice, T> : public OpKernel {
     se::DeviceMemoryBase output_ptr(output->flat<T>().data(),
                                     output->NumElements() * sizeof(T));
     stream->ThenMemZero(&output_ptr, output->NumElements() * sizeof(T));
-    if (output_backprop.NumElements() > 0) {
-      BiasGradGPU<T>::compute(context->template eigen_device<Device>(),
-                              output_backprop.template flat<T>().data(),
-                              output->flat<T>().data(), batch, width, height,
-                              channel, data_format_);
+    if (output_backprop.NumElements() <= 0) return;
+
+    int device_id = stream->parent()->device_ordinal();
+    DataType dtype = output_backprop.dtype();
+    BiasAddParams bias_parameters = {
+        {batch, height * width, channel},
+        data_format_,
+        dtype,
+        device_id,
+    };
+
+    // Autotune two algorithm: customized
+    BiasAddGradGPUConfig algo_config;
+    if (!AutotuneBiasGrad::GetInstance()->Find(bias_parameters, &algo_config)) {
+      BiasGradGPUProfileResult best_result;
+      // Initialize the timer.
+      perftools::gputools::Timer timer(stream->parent());
+      stream->InitTimer(&timer);
+      stream->ThenStartTimer(&timer);
+      ComputeWithCustomKernel(context, output_backprop, batch, width, height,
+                              channel, output);
+      stream->ThenStopTimer(&timer);
+      uint64 elapsed_microseconds = timer.Microseconds();
+      VLOG(1) << "BiasAddGrad " << bias_parameters.ToString()
+              << " Native algo latency: " << elapsed_microseconds;
+      if (elapsed_microseconds < best_result.elapsed_time()) {
+        best_result.set_algorithm(BiasAddGradGPUMode::kNative);
+        best_result.set_elapsed_time(elapsed_microseconds);
+      }
+
+      // Try reduction and profile.
+      stream->ThenStartTimer(&timer);
+      ComputeWithReduceSum(context, output_backprop, batch, width, height,
+                           channel, output);
+      stream->ThenStopTimer(&timer);
+
+      elapsed_microseconds = timer.Microseconds();
+      VLOG(1) << "BiasAddGrad " << bias_parameters.ToString()
+              << " Reduction algo latency: " << elapsed_microseconds;
+      if (elapsed_microseconds < best_result.elapsed_time()) {
+        best_result.set_algorithm(BiasAddGradGPUMode::kReduction);
+        best_result.set_elapsed_time(elapsed_microseconds);
+      }
+
+      algo_config.set_mode(best_result.algorithm());
+      AutotuneBiasGrad::GetInstance()->Insert(bias_parameters, algo_config);
+
+      // Results are already available during autotune, so no need to continue.
+      return;
+    }
+
+    // Choose the best algorithm based on autotune results.
+    if (algo_config.get_mode() == BiasAddGradGPUMode::kReduction) {
+      ComputeWithReduceSum(context, output_backprop, batch, width, height,
+                           channel, output);
+    } else {
+      // Default to the customized kernel.
+      ComputeWithCustomKernel(context, output_backprop, batch, width, height,
+                              channel, output);
     }
   }
 
diff --git a/tensorflow/core/kernels/bias_op_gpu.cu.cc b/tensorflow/core/kernels/bias_op_gpu.cu.cc
index 754b93b073a36d0925a0339956b8224878b849e1..1a7211a7cba8db6b3e57327df4018fb2ea0dbd0a 100644
--- a/tensorflow/core/kernels/bias_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/bias_op_gpu.cu.cc
@@ -24,6 +24,14 @@ limitations under the License.
 #include "tensorflow/core/kernels/bias_op_gpu.h"
 #include "tensorflow/core/util/cuda_kernel_helper.h"
 
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/tensor_shape.h"
+#include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/platform/types.h"
+
+#include "tensorflow/core/kernels/reduction_gpu_kernels.cu.h"
+#include "tensorflow/core/kernels/reduction_ops_common.h"
+
 namespace tensorflow {
 
 typedef Eigen::GpuDevice GPUDevice;
@@ -239,6 +247,26 @@ void BiasGradGPU<T>::compute(const GPUDevice& d, const T* output_backprop,
   }
 }
 
+template <typename T>
+void BiasGradGPU<T>::DoRowReduction(OpKernelContext* context, T* output,
+                                    const T* input, int rows, int cols) {
+  typedef const Eigen::array<TTypes<float>::Tensor::Index, 1>& ReductionAxes;
+  Constants<GPUDevice> constants;
+  cub::Sum op;
+  functor::ReduceImpl<T, cub::Sum, T*, const T*, ReductionAxes>(
+      context, output, input, 2, rows, cols, 1, 1, constants.kOne, op);
+}
+
+template <typename T>
+void BiasGradGPU<T>::DoColReduction(OpKernelContext* context, T* output,
+                                    const T* input, int rows, int cols) {
+  typedef const Eigen::array<TTypes<float>::Tensor::Index, 1>& ReductionAxes;
+  Constants<GPUDevice> constants;
+  cub::Sum op;
+  functor::ReduceImpl<T, cub::Sum, T*, const T*, ReductionAxes>(
+      context, output, input, 2, rows, cols, 1, 1, constants.kZero, op);
+}
+
 #define DEFINE_GPU_SPECS(T)   \
   template struct BiasGPU<T>; \
   template struct BiasGradGPU<T>;
diff --git a/tensorflow/core/kernels/bias_op_gpu.h b/tensorflow/core/kernels/bias_op_gpu.h
index 9f14cc296f661a443985184e67cc9cdd4f7c247c..60c274c82628e144724451409985a24aa66aad4c 100644
--- a/tensorflow/core/kernels/bias_op_gpu.h
+++ b/tensorflow/core/kernels/bias_op_gpu.h
@@ -19,7 +19,9 @@ limitations under the License.
 #define EIGEN_USE_GPU
 
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+#include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/kernels/gpu_utils.h"
 #include "tensorflow/core/util/tensor_format.h"
 
 namespace tensorflow {
@@ -38,6 +40,91 @@ struct BiasGradGPU {
   static void compute(const GPUDevice& device, const T* output_backprop,
                       T* bias_backprop, int32 batch, int32 height, int32 width,
                       int32 channel, TensorFormat data_format);
+
+  static void DoRowReduction(OpKernelContext* context, T* output,
+                             const T* input, int rows, int cols);
+
+  static void DoColReduction(OpKernelContext* context, T* output,
+                             const T* input, int rows, int cols);
+};
+
+enum class BiasAddGradGPUMode {
+  kInvalid = 0,
+  kNative = 1,
+  kReduction = 2,
+};
+
+// Describe the BiasGradGPU result from a perf experiment.
+//
+// Arguments:
+// algorithm: returns the method to use for bias add grad.
+// elapsed_time; returns the measured elapsed time in microseconds.
+class BiasGradGPUProfileResult {
+ public:
+  bool is_valid() const {
+    return (algorithm_ != BiasAddGradGPUMode::kInvalid &&
+            elapsed_time_ != std::numeric_limits<float>::max());
+  }
+  BiasAddGradGPUMode algorithm() const { return algorithm_; }
+  void set_algorithm(BiasAddGradGPUMode val) { algorithm_ = val; }
+  uint64 elapsed_time() const { return elapsed_time_; }
+  void set_elapsed_time(uint64 val) { elapsed_time_ = val; }
+
+ private:
+  BiasAddGradGPUMode algorithm_ = BiasAddGradGPUMode::kInvalid;
+  uint64 elapsed_time_ = std::numeric_limits<uint64>::max();
+};
+
+// Encapsulate all the shape information that is used in bias add grad
+// operations.
+class BiasAddParams {
+ public:
+  // We use a list to maintain both the shape value and the order (data format).
+  using SpatialArray = gtl::InlinedVector<int64, 4>;
+  BiasAddParams(const SpatialArray& in_shape, TensorFormat data_format,
+                DataType dtype, int device_id)
+      : in_shape_(in_shape),
+        data_format_(data_format),
+        dtype_(dtype),
+        device_id_(device_id) {
+    for (int64 val : in_shape_) {
+      hash_code_ = Hash64Combine(hash_code_, val);
+    }
+    hash_code_ = Hash64Combine(hash_code_, data_format);
+    hash_code_ = Hash64Combine(hash_code_, dtype);
+    hash_code_ = Hash64Combine(hash_code_, device_id);
+  }
+  bool operator==(const BiasAddParams& other) const {
+    return this->get_data_as_tuple() == other.get_data_as_tuple();
+  }
+
+  bool operator!=(const BiasAddParams& other) const {
+    return !(*this == other);
+  }
+  uint64 hash() const { return hash_code_; }
+
+  string ToString() const {
+    // clang-format off
+    return strings::StrCat(
+        "(", str_util::Join(in_shape_, ", "), "), ",
+        data_format_, ", ", dtype_, ", ", device_id_);
+    // clang-format on
+  }
+
+ protected:
+  using ParamsDataType = std::tuple<SpatialArray, TensorFormat, DataType, int>;
+
+  ParamsDataType get_data_as_tuple() const {
+    return std::make_tuple(in_shape_, data_format_, dtype_, device_id_);
+  }
+
+  uint64 hash_code_ = 0;
+
+ private:
+  SpatialArray in_shape_;
+  TensorFormat data_format_;
+  DataType dtype_;
+  int device_id_;
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/boosted_trees/stats_ops.cc b/tensorflow/core/kernels/boosted_trees/stats_ops.cc
index 6dfcd63ab310a2b682054c6921d4425d344ada15..48afd3fbf316e45f36079966b155508387ea0f61 100644
--- a/tensorflow/core/kernels/boosted_trees/stats_ops.cc
+++ b/tensorflow/core/kernels/boosted_trees/stats_ops.cc
@@ -268,33 +268,36 @@ class BoostedTreesMakeStatsSummaryOp : public OpKernel {
     OpInputList bucketized_features_list;
     OP_REQUIRES_OK(context, context->input_list("bucketized_features_list",
                                                 &bucketized_features_list));
-    std::vector<tensorflow::TTypes<int32>::ConstVec> bucketized_features;
-    bucketized_features.reserve(num_features_);
-    for (const Tensor& tensor : bucketized_features_list) {
-      bucketized_features.emplace_back(tensor.vec<int32>());
-    }
-
     // Infer batch size.
     const int64 batch_size = node_ids_t->dim_size(0);
-    // Allocate output stats tensor (Rank 4).
-    Tensor* output_stats_summary_t = nullptr;
-    OP_REQUIRES_OK(context, context->allocate_output(
-                                "stats_summary",
+
+    // Allocate temporary stats tensor (Rank 4).
+    Tensor temp_stats_double_t;
+    OP_REQUIRES_OK(context, context->allocate_temp(
+                                DT_DOUBLE,
                                 {num_features_, max_splits_, num_buckets_, 2},
-                                &output_stats_summary_t));
-    auto output_stats_summary = output_stats_summary_t->tensor<float, 4>();
-    output_stats_summary.setZero();
+                                &temp_stats_double_t));
+    auto temp_stats_double = temp_stats_double_t.tensor<double, 4>();
+    temp_stats_double.setZero();
 
     // Partition by node, and then bucketize.
     for (int feature_idx = 0; feature_idx < num_features_; ++feature_idx) {
-      const auto& features = bucketized_features[feature_idx];
+      const auto& features = bucketized_features_list[feature_idx].vec<int32>();
       for (int i = 0; i < batch_size; ++i) {
         const int32 node = node_ids(i);
         const int32 bucket = features(i);
-        output_stats_summary(feature_idx, node, bucket, 0) += gradients(i, 0);
-        output_stats_summary(feature_idx, node, bucket, 1) += hessians(i, 0);
+        temp_stats_double(feature_idx, node, bucket, 0) += gradients(i, 0);
+        temp_stats_double(feature_idx, node, bucket, 1) += hessians(i, 0);
       }
     }
+
+    // Copy temp tensor over to output tensor.
+    Tensor* output_stats_summary_t = nullptr;
+    OP_REQUIRES_OK(context, context->allocate_output(
+                                "stats_summary", temp_stats_double_t.shape(),
+                                &output_stats_summary_t));
+    output_stats_summary_t->tensor<float, 4>() =
+        temp_stats_double.template cast<float>();
   }
 
  private:
diff --git a/tensorflow/core/kernels/cholesky_grad.cc b/tensorflow/core/kernels/cholesky_grad.cc
index 9d33845c2f47512b4f0e7b613cd32d365252b595..eac66e580dd2ecd22c1b84d31b5c9dd292db1820 100644
--- a/tensorflow/core/kernels/cholesky_grad.cc
+++ b/tensorflow/core/kernels/cholesky_grad.cc
@@ -84,7 +84,7 @@ class CholeskyGrad : public LinearAlgebraOp<Scalar> {
       Variables names representing the derivative matrix have a trailing '_bar'.
       */
 
-      const int64 block_begin = std::max(0ll, block_end - kMaxBlockSize);
+      const int64 block_begin = std::max(int64{0}, block_end - kMaxBlockSize);
       const int64 block_size = block_end - block_begin;
       const int64 trailing_size = kMatrixSize - block_end;
 
diff --git a/tensorflow/core/kernels/control_flow_ops.cc b/tensorflow/core/kernels/control_flow_ops.cc
index 7d5d54e5bece7d448e7c11c6061109e9e8554008..ebf844d75ffa4db8fcbeee7c78f6edb5c4b53bc5 100644
--- a/tensorflow/core/kernels/control_flow_ops.cc
+++ b/tensorflow/core/kernels/control_flow_ops.cc
@@ -587,24 +587,14 @@ REGISTER_SYCL_HOST_KERNEL(string);
 #undef REGISTER_SYCL_HOST_KERNEL
 #endif  // TENSORFLOW_USE_SYCL
 
-// A LoopCond op has one input and one output. The input is a boolean
-// scalar representing the taken branches of the "pivot" Switch that
-// determines loop termination. As a contract, any high-level front-end
-// should always use port '0' of the "pivot" switches for loop exit.
-class LoopCondOp : public OpKernel {
- public:
-  explicit LoopCondOp(OpKernelConstruction* context) : OpKernel(context) {}
-
-  void Compute(OpKernelContext* context) override {
-    context->set_output(0, context->input(0));
-  }
+LoopCondOp::LoopCondOp(OpKernelConstruction* context) : OpKernel(context) {}
+LoopCondOp::~LoopCondOp() = default;
 
-  bool IsExpensive() override { return false; }
-
-  ~LoopCondOp() override {}
+void LoopCondOp::Compute(OpKernelContext* context) {
+  context->set_output(0, context->input(0));
+}
 
-  TF_DISALLOW_COPY_AND_ASSIGN(LoopCondOp);
-};
+bool LoopCondOp::IsExpensive() { return false; }
 
 REGISTER_KERNEL_BUILDER(Name("LoopCond").Device(DEVICE_CPU), LoopCondOp);
 REGISTER_KERNEL_BUILDER(Name("LoopCond")
diff --git a/tensorflow/core/kernels/control_flow_ops.h b/tensorflow/core/kernels/control_flow_ops.h
index 4838f2e2bf0443700046e634721ecc04fb13bf51..8edbcc9077764a036d6aea2c3c89329088f98d99 100644
--- a/tensorflow/core/kernels/control_flow_ops.h
+++ b/tensorflow/core/kernels/control_flow_ops.h
@@ -97,6 +97,22 @@ class NextIterationOp : public OpKernel {
   TF_DISALLOW_COPY_AND_ASSIGN(NextIterationOp);
 };
 
+// A LoopCond op has one input and one output. The input is a boolean
+// scalar representing the taken branches of the "pivot" Switch that
+// determines loop termination. As a contract, any high-level front-end
+// should always use port '0' of the "pivot" switches for loop exit.
+class LoopCondOp : public OpKernel {
+ public:
+  explicit LoopCondOp(OpKernelConstruction* context);
+  ~LoopCondOp() override;
+
+  void Compute(OpKernelContext* context) override;
+
+  bool IsExpensive() override;
+
+  TF_DISALLOW_COPY_AND_ASSIGN(LoopCondOp);
+};
+
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_KERNELS_CONTROL_FLOW_OPS_H_
diff --git a/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc b/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
index a2e7342b046ac0b1fb51bc515e80d26fdc191b1d..a5fa48f85ec87fb1a9c27a6f036b89f7c93e23e7 100644
--- a/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
+++ b/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
@@ -247,7 +247,13 @@ __global__ void SwapDimension1And2InTensor3UsingTiles(
   constexpr int ReadRowPerPass = NumThreads / TileSizeJ;
   constexpr int WriteRowPerPass = NumThreads / TileSizeI;
   // One extra line in the inner dimension to avoid share memory bank conflict.
-  __shared__ T shared_memory_tile[TileSizeI][TileSizeJ + 1];
+  // This is to mimic the following, but no constructor of T can be invoked.
+  //     __shared__ T shared_memory_tile[TileSizeI][TileSizeJ + 1];
+  __shared__ __align__(
+      alignof(T)) char shared_mem_raw[TileSizeI * (TileSizeJ + 1) * sizeof(T)];
+  typedef T(*SharedMemoryTile)[TileSizeJ + 1];
+  SharedMemoryTile shared_memory_tile =
+      reinterpret_cast<SharedMemoryTile>(shared_mem_raw);
 
   int x = threadIdx.x;
 
diff --git a/tensorflow/core/kernels/cwise_op_bessel.cc b/tensorflow/core/kernels/cwise_op_bessel.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4372f56408b8e90ea4bcd57acb15f763c9b79a27
--- /dev/null
+++ b/tensorflow/core/kernels/cwise_op_bessel.cc
@@ -0,0 +1,29 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/kernels/cwise_ops_common.h"
+
+namespace tensorflow {
+REGISTER3(UnaryOp, CPU, "BesselI0e", functor::bessel_i0e, Eigen::half, float,
+          double);
+REGISTER3(UnaryOp, CPU, "BesselI1e", functor::bessel_i1e, Eigen::half, float,
+          double);
+#if GOOGLE_CUDA
+REGISTER3(UnaryOp, GPU, "BesselI0e", functor::bessel_i0e, Eigen::half, float,
+          double);
+REGISTER3(UnaryOp, GPU, "BesselI1e", functor::bessel_i1e, Eigen::half, float,
+          double);
+#endif
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/cwise_op_bessel.cu.cc b/tensorflow/core/kernels/cwise_op_bessel.cu.cc
new file mode 100644
index 0000000000000000000000000000000000000000..30de8b1fdc4d521c318980317c8f5cb2e276cbbc
--- /dev/null
+++ b/tensorflow/core/kernels/cwise_op_bessel.cu.cc
@@ -0,0 +1,27 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#if GOOGLE_CUDA
+
+#include "tensorflow/core/kernels/cwise_ops_gpu_common.cu.h"
+
+namespace tensorflow {
+namespace functor {
+DEFINE_UNARY3(bessel_i0e, Eigen::half, float, double);
+DEFINE_UNARY3(bessel_i1e, Eigen::half, float, double);
+}  // namespace functor
+}  // namespace tensorflow
+
+#endif  // GOOGLE_CUDA
diff --git a/tensorflow/core/kernels/cwise_op_igammas.cc b/tensorflow/core/kernels/cwise_op_igammas.cc
index a1d7f4dad43bf6934a028b8f5c9498884df15f80..4b5f888bc1f076858eeebb159374efa7c82af900 100644
--- a/tensorflow/core/kernels/cwise_op_igammas.cc
+++ b/tensorflow/core/kernels/cwise_op_igammas.cc
@@ -18,4 +18,8 @@ limitations under the License.
 namespace tensorflow {
 REGISTER2(BinaryOp, CPU, "Igamma", functor::igamma, float, double);
 REGISTER2(BinaryOp, CPU, "Igammac", functor::igammac, float, double);
+#if GOOGLE_CUDA
+REGISTER2(BinaryOp, GPU, "Igamma", functor::igamma, float, double);
+REGISTER2(BinaryOp, GPU, "Igammac", functor::igammac, float, double);
+#endif
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/cwise_op_not_equal_to_1.cc b/tensorflow/core/kernels/cwise_op_not_equal_to_1.cc
index 02cd298745795294bfb8117a24ba930a7f471788..935619711c23dab21d426d8909ec0617d13b9a7a 100644
--- a/tensorflow/core/kernels/cwise_op_not_equal_to_1.cc
+++ b/tensorflow/core/kernels/cwise_op_not_equal_to_1.cc
@@ -21,5 +21,28 @@ REGISTER6(BinaryOp, CPU, "NotEqual", functor::not_equal_to, float, Eigen::half,
 #if GOOGLE_CUDA
 REGISTER4(BinaryOp, GPU, "NotEqual", functor::not_equal_to, float, Eigen::half,
           double, uint8);
+// A special GPU kernel for int32.
+// TODO(b/25387198): Also enable int32 in device memory. This kernel
+// registration requires all int32 inputs and outputs to be in host memory.
+REGISTER_KERNEL_BUILDER(Name("NotEqual")
+                            .Device(DEVICE_GPU)
+                            .HostMemory("x")
+                            .HostMemory("y")
+                            .HostMemory("z")
+                            .TypeConstraint<int32>("T"),
+                        BinaryOp<CPUDevice, functor::not_equal_to<int32>>);
 #endif
+
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER2(BinaryOp, SYCL, "NotEqual", functor::not_equal_to, float, double);
+
+REGISTER_KERNEL_BUILDER(Name("NotEqual")
+                            .Device(DEVICE_SYCL)
+                            .HostMemory("x")
+                            .HostMemory("y")
+                            .HostMemory("z")
+                            .TypeConstraint<int32>("T"),
+                        BinaryOp<CPUDevice, functor::not_equal_to<int32>>);
+#endif  // TENSORFLOW_USE_SYCL
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/cwise_ops.h b/tensorflow/core/kernels/cwise_ops.h
index a80905d1450cc38619bb27c2e27eda58b3cf169d..8b015df4e146891a63ddca6ff02521664ddb0d3f 100644
--- a/tensorflow/core/kernels/cwise_ops.h
+++ b/tensorflow/core/kernels/cwise_ops.h
@@ -616,6 +616,12 @@ struct acos : base<T, Eigen::internal::scalar_acos_op<T>> {};
 template <typename T>
 struct atan : base<T, Eigen::internal::scalar_atan_op<T>> {};
 
+template <typename T>
+struct bessel_i0e : base<T, Eigen::internal::scalar_i0e_op<T>> {};
+
+template <typename T>
+struct bessel_i1e : base<T, Eigen::internal::scalar_i1e_op<T>> {};
+
 struct logical_not : base<bool, Eigen::internal::scalar_boolean_not_op<bool>> {
 };
 
diff --git a/tensorflow/core/kernels/data/BUILD b/tensorflow/core/kernels/data/BUILD
index d35aad980de6affa39b9fa482a1448f3d8c5249e..6d2a04aa25d7d11ca36f07eccc71ab0ef88ad4c7 100644
--- a/tensorflow/core/kernels/data/BUILD
+++ b/tensorflow/core/kernels/data/BUILD
@@ -358,6 +358,7 @@ tf_kernel_library(
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
+        "//tensorflow/core:protos_all_cc",
     ],
 )
 
@@ -548,22 +549,69 @@ tf_kernel_library(
     ],
 )
 
+tf_kernel_library(
+    name = "identity_dataset_op",
+    srcs = ["identity_dataset_op.cc"],
+    deps = [
+        ":dataset",
+        "//tensorflow/core:framework",
+    ],
+)
+
+tf_kernel_library(
+    name = "optimize_dataset_op",
+    srcs = ["optimize_dataset_op.cc"],
+    deps = [
+        ":dataset",
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:dataset_ops_op_lib",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core/grappler:graph_view",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:grappler_item_builder",
+        "//tensorflow/core/grappler/clusters:virtual_cluster",
+        "//tensorflow/core/grappler/optimizers:meta_optimizer",
+        "//tensorflow/core/grappler/optimizers/data",
+    ],
+)
+
 tf_kernel_library(
     name = "dataset_ops",
+    srcs = ["dataset_ops.cc"],
+    deps = [
+        ":dataset",
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:dataset_ops_op_lib",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:protos_all_cc",
+    ],
+)
+
+tf_kernel_library(
+    name = "data",
     deps = [
         ":batch_dataset_op",
         ":cache_dataset_ops",
         ":concatenate_dataset_op",
+        ":dataset",
+        ":dataset_ops",
         ":dense_to_sparse_batch_dataset_op",
         ":filter_dataset_op",
         ":flat_map_dataset_op",
         ":generator_dataset_op",
         ":group_by_reducer_dataset_op",
         ":group_by_window_dataset_op",
+        ":identity_dataset_op",
         ":interleave_dataset_op",
         ":iterator_ops",
         ":map_and_batch_dataset_op",
         ":map_dataset_op",
+        ":optimize_dataset_op",
         ":padded_batch_dataset_op",
         ":parallel_interleave_dataset_op",
         ":parallel_map_dataset_op",
diff --git a/tensorflow/core/kernels/data/batch_dataset_op.cc b/tensorflow/core/kernels/data/batch_dataset_op.cc
index 3618c75827f7bff57183ef3ccbf24ac131547e55..58b86f2a088776b3d9a1c27108382cd69074a95e 100644
--- a/tensorflow/core/kernels/data/batch_dataset_op.cc
+++ b/tensorflow/core/kernels/data/batch_dataset_op.cc
@@ -27,7 +27,8 @@ namespace {
 class BatchDatasetOp : public UnaryDatasetOpKernel {
  public:
   explicit BatchDatasetOp(OpKernelConstruction* ctx)
-      : UnaryDatasetOpKernel(ctx) {}
+      : UnaryDatasetOpKernel(ctx),
+        op_version_(ctx->def().op() == "BatchDataset" ? 1 : 2) {}
 
   void MakeDataset(OpKernelContext* ctx, DatasetBase* input,
                    DatasetBase** output) override {
@@ -38,14 +39,24 @@ class BatchDatasetOp : public UnaryDatasetOpKernel {
         ctx, batch_size > 0,
         errors::InvalidArgument("Batch size must be greater than zero."));
 
-    *output = new Dataset(ctx, batch_size, input);
+    bool drop_remainder = false;
+    if (op_version_ > 1) {
+      OP_REQUIRES_OK(ctx, ParseScalarArgument<bool>(ctx, "drop_remainder",
+                                                    &drop_remainder));
+    }
+
+    *output = new Dataset(ctx, batch_size, drop_remainder, input);
   }
 
  private:
   class Dataset : public GraphDatasetBase {
    public:
-    Dataset(OpKernelContext* ctx, int64 batch_size, const DatasetBase* input)
-        : GraphDatasetBase(ctx), batch_size_(batch_size), input_(input) {
+    Dataset(OpKernelContext* ctx, int64 batch_size, bool drop_remainder,
+            const DatasetBase* input)
+        : GraphDatasetBase(ctx),
+          batch_size_(batch_size),
+          drop_remainder_(drop_remainder),
+          input_(input) {
       input_->Ref();
 
       // NOTE(mrry): Currently we implement "batch up to" semantics. If
@@ -54,14 +65,19 @@ class BatchDatasetOp : public UnaryDatasetOpKernel {
       const auto& input_shapes = input_->output_shapes();
       output_shapes_.reserve(input_shapes.size());
       for (const auto& input_shape : input_shapes) {
-        output_shapes_.emplace_back(
-            PartialTensorShape({-1}).Concatenate(input_shape));
+        if (drop_remainder_) {
+          output_shapes_.emplace_back(
+              PartialTensorShape({batch_size_}).Concatenate(input_shape));
+        } else {
+          output_shapes_.emplace_back(
+              PartialTensorShape({-1}).Concatenate(input_shape));
+        }
       }
     }
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(new Iterator(
           Iterator::Params{this, strings::StrCat(prefix, "::Batch")}));
@@ -75,7 +91,7 @@ class BatchDatasetOp : public UnaryDatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override {
+    string DebugString() const override {
       return strings::StrCat("BatchDatasetOp(", batch_size_, ")::Dataset");
     }
 
@@ -86,8 +102,10 @@ class BatchDatasetOp : public UnaryDatasetOpKernel {
       TF_RETURN_IF_ERROR(b->AddParentDataset(ctx, input_, &input_graph_node));
       Node* batch_size = nullptr;
       TF_RETURN_IF_ERROR(b->AddScalar(batch_size_, &batch_size));
-      TF_RETURN_IF_ERROR(
-          b->AddDataset(this, {input_graph_node, batch_size}, output));
+      Node* drop_remainder = nullptr;
+      TF_RETURN_IF_ERROR(b->AddScalar(drop_remainder_, &drop_remainder));
+      TF_RETURN_IF_ERROR(b->AddDataset(
+          this, {input_graph_node, batch_size, drop_remainder}, output));
       return Status::OK();
     }
 
@@ -95,8 +113,11 @@ class BatchDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
@@ -130,6 +151,12 @@ class BatchDatasetOp : public UnaryDatasetOpKernel {
           return Status::OK();
         }
 
+        if (dataset()->drop_remainder_ &&
+            batch_elements.size() < dataset()->batch_size_) {
+          *end_of_sequence = true;
+          return Status::OK();
+        }
+
         // Copy the retrieved batch elements into one output tensor
         // per tuple component.
         // NOTE(mrry): If the input or output sizes are statically
@@ -198,14 +225,20 @@ class BatchDatasetOp : public UnaryDatasetOpKernel {
     };
 
     const int64 batch_size_;
+    const bool drop_remainder_;
     const DatasetBase* const input_;
     std::vector<PartialTensorShape> output_shapes_;
   };
+
+  const int op_version_;
 };
 
 REGISTER_KERNEL_BUILDER(Name("BatchDataset").Device(DEVICE_CPU),
                         BatchDatasetOp);
 
+REGISTER_KERNEL_BUILDER(Name("BatchDatasetV2").Device(DEVICE_CPU),
+                        BatchDatasetOp);
+
 }  // namespace
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/data/cache_dataset_ops.cc b/tensorflow/core/kernels/data/cache_dataset_ops.cc
index 4b4728dab68523aa54176bdce6222a7aa5f8e9d3..ed4932bf32342afc4bacfcd56e30653e3deed186 100644
--- a/tensorflow/core/kernels/data/cache_dataset_ops.cc
+++ b/tensorflow/core/kernels/data/cache_dataset_ops.cc
@@ -41,15 +41,17 @@ class CacheDatasetOp : public UnaryDatasetOpKernel {
     if (filename.empty()) {
       *output = new MemoryDataset(input);
     } else {
-      *output = new FileDataset(input, filename, ctx->env());
+      *output = new FileDataset(ctx, input, filename, ctx->env());
     }
   }
 
  private:
-  class FileDataset : public DatasetBase {
+  class FileDataset : public GraphDatasetBase {
    public:
-    explicit FileDataset(const DatasetBase* input, string filename, Env* env)
-        : input_(input),
+    explicit FileDataset(OpKernelContext* ctx, const DatasetBase* input,
+                         string filename, Env* env)
+        : GraphDatasetBase(ctx),
+          input_(input),
           filename_(std::move(filename)),
           env_(env),
           num_tensors_(input->output_dtypes().size()),
@@ -64,15 +66,10 @@ class CacheDatasetOp : public UnaryDatasetOpKernel {
 
     ~FileDataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
-      if (env_->FileExists(strings::StrCat(filename_, ".index")).ok()) {
-        return std::unique_ptr<IteratorBase>(new FileReaderIterator(
-            {this, strings::StrCat(prefix, "::FileReader")}));
-      } else {
-        return std::unique_ptr<IteratorBase>(new FileWriterIterator(
-            {this, strings::StrCat(prefix, "::FileWriter")}));
-      }
+      return std::unique_ptr<IteratorBase>(new FileCacheIterator(
+          {this, strings::StrCat(prefix, "::FileCacheIterator")}));
     }
 
     const DataTypeVector& output_dtypes() const override {
@@ -83,7 +80,20 @@ class CacheDatasetOp : public UnaryDatasetOpKernel {
       return input_->output_shapes();
     }
 
-    string DebugString() override { return "CacheDatasetOp::FileDataset"; }
+    string DebugString() const override {
+      return "CacheDatasetOp::FileDataset";
+    }
+
+   protected:
+    Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
+                              Node** output) const override {
+      Node* input_graph = nullptr;
+      TF_RETURN_IF_ERROR(b->AddParentDataset(ctx, input_, &input_graph));
+      Node* filename = nullptr;
+      TF_RETURN_IF_ERROR(b->AddScalar(filename_, &filename));
+      TF_RETURN_IF_ERROR(b->AddDataset(this, {input_graph, filename}, output));
+      return Status::OK();
+    }
 
    private:
     static size_t StringPaddingSize(size_t num_tensors) {
@@ -95,160 +105,428 @@ class CacheDatasetOp : public UnaryDatasetOpKernel {
                              tensor_index);
     }
 
-    // FileWriterIterator passes through and caches items from the input
-    // FileDataset.
-    //
-    // This iterator is used when the cache directory is not found on disk. It
-    // creates the cache directory, and passes on the underlying iterator's
-    // elements.
-    class FileWriterIterator : public DatasetIterator<FileDataset> {
+    class FileCacheIterator : public DatasetIterator<FileDataset> {
      public:
-      explicit FileWriterIterator(const Params& params)
-          : DatasetIterator<FileDataset>(params),
-            cur_index_(0),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)),
-            writer_(params.dataset->env_, params.dataset->filename_),
-            lockfile_(strings::StrCat(params.dataset->filename_, ".lockfile")),
-            lockfile_created_(false),
-            iteration_completed_(false) {}
+      explicit FileCacheIterator(const Params& params)
+          : DatasetIterator<FileDataset>(params) {
+        if (params.dataset->env_
+                ->FileExists(MetaFilename(params.dataset->filename_))
+                .ok()) {
+          mode_ = Mode::read;
+        } else {
+          mode_ = Mode::write;
+        }
+        InitializeIterator();
+      }
+
+      Status Initialize(IteratorContext* ctx) override {
+        mutex_lock l(mu_);
+        return iterator_->Initialize(ctx);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
                              bool* end_of_sequence) override {
         mutex_lock l(mu_);
-        TF_RETURN_IF_ERROR(EnsureLockFileExists());
-        TF_RETURN_IF_ERROR(writer_.status());
-        if (cur_index_ >= kMaxItems) {
-          // As a courtesy, close the [truncated] cache file.
-          Status s = Finish();
-          if (!s.ok()) {
-            LOG(ERROR) << s;
-          }
-          return errors::InvalidArgument(
-              "Upstream iterator is producing more than ", kMaxItems,
-              " items, which is more than the cache limit.");
-        }
+        return iterator_->GetNext(ctx, out_tensors, end_of_sequence);
+      }
 
-        TF_RETURN_IF_ERROR(
-            input_impl_->GetNext(ctx, out_tensors, end_of_sequence));
-        if (*end_of_sequence && out_tensors->empty()) {
-          TF_RETURN_IF_ERROR(Finish());
-          cur_index_++;
-          return Status::OK();
-        }
-        if (out_tensors->size() != dataset()->num_tensors_) {
-          return errors::Internal(
-              "Upstream iterator returned invalid number of tensors. Expected ",
-              dataset()->num_tensors_, " got: ", out_tensors->size());
+     protected:
+      Status SaveInternal(IteratorStateWriter* writer) override {
+        mutex_lock l(mu_);
+        TF_RETURN_IF_ERROR(writer->WriteScalar(full_name("mode"), mode_));
+        return SaveParent(writer, iterator_);
+      }
+      Status RestoreInternal(IteratorContext* ctx,
+                             IteratorStateReader* reader) override {
+        mutex_lock l(mu_);
+        {
+          int64 temp;
+          TF_RETURN_IF_ERROR(reader->ReadScalar(full_name("mode"), &temp));
+          mode_ = static_cast<Mode>(temp);
         }
-        size_t tensor_index = 0;
-        for (const Tensor& t : *out_tensors) {
-          DCHECK_LT(tensor_index, dataset()->num_tensors_);
-          string key = dataset()->FormatName(cur_index_, tensor_index++);
-          TF_RETURN_IF_ERROR(writer_.Add(key, t));
+        if (mode_ == Mode::write &&
+            dataset()
+                ->env_->FileExists(MetaFilename(dataset()->filename_))
+                .ok()) {
+          // This could happen if the cache was completely written after the
+          // checkpoint was saved.
+          LOG(WARNING)
+              << "It looks like the cache was already completely written("
+              << MetaFilename(dataset()->filename_)
+              << ") after the last checkpoint was saved. "
+              << "Attempting to read the cache instead of continuing to "
+              << "write. If this is a mistake, please remove the above file "
+              << "and try running again.";
+          mode_ = Mode::read;
         }
-        if (*end_of_sequence) {
-          TF_RETURN_IF_ERROR(Finish());
-        }
-        cur_index_++;
-        return Status::OK();
+        InitializeIterator();
+        TF_RETURN_IF_ERROR(iterator_->Initialize(ctx));
+        return RestoreParent(ctx, reader, iterator_);
       }
 
      private:
-      Status EnsureLockFileExists() EXCLUSIVE_LOCKS_REQUIRED(mu_) {
-        if (iteration_completed_)
-          return errors::OutOfRange(
-              "Attempting to call get_next after iteration should have "
-              "finished.");
-        if (lockfile_created_ && !iteration_completed_) return Status::OK();
-        // Perform rudimentary locking to help catch concurrent writes to the
-        // same cache files.
-        if (dataset()->env_->FileExists(lockfile_).ok()) {
-          // Attempt to read the contents of the lockfile.
-          char contents_scratch[151] = {0};  // Initialize all to 0.
-          StringPiece contents;
-          std::unique_ptr<RandomAccessFile> file;
-          if (dataset()->env_->NewRandomAccessFile(lockfile_, &file).ok()) {
-            file->Read(0, 150, &contents, contents_scratch).IgnoreError();
+      // FileWriterIterator passes through and caches items from the input
+      // FileDataset.
+      //
+      // This iterator is used when the cache directory is not found on disk. It
+      // creates the cache directory, and passes on the underlying iterator's
+      // elements.
+      //
+      // Caching is performed by writing the input tensors to disk using the
+      // `BundleWriter`. Note that the cache gets fully flushed to disk only
+      // after the input iterator has been fully exhausted. If the program
+      // exits, before completion of an epoch, the cached state would be lost.
+      // To ensure that the partial cache persists across sessions, one should
+      // checkpoint the input pipeline. On each call to `SaveInternal` the
+      // partial cache gets flushed to disk in files with prefix
+      // <filename>_<shard_id> where shard_id is unique for each checkpoint.
+      // When all elements have been produced, these shards get coalesced.
+      class FileWriterIterator : public DatasetIterator<FileDataset> {
+       public:
+        explicit FileWriterIterator(const Params& params)
+            : DatasetIterator<FileDataset>(params),
+              cur_index_(0),
+              shard_id_(0),
+              filename_(
+                  strings::StrCat(params.dataset->filename_, "_", shard_id_)),
+              lockfile_(strings::StrCat(filename_, ".lockfile")),
+              lockfile_created_(false),
+              iteration_completed_(false) {}
+
+        Status Initialize(IteratorContext* ctx) override {
+          return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+        }
+
+        Status GetNextInternal(IteratorContext* ctx,
+                               std::vector<Tensor>* out_tensors,
+                               bool* end_of_sequence) override {
+          mutex_lock l(mu_);
+          TF_RETURN_IF_ERROR(EnsureLockFileExists());
+          TF_RETURN_IF_ERROR(writer_->status());
+          if (cur_index_ >= kMaxItems) {
+            // As a courtesy, close the [truncated] cache file.
+            Status s = Finish();
+            if (!s.ok()) {
+              LOG(ERROR) << s;
+            }
+            return errors::InvalidArgument(
+                "Upstream iterator is producing more than ", kMaxItems,
+                " items, which is more than the cache limit.");
           }
-          return errors::AlreadyExists(
-              "There appears to be a concurrent caching iterator running - "
-              "cache lockfile already exists ('",
-              lockfile_,
-              "'). If you are sure no other running TF computations are using "
-              "this cache prefix, delete the lockfile and re-initialize the "
-              "iterator. Lockfile contents: ",
-              contents);
-        } else {
-          // Create the file, and write some basic contents.
-          std::unique_ptr<WritableFile> lockfile;
+
           TF_RETURN_IF_ERROR(
-              dataset()->env_->NewWritableFile(lockfile_, &lockfile));
-          TF_RETURN_IF_ERROR(lockfile->Append(
-              strings::StrCat("Created at: ", dataset()->env_->NowSeconds())));
-          lockfile_created_ = true;
+              input_impl_->GetNext(ctx, out_tensors, end_of_sequence));
+          if (*end_of_sequence && out_tensors->empty()) {
+            TF_RETURN_IF_ERROR(Finish());
+            cur_index_++;
+            return Status::OK();
+          }
+          if (out_tensors->size() != dataset()->num_tensors_) {
+            return errors::Internal(
+                "Upstream iterator returned invalid number of tensors. "
+                "Expected ",
+                dataset()->num_tensors_, " got: ", out_tensors->size());
+          }
+          size_t tensor_index = 0;
+          for (const Tensor& t : *out_tensors) {
+            DCHECK_LT(tensor_index, dataset()->num_tensors_);
+            string key = dataset()->FormatName(cur_index_, tensor_index++);
+            TF_RETURN_IF_ERROR(writer_->Add(key, t));
+          }
+          if (*end_of_sequence) {
+            TF_RETURN_IF_ERROR(Finish());
+          }
+          cur_index_++;
           return Status::OK();
         }
-      }
 
-      Status Finish() EXCLUSIVE_LOCKS_REQUIRED(mu_) {
-        iteration_completed_ = true;
-        TF_RETURN_IF_ERROR(writer_.Finish());
-        TF_RETURN_IF_ERROR(dataset()->env_->DeleteFile(lockfile_));
-        return Status::OK();
-      }
+       protected:
+        Status SaveInternal(IteratorStateWriter* writer) override {
+          mutex_lock l(mu_);
+          if (iteration_completed_) {
+            TF_RETURN_IF_ERROR(
+                writer->WriteScalar(full_name("iteration_completed"), ""));
+            return Status::OK();
+          }
 
-      mutex mu_;
-      size_t cur_index_ GUARDED_BY(mu_);
-      std::unique_ptr<IteratorBase> input_impl_ GUARDED_BY(mu_);
-      BundleWriter writer_ GUARDED_BY(mu_);
-      const string lockfile_;
-      bool lockfile_created_ GUARDED_BY(mu_);
-      bool iteration_completed_ GUARDED_BY(mu_);
-    };  // FileWriterIterator
+          // lockfile is created on the first call to GetNextInternal. The
+          // absence of a lockfile means that GetNextInternal was not called
+          // and hence nothing was written to cache. So we don't need to worry
+          // about flushing the current shard. This ensures that we never write
+          // empty shards.
+          if (lockfile_created_) {
+            // Flush the current bundle.
+            TF_RETURN_IF_ERROR(writer_->Finish());
+
+            // Note: We do not delete the lockfile here. We keep lockfiles of
+            // all shards around until the entire cache has been written to
+            // prevent concurrent iterators from corrupting any of the shards.
+
+            // Start caching to a new shard.
+            shard_id_++;
+            filename_ = strings::StrCat(dataset()->filename_, "_", shard_id_);
+            lockfile_ = strings::StrCat(filename_, ".lockfile");
+            lockfile_created_ = false;
+          }
+          TF_RETURN_IF_ERROR(SaveParent(writer, input_impl_));
+          TF_RETURN_IF_ERROR(
+              writer->WriteScalar(full_name("cur_index"), cur_index_));
+          TF_RETURN_IF_ERROR(
+              writer->WriteScalar(full_name("shard_id"), shard_id_));
+          return Status::OK();
+        }
 
-    class FileReaderIterator : public DatasetIterator<FileDataset> {
-     public:
-      explicit FileReaderIterator(const Params& params)
-          : DatasetIterator<FileDataset>(params),
-            cur_index_(0),
-            reader_(dataset()->env_, dataset()->filename_) {}
+        Status RestoreInternal(IteratorContext* ctx,
+                               IteratorStateReader* reader) override {
+          mutex_lock l(mu_);
+          if (reader->Contains(full_name("iteration_completed"))) {
+            iteration_completed_ = true;
+            return Status::OK();
+          }
 
-      Status GetNextInternal(IteratorContext* ctx,
-                             std::vector<Tensor>* out_tensors,
-                             bool* end_of_sequence) override {
-        mutex_lock l(mu_);
-        *end_of_sequence = false;
-        TF_RETURN_IF_ERROR(reader_.status());
-        if (!reader_.Valid()) {
-          return errors::Internal(
-              "Cache iterator is in an invalid state. (Perhaps GetNext called "
-              "after end_of_sequence?)");
+          TF_RETURN_IF_ERROR(RestoreParent(ctx, reader, input_impl_));
+          int64 temp;
+          // TODO(b/78048575): Update this when saving size_t tensors directly
+          // is supported.
+          {
+            TF_RETURN_IF_ERROR(
+                reader->ReadScalar(full_name("cur_index"), &temp));
+            cur_index_ = static_cast<size_t>(temp);
+            if (cur_index_ != temp) {
+              return errors::Internal("Invalid value for cur_index ", temp);
+            }
+          }
+          // TODO(b/78048575): Update this when saving size_t tensors directly
+          // is supported.
+          {
+            TF_RETURN_IF_ERROR(
+                reader->ReadScalar(full_name("shard_id"), &temp));
+            shard_id_ = static_cast<size_t>(temp);
+            if (shard_id_ != temp) {
+              return errors::Internal("Invalid value for shard_id ", temp);
+            }
+          }
+          filename_ = strings::StrCat(dataset()->filename_, "_", shard_id_);
+          lockfile_ = strings::StrCat(filename_, ".lockfile");
+          writer_.reset(new BundleWriter(dataset()->env_, filename_));
+          return Status::OK();
         }
-        out_tensors->clear();
-        out_tensors->resize(dataset()->num_tensors_);
 
-        for (size_t i = 0; i < dataset()->num_tensors_; ++i) {
-          reader_.Next();  // The first entry in the table is a header entry.
-          if (!reader_.Valid()) {
-            out_tensors->clear();
-            *end_of_sequence = true;
+       private:
+        Status EnsureLockFileExists() EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+          if (iteration_completed_)
+            return errors::OutOfRange(
+                "Attempting to call get_next after iteration should have "
+                "finished.");
+          if (lockfile_created_ && !iteration_completed_) return Status::OK();
+
+          // Perform rudimentary locking to help catch concurrent writes to the
+          // same cache files.
+
+          // 1. Check that a checkpoint for the shard has not already been
+          // written.
+          if (dataset()->env_->FileExists(MetaFilename(filename_)).ok()) {
+            return errors::AlreadyExists("Existing cache files found: \n",
+                                         MetaFilename(filename_), "\n",
+                                         DataFilename(filename_, 0, 1), "\n",
+                                         "To continue delete the above files.");
+          }
+
+          // 2. Check that there isn't a concurrent iterator that is writing
+          // to cache.
+          if (dataset()->env_->FileExists(lockfile_).ok()) {
+            // Attempt to read the contents of the lockfile.
+            char contents_scratch[151] = {0};  // Initialize all to 0.
+            StringPiece contents;
+            std::unique_ptr<RandomAccessFile> file;
+            if (dataset()->env_->NewRandomAccessFile(lockfile_, &file).ok()) {
+              file->Read(0, 150, &contents, contents_scratch).IgnoreError();
+            }
+            return errors::AlreadyExists(
+                "There appears to be a concurrent caching iterator running - "
+                "cache lockfile already exists ('",
+                lockfile_,
+                "'). If you are sure no other running TF computations are "
+                "using "
+                "this cache prefix, delete the lockfile and re-initialize the "
+                "iterator. Lockfile contents: ",
+                contents);
+          } else {
+            // Create the file, and write some basic contents.
+            std::unique_ptr<WritableFile> lockfile;
+            TF_RETURN_IF_ERROR(
+                dataset()->env_->NewWritableFile(lockfile_, &lockfile));
+            TF_RETURN_IF_ERROR(lockfile->Append(strings::StrCat(
+                "Created at: ", dataset()->env_->NowSeconds())));
+
+            // At this point we know that
+            // 1. There is no conflicting checkpoint with prefix `filename_`.
+            // 2. There is no concurrent session that is trying to write a ckpt
+            //    to filename.
+            // So it is safe to create a BundleWriter here. Note that it is
+            // unsafe to initialize the BundleWriter anywhere the above
+            // conditions are not met since BundleWriter's constructor creates
+            // new temp files which can delete the temp files created by a
+            // BundleWriter in another Session.
+            writer_.reset(new BundleWriter(dataset()->env_, filename_));
+            lockfile_created_ = true;
             return Status::OK();
           }
-          StringPiece key = reader_.key();
-          DCHECK_EQ(key, dataset()->FormatName(cur_index_, i));
-          TF_RETURN_IF_ERROR(reader_.ReadCurrent(&(*out_tensors)[i]));
+        }
+
+        Status Finish() EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+          iteration_completed_ = true;
+          // Flush the current bundle.
+          TF_RETURN_IF_ERROR(writer_->Finish());
+          // Merge all the bundles.
+          // Currently there are `shard_id_ + 1` bundles, one for each
+          // checkpoint. Each bundle has prefix <filename>_<id> where `id` is an
+          // integer starting at 0 an incremented by 1 for each new checkpoint.
+          // We merge all these bundles into a bundle with prefix <filename> so
+          // that the next call to `MakeIterator` can build a
+          // `FileReaderIterator`.
+          {
+            std::vector<string> prefixes;
+            prefixes.reserve(shard_id_ + 1);
+            for (size_t i = 0; i <= shard_id_; ++i) {
+              prefixes.emplace_back(
+                  strings::StrCat(dataset()->filename_, "_", i));
+            }
+            TF_RETURN_IF_ERROR(
+                MergeBundles(dataset()->env_, prefixes, dataset()->filename_));
+          }
+          // Delete all lockfiles.
+          for (size_t i = 0; i <= shard_id_; ++i) {
+            TF_RETURN_IF_ERROR(dataset()->env_->DeleteFile(
+                strings::StrCat(dataset()->filename_, "_", i, ".lockfile")));
+          }
+          return Status::OK();
+        }
+
+        mutex mu_;
+        size_t cur_index_ GUARDED_BY(mu_);
+        // Index of the current shard. This gets incremented whenever a new
+        // cache shard is saved.
+        size_t shard_id_ GUARDED_BY(mu_);
+        std::unique_ptr<IteratorBase> input_impl_ GUARDED_BY(mu_);
+        // The current prefix for the cache file. This is equal to
+        // `StrCat(dataset()->filename_, "_", shard_id_)`.
+        string filename_;
+        std::unique_ptr<BundleWriter> writer_ GUARDED_BY(mu_);
+        string lockfile_ GUARDED_BY(mu_);
+        bool lockfile_created_ GUARDED_BY(mu_);
+        bool iteration_completed_ GUARDED_BY(mu_);
+      };  // FileWriterIterator
+
+      class FileReaderIterator : public DatasetIterator<FileDataset> {
+       public:
+        explicit FileReaderIterator(const Params& params)
+            : DatasetIterator<FileDataset>(params),
+              cur_index_(0),
+              reader_(dataset()->env_, dataset()->filename_),
+              iterator_restored_(false) {}
+
+        Status GetNextInternal(IteratorContext* ctx,
+                               std::vector<Tensor>* out_tensors,
+                               bool* end_of_sequence) override {
+          mutex_lock l(mu_);
+          *end_of_sequence = false;
           TF_RETURN_IF_ERROR(reader_.status());
+          if (!reader_.Valid()) {
+            return errors::Internal(
+                "Cache iterator is in an invalid state. (Perhaps GetNext "
+                "called "
+                "after end_of_sequence?)");
+          }
+          out_tensors->clear();
+          out_tensors->resize(dataset()->num_tensors_);
+
+          for (size_t i = 0; i < dataset()->num_tensors_; ++i) {
+            // When the iterator is restored from the checkpoint, `reader_` is
+            // already pointing at `key` so we do not need to skip the header
+            // entry.
+            if (!iterator_restored_) {
+              reader_
+                  .Next();  // The first entry in the table is a header entry.
+            } else {
+              iterator_restored_ = false;
+            }
+            if (!reader_.Valid()) {
+              out_tensors->clear();
+              *end_of_sequence = true;
+              return Status::OK();
+            }
+            StringPiece key = reader_.key();
+            DCHECK_EQ(key, dataset()->FormatName(cur_index_, i));
+            TF_RETURN_IF_ERROR(reader_.ReadCurrent(&(*out_tensors)[i]));
+            TF_RETURN_IF_ERROR(reader_.status());
+          }
+          cur_index_++;
+          return Status::OK();
+        }
+
+       protected:
+        Status SaveInternal(IteratorStateWriter* writer) override {
+          mutex_lock l(mu_);
+          TF_RETURN_IF_ERROR(
+              writer->WriteScalar(full_name("cur_index"), cur_index_));
+          return Status::OK();
+        }
+
+        Status RestoreInternal(
+            IteratorContext* ctx,
+            IteratorStateReader* iterator_state_reader) override {
+          mutex_lock l(mu_);
+          {
+            // TODO(b/78048575): Update this when saving size_t tensors directly
+            // is supported.
+            int64 temp;
+            TF_RETURN_IF_ERROR(iterator_state_reader->ReadScalar(
+                full_name("cur_index"), &temp));
+            cur_index_ = static_cast<size_t>(temp);
+            if (cur_index_ != temp) {
+              return errors::Internal("Invalid value for cur_index ", temp);
+            }
+          }
+          if (!reader_.Valid()) {
+            return errors::Internal("Error initializing BundleReader.");
+          }
+          reader_.Seek(dataset()->FormatName(cur_index_, 0));
+          iterator_restored_ = true;
+          return Status::OK();
+        }
+
+       private:
+        mutex mu_;
+        size_t cur_index_ GUARDED_BY(mu_);
+        BundleReader reader_ GUARDED_BY(mu_);
+        bool iterator_restored_ GUARDED_BY(mu_);
+      };  // FileReaderIterator
+
+      void InitializeIterator() EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+        // We intentionally use the same prefix for both `FileReaderIterator`
+        // and `FileWriterIterator`. Since at any time there will be at most
+        // one of them alive, there should be no conflicts. This allows both
+        // iterators to use a common key for `cur_index`. We leverage this
+        // in the corner case when this iterator is restored from an old
+        // checkpoint in `write` mode and the cache has been completely
+        // flushed to disk since then. In that case we simply build a
+        // `FileReaderIterator` and seek to the `cur_index`.
+        switch (mode_) {
+          case Mode::read:
+            iterator_.reset(new FileReaderIterator({dataset(), prefix()}));
+            break;
+          case Mode::write:
+            iterator_.reset(new FileWriterIterator({dataset(), prefix()}));
         }
-        cur_index_++;
-        return Status::OK();
       }
 
-     private:
       mutex mu_;
-      size_t cur_index_ GUARDED_BY(mu_);
-      BundleReader reader_ GUARDED_BY(mu_);
-    };  // FileReaderIterator
+      enum Mode { read, write };
+      Mode mode_ GUARDED_BY(mu_);
+      std::unique_ptr<IteratorBase> iterator_ GUARDED_BY(mu_);
+    };  // FileCacheIterator
 
     const DatasetBase* const input_;
     const string filename_;
@@ -268,7 +546,7 @@ class CacheDatasetOp : public UnaryDatasetOpKernel {
 
     ~MemoryDataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       mutex_lock l(mu_);
       if (cache_) {
@@ -292,7 +570,9 @@ class CacheDatasetOp : public UnaryDatasetOpKernel {
       return input_->output_shapes();
     }
 
-    string DebugString() override { return "CacheDatasetOp::MemoryDataset"; }
+    string DebugString() const override {
+      return "CacheDatasetOp::MemoryDataset";
+    }
 
    private:
     // MemoryWriterIterator passes through and appends items from the input
@@ -305,7 +585,6 @@ class CacheDatasetOp : public UnaryDatasetOpKernel {
      public:
       explicit MemoryWriterIterator(const Params& params)
           : DatasetIterator<MemoryDataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)),
             cache_(new std::vector<std::vector<Tensor>>) {}
 
       ~MemoryWriterIterator() override {
@@ -323,6 +602,10 @@ class CacheDatasetOp : public UnaryDatasetOpKernel {
         }
       }
 
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
+
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
                              bool* end_of_sequence) override {
diff --git a/tensorflow/core/kernels/data/concatenate_dataset_op.cc b/tensorflow/core/kernels/data/concatenate_dataset_op.cc
index f11abc62a67a6937cfa7891022a1643c93439e97..0012a4769d197445e725a892478980dbb43a159d 100644
--- a/tensorflow/core/kernels/data/concatenate_dataset_op.cc
+++ b/tensorflow/core/kernels/data/concatenate_dataset_op.cc
@@ -61,7 +61,7 @@ class ConcatenateDatasetOp : public BinaryDatasetOpKernel {
       to_concatenate_->Unref();
     }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::Concatenate")}));
@@ -75,7 +75,9 @@ class ConcatenateDatasetOp : public BinaryDatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override { return "ConcatenateDatasetOp::Dataset"; }
+    string DebugString() const override {
+      return "ConcatenateDatasetOp::Dataset";
+    }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -94,10 +96,12 @@ class ConcatenateDatasetOp : public BinaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            i_(0),
-            input_impl_(params.dataset->input_->MakeIterator(
-                strings::StrCat(params.prefix, "[0]"))) {}
+          : DatasetIterator<Dataset>(params), i_(0) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(
+            ctx, strings::StrCat(prefix(), "[0]"), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
@@ -114,8 +118,8 @@ class ConcatenateDatasetOp : public BinaryDatasetOpKernel {
             return Status::OK();
           }
           if (++i_ < 2) {
-            input_impl_ = dataset()->to_concatenate_->MakeIterator(
-                strings::StrCat(prefix(), "[1]"));
+            TF_RETURN_IF_ERROR(dataset()->to_concatenate_->MakeIterator(
+                ctx, strings::StrCat(prefix(), "[1]"), &input_impl_));
           }
         }
         *end_of_sequence = true;
@@ -147,8 +151,8 @@ class ConcatenateDatasetOp : public BinaryDatasetOpKernel {
         if (!TF_PREDICT_TRUE(i_ >= 0 && i_ <= 2))
           return errors::InvalidArgument("i_ must be in range [0, 2].");
         if (i_ == 1) {
-          input_impl_ = dataset()->to_concatenate_->MakeIterator(
-              strings::StrCat(prefix(), "[1]"));
+          TF_RETURN_IF_ERROR(dataset()->to_concatenate_->MakeIterator(
+              ctx, strings::StrCat(prefix(), "[1]"), &input_impl_));
         } else if (i_ == 2) {
           input_impl_.reset();
         }
diff --git a/tensorflow/core/kernels/data/dataset_ops.cc b/tensorflow/core/kernels/data/dataset_ops.cc
new file mode 100644
index 0000000000000000000000000000000000000000..01989a3bd9652d5f63f87c3eb664e1ba26cb27f5
--- /dev/null
+++ b/tensorflow/core/kernels/data/dataset_ops.cc
@@ -0,0 +1,47 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/graph.pb.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/graph/graph_def_builder.h"
+#include "tensorflow/core/kernels/data/dataset.h"
+
+namespace tensorflow {
+
+// See documentation in ../ops/dataset_ops.cc for a high-level
+// description of the following op.
+class DatasetToGraphOp : public OpKernel {
+ public:
+  explicit DatasetToGraphOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  void Compute(OpKernelContext* ctx) override {
+    DatasetBase* dataset;
+    OP_REQUIRES_OK(ctx, GetDatasetFromVariantTensor(ctx->input(0), &dataset));
+    GraphDefBuilder b;
+    DatasetBase::DatasetGraphDefBuilder db(&b);
+    Node* input_node = nullptr;
+    OP_REQUIRES_OK(ctx, db.AddParentDataset(ctx, dataset, &input_node));
+    GraphDef graph_def;
+    OP_REQUIRES_OK(ctx, b.ToGraphDef(&graph_def));
+    Tensor* result;
+    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, TensorShape({}), &result));
+    result->scalar<string>()() = graph_def.SerializeAsString();
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("DatasetToGraph").Device(DEVICE_CPU),
+                        DatasetToGraphOp);
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/data/dataset_utils.cc b/tensorflow/core/kernels/data/dataset_utils.cc
index c608f9e1c670971769a4d5c27ff061ddc459549b..d85ef1cbabcfc831f256ee14f696a85f9856cecf 100644
--- a/tensorflow/core/kernels/data/dataset_utils.cc
+++ b/tensorflow/core/kernels/data/dataset_utils.cc
@@ -41,9 +41,8 @@ Status MakeIteratorFromInputElement(
       GetDatasetFromVariantTensor(return_values[0], &returned_dataset));
 
   // Create an iterator for the dataset that was returned by `f`.
-  *out_iterator = returned_dataset->MakeIterator(
-      strings::StrCat(prefix, "[", thread_index, "]"));
-  return Status::OK();
+  return returned_dataset->MakeIterator(
+      ctx, strings::StrCat(prefix, "[", thread_index, "]"), out_iterator);
 }
 
 }  // namespace dataset
diff --git a/tensorflow/core/kernels/data/dense_to_sparse_batch_dataset_op.cc b/tensorflow/core/kernels/data/dense_to_sparse_batch_dataset_op.cc
index 132808a5f140a31fc3c1852cb83e5cd8579b6d95..91b927942763d3c562be43dbbcc04aba59faa836 100644
--- a/tensorflow/core/kernels/data/dense_to_sparse_batch_dataset_op.cc
+++ b/tensorflow/core/kernels/data/dense_to_sparse_batch_dataset_op.cc
@@ -94,7 +94,7 @@ class DenseToSparseBatchDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(new Iterator(
           {this, strings::StrCat(prefix, "::DenseToSparseBatch")}));
@@ -109,7 +109,7 @@ class DenseToSparseBatchDatasetOp : public UnaryDatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override {
+    string DebugString() const override {
       return strings::StrCat("DenseToSparseBatchDatasetOp(", batch_size_,
                              ")::Dataset");
     }
@@ -137,8 +137,12 @@ class DenseToSparseBatchDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset<T>> {
      public:
       explicit Iterator(const typename Iterator::Params& params)
-          : DatasetIterator<Dataset<T>>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset<T>>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return DatasetIterator<Dataset<T>>::dataset()->input_->MakeIterator(
+            ctx, DatasetIterator<Dataset<T>>::prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
diff --git a/tensorflow/core/kernels/data/filter_dataset_op.cc b/tensorflow/core/kernels/data/filter_dataset_op.cc
index 186b1e1c6c5a3d5a4aa8e4eb31d474aac4156243..6d6c44552d94901e95a4fe8ca7639b14455a5bfe 100644
--- a/tensorflow/core/kernels/data/filter_dataset_op.cc
+++ b/tensorflow/core/kernels/data/filter_dataset_op.cc
@@ -93,7 +93,7 @@ class FilterDatasetOp : public UnaryDatasetOpKernel {
 
     ~FilterDatasetBase() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::Filter")}));
@@ -106,7 +106,7 @@ class FilterDatasetOp : public UnaryDatasetOpKernel {
       return input_->output_shapes();
     }
 
-    string DebugString() override { return "FilterDatasetOp::Dataset"; }
+    string DebugString() const override { return "FilterDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -145,8 +145,11 @@ class FilterDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<FilterDatasetBase> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<FilterDatasetBase>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<FilterDatasetBase>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
diff --git a/tensorflow/core/kernels/data/flat_map_dataset_op.cc b/tensorflow/core/kernels/data/flat_map_dataset_op.cc
index 77a48a2aa9b0a2be22ef9112cf985964457d65bf..baca022f1eb28423e3484a1e9cfc123712c6bb8e 100644
--- a/tensorflow/core/kernels/data/flat_map_dataset_op.cc
+++ b/tensorflow/core/kernels/data/flat_map_dataset_op.cc
@@ -74,7 +74,7 @@ class FlatMapDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::FlatMap")}));
@@ -88,7 +88,7 @@ class FlatMapDatasetOp : public UnaryDatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override { return "FlatMapDatasetOp::Dataset"; }
+    string DebugString() const override { return "FlatMapDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -125,8 +125,11 @@ class FlatMapDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
@@ -202,7 +205,8 @@ class FlatMapDatasetOp : public UnaryDatasetOpKernel {
         current_element_iterator_.reset();
         captured_func_inputs_.clear();
         if (!reader->Contains(full_name("exhausted"))) {
-          input_impl_ = dataset()->input_->MakeIterator(prefix());
+          TF_RETURN_IF_ERROR(
+              dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_));
           TF_RETURN_IF_ERROR(RestoreParent(ctx, reader, input_impl_));
           {
             int64 temp;
diff --git a/tensorflow/core/kernels/data/generator_dataset_op.cc b/tensorflow/core/kernels/data/generator_dataset_op.cc
index 3f1e441b91d0102b112523a46ac75ce415eacdd7..aae62ad2fe8c2aa055287e13fee0282ab034c435 100644
--- a/tensorflow/core/kernels/data/generator_dataset_op.cc
+++ b/tensorflow/core/kernels/data/generator_dataset_op.cc
@@ -99,7 +99,7 @@ class GeneratorDatasetOp : public DatasetOpKernel {
           output_types_(output_types),
           output_shapes_(output_shapes) {}
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::Generator")}));
@@ -112,7 +112,9 @@ class GeneratorDatasetOp : public DatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override { return "GeneratorDatasetOp::Dataset"; }
+    string DebugString() const override {
+      return "GeneratorDatasetOp::Dataset";
+    }
 
    private:
     class Iterator : public DatasetIterator<Dataset> {
diff --git a/tensorflow/core/kernels/data/group_by_reducer_dataset_op.cc b/tensorflow/core/kernels/data/group_by_reducer_dataset_op.cc
index c8aeaab9cba5e8c3140bbd05eb829fc67d2c4c51..03abae79d210da0e0190fc87a5503b1ad685700a 100644
--- a/tensorflow/core/kernels/data/group_by_reducer_dataset_op.cc
+++ b/tensorflow/core/kernels/data/group_by_reducer_dataset_op.cc
@@ -88,7 +88,7 @@ class GroupByReducerDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::GroupByReducer")}));
@@ -101,7 +101,9 @@ class GroupByReducerDatasetOp : public UnaryDatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override { return "GroupByReducerDatasetOp::Dataset"; }
+    string DebugString() const override {
+      return "GroupByReducerDatasetOp::Dataset";
+    }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -183,8 +185,11 @@ class GroupByReducerDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
diff --git a/tensorflow/core/kernels/data/group_by_window_dataset_op.cc b/tensorflow/core/kernels/data/group_by_window_dataset_op.cc
index 03f847ce9c6e03eb5c401a44102b68630f1563a2..23d769e1ab0e6121196c26adff551af3499e36c0 100644
--- a/tensorflow/core/kernels/data/group_by_window_dataset_op.cc
+++ b/tensorflow/core/kernels/data/group_by_window_dataset_op.cc
@@ -118,7 +118,7 @@ class GroupByWindowDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::GroupByWindow")}));
@@ -131,7 +131,9 @@ class GroupByWindowDatasetOp : public UnaryDatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override { return "GroupByWindowDatasetOp::Dataset"; }
+    string DebugString() const override {
+      return "GroupByWindowDatasetOp::Dataset";
+    }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -198,8 +200,11 @@ class GroupByWindowDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
@@ -484,8 +489,8 @@ class GroupByWindowDatasetOp : public UnaryDatasetOpKernel {
             GetDatasetFromVariantTensor(return_values[0], &returned_dataset));
 
         // Create an iterator for the dataset that was returned by `f`.
-        current_group_iterator_ = returned_dataset->MakeIterator(prefix());
-        return Status::OK();
+        return returned_dataset->MakeIterator(ctx, prefix(),
+                                              &current_group_iterator_);
       }
 
       mutex mu_;
diff --git a/tensorflow/core/kernels/data/identity_dataset_op.cc b/tensorflow/core/kernels/data/identity_dataset_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..e28f18833616797c8d0eb62ebfb9e1001df16093
--- /dev/null
+++ b/tensorflow/core/kernels/data/identity_dataset_op.cc
@@ -0,0 +1,102 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <map>
+
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/kernels/data/dataset.h"
+
+namespace tensorflow {
+namespace {
+
+// The purpose of identity dataset is to serve as a placeholder when performing
+// optimizations. It is not expected to be surfaced in the Python API.
+class IdentityDatasetOp : public UnaryDatasetOpKernel {
+ public:
+  explicit IdentityDatasetOp(OpKernelConstruction* ctx)
+      : UnaryDatasetOpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_types", &output_types_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_shapes", &output_shapes_));
+  }
+
+ protected:
+  void MakeDataset(OpKernelContext* ctx, DatasetBase* input,
+                   DatasetBase** output) override {
+    *output = new Dataset(ctx, input);
+  }
+
+ private:
+  class Dataset : public GraphDatasetBase {
+   public:
+    Dataset(OpKernelContext* ctx, const DatasetBase* input)
+        : GraphDatasetBase(ctx), input_(input) {
+      input_->Ref();
+    }
+
+    ~Dataset() override { input_->Unref(); }
+
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
+        const string& prefix) const override {
+      return std::unique_ptr<IteratorBase>(
+          new Iterator({this, strings::StrCat(prefix, "::Identity")}));
+    }
+
+    const DataTypeVector& output_dtypes() const override {
+      return input_->output_dtypes();
+    }
+
+    const std::vector<PartialTensorShape>& output_shapes() const override {
+      return input_->output_shapes();
+    }
+
+    string DebugString() const override { return "IdentityDatasetOp::Dataset"; }
+
+   protected:
+    Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
+                              Node** output) const override {
+      Node* input_graph_node = nullptr;
+      TF_RETURN_IF_ERROR(b->AddParentDataset(ctx, input_, &input_graph_node));
+      TF_RETURN_IF_ERROR(b->AddDataset(this, {input_graph_node}, output));
+      return Status::OK();
+    }
+
+   private:
+    class Iterator : public DatasetIterator<Dataset> {
+     public:
+      explicit Iterator(const Params& params)
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return errors::Unimplemented(strings::StrCat(prefix(), "::Initialize"));
+      }
+
+      Status GetNextInternal(IteratorContext* ctx,
+                             std::vector<Tensor>* out_tensors,
+                             bool* end_of_sequence) override {
+        return errors::Unimplemented(
+            strings::StrCat(prefix(), "::GetNextInternal"));
+      }
+    };
+
+    const DatasetBase* const input_;
+  };
+
+  DataTypeVector output_types_;
+  std::vector<PartialTensorShape> output_shapes_;
+};
+
+REGISTER_KERNEL_BUILDER(Name("IdentityDataset").Device(DEVICE_CPU),
+                        IdentityDatasetOp);
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/data/interleave_dataset_op.cc b/tensorflow/core/kernels/data/interleave_dataset_op.cc
index bce3f28d62bf898e5137568c4241aff4392db65b..0765e639932845cb730432c7588910362eaa4260 100644
--- a/tensorflow/core/kernels/data/interleave_dataset_op.cc
+++ b/tensorflow/core/kernels/data/interleave_dataset_op.cc
@@ -96,7 +96,7 @@ class InterleaveDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::Interleave")}));
@@ -109,7 +109,9 @@ class InterleaveDatasetOp : public UnaryDatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override { return "InterleaveDatasetOp::Dataset"; }
+    string DebugString() const override {
+      return "InterleaveDatasetOp::Dataset";
+    }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -149,10 +151,13 @@ class InterleaveDatasetOp : public UnaryDatasetOpKernel {
      public:
       explicit Iterator(const Params& params)
           : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)),
             current_elements_(params.dataset->cycle_length_),
             args_list_(params.dataset->cycle_length_) {}
 
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
+
       void AdvanceToNextInCycle() EXCLUSIVE_LOCKS_REQUIRED(mu_) {
         block_index_ = 0;
         cycle_index_ = (cycle_index_ + 1) % dataset()->cycle_length_;
@@ -294,7 +299,7 @@ class InterleaveDatasetOp : public UnaryDatasetOpKernel {
       }
 
       mutex mu_;
-      const std::unique_ptr<IteratorBase> input_impl_ GUARDED_BY(mu_);
+      std::unique_ptr<IteratorBase> input_impl_ GUARDED_BY(mu_);
       std::vector<std::unique_ptr<IteratorBase>> current_elements_
           GUARDED_BY(mu_);
       std::vector<std::vector<Tensor>> args_list_ GUARDED_BY(mu_);
diff --git a/tensorflow/core/kernels/data/iterator_ops.cc b/tensorflow/core/kernels/data/iterator_ops.cc
index b6bf0ecd096f157e1ecf0cb1a02156b6844c8730..f33e9cec297835cfa99ad2e0deb35066228e270a 100644
--- a/tensorflow/core/kernels/data/iterator_ops.cc
+++ b/tensorflow/core/kernels/data/iterator_ops.cc
@@ -158,7 +158,10 @@ class IteratorResource : public ResourceBase {
         graph_runner.Run(&graph, lib, {}, {output_node}, &outputs));
     TF_RETURN_IF_ERROR(GetDatasetFromVariantTensor(outputs[0], &dataset));
 
-    TF_RETURN_IF_ERROR(set_iterator(dataset->MakeIterator("Iterator")));
+    IteratorContext iter_ctx = dataset::MakeIteratorContext(ctx);
+    std::unique_ptr<IteratorBase> iterator;
+    TF_RETURN_IF_ERROR(dataset->MakeIterator(&iter_ctx, "Iterator", &iterator));
+    TF_RETURN_IF_ERROR(set_iterator(std::move(iterator)));
     std::shared_ptr<IteratorBase> captured_iterator(iterator_);
 
     if (captured_iterator) {
@@ -204,12 +207,6 @@ class IteratorResource : public ResourceBase {
     return Status::OK();
   }
 
-
-  std::shared_ptr<StatsAggregator> stats_aggregator() {
-    tf_shared_lock l(mu_);
-    return stats_aggregator_;
-  }
-
   string DebugString() override { return "Iterator resource"; }
 
   const DataTypeVector& output_dtypes() const { return output_dtypes_; }
@@ -228,7 +225,6 @@ class IteratorResource : public ResourceBase {
   FunctionLibraryRuntime* lib_ = nullptr;  // not owned.
   std::shared_ptr<IteratorBase> iterator_;
   mutex mu_;
-  std::shared_ptr<StatsAggregator> stats_aggregator_ GUARDED_BY(mu_);
   std::shared_ptr<const FunctionLibraryDefinition> lib_def_ GUARDED_BY(mu_);
   const DataTypeVector output_dtypes_;
   const std::vector<PartialTensorShape> output_shapes_;
@@ -438,6 +434,9 @@ class IteratorStateVariant {
 REGISTER_UNARY_VARIANT_DECODE_FUNCTION(IteratorStateVariant,
                                        kIteratorVariantTypeName);
 
+// Note that IteratorHandleOp holds a reference to the resource it creates. If
+// cleaning up resources with DestroyResourceOp is important, consider creating
+// resource containers with AnonymousIteratorHandleOp instead.
 class IteratorHandleOp : public OpKernel {
  public:
   explicit IteratorHandleOp(OpKernelConstruction* ctx)
@@ -574,6 +573,75 @@ class IteratorHandleOp : public OpKernel {
   string name_;
 };
 
+// Like IteratorHandleOp, but creates handles which are never shared, and does
+// not hold a reference to these handles. The latter is important for eager
+// execution, since OpKernel instances generally live as long as the program
+// running them.
+class AnonymousIteratorHandleOp : public OpKernel {
+ public:
+  explicit AnonymousIteratorHandleOp(OpKernelConstruction* context)
+      : OpKernel(context), graph_def_version_(context->graph_def_version()) {
+    OP_REQUIRES_OK(context, context->GetAttr("output_types", &output_dtypes_));
+    OP_REQUIRES_OK(context, context->GetAttr("output_shapes", &output_shapes_));
+  }
+
+  void Compute(OpKernelContext* context) override {
+    FunctionLibraryRuntime* lib;
+    std::unique_ptr<DeviceMgr> device_mgr(nullptr);
+    std::unique_ptr<FunctionLibraryDefinition> flib_def(nullptr);
+    std::unique_ptr<ProcessFunctionLibraryRuntime> pflr(nullptr);
+    OP_REQUIRES_OK(context,
+                   context->function_library()->Clone(&flib_def, &pflr, &lib));
+
+    ResourceMgr* mgr = context->resource_manager();
+
+    const string container_name = "AnonymousIterator";
+    string unique_name;
+    {
+      mutex_lock l(static_resource_lookup_mutex_);
+      while (true) {  // Find an unused name
+        IteratorResource* existing_resource = nullptr;
+        unique_name = strings::StrCat("AnonymousIterator", current_id_++);
+        Status status = mgr->Lookup<IteratorResource>(
+            container_name, unique_name, &existing_resource);
+        if (status.code() == error::NOT_FOUND) {
+          break;
+        }
+        OP_REQUIRES_OK(context, status);
+        existing_resource->Unref();
+      }
+      IteratorResource* new_resource = new IteratorResource(
+          output_dtypes_, output_shapes_, graph_def_version_,
+          std::move(device_mgr), std::move(flib_def), std::move(pflr), lib);
+      // Create the resource with our chosen name under the resource lookup
+      // mutex to avoid another kernel racily creating a resource with this
+      // name.
+      OP_REQUIRES_OK(context, mgr->Create<IteratorResource>(
+                                  container_name, unique_name, new_resource));
+    }
+    OP_REQUIRES_OK(context, MakeResourceHandleToOutput(
+                                context, 0, container_name, unique_name,
+                                MakeTypeIndex<IteratorResource>()));
+  }
+
+ private:
+  // Coordinates Iterator unique name creation across AnonymousIteratorHandleOp
+  // instances.
+  static mutex static_resource_lookup_mutex_;
+  // current_id_ is just a hint for creating unique names. If it turns out
+  // there's a collision (e.g. because another AnonymousIteratorHandleOp
+  // instance is generating handles) we'll just skip that id.
+  static int64 current_id_ GUARDED_BY(static_resource_lookup_mutex_);
+  DataTypeVector output_dtypes_;
+  std::vector<PartialTensorShape> output_shapes_;
+  const int graph_def_version_;
+};
+
+// Static initializers for AnonymousIteratorHandleOp id counting.
+mutex AnonymousIteratorHandleOp::static_resource_lookup_mutex_{
+    LINKER_INITIALIZED};
+int64 AnonymousIteratorHandleOp::current_id_(0);
+
 class MakeIteratorOp : public OpKernel {
  public:
   explicit MakeIteratorOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
@@ -585,8 +653,12 @@ class MakeIteratorOp : public OpKernel {
     OP_REQUIRES_OK(
         ctx, LookupResource(ctx, HandleFromInput(ctx, 1), &iterator_resource));
     core::ScopedUnref unref(iterator_resource);
-    OP_REQUIRES_OK(ctx, iterator_resource->set_iterator(
-                            dataset->MakeIterator("Iterator")));
+
+    IteratorContext iter_ctx = dataset::MakeIteratorContext(ctx);
+    std::unique_ptr<IteratorBase> iterator;
+    OP_REQUIRES_OK(ctx,
+                   dataset->MakeIterator(&iter_ctx, "Iterator", &iterator));
+    OP_REQUIRES_OK(ctx, iterator_resource->set_iterator(std::move(iterator)));
   }
 };
 
@@ -608,9 +680,12 @@ class ToSingleElementOp : public AsyncOpKernel {
       DatasetBase* dataset;
       OP_REQUIRES_OK_ASYNC(
           ctx, GetDatasetFromVariantTensor(ctx->input(0), &dataset), done);
-      auto iterator = dataset->MakeIterator("SingleElementIterator");
-
       IteratorContext iter_ctx = dataset::MakeIteratorContext(ctx);
+      std::unique_ptr<IteratorBase> iterator;
+      OP_REQUIRES_OK_ASYNC(
+          ctx,
+          dataset->MakeIterator(&iter_ctx, "SingleElementIterator", &iterator),
+          done);
       std::vector<Tensor> components;
       components.reserve(dataset->output_dtypes().size());
       bool end_of_sequence;
@@ -700,7 +775,7 @@ class OneShotIteratorOp : public AsyncOpKernel {
         return;
       }
     }
-    ProduceOutput(ctx, std::move(done));
+    ProduceOutput(ctx, done);
   }
 
  private:
@@ -721,9 +796,9 @@ class OneShotIteratorOp : public AsyncOpKernel {
     }
 
     for (auto&& ctx_done : callbacks_to_run) {
-      ProduceOutput(ctx_done.first, std::move(ctx_done.second));
+      ProduceOutput(ctx_done.first, ctx_done.second);
     }
-    ProduceOutput(ctx, std::move(done));
+    ProduceOutput(ctx, done);
   }
 
   Status TryInit(OpKernelContext* ctx, IteratorResource** iterator,
@@ -794,8 +869,10 @@ class OneShotIteratorOp : public AsyncOpKernel {
     // factory function.
     DatasetBase* dataset;
     TF_RETURN_IF_ERROR(GetDatasetFromVariantTensor(return_values[0], &dataset));
-    TF_RETURN_IF_ERROR(
-        (*iterator)->set_iterator(dataset->MakeIterator("Iterator")));
+    IteratorContext iter_ctx = dataset::MakeIteratorContext(ctx);
+    std::unique_ptr<IteratorBase> iter;
+    TF_RETURN_IF_ERROR(dataset->MakeIterator(&iter_ctx, "Iterator", &iter));
+    TF_RETURN_IF_ERROR((*iterator)->set_iterator(std::move(iter)));
 
     (*iterator)->Ref();
     return Status::OK();
@@ -860,9 +937,6 @@ class IteratorGetNextOp : public AsyncOpKernel {
 
           IteratorContext::Params params;
           params.env = ctx->env();
-          params.stats_aggregator_getter = [iterator]() {
-            return iterator->stats_aggregator();
-          };
           params.runner = *(ctx->runner());
           params.function_library = iterator->function_library();
           DeviceBase* device = ctx->function_library()->device();
@@ -911,9 +985,6 @@ class IteratorGetNextSyncOp : public OpKernel {
 
     IteratorContext::Params params;
     params.env = ctx->env();
-    params.stats_aggregator_getter = [iterator]() {
-      return iterator->stats_aggregator();
-    };
     params.runner = *(ctx->runner());
     params.function_library = iterator->function_library();
     DeviceBase* device = ctx->function_library()->device();
@@ -1066,6 +1137,8 @@ class DeserializeIteratorOp : public OpKernel {
 REGISTER_KERNEL_BUILDER(Name("Iterator").Device(DEVICE_CPU), IteratorHandleOp);
 REGISTER_KERNEL_BUILDER(Name("MakeIterator").Device(DEVICE_CPU),
                         MakeIteratorOp);
+REGISTER_KERNEL_BUILDER(Name("AnonymousIterator").Device(DEVICE_CPU),
+                        AnonymousIteratorHandleOp);
 REGISTER_KERNEL_BUILDER(Name("DatasetToSingleElement").Device(DEVICE_CPU),
                         ToSingleElementOp);
 REGISTER_KERNEL_BUILDER(Name("OneShotIterator").Device(DEVICE_CPU),
diff --git a/tensorflow/core/kernels/data/map_and_batch_dataset_op.cc b/tensorflow/core/kernels/data/map_and_batch_dataset_op.cc
index 879bb40331e139e240d24d1d537266e50a534423..703ef194a13e4c08d2a8c5e94e4a7400e81a8f0d 100644
--- a/tensorflow/core/kernels/data/map_and_batch_dataset_op.cc
+++ b/tensorflow/core/kernels/data/map_and_batch_dataset_op.cc
@@ -125,7 +125,7 @@ class MapAndBatchDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::MapAndBatch")}));
@@ -139,7 +139,9 @@ class MapAndBatchDatasetOp : public UnaryDatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override { return "MapAndBatchDatasetOp::Dataset"; }
+    string DebugString() const override {
+      return "MapAndBatchDatasetOp::Dataset";
+    }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -188,7 +190,6 @@ class MapAndBatchDatasetOp : public UnaryDatasetOpKernel {
      public:
       explicit Iterator(const Params& params)
           : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)),
             batch_results_((params.dataset->num_parallel_calls_ +
                             params.dataset->batch_size_ - 1) /
                            params.dataset->batch_size_) {
@@ -208,9 +209,14 @@ class MapAndBatchDatasetOp : public UnaryDatasetOpKernel {
         }
       }
 
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
+
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
                              bool* end_of_sequence) override {
+        mutex_lock external_l(external_mu_);
         mutex_lock l(mu_);
         EnsureRunnerThreadStarted(ctx);
         BatchResult* result = &batch_results_[ComputeIndex(input_batch_)];
@@ -220,6 +226,7 @@ class MapAndBatchDatasetOp : public UnaryDatasetOpKernel {
 
      protected:
       Status SaveInternal(IteratorStateWriter* writer) override {
+        mutex_lock external_l(external_mu_);
         mutex_lock l(mu_);
         // Wait for all in-flight calls to complete.
         while (num_calls_ > 0) {
@@ -243,6 +250,7 @@ class MapAndBatchDatasetOp : public UnaryDatasetOpKernel {
 
       Status RestoreInternal(IteratorContext* ctx,
                              IteratorStateReader* reader) override {
+        mutex_lock external_l(external_mu_);
         mutex_lock l(mu_);
         TF_RETURN_IF_ERROR(RestoreParent(ctx, reader, input_impl_));
         TF_RETURN_IF_ERROR(
@@ -629,6 +637,8 @@ class MapAndBatchDatasetOp : public UnaryDatasetOpKernel {
         return Status::OK();
       }
 
+      // Used for coordination between the main thread, the runner thread, and
+      // the callback threads.
       mutex mu_;
       // Used for coordination between the main thread, the runner thread, and
       // the callback threads. In particular, the runner thread should only
@@ -636,11 +646,13 @@ class MapAndBatchDatasetOp : public UnaryDatasetOpKernel {
       // user specified level of parallelism and there are slots available in
       // the `batch_results_` buffer.
       condition_variable cond_var_;
+      // Used for serializing external parallelism.
+      mutex external_mu_ ACQUIRED_BEFORE(mu_);
       // Counts the number of outstanding calls for this batch.
       int64 num_calls_ GUARDED_BY(mu_) = 0;
       // Counts the total number of calls.
       int64 call_counter_ GUARDED_BY(mu_) = 0;
-      const std::unique_ptr<IteratorBase> input_impl_;
+      std::unique_ptr<IteratorBase> input_impl_;
       // Identifies the next batch to be read by the caller.
       int64 input_batch_ GUARDED_BY(mu_) = 0;
       // Identifies the next batch to create.
diff --git a/tensorflow/core/kernels/data/map_dataset_op.cc b/tensorflow/core/kernels/data/map_dataset_op.cc
index 89360d1cd95e896ebf284a0058edb122c7f82d09..aa530aea192a295b5b7861a116d3e301781600d0 100644
--- a/tensorflow/core/kernels/data/map_dataset_op.cc
+++ b/tensorflow/core/kernels/data/map_dataset_op.cc
@@ -73,7 +73,7 @@ class MapDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::Map")}));
@@ -86,7 +86,7 @@ class MapDatasetOp : public UnaryDatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override { return "MapDatasetOp::Dataset"; }
+    string DebugString() const override { return "MapDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -123,8 +123,11 @@ class MapDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
@@ -167,7 +170,7 @@ class MapDatasetOp : public UnaryDatasetOpKernel {
       }
 
      private:
-      const std::unique_ptr<IteratorBase> input_impl_;
+      std::unique_ptr<IteratorBase> input_impl_;
     };
 
     const DatasetBase* const input_;
diff --git a/tensorflow/core/kernels/data/optimize_dataset_op.cc b/tensorflow/core/kernels/data/optimize_dataset_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..8965858e8dd490b06a6457845543cbbd6595cdbe
--- /dev/null
+++ b/tensorflow/core/kernels/data/optimize_dataset_op.cc
@@ -0,0 +1,210 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <map>
+
+#include "tensorflow/core/common_runtime/device_mgr.h"
+#include "tensorflow/core/common_runtime/graph_runner.h"
+#include "tensorflow/core/common_runtime/process_function_library_runtime.h"
+#include "tensorflow/core/framework/device_base.h"
+#include "tensorflow/core/framework/partial_tensor_shape.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/graph/graph_constructor.h"
+#include "tensorflow/core/graph/graph_def_builder.h"
+#include "tensorflow/core/grappler/clusters/virtual_cluster.h"
+#include "tensorflow/core/grappler/graph_view.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/grappler_item_builder.h"
+#include "tensorflow/core/grappler/optimizers/meta_optimizer.h"
+#include "tensorflow/core/kernels/data/dataset.h"
+#include "tensorflow/core/lib/random/random.h"
+#include "tensorflow/core/protobuf/meta_graph.pb.h"
+#include "tensorflow/core/protobuf/rewriter_config.pb.h"
+
+namespace tensorflow {
+namespace {
+
+// See documentation in ../ops/dataset_ops.cc for a high-level
+// description of the following op.
+class OptimizeDatasetOp : public UnaryDatasetOpKernel {
+ public:
+  explicit OptimizeDatasetOp(OpKernelConstruction* ctx)
+      : UnaryDatasetOpKernel(ctx),
+        graph_def_version_(ctx->graph_def_version()) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_types", &output_types_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_shapes", &output_shapes_));
+  }
+
+ protected:
+  void MakeDataset(OpKernelContext* ctx, DatasetBase* input,
+                   DatasetBase** output) override {
+    std::vector<string> optimizations;
+    OP_REQUIRES_OK(
+        ctx, ParseVectorArgument<string>(ctx, "optimizations", &optimizations));
+    Dataset* dataset =
+        new Dataset(ctx, input, optimizations, output_types_, output_shapes_);
+    core::ScopedUnref unref(dataset);
+    OP_REQUIRES_OK(ctx, dataset->Optimize(ctx, output));
+  }
+
+ private:
+  class Dataset : public GraphDatasetBase {
+   public:
+    Dataset(OpKernelContext* ctx, const DatasetBase* input,
+            const std::vector<string>& optimizations,
+            const DataTypeVector& output_types,
+            const std::vector<PartialTensorShape>& output_shapes)
+        : GraphDatasetBase(ctx),
+          input_(input),
+          optimizations_(optimizations),
+          output_types_(output_types),
+          output_shapes_(output_shapes) {
+      input_->Ref();
+    }
+
+    ~Dataset() override { input_->Unref(); }
+
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
+        const string& prefix) const override {
+      return std::unique_ptr<IteratorBase>(
+          new Iterator({this, strings::StrCat(prefix, "::Optimize")}));
+    }
+
+    Status Optimize(OpKernelContext* ctx, DatasetBase** output) {
+      GraphDefBuilder b;
+      DatasetGraphDefBuilder db(&b);
+      Node* input_node = nullptr;
+      TF_RETURN_IF_ERROR(db.AddParentDataset(ctx, input_, &input_node));
+      string output_node = input_node->name();
+      GraphDef graph_def;
+      TF_RETURN_IF_ERROR(b.ToGraphDef(&graph_def));
+      TF_RETURN_IF_ERROR(ApplyOptimizations(ctx, &graph_def, &output_node));
+
+      Graph graph(OpRegistry::Global());
+      TF_RETURN_IF_ERROR(ImportGraphDef({}, graph_def, &graph, nullptr));
+      std::vector<Tensor> outputs;
+      GraphRunner graph_runner(ctx->env());
+      // Once rewrites that add/modify functions are introduced, we will need
+      // persist the results in a function library runtime.
+      TF_RETURN_IF_ERROR(graph_runner.Run(&graph, ctx->function_library(), {},
+                                          {output_node}, &outputs));
+      TF_RETURN_IF_ERROR(GetDatasetFromVariantTensor(outputs[0], output));
+      (*output)->Ref();
+      return Status::OK();
+    }
+
+    const DataTypeVector& output_dtypes() const override {
+      return output_types_;
+    }
+    const std::vector<PartialTensorShape>& output_shapes() const override {
+      return output_shapes_;
+    }
+
+    string DebugString() const override { return "OptimizeDatasetOp::Dataset"; }
+
+   private:
+    class Iterator : public DatasetIterator<Dataset> {
+     public:
+      explicit Iterator(const Params& params)
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return errors::Unimplemented(strings::StrCat(prefix(), "::Initialize"));
+      }
+
+      Status GetNextInternal(IteratorContext* ctx,
+                             std::vector<Tensor>* out_tensors,
+                             bool* end_of_sequence) override {
+        return errors::Unimplemented(
+            strings::StrCat(prefix(), "::GetNextInternal"));
+      }
+    };
+
+    Status ApplyOptimizations(OpKernelContext* ctx, GraphDef* graph_def,
+                              string* output_node) {
+      // Add a fake sink node to allow rewriting the actual sink node.
+      NodeDef* node = graph_def->mutable_node()->Add();
+      node->set_name("FakeSink");
+      node->set_op("IdentityDataset");
+      node->add_input(*output_node);
+      {
+        grappler::GraphView graph(graph_def);
+        NodeDef* sink = graph.GetNode(*output_node);
+        (*node->mutable_attr())["output_shapes"] =
+            sink->attr().at("output_shapes");
+        (*node->mutable_attr())["output_types"] =
+            sink->attr().at("output_types");
+      }
+
+      // Create metagraph.
+      MetaGraphDef meta_graph_def;
+      (*meta_graph_def.mutable_graph_def()) = *graph_def;
+
+      // Grappler determines fetch ops from collection 'train_op'.
+      CollectionDef collection_def;
+      auto node_list = collection_def.mutable_node_list();
+      node_list->add_value("FakeSink");
+      (*meta_graph_def.mutable_collection_def())["train_op"] = collection_def;
+
+      // Create Grappler item.
+      tensorflow::RewriterConfig rewriter_config;
+      for (const string& optimization : optimizations_) {
+        rewriter_config.add_optimizers(optimization);
+      }
+      // If no optimizations were specified, supply a non-existent optimization
+      // to prevent Grappler from applying the default set of optimizations as
+      // some of them do not work out of the box at the moment (e.g. because we
+      // have no cost model for dataset ops).
+      if (optimizations_.empty()) {
+        rewriter_config.add_optimizers("non-existent");
+      }
+      tensorflow::grappler::ItemConfig item_config;
+      item_config.apply_optimizations = true;
+      std::unique_ptr<tensorflow::grappler::GrapplerItem> grappler_item =
+          tensorflow::grappler::GrapplerItemFromMetaGraphDef(
+              "graph", meta_graph_def, item_config);
+      std::unordered_map<string, tensorflow::DeviceProperties> device_map;
+      tensorflow::grappler::VirtualCluster cluster(device_map);
+
+      // Run optimizer.
+      TF_RETURN_IF_ERROR(tensorflow::grappler::RunMetaOptimizer(
+          *grappler_item, rewriter_config, ctx->device(), &cluster, graph_def));
+
+      // Set `output_node` to the input of the fake sink node.
+      {
+        grappler::GraphView graph(graph_def);
+        grappler::GraphView::InputPort input_port =
+            graph.GetInputPort("FakeSink", 0);
+        *output_node = graph.GetRegularFanin(input_port).node->name();
+      }
+
+      return Status::OK();
+    }
+
+    const DatasetBase* input_;
+    const std::vector<string> optimizations_;
+    const DataTypeVector output_types_;
+    const std::vector<PartialTensorShape> output_shapes_;
+  };
+
+  const int graph_def_version_;
+  DataTypeVector output_types_;
+  std::vector<PartialTensorShape> output_shapes_;
+};
+
+REGISTER_KERNEL_BUILDER(Name("OptimizeDataset").Device(DEVICE_CPU),
+                        OptimizeDatasetOp);
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/data/padded_batch_dataset_op.cc b/tensorflow/core/kernels/data/padded_batch_dataset_op.cc
index e41800a8069ed5a5432395184e0ab3eb713c0523..59cbdb655d954ed72b176703b8f57f283a6875b0 100644
--- a/tensorflow/core/kernels/data/padded_batch_dataset_op.cc
+++ b/tensorflow/core/kernels/data/padded_batch_dataset_op.cc
@@ -28,7 +28,8 @@ namespace {
 class PaddedBatchDatasetOp : public UnaryDatasetOpKernel {
  public:
   explicit PaddedBatchDatasetOp(OpKernelConstruction* ctx)
-      : UnaryDatasetOpKernel(ctx) {}
+      : UnaryDatasetOpKernel(ctx),
+        op_version_(ctx->def().op() == "PaddedBatchDataset" ? 1 : 2) {}
 
   void MakeDataset(OpKernelContext* ctx, DatasetBase* input,
                    DatasetBase** output) override {
@@ -39,6 +40,12 @@ class PaddedBatchDatasetOp : public UnaryDatasetOpKernel {
         ctx, batch_size > 0,
         errors::InvalidArgument("Batch size must be greater than zero."));
 
+    bool drop_remainder = false;
+    if (op_version_ > 1) {
+      OP_REQUIRES_OK(ctx, ParseScalarArgument<bool>(ctx, "drop_remainder",
+                                                    &drop_remainder));
+    }
+
     OpInputList padded_shape_tensors;
     OP_REQUIRES_OK(ctx,
                    ctx->input_list("padded_shapes", &padded_shape_tensors));
@@ -85,18 +92,20 @@ class PaddedBatchDatasetOp : public UnaryDatasetOpKernel {
       padding_values.push_back(tensor::DeepCopy(padding_value_t));
     }
 
-    *output = new Dataset(ctx, batch_size, std::move(padded_shapes),
-                          std::move(padding_values), input);
+    *output =
+        new Dataset(ctx, batch_size, drop_remainder, std::move(padded_shapes),
+                    std::move(padding_values), input);
   }
 
  private:
   class Dataset : public GraphDatasetBase {
    public:
-    Dataset(OpKernelContext* ctx, int64 batch_size,
+    Dataset(OpKernelContext* ctx, int64 batch_size, bool drop_remainder,
             std::vector<PartialTensorShape> padded_shapes,
             std::vector<Tensor> padding_values, const DatasetBase* input)
         : GraphDatasetBase(ctx),
           batch_size_(batch_size),
+          drop_remainder_(drop_remainder),
           padded_shapes_(std::move(padded_shapes)),
           padding_values_(std::move(padding_values)),
           input_(input) {
@@ -112,14 +121,19 @@ class PaddedBatchDatasetOp : public UnaryDatasetOpKernel {
       const auto& input_shapes = input_->output_shapes();
       output_shapes_.reserve(input_shapes.size());
       for (size_t i = 0; i < input_shapes.size(); ++i) {
-        output_shapes_.push_back(
-            PartialTensorShape({-1}).Concatenate(padded_shapes_[i]));
+        if (drop_remainder_) {
+          output_shapes_.push_back(
+              PartialTensorShape({batch_size_}).Concatenate(padded_shapes_[i]));
+        } else {
+          output_shapes_.push_back(
+              PartialTensorShape({-1}).Concatenate(padded_shapes_[i]));
+        }
       }
     }
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::PaddedBatch")}));
@@ -133,7 +147,7 @@ class PaddedBatchDatasetOp : public UnaryDatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override {
+    string DebugString() const override {
       return strings::StrCat("PaddedBatchDatasetOp(", batch_size_,
                              ")::Dataset");
     }
@@ -166,16 +180,19 @@ class PaddedBatchDatasetOp : public UnaryDatasetOpKernel {
         padding_values.emplace_back(node);
       }
 
+      Node* drop_remainder = nullptr;
+      TF_RETURN_IF_ERROR(b->AddScalar(drop_remainder_, &drop_remainder));
+
       AttrValue output_types;
       b->BuildAttrValue(output_dtypes(), &output_types);
 
       AttrValue N;
       b->BuildAttrValue<int64>(padded_shapes_.size(), &N);
 
-      TF_RETURN_IF_ERROR(
-          b->AddDataset(this, {{0, input_graph_node}, {1, batch_size}},
-                        {{2, padded_shapes}, {3, padding_values}},
-                        {{"Toutput_types", output_types}, {"N", N}}, output));
+      TF_RETURN_IF_ERROR(b->AddDataset(
+          this, {{0, input_graph_node}, {1, batch_size}, {4, drop_remainder}},
+          {{2, padded_shapes}, {3, padding_values}},
+          {{"Toutput_types", output_types}, {"N", N}}, output));
       return Status::OK();
     }
 
@@ -186,8 +203,11 @@ class PaddedBatchDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
@@ -223,6 +243,12 @@ class PaddedBatchDatasetOp : public UnaryDatasetOpKernel {
           return Status::OK();
         }
 
+        if (dataset()->drop_remainder_ &&
+            batch_elements.size() < dataset()->batch_size_) {
+          *end_of_sequence = true;
+          return Status::OK();
+        }
+
         // Copy the retrieved batch elements into one output tensor
         // per tuple component.
         // NOTE(mrry): If the input or output sizes are statically
@@ -325,7 +351,8 @@ class PaddedBatchDatasetOp : public UnaryDatasetOpKernel {
         if (reader->Contains(full_name("exhausted"))) {
           input_impl_.reset();
         } else {
-          input_impl_ = dataset()->input_->MakeIterator(prefix());
+          TF_RETURN_IF_ERROR(
+              dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_));
           TF_RETURN_IF_ERROR(RestoreParent(ctx, reader, input_impl_));
         }
         return Status::OK();
@@ -337,16 +364,22 @@ class PaddedBatchDatasetOp : public UnaryDatasetOpKernel {
     };
 
     const int64 batch_size_;
+    const bool drop_remainder_;
     const std::vector<PartialTensorShape> padded_shapes_;
     const std::vector<Tensor> padding_values_;
     const DatasetBase* const input_;
     std::vector<PartialTensorShape> output_shapes_;
   };
+
+  const int op_version_;
 };
 
 REGISTER_KERNEL_BUILDER(Name("PaddedBatchDataset").Device(DEVICE_CPU),
                         PaddedBatchDatasetOp);
 
+REGISTER_KERNEL_BUILDER(Name("PaddedBatchDatasetV2").Device(DEVICE_CPU),
+                        PaddedBatchDatasetOp);
+
 }  // namespace
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/data/parallel_interleave_dataset_op.cc b/tensorflow/core/kernels/data/parallel_interleave_dataset_op.cc
index fa33867ec1181257931715f42478e8518b27db6e..6292b4536ed563de55bf677571a2f485232835b3 100644
--- a/tensorflow/core/kernels/data/parallel_interleave_dataset_op.cc
+++ b/tensorflow/core/kernels/data/parallel_interleave_dataset_op.cc
@@ -116,7 +116,7 @@ class ParallelInterleaveDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(new Iterator(
           {this, strings::StrCat(prefix, "::ParallelInterleave")}));
@@ -129,7 +129,7 @@ class ParallelInterleaveDatasetOp : public UnaryDatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override {
+    string DebugString() const override {
       return "ParallelInterleaveDatasetOp::Dataset";
     }
 
@@ -236,7 +236,6 @@ class ParallelInterleaveDatasetOp : public UnaryDatasetOpKernel {
      public:
       explicit Iterator(const Params& params)
           : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)),
             workers_(dataset()->num_threads()),
             worker_thread_states_(dataset()->num_threads()) {}
 
@@ -249,6 +248,10 @@ class ParallelInterleaveDatasetOp : public UnaryDatasetOpKernel {
         }
       }
 
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
+
       // It is implemented so that it matches the deterministic interleave
       // unless getting the next element would block and we are allowed to be
       // sloppy.
diff --git a/tensorflow/core/kernels/data/parallel_map_dataset_op.cc b/tensorflow/core/kernels/data/parallel_map_dataset_op.cc
index 7e373f25686899ec8599fc064f9cf7beb3ebfe95..3fa6b0d3a9390a1cc955880950f87564979dadee 100644
--- a/tensorflow/core/kernels/data/parallel_map_dataset_op.cc
+++ b/tensorflow/core/kernels/data/parallel_map_dataset_op.cc
@@ -85,7 +85,7 @@ class ParallelMapDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::ParallelMap")}));
@@ -99,7 +99,9 @@ class ParallelMapDatasetOp : public UnaryDatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override { return "ParallelMapDatasetOp::Dataset"; }
+    string DebugString() const override {
+      return "ParallelMapDatasetOp::Dataset";
+    }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -150,7 +152,6 @@ class ParallelMapDatasetOp : public UnaryDatasetOpKernel {
      public:
       explicit Iterator(const Params& params)
           : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)),
             invocation_results_(params.dataset->num_parallel_calls_) {}
 
       ~Iterator() override {
@@ -169,6 +170,10 @@ class ParallelMapDatasetOp : public UnaryDatasetOpKernel {
         }
       }
 
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
+
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
                              bool* end_of_sequence) override {
diff --git a/tensorflow/core/kernels/data/prefetch_autotuner_test.cc b/tensorflow/core/kernels/data/prefetch_autotuner_test.cc
index 2f573dfb3555b2466d84c6341eaa77e69414d103..29a8cc50cde8e6c2feb53f3e89d263c83924dd80 100644
--- a/tensorflow/core/kernels/data/prefetch_autotuner_test.cc
+++ b/tensorflow/core/kernels/data/prefetch_autotuner_test.cc
@@ -33,7 +33,7 @@ TEST(PrefetchAutotuner, Disabled) {
 TEST(PrefetchAutotuner, Enabled) {
   PrefetchAutotuner t(PrefetchAutotuner::kAutoTune);
   EXPECT_EQ(1, t.buffer_limit());
-  t.RecordConsumption(0);  // Expect buffer limit to increase.
+  t.RecordConsumption(0);  // Expect buffer limit to stay the same.
   EXPECT_EQ(1, t.buffer_limit());
   t.RecordConsumption(1);
   EXPECT_EQ(1, t.buffer_limit());
diff --git a/tensorflow/core/kernels/data/prefetch_dataset_op.cc b/tensorflow/core/kernels/data/prefetch_dataset_op.cc
index 536de81fd891f1849cd285d6be4ddefb79fd3386..e2b6aa590e22f9eff8659059d1b168c51ae02899 100644
--- a/tensorflow/core/kernels/data/prefetch_dataset_op.cc
+++ b/tensorflow/core/kernels/data/prefetch_dataset_op.cc
@@ -55,7 +55,7 @@ class PrefetchDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::Prefetch")}));
@@ -68,7 +68,7 @@ class PrefetchDatasetOp : public UnaryDatasetOpKernel {
       return input_->output_shapes();
     }
 
-    string DebugString() override { return "PrefetchDatasetOp::Dataset"; }
+    string DebugString() const override { return "PrefetchDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -87,7 +87,6 @@ class PrefetchDatasetOp : public UnaryDatasetOpKernel {
      public:
       explicit Iterator(const Params& params)
           : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)),
             auto_tuner_(params.dataset->buffer_size_) {}
 
       ~Iterator() override {
@@ -106,6 +105,10 @@ class PrefetchDatasetOp : public UnaryDatasetOpKernel {
         }
       }
 
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
+
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
                              bool* end_of_sequence) override {
@@ -327,7 +330,7 @@ class PrefetchDatasetOp : public UnaryDatasetOpKernel {
       // accessing the parent iterator. We keep this separate from `mu_` to
       // allow prefetching to run in parallel with GetNext calls.
       mutex parent_mu_ ACQUIRED_BEFORE(mu_);
-      const std::unique_ptr<IteratorBase> input_impl_ GUARDED_BY(parent_mu_);
+      std::unique_ptr<IteratorBase> input_impl_ GUARDED_BY(parent_mu_);
       condition_variable cond_var_;
       PrefetchAutotuner auto_tuner_ GUARDED_BY(mu_);
       std::deque<BufferElement> buffer_ GUARDED_BY(mu_);
diff --git a/tensorflow/core/kernels/data/random_dataset_op.cc b/tensorflow/core/kernels/data/random_dataset_op.cc
index 210b9ad1b84eeb0c106b0ee538b4957aba7ce1b2..ff166c3be76f9632d2c67d93baf6e2863dcb0b17 100644
--- a/tensorflow/core/kernels/data/random_dataset_op.cc
+++ b/tensorflow/core/kernels/data/random_dataset_op.cc
@@ -54,7 +54,7 @@ class RandomDatasetOp : public DatasetOpKernel {
     Dataset(OpKernelContext* ctx, int64 seed, int64 seed2)
         : GraphDatasetBase(ctx), seed_(seed), seed2_(seed2) {}
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::Random")}));
@@ -71,7 +71,7 @@ class RandomDatasetOp : public DatasetOpKernel {
       return *shapes;
     }
 
-    string DebugString() override {
+    string DebugString() const override {
       return strings::StrCat("RandomDatasetOp(", seed_, ", ", seed2_,
                              ")::Dataset");
     }
diff --git a/tensorflow/core/kernels/data/range_dataset_op.cc b/tensorflow/core/kernels/data/range_dataset_op.cc
index b57518e678ed185a183e0413d6e90f2a9f85e9fc..0b5c814767b0f7e12e1d49a150d518300dd42a93 100644
--- a/tensorflow/core/kernels/data/range_dataset_op.cc
+++ b/tensorflow/core/kernels/data/range_dataset_op.cc
@@ -48,7 +48,7 @@ class RangeDatasetOp : public DatasetOpKernel {
     Dataset(OpKernelContext* ctx, int64 start, int64 stop, int64 step)
         : GraphDatasetBase(ctx), start_(start), stop_(stop), step_(step) {}
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::Range")}));
@@ -65,7 +65,7 @@ class RangeDatasetOp : public DatasetOpKernel {
       return *shapes;
     }
 
-    string DebugString() override {
+    string DebugString() const override {
       return strings::StrCat("RangeDatasetOp(", start_, ", ", stop_, ", ",
                              step_, ")::Dataset");
     }
diff --git a/tensorflow/core/kernels/data/reader_dataset_ops.cc b/tensorflow/core/kernels/data/reader_dataset_ops.cc
index 34d7d9f914d7a726135febabb1fbe35b0146977c..29654b9bca9c097c57f469e3d6997a2347174573 100644
--- a/tensorflow/core/kernels/data/reader_dataset_ops.cc
+++ b/tensorflow/core/kernels/data/reader_dataset_ops.cc
@@ -89,7 +89,7 @@ class TextLineDatasetOp : public DatasetOpKernel {
           use_compression_(!compression_type.empty()),
           options_(options) {}
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::TextLine")}));
@@ -106,7 +106,7 @@ class TextLineDatasetOp : public DatasetOpKernel {
       return *shapes;
     }
 
-    string DebugString() override { return "TextLineDatasetOp::Dataset"; }
+    string DebugString() const override { return "TextLineDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(DatasetGraphDefBuilder* b,
@@ -323,7 +323,7 @@ class FixedLengthRecordDatasetOp : public DatasetOpKernel {
           footer_bytes_(footer_bytes),
           buffer_size_(buffer_size) {}
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::FixedLengthRecord")}));
@@ -340,7 +340,7 @@ class FixedLengthRecordDatasetOp : public DatasetOpKernel {
       return *shapes;
     }
 
-    string DebugString() override {
+    string DebugString() const override {
       return "FixedLengthRecordDatasetOp::Dataset";
     }
 
@@ -543,7 +543,7 @@ class TFRecordDatasetOp : public DatasetOpKernel {
       }
     }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::TFRecord")}));
@@ -560,7 +560,7 @@ class TFRecordDatasetOp : public DatasetOpKernel {
       return *shapes;
     }
 
-    string DebugString() override { return "TFRecordDatasetOp::Dataset"; }
+    string DebugString() const override { return "TFRecordDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(DatasetGraphDefBuilder* b,
diff --git a/tensorflow/core/kernels/data/repeat_dataset_op.cc b/tensorflow/core/kernels/data/repeat_dataset_op.cc
index d37086541dc4714162e00cc6d022b3bd300e3a1c..6b3f4ed27b7297e6866e1c99b492622f302344cb 100644
--- a/tensorflow/core/kernels/data/repeat_dataset_op.cc
+++ b/tensorflow/core/kernels/data/repeat_dataset_op.cc
@@ -48,7 +48,7 @@ class RepeatDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       if (count_ < 0) {
         return std::unique_ptr<IteratorBase>(new ForeverIterator(
@@ -69,7 +69,7 @@ class RepeatDatasetOp : public UnaryDatasetOpKernel {
       return input_->output_shapes();
     }
 
-    string DebugString() override { return "RepeatDatasetOp::Dataset"; }
+    string DebugString() const override { return "RepeatDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -108,9 +108,11 @@ class RepeatDatasetOp : public UnaryDatasetOpKernel {
     class FiniteIterator : public DatasetIterator<Dataset> {
      public:
       explicit FiniteIterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            i_(0),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params), i_(0) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
@@ -127,7 +129,8 @@ class RepeatDatasetOp : public UnaryDatasetOpKernel {
             return Status::OK();
           }
           ++i_;
-          input_impl_ = dataset()->input_->MakeIterator(prefix());
+          TF_RETURN_IF_ERROR(
+              dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_));
         }
         *end_of_sequence = true;
         input_impl_.reset();
@@ -178,7 +181,8 @@ class RepeatDatasetOp : public UnaryDatasetOpKernel {
           bool first_call = false;
           if (!input_impl_) {
             first_call = true;
-            input_impl_ = dataset()->input_->MakeIterator(prefix());
+            TF_RETURN_IF_ERROR(
+                dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_));
           }
           TF_RETURN_IF_ERROR(
               input_impl_->GetNext(ctx, out_tensors, end_of_sequence));
@@ -214,7 +218,8 @@ class RepeatDatasetOp : public UnaryDatasetOpKernel {
         if (reader->Contains(full_name("uninitialized"))) {
           input_impl_.reset();
         } else {
-          input_impl_ = dataset()->input_->MakeIterator(prefix());
+          TF_RETURN_IF_ERROR(
+              dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_));
           TF_RETURN_IF_ERROR(RestoreParent(ctx, reader, input_impl_));
         }
         return Status::OK();
diff --git a/tensorflow/core/kernels/data/scan_dataset_op.cc b/tensorflow/core/kernels/data/scan_dataset_op.cc
index 5dd6ff848eb4836dd9cbc51b9408d01a652241f0..a3b20016a86431b94d1cb42ce3aad2fbc607959d 100644
--- a/tensorflow/core/kernels/data/scan_dataset_op.cc
+++ b/tensorflow/core/kernels/data/scan_dataset_op.cc
@@ -90,7 +90,7 @@ class ScanDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::Scan")}));
@@ -103,7 +103,7 @@ class ScanDatasetOp : public UnaryDatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override { return "ScanDatasetOp::Dataset"; }
+    string DebugString() const override { return "ScanDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -149,9 +149,12 @@ class ScanDatasetOp : public UnaryDatasetOpKernel {
      public:
       explicit Iterator(const Params& params)
           : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)),
             state_(params.dataset->initial_state_) {}
 
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
+
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
                              bool* end_of_sequence) override {
@@ -250,7 +253,7 @@ class ScanDatasetOp : public UnaryDatasetOpKernel {
 
      private:
       mutex mu_;
-      const std::unique_ptr<IteratorBase> input_impl_ GUARDED_BY(mu_);
+      std::unique_ptr<IteratorBase> input_impl_ GUARDED_BY(mu_);
       std::vector<Tensor> state_ GUARDED_BY(mu_);
     };
 
diff --git a/tensorflow/core/kernels/data/shuffle_dataset_op.cc b/tensorflow/core/kernels/data/shuffle_dataset_op.cc
index 2f6bf83da5d4f1d4b431e6849fd6571f56539dfe..b859295fa479a8f39d23056bd631f114318064bb 100644
--- a/tensorflow/core/kernels/data/shuffle_dataset_op.cc
+++ b/tensorflow/core/kernels/data/shuffle_dataset_op.cc
@@ -61,10 +61,12 @@ class ShuffleDatasetOpBase : public UnaryDatasetOpKernel {
     }
 
    protected:
-    class Iterator : public DatasetIterator<ShuffleDatasetBase> {
+    template <class T>
+    class Iterator : public DatasetIterator<T> {
      public:
-      explicit Iterator(const Params& params, int64 seed, int64 seed2)
-          : DatasetIterator<ShuffleDatasetBase>(params),
+      explicit Iterator(const typename DatasetIterator<T>::Params& params,
+                        int64 seed, int64 seed2)
+          : DatasetIterator<T>(params),
             input_impl_(nullptr),
             seed_(seed),
             seed2_(seed2),
@@ -85,25 +87,28 @@ class ShuffleDatasetOpBase : public UnaryDatasetOpKernel {
         bool first_call = false;
         if (!input_impl_ && epoch_ == 0) {
           first_call = true;
-          input_impl_ = dataset()->input_->MakeIterator(prefix());
+          TF_RETURN_IF_ERROR(this->dataset()->input_->MakeIterator(
+              ctx, this->prefix(), &input_impl_));
         }
-        while (input_impl_ && num_elements_ < dataset()->buffer_size_) {
+        while (input_impl_ && num_elements_ < this->dataset()->buffer_size_) {
           if (ctx->env()->NowMicros() >
               ((num_log_entries + 1) * kLogIntervalMicros) + start_micros) {
             num_log_entries++;
             LOG(INFO) << "Filling up shuffle buffer (this may take a while): "
-                      << num_elements_ << " of " << dataset()->buffer_size_;
+                      << num_elements_ << " of "
+                      << this->dataset()->buffer_size_;
           }
           std::vector<Tensor> input_element;
           bool end_of_input_sequence = false;
-          while (dataset()->count_ == -1 || epoch_ < dataset()->count_) {
+          while (this->dataset()->count_ == -1 ||
+                 epoch_ < this->dataset()->count_) {
             TF_RETURN_IF_ERROR(input_impl_->GetNext(ctx, &input_element,
                                                     &end_of_input_sequence));
             if (!end_of_input_sequence) {
               first_call = false;
               break;
             }
-            if (first_call && dataset()->count_ == -1) {
+            if (first_call && this->dataset()->count_ == -1) {
               // If the first call to GetNext() fails because the end
               // of sequence has been reached, we terminate the
               // iteration immediately. (Otherwise, this iterator
@@ -114,10 +119,11 @@ class ShuffleDatasetOpBase : public UnaryDatasetOpKernel {
             epoch_++;
             int64 n = slices_.back()->end;
             slices_.emplace_back(new Slice{n, n});
-            input_impl_ = dataset()->input_->MakeIterator(prefix());
+            TF_RETURN_IF_ERROR(this->dataset()->input_->MakeIterator(
+                ctx, this->prefix(), &input_impl_));
           }
           if (!end_of_input_sequence) {
-            buffer_[slices_.back()->end % dataset()->buffer_size_] =
+            buffer_[slices_.back()->end % this->dataset()->buffer_size_] =
                 std::move(input_element);
             num_elements_++;
             slices_.back()->end++;
@@ -142,10 +148,11 @@ class ShuffleDatasetOpBase : public UnaryDatasetOpKernel {
           int64 offset =
               Random() % (slices_.front()->end - slices_.front()->start);
           int64 index =
-              (slices_.front()->start + offset) % dataset()->buffer_size_;
+              (slices_.front()->start + offset) % this->dataset()->buffer_size_;
           *out_tensors = std::move(buffer_[index]);
-          std::swap(buffer_[index],
-                    buffer_[slices_.front()->start % dataset()->buffer_size_]);
+          std::swap(
+              buffer_[index],
+              buffer_[slices_.front()->start % this->dataset()->buffer_size_]);
           slices_.front()->start++;
           num_elements_--;
         } else {
@@ -158,40 +165,44 @@ class ShuffleDatasetOpBase : public UnaryDatasetOpKernel {
      protected:
       Status SaveInternal(IteratorStateWriter* writer) override {
         mutex_lock l(mu_);
-
         // Save state needed to restore the random number generators.
-        TF_RETURN_IF_ERROR(writer->WriteScalar(full_name("num_random_samples"),
-                                               num_random_samples_));
+        TF_RETURN_IF_ERROR(writer->WriteScalar(
+            this->full_name("num_random_samples"), num_random_samples_));
+        TF_RETURN_IF_ERROR(writer->WriteScalar(this->full_name("seed"), seed_));
+        TF_RETURN_IF_ERROR(
+            writer->WriteScalar(this->full_name("seed2"), seed2_));
 
         // Save input iterator if it hasn't been exhausted else write
         // "end_of_input_sequence".
         if (!input_impl_) {
-          TF_RETURN_IF_ERROR(
-              writer->WriteScalar(full_name("end_of_input_sequence"), ""));
+          TF_RETURN_IF_ERROR(writer->WriteScalar(
+              this->full_name("end_of_input_sequence"), ""));
         } else {
-          TF_RETURN_IF_ERROR(SaveParent(writer, input_impl_));
+          TF_RETURN_IF_ERROR(this->SaveParent(writer, input_impl_));
         }
 
         // Save the epoch counter, buffer, and buffer slices.
-        TF_RETURN_IF_ERROR(writer->WriteScalar(full_name("epoch"), epoch_));
-        TF_RETURN_IF_ERROR(
-            writer->WriteScalar(full_name("num_elements"), num_elements_));
         TF_RETURN_IF_ERROR(
-            writer->WriteScalar(full_name("slices_size"), slices_.size()));
+            writer->WriteScalar(this->full_name("epoch"), epoch_));
+        TF_RETURN_IF_ERROR(writer->WriteScalar(this->full_name("num_elements"),
+                                               num_elements_));
+        TF_RETURN_IF_ERROR(writer->WriteScalar(this->full_name("slices_size"),
+                                               slices_.size()));
         for (size_t i = 0; i < slices_.size(); ++i) {
           TF_RETURN_IF_ERROR(writer->WriteScalar(
-              full_name(strings::StrCat("slices_start_", i)),
+              this->full_name(strings::StrCat("slices_start_", i)),
               slices_[i]->start));
           TF_RETURN_IF_ERROR(writer->WriteScalar(
-              full_name(strings::StrCat("slices_end_", i)), slices_[i]->end));
+              this->full_name(strings::StrCat("slices_end_", i)),
+              slices_[i]->end));
           for (size_t j = slices_[i]->start; j < slices_[i]->end; ++j) {
-            size_t index = j % dataset()->buffer_size_;
+            size_t index = j % this->dataset()->buffer_size_;
             TF_RETURN_IF_ERROR(writer->WriteScalar(
-                full_name(strings::StrCat("buffer_", index, "_size")),
+                this->full_name(strings::StrCat("buffer_", index, "_size")),
                 buffer_[index].size()));
             for (size_t k = 0; k < buffer_[index].size(); ++k) {
               TF_RETURN_IF_ERROR(writer->WriteTensor(
-                  full_name(strings::StrCat("buffer_", index, "_", k)),
+                  this->full_name(strings::StrCat("buffer_", index, "_", k)),
                   buffer_[index][k]));
             }
           }
@@ -203,50 +214,54 @@ class ShuffleDatasetOpBase : public UnaryDatasetOpKernel {
       Status RestoreInternal(IteratorContext* ctx,
                              IteratorStateReader* reader) override {
         mutex_lock l(mu_);
-
         // Restore the random number generators.
-        TF_RETURN_IF_ERROR(reader->ReadScalar(full_name("num_random_samples"),
-                                              &num_random_samples_));
+        TF_RETURN_IF_ERROR(reader->ReadScalar(
+            this->full_name("num_random_samples"), &num_random_samples_));
+        TF_RETURN_IF_ERROR(reader->ReadScalar(this->full_name("seed"), &seed_));
+        TF_RETURN_IF_ERROR(
+            reader->ReadScalar(this->full_name("seed2"), &seed2_));
         ResetRngs();
 
         // Restore the input iterator if it wasn't already exhausted.
-        if (!reader->Contains(full_name("end_of_input_sequence"))) {
-          input_impl_ = dataset()->input_->MakeIterator(prefix());
-          TF_RETURN_IF_ERROR(RestoreParent(ctx, reader, input_impl_));
+        if (!reader->Contains(this->full_name("end_of_input_sequence"))) {
+          TF_RETURN_IF_ERROR(this->dataset()->input_->MakeIterator(
+              ctx, this->prefix(), &input_impl_));
+          TF_RETURN_IF_ERROR(this->RestoreParent(ctx, reader, input_impl_));
         } else {
           input_impl_.reset();
         }
 
         // Restore the epoch counter, buffer, and buffer slices.
-        TF_RETURN_IF_ERROR(reader->ReadScalar(full_name("epoch"), &epoch_));
         TF_RETURN_IF_ERROR(
-            reader->ReadScalar(full_name("num_elements"), &num_elements_));
+            reader->ReadScalar(this->full_name("epoch"), &epoch_));
+        TF_RETURN_IF_ERROR(reader->ReadScalar(this->full_name("num_elements"),
+                                              &num_elements_));
         size_t slices_size;
         {
           int64 temp;
           TF_RETURN_IF_ERROR(
-              reader->ReadScalar(full_name("slices_size"), &temp));
+              reader->ReadScalar(this->full_name("slices_size"), &temp));
           slices_size = static_cast<size_t>(temp);
         }
-        buffer_.reset(new std::vector<Tensor>[dataset()->buffer_size_]);
+        buffer_.reset(new std::vector<Tensor>[this->dataset()->buffer_size_]);
         for (size_t i = 0; i < slices_size; ++i) {
           int64 start;
           TF_RETURN_IF_ERROR(reader->ReadScalar(
-              full_name(strings::StrCat("slices_start_", i)), &start));
+              this->full_name(strings::StrCat("slices_start_", i)), &start));
           int64 end;
           TF_RETURN_IF_ERROR(reader->ReadScalar(
-              full_name(strings::StrCat("slices_end_", i)), &end));
+              this->full_name(strings::StrCat("slices_end_", i)), &end));
           slices_.emplace_back(new Slice{start, end});
           for (size_t j = start; j < end; ++j) {
-            size_t index = j % dataset()->buffer_size_;
+            size_t index = j % this->dataset()->buffer_size_;
             int64 list_size;
             TF_RETURN_IF_ERROR(reader->ReadScalar(
-                full_name(strings::StrCat("buffer_", index, "_size")),
+                this->full_name(strings::StrCat("buffer_", index, "_size")),
                 &list_size));
             buffer_[index] = std::vector<Tensor>(list_size);
             for (int k = 0; k < list_size; ++k) {
               TF_RETURN_IF_ERROR(reader->ReadTensor(
-                  full_name(strings::StrCat("buffer_", index, "_", k)),
+                  this->full_name(strings::StrCat("buffer_", index, "_", k)),
                   &buffer_[index][k]));
             }
           }
@@ -286,8 +301,8 @@ class ShuffleDatasetOpBase : public UnaryDatasetOpKernel {
       mutex mu_;
       std::unique_ptr<std::vector<Tensor>[]> buffer_ GUARDED_BY(mu_);
       std::unique_ptr<IteratorBase> input_impl_ GUARDED_BY(mu_);
-      const int64 seed_ GUARDED_BY(mu_);
-      const int64 seed2_ GUARDED_BY(mu_);
+      int64 seed_ GUARDED_BY(mu_);
+      int64 seed2_ GUARDED_BY(mu_);
       int64 epoch_ GUARDED_BY(mu_);
       int64 num_elements_ GUARDED_BY(mu_);
       std::deque<std::unique_ptr<Slice>> slices_ GUARDED_BY(mu_);
@@ -356,32 +371,108 @@ class ShuffleDatasetOp : public ShuffleDatasetOpBase {
           parent_generator_(seed, seed2),
           generator_(&parent_generator_) {}
 
-    string DebugString() override {
+    string DebugString() const override {
+      mutex_lock l(mu_);
       return strings::StrCat("ShuffleDatasetOp(", buffer_size_, ", ", seed_,
                              ", ", seed2_, ")::ReshufflingDataset");
     }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       int64 iterator_seed;
       int64 iterator_seed2;
       {
         mutex_lock l(mu_);
-        iterator_seed = generator_();
-        iterator_seed2 = generator_();
+        iterator_seed = Random();
+        iterator_seed2 = Random();
       }
-      return std::unique_ptr<IteratorBase>(new ShuffleDatasetBase::Iterator(
-          {this, strings::StrCat(prefix, "::Shuffle")}, iterator_seed,
-          iterator_seed2));
+      return std::unique_ptr<IteratorBase>(
+          new Iterator({this, strings::StrCat(prefix, "::Shuffle")},
+                       iterator_seed, iterator_seed2));
+    }
+
+   protected:
+    class Iterator : public ShuffleDatasetBase::Iterator<ReshufflingDataset> {
+     public:
+      explicit Iterator(const Params& params, int64 seed, int64 seed2)
+          : ShuffleDatasetBase::Iterator<ReshufflingDataset>(params, seed,
+                                                             seed2) {}
+
+     protected:
+      Status SaveInternal(IteratorStateWriter* writer) override {
+        mutex_lock l(dataset()->mu_);
+
+        // Save RNG state of Dataset.
+        TF_RETURN_IF_ERROR(
+            writer->WriteScalar(full_name("ds_num_random_samples"),
+                                dataset()->num_random_samples_));
+
+        // Save the Iterator.
+        return ShuffleDatasetBase::Iterator<ReshufflingDataset>::SaveInternal(
+            writer);
+      }
+
+      Status RestoreInternal(IteratorContext* ctx,
+                             IteratorStateReader* reader) override {
+        mutex_lock l(dataset()->mu_);
+
+        // Restore RNG state of Dataset.
+        TF_RETURN_IF_ERROR(
+            reader->ReadScalar(full_name("ds_num_random_samples"),
+                               &dataset()->num_random_samples_));
+        dataset()->ResetRngs();
+
+        // Restore the Iterator.
+        return ShuffleDatasetBase::Iterator<
+            ReshufflingDataset>::RestoreInternal(ctx, reader);
+      }
+    };
+
+    Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
+                              Node** output) const override {
+      mutex_lock l(mu_);
+      Node* input_graph_node = nullptr;
+      TF_RETURN_IF_ERROR(b->AddParentDataset(ctx, input_, &input_graph_node));
+      Node* buffer_size = nullptr;
+      Node* seed = nullptr;
+      Node* seed2 = nullptr;
+      AttrValue reshuffle_each_iteration;
+
+      TF_RETURN_IF_ERROR(b->AddScalar(buffer_size_, &buffer_size));
+      TF_RETURN_IF_ERROR(b->AddScalar(seed_, &seed));
+      TF_RETURN_IF_ERROR(b->AddScalar(seed2_, &seed2));
+      b->BuildAttrValue(true, &reshuffle_each_iteration);
+      TF_RETURN_IF_ERROR(b->AddDataset(
+          this, {input_graph_node, buffer_size, seed, seed2},  // Inputs
+          {std::make_pair("reshuffle_each_iteration",
+                          reshuffle_each_iteration)},  // Attrs
+          output));
+      return Status::OK();
     }
 
    private:
-    const int64 seed_;
-    const int64 seed2_;
+    random::SingleSampleAdapter<random::PhiloxRandom>::ResultType Random() const
+        EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+      num_random_samples_++;
+      auto out = generator_();
+      return out;
+    }
+
+    void ResetRngs() const EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+      // Reset the generators based on the current seeds.
+      parent_generator_ = random::PhiloxRandom(seed_, seed2_);
+      generator_ =
+          random::SingleSampleAdapter<random::PhiloxRandom>(&parent_generator_);
+      generator_.Skip(num_random_samples_);
+    }
+
+    mutable int64 seed_ GUARDED_BY(mu_);
+    mutable int64 seed2_ GUARDED_BY(mu_);
     mutable mutex mu_;
     mutable random::PhiloxRandom parent_generator_ GUARDED_BY(mu_);
     mutable random::SingleSampleAdapter<random::PhiloxRandom> generator_
         GUARDED_BY(mu_);
+    mutable int64 num_random_samples_ GUARDED_BY(mu_) = 0;
   };
 
   // A dataset that uses the same fixed seed for all iterators created from it.
@@ -394,15 +485,16 @@ class ShuffleDatasetOp : public ShuffleDatasetOpBase {
           seed_(seed),
           seed2_(seed) {}
 
-    string DebugString() override {
+    string DebugString() const override {
       return strings::StrCat("ShuffleDatasetOp(", buffer_size_, ", ", seed_,
                              ", ", seed2_, ")::FixedSeedDataset");
     }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
-      return std::unique_ptr<IteratorBase>(new ShuffleDatasetBase::Iterator(
-          {this, strings::StrCat(prefix, "::Shuffle")}, seed_, seed2_));
+      return std::unique_ptr<IteratorBase>(
+          new ShuffleDatasetBase::Iterator<ShuffleDatasetBase>(
+              {this, strings::StrCat(prefix, "::Shuffle")}, seed_, seed2_));
     }
 
    protected:
@@ -477,16 +569,17 @@ class ShuffleAndRepeatDatasetOp : public ShuffleDatasetOpBase {
           seed_(seed),
           seed2_(seed2) {}
 
-    string DebugString() override {
+    string DebugString() const override {
       return strings::StrCat("ShuffleAndRepeatDatasetOp(", buffer_size_, ", ",
                              seed_, ", ", seed2_, ", ", count_, ")::Dataset");
     }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
-      return std::unique_ptr<IteratorBase>(new ShuffleDatasetBase::Iterator(
-          {this, strings::StrCat(prefix, "::ShuffleAndRepeat")}, seed_,
-          seed2_));
+      return std::unique_ptr<IteratorBase>(
+          new ShuffleDatasetBase::Iterator<ShuffleDatasetBase>(
+              {this, strings::StrCat(prefix, "::ShuffleAndRepeat")}, seed_,
+              seed2_));
     }
 
    protected:
diff --git a/tensorflow/core/kernels/data/skip_dataset_op.cc b/tensorflow/core/kernels/data/skip_dataset_op.cc
index d636c37afe2aa0566df7d4a38a8d393c34fd0195..b84afa3e33464c35f1170cb5ce409fcb9f31d761 100644
--- a/tensorflow/core/kernels/data/skip_dataset_op.cc
+++ b/tensorflow/core/kernels/data/skip_dataset_op.cc
@@ -47,14 +47,11 @@ class SkipDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       if (count_ < 0) {
         return std::unique_ptr<IteratorBase>(
             new EmptyIterator({this, strings::StrCat(prefix, "::EmptySkip")}));
-      } else if (count_ == 0) {
-        // Pass through.
-        return input_->MakeIterator(prefix);
       } else {
         return std::unique_ptr<IteratorBase>(new FiniteIterator(
             {this, strings::StrCat(prefix, "::FiniteSkip")}));
@@ -68,7 +65,7 @@ class SkipDatasetOp : public UnaryDatasetOpKernel {
       return input_->output_shapes();
     }
 
-    string DebugString() override { return "SkipDatasetOp::Dataset"; }
+    string DebugString() const override { return "SkipDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -108,9 +105,11 @@ class SkipDatasetOp : public UnaryDatasetOpKernel {
     class FiniteIterator : public DatasetIterator<Dataset> {
      public:
       explicit FiniteIterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            i_(0),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params), i_(0) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
diff --git a/tensorflow/core/kernels/data/slide_dataset_op.cc b/tensorflow/core/kernels/data/slide_dataset_op.cc
index 78c8363f91a7efcf9ba3355aa4e6b21d1b5eeff7..48776cbf6113f82fe80aaa9d6d2ca001ad92b836 100644
--- a/tensorflow/core/kernels/data/slide_dataset_op.cc
+++ b/tensorflow/core/kernels/data/slide_dataset_op.cc
@@ -33,10 +33,9 @@ class SlideDatasetOp : public UnaryDatasetOpKernel {
                    DatasetBase** output) override {
     int64 window_size = 0;
     int64 stride = 1;
-    OP_REQUIRES_OK(ctx,
-                   ParseScalarArgument<int64>(ctx, "window_size", &window_size));
-    OP_REQUIRES_OK(ctx,
-                   ParseScalarArgument<int64>(ctx, "stride", &stride));
+    OP_REQUIRES_OK(
+        ctx, ParseScalarArgument<int64>(ctx, "window_size", &window_size));
+    OP_REQUIRES_OK(ctx, ParseScalarArgument<int64>(ctx, "stride", &stride));
     OP_REQUIRES(
         ctx, window_size > 0,
         errors::InvalidArgument("Window size must be greater than zero."));
@@ -50,8 +49,12 @@ class SlideDatasetOp : public UnaryDatasetOpKernel {
  private:
   class Dataset : public GraphDatasetBase {
    public:
-    Dataset(OpKernelContext* ctx, int64 window_size, int64 stride, const DatasetBase* input)
-        : GraphDatasetBase(ctx), window_size_(window_size), stride_(stride), input_(input) {
+    Dataset(OpKernelContext* ctx, int64 window_size, int64 stride,
+            const DatasetBase* input)
+        : GraphDatasetBase(ctx),
+          window_size_(window_size),
+          stride_(stride),
+          input_(input) {
       input_->Ref();
 
       const auto& input_shapes = input_->output_shapes();
@@ -64,7 +67,7 @@ class SlideDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(new Iterator(
           Iterator::Params{this, strings::StrCat(prefix, "::Slide")}));
@@ -78,8 +81,9 @@ class SlideDatasetOp : public UnaryDatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override {
-      return strings::StrCat("SlideDatasetOp(", window_size_, ", ", stride_, ")::Dataset");
+    string DebugString() const override {
+      return strings::StrCat("SlideDatasetOp(", window_size_, ", ", stride_,
+                             ")::Dataset");
     }
 
    protected:
@@ -101,8 +105,11 @@ class SlideDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
diff --git a/tensorflow/core/kernels/data/sparse_tensor_slice_dataset_op.cc b/tensorflow/core/kernels/data/sparse_tensor_slice_dataset_op.cc
index fcf17ad68bb1bb5fca7fd7767e12fe9fbc50e0ab..2604822cc9da4e59f006001837e9596cb6af30a7 100644
--- a/tensorflow/core/kernels/data/sparse_tensor_slice_dataset_op.cc
+++ b/tensorflow/core/kernels/data/sparse_tensor_slice_dataset_op.cc
@@ -39,7 +39,7 @@ class Dataset : public GraphDatasetBase {
                  {-1},
                  {sparse_tensor.dims() - 1}}) {}
 
-  std::unique_ptr<IteratorBase> MakeIterator(
+  std::unique_ptr<IteratorBase> MakeIteratorInternal(
       const string& prefix) const override {
     return std::unique_ptr<IteratorBase>(
         new Iterator({this, strings::StrCat(prefix, "::SparseTensorSlice")}));
@@ -50,7 +50,7 @@ class Dataset : public GraphDatasetBase {
     return shapes_;
   }
 
-  string DebugString() override {
+  string DebugString() const override {
     return "SparseTensorSliceDatasetOp::Dataset";
   }
 
diff --git a/tensorflow/core/kernels/data/sql_dataset_ops.cc b/tensorflow/core/kernels/data/sql_dataset_ops.cc
index 634b3c280fedabfca1b69c32a99034a8c198470c..16652e792cb6a94f2a78fe8853e65b6474e32797 100644
--- a/tensorflow/core/kernels/data/sql_dataset_ops.cc
+++ b/tensorflow/core/kernels/data/sql_dataset_ops.cc
@@ -88,7 +88,7 @@ class SqlDatasetOp : public DatasetOpKernel {
           output_types_(output_types),
           output_shapes_(output_shapes) {}
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::Sql")}));
@@ -102,7 +102,7 @@ class SqlDatasetOp : public DatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override { return "SqlDatasetOp::Dataset"; }
+    string DebugString() const override { return "SqlDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
diff --git a/tensorflow/core/kernels/data/stats_aggregator_dataset_op.cc b/tensorflow/core/kernels/data/stats_aggregator_dataset_op.cc
index eb96b8a872cae71a55ca5de232df2ea402db4561..2ff90d7b10b36df3461da3cbe6dcb0c184ed5dd6 100644
--- a/tensorflow/core/kernels/data/stats_aggregator_dataset_op.cc
+++ b/tensorflow/core/kernels/data/stats_aggregator_dataset_op.cc
@@ -53,7 +53,7 @@ class SetStatsAggregatorDatasetOp : public UnaryDatasetOpKernel {
       stats_aggregator_resource_->Unref();
     }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(new Iterator(
           {this, strings::StrCat(prefix, "::SetStatsAggregator")}));
@@ -66,7 +66,7 @@ class SetStatsAggregatorDatasetOp : public UnaryDatasetOpKernel {
       return input_->output_shapes();
     }
 
-    string DebugString() override {
+    string DebugString() const override {
       return "SetStatsAggregatorDatasetOp::Dataset";
     }
 
@@ -82,8 +82,11 @@ class SetStatsAggregatorDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
diff --git a/tensorflow/core/kernels/data/stats_dataset_ops.cc b/tensorflow/core/kernels/data/stats_dataset_ops.cc
index 633cd8545114e839e6ffdf9307236a79936957ad..3e0a6ae0490cc17785de9e9cba5a1ba6f347dc64 100644
--- a/tensorflow/core/kernels/data/stats_dataset_ops.cc
+++ b/tensorflow/core/kernels/data/stats_dataset_ops.cc
@@ -13,6 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#include "tensorflow/core/example/example.pb.h"
+#include "tensorflow/core/example/feature.pb.h"
 #include "tensorflow/core/framework/partial_tensor_shape.h"
 #include "tensorflow/core/framework/stats_aggregator.h"
 #include "tensorflow/core/framework/tensor.h"
@@ -56,7 +58,7 @@ class LatencyStatsDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::LatencyStats")}));
@@ -69,7 +71,9 @@ class LatencyStatsDatasetOp : public UnaryDatasetOpKernel {
       return input_->output_shapes();
     }
 
-    string DebugString() override { return "LatencyStatsDatasetOp::Dataset"; }
+    string DebugString() const override {
+      return "LatencyStatsDatasetOp::Dataset";
+    }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -86,8 +90,11 @@ class LatencyStatsDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
@@ -150,7 +157,7 @@ class BytesProducedStatsDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(new Iterator(
           {this, strings::StrCat(prefix, "::BytesProducedStats")}));
@@ -163,7 +170,7 @@ class BytesProducedStatsDatasetOp : public UnaryDatasetOpKernel {
       return input_->output_shapes();
     }
 
-    string DebugString() override {
+    string DebugString() const override {
       return "BytesProducedStatsDatasetOp::Dataset";
     }
 
@@ -182,8 +189,11 @@ class BytesProducedStatsDatasetOp : public UnaryDatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
@@ -226,6 +236,189 @@ class BytesProducedStatsDatasetOp : public UnaryDatasetOpKernel {
   };
 };
 
+class FeatureStatsDatasetOp : public UnaryDatasetOpKernel {
+ public:
+  explicit FeatureStatsDatasetOp(OpKernelConstruction* ctx)
+      : UnaryDatasetOpKernel(ctx) {}
+
+  void MakeDataset(OpKernelContext* ctx, DatasetBase* input,
+                   DatasetBase** output) override {
+    string tag;
+    OP_REQUIRES_OK(ctx, ParseScalarArgument(ctx, "tag", &tag));
+    OP_REQUIRES(ctx, input->output_dtypes()[0] == DT_STRING,
+                errors::InvalidArgument("FeatureStatsDataset only supports "
+                                        "input with a single `tf.string` "
+                                        "component."));
+    *output = new Dataset(ctx, input, std::move(tag));
+  }
+
+ private:
+  class Dataset : public GraphDatasetBase {
+   public:
+    explicit Dataset(OpKernelContext* ctx, const DatasetBase* input, string tag)
+        : GraphDatasetBase(ctx), input_(input), tag_(std::move(tag)) {
+      input_->Ref();
+    }
+
+    ~Dataset() override { input_->Unref(); }
+
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
+        const string& prefix) const override {
+      return std::unique_ptr<IteratorBase>(new Iterator(
+          {this, strings::StrCat(prefix, "::FeatureStatsDataset")}));
+    }
+
+    const DataTypeVector& output_dtypes() const override {
+      return input_->output_dtypes();
+    }
+    const std::vector<PartialTensorShape>& output_shapes() const override {
+      return input_->output_shapes();
+    }
+
+    string DebugString() const override {
+      return "FeatureStatsDatasetOp::Dataset";
+    }
+
+   protected:
+    Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
+                              Node** output) const override {
+      Node* input_node;
+      TF_RETURN_IF_ERROR(b->AddParentDataset(ctx, input_, &input_node));
+      Node* tag_node;
+      TF_RETURN_IF_ERROR(b->AddScalar(tag_, &tag_node));
+      TF_RETURN_IF_ERROR(b->AddDataset(this, {input_node, tag_node}, output));
+      return Status::OK();
+    }
+
+   private:
+    class Iterator : public DatasetIterator<Dataset> {
+     public:
+      explicit Iterator(const Params& params)
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
+
+      Status GetNextInternal(IteratorContext* ctx,
+                             std::vector<Tensor>* out_tensors,
+                             bool* end_of_sequence) override {
+        tf_shared_lock l(mu_);
+        Status s = input_impl_->GetNext(ctx, out_tensors, end_of_sequence);
+        auto stats_aggregator = ctx->stats_aggregator();
+        if (stats_aggregator && s.ok() && !*end_of_sequence) {
+          for (const Tensor& t : *out_tensors) {
+            auto record_t = t.flat<string>();
+            Example example;
+            // TODO(shivaniagrawal): redundant parsing here, potential solutions
+            // to improve performance is to a) have a potential
+            // ParseExampleDataset and collect stats from there and b) make
+            // changes to parse_example() where it returns stats as well.
+            for (int i = 0; i < record_t.size(); ++i) {
+              if (example.ParseFromString(record_t(i))) {
+                AddStatsFeatures(example, stats_aggregator);
+              } else {
+                SequenceExample sequence_example;
+                if (sequence_example.ParseFromString(record_t(i))) {
+                  AddStatsFeatures(sequence_example, stats_aggregator);
+                }
+              }
+            }
+          }
+        }
+        return s;
+      }
+
+      // TODO(shivaniagrawal): Add features/feature-values to streamz metrics.
+      int AddStatsFeatureValues(const Feature& feature) {
+        int feature_values_list_size = 0;
+        switch (feature.kind_case()) {
+          case Feature::kBytesList: {
+            feature_values_list_size = feature.bytes_list().value().size();
+            break;
+          }
+          case Feature::kFloatList: {
+            feature_values_list_size = feature.float_list().value().size();
+            break;
+          }
+          case Feature::kInt64List: {
+            feature_values_list_size = feature.int64_list().value().size();
+            break;
+          }
+          case Feature::KIND_NOT_SET:
+            break;
+        }
+        return feature_values_list_size;
+      }
+
+      void AddStatsFeatures(
+          const Example& example,
+          const std::shared_ptr<StatsAggregator>& stats_aggregator) {
+        stats_aggregator->AddToHistogram(
+            strings::StrCat(dataset()->tag_, ":features"),
+            {static_cast<double>(example.features().feature().size())});
+
+        int feature_values_list_size_sum = 0;
+        for (const auto& feature : example.features().feature()) {
+          feature_values_list_size_sum += AddStatsFeatureValues(feature.second);
+        }
+        stats_aggregator->AddToHistogram(
+            strings::StrCat(dataset()->tag_, ":feature-values"),
+            {static_cast<double>(feature_values_list_size_sum)});
+      }
+
+      void AddStatsFeatures(
+          const SequenceExample& example,
+          const std::shared_ptr<StatsAggregator>& stats_aggregator) {
+        stats_aggregator->AddToHistogram(
+            strings::StrCat(dataset()->tag_, ":features"),
+            {static_cast<double>(
+                example.context().feature().size() +
+                example.feature_lists().feature_list().size())});
+
+        int feature_values_list_size_sum = 0;
+        for (const auto& feature : example.context().feature()) {
+          feature_values_list_size_sum += AddStatsFeatureValues(feature.second);
+        }
+
+        for (const auto& feature_list :
+             example.feature_lists().feature_list()) {
+          for (const auto& feature : feature_list.second.feature()) {
+            feature_values_list_size_sum += AddStatsFeatureValues(feature);
+          }
+        }
+
+        stats_aggregator->AddToHistogram(
+            strings::StrCat(dataset()->tag_, ":feature-values"),
+            {static_cast<double>(feature_values_list_size_sum)});
+      }
+
+     protected:
+      Status SaveInternal(IteratorStateWriter* writer) override {
+        mutex_lock l(mu_);
+        TF_RETURN_IF_ERROR(SaveParent(writer, input_impl_));
+        return Status::OK();
+      }
+
+      Status RestoreInternal(IteratorContext* ctx,
+                             IteratorStateReader* reader) override {
+        mutex_lock l(mu_);
+        TF_RETURN_IF_ERROR(RestoreParent(ctx, reader, input_impl_));
+        return Status::OK();
+      }
+
+     private:
+      mutex mu_;
+      std::unique_ptr<IteratorBase> input_impl_ GUARDED_BY(mu_);
+    };
+
+    const DatasetBase* const input_;
+    const string tag_;
+  };
+};
+
+REGISTER_KERNEL_BUILDER(Name("FeatureStatsDataset").Device(DEVICE_CPU),
+                        FeatureStatsDatasetOp);
 REGISTER_KERNEL_BUILDER(Name("LatencyStatsDataset").Device(DEVICE_CPU),
                         LatencyStatsDatasetOp);
 REGISTER_KERNEL_BUILDER(Name("BytesProducedStatsDataset").Device(DEVICE_CPU),
diff --git a/tensorflow/core/kernels/data/take_dataset_op.cc b/tensorflow/core/kernels/data/take_dataset_op.cc
index 3bea46a747e002633a0db269434b26bad761a771..3d29221f3e66449e067f7f1284b4cbf2bee0d4eb 100644
--- a/tensorflow/core/kernels/data/take_dataset_op.cc
+++ b/tensorflow/core/kernels/data/take_dataset_op.cc
@@ -47,12 +47,9 @@ class TakeDatasetOp : public UnaryDatasetOpKernel {
 
     ~Dataset() override { input_->Unref(); }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
-      if (count_ < 0) {
-        // Pass through
-        return input_->MakeIterator(prefix);
-      } else if (count_ == 0) {
+      if (count_ == 0) {
         return std::unique_ptr<IteratorBase>(
             new EmptyIterator({this, strings::StrCat(prefix, "::EmptyTake")}));
       } else {
@@ -69,7 +66,7 @@ class TakeDatasetOp : public UnaryDatasetOpKernel {
       return input_->output_shapes();
     }
 
-    string DebugString() override { return "TakeDatasetOp::Dataset"; }
+    string DebugString() const override { return "TakeDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -109,9 +106,11 @@ class TakeDatasetOp : public UnaryDatasetOpKernel {
     class FiniteIterator : public DatasetIterator<Dataset> {
      public:
       explicit FiniteIterator(const Params& params)
-          : DatasetIterator<Dataset>(params),
-            i_(0),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+          : DatasetIterator<Dataset>(params), i_(0) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
@@ -121,7 +120,7 @@ class TakeDatasetOp : public UnaryDatasetOpKernel {
           *end_of_sequence = true;
           return Status::OK();
         }
-        while (i_ < dataset()->count_) {
+        while (dataset()->count_ < 0 || i_ < dataset()->count_) {
           TF_RETURN_IF_ERROR(
               input_impl_->GetNext(ctx, out_tensors, end_of_sequence));
           if (!*end_of_sequence) {
diff --git a/tensorflow/core/kernels/data/tensor_dataset_op.cc b/tensorflow/core/kernels/data/tensor_dataset_op.cc
index 8c8994b1c3f470532cc7c45dabde4639e841dc4b..36fc434d8fbb65a1aeefec35ad8ca029f7e630af 100644
--- a/tensorflow/core/kernels/data/tensor_dataset_op.cc
+++ b/tensorflow/core/kernels/data/tensor_dataset_op.cc
@@ -53,7 +53,7 @@ class TensorDatasetOp : public DatasetOpKernel {
       }
     }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::FromTensor")}));
@@ -64,7 +64,7 @@ class TensorDatasetOp : public DatasetOpKernel {
       return shapes_;
     }
 
-    string DebugString() override { return "TensorDatasetOp::Dataset"; }
+    string DebugString() const override { return "TensorDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(DatasetGraphDefBuilder* b,
diff --git a/tensorflow/core/kernels/data/tensor_queue_dataset_op.cc b/tensorflow/core/kernels/data/tensor_queue_dataset_op.cc
index e271a42b2ac679e837d714567e9f26aa23b6c1a2..29b4c9053ea7aa0ea4e3f13bbf7393c8d5dfdded 100644
--- a/tensorflow/core/kernels/data/tensor_queue_dataset_op.cc
+++ b/tensorflow/core/kernels/data/tensor_queue_dataset_op.cc
@@ -81,7 +81,7 @@ class PrependFromQueueAndPaddedBatchDataset : public GraphDatasetBase {
 
   ~PrependFromQueueAndPaddedBatchDataset() override { input_->Unref(); }
 
-  std::unique_ptr<IteratorBase> MakeIterator(
+  std::unique_ptr<IteratorBase> MakeIteratorInternal(
       const string& prefix) const override {
     return std::unique_ptr<IteratorBase>(new Iterator(
         {this, strings::StrCat(prefix, "::PrependFromQueueAndPaddedBatch")}));
@@ -94,7 +94,7 @@ class PrependFromQueueAndPaddedBatchDataset : public GraphDatasetBase {
     return batched_shapes_with_queue_;
   }
 
-  string DebugString() override {
+  string DebugString() const override {
     return "PrependFromQueueAndPaddedBatchDatasetOp::Dataset";
   }
 
@@ -152,15 +152,19 @@ class PrependFromQueueAndPaddedBatchDataset : public GraphDatasetBase {
       : public DatasetIterator<PrependFromQueueAndPaddedBatchDataset> {
    public:
     explicit Iterator(const Params& params)
-        : DatasetIterator<PrependFromQueueAndPaddedBatchDataset>(params),
-          queue_(new TensorQueue(/*input_impl*/
-                                 params.dataset->input_->MakeIterator(
-                                     params.prefix),
-                                 params.dataset->dtypes_,
-                                 params.dataset->shapes_)) {}
+        : DatasetIterator<PrependFromQueueAndPaddedBatchDataset>(params) {}
 
     ~Iterator() override { queue_->Unref(); }
 
+    Status Initialize(IteratorContext* ctx) override {
+      std::unique_ptr<IteratorBase> iterator;
+      TF_RETURN_IF_ERROR(
+          dataset()->input_->MakeIterator(ctx, prefix(), &iterator));
+      queue_ = new TensorQueue(std::move(iterator), dataset()->dtypes_,
+                               dataset()->shapes_);
+      return Status::OK();
+    }
+
     Status GetNextInternal(IteratorContext* ctx,
                            std::vector<Tensor>* out_tensors,
                            bool* end_of_sequence) override {
@@ -372,7 +376,8 @@ class PrependFromQueueAndPaddedBatchDataset : public GraphDatasetBase {
         if (reader->Contains(iter->full_name("input_exhausted"))) {
           input_impl_.reset();
         } else {
-          input_impl_ = iter->dataset_input()->MakeIterator(iter->prefix());
+          TF_RETURN_IF_ERROR(iter->dataset_input()->MakeIterator(
+              ctx, iter->prefix(), &input_impl_));
           TF_RETURN_IF_ERROR(iter->RestoreParent(ctx, reader, input_impl_));
         }
         entries_.clear();
@@ -469,7 +474,7 @@ class PrependFromQueueAndPaddedBatchDataset : public GraphDatasetBase {
     };
 
    private:
-    TensorQueue* const queue_;
+    TensorQueue* queue_;
   };
 
  private:
diff --git a/tensorflow/core/kernels/data/tensor_slice_dataset_op.cc b/tensorflow/core/kernels/data/tensor_slice_dataset_op.cc
index 95708cc01ce6b63275f0c9c562d4f8c4782af5f4..68ce324081d7c096216d51d88125e897f6b43883 100644
--- a/tensorflow/core/kernels/data/tensor_slice_dataset_op.cc
+++ b/tensorflow/core/kernels/data/tensor_slice_dataset_op.cc
@@ -70,7 +70,7 @@ class TensorSliceDatasetOp : public DatasetOpKernel {
       }
     }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::TensorSlice")}));
@@ -81,7 +81,9 @@ class TensorSliceDatasetOp : public DatasetOpKernel {
       return shapes_;
     }
 
-    string DebugString() override { return "TensorSliceDatasetOp::Dataset"; }
+    string DebugString() const override {
+      return "TensorSliceDatasetOp::Dataset";
+    }
 
    protected:
     Status AsGraphDefInternal(DatasetGraphDefBuilder* b,
diff --git a/tensorflow/core/kernels/data/unbatch_dataset_op.cc b/tensorflow/core/kernels/data/unbatch_dataset_op.cc
index 2b383e50977a306e4512b3693568b9d88e38f0d1..2aec9fb0900f6bd6f8b6ab34411b0be260b1ae4b 100644
--- a/tensorflow/core/kernels/data/unbatch_dataset_op.cc
+++ b/tensorflow/core/kernels/data/unbatch_dataset_op.cc
@@ -49,7 +49,7 @@ class UnbatchDatasetOp : public UnaryDatasetOpKernel {
       }
     }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::Unbatch")}));
@@ -62,7 +62,7 @@ class UnbatchDatasetOp : public UnaryDatasetOpKernel {
       return shapes_;
     }
 
-    string DebugString() override { return "UnbatchDatasetOp::Dataset"; }
+    string DebugString() const override { return "UnbatchDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -80,9 +80,12 @@ class UnbatchDatasetOp : public UnaryDatasetOpKernel {
           : DatasetIterator<Dataset>(params),
             current_index_(0),
             current_batch_size_(0),
-            input_impl_(params.dataset->input_->MakeIterator(params.prefix)),
             shapes_(params.dataset->output_shapes().size()) {}
 
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
+
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
                              bool* end_of_sequence) override {
diff --git a/tensorflow/core/kernels/data/window_dataset.cc b/tensorflow/core/kernels/data/window_dataset.cc
index e24bdea4ac70b76edb926419fa9180f13cf51fb0..668b461374998e2167c1d57006a85926763d6d19 100644
--- a/tensorflow/core/kernels/data/window_dataset.cc
+++ b/tensorflow/core/kernels/data/window_dataset.cc
@@ -26,7 +26,7 @@ class WindowDataset : public DatasetBase {
         output_types_(std::move(output_types)),
         output_shapes_(std::move(output_shapes)) {}
 
-  std::unique_ptr<IteratorBase> MakeIterator(
+  std::unique_ptr<IteratorBase> MakeIteratorInternal(
       const string& prefix) const override {
     return std::unique_ptr<IteratorBase>(
         new Iterator({this, strings::StrCat(prefix, "::Window")}));
@@ -38,7 +38,7 @@ class WindowDataset : public DatasetBase {
     return output_shapes_;
   }
 
-  string DebugString() override { return "WindowDataset"; }
+  string DebugString() const override { return "WindowDataset"; }
 
  private:
   class Iterator : public DatasetIterator<WindowDataset> {
diff --git a/tensorflow/core/kernels/data/writer_ops.cc b/tensorflow/core/kernels/data/writer_ops.cc
index 656fee1e856d213cd496ae8a6386230ad48efc3f..80d9a5b867c85934ecd27b79287d4ed208acd463 100644
--- a/tensorflow/core/kernels/data/writer_ops.cc
+++ b/tensorflow/core/kernels/data/writer_ops.cc
@@ -70,9 +70,13 @@ class ToTFRecordOp : public AsyncOpKernel {
       DatasetBase* dataset;
       OP_REQUIRES_OK_ASYNC(
           ctx, GetDatasetFromVariantTensor(ctx->input(0), &dataset), done);
-      auto iterator = dataset->MakeIterator("ToTFRecordOpIterator");
-
       IteratorContext iter_ctx = dataset::MakeIteratorContext(ctx);
+      std::unique_ptr<IteratorBase> iterator;
+      OP_REQUIRES_OK_ASYNC(
+          ctx,
+          dataset->MakeIterator(&iter_ctx, "ToTFRecordOpIterator", &iterator),
+          done);
+
       std::vector<Tensor> components;
       components.reserve(dataset->output_dtypes().size());
       bool end_of_sequence;
diff --git a/tensorflow/core/kernels/data/zip_dataset_op.cc b/tensorflow/core/kernels/data/zip_dataset_op.cc
index 0f79eac94710fafd3cbf5686876f629dac7bac09..00705236f9ee83ca4d6b501b6e9fe60e04a2e2e6 100644
--- a/tensorflow/core/kernels/data/zip_dataset_op.cc
+++ b/tensorflow/core/kernels/data/zip_dataset_op.cc
@@ -60,7 +60,7 @@ class ZipDatasetOp : public DatasetOpKernel {
       }
     }
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::Zip")}));
@@ -74,7 +74,7 @@ class ZipDatasetOp : public DatasetOpKernel {
       return output_shapes_;
     }
 
-    string DebugString() override { return "ZipDatasetOp::Dataset"; }
+    string DebugString() const override { return "ZipDatasetOp::Dataset"; }
 
    protected:
     Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
@@ -95,13 +95,16 @@ class ZipDatasetOp : public DatasetOpKernel {
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params) {
-        input_impls_.reserve(params.dataset->inputs_.size());
-        size_t idx = 0;
-        for (const auto& input : params.dataset->inputs_) {
-          input_impls_.emplace_back(input->MakeIterator(
-              strings::StrCat(params.prefix, "[", idx++, "]")));
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        mutex_lock l(mu_);
+        input_impls_.resize(dataset()->inputs_.size());
+        for (size_t i = 0; i < input_impls_.size(); ++i) {
+          TF_RETURN_IF_ERROR(dataset()->inputs_[i]->MakeIterator(
+              ctx, strings::StrCat(prefix(), "[", i, "]"), &input_impls_[i]));
         }
+        return Status::OK();
       }
 
       Status GetNextInternal(IteratorContext* ctx,
diff --git a/tensorflow/core/kernels/deep_conv2d.cc b/tensorflow/core/kernels/deep_conv2d.cc
index 014684de642764c386d39b9af7e0289fef66e0b1..85a9702ae70c5ef80dcffb338f399ffde7c684e8 100644
--- a/tensorflow/core/kernels/deep_conv2d.cc
+++ b/tensorflow/core/kernels/deep_conv2d.cc
@@ -294,11 +294,11 @@ struct TransformFilterRange {
 
     // Compute number of filter shards.
     const int64 residual_row =
-        std::max(0LL, args.filter_rows - base_filter_rows);
+        std::max(int64{0}, args.filter_rows - base_filter_rows);
     const int64 shard_rows = 1 + (residual_row + 2 - 1) / 2;
 
     const int64 residual_col =
-        std::max(0LL, args.filter_cols - base_filter_cols);
+        std::max(int64{0}, args.filter_cols - base_filter_cols);
     const int64 shard_cols = 1 + (residual_col + 2 - 1) / 2;
 
     // Compute strides to be used for input and output IO.
@@ -415,8 +415,9 @@ struct TransformFilters {
         filter_total_size + filter_transform_buffer_size + filter_out_buf_size;
 
     // Remove fixed cost and divide by per-filter cost.
-    const int64 num_filters_cache = std::max(
-        1LL, (cache_size - filter_transform_matrix_size) / per_filter_cost);
+    const int64 num_filters_cache =
+        std::max(int64{1},
+                 (cache_size - filter_transform_matrix_size) / per_filter_cost);
     const int64 num_filters_transform = std::min(out_depth, num_filters_cache);
 
     // Allocate buffer for filter transform matrix:
@@ -952,11 +953,11 @@ struct DeepConv2D<CPUDevice, T> {
     const int64 base_filter_rows = transform->filter_shape().rows;
 
     const int64 filter_residual_row =
-        std::max(0LL, args.filter_rows - base_filter_rows);
+        std::max(int64{0}, args.filter_rows - base_filter_rows);
     const int64 filter_shards_row = 1 + (filter_residual_row + 2 - 1) / 2;
 
     const int64 filter_residual_col =
-        std::max(0LL, args.filter_cols - base_filter_rows);
+        std::max(int64{0}, args.filter_cols - base_filter_rows);
     const int64 filter_shards_col = 1 + (filter_residual_col + 2 - 1) / 2;
 
     // Allocate buffer for transformed filters.
@@ -1045,8 +1046,8 @@ struct DeepConv2D<CPUDevice, T> {
           buffer1_per_tile_size + buffer2_per_tile_size +
           packed_tile_per_tile_size + gemm_out_per_tile_size;
 
-      const int64 num_tiles_cache =
-          std::max(4LL, (cache_size - total_fixed_cost) / total_per_tile_cost);
+      const int64 num_tiles_cache = std::max(
+          int64{4}, (cache_size - total_fixed_cost) / total_per_tile_cost);
       const int64 num_tiles = std::min(num_tiles_cache, col_tiles);
 
       // Allocate temporary buffer 'buffer1', which is first used for copying
diff --git a/tensorflow/core/kernels/draw_bounding_box_op.cc b/tensorflow/core/kernels/draw_bounding_box_op.cc
index b5d5b880bbbacab07c51fc395b86b4fbbb343d36..618c47e6848b39175597ec0d68c6ec305bc766e8 100644
--- a/tensorflow/core/kernels/draw_bounding_box_op.cc
+++ b/tensorflow/core/kernels/draw_bounding_box_op.cc
@@ -93,14 +93,14 @@ class DrawBoundingBoxesOp : public OpKernel {
         int64 color_index = bb % color_table_length;
         const int64 min_box_row =
             static_cast<float>(tboxes(b, bb, 0)) * (height - 1);
-        const int64 min_box_row_clamp = std::max<int64>(min_box_row, 0);
+        const int64 min_box_row_clamp = std::max<int64>(min_box_row, int64{0});
         const int64 max_box_row =
             static_cast<float>(tboxes(b, bb, 2)) * (height - 1);
         const int64 max_box_row_clamp =
             std::min<int64>(max_box_row, height - 1);
         const int64 min_box_col =
             static_cast<float>(tboxes(b, bb, 1)) * (width - 1);
-        const int64 min_box_col_clamp = std::max<int64>(min_box_col, 0);
+        const int64 min_box_col_clamp = std::max<int64>(min_box_col, int64{0});
         const int64 max_box_col =
             static_cast<float>(tboxes(b, bb, 3)) * (width - 1);
         const int64 max_box_col_clamp = std::min<int64>(max_box_col, width - 1);
diff --git a/tensorflow/core/kernels/eigen_pooling.h b/tensorflow/core/kernels/eigen_pooling.h
index 2f83780525090c90a0a9cfa3268115daa6fbc89b..56de6b1d43d5d255d27ef5385001c2bb33719234 100644
--- a/tensorflow/core/kernels/eigen_pooling.h
+++ b/tensorflow/core/kernels/eigen_pooling.h
@@ -372,16 +372,23 @@ struct reducer_traits<AvgPoolMeanReducer<float>, Device> {
     Cost = 1,
 #if (EIGEN_ARCH_i386 || EIGEN_ARCH_x86_64) && !defined(__CUDACC__)
     // We only support packet access for floats.
-    PacketAccess = true
+    PacketAccess = true,
 #else
-    PacketAccess = false
+    PacketAccess = false,
 #endif
+    IsStateful = true,
+    IsExactlyAssociative = false
   };
 };
 
 template <>
 struct reducer_traits<AvgPoolMeanReducer<float>, GpuDevice> {
-  enum { Cost = 1, PacketAccess = false };
+  enum {
+    Cost = 1,
+    PacketAccess = false,
+    IsStateful = true,
+    IsExactlyAssociative = false
+  };
 };
 
 }  // namespace internal
diff --git a/tensorflow/core/kernels/function_ops.cc b/tensorflow/core/kernels/function_ops.cc
index 8f66f0a7b970f20641f18c9579bb9d2a5ec7ae57..f2724735bf4590c5d771171ef70b0f9f6862d360 100644
--- a/tensorflow/core/kernels/function_ops.cc
+++ b/tensorflow/core/kernels/function_ops.cc
@@ -254,6 +254,7 @@ class SymbolicGradientOp : public AsyncOpKernel {
     opts.runner = ctx->runner();
     opts.stats_collector = ctx->stats_collector();
     opts.step_container = ctx->step_container();
+    opts.collective_executor = ctx->collective_executor();
     std::vector<Tensor> args;
     args.reserve(ctx->num_inputs());
     for (int i = 0; i < ctx->num_inputs(); ++i) {
diff --git a/tensorflow/core/kernels/functional_ops.cc b/tensorflow/core/kernels/functional_ops.cc
index 9ae04a1062fe21eb619b2f967358adae53c1b409..519c47533293aa7d36c7b9d2f54e1349263dc1b1 100644
--- a/tensorflow/core/kernels/functional_ops.cc
+++ b/tensorflow/core/kernels/functional_ops.cc
@@ -152,7 +152,7 @@ class IfOp : public AsyncOpKernel {
         : kernel_(kernel),
           ctx_(ctx),
           cond_(cond),
-          done_(done),
+          done_(std::move(done)),
           lib_(CHECK_NOTNULL(ctx_->function_library())) {
       SetRunOptions(ctx_, &opts_, true /* always_collect_stats */);
       for (int i = 1; i < ctx_->num_inputs(); ++i) {
@@ -174,9 +174,9 @@ class IfOp : public AsyncOpKernel {
               s = SetOutputs(kernel_, ctx_, rets_);
             }
             ctx_->SetStatus(s);
-            auto done = done_;
+            DoneCallback captured_done(std::move(done_));
             delete this;
-            done();
+            captured_done();
           });
     }
 
@@ -184,7 +184,7 @@ class IfOp : public AsyncOpKernel {
     IfOp* const kernel_;
     OpKernelContext* const ctx_;
     const bool cond_;
-    const DoneCallback done_;
+    DoneCallback done_;
     FunctionLibraryRuntime* const lib_;
     FunctionLibraryRuntime::Options opts_;
     TensorVec args_;
@@ -257,7 +257,7 @@ class WhileOp : public AsyncOpKernel {
           ctx_(ctx),
           cond_handle_(cond_handle),
           body_handle_(body_handle),
-          done_(done),
+          done_(std::move(done)),
           lib_(CHECK_NOTNULL(ctx_->function_library())) {
       SetRunOptions(ctx_, &opts_, false /* always_collect_stats */);
       for (int i = 0; i < ctx_->num_inputs(); ++i) {
@@ -518,5 +518,24 @@ REGISTER_KERNEL_BUILDER(Name("For")
                             .HostMemory("delta"),
                         ForOp);
 
+class FakeParamOp : public OpKernel {
+ public:
+  explicit FakeParamOp(OpKernelConstruction* context) : OpKernel(context) {
+    OP_REQUIRES_OK(context, context->GetAttr("dtype", &dtype_));
+  }
+
+  void Compute(OpKernelContext* context) override {
+    // We must produce something (only Switch and Recvs are allowed to output
+    // dead tensors). This output is not expected to be consumed by anything.
+    Tensor output_tensor(dtype_, TensorShape({}));
+    context->set_output(0, output_tensor);
+  }
+
+ private:
+  DataType dtype_;
+};
+
+REGISTER_KERNEL_BUILDER(Name("FakeParam").Device(DEVICE_CPU), FakeParamOp);
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/inplace_ops.cc b/tensorflow/core/kernels/inplace_ops.cc
index ef6ce0546b0811edda3331de69906237cca76dd4..8f51cc3819df5edd798a72d2e6970585bad6cf56 100644
--- a/tensorflow/core/kernels/inplace_ops.cc
+++ b/tensorflow/core/kernels/inplace_ops.cc
@@ -476,6 +476,7 @@ REGISTER_EMPTY(string, CPU)
 REGISTER_EMPTY(int32, CPU)
 REGISTER_EMPTY(int64, CPU)
 REGISTER_EMPTY(bool, CPU)
+REGISTER_EMPTY(uint8, CPU)
 
 #if GOOGLE_CUDA
 
diff --git a/tensorflow/core/kernels/lrn_op_test.cc b/tensorflow/core/kernels/lrn_op_test.cc
index 9c8a1dfa9a52c0a6d6dbbc583f6b5c4ed9f0e4dc..5d8c5c21ca21f097cb5030b43e288765ae384eaf 100644
--- a/tensorflow/core/kernels/lrn_op_test.cc
+++ b/tensorflow/core/kernels/lrn_op_test.cc
@@ -71,7 +71,7 @@ class LRNFloatTest : public OpsTestBase {
       Eigen::Tensor<float, 1, Eigen::RowMajor> out_col(depth);
       for (int64 d = 0; d < depth; ++d) {
         float denom = 0.0f;
-        for (int64 r = std::max(0ll, d - depth_radius);
+        for (int64 r = std::max(int64{0}, d - depth_radius);
              r < std::min(depth, d + depth_radius + 1); ++r) {
           denom += in(i, r) * in(i, r);
         }
diff --git a/tensorflow/core/kernels/matmul_op.cc b/tensorflow/core/kernels/matmul_op.cc
index f9c15ce6d745eb2571aedf12d04dfb271461f5a0..fc3b3d344527a3935c83e387646d57bc5e067f75 100644
--- a/tensorflow/core/kernels/matmul_op.cc
+++ b/tensorflow/core/kernels/matmul_op.cc
@@ -551,7 +551,8 @@ struct MatMulFunctor<SYCLDevice, T> {
                               .Label("cublas"),                    \
                           MatMulOp<GPUDevice, T, true /* cublas */>)
 
-#if defined(INTEL_MKL)
+#if defined(INTEL_MKL) && !defined(DO_NOT_USE_ML)
+
 // MKL does not support half and int32 types for matrix-multiplication, so
 // register the kernel to use default Eigen based implementations for these
 // types. Registration for NO-LABEL version is in mkl_matmul_op.cc
diff --git a/tensorflow/core/kernels/matrix_band_part_op.cc b/tensorflow/core/kernels/matrix_band_part_op.cc
index 1439141f6493943c94516e6f0f9c05e8314401d5..61c527746469f42cc87f6b9baedbbff4e66dde7e 100644
--- a/tensorflow/core/kernels/matrix_band_part_op.cc
+++ b/tensorflow/core/kernels/matrix_band_part_op.cc
@@ -159,7 +159,7 @@ struct MatrixBandPartFunctor<CPUDevice, Scalar> {
           const int64 band_start =
               num_lower_diags < 0
                   ? 0
-                  : std::min(n, std::max(0ll, row - num_lower_diags));
+                  : std::min(n, std::max(int64{0}, row - num_lower_diags));
           const int64 band_end =
               num_upper_diags < 0
                   ? n
diff --git a/tensorflow/core/kernels/mkl_aggregate_ops.cc b/tensorflow/core/kernels/mkl_aggregate_ops.cc
index b539b00009eb5cdc383aa557881e32782dce5193..4ad858e4a91299a5427d7165728a9c1c85598bb7 100644
--- a/tensorflow/core/kernels/mkl_aggregate_ops.cc
+++ b/tensorflow/core/kernels/mkl_aggregate_ops.cc
@@ -24,15 +24,16 @@ limitations under the License.
 #include "tensorflow/core/lib/gtl/inlined_vector.h"
 #include "tensorflow/core/platform/logging.h"
 
-#include "mkl_dnn.h"
-#include "mkl_dnn_types.h"
-#include "tensorflow/core/util/mkl_util.h"
 
 #ifndef INTEL_MKL_ML
 #include "mkldnn.hpp"
 using mkldnn::stream;
 using mkldnn::sum;
+#else
+#include "mkl_dnn.h"
+#include "mkl_dnn_types.h"
 #endif
+#include "tensorflow/core/util/mkl_util.h"
 
 namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
@@ -333,7 +334,7 @@ class MklAddNOp : public OpKernel {
 
       if (!input1_in_mkl_format && src1_dims_size == 0) {
         Tensor* dst_tensor = nullptr;
-        MklShape mkl_shape_dst;
+        MklDnnShape mkl_shape_dst;
         mkl_shape_dst.SetMklTensor(false);
         AllocateOutputSetMklShape(ctx, output_idx, &dst_tensor,
                                   src1_tensor.shape(), mkl_shape_dst);
@@ -347,7 +348,7 @@ class MklAddNOp : public OpKernel {
       if (!input1_in_mkl_format && !input2_in_mkl_format) {
         if (src1_tensor.shape().num_elements() == 0) {
           Tensor* dst_tensor = nullptr;
-          MklShape mkl_shape_dst;
+          MklDnnShape mkl_shape_dst;
           mkl_shape_dst.SetMklTensor(false);
           AllocateOutputSetMklShape(ctx, output_idx, &dst_tensor,
                                     src1_tensor.shape(), mkl_shape_dst);
diff --git a/tensorflow/core/kernels/mkl_batch_matmul_op.cc b/tensorflow/core/kernels/mkl_batch_matmul_op.cc
index 723b445a7568775a13b89c9fbf0e7dc70c4b8b8c..45328b03d683d854dfdcb06549f473f2de68912e 100644
--- a/tensorflow/core/kernels/mkl_batch_matmul_op.cc
+++ b/tensorflow/core/kernels/mkl_batch_matmul_op.cc
@@ -25,7 +25,7 @@ limitations under the License.
 
 #define EIGEN_USE_THREADS
 
-#if defined(INTEL_MKL)
+#if defined(INTEL_MKL) && !defined(DO_NOT_USE_ML)
 #include <vector>
 #include "mkl_cblas.h"
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
diff --git a/tensorflow/core/kernels/mkl_concat_op.cc b/tensorflow/core/kernels/mkl_concat_op.cc
index a9b952095d5798fd7f53907850a186951f88efc9..31d1b949ef0928babea0583a1204218aeac5259d 100644
--- a/tensorflow/core/kernels/mkl_concat_op.cc
+++ b/tensorflow/core/kernels/mkl_concat_op.cc
@@ -27,16 +27,17 @@ limitations under the License.
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/types.h"
 
-#include "mkl_dnn.h"
-#include "mkl_dnn_types.h"
-#include "tensorflow/core/util/mkl_util.h"
 
 #ifndef INTEL_MKL_ML
 #include "mkldnn.hpp"
 
 using mkldnn::concat;
 using mkldnn::stream;
+#else
+#include "mkl_dnn.h"
+#include "mkl_dnn_types.h"
 #endif
+#include "tensorflow/core/util/mkl_util.h"
 
 namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
diff --git a/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc
index a6698a1a0780ad90dea8485c413786d982d2c952..f857be6c3214a4fe758a9b3369ac4d33cf86e57f 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc
@@ -39,8 +39,10 @@ limitations under the License.
 #include "tensorflow/core/util/use_cudnn.h"
 #include "tensorflow/core/util/work_sharder.h"
 
+#ifdef INTEL_MKL_ML
 #include "mkl_dnn.h"
 #include "mkl_dnn_types.h"
+#endif
 #include "tensorflow/core/util/mkl_util.h"
 
 namespace tensorflow {
diff --git a/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
index e0706568b15204312445446a161d0aa9911f9e33..356eed8b676efe2706b9919d0bbdca0cc656e67b 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
@@ -38,9 +38,6 @@ limitations under the License.
 #include "tensorflow/core/util/use_cudnn.h"
 #include "tensorflow/core/util/work_sharder.h"
 
-#include "mkl_dnn.h"
-#include "mkl_dnn_types.h"
-#include "tensorflow/core/util/mkl_util.h"
 
 #ifndef INTEL_MKL_ML
 #include "mkldnn.hpp"
@@ -49,8 +46,13 @@ using mkldnn::convolution_backward_weights;
 using mkldnn::memory;
 using mkldnn::prop_kind;
 using mkldnn::stream;
+#else
+#include "mkl_dnn.h"
+#include "mkl_dnn_types.h"
 #endif
 
+#include "tensorflow/core/util/mkl_util.h"
+
 namespace tensorflow {
 
 typedef Eigen::ThreadPoolDevice CPUDevice;
diff --git a/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
index d203c04934131ee56fbca169d4c3e5e534d7986f..21b18f911927448fd2869ea19abb5e4e32363ba9 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
@@ -23,8 +23,10 @@ limitations under the License.
 #define EIGEN_USE_THREADS
 #include <algorithm>
 #include <vector>
+#ifdef INTEL_MKL_ML
 #include "mkl_dnn.h"
 #include "mkl_dnn_types.h"
+#endif
 #include "tensorflow/core/framework/numeric_op.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
diff --git a/tensorflow/core/kernels/mkl_conv_ops.cc b/tensorflow/core/kernels/mkl_conv_ops.cc
index f2b14f12789d0aff02fb728cf843813ab45d6662..cede0b9dd6bc5c4d47b8cb2e5a78e357e8354b98 100644
--- a/tensorflow/core/kernels/mkl_conv_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_ops.cc
@@ -59,7 +59,8 @@ namespace tensorflow {
 
 #ifndef INTEL_MKL_ML
 
-struct ConvFwdDimensions {
+// This structure aggregates multiple inputs to Conv2DFwd* methods.
+struct MklConvFwdParams {
   memory::dims src_dims;
   memory::dims filter_dims;
   memory::dims bias_dims;
@@ -69,7 +70,7 @@ struct ConvFwdDimensions {
   memory::dims padding_left;
   memory::dims padding_right;
 
-  ConvFwdDimensions(memory::dims src_dims,
+  MklConvFwdParams(memory::dims src_dims,
     memory::dims filter_dims, memory::dims bias_dims,
     memory::dims dst_dims, memory::dims strides,
     memory::dims dilations, memory::dims padding_left,
@@ -82,35 +83,41 @@ struct ConvFwdDimensions {
 };
 
 template <typename T>
-class Conv2DFwd : public DnnOp {
+class MklConv2DFwdPrimitive: public MklPrimitive {
  public:
-  explicit Conv2DFwd(const ConvFwdDimensions& convFwdDims) {
-    fwd_stream_.reset(new stream(stream::kind::eager));
+  explicit MklConv2DFwdPrimitive(const MklConvFwdParams& convFwdDims) :
+    cpu_engine_(engine::cpu, 0) {
+    context_.fwd_stream.reset(new stream(stream::kind::eager));
     // create conv primitive
-    if (conv_fwd_ == nullptr) {
+    if (context_.conv_fwd == nullptr) {
       Setup(convFwdDims);
     }
   }
 
-  ~Conv2DFwd() {}
+  ~MklConv2DFwdPrimitive() {}
 
   // Convolution forward execute with bias
   //   src_data:    input data buffer of src
   //   filter_data: input data buffer of filter (weights)
   //   bias_data:   input data buffer of bias
   //   dst_data:    output data buffer of dst
-  void Execute(T* src_data, T* filter_data, T* bias_data, T* dst_data) {
-    src_mem_->set_data_handle(static_cast<void*>(src_data));
-    filter_mem_->set_data_handle(static_cast<void*>(filter_data));
-    bias_mem_->set_data_handle(static_cast<void*>(bias_data));
-    dst_mem_->set_data_handle(static_cast<void*>(dst_data));
-    fwd_stream_->submit(fwd_primitives_);
+  void Execute(const T* src_data, const T* filter_data,
+      const T* bias_data, const T* dst_data) {
+    context_.src_mem->set_data_handle(
+        static_cast<void*>(const_cast<T*>(src_data)));
+    context_.filter_mem->set_data_handle(
+        static_cast<void*>(const_cast<T*>(filter_data)));
+    context_.bias_mem->set_data_handle(
+        static_cast<void*>(const_cast<T*>(bias_data)));
+    context_.dst_mem->set_data_handle(
+        static_cast<void*>(const_cast<T*>(dst_data)));
+    context_.fwd_stream->submit(context_.fwd_primitives);
 
     // after exec, set data handle back
-    src_mem_->set_data_handle(DummyData);
-    filter_mem_->set_data_handle(DummyData);
-    bias_mem_->set_data_handle(DummyData);
-    dst_mem_->set_data_handle(DummyData);
+    context_.src_mem->set_data_handle(DummyData);
+    context_.filter_mem->set_data_handle(DummyData);
+    context_.bias_mem->set_data_handle(DummyData);
+    context_.dst_mem->set_data_handle(DummyData);
 
     return;
   }
@@ -119,139 +126,173 @@ class Conv2DFwd : public DnnOp {
   //   src_data:    input data buffer of src
   //   filter_data: input data buffer of filter (weights)
   //   dst_data:    output data buffer of dst
-  void Execute(T* src_data, T* filter_data, T* dst_data) {
-    src_mem_->set_data_handle(static_cast<void*>(src_data));
-    filter_mem_->set_data_handle(static_cast<void*>(filter_data));
-    dst_mem_->set_data_handle(static_cast<void*>(dst_data));
-    fwd_stream_->submit(fwd_primitives_);
-
-    // after exec, set data handle back
-    src_mem_->set_data_handle(DummyData);
-    filter_mem_->set_data_handle(DummyData);
-    dst_mem_->set_data_handle(DummyData);
+  void Execute(const T* src_data, const T* filter_data,
+      const T* dst_data) {
+    context_.src_mem->set_data_handle(
+        static_cast<void*>(const_cast<T*>(src_data)));
+    context_.filter_mem->set_data_handle(
+        static_cast<void*>(const_cast<T*>(filter_data)));
+    context_.dst_mem->set_data_handle(
+        static_cast<void*>(const_cast<T*>(dst_data)));
+    context_.fwd_stream->submit(context_.fwd_primitives);
+
+    // after execution, set data handle back
+    context_.src_mem->set_data_handle(DummyData);
+    context_.filter_mem->set_data_handle(DummyData);
+    context_.dst_mem->set_data_handle(DummyData);
+  }
 
-    return;
+  memory::format GetSrcMemoryFormat() const {
+    return context_.src_fmt;
   }
 
-  // expected memory format for this primitive instance
-  memory::format src_fmt_;
-  memory::format filter_fmt_;
+  memory::format GetFilterMemoryFormat() const {
+    return context_.filter_fmt;
+  }
 
-  // convolution primitive
-  std::shared_ptr<mkldnn::convolution_forward::primitive_desc> fwd_pd_;
-  std::shared_ptr<mkldnn::primitive> conv_fwd_;
+  std::shared_ptr<mkldnn::convolution_forward::primitive_desc>
+  GetPrimitiveDesc() const {
+    return context_.fwd_pd;
+  }
 
  private:
-  void Setup(const ConvFwdDimensions& convFwdDims) {
+  // Primitive reuse context for Conv2D Fwd op
+  struct ConvFwdContext {
+    // expected memory format for this primitive instance
+    memory::format src_fmt;
+    memory::format filter_fmt;
+
+    // MKLDNN memory
+    std::shared_ptr<mkldnn::memory> src_mem;
+    std::shared_ptr<mkldnn::memory> filter_mem;
+    std::shared_ptr<mkldnn::memory> bias_mem;
+    std::shared_ptr<mkldnn::memory> dst_mem;
+
+    // desc & prmitive desc
+    std::shared_ptr<mkldnn::convolution_forward::desc> fwd_desc;
+
+    // memory desc
+    std::shared_ptr<mkldnn::memory::desc> src_md;
+    std::shared_ptr<mkldnn::memory::desc> filter_md;
+    std::shared_ptr<mkldnn::memory::desc> bias_md;
+    std::shared_ptr<mkldnn::memory::desc> dst_md;
+
+    // convolution primitive
+    std::shared_ptr<mkldnn::convolution_forward::primitive_desc> fwd_pd;
+    std::shared_ptr<mkldnn::primitive> conv_fwd;
+
+    std::shared_ptr<mkldnn::stream> fwd_stream;
+    std::vector<mkldnn::primitive> fwd_primitives;
+
+    ConvFwdContext() :
+       src_fmt(memory::format::any), filter_fmt(memory::format::any),
+       src_mem(nullptr), filter_mem(nullptr), bias_mem(nullptr),
+       dst_mem(nullptr), fwd_desc(nullptr),
+       src_md(nullptr), filter_md(nullptr), bias_md(nullptr),
+       fwd_pd(nullptr), conv_fwd(nullptr), fwd_stream(nullptr) {
+    }
+  };
+
+  void Setup(const MklConvFwdParams& convFwdDims) {
     // create memory descriptors for convolution data w/ no specified format
-    src_md_.reset(new memory::desc({convFwdDims.src_dims},
+    context_.src_md.reset(new memory::desc({convFwdDims.src_dims},
         MklDnnType<T>(), memory::format::any));
 
-    filter_md_.reset(new memory::desc({convFwdDims.filter_dims},
+    context_.filter_md.reset(new memory::desc({convFwdDims.filter_dims},
         MklDnnType<T>(), memory::format::any));
 
-    dst_md_.reset(new memory::desc({convFwdDims.dst_dims},
+    context_.dst_md.reset(new memory::desc({convFwdDims.dst_dims},
         MklDnnType<T>(), memory::format::any));
 
     if (!convFwdDims.bias_dims.empty())
-        bias_md_.reset(new memory::desc({convFwdDims.bias_dims},
+        context_.bias_md.reset(new memory::desc({convFwdDims.bias_dims},
             MklDnnType<T>(), memory::format::any));
 
     // create a convolution
     if (!convFwdDims.bias_dims.empty()) {
-      fwd_desc_.reset(new convolution_forward::desc(prop_kind::forward,
-          convolution_direct, *src_md_, *filter_md_, *bias_md_, *dst_md_,
+      context_.fwd_desc.reset(new convolution_forward::desc(prop_kind::forward,
+          convolution_direct, *context_.src_md, *context_.filter_md,
+          *context_.bias_md, *context_.dst_md,
           convFwdDims.strides, convFwdDims.dilations, convFwdDims.padding_left,
           convFwdDims.padding_right, padding_kind::zero));
     } else {
-      fwd_desc_.reset(new convolution_forward::desc(prop_kind::forward,
-          convolution_direct, *src_md_, *filter_md_, *dst_md_,
-          convFwdDims.strides, convFwdDims.dilations, convFwdDims.padding_left,
-          convFwdDims.padding_right, padding_kind::zero));
+      context_.fwd_desc.reset(new convolution_forward::desc(prop_kind::forward,
+          convolution_direct, *context_.src_md, *context_.filter_md,
+          *context_.dst_md, convFwdDims.strides, convFwdDims.dilations,
+          convFwdDims.padding_left, convFwdDims.padding_right,
+          padding_kind::zero));
     }
 
-    fwd_pd_.reset(new convolution_forward::primitive_desc(
-        *fwd_desc_, cpu_engine_));
+    context_.fwd_pd.reset(new convolution_forward::primitive_desc(
+        *context_.fwd_desc, cpu_engine_));
 
     // store the expected memory format
-    src_fmt_ = static_cast<mkldnn::memory::format>(
-        fwd_pd_.get()->src_primitive_desc().desc().data.format);
+    context_.src_fmt = static_cast<mkldnn::memory::format>(
+        context_.fwd_pd.get()->src_primitive_desc().desc().data.format);
 
-    filter_fmt_ = static_cast<mkldnn::memory::format>(
-        fwd_pd_.get()->weights_primitive_desc().desc().data.format);
+    context_.filter_fmt = static_cast<mkldnn::memory::format>(
+        context_.fwd_pd.get()->weights_primitive_desc().desc().data.format);
 
     // create memory primitive based on dummy data
-    src_mem_.reset(new memory(fwd_pd_.get()->src_primitive_desc(), DummyData));
-    filter_mem_.reset(new memory(fwd_pd_.get()->weights_primitive_desc(),
-                      DummyData));
-    dst_mem_.reset(new memory(fwd_pd_.get()->dst_primitive_desc(), DummyData));
+    context_.src_mem.reset(new memory(
+        context_.fwd_pd.get()->src_primitive_desc(), DummyData));
+    context_.filter_mem.reset(new memory(
+        context_.fwd_pd.get()->weights_primitive_desc(), DummyData));
+    context_.dst_mem.reset(new memory(
+        context_.fwd_pd.get()->dst_primitive_desc(), DummyData));
 
     // create convolution primitive and add it to net
     if (!convFwdDims.bias_dims.empty()) {
-        bias_mem_.reset(new memory({{{convFwdDims.bias_dims}, MklDnnType<T>(),
-                        memory::format::x}, cpu_engine_}, DummyData));
-        conv_fwd_.reset(new convolution_forward(*fwd_pd_, *src_mem_,
-                        *filter_mem_, *bias_mem_, *dst_mem_));
+        context_.bias_mem.reset(new memory({{{convFwdDims.bias_dims},
+            MklDnnType<T>(), memory::format::x}, cpu_engine_}, DummyData));
+        context_.conv_fwd.reset(new convolution_forward(
+            *context_.fwd_pd, *context_.src_mem, *context_.filter_mem,
+            *context_.bias_mem, *context_.dst_mem));
     } else {
-        conv_fwd_.reset(new convolution_forward(*fwd_pd_, *src_mem_,
-                        *filter_mem_, *dst_mem_));
+        context_.conv_fwd.reset(new convolution_forward(
+            *context_.fwd_pd, *context_.src_mem,
+            *context_.filter_mem, *context_.dst_mem));
     }
 
-    fwd_primitives_.push_back(*conv_fwd_);
+    context_.fwd_primitives.push_back(*context_.conv_fwd);
     return;
   }
 
-  // MKLDNN memory
-  std::shared_ptr<mkldnn::memory> src_mem_;
-  std::shared_ptr<mkldnn::memory> filter_mem_;
-  std::shared_ptr<mkldnn::memory> bias_mem_;
-  std::shared_ptr<mkldnn::memory> dst_mem_;
-
-  std::shared_ptr<mkldnn::stream> fwd_stream_;
-  std::vector<mkldnn::primitive> fwd_primitives_;
-
-  // desc & prmitive desc
-  std::shared_ptr<mkldnn::convolution_forward::desc> fwd_desc_;
-
-  // memory desc
-  std::shared_ptr<mkldnn::memory::desc> src_md_;
-  std::shared_ptr<mkldnn::memory::desc> filter_md_;
-  std::shared_ptr<mkldnn::memory::desc> bias_md_;
-  std::shared_ptr<mkldnn::memory::desc> dst_md_;
-
-  engine cpu_engine_ = engine(engine::cpu, 0);
+  struct ConvFwdContext context_;
+  engine cpu_engine_;
 };
 
 template <typename T>
-class Conv2DFwdFactory : public DnnOpFactory<T> {
+class MklConv2DFwdPrimitiveFactory : public MklPrimitiveFactory<T> {
  public:
-  static Conv2DFwd<T>* Get(const ConvFwdDimensions& convFwdDims) {
-     Conv2DFwd<T>* conv2d_fwd = nullptr;
+  static MklConv2DFwdPrimitive<T>* Get(const MklConvFwdParams& convFwdDims) {
+     MklConv2DFwdPrimitive<T>* conv2d_fwd = nullptr;
 
      // try to find a suitable one in pool
-     conv2d_fwd = dynamic_cast<Conv2DFwd<T>*> (
-       Conv2DFwdFactory<T>::GetInstance().GetConv2DFwd(convFwdDims));
+     conv2d_fwd = dynamic_cast<MklConv2DFwdPrimitive<T>*> (
+       MklConv2DFwdPrimitiveFactory<T>::GetInstance().GetConv2DFwd(
+       convFwdDims));
 
      if (conv2d_fwd == nullptr) {
-       conv2d_fwd = new Conv2DFwd<T>(convFwdDims);
-       Conv2DFwdFactory<T>::GetInstance().SetConv2DFwd(
+       conv2d_fwd = new MklConv2DFwdPrimitive<T>(convFwdDims);
+       MklConv2DFwdPrimitiveFactory<T>::GetInstance().SetConv2DFwd(
            convFwdDims, conv2d_fwd);
      }
      return conv2d_fwd;
   }
 
  private:
-  Conv2DFwdFactory() {}
-  ~Conv2DFwdFactory() {}
+  MklConv2DFwdPrimitiveFactory() {}
+  ~MklConv2DFwdPrimitiveFactory() {}
 
   static const int kDilationH = 0, kDilationW = 1;
 
-  static Conv2DFwdFactory& GetInstance() {
-    static Conv2DFwdFactory instance_;
+  static MklConv2DFwdPrimitiveFactory& GetInstance() {
+    static MklConv2DFwdPrimitiveFactory instance_;
     return instance_;
   }
 
-  static std::string CreateKey(const ConvFwdDimensions& convFwdDims) {
+  static std::string CreateKey(const MklConvFwdParams& convFwdDims) {
     std::string prefix = "conv2d_fwd_";
     FactoryKeyCreator key_creator;
     key_creator.AddAsKey(prefix);
@@ -266,12 +307,12 @@ class Conv2DFwdFactory : public DnnOpFactory<T> {
     return key_creator.GetKey();
   }
 
-  DnnOp* GetConv2DFwd(const ConvFwdDimensions& convFwdDims) {
+  MklPrimitive* GetConv2DFwd(const MklConvFwdParams& convFwdDims) {
     std::string key = CreateKey(convFwdDims);
     return this->GetOp(key);
   }
 
-  void SetConv2DFwd(const ConvFwdDimensions& convFwdDims, DnnOp *op) {
+  void SetConv2DFwd(const MklConvFwdParams& convFwdDims, MklPrimitive *op) {
     std::string key = CreateKey(convFwdDims);
     this->SetOp(key, op);
   }
@@ -762,7 +803,6 @@ class MklConv2DOp : public OpKernel {
 
       MklDnnData<T> src(&cpu_engine);
       MklDnnData<T> filter(&cpu_engine);
-      MklDnnData<T> dst(&cpu_engine);  // output
 
       memory::dims src_dims, filter_dims, padding_left, padding_right,
                    dilations, strides;
@@ -812,7 +852,6 @@ class MklConv2DOp : public OpKernel {
       auto src_md = src_mkl_shape.IsMklTensor()
                         ? src_mkl_shape.GetMklLayout()
                         : memory::desc(src_dims, MklDnnType<T>(), tf_fmt);
-      src.SetUsrMem(src_md, &src_tensor);
 
       // Although filter shape (filter_dims) required is in MKL-DNN order,
       // the layout is Tensorflow's layout (HWIO).
@@ -820,29 +859,28 @@ class MklConv2DOp : public OpKernel {
                            ? filter_mkl_shape.GetMklLayout()
                            : memory::desc(filter_dims, MklDnnType<T>(),
                                           memory::format::hwio);
-      filter.SetUsrMem(filter_md, &filter_tensor);
 
       // MKLDNN dilation starts from 0.
       dilations[kDilationH] -= 1;
       dilations[kDilationW] -= 1;
 
       // get a conv2d fwd from primitive pool
-      Conv2DFwd<T> *conv2d_fwd = nullptr;
+      MklConv2DFwdPrimitive<T> *conv2d_fwd = nullptr;
       if (biasEnabled) {
         memory::dims bias_dims = {};
         conv_utl.GetBiasSizeInMklOrder(kInputIndex_Bias, &bias_dims);
-        ConvFwdDimensions convFwdDims(src_dims, filter_dims, bias_dims,
+        MklConvFwdParams convFwdDims(src_dims, filter_dims, bias_dims,
           dst_dims_mkl_order, strides, dilations, padding_left, padding_right);
-        conv2d_fwd = Conv2DFwdFactory<T>::Get(convFwdDims);
+        conv2d_fwd = MklConv2DFwdPrimitiveFactory<T>::Get(convFwdDims);
       } else {
-        ConvFwdDimensions convFwdDims(src_dims, filter_dims, NONE_DIMS,
+        MklConvFwdParams convFwdDims(src_dims, filter_dims, NONE_DIMS,
           dst_dims_mkl_order, strides, dilations, padding_left, padding_right);
-        conv2d_fwd = Conv2DFwdFactory<T>::Get(convFwdDims);
+        conv2d_fwd = MklConv2DFwdPrimitiveFactory<T>::Get(convFwdDims);
       }
 
       // allocate output tensors output_tensor and filter_out_tensor
       std::shared_ptr<mkldnn::convolution_forward::primitive_desc>
-      conv_fwd_pd = conv2d_fwd->fwd_pd_;
+      conv_fwd_pd = conv2d_fwd->GetPrimitiveDesc();
       AllocateOutputTensor(context, *conv_fwd_pd,
                        dst_dims_mkl_order, tf_fmt, &dst_tensor);
       Tensor* filter_out_tensor = nullptr;
@@ -854,20 +892,30 @@ class MklConv2DOp : public OpKernel {
 
       // check whether src/filter need reorder
       std::vector<primitive> net;
-      if (src_md.data.format != conv2d_fwd->src_fmt_)
-          src.CheckReorderToOpMem(
-              conv_fwd_pd.get()->src_primitive_desc(), &net);
-
-      if (filter_md.data.format != conv2d_fwd->filter_fmt_)
-          filter.CheckReorderToOpMem(
-              conv_fwd_pd.get()->weights_primitive_desc(),
-              filter.GetTensorBuffer(filter_out_tensor), &net);
+      T *src_data = nullptr;
+      if (src_md.data.format != conv2d_fwd->GetSrcMemoryFormat()) {
+        src.SetUsrMem(src_md, &src_tensor);
+        src.CheckReorderToOpMem(
+            conv_fwd_pd.get()->src_primitive_desc(), &net);
+        src_data = static_cast<T*>(src.GetOpMem().get_data_handle());
+      } else {
+        src_data = static_cast<T*>(const_cast<T*>(
+                    src_tensor.flat<T>().data()));
+      }
+      T *filter_data = nullptr;
+      if (filter_md.data.format != conv2d_fwd->GetFilterMemoryFormat()) {
+        filter.SetUsrMem(filter_md, &filter_tensor);
+        filter.CheckReorderToOpMem(
+            conv_fwd_pd.get()->weights_primitive_desc(),
+            filter.GetTensorBuffer(filter_out_tensor), &net);
+        filter_data = static_cast<T*>(filter.GetOpMem().get_data_handle());
+      } else {
+        filter_data = static_cast<T*>(const_cast<T*>(
+                       filter_tensor.flat<T>().data()));
+      }
+
       stream(stream::kind::eager).submit(net).wait();
 
-      T* src_data = static_cast<T*>(
-                src.GetOpMem().get_data_handle());
-      T* filter_data = static_cast<T*>(
-                filter.GetOpMem().get_data_handle());
 
       // execute convolution
       if (biasEnabled) {
diff --git a/tensorflow/core/kernels/mkl_fused_batch_norm_op.cc b/tensorflow/core/kernels/mkl_fused_batch_norm_op.cc
index 62aafa793056f233ac84d1c8ca49bba1e73035c9..3fe660cf968b4ea890b8c42930b3808943ce9112 100644
--- a/tensorflow/core/kernels/mkl_fused_batch_norm_op.cc
+++ b/tensorflow/core/kernels/mkl_fused_batch_norm_op.cc
@@ -21,21 +21,21 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor_types.h"
 #include "tensorflow/core/util/tensor_format.h"
 
-#include "mkl_dnn.h"
-#include "mkl_dnn_types.h"
-#include "tensorflow/core/util/mkl_util.h"
 
 #ifndef INTEL_MKL_ML
 #include "mkldnn.hpp"
-
 using mkldnn::batch_normalization_backward;
 using mkldnn::batch_normalization_forward;
 using mkldnn::prop_kind;
 using mkldnn::stream;
 using mkldnn::use_global_stats;
 using mkldnn::use_scale_shift;
+#else
+#include "mkl_dnn.h"
+#include "mkl_dnn_types.h"
 #endif
 
+#include "tensorflow/core/util/mkl_util.h"
 // TODO(inteltf) Address comments from PR 8968.
 
 namespace tensorflow {
diff --git a/tensorflow/core/kernels/mkl_identity_op.cc b/tensorflow/core/kernels/mkl_identity_op.cc
index 6c027f8e728b8660d18a70ae58995fa104f0b375..b02cc5384c347ed05a53c9f320dfcf330903fee9 100644
--- a/tensorflow/core/kernels/mkl_identity_op.cc
+++ b/tensorflow/core/kernels/mkl_identity_op.cc
@@ -24,8 +24,10 @@ limitations under the License.
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/logging.h"
 
+#ifdef INTEL_MKL_ML
 #include "mkl_dnn.h"
 #include "mkl_dnn_types.h"
+#endif
 #include "tensorflow/core/util/mkl_util.h"
 
 #ifndef INTEL_MKL_ML
diff --git a/tensorflow/core/kernels/mkl_input_conversion_op.cc b/tensorflow/core/kernels/mkl_input_conversion_op.cc
index cda1402b035cdccc4677a8d074178dfc79170798..dc4da33a0685662cedd1c43ff578416dd07348a4 100644
--- a/tensorflow/core/kernels/mkl_input_conversion_op.cc
+++ b/tensorflow/core/kernels/mkl_input_conversion_op.cc
@@ -369,8 +369,8 @@ class MklInputConversionOp : public OpKernel {
       MklToTfOp<Device, T>::ConvertMklToTf(this, context, data_format_str,
                                            op_data_type, has_avx512f_,
                                            kInputIndex_1);
-      SetDummyMklShapeOutput(context, kInputIndex_0);
-      SetDummyMklShapeOutput(context, kInputIndex_1);
+      SetDummyMklDnnShapeOutput(context, kInputIndex_0);
+      SetDummyMklDnnShapeOutput(context, kInputIndex_1);
       return;
     }
 
@@ -439,11 +439,11 @@ class MklInputConversionOp : public OpKernel {
                    tensor_out, &net);
       if(!reordered) {
         // This is the case that the TF tensor has the same shape and format of
-        // mkl tensor. However, tf_tensor can not be simply forwarded to the output
-        // tensor since mkl data tensor is always one dimensional tensor. 
-        // Tensor::CopyFrom shares the buffer of the other tensor while set its shape
-        // to the other tensor. 
-        tensor_out->CopyFrom(*tf_tensor, tensor_out->shape());
+        // mkl tensor. However, tf_tensor can not be simply forwarded to the
+        // output tensor since mkl data tensor is always one dimensional tensor.
+        // Tensor::CopyFrom shares the buffer of the other tensor while set its
+        // shape to the other tensor.
+        CHECK(tensor_out->CopyFrom(*tf_tensor, tensor_out->shape()));
       }
       else  
         stream(stream::kind::eager).submit(net).wait();
@@ -458,7 +458,7 @@ class MklInputConversionOp : public OpKernel {
       MklToTfOp<Device, T>::ConvertMklToTf(this, context, data_format_str,
                                            op_data_type, has_avx512f_,
                                            mkl_tensor_index);
-      SetDummyMklShapeOutput(context, mkl_tensor_index);
+      SetDummyMklDnnShapeOutput(context, mkl_tensor_index);
 
       // The tensor in TF format passes through
       ForwardTfTensorInToOut(context, tf_tensor_index, tf_tensor_index);
diff --git a/tensorflow/core/kernels/mkl_lrn_op.cc b/tensorflow/core/kernels/mkl_lrn_op.cc
index eef254cdadbde377c463ea2c5dad693d890d1dc5..dfe50e6a7f2137116d9635ecebf8889f528fff26 100644
--- a/tensorflow/core/kernels/mkl_lrn_op.cc
+++ b/tensorflow/core/kernels/mkl_lrn_op.cc
@@ -22,8 +22,6 @@ limitations under the License.
 
 #define EIGEN_USE_THREADS
 #include <vector>
-#include "mkl_dnn.h"
-#include "mkl_dnn_types.h"
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
@@ -31,7 +29,6 @@ limitations under the License.
 #include "tensorflow/core/kernels/bounds_check.h"
 #include "tensorflow/core/kernels/ops_util.h"
 #include "tensorflow/core/lib/core/errors.h"
-#include "tensorflow/core/util/mkl_util.h"
 #include "tensorflow/core/util/tensor_format.h"
 
 #if !defined(IS_MOBILE_PLATFORM)
@@ -45,8 +42,13 @@ using mkldnn::lrn_backward;
 using mkldnn::lrn_forward;
 using mkldnn::prop_kind;
 using mkldnn::stream;
+#else
+#include "mkl_dnn.h"
+#include "mkl_dnn_types.h"
 #endif
 
+#include "tensorflow/core/util/mkl_util.h"
+
 namespace tensorflow {
 
 namespace {
@@ -1236,7 +1238,7 @@ class MklLRNGradOp : public OpKernel {
     auto activations = orig_output_tensor.shaped<T, 2>({nodes * batch, depth});
 
     Tensor* output_dnn_data;
-    MklShape mkl_output_mkl_shape;
+    MklDnnShape mkl_output_mkl_shape;
     mkl_output_mkl_shape.SetMklTensor(false);
     mkl_output_mkl_shape.SetDimensions(4);
     AllocateOutputSetMklShape(context, kIdxOutput, &output_dnn_data,
diff --git a/tensorflow/core/kernels/mkl_matmul_op.cc b/tensorflow/core/kernels/mkl_matmul_op.cc
index dfa6cecc9bdc231ebf35e587183b5f84b17489e0..62c040489125ed65b7c2fcaf81d50f552a1b371d 100644
--- a/tensorflow/core/kernels/mkl_matmul_op.cc
+++ b/tensorflow/core/kernels/mkl_matmul_op.cc
@@ -23,7 +23,7 @@ limitations under the License.
 // and when it is undefined at build time, this file becomes an empty
 // compilation unit
 
-#if defined(INTEL_MKL)
+#if defined(INTEL_MKL) && !defined(DO_NOT_USE_ML)
 
 #include "mkl_cblas.h"
 #include "tensorflow/core/framework/op.h"
diff --git a/tensorflow/core/kernels/mkl_relu_op.cc b/tensorflow/core/kernels/mkl_relu_op.cc
index 1ed43834dd8803679d07a59b2d3a14e3c4953995..78abbdb730bd3a086d8b485af4daa196bcc9996c 100644
--- a/tensorflow/core/kernels/mkl_relu_op.cc
+++ b/tensorflow/core/kernels/mkl_relu_op.cc
@@ -23,9 +23,6 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/lib/core/errors.h"
 
-#include "mkl_dnn.h"
-#include "mkl_dnn_types.h"
-#include "tensorflow/core/util/mkl_util.h"
 
 #ifndef INTEL_MKL_ML
 #include "mkldnn.hpp"
@@ -38,7 +35,11 @@ using mkldnn::prop_kind;
 using mkldnn::relu_backward;
 using mkldnn::relu_forward;
 using mkldnn::stream;
+#else
+#include "mkl_dnn.h"
+#include "mkl_dnn_types.h"
 #endif
+#include "tensorflow/core/util/mkl_util.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/kernels/mkl_reshape_op.cc b/tensorflow/core/kernels/mkl_reshape_op.cc
index 2cfde1f6fd4112ea1b4e489be3d9ce0014cbaa6a..c44a6f3477ee016f878290c533a5ca14d7aad253 100644
--- a/tensorflow/core/kernels/mkl_reshape_op.cc
+++ b/tensorflow/core/kernels/mkl_reshape_op.cc
@@ -24,15 +24,17 @@ limitations under the License.
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/logging.h"
 
-#include "mkl_dnn.h"
-#include "mkl_dnn_types.h"
-#include "tensorflow/core/util/mkl_util.h"
 
 #ifndef INTEL_MKL_ML
 #include "mkldnn.hpp"
 using mkldnn::stream;
+#else
+#include "mkl_dnn.h"
+#include "mkl_dnn_types.h"
 #endif
 
+#include "tensorflow/core/util/mkl_util.h"
+
 namespace tensorflow {
 using CPUDevice = Eigen::ThreadPoolDevice;
 template <typename Device, typename T>
@@ -250,7 +252,7 @@ class MklReshapeOp : public OpKernel {
                 memory::primitive_desc(output_tf_md, cpu_engine);
 
             Tensor* output_tensor = nullptr;
-            MklShape mkl_shape_output;
+            MklDnnShape mkl_shape_output;
             mkl_shape_output.SetMklTensor(false);
             // We allocate output tensor in the shape expected by Reshape.
             AllocateOutputSetMklShape(context, kOutputSlotIdx, &output_tensor,
diff --git a/tensorflow/core/kernels/mkl_softmax_op.cc b/tensorflow/core/kernels/mkl_softmax_op.cc
index f79e18cff29de5682ac2db445160d9346425414f..638392954ed958a3498c2f38dba09c39201344b7 100644
--- a/tensorflow/core/kernels/mkl_softmax_op.cc
+++ b/tensorflow/core/kernels/mkl_softmax_op.cc
@@ -25,8 +25,6 @@ limitations under the License.
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/util/tensor_format.h"
 
-#include "mkldnn.h"
-#include "mkldnn_types.h"
 #include "tensorflow/core/util/mkl_util.h"
 
 #include "mkldnn.hpp"
diff --git a/tensorflow/core/kernels/mkl_tfconv_op.h b/tensorflow/core/kernels/mkl_tfconv_op.h
index 4120f013acd7a714d3015d4c6cdce53d0161d677..7e8ed1b1d619a06b285ac97d3ff17f89962bebba 100644
--- a/tensorflow/core/kernels/mkl_tfconv_op.h
+++ b/tensorflow/core/kernels/mkl_tfconv_op.h
@@ -32,8 +32,10 @@ limitations under the License.
 #include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/util/tensor_format.h"
 
+#ifdef INTEL_MKL_ML
 #include "mkl_dnn.h"
 #include "mkl_dnn_types.h"
+#endif
 #include "tensorflow/core/util/mkl_util.h"
 
 #ifndef INTEL_MKL_ML
diff --git a/tensorflow/core/kernels/mkl_transpose_op.cc b/tensorflow/core/kernels/mkl_transpose_op.cc
index 3f07b317c4d915fd7d304dbbab966837da64757a..b180c2ff2006e1f0fc5a81a757b05772383ebc50 100644
--- a/tensorflow/core/kernels/mkl_transpose_op.cc
+++ b/tensorflow/core/kernels/mkl_transpose_op.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 // See docs in ../ops/array_ops.cc.
 
-#ifdef INTEL_MKL
+#if defined(INTEL_MKL) && !defined(DO_NOT_USE_ML)
 #define EIGEN_USE_THREADS
 
 #include "mkl_trans.h"
diff --git a/tensorflow/core/kernels/non_max_suppression_op.cc b/tensorflow/core/kernels/non_max_suppression_op.cc
index 2b010f816d4e503b3b5a89a7b406273511cad954..23fdfe944a2ba6f83b51d18d7ae5822f8cbee304 100644
--- a/tensorflow/core/kernels/non_max_suppression_op.cc
+++ b/tensorflow/core/kernels/non_max_suppression_op.cc
@@ -117,10 +117,6 @@ void DoNonMaxSuppressionOp(OpKernelContext* context, const Tensor& boxes,
     }
   }
 
-  auto suppress_func = [iou_threshold](const float x) {
-    return x <= iou_threshold ? 1 : 0;
-  };
-
   std::vector<int> selected;
   std::vector<float> selected_scores;
   Candidate next_candidate;
@@ -134,14 +130,14 @@ void DoNonMaxSuppressionOp(OpKernelContext* context, const Tensor& boxes,
     // Overlapping boxes are likely to have similar scores,
     // therefore we iterate through the previously selected boxes backwards
     // in order to see if `next_candidate` should be suppressed.
+    bool should_select = true;
     for (int j = selected.size() - 1; j >= 0; --j) {
       iou = IOU(boxes_data, next_candidate.box_index, selected[j]);
       if (iou == 0.0) continue;
-      next_candidate.score *= suppress_func(iou);
-      if (next_candidate.score <= score_threshold) break;
+      if (iou > iou_threshold) should_select = false;
     }
 
-    if (original_score == next_candidate.score) {
+    if (should_select) {
       selected.push_back(next_candidate.box_index);
       selected_scores.push_back(next_candidate.score);
     }
@@ -178,7 +174,7 @@ class NonMaxSuppressionOp : public OpKernel {
         errors::InvalidArgument("max_output_size must be 0-D, got shape ",
                                 max_output_size.shape().DebugString()));
 
-    const float score_threshold_val = 0.0;
+    const float score_threshold_val = std::numeric_limits<float>::lowest();
     DoNonMaxSuppressionOp(context, boxes, scores, max_output_size,
                           iou_threshold_, score_threshold_val);
   }
@@ -211,7 +207,7 @@ class NonMaxSuppressionV2Op : public OpKernel {
                                         iou_threshold.shape().DebugString()));
     const float iou_threshold_val = iou_threshold.scalar<float>()();
 
-    const float score_threshold_val = 0.0;
+    const float score_threshold_val = std::numeric_limits<float>::lowest();
     DoNonMaxSuppressionOp(context, boxes, scores, max_output_size,
                           iou_threshold_val, score_threshold_val);
   }
diff --git a/tensorflow/core/kernels/non_max_suppression_op_test.cc b/tensorflow/core/kernels/non_max_suppression_op_test.cc
index c71aa23e01d1a552ae0ded1e196a031ea018c68c..ed7db313bd59c624300ebe58552a42d46bb1e955 100644
--- a/tensorflow/core/kernels/non_max_suppression_op_test.cc
+++ b/tensorflow/core/kernels/non_max_suppression_op_test.cc
@@ -86,6 +86,23 @@ TEST_F(NonMaxSuppressionOpTest, TestSelectAtMostTwoBoxesFromThreeClusters) {
   test::ExpectTensorEqual<int>(expected, *GetOutput(0));
 }
 
+TEST_F(NonMaxSuppressionOpTest, TestSelectWithNegativeScores) {
+  MakeOp(.5);
+  AddInputFromArray<float>(
+      TensorShape({6, 4}),
+      {0, 0,  1, 1,  0, 0.1f,  1, 1.1f,  0, -0.1f, 1, 0.9f,
+       0, 10, 1, 11, 0, 10.1f, 1, 11.1f, 0, 100,   1, 101});
+  AddInputFromArray<float>(
+      TensorShape({6}), {.9f - 10.0f, .75f - 10.0f, .6f - 10.0f, .95f - 10.0f,
+                         .5f - 10.0f, .3f - 10.0f});
+  AddInputFromArray<int>(TensorShape({}), {6});
+  TF_ASSERT_OK(RunOpKernel());
+
+  Tensor expected(allocator(), DT_INT32, TensorShape({3}));
+  test::FillValues<int>(&expected, {3, 0, 5});
+  test::ExpectTensorEqual<int>(expected, *GetOutput(0));
+}
+
 TEST_F(NonMaxSuppressionOpTest, TestSelectAtMostThirtyBoxesFromThreeClusters) {
   MakeOp(.5);
   AddInputFromArray<float>(
@@ -393,6 +410,27 @@ TEST_F(NonMaxSuppressionV3OpTest,
   test::ExpectTensorEqual<int>(expected, *GetOutput(0));
 }
 
+TEST_F(NonMaxSuppressionV3OpTest,
+       TestSelectFromThreeClustersWithScoreThresholdZeroScores) {
+  MakeOp();
+  AddInputFromArray<float>(
+      TensorShape({6, 4}),
+      {0, 0,  1, 1,  0, 0.1f,  1, 1.1f,  0, -0.1f, 1, 0.9f,
+       0, 10, 1, 11, 0, 10.1f, 1, 11.1f, 0, 100,   1, 101});
+  AddInputFromArray<float>(TensorShape({6}), {.1, 0, 0, .3, .2, -5.0});
+  // If we ask for more boxes than we actually expect to get back;
+  // should still only get 2 boxes back.
+  AddInputFromArray<int>(TensorShape({}), {6});
+  AddInputFromArray<float>(TensorShape({}), {0.5f});
+  AddInputFromArray<float>(TensorShape({}), {-3.0f});
+  TF_ASSERT_OK(RunOpKernel());
+
+  Tensor expected(allocator(), DT_INT32, TensorShape({2}));
+  test::FillValues<int>(&expected, {3, 0});
+
+  test::ExpectTensorEqual<int>(expected, *GetOutput(0));
+}
+
 TEST_F(NonMaxSuppressionV3OpTest,
        TestSelectFromThreeClustersFlippedCoordinates) {
   MakeOp();
diff --git a/tensorflow/core/kernels/partitioned_function_ops.cc b/tensorflow/core/kernels/partitioned_function_ops.cc
index d66b1ba66399f84edfc9380d25b0814e5b9745fc..b6ee808091a26d1cef46ad5b8e9bd75b99875532 100644
--- a/tensorflow/core/kernels/partitioned_function_ops.cc
+++ b/tensorflow/core/kernels/partitioned_function_ops.cc
@@ -22,6 +22,7 @@ limitations under the License.
 #include "tensorflow/core/graph/graph.h"
 #include "tensorflow/core/graph/graph_constructor.h"
 #include "tensorflow/core/graph/graph_partition.h"
+#include "tensorflow/core/util/ptr_util.h"
 #include "tensorflow/core/util/reffed_status_callback.h"
 
 #if GOOGLE_CUDA
@@ -41,7 +42,8 @@ namespace {
 // TODO(akshayka): Support distributed execution.
 class PartitionedCallOp : public AsyncOpKernel {
  public:
-  explicit PartitionedCallOp(OpKernelConstruction* ctx) : AsyncOpKernel(ctx) {
+  explicit PartitionedCallOp(OpKernelConstruction* ctx)
+      : AsyncOpKernel(ctx), local_device_name_(ctx->device()->name()) {
     OP_REQUIRES_OK(ctx, ctx->GetAttr("f", &func_));
   }
 
@@ -73,92 +75,28 @@ class PartitionedCallOp : public AsyncOpKernel {
     {
       mutex_lock l(mu_);
       if (!partitioned_) {
-        // Instantiate the function to obtain its underlying graph, complete
-        // with nodes for arguments and return values.
-        FunctionLibraryRuntime::InstantiateOptions opts;
-        FHandle handle;
-        OP_REQUIRES_OK_ASYNC(
-            ctx,
-            lib->Instantiate(func_.name(), AttrSlice(&func_.attr()), opts,
-                             &handle),
-            done);
-        Graph* graph = lib->GetFunctionBody(handle)->graph;
+        auto graph = tensorflow::MakeUnique<Graph>(OpRegistry::Global());
+        OP_REQUIRES_OK_ASYNC(ctx, GetGraphFromFunction(lib, graph.get()), done);
 
-        // Pin the inputs and outputs to the local device to simplify the
-        // function-dispatching logic.
-        local_device_name_ = lib->device()->name();
-        for (Node* node : graph->op_nodes()) {
-          string node_type = node->type_string();
-          if (node_type == FunctionLibraryDefinition::kArgOp ||
-              node_type == FunctionLibraryDefinition::kRetOp) {
-            node->set_assigned_device_name(local_device_name_);
-          }
-        }
-
-        // Place the graph, i.e,. assign a device to every node in it.
         DeviceSet device_set;
         for (auto d : lib->device_mgr()->ListDevices()) {
           device_set.AddDevice(d);
         }
-        Placer placer(graph, &device_set);
+        Placer placer(graph.get(), &device_set);
         OP_REQUIRES_OK_ASYNC(ctx, placer.Run(), done);
 
-        // Partition the graph into subgraphs: exactly one subgraph per device.
-        //
-        // TODO(akshayka): Let devices rewrite their graphs.
-        PartitionOptions partition_options;
-        partition_options.node_to_loc = [](const Node* node) {
-          // TODO(akshayka): To better support the distributed case, first split
-          // the graph by worker (e.g,. using the master session's
-          // `SplitByWorker` policy), and then recursively partition the
-          // per-worker shards at the remote worker(s).
-          return node->assigned_device_name();
-        };
-        int64 edge_name_counter = 0;
-        partition_options.new_name =
-            [&edge_name_counter](const string& prefix) {
-              return strings::StrCat(prefix, "/_", ++edge_name_counter);
-            };
-        partition_options.get_incarnation =
-            [&device_set](const string& name) -> int64 {
-          const Device* d = device_set.FindDeviceByName(name);
-          if (d == nullptr) {
-            return PartitionOptions::kIllegalIncarnation;
-          } else {
-            return d->attributes().incarnation();
-          }
-        };
-        partition_options.control_flow_added = false;
-        std::unordered_map<string, GraphDef> partitions;
+        std::unordered_map<string, std::unique_ptr<Graph>> subgraphs;
         OP_REQUIRES_OK_ASYNC(
-            ctx, Partition(partition_options, graph, &partitions), done);
-
-        VLOG(3) << "Partitioned function '" << func_.name() << "', yielding "
-                << partitions.size() << " shards.";
-
-        // `subgraphs` is a map from devices to their corresponding subgraphs.
-        gtl::FlatMap<string, std::unique_ptr<Graph>> subgraphs;
-        const FunctionLibraryDefinition* flib_def = &graph->flib_def();
-        for (const auto& partition : partitions) {
-          std::unique_ptr<Graph> subgraph(new Graph(flib_def));
-          GraphConstructorOptions opts;
-          opts.allow_internal_ops = true;
-          opts.expect_device_spec = true;
-          const string& device = partition.first;
-          const GraphDef& graph_def = partition.second;
-          OP_REQUIRES_OK_ASYNC(
-              ctx, ConvertGraphDefToGraph(opts, graph_def, subgraph.get()),
-              done);
-          subgraphs.emplace(device, std::move(subgraph));
-        }
+            ctx, PartitionHelper(device_set, std::move(graph), &subgraphs),
+            done);
 
         // The FunctionLibraryRuntime's library cannot be mutated from within
-        // an OpKernel, so the functions are instantiated in an overlay library.
+        // an OpKernel, so functions are instantiated in an overlay library.
         overlay_lib_.reset(new FunctionLibraryDefinition(
             *lib->GetFunctionLibraryDefinition()));
         for (const auto& pair : subgraphs) {
           const string& target = pair.first;
-          Graph* subgraph = pair.second.get();
+          const auto& subgraph = pair.second;
           FunctionDef shard;
           string unique_name = UniquifyFunctionName(func_.name());
           OP_REQUIRES_OK_ASYNC(
@@ -173,12 +111,96 @@ class PartitionedCallOp : public AsyncOpKernel {
               lib->Instantiate(unique_name, AttrSlice(&shard.attr()), opts,
                                &handle),
               done);
-          device_handle_map_.emplace(target, handle);
+          function_handles_.emplace(target, handle);
         }
         partitioned_ = true;
       }
     }
+    ExecuteFunctions(lib, ctx, std::move(done));
+  }
+
+ private:
+  typedef std::pair<string, FHandle> DeviceAndFHandle;
+
+  // `func_` encapsulates the original, unsharded function.
+  // Copies the graph backing `func_` into `*graph`, pinning the input and
+  // output nodes to the local device.
+  //
+  // `*graph` must be a freshly allocated graph.
+  Status GetGraphFromFunction(FunctionLibraryRuntime* lib, Graph* graph) {
+    FunctionLibraryRuntime::InstantiateOptions opts;
+    FHandle handle;
+    TF_RETURN_IF_ERROR(lib->Instantiate(func_.name(), AttrSlice(&func_.attr()),
+                                        opts, &handle));
+    const FunctionBody* fbody = lib->GetFunctionBody(handle);
+    if (fbody == nullptr) {
+      return errors::Internal("Could not find handle ", handle);
+    }
+    CopyGraph(*fbody->graph, graph);
 
+    // Pin the inputs and outputs to the local device to simplify the
+    // function-dispatching logic.
+    for (Node* node : graph->op_nodes()) {
+      string node_type = node->type_string();
+      if (node_type == FunctionLibraryDefinition::kArgOp ||
+          node_type == FunctionLibraryDefinition::kRetOp) {
+        node->set_assigned_device_name(local_device_name_);
+      }
+    }
+    return Status::OK();
+  }
+
+  // Partitions `graph` and populates `subgraphs` with the partitions.
+  Status PartitionHelper(
+      const DeviceSet& device_set, std::unique_ptr<Graph> graph,
+      std::unordered_map<string, std::unique_ptr<Graph>>* subgraphs) {
+    PartitionOptions partition_options;
+    partition_options.node_to_loc = [](const Node* node) {
+      // TODO(akshayka): To better support the distributed case, first split
+      // the graph by worker (e.g,. using the master session's
+      // `SplitByWorker` policy), and then recursively partition the
+      // per-worker shards at the remote worker(s).
+      return node->assigned_device_name();
+    };
+    int64 edge_name_counter = 0;
+    partition_options.new_name = [&edge_name_counter](const string& prefix) {
+      return strings::StrCat(prefix, "/_", ++edge_name_counter);
+    };
+    partition_options.get_incarnation =
+        [&device_set](const string& name) -> int64 {
+      const Device* d = device_set.FindDeviceByName(name);
+      if (d == nullptr) {
+        return PartitionOptions::kIllegalIncarnation;
+      } else {
+        return d->attributes().incarnation();
+      }
+    };
+    partition_options.control_flow_added = false;
+    std::unordered_map<string, GraphDef> partitions;
+    TF_RETURN_IF_ERROR(Partition(partition_options, graph.get(), &partitions));
+
+    VLOG(3) << "Partitioned function '" << func_.name() << "', yielding "
+            << partitions.size() << " shards.";
+
+    const FunctionLibraryDefinition* flib_def = &graph->flib_def();
+    for (const auto& partition : partitions) {
+      std::unique_ptr<Graph> subgraph(new Graph(flib_def));
+      GraphConstructorOptions opts;
+      opts.allow_internal_ops = true;
+      opts.expect_device_spec = true;
+      const string& device = partition.first;
+      const GraphDef& graph_def = partition.second;
+      TF_RETURN_IF_ERROR(
+          ConvertGraphDefToGraph(opts, graph_def, subgraph.get()));
+      subgraphs->emplace(device, std::move(subgraph));
+    }
+
+    return Status::OK();
+  }
+
+  // Executes the partitioned functions.
+  void ExecuteFunctions(FunctionLibraryRuntime* lib, OpKernelContext* ctx,
+                        DoneCallback done) LOCKS_EXCLUDED(mu_) {
     FunctionLibraryRuntime::Options opts;
     opts.step_id = ctx->step_id();
     opts.step_container = ctx->step_container();
@@ -205,11 +227,11 @@ class PartitionedCallOp : public AsyncOpKernel {
         },
         rendez, std::move(done), std::placeholders::_1);
     auto* refcounted_done = new ReffedStatusCallback(std::move(callback));
-    for (int i = 1; i < device_handle_map_.size(); ++i) {
+    for (int i = 1; i < function_handles_.size(); ++i) {
       refcounted_done->Ref();
     }
 
-    for (const auto& pair : device_handle_map_) {
+    for (const auto& pair : function_handles_) {
       const string& target_device = pair.first;
       FHandle handle = pair.second;
       VLOG(3) << "Running function shard on device " << target_device;
@@ -247,8 +269,6 @@ class PartitionedCallOp : public AsyncOpKernel {
       }
     }
   }
-
- private:
   string UniquifyFunctionName(const string& name) {
     for (;; ++suffix_) {
       const string candidate = strings::StrCat(name, "_", suffix_);
@@ -258,13 +278,13 @@ class PartitionedCallOp : public AsyncOpKernel {
     }
   }
 
-  // `func_` encapsulates the original, unsharded function.
   NameAttrList func_;
-  string local_device_name_;
+  const string local_device_name_;
   // Function shards are added to `overlay_lib_`.
   std::unique_ptr<FunctionLibraryDefinition> overlay_lib_;
-  // A map from device names to handles of function shards.
-  gtl::FlatMap<string, FHandle> device_handle_map_;
+  // A map from device names to handles of function shards; this map is
+  // read-only after the first execution of the OpKernel.
+  gtl::FlatMap<string, FHandle> function_handles_;
 
   mutex mu_;
   bool partitioned_ GUARDED_BY(mu_) = false;
@@ -274,6 +294,12 @@ class PartitionedCallOp : public AsyncOpKernel {
 };
 REGISTER_KERNEL_BUILDER(Name("PartitionedCall").Device(DEVICE_CPU),
                         PartitionedCallOp);
+REGISTER_KERNEL_BUILDER(Name("PartitionedCall").Device(DEVICE_GPU),
+                        PartitionedCallOp);
+#if TENSORFLOW_USE_SYCL
+REGISTER_KERNEL_BUILDER(Name("PartitionedCall").Device(DEVICE_SYCL),
+                        PartitionedCallOp);
+#endif  // TENSORFLOW_USE_SYCL
 
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/pooling_ops_common.h b/tensorflow/core/kernels/pooling_ops_common.h
index fc7cb437b8f583a811427deaf52a94d9ef996f37..e9265551e386f5e9347ed3e46cae36b4ba423c87 100644
--- a/tensorflow/core/kernels/pooling_ops_common.h
+++ b/tensorflow/core/kernels/pooling_ops_common.h
@@ -596,7 +596,7 @@ void SpatialAvgPool(OpKernelContext* context, Tensor* output,
   // so the factor 0.01 (i.e. 1/100) with a max of 10000, was chosen to limit
   // the work unit cost to an operating range in which it emperically performed
   // best.
-  const int64 work_unit_cost = std::max(10000LL, work_unit_size / 100LL);
+  const int64 work_unit_cost = std::max(int64{10000}, work_unit_size / 100LL);
   const DeviceBase::CpuWorkerThreads& worker_threads =
       *(context->device()->tensorflow_cpu_worker_threads());
   Shard(worker_threads.num_threads, worker_threads.workers,
diff --git a/tensorflow/core/kernels/quantization_utils.h b/tensorflow/core/kernels/quantization_utils.h
index 9fafe6bb65406a1dbcbcb63624fe58019f9e83a3..e67a94e5f83793bf5f459a793633f053c1d6e22d 100644
--- a/tensorflow/core/kernels/quantization_utils.h
+++ b/tensorflow/core/kernels/quantization_utils.h
@@ -273,8 +273,8 @@ inline void RequantizeManyInNewRangeReference(const qint32* input, int64 count,
     const int64 offset_intermediate = fp_value - output_offset_fp;
     const int64 round_intermediate = offset_intermediate + rounding_delta;
     int64 quantized_int64 = round_intermediate >> fp_shift;
-    quantized_int64 = std::max(quantized_int64, 0LL);
-    quantized_int64 = std::min(quantized_int64, 255LL);
+    quantized_int64 = std::max(quantized_int64, int64{0});
+    quantized_int64 = std::min(quantized_int64, int64{255});
     output[index] = static_cast<quint8>(static_cast<int32>(quantized_int64));
   }
 }
diff --git a/tensorflow/core/kernels/quantize_and_dequantize_op.h b/tensorflow/core/kernels/quantize_and_dequantize_op.h
index 3b09ea2527d8b401941c6ef0951c620edd0c5217..906d507c8a415967a72c53bd22e4cf706fc1e7c8 100644
--- a/tensorflow/core/kernels/quantize_and_dequantize_op.h
+++ b/tensorflow/core/kernels/quantize_and_dequantize_op.h
@@ -23,6 +23,8 @@ limitations under the License.
 namespace tensorflow {
 namespace functor {
 
+// TODO(pauldonnelly): 'signed_input' should really be called 'signed_output'.
+
 template <typename Device, typename T>
 struct QuantizeAndDequantizeOneScaleFunctor {
   void operator()(const Device& d, typename TTypes<T>::ConstVec input,
@@ -49,56 +51,51 @@ struct QuantizeAndDequantizeOneScaleImpl {
     d.memcpyDeviceToHost(&min_range, input_min.data(), sizeof(T));
     d.memcpyDeviceToHost(&max_range, input_max.data(), sizeof(T));
 
-    // Make sure the range is symmetric for signed quantization, or start from
-    // 0 for unsigned quantization.
-    max_range = std::max(std::abs(max_range), std::abs(min_range));
+    // Calculate the range for the simulated integer quantization:
+    // e.g. [-128,127] for signed = true, num_bits = 8,
+    // or [0, 255] for signed = false, num_bits = 8.
+    const int64 min_quantized = signed_input ? -(1ULL << (num_bits - 1)) : 0;
+    const int64 max_quantized = min_quantized + ((1ULL << num_bits) - 1);
 
-    // If both min and max are 0, then the output should be just 0.
-    if (max_range == 0) {
-      out.device(d) = input.constant(T(0));
-      return;
-    }
+    // Determine the maximum scaling factor that would scale
+    // [min_range, max_range] to not exceed [min_quantized, max_quantized],
+    // while keeping 0 unchanged.
+    const T scale_from_min_side = (min_quantized * min_range > 0)
+                                      ? min_quantized / min_range
+                                      : std::numeric_limits<T>::max();
+    const T scale_from_max_side = (max_quantized * max_range > 0)
+                                      ? max_quantized / max_range
+                                      : std::numeric_limits<T>::max();
 
-    if (signed_input) {
-      min_range = -max_range;
+    // Note: Avoids changing the side of the range that determines scale.
+    T scale, inverse_scale;
+    if (scale_from_min_side < scale_from_max_side) {
+      scale = scale_from_min_side;
+      inverse_scale = min_range / min_quantized;
+      max_range = max_quantized * inverse_scale;
+    } else {
+      scale = scale_from_max_side;
+      inverse_scale = max_range / max_quantized;
+      min_range = min_quantized * inverse_scale;
+    }
 
-      // If it is signed, we try to keep 0.0 being 0 and drop one bucket. For
-      // example, if it is 8 bits, we have the range [-127, 127]. So for input
-      // range of [-x, x], the scale should be 254/(2*x).
-      T scale = static_cast<T>((uint64_t{1} << (num_bits - 1)) - 1) / max_range;
-      T inverse_scale = T(1.0) / scale;
-      if (range_given) {
-        out.device(d) =
-            ((input.cwiseMin(max_range).cwiseMax(min_range) - min_range) *
-                 scale +
-             T(0.5))
-                    .floor() *
-                inverse_scale +
-            min_range;
-      } else {
-        // No need to compare with min and max as they are measured from the
-        // tensor.
-        out.device(d) =
-            ((input - min_range) * scale + T(0.5)).floor() * inverse_scale +
-            min_range;
-      }
+    if (range_given) {
+      // Note: The clamping here is to avoid overflow in the quantized type.
+      // The semantics of the op does not guarantee to clamp to the specified
+      // min_range and max_range - because we may have changed either min_range
+      // or max_range.
+      out.device(d) =
+          ((input.cwiseMin(max_range).cwiseMax(min_range) - min_range) * scale +
+           T(0.5))
+                  .floor() *
+              inverse_scale +
+          min_range;
     } else {
-      min_range = 0;
-      // If it is unsigned and num_bits == 8, the range with 8 bits is [0, 255].
-      // If the input range is [0, x], then the scale is x/255 instead of 254 as
-      // in the case above.
-      T scale = static_cast<T>((uint64_t{1} << num_bits) - 1) / max_range;
-      T inverse_scale = 1.0 / scale;
-      if (range_given) {
-        out.device(d) =
-            ((input.cwiseMin(max_range).cwiseMax(min_range)) * scale + T(0.5))
-                .floor() *
-            inverse_scale;
-      } else {
-        // No need to compare with min and max as they are measured from the
-        // tensor.
-        out.device(d) = (input * scale + T(0.5)).floor() * inverse_scale;
-      }
+      // No need to clamp to min_range and max_range in this case as they were
+      // measured from the tensor.
+      out.device(d) =
+          ((input - min_range) * scale + T(0.5)).floor() * inverse_scale +
+          min_range;
     }
   }
 };
diff --git a/tensorflow/core/kernels/quantize_and_dequantize_op_test.cc b/tensorflow/core/kernels/quantize_and_dequantize_op_test.cc
index e41df12d914d3971c0a577eb4ff4ee1b74ac8a46..629c69850368f509ac9817bd20433c1e8d26bb4f 100644
--- a/tensorflow/core/kernels/quantize_and_dequantize_op_test.cc
+++ b/tensorflow/core/kernels/quantize_and_dequantize_op_test.cc
@@ -105,13 +105,13 @@ TEST_F(QuantizeAndDequantizeTest, Convert_1D_tensor_with_int8) {
   AddInputFromArray<float>(TensorShape({}), {0.0});  // Min
   AddInputFromArray<float>(TensorShape({}), {0.0});  // Max
 
-  // With int8, the tensor is quantized to {-127, -63, 0, 38, 102, 70}.
+  // With int8, the tensor is quantized to {-128, -64, 0, 38, 102, 71}.
   // Scale is: 1/127
-  // Then it is dequantized to {-1, -63.0/127, 0, 38.0/127, 102.0/127, 70.0/127}
+  // Then it is dequantized to {-1, -0.5, 0, 38.0/128, 102.0/128, 71.0/128}
   TF_ASSERT_OK(RunOpKernel());
   Tensor expected(allocator(), DT_FLOAT, TensorShape({6}));
-  test::FillValues<float>(
-      &expected, {-1, -63.0 / 127, 0, 38.0 / 127, 102.0 / 127, 70.0 / 127});
+  test::FillValues<float>(&expected,
+                          {-1, -0.5, 0, 38.0 / 128, 102.0 / 128, 71.0 / 128});
   test::ExpectTensorNear<float>(expected, *GetOutput(0), 1e-5);
 
   // Ensure that the inputs haven't been changed.
@@ -136,13 +136,13 @@ TEST_F(QuantizeAndDequantizeTest, Convert_1D_tensor_with_int8_V3) {
   AddInputFromArray<float>(TensorShape({}), {0.0});  // Max
   AddInputFromArray<int32>(TensorShape({}), {8});    // num_bits
 
-  // With int8, the tensor is quantized to {-127, -63, 0, 38, 102, 70}.
-  // Scale is: 1/127
-  // Then it is dequantized to {-1, -63.0/127, 0, 38.0/127, 102.0/127, 70.0/127}
+  // With int8, the tensor is quantized to {-128, -64, 0, 38, 102, 71}.
+  // Scale is: 1/128
+  // Then it is dequantized to {-1, -64.0/128, 0, 38.0/128, 102.0/128, 71.0/128}
   TF_ASSERT_OK(RunOpKernel());
   Tensor expected(allocator(), DT_FLOAT, TensorShape({6}));
-  test::FillValues<float>(
-      &expected, {-1, -63.0 / 127, 0, 38.0 / 127, 102.0 / 127, 70.0 / 127});
+  test::FillValues<float>(&expected,
+                          {-1, -0.5, 0, 38.0 / 128, 102.0 / 128, 71.0 / 128});
   test::ExpectTensorNear<float>(expected, *GetOutput(0), 1e-5);
 
   // Ensure that the inputs haven't been changed.
@@ -166,12 +166,11 @@ TEST_F(QuantizeAndDequantizeTest, Convert_1D_tensor_with_int4) {
   AddInputFromArray<float>(TensorShape({}), {0.0});  // Min
   AddInputFromArray<float>(TensorShape({}), {0.0});  // Max
 
-  // With int4, the tensor is quantized to {-7, -3, 0, 2, 6, 4}.
-  // Scale is: 1/7
+  // With int4, the tensor is quantized to {-8, -4, 0, 2, 6, 4}.
+  // Scale is: 1/8
   TF_ASSERT_OK(RunOpKernel());
   Tensor expected(allocator(), DT_FLOAT, TensorShape({6}));
-  test::FillValues<float>(&expected,
-                          {-1, -3.0 / 7, 0, 2.0 / 7, 6.0 / 7, 4.0 / 7});
+  test::FillValues<float>(&expected, {-1, -0.5, 0, 0.25, 0.75, 0.5});
   test::ExpectTensorNear<float>(expected, *GetOutput(0), 1e-5);
 
   // Ensure that the inputs haven't been changed.
@@ -196,12 +195,11 @@ TEST_F(QuantizeAndDequantizeTest, Convert_1D_tensor_with_int4_V3) {
   AddInputFromArray<float>(TensorShape({}), {0.0});  // Max
   AddInputFromArray<int32>(TensorShape({}), {4});    // num_bits
 
-  // With int4, the tensor is quantized to {-7, -3, 0, 2, 6, 4}.
-  // Scale is: 1/7
+  // With int4, the tensor is quantized to {-8, -4, 0, 2, 6, 4}.
+  // Scale is: 1/8
   TF_ASSERT_OK(RunOpKernel());
   Tensor expected(allocator(), DT_FLOAT, TensorShape({6}));
-  test::FillValues<float>(&expected,
-                          {-1, -3.0 / 7, 0, 2.0 / 7, 6.0 / 7, 4.0 / 7});
+  test::FillValues<float>(&expected, {-1, -0.5, 0, 0.25, 0.75, 0.5});
   test::ExpectTensorNear<float>(expected, *GetOutput(0), 1e-5);
 
   // Ensure that the inputs haven't been changed.
@@ -228,13 +226,14 @@ TEST_F(QuantizeAndDequantizeTest, Convert_2D_tensor_with_int8_range_given) {
   AddInputFromArray<float>(TensorShape({}), {1.0});   // Max
 
   // Note that the range is given as [-1, 1].
-  // With int8, the tensor is quantized to {-102, -63, 0, 38, 102, 70, -127,
+  // With int8, the tensor is quantized to {-102, -63, 0, 38, 102, 70, -128,
   // 127}.
   // Scale is: 1/127
   TF_ASSERT_OK(RunOpKernel());
   Tensor expected(allocator(), DT_FLOAT, TensorShape({2, 4}));
-  test::FillValues<float>(&expected, {-102.0 / 127, -63.0 / 127, 0, 38.0 / 127,
-                                      102.0 / 127, 70.0 / 127, -1, 1});
+  test::FillValues<float>(
+      &expected, {-102.0 / 127, -63.0 / 127, 0, 38.0 / 127, 102.0 / 127,
+                  70.0 / 127, -128.0 / 127, 1});
   test::ExpectTensorNear<float>(expected, *GetOutput(0), 1e-5);
 }
 
@@ -258,13 +257,14 @@ TEST_F(QuantizeAndDequantizeTest, Convert_2D_tensor_with_int8_range_given_V3) {
   AddInputFromArray<int32>(TensorShape({}), {8});     // num_bits
 
   // Note that the range is given as [-1, 1].
-  // With int8, the tensor is quantized to {-102, -63, 0, 38, 102, 70, -127,
+  // With int8, the tensor is quantized to {-102, -63, 0, 38, 102, 70, -128,
   // 127}.
   // Scale is: 1/127
   TF_ASSERT_OK(RunOpKernel());
   Tensor expected(allocator(), DT_FLOAT, TensorShape({2, 4}));
-  test::FillValues<float>(&expected, {-102.0 / 127, -63.0 / 127, 0, 38.0 / 127,
-                                      102.0 / 127, 70.0 / 127, -1, 1});
+  test::FillValues<float>(
+      &expected, {-102.0 / 127, -63.0 / 127, 0, 38.0 / 127, 102.0 / 127,
+                  70.0 / 127, -128.0 / 127, 1});
   test::ExpectTensorNear<float>(expected, *GetOutput(0), 1e-5);
 }
 
diff --git a/tensorflow/core/kernels/reduction_gpu_kernels.cu.h b/tensorflow/core/kernels/reduction_gpu_kernels.cu.h
index 0de2ebb5907caa13e0c1b2a4e11d218bd9701bae..6655084045996dcb920620c80752520cdf4b5ee2 100644
--- a/tensorflow/core/kernels/reduction_gpu_kernels.cu.h
+++ b/tensorflow/core/kernels/reduction_gpu_kernels.cu.h
@@ -295,7 +295,11 @@ __global__ void ColumnReduceMax16ColumnsKernel(
 
   // 1D array necessary due to bug in CUDA 9 compiler.
   // TODO(nluehr) revert to 2D array when compiler is ready.
-  __shared__ storage_type<value_type> partial_sums[32 * 33];
+  // This is the mimic the following, but without any constructors:
+  //   __shared__ storage_type<value_type> partial_sums[32 * 33];
+  __shared__ __align__(
+      alignof(value_type)) char partial_sums_raw[32 * 33 * sizeof(value_type)];
+  value_type* partial_sums = reinterpret_cast<value_type*>(partial_sums_raw);
 
   row += rows_per_warp * gridDim.y * blockDim.y;
   for (; row < num_rows; row += rows_per_warp * gridDim.y * blockDim.y) {
@@ -344,7 +348,11 @@ __global__ void ColumnReduceKernel(
 
   // 1D array necessary due to bug in CUDA 9 compiler.
   // TODO(nluehr) revert to 2D array when compiler is ready.
-  __shared__ storage_type<value_type> partial_sums[32 * 33];
+  // This is to mimic the following, but without constructors:
+  //     __shared__ storage_type<value_type> partial_sums[32 * 33];
+  __shared__ __align__(
+      alignof(value_type)) char partial_sums_raw[32 * 33 * sizeof(value_type)];
+  value_type* partial_sums = reinterpret_cast<value_type*>(partial_sums_raw);
 
   row += gridDim.y * blockDim.y;
 
diff --git a/tensorflow/core/kernels/reshape_util.cc b/tensorflow/core/kernels/reshape_util.cc
index 4188ad233ea8f826fda28ee891a54ee9bd1156e3..c75e942039ffa5fb3941151c4341e9b0733482a3 100644
--- a/tensorflow/core/kernels/reshape_util.cc
+++ b/tensorflow/core/kernels/reshape_util.cc
@@ -12,9 +12,10 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
-
 #define EIGEN_USE_THREADS
 
+#include "tensorflow/core/kernels/reshape_util.h"
+
 #include <algorithm>
 #include <numeric>
 #include <unordered_map>
diff --git a/tensorflow/core/kernels/resize_area_op.cc b/tensorflow/core/kernels/resize_area_op.cc
index 98b8a0df282a21f6711cc8926762f7bbb4ef52b0..c996ae60b79ee9cc54a29d7a2fd13fc7ff5465e8 100644
--- a/tensorflow/core/kernels/resize_area_op.cc
+++ b/tensorflow/core/kernels/resize_area_op.cc
@@ -271,7 +271,7 @@ class ResizeAreaOp : public OpKernel {
 
  private:
   static EIGEN_ALWAYS_INLINE int64 Bound(int64 val, int64 limit) {
-    return std::min(limit - 1ll, std::max(0ll, val));
+    return std::min(limit - 1ll, std::max(int64{0}, val));
   }
 
   bool align_corners_;
diff --git a/tensorflow/core/kernels/resize_area_op_test.cc b/tensorflow/core/kernels/resize_area_op_test.cc
index a7e06ef15a1dd15c4c1428f44dbcd5e560b5e993..84ff090b5469291712eb97aa19734e7d194771b8 100644
--- a/tensorflow/core/kernels/resize_area_op_test.cc
+++ b/tensorflow/core/kernels/resize_area_op_test.cc
@@ -124,7 +124,8 @@ class ResizeAreaOpTest : public OpsTestBase {
                                   ? (j + 1 > in_x1 ? width_scale : j + 1 - in_x)
                                   : (j + 1 > in_x1 ? in_x1 - j : 1.0);
               for (int64 c = 0; c < channels; ++c) {
-#define BOUND(val, limit) std::min(((limit)-1ll), (std::max(0ll, (val))))
+#define BOUND(val, limit) \
+  std::min(((limit)-int64{1}), (std::max(int64{0}, (val))))
                 sum_data(c) +=
                     static_cast<float>(input_data(b, BOUND(i, in_height),
                                                   BOUND(j, in_width), c)) *
diff --git a/tensorflow/core/kernels/resize_bicubic_op.cc b/tensorflow/core/kernels/resize_bicubic_op.cc
index 65014b6c44eb2e5b0adb528c3ce08f01c21e4f26..8380ed6d8ff64a2959c453059b2df3242134dd94 100644
--- a/tensorflow/core/kernels/resize_bicubic_op.cc
+++ b/tensorflow/core/kernels/resize_bicubic_op.cc
@@ -57,7 +57,7 @@ const float* GetCoeffsTable() {
 }
 
 inline int64 Bound(int64 val, int64 limit) {
-  return std::min(limit - 1ll, std::max(0ll, val));
+  return std::min(limit - 1ll, std::max(int64{0}, val));
 }
 
 struct WeightsAndIndices {
diff --git a/tensorflow/core/kernels/resize_bicubic_op_test.cc b/tensorflow/core/kernels/resize_bicubic_op_test.cc
index c23570d885be2419a8e045633a31250e0b99703d..eff25f5ad49c242f8a3e2133f968e3585f58f97c 100644
--- a/tensorflow/core/kernels/resize_bicubic_op_test.cc
+++ b/tensorflow/core/kernels/resize_bicubic_op_test.cc
@@ -81,7 +81,7 @@ class ResizeBicubicOpTest : public OpsTestBase {
 
   // Used in the baseline implementation
   inline int64 Bound(int64 val, int64 limit) {
-    return std::min(limit - 1ll, std::max(0ll, val));
+    return std::min(limit - 1ll, std::max(int64{0}, val));
   }
 
   // Used in the baseline implementation
diff --git a/tensorflow/core/kernels/resource_variable_ops.cc b/tensorflow/core/kernels/resource_variable_ops.cc
index 03cc414905c66ab483549098a88e6f433a00ec9f..af921e48158923baeac749d22bea9ecef77cfaed 100644
--- a/tensorflow/core/kernels/resource_variable_ops.cc
+++ b/tensorflow/core/kernels/resource_variable_ops.cc
@@ -51,6 +51,7 @@ limitations under the License.
 #define EIGEN_USE_GPU
 #endif
 
+#include "tensorflow/core/kernels/resource_variable_ops.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/resource_mgr.h"
@@ -72,40 +73,33 @@ namespace tensorflow {
 
 REGISTER_RESOURCE_HANDLE_KERNEL(Var);
 
-class ReadVariableOp : public OpKernel {
- public:
-  explicit ReadVariableOp(OpKernelConstruction* c) : OpKernel(c) {
-    OP_REQUIRES_OK(c, c->GetAttr("dtype", &dtype_));
-  }
-
-  void Compute(OpKernelContext* ctx) override {
-    Var* variable = nullptr;
-    ResourceHandle handle = HandleFromInput(ctx, 0);
-    const auto status = LookupResource(ctx, handle, &variable);
-    OP_REQUIRES(ctx, status.ok(),
-                errors::FailedPrecondition(
-                    "Error while reading resource variable ", handle.name(),
-                    " from Container: ", handle.container(),
-                    ". This could mean that the variable was uninitialized. ",
-                    status.ToString()));
-
-    core::ScopedUnref s(variable);
-    // We're acquiring a reference to the underlying buffer while
-    // holding a shared lock to guarantee ordering of reads and
-    // writes.
-    tf_shared_lock ml(*variable->mu());
-    const Tensor& t = *variable->tensor();
-    OP_REQUIRES(
-        ctx, dtype_ == t.dtype(),
-        errors::InvalidArgument(
-            "Trying to read variable with wrong dtype. Expected ",
-            DataTypeString(dtype_), " got ", DataTypeString(t.dtype())));
-    ctx->set_output(0, t);
-  }
-
- private:
-  DataType dtype_;
-};
+ReadVariableOp::ReadVariableOp(OpKernelConstruction* c) : OpKernel(c) {
+  OP_REQUIRES_OK(c, c->GetAttr("dtype", &dtype_));
+}
+
+void ReadVariableOp::Compute(OpKernelContext* ctx) {
+  Var* variable = nullptr;
+  ResourceHandle handle = HandleFromInput(ctx, 0);
+  const auto status = LookupResource(ctx, handle, &variable);
+  OP_REQUIRES(ctx, status.ok(),
+              errors::FailedPrecondition(
+                  "Error while reading resource variable ", handle.name(),
+                  " from Container: ", handle.container(),
+                  ". This could mean that the variable was uninitialized. ",
+                  status.ToString()));
+
+  core::ScopedUnref s(variable);
+  // We're acquiring a reference to the underlying buffer while
+  // holding a shared lock to guarantee ordering of reads and
+  // writes.
+  tf_shared_lock ml(*variable->mu());
+  const Tensor& t = *variable->tensor();
+  OP_REQUIRES(ctx, dtype_ == t.dtype(),
+              errors::InvalidArgument(
+                  "Trying to read variable with wrong dtype. Expected ",
+                  DataTypeString(dtype_), " got ", DataTypeString(t.dtype())));
+  ctx->set_output(0, t);
+}
 
 REGISTER_KERNEL_BUILDER(Name("ReadVariableOp").Device(DEVICE_CPU),
                         ReadVariableOp);
diff --git a/tensorflow/core/kernels/resource_variable_ops.h b/tensorflow/core/kernels/resource_variable_ops.h
new file mode 100644
index 0000000000000000000000000000000000000000..8cae5d21f0e5d863df66e2d6d0d7a23ede653a15
--- /dev/null
+++ b/tensorflow/core/kernels/resource_variable_ops.h
@@ -0,0 +1,33 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_KERNELS_RESOURCE_VARIABLE_OPS_H_
+#define TENSORFLOW_CORE_KERNELS_RESOURCE_VARIABLE_OPS_H_
+
+#include "tensorflow/core/framework/op_kernel.h"
+
+namespace tensorflow {
+
+class ReadVariableOp : public OpKernel {
+ public:
+  explicit ReadVariableOp(OpKernelConstruction* c);
+  void Compute(OpKernelContext* ctx) override;
+
+ private:
+  DataType dtype_;
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_KERNELS_RESOURCE_VARIABLE_OPS_H_
diff --git a/tensorflow/core/kernels/scatter_nd_op.cc b/tensorflow/core/kernels/scatter_nd_op.cc
index ff38026ac72e9951fabd36f1a35e273cb7f3db70..e1fc2ea128585fded6cd345d33a54720193146ca 100644
--- a/tensorflow/core/kernels/scatter_nd_op.cc
+++ b/tensorflow/core/kernels/scatter_nd_op.cc
@@ -143,14 +143,10 @@ class ScatterNdUpdateOp : public OpKernel {
 
   void Compute(OpKernelContext* c) override {
     if (dtype_ == DT_RESOURCE) {
-      if (use_exclusive_lock_) {
-        Var* v;
-        OP_REQUIRES_OK(c, LookupResource(c, HandleFromInput(c, 0), &v));
-        mutex_lock m(*v->mu());
-        DoCompute(c);
-      } else {
-        DoCompute(c);
-      }
+      Var* v;
+      OP_REQUIRES_OK(c, LookupResource(c, HandleFromInput(c, 0), &v));
+      mutex_lock m(*v->mu());
+      DoCompute(c);
     } else if (use_exclusive_lock_) {
       // If we're here, it means the input type is a ref.
       DCHECK(IsRefType(c->input_dtype(0)));
@@ -176,13 +172,7 @@ class ScatterNdUpdateOp : public OpKernel {
       Var* v;
       OP_REQUIRES_OK(c, LookupResource(c, HandleFromInput(c, 0), &v));
       Tensor* t = v->tensor();
-      if (!use_exclusive_lock_) {
-        // We're not holding the lock in the outer scope so need it here.
-        mutex_lock m(*v->mu());
-        OP_REQUIRES_OK(c, PrepareToUpdateVariable<Device, T>(c, t));
-      } else {
-        OP_REQUIRES_OK(c, PrepareToUpdateVariable<Device, T>(c, t));
-      }
+      OP_REQUIRES_OK(c, PrepareToUpdateVariable<Device, T>(c, t));
       params = *t;
       params_shape = params.shape();
     } else if (IsRefType(c->input_dtype(0))) {
@@ -260,7 +250,9 @@ class ScatterNdUpdateOp : public OpKernel {
   REGISTER_SCATTER_ND_UPDATE_KERNEL(type, dev, "ScatterNdNonAliasingAdd", \
                                     scatter_nd_op::UpdateOp::ADD);        \
   REGISTER_SCATTER_ND_UPDATE_KERNEL(type, dev, "ScatterNdSub",            \
-                                    scatter_nd_op::UpdateOp::SUB);
+                                    scatter_nd_op::UpdateOp::SUB);        \
+  REGISTER_RESOURCE_SCATTER_ND_UPDATE_KERNEL(                             \
+      type, dev, "ResourceScatterNdAdd", scatter_nd_op::UpdateOp::ADD);
 
 #define REGISTER_SCATTER_ND(type, dev) \
   REGISTER_SCATTER_ND_KERNEL(type, dev, "ScatterNd");
diff --git a/tensorflow/core/kernels/scoped_allocator_ops_test.cc b/tensorflow/core/kernels/scoped_allocator_ops_test.cc
index d2918d20426af351a3613dea41eb7d2802d5a672..634f9ba8876ea0b542fd995e32599686bd201cc5 100644
--- a/tensorflow/core/kernels/scoped_allocator_ops_test.cc
+++ b/tensorflow/core/kernels/scoped_allocator_ops_test.cc
@@ -37,10 +37,12 @@ namespace tensorflow {
 
 class ScopedAllocatorOpTest : public OpsTestBase {
  protected:
-  void MakeOp(const gtl::ArraySlice<TensorShape>& shapes, DataType dtype,
+  void MakeOp(const TensorShape& shape,
+              const gtl::ArraySlice<TensorShape>& shapes, DataType dtype,
               const string& name, int32 id, int32 expected_call_count) {
     TF_EXPECT_OK(NodeDefBuilder("scoped_allocator_op", "_ScopedAllocator")
                      .Attr("T", dtype)
+                     .Attr("shape", shape)
                      .Attr("shapes", shapes)
                      .Attr("sa_name", name)
                      .Attr("id", id)
@@ -61,12 +63,14 @@ class ScopedAllocatorOpTest : public OpsTestBase {
 };
 
 TEST_F(ScopedAllocatorOpTest, Simple) {
-  MakeOp({TensorShape({8})}, DT_FLOAT, "test", 120, 1);
-  MakeOp({TensorShape({32, 32})}, DT_DOUBLE, "test1", 130, 1);
-  MakeOp({TensorShape({64}), TensorShape({3, 3}), TensorShape({5, 5, 5})},
+  MakeOp(TensorShape({8}), {TensorShape({8})}, DT_FLOAT, "test", 120, 1);
+  MakeOp(TensorShape({1024}), {TensorShape({32, 32})}, DT_DOUBLE, "test1", 130,
+         1);
+  MakeOp(TensorShape({204}),
+         {TensorShape({64}), TensorShape({3, 3}), TensorShape({5, 5, 5})},
          DT_HALF, "test2", 140, 3);
-  MakeOp({TensorShape({512}), TensorShape({64, 8})}, DT_UINT32, "test3", 150,
-         2);
+  MakeOp(TensorShape({1024}), {TensorShape({512}), TensorShape({64, 8})},
+         DT_UINT32, "test3", 150, 2);
 }
 
 // PrepOp is common to ConcatOp tests and SplitOpTests.
@@ -254,23 +258,26 @@ TEST_F(ScopedAllocatorConcatOpTest, FailBounds) {
 
 class ScopedAllocatorSplitOpTest : public OpsTestBase {
  protected:
-  void BuildNodeDef(const TensorShape& shape, DataType dtype,
-                    const string& name, int32 id, int32 num_tensors) {
+  void BuildNodeDef(const TensorShape& in_shape, DataType dtype,
+                    const string& name, int32 id, int32 num_tensors,
+                    const std::vector<TensorShape>& out_shapes) {
     TF_EXPECT_OK(
         NodeDefBuilder("scoped_allocator_split_op", "_ScopedAllocatorSplit")
             .Attr("T", dtype)
             .Attr("N", num_tensors)
             .Attr("sa_name", name)
             .Attr("id", id)
+            .Attr("shapes", out_shapes)
             .Input(FakeInput(dtype))  // backing tensor and input
             .Input(
                 FakeInput(num_tensors, dtype))  // list of subtensors to forward
             .Finalize(node_def()));
   }
 
-  void MakeOp(const TensorShape& shape, DataType dtype, const string& name,
-              int32 id, int32 num_tensors) {
-    BuildNodeDef(shape, dtype, name, id, num_tensors);
+  void MakeOp(const TensorShape& in_shape, DataType dtype, const string& name,
+              int32 id, int32 num_tensors,
+              const std::vector<TensorShape>& out_shapes) {
+    BuildNodeDef(in_shape, dtype, name, id, num_tensors, out_shapes);
     TF_EXPECT_OK(InitOp());
   }
 
@@ -310,33 +317,33 @@ class ScopedAllocatorSplitOpTest : public OpsTestBase {
 };
 
 TEST_F(ScopedAllocatorSplitOpTest, Success1) {
-  MakeOp({32}, DT_FLOAT, "test", 120, 2);
+  MakeOp({32}, DT_FLOAT, "test", 120, 2, {{16}, {16}});
   ExecOp(DT_FLOAT, 120, {{16}, {16}});
 }
 
 TEST_F(ScopedAllocatorSplitOpTest, Success2) {
-  MakeOp({2, 2, 2}, DT_DOUBLE, "test", 120, 2);
+  MakeOp({2, 2, 2}, DT_DOUBLE, "test", 120, 2, {{2, 2}, {2, 2}});
   ExecOp(DT_DOUBLE, 120, {{2, 2}, {2, 2}});
 }
 
 TEST_F(ScopedAllocatorSplitOpTest, Success3) {
-  MakeOp({3, 3, 3}, DT_HALF, "test", 120, 3);
+  MakeOp({3, 3, 3}, DT_HALF, "test", 120, 3, {{3, 3}, {3, 3}, {3, 3}});
   ExecOp(DT_HALF, 120, {{3, 3}, {3, 3}, {3, 3}});
 }
 
 TEST_F(ScopedAllocatorSplitOpTest, FailNLessThan2) {
-  BuildNodeDef({4, 4}, DT_FLOAT, "test", 120, 1);
+  BuildNodeDef({4, 4}, DT_FLOAT, "test", 120, 1, {{4, 4}});
   Status s = InitOp();
   EXPECT_EQ(s.code(), error::INVALID_ARGUMENT);
 }
 
 TEST_F(ScopedAllocatorSplitOpTest, FailDtypeCheck) {
-  MakeOp({8}, DT_FLOAT, "test", 120, 2);
+  MakeOp({8}, DT_FLOAT, "test", 120, 2, {{4}, {4}});
   EXPECT_DEATH(ExecOp(DT_HALF, 120, {{4}, {4}}), "");
 }
 
 TEST_F(ScopedAllocatorSplitOpTest, FailBounds) {
-  MakeOp({8}, DT_DOUBLE, "test", 120, 2);
+  MakeOp({8}, DT_DOUBLE, "test", 120, 2, {{4}, {4}});
   AddInputFromArray<double>({8}, {0, 1, 2, 3, 4, 5, 6, 7});
   AddInputFromArray<double>({4}, {0, 1, 2, 3});
   AddInputFromArray<double>({4}, {4, 5, 6, 7});
diff --git a/tensorflow/core/kernels/sparse_fill_empty_rows_op.cc b/tensorflow/core/kernels/sparse_fill_empty_rows_op.cc
index d17b72bc26a6e0a054fa681046c4d55e409ce6d1..c9365be5119391a2824ab7bd36cb2596091b0b9d 100644
--- a/tensorflow/core/kernels/sparse_fill_empty_rows_op.cc
+++ b/tensorflow/core/kernels/sparse_fill_empty_rows_op.cc
@@ -125,7 +125,7 @@ class SparseFillEmptyRowsOp : public OpKernel {
       // Scratch here describes the number of elements in this dense row
       empty_row_indicator(row) = (scratch(row) == 0);
       // In filled version, each row has at least one element.
-      scratch(row) = std::max(scratch(row), 1LL);
+      scratch(row) = std::max(scratch(row), int64{1});
       // Update scratch to represent the number of elements up to and
       // including dense_row + 1:
       //  scratch(0) == #{elements of row 0}
diff --git a/tensorflow/core/kernels/sparse_matmul_op.cc b/tensorflow/core/kernels/sparse_matmul_op.cc
index a1f9667b783ca5f455523874bc4e342f1368d4f3..866c5dcd521b2a33f44e2466262ec72b577ffa23 100644
--- a/tensorflow/core/kernels/sparse_matmul_op.cc
+++ b/tensorflow/core/kernels/sparse_matmul_op.cc
@@ -1490,7 +1490,7 @@ inline void LibxsmmSparseMatMul<TL, TR>::Compute(
 
 #endif  // TENSORFLOW_USE_LIBXSMM
 
-// Here is a an overview of the SparseMatMul code. Note that we assume that the
+// Here is an overview of the SparseMatMul code. Note that we assume that the
 // left matrix is sparse.
 //
 // The matrix "left" is divided into a grid with blocksize of (M, KL). Each
diff --git a/tensorflow/core/kernels/string_split_op.cc b/tensorflow/core/kernels/string_split_op.cc
index 4c2b312c3454e658ac4e288d06580d0ab5c04d52..3996ff002788a10304eb0542f6a4f29f33fdc73c 100644
--- a/tensorflow/core/kernels/string_split_op.cc
+++ b/tensorflow/core/kernels/string_split_op.cc
@@ -22,6 +22,7 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 
 namespace tensorflow {
@@ -43,6 +44,63 @@ std::vector<string> Split(const string& str, const string& delimiter,
   return char_vector;
 }
 
+std::vector<string> SplitV2(const string& str, StringPiece sep, int maxsplit) {
+  // This SplitV2 method matches the behavior of python's str.split:
+  //   If sep is given, consecutive delimiters are not grouped together
+  //   and are deemed to delimit empty strings (for example, '1,,2'.split(',')
+  //   returns ['1', '', '2']). The sep argument may consist of multiple
+  //   characters (for example, '1<>2<>3'.split('<>') returns ['1', '2', '3']).
+  //   Splitting an empty string with a specified separator returns [''].
+  //
+  //   If sep is not specified or is None, a different splitting algorithm is
+  //   applied: runs of consecutive whitespace are regarded as a single
+  //   separator, and the result will contain no empty strings at the start or
+  //   end if the string has leading or trailing whitespace. Consequently,
+  //   splitting an empty string or a string consisting of just whitespace
+  //   with a None separator returns [].
+
+  std::vector<string> result;
+
+  StringPiece text(str);
+  if (maxsplit == 0) {
+    result.emplace_back(std::string(text));
+    return result;
+  }
+
+  if (sep.empty()) {
+    StringPiece token;
+    // Remove leading whitespaces.
+    str_util::RemoveLeadingWhitespace(&text);
+    int split = 0;
+    while (str_util::ConsumeNonWhitespace(&text, &token)) {
+      result.emplace_back(std::string(token));
+      str_util::RemoveLeadingWhitespace(&text);
+      ++split;
+      if (maxsplit > 0 && split == maxsplit) {
+        result.emplace_back(std::string(text));
+        return result;
+      }
+    }
+    return result;
+  }
+  auto p = std::search(text.begin(), text.end(), sep.begin(), sep.end());
+  int split = 0;
+  while (p != text.end()) {
+    StringPiece token = text.substr(0, p - text.begin());
+    result.emplace_back(std::string(token));
+    text.remove_prefix(token.size());
+    text.remove_prefix(sep.size());
+    ++split;
+    if (maxsplit > 0 && split == maxsplit) {
+      result.emplace_back(std::string(text));
+      return result;
+    }
+    p = std::search(text.begin(), text.end(), sep.begin(), sep.end());
+  }
+  result.emplace_back(std::string(text));
+  return result;
+}
+
 }  // namespace
 
 class StringSplitOp : public OpKernel {
@@ -122,6 +180,78 @@ class StringSplitOp : public OpKernel {
   bool skip_empty_;
 };
 
+class StringSplitV2Op : public OpKernel {
+ public:
+  explicit StringSplitV2Op(OpKernelConstruction* context)
+      : OpKernel(context), maxsplit_(-1) {
+    context->GetAttr("maxsplit", &maxsplit_);
+  }
+
+  void Compute(OpKernelContext* ctx) override {
+    const Tensor* input_tensor;
+    OP_REQUIRES_OK(ctx, ctx->input("input", &input_tensor));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsVector(input_tensor->shape()),
+                errors::InvalidArgument("input must be a vector, got shape: ",
+                                        input_tensor->shape().DebugString()));
+
+    const auto input_vec = input_tensor->vec<string>();
+    const int64 batch_size = input_vec.dimension(0);
+
+    const Tensor* sep_tensor;
+    OP_REQUIRES_OK(ctx, ctx->input("sep", &sep_tensor));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(sep_tensor->shape()),
+                errors::InvalidArgument("sep must be a scalar, got shape: ",
+                                        sep_tensor->shape().DebugString()));
+    const auto sep_vec = sep_tensor->flat<string>();
+    StringPiece sep(sep_vec(0));
+    std::vector<string> tokens;
+    // Guess that we'll be unpacking a handful of tokens per example.
+    static constexpr int kReserveSize = 4;
+    tokens.reserve(batch_size * kReserveSize);
+
+    int64 output_size = 0;
+    int64 max_num_entries = 0;
+    std::vector<int64> num_indices(batch_size);
+    for (int64 i = 0; i < batch_size; ++i) {
+      std::vector<string> parts = SplitV2(input_vec(i), sep, maxsplit_);
+      int64 n_entries = parts.size();
+      num_indices[i] = n_entries;
+      output_size += n_entries;
+      max_num_entries = std::max(max_num_entries, n_entries);
+      tokens.insert(tokens.end(), parts.begin(), parts.end());
+    }
+
+    Tensor* sp_indices_t;
+    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, TensorShape({output_size, 2}),
+                                             &sp_indices_t));
+    Tensor* sp_tokens_t;
+    OP_REQUIRES_OK(
+        ctx, ctx->allocate_output(1, TensorShape({output_size}), &sp_tokens_t));
+    Tensor* sp_shape_t;
+    OP_REQUIRES_OK(ctx, ctx->allocate_output(2, TensorShape({2}), &sp_shape_t));
+
+    auto sp_indices = sp_indices_t->matrix<int64>();
+    auto sp_tokens = sp_tokens_t->vec<string>();
+    auto sp_shape = sp_shape_t->vec<int64>();
+    sp_shape(0) = batch_size;
+    sp_shape(1) = max_num_entries;
+    size_t c = 0;
+    for (size_t i = 0; i < batch_size; ++i) {
+      for (size_t j = 0; j < num_indices[i]; ++j) {
+        sp_indices(c, 0) = i;
+        sp_indices(c, 1) = j;
+        sp_tokens(c) = tokens[c];
+        ++c;
+      }
+    }
+  }
+
+ private:
+  int maxsplit_;
+};
+
 REGISTER_KERNEL_BUILDER(Name("StringSplit").Device(DEVICE_CPU), StringSplitOp);
+REGISTER_KERNEL_BUILDER(Name("StringSplitV2").Device(DEVICE_CPU),
+                        StringSplitV2Op);
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/tensor_array.cc b/tensorflow/core/kernels/tensor_array.cc
index 7b85ff2ea4105a150e15f548ca9f881f8a71a43d..765467bc1ef3c5cbdd21df5a95cb7691ca2783b4 100644
--- a/tensorflow/core/kernels/tensor_array.cc
+++ b/tensorflow/core/kernels/tensor_array.cc
@@ -81,7 +81,8 @@ TF_CALL_complex128(TENSOR_ARRAY_SET_ZERO_GPU);
 
 std::atomic<int64> TensorArray::tensor_array_counter{0};
 
-Status TensorArray::CopyShapesFrom(TensorArray* rhs) {
+Status TensorArray::CopyShapesFrom(TensorArray* rhs,
+                                   const TensorShape* shape_to_prepend) {
   mutex_lock l(mu_);
   mutex_lock l_rhs(rhs->mu_);
   TF_RETURN_IF_ERROR(LockedReturnIfClosed());
@@ -97,7 +98,12 @@ Status TensorArray::CopyShapesFrom(TensorArray* rhs) {
     if (!rhs->tensors_[i].written) continue;
 
     // Copy the shape over.
-    tensors_[i].shape = rhs->tensors_[i].shape;
+    if (shape_to_prepend) {
+      tensors_[i].shape = *shape_to_prepend;
+      tensors_[i].shape.AppendShape(rhs->tensors_[i].shape);
+    } else {
+      tensors_[i].shape = rhs->tensors_[i].shape;
+    }
     // Mark as written.  Reads will know that if written is true and
     // read is false, and cleared is false, to return zeros of the
     // appropriate shape.  Future aggregating writes will only use the shape
diff --git a/tensorflow/core/kernels/tensor_array.h b/tensorflow/core/kernels/tensor_array.h
index 90b71e370c474f8d7a94a47278601fdb7f3dabe0..68fab85770d89591c0fe223496403354161c8d3b 100644
--- a/tensorflow/core/kernels/tensor_array.h
+++ b/tensorflow/core/kernels/tensor_array.h
@@ -325,13 +325,15 @@ class TensorArray : public ResourceBase {
   bool HasIdenticalElementShapes() const { return identical_element_shapes_; }
 
   // Copy the TensorShapes from another TensorArray into this one.
+  // If `shapes_to_prepend` is set, expands the rank of the copied shape by
+  // prepending the passed in shape prefix to the shape values in `rhs`.
   // The sizes of the two TensorArrays must match and this one
   // may not have any entries filled in.  This performs a "soft copy",
   // essentially filling the current TensorArray with virtual
   // zero-tensors, which will be replaced by future aggregate writes,
   // or instantiated by future reads.  Requires a non-const pointer
   // to the rhs to access its mutex.
-  Status CopyShapesFrom(TensorArray* rhs);
+  Status CopyShapesFrom(TensorArray* rhs, const TensorShape* shape_to_prepend);
 
   // Clear the TensorArray, including any Tensor references, and mark as closed.
   void ClearAndMarkClosed() {
diff --git a/tensorflow/core/kernels/tensor_array_ops.cc b/tensorflow/core/kernels/tensor_array_ops.cc
index ef9748b1aad12906b32eeb33fab1523b506e8fef..37803ec775b66093d07138ed4874695515b4be08 100644
--- a/tensorflow/core/kernels/tensor_array_ops.cc
+++ b/tensorflow/core/kernels/tensor_array_ops.cc
@@ -264,7 +264,10 @@ REGISTER_GPU(bfloat16);
 #endif  // GOOGLE_CUDA
 
 // GRADIENT *******************************************************************
-
+// Note that this op may have an optional third input. If present, it represents
+// a shape value. It indicates that element shape of this gradient array is that
+// shape value concatenated with the element shape of the original tensor array.
+// See TensorArrayGradWithShape.
 class TensorArrayGradOp : public TensorArrayCreationOp {
  public:
   explicit TensorArrayGradOp(OpKernelConstruction* context)
@@ -325,18 +328,38 @@ class TensorArrayGradOp : public TensorArrayCreationOp {
           "previous write?  Gradient calculation is impossible when multiple "
           "writes are performed to the same index.");
     }
+    TensorShape shape_to_prepend;
+    auto element_shape = PartialTensorShape();
+    if (ctx->num_inputs() > 2) {
+      TF_RETURN_IF_ERROR(
+          ctx->op_kernel().MakeShape(ctx->input(2), &shape_to_prepend));
+      auto ta_element_shape = tensor_array->ElemShape();
+      if (!ta_element_shape.unknown_rank()) {
+        std::vector<int64> dims;
+        for (auto dim : shape_to_prepend) {
+          dims.push_back(dim.size);
+        }
+        for (auto dim : ta_element_shape) {
+          dims.push_back(dim.size);
+        }
+        TF_RETURN_IF_ERROR(TensorShapeUtils::MakeShape(
+            gtl::ArraySlice<int64>(dims), &element_shape));
+      }
+    } else {
+      element_shape = tensor_array->ElemShape();
+    }
 
     const auto key = strings::StrCat(output_handle(0), output_handle(1));
     auto creator = [this, key, tensor_array, array_size, marked_size,
-                    tensor_array_output_handle,
+                    element_shape, shape_to_prepend, tensor_array_output_handle,
                     output_handle](TensorArray** ret) -> Status {
       *ret = new TensorArray(
           key, tensor_array->ElemType(), *tensor_array_output_handle,
-          array_size, tensor_array->ElemShape(),
-          tensor_array->HasIdenticalElementShapes(), false /* dynamic_size */,
-          true /* multiple_writes_aggregate */, true /* is_grad */,
-          marked_size /* marked_size */, true /* close_after_read */);
-      return (*ret)->CopyShapesFrom(tensor_array);
+          array_size, element_shape, tensor_array->HasIdenticalElementShapes(),
+          false /* dynamic_size */, true /* multiple_writes_aggregate */,
+          true /* is_grad */, marked_size /* marked_size */,
+          true /* close_after_read */);
+      return (*ret)->CopyShapesFrom(tensor_array, &shape_to_prepend);
     };
 
     Status s = rm->LookupOrCreate<TensorArray>(
@@ -361,7 +384,8 @@ REGISTER_KERNEL_BUILDER(Name("TensorArrayGradV2").Device(DEVICE_CPU),
                         TensorArrayGradOp);
 REGISTER_KERNEL_BUILDER(Name("TensorArrayGradV3").Device(DEVICE_CPU),
                         TensorArrayGradOp);
-
+REGISTER_KERNEL_BUILDER(Name("TensorArrayGradWithShape").Device(DEVICE_CPU),
+                        TensorArrayGradOp);
 REGISTER_KERNEL_BUILDER(Name("TensorArrayGrad")
                             .Device(DEVICE_GPU)
                             .HostMemory("handle")
@@ -377,6 +401,12 @@ REGISTER_KERNEL_BUILDER(Name("TensorArrayGradV3")
                             .HostMemory("handle")
                             .HostMemory("grad_handle"),
                         TensorArrayGradOp);
+REGISTER_KERNEL_BUILDER(Name("TensorArrayGradWithShape")
+                            .Device(DEVICE_GPU)
+                            .HostMemory("handle")
+                            .HostMemory("shape_to_prepend")
+                            .HostMemory("grad_handle"),
+                        TensorArrayGradOp);
 
 // WRITE **********************************************************************
 
diff --git a/tensorflow/core/kernels/transpose_op.cc b/tensorflow/core/kernels/transpose_op.cc
index 7177ad78884cae85a847a283017511dcad2e4878..886b3e7492efa83a64407b3933a4873c2cedb284 100644
--- a/tensorflow/core/kernels/transpose_op.cc
+++ b/tensorflow/core/kernels/transpose_op.cc
@@ -218,7 +218,7 @@ Status ConjugateTransposeCpuOp::DoTranspose(OpKernelContext* ctx,
                                             perm, out);
 }
 
-#ifdef INTEL_MKL
+#if defined(INTEL_MKL) && !defined(DO_NOT_USE_ML)
 #define REGISTER(T)                                   \
   REGISTER_KERNEL_BUILDER(Name("Transpose")           \
                               .Device(DEVICE_CPU)     \
diff --git a/tensorflow/core/kernels/transpose_op.h b/tensorflow/core/kernels/transpose_op.h
index ae67592d044f9ebd67905641d51df780b261489f..709b0a92e90b5ff65fd4412e0c7b30625fb11e54 100644
--- a/tensorflow/core/kernels/transpose_op.h
+++ b/tensorflow/core/kernels/transpose_op.h
@@ -42,7 +42,7 @@ class TransposeCpuOp : public TransposeOp {
                      gtl::ArraySlice<int32> perm, Tensor* out) override;
 };
 
-#ifdef INTEL_MKL
+#if defined(INTEL_MKL) && !defined(DO_NOT_USE_ML)
 class MklTransposeCpuOp : public TransposeOp {
  public:
   explicit MklTransposeCpuOp(OpKernelConstruction* ctx) : TransposeOp(ctx) {}
@@ -85,7 +85,7 @@ class ConjugateTransposeCpuOp : public TransposeOp {
   bool IsConjugate() const override { return true; }
 };
 
-#ifdef INTEL_MKL
+#if defined(INTEL_MKL) && !defined(DO_NOT_USE_ML)
 class MklConjugateTransposeCpuOp : public TransposeOp {
  public:
   explicit MklConjugateTransposeCpuOp(OpKernelConstruction* ctx)
diff --git a/tensorflow/core/lib/io/random_inputstream.cc b/tensorflow/core/lib/io/random_inputstream.cc
index 09336e79cda67b324299d78c65217e6a7b40dc21..e85367df9c817c33d352982fba6e3314a8622147 100644
--- a/tensorflow/core/lib/io/random_inputstream.cc
+++ b/tensorflow/core/lib/io/random_inputstream.cc
@@ -45,16 +45,8 @@ Status RandomAccessInputStream::ReadNBytes(int64 bytes_to_read,
   result->resize(data.size());
   if (s.ok() || errors::IsOutOfRange(s)) {
     pos_ += data.size();
-  } else {
-    return s;
   }
-  // If the amount of data we read is less than what we wanted, we return an
-  // out of range error. We need to catch this explicitly since file_->Read()
-  // would not do so if at least 1 byte is read (b/30839063).
-  if (data.size() < bytes_to_read) {
-    return errors::OutOfRange("reached end of file");
-  }
-  return Status::OK();
+  return s;
 }
 
 // To limit memory usage, the default implementation of SkipNBytes() only reads
diff --git a/tensorflow/core/lib/strings/numbers.cc b/tensorflow/core/lib/strings/numbers.cc
index 987e4fe7330c38143313a93b248167331cc31c82..87aa5915ff80704bbf6d1b38e3ec4651f7da0d0a 100644
--- a/tensorflow/core/lib/strings/numbers.cc
+++ b/tensorflow/core/lib/strings/numbers.cc
@@ -331,31 +331,29 @@ bool safe_strtou32(StringPiece str, uint32* value) {
   return true;
 }
 
-bool safe_strtof(const char* str, float* value) {
+bool safe_strtof(StringPiece str, float* value) {
   int processed_characters_count = -1;
-  auto len = str_util::Strnlen(str, kFastToBufferSize);
+  auto len = str.size();
 
-  // If there is no zero-termination in str, fail.
-  if (len == kFastToBufferSize) return false;
-  // If string length exceeds int max, fail.
+  // If string length exceeds buffer size or int max, fail.
+  if (len >= kFastToBufferSize) return false;
   if (len > std::numeric_limits<int>::max()) return false;
 
-  *value = StringToFloatConverter().StringToFloat(str, static_cast<int>(len),
-                                                  &processed_characters_count);
+  *value = StringToFloatConverter().StringToFloat(
+      str.data(), static_cast<int>(len), &processed_characters_count);
   return processed_characters_count > 0;
 }
 
-bool safe_strtod(const char* str, double* value) {
+bool safe_strtod(StringPiece str, double* value) {
   int processed_characters_count = -1;
-  auto len = str_util::Strnlen(str, kFastToBufferSize);
+  auto len = str.size();
 
-  // If there is no zero-termination in str, fail.
-  if (len == kFastToBufferSize) return false;
-  // If string length exceeds int max, fail.
+  // If string length exceeds buffer size or int max, fail.
+  if (len >= kFastToBufferSize) return false;
   if (len > std::numeric_limits<int>::max()) return false;
 
-  *value = StringToFloatConverter().StringToDouble(str, static_cast<int>(len),
-                                                   &processed_characters_count);
+  *value = StringToFloatConverter().StringToDouble(
+      str.data(), static_cast<int>(len), &processed_characters_count);
   return processed_characters_count > 0;
 }
 
diff --git a/tensorflow/core/lib/strings/numbers.h b/tensorflow/core/lib/strings/numbers.h
index 9cb56415cb6044bbbe98d113319ce8cbaf329aad..1d5bacac93b89a09532c2c4d947551cd141f0663 100644
--- a/tensorflow/core/lib/strings/numbers.h
+++ b/tensorflow/core/lib/strings/numbers.h
@@ -115,13 +115,13 @@ bool safe_strtou64(StringPiece str, uint64* value);
 // Leading and trailing spaces are allowed.
 // Values may be rounded on over- and underflow.
 // Returns false on invalid input or if `strlen(value) >= kFastToBufferSize`.
-bool safe_strtof(const char* str, float* value);
+bool safe_strtof(StringPiece str, float* value);
 
 // Convert strings to double precision floating point values.
 // Leading and trailing spaces are allowed.
 // Values may be rounded on over- and underflow.
 // Returns false on invalid input or if `strlen(value) >= kFastToBufferSize`.
-bool safe_strtod(const char* str, double* value);
+bool safe_strtod(StringPiece str, double* value);
 
 inline bool ProtoParseNumeric(StringPiece s, int32* value) {
   return safe_strto32(s, value);
diff --git a/tensorflow/core/lib/strings/numbers_test.cc b/tensorflow/core/lib/strings/numbers_test.cc
index 0f22dac262bf555cb0ba85ca7102615dc6d7cc7d..5b595f98478a548c32ee4bd84aa3eb4d22165b0f 100644
--- a/tensorflow/core/lib/strings/numbers_test.cc
+++ b/tensorflow/core/lib/strings/numbers_test.cc
@@ -289,12 +289,9 @@ TEST(safe_strtof, Float) {
 
   EXPECT_FALSE(safe_strtof("-infinity is awesome", &result));
 
-  // Make sure we exit cleanly if the string is not terminated
+  // Make sure we exit cleanly if the string is too long
   char test_str[2 * kFastToBufferSize];
   for (int i = 0; i < 2 * kFastToBufferSize; ++i) test_str[i] = 'a';
-  EXPECT_FALSE(safe_strtof(test_str, &result));
-
-  // Make sure we exit cleanly if the string is too long
   test_str[kFastToBufferSize + 1] = '\0';
   EXPECT_FALSE(safe_strtof(test_str, &result));
 
@@ -330,12 +327,9 @@ TEST(safe_strtod, Double) {
   EXPECT_EQ(0.1234567890123, result);
   EXPECT_FALSE(safe_strtod("0.1234567890123abc", &result));
 
-  // Make sure we exit cleanly if the string is not terminated
+  // Make sure we exit cleanly if the string is too long
   char test_str[2 * kFastToBufferSize];
   for (int i = 0; i < 2 * kFastToBufferSize; ++i) test_str[i] = 'a';
-  EXPECT_FALSE(safe_strtod(test_str, &result));
-
-  // Make sure we exit cleanly if the string is too long
   test_str[kFastToBufferSize + 1] = '\0';
   EXPECT_FALSE(safe_strtod(test_str, &result));
 
diff --git a/tensorflow/core/ops/batch_ops.cc b/tensorflow/core/ops/batch_ops.cc
index 0a62965eedd3c053dff558108f21e99a77407587..ba7faeb5e8aecade494dbc6b0954601a4f52e061 100644
--- a/tensorflow/core/ops/batch_ops.cc
+++ b/tensorflow/core/ops/batch_ops.cc
@@ -19,6 +19,26 @@ limitations under the License.
 
 namespace tensorflow {
 
+REGISTER_OP("BatchFunction")
+    .Input("in_tensors: Tin")
+    .Input("captured_tensors: Tcaptured")
+    .Output("out_tensors: Tout")
+    .Attr("f: func")
+    .Attr("num_batch_threads: int")
+    .Attr("max_batch_size: int")
+    .Attr("batch_timeout_micros: int")
+    .Attr("max_enqueued_batches: int = 10")
+    .Attr("allowed_batch_sizes: list(int) = []")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .Attr("batching_queue: string = ''")
+    .Attr("Tin: list(type)")
+    .Attr("Tcaptured: list(type) >= 0")
+    .Attr("Tout: list(type)")
+    // TODO(apassos): Fix this shape inference function. It requires shape
+    // inference of function calls.
+    .SetShapeFn(shape_inference::UnknownShape);
+
 REGISTER_OP("Batch")
     .Input("in_tensors: T")
     .Output("batched_tensors: T")
diff --git a/tensorflow/core/ops/compat/ops_history.v1.pbtxt b/tensorflow/core/ops/compat/ops_history.v1.pbtxt
index c86767448923cc68a45e579199ce33a8a20e3d34..726bfd63b7142a369c76f811c801651427e8a354 100644
--- a/tensorflow/core/ops/compat/ops_history.v1.pbtxt
+++ b/tensorflow/core/ops/compat/ops_history.v1.pbtxt
@@ -1499,6 +1499,26 @@ op {
     }
   }
 }
+op {
+  name: "AnonymousIterator"
+  output_arg {
+    name: "handle"
+    type: DT_RESOURCE
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+  is_stateful: true
+}
 op {
   name: "Any"
   input_arg {
@@ -8700,6 +8720,37 @@ op {
     minimum: 1
   }
 }
+op {
+  name: "BatchDatasetV2"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  input_arg {
+    name: "batch_size"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "drop_remainder"
+    type: DT_BOOL
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+}
 op {
   name: "BatchFFT"
   input_arg {
@@ -8742,6 +8793,90 @@ op {
     version: 15
   }
 }
+op {
+  name: "BatchFunction"
+  input_arg {
+    name: "in_tensors"
+    type_list_attr: "Tin"
+  }
+  input_arg {
+    name: "captured_tensors"
+    type_list_attr: "Tcaptured"
+  }
+  output_arg {
+    name: "out_tensors"
+    type_list_attr: "Tout"
+  }
+  attr {
+    name: "f"
+    type: "func"
+  }
+  attr {
+    name: "num_batch_threads"
+    type: "int"
+  }
+  attr {
+    name: "max_batch_size"
+    type: "int"
+  }
+  attr {
+    name: "batch_timeout_micros"
+    type: "int"
+  }
+  attr {
+    name: "max_enqueued_batches"
+    type: "int"
+    default_value {
+      i: 10
+    }
+  }
+  attr {
+    name: "allowed_batch_sizes"
+    type: "list(int)"
+    default_value {
+      list {
+      }
+    }
+  }
+  attr {
+    name: "container"
+    type: "string"
+    default_value {
+      s: ""
+    }
+  }
+  attr {
+    name: "shared_name"
+    type: "string"
+    default_value {
+      s: ""
+    }
+  }
+  attr {
+    name: "batching_queue"
+    type: "string"
+    default_value {
+      s: ""
+    }
+  }
+  attr {
+    name: "Tin"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "Tcaptured"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "Tout"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+}
 op {
   name: "BatchIFFT"
   input_arg {
@@ -9950,6 +10085,52 @@ op {
     }
   }
 }
+op {
+  name: "BesselI0e"
+  input_arg {
+    name: "x"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "y"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_BFLOAT16
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+}
+op {
+  name: "BesselI1e"
+  input_arg {
+    name: "x"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "y"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_BFLOAT16
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+}
 op {
   name: "Betainc"
   input_arg {
@@ -14194,6 +14375,92 @@ op {
     }
   }
 }
+op {
+  name: "Conv3DBackpropInputV2"
+  input_arg {
+    name: "input_sizes"
+    type_attr: "Tshape"
+  }
+  input_arg {
+    name: "filter"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "out_backprop"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_HALF
+        type: DT_BFLOAT16
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "strides"
+    type: "list(int)"
+    has_minimum: true
+    minimum: 5
+  }
+  attr {
+    name: "padding"
+    type: "string"
+    allowed_values {
+      list {
+        s: "SAME"
+        s: "VALID"
+      }
+    }
+  }
+  attr {
+    name: "data_format"
+    type: "string"
+    default_value {
+      s: "NDHWC"
+    }
+    allowed_values {
+      list {
+        s: "NDHWC"
+        s: "NCDHW"
+      }
+    }
+  }
+  attr {
+    name: "dilations"
+    type: "list(int)"
+    default_value {
+      list {
+        i: 1
+        i: 1
+        i: 1
+        i: 1
+        i: 1
+      }
+    }
+  }
+  attr {
+    name: "Tshape"
+    type: "type"
+    default_value {
+      type: DT_INT32
+    }
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+}
 op {
   name: "Copy"
   input_arg {
@@ -16476,6 +16743,17 @@ op {
     }
   }
 }
+op {
+  name: "DatasetToGraph"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  output_arg {
+    name: "graph"
+    type: DT_STRING
+  }
+}
 op {
   name: "DatasetToSingleElement"
   input_arg {
@@ -21295,6 +21573,21 @@ op {
     type: DT_STRING
   }
 }
+op {
+  name: "FakeParam"
+  output_arg {
+    name: "output"
+    type_attr: "dtype"
+  }
+  attr {
+    name: "dtype"
+    type: "type"
+  }
+  attr {
+    name: "shape"
+    type: "shape"
+  }
+}
 op {
   name: "FakeQuantWithMinMaxArgs"
   input_arg {
@@ -21896,6 +22189,33 @@ op {
   }
   is_stateful: true
 }
+op {
+  name: "FeatureStatsDataset"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  input_arg {
+    name: "tag"
+    type: DT_STRING
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+}
 op {
   name: "Fill"
   input_arg {
@@ -25044,6 +25364,29 @@ op {
     type: "type"
   }
 }
+op {
+  name: "IdentityDataset"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+}
 op {
   name: "IdentityN"
   input_arg {
@@ -25171,6 +25514,44 @@ op {
     type: "func"
   }
 }
+op {
+  name: "If"
+  input_arg {
+    name: "cond"
+    type_attr: "Tcond"
+  }
+  input_arg {
+    name: "input"
+    type_list_attr: "Tin"
+  }
+  output_arg {
+    name: "output"
+    type_list_attr: "Tout"
+  }
+  attr {
+    name: "Tcond"
+    type: "type"
+  }
+  attr {
+    name: "Tin"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "Tout"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "then_branch"
+    type: "func"
+  }
+  attr {
+    name: "else_branch"
+    type: "func"
+  }
+}
 op {
   name: "Igamma"
   input_arg {
@@ -35008,6 +35389,33 @@ op {
     }
   }
 }
+op {
+  name: "OptimizeDataset"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  input_arg {
+    name: "optimizations"
+    type: DT_STRING
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+}
 op {
   name: "OrderedMapClear"
   attr {
@@ -35524,6 +35932,52 @@ op {
     minimum: 1
   }
 }
+op {
+  name: "PaddedBatchDatasetV2"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  input_arg {
+    name: "batch_size"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "padded_shapes"
+    type: DT_INT64
+    number_attr: "N"
+  }
+  input_arg {
+    name: "padding_values"
+    type_list_attr: "Toutput_types"
+  }
+  input_arg {
+    name: "drop_remainder"
+    type: DT_BOOL
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "Toutput_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "N"
+    type: "int"
+    has_minimum: true
+    minimum: 1
+  }
+}
 op {
   name: "PaddingFIFOQueue"
   output_arg {
@@ -42361,6 +42815,21 @@ op {
   }
   allows_uninitialized_input: true
 }
+op {
+  name: "RegexFullMatch"
+  input_arg {
+    name: "input"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "pattern"
+    type: DT_STRING
+  }
+  output_arg {
+    name: "output"
+    type: DT_BOOL
+  }
+}
 op {
   name: "RegexReplace"
   input_arg {
@@ -47852,6 +48321,43 @@ op {
   }
   is_stateful: true
 }
+op {
+  name: "ResourceScatterNdAdd"
+  input_arg {
+    name: "ref"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "indices"
+    type_attr: "Tindices"
+  }
+  input_arg {
+    name: "updates"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+  }
+  attr {
+    name: "Tindices"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+  attr {
+    name: "use_locking"
+    type: "bool"
+    default_value {
+      b: true
+    }
+  }
+  is_stateful: true
+}
 op {
   name: "ResourceScatterNdUpdate"
   input_arg {
@@ -55295,6 +55801,56 @@ op {
     }
   }
 }
+op {
+  name: "SegmentMean"
+  input_arg {
+    name: "data"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "segment_ids"
+    type_attr: "Tindices"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_UINT8
+        type: DT_INT16
+        type: DT_INT8
+        type: DT_COMPLEX64
+        type: DT_INT64
+        type: DT_QINT8
+        type: DT_QUINT8
+        type: DT_QINT32
+        type: DT_BFLOAT16
+        type: DT_UINT16
+        type: DT_COMPLEX128
+        type: DT_HALF
+        type: DT_UINT32
+        type: DT_UINT64
+      }
+    }
+  }
+  attr {
+    name: "Tindices"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+}
 op {
   name: "SegmentMin"
   input_arg {
@@ -69126,6 +69682,34 @@ op {
   }
   is_stateful: true
 }
+op {
+  name: "TensorArrayGradWithShape"
+  input_arg {
+    name: "handle"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "flow_in"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "shape_to_prepend"
+    type: DT_INT32
+  }
+  output_arg {
+    name: "grad_handle"
+    type: DT_RESOURCE
+  }
+  output_arg {
+    name: "flow_out"
+    type: DT_FLOAT
+  }
+  attr {
+    name: "source"
+    type: "string"
+  }
+  is_stateful: true
+}
 op {
   name: "TensorArrayPack"
   input_arg {
diff --git a/tensorflow/core/ops/data_flow_ops.cc b/tensorflow/core/ops/data_flow_ops.cc
index 3112f35da43d16d7a4cd4c1c8e017cab3366e070..eed0bce174387220d0dde3a1ecb6ef5c8cd22de1 100644
--- a/tensorflow/core/ops/data_flow_ops.cc
+++ b/tensorflow/core/ops/data_flow_ops.cc
@@ -608,6 +608,50 @@ REGISTER_OP("TensorArrayGradV3")
       return Status::OK();
     });
 
+REGISTER_OP("TensorArrayGradWithShape")
+    .Input("handle: resource")
+    .Input("flow_in: float")
+    .Input("shape_to_prepend: int32")
+    .Output("grad_handle: resource")
+    .Output("flow_out: float")
+    .Attr("source: string")
+    .SetIsStateful()
+    .SetShapeFn([](InferenceContext* c) {
+      ShapeHandle handle;
+      DimensionHandle unused_dim;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 1, &handle));
+      TF_RETURN_IF_ERROR(c->WithValue(c->Dim(handle, 0), 2, &unused_dim));
+      c->set_output(0, c->Vector(2));
+      c->set_output(1, c->Scalar());
+      auto* shape_and_type = c->input_handle_shapes_and_types(0);
+      if (shape_and_type) {
+        auto input_shape = (*shape_and_type)[0].shape;
+        auto dtype = (*shape_and_type)[0].dtype;
+        // Note that shape_to_preped is a rank 1 Tensor representing a shape.
+        // The size of dimension 0 is the number of dimensions we need to add to
+        // output shape.
+        int64 prepend_rank = c->Value(c->Dim(c->input(2), 0));
+        if (c->RankKnown(input_shape) &&
+            prepend_rank != InferenceContext::kUnknownDim) {
+          int32 input_rank = c->Rank(input_shape);
+          std::vector<DimensionHandle> dims;
+          dims.reserve(prepend_rank + input_rank);
+          for (int i = 0; i < prepend_rank; ++i) {
+            dims.push_back(c->UnknownDim());
+          }
+          for (int i = 0; i < input_rank; ++i) {
+            dims.push_back(c->Dim(input_shape, i));
+          }
+          c->set_output_handle_shapes_and_types(0,
+                                                {{c->MakeShape(dims), dtype}});
+        } else {
+          c->set_output_handle_shapes_and_types(0,
+                                                {{c->UnknownShape(), dtype}});
+        }
+      }
+      return Status::OK();
+    });
+
 REGISTER_OP("TensorArrayWriteV3")
     .Input("handle: resource")
     .Input("index: int32")
diff --git a/tensorflow/core/ops/dataset_ops.cc b/tensorflow/core/ops/dataset_ops.cc
index 29d9cfbde94ea8b59de9b8e7b6ca0f726b714dd7..9dca5f53ce5cd96d62dd4bfcb96d883492082545 100644
--- a/tensorflow/core/ops/dataset_ops.cc
+++ b/tensorflow/core/ops/dataset_ops.cc
@@ -166,6 +166,18 @@ REGISTER_OP("LatencyStatsDataset")
       return shape_inference::ScalarShape(c);
     });
 
+REGISTER_OP("FeatureStatsDataset")
+    .Input("input_dataset: variant")
+    .Input("tag: string")
+    .Output("handle: variant")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      shape_inference::ShapeHandle tag_shape;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 0, &tag_shape));
+      return shape_inference::ScalarShape(c);
+    });
+
 REGISTER_OP("SetStatsAggregatorDataset")
     .Input("input_dataset: variant")
     .Input("stats_aggregator: resource")
@@ -363,6 +375,22 @@ REGISTER_OP("BatchDataset")
       return shape_inference::ScalarShape(c);
     });
 
+REGISTER_OP("BatchDatasetV2")
+    .Input("input_dataset: variant")
+    .Input("batch_size: int64")
+    .Input("drop_remainder: bool")
+    .Output("handle: variant")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      shape_inference::ShapeHandle unused;
+      // batch_size should be a scalar.
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 0, &unused));
+      // drop_remainder should be a scalar.
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 0, &unused));
+      return shape_inference::ScalarShape(c);
+    });
+
 // TODO(mrry): move SlideDataset to contrib in the future.
 REGISTER_OP("SlideDataset")
     .Input("input_dataset: variant")
@@ -379,6 +407,10 @@ REGISTER_OP("SlideDataset")
       return shape_inference::ScalarShape(c);
     });
 
+// TODO(mrry): Validate that `padded_shapes` are all vectors, the lengths of
+// `output_types` and `output_shapes` are `N` the `output_shapes` are (as far as
+// possible to tell statically) compatible with `padded_shapes`, and that
+// `padding_values` are all scalars.
 REGISTER_OP("PaddedBatchDataset")
     .Input("input_dataset: variant")
     .Input("batch_size: int64")
@@ -388,17 +420,32 @@ REGISTER_OP("PaddedBatchDataset")
     .Attr("Toutput_types: list(type) >= 1")
     .Attr("output_shapes: list(shape) >= 1")
     .Attr("N: int >= 1")
-    .SetShapeFn(shape_inference::ScalarShape);  // TODO(mrry): Validate that
-                                                // `padded_shapes` are all
-                                                // vectors, the lengths of
-                                                // `output_types` and
-                                                // `output_shapes` are `N`,
-                                                // the `output_shapes` are (as
-                                                // far as possible to tell
-                                                // statically) compatible with
-                                                // `padded_shapes`, and
-                                                // that `padding_values` are
-                                                // all scalars.
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      shape_inference::ShapeHandle unused;
+      // batch_size should be a scalar.
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 0, &unused));
+      return shape_inference::ScalarShape(c);
+    });
+
+REGISTER_OP("PaddedBatchDatasetV2")
+    .Input("input_dataset: variant")
+    .Input("batch_size: int64")
+    .Input("padded_shapes: N * int64")
+    .Input("padding_values: Toutput_types")
+    .Input("drop_remainder: bool")
+    .Output("handle: variant")
+    .Attr("Toutput_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .Attr("N: int >= 1")
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      shape_inference::ShapeHandle unused;
+      // batch_size should be a scalar.
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 0, &unused));
+      // drop_remainder should be a scalar.
+      TF_RETURN_IF_ERROR(
+          c->WithRank(c->input(c->num_inputs() - 1), 0, &unused));
+      return shape_inference::ScalarShape(c);
+    });
 
 REGISTER_OP("DenseToSparseBatchDataset")
     .Input("input_dataset: variant")
@@ -584,6 +631,12 @@ REGISTER_OP("Iterator")
     .Attr("output_shapes: list(shape) >= 1")
     .SetShapeFn(shape_inference::ScalarShape);
 
+REGISTER_OP("AnonymousIterator")
+    .Output("handle: resource")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetShapeFn(shape_inference::ScalarShape);
+
 REGISTER_OP("MakeIterator")
     .Input("dataset: variant")
     .Input("iterator: resource")
@@ -712,4 +765,24 @@ REGISTER_OP("DatasetToTFRecord")
     .Input("compression_type: string")
     .SetShapeFn(shape_inference::NoOutputs);
 
+REGISTER_OP("DatasetToGraph")
+    .Input("input_dataset: variant")
+    .Output("graph: string")
+    .SetShapeFn(shape_inference::ScalarShape);
+
+REGISTER_OP("IdentityDataset")
+    .Input("input_dataset: variant")
+    .Output("handle: variant")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetShapeFn(shape_inference::ScalarShape);
+
+REGISTER_OP("OptimizeDataset")
+    .Input("input_dataset: variant")
+    .Input("optimizations: string")
+    .Output("handle: variant")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetShapeFn(shape_inference::ScalarShape);
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/ops/functional_ops.cc b/tensorflow/core/ops/functional_ops.cc
index 4d4a370478e0512bfb0c4a3ee146bd39d4934e76..88553dff93a255d707fed166de67493a6b35d488 100644
--- a/tensorflow/core/ops/functional_ops.cc
+++ b/tensorflow/core/ops/functional_ops.cc
@@ -82,7 +82,7 @@ REGISTER_OP("If")
     .Input("input: Tin")
     .Output("output: Tout")
     .Attr("Tcond: type")
-    .Attr("Tin: list(type)")
+    .Attr("Tin: list(type) >= 0")
     .Attr("Tout: list(type)")
     .Attr("then_branch: func")
     .Attr("else_branch: func")
@@ -154,4 +154,21 @@ REGISTER_OP("PartitionedCall")
     .Attr("f: func")
     .SetShapeFn(shape_inference::UnknownShape);
 
+// This op is used as a placeholder in If branch functions. It doesn't provide a
+// valid output when run, so must either be removed (e.g. replaced with a
+// function input) or guaranteed not to be used (e.g. if mirroring an
+// intermediate output needed for the gradient computation of the other branch).
+REGISTER_OP("FakeParam")
+    .Output("output: dtype")
+    .Attr("dtype: type")
+    .Attr("shape: shape")
+    .SetShapeFn([](InferenceContext* c) {
+      PartialTensorShape shape;
+      TF_RETURN_IF_ERROR(c->GetAttr("shape", &shape));
+      shape_inference::ShapeHandle out;
+      TF_RETURN_IF_ERROR(c->MakeShapeFromPartialTensorShape(shape, &out));
+      c->set_output(0, out);
+      return Status::OK();
+    });
+
 }  // end namespace tensorflow
diff --git a/tensorflow/core/ops/image_ops.cc b/tensorflow/core/ops/image_ops.cc
index d949e70c661467edbadae9fdb6f74d7e4cc3de02..87f4991134afd7ad78bc9b5f7cd9bdf8f54e80f8 100644
--- a/tensorflow/core/ops/image_ops.cc
+++ b/tensorflow/core/ops/image_ops.cc
@@ -454,7 +454,9 @@ REGISTER_OP("DrawBoundingBoxes")
       DimensionHandle unused;
       TF_RETURN_IF_ERROR(c->WithValue(c->Dim(boxes, 2), 4, &unused));
 
-      return shape_inference::UnchangedShapeWithRankAtLeast(c, 3);
+      // The rank of the input image (rank = 4) has already been restricted
+      // above, and the output is of the same shape as the input.
+      return shape_inference::UnchangedShape(c);
     });
 
 // --------------------------------------------------------------------------
diff --git a/tensorflow/core/ops/math_ops.cc b/tensorflow/core/ops/math_ops.cc
index 8c0b073ce4646875921c7cfbaa47f44e4c67ec19..0f2d438a035c1d4a4c758fcbc002ab6137d65fc9 100644
--- a/tensorflow/core/ops/math_ops.cc
+++ b/tensorflow/core/ops/math_ops.cc
@@ -239,6 +239,10 @@ REGISTER_OP("Acos").UNARY();
 
 REGISTER_OP("Atan").UNARY();
 
+REGISTER_OP("BesselI0e").UNARY_REAL();
+
+REGISTER_OP("BesselI1e").UNARY_REAL();
+
 #undef UNARY
 #undef UNARY_REAL
 #undef UNARY_COMPLEX
@@ -592,7 +596,13 @@ REGISTER_OP("ApproximateEqual")
     .SetIsCommutative()
     .Attr("T: numbertype")
     .Attr("tolerance: float = 0.00001")
-    .SetShapeFn(shape_inference::UnchangedShape);
+    .SetShapeFn([](InferenceContext* c) {
+      // The inputs 'x' and 'y' must have the same shape.
+      ShapeHandle data_x = c->input(0);
+      ShapeHandle data_y = c->input(1);
+      TF_RETURN_IF_ERROR(c->Merge(data_x, data_y, &data_x));
+      return shape_inference::UnchangedShape(c);
+    });
 
 // --------------------------------------------------------------------------
 
@@ -1080,7 +1090,7 @@ REGISTER_OP("UnsortedSegmentProd")
     .Input("segment_ids: Tindices")
     .Input("num_segments: Tnumsegments")
     .Output("output: T")
-    .Attr("T: realnumbertype")
+    .Attr("T: numbertype")
     .Attr("Tindices: {int32,int64}")
     .Attr("Tnumsegments: {int32,int64} = DT_INT32")
     .SetShapeFn(UnsortedSegmentReductionShapeFn);
diff --git a/tensorflow/core/ops/ops.pbtxt b/tensorflow/core/ops/ops.pbtxt
index e45125a1e8768e7136480bba13ac7775507da9bc..c609703bcb069dccea3dcbe6c8833ab2e01631ad 100644
--- a/tensorflow/core/ops/ops.pbtxt
+++ b/tensorflow/core/ops/ops.pbtxt
@@ -649,6 +649,26 @@ op {
     }
   }
 }
+op {
+  name: "AnonymousIterator"
+  output_arg {
+    name: "handle"
+    type: DT_RESOURCE
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+  is_stateful: true
+}
 op {
   name: "Any"
   input_arg {
@@ -2984,6 +3004,37 @@ op {
     minimum: 1
   }
 }
+op {
+  name: "BatchDatasetV2"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  input_arg {
+    name: "batch_size"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "drop_remainder"
+    type: DT_BOOL
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+}
 op {
   name: "BatchFFT"
   input_arg {
@@ -3029,6 +3080,90 @@ op {
     explanation: "Use FFT3D"
   }
 }
+op {
+  name: "BatchFunction"
+  input_arg {
+    name: "in_tensors"
+    type_list_attr: "Tin"
+  }
+  input_arg {
+    name: "captured_tensors"
+    type_list_attr: "Tcaptured"
+  }
+  output_arg {
+    name: "out_tensors"
+    type_list_attr: "Tout"
+  }
+  attr {
+    name: "f"
+    type: "func"
+  }
+  attr {
+    name: "num_batch_threads"
+    type: "int"
+  }
+  attr {
+    name: "max_batch_size"
+    type: "int"
+  }
+  attr {
+    name: "batch_timeout_micros"
+    type: "int"
+  }
+  attr {
+    name: "max_enqueued_batches"
+    type: "int"
+    default_value {
+      i: 10
+    }
+  }
+  attr {
+    name: "allowed_batch_sizes"
+    type: "list(int)"
+    default_value {
+      list {
+      }
+    }
+  }
+  attr {
+    name: "container"
+    type: "string"
+    default_value {
+      s: ""
+    }
+  }
+  attr {
+    name: "shared_name"
+    type: "string"
+    default_value {
+      s: ""
+    }
+  }
+  attr {
+    name: "batching_queue"
+    type: "string"
+    default_value {
+      s: ""
+    }
+  }
+  attr {
+    name: "Tin"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "Tcaptured"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "Tout"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+}
 op {
   name: "BatchIFFT"
   input_arg {
@@ -3725,6 +3860,52 @@ op {
     }
   }
 }
+op {
+  name: "BesselI0e"
+  input_arg {
+    name: "x"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "y"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_BFLOAT16
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+}
+op {
+  name: "BesselI1e"
+  input_arg {
+    name: "x"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "y"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_BFLOAT16
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+}
 op {
   name: "Betainc"
   input_arg {
@@ -5927,7 +6108,7 @@ op {
   name: "Conv3DBackpropInputV2"
   input_arg {
     name: "input_sizes"
-    type: DT_INT32
+    type_attr: "Tshape"
   }
   input_arg {
     name: "filter"
@@ -5995,6 +6176,19 @@ op {
       }
     }
   }
+  attr {
+    name: "Tshape"
+    type: "type"
+    default_value {
+      type: DT_INT32
+    }
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
 }
 op {
   name: "Copy"
@@ -7437,6 +7631,17 @@ op {
     }
   }
 }
+op {
+  name: "DatasetToGraph"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  output_arg {
+    name: "graph"
+    type: DT_STRING
+  }
+}
 op {
   name: "DatasetToSingleElement"
   input_arg {
@@ -9875,6 +10080,21 @@ op {
     type: DT_STRING
   }
 }
+op {
+  name: "FakeParam"
+  output_arg {
+    name: "output"
+    type_attr: "dtype"
+  }
+  attr {
+    name: "dtype"
+    type: "type"
+  }
+  attr {
+    name: "shape"
+    type: "shape"
+  }
+}
 op {
   name: "FakeQuantWithMinMaxArgs"
   input_arg {
@@ -10126,6 +10346,33 @@ op {
   }
   is_stateful: true
 }
+op {
+  name: "FeatureStatsDataset"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  input_arg {
+    name: "tag"
+    type: DT_STRING
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+}
 op {
   name: "Fill"
   input_arg {
@@ -12046,6 +12293,29 @@ op {
     type: "type"
   }
 }
+op {
+  name: "IdentityDataset"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+}
 op {
   name: "IdentityN"
   input_arg {
@@ -12134,7 +12404,6 @@ op {
     name: "Tin"
     type: "list(type)"
     has_minimum: true
-    minimum: 1
   }
   attr {
     name: "Tout"
@@ -16796,6 +17065,33 @@ op {
     }
   }
 }
+op {
+  name: "OptimizeDataset"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  input_arg {
+    name: "optimizations"
+    type: DT_STRING
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+}
 op {
   name: "OrderedMapClear"
   attr {
@@ -17269,6 +17565,52 @@ op {
     minimum: 1
   }
 }
+op {
+  name: "PaddedBatchDatasetV2"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  input_arg {
+    name: "batch_size"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "padded_shapes"
+    type: DT_INT64
+    number_attr: "N"
+  }
+  input_arg {
+    name: "padding_values"
+    type_list_attr: "Toutput_types"
+  }
+  input_arg {
+    name: "drop_remainder"
+    type: DT_BOOL
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "Toutput_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "N"
+    type: "int"
+    has_minimum: true
+    minimum: 1
+  }
+}
 op {
   name: "PaddingFIFOQueue"
   output_arg {
@@ -21494,6 +21836,21 @@ op {
   }
   allows_uninitialized_input: true
 }
+op {
+  name: "RegexFullMatch"
+  input_arg {
+    name: "input"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "pattern"
+    type: DT_STRING
+  }
+  output_arg {
+    name: "output"
+    type: DT_BOOL
+  }
+}
 op {
   name: "RegexReplace"
   input_arg {
@@ -23423,6 +23780,43 @@ op {
   }
   is_stateful: true
 }
+op {
+  name: "ResourceScatterNdAdd"
+  input_arg {
+    name: "ref"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "indices"
+    type_attr: "Tindices"
+  }
+  input_arg {
+    name: "updates"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+  }
+  attr {
+    name: "Tindices"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+  attr {
+    name: "use_locking"
+    type: "bool"
+    default_value {
+      b: true
+    }
+  }
+  is_stateful: true
+}
 op {
   name: "ResourceScatterNdUpdate"
   input_arg {
@@ -26047,9 +26441,14 @@ op {
         type: DT_UINT8
         type: DT_INT16
         type: DT_INT8
+        type: DT_COMPLEX64
         type: DT_INT64
+        type: DT_QINT8
+        type: DT_QUINT8
+        type: DT_QINT32
         type: DT_BFLOAT16
         type: DT_UINT16
+        type: DT_COMPLEX128
         type: DT_HALF
         type: DT_UINT32
         type: DT_UINT64
@@ -32162,6 +32561,34 @@ op {
   }
   is_stateful: true
 }
+op {
+  name: "TensorArrayGradWithShape"
+  input_arg {
+    name: "handle"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "flow_in"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "shape_to_prepend"
+    type: DT_INT32
+  }
+  output_arg {
+    name: "grad_handle"
+    type: DT_RESOURCE
+  }
+  output_arg {
+    name: "flow_out"
+    type: DT_FLOAT
+  }
+  attr {
+    name: "source"
+    type: "string"
+  }
+  is_stateful: true
+}
 op {
   name: "TensorArrayPack"
   input_arg {
diff --git a/tensorflow/core/ops/scoped_allocator_ops.cc b/tensorflow/core/ops/scoped_allocator_ops.cc
index 1e0dcdac96cba14763a880290fc4b6924bc7b792..359b4d8756fb23d7282677578f0f0e4215837737 100644
--- a/tensorflow/core/ops/scoped_allocator_ops.cc
+++ b/tensorflow/core/ops/scoped_allocator_ops.cc
@@ -21,6 +21,7 @@ namespace tensorflow {
 REGISTER_OP("_ScopedAllocator")
     .Output("output: T")
     .Attr("shapes: list(shape)")
+    .Attr("shape: shape")
     .Attr("T: type")
     .Attr("sa_name: string")
     .Attr("id: int")
@@ -35,6 +36,16 @@ Returns a reference to this value.
 
 This is an experimental op for internal use only.  It is possible to use this
 op in unsafe ways.
+
+'shapes' is a list of the shapes of the tensors that are to be allocated
+by this ScopedAllocator.
+'shape' is the shape of the output of this Op, i.e. the 1D backing tensor
+from which the individual allocated tensors are aliased.
+'sa_name' is the name assigned to the Node, for connectivity specification
+and debugging.
+'id' is a non-negative integer 'scope_id' handled by the ScopedAllocatorMgr.
+'expected_call_count' is the number of individual tensors expected to
+be allocated from the backing tensor.
 )doc");
 
 REGISTER_OP("_ScopedAllocatorConcat")
@@ -57,6 +68,18 @@ reference to that ScopedAllocator's backing tensor.
 
 This is an experimental op for internal use only.  It is possible to use this
 op in unsafe ways.
+
+'backing' is the backing tensor, i.e. the output of an upstream ScopedAllocator.
+'inputs' is a list of nominal input tensors, all of which must be aliases
+to regions of the backing tensor.  These will be outputs of upstream nodes
+that allocate their outputs from the same ScopedAllocator.
+'shape' is the shape of the output, which will usually be the same shape as
+the input backing tensor.
+'reshape' is true iff the output shape is to be different from that of
+the input backing tensor.
+'sa_name' is the Node name of the upstream ScopedAllocator.
+'id' is the scope_id identifying the upstream ScopedAllocator.
+'N' is the number of nominal inputs to be concatenated.
 )doc");
 
 REGISTER_OP("_ScopedAllocatorSplit")
@@ -67,8 +90,9 @@ REGISTER_OP("_ScopedAllocatorSplit")
     .Attr("sa_name: string")
     .Attr("id: int")
     .Attr("N: int >= 2")
+    .Attr("shapes: list(shape)")
     .SetIsStateful()
-    .SetShapeFn(shape_inference::ExplicitShape)
+    .SetShapeFn(shape_inference::ExplicitShapes)
     .Doc(R"doc(
 Acts roughly like a SplitV Op that splits one tensor into multiple tensors
 but must only be used in conjunction with corresponding ScopedAllocator
@@ -79,6 +103,17 @@ second list.
 
 This is an experimental op for internal use only.  It is possible to use this
 op in unsafe ways.
+
+'concat' is the single output produced by an upstream ScopedAllocatorConcat
+node.  This is actually the backing tensor from a ScopedAllocator node
+upstream of the ScopedAllocatorConcat.
+'split' is a list of tensors aliased from the backing tensor.  It will
+become the output of this ScopedAllocatorSplit node.
+'type' is the common DataType of all of the input and output tensors.
+'sa_name' is the Node name of the upstream ScopedAllocator.
+'id' is the scope_id identifying the upstream ScopedAllocator.
+'N' is the number of split tensors.
+'shapes' is a list of the split tensor shapes.
 )doc");
 
 }  // end namespace tensorflow
diff --git a/tensorflow/core/ops/state_ops.cc b/tensorflow/core/ops/state_ops.cc
index 664f52452e3339e895f568f83e1fbf80cdd8f035..aa975cb77bafb3b31f0d612d0f662cef0bde06f2 100644
--- a/tensorflow/core/ops/state_ops.cc
+++ b/tensorflow/core/ops/state_ops.cc
@@ -222,6 +222,15 @@ REGISTER_OP("ResourceScatterNdUpdate")
     .Attr("use_locking: bool = true")
     .SetShapeFn(shape_inference::ScatterNdUpdateShape);
 
+REGISTER_OP("ResourceScatterNdAdd")
+    .Input("ref: resource")
+    .Input("indices: Tindices")
+    .Input("updates: T")
+    .Attr("T: type")
+    .Attr("Tindices: {int32, int64}")
+    .Attr("use_locking: bool = true")
+    .SetShapeFn(shape_inference::ScatterNdUpdateShape);
+
 REGISTER_OP("ScatterNdAdd")
     .Input("ref: Ref(T)")
     .Input("indices: Tindices")
diff --git a/tensorflow/core/ops/string_ops.cc b/tensorflow/core/ops/string_ops.cc
index 1d5c743a56cec3d9888e0c19c0a6728d1a4fe682..44230623629986c6f47ea67327b26538f6a306dc 100644
--- a/tensorflow/core/ops/string_ops.cc
+++ b/tensorflow/core/ops/string_ops.cc
@@ -78,7 +78,7 @@ REGISTER_OP("ReduceJoin")
 REGISTER_OP("AsString")
     .Input("input: T")
     .Output("output: string")
-    .Attr("T: {int32, int64, complex64, float, double, bool, int8}")
+    .Attr("T: {int8, int16, int32, int64, complex64, float, double, bool}")
     .Attr("precision: int = -1")
     .Attr("scientific: bool = false")
     .Attr("shortest: bool = false")
@@ -134,6 +134,24 @@ REGISTER_OP("StringSplit")
       return Status::OK();
     });
 
+REGISTER_OP("StringSplitV2")
+    .Input("input: string")
+    .Input("sep: string")
+    .Output("indices: int64")
+    .Output("values: string")
+    .Output("shape: int64")
+    .Attr("maxsplit: int = -1")
+    .SetShapeFn([](InferenceContext* c) {
+      ShapeHandle unused;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 1, &unused));
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 0, &unused));
+
+      c->set_output(0, c->Matrix(InferenceContext::kUnknownDim, 2));
+      c->set_output(1, c->Vector(InferenceContext::kUnknownDim));
+      c->set_output(2, c->Vector(2));
+      return Status::OK();
+    });
+
 REGISTER_OP("StringStrip")
     .Input("input: string")
     .Output("output: string")
diff --git a/tensorflow/core/platform/cloud/BUILD b/tensorflow/core/platform/cloud/BUILD
index 0fc1e4ae45c4163604a7cf9d596006c8deeea147..67651349ea8cc01a5b6e5c8142b46002c0a7c8f1 100644
--- a/tensorflow/core/platform/cloud/BUILD
+++ b/tensorflow/core/platform/cloud/BUILD
@@ -174,6 +174,7 @@ cc_library(
         "oauth_client.h",
     ],
     copts = tf_copts(),
+    visibility = ["//tensorflow:__subpackages__"],
     deps = [
         ":curl_http_request",
         ":http_request",
diff --git a/tensorflow/core/platform/cloud/curl_http_request.cc b/tensorflow/core/platform/cloud/curl_http_request.cc
index 081d4cf043aabd1bf6a87ba6b0e1562209cffedb..a1be4aacceead71ab9dff5138468ec577c725251 100644
--- a/tensorflow/core/platform/cloud/curl_http_request.cc
+++ b/tensorflow/core/platform/cloud/curl_http_request.cc
@@ -112,10 +112,6 @@ class LibCurlProxy : public LibCurl {
   }
 
   void curl_free(void* p) override { ::curl_free(p); }
-
-  const char* curl_easy_strerror(CURLcode errornum) override {
-    return ::curl_easy_strerror(errornum);
-  }
 };
 }  // namespace
 
@@ -313,7 +309,7 @@ void CurlHttpRequest::SetResultBufferDirect(char* buffer, size_t size) {
   CHECK(buffer != nullptr);
   CheckNotSent();
 
-  direct_response_ = DirectResponseState{buffer, size, 0};
+  direct_response_ = DirectResponseState{buffer, size, 0, 0};
   CHECK_CURL_OK(libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEDATA,
                                            reinterpret_cast<void*>(this)));
   CHECK_CURL_OK(libcurl_->curl_easy_setopt(
@@ -335,24 +331,15 @@ size_t CurlHttpRequest::WriteCallbackDirect(const void* ptr, size_t size,
   size_t curl_bytes_received = size * nmemb;
   size_t user_buffer_bytes_available =
       state->buffer_size_ - state->bytes_transferred_;
-
-  // The HTTP server may send a response body that is longer than what we
-  // expected. We must not use CHECK() for this situation, because that would
-  // imply a code bug (in this client code) where none exists; the violation of
-  // expectations would have been caused by the server, not the client. So we
-  // report a log warning, if an HTTP server is misbehaving.
-  if (curl_bytes_received > user_buffer_bytes_available) {
-    LOG(WARNING) << "The HTTP response body that we received is longer than we "
-                    "requested or expected. "
-                 << "Total bytes requested: " << state->buffer_size_
-                 << " Bytes received (so far) in HTTP response body: "
-                 << (state->bytes_transferred_ + curl_bytes_received);
-  }
-
   size_t bytes_to_copy =
       std::min<size_t>(curl_bytes_received, user_buffer_bytes_available);
   memcpy(&state->buffer_[state->bytes_transferred_], ptr, bytes_to_copy);
   state->bytes_transferred_ += bytes_to_copy;
+  state->bytes_received_ += curl_bytes_received;
+  // If we didn't have room to store the full response, returning less than
+  // curl_bytes_received here will abort the transfer and curl_easy_perform()
+  // will return CURLE_WRITE_ERROR. We will detect and handle this error there,
+  // and can use state->bytes_received_ as stored above for logging purposes.
   return bytes_to_copy;
 }
 
@@ -447,23 +434,7 @@ Status CurlHttpRequest::Send() {
   }
 
   const CURLcode curl_result = libcurl_->curl_easy_perform(curl_);
-  TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(
-      curl_result, "Performing request. Detailed error: ", error_buffer);
-
-  auto get_error_message = [this, curl_result, &error_buffer]() -> string {
-    StringPiece response = GetResponse();
-    string error_message = strings::StrCat(
-        "Error executing an HTTP request (HTTP response code ", response_code_,
-        ", error code ", curl_result, ", error message '", error_buffer, "')");
-    if (!response.empty()) {
-      return strings::StrCat(
-          error_message, ", response '",
-          response.substr(0,
-                          std::min(response.size(), response_to_error_limit_)),
-          "'");
-    }
-    return error_message;
-  };
+  TF_RETURN_IF_ERROR(CURLcodeToStatus(curl_result, error_buffer));
 
   double written_size = 0;
   CHECK_CURL_OK(libcurl_->curl_easy_getinfo(curl_, CURLINFO_SIZE_DOWNLOAD,
@@ -472,6 +443,18 @@ Status CurlHttpRequest::Send() {
   CHECK_CURL_OK(libcurl_->curl_easy_getinfo(curl_, CURLINFO_RESPONSE_CODE,
                                             &response_code_));
 
+  auto get_error_message = [this]() -> string {
+    string error_message = strings::StrCat(
+        "Error executing an HTTP request: HTTP response code ", response_code_);
+    StringPiece body = GetResponse();
+    if (!body.empty()) {
+      return strings::StrCat(
+          error_message, " with body '",
+          body.substr(0, std::min(body.size(), response_to_error_limit_)), "'");
+    }
+    return error_message;
+  };
+
   Status result;
   switch (response_code_) {
     // The group of response codes indicating that the request achieved
@@ -485,9 +468,12 @@ Status CurlHttpRequest::Send() {
 
     case 416:  // Requested Range Not Satisfiable
       // The requested range had no overlap with the available range.
-      // This doesn't indicate an error, but this does mean an empty response
-      // body.
+      // This doesn't indicate an error, but we should produce an empty response
+      // body. (Not all servers do; GCS returns a short error message body.)
       response_buffer_->clear();
+      if (IsDirectResponse()) {
+        direct_response_.bytes_transferred_ = 0;
+      }
       result = Status::OK();
       break;
 
@@ -613,14 +599,13 @@ int CurlHttpRequest::ProgressCallback(void* this_object, curl_off_t dltotal,
                << " bytes for " << now - that->last_progress_timestamp_
                << " seconds and will be aborted. CURL timing information: "
                << "lookup time: " << lookup_time << " ("
-               << that->libcurl_->curl_easy_strerror(lookup_time_status)
+               << curl_easy_strerror(lookup_time_status)
                << "), connect time: " << connect_time << " ("
-               << that->libcurl_->curl_easy_strerror(connect_time_status)
+               << curl_easy_strerror(connect_time_status)
                << "), pre-transfer time: " << pretransfer_time << " ("
-               << that->libcurl_->curl_easy_strerror(pretransfer_time_status)
+               << curl_easy_strerror(pretransfer_time_status)
                << "), start-transfer time: " << starttransfer_time << " ("
-               << that->libcurl_->curl_easy_strerror(starttransfer_time_status)
-               << ")";
+               << curl_easy_strerror(starttransfer_time_status) << ")";
     return 1;  // Will abort the request.
   }
 
@@ -628,12 +613,36 @@ int CurlHttpRequest::ProgressCallback(void* this_object, curl_off_t dltotal,
   return 0;
 }
 
-Status CURLcodeToStatus(CURLcode code) {
-  // Return Unavailable to retry by default. We probably should distinguish
-  // between permanent or temporary failures.
-  return errors::Unavailable("Error executing an HTTP request (error code ",
-                             code, ", error message '",
-                             curl_easy_strerror(code), "')");
+Status CurlHttpRequest::CURLcodeToStatus(CURLcode code,
+                                         const char* error_buffer) {
+  if (code == CURLE_OK) {
+    return Status::OK();
+  }
+  string error_message = strings::StrCat(
+      "Error executing an HTTP request: libcurl code ", code, " meaning '",
+      curl_easy_strerror(code), "', error details: ");
+  // Special-case response-too-large errors as FAILED_PRECONDITION.
+  if (code == CURLE_WRITE_ERROR && IsDirectResponse() &&
+      direct_response_.bytes_received_ > direct_response_.buffer_size_) {
+    string overflow_message = strings::StrCat(
+        "Received ", direct_response_.bytes_received_, " response bytes ",
+        "for a ", direct_response_.buffer_size_, "-byte buffer");
+    uint64 response_code = 0;
+    const CURLcode get_response_result = libcurl_->curl_easy_getinfo(
+        curl_, CURLINFO_RESPONSE_CODE, &response_code);
+    // Special-case 416 Range Not Satisfied responses; they sometimes have
+    // a response body (e.g. GCS sends one with an error message) but we
+    // pretend as though they don't, so actually ignore this error.
+    if (get_response_result == CURLE_OK && response_code == 416) {
+      return Status::OK();
+    }
+    return errors::FailedPrecondition(
+        strings::StrCat(error_message, overflow_message));
+  }
+  // Return Unavailable to retry by default. There may be other permanent
+  // failures that should be distinguished.
+  return errors::Unavailable(
+      strings::StrCat(error_message, *error_buffer ? error_buffer : "(none)"));
 }
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/curl_http_request.h b/tensorflow/core/platform/cloud/curl_http_request.h
index e658948ab9b9dd95eee1889b390a48beb2f74f9e..1b2029926d6d600cee628e0faac442f63d922dd6 100644
--- a/tensorflow/core/platform/cloud/curl_http_request.h
+++ b/tensorflow/core/platform/cloud/curl_http_request.h
@@ -167,6 +167,10 @@ class CurlHttpRequest : public HttpRequest {
   void CheckNotSent() const;
   StringPiece GetResponse() const;
 
+  /// Helper to convert the given CURLcode and error buffer, representing the
+  /// result of performing a transfer, into a Status with an error message.
+  Status CURLcodeToStatus(CURLcode code, const char* error_buffer);
+
   LibCurl* libcurl_;
   Env* env_;
 
@@ -181,6 +185,7 @@ class CurlHttpRequest : public HttpRequest {
     char* buffer_;
     size_t buffer_size_;
     size_t bytes_transferred_;
+    size_t bytes_received_;
   };
   DirectResponseState direct_response_ = {};
 
@@ -261,21 +266,8 @@ class LibCurl {
   virtual void curl_slist_free_all(curl_slist* list) = 0;
   virtual char* curl_easy_escape(CURL* curl, const char* str, int length) = 0;
   virtual void curl_free(void* p) = 0;
-
-  virtual const char* curl_easy_strerror(CURLcode errornum) = 0;
 };
 
-Status CURLcodeToStatus(CURLcode code);
-
-#define TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(_code, ...)                    \
-  do {                                                                      \
-    if (_code != CURLE_OK) {                                                \
-      ::tensorflow::Status _status = ::tensorflow::CURLcodeToStatus(_code); \
-      ::tensorflow::errors::AppendToMessage(&_status, __VA_ARGS__);         \
-      return _status;                                                       \
-    }                                                                       \
-  } while (0)
-
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_CORE_PLATFORM_CLOUD_CURL_HTTP_REQUEST_H_
diff --git a/tensorflow/core/platform/cloud/curl_http_request_test.cc b/tensorflow/core/platform/cloud/curl_http_request_test.cc
index 522b717568777b22865d38a214086b12e3525e48..eb9023d70890e4f8fc9b6244f940135ec9eae1f6 100644
--- a/tensorflow/core/platform/cloud/curl_http_request_test.cc
+++ b/tensorflow/core/platform/cloud/curl_http_request_test.cc
@@ -149,8 +149,12 @@ class FakeLibCurl : public LibCurl {
       } while (bytes_read > 0);
     }
     if (write_data_ || write_callback_) {
-      write_callback_(response_content_.c_str(), 1, response_content_.size(),
-                      write_data_);
+      size_t bytes_handled = write_callback_(
+          response_content_.c_str(), 1, response_content_.size(), write_data_);
+      // Mimic real libcurl behavior by checking write callback return value.
+      if (bytes_handled != response_content_.size()) {
+        curl_easy_perform_result_ = CURLE_WRITE_ERROR;
+      }
     }
     for (const auto& header : response_headers_) {
       header_callback_(header.c_str(), 1, header.size(), header_data_);
@@ -219,10 +223,6 @@ class FakeLibCurl : public LibCurl {
   }
   void curl_free(void* p) override { port::Free(p); }
 
-  const char* curl_easy_strerror(CURLcode errornum) override {
-    return "<unimplemented>";
-  }
-
   // Variables defining the behavior of this fake.
   string response_content_;
   uint64 response_code_;
@@ -302,7 +302,7 @@ TEST(CurlHttpRequestTest, GetRequest_Direct) {
   string expected_response = "get response";
   size_t response_bytes_transferred =
       http_request.GetResultBufferDirectBytesTransferred();
-  EXPECT_EQ(response_bytes_transferred, expected_response.size());
+  EXPECT_EQ(expected_response.size(), response_bytes_transferred);
   EXPECT_EQ(
       "get response",
       string(scratch.begin(), scratch.begin() + response_bytes_transferred));
@@ -318,6 +318,48 @@ TEST(CurlHttpRequestTest, GetRequest_Direct) {
   EXPECT_EQ(200, http_request.GetResponseCode());
 }
 
+TEST(CurlHttpRequestTest, GetRequest_Direct_ResponseTooLarge) {
+  FakeLibCurl libcurl("get response", 200);
+  CurlHttpRequest http_request(&libcurl);
+
+  std::vector<char> scratch(5, 0);
+
+  http_request.SetUri("http://www.testuri.com");
+  http_request.SetResultBufferDirect(scratch.data(), scratch.size());
+  const Status& status = http_request.Send();
+  EXPECT_EQ(error::FAILED_PRECONDITION, status.code());
+  EXPECT_EQ(
+      "Error executing an HTTP request: libcurl code 23 meaning "
+      "'Failed writing received data to disk/application', error details: "
+      "Received 12 response bytes for a 5-byte buffer",
+      status.error_message());
+
+  // As long as the request clearly fails, ok to leave truncated response here.
+  EXPECT_EQ(5, http_request.GetResultBufferDirectBytesTransferred());
+  EXPECT_EQ("get r", string(scratch.begin(), scratch.begin() + 5));
+}
+
+TEST(CurlHttpRequestTest, GetRequest_Direct_RangeOutOfBound) {
+  FakeLibCurl libcurl("get response", 416);
+  CurlHttpRequest http_request(&libcurl);
+
+  const string initialScratch = "abcde";
+  std::vector<char> scratch;
+  scratch.insert(scratch.end(), initialScratch.begin(), initialScratch.end());
+
+  http_request.SetUri("http://www.testuri.com");
+  http_request.SetRange(0, 4);
+  http_request.SetResultBufferDirect(scratch.data(), scratch.size());
+  TF_EXPECT_OK(http_request.Send());
+  EXPECT_EQ(416, http_request.GetResponseCode());
+
+  // Some servers (in particular, GCS) return an error message payload with a
+  // 416 Range Not Satisfiable response. We should pretend it's not there when
+  // reporting bytes transferred, but it's ok if it writes to scratch.
+  EXPECT_EQ(0, http_request.GetResultBufferDirectBytesTransferred());
+  EXPECT_EQ("get r", string(scratch.begin(), scratch.end()));
+}
+
 TEST(CurlHttpRequestTest, GetRequest_Empty) {
   FakeLibCurl libcurl("", 200);
   CurlHttpRequest http_request(&libcurl);
@@ -357,28 +399,26 @@ TEST(CurlHttpRequestTest, GetRequest_RangeOutOfBound) {
   http_request.SetResultBuffer(&scratch);
   TF_EXPECT_OK(http_request.Send());
 
+  // Some servers (in particular, GCS) return an error message payload with a
+  // 416 Range Not Satisfiable response. We should pretend it's not there.
   EXPECT_TRUE(scratch.empty());
   EXPECT_EQ(416, http_request.GetResponseCode());
 }
 
 TEST(CurlHttpRequestTest, GetRequest_503) {
   FakeLibCurl libcurl("get response", 503);
-  libcurl.curl_easy_perform_result_ = CURLE_WRITE_ERROR;
   CurlHttpRequest http_request(&libcurl);
 
   std::vector<char> scratch;
   scratch.insert(scratch.end(), kTestContent.begin(), kTestContent.end());
 
   http_request.SetUri("http://www.testuri.com");
-  http_request.AddAuthBearerHeader("fake-bearer");
-  http_request.SetRange(100, 199);
   http_request.SetResultBuffer(&scratch);
   const auto& status = http_request.Send();
   EXPECT_EQ(error::UNAVAILABLE, status.code());
   EXPECT_EQ(
-      "Error executing an HTTP request (error code 23, error message 'Failed "
-      "writing received data to disk/application')\n\tPerforming request. "
-      "Detailed error: ",
+      "Error executing an HTTP request: HTTP response code 503 with body "
+      "'get response'",
       status.error_message());
 }
 
@@ -395,9 +435,8 @@ TEST(CurlHttpRequestTest, GetRequest_HttpCode0) {
   const auto& status = http_request.Send();
   EXPECT_EQ(error::UNAVAILABLE, status.code());
   EXPECT_EQ(
-      "Error executing an HTTP request (error code 28, error message 'Timeout "
-      "was reached')\n\tPerforming request. Detailed error: Operation timed "
-      "out",
+      "Error executing an HTTP request: libcurl code 28 meaning "
+      "'Timeout was reached', error details: Operation timed out",
       status.error_message());
   EXPECT_EQ(0, http_request.GetResponseCode());
 }
@@ -630,9 +669,8 @@ TEST(CurlHttpRequestTest, ProgressIsStuck) {
   auto status = http_request.Send();
   EXPECT_EQ(error::UNAVAILABLE, status.code());
   EXPECT_EQ(
-      "Error executing an HTTP request (error code 42, error message "
-      "'Operation was aborted by an application callback')\n\tPerforming "
-      "request. Detailed error: ",
+      "Error executing an HTTP request: libcurl code 42 meaning 'Operation "
+      "was aborted by an application callback', error details: (none)",
       status.error_message());
 }
 
diff --git a/tensorflow/core/platform/cloud/gcs_file_system.cc b/tensorflow/core/platform/cloud/gcs_file_system.cc
index e91c3b1b0c15eca5ac21fa9c55eff924490fd0a0..22ae6121e0fb9ef0f0294034089708186105f3d8 100644
--- a/tensorflow/core/platform/cloud/gcs_file_system.cc
+++ b/tensorflow/core/platform/cloud/gcs_file_system.cc
@@ -64,6 +64,10 @@ constexpr uint64 HTTP_CODE_RESUME_INCOMPLETE = 308;
 // The environment variable that overrides the size of the readahead buffer.
 // DEPRECATED. Use GCS_BLOCK_SIZE_MB instead.
 constexpr char kReadaheadBufferSize[] = "GCS_READAHEAD_BUFFER_SIZE_BYTES";
+// The environment variable that disables the GCS block cache for reads.
+// This is the explicit alternative to setting BLOCK_SIZE or MAX_SIZE to 0, and
+// takes precedence over either of those environment variables.
+constexpr char kReadCacheDisabled[] = "GCS_READ_CACHE_DISABLED";
 // The environment variable that overrides the block size for aligned reads from
 // GCS. Specified in MB (e.g. "16" = 16 x 1024 x 1024 = 16777216 bytes).
 constexpr char kBlockSize[] = "GCS_READ_CACHE_BLOCK_SIZE_MB";
@@ -129,9 +133,6 @@ constexpr char kInitialTokens[] = "GCS_INITIAL_TOKENS";
 
 // TODO: DO NOT use a hardcoded path
 Status GetTmpFilename(string* filename) {
-  if (!filename) {
-    return errors::Internal("'filename' cannot be nullptr.");
-  }
 #ifndef _WIN32
   char buffer[] = "/tmp/gcs_filesystem_XXXXXX";
   int fd = mkstemp(buffer);
@@ -158,9 +159,6 @@ Status GetTmpFilename(string* filename) {
 /// object is empty.
 Status ParseGcsPath(StringPiece fname, bool empty_object_ok, string* bucket,
                     string* object) {
-  if (!bucket || !object) {
-    return errors::Internal("bucket and object cannot be null.");
-  }
   StringPiece scheme, bucketp, objectp;
   io::ParseURI(fname, &scheme, &bucketp, &objectp);
   if (scheme != "gs") {
@@ -290,51 +288,24 @@ Status GetBoolValue(const Json::Value& parent, const char* name, bool* result) {
 /// A GCS-based implementation of a random access file with an LRU block cache.
 class GcsRandomAccessFile : public RandomAccessFile {
  public:
-  using SignatureGenFun =
-      std::function<Status(const string& filename, int64* file_signature)>;
+  using ReadFn =
+      std::function<Status(const string& filename, uint64 offset, size_t n,
+                           StringPiece* result, char* scratch)>;
 
-  GcsRandomAccessFile(const string& filename, FileBlockCache* file_block_cache,
-                      const SignatureGenFun& signature_gen_fun)
-      : filename_(filename),
-        file_block_cache_(file_block_cache),
-        signature_gen_fun_(signature_gen_fun) {}
+  GcsRandomAccessFile(const string& filename, ReadFn read_fn)
+      : filename_(filename), read_fn_(std::move(read_fn)) {}
 
   /// The implementation of reads with an LRU block cache. Thread safe.
   Status Read(uint64 offset, size_t n, StringPiece* result,
               char* scratch) const override {
-    if (file_block_cache_->IsCacheEnabled()) {
-      int64 signature;
-      TF_RETURN_IF_ERROR(signature_gen_fun_(filename_, &signature));
-      if (!file_block_cache_->ValidateAndUpdateFileSignature(filename_,
-                                                             signature)) {
-        VLOG(1) << "File " << filename_
-                << " signature has been changed. Refreshing the cache.";
-      }
-    }
-
-    *result = StringPiece();
-    size_t bytes_transferred;
-    TF_RETURN_IF_ERROR(file_block_cache_->Read(filename_, offset, n, scratch,
-                                               &bytes_transferred));
-    *result = StringPiece(scratch, bytes_transferred);
-
-    if (bytes_transferred < n) {
-      // This is not an error per se. The RandomAccessFile interface expects
-      // that Read returns OutOfRange if fewer bytes were read than requested.
-      return errors::OutOfRange("EOF reached, ", result->size(),
-                                " bytes were read out of ", n,
-                                " bytes requested.");
-    }
-    return Status::OK();
+    return read_fn_(filename_, offset, n, result, scratch);
   }
 
  private:
   /// The filename of this file.
   const string filename_;
-  /// The LRU block cache for this file.
-  mutable FileBlockCache* file_block_cache_;  // not owned
-
-  const SignatureGenFun signature_gen_fun_;
+  /// The implementation of the read operation (provided by the GCSFileSystem).
+  const ReadFn read_fn_;
 };
 
 /// \brief GCS-based implementation of a writeable file.
@@ -475,9 +446,6 @@ class GcsWritableFile : public WritableFile {
   }
 
   Status GetCurrentFileSize(uint64* size) {
-    if (size == nullptr) {
-      return errors::Internal("'size' cannot be nullptr");
-    }
     const auto tellp = outfile_.tellp();
     if (tellp == static_cast<std::streampos>(-1)) {
       return errors::Internal(
@@ -489,9 +457,6 @@ class GcsWritableFile : public WritableFile {
 
   /// Initiates a new resumable upload session.
   Status CreateNewUploadSession(string* session_uri) {
-    if (session_uri == nullptr) {
-      return errors::Internal("'session_uri' cannot be nullptr.");
-    }
     uint64 file_size;
     TF_RETURN_IF_ERROR(GetCurrentFileSize(&file_size));
 
@@ -525,9 +490,6 @@ class GcsWritableFile : public WritableFile {
   /// uploaded size in bytes.
   Status RequestUploadSessionStatus(const string& session_uri, bool* completed,
                                     uint64* uploaded) {
-    if (completed == nullptr || uploaded == nullptr) {
-      return errors::Internal("'completed' and 'uploaded' cannot be nullptr.");
-    }
     uint64 file_size;
     TF_RETURN_IF_ERROR(GetCurrentFileSize(&file_size));
 
@@ -665,6 +627,10 @@ GcsFileSystem::GcsFileSystem()
   if (GetEnvVar(kMaxStaleness, strings::safe_strtou64, &value)) {
     max_staleness = value;
   }
+  if (std::getenv(kReadCacheDisabled)) {
+    // Setting either to 0 disables the cache; set both for good measure.
+    block_size = max_bytes = 0;
+  }
   file_block_cache_ = MakeFileBlockCache(block_size, max_bytes, max_staleness);
   // Apply overrides for the stat cache max age and max entries, if provided.
   uint64 stat_cache_max_age = kStatCacheDefaultMaxAge;
@@ -797,21 +763,50 @@ Status GcsFileSystem::NewRandomAccessFile(
     const string& fname, std::unique_ptr<RandomAccessFile>* result) {
   string bucket, object;
   TF_RETURN_IF_ERROR(ParseGcsPath(fname, false, &bucket, &object));
-  result->reset(new GcsRandomAccessFile(
-      fname, file_block_cache_.get(),
-      [this, bucket, object](const string& fname, int64* signature) {
-        GcsFileStat stat;
-        TF_RETURN_IF_ERROR(stat_cache_->LookupOrCompute(
-            fname, &stat,
-            [this, bucket, object](const string& fname, GcsFileStat* stat) {
-              return UncachedStatForObject(fname, bucket, object, stat);
-            }));
-        *signature = stat.generation_number;
-        return Status::OK();
-      }));
+  result->reset(new GcsRandomAccessFile(fname, [this, bucket, object](
+                                                   const string& fname,
+                                                   uint64 offset, size_t n,
+                                                   StringPiece* result,
+                                                   char* scratch) {
+    tf_shared_lock l(block_cache_lock_);
+    if (file_block_cache_->IsCacheEnabled()) {
+      GcsFileStat stat;
+      TF_RETURN_IF_ERROR(stat_cache_->LookupOrCompute(
+          fname, &stat,
+          [this, bucket, object](const string& fname, GcsFileStat* stat) {
+            return UncachedStatForObject(fname, bucket, object, stat);
+          }));
+      if (!file_block_cache_->ValidateAndUpdateFileSignature(
+              fname, stat.generation_number)) {
+        VLOG(1)
+            << "File signature has been changed. Refreshing the cache. Path: "
+            << fname;
+      }
+    }
+    *result = StringPiece();
+    size_t bytes_transferred;
+    TF_RETURN_IF_ERROR(
+        file_block_cache_->Read(fname, offset, n, scratch, &bytes_transferred));
+    *result = StringPiece(scratch, bytes_transferred);
+    if (bytes_transferred < n) {
+      return errors::OutOfRange("EOF reached, ", result->size(),
+                                " bytes were read out of ", n,
+                                " bytes requested.");
+    }
+    return Status::OK();
+  }));
   return Status::OK();
 }
 
+void GcsFileSystem::ResetFileBlockCache(size_t block_size_bytes,
+                                        size_t max_bytes,
+                                        uint64 max_staleness_secs) {
+  mutex_lock l(block_cache_lock_);
+  file_block_cache_ =
+      MakeFileBlockCache(block_size_bytes, max_bytes, max_staleness_secs);
+  stats_->Configure(this, &throttle_, file_block_cache_.get());
+}
+
 // A helper function to build a FileBlockCache for GcsFileSystem.
 std::unique_ptr<FileBlockCache> GcsFileSystem::MakeFileBlockCache(
     size_t block_size, size_t max_bytes, uint64 max_staleness) {
@@ -880,6 +875,7 @@ Status GcsFileSystem::LoadBufferFromGCS(const string& filename, size_t offset,
 }
 
 void GcsFileSystem::ClearFileCaches(const string& fname) {
+  tf_shared_lock l(block_cache_lock_);
   file_block_cache_->RemoveFile(fname);
   stat_cache_->Delete(fname);
   // TODO(rxsang): Remove the patterns that matche the file in
@@ -962,11 +958,16 @@ Status GcsFileSystem::FileExists(const string& fname) {
       return Status::OK();
     }
   }
-  bool result;
-  TF_RETURN_IF_ERROR(ObjectExists(fname, bucket, object, &result));
-  if (result) {
-    return Status::OK();
+
+  // Check if the object exists.
+  GcsFileStat stat;
+  const Status status = StatForObject(fname, bucket, object, &stat);
+  if (status.code() != errors::Code::NOT_FOUND) {
+    return status;
   }
+
+  // Check if the folder exists.
+  bool result;
   TF_RETURN_IF_ERROR(FolderExists(fname, &result));
   if (result) {
     return Status::OK();
@@ -976,14 +977,11 @@ Status GcsFileSystem::FileExists(const string& fname) {
 
 Status GcsFileSystem::ObjectExists(const string& fname, const string& bucket,
                                    const string& object, bool* result) {
-  if (!result) {
-    return errors::Internal("'result' cannot be nullptr.");
-  }
-  GcsFileStat not_used_stat;
-  const Status status = StatForObject(fname, bucket, object, &not_used_stat);
+  GcsFileStat stat;
+  const Status status = StatForObject(fname, bucket, object, &stat);
   switch (status.code()) {
     case errors::Code::OK:
-      *result = true;
+      *result = !stat.base.is_directory;
       return Status::OK();
     case errors::Code::NOT_FOUND:
       *result = false;
@@ -1037,15 +1035,19 @@ Status GcsFileSystem::UncachedStatForObject(const string& fname,
           << "; mtime_nsec: " << stat->base.mtime_nsec
           << "; updated: " << updated;
 
-  stat->base.is_directory = false;
+  if (str_util::EndsWith(fname, "/")) {
+    // In GCS a path can be both a directory and a file, both it is uncommon for
+    // other file systems. To avoid the ambiguity, if a path ends with "/" in
+    // GCS, we always regard it as a directory mark or a virtual directory.
+    stat->base.is_directory = true;
+  } else {
+    stat->base.is_directory = false;
+  }
   return Status::OK();
 }
 
 Status GcsFileSystem::StatForObject(const string& fname, const string& bucket,
                                     const string& object, GcsFileStat* stat) {
-  if (!stat) {
-    return errors::Internal("'stat' cannot be nullptr.");
-  }
   if (object.empty()) {
     return errors::InvalidArgument(strings::Printf(
         "'object' must be a non-empty string. (File: %s)", fname.c_str()));
@@ -1056,18 +1058,10 @@ Status GcsFileSystem::StatForObject(const string& fname, const string& bucket,
       [this, &bucket, &object](const string& fname, GcsFileStat* stat) {
         return UncachedStatForObject(fname, bucket, object, stat);
       }));
-  if (stat->base.is_directory) {
-    return errors::NotFound(fname, " is a directory.");
-  } else {
-    return Status::OK();
-  }
+  return Status::OK();
 }
 
 Status GcsFileSystem::BucketExists(const string& bucket, bool* result) {
-  if (!result) {
-    return errors::Internal("'result' cannot be nullptr.");
-  }
-
   std::unique_ptr<HttpRequest> request;
   TF_RETURN_IF_ERROR(CreateHttpRequest(&request));
   request->SetUri(strings::StrCat(kGcsUriBase, "b/", bucket));
@@ -1086,9 +1080,6 @@ Status GcsFileSystem::BucketExists(const string& bucket, bool* result) {
 }
 
 Status GcsFileSystem::FolderExists(const string& dirname, bool* result) {
-  if (!result) {
-    return errors::Internal("'result' cannot be nullptr.");
-  }
   StatCache::ComputeFunc compute_func = [this](const string& dirname,
                                                GcsFileStat* stat) {
     std::vector<string> children;
@@ -1103,7 +1094,8 @@ Status GcsFileSystem::FolderExists(const string& dirname, bool* result) {
     }
   };
   GcsFileStat stat;
-  Status s = stat_cache_->LookupOrCompute(dirname, &stat, compute_func);
+  Status s = stat_cache_->LookupOrCompute(MaybeAppendSlash(dirname), &stat,
+                                          compute_func);
   if (s.ok()) {
     *result = stat.base.is_directory;
     return Status::OK();
@@ -1508,6 +1500,7 @@ Status GcsFileSystem::DeleteRecursively(const string& dirname,
 // reclaiming memory once filesystem operations are done (e.g. model is loaded),
 // or for resetting the filesystem to a consistent state.
 void GcsFileSystem::FlushCaches() {
+  tf_shared_lock l(block_cache_lock_);
   file_block_cache_->Flush();
   stat_cache_->Clear();
   matching_paths_cache_->Clear();
@@ -1516,8 +1509,15 @@ void GcsFileSystem::FlushCaches() {
 void GcsFileSystem::SetStats(GcsStatsInterface* stats) {
   CHECK(stats_ == nullptr) << "SetStats() has already been called.";
   CHECK(stats != nullptr);
+  mutex_lock l(block_cache_lock_);
   stats_ = stats;
-  stats_->Init(this, &throttle_, file_block_cache_.get());
+  stats_->Configure(this, &throttle_, file_block_cache_.get());
+}
+
+void GcsFileSystem::SetAuthProvider(
+    std::unique_ptr<AuthProvider> auth_provider) {
+  mutex_lock l(mu_);
+  auth_provider_ = std::move(auth_provider);
 }
 
 // Creates an HttpRequest and sets several parameters that are common to all
@@ -1530,7 +1530,11 @@ Status GcsFileSystem::CreateHttpRequest(std::unique_ptr<HttpRequest>* request) {
   }
 
   string auth_token;
-  TF_RETURN_IF_ERROR(AuthProvider::GetToken(auth_provider_.get(), &auth_token));
+  {
+    tf_shared_lock l(mu_);
+    TF_RETURN_IF_ERROR(
+        AuthProvider::GetToken(auth_provider_.get(), &auth_token));
+  }
 
   new_request->AddAuthBearerHeader(auth_token);
 
diff --git a/tensorflow/core/platform/cloud/gcs_file_system.h b/tensorflow/core/platform/cloud/gcs_file_system.h
index d543db15775121bb59577ebb03d13f7ba9514e0f..74768c98b563bda5caf05ae0e7019a1076637a6a 100644
--- a/tensorflow/core/platform/cloud/gcs_file_system.h
+++ b/tensorflow/core/platform/cloud/gcs_file_system.h
@@ -43,9 +43,12 @@ class GcsFileSystem;
 /// time.
 class GcsStatsInterface {
  public:
-  /// Init is called by the GcsFileSystem immediately after being registered.
-  virtual void Init(GcsFileSystem* fs, GcsThrottle* throttle,
-                    const FileBlockCache* block_cache) = 0;
+  /// Configure is called by the GcsFileSystem to provide instrumentation hooks.
+  ///
+  /// Note: Configure can be called multiple times (e.g. if the block cache is
+  /// re-initialized).
+  virtual void Configure(GcsFileSystem* fs, GcsThrottle* throttle,
+                         const FileBlockCache* block_cache) = 0;
 
   /// RecordBlockLoadRequest is called to record a block load request is about
   /// to be made.
@@ -132,9 +135,18 @@ class GcsFileSystem : public FileSystem {
 
   /// These accessors are mainly for testing purposes, to verify that the
   /// environment variables that control these parameters are handled correctly.
-  size_t block_size() const { return file_block_cache_->block_size(); }
-  size_t max_bytes() const { return file_block_cache_->max_bytes(); }
-  uint64 max_staleness() const { return file_block_cache_->max_staleness(); }
+  size_t block_size() {
+    tf_shared_lock l(block_cache_lock_);
+    return file_block_cache_->block_size();
+  }
+  size_t max_bytes() {
+    tf_shared_lock l(block_cache_lock_);
+    return file_block_cache_->max_bytes();
+  }
+  uint64 max_staleness() {
+    tf_shared_lock l(block_cache_lock_);
+    return file_block_cache_->max_staleness();
+  }
   TimeoutConfig timeouts() const { return timeouts_; }
   string additional_header_name() const {
     return additional_header_ ? additional_header_->first : "";
@@ -190,6 +202,21 @@ class GcsFileSystem : public FileSystem {
 
   Status CreateHttpRequest(std::unique_ptr<HttpRequest>* request);
 
+  /// \brief Sets a new AuthProvider on the GCS FileSystem.
+  ///
+  /// The new auth provider will be used for all subsequent requests.
+  void SetAuthProvider(std::unique_ptr<AuthProvider> auth_provider);
+
+  /// \brief Resets the block cache and re-instantiates it with the new values.
+  ///
+  /// This method can be used to clear the existing block cache and/or to
+  /// re-configure the block cache for different values.
+  ///
+  /// Note: the existing block cache is not cleaned up until all existing files
+  /// have been closed.
+  void ResetFileBlockCache(size_t block_size_bytes, size_t max_bytes,
+                           uint64 max_staleness_secs);
+
  private:
   // GCS file statistics.
   struct GcsFileStat {
@@ -246,9 +273,14 @@ class GcsFileSystem : public FileSystem {
   // Clear all the caches related to the file with name `filename`.
   void ClearFileCaches(const string& fname);
 
-  std::unique_ptr<AuthProvider> auth_provider_;
+  mutex mu_;
+  std::unique_ptr<AuthProvider> auth_provider_ GUARDED_BY(mu_);
   std::unique_ptr<HttpRequest::Factory> http_request_factory_;
-  std::unique_ptr<FileBlockCache> file_block_cache_;
+  // block_cache_lock_ protects the file_block_cache_ pointer (Note that
+  // FileBlockCache instances are themselves threadsafe).
+  mutex block_cache_lock_;
+  std::unique_ptr<FileBlockCache> file_block_cache_
+      GUARDED_BY(block_cache_lock_);
   std::unique_ptr<GcsDnsCache> dns_cache_;
   GcsThrottle throttle_;
 
diff --git a/tensorflow/core/platform/cloud/gcs_file_system_test.cc b/tensorflow/core/platform/cloud/gcs_file_system_test.cc
index bb4ace65a9ed64ce4d8415e402204fc5f5731b61..e791ae5a199eb8e7960c8857c187511baa42e8f9 100644
--- a/tensorflow/core/platform/cloud/gcs_file_system_test.cc
+++ b/tensorflow/core/platform/cloud/gcs_file_system_test.cc
@@ -1107,7 +1107,7 @@ TEST(GcsFileSystemTest, FileExists_StatCache) {
                            "\"updated\": \"2016-04-29T23:15:24.896Z\"}")),
        new FakeHttpRequest(
            "Uri: https://www.googleapis.com/storage/v1/b/bucket/o/"
-           "path%2Fsubfolder?fields=size%2Cgeneration%2Cupdated\n"
+           "path%2Fsubfolder%2F?fields=size%2Cgeneration%2Cupdated\n"
            "Auth Token: fake_token\n"
            "Timeouts: 5 1 10\n",
            "", errors::NotFound("404"), 404),
@@ -1133,10 +1133,32 @@ TEST(GcsFileSystemTest, FileExists_StatCache) {
   // HTTP requests.
   for (int i = 0; i < 10; i++) {
     TF_EXPECT_OK(fs.FileExists("gs://bucket/path/file1.txt"));
-    TF_EXPECT_OK(fs.FileExists("gs://bucket/path/subfolder"));
+    TF_EXPECT_OK(fs.FileExists("gs://bucket/path/subfolder/"));
   }
 }
 
+TEST(GcsFileSystemTest, FileExists_DirectoryMark) {
+  std::vector<HttpRequest*> requests({new FakeHttpRequest(
+      "Uri: https://www.googleapis.com/storage/v1/b/bucket/o/"
+      "dir%2F?fields=size%2Cgeneration%2Cupdated\n"
+      "Auth Token: fake_token\n"
+      "Timeouts: 5 1 10\n",
+      strings::StrCat("{\"size\": \"5\",\"generation\": \"1\","
+                      "\"updated\": \"2016-04-29T23:15:24.896Z\"}"))});
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
+      3600 /* stat cache max age */, 0 /* stat cache max entries */,
+      0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, nullptr /* gcs additional header */);
+
+  TF_EXPECT_OK(fs.FileExists("gs://bucket/dir/"));
+  TF_EXPECT_OK(fs.IsDirectory("gs://bucket/dir/"));
+}
+
 TEST(GcsFileSystemTest, GetChildren_NoItems) {
   std::vector<HttpRequest*> requests({new FakeHttpRequest(
       "Uri: https://www.googleapis.com/storage/v1/b/bucket/o?"
@@ -1932,6 +1954,14 @@ TEST(GcsFileSystemTest, RenameFile_Object) {
            "Range: 0-15\n"
            "Timeouts: 5 1 20\n",
            "76543210"),
+       // IsDirectory is checking whether there are children objects.
+       new FakeHttpRequest(
+           "Uri: https://www.googleapis.com/storage/v1/b/bucket/o?"
+           "fields=items%2Fname%2CnextPageToken&prefix=path%2Fsrc.txt%2F"
+           "&maxResults=1\n"
+           "Auth Token: fake_token\n"
+           "Timeouts: 5 1 10\n",
+           "{}"),
        // Copying to the new location.
        new FakeHttpRequest(
            "Uri: https://www.googleapis.com/storage/v1/b/bucket/o/"
@@ -2318,7 +2348,7 @@ TEST(GcsFileSystemTest, Stat_Cache) {
                            "\"updated\": \"2016-04-29T23:15:24.896Z\"}")),
        new FakeHttpRequest(
            "Uri: https://www.googleapis.com/storage/v1/b/bucket/o/"
-           "subfolder?fields=size%2Cgeneration%2Cupdated\n"
+           "subfolder%2F?fields=size%2Cgeneration%2Cupdated\n"
            "Auth Token: fake_token\n"
            "Timeouts: 5 1 10\n",
            "", errors::NotFound("404"), 404),
@@ -2348,7 +2378,7 @@ TEST(GcsFileSystemTest, Stat_Cache) {
     EXPECT_EQ(1010, stat.length);
     EXPECT_NEAR(1461971724896, stat.mtime_nsec / 1000 / 1000, 1);
     EXPECT_FALSE(stat.is_directory);
-    TF_EXPECT_OK(fs.Stat("gs://bucket/subfolder", &stat));
+    TF_EXPECT_OK(fs.Stat("gs://bucket/subfolder/", &stat));
     EXPECT_EQ(0, stat.length);
     EXPECT_EQ(0, stat.mtime_nsec);
     EXPECT_TRUE(stat.is_directory);
@@ -2399,6 +2429,30 @@ TEST(GcsFileSystemTest, Stat_Cache_Flush) {
   }
 }
 
+TEST(GcsFileSystemTest, Stat_FilenameEndingWithSlash) {
+  std::vector<HttpRequest*> requests({new FakeHttpRequest(
+      "Uri: https://www.googleapis.com/storage/v1/b/bucket/o/"
+      "dir%2F?fields=size%2Cgeneration%2Cupdated\n"
+      "Auth Token: fake_token\n"
+      "Timeouts: 5 1 10\n",
+      strings::StrCat("{\"size\": \"5\",\"generation\": \"1\","
+                      "\"updated\": \"2016-04-29T23:15:24.896Z\"}"))});
+  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+                   std::unique_ptr<HttpRequest::Factory>(
+                       new FakeHttpRequestFactory(&requests)),
+                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
+                   0 /* stat cache max age */, 0 /* stat cache max entries */,
+                   0 /* matching paths cache max age */,
+                   0 /* matching paths cache max entries */,
+                   0 /* initial retry delay*/, kTestTimeoutConfig,
+                   nullptr /* gcs additional header */);
+
+  FileStatistics stat;
+  TF_EXPECT_OK(fs.Stat("gs://bucket/dir/", &stat));
+  EXPECT_EQ(5, stat.length);
+  EXPECT_TRUE(stat.is_directory);
+}
+
 TEST(GcsFileSystemTest, IsDirectory_NotFound) {
   std::vector<HttpRequest*> requests(
       {new FakeHttpRequest(
@@ -2938,8 +2992,8 @@ TEST(GcsFileSystemTest, CreateHttpRequest) {
 
 class TestGcsStats : public GcsStatsInterface {
  public:
-  void Init(GcsFileSystem* fs, GcsThrottle* throttle,
-            const FileBlockCache* block_cache) override {
+  void Configure(GcsFileSystem* fs, GcsThrottle* throttle,
+                 const FileBlockCache* block_cache) override {
     CHECK(fs_ == nullptr);
     CHECK(throttle_ == nullptr);
     CHECK(block_cache_ == nullptr);
diff --git a/tensorflow/core/platform/cloud/gcs_throttle.cc b/tensorflow/core/platform/cloud/gcs_throttle.cc
index 27dd06a6250ad457d0ec142c07d29a2358dddaee..940d98fd0908602ca4fd87508a65acf760fcbdb9 100644
--- a/tensorflow/core/platform/cloud/gcs_throttle.cc
+++ b/tensorflow/core/platform/cloud/gcs_throttle.cc
@@ -51,7 +51,7 @@ void GcsThrottle::UpdateState() {
   // TODO(b/72643279): Switch to a monotonic clock.
   int64 now = env_time_->NowSeconds();
   uint64 delta_secs =
-      std::max(0LL, now - static_cast<int64>(last_updated_secs_));
+      std::max(int64{0}, now - static_cast<int64>(last_updated_secs_));
   available_tokens_ += delta_secs * config_.token_rate;
   available_tokens_ = std::min(available_tokens_, config_.bucket_size);
   last_updated_secs_ = now;
diff --git a/tensorflow/core/platform/cloud/ram_file_block_cache.h b/tensorflow/core/platform/cloud/ram_file_block_cache.h
index 2303f9caaa227fd92527d9380b394813d9597971..46fb9a35b88f04940c146b70df8197faaa075a59 100644
--- a/tensorflow/core/platform/cloud/ram_file_block_cache.h
+++ b/tensorflow/core/platform/cloud/ram_file_block_cache.h
@@ -60,6 +60,8 @@ class RamFileBlockCache : public FileBlockCache {
       pruning_thread_.reset(env_->StartThread(ThreadOptions(), "TF_prune_FBC",
                                               [this] { Prune(); }));
     }
+    VLOG(1) << "GCS file block cache is "
+            << (IsCacheEnabled() ? "enabled" : "disabled");
   }
 
   ~RamFileBlockCache() override {
diff --git a/tensorflow/core/platform/default/build_config.bzl b/tensorflow/core/platform/default/build_config.bzl
index b4b756b86620e1f063c25de08c6f79bd028de096..7d558fe880adfac2f6a28743f077958d7fce33c1 100644
--- a/tensorflow/core/platform/default/build_config.bzl
+++ b/tensorflow/core/platform/default/build_config.bzl
@@ -71,6 +71,8 @@ def pyx_library(
         name = filename + "_cython_translation",
         srcs = [filename],
         outs = [filename.split(".")[0] + ".cpp"],
+        # Optionally use PYTHON_BIN_PATH on Linux platforms so that python 3
+        # works. Windows has issues with cython_binary so skip PYTHON_BIN_PATH.
         cmd = "PYTHONHASHSEED=0 $(location @cython//:cython_binary) --cplus $(SRCS) --output-file $(OUTS)",
         tools = ["@cython//:cython_binary"] + pxd_srcs,
     )
@@ -82,7 +84,7 @@ def pyx_library(
     native.cc_binary(
         name=shared_object_name,
         srcs=[stem + ".cpp"],
-        deps=deps + ["//util/python:python_headers"],
+        deps=deps + ["//third_party/python_runtime:headers"],
         linkshared = 1,
     )
     shared_objects.append(shared_object_name)
@@ -304,6 +306,7 @@ def tf_proto_library_cc(name, srcs = [], has_services = None,
                         cc_grpc_version = None,
                         j2objc_api_version = 1,
                         cc_api_version = 2,
+                        dart_api_version = 2,
                         java_api_version = 2, py_api_version = 2,
                         js_api_version = 2, js_codegen = "jspb",
                         default_header = False):
@@ -409,7 +412,7 @@ def tf_proto_library(name, srcs = [], has_services = None,
                      visibility = [], testonly = 0,
                      cc_libs = [],
                      cc_api_version = 2, cc_grpc_version = None,
-                     j2objc_api_version = 1,
+                     dart_api_version = 2, j2objc_api_version = 1,
                      java_api_version = 2, py_api_version = 2,
                      js_api_version = 2, js_codegen = "jspb",
                      provide_cc_alias = False,
@@ -495,14 +498,6 @@ def tf_additional_lib_srcs(exclude = []):
       ], exclude = exclude),
   })
 
-# pylint: disable=unused-argument
-def tf_additional_framework_hdrs(exclude = []):
-  return []
-
-def tf_additional_framework_srcs(exclude = []):
-  return []
-# pylint: enable=unused-argument
-
 def tf_additional_minimal_lib_srcs():
   return [
       "platform/default/integral_types.h",
@@ -523,6 +518,9 @@ def tf_additional_proto_srcs():
       "platform/default/protobuf.cc",
   ]
 
+def tf_additional_human_readable_json_deps():
+  return []
+
 def tf_additional_all_protos():
   return ["//tensorflow/core:protos_all"]
 
diff --git a/tensorflow/core/platform/default/human_readable_json.cc b/tensorflow/core/platform/default/human_readable_json.cc
new file mode 100644
index 0000000000000000000000000000000000000000..6bf2106f6e5d38f61e0291817f5106437c541c19
--- /dev/null
+++ b/tensorflow/core/platform/default/human_readable_json.cc
@@ -0,0 +1,54 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/platform/human_readable_json.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+
+namespace tensorflow {
+
+Status ProtoToHumanReadableJson(const ::google::protobuf::Message& proto,
+                                string* result) {
+  result->clear();
+
+  auto status = google::protobuf::util::MessageToJsonString(proto, result);
+  if (!status.ok()) {
+    // Convert error_msg google::protobuf::StringPiece to
+    // tensorflow::StringPiece.
+    auto error_msg = status.error_message();
+    return errors::Internal(
+        strings::StrCat("Could not convert proto to JSON string: ",
+                        StringPiece(error_msg.data(), error_msg.length())));
+  }
+  return Status::OK();
+}
+
+Status HumanReadableJsonToProto(const string& str,
+                                ::google::protobuf::Message* proto) {
+  proto->Clear();
+  auto status = google::protobuf::util::JsonStringToMessage(str, proto);
+  if (!status.ok()) {
+    // Convert error_msg google::protobuf::StringPiece to
+    // tensorflow::StringPiece.
+    auto error_msg = status.error_message();
+    return errors::Internal(
+        strings::StrCat("Could not convert JSON string to proto: ",
+                        StringPiece(error_msg.data(), error_msg.length())));
+  }
+  return Status::OK();
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/platform/default/string_coding.cc b/tensorflow/core/platform/default/string_coding.cc
new file mode 100644
index 0000000000000000000000000000000000000000..7410ee67820a384e4843a57386b110e40a7e0680
--- /dev/null
+++ b/tensorflow/core/platform/default/string_coding.cc
@@ -0,0 +1,30 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/platform/default/string_coding.h"
+
+namespace tensorflow {
+namespace port {
+
+std::unique_ptr<StringListEncoder> NewStringListEncoder(string* out) {
+  return std::unique_ptr<StringListEncoder>(new StringListEncoder(out));
+}
+
+std::unique_ptr<StringListDecoder> NewStringListDecoder(const string& in) {
+  return std::unique_ptr<StringListDecoder>(new StringListDecoder(in));
+}
+
+}  // namespace port
+}  // namespace tensorflow
diff --git a/tensorflow/core/platform/default/string_coding.h b/tensorflow/core/platform/default/string_coding.h
new file mode 100644
index 0000000000000000000000000000000000000000..70b8ab01444a6175f9c037e35fadc8196c781b19
--- /dev/null
+++ b/tensorflow/core/platform/default/string_coding.h
@@ -0,0 +1,98 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_PLATFORM_DEFAULT_STRING_CODING_H_
+#define TENSORFLOW_CORE_PLATFORM_DEFAULT_STRING_CODING_H_
+
+// IWYU pragma: private, include "third_party/tensorflow/core/platform/tensor_coding.h"
+// IWYU pragma: friend third_party/tensorflow/core/platform/tensor_coding.h
+
+#include "tensorflow/core/lib/core/coding.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/protobuf.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+namespace port {
+
+// Encodes sequences of strings and serialized protocol buffers into a string.
+// Normal usage consists of zero or more calls to Append() and a single call to
+// Finalize().
+class StringListEncoder {
+ public:
+  explicit StringListEncoder(string* out) : out_(out) {}
+
+  // Encodes the given protocol buffer. This may not be called after Finalize().
+  void Append(const protobuf::MessageLite& m) {
+    core::PutVarint32(out_, m.ByteSize());
+    m.AppendToString(&rest_);
+  }
+
+  // Encodes the given string. This may not be called after Finalize().
+  void Append(const string& s) {
+    core::PutVarint32(out_, s.length());
+    strings::StrAppend(&rest_, s);
+  }
+
+  // Signals end of the encoding process. No other calls are allowed after this.
+  void Finalize() { strings::StrAppend(out_, rest_); }
+
+ private:
+  string* out_;
+  string rest_;
+};
+
+// Decodes a string into sequences of strings (which may represent serialized
+// protocol buffers). Normal usage involves a single call to ReadSizes() in
+// order to retrieve the length of all the strings in the sequence. For each
+// size returned a call to Data() is expected and will return the actual
+// string.
+class StringListDecoder {
+ public:
+  explicit StringListDecoder(const string& in) : reader_(in) {}
+
+  // Populates the given vector with the lengths of each string in the sequence
+  // being decoded. Upon returning the vector is guaranteed to contain as many
+  // elements as there are strings in the sequence.
+  bool ReadSizes(std::vector<uint32>* sizes) {
+    int64 total = 0;
+    for (auto& size : *sizes) {
+      if (!core::GetVarint32(&reader_, &size)) return false;
+      total += size;
+    }
+    if (total != static_cast<int64>(reader_.size())) {
+      return false;
+    }
+    return true;
+  }
+
+  // Returns a pointer to the next string in the sequence, then prepares for the
+  // next call by advancing 'size' characters in the sequence.
+  const char* Data(uint32 size) {
+    const char* data = reader_.data();
+    reader_.remove_prefix(size);
+    return data;
+  }
+
+ private:
+  StringPiece reader_;
+};
+
+std::unique_ptr<StringListEncoder> NewStringListEncoder(string* out);
+std::unique_ptr<StringListDecoder> NewStringListDecoder(const string& in);
+
+}  // namespace port
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_PLATFORM_DEFAULT_STRING_CODING_H_
diff --git a/tensorflow/core/platform/env.cc b/tensorflow/core/platform/env.cc
index fe7d0aa7d15919e6ca6c85e30a9ace7ef68d177b..47c59d435b95d65cd7f2cf2efc7fa5b8ef89cd97 100644
--- a/tensorflow/core/platform/env.cc
+++ b/tensorflow/core/platform/env.cc
@@ -26,7 +26,7 @@ limitations under the License.
 #endif
 #if defined(PLATFORM_WINDOWS)
 #include <windows.h>
-#include "tensorflow/core/platform/windows/windows_file_system.h"
+#include "tensorflow/core/platform/windows/wide_char.h"
 #define PATH_MAX MAX_PATH
 #else
 #include <unistd.h>
@@ -311,7 +311,7 @@ string Env::GetExecutablePath() {
   HMODULE hModule = GetModuleHandleW(NULL);
   WCHAR wc_file_path[MAX_PATH] = {0};
   GetModuleFileNameW(hModule, wc_file_path, MAX_PATH);
-  string file_path = WindowsFileSystem::WideCharToUtf8(wc_file_path);
+  string file_path = WideCharToUtf8(wc_file_path);
   std::copy(file_path.begin(), file_path.end(), exe_path);
 #else
   CHECK_NE(-1, readlink("/proc/self/exe", exe_path, sizeof(exe_path) - 1));
diff --git a/tensorflow/core/platform/hadoop/hadoop_file_system.cc b/tensorflow/core/platform/hadoop/hadoop_file_system.cc
index 72c12318cac883987c8013231c5d76d38c5aceaf..ff4b4436bbc1c07343cf317b740d6ed4b0c3a061 100644
--- a/tensorflow/core/platform/hadoop/hadoop_file_system.cc
+++ b/tensorflow/core/platform/hadoop/hadoop_file_system.cc
@@ -115,18 +115,17 @@ class LibHDFS {
     const char* kLibHdfsDso = "libhdfs.so";
 #endif
     char* hdfs_home = getenv("HADOOP_HDFS_HOME");
-    if (hdfs_home == nullptr) {
-      status_ = errors::FailedPrecondition(
-          "Environment variable HADOOP_HDFS_HOME not set");
-      return;
-    }
-    string path = io::JoinPath(hdfs_home, "lib", "native", kLibHdfsDso);
-    status_ = TryLoadAndBind(path.c_str(), &handle_);
-    if (!status_.ok()) {
-      // try load libhdfs.so using dynamic loader's search path in case
-      // libhdfs.so is installed in non-standard location
-      status_ = TryLoadAndBind(kLibHdfsDso, &handle_);
+    if (hdfs_home != nullptr) {
+      string path = io::JoinPath(hdfs_home, "lib", "native", kLibHdfsDso);
+      status_ = TryLoadAndBind(path.c_str(), &handle_);
+      if (status_.ok()) {
+        return;
+      }
     }
+
+    // Try to load the library dynamically in case it has been installed
+    // to a in non-standard location.
+    status_ = TryLoadAndBind(kLibHdfsDso, &handle_);
   }
 
   Status status_;
diff --git a/tensorflow/core/platform/human_readable_json.h b/tensorflow/core/platform/human_readable_json.h
new file mode 100644
index 0000000000000000000000000000000000000000..c759e801e97704641098eb134b6a7deea25fe053
--- /dev/null
+++ b/tensorflow/core/platform/human_readable_json.h
@@ -0,0 +1,37 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_PLATFORM_HUMAN_READABLE_JSON_H_
+#define TENSORFLOW_CORE_PLATFORM_HUMAN_READABLE_JSON_H_
+
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/platform/protobuf.h"
+
+namespace tensorflow {
+
+// Converts a proto to a JSON-like string that's meant to be human-readable
+// but still machine-parseable.
+//
+// This string may not be strictly JSON-compliant, but it must be parseable by
+// HumanReadableJSONToProto.
+Status ProtoToHumanReadableJson(const protobuf::Message& proto, string* result);
+
+// Converts a string produced by ProtoToHumanReadableJSON to a protobuf.  Not
+// guaranteed to work for general JSON.
+Status HumanReadableJsonToProto(const string& str, protobuf::Message* proto);
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_PLATFORM_HUMAN_READABLE_JSON_H_
diff --git a/tensorflow/core/platform/tensor_coding.cc b/tensorflow/core/platform/tensor_coding.cc
index 17dc81f7e0f49e50178f70df3da31d5745387f3a..84601de39a6547ee78d190764616058b4595dc33 100644
--- a/tensorflow/core/platform/tensor_coding.cc
+++ b/tensorflow/core/platform/tensor_coding.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include "tensorflow/core/platform/tensor_coding.h"
 
 #include <vector>
-#include "tensorflow/core/framework/resource_handle.pb.h"
+
 #include "tensorflow/core/lib/core/coding.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 
@@ -66,39 +66,5 @@ void CopyFromArray(string* s, const char* base, size_t bytes) {
   s->assign(base, bytes);
 }
 
-void EncodeResourceHandleList(const ResourceHandle* p, int64 n, string* out) {
-  out->clear();
-  string rest;
-  ResourceHandleProto proto;
-  for (int i = 0; i < n; ++i) {
-    p[i].AsProto(&proto);
-    core::PutVarint32(out, proto.ByteSize());
-    proto.AppendToString(&rest);
-  }
-  *out += rest;
-}
-
-bool DecodeResourceHandleList(const string& in, ResourceHandle* ps, int64 n) {
-  std::vector<uint32> sizes(n);
-  StringPiece reader(in);
-  int64 total = 0;
-  for (auto& size : sizes) {
-    if (!core::GetVarint32(&reader, &size)) return false;
-    total += size;
-  }
-  if (total != static_cast<int64>(reader.size())) {
-    return false;
-  }
-  ResourceHandleProto proto;
-  for (int i = 0; i < n; ++i) {
-    if (!proto.ParseFromArray(reader.data(), sizes[i])) {
-      return false;
-    }
-    ps[i].FromProto(proto);
-    reader.remove_prefix(sizes[i]);
-  }
-  return true;
-}
-
 }  // namespace port
 }  // namespace tensorflow
diff --git a/tensorflow/core/platform/tensor_coding.h b/tensorflow/core/platform/tensor_coding.h
index 19f53e6374f4d0baf68c7c9991661e09fe065ee2..6c6d75830de743b3e24676c1f57b6988aad11a0f 100644
--- a/tensorflow/core/platform/tensor_coding.h
+++ b/tensorflow/core/platform/tensor_coding.h
@@ -18,7 +18,6 @@ limitations under the License.
 #define TENSORFLOW_PLATFORM_TENSOR_CODING_H_
 
 #include <string>
-#include "tensorflow/core/framework/resource_handle.h"
 #include "tensorflow/core/lib/core/refcount.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/platform/platform.h"
@@ -26,6 +25,8 @@ limitations under the License.
 
 #ifdef PLATFORM_GOOGLE
 #include "tensorflow/core/platform/google/cord_coding.h"
+#else
+#include "tensorflow/core/platform/default/string_coding.h"
 #endif
 
 namespace tensorflow {
@@ -51,13 +52,6 @@ bool DecodeStringList(const string& src, string* strings, int64 n);
 // Assigns base[0..bytes-1] to *s
 void CopyFromArray(string* s, const char* base, size_t bytes);
 
-// Encodes a list of ResourceHandle protos in the given string.
-void EncodeResourceHandleList(const ResourceHandle* handles, int64 n,
-                              string* out);
-
-// Decodes a list of ResourceHandle protos from the given string.
-bool DecodeResourceHandleList(const string& in, ResourceHandle* ps, int64 n);
-
 }  // namespace port
 }  // namespace tensorflow
 
diff --git a/tensorflow/core/platform/variant_coding.cc b/tensorflow/core/platform/variant_coding.cc
deleted file mode 100644
index 48c5389d293f25e854e5b196230626c8fc9f697a..0000000000000000000000000000000000000000
--- a/tensorflow/core/platform/variant_coding.cc
+++ /dev/null
@@ -1,71 +0,0 @@
-/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include "tensorflow/core/platform/variant_coding.h"
-
-#include <vector>
-#include "tensorflow/core/framework/tensor.pb.h"
-#include "tensorflow/core/framework/variant_op_registry.h"
-#include "tensorflow/core/lib/core/coding.h"
-#include "tensorflow/core/lib/core/stringpiece.h"
-#include "tensorflow/core/lib/strings/strcat.h"
-
-namespace tensorflow {
-namespace port {
-
-void EncodeVariantList(const Variant* variant_array, int64 n, string* out) {
-  out->clear();
-  string rest;
-  for (int i = 0; i < n; ++i) {
-    string s;
-    variant_array[i].Encode(&s);
-    core::PutVarint32(out, s.length());
-    strings::StrAppend(&rest, s);
-  }
-  strings::StrAppend(out, rest);
-}
-
-bool DecodeVariantList(const string& in, Variant* variant_array, int64 n) {
-  std::vector<uint32> sizes(n);
-  StringPiece reader(in);
-  int64 total = 0;
-  for (auto& size : sizes) {
-    if (!core::GetVarint32(&reader, &size)) return false;
-    total += size;
-  }
-  if (total != static_cast<int64>(reader.size())) {
-    return false;
-  }
-
-  for (int i = 0; i < n; ++i) {
-    if (variant_array[i].is_empty()) {
-      variant_array[i] = VariantTensorDataProto();
-    }
-    string str(reader.data(), sizes[i]);
-    if (!variant_array[i].Decode(str)) return false;
-    if (!DecodeUnaryVariant(&variant_array[i])) {
-      LOG(ERROR) << "Could not decode variant with type_name: \""
-                 << variant_array[i].TypeName()
-                 << "\".  Perhaps you forgot to register a "
-                    "decoder via REGISTER_UNARY_VARIANT_DECODE_FUNCTION?";
-      return false;
-    }
-    reader.remove_prefix(sizes[i]);
-  }
-  return true;
-}
-
-}  // end namespace port
-}  // end namespace tensorflow
diff --git a/tensorflow/core/platform/variant_coding.h b/tensorflow/core/platform/variant_coding.h
deleted file mode 100644
index a971857e4a8954244c7ccfbfbac0b5234beed746..0000000000000000000000000000000000000000
--- a/tensorflow/core/platform/variant_coding.h
+++ /dev/null
@@ -1,40 +0,0 @@
-/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#ifndef TENSORFLOW_PLATFORM_VARIANT_CODING_H_
-#define TENSORFLOW_PLATFORM_VARIANT_CODING_H_
-
-#include "tensorflow/core/framework/variant.h"
-#include "tensorflow/core/framework/variant_encode_decode.h"
-
-#ifdef PLATFORM_GOOGLE
-#include "tensorflow/core/platform/google/variant_cord_coding.h"
-#endif
-
-namespace tensorflow {
-namespace port {
-
-// Encodes an array of Variant objects in to the given string.
-// `variant_array` is assumed to point to an array of `n` Variant objects.
-void EncodeVariantList(const Variant* variant_array, int64 n, string* out);
-
-// Decodes an array of Variant objects from the given string.
-// `variant_array` is assumed to point to an array of `n` Variant objects.
-bool DecodeVariantList(const string& in, Variant* variant_array, int64 n);
-
-}  // end namespace port
-}  // end namespace tensorflow
-
-#endif  // TENSORFLOW_PLATFORM_VARIANT_CODING_H_
diff --git a/tensorflow/core/platform/windows/env.cc b/tensorflow/core/platform/windows/env.cc
index 2f54f423b2ee7d8842f2edab7f6bf29877fac173..68ee3595a2bf8456cc1774680c9b29545ad18021 100644
--- a/tensorflow/core/platform/windows/env.cc
+++ b/tensorflow/core/platform/windows/env.cc
@@ -31,6 +31,7 @@ limitations under the License.
 #include "tensorflow/core/lib/core/error_codes.pb.h"
 #include "tensorflow/core/platform/load_library.h"
 #include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/windows/wide_char.h"
 #include "tensorflow/core/platform/windows/windows_file_system.h"
 
 #pragma comment(lib, "Shlwapi.lib")
@@ -71,8 +72,8 @@ class WindowsEnv : public Env {
   }
 
   bool MatchPath(const string& path, const string& pattern) override {
-    std::wstring ws_path(WindowsFileSystem::Utf8ToWideChar(path));
-    std::wstring ws_pattern(WindowsFileSystem::Utf8ToWideChar(pattern));
+    std::wstring ws_path(Utf8ToWideChar(path));
+    std::wstring ws_pattern(Utf8ToWideChar(pattern));
     return PathMatchSpecW(ws_path.c_str(), ws_pattern.c_str()) == TRUE;
   }
 
@@ -125,7 +126,7 @@ class WindowsEnv : public Env {
     std::string file_name = library_filename;
     std::replace(file_name.begin(), file_name.end(), '/', '\\');
 
-    std::wstring ws_file_name(WindowsFileSystem::Utf8ToWideChar(file_name));
+    std::wstring ws_file_name(Utf8ToWideChar(file_name));
 
     HMODULE hModule = LoadLibraryExW(ws_file_name.c_str(), NULL,
                                      LOAD_WITH_ALTERED_SEARCH_PATH);
diff --git a/tensorflow/core/platform/windows/wide_char.h b/tensorflow/core/platform/windows/wide_char.h
new file mode 100644
index 0000000000000000000000000000000000000000..1b86abc3fa120feb331ad46a5221444c7d08effb
--- /dev/null
+++ b/tensorflow/core/platform/windows/wide_char.h
@@ -0,0 +1,46 @@
+/* Copyright 2018 Google Inc. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_PLATFORM_WINDOWS_WIDE_CHAR_H_
+#define TENSORFLOW_CORE_PLATFORM_WINDOWS_WIDE_CHAR_H_
+
+#include <Windows.h>
+#include <string>
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+
+inline std::wstring Utf8ToWideChar(const string& utf8str) {
+  int size_required = MultiByteToWideChar(CP_UTF8, 0, utf8str.c_str(),
+                                          (int)utf8str.size(), NULL, 0);
+  std::wstring ws_translated_str(size_required, 0);
+  MultiByteToWideChar(CP_UTF8, 0, utf8str.c_str(), (int)utf8str.size(),
+                      &ws_translated_str[0], size_required);
+  return ws_translated_str;
+}
+
+inline string WideCharToUtf8(const std::wstring& wstr) {
+  if (wstr.empty()) return std::string();
+  int size_required = WideCharToMultiByte(
+      CP_UTF8, 0, wstr.c_str(), (int)wstr.size(), NULL, 0, NULL, NULL);
+  string utf8_translated_str(size_required, 0);
+  WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), (int)wstr.size(),
+                      &utf8_translated_str[0], size_required, NULL, NULL);
+  return utf8_translated_str;
+}
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_PLATFORM_WINDOWS_WIDE_CHAR_H_
diff --git a/tensorflow/core/platform/windows/windows_file_system.cc b/tensorflow/core/platform/windows/windows_file_system.cc
index dc2efbeaf5e3eabc6077df4c1c126762d36ba8a6..9079a5ccaad338addf1c0f8927157e454ad85da1 100644
--- a/tensorflow/core/platform/windows/windows_file_system.cc
+++ b/tensorflow/core/platform/windows/windows_file_system.cc
@@ -32,6 +32,7 @@ limitations under the License.
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/posix/error.h"
 #include "tensorflow/core/platform/windows/error.h"
+#include "tensorflow/core/platform/windows/wide_char.h"
 #include "tensorflow/core/platform/windows/windows_file_system.h"
 
 // TODO(mrry): Prevent this Windows.h #define from leaking out of our headers.
diff --git a/tensorflow/core/platform/windows/windows_file_system.h b/tensorflow/core/platform/windows/windows_file_system.h
index ba0302f0fd8b56dabaf9271a725bebdac4716102..6b04720c68f5e941fd49551a7654baf0d066affd 100644
--- a/tensorflow/core/platform/windows/windows_file_system.h
+++ b/tensorflow/core/platform/windows/windows_file_system.h
@@ -64,25 +64,6 @@ class WindowsFileSystem : public FileSystem {
   Status RenameFile(const string& src, const string& target) override;
 
   string TranslateName(const string& name) const override { return name; }
-
-  static std::wstring Utf8ToWideChar(const string& utf8str) {
-    int size_required = MultiByteToWideChar(CP_UTF8, 0, utf8str.c_str(),
-                                            (int)utf8str.size(), NULL, 0);
-    std::wstring ws_translated_str(size_required, 0);
-    MultiByteToWideChar(CP_UTF8, 0, utf8str.c_str(), (int)utf8str.size(),
-                        &ws_translated_str[0], size_required);
-    return ws_translated_str;
-  }
-
-  static string WideCharToUtf8(const std::wstring& wstr) {
-    if (wstr.empty()) return std::string();
-    int size_required = WideCharToMultiByte(
-        CP_UTF8, 0, wstr.c_str(), (int)wstr.size(), NULL, 0, NULL, NULL);
-    string utf8_translated_str(size_required, 0);
-    WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), (int)wstr.size(),
-                        &utf8_translated_str[0], size_required, NULL, NULL);
-    return utf8_translated_str;
-  }
 };
 
 class LocalWinFileSystem : public WindowsFileSystem {
diff --git a/tensorflow/core/protobuf/config.proto b/tensorflow/core/protobuf/config.proto
index a28ed06fb37826cf880bab8ed6a88b6dd947090c..9a48f43a63abade2ea462d088b35a5037a3c6ca1 100644
--- a/tensorflow/core/protobuf/config.proto
+++ b/tensorflow/core/protobuf/config.proto
@@ -379,7 +379,17 @@ message ConfigProto {
   // shared with other sessions.
   bool isolate_session_state = 15;
 
-  // Next: 16
+  // Everything inside Experimental is subject to change and is not subject
+  // to API stability guarantees in
+  // https://www.tensorflow.org/programmers_guide/version_compat.
+  message Experimental {
+    // Task name for group resolution.
+    string collective_group_leader = 1;
+  };
+
+  Experimental experimental = 16;
+
+  // Next: 17
 };
 
 // Options for a single Run() call.
@@ -414,6 +424,19 @@ message RunOptions {
   // Enabling this option can slow down the Run() call.
   bool report_tensor_allocations_upon_oom = 7;
 
+  // Everything inside Experimental is subject to change and is not subject
+  // to API stability guarantees in
+  // https://www.tensorflow.org/programmers_guide/version_compat.
+  message Experimental {
+    // If non-zero, declares that this graph is going to use collective
+    // ops and must synchronize step_ids with any other graph with this
+    // same group_key value (in a distributed computation where tasks
+    // run disjoint graphs).
+    int64 collective_graph_key = 1;
+  };
+
+  Experimental experimental = 8;
+
   reserved 4;
 }
 
diff --git a/tensorflow/core/protobuf/rewriter_config.proto b/tensorflow/core/protobuf/rewriter_config.proto
index 45e57594e4d282aa9c4c7ed5c14f8f39ac422537..bbb25d6f3f755d845e003a3da688fbd78c6b1479 100644
--- a/tensorflow/core/protobuf/rewriter_config.proto
+++ b/tensorflow/core/protobuf/rewriter_config.proto
@@ -14,6 +14,11 @@ message AutoParallelOptions {
   int32 num_replicas = 2;
 }
 
+message ScopedAllocatorOptions {
+  // If present, only perform optimization for these ops.
+  repeated string enable_op = 1;
+}
+
 message RewriterConfig {
   // Graph rewriting is experimental and subject to change, not covered by any
   // API stability guarantees.
@@ -67,6 +72,9 @@ message RewriterConfig {
   Toggle debug_stripper = 11;
   // If true, don't remove unnecessary ops from the graph
   bool disable_model_pruning = 2;
+  // Try to allocate some independent Op outputs contiguously in order to
+  // merge or eliminate downstream Ops (off by default).
+  Toggle scoped_allocator_optimization = 15;
 
   // Controls how many times we run the optimizers in meta optimizer (default
   // is once).
@@ -115,6 +123,8 @@ message RewriterConfig {
   // meta-optimizer or when manually specified through the optimizers field.
   AutoParallelOptions auto_parallel = 5;
 
+  ScopedAllocatorOptions scoped_allocator_opts = 16;
+
   // If non-empty, will use this as an alternative way to specify a list of
   // optimizations to turn on and the order of the optimizations (replacing the
   // meta-optimizer).
diff --git a/tensorflow/core/protobuf/worker.proto b/tensorflow/core/protobuf/worker.proto
index 39dd66fa1f1724a67c71e49193c57c31ed75fe20..a3bc2f422e776abf211d02c57715e7600894a6bb 100644
--- a/tensorflow/core/protobuf/worker.proto
+++ b/tensorflow/core/protobuf/worker.proto
@@ -122,6 +122,14 @@ message RegisterGraphRequest {
 
   // Field(s) used by TensorFlow Debugger (tfdbg).
   DebugOptions debug_options = 5;
+
+  // If graph_def contains any collective ops this must be a positive
+  // integer used to coordinate execution with other graphs.  All
+  // graphs in a distributed execution with the same
+  // collective_graph_key will coordinate to use the same step_id
+  // concurrently so that BufRendezvous entries will make the correct
+  // values accessible.
+  int64 collective_graph_key = 7;
 }
 
 message RegisterGraphResponse {
diff --git a/tensorflow/core/public/version.h b/tensorflow/core/public/version.h
index 522a9d84fddd2e68378866140d3eb63d0ed1421d..cb1fd09dbbbae66bd47a58ada9a90fa6dafba586 100644
--- a/tensorflow/core/public/version.h
+++ b/tensorflow/core/public/version.h
@@ -19,12 +19,12 @@ limitations under the License.
 // TensorFlow uses semantic versioning, see http://semver.org/.
 
 #define TF_MAJOR_VERSION 1
-#define TF_MINOR_VERSION 8
+#define TF_MINOR_VERSION 9
 #define TF_PATCH_VERSION 0
 
 // TF_VERSION_SUFFIX is non-empty for pre-releases (e.g. "-alpha", "-alpha.1",
 // "-beta", "-rc", "-rc.1")
-#define TF_VERSION_SUFFIX ""
+#define TF_VERSION_SUFFIX "-rc0"
 
 #define TF_STR_HELPER(x) #x
 #define TF_STR(x) TF_STR_HELPER(x)
diff --git a/tensorflow/core/util/exec_on_stall.h b/tensorflow/core/util/exec_on_stall.h
new file mode 100644
index 0000000000000000000000000000000000000000..5c8f9d2324d38dc41e2d0790be59e5392feac6db
--- /dev/null
+++ b/tensorflow/core/util/exec_on_stall.h
@@ -0,0 +1,89 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_UTIL_EXEC_ON_STALL_H_
+#define TENSORFLOW_CORE_UTIL_EXEC_ON_STALL_H_
+
+#include <functional>
+
+#include "tensorflow/core/platform/env.h"
+#include "tensorflow/core/platform/mutex.h"
+
+namespace tensorflow {
+
+// An object that executes a particular function only if it
+// is not deleted within the allotted number of seconds.
+//
+// This can be useful in diagnosing deadlocks, stalls and memory leaks
+// without logging too agressively.
+class ExecuteOnStall {
+ public:
+  // delay_secs: If the object still exists after this many seconds,
+  //     execute f.
+  // f: The function to be executed, for example a detailed log of the
+  //    the state of an object to which this is attached.
+  // poll_microseconds: The spawned thread will wake and test whether
+  //    the destructor has been invoked this frequently.
+  ExecuteOnStall(int delay_secs, std::function<void()> f,
+                 int32 poll_microseconds = 100)
+      : disabled_(false),
+        joined_(false),
+        env_(Env::Default()),
+        f_(f),
+        poll_microseconds_(poll_microseconds) {
+    deadline_ = env_->NowMicros() + 1000000 * delay_secs;
+    env_->SchedClosure([this]() {
+      while (env_->NowMicros() < deadline_) {
+        {
+          mutex_lock l(mu_);
+          if (disabled_) {
+            break;
+          }
+        }
+        env_->SleepForMicroseconds(poll_microseconds_);
+      }
+      {
+        mutex_lock l(mu_);
+        if (!disabled_) {
+          f_();
+        }
+        joined_ = true;
+        cond_var_.notify_all();
+      }
+    });
+  }
+
+  ~ExecuteOnStall() {
+    // Wait for spawned thread to terminate.
+    mutex_lock l(mu_);
+    disabled_ = true;
+    if (!joined_) {
+      cond_var_.wait(l);
+    }
+  }
+
+ private:
+  mutex mu_;
+  condition_variable cond_var_;
+  bool disabled_ GUARDED_BY(mu_);
+  bool joined_ GUARDED_BY(mu_);
+  Env* env_;
+  std::function<void()> f_;
+  int64 deadline_;
+  int32 poll_microseconds_;
+};
+
+}  // namespace tensorflow
+#endif  // TENSORFLOW_CORE_UTIL_EXEC_ON_STALL_H_
diff --git a/tensorflow/core/util/exec_on_stall_test.cc b/tensorflow/core/util/exec_on_stall_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..42e66a7e84a707e0f09cca7b37c7c2383025b295
--- /dev/null
+++ b/tensorflow/core/util/exec_on_stall_test.cc
@@ -0,0 +1,58 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/util/exec_on_stall.h"
+
+#include "tensorflow/core/platform/macros.h"
+#include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace {
+
+struct Chunk {
+  std::unique_ptr<ExecuteOnStall> stall_closure;
+};
+
+Chunk* NewChunk(int stall_seconds, std::function<void()> f) {
+  Chunk* c = new Chunk;
+  c->stall_closure.reset(new ExecuteOnStall(stall_seconds, std::move(f)));
+  return c;
+}
+
+TEST(ExecuteOnStallTest, BothWays) {
+  mutex mu;
+  bool a_triggered(false);
+  bool b_triggered(false);
+  Chunk* a = NewChunk(1, [&mu, &a_triggered]() {
+    mutex_lock l(mu);
+    a_triggered = true;
+  });
+  Chunk* b = NewChunk(1, [&mu, &b_triggered]() {
+    mutex_lock l(mu);
+    b_triggered = true;
+  });
+  delete a;
+  Env::Default()->SleepForMicroseconds(2000000);
+  {
+    mutex_lock l(mu);
+    EXPECT_FALSE(a_triggered);
+    EXPECT_TRUE(b_triggered);
+  }
+  delete b;
+}
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/util/mkl_util.h b/tensorflow/core/util/mkl_util.h
index 230b4278ca926b0fe46e73e26c6d3fac7d58c807..96944f27cd1a4f35098b92eabf73eab8591c424b 100644
--- a/tensorflow/core/util/mkl_util.h
+++ b/tensorflow/core/util/mkl_util.h
@@ -22,10 +22,13 @@ limitations under the License.
 #include <unordered_map>
 #include <utility>
 
+#ifdef INTEL_MKL_ML
 #include "mkl_dnn.h"
 #include "mkl_dnn_types.h"
 #include "mkl_service.h"
 #include "mkl_trans.h"
+#endif
+
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_shape.h"
@@ -39,6 +42,7 @@ limitations under the License.
 
 #ifndef INTEL_MKL_ML
 #include "mkldnn.hpp"
+#include "tensorflow/core/lib/core/stringpiece.h"
 
 using mkldnn::engine;
 using mkldnn::memory;
@@ -51,11 +55,12 @@ using mkldnn::reorder;
 typedef unsigned int uint;
 #endif
 
-// The file contains a number of utility classes and functions used by MKL
-// enabled kernels
 
 namespace tensorflow {
 
+// The file contains a number of utility classes and functions used by MKL
+// enabled kernels
+
 // This class encapsulates all the meta data that is associated with an MKL
 // tensor. A tensor is an MKL tensor if it was created as the result of an
 // MKL operation, and did not go through a conversion to a standard
@@ -71,6 +76,7 @@ typedef enum {
   Dim_I = 1
 } MklDnnDims;
 
+#ifdef INTEL_MKL_ML
 class MklShape {
  public:
   MklShape() {}
@@ -331,7 +337,7 @@ class MklShape {
       nullptr;  // TF dimension corresponding to this MKL dimension
 };
 
-#ifndef INTEL_MKL_ML
+#else
 
 // Forward decl
 TensorFormat MklDnnDataFormatToTFDataFormat(memory::format format);
@@ -664,12 +670,14 @@ class MklDnnShape {
 
 // List of MklShape objects. Used in Concat/Split layers.
 
-typedef std::vector<MklShape> MklShapeList;
 
 #ifndef INTEL_MKL_ML
 typedef std::vector<MklDnnShape> MklDnnShapeList;
+#else
+typedef std::vector<MklShape> MklShapeList;
 #endif
 
+#ifdef INTEL_MKL_ML
 // Check if all tensors specified by MklShapes are MKL tensors.
 inline bool AreAllMklTensors(const MklShapeList& shapes) {
   for (auto& s : shapes) {
@@ -680,7 +688,6 @@ inline bool AreAllMklTensors(const MklShapeList& shapes) {
   return true;
 }
 
-#ifdef INTEL_MKL_ML
 template <typename T>
 inline Tensor ConvertMklToTF(OpKernelContext* context, const Tensor& mkl_tensor,
                              const MklShape& mkl_shape) {
@@ -753,6 +760,7 @@ inline Tensor ConvertMklToTF(OpKernelContext* context, const Tensor& mkl_tensor,
 #endif
 
 // Get the MKL shape from the second string tensor
+#ifdef INTEL_MKL_ML
 inline void GetMklShape(OpKernelContext* ctext, int n, MklShape* mklshape) {
   mklshape->DeSerializeMklShape(
       ctext->input(GetTensorMetaDataIndex(n, ctext->num_inputs()))
@@ -763,8 +771,7 @@ inline void GetMklShape(OpKernelContext* ctext, int n, MklShape* mklshape) {
               .size() *
           sizeof(uint8));
 }
-
-#ifndef INTEL_MKL_ML
+#else
 inline void GetMklShape(OpKernelContext* ctext, int n, MklDnnShape* mklshape) {
   mklshape->DeSerializeMklDnnShape(
       ctext->input(GetTensorMetaDataIndex(n, ctext->num_inputs()))
@@ -838,6 +845,7 @@ inline TensorShape GetTfShape(OpKernelContext* context, size_t input_idx) {
 }
 #endif
 
+#ifdef INTEL_MKL_ML
 // Allocate the second output tensor that will contain
 // the MKL shape serialized
 inline void AllocateOutputSetMklShape(OpKernelContext* ctext, int n,
@@ -853,7 +861,7 @@ inline void AllocateOutputSetMklShape(OpKernelContext* ctext, int n,
       second_tensor->flat<uint8>().size() * sizeof(uint8));
 }
 
-#ifndef INTEL_MKL_ML
+#else
 // Allocate the second output tensor that will contain
 // the MKL shape serialized
 inline void AllocateOutputSetMklShape(OpKernelContext* ctext, int n,
@@ -870,6 +878,7 @@ inline void AllocateOutputSetMklShape(OpKernelContext* ctext, int n,
 }
 #endif
 
+#ifdef INTEL_MKL_ML
 // Allocate the output tensor, create a second output tensor that will contain
 // the MKL shape serialized
 inline void AllocateOutputSetMklShape(OpKernelContext* ctext, int n,
@@ -890,7 +899,7 @@ inline void AllocateOutputSetMklShape(OpKernelContext* ctext, int n,
       second_tensor->flat<uint8>().size() * sizeof(uint8));
 }
 
-#ifndef INTEL_MKL_ML
+#else
 // Allocate the output tensor, create a second output tensor that will contain
 // the MKL shape serialized
 inline void AllocateOutputSetMklShape(OpKernelContext* ctext, int n,
@@ -925,8 +934,7 @@ inline void AllocTmpBuffer(OpKernelContext* context, Tensor* tensor_out,
                                                  tf_shape, tensor_out));
   *buf_out = static_cast<void*>(tensor_out->flat<T>().data());
 }
-#endif
-
+#else
 inline void AllocTmpBuffer(OpKernelContext* context, Tensor* tensor_out,
                            dnnLayout_t lt_buff, void** buf_out) {
   TensorShape tf_shape;
@@ -940,6 +948,7 @@ inline void AllocTmpBuffer(OpKernelContext* context, Tensor* tensor_out,
   *buf_out = static_cast<void*>(tensor_out->flat<float>().data());
 }
 
+#endif
 template <typename T>
 inline void AllocTmpBuffer(OpKernelContext* context, Tensor* tensor_out,
                            TensorShape tf_shape) {
@@ -963,6 +972,7 @@ inline void GetStridesFromSizes(TensorFormat data_format, size_t* strides,
   }
 }
 
+#ifdef INTEL_MKL_ML
 inline void MklSizesToTFSizes(OpKernelContext* context,
                               TensorFormat data_format_,
                               const MklShape& mkl_shape,
@@ -988,6 +998,7 @@ inline void MklSizesToTFSizes(OpKernelContext* context,
 
   OP_REQUIRES_OK(context, TensorShapeUtils::MakeShape(sizes, tf_shape));
 }
+#endif
 
 inline int32 GetMklTensorDimIndex(char dimension) {
   switch (dimension) {
@@ -1005,12 +1016,14 @@ inline int32 GetMklTensorDimIndex(char dimension) {
   }
 }
 
+#ifdef INTEL_MKL_ML
 inline int64 GetMklTensorDim(const MklShape& mkl_shape, char dimension) {
   int index = GetMklTensorDimIndex(dimension);
   CHECK(index >= 0 && index < mkl_shape.GetDimension())
       << "Invalid index from the dimension: " << index << ", " << dimension;
   return mkl_shape.dim_size(index);
 }
+#endif
 
 inline void CopyMklTensorInToOut(OpKernelContext* context, int idx_in,
                                  int idx_out) {
@@ -1130,6 +1143,14 @@ inline void ForwardMklTensorInToOut(OpKernelContext* context, int idx_in,
 }
 
 #ifndef INTEL_MKL_ML
+// Set a dummy MKLDNN shape (called when the output is in TF format)
+inline void SetDummyMklDnnShapeOutput(OpKernelContext* context,
+                                      uint32 idx_data_out) {
+  MklDnnShape mkl_shape_output;
+  mkl_shape_output.SetMklTensor(false);
+  AllocateOutputSetMklShape(context, idx_data_out, mkl_shape_output);
+}
+
 inline void ForwardMklTensorInToOutWithMklShape(OpKernelContext* context,
                                                 int idx_in, int idx_out,
                                                 const MklDnnShape& mkl_shape) {
@@ -1165,6 +1186,7 @@ inline void ForwardMklMetaDataInToOut(OpKernelContext* context,
   }
 }
 
+#ifdef INTEL_MKL_ML
 // Set a dummy MKL shape (called when the output is in TF format)
 inline void SetDummyMklShapeOutput(OpKernelContext* context,
                                    uint32 idx_data_out) {
@@ -1172,8 +1194,6 @@ inline void SetDummyMklShapeOutput(OpKernelContext* context,
   mkl_shape_output.SetMklTensor(false);
   AllocateOutputSetMklShape(context, idx_data_out, mkl_shape_output);
 }
-
-#ifdef INTEL_MKL_ML
 // We don't need these functions in MKLDNN. We have defined equality operator
 // on MklDnnShape class directly.
 
@@ -1243,7 +1263,6 @@ inline bool MklCompareShapes(const TensorShape* input_shape_0,
 
   return true;
 }
-#endif
 
 // These functions do not compile with MKL-DNN since mkl.h is missing.
 // We may need to remove them later.
@@ -1281,6 +1300,7 @@ inline void MklNCHWToNHWC(const Tensor& input, Tensor** output) {
   }
 }
 
+#endif
 // -------------------------------------------------------------------
 
 #ifndef INTEL_MKL_ML
@@ -1794,11 +1814,11 @@ class MklDnnData {
   }
 };
 
-/// Base class for operations with reuse of DNN primitives
+/// Base class for operations with reuse of primitives
 ///
-class DnnOp {
+class MklPrimitive {
  public:
-  virtual ~DnnOp() {}
+  virtual ~MklPrimitive() {}
 
   // Dummy data. Its size, hard-coded as 256 here, does
   // not matter since MKL should never operate on this buffer.
@@ -1806,33 +1826,33 @@ class DnnOp {
 };
 
 const mkldnn::memory::dims NONE_DIMS = {};
-// This constant is used to declare dummy buffer (size), for MKL primitives
+
 template <typename T>
-class DnnOpFactory {
+class MklPrimitiveFactory {
  public:
-  DnnOpFactory() {}
-  ~DnnOpFactory() {}
+  MklPrimitiveFactory() {}
+  ~MklPrimitiveFactory() {}
 
-  DnnOp* GetOp(const std::string& key) {
-    auto stream_iter = DnnOpFactory<T>::GetHashMap().find(key);
-    if (stream_iter == DnnOpFactory<T>::GetHashMap().end()) {
+  MklPrimitive* GetOp(const std::string& key) {
+    auto stream_iter = MklPrimitiveFactory<T>::GetHashMap().find(key);
+    if (stream_iter == MklPrimitiveFactory<T>::GetHashMap().end()) {
       return nullptr;
     } else {
       return stream_iter->second;
     }
   }
 
-  void SetOp(const std::string& key, DnnOp* op) {
-    auto stream_iter = DnnOpFactory<T>::GetHashMap().find(key);
+  void SetOp(const std::string& key, MklPrimitive* op) {
+    auto stream_iter = MklPrimitiveFactory<T>::GetHashMap().find(key);
 
-    CHECK(stream_iter == DnnOpFactory<T>::GetHashMap().end());
+    CHECK(stream_iter == MklPrimitiveFactory<T>::GetHashMap().end());
 
-    DnnOpFactory<T>::GetHashMap()[key] = op;
+    MklPrimitiveFactory<T>::GetHashMap()[key] = op;
   }
 
  private:
-  static inline std::unordered_map<std::string, DnnOp*> &GetHashMap() {
-    static thread_local std::unordered_map<std::string, DnnOp*> map_;
+  static inline std::unordered_map<std::string, MklPrimitive*> &GetHashMap() {
+    static thread_local std::unordered_map<std::string, MklPrimitive*> map_;
     return map_;
   }
 };
@@ -1846,10 +1866,7 @@ class FactoryKeyCreator {
 
   ~FactoryKeyCreator() {}
 
-  void AddAsKey(const string &str) {
-    auto buffer = reinterpret_cast<const char *>(str.c_str());
-    Append(buffer, str.length());
-  }
+  void AddAsKey(const string& str) { Append(str); }
 
   void AddAsKey(const mkldnn::memory::dims &dims) {
     for (unsigned int i = 0; i < dims.size(); i++) {
@@ -1860,7 +1877,7 @@ class FactoryKeyCreator {
   template <typename T>
   void AddAsKey(const T data) {
     auto buffer = reinterpret_cast<const char *>(&data);
-    Append(buffer, sizeof(T));
+    Append(StringPiece(buffer, sizeof(T)));
   }
 
   std::string GetKey() {
@@ -1871,8 +1888,8 @@ class FactoryKeyCreator {
   string key_;
   const char delimiter = 'x';
   const int kMaxKeyLength = 256;
-  void Append(const char* data, int len) {
-    key_.append(data, len);
+  void Append(StringPiece s) {
+    key_.append(s.ToString());
     key_.append(1, delimiter);
   }
 };
diff --git a/tensorflow/core/util/rpc/call_container.h b/tensorflow/core/util/rpc/call_container.h
index e1226a7f162cafceb73dad441fbc97fbc6f501fe..39ead10815abbae413dd9b761f80fb5b9f7c0f21 100644
--- a/tensorflow/core/util/rpc/call_container.h
+++ b/tensorflow/core/util/rpc/call_container.h
@@ -102,7 +102,9 @@ CallContainer<Call>::CallContainer(
     typename CallContainer<Call>::StartCallFn start_call_fn)
     : ctx_(ctx),
       done_(std::move(done)),
-      token_(ctx->cancellation_manager()->get_cancellation_token()),
+      token_(ctx->cancellation_manager() != nullptr
+                 ? ctx->cancellation_manager()->get_cancellation_token()
+                 : CancellationManager::kInvalidToken),
       fail_fast_(fail_fast),
       try_rpc_(try_rpc),
       callback_destroyed_(new Notification) {
@@ -110,7 +112,9 @@ CallContainer<Call>::CallContainer(
 
   // This will run when all RPCs are finished.
   reffed_status_callback_ = new ReffedStatusCallback([this](const Status& s) {
-    ctx_->cancellation_manager()->DeregisterCallback(token_);
+    if (token_ != CancellationManager::kInvalidToken) {
+      ctx_->cancellation_manager()->DeregisterCallback(token_);
+    }
     ctx_->SetStatus(s);
     done_();
     callback_destroyed_->WaitForNotification();
@@ -125,11 +129,14 @@ CallContainer<Call>::CallContainer(
   std::shared_ptr<internal::NotifyWhenDestroyed> notify_when_destroyed(
       new internal::NotifyWhenDestroyed(callback_destroyed_));
   std::shared_ptr<Notification> calls_started(new Notification);
-  bool is_cancelled = !ctx_->cancellation_manager()->RegisterCallback(
-      token_, [this, calls_started, notify_when_destroyed]() {
-        calls_started->WaitForNotification();
-        StartCancel();
-      });
+  bool is_cancelled = false;
+  if (token_ != CancellationManager::kInvalidToken) {
+    is_cancelled = !ctx_->cancellation_manager()->RegisterCallback(
+        token_, [this, calls_started, notify_when_destroyed]() {
+          calls_started->WaitForNotification();
+          StartCancel();
+        });
+  }
 
   for (int i = 0; i < num_calls; ++i) {
     create_call_fn(this, i);
diff --git a/tensorflow/core/util/sparse/group_iterator.h b/tensorflow/core/util/sparse/group_iterator.h
index c0fce207e7a22028818abe1dcd9827434b1e4fcf..fb703180787db5bb0202ba34bb703dc378a6d181 100644
--- a/tensorflow/core/util/sparse/group_iterator.h
+++ b/tensorflow/core/util/sparse/group_iterator.h
@@ -78,7 +78,10 @@ class GroupIterable {
   typedef gtl::ArraySlice<int64> VarDimArray;
 
   GroupIterable(Tensor ix, Tensor vals, int dims, const VarDimArray& group_dims)
-      : ix_(ix), vals_(vals), dims_(dims), group_dims_(group_dims) {}
+      : ix_(ix),
+        vals_(vals),
+        dims_(dims),
+        group_dims_(group_dims.begin(), group_dims.end()) {}
 
   class IteratorStep;
 
@@ -127,7 +130,7 @@ class GroupIterable {
   Tensor ix_;
   Tensor vals_;
   const int dims_;
-  const VarDimArray group_dims_;
+  const gtl::InlinedVector<int64, 8> group_dims_;
 };
 
 // Implementation of Group::values<T>()
diff --git a/tensorflow/core/util/stat_summarizer.cc b/tensorflow/core/util/stat_summarizer.cc
index 8447028e3824383beea287f16cfab9fef37dae3e..a5c1fda1024ee835fa65b3f2ffabbed7bb3461cc 100644
--- a/tensorflow/core/util/stat_summarizer.cc
+++ b/tensorflow/core/util/stat_summarizer.cc
@@ -31,26 +31,22 @@ limitations under the License.
 
 namespace tensorflow {
 
+using Detail = StatsCalculator::Detail;
+
 StatSummarizer::StatSummarizer(const StatSummarizerOptions& options)
-    : options_(options) {}
+    : stats_calculator_(new StatsCalculator(options)) {}
 
 StatSummarizer::StatSummarizer(const tensorflow::GraphDef& tensorflow_graph)
-    : StatSummarizer(StatSummarizerOptions()) {}
+    : stats_calculator_(new StatsCalculator(StatSummarizerOptions())) {}
 
 StatSummarizer::~StatSummarizer() {}
 
-void StatSummarizer::Reset() {
-  run_total_us_.Reset();
-  memory_.Reset();
-  details_.clear();
-}
-
-void StatSummarizer::Validate(const Detail* detail,
+void StatSummarizer::Validate(const std::vector<TensorDescription>* outputs,
                               const NodeExecStats& ns) const {
-  if (detail->outputs.size() != ns.output_size()) {
+  if (outputs->size() != ns.output_size()) {
     LOG(WARNING) << "Number of outputs changed between runs for '"
-                 << ns.node_name() << "' - was " << detail->outputs.size()
-                 << ", now " << ns.output_size();
+                 << ns.node_name() << "' - was " << outputs->size() << ", now "
+                 << ns.output_size();
   } else {
     for (const auto& output : ns.output()) {
       const int32 slot = output.slot();
@@ -58,7 +54,7 @@ void StatSummarizer::Validate(const Detail* detail,
         // This is not a hard error for Switch ops, so just pass.
         continue;
       }
-      const auto& stored = detail->outputs[slot];
+      const auto& stored = (*outputs)[slot];
       const auto& current = output.tensor_description();
 
       bool do_tensors_match =
@@ -82,6 +78,14 @@ void StatSummarizer::Validate(const Detail* detail,
   }
 }
 
+void StatSummarizer::PrintStepStats() const {
+  string output = GetOutputString();
+  std::istringstream iss(output);
+  for (std::string line; std::getline(iss, line);) {
+    LOG(INFO) << line;
+  }
+}
+
 namespace {
 std::string OpType(const DeviceStepStats& ds, const NodeExecStats& ns) {
   // There is no published specification of how DeviceStats and NodeStats
@@ -129,6 +133,7 @@ void StatSummarizer::ProcessStepStats(const StepStats& step_stats) {
 
   int64 first_node_start_us =
       step_stats.dev_stats(0).node_stats(0).all_start_micros();
+  std::map<std::string, Detail> details;
 
   int node_num = 0;
   for (const auto& ds : step_stats.dev_stats()) {
@@ -172,7 +177,10 @@ void StatSummarizer::ProcessStepStats(const StepStats& step_stats) {
       ++node_num;
       const int64 curr_time = ns.all_end_rel_micros();
       curr_total_us += curr_time;
-      auto result = details_.emplace(name, Detail());
+      auto result = details.emplace(name, Detail());
+      auto output_result =
+          outputs_.emplace(name, std::vector<TensorDescription>());
+      std::vector<TensorDescription>* outputs = &(output_result.first->second);
       Detail* detail = &(result.first->second);
 
       detail->start_us.UpdateStat(ns.all_start_micros() - first_node_start_us);
@@ -185,16 +193,15 @@ void StatSummarizer::ProcessStepStats(const StepStats& step_stats) {
 
         detail->run_order = node_num;
 
-        detail->outputs.resize(ns.output_size());
+        outputs->resize(ns.output_size());
         for (const auto& output : ns.output()) {
           const int32 slot = output.slot();
           if ((slot < 0) || (slot >= ns.output_size())) {
             // This is not a hard error for Switch ops, so just pass.
             continue;
           }
-          detail->outputs[slot] = output.tensor_description();
+          (*outputs)[slot] = output.tensor_description();
         }
-
         detail->times_called = 0;
       }
 
@@ -207,273 +214,22 @@ void StatSummarizer::ProcessStepStats(const StepStats& step_stats) {
       mem_total += curr_node_mem;
 
       ++detail->times_called;
+      stats_calculator_->UpdateDetails(details);
 
-      Validate(detail, ns);
-    }
-  }
-
-  run_total_us_.UpdateStat(curr_total_us);
-  memory_.UpdateStat(mem_total);
-}
-
-std::string StatSummarizer::ShortSummary() const {
-  std::stringstream stream;
-  stream << "Timings (microseconds): ";
-  run_total_us_.OutputToStream(&stream);
-  stream << std::endl;
-
-  stream << "Memory (bytes): ";
-  memory_.OutputToStream(&stream);
-  stream << std::endl;
-
-  stream << details_.size() << " nodes observed" << std::endl;
-  return stream.str();
-}
-
-std::ostream& InitField(std::ostream& stream, int width) {
-  stream << "\t" << std::right << std::setw(width) << std::fixed
-         << std::setprecision(3);
-  return stream;
-}
-
-std::string StatSummarizer::HeaderString(const string& title) const {
-  std::stringstream stream;
-
-  stream << "============================== " << title
-         << " ==============================" << std::endl;
-
-  InitField(stream, 24) << "[node type]";
-  InitField(stream, 9) << "[start]";
-  InitField(stream, 9) << "[first]";
-  InitField(stream, 9) << "[avg ms]";
-  InitField(stream, 8) << "[%]";
-  InitField(stream, 8) << "[cdf%]";
-  InitField(stream, 10) << "[mem KB]";
-  InitField(stream, 9) << "[times called]";
-  stream << "\t"
-         << "[Name]";
-  return stream.str();
-}
-
-std::string StatSummarizer::ColumnString(const Detail& detail,
-                                         const int64 cumulative_stat_on_node,
-                                         const Stat<int64>& stat) const {
-  const double start_ms = detail.start_us.avg() / 1000.0;
-  const double first_time_ms = detail.rel_end_us.first() / 1000.0;
-  const double avg_time_ms = detail.rel_end_us.avg() / 1000.0;
-  const double percentage = detail.rel_end_us.sum() * 100.0 / stat.sum();
-  const double cdf_percentage = (cumulative_stat_on_node * 100.0f) / stat.sum();
-  const int64 times_called = detail.times_called / num_runs();
-
-  std::stringstream stream;
-  InitField(stream, 24) << detail.type;
-  InitField(stream, 9) << start_ms;
-  InitField(stream, 9) << first_time_ms;
-  InitField(stream, 9) << avg_time_ms;
-  InitField(stream, 7) << percentage << "%";
-  InitField(stream, 7) << cdf_percentage << "%";
-  InitField(stream, 10) << detail.mem_used.newest() / 1000.0;
-  InitField(stream, 9) << times_called;
-  stream << "\t" << detail.name;
-
-  return stream.str();
-}
-
-void StatSummarizer::OrderNodesByMetric(
-    SortingMetric metric, std::vector<const Detail*>* details) const {
-  std::priority_queue<std::pair<string, const Detail*>> sorted_list;
-  const int num_nodes = details_.size();
-
-  for (const auto& det : details_) {
-    const Detail* detail = &(det.second);
-    std::stringstream stream;
-    stream << std::setw(20) << std::right << std::setprecision(10)
-           << std::fixed;
-
-    switch (metric) {
-      case BY_NAME:
-        stream << detail->name;
-        break;
-      case BY_RUN_ORDER:
-        stream << num_nodes - detail->run_order;
-        break;
-      case BY_TIME:
-        stream << detail->rel_end_us.avg();
-        break;
-      case BY_MEMORY:
-        stream << detail->mem_used.avg();
-        break;
-      case BY_TYPE:
-        stream << detail->type;
-        break;
-      default:
-        stream << "";
-        break;
-    }
-
-    sorted_list.emplace(stream.str(), detail);
-  }
-
-  while (!sorted_list.empty()) {
-    auto entry = sorted_list.top();
-    sorted_list.pop();
-    details->push_back(entry.second);
-  }
-}
-
-void StatSummarizer::ComputeStatsByType(
-    std::map<string, int64>* node_type_map_count,
-    std::map<string, int64>* node_type_map_time,
-    std::map<string, int64>* node_type_map_memory,
-    std::map<string, int64>* node_type_map_times_called,
-    int64* accumulated_us) const {
-  int64 run_count = run_total_us_.count();
-
-  for (const auto& det : details_) {
-    const string node_name = det.first;
-    const Detail& detail = det.second;
-
-    int64 curr_time_val =
-        static_cast<int64>(detail.rel_end_us.sum() / run_count);
-    *accumulated_us += curr_time_val;
-
-    int64 curr_memory_val = detail.mem_used.newest();
-
-    const string& node_type = detail.type;
-
-    (*node_type_map_count)[node_type] += 1;
-    (*node_type_map_time)[node_type] += curr_time_val;
-    (*node_type_map_memory)[node_type] += curr_memory_val;
-    (*node_type_map_times_called)[node_type] += detail.times_called / run_count;
-  }
-}
-
-std::string StatSummarizer::GetStatsByNodeType() const {
-  std::stringstream stream;
-
-  stream << "============================== Summary by node type "
-            "=============================="
-         << std::endl;
-
-  LOG(INFO) << "Number of nodes executed: " << details_.size();
-
-  std::map<string, int64> node_type_map_count;
-  std::map<string, int64> node_type_map_time;
-  std::map<string, int64> node_type_map_memory;
-  std::map<string, int64> node_type_map_times_called;
-  int64 accumulated_us = 0;
-
-  ComputeStatsByType(&node_type_map_count, &node_type_map_time,
-                     &node_type_map_memory, &node_type_map_times_called,
-                     &accumulated_us);
-
-  // Sort them.
-  std::priority_queue<std::pair<int64, std::pair<string, int64>>> timings;
-  for (const auto& node_type : node_type_map_time) {
-    const int64 mem_used = node_type_map_memory[node_type.first];
-    timings.emplace(node_type.second,
-                    std::pair<string, int64>(node_type.first, mem_used));
-  }
-
-  InitField(stream, 24) << "[Node type]";
-  InitField(stream, 9) << "[count]";
-  InitField(stream, 10) << "[avg ms]";
-  InitField(stream, 11) << "[avg %]";
-  InitField(stream, 11) << "[cdf %]";
-  InitField(stream, 10) << "[mem KB]";
-  InitField(stream, 10) << "[times called]";
-  stream << std::endl;
-
-  float cdf = 0.0f;
-  while (!timings.empty()) {
-    auto entry = timings.top();
-    timings.pop();
-
-    const string node_type = entry.second.first;
-    const float memory = entry.second.second / 1000.0f;
-
-    const int64 node_type_total_us = entry.first;
-    const float time_per_run_ms = node_type_total_us / 1000.0f;
-
-    const float percentage =
-        ((entry.first / static_cast<float>(accumulated_us)) * 100.0f);
-    cdf += percentage;
-
-    InitField(stream, 24) << node_type;
-    InitField(stream, 9) << node_type_map_count[node_type];
-    InitField(stream, 10) << time_per_run_ms;
-    InitField(stream, 10) << percentage << "%";
-    InitField(stream, 10) << cdf << "%";
-    InitField(stream, 10) << memory;
-    InitField(stream, 9) << node_type_map_times_called[node_type];
-    stream << std::endl;
-  }
-  stream << std::endl;
-  return stream.str();
-}
-
-std::string StatSummarizer::GetStatsByMetric(const string& title,
-                                             SortingMetric sorting_metric,
-                                             int num_stats) const {
-  std::vector<const Detail*> details;
-  OrderNodesByMetric(sorting_metric, &details);
-
-  double cumulative_stat_on_node = 0;
-
-  std::stringstream stream;
-  stream << HeaderString(title) << std::endl;
-  int stat_num = 0;
-  for (auto detail : details) {
-    ++stat_num;
-    if (num_stats > 0 && stat_num > num_stats) {
-      break;
+      Validate(outputs, ns);
     }
-
-    // TODO(andrewharp): Make this keep track of the particular metric for cdf.
-    cumulative_stat_on_node += detail->rel_end_us.sum();
-    stream << ColumnString(*detail, cumulative_stat_on_node, run_total_us_)
-           << std::endl;
   }
-  stream << std::endl;
-  return stream.str();
-}
 
-std::string StatSummarizer::GetOutputString() const {
-  std::stringstream stream;
-  if (options_.show_run_order) {
-    stream << GetStatsByMetric("Run Order", BY_RUN_ORDER,
-                               options_.run_order_limit);
-  }
-  if (options_.show_time) {
-    stream << GetStatsByMetric("Top by Computation Time", BY_TIME,
-                               options_.time_limit);
-  }
-  if (options_.show_memory) {
-    stream << GetStatsByMetric("Top by Memory Use", BY_MEMORY,
-                               options_.memory_limit);
-  }
-  if (options_.show_type) {
-    stream << GetStatsByNodeType();
-  }
-  if (options_.show_summary) {
-    stream << ShortSummary() << std::endl;
-  }
-  return stream.str();
+  stats_calculator_->UpdateRunTotalUs(curr_total_us);
+  stats_calculator_->UpdateMemoryUsed(mem_total);
 }
 
-void StatSummarizer::PrintStepStats() const {
-  string output = GetOutputString();
-  std::istringstream iss(output);
-  for (std::string line; std::getline(iss, line);) {
-    LOG(INFO) << line;
-  }
-}
 
 void StatSummarizer::PrintOutputs() const {
   std::priority_queue<
       std::pair<int64, const std::pair<const std::string, Detail>*>>
       timings;
-  for (const auto& entry : details_) {
+  for (const auto& entry : stats_calculator_->GetDetails()) {
     timings.emplace(-entry.second.start_us.avg(), &entry);
   }
 
@@ -481,10 +237,10 @@ void StatSummarizer::PrintOutputs() const {
   while (!timings.empty()) {
     auto entry = timings.top();
     timings.pop();
-    const Detail& detail = entry.second->second;
     std::stringstream stream;
-    stream << entry.second->first << "\t" << detail.outputs.size();
-    for (const auto& tensor : detail.outputs) {
+    const auto detail_outputs = outputs_.at(entry.second->first);
+    stream << entry.second->first << "\t" << detail_outputs.size();
+    for (const auto& tensor : detail_outputs) {
       stream << "\t" << DataTypeString(tensor.dtype());
       stream << "\t" << tensor.shape().dim_size();
       for (const auto& d : tensor.shape().dim()) {
diff --git a/tensorflow/core/util/stat_summarizer.h b/tensorflow/core/util/stat_summarizer.h
index 79fa63723ec843e72364f09ad82617788ae9a901..7e6d6f63724ee0f71760f626fb7a47426f19cb3d 100644
--- a/tensorflow/core/util/stat_summarizer.h
+++ b/tensorflow/core/util/stat_summarizer.h
@@ -13,20 +13,23 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_UTIL_STAT_SUMMARIZER_H_
-#define TENSORFLOW_UTIL_STAT_SUMMARIZER_H_
+#ifndef TENSORFLOW_CORE_UTIL_STAT_SUMMARIZER_H_
+#define TENSORFLOW_CORE_UTIL_STAT_SUMMARIZER_H_
 
 #include <stdlib.h>
 
 #include <cmath>
 #include <limits>
 #include <map>
+#include <memory>
 #include <sstream>
 #include <string>
 
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/types.pb.h"
 #include "tensorflow/core/platform/types.h"
+#include "tensorflow/core/util/stat_summarizer_options.h"
+#include "tensorflow/core/util/stats_calculator.h"
 
 namespace tensorflow {
 
@@ -34,103 +37,6 @@ class GraphDef;
 class StepStats;
 class NodeExecStats;
 
-template <typename ValueType, typename HighPrecisionValueType = double>
-class Stat {
- public:
-  void UpdateStat(ValueType v) {
-    if (count_ == 0) {
-      first_ = v;
-    }
-
-    newest_ = v;
-    max_ = std::max(v, max_);
-    min_ = std::min(v, min_);
-    ++count_;
-    sum_ += v;
-    squared_sum_ += static_cast<HighPrecisionValueType>(v) * v;
-  }
-
-  void Reset() { new (this) Stat<ValueType, HighPrecisionValueType>(); }
-
-  bool empty() const { return count_ == 0; }
-
-  ValueType first() const { return first_; }
-
-  ValueType newest() const { return newest_; }
-
-  ValueType max() const { return max_; }
-
-  ValueType min() const { return min_; }
-
-  int64 count() const { return count_; }
-
-  ValueType sum() const { return sum_; }
-
-  HighPrecisionValueType squared_sum() const { return squared_sum_; }
-
-  bool all_same() const { return (count_ == 0 || min_ == max_); }
-
-  HighPrecisionValueType avg() const {
-    return empty() ? std::numeric_limits<ValueType>::quiet_NaN()
-                   : static_cast<HighPrecisionValueType>(sum_) / count_;
-  }
-
-  ValueType std_deviation() const {
-    return all_same() ? 0 : sqrt(squared_sum_ / count_ - avg() * avg());
-  }
-
-  void OutputToStream(std::ostream* stream) const {
-    if (empty()) {
-      *stream << "count=0";
-    } else if (all_same()) {
-      *stream << "count=" << count_ << " curr=" << newest_;
-      if (count_ > 1) *stream << "(all same)";
-    } else {
-      *stream << "count=" << count_ << " first=" << first_
-              << " curr=" << newest_ << " min=" << min_ << " max=" << max_
-              << " avg=" << avg() << " std=" << std_deviation();
-    }
-  }
-
-  friend std::ostream& operator<<(std::ostream& stream,
-                                  const Stat<ValueType>& stat) {
-    stat.OutputToStream(&stream);
-    return stream;
-  }
-
- private:
-  ValueType first_ = 0;
-  ValueType newest_ = 0;
-  ValueType max_ = std::numeric_limits<ValueType>::min();
-  ValueType min_ = std::numeric_limits<ValueType>::max();
-  int64 count_ = 0;
-  ValueType sum_ = 0;
-  HighPrecisionValueType squared_sum_ = 0;
-};
-
-// Used to control the output of the statistics summarizer;
-class StatSummarizerOptions {
- public:
-  StatSummarizerOptions()
-      : show_run_order(true),
-        run_order_limit(0),
-        show_time(true),
-        time_limit(10),
-        show_memory(true),
-        memory_limit(10),
-        show_type(true),
-        show_summary(true) {}
-
-  bool show_run_order;
-  int run_order_limit;
-  bool show_time;
-  int time_limit;
-  bool show_memory;
-  int memory_limit;
-  bool show_type;
-  bool show_summary;
-};
-
 // A StatSummarizer assists in performance analysis of Graph executions.
 //
 // It summarizes time spent executing (on GPU/CPU), memory used etc. across
@@ -140,14 +46,6 @@ class StatSummarizerOptions {
 // See tensorflow/tools/benchmark/benchmark_model.cc for an example usage.
 class StatSummarizer {
  public:
-  enum SortingMetric {
-    BY_NAME,
-    BY_RUN_ORDER,
-    BY_TIME,
-    BY_MEMORY,
-    BY_TYPE,
-  };
-
   explicit StatSummarizer(const StatSummarizerOptions& options);
 
   // Deprecated: Use StatSummarizer(const StatSummarizerOptions&) instead. The
@@ -161,9 +59,13 @@ class StatSummarizer {
 
   // Returns a string detailing the accumulated runtime stats in a tab-separated
   // format which can be pasted into a spreadsheet for further analysis.
-  std::string GetOutputString() const;
+  std::string GetOutputString() const {
+    return stats_calculator_->GetOutputString();
+  }
 
-  std::string ShortSummary() const;
+  std::string ShortSummary() const {
+    return stats_calculator_->GetShortSummary();
+  }
 
   // Prints the string returned by GetOutputString().
   void PrintStepStats() const;
@@ -171,55 +73,44 @@ class StatSummarizer {
   // Prints the output tensor sizes and types for each node.
   void PrintOutputs() const;
 
-  void ComputeStatsByType(std::map<string, int64>* node_type_map_count,
-                          std::map<string, int64>* node_type_map_time,
-                          std::map<string, int64>* node_type_map_memory,
-                          std::map<string, int64>* node_type_map_times_called,
-                          int64* accumulated_us) const;
+  void ComputeStatsByType(
+      std::map<std::string, int64_t>* node_type_map_count,
+      std::map<std::string, int64_t>* node_type_map_time,
+      std::map<std::string, int64_t>* node_type_map_memory,
+      std::map<std::string, int64_t>* node_type_map_times_called,
+      int64_t* accumulated_us) const {
+    stats_calculator_->ComputeStatsByType(
+        node_type_map_count, node_type_map_time, node_type_map_memory,
+        node_type_map_times_called, accumulated_us);
+  }
 
-  std::string GetStatsByNodeType() const;
+  std::string GetStatsByNodeType() const {
+    return stats_calculator_->GetStatsByNodeType();
+  }
 
   std::string GetStatsByMetric(const string& title,
-                               SortingMetric sorting_metric,
-                               int num_stats) const;
-
-  void Reset();
+                               StatsCalculator::SortingMetric sorting_metric,
+                               int num_stats) const {
+    return stats_calculator_->GetStatsByMetric(title, sorting_metric,
+                                               num_stats);
+  }
 
-  // Returns number of runs.
-  int num_runs() const { return static_cast<int>(run_total_us_.count()); }
+  int num_runs() const { return stats_calculator_->num_runs(); }
 
   // Returns stats of total microseconds spent by all nodes in each run.
-  const Stat<int64>& run_total_us() const { return run_total_us_; }
+  const Stat<int64_t>& run_total_us() const {
+    return stats_calculator_->run_total_us();
+  }
 
  private:
-  struct Detail {
-    string name;
-    string type;
-    int64 run_order;
-    Stat<int64> start_us;
-    Stat<int64> rel_end_us;
-    Stat<int64> mem_used;
-    std::vector<TensorDescription> outputs;
-    int64 times_called;
-  };
-
-  void Validate(const Detail* detail, const NodeExecStats& ns) const;
-
-  void OrderNodesByMetric(SortingMetric sorting_metric,
-                          std::vector<const Detail*>* details) const;
-
-  std::string HeaderString(const string& title) const;
-  std::string ColumnString(const Detail& detail,
-                           const int64 cumulative_stat_on_node,
-                           const Stat<int64>& stat) const;
-
-  Stat<int64> run_total_us_;
-  Stat<int64> memory_;
-
-  std::map<std::string, Detail> details_;
-  StatSummarizerOptions options_;
+  void Validate(const std::vector<TensorDescription>* outputs,
+                const NodeExecStats& ns) const;
+
+  std::map<std::string, std::vector<TensorDescription> > outputs_;
+
+  std::unique_ptr<StatsCalculator> stats_calculator_;
 };
 
 }  // namespace tensorflow
 
-#endif  // TENSORFLOW_UTIL_STAT_SUMMARIZER_H_
+#endif  // TENSORFLOW_CORE_UTIL_STAT_SUMMARIZER_H_
diff --git a/tensorflow/core/util/stat_summarizer_options.h b/tensorflow/core/util/stat_summarizer_options.h
new file mode 100644
index 0000000000000000000000000000000000000000..578020676b7698e1ea3fa466efc933200581036e
--- /dev/null
+++ b/tensorflow/core/util/stat_summarizer_options.h
@@ -0,0 +1,43 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_UTIL_STAT_SUMMARIZER_OPTIONS_H_
+#define TENSORFLOW_CORE_UTIL_STAT_SUMMARIZER_OPTIONS_H_
+namespace tensorflow {
+// Used to control the output of the statistics summarizer;
+class StatSummarizerOptions {
+ public:
+  StatSummarizerOptions()
+      : show_run_order(true),
+        run_order_limit(0),
+        show_time(true),
+        time_limit(10),
+        show_memory(true),
+        memory_limit(10),
+        show_type(true),
+        show_summary(true) {}
+
+  bool show_run_order;
+  int run_order_limit;
+  bool show_time;
+  int time_limit;
+  bool show_memory;
+  int memory_limit;
+  bool show_type;
+  bool show_summary;
+};
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_UTIL_STAT_SUMMARIZER_OPTIONS_H_
diff --git a/tensorflow/core/util/stats_calculator.cc b/tensorflow/core/util/stats_calculator.cc
new file mode 100644
index 0000000000000000000000000000000000000000..c4befbdb84fec9b09d345a344f249ec758883409
--- /dev/null
+++ b/tensorflow/core/util/stats_calculator.cc
@@ -0,0 +1,280 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/util/stats_calculator.h"
+
+#include <iomanip>
+#include <map>
+#include <queue>
+#include <sstream>
+#include <string>
+
+namespace tensorflow {
+
+StatsCalculator::StatsCalculator(const StatSummarizerOptions& options)
+    : options_(options) {}
+
+std::string StatsCalculator::GetShortSummary() const {
+  std::stringstream stream;
+  stream << "Timings (microseconds): ";
+  run_total_us_.OutputToStream(&stream);
+  stream << std::endl;
+
+  stream << "Memory (bytes): ";
+  memory_.OutputToStream(&stream);
+  stream << std::endl;
+
+  stream << details_.size() << " nodes observed" << std::endl;
+  return stream.str();
+}
+
+std::ostream& InitField(std::ostream& stream, int width) {
+  stream << "\t" << std::right << std::setw(width) << std::fixed
+         << std::setprecision(3);
+  return stream;
+}
+
+std::string StatsCalculator::HeaderString(const std::string& title) const {
+  std::stringstream stream;
+
+  stream << "============================== " << title
+         << " ==============================" << std::endl;
+
+  InitField(stream, 24) << "[node type]";
+  InitField(stream, 9) << "[start]";
+  InitField(stream, 9) << "[first]";
+  InitField(stream, 9) << "[avg ms]";
+  InitField(stream, 8) << "[%]";
+  InitField(stream, 8) << "[cdf%]";
+  InitField(stream, 10) << "[mem KB]";
+  InitField(stream, 9) << "[times called]";
+  stream << "\t"
+         << "[Name]";
+  return stream.str();
+}
+
+std::string StatsCalculator::ColumnString(const Detail& detail,
+                                          const int64_t cumulative_stat_on_node,
+                                          const Stat<int64_t>& stat) const {
+  const double start_ms = detail.start_us.avg() / 1000.0;
+  const double first_time_ms = detail.rel_end_us.first() / 1000.0;
+  const double avg_time_ms = detail.rel_end_us.avg() / 1000.0;
+  const double percentage = detail.rel_end_us.sum() * 100.0 / stat.sum();
+  const double cdf_percentage = (cumulative_stat_on_node * 100.0f) / stat.sum();
+  const int64_t times_called = detail.times_called / num_runs();
+
+  std::stringstream stream;
+  InitField(stream, 24) << detail.type;
+  InitField(stream, 9) << start_ms;
+  InitField(stream, 9) << first_time_ms;
+  InitField(stream, 9) << avg_time_ms;
+  InitField(stream, 7) << percentage << "%";
+  InitField(stream, 7) << cdf_percentage << "%";
+  InitField(stream, 10) << detail.mem_used.newest() / 1000.0;
+  InitField(stream, 9) << times_called;
+  stream << "\t" << detail.name;
+
+  return stream.str();
+}
+
+void StatsCalculator::OrderNodesByMetric(
+    SortingMetric metric, std::vector<const Detail*>* details) const {
+  std::priority_queue<std::pair<std::string, const Detail*>> sorted_list;
+  const int num_nodes = details_.size();
+
+  for (const auto& det : details_) {
+    const Detail* detail = &(det.second);
+    std::stringstream stream;
+    stream << std::setw(20) << std::right << std::setprecision(10)
+           << std::fixed;
+
+    switch (metric) {
+      case BY_NAME:
+        stream << detail->name;
+        break;
+      case BY_RUN_ORDER:
+        stream << num_nodes - detail->run_order;
+        break;
+      case BY_TIME:
+        stream << detail->rel_end_us.avg();
+        break;
+      case BY_MEMORY:
+        stream << detail->mem_used.avg();
+        break;
+      case BY_TYPE:
+        stream << detail->type;
+        break;
+      default:
+        stream << "";
+        break;
+    }
+
+    sorted_list.emplace(stream.str(), detail);
+  }
+
+  while (!sorted_list.empty()) {
+    auto entry = sorted_list.top();
+    sorted_list.pop();
+    details->push_back(entry.second);
+  }
+}
+
+void StatsCalculator::ComputeStatsByType(
+    std::map<std::string, int64_t>* node_type_map_count,
+    std::map<std::string, int64_t>* node_type_map_time,
+    std::map<std::string, int64_t>* node_type_map_memory,
+    std::map<std::string, int64_t>* node_type_map_times_called,
+    int64_t* accumulated_us) const {
+  int64_t run_count = run_total_us_.count();
+
+  for (const auto& det : details_) {
+    const std::string node_name = det.first;
+    const Detail& detail = det.second;
+
+    int64_t curr_time_val =
+        static_cast<int64_t>(detail.rel_end_us.sum() / run_count);
+    *accumulated_us += curr_time_val;
+
+    int64_t curr_memory_val = detail.mem_used.newest();
+
+    const std::string& node_type = detail.type;
+
+    (*node_type_map_count)[node_type] += 1;
+    (*node_type_map_time)[node_type] += curr_time_val;
+    (*node_type_map_memory)[node_type] += curr_memory_val;
+    (*node_type_map_times_called)[node_type] += detail.times_called / run_count;
+  }
+}
+
+std::string StatsCalculator::GetStatsByNodeType() const {
+  std::stringstream stream;
+
+  stream << "Number of nodes executed: " << details_.size() << std::endl;
+
+  stream << "============================== Summary by node type "
+            "=============================="
+         << std::endl;
+
+  std::map<std::string, int64_t> node_type_map_count;
+  std::map<std::string, int64_t> node_type_map_time;
+  std::map<std::string, int64_t> node_type_map_memory;
+  std::map<std::string, int64_t> node_type_map_times_called;
+  int64_t accumulated_us = 0;
+
+  ComputeStatsByType(&node_type_map_count, &node_type_map_time,
+                     &node_type_map_memory, &node_type_map_times_called,
+                     &accumulated_us);
+
+  // Sort them.
+  std::priority_queue<std::pair<int64_t, std::pair<std::string, int64_t>>>
+      timings;
+  for (const auto& node_type : node_type_map_time) {
+    const int64_t mem_used = node_type_map_memory[node_type.first];
+    timings.emplace(node_type.second,
+                    std::pair<std::string, int64_t>(node_type.first, mem_used));
+  }
+
+  InitField(stream, 24) << "[Node type]";
+  InitField(stream, 9) << "[count]";
+  InitField(stream, 10) << "[avg ms]";
+  InitField(stream, 11) << "[avg %]";
+  InitField(stream, 11) << "[cdf %]";
+  InitField(stream, 10) << "[mem KB]";
+  InitField(stream, 10) << "[times called]";
+  stream << std::endl;
+
+  float cdf = 0.0f;
+  while (!timings.empty()) {
+    auto entry = timings.top();
+    timings.pop();
+
+    const std::string node_type = entry.second.first;
+    const float memory = entry.second.second / 1000.0f;
+
+    const int64_t node_type_total_us = entry.first;
+    const float time_per_run_ms = node_type_total_us / 1000.0f;
+
+    const float percentage =
+        ((entry.first / static_cast<float>(accumulated_us)) * 100.0f);
+    cdf += percentage;
+
+    InitField(stream, 24) << node_type;
+    InitField(stream, 9) << node_type_map_count[node_type];
+    InitField(stream, 10) << time_per_run_ms;
+    InitField(stream, 10) << percentage << "%";
+    InitField(stream, 10) << cdf << "%";
+    InitField(stream, 10) << memory;
+    InitField(stream, 9) << node_type_map_times_called[node_type];
+    stream << std::endl;
+  }
+  stream << std::endl;
+  return stream.str();
+}
+
+std::string StatsCalculator::GetStatsByMetric(const std::string& title,
+                                              SortingMetric sorting_metric,
+                                              int num_stats) const {
+  std::vector<const Detail*> details;
+  OrderNodesByMetric(sorting_metric, &details);
+
+  double cumulative_stat_on_node = 0;
+
+  std::stringstream stream;
+  stream << HeaderString(title) << std::endl;
+  int stat_num = 0;
+  for (auto detail : details) {
+    ++stat_num;
+    if (num_stats > 0 && stat_num > num_stats) {
+      break;
+    }
+
+    // TODO(andrewharp): Make this keep track of the particular metric for cdf.
+    cumulative_stat_on_node += detail->rel_end_us.sum();
+    stream << ColumnString(*detail, cumulative_stat_on_node, run_total_us_)
+           << std::endl;
+  }
+  stream << std::endl;
+  return stream.str();
+}
+
+std::string StatsCalculator::GetOutputString() const {
+  std::stringstream stream;
+  if (options_.show_run_order) {
+    stream << GetStatsByMetric("Run Order", BY_RUN_ORDER,
+                               options_.run_order_limit);
+  }
+  if (options_.show_time) {
+    stream << GetStatsByMetric("Top by Computation Time", BY_TIME,
+                               options_.time_limit);
+  }
+  if (options_.show_memory) {
+    stream << GetStatsByMetric("Top by Memory Use", BY_MEMORY,
+                               options_.memory_limit);
+  }
+  if (options_.show_type) {
+    stream << GetStatsByNodeType();
+  }
+  if (options_.show_summary) {
+    stream << GetShortSummary() << std::endl;
+  }
+  return stream.str();
+}
+
+void StatsCalculator::UpdateDetails(
+    const std::map<std::string, Detail>& details) {
+  details_.insert(details.begin(), details.end());
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/util/stats_calculator.h b/tensorflow/core/util/stats_calculator.h
new file mode 100644
index 0000000000000000000000000000000000000000..39cef816f1987ca1d3e2e9f3bcb4aab58894c321
--- /dev/null
+++ b/tensorflow/core/util/stats_calculator.h
@@ -0,0 +1,186 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_UTIL_STATS_CALCULATOR_H_
+#define TENSORFLOW_CORE_UTIL_STATS_CALCULATOR_H_
+
+#include <stdlib.h>
+
+#include <cmath>
+#include <limits>
+#include <map>
+#include <sstream>
+#include <string>
+#include <vector>
+
+#include "tensorflow/core/util/stat_summarizer_options.h"
+
+namespace tensorflow {
+
+template <typename ValueType, typename HighPrecisionValueType = double>
+class Stat {
+ public:
+  void UpdateStat(ValueType v) {
+    if (count_ == 0) {
+      first_ = v;
+    }
+
+    newest_ = v;
+    max_ = std::max(v, max_);
+    min_ = std::min(v, min_);
+    ++count_;
+    sum_ += v;
+    squared_sum_ += static_cast<HighPrecisionValueType>(v) * v;
+  }
+
+  void Reset() { new (this) Stat<ValueType, HighPrecisionValueType>(); }
+
+  bool empty() const { return count_ == 0; }
+
+  ValueType first() const { return first_; }
+
+  ValueType newest() const { return newest_; }
+
+  ValueType max() const { return max_; }
+
+  ValueType min() const { return min_; }
+
+  int64_t count() const { return count_; }
+
+  ValueType sum() const { return sum_; }
+
+  HighPrecisionValueType squared_sum() const { return squared_sum_; }
+
+  bool all_same() const { return (count_ == 0 || min_ == max_); }
+
+  HighPrecisionValueType avg() const {
+    return empty() ? std::numeric_limits<ValueType>::quiet_NaN()
+                   : static_cast<HighPrecisionValueType>(sum_) / count_;
+  }
+
+  ValueType std_deviation() const {
+    return all_same() ? 0 : sqrt(squared_sum_ / count_ - avg() * avg());
+  }
+
+  void OutputToStream(std::ostream* stream) const {
+    if (empty()) {
+      *stream << "count=0";
+    } else if (all_same()) {
+      *stream << "count=" << count_ << " curr=" << newest_;
+      if (count_ > 1) *stream << "(all same)";
+    } else {
+      *stream << "count=" << count_ << " first=" << first_
+              << " curr=" << newest_ << " min=" << min_ << " max=" << max_
+              << " avg=" << avg() << " std=" << std_deviation();
+    }
+  }
+
+  friend std::ostream& operator<<(std::ostream& stream,
+                                  const Stat<ValueType>& stat) {
+    stat.OutputToStream(&stream);
+    return stream;
+  }
+
+ private:
+  ValueType first_ = 0;
+  ValueType newest_ = 0;
+  ValueType max_ = std::numeric_limits<ValueType>::min();
+  ValueType min_ = std::numeric_limits<ValueType>::max();
+  int64_t count_ = 0;
+  ValueType sum_ = 0;
+  HighPrecisionValueType squared_sum_ = 0;
+};
+
+// A StatsCalculator assists in performance analysis of Graph executions.
+//
+// It summarizes time spent executing (on GPU/CPU), memory used etc for
+// graph execution.
+//
+// For example usage see StatsSummarizer.
+class StatsCalculator {
+ public:
+  enum SortingMetric {
+    BY_NAME,
+    BY_RUN_ORDER,
+    BY_TIME,
+    BY_MEMORY,
+    BY_TYPE,
+  };
+
+  explicit StatsCalculator(const StatSummarizerOptions& options);
+
+  // Returns a string detailing the accumulated runtime stats in a tab-separated
+  // format which can be pasted into a spreadsheet for further analysis.
+  std::string GetOutputString() const;
+
+  std::string GetShortSummary() const;
+
+  void ComputeStatsByType(
+      std::map<std::string, int64_t>* node_type_map_count,
+      std::map<std::string, int64_t>* node_type_map_time,
+      std::map<std::string, int64_t>* node_type_map_memory,
+      std::map<std::string, int64_t>* node_type_map_times_called,
+      int64_t* accumulated_us) const;
+
+  std::string GetStatsByNodeType() const;
+
+  std::string GetStatsByMetric(const std::string& title,
+                               SortingMetric sorting_metric,
+                               int num_stats) const;
+
+  // Returns number of runs.
+  int num_runs() const { return static_cast<int>(run_total_us_.count()); }
+
+  // Returns stats of total microseconds spent by all nodes in each run.
+  const Stat<int64_t>& run_total_us() const { return run_total_us_; }
+
+  void UpdateRunTotalUs(int64_t run_total_us) {
+    run_total_us_.UpdateStat(run_total_us);
+  }
+
+  void UpdateMemoryUsed(int64_t memory) { memory_.UpdateStat(memory); }
+
+  struct Detail {
+    std::string name;
+    std::string type;
+    int64_t run_order;
+    Stat<int64_t> start_us;
+    Stat<int64_t> rel_end_us;
+    Stat<int64_t> mem_used;
+    int64_t times_called;
+  };
+
+  const std::map<std::string, Detail>& GetDetails() const { return details_; }
+  void UpdateDetails(const std::map<std::string, Detail>& details);
+
+ private:
+  void OrderNodesByMetric(SortingMetric sorting_metric,
+                          std::vector<const Detail*>* details) const;
+
+  std::string HeaderString(const std::string& title) const;
+  std::string ColumnString(const Detail& detail,
+                           const int64_t cumulative_stat_on_node,
+                           const Stat<int64_t>& stat) const;
+
+  Stat<int64_t> run_total_us_;
+  Stat<int64_t> memory_;
+
+  std::map<std::string, Detail> details_;
+  StatSummarizerOptions options_;
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_UTIL_STATS_CALCULATOR_H_
diff --git a/tensorflow/core/util/work_sharder.cc b/tensorflow/core/util/work_sharder.cc
index 7922fc9224e8c41f1cef7d5fe508d1a4aba4759c..337af07b50872f6a502938978423af88426d01da 100644
--- a/tensorflow/core/util/work_sharder.cc
+++ b/tensorflow/core/util/work_sharder.cc
@@ -35,7 +35,7 @@ void Shard(int max_parallelism, thread::ThreadPool* workers, int64 total,
     workers->ParallelFor(total, cost_per_unit, work);
     return;
   }
-  cost_per_unit = std::max(1LL, cost_per_unit);
+  cost_per_unit = std::max(int64{1}, cost_per_unit);
   // We shard [0, total) into "num_shards" shards.
   //   1 <= num_shards <= num worker threads
   //
diff --git a/tensorflow/docs_src/api_guides/python/contrib.bayesflow.monte_carlo.md b/tensorflow/docs_src/api_guides/python/contrib.bayesflow.monte_carlo.md
deleted file mode 100644
index 74fe4a323aafbed986a094daa8a6d5f010521ba4..0000000000000000000000000000000000000000
--- a/tensorflow/docs_src/api_guides/python/contrib.bayesflow.monte_carlo.md
+++ /dev/null
@@ -1,50 +0,0 @@
-# BayesFlow Monte Carlo (contrib)
-[TOC]
-
-Monte Carlo integration and helpers.
-
-## Background
-
-Monte Carlo integration refers to the practice of estimating an expectation with
-a sample mean.  For example, given random variable Z in \\(R^k\\) with density `p`,
-the expectation of function `f` can be approximated like:
-
-$$E_p[f(Z)] = \int f(z) p(z) dz$$
-$$          ~ S_n
-          := n^{-1} \sum_{i=1}^n f(z_i),  z_i\ iid\ samples\ from\ p.$$
-
-If \\(E_p[|f(Z)|] < infinity\\), then \\(S_n\\) --> \\(E_p[f(Z)]\\) by the strong law of large
-numbers.  If \\(E_p[f(Z)^2] < infinity\\), then \\(S_n\\) is asymptotically normal with
-variance \\(Var[f(Z)] / n\\).
-
-Practitioners of Bayesian statistics often find themselves wanting to estimate
-\\(E_p[f(Z)]\\) when the distribution `p` is known only up to a constant.  For
-example, the joint distribution `p(z, x)` may be known, but the evidence
-\\(p(x) = \int p(z, x) dz\\) may be intractable.  In that case, a parameterized
-distribution family \\(q_\lambda(z)\\) may be chosen, and the optimal \\(\lambda\\) is the
-one minimizing the KL divergence between \\(q_\lambda(z)\\) and
-\\(p(z | x)\\).  We only know `p(z, x)`, but that is sufficient to find \\(\lambda\\).
-
-
-## Log-space evaluation and subtracting the maximum
-
-Care must be taken when the random variable lives in a high dimensional space.
-For example, the naive importance sample estimate \\(E_q[f(Z) p(Z) / q(Z)]\\)
-involves the ratio of two terms \\(p(Z) / q(Z)\\), each of which must have tails
-dropping off faster than \\(O(|z|^{-(k + 1)})\\) in order to have finite integral.
-This ratio would often be zero or infinity up to numerical precision.
-
-For that reason, we write
-
-$$Log E_q[ f(Z) p(Z) / q(Z) ]$$
-$$   = Log E_q[ \exp\{Log[f(Z)] + Log[p(Z)] - Log[q(Z)] - C\} ] + C,$$  where
-$$C := Max[ Log[f(Z)] + Log[p(Z)] - Log[q(Z)] ].$$
-
-The maximum value of the exponentiated term will be 0.0, and the expectation
-can be evaluated in a stable manner.
-
-## Ops
-
-*   @{tf.contrib.bayesflow.monte_carlo.expectation}
-*   @{tf.contrib.bayesflow.monte_carlo.expectation_importance_sampler}
-*   @{tf.contrib.bayesflow.monte_carlo.expectation_importance_sampler_logspace}
diff --git a/tensorflow/docs_src/api_guides/python/contrib.distributions.bijectors.md b/tensorflow/docs_src/api_guides/python/contrib.distributions.bijectors.md
deleted file mode 100644
index e169897f31717d994a0229f1e1b485874d2b0572..0000000000000000000000000000000000000000
--- a/tensorflow/docs_src/api_guides/python/contrib.distributions.bijectors.md
+++ /dev/null
@@ -1,32 +0,0 @@
-# Random variable transformations (contrib)
-[TOC]
-
-Bijector Ops.
-
-An API for invertible, differentiable transformations of random variables.
-
-## Background
-
-Differentiable, bijective transformations of continuous random variables alter
-the calculations made in the cumulative/probability distribution functions and
-sample function.  This module provides a standard interface for making these
-manipulations.
-
-For more details and examples, see the `Bijector` docstring.
-
-To apply a `Bijector`, use `distributions.TransformedDistribution`.
-
-## Bijectors
-
-*   @{tf.contrib.distributions.bijectors.Affine}
-*   @{tf.contrib.distributions.bijectors.AffineLinearOperator}
-*   @{tf.contrib.distributions.bijectors.Bijector}
-*   @{tf.contrib.distributions.bijectors.Chain}
-*   @{tf.contrib.distributions.bijectors.CholeskyOuterProduct}
-*   @{tf.contrib.distributions.bijectors.Exp}
-*   @{tf.contrib.distributions.bijectors.Identity}
-*   @{tf.contrib.distributions.bijectors.Inline}
-*   @{tf.contrib.distributions.bijectors.Invert}
-*   @{tf.contrib.distributions.bijectors.PowerTransform}
-*   @{tf.contrib.distributions.bijectors.SoftmaxCentered}
-*   @{tf.contrib.distributions.bijectors.Softplus}
diff --git a/tensorflow/docs_src/api_guides/python/contrib.distributions.md b/tensorflow/docs_src/api_guides/python/contrib.distributions.md
deleted file mode 100644
index 533d7dac1373f61ca92dba288a7d29e07e0f37d3..0000000000000000000000000000000000000000
--- a/tensorflow/docs_src/api_guides/python/contrib.distributions.md
+++ /dev/null
@@ -1,83 +0,0 @@
-# Statistical Distributions (contrib)
-[TOC]
-
-Classes representing statistical distributions and ops for working with them.
-
-## Classes for statistical distributions
-
-Classes that represent batches of statistical distributions.  Each class is
-initialized with parameters that define the distributions.
-
-## Base classes
-
-*   @{tf.contrib.distributions.ReparameterizationType}
-*   @{tf.contrib.distributions.Distribution}
-
-## Univariate (scalar) distributions
-
-*   @{tf.contrib.distributions.Binomial}
-*   @{tf.contrib.distributions.Bernoulli}
-*   @{tf.contrib.distributions.Beta}
-*   @{tf.contrib.distributions.Categorical}
-*   @{tf.contrib.distributions.Chi2}
-*   @{tf.contrib.distributions.Chi2WithAbsDf}
-*   @{tf.contrib.distributions.Exponential}
-*   @{tf.contrib.distributions.Gamma}
-*   @{tf.contrib.distributions.InverseGamma}
-*   @{tf.contrib.distributions.Laplace}
-*   @{tf.contrib.distributions.LaplaceWithSoftplusScale}
-*   @{tf.contrib.distributions.Normal}
-*   @{tf.contrib.distributions.NormalWithSoftplusScale}
-*   @{tf.contrib.distributions.Poisson}
-*   @{tf.contrib.distributions.StudentT}
-*   @{tf.contrib.distributions.StudentTWithAbsDfSoftplusScale}
-*   @{tf.contrib.distributions.Uniform}
-
-## Multivariate distributions
-
-### Multivariate normal
-
-*   @{tf.contrib.distributions.MultivariateNormalDiag}
-*   @{tf.contrib.distributions.MultivariateNormalTriL}
-*   @{tf.contrib.distributions.MultivariateNormalDiagPlusLowRank}
-*   @{tf.contrib.distributions.MultivariateNormalDiagWithSoftplusScale}
-
-### Other multivariate distributions
-
-*   @{tf.contrib.distributions.Dirichlet}
-*   @{tf.contrib.distributions.DirichletMultinomial}
-*   @{tf.contrib.distributions.Multinomial}
-*   @{tf.contrib.distributions.WishartCholesky}
-*   @{tf.contrib.distributions.WishartFull}
-
-### Multivariate Utilities
-
-*   @{tf.contrib.distributions.matrix_diag_transform}
-
-## Transformed distributions
-
-*   @{tf.contrib.distributions.TransformedDistribution}
-*   @{tf.contrib.distributions.QuantizedDistribution}
-
-## Mixture Models
-
-*   @{tf.contrib.distributions.Mixture}
-
-## Posterior inference with conjugate priors
-
-Functions that transform conjugate prior/likelihood pairs to distributions
-representing the posterior or posterior predictive.
-
-## Normal likelihood with conjugate prior
-
-*   @{tf.contrib.distributions.normal_conjugates_known_scale_posterior}
-*   @{tf.contrib.distributions.normal_conjugates_known_scale_predictive}
-
-## Kullback-Leibler Divergence
-
-*   @{tf.contrib.distributions.kl_divergence}
-*   @{tf.contrib.distributions.RegisterKL}
-
-## Utilities
-
-*   @{tf.contrib.distributions.softplus_inverse}
diff --git a/tensorflow/docs_src/community/leftnav_files b/tensorflow/docs_src/community/leftnav_files
index 2bae60d9ddc5c18f67b1611054ac58b072e9674a..0bd1f14de9817baae034233b112e6dfdbb17d2a9 100644
--- a/tensorflow/docs_src/community/leftnav_files
+++ b/tensorflow/docs_src/community/leftnav_files
@@ -6,4 +6,3 @@ groups.md
 documentation.md
 style_guide.md
 benchmarks.md
-swift.md
diff --git a/tensorflow/docs_src/community/security.md b/tensorflow/docs_src/community/security.md
deleted file mode 100644
index 8d13c7a1eaf336193b39ea3a9ee8e316a04fcb63..0000000000000000000000000000000000000000
--- a/tensorflow/docs_src/community/security.md
+++ /dev/null
@@ -1,7 +0,0 @@
-# Using TensorFlow Securely
-
-Before using TensorFlow, please take a look at our security model, list of
-recent security announcements, and ways you can report security issues to the
-TensorFlow team at the
-[https://github.com/tensorflow/tensorflow/blob/master/SECURITY.md](Using
-TensorFlow Securely) page on GitHub.
diff --git a/tensorflow/docs_src/community/swift.md b/tensorflow/docs_src/community/swift.md
deleted file mode 100644
index d1625d3b93e2a95229afd897653c113bd45da2ea..0000000000000000000000000000000000000000
--- a/tensorflow/docs_src/community/swift.md
+++ /dev/null
@@ -1,60 +0,0 @@
-<p align="center">
-  <img src="../images/swift_tensorflow_logo.png">
-</p>
-
-# Swift for TensorFlow
-
-Welcome to the Swift for TensorFlow development community!
-
-Swift for TensorFlow is a new way to develop machine learning models. It
-gives you the power of
-[TensorFlow](https://www.tensorflow.org) directly
-integrated into the [Swift programming language](https://swift.org/about).
-With Swift, you can write the following imperative code, and Swift
-automatically turns it into **a single TensorFlow Graph** and runs it
-with the full performance of TensorFlow Sessions on CPU, GPU and
-[TPU](https://cloud.google.com/tpu/docs/tpus).
-
-```swift
-import TensorFlow
-
-var x = Tensor<Float>([[1, 2], [3, 4]])
-
-for i in 1...5 {
-  x += x ⊗ x
-}
-
-print(x)
-```
-
-Swift combines the flexibility of
-[Eager Execution](https://www.tensorflow.org/programmers_guide/eager) with the
-high performance of [Graphs and Sessions](https://www.tensorflow.org/programmers_guide/graphs).
-Behind the scenes, Swift analyzes your Tensor code and automatically builds
-graphs for you. Swift also catches type errors and shape mismatches before
-running your code, and has [Automatic Differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation)
-built right in. We believe that machine learning tools are so important that
-they deserve **a first-class language and a compiler**.
-
-Note: Swift for TensorFlow is an early stage research project. It has been
-released to enable open source development and is not yet ready for general use
-by machine learning developers.
-
-## Open Source
-
-We have released Swift for TensorFlow as an open-source project on GitHub!
-
-Our [documentation repository](https://github.com/tensorflow/swift) contains a
-[project overview](https://github.com/tensorflow/swift/blob/master/docs/DesignOverview.md)
-and [technical papers](https://github.com/tensorflow/swift/tree/master/docs)
-explaining specific areas in depth. There are also instructions for [installing
-pre-built packages](https://github.com/tensorflow/swift/blob/master/Installation.md)
-(for macOS and Ubuntu) as well as a simple
-[usage tutorial](https://github.com/tensorflow/swift/blob/master/Usage.md).
-
-Moving forward, we will use an open design model and all discussions will be
-public.
-
-[Sign up here to join the community Google
-group](https://groups.google.com/a/tensorflow.org/d/forum/swift), which we will
-use for announcements and general discussion.
diff --git a/tensorflow/docs_src/extend/new_data_formats.md b/tensorflow/docs_src/extend/new_data_formats.md
index 2c33a6b6f7e5f1faf04d38e95b74d184134a1edf..d1d1f69766987c2ff02d4ba447a02c0bd91f3a1b 100644
--- a/tensorflow/docs_src/extend/new_data_formats.md
+++ b/tensorflow/docs_src/extend/new_data_formats.md
@@ -45,7 +45,7 @@ Each of these implementations comprises three related classes:
 * A `tensorflow::GraphDatasetBase` subclass (e.g. `TextLineDatasetOp::Dataset`),
   which represents the *immutable* definition of the dataset itself, and tells
   TensorFlow how to construct an iterator object over that dataset, in its
-  `MakeIterator()` method.
+  `MakeIteratorInternal()` method.
 
 * A `tensorflow::DatasetIterator<Dataset>` subclass (e.g.
   `TextLineDatasetOp::Dataset::Iterator`), which represents the *mutable* state
@@ -103,7 +103,7 @@ class MyReaderDatasetOp : public DatasetOpKernel {
    public:
     Dataset(OpKernelContext* ctx) : GraphDatasetBase(ctx) {}
 
-    std::unique_ptr<IteratorBase> MakeIterator(
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
       return std::unique_ptr<IteratorBase>(
           new Iterator({this, strings::StrCat(prefix, "::MyReader")}));
@@ -124,7 +124,7 @@ class MyReaderDatasetOp : public DatasetOpKernel {
       return *shapes;
     }
 
-    string DebugString() override { return "MyReaderDatasetOp::Dataset"; }
+    string DebugString() const override { return "MyReaderDatasetOp::Dataset"; }
 
    protected:
     // Optional: Implementation of `GraphDef` serialization for this dataset.
diff --git a/tensorflow/docs_src/get_started/datasets_quickstart.md b/tensorflow/docs_src/get_started/datasets_quickstart.md
index c972e5e555eea1fab5a67fdecf13264897785519..020e40dd3b8f046f0144e3806468f58833f7b607 100644
--- a/tensorflow/docs_src/get_started/datasets_quickstart.md
+++ b/tensorflow/docs_src/get_started/datasets_quickstart.md
@@ -14,7 +14,7 @@ introduces the API by walking through two simple examples:
 
 Taking slices from an array is the simplest way to get started with `tf.data`.
 
-The @{$get_started/premade_estimators$Premade Estimators} chapter describes
+The @{$premade_estimators$Premade Estimators} chapter describes
 the following `train_input_fn`, from
 [`iris_data.py`](https://github.com/tensorflow/models/blob/master/samples/core/get_started/iris_data.py),
 to pipe the data into the Estimator:
@@ -377,7 +377,7 @@ Now you have the basic idea of how to efficiently load data into an
 Estimator. Consider the following documents next:
 
 
-* @{$get_started/custom_estimators}, which demonstrates how to build your own
+* @{$custom_estimators}, which demonstrates how to build your own
   custom `Estimator` model.
 * The @{$low_level_intro#datasets$Low Level Introduction}, which demonstrates
   how to experiment directly with `tf.data.Datasets` using TensorFlow's low
diff --git a/tensorflow/docs_src/get_started/eager.md b/tensorflow/docs_src/get_started/eager.md
index f08ac74425b6dc9e4974dd7ec9e8486e7f56b7bb..bbb25e20c62f6a2eec78668250a0e748494797c5 100644
--- a/tensorflow/docs_src/get_started/eager.md
+++ b/tensorflow/docs_src/get_started/eager.md
@@ -1,3 +1,3 @@
 # Get Started with Eager Execution
 
-[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/r1.8.0/samples/core/get_started/eager.ipynb)
+[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/r1.9.0/samples/core/get_started/eager.ipynb)
diff --git a/tensorflow/docs_src/get_started/get_started_for_beginners.md b/tensorflow/docs_src/get_started/get_started_for_beginners.md
deleted file mode 100644
index d5a80e22c5dd199b9e76fb4268871df5100d9934..0000000000000000000000000000000000000000
--- a/tensorflow/docs_src/get_started/get_started_for_beginners.md
+++ /dev/null
@@ -1,751 +0,0 @@
-# Get Started with Graph Execution
-
-This document explains how to use machine learning to classify (categorize)
-Iris flowers by species.  This document dives deeply into the TensorFlow
-code to do exactly that, explaining ML fundamentals along the way.
-
-If the following list describes you, then you are in the right place:
-
-*   You know little to nothing about machine learning.
-*   You want to learn how to write TensorFlow programs.
-*   You can code (at least a little) in Python.
-
-If you are already familiar with basic machine learning concepts
-but are new to TensorFlow, read
-@{$premade_estimators$Getting Started with TensorFlow: for ML Experts}.
-
-If you'd like to learn a lot about the basics of Machine Learning,
-consider taking
-[Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/).
-
-
-## The Iris classification problem
-
-Imagine you are a botanist seeking an automated way to classify each
-Iris flower you find.  Machine learning provides many ways to classify flowers.
-For instance, a sophisticated machine learning program could classify flowers
-based on photographs.  Our ambitions are more modest--we're going to classify
-Iris flowers based solely on the length and width of their
-[sepals](https://en.wikipedia.org/wiki/Sepal) and
-[petals](https://en.wikipedia.org/wiki/Petal).
-
-The Iris genus entails about 300 species, but our program will classify only
-the following three:
-
-*   Iris setosa
-*   Iris virginica
-*   Iris versicolor
-
-<div style="margin:auto; margin-bottom:10px; margin-top:20px;">
-<img style="width:100%"
-  alt="Petal geometry compared for three iris species: Iris setosa, Iris virginica, and Iris versicolor"
-  src="../images/iris_three_species.jpg">
-</div>
-
-**From left to right,
-[*Iris setosa*](https://commons.wikimedia.org/w/index.php?curid=170298) (by
-[Radomil](https://commons.wikimedia.org/wiki/User:Radomil), CC BY-SA 3.0),
-[*Iris versicolor*](https://commons.wikimedia.org/w/index.php?curid=248095) (by
-[Dlanglois](https://commons.wikimedia.org/wiki/User:Dlanglois), CC BY-SA 3.0),
-and [*Iris virginica*](https://www.flickr.com/photos/33397993@N05/3352169862)
-(by [Frank Mayfield](https://www.flickr.com/photos/33397993@N05), CC BY-SA
-2.0).**
-<p>&nbsp;</p>
-
-Fortunately, someone has already created [a data set of 120 Iris
-flowers](https://en.wikipedia.org/wiki/Iris_flower_data_set)
-with the sepal and petal measurements.  This data set has become
-one of the canonical introductions to machine learning classification problems.
-(The [MNIST database](https://en.wikipedia.org/wiki/MNIST_database),
-which contains handwritten digits, is another popular classification
-problem.) The first 5 entries of the Iris data set
-look as follows:
-
-| Sepal length | sepal width | petal length | petal width | species
-| ---          | ---         | ---          | ---         | ---
-|6.4           | 2.8         | 5.6          | 2.2         | 2
-|5.0           | 2.3         | 3.3          | 1.0         | 1
-|4.9           | 2.5         | 4.5          | 1.7         | 2
-|4.9           | 3.1         | 1.5          | 0.1         | 0
-|5.7           | 3.8         | 1.7          | 0.3         | 0
-
-Let's introduce some terms:
-
-*   The last column (species) is called the
-    [**label**](https://developers.google.com/machine-learning/glossary/#label);
-    the first four columns are called
-    [**features**](https://developers.google.com/machine-learning/glossary/#feature).
-    Features are characteristics of an example, while the label is
-    the thing we're trying to predict.
-
-*   An [**example**](https://developers.google.com/machine-learning/glossary/#example)
-    consists of the set of features and the label for one sample
-    flower. The preceding table shows 5 examples from a data set of
-    120 examples.
-
-Each label is naturally a string (for example, "setosa"), but machine learning
-typically relies on numeric values. Therefore, someone mapped each string to
-a number.  Here's the representation scheme:
-
-* 0 represents setosa
-* 1 represents versicolor
-* 2 represents virginica
-
-For a look at other examples of labels and examples, see the
-[ML Terminology section of Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/framing/ml-terminology).
-
-
-## Models and training
-
-A **model** is the relationship between features
-and the label.  For the Iris problem, the model defines the relationship
-between the sepal and petal measurements and the predicted Iris species. Some
-simple models can be described with a few lines of algebra, but complex machine
-learning models have a large number of parameters that are difficult to
-summarize.
-
-Could you determine the relationship between the four features and the
-Iris species *without* using machine learning?  That is, could you use
-traditional programming techniques (for example, a lot of conditional
-statements) to create a model?  Maybe. You could play with the data set
-long enough to determine the right relationships of petal and sepal
-measurements to particular species.  However, a good machine learning
-approach *determines the model for you*.  That is, if you feed enough
-representative examples into the right machine learning model type, the program
-will determine the relationship between sepals, petals, and species.
-
-**Training** is the stage of machine learning in which the model is
-gradually optimized (learned).  The Iris problem is an example
-of [**supervised machine
-learning**](https://developers.google.com/machine-learning/glossary/#supervised_machine_learning)
-in which a model is trained from examples that contain labels.  (In
-[**unsupervised machine
-learning**](https://developers.google.com/machine-learning/glossary/#unsupervised_machine_learning),
-the examples don't contain labels. Instead, the model typically finds
-patterns among the features.)
-
-
-
-
-## Get the sample program
-
-Prior to playing with the sample code in this document, do the following:
-
-1.  @{$install$Install TensorFlow}.
-2.  If you installed TensorFlow with virtualenv or Anaconda, activate your
-    TensorFlow environment.
-3.  Install or upgrade pandas by issuing the following command:
-
-     `pip install pandas`
-
-
-Take the following steps to get the sample program:
-
-1. Clone the TensorFlow Models repository from github by entering the following
-   command:
-
-       `git clone https://github.com/tensorflow/models`
-
-2. Change directory within that branch to the location containing the examples
-   used in this document:
-
-       `cd models/samples/core/get_started/`
-
-In that `get_started` directory, you'll find a program
-named `premade_estimator.py`.
-
-
-## Run the sample program
-
-You run TensorFlow programs as you would run any Python program. Therefore,
-issue the following command from a command line to
-run `premade_estimators.py`:
-
-``` bash
-python premade_estimator.py
-```
-
-Running the program should output a whole bunch of information ending with
-three prediction lines like the following:
-
-```None
-...
-Prediction is "Setosa" (99.6%), expected "Setosa"
-
-Prediction is "Versicolor" (99.8%), expected "Versicolor"
-
-Prediction is "Virginica" (97.9%), expected "Virginica"
-```
-
-If the program generates errors instead of predictions, ask yourself the
-following questions:
-
-* Did you install TensorFlow properly?
-* Are you using the correct version of TensorFlow?  The `premade_estimators.py`
-  program requires at least TensorFlow v1.4.
-* If you installed TensorFlow with virtualenv or Anaconda, did you activate
-  the environment?
-
-
-
-## The TensorFlow programming stack
-
-As the following illustration shows, TensorFlow
-provides a programming stack consisting of multiple API layers:
-
-<div style="margin:auto; margin-bottom:10px; margin-top:20px;">
-<img style="width:100%" src="../images/tensorflow_programming_environment.png">
-</div>
-
-**The TensorFlow Programming Environment.**
-<p>&nbsp;</p>
-
-As you start writing TensorFlow programs, we strongly recommend focusing on
-the following two high-level APIs:
-
-*   Estimators
-*   Datasets
-
-Although we'll grab an occasional convenience function from other APIs,
-this document focuses on the preceding two APIs.
-
-
-## The program itself
-
-Thanks for your patience; let's dig into the code.
-The general outline of `premade_estimator.py`--and many other TensorFlow
-programs--is as follows:
-
-*   Import and parse the data sets.
-*   Create feature columns to describe the data.
-*   Select the type of model
-*   Train the model.
-*   Evaluate the model's effectiveness.
-*   Let the trained model make predictions.
-
-The following subsections detail each part.
-
-
-### Import and parse the data sets
-
-The Iris program requires the data from the following two .csv files:
-
-*   `http://download.tensorflow.org/data/iris_training.csv`, which contains
-    the training set.
-*   `http://download.tensorflow.org/data/iris_test.csv`, which contains the
-    test set.
-
-The **training set** contains the examples that we'll use to train the model;
-the **test set** contains the examples that we'll use to evaluate the trained
-model's effectiveness.
-
-The training set and test set started out as a
-single data set.  Then, someone split the examples, with the majority going into
-the training set and the remainder going into the test set.  Adding
-examples to the training set usually builds a better model; however, adding
-more examples to the test set enables us to better gauge the model's
-effectiveness. Regardless of the split, the examples in the test set
-must be separate from the examples in the training set.  Otherwise, you can't
-accurately determine the model's effectiveness.
-
-The `premade_estimators.py` program relies on the `load_data` function
-in the adjacent [`iris_data.py`](
-https://github.com/tensorflow/models/blob/master/samples/core/get_started/iris_data.py)
-file to read in and parse the training set and test set.
-Here is a heavily commented version of the function:
-
-```python
-TRAIN_URL = "http://download.tensorflow.org/data/iris_training.csv"
-TEST_URL = "http://download.tensorflow.org/data/iris_test.csv"
-
-CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth',
-                    'PetalLength', 'PetalWidth', 'Species']
-
-...
-
-def load_data(label_name='Species'):
-    """Parses the csv file in TRAIN_URL and TEST_URL."""
-
-    # Create a local copy of the training set.
-    train_path = tf.keras.utils.get_file(fname=TRAIN_URL.split('/')[-1],
-                                         origin=TRAIN_URL)
-    # train_path now holds the pathname: ~/.keras/datasets/iris_training.csv
-
-    # Parse the local CSV file.
-    train = pd.read_csv(filepath_or_buffer=train_path,
-                        names=CSV_COLUMN_NAMES,  # list of column names
-                        header=0  # ignore the first row of the CSV file.
-                       )
-    # train now holds a pandas DataFrame, which is data structure
-    # analogous to a table.
-
-    # 1. Assign the DataFrame's labels (the right-most column) to train_label.
-    # 2. Delete (pop) the labels from the DataFrame.
-    # 3. Assign the remainder of the DataFrame to train_features
-    train_features, train_label = train, train.pop(label_name)
-
-    # Apply the preceding logic to the test set.
-    test_path = tf.keras.utils.get_file(TEST_URL.split('/')[-1], TEST_URL)
-    test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)
-    test_features, test_label = test, test.pop(label_name)
-
-    # Return four DataFrames.
-    return (train_features, train_label), (test_features, test_label)
-```
-
-Keras is an open-sourced machine learning library; `tf.keras` is a TensorFlow
-implementation of Keras.  The `premade_estimator.py` program only accesses
-one `tf.keras` function; namely, the `tf.keras.utils.get_file` convenience
-function, which copies a remote CSV file to a local file system.
-
-The call to `load_data` returns two `(feature,label)` pairs, for the training
-and test sets respectively:
-
-```python
-    # Call load_data() to parse the CSV file.
-    (train_feature, train_label), (test_feature, test_label) = load_data()
-```
-
-Pandas is an open-source Python library leveraged by several
-TensorFlow functions.  A pandas
-[**DataFrame**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)
-is a table with named columns headers and numbered rows.
-The features returned by `load_data` are packed in `DataFrames`.
-For example, the `test_feature` DataFrame looks as follows:
-
-```none
-    SepalLength  SepalWidth  PetalLength  PetalWidth
-0           5.9         3.0          4.2         1.5
-1           6.9         3.1          5.4         2.1
-2           5.1         3.3          1.7         0.5
-...
-27          6.7         3.1          4.7         1.5
-28          6.7         3.3          5.7         2.5
-29          6.4         2.9          4.3         1.3
-```
-
-
-### Describe the data
-
-A **feature column** is a data structure that tells your model
-how to interpret the data in each feature.  In the Iris problem,
-we want the model to interpret the data in each
-feature as its literal floating-point value; that is, we want the
-model to interpret an input value like 5.4 as, well, 5.4.  However,
-in other machine learning problems, it is often desirable to interpret
-data less literally.  Using feature columns to
-interpret data is such a rich topic that we devote an entire
-@{$feature_columns$document} to it.
-
-From a code perspective, you build a list of `feature_column` objects by calling
-functions from the @{tf.feature_column} module. Each object describes an input
-to the model. To tell the model to interpret data as a floating-point value,
-call @{tf.feature_column.numeric_column}.  In `premade_estimator.py`, all
-four features should be interpreted as literal floating-point values, so
-the code to create a feature column looks as follows:
-
-```python
-# Create feature columns for all features.
-my_feature_columns = []
-for key in train_x.keys():
-    my_feature_columns.append(tf.feature_column.numeric_column(key=key))
-```
-
-Here is a less elegant, but possibly clearer, alternative way to
-encode the preceding block:
-
-```python
-my_feature_columns = [
-    tf.feature_column.numeric_column(key='SepalLength'),
-    tf.feature_column.numeric_column(key='SepalWidth'),
-    tf.feature_column.numeric_column(key='PetalLength'),
-    tf.feature_column.numeric_column(key='PetalWidth')
-]
-```
-
-
-### Select the type of model
-
-We need to select the kind of model that will be trained.
-Lots of model types exist; picking the ideal type takes experience.
-We've selected a neural network to solve the Iris problem.  [**Neural
-networks**](https://developers.google.com/machine-learning/glossary/#neural_network)
-can find complex relationships between features and the label.
-A neural network is a highly-structured graph, organized into one or more
-[**hidden layers**](https://developers.google.com/machine-learning/glossary/#hidden_layer).
-Each hidden layer consists of one or more
-[**neurons**](https://developers.google.com/machine-learning/glossary/#neuron).
-There are several categories of neural networks.
-We'll be using a [**fully connected neural
-network**](https://developers.google.com/machine-learning/glossary/#fully_connected_layer),
-which means that the neurons in one layer take inputs from *every* neuron in
-the previous layer.  For example, the following figure illustrates a
-fully connected neural network consisting of three hidden layers:
-
-*   The first hidden layer contains four neurons.
-*   The second hidden layer contains three neurons.
-*   The third hidden layer contains two neurons.
-
-<div style="margin:auto; margin-bottom:10px; margin-top:20px;">
-<img style="width:100%" src="../images/simple_dnn.svg">
-</div>
-
-**A neural network with three hidden layers.**
-<p>&nbsp;</p>
-
-For a more detailed introduction to neural networks, see the
-[Introduction to Neural Nets section of Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/anatomy).
-
-To specify a model type, instantiate an
-[**Estimator**](https://developers.google.com/machine-learning/glossary/#Estimators)
-class.  TensorFlow provides two categories of Estimators:
-
-*   [**pre-made
-    Estimators**](https://developers.google.com/machine-learning/glossary/#pre-made_Estimator),
-    which someone else has already written for you.
-*   [**custom
-    Estimators**](https://developers.google.com/machine-learning/glossary/#custom_estimator),
-    which you must code yourself, at least partially.
-
-To implement a neural network, the `premade_estimators.py` program uses
-a pre-made Estimator named @{tf.estimator.DNNClassifier}.  This Estimator
-builds a neural network that classifies examples.  The following call
-instantiates `DNNClassifier`:
-
-```python
-    classifier = tf.estimator.DNNClassifier(
-        feature_columns=my_feature_columns,
-        hidden_units=[10, 10],
-        n_classes=3)
-```
-
-Use the `hidden_units` parameter to define the number of neurons
-in each hidden layer of the neural network.  Assign this parameter
-a list. For example:
-
-```python
-        hidden_units=[10, 10],
-```
-
-The length of the list assigned to `hidden_units` identifies the number of
-hidden layers (2, in this case).
-Each value in the list represents the number of neurons in a particular
-hidden layer (10 in the first hidden layer and 10 in the second hidden layer).
-To change the number of hidden layers or neurons, simply assign a different
-list to the `hidden_units` parameter.
-
-The ideal number of hidden layers and neurons depends on the problem
-and the data set. Like many aspects of machine learning,
-picking the ideal shape of the neural network requires some mixture
-of knowledge and experimentation.
-As a rule of thumb, increasing the number of hidden layers and neurons
-*typically* creates a more powerful model, which requires more data to
-train effectively.
-
-The `n_classes` parameter specifies the number of possible values that the
-neural network can predict.  Since the Iris problem classifies 3 Iris species,
-we set `n_classes` to 3.
-
-The constructor for `tf.Estimator.DNNClassifier` takes an optional argument
-named `optimizer`, which our sample code chose not to specify.  The
-[**optimizer**](https://developers.google.com/machine-learning/glossary/#optimizer)
-controls how the model will train.  As you develop more expertise in machine
-learning, optimizers and
-[**learning
-rate**](https://developers.google.com/machine-learning/glossary/#learning_rate)
-will become very important.
-
-
-
-### Train the model
-
-Instantiating a `tf.Estimator.DNNClassifier` creates a framework for learning
-the model. Basically, we've wired a network but haven't yet let data flow
-through it. To train the neural network, call the Estimator object's `train`
-method. For example:
-
-```python
-    classifier.train(
-        input_fn=lambda:train_input_fn(train_feature, train_label, args.batch_size),
-        steps=args.train_steps)
-```
-
-The `steps` argument tells `train` to stop training after the specified
-number of iterations.  Increasing `steps` increases the amount of time
-the model will train.  Counter-intuitively, training a model longer
-does not guarantee a better model.  The default value of `args.train_steps`
-is 1000.  The number of steps to train is a
-[**hyperparameter**](https://developers.google.com/machine-learning/glossary/#hyperparameter)
-you can tune. Choosing the right number of steps usually
-requires both experience and experimentation.
-
-The `input_fn` parameter identifies the function that supplies the
-training data.  The call to the `train` method indicates that the
-`train_input_fn` function will supply the training data.  Here's that
-method's signature:
-
-```python
-def train_input_fn(features, labels, batch_size):
-```
-
-We're passing the following arguments to `train_input_fn`:
-
-* `train_feature` is a Python dictionary in which:
-    * Each key is the name of a feature.
-    * Each value is an array containing the values for each example in the
-      training set.
-* `train_label` is an array containing the values of the label for every
-  example in the training set.
-* `args.batch_size` is an integer defining the [**batch
-  size**](https://developers.google.com/machine-learning/glossary/#batch_size).
-
-The `train_input_fn` function relies on the **Dataset API**. This is a
-high-level TensorFlow API for reading data and transforming it into a form
-that the `train` method requires.  The following call converts the
-input features and labels into a `tf.data.Dataset` object, which is the base
-class of the Dataset API:
-
-```python
-    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
-```
-
-The `tf.dataset` class provides many useful functions for preparing examples
-for training. The following line calls three of those functions:
-
-```python
-    dataset = dataset.shuffle(buffer_size=1000).repeat(count=None).batch(batch_size)
-```
-
-Training works best if the training examples are in
-random order.  To randomize the examples, call
-`tf.data.Dataset.shuffle`.  Setting the `buffer_size` to a value
-larger than the number of examples (120) ensures that the data will
-be well shuffled.
-
-During training, the `train` method typically processes the
-examples multiple times.  Calling the
-`tf.data.Dataset.repeat` method without any arguments ensures
-that the `train` method has an infinite supply of (now shuffled)
-training set examples.
-
-The `train` method processes a
-[**batch**](https://developers.google.com/machine-learning/glossary/#batch)
-of examples at a time.
-The `tf.data.Dataset.batch` method creates a batch by
-concatenating multiple examples.
-This program sets the default [**batch
-size**](https://developers.google.com/machine-learning/glossary/#batch_size)
-to 100, meaning that the `batch` method will concatenate groups of
-100 examples.  The ideal batch size depends on the problem.  As a rule
-of thumb, smaller batch sizes usually enable the `train` method to train
-the model faster at the expense (sometimes) of accuracy.
-
-The following `return` statement passes a batch of examples back to
-the caller (the `train` method).
-
-```python
-   return dataset.make_one_shot_iterator().get_next()
-```
-
-
-### Evaluate the model
-
-**Evaluating** means determining how effectively the model makes
-predictions.  To determine the Iris classification model's effectiveness,
-pass some sepal and petal measurements to the model and ask the model
-to predict what Iris species they represent. Then compare the model's
-prediction against the actual label.  For example, a model that picked
-the correct species on half the input examples would have an
-[accuracy](https://developers.google.com/machine-learning/glossary/#accuracy)
-of 0.5.  The following suggests a more effective model:
-
-
-<table>
-  <tr>
-    <th style="background-color:darkblue" colspan="5">
-       Test Set</th>
-  </tr>
-  <tr>
-    <th colspan="4">Features</th>
-    <th colspan="1">Label</th>
-    <th colspan="1">Prediction</th>
-  </tr>
-  <tr> <td>5.9</td> <td>3.0</td> <td>4.3</td> <td>1.5</td> <td>1</td>
-          <td style="background-color:green">1</td></tr>
-  <tr> <td>6.9</td> <td>3.1</td> <td>5.4</td> <td>2.1</td> <td>2</td>
-          <td style="background-color:green">2</td></tr>
-  <tr> <td>5.1</td> <td>3.3</td> <td>1.7</td> <td>0.5</td> <td>0</td>
-          <td style="background-color:green">0</td></tr>
-  <tr> <td>6.0</td> <td>3.4</td> <td>4.5</td> <td>1.6</td> <td>1</td>
-          <td style="background-color:red">2</td></tr>
-  <tr> <td>5.5</td> <td>2.5</td> <td>4.0</td> <td>1.3</td> <td>1</td>
-          <td style="background-color:green">1</td></tr>
-</table>
-
-**A model that is 80% accurate.**
-<p>&nbsp;</p>
-
-To evaluate a model's effectiveness, each Estimator provides an `evaluate`
-method.  The `premade_estimator.py` program calls `evaluate` as follows:
-
-```python
-# Evaluate the model.
-eval_result = classifier.evaluate(
-    input_fn=lambda:eval_input_fn(test_x, test_y, args.batch_size))
-
-print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))
-```
-
-The call to `classifier.evaluate` is similar to the call to `classifier.train`.
-The biggest difference is that `classifier.evaluate` must get its examples
-from the test set rather than the training set.  In other words, to
-fairly assess a model's effectiveness, the examples used to
-*evaluate* a model must be different from the examples used to *train*
-the model.  The `eval_input_fn` function serves a batch of examples from
-the test set.  Here's the `eval_input_fn` method:
-
-```python
-def eval_input_fn(features, labels=None, batch_size=None):
-    """An input function for evaluation or prediction"""
-    if labels is None:
-        # No labels, use only features.
-        inputs = features
-    else:
-        inputs = (features, labels)
-
-    # Convert inputs to a tf.dataset object.
-    dataset = tf.data.Dataset.from_tensor_slices(inputs)
-
-    # Batch the examples
-    assert batch_size is not None, "batch_size must not be None"
-    dataset = dataset.batch(batch_size)
-
-    # Return the read end of the pipeline.
-    return dataset.make_one_shot_iterator().get_next()
-```
-
-In brief, `eval_input_fn` does the following when called by
-`classifier.evaluate`:
-
-1.  Converts the features and labels from the test set to a `tf.dataset`
-    object.
-2.  Creates a batch of test set examples.  (There's no need to shuffle
-    or repeat the test set examples.)
-3.  Returns that batch of test set examples to `classifier.evaluate`.
-
-Running this code yields the following output (or something close to it):
-
-```none
-Test set accuracy: 0.967
-```
-
-An accuracy of 0.967 implies that our trained model correctly classified 29
-out of the 30 Iris species in the test set.
-
-To get a deeper understanding of different metrics for evaluating
-models, see the
-[Classification section of Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/classification).
-
-
-### Predicting
-
-We've now trained a model and "proven" that it is good--but not
-perfect--at classifying Iris species.  Now let's use the trained
-model to make some predictions on [**unlabeled
-examples**](https://developers.google.com/machine-learning/glossary/#unlabeled_example);
-that is, on examples that contain features but not a label.
-
-In real-life, the unlabeled examples could come from lots of different
-sources including apps, CSV files, and data feeds.  For now, we're simply
-going to manually provide the following three unlabeled examples:
-
-```python
-    predict_x = {
-        'SepalLength': [5.1, 5.9, 6.9],
-        'SepalWidth': [3.3, 3.0, 3.1],
-        'PetalLength': [1.7, 4.2, 5.4],
-        'PetalWidth': [0.5, 1.5, 2.1],
-    }
-```
-
-Every Estimator provides a `predict` method, which `premade_estimator.py`
-calls as follows:
-
-```python
-predictions = classifier.predict(
-    input_fn=lambda:eval_input_fn(predict_x,
-                                  labels=None,
-                                  batch_size=args.batch_size))
-```
-
-As with the `evaluate` method, our `predict` method also gathers examples
-from the `eval_input_fn` method.
-
-When doing predictions, we're *not* passing labels to `eval_input_fn`.
-Therefore, `eval_input_fn` does the following:
-
-1.  Converts the features from the 3-element manual set we just created.
-2.  Creates a batch of 3 examples from that manual set.
-3.  Returns that batch of examples to `classifier.predict`.
-
-The `predict` method returns a python iterable, yielding a dictionary of
-prediction results for each example.  This dictionary contains several keys.
-The `probabilities` key holds a list of three floating-point values,
-each representing the probability that the input example is a particular
-Iris species.  For example, consider the following `probabilities` list:
-
-```none
-'probabilities': array([  1.19127117e-08,   3.97069454e-02,   9.60292995e-01])
-```
-
-The preceding list indicates:
-
-*   A negligible chance of the Iris being Setosa.
-*   A 3.97% chance of the Iris being Versicolor.
-*   A 96.0% chance of the Iris being Virginica.
-
-The `class_ids` key holds a one-element array that identifies the most
-probable species.  For example:
-
-```none
-'class_ids': array([2])
-```
-
-The number `2` corresponds to Virginica.  The following code iterates
-through the returned `predictions` to report on each prediction:
-
-``` python
-for pred_dict, expec in zip(predictions, expected):
-    template = ('\nPrediction is "{}" ({:.1f}%), expected "{}"')
-
-    class_id = pred_dict['class_ids'][0]
-    probability = pred_dict['probabilities'][class_id]
-    print(template.format(iris_data.SPECIES[class_id], 100 * probability, expec))
-```
-
-Running the program yields the following output:
-
-
-``` None
-...
-Prediction is "Setosa" (99.6%), expected "Setosa"
-
-Prediction is "Versicolor" (99.8%), expected "Versicolor"
-
-Prediction is "Virginica" (97.9%), expected "Virginica"
-```
-
-
-## Summary
-
-This document provides a short introduction to machine learning.
-
-Because `premade_estimators.py` relies on high-level APIs, much of the
-mathematical complexity in machine learning is hidden.
-If you intend to become more proficient in machine learning, we recommend
-ultimately learning more about [**gradient
-descent**](https://developers.google.com/machine-learning/glossary/#gradient_descent),
-batching, and neural networks.
-
-We recommend reading the @{$feature_columns$Feature Columns} document next,
-which explains how to represent different kinds of data in machine learning.
diff --git a/tensorflow/docs_src/get_started/index.md b/tensorflow/docs_src/get_started/index.md
index 578080bb592d93902e7fc05cb3f74754ed188d4b..232d2f154703dc10320f9ee074c67d6e1a8ee850 100644
--- a/tensorflow/docs_src/get_started/index.md
+++ b/tensorflow/docs_src/get_started/index.md
@@ -15,26 +15,8 @@ The easiest way to get started with TensorFlow is by using Eager Execution.
   * @{$get_started/eager}, is for anyone new to machine learning or TensorFlow.
 
 TensorFlow provides many APIs. The remainder of this section focuses on the
-Estimator API which provide scalable, high-performance models.
-To get started with Estimators, begin by reading one of the following documents:
-
-  * @{$get_started/get_started_for_beginners}, which is aimed at readers
-    new to machine learning.
-  * @{$get_started/premade_estimators}, which is aimed at readers who have
-    experience in machine learning.
-
-Then, read the following documents, which demonstrate the key features
-in the high-level APIs:
-
-  * @{$get_started/checkpoints}, which explains how to save training progress
-    and resume where you left off.
-  * @{$get_started/feature_columns}, which shows how an
-    Estimator can handle a variety of input data types without changes to the
-    model.
-  * @{$get_started/datasets_quickstart}, which introduces TensorFlow's
-    input pipelines.
-  * @{$get_started/custom_estimators}, which demonstrates how
-    to build and train models you design yourself.
+Estimator API which provide scalable, high-performance models. See the
+@{$estimators} guide.
 
 For more advanced users:
 
diff --git a/tensorflow/docs_src/get_started/leftnav_files b/tensorflow/docs_src/get_started/leftnav_files
index 4c12f0d84b3d13e4d9ececcb4559e806486b4120..e6cc8d565810683947e9cf9692e7cccb43916e66 100644
--- a/tensorflow/docs_src/get_started/leftnav_files
+++ b/tensorflow/docs_src/get_started/leftnav_files
@@ -1,15 +1,4 @@
 index.md
 
-### Beginners
 eager.md
-get_started_for_beginners.md
-premade_estimators.md
-
-### Estimators
-get_started_for_beginners.md: For Beginners
-premade_estimators.md: Premade Estimators
->>>
-checkpoints.md
-feature_columns.md
 datasets_quickstart.md
-custom_estimators.md
diff --git a/tensorflow/docs_src/install/index.md b/tensorflow/docs_src/install/index.md
index 4f85383925bbb8a03372b020e448a0e604f3b999..c2e5a991d459f52be7c08a6da3bcad0c57e58934 100644
--- a/tensorflow/docs_src/install/index.md
+++ b/tensorflow/docs_src/install/index.md
@@ -6,6 +6,7 @@ operating systems:
   * macOS 10.12.6 (Sierra) or later.
   * Ubuntu 16.04 or later
   * Windows 7 or later.
+  * Raspbian 9.0 or later.
 
 Although you might be able to install TensorFlow on other laptop or desktop
 systems, we only support (and only fix issues in) the preceding configurations.
@@ -16,6 +17,7 @@ that enables you to write applications in Python:
   * @{$install_linux$Installing TensorFlow on Ubuntu}
   * @{$install_mac$Installing TensorFlow on macOS}
   * @{$install_windows$Installing TensorFlow on Windows}
+  * @{$install_raspbian$Installing TensorFlow on a Raspberry Pi}
   * @{$install_sources$Installing TensorFlow from Sources}
 
 Many aspects of the Python TensorFlow API changed from version 0.n to 1.0.
diff --git a/tensorflow/docs_src/install/install_c.md b/tensorflow/docs_src/install/install_c.md
index 1abd840ab3ca3f1d2d0d98fc7161d25b8ea9f700..29018487450951661f2b6aaef59e2fe59ab052f0 100644
--- a/tensorflow/docs_src/install/install_c.md
+++ b/tensorflow/docs_src/install/install_c.md
@@ -38,7 +38,7 @@ enable TensorFlow for C:
          OS="linux" # Change to "darwin" for macOS
          TARGET_DIRECTORY="/usr/local"
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.8.0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.9.0-rc0.tar.gz" |
            sudo tar -C $TARGET_DIRECTORY -xz
 
      The `tar` command extracts the TensorFlow C library into the `lib`
diff --git a/tensorflow/docs_src/install/install_go.md b/tensorflow/docs_src/install/install_go.md
index 52a2a3f8a68dd56d6d7ebfb474667ba8a0e39e18..55bc0f64e799ecb7115cc656c8a08ec1ce2a6108 100644
--- a/tensorflow/docs_src/install/install_go.md
+++ b/tensorflow/docs_src/install/install_go.md
@@ -38,7 +38,7 @@ steps to install this library and enable TensorFlow for Go:
          TF_TYPE="cpu" # Change to "gpu" for GPU support
          TARGET_DIRECTORY='/usr/local'
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.8.0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.9.0-rc0.tar.gz" |
          sudo tar -C $TARGET_DIRECTORY -xz
 
      The `tar` command extracts the TensorFlow C library into the `lib`
diff --git a/tensorflow/docs_src/install/install_java.md b/tensorflow/docs_src/install/install_java.md
index 1256fb99c4307c6e5f6d3eb47467ce0c006427fc..637231da1252097a9a143edea23f3248c7cf2eb6 100644
--- a/tensorflow/docs_src/install/install_java.md
+++ b/tensorflow/docs_src/install/install_java.md
@@ -36,7 +36,7 @@ following to the project's `pom.xml` to use the TensorFlow Java APIs:
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>tensorflow</artifactId>
-  <version>1.8.0</version>
+  <version>1.9.0-rc0</version>
 </dependency>
 ```
 
@@ -65,7 +65,7 @@ As an example, these steps will create a Maven project that uses TensorFlow:
                <dependency>
                  <groupId>org.tensorflow</groupId>
                  <artifactId>tensorflow</artifactId>
-                 <version>1.8.0</version>
+                 <version>1.9.0-rc0</version>
                </dependency>
              </dependencies>
          </project>
@@ -124,12 +124,12 @@ instead:
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>libtensorflow</artifactId>
-  <version>1.8.0</version>
+  <version>1.9.0-rc0</version>
 </dependency>
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>libtensorflow_jni_gpu</artifactId>
-  <version>1.8.0</version>
+  <version>1.9.0-rc0</version>
 </dependency>
 ```
 
@@ -148,7 +148,7 @@ refer to the simpler instructions above instead.
 Take the following steps to install TensorFlow for Java on Linux or macOS:
 
   1. Download
-     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.8.0.jar),
+     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.9.0-rc0.jar),
      which is the TensorFlow Java Archive (JAR).
 
   2. Decide whether you will run TensorFlow for Java on CPU(s) only or with
@@ -167,7 +167,7 @@ Take the following steps to install TensorFlow for Java on Linux or macOS:
          OS=$(uname -s | tr '[:upper:]' '[:lower:]')
          mkdir -p ./jni
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.8.0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.9.0-rc0.tar.gz" |
            tar -xz -C ./jni
 
 ### Install on Windows
@@ -175,13 +175,13 @@ Take the following steps to install TensorFlow for Java on Linux or macOS:
 Take the following steps to install TensorFlow for Java on Windows:
 
   1. Download
-     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.8.0.jar),
+     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.9.0-rc0.jar),
      which is the TensorFlow Java Archive (JAR).
   2. Download the following Java Native Interface (JNI) file appropriate for
-     [TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.8.0.zip).
+     [TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.9.0-rc0.zip).
   3. Extract this .zip file.
 
-
+__Note__: The native library (`tensorflow_jni.dll`) requires `msvcp140.dll` at runtime, which is included in the [Visual C++ 2015 Redistributable](https://www.microsoft.com/en-us/download/details.aspx?id=48145) package. 
 
 ### Validate the installation
 
@@ -227,7 +227,7 @@ must be part of your `classpath`. For example, you can include the
 downloaded `.jar` in your `classpath` by using the `-cp` compilation flag
 as follows:
 
-<pre><b>javac -cp libtensorflow-1.8.0.jar HelloTF.java</b></pre>
+<pre><b>javac -cp libtensorflow-1.9.0-rc0.jar HelloTF.java</b></pre>
 
 
 ### Running
@@ -241,11 +241,11 @@ two files are available to the JVM:
 For example, the following command line executes the `HelloTF` program on Linux
 and macOS X:
 
-<pre><b>java -cp libtensorflow-1.8.0.jar:. -Djava.library.path=./jni HelloTF</b></pre>
+<pre><b>java -cp libtensorflow-1.9.0-rc0.jar:. -Djava.library.path=./jni HelloTF</b></pre>
 
 And the following command line executes the `HelloTF` program on Windows:
 
-<pre><b>java -cp libtensorflow-1.8.0.jar;. -Djava.library.path=jni HelloTF</b></pre>
+<pre><b>java -cp libtensorflow-1.9.0-rc0.jar;. -Djava.library.path=jni HelloTF</b></pre>
 
 If the program prints <tt>Hello from <i>version</i></tt>, you've successfully
 installed TensorFlow for Java and are ready to use the API.  If the program
diff --git a/tensorflow/docs_src/install/install_linux.md b/tensorflow/docs_src/install/install_linux.md
index 3b9381625fd675ff30cad50e16737db1389b8cad..c8d706cf3c232d1ea91265bd7ad38d5227c440f0 100644
--- a/tensorflow/docs_src/install/install_linux.md
+++ b/tensorflow/docs_src/install/install_linux.md
@@ -339,9 +339,7 @@ Docker will download the TensorFlow binary image the first time you launch it.
 
 #### GPU support
 
-Prior to installing TensorFlow with GPU support, ensure that your system meets all
-[NVIDIA software requirements](#NVIDIARequirements).  To launch a Docker container
-with NVidia GPU support, enter a command of the following format:
+To launch a Docker container with NVidia GPU support, enter a command of the following format (this [does not require any local CUDA installation](https://github.com/nvidia/nvidia-docker/wiki/CUDA#requirements)):
 
 <pre>
 $ <b>nvidia-docker run -it</b> <i>-p hostPort:containerPort TensorFlowGPUImage</i>
@@ -438,7 +436,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
 
      <pre>
      (tensorflow)$ <b>pip install --ignore-installed --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.8.0-cp34-cp34m-linux_x86_64.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0rc0-cp34-cp34m-linux_x86_64.whl</b></pre>
 
 <a name="ValidateYourInstallation"></a>
 ## Validate your installation
@@ -684,14 +682,14 @@ This section documents the relevant values for Linux installations.
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.8.0-cp27-none-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0rc0-cp27-none-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.8.0-cp27-none-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.9.0rc0-cp27-none-linux_x86_64.whl
 </pre>
 
 Note that GPU support requires the NVIDIA hardware and software described in
@@ -703,14 +701,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.8.0-cp34-cp34m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0rc0-cp34-cp34m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.8.0-cp34-cp34m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.9.0rc0-cp34-cp34m-linux_x86_64.whl
 </pre>
 
 Note that GPU support requires the NVIDIA hardware and software described in
@@ -722,14 +720,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.8.0-cp35-cp35m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0rc0-cp35-cp35m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.8.0-cp35-cp35m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.9.0rc0-cp35-cp35m-linux_x86_64.whl
 </pre>
 
 
@@ -741,14 +739,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.8.0-cp36-cp36m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0rc0-cp36-cp36m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.8.0-cp36-cp36m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.9.0rc0-cp36-cp36m-linux_x86_64.whl
 </pre>
 
 
diff --git a/tensorflow/docs_src/install/install_mac.md b/tensorflow/docs_src/install/install_mac.md
index 016e7bf1b90d35cf0199eca3209fb2697856547c..9d01271c5a0beebf75be9e32682583ddc4a666b1 100644
--- a/tensorflow/docs_src/install/install_mac.md
+++ b/tensorflow/docs_src/install/install_mac.md
@@ -119,7 +119,7 @@ Take the following steps to install TensorFlow with Virtualenv:
      TensorFlow in the active Virtualenv is as follows:
 
      <pre> $ <b>pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.8.0-py3-none-any.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0rc0-py3-none-any.whl</b></pre>
 
 If you encounter installation problems, see
 [Common Installation Problems](#common-installation-problems).
@@ -242,7 +242,7 @@ take the following steps:
      issue the following command:
 
      <pre> $ <b>sudo pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.8.0-py3-none-any.whl</b> </pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0rc0-py3-none-any.whl</b> </pre>
 
      If the preceding command fails, see
      [installation problems](#common-installation-problems).
@@ -350,7 +350,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
      TensorFlow for Python 2.7:
 
      <pre> (<i>targetDirectory</i>)$ <b>pip install --ignore-installed --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.8.0-py2-none-any.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0rc0-py2-none-any.whl</b></pre>
 
 
 <a name="ValidateYourInstallation"></a>
@@ -403,10 +403,8 @@ writing TensorFlow programs:
 If the system outputs an error message instead of a greeting, see
 [Common installation problems](#common_installation_problems).
 
-If you are new to machine learning, we recommend the following:
-
-*  [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course)
-*  @{$get_started/get_started_for_beginners$Getting Started for ML Beginners}
+If you are new to machine learning, we recommend the
+[Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course).
 
 If you are experienced with machine learning but new to TensorFlow, see
 @{$get_started/eager}.
@@ -524,7 +522,7 @@ The value you specify depends on your Python version.
 
 
 <pre>
-https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.8.0-py2-none-any.whl
+https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0rc0-py2-none-any.whl
 </pre>
 
 
@@ -532,5 +530,5 @@ https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.8.0-py2-none-any.
 
 
 <pre>
-https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.8.0-py3-none-any.whl
+https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0rc0-py3-none-any.whl
 </pre>
diff --git a/tensorflow/docs_src/install/install_raspbian.md b/tensorflow/docs_src/install/install_raspbian.md
new file mode 100644
index 0000000000000000000000000000000000000000..2f425162a1c63f084702727b5280ed266196b955
--- /dev/null
+++ b/tensorflow/docs_src/install/install_raspbian.md
@@ -0,0 +1,317 @@
+# Installing TensorFlow on Raspbian
+
+This guide explains how to install TensorFlow on a Raspberry Pi running
+Raspbian. Although these instructions might also work on other Pi variants, we
+have only tested (and we only support) these instructions on machines meeting
+the following requirements:
+
+*   Raspberry Pi devices running Raspbian 9.0 or higher
+
+## Determine how to install TensorFlow
+
+You must pick the mechanism by which you install TensorFlow. The supported
+choices are as follows:
+
+*   "Native" pip.
+*   Cross-compiling from sources.
+
+**We recommend pip installation.**
+
+## Installing with native pip
+
+We have uploaded the TensorFlow binaries to piwheels.org. Therefore, you can
+install TensorFlow through pip.
+
+The [REQUIRED_PACKAGES section of
+setup.py](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/pip_package/setup.py)
+lists the packages that pip will install or upgrade.
+
+### Prerequisite: Python
+
+In order to install TensorFlow, your system must contain one of the following
+Python versions:
+
+*   Python 2.7
+*   Python 3.4+
+
+If your system does not already have one of the preceding Python versions,
+[install](https://wiki.python.org/moin/BeginnersGuide/Download) it now. It
+should already be included when Raspbian was installed though, so no extra steps
+should be needed.
+
+### Prerequisite: pip
+
+[Pip](https://en.wikipedia.org/wiki/Pip_\(package_manager\)) installs and
+manages software packages written in Python. If you intend to install with
+native pip, then one of the following flavors of pip must be installed on your
+system:
+
+*   `pip3`, for Python 3.n (preferred).
+*   `pip`, for Python 2.7.
+
+`pip` or `pip3` was probably installed on your system when you installed Python.
+To determine whether pip or pip3 is actually installed on your system, issue one
+of the following commands:
+
+<pre>$ <b>pip3 -V</b> # for Python 3.n
+$ <b>pip -V</b>  # for Python 2.7</pre>
+
+If it gives the error "Command not found", then the package has not been
+installed yet. To install if for the first time, run:
+
+<pre>$ sudo apt-get install python3-pip # for Python 3.n
+sudo apt-get install python-pip # for Python 2.7</pre>
+
+You can find more help on installing and upgrading pip in
+[the Raspberry Pi documentation](https://www.raspberrypi.org/documentation/linux/software/python.md).
+
+### Prerequisite: Atlas
+
+[Atlas](http://math-atlas.sourceforge.net/) is a linear algebra library that
+numpy depends on, and so needs to be installed before TensorFlow. To add it to
+your system, run the following command:
+
+<pre>$ sudo apt install libatlas-base-dev</pre>
+
+### Install TensorFlow
+
+Assuming the prerequisite software is installed on your Pi, install TensorFlow
+by invoking **one** of the following commands:
+
+     <pre> $ <b>pip3 install tensorflow</b>     # Python 3.n
+     $ <b>pip install tensorflow</b>      # Python 2.7</pre>
+
+This can take some time on certain platforms like the Pi Zero, where some Python
+packages like scipy that TensorFlow depends on need to be compiled before the
+installation can complete. The Python 3 version will typically be faster to
+install because piwheels.org has pre-built versions of the dependencies 
+available, so this is our recommended option.
+
+### Next Steps
+
+After installing TensorFlow, [validate your
+installation](#ValidateYourInstallation) to confirm that the installation worked
+properly.
+
+### Uninstalling TensorFlow
+
+To uninstall TensorFlow, issue one of following commands:
+
+<pre>$ <b>pip uninstall tensorflow</b>
+$ <b>pip3 uninstall tensorflow</b> </pre>
+
+## Cross-compiling from sources
+
+Cross-compilation means building on a different machine than than you'll be
+deploying on. Since Raspberry Pi's only have limited RAM and comparatively slow
+processors, and TensorFlow has a large amount of source code to compile, it's
+easier to use a MacOS or Linux desktop or laptop to handle the build process.
+Because it can take over 24 hours to build on a Pi, and requires external swap
+space to cope with the memory shortage, we recommend using cross-compilation if
+you do need to compile TensorFlow from source. To make the dependency management
+process easier, we also recommend using Docker to help simplify building.
+
+Note that we provide well-tested, pre-built TensorFlow binaries for Raspbian
+systems. So, don't build a TensorFlow binary yourself unless you are very
+comfortable building complex packages from source and dealing with the
+inevitable aftermath should things not go exactly as documented
+
+### Prerequisite: Docker
+
+Install Docker on your machine as described in the [Docker
+documentation](https://docs.docker.com/engine/installation/#/on-macos-and-windows).
+
+### Clone the TensorFlow repository
+
+Start the process of building TensorFlow by cloning a TensorFlow repository.
+
+To clone **the latest** TensorFlow repository, issue the following command:
+
+<pre>$ <b>git clone https://github.com/tensorflow/tensorflow</b> </pre>
+
+The preceding <code>git clone</code> command creates a subdirectory named
+`tensorflow`. After cloning, you may optionally build a **specific branch**
+(such as a release branch) by invoking the following commands:
+
+<pre>
+$ <b>cd tensorflow</b>
+$ <b>git checkout</b> <i>Branch</i> # where <i>Branch</i> is the desired branch
+</pre>
+
+For example, to work with the `r1.0` release instead of the master release,
+issue the following command:
+
+<pre>$ <b>git checkout r1.0</b></pre>
+
+### Build from source
+
+To compile TensorFlow and produce a binary pip can install, do the following:
+
+1.  Start a terminal.
+2.  Navigate to the directory containing the tensorflow source code.
+3.  Run a command to cross-compile the library, for example:
+
+<pre>$ CI_DOCKER_EXTRA_PARAMS="-e CI_BUILD_PYTHON=python3 -e CROSSTOOL_PYTHON_INCLUDE_PATH=/usr/include/python3.4" \
+tensorflow/tools/ci_build/ci_build.sh PI-PYTHON3 tensorflow/tools/ci_build/pi/build_raspberry_pi.sh
+ </pre>
+
+This will build a pip .whl file for Python 3.4, with Arm v7 instructions that
+will only work on the Pi models 2 or 3. These NEON instructions are required for
+the fastest operation on those devices, but you can build a library that will
+run across all Pi devices by passing `PI_ONE` at the end of the command line.
+You can also target Python 2.7 by omitting the initial docker parameters. Here's
+an example of building for Python 2.7 and Raspberry Pi model Zero or One
+devices:
+
+<pre>$ tensorflow/tools/ci_build/ci_build.sh PI tensorflow/tools/ci_build/pi/build_raspberry_pi.sh PI_ONE</pre>
+
+This will take some time to complete, typically twenty or thirty minutes, and
+should produce a .whl file in an output-artifacts sub-folder inside your source
+tree at the end. This wheel file can be installed through pip or pip3 (depending
+on your Python version) by copying it to a Raspberry Pi and running a terminal
+command like this (with the name of your actual file substituted):
+
+<pre>$ pip3 install tensorflow-1.9.0-cp34-none-linux_armv7l.whl</pre>
+
+### Troubleshooting the build
+
+The build script uses Docker internally to create a Linux virtual machine to
+handle the compilation. If you do have problems running the script, first check
+that you're able to run Docker tests like `docker run hello-world` on your
+system.
+
+If you're building from the latest development branch, try syncing to an older
+version that's known to work, for example release 1.9, with a command like this:
+
+<pre>$ <b>git checkout r1.0</b></pre>
+
+<a name="ValidateYourInstallation"></a>
+
+## Validate your installation
+
+To validate your TensorFlow installation, do the following:
+
+1.  Ensure that your environment is prepared to run TensorFlow programs.
+2.  Run a short TensorFlow program.
+
+### Prepare your environment
+
+If you installed on native pip, Virtualenv, or Anaconda, then do the following:
+
+1.  Start a terminal.
+2.  If you installed TensorFlow source code, navigate to any directory *except*
+    one containing TensorFlow source code.
+
+### Run a short TensorFlow program
+
+Invoke python from your shell as follows:
+
+<pre>$ <b>python</b></pre>
+
+Enter the following short program inside the python interactive shell:
+
+```python
+# Python
+import tensorflow as tf
+hello = tf.constant('Hello, TensorFlow!')
+sess = tf.Session()
+print(sess.run(hello))
+```
+
+If the system outputs the following, then you are ready to begin writing
+TensorFlow programs:
+
+<pre>Hello, TensorFlow!</pre>
+
+If you're running with Python 3.5, you may see a warning when you first import
+TensorFlow. This is not an error, and TensorFlow should continue to run with no
+problems, despite the log message.
+
+If the system outputs an error message instead of a greeting, see [Common
+installation problems](#common_installation_problems).
+
+If you are new to machine learning, we recommend the [Machine Learning Crash
+Course](https://developers.google.com/machine-learning/crash-course).
+
+If you are experienced with machine learning but new to TensorFlow, see
+@{$get_started/eager}.
+
+## Common installation problems
+
+We are relying on Stack Overflow to document TensorFlow installation problems
+and their remedies. The following table contains links to Stack Overflow answers
+for some common installation problems. If you encounter an error message or
+other installation problem not listed in the following table, search for it on
+Stack Overflow. If Stack Overflow doesn't show the error message, ask a new
+question about it on Stack Overflow and specify the `tensorflow` tag.
+
+<table>
+<tr> <th>Stack Overflow Link</th> <th>Error Message</th> </tr>
+
+
+<tr>
+  <td><a href="http://stackoverflow.com/q/42006320">42006320</a></td>
+  <td><pre>ImportError: Traceback (most recent call last):
+File ".../tensorflow/core/framework/graph_pb2.py", line 6, in <module>
+from google.protobuf import descriptor as _descriptor
+ImportError: cannot import name 'descriptor'</pre>
+  </td>
+</tr>
+
+<tr>
+  <td><a href="https://stackoverflow.com/q/33623453">33623453</a></td>
+  <td><pre>IOError: [Errno 2] No such file or directory:
+  '/tmp/pip-o6Tpui-build/setup.py'</tt></pre>
+</tr>
+
+<tr>
+  <td><a href="https://stackoverflow.com/questions/35190574">35190574</a> </td>
+  <td><pre>SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
+  failed</pre></td>
+</tr>
+
+<tr>
+  <td><a href="http://stackoverflow.com/q/42009190">42009190</a></td>
+  <td><pre>
+  Installing collected packages: setuptools, protobuf, wheel, numpy, tensorflow
+  Found existing installation: setuptools 1.1.6
+  Uninstalling setuptools-1.1.6:
+  Exception:
+  ...
+  [Errno 1] Operation not permitted:
+  '/tmp/pip-a1DXRT-uninstall/.../lib/python/_markerlib' </pre></td>
+</tr>
+
+<tr>
+  <td><a href="https://stackoverflow.com/q/33622019">33622019</a></td>
+  <td><pre>ImportError: No module named copyreg</pre></td>
+</tr>
+
+<tr>
+  <td><a href="http://stackoverflow.com/q/37810228">37810228</a></td>
+  <td>During a <tt>pip install</tt> operation, the system returns:
+  <pre>OSError: [Errno 1] Operation not permitted</pre>
+  </td>
+</tr>
+
+<tr>
+  <td><a href="http://stackoverflow.com/q/33622842">33622842</a></td>
+  <td>An <tt>import tensorflow</tt> statement triggers an error such as the
+  following:<pre>Traceback (most recent call last):
+  File "<stdin>", line 1, in <module>
+  File "/usr/local/lib/python2.7/site-packages/tensorflow/__init__.py",
+    line 4, in <module>
+    from tensorflow.python import *
+    ...
+  File "/usr/local/lib/python2.7/site-packages/tensorflow/core/framework/tensor_shape_pb2.py",
+    line 22, in <module>
+    serialized_pb=_b('\n,tensorflow/core/framework/tensor_shape.proto\x12\ntensorflow\"d\n\x10TensorShapeProto\x12-\n\x03\x64im\x18\x02
+      \x03(\x0b\x32
+      .tensorflow.TensorShapeProto.Dim\x1a!\n\x03\x44im\x12\x0c\n\x04size\x18\x01
+      \x01(\x03\x12\x0c\n\x04name\x18\x02 \x01(\tb\x06proto3')
+  TypeError: __init__() got an unexpected keyword argument 'syntax'</pre>
+  </td>
+</tr>
+
+
+</table>
diff --git a/tensorflow/docs_src/install/install_sources.md b/tensorflow/docs_src/install/install_sources.md
index 5ba522b436137bc5588382fd79f7559c6e9d11ed..dc6c1e36fc237c2a160887e6417e7f373008309e 100644
--- a/tensorflow/docs_src/install/install_sources.md
+++ b/tensorflow/docs_src/install/install_sources.md
@@ -81,7 +81,7 @@ or
 [macOS](#PrepareMac)
 
 
-<a name="#PrepareLinux"></a>
+<a name="PrepareLinux"></a>
 ## Prepare environment for Linux
 
 Before building TensorFlow on Linux, install the following build
@@ -328,10 +328,10 @@ Invoke `pip install` to install that pip package.
 The filename of the `.whl` file depends on your platform.
 For example, the following command will install the pip package
 
-for TensorFlow 1.8.0 on Linux:
+for TensorFlow 1.9.0rc0 on Linux:
 
 <pre>
-$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.8.0-py2-none-any.whl</b>
+$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.9.0rc0-py2-none-any.whl</b>
 </pre>
 
 ## Validate your installation
@@ -373,9 +373,9 @@ The build and installation problems you encounter typically depend on the
 operating system.  See the "Common installation problems" section
 of one of the following guides:
 
-  * @{$install_linux#CommonInstallationProblems$Installing TensorFlow on Linux}
-  * @{$install_mac#CommonInstallationProblems$Installing TensorFlow on Mac OS}
-  * @{$install_windows#CommonInstallationProblems$Installing TensorFlow on Windows}
+  * @{$install_linux#common_installation_problems$Installing TensorFlow on Linux}
+  * @{$install_mac#common_installation_problems$Installing TensorFlow on Mac OS}
+  * @{$install_windows#common_installation_problems$Installing TensorFlow on Windows}
 
 Beyond the errors documented in those two guides, the following table
 notes additional errors specific to building TensorFlow.  Note that we
@@ -433,6 +433,8 @@ Stack Overflow and specify the `tensorflow` tag.
 **Linux**
 <table>
 <tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
+<tr><td>tensorflow-1.9.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.11.0</td><td>N/A</td><td>N/A</td></tr>
+<tr><td>tensorflow_gpu-1.9.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.11.0</td><td>7</td><td>9</td></tr>
 <tr><td>tensorflow-1.8.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.10.0</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow_gpu-1.8.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.9.0</td><td>7</td><td>9</td></tr>
 <tr><td>tensorflow-1.7.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.10.0</td><td>N/A</td><td>N/A</td></tr>
@@ -456,6 +458,7 @@ Stack Overflow and specify the `tensorflow` tag.
 **Mac**
 <table>
 <tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
+<tr><td>tensorflow-1.9.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.11.0</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow-1.8.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.10.1</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow-1.7.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.10.1</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow-1.6.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.8.1</td><td>N/A</td><td>N/A</td></tr>
@@ -472,6 +475,8 @@ Stack Overflow and specify the `tensorflow` tag.
 **Windows**
 <table>
 <tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
+<tr><td>tensorflow-1.9.0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
+<tr><td>tensorflow_gpu-1.9.0</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>7</td><td>9</td></tr>
 <tr><td>tensorflow-1.8.0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow_gpu-1.8.0</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>7</td><td>9</td></tr>
 <tr><td>tensorflow-1.7.0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
diff --git a/tensorflow/docs_src/install/install_windows.md b/tensorflow/docs_src/install/install_windows.md
index a139a49661ee4eb8606d4b9fa1bf96ae12d16b8e..6c4f5b85ab2facdb274e9bdd36f6edb9ad79ba4b 100644
--- a/tensorflow/docs_src/install/install_windows.md
+++ b/tensorflow/docs_src/install/install_windows.md
@@ -157,10 +157,8 @@ TensorFlow programs:
 If the system outputs an error message instead of a greeting, see [Common
 installation problems](#common_installation_problems).
 
-If you are new to machine learning, we recommend the following:
-
-*  [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course)
-*  @{$get_started/get_started_for_beginners$Getting Started for ML Beginners}
+If you are new to machine learning, we recommend the
+[Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course).
 
 If you are experienced with machine learning but new to TensorFlow, see
 @{$get_started/eager}.
diff --git a/tensorflow/docs_src/mobile/linking_libs.md b/tensorflow/docs_src/mobile/linking_libs.md
index cf0db590210593914d42105c2cfae5bd99e18287..efef5dd0daa0b267d8384d32d62d9ce0226dc102 100644
--- a/tensorflow/docs_src/mobile/linking_libs.md
+++ b/tensorflow/docs_src/mobile/linking_libs.md
@@ -27,7 +27,7 @@ called `libandroid_tensorflow_inference_java.jar`. There are three ways to
 include this functionality in your program:
 
 1. Include the jcenter AAR which contains it, as in this
- [example app](https://github.com/googlecodelabs/tensorflow-for-poets-2/blob/master/android/build.gradle#L59-L65)
+ [example app](https://github.com/googlecodelabs/tensorflow-for-poets-2/blob/master/android/tfmobile/build.gradle#L59-L65)
 
 2. Download the nightly precompiled version from
 [ci.tensorflow.org](http://ci.tensorflow.org/view/Nightly/job/nightly-android/lastSuccessfulBuild/artifact/out/).
diff --git a/tensorflow/docs_src/mobile/prepare_models.md b/tensorflow/docs_src/mobile/prepare_models.md
index 8b22c04d872f18607c485775cb8f096f0a361995..2b84dbb97388b16c6a4ae1d3472e0b1a993285f0 100644
--- a/tensorflow/docs_src/mobile/prepare_models.md
+++ b/tensorflow/docs_src/mobile/prepare_models.md
@@ -105,8 +105,8 @@ inline constants so everything’s in one file.  To handle the conversion, you
 need the `freeze_graph.py` script, that’s held in
 [`tensorflow/python/tools/freeze_graph.py`](https://www.tensorflow.org/code/tensorflow/python/tools/freeze_graph.py). You’ll run it like this:
 
-    bazel build tensorflow/tools:freeze_graph
-    bazel-bin/tensorflow/tools/freeze_graph \
+    bazel build tensorflow/python/tools:freeze_graph
+    bazel-bin/tensorflow/python/tools/freeze_graph \
     --input_graph=/tmp/model/my_graph.pb \
     --input_checkpoint=/tmp/model/model.ckpt-1000 \
     --output_graph=/tmp/frozen_graph.pb \
diff --git a/tensorflow/docs_src/mobile/tflite/index.md b/tensorflow/docs_src/mobile/tflite/index.md
index 562203482763991c412b523bd261b3163d361134..3d1733024e493042a2cc85aa9f2fec4b75eefa94 100644
--- a/tensorflow/docs_src/mobile/tflite/index.md
+++ b/tensorflow/docs_src/mobile/tflite/index.md
@@ -37,8 +37,9 @@ a custom (less-dynamic) memory allocator to ensure minimal load, initialization,
 and execution latency.
 
 TensorFlow Lite provides an interface to leverage hardware acceleration, if
-available on the device. It does so via the Android Neural Networks library,
-released as part of Android O-MR1.
+available on the device. It does so via the
+[Android Neural Networks API](https://developer.android.com/ndk/guides/neuralnetworks/index.html),
+available on Android 8.1 (API level 27) and higher.
 
 ## Why do we need a new mobile-specific library?
 
@@ -116,6 +117,10 @@ following:
       Wear](https://research.googleblog.com/2017/02/on-device-machine-intelligence.html)
       to all first-party and third-party apps.
 
+    Also see the complete list of
+    [TensorFlow Lite's supported models](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/g3doc/models.md),
+    including the model sizes, performance numbers, and downloadable model files.
+
 - Quantized versions of the MobileNet model, which runs faster than the
   non-quantized (float) version on CPU.
 
@@ -131,10 +136,10 @@ compatibility with this release.
 ## Getting Started
 
 We recommend you try out TensorFlow Lite with the pre-tested models indicated
-above. If you have an existing mode, you will need to test whether your model is
-compatible with both the converter and the supported operator set.  To test your
-model, see the [documentation on
-GitHub](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite).
+above. If you have an existing model, you will need to test whether your model
+is compatible with both the converter and the supported operator set.  To test
+your model, see the
+[documentation on GitHub](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite).
 
 ### Retrain Inception-V3 or MobileNet for a custom data set
 
diff --git a/tensorflow/docs_src/performance/benchmarks.md b/tensorflow/docs_src/performance/benchmarks.md
index 20165a090efcf26133ff2677fa4914c5153d5249..a5fa551dd4904df3a73c0c2357ab7c79685f0393 100644
--- a/tensorflow/docs_src/performance/benchmarks.md
+++ b/tensorflow/docs_src/performance/benchmarks.md
@@ -403,8 +403,6 @@ GPUs | InceptionV3 (batch size 32) | ResNet-50 (batch size 32)
 This
 [script](https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks)
 was run on the various platforms to generate the above results.
-@{$performance_models$High-Performance Models} details techniques in the script
-along with examples of how to execute the script.
 
 In order to create results that are as repeatable as possible, each test was run
 5 times and then the times were averaged together. GPUs are run in their default
diff --git a/tensorflow/docs_src/performance/index.md b/tensorflow/docs_src/performance/index.md
index 49343eaac7f0a785691a7633d19cc41d281efb99..131d28fa3eb47ff363888934c728e9971283c45d 100644
--- a/tensorflow/docs_src/performance/index.md
+++ b/tensorflow/docs_src/performance/index.md
@@ -1,19 +1,31 @@
 # Performance
 
-Performance is often a significant issue when training a machine learning
-model.  This section explains various ways to optimize performance.  Start
-your investigation with the @{$performance_guide$Performance Guide} and then go
-deeper with techniques detailed in @{$performance_models$High-Performance Models}:
-
-  * @{$performance_guide$Performance Guide}, which contains a collection of best
+Performance is an important consideration when training machine learning
+models. Performance speeds up and scales research while
+also providing end users with near instant predictions. This section provides
+details on the high level APIs to use along with best practices to build
+and train high performance models, and quantize models for the least latency
+and highest throughput for inference.
+
+  * @{$performance_guide$Performance Guide} contains a collection of best
     practices for optimizing your TensorFlow code.
 
-  * @{$performance_models$High-Performance Models}, which contains a collection
-    of advanced techniques to build highly scalable models targeting different
-    system types and network topologies.
+  * @{$datasets_performance$Data input pipeline guide} describes the tf.data
+    API for building efficient data input pipelines for TensorFlow.
+
+  * @{$performance/benchmarks$Benchmarks} contains a collection of
+    benchmark results for a variety of hardware configurations.
+
+  * For improving inference efficiency on mobile and
+    embedded hardware, see
+    @{$quantization$How to Quantize Neural Networks with TensorFlow}, which
+    explains how to use quantization to reduce model size, both in storage
+    and at runtime.
+
+  * For optimizing inference on GPUs, refer to [NVIDIA TensorRT™
+  integration with TensorFlow.](
+    https://medium.com/tensorflow/speed-up-tensorflow-inference-on-gpus-with-tensorrt-13b49f3db3fa)
 
-  * @{$performance/benchmarks$Benchmarks}, which contains a collection of
-    benchmark results.
 
 XLA (Accelerated Linear Algebra) is an experimental compiler for linear
 algebra that optimizes TensorFlow computations. The following guides explore
@@ -36,10 +48,5 @@ XLA:
     standalone tool that compiles TensorFlow graphs into executable code in
     order to optimize performance.
 
-And finally, we offer the following guide:
 
-  * @{$quantization$How to Quantize Neural Networks with TensorFlow}, which
-    can explains how to use quantization to reduce model size, both in storage
-    and at runtime. Quantization can improve performance, especially on
-    mobile hardware.
 
diff --git a/tensorflow/docs_src/performance/leftnav_files b/tensorflow/docs_src/performance/leftnav_files
index 1f894c39fe4554261cd37ebc8cd48af6b36eef43..12e0dbd48ac4913e20a401f5fa1a1fd05a273fc3 100644
--- a/tensorflow/docs_src/performance/leftnav_files
+++ b/tensorflow/docs_src/performance/leftnav_files
@@ -1,7 +1,6 @@
 index.md
 performance_guide.md
 datasets_performance.md
-performance_models.md
 benchmarks.md
 quantization.md
 
diff --git a/tensorflow/docs_src/performance/quantization.md b/tensorflow/docs_src/performance/quantization.md
index 2fea02d861d314cc61f2ba20475bf08ebea8fb5f..c97f74139c6ee852bf29724a3ac335d349a73fd3 100644
--- a/tensorflow/docs_src/performance/quantization.md
+++ b/tensorflow/docs_src/performance/quantization.md
@@ -227,8 +227,8 @@ of 30.0f, and an 8-bit array, the quantized values represent the following:
   <table>
     <tr><th>Quantized</th><th>Float</th></tr>
     <tr><td>0</td><td>-10.0</td></tr>
-    <tr><td>255</td><td>30.0</td></tr>
     <tr><td>128</td><td>10.0</td></tr>
+    <tr><td>255</td><td>30.0</td></tr>
   </table>
   <figcaption>
     <b>Table 2</b>: Example quantized value range
diff --git a/tensorflow/docs_src/performance/xla/operation_semantics.md b/tensorflow/docs_src/performance/xla/operation_semantics.md
index 5887c3d88bf8c7844349cc1cc0db224586e56719..f7e116bf0f85fe94b2167eca8b623207432b38e9 100644
--- a/tensorflow/docs_src/performance/xla/operation_semantics.md
+++ b/tensorflow/docs_src/performance/xla/operation_semantics.md
@@ -581,12 +581,21 @@ Computes a sum across replicas.
 Arguments | Type    | Semantics
 --------- | ------- | -----------------------------
 `operand` | `XlaOp` | Array to sum across replicas.
+| `replica_group_ids`    | `int64` vector | Group ID for each replica.      |
 
 The output shape is the same as the input shape. For example, if there are two
 replicas and the operand has the value `(1.0, 2.5)` and `(3.0, 5.25)`
 respectively on the two replicas, then the output value from this op will be
 `(4.0, 7.75)` on both replicas.
 
+`replica_group_ids` identifies the group ID of each replica. The group ID must
+either be empty (all replicas belong to a single group), or contain the same
+number of elements as the number of replicas. For example, if
+`replica_group_ids` = {0, 1, 2, 3, 0, 1, 2, 3} has eight replicas, there are
+four subgroups of replica IDs: {0, 4}, {1, 5}, {2, 6}, and {3, 7}. The size of
+each subgroup *must* be identical, so, for example, using:
+`replica_group_ids` = {0, 1, 2, 0} for four replicas is invalid.
+
 Computing the result of CrossReplicaSum requires having one input from each
 replica, so if one replica executes a CrossReplicaSum node more times than
 another, then the former replica will wait forever. Since the replicas are all
diff --git a/tensorflow/docs_src/get_started/checkpoints.md b/tensorflow/docs_src/programmers_guide/checkpoints.md
similarity index 100%
rename from tensorflow/docs_src/get_started/checkpoints.md
rename to tensorflow/docs_src/programmers_guide/checkpoints.md
diff --git a/tensorflow/docs_src/get_started/custom_estimators.md b/tensorflow/docs_src/programmers_guide/custom_estimators.md
similarity index 98%
rename from tensorflow/docs_src/get_started/custom_estimators.md
rename to tensorflow/docs_src/programmers_guide/custom_estimators.md
index 275cda12bc397e1a8a980f6c97e6b2d97c5e64e8..fb20b35c128b5bdafbb88ccb19df05f6a73c9977 100644
--- a/tensorflow/docs_src/get_started/custom_estimators.md
+++ b/tensorflow/docs_src/programmers_guide/custom_estimators.md
@@ -5,7 +5,7 @@ This document introduces custom Estimators. In particular, this document
 demonstrates how to create a custom @{tf.estimator.Estimator$Estimator} that
 mimics the behavior of the pre-made Estimator
 @{tf.estimator.DNNClassifier$`DNNClassifier`} in solving the Iris problem. See
-the @{$get_started/premade_estimators$Pre-Made Estimators chapter} for details
+the @{$premade_estimators$Pre-Made Estimators chapter} for details
 on the Iris problem.
 
 To download and access the example code invoke the following two commands:
@@ -84,7 +84,7 @@ and a logits output layer.
 ## Write an Input function
 
 Our custom Estimator implementation uses the same input function as our
-@{$get_started/premade_estimators$pre-made Estimator implementation}, from
+@{$premade_estimators$pre-made Estimator implementation}, from
 [`iris_data.py`](https://github.com/tensorflow/models/blob/master/samples/core/get_started/iris_data.py).
 Namely:
 
@@ -106,8 +106,8 @@ This input function builds an input pipeline that yields batches of
 
 ## Create feature columns
 
-As detailed in the @{$get_started/premade_estimators$Premade Estimators} and
-@{$get_started/feature_columns$Feature Columns} chapters, you must define
+As detailed in the @{$premade_estimators$Premade Estimators} and
+@{$feature_columns$Feature Columns} chapters, you must define
 your model's feature columns to specify how the model should use each feature.
 Whether working with pre-made Estimators or custom Estimators, you define
 feature columns in the same fashion.
@@ -145,7 +145,7 @@ to the constructor are in turn passed on to the `model_fn`. In
 [`custom_estimator.py`](https://github.com/tensorflow/models/blob/master/samples/core/get_started/custom_estimator.py)
 the following lines create the estimator and set the params to configure the
 model. This configuration step is similar to how we configured the @{tf.estimator.DNNClassifier} in
-@{$get_started/premade_estimators}.
+@{$premade_estimators}.
 
 ```python
 classifier = tf.estimator.Estimator(
@@ -489,7 +489,7 @@ configure your Estimator without modifying the code in the `model_fn`.
 
 The rest of the code to train, evaluate, and generate predictions using our
 Estimator is the same as in the
-@{$get_started/premade_estimators$Premade Estimators} chapter. For
+@{$premade_estimators$Premade Estimators} chapter. For
 example, the following line will train the model:
 
 ```python
diff --git a/tensorflow/docs_src/programmers_guide/debugger.md b/tensorflow/docs_src/programmers_guide/debugger.md
index 6bd941886d7fe883f2fc61a97dc1494e033ba8ac..fc845c68f48204ab5e9796234a6c630f27704def 100644
--- a/tensorflow/docs_src/programmers_guide/debugger.md
+++ b/tensorflow/docs_src/programmers_guide/debugger.md
@@ -33,8 +33,9 @@ and [`inf`s](https://en.wikipedia.org/wiki/Infinity), a frequently-encountered
 type of bug in TensorFlow model development.
 The following example is for users who use the low-level
 [`Session`](https://www.tensorflow.org/api_docs/python/tf/Session) API of
-TensorFlow. A later section of this document describes how to use **tfdbg**
-with a higher-level API, namely `Estimator`s.
+TensorFlow. Later sections of this document describe how to use **tfdbg**
+with higher-level APIs of TensorFlow, including `tf.estimator`,
+`tf.keras` / `keras` and `tf.contrib.slim`.
 To *observe* such an issue, run the following command without the debugger (the
 source code can be found
 [here](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/debug/examples/debug_mnist.py)):
@@ -477,20 +478,31 @@ for more details.
 
 ## Debugging Keras Models with TFDBG
 
-To use TFDBG with [Keras](https://keras.io/), let the Keras backend use
-a TFDBG-wrapped Session object. For example, to use the CLI wrapper:
+To use TFDBG with
+[tf.keras](https://www.tensorflow.org/api_docs/python/tf/keras),
+let the Keras backend use a TFDBG-wrapped Session object. For example, to use
+the CLI wrapper:
 
 ``` python
 import tensorflow as tf
-from keras import backend as keras_backend
 from tensorflow.python import debug as tf_debug
 
-keras_backend.set_session(tf_debug.LocalCLIDebugWrapperSession(tf.Session()))
+tf.keras.backend.set_session(tf_debug.LocalCLIDebugWrapperSession(tf.Session()))
 
 # Define your keras model, called "model".
-model.fit(...)  # This will break into the TFDBG CLI.
+
+# Calls to `fit()`, 'evaluate()` and `predict()` methods will break into the
+# TFDBG CLI.
+model.fit(...)
+model.evaluate(...)
+model.predict(...)
 ```
 
+With minor modification, the preceding code example also works for the
+[non-TensorFlow version of Keras](https://keras.io/) running against a
+TensorFlow backend. You just need to replace `tf.keras.backend` with
+`keras.backend`.
+
 ## Debugging tf-slim with TFDBG
 
 TFDBG supports debugging of training and evaluation with
diff --git a/tensorflow/docs_src/programmers_guide/eager.md b/tensorflow/docs_src/programmers_guide/eager.md
index 00d02b44558d023d4cb3d8e2bce01c4e1911a302..b2bc3273b4529233faaaca8ce325f4e12979faa4 100644
--- a/tensorflow/docs_src/programmers_guide/eager.md
+++ b/tensorflow/docs_src/programmers_guide/eager.md
@@ -149,16 +149,17 @@ it to implement your own layer:
 ```py
 class MySimpleLayer(tf.keras.layers.Layer):
   def __init__(self, output_units):
+    super(MySimpleLayer, self).__init__()
     self.output_units = output_units
 
-  def build(self, input):
+  def build(self, input_shape):
     # The build method gets called the first time your layer is used.
     # Creating variables on build() allows you to make their shape depend
-    # on the input shape and hence remove the need for the user to specify
+    # on the input shape and hence removes the need for the user to specify
     # full shapes. It is possible to create variables during __init__() if
     # you already know their full shapes.
     self.kernel = self.add_variable(
-      "kernel", [input.shape[-1], self.output_units])
+      "kernel", [input_shape[-1], self.output_units])
 
   def call(self, input):
     # Override call() instead of __call__ so we can perform some bookkeeping.
diff --git a/tensorflow/docs_src/programmers_guide/estimators.md b/tensorflow/docs_src/programmers_guide/estimators.md
index de830112e001ee4b5204454725317e5d661af5ec..b13b47184d2b32fffb2390b0318fba8612d7826a 100644
--- a/tensorflow/docs_src/programmers_guide/estimators.md
+++ b/tensorflow/docs_src/programmers_guide/estimators.md
@@ -133,7 +133,7 @@ The heart of every Estimator--whether pre-made or custom--is its
 evaluation, and prediction. When you are using a pre-made Estimator,
 someone else has already implemented the model function. When relying
 on a custom Estimator, you must write the model function yourself. A
-@{$get_started/custom_estimators$companion document}
+@{$custom_estimators$companion document}
 explains how to write the model function.
 
 
diff --git a/tensorflow/docs_src/get_started/feature_columns.md b/tensorflow/docs_src/programmers_guide/feature_columns.md
similarity index 99%
rename from tensorflow/docs_src/get_started/feature_columns.md
rename to tensorflow/docs_src/programmers_guide/feature_columns.md
index f152813442903a4cbe2cca8b56c90b49d9718c3b..90f5c53a17f23200f238f6b0d171e1e225330e27 100644
--- a/tensorflow/docs_src/get_started/feature_columns.md
+++ b/tensorflow/docs_src/programmers_guide/feature_columns.md
@@ -5,7 +5,7 @@ intermediaries between raw data and Estimators. Feature columns are very rich,
 enabling you to transform a diverse range of raw data into formats that
 Estimators can use, allowing easy experimentation.
 
-In @{$get_started/premade_estimators$Premade Estimators}, we used the premade
+In @{$premade_estimators$Premade Estimators}, we used the premade
 Estimator, @{tf.estimator.DNNClassifier$`DNNClassifier`} to train a model to
 predict different types of Iris flowers from four input features. That example
 created only numerical feature columns (of type
diff --git a/tensorflow/docs_src/programmers_guide/index.md b/tensorflow/docs_src/programmers_guide/index.md
index 648d001bd3535fe3dcc460c9ebdb6e6a997dc332..0c2d4afb115c592c1925dde98b3a1a8c2a7ccad1 100644
--- a/tensorflow/docs_src/programmers_guide/index.md
+++ b/tensorflow/docs_src/programmers_guide/index.md
@@ -5,11 +5,31 @@ works. The units are as follows:
 
 ## High Level APIs
 
-  * @{$programmers_guide/eager}, which is the easiest way to use TensorFlow.
-  * @{$programmers_guide/estimators}, which introduces a high-level
-    TensorFlow API that greatly simplifies ML programming.
-  * @{$programmers_guide/datasets}, which explains how to
-    set up data pipelines to read data sets into your TensorFlow program.
+  * @{$programmers_guide/keras}, TensorFlow's high-level API for building and
+    training deep learning models.
+  * @{$programmers_guide/eager}, an API for writing TensorFlow code
+    imperatively, like you would use Numpy.
+  * @{$programmers_guide/estimators}, a high-level API that provides
+    fully-packaged models ready for large-scale training and production.
+  * @{$programmers_guide/datasets}, easy input pipelines to bring your data into
+    your TensorFlow program.
+
+## Estimators
+
+* @{$estimators} provides an introduction.
+* @{$premade_estimators}, introduces Estimators for machine learning.
+* @{$custom_estimators}, which demonstrates how to build and train models you
+  design yourself.
+* @{$feature_columns}, which shows how an Estimator can handle a variety of input
+  data types without changes to the model.
+* @{$checkpoints}, which explains how to save training progress and resume where
+  you left off.
+
+## Accelerators
+
+  * @{$using_gpu} explains how TensorFlow assigns operations to
+    devices and how you can change the arrangement manually.
+  * @{$using_tpu} explains how to modify `Estimator` programs to run on a TPU.
 
 ## Low Level APIs
 
@@ -32,13 +52,6 @@ works. The units are as follows:
   * @{$programmers_guide/saved_model}, which
     explains how to save and restore variables and models.
 
-## Accelerators
-
-  * @{$using_gpu} explains how TensorFlow assigns operations to
-    devices and how you can change the arrangement manually.
-  * @{$using_tpu} explains how to modify `Estimator` programs to run on a TPU.
-
-
 ## ML Concepts
 
   * @{$programmers_guide/embedding}, which introduces the concept
diff --git a/tensorflow/docs_src/programmers_guide/keras.md b/tensorflow/docs_src/programmers_guide/keras.md
new file mode 100644
index 0000000000000000000000000000000000000000..c6aca7ebf4edd085e2b47492fd3a86578620492d
--- /dev/null
+++ b/tensorflow/docs_src/programmers_guide/keras.md
@@ -0,0 +1,623 @@
+# Keras
+
+Keras is a high-level API to build and train deep learning models. It's used for
+fast prototyping, advanced research, and production, with three key advantages:
+
+- *User friendly*<br>
+  Keras has a simple, consistent interface optimized for common use cases. It
+  provides clear and actionable feedback for user errors.
+- *Modular and composable*<br>
+  Keras models are made by connecting configurable building blocks together,
+  with few restrictions.
+- *Easy to extend*<br> Write custom building blocks to express new ideas for
+  research. Create new layers, loss functions, and develop state-of-the-art
+  models.
+
+## Import tf.keras
+
+`tf.keras` is TensorFlow's implementation of the
+[Keras API specification](https://keras.io){:.external}. This is a high-level
+API to build and train models that includes first-class support for
+TensorFlow-specific functionality, such as [eager execution](#eager_execution),
+`tf.data` pipelines, and [Estimators](/programmers_guide/estimators).
+`tf.keras` makes TensorFlow easier to use without sacrificing flexibility and
+performance.
+
+To get started, import `tf.keras` as part of your TensorFlow program setup:
+
+```python
+import tensorflow as tf
+from tensorflow import keras
+```
+
+`tf.keras` can run any Keras-compatible code, but keep in mind:
+
+* The `tf.keras` version in the latest TensorFlow release might not be the same
+  as the latest `keras` version from PyPI. Check `tf.keras.__version__`.
+* When [saving a model's weights](#weights_only), `tf.keras` defaults to the
+  [checkpoint format](/get_started/checkpoints). Pass `save_format='h5'` to use
+  HDF5.
+
+## Build a simple model
+
+### Sequential model
+
+In Keras, you assemble *layers* to build *models*. A model is (usually) a graph
+of layers. The most common type of model is a stack of layers: the
+`tf.keras.Sequential` model.
+
+To build a simple, fully-connected network (i.e. multi-layer perceptron):
+
+```python
+model = keras.Sequential()
+# Adds a densely-connected layer with 64 units to the model:
+model.add(keras.layers.Dense(64, activation='relu'))
+# Add another:
+model.add(keras.layers.Dense(64, activation='relu'))
+# Add a softmax layer with 10 output units:
+model.add(keras.layers.Dense(10, activation='softmax'))
+```
+
+### Configure the layers
+
+There are many `tf.keras.layers` available with some common constructor
+parameters:
+
+* `activation`: Set the activation function for the layer. This parameter is
+  specified by the name of a built-in function or as a callable object. By
+  default, no activation is applied.
+* `kernel_initializer` and `bias_initializer`: The initialization schemes
+  that create the layer's weights (kernel and bias). This parameter is a name or
+  a callable object. This defaults to the `"Glorot uniform"` initializer.
+* `kernel_regularizer` and `bias_regularizer`: The regularization schemes
+  that apply the layer's weights (kernel and bias), such as L1 or L2
+  regularization. By default, no regularization is applied.
+
+The following instantiates `tf.keras.layers.Dense` layers using constructor
+arguments:
+
+```python
+# Create a sigmoid layer:
+layers.Dense(64, activation='sigmoid')
+# Or:
+layers.Dense(64, activation=tf.sigmoid)
+
+# A linear layer with L1 regularization of factor 0.01 applied to the kernel matrix:
+layers.Dense(64, kernel_regularizer=keras.regularizers.l1(0.01))
+# A linear layer with L2 regularization of factor 0.01 applied to the bias vector:
+layers.Dense(64, bias_regularizer=keras.regularizers.l2(0.01))
+
+# A linear layer with a kernel initialized to a random orthogonal matrix:
+layers.Dense(64, kernel_initializer='orthogonal')
+# A linear layer with a bias vector initialized to 2.0s:
+layers.Dense(64, bias_initializer=keras.initializers.constant(2.0))
+```
+
+## Train and evaluate
+
+### Set up training
+
+After the model is constructed, configure its learning process by calling the
+`compile` method:
+
+```python
+model.compile(optimizer=tf.train.AdamOptimizer(0.001),
+              loss='categorical_crossentropy',
+              metrics=['accuracy'])
+```
+
+`tf.keras.Model.compile` takes three important arguments:
+
+* `optimizer`: This object specifies the training procedure. Pass it optimizer
+  instances from the `tf.train` module, such as
+  [`AdamOptimizer`](/api_docs/python/tf/train/AdamOptimizer),
+  [`RMSPropOptimizer`](/api_docs/python/tf/train/RMSPropOptimizer), or
+  [`GradientDescentOptimizer`](/api_docs/python/tf/train/GradientDescentOptimizer).
+* `loss`: The function to minimize during optimization. Common choices include
+  mean square error (`mse`), `categorical_crossentropy`, and
+  `binary_crossentropy`. Loss functions are specified by name or by
+  passing a callable object from the `tf.keras.losses` module.
+* `metrics`: Used to monitor training. These are string names or callables from
+  the `tf.keras.metrics` module.
+
+The following shows a few examples of configuring a model for training:
+
+```python
+# Configure a model for mean-squared error regression.
+model.compile(optimizer=tf.train.AdamOptimizer(0.01),
+              loss='mse',       # mean squared error
+              metrics=['mae'])  # mean absolute error
+
+# Configure a model for categorical classification.
+model.compile(optimizer=tf.train.RMSPropOptimizer(0.01),
+              loss=keras.losses.categorical_crossentropy,
+              metrics=[keras.metrics.categorical_accuracy])
+```
+
+### Input NumPy data
+
+For small datasets, use in-memory [NumPy](https://www.numpy.org/){:.external}
+arrays to train and evaluate a model. The model is "fit" to the training data
+using the `fit` method:
+
+```python
+import numpy as np
+
+data = np.random.random((1000, 32))
+labels = np.random.random((1000, 10))
+
+model.fit(data, labels, epochs=10, batch_size=32)
+```
+
+`tf.keras.Model.fit` takes three important arguments:
+
+* `epochs`: Training is structured into *epochs*. An epoch is one iteration over
+  the entire input data (this is done in smaller batches).
+* `batch_size`: When passed NumPy data, the model slices the data into smaller
+  batches and iterates over these batches during training. This integer
+  specifies the size of each batch. Be aware that the last batch may be smaller
+  if the total number of samples is not divisible by the batch size.
+* `validation_data`: When prototyping a model, you want to easily monitor its
+  performance on some validation data. Passing this argument—a tuple of inputs
+  and labels—allows the model to display the loss and metrics in inference mode
+  for the passed data, at the end of each epoch.
+
+Here's an example using `validation_data`:
+
+```python
+import numpy as np
+
+data = np.random.random((1000, 32))
+labels = np.random.random((1000, 10))
+
+val_data = np.random.random((100, 32))
+val_labels = np.random.random((100, 10))
+
+model.fit(data, labels, epochs=10, batch_size=32,
+          validation_data=(val_data, val_labels))
+```
+
+### Input tf.data datasets
+
+Use the [Datasets API](/programmers_guide/datasets) to scale to large datasets
+or multi-device training. Pass a `tf.data.Dataset` instance to the `fit`
+method:
+
+```python
+# Instantiates a toy dataset instance:
+dataset = tf.data.Dataset.from_tensor_slices((data, labels))
+dataset = dataset.batch(32)
+dataset = dataset.repeat()
+
+# Don't forget to specify `steps_per_epoch` when calling `fit` on a dataset.
+model.fit(dataset, epochs=10, steps_per_epoch=30)
+```
+
+Here, the `fit` method uses the `steps_per_epoch` argument—this is the number of
+training steps the model runs before it moves to the next epoch. Since the
+`Dataset` yields batches of data, this snippet does not require a `batch_size`.
+
+Datasets can also be used for validation:
+
+```python
+dataset = tf.data.Dataset.from_tensor_slices((data, labels))
+dataset = dataset.batch(32).repeat()
+
+val_dataset = tf.data.Dataset.from_tensor_slices((val_data, val_labels))
+val_dataset = val_dataset.batch(32).repeat()
+
+model.fit(dataset, epochs=10, steps_per_epoch=30,
+          validation_data=val_dataset,
+          validation_steps=3)
+```
+
+### Evaluate and predict
+
+The `tf.keras.Model.evaluate` and `tf.keras.Model.predict` methods can use NumPy
+data and a `tf.data.Dataset`.
+
+To *evaluate* the inference-mode loss and metrics for the data provided:
+
+```python
+model.evaluate(x, y, batch_size=32)
+
+model.evaluate(dataset, steps=30
+```
+
+And to *predict* the output of the last layer in inference for the data provided,
+as a NumPy array:
+
+```
+model.predict(x, batch_size=32)
+
+model.predict(dataset, steps=30)
+```
+
+
+## Build advanced models
+
+### Functional API
+
+The `tf.keras.Sequential` model is a simple stack of layers that cannot
+represent arbitrary models. Use the
+[Keras functional API](https://keras.io/getting-started/functional-api-guide/){:.external}
+to build complex model topologies such as:
+
+* Multi-input models,
+* Multi-output models,
+* Models with shared layers (the same layer called several times),
+* Models with non-sequential data flows (e.g. residual connections).
+
+Building a model with the functional API works like this:
+
+1. A layer instance is callable and returns a tensor.
+2. Input tensors and output tensors are used to define a `tf.keras.Model`
+   instance.
+3. This model is trained just like the `Sequential` model.
+
+The following example uses the functional API to build a simple, fully-connected
+network:
+
+```python
+inputs = keras.Input(shape=(32,))  # Returns a placeholder tensor
+
+# A layer instance is callable on a tensor, and returns a tensor.
+x = keras.layers.Dense(64, activation='relu')(inputs)
+x = keras.layers.Dense(64, activation='relu')(x)
+predictions = keras.layers.Dense(10, activation='softmax')(x)
+
+# Instantiate the model given inputs and outputs.
+model = keras.Model(inputs=inputs, outputs=predictions)
+
+# The compile step specifies the training configuration.
+model.compile(optimizer=tf.train.RMSPropOptimizer(0.001),
+              loss='categorical_crossentropy',
+              metrics=['accuracy'])
+
+# Trains for 5 epochs
+model.fit(data, labels, batch_size=32, epochs=5)
+```
+
+### Model subclassing
+
+Build a fully-customizable model by subclassing `tf.keras.Model` and defining
+your own forward pass. Create layers in the `__init__` method and set them as
+attributes of the class instance. Define the forward pass in the `call` method.
+
+Model subclassing is particularly useful when
+[eager execution](/programmers_guide/eager) is enabled since the forward pass
+can be written imperatively.
+
+Key Point: Use the right API for the job. While model subclassing offers
+flexibility, it comes at a cost of greater complexity and more opportunities for
+user errors. If possible, prefer the functional API.
+
+The following example shows a subclassed `tf.keras.Model` using a custom forward
+pass:
+
+```python
+class MyModel(keras.Model):
+
+  def __init__(self, num_classes=10):
+    super(MyModel, self).__init__(name='my_model')
+    self.num_classes = num_classes
+    # Define your layers here.
+    self.dense_1 = keras.layers.Dense(32, activation='relu')
+    self.dense_2 = keras.layers.Dense(num_classes, activation='sigmoid')
+
+  def call(self, inputs):
+    # Define your forward pass here,
+    # using layers you previously defined (in `__init__`).
+    x = self.dense_1(inputs)
+    return self.dense_2(x)
+
+  def compute_output_shape(self, input_shape):
+    # You need to override this function if you want to use the subclassed model
+    # as part of a functional-style model.
+    # Otherwise, this method is optional.
+    shape = tf.TensorShape(input_shape).as_list()
+    shape[-1] = self.num_classes
+    return tf.TensorShape(shape)
+
+
+# Instantiates the subclassed model.
+model = MyModel(num_classes=10)
+
+# The compile step specifies the training configuration.
+model.compile(optimizer=tf.train.RMSPropOptimizer(0.001),
+              loss='categorical_crossentropy',
+              metrics=['accuracy'])
+
+# Trains for 5 epochs.
+model.fit(data, labels, batch_size=32, epochs=5)
+```
+
+
+### Custom layers
+
+Create a custom layer by subclassing `tf.keras.layers.Layer` and implementing
+the following methods:
+
+* `build`: Create the weights of the layer. Add weights with the `add_weight`
+  method.
+* `call`: Define the forward pass.
+* `compute_output_shape`: Specify how to compute the output shape of the layer
+  given the input shape.
+* Optionally, a layer can be serialized by implementing the `get_config` method
+  and the `from_config` class method.
+
+Here's an example of a custom layer that implements a `matmul` of an input with
+a kernel matrix:
+
+```python
+class MyLayer(keras.layers.Layer):
+
+  def __init__(self, output_dim, **kwargs):
+    self.output_dim = output_dim
+    super(MyLayer, self).__init__(**kwargs)
+
+  def build(self, input_shape):
+    shape = tf.TensorShape((input_shape[1], self.output_dim))
+    # Create a trainable weight variable for this layer.
+    self.kernel = self.add_weight(name='kernel',
+                                  shape=shape,
+                                  initializer='uniform',
+                                  trainable=True)
+    # Be sure to call this at the end
+    super(MyLayer, self).build(input_shape)
+
+  def call(self, inputs):
+    return tf.matmul(inputs, self.kernel)
+
+  def compute_output_shape(self, input_shape):
+    shape = tf.TensorShape(input_shape).as_list()
+    shape[-1] = self.output_dim
+    return tf.TensorShape(shape)
+
+  def get_config(self):
+    base_config = super(MyLayer, self).get_config()
+    base_config['output_dim'] = self.output_dim
+
+  @classmethod
+  def from_config(cls, config):
+    return cls(**config)
+
+
+# Create a model using the custom layer
+model = keras.Sequential([MyLayer(10),
+                          keras.layers.Activation('softmax')])
+
+# The compile step specifies the training configuration
+model.compile(optimizer=tf.train.RMSPropOptimizer(0.001),
+              loss='categorical_crossentropy',
+              metrics=['accuracy'])
+
+# Trains for 5 epochs.
+model.fit(data, targets, batch_size=32, epochs=5)
+```
+
+
+## Callbacks
+
+A callback is an object passed to a model to customize and extend its behavior
+during training. You can write your own custom callback, or use the built-in
+`tf.keras.callbacks` that include:
+
+* `tf.keras.callbacks.ModelCheckpoint`: Save checkpoints of your model at
+  regular intervals.
+* `tf.keras.callbacks.LearningRateScheduler`: Dynamically change the learning
+  rate.
+* `tf.keras.callbacks.EarlyStopping`: Interrupt training when validation
+  performance has stopped improving.
+* `tf.keras.callbacks.TensorBoard`: Monitor the model's behavior using
+  [TensorBoard](/programmers_guide/summaries_and_tensorboard).
+
+To use a `tf.keras.callbacks.Callback`, pass it to the model's `fit` method:
+
+```python
+callbacks = [
+  # Interrupt training if `val_loss` stops improving for over 2 epochs
+  keras.callbacks.EarlyStopping(patience=2, monitor='val_loss'),
+  # Write TensorBoard logs to `./logs` directory
+  keras.callbacks.TensorBoard(log_dir='./logs')
+]
+model.fit(data, labels, batch_size=32, epochs=5, callbacks=callbacks,
+          validation_data=(val_data, val_targets))
+```
+
+
+## Save and restore
+
+### Weights only
+
+Save and load the weights of a model using `tf.keras.Model.save_weights`:
+
+```python
+# Save weights to a TensorFlow Checkpoint file
+model.save_weights('./my_model')
+
+# Restore the model's state,
+# this requires a model with the same architecture.
+model.load_weights('my_model')
+```
+
+By default, this saves the model's weights in the
+[TensorFlow checkpoint](/get_started/checkpoints) file format. Weights can also
+be saved to the Keras HDF5 format (the default for the multi-backend
+implementation of Keras):
+
+```python
+# Save weights to a HDF5 file
+model.save_weights('my_model.h5', save_format='h5')
+
+# Restore the model's state
+model.load_weights('my_model.h5')
+```
+
+
+### Configuration only
+
+A model's configuration can be saved—this serializes the model architecture
+without any weights. A saved configuration can recreate and initialize the same
+model, even without the code that defined the original model. Keras supports
+JSON and YAML serialization formats:
+
+```python
+# Serialize a model to JSON format
+json_string = model.to_json()
+
+# Recreate the model (freshly initialized)
+fresh_model = keras.models.from_json(json_string)
+
+# Serializes a model to YAML format
+yaml_string = model.to_yaml()
+
+# Recreate the model
+fresh_model = keras.models.from_yaml(yaml_string)
+```
+
+Caution: Subclassed models are not serializable because their architecture is
+defined by the Python code in the body of the `call` method.
+
+
+### Entire model
+
+The entire model can be saved to a file that contains the weight values, the
+model's configuration, and even the optimizer's configuration. This allows you
+to checkpoint a model and resume training later—from the exact same
+state—without access to the original code.
+
+```python
+# Create a trivial model
+model = keras.Sequential([
+  keras.layers.Dense(10, activation='softmax', input_shape=(32,)),
+  keras.layers.Dense(10, activation='softmax')
+])
+model.compile(optimizer='rmsprop',
+              loss='categorical_crossentropy',
+              metrics=['accuracy'])
+model.fit(data, targets, batch_size=32, epochs=5)
+
+
+# Save entire model to a HDF5 file
+model.save('my_model.h5')
+
+# Recreate the exact same model, including weights and optimizer.
+model = keras.models.load_model('my_model.h5')
+```
+
+
+## Eager execution
+
+[Eager execution](/programmers_guide/eager) is an imperative programming
+environment that evaluates operations immediately. This is not required for
+Keras, but is supported by `tf.keras` and useful for inspecting your program and
+debugging.
+
+All of the `tf.keras` model-building APIs are compatible with eager execution.
+And while the `Sequential` and functional APIs can be used, eager execution
+especially benefits *model subclassing* and building *custom layers*—the APIs
+that require you to write the forward pass as code (instead of the APIs that
+create models by assembling existing layers).
+
+See the [eager execution guide](/programmers_guide/eager#build_a_model) for
+examples of using Keras models with custom training loops and `tf.GradientTape`.
+
+
+## Distribution
+
+### Estimators
+
+The [Estimators](/programmers_guide/estimators) API is used for training models
+for distributed environments. This targets industry use cases such as
+distributed training on large datasets that can export a model for production.
+
+A `tf.keras.Model` can be trained with the `tf.estimator` API by converting the
+model to an `tf.estimator.Estimator` object with
+`tf.keras.estimator.model_to_estimator`. See
+[Creating Estimators from Keras models](/programmers_guide/estimators#creating_estimators_from_keras_models).
+
+```python
+model = keras.Sequential([layers.Dense(10,activation='softmax'),
+                          layers.Dense(10,activation='softmax')])
+
+model.compile(optimizer=tf.train.RMSPropOptimizer(0.001),
+              loss='categorical_crossentropy',
+              metrics=['accuracy'])
+
+estimator = keras.estimator.model_to_estimator(model)
+```
+
+Note: Enable [eager execution](/programmers_guide/eager) for debugging
+[Estimator input functions](/programmers_guide/premade_estimators#create_input_functions)
+and inspecting data.
+
+### Multiple GPUs
+
+`tf.keras` models can run on multiple GPUs using
+`tf.contrib.distribute.DistributionStrategy`. This API provides distributed
+training on multiple GPUs with almost no changes to existing code.
+
+Currently, `tf.contrib.distribute.MirroredStrategy` is the only supported
+distribution strategy. `MirroredStrategy` does in-graph replication with
+synchronous training using all-reduce on a single machine. To use
+`DistributionStrategy` with Keras, convert the `tf.keras.Model` to a
+`tf.estimator.Estimator` with `tf.keras.estimator.model_to_estimator`, then
+train the estimator
+
+The following example distributes a `tf.keras.Model` across multiple GPUs on a
+single machine.
+
+First, define a simple model:
+
+```python
+model = keras.Sequential()
+model.add(keras.layers.Dense(16, activation='relu', input_shape=(10,)))
+model.add(keras.layers.Dense(1, activation='sigmoid'))
+
+optimizer = tf.train.GradientDescentOptimizer(0.2)
+
+model.compile(loss='binary_crossentropy', optimizer=optimizer)
+model.summary()
+```
+
+Convert the Keras model to a `tf.estimator.Estimator` instance:
+
+```python
+keras_estimator = keras.estimator.model_to_estimator(
+  keras_model=model,
+  config=config,
+  model_dir='/tmp/model_dir')
+```
+
+Define an *input pipeline*. The `input_fn` returns a `tf.data.Dataset` object
+used to distribute the data across multiple devices—with each device processing
+a slice of the input batch.
+
+```python
+def input_fn():
+  x = np.random.random((1024, 10))
+  y = np.random.randint(2, size=(1024, 1))
+  x = tf.cast(x, tf.float32)
+  dataset = tf.data.Dataset.from_tensor_slices((x, y))
+  dataset = dataset.repeat(10)
+  dataset = dataset.batch(32)
+  return dataset
+```
+
+Next, create a `tf.estimator.RunConfig` and set the `train_distribute` argument
+to the `tf.contrib.distribute.MirroredStrategy` instance. When creating
+`MirroredStrategy`, you can specify a list of devices or set the `num_gpus`
+argument. The default uses all available GPUs, like the following:
+
+```python
+strategy = tf.contrib.distribute.MirroredStrategy()
+config = tf.estimator.RunConfig(train_distribute=strategy)
+```
+
+Finally, train the `Estimator` instance by providing the `input_fn` and `steps`
+arguments:
+
+```python
+keras_estimator.train(input_fn=input_fn, steps=10)
+```
diff --git a/tensorflow/docs_src/programmers_guide/leftnav_files b/tensorflow/docs_src/programmers_guide/leftnav_files
index 7ac63bf2e019fc3b6aa7ab1b3e6422a97858d8c6..3bcf864e13db0cef40cec74ab872c807c2ec2fb0 100644
--- a/tensorflow/docs_src/programmers_guide/leftnav_files
+++ b/tensorflow/docs_src/programmers_guide/leftnav_files
@@ -1,9 +1,20 @@
 index.md
 
 ### High Level APIs
+keras.md
 eager.md
 datasets.md
-estimators.md
+
+### Estimators
+estimators.md: Introduction to Estimators
+premade_estimators.md
+custom_estimators.md
+feature_columns.md
+checkpoints.md
+
+### Accelerators
+using_gpu.md
+using_tpu.md
 
 ### Low Level APIs
 low_level_intro.md
@@ -12,10 +23,6 @@ variables.md
 graphs.md
 saved_model.md
 
-### Accelerators
-using_gpu.md
-using_tpu.md
-
 ### ML Concepts
 embedding.md
 
diff --git a/tensorflow/docs_src/programmers_guide/low_level_intro.md b/tensorflow/docs_src/programmers_guide/low_level_intro.md
index 05709ad10a9275953d351e4a62cbf6d7fbffbbe3..478e2bb70bc7f58156398c9f9fef4e76ba581e1a 100644
--- a/tensorflow/docs_src/programmers_guide/low_level_intro.md
+++ b/tensorflow/docs_src/programmers_guide/low_level_intro.md
@@ -9,7 +9,7 @@ This guide gets you started programming in the low-level TensorFlow APIs
   * Use high level components ([datasets](#datasets), [layers](#layers), and
     [feature_columns](#feature_columns)) in this low level environment.
   * Build your own training loop, instead of using the one
-    @{$get_started/premade_estimators$provided by Estimators}.
+    @{$premade_estimators$provided by Estimators}.
 
 We recommend using the higher level APIs to build models when possible.
 Knowing TensorFlow Core is valuable for the following reasons:
@@ -398,7 +398,7 @@ and layer reuse impossible.
 
 The easiest way to experiment with feature columns is using the
 @{tf.feature_column.input_layer} function. This function only accepts
-@{$get_started/feature_columns$dense columns} as inputs, so to view the result
+@{$feature_columns$dense columns} as inputs, so to view the result
 of a categorical column you must wrap it in an
 @{tf.feature_column.indicator_column}. For example:
 
@@ -589,7 +589,7 @@ print(sess.run(y_pred))
 
 To learn more about building models with TensorFlow consider the following:
 
-* @{$get_started/custom_estimators$Custom Estimators}, to learn how to build
+* @{$custom_estimators$Custom Estimators}, to learn how to build
   customized models with TensorFlow. Your knowledge of TensorFlow Core will
   help you understand and debug your own models.
 
diff --git a/tensorflow/docs_src/get_started/premade_estimators.md b/tensorflow/docs_src/programmers_guide/premade_estimators.md
similarity index 96%
rename from tensorflow/docs_src/get_started/premade_estimators.md
rename to tensorflow/docs_src/programmers_guide/premade_estimators.md
index 15853bc0abbf9bab0c96f34475890056c3291045..f6dd75eacab1c99215ab918a0854b0a33d0d9cca 100644
--- a/tensorflow/docs_src/get_started/premade_estimators.md
+++ b/tensorflow/docs_src/programmers_guide/premade_estimators.md
@@ -177,13 +177,11 @@ other features so you can concentrate on your model. For more details see
 
 An Estimator is any class derived from @{tf.estimator.Estimator}. TensorFlow
 provides a collection of
-[pre-made Estimators](https://developers.google.com/machine-learning/glossary/#premade_Estimator)
+@{tf.estimator$pre-made Estimators}
 (for example, `LinearRegressor`) to implement common ML algorithms. Beyond
 those, you may write your own
-[custom Estimators](https://developers.google.com/machine-learning/glossary/#custom_Estimator).
-We recommend using pre-made Estimators when just getting started with
-TensorFlow. After gaining expertise with the pre-made Estimators, we recommend
-optimizing your model by creating your own custom Estimators.
+@{$custom_estimators$custom Estimators}.
+We recommend using pre-made Estimators when just getting started.
 
 To write a TensorFlow program based on pre-made Estimators, you must perform the
 following tasks:
@@ -289,7 +287,7 @@ for key in train_x.keys():
 ```
 
 Feature columns can be far more sophisticated than those we're showing here.  We
-detail feature columns @{$get_started/feature_columns$later on} in our Getting
+detail feature columns @{$feature_columns$later on} in our Getting
 Started guide.
 
 Now that we have the description of how we want the model to represent the raw
@@ -425,11 +423,10 @@ Pre-made Estimators are an effective way to quickly create standard models.
 Now that you've gotten started writing TensorFlow programs, consider the
 following material:
 
-* @{$get_started/checkpoints$Checkpoints} to learn how to save and restore
-  models.
+* @{$checkpoints$Checkpoints} to learn how to save and restore models.
 * @{$get_started/datasets_quickstart$Datasets} to learn more about importing
   data into your
   model.
-* @{$get_started/custom_estimators$Creating Custom Estimators} to learn how to
+* @{$custom_estimators$Creating Custom Estimators} to learn how to
   write your own Estimator, customized for a particular problem.
 
diff --git a/tensorflow/docs_src/programmers_guide/using_tpu.md b/tensorflow/docs_src/programmers_guide/using_tpu.md
index 5e3e49d43402cd76f8b7062483259df4598bd8ff..44aabf05571bb7f325a5d642f06362e0088607d2 100644
--- a/tensorflow/docs_src/programmers_guide/using_tpu.md
+++ b/tensorflow/docs_src/programmers_guide/using_tpu.md
@@ -22,8 +22,8 @@ Standard `Estimators` can drive models on CPU and GPUs. You must use
 @{tf.contrib.tpu.TPUEstimator} to drive a model on TPUs.
 
 Refer to TensorFlow's Getting Started section for an introduction to the basics
-of using a @{$get_started/premade_estimators$pre-made `Estimator`}, and
-@{$get_started/custom_estimators$custom `Estimator`s}.
+of using a @{$premade_estimators$pre-made `Estimator`}, and
+@{$custom_estimators$custom `Estimator`s}.
 
 The `TPUEstimator` class differs somewhat from the `Estimator` class.
 
diff --git a/tensorflow/docs_src/tutorials/kernel_methods.md b/tensorflow/docs_src/tutorials/kernel_methods.md
index 73e5c5105784ddc9729b8cea6cd31921572837e1..205e2a2d2c1d1008e62ca4c2caf9f1b0895dff1a 100644
--- a/tensorflow/docs_src/tutorials/kernel_methods.md
+++ b/tensorflow/docs_src/tutorials/kernel_methods.md
@@ -53,7 +53,7 @@ In order to feed data to a `tf.contrib.learn Estimator`, it is helpful to conver
 it to Tensors. For this, we will use an `input function` which adds Ops to the
 TensorFlow graph that, when executed, create mini-batches of Tensors to be used
 downstream. For more background on input functions, check
-@{$get_started/premade_estimators#create_input_functions$this section on input functions}.
+@{$premade_estimators#create_input_functions$this section on input functions}.
 In this example, we will use the `tf.train.shuffle_batch` Op which, besides
 converting numpy arrays to Tensors, allows us to specify the batch_size and
 whether to randomize the input every time the input_fn Ops are executed
diff --git a/tensorflow/docs_src/tutorials/layers.md b/tensorflow/docs_src/tutorials/layers.md
index 496b1e4da9d3b85d88be4dd86086e04fda51b8be..0f17899dae7ccd8686ac159548dec303401b8ad4 100644
--- a/tensorflow/docs_src/tutorials/layers.md
+++ b/tensorflow/docs_src/tutorials/layers.md
@@ -190,7 +190,7 @@ def cnn_model_fn(features, labels, mode):
 The following sections (with headings corresponding to each code block above)
 dive deeper into the `tf.layers` code used to create each layer, as well as how
 to calculate loss, configure the training op, and generate predictions. If
-you're already experienced with CNNs and @{$get_started/custom_estimators$TensorFlow `Estimator`s},
+you're already experienced with CNNs and @{$custom_estimators$TensorFlow `Estimator`s},
 and find the above code intuitive, you may want to skim these sections or just
 skip ahead to ["Training and Evaluating the CNN MNIST Classifier"](#train_eval_mnist).
 
@@ -534,8 +534,8 @@ if mode == tf.estimator.ModeKeys.TRAIN:
 ```
 
 > Note: For a more in-depth look at configuring training ops for Estimator model
-> functions, see @{$get_started/custom_estimators#defining-the-training-op-for-the-model$"Defining the training op for the model"}
-> in the @{$get_started/custom_estimators$"Creating Estimations in tf.estimator"} tutorial.
+> functions, see @{$custom_estimators#defining-the-training-op-for-the-model$"Defining the training op for the model"}
+> in the @{$custom_estimators$"Creating Estimations in tf.estimator"} tutorial.
 
 
 ### Add evaluation metrics
@@ -600,7 +600,7 @@ be saved (here, we specify the temp directory `/tmp/mnist_convnet_model`, but
 feel free to change to another directory of your choice).
 
 > Note: For an in-depth walkthrough of the TensorFlow `Estimator` API, see the
-> tutorial @{$get_started/custom_estimators$"Creating Estimators in tf.estimator."}
+> tutorial @{$custom_estimators$"Creating Estimators in tf.estimator."}
 
 ### Set Up a Logging Hook {#set_up_a_logging_hook}
 
@@ -719,7 +719,7 @@ Here, we've achieved an accuracy of 97.3% on our test data set.
 To learn more about TensorFlow Estimators and CNNs in TensorFlow, see the
 following resources:
 
-*   @{$get_started/custom_estimators$Creating Estimators in tf.estimator}
+*   @{$custom_estimators$Creating Estimators in tf.estimator}
     provides an introduction to the TensorFlow Estimator API. It walks through
     configuring an Estimator, writing a model function, calculating loss, and
     defining a training op.
diff --git a/tensorflow/docs_src/tutorials/linear.md b/tensorflow/docs_src/tutorials/linear.md
index 265ded877d1ff9fb0b1cc2ad678729a3b7247aa8..3f247ade266d2675eac4d0f59a4744daa61f27ea 100644
--- a/tensorflow/docs_src/tutorials/linear.md
+++ b/tensorflow/docs_src/tutorials/linear.md
@@ -17,7 +17,7 @@ tutorial walks through the code in greater detail.
 
 To understand this overview it will help to have some familiarity
 with basic machine learning concepts, and also with
-@{$get_started/premade_estimators$Estimators}.
+@{$premade_estimators$Estimators}.
 
 [TOC]
 
diff --git a/tensorflow/docs_src/tutorials/recurrent_quickdraw.md b/tensorflow/docs_src/tutorials/recurrent_quickdraw.md
index 5d83fbe2a3709c0834f448cbc316453f80428dd1..1afd861738512f20de5171548d539d256f5f5225 100644
--- a/tensorflow/docs_src/tutorials/recurrent_quickdraw.md
+++ b/tensorflow/docs_src/tutorials/recurrent_quickdraw.md
@@ -220,7 +220,7 @@ length 2.
 ### Defining the model
 
 To define the model we create a new `Estimator`. If you want to read more about
-estimators, we recommend @{$get_started/custom_estimators$this tutorial}.
+estimators, we recommend @{$custom_estimators$this tutorial}.
 
 To build the model, we:
 
diff --git a/tensorflow/examples/learn/iris.py b/tensorflow/examples/learn/iris.py
index 03e60972aa660fad4af8d3535e31463c96f7c69b..86f5204ec3e8713d5d22156419b6414acb2fa677 100644
--- a/tensorflow/examples/learn/iris.py
+++ b/tensorflow/examples/learn/iris.py
@@ -21,7 +21,8 @@ from __future__ import division
 from __future__ import print_function
 
 import os
-import urllib
+
+from six.moves.urllib.request import urlretrieve
 
 import tensorflow as tf
 
@@ -38,9 +39,7 @@ FEATURE_KEYS = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
 def maybe_download_iris_data(file_name, download_url):
   """Downloads the file and returns the number of data."""
   if not os.path.exists(file_name):
-    raw = urllib.urlopen(download_url).read()
-    with open(file_name, 'w') as f:
-      f.write(raw)
+    urlretrieve(download_url, file_name)
 
   # The first line is a comma-separated string. The first one is the number of
   # total data in the file.
diff --git a/tensorflow/go/attrs.go b/tensorflow/go/attrs.go
new file mode 100644
index 0000000000000000000000000000000000000000..bfa60d2aa83339a5be6020e876900c330b7174a1
--- /dev/null
+++ b/tensorflow/go/attrs.go
@@ -0,0 +1,215 @@
+/*
+Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+*/
+
+package tensorflow
+
+// #include <stdlib.h>
+// #include "tensorflow/c/c_api.h"
+import "C"
+import (
+	"fmt"
+	"unsafe"
+)
+
+// makeCShape converts a shape specified in C.int64_t into a Shape.
+func makeCShape(shape []C.int64_t) Shape {
+	s := Shape{dims: make([]int64, len(shape))}
+	for i, n := range shape {
+		s.dims[i] = int64(n)
+	}
+	return s
+}
+
+// Attr returns the value of an attribute on op.
+func (op *Operation) Attr(name string) (interface{}, error) {
+	cname := C.CString(name)
+	defer C.free(unsafe.Pointer(cname))
+
+	status := newStatus()
+	meta := C.TF_OperationGetAttrMetadata(op.c, cname, status.c)
+	if err := status.Err(); err != nil {
+		return nil, err
+	}
+
+	if meta.is_list == 1 {
+		return listAttribute(op, cname, meta)
+	}
+	return scalarAttribute(op, cname, meta)
+}
+
+func listAttribute(op *Operation, cname *C.char, meta C.TF_AttrMetadata) (interface{}, error) {
+	status := newStatus()
+
+	switch meta._type {
+	case C.TF_ATTR_STRING:
+		values := make([]unsafe.Pointer, meta.list_size)
+		lengths := make([]C.size_t, meta.list_size)
+		storage := make([]C.char, meta.total_size)
+		C.TF_OperationGetAttrStringList(op.c, cname, &values[0], &lengths[0], C.int(meta.list_size), unsafe.Pointer(&storage[0]), C.size_t(meta.total_size), status.c)
+		if err := status.Err(); err != nil {
+			return nil, err
+		}
+		list := make([]string, meta.list_size)
+		for i, val := range values {
+			length := lengths[i]
+			list[i] = C.GoStringN((*C.char)(val), C.int(length))
+		}
+		return list, nil
+
+	case C.TF_ATTR_INT:
+		list := make([]C.int64_t, meta.list_size)
+		C.TF_OperationGetAttrIntList(op.c, cname, &list[0], C.int(meta.list_size), status.c)
+		if err := status.Err(); err != nil {
+			return nil, err
+		}
+		vals := make([]int64, meta.list_size)
+		for i, val := range list {
+			vals[i] = int64(val)
+		}
+		return vals, nil
+
+	case C.TF_ATTR_FLOAT:
+		list := make([]C.float, meta.list_size)
+		C.TF_OperationGetAttrFloatList(op.c, cname, &list[0], C.int(meta.list_size), status.c)
+		if err := status.Err(); err != nil {
+			return nil, err
+		}
+		vals := make([]float32, meta.list_size)
+		for i, val := range list {
+			vals[i] = float32(val)
+		}
+		return vals, nil
+
+	case C.TF_ATTR_BOOL:
+		list := make([]C.uchar, meta.list_size)
+		C.TF_OperationGetAttrBoolList(op.c, cname, &list[0], C.int(meta.list_size), status.c)
+		if err := status.Err(); err != nil {
+			return nil, err
+		}
+		vals := make([]bool, meta.list_size)
+		for i, val := range list {
+			vals[i] = val == 1
+		}
+		return vals, nil
+
+	case C.TF_ATTR_TYPE:
+		list := make([]C.TF_DataType, meta.list_size)
+		C.TF_OperationGetAttrTypeList(op.c, cname, &list[0], C.int(meta.list_size), status.c)
+		if err := status.Err(); err != nil {
+			return nil, err
+		}
+		vals := make([]DataType, meta.list_size)
+		for i, val := range list {
+			vals[i] = DataType(val)
+		}
+		return vals, nil
+
+	case C.TF_ATTR_TENSOR:
+		list := make([]*C.TF_Tensor, meta.list_size)
+		C.TF_OperationGetAttrTensorList(op.c, cname, &list[0], C.int(meta.list_size), status.c)
+		if err := status.Err(); err != nil {
+			return nil, err
+		}
+		vals := make([]*Tensor, meta.list_size)
+		for i, t := range list {
+			vals[i] = newTensorFromC(t)
+		}
+		return vals, nil
+
+	case C.TF_ATTR_SHAPE:
+		dims := make([]*C.int64_t, meta.list_size)
+		numDims := make([]C.int, meta.list_size)
+		storage := make([]C.int64_t, meta.total_size)
+		C.TF_OperationGetAttrShapeList(op.c, cname, &dims[0], &numDims[0], C.int(meta.list_size), &storage[0], C.int(meta.total_size), status.c)
+		if err := status.Err(); err != nil {
+			return nil, err
+		}
+		list := make([]Shape, meta.list_size)
+		for i, dim := range dims {
+			numDim := numDims[i]
+			// If the number of dimensions is unknown, default to empty shape.
+			if numDim < 0 {
+				continue
+			}
+			// A []C.int64_t slice backed by C memory.
+			// See: https://github.com/golang/go/wiki/cgo#turning-c-arrays-into-go-slices
+			slice := (*[1 << 30]C.int64_t)(unsafe.Pointer(dim))[:numDim:numDim]
+			list[i] = makeCShape(slice)
+		}
+		return list, nil
+
+	default:
+		return nil, fmt.Errorf("list type %v not supported", meta._type)
+	}
+}
+
+func scalarAttribute(op *Operation, cname *C.char, meta C.TF_AttrMetadata) (interface{}, error) {
+	status := newStatus()
+
+	switch meta._type {
+	case C.TF_ATTR_STRING:
+		v := make([]C.char, meta.total_size)
+		C.TF_OperationGetAttrString(op.c, cname, unsafe.Pointer(&v[0]), C.size_t(meta.total_size), status.c)
+		if err := status.Err(); err != nil {
+			return nil, err
+		}
+		return C.GoStringN(&v[0], C.int(meta.total_size)), nil
+
+	case C.TF_ATTR_INT:
+		var v C.int64_t
+		C.TF_OperationGetAttrInt(op.c, cname, &v, status.c)
+		return int64(v), status.Err()
+
+	case C.TF_ATTR_FLOAT:
+		var v C.float
+		C.TF_OperationGetAttrFloat(op.c, cname, &v, status.c)
+		return float32(v), status.Err()
+
+	case C.TF_ATTR_BOOL:
+		var v C.uchar
+		C.TF_OperationGetAttrBool(op.c, cname, &v, status.c)
+		return v == 1, status.Err()
+
+	case C.TF_ATTR_TYPE:
+		var v C.TF_DataType
+		C.TF_OperationGetAttrType(op.c, cname, &v, status.c)
+		return DataType(v), status.Err()
+
+	case C.TF_ATTR_TENSOR:
+		var v *C.TF_Tensor
+		C.TF_OperationGetAttrTensor(op.c, cname, &v, status.c)
+		if err := status.Err(); err != nil {
+			return nil, err
+		}
+		return newTensorFromC(v), nil
+
+	case C.TF_ATTR_SHAPE:
+		numDims := meta.total_size
+		// If number of dims is unknown return empty shape to indicate that.
+		if numDims < 0 {
+			return Shape{}, nil
+		}
+		dims := make([]C.int64_t, numDims)
+		C.TF_OperationGetAttrShape(op.c, cname, (*C.int64_t)(unsafe.Pointer(&dims[0])), C.int(numDims), status.c)
+		if err := status.Err(); err != nil {
+			return nil, err
+		}
+		return makeCShape(dims), nil
+
+	default:
+		return nil, fmt.Errorf("type %v not supported", meta._type)
+	}
+}
diff --git a/tensorflow/go/attrs_test.go b/tensorflow/go/attrs_test.go
new file mode 100644
index 0000000000000000000000000000000000000000..18fc0de90a1470e6d7adc93e4ca0b94ee526784b
--- /dev/null
+++ b/tensorflow/go/attrs_test.go
@@ -0,0 +1,47 @@
+/*
+Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+*/
+
+package tensorflow
+
+import (
+	"reflect"
+	"testing"
+)
+
+func TestOperationAttrs(t *testing.T) {
+	attrs := map[string]interface{}{
+		"dtype": Float,
+	}
+
+	g := NewGraph()
+	op, err := g.AddOperation(OpSpec{
+		Type:  "Placeholder",
+		Name:  "placeholder",
+		Attrs: attrs,
+	})
+	if err != nil {
+		t.Fatal(err)
+	}
+	for key, want := range attrs {
+		out, err := op.Attr(key)
+		if err != nil {
+			t.Fatal(err)
+		}
+		if !reflect.DeepEqual(out, want) {
+			t.Fatalf("%q: Got %+v, wanted %+v", key, out, want)
+		}
+	}
+}
diff --git a/tensorflow/go/op/wrappers.go b/tensorflow/go/op/wrappers.go
index e08f38969cf8392c90725d7bc45db154c37a0e39..a5224fbda04fdfb2460fce96efffb6eab4f08551 100644
--- a/tensorflow/go/op/wrappers.go
+++ b/tensorflow/go/op/wrappers.go
@@ -2674,206 +2674,442 @@ func MatrixBandPart(scope *Scope, input tf.Output, num_lower tf.Output, num_uppe
 	return op.Output(0)
 }
 
-// Clips tensor values to a specified min and max.
+// Returns the batched diagonal part of a batched tensor.
 //
-// Given a tensor `t`, this operation returns a tensor of the same type and
-// shape as `t` with its values clipped to `clip_value_min` and `clip_value_max`.
-// Any values less than `clip_value_min` are set to `clip_value_min`. Any values
-// greater than `clip_value_max` are set to `clip_value_max`.
+// This operation returns a tensor with the `diagonal` part
+// of the batched `input`. The `diagonal` part is computed as follows:
+//
+// Assume `input` has `k` dimensions `[I, J, K, ..., M, N]`, then the output is a
+// tensor of rank `k - 1` with dimensions `[I, J, K, ..., min(M, N)]` where:
+//
+// `diagonal[i, j, k, ..., n] = input[i, j, k, ..., n, n]`.
+//
+// The input must be at least a matrix.
+//
+// For example:
+//
+// ```
+// # 'input' is [[[1, 0, 0, 0]
+//                [0, 2, 0, 0]
+//                [0, 0, 3, 0]
+//                [0, 0, 0, 4]],
+//               [[5, 0, 0, 0]
+//                [0, 6, 0, 0]
+//                [0, 0, 7, 0]
+//                [0, 0, 0, 8]]]
+//
+// and input.shape = (2, 4, 4)
+//
+// tf.matrix_diag_part(input) ==> [[1, 2, 3, 4], [5, 6, 7, 8]]
+//
+// which has shape (2, 4)
+// ```
 //
 // Arguments:
-//	t: A `Tensor`.
-//	clip_value_min: A 0-D (scalar) `Tensor`, or a `Tensor` with the same shape
-// as `t`. The minimum value to clip by.
-//	clip_value_max: A 0-D (scalar) `Tensor`, or a `Tensor` with the same shape
-// as `t`. The maximum value to clip by.
+//	input: Rank `k` tensor where `k >= 2`.
 //
-// Returns A clipped `Tensor` with the same shape as input 't'.
-func ClipByValue(scope *Scope, t tf.Output, clip_value_min tf.Output, clip_value_max tf.Output) (output tf.Output) {
+// Returns The extracted diagonal(s) having shape
+// `diagonal.shape = input.shape[:-2] + [min(input.shape[-2:])]`.
+func MatrixDiagPart(scope *Scope, input tf.Output) (diagonal tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ClipByValue",
+		Type: "MatrixDiagPart",
 		Input: []tf.Input{
-			t, clip_value_min, clip_value_max,
+			input,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Creates a sequence of numbers.
+// Returns a batched diagonal tensor with a given batched diagonal values.
 //
-// This operation creates a sequence of numbers that begins at `start` and
-// extends by increments of `delta` up to but not including `limit`.
+// Given a `diagonal`, this operation returns a tensor with the `diagonal` and
+// everything else padded with zeros. The diagonal is computed as follows:
+//
+// Assume `diagonal` has `k` dimensions `[I, J, K, ..., N]`, then the output is a
+// tensor of rank `k+1` with dimensions [I, J, K, ..., N, N]` where:
+//
+// `output[i, j, k, ..., m, n] = 1{m=n} * diagonal[i, j, k, ..., n]`.
 //
 // For example:
 //
 // ```
-// # 'start' is 3
-// # 'limit' is 18
-// # 'delta' is 3
-// tf.range(start, limit, delta) ==> [3, 6, 9, 12, 15]
+// # 'diagonal' is [[1, 2, 3, 4], [5, 6, 7, 8]]
+//
+// and diagonal.shape = (2, 4)
+//
+// tf.matrix_diag(diagonal) ==> [[[1, 0, 0, 0]
+//                                      [0, 2, 0, 0]
+//                                      [0, 0, 3, 0]
+//                                      [0, 0, 0, 4]],
+//                                     [[5, 0, 0, 0]
+//                                      [0, 6, 0, 0]
+//                                      [0, 0, 7, 0]
+//                                      [0, 0, 0, 8]]]
+//
+// which has shape (2, 4, 4)
 // ```
 //
 // Arguments:
-//	start: 0-D (scalar). First entry in the sequence.
-//	limit: 0-D (scalar). Upper limit of sequence, exclusive.
-//	delta: 0-D (scalar). Optional. Default is 1. Number that increments `start`.
+//	diagonal: Rank `k`, where `k >= 1`.
 //
-// Returns 1-D.
-func Range(scope *Scope, start tf.Output, limit tf.Output, delta tf.Output) (output tf.Output) {
+// Returns Rank `k+1`, with `output.shape = diagonal.shape + [diagonal.shape[-1]]`.
+func MatrixDiag(scope *Scope, diagonal tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Range",
+		Type: "MatrixDiag",
 		Input: []tf.Input{
-			start, limit, delta,
+			diagonal,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes gradients for SparseSegmentSqrtN.
-//
-// Returns tensor "output" with same shape as grad, except for dimension 0 whose
-// value is output_dim0.
+// QuantizedInstanceNormAttr is an optional argument to QuantizedInstanceNorm.
+type QuantizedInstanceNormAttr func(optionalAttr)
+
+// QuantizedInstanceNormOutputRangeGiven sets the optional output_range_given attribute to value.
 //
-// Arguments:
-//	grad: gradient propagated to the SparseSegmentSqrtN op.
-//	indices: indices passed to the corresponding SparseSegmentSqrtN op.
-//	segment_ids: segment_ids passed to the corresponding SparseSegmentSqrtN op.
-//	output_dim0: dimension 0 of "data" passed to SparseSegmentSqrtN op.
-func SparseSegmentSqrtNGrad(scope *Scope, grad tf.Output, indices tf.Output, segment_ids tf.Output, output_dim0 tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SparseSegmentSqrtNGrad",
-		Input: []tf.Input{
-			grad, indices, segment_ids, output_dim0,
-		},
+// value: If True, `given_y_min` and `given_y_min`
+// and `given_y_max` are used as the output range. Otherwise,
+// the implementation computes the output range.
+// If not specified, defaults to false
+func QuantizedInstanceNormOutputRangeGiven(value bool) QuantizedInstanceNormAttr {
+	return func(m optionalAttr) {
+		m["output_range_given"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Computes the mean along sparse segments of a tensor.
-//
-// Read @{$math_ops#Segmentation$the section on segmentation} for an explanation of
-// segments.
-//
-// Like `SegmentMean`, but `segment_ids` can have rank less than `data`'s first
-// dimension, selecting a subset of dimension 0, specified by `indices`.
+// QuantizedInstanceNormGivenYMin sets the optional given_y_min attribute to value.
 //
-// Arguments:
+// value: Output in `y_min` if `output_range_given` is True.
+// If not specified, defaults to 0
+func QuantizedInstanceNormGivenYMin(value float32) QuantizedInstanceNormAttr {
+	return func(m optionalAttr) {
+		m["given_y_min"] = value
+	}
+}
+
+// QuantizedInstanceNormGivenYMax sets the optional given_y_max attribute to value.
 //
-//	indices: A 1-D tensor. Has same rank as `segment_ids`.
-//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
+// value: Output in `y_max` if `output_range_given` is True.
+// If not specified, defaults to 0
+func QuantizedInstanceNormGivenYMax(value float32) QuantizedInstanceNormAttr {
+	return func(m optionalAttr) {
+		m["given_y_max"] = value
+	}
+}
+
+// QuantizedInstanceNormVarianceEpsilon sets the optional variance_epsilon attribute to value.
 //
-// Returns Has same shape as data, except for dimension 0 which
-// has size `k`, the number of segments.
-func SparseSegmentMean(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
+// value: A small float number to avoid dividing by 0.
+// If not specified, defaults to 1e-05
+func QuantizedInstanceNormVarianceEpsilon(value float32) QuantizedInstanceNormAttr {
+	return func(m optionalAttr) {
+		m["variance_epsilon"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "SparseSegmentMean",
-		Input: []tf.Input{
-			data, indices, segment_ids,
-		},
+}
+
+// QuantizedInstanceNormMinSeparation sets the optional min_separation attribute to value.
+//
+// value: Minimum value of `y_max - y_min`
+// If not specified, defaults to 0.001
+func QuantizedInstanceNormMinSeparation(value float32) QuantizedInstanceNormAttr {
+	return func(m optionalAttr) {
+		m["min_separation"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Pop the element at the top of the stack.
+// Quantized Instance normalization.
 //
 // Arguments:
-//	handle: The handle to a stack.
-//	elem_type: The type of the elem that is popped.
+//	x: A 4D input Tensor.
+//	x_min: The value represented by the lowest quantized input.
+//	x_max: The value represented by the highest quantized input.
 //
-// Returns The tensor that is popped from the top of the stack.
-func StackPopV2(scope *Scope, handle tf.Output, elem_type tf.DataType) (elem tf.Output) {
+// Returns A 4D Tensor.The value represented by the lowest quantized output.The value represented by the highest quantized output.
+func QuantizedInstanceNorm(scope *Scope, x tf.Output, x_min tf.Output, x_max tf.Output, optional ...QuantizedInstanceNormAttr) (y tf.Output, y_min tf.Output, y_max tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"elem_type": elem_type}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "StackPopV2",
+		Type: "QuantizedInstanceNorm",
 		Input: []tf.Input{
-			handle,
+			x, x_min, x_max,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Computes the sum along sparse segments of a tensor.
+// Returns the diagonal part of the tensor.
 //
-// Like `SparseSegmentSum`, but allows missing ids in `segment_ids`. If an id is
-// misisng, the `output` tensor at that position will be zeroed.
+// This operation returns a tensor with the `diagonal` part
+// of the `input`. The `diagonal` part is computed as follows:
 //
-// Read @{$math_ops#Segmentation$the section on segmentation} for an explanation of
-// segments.
+// Assume `input` has dimensions `[D1,..., Dk, D1,..., Dk]`, then the output is a
+// tensor of rank `k` with dimensions `[D1,..., Dk]` where:
 //
-// For example:
+// `diagonal[i1,..., ik] = input[i1, ..., ik, i1,..., ik]`.
 //
-// ```python
-// c = tf.constant([[1,2,3,4], [-1,-2,-3,-4], [5,6,7,8]])
+// For example:
 //
-// tf.sparse_segment_sum_with_num_segments(
-//     c, tf.constant([0, 1]), tf.constant([0, 0]), num_segments=3)
-// # => [[0 0 0 0]
-// #     [0 0 0 0]
-// #     [0 0 0 0]]
+// ```
+// # 'input' is [[1, 0, 0, 0]
+//               [0, 2, 0, 0]
+//               [0, 0, 3, 0]
+//               [0, 0, 0, 4]]
 //
-// tf.sparse_segment_sum_with_num_segments(c,
-//                                         tf.constant([0, 1]),
-//                                         tf.constant([0, 2],
-//                                         num_segments=4))
-// # => [[ 1  2  3  4]
-// #     [ 0  0  0  0]
-// #     [-1 -2 -3 -4]
-// #     [ 0  0  0  0]]
+// tf.diag_part(input) ==> [1, 2, 3, 4]
 // ```
 //
 // Arguments:
+//	input: Rank k tensor where k is even and not zero.
 //
-//	indices: A 1-D tensor. Has same rank as `segment_ids`.
-//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
-//	num_segments: Should equal the number of distinct segment IDs.
-//
-// Returns Has same shape as data, except for dimension 0 which
-// has size `num_segments`.
-func SparseSegmentSumWithNumSegments(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output, num_segments tf.Output) (output tf.Output) {
+// Returns The extracted diagonal.
+func DiagPart(scope *Scope, input tf.Output) (diagonal tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseSegmentSumWithNumSegments",
+		Type: "DiagPart",
 		Input: []tf.Input{
-			data, indices, segment_ids, num_segments,
+			input,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// PreventGradientAttr is an optional argument to PreventGradient.
-type PreventGradientAttr func(optionalAttr)
-
-// PreventGradientMessage sets the optional message attribute to value.
+// Gives a guarantee to the TF runtime that the input tensor is a constant.
 //
-// value: Will be printed in the error when anyone tries to differentiate
-// this operation.
-// If not specified, defaults to ""
-func PreventGradientMessage(value string) PreventGradientAttr {
-	return func(m optionalAttr) {
+// The runtime is then free to make optimizations based on this.
+//
+// Only accepts value typed tensors as inputs and rejects resource variable handles
+// as input.
+//
+// Returns the input tensor without modification.
+func GuaranteeConst(scope *Scope, input tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "GuaranteeConst",
+		Input: []tf.Input{
+			input,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Splits a tensor into `num_split` tensors along one dimension.
+//
+// Arguments:
+//	value: The tensor to split.
+//	size_splits: list containing the sizes of each output tensor along the split
+// dimension. Must sum to the dimension of value along split_dim.
+// Can contain one -1 indicating that dimension is to be inferred.
+//	axis: 0-D.  The dimension along which to split.  Must be in the range
+// `[-rank(value), rank(value))`.
+//
+//
+// Returns Tensors whose shape matches that of `value`
+// except along `axis`, where their sizes are
+// `size_splits[i]`.
+func SplitV(scope *Scope, value tf.Output, size_splits tf.Output, axis tf.Output, num_split int64) (output []tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"num_split": num_split}
+	opspec := tf.OpSpec{
+		Type: "SplitV",
+		Input: []tf.Input{
+			value, size_splits, axis,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
+		scope.UpdateErr("SplitV", err)
+		return
+	}
+	return output
+}
+
+// Splits a tensor into `num_split` tensors along one dimension.
+//
+// Arguments:
+//	axis: 0-D.  The dimension along which to split.  Must be in the range
+// `[-rank(value), rank(value))`.
+//	value: The tensor to split.
+//	num_split: The number of ways to split.  Must evenly divide
+// `value.shape[split_dim]`.
+//
+// Returns They are identically shaped tensors, whose shape matches that of `value`
+// except along `axis`, where their sizes are
+// `values.shape[split_dim] / num_split`.
+func Split(scope *Scope, axis tf.Output, value tf.Output, num_split int64) (output []tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"num_split": num_split}
+	opspec := tf.OpSpec{
+		Type: "Split",
+		Input: []tf.Input{
+			axis, value,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
+		scope.UpdateErr("Split", err)
+		return
+	}
+	return output
+}
+
+// Creates a sequence of numbers.
+//
+// This operation creates a sequence of numbers that begins at `start` and
+// extends by increments of `delta` up to but not including `limit`.
+//
+// For example:
+//
+// ```
+// # 'start' is 3
+// # 'limit' is 18
+// # 'delta' is 3
+// tf.range(start, limit, delta) ==> [3, 6, 9, 12, 15]
+// ```
+//
+// Arguments:
+//	start: 0-D (scalar). First entry in the sequence.
+//	limit: 0-D (scalar). Upper limit of sequence, exclusive.
+//	delta: 0-D (scalar). Optional. Default is 1. Number that increments `start`.
+//
+// Returns 1-D.
+func Range(scope *Scope, start tf.Output, limit tf.Output, delta tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Range",
+		Input: []tf.Input{
+			start, limit, delta,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes gradients for SparseSegmentSqrtN.
+//
+// Returns tensor "output" with same shape as grad, except for dimension 0 whose
+// value is output_dim0.
+//
+// Arguments:
+//	grad: gradient propagated to the SparseSegmentSqrtN op.
+//	indices: indices passed to the corresponding SparseSegmentSqrtN op.
+//	segment_ids: segment_ids passed to the corresponding SparseSegmentSqrtN op.
+//	output_dim0: dimension 0 of "data" passed to SparseSegmentSqrtN op.
+func SparseSegmentSqrtNGrad(scope *Scope, grad tf.Output, indices tf.Output, segment_ids tf.Output, output_dim0 tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SparseSegmentSqrtNGrad",
+		Input: []tf.Input{
+			grad, indices, segment_ids, output_dim0,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes the mean along sparse segments of a tensor.
+//
+// Read @{$math_ops#Segmentation$the section on segmentation} for an explanation of
+// segments.
+//
+// Like `SegmentMean`, but `segment_ids` can have rank less than `data`'s first
+// dimension, selecting a subset of dimension 0, specified by `indices`.
+//
+// Arguments:
+//
+//	indices: A 1-D tensor. Has same rank as `segment_ids`.
+//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `k`, the number of segments.
+func SparseSegmentMean(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SparseSegmentMean",
+		Input: []tf.Input{
+			data, indices, segment_ids,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Pop the element at the top of the stack.
+//
+// Arguments:
+//	handle: The handle to a stack.
+//	elem_type: The type of the elem that is popped.
+//
+// Returns The tensor that is popped from the top of the stack.
+func StackPopV2(scope *Scope, handle tf.Output, elem_type tf.DataType) (elem tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"elem_type": elem_type}
+	opspec := tf.OpSpec{
+		Type: "StackPopV2",
+		Input: []tf.Input{
+			handle,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// PreventGradientAttr is an optional argument to PreventGradient.
+type PreventGradientAttr func(optionalAttr)
+
+// PreventGradientMessage sets the optional message attribute to value.
+//
+// value: Will be printed in the error when anyone tries to differentiate
+// this operation.
+// If not specified, defaults to ""
+func PreventGradientMessage(value string) PreventGradientAttr {
+	return func(m optionalAttr) {
 		m["message"] = value
 	}
 }
@@ -3974,91 +4210,28 @@ func Digamma(scope *Scope, x tf.Output) (y tf.Output) {
 	return op.Output(0)
 }
 
-// Shuffle dimensions of x according to a permutation.
-//
-// The output `y` has the same rank as `x`. The shapes of `x` and `y` satisfy:
-//   `y.shape[i] == x.shape[perm[i]] for i in [0, 1, ..., rank(x) - 1]`
-func Transpose(scope *Scope, x tf.Output, perm tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Transpose",
-		Input: []tf.Input{
-			x, perm,
-		},
+// Conv2DBackpropFilterAttr is an optional argument to Conv2DBackpropFilter.
+type Conv2DBackpropFilterAttr func(optionalAttr)
+
+// Conv2DBackpropFilterUseCudnnOnGpu sets the optional use_cudnn_on_gpu attribute to value.
+// If not specified, defaults to true
+func Conv2DBackpropFilterUseCudnnOnGpu(value bool) Conv2DBackpropFilterAttr {
+	return func(m optionalAttr) {
+		m["use_cudnn_on_gpu"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// MinAttr is an optional argument to Min.
-type MinAttr func(optionalAttr)
-
-// MinKeepDims sets the optional keep_dims attribute to value.
+// Conv2DBackpropFilterDataFormat sets the optional data_format attribute to value.
 //
-// value: If true, retain reduced dimensions with length 1.
-// If not specified, defaults to false
-func MinKeepDims(value bool) MinAttr {
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// If not specified, defaults to "NHWC"
+func Conv2DBackpropFilterDataFormat(value string) Conv2DBackpropFilterAttr {
 	return func(m optionalAttr) {
-		m["keep_dims"] = value
-	}
-}
-
-// Computes the minimum of elements across dimensions of a tensor.
-//
-// Reduces `input` along the dimensions given in `axis`. Unless
-// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
-// `axis`. If `keep_dims` is true, the reduced dimensions are
-// retained with length 1.
-//
-// Arguments:
-//	input: The tensor to reduce.
-//	axis: The dimensions to reduce. Must be in the range
-// `[-rank(input), rank(input))`.
-//
-// Returns The reduced tensor.
-func Min(scope *Scope, input tf.Output, axis tf.Output, optional ...MinAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "Min",
-		Input: []tf.Input{
-			input, axis,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Conv2DBackpropFilterAttr is an optional argument to Conv2DBackpropFilter.
-type Conv2DBackpropFilterAttr func(optionalAttr)
-
-// Conv2DBackpropFilterUseCudnnOnGpu sets the optional use_cudnn_on_gpu attribute to value.
-// If not specified, defaults to true
-func Conv2DBackpropFilterUseCudnnOnGpu(value bool) Conv2DBackpropFilterAttr {
-	return func(m optionalAttr) {
-		m["use_cudnn_on_gpu"] = value
-	}
-}
-
-// Conv2DBackpropFilterDataFormat sets the optional data_format attribute to value.
-//
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// If not specified, defaults to "NHWC"
-func Conv2DBackpropFilterDataFormat(value string) Conv2DBackpropFilterAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["data_format"] = value
 	}
 }
 
@@ -4563,6 +4736,68 @@ func Reciprocal(scope *Scope, x tf.Output) (y tf.Output) {
 	return op.Output(0)
 }
 
+// Returns a batched matrix tensor with new batched diagonal values.
+//
+// Given `input` and `diagonal`, this operation returns a tensor with the
+// same shape and values as `input`, except for the main diagonal of the
+// innermost matrices.  These will be overwritten by the values in `diagonal`.
+//
+// The output is computed as follows:
+//
+// Assume `input` has `k+1` dimensions `[I, J, K, ..., M, N]` and `diagonal` has
+// `k` dimensions `[I, J, K, ..., min(M, N)]`.  Then the output is a
+// tensor of rank `k+1` with dimensions `[I, J, K, ..., M, N]` where:
+//
+//   * `output[i, j, k, ..., m, n] = diagonal[i, j, k, ..., n]` for `m == n`.
+//   * `output[i, j, k, ..., m, n] = input[i, j, k, ..., m, n]` for `m != n`.
+//
+// Arguments:
+//	input: Rank `k+1`, where `k >= 1`.
+//	diagonal: Rank `k`, where `k >= 1`.
+//
+// Returns Rank `k+1`, with `output.shape = input.shape`.
+func MatrixSetDiag(scope *Scope, input tf.Output, diagonal tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "MatrixSetDiag",
+		Input: []tf.Input{
+			input, diagonal,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Returns the element-wise max of two SparseTensors.
+//
+// Assumes the two SparseTensors have the same shape, i.e., no broadcasting.
+//
+// Arguments:
+//	a_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
+// SparseTensor, in the canonical lexicographic ordering.
+//	a_values: 1-D.  `N` non-empty values corresponding to `a_indices`.
+//	a_shape: 1-D.  Shape of the input SparseTensor.
+//	b_indices: counterpart to `a_indices` for the other operand.
+//	b_values: counterpart to `a_values` for the other operand; must be of the same dtype.
+//	b_shape: counterpart to `a_shape` for the other operand; the two shapes must be equal.
+//
+// Returns 2-D.  The indices of the output SparseTensor.1-D.  The values of the output SparseTensor.
+func SparseSparseMaximum(scope *Scope, a_indices tf.Output, a_values tf.Output, a_shape tf.Output, b_indices tf.Output, b_values tf.Output, b_shape tf.Output) (output_indices tf.Output, output_values tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SparseSparseMaximum",
+		Input: []tf.Input{
+			a_indices, a_values, a_shape, b_indices, b_values, b_shape,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
+}
+
 // OrderedMapClearAttr is an optional argument to OrderedMapClear.
 type OrderedMapClearAttr func(optionalAttr)
 
@@ -5115,53 +5350,6 @@ func FloorDiv(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	return op.Output(0)
 }
 
-// Returns a batched diagonal tensor with a given batched diagonal values.
-//
-// Given a `diagonal`, this operation returns a tensor with the `diagonal` and
-// everything else padded with zeros. The diagonal is computed as follows:
-//
-// Assume `diagonal` has `k` dimensions `[I, J, K, ..., N]`, then the output is a
-// tensor of rank `k+1` with dimensions [I, J, K, ..., N, N]` where:
-//
-// `output[i, j, k, ..., m, n] = 1{m=n} * diagonal[i, j, k, ..., n]`.
-//
-// For example:
-//
-// ```
-// # 'diagonal' is [[1, 2, 3, 4], [5, 6, 7, 8]]
-//
-// and diagonal.shape = (2, 4)
-//
-// tf.matrix_diag(diagonal) ==> [[[1, 0, 0, 0]
-//                                      [0, 2, 0, 0]
-//                                      [0, 0, 3, 0]
-//                                      [0, 0, 0, 4]],
-//                                     [[5, 0, 0, 0]
-//                                      [0, 6, 0, 0]
-//                                      [0, 0, 7, 0]
-//                                      [0, 0, 0, 8]]]
-//
-// which has shape (2, 4, 4)
-// ```
-//
-// Arguments:
-//	diagonal: Rank `k`, where `k >= 1`.
-//
-// Returns Rank `k+1`, with `output.shape = diagonal.shape + [diagonal.shape[-1]]`.
-func MatrixDiag(scope *Scope, diagonal tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "MatrixDiag",
-		Input: []tf.Input{
-			diagonal,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Computes the inverse permutation of a tensor.
 //
 // This operation computes the inverse of an index permutation. It takes a 1-D
@@ -5930,6 +6118,77 @@ func Mod(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	return op.Output(0)
 }
 
+// Computes offsets of concat inputs within its output.
+//
+// For example:
+//
+// ```
+// # 'x' is [2, 2, 7]
+// # 'y' is [2, 3, 7]
+// # 'z' is [2, 5, 7]
+// concat_offset(2, [x, y, z]) => [0, 0, 0], [0, 2, 0], [0, 5, 0]
+// ```
+//
+// This is typically used by gradient computations for a concat operation.
+//
+// Arguments:
+//	concat_dim: The dimension along which to concatenate.
+//	shape: The `N` int32 vectors representing shape of tensors being concatenated.
+//
+// Returns The `N` int32 vectors representing the starting offset
+// of input tensors within the concatenated output.
+func ConcatOffset(scope *Scope, concat_dim tf.Output, shape []tf.Output) (offset []tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "ConcatOffset",
+		Input: []tf.Input{
+			concat_dim, tf.OutputList(shape),
+		},
+	}
+	op := scope.AddOperation(opspec)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if offset, idx, err = makeOutputList(op, idx, "offset"); err != nil {
+		scope.UpdateErr("ConcatOffset", err)
+		return
+	}
+	return offset
+}
+
+// Compute the lower regularized incomplete Gamma function `Q(a, x)`.
+//
+// The lower regularized incomplete Gamma function is defined as:
+//
+//
+// \\(P(a, x) = gamma(a, x) / Gamma(a) = 1 - Q(a, x)\\)
+//
+// where
+//
+// \\(gamma(a, x) = int_{0}^{x} t^{a-1} exp(-t) dt\\)
+//
+// is the lower incomplete Gamma function.
+//
+// Note, above `Q(a, x)` (`Igammac`) is the upper regularized complete
+// Gamma function.
+func Igamma(scope *Scope, a tf.Output, x tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Igamma",
+		Input: []tf.Input{
+			a, x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // DepthToSpaceAttr is an optional argument to DepthToSpace.
 type DepthToSpaceAttr func(optionalAttr)
 
@@ -6749,55 +7008,51 @@ func BiasAddV1(scope *Scope, value tf.Output, bias tf.Output) (output tf.Output)
 	return op.Output(0)
 }
 
-// Transforms a Tensor into a serialized TensorProto proto.
-//
-// Arguments:
-//	tensor: A Tensor of type `T`.
+// Shuffle dimensions of x according to a permutation.
 //
-// Returns A serialized TensorProto proto of the input tensor.
-func SerializeTensor(scope *Scope, tensor tf.Output) (serialized tf.Output) {
+// The output `y` has the same rank as `x`. The shapes of `x` and `y` satisfy:
+//   `y.shape[i] == x.shape[perm[i]] for i in [0, 1, ..., rank(x) - 1]`
+func Transpose(scope *Scope, x tf.Output, perm tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SerializeTensor",
+		Type: "Transpose",
 		Input: []tf.Input{
-			tensor,
+			x, perm,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// MatrixSolveAttr is an optional argument to MatrixSolve.
-type MatrixSolveAttr func(optionalAttr)
+// MinAttr is an optional argument to Min.
+type MinAttr func(optionalAttr)
 
-// MatrixSolveAdjoint sets the optional adjoint attribute to value.
+// MinKeepDims sets the optional keep_dims attribute to value.
 //
-// value: Boolean indicating whether to solve with `matrix` or its (block-wise)
-// adjoint.
+// value: If true, retain reduced dimensions with length 1.
 // If not specified, defaults to false
-func MatrixSolveAdjoint(value bool) MatrixSolveAttr {
+func MinKeepDims(value bool) MinAttr {
 	return func(m optionalAttr) {
-		m["adjoint"] = value
+		m["keep_dims"] = value
 	}
 }
 
-// Solves systems of linear equations.
+// Computes the minimum of elements across dimensions of a tensor.
 //
-// `Matrix` is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
-// form square matrices. `Rhs` is a tensor of shape `[..., M, K]`. The `output` is
-// a tensor shape `[..., M, K]`.  If `adjoint` is `False` then each output matrix
-// satisfies `matrix[..., :, :] * output[..., :, :] = rhs[..., :, :]`.
-// If `adjoint` is `True` then each output matrix satisfies
-// `adjoint(matrix[..., :, :]) * output[..., :, :] = rhs[..., :, :]`.
+// Reduces `input` along the dimensions given in `axis`. Unless
+// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
+// `axis`. If `keep_dims` is true, the reduced dimensions are
+// retained with length 1.
 //
 // Arguments:
-//	matrix: Shape is `[..., M, M]`.
-//	rhs: Shape is `[..., M, K]`.
+//	input: The tensor to reduce.
+//	axis: The dimensions to reduce. Must be in the range
+// `[-rank(input), rank(input))`.
 //
-// Returns Shape is `[..., M, K]`.
-func MatrixSolve(scope *Scope, matrix tf.Output, rhs tf.Output, optional ...MatrixSolveAttr) (output tf.Output) {
+// Returns The reduced tensor.
+func Min(scope *Scope, input tf.Output, axis tf.Output, optional ...MinAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -6806,9 +7061,9 @@ func MatrixSolve(scope *Scope, matrix tf.Output, rhs tf.Output, optional ...Matr
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MatrixSolve",
+		Type: "Min",
 		Input: []tf.Input{
-			matrix, rhs,
+			input, axis,
 		},
 		Attrs: attrs,
 	}
@@ -6816,6 +7071,26 @@ func MatrixSolve(scope *Scope, matrix tf.Output, rhs tf.Output, optional ...Matr
 	return op.Output(0)
 }
 
+// Transforms a Tensor into a serialized TensorProto proto.
+//
+// Arguments:
+//	tensor: A Tensor of type `T`.
+//
+// Returns A serialized TensorProto proto of the input tensor.
+func SerializeTensor(scope *Scope, tensor tf.Output) (serialized tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SerializeTensor",
+		Input: []tf.Input{
+			tensor,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // Computes acos of x element-wise.
 func Acos(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
@@ -7310,6 +7585,77 @@ func DecodeRaw(scope *Scope, bytes tf.Output, out_type tf.DataType, optional ...
 	return op.Output(0)
 }
 
+// RandomPoissonAttr is an optional argument to RandomPoisson.
+type RandomPoissonAttr func(optionalAttr)
+
+// RandomPoissonSeed sets the optional seed attribute to value.
+// If not specified, defaults to 0
+func RandomPoissonSeed(value int64) RandomPoissonAttr {
+	return func(m optionalAttr) {
+		m["seed"] = value
+	}
+}
+
+// RandomPoissonSeed2 sets the optional seed2 attribute to value.
+// If not specified, defaults to 0
+func RandomPoissonSeed2(value int64) RandomPoissonAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// Use RandomPoissonV2 instead.
+//
+// DEPRECATED at GraphDef version 25: Replaced by RandomPoissonV2
+func RandomPoisson(scope *Scope, shape tf.Output, rate tf.Output, optional ...RandomPoissonAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "RandomPoisson",
+		Input: []tf.Input{
+			shape, rate,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Returns the element-wise sum of a list of tensors.
+//
+// `tf.accumulate_n_v2` performs the same operation as `tf.add_n`, but does not
+// wait for all of its inputs to be ready before beginning to sum. This can
+// save memory if inputs are ready at different times, since minimum temporary
+// storage is proportional to the output size rather than the inputs size.
+//
+// Unlike the original `accumulate_n`, `accumulate_n_v2` is differentiable.
+//
+// Returns a `Tensor` of same shape and type as the elements of `inputs`.
+//
+// Arguments:
+//	inputs: A list of `Tensor` objects, each with same shape and type.
+//	shape: Shape of elements of `inputs`.
+func AccumulateNV2(scope *Scope, inputs []tf.Output, shape tf.Shape) (sum tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"shape": shape}
+	opspec := tf.OpSpec{
+		Type: "AccumulateNV2",
+		Input: []tf.Input{
+			tf.OutputList(inputs),
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // DepthwiseConv2dNativeBackpropFilterAttr is an optional argument to DepthwiseConv2dNativeBackpropFilter.
 type DepthwiseConv2dNativeBackpropFilterAttr func(optionalAttr)
 
@@ -7380,6 +7726,69 @@ func DepthwiseConv2dNativeBackpropFilter(scope *Scope, input tf.Output, filter_s
 	return op.Output(0)
 }
 
+// Returns immutable tensor from memory region.
+//
+// The current implementation memmaps the tensor from a file.
+//
+// Arguments:
+//	dtype: Type of the returned tensor.
+//	shape: Shape of the returned tensor.
+//	memory_region_name: Name of readonly memory region used by the tensor, see
+// NewReadOnlyMemoryRegionFromFile in tensorflow::Env.
+func ImmutableConst(scope *Scope, dtype tf.DataType, shape tf.Shape, memory_region_name string) (tensor tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"dtype": dtype, "shape": shape, "memory_region_name": memory_region_name}
+	opspec := tf.OpSpec{
+		Type: "ImmutableConst",
+
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// StringJoinAttr is an optional argument to StringJoin.
+type StringJoinAttr func(optionalAttr)
+
+// StringJoinSeparator sets the optional separator attribute to value.
+//
+// value: string, an optional join separator.
+// If not specified, defaults to ""
+func StringJoinSeparator(value string) StringJoinAttr {
+	return func(m optionalAttr) {
+		m["separator"] = value
+	}
+}
+
+// Joins the strings in the given list of string tensors into one tensor;
+//
+// with the given separator (default is an empty separator).
+//
+// Arguments:
+//	inputs: A list of string tensors.  The tensors must all have the same shape,
+// or be scalars.  Scalars may be mixed in; these will be broadcast to the shape
+// of non-scalar inputs.
+func StringJoin(scope *Scope, inputs []tf.Output, optional ...StringJoinAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "StringJoin",
+		Input: []tf.Input{
+			tf.OutputList(inputs),
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // LRNGradAttr is an optional argument to LRNGrad.
 type LRNGradAttr func(optionalAttr)
 
@@ -7768,47 +8177,6 @@ func SparseSplit(scope *Scope, split_dim tf.Output, indices tf.Output, values tf
 	return output_indices, output_values, output_shape
 }
 
-// RandomPoissonAttr is an optional argument to RandomPoisson.
-type RandomPoissonAttr func(optionalAttr)
-
-// RandomPoissonSeed sets the optional seed attribute to value.
-// If not specified, defaults to 0
-func RandomPoissonSeed(value int64) RandomPoissonAttr {
-	return func(m optionalAttr) {
-		m["seed"] = value
-	}
-}
-
-// RandomPoissonSeed2 sets the optional seed2 attribute to value.
-// If not specified, defaults to 0
-func RandomPoissonSeed2(value int64) RandomPoissonAttr {
-	return func(m optionalAttr) {
-		m["seed2"] = value
-	}
-}
-
-// Use RandomPoissonV2 instead.
-//
-// DEPRECATED at GraphDef version 25: Replaced by RandomPoissonV2
-func RandomPoisson(scope *Scope, shape tf.Output, rate tf.Output, optional ...RandomPoissonAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "RandomPoisson",
-		Input: []tf.Input{
-			shape, rate,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // ResourceSparseApplyFtrlV2Attr is an optional argument to ResourceSparseApplyFtrlV2.
 type ResourceSparseApplyFtrlV2Attr func(optionalAttr)
 
@@ -8220,101 +8588,12 @@ func RestoreSlice(scope *Scope, file_pattern tf.Output, tensor_name tf.Output, s
 	return op.Output(0)
 }
 
-// ImagAttr is an optional argument to Imag.
-type ImagAttr func(optionalAttr)
-
-// ImagTout sets the optional Tout attribute to value.
-// If not specified, defaults to DT_FLOAT
-func ImagTout(value tf.DataType) ImagAttr {
-	return func(m optionalAttr) {
-		m["Tout"] = value
-	}
-}
-
-// Returns the imaginary part of a complex number.
+// Divides sparse updates into the variable referenced by `resource`.
 //
-// Given a tensor `input` of complex numbers, this operation returns a tensor of
-// type `float` that is the imaginary part of each element in `input`. All
-// elements in `input` must be complex numbers of the form \\(a + bj\\), where *a*
-// is the real part and *b* is the imaginary part returned by this operation.
+// This operation computes
 //
-// For example:
-//
-// ```
-// # tensor 'input' is [-2.25 + 4.75j, 3.25 + 5.75j]
-// tf.imag(input) ==> [4.75, 5.75]
-// ```
-func Imag(scope *Scope, input tf.Output, optional ...ImagAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "Imag",
-		Input: []tf.Input{
-			input,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// ComplexAttr is an optional argument to Complex.
-type ComplexAttr func(optionalAttr)
-
-// ComplexTout sets the optional Tout attribute to value.
-// If not specified, defaults to DT_COMPLEX64
-func ComplexTout(value tf.DataType) ComplexAttr {
-	return func(m optionalAttr) {
-		m["Tout"] = value
-	}
-}
-
-// Converts two real numbers to a complex number.
-//
-// Given a tensor `real` representing the real part of a complex number, and a
-// tensor `imag` representing the imaginary part of a complex number, this
-// operation returns complex numbers elementwise of the form \\(a + bj\\), where
-// *a* represents the `real` part and *b* represents the `imag` part.
-//
-// The input tensors `real` and `imag` must have the same shape.
-//
-// For example:
-//
-// ```
-// # tensor 'real' is [2.25, 3.25]
-// # tensor `imag` is [4.75, 5.75]
-// tf.complex(real, imag) ==> [[2.25 + 4.75j], [3.25 + 5.75j]]
-// ```
-func Complex(scope *Scope, real tf.Output, imag tf.Output, optional ...ComplexAttr) (out tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "Complex",
-		Input: []tf.Input{
-			real, imag,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Divides sparse updates into the variable referenced by `resource`.
-//
-// This operation computes
-//
-//     # Scalar indices
-//     ref[indices, ...] /= updates[...]
+//     # Scalar indices
+//     ref[indices, ...] /= updates[...]
 //
 //     # Vector indices (for each i)
 //     ref[indices[i], ...] /= updates[i, ...]
@@ -8350,6 +8629,23 @@ func ResourceScatterDiv(scope *Scope, resource tf.Output, indices tf.Output, upd
 	return scope.AddOperation(opspec)
 }
 
+// Mutually reduces multiple tensors of identical type and shape.
+func CollectiveReduce(scope *Scope, input tf.Output, group_size int64, group_key int64, instance_key int64, merge_op string, final_op string, subdiv_offsets []int64) (data tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"group_size": group_size, "group_key": group_key, "instance_key": instance_key, "merge_op": merge_op, "final_op": final_op, "subdiv_offsets": subdiv_offsets}
+	opspec := tf.OpSpec{
+		Type: "CollectiveReduce",
+		Input: []tf.Input{
+			input,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // StatelessRandomNormalAttr is an optional argument to StatelessRandomNormal.
 type StatelessRandomNormalAttr func(optionalAttr)
 
@@ -10094,6 +10390,43 @@ func BatchDataset(scope *Scope, input_dataset tf.Output, batch_size tf.Output, o
 	return op.Output(0)
 }
 
+// Says whether the targets are in the top `K` predictions.
+//
+// This outputs a `batch_size` bool array, an entry `out[i]` is `true` if the
+// prediction for the target class is among the top `k` predictions among
+// all predictions for example `i`. Note that the behavior of `InTopK` differs
+// from the `TopK` op in its handling of ties; if multiple classes have the
+// same prediction value and straddle the top-`k` boundary, all of those
+// classes are considered to be in the top `k`.
+//
+// More formally, let
+//
+//   \\(predictions_i\\) be the predictions for all classes for example `i`,
+//   \\(targets_i\\) be the target class for example `i`,
+//   \\(out_i\\) be the output for example `i`,
+//
+// $$out_i = predictions_{i, targets_i} \in TopKIncludingTies(predictions_i)$$
+//
+// Arguments:
+//	predictions: A `batch_size` x `classes` tensor.
+//	targets: A `batch_size` vector of class ids.
+//	k: Number of top elements to look at for computing precision.
+//
+// Returns Computed precision at `k` as a `bool Tensor`.
+func InTopKV2(scope *Scope, predictions tf.Output, targets tf.Output, k tf.Output) (precision tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "InTopKV2",
+		Input: []tf.Input{
+			predictions, targets, k,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // DecodeAndCropJpegAttr is an optional argument to DecodeAndCropJpeg.
 type DecodeAndCropJpegAttr func(optionalAttr)
 
@@ -10949,101 +11282,6 @@ func Fact(scope *Scope) (fact tf.Output) {
 	return op.Output(0)
 }
 
-// AngleAttr is an optional argument to Angle.
-type AngleAttr func(optionalAttr)
-
-// AngleTout sets the optional Tout attribute to value.
-// If not specified, defaults to DT_FLOAT
-func AngleTout(value tf.DataType) AngleAttr {
-	return func(m optionalAttr) {
-		m["Tout"] = value
-	}
-}
-
-// Returns the argument of a complex number.
-//
-// Given a tensor `input` of complex numbers, this operation returns a tensor of
-// type `float` that is the argument of each element in `input`. All elements in
-// `input` must be complex numbers of the form \\(a + bj\\), where *a*
-// is the real part and *b* is the imaginary part.
-//
-// The argument returned by this operation is of the form \\(atan2(b, a)\\).
-//
-// For example:
-//
-// ```
-// # tensor 'input' is [-2.25 + 4.75j, 3.25 + 5.75j]
-// tf.angle(input) ==> [2.0132, 1.056]
-// ```
-//
-// @compatibility(numpy)
-// Equivalent to np.angle.
-// @end_compatibility
-func Angle(scope *Scope, input tf.Output, optional ...AngleAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "Angle",
-		Input: []tf.Input{
-			input,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// VarHandleOpAttr is an optional argument to VarHandleOp.
-type VarHandleOpAttr func(optionalAttr)
-
-// VarHandleOpContainer sets the optional container attribute to value.
-//
-// value: the container this variable is placed in.
-// If not specified, defaults to ""
-func VarHandleOpContainer(value string) VarHandleOpAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// VarHandleOpSharedName sets the optional shared_name attribute to value.
-//
-// value: the name by which this variable is referred to.
-// If not specified, defaults to ""
-func VarHandleOpSharedName(value string) VarHandleOpAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Creates a handle to a Variable resource.
-//
-// Arguments:
-//	dtype: the type of this variable. Must agree with the dtypes
-// of all ops using this variable.
-//	shape: The (possibly partially specified) shape of this variable.
-func VarHandleOp(scope *Scope, dtype tf.DataType, shape tf.Shape, optional ...VarHandleOpAttr) (resource tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"dtype": dtype, "shape": shape}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "VarHandleOp",
-
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Elementwise computes the bitwise XOR of `x` and `y`.
 //
 // The result will have those bits set, that are different in `x` and `y`. The
@@ -11126,86 +11364,29 @@ func DeserializeSparse(scope *Scope, serialized_sparse tf.Output, dtype tf.DataT
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// ResourceApplyRMSPropAttr is an optional argument to ResourceApplyRMSProp.
-type ResourceApplyRMSPropAttr func(optionalAttr)
+// ResourceScatterNdUpdateAttr is an optional argument to ResourceScatterNdUpdate.
+type ResourceScatterNdUpdateAttr func(optionalAttr)
 
-// ResourceApplyRMSPropUseLocking sets the optional use_locking attribute to value.
+// ResourceScatterNdUpdateUseLocking sets the optional use_locking attribute to value.
 //
-// value: If `True`, updating of the var, ms, and mom tensors is protected
-// by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
-// If not specified, defaults to false
-func ResourceApplyRMSPropUseLocking(value bool) ResourceApplyRMSPropAttr {
+// value: An optional bool. Defaults to True. If True, the assignment will
+// be protected by a lock; otherwise the behavior is undefined,
+// but may exhibit less contention.
+// If not specified, defaults to true
+func ResourceScatterNdUpdateUseLocking(value bool) ResourceScatterNdUpdateAttr {
 	return func(m optionalAttr) {
 		m["use_locking"] = value
 	}
 }
 
-// Update '*var' according to the RMSProp algorithm.
-//
-// Note that in dense implementation of this algorithm, ms and mom will
-// update even if the grad is zero, but in this sparse implementation, ms
-// and mom will not update in iterations during which the grad is zero.
+// Applies sparse `updates` to individual values or slices within a given
 //
-// mean_square = decay * mean_square + (1-decay) * gradient ** 2
-// Delta = learning_rate * gradient / sqrt(mean_square + epsilon)
+// variable according to `indices`.
 //
-// ms <- rho * ms_{t-1} + (1-rho) * grad * grad
-// mom <- momentum * mom_{t-1} + lr * grad / sqrt(ms + epsilon)
-// var <- var - mom
+// `ref` is a `Tensor` with rank `P` and `indices` is a `Tensor` of rank `Q`.
 //
-// Arguments:
-//	var_: Should be from a Variable().
-//	ms: Should be from a Variable().
-//	mom: Should be from a Variable().
-//	lr: Scaling factor. Must be a scalar.
-//	rho: Decay rate. Must be a scalar.
-//
-//	epsilon: Ridge term. Must be a scalar.
-//	grad: The gradient.
-//
-// Returns the created operation.
-func ResourceApplyRMSProp(scope *Scope, var_ tf.Output, ms tf.Output, mom tf.Output, lr tf.Output, rho tf.Output, momentum tf.Output, epsilon tf.Output, grad tf.Output, optional ...ResourceApplyRMSPropAttr) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "ResourceApplyRMSProp",
-		Input: []tf.Input{
-			var_, ms, mom, lr, rho, momentum, epsilon, grad,
-		},
-		Attrs: attrs,
-	}
-	return scope.AddOperation(opspec)
-}
-
-// ResourceScatterNdUpdateAttr is an optional argument to ResourceScatterNdUpdate.
-type ResourceScatterNdUpdateAttr func(optionalAttr)
-
-// ResourceScatterNdUpdateUseLocking sets the optional use_locking attribute to value.
-//
-// value: An optional bool. Defaults to True. If True, the assignment will
-// be protected by a lock; otherwise the behavior is undefined,
-// but may exhibit less contention.
-// If not specified, defaults to true
-func ResourceScatterNdUpdateUseLocking(value bool) ResourceScatterNdUpdateAttr {
-	return func(m optionalAttr) {
-		m["use_locking"] = value
-	}
-}
-
-// Applies sparse `updates` to individual values or slices within a given
-//
-// variable according to `indices`.
-//
-// `ref` is a `Tensor` with rank `P` and `indices` is a `Tensor` of rank `Q`.
-//
-// `indices` must be integer tensor, containing indices into `ref`.
-// It must be shape `[d_0, ..., d_{Q-2}, K]` where `0 < K <= P`.
+// `indices` must be integer tensor, containing indices into `ref`.
+// It must be shape `[d_0, ..., d_{Q-2}, K]` where `0 < K <= P`.
 //
 // The innermost dimension of `indices` (with length `K`) corresponds to
 // indices into elements (if `K = P`) or slices (if `K < P`) along the `K`th
@@ -11482,60 +11663,6 @@ func SparseDenseCwiseMul(scope *Scope, sp_indices tf.Output, sp_values tf.Output
 	return op.Output(0)
 }
 
-// ResizeAreaAttr is an optional argument to ResizeArea.
-type ResizeAreaAttr func(optionalAttr)
-
-// ResizeAreaAlignCorners sets the optional align_corners attribute to value.
-//
-// value: If true, the centers of the 4 corner pixels of the input and output tensors are
-// aligned, preserving the values at the corner pixels. Defaults to false.
-// If not specified, defaults to false
-func ResizeAreaAlignCorners(value bool) ResizeAreaAttr {
-	return func(m optionalAttr) {
-		m["align_corners"] = value
-	}
-}
-
-// Resize `images` to `size` using area interpolation.
-//
-// Input images can be of different types but output images are always float.
-//
-// The range of pixel values for the output image might be slightly different
-// from the range for the input image because of limited numerical precision.
-// To guarantee an output range, for example `[0.0, 1.0]`, apply
-// `tf.clip_by_value` to the output.
-//
-// Each output pixel is computed by first transforming the pixel's footprint into
-// the input tensor and then averaging the pixels that intersect the footprint. An
-// input pixel's contribution to the average is weighted by the fraction of its
-// area that intersects the footprint.  This is the same as OpenCV's INTER_AREA.
-//
-// Arguments:
-//	images: 4-D with shape `[batch, height, width, channels]`.
-//	size: = A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
-// new size for the images.
-//
-// Returns 4-D with shape
-// `[batch, new_height, new_width, channels]`.
-func ResizeArea(scope *Scope, images tf.Output, size tf.Output, optional ...ResizeAreaAttr) (resized_images tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "ResizeArea",
-		Input: []tf.Input{
-			images, size,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // 2D real-valued fast Fourier transform.
 //
 // Computes the 2-dimensional discrete Fourier transform of a real-valued signal
@@ -11711,23 +11838,6 @@ func TensorArrayGatherV3(scope *Scope, handle tf.Output, indices tf.Output, flow
 	return op.Output(0)
 }
 
-// Mutually reduces multiple tensors of identical type and shape.
-func CollectiveReduce(scope *Scope, input tf.Output, group_size int64, group_key int64, instance_key int64, merge_op string, final_op string, subdiv_offsets []int64) (data tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"group_size": group_size, "group_key": group_key, "instance_key": instance_key, "merge_op": merge_op, "final_op": final_op, "subdiv_offsets": subdiv_offsets}
-	opspec := tf.OpSpec{
-		Type: "CollectiveReduce",
-		Input: []tf.Input{
-			input,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // This op consumes a lock created by `MutexLock`.
 //
 // This op exists to consume a tensor created by `MutexLock` (other than
@@ -11829,81 +11939,6 @@ func RestoreV2(scope *Scope, prefix tf.Output, tensor_names tf.Output, shape_and
 	return tensors
 }
 
-// Creates a dataset that skips `count` elements from the `input_dataset`.
-//
-// Arguments:
-//
-//	count: A scalar representing the number of elements from the `input_dataset`
-// that should be skipped.  If count is -1, skips everything.
-//
-//
-func SkipDataset(scope *Scope, input_dataset tf.Output, count tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
-	opspec := tf.OpSpec{
-		Type: "SkipDataset",
-		Input: []tf.Input{
-			input_dataset, count,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Computes the maximum along segments of a tensor.
-//
-// Read @{$math_ops#Segmentation$the section on segmentation} for an explanation of
-// segments.
-//
-// Computes a tensor such that
-// \\(output_i = \max_j(data_j)\\) where `max` is over `j` such
-// that `segment_ids[j] == i`.
-//
-// If the max is empty for a given segment ID `i`, `output[i] = 0`.
-//
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentMax.png" alt>
-// </div>
-//
-// Arguments:
-//
-//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
-// first dimension.  Values should be sorted and can be repeated.
-//
-// Returns Has same shape as data, except for dimension 0 which
-// has size `k`, the number of segments.
-func SegmentMax(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SegmentMax",
-		Input: []tf.Input{
-			data, segment_ids,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Computes hyperbolic tangent of `x` element-wise.
-func Tanh(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Tanh",
-		Input: []tf.Input{
-			x,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Receives a tensor value broadcast from another device.
 func CollectiveBcastRecv(scope *Scope, T tf.DataType, group_size int64, group_key int64, instance_key int64, shape tf.Shape) (data tf.Output) {
 	if scope.Err() != nil {
@@ -13846,44 +13881,37 @@ func ResourceApplyCenteredRMSProp(scope *Scope, var_ tf.Output, mg tf.Output, ms
 	return scope.AddOperation(opspec)
 }
 
-// RealAttr is an optional argument to Real.
-type RealAttr func(optionalAttr)
-
-// RealTout sets the optional Tout attribute to value.
-// If not specified, defaults to DT_FLOAT
-func RealTout(value tf.DataType) RealAttr {
-	return func(m optionalAttr) {
-		m["Tout"] = value
+// Computes the gradient for the inverse of `x` wrt its input.
+//
+// Specifically, `grad = -dy * y*y`, where `y = 1/x`, and `dy`
+// is the corresponding input gradient.
+func ReciprocalGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "ReciprocalGrad",
+		Input: []tf.Input{
+			y, dy,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Returns the real part of a complex number.
-//
-// Given a tensor `input` of complex numbers, this operation returns a tensor of
-// type `float` that is the real part of each element in `input`. All elements in
-// `input` must be complex numbers of the form \\(a + bj\\), where *a* is the real
-//  part returned by this operation and *b* is the imaginary part.
-//
-// For example:
+// Returns the min of x and y (i.e. x < y ? x : y) element-wise.
 //
-// ```
-// # tensor 'input' is [-2.25 + 4.75j, 3.25 + 5.75j]
-// tf.real(input) ==> [-2.25, 3.25]
-// ```
-func Real(scope *Scope, input tf.Output, optional ...RealAttr) (output tf.Output) {
+// *NOTE*: `Minimum` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func Minimum(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "Real",
+		Type: "Minimum",
 		Input: []tf.Input{
-			input,
+			x, y,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
@@ -16239,19 +16267,76 @@ func MutableDenseHashTableV2(scope *Scope, empty_key tf.Output, value_dtype tf.D
 	return op.Output(0)
 }
 
-// Returns element-wise remainder of division. This emulates C semantics in that
-//
-// the result here is consistent with a truncating divide. E.g. `truncate(x / y) *
-// y + truncate_mod(x, y) = x`.
+// ResourceApplyRMSPropAttr is an optional argument to ResourceApplyRMSProp.
+type ResourceApplyRMSPropAttr func(optionalAttr)
+
+// ResourceApplyRMSPropUseLocking sets the optional use_locking attribute to value.
 //
-// *NOTE*: `TruncateMod` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func TruncateMod(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "TruncateMod",
+// value: If `True`, updating of the var, ms, and mom tensors is protected
+// by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceApplyRMSPropUseLocking(value bool) ResourceApplyRMSPropAttr {
+	return func(m optionalAttr) {
+		m["use_locking"] = value
+	}
+}
+
+// Update '*var' according to the RMSProp algorithm.
+//
+// Note that in dense implementation of this algorithm, ms and mom will
+// update even if the grad is zero, but in this sparse implementation, ms
+// and mom will not update in iterations during which the grad is zero.
+//
+// mean_square = decay * mean_square + (1-decay) * gradient ** 2
+// Delta = learning_rate * gradient / sqrt(mean_square + epsilon)
+//
+// ms <- rho * ms_{t-1} + (1-rho) * grad * grad
+// mom <- momentum * mom_{t-1} + lr * grad / sqrt(ms + epsilon)
+// var <- var - mom
+//
+// Arguments:
+//	var_: Should be from a Variable().
+//	ms: Should be from a Variable().
+//	mom: Should be from a Variable().
+//	lr: Scaling factor. Must be a scalar.
+//	rho: Decay rate. Must be a scalar.
+//
+//	epsilon: Ridge term. Must be a scalar.
+//	grad: The gradient.
+//
+// Returns the created operation.
+func ResourceApplyRMSProp(scope *Scope, var_ tf.Output, ms tf.Output, mom tf.Output, lr tf.Output, rho tf.Output, momentum tf.Output, epsilon tf.Output, grad tf.Output, optional ...ResourceApplyRMSPropAttr) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "ResourceApplyRMSProp",
+		Input: []tf.Input{
+			var_, ms, mom, lr, rho, momentum, epsilon, grad,
+		},
+		Attrs: attrs,
+	}
+	return scope.AddOperation(opspec)
+}
+
+// Returns element-wise remainder of division. This emulates C semantics in that
+//
+// the result here is consistent with a truncating divide. E.g. `truncate(x / y) *
+// y + truncate_mod(x, y) = x`.
+//
+// *NOTE*: `TruncateMod` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func TruncateMod(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "TruncateMod",
 		Input: []tf.Input{
 			x, y,
 		},
@@ -17512,69 +17597,6 @@ func DeserializeManySparse(scope *Scope, serialized_sparse tf.Output, dtype tf.D
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// StringJoinAttr is an optional argument to StringJoin.
-type StringJoinAttr func(optionalAttr)
-
-// StringJoinSeparator sets the optional separator attribute to value.
-//
-// value: string, an optional join separator.
-// If not specified, defaults to ""
-func StringJoinSeparator(value string) StringJoinAttr {
-	return func(m optionalAttr) {
-		m["separator"] = value
-	}
-}
-
-// Joins the strings in the given list of string tensors into one tensor;
-//
-// with the given separator (default is an empty separator).
-//
-// Arguments:
-//	inputs: A list of string tensors.  The tensors must all have the same shape,
-// or be scalars.  Scalars may be mixed in; these will be broadcast to the shape
-// of non-scalar inputs.
-func StringJoin(scope *Scope, inputs []tf.Output, optional ...StringJoinAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "StringJoin",
-		Input: []tf.Input{
-			tf.OutputList(inputs),
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Returns immutable tensor from memory region.
-//
-// The current implementation memmaps the tensor from a file.
-//
-// Arguments:
-//	dtype: Type of the returned tensor.
-//	shape: Shape of the returned tensor.
-//	memory_region_name: Name of readonly memory region used by the tensor, see
-// NewReadOnlyMemoryRegionFromFile in tensorflow::Env.
-func ImmutableConst(scope *Scope, dtype tf.DataType, shape tf.Shape, memory_region_name string) (tensor tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"dtype": dtype, "shape": shape, "memory_region_name": memory_region_name}
-	opspec := tf.OpSpec{
-		Type: "ImmutableConst",
-
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Inverse real-valued fast Fourier transform.
 //
 // Computes the inverse 1-dimensional discrete Fourier transform of a real-valued
@@ -18002,43 +18024,6 @@ func MaxPool(scope *Scope, input tf.Output, ksize []int64, strides []int64, padd
 	return op.Output(0)
 }
 
-// Says whether the targets are in the top `K` predictions.
-//
-// This outputs a `batch_size` bool array, an entry `out[i]` is `true` if the
-// prediction for the target class is among the top `k` predictions among
-// all predictions for example `i`. Note that the behavior of `InTopK` differs
-// from the `TopK` op in its handling of ties; if multiple classes have the
-// same prediction value and straddle the top-`k` boundary, all of those
-// classes are considered to be in the top `k`.
-//
-// More formally, let
-//
-//   \\(predictions_i\\) be the predictions for all classes for example `i`,
-//   \\(targets_i\\) be the target class for example `i`,
-//   \\(out_i\\) be the output for example `i`,
-//
-// $$out_i = predictions_{i, targets_i} \in TopKIncludingTies(predictions_i)$$
-//
-// Arguments:
-//	predictions: A `batch_size` x `classes` tensor.
-//	targets: A `batch_size` vector of class ids.
-//	k: Number of top elements to look at for computing precision.
-//
-// Returns Computed precision at `k` as a `bool Tensor`.
-func InTopKV2(scope *Scope, predictions tf.Output, targets tf.Output, k tf.Output) (precision tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "InTopKV2",
-		Input: []tf.Input{
-			predictions, targets, k,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Assigns a new value to a variable.
 //
 // Any ReadVariableOp with a control dependency on this op is guaranteed to return
@@ -18113,9 +18098,10 @@ func SparseFillEmptyRowsGrad(scope *Scope, reverse_index_map tf.Output, grad_val
 }
 
 // Computes scaled exponential linear: `scale * alpha * (exp(features) - 1)`
-//
 // if < 0, `scale * features` otherwise.
 //
+// Assumes weights to have zero mean and variance 1.0 / fan_in.
+//
 // See [Self-Normalizing Neural Networks](https://arxiv.org/abs/1706.02515)
 func Selu(scope *Scope, features tf.Output) (activations tf.Output) {
 	if scope.Err() != nil {
@@ -19183,99 +19169,69 @@ func SquaredDifference(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	return op.Output(0)
 }
 
-// Forwards the input to the output.
-//
-// This operator represents the loop termination condition used by the
-// "pivot" switches of a loop.
-//
-// Arguments:
-//	input: A boolean scalar, representing the branch predicate of the Switch op.
+// RandomGammaAttr is an optional argument to RandomGamma.
+type RandomGammaAttr func(optionalAttr)
+
+// RandomGammaSeed sets the optional seed attribute to value.
 //
-// Returns The same tensor as `input`.
-func LoopCond(scope *Scope, input tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "LoopCond",
-		Input: []tf.Input{
-			input,
-		},
+// value: If either `seed` or `seed2` are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func RandomGammaSeed(value int64) RandomGammaAttr {
+	return func(m optionalAttr) {
+		m["seed"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Computes the gradient for the inverse of `x` wrt its input.
+// RandomGammaSeed2 sets the optional seed2 attribute to value.
 //
-// Specifically, `grad = -dy * y*y`, where `y = 1/x`, and `dy`
-// is the corresponding input gradient.
-func ReciprocalGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "ReciprocalGrad",
-		Input: []tf.Input{
-			y, dy,
-		},
+// value: A second seed to avoid seed collision.
+// If not specified, defaults to 0
+func RandomGammaSeed2(value int64) RandomGammaAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Returns the min of x and y (i.e. x < y ? x : y) element-wise.
+// Outputs random values from the Gamma distribution(s) described by alpha.
 //
-// *NOTE*: `Minimum` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func Minimum(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// This op uses the algorithm by Marsaglia et al. to acquire samples via
+// transformation-rejection from pairs of uniform and normal random variables.
+// See http://dl.acm.org/citation.cfm?id=358414
+//
+// Arguments:
+//	shape: 1-D integer tensor. Shape of independent samples to draw from each
+// distribution described by the shape parameters given in alpha.
+//	alpha: A tensor in which each scalar is a "shape" parameter describing the
+// associated gamma distribution.
+//
+// Returns A tensor with shape `shape + shape(alpha)`. Each slice
+// `[:, ..., :, i0, i1, ...iN]` contains the samples drawn for
+// `alpha[i0, i1, ...iN]`. The dtype of the output matches the dtype of alpha.
+func RandomGamma(scope *Scope, shape tf.Output, alpha tf.Output, optional ...RandomGammaAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Minimum",
+		Type: "RandomGamma",
 		Input: []tf.Input{
-			x, y,
+			shape, alpha,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns the element-wise sum of a list of tensors.
-//
-// `tf.accumulate_n_v2` performs the same operation as `tf.add_n`, but does not
-// wait for all of its inputs to be ready before beginning to sum. This can
-// save memory if inputs are ready at different times, since minimum temporary
-// storage is proportional to the output size rather than the inputs size.
+// Convert the quantized 'input' tensor into a lower-precision 'output', using the
 //
-// Unlike the original `accumulate_n`, `accumulate_n_v2` is differentiable.
-//
-// Returns a `Tensor` of same shape and type as the elements of `inputs`.
-//
-// Arguments:
-//	inputs: A list of `Tensor` objects, each with same shape and type.
-//	shape: Shape of elements of `inputs`.
-func AccumulateNV2(scope *Scope, inputs []tf.Output, shape tf.Shape) (sum tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"shape": shape}
-	opspec := tf.OpSpec{
-		Type: "AccumulateNV2",
-		Input: []tf.Input{
-			tf.OutputList(inputs),
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Convert the quantized 'input' tensor into a lower-precision 'output', using the
-//
-// actual distribution of the values to maximize the usage of the lower bit depth
-// and adjusting the output min and max ranges accordingly.
+// actual distribution of the values to maximize the usage of the lower bit depth
+// and adjusting the output min and max ranges accordingly.
 //
 // [input_min, input_max] are scalar floats that specify the range for the float
 // interpretation of the 'input' data. For example, if input_min is -1.0f and
@@ -19320,60 +19276,24 @@ func QuantizeDownAndShrinkRange(scope *Scope, input tf.Output, input_min tf.Outp
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// RandomGammaAttr is an optional argument to RandomGamma.
-type RandomGammaAttr func(optionalAttr)
-
-// RandomGammaSeed sets the optional seed attribute to value.
-//
-// value: If either `seed` or `seed2` are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func RandomGammaSeed(value int64) RandomGammaAttr {
-	return func(m optionalAttr) {
-		m["seed"] = value
-	}
-}
-
-// RandomGammaSeed2 sets the optional seed2 attribute to value.
-//
-// value: A second seed to avoid seed collision.
-// If not specified, defaults to 0
-func RandomGammaSeed2(value int64) RandomGammaAttr {
-	return func(m optionalAttr) {
-		m["seed2"] = value
-	}
-}
-
-// Outputs random values from the Gamma distribution(s) described by alpha.
+// Forwards the input to the output.
 //
-// This op uses the algorithm by Marsaglia et al. to acquire samples via
-// transformation-rejection from pairs of uniform and normal random variables.
-// See http://dl.acm.org/citation.cfm?id=358414
+// This operator represents the loop termination condition used by the
+// "pivot" switches of a loop.
 //
 // Arguments:
-//	shape: 1-D integer tensor. Shape of independent samples to draw from each
-// distribution described by the shape parameters given in alpha.
-//	alpha: A tensor in which each scalar is a "shape" parameter describing the
-// associated gamma distribution.
+//	input: A boolean scalar, representing the branch predicate of the Switch op.
 //
-// Returns A tensor with shape `shape + shape(alpha)`. Each slice
-// `[:, ..., :, i0, i1, ...iN]` contains the samples drawn for
-// `alpha[i0, i1, ...iN]`. The dtype of the output matches the dtype of alpha.
-func RandomGamma(scope *Scope, shape tf.Output, alpha tf.Output, optional ...RandomGammaAttr) (output tf.Output) {
+// Returns The same tensor as `input`.
+func LoopCond(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "RandomGamma",
+		Type: "LoopCond",
 		Input: []tf.Input{
-			shape, alpha,
+			input,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
@@ -19594,100 +19514,34 @@ func OrderedMapIncompleteSize(scope *Scope, dtypes []tf.DataType, optional ...Or
 	return op.Output(0)
 }
 
-// Counts the number of occurrences of each value in an integer array.
-//
-// Outputs a vector with length `size` and the same dtype as `weights`. If
-// `weights` are empty, then index `i` stores the number of times the value `i` is
-// counted in `arr`. If `weights` are non-empty, then index `i` stores the sum of
-// the value in `weights` at each index where the corresponding value in `arr` is
-// `i`.
-//
-// Values in `arr` outside of the range [0, size) are ignored.
-//
-// Arguments:
-//	arr: int32 `Tensor`.
-//	size: non-negative int32 scalar `Tensor`.
-//	weights: is an int32, int64, float32, or float64 `Tensor` with the same
-// shape as `arr`, or a length-0 `Tensor`, in which case it acts as all weights
-// equal to 1.
-//
-// Returns 1D `Tensor` with length equal to `size`. The counts or summed weights for
-// each value in the range [0, size).
-func Bincount(scope *Scope, arr tf.Output, size tf.Output, weights tf.Output) (bins tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Bincount",
-		Input: []tf.Input{
-			arr, size, weights,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// CumsumAttr is an optional argument to Cumsum.
-type CumsumAttr func(optionalAttr)
-
-// CumsumExclusive sets the optional exclusive attribute to value.
-//
-// value: If `True`, perform exclusive cumsum.
-// If not specified, defaults to false
-func CumsumExclusive(value bool) CumsumAttr {
-	return func(m optionalAttr) {
-		m["exclusive"] = value
-	}
-}
+// ComplexAttr is an optional argument to Complex.
+type ComplexAttr func(optionalAttr)
 
-// CumsumReverse sets the optional reverse attribute to value.
-//
-// value: A `bool` (default: False).
-// If not specified, defaults to false
-func CumsumReverse(value bool) CumsumAttr {
+// ComplexTout sets the optional Tout attribute to value.
+// If not specified, defaults to DT_COMPLEX64
+func ComplexTout(value tf.DataType) ComplexAttr {
 	return func(m optionalAttr) {
-		m["reverse"] = value
+		m["Tout"] = value
 	}
 }
 
-// Compute the cumulative sum of the tensor `x` along `axis`.
-//
-// By default, this op performs an inclusive cumsum, which means that the first
-// element of the input is identical to the first element of the output:
-//
-// ```python
-// tf.cumsum([a, b, c])  # => [a, a + b, a + b + c]
-// ```
+// Converts two real numbers to a complex number.
 //
-// By setting the `exclusive` kwarg to `True`, an exclusive cumsum is
-// performed instead:
+// Given a tensor `real` representing the real part of a complex number, and a
+// tensor `imag` representing the imaginary part of a complex number, this
+// operation returns complex numbers elementwise of the form \\(a + bj\\), where
+// *a* represents the `real` part and *b* represents the `imag` part.
 //
-// ```python
-// tf.cumsum([a, b, c], exclusive=True)  # => [0, a, a + b]
-// ```
+// The input tensors `real` and `imag` must have the same shape.
 //
-// By setting the `reverse` kwarg to `True`, the cumsum is performed in the
-// opposite direction:
+// For example:
 //
-// ```python
-// tf.cumsum([a, b, c], reverse=True)  # => [a + b + c, b + c, c]
 // ```
-//
-// This is more efficient than using separate `tf.reverse` ops.
-//
-// The `reverse` and `exclusive` kwargs can also be combined:
-//
-// ```python
-// tf.cumsum([a, b, c], exclusive=True, reverse=True)  # => [b + c, c, 0]
+// # tensor 'real' is [2.25, 3.25]
+// # tensor `imag` is [4.75, 5.75]
+// tf.complex(real, imag) ==> [[2.25 + 4.75j], [3.25 + 5.75j]]
 // ```
-//
-// Arguments:
-//	x: A `Tensor`. Must be one of the following types: `float32`, `float64`,
-// `int64`, `int32`, `uint8`, `uint16`, `int16`, `int8`, `complex64`,
-// `complex128`, `qint8`, `quint8`, `qint32`, `half`.
-//	axis: A `Tensor` of type `int32` (default: 0). Must be in the range
-// `[-rank(x), rank(x))`.
-func Cumsum(scope *Scope, x tf.Output, axis tf.Output, optional ...CumsumAttr) (out tf.Output) {
+func Complex(scope *Scope, real tf.Output, imag tf.Output, optional ...ComplexAttr) (out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -19696,9 +19550,9 @@ func Cumsum(scope *Scope, x tf.Output, axis tf.Output, optional ...CumsumAttr) (
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Cumsum",
+		Type: "Complex",
 		Input: []tf.Input{
-			x, axis,
+			real, imag,
 		},
 		Attrs: attrs,
 	}
@@ -19706,67 +19560,31 @@ func Cumsum(scope *Scope, x tf.Output, axis tf.Output, optional ...CumsumAttr) (
 	return op.Output(0)
 }
 
-// CumprodAttr is an optional argument to Cumprod.
-type CumprodAttr func(optionalAttr)
-
-// CumprodExclusive sets the optional exclusive attribute to value.
-//
-// value: If `True`, perform exclusive cumprod.
-// If not specified, defaults to false
-func CumprodExclusive(value bool) CumprodAttr {
-	return func(m optionalAttr) {
-		m["exclusive"] = value
-	}
-}
+// ImagAttr is an optional argument to Imag.
+type ImagAttr func(optionalAttr)
 
-// CumprodReverse sets the optional reverse attribute to value.
-//
-// value: A `bool` (default: False).
-// If not specified, defaults to false
-func CumprodReverse(value bool) CumprodAttr {
+// ImagTout sets the optional Tout attribute to value.
+// If not specified, defaults to DT_FLOAT
+func ImagTout(value tf.DataType) ImagAttr {
 	return func(m optionalAttr) {
-		m["reverse"] = value
+		m["Tout"] = value
 	}
 }
 
-// Compute the cumulative product of the tensor `x` along `axis`.
-//
-// By default, this op performs an inclusive cumprod, which means that the first
-// element of the input is identical to the first element of the output:
-//
-// ```python
-// tf.cumprod([a, b, c])  # => [a, a * b, a * b * c]
-// ```
-//
-// By setting the `exclusive` kwarg to `True`, an exclusive cumprod is
-// performed instead:
+// Returns the imaginary part of a complex number.
 //
-// ```python
-// tf.cumprod([a, b, c], exclusive=True)  # => [1, a, a * b]
-// ```
+// Given a tensor `input` of complex numbers, this operation returns a tensor of
+// type `float` that is the imaginary part of each element in `input`. All
+// elements in `input` must be complex numbers of the form \\(a + bj\\), where *a*
+// is the real part and *b* is the imaginary part returned by this operation.
 //
-// By setting the `reverse` kwarg to `True`, the cumprod is performed in the
-// opposite direction:
+// For example:
 //
-// ```python
-// tf.cumprod([a, b, c], reverse=True)  # => [a * b * c, b * c, c]
 // ```
-//
-// This is more efficient than using separate `tf.reverse` ops.
-//
-// The `reverse` and `exclusive` kwargs can also be combined:
-//
-// ```python
-// tf.cumprod([a, b, c], exclusive=True, reverse=True)  # => [b * c, c, 1]
+// # tensor 'input' is [-2.25 + 4.75j, 3.25 + 5.75j]
+// tf.imag(input) ==> [4.75, 5.75]
 // ```
-//
-// Arguments:
-//	x: A `Tensor`. Must be one of the following types: `float32`, `float64`,
-// `int64`, `int32`, `uint8`, `uint16`, `int16`, `int8`, `complex64`,
-// `complex128`, `qint8`, `quint8`, `qint32`, `half`.
-//	axis: A `Tensor` of type `int32` (default: 0). Must be in the range
-// `[-rank(x), rank(x))`.
-func Cumprod(scope *Scope, x tf.Output, axis tf.Output, optional ...CumprodAttr) (out tf.Output) {
+func Imag(scope *Scope, input tf.Output, optional ...ImagAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -19775,9 +19593,9 @@ func Cumprod(scope *Scope, x tf.Output, axis tf.Output, optional ...CumprodAttr)
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Cumprod",
+		Type: "Imag",
 		Input: []tf.Input{
-			x, axis,
+			input,
 		},
 		Attrs: attrs,
 	}
@@ -19785,148 +19603,74 @@ func Cumprod(scope *Scope, x tf.Output, axis tf.Output, optional ...CumprodAttr)
 	return op.Output(0)
 }
 
-// QuantizedMatMulAttr is an optional argument to QuantizedMatMul.
-type QuantizedMatMulAttr func(optionalAttr)
-
-// QuantizedMatMulToutput sets the optional Toutput attribute to value.
-// If not specified, defaults to DT_QINT32
-func QuantizedMatMulToutput(value tf.DataType) QuantizedMatMulAttr {
-	return func(m optionalAttr) {
-		m["Toutput"] = value
-	}
-}
-
-// QuantizedMatMulTransposeA sets the optional transpose_a attribute to value.
+// Computes the maximum along segments of a tensor.
 //
-// value: If true, `a` is transposed before multiplication.
-// If not specified, defaults to false
-func QuantizedMatMulTransposeA(value bool) QuantizedMatMulAttr {
-	return func(m optionalAttr) {
-		m["transpose_a"] = value
-	}
-}
-
-// QuantizedMatMulTransposeB sets the optional transpose_b attribute to value.
+// Read @{$math_ops#Segmentation$the section on segmentation} for an explanation of
+// segments.
 //
-// value: If true, `b` is transposed before multiplication.
-// If not specified, defaults to false
-func QuantizedMatMulTransposeB(value bool) QuantizedMatMulAttr {
-	return func(m optionalAttr) {
-		m["transpose_b"] = value
-	}
-}
-
-// QuantizedMatMulTactivation sets the optional Tactivation attribute to value.
+// Computes a tensor such that
+// \\(output_i = \max_j(data_j)\\) where `max` is over `j` such
+// that `segment_ids[j] == i`.
 //
-// value: The type of output produced by activation function
-// following this operation.
-// If not specified, defaults to DT_QUINT8
-func QuantizedMatMulTactivation(value tf.DataType) QuantizedMatMulAttr {
-	return func(m optionalAttr) {
-		m["Tactivation"] = value
-	}
-}
-
-// Perform a quantized matrix multiplication of  `a` by the matrix `b`.
+// If the max is empty for a given segment ID `i`, `output[i] = 0`.
 //
-// The inputs must be two-dimensional matrices and the inner dimension of
-// `a` (after being transposed if `transpose_a` is non-zero) must match the
-// outer dimension of `b` (after being transposed if `transposed_b` is
-// non-zero).
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentMax.png" alt>
+// </div>
 //
 // Arguments:
-//	a: Must be a two-dimensional tensor.
-//	b: Must be a two-dimensional tensor.
-//	min_a: The float value that the lowest quantized `a` value represents.
-//	max_a: The float value that the highest quantized `a` value represents.
-//	min_b: The float value that the lowest quantized `b` value represents.
-//	max_b: The float value that the highest quantized `b` value represents.
 //
-// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
-func QuantizedMatMul(scope *Scope, a tf.Output, b tf.Output, min_a tf.Output, max_a tf.Output, min_b tf.Output, max_b tf.Output, optional ...QuantizedMatMulAttr) (out tf.Output, min_out tf.Output, max_out tf.Output) {
+//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
+// first dimension.  Values should be sorted and can be repeated.
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `k`, the number of segments.
+func SegmentMax(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "QuantizedMatMul",
+		Type: "SegmentMax",
 		Input: []tf.Input{
-			a, b, min_a, max_a, min_b, max_b,
+			data, segment_ids,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
-}
-
-// Does nothing. Serves as a control trigger for scheduling.
-//
-// Only useful as a placeholder for control edges.
-//
-// Returns the created operation.
-func ControlTrigger(scope *Scope) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "ControlTrigger",
-	}
-	return scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Batch normalization.
-//
-// DEPRECATED at GraphDef version 9: Use tf.nn.batch_normalization()
-//
-// This op is deprecated. Prefer `tf.nn.batch_normalization`.
-//
-// Arguments:
-//	t: A 4D input Tensor.
-//	m: A 1D mean Tensor with size matching the last dimension of t.
-// This is the first output from tf.nn.moments,
-// or a saved moving average thereof.
-//	v: A 1D variance Tensor with size matching the last dimension of t.
-// This is the second output from tf.nn.moments,
-// or a saved moving average thereof.
-//	beta: A 1D beta Tensor with size matching the last dimension of t.
-// An offset to be added to the normalized tensor.
-//	gamma: A 1D gamma Tensor with size matching the last dimension of t.
-// If "scale_after_normalization" is true, this tensor will be multiplied
-// with the normalized tensor.
-//	variance_epsilon: A small float number to avoid dividing by 0.
-//	scale_after_normalization: A bool indicating whether the resulted tensor
-// needs to be multiplied with gamma.
-func BatchNormWithGlobalNormalization(scope *Scope, t tf.Output, m tf.Output, v tf.Output, beta tf.Output, gamma tf.Output, variance_epsilon float32, scale_after_normalization bool) (result tf.Output) {
+// Computes hyperbolic tangent of `x` element-wise.
+func Tanh(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"variance_epsilon": variance_epsilon, "scale_after_normalization": scale_after_normalization}
 	opspec := tf.OpSpec{
-		Type: "BatchNormWithGlobalNormalization",
+		Type: "Tanh",
 		Input: []tf.Input{
-			t, m, v, beta, gamma,
+			x,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Deprecated. Use TensorArrayReadV3
+// Creates a dataset that skips `count` elements from the `input_dataset`.
 //
-// DEPRECATED at GraphDef version 26: Use TensorArrayReadV3
-func TensorArrayReadV2(scope *Scope, handle tf.Output, index tf.Output, flow_in tf.Output, dtype tf.DataType) (value tf.Output) {
+// Arguments:
+//
+//	count: A scalar representing the number of elements from the `input_dataset`
+// that should be skipped.  If count is -1, skips everything.
+//
+//
+func SkipDataset(scope *Scope, input_dataset tf.Output, count tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayReadV2",
+		Type: "SkipDataset",
 		Input: []tf.Input{
-			handle, index, flow_in,
+			input_dataset, count,
 		},
 		Attrs: attrs,
 	}
@@ -19934,32 +19678,31 @@ func TensorArrayReadV2(scope *Scope, handle tf.Output, index tf.Output, flow_in
 	return op.Output(0)
 }
 
-// QuantizedMulAttr is an optional argument to QuantizedMul.
-type QuantizedMulAttr func(optionalAttr)
+// RealAttr is an optional argument to Real.
+type RealAttr func(optionalAttr)
 
-// QuantizedMulToutput sets the optional Toutput attribute to value.
-// If not specified, defaults to DT_QINT32
-func QuantizedMulToutput(value tf.DataType) QuantizedMulAttr {
+// RealTout sets the optional Tout attribute to value.
+// If not specified, defaults to DT_FLOAT
+func RealTout(value tf.DataType) RealAttr {
 	return func(m optionalAttr) {
-		m["Toutput"] = value
+		m["Tout"] = value
 	}
 }
 
-// Returns x * y element-wise, working on quantized buffers.
-//
-// Arguments:
-//
+// Returns the real part of a complex number.
 //
-//	min_x: The float value that the lowest quantized `x` value represents.
-//	max_x: The float value that the highest quantized `x` value represents.
-//	min_y: The float value that the lowest quantized `y` value represents.
-//	max_y: The float value that the highest quantized `y` value represents.
+// Given a tensor `input` of complex numbers, this operation returns a tensor of
+// type `float` that is the real part of each element in `input`. All elements in
+// `input` must be complex numbers of the form \\(a + bj\\), where *a* is the real
+//  part returned by this operation and *b* is the imaginary part.
 //
-// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
+// For example:
 //
-// *NOTE*: `QuantizedMul` supports limited forms of broadcasting. More about
-// broadcasting [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func QuantizedMul(scope *Scope, x tf.Output, y tf.Output, min_x tf.Output, max_x tf.Output, min_y tf.Output, max_y tf.Output, optional ...QuantizedMulAttr) (z tf.Output, min_z tf.Output, max_z tf.Output) {
+// ```
+// # tensor 'input' is [-2.25 + 4.75j, 3.25 + 5.75j]
+// tf.real(input) ==> [-2.25, 3.25]
+// ```
+func Real(scope *Scope, input tf.Output, optional ...RealAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -19968,42 +19711,52 @@ func QuantizedMul(scope *Scope, x tf.Output, y tf.Output, min_x tf.Output, max_x
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "QuantizedMul",
+		Type: "Real",
 		Input: []tf.Input{
-			x, y, min_x, max_x, min_y, max_y,
+			input,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// QuantizedAddAttr is an optional argument to QuantizedAdd.
-type QuantizedAddAttr func(optionalAttr)
+// ResizeAreaAttr is an optional argument to ResizeArea.
+type ResizeAreaAttr func(optionalAttr)
 
-// QuantizedAddToutput sets the optional Toutput attribute to value.
-// If not specified, defaults to DT_QINT32
-func QuantizedAddToutput(value tf.DataType) QuantizedAddAttr {
+// ResizeAreaAlignCorners sets the optional align_corners attribute to value.
+//
+// value: If true, the centers of the 4 corner pixels of the input and output tensors are
+// aligned, preserving the values at the corner pixels. Defaults to false.
+// If not specified, defaults to false
+func ResizeAreaAlignCorners(value bool) ResizeAreaAttr {
 	return func(m optionalAttr) {
-		m["Toutput"] = value
+		m["align_corners"] = value
 	}
 }
 
-// Returns x + y element-wise, working on quantized buffers.
+// Resize `images` to `size` using area interpolation.
 //
-// Arguments:
+// Input images can be of different types but output images are always float.
 //
+// The range of pixel values for the output image might be slightly different
+// from the range for the input image because of limited numerical precision.
+// To guarantee an output range, for example `[0.0, 1.0]`, apply
+// `tf.clip_by_value` to the output.
 //
-//	min_x: The float value that the lowest quantized `x` value represents.
-//	max_x: The float value that the highest quantized `x` value represents.
-//	min_y: The float value that the lowest quantized `y` value represents.
-//	max_y: The float value that the highest quantized `y` value represents.
+// Each output pixel is computed by first transforming the pixel's footprint into
+// the input tensor and then averaging the pixels that intersect the footprint. An
+// input pixel's contribution to the average is weighted by the fraction of its
+// area that intersects the footprint.  This is the same as OpenCV's INTER_AREA.
 //
-// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
+// Arguments:
+//	images: 4-D with shape `[batch, height, width, channels]`.
+//	size: = A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
+// new size for the images.
 //
-// *NOTE*: `QuantizedAdd` supports limited forms of broadcasting. More about
-// broadcasting [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func QuantizedAdd(scope *Scope, x tf.Output, y tf.Output, min_x tf.Output, max_x tf.Output, min_y tf.Output, max_y tf.Output, optional ...QuantizedAddAttr) (z tf.Output, min_z tf.Output, max_z tf.Output) {
+// Returns 4-D with shape
+// `[batch, new_height, new_width, channels]`.
+func ResizeArea(scope *Scope, images tf.Output, size tf.Output, optional ...ResizeAreaAttr) (resized_images tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -20012,75 +19765,93 @@ func QuantizedAdd(scope *Scope, x tf.Output, y tf.Output, min_x tf.Output, max_x
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "QuantizedAdd",
+		Type: "ResizeArea",
 		Input: []tf.Input{
-			x, y, min_x, max_x, min_y, max_y,
+			images, size,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// MfccAttr is an optional argument to Mfcc.
-type MfccAttr func(optionalAttr)
+// VarHandleOpAttr is an optional argument to VarHandleOp.
+type VarHandleOpAttr func(optionalAttr)
 
-// MfccUpperFrequencyLimit sets the optional upper_frequency_limit attribute to value.
+// VarHandleOpContainer sets the optional container attribute to value.
 //
-// value: The highest frequency to use when calculating the
-// ceptstrum.
-// If not specified, defaults to 4000
-func MfccUpperFrequencyLimit(value float32) MfccAttr {
+// value: the container this variable is placed in.
+// If not specified, defaults to ""
+func VarHandleOpContainer(value string) VarHandleOpAttr {
 	return func(m optionalAttr) {
-		m["upper_frequency_limit"] = value
+		m["container"] = value
 	}
 }
 
-// MfccLowerFrequencyLimit sets the optional lower_frequency_limit attribute to value.
+// VarHandleOpSharedName sets the optional shared_name attribute to value.
 //
-// value: The lowest frequency to use when calculating the
-// ceptstrum.
-// If not specified, defaults to 20
-func MfccLowerFrequencyLimit(value float32) MfccAttr {
+// value: the name by which this variable is referred to.
+// If not specified, defaults to ""
+func VarHandleOpSharedName(value string) VarHandleOpAttr {
 	return func(m optionalAttr) {
-		m["lower_frequency_limit"] = value
+		m["shared_name"] = value
 	}
 }
 
-// MfccFilterbankChannelCount sets the optional filterbank_channel_count attribute to value.
+// Creates a handle to a Variable resource.
 //
-// value: Resolution of the Mel bank used internally.
-// If not specified, defaults to 40
-func MfccFilterbankChannelCount(value int64) MfccAttr {
-	return func(m optionalAttr) {
-		m["filterbank_channel_count"] = value
+// Arguments:
+//	dtype: the type of this variable. Must agree with the dtypes
+// of all ops using this variable.
+//	shape: The (possibly partially specified) shape of this variable.
+func VarHandleOp(scope *Scope, dtype tf.DataType, shape tf.Shape, optional ...VarHandleOpAttr) (resource tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"dtype": dtype, "shape": shape}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "VarHandleOp",
+
+		Attrs: attrs,
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// MfccDctCoefficientCount sets the optional dct_coefficient_count attribute to value.
-//
-// value: How many output channels to produce per time slice.
-// If not specified, defaults to 13
-func MfccDctCoefficientCount(value int64) MfccAttr {
+// AngleAttr is an optional argument to Angle.
+type AngleAttr func(optionalAttr)
+
+// AngleTout sets the optional Tout attribute to value.
+// If not specified, defaults to DT_FLOAT
+func AngleTout(value tf.DataType) AngleAttr {
 	return func(m optionalAttr) {
-		m["dct_coefficient_count"] = value
+		m["Tout"] = value
 	}
 }
 
-// Transforms a spectrogram into a form that's useful for speech recognition.
+// Returns the argument of a complex number.
 //
-// Mel Frequency Cepstral Coefficients are a way of representing audio data that's
-// been effective as an input feature for machine learning. They are created by
-// taking the spectrum of a spectrogram (a 'cepstrum'), and discarding some of the
-// higher frequencies that are less significant to the human ear. They have a long
-// history in the speech recognition world, and https://en.wikipedia.org/wiki/Mel-frequency_cepstrum
-// is a good resource to learn more.
+// Given a tensor `input` of complex numbers, this operation returns a tensor of
+// type `float` that is the argument of each element in `input`. All elements in
+// `input` must be complex numbers of the form \\(a + bj\\), where *a*
+// is the real part and *b* is the imaginary part.
 //
-// Arguments:
-//	spectrogram: Typically produced by the Spectrogram op, with magnitude_squared
-// set to true.
-//	sample_rate: How many samples per second the source audio used.
-func Mfcc(scope *Scope, spectrogram tf.Output, sample_rate tf.Output, optional ...MfccAttr) (output tf.Output) {
+// The argument returned by this operation is of the form \\(atan2(b, a)\\).
+//
+// For example:
+//
+// ```
+// # tensor 'input' is [-2.25 + 4.75j, 3.25 + 5.75j]
+// tf.angle(input) ==> [2.0132, 1.056]
+// ```
+//
+// @compatibility(numpy)
+// Equivalent to np.angle.
+// @end_compatibility
+func Angle(scope *Scope, input tf.Output, optional ...AngleAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -20089,9 +19860,9 @@ func Mfcc(scope *Scope, spectrogram tf.Output, sample_rate tf.Output, optional .
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Mfcc",
+		Type: "Angle",
 		Input: []tf.Input{
-			spectrogram, sample_rate,
+			input,
 		},
 		Attrs: attrs,
 	}
@@ -20099,315 +19870,285 @@ func Mfcc(scope *Scope, spectrogram tf.Output, sample_rate tf.Output, optional .
 	return op.Output(0)
 }
 
-// Given a quantized tensor described by (input, input_min, input_max), outputs a
+// Clips tensor values to a specified min and max.
 //
-// range that covers the actual values present in that tensor.  This op is
-// typically used to produce the requested_output_min and requested_output_max for
-// Requantize.
+// Given a tensor `t`, this operation returns a tensor of the same type and
+// shape as `t` with its values clipped to `clip_value_min` and `clip_value_max`.
+// Any values less than `clip_value_min` are set to `clip_value_min`. Any values
+// greater than `clip_value_max` are set to `clip_value_max`.
 //
 // Arguments:
+//	t: A `Tensor`.
+//	clip_value_min: A 0-D (scalar) `Tensor`, or a `Tensor` with the same shape
+// as `t`. The minimum value to clip by.
+//	clip_value_max: A 0-D (scalar) `Tensor`, or a `Tensor` with the same shape
+// as `t`. The maximum value to clip by.
 //
-//	input_min: The float value that the minimum quantized input value represents.
-//	input_max: The float value that the maximum quantized input value represents.
-//
-// Returns The computed min output.the computed max output.
-func RequantizationRange(scope *Scope, input tf.Output, input_min tf.Output, input_max tf.Output) (output_min tf.Output, output_max tf.Output) {
+// Returns A clipped `Tensor` with the same shape as input 't'.
+func ClipByValue(scope *Scope, t tf.Output, clip_value_min tf.Output, clip_value_max tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "RequantizationRange",
+		Type: "ClipByValue",
 		Input: []tf.Input{
-			input, input_min, input_max,
+			t, clip_value_min, clip_value_max,
 		},
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Rolls the elements of a tensor along an axis.
-//
-// The elements are shifted positively (towards larger indices) by the offset of
-// `shift` along the dimension of `axis`. Negative `shift` values will shift
-// elements in the opposite direction. Elements that roll passed the last position
-// will wrap around to the first and vice versa. Multiple shifts along multiple
-// axes may be specified.
-//
-// For example:
-//
-// ```
-// # 't' is [0, 1, 2, 3, 4]
-// roll(t, shift=2, axis=0) ==> [3, 4, 0, 1, 2]
+// Counts the number of occurrences of each value in an integer array.
 //
-// # shifting along multiple dimensions
-// # 't' is [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]
-// roll(t, shift=[1, -2], axis=[0, 1]) ==> [[7, 8, 9, 5, 6], [2, 3, 4, 0, 1]]
+// Outputs a vector with length `size` and the same dtype as `weights`. If
+// `weights` are empty, then index `i` stores the number of times the value `i` is
+// counted in `arr`. If `weights` are non-empty, then index `i` stores the sum of
+// the value in `weights` at each index where the corresponding value in `arr` is
+// `i`.
 //
-// # shifting along the same axis multiple times
-// # 't' is [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]
-// roll(t, shift=[2, -3], axis=[1, 1]) ==> [[1, 2, 3, 4, 0], [6, 7, 8, 9, 5]]
-// ```
+// Values in `arr` outside of the range [0, size) are ignored.
 //
 // Arguments:
+//	arr: int32 `Tensor`.
+//	size: non-negative int32 scalar `Tensor`.
+//	weights: is an int32, int64, float32, or float64 `Tensor` with the same
+// shape as `arr`, or a length-0 `Tensor`, in which case it acts as all weights
+// equal to 1.
 //
-//	shift: Dimension must be 0-D or 1-D. `shift[i]` specifies the number of places by which
-// elements are shifted positively (towards larger indices) along the dimension
-// specified by `axis[i]`. Negative shifts will roll the elements in the opposite
-// direction.
-//	axis: Dimension must be 0-D or 1-D. `axis[i]` specifies the dimension that the shift
-// `shift[i]` should occur. If the same axis is referenced more than once, the
-// total shift for that axis will be the sum of all the shifts that belong to that
-// axis.
-//
-// Returns Has the same shape and size as the input. The elements are shifted
-// positively (towards larger indices) by the offsets of `shift` along the
-// dimensions of `axis`.
-func Roll(scope *Scope, input tf.Output, shift tf.Output, axis tf.Output) (output tf.Output) {
+// Returns 1D `Tensor` with length equal to `size`. The counts or summed weights for
+// each value in the range [0, size).
+func Bincount(scope *Scope, arr tf.Output, size tf.Output, weights tf.Output) (bins tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Roll",
+		Type: "Bincount",
 		Input: []tf.Input{
-			input, shift, axis,
+			arr, size, weights,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// MapPeekAttr is an optional argument to MapPeek.
-type MapPeekAttr func(optionalAttr)
+// CumsumAttr is an optional argument to Cumsum.
+type CumsumAttr func(optionalAttr)
 
-// MapPeekCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// CumsumExclusive sets the optional exclusive attribute to value.
 //
-// REQUIRES: value >= 0
-func MapPeekCapacity(value int64) MapPeekAttr {
+// value: If `True`, perform exclusive cumsum.
+// If not specified, defaults to false
+func CumsumExclusive(value bool) CumsumAttr {
 	return func(m optionalAttr) {
-		m["capacity"] = value
+		m["exclusive"] = value
 	}
 }
 
-// MapPeekMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// CumsumReverse sets the optional reverse attribute to value.
 //
-// REQUIRES: value >= 0
-func MapPeekMemoryLimit(value int64) MapPeekAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// MapPeekContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func MapPeekContainer(value string) MapPeekAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// MapPeekSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func MapPeekSharedName(value string) MapPeekAttr {
+// value: A `bool` (default: False).
+// If not specified, defaults to false
+func CumsumReverse(value bool) CumsumAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["reverse"] = value
 	}
 }
 
-// Op peeks at the values at the specified key.  If the
+// Compute the cumulative sum of the tensor `x` along `axis`.
 //
-// underlying container does not contain this key
-// this op will block until it does.
-func MapPeek(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...MapPeekAttr) (values []tf.Output) {
+// By default, this op performs an inclusive cumsum, which means that the first
+// element of the input is identical to the first element of the output:
+//
+// ```python
+// tf.cumsum([a, b, c])  # => [a, a + b, a + b + c]
+// ```
+//
+// By setting the `exclusive` kwarg to `True`, an exclusive cumsum is
+// performed instead:
+//
+// ```python
+// tf.cumsum([a, b, c], exclusive=True)  # => [0, a, a + b]
+// ```
+//
+// By setting the `reverse` kwarg to `True`, the cumsum is performed in the
+// opposite direction:
+//
+// ```python
+// tf.cumsum([a, b, c], reverse=True)  # => [a + b + c, b + c, c]
+// ```
+//
+// This is more efficient than using separate `tf.reverse` ops.
+//
+// The `reverse` and `exclusive` kwargs can also be combined:
+//
+// ```python
+// tf.cumsum([a, b, c], exclusive=True, reverse=True)  # => [b + c, c, 0]
+// ```
+//
+// Arguments:
+//	x: A `Tensor`. Must be one of the following types: `float32`, `float64`,
+// `int64`, `int32`, `uint8`, `uint16`, `int16`, `int8`, `complex64`,
+// `complex128`, `qint8`, `quint8`, `qint32`, `half`.
+//	axis: A `Tensor` of type `int32` (default: 0). Must be in the range
+// `[-rank(x), rank(x))`.
+func Cumsum(scope *Scope, x tf.Output, axis tf.Output, optional ...CumsumAttr) (out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MapPeek",
+		Type: "Cumsum",
 		Input: []tf.Input{
-			key, indices,
+			x, axis,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("MapPeek", err)
-		return
-	}
-	return values
+	return op.Output(0)
 }
 
-// Looks up keys in a table, outputs the corresponding values.
-//
-// The tensor `keys` must of the same type as the keys of the table.
-// The output `values` is of the type of the table values.
-//
-// The scalar `default_value` is the value output for keys not present in the
-// table. It must also be of the same type as the table values.
+// CumprodAttr is an optional argument to Cumprod.
+type CumprodAttr func(optionalAttr)
+
+// CumprodExclusive sets the optional exclusive attribute to value.
 //
-// Arguments:
-//	table_handle: Handle to the table.
-//	keys: Any shape.  Keys to look up.
-//
-//
-// Returns Same shape as `keys`.  Values found in the table, or `default_values`
-// for missing keys.
-func LookupTableFindV2(scope *Scope, table_handle tf.Output, keys tf.Output, default_value tf.Output) (values tf.Output) {
-	if scope.Err() != nil {
-		return
+// value: If `True`, perform exclusive cumprod.
+// If not specified, defaults to false
+func CumprodExclusive(value bool) CumprodAttr {
+	return func(m optionalAttr) {
+		m["exclusive"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "LookupTableFindV2",
-		Input: []tf.Input{
-			table_handle, keys, default_value,
-		},
+}
+
+// CumprodReverse sets the optional reverse attribute to value.
+//
+// value: A `bool` (default: False).
+// If not specified, defaults to false
+func CumprodReverse(value bool) CumprodAttr {
+	return func(m optionalAttr) {
+		m["reverse"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Bucketizes 'input' based on 'boundaries'.
+// Compute the cumulative product of the tensor `x` along `axis`.
 //
-// For example, if the inputs are
-//     boundaries = [0, 10, 100]
-//     input = [[-5, 10000]
-//              [150,   10]
-//              [5,    100]]
+// By default, this op performs an inclusive cumprod, which means that the first
+// element of the input is identical to the first element of the output:
 //
-// then the output will be
-//     output = [[0, 3]
-//               [3, 2]
-//               [1, 3]]
+// ```python
+// tf.cumprod([a, b, c])  # => [a, a * b, a * b * c]
+// ```
 //
-// Arguments:
-//	input: Any shape of Tensor contains with int or float type.
-//	boundaries: A sorted list of floats gives the boundary of the buckets.
+// By setting the `exclusive` kwarg to `True`, an exclusive cumprod is
+// performed instead:
 //
-// Returns Same shape with 'input', each value of input replaced with bucket index.
+// ```python
+// tf.cumprod([a, b, c], exclusive=True)  # => [1, a, a * b]
+// ```
 //
-// @compatibility(numpy)
-// Equivalent to np.digitize.
-// @end_compatibility
-func Bucketize(scope *Scope, input tf.Output, boundaries []float32) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"boundaries": boundaries}
-	opspec := tf.OpSpec{
-		Type: "Bucketize",
-		Input: []tf.Input{
-			input,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Calculates gains for each feature and returns the best possible split information for the feature.
+// By setting the `reverse` kwarg to `True`, the cumprod is performed in the
+// opposite direction:
 //
-// The split information is the best threshold (bucket id), gains and left/right node contributions per node for each feature.
+// ```python
+// tf.cumprod([a, b, c], reverse=True)  # => [a * b * c, b * c, c]
+// ```
 //
-// It is possible that not all nodes can be split on each feature. Hence, the list of possible nodes can differ between the features. Therefore, we return `node_ids_list` for each feature, containing the list of nodes that this feature can be used to split.
+// This is more efficient than using separate `tf.reverse` ops.
 //
-// In this manner, the output is the best split per features and per node, so that it needs to be combined later to produce the best split for each node (among all possible features).
+// The `reverse` and `exclusive` kwargs can also be combined:
 //
-// The length of output lists are all of the same length, `num_features`.
-// The output shapes are compatible in a way that the first dimension of all tensors of all lists are the same and equal to the number of possible split nodes for each feature.
+// ```python
+// tf.cumprod([a, b, c], exclusive=True, reverse=True)  # => [b * c, c, 1]
+// ```
 //
 // Arguments:
-//	node_id_range: A Rank 1 tensor (shape=[2]) to specify the range [first, last) of node ids to process within `stats_summary_list`. The nodes are iterated between the two nodes specified by the tensor, as like `for node_id in range(node_id_range[0], node_id_range[1])` (Note that the last index node_id_range[1] is exclusive).
-//	stats_summary_list: A list of Rank 3 tensor (#shape=[max_splits, bucket, 2]) for accumulated stats summary (gradient/hessian) per node per buckets for each feature. The first dimension of the tensor is the maximum number of splits, and thus not all elements of it will be used, but only the indexes specified by node_ids will be used.
-//	l1: l1 regularization factor on leaf weights, per instance based.
-//	l2: l2 regularization factor on leaf weights, per instance based.
-//	tree_complexity: adjustment to the gain, per leaf based.
-//	min_node_weight: mininum avg of hessians in a node before required for the node to be considered for splitting.
-//	max_splits: the number of nodes that can be split in the whole tree. Used as a dimension of output tensors.
-//
-// Returns An output list of Rank 1 tensors indicating possible split node ids for each feature. The length of the list is num_features, but each tensor has different size as each feature provides different possible nodes. See above for details like shapes and sizes.An output list of Rank 1 tensors indicating the best gains for each feature to split for certain nodes. See above for details like shapes and sizes.An output list of Rank 1 tensors indicating the bucket id to compare with (as a threshold) for split in each node. See above for details like shapes and sizes.A list of Rank 2 tensors indicating the contribution of the left nodes when branching from parent nodes (given by the tensor element in the output node_ids_list) to the left direction by the given threshold for each feature. This value will be used to make the left node value by adding to the parent node value. Second dimension size is 1 for 1-dimensional logits, but would be larger for multi-class problems. See above for details like shapes and sizes.A list of Rank 2 tensors, with the same shape/conditions as left_node_contribs_list, but just that the value is for the right node.
-func BoostedTreesCalculateBestGainsPerFeature(scope *Scope, node_id_range tf.Output, stats_summary_list []tf.Output, l1 tf.Output, l2 tf.Output, tree_complexity tf.Output, min_node_weight tf.Output, max_splits int64) (node_ids_list []tf.Output, gains_list []tf.Output, thresholds_list []tf.Output, left_node_contribs_list []tf.Output, right_node_contribs_list []tf.Output) {
+//	x: A `Tensor`. Must be one of the following types: `float32`, `float64`,
+// `int64`, `int32`, `uint8`, `uint16`, `int16`, `int8`, `complex64`,
+// `complex128`, `qint8`, `quint8`, `qint32`, `half`.
+//	axis: A `Tensor` of type `int32` (default: 0). Must be in the range
+// `[-rank(x), rank(x))`.
+func Cumprod(scope *Scope, x tf.Output, axis tf.Output, optional ...CumprodAttr) (out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"max_splits": max_splits}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "BoostedTreesCalculateBestGainsPerFeature",
+		Type: "Cumprod",
 		Input: []tf.Input{
-			node_id_range, tf.OutputList(stats_summary_list), l1, l2, tree_complexity, min_node_weight,
+			x, axis,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if node_ids_list, idx, err = makeOutputList(op, idx, "node_ids_list"); err != nil {
-		scope.UpdateErr("BoostedTreesCalculateBestGainsPerFeature", err)
-		return
-	}
-	if gains_list, idx, err = makeOutputList(op, idx, "gains_list"); err != nil {
-		scope.UpdateErr("BoostedTreesCalculateBestGainsPerFeature", err)
-		return
-	}
-	if thresholds_list, idx, err = makeOutputList(op, idx, "thresholds_list"); err != nil {
-		scope.UpdateErr("BoostedTreesCalculateBestGainsPerFeature", err)
-		return
-	}
-	if left_node_contribs_list, idx, err = makeOutputList(op, idx, "left_node_contribs_list"); err != nil {
-		scope.UpdateErr("BoostedTreesCalculateBestGainsPerFeature", err)
-		return
-	}
-	if right_node_contribs_list, idx, err = makeOutputList(op, idx, "right_node_contribs_list"); err != nil {
-		scope.UpdateErr("BoostedTreesCalculateBestGainsPerFeature", err)
-		return
-	}
-	return node_ids_list, gains_list, thresholds_list, left_node_contribs_list, right_node_contribs_list
+	return op.Output(0)
 }
 
-// EncodePngAttr is an optional argument to EncodePng.
-type EncodePngAttr func(optionalAttr)
+// QuantizedMatMulAttr is an optional argument to QuantizedMatMul.
+type QuantizedMatMulAttr func(optionalAttr)
 
-// EncodePngCompression sets the optional compression attribute to value.
+// QuantizedMatMulToutput sets the optional Toutput attribute to value.
+// If not specified, defaults to DT_QINT32
+func QuantizedMatMulToutput(value tf.DataType) QuantizedMatMulAttr {
+	return func(m optionalAttr) {
+		m["Toutput"] = value
+	}
+}
+
+// QuantizedMatMulTransposeA sets the optional transpose_a attribute to value.
 //
-// value: Compression level.
-// If not specified, defaults to -1
-func EncodePngCompression(value int64) EncodePngAttr {
+// value: If true, `a` is transposed before multiplication.
+// If not specified, defaults to false
+func QuantizedMatMulTransposeA(value bool) QuantizedMatMulAttr {
 	return func(m optionalAttr) {
-		m["compression"] = value
+		m["transpose_a"] = value
 	}
 }
 
-// PNG-encode an image.
+// QuantizedMatMulTransposeB sets the optional transpose_b attribute to value.
 //
-// `image` is a 3-D uint8 or uint16 Tensor of shape `[height, width, channels]`
-// where `channels` is:
+// value: If true, `b` is transposed before multiplication.
+// If not specified, defaults to false
+func QuantizedMatMulTransposeB(value bool) QuantizedMatMulAttr {
+	return func(m optionalAttr) {
+		m["transpose_b"] = value
+	}
+}
+
+// QuantizedMatMulTactivation sets the optional Tactivation attribute to value.
 //
-// *   1: for grayscale.
-// *   2: for grayscale + alpha.
-// *   3: for RGB.
-// *   4: for RGBA.
+// value: The type of output produced by activation function
+// following this operation.
+// If not specified, defaults to DT_QUINT8
+func QuantizedMatMulTactivation(value tf.DataType) QuantizedMatMulAttr {
+	return func(m optionalAttr) {
+		m["Tactivation"] = value
+	}
+}
+
+// Perform a quantized matrix multiplication of  `a` by the matrix `b`.
 //
-// The ZLIB compression level, `compression`, can be -1 for the PNG-encoder
-// default or a value from 0 to 9.  9 is the highest compression level, generating
-// the smallest output, but is slower.
+// The inputs must be two-dimensional matrices and the inner dimension of
+// `a` (after being transposed if `transpose_a` is non-zero) must match the
+// outer dimension of `b` (after being transposed if `transposed_b` is
+// non-zero).
 //
 // Arguments:
-//	image: 3-D with shape `[height, width, channels]`.
+//	a: Must be a two-dimensional tensor.
+//	b: Must be a two-dimensional tensor.
+//	min_a: The float value that the lowest quantized `a` value represents.
+//	max_a: The float value that the highest quantized `a` value represents.
+//	min_b: The float value that the lowest quantized `b` value represents.
+//	max_b: The float value that the highest quantized `b` value represents.
 //
-// Returns 0-D. PNG-encoded image.
-func EncodePng(scope *Scope, image tf.Output, optional ...EncodePngAttr) (contents tf.Output) {
+// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
+func QuantizedMatMul(scope *Scope, a tf.Output, b tf.Output, min_a tf.Output, max_a tf.Output, min_b tf.Output, max_b tf.Output, optional ...QuantizedMatMulAttr) (out tf.Output, min_out tf.Output, max_out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -20416,111 +20157,81 @@ func EncodePng(scope *Scope, image tf.Output, optional ...EncodePngAttr) (conten
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "EncodePng",
+		Type: "QuantizedMatMul",
 		Input: []tf.Input{
-			image,
+			a, b, min_a, max_a, min_b, max_b,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Updates the table to associates keys with values.
-//
-// The tensor `keys` must be of the same type as the keys of the table.
-// The tensor `values` must be of the type of the table values.
+// Does nothing. Serves as a control trigger for scheduling.
 //
-// Arguments:
-//	table_handle: Handle to the table.
-//	keys: Any shape.  Keys to look up.
-//	values: Values to associate with keys.
+// Only useful as a placeholder for control edges.
 //
 // Returns the created operation.
-func LookupTableInsertV2(scope *Scope, table_handle tf.Output, keys tf.Output, values tf.Output) (o *tf.Operation) {
+func ControlTrigger(scope *Scope) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "LookupTableInsertV2",
-		Input: []tf.Input{
-			table_handle, keys, values,
-		},
+		Type: "ControlTrigger",
 	}
 	return scope.AddOperation(opspec)
 }
 
-// Returns element-wise smallest integer in not less than x.
-func Ceil(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Ceil",
-		Input: []tf.Input{
-			x,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Computes the number of elements in the given table.
+// Batch normalization.
 //
-// Arguments:
-//	table_handle: Handle to the table.
+// DEPRECATED at GraphDef version 9: Use tf.nn.batch_normalization()
 //
-// Returns Scalar that contains number of elements in the table.
-func LookupTableSizeV2(scope *Scope, table_handle tf.Output) (size tf.Output) {
-	if scope.Err() != nil {
+// This op is deprecated. Prefer `tf.nn.batch_normalization`.
+//
+// Arguments:
+//	t: A 4D input Tensor.
+//	m: A 1D mean Tensor with size matching the last dimension of t.
+// This is the first output from tf.nn.moments,
+// or a saved moving average thereof.
+//	v: A 1D variance Tensor with size matching the last dimension of t.
+// This is the second output from tf.nn.moments,
+// or a saved moving average thereof.
+//	beta: A 1D beta Tensor with size matching the last dimension of t.
+// An offset to be added to the normalized tensor.
+//	gamma: A 1D gamma Tensor with size matching the last dimension of t.
+// If "scale_after_normalization" is true, this tensor will be multiplied
+// with the normalized tensor.
+//	variance_epsilon: A small float number to avoid dividing by 0.
+//	scale_after_normalization: A bool indicating whether the resulted tensor
+// needs to be multiplied with gamma.
+func BatchNormWithGlobalNormalization(scope *Scope, t tf.Output, m tf.Output, v tf.Output, beta tf.Output, gamma tf.Output, variance_epsilon float32, scale_after_normalization bool) (result tf.Output) {
+	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"variance_epsilon": variance_epsilon, "scale_after_normalization": scale_after_normalization}
 	opspec := tf.OpSpec{
-		Type: "LookupTableSizeV2",
+		Type: "BatchNormWithGlobalNormalization",
 		Input: []tf.Input{
-			table_handle,
+			t, m, v, beta, gamma,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResizeBilinearGradAttr is an optional argument to ResizeBilinearGrad.
-type ResizeBilinearGradAttr func(optionalAttr)
-
-// ResizeBilinearGradAlignCorners sets the optional align_corners attribute to value.
-//
-// value: If true, the centers of the 4 corner pixels of the input and grad tensors are
-// aligned. Defaults to false.
-// If not specified, defaults to false
-func ResizeBilinearGradAlignCorners(value bool) ResizeBilinearGradAttr {
-	return func(m optionalAttr) {
-		m["align_corners"] = value
-	}
-}
-
-// Computes the gradient of bilinear interpolation.
-//
-// Arguments:
-//	grads: 4-D with shape `[batch, height, width, channels]`.
-//	original_image: 4-D with shape `[batch, orig_height, orig_width, channels]`,
-// The image tensor that was resized.
+// Deprecated. Use TensorArrayReadV3
 //
-// Returns 4-D with shape `[batch, orig_height, orig_width, channels]`.
-// Gradients with respect to the input image. Input image must have been
-// float or double.
-func ResizeBilinearGrad(scope *Scope, grads tf.Output, original_image tf.Output, optional ...ResizeBilinearGradAttr) (output tf.Output) {
+// DEPRECATED at GraphDef version 26: Use TensorArrayReadV3
+func TensorArrayReadV2(scope *Scope, handle tf.Output, index tf.Output, flow_in tf.Output, dtype tf.DataType) (value tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"dtype": dtype}
 	opspec := tf.OpSpec{
-		Type: "ResizeBilinearGrad",
+		Type: "TensorArrayReadV2",
 		Input: []tf.Input{
-			grads, original_image,
+			handle, index, flow_in,
 		},
 		Attrs: attrs,
 	}
@@ -20528,491 +20239,480 @@ func ResizeBilinearGrad(scope *Scope, grads tf.Output, original_image tf.Output,
 	return op.Output(0)
 }
 
-// Outputs all keys and values in the table.
+// QuantizedMulAttr is an optional argument to QuantizedMul.
+type QuantizedMulAttr func(optionalAttr)
+
+// QuantizedMulToutput sets the optional Toutput attribute to value.
+// If not specified, defaults to DT_QINT32
+func QuantizedMulToutput(value tf.DataType) QuantizedMulAttr {
+	return func(m optionalAttr) {
+		m["Toutput"] = value
+	}
+}
+
+// Returns x * y element-wise, working on quantized buffers.
 //
 // Arguments:
-//	table_handle: Handle to the table.
 //
 //
+//	min_x: The float value that the lowest quantized `x` value represents.
+//	max_x: The float value that the highest quantized `x` value represents.
+//	min_y: The float value that the lowest quantized `y` value represents.
+//	max_y: The float value that the highest quantized `y` value represents.
+//
+// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
 //
-// Returns Vector of all keys present in the table.Tensor of all values in the table. Indexed in parallel with `keys`.
-func LookupTableExportV2(scope *Scope, table_handle tf.Output, Tkeys tf.DataType, Tvalues tf.DataType) (keys tf.Output, values tf.Output) {
+// *NOTE*: `QuantizedMul` supports limited forms of broadcasting. More about
+// broadcasting [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func QuantizedMul(scope *Scope, x tf.Output, y tf.Output, min_x tf.Output, max_x tf.Output, min_y tf.Output, max_y tf.Output, optional ...QuantizedMulAttr) (z tf.Output, min_z tf.Output, max_z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"Tkeys": Tkeys, "Tvalues": Tvalues}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "LookupTableExportV2",
+		Type: "QuantizedMul",
 		Input: []tf.Input{
-			table_handle,
+			x, y, min_x, max_x, min_y, max_y,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Replaces the contents of the table with the specified keys and values.
-//
-// The tensor `keys` must be of the same type as the keys of the table.
-// The tensor `values` must be of the type of the table values.
+// QuantizedAddAttr is an optional argument to QuantizedAdd.
+type QuantizedAddAttr func(optionalAttr)
+
+// QuantizedAddToutput sets the optional Toutput attribute to value.
+// If not specified, defaults to DT_QINT32
+func QuantizedAddToutput(value tf.DataType) QuantizedAddAttr {
+	return func(m optionalAttr) {
+		m["Toutput"] = value
+	}
+}
+
+// Returns x + y element-wise, working on quantized buffers.
 //
 // Arguments:
-//	table_handle: Handle to the table.
-//	keys: Any shape.  Keys to look up.
-//	values: Values to associate with keys.
 //
-// Returns the created operation.
-func LookupTableImportV2(scope *Scope, table_handle tf.Output, keys tf.Output, values tf.Output) (o *tf.Operation) {
+//
+//	min_x: The float value that the lowest quantized `x` value represents.
+//	max_x: The float value that the highest quantized `x` value represents.
+//	min_y: The float value that the lowest quantized `y` value represents.
+//	max_y: The float value that the highest quantized `y` value represents.
+//
+// Returns The float value that the lowest quantized output value represents.The float value that the highest quantized output value represents.
+//
+// *NOTE*: `QuantizedAdd` supports limited forms of broadcasting. More about
+// broadcasting [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func QuantizedAdd(scope *Scope, x tf.Output, y tf.Output, min_x tf.Output, max_x tf.Output, min_y tf.Output, max_y tf.Output, optional ...QuantizedAddAttr) (z tf.Output, min_z tf.Output, max_z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "LookupTableImportV2",
+		Type: "QuantizedAdd",
 		Input: []tf.Input{
-			table_handle, keys, values,
+			x, y, min_x, max_x, min_y, max_y,
 		},
+		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// MapUnstageNoKeyAttr is an optional argument to MapUnstageNoKey.
-type MapUnstageNoKeyAttr func(optionalAttr)
+// MfccAttr is an optional argument to Mfcc.
+type MfccAttr func(optionalAttr)
 
-// MapUnstageNoKeyCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// MfccUpperFrequencyLimit sets the optional upper_frequency_limit attribute to value.
 //
-// REQUIRES: value >= 0
-func MapUnstageNoKeyCapacity(value int64) MapUnstageNoKeyAttr {
+// value: The highest frequency to use when calculating the
+// ceptstrum.
+// If not specified, defaults to 4000
+func MfccUpperFrequencyLimit(value float32) MfccAttr {
 	return func(m optionalAttr) {
-		m["capacity"] = value
+		m["upper_frequency_limit"] = value
 	}
 }
 
-// MapUnstageNoKeyMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// MfccLowerFrequencyLimit sets the optional lower_frequency_limit attribute to value.
 //
-// REQUIRES: value >= 0
-func MapUnstageNoKeyMemoryLimit(value int64) MapUnstageNoKeyAttr {
+// value: The lowest frequency to use when calculating the
+// ceptstrum.
+// If not specified, defaults to 20
+func MfccLowerFrequencyLimit(value float32) MfccAttr {
 	return func(m optionalAttr) {
-		m["memory_limit"] = value
+		m["lower_frequency_limit"] = value
 	}
 }
 
-// MapUnstageNoKeyContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func MapUnstageNoKeyContainer(value string) MapUnstageNoKeyAttr {
+// MfccFilterbankChannelCount sets the optional filterbank_channel_count attribute to value.
+//
+// value: Resolution of the Mel bank used internally.
+// If not specified, defaults to 40
+func MfccFilterbankChannelCount(value int64) MfccAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["filterbank_channel_count"] = value
 	}
 }
 
-// MapUnstageNoKeySharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func MapUnstageNoKeySharedName(value string) MapUnstageNoKeyAttr {
+// MfccDctCoefficientCount sets the optional dct_coefficient_count attribute to value.
+//
+// value: How many output channels to produce per time slice.
+// If not specified, defaults to 13
+func MfccDctCoefficientCount(value int64) MfccAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["dct_coefficient_count"] = value
 	}
 }
 
-// Op removes and returns a random (key, value)
+// Transforms a spectrogram into a form that's useful for speech recognition.
 //
-// from the underlying container.   If the underlying container
-// does not contain elements, the op will block until it does.
-func MapUnstageNoKey(scope *Scope, indices tf.Output, dtypes []tf.DataType, optional ...MapUnstageNoKeyAttr) (key tf.Output, values []tf.Output) {
+// Mel Frequency Cepstral Coefficients are a way of representing audio data that's
+// been effective as an input feature for machine learning. They are created by
+// taking the spectrum of a spectrogram (a 'cepstrum'), and discarding some of the
+// higher frequencies that are less significant to the human ear. They have a long
+// history in the speech recognition world, and https://en.wikipedia.org/wiki/Mel-frequency_cepstrum
+// is a good resource to learn more.
+//
+// Arguments:
+//	spectrogram: Typically produced by the Spectrogram op, with magnitude_squared
+// set to true.
+//	sample_rate: How many samples per second the source audio used.
+func Mfcc(scope *Scope, spectrogram tf.Output, sample_rate tf.Output, optional ...MfccAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MapUnstageNoKey",
+		Type: "Mfcc",
 		Input: []tf.Input{
-			indices,
+			spectrogram, sample_rate,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Given a quantized tensor described by (input, input_min, input_max), outputs a
+//
+// range that covers the actual values present in that tensor.  This op is
+// typically used to produce the requested_output_min and requested_output_max for
+// Requantize.
+//
+// Arguments:
+//
+//	input_min: The float value that the minimum quantized input value represents.
+//	input_max: The float value that the maximum quantized input value represents.
+//
+// Returns The computed min output.the computed max output.
+func RequantizationRange(scope *Scope, input tf.Output, input_min tf.Output, input_max tf.Output) (output_min tf.Output, output_max tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	var idx int
-	var err error
-	key = op.Output(idx)
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("MapUnstageNoKey", err)
-		return
+	opspec := tf.OpSpec{
+		Type: "RequantizationRange",
+		Input: []tf.Input{
+			input, input_min, input_max,
+		},
 	}
-	return key, values
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
 }
 
-// HashTableV2Attr is an optional argument to HashTableV2.
-type HashTableV2Attr func(optionalAttr)
-
-// HashTableV2Container sets the optional container attribute to value.
+// Rolls the elements of a tensor along an axis.
 //
-// value: If non-empty, this table is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func HashTableV2Container(value string) HashTableV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// HashTableV2SharedName sets the optional shared_name attribute to value.
+// The elements are shifted positively (towards larger indices) by the offset of
+// `shift` along the dimension of `axis`. Negative `shift` values will shift
+// elements in the opposite direction. Elements that roll passed the last position
+// will wrap around to the first and vice versa. Multiple shifts along multiple
+// axes may be specified.
 //
-// value: If non-empty, this table is shared under the given name across
-// multiple sessions.
-// If not specified, defaults to ""
-func HashTableV2SharedName(value string) HashTableV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// HashTableV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
+// For example:
 //
-// value: If true and shared_name is empty, the table is shared
-// using the node name.
-// If not specified, defaults to false
-func HashTableV2UseNodeNameSharing(value bool) HashTableV2Attr {
-	return func(m optionalAttr) {
-		m["use_node_name_sharing"] = value
-	}
-}
-
-// Creates a non-initialized hash table.
+// ```
+// # 't' is [0, 1, 2, 3, 4]
+// roll(t, shift=2, axis=0) ==> [3, 4, 0, 1, 2]
 //
-// This op creates a hash table, specifying the type of its keys and values.
-// Before using the table you will have to initialize it.  After initialization the
-// table will be immutable.
+// # shifting along multiple dimensions
+// # 't' is [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]
+// roll(t, shift=[1, -2], axis=[0, 1]) ==> [[7, 8, 9, 5, 6], [2, 3, 4, 0, 1]]
+//
+// # shifting along the same axis multiple times
+// # 't' is [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]
+// roll(t, shift=[2, -3], axis=[1, 1]) ==> [[1, 2, 3, 4, 0], [6, 7, 8, 9, 5]]
+// ```
 //
 // Arguments:
-//	key_dtype: Type of the table keys.
-//	value_dtype: Type of the table values.
 //
-// Returns Handle to a table.
-func HashTableV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...HashTableV2Attr) (table_handle tf.Output) {
+//	shift: Dimension must be 0-D or 1-D. `shift[i]` specifies the number of places by which
+// elements are shifted positively (towards larger indices) along the dimension
+// specified by `axis[i]`. Negative shifts will roll the elements in the opposite
+// direction.
+//	axis: Dimension must be 0-D or 1-D. `axis[i]` specifies the dimension that the shift
+// `shift[i]` should occur. If the same axis is referenced more than once, the
+// total shift for that axis will be the sum of all the shifts that belong to that
+// axis.
+//
+// Returns Has the same shape and size as the input. The elements are shifted
+// positively (towards larger indices) by the offsets of `shift` along the
+// dimensions of `axis`.
+func Roll(scope *Scope, input tf.Output, shift tf.Output, axis tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "HashTableV2",
-
-		Attrs: attrs,
+		Type: "Roll",
+		Input: []tf.Input{
+			input, shift, axis,
+		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// MutableHashTableV2Attr is an optional argument to MutableHashTableV2.
-type MutableHashTableV2Attr func(optionalAttr)
+// MapPeekAttr is an optional argument to MapPeek.
+type MapPeekAttr func(optionalAttr)
 
-// MutableHashTableV2Container sets the optional container attribute to value.
+// MapPeekCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// value: If non-empty, this table is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func MutableHashTableV2Container(value string) MutableHashTableV2Attr {
+// REQUIRES: value >= 0
+func MapPeekCapacity(value int64) MapPeekAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["capacity"] = value
 	}
 }
 
-// MutableHashTableV2SharedName sets the optional shared_name attribute to value.
+// MapPeekMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// value: If non-empty, this table is shared under the given name across
-// multiple sessions.
+// REQUIRES: value >= 0
+func MapPeekMemoryLimit(value int64) MapPeekAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// MapPeekContainer sets the optional container attribute to value.
 // If not specified, defaults to ""
-func MutableHashTableV2SharedName(value string) MutableHashTableV2Attr {
+func MapPeekContainer(value string) MapPeekAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["container"] = value
 	}
 }
 
-// MutableHashTableV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
-//
-// value: If true and shared_name is empty, the table is shared
-// using the node name.
-// If not specified, defaults to false
-func MutableHashTableV2UseNodeNameSharing(value bool) MutableHashTableV2Attr {
+// MapPeekSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func MapPeekSharedName(value string) MapPeekAttr {
 	return func(m optionalAttr) {
-		m["use_node_name_sharing"] = value
+		m["shared_name"] = value
 	}
 }
 
-// Creates an empty hash table.
-//
-// This op creates a mutable hash table, specifying the type of its keys and
-// values. Each value must be a scalar. Data can be inserted into the table using
-// the insert operations. It does not support the initialization operation.
-//
-// Arguments:
-//	key_dtype: Type of the table keys.
-//	value_dtype: Type of the table values.
+// Op peeks at the values at the specified key.  If the
 //
-// Returns Handle to a table.
-func MutableHashTableV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...MutableHashTableV2Attr) (table_handle tf.Output) {
+// underlying container does not contain this key
+// this op will block until it does.
+func MapPeek(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...MapPeekAttr) (values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MutableHashTableV2",
-
+		Type: "MapPeek",
+		Input: []tf.Input{
+			key, indices,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// DequantizeAttr is an optional argument to Dequantize.
-type DequantizeAttr func(optionalAttr)
-
-// DequantizeMode sets the optional mode attribute to value.
-// If not specified, defaults to "MIN_COMBINED"
-func DequantizeMode(value string) DequantizeAttr {
-	return func(m optionalAttr) {
-		m["mode"] = value
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("MapPeek", err)
+		return
 	}
+	return values
 }
 
-// Dequantize the 'input' tensor into a float Tensor.
-//
-// [min_range, max_range] are scalar floats that specify the range for
-// the 'input' data. The 'mode' attribute controls exactly which calculations are
-// used to convert the float values to their quantized equivalents.
-//
-// In 'MIN_COMBINED' mode, each value of the tensor will undergo the following:
-//
-// ```
-// if T == qint8, in[i] += (range(T) + 1)/ 2.0
-// out[i] = min_range + (in[i]* (max_range - min_range) / range(T))
-// ```
-// here `range(T) = numeric_limits<T>::max() - numeric_limits<T>::min()`
-//
-// *MIN_COMBINED Mode Example*
-//
-// If the input comes from a QuantizedRelu6, the output type is
-// quint8 (range of 0-255) but the possible range of QuantizedRelu6 is
-// 0-6.  The min_range and max_range values are therefore 0.0 and 6.0.
-// Dequantize on quint8 will take each value, cast to float, and multiply
-// by 6 / 255.
-// Note that if quantizedtype is qint8, the operation will additionally add
-// each value by 128 prior to casting.
-//
-// If the mode is 'MIN_FIRST', then this approach is used:
-//
-// ```c++
-// num_discrete_values = 1 << (# of bits in T)
-// range_adjust = num_discrete_values / (num_discrete_values - 1)
-// range = (range_max - range_min) * range_adjust
-// range_scale = range / num_discrete_values
-// const double offset_input = static_cast<double>(input) - lowest_quantized;
-// result = range_min + ((input - numeric_limits<T>::min()) * range_scale)
-// ```
-//
-// *SCALED mode Example*
-//
-// `SCALED` mode matches the quantization approach used in
-// `QuantizeAndDequantize{V2|V3}`.
-//
-// If the mode is `SCALED`, we do not use the full range of the output type,
-// choosing to elide the lowest possible value for symmetry (e.g., output range is
-// -127 to 127, not -128 to 127 for signed 8 bit quantization), so that 0.0 maps to
-// 0.
-//
-// We first find the range of values in our tensor. The
-// range we use is always centered on 0, so we find m such that
-// ```c++
-//   m = max(abs(input_min), abs(input_max))
-// ```
-//
-// Our input tensor range is then `[-m, m]`.
-//
-// Next, we choose our fixed-point quantization buckets, `[min_fixed, max_fixed]`.
-// If T is signed, this is
-// ```
-//   num_bits = sizeof(T) * 8
-//   [min_fixed, max_fixed] =
-//       [-(1 << (num_bits - 1) - 1), (1 << (num_bits - 1)) - 1]
-// ```
-//
-// Otherwise, if T is unsigned, the fixed-point range is
-// ```
-//   [min_fixed, max_fixed] = [0, (1 << num_bits) - 1]
-// ```
+// Looks up keys in a table, outputs the corresponding values.
 //
-// From this we compute our scaling factor, s:
-// ```c++
-//   s = (2 * m) / (max_fixed - min_fixed)
-// ```
+// The tensor `keys` must of the same type as the keys of the table.
+// The output `values` is of the type of the table values.
 //
-// Now we can dequantize the elements of our tensor:
-// ```c++
-// result = input * s
-// ```
+// The scalar `default_value` is the value output for keys not present in the
+// table. It must also be of the same type as the table values.
 //
 // Arguments:
+//	table_handle: Handle to the table.
+//	keys: Any shape.  Keys to look up.
 //
-//	min_range: The minimum scalar value possibly produced for the input.
-//	max_range: The maximum scalar value possibly produced for the input.
-func Dequantize(scope *Scope, input tf.Output, min_range tf.Output, max_range tf.Output, optional ...DequantizeAttr) (output tf.Output) {
+//
+// Returns Same shape as `keys`.  Values found in the table, or `default_values`
+// for missing keys.
+func LookupTableFindV2(scope *Scope, table_handle tf.Output, keys tf.Output, default_value tf.Output) (values tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "Dequantize",
+		Type: "LookupTableFindV2",
 		Input: []tf.Input{
-			input, min_range, max_range,
+			table_handle, keys, default_value,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Flips all bits elementwise.
+// Bucketizes 'input' based on 'boundaries'.
 //
-// The result will have exactly those bits set, that are not set in `x`. The
-// computation is performed on the underlying representation of x.
-func Invert(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Invert",
-		Input: []tf.Input{
-			x,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Inverse 3D fast Fourier transform.
+// For example, if the inputs are
+//     boundaries = [0, 10, 100]
+//     input = [[-5, 10000]
+//              [150,   10]
+//              [5,    100]]
 //
-// Computes the inverse 3-dimensional discrete Fourier transform over the
-// inner-most 3 dimensions of `input`.
+// then the output will be
+//     output = [[0, 3]
+//               [3, 2]
+//               [1, 3]]
 //
 // Arguments:
-//	input: A complex64 tensor.
+//	input: Any shape of Tensor contains with int or float type.
+//	boundaries: A sorted list of floats gives the boundary of the buckets.
 //
-// Returns A complex64 tensor of the same shape as `input`. The inner-most 3
-//   dimensions of `input` are replaced with their inverse 3D Fourier transform.
+// Returns Same shape with 'input', each value of input replaced with bucket index.
 //
 // @compatibility(numpy)
-// Equivalent to np.fft.ifftn with 3 dimensions.
+// Equivalent to np.digitize.
 // @end_compatibility
-func IFFT3D(scope *Scope, input tf.Output) (output tf.Output) {
+func Bucketize(scope *Scope, input tf.Output, boundaries []float32) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"boundaries": boundaries}
 	opspec := tf.OpSpec{
-		Type: "IFFT3D",
+		Type: "Bucketize",
 		Input: []tf.Input{
 			input,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Deprecated. Disallowed in GraphDef version >= 2.
+// Calculates gains for each feature and returns the best possible split information for the feature.
 //
-// DEPRECATED at GraphDef version 2: Use AdjustContrastv2 instead
-func AdjustContrast(scope *Scope, images tf.Output, contrast_factor tf.Output, min_value tf.Output, max_value tf.Output) (output tf.Output) {
+// The split information is the best threshold (bucket id), gains and left/right node contributions per node for each feature.
+//
+// It is possible that not all nodes can be split on each feature. Hence, the list of possible nodes can differ between the features. Therefore, we return `node_ids_list` for each feature, containing the list of nodes that this feature can be used to split.
+//
+// In this manner, the output is the best split per features and per node, so that it needs to be combined later to produce the best split for each node (among all possible features).
+//
+// The length of output lists are all of the same length, `num_features`.
+// The output shapes are compatible in a way that the first dimension of all tensors of all lists are the same and equal to the number of possible split nodes for each feature.
+//
+// Arguments:
+//	node_id_range: A Rank 1 tensor (shape=[2]) to specify the range [first, last) of node ids to process within `stats_summary_list`. The nodes are iterated between the two nodes specified by the tensor, as like `for node_id in range(node_id_range[0], node_id_range[1])` (Note that the last index node_id_range[1] is exclusive).
+//	stats_summary_list: A list of Rank 3 tensor (#shape=[max_splits, bucket, 2]) for accumulated stats summary (gradient/hessian) per node per buckets for each feature. The first dimension of the tensor is the maximum number of splits, and thus not all elements of it will be used, but only the indexes specified by node_ids will be used.
+//	l1: l1 regularization factor on leaf weights, per instance based.
+//	l2: l2 regularization factor on leaf weights, per instance based.
+//	tree_complexity: adjustment to the gain, per leaf based.
+//	min_node_weight: mininum avg of hessians in a node before required for the node to be considered for splitting.
+//	max_splits: the number of nodes that can be split in the whole tree. Used as a dimension of output tensors.
+//
+// Returns An output list of Rank 1 tensors indicating possible split node ids for each feature. The length of the list is num_features, but each tensor has different size as each feature provides different possible nodes. See above for details like shapes and sizes.An output list of Rank 1 tensors indicating the best gains for each feature to split for certain nodes. See above for details like shapes and sizes.An output list of Rank 1 tensors indicating the bucket id to compare with (as a threshold) for split in each node. See above for details like shapes and sizes.A list of Rank 2 tensors indicating the contribution of the left nodes when branching from parent nodes (given by the tensor element in the output node_ids_list) to the left direction by the given threshold for each feature. This value will be used to make the left node value by adding to the parent node value. Second dimension size is 1 for 1-dimensional logits, but would be larger for multi-class problems. See above for details like shapes and sizes.A list of Rank 2 tensors, with the same shape/conditions as left_node_contribs_list, but just that the value is for the right node.
+func BoostedTreesCalculateBestGainsPerFeature(scope *Scope, node_id_range tf.Output, stats_summary_list []tf.Output, l1 tf.Output, l2 tf.Output, tree_complexity tf.Output, min_node_weight tf.Output, max_splits int64) (node_ids_list []tf.Output, gains_list []tf.Output, thresholds_list []tf.Output, left_node_contribs_list []tf.Output, right_node_contribs_list []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"max_splits": max_splits}
 	opspec := tf.OpSpec{
-		Type: "AdjustContrast",
+		Type: "BoostedTreesCalculateBestGainsPerFeature",
 		Input: []tf.Input{
-			images, contrast_factor, min_value, max_value,
+			node_id_range, tf.OutputList(stats_summary_list), l1, l2, tree_complexity, min_node_weight,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Table initializer that takes two tensors for keys and values respectively.
-//
-// Arguments:
-//	table_handle: Handle to a table which will be initialized.
-//	keys: Keys of type Tkey.
-//	values: Values of type Tval.
-//
-// Returns the created operation.
-func InitializeTableV2(scope *Scope, table_handle tf.Output, keys tf.Output, values tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	opspec := tf.OpSpec{
-		Type: "InitializeTableV2",
-		Input: []tf.Input{
-			table_handle, keys, values,
-		},
+	var idx int
+	var err error
+	if node_ids_list, idx, err = makeOutputList(op, idx, "node_ids_list"); err != nil {
+		scope.UpdateErr("BoostedTreesCalculateBestGainsPerFeature", err)
+		return
 	}
-	return scope.AddOperation(opspec)
-}
-
-// PrintAttr is an optional argument to Print.
-type PrintAttr func(optionalAttr)
-
-// PrintMessage sets the optional message attribute to value.
-//
-// value: A string, prefix of the error message.
-// If not specified, defaults to ""
-func PrintMessage(value string) PrintAttr {
-	return func(m optionalAttr) {
-		m["message"] = value
+	if gains_list, idx, err = makeOutputList(op, idx, "gains_list"); err != nil {
+		scope.UpdateErr("BoostedTreesCalculateBestGainsPerFeature", err)
+		return
+	}
+	if thresholds_list, idx, err = makeOutputList(op, idx, "thresholds_list"); err != nil {
+		scope.UpdateErr("BoostedTreesCalculateBestGainsPerFeature", err)
+		return
+	}
+	if left_node_contribs_list, idx, err = makeOutputList(op, idx, "left_node_contribs_list"); err != nil {
+		scope.UpdateErr("BoostedTreesCalculateBestGainsPerFeature", err)
+		return
+	}
+	if right_node_contribs_list, idx, err = makeOutputList(op, idx, "right_node_contribs_list"); err != nil {
+		scope.UpdateErr("BoostedTreesCalculateBestGainsPerFeature", err)
+		return
 	}
+	return node_ids_list, gains_list, thresholds_list, left_node_contribs_list, right_node_contribs_list
 }
 
-// PrintFirstN sets the optional first_n attribute to value.
+// EncodePngAttr is an optional argument to EncodePng.
+type EncodePngAttr func(optionalAttr)
+
+// EncodePngCompression sets the optional compression attribute to value.
 //
-// value: Only log `first_n` number of times. -1 disables logging.
+// value: Compression level.
 // If not specified, defaults to -1
-func PrintFirstN(value int64) PrintAttr {
+func EncodePngCompression(value int64) EncodePngAttr {
 	return func(m optionalAttr) {
-		m["first_n"] = value
+		m["compression"] = value
 	}
 }
 
-// PrintSummarize sets the optional summarize attribute to value.
+// PNG-encode an image.
 //
-// value: Only print this many entries of each tensor.
-// If not specified, defaults to 3
-func PrintSummarize(value int64) PrintAttr {
-	return func(m optionalAttr) {
-		m["summarize"] = value
-	}
-}
-
-// Prints a list of tensors.
+// `image` is a 3-D uint8 or uint16 Tensor of shape `[height, width, channels]`
+// where `channels` is:
 //
-// Passes `input` through to `output` and prints `data` when evaluating.
+// *   1: for grayscale.
+// *   2: for grayscale + alpha.
+// *   3: for RGB.
+// *   4: for RGBA.
+//
+// The ZLIB compression level, `compression`, can be -1 for the PNG-encoder
+// default or a value from 0 to 9.  9 is the highest compression level, generating
+// the smallest output, but is slower.
 //
 // Arguments:
-//	input: The tensor passed to `output`
-//	data: A list of tensors to print out when op is evaluated.
+//	image: 3-D with shape `[height, width, channels]`.
 //
-// Returns = The unmodified `input` tensor
-func Print(scope *Scope, input tf.Output, data []tf.Output, optional ...PrintAttr) (output tf.Output) {
+// Returns 0-D. PNG-encoded image.
+func EncodePng(scope *Scope, image tf.Output, optional ...EncodePngAttr) (contents tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -21021,9 +20721,9 @@ func Print(scope *Scope, input tf.Output, data []tf.Output, optional ...PrintAtt
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Print",
+		Type: "EncodePng",
 		Input: []tf.Input{
-			input, tf.OutputList(data),
+			image,
 		},
 		Attrs: attrs,
 	}
@@ -21031,104 +20731,101 @@ func Print(scope *Scope, input tf.Output, data []tf.Output, optional ...PrintAtt
 	return op.Output(0)
 }
 
-// Outputs a `Summary` protocol buffer with a tensor and per-plugin data.
+// Updates the table to associates keys with values.
+//
+// The tensor `keys` must be of the same type as the keys of the table.
+// The tensor `values` must be of the type of the table values.
 //
 // Arguments:
-//	tag: A string attached to this summary. Used for organization in TensorBoard.
-//	tensor: A tensor to serialize.
-//	serialized_summary_metadata: A serialized SummaryMetadata proto. Contains plugin
-// data.
-func TensorSummaryV2(scope *Scope, tag tf.Output, tensor tf.Output, serialized_summary_metadata tf.Output) (summary tf.Output) {
+//	table_handle: Handle to the table.
+//	keys: Any shape.  Keys to look up.
+//	values: Values to associate with keys.
+//
+// Returns the created operation.
+func LookupTableInsertV2(scope *Scope, table_handle tf.Output, keys tf.Output, values tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorSummaryV2",
+		Type: "LookupTableInsertV2",
 		Input: []tf.Input{
-			tag, tensor, serialized_summary_metadata,
+			table_handle, keys, values,
+		},
+	}
+	return scope.AddOperation(opspec)
+}
+
+// Returns element-wise smallest integer in not less than x.
+func Ceil(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Ceil",
+		Input: []tf.Input{
+			x,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Creates a dataset that asynchronously prefetches elements from `input_dataset`.
+// Computes the number of elements in the given table.
 //
 // Arguments:
+//	table_handle: Handle to the table.
 //
-//	buffer_size: The maximum number of elements to buffer in an iterator over
-// this dataset.
-//
-//
-func PrefetchDataset(scope *Scope, input_dataset tf.Output, buffer_size tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// Returns Scalar that contains number of elements in the table.
+func LookupTableSizeV2(scope *Scope, table_handle tf.Output) (size tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "PrefetchDataset",
+		Type: "LookupTableSizeV2",
 		Input: []tf.Input{
-			input_dataset, buffer_size,
+			table_handle,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// TensorSummaryAttr is an optional argument to TensorSummary.
-type TensorSummaryAttr func(optionalAttr)
+// ResizeBilinearGradAttr is an optional argument to ResizeBilinearGrad.
+type ResizeBilinearGradAttr func(optionalAttr)
 
-// TensorSummaryDescription sets the optional description attribute to value.
+// ResizeBilinearGradAlignCorners sets the optional align_corners attribute to value.
 //
-// value: A json-encoded SummaryDescription proto.
-// If not specified, defaults to ""
-func TensorSummaryDescription(value string) TensorSummaryAttr {
+// value: If true, the centers of the 4 corner pixels of the input and grad tensors are
+// aligned. Defaults to false.
+// If not specified, defaults to false
+func ResizeBilinearGradAlignCorners(value bool) ResizeBilinearGradAttr {
 	return func(m optionalAttr) {
-		m["description"] = value
+		m["align_corners"] = value
 	}
 }
 
-// TensorSummaryLabels sets the optional labels attribute to value.
+// Computes the gradient of bilinear interpolation.
 //
-// value: An unused list of strings.
-// If not specified, defaults to <>
-func TensorSummaryLabels(value []string) TensorSummaryAttr {
-	return func(m optionalAttr) {
-		m["labels"] = value
-	}
-}
-
-// TensorSummaryDisplayName sets the optional display_name attribute to value.
+// Arguments:
+//	grads: 4-D with shape `[batch, height, width, channels]`.
+//	original_image: 4-D with shape `[batch, orig_height, orig_width, channels]`,
+// The image tensor that was resized.
 //
-// value: An unused string.
-// If not specified, defaults to ""
-func TensorSummaryDisplayName(value string) TensorSummaryAttr {
-	return func(m optionalAttr) {
-		m["display_name"] = value
-	}
-}
-
-// Outputs a `Summary` protocol buffer with a tensor.
-//
-// This op is being phased out in favor of TensorSummaryV2, which lets callers pass
-// a tag as well as a serialized SummaryMetadata proto string that contains
-// plugin-specific data. We will keep this op to maintain backwards compatibility.
-//
-// Arguments:
-//	tensor: A tensor to serialize.
-func TensorSummary(scope *Scope, tensor tf.Output, optional ...TensorSummaryAttr) (summary tf.Output) {
-	if scope.Err() != nil {
-		return
+// Returns 4-D with shape `[batch, orig_height, orig_width, channels]`.
+// Gradients with respect to the input image. Input image must have been
+// float or double.
+func ResizeBilinearGrad(scope *Scope, grads tf.Output, original_image tf.Output, optional ...ResizeBilinearGradAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
 	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorSummary",
+		Type: "ResizeBilinearGrad",
 		Input: []tf.Input{
-			tensor,
+			grads, original_image,
 		},
 		Attrs: attrs,
 	}
@@ -21136,258 +20833,346 @@ func TensorSummary(scope *Scope, tensor tf.Output, optional ...TensorSummaryAttr
 	return op.Output(0)
 }
 
-// Computes the gradient for the tanh of `x` wrt its input.
+// Outputs all keys and values in the table.
 //
-// Specifically, `grad = dy * (1 - y*y)`, where `y = tanh(x)`, and `dy`
-// is the corresponding input gradient.
-func TanhGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
+// Arguments:
+//	table_handle: Handle to the table.
+//
+//
+//
+// Returns Vector of all keys present in the table.Tensor of all values in the table. Indexed in parallel with `keys`.
+func LookupTableExportV2(scope *Scope, table_handle tf.Output, Tkeys tf.DataType, Tvalues tf.DataType) (keys tf.Output, values tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"Tkeys": Tkeys, "Tvalues": Tvalues}
 	opspec := tf.OpSpec{
-		Type: "TanhGrad",
+		Type: "LookupTableExportV2",
 		Input: []tf.Input{
-			y, dy,
+			table_handle,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// Reduces sparse updates into the variable referenced by `resource` using the `max` operation.
-//
-// This operation computes
-//
-//     # Scalar indices
-//     ref[indices, ...] = max(ref[indices, ...], updates[...])
-//
-//     # Vector indices (for each i)
-//     ref[indices[i], ...] = max(ref[indices[i], ...], updates[i, ...])
-//
-//     # High rank indices (for each i, ..., j)
-//     ref[indices[i, ..., j], ...] = max(ref[indices[i, ..., j], ...], updates[i, ..., j, ...])
-//
-// Duplicate entries are handled correctly: if multiple `indices` reference
-// the same location, their contributions are combined.
-//
-// Requires `updates.shape = indices.shape + ref.shape[1:]` or `updates.shape = []`.
+// Replaces the contents of the table with the specified keys and values.
 //
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src='https://www.tensorflow.org/images/ScatterAdd.png' alt>
-// </div>
+// The tensor `keys` must be of the same type as the keys of the table.
+// The tensor `values` must be of the type of the table values.
 //
 // Arguments:
-//	resource: Should be from a `Variable` node.
-//	indices: A tensor of indices into the first dimension of `ref`.
-//	updates: A tensor of updated values to add to `ref`.
+//	table_handle: Handle to the table.
+//	keys: Any shape.  Keys to look up.
+//	values: Values to associate with keys.
 //
 // Returns the created operation.
-func ResourceScatterMax(scope *Scope, resource tf.Output, indices tf.Output, updates tf.Output) (o *tf.Operation) {
+func LookupTableImportV2(scope *Scope, table_handle tf.Output, keys tf.Output, values tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceScatterMax",
+		Type: "LookupTableImportV2",
 		Input: []tf.Input{
-			resource, indices, updates,
+			table_handle, keys, values,
 		},
 	}
 	return scope.AddOperation(opspec)
 }
 
-// Outputs a `Summary` protocol buffer with scalar values.
+// MapUnstageNoKeyAttr is an optional argument to MapUnstageNoKey.
+type MapUnstageNoKeyAttr func(optionalAttr)
+
+// MapUnstageNoKeyCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// The input `tags` and `values` must have the same shape.  The generated summary
-// has a summary value for each tag-value pair in `tags` and `values`.
+// REQUIRES: value >= 0
+func MapUnstageNoKeyCapacity(value int64) MapUnstageNoKeyAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
+	}
+}
+
+// MapUnstageNoKeyMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// Arguments:
-//	tags: Tags for the summary.
-//	values: Same shape as `tags.  Values for the summary.
+// REQUIRES: value >= 0
+func MapUnstageNoKeyMemoryLimit(value int64) MapUnstageNoKeyAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// MapUnstageNoKeyContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func MapUnstageNoKeyContainer(value string) MapUnstageNoKeyAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// MapUnstageNoKeySharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func MapUnstageNoKeySharedName(value string) MapUnstageNoKeyAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op removes and returns a random (key, value)
 //
-// Returns Scalar.  Serialized `Summary` protocol buffer.
-func ScalarSummary(scope *Scope, tags tf.Output, values tf.Output) (summary tf.Output) {
+// from the underlying container.   If the underlying container
+// does not contain elements, the op will block until it does.
+func MapUnstageNoKey(scope *Scope, indices tf.Output, dtypes []tf.DataType, optional ...MapUnstageNoKeyAttr) (key tf.Output, values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "ScalarSummary",
+		Type: "MapUnstageNoKey",
 		Input: []tf.Input{
-			tags, values,
+			indices,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	key = op.Output(idx)
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("MapUnstageNoKey", err)
+		return
+	}
+	return key, values
 }
 
-// Outputs a `Summary` protocol buffer with a histogram.
-//
-// The generated
-// [`Summary`](https://www.tensorflow.org/code/tensorflow/core/framework/summary.proto)
-// has one summary value containing a histogram for `values`.
-//
-// This op reports an `InvalidArgument` error if any value is not finite.
+// HashTableV2Attr is an optional argument to HashTableV2.
+type HashTableV2Attr func(optionalAttr)
+
+// HashTableV2Container sets the optional container attribute to value.
 //
-// Arguments:
-//	tag: Scalar.  Tag to use for the `Summary.Value`.
-//	values: Any shape. Values to use to build the histogram.
+// value: If non-empty, this table is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func HashTableV2Container(value string) HashTableV2Attr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// HashTableV2SharedName sets the optional shared_name attribute to value.
 //
-// Returns Scalar. Serialized `Summary` protocol buffer.
-func HistogramSummary(scope *Scope, tag tf.Output, values tf.Output) (summary tf.Output) {
-	if scope.Err() != nil {
-		return
+// value: If non-empty, this table is shared under the given name across
+// multiple sessions.
+// If not specified, defaults to ""
+func HashTableV2SharedName(value string) HashTableV2Attr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "HistogramSummary",
-		Input: []tf.Input{
-			tag, values,
-		},
+}
+
+// HashTableV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
+//
+// value: If true and shared_name is empty, the table is shared
+// using the node name.
+// If not specified, defaults to false
+func HashTableV2UseNodeNameSharing(value bool) HashTableV2Attr {
+	return func(m optionalAttr) {
+		m["use_node_name_sharing"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Computes the number of elements in the given queue.
+// Creates a non-initialized hash table.
+//
+// This op creates a hash table, specifying the type of its keys and values.
+// Before using the table you will have to initialize it.  After initialization the
+// table will be immutable.
 //
 // Arguments:
-//	handle: The handle to a queue.
+//	key_dtype: Type of the table keys.
+//	value_dtype: Type of the table values.
 //
-// Returns The number of elements in the given queue.
-func QueueSizeV2(scope *Scope, handle tf.Output) (size tf.Output) {
+// Returns Handle to a table.
+func HashTableV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...HashTableV2Attr) (table_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "QueueSizeV2",
-		Input: []tf.Input{
-			handle,
-		},
+		Type: "HashTableV2",
+
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ImageSummaryAttr is an optional argument to ImageSummary.
-type ImageSummaryAttr func(optionalAttr)
+// MutableHashTableV2Attr is an optional argument to MutableHashTableV2.
+type MutableHashTableV2Attr func(optionalAttr)
 
-// ImageSummaryMaxImages sets the optional max_images attribute to value.
+// MutableHashTableV2Container sets the optional container attribute to value.
 //
-// value: Max number of batch elements to generate images for.
-// If not specified, defaults to 3
+// value: If non-empty, this table is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func MutableHashTableV2Container(value string) MutableHashTableV2Attr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// MutableHashTableV2SharedName sets the optional shared_name attribute to value.
 //
-// REQUIRES: value >= 1
-func ImageSummaryMaxImages(value int64) ImageSummaryAttr {
+// value: If non-empty, this table is shared under the given name across
+// multiple sessions.
+// If not specified, defaults to ""
+func MutableHashTableV2SharedName(value string) MutableHashTableV2Attr {
 	return func(m optionalAttr) {
-		m["max_images"] = value
+		m["shared_name"] = value
 	}
 }
 
-// ImageSummaryBadColor sets the optional bad_color attribute to value.
+// MutableHashTableV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
 //
-// value: Color to use for pixels with non-finite values.
-// If not specified, defaults to <dtype:DT_UINT8 tensor_shape:<dim:<size:4 > > int_val:255 int_val:0 int_val:0 int_val:255 >
-func ImageSummaryBadColor(value tf.Tensor) ImageSummaryAttr {
+// value: If true and shared_name is empty, the table is shared
+// using the node name.
+// If not specified, defaults to false
+func MutableHashTableV2UseNodeNameSharing(value bool) MutableHashTableV2Attr {
 	return func(m optionalAttr) {
-		m["bad_color"] = value
+		m["use_node_name_sharing"] = value
 	}
 }
 
-// Outputs a `Summary` protocol buffer with images.
+// Creates an empty hash table.
 //
-// The summary has up to `max_images` summary values containing images. The
-// images are built from `tensor` which must be 4-D with shape `[batch_size,
-// height, width, channels]` and where `channels` can be:
-//
-// *  1: `tensor` is interpreted as Grayscale.
-// *  3: `tensor` is interpreted as RGB.
-// *  4: `tensor` is interpreted as RGBA.
-//
-// The images have the same number of channels as the input tensor. For float
-// input, the values are normalized one image at a time to fit in the range
-// `[0, 255]`.  `uint8` values are unchanged.  The op uses two different
-// normalization algorithms:
-//
-// *  If the input values are all positive, they are rescaled so the largest one
-//    is 255.
-//
-// *  If any input value is negative, the values are shifted so input value 0.0
-//    is at 127.  They are then rescaled so that either the smallest value is 0,
-//    or the largest one is 255.
-//
-// The `tag` argument is a scalar `Tensor` of type `string`.  It is used to
-// build the `tag` of the summary values:
-//
-// *  If `max_images` is 1, the summary value tag is '*tag*/image'.
-// *  If `max_images` is greater than 1, the summary value tags are
-//    generated sequentially as '*tag*/image/0', '*tag*/image/1', etc.
-//
-// The `bad_color` argument is the color to use in the generated images for
-// non-finite input values.  It is a `uint8` 1-D tensor of length `channels`.
-// Each element must be in the range `[0, 255]` (It represents the value of a
-// pixel in the output image).  Non-finite values in the input tensor are
-// replaced by this tensor in the output image.  The default value is the color
-// red.
+// This op creates a mutable hash table, specifying the type of its keys and
+// values. Each value must be a scalar. Data can be inserted into the table using
+// the insert operations. It does not support the initialization operation.
 //
 // Arguments:
-//	tag: Scalar. Used to build the `tag` attribute of the summary values.
-//	tensor: 4-D of shape `[batch_size, height, width, channels]` where
-// `channels` is 1, 3, or 4.
+//	key_dtype: Type of the table keys.
+//	value_dtype: Type of the table values.
 //
-// Returns Scalar. Serialized `Summary` protocol buffer.
-func ImageSummary(scope *Scope, tag tf.Output, tensor tf.Output, optional ...ImageSummaryAttr) (summary tf.Output) {
+// Returns Handle to a table.
+func MutableHashTableV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...MutableHashTableV2Attr) (table_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ImageSummary",
-		Input: []tf.Input{
-			tag, tensor,
-		},
+		Type: "MutableHashTableV2",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// AudioSummaryV2Attr is an optional argument to AudioSummaryV2.
-type AudioSummaryV2Attr func(optionalAttr)
+// DequantizeAttr is an optional argument to Dequantize.
+type DequantizeAttr func(optionalAttr)
 
-// AudioSummaryV2MaxOutputs sets the optional max_outputs attribute to value.
-//
-// value: Max number of batch elements to generate audio for.
-// If not specified, defaults to 3
-//
-// REQUIRES: value >= 1
-func AudioSummaryV2MaxOutputs(value int64) AudioSummaryV2Attr {
+// DequantizeMode sets the optional mode attribute to value.
+// If not specified, defaults to "MIN_COMBINED"
+func DequantizeMode(value string) DequantizeAttr {
 	return func(m optionalAttr) {
-		m["max_outputs"] = value
+		m["mode"] = value
 	}
 }
 
-// Outputs a `Summary` protocol buffer with audio.
+// Dequantize the 'input' tensor into a float Tensor.
 //
-// The summary has up to `max_outputs` summary values containing audio. The
-// audio is built from `tensor` which must be 3-D with shape `[batch_size,
-// frames, channels]` or 2-D with shape `[batch_size, frames]`. The values are
-// assumed to be in the range of `[-1.0, 1.0]` with a sample rate of `sample_rate`.
+// [min_range, max_range] are scalar floats that specify the range for
+// the 'input' data. The 'mode' attribute controls exactly which calculations are
+// used to convert the float values to their quantized equivalents.
 //
-// The `tag` argument is a scalar `Tensor` of type `string`.  It is used to
-// build the `tag` of the summary values:
+// In 'MIN_COMBINED' mode, each value of the tensor will undergo the following:
 //
-// *  If `max_outputs` is 1, the summary value tag is '*tag*/audio'.
-// *  If `max_outputs` is greater than 1, the summary value tags are
-//    generated sequentially as '*tag*/audio/0', '*tag*/audio/1', etc.
+// ```
+// if T == qint8, in[i] += (range(T) + 1)/ 2.0
+// out[i] = min_range + (in[i]* (max_range - min_range) / range(T))
+// ```
+// here `range(T) = numeric_limits<T>::max() - numeric_limits<T>::min()`
+//
+// *MIN_COMBINED Mode Example*
+//
+// If the input comes from a QuantizedRelu6, the output type is
+// quint8 (range of 0-255) but the possible range of QuantizedRelu6 is
+// 0-6.  The min_range and max_range values are therefore 0.0 and 6.0.
+// Dequantize on quint8 will take each value, cast to float, and multiply
+// by 6 / 255.
+// Note that if quantizedtype is qint8, the operation will additionally add
+// each value by 128 prior to casting.
+//
+// If the mode is 'MIN_FIRST', then this approach is used:
+//
+// ```c++
+// num_discrete_values = 1 << (# of bits in T)
+// range_adjust = num_discrete_values / (num_discrete_values - 1)
+// range = (range_max - range_min) * range_adjust
+// range_scale = range / num_discrete_values
+// const double offset_input = static_cast<double>(input) - lowest_quantized;
+// result = range_min + ((input - numeric_limits<T>::min()) * range_scale)
+// ```
+//
+// *SCALED mode Example*
+//
+// `SCALED` mode matches the quantization approach used in
+// `QuantizeAndDequantize{V2|V3}`.
+//
+// If the mode is `SCALED`, we do not use the full range of the output type,
+// choosing to elide the lowest possible value for symmetry (e.g., output range is
+// -127 to 127, not -128 to 127 for signed 8 bit quantization), so that 0.0 maps to
+// 0.
+//
+// We first find the range of values in our tensor. The
+// range we use is always centered on 0, so we find m such that
+// ```c++
+//   m = max(abs(input_min), abs(input_max))
+// ```
+//
+// Our input tensor range is then `[-m, m]`.
+//
+// Next, we choose our fixed-point quantization buckets, `[min_fixed, max_fixed]`.
+// If T is signed, this is
+// ```
+//   num_bits = sizeof(T) * 8
+//   [min_fixed, max_fixed] =
+//       [-(1 << (num_bits - 1) - 1), (1 << (num_bits - 1)) - 1]
+// ```
+//
+// Otherwise, if T is unsigned, the fixed-point range is
+// ```
+//   [min_fixed, max_fixed] = [0, (1 << num_bits) - 1]
+// ```
+//
+// From this we compute our scaling factor, s:
+// ```c++
+//   s = (2 * m) / (max_fixed - min_fixed)
+// ```
+//
+// Now we can dequantize the elements of our tensor:
+// ```c++
+// result = input * s
+// ```
 //
 // Arguments:
-//	tag: Scalar. Used to build the `tag` attribute of the summary values.
-//	tensor: 2-D of shape `[batch_size, frames]`.
-//	sample_rate: The sample rate of the signal in hertz.
 //
-// Returns Scalar. Serialized `Summary` protocol buffer.
-func AudioSummaryV2(scope *Scope, tag tf.Output, tensor tf.Output, sample_rate tf.Output, optional ...AudioSummaryV2Attr) (summary tf.Output) {
+//	min_range: The minimum scalar value possibly produced for the input.
+//	max_range: The maximum scalar value possibly produced for the input.
+func Dequantize(scope *Scope, input tf.Output, min_range tf.Output, max_range tf.Output, optional ...DequantizeAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -21396,9 +21181,9 @@ func AudioSummaryV2(scope *Scope, tag tf.Output, tensor tf.Output, sample_rate t
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "AudioSummaryV2",
+		Type: "Dequantize",
 		Input: []tf.Input{
-			tag, tensor, sample_rate,
+			input, min_range, max_range,
 		},
 		Attrs: attrs,
 	}
@@ -21406,161 +21191,133 @@ func AudioSummaryV2(scope *Scope, tag tf.Output, tensor tf.Output, sample_rate t
 	return op.Output(0)
 }
 
-// AvgPoolAttr is an optional argument to AvgPool.
-type AvgPoolAttr func(optionalAttr)
-
-// AvgPoolDataFormat sets the optional data_format attribute to value.
-//
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// If not specified, defaults to "NHWC"
-func AvgPoolDataFormat(value string) AvgPoolAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
-
-// Performs average pooling on the input.
-//
-// Each entry in `output` is the mean of the corresponding size `ksize`
-// window in `value`.
-//
-// Arguments:
-//	value: 4-D with shape `[batch, height, width, channels]`.
-//	ksize: The size of the sliding window for each dimension of `value`.
-//	strides: The stride of the sliding window for each dimension of `value`.
-//	padding: The type of padding algorithm to use.
+// Flips all bits elementwise.
 //
-// Returns The average pooled output tensor.
-func AvgPool(scope *Scope, value tf.Output, ksize []int64, strides []int64, padding string, optional ...AvgPoolAttr) (output tf.Output) {
+// The result will have exactly those bits set, that are not set in `x`. The
+// computation is performed on the underlying representation of x.
+func Invert(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "AvgPool",
+		Type: "Invert",
 		Input: []tf.Input{
-			value,
+			x,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Merges summaries.
-//
-// This op creates a
-// [`Summary`](https://www.tensorflow.org/code/tensorflow/core/framework/summary.proto)
-// protocol buffer that contains the union of all the values in the input
-// summaries.
+// Inverse 3D fast Fourier transform.
 //
-// When the Op is run, it reports an `InvalidArgument` error if multiple values
-// in the summaries to merge use the same tag.
+// Computes the inverse 3-dimensional discrete Fourier transform over the
+// inner-most 3 dimensions of `input`.
 //
 // Arguments:
-//	inputs: Can be of any shape.  Each must contain serialized `Summary` protocol
-// buffers.
+//	input: A complex64 tensor.
 //
-// Returns Scalar. Serialized `Summary` protocol buffer.
-func MergeSummary(scope *Scope, inputs []tf.Output) (summary tf.Output) {
+// Returns A complex64 tensor of the same shape as `input`. The inner-most 3
+//   dimensions of `input` are replaced with their inverse 3D Fourier transform.
+//
+// @compatibility(numpy)
+// Equivalent to np.fft.ifftn with 3 dimensions.
+// @end_compatibility
+func IFFT3D(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "MergeSummary",
+		Type: "IFFT3D",
 		Input: []tf.Input{
-			tf.OutputList(inputs),
+			input,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the gradient of morphological 2-D dilation with respect to the filter.
-//
-// Arguments:
-//	input: 4-D with shape `[batch, in_height, in_width, depth]`.
-//	filter: 3-D with shape `[filter_height, filter_width, depth]`.
-//	out_backprop: 4-D with shape `[batch, out_height, out_width, depth]`.
-//	strides: 1-D of length 4. The stride of the sliding window for each dimension of
-// the input tensor. Must be: `[1, stride_height, stride_width, 1]`.
-//	rates: 1-D of length 4. The input stride for atrous morphological dilation.
-// Must be: `[1, rate_height, rate_width, 1]`.
-//	padding: The type of padding algorithm to use.
+// Deprecated. Disallowed in GraphDef version >= 2.
 //
-// Returns 3-D with shape `[filter_height, filter_width, depth]`.
-func Dilation2DBackpropFilter(scope *Scope, input tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, rates []int64, padding string) (filter_backprop tf.Output) {
+// DEPRECATED at GraphDef version 2: Use AdjustContrastv2 instead
+func AdjustContrast(scope *Scope, images tf.Output, contrast_factor tf.Output, min_value tf.Output, max_value tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"strides": strides, "rates": rates, "padding": padding}
 	opspec := tf.OpSpec{
-		Type: "Dilation2DBackpropFilter",
+		Type: "AdjustContrast",
 		Input: []tf.Input{
-			input, filter, out_backprop,
+			images, contrast_factor, min_value, max_value,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// AddSparseToTensorsMapAttr is an optional argument to AddSparseToTensorsMap.
-type AddSparseToTensorsMapAttr func(optionalAttr)
-
-// AddSparseToTensorsMapContainer sets the optional container attribute to value.
+// Table initializer that takes two tensors for keys and values respectively.
 //
-// value: The container name for the `SparseTensorsMap` created by this op.
-// If not specified, defaults to ""
-func AddSparseToTensorsMapContainer(value string) AddSparseToTensorsMapAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
+// Arguments:
+//	table_handle: Handle to a table which will be initialized.
+//	keys: Keys of type Tkey.
+//	values: Values of type Tval.
+//
+// Returns the created operation.
+func InitializeTableV2(scope *Scope, table_handle tf.Output, keys tf.Output, values tf.Output) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
 	}
+	opspec := tf.OpSpec{
+		Type: "InitializeTableV2",
+		Input: []tf.Input{
+			table_handle, keys, values,
+		},
+	}
+	return scope.AddOperation(opspec)
 }
 
-// AddSparseToTensorsMapSharedName sets the optional shared_name attribute to value.
+// PrintAttr is an optional argument to Print.
+type PrintAttr func(optionalAttr)
+
+// PrintMessage sets the optional message attribute to value.
 //
-// value: The shared name for the `SparseTensorsMap` created by this op.
-// If blank, the new Operation's unique name is used.
+// value: A string, prefix of the error message.
 // If not specified, defaults to ""
-func AddSparseToTensorsMapSharedName(value string) AddSparseToTensorsMapAttr {
+func PrintMessage(value string) PrintAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["message"] = value
 	}
 }
 
-// Add a `SparseTensor` to a `SparseTensorsMap` return its handle.
+// PrintFirstN sets the optional first_n attribute to value.
 //
-// A `SparseTensor` is represented by three tensors: `sparse_indices`,
-// `sparse_values`, and `sparse_shape`.
+// value: Only log `first_n` number of times. -1 disables logging.
+// If not specified, defaults to -1
+func PrintFirstN(value int64) PrintAttr {
+	return func(m optionalAttr) {
+		m["first_n"] = value
+	}
+}
+
+// PrintSummarize sets the optional summarize attribute to value.
 //
-// This operator takes the given `SparseTensor` and adds it to a container
-// object (a `SparseTensorsMap`).  A unique key within this container is generated
-// in the form of an `int64`, and this is the value that is returned.
+// value: Only print this many entries of each tensor.
+// If not specified, defaults to 3
+func PrintSummarize(value int64) PrintAttr {
+	return func(m optionalAttr) {
+		m["summarize"] = value
+	}
+}
+
+// Prints a list of tensors.
 //
-// The `SparseTensor` can then be read out as part of a minibatch by passing
-// the key as a vector element to `TakeManySparseFromTensorsMap`.  To ensure
-// the correct `SparseTensorsMap` is accessed, ensure that the same
-// `container` and `shared_name` are passed to that Op.  If no `shared_name`
-// is provided here, instead use the *name* of the Operation created by calling
-// `AddSparseToTensorsMap` as the `shared_name` passed to
-// `TakeManySparseFromTensorsMap`.  Ensure the Operations are colocated.
+// Passes `input` through to `output` and prints `data` when evaluating.
 //
 // Arguments:
-//	sparse_indices: 2-D.  The `indices` of the `SparseTensor`.
-//	sparse_values: 1-D.  The `values` of the `SparseTensor`.
-//	sparse_shape: 1-D.  The `shape` of the `SparseTensor`.
+//	input: The tensor passed to `output`
+//	data: A list of tensors to print out when op is evaluated.
 //
-// Returns 0-D.  The handle of the `SparseTensor` now stored in the
-// `SparseTensorsMap`.
-func AddSparseToTensorsMap(scope *Scope, sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output, optional ...AddSparseToTensorsMapAttr) (sparse_handle tf.Output) {
+// Returns = The unmodified `input` tensor
+func Print(scope *Scope, input tf.Output, data []tf.Output, optional ...PrintAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -21569,9 +21326,9 @@ func AddSparseToTensorsMap(scope *Scope, sparse_indices tf.Output, sparse_values
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "AddSparseToTensorsMap",
+		Type: "Print",
 		Input: []tf.Input{
-			sparse_indices, sparse_values, sparse_shape,
+			input, tf.OutputList(data),
 		},
 		Attrs: attrs,
 	}
@@ -21579,81 +21336,104 @@ func AddSparseToTensorsMap(scope *Scope, sparse_indices tf.Output, sparse_values
 	return op.Output(0)
 }
 
-// Returns a list list which has the passed-in `Tensor` as last element and the other elements of the given list in `input_handle`.
+// Outputs a `Summary` protocol buffer with a tensor and per-plugin data.
 //
-// tensor: The tensor to put on the list.
-// input_handle: The old list.
-// output_handle: A list with the elements of the old list followed by tensor.
-// element_dtype: the type of elements in the list.
-// element_shape: a shape compatible with that of elements in the list.
-func TensorListPushBack(scope *Scope, input_handle tf.Output, tensor tf.Output) (output_handle tf.Output) {
+// Arguments:
+//	tag: A string attached to this summary. Used for organization in TensorBoard.
+//	tensor: A tensor to serialize.
+//	serialized_summary_metadata: A serialized SummaryMetadata proto. Contains plugin
+// data.
+func TensorSummaryV2(scope *Scope, tag tf.Output, tensor tf.Output, serialized_summary_metadata tf.Output) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorListPushBack",
+		Type: "TensorSummaryV2",
 		Input: []tf.Input{
-			input_handle, tensor,
+			tag, tensor, serialized_summary_metadata,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns the number of tensors in the input tensor list.
+// Creates a dataset that asynchronously prefetches elements from `input_dataset`.
 //
-// input_handle: the input list
-// length: the number of tensors in the list
-func TensorListLength(scope *Scope, input_handle tf.Output) (length tf.Output) {
+// Arguments:
+//
+//	buffer_size: The maximum number of elements to buffer in an iterator over
+// this dataset.
+//
+//
+func PrefetchDataset(scope *Scope, input_dataset tf.Output, buffer_size tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "TensorListLength",
+		Type: "PrefetchDataset",
 		Input: []tf.Input{
-			input_handle,
+			input_dataset, buffer_size,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// The shape of the elements of the given list, as a tensor.
+// TensorSummaryAttr is an optional argument to TensorSummary.
+type TensorSummaryAttr func(optionalAttr)
+
+// TensorSummaryDescription sets the optional description attribute to value.
 //
-//   input_handle: the list
-//   element_shape: the shape of elements of the list
-func TensorListElementShape(scope *Scope, input_handle tf.Output, shape_type tf.DataType) (element_shape tf.Output) {
-	if scope.Err() != nil {
-		return
+// value: A json-encoded SummaryDescription proto.
+// If not specified, defaults to ""
+func TensorSummaryDescription(value string) TensorSummaryAttr {
+	return func(m optionalAttr) {
+		m["description"] = value
 	}
-	attrs := map[string]interface{}{"shape_type": shape_type}
-	opspec := tf.OpSpec{
-		Type: "TensorListElementShape",
-		Input: []tf.Input{
-			input_handle,
-		},
-		Attrs: attrs,
+}
+
+// TensorSummaryLabels sets the optional labels attribute to value.
+//
+// value: An unused list of strings.
+// If not specified, defaults to <>
+func TensorSummaryLabels(value []string) TensorSummaryAttr {
+	return func(m optionalAttr) {
+		m["labels"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Returns the item in the list with the given index.
+// TensorSummaryDisplayName sets the optional display_name attribute to value.
 //
-// input_handle: the list
-// index: the position in the list from which an element will be retrieved
-// item: the element at that position
+// value: An unused string.
+// If not specified, defaults to ""
+func TensorSummaryDisplayName(value string) TensorSummaryAttr {
+	return func(m optionalAttr) {
+		m["display_name"] = value
+	}
+}
+
+// Outputs a `Summary` protocol buffer with a tensor.
 //
+// This op is being phased out in favor of TensorSummaryV2, which lets callers pass
+// a tag as well as a serialized SummaryMetadata proto string that contains
+// plugin-specific data. We will keep this op to maintain backwards compatibility.
 //
-func TensorListGetItem(scope *Scope, input_handle tf.Output, index tf.Output, element_dtype tf.DataType) (item tf.Output) {
+// Arguments:
+//	tensor: A tensor to serialize.
+func TensorSummary(scope *Scope, tensor tf.Output, optional ...TensorSummaryAttr) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"element_dtype": element_dtype}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "TensorListGetItem",
+		Type: "TensorSummary",
 		Input: []tf.Input{
-			input_handle, index,
+			tensor,
 		},
 		Attrs: attrs,
 	}
@@ -21661,233 +21441,204 @@ func TensorListGetItem(scope *Scope, input_handle tf.Output, index tf.Output, el
 	return op.Output(0)
 }
 
-// Computes the matrix exponential of one or more square matrices:
-//
-// exp(A) = \sum_{n=0}^\infty A^n/n!
-//
-// The exponential is computed using a combination of the scaling and squaring
-// method and the Pade approximation. Details can be founds in:
-// Nicholas J. Higham, "The scaling and squaring method for the matrix exponential
-// revisited," SIAM J. Matrix Anal. Applic., 26:1179-1193, 2005.
-//
-// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
-// form square matrices. The output is a tensor of the same shape as the input
-// containing the exponential for all input submatrices `[..., :, :]`.
-//
-// Arguments:
-//	input: Shape is `[..., M, M]`.
-//
-// Returns Shape is `[..., M, M]`.
+// Computes the gradient for the tanh of `x` wrt its input.
 //
-// @compatibility(scipy)
-// Equivalent to scipy.linalg.expm
-// @end_compatibility
-func MatrixExponential(scope *Scope, input tf.Output) (output tf.Output) {
+// Specifically, `grad = dy * (1 - y*y)`, where `y = tanh(x)`, and `dy`
+// is the corresponding input gradient.
+func TanhGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "MatrixExponential",
+		Type: "TanhGrad",
 		Input: []tf.Input{
-			input,
+			y, dy,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the matrix logarithm of one or more square matrices:
+// Reduces sparse updates into the variable referenced by `resource` using the `max` operation.
 //
+// This operation computes
 //
-// log(exp(A)) = A
+//     # Scalar indices
+//     ref[indices, ...] = max(ref[indices, ...], updates[...])
 //
-// This op is only defined for complex matrices. If A is positive-definite and
-// real, then casting to a complex matrix, taking the logarithm and casting back
-// to a real matrix will give the correct result.
+//     # Vector indices (for each i)
+//     ref[indices[i], ...] = max(ref[indices[i], ...], updates[i, ...])
 //
-// This function computes the matrix logarithm using the Schur-Parlett algorithm.
-// Details of the algorithm can be found in Section 11.6.2 of:
-// Nicholas J. Higham, Functions of Matrices: Theory and Computation, SIAM 2008.
-// ISBN 978-0-898716-46-7.
+//     # High rank indices (for each i, ..., j)
+//     ref[indices[i, ..., j], ...] = max(ref[indices[i, ..., j], ...], updates[i, ..., j, ...])
 //
-// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
-// form square matrices. The output is a tensor of the same shape as the input
-// containing the exponential for all input submatrices `[..., :, :]`.
+// Duplicate entries are handled correctly: if multiple `indices` reference
+// the same location, their contributions are combined.
 //
-// Arguments:
-//	input: Shape is `[..., M, M]`.
+// Requires `updates.shape = indices.shape + ref.shape[1:]` or `updates.shape = []`.
 //
-// Returns Shape is `[..., M, M]`.
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src='https://www.tensorflow.org/images/ScatterAdd.png' alt>
+// </div>
 //
-// @compatibility(scipy)
-// Equivalent to scipy.linalg.logm
-// @end_compatibility
-func MatrixLogarithm(scope *Scope, input tf.Output) (output tf.Output) {
+// Arguments:
+//	resource: Should be from a `Variable` node.
+//	indices: A tensor of indices into the first dimension of `ref`.
+//	updates: A tensor of updated values to add to `ref`.
+//
+// Returns the created operation.
+func ResourceScatterMax(scope *Scope, resource tf.Output, indices tf.Output, updates tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "MatrixLogarithm",
+		Type: "ResourceScatterMax",
 		Input: []tf.Input{
-			input,
+			resource, indices, updates,
 		},
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// QueueDequeueUpToV2Attr is an optional argument to QueueDequeueUpToV2.
-type QueueDequeueUpToV2Attr func(optionalAttr)
-
-// QueueDequeueUpToV2TimeoutMs sets the optional timeout_ms attribute to value.
-//
-// value: If the queue has fewer than n elements, this operation
-// will block for up to timeout_ms milliseconds.
-// Note: This option is not supported yet.
-// If not specified, defaults to -1
-func QueueDequeueUpToV2TimeoutMs(value int64) QueueDequeueUpToV2Attr {
-	return func(m optionalAttr) {
-		m["timeout_ms"] = value
-	}
+	return scope.AddOperation(opspec)
 }
 
-// Dequeues `n` tuples of one or more tensors from the given queue.
-//
-// This operation is not supported by all queues.  If a queue does not support
-// DequeueUpTo, then an Unimplemented error is returned.
-//
-// If the queue is closed and there are more than 0 but less than `n`
-// elements remaining, then instead of returning an OutOfRange error like
-// QueueDequeueMany, less than `n` elements are returned immediately.  If
-// the queue is closed and there are 0 elements left in the queue, then
-// an OutOfRange error is returned just like in QueueDequeueMany.
-// Otherwise the behavior is identical to QueueDequeueMany:
-//
-// This operation concatenates queue-element component tensors along the
-// 0th dimension to make a single component tensor.  All of the components
-// in the dequeued tuple will have size n in the 0th dimension.
+// Outputs a `Summary` protocol buffer with scalar values.
 //
-// This operation has `k` outputs, where `k` is the number of components in
-// the tuples stored in the given queue, and output `i` is the ith
-// component of the dequeued tuple.
+// The input `tags` and `values` must have the same shape.  The generated summary
+// has a summary value for each tag-value pair in `tags` and `values`.
 //
 // Arguments:
-//	handle: The handle to a queue.
-//	n: The number of tuples to dequeue.
-//	component_types: The type of each component in a tuple.
+//	tags: Tags for the summary.
+//	values: Same shape as `tags.  Values for the summary.
 //
-// Returns One or more tensors that were dequeued as a tuple.
-func QueueDequeueUpToV2(scope *Scope, handle tf.Output, n tf.Output, component_types []tf.DataType, optional ...QueueDequeueUpToV2Attr) (components []tf.Output) {
+// Returns Scalar.  Serialized `Summary` protocol buffer.
+func ScalarSummary(scope *Scope, tags tf.Output, values tf.Output) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"component_types": component_types}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "QueueDequeueUpToV2",
+		Type: "ScalarSummary",
 		Input: []tf.Input{
-			handle, n,
+			tags, values,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
-		scope.UpdateErr("QueueDequeueUpToV2", err)
-		return
-	}
-	return components
+	return op.Output(0)
 }
 
-// Computes the Cholesky decomposition of one or more square matrices.
-//
-// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
-// form square matrices.
-//
-// The input has to be symmetric and positive definite. Only the lower-triangular
-// part of the input will be used for this operation. The upper-triangular part
-// will not be read.
+// Outputs a `Summary` protocol buffer with a histogram.
 //
-// The output is a tensor of the same shape as the input
-// containing the Cholesky decompositions for all input submatrices `[..., :, :]`.
+// The generated
+// [`Summary`](https://www.tensorflow.org/code/tensorflow/core/framework/summary.proto)
+// has one summary value containing a histogram for `values`.
 //
-// **Note**: The gradient computation on GPU is faster for large matrices but
-// not for large batch dimensions when the submatrices are small. In this
-// case it might be faster to use the CPU.
+// This op reports an `InvalidArgument` error if any value is not finite.
 //
 // Arguments:
-//	input: Shape is `[..., M, M]`.
+//	tag: Scalar.  Tag to use for the `Summary.Value`.
+//	values: Any shape. Values to use to build the histogram.
 //
-// Returns Shape is `[..., M, M]`.
-func Cholesky(scope *Scope, input tf.Output) (output tf.Output) {
+// Returns Scalar. Serialized `Summary` protocol buffer.
+func HistogramSummary(scope *Scope, tag tf.Output, values tf.Output) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Cholesky",
+		Type: "HistogramSummary",
 		Input: []tf.Input{
-			input,
+			tag, values,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Writes contents to the file at input filename. Creates file and recursively
-//
-// creates directory if not existing.
+// Computes the number of elements in the given queue.
 //
 // Arguments:
-//	filename: scalar. The name of the file to which we write the contents.
-//	contents: scalar. The content to be written to the output file.
+//	handle: The handle to a queue.
 //
-// Returns the created operation.
-func WriteFile(scope *Scope, filename tf.Output, contents tf.Output) (o *tf.Operation) {
+// Returns The number of elements in the given queue.
+func QueueSizeV2(scope *Scope, handle tf.Output) (size tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "WriteFile",
+		Type: "QueueSizeV2",
 		Input: []tf.Input{
-			filename, contents,
+			handle,
 		},
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// AllAttr is an optional argument to All.
-type AllAttr func(optionalAttr)
+// ImageSummaryAttr is an optional argument to ImageSummary.
+type ImageSummaryAttr func(optionalAttr)
 
-// AllKeepDims sets the optional keep_dims attribute to value.
+// ImageSummaryMaxImages sets the optional max_images attribute to value.
 //
-// value: If true, retain reduced dimensions with length 1.
-// If not specified, defaults to false
-func AllKeepDims(value bool) AllAttr {
+// value: Max number of batch elements to generate images for.
+// If not specified, defaults to 3
+//
+// REQUIRES: value >= 1
+func ImageSummaryMaxImages(value int64) ImageSummaryAttr {
 	return func(m optionalAttr) {
-		m["keep_dims"] = value
+		m["max_images"] = value
 	}
 }
 
-// Computes the "logical and" of elements across dimensions of a tensor.
+// ImageSummaryBadColor sets the optional bad_color attribute to value.
 //
-// Reduces `input` along the dimensions given in `axis`. Unless
-// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
-// `axis`. If `keep_dims` is true, the reduced dimensions are
-// retained with length 1.
+// value: Color to use for pixels with non-finite values.
+// If not specified, defaults to <dtype:DT_UINT8 tensor_shape:<dim:<size:4 > > int_val:255 int_val:0 int_val:0 int_val:255 >
+func ImageSummaryBadColor(value tf.Tensor) ImageSummaryAttr {
+	return func(m optionalAttr) {
+		m["bad_color"] = value
+	}
+}
+
+// Outputs a `Summary` protocol buffer with images.
+//
+// The summary has up to `max_images` summary values containing images. The
+// images are built from `tensor` which must be 4-D with shape `[batch_size,
+// height, width, channels]` and where `channels` can be:
+//
+// *  1: `tensor` is interpreted as Grayscale.
+// *  3: `tensor` is interpreted as RGB.
+// *  4: `tensor` is interpreted as RGBA.
+//
+// The images have the same number of channels as the input tensor. For float
+// input, the values are normalized one image at a time to fit in the range
+// `[0, 255]`.  `uint8` values are unchanged.  The op uses two different
+// normalization algorithms:
+//
+// *  If the input values are all positive, they are rescaled so the largest one
+//    is 255.
+//
+// *  If any input value is negative, the values are shifted so input value 0.0
+//    is at 127.  They are then rescaled so that either the smallest value is 0,
+//    or the largest one is 255.
+//
+// The `tag` argument is a scalar `Tensor` of type `string`.  It is used to
+// build the `tag` of the summary values:
+//
+// *  If `max_images` is 1, the summary value tag is '*tag*/image'.
+// *  If `max_images` is greater than 1, the summary value tags are
+//    generated sequentially as '*tag*/image/0', '*tag*/image/1', etc.
+//
+// The `bad_color` argument is the color to use in the generated images for
+// non-finite input values.  It is a `uint8` 1-D tensor of length `channels`.
+// Each element must be in the range `[0, 255]` (It represents the value of a
+// pixel in the output image).  Non-finite values in the input tensor are
+// replaced by this tensor in the output image.  The default value is the color
+// red.
 //
 // Arguments:
-//	input: The tensor to reduce.
-//	axis: The dimensions to reduce. Must be in the range
-// `[-rank(input), rank(input))`.
+//	tag: Scalar. Used to build the `tag` attribute of the summary values.
+//	tensor: 4-D of shape `[batch_size, height, width, channels]` where
+// `channels` is 1, 3, or 4.
 //
-// Returns The reduced tensor.
-func All(scope *Scope, input tf.Output, axis tf.Output, optional ...AllAttr) (output tf.Output) {
+// Returns Scalar. Serialized `Summary` protocol buffer.
+func ImageSummary(scope *Scope, tag tf.Output, tensor tf.Output, optional ...ImageSummaryAttr) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -21896,9 +21647,9 @@ func All(scope *Scope, input tf.Output, axis tf.Output, optional ...AllAttr) (ou
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "All",
+		Type: "ImageSummary",
 		Input: []tf.Input{
-			input, axis,
+			tag, tensor,
 		},
 		Attrs: attrs,
 	}
@@ -21906,239 +21657,215 @@ func All(scope *Scope, input tf.Output, axis tf.Output, optional ...AllAttr) (ou
 	return op.Output(0)
 }
 
-// Computes the Eigen Decomposition of a batch of square self-adjoint matrices.
+// AudioSummaryV2Attr is an optional argument to AudioSummaryV2.
+type AudioSummaryV2Attr func(optionalAttr)
+
+// AudioSummaryV2MaxOutputs sets the optional max_outputs attribute to value.
 //
-// DEPRECATED at GraphDef version 11: Use SelfAdjointEigV2 instead.
+// value: Max number of batch elements to generate audio for.
+// If not specified, defaults to 3
 //
-// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
-// form square matrices, with the same constraints as the single matrix
-// SelfAdjointEig.
+// REQUIRES: value >= 1
+func AudioSummaryV2MaxOutputs(value int64) AudioSummaryV2Attr {
+	return func(m optionalAttr) {
+		m["max_outputs"] = value
+	}
+}
+
+// Outputs a `Summary` protocol buffer with audio.
 //
-// The result is a [..., M+1, M] matrix with [..., 0,:] containing the
-// eigenvalues, and subsequent [...,1:, :] containing the eigenvectors. The eigenvalues
-// are sorted in non-decreasing order.
+// The summary has up to `max_outputs` summary values containing audio. The
+// audio is built from `tensor` which must be 3-D with shape `[batch_size,
+// frames, channels]` or 2-D with shape `[batch_size, frames]`. The values are
+// assumed to be in the range of `[-1.0, 1.0]` with a sample rate of `sample_rate`.
+//
+// The `tag` argument is a scalar `Tensor` of type `string`.  It is used to
+// build the `tag` of the summary values:
+//
+// *  If `max_outputs` is 1, the summary value tag is '*tag*/audio'.
+// *  If `max_outputs` is greater than 1, the summary value tags are
+//    generated sequentially as '*tag*/audio/0', '*tag*/audio/1', etc.
 //
 // Arguments:
-//	input: Shape is `[..., M, M]`.
+//	tag: Scalar. Used to build the `tag` attribute of the summary values.
+//	tensor: 2-D of shape `[batch_size, frames]`.
+//	sample_rate: The sample rate of the signal in hertz.
 //
-// Returns Shape is `[..., M+1, M]`.
-func SelfAdjointEig(scope *Scope, input tf.Output) (output tf.Output) {
+// Returns Scalar. Serialized `Summary` protocol buffer.
+func AudioSummaryV2(scope *Scope, tag tf.Output, tensor tf.Output, sample_rate tf.Output, optional ...AudioSummaryV2Attr) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "SelfAdjointEig",
+		Type: "AudioSummaryV2",
 		Input: []tf.Input{
-			input,
+			tag, tensor, sample_rate,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes softplus gradients for a softplus operation.
-//
-// Arguments:
-//	gradients: The backpropagated gradients to the corresponding softplus operation.
-//	features: The features passed as input to the corresponding softplus operation.
+// AvgPoolAttr is an optional argument to AvgPool.
+type AvgPoolAttr func(optionalAttr)
+
+// AvgPoolDataFormat sets the optional data_format attribute to value.
 //
-// Returns The gradients: `gradients / (1 + exp(-features))`.
-func SoftplusGrad(scope *Scope, gradients tf.Output, features tf.Output) (backprops tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SoftplusGrad",
-		Input: []tf.Input{
-			gradients, features,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// SelfAdjointEigV2Attr is an optional argument to SelfAdjointEigV2.
-type SelfAdjointEigV2Attr func(optionalAttr)
-
-// SelfAdjointEigV2ComputeV sets the optional compute_v attribute to value.
-//
-// value: If `True` then eigenvectors will be computed and returned in `v`.
-// Otherwise, only the eigenvalues will be computed.
-// If not specified, defaults to true
-func SelfAdjointEigV2ComputeV(value bool) SelfAdjointEigV2Attr {
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// If not specified, defaults to "NHWC"
+func AvgPoolDataFormat(value string) AvgPoolAttr {
 	return func(m optionalAttr) {
-		m["compute_v"] = value
+		m["data_format"] = value
 	}
 }
 
-// Computes the eigen decomposition of one or more square self-adjoint matrices.
-//
-// Computes the eigenvalues and (optionally) eigenvectors of each inner matrix in
-// `input` such that `input[..., :, :] = v[..., :, :] * diag(e[..., :])`. The eigenvalues
-// are sorted in non-decreasing order.
+// Performs average pooling on the input.
 //
-// ```python
-// # a is a tensor.
-// # e is a tensor of eigenvalues.
-// # v is a tensor of eigenvectors.
-// e, v = self_adjoint_eig(a)
-// e = self_adjoint_eig(a, compute_v=False)
-// ```
+// Each entry in `output` is the mean of the corresponding size `ksize`
+// window in `value`.
 //
 // Arguments:
-//	input: `Tensor` input of shape `[N, N]`.
+//	value: 4-D with shape `[batch, height, width, channels]`.
+//	ksize: The size of the sliding window for each dimension of `value`.
+//	strides: The stride of the sliding window for each dimension of `value`.
+//	padding: The type of padding algorithm to use.
 //
-// Returns Eigenvalues. Shape is `[N]`.Eigenvectors. Shape is `[N, N]`.
-func SelfAdjointEigV2(scope *Scope, input tf.Output, optional ...SelfAdjointEigV2Attr) (e tf.Output, v tf.Output) {
+// Returns The average pooled output tensor.
+func AvgPool(scope *Scope, value tf.Output, ksize []int64, strides []int64, padding string, optional ...AvgPoolAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SelfAdjointEigV2",
+		Type: "AvgPool",
 		Input: []tf.Input{
-			input,
+			value,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Adjust the saturation of one or more images.
+// Merges summaries.
 //
-// `images` is a tensor of at least 3 dimensions.  The last dimension is
-// interpretted as channels, and must be three.
+// This op creates a
+// [`Summary`](https://www.tensorflow.org/code/tensorflow/core/framework/summary.proto)
+// protocol buffer that contains the union of all the values in the input
+// summaries.
 //
-// The input image is considered in the RGB colorspace. Conceptually, the RGB
-// colors are first mapped into HSV. A scale is then applied all the saturation
-// values, and then remapped back to RGB colorspace.
+// When the Op is run, it reports an `InvalidArgument` error if multiple values
+// in the summaries to merge use the same tag.
 //
 // Arguments:
-//	images: Images to adjust.  At least 3-D.
-//	scale: A float scale to add to the saturation.
+//	inputs: Can be of any shape.  Each must contain serialized `Summary` protocol
+// buffers.
 //
-// Returns The hue-adjusted image or images.
-func AdjustSaturation(scope *Scope, images tf.Output, scale tf.Output) (output tf.Output) {
+// Returns Scalar. Serialized `Summary` protocol buffer.
+func MergeSummary(scope *Scope, inputs []tf.Output) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "AdjustSaturation",
+		Type: "MergeSummary",
 		Input: []tf.Input{
-			images, scale,
+			tf.OutputList(inputs),
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// SvdAttr is an optional argument to Svd.
-type SvdAttr func(optionalAttr)
-
-// SvdComputeUv sets the optional compute_uv attribute to value.
-//
-// value: If true, left and right singular vectors will be
-// computed and returned in `u` and `v`, respectively.
-// If false, `u` and `v` are not set and should never referenced.
-// If not specified, defaults to true
-func SvdComputeUv(value bool) SvdAttr {
-	return func(m optionalAttr) {
-		m["compute_uv"] = value
-	}
-}
-
-// SvdFullMatrices sets the optional full_matrices attribute to value.
-//
-// value: If true, compute full-sized `u` and `v`. If false
-// (the default), compute only the leading `P` singular vectors.
-// Ignored if `compute_uv` is `False`.
-// If not specified, defaults to false
-func SvdFullMatrices(value bool) SvdAttr {
-	return func(m optionalAttr) {
-		m["full_matrices"] = value
-	}
-}
-
-// Computes the singular value decompositions of one or more matrices.
-//
-// Computes the SVD of each inner matrix in `input` such that
-// `input[..., :, :] = u[..., :, :] * diag(s[..., :, :]) * transpose(v[..., :, :])`
-//
-// ```python
-// # a is a tensor containing a batch of matrices.
-// # s is a tensor of singular values for each matrix.
-// # u is the tensor containing of left singular vectors for each matrix.
-// # v is the tensor containing of right singular vectors for each matrix.
-// s, u, v = svd(a)
-// s, _, _ = svd(a, compute_uv=False)
-// ```
+// Computes the gradient of morphological 2-D dilation with respect to the filter.
 //
 // Arguments:
-//	input: A tensor of shape `[..., M, N]` whose inner-most 2 dimensions
-// form matrices of size `[M, N]`. Let `P` be the minimum of `M` and `N`.
+//	input: 4-D with shape `[batch, in_height, in_width, depth]`.
+//	filter: 3-D with shape `[filter_height, filter_width, depth]`.
+//	out_backprop: 4-D with shape `[batch, out_height, out_width, depth]`.
+//	strides: 1-D of length 4. The stride of the sliding window for each dimension of
+// the input tensor. Must be: `[1, stride_height, stride_width, 1]`.
+//	rates: 1-D of length 4. The input stride for atrous morphological dilation.
+// Must be: `[1, rate_height, rate_width, 1]`.
+//	padding: The type of padding algorithm to use.
 //
-// Returns Singular values. Shape is `[..., P]`.Left singular vectors. If `full_matrices` is `False` then shape is
-// `[..., M, P]`; if `full_matrices` is `True` then shape is
-// `[..., M, M]`. Undefined if `compute_uv` is `False`.Left singular vectors. If `full_matrices` is `False` then shape is
-// `[..., N, P]`. If `full_matrices` is `True` then shape is `[..., N, N]`.
-// Undefined if `compute_uv` is false.
-func Svd(scope *Scope, input tf.Output, optional ...SvdAttr) (s tf.Output, u tf.Output, v tf.Output) {
+// Returns 3-D with shape `[filter_height, filter_width, depth]`.
+func Dilation2DBackpropFilter(scope *Scope, input tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, rates []int64, padding string) (filter_backprop tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"strides": strides, "rates": rates, "padding": padding}
 	opspec := tf.OpSpec{
-		Type: "Svd",
+		Type: "Dilation2DBackpropFilter",
 		Input: []tf.Input{
-			input,
+			input, filter, out_backprop,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// QueueEnqueueManyV2Attr is an optional argument to QueueEnqueueManyV2.
-type QueueEnqueueManyV2Attr func(optionalAttr)
+// AddSparseToTensorsMapAttr is an optional argument to AddSparseToTensorsMap.
+type AddSparseToTensorsMapAttr func(optionalAttr)
 
-// QueueEnqueueManyV2TimeoutMs sets the optional timeout_ms attribute to value.
+// AddSparseToTensorsMapContainer sets the optional container attribute to value.
 //
-// value: If the queue is too full, this operation will block for up
-// to timeout_ms milliseconds.
-// Note: This option is not supported yet.
-// If not specified, defaults to -1
-func QueueEnqueueManyV2TimeoutMs(value int64) QueueEnqueueManyV2Attr {
+// value: The container name for the `SparseTensorsMap` created by this op.
+// If not specified, defaults to ""
+func AddSparseToTensorsMapContainer(value string) AddSparseToTensorsMapAttr {
 	return func(m optionalAttr) {
-		m["timeout_ms"] = value
+		m["container"] = value
 	}
 }
 
-// Enqueues zero or more tuples of one or more tensors in the given queue.
+// AddSparseToTensorsMapSharedName sets the optional shared_name attribute to value.
 //
-// This operation slices each component tensor along the 0th dimension to
-// make multiple queue elements. All of the tuple components must have the
-// same size in the 0th dimension.
+// value: The shared name for the `SparseTensorsMap` created by this op.
+// If blank, the new Operation's unique name is used.
+// If not specified, defaults to ""
+func AddSparseToTensorsMapSharedName(value string) AddSparseToTensorsMapAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Add a `SparseTensor` to a `SparseTensorsMap` return its handle.
 //
-// The components input has k elements, which correspond to the components of
-// tuples stored in the given queue.
+// A `SparseTensor` is represented by three tensors: `sparse_indices`,
+// `sparse_values`, and `sparse_shape`.
 //
-// N.B. If the queue is full, this operation will block until the given
-// elements have been enqueued (or 'timeout_ms' elapses, if specified).
+// This operator takes the given `SparseTensor` and adds it to a container
+// object (a `SparseTensorsMap`).  A unique key within this container is generated
+// in the form of an `int64`, and this is the value that is returned.
+//
+// The `SparseTensor` can then be read out as part of a minibatch by passing
+// the key as a vector element to `TakeManySparseFromTensorsMap`.  To ensure
+// the correct `SparseTensorsMap` is accessed, ensure that the same
+// `container` and `shared_name` are passed to that Op.  If no `shared_name`
+// is provided here, instead use the *name* of the Operation created by calling
+// `AddSparseToTensorsMap` as the `shared_name` passed to
+// `TakeManySparseFromTensorsMap`.  Ensure the Operations are colocated.
 //
 // Arguments:
-//	handle: The handle to a queue.
-//	components: One or more tensors from which the enqueued tensors should
-// be taken.
+//	sparse_indices: 2-D.  The `indices` of the `SparseTensor`.
+//	sparse_values: 1-D.  The `values` of the `SparseTensor`.
+//	sparse_shape: 1-D.  The `shape` of the `SparseTensor`.
 //
-// Returns the created operation.
-func QueueEnqueueManyV2(scope *Scope, handle tf.Output, components []tf.Output, optional ...QueueEnqueueManyV2Attr) (o *tf.Operation) {
+// Returns 0-D.  The handle of the `SparseTensor` now stored in the
+// `SparseTensorsMap`.
+func AddSparseToTensorsMap(scope *Scope, sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output, optional ...AddSparseToTensorsMapAttr) (sparse_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -22147,435 +21874,302 @@ func QueueEnqueueManyV2(scope *Scope, handle tf.Output, components []tf.Output,
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "QueueEnqueueManyV2",
+		Type: "AddSparseToTensorsMap",
 		Input: []tf.Input{
-			handle, tf.OutputList(components),
+			sparse_indices, sparse_values, sparse_shape,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Computes the product along segments of a tensor.
-//
-// Read @{$math_ops#Segmentation$the section on segmentation} for an explanation of
-// segments.
-//
-// Computes a tensor such that
-// \\(output_i = \prod_j data_j\\) where the product is over `j` such
-// that `segment_ids[j] == i`.
-//
-// If the product is empty for a given segment ID `i`, `output[i] = 1`.
-//
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentProd.png" alt>
-// </div>
-//
-// Arguments:
-//
-//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
-// first dimension.  Values should be sorted and can be repeated.
+// Returns a list list which has the passed-in `Tensor` as last element and the other elements of the given list in `input_handle`.
 //
-// Returns Has same shape as data, except for dimension 0 which
-// has size `k`, the number of segments.
-func SegmentProd(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
+// tensor: The tensor to put on the list.
+// input_handle: The old list.
+// output_handle: A list with the elements of the old list followed by tensor.
+// element_dtype: the type of elements in the list.
+// element_shape: a shape compatible with that of elements in the list.
+func TensorListPushBack(scope *Scope, input_handle tf.Output, tensor tf.Output) (output_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SegmentProd",
+		Type: "TensorListPushBack",
 		Input: []tf.Input{
-			data, segment_ids,
+			input_handle, tensor,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Converts one or more images from RGB to HSV.
+// Returns the number of tensors in the input tensor list.
 //
-// Outputs a tensor of the same shape as the `images` tensor, containing the HSV
-// value of the pixels. The output is only well defined if the value in `images`
-// are in `[0,1]`.
-//
-// `output[..., 0]` contains hue, `output[..., 1]` contains saturation, and
-// `output[..., 2]` contains value. All HSV values are in `[0,1]`. A hue of 0
-// corresponds to pure red, hue 1/3 is pure green, and 2/3 is pure blue.
-//
-// Arguments:
-//	images: 1-D or higher rank. RGB data to convert. Last dimension must be size 3.
-//
-// Returns `images` converted to HSV.
-func RGBToHSV(scope *Scope, images tf.Output) (output tf.Output) {
+// input_handle: the input list
+// length: the number of tensors in the list
+func TensorListLength(scope *Scope, input_handle tf.Output) (length tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "RGBToHSV",
+		Type: "TensorListLength",
 		Input: []tf.Input{
-			images,
+			input_handle,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Does nothing. Only useful as a placeholder for control edges.
+// The shape of the elements of the given list, as a tensor.
 //
-// Returns the created operation.
-func NoOp(scope *Scope) (o *tf.Operation) {
+//   input_handle: the list
+//   element_shape: the shape of elements of the list
+func TensorListElementShape(scope *Scope, input_handle tf.Output, shape_type tf.DataType) (element_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"shape_type": shape_type}
 	opspec := tf.OpSpec{
-		Type: "NoOp",
-	}
-	return scope.AddOperation(opspec)
-}
-
-// MergeV2CheckpointsAttr is an optional argument to MergeV2Checkpoints.
-type MergeV2CheckpointsAttr func(optionalAttr)
-
-// MergeV2CheckpointsDeleteOldDirs sets the optional delete_old_dirs attribute to value.
-//
-// value: see above.
-// If not specified, defaults to true
-func MergeV2CheckpointsDeleteOldDirs(value bool) MergeV2CheckpointsAttr {
-	return func(m optionalAttr) {
-		m["delete_old_dirs"] = value
+		Type: "TensorListElementShape",
+		Input: []tf.Input{
+			input_handle,
+		},
+		Attrs: attrs,
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// V2 format specific: merges the metadata files of sharded checkpoints.  The
-//
-// result is one logical checkpoint, with one physical metadata file and renamed
-// data files.
-//
-// Intended for "grouping" multiple checkpoints in a sharded checkpoint setup.
+// Returns the item in the list with the given index.
 //
-// If delete_old_dirs is true, attempts to delete recursively the dirname of each
-// path in the input checkpoint_prefixes.  This is useful when those paths are non
-// user-facing temporary locations.
+// input_handle: the list
+// index: the position in the list from which an element will be retrieved
+// item: the element at that position
 //
-// Arguments:
-//	checkpoint_prefixes: prefixes of V2 checkpoints to merge.
-//	destination_prefix: scalar.  The desired final prefix.  Allowed to be the same
-// as one of the checkpoint_prefixes.
 //
-// Returns the created operation.
-func MergeV2Checkpoints(scope *Scope, checkpoint_prefixes tf.Output, destination_prefix tf.Output, optional ...MergeV2CheckpointsAttr) (o *tf.Operation) {
+func TensorListGetItem(scope *Scope, input_handle tf.Output, index tf.Output, element_dtype tf.DataType) (item tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"element_dtype": element_dtype}
 	opspec := tf.OpSpec{
-		Type: "MergeV2Checkpoints",
+		Type: "TensorListGetItem",
 		Input: []tf.Input{
-			checkpoint_prefixes, destination_prefix,
+			input_handle, index,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Saves input tensors slices to disk.
-//
-// This is like `Save` except that tensors can be listed in the saved file as being
-// a slice of a larger tensor.  `shapes_and_slices` specifies the shape of the
-// larger tensor and the slice that this tensor covers. `shapes_and_slices` must
-// have as many elements as `tensor_names`.
+// Returns a diagonal tensor with a given diagonal values.
 //
-// Elements of the `shapes_and_slices` input must either be:
+// Given a `diagonal`, this operation returns a tensor with the `diagonal` and
+// everything else padded with zeros. The diagonal is computed as follows:
 //
-// *  The empty string, in which case the corresponding tensor is
-//    saved normally.
-// *  A string of the form `dim0 dim1 ... dimN-1 slice-spec` where the
-//    `dimI` are the dimensions of the larger tensor and `slice-spec`
-//    specifies what part is covered by the tensor to save.
+// Assume `diagonal` has dimensions [D1,..., Dk], then the output is a tensor of
+// rank 2k with dimensions [D1,..., Dk, D1,..., Dk] where:
 //
-// `slice-spec` itself is a `:`-separated list: `slice0:slice1:...:sliceN-1`
-// where each `sliceI` is either:
+// `output[i1,..., ik, i1,..., ik] = diagonal[i1, ..., ik]` and 0 everywhere else.
 //
-// *  The string `-` meaning that the slice covers all indices of this dimension
-// *  `start,length` where `start` and `length` are integers.  In that
-//    case the slice covers `length` indices starting at `start`.
+// For example:
 //
-// See also `Save`.
+// ```
+// # 'diagonal' is [1, 2, 3, 4]
+// tf.diag(diagonal) ==> [[1, 0, 0, 0]
+//                        [0, 2, 0, 0]
+//                        [0, 0, 3, 0]
+//                        [0, 0, 0, 4]]
+// ```
 //
 // Arguments:
-//	filename: Must have a single element. The name of the file to which we write the
-// tensor.
-//	tensor_names: Shape `[N]`. The names of the tensors to be saved.
-//	shapes_and_slices: Shape `[N]`.  The shapes and slice specifications to use when
-// saving the tensors.
-//	data: `N` tensors to save.
-//
-// Returns the created operation.
-func SaveSlices(scope *Scope, filename tf.Output, tensor_names tf.Output, shapes_and_slices tf.Output, data []tf.Output) (o *tf.Operation) {
+//	diagonal: Rank k tensor where k is at most 1.
+func Diag(scope *Scope, diagonal tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SaveSlices",
+		Type: "Diag",
 		Input: []tf.Input{
-			filename, tensor_names, shapes_and_slices, tf.OutputList(data),
+			diagonal,
 		},
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// DenseToDenseSetOperationAttr is an optional argument to DenseToDenseSetOperation.
-type DenseToDenseSetOperationAttr func(optionalAttr)
+// ParameterizedTruncatedNormalAttr is an optional argument to ParameterizedTruncatedNormal.
+type ParameterizedTruncatedNormalAttr func(optionalAttr)
 
-// DenseToDenseSetOperationValidateIndices sets the optional validate_indices attribute to value.
-// If not specified, defaults to true
-func DenseToDenseSetOperationValidateIndices(value bool) DenseToDenseSetOperationAttr {
+// ParameterizedTruncatedNormalSeed sets the optional seed attribute to value.
+//
+// value: If either `seed` or `seed2` are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func ParameterizedTruncatedNormalSeed(value int64) ParameterizedTruncatedNormalAttr {
 	return func(m optionalAttr) {
-		m["validate_indices"] = value
+		m["seed"] = value
 	}
 }
 
-// Applies set operation along last dimension of 2 `Tensor` inputs.
+// ParameterizedTruncatedNormalSeed2 sets the optional seed2 attribute to value.
 //
-// See SetOperationOp::SetOperationFromContext for values of `set_operation`.
+// value: A second seed to avoid seed collision.
+// If not specified, defaults to 0
+func ParameterizedTruncatedNormalSeed2(value int64) ParameterizedTruncatedNormalAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// Outputs random values from a normal distribution. The parameters may each be a
 //
-// Output `result` is a `SparseTensor` represented by `result_indices`,
-// `result_values`, and `result_shape`. For `set1` and `set2` ranked `n`, this
-// has rank `n` and the same 1st `n-1` dimensions as `set1` and `set2`. The `nth`
-// dimension contains the result of `set_operation` applied to the corresponding
-// `[0...n-1]` dimension of `set`.
+// scalar which applies to the entire output, or a vector of length shape[0] which
+// stores the parameters for each batch.
 //
 // Arguments:
-//	set1: `Tensor` with rank `n`. 1st `n-1` dimensions must be the same as `set2`.
-// Dimension `n` contains values in a set, duplicates are allowed but ignored.
-//	set2: `Tensor` with rank `n`. 1st `n-1` dimensions must be the same as `set1`.
-// Dimension `n` contains values in a set, duplicates are allowed but ignored.
-//
+//	shape: The shape of the output tensor. Batches are indexed by the 0th dimension.
+//	means: The mean parameter of each batch.
+//	stdevs: The standard deviation parameter of each batch. Must be greater than 0.
+//	minvals: The minimum cutoff. May be -infinity.
+//	maxvals: The maximum cutoff. May be +infinity, and must be more than the minval
+// for each batch.
 //
-// Returns 2D indices of a `SparseTensor`.1D values of a `SparseTensor`.1D `Tensor` shape of a `SparseTensor`. `result_shape[0...n-1]` is
-// the same as the 1st `n-1` dimensions of `set1` and `set2`, `result_shape[n]`
-// is the max result set size across all `0...n-1` dimensions.
-func DenseToDenseSetOperation(scope *Scope, set1 tf.Output, set2 tf.Output, set_operation string, optional ...DenseToDenseSetOperationAttr) (result_indices tf.Output, result_values tf.Output, result_shape tf.Output) {
+// Returns A matrix of shape num_batches x samples_per_batch, filled with random
+// truncated normal values using the parameters for each row.
+func ParameterizedTruncatedNormal(scope *Scope, shape tf.Output, means tf.Output, stdevs tf.Output, minvals tf.Output, maxvals tf.Output, optional ...ParameterizedTruncatedNormalAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"set_operation": set_operation}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DenseToDenseSetOperation",
+		Type: "ParameterizedTruncatedNormal",
 		Input: []tf.Input{
-			set1, set2,
+			shape, means, stdevs, minvals, maxvals,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// Generate a sharded filename. The filename is printf formatted as
+// Sets the index-th position of the list to contain the given tensor.
 //
-//    %s-%05d-of-%05d, basename, shard, num_shards.
-func ShardedFilename(scope *Scope, basename tf.Output, shard tf.Output, num_shards tf.Output) (filename tf.Output) {
+// input_handle: the list
+// index: the position in the list to which the tensor will be assigned
+// item: the element to be assigned to that position
+// output_handle: the new list, with the element in the proper position
+//
+func TensorListSetItem(scope *Scope, input_handle tf.Output, index tf.Output, item tf.Output) (output_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ShardedFilename",
+		Type: "TensorListSetItem",
 		Input: []tf.Input{
-			basename, shard, num_shards,
+			input_handle, index, item,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// BatchToSpace for N-D tensors of type T.
-//
-// This operation reshapes the "batch" dimension 0 into `M + 1` dimensions of shape
-// `block_shape + [batch]`, interleaves these blocks back into the grid defined by
-// the spatial dimensions `[1, ..., M]`, to obtain a result with the same rank as
-// the input.  The spatial dimensions of this intermediate result are then
-// optionally cropped according to `crops` to produce the output.  This is the
-// reverse of SpaceToBatch.  See below for a precise description.
-//
-// Arguments:
-//	input: N-D with shape `input_shape = [batch] + spatial_shape + remaining_shape`,
-// where spatial_shape has M dimensions.
-//	block_shape: 1-D with shape `[M]`, all values must be >= 1.
-//	crops: 2-D with shape `[M, 2]`, all values must be >= 0.
-//   `crops[i] = [crop_start, crop_end]` specifies the amount to crop from input
-//   dimension `i + 1`, which corresponds to spatial dimension `i`.  It is
-//   required that
-//   `crop_start[i] + crop_end[i] <= block_shape[i] * input_shape[i + 1]`.
-//
-// This operation is equivalent to the following steps:
-//
-// 1. Reshape `input` to `reshaped` of shape:
-//      [block_shape[0], ..., block_shape[M-1],
-//       batch / prod(block_shape),
-//       input_shape[1], ..., input_shape[N-1]]
+// Computes the matrix exponential of one or more square matrices:
 //
-// 2. Permute dimensions of `reshaped` to produce `permuted` of shape
-//      [batch / prod(block_shape),
+// exp(A) = \sum_{n=0}^\infty A^n/n!
 //
-//       input_shape[1], block_shape[0],
-//       ...,
-//       input_shape[M], block_shape[M-1],
+// The exponential is computed using a combination of the scaling and squaring
+// method and the Pade approximation. Details can be founds in:
+// Nicholas J. Higham, "The scaling and squaring method for the matrix exponential
+// revisited," SIAM J. Matrix Anal. Applic., 26:1179-1193, 2005.
 //
-//       input_shape[M+1], ..., input_shape[N-1]]
+// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
+// form square matrices. The output is a tensor of the same shape as the input
+// containing the exponential for all input submatrices `[..., :, :]`.
 //
-// 3. Reshape `permuted` to produce `reshaped_permuted` of shape
-//      [batch / prod(block_shape),
+// Arguments:
+//	input: Shape is `[..., M, M]`.
 //
-//       input_shape[1] * block_shape[0],
-//       ...,
-//       input_shape[M] * block_shape[M-1],
+// Returns Shape is `[..., M, M]`.
 //
-//       input_shape[M+1],
-//       ...,
-//       input_shape[N-1]]
-//
-// 4. Crop the start and end of dimensions `[1, ..., M]` of
-//    `reshaped_permuted` according to `crops` to produce the output of shape:
-//      [batch / prod(block_shape),
-//
-//       input_shape[1] * block_shape[0] - crops[0,0] - crops[0,1],
-//       ...,
-//       input_shape[M] * block_shape[M-1] - crops[M-1,0] - crops[M-1,1],
-//
-//       input_shape[M+1], ..., input_shape[N-1]]
-//
-// Some examples:
-//
-// (1) For the following input of shape `[4, 1, 1, 1]`, `block_shape = [2, 2]`, and
-//     `crops = [[0, 0], [0, 0]]`:
-//
-// ```
-// [[[[1]]], [[[2]]], [[[3]]], [[[4]]]]
-// ```
-//
-// The output tensor has shape `[1, 2, 2, 1]` and value:
-//
-// ```
-// x = [[[[1], [2]], [[3], [4]]]]
-// ```
-//
-// (2) For the following input of shape `[4, 1, 1, 3]`, `block_shape = [2, 2]`, and
-//     `crops = [[0, 0], [0, 0]]`:
-//
-// ```
-// [[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[10, 11, 12]]]
-// ```
-//
-// The output tensor has shape `[1, 2, 2, 3]` and value:
-//
-// ```
-// x = [[[[1, 2, 3], [4, 5, 6]],
-//       [[7, 8, 9], [10, 11, 12]]]]
-// ```
-//
-// (3) For the following input of shape `[4, 2, 2, 1]`, `block_shape = [2, 2]`, and
-//     `crops = [[0, 0], [0, 0]]`:
-//
-// ```
-// x = [[[[1], [3]], [[9], [11]]],
-//      [[[2], [4]], [[10], [12]]],
-//      [[[5], [7]], [[13], [15]]],
-//      [[[6], [8]], [[14], [16]]]]
-// ```
-//
-// The output tensor has shape `[1, 4, 4, 1]` and value:
-//
-// ```
-// x = [[[1],   [2],  [3],  [4]],
-//      [[5],   [6],  [7],  [8]],
-//      [[9],  [10], [11],  [12]],
-//      [[13], [14], [15],  [16]]]
-// ```
-//
-// (4) For the following input of shape `[8, 1, 3, 1]`, `block_shape = [2, 2]`, and
-//     `crops = [[0, 0], [2, 0]]`:
-//
-// ```
-// x = [[[[0], [1], [3]]], [[[0], [9], [11]]],
-//      [[[0], [2], [4]]], [[[0], [10], [12]]],
-//      [[[0], [5], [7]]], [[[0], [13], [15]]],
-//      [[[0], [6], [8]]], [[[0], [14], [16]]]]
-// ```
-//
-// The output tensor has shape `[2, 2, 4, 1]` and value:
-//
-// ```
-// x = [[[[1],   [2],  [3],  [4]],
-//       [[5],   [6],  [7],  [8]]],
-//      [[[9],  [10], [11],  [12]],
-//       [[13], [14], [15],  [16]]]]
-// ```
-func BatchToSpaceND(scope *Scope, input tf.Output, block_shape tf.Output, crops tf.Output) (output tf.Output) {
+// @compatibility(scipy)
+// Equivalent to scipy.linalg.expm
+// @end_compatibility
+func MatrixExponential(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "BatchToSpaceND",
+		Type: "MatrixExponential",
 		Input: []tf.Input{
-			input, block_shape, crops,
+			input,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// UnpackAttr is an optional argument to Unpack.
-type UnpackAttr func(optionalAttr)
+// QueueDequeueUpToV2Attr is an optional argument to QueueDequeueUpToV2.
+type QueueDequeueUpToV2Attr func(optionalAttr)
 
-// UnpackAxis sets the optional axis attribute to value.
+// QueueDequeueUpToV2TimeoutMs sets the optional timeout_ms attribute to value.
 //
-// value: Dimension along which to unpack.  Negative values wrap around, so the
-// valid range is `[-R, R)`.
-// If not specified, defaults to 0
-func UnpackAxis(value int64) UnpackAttr {
+// value: If the queue has fewer than n elements, this operation
+// will block for up to timeout_ms milliseconds.
+// Note: This option is not supported yet.
+// If not specified, defaults to -1
+func QueueDequeueUpToV2TimeoutMs(value int64) QueueDequeueUpToV2Attr {
 	return func(m optionalAttr) {
-		m["axis"] = value
+		m["timeout_ms"] = value
 	}
 }
 
-// Unpacks a given dimension of a rank-`R` tensor into `num` rank-`(R-1)` tensors.
+// Dequeues `n` tuples of one or more tensors from the given queue.
 //
-// Unpacks `num` tensors from `value` by chipping it along the `axis` dimension.
-// For example, given a tensor of shape `(A, B, C, D)`;
+// This operation is not supported by all queues.  If a queue does not support
+// DequeueUpTo, then an Unimplemented error is returned.
 //
-// If `axis == 0` then the i'th tensor in `output` is the slice `value[i, :, :, :]`
-//   and each tensor in `output` will have shape `(B, C, D)`. (Note that the
-//   dimension unpacked along is gone, unlike `split`).
+// If the queue is closed and there are more than 0 but less than `n`
+// elements remaining, then instead of returning an OutOfRange error like
+// QueueDequeueMany, less than `n` elements are returned immediately.  If
+// the queue is closed and there are 0 elements left in the queue, then
+// an OutOfRange error is returned just like in QueueDequeueMany.
+// Otherwise the behavior is identical to QueueDequeueMany:
 //
-// If `axis == 1` then the i'th tensor in `output` is the slice `value[:, i, :, :]`
-//   and each tensor in `output` will have shape `(A, C, D)`.
-// Etc.
+// This operation concatenates queue-element component tensors along the
+// 0th dimension to make a single component tensor.  All of the components
+// in the dequeued tuple will have size n in the 0th dimension.
 //
-// This is the opposite of `pack`.
+// This operation has `k` outputs, where `k` is the number of components in
+// the tuples stored in the given queue, and output `i` is the ith
+// component of the dequeued tuple.
 //
 // Arguments:
-//	value: 1-D or higher, with `axis` dimension size equal to `num`.
-//
+//	handle: The handle to a queue.
+//	n: The number of tuples to dequeue.
+//	component_types: The type of each component in a tuple.
 //
-// Returns The list of tensors unpacked from `value`.
-func Unpack(scope *Scope, value tf.Output, num int64, optional ...UnpackAttr) (output []tf.Output) {
+// Returns One or more tensors that were dequeued as a tuple.
+func QueueDequeueUpToV2(scope *Scope, handle tf.Output, n tf.Output, component_types []tf.DataType, optional ...QueueDequeueUpToV2Attr) (components []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num": num}
+	attrs := map[string]interface{}{"component_types": component_types}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Unpack",
+		Type: "QueueDequeueUpToV2",
 		Input: []tf.Input{
-			value,
+			handle, n,
 		},
 		Attrs: attrs,
 	}
@@ -22585,295 +22179,273 @@ func Unpack(scope *Scope, value tf.Output, num int64, optional ...UnpackAttr) (o
 	}
 	var idx int
 	var err error
-	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
-		scope.UpdateErr("Unpack", err)
+	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
+		scope.UpdateErr("QueueDequeueUpToV2", err)
 		return
 	}
-	return output
+	return components
 }
 
-// Increments variable pointed to by 'resource' until it reaches 'limit'.
+// Computes the Cholesky decomposition of one or more square matrices.
 //
-// Arguments:
-//	resource: Should be from a scalar `Variable` node.
-//	limit: If incrementing ref would bring it above limit, instead generates an
-// 'OutOfRange' error.
+// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
+// form square matrices.
+//
+// The input has to be symmetric and positive definite. Only the lower-triangular
+// part of the input will be used for this operation. The upper-triangular part
+// will not be read.
 //
+// The output is a tensor of the same shape as the input
+// containing the Cholesky decompositions for all input submatrices `[..., :, :]`.
 //
-// Returns A copy of the input before increment. If nothing else modifies the
-// input, the values produced will all be distinct.
-func ResourceCountUpTo(scope *Scope, resource tf.Output, limit int64, T tf.DataType) (output tf.Output) {
+// **Note**: The gradient computation on GPU is faster for large matrices but
+// not for large batch dimensions when the submatrices are small. In this
+// case it might be faster to use the CPU.
+//
+// Arguments:
+//	input: Shape is `[..., M, M]`.
+//
+// Returns Shape is `[..., M, M]`.
+func Cholesky(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"limit": limit, "T": T}
 	opspec := tf.OpSpec{
-		Type: "ResourceCountUpTo",
+		Type: "Cholesky",
 		Input: []tf.Input{
-			resource,
+			input,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Delete the stack from its resource container.
+// Writes contents to the file at input filename. Creates file and recursively
+//
+// creates directory if not existing.
 //
 // Arguments:
-//	handle: The handle to a stack.
+//	filename: scalar. The name of the file to which we write the contents.
+//	contents: scalar. The content to be written to the output file.
 //
 // Returns the created operation.
-func StackCloseV2(scope *Scope, handle tf.Output) (o *tf.Operation) {
+func WriteFile(scope *Scope, filename tf.Output, contents tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "StackCloseV2",
+		Type: "WriteFile",
 		Input: []tf.Input{
-			handle,
+			filename, contents,
 		},
 	}
 	return scope.AddOperation(opspec)
 }
 
-// Generate a glob pattern matching all sharded file names.
-func ShardedFilespec(scope *Scope, basename tf.Output, num_shards tf.Output) (filename tf.Output) {
+// AllAttr is an optional argument to All.
+type AllAttr func(optionalAttr)
+
+// AllKeepDims sets the optional keep_dims attribute to value.
+//
+// value: If true, retain reduced dimensions with length 1.
+// If not specified, defaults to false
+func AllKeepDims(value bool) AllAttr {
+	return func(m optionalAttr) {
+		m["keep_dims"] = value
+	}
+}
+
+// Computes the "logical and" of elements across dimensions of a tensor.
+//
+// Reduces `input` along the dimensions given in `axis`. Unless
+// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
+// `axis`. If `keep_dims` is true, the reduced dimensions are
+// retained with length 1.
+//
+// Arguments:
+//	input: The tensor to reduce.
+//	axis: The dimensions to reduce. Must be in the range
+// `[-rank(input), rank(input))`.
+//
+// Returns The reduced tensor.
+func All(scope *Scope, input tf.Output, axis tf.Output, optional ...AllAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "ShardedFilespec",
+		Type: "All",
 		Input: []tf.Input{
-			basename, num_shards,
+			input, axis,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// TextLineReaderV2Attr is an optional argument to TextLineReaderV2.
-type TextLineReaderV2Attr func(optionalAttr)
-
-// TextLineReaderV2SkipHeaderLines sets the optional skip_header_lines attribute to value.
+// Computes the Eigen Decomposition of a batch of square self-adjoint matrices.
 //
-// value: Number of lines to skip from the beginning of every file.
-// If not specified, defaults to 0
-func TextLineReaderV2SkipHeaderLines(value int64) TextLineReaderV2Attr {
-	return func(m optionalAttr) {
-		m["skip_header_lines"] = value
-	}
-}
-
-// TextLineReaderV2Container sets the optional container attribute to value.
+// DEPRECATED at GraphDef version 11: Use SelfAdjointEigV2 instead.
 //
-// value: If non-empty, this reader is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func TextLineReaderV2Container(value string) TextLineReaderV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// TextLineReaderV2SharedName sets the optional shared_name attribute to value.
+// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
+// form square matrices, with the same constraints as the single matrix
+// SelfAdjointEig.
 //
-// value: If non-empty, this reader is named in the given bucket
-// with this shared_name. Otherwise, the node name is used instead.
-// If not specified, defaults to ""
-func TextLineReaderV2SharedName(value string) TextLineReaderV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
+// The result is a [..., M+1, M] matrix with [..., 0,:] containing the
+// eigenvalues, and subsequent [...,1:, :] containing the eigenvectors. The eigenvalues
+// are sorted in non-decreasing order.
+//
+// Arguments:
+//	input: Shape is `[..., M, M]`.
+//
+// Returns Shape is `[..., M+1, M]`.
+func SelfAdjointEig(scope *Scope, input tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SelfAdjointEig",
+		Input: []tf.Input{
+			input,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// A Reader that outputs the lines of a file delimited by '\n'.
+// Computes softplus gradients for a softplus operation.
 //
-// Returns The handle to reference the Reader.
-func TextLineReaderV2(scope *Scope, optional ...TextLineReaderV2Attr) (reader_handle tf.Output) {
+// Arguments:
+//	gradients: The backpropagated gradients to the corresponding softplus operation.
+//	features: The features passed as input to the corresponding softplus operation.
+//
+// Returns The gradients: `gradients / (1 + exp(-features))`.
+func SoftplusGrad(scope *Scope, gradients tf.Output, features tf.Output) (backprops tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "TextLineReaderV2",
-
-		Attrs: attrs,
+		Type: "SoftplusGrad",
+		Input: []tf.Input{
+			gradients, features,
+		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// LoadAndRemapMatrixAttr is an optional argument to LoadAndRemapMatrix.
-type LoadAndRemapMatrixAttr func(optionalAttr)
+// SelfAdjointEigV2Attr is an optional argument to SelfAdjointEigV2.
+type SelfAdjointEigV2Attr func(optionalAttr)
 
-// LoadAndRemapMatrixMaxRowsInMemory sets the optional max_rows_in_memory attribute to value.
+// SelfAdjointEigV2ComputeV sets the optional compute_v attribute to value.
 //
-// value: The maximum number of rows to load from the checkpoint at
-// once. If less than or equal to 0, the entire matrix will be loaded into
-// memory. Setting this arg trades increased disk reads for lower memory usage.
-// If not specified, defaults to -1
-func LoadAndRemapMatrixMaxRowsInMemory(value int64) LoadAndRemapMatrixAttr {
+// value: If `True` then eigenvectors will be computed and returned in `v`.
+// Otherwise, only the eigenvalues will be computed.
+// If not specified, defaults to true
+func SelfAdjointEigV2ComputeV(value bool) SelfAdjointEigV2Attr {
 	return func(m optionalAttr) {
-		m["max_rows_in_memory"] = value
+		m["compute_v"] = value
 	}
 }
 
-// Loads a 2-D (matrix) `Tensor` with name `old_tensor_name` from the checkpoint
-//
-// at `ckpt_path` and potentially reorders its rows and columns using the
-// specified remappings.
-//
-// Most users should use one of the wrapper initializers (such as
-// `tf.contrib.framework.load_and_remap_matrix_initializer`) instead of this
-// function directly.
-//
-// The remappings are 1-D tensors with the following properties:
-//
-// * `row_remapping` must have exactly `num_rows` entries. Row `i` of the output
-//   matrix will be initialized from the row corresponding to index
-//   `row_remapping[i]` in the old `Tensor` from the checkpoint.
-// * `col_remapping` must have either 0 entries (indicating that no column
-//   reordering is needed) or `num_cols` entries. If specified, column `j` of the
-//   output matrix will be initialized from the column corresponding to index
-//   `col_remapping[j]` in the old `Tensor` from the checkpoint.
-// * A value of -1 in either of the remappings signifies a "missing" entry. In that
-//   case, values from the `initializing_values` tensor will be used to fill that
-//   missing row or column. If `row_remapping` has `r` missing entries and
-//   `col_remapping` has `c` missing entries, then the following condition must be
-//   true:
-//
-// `(r * num_cols) + (c * num_rows) - (r * c) == len(initializing_values)`
-//
-// The remapping tensors can be generated using the GenerateVocabRemapping op.
+// Computes the eigen decomposition of one or more square self-adjoint matrices.
 //
-// As an example, with row_remapping = [1, 0, -1], col_remapping = [0, 2, -1],
-// initializing_values = [0.5, -0.5, 0.25, -0.25, 42], and w(i, j) representing
-// the value from row i, column j of the old tensor in the checkpoint, the output
-// matrix will look like the following:
+// Computes the eigenvalues and (optionally) eigenvectors of each inner matrix in
+// `input` such that `input[..., :, :] = v[..., :, :] * diag(e[..., :])`. The eigenvalues
+// are sorted in non-decreasing order.
 //
-// [[w(1, 0),  w(1, 2),  0.5],
-//  [w(0, 0),  w(0, 2), -0.5],
-//  [0.25,    -0.25,      42]]
+// ```python
+// # a is a tensor.
+// # e is a tensor of eigenvalues.
+// # v is a tensor of eigenvectors.
+// e, v = self_adjoint_eig(a)
+// e = self_adjoint_eig(a, compute_v=False)
+// ```
 //
 // Arguments:
-//	ckpt_path: Path to the TensorFlow checkpoint (version 2, `TensorBundle`) from
-// which the old matrix `Tensor` will be loaded.
-//	old_tensor_name: Name of the 2-D `Tensor` to load from checkpoint.
-//	row_remapping: An int `Tensor` of row remappings (generally created by
-// `generate_vocab_remapping`).  Even if no row remapping is needed, this must
-// still be an index-valued Tensor (e.g. [0, 1, 2, ...]), or a shifted
-// index-valued `Tensor` (e.g. [8, 9, 10, ...], for partitioned `Variables`).
-//	col_remapping: An int `Tensor` of column remappings (generally created by
-// `generate_vocab_remapping`).  May be a size-0 `Tensor` if only row remapping
-// is to be done (e.g. column ordering is the same).
-//	initializing_values: A float `Tensor` containing  values to fill in for cells
-// in the output matrix that are not loaded from the checkpoint. Length must be
-// exactly the same as the number of missing / new cells.
-//	num_rows: Number of rows (length of the 1st dimension) in the output matrix.
-//	num_cols: Number of columns (length of the 2nd dimension) in the output matrix.
+//	input: `Tensor` input of shape `[N, N]`.
 //
-// Returns Output matrix containing existing values loaded from the
-// checkpoint, and with any missing values filled in from initializing_values.
-func LoadAndRemapMatrix(scope *Scope, ckpt_path tf.Output, old_tensor_name tf.Output, row_remapping tf.Output, col_remapping tf.Output, initializing_values tf.Output, num_rows int64, num_cols int64, optional ...LoadAndRemapMatrixAttr) (output_matrix tf.Output) {
+// Returns Eigenvalues. Shape is `[N]`.Eigenvectors. Shape is `[N, N]`.
+func SelfAdjointEigV2(scope *Scope, input tf.Output, optional ...SelfAdjointEigV2Attr) (e tf.Output, v tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_rows": num_rows, "num_cols": num_cols}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "LoadAndRemapMatrix",
+		Type: "SelfAdjointEigV2",
 		Input: []tf.Input{
-			ckpt_path, old_tensor_name, row_remapping, col_remapping, initializing_values,
+			input,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// TFRecordReaderV2Attr is an optional argument to TFRecordReaderV2.
-type TFRecordReaderV2Attr func(optionalAttr)
-
-// TFRecordReaderV2Container sets the optional container attribute to value.
+// Adjust the saturation of one or more images.
 //
-// value: If non-empty, this reader is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func TFRecordReaderV2Container(value string) TFRecordReaderV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// TFRecordReaderV2SharedName sets the optional shared_name attribute to value.
+// `images` is a tensor of at least 3 dimensions.  The last dimension is
+// interpretted as channels, and must be three.
 //
-// value: If non-empty, this reader is named in the given bucket
-// with this shared_name. Otherwise, the node name is used instead.
-// If not specified, defaults to ""
-func TFRecordReaderV2SharedName(value string) TFRecordReaderV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// TFRecordReaderV2CompressionType sets the optional compression_type attribute to value.
-// If not specified, defaults to ""
-func TFRecordReaderV2CompressionType(value string) TFRecordReaderV2Attr {
-	return func(m optionalAttr) {
-		m["compression_type"] = value
-	}
-}
-
-// A Reader that outputs the records from a TensorFlow Records file.
+// The input image is considered in the RGB colorspace. Conceptually, the RGB
+// colors are first mapped into HSV. A scale is then applied all the saturation
+// values, and then remapped back to RGB colorspace.
 //
-// Returns The handle to reference the Reader.
-func TFRecordReaderV2(scope *Scope, optional ...TFRecordReaderV2Attr) (reader_handle tf.Output) {
+// Arguments:
+//	images: Images to adjust.  At least 3-D.
+//	scale: A float scale to add to the saturation.
+//
+// Returns The hue-adjusted image or images.
+func AdjustSaturation(scope *Scope, images tf.Output, scale tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "TFRecordReaderV2",
-
-		Attrs: attrs,
+		Type: "AdjustSaturation",
+		Input: []tf.Input{
+			images, scale,
+		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// QuantizeAndDequantizeV3Attr is an optional argument to QuantizeAndDequantizeV3.
-type QuantizeAndDequantizeV3Attr func(optionalAttr)
-
-// QuantizeAndDequantizeV3SignedInput sets the optional signed_input attribute to value.
-// If not specified, defaults to true
-func QuantizeAndDequantizeV3SignedInput(value bool) QuantizeAndDequantizeV3Attr {
-	return func(m optionalAttr) {
-		m["signed_input"] = value
-	}
-}
+// MatrixSolveAttr is an optional argument to MatrixSolve.
+type MatrixSolveAttr func(optionalAttr)
 
-// QuantizeAndDequantizeV3RangeGiven sets the optional range_given attribute to value.
-// If not specified, defaults to true
-func QuantizeAndDequantizeV3RangeGiven(value bool) QuantizeAndDequantizeV3Attr {
+// MatrixSolveAdjoint sets the optional adjoint attribute to value.
+//
+// value: Boolean indicating whether to solve with `matrix` or its (block-wise)
+// adjoint.
+// If not specified, defaults to false
+func MatrixSolveAdjoint(value bool) MatrixSolveAttr {
 	return func(m optionalAttr) {
-		m["range_given"] = value
+		m["adjoint"] = value
 	}
 }
 
-// Quantizes then dequantizes a tensor.
+// Solves systems of linear equations.
 //
-// This is almost identical to QuantizeAndDequantizeV2, except that num_bits is a
-// tensor, so its value can change during training.
-func QuantizeAndDequantizeV3(scope *Scope, input tf.Output, input_min tf.Output, input_max tf.Output, num_bits tf.Output, optional ...QuantizeAndDequantizeV3Attr) (output tf.Output) {
+// `Matrix` is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
+// form square matrices. `Rhs` is a tensor of shape `[..., M, K]`. The `output` is
+// a tensor shape `[..., M, K]`.  If `adjoint` is `False` then each output matrix
+// satisfies `matrix[..., :, :] * output[..., :, :] = rhs[..., :, :]`.
+// If `adjoint` is `True` then each output matrix satisfies
+// `adjoint(matrix[..., :, :]) * output[..., :, :] = rhs[..., :, :]`.
+//
+// Arguments:
+//	matrix: Shape is `[..., M, M]`.
+//	rhs: Shape is `[..., M, K]`.
+//
+// Returns Shape is `[..., M, K]`.
+func MatrixSolve(scope *Scope, matrix tf.Output, rhs tf.Output, optional ...MatrixSolveAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -22882,9 +22454,9 @@ func QuantizeAndDequantizeV3(scope *Scope, input tf.Output, input_min tf.Output,
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "QuantizeAndDequantizeV3",
+		Type: "MatrixSolve",
 		Input: []tf.Input{
-			input, input_min, input_max, num_bits,
+			matrix, rhs,
 		},
 		Attrs: attrs,
 	}
@@ -22892,38 +22464,57 @@ func QuantizeAndDequantizeV3(scope *Scope, input tf.Output, input_min tf.Output,
 	return op.Output(0)
 }
 
-// IdentityReaderV2Attr is an optional argument to IdentityReaderV2.
-type IdentityReaderV2Attr func(optionalAttr)
+// SvdAttr is an optional argument to Svd.
+type SvdAttr func(optionalAttr)
 
-// IdentityReaderV2Container sets the optional container attribute to value.
+// SvdComputeUv sets the optional compute_uv attribute to value.
 //
-// value: If non-empty, this reader is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func IdentityReaderV2Container(value string) IdentityReaderV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
+// value: If true, left and right singular vectors will be
+// computed and returned in `u` and `v`, respectively.
+// If false, `u` and `v` are not set and should never referenced.
+// If not specified, defaults to true
+func SvdComputeUv(value bool) SvdAttr {
+	return func(m optionalAttr) {
+		m["compute_uv"] = value
 	}
 }
 
-// IdentityReaderV2SharedName sets the optional shared_name attribute to value.
+// SvdFullMatrices sets the optional full_matrices attribute to value.
 //
-// value: If non-empty, this reader is named in the given bucket
-// with this shared_name. Otherwise, the node name is used instead.
-// If not specified, defaults to ""
-func IdentityReaderV2SharedName(value string) IdentityReaderV2Attr {
+// value: If true, compute full-sized `u` and `v`. If false
+// (the default), compute only the leading `P` singular vectors.
+// Ignored if `compute_uv` is `False`.
+// If not specified, defaults to false
+func SvdFullMatrices(value bool) SvdAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["full_matrices"] = value
 	}
 }
 
-// A Reader that outputs the queued work as both the key and value.
+// Computes the singular value decompositions of one or more matrices.
 //
-// To use, enqueue strings in a Queue.  ReaderRead will take the front
-// work string and output (work, work).
+// Computes the SVD of each inner matrix in `input` such that
+// `input[..., :, :] = u[..., :, :] * diag(s[..., :, :]) * transpose(v[..., :, :])`
 //
-// Returns The handle to reference the Reader.
-func IdentityReaderV2(scope *Scope, optional ...IdentityReaderV2Attr) (reader_handle tf.Output) {
+// ```python
+// # a is a tensor containing a batch of matrices.
+// # s is a tensor of singular values for each matrix.
+// # u is the tensor containing of left singular vectors for each matrix.
+// # v is the tensor containing of right singular vectors for each matrix.
+// s, u, v = svd(a)
+// s, _, _ = svd(a, compute_uv=False)
+// ```
+//
+// Arguments:
+//	input: A tensor of shape `[..., M, N]` whose inner-most 2 dimensions
+// form matrices of size `[M, N]`. Let `P` be the minimum of `M` and `N`.
+//
+// Returns Singular values. Shape is `[..., P]`.Left singular vectors. If `full_matrices` is `False` then shape is
+// `[..., M, P]`; if `full_matrices` is `True` then shape is
+// `[..., M, M]`. Undefined if `compute_uv` is `False`.Left singular vectors. If `full_matrices` is `False` then shape is
+// `[..., N, P]`. If `full_matrices` is `True` then shape is `[..., N, N]`.
+// Undefined if `compute_uv` is false.
+func Svd(scope *Scope, input tf.Output, optional ...SvdAttr) (s tf.Output, u tf.Output, v tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -22932,37 +22523,50 @@ func IdentityReaderV2(scope *Scope, optional ...IdentityReaderV2Attr) (reader_ha
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "IdentityReaderV2",
-
+		Type: "Svd",
+		Input: []tf.Input{
+			input,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// ResourceApplyGradientDescentAttr is an optional argument to ResourceApplyGradientDescent.
-type ResourceApplyGradientDescentAttr func(optionalAttr)
+// QueueEnqueueManyV2Attr is an optional argument to QueueEnqueueManyV2.
+type QueueEnqueueManyV2Attr func(optionalAttr)
 
-// ResourceApplyGradientDescentUseLocking sets the optional use_locking attribute to value.
+// QueueEnqueueManyV2TimeoutMs sets the optional timeout_ms attribute to value.
 //
-// value: If `True`, the subtraction will be protected by a lock;
-// otherwise the behavior is undefined, but may exhibit less contention.
-// If not specified, defaults to false
-func ResourceApplyGradientDescentUseLocking(value bool) ResourceApplyGradientDescentAttr {
+// value: If the queue is too full, this operation will block for up
+// to timeout_ms milliseconds.
+// Note: This option is not supported yet.
+// If not specified, defaults to -1
+func QueueEnqueueManyV2TimeoutMs(value int64) QueueEnqueueManyV2Attr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["timeout_ms"] = value
 	}
 }
 
-// Update '*var' by subtracting 'alpha' * 'delta' from it.
+// Enqueues zero or more tuples of one or more tensors in the given queue.
+//
+// This operation slices each component tensor along the 0th dimension to
+// make multiple queue elements. All of the tuple components must have the
+// same size in the 0th dimension.
+//
+// The components input has k elements, which correspond to the components of
+// tuples stored in the given queue.
+//
+// N.B. If the queue is full, this operation will block until the given
+// elements have been enqueued (or 'timeout_ms' elapses, if specified).
 //
 // Arguments:
-//	var_: Should be from a Variable().
-//	alpha: Scaling factor. Must be a scalar.
-//	delta: The change.
+//	handle: The handle to a queue.
+//	components: One or more tensors from which the enqueued tensors should
+// be taken.
 //
 // Returns the created operation.
-func ResourceApplyGradientDescent(scope *Scope, var_ tf.Output, alpha tf.Output, delta tf.Output, optional ...ResourceApplyGradientDescentAttr) (o *tf.Operation) {
+func QueueEnqueueManyV2(scope *Scope, handle tf.Output, components []tf.Output, optional ...QueueEnqueueManyV2Attr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -22971,113 +22575,123 @@ func ResourceApplyGradientDescent(scope *Scope, var_ tf.Output, alpha tf.Output,
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyGradientDescent",
+		Type: "QueueEnqueueManyV2",
 		Input: []tf.Input{
-			var_, alpha, delta,
+			handle, tf.OutputList(components),
 		},
 		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// Returns the next record (key, value pair) produced by a Reader.
+// Computes the product along segments of a tensor.
 //
-// Will dequeue from the input queue if necessary (e.g. when the
-// Reader needs to start reading from a new file since it has finished
-// with the previous file).
+// Read @{$math_ops#Segmentation$the section on segmentation} for an explanation of
+// segments.
+//
+// Computes a tensor such that
+// \\(output_i = \prod_j data_j\\) where the product is over `j` such
+// that `segment_ids[j] == i`.
+//
+// If the product is empty for a given segment ID `i`, `output[i] = 1`.
+//
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentProd.png" alt>
+// </div>
 //
 // Arguments:
-//	reader_handle: Handle to a Reader.
-//	queue_handle: Handle to a Queue, with string work items.
 //
-// Returns A scalar.A scalar.
-func ReaderReadV2(scope *Scope, reader_handle tf.Output, queue_handle tf.Output) (key tf.Output, value tf.Output) {
+//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
+// first dimension.  Values should be sorted and can be repeated.
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `k`, the number of segments.
+func SegmentProd(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ReaderReadV2",
+		Type: "SegmentProd",
 		Input: []tf.Input{
-			reader_handle, queue_handle,
+			data, segment_ids,
 		},
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Returns up to `num_records` (key, value) pairs produced by a Reader.
+// Converts one or more images from RGB to HSV.
 //
-// Will dequeue from the input queue if necessary (e.g. when the
-// Reader needs to start reading from a new file since it has finished
-// with the previous file).
-// It may return less than `num_records` even before the last batch.
+// Outputs a tensor of the same shape as the `images` tensor, containing the HSV
+// value of the pixels. The output is only well defined if the value in `images`
+// are in `[0,1]`.
+//
+// `output[..., 0]` contains hue, `output[..., 1]` contains saturation, and
+// `output[..., 2]` contains value. All HSV values are in `[0,1]`. A hue of 0
+// corresponds to pure red, hue 1/3 is pure green, and 2/3 is pure blue.
 //
 // Arguments:
-//	reader_handle: Handle to a `Reader`.
-//	queue_handle: Handle to a `Queue`, with string work items.
-//	num_records: number of records to read from `Reader`.
+//	images: 1-D or higher rank. RGB data to convert. Last dimension must be size 3.
 //
-// Returns A 1-D tensor.A 1-D tensor.
-func ReaderReadUpToV2(scope *Scope, reader_handle tf.Output, queue_handle tf.Output, num_records tf.Output) (keys tf.Output, values tf.Output) {
+// Returns `images` converted to HSV.
+func RGBToHSV(scope *Scope, images tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ReaderReadUpToV2",
+		Type: "RGBToHSV",
 		Input: []tf.Input{
-			reader_handle, queue_handle, num_records,
+			images,
 		},
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// ResourceApplyAdamAttr is an optional argument to ResourceApplyAdam.
-type ResourceApplyAdamAttr func(optionalAttr)
-
-// ResourceApplyAdamUseLocking sets the optional use_locking attribute to value.
+// Does nothing. Only useful as a placeholder for control edges.
 //
-// value: If `True`, updating of the var, m, and v tensors will be protected
-// by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
-// If not specified, defaults to false
-func ResourceApplyAdamUseLocking(value bool) ResourceApplyAdamAttr {
-	return func(m optionalAttr) {
-		m["use_locking"] = value
+// Returns the created operation.
+func NoOp(scope *Scope) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "NoOp",
 	}
+	return scope.AddOperation(opspec)
 }
 
-// ResourceApplyAdamUseNesterov sets the optional use_nesterov attribute to value.
+// MergeV2CheckpointsAttr is an optional argument to MergeV2Checkpoints.
+type MergeV2CheckpointsAttr func(optionalAttr)
+
+// MergeV2CheckpointsDeleteOldDirs sets the optional delete_old_dirs attribute to value.
 //
-// value: If `True`, uses the nesterov update.
-// If not specified, defaults to false
-func ResourceApplyAdamUseNesterov(value bool) ResourceApplyAdamAttr {
+// value: see above.
+// If not specified, defaults to true
+func MergeV2CheckpointsDeleteOldDirs(value bool) MergeV2CheckpointsAttr {
 	return func(m optionalAttr) {
-		m["use_nesterov"] = value
+		m["delete_old_dirs"] = value
 	}
 }
 
-// Update '*var' according to the Adam algorithm.
+// V2 format specific: merges the metadata files of sharded checkpoints.  The
 //
-// lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
-// m_t <- beta1 * m_{t-1} + (1 - beta1) * g_t
-// v_t <- beta2 * v_{t-1} + (1 - beta2) * g_t * g_t
-// variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
+// result is one logical checkpoint, with one physical metadata file and renamed
+// data files.
+//
+// Intended for "grouping" multiple checkpoints in a sharded checkpoint setup.
+//
+// If delete_old_dirs is true, attempts to delete recursively the dirname of each
+// path in the input checkpoint_prefixes.  This is useful when those paths are non
+// user-facing temporary locations.
 //
 // Arguments:
-//	var_: Should be from a Variable().
-//	m: Should be from a Variable().
-//	v: Should be from a Variable().
-//	beta1_power: Must be a scalar.
-//	beta2_power: Must be a scalar.
-//	lr: Scaling factor. Must be a scalar.
-//	beta1: Momentum factor. Must be a scalar.
-//	beta2: Momentum factor. Must be a scalar.
-//	epsilon: Ridge term. Must be a scalar.
-//	grad: The gradient.
+//	checkpoint_prefixes: prefixes of V2 checkpoints to merge.
+//	destination_prefix: scalar.  The desired final prefix.  Allowed to be the same
+// as one of the checkpoint_prefixes.
 //
 // Returns the created operation.
-func ResourceApplyAdam(scope *Scope, var_ tf.Output, m tf.Output, v tf.Output, beta1_power tf.Output, beta2_power tf.Output, lr tf.Output, beta1 tf.Output, beta2 tf.Output, epsilon tf.Output, grad tf.Output, optional ...ResourceApplyAdamAttr) (o *tf.Operation) {
+func MergeV2Checkpoints(scope *Scope, checkpoint_prefixes tf.Output, destination_prefix tf.Output, optional ...MergeV2CheckpointsAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -23086,688 +22700,568 @@ func ResourceApplyAdam(scope *Scope, var_ tf.Output, m tf.Output, v tf.Output, b
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyAdam",
+		Type: "MergeV2Checkpoints",
 		Input: []tf.Input{
-			var_, m, v, beta1_power, beta2_power, lr, beta1, beta2, epsilon, grad,
+			checkpoint_prefixes, destination_prefix,
 		},
 		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// Store the input tensor in the state of the current session.
+// Saves input tensors slices to disk.
 //
-// Arguments:
-//	value: The tensor to be stored.
+// This is like `Save` except that tensors can be listed in the saved file as being
+// a slice of a larger tensor.  `shapes_and_slices` specifies the shape of the
+// larger tensor and the slice that this tensor covers. `shapes_and_slices` must
+// have as many elements as `tensor_names`.
 //
-// Returns The handle for the tensor stored in the session state, represented
-// as a ResourceHandle object.
-func GetSessionHandleV2(scope *Scope, value tf.Output) (handle tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "GetSessionHandleV2",
-		Input: []tf.Input{
-			value,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// ResizeBicubicGradAttr is an optional argument to ResizeBicubicGrad.
-type ResizeBicubicGradAttr func(optionalAttr)
-
-// ResizeBicubicGradAlignCorners sets the optional align_corners attribute to value.
+// Elements of the `shapes_and_slices` input must either be:
 //
-// value: If true, the centers of the 4 corner pixels of the input and grad tensors are
-// aligned. Defaults to false.
-// If not specified, defaults to false
-func ResizeBicubicGradAlignCorners(value bool) ResizeBicubicGradAttr {
-	return func(m optionalAttr) {
-		m["align_corners"] = value
-	}
-}
-
-// Computes the gradient of bicubic interpolation.
+// *  The empty string, in which case the corresponding tensor is
+//    saved normally.
+// *  A string of the form `dim0 dim1 ... dimN-1 slice-spec` where the
+//    `dimI` are the dimensions of the larger tensor and `slice-spec`
+//    specifies what part is covered by the tensor to save.
 //
-// Arguments:
-//	grads: 4-D with shape `[batch, height, width, channels]`.
-//	original_image: 4-D with shape `[batch, orig_height, orig_width, channels]`,
-// The image tensor that was resized.
+// `slice-spec` itself is a `:`-separated list: `slice0:slice1:...:sliceN-1`
+// where each `sliceI` is either:
 //
-// Returns 4-D with shape `[batch, orig_height, orig_width, channels]`.
-// Gradients with respect to the input image. Input image must have been
-// float or double.
-func ResizeBicubicGrad(scope *Scope, grads tf.Output, original_image tf.Output, optional ...ResizeBicubicGradAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "ResizeBicubicGrad",
-		Input: []tf.Input{
-			grads, original_image,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// ResizeNearestNeighborAttr is an optional argument to ResizeNearestNeighbor.
-type ResizeNearestNeighborAttr func(optionalAttr)
-
-// ResizeNearestNeighborAlignCorners sets the optional align_corners attribute to value.
+// *  The string `-` meaning that the slice covers all indices of this dimension
+// *  `start,length` where `start` and `length` are integers.  In that
+//    case the slice covers `length` indices starting at `start`.
 //
-// value: If true, the centers of the 4 corner pixels of the input and output tensors are
-// aligned, preserving the values at the corner pixels. Defaults to false.
-// If not specified, defaults to false
-func ResizeNearestNeighborAlignCorners(value bool) ResizeNearestNeighborAttr {
-	return func(m optionalAttr) {
-		m["align_corners"] = value
-	}
-}
-
-// Resize `images` to `size` using nearest neighbor interpolation.
+// See also `Save`.
 //
 // Arguments:
-//	images: 4-D with shape `[batch, height, width, channels]`.
-//	size: = A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
-// new size for the images.
+//	filename: Must have a single element. The name of the file to which we write the
+// tensor.
+//	tensor_names: Shape `[N]`. The names of the tensors to be saved.
+//	shapes_and_slices: Shape `[N]`.  The shapes and slice specifications to use when
+// saving the tensors.
+//	data: `N` tensors to save.
 //
-// Returns 4-D with shape
-// `[batch, new_height, new_width, channels]`.
-func ResizeNearestNeighbor(scope *Scope, images tf.Output, size tf.Output, optional ...ResizeNearestNeighborAttr) (resized_images tf.Output) {
+// Returns the created operation.
+func SaveSlices(scope *Scope, filename tf.Output, tensor_names tf.Output, shapes_and_slices tf.Output, data []tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "ResizeNearestNeighbor",
+		Type: "SaveSlices",
 		Input: []tf.Input{
-			images, size,
+			filename, tensor_names, shapes_and_slices, tf.OutputList(data),
 		},
-		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// ResizeNearestNeighborGradAttr is an optional argument to ResizeNearestNeighborGrad.
-type ResizeNearestNeighborGradAttr func(optionalAttr)
+// DenseToDenseSetOperationAttr is an optional argument to DenseToDenseSetOperation.
+type DenseToDenseSetOperationAttr func(optionalAttr)
 
-// ResizeNearestNeighborGradAlignCorners sets the optional align_corners attribute to value.
-//
-// value: If true, the centers of the 4 corner pixels of the input and grad tensors are
-// aligned. Defaults to false.
-// If not specified, defaults to false
-func ResizeNearestNeighborGradAlignCorners(value bool) ResizeNearestNeighborGradAttr {
+// DenseToDenseSetOperationValidateIndices sets the optional validate_indices attribute to value.
+// If not specified, defaults to true
+func DenseToDenseSetOperationValidateIndices(value bool) DenseToDenseSetOperationAttr {
 	return func(m optionalAttr) {
-		m["align_corners"] = value
+		m["validate_indices"] = value
 	}
 }
 
-// Computes the gradient of nearest neighbor interpolation.
+// Applies set operation along last dimension of 2 `Tensor` inputs.
+//
+// See SetOperationOp::SetOperationFromContext for values of `set_operation`.
+//
+// Output `result` is a `SparseTensor` represented by `result_indices`,
+// `result_values`, and `result_shape`. For `set1` and `set2` ranked `n`, this
+// has rank `n` and the same 1st `n-1` dimensions as `set1` and `set2`. The `nth`
+// dimension contains the result of `set_operation` applied to the corresponding
+// `[0...n-1]` dimension of `set`.
 //
 // Arguments:
-//	grads: 4-D with shape `[batch, height, width, channels]`.
-//	size: = A 1-D int32 Tensor of 2 elements: `orig_height, orig_width`. The
-// original input size.
+//	set1: `Tensor` with rank `n`. 1st `n-1` dimensions must be the same as `set2`.
+// Dimension `n` contains values in a set, duplicates are allowed but ignored.
+//	set2: `Tensor` with rank `n`. 1st `n-1` dimensions must be the same as `set1`.
+// Dimension `n` contains values in a set, duplicates are allowed but ignored.
 //
-// Returns 4-D with shape `[batch, orig_height, orig_width, channels]`. Gradients
-// with respect to the input image.
-func ResizeNearestNeighborGrad(scope *Scope, grads tf.Output, size tf.Output, optional ...ResizeNearestNeighborGradAttr) (output tf.Output) {
+//
+// Returns 2D indices of a `SparseTensor`.1D values of a `SparseTensor`.1D `Tensor` shape of a `SparseTensor`. `result_shape[0...n-1]` is
+// the same as the 1st `n-1` dimensions of `set1` and `set2`, `result_shape[n]`
+// is the max result set size across all `0...n-1` dimensions.
+func DenseToDenseSetOperation(scope *Scope, set1 tf.Output, set2 tf.Output, set_operation string, optional ...DenseToDenseSetOperationAttr) (result_indices tf.Output, result_values tf.Output, result_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"set_operation": set_operation}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResizeNearestNeighborGrad",
+		Type: "DenseToDenseSetOperation",
 		Input: []tf.Input{
-			grads, size,
+			set1, set2,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// ExtractJpegShapeAttr is an optional argument to ExtractJpegShape.
-type ExtractJpegShapeAttr func(optionalAttr)
-
-// ExtractJpegShapeOutputType sets the optional output_type attribute to value.
-//
-// value: (Optional) The output type of the operation (int32 or int64).
-// Defaults to int32.
-// If not specified, defaults to DT_INT32
-func ExtractJpegShapeOutputType(value tf.DataType) ExtractJpegShapeAttr {
-	return func(m optionalAttr) {
-		m["output_type"] = value
-	}
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Extract the shape information of a JPEG-encoded image.
-//
-// This op only parses the image header, so it is much faster than DecodeJpeg.
-//
-// Arguments:
-//	contents: 0-D. The JPEG-encoded image.
+// Generate a sharded filename. The filename is printf formatted as
 //
-// Returns 1-D. The image shape with format [height, width, channels].
-func ExtractJpegShape(scope *Scope, contents tf.Output, optional ...ExtractJpegShapeAttr) (image_shape tf.Output) {
+//    %s-%05d-of-%05d, basename, shard, num_shards.
+func ShardedFilename(scope *Scope, basename tf.Output, shard tf.Output, num_shards tf.Output) (filename tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "ExtractJpegShape",
+		Type: "ShardedFilename",
 		Input: []tf.Input{
-			contents,
+			basename, shard, num_shards,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// PaddingFIFOQueueV2Attr is an optional argument to PaddingFIFOQueueV2.
-type PaddingFIFOQueueV2Attr func(optionalAttr)
-
-// PaddingFIFOQueueV2Shapes sets the optional shapes attribute to value.
+// BatchToSpace for N-D tensors of type T.
 //
-// value: The shape of each component in a value. The length of this attr must
-// be either 0 or the same as the length of component_types.
-// Shapes of fixed rank but variable size are allowed by setting
-// any shape dimension to -1.  In this case, the inputs' shape may vary along
-// the given dimension, and DequeueMany will pad the given dimension with
-// zeros up to the maximum shape of all elements in the given batch.
-// If the length of this attr is 0, different queue elements may have
-// different ranks and shapes, but only one element may be dequeued at a time.
-// If not specified, defaults to <>
+// This operation reshapes the "batch" dimension 0 into `M + 1` dimensions of shape
+// `block_shape + [batch]`, interleaves these blocks back into the grid defined by
+// the spatial dimensions `[1, ..., M]`, to obtain a result with the same rank as
+// the input.  The spatial dimensions of this intermediate result are then
+// optionally cropped according to `crops` to produce the output.  This is the
+// reverse of SpaceToBatch.  See below for a precise description.
 //
-// REQUIRES: len(value) >= 0
-func PaddingFIFOQueueV2Shapes(value []tf.Shape) PaddingFIFOQueueV2Attr {
-	return func(m optionalAttr) {
-		m["shapes"] = value
-	}
-}
-
-// PaddingFIFOQueueV2Capacity sets the optional capacity attribute to value.
+// Arguments:
+//	input: N-D with shape `input_shape = [batch] + spatial_shape + remaining_shape`,
+// where spatial_shape has M dimensions.
+//	block_shape: 1-D with shape `[M]`, all values must be >= 1.
+//	crops: 2-D with shape `[M, 2]`, all values must be >= 0.
+//   `crops[i] = [crop_start, crop_end]` specifies the amount to crop from input
+//   dimension `i + 1`, which corresponds to spatial dimension `i`.  It is
+//   required that
+//   `crop_start[i] + crop_end[i] <= block_shape[i] * input_shape[i + 1]`.
 //
-// value: The upper bound on the number of elements in this queue.
-// Negative numbers mean no limit.
-// If not specified, defaults to -1
-func PaddingFIFOQueueV2Capacity(value int64) PaddingFIFOQueueV2Attr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// PaddingFIFOQueueV2Container sets the optional container attribute to value.
+// This operation is equivalent to the following steps:
 //
-// value: If non-empty, this queue is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func PaddingFIFOQueueV2Container(value string) PaddingFIFOQueueV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// PaddingFIFOQueueV2SharedName sets the optional shared_name attribute to value.
+// 1. Reshape `input` to `reshaped` of shape:
+//      [block_shape[0], ..., block_shape[M-1],
+//       batch / prod(block_shape),
+//       input_shape[1], ..., input_shape[N-1]]
 //
-// value: If non-empty, this queue will be shared under the given name
-// across multiple sessions.
-// If not specified, defaults to ""
-func PaddingFIFOQueueV2SharedName(value string) PaddingFIFOQueueV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// A queue that produces elements in first-in first-out order.
+// 2. Permute dimensions of `reshaped` to produce `permuted` of shape
+//      [batch / prod(block_shape),
 //
-// Variable-size shapes are allowed by setting the corresponding shape dimensions
-// to 0 in the shape attr.  In this case DequeueMany will pad up to the maximum
-// size of any given element in the minibatch.  See below for details.
+//       input_shape[1], block_shape[0],
+//       ...,
+//       input_shape[M], block_shape[M-1],
 //
-// Arguments:
-//	component_types: The type of each component in a value.
+//       input_shape[M+1], ..., input_shape[N-1]]
 //
-// Returns The handle to the queue.
-func PaddingFIFOQueueV2(scope *Scope, component_types []tf.DataType, optional ...PaddingFIFOQueueV2Attr) (handle tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"component_types": component_types}
-	for _, a := range optional {
-		a(attrs)
+// 3. Reshape `permuted` to produce `reshaped_permuted` of shape
+//      [batch / prod(block_shape),
+//
+//       input_shape[1] * block_shape[0],
+//       ...,
+//       input_shape[M] * block_shape[M-1],
+//
+//       input_shape[M+1],
+//       ...,
+//       input_shape[N-1]]
+//
+// 4. Crop the start and end of dimensions `[1, ..., M]` of
+//    `reshaped_permuted` according to `crops` to produce the output of shape:
+//      [batch / prod(block_shape),
+//
+//       input_shape[1] * block_shape[0] - crops[0,0] - crops[0,1],
+//       ...,
+//       input_shape[M] * block_shape[M-1] - crops[M-1,0] - crops[M-1,1],
+//
+//       input_shape[M+1], ..., input_shape[N-1]]
+//
+// Some examples:
+//
+// (1) For the following input of shape `[4, 1, 1, 1]`, `block_shape = [2, 2]`, and
+//     `crops = [[0, 0], [0, 0]]`:
+//
+// ```
+// [[[[1]]], [[[2]]], [[[3]]], [[[4]]]]
+// ```
+//
+// The output tensor has shape `[1, 2, 2, 1]` and value:
+//
+// ```
+// x = [[[[1], [2]], [[3], [4]]]]
+// ```
+//
+// (2) For the following input of shape `[4, 1, 1, 3]`, `block_shape = [2, 2]`, and
+//     `crops = [[0, 0], [0, 0]]`:
+//
+// ```
+// [[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[10, 11, 12]]]
+// ```
+//
+// The output tensor has shape `[1, 2, 2, 3]` and value:
+//
+// ```
+// x = [[[[1, 2, 3], [4, 5, 6]],
+//       [[7, 8, 9], [10, 11, 12]]]]
+// ```
+//
+// (3) For the following input of shape `[4, 2, 2, 1]`, `block_shape = [2, 2]`, and
+//     `crops = [[0, 0], [0, 0]]`:
+//
+// ```
+// x = [[[[1], [3]], [[9], [11]]],
+//      [[[2], [4]], [[10], [12]]],
+//      [[[5], [7]], [[13], [15]]],
+//      [[[6], [8]], [[14], [16]]]]
+// ```
+//
+// The output tensor has shape `[1, 4, 4, 1]` and value:
+//
+// ```
+// x = [[[1],   [2],  [3],  [4]],
+//      [[5],   [6],  [7],  [8]],
+//      [[9],  [10], [11],  [12]],
+//      [[13], [14], [15],  [16]]]
+// ```
+//
+// (4) For the following input of shape `[8, 1, 3, 1]`, `block_shape = [2, 2]`, and
+//     `crops = [[0, 0], [2, 0]]`:
+//
+// ```
+// x = [[[[0], [1], [3]]], [[[0], [9], [11]]],
+//      [[[0], [2], [4]]], [[[0], [10], [12]]],
+//      [[[0], [5], [7]]], [[[0], [13], [15]]],
+//      [[[0], [6], [8]]], [[[0], [14], [16]]]]
+// ```
+//
+// The output tensor has shape `[2, 2, 4, 1]` and value:
+//
+// ```
+// x = [[[[1],   [2],  [3],  [4]],
+//       [[5],   [6],  [7],  [8]]],
+//      [[[9],  [10], [11],  [12]],
+//       [[13], [14], [15],  [16]]]]
+// ```
+func BatchToSpaceND(scope *Scope, input tf.Output, block_shape tf.Output, crops tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
 	opspec := tf.OpSpec{
-		Type: "PaddingFIFOQueueV2",
-
-		Attrs: attrs,
+		Type: "BatchToSpaceND",
+		Input: []tf.Input{
+			input, block_shape, crops,
+		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// DecodePngAttr is an optional argument to DecodePng.
-type DecodePngAttr func(optionalAttr)
+// UnpackAttr is an optional argument to Unpack.
+type UnpackAttr func(optionalAttr)
 
-// DecodePngChannels sets the optional channels attribute to value.
+// UnpackAxis sets the optional axis attribute to value.
 //
-// value: Number of color channels for the decoded image.
+// value: Dimension along which to unpack.  Negative values wrap around, so the
+// valid range is `[-R, R)`.
 // If not specified, defaults to 0
-func DecodePngChannels(value int64) DecodePngAttr {
-	return func(m optionalAttr) {
-		m["channels"] = value
-	}
-}
-
-// DecodePngDtype sets the optional dtype attribute to value.
-// If not specified, defaults to DT_UINT8
-func DecodePngDtype(value tf.DataType) DecodePngAttr {
+func UnpackAxis(value int64) UnpackAttr {
 	return func(m optionalAttr) {
-		m["dtype"] = value
+		m["axis"] = value
 	}
 }
 
-// Decode a PNG-encoded image to a uint8 or uint16 tensor.
-//
-// The attr `channels` indicates the desired number of color channels for the
-// decoded image.
+// Unpacks a given dimension of a rank-`R` tensor into `num` rank-`(R-1)` tensors.
 //
-// Accepted values are:
+// Unpacks `num` tensors from `value` by chipping it along the `axis` dimension.
+// For example, given a tensor of shape `(A, B, C, D)`;
 //
-// *   0: Use the number of channels in the PNG-encoded image.
-// *   1: output a grayscale image.
-// *   3: output an RGB image.
-// *   4: output an RGBA image.
+// If `axis == 0` then the i'th tensor in `output` is the slice `value[i, :, :, :]`
+//   and each tensor in `output` will have shape `(B, C, D)`. (Note that the
+//   dimension unpacked along is gone, unlike `split`).
 //
-// If needed, the PNG-encoded image is transformed to match the requested number
-// of color channels.
+// If `axis == 1` then the i'th tensor in `output` is the slice `value[:, i, :, :]`
+//   and each tensor in `output` will have shape `(A, C, D)`.
+// Etc.
 //
-// This op also supports decoding JPEGs and non-animated GIFs since the interface
-// is the same, though it is cleaner to use `tf.image.decode_image`.
+// This is the opposite of `pack`.
 //
 // Arguments:
-//	contents: 0-D.  The PNG-encoded image.
+//	value: 1-D or higher, with `axis` dimension size equal to `num`.
 //
-// Returns 3-D with shape `[height, width, channels]`.
-func DecodePng(scope *Scope, contents tf.Output, optional ...DecodePngAttr) (image tf.Output) {
+//
+// Returns The list of tensors unpacked from `value`.
+func Unpack(scope *Scope, value tf.Output, num int64, optional ...UnpackAttr) (output []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"num": num}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DecodePng",
+		Type: "Unpack",
 		Input: []tf.Input{
-			contents,
+			value,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
+		scope.UpdateErr("Unpack", err)
+		return
+	}
+	return output
 }
 
-// Decode the first frame of a GIF-encoded image to a uint8 tensor.
-//
-// GIF with frame or transparency compression are not supported
-// convert animated GIF from compressed to uncompressed by:
-//
-//     convert $src.gif -coalesce $dst.gif
-//
-// This op also supports decoding JPEGs and PNGs, though it is cleaner to use
-// `tf.image.decode_image`.
+// Increments variable pointed to by 'resource' until it reaches 'limit'.
 //
 // Arguments:
-//	contents: 0-D.  The GIF-encoded image.
+//	resource: Should be from a scalar `Variable` node.
+//	limit: If incrementing ref would bring it above limit, instead generates an
+// 'OutOfRange' error.
 //
-// Returns 4-D with shape `[num_frames, height, width, 3]`. RGB order
-func DecodeGif(scope *Scope, contents tf.Output) (image tf.Output) {
+//
+// Returns A copy of the input before increment. If nothing else modifies the
+// input, the values produced will all be distinct.
+func ResourceCountUpTo(scope *Scope, resource tf.Output, limit int64, T tf.DataType) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"limit": limit, "T": T}
 	opspec := tf.OpSpec{
-		Type: "DecodeGif",
+		Type: "ResourceCountUpTo",
 		Input: []tf.Input{
-			contents,
+			resource,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the gradient of the sigmoid of `x` wrt its input.
+// Delete the stack from its resource container.
 //
-// Specifically, `grad = dy * y * (1 - y)`, where `y = sigmoid(x)`, and
-// `dy` is the corresponding input gradient.
-func SigmoidGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
+// Arguments:
+//	handle: The handle to a stack.
+//
+// Returns the created operation.
+func StackCloseV2(scope *Scope, handle tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SigmoidGrad",
+		Type: "StackCloseV2",
 		Input: []tf.Input{
-			y, dy,
+			handle,
 		},
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Convert one or more images from HSV to RGB.
-//
-// Outputs a tensor of the same shape as the `images` tensor, containing the RGB
-// value of the pixels. The output is only well defined if the value in `images`
-// are in `[0,1]`.
-//
-// See `rgb_to_hsv` for a description of the HSV encoding.
-//
-// Arguments:
-//	images: 1-D or higher rank. HSV data to convert. Last dimension must be size 3.
-//
-// Returns `images` converted to RGB.
-func HSVToRGB(scope *Scope, images tf.Output) (output tf.Output) {
+// Generate a glob pattern matching all sharded file names.
+func ShardedFilespec(scope *Scope, basename tf.Output, num_shards tf.Output) (filename tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "HSVToRGB",
+		Type: "ShardedFilespec",
 		Input: []tf.Input{
-			images,
+			basename, num_shards,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Retrieves the tree ensemble resource stamp token, number of trees and growing statistics.
-//
-// Arguments:
-//	tree_ensemble_handle: Handle to the tree ensemble.
+// TextLineReaderV2Attr is an optional argument to TextLineReaderV2.
+type TextLineReaderV2Attr func(optionalAttr)
+
+// TextLineReaderV2SkipHeaderLines sets the optional skip_header_lines attribute to value.
 //
-// Returns Stamp token of the tree ensemble resource.The number of trees in the tree ensemble resource.The number of trees that were finished successfully.The number of layers we attempted to build (but not necessarily succeeded).Rank size 2 tensor that contains start and end ids of the nodes in the latest
-// layer.
-func BoostedTreesGetEnsembleStates(scope *Scope, tree_ensemble_handle tf.Output) (stamp_token tf.Output, num_trees tf.Output, num_finalized_trees tf.Output, num_attempted_layers tf.Output, last_layer_nodes_range tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "BoostedTreesGetEnsembleStates",
-		Input: []tf.Input{
-			tree_ensemble_handle,
-		},
+// value: Number of lines to skip from the beginning of every file.
+// If not specified, defaults to 0
+func TextLineReaderV2SkipHeaderLines(value int64) TextLineReaderV2Attr {
+	return func(m optionalAttr) {
+		m["skip_header_lines"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
 }
 
-// Gets the next output from the given iterator.
+// TextLineReaderV2Container sets the optional container attribute to value.
 //
-// This operation is a synchronous version IteratorGetNext. It should only be used
-// in situations where the iterator does not block the calling thread, or where
-// the calling thread is not a member of the thread pool used to execute parallel
-// operations (e.g. in eager mode).
-func IteratorGetNextSync(scope *Scope, iterator tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (components []tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
-	opspec := tf.OpSpec{
-		Type: "IteratorGetNextSync",
-		Input: []tf.Input{
-			iterator,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
-		scope.UpdateErr("IteratorGetNextSync", err)
-		return
-	}
-	return components
-}
-
-// SampleDistortedBoundingBoxV2Attr is an optional argument to SampleDistortedBoundingBoxV2.
-type SampleDistortedBoundingBoxV2Attr func(optionalAttr)
-
-// SampleDistortedBoundingBoxV2Seed sets the optional seed attribute to value.
-//
-// value: If either `seed` or `seed2` are set to non-zero, the random number
-// generator is seeded by the given `seed`.  Otherwise, it is seeded by a random
-// seed.
-// If not specified, defaults to 0
-func SampleDistortedBoundingBoxV2Seed(value int64) SampleDistortedBoundingBoxV2Attr {
+// value: If non-empty, this reader is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func TextLineReaderV2Container(value string) TextLineReaderV2Attr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["container"] = value
 	}
 }
 
-// SampleDistortedBoundingBoxV2Seed2 sets the optional seed2 attribute to value.
+// TextLineReaderV2SharedName sets the optional shared_name attribute to value.
 //
-// value: A second seed to avoid seed collision.
-// If not specified, defaults to 0
-func SampleDistortedBoundingBoxV2Seed2(value int64) SampleDistortedBoundingBoxV2Attr {
+// value: If non-empty, this reader is named in the given bucket
+// with this shared_name. Otherwise, the node name is used instead.
+// If not specified, defaults to ""
+func TextLineReaderV2SharedName(value string) TextLineReaderV2Attr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["shared_name"] = value
 	}
 }
 
-// SampleDistortedBoundingBoxV2AspectRatioRange sets the optional aspect_ratio_range attribute to value.
+// A Reader that outputs the lines of a file delimited by '\n'.
 //
-// value: The cropped area of the image must have an aspect ratio =
-// width / height within this range.
-// If not specified, defaults to <f:0.75 f:1.33 >
-func SampleDistortedBoundingBoxV2AspectRatioRange(value []float32) SampleDistortedBoundingBoxV2Attr {
-	return func(m optionalAttr) {
-		m["aspect_ratio_range"] = value
+// Returns The handle to reference the Reader.
+func TextLineReaderV2(scope *Scope, optional ...TextLineReaderV2Attr) (reader_handle tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// SampleDistortedBoundingBoxV2AreaRange sets the optional area_range attribute to value.
-//
-// value: The cropped area of the image must contain a fraction of the
-// supplied image within this range.
-// If not specified, defaults to <f:0.05 f:1 >
-func SampleDistortedBoundingBoxV2AreaRange(value []float32) SampleDistortedBoundingBoxV2Attr {
-	return func(m optionalAttr) {
-		m["area_range"] = value
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
 	}
-}
+	opspec := tf.OpSpec{
+		Type: "TextLineReaderV2",
 
-// SampleDistortedBoundingBoxV2MaxAttempts sets the optional max_attempts attribute to value.
-//
-// value: Number of attempts at generating a cropped region of the image
-// of the specified constraints. After `max_attempts` failures, return the entire
-// image.
-// If not specified, defaults to 100
-func SampleDistortedBoundingBoxV2MaxAttempts(value int64) SampleDistortedBoundingBoxV2Attr {
-	return func(m optionalAttr) {
-		m["max_attempts"] = value
+		Attrs: attrs,
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// SampleDistortedBoundingBoxV2UseImageIfNoBoundingBoxes sets the optional use_image_if_no_bounding_boxes attribute to value.
+// LoadAndRemapMatrixAttr is an optional argument to LoadAndRemapMatrix.
+type LoadAndRemapMatrixAttr func(optionalAttr)
+
+// LoadAndRemapMatrixMaxRowsInMemory sets the optional max_rows_in_memory attribute to value.
 //
-// value: Controls behavior if no bounding boxes supplied.
-// If true, assume an implicit bounding box covering the whole input. If false,
-// raise an error.
-// If not specified, defaults to false
-func SampleDistortedBoundingBoxV2UseImageIfNoBoundingBoxes(value bool) SampleDistortedBoundingBoxV2Attr {
+// value: The maximum number of rows to load from the checkpoint at
+// once. If less than or equal to 0, the entire matrix will be loaded into
+// memory. Setting this arg trades increased disk reads for lower memory usage.
+// If not specified, defaults to -1
+func LoadAndRemapMatrixMaxRowsInMemory(value int64) LoadAndRemapMatrixAttr {
 	return func(m optionalAttr) {
-		m["use_image_if_no_bounding_boxes"] = value
+		m["max_rows_in_memory"] = value
 	}
 }
 
-// Generate a single randomly distorted bounding box for an image.
+// Loads a 2-D (matrix) `Tensor` with name `old_tensor_name` from the checkpoint
 //
-// Bounding box annotations are often supplied in addition to ground-truth labels
-// in image recognition or object localization tasks. A common technique for
-// training such a system is to randomly distort an image while preserving
-// its content, i.e. *data augmentation*. This Op outputs a randomly distorted
-// localization of an object, i.e. bounding box, given an `image_size`,
-// `bounding_boxes` and a series of constraints.
+// at `ckpt_path` and potentially reorders its rows and columns using the
+// specified remappings.
 //
-// The output of this Op is a single bounding box that may be used to crop the
-// original image. The output is returned as 3 tensors: `begin`, `size` and
-// `bboxes`. The first 2 tensors can be fed directly into `tf.slice` to crop the
-// image. The latter may be supplied to `tf.image.draw_bounding_boxes` to visualize
-// what the bounding box looks like.
+// Most users should use one of the wrapper initializers (such as
+// `tf.contrib.framework.load_and_remap_matrix_initializer`) instead of this
+// function directly.
 //
-// Bounding boxes are supplied and returned as `[y_min, x_min, y_max, x_max]`. The
-// bounding box coordinates are floats in `[0.0, 1.0]` relative to the width and
-// height of the underlying image.
+// The remappings are 1-D tensors with the following properties:
 //
-// For example,
+// * `row_remapping` must have exactly `num_rows` entries. Row `i` of the output
+//   matrix will be initialized from the row corresponding to index
+//   `row_remapping[i]` in the old `Tensor` from the checkpoint.
+// * `col_remapping` must have either 0 entries (indicating that no column
+//   reordering is needed) or `num_cols` entries. If specified, column `j` of the
+//   output matrix will be initialized from the column corresponding to index
+//   `col_remapping[j]` in the old `Tensor` from the checkpoint.
+// * A value of -1 in either of the remappings signifies a "missing" entry. In that
+//   case, values from the `initializing_values` tensor will be used to fill that
+//   missing row or column. If `row_remapping` has `r` missing entries and
+//   `col_remapping` has `c` missing entries, then the following condition must be
+//   true:
 //
-// ```python
-//     # Generate a single distorted bounding box.
-//     begin, size, bbox_for_draw = tf.image.sample_distorted_bounding_box(
-//         tf.shape(image),
-//         bounding_boxes=bounding_boxes)
+// `(r * num_cols) + (c * num_rows) - (r * c) == len(initializing_values)`
 //
-//     # Draw the bounding box in an image summary.
-//     image_with_box = tf.image.draw_bounding_boxes(tf.expand_dims(image, 0),
-//                                                   bbox_for_draw)
-//     tf.summary.image('images_with_box', image_with_box)
+// The remapping tensors can be generated using the GenerateVocabRemapping op.
 //
-//     # Employ the bounding box to distort the image.
-//     distorted_image = tf.slice(image, begin, size)
-// ```
+// As an example, with row_remapping = [1, 0, -1], col_remapping = [0, 2, -1],
+// initializing_values = [0.5, -0.5, 0.25, -0.25, 42], and w(i, j) representing
+// the value from row i, column j of the old tensor in the checkpoint, the output
+// matrix will look like the following:
 //
-// Note that if no bounding box information is available, setting
-// `use_image_if_no_bounding_boxes = true` will assume there is a single implicit
-// bounding box covering the whole image. If `use_image_if_no_bounding_boxes` is
-// false and no bounding boxes are supplied, an error is raised.
+// [[w(1, 0),  w(1, 2),  0.5],
+//  [w(0, 0),  w(0, 2), -0.5],
+//  [0.25,    -0.25,      42]]
 //
 // Arguments:
-//	image_size: 1-D, containing `[height, width, channels]`.
-//	bounding_boxes: 3-D with shape `[batch, N, 4]` describing the N bounding boxes
-// associated with the image.
-//	min_object_covered: The cropped area of the image must contain at least this
-// fraction of any bounding box supplied. The value of this parameter should be
-// non-negative. In the case of 0, the cropped area does not need to overlap
-// any of the bounding boxes supplied.
+//	ckpt_path: Path to the TensorFlow checkpoint (version 2, `TensorBundle`) from
+// which the old matrix `Tensor` will be loaded.
+//	old_tensor_name: Name of the 2-D `Tensor` to load from checkpoint.
+//	row_remapping: An int `Tensor` of row remappings (generally created by
+// `generate_vocab_remapping`).  Even if no row remapping is needed, this must
+// still be an index-valued Tensor (e.g. [0, 1, 2, ...]), or a shifted
+// index-valued `Tensor` (e.g. [8, 9, 10, ...], for partitioned `Variables`).
+//	col_remapping: An int `Tensor` of column remappings (generally created by
+// `generate_vocab_remapping`).  May be a size-0 `Tensor` if only row remapping
+// is to be done (e.g. column ordering is the same).
+//	initializing_values: A float `Tensor` containing  values to fill in for cells
+// in the output matrix that are not loaded from the checkpoint. Length must be
+// exactly the same as the number of missing / new cells.
+//	num_rows: Number of rows (length of the 1st dimension) in the output matrix.
+//	num_cols: Number of columns (length of the 2nd dimension) in the output matrix.
 //
-// Returns 1-D, containing `[offset_height, offset_width, 0]`. Provide as input to
-// `tf.slice`.1-D, containing `[target_height, target_width, -1]`. Provide as input to
-// `tf.slice`.3-D with shape `[1, 1, 4]` containing the distorted bounding box.
-// Provide as input to `tf.image.draw_bounding_boxes`.
-func SampleDistortedBoundingBoxV2(scope *Scope, image_size tf.Output, bounding_boxes tf.Output, min_object_covered tf.Output, optional ...SampleDistortedBoundingBoxV2Attr) (begin tf.Output, size tf.Output, bboxes tf.Output) {
+// Returns Output matrix containing existing values loaded from the
+// checkpoint, and with any missing values filled in from initializing_values.
+func LoadAndRemapMatrix(scope *Scope, ckpt_path tf.Output, old_tensor_name tf.Output, row_remapping tf.Output, col_remapping tf.Output, initializing_values tf.Output, num_rows int64, num_cols int64, optional ...LoadAndRemapMatrixAttr) (output_matrix tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"num_rows": num_rows, "num_cols": num_cols}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SampleDistortedBoundingBoxV2",
+		Type: "LoadAndRemapMatrix",
 		Input: []tf.Input{
-			image_size, bounding_boxes, min_object_covered,
+			ckpt_path, old_tensor_name, row_remapping, col_remapping, initializing_values,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// ExtractGlimpseAttr is an optional argument to ExtractGlimpse.
-type ExtractGlimpseAttr func(optionalAttr)
+// TFRecordReaderV2Attr is an optional argument to TFRecordReaderV2.
+type TFRecordReaderV2Attr func(optionalAttr)
 
-// ExtractGlimpseCentered sets the optional centered attribute to value.
+// TFRecordReaderV2Container sets the optional container attribute to value.
 //
-// value: indicates if the offset coordinates are centered relative to
-// the image, in which case the (0, 0) offset is relative to the center
-// of the input images. If false, the (0,0) offset corresponds to the
-// upper left corner of the input images.
-// If not specified, defaults to true
-func ExtractGlimpseCentered(value bool) ExtractGlimpseAttr {
+// value: If non-empty, this reader is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func TFRecordReaderV2Container(value string) TFRecordReaderV2Attr {
 	return func(m optionalAttr) {
-		m["centered"] = value
+		m["container"] = value
 	}
 }
 
-// ExtractGlimpseNormalized sets the optional normalized attribute to value.
+// TFRecordReaderV2SharedName sets the optional shared_name attribute to value.
 //
-// value: indicates if the offset coordinates are normalized.
-// If not specified, defaults to true
-func ExtractGlimpseNormalized(value bool) ExtractGlimpseAttr {
+// value: If non-empty, this reader is named in the given bucket
+// with this shared_name. Otherwise, the node name is used instead.
+// If not specified, defaults to ""
+func TFRecordReaderV2SharedName(value string) TFRecordReaderV2Attr {
 	return func(m optionalAttr) {
-		m["normalized"] = value
+		m["shared_name"] = value
 	}
 }
 
-// ExtractGlimpseUniformNoise sets the optional uniform_noise attribute to value.
-//
-// value: indicates if the noise should be generated using a
-// uniform distribution or a Gaussian distribution.
-// If not specified, defaults to true
-func ExtractGlimpseUniformNoise(value bool) ExtractGlimpseAttr {
+// TFRecordReaderV2CompressionType sets the optional compression_type attribute to value.
+// If not specified, defaults to ""
+func TFRecordReaderV2CompressionType(value string) TFRecordReaderV2Attr {
 	return func(m optionalAttr) {
-		m["uniform_noise"] = value
+		m["compression_type"] = value
 	}
 }
 
-// Extracts a glimpse from the input tensor.
+// A Reader that outputs the records from a TensorFlow Records file.
 //
-// Returns a set of windows called glimpses extracted at location
-// `offsets` from the input tensor. If the windows only partially
-// overlaps the inputs, the non overlapping areas will be filled with
-// random noise.
-//
-// The result is a 4-D tensor of shape `[batch_size, glimpse_height,
-// glimpse_width, channels]`. The channels and batch dimensions are the
-// same as that of the input tensor. The height and width of the output
-// windows are specified in the `size` parameter.
-//
-// The argument `normalized` and `centered` controls how the windows are built:
-//
-// * If the coordinates are normalized but not centered, 0.0 and 1.0
-//   correspond to the minimum and maximum of each height and width
-//   dimension.
-// * If the coordinates are both normalized and centered, they range from
-//   -1.0 to 1.0. The coordinates (-1.0, -1.0) correspond to the upper
-//   left corner, the lower right corner is located at (1.0, 1.0) and the
-//   center is at (0, 0).
-// * If the coordinates are not normalized they are interpreted as
-//   numbers of pixels.
-//
-// Arguments:
-//	input: A 4-D float tensor of shape `[batch_size, height, width, channels]`.
-//	size: A 1-D tensor of 2 elements containing the size of the glimpses
-// to extract.  The glimpse height must be specified first, following
-// by the glimpse width.
-//	offsets: A 2-D integer tensor of shape `[batch_size, 2]` containing
-// the y, x locations of the center of each window.
-//
-// Returns A tensor representing the glimpses `[batch_size,
-// glimpse_height, glimpse_width, channels]`.
-func ExtractGlimpse(scope *Scope, input tf.Output, size tf.Output, offsets tf.Output, optional ...ExtractGlimpseAttr) (glimpse tf.Output) {
+// Returns The handle to reference the Reader.
+func TFRecordReaderV2(scope *Scope, optional ...TFRecordReaderV2Attr) (reader_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -23776,268 +23270,305 @@ func ExtractGlimpse(scope *Scope, input tf.Output, size tf.Output, offsets tf.Ou
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ExtractGlimpse",
-		Input: []tf.Input{
-			input, size, offsets,
-		},
+		Type: "TFRecordReaderV2",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// A container for an iterator resource.
+// QuantizeAndDequantizeV3Attr is an optional argument to QuantizeAndDequantizeV3.
+type QuantizeAndDequantizeV3Attr func(optionalAttr)
+
+// QuantizeAndDequantizeV3SignedInput sets the optional signed_input attribute to value.
+// If not specified, defaults to true
+func QuantizeAndDequantizeV3SignedInput(value bool) QuantizeAndDequantizeV3Attr {
+	return func(m optionalAttr) {
+		m["signed_input"] = value
+	}
+}
+
+// QuantizeAndDequantizeV3RangeGiven sets the optional range_given attribute to value.
+// If not specified, defaults to true
+func QuantizeAndDequantizeV3RangeGiven(value bool) QuantizeAndDequantizeV3Attr {
+	return func(m optionalAttr) {
+		m["range_given"] = value
+	}
+}
+
+// Quantizes then dequantizes a tensor.
 //
-// Returns A handle to the iterator that can be passed to a "MakeIterator"
-// or "IteratorGetNext" op.
-func Iterator(scope *Scope, shared_name string, container string, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// This is almost identical to QuantizeAndDequantizeV2, except that num_bits is a
+// tensor, so its value can change during training.
+func QuantizeAndDequantizeV3(scope *Scope, input tf.Output, input_min tf.Output, input_max tf.Output, num_bits tf.Output, optional ...QuantizeAndDequantizeV3Attr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"shared_name": shared_name, "container": container, "output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Iterator",
-
+		Type: "QuantizeAndDequantizeV3",
+		Input: []tf.Input{
+			input, input_min, input_max, num_bits,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// CropAndResizeGradImageAttr is an optional argument to CropAndResizeGradImage.
-type CropAndResizeGradImageAttr func(optionalAttr)
+// IdentityReaderV2Attr is an optional argument to IdentityReaderV2.
+type IdentityReaderV2Attr func(optionalAttr)
 
-// CropAndResizeGradImageMethod sets the optional method attribute to value.
+// IdentityReaderV2Container sets the optional container attribute to value.
 //
-// value: A string specifying the interpolation method. Only 'bilinear' is
-// supported for now.
-// If not specified, defaults to "bilinear"
-func CropAndResizeGradImageMethod(value string) CropAndResizeGradImageAttr {
+// value: If non-empty, this reader is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func IdentityReaderV2Container(value string) IdentityReaderV2Attr {
 	return func(m optionalAttr) {
-		m["method"] = value
+		m["container"] = value
 	}
 }
 
-// Computes the gradient of the crop_and_resize op wrt the input image tensor.
+// IdentityReaderV2SharedName sets the optional shared_name attribute to value.
 //
-// Arguments:
-//	grads: A 4-D tensor of shape `[num_boxes, crop_height, crop_width, depth]`.
-//	boxes: A 2-D tensor of shape `[num_boxes, 4]`. The `i`-th row of the tensor
-// specifies the coordinates of a box in the `box_ind[i]` image and is specified
-// in normalized coordinates `[y1, x1, y2, x2]`. A normalized coordinate value of
-// `y` is mapped to the image coordinate at `y * (image_height - 1)`, so as the
-// `[0, 1]` interval of normalized image height is mapped to
-// `[0, image_height - 1] in image height coordinates. We do allow y1 > y2, in
-// which case the sampled crop is an up-down flipped version of the original
-// image. The width dimension is treated similarly. Normalized coordinates
-// outside the `[0, 1]` range are allowed, in which case we use
-// `extrapolation_value` to extrapolate the input image values.
-//	box_ind: A 1-D tensor of shape `[num_boxes]` with int32 values in `[0, batch)`.
-// The value of `box_ind[i]` specifies the image that the `i`-th box refers to.
-//	image_size: A 1-D tensor with value `[batch, image_height, image_width, depth]`
-// containing the original image size. Both `image_height` and `image_width` need
-// to be positive.
+// value: If non-empty, this reader is named in the given bucket
+// with this shared_name. Otherwise, the node name is used instead.
+// If not specified, defaults to ""
+func IdentityReaderV2SharedName(value string) IdentityReaderV2Attr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// A Reader that outputs the queued work as both the key and value.
 //
+// To use, enqueue strings in a Queue.  ReaderRead will take the front
+// work string and output (work, work).
 //
-// Returns A 4-D tensor of shape `[batch, image_height, image_width, depth]`.
-func CropAndResizeGradImage(scope *Scope, grads tf.Output, boxes tf.Output, box_ind tf.Output, image_size tf.Output, T tf.DataType, optional ...CropAndResizeGradImageAttr) (output tf.Output) {
+// Returns The handle to reference the Reader.
+func IdentityReaderV2(scope *Scope, optional ...IdentityReaderV2Attr) (reader_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"T": T}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "CropAndResizeGradImage",
-		Input: []tf.Input{
-			grads, boxes, box_ind, image_size,
-		},
+		Type: "IdentityReaderV2",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ShuffleDatasetAttr is an optional argument to ShuffleDataset.
-type ShuffleDatasetAttr func(optionalAttr)
+// ResourceApplyGradientDescentAttr is an optional argument to ResourceApplyGradientDescent.
+type ResourceApplyGradientDescentAttr func(optionalAttr)
 
-// ShuffleDatasetReshuffleEachIteration sets the optional reshuffle_each_iteration attribute to value.
+// ResourceApplyGradientDescentUseLocking sets the optional use_locking attribute to value.
 //
-// value: If true, each iterator over this dataset will be given
-// a different pseudorandomly generated seed, based on a sequence seeded by the
-// `seed` and `seed2` inputs. If false, each iterator will be given the same
-// seed, and repeated iteration over this dataset will yield the exact same
-// sequence of results.
-// If not specified, defaults to true
-func ShuffleDatasetReshuffleEachIteration(value bool) ShuffleDatasetAttr {
+// value: If `True`, the subtraction will be protected by a lock;
+// otherwise the behavior is undefined, but may exhibit less contention.
+// If not specified, defaults to false
+func ResourceApplyGradientDescentUseLocking(value bool) ResourceApplyGradientDescentAttr {
 	return func(m optionalAttr) {
-		m["reshuffle_each_iteration"] = value
+		m["use_locking"] = value
 	}
 }
 
-// Creates a dataset that shuffles elements from `input_dataset` pseudorandomly.
+// Update '*var' by subtracting 'alpha' * 'delta' from it.
 //
 // Arguments:
+//	var_: Should be from a Variable().
+//	alpha: Scaling factor. Must be a scalar.
+//	delta: The change.
 //
-//	buffer_size: The number of output elements to buffer in an iterator over
-// this dataset. Compare with the `min_after_dequeue` attr when creating a
-// `RandomShuffleQueue`.
-//	seed: A scalar seed for the random number generator. If either `seed` or
-// `seed2` is set to be non-zero, the random number generator is seeded
-// by the given seed.  Otherwise, a random seed is used.
-//	seed2: A second scalar seed to avoid seed collision.
-//
-//
-func ShuffleDataset(scope *Scope, input_dataset tf.Output, buffer_size tf.Output, seed tf.Output, seed2 tf.Output, output_types []tf.DataType, output_shapes []tf.Shape, optional ...ShuffleDatasetAttr) (handle tf.Output) {
+// Returns the created operation.
+func ResourceApplyGradientDescent(scope *Scope, var_ tf.Output, alpha tf.Output, delta tf.Output, optional ...ResourceApplyGradientDescentAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ShuffleDataset",
+		Type: "ResourceApplyGradientDescent",
 		Input: []tf.Input{
-			input_dataset, buffer_size, seed, seed2,
+			var_, alpha, delta,
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// 3D fast Fourier transform.
+// Returns the next record (key, value pair) produced by a Reader.
 //
-// Computes the 3-dimensional discrete Fourier transform over the inner-most 3
-// dimensions of `input`.
+// Will dequeue from the input queue if necessary (e.g. when the
+// Reader needs to start reading from a new file since it has finished
+// with the previous file).
 //
 // Arguments:
-//	input: A complex64 tensor.
-//
-// Returns A complex64 tensor of the same shape as `input`. The inner-most 3
-//   dimensions of `input` are replaced with their 3D Fourier transform.
+//	reader_handle: Handle to a Reader.
+//	queue_handle: Handle to a Queue, with string work items.
 //
-// @compatibility(numpy)
-// Equivalent to np.fft.fftn with 3 dimensions.
-// @end_compatibility
-func FFT3D(scope *Scope, input tf.Output) (output tf.Output) {
+// Returns A scalar.A scalar.
+func ReaderReadV2(scope *Scope, reader_handle tf.Output, queue_handle tf.Output) (key tf.Output, value tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "FFT3D",
+		Type: "ReaderReadV2",
 		Input: []tf.Input{
-			input,
+			reader_handle, queue_handle,
 		},
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// CropAndResizeGradBoxesAttr is an optional argument to CropAndResizeGradBoxes.
-type CropAndResizeGradBoxesAttr func(optionalAttr)
-
-// CropAndResizeGradBoxesMethod sets the optional method attribute to value.
+// Returns up to `num_records` (key, value) pairs produced by a Reader.
 //
-// value: A string specifying the interpolation method. Only 'bilinear' is
-// supported for now.
-// If not specified, defaults to "bilinear"
-func CropAndResizeGradBoxesMethod(value string) CropAndResizeGradBoxesAttr {
-	return func(m optionalAttr) {
-		m["method"] = value
-	}
-}
-
-// Computes the gradient of the crop_and_resize op wrt the input boxes tensor.
+// Will dequeue from the input queue if necessary (e.g. when the
+// Reader needs to start reading from a new file since it has finished
+// with the previous file).
+// It may return less than `num_records` even before the last batch.
 //
 // Arguments:
-//	grads: A 4-D tensor of shape `[num_boxes, crop_height, crop_width, depth]`.
-//	image: A 4-D tensor of shape `[batch, image_height, image_width, depth]`.
-// Both `image_height` and `image_width` need to be positive.
-//	boxes: A 2-D tensor of shape `[num_boxes, 4]`. The `i`-th row of the tensor
-// specifies the coordinates of a box in the `box_ind[i]` image and is specified
-// in normalized coordinates `[y1, x1, y2, x2]`. A normalized coordinate value of
-// `y` is mapped to the image coordinate at `y * (image_height - 1)`, so as the
-// `[0, 1]` interval of normalized image height is mapped to
-// `[0, image_height - 1] in image height coordinates. We do allow y1 > y2, in
-// which case the sampled crop is an up-down flipped version of the original
-// image. The width dimension is treated similarly. Normalized coordinates
-// outside the `[0, 1]` range are allowed, in which case we use
-// `extrapolation_value` to extrapolate the input image values.
-//	box_ind: A 1-D tensor of shape `[num_boxes]` with int32 values in `[0, batch)`.
-// The value of `box_ind[i]` specifies the image that the `i`-th box refers to.
+//	reader_handle: Handle to a `Reader`.
+//	queue_handle: Handle to a `Queue`, with string work items.
+//	num_records: number of records to read from `Reader`.
 //
-// Returns A 2-D tensor of shape `[num_boxes, 4]`.
-func CropAndResizeGradBoxes(scope *Scope, grads tf.Output, image tf.Output, boxes tf.Output, box_ind tf.Output, optional ...CropAndResizeGradBoxesAttr) (output tf.Output) {
+// Returns A 1-D tensor.A 1-D tensor.
+func ReaderReadUpToV2(scope *Scope, reader_handle tf.Output, queue_handle tf.Output, num_records tf.Output) (keys tf.Output, values tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "CropAndResizeGradBoxes",
+		Type: "ReaderReadUpToV2",
 		Input: []tf.Input{
-			grads, image, boxes, box_ind,
+			reader_handle, queue_handle, num_records,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// Saves tensors in V2 checkpoint format.
+// ResourceApplyAdamAttr is an optional argument to ResourceApplyAdam.
+type ResourceApplyAdamAttr func(optionalAttr)
+
+// ResourceApplyAdamUseLocking sets the optional use_locking attribute to value.
 //
-// By default, saves the named tensors in full.  If the caller wishes to save
-// specific slices of full tensors, "shape_and_slices" should be non-empty strings
-// and correspondingly well-formed.
+// value: If `True`, updating of the var, m, and v tensors will be protected
+// by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceApplyAdamUseLocking(value bool) ResourceApplyAdamAttr {
+	return func(m optionalAttr) {
+		m["use_locking"] = value
+	}
+}
+
+// ResourceApplyAdamUseNesterov sets the optional use_nesterov attribute to value.
+//
+// value: If `True`, uses the nesterov update.
+// If not specified, defaults to false
+func ResourceApplyAdamUseNesterov(value bool) ResourceApplyAdamAttr {
+	return func(m optionalAttr) {
+		m["use_nesterov"] = value
+	}
+}
+
+// Update '*var' according to the Adam algorithm.
+//
+// lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
+// m_t <- beta1 * m_{t-1} + (1 - beta1) * g_t
+// v_t <- beta2 * v_{t-1} + (1 - beta2) * g_t * g_t
+// variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
 //
 // Arguments:
-//	prefix: Must have a single element. The prefix of the V2 checkpoint to which we
-// write the tensors.
-//	tensor_names: shape {N}. The names of the tensors to be saved.
-//	shape_and_slices: shape {N}.  The slice specs of the tensors to be saved.
-// Empty strings indicate that they are non-partitioned tensors.
-//	tensors: `N` tensors to save.
+//	var_: Should be from a Variable().
+//	m: Should be from a Variable().
+//	v: Should be from a Variable().
+//	beta1_power: Must be a scalar.
+//	beta2_power: Must be a scalar.
+//	lr: Scaling factor. Must be a scalar.
+//	beta1: Momentum factor. Must be a scalar.
+//	beta2: Momentum factor. Must be a scalar.
+//	epsilon: Ridge term. Must be a scalar.
+//	grad: The gradient.
 //
 // Returns the created operation.
-func SaveV2(scope *Scope, prefix tf.Output, tensor_names tf.Output, shape_and_slices tf.Output, tensors []tf.Output) (o *tf.Operation) {
+func ResourceApplyAdam(scope *Scope, var_ tf.Output, m tf.Output, v tf.Output, beta1_power tf.Output, beta2_power tf.Output, lr tf.Output, beta1 tf.Output, beta2 tf.Output, epsilon tf.Output, grad tf.Output, optional ...ResourceApplyAdamAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "SaveV2",
+		Type: "ResourceApplyAdam",
 		Input: []tf.Input{
-			prefix, tensor_names, shape_and_slices, tf.OutputList(tensors),
+			var_, m, v, beta1_power, beta2_power, lr, beta1, beta2, epsilon, grad,
 		},
+		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// StatsAggregatorHandleAttr is an optional argument to StatsAggregatorHandle.
-type StatsAggregatorHandleAttr func(optionalAttr)
-
-// StatsAggregatorHandleContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func StatsAggregatorHandleContainer(value string) StatsAggregatorHandleAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
+// Store the input tensor in the state of the current session.
+//
+// Arguments:
+//	value: The tensor to be stored.
+//
+// Returns The handle for the tensor stored in the session state, represented
+// as a ResourceHandle object.
+func GetSessionHandleV2(scope *Scope, value tf.Output) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "GetSessionHandleV2",
+		Input: []tf.Input{
+			value,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// StatsAggregatorHandleSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func StatsAggregatorHandleSharedName(value string) StatsAggregatorHandleAttr {
+// ResizeBicubicGradAttr is an optional argument to ResizeBicubicGrad.
+type ResizeBicubicGradAttr func(optionalAttr)
+
+// ResizeBicubicGradAlignCorners sets the optional align_corners attribute to value.
+//
+// value: If true, the centers of the 4 corner pixels of the input and grad tensors are
+// aligned. Defaults to false.
+// If not specified, defaults to false
+func ResizeBicubicGradAlignCorners(value bool) ResizeBicubicGradAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["align_corners"] = value
 	}
 }
 
-// Creates a statistics manager resource.
-func StatsAggregatorHandle(scope *Scope, optional ...StatsAggregatorHandleAttr) (handle tf.Output) {
+// Computes the gradient of bicubic interpolation.
+//
+// Arguments:
+//	grads: 4-D with shape `[batch, height, width, channels]`.
+//	original_image: 4-D with shape `[batch, orig_height, orig_width, channels]`,
+// The image tensor that was resized.
+//
+// Returns 4-D with shape `[batch, orig_height, orig_width, channels]`.
+// Gradients with respect to the input image. Input image must have been
+// float or double.
+func ResizeBicubicGrad(scope *Scope, grads tf.Output, original_image tf.Output, optional ...ResizeBicubicGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -24046,130 +23577,93 @@ func StatsAggregatorHandle(scope *Scope, optional ...StatsAggregatorHandleAttr)
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "StatsAggregatorHandle",
-
+		Type: "ResizeBicubicGrad",
+		Input: []tf.Input{
+			grads, original_image,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Greedily selects a subset of bounding boxes in descending order of score,
-//
-// pruning away boxes that have high intersection-over-union (IOU) overlap
-// with previously selected boxes.  Bounding boxes are supplied as
-// [y1, x1, y2, x2], where (y1, x1) and (y2, x2) are the coordinates of any
-// diagonal pair of box corners and the coordinates can be provided as normalized
-// (i.e., lying in the interval [0, 1]) or absolute.  Note that this algorithm
-// is agnostic to where the origin is in the coordinate system.  Note that this
-// algorithm is invariant to orthogonal transformations and translations
-// of the coordinate system; thus translating or reflections of the coordinate
-// system result in the same boxes being selected by the algorithm.
-//
-// The output of this operation is a set of integers indexing into the input
-// collection of bounding boxes representing the selected boxes.  The bounding
-// box coordinates corresponding to the selected indices can then be obtained
-// using the `tf.gather operation`.  For example:
+// ResizeNearestNeighborAttr is an optional argument to ResizeNearestNeighbor.
+type ResizeNearestNeighborAttr func(optionalAttr)
+
+// ResizeNearestNeighborAlignCorners sets the optional align_corners attribute to value.
 //
-//   selected_indices = tf.image.non_max_suppression_v2(
-//       boxes, scores, max_output_size, iou_threshold)
-//   selected_boxes = tf.gather(boxes, selected_indices)
+// value: If true, the centers of the 4 corner pixels of the input and output tensors are
+// aligned, preserving the values at the corner pixels. Defaults to false.
+// If not specified, defaults to false
+func ResizeNearestNeighborAlignCorners(value bool) ResizeNearestNeighborAttr {
+	return func(m optionalAttr) {
+		m["align_corners"] = value
+	}
+}
+
+// Resize `images` to `size` using nearest neighbor interpolation.
 //
 // Arguments:
-//	boxes: A 2-D float tensor of shape `[num_boxes, 4]`.
-//	scores: A 1-D float tensor of shape `[num_boxes]` representing a single
-// score corresponding to each box (each row of boxes).
-//	max_output_size: A scalar integer tensor representing the maximum number of
-// boxes to be selected by non max suppression.
-//	iou_threshold: A 0-D float tensor representing the threshold for deciding whether
-// boxes overlap too much with respect to IOU.
+//	images: 4-D with shape `[batch, height, width, channels]`.
+//	size: = A 1-D int32 Tensor of 2 elements: `new_height, new_width`.  The
+// new size for the images.
 //
-// Returns A 1-D integer tensor of shape `[M]` representing the selected
-// indices from the boxes tensor, where `M <= max_output_size`.
-func NonMaxSuppressionV2(scope *Scope, boxes tf.Output, scores tf.Output, max_output_size tf.Output, iou_threshold tf.Output) (selected_indices tf.Output) {
+// Returns 4-D with shape
+// `[batch, new_height, new_width, channels]`.
+func ResizeNearestNeighbor(scope *Scope, images tf.Output, size tf.Output, optional ...ResizeNearestNeighborAttr) (resized_images tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "NonMaxSuppressionV2",
+		Type: "ResizeNearestNeighbor",
 		Input: []tf.Input{
-			boxes, scores, max_output_size, iou_threshold,
+			images, size,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// EncodeProtoAttr is an optional argument to EncodeProto.
-type EncodeProtoAttr func(optionalAttr)
+// ResizeNearestNeighborGradAttr is an optional argument to ResizeNearestNeighborGrad.
+type ResizeNearestNeighborGradAttr func(optionalAttr)
 
-// EncodeProtoDescriptorSource sets the optional descriptor_source attribute to value.
-// If not specified, defaults to "local://"
-func EncodeProtoDescriptorSource(value string) EncodeProtoAttr {
+// ResizeNearestNeighborGradAlignCorners sets the optional align_corners attribute to value.
+//
+// value: If true, the centers of the 4 corner pixels of the input and grad tensors are
+// aligned. Defaults to false.
+// If not specified, defaults to false
+func ResizeNearestNeighborGradAlignCorners(value bool) ResizeNearestNeighborGradAttr {
 	return func(m optionalAttr) {
-		m["descriptor_source"] = value
+		m["align_corners"] = value
 	}
 }
 
-// The op serializes protobuf messages provided in the input tensors.
-//
-// The types of the tensors in `values` must match the schema for the
-// fields specified in `field_names`. All the tensors in `values` must
-// have a common shape prefix, *batch_shape*.
-//
-// The `sizes` tensor specifies repeat counts for each field.  The repeat
-// count (last dimension) of a each tensor in `values` must be greater
-// than or equal to corresponding repeat count in `sizes`.
-//
-// A `message_type` name must be provided to give context for the field
-// names. The actual message descriptor can be looked up either in the
-// linked-in descriptor pool or a filename provided by the caller using
-// the `descriptor_source` attribute.
-//
-// The `descriptor_source` attribute selects a source of protocol
-// descriptors to consult when looking up `message_type`. This may be a
-// filename containing a serialized `FileDescriptorSet` message,
-// or the special value `local://`, in which case only descriptors linked
-// into the code will be searched; the filename can be on any filesystem
-// accessible to TensorFlow.
-//
-// You can build a `descriptor_source` file using the `--descriptor_set_out`
-// and `--include_imports` options to the protocol compiler `protoc`.
-//
-// The `local://` database only covers descriptors linked into the
-// code via C++ libraries, not Python imports. You can link in a proto descriptor
-// by creating a cc_library target with alwayslink=1.
-//
-// There are a few special cases in the value mapping:
-//
-// Submessage and group fields must be pre-serialized as TensorFlow strings.
-//
-// TensorFlow lacks support for unsigned int64s, so they must be
-// represented as `tf.int64` with the same twos-complement bit pattern
-// (the obvious way).
-//
-// Unsigned int32 values can be represented exactly with `tf.int64`, or
-// with sign wrapping if the input is of type `tf.int32`.
+// Computes the gradient of nearest neighbor interpolation.
 //
 // Arguments:
-//	sizes: Tensor of int32 with shape `[batch_shape, len(field_names)]`.
-//	values: List of tensors containing values for the corresponding field.
-//	field_names: List of strings containing proto field names.
-//	message_type: Name of the proto message type to decode.
+//	grads: 4-D with shape `[batch, height, width, channels]`.
+//	size: = A 1-D int32 Tensor of 2 elements: `orig_height, orig_width`. The
+// original input size.
 //
-// Returns Tensor of serialized protos with shape `batch_shape`.
-func EncodeProto(scope *Scope, sizes tf.Output, values []tf.Output, field_names []string, message_type string, optional ...EncodeProtoAttr) (bytes tf.Output) {
+// Returns 4-D with shape `[batch, orig_height, orig_width, channels]`. Gradients
+// with respect to the input image.
+func ResizeNearestNeighborGrad(scope *Scope, grads tf.Output, size tf.Output, optional ...ResizeNearestNeighborGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"field_names": field_names, "message_type": message_type}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "EncodeProto",
+		Type: "ResizeNearestNeighborGrad",
 		Input: []tf.Input{
-			sizes, tf.OutputList(values),
+			grads, size,
 		},
 		Attrs: attrs,
 	}
@@ -24177,487 +23671,531 @@ func EncodeProto(scope *Scope, sizes tf.Output, values []tf.Output, field_names
 	return op.Output(0)
 }
 
-// Creates a TensorArray for storing the gradients of values in the given handle.
+// ExtractJpegShapeAttr is an optional argument to ExtractJpegShape.
+type ExtractJpegShapeAttr func(optionalAttr)
+
+// ExtractJpegShapeOutputType sets the optional output_type attribute to value.
 //
-// If the given TensorArray gradient already exists, returns a reference to it.
-//
-// Locks the size of the original TensorArray by disabling its dynamic size flag.
-//
-// **A note about the input flow_in:**
-//
-// The handle flow_in forces the execution of the gradient lookup to occur
-// only after certain other operations have occurred.  For example, when
-// the forward TensorArray is dynamically sized, writes to this TensorArray
-// may resize the object.  The gradient TensorArray is statically sized based
-// on the size of the forward TensorArray when this operation executes.
-// Furthermore, the size of the forward TensorArray is frozen by this call.
-// As a result, the flow is used to ensure that the call to generate the gradient
-// TensorArray only happens after all writes are executed.
-//
-// In the case of dynamically sized TensorArrays, gradient computation should
-// only be performed on read operations that have themselves been chained via
-// flow to occur only after all writes have executed. That way the final size
-// of the forward TensorArray is known when this operation is called.
-//
-// **A note about the source attribute:**
-//
-// TensorArray gradient calls use an accumulator TensorArray object.  If
-// multiple gradients are calculated and run in the same session, the multiple
-// gradient nodes may accidentally flow through the same accumulator TensorArray.
-// This double counts and generally breaks the TensorArray gradient flow.
-//
-// The solution is to identify which gradient call this particular
-// TensorArray gradient is being called in.  This is performed by identifying
-// a unique string (e.g. "gradients", "gradients_1", ...) from the input
-// gradient Tensor's name.  This string is used as a suffix when creating
-// the TensorArray gradient object here (the attribute `source`).
+// value: (Optional) The output type of the operation (int32 or int64).
+// Defaults to int32.
+// If not specified, defaults to DT_INT32
+func ExtractJpegShapeOutputType(value tf.DataType) ExtractJpegShapeAttr {
+	return func(m optionalAttr) {
+		m["output_type"] = value
+	}
+}
+
+// Extract the shape information of a JPEG-encoded image.
 //
-// The attribute `source` is added as a suffix to the forward TensorArray's
-// name when performing the creation / lookup, so that each separate gradient
-// calculation gets its own TensorArray accumulator.
+// This op only parses the image header, so it is much faster than DecodeJpeg.
 //
 // Arguments:
-//	handle: The handle to the forward TensorArray.
-//	flow_in: A float scalar that enforces proper chaining of operations.
-//	source: The gradient source string, used to decide which gradient TensorArray
-// to return.
-func TensorArrayGradV3(scope *Scope, handle tf.Output, flow_in tf.Output, source string) (grad_handle tf.Output, flow_out tf.Output) {
+//	contents: 0-D. The JPEG-encoded image.
+//
+// Returns 1-D. The image shape with format [height, width, channels].
+func ExtractJpegShape(scope *Scope, contents tf.Output, optional ...ExtractJpegShapeAttr) (image_shape tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"source": source}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayGradV3",
+		Type: "ExtractJpegShape",
 		Input: []tf.Input{
-			handle, flow_in,
+			contents,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// DecodeProtoV2Attr is an optional argument to DecodeProtoV2.
-type DecodeProtoV2Attr func(optionalAttr)
+// PaddingFIFOQueueV2Attr is an optional argument to PaddingFIFOQueueV2.
+type PaddingFIFOQueueV2Attr func(optionalAttr)
 
-// DecodeProtoV2DescriptorSource sets the optional descriptor_source attribute to value.
+// PaddingFIFOQueueV2Shapes sets the optional shapes attribute to value.
 //
-// value: Either the special value `local://` or a path to a file containing
-// a serialized `FileDescriptorSet`.
-// If not specified, defaults to "local://"
-func DecodeProtoV2DescriptorSource(value string) DecodeProtoV2Attr {
+// value: The shape of each component in a value. The length of this attr must
+// be either 0 or the same as the length of component_types.
+// Shapes of fixed rank but variable size are allowed by setting
+// any shape dimension to -1.  In this case, the inputs' shape may vary along
+// the given dimension, and DequeueMany will pad the given dimension with
+// zeros up to the maximum shape of all elements in the given batch.
+// If the length of this attr is 0, different queue elements may have
+// different ranks and shapes, but only one element may be dequeued at a time.
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func PaddingFIFOQueueV2Shapes(value []tf.Shape) PaddingFIFOQueueV2Attr {
 	return func(m optionalAttr) {
-		m["descriptor_source"] = value
+		m["shapes"] = value
 	}
 }
 
-// DecodeProtoV2MessageFormat sets the optional message_format attribute to value.
+// PaddingFIFOQueueV2Capacity sets the optional capacity attribute to value.
 //
-// value: Either `binary` or `text`.
-// If not specified, defaults to "binary"
-func DecodeProtoV2MessageFormat(value string) DecodeProtoV2Attr {
+// value: The upper bound on the number of elements in this queue.
+// Negative numbers mean no limit.
+// If not specified, defaults to -1
+func PaddingFIFOQueueV2Capacity(value int64) PaddingFIFOQueueV2Attr {
 	return func(m optionalAttr) {
-		m["message_format"] = value
+		m["capacity"] = value
 	}
 }
 
-// DecodeProtoV2Sanitize sets the optional sanitize attribute to value.
+// PaddingFIFOQueueV2Container sets the optional container attribute to value.
 //
-// value: Whether to sanitize the result or not.
-// If not specified, defaults to false
-func DecodeProtoV2Sanitize(value bool) DecodeProtoV2Attr {
+// value: If non-empty, this queue is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func PaddingFIFOQueueV2Container(value string) PaddingFIFOQueueV2Attr {
 	return func(m optionalAttr) {
-		m["sanitize"] = value
+		m["container"] = value
 	}
 }
 
-// The op extracts fields from a serialized protocol buffers message into tensors.
+// PaddingFIFOQueueV2SharedName sets the optional shared_name attribute to value.
 //
-// The `decode_proto` op extracts fields from a serialized protocol buffers
-// message into tensors.  The fields in `field_names` are decoded and converted
-// to the corresponding `output_types` if possible.
+// value: If non-empty, this queue will be shared under the given name
+// across multiple sessions.
+// If not specified, defaults to ""
+func PaddingFIFOQueueV2SharedName(value string) PaddingFIFOQueueV2Attr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// A queue that produces elements in first-in first-out order.
 //
-// A `message_type` name must be provided to give context for the field
-// names. The actual message descriptor can be looked up either in the
-// linked-in descriptor pool or a filename provided by the caller using
-// the `descriptor_source` attribute.
+// Variable-size shapes are allowed by setting the corresponding shape dimensions
+// to 0 in the shape attr.  In this case DequeueMany will pad up to the maximum
+// size of any given element in the minibatch.  See below for details.
 //
-// Each output tensor is a dense tensor. This means that it is padded to
-// hold the largest number of repeated elements seen in the input
-// minibatch. (The shape is also padded by one to prevent zero-sized
-// dimensions). The actual repeat counts for each example in the
-// minibatch can be found in the `sizes` output. In many cases the output
-// of `decode_proto` is fed immediately into tf.squeeze if missing values
-// are not a concern. When using tf.squeeze, always pass the squeeze
-// dimension explicitly to avoid surprises.
+// Arguments:
+//	component_types: The type of each component in a value.
 //
-// For the most part, the mapping between Proto field types and
-// TensorFlow dtypes is straightforward. However, there are a few
-// special cases:
+// Returns The handle to the queue.
+func PaddingFIFOQueueV2(scope *Scope, component_types []tf.DataType, optional ...PaddingFIFOQueueV2Attr) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"component_types": component_types}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "PaddingFIFOQueueV2",
+
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// DecodePngAttr is an optional argument to DecodePng.
+type DecodePngAttr func(optionalAttr)
+
+// DecodePngChannels sets the optional channels attribute to value.
 //
-// - A proto field that contains a submessage or group can only be converted
-// to `DT_STRING` (the serialized submessage). This is to reduce the
-// complexity of the API. The resulting string can be used as input
-// to another instance of the decode_proto op.
+// value: Number of color channels for the decoded image.
+// If not specified, defaults to 0
+func DecodePngChannels(value int64) DecodePngAttr {
+	return func(m optionalAttr) {
+		m["channels"] = value
+	}
+}
+
+// DecodePngDtype sets the optional dtype attribute to value.
+// If not specified, defaults to DT_UINT8
+func DecodePngDtype(value tf.DataType) DecodePngAttr {
+	return func(m optionalAttr) {
+		m["dtype"] = value
+	}
+}
+
+// Decode a PNG-encoded image to a uint8 or uint16 tensor.
 //
-// - TensorFlow lacks support for unsigned integers. The ops represent uint64
-// types as a `DT_INT64` with the same twos-complement bit pattern
-// (the obvious way). Unsigned int32 values can be represented exactly by
-// specifying type `DT_INT64`, or using twos-complement if the caller
-// specifies `DT_INT32` in the `output_types` attribute.
+// The attr `channels` indicates the desired number of color channels for the
+// decoded image.
 //
-// The `descriptor_source` attribute selects a source of protocol
-// descriptors to consult when looking up `message_type`. This may be a
-// filename containing a serialized `FileDescriptorSet` message,
-// or the special value `local://`, in which case only descriptors linked
-// into the code will be searched; the filename can be on any filesystem
-// accessible to TensorFlow.
+// Accepted values are:
 //
-// You can build a `descriptor_source` file using the `--descriptor_set_out`
-// and `--include_imports` options to the protocol compiler `protoc`.
+// *   0: Use the number of channels in the PNG-encoded image.
+// *   1: output a grayscale image.
+// *   3: output an RGB image.
+// *   4: output an RGBA image.
 //
-// The `local://` database only covers descriptors linked into the
-// code via C++ libraries, not Python imports. You can link in a proto descriptor
-// by creating a cc_library target with alwayslink=1.
+// If needed, the PNG-encoded image is transformed to match the requested number
+// of color channels.
 //
-// Both binary and text proto serializations are supported, and can be
-// chosen using the `format` attribute.
+// This op also supports decoding JPEGs and non-animated GIFs since the interface
+// is the same, though it is cleaner to use `tf.image.decode_image`.
 //
 // Arguments:
-//	bytes: Tensor of serialized protos with shape `batch_shape`.
-//	message_type: Name of the proto message type to decode.
-//	field_names: List of strings containing proto field names.
-//	output_types: List of TF types to use for the respective field in field_names.
+//	contents: 0-D.  The PNG-encoded image.
 //
-// Returns Tensor of int32 with shape `[batch_shape, len(field_names)]`.
-// Each entry is the number of values found for the corresponding field.
-// Optional fields may have 0 or 1 values.List of tensors containing values for the corresponding field.
-// `values[i]` has datatype `output_types[i]`
-// and shape `[batch_shape, max(sizes[...,i])]`.
-func DecodeProtoV2(scope *Scope, bytes tf.Output, message_type string, field_names []string, output_types []tf.DataType, optional ...DecodeProtoV2Attr) (sizes tf.Output, values []tf.Output) {
+// Returns 3-D with shape `[height, width, channels]`.
+func DecodePng(scope *Scope, contents tf.Output, optional ...DecodePngAttr) (image tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"message_type": message_type, "field_names": field_names, "output_types": output_types}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DecodeProtoV2",
+		Type: "DecodePng",
 		Input: []tf.Input{
-			bytes,
+			contents,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	sizes = op.Output(idx)
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("DecodeProtoV2", err)
-		return
-	}
-	return sizes, values
-}
-
-// Creates a dataset that splits a SparseTensor into elements row-wise.
-func SparseTensorSliceDataset(scope *Scope, indices tf.Output, values tf.Output, dense_shape tf.Output) (handle tf.Output) {
+	return op.Output(0)
+}
+
+// Decode the first frame of a GIF-encoded image to a uint8 tensor.
+//
+// GIF with frame or transparency compression are not supported
+// convert animated GIF from compressed to uncompressed by:
+//
+//     convert $src.gif -coalesce $dst.gif
+//
+// This op also supports decoding JPEGs and PNGs, though it is cleaner to use
+// `tf.image.decode_image`.
+//
+// Arguments:
+//	contents: 0-D.  The GIF-encoded image.
+//
+// Returns 4-D with shape `[num_frames, height, width, 3]`. RGB order
+func DecodeGif(scope *Scope, contents tf.Output) (image tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseTensorSliceDataset",
+		Type: "DecodeGif",
 		Input: []tf.Input{
-			indices, values, dense_shape,
+			contents,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns x / y element-wise for real types.
-//
-// If `x` and `y` are reals, this will return the floating-point division.
+// Computes the gradient of the sigmoid of `x` wrt its input.
 //
-// *NOTE*: `Div` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func RealDiv(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// Specifically, `grad = dy * y * (1 - y)`, where `y = sigmoid(x)`, and
+// `dy` is the corresponding input gradient.
+func SigmoidGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "RealDiv",
+		Type: "SigmoidGrad",
 		Input: []tf.Input{
-			x, y,
+			y, dy,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-//     Adds v into specified rows of x.
+// Convert one or more images from HSV to RGB.
 //
-//     Computes y = x; y[i, :] += v; return y.
+// Outputs a tensor of the same shape as the `images` tensor, containing the RGB
+// value of the pixels. The output is only well defined if the value in `images`
+// are in `[0,1]`.
+//
+// See `rgb_to_hsv` for a description of the HSV encoding.
 //
 // Arguments:
-//	x: A `Tensor` of type T.
-//	i: A vector. Indices into the left-most dimension of `x`.
-//	v: A `Tensor` of type T. Same dimension sizes as x except the first dimension, which must be the same as i's size.
+//	images: 1-D or higher rank. HSV data to convert. Last dimension must be size 3.
 //
-// Returns A `Tensor` of type T. An alias of `x`. The content of `y` is undefined if there are duplicates in `i`.
-func InplaceAdd(scope *Scope, x tf.Output, i tf.Output, v tf.Output) (y tf.Output) {
+// Returns `images` converted to RGB.
+func HSVToRGB(scope *Scope, images tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "InplaceAdd",
+		Type: "HSVToRGB",
 		Input: []tf.Input{
-			x, i, v,
+			images,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Restore a Reader to its initial clean state.
+// Retrieves the tree ensemble resource stamp token, number of trees and growing statistics.
 //
 // Arguments:
-//	reader_handle: Handle to a Reader.
+//	tree_ensemble_handle: Handle to the tree ensemble.
 //
-// Returns the created operation.
-func ReaderResetV2(scope *Scope, reader_handle tf.Output) (o *tf.Operation) {
+// Returns Stamp token of the tree ensemble resource.The number of trees in the tree ensemble resource.The number of trees that were finished successfully.The number of layers we attempted to build (but not necessarily succeeded).Rank size 2 tensor that contains start and end ids of the nodes in the latest
+// layer.
+func BoostedTreesGetEnsembleStates(scope *Scope, tree_ensemble_handle tf.Output) (stamp_token tf.Output, num_trees tf.Output, num_finalized_trees tf.Output, num_attempted_layers tf.Output, last_layer_nodes_range tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ReaderResetV2",
+		Type: "BoostedTreesGetEnsembleStates",
 		Input: []tf.Input{
-			reader_handle,
+			tree_ensemble_handle,
 		},
 	}
-	return scope.AddOperation(opspec)
-}
-
-// RpcAttr is an optional argument to Rpc.
-type RpcAttr func(optionalAttr)
-
-// RpcProtocol sets the optional protocol attribute to value.
-//
-// value: RPC protocol to use.  Empty string means use the default protocol.
-// Options include 'grpc'.
-// If not specified, defaults to ""
-func RpcProtocol(value string) RpcAttr {
-	return func(m optionalAttr) {
-		m["protocol"] = value
-	}
-}
-
-// RpcFailFast sets the optional fail_fast attribute to value.
-//
-// value: `boolean`. If `true` (default), then failures to connect
-// (i.e., the server does not immediately respond) cause an RPC failure.
-// If not specified, defaults to true
-func RpcFailFast(value bool) RpcAttr {
-	return func(m optionalAttr) {
-		m["fail_fast"] = value
-	}
-}
-
-// RpcTimeoutInMs sets the optional timeout_in_ms attribute to value.
-//
-// value: `int`. If `0` (default), then the kernel will run the RPC
-// request and only time out if the RPC deadline passes or the session times out.
-// If this value is greater than `0`, then the op will raise an exception if
-// the RPC takes longer than `timeout_in_ms`.
-// If not specified, defaults to 0
-func RpcTimeoutInMs(value int64) RpcAttr {
-	return func(m optionalAttr) {
-		m["timeout_in_ms"] = value
-	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
 }
 
-// Perform batches of RPC requests.
-//
-// This op asynchronously performs either a single RPC request, or a batch
-// of requests.  RPC requests are defined by three main parameters:
-//
-//   - `address` (the host+port or BNS address of the request)
-//   - `method` (the RPC method name for the request)
-//   - `request` (the serialized proto string, or vector of strings,
-//      of the RPC request argument).
-//
-// For example, if you have an RPC service running on port localhost:2345,
-// and its interface is configured with the following proto declaration:
-//
-// ```
-// service MyService {
-//   rpc MyMethod(MyRequestProto) returns (MyResponseProto) {
-//   }
-// };
-// ```
-//
-// then call this op with arguments:
-//
-// ```
-// address = "localhost:2345"
-// method = "MyService/MyMethod"
-// ```
-//
-// The `request` tensor is a string tensor representing serialized `MyRequestProto`
-// strings; and the output string tensor `response` will have the same shape
-// and contain (upon successful completion) corresponding serialized
-// `MyResponseProto` strings.
-//
-// For example, to send a single, empty, `MyRequestProto`, call
-// this op with `request = ""`.  To send 5 **parallel** empty requests,
-// call this op with `request = ["", "", "", "", ""]`.
-//
-// More generally, one can create a batch of `MyRequestProto` serialized protos
-// from regular batched tensors using the `encode_proto` op, and convert
-// the response `MyResponseProto` serialized protos to batched tensors
-// using the `decode_proto` op.
-//
-// **NOTE** Working with serialized proto strings is faster than instantiating
-// actual proto objects in memory, so no performance degradation is expected
-// compared to writing custom kernels for this workflow.
-//
-// If the connection fails or the remote worker returns an error
-// status, the op reraises this exception locally.
-//
-// See the `TryRpc` op if you prefer to handle RPC failures manually in the graph.
-//
-// Arguments:
-//	address: `0-D` or `1-D`.  The address (i.e. host_name:port) of the RPC server.
-// If this tensor has more than 1 element, then multiple parallel rpc requests
-// are sent.  This argument broadcasts with `method` and `request`.
-//	method: `0-D` or `1-D`.  The method address on the RPC server.
-// If this tensor has more than 1 element, then multiple parallel rpc requests
-// are sent.  This argument broadcasts with `address` and `request`.
-//	request: `0-D` or `1-D`.  Serialized proto strings: the rpc request argument.
-// If this tensor has more than 1 element, then multiple parallel rpc requests
-// are sent.  This argument broadcasts with `address` and `method`.
+// Gets the next output from the given iterator.
 //
-// Returns Same shape as `request`. Serialized proto strings: the rpc responses.
-func Rpc(scope *Scope, address tf.Output, method tf.Output, request tf.Output, optional ...RpcAttr) (response tf.Output) {
+// This operation is a synchronous version IteratorGetNext. It should only be used
+// in situations where the iterator does not block the calling thread, or where
+// the calling thread is not a member of the thread pool used to execute parallel
+// operations (e.g. in eager mode).
+func IteratorGetNextSync(scope *Scope, iterator tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (components []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "Rpc",
+		Type: "IteratorGetNextSync",
 		Input: []tf.Input{
-			address, method, request,
+			iterator,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
+		scope.UpdateErr("IteratorGetNextSync", err)
+		return
+	}
+	return components
 }
 
-// OrderedMapStageAttr is an optional argument to OrderedMapStage.
-type OrderedMapStageAttr func(optionalAttr)
+// SampleDistortedBoundingBoxV2Attr is an optional argument to SampleDistortedBoundingBoxV2.
+type SampleDistortedBoundingBoxV2Attr func(optionalAttr)
 
-// OrderedMapStageCapacity sets the optional capacity attribute to value.
+// SampleDistortedBoundingBoxV2Seed sets the optional seed attribute to value.
 //
-// value: Maximum number of elements in the Staging Area. If > 0, inserts
-// on the container will block when the capacity is reached.
+// value: If either `seed` or `seed2` are set to non-zero, the random number
+// generator is seeded by the given `seed`.  Otherwise, it is seeded by a random
+// seed.
 // If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func OrderedMapStageCapacity(value int64) OrderedMapStageAttr {
+func SampleDistortedBoundingBoxV2Seed(value int64) SampleDistortedBoundingBoxV2Attr {
 	return func(m optionalAttr) {
-		m["capacity"] = value
+		m["seed"] = value
 	}
 }
 
-// OrderedMapStageMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// SampleDistortedBoundingBoxV2Seed2 sets the optional seed2 attribute to value.
 //
-// REQUIRES: value >= 0
-func OrderedMapStageMemoryLimit(value int64) OrderedMapStageAttr {
+// value: A second seed to avoid seed collision.
+// If not specified, defaults to 0
+func SampleDistortedBoundingBoxV2Seed2(value int64) SampleDistortedBoundingBoxV2Attr {
 	return func(m optionalAttr) {
-		m["memory_limit"] = value
+		m["seed2"] = value
 	}
 }
 
-// OrderedMapStageContainer sets the optional container attribute to value.
+// SampleDistortedBoundingBoxV2AspectRatioRange sets the optional aspect_ratio_range attribute to value.
 //
-// value: If non-empty, this queue is placed in the given container. Otherwise,
-// a default container is used.
-// If not specified, defaults to ""
-func OrderedMapStageContainer(value string) OrderedMapStageAttr {
+// value: The cropped area of the image must have an aspect ratio =
+// width / height within this range.
+// If not specified, defaults to <f:0.75 f:1.33 >
+func SampleDistortedBoundingBoxV2AspectRatioRange(value []float32) SampleDistortedBoundingBoxV2Attr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["aspect_ratio_range"] = value
 	}
 }
 
-// OrderedMapStageSharedName sets the optional shared_name attribute to value.
+// SampleDistortedBoundingBoxV2AreaRange sets the optional area_range attribute to value.
 //
-// value: It is necessary to match this name to the matching Unstage Op.
-// If not specified, defaults to ""
-func OrderedMapStageSharedName(value string) OrderedMapStageAttr {
+// value: The cropped area of the image must contain a fraction of the
+// supplied image within this range.
+// If not specified, defaults to <f:0.05 f:1 >
+func SampleDistortedBoundingBoxV2AreaRange(value []float32) SampleDistortedBoundingBoxV2Attr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["area_range"] = value
 	}
 }
 
-// Stage (key, values) in the underlying container which behaves like a ordered
-//
-// associative container.   Elements are ordered by key.
-//
-// Arguments:
-//	key: int64
-//
-//	values: a list of tensors
-// dtypes A list of data types that inserted values should adhere to.
-//
+// SampleDistortedBoundingBoxV2MaxAttempts sets the optional max_attempts attribute to value.
 //
-// Returns the created operation.
-func OrderedMapStage(scope *Scope, key tf.Output, indices tf.Output, values []tf.Output, dtypes []tf.DataType, optional ...OrderedMapStageAttr) (o *tf.Operation) {
+// value: Number of attempts at generating a cropped region of the image
+// of the specified constraints. After `max_attempts` failures, return the entire
+// image.
+// If not specified, defaults to 100
+func SampleDistortedBoundingBoxV2MaxAttempts(value int64) SampleDistortedBoundingBoxV2Attr {
+	return func(m optionalAttr) {
+		m["max_attempts"] = value
+	}
+}
+
+// SampleDistortedBoundingBoxV2UseImageIfNoBoundingBoxes sets the optional use_image_if_no_bounding_boxes attribute to value.
+//
+// value: Controls behavior if no bounding boxes supplied.
+// If true, assume an implicit bounding box covering the whole input. If false,
+// raise an error.
+// If not specified, defaults to false
+func SampleDistortedBoundingBoxV2UseImageIfNoBoundingBoxes(value bool) SampleDistortedBoundingBoxV2Attr {
+	return func(m optionalAttr) {
+		m["use_image_if_no_bounding_boxes"] = value
+	}
+}
+
+// Generate a single randomly distorted bounding box for an image.
+//
+// Bounding box annotations are often supplied in addition to ground-truth labels
+// in image recognition or object localization tasks. A common technique for
+// training such a system is to randomly distort an image while preserving
+// its content, i.e. *data augmentation*. This Op outputs a randomly distorted
+// localization of an object, i.e. bounding box, given an `image_size`,
+// `bounding_boxes` and a series of constraints.
+//
+// The output of this Op is a single bounding box that may be used to crop the
+// original image. The output is returned as 3 tensors: `begin`, `size` and
+// `bboxes`. The first 2 tensors can be fed directly into `tf.slice` to crop the
+// image. The latter may be supplied to `tf.image.draw_bounding_boxes` to visualize
+// what the bounding box looks like.
+//
+// Bounding boxes are supplied and returned as `[y_min, x_min, y_max, x_max]`. The
+// bounding box coordinates are floats in `[0.0, 1.0]` relative to the width and
+// height of the underlying image.
+//
+// For example,
+//
+// ```python
+//     # Generate a single distorted bounding box.
+//     begin, size, bbox_for_draw = tf.image.sample_distorted_bounding_box(
+//         tf.shape(image),
+//         bounding_boxes=bounding_boxes)
+//
+//     # Draw the bounding box in an image summary.
+//     image_with_box = tf.image.draw_bounding_boxes(tf.expand_dims(image, 0),
+//                                                   bbox_for_draw)
+//     tf.summary.image('images_with_box', image_with_box)
+//
+//     # Employ the bounding box to distort the image.
+//     distorted_image = tf.slice(image, begin, size)
+// ```
+//
+// Note that if no bounding box information is available, setting
+// `use_image_if_no_bounding_boxes = true` will assume there is a single implicit
+// bounding box covering the whole image. If `use_image_if_no_bounding_boxes` is
+// false and no bounding boxes are supplied, an error is raised.
+//
+// Arguments:
+//	image_size: 1-D, containing `[height, width, channels]`.
+//	bounding_boxes: 3-D with shape `[batch, N, 4]` describing the N bounding boxes
+// associated with the image.
+//	min_object_covered: The cropped area of the image must contain at least this
+// fraction of any bounding box supplied. The value of this parameter should be
+// non-negative. In the case of 0, the cropped area does not need to overlap
+// any of the bounding boxes supplied.
+//
+// Returns 1-D, containing `[offset_height, offset_width, 0]`. Provide as input to
+// `tf.slice`.1-D, containing `[target_height, target_width, -1]`. Provide as input to
+// `tf.slice`.3-D with shape `[1, 1, 4]` containing the distorted bounding box.
+// Provide as input to `tf.image.draw_bounding_boxes`.
+func SampleDistortedBoundingBoxV2(scope *Scope, image_size tf.Output, bounding_boxes tf.Output, min_object_covered tf.Output, optional ...SampleDistortedBoundingBoxV2Attr) (begin tf.Output, size tf.Output, bboxes tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "OrderedMapStage",
+		Type: "SampleDistortedBoundingBoxV2",
 		Input: []tf.Input{
-			key, indices, tf.OutputList(values),
+			image_size, bounding_boxes, min_object_covered,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// StackPushV2Attr is an optional argument to StackPushV2.
-type StackPushV2Attr func(optionalAttr)
+// ExtractGlimpseAttr is an optional argument to ExtractGlimpse.
+type ExtractGlimpseAttr func(optionalAttr)
 
-// StackPushV2SwapMemory sets the optional swap_memory attribute to value.
+// ExtractGlimpseCentered sets the optional centered attribute to value.
 //
-// value: Swap `elem` to CPU. Default to false.
-// If not specified, defaults to false
-func StackPushV2SwapMemory(value bool) StackPushV2Attr {
+// value: indicates if the offset coordinates are centered relative to
+// the image, in which case the (0, 0) offset is relative to the center
+// of the input images. If false, the (0,0) offset corresponds to the
+// upper left corner of the input images.
+// If not specified, defaults to true
+func ExtractGlimpseCentered(value bool) ExtractGlimpseAttr {
 	return func(m optionalAttr) {
-		m["swap_memory"] = value
+		m["centered"] = value
 	}
 }
 
-// Push an element onto the stack.
+// ExtractGlimpseNormalized sets the optional normalized attribute to value.
+//
+// value: indicates if the offset coordinates are normalized.
+// If not specified, defaults to true
+func ExtractGlimpseNormalized(value bool) ExtractGlimpseAttr {
+	return func(m optionalAttr) {
+		m["normalized"] = value
+	}
+}
+
+// ExtractGlimpseUniformNoise sets the optional uniform_noise attribute to value.
+//
+// value: indicates if the noise should be generated using a
+// uniform distribution or a Gaussian distribution.
+// If not specified, defaults to true
+func ExtractGlimpseUniformNoise(value bool) ExtractGlimpseAttr {
+	return func(m optionalAttr) {
+		m["uniform_noise"] = value
+	}
+}
+
+// Extracts a glimpse from the input tensor.
+//
+// Returns a set of windows called glimpses extracted at location
+// `offsets` from the input tensor. If the windows only partially
+// overlaps the inputs, the non overlapping areas will be filled with
+// random noise.
+//
+// The result is a 4-D tensor of shape `[batch_size, glimpse_height,
+// glimpse_width, channels]`. The channels and batch dimensions are the
+// same as that of the input tensor. The height and width of the output
+// windows are specified in the `size` parameter.
+//
+// The argument `normalized` and `centered` controls how the windows are built:
+//
+// * If the coordinates are normalized but not centered, 0.0 and 1.0
+//   correspond to the minimum and maximum of each height and width
+//   dimension.
+// * If the coordinates are both normalized and centered, they range from
+//   -1.0 to 1.0. The coordinates (-1.0, -1.0) correspond to the upper
+//   left corner, the lower right corner is located at (1.0, 1.0) and the
+//   center is at (0, 0).
+// * If the coordinates are not normalized they are interpreted as
+//   numbers of pixels.
 //
 // Arguments:
-//	handle: The handle to a stack.
-//	elem: The tensor to be pushed onto the stack.
+//	input: A 4-D float tensor of shape `[batch_size, height, width, channels]`.
+//	size: A 1-D tensor of 2 elements containing the size of the glimpses
+// to extract.  The glimpse height must be specified first, following
+// by the glimpse width.
+//	offsets: A 2-D integer tensor of shape `[batch_size, 2]` containing
+// the y, x locations of the center of each window.
 //
-// Returns The same tensor as the input 'elem'.
-func StackPushV2(scope *Scope, handle tf.Output, elem tf.Output, optional ...StackPushV2Attr) (output tf.Output) {
+// Returns A tensor representing the glimpses `[batch_size,
+// glimpse_height, glimpse_width, channels]`.
+func ExtractGlimpse(scope *Scope, input tf.Output, size tf.Output, offsets tf.Output, optional ...ExtractGlimpseAttr) (glimpse tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -24666,9 +24204,9 @@ func StackPushV2(scope *Scope, handle tf.Output, elem tf.Output, optional ...Sta
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "StackPushV2",
+		Type: "ExtractGlimpse",
 		Input: []tf.Input{
-			handle, elem,
+			input, size, offsets,
 		},
 		Attrs: attrs,
 	}
@@ -24676,56 +24214,72 @@ func StackPushV2(scope *Scope, handle tf.Output, elem tf.Output, optional ...Sta
 	return op.Output(0)
 }
 
-// Creates a dataset that concatenates `input_dataset` with `another_dataset`.
-func ConcatenateDataset(scope *Scope, input_dataset tf.Output, another_dataset tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// A container for an iterator resource.
+//
+// Returns A handle to the iterator that can be passed to a "MakeIterator"
+// or "IteratorGetNext" op.
+func Iterator(scope *Scope, shared_name string, container string, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{"shared_name": shared_name, "container": container, "output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "ConcatenateDataset",
-		Input: []tf.Input{
-			input_dataset, another_dataset,
-		},
+		Type: "Iterator",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Adds a value to the current value of a variable.
+// CropAndResizeGradImageAttr is an optional argument to CropAndResizeGradImage.
+type CropAndResizeGradImageAttr func(optionalAttr)
+
+// CropAndResizeGradImageMethod sets the optional method attribute to value.
 //
-// Any ReadVariableOp with a control dependency on this op is guaranteed to
-// see the incremented value or a subsequent newer one.
+// value: A string specifying the interpolation method. Only 'bilinear' is
+// supported for now.
+// If not specified, defaults to "bilinear"
+func CropAndResizeGradImageMethod(value string) CropAndResizeGradImageAttr {
+	return func(m optionalAttr) {
+		m["method"] = value
+	}
+}
+
+// Computes the gradient of the crop_and_resize op wrt the input image tensor.
 //
 // Arguments:
-//	resource: handle to the resource in which to store the variable.
-//	value: the value by which the variable will be incremented.
+//	grads: A 4-D tensor of shape `[num_boxes, crop_height, crop_width, depth]`.
+//	boxes: A 2-D tensor of shape `[num_boxes, 4]`. The `i`-th row of the tensor
+// specifies the coordinates of a box in the `box_ind[i]` image and is specified
+// in normalized coordinates `[y1, x1, y2, x2]`. A normalized coordinate value of
+// `y` is mapped to the image coordinate at `y * (image_height - 1)`, so as the
+// `[0, 1]` interval of normalized image height is mapped to
+// `[0, image_height - 1] in image height coordinates. We do allow y1 > y2, in
+// which case the sampled crop is an up-down flipped version of the original
+// image. The width dimension is treated similarly. Normalized coordinates
+// outside the `[0, 1]` range are allowed, in which case we use
+// `extrapolation_value` to extrapolate the input image values.
+//	box_ind: A 1-D tensor of shape `[num_boxes]` with int32 values in `[0, batch)`.
+// The value of `box_ind[i]` specifies the image that the `i`-th box refers to.
+//	image_size: A 1-D tensor with value `[batch, image_height, image_width, depth]`
+// containing the original image size. Both `image_height` and `image_width` need
+// to be positive.
 //
-// Returns the created operation.
-func AssignAddVariableOp(scope *Scope, resource tf.Output, value tf.Output) (o *tf.Operation) {
+//
+// Returns A 4-D tensor of shape `[batch, image_height, image_width, depth]`.
+func CropAndResizeGradImage(scope *Scope, grads tf.Output, boxes tf.Output, box_ind tf.Output, image_size tf.Output, T tf.DataType, optional ...CropAndResizeGradImageAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	opspec := tf.OpSpec{
-		Type: "AssignAddVariableOp",
-		Input: []tf.Input{
-			resource, value,
-		},
-	}
-	return scope.AddOperation(opspec)
-}
-
-// Records the latency of producing `input_dataset` elements in a StatsAggregator.
-func LatencyStatsDataset(scope *Scope, input_dataset tf.Output, tag tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
-	if scope.Err() != nil {
-		return
+	attrs := map[string]interface{}{"T": T}
+	for _, a := range optional {
+		a(attrs)
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "LatencyStatsDataset",
+		Type: "CropAndResizeGradImage",
 		Input: []tf.Input{
-			input_dataset, tag,
+			grads, boxes, box_ind, image_size,
 		},
 		Attrs: attrs,
 	}
@@ -24733,139 +24287,118 @@ func LatencyStatsDataset(scope *Scope, input_dataset tf.Output, tag tf.Output, o
 	return op.Output(0)
 }
 
-// MapSizeAttr is an optional argument to MapSize.
-type MapSizeAttr func(optionalAttr)
+// ShuffleDatasetAttr is an optional argument to ShuffleDataset.
+type ShuffleDatasetAttr func(optionalAttr)
 
-// MapSizeCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// ShuffleDatasetReshuffleEachIteration sets the optional reshuffle_each_iteration attribute to value.
 //
-// REQUIRES: value >= 0
-func MapSizeCapacity(value int64) MapSizeAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// MapSizeMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func MapSizeMemoryLimit(value int64) MapSizeAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// MapSizeContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func MapSizeContainer(value string) MapSizeAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// MapSizeSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func MapSizeSharedName(value string) MapSizeAttr {
+// value: If true, each iterator over this dataset will be given
+// a different pseudorandomly generated seed, based on a sequence seeded by the
+// `seed` and `seed2` inputs. If false, each iterator will be given the same
+// seed, and repeated iteration over this dataset will yield the exact same
+// sequence of results.
+// If not specified, defaults to true
+func ShuffleDatasetReshuffleEachIteration(value bool) ShuffleDatasetAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["reshuffle_each_iteration"] = value
 	}
 }
 
-// Op returns the number of elements in the underlying container.
-func MapSize(scope *Scope, dtypes []tf.DataType, optional ...MapSizeAttr) (size tf.Output) {
+// Creates a dataset that shuffles elements from `input_dataset` pseudorandomly.
+//
+// Arguments:
+//
+//	buffer_size: The number of output elements to buffer in an iterator over
+// this dataset. Compare with the `min_after_dequeue` attr when creating a
+// `RandomShuffleQueue`.
+//	seed: A scalar seed for the random number generator. If either `seed` or
+// `seed2` is set to be non-zero, the random number generator is seeded
+// by the given seed.  Otherwise, a random seed is used.
+//	seed2: A second scalar seed to avoid seed collision.
+//
+//
+func ShuffleDataset(scope *Scope, input_dataset tf.Output, buffer_size tf.Output, seed tf.Output, seed2 tf.Output, output_types []tf.DataType, output_shapes []tf.Shape, optional ...ShuffleDatasetAttr) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MapSize",
-
+		Type: "ShuffleDataset",
+		Input: []tf.Input{
+			input_dataset, buffer_size, seed, seed2,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Convert JSON-encoded Example records to binary protocol buffer strings.
+// 3D fast Fourier transform.
 //
-// This op translates a tensor containing Example records, encoded using
-// the [standard JSON
-// mapping](https://developers.google.com/protocol-buffers/docs/proto3#json),
-// into a tensor containing the same records encoded as binary protocol
-// buffers. The resulting tensor can then be fed to any of the other
-// Example-parsing ops.
+// Computes the 3-dimensional discrete Fourier transform over the inner-most 3
+// dimensions of `input`.
 //
 // Arguments:
-//	json_examples: Each string is a JSON object serialized according to the JSON
-// mapping of the Example proto.
+//	input: A complex64 tensor.
 //
-// Returns Each string is a binary Example protocol buffer corresponding
-// to the respective element of `json_examples`.
-func DecodeJSONExample(scope *Scope, json_examples tf.Output) (binary_examples tf.Output) {
+// Returns A complex64 tensor of the same shape as `input`. The inner-most 3
+//   dimensions of `input` are replaced with their 3D Fourier transform.
+//
+// @compatibility(numpy)
+// Equivalent to np.fft.fftn with 3 dimensions.
+// @end_compatibility
+func FFT3D(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "DecodeJSONExample",
+		Type: "FFT3D",
 		Input: []tf.Input{
-			json_examples,
+			input,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// SparseToDenseAttr is an optional argument to SparseToDense.
-type SparseToDenseAttr func(optionalAttr)
+// CropAndResizeGradBoxesAttr is an optional argument to CropAndResizeGradBoxes.
+type CropAndResizeGradBoxesAttr func(optionalAttr)
 
-// SparseToDenseValidateIndices sets the optional validate_indices attribute to value.
+// CropAndResizeGradBoxesMethod sets the optional method attribute to value.
 //
-// value: If true, indices are checked to make sure they are sorted in
-// lexicographic order and that there are no repeats.
-// If not specified, defaults to true
-func SparseToDenseValidateIndices(value bool) SparseToDenseAttr {
+// value: A string specifying the interpolation method. Only 'bilinear' is
+// supported for now.
+// If not specified, defaults to "bilinear"
+func CropAndResizeGradBoxesMethod(value string) CropAndResizeGradBoxesAttr {
 	return func(m optionalAttr) {
-		m["validate_indices"] = value
+		m["method"] = value
 	}
 }
 
-// Converts a sparse representation into a dense tensor.
-//
-// Builds an array `dense` with shape `output_shape` such that
-//
-// ```
-// # If sparse_indices is scalar
-// dense[i] = (i == sparse_indices ? sparse_values : default_value)
-//
-// # If sparse_indices is a vector, then for each i
-// dense[sparse_indices[i]] = sparse_values[i]
-//
-// # If sparse_indices is an n by d matrix, then for each i in [0, n)
-// dense[sparse_indices[i][0], ..., sparse_indices[i][d-1]] = sparse_values[i]
-// ```
-//
-// All other values in `dense` are set to `default_value`.  If `sparse_values` is a
-// scalar, all sparse indices are set to this single value.
-//
-// Indices should be sorted in lexicographic order, and indices must not
-// contain any repeats. If `validate_indices` is true, these properties
-// are checked during execution.
+// Computes the gradient of the crop_and_resize op wrt the input boxes tensor.
 //
 // Arguments:
-//	sparse_indices: 0-D, 1-D, or 2-D.  `sparse_indices[i]` contains the complete
-// index where `sparse_values[i]` will be placed.
-//	output_shape: 1-D.  Shape of the dense output tensor.
-//	sparse_values: 1-D.  Values corresponding to each row of `sparse_indices`,
-// or a scalar value to be used for all sparse indices.
-//	default_value: Scalar value to set for indices not specified in
-// `sparse_indices`.
+//	grads: A 4-D tensor of shape `[num_boxes, crop_height, crop_width, depth]`.
+//	image: A 4-D tensor of shape `[batch, image_height, image_width, depth]`.
+// Both `image_height` and `image_width` need to be positive.
+//	boxes: A 2-D tensor of shape `[num_boxes, 4]`. The `i`-th row of the tensor
+// specifies the coordinates of a box in the `box_ind[i]` image and is specified
+// in normalized coordinates `[y1, x1, y2, x2]`. A normalized coordinate value of
+// `y` is mapped to the image coordinate at `y * (image_height - 1)`, so as the
+// `[0, 1]` interval of normalized image height is mapped to
+// `[0, image_height - 1] in image height coordinates. We do allow y1 > y2, in
+// which case the sampled crop is an up-down flipped version of the original
+// image. The width dimension is treated similarly. Normalized coordinates
+// outside the `[0, 1]` range are allowed, in which case we use
+// `extrapolation_value` to extrapolate the input image values.
+//	box_ind: A 1-D tensor of shape `[num_boxes]` with int32 values in `[0, batch)`.
+// The value of `box_ind[i]` specifies the image that the `i`-th box refers to.
 //
-// Returns Dense output tensor of shape `output_shape`.
-func SparseToDense(scope *Scope, sparse_indices tf.Output, output_shape tf.Output, sparse_values tf.Output, default_value tf.Output, optional ...SparseToDenseAttr) (dense tf.Output) {
+// Returns A 2-D tensor of shape `[num_boxes, 4]`.
+func CropAndResizeGradBoxes(scope *Scope, grads tf.Output, image tf.Output, boxes tf.Output, box_ind tf.Output, optional ...CropAndResizeGradBoxesAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -24874,9 +24407,9 @@ func SparseToDense(scope *Scope, sparse_indices tf.Output, output_shape tf.Outpu
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseToDense",
+		Type: "CropAndResizeGradBoxes",
 		Input: []tf.Input{
-			sparse_indices, output_shape, sparse_values, default_value,
+			grads, image, boxes, box_ind,
 		},
 		Attrs: attrs,
 	}
@@ -24884,299 +24417,227 @@ func SparseToDense(scope *Scope, sparse_indices tf.Output, output_shape tf.Outpu
 	return op.Output(0)
 }
 
-// Computes the grayscale dilation of 4-D `input` and 3-D `filter` tensors.
-//
-// The `input` tensor has shape `[batch, in_height, in_width, depth]` and the
-// `filter` tensor has shape `[filter_height, filter_width, depth]`, i.e., each
-// input channel is processed independently of the others with its own structuring
-// function. The `output` tensor has shape
-// `[batch, out_height, out_width, depth]`. The spatial dimensions of the output
-// tensor depend on the `padding` algorithm. We currently only support the default
-// "NHWC" `data_format`.
-//
-// In detail, the grayscale morphological 2-D dilation is the max-sum correlation
-// (for consistency with `conv2d`, we use unmirrored filters):
-//
-//     output[b, y, x, c] =
-//        max_{dy, dx} input[b,
-//                           strides[1] * y + rates[1] * dy,
-//                           strides[2] * x + rates[2] * dx,
-//                           c] +
-//                     filter[dy, dx, c]
-//
-// Max-pooling is a special case when the filter has size equal to the pooling
-// kernel size and contains all zeros.
-//
-// Note on duality: The dilation of `input` by the `filter` is equal to the
-// negation of the erosion of `-input` by the reflected `filter`.
-//
-// Arguments:
-//	input: 4-D with shape `[batch, in_height, in_width, depth]`.
-//	filter: 3-D with shape `[filter_height, filter_width, depth]`.
-//	strides: The stride of the sliding window for each dimension of the input
-// tensor. Must be: `[1, stride_height, stride_width, 1]`.
-//	rates: The input stride for atrous morphological dilation. Must be:
-// `[1, rate_height, rate_width, 1]`.
-//	padding: The type of padding algorithm to use.
+// Saves tensors in V2 checkpoint format.
 //
-// Returns 4-D with shape `[batch, out_height, out_width, depth]`.
-func Dilation2D(scope *Scope, input tf.Output, filter tf.Output, strides []int64, rates []int64, padding string) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"strides": strides, "rates": rates, "padding": padding}
-	opspec := tf.OpSpec{
-		Type: "Dilation2D",
-		Input: []tf.Input{
-			input, filter,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Converts the given variant tensor to an iterator and stores it in the given resource.
+// By default, saves the named tensors in full.  If the caller wishes to save
+// specific slices of full tensors, "shape_and_slices" should be non-empty strings
+// and correspondingly well-formed.
 //
 // Arguments:
-//	resource_handle: A handle to an iterator resource.
-//	serialized: A variant tensor storing the state of the iterator contained in the
-// resource.
+//	prefix: Must have a single element. The prefix of the V2 checkpoint to which we
+// write the tensors.
+//	tensor_names: shape {N}. The names of the tensors to be saved.
+//	shape_and_slices: shape {N}.  The slice specs of the tensors to be saved.
+// Empty strings indicate that they are non-partitioned tensors.
+//	tensors: `N` tensors to save.
 //
 // Returns the created operation.
-func DeserializeIterator(scope *Scope, resource_handle tf.Output, serialized tf.Output) (o *tf.Operation) {
+func SaveV2(scope *Scope, prefix tf.Output, tensor_names tf.Output, shape_and_slices tf.Output, tensors []tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "DeserializeIterator",
+		Type: "SaveV2",
 		Input: []tf.Input{
-			resource_handle, serialized,
+			prefix, tensor_names, shape_and_slices, tf.OutputList(tensors),
 		},
 	}
 	return scope.AddOperation(opspec)
 }
 
-// TensorArrayConcatV2Attr is an optional argument to TensorArrayConcatV2.
-type TensorArrayConcatV2Attr func(optionalAttr)
+// StatsAggregatorHandleAttr is an optional argument to StatsAggregatorHandle.
+type StatsAggregatorHandleAttr func(optionalAttr)
 
-// TensorArrayConcatV2ElementShapeExcept0 sets the optional element_shape_except0 attribute to value.
-// If not specified, defaults to <unknown_rank:true >
-func TensorArrayConcatV2ElementShapeExcept0(value tf.Shape) TensorArrayConcatV2Attr {
+// StatsAggregatorHandleContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func StatsAggregatorHandleContainer(value string) StatsAggregatorHandleAttr {
 	return func(m optionalAttr) {
-		m["element_shape_except0"] = value
+		m["container"] = value
 	}
 }
 
-// Deprecated. Use TensorArrayConcatV3
-func TensorArrayConcatV2(scope *Scope, handle tf.Output, flow_in tf.Output, dtype tf.DataType, optional ...TensorArrayConcatV2Attr) (value tf.Output, lengths tf.Output) {
+// StatsAggregatorHandleSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func StatsAggregatorHandleSharedName(value string) StatsAggregatorHandleAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Creates a statistics manager resource.
+func StatsAggregatorHandle(scope *Scope, optional ...StatsAggregatorHandleAttr) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayConcatV2",
-		Input: []tf.Input{
-			handle, flow_in,
-		},
+		Type: "StatsAggregatorHandle",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Creates a dataset that batches and pads `batch_size` elements from the input.
+// Greedily selects a subset of bounding boxes in descending order of score,
 //
-// Arguments:
+// pruning away boxes that have high intersection-over-union (IOU) overlap
+// with previously selected boxes.  Bounding boxes are supplied as
+// [y1, x1, y2, x2], where (y1, x1) and (y2, x2) are the coordinates of any
+// diagonal pair of box corners and the coordinates can be provided as normalized
+// (i.e., lying in the interval [0, 1]) or absolute.  Note that this algorithm
+// is agnostic to where the origin is in the coordinate system.  Note that this
+// algorithm is invariant to orthogonal transformations and translations
+// of the coordinate system; thus translating or reflections of the coordinate
+// system result in the same boxes being selected by the algorithm.
 //
-//	batch_size: A scalar representing the number of elements to accumulate in a
-// batch.
-//	padded_shapes: A list of int64 tensors representing the desired padded shapes
-// of the corresponding output components. These shapes may be partially
-// specified, using `-1` to indicate that a particular dimension should be
-// padded to the maximum size of all batch elements.
-//	padding_values: A list of scalars containing the padding value to use for
-// each of the outputs.
+// The output of this operation is a set of integers indexing into the input
+// collection of bounding boxes representing the selected boxes.  The bounding
+// box coordinates corresponding to the selected indices can then be obtained
+// using the `tf.gather operation`.  For example:
 //
-func PaddedBatchDataset(scope *Scope, input_dataset tf.Output, batch_size tf.Output, padded_shapes []tf.Output, padding_values []tf.Output, output_shapes []tf.Shape) (handle tf.Output) {
+//   selected_indices = tf.image.non_max_suppression_v2(
+//       boxes, scores, max_output_size, iou_threshold)
+//   selected_boxes = tf.gather(boxes, selected_indices)
+//
+// Arguments:
+//	boxes: A 2-D float tensor of shape `[num_boxes, 4]`.
+//	scores: A 1-D float tensor of shape `[num_boxes]` representing a single
+// score corresponding to each box (each row of boxes).
+//	max_output_size: A scalar integer tensor representing the maximum number of
+// boxes to be selected by non max suppression.
+//	iou_threshold: A 0-D float tensor representing the threshold for deciding whether
+// boxes overlap too much with respect to IOU.
+//
+// Returns A 1-D integer tensor of shape `[M]` representing the selected
+// indices from the boxes tensor, where `M <= max_output_size`.
+func NonMaxSuppressionV2(scope *Scope, boxes tf.Output, scores tf.Output, max_output_size tf.Output, iou_threshold tf.Output) (selected_indices tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "PaddedBatchDataset",
+		Type: "NonMaxSuppressionV2",
 		Input: []tf.Input{
-			input_dataset, batch_size, tf.OutputList(padded_shapes), tf.OutputList(padding_values),
+			boxes, scores, max_output_size, iou_threshold,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Creates a dataset that batches input elements into a SparseTensor.
+// Computes the matrix logarithm of one or more square matrices:
+//
+//
+// log(exp(A)) = A
+//
+// This op is only defined for complex matrices. If A is positive-definite and
+// real, then casting to a complex matrix, taking the logarithm and casting back
+// to a real matrix will give the correct result.
+//
+// This function computes the matrix logarithm using the Schur-Parlett algorithm.
+// Details of the algorithm can be found in Section 11.6.2 of:
+// Nicholas J. Higham, Functions of Matrices: Theory and Computation, SIAM 2008.
+// ISBN 978-0-898716-46-7.
+//
+// The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
+// form square matrices. The output is a tensor of the same shape as the input
+// containing the exponential for all input submatrices `[..., :, :]`.
 //
 // Arguments:
-//	input_dataset: A handle to an input dataset. Must have a single component.
-//	batch_size: A scalar representing the number of elements to accumulate in a
-// batch.
-//	row_shape: A vector representing the dense shape of each row in the produced
-// SparseTensor. The shape may be partially specified, using `-1` to indicate
-// that a particular dimension should use the maximum size of all batch elements.
+//	input: Shape is `[..., M, M]`.
 //
+// Returns Shape is `[..., M, M]`.
 //
-func DenseToSparseBatchDataset(scope *Scope, input_dataset tf.Output, batch_size tf.Output, row_shape tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// @compatibility(scipy)
+// Equivalent to scipy.linalg.logm
+// @end_compatibility
+func MatrixLogarithm(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "DenseToSparseBatchDataset",
+		Type: "MatrixLogarithm",
 		Input: []tf.Input{
-			input_dataset, batch_size, row_shape,
+			input,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Deprecated. Use TensorArrayGradV3
-//
-// DEPRECATED at GraphDef version 26: Use TensorArrayGradV3
-func TensorArrayGradV2(scope *Scope, handle tf.Output, flow_in tf.Output, source string) (grad_handle tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"source": source}
-	opspec := tf.OpSpec{
-		Type: "TensorArrayGradV2",
-		Input: []tf.Input{
-			handle, flow_in,
-		},
-		Attrs: attrs,
+// EncodeProtoAttr is an optional argument to EncodeProto.
+type EncodeProtoAttr func(optionalAttr)
+
+// EncodeProtoDescriptorSource sets the optional descriptor_source attribute to value.
+// If not specified, defaults to "local://"
+func EncodeProtoDescriptorSource(value string) EncodeProtoAttr {
+	return func(m optionalAttr) {
+		m["descriptor_source"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Return substrings from `Tensor` of strings.
-//
-// For each string in the input `Tensor`, creates a substring starting at index
-// `pos` with a total length of `len`.
-//
-// If `len` defines a substring that would extend beyond the length of the input
-// string, then as many characters as possible are used.
-//
-// If `pos` is negative or specifies a character index larger than any of the input
-// strings, then an `InvalidArgumentError` is thrown.
-//
-// `pos` and `len` must have the same shape, otherwise a `ValueError` is thrown on
-// Op creation.
-//
-// *NOTE*: `Substr` supports broadcasting up to two dimensions. More about
-// broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-//
-// ---
-//
-// Examples
-//
-// Using scalar `pos` and `len`:
-//
-// ```python
-// input = [b'Hello', b'World']
-// position = 1
-// length = 3
+// The op serializes protobuf messages provided in the input tensors.
 //
-// output = [b'ell', b'orl']
-// ```
+// The types of the tensors in `values` must match the schema for the
+// fields specified in `field_names`. All the tensors in `values` must
+// have a common shape prefix, *batch_shape*.
 //
-// Using `pos` and `len` with same shape as `input`:
+// The `sizes` tensor specifies repeat counts for each field.  The repeat
+// count (last dimension) of a each tensor in `values` must be greater
+// than or equal to corresponding repeat count in `sizes`.
 //
-// ```python
-// input = [[b'ten', b'eleven', b'twelve'],
-//          [b'thirteen', b'fourteen', b'fifteen'],
-//          [b'sixteen', b'seventeen', b'eighteen']]
-// position = [[1, 2, 3],
-//             [1, 2, 3],
-//             [1, 2, 3]]
-// length =   [[2, 3, 4],
-//             [4, 3, 2],
-//             [5, 5, 5]]
+// A `message_type` name must be provided to give context for the field
+// names. The actual message descriptor can be looked up either in the
+// linked-in descriptor pool or a filename provided by the caller using
+// the `descriptor_source` attribute.
 //
-// output = [[b'en', b'eve', b'lve'],
-//           [b'hirt', b'urt', b'te'],
-//           [b'ixtee', b'vente', b'hteen']]
-// ```
+// The `descriptor_source` attribute selects a source of protocol
+// descriptors to consult when looking up `message_type`. This may be a
+// filename containing a serialized `FileDescriptorSet` message,
+// or the special value `local://`, in which case only descriptors linked
+// into the code will be searched; the filename can be on any filesystem
+// accessible to TensorFlow.
 //
-// Broadcasting `pos` and `len` onto `input`:
+// You can build a `descriptor_source` file using the `--descriptor_set_out`
+// and `--include_imports` options to the protocol compiler `protoc`.
 //
-// ```
-// input = [[b'ten', b'eleven', b'twelve'],
-//          [b'thirteen', b'fourteen', b'fifteen'],
-//          [b'sixteen', b'seventeen', b'eighteen'],
-//          [b'nineteen', b'twenty', b'twentyone']]
-// position = [1, 2, 3]
-// length =   [1, 2, 3]
+// The `local://` database only covers descriptors linked into the
+// code via C++ libraries, not Python imports. You can link in a proto descriptor
+// by creating a cc_library target with alwayslink=1.
 //
-// output = [[b'e', b'ev', b'lve'],
-//           [b'h', b'ur', b'tee'],
-//           [b'i', b've', b'hte'],
-//           [b'i', b'en', b'nty']]
-// ```
+// There are a few special cases in the value mapping:
 //
-// Broadcasting `input` onto `pos` and `len`:
+// Submessage and group fields must be pre-serialized as TensorFlow strings.
 //
-// ```
-// input = b'thirteen'
-// position = [1, 5, 7]
-// length =   [3, 2, 1]
+// TensorFlow lacks support for unsigned int64s, so they must be
+// represented as `tf.int64` with the same twos-complement bit pattern
+// (the obvious way).
 //
-// output = [b'hir', b'ee', b'n']
-// ```
+// Unsigned int32 values can be represented exactly with `tf.int64`, or
+// with sign wrapping if the input is of type `tf.int32`.
 //
 // Arguments:
-//	input: Tensor of strings
-//	pos: Scalar defining the position of first character in each substring
-//	len: Scalar defining the number of characters to include in each substring
+//	sizes: Tensor of int32 with shape `[batch_shape, len(field_names)]`.
+//	values: List of tensors containing values for the corresponding field.
+//	field_names: List of strings containing proto field names.
+//	message_type: Name of the proto message type to decode.
 //
-// Returns Tensor of substrings
-func Substr(scope *Scope, input tf.Output, pos tf.Output, len tf.Output) (output tf.Output) {
+// Returns Tensor of serialized protos with shape `batch_shape`.
+func EncodeProto(scope *Scope, sizes tf.Output, values []tf.Output, field_names []string, message_type string, optional ...EncodeProtoAttr) (bytes tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	opspec := tf.OpSpec{
-		Type: "Substr",
-		Input: []tf.Input{
-			input, pos, len,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Creates a Dataset that returns pseudorandom numbers.
-//
-// Arguments:
-//	seed: A scalar seed for the random number generator. If either seed or
-// seed2 is set to be non-zero, the random number generator is seeded
-// by the given seed.  Otherwise, a random seed is used.
-//	seed2: A second scalar seed to avoid seed collision.
-//
-//
-func RandomDataset(scope *Scope, seed tf.Output, seed2 tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
-	if scope.Err() != nil {
-		return
+	attrs := map[string]interface{}{"field_names": field_names, "message_type": message_type}
+	for _, a := range optional {
+		a(attrs)
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "RandomDataset",
+		Type: "EncodeProto",
 		Input: []tf.Input{
-			seed, seed2,
+			sizes, tf.OutputList(values),
 		},
 		Attrs: attrs,
 	}
@@ -25184,277 +24645,381 @@ func RandomDataset(scope *Scope, seed tf.Output, seed2 tf.Output, output_types [
 	return op.Output(0)
 }
 
-// Creates a dataset that shuffles and repeats elements from `input_dataset`
+// Creates a TensorArray for storing the gradients of values in the given handle.
 //
-// pseudorandomly.
+// If the given TensorArray gradient already exists, returns a reference to it.
 //
-// Arguments:
+// Locks the size of the original TensorArray by disabling its dynamic size flag.
 //
-//	buffer_size: The number of output elements to buffer in an iterator over
-// this dataset. Compare with the `min_after_dequeue` attr when creating a
-// `RandomShuffleQueue`.
-//	seed: A scalar seed for the random number generator. If either `seed` or
-// `seed2` is set to be non-zero, the random number generator is seeded
-// by the given seed.  Otherwise, a random seed is used.
-//	seed2: A second scalar seed to avoid seed collision.
-//	count: A scalar representing the number of times the underlying dataset
-// should be repeated. The default is `-1`, which results in infinite repetition.
+// **A note about the input flow_in:**
+//
+// The handle flow_in forces the execution of the gradient lookup to occur
+// only after certain other operations have occurred.  For example, when
+// the forward TensorArray is dynamically sized, writes to this TensorArray
+// may resize the object.  The gradient TensorArray is statically sized based
+// on the size of the forward TensorArray when this operation executes.
+// Furthermore, the size of the forward TensorArray is frozen by this call.
+// As a result, the flow is used to ensure that the call to generate the gradient
+// TensorArray only happens after all writes are executed.
 //
+// In the case of dynamically sized TensorArrays, gradient computation should
+// only be performed on read operations that have themselves been chained via
+// flow to occur only after all writes have executed. That way the final size
+// of the forward TensorArray is known when this operation is called.
 //
-func ShuffleAndRepeatDataset(scope *Scope, input_dataset tf.Output, buffer_size tf.Output, seed tf.Output, seed2 tf.Output, count tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// **A note about the source attribute:**
+//
+// TensorArray gradient calls use an accumulator TensorArray object.  If
+// multiple gradients are calculated and run in the same session, the multiple
+// gradient nodes may accidentally flow through the same accumulator TensorArray.
+// This double counts and generally breaks the TensorArray gradient flow.
+//
+// The solution is to identify which gradient call this particular
+// TensorArray gradient is being called in.  This is performed by identifying
+// a unique string (e.g. "gradients", "gradients_1", ...) from the input
+// gradient Tensor's name.  This string is used as a suffix when creating
+// the TensorArray gradient object here (the attribute `source`).
+//
+// The attribute `source` is added as a suffix to the forward TensorArray's
+// name when performing the creation / lookup, so that each separate gradient
+// calculation gets its own TensorArray accumulator.
+//
+// Arguments:
+//	handle: The handle to the forward TensorArray.
+//	flow_in: A float scalar that enforces proper chaining of operations.
+//	source: The gradient source string, used to decide which gradient TensorArray
+// to return.
+func TensorArrayGradV3(scope *Scope, handle tf.Output, flow_in tf.Output, source string) (grad_handle tf.Output, flow_out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{"source": source}
 	opspec := tf.OpSpec{
-		Type: "ShuffleAndRepeatDataset",
+		Type: "TensorArrayGradV3",
 		Input: []tf.Input{
-			input_dataset, buffer_size, seed, seed2, count,
+			handle, flow_in,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// Creates a dataset that caches elements from `input_dataset`.
+// DecodeProtoV2Attr is an optional argument to DecodeProtoV2.
+type DecodeProtoV2Attr func(optionalAttr)
+
+// DecodeProtoV2DescriptorSource sets the optional descriptor_source attribute to value.
 //
-// A CacheDataset will iterate over the input_dataset, and store tensors. If the
-// cache already exists, the cache will be used. If the cache is inappropriate
-// (e.g. cannot be opened, contains tensors of the wrong shape / size), an error
-// will the returned when used.
+// value: Either the special value `local://` or a path to a file containing
+// a serialized `FileDescriptorSet`.
+// If not specified, defaults to "local://"
+func DecodeProtoV2DescriptorSource(value string) DecodeProtoV2Attr {
+	return func(m optionalAttr) {
+		m["descriptor_source"] = value	}
+}
+
+// DecodeProtoV2MessageFormat sets the optional message_format attribute to value.
 //
-// Arguments:
+// value: Either `binary` or `text`.
+// If not specified, defaults to "binary"
+func DecodeProtoV2MessageFormat(value string) DecodeProtoV2Attr {
+	return func(m optionalAttr) {
+		m["message_format"] = value
+	}
+}
+
+// DecodeProtoV2Sanitize sets the optional sanitize attribute to value.
 //
-//	filename: A path on the filesystem where we should cache the dataset. Note: this
-// will be a directory.
+// value: Whether to sanitize the result or not.
+// If not specified, defaults to false
+func DecodeProtoV2Sanitize(value bool) DecodeProtoV2Attr {
+	return func(m optionalAttr) {
+		m["sanitize"] = value
+	}
+}
+
+// The op extracts fields from a serialized protocol buffers message into tensors.
+//
+// The `decode_proto` op extracts fields from a serialized protocol buffers
+// message into tensors.  The fields in `field_names` are decoded and converted
+// to the corresponding `output_types` if possible.
 //
+// A `message_type` name must be provided to give context for the field
+// names. The actual message descriptor can be looked up either in the
+// linked-in descriptor pool or a filename provided by the caller using
+// the `descriptor_source` attribute.
 //
-func CacheDataset(scope *Scope, input_dataset tf.Output, filename tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// Each output tensor is a dense tensor. This means that it is padded to
+// hold the largest number of repeated elements seen in the input
+// minibatch. (The shape is also padded by one to prevent zero-sized
+// dimensions). The actual repeat counts for each example in the
+// minibatch can be found in the `sizes` output. In many cases the output
+// of `decode_proto` is fed immediately into tf.squeeze if missing values
+// are not a concern. When using tf.squeeze, always pass the squeeze
+// dimension explicitly to avoid surprises.
+//
+// For the most part, the mapping between Proto field types and
+// TensorFlow dtypes is straightforward. However, there are a few
+// special cases:
+//
+// - A proto field that contains a submessage or group can only be converted
+// to `DT_STRING` (the serialized submessage). This is to reduce the
+// complexity of the API. The resulting string can be used as input
+// to another instance of the decode_proto op.
+//
+// - TensorFlow lacks support for unsigned integers. The ops represent uint64
+// types as a `DT_INT64` with the same twos-complement bit pattern
+// (the obvious way). Unsigned int32 values can be represented exactly by
+// specifying type `DT_INT64`, or using twos-complement if the caller
+// specifies `DT_INT32` in the `output_types` attribute.
+//
+// The `descriptor_source` attribute selects a source of protocol
+// descriptors to consult when looking up `message_type`. This may be a
+// filename containing a serialized `FileDescriptorSet` message,
+// or the special value `local://`, in which case only descriptors linked
+// into the code will be searched; the filename can be on any filesystem
+// accessible to TensorFlow.
+//
+// You can build a `descriptor_source` file using the `--descriptor_set_out`
+// and `--include_imports` options to the protocol compiler `protoc`.
+//
+// The `local://` database only covers descriptors linked into the
+// code via C++ libraries, not Python imports. You can link in a proto descriptor
+// by creating a cc_library target with alwayslink=1.
+//
+// Both binary and text proto serializations are supported, and can be
+// chosen using the `format` attribute.
+//
+// Arguments:
+//	bytes: Tensor of serialized protos with shape `batch_shape`.
+//	message_type: Name of the proto message type to decode.
+//	field_names: List of strings containing proto field names.
+//	output_types: List of TF types to use for the respective field in field_names.
+//
+// Returns Tensor of int32 with shape `[batch_shape, len(field_names)]`.
+// Each entry is the number of values found for the corresponding field.
+// Optional fields may have 0 or 1 values.List of tensors containing values for the corresponding field.
+// `values[i]` has datatype `output_types[i]`
+// and shape `[batch_shape, max(sizes[...,i])]`.
+func DecodeProtoV2(scope *Scope, bytes tf.Output, message_type string, field_names []string, output_types []tf.DataType, optional ...DecodeProtoV2Attr) (sizes tf.Output, values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{"message_type": message_type, "field_names": field_names, "output_types": output_types}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "CacheDataset",
+		Type: "DecodeProtoV2",
 		Input: []tf.Input{
-			input_dataset, filename,
+			bytes,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	sizes = op.Output(idx)
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("DecodeProtoV2", err)
+		return
+	}
+	return sizes, values
 }
 
-// Creates a dataset that executes a SQL query and emits rows of the result set.
-//
-// Arguments:
-//	driver_name: The database type. Currently, the only supported type is 'sqlite'.
-//	data_source_name: A connection string to connect to the database.
-//	query: A SQL query to execute.
-//
-//
-func SqlDataset(scope *Scope, driver_name tf.Output, data_source_name tf.Output, query tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+// Creates a dataset that splits a SparseTensor into elements row-wise.
+func SparseTensorSliceDataset(scope *Scope, indices tf.Output, values tf.Output, dense_shape tf.Output) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "SqlDataset",
+		Type: "SparseTensorSliceDataset",
 		Input: []tf.Input{
-			driver_name, data_source_name, query,
+			indices, values, dense_shape,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Creates a dataset that emits the records from one or more binary files.
+// Returns x / y element-wise for real types.
 //
-// Arguments:
-//	filenames: A scalar or a vector containing the name(s) of the file(s) to be
-// read.
-//	header_bytes: A scalar representing the number of bytes to skip at the
-// beginning of a file.
-//	record_bytes: A scalar representing the number of bytes in each record.
-//	footer_bytes: A scalar representing the number of bytes to skip at the end
-// of a file.
-//	buffer_size: A scalar representing the number of bytes to buffer. Must be > 0.
-func FixedLengthRecordDataset(scope *Scope, filenames tf.Output, header_bytes tf.Output, record_bytes tf.Output, footer_bytes tf.Output, buffer_size tf.Output) (handle tf.Output) {
+// If `x` and `y` are reals, this will return the floating-point division.
+//
+// *NOTE*: `Div` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func RealDiv(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "FixedLengthRecordDataset",
+		Type: "RealDiv",
 		Input: []tf.Input{
-			filenames, header_bytes, record_bytes, footer_bytes, buffer_size,
+			x, y,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Gradients for batch normalization.
-//
-// DEPRECATED at GraphDef version 9: Use tf.nn.batch_normalization()
+//     Adds v into specified rows of x.
 //
-// This op is deprecated. See `tf.nn.batch_normalization`.
+//     Computes y = x; y[i, :] += v; return y.
 //
 // Arguments:
-//	t: A 4D input Tensor.
-//	m: A 1D mean Tensor with size matching the last dimension of t.
-// This is the first output from tf.nn.moments,
-// or a saved moving average thereof.
-//	v: A 1D variance Tensor with size matching the last dimension of t.
-// This is the second output from tf.nn.moments,
-// or a saved moving average thereof.
-//	gamma: A 1D gamma Tensor with size matching the last dimension of t.
-// If "scale_after_normalization" is true, this Tensor will be multiplied
-// with the normalized Tensor.
-//	backprop: 4D backprop Tensor.
-//	variance_epsilon: A small float number to avoid dividing by 0.
-//	scale_after_normalization: A bool indicating whether the resulted tensor
-// needs to be multiplied with gamma.
+//	x: A `Tensor` of type T.
+//	i: A vector. Indices into the left-most dimension of `x`.
+//	v: A `Tensor` of type T. Same dimension sizes as x except the first dimension, which must be the same as i's size.
 //
-// Returns 4D backprop tensor for input.1D backprop tensor for mean.1D backprop tensor for variance.1D backprop tensor for beta.1D backprop tensor for gamma.
-func BatchNormWithGlobalNormalizationGrad(scope *Scope, t tf.Output, m tf.Output, v tf.Output, gamma tf.Output, backprop tf.Output, variance_epsilon float32, scale_after_normalization bool) (dx tf.Output, dm tf.Output, dv tf.Output, db tf.Output, dg tf.Output) {
+// Returns A `Tensor` of type T. An alias of `x`. The content of `y` is undefined if there are duplicates in `i`.
+func InplaceAdd(scope *Scope, x tf.Output, i tf.Output, v tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"variance_epsilon": variance_epsilon, "scale_after_normalization": scale_after_normalization}
 	opspec := tf.OpSpec{
-		Type: "BatchNormWithGlobalNormalizationGrad",
+		Type: "InplaceAdd",
 		Input: []tf.Input{
-			t, m, v, gamma, backprop,
+			x, i, v,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
+	return op.Output(0)
 }
 
-// Creates a dataset that emits the records from one or more TFRecord files.
+// Restore a Reader to its initial clean state.
 //
 // Arguments:
-//	filenames: A scalar or vector containing the name(s) of the file(s) to be
-// read.
-//	compression_type: A scalar containing either (i) the empty string (no
-// compression), (ii) "ZLIB", or (iii) "GZIP".
-//	buffer_size: A scalar representing the number of bytes to buffer. A value of
-// 0 means no buffering will be performed.
-func TFRecordDataset(scope *Scope, filenames tf.Output, compression_type tf.Output, buffer_size tf.Output) (handle tf.Output) {
+//	reader_handle: Handle to a Reader.
+//
+// Returns the created operation.
+func ReaderResetV2(scope *Scope, reader_handle tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "TFRecordDataset",
+		Type: "ReaderResetV2",
 		Input: []tf.Input{
-			filenames, compression_type, buffer_size,
+			reader_handle,
 		},
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// BatchToSpace for 4-D tensors of type T.
-//
-// This is a legacy version of the more general BatchToSpaceND.
-//
-// Rearranges (permutes) data from batch into blocks of spatial data, followed by
-// cropping. This is the reverse transformation of SpaceToBatch. More specifically,
-// this op outputs a copy of the input tensor where values from the `batch`
-// dimension are moved in spatial blocks to the `height` and `width` dimensions,
-// followed by cropping along the `height` and `width` dimensions.
-//
-// Arguments:
-//	input: 4-D tensor with shape
-// `[batch*block_size*block_size, height_pad/block_size, width_pad/block_size,
-//   depth]`. Note that the batch size of the input tensor must be divisible by
-// `block_size * block_size`.
-//	crops: 2-D tensor of non-negative integers with shape `[2, 2]`. It specifies
-// how many elements to crop from the intermediate result across the spatial
-// dimensions as follows:
-//
-//     crops = [[crop_top, crop_bottom], [crop_left, crop_right]]
-//
-//
-// Returns 4-D with shape `[batch, height, width, depth]`, where:
-//
-//       height = height_pad - crop_top - crop_bottom
-//       width = width_pad - crop_left - crop_right
-//
-// The attr `block_size` must be greater than one. It indicates the block size.
+// RpcAttr is an optional argument to Rpc.
+type RpcAttr func(optionalAttr)
+
+// RpcProtocol sets the optional protocol attribute to value.
 //
-// Some examples:
+// value: RPC protocol to use.  Empty string means use the default protocol.
+// Options include 'grpc'.
+// If not specified, defaults to ""
+func RpcProtocol(value string) RpcAttr {
+	return func(m optionalAttr) {
+		m["protocol"] = value
+	}
+}
+
+// RpcFailFast sets the optional fail_fast attribute to value.
 //
-// (1) For the following input of shape `[4, 1, 1, 1]` and block_size of 2:
+// value: `boolean`. If `true` (default), then failures to connect
+// (i.e., the server does not immediately respond) cause an RPC failure.
+// If not specified, defaults to true
+func RpcFailFast(value bool) RpcAttr {
+	return func(m optionalAttr) {
+		m["fail_fast"] = value
+	}
+}
+
+// RpcTimeoutInMs sets the optional timeout_in_ms attribute to value.
 //
-// ```
-// [[[[1]]], [[[2]]], [[[3]]], [[[4]]]]
-// ```
+// value: `int`. If `0` (default), then the kernel will run the RPC
+// request and only time out if the RPC deadline passes or the session times out.
+// If this value is greater than `0`, then the op will raise an exception if
+// the RPC takes longer than `timeout_in_ms`.
+// If not specified, defaults to 0
+func RpcTimeoutInMs(value int64) RpcAttr {
+	return func(m optionalAttr) {
+		m["timeout_in_ms"] = value
+	}
+}
+
+// Perform batches of RPC requests.
 //
-// The output tensor has shape `[1, 2, 2, 1]` and value:
+// This op asynchronously performs either a single RPC request, or a batch
+// of requests.  RPC requests are defined by three main parameters:
 //
-// ```
-// x = [[[[1], [2]], [[3], [4]]]]
-// ```
+//   - `address` (the host+port or BNS address of the request)
+//   - `method` (the RPC method name for the request)
+//   - `request` (the serialized proto string, or vector of strings,
+//      of the RPC request argument).
 //
-// (2) For the following input of shape `[4, 1, 1, 3]` and block_size of 2:
+// For example, if you have an RPC service running on port localhost:2345,
+// and its interface is configured with the following proto declaration:
 //
 // ```
-// [[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[10, 11, 12]]]
+// service MyService {
+//   rpc MyMethod(MyRequestProto) returns (MyResponseProto) {
+//   }
+// };
 // ```
 //
-// The output tensor has shape `[1, 2, 2, 3]` and value:
+// then call this op with arguments:
 //
 // ```
-// x = [[[[1, 2, 3], [4, 5, 6]],
-//       [[7, 8, 9], [10, 11, 12]]]]
+// address = "localhost:2345"
+// method = "MyService/MyMethod"
 // ```
 //
-// (3) For the following input of shape `[4, 2, 2, 1]` and block_size of 2:
+// The `request` tensor is a string tensor representing serialized `MyRequestProto`
+// strings; and the output string tensor `response` will have the same shape
+// and contain (upon successful completion) corresponding serialized
+// `MyResponseProto` strings.
 //
-// ```
-// x = [[[[1], [3]], [[9], [11]]],
-//      [[[2], [4]], [[10], [12]]],
-//      [[[5], [7]], [[13], [15]]],
-//      [[[6], [8]], [[14], [16]]]]
-// ```
+// For example, to send a single, empty, `MyRequestProto`, call
+// this op with `request = ""`.  To send 5 **parallel** empty requests,
+// call this op with `request = ["", "", "", "", ""]`.
 //
-// The output tensor has shape `[1, 4, 4, 1]` and value:
+// More generally, one can create a batch of `MyRequestProto` serialized protos
+// from regular batched tensors using the `encode_proto` op, and convert
+// the response `MyResponseProto` serialized protos to batched tensors
+// using the `decode_proto` op.
 //
-// ```
-// x = [[[1],   [2],  [3],  [4]],
-//      [[5],   [6],  [7],  [8]],
-//      [[9],  [10], [11],  [12]],
-//      [[13], [14], [15],  [16]]]
-// ```
+// **NOTE** Working with serialized proto strings is faster than instantiating
+// actual proto objects in memory, so no performance degradation is expected
+// compared to writing custom kernels for this workflow.
 //
-// (4) For the following input of shape `[8, 1, 2, 1]` and block_size of 2:
+// If the connection fails or the remote worker returns an error
+// status, the op reraises this exception locally.
 //
-// ```
-// x = [[[[1], [3]]], [[[9], [11]]], [[[2], [4]]], [[[10], [12]]],
-//      [[[5], [7]]], [[[13], [15]]], [[[6], [8]]], [[[14], [16]]]]
-// ```
+// See the `TryRpc` op if you prefer to handle RPC failures manually in the graph.
 //
-// The output tensor has shape `[2, 2, 4, 1]` and value:
+// Arguments:
+//	address: `0-D` or `1-D`.  The address (i.e. host_name:port) of the RPC server.
+// If this tensor has more than 1 element, then multiple parallel rpc requests
+// are sent.  This argument broadcasts with `method` and `request`.
+//	method: `0-D` or `1-D`.  The method address on the RPC server.
+// If this tensor has more than 1 element, then multiple parallel rpc requests
+// are sent.  This argument broadcasts with `address` and `request`.
+//	request: `0-D` or `1-D`.  Serialized proto strings: the rpc request argument.
+// If this tensor has more than 1 element, then multiple parallel rpc requests
+// are sent.  This argument broadcasts with `address` and `method`.
 //
-// ```
-// x = [[[[1], [3]], [[5], [7]]],
-//      [[[2], [4]], [[10], [12]]],
-//      [[[5], [7]], [[13], [15]]],
-//      [[[6], [8]], [[14], [16]]]]
-// ```
-func BatchToSpace(scope *Scope, input tf.Output, crops tf.Output, block_size int64) (output tf.Output) {
+// Returns Same shape as `request`. Serialized proto strings: the rpc responses.
+func Rpc(scope *Scope, address tf.Output, method tf.Output, request tf.Output, optional ...RpcAttr) (response tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"block_size": block_size}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "BatchToSpace",
+		Type: "Rpc",
 		Input: []tf.Input{
-			input, crops,
+			address, method, request,
 		},
 		Attrs: attrs,
 	}
@@ -25462,640 +25027,537 @@ func BatchToSpace(scope *Scope, input tf.Output, crops tf.Output, block_size int
 	return op.Output(0)
 }
 
-// Makes a new iterator from the given `dataset` and stores it in `iterator`.
+// OrderedMapStageAttr is an optional argument to OrderedMapStage.
+type OrderedMapStageAttr func(optionalAttr)
+
+// OrderedMapStageCapacity sets the optional capacity attribute to value.
 //
-// This operation may be executed multiple times. Each execution will reset the
-// iterator in `iterator` to the first element of `dataset`.
+// value: Maximum number of elements in the Staging Area. If > 0, inserts
+// on the container will block when the capacity is reached.
+// If not specified, defaults to 0
 //
-// Returns the created operation.
-func MakeIterator(scope *Scope, dataset tf.Output, iterator tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
+// REQUIRES: value >= 0
+func OrderedMapStageCapacity(value int64) OrderedMapStageAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "MakeIterator",
-		Input: []tf.Input{
-			dataset, iterator,
-		},
+}
+
+// OrderedMapStageMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func OrderedMapStageMemoryLimit(value int64) OrderedMapStageAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
 	}
-	return scope.AddOperation(opspec)
 }
 
-// Makes the summary of accumulated stats for the batch.
+// OrderedMapStageContainer sets the optional container attribute to value.
 //
-// The summary stats contains gradients and hessians accumulated into the corresponding node and bucket for each example.
+// value: If non-empty, this queue is placed in the given container. Otherwise,
+// a default container is used.
+// If not specified, defaults to ""
+func OrderedMapStageContainer(value string) OrderedMapStageAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// OrderedMapStageSharedName sets the optional shared_name attribute to value.
+//
+// value: It is necessary to match this name to the matching Unstage Op.
+// If not specified, defaults to ""
+func OrderedMapStageSharedName(value string) OrderedMapStageAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Stage (key, values) in the underlying container which behaves like a ordered
+//
+// associative container.   Elements are ordered by key.
 //
 // Arguments:
-//	node_ids: int32 Rank 1 Tensor containing node ids, which each example falls into for the requested layer.
-//	gradients: float32; Rank 2 Tensor (shape=[#examples, 1]) for gradients.
-//	hessians: float32; Rank 2 Tensor (shape=[#examples, 1]) for hessians.
-//	bucketized_features_list: int32 list of Rank 1 Tensors, each containing the bucketized feature (for each feature column).
-//	max_splits: int; the maximum number of splits possible in the whole tree.
-//	num_buckets: int; equals to the maximum possible value of bucketized feature.
+//	key: int64
 //
-// Returns output Rank 4 Tensor (shape=[#features, #splits, #buckets, 2]) containing accumulated stats put into the corresponding node and bucket. The first index of 4th dimension refers to gradients, and the second to hessians.
-func BoostedTreesMakeStatsSummary(scope *Scope, node_ids tf.Output, gradients tf.Output, hessians tf.Output, bucketized_features_list []tf.Output, max_splits int64, num_buckets int64) (stats_summary tf.Output) {
+//	values: a list of tensors
+// dtypes A list of data types that inserted values should adhere to.
+//
+//
+// Returns the created operation.
+func OrderedMapStage(scope *Scope, key tf.Output, indices tf.Output, values []tf.Output, dtypes []tf.DataType, optional ...OrderedMapStageAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"max_splits": max_splits, "num_buckets": num_buckets}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "BoostedTreesMakeStatsSummary",
+		Type: "OrderedMapStage",
 		Input: []tf.Input{
-			node_ids, gradients, hessians, tf.OutputList(bucketized_features_list),
+			key, indices, tf.OutputList(values),
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Adjust the contrast of one or more images.
+// StackPushV2Attr is an optional argument to StackPushV2.
+type StackPushV2Attr func(optionalAttr)
+
+// StackPushV2SwapMemory sets the optional swap_memory attribute to value.
 //
-// `images` is a tensor of at least 3 dimensions.  The last 3 dimensions are
-// interpreted as `[height, width, channels]`.  The other dimensions only
-// represent a collection of images, such as `[batch, height, width, channels].`
-//
-// Contrast is adjusted independently for each channel of each image.
-//
-// For each channel, the Op first computes the mean of the image pixels in the
-// channel and then adjusts each component of each pixel to
-// `(x - mean) * contrast_factor + mean`.
+// value: Swap `elem` to CPU. Default to false.
+// If not specified, defaults to false
+func StackPushV2SwapMemory(value bool) StackPushV2Attr {
+	return func(m optionalAttr) {
+		m["swap_memory"] = value
+	}
+}
+
+// Push an element onto the stack.
 //
 // Arguments:
-//	images: Images to adjust.  At least 3-D.
-//	contrast_factor: A float multiplier for adjusting contrast.
+//	handle: The handle to a stack.
+//	elem: The tensor to be pushed onto the stack.
 //
-// Returns The contrast-adjusted image or images.
-func AdjustContrastv2(scope *Scope, images tf.Output, contrast_factor tf.Output) (output tf.Output) {
+// Returns The same tensor as the input 'elem'.
+func StackPushV2(scope *Scope, handle tf.Output, elem tf.Output, optional ...StackPushV2Attr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "AdjustContrastv2",
+		Type: "StackPushV2",
 		Input: []tf.Input{
-			images, contrast_factor,
+			handle, elem,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Gets the next output from the given iterator.
-func IteratorGetNext(scope *Scope, iterator tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (components []tf.Output) {
+// Creates a dataset that concatenates `input_dataset` with `another_dataset`.
+func ConcatenateDataset(scope *Scope, input_dataset tf.Output, another_dataset tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "IteratorGetNext",
+		Type: "ConcatenateDataset",
 		Input: []tf.Input{
-			iterator,
+			input_dataset, another_dataset,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
-		scope.UpdateErr("IteratorGetNext", err)
-		return
-	}
-	return components
+	return op.Output(0)
 }
 
-// Outputs the single element from the given dataset.
-//
-// Arguments:
-//	dataset: A handle to a dataset that contains a single element.
+// Adds a value to the current value of a variable.
 //
+// Any ReadVariableOp with a control dependency on this op is guaranteed to
+// see the incremented value or a subsequent newer one.
 //
+// Arguments:
+//	resource: handle to the resource in which to store the variable.
+//	value: the value by which the variable will be incremented.
 //
-// Returns The components of the single element of `input`.
-func DatasetToSingleElement(scope *Scope, dataset tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (components []tf.Output) {
+// Returns the created operation.
+func AssignAddVariableOp(scope *Scope, resource tf.Output, value tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "DatasetToSingleElement",
+		Type: "AssignAddVariableOp",
 		Input: []tf.Input{
-			dataset,
+			resource, value,
 		},
-		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
+	return scope.AddOperation(opspec)
+}
+
+// Records the latency of producing `input_dataset` elements in a StatsAggregator.
+func LatencyStatsDataset(scope *Scope, input_dataset tf.Output, tag tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	var idx int
-	var err error
-	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
-		scope.UpdateErr("DatasetToSingleElement", err)
-		return
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "LatencyStatsDataset",
+		Input: []tf.Input{
+			input_dataset, tag,
+		},
+		Attrs: attrs,
 	}
-	return components
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Converts the given `resource_handle` representing an iterator to a string.
+// Runs multiple additive regression ensemble predictors on input instances and
+//
+// computes the update to cached logits. It is designed to be used during training.
+// It traverses the trees starting from cached tree id and cached node id and
+// calculates the updates to be pushed to the cache.
 //
 // Arguments:
-//	resource_handle: A handle to an iterator resource.
 //
-// Returns A string representation of the given handle.
-func IteratorToStringHandle(scope *Scope, resource_handle tf.Output) (string_handle tf.Output) {
+//	cached_tree_ids: Rank 1 Tensor containing cached tree ids which is the starting
+// tree of prediction.
+//	cached_node_ids: Rank 1 Tensor containing cached node id which is the starting
+// node of prediction.
+//	bucketized_features: A list of rank 1 Tensors containing bucket id for each
+// feature.
+//	logits_dimension: scalar, dimension of the logits, to be used for partial logits
+// shape.
+//
+// Returns Rank 2 Tensor containing logits update (with respect to cached
+// values stored) for each example.Rank 1 Tensor containing new tree ids for each example.Rank 1 Tensor containing new node ids in the new tree_ids.
+func BoostedTreesTrainingPredict(scope *Scope, tree_ensemble_handle tf.Output, cached_tree_ids tf.Output, cached_node_ids tf.Output, bucketized_features []tf.Output, logits_dimension int64) (partial_logits tf.Output, tree_ids tf.Output, node_ids tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"logits_dimension": logits_dimension}
 	opspec := tf.OpSpec{
-		Type: "IteratorToStringHandle",
+		Type: "BoostedTreesTrainingPredict",
 		Input: []tf.Input{
-			resource_handle,
+			tree_ensemble_handle, cached_tree_ids, cached_node_ids, tf.OutputList(bucketized_features),
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// IteratorFromStringHandleAttr is an optional argument to IteratorFromStringHandle.
-type IteratorFromStringHandleAttr func(optionalAttr)
+// MapSizeAttr is an optional argument to MapSize.
+type MapSizeAttr func(optionalAttr)
 
-// IteratorFromStringHandleOutputTypes sets the optional output_types attribute to value.
-//
-// value: If specified, defines the type of each tuple component in an
-// element produced by the resulting iterator.
-// If not specified, defaults to <>
+// MapSizeCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// REQUIRES: len(value) >= 0
-func IteratorFromStringHandleOutputTypes(value []tf.DataType) IteratorFromStringHandleAttr {
+// REQUIRES: value >= 0
+func MapSizeCapacity(value int64) MapSizeAttr {
 	return func(m optionalAttr) {
-		m["output_types"] = value
+		m["capacity"] = value
 	}
 }
 
-// IteratorFromStringHandleOutputShapes sets the optional output_shapes attribute to value.
-//
-// value: If specified, defines the shape of each tuple component in an
-// element produced by the resulting iterator.
-// If not specified, defaults to <>
+// MapSizeMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// REQUIRES: len(value) >= 0
-func IteratorFromStringHandleOutputShapes(value []tf.Shape) IteratorFromStringHandleAttr {
+// REQUIRES: value >= 0
+func MapSizeMemoryLimit(value int64) MapSizeAttr {
 	return func(m optionalAttr) {
-		m["output_shapes"] = value
+		m["memory_limit"] = value
 	}
 }
 
-// Converts the given string representing a handle to an iterator to a resource.
-//
-// Arguments:
-//	string_handle: A string representation of the given handle.
-//
-// Returns A handle to an iterator resource.
-func IteratorFromStringHandle(scope *Scope, string_handle tf.Output, optional ...IteratorFromStringHandleAttr) (resource_handle tf.Output) {
+// MapSizeContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func MapSizeContainer(value string) MapSizeAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// MapSizeSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func MapSizeSharedName(value string) MapSizeAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op returns the number of elements in the underlying container.
+func MapSize(scope *Scope, dtypes []tf.DataType, optional ...MapSizeAttr) (size tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "IteratorFromStringHandle",
-		Input: []tf.Input{
-			string_handle,
-		},
+		Type: "MapSize",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Gather slices from `params` axis `axis` according to `indices`.
-//
-// `indices` must be an integer tensor of any dimension (usually 0-D or 1-D).
-// Produces an output tensor with shape `params.shape[:axis] + indices.shape +
-// params.shape[axis + 1:]` where:
-//
-// ```python
-//     # Scalar indices (output is rank(params) - 1).
-//     output[a_0, ..., a_n, b_0, ..., b_n] =
-//       params[a_0, ..., a_n, indices, b_0, ..., b_n]
-//
-//     # Vector indices (output is rank(params)).
-//     output[a_0, ..., a_n, i, b_0, ..., b_n] =
-//       params[a_0, ..., a_n, indices[i], b_0, ..., b_n]
-//
-//     # Higher rank indices (output is rank(params) + rank(indices) - 1).
-//     output[a_0, ..., a_n, i, ..., j, b_0, ... b_n] =
-//       params[a_0, ..., a_n, indices[i, ..., j], b_0, ..., b_n]
-// ```
-//
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/Gather.png" alt>
-// </div>
+// Convert JSON-encoded Example records to binary protocol buffer strings.
 //
-// Note that on CPU, if an out of bound index is found, an error is returned.
-// On GPU, if an out of bound index is found, a 0 is stored in the
-// corresponding output value.
+// This op translates a tensor containing Example records, encoded using
+// the [standard JSON
+// mapping](https://developers.google.com/protocol-buffers/docs/proto3#json),
+// into a tensor containing the same records encoded as binary protocol
+// buffers. The resulting tensor can then be fed to any of the other
+// Example-parsing ops.
 //
 // Arguments:
-//	params: The tensor from which to gather values. Must be at least rank
-// `axis + 1`.
-//	indices: Index tensor. Must be in range `[0, params.shape[axis])`.
-//	axis: The axis in `params` to gather `indices` from. Defaults to the first
-// dimension. Supports negative indexes.
+//	json_examples: Each string is a JSON object serialized according to the JSON
+// mapping of the Example proto.
 //
-// Returns Values from `params` gathered from indices given by `indices`, with
-// shape `params.shape[:axis] + indices.shape + params.shape[axis + 1:]`.
-func GatherV2(scope *Scope, params tf.Output, indices tf.Output, axis tf.Output) (output tf.Output) {
+// Returns Each string is a binary Example protocol buffer corresponding
+// to the respective element of `json_examples`.
+func DecodeJSONExample(scope *Scope, json_examples tf.Output) (binary_examples tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "GatherV2",
+		Type: "DecodeJSONExample",
 		Input: []tf.Input{
-			params, indices, axis,
+			json_examples,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Converts the given `resource_handle` representing an iterator to a variant tensor.
-//
-// Arguments:
-//	resource_handle: A handle to an iterator resource.
+// SparseToDenseAttr is an optional argument to SparseToDense.
+type SparseToDenseAttr func(optionalAttr)
+
+// SparseToDenseValidateIndices sets the optional validate_indices attribute to value.
 //
-// Returns A variant tensor storing the state of the iterator contained in the
-// resource.
-func SerializeIterator(scope *Scope, resource_handle tf.Output) (serialized tf.Output) {
-	if scope.Err() != nil {
-		return
+// value: If true, indices are checked to make sure they are sorted in
+// lexicographic order and that there are no repeats.
+// If not specified, defaults to true
+func SparseToDenseValidateIndices(value bool) SparseToDenseAttr {
+	return func(m optionalAttr) {
+		m["validate_indices"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "SerializeIterator",
-		Input: []tf.Input{
-			resource_handle,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// FIFOQueueV2Attr is an optional argument to FIFOQueueV2.
-type FIFOQueueV2Attr func(optionalAttr)
-
-// FIFOQueueV2Shapes sets the optional shapes attribute to value.
+// Converts a sparse representation into a dense tensor.
 //
-// value: The shape of each component in a value. The length of this attr must
-// be either 0 or the same as the length of component_types. If the length of
-// this attr is 0, the shapes of queue elements are not constrained, and
-// only one element may be dequeued at a time.
-// If not specified, defaults to <>
+// Builds an array `dense` with shape `output_shape` such that
 //
-// REQUIRES: len(value) >= 0
-func FIFOQueueV2Shapes(value []tf.Shape) FIFOQueueV2Attr {
-	return func(m optionalAttr) {
-		m["shapes"] = value
-	}
-}
-
-// FIFOQueueV2Capacity sets the optional capacity attribute to value.
+// ```
+// # If sparse_indices is scalar
+// dense[i] = (i == sparse_indices ? sparse_values : default_value)
 //
-// value: The upper bound on the number of elements in this queue.
-// Negative numbers mean no limit.
-// If not specified, defaults to -1
-func FIFOQueueV2Capacity(value int64) FIFOQueueV2Attr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// FIFOQueueV2Container sets the optional container attribute to value.
+// # If sparse_indices is a vector, then for each i
+// dense[sparse_indices[i]] = sparse_values[i]
 //
-// value: If non-empty, this queue is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func FIFOQueueV2Container(value string) FIFOQueueV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// FIFOQueueV2SharedName sets the optional shared_name attribute to value.
+// # If sparse_indices is an n by d matrix, then for each i in [0, n)
+// dense[sparse_indices[i][0], ..., sparse_indices[i][d-1]] = sparse_values[i]
+// ```
 //
-// value: If non-empty, this queue will be shared under the given name
-// across multiple sessions.
-// If not specified, defaults to ""
-func FIFOQueueV2SharedName(value string) FIFOQueueV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// A queue that produces elements in first-in first-out order.
+// All other values in `dense` are set to `default_value`.  If `sparse_values` is a
+// scalar, all sparse indices are set to this single value.
+//
+// Indices should be sorted in lexicographic order, and indices must not
+// contain any repeats. If `validate_indices` is true, these properties
+// are checked during execution.
 //
 // Arguments:
-//	component_types: The type of each component in a value.
+//	sparse_indices: 0-D, 1-D, or 2-D.  `sparse_indices[i]` contains the complete
+// index where `sparse_values[i]` will be placed.
+//	output_shape: 1-D.  Shape of the dense output tensor.
+//	sparse_values: 1-D.  Values corresponding to each row of `sparse_indices`,
+// or a scalar value to be used for all sparse indices.
+//	default_value: Scalar value to set for indices not specified in
+// `sparse_indices`.
 //
-// Returns The handle to the queue.
-func FIFOQueueV2(scope *Scope, component_types []tf.DataType, optional ...FIFOQueueV2Attr) (handle tf.Output) {
+// Returns Dense output tensor of shape `output_shape`.
+func SparseToDense(scope *Scope, sparse_indices tf.Output, output_shape tf.Output, sparse_values tf.Output, default_value tf.Output, optional ...SparseToDenseAttr) (dense tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"component_types": component_types}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "FIFOQueueV2",
-
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Produces a summary of any statistics recorded by the given statistics manager.
-func StatsAggregatorSummary(scope *Scope, iterator tf.Output) (summary tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "StatsAggregatorSummary",
+		Type: "SparseToDense",
 		Input: []tf.Input{
-			iterator,
+			sparse_indices, output_shape, sparse_values, default_value,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Compute the pairwise cross product.
+// Computes the grayscale dilation of 4-D `input` and 3-D `filter` tensors.
 //
-// `a` and `b` must be the same shape; they can either be simple 3-element vectors,
-// or any shape where the innermost dimension is 3. In the latter case, each pair
-// of corresponding 3-element vectors is cross-multiplied independently.
+// The `input` tensor has shape `[batch, in_height, in_width, depth]` and the
+// `filter` tensor has shape `[filter_height, filter_width, depth]`, i.e., each
+// input channel is processed independently of the others with its own structuring
+// function. The `output` tensor has shape
+// `[batch, out_height, out_width, depth]`. The spatial dimensions of the output
+// tensor depend on the `padding` algorithm. We currently only support the default
+// "NHWC" `data_format`.
+//
+// In detail, the grayscale morphological 2-D dilation is the max-sum correlation
+// (for consistency with `conv2d`, we use unmirrored filters):
+//
+//     output[b, y, x, c] =
+//        max_{dy, dx} input[b,
+//                           strides[1] * y + rates[1] * dy,
+//                           strides[2] * x + rates[2] * dx,
+//                           c] +
+//                     filter[dy, dx, c]
+//
+// Max-pooling is a special case when the filter has size equal to the pooling
+// kernel size and contains all zeros.
+//
+// Note on duality: The dilation of `input` by the `filter` is equal to the
+// negation of the erosion of `-input` by the reflected `filter`.
 //
 // Arguments:
-//	a: A tensor containing 3-element vectors.
-//	b: Another tensor, of same type and shape as `a`.
+//	input: 4-D with shape `[batch, in_height, in_width, depth]`.
+//	filter: 3-D with shape `[filter_height, filter_width, depth]`.
+//	strides: The stride of the sliding window for each dimension of the input
+// tensor. Must be: `[1, stride_height, stride_width, 1]`.
+//	rates: The input stride for atrous morphological dilation. Must be:
+// `[1, rate_height, rate_width, 1]`.
+//	padding: The type of padding algorithm to use.
 //
-// Returns Pairwise cross product of the vectors in `a` and `b`.
-func Cross(scope *Scope, a tf.Output, b tf.Output) (product tf.Output) {
+// Returns 4-D with shape `[batch, out_height, out_width, depth]`.
+func Dilation2D(scope *Scope, input tf.Output, filter tf.Output, strides []int64, rates []int64, padding string) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"strides": strides, "rates": rates, "padding": padding}
 	opspec := tf.OpSpec{
-		Type: "Cross",
+		Type: "Dilation2D",
 		Input: []tf.Input{
-			a, b,
+			input, filter,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Performs a padding as a preprocess during a convolution.
-//
-// Similar to FusedResizeAndPadConv2d, this op allows for an optimized
-// implementation where the spatial padding transformation stage is fused with the
-// im2col lookup, but in this case without the bilinear filtering required for
-// resizing. Fusing the padding prevents the need to write out the intermediate
-// results as whole tensors, reducing memory pressure, and we can get some latency
-// gains by merging the transformation calculations.
-// The data_format attribute for Conv2D isn't supported by this op, and 'NHWC'
-// order is used instead.
-// Internally this op uses a single per-graph scratch buffer, which means that it
-// will block if multiple versions are being run in parallel. This is because this
-// operator is primarily an optimization to minimize memory usage.
+// Converts the given variant tensor to an iterator and stores it in the given resource.
 //
 // Arguments:
-//	input: 4-D with shape `[batch, in_height, in_width, in_channels]`.
-//	paddings: A two-column matrix specifying the padding sizes. The number of
-// rows must be the same as the rank of `input`.
-//	filter: 4-D with shape
-// `[filter_height, filter_width, in_channels, out_channels]`.
+//	resource_handle: A handle to an iterator resource.
+//	serialized: A variant tensor storing the state of the iterator contained in the
+// resource.
 //
-//	strides: 1-D of length 4.  The stride of the sliding window for each dimension
-// of `input`. Must be in the same order as the dimension specified with format.
-//	padding: The type of padding algorithm to use.
-func FusedPadConv2D(scope *Scope, input tf.Output, paddings tf.Output, filter tf.Output, mode string, strides []int64, padding string) (output tf.Output) {
+// Returns the created operation.
+func DeserializeIterator(scope *Scope, resource_handle tf.Output, serialized tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"mode": mode, "strides": strides, "padding": padding}
 	opspec := tf.OpSpec{
-		Type: "FusedPadConv2D",
+		Type: "DeserializeIterator",
 		Input: []tf.Input{
-			input, paddings, filter,
+			resource_handle, serialized,
 		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Conv2DBackpropInputAttr is an optional argument to Conv2DBackpropInput.
-type Conv2DBackpropInputAttr func(optionalAttr)
-
-// Conv2DBackpropInputUseCudnnOnGpu sets the optional use_cudnn_on_gpu attribute to value.
-// If not specified, defaults to true
-func Conv2DBackpropInputUseCudnnOnGpu(value bool) Conv2DBackpropInputAttr {
-	return func(m optionalAttr) {
-		m["use_cudnn_on_gpu"] = value
 	}
+	return scope.AddOperation(opspec)
 }
 
-// Conv2DBackpropInputDataFormat sets the optional data_format attribute to value.
-//
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// If not specified, defaults to "NHWC"
-func Conv2DBackpropInputDataFormat(value string) Conv2DBackpropInputAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
+// TensorArrayConcatV2Attr is an optional argument to TensorArrayConcatV2.
+type TensorArrayConcatV2Attr func(optionalAttr)
 
-// Conv2DBackpropInputDilations sets the optional dilations attribute to value.
-//
-// value: 1-D tensor of length 4.  The dilation factor for each dimension of
-// `input`. If set to k > 1, there will be k-1 skipped cells between each filter
-// element on that dimension. The dimension order is determined by the value of
-// `data_format`, see above for details. Dilations in the batch and depth
-// dimensions must be 1.
-// If not specified, defaults to <i:1 i:1 i:1 i:1 >
-func Conv2DBackpropInputDilations(value []int64) Conv2DBackpropInputAttr {
+// TensorArrayConcatV2ElementShapeExcept0 sets the optional element_shape_except0 attribute to value.
+// If not specified, defaults to <unknown_rank:true >
+func TensorArrayConcatV2ElementShapeExcept0(value tf.Shape) TensorArrayConcatV2Attr {
 	return func(m optionalAttr) {
-		m["dilations"] = value
+		m["element_shape_except0"] = value
 	}
 }
 
-// Computes the gradients of convolution with respect to the input.
-//
-// Arguments:
-//	input_sizes: An integer vector representing the shape of `input`,
-// where `input` is a 4-D `[batch, height, width, channels]` tensor.
-//	filter: 4-D with shape
-// `[filter_height, filter_width, in_channels, out_channels]`.
-//	out_backprop: 4-D with shape `[batch, out_height, out_width, out_channels]`.
-// Gradients w.r.t. the output of the convolution.
-//	strides: The stride of the sliding window for each dimension of the input
-// of the convolution. Must be in the same order as the dimension specified with
-// format.
-//	padding: The type of padding algorithm to use.
-//
-// Returns 4-D with shape `[batch, in_height, in_width, in_channels]`.  Gradient
-// w.r.t. the input of the convolution.
-func Conv2DBackpropInput(scope *Scope, input_sizes tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, padding string, optional ...Conv2DBackpropInputAttr) (output tf.Output) {
+// Deprecated. Use TensorArrayConcatV3
+func TensorArrayConcatV2(scope *Scope, handle tf.Output, flow_in tf.Output, dtype tf.DataType, optional ...TensorArrayConcatV2Attr) (value tf.Output, lengths tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"strides": strides, "padding": padding}
+	attrs := map[string]interface{}{"dtype": dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Conv2DBackpropInput",
+		Type: "TensorArrayConcatV2",
 		Input: []tf.Input{
-			input_sizes, filter, out_backprop,
+			handle, flow_in,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// Interleave the values from the `data` tensors into a single tensor.
+// Creates a dataset that batches and pads `batch_size` elements from the input.
 //
-// Builds a merged tensor such that
+// Arguments:
 //
-// ```python
-//     merged[indices[m][i, ..., j], ...] = data[m][i, ..., j, ...]
-// ```
-//
-// For example, if each `indices[m]` is scalar or vector, we have
-//
-// ```python
-//     # Scalar indices:
-//     merged[indices[m], ...] = data[m][...]
-//
-//     # Vector indices:
-//     merged[indices[m][i], ...] = data[m][i, ...]
-// ```
-//
-// Each `data[i].shape` must start with the corresponding `indices[i].shape`,
-// and the rest of `data[i].shape` must be constant w.r.t. `i`.  That is, we
-// must have `data[i].shape = indices[i].shape + constant`.  In terms of this
-// `constant`, the output shape is
-//
-//     merged.shape = [max(indices)] + constant
-//
-// Values are merged in order, so if an index appears in both `indices[m][i]` and
-// `indices[n][j]` for `(m,i) < (n,j)` the slice `data[n][j]` will appear in the
-// merged result. If you do not need this guarantee, ParallelDynamicStitch might
-// perform better on some devices.
-//
-// For example:
-//
-// ```python
-//     indices[0] = 6
-//     indices[1] = [4, 1]
-//     indices[2] = [[5, 2], [0, 3]]
-//     data[0] = [61, 62]
-//     data[1] = [[41, 42], [11, 12]]
-//     data[2] = [[[51, 52], [21, 22]], [[1, 2], [31, 32]]]
-//     merged = [[1, 2], [11, 12], [21, 22], [31, 32], [41, 42],
-//               [51, 52], [61, 62]]
-// ```
-//
-// This method can be used to merge partitions created by `dynamic_partition`
-// as illustrated on the following example:
-//
-// ```python
-//     # Apply function (increments x_i) on elements for which a certain condition
-//     # apply (x_i != -1 in this example).
-//     x=tf.constant([0.1, -1., 5.2, 4.3, -1., 7.4])
-//     condition_mask=tf.not_equal(x,tf.constant(-1.))
-//     partitioned_data = tf.dynamic_partition(
-//         x, tf.cast(condition_mask, tf.int32) , 2)
-//     partitioned_data[1] = partitioned_data[1] + 1.0
-//     condition_indices = tf.dynamic_partition(
-//         tf.range(tf.shape(x)[0]), tf.cast(condition_mask, tf.int32) , 2)
-//     x = tf.dynamic_stitch(condition_indices, partitioned_data)
-//     # Here x=[1.1, -1., 6.2, 5.3, -1, 8.4], the -1. values remain
-//     # unchanged.
-// ```
+//	batch_size: A scalar representing the number of elements to accumulate in a
+// batch.
+//	padded_shapes: A list of int64 tensors representing the desired padded shapes
+// of the corresponding output components. These shapes may be partially
+// specified, using `-1` to indicate that a particular dimension should be
+// padded to the maximum size of all batch elements.
+//	padding_values: A list of scalars containing the padding value to use for
+// each of the outputs.
 //
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/DynamicStitch.png" alt>
-// </div>
-func DynamicStitch(scope *Scope, indices []tf.Output, data []tf.Output) (merged tf.Output) {
+func PaddedBatchDataset(scope *Scope, input_dataset tf.Output, batch_size tf.Output, padded_shapes []tf.Output, padding_values []tf.Output, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "DynamicStitch",
+		Type: "PaddedBatchDataset",
 		Input: []tf.Input{
-			tf.OutputList(indices), tf.OutputList(data),
+			input_dataset, batch_size, tf.OutputList(padded_shapes), tf.OutputList(padding_values),
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns the truth value of (x == y) element-wise.
+// Creates a dataset that batches input elements into a SparseTensor.
 //
-// *NOTE*: `Equal` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func Equal(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// Arguments:
+//	input_dataset: A handle to an input dataset. Must have a single component.
+//	batch_size: A scalar representing the number of elements to accumulate in a
+// batch.
+//	row_shape: A vector representing the dense shape of each row in the produced
+// SparseTensor. The shape may be partially specified, using `-1` to indicate
+// that a particular dimension should use the maximum size of all batch elements.
+//
+//
+func DenseToSparseBatchDataset(scope *Scope, input_dataset tf.Output, batch_size tf.Output, row_shape tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "Equal",
+		Type: "DenseToSparseBatchDataset",
 		Input: []tf.Input{
-			x, y,
+			input_dataset, batch_size, row_shape,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// TensorArrayGatherV2Attr is an optional argument to TensorArrayGatherV2.
-type TensorArrayGatherV2Attr func(optionalAttr)
-
-// TensorArrayGatherV2ElementShape sets the optional element_shape attribute to value.
-// If not specified, defaults to <unknown_rank:true >
-func TensorArrayGatherV2ElementShape(value tf.Shape) TensorArrayGatherV2Attr {
-	return func(m optionalAttr) {
-		m["element_shape"] = value
-	}
-}
-
-// Deprecated. Use TensorArrayGatherV3
+// Deprecated. Use TensorArrayGradV3
 //
-// DEPRECATED at GraphDef version 26: Use TensorArrayGatherV3
-func TensorArrayGatherV2(scope *Scope, handle tf.Output, indices tf.Output, flow_in tf.Output, dtype tf.DataType, optional ...TensorArrayGatherV2Attr) (value tf.Output) {
+// DEPRECATED at GraphDef version 26: Use TensorArrayGradV3
+func TensorArrayGradV2(scope *Scope, handle tf.Output, flow_in tf.Output, source string) (grad_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"source": source}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayGatherV2",
+		Type: "TensorArrayGradV2",
 		Input: []tf.Input{
-			handle, indices, flow_in,
+			handle, flow_in,
 		},
 		Attrs: attrs,
 	}
@@ -26103,116 +25565,153 @@ func TensorArrayGatherV2(scope *Scope, handle tf.Output, indices tf.Output, flow
 	return op.Output(0)
 }
 
-// Interleave the values from the `data` tensors into a single tensor.
+// Return substrings from `Tensor` of strings.
 //
-// Builds a merged tensor such that
+// For each string in the input `Tensor`, creates a substring starting at index
+// `pos` with a total length of `len`.
+//
+// If `len` defines a substring that would extend beyond the length of the input
+// string, then as many characters as possible are used.
+//
+// If `pos` is negative or specifies a character index larger than any of the input
+// strings, then an `InvalidArgumentError` is thrown.
+//
+// `pos` and `len` must have the same shape, otherwise a `ValueError` is thrown on
+// Op creation.
+//
+// *NOTE*: `Substr` supports broadcasting up to two dimensions. More about
+// broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+//
+// ---
+//
+// Examples
+//
+// Using scalar `pos` and `len`:
 //
 // ```python
-//     merged[indices[m][i, ..., j], ...] = data[m][i, ..., j, ...]
+// input = [b'Hello', b'World']
+// position = 1
+// length = 3
+//
+// output = [b'ell', b'orl']
 // ```
 //
-// For example, if each `indices[m]` is scalar or vector, we have
+// Using `pos` and `len` with same shape as `input`:
 //
 // ```python
-//     # Scalar indices:
-//     merged[indices[m], ...] = data[m][...]
+// input = [[b'ten', b'eleven', b'twelve'],
+//          [b'thirteen', b'fourteen', b'fifteen'],
+//          [b'sixteen', b'seventeen', b'eighteen']]
+// position = [[1, 2, 3],
+//             [1, 2, 3],
+//             [1, 2, 3]]
+// length =   [[2, 3, 4],
+//             [4, 3, 2],
+//             [5, 5, 5]]
 //
-//     # Vector indices:
-//     merged[indices[m][i], ...] = data[m][i, ...]
+// output = [[b'en', b'eve', b'lve'],
+//           [b'hirt', b'urt', b'te'],
+//           [b'ixtee', b'vente', b'hteen']]
 // ```
 //
-// Each `data[i].shape` must start with the corresponding `indices[i].shape`,
-// and the rest of `data[i].shape` must be constant w.r.t. `i`.  That is, we
-// must have `data[i].shape = indices[i].shape + constant`.  In terms of this
-// `constant`, the output shape is
+// Broadcasting `pos` and `len` onto `input`:
 //
-//     merged.shape = [max(indices)] + constant
+// ```
+// input = [[b'ten', b'eleven', b'twelve'],
+//          [b'thirteen', b'fourteen', b'fifteen'],
+//          [b'sixteen', b'seventeen', b'eighteen'],
+//          [b'nineteen', b'twenty', b'twentyone']]
+// position = [1, 2, 3]
+// length =   [1, 2, 3]
 //
-// Values may be merged in parallel, so if an index appears in both `indices[m][i]`
-// and `indices[n][j]`, the result may be invalid. This differs from the normal
-// DynamicStitch operator that defines the behavior in that case.
+// output = [[b'e', b'ev', b'lve'],
+//           [b'h', b'ur', b'tee'],
+//           [b'i', b've', b'hte'],
+//           [b'i', b'en', b'nty']]
+// ```
 //
-// For example:
+// Broadcasting `input` onto `pos` and `len`:
 //
-// ```python
-//     indices[0] = 6
-//     indices[1] = [4, 1]
-//     indices[2] = [[5, 2], [0, 3]]
-//     data[0] = [61, 62]
-//     data[1] = [[41, 42], [11, 12]]
-//     data[2] = [[[51, 52], [21, 22]], [[1, 2], [31, 32]]]
-//     merged = [[1, 2], [11, 12], [21, 22], [31, 32], [41, 42],
-//               [51, 52], [61, 62]]
 // ```
+// input = b'thirteen'
+// position = [1, 5, 7]
+// length =   [3, 2, 1]
 //
-// This method can be used to merge partitions created by `dynamic_partition`
-// as illustrated on the following example:
-//
-// ```python
-//     # Apply function (increments x_i) on elements for which a certain condition
-//     # apply (x_i != -1 in this example).
-//     x=tf.constant([0.1, -1., 5.2, 4.3, -1., 7.4])
-//     condition_mask=tf.not_equal(x,tf.constant(-1.))
-//     partitioned_data = tf.dynamic_partition(
-//         x, tf.cast(condition_mask, tf.int32) , 2)
-//     partitioned_data[1] = partitioned_data[1] + 1.0
-//     condition_indices = tf.dynamic_partition(
-//         tf.range(tf.shape(x)[0]), tf.cast(condition_mask, tf.int32) , 2)
-//     x = tf.dynamic_stitch(condition_indices, partitioned_data)
-//     # Here x=[1.1, -1., 6.2, 5.3, -1, 8.4], the -1. values remain
-//     # unchanged.
+// output = [b'hir', b'ee', b'n']
 // ```
 //
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/DynamicStitch.png" alt>
-// </div>
-func ParallelDynamicStitch(scope *Scope, indices []tf.Output, data []tf.Output) (merged tf.Output) {
+// Arguments:
+//	input: Tensor of strings
+//	pos: Scalar defining the position of first character in each substring
+//	len: Scalar defining the number of characters to include in each substring
+//
+// Returns Tensor of substrings
+func Substr(scope *Scope, input tf.Output, pos tf.Output, len tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "ParallelDynamicStitch",
+		Type: "Substr",
 		Input: []tf.Input{
-			tf.OutputList(indices), tf.OutputList(data),
+			input, pos, len,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the gradient for the inverse of `x` wrt its input.
+// Creates a Dataset that returns pseudorandom numbers.
 //
-// Specifically, `grad = -dy * y*y`, where `y = 1/x`, and `dy`
-// is the corresponding input gradient.
-func InvGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
+// Arguments:
+//	seed: A scalar seed for the random number generator. If either seed or
+// seed2 is set to be non-zero, the random number generator is seeded
+// by the given seed.  Otherwise, a random seed is used.
+//	seed2: A second scalar seed to avoid seed collision.
+//
+//
+func RandomDataset(scope *Scope, seed tf.Output, seed2 tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "InvGrad",
+		Type: "RandomDataset",
 		Input: []tf.Input{
-			y, dy,
+			seed, seed2,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// List of the given size with empty elements.
+// Creates a dataset that shuffles and repeats elements from `input_dataset`
 //
-// element_shape: the shape of the future elements of the list
-// num_elements: the number of elements to reserve
-// handle: the output list
-// element_dtype: the desired type of elements in the list.
-func TensorListReserve(scope *Scope, element_shape tf.Output, num_elements tf.Output, element_dtype tf.DataType) (handle tf.Output) {
+// pseudorandomly.
+//
+// Arguments:
+//
+//	buffer_size: The number of output elements to buffer in an iterator over
+// this dataset. Compare with the `min_after_dequeue` attr when creating a
+// `RandomShuffleQueue`.
+//	seed: A scalar seed for the random number generator. If either `seed` or
+// `seed2` is set to be non-zero, the random number generator is seeded
+// by the given seed.  Otherwise, a random seed is used.
+//	seed2: A second scalar seed to avoid seed collision.
+//	count: A scalar representing the number of times the underlying dataset
+// should be repeated. The default is `-1`, which results in infinite repetition.
+//
+//
+func ShuffleAndRepeatDataset(scope *Scope, input_dataset tf.Output, buffer_size tf.Output, seed tf.Output, seed2 tf.Output, count tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"element_dtype": element_dtype}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "TensorListReserve",
+		Type: "ShuffleAndRepeatDataset",
 		Input: []tf.Input{
-			element_shape, num_elements,
+			input_dataset, buffer_size, seed, seed2, count,
 		},
 		Attrs: attrs,
 	}
@@ -26220,304 +25719,295 @@ func TensorListReserve(scope *Scope, element_shape tf.Output, num_elements tf.Ou
 	return op.Output(0)
 }
 
-// PriorityQueueV2Attr is an optional argument to PriorityQueueV2.
-type PriorityQueueV2Attr func(optionalAttr)
-
-// PriorityQueueV2ComponentTypes sets the optional component_types attribute to value.
+// Creates a dataset that caches elements from `input_dataset`.
 //
-// value: The type of each component in a value.
-// If not specified, defaults to <>
+// A CacheDataset will iterate over the input_dataset, and store tensors. If the
+// cache already exists, the cache will be used. If the cache is inappropriate
+// (e.g. cannot be opened, contains tensors of the wrong shape / size), an error
+// will the returned when used.
 //
-// REQUIRES: len(value) >= 0
-func PriorityQueueV2ComponentTypes(value []tf.DataType) PriorityQueueV2Attr {
-	return func(m optionalAttr) {
-		m["component_types"] = value
-	}
-}
-
-// PriorityQueueV2Capacity sets the optional capacity attribute to value.
+// Arguments:
 //
-// value: The upper bound on the number of elements in this queue.
-// Negative numbers mean no limit.
-// If not specified, defaults to -1
-func PriorityQueueV2Capacity(value int64) PriorityQueueV2Attr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// PriorityQueueV2Container sets the optional container attribute to value.
+//	filename: A path on the filesystem where we should cache the dataset. Note: this
+// will be a directory.
 //
-// value: If non-empty, this queue is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func PriorityQueueV2Container(value string) PriorityQueueV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// PriorityQueueV2SharedName sets the optional shared_name attribute to value.
 //
-// value: If non-empty, this queue will be shared under the given name
-// across multiple sessions.
-// If not specified, defaults to ""
-func PriorityQueueV2SharedName(value string) PriorityQueueV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
+func CacheDataset(scope *Scope, input_dataset tf.Output, filename tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "CacheDataset",
+		Input: []tf.Input{
+			input_dataset, filename,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// A queue that produces elements sorted by the first component value.
+// Computes the sum along sparse segments of a tensor.
 //
-// Note that the PriorityQueue requires the first component of any element
-// to be a scalar int64, in addition to the other elements declared by
-// component_types.  Therefore calls to Enqueue and EnqueueMany (resp. Dequeue
-// and DequeueMany) on a PriorityQueue will all require (resp. output) one extra
-// entry in their input (resp. output) lists.
+// Like `SparseSegmentSum`, but allows missing ids in `segment_ids`. If an id is
+// misisng, the `output` tensor at that position will be zeroed.
+//
+// Read @{$math_ops#Segmentation$the section on segmentation} for an explanation of
+// segments.
+//
+// For example:
+//
+// ```python
+// c = tf.constant([[1,2,3,4], [-1,-2,-3,-4], [5,6,7,8]])
+//
+// tf.sparse_segment_sum_with_num_segments(
+//     c, tf.constant([0, 1]), tf.constant([0, 0]), num_segments=3)
+// # => [[0 0 0 0]
+// #     [0 0 0 0]
+// #     [0 0 0 0]]
+//
+// tf.sparse_segment_sum_with_num_segments(c,
+//                                         tf.constant([0, 1]),
+//                                         tf.constant([0, 2],
+//                                         num_segments=4))
+// # => [[ 1  2  3  4]
+// #     [ 0  0  0  0]
+// #     [-1 -2 -3 -4]
+// #     [ 0  0  0  0]]
+// ```
 //
 // Arguments:
-//	shapes: The shape of each component in a value. The length of this attr must
-// be either 0 or the same as the length of component_types. If the length of
-// this attr is 0, the shapes of queue elements are not constrained, and
-// only one element may be dequeued at a time.
 //
-// Returns The handle to the queue.
-func PriorityQueueV2(scope *Scope, shapes []tf.Shape, optional ...PriorityQueueV2Attr) (handle tf.Output) {
+//	indices: A 1-D tensor. Has same rank as `segment_ids`.
+//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
+//	num_segments: Should equal the number of distinct segment IDs.
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `num_segments`.
+func SparseSegmentSumWithNumSegments(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output, num_segments tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"shapes": shapes}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "PriorityQueueV2",
-
-		Attrs: attrs,
+		Type: "SparseSegmentSumWithNumSegments",
+		Input: []tf.Input{
+			data, indices, segment_ids, num_segments,
+		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// UnstageAttr is an optional argument to Unstage.
-type UnstageAttr func(optionalAttr)
-
-// UnstageCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// Creates a dataset that executes a SQL query and emits rows of the result set.
 //
-// REQUIRES: value >= 0
-func UnstageCapacity(value int64) UnstageAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// UnstageMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// Arguments:
+//	driver_name: The database type. Currently, the only supported type is 'sqlite'.
+//	data_source_name: A connection string to connect to the database.
+//	query: A SQL query to execute.
 //
-// REQUIRES: value >= 0
-func UnstageMemoryLimit(value int64) UnstageAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// UnstageContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func UnstageContainer(value string) UnstageAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// UnstageSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func UnstageSharedName(value string) UnstageAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Op is similar to a lightweight Dequeue.
 //
-// The basic functionality is similar to dequeue with many fewer
-// capabilities and options.  This Op is optimized for performance.
-func Unstage(scope *Scope, dtypes []tf.DataType, optional ...UnstageAttr) (values []tf.Output) {
+func SqlDataset(scope *Scope, driver_name tf.Output, data_source_name tf.Output, query tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "Unstage",
-
+		Type: "SqlDataset",
+		Input: []tf.Input{
+			driver_name, data_source_name, query,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("Unstage", err)
-		return
-	}
-	return values
+	return op.Output(0)
 }
 
-// QueueEnqueueV2Attr is an optional argument to QueueEnqueueV2.
-type QueueEnqueueV2Attr func(optionalAttr)
-
-// QueueEnqueueV2TimeoutMs sets the optional timeout_ms attribute to value.
+// Creates a dataset that emits the records from one or more binary files.
 //
-// value: If the queue is full, this operation will block for up to
-// timeout_ms milliseconds.
-// Note: This option is not supported yet.
-// If not specified, defaults to -1
-func QueueEnqueueV2TimeoutMs(value int64) QueueEnqueueV2Attr {
-	return func(m optionalAttr) {
-		m["timeout_ms"] = value
+// Arguments:
+//	filenames: A scalar or a vector containing the name(s) of the file(s) to be
+// read.
+//	header_bytes: A scalar representing the number of bytes to skip at the
+// beginning of a file.
+//	record_bytes: A scalar representing the number of bytes in each record.
+//	footer_bytes: A scalar representing the number of bytes to skip at the end
+// of a file.
+//	buffer_size: A scalar representing the number of bytes to buffer. Must be > 0.
+func FixedLengthRecordDataset(scope *Scope, filenames tf.Output, header_bytes tf.Output, record_bytes tf.Output, footer_bytes tf.Output, buffer_size tf.Output) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
+	opspec := tf.OpSpec{
+		Type: "FixedLengthRecordDataset",
+		Input: []tf.Input{
+			filenames, header_bytes, record_bytes, footer_bytes, buffer_size,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Enqueues a tuple of one or more tensors in the given queue.
+// Gradients for batch normalization.
 //
-// The components input has k elements, which correspond to the components of
-// tuples stored in the given queue.
+// DEPRECATED at GraphDef version 9: Use tf.nn.batch_normalization()
 //
-// N.B. If the queue is full, this operation will block until the given
-// element has been enqueued (or 'timeout_ms' elapses, if specified).
+// This op is deprecated. See `tf.nn.batch_normalization`.
 //
 // Arguments:
-//	handle: The handle to a queue.
-//	components: One or more tensors from which the enqueued tensors should be taken.
+//	t: A 4D input Tensor.
+//	m: A 1D mean Tensor with size matching the last dimension of t.
+// This is the first output from tf.nn.moments,
+// or a saved moving average thereof.
+//	v: A 1D variance Tensor with size matching the last dimension of t.
+// This is the second output from tf.nn.moments,
+// or a saved moving average thereof.
+//	gamma: A 1D gamma Tensor with size matching the last dimension of t.
+// If "scale_after_normalization" is true, this Tensor will be multiplied
+// with the normalized Tensor.
+//	backprop: 4D backprop Tensor.
+//	variance_epsilon: A small float number to avoid dividing by 0.
+//	scale_after_normalization: A bool indicating whether the resulted tensor
+// needs to be multiplied with gamma.
 //
-// Returns the created operation.
-func QueueEnqueueV2(scope *Scope, handle tf.Output, components []tf.Output, optional ...QueueEnqueueV2Attr) (o *tf.Operation) {
+// Returns 4D backprop tensor for input.1D backprop tensor for mean.1D backprop tensor for variance.1D backprop tensor for beta.1D backprop tensor for gamma.
+func BatchNormWithGlobalNormalizationGrad(scope *Scope, t tf.Output, m tf.Output, v tf.Output, gamma tf.Output, backprop tf.Output, variance_epsilon float32, scale_after_normalization bool) (dx tf.Output, dm tf.Output, dv tf.Output, db tf.Output, dg tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"variance_epsilon": variance_epsilon, "scale_after_normalization": scale_after_normalization}
 	opspec := tf.OpSpec{
-		Type: "QueueEnqueueV2",
+		Type: "BatchNormWithGlobalNormalizationGrad",
 		Input: []tf.Input{
-			handle, tf.OutputList(components),
+			t, m, v, gamma, backprop,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
-}
-
-// QueueDequeueManyV2Attr is an optional argument to QueueDequeueManyV2.
-type QueueDequeueManyV2Attr func(optionalAttr)
-
-// QueueDequeueManyV2TimeoutMs sets the optional timeout_ms attribute to value.
-//
-// value: If the queue has fewer than n elements, this operation
-// will block for up to timeout_ms milliseconds.
-// Note: This option is not supported yet.
-// If not specified, defaults to -1
-func QueueDequeueManyV2TimeoutMs(value int64) QueueDequeueManyV2Attr {
-	return func(m optionalAttr) {
-		m["timeout_ms"] = value
-	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
 }
 
-// Dequeues `n` tuples of one or more tensors from the given queue.
-//
-// If the queue is closed and there are fewer than `n` elements, then an
-// OutOfRange error is returned.
-//
-// This operation concatenates queue-element component tensors along the
-// 0th dimension to make a single component tensor.  All of the components
-// in the dequeued tuple will have size `n` in the 0th dimension.
-//
-// This operation has `k` outputs, where `k` is the number of components in
-// the tuples stored in the given queue, and output `i` is the ith
-// component of the dequeued tuple.
-//
-// N.B. If the queue is empty, this operation will block until `n` elements
-// have been dequeued (or 'timeout_ms' elapses, if specified).
+// Creates a dataset that emits the records from one or more TFRecord files.
 //
 // Arguments:
-//	handle: The handle to a queue.
-//	n: The number of tuples to dequeue.
-//	component_types: The type of each component in a tuple.
-//
-// Returns One or more tensors that were dequeued as a tuple.
-func QueueDequeueManyV2(scope *Scope, handle tf.Output, n tf.Output, component_types []tf.DataType, optional ...QueueDequeueManyV2Attr) (components []tf.Output) {
+//	filenames: A scalar or vector containing the name(s) of the file(s) to be
+// read.
+//	compression_type: A scalar containing either (i) the empty string (no
+// compression), (ii) "ZLIB", or (iii) "GZIP".
+//	buffer_size: A scalar representing the number of bytes to buffer. A value of
+// 0 means no buffering will be performed.
+func TFRecordDataset(scope *Scope, filenames tf.Output, compression_type tf.Output, buffer_size tf.Output) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"component_types": component_types}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "QueueDequeueManyV2",
+		Type: "TFRecordDataset",
 		Input: []tf.Input{
-			handle, n,
+			filenames, compression_type, buffer_size,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
-		scope.UpdateErr("QueueDequeueManyV2", err)
-		return
-	}
-	return components
-}
-
-// EncodeBase64Attr is an optional argument to EncodeBase64.
-type EncodeBase64Attr func(optionalAttr)
-
-// EncodeBase64Pad sets the optional pad attribute to value.
-//
-// value: Bool whether padding is applied at the ends.
-// If not specified, defaults to false
-func EncodeBase64Pad(value bool) EncodeBase64Attr {
-	return func(m optionalAttr) {
-		m["pad"] = value
-	}
+	return op.Output(0)
 }
 
-// Encode strings into web-safe base64 format.
+// BatchToSpace for 4-D tensors of type T.
 //
-// Refer to the following article for more information on base64 format:
-// en.wikipedia.org/wiki/Base64. Base64 strings may have padding with '=' at the
-// end so that the encoded has length multiple of 4. See Padding section of the
-// link above.
+// This is a legacy version of the more general BatchToSpaceND.
 //
-// Web-safe means that the encoder uses - and _ instead of + and /.
+// Rearranges (permutes) data from batch into blocks of spatial data, followed by
+// cropping. This is the reverse transformation of SpaceToBatch. More specifically,
+// this op outputs a copy of the input tensor where values from the `batch`
+// dimension are moved in spatial blocks to the `height` and `width` dimensions,
+// followed by cropping along the `height` and `width` dimensions.
 //
 // Arguments:
-//	input: Strings to be encoded.
+//	input: 4-D tensor with shape
+// `[batch*block_size*block_size, height_pad/block_size, width_pad/block_size,
+//   depth]`. Note that the batch size of the input tensor must be divisible by
+// `block_size * block_size`.
+//	crops: 2-D tensor of non-negative integers with shape `[2, 2]`. It specifies
+// how many elements to crop from the intermediate result across the spatial
+// dimensions as follows:
 //
-// Returns Input strings encoded in base64.
-func EncodeBase64(scope *Scope, input tf.Output, optional ...EncodeBase64Attr) (output tf.Output) {
+//     crops = [[crop_top, crop_bottom], [crop_left, crop_right]]
+//
+//
+// Returns 4-D with shape `[batch, height, width, depth]`, where:
+//
+//       height = height_pad - crop_top - crop_bottom
+//       width = width_pad - crop_left - crop_right
+//
+// The attr `block_size` must be greater than one. It indicates the block size.
+//
+// Some examples:
+//
+// (1) For the following input of shape `[4, 1, 1, 1]` and block_size of 2:
+//
+// ```
+// [[[[1]]], [[[2]]], [[[3]]], [[[4]]]]
+// ```
+//
+// The output tensor has shape `[1, 2, 2, 1]` and value:
+//
+// ```
+// x = [[[[1], [2]], [[3], [4]]]]
+// ```
+//
+// (2) For the following input of shape `[4, 1, 1, 3]` and block_size of 2:
+//
+// ```
+// [[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[10, 11, 12]]]
+// ```
+//
+// The output tensor has shape `[1, 2, 2, 3]` and value:
+//
+// ```
+// x = [[[[1, 2, 3], [4, 5, 6]],
+//       [[7, 8, 9], [10, 11, 12]]]]
+// ```
+//
+// (3) For the following input of shape `[4, 2, 2, 1]` and block_size of 2:
+//
+// ```
+// x = [[[[1], [3]], [[9], [11]]],
+//      [[[2], [4]], [[10], [12]]],
+//      [[[5], [7]], [[13], [15]]],
+//      [[[6], [8]], [[14], [16]]]]
+// ```
+//
+// The output tensor has shape `[1, 4, 4, 1]` and value:
+//
+// ```
+// x = [[[1],   [2],  [3],  [4]],
+//      [[5],   [6],  [7],  [8]],
+//      [[9],  [10], [11],  [12]],
+//      [[13], [14], [15],  [16]]]
+// ```
+//
+// (4) For the following input of shape `[8, 1, 2, 1]` and block_size of 2:
+//
+// ```
+// x = [[[[1], [3]]], [[[9], [11]]], [[[2], [4]]], [[[10], [12]]],
+//      [[[5], [7]]], [[[13], [15]]], [[[6], [8]]], [[[14], [16]]]]
+// ```
+//
+// The output tensor has shape `[2, 2, 4, 1]` and value:
+//
+// ```
+// x = [[[[1], [3]], [[5], [7]]],
+//      [[[2], [4]], [[10], [12]]],
+//      [[[5], [7]], [[13], [15]]],
+//      [[[6], [8]], [[14], [16]]]]
+// ```
+func BatchToSpace(scope *Scope, input tf.Output, crops tf.Output, block_size int64) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"block_size": block_size}
 	opspec := tf.OpSpec{
-		Type: "EncodeBase64",
+		Type: "BatchToSpace",
 		Input: []tf.Input{
-			input,
+			input, crops,
 		},
 		Attrs: attrs,
 	}
@@ -26525,297 +26015,200 @@ func EncodeBase64(scope *Scope, input tf.Output, optional ...EncodeBase64Attr) (
 	return op.Output(0)
 }
 
-// Deprecated. Use TensorArrayCloseV3
+// Makes a new iterator from the given `dataset` and stores it in `iterator`.
 //
-// DEPRECATED at GraphDef version 26: Use TensorArrayCloseV3
+// This operation may be executed multiple times. Each execution will reset the
+// iterator in `iterator` to the first element of `dataset`.
 //
 // Returns the created operation.
-func TensorArrayCloseV2(scope *Scope, handle tf.Output) (o *tf.Operation) {
+func MakeIterator(scope *Scope, dataset tf.Output, iterator tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayCloseV2",
+		Type: "MakeIterator",
 		Input: []tf.Input{
-			handle,
+			dataset, iterator,
 		},
 	}
 	return scope.AddOperation(opspec)
 }
 
-// Forwards the value of an available tensor from `inputs` to `output`.
-//
-// `Merge` waits for at least one of the tensors in `inputs` to become available.
-// It is usually combined with `Switch` to implement branching.
+// Makes the summary of accumulated stats for the batch.
 //
-// `Merge` forwards the first tensor to become available to `output`, and sets
-// `value_index` to its index in `inputs`.
+// The summary stats contains gradients and hessians accumulated into the corresponding node and bucket for each example.
 //
 // Arguments:
-//	inputs: The input tensors, exactly one of which will become available.
+//	node_ids: int32 Rank 1 Tensor containing node ids, which each example falls into for the requested layer.
+//	gradients: float32; Rank 2 Tensor (shape=[#examples, 1]) for gradients.
+//	hessians: float32; Rank 2 Tensor (shape=[#examples, 1]) for hessians.
+//	bucketized_features_list: int32 list of Rank 1 Tensors, each containing the bucketized feature (for each feature column).
+//	max_splits: int; the maximum number of splits possible in the whole tree.
+//	num_buckets: int; equals to the maximum possible value of bucketized feature.
 //
-// Returns Will be set to the available input tensor.The index of the chosen input tensor in `inputs`.
-func Merge(scope *Scope, inputs []tf.Output) (output tf.Output, value_index tf.Output) {
+// Returns output Rank 4 Tensor (shape=[#features, #splits, #buckets, 2]) containing accumulated stats put into the corresponding node and bucket. The first index of 4th dimension refers to gradients, and the second to hessians.
+func BoostedTreesMakeStatsSummary(scope *Scope, node_ids tf.Output, gradients tf.Output, hessians tf.Output, bucketized_features_list []tf.Output, max_splits int64, num_buckets int64) (stats_summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"max_splits": max_splits, "num_buckets": num_buckets}
 	opspec := tf.OpSpec{
-		Type: "Merge",
+		Type: "BoostedTreesMakeStatsSummary",
 		Input: []tf.Input{
-			tf.OutputList(inputs),
+			node_ids, gradients, hessians, tf.OutputList(bucketized_features_list),
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// QueueCloseV2Attr is an optional argument to QueueCloseV2.
-type QueueCloseV2Attr func(optionalAttr)
-
-// QueueCloseV2CancelPendingEnqueues sets the optional cancel_pending_enqueues attribute to value.
+// Adjust the contrast of one or more images.
 //
-// value: If true, all pending enqueue requests that are
-// blocked on the given queue will be canceled.
-// If not specified, defaults to false
-func QueueCloseV2CancelPendingEnqueues(value bool) QueueCloseV2Attr {
-	return func(m optionalAttr) {
-		m["cancel_pending_enqueues"] = value
-	}
-}
-
-// Closes the given queue.
+// `images` is a tensor of at least 3 dimensions.  The last 3 dimensions are
+// interpreted as `[height, width, channels]`.  The other dimensions only
+// represent a collection of images, such as `[batch, height, width, channels].`
 //
-// This operation signals that no more elements will be enqueued in the
-// given queue. Subsequent Enqueue(Many) operations will fail.
-// Subsequent Dequeue(Many) operations will continue to succeed if
-// sufficient elements remain in the queue. Subsequent Dequeue(Many)
-// operations that would block will fail immediately.
+// Contrast is adjusted independently for each channel of each image.
+//
+// For each channel, the Op first computes the mean of the image pixels in the
+// channel and then adjusts each component of each pixel to
+// `(x - mean) * contrast_factor + mean`.
 //
 // Arguments:
-//	handle: The handle to a queue.
+//	images: Images to adjust.  At least 3-D.
+//	contrast_factor: A float multiplier for adjusting contrast.
 //
-// Returns the created operation.
-func QueueCloseV2(scope *Scope, handle tf.Output, optional ...QueueCloseV2Attr) (o *tf.Operation) {
+// Returns The contrast-adjusted image or images.
+func AdjustContrastv2(scope *Scope, images tf.Output, contrast_factor tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "QueueCloseV2",
+		Type: "AdjustContrastv2",
 		Input: []tf.Input{
-			handle,
+			images, contrast_factor,
 		},
-		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Computes inverse hyperbolic tangent of x element-wise.
-func Atanh(scope *Scope, x tf.Output) (y tf.Output) {
+// Gets the next output from the given iterator.
+func IteratorGetNext(scope *Scope, iterator tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (components []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "Atanh",
+		Type: "IteratorGetNext",
 		Input: []tf.Input{
-			x,
+			iterator,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Returns true if queue is closed.
-//
-// This operation returns true if the queue is closed and false if the queue
-// is open.
-//
-// Arguments:
-//	handle: The handle to a queue.
-func QueueIsClosedV2(scope *Scope, handle tf.Output) (is_closed tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	opspec := tf.OpSpec{
-		Type: "QueueIsClosedV2",
-		Input: []tf.Input{
-			handle,
-		},
+	var idx int
+	var err error
+	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
+		scope.UpdateErr("IteratorGetNext", err)
+		return
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return components
 }
 
-// Returns the batched diagonal part of a batched tensor.
-//
-// This operation returns a tensor with the `diagonal` part
-// of the batched `input`. The `diagonal` part is computed as follows:
-//
-// Assume `input` has `k` dimensions `[I, J, K, ..., M, N]`, then the output is a
-// tensor of rank `k - 1` with dimensions `[I, J, K, ..., min(M, N)]` where:
-//
-// `diagonal[i, j, k, ..., n] = input[i, j, k, ..., n, n]`.
-//
-// The input must be at least a matrix.
-//
-// For example:
-//
-// ```
-// # 'input' is [[[1, 0, 0, 0]
-//                [0, 2, 0, 0]
-//                [0, 0, 3, 0]
-//                [0, 0, 0, 4]],
-//               [[5, 0, 0, 0]
-//                [0, 6, 0, 0]
-//                [0, 0, 7, 0]
-//                [0, 0, 0, 8]]]
-//
-// and input.shape = (2, 4, 4)
+// Outputs the single element from the given dataset.
 //
-// tf.matrix_diag_part(input) ==> [[1, 2, 3, 4], [5, 6, 7, 8]]
+// Arguments:
+//	dataset: A handle to a dataset that contains a single element.
 //
-// which has shape (2, 4)
-// ```
 //
-// Arguments:
-//	input: Rank `k` tensor where `k >= 2`.
 //
-// Returns The extracted diagonal(s) having shape
-// `diagonal.shape = input.shape[:-2] + [min(input.shape[-2:])]`.
-func MatrixDiagPart(scope *Scope, input tf.Output) (diagonal tf.Output) {
+// Returns The components of the single element of `input`.
+func DatasetToSingleElement(scope *Scope, dataset tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (components []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
 	opspec := tf.OpSpec{
-		Type: "MatrixDiagPart",
+		Type: "DatasetToSingleElement",
 		Input: []tf.Input{
-			input,
+			dataset,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Computes the absolute value of a tensor.
-//
-// Given a tensor `x`, this operation returns a tensor containing the absolute
-// value of each element in `x`. For example, if x is an input element and y is
-// an output element, this operation computes \\(y = |x|\\).
-func Abs(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	opspec := tf.OpSpec{
-		Type: "Abs",
-		Input: []tf.Input{
-			x,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// StackV2Attr is an optional argument to StackV2.
-type StackV2Attr func(optionalAttr)
-
-// StackV2StackName sets the optional stack_name attribute to value.
-//
-// value: Overrides the name used for the temporary stack resource. Default
-// value is the name of the 'Stack' op (which is guaranteed unique).
-// If not specified, defaults to ""
-func StackV2StackName(value string) StackV2Attr {
-	return func(m optionalAttr) {
-		m["stack_name"] = value
+	var idx int
+	var err error
+	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
+		scope.UpdateErr("DatasetToSingleElement", err)
+		return
 	}
+	return components
 }
 
-// A stack that produces elements in first-in last-out order.
+// Converts the given `resource_handle` representing an iterator to a string.
 //
 // Arguments:
-//	max_size: The maximum size of the stack if non-negative. If negative, the stack
-// size is unlimited.
-//	elem_type: The type of the elements on the stack.
+//	resource_handle: A handle to an iterator resource.
 //
-// Returns The handle to the stack.
-func StackV2(scope *Scope, max_size tf.Output, elem_type tf.DataType, optional ...StackV2Attr) (handle tf.Output) {
+// Returns A string representation of the given handle.
+func IteratorToStringHandle(scope *Scope, resource_handle tf.Output) (string_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"elem_type": elem_type}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "StackV2",
+		Type: "IteratorToStringHandle",
 		Input: []tf.Input{
-			max_size,
+			resource_handle,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// FusedBatchNormGradV2Attr is an optional argument to FusedBatchNormGradV2.
-type FusedBatchNormGradV2Attr func(optionalAttr)
+// IteratorFromStringHandleAttr is an optional argument to IteratorFromStringHandle.
+type IteratorFromStringHandleAttr func(optionalAttr)
 
-// FusedBatchNormGradV2Epsilon sets the optional epsilon attribute to value.
+// IteratorFromStringHandleOutputTypes sets the optional output_types attribute to value.
 //
-// value: A small float number added to the variance of x.
-// If not specified, defaults to 0.0001
-func FusedBatchNormGradV2Epsilon(value float32) FusedBatchNormGradV2Attr {
-	return func(m optionalAttr) {
-		m["epsilon"] = value
-	}
-}
-
-// FusedBatchNormGradV2DataFormat sets the optional data_format attribute to value.
+// value: If specified, defines the type of each tuple component in an
+// element produced by the resulting iterator.
+// If not specified, defaults to <>
 //
-// value: The data format for y_backprop, x, x_backprop.
-// Either "NHWC" (default) or "NCHW".
-// If not specified, defaults to "NHWC"
-func FusedBatchNormGradV2DataFormat(value string) FusedBatchNormGradV2Attr {
+// REQUIRES: len(value) >= 0
+func IteratorFromStringHandleOutputTypes(value []tf.DataType) IteratorFromStringHandleAttr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["output_types"] = value
 	}
 }
 
-// FusedBatchNormGradV2IsTraining sets the optional is_training attribute to value.
+// IteratorFromStringHandleOutputShapes sets the optional output_shapes attribute to value.
 //
-// value: A bool value to indicate the operation is for training (default)
-// or inference.
-// If not specified, defaults to true
-func FusedBatchNormGradV2IsTraining(value bool) FusedBatchNormGradV2Attr {
+// value: If specified, defines the shape of each tuple component in an
+// element produced by the resulting iterator.
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func IteratorFromStringHandleOutputShapes(value []tf.Shape) IteratorFromStringHandleAttr {
 	return func(m optionalAttr) {
-		m["is_training"] = value
+		m["output_shapes"] = value
 	}
 }
 
-// Gradient for batch normalization.
-//
-// Note that the size of 4D Tensors are defined by either "NHWC" or "NCHW".
-// The size of 1D Tensors matches the dimension C of the 4D Tensors.
+// Converts the given string representing a handle to an iterator to a resource.
 //
 // Arguments:
-//	y_backprop: A 4D Tensor for the gradient with respect to y.
-//	x: A 4D Tensor for input data.
-//	scale: A 1D Tensor for scaling factor, to scale the normalized x.
-//	reserve_space_1: When is_training is True, a 1D Tensor for the computed batch
-// mean to be reused in gradient computation. When is_training is
-// False, a 1D Tensor for the population mean to be reused in both
-// 1st and 2nd order gradient computation.
-//	reserve_space_2: When is_training is True, a 1D Tensor for the computed batch
-// variance (inverted variance in the cuDNN case) to be reused in
-// gradient computation. When is_training is False, a 1D Tensor
-// for the population variance to be reused in both 1st and 2nd
-// order gradient computation.
+//	string_handle: A string representation of the given handle.
 //
-// Returns A 4D Tensor for the gradient with respect to x.A 1D Tensor for the gradient with respect to scale.A 1D Tensor for the gradient with respect to offset.Unused placeholder to match the mean input in FusedBatchNorm.Unused placeholder to match the variance input
-// in FusedBatchNorm.
-func FusedBatchNormGradV2(scope *Scope, y_backprop tf.Output, x tf.Output, scale tf.Output, reserve_space_1 tf.Output, reserve_space_2 tf.Output, optional ...FusedBatchNormGradV2Attr) (x_backprop tf.Output, scale_backprop tf.Output, offset_backprop tf.Output, reserve_space_3 tf.Output, reserve_space_4 tf.Output) {
+// Returns A handle to an iterator resource.
+func IteratorFromStringHandle(scope *Scope, string_handle tf.Output, optional ...IteratorFromStringHandleAttr) (resource_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -26824,300 +26217,309 @@ func FusedBatchNormGradV2(scope *Scope, y_backprop tf.Output, x tf.Output, scale
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "FusedBatchNormGradV2",
+		Type: "IteratorFromStringHandle",
 		Input: []tf.Input{
-			y_backprop, x, scale, reserve_space_1, reserve_space_2,
+			string_handle,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
+	return op.Output(0)
 }
 
-// DecodeCompressedAttr is an optional argument to DecodeCompressed.
-type DecodeCompressedAttr func(optionalAttr)
-
-// DecodeCompressedCompressionType sets the optional compression_type attribute to value.
+// Gather slices from `params` axis `axis` according to `indices`.
 //
-// value: A scalar containing either (i) the empty string (no
-// compression), (ii) "ZLIB", or (iii) "GZIP".
-// If not specified, defaults to ""
-func DecodeCompressedCompressionType(value string) DecodeCompressedAttr {
-	return func(m optionalAttr) {
-		m["compression_type"] = value
-	}
-}
-
-// Decompress strings.
+// `indices` must be an integer tensor of any dimension (usually 0-D or 1-D).
+// Produces an output tensor with shape `params.shape[:axis] + indices.shape +
+// params.shape[axis + 1:]` where:
 //
-// This op decompresses each element of the `bytes` input `Tensor`, which
-// is assumed to be compressed using the given `compression_type`.
+// ```python
+//     # Scalar indices (output is rank(params) - 1).
+//     output[a_0, ..., a_n, b_0, ..., b_n] =
+//       params[a_0, ..., a_n, indices, b_0, ..., b_n]
 //
-// The `output` is a string `Tensor` of the same shape as `bytes`,
-// each element containing the decompressed data from the corresponding
-// element in `bytes`.
+//     # Vector indices (output is rank(params)).
+//     output[a_0, ..., a_n, i, b_0, ..., b_n] =
+//       params[a_0, ..., a_n, indices[i], b_0, ..., b_n]
+//
+//     # Higher rank indices (output is rank(params) + rank(indices) - 1).
+//     output[a_0, ..., a_n, i, ..., j, b_0, ... b_n] =
+//       params[a_0, ..., a_n, indices[i, ..., j], b_0, ..., b_n]
+// ```
+//
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/Gather.png" alt>
+// </div>
+//
+// Note that on CPU, if an out of bound index is found, an error is returned.
+// On GPU, if an out of bound index is found, a 0 is stored in the
+// corresponding output value.
 //
 // Arguments:
-//	bytes: A Tensor of string which is compressed.
+//	params: The tensor from which to gather values. Must be at least rank
+// `axis + 1`.
+//	indices: Index tensor. Must be in range `[0, params.shape[axis])`.
+//	axis: The axis in `params` to gather `indices` from. Defaults to the first
+// dimension. Supports negative indexes.
 //
-// Returns A Tensor with the same shape as input `bytes`, uncompressed
-// from bytes.
-func DecodeCompressed(scope *Scope, bytes tf.Output, optional ...DecodeCompressedAttr) (output tf.Output) {
+// Returns Values from `params` gathered from indices given by `indices`, with
+// shape `params.shape[:axis] + indices.shape + params.shape[axis + 1:]`.
+func GatherV2(scope *Scope, params tf.Output, indices tf.Output, axis tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "DecodeCompressed",
+		Type: "GatherV2",
 		Input: []tf.Input{
-			bytes,
+			params, indices, axis,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// CudnnRNNAttr is an optional argument to CudnnRNN.
-type CudnnRNNAttr func(optionalAttr)
-
-// CudnnRNNRnnMode sets the optional rnn_mode attribute to value.
-// If not specified, defaults to "lstm"
-func CudnnRNNRnnMode(value string) CudnnRNNAttr {
-	return func(m optionalAttr) {
-		m["rnn_mode"] = value
+// Converts the given `resource_handle` representing an iterator to a variant tensor.
+//
+// Arguments:
+//	resource_handle: A handle to an iterator resource.
+//
+// Returns A variant tensor storing the state of the iterator contained in the
+// resource.
+func SerializeIterator(scope *Scope, resource_handle tf.Output) (serialized tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// CudnnRNNInputMode sets the optional input_mode attribute to value.
-// If not specified, defaults to "linear_input"
-func CudnnRNNInputMode(value string) CudnnRNNAttr {
-	return func(m optionalAttr) {
-		m["input_mode"] = value
+	opspec := tf.OpSpec{
+		Type: "SerializeIterator",
+		Input: []tf.Input{
+			resource_handle,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// CudnnRNNDirection sets the optional direction attribute to value.
-// If not specified, defaults to "unidirectional"
-func CudnnRNNDirection(value string) CudnnRNNAttr {
-	return func(m optionalAttr) {
-		m["direction"] = value
-	}
-}
+// FIFOQueueV2Attr is an optional argument to FIFOQueueV2.
+type FIFOQueueV2Attr func(optionalAttr)
 
-// CudnnRNNDropout sets the optional dropout attribute to value.
-// If not specified, defaults to 0
-func CudnnRNNDropout(value float32) CudnnRNNAttr {
+// FIFOQueueV2Shapes sets the optional shapes attribute to value.
+//
+// value: The shape of each component in a value. The length of this attr must
+// be either 0 or the same as the length of component_types. If the length of
+// this attr is 0, the shapes of queue elements are not constrained, and
+// only one element may be dequeued at a time.
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func FIFOQueueV2Shapes(value []tf.Shape) FIFOQueueV2Attr {
 	return func(m optionalAttr) {
-		m["dropout"] = value
+		m["shapes"] = value
 	}
 }
 
-// CudnnRNNSeed sets the optional seed attribute to value.
-// If not specified, defaults to 0
-func CudnnRNNSeed(value int64) CudnnRNNAttr {
+// FIFOQueueV2Capacity sets the optional capacity attribute to value.
+//
+// value: The upper bound on the number of elements in this queue.
+// Negative numbers mean no limit.
+// If not specified, defaults to -1
+func FIFOQueueV2Capacity(value int64) FIFOQueueV2Attr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["capacity"] = value
 	}
 }
 
-// CudnnRNNSeed2 sets the optional seed2 attribute to value.
-// If not specified, defaults to 0
-func CudnnRNNSeed2(value int64) CudnnRNNAttr {
+// FIFOQueueV2Container sets the optional container attribute to value.
+//
+// value: If non-empty, this queue is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func FIFOQueueV2Container(value string) FIFOQueueV2Attr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["container"] = value
 	}
 }
 
-// CudnnRNNIsTraining sets the optional is_training attribute to value.
-// If not specified, defaults to true
-func CudnnRNNIsTraining(value bool) CudnnRNNAttr {
+// FIFOQueueV2SharedName sets the optional shared_name attribute to value.
+//
+// value: If non-empty, this queue will be shared under the given name
+// across multiple sessions.
+// If not specified, defaults to ""
+func FIFOQueueV2SharedName(value string) FIFOQueueV2Attr {
 	return func(m optionalAttr) {
-		m["is_training"] = value
+		m["shared_name"] = value
 	}
 }
 
-// A RNN backed by cuDNN.
+// A queue that produces elements in first-in first-out order.
 //
-// Computes the RNN from the input and initial states, with respect to the params
-// buffer.
+// Arguments:
+//	component_types: The type of each component in a value.
 //
-// rnn_mode: Indicates the type of the RNN model.
-// input_mode: Indicate whether there is a linear projection between the input and
-//   The actual computation before the first layer. 'skip_input' is only allowed
-//   when input_size == num_units; 'auto_select' implies 'skip_input' when
-//   input_size == num_units; otherwise, it implies 'linear_input'.
-// direction: Indicates whether a bidirectional model will be used.
-//   dir = (direction == bidirectional) ? 2 : 1
-// dropout: dropout probability. When set to 0., dropout is disabled.
-// seed: the 1st part of a seed to initialize dropout.
-// seed2: the 2nd part of a seed to initialize dropout.
-// input: a 3-D tensor with the shape of [seq_length, batch_size, input_size].
-// input_h: a 3-D tensor with the shape of [num_layer * dir, batch_size,
-//     num_units].
-// input_c: For LSTM, a 3-D tensor with the shape of
-//     [num_layer * dir, batch, num_units]. For other models, it is ignored.
-// params: a 1-D tensor that contains the weights and biases in an opaque layout.
-//     The size must be created through CudnnRNNParamsSize, and initialized
-//     separately. Note that they might not be compatible across different
-//     generations. So it is a good idea to save and restore
-// output: a 3-D tensor with the shape of [seq_length, batch_size,
-//     dir * num_units].
-// output_h: the same shape has input_h.
-// output_c: the same shape as input_c for LSTM. An empty tensor for other models.
-// is_training: Indicates whether this operation is used for inferenece or
-//   training.
-// reserve_space: an opaque tensor that can be used in backprop calculation. It
-//   is only produced if is_training is false.
-func CudnnRNN(scope *Scope, input tf.Output, input_h tf.Output, input_c tf.Output, params tf.Output, optional ...CudnnRNNAttr) (output tf.Output, output_h tf.Output, output_c tf.Output, reserve_space tf.Output) {
+// Returns The handle to the queue.
+func FIFOQueueV2(scope *Scope, component_types []tf.DataType, optional ...FIFOQueueV2Attr) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"component_types": component_types}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "CudnnRNN",
-		Input: []tf.Input{
-			input, input_h, input_c, params,
-		},
+		Type: "FIFOQueueV2",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2), op.Output(3)
+	return op.Output(0)
 }
 
-// Compare values of `input` to `threshold` and pack resulting bits into a `uint8`.
-//
-// Each comparison returns a boolean `true` (if `input_value > threshold`)
-// or and `false` otherwise.
-//
-// This operation is useful for Locality-Sensitive-Hashing (LSH) and other
-// algorithms that use hashing approximations of cosine and `L2` distances;
-// codes can be generated from an input via:
-//
-// ```python
-// codebook_size = 50
-// codebook_bits = codebook_size * 32
-// codebook = tf.get_variable('codebook', [x.shape[-1].value, codebook_bits],
-//                            dtype=x.dtype,
-//                            initializer=tf.orthogonal_initializer())
-// codes = compare_and_threshold(tf.matmul(x, codebook), threshold=0.)
-// codes = tf.bitcast(codes, tf.int32)  # go from uint8 to int32
-// # now codes has shape x.shape[:-1] + [codebook_size]
-// ```
-//
-// **NOTE**: Currently, the innermost dimension of the tensor must be divisible
-// by 8.
-//
-// Given an `input` shaped `[s0, s1, ..., s_n]`, the output is
-// a `uint8` tensor shaped `[s0, s1, ..., s_n / 8]`.
-//
-// Arguments:
-//	input: Values to compare against `threshold` and bitpack.
-//	threshold: Threshold to compare against.
-//
-// Returns The bitpacked comparisons.
-func CompareAndBitpack(scope *Scope, input tf.Output, threshold tf.Output) (output tf.Output) {
+// Produces a summary of any statistics recorded by the given statistics manager.
+func StatsAggregatorSummary(scope *Scope, iterator tf.Output) (summary tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "CompareAndBitpack",
+		Type: "StatsAggregatorSummary",
 		Input: []tf.Input{
-			input, threshold,
+			iterator,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Push an element onto the tensor_array.
+// Compute the pairwise cross product.
+//
+// `a` and `b` must be the same shape; they can either be simple 3-element vectors,
+// or any shape where the innermost dimension is 3. In the latter case, each pair
+// of corresponding 3-element vectors is cross-multiplied independently.
 //
 // Arguments:
-//	handle: The handle to a TensorArray.
-//	index: The position to write to inside the TensorArray.
-//	value: The tensor to write to the TensorArray.
-//	flow_in: A float scalar that enforces proper chaining of operations.
+//	a: A tensor containing 3-element vectors.
+//	b: Another tensor, of same type and shape as `a`.
 //
-// Returns A float scalar that enforces proper chaining of operations.
-func TensorArrayWriteV3(scope *Scope, handle tf.Output, index tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
+// Returns Pairwise cross product of the vectors in `a` and `b`.
+func Cross(scope *Scope, a tf.Output, b tf.Output) (product tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayWriteV3",
+		Type: "Cross",
 		Input: []tf.Input{
-			handle, index, value, flow_in,
+			a, b,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Scatter the data from the input value into specific TensorArray elements.
-//
-// `indices` must be a vector, its length must match the first dim of `value`.
-//
-// Arguments:
-//	handle: The handle to a TensorArray.
-//	indices: The locations at which to write the tensor elements.
-//	value: The concatenated tensor to write to the TensorArray.
-//	flow_in: A float scalar that enforces proper chaining of operations.
+// Performs a padding as a preprocess during a convolution.
 //
-// Returns A float scalar that enforces proper chaining of operations.
-func TensorArrayScatterV3(scope *Scope, handle tf.Output, indices tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
+// Similar to FusedResizeAndPadConv2d, this op allows for an optimized
+// implementation where the spatial padding transformation stage is fused with the
+// im2col lookup, but in this case without the bilinear filtering required for
+// resizing. Fusing the padding prevents the need to write out the intermediate
+// results as whole tensors, reducing memory pressure, and we can get some latency
+// gains by merging the transformation calculations.
+// The data_format attribute for Conv2D isn't supported by this op, and 'NHWC'
+// order is used instead.
+// Internally this op uses a single per-graph scratch buffer, which means that it
+// will block if multiple versions are being run in parallel. This is because this
+// operator is primarily an optimization to minimize memory usage.
+//
+// Arguments:
+//	input: 4-D with shape `[batch, in_height, in_width, in_channels]`.
+//	paddings: A two-column matrix specifying the padding sizes. The number of
+// rows must be the same as the rank of `input`.
+//	filter: 4-D with shape
+// `[filter_height, filter_width, in_channels, out_channels]`.
+//
+//	strides: 1-D of length 4.  The stride of the sliding window for each dimension
+// of `input`. Must be in the same order as the dimension specified with format.
+//	padding: The type of padding algorithm to use.
+func FusedPadConv2D(scope *Scope, input tf.Output, paddings tf.Output, filter tf.Output, mode string, strides []int64, padding string) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"mode": mode, "strides": strides, "padding": padding}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayScatterV3",
+		Type: "FusedPadConv2D",
 		Input: []tf.Input{
-			handle, indices, value, flow_in,
+			input, paddings, filter,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// EmptyAttr is an optional argument to Empty.
-type EmptyAttr func(optionalAttr)
+// Conv2DBackpropInputAttr is an optional argument to Conv2DBackpropInput.
+type Conv2DBackpropInputAttr func(optionalAttr)
 
-// EmptyInit sets the optional init attribute to value.
+// Conv2DBackpropInputUseCudnnOnGpu sets the optional use_cudnn_on_gpu attribute to value.
+// If not specified, defaults to true
+func Conv2DBackpropInputUseCudnnOnGpu(value bool) Conv2DBackpropInputAttr {
+	return func(m optionalAttr) {
+		m["use_cudnn_on_gpu"] = value
+	}
+}
+
+// Conv2DBackpropInputDataFormat sets the optional data_format attribute to value.
 //
-// value: If True, initialize the returned tensor with the default value of dtype.  Otherwise, the implementation is free not to initializethe tensor's content.
-// If not specified, defaults to false
-func EmptyInit(value bool) EmptyAttr {
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// If not specified, defaults to "NHWC"
+func Conv2DBackpropInputDataFormat(value string) Conv2DBackpropInputAttr {
 	return func(m optionalAttr) {
-		m["init"] = value
+		m["data_format"] = value
 	}
 }
 
-// Creates a tensor with the given shape.
+// Conv2DBackpropInputDilations sets the optional dilations attribute to value.
 //
-// This operation creates a tensor of `shape` and `dtype`.
+// value: 1-D tensor of length 4.  The dilation factor for each dimension of
+// `input`. If set to k > 1, there will be k-1 skipped cells between each filter
+// element on that dimension. The dimension order is determined by the value of
+// `data_format`, see above for details. Dilations in the batch and depth
+// dimensions must be 1.
+// If not specified, defaults to <i:1 i:1 i:1 i:1 >
+func Conv2DBackpropInputDilations(value []int64) Conv2DBackpropInputAttr {
+	return func(m optionalAttr) {
+		m["dilations"] = value
+	}
+}
+
+// Computes the gradients of convolution with respect to the input.
 //
 // Arguments:
-//	shape: 1-D. Represents the shape of the output tensor.
-//
+//	input_sizes: An integer vector representing the shape of `input`,
+// where `input` is a 4-D `[batch, height, width, channels]` tensor.
+//	filter: 4-D with shape
+// `[filter_height, filter_width, in_channels, out_channels]`.
+//	out_backprop: 4-D with shape `[batch, out_height, out_width, out_channels]`.
+// Gradients w.r.t. the output of the convolution.
+//	strides: The stride of the sliding window for each dimension of the input
+// of the convolution. Must be in the same order as the dimension specified with
+// format.
+//	padding: The type of padding algorithm to use.
 //
-// Returns A `Tensor` of type `T`.
-func Empty(scope *Scope, shape tf.Output, dtype tf.DataType, optional ...EmptyAttr) (output tf.Output) {
+// Returns 4-D with shape `[batch, in_height, in_width, in_channels]`.  Gradient
+// w.r.t. the input of the convolution.
+func Conv2DBackpropInput(scope *Scope, input_sizes tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, padding string, optional ...Conv2DBackpropInputAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{"strides": strides, "padding": padding}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Empty",
+		Type: "Conv2DBackpropInput",
 		Input: []tf.Input{
-			shape,
+			input_sizes, filter, out_backprop,
 		},
 		Attrs: attrs,
 	}
@@ -27125,116 +26527,128 @@ func Empty(scope *Scope, shape tf.Output, dtype tf.DataType, optional ...EmptyAt
 	return op.Output(0)
 }
 
-// TensorArrayConcatV3Attr is an optional argument to TensorArrayConcatV3.
-type TensorArrayConcatV3Attr func(optionalAttr)
-
-// TensorArrayConcatV3ElementShapeExcept0 sets the optional element_shape_except0 attribute to value.
+// Interleave the values from the `data` tensors into a single tensor.
 //
-// value: The expected shape of an element, if known,
-// excluding the first dimension. Used to validate the shapes of
-// TensorArray elements. If this shape is not fully specified, concatenating
-// zero-size TensorArrays is an error.
-// If not specified, defaults to <unknown_rank:true >
-func TensorArrayConcatV3ElementShapeExcept0(value tf.Shape) TensorArrayConcatV3Attr {
-	return func(m optionalAttr) {
-		m["element_shape_except0"] = value
-	}
-}
-
-// Concat the elements from the TensorArray into value `value`.
+// Builds a merged tensor such that
 //
-// Takes `T` elements of shapes
+// ```python
+//     merged[indices[m][i, ..., j], ...] = data[m][i, ..., j, ...]
+// ```
 //
-//   ```
-//   (n0 x d0 x d1 x ...), (n1 x d0 x d1 x ...), ..., (n(T-1) x d0 x d1 x ...)
-//   ```
+// For example, if each `indices[m]` is scalar or vector, we have
 //
-// and concatenates them into a Tensor of shape:
+// ```python
+//     # Scalar indices:
+//     merged[indices[m], ...] = data[m][...]
 //
-//   ```(n0 + n1 + ... + n(T-1) x d0 x d1 x ...)```
+//     # Vector indices:
+//     merged[indices[m][i], ...] = data[m][i, ...]
+// ```
 //
-// All elements must have the same shape (excepting the first dimension).
+// Each `data[i].shape` must start with the corresponding `indices[i].shape`,
+// and the rest of `data[i].shape` must be constant w.r.t. `i`.  That is, we
+// must have `data[i].shape = indices[i].shape + constant`.  In terms of this
+// `constant`, the output shape is
 //
-// Arguments:
-//	handle: The handle to a TensorArray.
-//	flow_in: A float scalar that enforces proper chaining of operations.
-//	dtype: The type of the elem that is returned.
+//     merged.shape = [max(indices)] + constant
 //
-// Returns All of the elements in the TensorArray, concatenated along the first
-// axis.A vector of the row sizes of the original T elements in the
-// value output.  In the example above, this would be the values:
-// `(n1, n2, ..., n(T-1))`.
-func TensorArrayConcatV3(scope *Scope, handle tf.Output, flow_in tf.Output, dtype tf.DataType, optional ...TensorArrayConcatV3Attr) (value tf.Output, lengths tf.Output) {
+// Values are merged in order, so if an index appears in both `indices[m][i]` and
+// `indices[n][j]` for `(m,i) < (n,j)` the slice `data[n][j]` will appear in the
+// merged result. If you do not need this guarantee, ParallelDynamicStitch might
+// perform better on some devices.
+//
+// For example:
+//
+// ```python
+//     indices[0] = 6
+//     indices[1] = [4, 1]
+//     indices[2] = [[5, 2], [0, 3]]
+//     data[0] = [61, 62]
+//     data[1] = [[41, 42], [11, 12]]
+//     data[2] = [[[51, 52], [21, 22]], [[1, 2], [31, 32]]]
+//     merged = [[1, 2], [11, 12], [21, 22], [31, 32], [41, 42],
+//               [51, 52], [61, 62]]
+// ```
+//
+// This method can be used to merge partitions created by `dynamic_partition`
+// as illustrated on the following example:
+//
+// ```python
+//     # Apply function (increments x_i) on elements for which a certain condition
+//     # apply (x_i != -1 in this example).
+//     x=tf.constant([0.1, -1., 5.2, 4.3, -1., 7.4])
+//     condition_mask=tf.not_equal(x,tf.constant(-1.))
+//     partitioned_data = tf.dynamic_partition(
+//         x, tf.cast(condition_mask, tf.int32) , 2)
+//     partitioned_data[1] = partitioned_data[1] + 1.0
+//     condition_indices = tf.dynamic_partition(
+//         tf.range(tf.shape(x)[0]), tf.cast(condition_mask, tf.int32) , 2)
+//     x = tf.dynamic_stitch(condition_indices, partitioned_data)
+//     # Here x=[1.1, -1., 6.2, 5.3, -1, 8.4], the -1. values remain
+//     # unchanged.
+// ```
+//
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/DynamicStitch.png" alt>
+// </div>
+func DynamicStitch(scope *Scope, indices []tf.Output, data []tf.Output) (merged tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayConcatV3",
+		Type: "DynamicStitch",
 		Input: []tf.Input{
-			handle, flow_in,
+			tf.OutputList(indices), tf.OutputList(data),
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// ParameterizedTruncatedNormalAttr is an optional argument to ParameterizedTruncatedNormal.
-type ParameterizedTruncatedNormalAttr func(optionalAttr)
-
-// ParameterizedTruncatedNormalSeed sets the optional seed attribute to value.
+// Returns the truth value of (x == y) element-wise.
 //
-// value: If either `seed` or `seed2` are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func ParameterizedTruncatedNormalSeed(value int64) ParameterizedTruncatedNormalAttr {
-	return func(m optionalAttr) {
-		m["seed"] = value
+// *NOTE*: `Equal` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func Equal(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Equal",
+		Input: []tf.Input{
+			x, y,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// ParameterizedTruncatedNormalSeed2 sets the optional seed2 attribute to value.
-//
-// value: A second seed to avoid seed collision.
-// If not specified, defaults to 0
-func ParameterizedTruncatedNormalSeed2(value int64) ParameterizedTruncatedNormalAttr {
+// TensorArrayGatherV2Attr is an optional argument to TensorArrayGatherV2.
+type TensorArrayGatherV2Attr func(optionalAttr)
+
+// TensorArrayGatherV2ElementShape sets the optional element_shape attribute to value.
+// If not specified, defaults to <unknown_rank:true >
+func TensorArrayGatherV2ElementShape(value tf.Shape) TensorArrayGatherV2Attr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["element_shape"] = value
 	}
 }
 
-// Outputs random values from a normal distribution. The parameters may each be a
-//
-// scalar which applies to the entire output, or a vector of length shape[0] which
-// stores the parameters for each batch.
-//
-// Arguments:
-//	shape: The shape of the output tensor. Batches are indexed by the 0th dimension.
-//	means: The mean parameter of each batch.
-//	stdevs: The standard deviation parameter of each batch. Must be greater than 0.
-//	minvals: The minimum cutoff. May be -infinity.
-//	maxvals: The maximum cutoff. May be +infinity, and must be more than the minval
-// for each batch.
+// Deprecated. Use TensorArrayGatherV3
 //
-// Returns A matrix of shape num_batches x samples_per_batch, filled with random
-// truncated normal values using the parameters for each row.
-func ParameterizedTruncatedNormal(scope *Scope, shape tf.Output, means tf.Output, stdevs tf.Output, minvals tf.Output, maxvals tf.Output, optional ...ParameterizedTruncatedNormalAttr) (output tf.Output) {
+// DEPRECATED at GraphDef version 26: Use TensorArrayGatherV3
+func TensorArrayGatherV2(scope *Scope, handle tf.Output, indices tf.Output, flow_in tf.Output, dtype tf.DataType, optional ...TensorArrayGatherV2Attr) (value tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"dtype": dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ParameterizedTruncatedNormal",
+		Type: "TensorArrayGatherV2",
 		Input: []tf.Input{
-			shape, means, stdevs, minvals, maxvals,
+			handle, indices, flow_in,
 		},
 		Attrs: attrs,
 	}
@@ -27242,137 +26656,116 @@ func ParameterizedTruncatedNormal(scope *Scope, shape tf.Output, means tf.Output
 	return op.Output(0)
 }
 
-// Sets the index-th position of the list to contain the given tensor.
+// Interleave the values from the `data` tensors into a single tensor.
 //
-// input_handle: the list
-// index: the position in the list to which the tensor will be assigned
-// item: the element to be assigned to that position
-// output_handle: the new list, with the element in the proper position
+// Builds a merged tensor such that
 //
-func TensorListSetItem(scope *Scope, input_handle tf.Output, index tf.Output, item tf.Output) (output_handle tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "TensorListSetItem",
-		Input: []tf.Input{
-			input_handle, index, item,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Returns a diagonal tensor with a given diagonal values.
+// ```python
+//     merged[indices[m][i, ..., j], ...] = data[m][i, ..., j, ...]
+// ```
 //
-// Given a `diagonal`, this operation returns a tensor with the `diagonal` and
-// everything else padded with zeros. The diagonal is computed as follows:
+// For example, if each `indices[m]` is scalar or vector, we have
 //
-// Assume `diagonal` has dimensions [D1,..., Dk], then the output is a tensor of
-// rank 2k with dimensions [D1,..., Dk, D1,..., Dk] where:
+// ```python
+//     # Scalar indices:
+//     merged[indices[m], ...] = data[m][...]
 //
-// `output[i1,..., ik, i1,..., ik] = diagonal[i1, ..., ik]` and 0 everywhere else.
+//     # Vector indices:
+//     merged[indices[m][i], ...] = data[m][i, ...]
+// ```
+//
+// Each `data[i].shape` must start with the corresponding `indices[i].shape`,
+// and the rest of `data[i].shape` must be constant w.r.t. `i`.  That is, we
+// must have `data[i].shape = indices[i].shape + constant`.  In terms of this
+// `constant`, the output shape is
+//
+//     merged.shape = [max(indices)] + constant
+//
+// Values may be merged in parallel, so if an index appears in both `indices[m][i]`
+// and `indices[n][j]`, the result may be invalid. This differs from the normal
+// DynamicStitch operator that defines the behavior in that case.
 //
 // For example:
 //
+// ```python
+//     indices[0] = 6
+//     indices[1] = [4, 1]
+//     indices[2] = [[5, 2], [0, 3]]
+//     data[0] = [61, 62]
+//     data[1] = [[41, 42], [11, 12]]
+//     data[2] = [[[51, 52], [21, 22]], [[1, 2], [31, 32]]]
+//     merged = [[1, 2], [11, 12], [21, 22], [31, 32], [41, 42],
+//               [51, 52], [61, 62]]
 // ```
-// # 'diagonal' is [1, 2, 3, 4]
-// tf.diag(diagonal) ==> [[1, 0, 0, 0]
-//                        [0, 2, 0, 0]
-//                        [0, 0, 3, 0]
-//                        [0, 0, 0, 4]]
+//
+// This method can be used to merge partitions created by `dynamic_partition`
+// as illustrated on the following example:
+//
+// ```python
+//     # Apply function (increments x_i) on elements for which a certain condition
+//     # apply (x_i != -1 in this example).
+//     x=tf.constant([0.1, -1., 5.2, 4.3, -1., 7.4])
+//     condition_mask=tf.not_equal(x,tf.constant(-1.))
+//     partitioned_data = tf.dynamic_partition(
+//         x, tf.cast(condition_mask, tf.int32) , 2)
+//     partitioned_data[1] = partitioned_data[1] + 1.0
+//     condition_indices = tf.dynamic_partition(
+//         tf.range(tf.shape(x)[0]), tf.cast(condition_mask, tf.int32) , 2)
+//     x = tf.dynamic_stitch(condition_indices, partitioned_data)
+//     # Here x=[1.1, -1., 6.2, 5.3, -1, 8.4], the -1. values remain
+//     # unchanged.
 // ```
 //
-// Arguments:
-//	diagonal: Rank k tensor where k is at most 1.
-func Diag(scope *Scope, diagonal tf.Output) (output tf.Output) {
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/DynamicStitch.png" alt>
+// </div>
+func ParallelDynamicStitch(scope *Scope, indices []tf.Output, data []tf.Output) (merged tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "Diag",
+		Type: "ParallelDynamicStitch",
 		Input: []tf.Input{
-			diagonal,
+			tf.OutputList(indices), tf.OutputList(data),
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Split the data from the input value into TensorArray elements.
-//
-// Assuming that `lengths` takes on values
-//
-//   ```(n0, n1, ..., n(T-1))```
-//
-// and that `value` has shape
-//
-//   ```(n0 + n1 + ... + n(T-1) x d0 x d1 x ...)```,
-//
-// this splits values into a TensorArray with T tensors.
-//
-// TensorArray index t will be the subtensor of values with starting position
-//
-//   ```(n0 + n1 + ... + n(t-1), 0, 0, ...)```
-//
-// and having size
-//
-//   ```nt x d0 x d1 x ...```
-//
-// Arguments:
-//	handle: The handle to a TensorArray.
-//	value: The concatenated tensor to write to the TensorArray.
-//	lengths: The vector of lengths, how to split the rows of value into the
-// TensorArray.
-//	flow_in: A float scalar that enforces proper chaining of operations.
+// Computes the gradient for the inverse of `x` wrt its input.
 //
-// Returns A float scalar that enforces proper chaining of operations.
-func TensorArraySplitV3(scope *Scope, handle tf.Output, value tf.Output, lengths tf.Output, flow_in tf.Output) (flow_out tf.Output) {
+// Specifically, `grad = -dy * y*y`, where `y = 1/x`, and `dy`
+// is the corresponding input gradient.
+func InvGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArraySplitV3",
+		Type: "InvGrad",
 		Input: []tf.Input{
-			handle, value, lengths, flow_in,
+			y, dy,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// SerializeSparseAttr is an optional argument to SerializeSparse.
-type SerializeSparseAttr func(optionalAttr)
-
-// SerializeSparseOutType sets the optional out_type attribute to value.
-//
-// value: The `dtype` to use for serialization; the supported types are `string`
-// (default) and `variant`.
-// If not specified, defaults to DT_STRING
-func SerializeSparseOutType(value tf.DataType) SerializeSparseAttr {
-	return func(m optionalAttr) {
-		m["out_type"] = value
-	}
-}
-
-// Serialize a `SparseTensor` into a `[3]` `Tensor` object.
+// List of the given size with empty elements.
 //
-// Arguments:
-//	sparse_indices: 2-D.  The `indices` of the `SparseTensor`.
-//	sparse_values: 1-D.  The `values` of the `SparseTensor`.
-//	sparse_shape: 1-D.  The `shape` of the `SparseTensor`.
-func SerializeSparse(scope *Scope, sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output, optional ...SerializeSparseAttr) (serialized_sparse tf.Output) {
+// element_shape: the shape of the future elements of the list
+// num_elements: the number of elements to reserve
+// handle: the output list
+// element_dtype: the desired type of elements in the list.
+func TensorListReserve(scope *Scope, element_shape tf.Output, num_elements tf.Output, element_dtype tf.DataType) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"element_dtype": element_dtype}
 	opspec := tf.OpSpec{
-		Type: "SerializeSparse",
+		Type: "TensorListReserve",
 		Input: []tf.Input{
-			sparse_indices, sparse_values, sparse_shape,
+			element_shape, num_elements,
 		},
 		Attrs: attrs,
 	}
@@ -27380,106 +26773,79 @@ func SerializeSparse(scope *Scope, sparse_indices tf.Output, sparse_values tf.Ou
 	return op.Output(0)
 }
 
-// RandomShuffleQueueV2Attr is an optional argument to RandomShuffleQueueV2.
-type RandomShuffleQueueV2Attr func(optionalAttr)
+// PriorityQueueV2Attr is an optional argument to PriorityQueueV2.
+type PriorityQueueV2Attr func(optionalAttr)
 
-// RandomShuffleQueueV2Shapes sets the optional shapes attribute to value.
+// PriorityQueueV2ComponentTypes sets the optional component_types attribute to value.
 //
-// value: The shape of each component in a value. The length of this attr must
-// be either 0 or the same as the length of component_types. If the length of
-// this attr is 0, the shapes of queue elements are not constrained, and
-// only one element may be dequeued at a time.
+// value: The type of each component in a value.
 // If not specified, defaults to <>
 //
 // REQUIRES: len(value) >= 0
-func RandomShuffleQueueV2Shapes(value []tf.Shape) RandomShuffleQueueV2Attr {
+func PriorityQueueV2ComponentTypes(value []tf.DataType) PriorityQueueV2Attr {
 	return func(m optionalAttr) {
-		m["shapes"] = value
+		m["component_types"] = value
 	}
 }
 
-// RandomShuffleQueueV2Capacity sets the optional capacity attribute to value.
+// PriorityQueueV2Capacity sets the optional capacity attribute to value.
 //
 // value: The upper bound on the number of elements in this queue.
 // Negative numbers mean no limit.
 // If not specified, defaults to -1
-func RandomShuffleQueueV2Capacity(value int64) RandomShuffleQueueV2Attr {
+func PriorityQueueV2Capacity(value int64) PriorityQueueV2Attr {
 	return func(m optionalAttr) {
 		m["capacity"] = value
 	}
 }
 
-// RandomShuffleQueueV2MinAfterDequeue sets the optional min_after_dequeue attribute to value.
-//
-// value: Dequeue will block unless there would be this
-// many elements after the dequeue or the queue is closed. This
-// ensures a minimum level of mixing of elements.
-// If not specified, defaults to 0
-func RandomShuffleQueueV2MinAfterDequeue(value int64) RandomShuffleQueueV2Attr {
-	return func(m optionalAttr) {
-		m["min_after_dequeue"] = value
-	}
-}
-
-// RandomShuffleQueueV2Seed sets the optional seed attribute to value.
-//
-// value: If either seed or seed2 is set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, a random seed is used.
-// If not specified, defaults to 0
-func RandomShuffleQueueV2Seed(value int64) RandomShuffleQueueV2Attr {
-	return func(m optionalAttr) {
-		m["seed"] = value
-	}
-}
-
-// RandomShuffleQueueV2Seed2 sets the optional seed2 attribute to value.
-//
-// value: A second seed to avoid seed collision.
-// If not specified, defaults to 0
-func RandomShuffleQueueV2Seed2(value int64) RandomShuffleQueueV2Attr {
-	return func(m optionalAttr) {
-		m["seed2"] = value
-	}
-}
-
-// RandomShuffleQueueV2Container sets the optional container attribute to value.
+// PriorityQueueV2Container sets the optional container attribute to value.
 //
 // value: If non-empty, this queue is placed in the given container.
 // Otherwise, a default container is used.
 // If not specified, defaults to ""
-func RandomShuffleQueueV2Container(value string) RandomShuffleQueueV2Attr {
+func PriorityQueueV2Container(value string) PriorityQueueV2Attr {
 	return func(m optionalAttr) {
 		m["container"] = value
 	}
 }
 
-// RandomShuffleQueueV2SharedName sets the optional shared_name attribute to value.
+// PriorityQueueV2SharedName sets the optional shared_name attribute to value.
 //
 // value: If non-empty, this queue will be shared under the given name
 // across multiple sessions.
 // If not specified, defaults to ""
-func RandomShuffleQueueV2SharedName(value string) RandomShuffleQueueV2Attr {
+func PriorityQueueV2SharedName(value string) PriorityQueueV2Attr {
 	return func(m optionalAttr) {
 		m["shared_name"] = value
 	}
 }
 
-// A queue that randomizes the order of elements.
+// A queue that produces elements sorted by the first component value.
+//
+// Note that the PriorityQueue requires the first component of any element
+// to be a scalar int64, in addition to the other elements declared by
+// component_types.  Therefore calls to Enqueue and EnqueueMany (resp. Dequeue
+// and DequeueMany) on a PriorityQueue will all require (resp. output) one extra
+// entry in their input (resp. output) lists.
 //
 // Arguments:
-//	component_types: The type of each component in a value.
+//	shapes: The shape of each component in a value. The length of this attr must
+// be either 0 or the same as the length of component_types. If the length of
+// this attr is 0, the shapes of queue elements are not constrained, and
+// only one element may be dequeued at a time.
 //
 // Returns The handle to the queue.
-func RandomShuffleQueueV2(scope *Scope, component_types []tf.DataType, optional ...RandomShuffleQueueV2Attr) (handle tf.Output) {
+func PriorityQueueV2(scope *Scope, shapes []tf.Shape, optional ...PriorityQueueV2Attr) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"component_types": component_types}
+	attrs := map[string]interface{}{"shapes": shapes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "RandomShuffleQueueV2",
+		Type: "PriorityQueueV2",
 
 		Attrs: attrs,
 	}
@@ -27487,291 +26853,302 @@ func RandomShuffleQueueV2(scope *Scope, component_types []tf.DataType, optional
 	return op.Output(0)
 }
 
-// Draw bounding boxes on a batch of images.
-//
-// Outputs a copy of `images` but draws on top of the pixels zero or more bounding
-// boxes specified by the locations in `boxes`. The coordinates of the each
-// bounding box in `boxes` are encoded as `[y_min, x_min, y_max, x_max]`. The
-// bounding box coordinates are floats in `[0.0, 1.0]` relative to the width and
-// height of the underlying image.
-//
-// For example, if an image is 100 x 200 pixels (height x width) and the bounding
-// box is `[0.1, 0.2, 0.5, 0.9]`, the upper-left and bottom-right coordinates of
-// the bounding box will be `(40, 10)` to `(100, 50)` (in (x,y) coordinates).
+// UnstageAttr is an optional argument to Unstage.
+type UnstageAttr func(optionalAttr)
+
+// UnstageCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// Parts of the bounding box may fall outside the image.
+// REQUIRES: value >= 0
+func UnstageCapacity(value int64) UnstageAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
+	}
+}
+
+// UnstageMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// Arguments:
-//	images: 4-D with shape `[batch, height, width, depth]`. A batch of images.
-//	boxes: 3-D with shape `[batch, num_bounding_boxes, 4]` containing bounding
-// boxes.
+// REQUIRES: value >= 0
+func UnstageMemoryLimit(value int64) UnstageAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// UnstageContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func UnstageContainer(value string) UnstageAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// UnstageSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func UnstageSharedName(value string) UnstageAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op is similar to a lightweight Dequeue.
 //
-// Returns 4-D with the same shape as `images`. The batch of input images with
-// bounding boxes drawn on the images.
-func DrawBoundingBoxes(scope *Scope, images tf.Output, boxes tf.Output) (output tf.Output) {
+// The basic functionality is similar to dequeue with many fewer
+// capabilities and options.  This Op is optimized for performance.
+func Unstage(scope *Scope, dtypes []tf.DataType, optional ...UnstageAttr) (values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "DrawBoundingBoxes",
-		Input: []tf.Input{
-			images, boxes,
-		},
+		Type: "Unstage",
+
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// LearnedUnigramCandidateSamplerAttr is an optional argument to LearnedUnigramCandidateSampler.
-type LearnedUnigramCandidateSamplerAttr func(optionalAttr)
-
-// LearnedUnigramCandidateSamplerSeed sets the optional seed attribute to value.
-//
-// value: If either seed or seed2 are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func LearnedUnigramCandidateSamplerSeed(value int64) LearnedUnigramCandidateSamplerAttr {
-	return func(m optionalAttr) {
-		m["seed"] = value
+	if scope.Err() != nil {
+		return
 	}
+	var idx int
+	var err error
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("Unstage", err)
+		return
+	}
+	return values
 }
 
-// LearnedUnigramCandidateSamplerSeed2 sets the optional seed2 attribute to value.
+// QueueEnqueueV2Attr is an optional argument to QueueEnqueueV2.
+type QueueEnqueueV2Attr func(optionalAttr)
+
+// QueueEnqueueV2TimeoutMs sets the optional timeout_ms attribute to value.
 //
-// value: An second seed to avoid seed collision.
-// If not specified, defaults to 0
-func LearnedUnigramCandidateSamplerSeed2(value int64) LearnedUnigramCandidateSamplerAttr {
+// value: If the queue is full, this operation will block for up to
+// timeout_ms milliseconds.
+// Note: This option is not supported yet.
+// If not specified, defaults to -1
+func QueueEnqueueV2TimeoutMs(value int64) QueueEnqueueV2Attr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["timeout_ms"] = value
 	}
 }
 
-// Generates labels for candidate sampling with a learned unigram distribution.
-//
-// See explanations of candidate sampling and the data formats at
-// go/candidate-sampling.
+// Enqueues a tuple of one or more tensors in the given queue.
 //
-// For each batch, this op picks a single set of sampled candidate labels.
+// The components input has k elements, which correspond to the components of
+// tuples stored in the given queue.
 //
-// The advantages of sampling candidates per-batch are simplicity and the
-// possibility of efficient dense matrix multiplication. The disadvantage is that
-// the sampled candidates must be chosen independently of the context and of the
-// true labels.
+// N.B. If the queue is full, this operation will block until the given
+// element has been enqueued (or 'timeout_ms' elapses, if specified).
 //
 // Arguments:
-//	true_classes: A batch_size * num_true matrix, in which each row contains the
-// IDs of the num_true target_classes in the corresponding original label.
-//	num_true: Number of true labels per context.
-//	num_sampled: Number of candidates to randomly sample.
-//	unique: If unique is true, we sample with rejection, so that all sampled
-// candidates in a batch are unique. This requires some approximation to
-// estimate the post-rejection sampling probabilities.
-//	range_max: The sampler will sample integers from the interval [0, range_max).
+//	handle: The handle to a queue.
+//	components: One or more tensors from which the enqueued tensors should be taken.
 //
-// Returns A vector of length num_sampled, in which each element is
-// the ID of a sampled candidate.A batch_size * num_true matrix, representing
-// the number of times each candidate is expected to occur in a batch
-// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
-// candidate representing the number of times the candidate is expected
-// to occur in a batch of sampled candidates.  If unique=true, then this is a
-// probability.
-func LearnedUnigramCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...LearnedUnigramCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
+// Returns the created operation.
+func QueueEnqueueV2(scope *Scope, handle tf.Output, components []tf.Output, optional ...QueueEnqueueV2Attr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "LearnedUnigramCandidateSampler",
+		Type: "QueueEnqueueV2",
 		Input: []tf.Input{
-			true_classes,
+			handle, tf.OutputList(components),
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return scope.AddOperation(opspec)
 }
 
-// Computes gradients for the scaled exponential linear (Selu) operation.
+// QueueDequeueManyV2Attr is an optional argument to QueueDequeueManyV2.
+type QueueDequeueManyV2Attr func(optionalAttr)
+
+// QueueDequeueManyV2TimeoutMs sets the optional timeout_ms attribute to value.
+//
+// value: If the queue has fewer than n elements, this operation
+// will block for up to timeout_ms milliseconds.
+// Note: This option is not supported yet.
+// If not specified, defaults to -1
+func QueueDequeueManyV2TimeoutMs(value int64) QueueDequeueManyV2Attr {
+	return func(m optionalAttr) {
+		m["timeout_ms"] = value
+	}
+}
+
+// Dequeues `n` tuples of one or more tensors from the given queue.
+//
+// If the queue is closed and there are fewer than `n` elements, then an
+// OutOfRange error is returned.
+//
+// This operation concatenates queue-element component tensors along the
+// 0th dimension to make a single component tensor.  All of the components
+// in the dequeued tuple will have size `n` in the 0th dimension.
+//
+// This operation has `k` outputs, where `k` is the number of components in
+// the tuples stored in the given queue, and output `i` is the ith
+// component of the dequeued tuple.
+//
+// N.B. If the queue is empty, this operation will block until `n` elements
+// have been dequeued (or 'timeout_ms' elapses, if specified).
 //
 // Arguments:
-//	gradients: The backpropagated gradients to the corresponding Selu operation.
-//	outputs: The outputs of the corresponding Selu operation.
+//	handle: The handle to a queue.
+//	n: The number of tuples to dequeue.
+//	component_types: The type of each component in a tuple.
 //
-// Returns The gradients: `gradients * (outputs + scale * alpha)`
-// if outputs < 0, `scale * gradients` otherwise.
-func SeluGrad(scope *Scope, gradients tf.Output, outputs tf.Output) (backprops tf.Output) {
+// Returns One or more tensors that were dequeued as a tuple.
+func QueueDequeueManyV2(scope *Scope, handle tf.Output, n tf.Output, component_types []tf.DataType, optional ...QueueDequeueManyV2Attr) (components []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"component_types": component_types}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "SeluGrad",
+		Type: "QueueDequeueManyV2",
 		Input: []tf.Input{
-			gradients, outputs,
+			handle, n,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
+		scope.UpdateErr("QueueDequeueManyV2", err)
+		return
+	}
+	return components
 }
 
-// Get the current size of the TensorArray.
+// EncodeBase64Attr is an optional argument to EncodeBase64.
+type EncodeBase64Attr func(optionalAttr)
+
+// EncodeBase64Pad sets the optional pad attribute to value.
+//
+// value: Bool whether padding is applied at the ends.
+// If not specified, defaults to false
+func EncodeBase64Pad(value bool) EncodeBase64Attr {
+	return func(m optionalAttr) {
+		m["pad"] = value
+	}
+}
+
+// Encode strings into web-safe base64 format.
+//
+// Refer to the following article for more information on base64 format:
+// en.wikipedia.org/wiki/Base64. Base64 strings may have padding with '=' at the
+// end so that the encoded has length multiple of 4. See Padding section of the
+// link above.
+//
+// Web-safe means that the encoder uses - and _ instead of + and /.
 //
 // Arguments:
-//	handle: The handle to a TensorArray (output of TensorArray or TensorArrayGrad).
-//	flow_in: A float scalar that enforces proper chaining of operations.
+//	input: Strings to be encoded.
 //
-// Returns The current size of the TensorArray.
-func TensorArraySizeV3(scope *Scope, handle tf.Output, flow_in tf.Output) (size tf.Output) {
+// Returns Input strings encoded in base64.
+func EncodeBase64(scope *Scope, input tf.Output, optional ...EncodeBase64Attr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "TensorArraySizeV3",
+		Type: "EncodeBase64",
 		Input: []tf.Input{
-			handle, flow_in,
+			input,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Deprecated. Use TensorArrayGradV3
+// Deprecated. Use TensorArrayCloseV3
 //
-// DEPRECATED at GraphDef version 26: Use TensorArrayWriteV3
-func TensorArrayWriteV2(scope *Scope, handle tf.Output, index tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
+// DEPRECATED at GraphDef version 26: Use TensorArrayCloseV3
+//
+// Returns the created operation.
+func TensorArrayCloseV2(scope *Scope, handle tf.Output) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayWriteV2",
+		Type: "TensorArrayCloseV2",
 		Input: []tf.Input{
-			handle, index, value, flow_in,
+			handle,
 		},
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// SparseReduceMaxAttr is an optional argument to SparseReduceMax.
-type SparseReduceMaxAttr func(optionalAttr)
-
-// SparseReduceMaxKeepDims sets the optional keep_dims attribute to value.
+// Forwards the value of an available tensor from `inputs` to `output`.
 //
-// value: If true, retain reduced dimensions with length 1.
-// If not specified, defaults to false
-func SparseReduceMaxKeepDims(value bool) SparseReduceMaxAttr {
-	return func(m optionalAttr) {
-		m["keep_dims"] = value
-	}
-}
-
-// Computes the max of elements across dimensions of a SparseTensor.
-//
-// This Op takes a SparseTensor and is the sparse counterpart to
-// `tf.reduce_max()`.  In particular, this Op also returns a dense `Tensor`
-// instead of a sparse one.
-//
-// Reduces `sp_input` along the dimensions given in `reduction_axes`.  Unless
-// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
-// `reduction_axes`. If `keep_dims` is true, the reduced dimensions are retained
-// with length 1.
+// `Merge` waits for at least one of the tensors in `inputs` to become available.
+// It is usually combined with `Switch` to implement branching.
 //
-// If `reduction_axes` has no entries, all dimensions are reduced, and a tensor
-// with a single element is returned.  Additionally, the axes can be negative,
-// which are interpreted according to the indexing rules in Python.
+// `Merge` forwards the first tensor to become available to `output`, and sets
+// `value_index` to its index in `inputs`.
 //
 // Arguments:
-//	input_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
-// SparseTensor, possibly not in canonical ordering.
-//	input_values: 1-D.  `N` non-empty values corresponding to `input_indices`.
-//	input_shape: 1-D.  Shape of the input SparseTensor.
-//	reduction_axes: 1-D.  Length-`K` vector containing the reduction axes.
+//	inputs: The input tensors, exactly one of which will become available.
 //
-// Returns `R-K`-D.  The reduced Tensor.
-func SparseReduceMax(scope *Scope, input_indices tf.Output, input_values tf.Output, input_shape tf.Output, reduction_axes tf.Output, optional ...SparseReduceMaxAttr) (output tf.Output) {
+// Returns Will be set to the available input tensor.The index of the chosen input tensor in `inputs`.
+func Merge(scope *Scope, inputs []tf.Output) (output tf.Output, value_index tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "SparseReduceMax",
+		Type: "Merge",
 		Input: []tf.Input{
-			input_indices, input_values, input_shape, reduction_axes,
+			tf.OutputList(inputs),
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// AsStringAttr is an optional argument to AsString.
-type AsStringAttr func(optionalAttr)
-
-// AsStringPrecision sets the optional precision attribute to value.
-//
-// value: The post-decimal precision to use for floating point numbers.
-// Only used if precision > -1.
-// If not specified, defaults to -1
-func AsStringPrecision(value int64) AsStringAttr {
-	return func(m optionalAttr) {
-		m["precision"] = value
-	}
+	return op.Output(0), op.Output(1)
 }
 
-// AsStringScientific sets the optional scientific attribute to value.
-//
-// value: Use scientific notation for floating point numbers.
-// If not specified, defaults to false
-func AsStringScientific(value bool) AsStringAttr {
-	return func(m optionalAttr) {
-		m["scientific"] = value
-	}
-}
+// QueueCloseV2Attr is an optional argument to QueueCloseV2.
+type QueueCloseV2Attr func(optionalAttr)
 
-// AsStringShortest sets the optional shortest attribute to value.
+// QueueCloseV2CancelPendingEnqueues sets the optional cancel_pending_enqueues attribute to value.
 //
-// value: Use shortest representation (either scientific or standard) for
-// floating point numbers.
+// value: If true, all pending enqueue requests that are
+// blocked on the given queue will be canceled.
 // If not specified, defaults to false
-func AsStringShortest(value bool) AsStringAttr {
+func QueueCloseV2CancelPendingEnqueues(value bool) QueueCloseV2Attr {
 	return func(m optionalAttr) {
-		m["shortest"] = value
+		m["cancel_pending_enqueues"] = value
 	}
 }
 
-// AsStringWidth sets the optional width attribute to value.
+// Closes the given queue.
 //
-// value: Pad pre-decimal numbers to this width.
-// Applies to both floating point and integer numbers.
-// Only used if width > -1.
-// If not specified, defaults to -1
-func AsStringWidth(value int64) AsStringAttr {
-	return func(m optionalAttr) {
-		m["width"] = value
-	}
-}
-
-// AsStringFill sets the optional fill attribute to value.
+// This operation signals that no more elements will be enqueued in the
+// given queue. Subsequent Enqueue(Many) operations will fail.
+// Subsequent Dequeue(Many) operations will continue to succeed if
+// sufficient elements remain in the queue. Subsequent Dequeue(Many)
+// operations that would block will fail immediately.
 //
-// value: The value to pad if width > -1.  If empty, pads with spaces.
-// Another typical value is '0'.  String cannot be longer than 1 character.
-// If not specified, defaults to ""
-func AsStringFill(value string) AsStringAttr {
-	return func(m optionalAttr) {
-		m["fill"] = value
-	}
-}
-
-// Converts each entry in the given tensor to strings.  Supports many numeric
+// Arguments:
+//	handle: The handle to a queue.
 //
-// types and boolean.
-func AsString(scope *Scope, input tf.Output, optional ...AsStringAttr) (output tf.Output) {
+// Returns the created operation.
+func QueueCloseV2(scope *Scope, handle tf.Output, optional ...QueueCloseV2Attr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -27780,347 +27157,226 @@ func AsString(scope *Scope, input tf.Output, optional ...AsStringAttr) (output t
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "AsString",
+		Type: "QueueCloseV2",
 		Input: []tf.Input{
-			input,
+			handle,
 		},
 		Attrs: attrs,
 	}
+	return scope.AddOperation(opspec)
+}
+
+// Computes inverse hyperbolic tangent of x element-wise.
+func Atanh(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Atanh",
+		Input: []tf.Input{
+			x,
+		},
+	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Deprecated. Use TensorArrayScatterV3
+// Returns true if queue is closed.
 //
-// DEPRECATED at GraphDef version 26: Use TensorArrayScatterV3
-func TensorArrayScatterV2(scope *Scope, handle tf.Output, indices tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
+// This operation returns true if the queue is closed and false if the queue
+// is open.
+//
+// Arguments:
+//	handle: The handle to a queue.
+func QueueIsClosedV2(scope *Scope, handle tf.Output) (is_closed tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayScatterV2",
+		Type: "QueueIsClosedV2",
 		Input: []tf.Input{
-			handle, indices, value, flow_in,
+			handle,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Creates a tree ensemble model and returns a handle to it.
-//
-// Arguments:
-//	tree_ensemble_handle: Handle to the tree ensemble resource to be created.
-//	stamp_token: Token to use as the initial value of the resource stamp.
-//	tree_ensemble_serialized: Serialized proto of the tree ensemble.
+// Computes the absolute value of a tensor.
 //
-// Returns the created operation.
-func BoostedTreesCreateEnsemble(scope *Scope, tree_ensemble_handle tf.Output, stamp_token tf.Output, tree_ensemble_serialized tf.Output) (o *tf.Operation) {
+// Given a tensor `x`, this operation returns a tensor containing the absolute
+// value of each element in `x`. For example, if x is an input element and y is
+// an output element, this operation computes \\(y = |x|\\).
+func Abs(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "BoostedTreesCreateEnsemble",
+		Type: "Abs",
 		Input: []tf.Input{
-			tree_ensemble_handle, stamp_token, tree_ensemble_serialized,
+			x,
 		},
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Applies sparse addition to `input` using individual values or slices
-//
-// from `updates` according to indices `indices`.  The updates are non-aliasing:
-// `input` is only modified in-place if no other operations will use it.
-// Otherwise, a copy of `input` is made.  This operation has a gradient with
-// respect to both `input` and `updates`.
-//
-// `input` is a `Tensor` with rank `P` and `indices` is a `Tensor` of rank `Q`.
-//
-// `indices` must be integer tensor, containing indices into `input`.
-// It must be shape `[d_0, ..., d_{Q-2}, K]` where `0 < K <= P`.
-//
-// The innermost dimension of `indices` (with length `K`) corresponds to
-// indices into elements (if `K = P`) or `(P-K)`-dimensional slices
-// (if `K < P`) along the `K`th dimension of `input`.
-//
-// `updates` is `Tensor` of rank `Q-1+P-K` with shape:
-//
-// ```
-// [d_0, ..., d_{Q-2}, input.shape[K], ..., input.shape[P-1]].
-// ```
-//
-// For example, say we want to add 4 scattered elements to a rank-1 tensor to 8
-// elements. In Python, that addition would look like this:
-//
-//     input = tf.constant([1, 2, 3, 4, 5, 6, 7, 8])
-//     indices = tf.constant([[4], [3], [1], [7]])
-//     updates = tf.constant([9, 10, 11, 12])
-//     output = tf.scatter_nd_non_aliasing_add(input, indices, updates)
-//     with tf.Session() as sess:
-//       print(sess.run(output))
-//
-// The resulting value `output` would look like this:
-//
-//     [1, 13, 3, 14, 14, 6, 7, 20]
+// StackV2Attr is an optional argument to StackV2.
+type StackV2Attr func(optionalAttr)
+
+// StackV2StackName sets the optional stack_name attribute to value.
 //
-// See @{tf.scatter_nd} for more details about how to make updates to slices.
+// value: Overrides the name used for the temporary stack resource. Default
+// value is the name of the 'Stack' op (which is guaranteed unique).
+// If not specified, defaults to ""
+func StackV2StackName(value string) StackV2Attr {
+	return func(m optionalAttr) {
+		m["stack_name"] = value
+	}
+}
+
+// A stack that produces elements in first-in last-out order.
 //
 // Arguments:
-//	input: A Tensor.
-//	indices: A Tensor. Must be one of the following types: `int32`, `int64`.
-// A tensor of indices into `input`.
-//	updates: A Tensor. Must have the same type as ref. A tensor of updated values
-// to add to `input`.
+//	max_size: The maximum size of the stack if non-negative. If negative, the stack
+// size is unlimited.
+//	elem_type: The type of the elements on the stack.
 //
-// Returns A `Tensor` with the same shape as `input`, containing values of `input`
-// updated with `updates`.
-func ScatterNdNonAliasingAdd(scope *Scope, input tf.Output, indices tf.Output, updates tf.Output) (output tf.Output) {
+// Returns The handle to the stack.
+func StackV2(scope *Scope, max_size tf.Output, elem_type tf.DataType, optional ...StackV2Attr) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"elem_type": elem_type}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "ScatterNdNonAliasingAdd",
+		Type: "StackV2",
 		Input: []tf.Input{
-			input, indices, updates,
+			max_size,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// FractionalMaxPoolAttr is an optional argument to FractionalMaxPool.
-type FractionalMaxPoolAttr func(optionalAttr)
+// FusedBatchNormGradV2Attr is an optional argument to FusedBatchNormGradV2.
+type FusedBatchNormGradV2Attr func(optionalAttr)
 
-// FractionalMaxPoolPseudoRandom sets the optional pseudo_random attribute to value.
+// FusedBatchNormGradV2Epsilon sets the optional epsilon attribute to value.
 //
-// value: When set to True, generates the pooling sequence in a
-// pseudorandom fashion, otherwise, in a random fashion. Check paper [Benjamin
-// Graham, Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) for
-// difference between pseudorandom and random.
-// If not specified, defaults to false
-func FractionalMaxPoolPseudoRandom(value bool) FractionalMaxPoolAttr {
+// value: A small float number added to the variance of x.
+// If not specified, defaults to 0.0001
+func FusedBatchNormGradV2Epsilon(value float32) FusedBatchNormGradV2Attr {
 	return func(m optionalAttr) {
-		m["pseudo_random"] = value
+		m["epsilon"] = value
 	}
 }
 
-// FractionalMaxPoolOverlapping sets the optional overlapping attribute to value.
-//
-// value: When set to True, it means when pooling, the values at the boundary
-// of adjacent pooling cells are used by both cells. For example:
-//
-// `index  0  1  2  3  4`
-//
-// `value  20 5  16 3  7`
-//
-// If the pooling sequence is [0, 2, 4], then 16, at index 2 will be used twice.
-// The result would be [20, 16] for fractional max pooling.
-// If not specified, defaults to false
-func FractionalMaxPoolOverlapping(value bool) FractionalMaxPoolAttr {
-	return func(m optionalAttr) {
-		m["overlapping"] = value
-	}
-}
-
-// FractionalMaxPoolDeterministic sets the optional deterministic attribute to value.
-//
-// value: When set to True, a fixed pooling region will be used when
-// iterating over a FractionalMaxPool node in the computation graph. Mainly used
-// in unit test to make FractionalMaxPool deterministic.
-// If not specified, defaults to false
-func FractionalMaxPoolDeterministic(value bool) FractionalMaxPoolAttr {
-	return func(m optionalAttr) {
-		m["deterministic"] = value
-	}
-}
-
-// FractionalMaxPoolSeed sets the optional seed attribute to value.
+// FusedBatchNormGradV2DataFormat sets the optional data_format attribute to value.
 //
-// value: If either seed or seed2 are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func FractionalMaxPoolSeed(value int64) FractionalMaxPoolAttr {
+// value: The data format for y_backprop, x, x_backprop.
+// Either "NHWC" (default) or "NCHW".
+// If not specified, defaults to "NHWC"
+func FusedBatchNormGradV2DataFormat(value string) FusedBatchNormGradV2Attr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["data_format"] = value
 	}
 }
 
-// FractionalMaxPoolSeed2 sets the optional seed2 attribute to value.
+// FusedBatchNormGradV2IsTraining sets the optional is_training attribute to value.
 //
-// value: An second seed to avoid seed collision.
-// If not specified, defaults to 0
-func FractionalMaxPoolSeed2(value int64) FractionalMaxPoolAttr {
+// value: A bool value to indicate the operation is for training (default)
+// or inference.
+// If not specified, defaults to true
+func FusedBatchNormGradV2IsTraining(value bool) FusedBatchNormGradV2Attr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["is_training"] = value
 	}
 }
 
-// Performs fractional max pooling on the input.
-//
-// Fractional max pooling is slightly different than regular max pooling.  In
-// regular max pooling, you downsize an input set by taking the maximum value of
-// smaller N x N subsections of the set (often 2x2), and try to reduce the set by
-// a factor of N, where N is an integer.  Fractional max pooling, as you might
-// expect from the word "fractional", means that the overall reduction ratio N
-// does not have to be an integer.
-//
-// The sizes of the pooling regions are generated randomly but are fairly uniform.
-// For example, let's look at the height dimension, and the constraints on the
-// list of rows that will be pool boundaries.
-//
-// First we define the following:
-//
-// 1.  input_row_length : the number of rows from the input set
-// 2.  output_row_length : which will be smaller than the input
-// 3.  alpha = input_row_length / output_row_length : our reduction ratio
-// 4.  K = floor(alpha)
-// 5.  row_pooling_sequence : this is the result list of pool boundary rows
-//
-// Then, row_pooling_sequence should satisfy:
-//
-// 1.  a[0] = 0 : the first value of the sequence is 0
-// 2.  a[end] = input_row_length : the last value of the sequence is the size
-// 3.  K <= (a[i+1] - a[i]) <= K+1 : all intervals are K or K+1 size
-// 4.  length(row_pooling_sequence) = output_row_length+1
+// Gradient for batch normalization.
 //
-// For more details on fractional max pooling, see this paper:
-// [Benjamin Graham, Fractional Max-Pooling](http://arxiv.org/abs/1412.6071)
+// Note that the size of 4D Tensors are defined by either "NHWC" or "NCHW".
+// The size of 1D Tensors matches the dimension C of the 4D Tensors.
 //
 // Arguments:
-//	value: 4-D with shape `[batch, height, width, channels]`.
-//	pooling_ratio: Pooling ratio for each dimension of `value`, currently only
-// supports row and col dimension and should be >= 1.0. For example, a valid
-// pooling ratio looks like [1.0, 1.44, 1.73, 1.0]. The first and last elements
-// must be 1.0 because we don't allow pooling on batch and channels
-// dimensions. 1.44 and 1.73 are pooling ratio on height and width dimensions
-// respectively.
+//	y_backprop: A 4D Tensor for the gradient with respect to y.
+//	x: A 4D Tensor for input data.
+//	scale: A 1D Tensor for scaling factor, to scale the normalized x.
+//	reserve_space_1: When is_training is True, a 1D Tensor for the computed batch
+// mean to be reused in gradient computation. When is_training is
+// False, a 1D Tensor for the population mean to be reused in both
+// 1st and 2nd order gradient computation.
+//	reserve_space_2: When is_training is True, a 1D Tensor for the computed batch
+// variance (inverted variance in the cuDNN case) to be reused in
+// gradient computation. When is_training is False, a 1D Tensor
+// for the population variance to be reused in both 1st and 2nd
+// order gradient computation.
 //
-// Returns output tensor after fractional max pooling.row pooling sequence, needed to calculate gradient.column pooling sequence, needed to calculate gradient.
-func FractionalMaxPool(scope *Scope, value tf.Output, pooling_ratio []float32, optional ...FractionalMaxPoolAttr) (output tf.Output, row_pooling_sequence tf.Output, col_pooling_sequence tf.Output) {
+// Returns A 4D Tensor for the gradient with respect to x.A 1D Tensor for the gradient with respect to scale.A 1D Tensor for the gradient with respect to offset.Unused placeholder to match the mean input in FusedBatchNorm.Unused placeholder to match the variance input
+// in FusedBatchNorm.
+func FusedBatchNormGradV2(scope *Scope, y_backprop tf.Output, x tf.Output, scale tf.Output, reserve_space_1 tf.Output, reserve_space_2 tf.Output, optional ...FusedBatchNormGradV2Attr) (x_backprop tf.Output, scale_backprop tf.Output, offset_backprop tf.Output, reserve_space_3 tf.Output, reserve_space_4 tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"pooling_ratio": pooling_ratio}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "FractionalMaxPool",
+		Type: "FusedBatchNormGradV2",
 		Input: []tf.Input{
-			value,
+			y_backprop, x, scale, reserve_space_1, reserve_space_2,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
-}
-
-// Deprecated. Use TensorArraySizeV3
-//
-// DEPRECATED at GraphDef version 26: Use TensorArraySizeV3
-func TensorArraySizeV2(scope *Scope, handle tf.Output, flow_in tf.Output) (size tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "TensorArraySizeV2",
-		Input: []tf.Input{
-			handle, flow_in,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Conv2DAttr is an optional argument to Conv2D.
-type Conv2DAttr func(optionalAttr)
-
-// Conv2DUseCudnnOnGpu sets the optional use_cudnn_on_gpu attribute to value.
-// If not specified, defaults to true
-func Conv2DUseCudnnOnGpu(value bool) Conv2DAttr {
-	return func(m optionalAttr) {
-		m["use_cudnn_on_gpu"] = value
-	}
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
 }
 
-// Conv2DDataFormat sets the optional data_format attribute to value.
-//
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, height, width, channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, channels, height, width].
-// If not specified, defaults to "NHWC"
-func Conv2DDataFormat(value string) Conv2DAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
+// DecodeCompressedAttr is an optional argument to DecodeCompressed.
+type DecodeCompressedAttr func(optionalAttr)
 
-// Conv2DDilations sets the optional dilations attribute to value.
+// DecodeCompressedCompressionType sets the optional compression_type attribute to value.
 //
-// value: 1-D tensor of length 4.  The dilation factor for each dimension of
-// `input`. If set to k > 1, there will be k-1 skipped cells between each
-// filter element on that dimension. The dimension order is determined by the
-// value of `data_format`, see above for details. Dilations in the batch and
-// depth dimensions must be 1.
-// If not specified, defaults to <i:1 i:1 i:1 i:1 >
-func Conv2DDilations(value []int64) Conv2DAttr {
+// value: A scalar containing either (i) the empty string (no
+// compression), (ii) "ZLIB", or (iii) "GZIP".
+// If not specified, defaults to ""
+func DecodeCompressedCompressionType(value string) DecodeCompressedAttr {
 	return func(m optionalAttr) {
-		m["dilations"] = value
+		m["compression_type"] = value
 	}
 }
 
-// Computes a 2-D convolution given 4-D `input` and `filter` tensors.
-//
-// Given an input tensor of shape `[batch, in_height, in_width, in_channels]`
-// and a filter / kernel tensor of shape
-// `[filter_height, filter_width, in_channels, out_channels]`, this op
-// performs the following:
-//
-// 1. Flattens the filter to a 2-D matrix with shape
-//    `[filter_height * filter_width * in_channels, output_channels]`.
-// 2. Extracts image patches from the input tensor to form a *virtual*
-//    tensor of shape `[batch, out_height, out_width,
-//    filter_height * filter_width * in_channels]`.
-// 3. For each patch, right-multiplies the filter matrix and the image patch
-//    vector.
-//
-// In detail, with the default NHWC format,
+// Decompress strings.
 //
-//     output[b, i, j, k] =
-//         sum_{di, dj, q} input[b, strides[1] * i + di, strides[2] * j + dj, q] *
-//                         filter[di, dj, q, k]
+// This op decompresses each element of the `bytes` input `Tensor`, which
+// is assumed to be compressed using the given `compression_type`.
 //
-// Must have `strides[0] = strides[3] = 1`.  For the most common case of the same
-// horizontal and vertices strides, `strides = [1, stride, stride, 1]`.
+// The `output` is a string `Tensor` of the same shape as `bytes`,
+// each element containing the decompressed data from the corresponding
+// element in `bytes`.
 //
 // Arguments:
-//	input: A 4-D tensor. The dimension order is interpreted according to the value
-// of `data_format`, see below for details.
-//	filter: A 4-D tensor of shape
-// `[filter_height, filter_width, in_channels, out_channels]`
-//	strides: 1-D tensor of length 4.  The stride of the sliding window for each
-// dimension of `input`. The dimension order is determined by the value of
-// `data_format`, see below for details.
-//	padding: The type of padding algorithm to use.
+//	bytes: A Tensor of string which is compressed.
 //
-// Returns A 4-D tensor. The dimension order is determined by the value of
-// `data_format`, see below for details.
-func Conv2D(scope *Scope, input tf.Output, filter tf.Output, strides []int64, padding string, optional ...Conv2DAttr) (output tf.Output) {
+// Returns A Tensor with the same shape as input `bytes`, uncompressed
+// from bytes.
+func DecodeCompressed(scope *Scope, bytes tf.Output, optional ...DecodeCompressedAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"strides": strides, "padding": padding}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Conv2D",
+		Type: "DecodeCompressed",
 		Input: []tf.Input{
-			input, filter,
+			bytes,
 		},
 		Attrs: attrs,
 	}
@@ -28128,67 +27384,98 @@ func Conv2D(scope *Scope, input tf.Output, filter tf.Output, strides []int64, pa
 	return op.Output(0)
 }
 
-// StageAttr is an optional argument to Stage.
-type StageAttr func(optionalAttr)
+// CudnnRNNAttr is an optional argument to CudnnRNN.
+type CudnnRNNAttr func(optionalAttr)
 
-// StageCapacity sets the optional capacity attribute to value.
-//
-// value: Maximum number of elements in the Staging Area. If > 0, inserts
-// on the container will block when the capacity is reached.
-// If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func StageCapacity(value int64) StageAttr {
+// CudnnRNNRnnMode sets the optional rnn_mode attribute to value.
+// If not specified, defaults to "lstm"
+func CudnnRNNRnnMode(value string) CudnnRNNAttr {
 	return func(m optionalAttr) {
-		m["capacity"] = value
+		m["rnn_mode"] = value
 	}
 }
 
-// StageMemoryLimit sets the optional memory_limit attribute to value.
-//
-// value: The maximum number of bytes allowed for Tensors in the Staging Area.
-// If > 0, inserts will block until sufficient space is available.
-// If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func StageMemoryLimit(value int64) StageAttr {
+// CudnnRNNInputMode sets the optional input_mode attribute to value.
+// If not specified, defaults to "linear_input"
+func CudnnRNNInputMode(value string) CudnnRNNAttr {
 	return func(m optionalAttr) {
-		m["memory_limit"] = value
+		m["input_mode"] = value
 	}
 }
 
-// StageContainer sets the optional container attribute to value.
-//
-// value: If non-empty, this queue is placed in the given container. Otherwise,
-// a default container is used.
-// If not specified, defaults to ""
-func StageContainer(value string) StageAttr {
+// CudnnRNNDirection sets the optional direction attribute to value.
+// If not specified, defaults to "unidirectional"
+func CudnnRNNDirection(value string) CudnnRNNAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["direction"] = value
 	}
 }
 
-// StageSharedName sets the optional shared_name attribute to value.
-//
-// value: It is necessary to match this name to the matching Unstage Op.
-// If not specified, defaults to ""
-func StageSharedName(value string) StageAttr {
+// CudnnRNNDropout sets the optional dropout attribute to value.
+// If not specified, defaults to 0
+func CudnnRNNDropout(value float32) CudnnRNNAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["dropout"] = value
 	}
 }
 
-// Stage values similar to a lightweight Enqueue.
-//
-// The basic functionality of this Op is similar to a queue with many
-// fewer capabilities and options.  This Op is optimized for performance.
+// CudnnRNNSeed sets the optional seed attribute to value.
+// If not specified, defaults to 0
+func CudnnRNNSeed(value int64) CudnnRNNAttr {
+	return func(m optionalAttr) {
+		m["seed"] = value
+	}
+}
+
+// CudnnRNNSeed2 sets the optional seed2 attribute to value.
+// If not specified, defaults to 0
+func CudnnRNNSeed2(value int64) CudnnRNNAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// CudnnRNNIsTraining sets the optional is_training attribute to value.
+// If not specified, defaults to true
+func CudnnRNNIsTraining(value bool) CudnnRNNAttr {
+	return func(m optionalAttr) {
+		m["is_training"] = value
+	}
+}
+
+// A RNN backed by cuDNN.
 //
-// Arguments:
-//	values: a list of tensors
-// dtypes A list of data types that inserted values should adhere to.
+// Computes the RNN from the input and initial states, with respect to the params
+// buffer.
 //
-// Returns the created operation.
-func Stage(scope *Scope, values []tf.Output, optional ...StageAttr) (o *tf.Operation) {
+// rnn_mode: Indicates the type of the RNN model.
+// input_mode: Indicate whether there is a linear projection between the input and
+//   The actual computation before the first layer. 'skip_input' is only allowed
+//   when input_size == num_units; 'auto_select' implies 'skip_input' when
+//   input_size == num_units; otherwise, it implies 'linear_input'.
+// direction: Indicates whether a bidirectional model will be used.
+//   dir = (direction == bidirectional) ? 2 : 1
+// dropout: dropout probability. When set to 0., dropout is disabled.
+// seed: the 1st part of a seed to initialize dropout.
+// seed2: the 2nd part of a seed to initialize dropout.
+// input: a 3-D tensor with the shape of [seq_length, batch_size, input_size].
+// input_h: a 3-D tensor with the shape of [num_layer * dir, batch_size,
+//     num_units].
+// input_c: For LSTM, a 3-D tensor with the shape of
+//     [num_layer * dir, batch, num_units]. For other models, it is ignored.
+// params: a 1-D tensor that contains the weights and biases in an opaque layout.
+//     The size must be created through CudnnRNNParamsSize, and initialized
+//     separately. Note that they might not be compatible across different
+//     generations. So it is a good idea to save and restore
+// output: a 3-D tensor with the shape of [seq_length, batch_size,
+//     dir * num_units].
+// output_h: the same shape has input_h.
+// output_c: the same shape as input_c for LSTM. An empty tensor for other models.
+// is_training: Indicates whether this operation is used for inferenece or
+//   training.
+// reserve_space: an opaque tensor that can be used in backprop calculation. It
+//   is only produced if is_training is false.
+func CudnnRNN(scope *Scope, input tf.Output, input_h tf.Output, input_c tf.Output, params tf.Output, optional ...CudnnRNNAttr) (output tf.Output, output_h tf.Output, output_c tf.Output, reserve_space tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -28197,283 +27484,389 @@ func Stage(scope *Scope, values []tf.Output, optional ...StageAttr) (o *tf.Opera
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Stage",
+		Type: "CudnnRNN",
 		Input: []tf.Input{
-			tf.OutputList(values),
+			input, input_h, input_c, params,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3)
 }
 
-// StagePeekAttr is an optional argument to StagePeek.
-type StagePeekAttr func(optionalAttr)
-
-// StagePeekCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// Compare values of `input` to `threshold` and pack resulting bits into a `uint8`.
 //
-// REQUIRES: value >= 0
-func StagePeekCapacity(value int64) StagePeekAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
+// Each comparison returns a boolean `true` (if `input_value > threshold`)
+// or and `false` otherwise.
+//
+// This operation is useful for Locality-Sensitive-Hashing (LSH) and other
+// algorithms that use hashing approximations of cosine and `L2` distances;
+// codes can be generated from an input via:
+//
+// ```python
+// codebook_size = 50
+// codebook_bits = codebook_size * 32
+// codebook = tf.get_variable('codebook', [x.shape[-1].value, codebook_bits],
+//                            dtype=x.dtype,
+//                            initializer=tf.orthogonal_initializer())
+// codes = compare_and_threshold(tf.matmul(x, codebook), threshold=0.)
+// codes = tf.bitcast(codes, tf.int32)  # go from uint8 to int32
+// # now codes has shape x.shape[:-1] + [codebook_size]
+// ```
+//
+// **NOTE**: Currently, the innermost dimension of the tensor must be divisible
+// by 8.
+//
+// Given an `input` shaped `[s0, s1, ..., s_n]`, the output is
+// a `uint8` tensor shaped `[s0, s1, ..., s_n / 8]`.
+//
+// Arguments:
+//	input: Values to compare against `threshold` and bitpack.
+//	threshold: Threshold to compare against.
+//
+// Returns The bitpacked comparisons.
+func CompareAndBitpack(scope *Scope, input tf.Output, threshold tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "CompareAndBitpack",
+		Input: []tf.Input{
+			input, threshold,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// StagePeekMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// Push an element onto the tensor_array.
 //
-// REQUIRES: value >= 0
-func StagePeekMemoryLimit(value int64) StagePeekAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
+// Arguments:
+//	handle: The handle to a TensorArray.
+//	index: The position to write to inside the TensorArray.
+//	value: The tensor to write to the TensorArray.
+//	flow_in: A float scalar that enforces proper chaining of operations.
+//
+// Returns A float scalar that enforces proper chaining of operations.
+func TensorArrayWriteV3(scope *Scope, handle tf.Output, index tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "TensorArrayWriteV3",
+		Input: []tf.Input{
+			handle, index, value, flow_in,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// StagePeekContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func StagePeekContainer(value string) StagePeekAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
+// Scatter the data from the input value into specific TensorArray elements.
+//
+// `indices` must be a vector, its length must match the first dim of `value`.
+//
+// Arguments:
+//	handle: The handle to a TensorArray.
+//	indices: The locations at which to write the tensor elements.
+//	value: The concatenated tensor to write to the TensorArray.
+//	flow_in: A float scalar that enforces proper chaining of operations.
+//
+// Returns A float scalar that enforces proper chaining of operations.
+func TensorArrayScatterV3(scope *Scope, handle tf.Output, indices tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
+	opspec := tf.OpSpec{
+		Type: "TensorArrayScatterV3",
+		Input: []tf.Input{
+			handle, indices, value, flow_in,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// StagePeekSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func StagePeekSharedName(value string) StagePeekAttr {
+// EmptyAttr is an optional argument to Empty.
+type EmptyAttr func(optionalAttr)
+
+// EmptyInit sets the optional init attribute to value.
+//
+// value: If True, initialize the returned tensor with the default value of dtype.  Otherwise, the implementation is free not to initializethe tensor's content.
+// If not specified, defaults to false
+func EmptyInit(value bool) EmptyAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["init"] = value
 	}
 }
 
-// Op peeks at the values at the specified index.  If the
+// Creates a tensor with the given shape.
 //
-// underlying container does not contain sufficient elements
-// this op will block until it does.   This Op is optimized for
-// performance.
-func StagePeek(scope *Scope, index tf.Output, dtypes []tf.DataType, optional ...StagePeekAttr) (values []tf.Output) {
+// This operation creates a tensor of `shape` and `dtype`.
+//
+// Arguments:
+//	shape: 1-D. Represents the shape of the output tensor.
+//
+//
+// Returns A `Tensor` of type `T`.
+func Empty(scope *Scope, shape tf.Output, dtype tf.DataType, optional ...EmptyAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{"dtype": dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "StagePeek",
+		Type: "Empty",
 		Input: []tf.Input{
-			index,
+			shape,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("StagePeek", err)
-		return
-	}
-	return values
+	return op.Output(0)
 }
 
-// MapStageAttr is an optional argument to MapStage.
-type MapStageAttr func(optionalAttr)
+// TensorArrayConcatV3Attr is an optional argument to TensorArrayConcatV3.
+type TensorArrayConcatV3Attr func(optionalAttr)
 
-// MapStageCapacity sets the optional capacity attribute to value.
-//
-// value: Maximum number of elements in the Staging Area. If > 0, inserts
-// on the container will block when the capacity is reached.
-// If not specified, defaults to 0
+// TensorArrayConcatV3ElementShapeExcept0 sets the optional element_shape_except0 attribute to value.
 //
-// REQUIRES: value >= 0
-func MapStageCapacity(value int64) MapStageAttr {
+// value: The expected shape of an element, if known,
+// excluding the first dimension. Used to validate the shapes of
+// TensorArray elements. If this shape is not fully specified, concatenating
+// zero-size TensorArrays is an error.
+// If not specified, defaults to <unknown_rank:true >
+func TensorArrayConcatV3ElementShapeExcept0(value tf.Shape) TensorArrayConcatV3Attr {
 	return func(m optionalAttr) {
-		m["capacity"] = value
+		m["element_shape_except0"] = value
 	}
 }
 
-// MapStageMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// Concat the elements from the TensorArray into value `value`.
 //
-// REQUIRES: value >= 0
-func MapStageMemoryLimit(value int64) MapStageAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// MapStageContainer sets the optional container attribute to value.
+// Takes `T` elements of shapes
 //
-// value: If non-empty, this queue is placed in the given container. Otherwise,
-// a default container is used.
-// If not specified, defaults to ""
-func MapStageContainer(value string) MapStageAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// MapStageSharedName sets the optional shared_name attribute to value.
+//   ```
+//   (n0 x d0 x d1 x ...), (n1 x d0 x d1 x ...), ..., (n(T-1) x d0 x d1 x ...)
+//   ```
 //
-// value: It is necessary to match this name to the matching Unstage Op.
-// If not specified, defaults to ""
-func MapStageSharedName(value string) MapStageAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Stage (key, values) in the underlying container which behaves like a hashtable.
+// and concatenates them into a Tensor of shape:
 //
-// Arguments:
-//	key: int64
+//   ```(n0 + n1 + ... + n(T-1) x d0 x d1 x ...)```
 //
-//	values: a list of tensors
-// dtypes A list of data types that inserted values should adhere to.
+// All elements must have the same shape (excepting the first dimension).
 //
+// Arguments:
+//	handle: The handle to a TensorArray.
+//	flow_in: A float scalar that enforces proper chaining of operations.
+//	dtype: The type of the elem that is returned.
 //
-// Returns the created operation.
-func MapStage(scope *Scope, key tf.Output, indices tf.Output, values []tf.Output, dtypes []tf.DataType, optional ...MapStageAttr) (o *tf.Operation) {
+// Returns All of the elements in the TensorArray, concatenated along the first
+// axis.A vector of the row sizes of the original T elements in the
+// value output.  In the example above, this would be the values:
+// `(n1, n2, ..., n(T-1))`.
+func TensorArrayConcatV3(scope *Scope, handle tf.Output, flow_in tf.Output, dtype tf.DataType, optional ...TensorArrayConcatV3Attr) (value tf.Output, lengths tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{"dtype": dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MapStage",
+		Type: "TensorArrayConcatV3",
 		Input: []tf.Input{
-			key, indices, tf.OutputList(values),
+			handle, flow_in,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
 }
 
-// MapUnstageAttr is an optional argument to MapUnstage.
-type MapUnstageAttr func(optionalAttr)
-
-// MapUnstageCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// Split the data from the input value into TensorArray elements.
 //
-// REQUIRES: value >= 0
-func MapUnstageCapacity(value int64) MapUnstageAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// MapUnstageMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// Assuming that `lengths` takes on values
 //
-// REQUIRES: value >= 0
-func MapUnstageMemoryLimit(value int64) MapUnstageAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
+//   ```(n0, n1, ..., n(T-1))```
+//
+// and that `value` has shape
+//
+//   ```(n0 + n1 + ... + n(T-1) x d0 x d1 x ...)```,
+//
+// this splits values into a TensorArray with T tensors.
+//
+// TensorArray index t will be the subtensor of values with starting position
+//
+//   ```(n0 + n1 + ... + n(t-1), 0, 0, ...)```
+//
+// and having size
+//
+//   ```nt x d0 x d1 x ...```
+//
+// Arguments:
+//	handle: The handle to a TensorArray.
+//	value: The concatenated tensor to write to the TensorArray.
+//	lengths: The vector of lengths, how to split the rows of value into the
+// TensorArray.
+//	flow_in: A float scalar that enforces proper chaining of operations.
+//
+// Returns A float scalar that enforces proper chaining of operations.
+func TensorArraySplitV3(scope *Scope, handle tf.Output, value tf.Output, lengths tf.Output, flow_in tf.Output) (flow_out tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// MapUnstageContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func MapUnstageContainer(value string) MapUnstageAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
+	opspec := tf.OpSpec{
+		Type: "TensorArraySplitV3",
+		Input: []tf.Input{
+			handle, value, lengths, flow_in,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// MapUnstageSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func MapUnstageSharedName(value string) MapUnstageAttr {
+// SerializeSparseAttr is an optional argument to SerializeSparse.
+type SerializeSparseAttr func(optionalAttr)
+
+// SerializeSparseOutType sets the optional out_type attribute to value.
+//
+// value: The `dtype` to use for serialization; the supported types are `string`
+// (default) and `variant`.
+// If not specified, defaults to DT_STRING
+func SerializeSparseOutType(value tf.DataType) SerializeSparseAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["out_type"] = value
 	}
 }
 
-// Op removes and returns the values associated with the key
+// Serialize a `SparseTensor` into a `[3]` `Tensor` object.
 //
-// from the underlying container.   If the underlying container
-// does not contain this key, the op will block until it does.
-func MapUnstage(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...MapUnstageAttr) (values []tf.Output) {
+// Arguments:
+//	sparse_indices: 2-D.  The `indices` of the `SparseTensor`.
+//	sparse_values: 1-D.  The `values` of the `SparseTensor`.
+//	sparse_shape: 1-D.  The `shape` of the `SparseTensor`.
+func SerializeSparse(scope *Scope, sparse_indices tf.Output, sparse_values tf.Output, sparse_shape tf.Output, optional ...SerializeSparseAttr) (serialized_sparse tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MapUnstage",
+		Type: "SerializeSparse",
 		Input: []tf.Input{
-			key, indices,
+			sparse_indices, sparse_values, sparse_shape,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("MapUnstage", err)
-		return
-	}
-	return values
+	return op.Output(0)
 }
 
-// MapIncompleteSizeAttr is an optional argument to MapIncompleteSize.
-type MapIncompleteSizeAttr func(optionalAttr)
+// RandomShuffleQueueV2Attr is an optional argument to RandomShuffleQueueV2.
+type RandomShuffleQueueV2Attr func(optionalAttr)
 
-// MapIncompleteSizeCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// RandomShuffleQueueV2Shapes sets the optional shapes attribute to value.
 //
-// REQUIRES: value >= 0
-func MapIncompleteSizeCapacity(value int64) MapIncompleteSizeAttr {
+// value: The shape of each component in a value. The length of this attr must
+// be either 0 or the same as the length of component_types. If the length of
+// this attr is 0, the shapes of queue elements are not constrained, and
+// only one element may be dequeued at a time.
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func RandomShuffleQueueV2Shapes(value []tf.Shape) RandomShuffleQueueV2Attr {
+	return func(m optionalAttr) {
+		m["shapes"] = value
+	}
+}
+
+// RandomShuffleQueueV2Capacity sets the optional capacity attribute to value.
+//
+// value: The upper bound on the number of elements in this queue.
+// Negative numbers mean no limit.
+// If not specified, defaults to -1
+func RandomShuffleQueueV2Capacity(value int64) RandomShuffleQueueV2Attr {
 	return func(m optionalAttr) {
 		m["capacity"] = value
 	}
 }
 
-// MapIncompleteSizeMemoryLimit sets the optional memory_limit attribute to value.
+// RandomShuffleQueueV2MinAfterDequeue sets the optional min_after_dequeue attribute to value.
+//
+// value: Dequeue will block unless there would be this
+// many elements after the dequeue or the queue is closed. This
+// ensures a minimum level of mixing of elements.
 // If not specified, defaults to 0
+func RandomShuffleQueueV2MinAfterDequeue(value int64) RandomShuffleQueueV2Attr {
+	return func(m optionalAttr) {
+		m["min_after_dequeue"] = value
+	}
+}
+
+// RandomShuffleQueueV2Seed sets the optional seed attribute to value.
 //
-// REQUIRES: value >= 0
-func MapIncompleteSizeMemoryLimit(value int64) MapIncompleteSizeAttr {
+// value: If either seed or seed2 is set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, a random seed is used.
+// If not specified, defaults to 0
+func RandomShuffleQueueV2Seed(value int64) RandomShuffleQueueV2Attr {
 	return func(m optionalAttr) {
-		m["memory_limit"] = value
+		m["seed"] = value
 	}
 }
 
-// MapIncompleteSizeContainer sets the optional container attribute to value.
+// RandomShuffleQueueV2Seed2 sets the optional seed2 attribute to value.
+//
+// value: A second seed to avoid seed collision.
+// If not specified, defaults to 0
+func RandomShuffleQueueV2Seed2(value int64) RandomShuffleQueueV2Attr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// RandomShuffleQueueV2Container sets the optional container attribute to value.
+//
+// value: If non-empty, this queue is placed in the given container.
+// Otherwise, a default container is used.
 // If not specified, defaults to ""
-func MapIncompleteSizeContainer(value string) MapIncompleteSizeAttr {
+func RandomShuffleQueueV2Container(value string) RandomShuffleQueueV2Attr {
 	return func(m optionalAttr) {
 		m["container"] = value
 	}
 }
 
-// MapIncompleteSizeSharedName sets the optional shared_name attribute to value.
+// RandomShuffleQueueV2SharedName sets the optional shared_name attribute to value.
+//
+// value: If non-empty, this queue will be shared under the given name
+// across multiple sessions.
 // If not specified, defaults to ""
-func MapIncompleteSizeSharedName(value string) MapIncompleteSizeAttr {
+func RandomShuffleQueueV2SharedName(value string) RandomShuffleQueueV2Attr {
 	return func(m optionalAttr) {
 		m["shared_name"] = value
 	}
 }
 
-// Op returns the number of incomplete elements in the underlying container.
-func MapIncompleteSize(scope *Scope, dtypes []tf.DataType, optional ...MapIncompleteSizeAttr) (size tf.Output) {
+// A queue that randomizes the order of elements.
+//
+// Arguments:
+//	component_types: The type of each component in a value.
+//
+// Returns The handle to the queue.
+func RandomShuffleQueueV2(scope *Scope, component_types []tf.DataType, optional ...RandomShuffleQueueV2Attr) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{"component_types": component_types}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MapIncompleteSize",
+		Type: "RandomShuffleQueueV2",
 
 		Attrs: attrs,
 	}
@@ -28481,149 +27874,211 @@ func MapIncompleteSize(scope *Scope, dtypes []tf.DataType, optional ...MapIncomp
 	return op.Output(0)
 }
 
-// OrderedMapUnstageAttr is an optional argument to OrderedMapUnstage.
-type OrderedMapUnstageAttr func(optionalAttr)
-
-// OrderedMapUnstageCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// Draw bounding boxes on a batch of images.
 //
-// REQUIRES: value >= 0
-func OrderedMapUnstageCapacity(value int64) OrderedMapUnstageAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// OrderedMapUnstageMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// Outputs a copy of `images` but draws on top of the pixels zero or more bounding
+// boxes specified by the locations in `boxes`. The coordinates of the each
+// bounding box in `boxes` are encoded as `[y_min, x_min, y_max, x_max]`. The
+// bounding box coordinates are floats in `[0.0, 1.0]` relative to the width and
+// height of the underlying image.
 //
-// REQUIRES: value >= 0
-func OrderedMapUnstageMemoryLimit(value int64) OrderedMapUnstageAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
+// For example, if an image is 100 x 200 pixels (height x width) and the bounding
+// box is `[0.1, 0.2, 0.5, 0.9]`, the upper-left and bottom-right coordinates of
+// the bounding box will be `(40, 10)` to `(100, 50)` (in (x,y) coordinates).
+//
+// Parts of the bounding box may fall outside the image.
+//
+// Arguments:
+//	images: 4-D with shape `[batch, height, width, depth]`. A batch of images.
+//	boxes: 3-D with shape `[batch, num_bounding_boxes, 4]` containing bounding
+// boxes.
+//
+// Returns 4-D with the same shape as `images`. The batch of input images with
+// bounding boxes drawn on the images.
+func DrawBoundingBoxes(scope *Scope, images tf.Output, boxes tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "DrawBoundingBoxes",
+		Input: []tf.Input{
+			images, boxes,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// OrderedMapUnstageContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func OrderedMapUnstageContainer(value string) OrderedMapUnstageAttr {
+// LearnedUnigramCandidateSamplerAttr is an optional argument to LearnedUnigramCandidateSampler.
+type LearnedUnigramCandidateSamplerAttr func(optionalAttr)
+
+// LearnedUnigramCandidateSamplerSeed sets the optional seed attribute to value.
+//
+// value: If either seed or seed2 are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func LearnedUnigramCandidateSamplerSeed(value int64) LearnedUnigramCandidateSamplerAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["seed"] = value
 	}
 }
 
-// OrderedMapUnstageSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func OrderedMapUnstageSharedName(value string) OrderedMapUnstageAttr {
+// LearnedUnigramCandidateSamplerSeed2 sets the optional seed2 attribute to value.
+//
+// value: An second seed to avoid seed collision.
+// If not specified, defaults to 0
+func LearnedUnigramCandidateSamplerSeed2(value int64) LearnedUnigramCandidateSamplerAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["seed2"] = value
 	}
 }
 
-// Op removes and returns the values associated with the key
+// Generates labels for candidate sampling with a learned unigram distribution.
 //
-// from the underlying container.   If the underlying container
-// does not contain this key, the op will block until it does.
-func OrderedMapUnstage(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...OrderedMapUnstageAttr) (values []tf.Output) {
+// See explanations of candidate sampling and the data formats at
+// go/candidate-sampling.
+//
+// For each batch, this op picks a single set of sampled candidate labels.
+//
+// The advantages of sampling candidates per-batch are simplicity and the
+// possibility of efficient dense matrix multiplication. The disadvantage is that
+// the sampled candidates must be chosen independently of the context and of the
+// true labels.
+//
+// Arguments:
+//	true_classes: A batch_size * num_true matrix, in which each row contains the
+// IDs of the num_true target_classes in the corresponding original label.
+//	num_true: Number of true labels per context.
+//	num_sampled: Number of candidates to randomly sample.
+//	unique: If unique is true, we sample with rejection, so that all sampled
+// candidates in a batch are unique. This requires some approximation to
+// estimate the post-rejection sampling probabilities.
+//	range_max: The sampler will sample integers from the interval [0, range_max).
+//
+// Returns A vector of length num_sampled, in which each element is
+// the ID of a sampled candidate.A batch_size * num_true matrix, representing
+// the number of times each candidate is expected to occur in a batch
+// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
+// candidate representing the number of times the candidate is expected
+// to occur in a batch of sampled candidates.  If unique=true, then this is a
+// probability.
+func LearnedUnigramCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...LearnedUnigramCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
+	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "OrderedMapUnstage",
+		Type: "LearnedUnigramCandidateSampler",
 		Input: []tf.Input{
-			key, indices,
+			true_classes,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
+}
+
+// Computes gradients for the scaled exponential linear (Selu) operation.
+//
+// Arguments:
+//	gradients: The backpropagated gradients to the corresponding Selu operation.
+//	outputs: The outputs of the corresponding Selu operation.
+//
+// Returns The gradients: `gradients * (outputs + scale * alpha)`
+// if outputs < 0, `scale * gradients` otherwise.
+func SeluGrad(scope *Scope, gradients tf.Output, outputs tf.Output) (backprops tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	var idx int
-	var err error
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("OrderedMapUnstage", err)
-		return
+	opspec := tf.OpSpec{
+		Type: "SeluGrad",
+		Input: []tf.Input{
+			gradients, outputs,
+		},
 	}
-	return values
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// OrderedMapSizeAttr is an optional argument to OrderedMapSize.
-type OrderedMapSizeAttr func(optionalAttr)
-
-// OrderedMapSizeCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// Get the current size of the TensorArray.
 //
-// REQUIRES: value >= 0
-func OrderedMapSizeCapacity(value int64) OrderedMapSizeAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// OrderedMapSizeMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
+// Arguments:
+//	handle: The handle to a TensorArray (output of TensorArray or TensorArrayGrad).
+//	flow_in: A float scalar that enforces proper chaining of operations.
 //
-// REQUIRES: value >= 0
-func OrderedMapSizeMemoryLimit(value int64) OrderedMapSizeAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// OrderedMapSizeContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func OrderedMapSizeContainer(value string) OrderedMapSizeAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
+// Returns The current size of the TensorArray.
+func TensorArraySizeV3(scope *Scope, handle tf.Output, flow_in tf.Output) (size tf.Output) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// OrderedMapSizeSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func OrderedMapSizeSharedName(value string) OrderedMapSizeAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
+	opspec := tf.OpSpec{
+		Type: "TensorArraySizeV3",
+		Input: []tf.Input{
+			handle, flow_in,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Op returns the number of elements in the underlying container.
-func OrderedMapSize(scope *Scope, dtypes []tf.DataType, optional ...OrderedMapSizeAttr) (size tf.Output) {
+// Deprecated. Use TensorArrayGradV3
+//
+// DEPRECATED at GraphDef version 26: Use TensorArrayWriteV3
+func TensorArrayWriteV2(scope *Scope, handle tf.Output, index tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "OrderedMapSize",
-
-		Attrs: attrs,
+		Type: "TensorArrayWriteV2",
+		Input: []tf.Input{
+			handle, index, value, flow_in,
+		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ShapeNAttr is an optional argument to ShapeN.
-type ShapeNAttr func(optionalAttr)
+// SparseReduceMaxAttr is an optional argument to SparseReduceMax.
+type SparseReduceMaxAttr func(optionalAttr)
 
-// ShapeNOutType sets the optional out_type attribute to value.
-// If not specified, defaults to DT_INT32
-func ShapeNOutType(value tf.DataType) ShapeNAttr {
+// SparseReduceMaxKeepDims sets the optional keep_dims attribute to value.
+//
+// value: If true, retain reduced dimensions with length 1.
+// If not specified, defaults to false
+func SparseReduceMaxKeepDims(value bool) SparseReduceMaxAttr {
 	return func(m optionalAttr) {
-		m["out_type"] = value
+		m["keep_dims"] = value
 	}
 }
 
-// Returns shape of tensors.
+// Computes the max of elements across dimensions of a SparseTensor.
 //
-// This operation returns N 1-D integer tensors representing shape of `input[i]s`.
-func ShapeN(scope *Scope, input []tf.Output, optional ...ShapeNAttr) (output []tf.Output) {
+// This Op takes a SparseTensor and is the sparse counterpart to
+// `tf.reduce_max()`.  In particular, this Op also returns a dense `Tensor`
+// instead of a sparse one.
+//
+// Reduces `sp_input` along the dimensions given in `reduction_axes`.  Unless
+// `keep_dims` is true, the rank of the tensor is reduced by 1 for each entry in
+// `reduction_axes`. If `keep_dims` is true, the reduced dimensions are retained
+// with length 1.
+//
+// If `reduction_axes` has no entries, all dimensions are reduced, and a tensor
+// with a single element is returned.  Additionally, the axes can be negative,
+// which are interpreted according to the indexing rules in Python.
+//
+// Arguments:
+//	input_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
+// SparseTensor, possibly not in canonical ordering.
+//	input_values: 1-D.  `N` non-empty values corresponding to `input_indices`.
+//	input_shape: 1-D.  Shape of the input SparseTensor.
+//	reduction_axes: 1-D.  Length-`K` vector containing the reduction axes.
+//
+// Returns `R-K`-D.  The reduced Tensor.
+func SparseReduceMax(scope *Scope, input_indices tf.Output, input_values tf.Output, input_shape tf.Output, reduction_axes tf.Output, optional ...SparseReduceMaxAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -28632,861 +28087,780 @@ func ShapeN(scope *Scope, input []tf.Output, optional ...ShapeNAttr) (output []t
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ShapeN",
+		Type: "SparseReduceMax",
 		Input: []tf.Input{
-			tf.OutputList(input),
+			input_indices, input_values, input_shape, reduction_axes,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
-		scope.UpdateErr("ShapeN", err)
-		return
-	}
-	return output
+	return op.Output(0)
 }
 
-// CudnnRNNParamsToCanonicalAttr is an optional argument to CudnnRNNParamsToCanonical.
-type CudnnRNNParamsToCanonicalAttr func(optionalAttr)
+// AsStringAttr is an optional argument to AsString.
+type AsStringAttr func(optionalAttr)
 
-// CudnnRNNParamsToCanonicalRnnMode sets the optional rnn_mode attribute to value.
-// If not specified, defaults to "lstm"
-func CudnnRNNParamsToCanonicalRnnMode(value string) CudnnRNNParamsToCanonicalAttr {
-	return func(m optionalAttr) {
-		m["rnn_mode"] = value
-	}
-}
-
-// CudnnRNNParamsToCanonicalInputMode sets the optional input_mode attribute to value.
-// If not specified, defaults to "linear_input"
-func CudnnRNNParamsToCanonicalInputMode(value string) CudnnRNNParamsToCanonicalAttr {
+// AsStringPrecision sets the optional precision attribute to value.
+//
+// value: The post-decimal precision to use for floating point numbers.
+// Only used if precision > -1.
+// If not specified, defaults to -1
+func AsStringPrecision(value int64) AsStringAttr {
 	return func(m optionalAttr) {
-		m["input_mode"] = value
+		m["precision"] = value
 	}
 }
 
-// CudnnRNNParamsToCanonicalDirection sets the optional direction attribute to value.
-// If not specified, defaults to "unidirectional"
-func CudnnRNNParamsToCanonicalDirection(value string) CudnnRNNParamsToCanonicalAttr {
+// AsStringScientific sets the optional scientific attribute to value.
+//
+// value: Use scientific notation for floating point numbers.
+// If not specified, defaults to false
+func AsStringScientific(value bool) AsStringAttr {
 	return func(m optionalAttr) {
-		m["direction"] = value
+		m["scientific"] = value
 	}
 }
 
-// CudnnRNNParamsToCanonicalDropout sets the optional dropout attribute to value.
-// If not specified, defaults to 0
-func CudnnRNNParamsToCanonicalDropout(value float32) CudnnRNNParamsToCanonicalAttr {
+// AsStringShortest sets the optional shortest attribute to value.
+//
+// value: Use shortest representation (either scientific or standard) for
+// floating point numbers.
+// If not specified, defaults to false
+func AsStringShortest(value bool) AsStringAttr {
 	return func(m optionalAttr) {
-		m["dropout"] = value
+		m["shortest"] = value
 	}
 }
 
-// CudnnRNNParamsToCanonicalSeed sets the optional seed attribute to value.
-// If not specified, defaults to 0
-func CudnnRNNParamsToCanonicalSeed(value int64) CudnnRNNParamsToCanonicalAttr {
+// AsStringWidth sets the optional width attribute to value.
+//
+// value: Pad pre-decimal numbers to this width.
+// Applies to both floating point and integer numbers.
+// Only used if width > -1.
+// If not specified, defaults to -1
+func AsStringWidth(value int64) AsStringAttr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["width"] = value
 	}
 }
 
-// CudnnRNNParamsToCanonicalSeed2 sets the optional seed2 attribute to value.
-// If not specified, defaults to 0
-func CudnnRNNParamsToCanonicalSeed2(value int64) CudnnRNNParamsToCanonicalAttr {
+// AsStringFill sets the optional fill attribute to value.
+//
+// value: The value to pad if width > -1.  If empty, pads with spaces.
+// Another typical value is '0'.  String cannot be longer than 1 character.
+// If not specified, defaults to ""
+func AsStringFill(value string) AsStringAttr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["fill"] = value
 	}
 }
 
-// Retrieves CudnnRNN params in canonical form.
-//
-// Retrieves a set of weights from the opaque params buffer that can be saved and
-// restored in a way compatible with future runs.
-//
-// Note that the params buffer may not be compatible across different GPUs. So any
-// save and restoration should be converted to and from the canonical weights and
-// biases.
+// Converts each entry in the given tensor to strings.  Supports many numeric
 //
-// num_layers: Specifies the number of layers in the RNN model.
-// num_units: Specifies the size of the hidden state.
-// input_size: Specifies the size of the input state.
-// num_params: number of parameter sets for all layers.
-//     Each layer may contain multiple parameter sets, with each set consisting of
-//     a weight matrix and a bias vector.
-// weights: the canonical form of weights that can be used for saving
-//     and restoration. They are more likely to be compatible across different
-//     generations.
-// biases: the canonical form of biases that can be used for saving
-//     and restoration. They are more likely to be compatible across different
-//     generations.
-// rnn_mode: Indicates the type of the RNN model.
-// input_mode: Indicate whether there is a linear projection between the input and
-//     The actual computation before the first layer. 'skip_input' is only allowed
-//     when input_size == num_units; 'auto_select' implies 'skip_input' when
-//     input_size == num_units; otherwise, it implies 'linear_input'.
-// direction: Indicates whether a bidirectional model will be used.
-//     dir = (direction == bidirectional) ? 2 : 1
-// dropout: dropout probability. When set to 0., dropout is disabled.
-// seed: the 1st part of a seed to initialize dropout.
-// seed2: the 2nd part of a seed to initialize dropout.
-func CudnnRNNParamsToCanonical(scope *Scope, num_layers tf.Output, num_units tf.Output, input_size tf.Output, params tf.Output, num_params int64, optional ...CudnnRNNParamsToCanonicalAttr) (weights []tf.Output, biases []tf.Output) {
+// types and boolean.
+func AsString(scope *Scope, input tf.Output, optional ...AsStringAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_params": num_params}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "CudnnRNNParamsToCanonical",
+		Type: "AsString",
 		Input: []tf.Input{
-			num_layers, num_units, input_size, params,
+			input,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Deprecated. Use TensorArrayScatterV3
+//
+// DEPRECATED at GraphDef version 26: Use TensorArrayScatterV3
+func TensorArrayScatterV2(scope *Scope, handle tf.Output, indices tf.Output, value tf.Output, flow_in tf.Output) (flow_out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	var idx int
-	var err error
-	if weights, idx, err = makeOutputList(op, idx, "weights"); err != nil {
-		scope.UpdateErr("CudnnRNNParamsToCanonical", err)
-		return
-	}
-	if biases, idx, err = makeOutputList(op, idx, "biases"); err != nil {
-		scope.UpdateErr("CudnnRNNParamsToCanonical", err)
-		return
+	opspec := tf.OpSpec{
+		Type: "TensorArrayScatterV2",
+		Input: []tf.Input{
+			handle, indices, value, flow_in,
+		},
 	}
-	return weights, biases
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// UniformCandidateSamplerAttr is an optional argument to UniformCandidateSampler.
-type UniformCandidateSamplerAttr func(optionalAttr)
-
-// UniformCandidateSamplerSeed sets the optional seed attribute to value.
+// Creates a tree ensemble model and returns a handle to it.
 //
-// value: If either seed or seed2 are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func UniformCandidateSamplerSeed(value int64) UniformCandidateSamplerAttr {
-	return func(m optionalAttr) {
-		m["seed"] = value
-	}
-}
-
-// UniformCandidateSamplerSeed2 sets the optional seed2 attribute to value.
+// Arguments:
+//	tree_ensemble_handle: Handle to the tree ensemble resource to be created.
+//	stamp_token: Token to use as the initial value of the resource stamp.
+//	tree_ensemble_serialized: Serialized proto of the tree ensemble.
 //
-// value: An second seed to avoid seed collision.
-// If not specified, defaults to 0
-func UniformCandidateSamplerSeed2(value int64) UniformCandidateSamplerAttr {
-	return func(m optionalAttr) {
-		m["seed2"] = value
+// Returns the created operation.
+func BoostedTreesCreateEnsemble(scope *Scope, tree_ensemble_handle tf.Output, stamp_token tf.Output, tree_ensemble_serialized tf.Output) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "BoostedTreesCreateEnsemble",
+		Input: []tf.Input{
+			tree_ensemble_handle, stamp_token, tree_ensemble_serialized,
+		},
 	}
+	return scope.AddOperation(opspec)
 }
 
-// Generates labels for candidate sampling with a uniform distribution.
+// Applies sparse addition to `input` using individual values or slices
 //
-// See explanations of candidate sampling and the data formats at
-// go/candidate-sampling.
+// from `updates` according to indices `indices`.  The updates are non-aliasing:
+// `input` is only modified in-place if no other operations will use it.
+// Otherwise, a copy of `input` is made.  This operation has a gradient with
+// respect to both `input` and `updates`.
 //
-// For each batch, this op picks a single set of sampled candidate labels.
+// `input` is a `Tensor` with rank `P` and `indices` is a `Tensor` of rank `Q`.
 //
-// The advantages of sampling candidates per-batch are simplicity and the
-// possibility of efficient dense matrix multiplication. The disadvantage is that
-// the sampled candidates must be chosen independently of the context and of the
-// true labels.
+// `indices` must be integer tensor, containing indices into `input`.
+// It must be shape `[d_0, ..., d_{Q-2}, K]` where `0 < K <= P`.
+//
+// The innermost dimension of `indices` (with length `K`) corresponds to
+// indices into elements (if `K = P`) or `(P-K)`-dimensional slices
+// (if `K < P`) along the `K`th dimension of `input`.
+//
+// `updates` is `Tensor` of rank `Q-1+P-K` with shape:
+//
+// ```
+// [d_0, ..., d_{Q-2}, input.shape[K], ..., input.shape[P-1]].
+// ```
+//
+// For example, say we want to add 4 scattered elements to a rank-1 tensor to 8
+// elements. In Python, that addition would look like this:
+//
+//     input = tf.constant([1, 2, 3, 4, 5, 6, 7, 8])
+//     indices = tf.constant([[4], [3], [1], [7]])
+//     updates = tf.constant([9, 10, 11, 12])
+//     output = tf.scatter_nd_non_aliasing_add(input, indices, updates)
+//     with tf.Session() as sess:
+//       print(sess.run(output))
+//
+// The resulting value `output` would look like this:
+//
+//     [1, 13, 3, 14, 14, 6, 7, 20]
+//
+// See @{tf.scatter_nd} for more details about how to make updates to slices.
 //
 // Arguments:
-//	true_classes: A batch_size * num_true matrix, in which each row contains the
-// IDs of the num_true target_classes in the corresponding original label.
-//	num_true: Number of true labels per context.
-//	num_sampled: Number of candidates to randomly sample.
-//	unique: If unique is true, we sample with rejection, so that all sampled
-// candidates in a batch are unique. This requires some approximation to
-// estimate the post-rejection sampling probabilities.
-//	range_max: The sampler will sample integers from the interval [0, range_max).
+//	input: A Tensor.
+//	indices: A Tensor. Must be one of the following types: `int32`, `int64`.
+// A tensor of indices into `input`.
+//	updates: A Tensor. Must have the same type as ref. A tensor of updated values
+// to add to `input`.
 //
-// Returns A vector of length num_sampled, in which each element is
-// the ID of a sampled candidate.A batch_size * num_true matrix, representing
-// the number of times each candidate is expected to occur in a batch
-// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
-// candidate representing the number of times the candidate is expected
-// to occur in a batch of sampled candidates.  If unique=true, then this is a
-// probability.
-func UniformCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...UniformCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
+// Returns A `Tensor` with the same shape as `input`, containing values of `input`
+// updated with `updates`.
+func ScatterNdNonAliasingAdd(scope *Scope, input tf.Output, indices tf.Output, updates tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "UniformCandidateSampler",
+		Type: "ScatterNdNonAliasingAdd",
 		Input: []tf.Input{
-			true_classes,
+			input, indices, updates,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// CTCLossAttr is an optional argument to CTCLoss.
-type CTCLossAttr func(optionalAttr)
+// FractionalMaxPoolAttr is an optional argument to FractionalMaxPool.
+type FractionalMaxPoolAttr func(optionalAttr)
 
-// CTCLossPreprocessCollapseRepeated sets the optional preprocess_collapse_repeated attribute to value.
+// FractionalMaxPoolPseudoRandom sets the optional pseudo_random attribute to value.
 //
-// value: Scalar, if true then repeated labels are
-// collapsed prior to the CTC calculation.
+// value: When set to True, generates the pooling sequence in a
+// pseudorandom fashion, otherwise, in a random fashion. Check paper [Benjamin
+// Graham, Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) for
+// difference between pseudorandom and random.
 // If not specified, defaults to false
-func CTCLossPreprocessCollapseRepeated(value bool) CTCLossAttr {
+func FractionalMaxPoolPseudoRandom(value bool) FractionalMaxPoolAttr {
 	return func(m optionalAttr) {
-		m["preprocess_collapse_repeated"] = value
+		m["pseudo_random"] = value
 	}
 }
 
-// CTCLossCtcMergeRepeated sets the optional ctc_merge_repeated attribute to value.
+// FractionalMaxPoolOverlapping sets the optional overlapping attribute to value.
 //
-// value: Scalar.  If set to false, *during* CTC calculation
-// repeated non-blank labels will not be merged and are interpreted as
-// individual labels.  This is a simplified version of CTC.
-// If not specified, defaults to true
-func CTCLossCtcMergeRepeated(value bool) CTCLossAttr {
+// value: When set to True, it means when pooling, the values at the boundary
+// of adjacent pooling cells are used by both cells. For example:
+//
+// `index  0  1  2  3  4`
+//
+// `value  20 5  16 3  7`
+//
+// If the pooling sequence is [0, 2, 4], then 16, at index 2 will be used twice.
+// The result would be [20, 16] for fractional max pooling.
+// If not specified, defaults to false
+func FractionalMaxPoolOverlapping(value bool) FractionalMaxPoolAttr {
 	return func(m optionalAttr) {
-		m["ctc_merge_repeated"] = value
+		m["overlapping"] = value
 	}
 }
 
-// CTCLossIgnoreLongerOutputsThanInputs sets the optional ignore_longer_outputs_than_inputs attribute to value.
+// FractionalMaxPoolDeterministic sets the optional deterministic attribute to value.
 //
-// value: Scalar. If set to true, during CTC
-// calculation, items that have longer output sequences than input sequences
-// are skipped: they don't contribute to the loss term and have zero-gradient.
+// value: When set to True, a fixed pooling region will be used when
+// iterating over a FractionalMaxPool node in the computation graph. Mainly used
+// in unit test to make FractionalMaxPool deterministic.
 // If not specified, defaults to false
-func CTCLossIgnoreLongerOutputsThanInputs(value bool) CTCLossAttr {
+func FractionalMaxPoolDeterministic(value bool) FractionalMaxPoolAttr {
 	return func(m optionalAttr) {
-		m["ignore_longer_outputs_than_inputs"] = value
+		m["deterministic"] = value
 	}
 }
 
-// Calculates the CTC Loss (log probability) for each batch entry.  Also calculates
-//
-// the gradient.  This class performs the softmax operation for you, so inputs
-// should be e.g. linear projections of outputs by an LSTM.
-//
-// Arguments:
-//	inputs: 3-D, shape: `(max_time x batch_size x num_classes)`, the logits.
-//	labels_indices: The indices of a `SparseTensor<int32, 2>`.
-// `labels_indices(i, :) == [b, t]` means `labels_values(i)` stores the id for
-// `(batch b, time t)`.
-//	labels_values: The values (labels) associated with the given batch and time.
-//	sequence_length: A vector containing sequence lengths (batch).
+// FractionalMaxPoolSeed sets the optional seed attribute to value.
 //
-// Returns A vector (batch) containing log-probabilities.The gradient of `loss`.  3-D, shape:
-// `(max_time x batch_size x num_classes)`.
-func CTCLoss(scope *Scope, inputs tf.Output, labels_indices tf.Output, labels_values tf.Output, sequence_length tf.Output, optional ...CTCLossAttr) (loss tf.Output, gradient tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "CTCLoss",
-		Input: []tf.Input{
-			inputs, labels_indices, labels_values, sequence_length,
-		},
-		Attrs: attrs,
+// value: If either seed or seed2 are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func FractionalMaxPoolSeed(value int64) FractionalMaxPoolAttr {
+	return func(m optionalAttr) {
+		m["seed"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
 }
 
-// CTCGreedyDecoderAttr is an optional argument to CTCGreedyDecoder.
-type CTCGreedyDecoderAttr func(optionalAttr)
-
-// CTCGreedyDecoderMergeRepeated sets the optional merge_repeated attribute to value.
+// FractionalMaxPoolSeed2 sets the optional seed2 attribute to value.
 //
-// value: If True, merge repeated classes in output.
-// If not specified, defaults to false
-func CTCGreedyDecoderMergeRepeated(value bool) CTCGreedyDecoderAttr {
+// value: An second seed to avoid seed collision.
+// If not specified, defaults to 0
+func FractionalMaxPoolSeed2(value int64) FractionalMaxPoolAttr {
 	return func(m optionalAttr) {
-		m["merge_repeated"] = value
+		m["seed2"] = value
 	}
 }
 
-// Performs greedy decoding on the logits given in inputs.
+// Performs fractional max pooling on the input.
 //
-// A note about the attribute merge_repeated: if enabled, when
-// consecutive logits' maximum indices are the same, only the first of
-// these is emitted.  Labeling the blank '*', the sequence "A B B * B B"
-// becomes "A B B" if merge_repeated = True and "A B B B B" if
-// merge_repeated = False.
+// Fractional max pooling is slightly different than regular max pooling.  In
+// regular max pooling, you downsize an input set by taking the maximum value of
+// smaller N x N subsections of the set (often 2x2), and try to reduce the set by
+// a factor of N, where N is an integer.  Fractional max pooling, as you might
+// expect from the word "fractional", means that the overall reduction ratio N
+// does not have to be an integer.
 //
-// Regardless of the value of merge_repeated, if the maximum index of a given
-// time and batch corresponds to the blank, index `(num_classes - 1)`, no new
-// element is emitted.
+// The sizes of the pooling regions are generated randomly but are fairly uniform.
+// For example, let's look at the height dimension, and the constraints on the
+// list of rows that will be pool boundaries.
 //
-// Arguments:
-//	inputs: 3-D, shape: `(max_time x batch_size x num_classes)`, the logits.
-//	sequence_length: A vector containing sequence lengths, size `(batch_size)`.
+// First we define the following:
 //
-// Returns Indices matrix, size `(total_decoded_outputs x 2)`,
-// of a `SparseTensor<int64, 2>`.  The rows store: [batch, time].Values vector, size: `(total_decoded_outputs)`,
-// of a `SparseTensor<int64, 2>`.  The vector stores the decoded classes.Shape vector, size `(2)`, of the decoded SparseTensor.
-// Values are: `[batch_size, max_decoded_length]`.Matrix, size `(batch_size x 1)`, containing sequence
-// log-probabilities.
-func CTCGreedyDecoder(scope *Scope, inputs tf.Output, sequence_length tf.Output, optional ...CTCGreedyDecoderAttr) (decoded_indices tf.Output, decoded_values tf.Output, decoded_shape tf.Output, log_probability tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "CTCGreedyDecoder",
-		Input: []tf.Input{
-			inputs, sequence_length,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2), op.Output(3)
-}
-
-// Forwards `data` to the output port determined by `pred`.
+// 1.  input_row_length : the number of rows from the input set
+// 2.  output_row_length : which will be smaller than the input
+// 3.  alpha = input_row_length / output_row_length : our reduction ratio
+// 4.  K = floor(alpha)
+// 5.  row_pooling_sequence : this is the result list of pool boundary rows
 //
-// If `pred` is true, the `data` input is forwarded to `output_true`. Otherwise,
-// the data goes to `output_false`.
+// Then, row_pooling_sequence should satisfy:
 //
-// See also `RefSwitch` and `Merge`.
+// 1.  a[0] = 0 : the first value of the sequence is 0
+// 2.  a[end] = input_row_length : the last value of the sequence is the size
+// 3.  K <= (a[i+1] - a[i]) <= K+1 : all intervals are K or K+1 size
+// 4.  length(row_pooling_sequence) = output_row_length+1
+//
+// For more details on fractional max pooling, see this paper:
+// [Benjamin Graham, Fractional Max-Pooling](http://arxiv.org/abs/1412.6071)
 //
 // Arguments:
-//	data: The tensor to be forwarded to the appropriate output.
-//	pred: A scalar that specifies which output port will receive data.
+//	value: 4-D with shape `[batch, height, width, channels]`.
+//	pooling_ratio: Pooling ratio for each dimension of `value`, currently only
+// supports row and col dimension and should be >= 1.0. For example, a valid
+// pooling ratio looks like [1.0, 1.44, 1.73, 1.0]. The first and last elements
+// must be 1.0 because we don't allow pooling on batch and channels
+// dimensions. 1.44 and 1.73 are pooling ratio on height and width dimensions
+// respectively.
 //
-// Returns If `pred` is false, data will be forwarded to this output.If `pred` is true, data will be forwarded to this output.
-func Switch(scope *Scope, data tf.Output, pred tf.Output) (output_false tf.Output, output_true tf.Output) {
+// Returns output tensor after fractional max pooling.row pooling sequence, needed to calculate gradient.column pooling sequence, needed to calculate gradient.
+func FractionalMaxPool(scope *Scope, value tf.Output, pooling_ratio []float32, optional ...FractionalMaxPoolAttr) (output tf.Output, row_pooling_sequence tf.Output, col_pooling_sequence tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"pooling_ratio": pooling_ratio}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Switch",
+		Type: "FractionalMaxPool",
 		Input: []tf.Input{
-			data, pred,
+			value,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Add all input tensors element wise.
+// Deprecated. Use TensorArraySizeV3
 //
-// Arguments:
-//	inputs: Must all be the same size and shape.
-func AddN(scope *Scope, inputs []tf.Output) (sum tf.Output) {
+// DEPRECATED at GraphDef version 26: Use TensorArraySizeV3
+func TensorArraySizeV2(scope *Scope, handle tf.Output, flow_in tf.Output) (size tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "AddN",
+		Type: "TensorArraySizeV2",
 		Input: []tf.Input{
-			tf.OutputList(inputs),
+			handle, flow_in,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// TryRpcAttr is an optional argument to TryRpc.
-type TryRpcAttr func(optionalAttr)
+// Conv2DAttr is an optional argument to Conv2D.
+type Conv2DAttr func(optionalAttr)
 
-// TryRpcProtocol sets the optional protocol attribute to value.
-//
-// value: RPC protocol to use.  Empty string means use the default protocol.
-// Options include 'grpc'.
-// If not specified, defaults to ""
-func TryRpcProtocol(value string) TryRpcAttr {
+// Conv2DUseCudnnOnGpu sets the optional use_cudnn_on_gpu attribute to value.
+// If not specified, defaults to true
+func Conv2DUseCudnnOnGpu(value bool) Conv2DAttr {
 	return func(m optionalAttr) {
-		m["protocol"] = value
+		m["use_cudnn_on_gpu"] = value
 	}
 }
 
-// TryRpcFailFast sets the optional fail_fast attribute to value.
+// Conv2DDataFormat sets the optional data_format attribute to value.
 //
-// value: `boolean`. If `true` (default), then failures to connect
-// (i.e., the server does not immediately respond) cause an RPC failure.
-// If not specified, defaults to true
-func TryRpcFailFast(value bool) TryRpcAttr {
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, height, width, channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, channels, height, width].
+// If not specified, defaults to "NHWC"
+func Conv2DDataFormat(value string) Conv2DAttr {
 	return func(m optionalAttr) {
-		m["fail_fast"] = value
+		m["data_format"] = value
 	}
 }
 
-// TryRpcTimeoutInMs sets the optional timeout_in_ms attribute to value.
+// Conv2DDilations sets the optional dilations attribute to value.
 //
-// value: `int`. If `0` (default), then the kernel will run the RPC
-// request and only time out if the RPC deadline passes or the session times out.
-// If this value is greater than `0`, then the op will raise an exception if
-// the RPC takes longer than `timeout_in_ms`.
-// If not specified, defaults to 0
-func TryRpcTimeoutInMs(value int64) TryRpcAttr {
+// value: 1-D tensor of length 4.  The dilation factor for each dimension of
+// `input`. If set to k > 1, there will be k-1 skipped cells between each
+// filter element on that dimension. The dimension order is determined by the
+// value of `data_format`, see above for details. Dilations in the batch and
+// depth dimensions must be 1.
+// If not specified, defaults to <i:1 i:1 i:1 i:1 >
+func Conv2DDilations(value []int64) Conv2DAttr {
 	return func(m optionalAttr) {
-		m["timeout_in_ms"] = value
+		m["dilations"] = value
 	}
 }
 
-// Perform batches of RPC requests.
-//
-// This op asynchronously performs either a single RPC request, or a batch
-// of requests.  RPC requests are defined by three main parameters:
+// Computes a 2-D convolution given 4-D `input` and `filter` tensors.
 //
-//   - `address` (the host+port or BNS address of the request)
-//   - `method` (the method name for the request)
-//   - `request` (the serialized proto string, or vector of strings,
-//      of the RPC request argument).
+// Given an input tensor of shape `[batch, in_height, in_width, in_channels]`
+// and a filter / kernel tensor of shape
+// `[filter_height, filter_width, in_channels, out_channels]`, this op
+// performs the following:
 //
-// For example, if you have an RPC service running on port localhost:2345,
-// and its interface is configured with the following proto declaration:
+// 1. Flattens the filter to a 2-D matrix with shape
+//    `[filter_height * filter_width * in_channels, output_channels]`.
+// 2. Extracts image patches from the input tensor to form a *virtual*
+//    tensor of shape `[batch, out_height, out_width,
+//    filter_height * filter_width * in_channels]`.
+// 3. For each patch, right-multiplies the filter matrix and the image patch
+//    vector.
 //
-// ```
-// service MyService {
-//   rpc MyMethod(MyRequestProto) returns (MyResponseProto) {
-//   }
-// };
-// ```
+// In detail, with the default NHWC format,
 //
-// then call this op with arguments:
-//
-// ```
-// address = "localhost:2345"
-// method = "MyService/MyMethod"
-// ```
-//
-// The `request` tensor is a string tensor representing serialized `MyRequestProto`
-// strings; and the output string tensor `response` will have the same shape
-// and contain (upon successful completion) corresponding serialized
-// `MyResponseProto` strings.
-//
-// For example, to send a single, empty, `MyRequestProto`, call
-// this op with `request = ""`.  To send 5 **parallel** empty requests,
-// call this op with `request = ["", "", "", "", ""]`.
-//
-// More generally, one can create a batch of `MyRequestProto` serialized protos
-// from regular batched tensors using the `encode_proto` op, and convert
-// the response `MyResponseProto` serialized protos to batched tensors
-// using the `decode_proto` op.
-//
-// **NOTE** Working with serialized proto strings is faster than instantiating
-// actual proto objects in memory, so no performance degradation is expected
-// compared to writing custom kernels for this workflow.
+//     output[b, i, j, k] =
+//         sum_{di, dj, q} input[b, strides[1] * i + di, strides[2] * j + dj, q] *
+//                         filter[di, dj, q, k]
 //
-// Unlike the standard `Rpc` op, if the connection fails or the remote worker
-// returns an error status, this op does **not** reraise the exception.
-// Instead, the `status_code` and `status_message` entry for the corresponding RPC
-// call is set with the error returned from the RPC call.  The `response` tensor
-// will contain valid response values for those minibatch entries whose RPCs did
-// not fail; the rest of the entries will have empty strings.
+// Must have `strides[0] = strides[3] = 1`.  For the most common case of the same
+// horizontal and vertices strides, `strides = [1, stride, stride, 1]`.
 //
 // Arguments:
-//	address: `0-D` or `1-D`.  The address (i.e. host_name:port) of the RPC server.
-// If this tensor has more than 1 element, then multiple parallel rpc requests
-// are sent.  This argument broadcasts with `method` and `request`.
-//	method: `0-D` or `1-D`.  The method address on the RPC server.
-// If this tensor has more than 1 element, then multiple parallel rpc requests
-// are sent.  This argument broadcasts with `address` and `request`.
-//	request: `0-D` or `1-D`.  Serialized proto strings: the rpc request argument.
-// If this tensor has more than 1 element, then multiple parallel rpc requests
-// are sent.  This argument broadcasts with `address` and `method`.
+//	input: A 4-D tensor. The dimension order is interpreted according to the value
+// of `data_format`, see below for details.
+//	filter: A 4-D tensor of shape
+// `[filter_height, filter_width, in_channels, out_channels]`
+//	strides: 1-D tensor of length 4.  The stride of the sliding window for each
+// dimension of `input`. The dimension order is determined by the value of
+// `data_format`, see below for details.
+//	padding: The type of padding algorithm to use.
 //
-// Returns Same shape as `request`. Serialized proto strings: the rpc responses.Same shape as `request`.  Values correspond to tensorflow Status enum codes.Same shape as `request`.  Values correspond to Status messages
-// returned from the RPC calls.
-func TryRpc(scope *Scope, address tf.Output, method tf.Output, request tf.Output, optional ...TryRpcAttr) (response tf.Output, status_code tf.Output, status_message tf.Output) {
+// Returns A 4-D tensor. The dimension order is determined by the value of
+// `data_format`, see below for details.
+func Conv2D(scope *Scope, input tf.Output, filter tf.Output, strides []int64, padding string, optional ...Conv2DAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"strides": strides, "padding": padding}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TryRpc",
+		Type: "Conv2D",
 		Input: []tf.Input{
-			address, method, request,
+			input, filter,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0)
 }
 
-// EnterAttr is an optional argument to Enter.
-type EnterAttr func(optionalAttr)
+// StageAttr is an optional argument to Stage.
+type StageAttr func(optionalAttr)
 
-// EnterIsConstant sets the optional is_constant attribute to value.
+// StageCapacity sets the optional capacity attribute to value.
 //
-// value: If true, the output is constant within the child frame.
-// If not specified, defaults to false
-func EnterIsConstant(value bool) EnterAttr {
+// value: Maximum number of elements in the Staging Area. If > 0, inserts
+// on the container will block when the capacity is reached.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func StageCapacity(value int64) StageAttr {
 	return func(m optionalAttr) {
-		m["is_constant"] = value
+		m["capacity"] = value
 	}
 }
 
-// EnterParallelIterations sets the optional parallel_iterations attribute to value.
+// StageMemoryLimit sets the optional memory_limit attribute to value.
 //
-// value: The number of iterations allowed to run in parallel.
-// If not specified, defaults to 10
-func EnterParallelIterations(value int64) EnterAttr {
+// value: The maximum number of bytes allowed for Tensors in the Staging Area.
+// If > 0, inserts will block until sufficient space is available.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func StageMemoryLimit(value int64) StageAttr {
 	return func(m optionalAttr) {
-		m["parallel_iterations"] = value
+		m["memory_limit"] = value
 	}
 }
 
-// Creates or finds a child frame, and makes `data` available to the child frame.
+// StageContainer sets the optional container attribute to value.
 //
-// This op is used together with `Exit` to create loops in the graph.
-// The unique `frame_name` is used by the `Executor` to identify frames. If
-// `is_constant` is true, `output` is a constant in the child frame; otherwise
-// it may be changed in the child frame. At most `parallel_iterations` iterations
-// are run in parallel in the child frame.
+// value: If non-empty, this queue is placed in the given container. Otherwise,
+// a default container is used.
+// If not specified, defaults to ""
+func StageContainer(value string) StageAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// StageSharedName sets the optional shared_name attribute to value.
+//
+// value: It is necessary to match this name to the matching Unstage Op.
+// If not specified, defaults to ""
+func StageSharedName(value string) StageAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Stage values similar to a lightweight Enqueue.
+//
+// The basic functionality of this Op is similar to a queue with many
+// fewer capabilities and options.  This Op is optimized for performance.
 //
 // Arguments:
-//	data: The tensor to be made available to the child frame.
-//	frame_name: The name of the child frame.
+//	values: a list of tensors
+// dtypes A list of data types that inserted values should adhere to.
 //
-// Returns The same tensor as `data`.
-func Enter(scope *Scope, data tf.Output, frame_name string, optional ...EnterAttr) (output tf.Output) {
+// Returns the created operation.
+func Stage(scope *Scope, values []tf.Output, optional ...StageAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"frame_name": frame_name}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Enter",
+		Type: "Stage",
 		Input: []tf.Input{
-			data,
+			tf.OutputList(values),
 		},
 		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// Produce a string tensor that encodes the state of a Reader.
+// StagePeekAttr is an optional argument to StagePeek.
+type StagePeekAttr func(optionalAttr)
+
+// StagePeekCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// Not all Readers support being serialized, so this can produce an
-// Unimplemented error.
+// REQUIRES: value >= 0
+func StagePeekCapacity(value int64) StagePeekAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
+	}
+}
+
+// StagePeekMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// Arguments:
-//	reader_handle: Handle to a Reader.
-func ReaderSerializeStateV2(scope *Scope, reader_handle tf.Output) (state tf.Output) {
-	if scope.Err() != nil {
-		return
+// REQUIRES: value >= 0
+func StagePeekMemoryLimit(value int64) StagePeekAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "ReaderSerializeStateV2",
-		Input: []tf.Input{
-			reader_handle,
-		},
+}
+
+// StagePeekContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func StagePeekContainer(value string) StagePeekAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Exits the current frame to its parent frame.
-//
-// Exit makes its input `data` available to the parent frame.
-//
-// Arguments:
-//	data: The tensor to be made available to the parent frame.
+// StagePeekSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func StagePeekSharedName(value string) StagePeekAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op peeks at the values at the specified index.  If the
 //
-// Returns The same tensor as `data`.
-func Exit(scope *Scope, data tf.Output) (output tf.Output) {
+// underlying container does not contain sufficient elements
+// this op will block until it does.   This Op is optimized for
+// performance.
+func StagePeek(scope *Scope, index tf.Output, dtypes []tf.DataType, optional ...StagePeekAttr) (values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Exit",
+		Type: "StagePeek",
 		Input: []tf.Input{
-			data,
+			index,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Returns a copy of the input tensor.
-func Snapshot(scope *Scope, input tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	opspec := tf.OpSpec{
-		Type: "Snapshot",
-		Input: []tf.Input{
-			input,
-		},
+	var idx int
+	var err error
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("StagePeek", err)
+		return
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return values
 }
 
-// AbortAttr is an optional argument to Abort.
-type AbortAttr func(optionalAttr)
+// MapStageAttr is an optional argument to MapStage.
+type MapStageAttr func(optionalAttr)
 
-// AbortErrorMsg sets the optional error_msg attribute to value.
+// MapStageCapacity sets the optional capacity attribute to value.
 //
-// value: A string which is the message associated with the exception.
-// If not specified, defaults to ""
-func AbortErrorMsg(value string) AbortAttr {
+// value: Maximum number of elements in the Staging Area. If > 0, inserts
+// on the container will block when the capacity is reached.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func MapStageCapacity(value int64) MapStageAttr {
 	return func(m optionalAttr) {
-		m["error_msg"] = value
+		m["capacity"] = value
 	}
 }
 
-// AbortExitWithoutError sets the optional exit_without_error attribute to value.
-// If not specified, defaults to false
-func AbortExitWithoutError(value bool) AbortAttr {
+// MapStageMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func MapStageMemoryLimit(value int64) MapStageAttr {
 	return func(m optionalAttr) {
-		m["exit_without_error"] = value
+		m["memory_limit"] = value
 	}
 }
 
-// Raise a exception to abort the process when called.
+// MapStageContainer sets the optional container attribute to value.
 //
-// If exit_without_error is true, the process will exit normally,
-// otherwise it will exit with a SIGABORT signal.
+// value: If non-empty, this queue is placed in the given container. Otherwise,
+// a default container is used.
+// If not specified, defaults to ""
+func MapStageContainer(value string) MapStageAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// MapStageSharedName sets the optional shared_name attribute to value.
+//
+// value: It is necessary to match this name to the matching Unstage Op.
+// If not specified, defaults to ""
+func MapStageSharedName(value string) MapStageAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Stage (key, values) in the underlying container which behaves like a hashtable.
+//
+// Arguments:
+//	key: int64
+//
+//	values: a list of tensors
+// dtypes A list of data types that inserted values should adhere to.
 //
-// Returns nothing but an exception.
 //
 // Returns the created operation.
-func Abort(scope *Scope, optional ...AbortAttr) (o *tf.Operation) {
+func MapStage(scope *Scope, key tf.Output, indices tf.Output, values []tf.Output, dtypes []tf.DataType, optional ...MapStageAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Abort",
-
+		Type: "MapStage",
+		Input: []tf.Input{
+			key, indices, tf.OutputList(values),
+		},
 		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// FixedUnigramCandidateSamplerAttr is an optional argument to FixedUnigramCandidateSampler.
-type FixedUnigramCandidateSamplerAttr func(optionalAttr)
-
-// FixedUnigramCandidateSamplerVocabFile sets the optional vocab_file attribute to value.
-//
-// value: Each valid line in this file (which should have a CSV-like format)
-// corresponds to a valid word ID. IDs are in sequential order, starting from
-// num_reserved_ids. The last entry in each line is expected to be a value
-// corresponding to the count or relative probability. Exactly one of vocab_file
-// and unigrams needs to be passed to this op.
-// If not specified, defaults to ""
-func FixedUnigramCandidateSamplerVocabFile(value string) FixedUnigramCandidateSamplerAttr {
-	return func(m optionalAttr) {
-		m["vocab_file"] = value
-	}
-}
-
-// FixedUnigramCandidateSamplerDistortion sets the optional distortion attribute to value.
-//
-// value: The distortion is used to skew the unigram probability distribution.
-// Each weight is first raised to the distortion's power before adding to the
-// internal unigram distribution. As a result, distortion = 1.0 gives regular
-// unigram sampling (as defined by the vocab file), and distortion = 0.0 gives
-// a uniform distribution.
-// If not specified, defaults to 1
-func FixedUnigramCandidateSamplerDistortion(value float32) FixedUnigramCandidateSamplerAttr {
-	return func(m optionalAttr) {
-		m["distortion"] = value
-	}
-}
+// MapUnstageAttr is an optional argument to MapUnstage.
+type MapUnstageAttr func(optionalAttr)
 
-// FixedUnigramCandidateSamplerNumReservedIds sets the optional num_reserved_ids attribute to value.
-//
-// value: Optionally some reserved IDs can be added in the range [0,
-// ..., num_reserved_ids) by the users. One use case is that a special unknown
-// word token is used as ID 0. These IDs will have a sampling probability of 0.
+// MapUnstageCapacity sets the optional capacity attribute to value.
 // If not specified, defaults to 0
-func FixedUnigramCandidateSamplerNumReservedIds(value int64) FixedUnigramCandidateSamplerAttr {
-	return func(m optionalAttr) {
-		m["num_reserved_ids"] = value
-	}
-}
-
-// FixedUnigramCandidateSamplerNumShards sets the optional num_shards attribute to value.
 //
-// value: A sampler can be used to sample from a subset of the original range
-// in order to speed up the whole computation through parallelism. This parameter
-// (together with 'shard') indicates the number of partitions that are being
-// used in the overall computation.
-// If not specified, defaults to 1
-//
-// REQUIRES: value >= 1
-func FixedUnigramCandidateSamplerNumShards(value int64) FixedUnigramCandidateSamplerAttr {
+// REQUIRES: value >= 0
+func MapUnstageCapacity(value int64) MapUnstageAttr {
 	return func(m optionalAttr) {
-		m["num_shards"] = value
+		m["capacity"] = value
 	}
 }
 
-// FixedUnigramCandidateSamplerShard sets the optional shard attribute to value.
-//
-// value: A sampler can be used to sample from a subset of the original range
-// in order to speed up the whole computation through parallelism. This parameter
-// (together with 'num_shards') indicates the particular partition number of a
-// sampler op, when partitioning is being used.
+// MapUnstageMemoryLimit sets the optional memory_limit attribute to value.
 // If not specified, defaults to 0
 //
 // REQUIRES: value >= 0
-func FixedUnigramCandidateSamplerShard(value int64) FixedUnigramCandidateSamplerAttr {
-	return func(m optionalAttr) {
-		m["shard"] = value
-	}
-}
-
-// FixedUnigramCandidateSamplerUnigrams sets the optional unigrams attribute to value.
-//
-// value: A list of unigram counts or probabilities, one per ID in sequential
-// order. Exactly one of vocab_file and unigrams should be passed to this op.
-// If not specified, defaults to <>
-func FixedUnigramCandidateSamplerUnigrams(value []float32) FixedUnigramCandidateSamplerAttr {
+func MapUnstageMemoryLimit(value int64) MapUnstageAttr {
 	return func(m optionalAttr) {
-		m["unigrams"] = value
+		m["memory_limit"] = value
 	}
 }
 
-// FixedUnigramCandidateSamplerSeed sets the optional seed attribute to value.
-//
-// value: If either seed or seed2 are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func FixedUnigramCandidateSamplerSeed(value int64) FixedUnigramCandidateSamplerAttr {
+// MapUnstageContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func MapUnstageContainer(value string) MapUnstageAttr {
 	return func(m optionalAttr) {
-		m["seed"] = value
+		m["container"] = value
 	}
 }
 
-// FixedUnigramCandidateSamplerSeed2 sets the optional seed2 attribute to value.
-//
-// value: An second seed to avoid seed collision.
-// If not specified, defaults to 0
-func FixedUnigramCandidateSamplerSeed2(value int64) FixedUnigramCandidateSamplerAttr {
+// MapUnstageSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func MapUnstageSharedName(value string) MapUnstageAttr {
 	return func(m optionalAttr) {
-		m["seed2"] = value
+		m["shared_name"] = value
 	}
 }
 
-// Generates labels for candidate sampling with a learned unigram distribution.
-//
-// A unigram sampler could use a fixed unigram distribution read from a
-// file or passed in as an in-memory array instead of building up the distribution
-// from data on the fly. There is also an option to skew the distribution by
-// applying a distortion power to the weights.
-//
-// The vocabulary file should be in CSV-like format, with the last field
-// being the weight associated with the word.
-//
-// For each batch, this op picks a single set of sampled candidate labels.
-//
-// The advantages of sampling candidates per-batch are simplicity and the
-// possibility of efficient dense matrix multiplication. The disadvantage is that
-// the sampled candidates must be chosen independently of the context and of the
-// true labels.
-//
-// Arguments:
-//	true_classes: A batch_size * num_true matrix, in which each row contains the
-// IDs of the num_true target_classes in the corresponding original label.
-//	num_true: Number of true labels per context.
-//	num_sampled: Number of candidates to randomly sample.
-//	unique: If unique is true, we sample with rejection, so that all sampled
-// candidates in a batch are unique. This requires some approximation to
-// estimate the post-rejection sampling probabilities.
-//	range_max: The sampler will sample integers from the interval [0, range_max).
+// Op removes and returns the values associated with the key
 //
-// Returns A vector of length num_sampled, in which each element is
-// the ID of a sampled candidate.A batch_size * num_true matrix, representing
-// the number of times each candidate is expected to occur in a batch
-// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
-// candidate representing the number of times the candidate is expected
-// to occur in a batch of sampled candidates.  If unique=true, then this is a
-// probability.
-func FixedUnigramCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...FixedUnigramCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
+// from the underlying container.   If the underlying container
+// does not contain this key, the op will block until it does.
+func MapUnstage(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...MapUnstageAttr) (values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "FixedUnigramCandidateSampler",
+		Type: "MapUnstage",
 		Input: []tf.Input{
-			true_classes,
+			key, indices,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("MapUnstage", err)
+		return
+	}
+	return values
 }
 
-// WholeFileReaderV2Attr is an optional argument to WholeFileReaderV2.
-type WholeFileReaderV2Attr func(optionalAttr)
+// MapIncompleteSizeAttr is an optional argument to MapIncompleteSize.
+type MapIncompleteSizeAttr func(optionalAttr)
 
-// WholeFileReaderV2Container sets the optional container attribute to value.
+// MapIncompleteSizeCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// value: If non-empty, this reader is placed in the given container.
-// Otherwise, a default container is used.
+// REQUIRES: value >= 0
+func MapIncompleteSizeCapacity(value int64) MapIncompleteSizeAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
+	}
+}
+
+// MapIncompleteSizeMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func MapIncompleteSizeMemoryLimit(value int64) MapIncompleteSizeAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// MapIncompleteSizeContainer sets the optional container attribute to value.
 // If not specified, defaults to ""
-func WholeFileReaderV2Container(value string) WholeFileReaderV2Attr {
+func MapIncompleteSizeContainer(value string) MapIncompleteSizeAttr {
 	return func(m optionalAttr) {
 		m["container"] = value
 	}
 }
 
-// WholeFileReaderV2SharedName sets the optional shared_name attribute to value.
-//
-// value: If non-empty, this reader is named in the given bucket
-// with this shared_name. Otherwise, the node name is used instead.
+// MapIncompleteSizeSharedName sets the optional shared_name attribute to value.
 // If not specified, defaults to ""
-func WholeFileReaderV2SharedName(value string) WholeFileReaderV2Attr {
+func MapIncompleteSizeSharedName(value string) MapIncompleteSizeAttr {
 	return func(m optionalAttr) {
 		m["shared_name"] = value
 	}
 }
 
-// A Reader that outputs the entire contents of a file as a value.
-//
-// To use, enqueue filenames in a Queue.  The output of ReaderRead will
-// be a filename (key) and the contents of that file (value).
-//
-// Returns The handle to reference the Reader.
-func WholeFileReaderV2(scope *Scope, optional ...WholeFileReaderV2Attr) (reader_handle tf.Output) {
+// Op returns the number of incomplete elements in the underlying container.
+func MapIncompleteSize(scope *Scope, dtypes []tf.DataType, optional ...MapIncompleteSizeAttr) (size tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "WholeFileReaderV2",
+		Type: "MapIncompleteSize",
 
 		Attrs: attrs,
 	}
@@ -29494,48 +28868,61 @@ func WholeFileReaderV2(scope *Scope, optional ...WholeFileReaderV2Attr) (reader_
 	return op.Output(0)
 }
 
-// Transforms a tf.Example proto (as a string) into typed tensors.
+// OrderedMapUnstageAttr is an optional argument to OrderedMapUnstage.
+type OrderedMapUnstageAttr func(optionalAttr)
+
+// OrderedMapUnstageCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// Arguments:
-//	serialized: A vector containing a batch of binary serialized Example protos.
-//	dense_defaults: A list of Tensors (some may be empty), whose length matches
-// the length of `dense_keys`. dense_defaults[j] provides default values
-// when the example's feature_map lacks dense_key[j].  If an empty Tensor is
-// provided for dense_defaults[j], then the Feature dense_keys[j] is required.
-// The input type is inferred from dense_defaults[j], even when it's empty.
-// If dense_defaults[j] is not empty, and dense_shapes[j] is fully defined,
-// then the shape of dense_defaults[j] must match that of dense_shapes[j].
-// If dense_shapes[j] has an undefined major dimension (variable strides dense
-// feature), dense_defaults[j] must contain a single element:
-// the padding element.
-//	num_sparse: The number of sparse features to be parsed from the example. This
-// must match the lengths of `sparse_keys` and `sparse_types`.
-//	sparse_keys: A list of `num_sparse` strings.
-// The keys expected in the Examples' features associated with sparse values.
-//	dense_keys: The keys expected in the Examples' features associated with dense
-// values.
-//	sparse_types: A list of `num_sparse` types; the data types of data in each
-// Feature given in sparse_keys.
-// Currently the ParseSingleExample op supports DT_FLOAT (FloatList),
-// DT_INT64 (Int64List), and DT_STRING (BytesList).
-//	dense_shapes: The shapes of data in each Feature given in dense_keys.
-// The length of this list must match the length of `dense_keys`.  The
-// number of elements in the Feature corresponding to dense_key[j] must
-// always equal dense_shapes[j].NumEntries().  If dense_shapes[j] ==
-// (D0, D1, ..., DN) then the shape of output Tensor dense_values[j]
-// will be (D0, D1, ..., DN): In the case dense_shapes[j] = (-1, D1,
-// ..., DN), the shape of the output Tensor dense_values[j] will be (M,
-// D1, .., DN), where M is the number of blocks of elements of length
-// D1 * .... * DN, in the input.
-func ParseSingleExample(scope *Scope, serialized tf.Output, dense_defaults []tf.Output, num_sparse int64, sparse_keys []string, dense_keys []string, sparse_types []tf.DataType, dense_shapes []tf.Shape) (sparse_indices []tf.Output, sparse_values []tf.Output, sparse_shapes []tf.Output, dense_values []tf.Output) {
+// REQUIRES: value >= 0
+func OrderedMapUnstageCapacity(value int64) OrderedMapUnstageAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
+	}
+}
+
+// OrderedMapUnstageMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func OrderedMapUnstageMemoryLimit(value int64) OrderedMapUnstageAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// OrderedMapUnstageContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func OrderedMapUnstageContainer(value string) OrderedMapUnstageAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// OrderedMapUnstageSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func OrderedMapUnstageSharedName(value string) OrderedMapUnstageAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op removes and returns the values associated with the key
+//
+// from the underlying container.   If the underlying container
+// does not contain this key, the op will block until it does.
+func OrderedMapUnstage(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...OrderedMapUnstageAttr) (values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_sparse": num_sparse, "sparse_keys": sparse_keys, "dense_keys": dense_keys, "sparse_types": sparse_types, "dense_shapes": dense_shapes}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "ParseSingleExample",
+		Type: "OrderedMapUnstage",
 		Input: []tf.Input{
-			serialized, tf.OutputList(dense_defaults),
+			key, indices,
 		},
 		Attrs: attrs,
 	}
@@ -29545,299 +28932,206 @@ func ParseSingleExample(scope *Scope, serialized tf.Output, dense_defaults []tf.
 	}
 	var idx int
 	var err error
-	if sparse_indices, idx, err = makeOutputList(op, idx, "sparse_indices"); err != nil {
-		scope.UpdateErr("ParseSingleExample", err)
-		return
-	}
-	if sparse_values, idx, err = makeOutputList(op, idx, "sparse_values"); err != nil {
-		scope.UpdateErr("ParseSingleExample", err)
-		return
-	}
-	if sparse_shapes, idx, err = makeOutputList(op, idx, "sparse_shapes"); err != nil {
-		scope.UpdateErr("ParseSingleExample", err)
-		return
-	}
-	if dense_values, idx, err = makeOutputList(op, idx, "dense_values"); err != nil {
-		scope.UpdateErr("ParseSingleExample", err)
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("OrderedMapUnstage", err)
 		return
 	}
-	return sparse_indices, sparse_values, sparse_shapes, dense_values
+	return values
 }
 
-// Deserializes a serialized tree ensemble config and replaces current tree
-//
-// ensemble.
-//
-// Arguments:
-//	tree_ensemble_handle: Handle to the tree ensemble.
-//	stamp_token: Token to use as the new value of the resource stamp.
-//	tree_ensemble_serialized: Serialized proto of the ensemble.
+// OrderedMapSizeAttr is an optional argument to OrderedMapSize.
+type OrderedMapSizeAttr func(optionalAttr)
+
+// OrderedMapSizeCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// Returns the created operation.
-func BoostedTreesDeserializeEnsemble(scope *Scope, tree_ensemble_handle tf.Output, stamp_token tf.Output, tree_ensemble_serialized tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "BoostedTreesDeserializeEnsemble",
-		Input: []tf.Input{
-			tree_ensemble_handle, stamp_token, tree_ensemble_serialized,
-		},
+// REQUIRES: value >= 0
+func OrderedMapSizeCapacity(value int64) OrderedMapSizeAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
 	}
-	return scope.AddOperation(opspec)
 }
 
-// Runs multiple additive regression ensemble predictors on input instances and
-//
-// computes the update to cached logits. It is designed to be used during training.
-// It traverses the trees starting from cached tree id and cached node id and
-// calculates the updates to be pushed to the cache.
-//
-// Arguments:
-//
-//	cached_tree_ids: Rank 1 Tensor containing cached tree ids which is the starting
-// tree of prediction.
-//	cached_node_ids: Rank 1 Tensor containing cached node id which is the starting
-// node of prediction.
-//	bucketized_features: A list of rank 1 Tensors containing bucket id for each
-// feature.
-//	logits_dimension: scalar, dimension of the logits, to be used for partial logits
-// shape.
+// OrderedMapSizeMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// Returns Rank 2 Tensor containing logits update (with respect to cached
-// values stored) for each example.Rank 1 Tensor containing new tree ids for each example.Rank 1 Tensor containing new node ids in the new tree_ids.
-func BoostedTreesTrainingPredict(scope *Scope, tree_ensemble_handle tf.Output, cached_tree_ids tf.Output, cached_node_ids tf.Output, bucketized_features []tf.Output, logits_dimension int64) (partial_logits tf.Output, tree_ids tf.Output, node_ids tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"logits_dimension": logits_dimension}
-	opspec := tf.OpSpec{
-		Type: "BoostedTreesTrainingPredict",
-		Input: []tf.Input{
-			tree_ensemble_handle, cached_tree_ids, cached_node_ids, tf.OutputList(bucketized_features),
-		},
-		Attrs: attrs,
+// REQUIRES: value >= 0
+func OrderedMapSizeMemoryLimit(value int64) OrderedMapSizeAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Elementwise computes the bitwise AND of `x` and `y`.
-//
-// The result will have those bits set, that are set in both `x` and `y`. The
-// computation is performed on the underlying representations of `x` and `y`.
-func BitwiseAnd(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
-	if scope.Err() != nil {
-		return
+// OrderedMapSizeContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func OrderedMapSizeContainer(value string) OrderedMapSizeAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "BitwiseAnd",
-		Input: []tf.Input{
-			x, y,
-		},
+}
+
+// OrderedMapSizeSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func OrderedMapSizeSharedName(value string) OrderedMapSizeAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Elementwise computes the bitwise left-shift of `x` and `y`.
-//
-// If `y` is negative, or greater than or equal to the width of `x` in bits the
-// result is implementation defined.
-func LeftShift(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+// Op returns the number of elements in the underlying container.
+func OrderedMapSize(scope *Scope, dtypes []tf.DataType, optional ...OrderedMapSizeAttr) (size tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "LeftShift",
-		Input: []tf.Input{
-			x, y,
-		},
+		Type: "OrderedMapSize",
+
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// TensorListStackAttr is an optional argument to TensorListStack.
-type TensorListStackAttr func(optionalAttr)
+// ShapeNAttr is an optional argument to ShapeN.
+type ShapeNAttr func(optionalAttr)
 
-// TensorListStackNumElements sets the optional num_elements attribute to value.
-// If not specified, defaults to -1
-func TensorListStackNumElements(value int64) TensorListStackAttr {
+// ShapeNOutType sets the optional out_type attribute to value.
+// If not specified, defaults to DT_INT32
+func ShapeNOutType(value tf.DataType) ShapeNAttr {
 	return func(m optionalAttr) {
-		m["num_elements"] = value
+		m["out_type"] = value
 	}
 }
 
-// Stacks all tensors in the list.
-//
-// Requires that all tensors have the same shape.
-//
-// input_handle: the input list
-// tensor: the gathered result
-// num_elements: optional. If not -1, the number of elements in the list.
+// Returns shape of tensors.
 //
-func TensorListStack(scope *Scope, input_handle tf.Output, element_dtype tf.DataType, optional ...TensorListStackAttr) (tensor tf.Output) {
+// This operation returns N 1-D integer tensors representing shape of `input[i]s`.
+func ShapeN(scope *Scope, input []tf.Output, optional ...ShapeNAttr) (output []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"element_dtype": element_dtype}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorListStack",
+		Type: "ShapeN",
 		Input: []tf.Input{
-			input_handle,
+			tf.OutputList(input),
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Elementwise computes the bitwise right-shift of `x` and `y`.
-//
-// Performs a logical shift for unsigned integer types, and an arithmetic shift
-// for signed integer types.
-//
-// If `y` is negative, or greater than or equal to than the width of `x` in bits
-// the result is implementation defined.
-func RightShift(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	opspec := tf.OpSpec{
-		Type: "RightShift",
-		Input: []tf.Input{
-			x, y,
-		},
+	var idx int
+	var err error
+	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
+		scope.UpdateErr("ShapeN", err)
+		return
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return output
 }
 
-// Adjust the hue of one or more images.
-//
-// `images` is a tensor of at least 3 dimensions.  The last dimension is
-// interpretted as channels, and must be three.
-//
-// The input image is considered in the RGB colorspace. Conceptually, the RGB
-// colors are first mapped into HSV. A delta is then applied all the hue values,
-// and then remapped back to RGB colorspace.
-//
-// Arguments:
-//	images: Images to adjust.  At least 3-D.
-//	delta: A float delta to add to the hue.
-//
-// Returns The hue-adjusted image or images.
-func AdjustHue(scope *Scope, images tf.Output, delta tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "AdjustHue",
-		Input: []tf.Input{
-			images, delta,
-		},
+// CudnnRNNParamsToCanonicalAttr is an optional argument to CudnnRNNParamsToCanonical.
+type CudnnRNNParamsToCanonicalAttr func(optionalAttr)
+
+// CudnnRNNParamsToCanonicalRnnMode sets the optional rnn_mode attribute to value.
+// If not specified, defaults to "lstm"
+func CudnnRNNParamsToCanonicalRnnMode(value string) CudnnRNNParamsToCanonicalAttr {
+	return func(m optionalAttr) {
+		m["rnn_mode"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// BatchAttr is an optional argument to Batch.
-type BatchAttr func(optionalAttr)
-
-// BatchMaxEnqueuedBatches sets the optional max_enqueued_batches attribute to value.
-// If not specified, defaults to 10
-func BatchMaxEnqueuedBatches(value int64) BatchAttr {
+// CudnnRNNParamsToCanonicalInputMode sets the optional input_mode attribute to value.
+// If not specified, defaults to "linear_input"
+func CudnnRNNParamsToCanonicalInputMode(value string) CudnnRNNParamsToCanonicalAttr {
 	return func(m optionalAttr) {
-		m["max_enqueued_batches"] = value
+		m["input_mode"] = value
 	}
 }
 
-// BatchAllowedBatchSizes sets the optional allowed_batch_sizes attribute to value.
-// If not specified, defaults to <>
-func BatchAllowedBatchSizes(value []int64) BatchAttr {
+// CudnnRNNParamsToCanonicalDirection sets the optional direction attribute to value.
+// If not specified, defaults to "unidirectional"
+func CudnnRNNParamsToCanonicalDirection(value string) CudnnRNNParamsToCanonicalAttr {
 	return func(m optionalAttr) {
-		m["allowed_batch_sizes"] = value
+		m["direction"] = value
 	}
 }
 
-// BatchContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func BatchContainer(value string) BatchAttr {
+// CudnnRNNParamsToCanonicalDropout sets the optional dropout attribute to value.
+// If not specified, defaults to 0
+func CudnnRNNParamsToCanonicalDropout(value float32) CudnnRNNParamsToCanonicalAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["dropout"] = value
 	}
 }
 
-// BatchSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func BatchSharedName(value string) BatchAttr {
+// CudnnRNNParamsToCanonicalSeed sets the optional seed attribute to value.
+// If not specified, defaults to 0
+func CudnnRNNParamsToCanonicalSeed(value int64) CudnnRNNParamsToCanonicalAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["seed"] = value
 	}
 }
 
-// BatchBatchingQueue sets the optional batching_queue attribute to value.
-// If not specified, defaults to ""
-func BatchBatchingQueue(value string) BatchAttr {
+// CudnnRNNParamsToCanonicalSeed2 sets the optional seed2 attribute to value.
+// If not specified, defaults to 0
+func CudnnRNNParamsToCanonicalSeed2(value int64) CudnnRNNParamsToCanonicalAttr {
 	return func(m optionalAttr) {
-		m["batching_queue"] = value
+		m["seed2"] = value
 	}
 }
 
-// Batches all input tensors nondeterministically.
-//
-// When many instances of this Op are being run concurrently with the same
-// container/shared_name in the same device, some will output zero-shaped Tensors
-// and others will output Tensors of size up to max_batch_size.
-//
-// All Tensors in in_tensors are batched together (so, for example, labels and
-// features should be batched with a single instance of this operation.
-//
-// Each invocation of batch emits an `id` scalar which will be used to identify
-// this particular invocation when doing unbatch or its gradient.
+// Retrieves CudnnRNN params in canonical form.
 //
-// Each op which emits a non-empty batch will also emit a non-empty batch_index
-// Tensor, which, is a [K, 3] matrix where each row contains the invocation's id,
-// start, and length of elements of each set of Tensors present in batched_tensors.
+// Retrieves a set of weights from the opaque params buffer that can be saved and
+// restored in a way compatible with future runs.
 //
-// Batched tensors are concatenated along the first dimension, and all tensors in
-// in_tensors must have the first dimension of the same size.
+// Note that the params buffer may not be compatible across different GPUs. So any
+// save and restoration should be converted to and from the canonical weights and
+// biases.
 //
-// in_tensors: The tensors to be batched.
-// num_batch_threads: Number of scheduling threads for processing batches of work.
-//  Determines the number of batches processed in parallel.
-// max_batch_size: Batch sizes will never be bigger than this.
-// batch_timeout_micros: Maximum number of microseconds to wait before outputting
-//  an incomplete batch.
-// allowed_batch_sizes: Optional list of allowed batch sizes. If left empty, does
-//  nothing. Otherwise, supplies a list of batch sizes, causing the op to pad
-//  batches up to one of those sizes. The entries must increase monotonically, and
-//  the final entry must equal max_batch_size.
-// grad_timeout_micros: The timeout to use for the gradient. See Unbatch.
-// batched_tensors: Either empty tensors or a batch of concatenated Tensors.
-// batch_index: If out_tensors is non-empty, has information to invert it.
-// container: Controls the scope of sharing of this batch.
-// id: always contains a scalar with a unique ID for this invocation of Batch.
-// shared_name: Concurrently running instances of batch in the same device with the
-//  same container and shared_name will batch their elements together. If left
-//  empty, the op name will be used as the shared name.
-// T: the types of tensors to be batched.
-func Batch(scope *Scope, in_tensors []tf.Output, num_batch_threads int64, max_batch_size int64, batch_timeout_micros int64, grad_timeout_micros int64, optional ...BatchAttr) (batched_tensors []tf.Output, batch_index tf.Output, id tf.Output) {
+// num_layers: Specifies the number of layers in the RNN model.
+// num_units: Specifies the size of the hidden state.
+// input_size: Specifies the size of the input state.
+// num_params: number of parameter sets for all layers.
+//     Each layer may contain multiple parameter sets, with each set consisting of
+//     a weight matrix and a bias vector.
+// weights: the canonical form of weights that can be used for saving
+//     and restoration. They are more likely to be compatible across different
+//     generations.
+// biases: the canonical form of biases that can be used for saving
+//     and restoration. They are more likely to be compatible across different
+//     generations.
+// rnn_mode: Indicates the type of the RNN model.
+// input_mode: Indicate whether there is a linear projection between the input and
+//     The actual computation before the first layer. 'skip_input' is only allowed
+//     when input_size == num_units; 'auto_select' implies 'skip_input' when
+//     input_size == num_units; otherwise, it implies 'linear_input'.
+// direction: Indicates whether a bidirectional model will be used.
+//     dir = (direction == bidirectional) ? 2 : 1
+// dropout: dropout probability. When set to 0., dropout is disabled.
+// seed: the 1st part of a seed to initialize dropout.
+// seed2: the 2nd part of a seed to initialize dropout.
+func CudnnRNNParamsToCanonical(scope *Scope, num_layers tf.Output, num_units tf.Output, input_size tf.Output, params tf.Output, num_params int64, optional ...CudnnRNNParamsToCanonicalAttr) (weights []tf.Output, biases []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_batch_threads": num_batch_threads, "max_batch_size": max_batch_size, "batch_timeout_micros": batch_timeout_micros, "grad_timeout_micros": grad_timeout_micros}
+	attrs := map[string]interface{}{"num_params": num_params}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Batch",
+		Type: "CudnnRNNParamsToCanonical",
 		Input: []tf.Input{
-			tf.OutputList(in_tensors),
+			num_layers, num_units, input_size, params,
 		},
 		Attrs: attrs,
 	}
@@ -29847,241 +29141,808 @@ func Batch(scope *Scope, in_tensors []tf.Output, num_batch_threads int64, max_ba
 	}
 	var idx int
 	var err error
-	if batched_tensors, idx, err = makeOutputList(op, idx, "batched_tensors"); err != nil {
-		scope.UpdateErr("Batch", err)
+	if weights, idx, err = makeOutputList(op, idx, "weights"); err != nil {
+		scope.UpdateErr("CudnnRNNParamsToCanonical", err)
 		return
 	}
-	batch_index = op.Output(idx)
-	id = op.Output(idx)
-	return batched_tensors, batch_index, id
+	if biases, idx, err = makeOutputList(op, idx, "biases"); err != nil {
+		scope.UpdateErr("CudnnRNNParamsToCanonical", err)
+		return
+	}
+	return weights, biases
 }
 
-// UnbatchAttr is an optional argument to Unbatch.
-type UnbatchAttr func(optionalAttr)
+// UniformCandidateSamplerAttr is an optional argument to UniformCandidateSampler.
+type UniformCandidateSamplerAttr func(optionalAttr)
 
-// UnbatchContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func UnbatchContainer(value string) UnbatchAttr {
+// UniformCandidateSamplerSeed sets the optional seed attribute to value.
+//
+// value: If either seed or seed2 are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func UniformCandidateSamplerSeed(value int64) UniformCandidateSamplerAttr {
 	return func(m optionalAttr) {
-		m["container"] = value
+		m["seed"] = value
 	}
 }
 
-// UnbatchSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func UnbatchSharedName(value string) UnbatchAttr {
+// UniformCandidateSamplerSeed2 sets the optional seed2 attribute to value.
+//
+// value: An second seed to avoid seed collision.
+// If not specified, defaults to 0
+func UniformCandidateSamplerSeed2(value int64) UniformCandidateSamplerAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["seed2"] = value
 	}
 }
 
-// Reverses the operation of Batch for a single output Tensor.
+// Generates labels for candidate sampling with a uniform distribution.
 //
-// An instance of Unbatch either receives an empty batched_tensor, in which case it
-// asynchronously waits until the values become available from a concurrently
-// running instance of Unbatch with the same container and shared_name, or receives
-// a non-empty batched_tensor in which case it finalizes all other concurrently
-// running instances and outputs its own element from the batch.
+// See explanations of candidate sampling and the data formats at
+// go/candidate-sampling.
 //
-// batched_tensor: The possibly transformed output of Batch. The size of the first
-//  dimension should remain unchanged by the transformations for the operation to
-//  work.
-// batch_index: The matching batch_index obtained from Batch.
-// id: The id scalar emitted by Batch.
-// unbatched_tensor: The Tensor corresponding to this execution.
-// timeout_micros: Maximum amount of time (in microseconds) to wait to receive the
-//  batched input tensor associated with a given invocation of the op.
-// container: Container to control resource sharing.
-// shared_name: Instances of Unbatch with the same container and shared_name are
-//  assumed to possibly belong to the same batch. If left empty, the op name will
-//  be used as the shared name.
-func Unbatch(scope *Scope, batched_tensor tf.Output, batch_index tf.Output, id tf.Output, timeout_micros int64, optional ...UnbatchAttr) (unbatched_tensor tf.Output) {
+// For each batch, this op picks a single set of sampled candidate labels.
+//
+// The advantages of sampling candidates per-batch are simplicity and the
+// possibility of efficient dense matrix multiplication. The disadvantage is that
+// the sampled candidates must be chosen independently of the context and of the
+// true labels.
+//
+// Arguments:
+//	true_classes: A batch_size * num_true matrix, in which each row contains the
+// IDs of the num_true target_classes in the corresponding original label.
+//	num_true: Number of true labels per context.
+//	num_sampled: Number of candidates to randomly sample.
+//	unique: If unique is true, we sample with rejection, so that all sampled
+// candidates in a batch are unique. This requires some approximation to
+// estimate the post-rejection sampling probabilities.
+//	range_max: The sampler will sample integers from the interval [0, range_max).
+//
+// Returns A vector of length num_sampled, in which each element is
+// the ID of a sampled candidate.A batch_size * num_true matrix, representing
+// the number of times each candidate is expected to occur in a batch
+// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
+// candidate representing the number of times the candidate is expected
+// to occur in a batch of sampled candidates.  If unique=true, then this is a
+// probability.
+func UniformCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...UniformCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"timeout_micros": timeout_micros}
+	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Unbatch",
+		Type: "UniformCandidateSampler",
 		Input: []tf.Input{
-			batched_tensor, batch_index, id,
+			true_classes,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// AvgPool3DGradAttr is an optional argument to AvgPool3DGrad.
-type AvgPool3DGradAttr func(optionalAttr)
+// CTCLossAttr is an optional argument to CTCLoss.
+type CTCLossAttr func(optionalAttr)
 
-// AvgPool3DGradDataFormat sets the optional data_format attribute to value.
+// CTCLossPreprocessCollapseRepeated sets the optional preprocess_collapse_repeated attribute to value.
 //
-// value: The data format of the input and output data. With the
-// default format "NDHWC", the data is stored in the order of:
-//     [batch, in_depth, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCDHW", the data storage order is:
-//     [batch, in_channels, in_depth, in_height, in_width].
-// If not specified, defaults to "NDHWC"
-func AvgPool3DGradDataFormat(value string) AvgPool3DGradAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
+// value: Scalar, if true then repeated labels are
+// collapsed prior to the CTC calculation.
+// If not specified, defaults to false
+func CTCLossPreprocessCollapseRepeated(value bool) CTCLossAttr {
+	return func(m optionalAttr) {
+		m["preprocess_collapse_repeated"] = value
 	}
 }
 
-// Computes gradients of average pooling function.
+// CTCLossCtcMergeRepeated sets the optional ctc_merge_repeated attribute to value.
+//
+// value: Scalar.  If set to false, *during* CTC calculation
+// repeated non-blank labels will not be merged and are interpreted as
+// individual labels.  This is a simplified version of CTC.
+// If not specified, defaults to true
+func CTCLossCtcMergeRepeated(value bool) CTCLossAttr {
+	return func(m optionalAttr) {
+		m["ctc_merge_repeated"] = value
+	}
+}
+
+// CTCLossIgnoreLongerOutputsThanInputs sets the optional ignore_longer_outputs_than_inputs attribute to value.
+//
+// value: Scalar. If set to true, during CTC
+// calculation, items that have longer output sequences than input sequences
+// are skipped: they don't contribute to the loss term and have zero-gradient.
+// If not specified, defaults to false
+func CTCLossIgnoreLongerOutputsThanInputs(value bool) CTCLossAttr {
+	return func(m optionalAttr) {
+		m["ignore_longer_outputs_than_inputs"] = value
+	}
+}
+
+// Calculates the CTC Loss (log probability) for each batch entry.  Also calculates
+//
+// the gradient.  This class performs the softmax operation for you, so inputs
+// should be e.g. linear projections of outputs by an LSTM.
 //
 // Arguments:
-//	orig_input_shape: The original input dimensions.
-//	grad: Output backprop of shape `[batch, depth, rows, cols, channels]`.
-//	ksize: 1-D tensor of length 5. The size of the window for each dimension of
-// the input tensor. Must have `ksize[0] = ksize[4] = 1`.
-//	strides: 1-D tensor of length 5. The stride of the sliding window for each
-// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
-//	padding: The type of padding algorithm to use.
+//	inputs: 3-D, shape: `(max_time x batch_size x num_classes)`, the logits.
+//	labels_indices: The indices of a `SparseTensor<int32, 2>`.
+// `labels_indices(i, :) == [b, t]` means `labels_values(i)` stores the id for
+// `(batch b, time t)`.
+//	labels_values: The values (labels) associated with the given batch and time.
+//	sequence_length: A vector containing sequence lengths (batch).
 //
-// Returns The backprop for input.
-func AvgPool3DGrad(scope *Scope, orig_input_shape tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...AvgPool3DGradAttr) (output tf.Output) {
+// Returns A vector (batch) containing log-probabilities.The gradient of `loss`.  3-D, shape:
+// `(max_time x batch_size x num_classes)`.
+func CTCLoss(scope *Scope, inputs tf.Output, labels_indices tf.Output, labels_values tf.Output, sequence_length tf.Output, optional ...CTCLossAttr) (loss tf.Output, gradient tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "AvgPool3DGrad",
+		Type: "CTCLoss",
 		Input: []tf.Input{
-			orig_input_shape, grad,
+			inputs, labels_indices, labels_values, sequence_length,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
+}
+
+// CTCGreedyDecoderAttr is an optional argument to CTCGreedyDecoder.
+type CTCGreedyDecoderAttr func(optionalAttr)
+
+// CTCGreedyDecoderMergeRepeated sets the optional merge_repeated attribute to value.
+//
+// value: If True, merge repeated classes in output.
+// If not specified, defaults to false
+func CTCGreedyDecoderMergeRepeated(value bool) CTCGreedyDecoderAttr {
+	return func(m optionalAttr) {
+		m["merge_repeated"] = value
+	}
+}
+
+// Performs greedy decoding on the logits given in inputs.
+//
+// A note about the attribute merge_repeated: if enabled, when
+// consecutive logits' maximum indices are the same, only the first of
+// these is emitted.  Labeling the blank '*', the sequence "A B B * B B"
+// becomes "A B B" if merge_repeated = True and "A B B B B" if
+// merge_repeated = False.
+//
+// Regardless of the value of merge_repeated, if the maximum index of a given
+// time and batch corresponds to the blank, index `(num_classes - 1)`, no new
+// element is emitted.
+//
+// Arguments:
+//	inputs: 3-D, shape: `(max_time x batch_size x num_classes)`, the logits.
+//	sequence_length: A vector containing sequence lengths, size `(batch_size)`.
+//
+// Returns Indices matrix, size `(total_decoded_outputs x 2)`,
+// of a `SparseTensor<int64, 2>`.  The rows store: [batch, time].Values vector, size: `(total_decoded_outputs)`,
+// of a `SparseTensor<int64, 2>`.  The vector stores the decoded classes.Shape vector, size `(2)`, of the decoded SparseTensor.
+// Values are: `[batch_size, max_decoded_length]`.Matrix, size `(batch_size x 1)`, containing sequence
+// log-probabilities.
+func CTCGreedyDecoder(scope *Scope, inputs tf.Output, sequence_length tf.Output, optional ...CTCGreedyDecoderAttr) (decoded_indices tf.Output, decoded_values tf.Output, decoded_shape tf.Output, log_probability tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "CTCGreedyDecoder",
+		Input: []tf.Input{
+			inputs, sequence_length,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3)
+}
+
+// Forwards `data` to the output port determined by `pred`.
+//
+// If `pred` is true, the `data` input is forwarded to `output_true`. Otherwise,
+// the data goes to `output_false`.
+//
+// See also `RefSwitch` and `Merge`.
+//
+// Arguments:
+//	data: The tensor to be forwarded to the appropriate output.
+//	pred: A scalar that specifies which output port will receive data.
+//
+// Returns If `pred` is false, data will be forwarded to this output.If `pred` is true, data will be forwarded to this output.
+func Switch(scope *Scope, data tf.Output, pred tf.Output) (output_false tf.Output, output_true tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Switch",
+		Input: []tf.Input{
+			data, pred,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
+}
+
+// Add all input tensors element wise.
+//
+// Arguments:
+//	inputs: Must all be the same size and shape.
+func AddN(scope *Scope, inputs []tf.Output) (sum tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "AddN",
+		Input: []tf.Input{
+			tf.OutputList(inputs),
+		},
+	}
+	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ParseSingleSequenceExampleAttr is an optional argument to ParseSingleSequenceExample.
-type ParseSingleSequenceExampleAttr func(optionalAttr)
+// TryRpcAttr is an optional argument to TryRpc.
+type TryRpcAttr func(optionalAttr)
 
-// ParseSingleSequenceExampleContextSparseTypes sets the optional context_sparse_types attribute to value.
+// TryRpcProtocol sets the optional protocol attribute to value.
 //
-// value: A list of Ncontext_sparse types; the data types of data in
-// each context Feature given in context_sparse_keys.
-// Currently the ParseSingleSequenceExample supports DT_FLOAT (FloatList),
-// DT_INT64 (Int64List), and DT_STRING (BytesList).
-// If not specified, defaults to <>
+// value: RPC protocol to use.  Empty string means use the default protocol.
+// Options include 'grpc'.
+// If not specified, defaults to ""
+func TryRpcProtocol(value string) TryRpcAttr {
+	return func(m optionalAttr) {
+		m["protocol"] = value
+	}
+}
+
+// TryRpcFailFast sets the optional fail_fast attribute to value.
 //
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleContextSparseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
+// value: `boolean`. If `true` (default), then failures to connect
+// (i.e., the server does not immediately respond) cause an RPC failure.
+// If not specified, defaults to true
+func TryRpcFailFast(value bool) TryRpcAttr {
 	return func(m optionalAttr) {
-		m["context_sparse_types"] = value
+		m["fail_fast"] = value
 	}
 }
 
-// ParseSingleSequenceExampleFeatureListDenseTypes sets the optional feature_list_dense_types attribute to value.
-// If not specified, defaults to <>
+// TryRpcTimeoutInMs sets the optional timeout_in_ms attribute to value.
 //
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleFeatureListDenseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
+// value: `int`. If `0` (default), then the kernel will run the RPC
+// request and only time out if the RPC deadline passes or the session times out.
+// If this value is greater than `0`, then the op will raise an exception if
+// the RPC takes longer than `timeout_in_ms`.
+// If not specified, defaults to 0
+func TryRpcTimeoutInMs(value int64) TryRpcAttr {
 	return func(m optionalAttr) {
-		m["feature_list_dense_types"] = value
+		m["timeout_in_ms"] = value
 	}
 }
 
-// ParseSingleSequenceExampleContextDenseShapes sets the optional context_dense_shapes attribute to value.
-//
-// value: A list of Ncontext_dense shapes; the shapes of data in
-// each context Feature given in context_dense_keys.
-// The number of elements in the Feature corresponding to context_dense_key[j]
-// must always equal context_dense_shapes[j].NumEntries().
-// The shape of context_dense_values[j] will match context_dense_shapes[j].
-// If not specified, defaults to <>
+// Perform batches of RPC requests.
+//
+// This op asynchronously performs either a single RPC request, or a batch
+// of requests.  RPC requests are defined by three main parameters:
+//
+//   - `address` (the host+port or BNS address of the request)
+//   - `method` (the method name for the request)
+//   - `request` (the serialized proto string, or vector of strings,
+//      of the RPC request argument).
+//
+// For example, if you have an RPC service running on port localhost:2345,
+// and its interface is configured with the following proto declaration:
+//
+// ```
+// service MyService {
+//   rpc MyMethod(MyRequestProto) returns (MyResponseProto) {
+//   }
+// };
+// ```
+//
+// then call this op with arguments:
+//
+// ```
+// address = "localhost:2345"
+// method = "MyService/MyMethod"
+// ```
+//
+// The `request` tensor is a string tensor representing serialized `MyRequestProto`
+// strings; and the output string tensor `response` will have the same shape
+// and contain (upon successful completion) corresponding serialized
+// `MyResponseProto` strings.
+//
+// For example, to send a single, empty, `MyRequestProto`, call
+// this op with `request = ""`.  To send 5 **parallel** empty requests,
+// call this op with `request = ["", "", "", "", ""]`.
+//
+// More generally, one can create a batch of `MyRequestProto` serialized protos
+// from regular batched tensors using the `encode_proto` op, and convert
+// the response `MyResponseProto` serialized protos to batched tensors
+// using the `decode_proto` op.
+//
+// **NOTE** Working with serialized proto strings is faster than instantiating
+// actual proto objects in memory, so no performance degradation is expected
+// compared to writing custom kernels for this workflow.
+//
+// Unlike the standard `Rpc` op, if the connection fails or the remote worker
+// returns an error status, this op does **not** reraise the exception.
+// Instead, the `status_code` and `status_message` entry for the corresponding RPC
+// call is set with the error returned from the RPC call.  The `response` tensor
+// will contain valid response values for those minibatch entries whose RPCs did
+// not fail; the rest of the entries will have empty strings.
+//
+// Arguments:
+//	address: `0-D` or `1-D`.  The address (i.e. host_name:port) of the RPC server.
+// If this tensor has more than 1 element, then multiple parallel rpc requests
+// are sent.  This argument broadcasts with `method` and `request`.
+//	method: `0-D` or `1-D`.  The method address on the RPC server.
+// If this tensor has more than 1 element, then multiple parallel rpc requests
+// are sent.  This argument broadcasts with `address` and `request`.
+//	request: `0-D` or `1-D`.  Serialized proto strings: the rpc request argument.
+// If this tensor has more than 1 element, then multiple parallel rpc requests
+// are sent.  This argument broadcasts with `address` and `method`.
+//
+// Returns Same shape as `request`. Serialized proto strings: the rpc responses.Same shape as `request`.  Values correspond to tensorflow Status enum codes.Same shape as `request`.  Values correspond to Status messages
+// returned from the RPC calls.
+func TryRpc(scope *Scope, address tf.Output, method tf.Output, request tf.Output, optional ...TryRpcAttr) (response tf.Output, status_code tf.Output, status_message tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "TryRpc",
+		Input: []tf.Input{
+			address, method, request,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
+}
+
+// EnterAttr is an optional argument to Enter.
+type EnterAttr func(optionalAttr)
+
+// EnterIsConstant sets the optional is_constant attribute to value.
+//
+// value: If true, the output is constant within the child frame.
+// If not specified, defaults to false
+func EnterIsConstant(value bool) EnterAttr {
+	return func(m optionalAttr) {
+		m["is_constant"] = value
+	}
+}
+
+// EnterParallelIterations sets the optional parallel_iterations attribute to value.
+//
+// value: The number of iterations allowed to run in parallel.
+// If not specified, defaults to 10
+func EnterParallelIterations(value int64) EnterAttr {
+	return func(m optionalAttr) {
+		m["parallel_iterations"] = value
+	}
+}
+
+// Creates or finds a child frame, and makes `data` available to the child frame.
+//
+// This op is used together with `Exit` to create loops in the graph.
+// The unique `frame_name` is used by the `Executor` to identify frames. If
+// `is_constant` is true, `output` is a constant in the child frame; otherwise
+// it may be changed in the child frame. At most `parallel_iterations` iterations
+// are run in parallel in the child frame.
+//
+// Arguments:
+//	data: The tensor to be made available to the child frame.
+//	frame_name: The name of the child frame.
+//
+// Returns The same tensor as `data`.
+func Enter(scope *Scope, data tf.Output, frame_name string, optional ...EnterAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"frame_name": frame_name}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "Enter",
+		Input: []tf.Input{
+			data,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Produce a string tensor that encodes the state of a Reader.
+//
+// Not all Readers support being serialized, so this can produce an
+// Unimplemented error.
+//
+// Arguments:
+//	reader_handle: Handle to a Reader.
+func ReaderSerializeStateV2(scope *Scope, reader_handle tf.Output) (state tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "ReaderSerializeStateV2",
+		Input: []tf.Input{
+			reader_handle,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Exits the current frame to its parent frame.
+//
+// Exit makes its input `data` available to the parent frame.
+//
+// Arguments:
+//	data: The tensor to be made available to the parent frame.
+//
+// Returns The same tensor as `data`.
+func Exit(scope *Scope, data tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Exit",
+		Input: []tf.Input{
+			data,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Returns a copy of the input tensor.
+func Snapshot(scope *Scope, input tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Snapshot",
+		Input: []tf.Input{
+			input,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Returns a tensor of zeros with the same shape and type as x.
+//
+// Arguments:
+//	x: a tensor of type T.
+//
+// Returns a tensor of the same shape and type as x but filled with zeros.
+func ZerosLike(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "ZerosLike",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// AbortAttr is an optional argument to Abort.
+type AbortAttr func(optionalAttr)
+
+// AbortErrorMsg sets the optional error_msg attribute to value.
+//
+// value: A string which is the message associated with the exception.
+// If not specified, defaults to ""
+func AbortErrorMsg(value string) AbortAttr {
+	return func(m optionalAttr) {
+		m["error_msg"] = value
+	}
+}
+
+// AbortExitWithoutError sets the optional exit_without_error attribute to value.
+// If not specified, defaults to false
+func AbortExitWithoutError(value bool) AbortAttr {
+	return func(m optionalAttr) {
+		m["exit_without_error"] = value
+	}
+}
+
+// Raise a exception to abort the process when called.
+//
+// If exit_without_error is true, the process will exit normally,
+// otherwise it will exit with a SIGABORT signal.
+//
+// Returns nothing but an exception.
+//
+// Returns the created operation.
+func Abort(scope *Scope, optional ...AbortAttr) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "Abort",
+
+		Attrs: attrs,
+	}
+	return scope.AddOperation(opspec)
+}
+
+// FixedUnigramCandidateSamplerAttr is an optional argument to FixedUnigramCandidateSampler.
+type FixedUnigramCandidateSamplerAttr func(optionalAttr)
+
+// FixedUnigramCandidateSamplerVocabFile sets the optional vocab_file attribute to value.
+//
+// value: Each valid line in this file (which should have a CSV-like format)
+// corresponds to a valid word ID. IDs are in sequential order, starting from
+// num_reserved_ids. The last entry in each line is expected to be a value
+// corresponding to the count or relative probability. Exactly one of vocab_file
+// and unigrams needs to be passed to this op.
+// If not specified, defaults to ""
+func FixedUnigramCandidateSamplerVocabFile(value string) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["vocab_file"] = value
+	}
+}
+
+// FixedUnigramCandidateSamplerDistortion sets the optional distortion attribute to value.
+//
+// value: The distortion is used to skew the unigram probability distribution.
+// Each weight is first raised to the distortion's power before adding to the
+// internal unigram distribution. As a result, distortion = 1.0 gives regular
+// unigram sampling (as defined by the vocab file), and distortion = 0.0 gives
+// a uniform distribution.
+// If not specified, defaults to 1
+func FixedUnigramCandidateSamplerDistortion(value float32) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["distortion"] = value
+	}
+}
+
+// FixedUnigramCandidateSamplerNumReservedIds sets the optional num_reserved_ids attribute to value.
+//
+// value: Optionally some reserved IDs can be added in the range [0,
+// ..., num_reserved_ids) by the users. One use case is that a special unknown
+// word token is used as ID 0. These IDs will have a sampling probability of 0.
+// If not specified, defaults to 0
+func FixedUnigramCandidateSamplerNumReservedIds(value int64) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["num_reserved_ids"] = value
+	}
+}
+
+// FixedUnigramCandidateSamplerNumShards sets the optional num_shards attribute to value.
+//
+// value: A sampler can be used to sample from a subset of the original range
+// in order to speed up the whole computation through parallelism. This parameter
+// (together with 'shard') indicates the number of partitions that are being
+// used in the overall computation.
+// If not specified, defaults to 1
+//
+// REQUIRES: value >= 1
+func FixedUnigramCandidateSamplerNumShards(value int64) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["num_shards"] = value
+	}
+}
+
+// FixedUnigramCandidateSamplerShard sets the optional shard attribute to value.
+//
+// value: A sampler can be used to sample from a subset of the original range
+// in order to speed up the whole computation through parallelism. This parameter
+// (together with 'num_shards') indicates the particular partition number of a
+// sampler op, when partitioning is being used.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func FixedUnigramCandidateSamplerShard(value int64) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["shard"] = value
+	}
+}
+
+// FixedUnigramCandidateSamplerUnigrams sets the optional unigrams attribute to value.
+//
+// value: A list of unigram counts or probabilities, one per ID in sequential
+// order. Exactly one of vocab_file and unigrams should be passed to this op.
+// If not specified, defaults to <>
+func FixedUnigramCandidateSamplerUnigrams(value []float32) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["unigrams"] = value
+	}
+}
+
+// FixedUnigramCandidateSamplerSeed sets the optional seed attribute to value.
+//
+// value: If either seed or seed2 are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func FixedUnigramCandidateSamplerSeed(value int64) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["seed"] = value
+	}
+}
+
+// FixedUnigramCandidateSamplerSeed2 sets the optional seed2 attribute to value.
+//
+// value: An second seed to avoid seed collision.
+// If not specified, defaults to 0
+func FixedUnigramCandidateSamplerSeed2(value int64) FixedUnigramCandidateSamplerAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// Generates labels for candidate sampling with a learned unigram distribution.
+//
+// A unigram sampler could use a fixed unigram distribution read from a
+// file or passed in as an in-memory array instead of building up the distribution
+// from data on the fly. There is also an option to skew the distribution by
+// applying a distortion power to the weights.
+//
+// The vocabulary file should be in CSV-like format, with the last field
+// being the weight associated with the word.
+//
+// For each batch, this op picks a single set of sampled candidate labels.
+//
+// The advantages of sampling candidates per-batch are simplicity and the
+// possibility of efficient dense matrix multiplication. The disadvantage is that
+// the sampled candidates must be chosen independently of the context and of the
+// true labels.
+//
+// Arguments:
+//	true_classes: A batch_size * num_true matrix, in which each row contains the
+// IDs of the num_true target_classes in the corresponding original label.
+//	num_true: Number of true labels per context.
+//	num_sampled: Number of candidates to randomly sample.
+//	unique: If unique is true, we sample with rejection, so that all sampled
+// candidates in a batch are unique. This requires some approximation to
+// estimate the post-rejection sampling probabilities.
+//	range_max: The sampler will sample integers from the interval [0, range_max).
+//
+// Returns A vector of length num_sampled, in which each element is
+// the ID of a sampled candidate.A batch_size * num_true matrix, representing
+// the number of times each candidate is expected to occur in a batch
+// of sampled candidates. If unique=true, then this is a probability.A vector of length num_sampled, for each sampled
+// candidate representing the number of times the candidate is expected
+// to occur in a batch of sampled candidates.  If unique=true, then this is a
+// probability.
+func FixedUnigramCandidateSampler(scope *Scope, true_classes tf.Output, num_true int64, num_sampled int64, unique bool, range_max int64, optional ...FixedUnigramCandidateSamplerAttr) (sampled_candidates tf.Output, true_expected_count tf.Output, sampled_expected_count tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"num_true": num_true, "num_sampled": num_sampled, "unique": unique, "range_max": range_max}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "FixedUnigramCandidateSampler",
+		Input: []tf.Input{
+			true_classes,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
+}
+
+// WholeFileReaderV2Attr is an optional argument to WholeFileReaderV2.
+type WholeFileReaderV2Attr func(optionalAttr)
+
+// WholeFileReaderV2Container sets the optional container attribute to value.
 //
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleContextDenseShapes(value []tf.Shape) ParseSingleSequenceExampleAttr {
+// value: If non-empty, this reader is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func WholeFileReaderV2Container(value string) WholeFileReaderV2Attr {
 	return func(m optionalAttr) {
-		m["context_dense_shapes"] = value
+		m["container"] = value
 	}
 }
 
-// ParseSingleSequenceExampleFeatureListSparseTypes sets the optional feature_list_sparse_types attribute to value.
-//
-// value: A list of Nfeature_list_sparse types; the data types
-// of data in each FeatureList given in feature_list_sparse_keys.
-// Currently the ParseSingleSequenceExample supports DT_FLOAT (FloatList),
-// DT_INT64 (Int64List), and DT_STRING (BytesList).
-// If not specified, defaults to <>
+// WholeFileReaderV2SharedName sets the optional shared_name attribute to value.
 //
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleFeatureListSparseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
+// value: If non-empty, this reader is named in the given bucket
+// with this shared_name. Otherwise, the node name is used instead.
+// If not specified, defaults to ""
+func WholeFileReaderV2SharedName(value string) WholeFileReaderV2Attr {
 	return func(m optionalAttr) {
-		m["feature_list_sparse_types"] = value
+		m["shared_name"] = value
 	}
 }
 
-// ParseSingleSequenceExampleFeatureListDenseShapes sets the optional feature_list_dense_shapes attribute to value.
+// A Reader that outputs the entire contents of a file as a value.
 //
-// value: A list of Nfeature_list_dense shapes; the shapes of
-// data in each FeatureList given in feature_list_dense_keys.
-// The shape of each Feature in the FeatureList corresponding to
-// feature_list_dense_key[j] must always equal
-// feature_list_dense_shapes[j].NumEntries().
-// If not specified, defaults to <>
+// To use, enqueue filenames in a Queue.  The output of ReaderRead will
+// be a filename (key) and the contents of that file (value).
 //
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleFeatureListDenseShapes(value []tf.Shape) ParseSingleSequenceExampleAttr {
-	return func(m optionalAttr) {
-		m["feature_list_dense_shapes"] = value
+// Returns The handle to reference the Reader.
+func WholeFileReaderV2(scope *Scope, optional ...WholeFileReaderV2Attr) (reader_handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "WholeFileReaderV2",
+
+		Attrs: attrs,
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Transforms a scalar brain.SequenceExample proto (as strings) into typed tensors.
+// Transforms a tf.Example proto (as a string) into typed tensors.
 //
 // Arguments:
-//	serialized: A scalar containing a binary serialized SequenceExample proto.
-//	feature_list_dense_missing_assumed_empty: A vector listing the
-// FeatureList keys which may be missing from the SequenceExample.  If the
-// associated FeatureList is missing, it is treated as empty.  By default,
-// any FeatureList not listed in this vector must exist in the SequenceExample.
-//	context_sparse_keys: A list of Ncontext_sparse string Tensors (scalars).
-// The keys expected in the Examples' features associated with context_sparse
-// values.
-//	context_dense_keys: A list of Ncontext_dense string Tensors (scalars).
-// The keys expected in the SequenceExamples' context features associated with
-// dense values.
-//	feature_list_sparse_keys: A list of Nfeature_list_sparse string Tensors
-// (scalars).  The keys expected in the FeatureLists associated with sparse
+//	serialized: A vector containing a batch of binary serialized Example protos.
+//	dense_defaults: A list of Tensors (some may be empty), whose length matches
+// the length of `dense_keys`. dense_defaults[j] provides default values
+// when the example's feature_map lacks dense_key[j].  If an empty Tensor is
+// provided for dense_defaults[j], then the Feature dense_keys[j] is required.
+// The input type is inferred from dense_defaults[j], even when it's empty.
+// If dense_defaults[j] is not empty, and dense_shapes[j] is fully defined,
+// then the shape of dense_defaults[j] must match that of dense_shapes[j].
+// If dense_shapes[j] has an undefined major dimension (variable strides dense
+// feature), dense_defaults[j] must contain a single element:
+// the padding element.
+//	num_sparse: The number of sparse features to be parsed from the example. This
+// must match the lengths of `sparse_keys` and `sparse_types`.
+//	sparse_keys: A list of `num_sparse` strings.
+// The keys expected in the Examples' features associated with sparse values.
+//	dense_keys: The keys expected in the Examples' features associated with dense
 // values.
-//	feature_list_dense_keys: A list of Nfeature_list_dense string Tensors (scalars).
-// The keys expected in the SequenceExamples' feature_lists associated
-// with lists of dense values.
-//	context_dense_defaults: A list of Ncontext_dense Tensors (some may be empty).
-// context_dense_defaults[j] provides default values
-// when the SequenceExample's context map lacks context_dense_key[j].
-// If an empty Tensor is provided for context_dense_defaults[j],
-// then the Feature context_dense_keys[j] is required.
-// The input type is inferred from context_dense_defaults[j], even when it's
-// empty.  If context_dense_defaults[j] is not empty, its shape must match
-// context_dense_shapes[j].
-//	debug_name: A scalar containing the name of the serialized proto.
-// May contain, for example, table key (descriptive) name for the
-// corresponding serialized proto.  This is purely useful for debugging
-// purposes, and the presence of values here has no effect on the output.
-// May also be an empty scalar if no name is available.
-func ParseSingleSequenceExample(scope *Scope, serialized tf.Output, feature_list_dense_missing_assumed_empty tf.Output, context_sparse_keys []tf.Output, context_dense_keys []tf.Output, feature_list_sparse_keys []tf.Output, feature_list_dense_keys []tf.Output, context_dense_defaults []tf.Output, debug_name tf.Output, optional ...ParseSingleSequenceExampleAttr) (context_sparse_indices []tf.Output, context_sparse_values []tf.Output, context_sparse_shapes []tf.Output, context_dense_values []tf.Output, feature_list_sparse_indices []tf.Output, feature_list_sparse_values []tf.Output, feature_list_sparse_shapes []tf.Output, feature_list_dense_values []tf.Output) {
+//	sparse_types: A list of `num_sparse` types; the data types of data in each
+// Feature given in sparse_keys.
+// Currently the ParseSingleExample op supports DT_FLOAT (FloatList),
+// DT_INT64 (Int64List), and DT_STRING (BytesList).
+//	dense_shapes: The shapes of data in each Feature given in dense_keys.
+// The length of this list must match the length of `dense_keys`.  The
+// number of elements in the Feature corresponding to dense_key[j] must
+// always equal dense_shapes[j].NumEntries().  If dense_shapes[j] ==
+// (D0, D1, ..., DN) then the shape of output Tensor dense_values[j]
+// will be (D0, D1, ..., DN): In the case dense_shapes[j] = (-1, D1,
+// ..., DN), the shape of the output Tensor dense_values[j] will be (M,
+// D1, .., DN), where M is the number of blocks of elements of length
+// D1 * .... * DN, in the input.
+func ParseSingleExample(scope *Scope, serialized tf.Output, dense_defaults []tf.Output, num_sparse int64, sparse_keys []string, dense_keys []string, sparse_types []tf.DataType, dense_shapes []tf.Shape) (sparse_indices []tf.Output, sparse_values []tf.Output, sparse_shapes []tf.Output, dense_values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
+	attrs := map[string]interface{}{"num_sparse": num_sparse, "sparse_keys": sparse_keys, "dense_keys": dense_keys, "sparse_types": sparse_types, "dense_shapes": dense_shapes}
 	opspec := tf.OpSpec{
-		Type: "ParseSingleSequenceExample",
+		Type: "ParseSingleExample",
 		Input: []tf.Input{
-			serialized, feature_list_dense_missing_assumed_empty, tf.OutputList(context_sparse_keys), tf.OutputList(context_dense_keys), tf.OutputList(feature_list_sparse_keys), tf.OutputList(feature_list_dense_keys), tf.OutputList(context_dense_defaults), debug_name,
+			serialized, tf.OutputList(dense_defaults),
 		},
 		Attrs: attrs,
 	}
@@ -30091,346 +29952,508 @@ func ParseSingleSequenceExample(scope *Scope, serialized tf.Output, feature_list
 	}
 	var idx int
 	var err error
-	if context_sparse_indices, idx, err = makeOutputList(op, idx, "context_sparse_indices"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if context_sparse_values, idx, err = makeOutputList(op, idx, "context_sparse_values"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if context_sparse_shapes, idx, err = makeOutputList(op, idx, "context_sparse_shapes"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if context_dense_values, idx, err = makeOutputList(op, idx, "context_dense_values"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if feature_list_sparse_indices, idx, err = makeOutputList(op, idx, "feature_list_sparse_indices"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
+	if sparse_indices, idx, err = makeOutputList(op, idx, "sparse_indices"); err != nil {
+		scope.UpdateErr("ParseSingleExample", err)
 		return
 	}
-	if feature_list_sparse_values, idx, err = makeOutputList(op, idx, "feature_list_sparse_values"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
+	if sparse_values, idx, err = makeOutputList(op, idx, "sparse_values"); err != nil {
+		scope.UpdateErr("ParseSingleExample", err)
 		return
 	}
-	if feature_list_sparse_shapes, idx, err = makeOutputList(op, idx, "feature_list_sparse_shapes"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
+	if sparse_shapes, idx, err = makeOutputList(op, idx, "sparse_shapes"); err != nil {
+		scope.UpdateErr("ParseSingleExample", err)
 		return
 	}
-	if feature_list_dense_values, idx, err = makeOutputList(op, idx, "feature_list_dense_values"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
+	if dense_values, idx, err = makeOutputList(op, idx, "dense_values"); err != nil {
+		scope.UpdateErr("ParseSingleExample", err)
 		return
 	}
-	return context_sparse_indices, context_sparse_values, context_sparse_shapes, context_dense_values, feature_list_sparse_indices, feature_list_sparse_values, feature_list_sparse_shapes, feature_list_dense_values
+	return sparse_indices, sparse_values, sparse_shapes, dense_values
 }
 
-// UnbatchGradAttr is an optional argument to UnbatchGrad.
-type UnbatchGradAttr func(optionalAttr)
-
-// UnbatchGradContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func UnbatchGradContainer(value string) UnbatchGradAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
+// Deserializes a serialized tree ensemble config and replaces current tree
+//
+// ensemble.
+//
+// Arguments:
+//	tree_ensemble_handle: Handle to the tree ensemble.
+//	stamp_token: Token to use as the new value of the resource stamp.
+//	tree_ensemble_serialized: Serialized proto of the ensemble.
+//
+// Returns the created operation.
+func BoostedTreesDeserializeEnsemble(scope *Scope, tree_ensemble_handle tf.Output, stamp_token tf.Output, tree_ensemble_serialized tf.Output) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
 	}
-}
-
-// UnbatchGradSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func UnbatchGradSharedName(value string) UnbatchGradAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
+	opspec := tf.OpSpec{
+		Type: "BoostedTreesDeserializeEnsemble",
+		Input: []tf.Input{
+			tree_ensemble_handle, stamp_token, tree_ensemble_serialized,
+		},
 	}
+	return scope.AddOperation(opspec)
 }
 
-// Gradient of Unbatch.
-//
-// Acts like Batch but using the given batch_index index of batching things as they
-// become available. This ensures that the gradients are propagated back in the
-// same session which did the forward pass.
+// Elementwise computes the bitwise AND of `x` and `y`.
 //
-// original_input: The input to the Unbatch operation this is the gradient of.
-// batch_index: The batch_index given to the Unbatch operation this is the gradient
-// of.
-// grad: The downstream gradient.
-// id: The id scalar emitted by Batch.
-// batched_grad: The return value, either an empty tensor or the batched gradient.
-// container: Container to control resource sharing.
-// shared_name: Instances of UnbatchGrad with the same container and shared_name
-//  are assumed to possibly belong to the same batch. If left empty, the op name
-//  will be used as the shared name.
-func UnbatchGrad(scope *Scope, original_input tf.Output, batch_index tf.Output, grad tf.Output, id tf.Output, optional ...UnbatchGradAttr) (batched_grad tf.Output) {
+// The result will have those bits set, that are set in both `x` and `y`. The
+// computation is performed on the underlying representations of `x` and `y`.
+func BitwiseAnd(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "UnbatchGrad",
+		Type: "BitwiseAnd",
 		Input: []tf.Input{
-			original_input, batch_index, grad, id,
+			x, y,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// DecodeWavAttr is an optional argument to DecodeWav.
-type DecodeWavAttr func(optionalAttr)
-
-// DecodeWavDesiredChannels sets the optional desired_channels attribute to value.
+// Elementwise computes the bitwise left-shift of `x` and `y`.
 //
-// value: Number of sample channels wanted.
-// If not specified, defaults to -1
-func DecodeWavDesiredChannels(value int64) DecodeWavAttr {
-	return func(m optionalAttr) {
-		m["desired_channels"] = value
+// If `y` is negative, or greater than or equal to the width of `x` in bits the
+// result is implementation defined.
+func LeftShift(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "LeftShift",
+		Input: []tf.Input{
+			x, y,
+		},
 	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// DecodeWavDesiredSamples sets the optional desired_samples attribute to value.
-//
-// value: Length of audio requested.
+// TensorListStackAttr is an optional argument to TensorListStack.
+type TensorListStackAttr func(optionalAttr)
+
+// TensorListStackNumElements sets the optional num_elements attribute to value.
 // If not specified, defaults to -1
-func DecodeWavDesiredSamples(value int64) DecodeWavAttr {
+func TensorListStackNumElements(value int64) TensorListStackAttr {
 	return func(m optionalAttr) {
-		m["desired_samples"] = value
+		m["num_elements"] = value
 	}
 }
 
-// Decode a 16-bit PCM WAV file to a float tensor.
-//
-// The -32768 to 32767 signed 16-bit values will be scaled to -1.0 to 1.0 in float.
-//
-// When desired_channels is set, if the input contains fewer channels than this
-// then the last channel will be duplicated to give the requested number, else if
-// the input has more channels than requested then the additional channels will be
-// ignored.
-//
-// If desired_samples is set, then the audio will be cropped or padded with zeroes
-// to the requested length.
+// Stacks all tensors in the list.
 //
-// The first output contains a Tensor with the content of the audio samples. The
-// lowest dimension will be the number of channels, and the second will be the
-// number of samples. For example, a ten-sample-long stereo WAV file should give an
-// output shape of [10, 2].
+// Requires that all tensors have the same shape.
 //
-// Arguments:
-//	contents: The WAV-encoded audio, usually from a file.
+// input_handle: the input list
+// tensor: the gathered result
+// num_elements: optional. If not -1, the number of elements in the list.
 //
-// Returns 2-D with shape `[length, channels]`.Scalar holding the sample rate found in the WAV header.
-func DecodeWav(scope *Scope, contents tf.Output, optional ...DecodeWavAttr) (audio tf.Output, sample_rate tf.Output) {
+func TensorListStack(scope *Scope, input_handle tf.Output, element_dtype tf.DataType, optional ...TensorListStackAttr) (tensor tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
+	attrs := map[string]interface{}{"element_dtype": element_dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "DecodeWav",
+		Type: "TensorListStack",
 		Input: []tf.Input{
-			contents,
+			input_handle,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Concatenates a list of `N` tensors along the first dimension.
+// Elementwise computes the bitwise right-shift of `x` and `y`.
 //
-// The input tensors are all required to have size 1 in the first dimension.
+// Performs a logical shift for unsigned integer types, and an arithmetic shift
+// for signed integer types.
 //
-// For example:
+// If `y` is negative, or greater than or equal to than the width of `x` in bits
+// the result is implementation defined.
+func RightShift(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "RightShift",
+		Input: []tf.Input{
+			x, y,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Adjust the hue of one or more images.
 //
-// ```
-// # 'x' is [[1, 4]]
-// # 'y' is [[2, 5]]
-// # 'z' is [[3, 6]]
-// parallel_concat([x, y, z]) => [[1, 4], [2, 5], [3, 6]]  # Pack along first dim.
-// ```
+// `images` is a tensor of at least 3 dimensions.  The last dimension is
+// interpretted as channels, and must be three.
 //
-// The difference between concat and parallel_concat is that concat requires all
-// of the inputs be computed before the operation will begin but doesn't require
-// that the input shapes be known during graph construction.  Parallel concat
-// will copy pieces of the input into the output as they become available, in
-// some situations this can provide a performance benefit.
+// The input image is considered in the RGB colorspace. Conceptually, the RGB
+// colors are first mapped into HSV. A delta is then applied all the hue values,
+// and then remapped back to RGB colorspace.
 //
 // Arguments:
-//	values: Tensors to be concatenated. All must have size 1 in the first dimension
-// and same shape.
-//	shape: the final shape of the result; should be equal to the shapes of any input
-// but with the number of input values in the first dimension.
+//	images: Images to adjust.  At least 3-D.
+//	delta: A float delta to add to the hue.
 //
-// Returns The concatenated tensor.
-func ParallelConcat(scope *Scope, values []tf.Output, shape tf.Shape) (output tf.Output) {
+// Returns The hue-adjusted image or images.
+func AdjustHue(scope *Scope, images tf.Output, delta tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"shape": shape}
 	opspec := tf.OpSpec{
-		Type: "ParallelConcat",
+		Type: "AdjustHue",
 		Input: []tf.Input{
-			tf.OutputList(values),
+			images, delta,
 		},
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-//     Subtracts `v` into specified rows of `x`.
+// BatchAttr is an optional argument to Batch.
+type BatchAttr func(optionalAttr)
+
+// BatchMaxEnqueuedBatches sets the optional max_enqueued_batches attribute to value.
+// If not specified, defaults to 10
+func BatchMaxEnqueuedBatches(value int64) BatchAttr {
+	return func(m optionalAttr) {
+		m["max_enqueued_batches"] = value
+	}
+}
+
+// BatchAllowedBatchSizes sets the optional allowed_batch_sizes attribute to value.
+// If not specified, defaults to <>
+func BatchAllowedBatchSizes(value []int64) BatchAttr {
+	return func(m optionalAttr) {
+		m["allowed_batch_sizes"] = value
+	}
+}
+
+// BatchContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func BatchContainer(value string) BatchAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// BatchSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func BatchSharedName(value string) BatchAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// BatchBatchingQueue sets the optional batching_queue attribute to value.
+// If not specified, defaults to ""
+func BatchBatchingQueue(value string) BatchAttr {
+	return func(m optionalAttr) {
+		m["batching_queue"] = value
+	}
+}
+
+// Batches all input tensors nondeterministically.
 //
-//     Computes y = x; y[i, :] -= v; return y.
+// When many instances of this Op are being run concurrently with the same
+// container/shared_name in the same device, some will output zero-shaped Tensors
+// and others will output Tensors of size up to max_batch_size.
 //
-// Arguments:
-//	x: A `Tensor` of type T.
-//	i: A vector. Indices into the left-most dimension of `x`.
-//	v: A `Tensor` of type T. Same dimension sizes as x except the first dimension, which must be the same as i's size.
+// All Tensors in in_tensors are batched together (so, for example, labels and
+// features should be batched with a single instance of this operation.
 //
-// Returns A `Tensor` of type T. An alias of `x`. The content of `y` is undefined if there are duplicates in `i`.
-func InplaceSub(scope *Scope, x tf.Output, i tf.Output, v tf.Output) (y tf.Output) {
+// Each invocation of batch emits an `id` scalar which will be used to identify
+// this particular invocation when doing unbatch or its gradient.
+//
+// Each op which emits a non-empty batch will also emit a non-empty batch_index
+// Tensor, which, is a [K, 3] matrix where each row contains the invocation's id,
+// start, and length of elements of each set of Tensors present in batched_tensors.
+//
+// Batched tensors are concatenated along the first dimension, and all tensors in
+// in_tensors must have the first dimension of the same size.
+//
+// in_tensors: The tensors to be batched.
+// num_batch_threads: Number of scheduling threads for processing batches of work.
+//  Determines the number of batches processed in parallel.
+// max_batch_size: Batch sizes will never be bigger than this.
+// batch_timeout_micros: Maximum number of microseconds to wait before outputting
+//  an incomplete batch.
+// allowed_batch_sizes: Optional list of allowed batch sizes. If left empty, does
+//  nothing. Otherwise, supplies a list of batch sizes, causing the op to pad
+//  batches up to one of those sizes. The entries must increase monotonically, and
+//  the final entry must equal max_batch_size.
+// grad_timeout_micros: The timeout to use for the gradient. See Unbatch.
+// batched_tensors: Either empty tensors or a batch of concatenated Tensors.
+// batch_index: If out_tensors is non-empty, has information to invert it.
+// container: Controls the scope of sharing of this batch.
+// id: always contains a scalar with a unique ID for this invocation of Batch.
+// shared_name: Concurrently running instances of batch in the same device with the
+//  same container and shared_name will batch their elements together. If left
+//  empty, the op name will be used as the shared name.
+// T: the types of tensors to be batched.
+func Batch(scope *Scope, in_tensors []tf.Output, num_batch_threads int64, max_batch_size int64, batch_timeout_micros int64, grad_timeout_micros int64, optional ...BatchAttr) (batched_tensors []tf.Output, batch_index tf.Output, id tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"num_batch_threads": num_batch_threads, "max_batch_size": max_batch_size, "batch_timeout_micros": batch_timeout_micros, "grad_timeout_micros": grad_timeout_micros}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "InplaceSub",
+		Type: "Batch",
 		Input: []tf.Input{
-			x, i, v,
+			tf.OutputList(in_tensors),
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if batched_tensors, idx, err = makeOutputList(op, idx, "batched_tensors"); err != nil {
+		scope.UpdateErr("Batch", err)
+		return
+	}
+	batch_index = op.Output(idx)
+	id = op.Output(idx)
+	return batched_tensors, batch_index, id
 }
 
-// Converts a flat index or array of flat indices into a tuple of
-//
-// coordinate arrays.
-//
-// @compatibility(numpy)
-// Equivalent to np.unravel_index
-// @end_compatibility
+// UnbatchAttr is an optional argument to Unbatch.
+type UnbatchAttr func(optionalAttr)
+
+// UnbatchContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func UnbatchContainer(value string) UnbatchAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// UnbatchSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func UnbatchSharedName(value string) UnbatchAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Reverses the operation of Batch for a single output Tensor.
 //
-// Arguments:
-//	indices: An 0-D or 1-D `int` Tensor whose elements are indices into the
-// flattened version of an array of dimensions dims.
-//	dims: An 1-D `int` Tensor. The shape of the array to use for unraveling
-// indices.
+// An instance of Unbatch either receives an empty batched_tensor, in which case it
+// asynchronously waits until the values become available from a concurrently
+// running instance of Unbatch with the same container and shared_name, or receives
+// a non-empty batched_tensor in which case it finalizes all other concurrently
+// running instances and outputs its own element from the batch.
 //
-// Returns An 2-D (or 1-D if indices is 0-D) tensor where each row has the
-// same shape as the indices array.
-func UnravelIndex(scope *Scope, indices tf.Output, dims tf.Output) (output tf.Output) {
+// batched_tensor: The possibly transformed output of Batch. The size of the first
+//  dimension should remain unchanged by the transformations for the operation to
+//  work.
+// batch_index: The matching batch_index obtained from Batch.
+// id: The id scalar emitted by Batch.
+// unbatched_tensor: The Tensor corresponding to this execution.
+// timeout_micros: Maximum amount of time (in microseconds) to wait to receive the
+//  batched input tensor associated with a given invocation of the op.
+// container: Container to control resource sharing.
+// shared_name: Instances of Unbatch with the same container and shared_name are
+//  assumed to possibly belong to the same batch. If left empty, the op name will
+//  be used as the shared name.
+func Unbatch(scope *Scope, batched_tensor tf.Output, batch_index tf.Output, id tf.Output, timeout_micros int64, optional ...UnbatchAttr) (unbatched_tensor tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"timeout_micros": timeout_micros}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "UnravelIndex",
+		Type: "Unbatch",
 		Input: []tf.Input{
-			indices, dims,
+			batched_tensor, batch_index, id,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Compute the lower regularized incomplete Gamma function `Q(a, x)`.
-//
-// The lower regularized incomplete Gamma function is defined as:
-//
-//
-// \\(P(a, x) = gamma(a, x) / Gamma(a) = 1 - Q(a, x)\\)
-//
-// where
+// AvgPool3DGradAttr is an optional argument to AvgPool3DGrad.
+type AvgPool3DGradAttr func(optionalAttr)
+
+// AvgPool3DGradDataFormat sets the optional data_format attribute to value.
 //
-// \\(gamma(a, x) = int_{0}^{x} t^{a-1} exp(-t) dt\\)
+// value: The data format of the input and output data. With the
+// default format "NDHWC", the data is stored in the order of:
+//     [batch, in_depth, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCDHW", the data storage order is:
+//     [batch, in_channels, in_depth, in_height, in_width].
+// If not specified, defaults to "NDHWC"
+func AvgPool3DGradDataFormat(value string) AvgPool3DGradAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Computes gradients of average pooling function.
 //
-// is the lower incomplete Gamma function.
+// Arguments:
+//	orig_input_shape: The original input dimensions.
+//	grad: Output backprop of shape `[batch, depth, rows, cols, channels]`.
+//	ksize: 1-D tensor of length 5. The size of the window for each dimension of
+// the input tensor. Must have `ksize[0] = ksize[4] = 1`.
+//	strides: 1-D tensor of length 5. The stride of the sliding window for each
+// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
+//	padding: The type of padding algorithm to use.
 //
-// Note, above `Q(a, x)` (`Igammac`) is the upper regularized complete
-// Gamma function.
-func Igamma(scope *Scope, a tf.Output, x tf.Output) (z tf.Output) {
+// Returns The backprop for input.
+func AvgPool3DGrad(scope *Scope, orig_input_shape tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...AvgPool3DGradAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Igamma",
+		Type: "AvgPool3DGrad",
 		Input: []tf.Input{
-			a, x,
+			orig_input_shape, grad,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes offsets of concat inputs within its output.
+// ParseSingleSequenceExampleAttr is an optional argument to ParseSingleSequenceExample.
+type ParseSingleSequenceExampleAttr func(optionalAttr)
+
+// ParseSingleSequenceExampleContextSparseTypes sets the optional context_sparse_types attribute to value.
 //
-// For example:
+// value: A list of Ncontext_sparse types; the data types of data in
+// each context Feature given in context_sparse_keys.
+// Currently the ParseSingleSequenceExample supports DT_FLOAT (FloatList),
+// DT_INT64 (Int64List), and DT_STRING (BytesList).
+// If not specified, defaults to <>
 //
-// ```
-// # 'x' is [2, 2, 7]
-// # 'y' is [2, 3, 7]
-// # 'z' is [2, 5, 7]
-// concat_offset(2, [x, y, z]) => [0, 0, 0], [0, 2, 0], [0, 5, 0]
-// ```
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleContextSparseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
+	return func(m optionalAttr) {
+		m["context_sparse_types"] = value
+	}
+}
+
+// ParseSingleSequenceExampleFeatureListDenseTypes sets the optional feature_list_dense_types attribute to value.
+// If not specified, defaults to <>
 //
-// This is typically used by gradient computations for a concat operation.
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleFeatureListDenseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
+	return func(m optionalAttr) {
+		m["feature_list_dense_types"] = value
+	}
+}
+
+// ParseSingleSequenceExampleContextDenseShapes sets the optional context_dense_shapes attribute to value.
 //
-// Arguments:
-//	concat_dim: The dimension along which to concatenate.
-//	shape: The `N` int32 vectors representing shape of tensors being concatenated.
+// value: A list of Ncontext_dense shapes; the shapes of data in
+// each context Feature given in context_dense_keys.
+// The number of elements in the Feature corresponding to context_dense_key[j]
+// must always equal context_dense_shapes[j].NumEntries().
+// The shape of context_dense_values[j] will match context_dense_shapes[j].
+// If not specified, defaults to <>
 //
-// Returns The `N` int32 vectors representing the starting offset
-// of input tensors within the concatenated output.
-func ConcatOffset(scope *Scope, concat_dim tf.Output, shape []tf.Output) (offset []tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "ConcatOffset",
-		Input: []tf.Input{
-			concat_dim, tf.OutputList(shape),
-		},
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleContextDenseShapes(value []tf.Shape) ParseSingleSequenceExampleAttr {
+	return func(m optionalAttr) {
+		m["context_dense_shapes"] = value
 	}
-	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
+}
+
+// ParseSingleSequenceExampleFeatureListSparseTypes sets the optional feature_list_sparse_types attribute to value.
+//
+// value: A list of Nfeature_list_sparse types; the data types
+// of data in each FeatureList given in feature_list_sparse_keys.
+// Currently the ParseSingleSequenceExample supports DT_FLOAT (FloatList),
+// DT_INT64 (Int64List), and DT_STRING (BytesList).
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleFeatureListSparseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
+	return func(m optionalAttr) {
+		m["feature_list_sparse_types"] = value
 	}
-	var idx int
-	var err error
-	if offset, idx, err = makeOutputList(op, idx, "offset"); err != nil {
-		scope.UpdateErr("ConcatOffset", err)
-		return
+}
+
+// ParseSingleSequenceExampleFeatureListDenseShapes sets the optional feature_list_dense_shapes attribute to value.
+//
+// value: A list of Nfeature_list_dense shapes; the shapes of
+// data in each FeatureList given in feature_list_dense_keys.
+// The shape of each Feature in the FeatureList corresponding to
+// feature_list_dense_key[j] must always equal
+// feature_list_dense_shapes[j].NumEntries().
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleFeatureListDenseShapes(value []tf.Shape) ParseSingleSequenceExampleAttr {
+	return func(m optionalAttr) {
+		m["feature_list_dense_shapes"] = value
 	}
-	return offset
 }
 
-// Splits a tensor into `num_split` tensors along one dimension.
+// Transforms a scalar brain.SequenceExample proto (as strings) into typed tensors.
 //
 // Arguments:
-//	axis: 0-D.  The dimension along which to split.  Must be in the range
-// `[-rank(value), rank(value))`.
-//	value: The tensor to split.
-//	num_split: The number of ways to split.  Must evenly divide
-// `value.shape[split_dim]`.
-//
-// Returns They are identically shaped tensors, whose shape matches that of `value`
-// except along `axis`, where their sizes are
-// `values.shape[split_dim] / num_split`.
-func Split(scope *Scope, axis tf.Output, value tf.Output, num_split int64) (output []tf.Output) {
+//	serialized: A scalar containing a binary serialized SequenceExample proto.
+//	feature_list_dense_missing_assumed_empty: A vector listing the
+// FeatureList keys which may be missing from the SequenceExample.  If the
+// associated FeatureList is missing, it is treated as empty.  By default,
+// any FeatureList not listed in this vector must exist in the SequenceExample.
+//	context_sparse_keys: A list of Ncontext_sparse string Tensors (scalars).
+// The keys expected in the Examples' features associated with context_sparse
+// values.
+//	context_dense_keys: A list of Ncontext_dense string Tensors (scalars).
+// The keys expected in the SequenceExamples' context features associated with
+// dense values.
+//	feature_list_sparse_keys: A list of Nfeature_list_sparse string Tensors
+// (scalars).  The keys expected in the FeatureLists associated with sparse
+// values.
+//	feature_list_dense_keys: A list of Nfeature_list_dense string Tensors (scalars).
+// The keys expected in the SequenceExamples' feature_lists associated
+// with lists of dense values.
+//	context_dense_defaults: A list of Ncontext_dense Tensors (some may be empty).
+// context_dense_defaults[j] provides default values
+// when the SequenceExample's context map lacks context_dense_key[j].
+// If an empty Tensor is provided for context_dense_defaults[j],
+// then the Feature context_dense_keys[j] is required.
+// The input type is inferred from context_dense_defaults[j], even when it's
+// empty.  If context_dense_defaults[j] is not empty, its shape must match
+// context_dense_shapes[j].
+//	debug_name: A scalar containing the name of the serialized proto.
+// May contain, for example, table key (descriptive) name for the
+// corresponding serialized proto.  This is purely useful for debugging
+// purposes, and the presence of values here has no effect on the output.
+// May also be an empty scalar if no name is available.
+func ParseSingleSequenceExample(scope *Scope, serialized tf.Output, feature_list_dense_missing_assumed_empty tf.Output, context_sparse_keys []tf.Output, context_dense_keys []tf.Output, feature_list_sparse_keys []tf.Output, feature_list_dense_keys []tf.Output, context_dense_defaults []tf.Output, debug_name tf.Output, optional ...ParseSingleSequenceExampleAttr) (context_sparse_indices []tf.Output, context_sparse_values []tf.Output, context_sparse_shapes []tf.Output, context_dense_values []tf.Output, feature_list_sparse_indices []tf.Output, feature_list_sparse_values []tf.Output, feature_list_sparse_shapes []tf.Output, feature_list_dense_values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"num_split": num_split}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "Split",
+		Type: "ParseSingleSequenceExample",
 		Input: []tf.Input{
-			axis, value,
+			serialized, feature_list_dense_missing_assumed_empty, tf.OutputList(context_sparse_keys), tf.OutputList(context_dense_keys), tf.OutputList(feature_list_sparse_keys), tf.OutputList(feature_list_dense_keys), tf.OutputList(context_dense_defaults), debug_name,
 		},
 		Attrs: attrs,
 	}
@@ -30440,158 +30463,140 @@ func Split(scope *Scope, axis tf.Output, value tf.Output, num_split int64) (outp
 	}
 	var idx int
 	var err error
-	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
-		scope.UpdateErr("Split", err)
+	if context_sparse_indices, idx, err = makeOutputList(op, idx, "context_sparse_indices"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
 		return
 	}
-	return output
-}
-
-// Splits a tensor into `num_split` tensors along one dimension.
-//
-// Arguments:
-//	value: The tensor to split.
-//	size_splits: list containing the sizes of each output tensor along the split
-// dimension. Must sum to the dimension of value along split_dim.
-// Can contain one -1 indicating that dimension is to be inferred.
-//	axis: 0-D.  The dimension along which to split.  Must be in the range
-// `[-rank(value), rank(value))`.
-//
-//
-// Returns Tensors whose shape matches that of `value`
-// except along `axis`, where their sizes are
-// `size_splits[i]`.
-func SplitV(scope *Scope, value tf.Output, size_splits tf.Output, axis tf.Output, num_split int64) (output []tf.Output) {
-	if scope.Err() != nil {
+	if context_sparse_values, idx, err = makeOutputList(op, idx, "context_sparse_values"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
 		return
 	}
-	attrs := map[string]interface{}{"num_split": num_split}
-	opspec := tf.OpSpec{
-		Type: "SplitV",
-		Input: []tf.Input{
-			value, size_splits, axis,
-		},
-		Attrs: attrs,
+	if context_sparse_shapes, idx, err = makeOutputList(op, idx, "context_sparse_shapes"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if context_dense_values, idx, err = makeOutputList(op, idx, "context_dense_values"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if feature_list_sparse_indices, idx, err = makeOutputList(op, idx, "feature_list_sparse_indices"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
 	}
-	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
+	if feature_list_sparse_values, idx, err = makeOutputList(op, idx, "feature_list_sparse_values"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
 		return
 	}
-	var idx int
-	var err error
-	if output, idx, err = makeOutputList(op, idx, "output"); err != nil {
-		scope.UpdateErr("SplitV", err)
+	if feature_list_sparse_shapes, idx, err = makeOutputList(op, idx, "feature_list_sparse_shapes"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
 		return
 	}
-	return output
+	if feature_list_dense_values, idx, err = makeOutputList(op, idx, "feature_list_dense_values"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	return context_sparse_indices, context_sparse_values, context_sparse_shapes, context_dense_values, feature_list_sparse_indices, feature_list_sparse_values, feature_list_sparse_shapes, feature_list_dense_values
 }
 
-// Gives a guarantee to the TF runtime that the input tensor is a constant.
-//
-// The runtime is then free to make optimizations based on this.
-//
-// Only accepts value typed tensors as inputs and rejects resource variable handles
-// as input.
-//
-// Returns the input tensor without modification.
-func GuaranteeConst(scope *Scope, input tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
+// UnbatchGradAttr is an optional argument to UnbatchGrad.
+type UnbatchGradAttr func(optionalAttr)
+
+// UnbatchGradContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func UnbatchGradContainer(value string) UnbatchGradAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
 	}
-	opspec := tf.OpSpec{
-		Type: "GuaranteeConst",
-		Input: []tf.Input{
-			input,
-		},
+}
+
+// UnbatchGradSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func UnbatchGradSharedName(value string) UnbatchGradAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// Returns a tensor of zeros with the same shape and type as x.
+// Gradient of Unbatch.
 //
-// Arguments:
-//	x: a tensor of type T.
+// Acts like Batch but using the given batch_index index of batching things as they
+// become available. This ensures that the gradients are propagated back in the
+// same session which did the forward pass.
 //
-// Returns a tensor of the same shape and type as x but filled with zeros.
-func ZerosLike(scope *Scope, x tf.Output) (y tf.Output) {
+// original_input: The input to the Unbatch operation this is the gradient of.
+// batch_index: The batch_index given to the Unbatch operation this is the gradient
+// of.
+// grad: The downstream gradient.
+// id: The id scalar emitted by Batch.
+// batched_grad: The return value, either an empty tensor or the batched gradient.
+// container: Container to control resource sharing.
+// shared_name: Instances of UnbatchGrad with the same container and shared_name
+//  are assumed to possibly belong to the same batch. If left empty, the op name
+//  will be used as the shared name.
+func UnbatchGrad(scope *Scope, original_input tf.Output, batch_index tf.Output, grad tf.Output, id tf.Output, optional ...UnbatchGradAttr) (batched_grad tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "ZerosLike",
+		Type: "UnbatchGrad",
 		Input: []tf.Input{
-			x,
+			original_input, batch_index, grad, id,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// QuantizedInstanceNormAttr is an optional argument to QuantizedInstanceNorm.
-type QuantizedInstanceNormAttr func(optionalAttr)
+// DecodeWavAttr is an optional argument to DecodeWav.
+type DecodeWavAttr func(optionalAttr)
 
-// QuantizedInstanceNormOutputRangeGiven sets the optional output_range_given attribute to value.
+// DecodeWavDesiredChannels sets the optional desired_channels attribute to value.
 //
-// value: If True, `given_y_min` and `given_y_min`
-// and `given_y_max` are used as the output range. Otherwise,
-// the implementation computes the output range.
-// If not specified, defaults to false
-func QuantizedInstanceNormOutputRangeGiven(value bool) QuantizedInstanceNormAttr {
+// value: Number of sample channels wanted.
+// If not specified, defaults to -1
+func DecodeWavDesiredChannels(value int64) DecodeWavAttr {
 	return func(m optionalAttr) {
-		m["output_range_given"] = value
+		m["desired_channels"] = value
 	}
 }
 
-// QuantizedInstanceNormGivenYMin sets the optional given_y_min attribute to value.
+// DecodeWavDesiredSamples sets the optional desired_samples attribute to value.
 //
-// value: Output in `y_min` if `output_range_given` is True.
-// If not specified, defaults to 0
-func QuantizedInstanceNormGivenYMin(value float32) QuantizedInstanceNormAttr {
+// value: Length of audio requested.
+// If not specified, defaults to -1
+func DecodeWavDesiredSamples(value int64) DecodeWavAttr {
 	return func(m optionalAttr) {
-		m["given_y_min"] = value
+		m["desired_samples"] = value
 	}
 }
 
-// QuantizedInstanceNormGivenYMax sets the optional given_y_max attribute to value.
+// Decode a 16-bit PCM WAV file to a float tensor.
 //
-// value: Output in `y_max` if `output_range_given` is True.
-// If not specified, defaults to 0
-func QuantizedInstanceNormGivenYMax(value float32) QuantizedInstanceNormAttr {
-	return func(m optionalAttr) {
-		m["given_y_max"] = value
-	}
-}
-
-// QuantizedInstanceNormVarianceEpsilon sets the optional variance_epsilon attribute to value.
+// The -32768 to 32767 signed 16-bit values will be scaled to -1.0 to 1.0 in float.
 //
-// value: A small float number to avoid dividing by 0.
-// If not specified, defaults to 1e-05
-func QuantizedInstanceNormVarianceEpsilon(value float32) QuantizedInstanceNormAttr {
-	return func(m optionalAttr) {
-		m["variance_epsilon"] = value
-	}
-}
-
-// QuantizedInstanceNormMinSeparation sets the optional min_separation attribute to value.
+// When desired_channels is set, if the input contains fewer channels than this
+// then the last channel will be duplicated to give the requested number, else if
+// the input has more channels than requested then the additional channels will be
+// ignored.
 //
-// value: Minimum value of `y_max - y_min`
-// If not specified, defaults to 0.001
-func QuantizedInstanceNormMinSeparation(value float32) QuantizedInstanceNormAttr {
-	return func(m optionalAttr) {
-		m["min_separation"] = value
-	}
-}
-
-// Quantized Instance normalization.
+// If desired_samples is set, then the audio will be cropped or padded with zeroes
+// to the requested length.
+//
+// The first output contains a Tensor with the content of the audio samples. The
+// lowest dimension will be the number of channels, and the second will be the
+// number of samples. For example, a ten-sample-long stereo WAV file should give an
+// output shape of [10, 2].
 //
 // Arguments:
-//	x: A 4D input Tensor.
-//	x_min: The value represented by the lowest quantized input.
-//	x_max: The value represented by the highest quantized input.
+//	contents: The WAV-encoded audio, usually from a file.
 //
-// Returns A 4D Tensor.The value represented by the lowest quantized output.The value represented by the highest quantized output.
-func QuantizedInstanceNorm(scope *Scope, x tf.Output, x_min tf.Output, x_max tf.Output, optional ...QuantizedInstanceNormAttr) (y tf.Output, y_min tf.Output, y_max tf.Output) {
+// Returns 2-D with shape `[length, channels]`.Scalar holding the sample rate found in the WAV header.
+func DecodeWav(scope *Scope, contents tf.Output, optional ...DecodeWavAttr) (audio tf.Output, sample_rate tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -30600,111 +30605,106 @@ func QuantizedInstanceNorm(scope *Scope, x tf.Output, x_min tf.Output, x_max tf.
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "QuantizedInstanceNorm",
+		Type: "DecodeWav",
 		Input: []tf.Input{
-			x, x_min, x_max,
+			contents,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
+	return op.Output(0), op.Output(1)
 }
 
-// Returns the diagonal part of the tensor.
-//
-// This operation returns a tensor with the `diagonal` part
-// of the `input`. The `diagonal` part is computed as follows:
-//
-// Assume `input` has dimensions `[D1,..., Dk, D1,..., Dk]`, then the output is a
-// tensor of rank `k` with dimensions `[D1,..., Dk]` where:
+// Concatenates a list of `N` tensors along the first dimension.
 //
-// `diagonal[i1,..., ik] = input[i1, ..., ik, i1,..., ik]`.
+// The input tensors are all required to have size 1 in the first dimension.
 //
 // For example:
 //
 // ```
-// # 'input' is [[1, 0, 0, 0]
-//               [0, 2, 0, 0]
-//               [0, 0, 3, 0]
-//               [0, 0, 0, 4]]
-//
-// tf.diag_part(input) ==> [1, 2, 3, 4]
+// # 'x' is [[1, 4]]
+// # 'y' is [[2, 5]]
+// # 'z' is [[3, 6]]
+// parallel_concat([x, y, z]) => [[1, 4], [2, 5], [3, 6]]  # Pack along first dim.
 // ```
 //
+// The difference between concat and parallel_concat is that concat requires all
+// of the inputs be computed before the operation will begin but doesn't require
+// that the input shapes be known during graph construction.  Parallel concat
+// will copy pieces of the input into the output as they become available, in
+// some situations this can provide a performance benefit.
+//
 // Arguments:
-//	input: Rank k tensor where k is even and not zero.
+//	values: Tensors to be concatenated. All must have size 1 in the first dimension
+// and same shape.
+//	shape: the final shape of the result; should be equal to the shapes of any input
+// but with the number of input values in the first dimension.
 //
-// Returns The extracted diagonal.
-func DiagPart(scope *Scope, input tf.Output) (diagonal tf.Output) {
+// Returns The concatenated tensor.
+func ParallelConcat(scope *Scope, values []tf.Output, shape tf.Shape) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"shape": shape}
 	opspec := tf.OpSpec{
-		Type: "DiagPart",
+		Type: "ParallelConcat",
 		Input: []tf.Input{
-			input,
+			tf.OutputList(values),
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Returns the element-wise max of two SparseTensors.
+//     Subtracts `v` into specified rows of `x`.
 //
-// Assumes the two SparseTensors have the same shape, i.e., no broadcasting.
+//     Computes y = x; y[i, :] -= v; return y.
 //
 // Arguments:
-//	a_indices: 2-D.  `N x R` matrix with the indices of non-empty values in a
-// SparseTensor, in the canonical lexicographic ordering.
-//	a_values: 1-D.  `N` non-empty values corresponding to `a_indices`.
-//	a_shape: 1-D.  Shape of the input SparseTensor.
-//	b_indices: counterpart to `a_indices` for the other operand.
-//	b_values: counterpart to `a_values` for the other operand; must be of the same dtype.
-//	b_shape: counterpart to `a_shape` for the other operand; the two shapes must be equal.
+//	x: A `Tensor` of type T.
+//	i: A vector. Indices into the left-most dimension of `x`.
+//	v: A `Tensor` of type T. Same dimension sizes as x except the first dimension, which must be the same as i's size.
 //
-// Returns 2-D.  The indices of the output SparseTensor.1-D.  The values of the output SparseTensor.
-func SparseSparseMaximum(scope *Scope, a_indices tf.Output, a_values tf.Output, a_shape tf.Output, b_indices tf.Output, b_values tf.Output, b_shape tf.Output) (output_indices tf.Output, output_values tf.Output) {
+// Returns A `Tensor` of type T. An alias of `x`. The content of `y` is undefined if there are duplicates in `i`.
+func InplaceSub(scope *Scope, x tf.Output, i tf.Output, v tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseSparseMaximum",
+		Type: "InplaceSub",
 		Input: []tf.Input{
-			a_indices, a_values, a_shape, b_indices, b_values, b_shape,
+			x, i, v,
 		},
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// Returns a batched matrix tensor with new batched diagonal values.
-//
-// Given `input` and `diagonal`, this operation returns a tensor with the
-// same shape and values as `input`, except for the main diagonal of the
-// innermost matrices.  These will be overwritten by the values in `diagonal`.
-//
-// The output is computed as follows:
+// Converts a flat index or array of flat indices into a tuple of
 //
-// Assume `input` has `k+1` dimensions `[I, J, K, ..., M, N]` and `diagonal` has
-// `k` dimensions `[I, J, K, ..., min(M, N)]`.  Then the output is a
-// tensor of rank `k+1` with dimensions `[I, J, K, ..., M, N]` where:
+// coordinate arrays.
 //
-//   * `output[i, j, k, ..., m, n] = diagonal[i, j, k, ..., n]` for `m == n`.
-//   * `output[i, j, k, ..., m, n] = input[i, j, k, ..., m, n]` for `m != n`.
+// @compatibility(numpy)
+// Equivalent to np.unravel_index
+// @end_compatibility
 //
 // Arguments:
-//	input: Rank `k+1`, where `k >= 1`.
-//	diagonal: Rank `k`, where `k >= 1`.
+//	indices: An 0-D or 1-D `int` Tensor whose elements are indices into the
+// flattened version of an array of dimensions dims.
+//	dims: An 1-D `int` Tensor. The shape of the array to use for unraveling
+// indices.
 //
-// Returns Rank `k+1`, with `output.shape = input.shape`.
-func MatrixSetDiag(scope *Scope, input tf.Output, diagonal tf.Output) (output tf.Output) {
+// Returns An 2-D (or 1-D if indices is 0-D) tensor where each row has the
+// same shape as the indices array.
+func UnravelIndex(scope *Scope, indices tf.Output, dims tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "MatrixSetDiag",
+		Type: "UnravelIndex",
 		Input: []tf.Input{
-			input, diagonal,
+			indices, dims,
 		},
 	}
 	op := scope.AddOperation(opspec)
diff --git a/tensorflow/go/operation.go b/tensorflow/go/operation.go
index 8fcad61f4c6eec597d2b14fb8c9b4fa59987a829..baaac41f4ed9cff285cad0c7e63084102eb9a67c 100644
--- a/tensorflow/go/operation.go
+++ b/tensorflow/go/operation.go
@@ -65,6 +65,11 @@ func (op *Operation) Output(i int) Output {
 	return Output{op, i}
 }
 
+// NumInputs returns the number of inputs of op.
+func (op *Operation) NumInputs() int {
+	return int(C.TF_OperationNumInputs(op.c))
+}
+
 // Output represents one of the outputs of an operation in the graph. Has a
 // DataType (and eventually a Shape).  May be passed as an input argument to a
 // function for adding operations to a graph, or to a Session's Run() method to
@@ -123,6 +128,64 @@ func (p Output) c() C.TF_Output {
 
 func (p Output) canBeAnInput() {}
 
+// Consumers returns the inputs that consume this output.
+func (p Output) Consumers() []Consumer {
+	max := int(C.TF_OperationOutputNumConsumers(p.c()))
+	inputs := make([]C.TF_Input, max)
+	n := C.TF_OperationOutputConsumers(p.c(), (*C.TF_Input)(unsafe.Pointer(&inputs[0])), C.int(max))
+	inputs = inputs[:int(n)]
+
+	var consumers []Consumer
+	for _, consumer := range inputs {
+		consumers = append(consumers, Consumer{
+			Index: int(consumer.index),
+			Op: &Operation{
+				c: consumer.oper,
+				g: p.Op.g,
+			},
+		})
+	}
+
+	return consumers
+}
+
+// Consumer identifies a specific input of an operation that consumes the output
+// of another operation.
+type Consumer struct {
+	// Op is the Operation that is consuming the output of another operation.
+	Op *Operation
+
+	// Index is the index of the input within Op that the output of another
+	// operation is connected to.
+	Index int
+}
+
+func (p Consumer) c() C.TF_Input {
+	if p.Op == nil {
+		// Attempt to provide a more useful panic message than "nil
+		// pointer dereference".
+		panic("nil-Operation. Consumer objects should only be created by a call to Output.Consumers")
+	}
+	return C.TF_Input{oper: p.Op.c, index: C.int(p.Index)}
+}
+
+// DataType returns the type of the input.
+func (p Consumer) DataType() DataType {
+	return DataType(C.TF_OperationInputType(p.c()))
+}
+
+// Producer returns the Output that is connected to this Consumer.
+func (p Consumer) Producer() Output {
+	output := C.TF_OperationInput(p.c())
+	return Output{
+		Op: &Operation{
+			c: output.oper,
+			g: p.Op.g,
+		},
+		Index: int(output.index),
+	}
+}
+
 // Input is the interface for specifying inputs to an operation being added to
 // a Graph.
 //
diff --git a/tensorflow/go/operation_test.go b/tensorflow/go/operation_test.go
index 40c951ab8c13f43e2063b9f9cfadcd44a6da72fe..0672e8ecc75629f114a52b9c677afeea96b213b4 100644
--- a/tensorflow/go/operation_test.go
+++ b/tensorflow/go/operation_test.go
@@ -166,6 +166,64 @@ func TestOutputDataTypeAndShape(t *testing.T) {
 	}
 }
 
+func TestOperationInputs(t *testing.T) {
+	g := NewGraph()
+	x, err := Placeholder(g, "x", Float)
+	if err != nil {
+		t.Fatal(err)
+	}
+	y, err := Placeholder(g, "y", Float)
+	if err != nil {
+		t.Fatal(err)
+	}
+	add, err := Add(g, "add", x, y)
+	if err != nil {
+		t.Fatal(err)
+	}
+	addOp := add.Op
+
+	if out := addOp.NumInputs(); out != 2 {
+		t.Fatalf("Got %d inputs, wanted 2", out)
+	}
+}
+
+func TestOperationConsumers(t *testing.T) {
+	g := NewGraph()
+	x, err := Placeholder(g, "x", Float)
+	if err != nil {
+		t.Fatal(err)
+	}
+	a, err := Neg(g, "a", x)
+	if err != nil {
+		t.Fatal(err)
+	}
+	b, err := Neg(g, "b", x)
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	consumers := []*Operation{a.Op, b.Op}
+
+	xConsumers := x.Consumers()
+	if out := len(xConsumers); out != 2 {
+		t.Fatalf("Got %d consumers, wanted 2", out)
+	}
+
+	for i, consumer := range xConsumers {
+		got := consumer.Op.Name()
+		want := consumers[i].Name()
+		if got != want {
+			t.Fatalf("%d. Got op name %q, wanted %q", i, got, want)
+		}
+
+		got = consumer.Producer().Op.Name()
+		want = x.Op.Name()
+		if got != want {
+			t.Fatalf("%d. Got op name %q, wanted %q", i, got, want)
+		}
+	}
+}
+
 func forceGC() {
 	var mem runtime.MemStats
 	runtime.ReadMemStats(&mem)
diff --git a/tensorflow/java/BUILD b/tensorflow/java/BUILD
index 50d5dcef548758dcab4109f8d61a30290892bbf7..19d2133a55f347cfc3d4dc766e0593a0e188c967 100644
--- a/tensorflow/java/BUILD
+++ b/tensorflow/java/BUILD
@@ -60,9 +60,7 @@ java_library(
 
 filegroup(
     name = "java_op_sources",
-    srcs = glob(["src/main/java/org/tensorflow/op/**/*.java"]) + [
-        ":java_op_gen_sources",
-    ],
+    srcs = glob(["src/main/java/org/tensorflow/op/**/*.java"]) + [":java_op_gen_sources"],
     visibility = [
         "//tensorflow/java:__pkg__",
     ],
@@ -87,6 +85,9 @@ tf_cc_binary(
     linkstatic = 1,
     deps = [
         ":java_op_gen_lib",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:framework_internal",
+        "//tensorflow/core:lib",
         "//tensorflow/core:ops",
     ],
 )
@@ -111,6 +112,8 @@ cc_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:op_gen_lib",
+        "//tensorflow/core:protos_all_cc",
+        "@com_googlesource_code_re2//:re2",
     ],
 )
 
@@ -303,6 +306,7 @@ tf_cc_test(
     ],
     deps = [
         ":java_op_gen_lib",
+        "//tensorflow/core:lib",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
     ],
diff --git a/tensorflow/java/maven/.gitignore b/tensorflow/java/maven/.gitignore
index ff080515d5e730b308bf78f7e28244c6c799cdc3..657e2a60bc57c0cf259c000476c75ae58d75fff2 100644
--- a/tensorflow/java/maven/.gitignore
+++ b/tensorflow/java/maven/.gitignore
@@ -11,4 +11,10 @@ tensorflow/src
 tensorflow/target
 proto/src
 proto/target
+hadoop/src
+hadoop/target
+spark-connector/src
+spark-connector/target
+spark-connector/dependency-reduced-pom.xml
+spark-connector/spark-warehouse
 pom.xml.versionsBackup
diff --git a/tensorflow/java/maven/README.md b/tensorflow/java/maven/README.md
index c7e8f0380629f492ade9ba47cdcb4bc286ac82bc..3e030dcd09c886983540b95640230eae3a6f2c0f 100644
--- a/tensorflow/java/maven/README.md
+++ b/tensorflow/java/maven/README.md
@@ -53,6 +53,12 @@ There are seven artifacts and thus `pom.xml`s involved in this release:
 7.  [`parentpom`](https://maven.apache.org/pom/index.html): Common settings
     shared by all of the above.
 
+8. `hadoop`: The TensorFlow TFRecord InputFormat/OutputFormat for Apache Hadoop.
+    The source code for this package is available in the [TensorFlow Ecosystem](https://github.com/tensorflow/ecosystem/tree/master/hadoop)
+
+9. `spark-connector`: A Scala library for loading and storing TensorFlow TFRecord
+    using Apache Spark DataFrames. The source code for this package is available
+    in the [TensorFlow Ecosystem](https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector)
 
 ## Updating the release
 
diff --git a/tensorflow/java/maven/hadoop/pom.xml b/tensorflow/java/maven/hadoop/pom.xml
new file mode 100644
index 0000000000000000000000000000000000000000..0642be06fa148933902ab450c5cf2f771e268828
--- /dev/null
+++ b/tensorflow/java/maven/hadoop/pom.xml
@@ -0,0 +1,24 @@
+<project
+        xmlns="http://maven.apache.org/POM/4.0.0"
+        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+        xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <!-- Placeholder pom which is replaced by TensorFlow ecosystem Hadoop pom during build -->
+    <modelVersion>4.0.0</modelVersion>
+    <description>TensorFlow TFRecord InputFormat/OutputFormat for Apache Hadoop</description>
+    <artifactId>hadoop</artifactId>
+    <packaging>jar</packaging>
+
+    <scm>
+        <url>https://github.com/tensorflow/ecosystem.git</url>
+        <connection>git@github.com:tensorflow/ecosystem.git</connection>
+        <developerConnection>scm:git:https://github.com/tensorflow/ecosystem.git</developerConnection>
+    </scm>
+
+    <url>https://github.com/tensorflow/ecosystem/</url>
+    <parent>
+        <groupId>org.tensorflow</groupId>
+        <artifactId>parentpom</artifactId>
+        <version>1.9.0-rc0</version>
+        <relativePath>../</relativePath>
+    </parent>
+</project>
\ No newline at end of file
diff --git a/tensorflow/java/maven/libtensorflow/pom.xml b/tensorflow/java/maven/libtensorflow/pom.xml
index 08cc860f5795a4cf20f4ab2d09d2c2d37a52faf6..38e87b16399e7b9344f654b553fa1623b6b2d9cd 100644
--- a/tensorflow/java/maven/libtensorflow/pom.xml
+++ b/tensorflow/java/maven/libtensorflow/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.8.0</version>
+    <version>1.9.0-rc0</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>libtensorflow</artifactId>
diff --git a/tensorflow/java/maven/libtensorflow_jni/pom.xml b/tensorflow/java/maven/libtensorflow_jni/pom.xml
index fcc7eacc33b7bab366159425405b4bf5b0216cf1..36c984e280199c97a6e07516cc84290fa91e6b27 100644
--- a/tensorflow/java/maven/libtensorflow_jni/pom.xml
+++ b/tensorflow/java/maven/libtensorflow_jni/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.8.0</version>
+    <version>1.9.0-rc0</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>libtensorflow_jni</artifactId>
diff --git a/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml b/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml
index 3d22d86a4970def52bf9a4a452a8131e1357341a..4c846de05ad415bedf2c14b6a07ff8d5bc6f11b8 100644
--- a/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml
+++ b/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.8.0</version>
+    <version>1.9.0-rc0</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>libtensorflow_jni_gpu</artifactId>
diff --git a/tensorflow/java/maven/pom.xml b/tensorflow/java/maven/pom.xml
index 0a09a5ea7cb96776b8296f68f599c333559a0729..acab08f58c074144472f80eeaf894e25ad8389f8 100644
--- a/tensorflow/java/maven/pom.xml
+++ b/tensorflow/java/maven/pom.xml
@@ -6,7 +6,7 @@
   <modelVersion>4.0.0</modelVersion>
   <groupId>org.tensorflow</groupId>
   <artifactId>parentpom</artifactId>
-  <version>1.8.0</version>
+  <version>1.9.0-rc0</version>
   <packaging>pom</packaging>
 
   <url>https://www.tensorflow.org</url>
@@ -32,6 +32,8 @@
     <module>libtensorflow_jni_gpu</module>
     <module>tensorflow</module>
     <module>proto</module>
+    <module>hadoop</module>
+    <module>spark-connector</module>
   </modules>
 
   <!-- Two profiles are used:
diff --git a/tensorflow/java/maven/proto/pom.xml b/tensorflow/java/maven/proto/pom.xml
index 77ec6a0ddbab2749119a015094e47a9570e110e2..eb0a952c7d2f8960a387fa63e6c257e33b80bcbb 100644
--- a/tensorflow/java/maven/proto/pom.xml
+++ b/tensorflow/java/maven/proto/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.8.0</version>
+    <version>1.9.0-rc0</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>proto</artifactId>
@@ -16,7 +16,7 @@
     <dependency>
       <groupId>com.google.protobuf</groupId>
       <artifactId>protobuf-java</artifactId>
-      <version>3.3.1</version>
+      <version>3.5.1</version>
     </dependency>
   </dependencies>
 
diff --git a/tensorflow/java/maven/run_inside_container.sh b/tensorflow/java/maven/run_inside_container.sh
index 6136ccfdfb92d6a71c440b23dc0a13ebe86c52e6..2e771064e4a9a9ca4c50f5ecf8cae91cf8b5ce6c 100644
--- a/tensorflow/java/maven/run_inside_container.sh
+++ b/tensorflow/java/maven/run_inside_container.sh
@@ -19,6 +19,7 @@
 
 
 RELEASE_URL_PREFIX="https://storage.googleapis.com/tensorflow/libtensorflow"
+TF_ECOSYSTEM_URL="https://github.com/tensorflow/ecosystem.git"
 
 # By default we deploy to both ossrh and bintray. These two
 # environment variables can be set to skip either repository.
@@ -31,7 +32,7 @@ if [[ "${TF_VERSION}" == *"-SNAPSHOT" ]]; then
   # Bintray does not allow snapshots.
   DEPLOY_BINTRAY="false"
 fi
-PROTOC_RELEASE_URL="https://github.com/google/protobuf/releases/download/v3.3.0/protoc-3.3.0-linux-x86_64.zip"
+PROTOC_RELEASE_URL="https://github.com/google/protobuf/releases/download/v3.5.1/protoc-3.5.1-linux-x86_64.zip"
 if [[ "${DEPLOY_BINTRAY}" != "true" && "${DEPLOY_OSSRH}" != "true" ]]; then
   echo "Must deploy to at least one of Bintray or OSSRH" >&2
   exit 2
@@ -44,7 +45,9 @@ clean() {
   # (though if run inside a clean docker container, there won't be any dirty
   # artifacts lying around)
   mvn -q clean
-  rm -rf libtensorflow_jni/src libtensorflow_jni/target libtensorflow_jni_gpu/src libtensorflow_jni_gpu/target libtensorflow/src libtensorflow/target tensorflow-android/target
+  rm -rf libtensorflow_jni/src libtensorflow_jni/target libtensorflow_jni_gpu/src libtensorflow_jni_gpu/target \
+    libtensorflow/src libtensorflow/target tensorflow-android/target proto/src proto/target \
+    hadoop/src hadoop/target spark-connector/src spark-connector/target
 }
 
 update_version_in_pom() {
@@ -183,6 +186,43 @@ generate_java_protos() {
   rm -rf "${DIR}/proto/tmp"
 }
 
+
+# Download the TensorFlow ecosystem source from git.
+# The pom files from this repo do not inherit from the parent pom so the maven version
+# is updated for each module.
+download_tf_ecosystem() {
+  ECOSYSTEM_DIR="/tmp/tensorflow-ecosystem"
+  HADOOP_DIR="${DIR}/hadoop"
+  SPARK_DIR="${DIR}/spark-connector"
+
+  # Clean any previous attempts
+  rm -rf "${ECOSYSTEM_DIR}"
+
+  # Clone the TensorFlow ecosystem project
+  mkdir -p  "${ECOSYSTEM_DIR}"
+  cd "${ECOSYSTEM_DIR}"
+  git clone "${TF_ECOSYSTEM_URL}"
+  cd ecosystem
+  git checkout r${TF_VERSION}
+
+  # Copy the TensorFlow Hadoop source
+  cp -r "${ECOSYSTEM_DIR}/ecosystem/hadoop/src" "${HADOOP_DIR}"
+  cp "${ECOSYSTEM_DIR}/ecosystem/hadoop/pom.xml" "${HADOOP_DIR}"
+  cd "${HADOOP_DIR}"
+  update_version_in_pom
+
+  # Copy the TensorFlow Spark connector source
+  cp -r "${ECOSYSTEM_DIR}/ecosystem/spark/spark-tensorflow-connector/src" "${SPARK_DIR}"
+  cp "${ECOSYSTEM_DIR}/ecosystem/spark/spark-tensorflow-connector/pom.xml" "${SPARK_DIR}"
+  cd "${SPARK_DIR}"
+  update_version_in_pom
+
+  # Cleanup
+  rm -rf "${ECOSYSTEM_DIR}"
+
+  cd "${DIR}"
+}
+
 # Deploy artifacts using a specific profile.
 # Arguments:
 #   profile - name of selected profile.
@@ -240,7 +280,8 @@ cd "${DIR}"
 # Comment lines out appropriately if debugging/tinkering with the release
 # process.
 # gnupg2 is required for signing
-apt-get -qq update && apt-get -qqq install -y gnupg2
+apt-get -qq update && apt-get -qqq install -y gnupg2 git
+
 clean
 update_version_in_pom
 download_libtensorflow
@@ -248,6 +289,8 @@ download_libtensorflow_jni
 download_libtensorflow_jni_gpu
 update_tensorflow_android
 generate_java_protos
+download_tf_ecosystem
+
 # Build the release artifacts
 mvn verify
 # Push artifacts to repository
diff --git a/tensorflow/java/maven/spark-connector/pom.xml b/tensorflow/java/maven/spark-connector/pom.xml
new file mode 100644
index 0000000000000000000000000000000000000000..19c752d08be1deec40042bc1aa8fd1159b2f2be9
--- /dev/null
+++ b/tensorflow/java/maven/spark-connector/pom.xml
@@ -0,0 +1,24 @@
+<project
+        xmlns="http://maven.apache.org/POM/4.0.0"
+        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+        xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <!-- Placeholder pom which is replaced by TensorFlow ecosystem Spark pom during build -->
+    <modelVersion>4.0.0</modelVersion>
+    <description>TensorFlow TFRecord connector for Apache Spark DataFrames</description>
+    <artifactId>spark-connector</artifactId>
+    <packaging>jar</packaging>
+
+    <scm>
+        <url>https://github.com/tensorflow/ecosystem.git</url>
+        <connection>git@github.com:tensorflow/ecosystem.git</connection>
+        <developerConnection>scm:git:https://github.com/tensorflow/ecosystem.git</developerConnection>
+    </scm>
+
+    <url>https://github.com/tensorflow/ecosystem/</url>
+    <parent>
+        <groupId>org.tensorflow</groupId>
+        <artifactId>parentpom</artifactId>
+        <version>1.9.0-rc0</version>
+        <relativePath>../</relativePath>
+    </parent>
+</project>
\ No newline at end of file
diff --git a/tensorflow/java/maven/tensorflow/pom.xml b/tensorflow/java/maven/tensorflow/pom.xml
index 0df1f2814906e548855522335f710e9702f8bb2a..48668a47f2839e06eba774cae75b03663985ea28 100644
--- a/tensorflow/java/maven/tensorflow/pom.xml
+++ b/tensorflow/java/maven/tensorflow/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.8.0</version>
+    <version>1.9.0-rc0</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>tensorflow</artifactId>
diff --git a/tensorflow/java/src/gen/cc/java_defs.h b/tensorflow/java/src/gen/cc/java_defs.h
index 62575f6683089b1e4a62cedb3639570e1711b8a6..f5f54bf4d31af159624c668f1abb106f68944737 100644
--- a/tensorflow/java/src/gen/cc/java_defs.h
+++ b/tensorflow/java/src/gen/cc/java_defs.h
@@ -26,12 +26,12 @@ namespace java {
 
 // An enumeration of different modifiers commonly used in Java
 enum Modifier {
-  PACKAGE   = 0,
-  PUBLIC    = (1 << 0),
+  PACKAGE = 0,
+  PUBLIC = (1 << 0),
   PROTECTED = (1 << 1),
-  PRIVATE   = (1 << 2),
-  STATIC    = (1 << 3),
-  FINAL     = (1 << 4),
+  PRIVATE = (1 << 2),
+  STATIC = (1 << 3),
+  FINAL = (1 << 4),
 };
 
 class Annotation;
@@ -75,12 +75,8 @@ class Type {
     // Reflection API does
     return Type(Type::PRIMITIVE, "void");
   }
-  static Type Generic(const string& name) {
-    return Type(Type::GENERIC, name);
-  }
-  static Type Wildcard() {
-    return Type(Type::GENERIC, "");
-  }
+  static Type Generic(const string& name) { return Type(Type::GENERIC, name); }
+  static Type Wildcard() { return Type(Type::GENERIC, ""); }
   static Type Class(const string& name, const string& package = "") {
     return Type(Type::CLASS, name, package);
   }
@@ -226,9 +222,7 @@ class Method {
 // A definition of a documentation bloc for a Java element (JavaDoc)
 class Javadoc {
  public:
-  static Javadoc Create(const string& brief = "") {
-    return Javadoc(brief);
-  }
+  static Javadoc Create(const string& brief = "") { return Javadoc(brief); }
   const string& brief() const { return brief_; }
   const string& details() const { return details_; }
   Javadoc& details(const string& details) {
diff --git a/tensorflow/java/src/gen/cc/op_gen_main.cc b/tensorflow/java/src/gen/cc/op_gen_main.cc
index 6c35cd9595a0aafd2f6db35dadf899c19047e1f1..0d9e0883af262ee1f262a5e1308cb9df8763488d 100644
--- a/tensorflow/java/src/gen/cc/op_gen_main.cc
+++ b/tensorflow/java/src/gen/cc/op_gen_main.cc
@@ -41,7 +41,7 @@ const char kUsageHeader[] =
     "using an appropriate annotation processor.\n\n"
     "The '--base_package' overrides the default parent package under which "
     "the generated subpackage and classes are to be located.\n\n"
-    "Finally, the `--api_dirs` argument takes a list of comma-seperated "
+    "Finally, the `--api_dirs` argument takes a list of comma-separated "
     "directories of API definitions can be provided to override default\n"
     "values found in the ops definitions. Directories are ordered by priority "
     "(the last having precedence over the first).\n\n";
@@ -55,10 +55,12 @@ int main(int argc, char* argv[]) {
   tensorflow::string api_dirs_str;
   std::vector<tensorflow::Flag> flag_list = {
       tensorflow::Flag("output_dir", &output_dir,
-          "Root directory into which output files are generated"),
-      tensorflow::Flag("base_package", &base_package,
+                       "Root directory into which output files are generated"),
+      tensorflow::Flag(
+          "base_package", &base_package,
           "Package parent to the generated subpackage and classes"),
-      tensorflow::Flag("api_dirs", &api_dirs_str,
+      tensorflow::Flag(
+          "api_dirs", &api_dirs_str,
           "List of directories that contains the ops api definitions")};
   tensorflow::string usage = tensorflow::java::kUsageHeader;
   usage += tensorflow::Flags::Usage(argv[0], flag_list);
diff --git a/tensorflow/java/src/gen/cc/op_generator.cc b/tensorflow/java/src/gen/cc/op_generator.cc
index 284f675c9466a9b571bfb75a151795d465845fb6..9b171f66ec741aa747fa365c13144a4b8b1aecbc 100644
--- a/tensorflow/java/src/gen/cc/op_generator.cc
+++ b/tensorflow/java/src/gen/cc/op_generator.cc
@@ -13,43 +13,44 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include <string>
-#include <map>
-#include <vector>
 #include <list>
+#include <map>
 #include <memory>
 #include <set>
+#include <string>
+#include <vector>
 
+#include "tensorflow/core/framework/op_gen_lib.h"
 #include "tensorflow/core/lib/core/errors.h"
-#include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/io/path.h"
-#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/platform/env.h"
-#include "tensorflow/core/framework/op_gen_lib.h"
+#include "tensorflow/core/platform/logging.h"
 #include "tensorflow/java/src/gen/cc/java_defs.h"
-#include "tensorflow/java/src/gen/cc/source_writer.h"
 #include "tensorflow/java/src/gen/cc/op_generator.h"
 #include "tensorflow/java/src/gen/cc/op_specs.h"
+#include "tensorflow/java/src/gen/cc/source_writer.h"
 
 namespace tensorflow {
 namespace java {
 namespace {
 
 const char* kLicense =
-  "/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.\n"
-  "\n"
-  "Licensed under the Apache License, Version 2.0 (the \"License\");\n"
-  "you may not use this file except in compliance with the License.\n"
-  "You may obtain a copy of the License at\n"
-  "\n"
-  "    http://www.apache.org/licenses/LICENSE-2.0\n"
-  "\n"
-  "Unless required by applicable law or agreed to in writing, software\n"
-  "distributed under the License is distributed on an \"AS IS\" BASIS,\n"
-  "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n"
-  "See the License for the specific language governing permissions and\n"
-  "limitations under the License.\n"
-  "=======================================================================*/\n";
+    "/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.\n"
+    "\n"
+    "Licensed under the Apache License, Version 2.0 (the \"License\");\n"
+    "you may not use this file except in compliance with the License.\n"
+    "You may obtain a copy of the License at\n"
+    "\n"
+    "    http://www.apache.org/licenses/LICENSE-2.0\n"
+    "\n"
+    "Unless required by applicable law or agreed to in writing, software\n"
+    "distributed under the License is distributed on an \"AS IS\" BASIS,\n"
+    "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n"
+    "See the License for the specific language governing permissions and\n"
+    "limitations under the License.\n"
+    "=======================================================================*/"
+    "\n";
 
 // There is three different modes to render an op class, depending on the
 // number and type of outputs it has:
@@ -64,20 +65,16 @@ const char* kLicense =
 //          allowing an instance to be passed directly as a list input to
 //          another operation
 //
-enum RenderMode {
-  DEFAULT,
-  OPERAND,
-  LIST_OPERAND
-};
+enum RenderMode { DEFAULT, OPERAND, LIST_OPERAND };
 
 void AddArgument(const Variable& var, const string& description,
-    Method* method_out, Javadoc* javadoc_out) {
+                 Method* method_out, Javadoc* javadoc_out) {
   method_out->add_argument(var);
   javadoc_out->add_param_tag(var.name(), description);
 }
 
 void CollectOpDependencies(const OpSpec& op, RenderMode mode,
-    std::list<Type>* out) {
+                           std::list<Type>* out) {
   out->push_back(Type::Class("Operation", "org.tensorflow"));
   out->push_back(Type::Class("OperationBuilder", "org.tensorflow"));
   out->push_back(Type::Class("Scope", "org.tensorflow.op"));
@@ -110,7 +107,7 @@ void CollectOpDependencies(const OpSpec& op, RenderMode mode,
 }
 
 void WriteSetAttrDirective(const AttributeSpec& attr, bool optional,
-    SourceWriter* writer) {
+                           SourceWriter* writer) {
   string var_name = optional ? "opts." + attr.var().name() : attr.var().name();
   if (attr.iterable()) {
     string array_name = attr.var().name() + "Array";
@@ -143,11 +140,11 @@ void WriteSetAttrDirective(const AttributeSpec& attr, bool optional,
 }
 
 void RenderFactoryMethods(const OpSpec& op, const Type& op_class,
-    SourceWriter* writer) {
+                          SourceWriter* writer) {
   Method factory = Method::Create("create", op_class);
-  Javadoc factory_doc = Javadoc::Create(
-      "Factory method to create a class to wrap a new " + op_class.name()
-      + " operation to the graph.");
+  Javadoc factory_doc =
+      Javadoc::Create("Factory method to create a class to wrap a new " +
+                      op_class.name() + " operation to the graph.");
   Variable scope =
       Variable::Create("scope", Type::Class("Scope", "org.tensorflow.op"));
   AddArgument(scope, "current graph scope", &factory, &factory_doc);
@@ -159,23 +156,23 @@ void RenderFactoryMethods(const OpSpec& op, const Type& op_class,
   }
   if (!op.optional_attributes().empty()) {
     AddArgument(Variable::Varargs("options", Type::Class("Options")),
-        "carries optional attributes values", &factory, &factory_doc);
+                "carries optional attributes values", &factory, &factory_doc);
   }
   factory_doc.add_tag("return", "a new instance of " + op_class.name());
 
-  writer->BeginMethod(factory, PUBLIC|STATIC, &factory_doc);
-  writer->Append("OperationBuilder opBuilder = scope.graph().opBuilder(\""
-      + op.graph_op_name() + "\", scope.makeOpName(\""
-      + op_class.name() + "\"));");
+  writer->BeginMethod(factory, PUBLIC | STATIC, &factory_doc);
+  writer->Append("OperationBuilder opBuilder = scope.graph().opBuilder(\"" +
+                 op.graph_op_name() + "\", scope.makeOpName(\"" +
+                 op_class.name() + "\"));");
   writer->EndLine();
   for (const ArgumentSpec& input : op.inputs()) {
     if (input.iterable()) {
-      writer->Append("opBuilder.addInputList(Operands.asOutputs("
-          + input.var().name() + "));");
+      writer->Append("opBuilder.addInputList(Operands.asOutputs(" +
+                     input.var().name() + "));");
       writer->EndLine();
     } else {
-      writer->Append("opBuilder.addInput(" + input.var().name()
-          + ".asOutput());");
+      writer->Append("opBuilder.addInput(" + input.var().name() +
+                     ".asOutput());");
       writer->EndLine();
     }
   }
@@ -200,7 +197,7 @@ void RenderFactoryMethods(const OpSpec& op, const Type& op_class,
 }
 
 void RenderConstructor(const OpSpec& op, const Type& op_class,
-    SourceWriter* writer) {
+                       SourceWriter* writer) {
   Variable operation =
       Variable::Create("operation", Type::Class("Operation", "org.tensorflow"));
   Method constructor = Method::ConstructorFor(op_class).add_argument(operation);
@@ -214,15 +211,14 @@ void RenderConstructor(const OpSpec& op, const Type& op_class,
   writer->BeginMethod(constructor, PRIVATE)
       .Append("super(operation);")
       .EndLine();
-  if (op.outputs().size() > 0) {
-    writer->Append("int outputIdx = 0;")
-        .EndLine();
+  if (!op.outputs().empty()) {
+    writer->Append("int outputIdx = 0;").EndLine();
     for (const ArgumentSpec& output : op.outputs()) {
       if (output.iterable()) {
         string var_length = output.var().name() + "Length";
         writer->Append("int " + var_length)
-            .Append(" = operation.outputListLength(\"" + output.op_def_name()
-                + "\");")
+            .Append(" = operation.outputListLength(\"" + output.op_def_name() +
+                    "\");")
             .EndLine()
             .Append(output.var().name() + " = Arrays.asList(");
         if (!output.type().wildcard()) {
@@ -235,8 +231,8 @@ void RenderConstructor(const OpSpec& op, const Type& op_class,
             .Append("outputIdx += " + var_length + ";")
             .EndLine();
       } else {
-        writer->Append(output.var().name()
-                + " = operation.output(outputIdx++);")
+        writer
+            ->Append(output.var().name() + " = operation.output(outputIdx++);")
             .EndLine();
       }
     }
@@ -246,13 +242,12 @@ void RenderConstructor(const OpSpec& op, const Type& op_class,
 
 void RenderGettersAndSetters(const OpSpec& op, SourceWriter* writer) {
   for (const AttributeSpec& attr : op.optional_attributes()) {
-    Method setter =
-        Method::Create(attr.var().name(), Type::Class("Options"));
+    Method setter = Method::Create(attr.var().name(), Type::Class("Options"));
     Javadoc setter_doc = Javadoc::Create();
     AddArgument(attr.var(), attr.description(), &setter, &setter_doc);
-    writer->BeginMethod(setter, PUBLIC|STATIC, &setter_doc)
-        .Append("return new Options()." + attr.var().name() + "("
-            + attr.var().name() + ");")
+    writer->BeginMethod(setter, PUBLIC | STATIC, &setter_doc)
+        .Append("return new Options()." + attr.var().name() + "(" +
+                attr.var().name() + ");")
         .EndLine()
         .EndMethod();
   }
@@ -267,15 +262,16 @@ void RenderGettersAndSetters(const OpSpec& op, SourceWriter* writer) {
 }
 
 void RenderInterfaceImpl(const OpSpec& op, RenderMode mode,
-    SourceWriter* writer) {
+                         SourceWriter* writer) {
   ArgumentSpec output = op.outputs().front();
 
   if (mode == OPERAND) {
     bool cast2obj = output.type().wildcard();
-    Type return_type = Type::Class("Output", "org.tensorflow")
-        .add_parameter(cast2obj ? Type::Class("Object") : output.type());
+    Type return_type =
+        Type::Class("Output", "org.tensorflow")
+            .add_parameter(cast2obj ? Type::Class("Object") : output.type());
     Method as_output = Method::Create("asOutput", return_type)
-        .add_annotation(Annotation::Create("Override"));
+                           .add_annotation(Annotation::Create("Override"));
     if (cast2obj) {
       as_output.add_annotation(
           Annotation::Create("SuppressWarnings").attributes("\"unchecked\""));
@@ -286,9 +282,7 @@ void RenderInterfaceImpl(const OpSpec& op, RenderMode mode,
     } else {
       writer->Append("return ");
     }
-    writer->Append(output.var().name() + ";")
-        .EndLine()
-        .EndMethod();
+    writer->Append(output.var().name() + ";").EndLine().EndMethod();
 
   } else if (mode == LIST_OPERAND) {
     Type operand = Type::Interface("Operand", "org.tensorflow");
@@ -297,12 +291,13 @@ void RenderInterfaceImpl(const OpSpec& op, RenderMode mode,
     } else {
       operand.add_parameter(output.type());
     }
-    Type return_type = Type::Interface("Iterator", "java.util")
-        .add_parameter(operand);
-    Method iterator = Method::Create("iterator", return_type)
-        .add_annotation(Annotation::Create("Override"))
-        .add_annotation(Annotation::Create("SuppressWarnings")
-            .attributes("{\"rawtypes\", \"unchecked\"}"));
+    Type return_type =
+        Type::Interface("Iterator", "java.util").add_parameter(operand);
+    Method iterator =
+        Method::Create("iterator", return_type)
+            .add_annotation(Annotation::Create("Override"))
+            .add_annotation(Annotation::Create("SuppressWarnings")
+                                .attributes("{\"rawtypes\", \"unchecked\"}"));
     // cast the output list using a raw List
     writer->BeginMethod(iterator, PUBLIC)
         .Append("return (" + return_type.name() + ") ")
@@ -313,10 +308,10 @@ void RenderInterfaceImpl(const OpSpec& op, RenderMode mode,
 }
 
 void RenderOptionsClass(const OpSpec& op, const Type& op_class,
-    SourceWriter* writer) {
+                        SourceWriter* writer) {
   Type options_class = Type::Class("Options");
-  Javadoc options_doc = Javadoc::Create(
-      "Optional attributes for {@link " + op_class.canonical_name() + "}");
+  Javadoc options_doc = Javadoc::Create("Optional attributes for {@link " +
+                                        op_class.canonical_name() + "}");
   writer->BeginInnerType(options_class, PUBLIC | STATIC, &options_doc);
   for (const AttributeSpec& attr : op.optional_attributes()) {
     Method setter = Method::Create(attr.var().name(), options_class);
@@ -339,24 +334,27 @@ void RenderOptionsClass(const OpSpec& op, const Type& op_class,
 }
 
 inline Type ClassOf(const EndpointSpec& endpoint, const string& base_package) {
-  return Type::Class(endpoint.name(),
+  return Type::Class(
+      endpoint.name(),
       base_package + "." + str_util::Lowercase(endpoint.package()));
 }
 
 void GenerateOp(const OpSpec& op, const EndpointSpec& endpoint,
-    const string& base_package, const string& output_dir, Env* env) {
-  Type op_class(ClassOf(endpoint, base_package)
-      .add_supertype(Type::Class("PrimitiveOp", "org.tensorflow.op")));
+                const string& base_package, const string& output_dir,
+                Env* env) {
+  Type op_class(
+      ClassOf(endpoint, base_package)
+          .add_supertype(Type::Class("PrimitiveOp", "org.tensorflow.op")));
   Javadoc op_javadoc(endpoint.javadoc());
 
   // op interfaces
   RenderMode mode = DEFAULT;
   if (op.outputs().size() == 1) {
     const ArgumentSpec& output = op.outputs().front();
-    Type operand_type(output.type().wildcard() ?
-        Type::Class("Object") : output.type());
+    Type operand_type(output.type().wildcard() ? Type::Class("Object")
+                                               : output.type());
     Type operand_inf(Type::Interface("Operand", "org.tensorflow")
-        .add_parameter(operand_type));
+                         .add_parameter(operand_type));
     if (output.iterable()) {
       mode = LIST_OPERAND;
       op_class.add_supertype(Type::IterableOf(operand_inf));
@@ -368,25 +366,24 @@ void GenerateOp(const OpSpec& op, const EndpointSpec& endpoint,
   // op generic parameters
   std::set<string> generics;
   for (const ArgumentSpec& output : op.outputs()) {
-    if (output.type().kind() == Type::GENERIC && !output.type().wildcard()
-        && generics.find(output.type().name()) == generics.end()) {
+    if (output.type().kind() == Type::GENERIC && !output.type().wildcard() &&
+        generics.find(output.type().name()) == generics.end()) {
       op_class.add_parameter(output.type());
-      op_javadoc.add_param_tag("<" + output.type().name() + ">",
+      op_javadoc.add_param_tag(
+          "<" + output.type().name() + ">",
           "data type for {@code " + output.var().name() + "()} output");
       generics.insert(output.type().name());
     }
   }
   // op annotations
-  op_class.add_annotation(
-      Annotation::Create("Generated", "javax.annotation")
-          .attributes("value = \"TensorFlow Java Op Generator\""));
   if (endpoint.deprecated()) {
     op_class.add_annotation(Annotation::Create("Deprecated"));
     string explanation;
     if (!op.endpoints().front().deprecated()) {
-      explanation = "use {@link " +
-          ClassOf(op.endpoints().front(), base_package).canonical_name()
-          + "} instead";
+      explanation =
+          "use {@link " +
+          ClassOf(op.endpoints().front(), base_package).canonical_name() +
+          "} instead";
     } else {
       explanation = op.deprecation_explanation();
     }
@@ -396,19 +393,20 @@ void GenerateOp(const OpSpec& op, const EndpointSpec& endpoint,
     // expose the op in the Ops Graph API only if it is visible
     op_class.add_annotation(
         Annotation::Create("Operator", "org.tensorflow.op.annotation")
-          .attributes("group = \"" + endpoint.package() + "\""));
+            .attributes("group = \"" + endpoint.package() + "\""));
   }
   // create op class file
-  const string op_dir_name = io::JoinPath(output_dir,
-      str_util::StringReplace(op_class.package(), ".", "/", true));
+  const string op_dir_name = io::JoinPath(
+      output_dir, str_util::StringReplace(op_class.package(), ".", "/", true));
   if (!env->FileExists(op_dir_name).ok()) {
     TF_CHECK_OK(Env::Default()->RecursivelyCreateDir(op_dir_name))
         << op_dir_name;
   }
   const string op_file_name = op_class.name() + ".java";
   std::unique_ptr<tensorflow::WritableFile> op_file;
-  TF_CHECK_OK(env->NewWritableFile(
-      io::JoinPath(op_dir_name, op_file_name), &op_file)) << op_file_name;
+  TF_CHECK_OK(
+      env->NewWritableFile(io::JoinPath(op_dir_name, op_file_name), &op_file))
+      << op_file_name;
 
   // render endpoint source code
   SourceFileWriter writer(op_file.get());
@@ -416,7 +414,10 @@ void GenerateOp(const OpSpec& op, const EndpointSpec& endpoint,
   CollectOpDependencies(op, mode, &dependencies);
   writer.Write(kLicense)
       .EndLine()
-      .BeginType(op_class, PUBLIC|FINAL, &dependencies, &op_javadoc);
+      .Write("// This class has been generated, DO NOT EDIT!")
+      .EndLine()
+      .EndLine()
+      .BeginType(op_class, PUBLIC | FINAL, &dependencies, &op_javadoc);
   if (!op.optional_attributes().empty()) {
     RenderOptionsClass(op, op_class, &writer);
   }
@@ -448,7 +449,7 @@ bool CanGenerateOp(const OpDef& op_def, const ApiDef& api_def) {
 }  // namespace
 
 Status OpGenerator::Run(const OpList& op_list, const string& base_package,
-    const string& output_dir) {
+                        const string& output_dir) {
   ApiDefMap api_map(op_list);
   if (!api_dirs_.empty()) {
     // Only load api files that correspond to the requested "op_list"
diff --git a/tensorflow/java/src/gen/cc/op_generator.h b/tensorflow/java/src/gen/cc/op_generator.h
index cfe842070a77947003e8fedc1897a418f1403241..759d800ecfb5bec10b7bf8454baf5fc4c389e990 100644
--- a/tensorflow/java/src/gen/cc/op_generator.h
+++ b/tensorflow/java/src/gen/cc/op_generator.h
@@ -21,7 +21,7 @@ limitations under the License.
 
 #include "tensorflow/core/framework/op_def.pb.h"
 #include "tensorflow/core/framework/api_def.pb.h"
-#include "tensorflow/core/platform/env.h"
+#include "tensorflow/core/framework/op_def.pb.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/java/src/gen/cc/op_specs.h"
 
@@ -37,14 +37,15 @@ namespace java {
 class OpGenerator {
  public:
   explicit OpGenerator(const std::vector<string>& api_dirs,
-      Env* env = Env::Default()) : api_dirs_(api_dirs), env_(env) {}
+                       Env* env = Env::Default())
+      : api_dirs_(api_dirs), env_(env) {}
 
   // Generates wrappers for the given list of 'ops'.
   //
   // Output files are generated in <output_dir>/<base_package>/<op_package>,
   // where 'op_package' is derived from ops endpoints.
   Status Run(const OpList& op_list, const string& base_package,
-      const string& output_dir);
+             const string& output_dir);
 
  private:
   const std::vector<string> api_dirs_;
diff --git a/tensorflow/java/src/gen/cc/source_writer.cc b/tensorflow/java/src/gen/cc/source_writer.cc
index 56806cbb6dc5da94dc672828c2ff2dfe001b71c9..8e5fba7e32f096504f2aace6e9943b6f7281be31 100644
--- a/tensorflow/java/src/gen/cc/source_writer.cc
+++ b/tensorflow/java/src/gen/cc/source_writer.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include <string>
 #include <algorithm>
 #include <list>
+#include <string>
 
 #include "tensorflow/java/src/gen/cc/source_writer.h"
 
@@ -123,7 +124,7 @@ SourceWriter& SourceWriter::EndBlock() {
 }
 
 SourceWriter& SourceWriter::BeginMethod(const Method& method, int modifiers,
-    const Javadoc* javadoc) {
+                                        const Javadoc* javadoc) {
   GenericNamespace* generic_namespace = PushGenericNamespace(modifiers);
   if (!method.constructor()) {
     generic_namespace->Visit(method.return_type());
@@ -165,7 +166,8 @@ SourceWriter& SourceWriter::EndMethod() {
 }
 
 SourceWriter& SourceWriter::BeginType(const Type& type, int modifiers,
-    const std::list<Type>* extra_dependencies, const Javadoc* javadoc) {
+                                      const std::list<Type>* extra_dependencies,
+                                      const Javadoc* javadoc) {
   if (!type.package().empty()) {
     Append("package ").Append(type.package()).Append(";").EndLine();
   }
@@ -186,7 +188,7 @@ SourceWriter& SourceWriter::BeginType(const Type& type, int modifiers,
 }
 
 SourceWriter& SourceWriter::BeginInnerType(const Type& type, int modifiers,
-    const Javadoc* javadoc) {
+                                           const Javadoc* javadoc) {
   GenericNamespace* generic_namespace = PushGenericNamespace(modifiers);
   generic_namespace->Visit(type);
   EndLine();
@@ -226,7 +228,7 @@ SourceWriter& SourceWriter::EndType() {
 }
 
 SourceWriter& SourceWriter::WriteField(const Variable& field, int modifiers,
-    const Javadoc* javadoc) {
+                                       const Javadoc* javadoc) {
   // If present, write field javadoc only as one brief line
   if (javadoc != nullptr && !javadoc->brief().empty()) {
     Append("/** ").Append(javadoc->brief()).Append(" */").EndLine();
@@ -345,8 +347,8 @@ void SourceWriter::TypeVisitor::Visit(const Type& type) {
 
 void SourceWriter::GenericNamespace::DoVisit(const Type& type) {
   // ignore non-generic parameters, wildcards and generics already declared
-  if (type.kind() == Type::GENERIC && !type.wildcard()
-      && generic_names_.find(type.name()) == generic_names_.end()) {
+  if (type.kind() == Type::GENERIC && !type.wildcard() &&
+      generic_names_.find(type.name()) == generic_names_.end()) {
     declared_types_.push_back(&type);
     generic_names_.insert(type.name());
   }
diff --git a/tensorflow/java/src/gen/cc/source_writer.h b/tensorflow/java/src/gen/cc/source_writer.h
index 1f0febe9a3135a0b9cc07a406db97b3ccc0182ee..de0113bd5b7092bfae0e2dd15fa1c7a26c491c9e 100644
--- a/tensorflow/java/src/gen/cc/source_writer.h
+++ b/tensorflow/java/src/gen/cc/source_writer.h
@@ -93,7 +93,7 @@ class SourceWriter {
   // This method appends a new opening brace to the current data and indent the
   // next lines according to Google Java Style Guide. The block can optionally
   // be preceded by an expression (e.g. Append("if(true)").BeginBlock();)
-  SourceWriter& BeginBlock(const string& expr = "");
+  SourceWriter& BeginBlock(const string& expression = "");
 
   // Ends the current block of source code.
   //
@@ -108,7 +108,7 @@ class SourceWriter {
   // in parameter to define the access scope of this method and, optionally,
   // a Javadoc.
   SourceWriter& BeginMethod(const Method& method, int modifiers,
-      const Javadoc* javadoc = nullptr);
+                            const Javadoc* javadoc = nullptr);
 
   // Ends the current method.
   //
@@ -125,9 +125,9 @@ class SourceWriter {
   //
   // If not null, all types found in the 'extra_dependencies' list will be
   // imported before declaring the new type.
-  SourceWriter& BeginType(const Type& clazz, int modifiers,
-      const std::list<Type>* extra_dependencies = nullptr,
-      const Javadoc* javadoc = nullptr);
+  SourceWriter& BeginType(const Type& type, int modifiers,
+                          const std::list<Type>* extra_dependencies = nullptr,
+                          const Javadoc* javadoc = nullptr);
 
   // Begins to write a new inner type.
   //
@@ -136,7 +136,7 @@ class SourceWriter {
   // in parameter to define the accesses and the scope of this type and,
   // optionally, a Javadoc.
   SourceWriter& BeginInnerType(const Type& type, int modifiers,
-      const Javadoc* javadoc = nullptr);
+                               const Javadoc* javadoc = nullptr);
 
   // Ends the current type.
   //
@@ -150,7 +150,7 @@ class SourceWriter {
   // or BeginInnerType()). Modifiers are also be passed in parameter to define
   // the accesses and the scope of this field and, optionally, a Javadoc.
   SourceWriter& WriteField(const Variable& field, int modifiers,
-      const Javadoc* javadoc = nullptr);
+                           const Javadoc* javadoc = nullptr);
 
  protected:
   virtual void DoAppend(const StringPiece& str) = 0;
diff --git a/tensorflow/java/src/gen/cc/source_writer_test.cc b/tensorflow/java/src/gen/cc/source_writer_test.cc
index b9a5fee9bea1660dbfdf1a7b6920133f0bb85577..fb8fc64dffa309e4df5ef0e9a0c631fab73c397b 100644
--- a/tensorflow/java/src/gen/cc/source_writer_test.cc
+++ b/tensorflow/java/src/gen/cc/source_writer_test.cc
@@ -245,12 +245,17 @@ TEST(StreamTest, Types) {
   SourceBufferWriter writer;
   Type generic = Type::Generic("T").add_supertype(Type::Class("Number"));
 
-  writer.AppendType(Type::Int()).Append(", ")
-        .AppendType(Type::Class("String")).Append(", ")
-        .AppendType(generic).Append(", ")
-        .AppendType(Type::ListOf(generic)).Append(", ")
-        .AppendType(Type::ListOf(Type::IterableOf(generic))).Append(", ")
-        .AppendType(Type::ListOf(Type::Wildcard()));
+  writer.AppendType(Type::Int())
+      .Append(", ")
+      .AppendType(Type::Class("String"))
+      .Append(", ")
+      .AppendType(generic)
+      .Append(", ")
+      .AppendType(Type::ListOf(generic))
+      .Append(", ")
+      .AppendType(Type::ListOf(Type::IterableOf(generic)))
+      .Append(", ")
+      .AppendType(Type::ListOf(Type::Wildcard()));
 
   const char* expected =
       "int, String, T, List<T>, List<Iterable<T>>, List<?>";
@@ -314,7 +319,7 @@ TEST(WriteType, AnnotatedAndDocumentedClass) {
   SourceBufferWriter writer;
   Type clazz = Type::Class("Test", "org.tensorflow");
   Javadoc clazz_doc = Javadoc::Create("Javadoc test")
-      .details("This is a\nmultiline description.");
+                          .details("This is a\nmultiline description.");
   clazz.add_annotation(Annotation::Create("Bean"));
   clazz.add_annotation(Annotation::Create("SuppressWarnings")
       .attributes("\"rawtypes\""));
@@ -380,10 +385,10 @@ TEST(WriteType, ParameterizedClassFields) {
   Javadoc field3_doc = Javadoc::Create("This variable is documented");
 
   writer.BeginType(clazz, PUBLIC)
-          .WriteField(field1, STATIC | PUBLIC | FINAL)
-          .WriteField(field2, PRIVATE)
-          .WriteField(field3, PRIVATE, &field3_doc)
-        .EndType();
+      .WriteField(field1, STATIC | PUBLIC | FINAL)
+      .WriteField(field2, PRIVATE)
+      .WriteField(field3, PRIVATE, &field3_doc)
+      .EndType();
 
   const char* expected =
       "package org.tensorflow;\n\n"
@@ -402,9 +407,9 @@ TEST(WriteType, SimpleInnerClass) {
   Type inner_class = Type::Class("InnerTest");
 
   writer.BeginType(clazz, PUBLIC)
-          .BeginInnerType(inner_class, PUBLIC)
-          .EndType()
-        .EndType();
+      .BeginInnerType(inner_class, PUBLIC)
+      .EndType()
+      .EndType();
 
   const char* expected =
       "package org.tensorflow;\n\n"
@@ -425,9 +430,9 @@ TEST(WriteType, StaticParameterizedInnerClass) {
   inner_class.add_parameter(type_t);
 
   writer.BeginType(clazz, PUBLIC)
-          .BeginInnerType(inner_class, PUBLIC | STATIC)
-          .EndType()
-        .EndType();
+      .BeginInnerType(inner_class, PUBLIC | STATIC)
+      .EndType()
+      .EndType();
 
   const char* expected =
       "package org.tensorflow;\n\n"
@@ -445,8 +450,9 @@ TEST(WriteMethod, SimpleMethod) {
   Method method = Method::Create("doNothing", Type::Void());
 
   writer.BeginType(clazz, PUBLIC)
-          .BeginMethod(method, PUBLIC).EndMethod()
-        .EndType();
+      .BeginMethod(method, PUBLIC)
+      .EndMethod()
+      .EndType();
 
   const char* expected =
       "package org.tensorflow;\n\n"
@@ -462,15 +468,17 @@ TEST(WriteMethod, AnnotatedAndDocumentedMethod) {
   SourceBufferWriter writer;
   Type clazz = Type::Class("Test", "org.tensorflow");
   Method method = Method::Create("doNothing", Type::Void());
-  Javadoc method_doc = Javadoc::Create("Javadoc test")
-      .details("This method has a\nmultiline description.");
+  Javadoc method_doc =
+      Javadoc::Create("Javadoc test")
+          .details("This method has a\nmultiline description.");
   method.add_annotation(Annotation::Create("Override"));
   method.add_annotation(Annotation::Create("SuppressWarnings")
       .attributes("\"rawtypes\""));
 
   writer.BeginType(clazz, PUBLIC)
-          .BeginMethod(method, PUBLIC, &method_doc).EndMethod()
-        .EndType();
+      .BeginMethod(method, PUBLIC, &method_doc)
+      .EndMethod()
+      .EndType();
 
   const char* expected =
       "package org.tensorflow;\n\n"
@@ -497,20 +505,23 @@ TEST(WriteMethod, DocumentedMethodWithArguments) {
   Method method = Method::Create("boolToInt", Type::Int());
   method.add_argument(Variable::Create("b", Type::Boolean()));
   method.add_argument(reverse);
-  Javadoc method_doc = Javadoc::Create("Converts a boolean to an int")
-      .details("This method will convert\na boolean to an int")
-      .add_param_tag(reverse.name(), "if true, value is reversed")
-      .add_tag("return", "int value for this boolean");
+  Javadoc method_doc =
+      Javadoc::Create("Converts a boolean to an int")
+          .details("This method will convert\na boolean to an int")
+          .add_param_tag(reverse.name(), "if true, value is reversed")
+          .add_tag("return", "int value for this boolean");
 
   writer.BeginType(clazz, PUBLIC)
-          .BeginMethod(method, PUBLIC, &method_doc)
-            .Append("if (b && !reverse)")
-            .BeginBlock()
-              .Append("return 1;").EndLine()
-            .EndBlock()
-          .Append("return 0;").EndLine()
-          .EndMethod()
-        .EndType();
+      .BeginMethod(method, PUBLIC, &method_doc)
+      .Append("if (b && !reverse)")
+      .BeginBlock()
+      .Append("return 1;")
+      .EndLine()
+      .EndBlock()
+      .Append("return 0;")
+      .EndLine()
+      .EndMethod()
+      .EndType();
 
   const char* expected =
       "package org.tensorflow;\n\n"
@@ -543,10 +554,11 @@ TEST(WriteMethod, ParameterizedMethod) {
   Method method = Method::Create("doNothing", type_t);
 
   writer.BeginType(clazz, PUBLIC)
-          .BeginMethod(method, PUBLIC)
-            .Append("return null;").EndLine()
-          .EndMethod()
-        .EndType();
+      .BeginMethod(method, PUBLIC)
+      .Append("return null;")
+      .EndLine()
+      .EndMethod()
+      .EndType();
 
   const char* expected =
       "package org.tensorflow;\n\n"
@@ -567,10 +579,11 @@ TEST(WriteMethod, StaticParameterizedMethod) {
   Method method = Method::Create("doNothing", type_t);
 
   writer.BeginType(clazz, PUBLIC)
-          .BeginMethod(method, PUBLIC | STATIC)
-            .Append("return null;").EndLine()
-          .EndMethod()
-        .EndType();
+      .BeginMethod(method, PUBLIC | STATIC)
+      .Append("return null;")
+      .EndLine()
+      .EndMethod()
+      .EndType();
 
   const char* expected =
       "package org.tensorflow;\n\n"
diff --git a/tensorflow/python/BUILD b/tensorflow/python/BUILD
index 7201e12c50c5f6e1f7c84be1986b93349a7aa01e..a06b536f5b24334963004ef9c321d59dc1685a44 100644
--- a/tensorflow/python/BUILD
+++ b/tensorflow/python/BUILD
@@ -71,6 +71,7 @@ py_library(
     visibility = [
         "//tensorflow:__pkg__",
         "//tensorflow/python/tools:__pkg__",
+        "//tensorflow/tools/api/generator:__pkg__",
     ],
     deps = [
         ":array_ops",
@@ -79,6 +80,7 @@ py_library(
         ":check_ops",
         ":client",
         ":client_testlib",
+        ":collective_ops",
         ":confusion_matrix",
         ":control_flow_ops",
         ":cudnn_rnn_ops_gen",
@@ -255,7 +257,7 @@ cc_library(
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
         "//third_party/py/numpy:headers",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
     ],
 )
 
@@ -268,7 +270,7 @@ cc_library(
         ":safe_ptr",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
     ],
 )
 
@@ -292,7 +294,7 @@ cc_library(
     deps = [
         "//tensorflow/c:c_api",
         "//tensorflow/core:lib",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
     ],
 )
 
@@ -315,7 +317,7 @@ cc_library(
         ":safe_ptr",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
     ],
 )
 
@@ -337,7 +339,7 @@ cc_library(
         "//tensorflow/core:script_ops_op_lib",
         "//tensorflow/python/eager:pywrap_tfe_lib",
         "//third_party/py/numpy:headers",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
     ],
 )
 
@@ -348,7 +350,7 @@ cc_library(
     deps = [
         "//tensorflow/c:c_api",
         "//tensorflow/c/eager:c_api",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
     ],
 )
 
@@ -378,7 +380,7 @@ cc_library(
         ":safe_ptr",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
     ],
 )
 
@@ -389,7 +391,7 @@ cc_library(
     deps = [
         "//tensorflow/core:lib",
         "//tensorflow/core:script_ops_op_lib",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
     ],
 )
 
@@ -715,6 +717,38 @@ py_library(
     ],
 )
 
+py_library(
+    name = "function_def_to_graph",
+    srcs = ["framework/function_def_to_graph.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":framework",
+        ":function",
+        ":op_def_registry",
+        ":tensor_shape",
+        ":versions",
+        "//tensorflow/core:protos_all_py",
+    ],
+)
+
+py_test(
+    name = "function_def_to_graph_test",
+    size = "small",
+    srcs = ["framework/function_def_to_graph_test.py"],
+    srcs_version = "PY2AND3",
+    tags = ["no_pip"],
+    deps = [
+        ":array_ops",
+        ":client_testlib",
+        ":dtypes",
+        ":framework_ops",
+        ":function_def_to_graph",
+        ":graph_to_function_def",
+        ":math_ops",
+        ":test_ops",
+    ],
+)
+
 py_library(
     name = "graph_util",
     srcs = [
@@ -1016,7 +1050,10 @@ py_test(
 
 tf_gen_op_wrapper_private_py(
     name = "functional_ops_gen",
-    visibility = ["//learning/brain/python/ops:__pkg__"],
+    visibility = [
+        "//learning/brain/python/ops:__pkg__",
+        "//tensorflow/contrib/control_flow:__pkg__",
+    ],
 )
 
 py_library(
@@ -1435,6 +1472,14 @@ tf_gen_op_wrapper_private_py(
     ],
 )
 
+tf_gen_op_wrapper_private_py(
+    name = "collective_ops_gen",
+    visibility = ["//tensorflow:internal"],
+    deps = [
+        "//tensorflow/core:collective_ops_op_lib",
+    ],
+)
+
 tf_gen_op_wrapper_private_py(
     name = "control_flow_ops_gen",
     visibility = [
@@ -1736,9 +1781,33 @@ py_test(
     ],
 )
 
+py_library(
+    name = "collective_ops",
+    srcs = ["ops/collective_ops.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":collective_ops_gen",
+        ":framework_for_generated_wrappers",
+    ],
+)
+
+py_test(
+    name = "collective_ops_test",
+    size = "small",
+    srcs = ["ops/collective_ops_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":client_testlib",
+        ":collective_ops",
+        ":framework_for_generated_wrappers",
+        "//third_party/py/numpy",
+    ],
+)
+
 py_library(
     name = "control_flow_grad",
-    srcs = ["ops/control_flow_grad.py"],
+    srcs =
+        ["ops/control_flow_grad.py"],
     srcs_version = "PY2AND3",
     deps = [
         ":control_flow_ops",
@@ -2461,6 +2530,7 @@ py_library(
         ":check_ops",
         ":confusion_matrix",
         ":control_flow_ops",
+        ":distribute",
         ":framework",
         ":framework_for_generated_wrappers",
         ":math_ops",
@@ -2666,7 +2736,6 @@ py_library(
         ":util",
         ":variables",
         "//tensorflow/python/eager:context",
-        "//tensorflow/python/estimator:util",
         "@six_archive//:six",
     ],
 )
@@ -3404,7 +3473,7 @@ tf_cuda_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
         "//third_party/py/numpy:headers",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
     ],
 )
 
@@ -3456,6 +3525,7 @@ tf_py_wrap_cc(
         "util/transform_graph.i",
         "util/util.i",
     ],
+    # add win_def_file
     win_def_file = select({
         "//tensorflow:windows": ":pywrap_tensorflow_filtered_def_file",
         "//conditions:default": None,
@@ -3475,6 +3545,7 @@ tf_py_wrap_cc(
         ":py_record_writer_lib",
         ":python_op_gen",
         ":tf_session_helper",
+        "//third_party/python_runtime:headers",
         "//tensorflow/c:c_api",
         "//tensorflow/c:checkpoint_reader",
         "//tensorflow/c:python_api",
@@ -3497,7 +3568,6 @@ tf_py_wrap_cc(
         "//tensorflow/core/profiler/internal:print_model_analysis",
         "//tensorflow/tools/graph_transforms:transform_graph_lib",
         "//tensorflow/python/eager:pywrap_tfe_lib",
-        "//util/python:python_headers",
     ] + (tf_additional_lib_deps() +
          tf_additional_plugin_deps() +
          tf_additional_verbs_deps() +
@@ -4275,7 +4345,7 @@ py_test(
 
 py_test(
     name = "warm_starting_util_test",
-    size = "small",
+    size = "medium",
     srcs = ["training/warm_starting_util_test.py"],
     srcs_version = "PY2AND3",
     deps = [
diff --git a/tensorflow/python/client/session.py b/tensorflow/python/client/session.py
index 5507d011bb0746c84b868ca7efcc3e4f8d2e146a..35aa37ac6dd721750cd72b54d1b8ef6a70402038 100644
--- a/tensorflow/python/client/session.py
+++ b/tensorflow/python/client/session.py
@@ -619,21 +619,12 @@ class BaseSession(SessionInterface):
       self._config = None
       self._add_shapes = False
 
-    # pylint: disable=protected-access
-    # We cache _USE_C_API's value because some test cases will create a session
-    # with _USE_C_API = False but set it back to True before calling close().
-    self._created_with_new_api = ops._USE_C_API
-    # pylint: enable=protected-access
-
     self._session = None
     opts = tf_session.TF_NewSessionOptions(target=self._target, config=config)
     try:
-      if self._created_with_new_api:
-        # pylint: disable=protected-access
-        self._session = tf_session.TF_NewSession(self._graph._c_graph, opts)
-        # pylint: enable=protected-access
-      else:
-        self._session = tf_session.TF_NewDeprecatedSession(opts)
+      # pylint: disable=protected-access
+      self._session = tf_session.TF_NewSession(self._graph._c_graph, opts)
+      # pylint: enable=protected-access
     finally:
       tf_session.TF_DeleteSessionOptions(opts)
 
@@ -660,11 +651,7 @@ class BaseSession(SessionInterface):
     Returns:
       A list of devices in the session.
     """
-    if self._created_with_new_api:
-      raw_device_list = tf_session.TF_SessionListDevices(self._session)
-    else:
-      raw_device_list = tf_session.TF_DeprecatedSessionListDevices(
-          self._session)
+    raw_device_list = tf_session.TF_SessionListDevices(self._session)
     device_list = []
     size = tf_session.TF_DeviceListCount(raw_device_list)
     for i in range(size):
@@ -684,16 +671,9 @@ class BaseSession(SessionInterface):
       tf.errors.OpError: Or one of its subclasses if an error occurs while
         closing the TensorFlow session.
     """
-    if self._created_with_new_api:
-      if self._session and not self._closed:
-        self._closed = True
-        tf_session.TF_CloseSession(self._session)
-
-    else:
-      with self._extend_lock:
-        if self._opened and not self._closed:
-          self._closed = True
-          tf_session.TF_CloseDeprecatedSession(self._session)
+    if self._session and not self._closed:
+      self._closed = True
+      tf_session.TF_CloseSession(self._session)
 
   def __del__(self):
     # cleanly ignore all exceptions
@@ -703,10 +683,7 @@ class BaseSession(SessionInterface):
       pass
     if self._session is not None:
       try:
-        if self._created_with_new_api:
-          tf_session.TF_DeleteSession(self._session)
-        else:
-          tf_session.TF_DeleteDeprecatedSession(self._session)
+        tf_session.TF_DeleteSession(self._session)
       except AttributeError:
         # At shutdown, `c_api_util` or `tf_session` may have been garbage
         # collected, causing the above method calls to fail. In this case,
@@ -1005,12 +982,9 @@ class BaseSession(SessionInterface):
         try:
           subfeed_t = self.graph.as_graph_element(
               subfeed, allow_tensor=True, allow_operation=False)
-          if self._created_with_new_api:
-            # pylint: disable=protected-access
-            feed_list.append(subfeed_t._as_tf_output())
-            # pylint: enable=protected-access
-          else:
-            feed_list.append(compat.as_bytes(subfeed_t.name))
+          # pylint: disable=protected-access
+          feed_list.append(subfeed_t._as_tf_output())
+          # pylint: enable=protected-access
         except Exception as e:
           e.message = ('Cannot interpret feed_list key as Tensor: ' + e.message)
           e.args = (e.message,)
@@ -1023,22 +997,13 @@ class BaseSession(SessionInterface):
     # Set up a graph with feeds and fetches for partial run.
     def _setup_fn(session, feed_list, fetch_list, target_list):
       self._extend_graph()
-      if self._created_with_new_api:
-        return tf_session.TF_SessionPRunSetup_wrapper(
-            session, feed_list, fetch_list, target_list)
-      else:
-        with errors.raise_exception_on_not_ok_status() as status:
-          return tf_session.TF_PRunSetup(session, feed_list, fetch_list,
-                                         target_list, status)
+      return tf_session.TF_SessionPRunSetup_wrapper(
+          session, feed_list, fetch_list, target_list)
 
-    if self._created_with_new_api:
-      # pylint: disable=protected-access
-      final_fetches = [t._as_tf_output() for t in fetch_handler.fetches()]
-      final_targets = [op._c_op for op in fetch_handler.targets()]
-      # pylint: enable=protected-access
-    else:
-      final_fetches = _name_list(fetch_handler.fetches())
-      final_targets = _name_list(fetch_handler.targets())
+    # pylint: disable=protected-access
+    final_fetches = [t._as_tf_output() for t in fetch_handler.fetches()]
+    final_targets = [op._c_op for op in fetch_handler.targets()]
+    # pylint: enable=protected-access
 
     return self._do_call(_setup_fn, self._session, feed_list, final_fetches,
                          final_targets)
@@ -1196,14 +1161,10 @@ class BaseSession(SessionInterface):
 
     # Create a fetch handler to take care of the structure of fetches.
     fetch_handler = _FetchHandler(self._graph, fetches, {})
-    if self._created_with_new_api:
-      # pylint: disable=protected-access
-      fetch_list = [t._as_tf_output() for t in fetch_handler.fetches()]
-      target_list = [op._c_op for op in fetch_handler.targets()]
-      # pylint: enable=protected-access
-    else:
-      fetch_list = _name_list(fetch_handler.fetches())
-      target_list = _name_list(fetch_handler.targets())
+    # pylint: disable=protected-access
+    fetch_list = [t._as_tf_output() for t in fetch_handler.fetches()]
+    target_list = [op._c_op for op in fetch_handler.targets()]
+    # pylint: enable=protected-access
 
     def _callable_template_with_options_and_metadata(fetch_list,
                                                      target_list,
@@ -1289,16 +1250,11 @@ class BaseSession(SessionInterface):
     Raises:
       tf.errors.OpError: Or one of its subclasses on error.
     """
-    if self._created_with_new_api:
-      # pylint: disable=protected-access
-      feeds = dict((t._as_tf_output(), v) for t, v in feed_dict.items())
-      fetches = [t._as_tf_output() for t in fetch_list]
-      targets = [op._c_op for op in target_list]
-      # pylint: enable=protected-access
-    else:
-      feeds = dict((compat.as_bytes(t.name), v) for t, v in feed_dict.items())
-      fetches = _name_list(fetch_list)
-      targets = _name_list(target_list)
+    # pylint: disable=protected-access
+    feeds = dict((t._as_tf_output(), v) for t, v in feed_dict.items())
+    fetches = [t._as_tf_output() for t in fetch_list]
+    targets = [op._c_op for op in target_list]
+    # pylint: enable=protected-access
 
     def _run_fn(feed_dict, fetch_list, target_list, options, run_metadata):
       # Ensure any changes to the graph are reflected in the runtime.
@@ -1335,22 +1291,8 @@ class BaseSession(SessionInterface):
       raise type(e)(node_def, op, message)
 
   def _extend_graph(self):
-    if self._created_with_new_api:
-      with self._graph._lock:  # pylint: disable=protected-access
-        tf_session.ExtendSession(self._session)
-    else:
-      # Ensure any changes to the graph are reflected in the runtime.
-      with self._extend_lock:
-        if self._graph.version > self._current_version:
-          # pylint: disable=protected-access
-          graph_def, self._current_version = self._graph._as_graph_def(
-              from_version=self._current_version, add_shapes=self._add_shapes)
-          # pylint: enable=protected-access
-
-          with errors.raise_exception_on_not_ok_status() as status:
-            tf_session.TF_ExtendGraph(self._session,
-                                      graph_def.SerializeToString(), status)
-          self._opened = True
+    with self._graph._lock:  # pylint: disable=protected-access
+      tf_session.ExtendSession(self._session)
 
   # The threshold to run garbage collection to delete dead tensors.
   _DEAD_HANDLES_THRESHOLD = 10
@@ -1403,24 +1345,13 @@ class BaseSession(SessionInterface):
 
   def _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list,
                           run_metadata):
-    if self._created_with_new_api:
-      return tf_session.TF_SessionRun_wrapper(
-          self._session, options, feed_dict, fetch_list, target_list,
-          run_metadata)
-    else:
-      with errors.raise_exception_on_not_ok_status() as status:
-        return tf_session.TF_Run(
-            self._session, options, feed_dict, fetch_list, target_list,
-            status, run_metadata)
+    return tf_session.TF_SessionRun_wrapper(
+        self._session, options, feed_dict, fetch_list, target_list,
+        run_metadata)
 
   def _call_tf_sessionprun(self, handle, feed_dict, fetch_list):
-    if self._created_with_new_api:
-      return tf_session.TF_SessionPRun_wrapper(
-          self._session, handle, feed_dict, fetch_list)
-    else:
-      with errors.raise_exception_on_not_ok_status() as status:
-        return tf_session.TF_PRun(
-            self._session, handle, feed_dict, fetch_list, status)
+    return tf_session.TF_SessionPRun_wrapper(
+        self._session, handle, feed_dict, fetch_list)
 
   # pylint: disable=protected-access
   class _Callable(object):
@@ -1433,25 +1364,29 @@ class BaseSession(SessionInterface):
           compat.as_bytes(callable_options.SerializeToString()))
       try:
         with errors.raise_exception_on_not_ok_status() as status:
-          if session._created_with_new_api:
-            self._handle = tf_session.TF_SessionMakeCallable(
-                session._session, options_ptr, status)
-          else:
-            self._handle = tf_session.TF_DeprecatedSessionMakeCallable(
-                session._session, options_ptr, status)
+          self._handle = tf_session.TF_SessionMakeCallable(
+              session._session, options_ptr, status)
       finally:
         tf_session.TF_DeleteBuffer(options_ptr)
 
-    def __call__(self, *args):
+    def __call__(self, *args, **kwargs):
       # TODO(b/74355905): Support argument and return value nested structures,
       # and tensor-like objects such as SparseTensors.
-      with errors.raise_exception_on_not_ok_status() as status:
-        if self._session._created_with_new_api:
-          return tf_session.TF_SessionRunCallable(
-              self._session._session, self._handle, args, status, None)
-        else:
-          return tf_session.TF_DeprecatedSessionRunCallable(
-              self._session._session, self._handle, args, status, None)
+      run_metadata = kwargs.get('run_metadata', None)
+      try:
+        run_metadata_ptr = tf_session.TF_NewBuffer() if run_metadata else None
+        # TODO(mrry): Switch to raising an exception from the SWIG wrapper.
+        with errors.raise_exception_on_not_ok_status() as status:
+          ret = tf_session.TF_SessionRunCallable(
+              self._session._session, self._handle, args, status,
+              run_metadata_ptr)
+        if run_metadata:
+          proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
+          run_metadata.ParseFromString(compat.as_bytes(proto_data))
+      finally:
+        if run_metadata_ptr:
+          tf_session.TF_DeleteBuffer(run_metadata_ptr)
+      return ret
 
     def __del__(self):
       # NOTE(mrry): It is possible that `self._session.__del__()` could be
@@ -1459,12 +1394,8 @@ class BaseSession(SessionInterface):
       # will be `None`.
       if self._handle is not None and self._session._session is not None:
         with errors.raise_exception_on_not_ok_status() as status:
-          if self._session._created_with_new_api:
-            tf_session.TF_SessionReleaseCallable(
-                self._session._session, self._handle, status)
-          else:
-            tf_session.TF_DeprecatedSessionReleaseCallable(
-                self._session._session, self._handle, status)
+          tf_session.TF_SessionReleaseCallable(
+              self._session._session, self._handle, status)
   # pylint: enable=protected-access
 
   # TODO(b/74355905): Reimplement `Session.make_callable()` using this method
diff --git a/tensorflow/python/client/session_test.py b/tensorflow/python/client/session_test.py
index 482497078cd3e0544b7465fc7c0be0dc81b5ff6a..e49d0671050f557842ad1d3305331d61cd8c9672 100644
--- a/tensorflow/python/client/session_test.py
+++ b/tensorflow/python/client/session_test.py
@@ -1364,6 +1364,20 @@ class SessionTest(test_util.TensorFlowTestCase):
         for _ in range(5):
           self.assertEqual([2.0], callable_fn(np.array(1.0, dtype=np.float32)))
 
+  def testOptimizedMakeCallableWithRunMetadata(self):
+    with session.Session() as sess:
+      ph = array_ops.placeholder(dtypes.float32)
+      a = math_ops.add(ph, 1.0)
+      callable_opts = config_pb2.CallableOptions()
+      callable_opts.feed.append(ph.name)
+      callable_opts.fetch.append(a.name)
+      callable_opts.run_options.trace_level = config_pb2.RunOptions.FULL_TRACE
+      callable_fn = sess._make_callable_from_options(callable_opts)
+      run_metadata = config_pb2.RunMetadata()
+      self.assertEqual([2.0], callable_fn(np.array(1.0, dtype=np.float32),
+                                          run_metadata=run_metadata))
+      self.assertGreater(len(run_metadata.step_stats.dev_stats), 0)
+
   def testFeedError(self):
     with session.Session() as sess:
       feed_t = array_ops.placeholder(dtype=dtypes.float32)
diff --git a/tensorflow/python/data/kernel_tests/BUILD b/tensorflow/python/data/kernel_tests/BUILD
index ed0c11e6c117dcbb810fd3acfc484128ed3519fa..e86c2f69935bc0d66b18b7dbc987bb5966c7ed8d 100644
--- a/tensorflow/python/data/kernel_tests/BUILD
+++ b/tensorflow/python/data/kernel_tests/BUILD
@@ -15,6 +15,7 @@ tf_py_test(
     size = "small",
     srcs = ["batch_dataset_op_test.py"],
     additional_deps = [
+        "@absl_py//absl/testing:parameterized",
         "//third_party/py/numpy",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
@@ -72,6 +73,17 @@ tf_py_test(
     ],
 )
 
+tf_py_test(
+    name = "dataset_ops_test",
+    size = "small",
+    srcs = ["dataset_ops_test.py"],
+    additional_deps = [
+        "//tensorflow/core:protos_all_py",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python/data/ops:dataset_ops",
+    ],
+)
+
 tf_py_test(
     name = "filter_dataset_op_test",
     size = "small",
diff --git a/tensorflow/python/data/kernel_tests/batch_dataset_op_test.py b/tensorflow/python/data/kernel_tests/batch_dataset_op_test.py
index bd80b9dbf561de16168b05facf0086dadcda6444..50bb0837b7052d67ced4fdf5c9c7e96212bdb415 100644
--- a/tensorflow/python/data/kernel_tests/batch_dataset_op_test.py
+++ b/tensorflow/python/data/kernel_tests/batch_dataset_op_test.py
@@ -18,8 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import math
-
+from absl.testing import parameterized
 import numpy as np
 
 from tensorflow.python.data.ops import dataset_ops
@@ -35,73 +34,83 @@ from tensorflow.python.platform import test
 from tensorflow.python.util import compat
 
 
-class BatchDatasetTest(test.TestCase):
+class BatchDatasetTest(test.TestCase, parameterized.TestCase):
+
+  @parameterized.named_parameters(
+      ('even', 28, 14, False),
+      ('uneven_with_remainder', 28, 15, False),
+      ('uneven_without_remainder', 28, 15, True),
+      ('empty', 0, 14, False),
+  )
+  def testBatchDataset(self, count, batch_size, drop_remainder):
+    """Tests the batch dataset logic for various input configurations.
+
+    Args:
+      count: the number of input elements
+      batch_size: the batch size
+      drop_remainder: whether a smaller batch size should be produced if batch
+        size does not divide number of inputs evenly
+    """
 
-  def testBatchDataset(self):
-    """Test an dataset that maps a TF function across its input elements."""
     # The pipeline is TensorSliceDataset -> MapDataset(square_3) ->
     # RepeatDataset(count) -> BatchDataset(batch_size).
     components = (np.arange(7),
                   np.array([[1, 2, 3]]) * np.arange(7)[:, np.newaxis],
                   np.array(37.0) * np.arange(7))
 
-    count = array_ops.placeholder(dtypes.int64, shape=[])
-    batch_size = array_ops.placeholder(dtypes.int64, shape=[])
+    count_t = array_ops.placeholder(dtypes.int64, shape=[])
+    batch_size_t = array_ops.placeholder(dtypes.int64, shape=[])
+    drop_remainder_t = array_ops.placeholder(dtypes.bool, shape=[])
 
     def _map_fn(x, y, z):
       return math_ops.square(x), math_ops.square(y), math_ops.square(z)
 
     iterator = (
         dataset_ops.Dataset.from_tensor_slices(components).map(_map_fn)
-        .repeat(count).batch(batch_size).make_initializable_iterator())
+        .repeat(count).batch(batch_size,
+                             drop_remainder).make_initializable_iterator())
     init_op = iterator.initializer
     get_next = iterator.get_next()
 
-    self.assertEqual([[None] + list(c.shape[1:]) for c in components],
+    if drop_remainder:
+      dim0 = batch_size
+    else:
+      dim0 = None
+    self.assertEqual([[dim0] + list(c.shape[1:]) for c in components],
                      [t.shape.as_list() for t in get_next])
 
     with self.test_session() as sess:
-      # Batch of a finite input, where the batch_size divides the
-      # total number of elements.
-      sess.run(init_op, feed_dict={count: 28, batch_size: 14})
-      num_batches = (28 * 7) // 14
-      for i in range(num_batches):
+      sess.run(
+          init_op,
+          feed_dict={
+              count_t: count,
+              batch_size_t: batch_size,
+              drop_remainder_t: drop_remainder
+          })
+      num_full_batches = (count * 7) // batch_size
+      for i in range(num_full_batches):
         result = sess.run(get_next)
         for component, result_component in zip(components, result):
-          for j in range(14):
-            self.assertAllEqual(component[(i * 14 + j) % 7]**2,
+          for j in range(batch_size):
+            self.assertAllEqual(component[(i * batch_size + j) % 7]**2,
                                 result_component[j])
-      with self.assertRaises(errors.OutOfRangeError):
-        sess.run(get_next)
-
-      # Batch of a finite input, where the batch_size does not
-      # divide the total number of elements.
-      sess.run(init_op, feed_dict={count: 14, batch_size: 8})
-
-      # We expect (num_batches - 1) full-sized batches.
-      num_batches = int(math.ceil((14 * 7) / 8))
-      for i in range(num_batches - 1):
+      if not drop_remainder and (count * 7) % batch_size > 0:
         result = sess.run(get_next)
         for component, result_component in zip(components, result):
-          for j in range(8):
-            self.assertAllEqual(component[(i * 8 + j) % 7]**2,
-                                result_component[j])
-      result = sess.run(get_next)
-      for component, result_component in zip(components, result):
-        for j in range((14 * 7) % 8):
-          self.assertAllEqual(component[((num_batches - 1) * 8 + j) % 7]**2,
-                              result_component[j])
+          for j in range((count * 7) % batch_size):
+            self.assertAllEqual(
+                component[(num_full_batches * batch_size + j) % 7]**2,
+                result_component[j])
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
-      # Batch of an empty input should fail straight away.
-      sess.run(init_op, feed_dict={count: 0, batch_size: 8})
-      with self.assertRaises(errors.OutOfRangeError):
-        sess.run(get_next)
+  def testBatchDatasetInvalidBatchSize(self):
+    iterator = (dataset_ops.Dataset.range(10).batch(0).make_one_shot_iterator())
+    get_next = iterator.get_next()
 
-      # Empty batch should be an initialization time error.
+    with self.test_session() as sess:
       with self.assertRaises(errors.InvalidArgumentError):
-        sess.run(init_op, feed_dict={count: 14, batch_size: 0})
+        sess.run(get_next)
 
   def assertSparseValuesEqual(self, a, b):
     self.assertAllEqual(a.indices, b.indices)
@@ -210,66 +219,108 @@ class BatchDatasetTest(test.TestCase):
           r'First element had shape \[3\] and element 2 had shape \[4\].'):
         sess.run(next_element)
 
-  def testPaddedBatchDataset(self):
-    seq_lens = array_ops.placeholder(dtypes.int32, shape=[None])
-    padded_shape = array_ops.placeholder(dtypes.int64, shape=[1])
+
+def _random_seq_lens(count):
+  return np.random.randint(20, size=(count,)).astype(np.int32)
+
+
+class PaddedBatchDatasetTest(test.TestCase, parameterized.TestCase):
+
+  @parameterized.named_parameters(
+      ('default_padding', _random_seq_lens(32), 4, [-1], False),
+      ('constant_padding', _random_seq_lens(32), 4, [25], False),
+      ('uneven_with_remainder', _random_seq_lens(34), 4, [-1], False),
+      ('uneven_without_remainder', _random_seq_lens(34), 4, [-1], True),
+  )
+  def testPaddedBatchDataset(self, seq_lens, batch_size, padded_shapes,
+                             drop_remainder):
+    """Tests the padded batch dataset logic for various input configurations.
+
+    Args:
+      seq_lens: the input sequence lengths
+      batch_size: the batch size
+      padded_shapes: the padded shapes to use
+      drop_remainder: whether a smaller batch size should be produced if batch
+        size does not divide number of inputs evenly
+    """
+
+    seq_lens_t = array_ops.placeholder(dtypes.int32, shape=[None])
+    batch_size_t = array_ops.placeholder(dtypes.int64, shape=[])
+    padded_shapes_t = array_ops.placeholder(dtypes.int64, shape=[1])
+    drop_remainder_t = array_ops.placeholder(dtypes.bool, shape=[])
 
     iterator = (
-        dataset_ops.Dataset.from_tensor_slices(seq_lens)
+        dataset_ops.Dataset.from_tensor_slices(seq_lens_t)
         .map(lambda x: array_ops.fill([x], x)).padded_batch(
-            4, padded_shapes=padded_shape).make_initializable_iterator())
+            batch_size=batch_size_t,
+            drop_remainder=drop_remainder_t,
+            padded_shapes=padded_shapes_t).make_initializable_iterator())
 
     init_op = iterator.initializer
     get_next = iterator.get_next()
 
     with self.test_session() as sess:
-      # Test with random sequence lengths, and max padding.
-      random_seq_lens = np.random.randint(20, size=(32,)).astype(np.int32)
       sess.run(
-          init_op, feed_dict={
-              padded_shape: [-1],
-              seq_lens: random_seq_lens
+          init_op,
+          feed_dict={
+              seq_lens_t: seq_lens,
+              batch_size_t: batch_size,
+              padded_shapes_t: padded_shapes,
+              drop_remainder_t: drop_remainder,
           })
-      for i in range(8):
+
+      num_full_batches = len(seq_lens) // batch_size
+
+      for i in range(num_full_batches):
         result = sess.run(get_next)
-        padded_len = np.max(result)
-        self.assertEqual((4, padded_len), result.shape)
-        for j in range(4):
-          seq_len = random_seq_lens[(i * 4) + j]
+        padded_len = padded_shapes[0]
+        if padded_len is None or padded_len == -1:
+          padded_len = np.max(result)
+        self.assertEqual((batch_size, padded_len), result.shape)
+        for j in range(batch_size):
+          seq_len = seq_lens[(i * batch_size) + j]
           self.assertAllEqual(result[j, :seq_len], [seq_len] * seq_len)
-          self.assertAllEqual(result[j, seq_len:], [0] * (padded_len - seq_len))
-      with self.assertRaises(errors.OutOfRangeError):
-        sess.run(get_next)
+          self.assertAllEqual(result[j, seq_len:],
+                              [0] * (padded_len - seq_len))
 
-      # Test with random sequence lengths, and constant padding.
-      sess.run(
-          init_op, feed_dict={
-              padded_shape: [25],
-              seq_lens: random_seq_lens
-          })
-      for i in range(8):
+      if not drop_remainder and len(seq_lens) % batch_size > 0:
         result = sess.run(get_next)
-        self.assertEqual((4, 25), result.shape)
-        for j in range(4):
-          seq_len = random_seq_lens[(i * 4) + j]
+        padded_len = np.max(result)
+        self.assertEqual((len(seq_lens) % batch_size, padded_len),
+                         result.shape)
+        for j in range(len(seq_lens) % batch_size):
+          seq_len = seq_lens[num_full_batches * batch_size + j]
           self.assertAllEqual(result[j, :seq_len], [seq_len] * seq_len)
-          self.assertAllEqual(result[j, seq_len:], [0] * (25 - seq_len))
+          self.assertAllEqual(result[j, seq_len:],
+                              [0] * (padded_len - seq_len))
+
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
-      # Test correct handling of empty tensors.
-      sess.run(init_op, feed_dict={padded_shape: [-1], seq_lens: [0, 0, 0, 0]})
+  def testPaddedBatchShortPadding(self):
+    iterator = (
+        dataset_ops.Dataset.from_tensor_slices([6, 5, 5, 5, 5])
+        .map(lambda x: array_ops.fill([x], x)).padded_batch(
+            batch_size=4, padded_shapes=[5]).make_one_shot_iterator())
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      with self.assertRaises(errors.DataLossError):
+        sess.run(get_next)
+
+  def testPaddedBatchEmptyTensors(self):
+    iterator = (
+        dataset_ops.Dataset.from_tensor_slices([0, 0, 0, 0])
+        .map(lambda x: array_ops.fill([x], x)).padded_batch(
+            batch_size=4, padded_shapes=[-1]).make_one_shot_iterator())
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
       result = sess.run(get_next)
       self.assertAllEqual([[], [], [], []], result)
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
-      # Test error handling with constant sequence lengths, and
-      # too-short padding.
-      sess.run(init_op, feed_dict={padded_shape: [5], seq_lens: [6, 5, 5, 5]})
-      with self.assertRaises(errors.DataLossError):
-        result = sess.run(get_next)
-
   def testPaddedBatchDatasetNonDefaultPadding(self):
     seq_lens = array_ops.placeholder(dtypes.int32, shape=[None])
     padded_shape = array_ops.placeholder(dtypes.int64, shape=[1])
@@ -371,6 +422,44 @@ class BatchDatasetTest(test.TestCase):
     with self.assertRaises(TypeError):
       _ = dataset_ops.Dataset.range(10).map(_map_fn).padded_batch(10)
 
+  def testPaddedBatchShapeError(self):
+    with self.assertRaisesRegexp(
+        ValueError, r'The padded shape \(1,\) is not compatible with the '
+        r'corresponding input component shape \(\).'):
+      _ = dataset_ops.Dataset.range(10).padded_batch(5, padded_shapes=[1])
+
+    with self.assertRaisesRegexp(
+        ValueError, r'The padded shape \(1,\) is not compatible with the '
+        r'corresponding input component shape \(3,\).'):
+      _ = dataset_ops.Dataset.from_tensors([1, 2, 3]).padded_batch(
+          5, padded_shapes=[1])
+
+    with self.assertRaisesRegexp(
+        ValueError, r'Padded shape .* must be a 1-D tensor '
+        r'of tf.int64 values, but its shape was \(2, 2\).'):
+      _ = dataset_ops.Dataset.from_tensors([1, 2, 3]).padded_batch(
+          5, padded_shapes=[[1, 1], [1, 1]])
+
+    with self.assertRaisesRegexp(
+        TypeError, r'Padded shape .* must be a 1-D tensor '
+        r'of tf.int64 values, but its element type was float32.'):
+      _ = dataset_ops.Dataset.from_tensors([1, 2, 3]).padded_batch(
+          5, padded_shapes=constant_op.constant([1., 2., 3.]))
+
+    with self.assertRaisesRegexp(
+        ValueError, r'The padded shape \(1,\) is not compatible with the '
+        r'corresponding input component shape \(\).'):
+      shape_as_tensor = constant_op.constant([1], dtype=dtypes.int64)
+      _ = dataset_ops.Dataset.range(10).padded_batch(
+          5, padded_shapes=shape_as_tensor)
+
+    with self.assertRaisesRegexp(
+        ValueError, r'The padded shape \(\?, \?\) is not compatible with the '
+        r'corresponding input component shape \(\).'):
+      shape_as_tensor = array_ops.placeholder(dtypes.int64, shape=[2])
+      _ = dataset_ops.Dataset.range(10).padded_batch(
+          5, padded_shapes=shape_as_tensor)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/data/kernel_tests/dataset_from_generator_op_test.py b/tensorflow/python/data/kernel_tests/dataset_from_generator_op_test.py
index 296a76ec887ae7c31cb9d0bd2afd6d1fe827d95c..fb55ae140058349753731b0c257acb3cf3def0a3 100644
--- a/tensorflow/python/data/kernel_tests/dataset_from_generator_op_test.py
+++ b/tensorflow/python/data/kernel_tests/dataset_from_generator_op_test.py
@@ -259,9 +259,7 @@ class DatasetConstructorTest(test.TestCase):
       sess.run(init_op)
       self.assertAllEqual([1, 2, 3], sess.run(get_next))
       self.assertAllEqual([4, 5, 6], sess.run(get_next))
-      # NOTE(mrry): Type name in message differs between Python 2 (`long`) and
-      # 3 (`int`).
-      with self.assertRaisesOpError(r"invalid literal for"):
+      with self.assertRaisesOpError("The expected type was int64"):
         sess.run(get_next)
       self.assertAllEqual([7, 8, 9], sess.run(get_next))
       with self.assertRaises(errors.OutOfRangeError):
@@ -290,6 +288,34 @@ class DatasetConstructorTest(test.TestCase):
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
+  def testFromGeneratorStructureError(self):
+    def generator():
+      yield 1, 2
+      yield 3, 4
+      yield 5
+      yield 6, 7, 8
+      yield 9, 10
+
+    iterator = (dataset_ops.Dataset.from_generator(
+        generator, output_types=(dtypes.int64, dtypes.int64))
+                .make_initializable_iterator())
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(init_op)
+      self.assertEqual((1, 2), sess.run(get_next))
+      self.assertEqual((3, 4), sess.run(get_next))
+      with self.assertRaisesOpError(
+          r"The expected structure was \(tf\.int64, tf\.int64\)"):
+        sess.run(get_next)
+      with self.assertRaisesOpError(
+          r"The expected structure was \(tf\.int64, tf\.int64\)"):
+        sess.run(get_next)
+      self.assertEqual((9, 10), sess.run(get_next))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
   def testFromGeneratorHeterogeneous(self):
     def generator():
       yield 1
diff --git a/tensorflow/python/data/kernel_tests/dataset_ops_test.py b/tensorflow/python/data/kernel_tests/dataset_ops_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c4c11e132d1fc9b8969540994a097098279dd9e
--- /dev/null
+++ b/tensorflow/python/data/kernel_tests/dataset_ops_test.py
@@ -0,0 +1,37 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for the input pipeline ops."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.core.framework import graph_pb2
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.platform import test
+
+
+class DatasetOpsTest(test.TestCase):
+
+  def testAsSerializedGraph(self):
+    dataset = dataset_ops.Dataset.range(10)
+    with self.test_session() as sess:
+      graph = graph_pb2.GraphDef().FromString(
+          sess.run(dataset._as_serialized_graph()))
+      self.assertTrue(any([node.op != "RangeDataset" for node in graph.node]))
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/python/data/kernel_tests/map_dataset_op_test.py b/tensorflow/python/data/kernel_tests/map_dataset_op_test.py
index 1ad0b9de5e76e3edd66303ab4666108f43a27428..768d4ac82cce4e78e9ff493cb5b4401614ecd1c0 100644
--- a/tensorflow/python/data/kernel_tests/map_dataset_op_test.py
+++ b/tensorflow/python/data/kernel_tests/map_dataset_op_test.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 from collections import namedtuple
 import threading
 import time
+import warnings
 
 import numpy as np
 
@@ -638,6 +639,26 @@ class MapDatasetTest(test.TestCase):
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
+  def testWarnOnLookupTable(self):
+    def collecting_function(x):
+      _ = lookup_ops.HashTable(
+          lookup_ops.KeyValueTensorInitializer([], []), 0.0, name="t1")
+      return x
+
+    warnings.simplefilter("always")
+    with warnings.catch_warnings(record=True) as w:
+      _ = dataset_ops.Dataset.range(10).map(collecting_function)
+    # NOTE(mrry): Python 3 prints other warnings in addition to the one we are
+    # testing, so we search for the expected warning.
+    self.assertGreaterEqual(len(w), 1)
+    found_warning = False
+    for warning in w:
+      if ("Creating lookup tables inside a function passed to Dataset.map() is "
+          "not supported." in str(warning)):
+        found_warning = True
+        break
+    self.assertTrue(found_warning)
+
 
 class MapDatasetBenchmark(test.Benchmark):
 
diff --git a/tensorflow/python/data/kernel_tests/reader_dataset_ops_test.py b/tensorflow/python/data/kernel_tests/reader_dataset_ops_test.py
index 1ddedfda4e1c9d6b6949f796be1870f167435763..e99f0a203b4d8b83fc6a95163e23b74300f6f6b8 100644
--- a/tensorflow/python/data/kernel_tests/reader_dataset_ops_test.py
+++ b/tensorflow/python/data/kernel_tests/reader_dataset_ops_test.py
@@ -24,6 +24,7 @@ import zlib
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.data.ops import readers
+from tensorflow.python.eager import context
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
@@ -38,6 +39,13 @@ from tensorflow.python.platform import test
 from tensorflow.python.util import compat
 
 
+try:
+  import psutil  # pylint: disable=g-import-not-at-top
+  psutil_import_succeeded = True
+except ImportError:
+  psutil_import_succeeded = False
+
+
 class TextLineDatasetTest(test.TestCase):
 
   def _lineText(self, f, l):
@@ -162,6 +170,34 @@ class TextLineDatasetTest(test.TestCase):
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(iterator.get_next())
 
+  def testIteratorResourceCleanup(self):
+    filename = os.path.join(self.get_temp_dir(), "text.txt")
+    with open(filename, "wt") as f:
+      for i in range(3):
+        f.write("%d\n" % (i,))
+    with context.eager_mode():
+      first_iterator = iter(readers.TextLineDataset(filename))
+      self.assertEqual(b"0", next(first_iterator).numpy())
+      second_iterator = iter(readers.TextLineDataset(filename))
+      self.assertEqual(b"0", next(second_iterator).numpy())
+      # Eager kernel caching is based on op attributes, which includes the
+      # Dataset's output shape. Create a different kernel to test that they
+      # don't create resources with the same names.
+      different_kernel_iterator = iter(
+          readers.TextLineDataset(filename).repeat().batch(16))
+      self.assertEqual([16], next(different_kernel_iterator).shape)
+      # Remove our references to the Python Iterator objects, which (assuming no
+      # reference cycles) is enough to trigger DestroyResourceOp and close the
+      # partially-read files.
+      del first_iterator
+      del second_iterator
+      del different_kernel_iterator
+      if not psutil_import_succeeded:
+        self.skipTest(
+            "psutil is required to check that we've closed our files.")
+      open_files = psutil.Process().open_files()
+      self.assertNotIn(filename, [open_file.path for open_file in open_files])
+
 
 class FixedLengthRecordReaderTest(test.TestCase):
 
diff --git a/tensorflow/python/data/ops/dataset_ops.py b/tensorflow/python/data/ops/dataset_ops.py
index 6f9b12b12323fa5bcd0ebc7d383cdcec8aed72cc..d0deed5edef81a03a5ee69a01c16b98a799189c7 100644
--- a/tensorflow/python/data/ops/dataset_ops.py
+++ b/tensorflow/python/data/ops/dataset_ops.py
@@ -19,6 +19,7 @@ from __future__ import print_function
 
 import abc
 import threading
+import warnings
 
 import numpy as np
 import six
@@ -32,6 +33,7 @@ from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import smart_cond
 from tensorflow.python.framework import sparse_tensor as sparse_tensor_lib
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import tensor_util
@@ -57,6 +59,15 @@ class Dataset(object):
   def __init__(self):
     pass
 
+  def _as_serialized_graph(self):
+    """Produces serialized graph representation of the dataset.
+
+    Returns:
+      A scalar `tf.Tensor` of `tf.string` type, representing this dataset as a
+      serialized graph.
+    """
+    return gen_dataset_ops.dataset_to_graph(self._as_variant_tensor())
+
   @abc.abstractmethod
   def _as_variant_tensor(self):
     """Creates a scalar `tf.Tensor` of `tf.variant` representing this dataset.
@@ -212,6 +223,13 @@ class Dataset(object):
   def from_tensors(tensors):
     """Creates a `Dataset` with a single element, comprising the given tensors.
 
+    Note that if `tensors` contains a NumPy array, and eager execution is not
+    enabled, the values will be embedded in the graph as one or more
+    @{tf.constant} operations. For large datasets (> 1 GB), this can waste
+    memory and run into byte limits of graph serialization.  If tensors contains
+    one or more large NumPy arrays, consider the alternative described in
+    @{$programmers_guide/datasets#consuming_numpy_arrays$this guide}.
+
     Args:
       tensors: A nested structure of tensors.
 
@@ -224,6 +242,13 @@ class Dataset(object):
   def from_tensor_slices(tensors):
     """Creates a `Dataset` whose elements are slices of the given tensors.
 
+    Note that if `tensors` contains a NumPy array, and eager execution is not
+    enabled, the values will be embedded in the graph as one or more
+    @{tf.constant} operations. For large datasets (> 1 GB), this can waste
+    memory and run into byte limits of graph serialization.  If tensors contains
+    one or more large NumPy arrays, consider the alternative described in
+    @{$programmers_guide/datasets#consuming_numpy_arrays$this guide}.
+
     Args:
       tensors: A nested structure of tensors, each having the same size in the
         0th dimension.
@@ -398,13 +423,23 @@ class Dataset(object):
         # Use the same _convert function from the py_func() implementation to
         # convert the returned values to arrays early, so that we can inspect
         # their values.
-        # pylint: disable=protected-access
-        ret_arrays = [
-            script_ops.FuncRegistry._convert(ret, dtype=dtype.as_numpy_dtype)
-            for ret, dtype in zip(
-                nest.flatten_up_to(output_types, values), flattened_types)
-        ]
-        # pylint: enable=protected-access
+        try:
+          flattened_values = nest.flatten_up_to(output_types, values)
+        except (TypeError, ValueError):
+          raise TypeError(
+              "`generator` yielded an element that did not match the expected "
+              "structure. The expected structure was %s, but the yielded "
+              "element was %s." % (output_types, values))
+        ret_arrays = []
+        for ret, dtype in zip(flattened_values, flattened_types):
+          try:
+            ret_arrays.append(script_ops.FuncRegistry._convert(  # pylint: disable=protected-access
+                ret, dtype=dtype.as_numpy_dtype))
+          except (TypeError, ValueError):
+            raise TypeError(
+                "`generator` yielded an element that could not be converted to "
+                "the expected type. The expected type was %s, but the yielded "
+                "element was %s." % (dtype.name, ret))
 
         # Additional type and shape checking to ensure that the components
         # of the generated element match the `output_types` and `output_shapes`
@@ -781,7 +816,7 @@ class Dataset(object):
 
     return self._enumerate().filter(filter_fn).map(lambda _, elem: elem)
 
-  def batch(self, batch_size):
+  def batch(self, batch_size, drop_remainder=False):
     """Combines consecutive elements of this dataset into batches.
 
     NOTE: If the number of elements (`N`) in this dataset is not an exact
@@ -793,13 +828,21 @@ class Dataset(object):
     Args:
       batch_size: A `tf.int64` scalar `tf.Tensor`, representing the number of
         consecutive elements of this dataset to combine in a single batch.
+      drop_remainder: (Optional.) A `tf.bool` scalar `tf.Tensor`, representing
+        whether the last batch should be dropped in the case its has fewer than
+        `batch_size` elements; the default behavior is not to drop the smaller
+        batch.
 
     Returns:
       Dataset: A `Dataset`.
     """
-    return BatchDataset(self, batch_size)
+    return BatchDataset(self, batch_size, drop_remainder)
 
-  def padded_batch(self, batch_size, padded_shapes, padding_values=None):
+  def padded_batch(self,
+                   batch_size,
+                   padded_shapes,
+                   padding_values=None,
+                   drop_remainder=False):
     """Combines consecutive elements of this dataset into padded batches.
 
     This transformation combines multiple consecutive elements of the input
@@ -842,11 +885,16 @@ class Dataset(object):
         `tf.Tensor`, representing the padding values to use for the
         respective components.  Defaults are `0` for numeric types and
         the empty string for string types.
+      drop_remainder: (Optional.) A `tf.bool` scalar `tf.Tensor`, representing
+        whether the last batch should be dropped in the case its has fewer than
+        `batch_size` elements; the default behavior is not to drop the smaller
+        batch.
 
     Returns:
       Dataset: A `Dataset`.
     """
-    return PaddedBatchDataset(self, batch_size, padded_shapes, padding_values)
+    return PaddedBatchDataset(self, batch_size, padded_shapes, padding_values,
+                              drop_remainder)
 
   def map(self, map_func, num_parallel_calls=None):
     """Maps `map_func` across this dataset.
@@ -958,7 +1006,8 @@ class Dataset(object):
         scalar `tf.bool` tensor.
 
     Returns:
-      Dataset: A `Dataset`.
+      Dataset: The `Dataset` containing the elements of this dataset for which
+          `predicate` is `True`.
     """
     return FilterDataset(self, predicate)
 
@@ -1109,6 +1158,30 @@ class SparseTensorSliceDataset(Dataset):
     return (dtypes.int64, self._sparse_tensor.dtype, dtypes.int64)
 
 
+def flat_structure(dataset):
+  """Helper for setting `output_shapes` and `output_types` attrs of Dataset ops.
+
+  Most Dataset op constructors expect `output_shapes` and `output_types`
+  arguments that represent the flattened structure of an element. This helper
+  function generates these attrs as a keyword argument dictionary, allowing
+  `Dataset._as_variant_tensor()` implementations to pass
+  `**flat_structure(self)` to the op constructor.
+
+  Args:
+    dataset: A @{tf.data.Dataset}.
+
+  Returns:
+    A dictionary of keyword arguments that can be passed to many Dataset op
+    constructors.
+  """
+  return {
+      "output_shapes": nest.flatten(sparse.as_dense_shapes(
+          dataset.output_shapes, dataset.output_classes)),
+      "output_types": nest.flatten(sparse.as_dense_types(
+          dataset.output_types, dataset.output_classes)),
+  }
+
+
 class _GeneratorDataset(Dataset):
   """A `Dataset` that generates elements by invoking a function."""
 
@@ -1281,10 +1354,7 @@ class _GeneratorDataset(Dataset):
         init_func=self._init_func,
         next_func=self._next_func,
         finalize_func=self._finalize_func,
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+        **flat_structure(self))
 
   @property
   def output_classes(self):
@@ -1321,16 +1391,7 @@ class ZipDataset(Dataset):
     # pylint: disable=protected-access
     return gen_dataset_ops.zip_dataset(
         [ds._as_variant_tensor() for ds in nest.flatten(self._datasets)],
-        output_shapes=[
-            s
-            for ds in nest.flatten(self._datasets)
-            for s in nest.flatten(ds.output_shapes)
-        ],
-        output_types=[
-            t
-            for ds in nest.flatten(self._datasets)
-            for t in nest.flatten(ds.output_types)
-        ])
+        **flat_structure(self))
     # pylint: enable=protected-access
 
   @property
@@ -1375,10 +1436,7 @@ class ConcatenateDataset(Dataset):
     return gen_dataset_ops.concatenate_dataset(
         self._input_dataset._as_variant_tensor(),
         self._dataset_to_concatenate._as_variant_tensor(),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **flat_structure(self))
     # pylint: enable=protected-access
 
   @property
@@ -1416,10 +1474,7 @@ class RepeatDataset(Dataset):
     return gen_dataset_ops.repeat_dataset(
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
         count=self._count,
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **flat_structure(self))
 
   @property
   def output_classes(self):
@@ -1466,10 +1521,7 @@ class RangeDataset(Dataset):
         start=self._start,
         stop=self._stop,
         step=self._step,
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **flat_structure(self))
 
   @property
   def output_classes(self):
@@ -1498,10 +1550,7 @@ class CacheDataset(Dataset):
     return gen_dataset_ops.cache_dataset(
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
         filename=self._filename,
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **flat_structure(self))
 
   @property
   def output_classes(self):
@@ -1561,10 +1610,7 @@ class ShuffleDataset(Dataset):
         seed=self._seed,
         seed2=self._seed2,
         reshuffle_each_iteration=self._reshuffle_each_iteration,
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **flat_structure(self))
 
   @property
   def output_classes(self):
@@ -1592,10 +1638,7 @@ class TakeDataset(Dataset):
     return gen_dataset_ops.take_dataset(
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
         count=self._count,
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **flat_structure(self))
 
   @property
   def output_classes(self):
@@ -1623,10 +1666,7 @@ class SkipDataset(Dataset):
     return gen_dataset_ops.skip_dataset(
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
         count=self._count,
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **flat_structure(self))
 
   @property
   def output_classes(self):
@@ -1644,21 +1684,28 @@ class SkipDataset(Dataset):
 class BatchDataset(Dataset):
   """A `Dataset` that batches contiguous elements from its input."""
 
-  def __init__(self, input_dataset, batch_size):
+  def __init__(self, input_dataset, batch_size, drop_remainder):
     """See `Dataset.batch()` for details."""
     super(BatchDataset, self).__init__()
     self._input_dataset = input_dataset
     self._batch_size = ops.convert_to_tensor(
         batch_size, dtype=dtypes.int64, name="batch_size")
+    self._drop_remainder = ops.convert_to_tensor(
+        drop_remainder, dtype=dtypes.bool, name="drop_remainder")
 
   def _as_variant_tensor(self):
-    return gen_dataset_ops.batch_dataset(
-        self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
-        batch_size=self._batch_size,
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+    # TODO(jsimsa): Switch to using v2 only any time after 6/30/2018.
+    if smart_cond.smart_constant_value(self._drop_remainder) is False:
+      return gen_dataset_ops.batch_dataset(
+          self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
+          batch_size=self._batch_size,
+          **flat_structure(self))
+    else:
+      return gen_dataset_ops.batch_dataset_v2(
+          self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
+          batch_size=self._batch_size,
+          drop_remainder=self._drop_remainder,
+          **flat_structure(self))
 
   @property
   def output_classes(self):
@@ -1668,7 +1715,9 @@ class BatchDataset(Dataset):
   def output_shapes(self):
     input_shapes = self._input_dataset.output_shapes
     return nest.pack_sequence_as(input_shapes, [
-        tensor_shape.vector(None).concatenate(s)
+        tensor_shape.vector(
+            tensor_util.constant_value(self._batch_size) if smart_cond.
+            smart_constant_value(self._drop_remainder) else None).concatenate(s)
         for s in nest.flatten(self._input_dataset.output_shapes)
     ])
 
@@ -1677,20 +1726,77 @@ class BatchDataset(Dataset):
     return self._input_dataset.output_types
 
 
-def _partial_shape_to_tensor(shape_like):
+def _is_padded_shape_compatible_with(padded_shape, input_component_shape):
+  """Returns `True` if `input_component_shape` can be padded to `padded_shape`.
+
+  Args:
+    padded_shape: A `tf.TensorShape`.
+    input_component_shape: A `tf.TensorShape`.
+
+  Returns:
+    `True` if `input_component_shape` can be padded to `padded_shape`, otherwise
+    `False`.
+  """
+
+  if padded_shape.dims is None or input_component_shape.dims is None:
+    return True
+  if len(padded_shape.dims) != len(input_component_shape.dims):
+    return False
+  for padded_dim, input_dim in zip(
+      padded_shape.dims, input_component_shape.dims):
+    if (padded_dim.value is not None and input_dim.value is not None
+        and padded_dim.value < input_dim.value):
+      return False
+  return True
+
+
+def _padded_shape_to_tensor(padded_shape, input_component_shape):
+  """Converts `padded_shape` to a `tf.Tensor` representing that shape.
+
+  Args:
+    padded_shape: A shape-like object, which may be a `tf.TensorShape`, a Python
+      sequence, or a 1-D `tf.Tensor` of `tf.int64` elements.
+    input_component_shape: A `tf.TensorShape`, with which `padded_shape` must
+      be compatible.
+
+  Returns:
+    A 1-D `tf.Tensor` of `tf.int64` elements, representing `padded_shape`.
+
+  Raises:
+    ValueError: If `padded_shape` is not a shape or not compatible with
+      `input_component_shape`.
+    TypeError: If `padded_shape` is not convertible to a `tf.int64` tensor.
+  """
   try:
-    # First attempt to convert the input to a shape, and return the
-    # "canonical" tensor representation, which uses `-1` in place of
-    # `None`.
-    shape_like = tensor_shape.as_shape(shape_like)
-    return ops.convert_to_tensor(
-        [dim if dim is not None else -1 for dim in shape_like.as_list()],
-        dtype=dtypes.int64)
+    # Try to convert the `padded_shape` to a `tf.TensorShape`
+    padded_shape_as_shape = tensor_shape.as_shape(padded_shape)
+    # We will return the "canonical" tensor representation, which uses
+    # `-1` in place of `None`.
+    ret = ops.convert_to_tensor(
+        [dim if dim is not None else -1
+         for dim in padded_shape_as_shape.as_list()], dtype=dtypes.int64)
   except (TypeError, ValueError):
     # The argument was not trivially convertible to a
     # `tf.TensorShape`, so fall back on the conversion to tensor
     # machinery.
-    return ops.convert_to_tensor(shape_like, dtype=dtypes.int64)
+    ret = ops.convert_to_tensor(padded_shape, preferred_dtype=dtypes.int64)
+    if ret.shape.dims is not None and len(ret.shape.dims) != 1:
+      raise ValueError(
+          "Padded shape %s must be a 1-D tensor of tf.int64 values, but its "
+          "shape was %s." % (padded_shape, ret.shape))
+    if ret.dtype != dtypes.int64:
+      raise TypeError(
+          "Padded shape %s must be a 1-D tensor of tf.int64 values, but its "
+          "element type was %s." % (padded_shape, ret.dtype.name))
+    padded_shape_as_shape = tensor_util.constant_value_as_shape(ret)
+
+  if not _is_padded_shape_compatible_with(padded_shape_as_shape,
+                                          input_component_shape):
+    raise ValueError("The padded shape %s is not compatible with the "
+                     "corresponding input component shape %s."
+                     % (padded_shape_as_shape, input_component_shape))
+
+  return ret
 
 
 def _padding_value_to_tensor(value, output_type):
@@ -1732,7 +1838,8 @@ def _default_padding(input_dataset):
 class PaddedBatchDataset(Dataset):
   """A `Dataset` that batches and pads contiguous elements from its input."""
 
-  def __init__(self, input_dataset, batch_size, padded_shapes, padding_values):
+  def __init__(self, input_dataset, batch_size, padded_shapes, padding_values,
+               drop_remainder):
     """See `Dataset.batch()` for details."""
     super(PaddedBatchDataset, self).__init__()
     if sparse.any_sparse(input_dataset.output_classes):
@@ -1745,23 +1852,51 @@ class PaddedBatchDataset(Dataset):
     padding_values = (
         padding_values
         if padding_values is not None else _default_padding(input_dataset))
-    self._padded_shapes = nest.map_structure_up_to(
-        input_dataset.output_shapes, _partial_shape_to_tensor, padded_shapes)
+
+    flat_padded_shapes = nest.flatten_up_to(input_dataset.output_shapes,
+                                            padded_shapes)
+
+    flat_padded_shapes_as_tensors = []
+
+    for input_component_shape, padded_shape in zip(
+        nest.flatten(input_dataset.output_shapes), flat_padded_shapes):
+      flat_padded_shapes_as_tensors.append(
+          _padded_shape_to_tensor(padded_shape, input_component_shape))
+
+    self._padded_shapes = nest.pack_sequence_as(input_dataset.output_shapes,
+                                                flat_padded_shapes_as_tensors)
+
     self._padding_values = nest.map_structure_up_to(
         input_dataset.output_shapes, _padding_value_to_tensor, padding_values,
         input_dataset.output_types)
+    self._drop_remainder = ops.convert_to_tensor(
+        drop_remainder, dtype=dtypes.bool, name="drop_remainder")
 
   def _as_variant_tensor(self):
-    return gen_dataset_ops.padded_batch_dataset(
-        self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
-        batch_size=self._batch_size,
-        padded_shapes=[
-            ops.convert_to_tensor(s, dtype=dtypes.int64)
-            for s in nest.flatten(self._padded_shapes)
-        ],
-        padding_values=nest.flatten(self._padding_values),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+    # TODO(jsimsa): Switch to using v2 only any time after 6/30/2018.
+    if smart_cond.smart_constant_value(self._drop_remainder) is False:
+      return gen_dataset_ops.padded_batch_dataset(
+          self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
+          batch_size=self._batch_size,
+          padded_shapes=[
+              ops.convert_to_tensor(s, dtype=dtypes.int64)
+              for s in nest.flatten(self._padded_shapes)
+          ],
+          padding_values=nest.flatten(self._padding_values),
+          output_shapes=nest.flatten(
+              sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+    else:
+      return gen_dataset_ops.padded_batch_dataset_v2(
+          self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
+          batch_size=self._batch_size,
+          padded_shapes=[
+              ops.convert_to_tensor(s, dtype=dtypes.int64)
+              for s in nest.flatten(self._padded_shapes)
+          ],
+          padding_values=nest.flatten(self._padding_values),
+          drop_remainder=self._drop_remainder,
+          output_shapes=nest.flatten(
+              sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
 
   @property
   def output_classes(self):
@@ -1771,8 +1906,10 @@ class PaddedBatchDataset(Dataset):
   def output_shapes(self):
 
     def _padded_shape_to_batch_shape(s):
-      return tensor_shape.vector(None).concatenate(
-          tensor_util.constant_value_as_shape(s))
+      return tensor_shape.vector(
+          tensor_util.constant_value(self._batch_size) if smart_cond.
+          smart_constant_value(self._drop_remainder) else None).concatenate(
+              tensor_util.constant_value_as_shape(s))
 
     return nest.map_structure(_padded_shape_to_batch_shape, self._padded_shapes)
 
@@ -1786,6 +1923,24 @@ def _should_unpack_args(args):
   return type(args) is tuple  # pylint: disable=unidiomatic-typecheck
 
 
+def _warn_if_collections(transformation_name):
+  """Prints warning message if the current graph uses common graph collections.
+
+  NOTE(mrry): Currently a warning is only generated for lookup tables. Any
+  variables created will be automatically hoisted out to the outermost scope
+  using `init_scope()`. Some collections (such as for control-flow contexts)
+  are benign and should not generate a warning.
+
+  Args:
+    transformation_name: A human-readable name for the transformation.
+  """
+  if ops.get_default_graph().get_collection(ops.GraphKeys.TABLE_INITIALIZERS):
+    warnings.warn("Creating lookup tables inside a function passed to %s is not"
+                  " supported. Create each table outside the function, and "
+                  "capture it inside the function to use it."
+                  % transformation_name)
+
+
 class MapDataset(Dataset):
   """A `Dataset` that maps a function over elements in its input."""
 
@@ -1845,6 +2000,8 @@ class MapDataset(Dataset):
       self._output_types = nest.pack_sequence_as(
           ret, [t.dtype for t in nest.flatten(ret)])
 
+      _warn_if_collections("Dataset.map()")
+
       # Serialize any sparse tensors.
       ret = nest.pack_sequence_as(
           ret, [t for t in nest.flatten(sparse.serialize_sparse_tensors(ret))])
@@ -1859,10 +2016,7 @@ class MapDataset(Dataset):
         input_t,
         self._map_func.captured_inputs,
         f=self._map_func,
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+        **flat_structure(self))
 
   @property
   def output_classes(self):
@@ -1895,10 +2049,7 @@ class ParallelMapDataset(MapDataset):
         self._map_func.captured_inputs,
         f=self._map_func,
         num_parallel_calls=self._num_parallel_calls,
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+        **flat_structure(self))
     # pylint: enable=protected-access
 
 
@@ -1933,6 +2084,8 @@ class FlatMapDataset(Dataset):
       if not isinstance(dataset, Dataset):
         raise TypeError("`map_func` must return a `Dataset` object.")
 
+      _warn_if_collections(self._transformation_name())
+
       self._output_classes = dataset.output_classes
       self._output_types = dataset.output_types
       self._output_shapes = dataset.output_shapes
@@ -1947,10 +2100,7 @@ class FlatMapDataset(Dataset):
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
         self._map_func.captured_inputs,
         f=self._map_func,
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+        **flat_structure(self))
 
   @property
   def output_classes(self):
@@ -1964,6 +2114,9 @@ class FlatMapDataset(Dataset):
   def output_types(self):
     return self._output_types
 
+  def _transformation_name(self):
+    return "Dataset.flat_map()"
+
 
 class InterleaveDataset(FlatMapDataset):
   """A `Dataset` that maps a function over its input and interleaves the result.
@@ -1984,10 +2137,10 @@ class InterleaveDataset(FlatMapDataset):
         self._cycle_length,
         self._block_length,
         f=self._map_func,  # pylint: disable=protected-access
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+        **flat_structure(self))
+
+  def _transformation_name(self):
+    return "Dataset.interleave()"
 
 
 class FilterDataset(Dataset):
@@ -2023,6 +2176,8 @@ class FilterDataset(Dataset):
               ret.shape.is_compatible_with(tensor_shape.scalar())):
         raise ValueError("`predicate` must return a scalar boolean tensor.")
 
+      _warn_if_collections("Dataset.filter()")
+
       return ret
 
     self._predicate = tf_predicate
@@ -2033,10 +2188,7 @@ class FilterDataset(Dataset):
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
         other_arguments=self._predicate.captured_inputs,
         predicate=self._predicate,
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+        **flat_structure(self))
 
   @property
   def output_classes(self):
@@ -2067,10 +2219,7 @@ class PrefetchDataset(Dataset):
     return gen_dataset_ops.prefetch_dataset(
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
         buffer_size=self._buffer_size,
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)))
+        **flat_structure(self))
 
   @property
   def output_classes(self):
diff --git a/tensorflow/python/data/ops/iterator_ops.py b/tensorflow/python/data/ops/iterator_ops.py
index fd164277b6fd7509403d84c50a62638df7968a03..b6dba4e3ca3874b8e9bc3b7ea92fb91fe41759d8 100644
--- a/tensorflow/python/data/ops/iterator_ops.py
+++ b/tensorflow/python/data/ops/iterator_ops.py
@@ -471,9 +471,7 @@ class EagerIterator(object):
           sparse.as_dense_types(self._output_types, self._output_classes))
       self._flat_output_shapes = nest.flatten(
           sparse.as_dense_shapes(self._output_shapes, self._output_classes))
-      self._resource = gen_dataset_ops.iterator(
-          shared_name="",
-          container=_generate_shared_name("eageriterator"),
+      self._resource = gen_dataset_ops.anonymous_iterator(
           output_types=self._flat_output_types,
           output_shapes=self._flat_output_shapes)
       gen_dataset_ops.make_iterator(ds_variant, self._resource)
diff --git a/tensorflow/python/data/ops/readers.py b/tensorflow/python/data/ops/readers.py
index a73a8b5cdc494d7a14c1a2bcb6aa766dbf819403..066e09969c0ba8f054ada42a40960c7513945963 100644
--- a/tensorflow/python/data/ops/readers.py
+++ b/tensorflow/python/data/ops/readers.py
@@ -19,8 +19,6 @@ from __future__ import print_function
 
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import convert
-from tensorflow.python.data.util import nest
-from tensorflow.python.data.util import sparse
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
@@ -150,12 +148,12 @@ class ParallelInterleaveDataset(dataset_ops.InterleaveDataset):
         self._buffer_output_elements,
         self._prefetch_input_elements,
         f=self._map_func,
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+        **dataset_ops.flat_structure(self))
     # pylint: enable=protected-access
 
+  def _transformation_name(self):
+    return "tf.contrib.data.parallel_interleave()"
+
 
 @tf_export("data.TFRecordDataset")
 class TFRecordDataset(dataset_ops.Dataset):
diff --git a/tensorflow/python/data/util/BUILD b/tensorflow/python/data/util/BUILD
index 0fc32d51b9fe581a54519139f3bf12118f8f4028..5fcc62b60b696e05d7674c0bf46f57e71d6cc007 100644
--- a/tensorflow/python/data/util/BUILD
+++ b/tensorflow/python/data/util/BUILD
@@ -70,6 +70,7 @@ py_library(
         "//tensorflow/python:constant_op",
         "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
+        "//tensorflow/python:tensor_shape",
     ],
 )
 
diff --git a/tensorflow/python/data/util/convert.py b/tensorflow/python/data/util/convert.py
index eeb1d700f3c67a1a2ab627aa8a291755bc2127e4..746b3d66de082d59e8c1e316c51e2a9ab7670e6d 100644
--- a/tensorflow/python/data/util/convert.py
+++ b/tensorflow/python/data/util/convert.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
 
 
 def optional_param_to_tensor(argument_name,
@@ -32,3 +33,40 @@ def optional_param_to_tensor(argument_name,
   else:
     return constant_op.constant(
         argument_default, dtype=argument_dtype, name=argument_name)
+
+
+def partial_shape_to_tensor(shape_like):
+  """Returns a @{tf.Tensor} that represents the given shape.
+
+  Args:
+    shape_like: A value that can be converted to a @{tf.TensorShape} or a
+      @{tf.Tensor}.
+
+  Returns:
+    A 1-D `tf.Tensor` of `tf.int64` elements representing the given shape, where
+    `-1` is substituted for any unknown dimensions.
+  """
+  try:
+    # First attempt to convert the input to a shape, and return the
+    # "canonical" tensor representation, which uses `-1` in place of
+    # `None`.
+    shape_like = tensor_shape.as_shape(shape_like)
+    return ops.convert_to_tensor(
+        [dim if dim is not None else -1 for dim in shape_like.as_list()],
+        dtype=dtypes.int64)
+  except (TypeError, ValueError):
+    # The argument was not trivially convertible to a
+    # `tf.TensorShape`, so fall back on the conversion to tensor
+    # machinery.
+    ret = ops.convert_to_tensor(shape_like, preferred_dtype=dtypes.int64)
+    if ret.shape.dims is not None and len(ret.shape.dims) != 1:
+      raise ValueError("The given shape %s must be a 1-D tensor of tf.int64 "
+                       "values, but the shape was %s."
+                       % (shape_like, ret.shape))
+    if ret.dtype != dtypes.int64:
+      raise TypeError("The given shape %s must be a 1-D tensor of tf.int64 "
+                      "values, but the element type was %s."
+                      % (shape_like, ret.dtype.name))
+
+    return ret
+
diff --git a/tensorflow/python/data/util/convert_test.py b/tensorflow/python/data/util/convert_test.py
index 2cb6488070eb422f6c8d56ca5d712cbdf09fa883..6a67093e48c988b01b8137a544078d570aabf74f 100644
--- a/tensorflow/python/data/util/convert_test.py
+++ b/tensorflow/python/data/util/convert_test.py
@@ -19,7 +19,9 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.data.util import convert
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import tensor_shape
 from tensorflow.python.platform import test
 from tensorflow.python.util import compat
 
@@ -48,6 +50,77 @@ class ConvertTest(test.TestCase):
     with self.test_session() as sess:
       self.assertEqual(compat.as_bytes("value"), sess.run(resp))
 
+  def testPartialShapeToTensorKnownDimension(self):
+    with self.test_session() as sess:
+      self.assertAllEqual([1], sess.run(convert.partial_shape_to_tensor(
+          tensor_shape.TensorShape([1]))))
+      self.assertAllEqual([1], sess.run(convert.partial_shape_to_tensor((1,))))
+      self.assertAllEqual([1], sess.run(convert.partial_shape_to_tensor([1])))
+      self.assertAllEqual([1], sess.run(convert.partial_shape_to_tensor(
+          constant_op.constant([1], dtype=dtypes.int64))))
+
+  def testPartialShapeToTensorUnknownDimension(self):
+    with self.test_session() as sess:
+      self.assertAllEqual([-1], sess.run(convert.partial_shape_to_tensor(
+          tensor_shape.TensorShape([None]))))
+      self.assertAllEqual([-1], sess.run(convert.partial_shape_to_tensor(
+          (None,))))
+      self.assertAllEqual([-1], sess.run(convert.partial_shape_to_tensor(
+          [None])))
+      self.assertAllEqual([-1], sess.run(convert.partial_shape_to_tensor(
+          [-1])))
+      self.assertAllEqual([-1], sess.run(convert.partial_shape_to_tensor(
+          constant_op.constant([-1], dtype=dtypes.int64))))
+
+    with self.assertRaisesRegexp(
+        ValueError, r"The given shape .* must be a 1-D tensor of tf.int64 "
+        r"values, but the shape was \(2, 2\)."):
+      convert.partial_shape_to_tensor(constant_op.constant(
+          [[1, 1], [1, 1]], dtype=dtypes.int64))
+
+    with self.assertRaisesRegexp(
+        TypeError, r"The given shape .* must be a 1-D tensor of tf.int64 "
+        r"values, but the element type was float32."):
+      convert.partial_shape_to_tensor(constant_op.constant([1., 1.]))
+
+  def testPartialShapeToTensorMultipleDimensions(self):
+    with self.test_session() as sess:
+      self.assertAllEqual([3, 6], sess.run(convert.partial_shape_to_tensor(
+          tensor_shape.TensorShape([3, 6]))))
+      self.assertAllEqual([3, 6], sess.run(convert.partial_shape_to_tensor(
+          (3, 6))))
+      self.assertAllEqual([3, 6], sess.run(convert.partial_shape_to_tensor(
+          [3, 6])))
+      self.assertAllEqual([3, 6], sess.run(convert.partial_shape_to_tensor(
+          constant_op.constant([3, 6], dtype=dtypes.int64))))
+
+      self.assertAllEqual([3, -1], sess.run(convert.partial_shape_to_tensor(
+          tensor_shape.TensorShape([3, None]))))
+      self.assertAllEqual([3, -1], sess.run(convert.partial_shape_to_tensor(
+          (3, None))))
+      self.assertAllEqual([3, -1], sess.run(convert.partial_shape_to_tensor(
+          [3, None])))
+      self.assertAllEqual([3, -1], sess.run(convert.partial_shape_to_tensor(
+          constant_op.constant([3, -1], dtype=dtypes.int64))))
+
+      self.assertAllEqual([-1, -1], sess.run(convert.partial_shape_to_tensor(
+          tensor_shape.TensorShape([None, None]))))
+      self.assertAllEqual([-1, -1], sess.run(convert.partial_shape_to_tensor(
+          (None, None))))
+      self.assertAllEqual([-1, -1], sess.run(convert.partial_shape_to_tensor(
+          [None, None])))
+      self.assertAllEqual([-1, -1], sess.run(convert.partial_shape_to_tensor(
+          constant_op.constant([-1, -1], dtype=dtypes.int64))))
+
+  def testPartialShapeToTensorScalar(self):
+    with self.test_session() as sess:
+      self.assertAllEqual([], sess.run(convert.partial_shape_to_tensor(
+          tensor_shape.TensorShape([]))))
+      self.assertAllEqual([], sess.run(convert.partial_shape_to_tensor(())))
+      self.assertAllEqual([], sess.run(convert.partial_shape_to_tensor([])))
+      self.assertAllEqual([], sess.run(convert.partial_shape_to_tensor(
+          constant_op.constant([], dtype=dtypes.int64))))
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/debug/BUILD b/tensorflow/python/debug/BUILD
index 183994ddaa72b5961f62f34fb43935b0859a3b25..09062abd7446628ede12e782e202ee0e55905879 100644
--- a/tensorflow/python/debug/BUILD
+++ b/tensorflow/python/debug/BUILD
@@ -572,6 +572,7 @@ py_test(
         ":source_utils",
         "//tensorflow/core:protos_all_py",
         "//tensorflow/python:client",
+        "//tensorflow/python:client_testlib",
         "//tensorflow/python:framework_ops",
         "//tensorflow/python:framework_test_lib",
         "//tensorflow/python:math_ops",
@@ -1003,6 +1004,7 @@ cuda_py_test(
         "no_oss",  # Test flaky due to port collisions.
         "no_windows",
         "noasan",  # Times out due to size of test (b/73731462).
+        "optonly",  # Test flaky (b/80130873)
         "oss_serial",
     ],
 )
diff --git a/tensorflow/python/debug/cli/cli_shared.py b/tensorflow/python/debug/cli/cli_shared.py
index dea019fef58015fbd7982a81319dcabe4e5f4930..6a368682de5db12e128f010bfe0c9bbf9cf3b997 100644
--- a/tensorflow/python/debug/cli/cli_shared.py
+++ b/tensorflow/python/debug/cli/cli_shared.py
@@ -451,42 +451,48 @@ def get_error_intro(tf_error):
       sample commands for debugging.
   """
 
-  op_name = tf_error.op.name
+  if hasattr(tf_error, "op") and hasattr(tf_error.op, "name"):
+    op_name = tf_error.op.name
+  else:
+    op_name = None
 
   intro_lines = [
       "--------------------------------------",
       RL("!!! An error occurred during the run !!!", "blink"),
       "",
-      "You may use the following commands to debug:",
   ]
 
   out = debugger_cli_common.rich_text_lines_from_rich_line_list(intro_lines)
 
-  out.extend(
-      _recommend_command("ni -a -d -t %s" % op_name,
-                         "Inspect information about the failing op.",
-                         create_link=True))
-  out.extend(
-      _recommend_command("li -r %s" % op_name,
-                         "List inputs to the failing op, recursively.",
-                         create_link=True))
-
-  out.extend(
-      _recommend_command(
-          "lt",
-          "List all tensors dumped during the failing run() call.",
-          create_link=True))
+  if op_name is not None:
+    out.extend(debugger_cli_common.RichTextLines(
+        ["You may use the following commands to debug:"]))
+    out.extend(
+        _recommend_command("ni -a -d -t %s" % op_name,
+                           "Inspect information about the failing op.",
+                           create_link=True))
+    out.extend(
+        _recommend_command("li -r %s" % op_name,
+                           "List inputs to the failing op, recursively.",
+                           create_link=True))
+
+    out.extend(
+        _recommend_command(
+            "lt",
+            "List all tensors dumped during the failing run() call.",
+            create_link=True))
+  else:
+    out.extend(debugger_cli_common.RichTextLines([
+        "WARNING: Cannot determine the name of the op that caused the error."]))
 
   more_lines = [
       "",
-      "Op name:    " + op_name,
+      "Op name:    %s" % op_name,
       "Error type: " + str(type(tf_error)),
       "",
       "Details:",
       str(tf_error),
       "",
-      "WARNING: Using client GraphDef due to the error, instead of "
-      "executor GraphDefs.",
       "--------------------------------------",
       "",
   ]
diff --git a/tensorflow/python/debug/cli/cli_shared_test.py b/tensorflow/python/debug/cli/cli_shared_test.py
index 3d7939490dfe08118ee4972541c4166b2a536608..07b364db9f2aab9c11ecb769a94f36e0809d70a0 100644
--- a/tensorflow/python/debug/cli/cli_shared_test.py
+++ b/tensorflow/python/debug/cli/cli_shared_test.py
@@ -372,6 +372,11 @@ class GetErrorIntroTest(test_util.TensorFlowTestCase):
     self.assertEqual("Details:", error_intro.lines[14])
     self.assertStartsWith(error_intro.lines[15], "foo description")
 
+  def testGetErrorIntroForNoOpName(self):
+    tf_error = errors.OpError(None, None, "Fake OpError", -1)
+    error_intro = cli_shared.get_error_intro(tf_error)
+    self.assertIn("Cannot determine the name of the op", error_intro.lines[3])
+
 
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/python/debug/examples/examples_test.sh b/tensorflow/python/debug/examples/examples_test.sh
index 2df6c0b6a2701022e3fed6648208b9708197bebc..e9c45a7e6e92d069f51648647620f7a7c3a5aadc 100755
--- a/tensorflow/python/debug/examples/examples_test.sh
+++ b/tensorflow/python/debug/examples/examples_test.sh
@@ -69,6 +69,12 @@ run
 exit
 EOF
 
+cat << EOF | ${DEBUG_ERRORS_BIN} --error=uninitialized_variable --debug --ui_type=readline
+run
+ni -a -d -t v/read
+exit
+EOF
+
 cat << EOF | ${DEBUG_MNIST_BIN} --debug --max_steps=1 --fake_data --ui_type=readline
 run -t 1
 run --node_name_filter hidden --op_type_filter MatMul
diff --git a/tensorflow/python/debug/lib/debug_data.py b/tensorflow/python/debug/lib/debug_data.py
index 8a65ad087b3002d8ad93f3a64f48715d26ff62d8..7c96c2878c78d5650f3d1907065cc17c4eb71f5c 100644
--- a/tensorflow/python/debug/lib/debug_data.py
+++ b/tensorflow/python/debug/lib/debug_data.py
@@ -748,7 +748,7 @@ class DebugDumpDir(object):
     return sum(len(self._dump_tensor_data[device_name])
                for device_name in self._dump_tensor_data)
 
-  def _load_partition_graphs(self, partition_graphs, validate):
+  def _load_partition_graphs(self, client_partition_graphs, validate):
     """Load and process partition graphs.
 
     Load the graphs; parse the input and control input structure; obtain the
@@ -757,8 +757,10 @@ class DebugDumpDir(object):
     tensor dumps.
 
     Args:
-      partition_graphs: A repeated field of GraphDefs representing the
-          partition graphs executed by the TensorFlow runtime.
+      client_partition_graphs: A repeated field of GraphDefs representing the
+        partition graphs executed by the TensorFlow runtime, from the Python
+        client. These partition graphs are used only if partition graphs
+        cannot be loaded from the dump directory on the file system.
       validate: (`bool`) Whether the dump files are to be validated against the
         partition graphs.
 
@@ -769,24 +771,23 @@ class DebugDumpDir(object):
     self._debug_graphs = {}
     self._node_devices = {}
 
-    if partition_graphs:
-      partition_graphs_and_device_names = [
-          (partition_graph, None) for partition_graph in partition_graphs]
-    else:
-      partition_graphs_and_device_names = []
-      for device_name in self._device_names:
-        partition_graph = None
-        if device_name in self._dump_graph_file_paths:
-          partition_graph = _load_graph_def_from_event_file(
-              self._dump_graph_file_paths[device_name])
-        else:
-          partition_graph = self._find_partition_graph(partition_graphs,
-                                                       device_name)
-        if partition_graph:
-          partition_graphs_and_device_names.append((partition_graph,
-                                                    device_name))
-        else:
-          logging.warn("Failed to load partition graphs from disk.")
+    partition_graphs_and_device_names = []
+    for device_name in self._device_names:
+      partition_graph = None
+      if device_name in self._dump_graph_file_paths:
+        partition_graph = _load_graph_def_from_event_file(
+            self._dump_graph_file_paths[device_name])
+      else:
+        logging.warn(
+            "Failed to load partition graphs for device %s from disk. "
+            "As a fallback, the client graphs will be used. This "
+            "may cause mismatches in device names." % device_name)
+        partition_graph = self._find_partition_graph(client_partition_graphs,
+                                                     device_name)
+
+      if partition_graph:
+        partition_graphs_and_device_names.append((partition_graph,
+                                                  device_name))
 
     for partition_graph, maybe_device_name in partition_graphs_and_device_names:
       debug_graph = debug_graphs.DebugGraph(partition_graph,
diff --git a/tensorflow/python/debug/lib/grpc_debug_test_server.py b/tensorflow/python/debug/lib/grpc_debug_test_server.py
index 917004694845c752d1f6bf88cc2a203eb8f9ba73..a7be20948df0d88c0861007c926186a469ffa19e 100644
--- a/tensorflow/python/debug/lib/grpc_debug_test_server.py
+++ b/tensorflow/python/debug/lib/grpc_debug_test_server.py
@@ -245,7 +245,7 @@ class EventListenerTestServicer(grpc_debug_server.EventListenerBaseServicer):
     self._origin_id_to_strings = []
     self._graph_tracebacks = []
     self._graph_versions = []
-    self._source_files = None
+    self._source_files = []
 
   def _initialize_toggle_watch_state(self, toggle_watches):
     self._toggle_watches = toggle_watches
@@ -274,7 +274,7 @@ class EventListenerTestServicer(grpc_debug_server.EventListenerBaseServicer):
     self._origin_id_to_strings = []
     self._graph_tracebacks = []
     self._graph_versions = []
-    self._source_files = None
+    self._source_files = []
 
   def SendTracebacks(self, request, context):
     self._call_types.append(request.call_type)
@@ -286,7 +286,7 @@ class EventListenerTestServicer(grpc_debug_server.EventListenerBaseServicer):
     return debug_service_pb2.EventReply()
 
   def SendSourceFiles(self, request, context):
-    self._source_files = request
+    self._source_files.append(request)
     return debug_service_pb2.EventReply()
 
   def query_op_traceback(self, op_name):
@@ -351,9 +351,10 @@ class EventListenerTestServicer(grpc_debug_server.EventListenerBaseServicer):
     if not self._source_files:
       raise ValueError(
           "This debug server has not received any source file contents yet.")
-    for source_file_proto in self._source_files.source_files:
-      if source_file_proto.file_path == file_path:
-        return source_file_proto.lines[lineno - 1]
+    for source_files in self._source_files:
+      for source_file_proto in source_files.source_files:
+        if source_file_proto.file_path == file_path:
+          return source_file_proto.lines[lineno - 1]
     raise ValueError(
         "Source file at path %s has not been received by the debug server",
         file_path)
diff --git a/tensorflow/python/debug/lib/source_remote.py b/tensorflow/python/debug/lib/source_remote.py
index 4b6b2b995ecd13cffddaa38bd2ec673e6b824574..4afae41bc9a672c2c991f8fd2c3e1e6eecac193f 100644
--- a/tensorflow/python/debug/lib/source_remote.py
+++ b/tensorflow/python/debug/lib/source_remote.py
@@ -28,6 +28,7 @@ from tensorflow.python.debug.lib import common
 from tensorflow.python.debug.lib import debug_service_pb2_grpc
 from tensorflow.python.debug.lib import source_utils
 from tensorflow.python.platform import gfile
+from tensorflow.python.platform import tf_logging
 from tensorflow.python.profiler import tfprof_logger
 
 
@@ -95,6 +96,11 @@ def _source_file_paths_outside_tensorflow_py_library(code_defs, id_to_string):
   return non_tf_files
 
 
+def grpc_message_length_bytes():
+  """Maximum gRPC message length in bytes."""
+  return 4 * 1024 * 1024
+
+
 def _send_call_tracebacks(destinations,
                           origin_stack,
                           is_eager_execution=False,
@@ -155,17 +161,28 @@ def _send_call_tracebacks(destinations,
     source_file_paths.update(_source_file_paths_outside_tensorflow_py_library(
         [call_traceback.origin_stack], call_traceback.origin_id_to_string))
 
-    debugged_source_files = debug_pb2.DebuggedSourceFiles()
+    debugged_source_files = []
     for file_path in source_file_paths:
+      source_files = debug_pb2.DebuggedSourceFiles()
       _load_debugged_source_file(
-          file_path, debugged_source_files.source_files.add())
+          file_path, source_files.source_files.add())
+      debugged_source_files.append(source_files)
 
   for destination in destinations:
     channel = grpc.insecure_channel(destination)
     stub = debug_service_pb2_grpc.EventListenerStub(channel)
     stub.SendTracebacks(call_traceback)
     if send_source:
-      stub.SendSourceFiles(debugged_source_files)
+      for path, source_files in zip(
+          source_file_paths, debugged_source_files):
+        if source_files.ByteSize() < grpc_message_length_bytes():
+          stub.SendSourceFiles(source_files)
+        else:
+          tf_logging.warn(
+              "The content of the source file at %s is not sent to "
+              "gRPC debug server %s, because the message size exceeds "
+              "gRPC message length limit (%d bytes)." % (
+                  path, destination, grpc_message_length_bytes()))
 
 
 def send_graph_tracebacks(destinations,
diff --git a/tensorflow/python/debug/lib/source_remote_test.py b/tensorflow/python/debug/lib/source_remote_test.py
index 27bafa45e1207513e46fd2ae0f92d5bfa686ffd5..29add425e946aadfe941c73e9f9cef4aef3c8a9c 100644
--- a/tensorflow/python/debug/lib/source_remote_test.py
+++ b/tensorflow/python/debug/lib/source_remote_test.py
@@ -33,6 +33,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import resource_variable_ops  # pylint: disable=unused-import
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import googletest
+from tensorflow.python.platform import test
 from tensorflow.python.util import tf_inspect
 
 
@@ -155,6 +156,51 @@ class SendTracebacksTest(test_util.TensorFlowTestCase):
         self.assertEqual(["dummy_run_key"], server.query_call_keys())
         self.assertEqual([sess.graph.version], server.query_graph_versions())
 
+  def testSourceFileSizeExceedsGrpcMessageLengthLimit(self):
+    """In case source file size exceeds the grpc message length limit.
+
+    it ought not to have been sent to the server.
+    """
+    this_func_name = "testSourceFileSizeExceedsGrpcMessageLengthLimit"
+
+    # Patch the method to simulate a very small message length limit.
+    with test.mock.patch.object(
+        source_remote, "grpc_message_length_bytes", return_value=2):
+      with session.Session() as sess:
+        a = variables.Variable(21.0, name="two/a")
+        a_lineno = line_number_above()
+        b = variables.Variable(2.0, name="two/b")
+        b_lineno = line_number_above()
+        x = math_ops.add(a, b, name="two/x")
+        x_lineno = line_number_above()
+
+        send_traceback = traceback.extract_stack()
+        send_lineno = line_number_above()
+        source_remote.send_graph_tracebacks(
+            [self._server_address, self._server_address_2],
+            "dummy_run_key", send_traceback, sess.graph)
+
+        servers = [self._server, self._server_2]
+        for server in servers:
+          # Even though the source file content is not sent, the traceback
+          # should have been sent.
+          tb = server.query_op_traceback("two/a")
+          self.assertIn((self._curr_file_path, a_lineno, this_func_name), tb)
+          tb = server.query_op_traceback("two/b")
+          self.assertIn((self._curr_file_path, b_lineno, this_func_name), tb)
+          tb = server.query_op_traceback("two/x")
+          self.assertIn((self._curr_file_path, x_lineno, this_func_name), tb)
+
+          self.assertIn(
+              (self._curr_file_path, send_lineno, this_func_name),
+              server.query_origin_stack()[-1])
+
+          tf_trace_file_path = (
+              self._findFirstTraceInsideTensorFlowPyLibrary(x.op))
+          # Verify that the source content is not sent to the server.
+          with self.assertRaises(ValueError):
+            self._server.query_source_file_line(tf_trace_file_path, 0)
+
   def testSendEagerTracebacksToSingleDebugServer(self):
     this_func_name = "testSendEagerTracebacksToSingleDebugServer"
     send_traceback = traceback.extract_stack()
diff --git a/tensorflow/python/eager/BUILD b/tensorflow/python/eager/BUILD
index 5530193d4e1dd8f351b60c06502b2406c2c75d33..dee86966f1bb08540c69f158e13ce6a288bd9821 100644
--- a/tensorflow/python/eager/BUILD
+++ b/tensorflow/python/eager/BUILD
@@ -26,12 +26,13 @@ cc_library(
         "//tensorflow/c/eager:tape",
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
+        "//tensorflow/python:cpp_python_util",
         "//tensorflow/python:ndarray_tensor",
         "//tensorflow/python:ndarray_tensor_bridge",
         "//tensorflow/python:numpy_lib",
         "//tensorflow/python:py_seq_tensor",
         "//tensorflow/python:safe_ptr",
-        "//util/python:python_headers",
+        "//third_party/python_runtime:headers",
     ],
 )
 
diff --git a/tensorflow/python/eager/backprop.py b/tensorflow/python/eager/backprop.py
index dcfd03b4583efab06beca712d6cbabb635967136..bd97b181ff7fa5a38ea8ab16e55b3ade7b599261 100644
--- a/tensorflow/python/eager/backprop.py
+++ b/tensorflow/python/eager/backprop.py
@@ -20,7 +20,6 @@ from __future__ import print_function
 
 import functools
 import operator
-import threading
 
 import six
 
@@ -94,8 +93,8 @@ class _MockOp(object):
     )
 
 
-def _magic_gradient_function(op_name, attr_tuple, num_inputs,
-                             inputs, outputs, out_grads):
+def _gradient_function(op_name, attr_tuple, num_inputs, inputs, outputs,
+                       out_grads):
   """Calls the gradient function of the op.
 
   Args:
@@ -117,8 +116,7 @@ def _magic_gradient_function(op_name, attr_tuple, num_inputs,
   return grad_fn(mock_op, *out_grads)
 
 
-_gradient_functions = {}
-_gradient_functions_lock = threading.Lock()
+pywrap_tensorflow.TFE_Py_RegisterGradientFunction(_gradient_function)
 
 
 _tracing = False
@@ -142,22 +140,6 @@ _grad_fn_accepts_none_for_indices = {
 }
 
 
-def _get_backward_fn(op_name, attrs, num_inputs, op_inputs, op_outputs):
-
-  def grad_fn(*orig_outputs):
-    result = _magic_gradient_function(op_name, attrs, num_inputs,
-                                      op_inputs, op_outputs, orig_outputs)
-    if _tracing:
-      print("Gradient for", op_name, "inputs", op_inputs, "output_grads",
-            orig_outputs, "gradients", result)
-    return nest.flatten(result)
-
-  return grad_fn
-
-
-pywrap_tensorflow.TFE_Py_RegisterBackwardFunctionGetter(_get_backward_fn)
-
-
 def _record_gradient(op_name, inputs, attrs, results, name):
   return pywrap_tensorflow.TFE_Py_RecordGradient(op_name, inputs, attrs,
                                                  results, name)
@@ -214,27 +196,25 @@ def implicit_val_and_grad(f):
   # TODO(cais): Remove calls to tf.constant() once the gradients functions
   # accept lists and np.ndarrays.
 
-  def grad_fn(*args):
+  def grad_fn(*args, **kwds):
     """Computes the gradient of the wrapped function."""
     this_tape = tape.push_new_tape()
     try:
-      end_node = f(*args)
+      end_node = f(*args, **kwds)
       if end_node is None:
         raise ValueError("Cannot differentiate a function that returns None; "
                          "did you forget to return a value from {}?".format(
                              f.__name__))
     finally:
       tape.pop_tape(this_tape)
-    # Sorting variables by id, which is monotonically increasing in construction
-    # order. This ensures unique order across executions.
-    # TODO(josh11b): Move the sort to the C++ implementation in pywrap_tfe_src.cc.
-    variables = list(sorted(this_tape.watched_variables(),
-                            key=lambda v: v.handle._id))  # pylint: disable=protected-access
-    sources = [x.handle for x in variables]
-
-    if not sources:
+    # Note: variables are returned in construction order. This ensures unique
+    # order across executions.
+    variables = this_tape.watched_variables()
+    if not variables:
       raise ValueError("No trainable variables were accessed while the "
                        "function was being computed.")
+
+    sources = [v.handle for v in variables]
     grad = imperative_grad.imperative_grad(_default_vspace,
                                            this_tape,
                                            nest.flatten(end_node),
@@ -819,11 +799,8 @@ class GradientTape(object):
     self._push_tape()
 
   def watched_variables(self):
-    # Sorting variables by id, which is monotonically increasing in construction
-    # order. This ensures unique order across executions.
-    # TODO(josh11b): Move the sort to the C++ implementation in pywrap_tfe_src.cc.
-    return list(sorted(self._tape.watched_variables(),
-                       key=lambda v: v.handle._id))  # pylint: disable=protected-access
+    """Returns variables watched by this tape in order of construction."""
+    return self._tape.watched_variables()
 
   def gradient(self, target, sources, output_gradients=None):
     """Computes the gradient using operations recorded in context of this tape.
diff --git a/tensorflow/python/eager/function.py b/tensorflow/python/eager/function.py
index 120b298171b6272fdd63f76a538fa2c92a58616d..03393bcd46c4670fcac8f01f274137ee60df0bea 100644
--- a/tensorflow/python/eager/function.py
+++ b/tensorflow/python/eager/function.py
@@ -409,7 +409,15 @@ class GraphModeFunction(object):
         backward_outputs, in_gradients, output_shapes, attrs=self._attrs)
 
   def _backprop_call(self, args):
-    """Calls the wrapped function and records the result on a tape."""
+    """Calls the wrapped function and records the result on a tape.
+
+    (Only records results on a tape if the function has outputs)
+
+    Args:
+      args: The tensor inputs to the function.
+    Returns:
+      The call output.
+    """
     all_args = args + self._extra_inputs
     signature = self._forward_fdef.signature
     ctx = context.context()
@@ -420,6 +428,8 @@ class GraphModeFunction(object):
           inputs=all_args,
           attrs=None,
           ctx=ctx)
+      if not outputs:
+        return None
     else:
       g = ops.get_default_graph()
       g._add_function(self._forward_fdef)  # pylint: disable=protected-access
@@ -431,8 +441,9 @@ class GraphModeFunction(object):
           name="FunctionCall",
           compute_shapes=False)
       outputs = op.outputs
-      outputs = [outputs] if isinstance(
-          outputs, (ops.Tensor, type(None))) else list(outputs)
+      if not outputs:
+        return op
+      outputs = [outputs] if isinstance(outputs, ops.Tensor) else list(outputs)
       for i, s in enumerate(self._output_shapes):
         outputs[i].set_shape(s)
     real_outputs = outputs[:len(self._returns)]
@@ -494,7 +505,7 @@ class GraphModeFunction(object):
   def __call__(self, *args):
     """Executes the passed function in eager mode."""
     for v in self._variables:
-      if v._trainable:  # pylint: disable=protected-access
+      if v.trainable:
         tape.watch_variable(v)
 
     tensor_inputs = [x for x in nest.flatten(args) if isinstance(x, ops.Tensor)]
@@ -777,7 +788,7 @@ def defun(func=None, compiled=False):
   def h():
     return f(x, y)
 
-  assert h().numpy() == f(x, y)
+  assert (h().numpy() == f(x, y).numpy()).all()
 
   # `defun` automatically lifts variables out of the graphs it creates,
   # allowing you to compile the `call` methods of `tf.keras.layers.Layer` and
@@ -785,6 +796,7 @@ def defun(func=None, compiled=False):
   class MyModel(tf.keras.Model):
 
     def __init__(self, keep_probability=0.2):
+      super(MyModel, self).__init__()
       self.dense1 = tf.keras.layers.Dense(4, activation=tf.nn.relu)
       self.dense2 = tf.keras.layers.Dense(5, activation=tf.nn.softmax)
       self.keep_probability = keep_probability
@@ -804,7 +816,7 @@ def defun(func=None, compiled=False):
   # `defun`-compiled functions are differentiable.
   optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
   with tf.GradientTape() as tape:
-    outputs = model(inputs)
+    outputs = model(x)
   gradient = tape.gradient(outputs, model.trainable_variables)
   optimizer.apply_gradients((grad, var) for grad, var in zip(gradient,
                             model.trainable_variables))
@@ -840,6 +852,8 @@ def defun(func=None, compiled=False):
   import tensorflow as tf
   import numpy as np
 
+  tf.enable_eager_execution()
+
   matrix = tf.eye(5)
   # `matrix` is assumed to be a Tensor
   def add_noise():
@@ -862,6 +876,8 @@ def defun(func=None, compiled=False):
   ```python
   import tensorflow as tf
 
+  tf.enable_eager_execution()
+
   @tf.contrib.eager.defun
   def lossy_matmul(W, x, training=True):
     outputs = tf.matmul(W, x)
@@ -869,6 +885,9 @@ def defun(func=None, compiled=False):
       outputs = tf.nn.dropout(outputs, keep_probability=0.2)
     return outputs
 
+  W = tf.random_normal((3, 5))
+  x = tf.random_normal((5, 1))
+
   # Executes a graph that applies dropout.
   lossy_outputs = lossy_matmul(W, x, training=True)
 
@@ -919,14 +938,14 @@ def defun(func=None, compiled=False):
 
   # `fn` is a Python function, so x is created, initialized, and destroyed upon
   # every invocation
-  assert(fn().numpy() == fn().numpy() == 1.0)
+  assert fn().numpy() == fn().numpy() == 1.0
 
   compiled = tf.contrib.eager.defun(fn)
 
   # Compiling `fn` with `defun` hoists all variables outside of the generated
   # graph, so initialization happens exactly once.
-  assert(compiled().numpy() == 1.0)
-  assert(compiled().numpy() == 2.0)
+  assert compiled().numpy() == 1.0
+  assert compiled().numpy() == 2.0
   ```
 
   Finally, because each input signature is bound to a unique graph, if your
@@ -1207,6 +1226,9 @@ class AutomaticControlDependencies(object):
     # test that it works. Support while loops. Support init_scope escaping from
     # this.
     for op in new_operations:
+      # TODO(apassos) make this code safely support while loops.
+      if isinstance(op._control_flow_context, control_flow_ops.WhileContext):  # pylint: disable=protected-access
+        continue
       control_inputs = set()
       # Ensure stateful ops run
       if (op.type not in self._graph._registered_ops  # pylint: disable=protected-access
diff --git a/tensorflow/python/eager/function_test.py b/tensorflow/python/eager/function_test.py
index f53d6c26083cad8efd291a064393561c4bebfcfb..cfdbe5f079058906d18268c924819e0cc428bca7 100644
--- a/tensorflow/python/eager/function_test.py
+++ b/tensorflow/python/eager/function_test.py
@@ -349,6 +349,23 @@ class FunctionTest(test.TestCase):
 
     g(constant_op.constant(1.0))
 
+  def testNestedDefunWithNoOutputAndTapedInput(self):
+    three = resource_variable_ops.ResourceVariable(3.0, name='v')
+
+    @function.defun
+    def f(x):
+      # This function intentionally takes a taped variable as input,
+      # but does not return any values
+      math_ops.add(x, three)
+
+    @function.defun
+    def g(x):
+      tape.watch_variable(x)
+      y = math_ops.add(x, three)
+      f(y)
+
+    g(three)
+
   def testGradientTensorConversionWithDefun(self):
     three = resource_variable_ops.ResourceVariable(3.0, name='v')
 
diff --git a/tensorflow/python/eager/graph_callable.py b/tensorflow/python/eager/graph_callable.py
index d9ffcbd2036b9e312967012597ceea22e607d2a7..760a1485523798c6587e95804488a14b42a69bc0 100644
--- a/tensorflow/python/eager/graph_callable.py
+++ b/tensorflow/python/eager/graph_callable.py
@@ -202,7 +202,7 @@ class _InitializingFunctionObject(object):
         v.handle).numpy() for v in self._call_fn.variables]
     if all(x for x in initialized):
       for v in self._call_fn.variables:
-        if v._trainable:  # pylint: disable=protected-access
+        if v.trainable:
           tape.watch_variable(v)
       return self._call_fn(*args)
     elif all(not x for x in initialized):
diff --git a/tensorflow/python/eager/pywrap_tensor.cc b/tensorflow/python/eager/pywrap_tensor.cc
index a62af4a06c7d61216ffd3aae7fe3070363b99d94..ea604647faede0e5b86a17938d0a7c8a7621dec1 100644
--- a/tensorflow/python/eager/pywrap_tensor.cc
+++ b/tensorflow/python/eager/pywrap_tensor.cc
@@ -27,8 +27,15 @@ limitations under the License.
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/python/lib/core/ndarray_tensor.h"
 
+// forward declare
+struct EagerTensor;
+
 namespace {
 
+// An instance of _EagerTensorProfiler that will receive callbacks about
+// events on eager tensors. This is set by TFE_Py_InitEagerTensor, if at all.
+PyObject* eager_tensor_profiler = nullptr;
+
 TFE_Context* GetContext(PyObject* ctx) {
   TFE_Context* context =
       reinterpret_cast<TFE_Context*>(PyCapsule_GetPointer(ctx, nullptr));
@@ -253,8 +260,45 @@ typedef struct EagerTensor {
   // to use a TF_Status object. However note that accesses to `status` are not
   // thread-safe.
   TF_Status* status;
+
+  PyObject* weakreflist; /* List of weak references */
 } EagerTensor;
 
+namespace {
+
+// Returns true on success - successfully invoked or no profiler registered.
+// Returns false if some error occurred.
+bool MaybeInvokeCreatedOnEagerTensorProfiler(EagerTensor* created_tensor) {
+  if (eager_tensor_profiler != nullptr) {
+#if PY_MAJOR_VERSION < 3
+    PyObject* created_method_name = PyString_InternFromString("created");
+#else
+    PyObject* created_method_name = PyUnicode_InternFromString("created");
+#endif
+    if (created_method_name == nullptr) {
+      return false;
+    }
+    PyObject* result = PyObject_CallMethodObjArgs(
+        eager_tensor_profiler, created_method_name, created_tensor, NULL);
+    if (result == nullptr) {
+      LOG(ERROR) << "Invoking created() on EagerTensor profiler failed";
+      // While we can potentially continue because the error is related to
+      // profiling, we choose to return an error because:
+      //  - If profiling is used, the user likely wants to stop execution on
+      //    profiling errors.
+      //  - Error in profiling code might have left some state in an invalid
+      //    form that can lead to an error later on. Better to fail fast.
+      Py_DECREF(created_method_name);
+      return false;
+    }
+    Py_DECREF(created_method_name);
+    Py_DECREF(result);
+  }
+  return true;
+}
+
+}  // namespace
+
 // tp_init for EagerTensor.
 int EagerTensor_init(EagerTensor* self, PyObject* args, PyObject* kwds) {
   self->id = get_uid();
@@ -266,6 +310,7 @@ int EagerTensor_init(EagerTensor* self, PyObject* args, PyObject* kwds) {
   Py_INCREF(Py_None);
   self->tensor_shape = Py_None;
   self->status = TF_NewStatus();
+  self->weakreflist = nullptr;
   PyObject* value;
   PyObject* context = nullptr;
   PyObject* device = nullptr;
@@ -344,11 +389,22 @@ int EagerTensor_init(EagerTensor* self, PyObject* args, PyObject* kwds) {
     if (handle == nullptr) return -1;
   }
   self->handle = handle.release();
+
+  if (!MaybeInvokeCreatedOnEagerTensorProfiler(self)) {
+    return -1;
+  }
+
   return 0;
 }
 
 // tp_dealloc for EagerTensor.
 void EagerTensor_dealloc(EagerTensor* self) {
+  // Clear weak references to self.
+  // Needs to happen before any actual destruction.
+  if (self->weakreflist != nullptr) {
+    PyObject_ClearWeakRefs((PyObject*)self);
+  }
+
   TF_DeleteStatus(self->status);
   Py_DECREF(self->handle_data);
   Py_DECREF(self->keras_mask);
@@ -574,43 +630,43 @@ static PyTypeObject _EagerTensorType = {
     // clang-format off
     PyVarObject_HEAD_INIT(nullptr, 0)
     // clang-format on
-    "EagerTensor",                   /* tp_name */
-    sizeof(EagerTensor),             /* tp_basicsize */
-    0,                               /* tp_itemsize */
-    (destructor)EagerTensor_dealloc, /* tp_dealloc */
-    nullptr,                         /* tp_print */
-    nullptr,                         /* tp_getattr */
-    nullptr,                         /* tp_setattr */
-    nullptr,                         /* tp_compare */
-    nullptr,                         /* tp_repr */
-    nullptr,                         /* tp_as_number */
-    nullptr,                         /* tp_as_sequence */
-    nullptr,                         /* tp_as_mapping */
-    nullptr,                         /* tp_hash */
-    nullptr,                         /* tp_call */
-    nullptr,                         /* tp_str */
-    nullptr,                         /* tp_getattro */
-    nullptr,                         /* tp_setattro */
-    nullptr,                         /* tp_as_buffer */
-    Py_TPFLAGS_DEFAULT,              /* tp_flags */
-    nullptr,                         /* tp_doc */
-    nullptr,                         /* tp_traverse */
-    nullptr,                         /* tp_clear */
-    nullptr,                         /* tp_richcompare */
-    0,                               /* tp_weaklistoffset */
-    nullptr,                         /* tp_iter */
-    nullptr,                         /* tp_iternext */
-    EagerTensor_methods,             /* tp_methods */
-    nullptr,                         /* tp_members */
-    EagerTensor_getseters,           /* tp_getset */
-    nullptr,                         /* tp_base */
-    nullptr,                         /* tp_dict */
-    nullptr,                         /* tp_descr_get */
-    nullptr,                         /* tp_descr_set */
-    0,                               /* tp_dictoffset */
-    (initproc)EagerTensor_init,      /* tp_init */
-    nullptr,                         /* tp_alloc */
-    nullptr,                         /* tp_new */
+    "EagerTensor",                      /* tp_name */
+    sizeof(EagerTensor),                /* tp_basicsize */
+    0,                                  /* tp_itemsize */
+    (destructor)EagerTensor_dealloc,    /* tp_dealloc */
+    nullptr,                            /* tp_print */
+    nullptr,                            /* tp_getattr */
+    nullptr,                            /* tp_setattr */
+    nullptr,                            /* tp_compare */
+    nullptr,                            /* tp_repr */
+    nullptr,                            /* tp_as_number */
+    nullptr,                            /* tp_as_sequence */
+    nullptr,                            /* tp_as_mapping */
+    nullptr,                            /* tp_hash */
+    nullptr,                            /* tp_call */
+    nullptr,                            /* tp_str */
+    nullptr,                            /* tp_getattro */
+    nullptr,                            /* tp_setattro */
+    nullptr,                            /* tp_as_buffer */
+    Py_TPFLAGS_DEFAULT,                 /* tp_flags */
+    nullptr,                            /* tp_doc */
+    nullptr,                            /* tp_traverse */
+    nullptr,                            /* tp_clear */
+    nullptr,                            /* tp_richcompare */
+    offsetof(EagerTensor, weakreflist), /* tp_weaklistoffset */
+    nullptr,                            /* tp_iter */
+    nullptr,                            /* tp_iternext */
+    EagerTensor_methods,                /* tp_methods */
+    nullptr,                            /* tp_members */
+    EagerTensor_getseters,              /* tp_getset */
+    nullptr,                            /* tp_base */
+    nullptr,                            /* tp_dict */
+    nullptr,                            /* tp_descr_get */
+    nullptr,                            /* tp_descr_set */
+    0,                                  /* tp_dictoffset */
+    (initproc)EagerTensor_init,         /* tp_init */
+    nullptr,                            /* tp_alloc */
+    nullptr,                            /* tp_new */
 };
 
 #endif
@@ -641,6 +697,11 @@ PyObject* EagerTensorFromHandle(TFE_TensorHandle* handle) {
     t->tensor_shape = Py_None;
     t->handle = handle;
     t->status = TF_NewStatus();
+    t->weakreflist = nullptr;
+
+    if (!MaybeInvokeCreatedOnEagerTensorProfiler(t)) {
+      return nullptr;
+    }
   }
   return reinterpret_cast<PyObject*>(t);
 }
@@ -720,6 +781,18 @@ PyObject* TFE_Py_InitEagerTensor(PyObject* base_class) {
   return reinterpret_cast<PyObject*>(EagerTensorType);
 }
 
+PyObject* TFE_Py_SetEagerTensorProfiler(PyObject* profiler) {
+  Py_XDECREF(eager_tensor_profiler);
+
+  if (profiler == Py_None) {
+    eager_tensor_profiler = nullptr;
+  } else {
+    eager_tensor_profiler = profiler;
+    Py_INCREF(eager_tensor_profiler);
+  }
+  Py_RETURN_NONE;
+}
+
 PyObject* TFE_Py_TensorShapeSlice(PyObject* tensors, int slice_dim) {
   if (!PyList_Check(tensors) && !PyTuple_Check(tensors)) {
     PyErr_SetString(PyExc_TypeError,
@@ -792,3 +865,37 @@ PyObject* TFE_Py_TensorShapeSlice(PyObject* tensors, int slice_dim) {
 
   return EagerTensorFromHandle(handle);
 }
+
+PyObject* TFE_Py_TensorShapeOnDevice(PyObject* tensor) {
+  if (!EagerTensor_CheckExact(tensor)) {
+    PyErr_SetString(
+        PyExc_TypeError,
+        tensorflow::strings::StrCat("Expected an EagerTensors but got type \"",
+                                    Py_TYPE(tensor)->tp_name, "\"")
+            .c_str());
+    return nullptr;
+  }
+  TFE_TensorHandle* handle = EagerTensor_Handle(tensor);
+
+  auto status = tensorflow::make_safe(TF_NewStatus());
+  TFE_TensorDebugInfo* debug_info =
+      TFE_TensorHandleTensorDebugInfo(handle, status.get());
+  if (TF_GetCode(status.get()) != TF_OK) {
+    PyErr_SetString(
+        PyExc_RuntimeError,
+        tensorflow::strings::StrCat("Error retrieving tensor's device shape: ",
+                                    TF_Message(status.get()))
+            .c_str());
+    return nullptr;
+  }
+
+  int rank = TFE_TensorDebugInfoOnDeviceNumDims(debug_info);
+  PyObject* shape = PyTuple_New(rank);
+  for (int i = 0; i < rank; ++i) {
+    tensorflow::int64 dim_size = TFE_TensorDebugInfoOnDeviceDim(debug_info, i);
+    PyTuple_SET_ITEM(shape, i, PyLong_FromLongLong(dim_size));
+  }
+  TFE_DeleteTensorDebugInfo(debug_info);
+
+  return shape;
+}
diff --git a/tensorflow/python/eager/pywrap_tfe.h b/tensorflow/python/eager/pywrap_tfe.h
index 9bc8b9bc726f9187f89855aefc15d7b4f6427e60..a916a75f00cafc077c422cc6aee6828d07e6188d 100644
--- a/tensorflow/python/eager/pywrap_tfe.h
+++ b/tensorflow/python/eager/pywrap_tfe.h
@@ -16,10 +16,11 @@ limitations under the License.
 #ifndef TENSORFLOW_PYTHON_EAGER_PYWRAP_TFE_H_
 #define TENSORFLOW_PYTHON_EAGER_PYWRAP_TFE_H_
 
+#include <Python.h>
+
 #include "tensorflow/c/eager/c_api.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/gtl/inlined_vector.h"
-#include <Python.h>
 
 typedef tensorflow::gtl::InlinedVector<TFE_TensorHandle*, 4>
     TFE_InputTensorHandles;
@@ -66,14 +67,15 @@ PyObject* TFE_Py_RegisterResourceVariableType(PyObject* e);
 // This function is not thread-safe.
 PyObject* TFE_Py_RegisterFallbackExceptionClass(PyObject* e);
 
-// Registers e as the backward_function_getter.
-// The registered function creates a backward function (a function that can
-// return the gradient of the inputs an op given the gradient of it's outputs).
-// The registered function will be passed the following arguments:
-//    op_name, attrs, num_inputs, op_inputs, op_outputs
+// Registers e as the gradient_function.
+// The registered function takes
+// (op_name, attrs, num_inputs, inputs, outputs, output_gradients) and returns
+// the input gradients. This function will not correctly be able to generate
+// gradients for functional ops - the gradients for those ops are calculated
+// through a different codepath (see function.py for additional information).
 //
 // This function is not thread-safe.
-PyObject* TFE_Py_RegisterBackwardFunctionGetter(PyObject* e);
+PyObject* TFE_Py_RegisterGradientFunction(PyObject* e);
 
 // Returns 0 if 'status' is TF_OK. Otherwise, raises an exception (using
 // `exception` if not nullptr, else using the class registered via
@@ -113,6 +115,15 @@ TFE_TensorHandle* EagerTensor_Handle(const PyObject* o);
 // newly created type, or nullptr on error.
 PyObject* TFE_Py_InitEagerTensor(PyObject* base_class);
 
+// Sets `profiler` as the current profiler to receive callbacks about events
+// on eager tensors. Currently, the only reported event is creation.
+// `profiler` is expected to have a `created(self, eager_tensor)` method that
+// takes the created tensor as its single argument.
+// Previous profiler, if any, is unset and will not receive any more
+// callbacks.
+// To unset the profiler, pass Py_None as the value of `profiler`.
+PyObject* TFE_Py_SetEagerTensorProfiler(PyObject* profiler);
+
 // Creates a new tape and adds it to the active set. `persistent` must be a
 // PyBool_Type, i.e either Py_True or Py_False
 PyObject* TFE_Py_TapeSetNew(PyObject* persistent);
@@ -186,7 +197,8 @@ PyObject* TFE_Py_RecordGradient(PyObject* op_name, PyObject* inputs,
                                 PyObject* attrs, PyObject* results,
                                 PyObject* name);
 
-// Returns the set of variables watched by the given tape.
+// Returns all variables watched by the given tape in the order those variables
+// were created.
 PyObject* TFE_Py_TapeWatchedVariables(PyObject* tape);
 
 // Returns an EagerTensor of dimension [len(`tensors`)] containing
@@ -201,4 +213,8 @@ PyObject* TFE_Py_TapeWatchedVariables(PyObject* tape);
 //   tensors in `tensors`.
 PyObject* TFE_Py_TensorShapeSlice(PyObject* tensors, int slice_dim);
 
+// Returns the shape of this tensor's on-device representation.
+// The shape is represented as a Python tuple of integers.
+PyObject* TFE_Py_TensorShapeOnDevice(PyObject* tensor);
+
 #endif  // TENSORFLOW_PYTHON_EAGER_PYWRAP_TFE_H_
diff --git a/tensorflow/python/eager/pywrap_tfe_src.cc b/tensorflow/python/eager/pywrap_tfe_src.cc
index 62deb41e9b5ed3706f9a427b6e5c674824a48ce7..6c9481c3af31a96308bb47f6e2aa35988f16f709 100644
--- a/tensorflow/python/eager/pywrap_tfe_src.cc
+++ b/tensorflow/python/eager/pywrap_tfe_src.cc
@@ -32,6 +32,7 @@ limitations under the License.
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/python/eager/pywrap_tensor.h"
 #include "tensorflow/python/lib/core/safe_ptr.h"
+#include "tensorflow/python/util/util.h"
 
 using tensorflow::string;
 using tensorflow::strings::Printf;
@@ -45,12 +46,14 @@ struct InputInfo {
   bool is_list = false;
 };
 
+// Takes in output gradients, returns input gradients.
+typedef std::function<PyObject*(PyObject*)> PyBackwardFunction;
+
 using AttrToInputsMap =
     tensorflow::gtl::FlatMap<string,
                              tensorflow::gtl::InlinedVector<InputInfo, 4>>;
 
-tensorflow::mutex all_attr_to_input_maps_lock(
-    tensorflow::LINKER_INITIALIZED);
+tensorflow::mutex all_attr_to_input_maps_lock(tensorflow::LINKER_INITIALIZED);
 tensorflow::gtl::FlatMap<string, AttrToInputsMap*>* GetAllAttrToInputsMaps() {
   static auto* all_attr_to_input_maps =
       new tensorflow::gtl::FlatMap<string, AttrToInputsMap*>;
@@ -641,8 +644,8 @@ PyObject* exception_class GUARDED_BY(exception_class_mutex) = nullptr;
 // Python subclass of Exception that is created to signal fallback.
 PyObject* fallback_exception_class = nullptr;
 
-// Python function that returns a backward_function.
-PyObject* backward_function_getter = nullptr;
+// Python function that returns input gradients given output gradients.
+PyObject* gradient_function = nullptr;
 
 PyTypeObject* resource_variable_type = nullptr;
 
@@ -735,26 +738,26 @@ PyObject* TFE_Py_RegisterFallbackExceptionClass(PyObject* e) {
   }
 }
 
-PyObject* TFE_Py_RegisterBackwardFunctionGetter(PyObject* e) {
-  if (backward_function_getter != nullptr) {
-    Py_DECREF(backward_function_getter);
+PyObject* TFE_Py_RegisterGradientFunction(PyObject* e) {
+  if (gradient_function != nullptr) {
+    Py_DECREF(gradient_function);
   }
   if (!PyCallable_Check(e)) {
-    backward_function_getter = nullptr;
+    gradient_function = nullptr;
     PyErr_SetString(PyExc_TypeError,
                     "TFE_Py_RegisterBackwardFunctionGetter: "
                     "Registered object should be function.");
     return nullptr;
   } else {
     Py_INCREF(e);
-    backward_function_getter = e;
+    gradient_function = e;
     Py_RETURN_NONE;
   }
 }
 
 void RaiseFallbackException(const char* message) {
   if (fallback_exception_class != nullptr) {
-    PyErr_SetObject(fallback_exception_class, Py_BuildValue("s", message));
+    PyErr_SetString(fallback_exception_class, message);
     return;
   }
 
@@ -772,8 +775,9 @@ int MaybeRaiseExceptionFromTFStatus(TF_Status* status, PyObject* exception) {
   if (exception == nullptr) {
     tensorflow::mutex_lock l(exception_class_mutex);
     if (exception_class != nullptr) {
-      PyErr_SetObject(exception_class,
-                      Py_BuildValue("si", msg, TF_GetCode(status)));
+      tensorflow::Safe_PyObjectPtr val(
+          Py_BuildValue("si", msg, TF_GetCode(status)));
+      PyErr_SetObject(exception_class, val.get());
       return -1;
     } else {
       exception = PyExc_RuntimeError;
@@ -791,7 +795,8 @@ int MaybeRaiseExceptionFromStatus(const tensorflow::Status& status,
   if (exception == nullptr) {
     tensorflow::mutex_lock l(exception_class_mutex);
     if (exception_class != nullptr) {
-      PyErr_SetObject(exception_class, Py_BuildValue("si", msg, status.code()));
+      tensorflow::Safe_PyObjectPtr val(Py_BuildValue("si", msg, status.code()));
+      PyErr_SetObject(exception_class, val.get());
       return -1;
     } else {
       exception = PyExc_RuntimeError;
@@ -869,41 +874,70 @@ static tensorflow::DataType FastTensorDtype(PyObject* tensor) {
 }
 
 class GradientTape
-    : public tensorflow::eager::GradientTape<PyObject, PyObject> {
+    : public tensorflow::eager::GradientTape<PyObject, PyBackwardFunction> {
  public:
   explicit GradientTape(bool persistent)
-      : tensorflow::eager::GradientTape<PyObject, PyObject>(persistent) {}
+      : tensorflow::eager::GradientTape<PyObject, PyBackwardFunction>(
+            persistent) {}
 
   virtual ~GradientTape() {
-    for (PyObject* v : watched_variables_) {
-      Py_DECREF(v);
+    for (const IdAndVariable& v : watched_variables_) {
+      Py_DECREF(v.variable);
     }
   }
 
   void WatchVariable(PyObject* v) {
-    auto insert_result = watched_variables_.insert(v);
-    if (insert_result.second) {
-      // Only increment the reference count if we aren't already watching this
-      // variable.
-      Py_INCREF(v);
-    }
-    PyObject* handle = PyObject_GetAttrString(v, "handle");
+    tensorflow::Safe_PyObjectPtr handle(PyObject_GetAttrString(v, "handle"));
     if (handle == nullptr) {
       return;
     }
-    tensorflow::int64 id = FastTensorId(handle);
-    Py_DECREF(handle);
+    tensorflow::int64 id = FastTensorId(handle.get());
+
     if (!PyErr_Occurred()) {
       this->Watch(id);
     }
+
+    tensorflow::mutex_lock l(watched_variables_mu_);
+    auto insert_result = watched_variables_.emplace(id, v);
+
+    if (insert_result.second) {
+      // Only increment the reference count if we aren't already watching this
+      // variable.
+      Py_INCREF(v);
+    }
   }
 
-  const std::unordered_set<PyObject*> WatchedVariables() {
-    return watched_variables_;
+  PyObject* GetVariablesAsPyTuple() {
+    tensorflow::mutex_lock l(watched_variables_mu_);
+    PyObject* result = PyTuple_New(watched_variables_.size());
+    Py_ssize_t pos = 0;
+    for (const IdAndVariable& id_and_variable : watched_variables_) {
+      PyTuple_SET_ITEM(result, pos++, id_and_variable.variable);
+      Py_INCREF(id_and_variable.variable);
+    }
+    return result;
   }
 
  private:
-  std::unordered_set<PyObject*> watched_variables_;
+  // We store an IdAndVariable in the map since the map needs to be locked
+  // during insert, but should not call back into python during insert to avoid
+  // deadlocking with the GIL.
+  struct IdAndVariable {
+    tensorflow::int64 id;
+    PyObject* variable;
+
+    IdAndVariable(tensorflow::int64 id, PyObject* variable)
+        : id(id), variable(variable) {}
+  };
+  struct CompareById {
+    bool operator()(const IdAndVariable& lhs, const IdAndVariable& rhs) {
+      return lhs.id < rhs.id;
+    }
+  };
+
+  tensorflow::mutex watched_variables_mu_;
+  std::set<IdAndVariable, CompareById> watched_variables_
+      GUARDED_BY(watched_variables_mu_);
 };
 
 typedef struct {
@@ -1195,13 +1229,7 @@ void TFE_Py_TapeSetWatchVariable(PyObject* variable) {
 }
 
 PyObject* TFE_Py_TapeWatchedVariables(PyObject* tape) {
-  const std::unordered_set<PyObject*>& watched_variables =
-      reinterpret_cast<TFE_Py_Tape*>(tape)->tape->WatchedVariables();
-  PyObject* result = PySet_New(nullptr);
-  for (PyObject* variable : watched_variables) {
-    PySet_Add(result, variable);
-  }
-  return result;
+  return reinterpret_cast<TFE_Py_Tape*>(tape)->tape->GetVariablesAsPyTuple();
 }
 
 namespace {
@@ -1225,11 +1253,13 @@ void TapeSetRecordOperation(
     PyObject* op_type, PyObject* output_tensors,
     const std::vector<tensorflow::int64>& input_ids,
     const std::vector<tensorflow::DataType>& input_dtypes,
-    PyObject* backward_function) {
+    const std::function<PyBackwardFunction*()>& backward_function_getter,
+    const std::function<void(PyBackwardFunction*)>& backward_function_killer) {
   std::vector<tensorflow::eager::TapeTensor> output_info;
   PyObject* seq = PySequence_Fast(output_tensors,
                                   "expected a sequence of integer tensor ids");
   int len = PySequence_Size(output_tensors);
+  if (PyErr_Occurred()) return;
   output_info.reserve(len);
   for (int i = 0; i < len; ++i) {
     output_info.push_back(
@@ -1258,10 +1288,10 @@ void TapeSetRecordOperation(
   }
 
   for (TFE_Py_Tape* tape : SafeTapeSet()) {
-    Py_INCREF(backward_function);
-    tape->tape->RecordOperation(
-        op_type_str, output_info, input_ids, input_dtypes, backward_function,
-        [backward_function]() { Py_DECREF(backward_function); });
+    auto* function = backward_function_getter();
+    tape->tape->RecordOperation(op_type_str, output_info, input_ids,
+                                input_dtypes, function,
+                                backward_function_killer);
   }
 }
 }  // namespace
@@ -1278,8 +1308,21 @@ void TFE_Py_TapeSetRecordOperation(PyObject* op_type, PyObject* output_tensors,
   std::vector<tensorflow::DataType> input_dtypes =
       MakeTensorDtypeList(input_tensors);
   if (PyErr_Occurred()) return;
-  TapeSetRecordOperation(op_type, output_tensors, input_ids, input_dtypes,
-                         backward_function);
+
+  TapeSetRecordOperation(
+      op_type, output_tensors, input_ids, input_dtypes,
+      [backward_function]() {
+        Py_INCREF(backward_function);
+        PyBackwardFunction* function =
+            new PyBackwardFunction([backward_function](PyObject* out_grads) {
+              return PyObject_CallObject(backward_function, out_grads);
+            });
+        return function;
+      },
+      [backward_function](PyBackwardFunction* py_backward_function) {
+        Py_DECREF(backward_function);
+        delete py_backward_function;
+      });
 }
 
 void TFE_Py_TapeSetDeleteTrace(tensorflow::int64 tensor_id) {
@@ -1288,7 +1331,8 @@ void TFE_Py_TapeSetDeleteTrace(tensorflow::int64 tensor_id) {
   }
 }
 
-class PyVSpace : public tensorflow::eager::VSpace<PyObject, PyObject> {
+class PyVSpace
+    : public tensorflow::eager::VSpace<PyObject, PyBackwardFunction> {
  public:
   explicit PyVSpace(PyObject* py_vspace) : py_vspace_(py_vspace) {}
 
@@ -1381,7 +1425,7 @@ class PyVSpace : public tensorflow::eager::VSpace<PyObject, PyObject> {
   }
 
   tensorflow::Status CallBackwardFunction(
-      PyObject* backward_function,
+      PyBackwardFunction* backward_function,
       tensorflow::gtl::ArraySlice<PyObject*> output_gradients,
       std::vector<PyObject*>* result) const final {
     PyObject* grads = PyTuple_New(output_gradients.size());
@@ -1394,8 +1438,7 @@ class PyVSpace : public tensorflow::eager::VSpace<PyObject, PyObject> {
                          reinterpret_cast<PyObject*>(output_gradients[i]));
       }
     }
-    PyObject* py_result = PyEval_CallObject(
-        reinterpret_cast<PyObject*>(backward_function), grads);
+    PyObject* py_result = (*backward_function)(grads);
     Py_DECREF(grads);
     if (py_result == nullptr) {
       return tensorflow::errors::Internal("gradient function threw exceptions");
@@ -1424,10 +1467,6 @@ class PyVSpace : public tensorflow::eager::VSpace<PyObject, PyObject> {
     return tensorflow::Status::OK();
   }
 
-  void ReleaseBackwardFunction(PyObject* backward_function) const final {
-    Py_DECREF(backward_function);
-  }
-
   void DeleteGradient(PyObject* tensor) const final { Py_XDECREF(tensor); }
 
  private:
@@ -1586,12 +1625,12 @@ bool CheckInputsOk(PyObject* seq, int start_index,
       for (Py_ssize_t j = 0; j < PySequence_Fast_GET_SIZE(item); j++) {
         PyObject* inner_item = PySequence_Fast_GET_ITEM(item, j);
         if (!CheckOneInput(inner_item)) {
-          VLOG(1)
-              << "Falling back to slow path for Op \"" << op_def.name()
-              << "\", Input \"" << op_def.input_arg(i).name() << "\", Index "
-              << j
-              << " since we expected an EagerTensor/ResourceVariable, but got "
-              << inner_item->ob_type->tp_name;
+          VLOG(1) << "Falling back to slow path for Op \"" << op_def.name()
+                  << "\", Input \"" << op_def.input_arg(i).name()
+                  << "\", Index " << j
+                  << " since we expected an EagerTensor/ResourceVariable, "
+                     "but got "
+                  << inner_item->ob_type->tp_name;
           return false;
         }
       }
@@ -1798,28 +1837,53 @@ PyObject* RecordGradient(PyObject* op_name, PyObject* inputs, PyObject* attrs,
   }
 
   PyObject* num_inputs = PyLong_FromLong(PySequence_Size(inputs));
-  PyObject* callback_args =
-      Py_BuildValue("OOOOO", op_name, attrs, num_inputs, op_inputs, op_outputs);
-
-  PyObject* backward_function =
-      PyObject_CallObject(backward_function_getter, callback_args);
-  Py_DECREF(callback_args);
-  if (backward_function == nullptr) return nullptr;
 
-  TapeSetRecordOperation(op_name, results, input_ids, input_dtypes,
-                         backward_function);
+  TapeSetRecordOperation(
+      op_name, results, input_ids, input_dtypes,
+      [op_name, attrs, num_inputs, op_inputs, op_outputs]() {
+        Py_INCREF(op_name);
+        Py_INCREF(attrs);
+        Py_INCREF(num_inputs);
+        Py_INCREF(op_inputs);
+        Py_INCREF(op_outputs);
+        PyBackwardFunction* function =
+            new PyBackwardFunction([op_name, attrs, num_inputs, op_inputs,
+                                    op_outputs](PyObject* output_grads) {
+              tensorflow::Safe_PyObjectPtr callback_args(
+                  Py_BuildValue("OOOOOO", op_name, attrs, num_inputs, op_inputs,
+                                op_outputs, output_grads));
+
+              tensorflow::Safe_PyObjectPtr result(
+                  PyObject_CallObject(gradient_function, callback_args.get()));
+
+              if (PyErr_Occurred()) return static_cast<PyObject*>(nullptr);
+
+              return tensorflow::swig::Flatten(result.get());
+            });
+        return function;
+      },
+      [op_name, attrs, num_inputs, op_inputs,
+       op_outputs](PyBackwardFunction* backward_function) {
+        Py_DECREF(op_name);
+        Py_DECREF(attrs);
+        Py_DECREF(num_inputs);
+        Py_DECREF(op_inputs);
+        Py_DECREF(op_outputs);
+
+        delete backward_function;
+      });
 
-  Py_DECREF(backward_function);
+  Py_DECREF(num_inputs);
 
   Py_RETURN_NONE;
 }
 
 void MaybeWatchVariable(PyObject* input) {
   DCHECK(CheckResourceVariable(input));
-  DCHECK(PyObject_HasAttrString(input, "_trainable"));
+  DCHECK(PyObject_HasAttrString(input, "trainable"));
 
   tensorflow::Safe_PyObjectPtr trainable(
-      PyObject_GetAttrString(input, "_trainable"));
+      PyObject_GetAttrString(input, "trainable"));
   if (trainable.get() == Py_False) return;
   TFE_Py_TapeSetWatchVariable(input);
 }
@@ -1869,8 +1933,10 @@ bool ReadVariableOp(const FastPathOpExecInfo& parent_op_exec_info,
     Py_INCREF(output->get());  // stay alive after since tuple steals.
     PyTuple_SET_ITEM(outputs.get(), 0, output->get());
 
-    if (!RecordGradient(GetPythonObjectFromString("ReadVariableOp"),
-                        inputs.get(), Py_None, outputs.get(), Py_None)) {
+    tensorflow::Safe_PyObjectPtr op_string(
+        GetPythonObjectFromString("ReadVariableOp"));
+    if (!RecordGradient(op_string.get(), inputs.get(), Py_None, outputs.get(),
+                        Py_None)) {
       return false;
     }
   }
@@ -1880,8 +1946,8 @@ bool ReadVariableOp(const FastPathOpExecInfo& parent_op_exec_info,
 
 // Supports only 2 cases at the moment:
 //  i) input is an EagerTensor
-//  ii) input is a ResourceVariable - in this case, the is_variable param is set
-//  to true.
+//  ii) input is a ResourceVariable - in this case, the is_variable param is
+//  set to true.
 //
 //  NOTE: dtype_hint_getter must *always* return a PyObject that can be
 //  decref'd. So if no hint is found, Py_RETURN_NONE (which correctly
diff --git a/tensorflow/python/eager/tensor_test.py b/tensorflow/python/eager/tensor_test.py
index b044b30231603b0265aa1ef0320e9f1cfb303724..626a4eb1eee9bda6c910c9dfa9cfff27b04444c1 100644
--- a/tensorflow/python/eager/tensor_test.py
+++ b/tensorflow/python/eager/tensor_test.py
@@ -292,6 +292,11 @@ class TFETensorUtilTest(test_util.TensorFlowTestCase):
   def testUnicode(self):
     self.assertEqual(constant_op.constant(u"asdf").numpy(), b"asdf")
 
+  def testFloatTensor(self):
+    self.assertEqual(dtypes.float64, _create_tensor(np.float64()).dtype)
+    self.assertEqual(dtypes.float32, _create_tensor(np.float32()).dtype)
+    self.assertEqual(dtypes.float32, _create_tensor(0.0).dtype)
+
   def testSliceDimOutOfRange(self):
     t1 = _create_tensor([[1, 2], [3, 4], [5, 6]], dtype=dtypes.int32)
     t2 = _create_tensor([1, 2], dtype=dtypes.int32)
diff --git a/tensorflow/python/estimator/BUILD b/tensorflow/python/estimator/BUILD
index 0754041f9eb50b429d02a06f9f0357c3431d3df5..9e716e81f40a2395b6ff04989f695dc5c0d91d15 100644
--- a/tensorflow/python/estimator/BUILD
+++ b/tensorflow/python/estimator/BUILD
@@ -12,6 +12,10 @@ py_library(
     name = "estimator_py",
     srcs = ["estimator_lib.py"],
     srcs_version = "PY2AND3",
+    visibility = [
+        "//tensorflow:__pkg__",
+        "//tensorflow:internal",
+    ],
     deps = [
         ":baseline",
         ":boosted_trees",
@@ -446,7 +450,26 @@ py_library(
     ],
     srcs_version = "PY2AND3",
     deps = [
+        "//tensorflow/python:platform",
+        "//tensorflow/python:training",
         "//tensorflow/python:util",
+        "//tensorflow/python/data",
+    ],
+)
+
+py_test(
+    name = "util_test",
+    srcs = ["util_test.py"],
+    srcs_version = "PY2AND3",
+    tags = ["notsan"],  # b/67510291
+    deps = [
+        ":util",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:constant_op",
+        "//tensorflow/python:training",
+        "//tensorflow/python/data",
+        "//third_party/py/numpy",
+        "@six_archive//:six",
     ],
 )
 
@@ -598,6 +621,7 @@ py_library(
     ],
     srcs_version = "PY2AND3",
     deps = [
+        ":util",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
@@ -689,6 +713,7 @@ py_test(
         "//tensorflow/python:sparse_tensor",
         "//tensorflow/python:string_ops",
         "//tensorflow/python:training",
+        "//tensorflow/python:variables",
         "//tensorflow/python/feature_column",
         "//tensorflow/python/ops/losses",
         "//tensorflow/python/saved_model:signature_constants",
@@ -950,7 +975,10 @@ py_test(
     size = "large",
     srcs = ["keras_test.py"],
     srcs_version = "PY2AND3",
-    tags = ["notsan"],
+    tags = [
+        "no_windows",
+        "notsan",
+    ],
     deps = [
         ":keras",
         "//tensorflow/core:protos_all_py",
diff --git a/tensorflow/python/estimator/api/BUILD b/tensorflow/python/estimator/api/BUILD
new file mode 100644
index 0000000000000000000000000000000000000000..cddee9b8f30555da63a9aad1190a7644d02e5392
--- /dev/null
+++ b/tensorflow/python/estimator/api/BUILD
@@ -0,0 +1,17 @@
+package(
+    default_visibility = [
+        "//tensorflow:internal",
+    ],
+)
+
+licenses(["notice"])  # Apache 2.0
+
+load("//tensorflow/tools/api/generator:api_gen.bzl", "gen_api_init_files")
+load("//tensorflow/tools/api/generator:api_gen.bzl", "ESTIMATOR_API_INIT_FILES")
+
+gen_api_init_files(
+    name = "estimator_python_api_gen",
+    api_name = "estimator",
+    output_files = ESTIMATOR_API_INIT_FILES,
+    package = "tensorflow.python.estimator",
+)
diff --git a/tensorflow/python/estimator/canned/baseline.py b/tensorflow/python/estimator/canned/baseline.py
index 980c0573726945bcc80863319da98a220c86bd91..3c6816cb03f180fa0ab230cba38d45ef162d58db 100644
--- a/tensorflow/python/estimator/canned/baseline.py
+++ b/tensorflow/python/estimator/canned/baseline.py
@@ -59,7 +59,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops.losses import losses
 from tensorflow.python.training import training_util
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 # The default learning rate of 0.3 is a historical artifact of the initial
 # implementation, but seems a reasonable choice.
@@ -174,7 +174,7 @@ def _baseline_model_fn(features, labels, mode, head, optimizer,
       train_op_fn=train_op_fn)
 
 
-@tf_export('estimator.BaselineClassifier')
+@estimator_export('estimator.BaselineClassifier')
 class BaselineClassifier(estimator.Estimator):
   """A classifier that can establish a simple baseline.
 
@@ -277,7 +277,7 @@ class BaselineClassifier(estimator.Estimator):
         config=config)
 
 
-@tf_export('estimator.BaselineRegressor')
+@estimator_export('estimator.BaselineRegressor')
 class BaselineRegressor(estimator.Estimator):
   """A regressor that can establish a simple baseline.
 
diff --git a/tensorflow/python/estimator/canned/boosted_trees.py b/tensorflow/python/estimator/canned/boosted_trees.py
index 4e6010a162be6e6b7288900b428d7841db6e453c..6b54f51ca6f9cc7fed644383bec432cc6c8013bd 100644
--- a/tensorflow/python/estimator/canned/boosted_trees.py
+++ b/tensorflow/python/estimator/canned/boosted_trees.py
@@ -39,7 +39,7 @@ from tensorflow.python.summary import summary
 from tensorflow.python.training import distribute as distribute_lib
 from tensorflow.python.training import session_run_hook
 from tensorflow.python.training import training_util
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 # TODO(nponomareva): Reveal pruning params here.
 _TreeHParams = collections.namedtuple('TreeHParams', [
@@ -712,7 +712,7 @@ def _create_regression_head(label_dimension, weight_column=None):
   # pylint: enable=protected-access
 
 
-@tf_export('estimator.BoostedTreesClassifier')
+@estimator_export('estimator.BoostedTreesClassifier')
 class BoostedTreesClassifier(estimator.Estimator):
   """A Classifier for Tensorflow Boosted Trees models."""
 
@@ -830,7 +830,7 @@ class BoostedTreesClassifier(estimator.Estimator):
         model_fn=_model_fn, model_dir=model_dir, config=config)
 
 
-@tf_export('estimator.BoostedTreesRegressor')
+@estimator_export('estimator.BoostedTreesRegressor')
 class BoostedTreesRegressor(estimator.Estimator):
   """A Regressor for Tensorflow Boosted Trees models."""
 
diff --git a/tensorflow/python/estimator/canned/dnn.py b/tensorflow/python/estimator/canned/dnn.py
index 1feac36f356cc5b2615217b7ca69a79d2a781ca6..b924ad5df43fac68696abf5fe815b5e7fa20908c 100644
--- a/tensorflow/python/estimator/canned/dnn.py
+++ b/tensorflow/python/estimator/canned/dnn.py
@@ -32,7 +32,7 @@ from tensorflow.python.ops import partitioned_variables
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops.losses import losses
 from tensorflow.python.summary import summary
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 # The default learning rate of 0.05 is a historical artifact of the initial
 # implementation, but seems a reasonable choice.
@@ -201,7 +201,7 @@ def _dnn_model_fn(features,
           logits=logits)
 
 
-@tf_export('estimator.DNNClassifier')
+@estimator_export('estimator.DNNClassifier')
 class DNNClassifier(estimator.Estimator):
   """A classifier for TensorFlow DNN models.
 
@@ -353,7 +353,7 @@ class DNNClassifier(estimator.Estimator):
         warm_start_from=warm_start_from)
 
 
-@tf_export('estimator.DNNRegressor')
+@estimator_export('estimator.DNNRegressor')
 class DNNRegressor(estimator.Estimator):
   """A regressor for TensorFlow DNN models.
 
diff --git a/tensorflow/python/estimator/canned/dnn_linear_combined.py b/tensorflow/python/estimator/canned/dnn_linear_combined.py
index 95efc0a028bc90911106a8947dcfc199ddd29444..64d81c46ce6f6cec3f095b8bde82a3039f3377fc 100644
--- a/tensorflow/python/estimator/canned/dnn_linear_combined.py
+++ b/tensorflow/python/estimator/canned/dnn_linear_combined.py
@@ -37,7 +37,7 @@ from tensorflow.python.summary import summary
 from tensorflow.python.training import distribute as distribute_lib
 from tensorflow.python.training import sync_replicas_optimizer
 from tensorflow.python.training import training_util
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 # The default learning rates are a historical artifact of the initial
 # implementation.
@@ -225,7 +225,7 @@ def _dnn_linear_combined_model_fn(features,
       logits=logits)
 
 
-@tf_export('estimator.DNNLinearCombinedClassifier')
+@estimator_export('estimator.DNNLinearCombinedClassifier')
 class DNNLinearCombinedClassifier(estimator.Estimator):
   """An estimator for TensorFlow Linear and DNN joined classification models.
 
@@ -406,7 +406,7 @@ class DNNLinearCombinedClassifier(estimator.Estimator):
         warm_start_from=warm_start_from)
 
 
-@tf_export('estimator.DNNLinearCombinedRegressor')
+@estimator_export('estimator.DNNLinearCombinedRegressor')
 class DNNLinearCombinedRegressor(estimator.Estimator):
   """An estimator for TensorFlow Linear and DNN joined models for regression.
 
diff --git a/tensorflow/python/estimator/canned/head.py b/tensorflow/python/estimator/canned/head.py
index 04fe4d97e40d60f7e5a5c9c2e9b40a08678f35d1..b74ef1015cc564c20370e17e94e3a09d460c4f85 100644
--- a/tensorflow/python/estimator/canned/head.py
+++ b/tensorflow/python/estimator/canned/head.py
@@ -873,6 +873,7 @@ class _MultiClassHeadWithSoftmaxCrossEntropyLoss(_Head):
         train_op = train_op_fn(regularized_training_loss)
       else:
         raise ValueError('train_op_fn and optimizer cannot both be None.')
+      train_op = _append_update_ops(train_op)
       # Only summarize mean_loss for SUM reduction to preserve backwards
       # compatibility. Otherwise skip it to avoid unnecessary computation.
       if self._loss_reduction == losses.Reduction.SUM:
@@ -1244,6 +1245,7 @@ class _BinaryLogisticHeadWithSigmoidCrossEntropyLoss(_Head):
         train_op = train_op_fn(regularized_training_loss)
       else:
         raise ValueError('train_op_fn and optimizer cannot both be None.')
+      train_op = _append_update_ops(train_op)
       # Only summarize mean_loss for SUM reduction to preserve backwards
       # compatibility. Otherwise skip it to avoid unnecessary computation.
       if self._loss_reduction == losses.Reduction.SUM:
@@ -1506,6 +1508,7 @@ class _RegressionHeadWithMeanSquaredErrorLoss(_Head):
         train_op = train_op_fn(regularized_training_loss)
       else:
         raise ValueError('train_op_fn and optimizer cannot both be None.')
+      train_op = _append_update_ops(train_op)
       # Only summarize mean_loss for SUM reduction to preserve backwards
       # compatibility. Otherwise skip it to avoid unnecessary computation.
       if self._loss_reduction == losses.Reduction.SUM:
@@ -1533,6 +1536,14 @@ class _RegressionHeadWithMeanSquaredErrorLoss(_Head):
         train_op=train_op)
 
 
+def _append_update_ops(train_op):
+  """Returns `train_op` appending `UPDATE_OPS` collection if present."""
+  update_ops = ops.get_collection(ops.GraphKeys.UPDATE_OPS)
+  if update_ops:
+    return control_flow_ops.group(train_op, *update_ops)
+  return train_op
+
+
 def _assert_range(labels, n_classes, message=None):
   with ops.name_scope(None, 'assert_range', (labels,)):
     assert_less = check_ops.assert_less_equal(
diff --git a/tensorflow/python/estimator/canned/head_test.py b/tensorflow/python/estimator/canned/head_test.py
index ecca3e8b0d82864c5fda6b94cc75db0521d5e8d3..08ce5ca8e833fdd88f9c45b668f0914fcc70acd0 100644
--- a/tensorflow/python/estimator/canned/head_test.py
+++ b/tensorflow/python/estimator/canned/head_test.py
@@ -39,6 +39,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import string_ops
+from tensorflow.python.ops import variables
 from tensorflow.python.ops.losses import losses
 from tensorflow.python.platform import test
 from tensorflow.python.saved_model import signature_constants
@@ -969,6 +970,35 @@ class MultiClassHeadWithSoftmaxCrossEntropyLoss(test.TestCase):
           six.b('{0:s}{1:.2f}'.format(expected_train_result, expected_loss)),
           train_result)
 
+  def test_train_with_update_ops(self):
+    n_classes = 3
+    head = head_lib._multi_class_head_with_softmax_cross_entropy_loss(n_classes)
+
+    with ops.Graph().as_default():
+      w = variables.Variable(1)
+      update_op = w.assign_add(1)
+      ops.add_to_collection(ops.GraphKeys.UPDATE_OPS, update_op)
+
+      t = variables.Variable('')
+      expected_train_result = b'my_train_op'
+      def _train_op_fn(loss):
+        del loss
+        return t.assign(expected_train_result)
+
+      spec = head.create_estimator_spec(
+          features={'x': np.array(((42,),), dtype=np.int32)},
+          mode=model_fn.ModeKeys.TRAIN,
+          logits=np.array(((10, 0, 0), (0, 10, 0),), dtype=np.float32),
+          labels=np.array(((1,), (1,)), dtype=np.int64),
+          train_op_fn=_train_op_fn)
+
+      with self.test_session() as sess:
+        _initialize_variables(self, spec.scaffold)
+        sess.run(spec.train_op)
+        w_value, t_value = sess.run([w, t])
+        self.assertEqual(2, w_value)
+        self.assertEqual(expected_train_result, t_value)
+
   def test_train_summaries_with_head_name(self):
     n_classes = 3
     head = head_lib._multi_class_head_with_softmax_cross_entropy_loss(
@@ -2102,6 +2132,34 @@ class BinaryLogisticHeadWithSigmoidCrossEntropyLossTest(test.TestCase):
       self.assertAllClose(expected_loss, loss)
       self.assertEqual(expected_train_result, train_result)
 
+  def test_train_with_update_ops(self):
+    head = head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss()
+
+    with ops.Graph().as_default():
+      w = variables.Variable(1)
+      update_op = w.assign_add(1)
+      ops.add_to_collection(ops.GraphKeys.UPDATE_OPS, update_op)
+
+      t = variables.Variable('')
+      expected_train_result = b'my_train_op'
+      def _train_op_fn(loss):
+        del loss
+        return t.assign(expected_train_result)
+
+      spec = head.create_estimator_spec(
+          features={'x': np.array(((42,),), dtype=np.int32)},
+          mode=model_fn.ModeKeys.TRAIN,
+          logits=np.array(((45,), (-41,),), dtype=np.float32),
+          labels=np.array(((1,), (1,),), dtype=np.float64),
+          train_op_fn=_train_op_fn)
+
+      with self.test_session() as sess:
+        _initialize_variables(self, spec.scaffold)
+        sess.run(spec.train_op)
+        w_value, t_value = sess.run([w, t])
+        self.assertEqual(2, w_value)
+        self.assertEqual(expected_train_result, t_value)
+
   def test_train_summaries_with_head_name(self):
     head = head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss(
         name='some_binary_head')
@@ -3278,6 +3336,34 @@ class RegressionHead(test.TestCase):
       self.assertAllClose(expected_loss, loss)
       self.assertEqual(expected_train_result, train_result)
 
+  def test_train_with_update_ops(self):
+    head = head_lib._regression_head()
+
+    with ops.Graph().as_default():
+      w = variables.Variable(1)
+      update_op = w.assign_add(1)
+      ops.add_to_collection(ops.GraphKeys.UPDATE_OPS, update_op)
+
+      t = variables.Variable('')
+      expected_train_result = b'my_train_op'
+      def _train_op_fn(loss):
+        del loss
+        return t.assign(expected_train_result)
+
+      spec = head.create_estimator_spec(
+          features={'x': np.array(((42,),), dtype=np.int32)},
+          mode=model_fn.ModeKeys.TRAIN,
+          logits=np.array(((45,), (41,),), dtype=np.float32),
+          labels=np.array(((43.,), (44.,),), dtype=np.float64),
+          train_op_fn=_train_op_fn)
+
+      with self.test_session() as sess:
+        _initialize_variables(self, spec.scaffold)
+        sess.run(spec.train_op)
+        w_value, t_value = sess.run([w, t])
+        self.assertEqual(2, w_value)
+        self.assertEqual(expected_train_result, t_value)
+
   def test_train_summaries_with_head_name(self):
     head = head_lib._regression_head(name='some_regression_head')
     self.assertEqual(1, head.logits_dimension)
diff --git a/tensorflow/python/estimator/canned/linear.py b/tensorflow/python/estimator/canned/linear.py
index 81657f0c01644524f1f706a0d42dd67e1345273e..705fc3ce06ee9b1c75b3d7adefdfb2ec8459335c 100644
--- a/tensorflow/python/estimator/canned/linear.py
+++ b/tensorflow/python/estimator/canned/linear.py
@@ -33,7 +33,7 @@ from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops.losses import losses
 from tensorflow.python.summary import summary
 from tensorflow.python.training import ftrl
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 
 # The default learning rate of 0.2 is a historical artifact of the initial
@@ -164,7 +164,7 @@ def _linear_model_fn(features, labels, mode, head, feature_columns, optimizer,
         logits=logits)
 
 
-@tf_export('estimator.LinearClassifier')
+@estimator_export('estimator.LinearClassifier')
 class LinearClassifier(estimator.Estimator):
   """Linear classifier model.
 
@@ -317,7 +317,7 @@ class LinearClassifier(estimator.Estimator):
         warm_start_from=warm_start_from)
 
 
-@tf_export('estimator.LinearRegressor')
+@estimator_export('estimator.LinearRegressor')
 class LinearRegressor(estimator.Estimator):
   """An estimator for TensorFlow Linear regression problems.
 
diff --git a/tensorflow/python/estimator/canned/parsing_utils.py b/tensorflow/python/estimator/canned/parsing_utils.py
index 74e5e5a1bed80229c68daa3ff33ee7af4004bf47..1ae0f1e9f7781be84e71790146a90cf99a5e9831 100644
--- a/tensorflow/python/estimator/canned/parsing_utils.py
+++ b/tensorflow/python/estimator/canned/parsing_utils.py
@@ -23,10 +23,10 @@ import six
 from tensorflow.python.feature_column import feature_column as fc
 from tensorflow.python.framework import dtypes
 from tensorflow.python.ops import parsing_ops
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 
-@tf_export('estimator.classifier_parse_example_spec')
+@estimator_export('estimator.classifier_parse_example_spec')
 def classifier_parse_example_spec(feature_columns,
                                   label_key,
                                   label_dtype=dtypes.int64,
@@ -166,7 +166,7 @@ def classifier_parse_example_spec(feature_columns,
   return parsing_spec
 
 
-@tf_export('estimator.regressor_parse_example_spec')
+@estimator_export('estimator.regressor_parse_example_spec')
 def regressor_parse_example_spec(feature_columns,
                                  label_key,
                                  label_dtype=dtypes.float32,
diff --git a/tensorflow/python/estimator/estimator.py b/tensorflow/python/estimator/estimator.py
index f09c90bec87f173df18c0327b1b615fe6dbbda1f..41c25f1c73295aceb542887df4fb9164c17e770a 100644
--- a/tensorflow/python/estimator/estimator.py
+++ b/tensorflow/python/estimator/estimator.py
@@ -32,15 +32,17 @@ from tensorflow.core.framework import summary_pb2
 from tensorflow.core.protobuf import config_pb2
 from tensorflow.core.protobuf import rewriter_config_pb2
 from tensorflow.python.client import session as tf_session
-from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.eager import context
 from tensorflow.python.estimator import model_fn as model_fn_lib
 from tensorflow.python.estimator import run_config
+from tensorflow.python.estimator import util as estimator_util
 from tensorflow.python.estimator.export import export as export_helpers
 from tensorflow.python.estimator.export import export_output
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import random_seed
+from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import metrics as metrics_lib
@@ -64,14 +66,14 @@ from tensorflow.python.util import compat
 from tensorflow.python.util import compat_internal
 from tensorflow.python.util import function_utils
 from tensorflow.python.util import nest
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 
 _VALID_MODEL_FN_ARGS = set(
     ['features', 'labels', 'mode', 'params', 'self', 'config'])
 
 
-@tf_export('estimator.Estimator')
+@estimator_export('estimator.Estimator')
 class Estimator(object):
   """Estimator class to train and evaluate TensorFlow models.
 
@@ -212,8 +214,8 @@ class Estimator(object):
     else:
       self._session_config = self._config.session_config
 
-    self._device_fn = self._config.device_fn or \
-                      _get_replica_device_setter(self._config)
+    self._device_fn = (
+        self._config.device_fn or _get_replica_device_setter(self._config))
 
     if model_fn is None:
       raise ValueError('model_fn must be provided to Estimator.')
@@ -302,7 +304,7 @@ class Estimator(object):
 
     Args:
       input_fn: A function that provides input data for training as minibatches.
-        See @{$get_started/premade_estimators#create_input_functions} for more
+        See @{$premade_estimators#create_input_functions} for more
         information. The function should construct and return one of
         the following:
 
@@ -398,7 +400,7 @@ class Estimator(object):
 
     Args:
       input_fn: A function that constructs the input data for evaluation.
-        See @{$get_started/premade_estimators#create_input_functions} for more
+        See @{$premade_estimators#create_input_functions} for more
         information. The function should construct and return one of
         the following:
 
@@ -477,7 +479,7 @@ class Estimator(object):
       input_fn: A function that constructs the features. Prediction continues
         until `input_fn` raises an end-of-input exception (`OutOfRangeError` or
         `StopIteration`).
-        See @{$get_started/premade_estimators#create_input_functions} for more
+        See @{$premade_estimators#create_input_functions} for more
         information. The function should construct and return one of
         the following:
 
@@ -564,7 +566,9 @@ class Estimator(object):
     allowed_overrides = set([
         '_call_input_fn', '_create_global_step',
         '_convert_train_steps_to_hooks', '_convert_eval_steps_to_hooks',
-        '_tf_api_names', '_validate_features_in_predict_input'
+        '_tf_api_names', '_estimator_api_names', '_estimator_api_constants',
+        '_validate_features_in_predict_input',
+        '_call_model_fn', '_add_meta_graph_for_mode'
     ])
     estimator_members = set([m for m in Estimator.__dict__.keys()
                              if not m.startswith('__')])
@@ -828,10 +832,14 @@ class Estimator(object):
       gfile.Rename(temp_export_dir, export_dir)
       return export_dir
 
-  def _add_meta_graph_for_mode(
-      self, builder, input_receiver_fn_map, checkpoint_path,
-      strip_default_attrs, save_variables=True,
-      mode=model_fn_lib.ModeKeys.PREDICT):
+  def _add_meta_graph_for_mode(self,
+                               builder,
+                               input_receiver_fn_map,
+                               checkpoint_path,
+                               strip_default_attrs,
+                               save_variables=True,
+                               mode=model_fn_lib.ModeKeys.PREDICT,
+                               export_tags=None):
     # pylint: disable=line-too-long
     """Loads variables and adds them along with a MetaGraphDef for saving.
 
@@ -850,9 +858,14 @@ class Estimator(object):
         True for the first call to this function, and the SavedModelBuilder will
         raise an error if that is not the case.
       mode: tf.estimator.ModeKeys value indicating which mode will be exported.
+      export_tags: The set of tags with which to save `MetaGraphDef`. If None,
+        a default set will be selected to matched the passed mode.
     """
     # pylint: enable=line-too-long
+    if export_tags is None:
+      export_tags = model_fn_lib.EXPORT_TAG_MAP[mode]
     input_receiver_fn = input_receiver_fn_map[mode]
+
     with ops.Graph().as_default() as g:
       self._create_and_assert_global_step(g)
       random_seed.set_random_seed(self._config.tf_random_seed)
@@ -877,17 +890,18 @@ class Estimator(object):
 
       with tf_session.Session(config=self._session_config) as session:
 
-        export_tags = model_fn_lib.EXPORT_TAG_MAP[mode]
-
         local_init_op = (
             estimator_spec.scaffold.local_init_op or
             monitored_session.Scaffold.default_local_init_op())
 
-        saver_for_restore = estimator_spec.scaffold.saver or saver.Saver(
-            sharded=True)
+        # This saver will be used both for restoring variables now,
+        # and in saving out the metagraph below. This ensures that any
+        # Custom Savers stored with the Scaffold are passed through to the
+        # SavedModel for restore later.
+        graph_saver = estimator_spec.scaffold.saver or saver.Saver(sharded=True)
 
         try:
-          saver_for_restore.restore(session, checkpoint_path)
+          graph_saver.restore(session, checkpoint_path)
         except errors.NotFoundError as e:
           msg = ('Could not load all requested variables from the checkpoint. '
                  'Please make sure your model_fn does not expect variables '
@@ -908,7 +922,8 @@ class Estimator(object):
             assets_collection=ops.get_collection(
                 ops.GraphKeys.ASSET_FILEPATHS),
             strip_default_attrs=strip_default_attrs,
-            legacy_init_op=local_init_op)
+            legacy_init_op=local_init_op,
+            saver=graph_saver)
 
         if save_variables:
           builder.add_meta_graph_and_variables(
@@ -956,17 +971,9 @@ class Estimator(object):
   def _get_features_from_input_fn(self, input_fn, mode):
     """Extracts the `features` from return values of `input_fn`."""
     result = self._call_input_fn(input_fn, mode)
-    input_hooks = []
-    if isinstance(result, dataset_ops.Dataset):
-      iterator = result.make_initializable_iterator()
-      input_hooks.append(_DatasetInitializerHook(iterator))
-      result = iterator.get_next()
-    if isinstance(result, (list, tuple)):
-      # Unconditionally drop the label (the second element of result).
-      result = result[0]
-
+    result, _, hooks = estimator_util.parse_input_fn_result(result)
     self._validate_features_in_predict_input(result)
-    return result, input_hooks
+    return result, hooks
 
   def _validate_features_in_predict_input(self, result):
     if not _has_dataset_or_queue_runner(result):
@@ -976,25 +983,13 @@ class Estimator(object):
 
   def _get_features_and_labels_from_input_fn(self, input_fn, mode):
     """Extracts the `features` and labels from return values of `input_fn`."""
-    input_hooks = []
     if self._distribution is not None and mode == model_fn_lib.ModeKeys.TRAIN:
       result = self._distribution.distribute_dataset(
           lambda: self._call_input_fn(input_fn, mode))
-      iterator = result.make_initializable_iterator()
-      input_hooks.append(_DatasetInitializerHook(iterator))
-      result = iterator.get_next()
     else:
       result = self._call_input_fn(input_fn, mode)
-      if isinstance(result, dataset_ops.Dataset):
-        iterator = result.make_initializable_iterator()
-        input_hooks.append(_DatasetInitializerHook(iterator))
-        result = iterator.get_next()
-    if isinstance(result, (list, tuple)):
-      if len(result) != 2:
-        raise ValueError(
-            'input_fn should return (features, labels) as a len 2 tuple.')
-      return result[0], result[1], input_hooks
-    return result, None, input_hooks
+
+    return estimator_util.parse_input_fn_result(result)
 
   def _extract_batch_length(self, preds_evaluated):
     """Extracts batch length of predictions."""
@@ -1059,9 +1054,15 @@ class Estimator(object):
       mode: ModeKeys
 
     Returns:
-      Either features or (features, labels) where features and labels are:
-        features - `Tensor` or dictionary of string feature name to `Tensor`.
-        labels - `Tensor` or dictionary of `Tensor` with labels.
+      The return value of the passed input_fn, which should be one of:
+
+        * A 'tf.data.Dataset' object: Outputs of `Dataset` object must be a
+            tuple (features, labels) with same constraints as below.
+        * A tuple (features, labels): Where `features` is a `Tensor` or a
+          dictionary of string feature name to `Tensor` and `labels` is a
+          `Tensor` or a dictionary of string label name to `Tensor`. Both
+          `features` and `labels` are consumed by `model_fn`. They should
+          satisfy the expectation of `model_fn` from inputs.
 
     Raises:
       ValueError: if input_fn takes invalid arguments.
@@ -1389,10 +1390,18 @@ class Estimator(object):
         hooks=all_hooks,
         config=self._session_config)
 
+    current_global_step = eval_results[ops.GraphKeys.GLOBAL_STEP]
+
     _write_dict_to_summary(
         output_dir=output_dir,
         dictionary=eval_results,
-        current_global_step=eval_results[ops.GraphKeys.GLOBAL_STEP])
+        current_global_step=current_global_step)
+
+    if checkpoint_path:
+      _write_checkpoint_path_to_summary(
+          output_dir=output_dir,
+          checkpoint_path=checkpoint_path,
+          current_global_step=current_global_step)
 
     return eval_results
 
@@ -1591,6 +1600,30 @@ def _write_dict_to_summary(output_dir,
   summary_writer.flush()
 
 
+def _write_checkpoint_path_to_summary(output_dir, checkpoint_path,
+                                      current_global_step):
+  """Writes `checkpoint_path` into summary file in the given output directory.
+
+  Args:
+    output_dir: `str`, directory to write the summary file in.
+    checkpoint_path: `str`, checkpoint file path to be written to summary file.
+    current_global_step: `int`, the current global step.
+  """
+
+  checkpoint_path_tag = 'checkpoint_path'
+
+  logging.info('Saving \'%s\' summary for global step %d: %s',
+               checkpoint_path_tag, current_global_step, checkpoint_path)
+  summary_proto = summary_pb2.Summary()
+  summary_proto.value.add(
+      tag=checkpoint_path_tag,
+      tensor=tensor_util.make_tensor_proto(
+          checkpoint_path, dtype=dtypes.string))
+  summary_writer = writer_cache.FileWriterCache.get(output_dir)
+  summary_writer.add_summary(summary_proto, current_global_step)
+  summary_writer.flush()
+
+
 def _has_dataset_or_queue_runner(maybe_tensor):
   """Returns True if TF dataset or QueueRunner has been used."""
   # Check TF dataset first. Here, we use a simple algorithm to check the top
@@ -1603,23 +1636,11 @@ def _has_dataset_or_queue_runner(maybe_tensor):
   return ops.get_default_graph().get_collection(ops.GraphKeys.QUEUE_RUNNERS)
 
 
-class _DatasetInitializerHook(training.SessionRunHook):
-
-  def __init__(self, iterator):
-    self._iterator = iterator
-
-  def begin(self):
-    self._initializer = self._iterator.initializer
-
-  def after_create_session(self, session, coord):
-    del coord
-    session.run(self._initializer)
-
 VocabInfo = warm_starting_util.VocabInfo  # pylint: disable=invalid-name
-tf_export('estimator.VocabInfo', allow_multiple_exports=True)(VocabInfo)
+estimator_export('estimator.VocabInfo')(VocabInfo)
 
 
-@tf_export('estimator.WarmStartSettings')
+@estimator_export('estimator.WarmStartSettings')
 class WarmStartSettings(
     collections.namedtuple('WarmStartSettings', [
         'ckpt_to_initialize_from',
@@ -1746,10 +1767,19 @@ class WarmStartSettings(
     ckpt_to_initialize_from: [Required] A string specifying the directory with
       checkpoint file(s) or path to checkpoint from which to warm-start the
       model parameters.
-    vars_to_warm_start: [Optional] A regular expression that captures which
-      variables to warm-start (see tf.get_collection).  Defaults to `'.*'`,
-      which warm-starts all variables.  If `None` is explicitly given, only
-      variables specified in `var_name_to_vocab_info` will be warm-started.
+    vars_to_warm_start: [Optional] One of the following:
+
+      - A regular expression (string) that captures which variables to
+        warm-start (see tf.get_collection).  This expression will only consider
+        variables in the TRAINABLE_VARIABLES collection.
+      - A list of Variables to warm-start.
+      - A list of strings, each representing a full variable name to warm-start.
+      - `None`, in which case only variables specified in
+        `var_name_to_vocab_info` will be warm-started.
+
+      Defaults to `'.*'`, which warm-starts all variables in the
+      TRAINABLE_VARIABLES collection.  Note that this excludes variables such as
+      accumulators and moving statistics from batch norm.
     var_name_to_vocab_info: [Optional] Dict of variable names (strings) to
       VocabInfo. The variable names should be "full" variables, not the names
       of the partitions.  If not explicitly provided, the variable is assumed to
diff --git a/tensorflow/python/estimator/estimator_test.py b/tensorflow/python/estimator/estimator_test.py
index 1b701899487ede96a90a6be299323fe278e5027f..a43b820f322d70093a5015457fea294e436daeea 100644
--- a/tensorflow/python/estimator/estimator_test.py
+++ b/tensorflow/python/estimator/estimator_test.py
@@ -39,6 +39,7 @@ from tensorflow.python.estimator.inputs import numpy_io
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_util
 from tensorflow.python.framework import test_util
 from tensorflow.python.layers import layers
 from tensorflow.python.lib.io import file_io
@@ -81,21 +82,27 @@ def dummy_model_fn(features, labels, params):
   _, _, _ = features, labels, params
 
 
-def check_eventfile_for_keyword(keyword, dir_):
-  """Checks event files for the keyword."""
+def summaries_with_matching_keyword(keyword, dir_):
+  """Yields summary protos matching given keyword from event file."""
 
   writer_cache.FileWriterCache.clear()
 
-  # Get last Event written.
   event_paths = glob.glob(os.path.join(dir_, 'events*'))
-  last_event = None
-  for last_event in summary_iterator.summary_iterator(event_paths[-1]):
-    if last_event.summary is not None:
-      for value in last_event.summary.value:
+  for event in summary_iterator.summary_iterator(event_paths[-1]):
+    if event.summary is not None:
+      for value in event.summary.value:
         if keyword in value.tag:
-          return True
+          yield event.summary
+
+
+def check_eventfile_for_keyword(keyword, dir_):
+  """Checks event files for the keyword."""
+  return any(summaries_with_matching_keyword(keyword, dir_))
 
-  return False
+
+def get_mock_saver():
+  real_saver = saver.Saver()
+  return test.mock.Mock(wraps=real_saver, saver_def=real_saver.saver_def)
 
 
 class EstimatorInheritanceConstraintTest(test.TestCase):
@@ -814,6 +821,7 @@ class EstimatorTrainTest(test.TestCase):
 
   def test_saving_listeners_are_used(self):
     listener = test.mock.Mock(spec=training.CheckpointSaverListener)
+    listener.after_save.return_value = None
     est = estimator.Estimator(
         model_fn=model_fn_global_step_incrementer,
         config=run_config.RunConfig(save_checkpoints_steps=10))
@@ -1292,9 +1300,7 @@ class EstimatorEvaluateTest(test.TestCase):
     def _model_fn_scaffold(features, labels, mode):
       _, _ = features, labels
       variables.Variable(1., name='weight')
-      real_saver = saver.Saver()
-      self.mock_saver = test.mock.Mock(
-          wraps=real_saver, saver_def=real_saver.saver_def)
+      self.mock_saver = get_mock_saver()
       return model_fn_lib.EstimatorSpec(
           mode=mode,
           predictions=constant_op.constant([[1.]]),
@@ -1397,6 +1403,19 @@ class EstimatorEvaluateTest(test.TestCase):
           check_eventfile_for_keyword(key, est.eval_dir()),
           '{} should be part of reported summaries.'.format(key))
 
+    # Verify that evaluated checkpoint path is written to event file.
+    checkpoint_path_tag = 'checkpoint_path'
+    self.assertTrue(
+        check_eventfile_for_keyword(checkpoint_path_tag, est.eval_dir()),
+        '{} should be part of reported summaries.'.format(checkpoint_path_tag))
+
+    expected_tensor_proto = tensor_util.make_tensor_proto(
+        est.latest_checkpoint(), dtype=dtypes.string)
+    summaries = summaries_with_matching_keyword(checkpoint_path_tag,
+                                                est.eval_dir())
+    self.assertProtoEquals(expected_tensor_proto,
+                           next(summaries).value[0].tensor)
+
 
 class EstimatorPredictTest(test.TestCase):
 
@@ -1803,9 +1822,7 @@ class EstimatorPredictTest(test.TestCase):
     def _model_fn_scaffold(features, labels, mode):
       _, _ = features, labels
       variables.Variable(1., name='weight')
-      real_saver = saver.Saver()
-      self.mock_saver = test.mock.Mock(
-          wraps=real_saver, saver_def=real_saver.saver_def)
+      self.mock_saver = get_mock_saver()
       return model_fn_lib.EstimatorSpec(
           mode=mode,
           predictions=constant_op.constant([[1.]]),
@@ -2299,8 +2316,8 @@ class EstimatorExportTest(test.TestCase):
         graph_ops = [x.name for x in graph.get_operations()]
         self.assertTrue('input_example_tensor' in graph_ops)
         self.assertTrue('ParseExample/ParseExample' in graph_ops)
-        # Note that the SavedModel builder replaced the Saver with a new one
-        self.assertTrue('save_1/LookupTableImportV2' in graph_ops)
+        # The original saver is used to restore variables
+        self.assertTrue('save/LookupTableImportV2' in graph_ops)
 
     # Clean up.
     gfile.DeleteRecursively(tmpdir)
@@ -2465,9 +2482,7 @@ class EstimatorExportTest(test.TestCase):
     def _model_fn_scaffold(features, labels, mode):
       _, _ = features, labels
       variables.Variable(1., name='weight')
-      real_saver = saver.Saver()
-      self.mock_saver = test.mock.Mock(
-          wraps=real_saver, saver_def=real_saver.saver_def)
+      self.mock_saver = get_mock_saver()
       scores = constant_op.constant([3.])
       return model_fn_lib.EstimatorSpec(
           mode=mode,
@@ -2490,19 +2505,24 @@ class EstimatorExportTest(test.TestCase):
     est.export_savedmodel(export_dir_base, serving_input_receiver_fn)
 
     self.assertTrue(self.mock_saver.restore.called)
+    self.assertTrue(self.mock_saver.export_meta_graph.called)
+    self.assertTrue(self.mock_saver.save.called)
 
   def test_scaffold_is_used_for_saver_multiple_modes(self):
     tmpdir = tempfile.mkdtemp()
+    savers = {'predict_saver': None, 'train_saver': None}
 
     def _model_fn_scaffold(features, labels, mode):
       _, _ = features, labels
       variables.Variable(1., name='weight')
-      real_saver = saver.Saver()
-      self.mock_saver = test.mock.Mock(
-          wraps=real_saver, saver_def=real_saver.saver_def)
+
       scores = constant_op.constant([3.])
       if mode == model_fn_lib.ModeKeys.PREDICT:
-        scaffold = training.Scaffold(saver=self.mock_saver)
+        savers['predict_saver'] = get_mock_saver()
+        scaffold = training.Scaffold(saver=savers['predict_saver'])
+      elif mode == model_fn_lib.ModeKeys.TRAIN:
+        savers['train_saver'] = get_mock_saver()
+        scaffold = training.Scaffold(saver=savers['train_saver'])
       else:
         scaffold = training.Scaffold()
       return model_fn_lib.EstimatorSpec(
@@ -2526,7 +2546,13 @@ class EstimatorExportTest(test.TestCase):
         compat.as_bytes(tmpdir), compat.as_bytes('export'))
     est._export_all_saved_models(export_dir_base, input_receiver_fn_map)
 
-    self.assertTrue(self.mock_saver.restore.called)
+    self.assertTrue(savers['train_saver'].restore.called)
+    self.assertEqual(savers['train_saver'].export_meta_graph.call_count, 1)
+    self.assertEqual(savers['train_saver'].save.call_count, 1)
+
+    self.assertTrue(savers['predict_saver'].restore.called)
+    self.assertEqual(savers['predict_saver'].export_meta_graph.call_count, 1)
+    self.assertEqual(savers['predict_saver'].save.call_count, 0)
 
   def test_scaffold_is_used_for_local_init(self):
     tmpdir = tempfile.mkdtemp()
diff --git a/tensorflow/python/estimator/export/export.py b/tensorflow/python/estimator/export/export.py
index 48ae8cd49791c27a1e9674ed1be19d543d690b35..ca26341445e86ad554ac2e7cbf643c7775dd9825 100644
--- a/tensorflow/python/estimator/export/export.py
+++ b/tensorflow/python/estimator/export/export.py
@@ -34,7 +34,7 @@ from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.saved_model import signature_constants
 from tensorflow.python.saved_model import signature_def_utils
 from tensorflow.python.util import compat
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 _SINGLE_FEATURE_DEFAULT_NAME = 'feature'
 _SINGLE_RECEIVER_DEFAULT_NAME = 'input'
@@ -93,7 +93,7 @@ def _check_tensor_key(name, error_label='feature'):
     raise ValueError('{} keys must be strings: {}.'.format(error_label, name))
 
 
-@tf_export('estimator.export.ServingInputReceiver')
+@estimator_export('estimator.export.ServingInputReceiver')
 class ServingInputReceiver(
     collections.namedtuple(
         'ServingInputReceiver',
@@ -161,7 +161,7 @@ class ServingInputReceiver(
         receiver_tensors_alternatives=receiver_tensors_alternatives)
 
 
-@tf_export('estimator.export.TensorServingInputReceiver')
+@estimator_export('estimator.export.TensorServingInputReceiver')
 class TensorServingInputReceiver(
     collections.namedtuple(
         'TensorServingInputReceiver',
@@ -263,7 +263,7 @@ class SupervisedInputReceiver(
         receiver_tensors=receiver_tensors)
 
 
-@tf_export('estimator.export.build_parsing_serving_input_receiver_fn')
+@estimator_export('estimator.export.build_parsing_serving_input_receiver_fn')
 def build_parsing_serving_input_receiver_fn(feature_spec,
                                             default_batch_size=None):
   """Build a serving_input_receiver_fn expecting fed tf.Examples.
@@ -313,7 +313,7 @@ def _placeholders_from_receiver_tensors_dict(input_vals,
   }
 
 
-@tf_export('estimator.export.build_raw_serving_input_receiver_fn')
+@estimator_export('estimator.export.build_raw_serving_input_receiver_fn')
 def build_raw_serving_input_receiver_fn(features, default_batch_size=None):
   """Build a serving_input_receiver_fn expecting feature Tensors.
 
@@ -333,11 +333,7 @@ def build_raw_serving_input_receiver_fn(features, default_batch_size=None):
     """A serving_input_receiver_fn that expects features to be fed directly."""
     receiver_tensors = _placeholders_from_receiver_tensors_dict(
         features, default_batch_size)
-
-    # TODO(b/34885899): remove the unnecessary copy
-    # The features provided are simply the placeholders, but we defensively copy
-    # the dict because it may be mutated.
-    return ServingInputReceiver(receiver_tensors, receiver_tensors.copy())
+    return ServingInputReceiver(receiver_tensors, receiver_tensors)
 
   return serving_input_receiver_fn
 
@@ -404,6 +400,42 @@ def build_raw_supervised_input_receiver_fn(features,
   return supervised_input_receiver_fn
 
 
+def build_supervised_input_receiver_fn_from_input_fn(input_fn, **input_fn_args):
+  """Get a function that returns a SupervisedInputReceiver matching an input_fn.
+
+  Note that this function calls the input_fn in a local graph in order to
+  extract features and labels. Placeholders are then created from those
+  features and labels in the default graph.
+
+  Args:
+    input_fn: An Estimator input_fn, which is a function that returns one of:
+
+      * A 'tf.data.Dataset' object: Outputs of `Dataset` object must be a
+          tuple (features, labels) with same constraints as below.
+      * A tuple (features, labels): Where `features` is a `Tensor` or a
+        dictionary of string feature name to `Tensor` and `labels` is a
+        `Tensor` or a dictionary of string label name to `Tensor`. Both
+        `features` and `labels` are consumed by `model_fn`. They should
+        satisfy the expectation of `model_fn` from inputs.
+
+    **input_fn_args: set of kwargs to be passed to the input_fn. Note that
+      these will not be checked or validated here, and any errors raised by
+      the input_fn will be thrown to the top.
+
+  Returns:
+    A function taking no arguments that, when called, returns a
+    SupervisedInputReceiver. This function can be passed in as part of the
+    input_receiver_map when exporting SavedModels from Estimator with multiple
+    modes.
+  """
+  # Wrap the input_fn call in a graph to prevent sullying the default namespace
+  with ops.Graph().as_default():
+    result = input_fn(**input_fn_args)
+    features, labels, _ = util.parse_input_fn_result(result)
+  # Placeholders are created back in the default graph.
+  return build_raw_supervised_input_receiver_fn(features, labels)
+
+
 ### Below utilities are specific to SavedModel exports.
 
 
diff --git a/tensorflow/python/estimator/export/export_output.py b/tensorflow/python/estimator/export/export_output.py
index d387ea2940e7a450afe28b884c52113355c70fe6..6c26d299851eaea74f1e564d0fac217f238d76a2 100644
--- a/tensorflow/python/estimator/export/export_output.py
+++ b/tensorflow/python/estimator/export/export_output.py
@@ -26,10 +26,10 @@ import six
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.saved_model import signature_def_utils
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 
-@tf_export('estimator.export.ExportOutput')
+@estimator_export('estimator.export.ExportOutput')
 class ExportOutput(object):
   """Represents an output of a model that can be served.
 
@@ -100,7 +100,7 @@ class ExportOutput(object):
     return output_dict
 
 
-@tf_export('estimator.export.ClassificationOutput')
+@estimator_export('estimator.export.ClassificationOutput')
 class ClassificationOutput(ExportOutput):
   """Represents the output of a classification head.
 
@@ -169,7 +169,7 @@ class ClassificationOutput(ExportOutput):
         examples, self.classes, self.scores)
 
 
-@tf_export('estimator.export.RegressionOutput')
+@estimator_export('estimator.export.RegressionOutput')
 class RegressionOutput(ExportOutput):
   """Represents the output of a regression head."""
 
@@ -202,7 +202,7 @@ class RegressionOutput(ExportOutput):
     return signature_def_utils.regression_signature_def(examples, self.value)
 
 
-@tf_export('estimator.export.PredictOutput')
+@estimator_export('estimator.export.PredictOutput')
 class PredictOutput(ExportOutput):
   """Represents the output of a generic prediction head.
 
diff --git a/tensorflow/python/estimator/export/export_test.py b/tensorflow/python/estimator/export/export_test.py
index 0af587f2a850dff3ca2dc744e157ed5fbb329735..a7074712c25532a1d2156a11d2314150d9efabc1 100644
--- a/tensorflow/python/estimator/export/export_test.py
+++ b/tensorflow/python/estimator/export/export_test.py
@@ -459,6 +459,41 @@ class ExportTest(test_util.TensorFlowTestCase):
     with self.assertRaises(ValueError):
       export.build_raw_supervised_input_receiver_fn(features, labels)
 
+  def test_build_supervised_input_receiver_fn_from_input_fn(self):
+    def dummy_input_fn():
+      return ({"x": constant_op.constant([[1], [1]]),
+               "y": constant_op.constant(["hello", "goodbye"])},
+              constant_op.constant([[1], [1]]))
+
+    input_receiver_fn = export.build_supervised_input_receiver_fn_from_input_fn(
+        dummy_input_fn)
+
+    with ops.Graph().as_default():
+      input_receiver = input_receiver_fn()
+      self.assertEqual(set(["x", "y"]),
+                       set(input_receiver.features.keys()))
+      self.assertIsInstance(input_receiver.labels, ops.Tensor)
+      self.assertEqual(set(["x", "y", "label"]),
+                       set(input_receiver.receiver_tensors.keys()))
+
+  def test_build_supervised_input_receiver_fn_from_input_fn_args(self):
+    def dummy_input_fn(feature_key="x"):
+      return ({feature_key: constant_op.constant([[1], [1]]),
+               "y": constant_op.constant(["hello", "goodbye"])},
+              {"my_label": constant_op.constant([[1], [1]])})
+
+    input_receiver_fn = export.build_supervised_input_receiver_fn_from_input_fn(
+        dummy_input_fn, feature_key="z")
+
+    with ops.Graph().as_default():
+      input_receiver = input_receiver_fn()
+      self.assertEqual(set(["z", "y"]),
+                       set(input_receiver.features.keys()))
+      self.assertEqual(set(["my_label"]),
+                       set(input_receiver.labels.keys()))
+      self.assertEqual(set(["z", "y", "my_label"]),
+                       set(input_receiver.receiver_tensors.keys()))
+
   def test_build_all_signature_defs_without_receiver_alternatives(self):
     receiver_tensor = array_ops.placeholder(dtypes.string)
     output_1 = constant_op.constant([1.])
diff --git a/tensorflow/python/estimator/exporter.py b/tensorflow/python/estimator/exporter.py
index b775b19127c7fd8ee02142263791f25cd4b1f84f..b18212cfcda8f817f909672007c5b000db718232 100644
--- a/tensorflow/python/estimator/exporter.py
+++ b/tensorflow/python/estimator/exporter.py
@@ -28,10 +28,10 @@ from tensorflow.python.framework import errors_impl
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import tf_logging
 from tensorflow.python.summary import summary_iterator
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 
-@tf_export('estimator.Exporter')
+@estimator_export('estimator.Exporter')
 class Exporter(object):
   """A class representing a type of model export."""
 
@@ -172,7 +172,7 @@ def _verify_compare_fn_args(compare_fn):
                      (compare_fn, non_valid_args))
 
 
-@tf_export('estimator.BestExporter')
+@estimator_export('estimator.BestExporter')
 class BestExporter(Exporter):
   """This class exports the serving graph and checkpoints of the best models.
 
@@ -287,11 +287,11 @@ class BestExporter(Exporter):
              is_the_final_export):
     export_result = None
 
-    if self._model_dir != estimator.model_dir() and self._event_file_pattern:
+    if self._model_dir != estimator.model_dir and self._event_file_pattern:
       # Loads best metric from event files.
       tf_logging.info('Loading best metric from event files.')
 
-      self._model_dir = estimator.model_dir()
+      self._model_dir = estimator.model_dir
       full_event_file_pattern = os.path.join(self._model_dir,
                                              self._event_file_pattern)
       self._best_eval_result = self._get_best_eval_result(
@@ -360,13 +360,14 @@ class BestExporter(Exporter):
           for value in event.summary.value:
             if value.HasField('simple_value'):
               event_eval_result[value.tag] = value.simple_value
-          if best_eval_result is None or self._compare_fn(
-              best_eval_result, event_eval_result):
-            best_eval_result = event_eval_result
+          if event_eval_result:
+            if best_eval_result is None or self._compare_fn(
+                best_eval_result, event_eval_result):
+              best_eval_result = event_eval_result
     return best_eval_result
 
 
-@tf_export('estimator.FinalExporter')
+@estimator_export('estimator.FinalExporter')
 class FinalExporter(Exporter):
   """This class exports the serving graph and checkpoints in the end.
 
@@ -417,7 +418,7 @@ class FinalExporter(Exporter):
                                              is_the_final_export)
 
 
-@tf_export('estimator.LatestExporter')
+@estimator_export('estimator.LatestExporter')
 class LatestExporter(Exporter):
   """This class regularly exports the serving graph and checkpoints.
 
diff --git a/tensorflow/python/estimator/exporter_test.py b/tensorflow/python/estimator/exporter_test.py
index 053c5490711cf538b4b83599b7938a43d8eaec34..c4b006955c4128d6f40dcb30215beda761abbd49 100644
--- a/tensorflow/python/estimator/exporter_test.py
+++ b/tensorflow/python/estimator/exporter_test.py
@@ -62,7 +62,7 @@ class BestExporterTest(test.TestCase):
         exports_to_keep=5)
     estimator = test.mock.Mock(spec=estimator_lib.Estimator)
     estimator.export_savedmodel.return_value = "export_result_path"
-    estimator.model_dir.return_value = export_dir_base
+    estimator.model_dir = export_dir_base
 
     export_result = exporter.export(estimator, export_dir_base,
                                     "checkpoint_path", {}, False)
@@ -94,7 +94,7 @@ class BestExporterTest(test.TestCase):
         exports_to_keep=1)
     estimator = test.mock.Mock(spec=estimator_lib.Estimator)
     estimator.export_savedmodel.return_value = "export_result_path"
-    estimator.model_dir.return_value = export_dir_base
+    estimator.model_dir = export_dir_base
 
     export_result = exporter.export(estimator, export_dir_base,
                                     "checkpoint_path", {"loss": 0.5}, False)
@@ -133,7 +133,7 @@ class BestExporterTest(test.TestCase):
         exports_to_keep=1)
 
     estimator = test.mock.Mock(spec=estimator_lib.Estimator)
-    estimator.model_dir.return_value = export_dir_base
+    estimator.model_dir = export_dir_base
     estimator.export_savedmodel.return_value = "export_result_path"
 
     export_result = exporter.export(estimator, export_dir_base,
@@ -148,6 +148,40 @@ class BestExporterTest(test.TestCase):
                                     "checkpoint_path", {"loss": 20}, False)
     self.assertEqual(None, export_result)
 
+  def test_best_exporter_with_empty_event(self):
+
+    def _serving_input_receiver_fn():
+      pass
+
+    export_dir_base = tempfile.mkdtemp()
+    gfile.MkDir(export_dir_base)
+    gfile.MkDir(export_dir_base + "/export")
+    gfile.MkDir(export_dir_base + "/eval")
+
+    eval_dir_base = os.path.join(export_dir_base, "eval_continuous")
+    estimator_lib._write_dict_to_summary(eval_dir_base, {}, 1)
+    estimator_lib._write_dict_to_summary(eval_dir_base, {"loss": 60}, 2)
+
+    exporter = exporter_lib.BestExporter(
+        name="best_exporter",
+        serving_input_receiver_fn=_serving_input_receiver_fn,
+        event_file_pattern="eval_continuous/*.tfevents.*",
+        assets_extra={"from/path": "to/path"},
+        as_text=False,
+        exports_to_keep=1)
+
+    estimator = test.mock.Mock(spec=estimator_lib.Estimator)
+    estimator.model_dir = export_dir_base
+    estimator.export_savedmodel.return_value = "export_result_path"
+
+    export_result = exporter.export(estimator, export_dir_base,
+                                    "checkpoint_path", {"loss": 100}, False)
+    self.assertEqual(None, export_result)
+
+    export_result = exporter.export(estimator, export_dir_base,
+                                    "checkpoint_path", {"loss": 10}, False)
+    self.assertEqual("export_result_path", export_result)
+
   def test_garbage_collect_exports(self):
     export_dir_base = tempfile.mkdtemp()
     gfile.MkDir(export_dir_base)
@@ -172,7 +206,7 @@ class BestExporterTest(test.TestCase):
         serving_input_receiver_fn=_serving_input_receiver_fn,
         exports_to_keep=2)
     estimator = test.mock.Mock(spec=estimator_lib.Estimator)
-    estimator.model_dir.return_value = export_dir_base
+    estimator.model_dir = export_dir_base
     # Garbage collect all but the most recent 2 exports,
     # where recency is determined based on the timestamp directory names.
     exporter.export(estimator, export_dir_base, None, None, False)
diff --git a/tensorflow/python/estimator/inputs/numpy_io.py b/tensorflow/python/estimator/inputs/numpy_io.py
index eefc7c712d79d8d02632ccb928f7ab4af02b2596..a6cefdece21fa8ce944095cb5d3395f2b67142bd 100644
--- a/tensorflow/python/estimator/inputs/numpy_io.py
+++ b/tensorflow/python/estimator/inputs/numpy_io.py
@@ -24,7 +24,7 @@ import numpy as np
 from six import string_types
 
 from tensorflow.python.estimator.inputs.queues import feeding_functions
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 # Key name to pack the target into dict of `features`. See
 # `_get_unique_target_key` for details.
@@ -87,7 +87,7 @@ def _validate_and_convert_features(x):
   return ordered_dict_data
 
 
-@tf_export('estimator.inputs.numpy_input_fn')
+@estimator_export('estimator.inputs.numpy_input_fn')
 def numpy_input_fn(x,
                    y=None,
                    batch_size=128,
diff --git a/tensorflow/python/estimator/inputs/pandas_io.py b/tensorflow/python/estimator/inputs/pandas_io.py
index 1ed6ed4d846a47d70a72c1363567ce918bb007a6..57f8e5fd6aff366ad1f574d5dd40a6c457966b52 100644
--- a/tensorflow/python/estimator/inputs/pandas_io.py
+++ b/tensorflow/python/estimator/inputs/pandas_io.py
@@ -21,7 +21,7 @@ from __future__ import print_function
 
 import numpy as np
 from tensorflow.python.estimator.inputs.queues import feeding_functions
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 try:
   # pylint: disable=g-import-not-at-top
@@ -35,7 +35,7 @@ except ImportError:
   HAS_PANDAS = False
 
 
-@tf_export('estimator.inputs.pandas_input_fn')
+@estimator_export('estimator.inputs.pandas_input_fn')
 def pandas_input_fn(x,
                     y=None,
                     batch_size=128,
diff --git a/tensorflow/python/estimator/keras.py b/tensorflow/python/estimator/keras.py
index dd92a0403b9786dae0980992c0cdb7ecb901ea13..2f439f765e6811335667b62437f7aafc934904dc 100644
--- a/tensorflow/python/estimator/keras.py
+++ b/tensorflow/python/estimator/keras.py
@@ -20,7 +20,7 @@ from __future__ import division
 from __future__ import print_function
 
 import os
-
+import re
 from tensorflow.python.client import session
 from tensorflow.python.estimator import estimator as estimator_lib
 from tensorflow.python.estimator import export as export_lib
@@ -42,10 +42,12 @@ from tensorflow.python.ops import metrics as metrics_module
 from tensorflow.python.ops import variables as variables_module
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.saved_model import signature_constants
+from tensorflow.python.training import distribute as distribute_lib
 from tensorflow.python.training import saver as saver_lib
 from tensorflow.python.training import training_util
 from tensorflow.python.util.tf_export import tf_export
 
+
 _DEFAULT_SERVING_KEY = signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
 
 
@@ -136,8 +138,9 @@ def _in_place_subclassed_model_reset(model):
   To "instantiate" an identical model in a new TF graph, we reuse the original
   model object, but we clear its state.
 
-  After calling this function on a model intance, you can use the model instance
-  as if it were a model clone (in particular you can use it in a new graph).
+  After calling this function on a model instance, you can use the model
+  instance as if it were a model clone (in particular you can use it in a new
+  graph).
 
   This method clears the state of the input model. It is thus destructive.
   However the original state can be restored fully by calling
@@ -220,7 +223,6 @@ def _in_place_subclassed_model_reset(model):
       for name in attributes_to_cache:
         attributes_cache[name] = getattr(model, name)
   model._original_attributes_cache = attributes_cache
-
   # Reset built state
   model.built = False
   model.inputs = None
@@ -340,8 +342,19 @@ def _create_keras_model_fn(keras_model, custom_objects=None):
     """model_fn for keras Estimator."""
     model = _clone_and_build_model(mode, keras_model, custom_objects, features,
                                    labels)
+    model_output_names = []
+    # We need to make sure that the output names of the last layer in the model
+    # is the same for each of the cloned models. This is required for mirrored
+    # strategy when we call regroup.
+    if distribute_lib.has_distribution_strategy():
+      for name in model.output_names:
+        name = re.compile(r'_\d$').sub('', name)
+        model_output_names.append(name)
+    else:
+      model_output_names = model.output_names
+
     # Get inputs to EstimatorSpec
-    predictions = dict(zip(model.output_names, model.outputs))
+    predictions = dict(zip(model_output_names, model.outputs))
 
     loss = None
     train_op = None
@@ -445,10 +458,14 @@ def model_to_estimator(keras_model=None,
   @{$programmers_guide/estimators$creating_estimators_from_keras_models}.
 
   Args:
-    keras_model: Keras model in memory.
-    keras_model_path: Directory to a keras model on disk.
+    keras_model: A compiled Keras model object. This argument is mutually
+      exclusive with `keras_model_path`.
+    keras_model_path: Path to a compiled Keras model saved on disk, in HDF5
+      format, which can be generated with the `save()` method of a Keras model.
+      This argument is mutually exclusive with `keras_model`.
     custom_objects: Dictionary for custom objects.
-    model_dir: Directory to save Estimator model parameters, graph and etc.
+    model_dir: Directory to save Estimator model parameters, graph, summary
+      files for TensorBoard, etc.
     config: Configuration object.
 
   Returns:
@@ -460,7 +477,7 @@ def model_to_estimator(keras_model=None,
     ValueError: if the keras_model_path is a GCS URI.
     ValueError: if keras_model has not been compiled.
   """
-  if (not keras_model) and (not keras_model_path):
+  if not (keras_model or keras_model_path):
     raise ValueError(
         'Either `keras_model` or `keras_model_path` needs to be provided.')
   if keras_model and keras_model_path:
@@ -482,8 +499,9 @@ def model_to_estimator(keras_model=None,
 
   if not hasattr(keras_model, 'optimizer') or not keras_model.optimizer:
     raise ValueError(
-        'The given keras model has not been compiled yet. Please compile first '
-        'before calling `model_to_estimator`.')
+        'The given keras model has not been compiled yet. '
+        'Please compile the model with `model.compile()` '
+        'before calling `model_to_estimator()`.')
 
   if isinstance(config, dict):
     config = run_config_lib.RunConfig(**config)
diff --git a/tensorflow/python/estimator/keras_test.py b/tensorflow/python/estimator/keras_test.py
index 6688a841300f04416e4099f3abc9a03858ac245e..5e094ae92bcf88a48d7afe3fb88bbced4971b587 100644
--- a/tensorflow/python/estimator/keras_test.py
+++ b/tensorflow/python/estimator/keras_test.py
@@ -31,10 +31,10 @@ from tensorflow.python.estimator import run_config as run_config_lib
 from tensorflow.python.estimator.inputs import numpy_io
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
-from tensorflow.python.keras import backend as K
 from tensorflow.python.keras import testing_utils
 from tensorflow.python.keras.applications import mobilenet
 from tensorflow.python.keras.optimizers import SGD
+from tensorflow.python.ops.parsing_ops import gen_parsing_ops
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
 from tensorflow.python.summary.writer import writer_cache
@@ -146,13 +146,13 @@ def randomize_io_type(array, name):
 def multi_inputs_multi_outputs_model():
   a = keras.layers.Input(shape=(16,), name='input_a')
   b = keras.layers.Input(shape=(16,), name='input_b')
-  m = keras.layers.Input(shape=(8,), dtype='bool', name='input_m')
+  m = keras.layers.Input(shape=(8,), dtype='string', name='input_m')
   dense = keras.layers.Dense(8, name='dense_1')
 
   a_2 = dense(a)
-  # Apply a mask
-  s_2 = keras.layers.Lambda(lambda k:
-                            K.switch(k[0], k[1], K.zeros_like(k[1])))([m, a_2])
+  # Read m
+  m_2 = keras.layers.Lambda(gen_parsing_ops.string_to_number)(m)
+  s_2 = keras.layers.Lambda(lambda k: k[0] * k[1])([m_2, a_2])
   b_2 = dense(b)
   merged = keras.layers.concatenate([s_2, b_2], name='merge')
   c = keras.layers.Dense(3, activation='softmax', name='dense_2')(merged)
@@ -372,13 +372,13 @@ class TestKerasEstimator(test_util.TensorFlowTestCase):
 
     def train_input_fn():
       input_dict = {'input_a': a_train, 'input_b': b_train,
-                    'input_m': input_m_train > 0}
+                    'input_m': input_m_train.astype(np.str)}
       output_dict = {'dense_2': c_train, 'dense_3': d_train}
       return input_dict, output_dict
 
     def eval_input_fn():
       input_dict = {'input_a': a_test, 'input_b': b_test,
-                    'input_m': input_m_test > 0}
+                    'input_m': input_m_test.astype(np.str)}
       output_dict = {'dense_2': c_test, 'dense_3': d_test}
       return input_dict, output_dict
 
diff --git a/tensorflow/python/estimator/model_fn.py b/tensorflow/python/estimator/model_fn.py
index 3edf9fe940b19c7a0b1a7c21a9674189faba5acb..c60c7f63bacc810e447bcafe954c55cb49ede7e0 100644
--- a/tensorflow/python/estimator/model_fn.py
+++ b/tensorflow/python/estimator/model_fn.py
@@ -32,10 +32,10 @@ from tensorflow.python.saved_model import tag_constants
 from tensorflow.python.training import monitored_session
 from tensorflow.python.training import session_run_hook
 from tensorflow.python.util import nest
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 
-@tf_export('estimator.ModeKeys')
+@estimator_export('estimator.ModeKeys')
 class ModeKeys(object):
   """Standard names for model modes.
 
@@ -62,7 +62,7 @@ EXPORT_TAG_MAP = {
 }
 
 
-@tf_export('estimator.EstimatorSpec')
+@estimator_export('estimator.EstimatorSpec')
 class EstimatorSpec(
     collections.namedtuple('EstimatorSpec', [
         'mode', 'predictions', 'loss', 'train_op', 'eval_metric_ops',
diff --git a/tensorflow/python/estimator/run_config.py b/tensorflow/python/estimator/run_config.py
index c7707be8397d950f4e5993b678c215128d3d8b9f..b948ce96e0c09c0537619366403658408cf17895 100644
--- a/tensorflow/python/estimator/run_config.py
+++ b/tensorflow/python/estimator/run_config.py
@@ -29,7 +29,7 @@ from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import server_lib
 from tensorflow.python.util import compat_internal
 from tensorflow.python.util import function_utils
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 
 _USE_DEFAULT = object()
@@ -296,7 +296,7 @@ class TaskType(object):
   EVALUATOR = 'evaluator'
 
 
-@tf_export('estimator.RunConfig')
+@estimator_export('estimator.RunConfig')
 class RunConfig(object):
   """This class specifies the configurations for an `Estimator` run."""
 
diff --git a/tensorflow/python/estimator/training.py b/tensorflow/python/estimator/training.py
index a187b12b24c4e588e3aef8624a09a092a4b62b95..1572af579b964e8cf5cdb3d5d11a56d80b965b5c 100644
--- a/tensorflow/python/estimator/training.py
+++ b/tensorflow/python/estimator/training.py
@@ -35,7 +35,7 @@ from tensorflow.python.training import basic_session_run_hooks
 from tensorflow.python.training import server_lib
 from tensorflow.python.training import session_run_hook
 from tensorflow.python.util import compat
-from tensorflow.python.util.tf_export import tf_export
+from tensorflow.python.util.tf_export import estimator_export
 
 _MAX_DELAY_SECS = 60
 _DELAY_SECS_PER_WORKER = 5
@@ -115,7 +115,7 @@ def _is_google_env():
   return tf_config.get(_ENVIRONMENT_KEY) == _ENVIRONMENT_GOOGLE_VALUE
 
 
-@tf_export('estimator.TrainSpec')
+@estimator_export('estimator.TrainSpec')
 class TrainSpec(
     collections.namedtuple('TrainSpec', ['input_fn', 'max_steps', 'hooks'])):
   """Configuration for the "train" part for the `train_and_evaluate` call.
@@ -129,7 +129,7 @@ class TrainSpec(
 
     Args:
       input_fn: A function that provides input data for training as minibatches.
-        See @{$get_started/premade_estimators#create_input_functions} for more
+        See @{$premade_estimators#create_input_functions} for more
         information. The function should construct and return one of
         the following:
           * A 'tf.data.Dataset' object: Outputs of `Dataset` object must be a
@@ -167,7 +167,7 @@ class TrainSpec(
         cls, input_fn=input_fn, max_steps=max_steps, hooks=hooks)
 
 
-@tf_export('estimator.EvalSpec')
+@estimator_export('estimator.EvalSpec')
 class EvalSpec(
     collections.namedtuple('EvalSpec', [
         'input_fn', 'steps', 'name', 'hooks', 'exporters', 'start_delay_secs',
@@ -193,7 +193,7 @@ class EvalSpec(
 
     Args:
       input_fn: A function that constructs the input data for evaluation.
-        See @{$get_started/premade_estimators#create_input_functions} for more
+        See @{$premade_estimators#create_input_functions} for more
         information. The function should construct and return one of
         the following:
           * A 'tf.data.Dataset' object: Outputs of `Dataset` object must be a
@@ -263,7 +263,7 @@ class EvalSpec(
         throttle_secs=throttle_secs)
 
 
-@tf_export('estimator.train_and_evaluate')
+@estimator_export('estimator.train_and_evaluate')
 def train_and_evaluate(estimator, train_spec, eval_spec):
   """Train and evaluate the `estimator`.
 
@@ -295,6 +295,7 @@ def train_and_evaluate(estimator, train_spec, eval_spec):
   model will be trained with three epochs of training data instead of one epoch.
 
   Example of local (non-distributed) training:
+
   ```python
   # Set up feature columns.
   categorial_feature_a = categorial_column_with_hash_bucket(...)
@@ -339,12 +340,14 @@ def train_and_evaluate(estimator, train_spec, eval_spec):
 
   Setting environment variable depends on the platform. For example, on Linux,
   it can be done as follows (`$` is the shell prompt):
+
   ```
   $ TF_CONFIG='<replace_with_real_content>' python train_model.py
   ```
 
   For the content in `TF_CONFIG`, assume that the training cluster spec looks
   like:
+
   ```
   cluster = {"chief": ["host0:2222"],
              "worker": ["host1:2222", "host2:2222", "host3:2222"],
@@ -352,6 +355,7 @@ def train_and_evaluate(estimator, train_spec, eval_spec):
   ```
 
   Example of `TF_CONFIG` for chief training worker (must have one and only one):
+
   ```
   # This should be a JSON string, which is set as environment variable. Usually
   # the cluster manager handles that.
@@ -371,6 +375,7 @@ def train_and_evaluate(estimator, train_spec, eval_spec):
 
   Example of `TF_CONFIG` for non-chief training worker (optional, could be
   multiple):
+
   ```
   # This should be a JSON string, which is set as environment variable. Usually
   # the cluster manager handles that.
@@ -387,6 +392,7 @@ def train_and_evaluate(estimator, train_spec, eval_spec):
   for non-chief training workers.
 
   Example of `TF_CONFIG` for parameter server, aka ps (could be multiple):
+
   ```
   # This should be a JSON string, which is set as environment variable. Usually
   # the cluster manager handles that.
@@ -405,6 +411,7 @@ def train_and_evaluate(estimator, train_spec, eval_spec):
   Example of `TF_CONFIG` for evaluator task. Evaluator is a special task that is
   not part of the training cluster. There could be only one. It is used for
   model evaluation.
+
   ```
   # This should be a JSON string, which is set as environment variable. Usually
   # the cluster manager handles that.
@@ -444,7 +451,7 @@ def train_and_evaluate(estimator, train_spec, eval_spec):
         'For distributed training, there can only be one `evaluator` task '
         '(with task id 0).  Given task id {}'.format(config.task_id))
 
-  executor.run()
+  return executor.run()
 
 
 class _StopAtSecsHook(session_run_hook.SessionRunHook):
diff --git a/tensorflow/python/estimator/util.py b/tensorflow/python/estimator/util.py
index e4e1d37f74330c9bfd48adff95e6409793714729..924ca309ff0455d3bb06be61ce65bb0a61e84fb0 100644
--- a/tensorflow/python/estimator/util.py
+++ b/tensorflow/python/estimator/util.py
@@ -24,6 +24,7 @@ import time
 
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import training
 from tensorflow.python.util import compat
 from tensorflow.python.util import function_utils
 
@@ -72,3 +73,59 @@ def get_timestamped_dir(dir_base):
         result_dir, attempts, MAX_DIRECTORY_CREATION_ATTEMPTS))
   raise RuntimeError('Failed to obtain a unique export directory name after '
                      '{} attempts.'.format(MAX_DIRECTORY_CREATION_ATTEMPTS))
+
+
+def parse_input_fn_result(result):
+  """Gets features, labels, and hooks from the result of an Estimator input_fn.
+
+  Args:
+    result: output of an input_fn to an estimator, which should be one of:
+
+      * A 'tf.data.Dataset' object: Outputs of `Dataset` object must be a
+          tuple (features, labels) with same constraints as below.
+      * A tuple (features, labels): Where `features` is a `Tensor` or a
+        dictionary of string feature name to `Tensor` and `labels` is a
+        `Tensor` or a dictionary of string label name to `Tensor`. Both
+        `features` and `labels` are consumed by `model_fn`. They should
+        satisfy the expectation of `model_fn` from inputs.
+
+  Returns:
+    Tuple of features, labels, and input_hooks, where features are as described
+    above, labels are as described above or None, and input_hooks are a list
+    of SessionRunHooks to be included when running.
+
+  Raises:
+    ValueError: if the result is a list or tuple of length != 2.
+  """
+  input_hooks = []
+  try:
+    # We can't just check whether this is a tf.data.Dataset instance here,
+    # as this is plausibly a PerDeviceDataset. Try treating as a dataset first.
+    iterator = result.make_initializable_iterator()
+  except AttributeError:
+    # Not a dataset or dataset-like-object. Move along.
+    pass
+  else:
+    input_hooks.append(_DatasetInitializerHook(iterator))
+    result = iterator.get_next()
+
+  if isinstance(result, (list, tuple)):
+    if len(result) != 2:
+      raise ValueError(
+          'input_fn should return (features, labels) as a len 2 tuple.')
+    return result[0], result[1], input_hooks
+  return result, None, input_hooks
+
+
+class _DatasetInitializerHook(training.SessionRunHook):
+  """Creates a SessionRunHook that initializes the passed iterator."""
+
+  def __init__(self, iterator):
+    self._iterator = iterator
+
+  def begin(self):
+    self._initializer = self._iterator.initializer
+
+  def after_create_session(self, session, coord):
+    del coord
+    session.run(self._initializer)
diff --git a/tensorflow/python/estimator/util_test.py b/tensorflow/python/estimator/util_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..d7e0610779023199aa659b119d366ba69ac2b15f
--- /dev/null
+++ b/tensorflow/python/estimator/util_test.py
@@ -0,0 +1,102 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Tests for util.py."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.estimator import util
+from tensorflow.python.framework import constant_op
+from tensorflow.python.platform import test
+from tensorflow.python.training import training
+
+
+class UtilTest(test.TestCase):
+  """Tests for miscellaneous Estimator utils."""
+
+  def test_parse_input_fn_result_tuple(self):
+    def _input_fn():
+      features = constant_op.constant(np.arange(100))
+      labels = constant_op.constant(np.arange(100, 200))
+      return features, labels
+
+    features, labels, hooks = util.parse_input_fn_result(_input_fn())
+
+    with self.test_session() as sess:
+      vals = sess.run([features, labels])
+
+    self.assertAllEqual(vals[0], np.arange(100))
+    self.assertAllEqual(vals[1], np.arange(100, 200))
+    self.assertEqual(hooks, [])
+
+  def test_parse_input_fn_result_dataset(self):
+    def _input_fn():
+      features = np.expand_dims(np.arange(100), 0)
+      labels = np.expand_dims(np.arange(100, 200), 0)
+      return dataset_ops.Dataset.from_tensor_slices((features, labels))
+
+    features, labels, hooks = util.parse_input_fn_result(_input_fn())
+
+    with training.MonitoredSession(hooks=hooks) as sess:
+      vals = sess.run([features, labels])
+
+    self.assertAllEqual(vals[0], np.arange(100))
+    self.assertAllEqual(vals[1], np.arange(100, 200))
+    self.assertIsInstance(hooks[0], util._DatasetInitializerHook)
+
+  def test_parse_input_fn_result_features_only(self):
+    def _input_fn():
+      return constant_op.constant(np.arange(100))
+
+    features, labels, hooks = util.parse_input_fn_result(_input_fn())
+
+    with self.test_session() as sess:
+      vals = sess.run([features])
+
+    self.assertAllEqual(vals[0], np.arange(100))
+    self.assertEqual(labels, None)
+    self.assertEqual(hooks, [])
+
+  def test_parse_input_fn_result_features_only_dataset(self):
+    def _input_fn():
+      features = np.expand_dims(np.arange(100), 0)
+      return dataset_ops.Dataset.from_tensor_slices(features)
+
+    features, labels, hooks = util.parse_input_fn_result(_input_fn())
+
+    with training.MonitoredSession(hooks=hooks) as sess:
+      vals = sess.run([features])
+
+    self.assertAllEqual(vals[0], np.arange(100))
+    self.assertEqual(labels, None)
+    self.assertIsInstance(hooks[0], util._DatasetInitializerHook)
+
+  def test_parse_input_fn_result_invalid(self):
+    def _input_fn():
+      features = np.expand_dims(np.arange(100), 0)
+      labels = np.expand_dims(np.arange(100, 200), 0)
+      return dataset_ops.Dataset.from_tensor_slices((features, labels, labels))
+
+    with self.assertRaisesRegexp(ValueError, 'input_fn should return'):
+      util.parse_input_fn_result(_input_fn())
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/python/feature_column/feature_column.py b/tensorflow/python/feature_column/feature_column.py
index ffcb9990d52c2e3ff3bb356e8f8eebd850ea9e20..af2ead9b8405696977cc9a40be750b939541006a 100644
--- a/tensorflow/python/feature_column/feature_column.py
+++ b/tensorflow/python/feature_column/feature_column.py
@@ -1782,9 +1782,7 @@ class _EmbeddingColumnLayer(base.Layer):
     Args:
       embedding_shape: Shape of the embedding variable used for lookup.
       initializer: A variable initializer function to be used in embedding
-        variable initialization. If not specified, defaults to
-        `tf.truncated_normal_initializer` with mean `0.0` and standard deviation
-        `1/sqrt(dimension)`.
+        variable initialization.
       weight_collections: A list of collection names to which the Variable will
         be added. Note that, variables will also be added to collections
         `tf.GraphKeys.GLOBAL_VARIABLES` and `ops.GraphKeys.MODEL_VARIABLES`.
@@ -1799,6 +1797,15 @@ class _EmbeddingColumnLayer(base.Layer):
     self._initializer = initializer
     self._weight_collections = weight_collections
 
+  def set_weight_collections(self, weight_collections):
+    """Sets the weight collections for the layer.
+
+    Args:
+      weight_collections: A list of collection names to which the Variable will
+        be added.
+    """
+    self._weight_collections = weight_collections
+
   def build(self, _):
     self._embedding_weight_var = self.add_variable(
         name='embedding_weights',
@@ -2163,7 +2170,7 @@ class _LazyBuilder(object):
       self._feature_tensors[key] = feature_tensor
       return feature_tensor
 
-    if isinstance(key, str):
+    if isinstance(key, six.string_types):
       raise ValueError('Feature {} is not in features dictionary.'.format(key))
 
     if not isinstance(key, _FeatureColumn):
@@ -2604,6 +2611,7 @@ class _SharedEmbeddingColumn(
       sparse_ids = sparse_tensors.id_tensor
       sparse_weights = sparse_tensors.weight_tensor
 
+      self._layer.set_weight_collections(weight_collections)
       embedding_weights = self._layer(
           None, scope=variable_scope.get_variable_scope())
       # If we're in graph mode and this is called with a different graph,
@@ -2612,6 +2620,7 @@ class _SharedEmbeddingColumn(
           ops.get_default_graph() !=
           _get_graph_for_variable(embedding_weights)):
         self._reset_config()
+        self._layer.set_weight_collections(weight_collections)
         embedding_weights = self._layer(
             None, scope=variable_scope.get_variable_scope())
 
diff --git a/tensorflow/python/feature_column/feature_column_test.py b/tensorflow/python/feature_column/feature_column_test.py
index f9206f4f38d7631ccdb57d41d8c52f9f0edede6d..627430d6bc5995cf054482ac3004098b8a2472ab 100644
--- a/tensorflow/python/feature_column/feature_column_test.py
+++ b/tensorflow/python/feature_column/feature_column_test.py
@@ -137,6 +137,9 @@ class LazyColumnTest(test.TestCase):
     with self.assertRaisesRegexp(ValueError,
                                  'bbb is not in features dictionary'):
       builder.get('bbb')
+    with self.assertRaisesRegexp(ValueError,
+                                 'bbb is not in features dictionary'):
+      builder.get(u'bbb')
 
   def test_not_supported_feature_column(self):
 
@@ -5612,6 +5615,72 @@ class SharedEmbeddingColumnTest(test.TestCase):
       self.assertAllEqual(expected_lookups_a, embedding_lookup_a.eval())
       self.assertAllEqual(expected_lookups_b, embedding_lookup_b.eval())
 
+  def test_get_dense_tensor_weight_collections(self):
+    # Inputs.
+    vocabulary_size = 3
+    # -1 values are ignored.
+    input_a = np.array([
+        [2, -1, -1],  # example 0, ids [2]
+        [0, 1, -1]
+    ])  # example 1, ids [0, 1]
+    input_b = np.array([
+        [0, -1, -1],  # example 0, ids [0]
+        [-1, -1, -1]
+    ])  # example 1, ids []
+    input_features = {'aaa': input_a, 'bbb': input_b}
+
+    # Embedding variable.
+    embedding_dimension = 2
+    embedding_values = (
+        (1., 2.),  # id 0
+        (3., 5.),  # id 1
+        (7., 11.)  # id 2
+    )
+
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return embedding_values
+
+    # Expected lookup result, using combiner='mean'.
+    expected_lookups_a = (
+        # example 0:
+        (7., 11.),  # ids [2], embedding = [7, 11]
+        # example 1:
+        (2., 3.5),  # ids [0, 1], embedding = mean([1, 2] + [3, 5]) = [2, 3.5]
+    )
+    expected_lookups_b = (
+        # example 0:
+        (1., 2.),  # ids [0], embedding = [1, 2]
+        # example 1:
+        (0., 0.),  # ids [], embedding = [0, 0]
+    )
+
+    # Build columns.
+    categorical_column_a = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    categorical_column_b = fc.categorical_column_with_identity(
+        key='bbb', num_buckets=vocabulary_size)
+    embedding_column_a, embedding_column_b = fc.shared_embedding_columns(
+        [categorical_column_a, categorical_column_b],
+        dimension=embedding_dimension,
+        initializer=_initializer)
+
+    fc.input_layer(
+        input_features, [embedding_column_a, embedding_column_b],
+        weight_collections=('my_vars',))
+
+    # Assert expected embedding variable and lookups.
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(
+        ('input_layer/aaa_bbb_shared_embedding/embedding_weights:0',),
+        tuple(v.name for v in global_vars))
+    my_vars = ops.get_collection('my_vars')
+    self.assertItemsEqual(
+        ('input_layer/aaa_bbb_shared_embedding/embedding_weights:0',),
+        tuple(v.name for v in my_vars))
+
   def test_get_dense_tensor_placeholder_inputs(self):
     # Inputs.
     vocabulary_size = 3
diff --git a/tensorflow/python/framework/dtypes.py b/tensorflow/python/framework/dtypes.py
index 7f9ef53457ae060600067b946e686487f55adda1..c3f70df7d8056ee33d2e0d875b19e07f18220545 100644
--- a/tensorflow/python/framework/dtypes.py
+++ b/tensorflow/python/framework/dtypes.py
@@ -120,11 +120,7 @@ class DType(object):
 
   @property
   def is_numpy_compatible(self):
-    numpy_incompatible = [
-        types_pb2.DT_VARIANT, types_pb2.DT_VARIANT_REF, types_pb2.DT_RESOURCE,
-        types_pb2.DT_RESOURCE_REF
-    ]
-    return self._type_enum not in numpy_incompatible
+    return self._type_enum not in _NUMPY_INCOMPATIBLE
 
   @property
   def as_numpy_dtype(self):
@@ -162,7 +158,7 @@ class DType(object):
   @property
   def is_quantized(self):
     """Returns whether this is a quantized data type."""
-    return self.base_dtype in [qint8, quint8, qint16, quint16, qint32]
+    return self.base_dtype in _QUANTIZED_DTYPES_NO_REF
 
   @property
   def is_unsigned(self):
@@ -401,6 +397,11 @@ quint16_ref = DType(types_pb2.DT_QUINT16_REF)
 qint32_ref = DType(types_pb2.DT_QINT32_REF)
 bfloat16_ref = DType(types_pb2.DT_BFLOAT16_REF)
 
+_NUMPY_INCOMPATIBLE = frozenset([
+    types_pb2.DT_VARIANT, types_pb2.DT_VARIANT_REF, types_pb2.DT_RESOURCE,
+    types_pb2.DT_RESOURCE_REF
+])
+
 # Maintain an intern table so that we don't have to create a large
 # number of small objects.
 _INTERN_TABLE = {
@@ -645,10 +646,10 @@ _TF_TO_NP = {
         _np_bfloat16,
 }
 
-QUANTIZED_DTYPES = frozenset([
-    qint8, quint8, qint16, quint16, qint32, qint8_ref, quint8_ref, qint16_ref,
-    quint16_ref, qint32_ref
-])
+_QUANTIZED_DTYPES_NO_REF = frozenset([qint8, quint8, qint16, quint16, qint32])
+_QUANTIZED_DTYPES_REF = frozenset(
+    [qint8_ref, quint8_ref, qint16_ref, quint16_ref, qint32_ref])
+QUANTIZED_DTYPES = _QUANTIZED_DTYPES_REF.union(_QUANTIZED_DTYPES_NO_REF)
 tf_export("QUANTIZED_DTYPES").export_constant(__name__, "QUANTIZED_DTYPES")
 
 _PYTHON_TO_TF = {
@@ -662,10 +663,9 @@ def as_dtype(type_value):
   """Converts the given `type_value` to a `DType`.
 
   Args:
-    type_value: A value that can be converted to a `tf.DType`
-      object. This may currently be a `tf.DType` object, a
-      [`DataType`
-        enum](https://www.tensorflow.org/code/tensorflow/core/framework/types.proto),
+    type_value: A value that can be converted to a `tf.DType` object. This may
+      currently be a `tf.DType` object, a [`DataType`
+      enum](https://www.tensorflow.org/code/tensorflow/core/framework/types.proto),
       a string type name, or a `numpy.dtype`.
 
   Returns:
diff --git a/tensorflow/python/framework/function.py b/tensorflow/python/framework/function.py
index 067522201694d64027ebcc7ea500e4771a96ed92..82ecba310ba4cbad62b8fda073a006b28be2c3ad 100644
--- a/tensorflow/python/framework/function.py
+++ b/tensorflow/python/framework/function.py
@@ -718,8 +718,14 @@ class _FuncGraph(ops.Graph):
           tensor.dtype, shape=tensor.get_shape(), name=name)
     # pylint: disable=protected-access
     if ops._USE_C_SHAPES:
-      handle_data = c_api.GetResourceHandleShapeAndType(tensor.graph._c_graph,
-                                                        tensor._as_tf_output())
+      if isinstance(tensor, ops.EagerTensor):
+        handle_data = tensor._handle_data
+        if handle_data:
+          handle_data = handle_data.SerializeToString()
+      else:
+        handle_data = c_api.GetResourceHandleShapeAndType(
+            tensor.graph._c_graph, tensor._as_tf_output())
+
       if handle_data:
         c_api.SetResourceHandleShapeAndType(ph.graph._c_graph,
                                             ph._as_tf_output(),
@@ -1166,3 +1172,13 @@ _DTYPE_TO_STR = {
     dtypes.qint32: "qi32",
     dtypes.bfloat16: "b16"
 }
+
+
+def function_def_from_tf_function(c_func):
+  """Converts a SWIG-wrapped TF_Function* to a FunctionDef proto."""
+  with c_api_util.tf_buffer() as buf:
+    c_api.TF_FunctionToFunctionDef(c_func, buf)
+    data = c_api.TF_GetBuffer(buf)
+  fdef = function_pb2.FunctionDef()
+  fdef.ParseFromString(compat.as_bytes(data))
+  return fdef
diff --git a/tensorflow/python/framework/function_def_to_graph.py b/tensorflow/python/framework/function_def_to_graph.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fecc41343f26291dac8455f6c972a755b65ecfc
--- /dev/null
+++ b/tensorflow/python/framework/function_def_to_graph.py
@@ -0,0 +1,189 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+"""Utlity to convert FunctionDef to GraphDef and Graph."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.core.framework import graph_pb2
+from tensorflow.core.framework import types_pb2
+from tensorflow.core.framework import versions_pb2
+from tensorflow.python.framework import function
+from tensorflow.python.framework import importer
+from tensorflow.python.framework import op_def_registry
+from tensorflow.python.framework import versions
+
+
+def function_def_to_graph(fdef, input_shapes=None):
+  """Converts a FunctionDef to a function._FuncGraph (sub-class Graph).
+
+  The returned _FuncGraph's `name`, `inputs` and `outputs` fields will be set.
+  The input tensors are represented as placeholders.
+
+  Note: `_FuncGraph.inputs` and `_FuncGraph._captured` are not set and may be
+  set by the caller.
+
+  Args:
+    fdef: FunctionDef.
+    input_shapes: Optional. A list of TensorShape objects of the shapes of
+      function inputs. If specified, its length must match length of
+      `fdef.signature.input_arg`. If a shape is None, the corresponding input
+      placeholder will have unknown shape.
+
+  Returns:
+    A _FuncGraph.
+  """
+  func_graph = function._FuncGraph(fdef.signature.name, capture_by_value=False)  # pylint: disable=protected-access
+  graph_def, nested_to_flat_tensor_name = function_def_to_graph_def(
+      fdef, input_shapes)
+
+  with func_graph.as_default():
+    # Add all function nodes to the graph.
+    importer.import_graph_def(graph_def, name="")
+
+    # Initialize fields specific to _FuncGraph.
+
+    # inputs
+    input_tensor_names = [
+        nested_to_flat_tensor_name[arg.name] for arg in fdef.signature.input_arg
+    ]
+    func_graph.inputs = [
+        func_graph.get_tensor_by_name(name) for name in input_tensor_names
+    ]
+
+    # outputs
+    output_tensor_names = [
+        nested_to_flat_tensor_name[fdef.ret[arg.name]]
+        for arg in fdef.signature.output_arg
+    ]
+    func_graph.outputs = [
+        func_graph.get_tensor_by_name(name) for name in output_tensor_names
+    ]
+
+  return func_graph
+
+
+def function_def_to_graph_def(fdef, input_shapes=None):
+  """Convert a FunctionDef to a GraphDef.
+
+  Steps:
+  1. Creates placeholder nodes corresponding to inputs in
+     `FunctionDef.signature.input_arg`.
+  2. Adds NodeDefs in `FunctionDef.node_def` to `GraphDef.node`.
+  3. Renames inputs of all nodes to use the convention of GraphDef instead of
+     FunctionDef. See comment on `FunctionDef.node_def` on how the tensor naming
+     in FunctionDefs is different from GraphDefs.
+
+  Args:
+    fdef: FunctionDef.
+    input_shapes: Optional. A list of TensorShape objects of the shapes of
+      function inputs. If specified, its length must match length of
+      `fdef.signature.input_arg`. If a shape is None, the corresponding input
+      placeholder will have unknown shape.
+
+  Returns:
+    A tuple of (GraphDef, dict<string, string>). The dict contains a mapping
+    from nested tensor names (in FunctionDef) to flattened names (in GraphDef).
+
+  Raises:
+    ValueError: If the length of input_shapes does not match the number of
+      input_args or if the FunctionDef is invalid.
+  """
+  graph_def = graph_pb2.GraphDef()
+  graph_def.versions.CopyFrom(
+      versions_pb2.VersionDef(
+          producer=versions.GRAPH_DEF_VERSION,
+          min_consumer=versions.GRAPH_DEF_VERSION_MIN_CONSUMER))
+
+  if input_shapes and len(input_shapes) != len(fdef.signature.input_arg):
+    raise ValueError("Length of input_shapes must match the number of " +
+                     "input_args. len(input_shapes): {} len(input_arg): {}".
+                     format(len(input_shapes), len(fdef.signature.input_arg)))
+
+  # 1. Create placeholders for input nodes.
+  for i, arg_def in enumerate(fdef.signature.input_arg):
+    node_def = graph_def.node.add()
+    node_def.name = arg_def.name
+    node_def.op = "Placeholder"
+    node_def.attr["dtype"].type = arg_def.type
+    if input_shapes and input_shapes[i] is not None:
+      node_def.attr["shape"].shape.CopyFrom(input_shapes[i].as_proto())
+
+  # 2. Copy all body NodeDefs to the GraphDef.
+  graph_def.node.extend(fdef.node_def)
+
+  # 3. Perform the renaming.
+
+  # Build the tensor name mapping then flatten the tensor names.
+  # See comment on `FunctionDef.node_def` on how the tensor naming in
+  # FunctionDefs is different from GraphDefs.
+  nested_to_flat_tensor_name = {}
+
+  for arg_def in fdef.signature.input_arg:
+    nested_to_flat_tensor_name[arg_def.name] = "{}:0".format(arg_def.name)
+
+  for node_def in fdef.node_def:
+    op_def = op_def_registry.get_registered_ops().get(node_def.op)
+    if not op_def:
+      # TODO(b/80470245): Support functions which refer other functions.
+      raise NotImplementedError(
+          "No op registered for {},".format(node_def.op) +
+          " it may be a function. function_def_to_graph_def " +
+          "currently does not support converting functions with " +
+          "references to other graph functions.")
+
+    for attr in op_def.attr:
+      if attr.type in ("func", "list(func)"):
+        # TODO(b/80470245): Support functions which refer other functions.
+        raise NotImplementedError("Unsupported attr {} ".format(attr.name) +
+                                  " with type {}".format(attr.type) +
+                                  " in op {}. ".format(op_def.name) +
+                                  "function_def_to_graph_def currently does " +
+                                  "not support converting functions with " +
+                                  "references to other graph functions.")
+
+    # Iterate over output_args in op_def to build the map.
+    # Index of the output tensor in the flattened list of *all* output
+    # tensors of the op.
+    flattened_index = 0
+    for arg_def in op_def.output_arg:
+      num_args = _get_num_args(arg_def, node_def)
+      for i in range(num_args):
+        # Map tensor names from "node_name:output_arg_name:index" to
+        # "node_name:flattened_index".
+        nested_name = "{}:{}:{}".format(node_def.name, arg_def.name, i)
+        flat_name = "{}:{}".format(node_def.name, flattened_index)
+        nested_to_flat_tensor_name[nested_name] = flat_name
+        flattened_index += 1
+
+  # Update inputs of all nodes in graph.
+  for node_def in graph_def.node:
+    for i in range(len(node_def.input)):
+      node_def.input[i] = nested_to_flat_tensor_name[node_def.input[i]]
+
+  return graph_def, nested_to_flat_tensor_name
+
+
+# Based on implementation in core/framework/node_def_util.cc::ComputeArgRange.
+def _get_num_args(arg_def, node_def):
+  if arg_def.number_attr:
+    return node_def.attr[arg_def.number_attr].i
+  elif arg_def.type_list_attr:
+    return len(node_def.attr[arg_def.type_list_attr].list.type)
+  elif arg_def.type_attr or arg_def.type != types_pb2.DT_INVALID:
+    return 1
+  else:
+    raise ValueError("Invalid arg_def:\n\n{}".format(str(arg_def)))
diff --git a/tensorflow/python/framework/function_def_to_graph_test.py b/tensorflow/python/framework/function_def_to_graph_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..0f4e6ef54fb02cc6ba52c9de2ccabea982fd2323
--- /dev/null
+++ b/tensorflow/python/framework/function_def_to_graph_test.py
@@ -0,0 +1,184 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for tensorflow.python.framework.function_def_to_graph."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import function_def_to_graph
+from tensorflow.python.framework import graph_to_function_def
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.framework import test_ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.platform import test
+
+
+class FunctionDefToGraphTest(test.TestCase):
+
+  def _build_function_def(self):
+    with ops.Graph().as_default() as g:
+      # Inputs
+      x = array_ops.placeholder(dtypes.float32, name="x")
+      y = array_ops.placeholder(dtypes.float32, name="y")
+
+      # Outputs
+      sum_squares = math_ops.add_n(
+          [math_ops.pow(x, 2), math_ops.pow(y, 2)], name="sum_squares")
+      sum_cubes = math_ops.add_n(
+          [math_ops.pow(x, 3), math_ops.pow(y, 3)], name="sum_cubes")
+    fdef = graph_to_function_def.graph_to_function_def(
+        g,
+        g.get_operations(),
+        [x, y],  # Inputs
+        [sum_squares, sum_cubes])  # Outputs.
+    fdef.signature.name = "_whats_in_a_name"
+    return fdef
+
+  def testInputsAndOutputs(self):
+    fdef = self._build_function_def()
+    g = function_def_to_graph.function_def_to_graph(fdef)
+    self.assertEqual(g.name, "_whats_in_a_name")
+    with self.test_session(graph=g) as sess:
+      inputs = sess.run(g.inputs, feed_dict={"x:0": 2, "y:0": 3})
+      self.assertSequenceEqual(inputs, [2.0, 3.0])
+      outputs = sess.run(g.outputs, feed_dict={"x:0": 2, "y:0": 3})
+      self.assertSequenceEqual(outputs, [13.0, 35.0])
+
+  def testShapes(self):
+    fdef = self._build_function_def()
+
+    g = function_def_to_graph.function_def_to_graph(fdef)
+    self.assertIsNone(g.inputs[0].shape.dims)  # Unknown dims.
+    self.assertIsNone(g.inputs[1].shape.dims)  # Unknown dims.
+    self.assertIsNone(g.outputs[0].shape.dims)  # Unknown dims.
+    self.assertIsNone(g.outputs[1].shape.dims)  # Unknown dims.
+
+    g = function_def_to_graph.function_def_to_graph(
+        fdef, input_shapes=[tensor_shape.vector(5),
+                            tensor_shape.vector(5)])
+    self.assertSequenceEqual(g.inputs[0].shape.dims, [5])
+    self.assertSequenceEqual(g.inputs[1].shape.dims, [5])
+    self.assertSequenceEqual(g.outputs[0].shape.dims, [5])
+    self.assertSequenceEqual(g.outputs[1].shape.dims, [5])
+
+    g = function_def_to_graph.function_def_to_graph(
+        fdef, input_shapes=[None, tensor_shape.matrix(5, 7)])
+    print(g.as_graph_def())
+    self.assertIsNone(g.inputs[0].shape.dims)
+    self.assertSequenceEqual(g.inputs[1].shape.dims, [5, 7])
+    self.assertSequenceEqual(g.outputs[0].shape.dims, [5, 7])
+    self.assertSequenceEqual(g.outputs[1].shape.dims, [5, 7])
+
+    # Should raise a ValueError if the length of input_shapes does not match
+    # the number of input args in FunctionDef.signature.input_arg.
+    with self.assertRaises(ValueError):
+      g = function_def_to_graph.function_def_to_graph(
+          fdef, input_shapes=[tensor_shape.matrix(5, 7)])
+
+
+class FunctionDefToGraphDefTest(test.TestCase):
+
+  def _build_function_def(self):
+    with ops.Graph().as_default() as g:
+      # Inputs:    x    y    z
+      #            |\   |   /
+      #            | \  |  /
+      #            |  foo_1     list_output
+      #            |   / \       /       \
+      #            | d_1 e_1  a:1        a:0
+      #            |  \   |   /           |
+      #            |   \  |  /            |
+      #            |    foo_2             |
+      #            |     / \              |
+      # Outputs:   x   d_2 e_2           a:0
+
+      x = array_ops.placeholder(dtypes.float32, name="x")
+      y = array_ops.placeholder(dtypes.int32, name="y")
+      z = array_ops.placeholder(dtypes.int32, name="z")
+
+      d_1, e_1 = test_ops._op_def_lib.apply_op(
+          "Foo1", name="foo_1", a=x, b=y, c=z)
+
+      list_output0, list_output1 = test_ops.list_output(
+          T=[dtypes.int32, dtypes.int32], name="list_output")
+
+      d_2, e_2 = test_ops.foo1(a=d_1, b=e_1, c=list_output1, name="foo_2")
+
+    fdef = graph_to_function_def.graph_to_function_def(
+        g,
+        g.get_operations(),
+        [x, y, z],  # Inputs
+        [x, d_2, e_2, list_output0])  # Outputs.
+
+    # Assert that the FunctionDef was correctly built.
+    assert len(fdef.node_def) == 3  # 2 Foo1 nodes and 1 ListOutput node.
+    assert fdef.node_def[0].op == "Foo1"
+    assert fdef.node_def[0].input == ["x", "y", "z"]
+    assert fdef.node_def[1].op == "ListOutput"
+    assert not fdef.node_def[1].input
+    assert fdef.node_def[2].op == "Foo1"
+    assert fdef.node_def[2].input == [
+        "foo_1:d:0", "foo_1:e:0", "list_output:a:1"
+    ]
+    return fdef
+
+  def testTensorNames(self):
+    fdef = self._build_function_def()
+    g, tensor_name_map = function_def_to_graph.function_def_to_graph_def(fdef)
+
+    # Verify that inputs of body nodes are correctly renamed.
+    # foo_1
+    self.assertSequenceEqual(g.node[3].input, ["x:0", "y:0", "z:0"])
+    # foo_2
+    self.assertSequenceEqual(g.node[5].input,
+                             ["foo_1:0", "foo_1:1", "list_output:1"])
+
+    # Verify that the `tensor_name_map` has the correct mapping.
+    self.assertDictEqual(
+        tensor_name_map, {
+            "x": "x:0",
+            "y": "y:0",
+            "z": "z:0",
+            "foo_1:d:0": "foo_1:0",
+            "foo_1:e:0": "foo_1:1",
+            "list_output:a:0": "list_output:0",
+            "list_output:a:1": "list_output:1",
+            "foo_2:d:0": "foo_2:0",
+            "foo_2:e:0": "foo_2:1",
+        })
+
+  def testShapes(self):
+    fdef = self._build_function_def()
+    g, _ = function_def_to_graph.function_def_to_graph_def(
+        fdef,
+        input_shapes=[tensor_shape.scalar(),
+                      tensor_shape.vector(5), None])
+    self.assertEqual("shape" in g.node[0].attr, True)
+    self.assertSequenceEqual(
+        tensor_shape.TensorShape(g.node[0].attr["shape"].shape).as_list(), [])
+    self.assertEqual(g.node[0].attr["shape"].shape.unknown_rank, False)
+    self.assertEqual("shape" in g.node[1].attr, True)
+    self.assertSequenceEqual(
+        tensor_shape.TensorShape(g.node[1].attr["shape"].shape).as_list(), [5])
+    self.assertEqual(g.node[0].attr["shape"].shape.unknown_rank, False)
+    self.assertFalse("shape" in g.node[2].attr)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/python/framework/ops.py b/tensorflow/python/framework/ops.py
index 7fb3a22f5ac77bb54ef37121fd464fa8d66ae9b1..b440cde3ad4268b9f53da77af0c570f7c0f51d65 100644
--- a/tensorflow/python/framework/ops.py
+++ b/tensorflow/python/framework/ops.py
@@ -20,7 +20,6 @@ from __future__ import print_function
 
 import collections
 import copy
-import functools
 import linecache
 import os
 import re
@@ -57,14 +56,13 @@ from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util import compat
 from tensorflow.python.util import decorator_utils
 from tensorflow.python.util import tf_contextlib
+from tensorflow.python.util.deprecation import deprecated_args
 from tensorflow.python.util.tf_export import tf_export
 
 
-# Temporary global switch determining if we should enable the work-in-progress
-# calls to the C API. Currently disabled by default but can be manually enabled
-# in code or via the environment variable. This will be removed once all
-# functionality is supported and there's no performance penalty with it enabled.
-_USE_C_API = os.getenv("TF_C_API_GRAPH_CONSTRUCTION", "1") is not "0"
+# Temporary global switches determining if we should enable the work-in-progress
+# calls to the C API. These will be removed once all functionality is supported.
+_USE_C_API = True
 _USE_C_SHAPES = os.getenv("TF_C_API_GRAPH_CONSTRUCTION_SHAPES", "0") is not "0"
 
 
@@ -291,15 +289,8 @@ class Tensor(_TensorLike):
     self._value_index = value_index
     self._dtype = dtypes.as_dtype(dtype)
 
-    if _USE_C_API:
-      # This will be set by set_shape_and_handle_data_for_outputs.
-      self._shape_val = None
-    else:
-      # The Python code requires all tensors start with a shape to support shape
-      # inference on imported while loops. This isn't necessary with the C API
-      # enabled because the C API provides the shapes for imported nodes.
-      # TODO(skyewm): remove when _USE_C_API is removed.
-      self._shape_val = tensor_shape.unknown_shape()
+    # This will be set by self.shape().
+    self._shape_val = None
 
     # List of operations that use this Tensor as input.  We maintain this list
     # to easily navigate a computation graph.
@@ -387,7 +378,6 @@ class Tensor(_TensorLike):
       if _USE_C_SHAPES:
         self._shape_val = self._c_api_shape()
       else:
-        assert _USE_C_API
         # Call set_shape_and_handle_data_for_outputs in topological order on all
         # ops that are needed to compute self.op's shape. We do this instead of
         # having set_shape_and_handle_data_for_outputs recursively call
@@ -511,8 +501,6 @@ class Tensor(_TensorLike):
     else:
       self._shape_val = self.shape.merge_with(shape)
 
-    if not self._op._graph._c_graph: return
-
     # Update C shape even if _USE_C_SHAPES = False, since we still want
     # set_shape to be reflected in the C API graph for when we run it.
     if not isinstance(shape, tensor_shape.TensorShape):
@@ -548,33 +536,14 @@ class Tensor(_TensorLike):
     Returns:
       A list of `Operation`s.
     """
-    if self._op._c_op:  # pylint: disable=protected-access
-      consumer_names = c_api.TF_OperationOutputConsumers_wrapper(
-          self._as_tf_output())
-      # pylint: disable=protected-access
-      return [
-          self.graph._get_operation_by_name_unsafe(name)
-          for name in consumer_names
-      ]
-      # pylint: enable=protected-access
-    else:
-      return self._consumers
-
-  def _add_consumer(self, consumer):
-    """Add a consumer to this tensor.
-
-    Args:
-      consumer: an Operation.
-
-    Raises:
-      TypeError: if the consumer is not an Operation.
-    """
+    consumer_names = c_api.TF_OperationOutputConsumers_wrapper(
+        self._as_tf_output())
     # pylint: disable=protected-access
-    assert not self._op._c_op, "Tensor._add_consumer doesn't work with C API"
+    return [
+        self.graph._get_operation_by_name_unsafe(name)
+        for name in consumer_names
+    ]
     # pylint: enable=protected-access
-    if not isinstance(consumer, Operation):
-      raise TypeError("Consumer must be an Operation: %s" % consumer)
-    self._consumers.append(consumer)
 
   def _as_node_def_input(self):
     """Return a value to use for the NodeDef "input" attribute.
@@ -597,7 +566,6 @@ class Tensor(_TensorLike):
 
   def _as_tf_output(self):
     # pylint: disable=protected-access
-    assert self.op._c_op
     return c_api_util.tf_output(self.op._c_op, self.value_index)
     # pylint: enable=protected-access
 
@@ -1725,18 +1693,8 @@ class Operation(object):
                           "a Tensor, or IndexedSlices: %s" % c)
         control_input_ops.append(control_op)
 
-    # Don't set private fields with C API enabled to catch users who need to
-    # switch to public API.
-    # TODO(skyewm): delete these fields once we remove _USE_C_API
-    if not self._graph._c_graph:
-      self._inputs_val = list(inputs)  # Defensive copy.
-      self._input_types_val = input_types
-      self._control_inputs_val = control_input_ops
-      self._node_def_val = copy.deepcopy(node_def)
-      self._op_def_val = op_def
-    else:
-      # This will be set by self.inputs.
-      self._inputs_val = None
+    # This will be set by self.inputs.
+    self._inputs_val = None
 
     self._id_value = self._graph._next_id()  # pylint: disable=protected-access
     self._original_op = original_op
@@ -1745,10 +1703,8 @@ class Operation(object):
 
     # Initialize self._c_op.
     if c_op:
-      # TODO(skyewm): remove this assert when we remove USE_C_API
-      assert self._graph._c_graph  # pylint: disable=protected-access
       self._c_op = c_op
-    elif self._graph._c_graph:  # pylint: disable=protected-access
+    else:
       if op_def is None:
         op_def = self._graph._get_op_def(node_def.op)
       # TODO(skyewm): op_def_library.apply_op() flattens the incoming inputs.
@@ -1757,30 +1713,19 @@ class Operation(object):
           op_def, inputs, node_def.attr)
       self._c_op = _create_c_op(self._graph, node_def, grouped_inputs,
                                 control_input_ops)
-    else:
-      self._c_op = None
-
-    # Mark that we consume the inputs. This is unnecessary and unsupported with
-    # the C API enabled, since the C API tracks the tensor consumers instead.
-    if not self._c_op:
-      for input_tensor in self._inputs_val:
-        input_tensor._add_consumer(self)  # pylint: disable=protected-access
 
     # Initialize self._outputs.
-    if self._c_op:
-      num_outputs = c_api.TF_OperationNumOutputs(self._c_op)
-      output_types = [
-          c_api.TF_OperationOutputType(c_api_util.tf_output(self._c_op, i))
-          for i in range(num_outputs)]
-      assert output_types is not None
-    elif output_types is None:
-      output_types = []
-    self._output_types_val = output_types
+    num_outputs = c_api.TF_OperationNumOutputs(self._c_op)
+    output_types = [
+        c_api.TF_OperationOutputType(c_api_util.tf_output(self._c_op, i))
+        for i in range(num_outputs)]
     self._outputs = [
         Tensor(self, i, output_type)
         for i, output_type in enumerate(output_types)
     ]
 
+    self._graph._add_op(self)  # pylint: disable=protected-access
+
     if not c_op:
       self._control_flow_post_processing()
 
@@ -1794,7 +1739,6 @@ class Operation(object):
       control_flow_util.CheckInputFromValidContext(self, input_tensor.op)
     if self._control_flow_context is not None:
       self._control_flow_context.AddOp(self)
-    self._recompute_node_def()
 
   def _reconstruct_sequence_inputs(self, op_def, inputs, attrs):
     """Regroups a flat list of input tensors into scalar and sequence inputs.
@@ -1875,10 +1819,7 @@ class Operation(object):
   @property
   def name(self):
     """The full name of this operation."""
-    if self._c_op:
-      return c_api.TF_OperationName(self._c_op)
-    else:
-      return self._node_def_val.name
+    return c_api.TF_OperationName(self._c_op)
 
   @property
   def _id(self):
@@ -1894,10 +1835,7 @@ class Operation(object):
       assigned, or an empty string if it has not been assigned to a
       device.
     """
-    if self._c_op:
-      return c_api.TF_OperationDevice(self._c_op)
-    else:
-      return self._node_def_val.device
+    return c_api.TF_OperationDevice(self._c_op)
 
   @property
   def _output_types(self):
@@ -1910,28 +1848,21 @@ class Operation(object):
       The length of this list indicates the number of output endpoints
       of the operation.
     """
-    if self._c_op:
-      num_outputs = c_api.TF_OperationNumOutputs(self._c_op)
-      output_types = [
-          c_api.TF_OperationOutputType(self._tf_output(i))
-          for i in xrange(num_outputs)
-      ]
-      # TODO(iga): Remove this assert after converting to C API by default.
-      # Just being a bit paranoid here.
-      assert self._output_types_val == output_types
-      # In all the tests we have output_types that are passed into
-      # Operation.__init__ are a list of ints (which is illegal according
-      # to the docstring), but input_types are instances of DType.
-      # This extra assert is to catch if we ever use DType for output_types.
-      if output_types:
-        assert isinstance(output_types[0], int)
-      return output_types
-    else:
-      return self._output_types_val
+    num_outputs = c_api.TF_OperationNumOutputs(self._c_op)
+    output_types = [
+        c_api.TF_OperationOutputType(self._tf_output(i))
+        for i in xrange(num_outputs)
+    ]
+    # In all the tests we have output_types that are passed into
+    # Operation.__init__ are a list of ints (which is illegal according
+    # to the docstring), but input_types are instances of DType.
+    # This extra assert is to catch if we ever use DType for output_types.
+    if output_types:
+      assert isinstance(output_types[0], int)
+    return output_types
 
   def _tf_output(self, output_idx):
     """Create and return a new TF_Output for output_idx'th output of this op."""
-    assert self._c_op
     tf_output = c_api.TF_Output()
     tf_output.oper = self._c_op
     tf_output.index = output_idx
@@ -1939,7 +1870,6 @@ class Operation(object):
 
   def _tf_input(self, input_idx):
     """Create and return a new TF_Input for input_idx'th input of this op."""
-    assert self._c_op
     tf_input = c_api.TF_Input()
     tf_input.oper = self._c_op
     tf_input.index = input_idx
@@ -1951,47 +1881,12 @@ class Operation(object):
     Args:
       device: string or device..  The device to set.
     """
-    if self._c_op:
-      c_api.SetRequestedDevice(
-          self._graph._c_graph,  # pylint: disable=protected-access
-          self._c_op,  # pylint: disable=protected-access
-          compat.as_str(_device_string(device)))
-    else:
-      self._node_def_val.device = _device_string(device)
+    c_api.SetRequestedDevice(
+        self._graph._c_graph,  # pylint: disable=protected-access
+        self._c_op,  # pylint: disable=protected-access
+        compat.as_str(_device_string(device)))
 
-  def _add_input(self, tensor, dtype=None):
-    """Add a new input to this operation.
-
-    Args:
-      tensor: the Tensor to add as an input.
-      dtype: tf.DType: type of the input; defaults to
-        the tensor's dtype.
-
-    Raises:
-      TypeError: if tensor is not a Tensor,
-        or if input tensor type is not convertible to dtype.
-      ValueError: if the Tensor is from a different graph.
-    """
-    assert not self._c_op, (
-        "Operation._add_input doesn't work with C API")
-    if not isinstance(tensor, Tensor):
-      raise TypeError("tensor must be a Tensor: %s" % tensor)
-    _assert_same_graph(self, tensor)
-    if dtype is None:
-      dtype = tensor.dtype
-    else:
-      dtype = dtypes.as_dtype(dtype)
-      if not dtype.is_compatible_with(tensor.dtype):
-        raise TypeError(
-            "Cannot convert a tensor of type %s to an input of type %s" %
-            (tensor.dtype.name, dtype.name))
-    self._inputs_val.append(tensor)
-    self._input_types_val.append(dtype)
-    tensor._add_consumer(self)  # pylint: disable=protected-access
-    self._recompute_node_def()
-
-  # TODO(skyewm): Remove `update_dtype` when we enable the C API.
-  def _update_input(self, index, tensor, update_dtype=True):
+  def _update_input(self, index, tensor):
     """Update the input to this operation at the given index.
 
     NOTE: This is for TF internal use only. Please don't use it.
@@ -1999,7 +1894,6 @@ class Operation(object):
     Args:
       index: the index of the input to update.
       tensor: the Tensor to be used as the input at the given index.
-      update_dtype: If `False`, the type for this input is not updated.
 
     Raises:
       TypeError: if tensor is not a Tensor,
@@ -2016,20 +1910,12 @@ class Operation(object):
     if not _USE_C_SHAPES:
       set_shape_and_handle_data_for_outputs(self)
 
-    if self._c_op:
-      # Reset cached inputs.
-      self._inputs_val = None
-      c_api.UpdateEdge(
-          self._graph._c_graph,  # pylint: disable=protected-access
-          tensor._as_tf_output(),  # pylint: disable=protected-access
-          self._tf_input(index))
-    else:
-      self._inputs_val[index].consumers().remove(self)
-      self._inputs_val[index] = tensor
-      if update_dtype:
-        self._input_types_val[index] = tensor.dtype
-      tensor._add_consumer(self)  # pylint: disable=protected-access
-      self._recompute_node_def()
+    # Reset cached inputs.
+    self._inputs_val = None
+    c_api.UpdateEdge(
+        self._graph._c_graph,  # pylint: disable=protected-access
+        tensor._as_tf_output(),  # pylint: disable=protected-access
+        self._tf_input(index))
 
   def _add_control_inputs(self, ops):
     """Add a list of new control inputs to this operation.
@@ -2041,19 +1927,10 @@ class Operation(object):
       TypeError: if ops is not a list of Operations.
       ValueError: if any op in ops is from a different graph.
     """
-    if self._c_op:
-      for op in ops:
-        if not isinstance(op, Operation):
-          raise TypeError("op must be an Operation: %s" % op)
-        c_api.AddControlInput(self._graph._c_graph, self._c_op, op._c_op)  # pylint: disable=protected-access
-    else:
-      if ops:
-        for op in ops:
-          if not isinstance(op, Operation):
-            raise TypeError("op must be an Operation: %s" % op)
-          _assert_same_graph(self, op)
-          self._control_inputs_val.append(op)
-        self._recompute_node_def()
+    for op in ops:
+      if not isinstance(op, Operation):
+        raise TypeError("op must be an Operation: %s" % op)
+      c_api.AddControlInput(self._graph._c_graph, self._c_op, op._c_op)  # pylint: disable=protected-access
 
   def _add_control_input(self, op):
     """Add a new control input to this operation.
@@ -2065,33 +1942,13 @@ class Operation(object):
       TypeError: if op is not an Operation.
       ValueError: if op is from a different graph.
     """
-    if self._c_op:
-      if not isinstance(op, Operation):
-        raise TypeError("op must be an Operation: %s" % op)
-      c_api.AddControlInput(self._graph._c_graph, self._c_op, op._c_op)  # pylint: disable=protected-access
-    else:
-      self._add_control_inputs([op])
+    if not isinstance(op, Operation):
+      raise TypeError("op must be an Operation: %s" % op)
+    c_api.AddControlInput(self._graph._c_graph, self._c_op, op._c_op)  # pylint: disable=protected-access
 
   def _remove_all_control_inputs(self):
     """Removes any control inputs to this operation."""
-    if self._c_op:
-      c_api.RemoveAllControlInputs(self._graph._c_graph, self._c_op)  # pylint: disable=protected-access
-    else:
-      del self.control_inputs[:]
-
-  # Methods below are used when building the NodeDef and Graph proto.
-  def _recompute_node_def(self):
-    # TODO(skyewm): remove this function when we switch to C API
-    if self._c_op: return
-
-    del self._node_def_val.input[:]
-    # pylint: disable=protected-access
-    self._node_def_val.input.extend(
-        [t._as_node_def_input() for t in self._inputs_val])
-    # pylint: enable=protected-access
-    if self._control_inputs_val:
-      self._node_def_val.input.extend(
-          ["^%s" % op.name for op in self._control_inputs_val])
+    c_api.RemoveAllControlInputs(self._graph._c_graph, self._c_op)  # pylint: disable=protected-access
 
   def __str__(self):
     return str(self.node_def)
@@ -2132,19 +1989,16 @@ class Operation(object):
   @property
   def inputs(self):
     """The list of `Tensor` objects representing the data inputs of this op."""
-    if self._c_op:
-      if self._inputs_val is None:
-        tf_outputs = c_api.GetOperationInputs(self._c_op)
-        # pylint: disable=protected-access
-        retval = [
-            self.graph._get_tensor_by_tf_output(tf_output)
-            for tf_output in tf_outputs
-        ]
-        # pylint: enable=protected-access
-        self._inputs_val = Operation._InputList(retval)
-      return self._inputs_val
-    else:
-      return Operation._InputList(self._inputs_val)
+    if self._inputs_val is None:
+      tf_outputs = c_api.GetOperationInputs(self._c_op)
+      # pylint: disable=protected-access
+      retval = [
+          self.graph._get_tensor_by_tf_output(tf_output)
+          for tf_output in tf_outputs
+      ]
+      # pylint: enable=protected-access
+      self._inputs_val = Operation._InputList(retval)
+    return self._inputs_val
 
   @property
   def _inputs(self):
@@ -2158,15 +2012,12 @@ class Operation(object):
 
   @property
   def _input_types(self):
-    if self._c_op:
-      num_inputs = c_api.TF_OperationNumInputs(self._c_op)
-      input_types = [
-          dtypes.as_dtype(c_api.TF_OperationInputType(self._tf_input(i)))
-          for i in xrange(num_inputs)
-      ]
-      return input_types
-    else:
-      return self._input_types_val
+    num_inputs = c_api.TF_OperationNumInputs(self._c_op)
+    input_types = [
+        dtypes.as_dtype(c_api.TF_OperationInputType(self._tf_input(i)))
+        for i in xrange(num_inputs)
+    ]
+    return input_types
 
   @_input_types.setter
   def _input_types(self, value):
@@ -2186,16 +2037,13 @@ class Operation(object):
       A list of `Operation` objects.
 
     """
-    if self._c_op:
-      control_c_ops = c_api.TF_OperationGetControlInputs_wrapper(self._c_op)
-      # pylint: disable=protected-access
-      return [
-          self.graph._get_operation_by_name_unsafe(
-              c_api.TF_OperationName(c_op)) for c_op in control_c_ops
-      ]
-      # pylint: enable=protected-access
-    else:
-      return self._control_inputs_val
+    control_c_ops = c_api.TF_OperationGetControlInputs_wrapper(self._c_op)
+    # pylint: disable=protected-access
+    return [
+        self.graph._get_operation_by_name_unsafe(
+            c_api.TF_OperationName(c_op)) for c_op in control_c_ops
+    ]
+    # pylint: enable=protected-access
 
   @property
   def _control_outputs(self):
@@ -2208,18 +2056,13 @@ class Operation(object):
       A list of `Operation` objects.
 
     """
-    if self._c_op:
-      control_c_ops = c_api.TF_OperationGetControlOutputs_wrapper(self._c_op)
-      # pylint: disable=protected-access
-      return [
-          self.graph._get_operation_by_name_unsafe(
-              c_api.TF_OperationName(c_op)) for c_op in control_c_ops
-      ]
-      # pylint: enable=protected-access
-    else:
-      # TODO(apassos) this should be less inefficient.
-      return [o for o in self._graph.get_operations()
-              if self in o.control_inputs]
+    control_c_ops = c_api.TF_OperationGetControlOutputs_wrapper(self._c_op)
+    # pylint: disable=protected-access
+    return [
+        self.graph._get_operation_by_name_unsafe(
+            c_api.TF_OperationName(c_op)) for c_op in control_c_ops
+    ]
+    # pylint: enable=protected-access
 
   @property
   def _control_inputs(self):
@@ -2243,11 +2086,7 @@ class Operation(object):
   @property
   def type(self):
     """The type of the op (e.g. `"MatMul"`)."""
-    if self._c_op:
-      op_type = c_api.TF_OperationOpType(self._c_op)
-      return op_type
-    else:
-      return self._node_def_val.op
+    return c_api.TF_OperationOpType(self._c_op)
 
   @property
   def graph(self):
@@ -2265,15 +2104,12 @@ class Operation(object):
       protocol buffer.
     """
     # pylint: enable=line-too-long
-    if self._c_op:
-      with c_api_util.tf_buffer() as buf:
-        c_api.TF_OperationToNodeDef(self._c_op, buf)
-        data = c_api.TF_GetBuffer(buf)
-      node_def = node_def_pb2.NodeDef()
-      node_def.ParseFromString(compat.as_bytes(data))
-      return node_def
-    else:
-      return self._node_def_val
+    with c_api_util.tf_buffer() as buf:
+      c_api.TF_OperationToNodeDef(self._c_op, buf)
+      data = c_api.TF_GetBuffer(buf)
+    node_def = node_def_pb2.NodeDef()
+    node_def.ParseFromString(compat.as_bytes(data))
+    return node_def
 
   @property
   def _node_def(self):
@@ -2292,10 +2128,7 @@ class Operation(object):
       protocol buffer.
     """
     # pylint: enable=line-too-long
-    if self._c_op:
-      return self._graph._get_op_def(self.type)
-    else:
-      return self._op_def_val
+    return self._graph._get_op_def(self.type)
 
   @property
   def _op_def(self):
@@ -2321,17 +2154,14 @@ class Operation(object):
 
   def _set_attr(self, attr_name, attr_value):
     """Private method used to set an attribute in the node_def."""
-    if self._c_op:
-      buf = c_api.TF_NewBufferFromString(
-          compat.as_bytes(attr_value.SerializeToString()))
-      try:
-        # pylint: disable=protected-access
-        c_api.SetAttr(self._graph._c_graph, self._c_op, attr_name, buf)
-        # pylint: enable=protected-access
-      finally:
-        c_api.TF_DeleteBuffer(buf)
-    else:
-      self._node_def_val.attr[attr_name].CopyFrom(attr_value)
+    buf = c_api.TF_NewBufferFromString(
+        compat.as_bytes(attr_value.SerializeToString()))
+    try:
+      # pylint: disable=protected-access
+      c_api.SetAttr(self._graph._c_graph, self._c_op, attr_name, buf)
+      # pylint: enable=protected-access
+    finally:
+      c_api.TF_DeleteBuffer(buf)
 
   def get_attr(self, name):
     """Returns the value of the attr of this op with the given `name`.
@@ -2346,21 +2176,15 @@ class Operation(object):
       ValueError: If this op does not have an attr with the given `name`.
     """
     fields = ["s", "i", "f", "b", "type", "shape", "tensor", "func"]
-    if self._c_op:
-      try:
-        with c_api_util.tf_buffer() as buf:
-          c_api.TF_OperationGetAttrValueProto(self._c_op, name, buf)
-          data = c_api.TF_GetBuffer(buf)
-      except errors.InvalidArgumentError as e:
-        # Convert to ValueError for backwards compatibility.
-        raise ValueError(str(e))
-      x = attr_value_pb2.AttrValue()
-      x.ParseFromString(data)
-    else:
-      if name not in self._node_def_val.attr:
-        raise ValueError(
-            "No attr named '" + name + "' in " + str(self._node_def_val))
-      x = self._node_def_val.attr[name]
+    try:
+      with c_api_util.tf_buffer() as buf:
+        c_api.TF_OperationGetAttrValueProto(self._c_op, name, buf)
+        data = c_api.TF_GetBuffer(buf)
+    except errors.InvalidArgumentError as e:
+      # Convert to ValueError for backwards compatibility.
+      raise ValueError(str(e))
+    x = attr_value_pb2.AttrValue()
+    x.ParseFromString(data)
 
     # Treat an empty oneof value as an empty list.
     if not x.WhichOneof("value"):
@@ -2580,9 +2404,9 @@ def _set_shape_and_handle_data_for_outputs_c_api(op):
 def set_shape_and_handle_data_for_outputs(op):
   """Set the shapes and resource handle data for op's outputs.
 
-  When _USE_C_API = True, this is lazily called when a tensor's shape is first
-  requested. Usually this should work automatically, but some edge cases may
-  require manually calling this first to make sure Tensor._shape_val and
+  When _USE_C_SHAPES = False, this is lazily called when a tensor's shape is
+  first requested. Usually this should work automatically, but some edge cases
+  may require manually calling this first to make sure Tensor._shape_val and
   Tensor._handle_data are set (e.g. manually overriding _handle_data, copying a
   Tensor).
   """
@@ -3086,15 +2910,12 @@ class Graph(object):
       A `VersionDef`.
     """
     # pylint: enable=line-too-long
-    if self._c_graph:
-      with c_api_util.tf_buffer() as buf:
-        c_api.TF_GraphVersions(self._c_graph, buf)
-        data = c_api.TF_GetBuffer(buf)
-      version_def = versions_pb2.VersionDef()
-      version_def.ParseFromString(compat.as_bytes(data))
-      return version_def
-    else:
-      return self._graph_def_versions
+    with c_api_util.tf_buffer() as buf:
+      c_api.TF_GraphVersions(self._c_graph, buf)
+      data = c_api.TF_GetBuffer(buf)
+    version_def = versions_pb2.VersionDef()
+    version_def.ParseFromString(compat.as_bytes(data))
+    return version_def
 
   @property
   def seed(self):
@@ -3188,40 +3009,22 @@ class Graph(object):
 
     """
     # pylint: enable=line-too-long
-    if self._c_graph:
-      with self._lock:
-        with c_api_util.tf_buffer() as buf:
-          c_api.TF_GraphToGraphDef(self._c_graph, buf)
-          data = c_api.TF_GetBuffer(buf)
-        graph = graph_pb2.GraphDef()
-        graph.ParseFromString(compat.as_bytes(data))
-        # Strip the experimental library field iff it's empty.
-        if not graph.library.function:
-          graph.ClearField("library")
-
-        if add_shapes:
-          for node in graph.node:
-            op = self._nodes_by_name[node.name]
-            if op.outputs:
-              node.attr["_output_shapes"].list.shape.extend(
-                  [output.get_shape().as_proto() for output in op.outputs])
-    else:
-      with self._lock:
-        graph = graph_pb2.GraphDef()
-        graph.versions.CopyFrom(self._graph_def_versions)
-        bytesize = 0
-        for op_id in sorted(self._nodes_by_id):
-          op = self._nodes_by_id[op_id]
-          if from_version is None or op_id > from_version:
-            graph.node.extend([op.node_def])
-            if op.outputs and add_shapes:
-              assert "_output_shapes" not in graph.node[-1].attr
-              graph.node[-1].attr["_output_shapes"].list.shape.extend(
-                  [output.get_shape().as_proto() for output in op.outputs])
-            bytesize += op.node_def.ByteSize()
-            if bytesize >= (1 << 31) or bytesize < 0:
-              raise ValueError("GraphDef cannot be larger than 2GB.")
-        self._copy_functions_to_graph_def(graph, bytesize)
+    with self._lock:
+      with c_api_util.tf_buffer() as buf:
+        c_api.TF_GraphToGraphDef(self._c_graph, buf)
+        data = c_api.TF_GetBuffer(buf)
+      graph = graph_pb2.GraphDef()
+      graph.ParseFromString(compat.as_bytes(data))
+      # Strip the experimental library field iff it's empty.
+      if not graph.library.function:
+        graph.ClearField("library")
+
+      if add_shapes:
+        for node in graph.node:
+          op = self._nodes_by_name[node.name]
+          if op.outputs:
+            node.attr["_output_shapes"].list.shape.extend(
+                [output.get_shape().as_proto() for output in op.outputs])
     return graph, self._version
 
   def as_graph_def(self, from_version=None, add_shapes=False):
@@ -3295,34 +3098,16 @@ class Graph(object):
 
     # Add function to graph
     # pylint: disable=protected-access
-    if self._c_graph:
-      # Handle functions created without using the C API. TODO(apassos,skyewm)
-      # remove this when all functions are generated using the C API by default
-      # as this will be unnecessary.
-      if not function._c_func:
-        serialized = function.definition.SerializeToString()
-        c_func = c_api.TF_FunctionImportFunctionDef(serialized)
-        function._c_func = c_api_util.ScopedTFFunction(c_func)
-      gradient = (function._grad_func._c_func.func if function._grad_func
-                  else None)
-      c_api.TF_GraphCopyFunction(self._c_graph, function._c_func.func, gradient)
-    else:
-      # If there is already a function with the same name, raise an error
-      # if bodies are different. Else, do nothing. The C API version above
-      # has the same behavior.
-      previous = self._functions.get(name, None)
-      if previous:
-        # This check is not ideal as we can have a hash collision with only
-        # 32 bits in the hash, but the non C API mode is being deprecated.
-        # Don't bother changing it now.
-        if previous._hash_str == function._hash_str:
-          return
-        else:
-          raise ValueError("Cannot add function (%s, hash %s) to graph (%s). "
-                           "Another function (%s, hash %s) is already defined "
-                           "with that name (%s)" % (
-                               function, function._hash_str, self,
-                               previous, previous._hash_str, name))
+    # Handle functions created without using the C API. TODO(apassos,skyewm)
+    # remove this when all functions are generated using the C API by default
+    # as this will be unnecessary.
+    if not function._c_func:
+      serialized = function.definition.SerializeToString()
+      c_func = c_api.TF_FunctionImportFunctionDef(serialized)
+      function._c_func = c_api_util.ScopedTFFunction(c_func)
+    gradient = (function._grad_func._c_func.func if function._grad_func
+                else None)
+    c_api.TF_GraphCopyFunction(self._c_graph, function._c_func.func, gradient)
     # pylint: enable=protected-access
 
     self._functions[name] = function
@@ -3337,6 +3122,9 @@ class Graph(object):
     return self._building_function
 
   # Helper functions to create operations.
+  @deprecated_args(None,
+                   "Shapes are always computed; don't use the compute_shapes "
+                   "as it has no effect.", "compute_shapes")
   def create_op(
       self,
       op_type,
@@ -3373,8 +3161,8 @@ class Graph(object):
         proto).
       op_def: (Optional.) The `OpDef` proto that describes the `op_type` that
         the operation will have.
-      compute_shapes: (Optional.) If True, shape inference will be performed
-        to compute the shapes of the outputs.
+      compute_shapes: (Optional.) Deprecated. Has no effect (shapes are always
+        computed).
       compute_device: (Optional.) If True, device functions will be executed
         to compute the device property of the Operation.
 
@@ -3384,8 +3172,9 @@ class Graph(object):
 
     Returns:
       An `Operation` object.
-
     """
+    del compute_shapes
+
     self._check_not_finalized()
     for idx, a in enumerate(inputs):
       if not isinstance(a, Tensor):
@@ -3415,18 +3204,7 @@ class Graph(object):
           input_types=input_types,
           original_op=self._default_original_op,
           op_def=op_def)
-
-      # Note: shapes are lazily computed with the C API enabled.
-      #
-      # TODO(skyewm): unlike in the original Python implementation, the C API
-      # always computes shape information (even for function calls, which the
-      # original Python shape inference code doesn't handle). Deprecate the
-      # compute_shapes argument.
-      if not _USE_C_API and compute_shapes:
-        set_shape_and_handle_data_for_outputs(ret)
-
-      self._create_op_helper(ret, compute_shapes=compute_shapes,
-                             compute_device=compute_device)
+      self._create_op_helper(ret, compute_device=compute_device)
     return ret
 
   def _create_op_from_tf_operation(self, c_op, compute_device=True):
@@ -3461,11 +3239,8 @@ class Graph(object):
     self._create_op_helper(ret, compute_device=compute_device)
     return ret
 
-  def _create_op_helper(self, op, compute_shapes=True, compute_device=True):
+  def _create_op_helper(self, op, compute_device=True):
     """Common logic for creating an op in this graph."""
-    # TODO(b/XXXX): move to Operation.__init__ once _USE_C_API flag is removed.
-    self._add_op(op)
-
     # Apply any additional attributes requested. Do not overwrite any existing
     # attributes.
     for key, value in self._attr_scope_map.items():
@@ -3532,8 +3307,7 @@ class Graph(object):
     # (2) "is_stateful" is set in OpDef
     # (3) "container" attribute is in OpDef
     # (4) "container" attribute is None
-    # TODO(skyewm): remove op.op_def check when _USE_C_API is removed.
-    if self._container and op.op_def and op.op_def.is_stateful:
+    if self._container and op.op_def.is_stateful:
       try:
         container_attr = op.get_attr("container")
       except ValueError:
@@ -3820,17 +3594,14 @@ class Graph(object):
 
   def _get_op_def(self, type):  # pylint: disable=redefined-builtin
     """Returns the `OpDef` proto for `type`. `type` is a string."""
-    if self._c_graph:
-      with c_api_util.tf_buffer() as buf:
-        # pylint: disable=protected-access
-        c_api.TF_GraphGetOpDef(self._c_graph, compat.as_bytes(type), buf)
-        # pylint: enable=protected-access
-        data = c_api.TF_GetBuffer(buf)
-      op_def = op_def_pb2.OpDef()
-      op_def.ParseFromString(compat.as_bytes(data))
-      return op_def
-    else:
-      return self._registered_ops[type]
+    with c_api_util.tf_buffer() as buf:
+      # pylint: disable=protected-access
+      c_api.TF_GraphGetOpDef(self._c_graph, compat.as_bytes(type), buf)
+      # pylint: enable=protected-access
+      data = c_api.TF_GetBuffer(buf)
+    op_def = op_def_pb2.OpDef()
+    op_def.ParseFromString(compat.as_bytes(data))
+    return op_def
 
   def as_default(self):
     """Returns a context manager that makes this `Graph` the default graph.
@@ -3862,6 +3633,9 @@ class Graph(object):
       assert c.graph is g
     ```
 
+    If eager execution is enabled ops created under this context manager will be
+    added to the graph instead of executed eagerly.
+
     Returns:
       A context manager for using this graph as the default graph.
     """
@@ -3883,7 +3657,6 @@ class Graph(object):
         contains many standard names for collections.
       value: The value to add to the collection.
     """  # pylint: disable=g-doc-exception
-    _assert_collection_is_ok(name)
     self._check_not_finalized()
     with self._lock:
       if name not in self._collections:
@@ -3930,7 +3703,6 @@ class Graph(object):
       The list of values in the collection with the given `name`, or an empty
       list if no value has been added to that collection.
     """  # pylint: disable=g-doc-exception
-    _assert_collection_is_ok(name)
     with self._lock:
       coll_list = self._collections.get(name, None)
       if coll_list is None:
@@ -3960,7 +3732,6 @@ class Graph(object):
       list contains the values in the order under which they were
       collected.
     """  # pylint: disable=g-doc-exception
-    _assert_collection_is_ok(name)
     with self._lock:
       collection = self._collections.get(name, None)
       if collection is None:
@@ -5278,35 +5049,15 @@ class _DefaultGraphStack(_DefaultStack):  # pylint: disable=protected-access
   @tf_contextlib.contextmanager
   def get_controller(self, default):
     try:
-      if context.executing_eagerly():
-        # A Graph alone on the context stack would keep init_scope-wrapped
-        # operations graph building when entered (assuming init_scope is called
-        # in a graph building context). Instead, we push a context which first
-        # enables eager execution and then re-enters the Graph.
-        context.context().context_switches.push(
-            default.building_function,
-            functools.partial(
-                _enter_context_and_graph,
-                context.eager_mode,
-                default.as_default))
-      else:
-        # This Graph is being used from a graph building context. A lack of
-        # context switch implies that the context is graph building.
-        context.context().context_switches.push(default.building_function,
-                                                default.as_default)
-      with super(_DefaultGraphStack, self).get_controller(default) as g:
+      context.context().context_switches.push(
+          default.building_function, default.as_default)
+      with super(_DefaultGraphStack, self).get_controller(
+          default) as g, context.graph_mode():
         yield g
     finally:
       context.context().context_switches.pop()
 
 
-@tf_contextlib.contextmanager
-def _enter_context_and_graph(context_fn, graph_fn):
-  """Combines two context managers."""
-  with context_fn(), graph_fn():
-    yield
-
-
 _default_graph_stack = _DefaultGraphStack()
 
 
@@ -5850,7 +5601,8 @@ def add_to_collection(name, value):
     value: The value to add to the collection.
 
   @compatibility(eager)
-  Collections are not supported when eager execution is enabled.
+  Collections are only supported in eager when variables are created inside an
+  EagerVariableStore (e.g. as part of a layer or template).
   @end_compatibility
   """
   get_default_graph().add_to_collection(name, value)
@@ -5868,7 +5620,8 @@ def add_to_collections(names, value):
     value: The value to add to the collections.
 
   @compatibility(eager)
-  Collections are not supported when eager execution is enabled.
+  Collections are only supported in eager when variables are created inside an
+  EagerVariableStore (e.g. as part of a layer or template).
   @end_compatibility
   """
   get_default_graph().add_to_collections(names, value)
@@ -6161,14 +5914,6 @@ def get_from_proto_function(collection_name):
     return None
 
 
-def _assert_collection_is_ok(collection_name):
-  if context.executing_eagerly():
-    if collection_name in GraphKeys._VARIABLE_COLLECTIONS:  # pylint: disable=protected-access
-      raise ValueError(
-          "variable collections are not supported when eager execution is enabled."
-      )
-
-
 def _operation_conversion_error(op, dtype=None, name=None, as_ref=False):
   """Produce a nice error if someone converts an Operation to a Tensor."""
   raise TypeError(("Can't convert Operation '%s' to Tensor "
diff --git a/tensorflow/python/framework/ops_test.py b/tensorflow/python/framework/ops_test.py
index f8501b27aea3e47f37c60ecba3401a2b40ec77b9..750df4d8e3926fbaee7a38978457e448c21d64c7 100644
--- a/tensorflow/python/framework/ops_test.py
+++ b/tensorflow/python/framework/ops_test.py
@@ -270,7 +270,6 @@ class OperationTest(test_util.TensorFlowTestCase):
     op1 = ops.Operation(
         ops._NodeDef("RefOutputFloatOutput", "op1"), g, [],
         [dtypes.float32_ref, dtypes.float32])
-    g._add_op(op1)
     self.assertProtoEquals("op:'RefOutputFloatOutput' name:'op1'", op1.node_def)
     self.assertEquals([], list(op1.inputs))
     ref_t, nonref_t = op1.values()
@@ -279,14 +278,12 @@ class OperationTest(test_util.TensorFlowTestCase):
         ops._NodeDef("RefInputFloatInput", "op2"),
         g, [ref_t, nonref_t], [],
         input_types=[dtypes.float32_ref, dtypes.float32])
-    g._add_op(op2)
     self.assertProtoEquals(
         "op:'RefInputFloatInput' name:'op2' input:'op1' input:'op1:1'",
         op2.node_def)
     self.assertEquals([ref_t, nonref_t], list(op2.inputs))
     op3 = ops.Operation(
         ops._NodeDef("TwoFloatInputs", "op3"), g, [ref_t, nonref_t], [])
-    g._add_op(op3)
     self.assertProtoEquals(
         "op:'TwoFloatInputs' name:'op3' input:'op1' input:'op1:1'",
         op3.node_def)
@@ -2224,12 +2221,25 @@ class InitScopeTest(test_util.TensorFlowTestCase):
           self.assertEqual(ops.get_name_scope(), "inner")
       self.assertEqual(ops.get_name_scope(), "")
 
-  def testEagerGraphContextsExecuteEagerly(self):
+  def testEnteringGraphFromEagerIsSticky(self):
     with context.eager_mode():
+      g = ops.Graph()
+      with g.as_default():
+        with ops.init_scope():
+          self.assertFalse(context.executing_eagerly())
+          self.assertEqual(g, ops.get_default_graph())
+
+  def testMixGraphEager(self):
+    with context.eager_mode():
+      c = constant_op.constant(1.0)
       with ops.Graph().as_default():
-        with context.graph_mode():
-          with ops.init_scope():
-            self.assertTrue(context.executing_eagerly())
+        with self.assertRaisesRegexp(
+            RuntimeError, "Attempting to capture an EagerTensor"):
+          math_ops.add(c, c)
+        c2 = constant_op.constant(2.0)
+      with self.assertRaisesRegexp(
+          TypeError, "contains objects other than 'EagerTensor'"):
+        math_ops.add(c2, c2)
 
   def testPreservesNameScopeInEagerExecution(self):
     with context.eager_mode():
@@ -2263,6 +2273,11 @@ class GraphTest(test_util.TensorFlowTestCase):
       with g0.as_default():
         ops.reset_default_graph()
 
+  def testGraphContextManagerCancelsEager(self):
+    with context.eager_mode():
+      with ops.Graph().as_default():
+        self.assertFalse(context.executing_eagerly())
+
   def testGraphContextManager(self):
     g0 = ops.Graph()
     with g0.as_default() as g1:
diff --git a/tensorflow/python/framework/tensor_shape.py b/tensorflow/python/framework/tensor_shape.py
index 0dd29460ed93aadf61ef1f1b2dbf1d7802ca4877..c9be3d50056b2838e8cf39c3a17e1cff14e67ea0 100644
--- a/tensorflow/python/framework/tensor_shape.py
+++ b/tensorflow/python/framework/tensor_shape.py
@@ -961,9 +961,12 @@ def unknown_shape(ndims=None):
     return TensorShape([Dimension(None)] * ndims)
 
 
+_SCALAR_SHAPE = TensorShape([])
+
+
 def scalar():
   """Returns a shape representing a scalar."""
-  return TensorShape([])
+  return _SCALAR_SHAPE
 
 
 def vector(length):
diff --git a/tensorflow/python/framework/test_util.py b/tensorflow/python/framework/test_util.py
index 5b01df48fea19fa473658103341998b29920170c..5582b14249599f96453c0686fd89d56d4985531d 100644
--- a/tensorflow/python/framework/test_util.py
+++ b/tensorflow/python/framework/test_util.py
@@ -321,32 +321,6 @@ def NCHWToNHWC(input_tensor):
     return [input_tensor[a] for a in new_axes[ndims]]
 
 
-# TODO(skyewm): remove this eventually
-# pylint: disable=protected-access
-def _use_c_api_wrapper(fn, use_c_api, *args, **kwargs):
-  prev_value = ops._USE_C_API
-  ops._USE_C_API = use_c_api
-  try:
-    # Reset the default graph so it has the C API enabled. We call
-    # reset_default_graph() instead of creating a new default Graph context to
-    # make this robust to tests that call reset_default_graph(), which requires
-    # that the current default graph isn't nested.
-    ops.reset_default_graph()
-    fn(*args, **kwargs)
-  finally:
-    ops._USE_C_API = prev_value
-    # Make sure default graph reflects prev_value in case next test doesn't call
-    # reset_default_graph().
-    ops.reset_default_graph()
-
-
-# pylint: disable=protected-access
-
-
-def c_api_and_cuda_enabled():
-  return ops._USE_C_API and IsGoogleCudaEnabled()
-
-
 def skip_if(condition):
   """Skips the decorated function if condition is or evaluates to True.
 
@@ -372,46 +346,6 @@ def skip_if(condition):
   return real_skip_if
 
 
-# TODO(skyewm): remove this eventually
-def disable_c_api(fn):
-  """Decorator for disabling the C API on a test.
-
-  Note this disables the C API after running the test class's setup/teardown
-  methods.
-
-  Args:
-    fn: the function to be wrapped
-
-  Returns:
-    The wrapped function
-  """
-
-  def wrapper(*args, **kwargs):
-    _use_c_api_wrapper(fn, False, *args, **kwargs)
-
-  return wrapper
-
-
-# TODO(skyewm): remove this eventually
-def enable_c_api(fn):
-  """Decorator for enabling the C API on a test.
-
-  Note this enables the C API after running the test class's setup/teardown
-  methods.
-
-  Args:
-    fn: the function to be wrapped
-
-  Returns:
-    The wrapped function
-  """
-
-  def wrapper(*args, **kwargs):
-    _use_c_api_wrapper(fn, True, *args, **kwargs)
-
-  return wrapper
-
-
 def enable_c_shapes(fn):
   """Decorator for enabling C shapes on a test.
 
@@ -425,46 +359,19 @@ def enable_c_shapes(fn):
     The wrapped function
   """
 
+  # pylint: disable=protected-access
   def wrapper(*args, **kwargs):
     prev_value = ops._USE_C_SHAPES
-    # Only use C shapes if the C API is already enabled.
-    ops._USE_C_SHAPES = ops._USE_C_API
+    ops._USE_C_SHAPES = True
     try:
       fn(*args, **kwargs)
     finally:
       ops._USE_C_SHAPES = prev_value
+  # pylint: enable=protected-access
 
   return wrapper
 
 
-# This decorator is a hacky way to run all the test methods in a decorated
-# class with and without C API enabled.
-# TODO(iga): Remove this and its uses once we switch to using C API by default.
-def with_c_api(cls):
-  """Adds methods that call original methods but with C API enabled.
-
-  Note this enables the C API in new methods after running the test class's
-  setup method. This can be a problem if some objects are created in it
-  before the C API is enabled.
-
-  Args:
-    cls: class to decorate
-
-  Returns:
-    cls with new test methods added
-  """
-  # If the C API is already enabled, don't do anything. Some tests break if the
-  # same test is run twice, so this allows us to turn on the C API by default
-  # without breaking these tests.
-  if ops._USE_C_API:
-    return cls
-
-  for name, value in cls.__dict__.copy().items():
-    if callable(value) and name.startswith("test"):
-      setattr(cls, name + "WithCApi", enable_c_api(value))
-  return cls
-
-
 def with_c_shapes(cls):
   """Adds methods that call original methods but with C API shapes enabled.
 
@@ -556,12 +463,16 @@ def assert_no_new_tensors(f):
 
     tensors_before = set(
         id(obj) for obj in gc.get_objects() if _is_tensorflow_object(obj))
-    outside_graph_key = ops.get_default_graph()._graph_key
-    with ops.Graph().as_default():
+    if context.executing_eagerly():
+      f(self, **kwargs)
+      ops.reset_default_graph()
+    else:
       # Run the test in a new graph so that collections get cleared when it's
       # done, but inherit the graph key so optimizers behave.
-      ops.get_default_graph()._graph_key = outside_graph_key
-      f(self, **kwargs)
+      outside_graph_key = ops.get_default_graph()._graph_key
+      with ops.Graph().as_default():
+        ops.get_default_graph()._graph_key = outside_graph_key
+        f(self, **kwargs)
     # Make an effort to clear caches, which would otherwise look like leaked
     # Tensors.
     backprop._zeros_cache.flush()
@@ -640,6 +551,7 @@ def assert_no_garbage_created(f):
 
 
 def run_all_in_graph_and_eager_modes(cls):
+  """Execute all test methods in the given class with and without eager."""
   base_decorator = run_in_graph_and_eager_modes()
   for name, value in cls.__dict__.copy().items():
     if callable(value) and name.startswith("test"):
@@ -727,12 +639,12 @@ def run_in_graph_and_eager_modes(__unused__=None,
           f(self, **kwargs)
 
       if assert_no_eager_garbage:
+        ops.reset_default_graph()
         run_eagerly = assert_no_new_tensors(
             assert_no_garbage_created(run_eagerly))
 
       with context.eager_mode():
-        with ops.Graph().as_default():
-          run_eagerly(self, **kwargs)
+        run_eagerly(self, **kwargs)
 
     return decorated
 
@@ -1027,7 +939,9 @@ class TensorFlowTestCase(googletest.TestCase):
           rewriter_config_pb2.RewriterConfig.OFF)
       return config
 
-    if graph is None:
+    if context.executing_eagerly():
+      yield None
+    elif graph is None:
       if self._cached_session is None:
         self._cached_session = session.Session(
             graph=None, config=prepare_config(config))
@@ -1328,11 +1242,11 @@ class TensorFlowTestCase(googletest.TestCase):
             b,
             rtol=rtol,
             atol=atol,
-            msg="Mismatched value: a%s is different from b%s." % (path_str,
-                                                                  path_str))
+            msg=("Mismatched value: a%s is different from b%s. %s" %
+                 (path_str, path_str, msg)))
       except TypeError as e:
-        msg = "Error: a%s has %s, but b%s has %s" % (path_str, type(a),
-                                                     path_str, type(b))
+        msg = ("Error: a%s has %s, but b%s has %s. %s" %
+               (path_str, type(a), path_str, type(b), msg))
         e.args = ((e.args[0] + " : " + msg,) + e.args[1:])
         raise
 
diff --git a/tensorflow/python/framework/test_util_test.py b/tensorflow/python/framework/test_util_test.py
index 0f53762f6fab16f3fef2f5511a11acf53467e250..0178908bcc9c0613353e3beea8e1eb11638f9531 100644
--- a/tensorflow/python/framework/test_util_test.py
+++ b/tensorflow/python/framework/test_util_test.py
@@ -619,6 +619,7 @@ class GarbageCollectionTest(test_util.TensorFlowTestCase):
 
     ReferenceCycleTest().test_has_no_cycle()
 
+  @test_util.run_in_graph_and_eager_modes()
   def test_no_leaked_tensor_decorator(self):
 
     class LeakedTensorTest(object):
@@ -628,11 +629,11 @@ class GarbageCollectionTest(test_util.TensorFlowTestCase):
 
       @test_util.assert_no_new_tensors
       def test_has_leak(self):
-        self.a = constant_op.constant([3.])
+        self.a = constant_op.constant([3.], name="leak")
 
       @test_util.assert_no_new_tensors
       def test_has_no_leak(self):
-        constant_op.constant([3.])
+        constant_op.constant([3.], name="no-leak")
 
     with self.assertRaisesRegexp(AssertionError, "Tensors not deallocated"):
       LeakedTensorTest().test_has_leak()
diff --git a/tensorflow/python/grappler/layout_optimizer_test.py b/tensorflow/python/grappler/layout_optimizer_test.py
index 2d6925d1a825808ce133eb0404b5bd4925861723..af5d709f7e936e0438d5e03f60b44bc0017cb4b6 100644
--- a/tensorflow/python/grappler/layout_optimizer_test.py
+++ b/tensorflow/python/grappler/layout_optimizer_test.py
@@ -1389,7 +1389,7 @@ class LayoutOptimizerTest(test.TestCase):
       expected_num_transposes = 3
       self.assertEqual(expected_num_transposes, num_transposes)
       self._assert_trans_nhwc_to_nchw('map/while/Conv2D-0', nodes)
-      self._assert_trans_nchw_to_nhwc('map/while/Add-0-2', nodes)
+      self._assert_trans_nchw_to_nhwc('map/while/Add_1-0-2', nodes)
       self.assertAllClose(output_val_ref, output_val, atol=1e-3)
 
   def testLoopWithVecAnd4D(self):
@@ -1413,7 +1413,7 @@ class LayoutOptimizerTest(test.TestCase):
       expected_num_transposes = 2
       self.assertEqual(expected_num_transposes, num_transposes)
       self._assert_trans_nhwc_to_nchw('map/while/Conv2D-0', nodes)
-      self._assert_trans_nchw_to_nhwc('map/while/Add-0-2', nodes)
+      self._assert_trans_nchw_to_nhwc('map/while/Add_1-0-2', nodes)
       self.assertAllClose(output_val_ref, output_val, atol=1e-3)
 
   def testBinaryOpSecondPort(self):
diff --git a/tensorflow/python/keras/BUILD b/tensorflow/python/keras/BUILD
index 5d730695b9f818137da6f6ced821a941a4518f71..fe40c9fbed7c041ad6b6dc8cdb1c50b80f57a48f 100755
--- a/tensorflow/python/keras/BUILD
+++ b/tensorflow/python/keras/BUILD
@@ -135,6 +135,7 @@ py_library(
     deps = [
         ":backend",
         "//tensorflow/python/data",
+        "//tensorflow/python/training/checkpointable:data_structures_base",
         "@six_archive//:six",
     ],
 )
diff --git a/tensorflow/python/keras/__init__.py b/tensorflow/python/keras/__init__.py
index 197f3060970743d9bd245dc7018fa0503500d176..3493069a5bf53ffbfe6447f2c1b3df7ac64cbf3a 100644
--- a/tensorflow/python/keras/__init__.py
+++ b/tensorflow/python/keras/__init__.py
@@ -41,8 +41,12 @@ from tensorflow.python.keras.layers import Input
 from tensorflow.python.keras.models import Model
 from tensorflow.python.keras.models import Sequential
 
+from tensorflow.python.util.tf_export import tf_export
+
 __version__ = '2.1.6-tf'
 
+tf_export('keras.__version__').export_constant(__name__, '__version__')
+
 del absolute_import
 del division
 del print_function
diff --git a/tensorflow/python/keras/activations.py b/tensorflow/python/keras/activations.py
index a62dadb830cd1cce51fca6efd8cc071f22499cdd..f608dea430f0573503713f0cbc60f8921e6df51e 100644
--- a/tensorflow/python/keras/activations.py
+++ b/tensorflow/python/keras/activations.py
@@ -32,7 +32,7 @@ def softmax(x, axis=-1):
   """Softmax activation function.
 
   Arguments:
-      x : Tensor.
+      x : Input tensor.
       axis: Integer, axis along which the softmax normalization is applied.
 
   Returns:
@@ -49,28 +49,52 @@ def softmax(x, axis=-1):
     s = math_ops.reduce_sum(e, axis=axis, keepdims=True)
     return e / s
   else:
-    raise ValueError('Cannot apply softmax to a tensor that is 1D')
+    raise ValueError('Cannot apply softmax to a tensor that is 1D. '
+                     'Received input: %s' % (x,))
 
 
 @tf_export('keras.activations.elu')
 def elu(x, alpha=1.0):
+  """Exponential linear unit.
+
+  Arguments:
+      x: Input tensor.
+      alpha: A scalar, slope of negative section.
+
+  Returns:
+      The exponential linear activation: `x` if `x > 0` and
+        `alpha * (exp(x)-1)` if `x < 0`.
+
+  Reference:
+      - [Fast and Accurate Deep Network Learning by Exponential
+        Linear Units (ELUs)](https://arxiv.org/abs/1511.07289)
+  """
   return K.elu(x, alpha)
 
 
 @tf_export('keras.activations.selu')
 def selu(x):
-  """Scaled Exponential Linear Unit. (Klambauer et al., 2017).
+  """Scaled Exponential Linear Unit (SELU).
+
+  SELU is equal to: `scale * elu(x, alpha)`, where alpha and scale
+  are pre-defined constants. The values of `alpha` and `scale` are
+  chosen so that the mean and variance of the inputs are preserved
+  between two consecutive layers as long as the weights are initialized
+  correctly (see `lecun_normal` initialization) and the number of inputs
+  is "large enough" (see references for more information).
 
   Arguments:
       x: A tensor or variable to compute the activation function for.
 
   Returns:
-      Tensor with the same shape and dtype as `x`.
+      The scaled exponential unit activation: `scale * elu(x, alpha)`.
 
   # Note
       - To be used together with the initialization "lecun_normal".
       - To be used together with the dropout variant "AlphaDropout".
 
+  References:
+      - [Self-Normalizing Neural Networks](https://arxiv.org/abs/1706.02515)
   """
   alpha = 1.6732632423543772848170429916717
   scale = 1.0507009873554804934193349852946
@@ -79,16 +103,44 @@ def selu(x):
 
 @tf_export('keras.activations.softplus')
 def softplus(x):
+  """Softplus activation function.
+
+  Arguments:
+      x: Input tensor.
+
+  Returns:
+      The softplus activation: `log(exp(x) + 1)`.
+  """
   return nn.softplus(x)
 
 
 @tf_export('keras.activations.softsign')
 def softsign(x):
+  """Softsign activation function.
+
+  Arguments:
+      x: Input tensor.
+
+  Returns:
+      The softplus activation: `x / (abs(x) + 1)`.
+  """
   return nn.softsign(x)
 
 
 @tf_export('keras.activations.relu')
 def relu(x, alpha=0., max_value=None):
+  """Rectified Linear Unit.
+
+  Arguments:
+      x: Input tensor.
+      alpha: Slope of the negative part. Defaults to zero.
+      max_value: Maximum value for the output.
+
+  Returns:
+      The (leaky) rectified linear unit activation: `x` if `x > 0`,
+        `alpha * x` if `x < 0`. If `max_value` is defined, the result
+        is truncated to this value.
+  """
   return K.relu(x, alpha=alpha, max_value=max_value)
 
 
@@ -104,6 +156,19 @@ def sigmoid(x):
 
 @tf_export('keras.activations.hard_sigmoid')
 def hard_sigmoid(x):
+  """Hard sigmoid activation function.
+
+  Faster to compute than sigmoid activation.
+
+  Arguments:
+      x: Input tensor.
+
+  Returns:
+      Hard sigmoid activation:
+      - `0` if `x < -2.5`
+      - `1` if `x > 2.5`
+      - `0.2 * x + 0.5` if `-2.5 <= x <= 2.5`.
+  """
   return K.hard_sigmoid(x)
 
 
diff --git a/tensorflow/python/keras/backend.py b/tensorflow/python/keras/backend.py
index af3d1fa33d3431e7b13d1910a8581393e7b912c6..84821918bfe6160d7ee1b4556e00f533a07f5ebd 100644
--- a/tensorflow/python/keras/backend.py
+++ b/tensorflow/python/keras/backend.py
@@ -2973,30 +2973,29 @@ def rnn(step_function,
 
   Arguments:
       step_function: RNN step function.
-          Parameters;
-              input; tensor with shape `(samples, ...)` (no time dimension),
+          Args;
+              input; Tensor with shape `(samples, ...)` (no time dimension),
                   representing input for the batch of samples at a certain
                   time step.
-              states; list of tensors.
+              states; List of tensors.
           Returns;
-              output; tensor with shape `(samples, output_dim)`
+              output; Tensor with shape `(samples, output_dim)`
                   (no time dimension).
-              new_states; list of tensors, same length and shapes
+              new_states; List of tensors, same length and shapes
                   as 'states'. The first state in the list must be the
                   output tensor at the previous timestep.
-      inputs: tensor of temporal data of shape `(samples, time, ...)`
+      inputs: Tensor of temporal data of shape `(samples, time, ...)`
           (at least 3D).
-      initial_states: tensor with shape (samples, output_dim)
+      initial_states: Tensor with shape `(samples, output_dim)`
           (no time dimension),
           containing the initial values for the states used in
           the step function.
-      go_backwards: boolean. If True, do the iteration over the time
+      go_backwards: Boolean. If True, do the iteration over the time
           dimension in reverse order and return the reversed sequence.
-      mask: binary tensor with shape `(samples, time, 1)`,
+      mask: Binary tensor with shape `(samples, time, 1)`,
           with a zero for every element that is masked.
-      constants: a list of constant values passed at each step.
-      unroll: whether to unroll the RNN or to use a symbolic loop
-          (`while_loop` or `scan` depending on backend).
+      constants: List of constant values passed at each step.
+      unroll: Whether to unroll the RNN or to use a symbolic `while_loop`.
       input_length: If specified, assume time dimension is of this length.
 
   Returns:
@@ -3637,12 +3636,12 @@ def _preprocess_conv1d_input(x, data_format):
   Returns:
       A tensor.
   """
-  tf_data_format = 'NHWC'  # to pass TF Conv2dNative operations
+  tf_data_format = 'NWC'  # to pass TF Conv2dNative operations
   if data_format == 'channels_first':
     if not _has_nchw_support():
       x = array_ops.transpose(x, (0, 2, 1))  # NCW -> NWC
     else:
-      tf_data_format = 'NCHW'
+      tf_data_format = 'NCW'
   return x, tf_data_format
 
 
@@ -3741,10 +3740,8 @@ def conv1d(x,
     x = temporal_padding(x, (left_pad, 0))
     padding = 'valid'
   padding = _preprocess_padding(padding)
-  if data_format == 'channels_last':
-    tf_data_format = 'NWC'
-  else:
-    tf_data_format = 'NCW'
+
+  x, tf_data_format = _preprocess_conv1d_input(x, data_format)
   x = nn.convolution(
       input=x,
       filter=kernel,
@@ -3752,6 +3749,8 @@ def conv1d(x,
       strides=(strides,),
       padding=padding,
       data_format=tf_data_format)
+  if data_format == 'channels_first' and tf_data_format == 'NWC':
+    x = array_ops.transpose(x, (0, 2, 1))  # NWC -> NCW
   return x
 
 
@@ -3892,11 +3891,16 @@ def separable_conv1d(x,
   if data_format not in {'channels_first', 'channels_last'}:
     raise ValueError('Unknown data_format: ' + str(data_format))
 
+  if isinstance(strides, int):
+    strides = (strides,)
+  if isinstance(dilation_rate, int):
+    dilation_rate = (dilation_rate,)
+
   x, tf_data_format = _preprocess_conv1d_input(x, data_format)
   padding = _preprocess_padding(padding)
   if not isinstance(strides, tuple):
     strides = tuple(strides)
-  if tf_data_format == 'NHWC':
+  if tf_data_format == 'NWC':
     spatial_start_dim = 1
     strides = (1,) + strides * 2 + (1,)
   else:
@@ -3918,7 +3922,7 @@ def separable_conv1d(x,
 
   x = array_ops.squeeze(x, [spatial_start_dim])
 
-  if data_format == 'channels_first' and tf_data_format == 'NHWC':
+  if data_format == 'channels_first' and tf_data_format == 'NWC':
     x = array_ops.transpose(x, (0, 2, 1))  # NWC -> NCW
 
   return x
@@ -4242,7 +4246,11 @@ def local_conv1d(inputs, kernel, kernel_size, strides, data_format=None):
   """Apply 1D conv with un-shared weights.
 
   Arguments:
-      inputs: 3D tensor with shape: (batch_size, steps, input_dim)
+      inputs: 3D tensor with shape:
+              (batch_size, steps, input_dim)
+              if data_format is "channels_last" or
+              (batch_size, input_dim, steps)
+              if data_format is "channels_first".
       kernel: the unshared weight for convolution,
               with shape (output_length, feature_dim, filters)
       kernel_size: a tuple of a single integer,
@@ -4272,11 +4280,20 @@ def local_conv1d(inputs, kernel, kernel_size, strides, data_format=None):
   xs = []
   for i in range(output_length):
     slice_length = slice(i * stride, i * stride + kernel_size[0])
-    xs.append(reshape(inputs[:, slice_length, :], (1, -1, feature_dim)))
+    if data_format == 'channels_first':
+      xs.append(reshape(inputs[:, :, slice_length], (1, -1, feature_dim)))
+    else:
+      xs.append(reshape(inputs[:, slice_length, :], (1, -1, feature_dim)))
+
   x_aggregate = concatenate(xs, axis=0)
   # Shape: `(output_length, batch_size, filters)`.
   output = batch_dot(x_aggregate, kernel)
-  return permute_dimensions(output, (1, 0, 2))
+
+  if data_format == 'channels_first':
+    output = permute_dimensions(output, (1, 2, 0))
+  else:
+    output = permute_dimensions(output, (1, 0, 2))
+  return output
 
 
 def local_conv2d(inputs,
@@ -4704,8 +4721,13 @@ def foldr(fn, elems, initializer=None, name=None):
 
 
 # Load Keras default configuration from config file if present.
-_keras_base_dir = os.path.expanduser('~')
-_keras_dir = os.path.join(_keras_base_dir, '.keras')
+# Set Keras base dir path given KERAS_HOME env variable, if applicable.
+# Otherwise either ~/.keras or /tmp.
+if 'KERAS_HOME' in os.environ:
+  _keras_dir = os.environ.get('KERAS_HOME')
+else:
+  _keras_base_dir = os.path.expanduser('~')
+  _keras_dir = os.path.join(_keras_base_dir, '.keras')
 _config_path = os.path.expanduser(os.path.join(_keras_dir, 'keras.json'))
 if os.path.exists(_config_path):
   try:
diff --git a/tensorflow/python/keras/backend_test.py b/tensorflow/python/keras/backend_test.py
index 58df263a4f24278f8b61bd9e89f5d8af5e589c6d..53e30e0e4aeda1847c4f1867cb37b87841ad1ee7 100644
--- a/tensorflow/python/keras/backend_test.py
+++ b/tensorflow/python/keras/backend_test.py
@@ -810,6 +810,53 @@ class BackendNNOpsTest(test.TestCase):
                              padding='same', data_format='channels_last')
     self.assertEqual(y.get_shape().as_list(), [10, 5, 5])
 
+  def test_local_conv1d_channels_dim(self):
+    input_length = 5
+    input_dim = 3
+    batch_size = 2
+
+    inputs = np.random.normal(0, 1, (batch_size, input_dim, input_length))
+    inputs_cf = keras.backend.variable(inputs)
+
+    filters = 4
+    for kernel_size in [(1,), (2,), (3,)]:
+      for strides in [(1,), (2,), (3,)]:
+        output_length = (input_length - kernel_size[0]
+                         + strides[0]) // strides[0]
+
+        kernel_shape = (output_length, kernel_size[0] * input_dim, filters)
+        kernel = np.random.normal(0, 1, (output_length,
+                                         input_dim,
+                                         kernel_size[0],
+                                         filters))
+        kernel_cf = np.reshape(kernel, kernel_shape)
+        kernel_cf = keras.backend.variable(kernel_cf)
+
+        conv_cf = keras.backend.local_conv1d(inputs_cf,
+                                             kernel_cf,
+                                             kernel_size,
+                                             strides,
+                                             'channels_first')
+
+        inputs_cl = np.transpose(inputs, (0, 2, 1))
+        inputs_cl = keras.backend.variable(inputs_cl)
+
+        kernel_cl = np.reshape(np.transpose(kernel, (0, 2, 1, 3)),
+                               kernel_shape)
+        kernel_cl = keras.backend.variable(kernel_cl)
+
+        conv_cl = keras.backend.local_conv1d(inputs_cl,
+                                             kernel_cl,
+                                             kernel_size,
+                                             strides,
+                                             'channels_last')
+        with self.test_session():
+          conv_cf = keras.backend.eval(conv_cf)
+          conv_cl = keras.backend.eval(conv_cl)
+
+        self.assertAllCloseAccordingToType(conv_cf,
+                                           np.transpose(conv_cl, (0, 2, 1)))
+
   def test_conv2d(self):
     val = np.random.random((10, 4, 10, 10))
     x = keras.backend.variable(val)
diff --git a/tensorflow/python/keras/callbacks.py b/tensorflow/python/keras/callbacks.py
index a6dbe2ba71ae1aab025ab61820c5309d166414f8..9f91368e5bd772b47ac951a600f458126c1e12a6 100644
--- a/tensorflow/python/keras/callbacks.py
+++ b/tensorflow/python/keras/callbacks.py
@@ -424,7 +424,7 @@ class ModelCheckpoint(Callback):
 
     if mode not in ['auto', 'min', 'max']:
       logging.warning('ModelCheckpoint mode %s is unknown, '
-                      'fallback to auto mode.', (mode), RuntimeWarning)
+                      'fallback to auto mode.', mode)
       mode = 'auto'
 
     if mode == 'min':
@@ -451,7 +451,7 @@ class ModelCheckpoint(Callback):
         current = logs.get(self.monitor)
         if current is None:
           logging.warning('Can save best model only with %s available, '
-                          'skipping.', self.monitor, RuntimeWarning)
+                          'skipping.', self.monitor)
         else:
           if self.monitor_op(current, self.best):
             if self.verbose > 0:
@@ -515,7 +515,7 @@ class EarlyStopping(Callback):
 
     if mode not in ['auto', 'min', 'max']:
       logging.warning('EarlyStopping mode %s is unknown, '
-                      'fallback to auto mode.', mode, RuntimeWarning)
+                      'fallback to auto mode.', mode)
       mode = 'auto'
 
     if mode == 'min':
@@ -544,7 +544,7 @@ class EarlyStopping(Callback):
     if current is None:
       logging.warning('Early stopping conditioned on metric `%s` '
                       'which is not available. Available metrics are: %s',
-                      self.monitor, ','.join(list(logs.keys())), RuntimeWarning)
+                      self.monitor, ','.join(list(logs.keys())))
       return
     if self.monitor_op(current - self.min_delta, self.best):
       self.best = current
@@ -635,7 +635,11 @@ class LearningRateScheduler(Callback):
   def on_epoch_begin(self, epoch, logs=None):
     if not hasattr(self.model.optimizer, 'lr'):
       raise ValueError('Optimizer must have a "lr" attribute.')
-    lr = self.schedule(epoch)
+    try:  # new API
+      lr = float(K.get_value(self.model.optimizer.lr))
+      lr = self.schedule(epoch, lr)
+    except TypeError:  # Support for old API for backward compatibility
+      lr = self.schedule(epoch)
     if not isinstance(lr, (float, np.float32, np.float64)):
       raise ValueError('The output of the "schedule" function '
                        'should be float.')
@@ -901,7 +905,7 @@ class ReduceLROnPlateau(Callback):
     """
     if self.mode not in ['auto', 'min', 'max']:
       logging.warning('Learning Rate Plateau Reducing mode %s is unknown, '
-                      'fallback to auto mode.', self.mode, RuntimeWarning)
+                      'fallback to auto mode.', self.mode)
       self.mode = 'auto'
     if (self.mode == 'min' or
         (self.mode == 'auto' and 'acc' not in self.monitor)):
@@ -923,7 +927,7 @@ class ReduceLROnPlateau(Callback):
     if current is None:
       logging.warning('Reduce LR on plateau conditioned on metric `%s` '
                       'which is not available. Available metrics are: %s',
-                      self.monitor, ','.join(list(logs.keys())), RuntimeWarning)
+                      self.monitor, ','.join(list(logs.keys())))
 
     else:
       if self.in_cooldown():
diff --git a/tensorflow/python/keras/callbacks_test.py b/tensorflow/python/keras/callbacks_test.py
index eb40fb4acc11d278fd456b95af0f24058b0df7c1..5062a26580ddb10011fd04f9a6e75ee6d2adbc68 100644
--- a/tensorflow/python/keras/callbacks_test.py
+++ b/tensorflow/python/keras/callbacks_test.py
@@ -321,8 +321,26 @@ class KerasCallbacksTest(test.TestCase):
           callbacks=cbks,
           epochs=5,
           verbose=0)
-      assert (float(keras.backend.get_value(model.optimizer.lr)) - 0.2
-             ) < keras.backend.epsilon()
+      assert (
+          float(keras.backend.get_value(
+              model.optimizer.lr)) - 0.2) < keras.backend.epsilon()
+
+      cbks = [keras.callbacks.LearningRateScheduler(lambda x, lr: lr / 2)]
+      model.compile(
+          loss='categorical_crossentropy',
+          optimizer='sgd',
+          metrics=['accuracy'])
+      model.fit(
+          x_train,
+          y_train,
+          batch_size=BATCH_SIZE,
+          validation_data=(x_test, y_test),
+          callbacks=cbks,
+          epochs=2,
+          verbose=0)
+      assert (
+          float(keras.backend.get_value(
+              model.optimizer.lr)) - 0.01 / 4) < keras.backend.epsilon()
 
   def test_ReduceLROnPlateau(self):
     with self.test_session():
diff --git a/tensorflow/python/keras/datasets/boston_housing.py b/tensorflow/python/keras/datasets/boston_housing.py
index 8c043638c0d1167948d19b44bcc3272bcef4f830..4c4cab8c0865098ebed1a7fbe29246ef51bb9833 100644
--- a/tensorflow/python/keras/datasets/boston_housing.py
+++ b/tensorflow/python/keras/datasets/boston_housing.py
@@ -39,9 +39,10 @@ def load_data(path='boston_housing.npz', test_split=0.2, seed=113):
       Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.
   """
   assert 0 <= test_split < 1
+  origin_folder = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
   path = get_file(
       path,
-      origin='https://s3.amazonaws.com/keras-datasets/boston_housing.npz',
+      origin=origin_folder + 'boston_housing.npz',
       file_hash=
       'f553886a1f8d56431e820c5b82552d9d95cfcb96d1e678153f8839538947dff5')
   f = np.load(path)
diff --git a/tensorflow/python/keras/datasets/fashion_mnist.py b/tensorflow/python/keras/datasets/fashion_mnist.py
index 45e27aad34f6584721b11d306a3f24e78f2ed48b..3f4c6c7413e01313fda051a5603f223f9f7c4d27 100644
--- a/tensorflow/python/keras/datasets/fashion_mnist.py
+++ b/tensorflow/python/keras/datasets/fashion_mnist.py
@@ -33,9 +33,15 @@ def load_data():
 
   Returns:
       Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.
+
+  License:
+      The copyright for Fashion-MNIST is held by Zalando SE.
+      Fashion-MNIST is licensed under the [MIT license](
+      https://github.com/zalandoresearch/fashion-mnist/blob/master/LICENSE).
+
   """
   dirname = os.path.join('datasets', 'fashion-mnist')
-  base = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/'
+  base = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
   files = [
       'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz',
       't10k-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz'
diff --git a/tensorflow/python/keras/datasets/imdb.py b/tensorflow/python/keras/datasets/imdb.py
index 411b3e8635f5b95b19375fb5f4686b20c643190f..b73b024162ac3fde4c430c34ff4f0f7b1174abe6 100644
--- a/tensorflow/python/keras/datasets/imdb.py
+++ b/tensorflow/python/keras/datasets/imdb.py
@@ -77,9 +77,10 @@ def load_data(path='imdb.npz',
   if kwargs:
     raise TypeError('Unrecognized keyword arguments: ' + str(kwargs))
 
+  origin_folder = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
   path = get_file(
       path,
-      origin='https://s3.amazonaws.com/text-datasets/imdb.npz',
+      origin=origin_folder + 'imdb.npz',
       file_hash='599dadb1135973df5b59232a0e9a887c')
   with np.load(path) as f:
     x_train, labels_train = f['x_train'], f['y_train']
@@ -140,9 +141,10 @@ def get_word_index(path='imdb_word_index.json'):
   Returns:
       The word index dictionary.
   """
+  origin_folder = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
   path = get_file(
       path,
-      origin='https://s3.amazonaws.com/text-datasets/imdb_word_index.json',
+      origin=origin_folder + 'imdb_word_index.json',
       file_hash='bfafd718b763782e994055a2d397834f')
   with open(path) as f:
     return json.load(f)
diff --git a/tensorflow/python/keras/datasets/mnist.py b/tensorflow/python/keras/datasets/mnist.py
index 631189731a91bf352ed347b1887f80f17860c807..03564accc74507713d198b8ba1ed8c08bd597e8d 100644
--- a/tensorflow/python/keras/datasets/mnist.py
+++ b/tensorflow/python/keras/datasets/mnist.py
@@ -34,10 +34,18 @@ def load_data(path='mnist.npz'):
 
   Returns:
       Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.
+
+  License:
+      Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset,
+      which is a derivative work from original NIST datasets.
+      MNIST dataset is made available under the terms of the
+      [Creative Commons Attribution-Share Alike 3.0 license.](
+      https://creativecommons.org/licenses/by-sa/3.0/)
   """
+  origin_folder = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
   path = get_file(
       path,
-      origin='https://s3.amazonaws.com/img-datasets/mnist.npz',
+      origin=origin_folder + 'mnist.npz',
       file_hash='8a61469f7ea1b51cbae51d4f78837e45')
   f = np.load(path)
   x_train, y_train = f['x_train'], f['y_train']
diff --git a/tensorflow/python/keras/datasets/reuters.py b/tensorflow/python/keras/datasets/reuters.py
index b070ba8d125614e4c555915e4acdbf8a8803863d..2120b4b2421c652c9587a2e644bf008c3ece3980 100644
--- a/tensorflow/python/keras/datasets/reuters.py
+++ b/tensorflow/python/keras/datasets/reuters.py
@@ -75,9 +75,10 @@ def load_data(path='reuters.npz',
   if kwargs:
     raise TypeError('Unrecognized keyword arguments: ' + str(kwargs))
 
+  origin_folder = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
   path = get_file(
       path,
-      origin='https://s3.amazonaws.com/text-datasets/reuters.npz',
+      origin=origin_folder + 'reuters.npz',
       file_hash='87aedbeb0cb229e378797a632c1997b6')
   with np.load(path) as f:
     xs, labels = f['x'], f['y']
@@ -124,9 +125,10 @@ def get_word_index(path='reuters_word_index.json'):
   Returns:
       The word index dictionary.
   """
+  origin_folder = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
   path = get_file(
       path,
-      origin='https://s3.amazonaws.com/text-datasets/reuters_word_index.json',
+      origin=origin_folder + 'reuters_word_index.json',
       file_hash='4d44cc38712099c9e383dc6e5f11a921')
   f = open(path)
   data = json.load(f)
diff --git a/tensorflow/python/keras/engine/base_layer.py b/tensorflow/python/keras/engine/base_layer.py
index 24716cfbe4978c6fd5a7884a581524f804817d60..4814275fd5ba53e7845f383b3447a6ef9f47f6c2 100644
--- a/tensorflow/python/keras/engine/base_layer.py
+++ b/tensorflow/python/keras/engine/base_layer.py
@@ -19,6 +19,7 @@ from __future__ import division
 from __future__ import print_function
 
 import collections
+import enum  # pylint: disable=g-bad-import-order
 import inspect  # Necessary supplement to tf_inspect to deal with variadic args.
 
 import numpy as np
@@ -50,6 +51,20 @@ from tensorflow.python.util import tf_inspect
 from tensorflow.python.util.tf_export import tf_export
 
 
+class CallConvention(enum.Enum):
+  """Calling conventions for passing `Layer` inputs to `Layer.call`."""
+  # The Layer takes inputs as its first argument, named "inputs" for
+  # compatibility with the signature of Layer.__call__. This is the mode assumed
+  # for Layers which are not subclassed Models.
+  EXPLICIT_INPUTS_ARGUMENT = 1
+  # The Layer takes a single positional argument, not named "inputs". It's
+  # treated like an "inputs" argument.
+  SINGLE_POSITIONAL_ARGUMENT = 2
+  # The Layer has multiple positional arguments to which its inputs should be
+  # bound.
+  POSITIONAL_ARGUMENTS_ARE_INPUTS = 3
+
+
 @tf_export('keras.layers.Layer')
 class Layer(checkpointable.CheckpointableBase):
   """Base layer class.
@@ -149,7 +164,7 @@ class Layer(checkpointable.CheckpointableBase):
     self._call_fn_args = function_utils.fn_args(self.call)
     self._compute_previous_mask = ('mask' in self._call_fn_args or
                                    hasattr(self, 'compute_mask'))
-    self._uses_inputs_arg = True
+    self._call_convention = CallConvention.EXPLICIT_INPUTS_ARGUMENT
 
     # These lists will be filled via successive calls
     # to self._add_inbound_node().
@@ -793,12 +808,22 @@ class Layer(checkpointable.CheckpointableBase):
           pass  # C type such as dict. Masking not supported in this case.
 
   def _set_connectivity_metadata_(self, inputs, outputs, args, kwargs):
-    if args and getattr(self, '_uses_inputs_arg', True):
-      raise TypeError(
-          'This Layer takes an `inputs` argument to call(), and only the '
-          '`inputs` argument may be specified as a positional argument. '
-          'Pass everything else as a keyword argument (those arguments will'
-          ' not be tracked as inputs to the Layer).')
+    call_convention = getattr(self, '_call_convention',
+                              CallConvention.EXPLICIT_INPUTS_ARGUMENT)
+    if args:
+      if call_convention == CallConvention.EXPLICIT_INPUTS_ARGUMENT:
+        raise TypeError(
+            'This Layer takes an `inputs` argument to call(), and only the '
+            '`inputs` argument may be specified as a positional argument. '
+            'Pass everything else as a keyword argument (those arguments will'
+            ' not be tracked as inputs to the Layer).')
+      elif call_convention == CallConvention.SINGLE_POSITIONAL_ARGUMENT:
+        raise TypeError(
+            'This Layer takes a single positional argument to call(), which is '
+            'by convention the inputs argument, and only this argument may be '
+            'specified as a positional argument. Pass everything else as a '
+            'keyword argument (those arguments will not be tracked as inputs '
+            'to the Layer).')
 
     # If the layer returns tensors from its inputs, unmodified,
     # we copy them to avoid loss of tensor metadata.
@@ -834,7 +859,11 @@ class Layer(checkpointable.CheckpointableBase):
       A tuple of (inputs, non_input_kwargs). These may be the same objects as
       were passed in (call_args and call_kwargs).
     """
-    if getattr(self, '_uses_inputs_arg', True):
+    call_convention = getattr(self, '_call_convention',
+                              CallConvention.EXPLICIT_INPUTS_ARGUMENT)
+    if (call_convention in (
+        CallConvention.EXPLICIT_INPUTS_ARGUMENT,
+        CallConvention.SINGLE_POSITIONAL_ARGUMENT)):
       assert len(call_args) == 1  # TypeError raised earlier in __call__.
       return call_args[0], call_kwargs
     else:
diff --git a/tensorflow/python/keras/engine/input_layer.py b/tensorflow/python/keras/engine/input_layer.py
index b04dc3c60be2a9e13cb1bc56ef12f0de36ed105c..7996110829b56b6f7f0b3e4c5b6c5b9f35affb64 100644
--- a/tensorflow/python/keras/engine/input_layer.py
+++ b/tensorflow/python/keras/engine/input_layer.py
@@ -119,6 +119,12 @@ class InputLayer(base_layer.Layer):
       self.is_placeholder = False
       self._batch_input_shape = tuple(input_tensor.get_shape().as_list())
 
+      if context.executing_eagerly():
+        raise ValueError('You should not pass an input tensor when executing '
+                         'in eager mode. For example, instead of creating an '
+                         'InputLayer, you should instantiate your model and '
+                         'directly call it on your input.')
+
     # Create an input node to add to self.outbound_node
     # and set output_tensors' _keras_history.
     input_tensor._keras_history = (self, 0, 0)  # pylint: disable=protected-access
diff --git a/tensorflow/python/keras/engine/network.py b/tensorflow/python/keras/engine/network.py
index 552e7b7d735137af7edaa2f475a3af00bb4ba5d3..427efaaf11ecd964d72b3e34233920d2cdfeaeeb 100644
--- a/tensorflow/python/keras/engine/network.py
+++ b/tensorflow/python/keras/engine/network.py
@@ -20,6 +20,7 @@ from __future__ import division
 from __future__ import print_function
 
 import copy
+import functools
 import json
 import os
 import weakref
@@ -36,11 +37,13 @@ from tensorflow.python.keras import backend
 from tensorflow.python.keras.engine import base_layer
 from tensorflow.python.keras.engine import saving
 from tensorflow.python.keras.utils import generic_utils
+from tensorflow.python.keras.utils import layer_utils
 from tensorflow.python.keras.utils import tf_utils
 from tensorflow.python.keras.utils.io_utils import ask_to_proceed_with_overwrite
-from tensorflow.python.keras.utils.layer_utils import print_summary as print_layer_summary
+from tensorflow.python.ops import variables
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training.checkpointable import base as checkpointable
+from tensorflow.python.training.checkpointable import data_structures_base
 from tensorflow.python.training.checkpointable import util as checkpointable_utils
 from tensorflow.python.util import nest
 from tensorflow.python.util import tf_inspect
@@ -93,6 +96,11 @@ class Network(base_layer.Layer):
     self.trainable = True
     self._is_compiled = False
     self._expects_training_arg = False
+    # A list of "extra" variables assigned to attributes of this class, included
+    # in self.weights and self.variables. Always empty for graph networks (but
+    # included in base_init to avoid excessive special casing when retrieving
+    # the value).
+    self._extra_variables = []
 
     self.supports_masking = False
     if not hasattr(self, 'optimizer'):
@@ -127,7 +135,7 @@ class Network(base_layer.Layer):
     self._in_progress_restore_finalizer = None
 
   def _init_graph_network(self, inputs, outputs, name=None):
-    self._uses_inputs_arg = True
+    self._call_convention = base_layer.CallConvention.EXPLICIT_INPUTS_ARGUMENT
     # Normalize and set self.inputs, self.outputs.
     if isinstance(inputs, (list, tuple)):
       self.inputs = list(inputs)  # Tensor or list of tensors.
@@ -287,19 +295,55 @@ class Network(base_layer.Layer):
   def _init_subclassed_network(self, name=None):
     self._base_init(name=name)
     self._is_graph_network = False
-    call_args = tf_inspect.getargspec(self.call).args
-    if 'training' in call_args:
+    call_argspec = tf_inspect.getargspec(self.call)
+    if 'training' in call_argspec.args:
       self._expects_training_arg = True
     else:
       self._expects_training_arg = False
-    if 'inputs' in call_args:
-      self._uses_inputs_arg = True
-    else:
-      self._uses_inputs_arg = False
+    self._call_convention = self._determine_call_convention(call_argspec)
     self.outputs = None
     self.inputs = None
     self.built = False
 
+  def _determine_call_convention(self, call_argspec):
+    """Decides how `self.call()` is invoked. See base_layer.CallConvention."""
+    if call_argspec.varargs:
+      may_take_single_argument = False
+    else:
+      try:
+        # Note: tf_inspect doesn't raise a TypeError when regular inspect would,
+        # so we need to keep in mind that "getcallargs" may have returned
+        # something even though we under-specified positional arguments.
+        all_args = tf_inspect.getcallargs(self.call, None)
+        self_args = set()
+        for arg_name, obj in all_args.items():
+          if obj is self:
+            self_args.add(arg_name)
+        may_take_single_argument = True
+      except TypeError:
+        may_take_single_argument = False
+    if may_take_single_argument:
+      # A single positional argument (plus "self") is considered equivalent to
+      # an "inputs" argument.
+      all_positional_args = len(call_argspec.args)
+      if call_argspec.defaults is not None:
+        all_positional_args -= len(call_argspec.defaults)
+      non_self_positional_args = all_positional_args
+      for positional_arg_name in call_argspec.args[:all_positional_args]:
+        if positional_arg_name in self_args:
+          non_self_positional_args -= 1
+      if non_self_positional_args == 1:
+        if 'inputs' in call_argspec.args[all_positional_args:]:
+          raise TypeError(
+              "Model.call() takes a single positional argument (to which "
+              "inputs are passed by convention) and a separate 'inputs' "
+              "argument. Unable to determine which arguments are inputs.")
+        return base_layer.CallConvention.SINGLE_POSITIONAL_ARGUMENT
+    if 'inputs' in call_argspec.args:
+      return base_layer.CallConvention.EXPLICIT_INPUTS_ARGUMENT
+    else:
+      return base_layer.CallConvention.POSITIONAL_ARGUMENTS_ARE_INPUTS
+
   def _track_layers(self, layers):
     """Add Checkpointable dependencies on a list of Layers."""
     weight_layer_index = 0
@@ -321,7 +365,10 @@ class Network(base_layer.Layer):
     no_dependency = isinstance(value, checkpointable.NoDependency)
     if no_dependency:
       value = value.value
-    if isinstance(value, (base_layer.Layer, Network)):
+    if isinstance(value, (
+        base_layer.Layer,
+        Network,
+        data_structures_base.CheckpointableDataStructureBase)):
       try:
         is_graph_network = self._is_graph_network
       except AttributeError:
@@ -343,11 +390,22 @@ class Network(base_layer.Layer):
       # layers). Therefore Model tracks Checkpointable objects itself.
       self._track_checkpointable(
           checkpointable=value, name=name, overwrite=True)
+      if (  # For subclassed models only, users may add extra weights/variables
+            # simply by assigning them to attributes.
+          not self._is_graph_network
+          and isinstance(value, variables.Variable)):
+        self._extra_variables.append(value)
     super(Network, self).__setattr__(name, value)
 
   def add_variable(self, name, shape, dtype=None, initializer=None,
                    regularizer=None, trainable=True, constraint=None):
-    raise NotImplementedError('`add_variable` is not supported on Networks.')
+    if self._is_graph_network:
+      raise NotImplementedError('`add_variable` is not supported on Networks.')
+    else:
+      raise NotImplementedError(
+          '`add_variable` is not supported on Networks. However, you may '
+          'assign variables to attributes and they will show up in the weights '
+          'and variables properties.')
 
   def add_loss(self, *args, **kwargs):
     if context.executing_eagerly():
@@ -585,24 +643,17 @@ class Network(base_layer.Layer):
 
   @property
   def trainable_weights(self):
-    if not self.trainable:
-      return []
-    weights = []
-    for layer in self.layers:
-      weights += layer.trainable_weights
-    return weights
+    return layer_utils.gather_trainable_weights(
+        trainable=self.trainable,
+        sub_layers=self.layers,
+        extra_variables=self._extra_variables)
 
   @property
   def non_trainable_weights(self):
-    weights = []
-    for layer in self.layers:
-      weights += layer.non_trainable_weights
-    if not self.trainable:
-      trainable_weights = []
-      for layer in self.layers:
-        trainable_weights += layer.trainable_weights
-      return trainable_weights + weights
-    return weights
+    return layer_utils.gather_non_trainable_weights(
+        trainable=self.trainable,
+        sub_layers=self.layers,
+        extra_variables=self._extra_variables)
 
   @property
   def input_spec(self):
@@ -1250,7 +1301,11 @@ class Network(base_layer.Layer):
       with h5py.File(filepath, 'w') as f:
         saving.save_weights_to_hdf5_group(f, self.layers)
     else:
-      self._checkpointable_saver.save(filepath)
+      if context.executing_eagerly():
+        session = None
+      else:
+        session = backend.get_session()
+      self._checkpointable_saver.save(filepath, session=session)
 
   def load_weights(self, filepath, by_name=False):
     """Loads all layer weights, either from a TensorFlow or an HDF5 weight file.
@@ -1310,7 +1365,8 @@ class Network(base_layer.Layer):
             'loading TensorFlow-formatted weights (got by_name=True to '
             'load_weights).')
       if not context.executing_eagerly():
-        finalizer = status.run_restore_ops
+        session = backend.get_session()
+        finalizer = functools.partial(status.run_restore_ops, session=session)
         if self.built:
           finalizer()
         else:
@@ -1407,7 +1463,8 @@ class Network(base_layer.Layer):
         ImportError: if yaml module is not found.
     """
     if yaml is None:
-      raise ImportError('Requires yaml module installed.')
+      raise ImportError(
+          'Requires yaml module installed (`pip install pyyaml`).')
     return yaml.dump(self._updated_config(), **kwargs)
 
   def summary(self, line_length=None, positions=None, print_fn=None):
@@ -1424,11 +1481,19 @@ class Network(base_layer.Layer):
             It will be called on each line of the summary.
             You can set it to a custom function
             in order to capture the string summary.
+
+    Raises:
+        ValueError: if `summary()` is called before the model is built.
     """
-    print_layer_summary(self,
-                        line_length=line_length,
-                        positions=positions,
-                        print_fn=print_fn)
+    if not self.built:
+      raise ValueError('This model has never been called, thus its weights '
+                       'have not yet been created, so no summary can be '
+                       'displayed. Build the model first '
+                       '(e.g. by calling it on some data).')
+    layer_utils.print_summary(self,
+                              line_length=line_length,
+                              positions=positions,
+                              print_fn=print_fn)
 
 
 def get_source_inputs(tensor, layer=None, node_index=None):
diff --git a/tensorflow/python/keras/engine/saving.py b/tensorflow/python/keras/engine/saving.py
index 99ce64a469db97f27372f4821fdc5c4f7565c763..b9a2e1f25f637dc8017f751bbdd400c1e5c9dd44 100644
--- a/tensorflow/python/keras/engine/saving.py
+++ b/tensorflow/python/keras/engine/saving.py
@@ -106,7 +106,7 @@ def save_model(model, filepath, overwrite=True, include_optimizer=True):
     model_layers = model.layers
     save_weights_to_hdf5_group(model_weights_group, model_layers)
 
-    if include_optimizer and hasattr(model, 'optimizer'):
+    if include_optimizer and model.optimizer:
       if isinstance(model.optimizer, optimizers.TFOptimizer):
         logging.warning(
             'TensorFlow optimizers do not '
@@ -323,7 +323,7 @@ def model_from_yaml(yaml_string, custom_objects=None):
       ImportError: if yaml module is not found.
   """
   if yaml is None:
-    raise ImportError('Requires yaml module installed.')
+    raise ImportError('Requires yaml module installed (`pip install pyyaml`).')
   config = yaml.load(yaml_string)
   from tensorflow.python.keras.layers import deserialize  # pylint: disable=g-import-not-at-top
   return deserialize(config, custom_objects=custom_objects)
diff --git a/tensorflow/python/keras/engine/saving_test.py b/tensorflow/python/keras/engine/saving_test.py
index 30bcd3d1855d28012ce0ae747b900468c57465e0..7e82db028b8db01d6fe2b693e2087cbdfac55314 100644
--- a/tensorflow/python/keras/engine/saving_test.py
+++ b/tensorflow/python/keras/engine/saving_test.py
@@ -288,6 +288,30 @@ class TestWholeModelSaving(test.TestCase):
       out2 = new_model.predict(x)
       self.assertAllClose(out, out2, atol=1e-05)
 
+  def test_sequential_model_saving_without_compile(self):
+    if h5py is None:
+      self.skipTest('h5py required to run this test')
+
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(2, input_shape=(3,)))
+      model.add(keras.layers.RepeatVector(3))
+      model.add(keras.layers.TimeDistributed(keras.layers.Dense(3)))
+
+      x = np.random.random((1, 3))
+      out = model.predict(x)
+      fd, fname = tempfile.mkstemp('.h5')
+
+      # Save the model without any compilation or training.
+      keras.models.save_model(model, fname)
+
+      new_model = keras.models.load_model(fname)
+      os.close(fd)
+      os.remove(fname)
+
+      out2 = new_model.predict(x)
+      self.assertAllClose(out, out2, atol=1e-05)
+
   def test_sequential_model_saving_2(self):
     if h5py is None:
       self.skipTest('h5py required to run this test')
@@ -404,26 +428,27 @@ class TestWholeModelSaving(test.TestCase):
       os.remove(fname)
 
   def test_saving_lambda_numpy_array_arguments(self):
-    if h5py is None:
-      self.skipTest('h5py required to run this test')
+    with self.test_session():
+      if h5py is None:
+        self.skipTest('h5py required to run this test')
 
-    mean = np.random.random((4, 2, 3))
-    std = np.abs(np.random.random((4, 2, 3))) + 1e-5
-    inputs = keras.layers.Input(shape=(4, 2, 3))
-    output = keras.layers.Lambda(lambda image, mu, std: (image - mu) / std,
-                                 arguments={'mu': mean, 'std': std})(inputs)
-    model = keras.models.Model(inputs, output)
-    model.compile(loss='mse', optimizer='sgd', metrics=['acc'])
+      mean = np.random.random((4, 2, 3))
+      std = np.abs(np.random.random((4, 2, 3))) + 1e-5
+      inputs = keras.layers.Input(shape=(4, 2, 3))
+      output = keras.layers.Lambda(lambda image, mu, std: (image - mu) / std,
+                                   arguments={'mu': mean, 'std': std})(inputs)
+      model = keras.models.Model(inputs, output)
+      model.compile(loss='mse', optimizer='sgd', metrics=['acc'])
 
-    fd, fname = tempfile.mkstemp('.h5')
-    keras.models.save_model(model, fname)
+      fd, fname = tempfile.mkstemp('.h5')
+      keras.models.save_model(model, fname)
 
-    model = keras.models.load_model(fname)
-    os.close(fd)
-    os.remove(fname)
+      model = keras.models.load_model(fname)
+      os.close(fd)
+      os.remove(fname)
 
-    self.assertAllClose(mean, model.layers[1].arguments['mu'])
-    self.assertAllClose(std, model.layers[1].arguments['std'])
+      self.assertAllClose(mean, model.layers[1].arguments['mu'])
+      self.assertAllClose(std, model.layers[1].arguments['std'])
 
   def test_saving_model_with_long_layer_names(self):
     if h5py is None:
@@ -580,6 +605,25 @@ class TestWeightSavingAndLoadingTFFormat(test.TestCase):
         # Indirectly tests that the user is prompted
         model.save_weights(prefix, save_format='tensorflow', overwrite=False)
 
+  def test_no_default_session(self):
+    with ops.Graph().as_default():
+      self.assertFalse(ops.get_default_session())
+      data = np.random.random((1000, 32)).astype(np.float32)
+      labels = np.random.random((1000, 10)).astype(np.float32)
+
+      model = keras.models.Sequential([
+          keras.layers.Dense(10, activation='softmax'),
+          keras.layers.Dense(10, activation='softmax')])
+
+      model.compile(optimizer=training_module.RMSPropOptimizer(0.001),
+                    loss='categorical_crossentropy',
+                    metrics=['accuracy'])
+
+      model.fit(data, labels)
+      fname = os.path.join(self.get_temp_dir(), 'weights', 'ckpt')
+      model.save_weights(fname)
+      model.load_weights(fname)
+
   def test_no_graph_pollution(self):
     with context.graph_mode():
       graph = ops.Graph()
diff --git a/tensorflow/python/keras/engine/sequential.py b/tensorflow/python/keras/engine/sequential.py
index 52e29b0ffad7d26d2a2fbe8f287146daaffa3059..3ca8fdd3260d964442c18bc30c3925f252e8a304 100644
--- a/tensorflow/python/keras/engine/sequential.py
+++ b/tensorflow/python/keras/engine/sequential.py
@@ -222,11 +222,16 @@ class Sequential(Model):
       for layer in self._layers:
         x = layer(x)
       self.outputs = [x]
+      # Make sure that the model's input shape will be preserved during
+      # serialization.
+      if self._layers:
+        self._layers[0]._batch_input_shape = batch_shape
 
     if self.inputs:
       self._init_graph_network(self.inputs, self.outputs, name=self.name)
       self.built = True
-    self._track_layers(self._layers)
+    if self._layers:
+      self._track_layers(self._layers)
 
   def predict_proba(self, x, batch_size=32, verbose=0):
     """Generates class probability predictions for the input samples.
diff --git a/tensorflow/python/keras/engine/sequential_test.py b/tensorflow/python/keras/engine/sequential_test.py
index 69a288e69b60b03383b2cb54f8a2fde641516628..cdaf9162de8c5318c7e092077d409f05a7edc717 100644
--- a/tensorflow/python/keras/engine/sequential_test.py
+++ b/tensorflow/python/keras/engine/sequential_test.py
@@ -209,6 +209,30 @@ class TestSequential(test.TestCase):
       x2 = model.predict(val_a)
       assert np.abs(np.sum(x1 - x2)) > 1e-5
 
+  def test_sequential_deferred_build_serialization(self):
+    num_hidden = 5
+    input_dim = 3
+    batch_size = 5
+    num_classes = 2
+
+    model = keras.models.Sequential()
+    # We don't specify the input shape.
+    model.add(keras.layers.Dense(num_hidden))
+    model.add(keras.layers.Dense(num_classes))
+    model.compile(loss='mse', optimizer=rmsprop.RMSPropOptimizer(1e-3))
+    self.assertFalse(model.built)
+
+    x = np.random.random((batch_size, input_dim))
+    y = np.random.random((batch_size, num_classes))
+    model.train_on_batch(x, y)
+    self.assertTrue(model.built)
+
+    config = model.get_config()
+    new_model = keras.models.Sequential.from_config(config)
+    self.assertTrue(new_model.built)
+    self.assertEqual(len(model.layers), 2)
+    self.assertEqual(len(model.weights), 4)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/keras/engine/training.py b/tensorflow/python/keras/engine/training.py
index ff50d0b6e27f15ef13449b4eef7b815f4684a033..fce6cbdb7a09d5a4daa70d0483732735e70c28b7 100644
--- a/tensorflow/python/keras/engine/training.py
+++ b/tensorflow/python/keras/engine/training.py
@@ -24,6 +24,7 @@ import numpy as np
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.eager import context
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_util
@@ -31,12 +32,11 @@ from tensorflow.python.keras import backend as K
 from tensorflow.python.keras import losses
 from tensorflow.python.keras import metrics as metrics_module
 from tensorflow.python.keras import optimizers
+from tensorflow.python.keras.engine import base_layer
 from tensorflow.python.keras.engine import training_arrays
 from tensorflow.python.keras.engine import training_eager
 from tensorflow.python.keras.engine import training_generator
 from tensorflow.python.keras.engine import training_utils
-from tensorflow.python.keras.engine.base_layer import DeferredTensor
-from tensorflow.python.keras.engine.base_layer import Layer
 from tensorflow.python.keras.engine.network import Network
 from tensorflow.python.keras.utils.generic_utils import slice_arrays
 from tensorflow.python.ops import array_ops
@@ -112,6 +112,8 @@ class Model(Network):
     super(Model, self).__init__(*args, **kwargs)
     # Create a cache for iterator get_next op.
     self._iterator_get_next = weakref.WeakKeyDictionary()
+    # Create a cache for dataset - uninitialized iterators
+    self._dataset_iterator_cache = weakref.WeakKeyDictionary()
 
   def compile(self,
               optimizer,
@@ -408,11 +410,13 @@ class Model(Network):
         else:
           if sample_weight_mode == 'temporal':
             sample_weights.append(array_ops.placeholder_with_default(
-                [[1.]], shape=[None, None], name=name + '_sample_weights'))
+                constant_op.constant([[1.]], dtype=K.floatx()),
+                shape=[None, None], name=name + '_sample_weights'))
             sample_weight_modes.append('temporal')
           else:
             sample_weights.append(array_ops.placeholder_with_default(
-                [1.], shape=[None], name=name + '_sample_weights'))
+                constant_op.constant([1.], dtype=K.floatx()),
+                shape=[None], name=name + '_sample_weights'))
             sample_weight_modes.append(None)
     self.sample_weight_modes = sample_weight_modes
     self._feed_sample_weight_modes = []
@@ -521,7 +525,7 @@ class Model(Network):
 
             # Keep track of state updates created by
             # stateful metrics (i.e. metrics layers).
-            if isinstance(metric_fn, Layer) and metric_fn.stateful:
+            if isinstance(metric_fn, base_layer.Layer) and metric_fn.stateful:
               self.stateful_metric_names.append(metric_name)
               self.stateful_metric_functions.append(metric_fn)
               self.metrics_updates += metric_fn.updates
@@ -670,12 +674,12 @@ class Model(Network):
           (in case the model has multiple inputs).
         - A dict mapping input names to the corresponding array/tensors,
           if the model has named inputs.
-        - A `tf.data` dataset iterator.
+        - A `tf.data` dataset or a dataset iterator.
       y: Target data. Like the input data `x`,
         it could be either Numpy array(s) or TensorFlow tensor(s).
         It should be consistent with `x` (you cannot have Numpy inputs and
-        tensor targets, or inversely). If `x` is a dataset iterator,
-        `y` should not be specified
+        tensor targets, or inversely). If `x` is a dataset or a
+        dataset iterator, `y` should not be specified
         (since targets will be obtained from the iterator).
       sample_weight: An optional sample-weight array passed by the user to
         weight the importance of each sample in `x`.
@@ -706,11 +710,16 @@ class Model(Network):
       RuntimeError: If the model was never compiled.
     """
     if isinstance(x, dataset_ops.Dataset):
-      raise ValueError('You passed a `Dataset` instance to your model (%s), '
-                       'which is not supported. Instead, pass an `Iterator`, '
-                       'which you can obtain e.g. via '
-                       '`dataset.make_one_shot_iterator()` (the exact method '
-                       'to use will depend on your specific dataset).' % x)
+      if context.executing_eagerly():
+        x = x.make_one_shot_iterator()
+      else:
+        if x in self._dataset_iterator_cache:
+          x = self._dataset_iterator_cache[x]
+        else:
+          iterator = x.make_initializable_iterator()
+          self._dataset_iterator_cache[x] = iterator
+          x = iterator
+        K.get_session().run(x.initializer)
 
     # Validates `steps` argument based on x's type.
     if check_steps:
@@ -719,7 +728,7 @@ class Model(Network):
     is_x_eager_iterator = isinstance(x, iterator_ops.EagerIterator)
     is_x_iterator = isinstance(x, iterator_ops.Iterator)
 
-    # Validate user inputs when data is given as a dataset iterator.
+    # Validate user inputs when data is given as a dataset or dataset iterator.
     if is_x_iterator or is_x_eager_iterator:
       training_utils.validate_iterator_input(x, y, sample_weight,
                                              validation_split)
@@ -839,7 +848,8 @@ class Model(Network):
     # in the case where all inputs are value arrays.
 
     if context.executing_eagerly():
-      # In eager mode, do not do shape validation.
+      # In eager mode, do not do shape validation
+      # since the network has no input nodes (placeholders) to be fed.
       feed_input_names = self.input_names
       feed_input_shapes = None
     elif not self._is_graph_network:
@@ -951,11 +961,17 @@ class Model(Network):
         whether to build the model's graph in inference mode (False), training
         mode (True), or using the Keras learning phase (None).
     """
-    if not getattr(self, '_uses_inputs_arg', True):
+    call_convention = getattr(
+        self,
+        '_call_convention',
+        base_layer.CallConvention.EXPLICIT_INPUTS_ARGUMENT)
+    if call_convention not in (
+        base_layer.CallConvention.EXPLICIT_INPUTS_ARGUMENT,
+        base_layer.CallConvention.SINGLE_POSITIONAL_ARGUMENT):
       raise NotImplementedError(
-          'Subclassed Models without "inputs" in their call() signatures do '
-          'not yet support shape inference. File a feature request if this '
-          'limitation bothers you.')
+          'Subclassed Models without "inputs" (or single positional arguments) '
+          'in their call() signatures do not yet support shape inference. File '
+          'a feature request if this limitation bothers you.')
     if self.__class__.__name__ == 'Sequential':
       # Note: we can't test whether the model is `Sequential` via `isinstance`
       # since `Sequential` depends on `Model`.
@@ -995,14 +1011,16 @@ class Model(Network):
     # to keep track of number of inputs and outputs and their ndim.
     if isinstance(inputs, (list, tuple)):
       if tensor_util.is_tensor(inputs[0]):
-        dummy_output_values = self.call(inputs)
+        dummy_output_values = self.call(
+            training_utils.cast_if_floating_dtype(inputs))
       else:
         dummy_output_values = self.call(
             [ops.convert_to_tensor(v, dtype=K.floatx()) for v in inputs])
       dummy_input_values = list(inputs)
     else:
       if tensor_util.is_tensor(inputs):
-        dummy_output_values = self.call(inputs)
+        dummy_output_values = self.call(
+            training_utils.cast_if_floating_dtype(inputs))
       else:
         dummy_output_values = self.call(
             ops.convert_to_tensor(inputs, dtype=K.floatx()))
@@ -1012,11 +1030,11 @@ class Model(Network):
     else:
       dummy_output_values = [dummy_output_values]
     self.outputs = [
-        DeferredTensor(shape=(None for _ in v.shape),
-                       dtype=v.dtype) for v in dummy_output_values]
+        base_layer.DeferredTensor(shape=(None for _ in v.shape),
+                                  dtype=v.dtype) for v in dummy_output_values]
     self.inputs = [
-        DeferredTensor(shape=(None for _ in v.shape),
-                       dtype=v.dtype) for v in dummy_input_values]
+        base_layer.DeferredTensor(shape=(None for _ in v.shape),
+                                  dtype=v.dtype) for v in dummy_input_values]
     self.input_names = [
         'input_%d' % (i + 1) for i in range(len(dummy_input_values))]
     self.output_names = [
@@ -1130,19 +1148,19 @@ class Model(Network):
             (in case the model has multiple inputs).
           - A dict mapping input names to the corresponding array/tensors,
             if the model has named inputs.
-          - A `tf.data` dataset iterator.
+          - A `tf.data` dataset or a dataset iterator.
         y: Target data. Like the input data `x`,
           it could be either Numpy array(s) or TensorFlow tensor(s).
           It should be consistent with `x` (you cannot have Numpy inputs and
-          tensor targets, or inversely). If `x` is a dataset iterator,
-          `y` should not be specified
+          tensor targets, or inversely). If `x` is a dataset or dataset
+          iterator, `y` should not be specified
           (since targets will be obtained from the iterator).
         batch_size: Integer or `None`.
             Number of samples per gradient update.
             If unspecified, `batch_size` will default to 32.
             Do not specify the `batch_size` if your data is in the
-            form of symbolic tensors or dataset iterators (since they generate
-            batches).
+            form of symbolic tensors, datasets, or dataset iterators
+            (since they generate batches).
         epochs: Integer. Number of epochs to train the model.
             An epoch is an iteration over the entire `x` and `y`
             data provided.
@@ -1164,7 +1182,7 @@ class Model(Network):
             on this data at the end of each epoch.
             The validation data is selected from the last samples
             in the `x` and `y` data provided, before shuffling. This argument is
-            not supported when `x` is a dataset iterator.
+            not supported when `x` is a dataset or a dataset iterator.
         validation_data: Data on which to evaluate
             the loss and any model metrics at the end of each epoch.
             The model will not be trained on this data.
@@ -1172,7 +1190,7 @@ class Model(Network):
             `validation_data` could be:
               - tuple `(x_val, y_val)` of Numpy arrays or tensors
               - tuple `(x_val, y_val, val_sample_weights)` of Numpy arrays
-              - dataset iterator
+              - dataset or a dataset iterator
         shuffle: Boolean (whether to shuffle the training data
             before each epoch) or str (for 'batch').
             'batch' is a special option for dealing with the
@@ -1195,7 +1213,7 @@ class Model(Network):
             to apply a different weight to every timestep of every sample.
             In this case you should make sure to specify
             `sample_weight_mode="temporal"` in `compile()`. This argument is not
-            supported when `x` is a dataset iterator.
+            supported when `x` is a dataset or a dataset iterator.
         initial_epoch: Integer.
             Epoch at which to start training
             (useful for resuming a previous training run).
@@ -1252,7 +1270,8 @@ class Model(Network):
     # Prepare validation data.
     if validation_data:
       if (isinstance(validation_data, iterator_ops.Iterator) or
-          isinstance(validation_data, iterator_ops.EagerIterator)):
+          isinstance(validation_data, iterator_ops.EagerIterator) or
+          isinstance(validation_data, dataset_ops.Dataset)):
         val_x = validation_data
         val_y = None
         val_sample_weight = None
@@ -1266,8 +1285,9 @@ class Model(Network):
             'When passing a `validation_data` argument, '
             'it must contain either 2 items (x_val, y_val), '
             'or 3 items (x_val, y_val, val_sample_weights), '
-            'or alternatively it could be a dataset iterator. However we '
-            'received `validation_data=%s`' % validation_data)
+            'or alternatively it could be a dataset or a '
+            'dataset or a dataset iterator. '
+            'However we received `validation_data=%s`' % validation_data)
 
       # Validate and standardize validation data.
       val_x, val_y, val_sample_weights = self._standardize_user_data(
@@ -1351,19 +1371,19 @@ class Model(Network):
             (in case the model has multiple inputs).
           - A dict mapping input names to the corresponding array/tensors,
             if the model has named inputs.
-          - A `tf.data` dataset iterator.
+          - A `tf.data` dataset or a dataset iterator.
         y: Target data. Like the input data `x`,
           it could be either Numpy array(s) or TensorFlow tensor(s).
           It should be consistent with `x` (you cannot have Numpy inputs and
-          tensor targets, or inversely). If `x` is a dataset iterator,
-          `y` should not be specified
-          (since targets will be obtained from the iterator).
+          tensor targets, or inversely).
+          If `x` is a dataset or a dataset iterator, `y` should not be specified
+          (since targets will be obtained from the iterator/dataset).
         batch_size: Integer or `None`.
             Number of samples per gradient update.
             If unspecified, `batch_size` will default to 32.
             Do not specify the `batch_size` is your data is in the
-            form of symbolic tensors or dataset iterators (since they generate
-            batches).
+            form of symbolic tensors, datasets, or dataset iterators
+            (since they generate batches).
         verbose: 0 or 1. Verbosity mode.
             0 = silent, 1 = progress bar.
         sample_weight: Optional Numpy array of weights for
@@ -1377,7 +1397,7 @@ class Model(Network):
             to apply a different weight to every timestep of every sample.
             In this case you should make sure to specify
             `sample_weight_mode="temporal"` in `compile()`. This argument is not
-            supported when `x` is a dataset iterator.
+            supported when `x` is a dataset or a dataset iterator.
         steps: Integer or `None`.
             Total number of steps (batches of samples)
             before declaring the evaluation round finished.
@@ -1426,13 +1446,13 @@ class Model(Network):
             (in case the model has multiple inputs).
           - A TensorFlow tensor, or a list of tensors
             (in case the model has multiple inputs).
-          - A `tf.data` dataset iterator.
+          - A `tf.data` dataset or a dataset iterator.
         batch_size: Integer or `None`.
             Number of samples per gradient update.
             If unspecified, `batch_size` will default to 32.
             Do not specify the `batch_size` is your data is in the
-            form of symbolic tensors or dataset iterators (since they generate
-            batches).
+            form of symbolic tensors, dataset, or dataset iterators
+            (since they generate batches).
         verbose: Verbosity mode, 0 or 1.
         steps: Total number of steps (batches of samples)
             before declaring the prediction round finished.
@@ -1473,12 +1493,12 @@ class Model(Network):
             (in case the model has multiple inputs).
           - A dict mapping input names to the corresponding array/tensors,
             if the model has named inputs.
-          - A `tf.data` dataset iterator.
+          - A `tf.data` dataset or a dataset iterator.
         y: Target data. Like the input data `x`,
           it could be either Numpy array(s) or TensorFlow tensor(s).
           It should be consistent with `x` (you cannot have Numpy inputs and
-          tensor targets, or inversely). If `x` is a dataset iterator,
-          `y` should not be specified
+          tensor targets, or inversely). If `x` is a dataset or a
+          dataset iterator, `y` should not be specified
           (since targets will be obtained from the iterator).
         sample_weight: Optional array of the same length as x, containing
             weights to apply to the model's loss for each sample.
@@ -1487,8 +1507,7 @@ class Model(Network):
             to apply a different weight to every timestep of every sample.
             In this case you should make sure to specify
             sample_weight_mode="temporal" in compile(). This argument is not
-            supported when `x` is a dataset iterator.
-
+            supported when `x` is a dataset or a dataset iterator.
         class_weight: Optional dictionary mapping
             class indices (integers) to
             a weight (float) to apply to the model's loss for the samples
@@ -1537,12 +1556,12 @@ class Model(Network):
             (in case the model has multiple inputs).
           - A dict mapping input names to the corresponding array/tensors,
             if the model has named inputs.
-          - A `tf.data` dataset iterator.
+          - A `tf.data` dataset or a dataset iterator.
         y: Target data. Like the input data `x`,
           it could be either Numpy array(s) or TensorFlow tensor(s).
           It should be consistent with `x` (you cannot have Numpy inputs and
-          tensor targets, or inversely). If `x` is a dataset iterator,
-          `y` should not be specified
+          tensor targets, or inversely). If `x` is a dataset or a
+          dataset iterator, `y` should not be specified
           (since targets will be obtained from the iterator).
         sample_weight: Optional array of the same length as x, containing
             weights to apply to the model's loss for each sample.
@@ -1551,7 +1570,7 @@ class Model(Network):
             to apply a different weight to every timestep of every sample.
             In this case you should make sure to specify
             sample_weight_mode="temporal" in compile(). This argument is not
-            supported when `x` is a dataset iterator.
+            supported when `x` is a dataset or a dataset iterator.
 
     Returns:
         Scalar test loss (if the model has a single output and no metrics)
@@ -1590,7 +1609,7 @@ class Model(Network):
             (in case the model has multiple inputs).
           - A TensorFlow tensor, or a list of tensors
             (in case the model has multiple inputs).
-          - A `tf.data` dataset iterator.
+          - A `tf.data` dataset or a dataset iterator.
 
     Returns:
         Numpy array(s) of predictions.
@@ -1602,7 +1621,10 @@ class Model(Network):
     # Validate and standardize user data.
     inputs, _, _ = self._standardize_user_data(x)
     if context.executing_eagerly():
-      if not isinstance(inputs, iterator_ops.EagerIterator):
+      if (isinstance(x, iterator_ops.EagerIterator) or
+          (isinstance(x, dataset_ops.Dataset) and context.executing_eagerly())):
+        inputs = training_utils.cast_if_floating_dtype(inputs)
+      else:
         inputs = [
             ops.convert_to_tensor(val, dtype=K.floatx()) for val in inputs
         ]
diff --git a/tensorflow/python/keras/engine/training_arrays.py b/tensorflow/python/keras/engine/training_arrays.py
index 93f4f1bd1dde848d9d3afbfd1dcbd26741b9c745..281ad9bd50edf519f520fe4aa664ae05b72528d8 100644
--- a/tensorflow/python/keras/engine/training_arrays.py
+++ b/tensorflow/python/keras/engine/training_arrays.py
@@ -185,6 +185,7 @@ def fit_loop(model,
     callbacks.on_epoch_begin(epoch)
     epoch_logs = {}
     if steps_per_epoch is not None:
+      # Step-wise fit loop.
       for step_index in range(steps_per_epoch):
         batch_logs = {}
         batch_logs['batch'] = step_index
@@ -215,7 +216,6 @@ def fit_loop(model,
             val_inputs,
             val_targets,
             sample_weights=val_sample_weights,
-            batch_size=batch_size,
             steps=validation_steps,
             verbose=0)
         if not isinstance(val_outs, list):
@@ -224,6 +224,7 @@ def fit_loop(model,
         for l, o in zip(out_labels, val_outs):
           epoch_logs['val_' + l] = o
     else:
+      # Sample-wise fit loop.
       if shuffle == 'batch':
         index_array = training_utils.batch_shuffle(index_array, batch_size)
       elif shuffle:
diff --git a/tensorflow/python/keras/engine/training_eager.py b/tensorflow/python/keras/engine/training_eager.py
index 081e46aa66176d9d19c48c41b950f8df887e7a84..e8838cd3bca7b3afba80504f9e705943474423c5 100644
--- a/tensorflow/python/keras/engine/training_eager.py
+++ b/tensorflow/python/keras/engine/training_eager.py
@@ -255,6 +255,8 @@ def iterator_fit_loop(model,
     # Validate and standardize data.
     x, y, sample_weights = model._standardize_user_data(
         x, y, class_weight=class_weight)
+    x = training_utils.cast_if_floating_dtype(x)
+    y = training_utils.cast_if_floating_dtype(y)
     if sample_weights:
       sample_weights = [
           ops.convert_to_tensor(val, dtype=backend.floatx())
@@ -471,6 +473,8 @@ def iterator_test_loop(model, inputs, steps, verbose=0):
 
     # Validate and standardize data.
     x, y, sample_weights = model._standardize_user_data(x, y)
+    x = training_utils.cast_if_floating_dtype(x)
+    y = training_utils.cast_if_floating_dtype(y)
 
     # Calculate model output, loss values.
     loss_outs, loss, loss_metrics = _model_loss(
@@ -501,11 +505,11 @@ def iterator_test_loop(model, inputs, steps, verbose=0):
     if verbose == 1:
       progbar.update(step_index + 1)
 
-    for i in range(len(outs)):
-      outs[i] /= num_samples
-    if len(outs) == 1:
-      return outs[0]
-    return outs
+  for i in range(len(outs)):
+    outs[i] /= num_samples
+  if len(outs) == 1:
+    return outs[0]
+  return outs
 
 
 def batch_test_loop(model,
@@ -639,6 +643,7 @@ def iterator_predict_loop(model, inputs, steps, verbose=0):
 
     # Validate and standardize data.
     x, _, _ = model._standardize_user_data(x)
+    x = training_utils.cast_if_floating_dtype(x)
 
     if model._expects_training_arg:
       batch_outs = model.call(x[0] if len(x) == 1 else x, training=False)
@@ -814,7 +819,10 @@ def train_on_batch(model, inputs, targets, sample_weights=None):
   Returns:
       total loss and the loss associated with each output.
   """
-  if len(inputs) and not tensor_util.is_tensor(inputs[0]):
+  if len(inputs) and tensor_util.is_tensor(inputs[0]):
+    inputs = training_utils.cast_if_floating_dtype(inputs)
+    targets = training_utils.cast_if_floating_dtype(targets)
+  else:
     inputs = [
         ops.convert_to_tensor(val, dtype=backend.floatx()) for val in inputs
     ]
@@ -849,7 +857,10 @@ def test_on_batch(model, inputs, targets, sample_weights=None):
   Returns:
       total loss, loss and metrics associated with each output.
   """
-  if len(inputs) and not tensor_util.is_tensor(inputs[0]):
+  if len(inputs) and tensor_util.is_tensor(inputs[0]):
+    inputs = training_utils.cast_if_floating_dtype(inputs)
+    targets = training_utils.cast_if_floating_dtype(targets)
+  else:
     inputs = [
         ops.convert_to_tensor(val, dtype=backend.floatx()) for val in inputs
     ]
diff --git a/tensorflow/python/keras/engine/training_eager_test.py b/tensorflow/python/keras/engine/training_eager_test.py
index d9446fd4373e7403345a4b5a8a4e35faace941a2..1571a7782aaae3c3be233c7d4fdd91456a96e3e3 100644
--- a/tensorflow/python/keras/engine/training_eager_test.py
+++ b/tensorflow/python/keras/engine/training_eager_test.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python import keras
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util as tf_test_util
@@ -402,6 +403,24 @@ class TrainingTest(test.TestCase):
     model.train_on_batch(inputs, targets)
     model.test_on_batch(inputs, targets)
 
+  def test_generator_methods(self):
+    model = keras.Sequential()
+    model.add(keras.layers.Dense(4, input_shape=(3,)))
+    optimizer = RMSPropOptimizer(learning_rate=0.001)
+    model.compile(optimizer, 'mse', metrics=['mae'])
+
+    x = np.random.random((10, 3))
+    y = np.random.random((10, 4))
+
+    def iterator():
+      while True:
+        yield x, y
+
+    model.fit_generator(iterator(), steps_per_epoch=3, epochs=1)
+    model.evaluate_generator(iterator(), steps=3)
+    out = model.predict_generator(iterator(), steps=3)
+    self.assertEqual(out.shape, (30, 4))
+
 
 class LossWeightingTest(test.TestCase):
 
@@ -670,6 +689,59 @@ class CorrectnessTest(test.TestCase):
     outs = model.evaluate(x, y)
     self.assertEqual(outs[1], 0.)
 
+  @tf_test_util.run_in_graph_and_eager_modes()
+  def test_loss_correctness_with_iterator(self):
+    # Test that training loss is the same in eager and graph
+    # (by comparing it to a reference value in a deterministic case)
+    model = keras.Sequential()
+    model.add(
+        keras.layers.Dense(
+            3, activation='relu', input_dim=4, kernel_initializer='ones'))
+    model.add(
+        keras.layers.Dense(2, activation='softmax', kernel_initializer='ones'))
+    model.compile(
+        loss='sparse_categorical_crossentropy',
+        optimizer=RMSPropOptimizer(learning_rate=0.001))
+    x = np.ones((100, 4), dtype=np.float32)
+    np.random.seed(123)
+    y = np.random.randint(0, 1, size=(100, 1))
+    dataset = dataset_ops.Dataset.from_tensor_slices((x, y))
+    dataset = dataset.repeat(100)
+    dataset = dataset.batch(10)
+    iterator = dataset.make_one_shot_iterator()
+    history = model.fit(iterator, epochs=1, steps_per_epoch=10)
+    self.assertEqual(np.around(history.history['loss'][-1], decimals=4), 0.6173)
+
+  @tf_test_util.run_in_graph_and_eager_modes()
+  def test_metrics_correctness_with_iterator(self):
+    model = keras.Sequential()
+    model.add(
+        keras.layers.Dense(
+            8, activation='relu', input_dim=4, kernel_initializer='ones'))
+    model.add(
+        keras.layers.Dense(1, activation='sigmoid', kernel_initializer='ones'))
+    model.compile(
+        loss='binary_crossentropy',
+        metrics=['accuracy'],
+        optimizer=RMSPropOptimizer(learning_rate=0.001))
+    np.random.seed(123)
+    x = np.random.randint(10, size=(100, 4)).astype(np.float32)
+    y = np.random.randint(2, size=(100, 1)).astype(np.float32)
+    dataset = dataset_ops.Dataset.from_tensor_slices((x, y))
+    dataset = dataset.batch(10)
+    iterator = dataset.make_one_shot_iterator()
+    outs = model.evaluate(iterator, steps=10)
+    self.assertEqual(np.around(outs[1], decimals=1), 0.5)
+
+    y = np.zeros((100, 1), dtype=np.float32)
+    dataset = dataset_ops.Dataset.from_tensor_slices((x, y))
+    dataset = dataset.repeat(100)
+    dataset = dataset.batch(10)
+    iterator = dataset.make_one_shot_iterator()
+    outs = model.evaluate(iterator, steps=10)
+    self.assertEqual(outs[1], 0.)
+
+
 if __name__ == '__main__':
   ops.enable_eager_execution()
   test.main()
diff --git a/tensorflow/python/keras/engine/training_test.py b/tensorflow/python/keras/engine/training_test.py
index 7dec0bbf8a681215d9a23db8f3419cc315a4b39c..a1ab72018957a26984f9b7b1ccba9a128a136866 100644
--- a/tensorflow/python/keras/engine/training_test.py
+++ b/tensorflow/python/keras/engine/training_test.py
@@ -129,8 +129,10 @@ class TrainingTest(test.TestCase):
           {
               'input_a': input_a_np,
               'input_b': input_b_np
-          }, {'dense': output_d_np,
-              'dropout': output_e_np},
+          }, {
+              'dense': output_d_np,
+              'dropout': output_e_np
+          },
           epochs=1,
           batch_size=5,
           verbose=0)
@@ -138,8 +140,10 @@ class TrainingTest(test.TestCase):
           {
               'input_a': input_a_np,
               'input_b': input_b_np
-          }, {'dense': output_d_np,
-              'dropout': output_e_np},
+          }, {
+              'dense': output_d_np,
+              'dropout': output_e_np
+          },
           epochs=1,
           batch_size=5,
           verbose=1)
@@ -147,8 +151,10 @@ class TrainingTest(test.TestCase):
           {
               'input_a': input_a_np,
               'input_b': input_b_np
-          }, {'dense': output_d_np,
-              'dropout': output_e_np},
+          }, {
+              'dense': output_d_np,
+              'dropout': output_e_np
+          },
           validation_data=({
               'input_a': input_a_np,
               'input_b': input_b_np
@@ -162,8 +168,10 @@ class TrainingTest(test.TestCase):
       model.train_on_batch({
           'input_a': input_a_np,
           'input_b': input_b_np
-      }, {'dense': output_d_np,
-          'dropout': output_e_np})
+      }, {
+          'dense': output_d_np,
+          'dropout': output_e_np
+      })
 
       # Test with lists for loss, metrics
       loss = ['mae', 'mse']
@@ -285,16 +293,20 @@ class TrainingTest(test.TestCase):
           {
               'input_a': input_a_np,
               'input_b': input_b_np
-          }, {'dense': output_d_np,
-              'dropout': output_e_np},
+          }, {
+              'dense': output_d_np,
+              'dropout': output_e_np
+          },
           batch_size=5,
           verbose=0)
       model.evaluate(
           {
               'input_a': input_a_np,
               'input_b': input_b_np
-          }, {'dense': output_d_np,
-              'dropout': output_e_np},
+          }, {
+              'dense': output_d_np,
+              'dropout': output_e_np
+          },
           batch_size=5,
           verbose=1)
 
@@ -349,9 +361,11 @@ class TrainingTest(test.TestCase):
 
     with self.test_session():
       test_inputs = [
-          scipy_sparse.random(6, 3, density=0.25).tocsr() for _ in range(2)]
+          scipy_sparse.random(6, 3, density=0.25).tocsr() for _ in range(2)
+      ]
       test_outputs = [
-          scipy_sparse.random(6, i, density=0.25).tocsr() for i in range(3, 5)]
+          scipy_sparse.random(6, i, density=0.25).tocsr() for i in range(3, 5)
+      ]
       in1 = keras.layers.Input(shape=(3,))
       in2 = keras.layers.Input(shape=(3,))
       out1 = keras.layers.Dropout(0.5, name='dropout')(in1)
@@ -1721,8 +1735,8 @@ class TestTrainingWithDatasetIterators(test.TestCase):
       metrics = ['mae']
       model.compile(optimizer, loss, metrics=metrics)
 
-      inputs = np.zeros((10, 3), dtype=np.float32)
-      targets = np.zeros((10, 4), dtype=np.float32)
+      inputs = np.zeros((10, 3))
+      targets = np.zeros((10, 4))
       dataset = dataset_ops.Dataset.from_tensor_slices((inputs, targets))
       dataset = dataset.repeat(100)
       dataset = dataset.batch(10)
@@ -1742,7 +1756,7 @@ class TestTrainingWithDatasetIterators(test.TestCase):
       # Test with validation split
       with self.assertRaisesRegexp(
           ValueError, '`validation_split` argument is not supported '
-          'when input `x` is a dataset iterator'):
+          'when input `x` is a dataset or a dataset iterator'):
         model.fit(iterator,
                   epochs=1, steps_per_epoch=2, verbose=0,
                   validation_split=0.5, validation_steps=2)
@@ -1751,7 +1765,7 @@ class TestTrainingWithDatasetIterators(test.TestCase):
       sample_weight = np.random.random((10,))
       with self.assertRaisesRegexp(
           ValueError, '`sample_weight` argument is not supported '
-          'when input `x` is a dataset iterator'):
+          'when input `x` is a dataset or a dataset iterator'):
         model.fit(
             iterator,
             epochs=1,
@@ -1760,10 +1774,6 @@ class TestTrainingWithDatasetIterators(test.TestCase):
             sample_weight=sample_weight)
 
       # Test invalid usage
-      with self.assertRaisesRegexp(ValueError,
-                                   'Instead, pass an `Iterator`'):
-        model.fit(dataset,
-                  epochs=1, steps_per_epoch=2, verbose=0)
       with self.assertRaisesRegexp(ValueError,
                                    'you should not specify a target'):
         model.fit(iterator, iterator,
@@ -1790,8 +1800,8 @@ class TestTrainingWithDatasetIterators(test.TestCase):
       metrics = ['mae']
       model.compile(optimizer, loss, metrics=metrics)
 
-      inputs = np.zeros((10, 3), dtype=np.float32)
-      targets = np.zeros((10, 4), dtype=np.float32)
+      inputs = np.zeros((10, 3))
+      targets = np.zeros((10, 4))
       dataset = dataset_ops.Dataset.from_tensor_slices((inputs, targets))
       dataset = dataset.repeat(100)
       dataset = dataset.batch(10)
@@ -1815,8 +1825,8 @@ class TestTrainingWithDatasetIterators(test.TestCase):
       metrics = ['mae']
       model.compile(optimizer, loss, metrics=metrics)
 
-      inputs = np.zeros((10, 3), dtype=np.float32)
-      targets = np.zeros((10, 4), dtype=np.float32)
+      inputs = np.zeros((10, 3))
+      targets = np.zeros((10, 4))
       dataset = dataset_ops.Dataset.from_tensor_slices((inputs, targets))
       dataset = dataset.repeat(2)
       dataset = dataset.batch(10)
@@ -1829,5 +1839,129 @@ class TestTrainingWithDatasetIterators(test.TestCase):
             'dataset iterator ran out of data')
 
 
+class TestTrainingWithDataset(test.TestCase):
+
+  def test_calling_model_on_same_dataset(self):
+    with self.test_session():
+      x = keras.layers.Input(shape=(3,), name='input')
+      y = keras.layers.Dense(4, name='dense')(x)
+      model = keras.Model(x, y)
+
+      optimizer = RMSPropOptimizer(learning_rate=0.001)
+      loss = 'mse'
+      metrics = ['mae']
+      model.compile(optimizer, loss, metrics=metrics)
+
+      inputs = np.zeros((10, 3))
+      targets = np.zeros((10, 4))
+      dataset = dataset_ops.Dataset.from_tensor_slices((inputs, targets))
+      dataset = dataset.repeat(100)
+      dataset = dataset.batch(10)
+
+      # Call fit with validation data
+      model.fit(dataset, epochs=1, steps_per_epoch=2, verbose=0,
+                validation_data=dataset, validation_steps=2)
+      # Finalize the graph to make sure new ops aren't added when calling on the
+      # same dataset
+      ops.get_default_graph().finalize()
+      model.fit(dataset, epochs=1, steps_per_epoch=2, verbose=0,
+                validation_data=dataset, validation_steps=2)
+
+  @tf_test_util.run_in_graph_and_eager_modes()
+  def test_training_and_eval_methods_on_dataset(self):
+    with self.test_session():
+      x = keras.layers.Input(shape=(3,), name='input')
+      y = keras.layers.Dense(4, name='dense')(x)
+      model = keras.Model(x, y)
+
+      optimizer = RMSPropOptimizer(learning_rate=0.001)
+      loss = 'mse'
+      metrics = ['mae']
+      model.compile(optimizer, loss, metrics=metrics)
+
+      inputs = np.zeros((10, 3))
+      targets = np.zeros((10, 4))
+      dataset = dataset_ops.Dataset.from_tensor_slices((inputs, targets))
+      dataset = dataset.repeat(100)
+      dataset = dataset.batch(10)
+
+      model.fit(dataset, epochs=1, steps_per_epoch=2, verbose=1)
+      model.evaluate(dataset, steps=2, verbose=1)
+      model.predict(dataset, steps=2)
+      model.train_on_batch(dataset)
+      model.predict_on_batch(dataset)
+
+      # Test with validation data
+      model.fit(dataset, epochs=1, steps_per_epoch=2, verbose=0,
+                validation_data=dataset, validation_steps=2)
+
+      # Test with validation split
+      with self.assertRaisesRegexp(
+          ValueError, '`validation_split` argument is not supported '
+          'when input `x` is a dataset or a dataset iterator'):
+        model.fit(dataset,
+                  epochs=1, steps_per_epoch=2, verbose=0,
+                  validation_split=0.5, validation_steps=2)
+
+      # Test with sample weight.
+      sample_weight = np.random.random((10,))
+      with self.assertRaisesRegexp(
+          ValueError, '`sample_weight` argument is not supported '
+          'when input `x` is a dataset or a dataset iterator'):
+        model.fit(
+            dataset,
+            epochs=1,
+            steps_per_epoch=2,
+            verbose=0,
+            sample_weight=sample_weight)
+
+      # Test invalid usage
+      with self.assertRaisesRegexp(ValueError,
+                                   'you should not specify a target'):
+        model.fit(dataset, dataset,
+                  epochs=1, steps_per_epoch=2, verbose=0)
+
+      with self.assertRaisesRegexp(
+          ValueError, 'you should specify the `steps_per_epoch` argument'):
+        model.fit(dataset, epochs=1, verbose=0)
+      with self.assertRaisesRegexp(ValueError,
+                                   'you should specify the `steps` argument'):
+        model.evaluate(dataset, verbose=0)
+      with self.assertRaisesRegexp(ValueError,
+                                   'you should specify the `steps` argument'):
+        model.predict(dataset, verbose=0)
+
+  def test_dataset_input_shape_validation(self):
+    with self.test_session():
+      x = keras.layers.Input(shape=(3,), name='input')
+      y = keras.layers.Dense(4, name='dense')(x)
+      model = keras.Model(x, y)
+
+      optimizer = RMSPropOptimizer(learning_rate=0.001)
+      loss = 'mse'
+      model.compile(optimizer, loss)
+
+      # User forgets to batch the dataset
+      inputs = np.zeros((10, 3))
+      targets = np.zeros((10, 4))
+      dataset = dataset_ops.Dataset.from_tensor_slices((inputs, targets))
+      dataset = dataset.repeat(100)
+
+      with self.assertRaisesRegexp(ValueError,
+                                   'expected input to have 2 dimensions'):
+        model.train_on_batch(dataset)
+
+      # Wrong input shape
+      inputs = np.zeros((10, 5))
+      targets = np.zeros((10, 4))
+      dataset = dataset_ops.Dataset.from_tensor_slices((inputs, targets))
+      dataset = dataset.repeat(100)
+      dataset = dataset.batch(10)
+
+      with self.assertRaisesRegexp(ValueError,
+                                   'expected input to have shape'):
+        model.train_on_batch(dataset)
+
+
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/keras/engine/training_utils.py b/tensorflow/python/keras/engine/training_utils.py
index 7d214d61a4d04a2083119fab3db84d3a9b151df5..728a2b493b9f076cc2942766d2677c1f24fb3c15 100644
--- a/tensorflow/python/keras/engine/training_utils.py
+++ b/tensorflow/python/keras/engine/training_utils.py
@@ -166,10 +166,16 @@ def standardize_input_data(data,
   # Check shapes compatibility.
   if shapes:
     for i in range(len(names)):
-      if shapes[i] is not None and not tensor_util.is_tensor(data[i]):
-        data_shape = data[i].shape
+      if shapes[i] is not None:
+        if tensor_util.is_tensor(data[i]):
+          tensorshape = data[i].get_shape()
+          if not tensorshape:
+            continue
+          data_shape = tuple(tensorshape.as_list())
+        else:
+          data_shape = data[i].shape
         shape = shapes[i]
-        if data[i].ndim != len(shape):
+        if len(data_shape) != len(shape):
           raise ValueError('Error when checking ' + exception_prefix +
                            ': expected ' + names[i] + ' to have ' +
                            str(len(shape)) + ' dimensions, but got array '
@@ -178,7 +184,7 @@ def standardize_input_data(data,
           data_shape = data_shape[1:]
           shape = shape[1:]
         for dim, ref_dim in zip(data_shape, shape):
-          if ref_dim != dim and ref_dim:
+          if ref_dim != dim and ref_dim is not None and dim is not None:
             raise ValueError(
                 'Error when checking ' + exception_prefix + ': expected ' +
                 names[i] + ' to have shape ' + str(shape) +
@@ -547,6 +553,10 @@ def standardize_weights(y,
 def has_symbolic_tensors(ls):
   if context.executing_eagerly():
     return False
+  return has_tensors(ls)
+
+
+def has_tensors(ls):
   if isinstance(ls, (list, tuple)):
     return any(tensor_util.is_tensor(v) for v in ls)
   return tensor_util.is_tensor(ls)
@@ -632,19 +642,20 @@ def validate_iterator_input(x, y, sample_weight, validation_split=None):
         provided by user.
   """
   if y is not None:
-    raise ValueError('You passed a dataset iterator (%s) as input `x` to '
-                     'your model. In that case, you should not specify '
-                     'a target (`y`) argument, since the dataset iterator '
-                     'generates both input data and target data. '
+    raise ValueError('You passed a dataset or dataset iterator (%s) as '
+                     'input `x` to your model. In that case, you should '
+                     'not specify a target (`y`) argument, since the dataset '
+                     'or dataset iterator generates both input data and '
+                     'target data. '
                      'Received: %s' % (x, y))
   if sample_weight is not None:
-    raise ValueError('`sample_weight` argument is not supported when input'
-                     ' `x` is a dataset iterator. '
+    raise ValueError('`sample_weight` argument is not supported when input '
+                     '`x` is a dataset or a dataset iterator. '
                      'Received: x=%s, sample_weight=%s' % (x, sample_weight))
   if validation_split is not None and validation_split != 0.0:
     raise ValueError(
         '`validation_split` argument is not supported when '
-        'input `x` is a dataset iterator. '
+        'input `x` is a dataset or a dataset iterator. '
         'Received: x=%s, validation_split=%f' % (x, validation_split))
 
 
@@ -685,3 +696,29 @@ def check_steps_argument(input_data, steps, steps_name):
                            input_type=input_type_str, steps_name=steps_name))
     return True
   return False
+
+
+def cast_if_floating_dtype(x):
+  """Casts the given data tensors to the default floating point type.
+
+  Casts only if the input is already a floating point type.
+  Args:
+    x: tensor or list/tuple of tensors.
+
+  Returns:
+    Converted input.
+
+  Raises:
+    RuntimeError: if data isn't tensors.
+  """
+  if not has_tensors(x):
+    raise RuntimeError(
+        'Please provide tensors for casting, got: {x}'.format(x=x))
+
+  if isinstance(x, (list, tuple)):
+    return [
+        math_ops.cast(val, dtype=K.floatx())
+        if tensor_util.is_tensor(val) and val.dtype.is_floating else val
+        for val in x
+    ]
+  return math_ops.cast(x, dtype=K.floatx()) if x.dtype.is_floating else x
diff --git a/tensorflow/python/keras/initializers_test.py b/tensorflow/python/keras/initializers_test.py
index a54d6da83907b71ce5f7fd6070598545731b7428..c519e194bdc21692025f259533b8b75e2dc48c09 100644
--- a/tensorflow/python/keras/initializers_test.py
+++ b/tensorflow/python/keras/initializers_test.py
@@ -71,7 +71,7 @@ class KerasInitializersTest(test.TestCase):
                                                       stddev=1,
                                                       seed=126),
                    tensor_shape,
-                   target_mean=0., target_std=None, target_max=2)
+                   target_mean=0., target_max=2, target_min=-2)
 
   def test_constant(self):
     tensor_shape = (5, 6, 4)
@@ -83,49 +83,49 @@ class KerasInitializersTest(test.TestCase):
     tensor_shape = (5, 6, 4, 2)
     with self.test_session():
       fan_in, _ = init_ops._compute_fans(tensor_shape)
-      scale = np.sqrt(3. / fan_in)
+      std = np.sqrt(1. / fan_in)
       self._runner(keras.initializers.lecun_uniform(seed=123), tensor_shape,
-                   target_mean=0., target_max=scale, target_min=-scale)
+                   target_mean=0., target_std=std)
 
   def test_glorot_uniform(self):
     tensor_shape = (5, 6, 4, 2)
     with self.test_session():
       fan_in, fan_out = init_ops._compute_fans(tensor_shape)
-      scale = np.sqrt(6. / (fan_in + fan_out))
+      std = np.sqrt(2. / (fan_in + fan_out))
       self._runner(keras.initializers.glorot_uniform(seed=123), tensor_shape,
-                   target_mean=0., target_max=scale, target_min=-scale)
+                   target_mean=0., target_std=std)
 
   def test_he_uniform(self):
     tensor_shape = (5, 6, 4, 2)
     with self.test_session():
       fan_in, _ = init_ops._compute_fans(tensor_shape)
-      scale = np.sqrt(6. / fan_in)
+      std = np.sqrt(2. / fan_in)
       self._runner(keras.initializers.he_uniform(seed=123), tensor_shape,
-                   target_mean=0., target_max=scale, target_min=-scale)
+                   target_mean=0., target_std=std)
 
   def test_lecun_normal(self):
     tensor_shape = (5, 6, 4, 2)
     with self.test_session():
       fan_in, _ = init_ops._compute_fans(tensor_shape)
-      scale = np.sqrt(1. / fan_in)
+      std = np.sqrt(1. / fan_in)
       self._runner(keras.initializers.lecun_normal(seed=123), tensor_shape,
-                   target_mean=0., target_std=None, target_max=2 * scale)
+                   target_mean=0., target_std=std)
 
   def test_glorot_normal(self):
     tensor_shape = (5, 6, 4, 2)
     with self.test_session():
       fan_in, fan_out = init_ops._compute_fans(tensor_shape)
-      scale = np.sqrt(2. / (fan_in + fan_out))
+      std = np.sqrt(2. / (fan_in + fan_out))
       self._runner(keras.initializers.glorot_normal(seed=123), tensor_shape,
-                   target_mean=0., target_std=None, target_max=2 * scale)
+                   target_mean=0., target_std=std)
 
   def test_he_normal(self):
     tensor_shape = (5, 6, 4, 2)
     with self.test_session():
       fan_in, _ = init_ops._compute_fans(tensor_shape)
-      scale = np.sqrt(2. / fan_in)
+      std = np.sqrt(2. / fan_in)
       self._runner(keras.initializers.he_normal(seed=123), tensor_shape,
-                   target_mean=0., target_std=None, target_max=2 * scale)
+                   target_mean=0., target_std=std)
 
   def test_orthogonal(self):
     tensor_shape = (20, 20)
diff --git a/tensorflow/python/keras/integration_test.py b/tensorflow/python/keras/integration_test.py
index 2e83544d97e66c3ab7efa41fb31d11752eed29b8..2a05699407cc608c1ed0dd97d230beeb6e99e0ef 100644
--- a/tensorflow/python/keras/integration_test.py
+++ b/tensorflow/python/keras/integration_test.py
@@ -29,6 +29,9 @@ from tensorflow.python.platform import test
 
 class KerasIntegrationTest(test.TestCase):
 
+  def test_version(self):
+    self.assertTrue(keras.__version__.endswith('-tf'))
+
   def test_vector_classification_sequential(self):
     with self.test_session():
       np.random.seed(1337)
diff --git a/tensorflow/python/keras/layers/__init__.py b/tensorflow/python/keras/layers/__init__.py
index 8fb663a17e16f9a16c67393327347f6cc463a5b6..ce0cdb2e1b34d1133f64965818b0a2bcef108d86 100644
--- a/tensorflow/python/keras/layers/__init__.py
+++ b/tensorflow/python/keras/layers/__init__.py
@@ -86,9 +86,11 @@ from tensorflow.python.keras.layers.local import LocallyConnected2D
 
 # Merge layers.
 from tensorflow.python.keras.layers.merge import Add
+from tensorflow.python.keras.layers.merge import Subtract
 from tensorflow.python.keras.layers.merge import Multiply
 from tensorflow.python.keras.layers.merge import Average
 from tensorflow.python.keras.layers.merge import Maximum
+from tensorflow.python.keras.layers.merge import Minimum
 from tensorflow.python.keras.layers.merge import Concatenate
 from tensorflow.python.keras.layers.merge import Dot
 from tensorflow.python.keras.layers.merge import add
diff --git a/tensorflow/python/keras/layers/convolutional.py b/tensorflow/python/keras/layers/convolutional.py
index ce1c84e98d04c84aad7aa381b2536facfae2322d..720b386c4dd5fbed95b5275c0ca7d3885afda422 100644
--- a/tensorflow/python/keras/layers/convolutional.py
+++ b/tensorflow/python/keras/layers/convolutional.py
@@ -151,21 +151,23 @@ class Conv(Layer):
     input_dim = int(input_shape[channel_axis])
     kernel_shape = self.kernel_size + (input_dim, self.filters)
 
-    self.kernel = self.add_variable(name='kernel',
-                                    shape=kernel_shape,
-                                    initializer=self.kernel_initializer,
-                                    regularizer=self.kernel_regularizer,
-                                    constraint=self.kernel_constraint,
-                                    trainable=True,
-                                    dtype=self.dtype)
+    self.kernel = self.add_weight(
+        name='kernel',
+        shape=kernel_shape,
+        initializer=self.kernel_initializer,
+        regularizer=self.kernel_regularizer,
+        constraint=self.kernel_constraint,
+        trainable=True,
+        dtype=self.dtype)
     if self.use_bias:
-      self.bias = self.add_variable(name='bias',
-                                    shape=(self.filters,),
-                                    initializer=self.bias_initializer,
-                                    regularizer=self.bias_regularizer,
-                                    constraint=self.bias_constraint,
-                                    trainable=True,
-                                    dtype=self.dtype)
+      self.bias = self.add_weight(
+          name='bias',
+          shape=(self.filters,),
+          initializer=self.bias_initializer,
+          regularizer=self.bias_regularizer,
+          constraint=self.bias_constraint,
+          trainable=True,
+          dtype=self.dtype)
     else:
       self.bias = None
     self.input_spec = InputSpec(ndim=self.rank + 2,
@@ -380,11 +382,11 @@ class Conv2D(Conv):
       filters: Integer, the dimensionality of the output space
           (i.e. the number of output filters in the convolution).
       kernel_size: An integer or tuple/list of 2 integers, specifying the
-          width and height of the 2D convolution window.
+          height and width of the 2D convolution window.
           Can be a single integer to specify the same value for
           all spatial dimensions.
       strides: An integer or tuple/list of 2 integers,
-          specifying the strides of the convolution along the width and height.
+          specifying the strides of the convolution along the height and width.
           Can be a single integer to specify the same value for
           all spatial dimensions.
           Specifying any stride value != 1 is incompatible with specifying
@@ -611,11 +613,11 @@ class Conv2DTranspose(Conv2D):
       filters: Integer, the dimensionality of the output space
           (i.e. the number of output filters in the convolution).
       kernel_size: An integer or tuple/list of 2 integers, specifying the
-          width and height of the 2D convolution window.
+          height and width of the 2D convolution window.
           Can be a single integer to specify the same value for
           all spatial dimensions.
       strides: An integer or tuple/list of 2 integers,
-          specifying the strides of the convolution along the width and height.
+          specifying the strides of the convolution along the height and width.
           Can be a single integer to specify the same value for
           all spatial dimensions.
           Specifying any stride value != 1 is incompatible with specifying
@@ -720,21 +722,23 @@ class Conv2DTranspose(Conv2D):
     self.input_spec = InputSpec(ndim=4, axes={channel_axis: input_dim})
     kernel_shape = self.kernel_size + (self.filters, input_dim)
 
-    self.kernel = self.add_variable(name='kernel',
-                                    shape=kernel_shape,
-                                    initializer=self.kernel_initializer,
-                                    regularizer=self.kernel_regularizer,
-                                    constraint=self.kernel_constraint,
-                                    trainable=True,
-                                    dtype=self.dtype)
+    self.kernel = self.add_weight(
+        name='kernel',
+        shape=kernel_shape,
+        initializer=self.kernel_initializer,
+        regularizer=self.kernel_regularizer,
+        constraint=self.kernel_constraint,
+        trainable=True,
+        dtype=self.dtype)
     if self.use_bias:
-      self.bias = self.add_variable(name='bias',
-                                    shape=(self.filters,),
-                                    initializer=self.bias_initializer,
-                                    regularizer=self.bias_regularizer,
-                                    constraint=self.bias_constraint,
-                                    trainable=True,
-                                    dtype=self.dtype)
+      self.bias = self.add_weight(
+          name='bias',
+          shape=(self.filters,),
+          initializer=self.bias_initializer,
+          regularizer=self.bias_regularizer,
+          constraint=self.bias_constraint,
+          trainable=True,
+          dtype=self.dtype)
     else:
       self.bias = None
     self.built = True
@@ -961,7 +965,7 @@ class Conv3DTranspose(Conv3D):
     kernel_shape = self.kernel_size + (self.filters, input_dim)
     self.input_spec = InputSpec(ndim=5, axes={channel_axis: input_dim})
 
-    self.kernel = self.add_variable(
+    self.kernel = self.add_weight(
         'kernel',
         shape=kernel_shape,
         initializer=self.kernel_initializer,
@@ -970,7 +974,7 @@ class Conv3DTranspose(Conv3D):
         trainable=True,
         dtype=self.dtype)
     if self.use_bias:
-      self.bias = self.add_variable(
+      self.bias = self.add_weight(
           'bias',
           shape=(self.filters,),
           initializer=self.bias_initializer,
@@ -1222,7 +1226,7 @@ class SeparableConv(Conv):
     pointwise_kernel_shape = (
         1,) * self.rank + (self.depth_multiplier * input_dim, self.filters)
 
-    self.depthwise_kernel = self.add_variable(
+    self.depthwise_kernel = self.add_weight(
         name='depthwise_kernel',
         shape=depthwise_kernel_shape,
         initializer=self.depthwise_initializer,
@@ -1230,7 +1234,7 @@ class SeparableConv(Conv):
         constraint=self.depthwise_constraint,
         trainable=True,
         dtype=self.dtype)
-    self.pointwise_kernel = self.add_variable(
+    self.pointwise_kernel = self.add_weight(
         name='pointwise_kernel',
         shape=pointwise_kernel_shape,
         initializer=self.pointwise_initializer,
@@ -1239,13 +1243,14 @@ class SeparableConv(Conv):
         trainable=True,
         dtype=self.dtype)
     if self.use_bias:
-      self.bias = self.add_variable(name='bias',
-                                    shape=(self.filters,),
-                                    initializer=self.bias_initializer,
-                                    regularizer=self.bias_regularizer,
-                                    constraint=self.bias_constraint,
-                                    trainable=True,
-                                    dtype=self.dtype)
+      self.bias = self.add_weight(
+          name='bias',
+          shape=(self.filters,),
+          initializer=self.bias_initializer,
+          regularizer=self.bias_regularizer,
+          constraint=self.bias_constraint,
+          trainable=True,
+          dtype=self.dtype)
     else:
       self.bias = None
     self.built = True
@@ -1447,11 +1452,11 @@ class SeparableConv2D(SeparableConv):
       filters: Integer, the dimensionality of the output space
           (i.e. the number of output filters in the convolution).
       kernel_size: An integer or tuple/list of 2 integers, specifying the
-          width and height of the 2D convolution window.
+          height and width of the 2D convolution window.
           Can be a single integer to specify the same value for
           all spatial dimensions.
       strides: An integer or tuple/list of 2 integers,
-          specifying the strides of the convolution along the width and height.
+          specifying the strides of the convolution along the height and width.
           Can be a single integer to specify the same value for
           all spatial dimensions.
           Specifying any stride value != 1 is incompatible with specifying
@@ -1591,11 +1596,11 @@ class DepthwiseConv2D(Conv2D):
 
   Arguments:
     kernel_size: An integer or tuple/list of 2 integers, specifying the
-        width and height of the 2D convolution window.
+        height and width of the 2D convolution window.
         Can be a single integer to specify the same value for
         all spatial dimensions.
     strides: An integer or tuple/list of 2 integers,
-        specifying the strides of the convolution along the width and height.
+        specifying the strides of the convolution along the height and width.
         Can be a single integer to specify the same value for
         all spatial dimensions.
         Specifying any stride value != 1 is incompatible with specifying
@@ -2002,7 +2007,7 @@ class ZeroPadding2D(Layer):
   Arguments:
       padding: int, or tuple of 2 ints, or tuple of 2 tuples of 2 ints.
           - If int: the same symmetric padding
-              is applied to width and height.
+              is applied to height and width.
           - If tuple of 2 ints:
               interpreted as two different
               symmetric padding values for height and width:
@@ -2101,7 +2106,7 @@ class ZeroPadding3D(Layer):
   Arguments:
       padding: int, or tuple of 3 ints, or tuple of 3 tuples of 2 ints.
           - If int: the same symmetric padding
-              is applied to width and height.
+              is applied to height and width.
           - If tuple of 3 ints:
               interpreted as two different
               symmetric padding values for height and width:
@@ -2261,12 +2266,12 @@ class Cropping1D(Layer):
 class Cropping2D(Layer):
   """Cropping layer for 2D input (e.g. picture).
 
-  It crops along spatial dimensions, i.e. width and height.
+  It crops along spatial dimensions, i.e. height and width.
 
   Arguments:
       cropping: int, or tuple of 2 ints, or tuple of 2 tuples of 2 ints.
           - If int: the same symmetric cropping
-              is applied to width and height.
+              is applied to height and width.
           - If tuple of 2 ints:
               interpreted as two different
               symmetric cropping values for height and width:
diff --git a/tensorflow/python/keras/layers/core.py b/tensorflow/python/keras/layers/core.py
index df4c3915a3097d52553557208b074f6923341673..f60064ed6363d36731795d08bb42e75398628283 100644
--- a/tensorflow/python/keras/layers/core.py
+++ b/tensorflow/python/keras/layers/core.py
@@ -19,7 +19,9 @@ from __future__ import division
 from __future__ import print_function
 
 import copy
+import sys
 import types as python_types
+import warnings
 
 import numpy as np
 
@@ -714,6 +716,7 @@ class Lambda(Layer):
     return self.mask
 
   def get_config(self):
+    module = self.function.__module__
     if isinstance(self.function, python_types.LambdaType):
       function = generic_utils.func_dump(self.function)
       function_type = 'lambda'
@@ -721,21 +724,26 @@ class Lambda(Layer):
       function = self.function.__name__
       function_type = 'function'
 
+    output_shape_module = None
     if isinstance(self._output_shape, python_types.LambdaType):
       output_shape = generic_utils.func_dump(self._output_shape)
       output_shape_type = 'lambda'
+      output_shape_module = self._output_shape.__module__
     elif callable(self._output_shape):
       output_shape = self._output_shape.__name__
       output_shape_type = 'function'
+      output_shape_module = self._output_shape.__module__
     else:
       output_shape = self._output_shape
       output_shape_type = 'raw'
 
     config = {
         'function': function,
+        'module': module,
         'function_type': function_type,
         'output_shape': output_shape,
         'output_shape_type': output_shape_type,
+        'output_shape_module': output_shape_module,
         'arguments': self.arguments
     }
     base_config = super(Lambda, self).get_config()
@@ -745,8 +753,16 @@ class Lambda(Layer):
   def from_config(cls, config, custom_objects=None):
     config = config.copy()
     globs = globals()
+    module = config.pop('module', None)
+    if module in sys.modules:
+      globs.update(sys.modules[module].__dict__)
+    elif module is not None:
+      # Note: we don't know the name of the function if it's a lambda.
+      warnings.warn('{} is not loaded, but a Lambda layer uses it. '
+                    'It may cause errors.'.format(module)
+                    , UserWarning)
     if custom_objects:
-      globs = dict(list(globs.items()) + list(custom_objects.items()))
+      globs.update(custom_objects)
     function_type = config.pop('function_type')
     if function_type == 'function':
       # Simple lookup in custom objects
@@ -760,6 +776,14 @@ class Lambda(Layer):
     else:
       raise TypeError('Unknown function type:', function_type)
 
+    output_shape_module = config.pop('output_shape_module', None)
+    if output_shape_module in sys.modules:
+      globs.update(sys.modules[output_shape_module].__dict__)
+    elif output_shape_module is not None:
+      # Note: we don't know the name of the function if it's a lambda.
+      warnings.warn('{} is not loaded, but a Lambda layer uses it. '
+                    'It may cause errors.'.format(output_shape_module)
+                    , UserWarning)
     output_shape_type = config.pop('output_shape_type')
     if output_shape_type == 'function':
       # Simple lookup in custom objects
@@ -882,21 +906,23 @@ class Dense(Layer):
                        'should be defined. Found `None`.')
     self.input_spec = InputSpec(min_ndim=2,
                                 axes={-1: input_shape[-1].value})
-    self.kernel = self.add_variable('kernel',
-                                    shape=[input_shape[-1].value, self.units],
-                                    initializer=self.kernel_initializer,
-                                    regularizer=self.kernel_regularizer,
-                                    constraint=self.kernel_constraint,
-                                    dtype=self.dtype,
-                                    trainable=True)
+    self.kernel = self.add_weight(
+        'kernel',
+        shape=[input_shape[-1].value, self.units],
+        initializer=self.kernel_initializer,
+        regularizer=self.kernel_regularizer,
+        constraint=self.kernel_constraint,
+        dtype=self.dtype,
+        trainable=True)
     if self.use_bias:
-      self.bias = self.add_variable('bias',
-                                    shape=[self.units,],
-                                    initializer=self.bias_initializer,
-                                    regularizer=self.bias_regularizer,
-                                    constraint=self.bias_constraint,
-                                    dtype=self.dtype,
-                                    trainable=True)
+      self.bias = self.add_weight(
+          'bias',
+          shape=[self.units,],
+          initializer=self.bias_initializer,
+          regularizer=self.bias_regularizer,
+          constraint=self.bias_constraint,
+          dtype=self.dtype,
+          trainable=True)
     else:
       self.bias = None
     self.built = True
diff --git a/tensorflow/python/keras/layers/cudnn_recurrent.py b/tensorflow/python/keras/layers/cudnn_recurrent.py
index 5c4a2dbe92e75296b21bfdb628621efa03f19e02..ad6594279d037c8dc0e1408955d2a2eebd51ce1d 100644
--- a/tensorflow/python/keras/layers/cudnn_recurrent.py
+++ b/tensorflow/python/keras/layers/cudnn_recurrent.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 
 import collections
 
+from tensorflow.python.framework import constant_op
 from tensorflow.python.keras import backend as K
 from tensorflow.python.keras import constraints
 from tensorflow.python.keras import initializers
@@ -71,10 +72,11 @@ class _CuDNNRNN(RNN):
     self.constants_spec = None
     self._states = None
     self._num_constants = None
+    self._vector_shape = constant_op.constant([-1])
 
   def _canonical_to_params(self, weights, biases):
-    weights = [array_ops.reshape(x, (-1,)) for x in weights]
-    biases = [array_ops.reshape(x, (-1,)) for x in biases]
+    weights = [array_ops.reshape(x, self._vector_shape) for x in weights]
+    biases = [array_ops.reshape(x, self._vector_shape) for x in biases]
     return array_ops.concat(weights + biases, axis=0)
 
   def call(self, inputs, mask=None, training=None, initial_state=None):
diff --git a/tensorflow/python/keras/layers/local.py b/tensorflow/python/keras/layers/local.py
index 46c18b763e80b58da1ec0c2655978753af75b4f8..f222ea3083bad48094fbec7fe6750921f0233e35 100644
--- a/tensorflow/python/keras/layers/local.py
+++ b/tensorflow/python/keras/layers/local.py
@@ -62,6 +62,16 @@ class LocallyConnected1D(Layer):
           any `dilation_rate` value != 1.
       padding: Currently only supports `"valid"` (case-insensitive).
           `"same"` may be supported in the future.
+      data_format: A string,
+          one of `channels_last` (default) or `channels_first`.
+          The ordering of the dimensions in the inputs.
+          `channels_last` corresponds to inputs with shape
+          `(batch, length, channels)` while `channels_first`
+          corresponds to inputs with shape
+          `(batch, channels, length)`.
+          It defaults to the `image_data_format` value found in your
+          Keras config file at `~/.keras/keras.json`.
+          If you never set it, then it will be "channels_last".
       activation: Activation function to use.
           If you don't specify anything, no activation is applied
           (ie. "linear" activation: `a(x) = x`).
@@ -122,12 +132,16 @@ class LocallyConnected1D(Layer):
 
   @tf_utils.shape_type_conversion
   def build(self, input_shape):
-    input_dim = input_shape[2]
+    if self.data_format == 'channels_first':
+      input_dim, input_length = input_shape[1], input_shape[2]
+    else:
+      input_dim, input_length = input_shape[2], input_shape[1]
+
     if input_dim is None:
       raise ValueError('Axis 2 of input should be fully-defined. '
                        'Found shape:', input_shape)
     output_length = conv_utils.conv_output_length(
-        input_shape[1], self.kernel_size[0], self.padding, self.strides[0])
+        input_length, self.kernel_size[0], self.padding, self.strides[0])
     self.kernel_shape = (output_length, self.kernel_size[0] * input_dim,
                          self.filters)
     self.kernel = self.add_weight(
@@ -145,19 +159,33 @@ class LocallyConnected1D(Layer):
           constraint=self.bias_constraint)
     else:
       self.bias = None
-    self.input_spec = InputSpec(ndim=3, axes={2: input_dim})
+
+    if self.data_format == 'channels_first':
+      self.input_spec = InputSpec(ndim=3, axes={1: input_dim})
+    else:
+      self.input_spec = InputSpec(ndim=3, axes={-1: input_dim})
     self.built = True
 
   @tf_utils.shape_type_conversion
   def compute_output_shape(self, input_shape):
-    length = conv_utils.conv_output_length(input_shape[1], self.kernel_size[0],
+    if self.data_format == 'channels_first':
+      input_length = input_shape[2]
+    else:
+      input_length = input_shape[1]
+
+    length = conv_utils.conv_output_length(input_length, self.kernel_size[0],
                                            self.padding, self.strides[0])
-    return (input_shape[0], length, self.filters)
+
+    if self.data_format == 'channels_first':
+      return (input_shape[0], self.filters, length)
+    elif self.data_format == 'channels_last':
+      return (input_shape[0], length, self.filters)
 
   def call(self, inputs):
-    output = K.local_conv1d(inputs, self.kernel, self.kernel_size, self.strides)
+    output = K.local_conv1d(inputs, self.kernel, self.kernel_size,
+                            self.strides, self.data_format)
     if self.use_bias:
-      output = K.bias_add(output, self.bias)
+      output = K.bias_add(output, self.bias, data_format=self.data_format)
     if self.activation is not None:
       output = self.activation(output)
     return output
@@ -172,6 +200,8 @@ class LocallyConnected1D(Layer):
             self.strides,
         'padding':
             self.padding,
+        'data_format':
+            self.data_format,
         'activation':
             activations.serialize(self.activation),
         'use_bias':
diff --git a/tensorflow/python/keras/layers/local_test.py b/tensorflow/python/keras/layers/local_test.py
index 90ae1719e171b19e1c3b95fef434bd53285c858c..8df3f6b7bd741ad0b698fe500f0ac72e73985421 100644
--- a/tensorflow/python/keras/layers/local_test.py
+++ b/tensorflow/python/keras/layers/local_test.py
@@ -40,16 +40,17 @@ class LocallyConnectedLayersTest(test.TestCase):
       for strides in [1]:
         if padding == 'same' and strides != 1:
           continue
-
-        testing_utils.layer_test(
-            keras.layers.LocallyConnected1D,
-            kwargs={
-                'filters': filters,
-                'kernel_size': filter_length,
-                'padding': padding,
-                'strides': strides
-            },
-            input_shape=(num_samples, num_steps, input_dim))
+        for data_format in ['channels_first', 'channels_last']:
+          testing_utils.layer_test(
+              keras.layers.LocallyConnected1D,
+              kwargs={
+                  'filters': filters,
+                  'kernel_size': filter_length,
+                  'padding': padding,
+                  'strides': strides,
+                  'data_format': data_format
+              },
+              input_shape=(num_samples, num_steps, input_dim))
 
   def test_locallyconnected_1d_regularization(self):
     num_samples = 2
@@ -57,35 +58,39 @@ class LocallyConnectedLayersTest(test.TestCase):
     input_dim = 5
     filter_length = 3
     filters = 4
-    kwargs = {
-        'filters': filters,
-        'kernel_size': filter_length,
-        'kernel_regularizer': 'l2',
-        'bias_regularizer': 'l2',
-        'activity_regularizer': 'l2',
-    }
-
-    with self.test_session():
-      layer = keras.layers.LocallyConnected1D(**kwargs)
-      layer.build((num_samples, num_steps, input_dim))
-      self.assertEqual(len(layer.losses), 2)
-      layer(
-          keras.backend.variable(np.ones((num_samples, num_steps, input_dim))))
-      self.assertEqual(len(layer.losses), 3)
-
-    k_constraint = keras.constraints.max_norm(0.01)
-    b_constraint = keras.constraints.max_norm(0.01)
-    kwargs = {
-        'filters': filters,
-        'kernel_size': filter_length,
-        'kernel_constraint': k_constraint,
-        'bias_constraint': b_constraint,
-    }
-    with self.test_session():
-      layer = keras.layers.LocallyConnected1D(**kwargs)
-      layer.build((num_samples, num_steps, input_dim))
-      self.assertEqual(layer.kernel.constraint, k_constraint)
-      self.assertEqual(layer.bias.constraint, b_constraint)
+    for data_format in ['channels_first', 'channels_last']:
+      kwargs = {
+          'filters': filters,
+          'kernel_size': filter_length,
+          'kernel_regularizer': 'l2',
+          'bias_regularizer': 'l2',
+          'activity_regularizer': 'l2',
+          'data_format': data_format
+      }
+
+      with self.test_session():
+        layer = keras.layers.LocallyConnected1D(**kwargs)
+        layer.build((num_samples, num_steps, input_dim))
+        self.assertEqual(len(layer.losses), 2)
+        layer(
+            keras.backend.variable(np.ones((num_samples,
+                                            num_steps,
+                                            input_dim))))
+        self.assertEqual(len(layer.losses), 3)
+
+      k_constraint = keras.constraints.max_norm(0.01)
+      b_constraint = keras.constraints.max_norm(0.01)
+      kwargs = {
+          'filters': filters,
+          'kernel_size': filter_length,
+          'kernel_constraint': k_constraint,
+          'bias_constraint': b_constraint,
+      }
+      with self.test_session():
+        layer = keras.layers.LocallyConnected1D(**kwargs)
+        layer.build((num_samples, num_steps, input_dim))
+        self.assertEqual(layer.kernel.constraint, k_constraint)
+        self.assertEqual(layer.bias.constraint, b_constraint)
 
   @tf_test_util.run_in_graph_and_eager_modes()
   def test_locallyconnected_2d(self):
@@ -113,6 +118,7 @@ class LocallyConnectedLayersTest(test.TestCase):
             },
             input_shape=(num_samples, num_row, num_col, stack_size))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_locallyconnected_2d_channels_first(self):
     num_samples = 8
     filters = 3
@@ -120,15 +126,14 @@ class LocallyConnectedLayersTest(test.TestCase):
     num_row = 6
     num_col = 10
 
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.LocallyConnected2D,
-          kwargs={
-              'filters': filters,
-              'kernel_size': 3,
-              'data_format': 'channels_first'
-          },
-          input_shape=(num_samples, num_row, num_col, stack_size))
+    testing_utils.layer_test(
+        keras.layers.LocallyConnected2D,
+        kwargs={
+            'filters': filters,
+            'kernel_size': 3,
+            'data_format': 'channels_first'
+        },
+        input_shape=(num_samples, num_row, num_col, stack_size))
 
   def test_locallyconnected_2d_regularization(self):
     num_samples = 8
diff --git a/tensorflow/python/keras/layers/merge.py b/tensorflow/python/keras/layers/merge.py
index 683e3e0ed1ce9a1fc56dcae0c0c8841148f008d5..f295af3fe04d87d260e4f6a98762dcfb90883531 100644
--- a/tensorflow/python/keras/layers/merge.py
+++ b/tensorflow/python/keras/layers/merge.py
@@ -250,6 +250,7 @@ class Add(_Merge):
     return output
 
 
+@tf_export('keras.layers.Subtract')
 class Subtract(_Merge):
   """Layer that subtracts two inputs.
 
@@ -336,6 +337,7 @@ class Maximum(_Merge):
     return output
 
 
+@tf_export('keras.layers.Minimum')
 class Minimum(_Merge):
   """Layer that computes the minimum (element-wise) a list of inputs.
 
@@ -446,8 +448,8 @@ class Concatenate(_Merge):
 class Dot(_Merge):
   """Layer that computes a dot product between samples in two tensors.
 
-  E.g. if applied to two tensors `a` and `b` of shape `(batch_size, n)`,
-  the output will be a tensor of shape `(batch_size, 1)`
+  E.g. if applied to a list of two tensors `a` and `b` of shape
+  `(batch_size, n)`, the output will be a tensor of shape `(batch_size, 1)`
   where each entry `i` will be the dot product between
   `a[i]` and `b[i]`.
 
@@ -586,6 +588,7 @@ def add(inputs, **kwargs):
   return Add(**kwargs)(inputs)
 
 
+@tf_export('keras.layers.subtract')
 def subtract(inputs, **kwargs):
   """Functional interface to the `Subtract` layer.
 
@@ -656,6 +659,7 @@ def maximum(inputs, **kwargs):
   return Maximum(**kwargs)(inputs)
 
 
+@tf_export('keras.layers.minimum')
 def minimum(inputs, **kwargs):
   """Functional interface to the `Minimum` layer.
 
diff --git a/tensorflow/python/keras/layers/normalization.py b/tensorflow/python/keras/layers/normalization.py
index c0dc5220f1ea63930e787fee4d8ff95e2c4cb321..ff51eadee9928902cb310b28569c06a49d8e5ecc 100644
--- a/tensorflow/python/keras/layers/normalization.py
+++ b/tensorflow/python/keras/layers/normalization.py
@@ -183,7 +183,7 @@ class BatchNormalization(Layer):
   def _add_tower_local_variable(self, *args, **kwargs):
     tower_context = distribute_lib.get_tower_context()
     with tower_context.tower_local_var_scope('mean'):
-      return self.add_variable(*args, **kwargs)
+      return self.add_weight(*args, **kwargs)
 
   def build(self, input_shape):
     input_shape = tensor_shape.TensorShape(input_shape)
@@ -276,7 +276,7 @@ class BatchNormalization(Layer):
           self.axis[idx] = x + 1      # Account for added dimension
 
     if self.scale:
-      self.gamma = self.add_variable(
+      self.gamma = self.add_weight(
           name='gamma',
           shape=param_shape,
           dtype=param_dtype,
@@ -291,7 +291,7 @@ class BatchNormalization(Layer):
             1.0, dtype=param_dtype, shape=param_shape)
 
     if self.center:
-      self.beta = self.add_variable(
+      self.beta = self.add_weight(
           name='beta',
           shape=param_shape,
           dtype=param_dtype,
@@ -574,28 +574,26 @@ class BatchNormalization(Layer):
                                      lambda: variance,
                                      lambda: moving_variance)
 
+      if self.virtual_batch_size is not None:
+        # This isn't strictly correct since in ghost batch norm, you are
+        # supposed to sequentially update the moving_mean and moving_variance
+        # with each sub-batch. However, since the moving statistics are only
+        # used during evaluation, it is more efficient to just update in one
+        # step and should not make a significant difference in the result.
+        new_mean = math_ops.reduce_mean(mean, axis=1, keepdims=True)
+        new_variance = math_ops.reduce_mean(variance, axis=1, keepdims=True)
+      else:
+        new_mean, new_variance = mean, variance
+
       if self.renorm:
         r, d, new_mean, new_variance = self._renorm_correction_and_moments(
-            mean, variance, training)
+            new_mean, new_variance, training)
         # When training, the normalized values (say, x) will be transformed as
         # x * gamma + beta without renorm, and (x * r + d) * gamma + beta
         # = x * (r * gamma) + (d * gamma + beta) with renorm.
         r = _broadcast(array_ops.stop_gradient(r, name='renorm_r'))
         d = _broadcast(array_ops.stop_gradient(d, name='renorm_d'))
         scale, offset = _compose_transforms(r, d, scale, offset)
-      else:
-        new_mean, new_variance = mean, variance
-
-      if self.virtual_batch_size is not None:
-        # This isn't strictly correct since in ghost batch norm, you are
-        # supposed to sequentially update the moving_mean and moving_variance
-        # with each sub-batch. However, since the moving statistics are only
-        # used during evaluation, it is more efficient to just update in one
-        # step and should not make a significant difference in the result.
-        new_mean = math_ops.reduce_mean(new_mean,
-                                        axis=1, keepdims=True)
-        new_variance = math_ops.reduce_mean(new_variance,
-                                            axis=1, keepdims=True)
 
       def _do_update(var, value):
         if in_eager_mode and not self.trainable:
diff --git a/tensorflow/python/keras/losses.py b/tensorflow/python/keras/losses.py
index d82ebd9c314c0427da86f6f6f617b7b282240c3d..9f548bfe0408d5c053c25b9ae14810d582b83e1e 100644
--- a/tensorflow/python/keras/losses.py
+++ b/tensorflow/python/keras/losses.py
@@ -30,19 +30,31 @@ from tensorflow.python.util.tf_export import tf_export
 
 
 @tf_export('keras.metrics.mean_squared_error',
-           'keras.losses.mean_squared_error')
+           'keras.metrics.mse',
+           'keras.metrics.MSE',
+           'keras.losses.mean_squared_error',
+           'keras.losses.mse',
+           'keras.losses.MSE')
 def mean_squared_error(y_true, y_pred):
   return K.mean(math_ops.square(y_pred - y_true), axis=-1)
 
 
 @tf_export('keras.metrics.mean_absolute_error',
-           'keras.losses.mean_absolute_error')
+           'keras.metrics.mae',
+           'keras.metrics.MAE',
+           'keras.losses.mean_absolute_error',
+           'keras.losses.mae',
+           'keras.losses.MAE')
 def mean_absolute_error(y_true, y_pred):
   return K.mean(math_ops.abs(y_pred - y_true), axis=-1)
 
 
 @tf_export('keras.metrics.mean_absolute_percentage_error',
-           'keras.losses.mean_absolute_percentage_error')
+           'keras.metrics.mape',
+           'keras.metrics.MAPE',
+           'keras.losses.mean_absolute_percentage_error',
+           'keras.losses.mape',
+           'keras.losses.MAPE')
 def mean_absolute_percentage_error(y_true, y_pred):
   diff = math_ops.abs(
       (y_true - y_pred) / K.clip(math_ops.abs(y_true), K.epsilon(), None))
@@ -50,7 +62,11 @@ def mean_absolute_percentage_error(y_true, y_pred):
 
 
 @tf_export('keras.metrics.mean_squared_logarithmic_error',
-           'keras.losses.mean_squared_logarithmic_error')
+           'keras.metrics.msle',
+           'keras.metrics.MSLE',
+           'keras.losses.mean_squared_logarithmic_error',
+           'keras.losses.msle',
+           'keras.losses.MSLE')
 def mean_squared_logarithmic_error(y_true, y_pred):
   first_log = math_ops.log(K.clip(y_pred, K.epsilon(), None) + 1.)
   second_log = math_ops.log(K.clip(y_true, K.epsilon(), None) + 1.)
@@ -117,7 +133,11 @@ def binary_crossentropy(y_true, y_pred):
 
 
 @tf_export('keras.metrics.kullback_leibler_divergence',
-           'keras.losses.kullback_leibler_divergence')
+           'keras.metrics.kld',
+           'keras.metrics.KLD',
+           'keras.losses.kullback_leibler_divergence',
+           'keras.losses.kld',
+           'keras.losses.KLD')
 def kullback_leibler_divergence(y_true, y_pred):
   y_true = K.clip(y_true, K.epsilon(), 1)
   y_pred = K.clip(y_pred, K.epsilon(), 1)
@@ -129,7 +149,10 @@ def poisson(y_true, y_pred):
   return K.mean(y_pred - y_true * math_ops.log(y_pred + K.epsilon()), axis=-1)
 
 
-@tf_export('keras.metrics.cosine_proximity', 'keras.losses.cosine_proximity')
+@tf_export('keras.metrics.cosine_proximity',
+           'keras.metrics.cosine',
+           'keras.losses.cosine_proximity',
+           'keras.losses.cosine')
 def cosine_proximity(y_true, y_pred):
   y_true = nn.l2_normalize(y_true, axis=-1)
   y_pred = nn.l2_normalize(y_pred, axis=-1)
diff --git a/tensorflow/python/keras/model_subclassing_test.py b/tensorflow/python/keras/model_subclassing_test.py
index 558854ab97bec203281162a03a8d513e487b3dfb..8fb957da439dd490bc3378df96f611733335c809 100644
--- a/tensorflow/python/keras/model_subclassing_test.py
+++ b/tensorflow/python/keras/model_subclassing_test.py
@@ -56,8 +56,8 @@ class SimpleTestModel(keras.Model):
     if self.use_bn:
       self.bn = keras.layers.BatchNormalization(axis=-1)
 
-  def call(self, inputs):
-    x = self.dense1(inputs)
+  def call(self, x):
+    x = self.dense1(x)
     if self.use_dp:
       x = self.dp(x)
     if self.use_bn:
@@ -622,6 +622,51 @@ class ModelSubclassingTest(test.TestCase):
     self.assertIs(m.isdep, m._checkpoint_dependencies[0].ref)
     self.assertEqual('notdep_var:0', m.notdep_var.name)
 
+  def test_extra_variable(self):
+
+    class ExtraVar(keras.Model):
+
+      def __init__(self):
+        super(ExtraVar, self).__init__()
+        self.dense = keras.layers.Dense(1)
+        self.var = resource_variable_ops.ResourceVariable(1.)
+        self.not_trainable_var = resource_variable_ops.ResourceVariable(
+            2., trainable=False)
+
+      def call(self, inputs):
+        return self.dense(inputs + self.var)
+
+    m = ExtraVar()
+    self.assertTrue(m.trainable)
+    self.assertEqual([m.dense], m.layers)
+    self.assertEqual([m.var, m.not_trainable_var], m.variables)
+    self.assertEqual([m.var], m.trainable_variables)
+    self.assertEqual([m.not_trainable_var], m.non_trainable_variables)
+    m.trainable = False
+    self.assertEqual([m.var, m.not_trainable_var], m.variables)
+    self.assertEqual([], m.trainable_variables)
+    self.assertEqual([m.var, m.not_trainable_var], m.non_trainable_variables)
+    m.trainable = True
+
+    m(array_ops.ones([1, 1]))
+
+    self.assertEqual([m.dense.kernel, m.dense.bias], m.dense.variables)
+    self.assertEqual([m.dense.kernel, m.dense.bias], m.dense.weights)
+
+    self.assertEqual([m.dense.kernel, m.dense.bias, m.var, m.not_trainable_var],
+                     m.variables)
+    self.assertEqual([m.dense.kernel, m.dense.bias, m.var],
+                     m.trainable_variables)
+    self.assertEqual([m.not_trainable_var], m.non_trainable_variables)
+
+    m.dense.trainable = False
+    self.assertEqual(
+        [m.var, m.dense.kernel, m.dense.bias, m.not_trainable_var],
+        m.variables)
+    self.assertEqual([m.var], m.trainable_variables)
+    self.assertEqual([m.dense.kernel, m.dense.bias, m.not_trainable_var],
+                     m.non_trainable_variables)
+
 
 class CustomCallModel(keras.Model):
 
diff --git a/tensorflow/python/keras/models_test.py b/tensorflow/python/keras/models_test.py
index 01fb41b8ee134e29fdd9852b6c2f10a6bd79ef1c..e6e45902a8f117e5765249da18afa7cc35aa6b16 100644
--- a/tensorflow/python/keras/models_test.py
+++ b/tensorflow/python/keras/models_test.py
@@ -18,10 +18,14 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import os
+
 import numpy as np
 
 from tensorflow.python import keras
+from tensorflow.python.framework import test_util
 from tensorflow.python.platform import test
+from tensorflow.python.training import adam
 
 
 class TestModelCloning(test.TestCase):
@@ -123,5 +127,36 @@ class TestModelCloning(test.TestCase):
       keras.models._clone_sequential_model(seq_model, input_tensors=y)
 
 
+class CheckpointingTests(test.TestCase):
+
+  @test_util.run_in_graph_and_eager_modes()
+  def test_optimizer_dependency(self):
+    model = keras.models.Sequential()
+    model.add(keras.layers.Dense(1, input_shape=(4,)))
+    opt = adam.AdamOptimizer(0.01)
+    model.compile(optimizer=opt, loss='mse')
+    model.fit(x=np.array([[1., 2., 3., 4.]]), y=[1.], epochs=2)
+    save_prefix = os.path.join(self.get_temp_dir(), 'ckpt')
+    beta1_power, _ = opt._get_beta_accumulators()
+    self.evaluate(beta1_power.assign(12.))
+    model.save_weights(save_prefix)
+    self.evaluate(beta1_power.assign(13.))
+    model.load_weights(save_prefix)
+    self.assertEqual(12., self.evaluate(beta1_power))
+
+class TestModelBackend(test.TestCase):
+
+  def test_model_backend_float64_use_cases(self):
+    # Test case for GitHub issue 19318
+    floatx = keras.backend.floatx()
+    keras.backend.set_floatx('float64')
+
+    x = keras.Input((5,))
+    y = keras.layers.Dense(1)(x)
+    model = keras.models.Model(x, y)
+    model.compile('rmsprop', 'mse')
+
+    keras.backend.set_floatx(floatx)
+
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/keras/optimizers.py b/tensorflow/python/keras/optimizers.py
index febbda4df6c6e2ad67d30f7337fbd518da3d2439..f58aeaea1acae2717f00a0323b5ff297a8cc8b46 100644
--- a/tensorflow/python/keras/optimizers.py
+++ b/tensorflow/python/keras/optimizers.py
@@ -35,6 +35,7 @@ from tensorflow.python.ops import state_ops
 from tensorflow.python.training import distribute as distribute_lib
 from tensorflow.python.training import optimizer as tf_optimizer_module
 from tensorflow.python.training import training_util
+from tensorflow.python.training.checkpointable import base as checkpointable
 from tensorflow.python.util.tf_export import tf_export
 
 
@@ -718,7 +719,7 @@ class Nadam(Optimizer):
     return dict(list(base_config.items()) + list(config.items()))
 
 
-class TFOptimizer(Optimizer):
+class TFOptimizer(Optimizer, checkpointable.Checkpointable):
   """Wrapper class for native TensorFlow optimizers.
   """
 
diff --git a/tensorflow/python/keras/utils/data_utils.py b/tensorflow/python/keras/utils/data_utils.py
index a1f89d9d43400983baaf81d47aeb480d4d8f30c4..c1ee34ae467b7037bafa53ea1a9b4b8596917df4 100644
--- a/tensorflow/python/keras/utils/data_utils.py
+++ b/tensorflow/python/keras/utils/data_utils.py
@@ -324,12 +324,12 @@ def validate_file(fpath, file_hash, algorithm='auto', chunk_size=65535):
 class Sequence(object):
   """Base object for fitting to a sequence of data, such as a dataset.
 
-  Every `Sequence` must implements the `__getitem__` and the `__len__` methods.
+  Every `Sequence` must implement the `__getitem__` and the `__len__` methods.
   If you want to modify your dataset between epochs you may implement
   `on_epoch_end`.
   The method `__getitem__` should return a complete batch.
 
-  # Notes
+  Notes:
 
   `Sequence` are a safer way to do multiprocessing. This structure guarantees
   that the network will only train once
diff --git a/tensorflow/python/keras/utils/io_utils.py b/tensorflow/python/keras/utils/io_utils.py
index f82e3277de70a631c93f0ef3c240f41ddb3390a7..62674a9c77fc410a551d2ac79c22ecf959b16fc3 100644
--- a/tensorflow/python/keras/utils/io_utils.py
+++ b/tensorflow/python/keras/utils/io_utils.py
@@ -102,13 +102,12 @@ class HDF5Matrix(object):
         idx = (self.start + key).tolist()
       else:
         raise IndexError
-    elif isinstance(key, list):
+    else:
+      # Assume list/iterable
       if max(key) + self.start < self.end:
         idx = [x + self.start for x in key]
       else:
         raise IndexError
-    else:
-      raise IndexError
     if self.normalizer is not None:
       return self.normalizer(self.data[idx])
     else:
diff --git a/tensorflow/python/keras/utils/io_utils_test.py b/tensorflow/python/keras/utils/io_utils_test.py
index 3895dca68e37e1597b93d8eeded7e5cfb0d3e338..81bb661edd8d815f8565285ad5dc8126f4f52e98 100644
--- a/tensorflow/python/keras/utils/io_utils_test.py
+++ b/tensorflow/python/keras/utils/io_utils_test.py
@@ -22,6 +22,7 @@ import os
 import shutil
 
 import numpy as np
+import six
 
 from tensorflow.python import keras
 from tensorflow.python.platform import test
@@ -95,6 +96,29 @@ class TestIOUtils(test.TestCase):
     self.assertEqual(out_eval.shape, ())
     self.assertGreater(out_eval, 0)
 
+    # test slicing for shortened array
+    self.assertEqual(len(x_train[0:]), len(x_train))
+
+    # test __getitem__ invalid use cases
+    with self.assertRaises(IndexError):
+      _ = x_train[1000]
+    with self.assertRaises(IndexError):
+      _ = x_train[1000: 1001]
+    with self.assertRaises(IndexError):
+      _ = x_train[[1000, 1001]]
+    with self.assertRaises(IndexError):
+      _ = x_train[six.moves.range(1000, 1001)]
+    with self.assertRaises(IndexError):
+      _ = x_train[np.array([1000])]
+    with self.assertRaises(TypeError):
+      _ = x_train[None]
+
+    # test normalizer
+    normalizer = lambda x: x + 1
+    normalized_x_train = keras.utils.io_utils.HDF5Matrix(
+        h5_path, 'my_data', start=0, end=150, normalizer=normalizer)
+    self.assertAllClose(normalized_x_train[0][0], x_train[0][0] + 1)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/keras/utils/layer_utils.py b/tensorflow/python/keras/utils/layer_utils.py
index bd61f8e9cccb1a0ba25dd9a449453df837b89ad2..88daff0461593f6270f3be8c06a277c7e6751286 100644
--- a/tensorflow/python/keras/utils/layer_utils.py
+++ b/tensorflow/python/keras/utils/layer_utils.py
@@ -201,6 +201,61 @@ def print_summary(model, line_length=None, positions=None, print_fn=None):
   print_fn('_' * line_length)
 
 
+def gather_trainable_weights(trainable, sub_layers, extra_variables):
+  """Lists the trainable weights for an object with sub-layers.
+
+  Args:
+    trainable: Whether the object collecting the variables is trainable.
+    sub_layers: A flat list of Layer objects owned by this object, to collect
+      variables from.
+    extra_variables: Any extra variables to include. Their `.trainable` property
+      is used to categorize them.
+
+  Returns:
+    A list of collected trainable weights/variables.
+  """
+  if not trainable:
+    return []
+  weights = []
+  for layer in sub_layers:
+    weights += layer.trainable_weights
+  trainable_extra_variables = [
+      v for v in extra_variables if v.trainable]
+  return weights + trainable_extra_variables
+
+
+def gather_non_trainable_weights(trainable, sub_layers, extra_variables):
+  """Lists the non-trainable weights for an object with sub-layers.
+
+  Args:
+    trainable: Whether the object collecting the variables is trainable.
+    sub_layers: A flat list of Layer objects owned by this object, to collect
+      variables from.
+    extra_variables: Any extra variables to include. Their `.trainable` property
+      is used to categorize them.
+
+  Returns:
+    A list of collected non-trainable weights/variables.
+  """
+  trainable_extra_variables = []
+  non_trainable_extra_variables = []
+  for v in extra_variables:
+    if v.trainable:
+      trainable_extra_variables.append(v)
+    else:
+      non_trainable_extra_variables.append(v)
+  weights = []
+  for layer in sub_layers:
+    weights += layer.non_trainable_weights
+  if not trainable:
+    trainable_weights = []
+    for layer in sub_layers:
+      trainable_weights += layer.trainable_weights
+    return (trainable_weights + trainable_extra_variables
+            + weights + non_trainable_extra_variables)
+  return weights + non_trainable_extra_variables
+
+
 @tf_export('keras.utils.convert_all_kernels_in_model')
 def convert_all_kernels_in_model(model):
   """Converts all convolution kernels in a model from Theano to TensorFlow.
diff --git a/tensorflow/python/keras/utils/multi_gpu_utils.py b/tensorflow/python/keras/utils/multi_gpu_utils.py
index e5442f04e316c6c2ec6f814cf8ae2aad546dc7d7..e1c49bc85221aa94241ed746c2063aadf881f3cd 100644
--- a/tensorflow/python/keras/utils/multi_gpu_utils.py
+++ b/tensorflow/python/keras/utils/multi_gpu_utils.py
@@ -196,7 +196,7 @@ def multi_gpu_model(model, gpus, cpu_merge=True, cpu_relocation=False):
     batch_size = shape[:1]
     input_shape = shape[1:]
     step = batch_size // parts
-    if i == num_gpus - 1:
+    if i == parts - 1:
       size = batch_size - step * i
     else:
       size = step
diff --git a/tensorflow/python/kernel_tests/BUILD b/tensorflow/python/kernel_tests/BUILD
index 83b353600ad47c59759fff81c64fe5e8c2ed7926..5d29c2e5f86bd3c4997cc3f18f4cb760dc87d63b 100644
--- a/tensorflow/python/kernel_tests/BUILD
+++ b/tensorflow/python/kernel_tests/BUILD
@@ -9,6 +9,7 @@ licenses(["notice"])  # Apache 2.0
 load("//tensorflow:tensorflow.bzl", "tf_py_test")
 load("//tensorflow:tensorflow.bzl", "cuda_py_test")
 load("//tensorflow:tensorflow.bzl", "sycl_py_test")
+load("//tensorflow:tensorflow.bzl", "tf_custom_op_library")
 
 # CPU only tests should use tf_py_test, GPU tests use cuda_py_test
 # Please avoid the py_tests and cuda_py_tests (plural) while we
@@ -2334,6 +2335,9 @@ cuda_py_test(
         "//tensorflow/python:nn_ops",
     ],
     shard_count = 2,
+    tags = [
+        "no_gpu",  #  Flaky: b/80127739
+    ],
 )
 
 cuda_py_test(
@@ -3026,3 +3030,60 @@ tf_py_test(
         "//tensorflow/python/eager:tape",
     ],
 )
+
+# Custom op tests
+tf_custom_op_library(
+    name = "ackermann_op.so",
+    srcs = ["ackermann_op.cc"],
+)
+
+tf_py_test(
+    name = "ackermann_test",
+    size = "small",
+    srcs = ["ackermann_test.py"],
+    additional_deps = [
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:platform",
+    ],
+    data = [":ackermann_op.so"],
+    tags = ["no_pip"],
+)
+
+tf_custom_op_library(
+    name = "duplicate_op.so",
+    srcs = ["duplicate_op.cc"],
+)
+
+tf_py_test(
+    name = "duplicate_op_test",
+    size = "small",
+    srcs = ["duplicate_op_test.py"],
+    additional_deps = [
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:platform",
+    ],
+    data = [":duplicate_op.so"],
+    tags = ["no_pip"],
+)
+
+tf_custom_op_library(
+    name = "invalid_op.so",
+    srcs = ["invalid_op.cc"],
+)
+
+tf_py_test(
+    name = "invalid_op_test",
+    size = "small",
+    srcs = ["invalid_op_test.py"],
+    additional_deps = [
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:errors",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:platform",
+    ],
+    data = [":invalid_op.so"],
+    tags = ["no_pip"],
+)
diff --git a/tensorflow/python/kernel_tests/accumulate_n_eager_test.py b/tensorflow/python/kernel_tests/accumulate_n_eager_test.py
index dc11b7deceb9040584aca1f629f4d003aef39428..5f516f2c7e6af2d5b77deeebf1d71d3d0fa6be39 100644
--- a/tensorflow/python/kernel_tests/accumulate_n_eager_test.py
+++ b/tensorflow/python/kernel_tests/accumulate_n_eager_test.py
@@ -43,10 +43,9 @@ class AccumulateNV2EagerTest(test_util.TensorFlowTestCase):
     np.random.seed(12345)
     x = [np.random.random((1, 2, 3, 4, 5)) - 0.5 for _ in range(5)]
     tf_x = ops.convert_n_to_tensor(x)
-    with self.test_session(use_gpu=True):
-      self.assertAllClose(sum(x), math_ops.accumulate_n(tf_x).numpy())
-      self.assertAllClose(x[0] * 5,
-                          math_ops.accumulate_n([tf_x[0]] * 5).numpy())
+    self.assertAllClose(sum(x), math_ops.accumulate_n(tf_x))
+    self.assertAllClose(x[0] * 5,
+                        math_ops.accumulate_n([tf_x[0]] * 5))
 
   def testGrad(self):
     np.random.seed(42)
diff --git a/tensorflow/user_ops/ackermann_op.cc b/tensorflow/python/kernel_tests/ackermann_op.cc
similarity index 100%
rename from tensorflow/user_ops/ackermann_op.cc
rename to tensorflow/python/kernel_tests/ackermann_op.cc
diff --git a/tensorflow/user_ops/ackermann_test.py b/tensorflow/python/kernel_tests/ackermann_test.py
similarity index 76%
rename from tensorflow/user_ops/ackermann_test.py
rename to tensorflow/python/kernel_tests/ackermann_test.py
index 257de498088d1f8a71898e490b8951beb7975b7a..5e0d87c783109b5ec8055e4c975157f3da07bcd4 100644
--- a/tensorflow/user_ops/ackermann_test.py
+++ b/tensorflow/python/kernel_tests/ackermann_test.py
@@ -17,17 +17,19 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import os.path
+import os
 
-import tensorflow as tf
+from tensorflow.python.framework import load_library
+from tensorflow.python.platform import resource_loader
+from tensorflow.python.platform import test
 
 
-class AckermannTest(tf.test.TestCase):
+class AckermannTest(test.TestCase):
 
   def testBasic(self):
-    library_filename = os.path.join(tf.resource_loader.get_data_files_path(),
+    library_filename = os.path.join(resource_loader.get_data_files_path(),
                                     'ackermann_op.so')
-    ackermann = tf.load_op_library(library_filename)
+    ackermann = load_library.load_op_library(library_filename)
 
     self.assertEqual(len(ackermann.OP_LIST.op), 1)
     self.assertEqual(ackermann.OP_LIST.op[0].name, 'Ackermann')
@@ -37,4 +39,4 @@ class AckermannTest(tf.test.TestCase):
 
 
 if __name__ == '__main__':
-  tf.test.main()
+  test.main()
diff --git a/tensorflow/python/kernel_tests/as_string_op_test.py b/tensorflow/python/kernel_tests/as_string_op_test.py
index 9d54add2644fb9ba6931357dbaa96368952b7486..94ed8ebd31f5874024bb6b0988073ece15d39d87 100644
--- a/tensorflow/python/kernel_tests/as_string_op_test.py
+++ b/tensorflow/python/kernel_tests/as_string_op_test.py
@@ -130,6 +130,16 @@ class AsStringOpTest(test.TestCase):
       result = output.eval(feed_dict={input_: int_inputs_})
       self.assertAllEqual(s(result), ["%d" % x for x in int_inputs_])
 
+  def testHalfInt(self):
+    s = lambda strs: [x.decode("ascii") for x in strs]
+
+    with self.test_session():
+      input_ = array_ops.placeholder(dtypes.int16)
+      int_inputs_ = [np.iinfo(np.int16).min, np.iinfo(np.int16).max]
+      output = string_ops.as_string(input_)
+      result = output.eval(feed_dict={input_: int_inputs_})
+      self.assertAllEqual(s(result), ["%d" % x for x in int_inputs_])
+
   def testBool(self):
     bool_inputs_ = [False, True]
     s = lambda strs: [x.decode("ascii") for x in strs]
diff --git a/tensorflow/python/kernel_tests/betainc_op_test.py b/tensorflow/python/kernel_tests/betainc_op_test.py
index 08b03f851803a34dd050721e47471bafd1cd6cac..16fdedac4136d7e53eb66ba060a92b9fd7d58307 100644
--- a/tensorflow/python/kernel_tests/betainc_op_test.py
+++ b/tensorflow/python/kernel_tests/betainc_op_test.py
@@ -172,7 +172,7 @@ class BetaincTest(test.TestCase):
       tf_gout_t = math_ops.betainc(tf_ga_s, tf_gb_s, tf_gx_s)
       err = gradient_checker.compute_gradient_error(
           [tf_gx_s], [gx_s.shape], tf_gout_t, gx_s.shape)
-      print("betainc gradient err = %g " % err)
+      tf_logging.info("betainc gradient err = %g " % err)
       self.assertLess(err, err_tolerance)
 
       # Test broadcast gradient
@@ -181,7 +181,7 @@ class BetaincTest(test.TestCase):
       tf_gout_t = math_ops.betainc(tf_ga_s, tf_gb_s, tf_gx_s)
       err = gradient_checker.compute_gradient_error(
           [tf_gx_s], [()], tf_gout_t, ga_s.shape)
-      print("betainc gradient err = %g " % err)
+      tf_logging.info("betainc gradient err = %g " % err)
       self.assertLess(err, err_tolerance)
 
 
diff --git a/tensorflow/python/kernel_tests/boosted_trees/BUILD b/tensorflow/python/kernel_tests/boosted_trees/BUILD
index 30e6289420b36a75589ef25150480e48f8245ec2..4f92ab0795d231f973f988d9c5b8b39166357c4c 100644
--- a/tensorflow/python/kernel_tests/boosted_trees/BUILD
+++ b/tensorflow/python/kernel_tests/boosted_trees/BUILD
@@ -52,7 +52,7 @@ tf_py_test(
 
 tf_py_test(
     name = "stats_ops_test",
-    size = "small",
+    size = "medium",
     srcs = ["stats_ops_test.py"],
     additional_deps = [
         "//tensorflow/python:boosted_trees_ops",
diff --git a/tensorflow/python/kernel_tests/boosted_trees/stats_ops_test.py b/tensorflow/python/kernel_tests/boosted_trees/stats_ops_test.py
index 5cceb98cff26ec11137c3f8ff73e8d2e05d009e1..568e695fd590a6fd6e915acb493adb8926e92eaf 100644
--- a/tensorflow/python/kernel_tests/boosted_trees/stats_ops_test.py
+++ b/tensorflow/python/kernel_tests/boosted_trees/stats_ops_test.py
@@ -17,7 +17,10 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import test_util
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import boosted_trees_ops
 from tensorflow.python.platform import googletest
 
@@ -388,6 +391,41 @@ class StatsOpsTest(test_util.TensorFlowTestCase):
           ],
           result.eval())
 
+  def _verify_precision(self, length):
+    with self.test_session():
+      max_splits = 1
+      num_buckets = 1
+      node_ids = array_ops.fill([length], 0)
+
+      gradients = constant_op.constant(
+          2.0 / length, dtype=dtypes.float32, shape=[length, 1])
+      hessians = constant_op.constant(
+          0.2 / length, dtype=dtypes.float32, shape=[length, 1])
+
+      bucketized_features = array_ops.zeros([length], dtype=dtypes.int32)
+
+      result = boosted_trees_ops.make_stats_summary(
+          node_ids, gradients, hessians, [bucketized_features], max_splits,
+          num_buckets)  # shape=[max_splits, num_buckets, num_features, 2]
+
+      self.assertAllClose([[[[2., 0.2]]]], result.eval())
+
+  def testMakeStatsSummaryNumericalPrecisionSmallBatch(self):
+    """Tests numeric precision."""
+    self._verify_precision(length=2000)
+
+  def testMakeStatsSummaryNumericalPrecisionMediumBatch(self):
+    """Tests numeric precision."""
+    self._verify_precision(length=100000)
+
+  def testMakeStatsSummaryNumericalPrecisionLargeBatch(self):
+    """Tests numeric precision."""
+    self._verify_precision(length=1000000)
+
+  def testMakeStatsSummaryNumericalPrecisionMegaBatch(self):
+    """Tests numeric precision."""
+    self._verify_precision(length=50000000)
+
 
 if __name__ == '__main__':
   googletest.main()
diff --git a/tensorflow/python/kernel_tests/check_ops_test.py b/tensorflow/python/kernel_tests/check_ops_test.py
index 5a83ec8d302b4c26aef7abfa7465eb9fd0cca019..7ef841c96b5cec9c7ae56c631896231ed663b8be 100644
--- a/tensorflow/python/kernel_tests/check_ops_test.py
+++ b/tensorflow/python/kernel_tests/check_ops_test.py
@@ -88,6 +88,13 @@ class AssertEqualTest(test.TestCase):
       out = array_ops.identity(small)
     self.evaluate(out)
 
+  @test_util.run_in_graph_and_eager_modes()
+  def test_scalar_comparison(self):
+    const_true = constant_op.constant(True, name="true")
+    const_false = constant_op.constant(False, name="false")
+    with self.assertRaisesRegexp(errors.InvalidArgumentError, "fail"):
+      check_ops.assert_equal(const_true, const_false, message="fail")
+
   def test_returns_none_with_eager(self):
     with context.eager_mode():
       small = constant_op.constant([1, 2], name="small")
diff --git a/tensorflow/python/kernel_tests/conv_ops_test.py b/tensorflow/python/kernel_tests/conv_ops_test.py
index a291bef0ad6f16184ff29f665457a53b77447d54..450428707dab0d5fb88d180b8e54bb2c56958d8d 100644
--- a/tensorflow/python/kernel_tests/conv_ops_test.py
+++ b/tensorflow/python/kernel_tests/conv_ops_test.py
@@ -312,8 +312,8 @@ class Conv2DTest(test.TestCase):
       expected_values = self.evaluate(expected_results)
       computed_values = self.evaluate(computed_results)
       for e_value, c_value in zip(expected_values, computed_values):
-        print("expected = ", e_value)
-        print("actual = ", c_value)
+        tf_logging.info("expected = ", e_value)
+        tf_logging.info("actual = ", c_value)
         self.assertAllClose(
             e_value.flatten(), c_value.flatten(), atol=tolerance, rtol=1e-4)
 
@@ -337,8 +337,8 @@ class Conv2DTest(test.TestCase):
       for i in range(len(tensors)):
         conv = tensors[i]
         value = values[i]
-        print("expected = ", expected)
-        print("actual = ", value)
+        tf_logging.info("expected = ", expected)
+        tf_logging.info("actual = ", value)
         tol = 1e-5
         if value.dtype == np.float16:
           tol = 1e-3
@@ -547,8 +547,8 @@ class Conv2DTest(test.TestCase):
       # "values" consists of two tensors for two backprops
       value = self.evaluate(conv)
       self.assertShapeEqual(value, conv)
-    print("expected = ", expected)
-    print("actual = ", value)
+    tf_logging.info("expected = ", expected)
+    tf_logging.info("actual = ", value)
     self.assertArrayNear(expected, value.flatten(), err)
 
   def _CompareBackpropInput(self, input_sizes, filter_sizes, output_sizes,
@@ -723,8 +723,8 @@ class Conv2DTest(test.TestCase):
             data_format=data_format)
         value = self.evaluate(conv)
         self.assertShapeEqual(value, conv)
-      print("expected = ", expected)
-      print("actual = ", value)
+      tf_logging.info("expected = ", expected)
+      tf_logging.info("actual = ", value)
       self.assertArrayNear(expected, value.flatten(), 1e-5)
 
   def _CompareBackFilter(self, input_sizes, filter_sizes, output_sizes,
@@ -912,8 +912,8 @@ class Conv2DTest(test.TestCase):
         value_2 = sess.run(conv_2)
         self.assertShapeEqual(value, conv)
         self.assertShapeEqual(value_2, conv_2)
-      print("expected = ", value_2)
-      print("actual = ", value)
+      tf_logging.info("expected = ", value_2)
+      tf_logging.info("actual = ", value)
       self.assertArrayNear(value_2.flatten(), value.flatten(), err)
 
   # Testing for backprops
@@ -965,8 +965,8 @@ class Conv2DTest(test.TestCase):
         value_2 = sess.run(conv_2)
         self.assertShapeEqual(value, conv)
         self.assertShapeEqual(value_2, conv_2)
-      print("expected = ", value_2)
-      print("actual = ", value)
+      tf_logging.info("expected = ", value_2)
+      tf_logging.info("actual = ", value)
       self.assertArrayNear(value_2.flatten(), value.flatten(), err)
 
   def testConv2D2x2Depth3ValidBackpropFilterStride1x1Dilation2x1(self):
@@ -1178,7 +1178,7 @@ class Conv2DTest(test.TestCase):
           # since fp16 numerical gradients are too imprecise.
           err = np.fabs(jacob_t - reference_jacob_t).max()
 
-        print("conv_2d gradient error = ", err)
+        tf_logging.info("conv_2d gradient error = ", err)
         self.assertLess(err, 0.002)
 
   def testInputGradientValidPaddingStrideOne(self):
@@ -1546,7 +1546,7 @@ class DepthwiseConv2DTest(test.TestCase):
       conv = nn_impl.depthwise_conv2d(
           t1, t2, strides=[1, stride, stride, 1], padding=padding)
       value = sess.run(conv)
-    print("value = ", value)
+    tf_logging.info("value = ", value)
     self.assertArrayNear(expected, np.ravel(value), 1e-5)
     self.assertShapeEqual(value, conv)
 
@@ -1668,7 +1668,7 @@ class SeparableConv2DTest(test.TestCase):
         conv = array_ops.transpose(conv, [0, 2, 3, 1])
 
       value = sess.run(conv)
-    print("value = ", value)
+    tf_logging.info("value = ", value)
     self.assertArrayNear(expected, np.ravel(value), 1e-5)
     self.assertShapeEqual(value, conv)
 
@@ -1826,7 +1826,7 @@ class Conv2DBenchmark(test.Benchmark):
         wall_time = time.time() - start
         self.report_benchmark(
             name="conv_stack_iter_%d" % iter_index, wall_time=wall_time)
-        print("conv_stack_iter_%d: %.4f" % (iter_index, wall_time))
+        tf_logging.info("conv_stack_iter_%d: %.4f" % (iter_index, wall_time))
 
 
 def GetInceptionFwdTest(input_size, filter_size, stride, padding,
diff --git a/tensorflow/python/kernel_tests/cwise_ops_test.py b/tensorflow/python/kernel_tests/cwise_ops_test.py
index 87da89831c8ded9b8382c7bb251948b6d202300e..8a3e64b174d9572ce11981a9d8d0e71cd9a336bc 100644
--- a/tensorflow/python/kernel_tests/cwise_ops_test.py
+++ b/tensorflow/python/kernel_tests/cwise_ops_test.py
@@ -27,6 +27,7 @@ from tensorflow.python.framework import dtypes as dtypes_lib
 from tensorflow.python.framework import errors_impl
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gen_math_ops
 from tensorflow.python.ops import gradient_checker
@@ -152,7 +153,7 @@ class UnaryOpTest(test.TestCase):
 
   def _compareGpu(self, x, np_func, tf_func):
     np_ans = np_func(x)
-    with self.test_session(use_gpu=True):
+    with self.test_session(force_gpu=test_util.is_gpu_available()):
       result = tf_func(ops.convert_to_tensor(x))
       tf_gpu = result.eval()
     if x.dtype == np.float16:
@@ -164,7 +165,7 @@ class UnaryOpTest(test.TestCase):
   def _compareSparseGpu(self, x, np_func, tf_func, tol):
     x_sp, x_sp_vals = _sparsify(x)
     res_np = np_func(x_sp_vals)
-    with self.test_session(use_gpu=True):
+    with self.test_session(force_gpu=test_util.is_gpu_available()):
       self._check(tf_func(x_sp), res_np, x_sp, tol)
 
   def _compareBoth(self, x, np_func, tf_func):
@@ -240,6 +241,12 @@ class UnaryOpTest(test.TestCase):
                       math_ops.lgamma)
     self._compareBoth(x, np.vectorize(math.erf), math_ops.erf)
     self._compareBoth(x, np.vectorize(math.erfc), math_ops.erfc)
+    try:
+      from scipy import special  # pylint: disable=g-import-not-at-top
+      self._compareBoth(x, special.i0e, math_ops.bessel_i0e)
+      self._compareBoth(x, special.i1e, math_ops.bessel_i1e)
+    except ImportError as e:
+      tf_logging.warn("Cannot test special functions: %s" % str(e))
 
     self._compareBothSparse(x, np.abs, math_ops.abs)
     self._compareBothSparse(x, np.negative, math_ops.negative)
@@ -285,6 +292,12 @@ class UnaryOpTest(test.TestCase):
     self._compareBoth(x, np.arcsin, math_ops.asin)
     self._compareBoth(x, np.arccos, math_ops.acos)
     self._compareBoth(x, np.arctan, math_ops.atan)
+    try:
+      from scipy import special  # pylint: disable=g-import-not-at-top
+      self._compareBoth(x, special.i0e, math_ops.bessel_i0e)
+      self._compareBoth(x, special.i1e, math_ops.bessel_i1e)
+    except ImportError as e:
+      tf_logging.warn("Cannot test special functions: %s" % str(e))
 
     self._compareBothSparse(x, np.abs, math_ops.abs)
     self._compareBothSparse(x, np.negative, math_ops.negative)
@@ -333,6 +346,12 @@ class UnaryOpTest(test.TestCase):
     self._compareBoth(k, np.arcsin, math_ops.asin)
     self._compareBoth(k, np.arccos, math_ops.acos)
     self._compareBoth(k, np.tan, math_ops.tan)
+    try:
+      from scipy import special  # pylint: disable=g-import-not-at-top
+      self._compareBoth(x, special.i0e, math_ops.bessel_i0e)
+      self._compareBoth(x, special.i1e, math_ops.bessel_i1e)
+    except ImportError as e:
+      tf_logging.warn("Cannot test special functions: %s" % str(e))
 
     self._compareBothSparse(x, np.abs, math_ops.abs)
     self._compareBothSparse(x, np.negative, math_ops.negative)
@@ -369,6 +388,12 @@ class UnaryOpTest(test.TestCase):
                       math_ops.lgamma)
     self._compareBoth(x, np.vectorize(math.erf), math_ops.erf)
     self._compareBoth(x, np.vectorize(math.erfc), math_ops.erfc)
+    try:
+      from scipy import special  # pylint: disable=g-import-not-at-top
+      self._compareBoth(x, special.i0e, math_ops.bessel_i0e)
+      self._compareBoth(x, special.i1e, math_ops.bessel_i1e)
+    except ImportError as e:
+      tf_logging.warn("Cannot test special functions: %s" % str(e))
 
     self._compareBothSparse(x, np.abs, math_ops.abs)
     self._compareBothSparse(x, np.negative, math_ops.negative)
@@ -630,7 +655,7 @@ class BinaryOpTest(test.TestCase):
 
   def _compareGpu(self, x, y, np_func, tf_func):
     np_ans = np_func(x, y)
-    with self.test_session(use_gpu=True):
+    with self.test_session(force_gpu=test_util.is_gpu_available()):
       inx = ops.convert_to_tensor(x)
       iny = ops.convert_to_tensor(y)
       out = tf_func(inx, iny)
@@ -1203,7 +1228,7 @@ class BinaryOpTest(test.TestCase):
 class ComparisonOpTest(test.TestCase):
 
   def _compareScalar(self, func, x, y, dtype):
-    with self.test_session(use_gpu=True):
+    with self.test_session(force_gpu=test_util.is_gpu_available()):
       out = func(
           ops.convert_to_tensor(np.array([x]).astype(dtype)),
           ops.convert_to_tensor(np.array([y]).astype(dtype)))
@@ -1236,7 +1261,7 @@ class ComparisonOpTest(test.TestCase):
 
   def _compare(self, x, y, np_func, tf_func):
     np_ans = np_func(x, y)
-    with self.test_session(use_gpu=True):
+    with self.test_session(force_gpu=test_util.is_gpu_available()):
       out = tf_func(ops.convert_to_tensor(x), ops.convert_to_tensor(y))
       tf_ans = out.eval()
     self.assertAllEqual(np_ans, tf_ans)
@@ -1337,7 +1362,8 @@ class LogicalOpTest(test.TestCase):
 
   def _compareBinary(self, x, y, np_func, tf_func, use_gpu=False):
     np_ans = np_func(x, y)
-    with self.test_session(use_gpu=use_gpu):
+    with self.test_session(use_gpu=use_gpu,
+                           force_gpu=use_gpu and test_util.is_gpu_available()):
       inx = ops.convert_to_tensor(x)
       iny = ops.convert_to_tensor(y)
       out = tf_func(inx, iny)
@@ -1348,7 +1374,8 @@ class LogicalOpTest(test.TestCase):
 
   def _not(self, x, use_gpu=False):
     np_ans = np.logical_not(x)
-    with self.test_session(use_gpu=use_gpu):
+    with self.test_session(use_gpu=use_gpu,
+                           force_gpu=use_gpu and test_util.is_gpu_available()):
       out = math_ops.logical_not(ops.convert_to_tensor(x))
       tf_val = out.eval()
     self.assertEqual(out.dtype, dtypes_lib.bool)
@@ -1433,7 +1460,8 @@ class SelectOpTest(test.TestCase):
 
   def _compare(self, c, x, y, use_gpu):
     np_ans = np.where(c, x, y)
-    with self.test_session(use_gpu=use_gpu):
+    with self.test_session(use_gpu=use_gpu,
+                           force_gpu=use_gpu and test_util.is_gpu_available()):
       out = array_ops.where(c, x, y)
       tf_ans = out.eval()
     self.assertAllEqual(np_ans, tf_ans)
@@ -1576,7 +1604,8 @@ class BatchSelectOpTest(test.TestCase):
     np_ans = np.dstack(
         [x_i if c_i else y_i for c_i, x_i, y_i in zip(c, x, y)]).transpose(
             [2, 0, 1])
-    with self.test_session(use_gpu=use_gpu):
+    with self.test_session(use_gpu=use_gpu,
+                           force_gpu=use_gpu and test_util.is_gpu_available()):
       out = array_ops.where(c, x, y)
       tf_ans = out.eval()
     self.assertAllEqual(np_ans, tf_ans)
@@ -1681,7 +1710,9 @@ class MinMaxOpTest(test.TestCase):
 
   def _compare(self, x, y, use_gpu):
     np_min, np_max = np.minimum(x, y), np.maximum(x, y)
-    with self.test_session(use_gpu=use_gpu) as sess:
+    with self.test_session(
+        use_gpu=use_gpu,
+        force_gpu=use_gpu and test_util.is_gpu_available()) as sess:
       inx = ops.convert_to_tensor(x)
       iny = ops.convert_to_tensor(y)
       omin, omax = math_ops.minimum(inx, iny), math_ops.maximum(inx, iny)
@@ -1843,7 +1874,9 @@ class IsFiniteInfNanTest(test.TestCase):
 
   def _compare(self, x, use_gpu):
     np_finite, np_inf, np_nan = np.isfinite(x), np.isinf(x), np.isnan(x)
-    with self.test_session(use_gpu=use_gpu) as sess:
+    with self.test_session(
+        use_gpu=use_gpu,
+        force_gpu=use_gpu and test_util.is_gpu_available()) as sess:
       inx = ops.convert_to_tensor(x)
       ofinite, oinf, onan = math_ops.is_finite(inx), math_ops.is_inf(
           inx), math_ops.is_nan(inx)
@@ -1884,7 +1917,7 @@ class IsFiniteInfNanTest(test.TestCase):
           x = np.full((size,), value, dtype=dtype)
           np_y = np.sqrt(x)
           np_nan = np.isnan(np_y)
-          with self.test_session(use_gpu=True):
+          with self.test_session(force_gpu=test_util.is_gpu_available()):
             tf_y = math_ops.sqrt(x)
             tf_nan = math_ops.is_nan(tf_y)
             if value < 0:
@@ -1939,7 +1972,8 @@ class ComplexMakeRealImagTest(test.TestCase):
 
   def _compareMake(self, real, imag, use_gpu):
     np_ans = real + (1j) * imag
-    with self.test_session(use_gpu=use_gpu):
+    with self.test_session(use_gpu=use_gpu,
+                           force_gpu=use_gpu and test_util.is_gpu_available()):
       real = ops.convert_to_tensor(real)
       imag = ops.convert_to_tensor(imag)
       tf_ans = math_ops.complex(real, imag)
@@ -1958,7 +1992,8 @@ class ComplexMakeRealImagTest(test.TestCase):
   def _compareRealImag(self, cplx, use_gpu):
     np_real, np_imag = np.real(cplx), np.imag(cplx)
     np_zeros = np_real * 0
-    with self.test_session(use_gpu=use_gpu):
+    with self.test_session(use_gpu=use_gpu,
+                           force_gpu=use_gpu and test_util.is_gpu_available()):
       inx = ops.convert_to_tensor(cplx)
       tf_real = math_ops.real(inx)
       tf_imag = math_ops.imag(inx)
@@ -1985,7 +2020,9 @@ class ComplexMakeRealImagTest(test.TestCase):
 
   def _compareAngle(self, cplx, use_gpu):
     np_angle = np.angle(cplx)
-    with self.test_session(use_gpu=use_gpu) as sess:
+    with self.test_session(
+        use_gpu=use_gpu,
+        force_gpu=use_gpu and test_util.is_gpu_available()) as sess:
       inx = ops.convert_to_tensor(cplx)
       tf_angle = math_ops.angle(inx)
       tf_angle_val = sess.run(tf_angle)
@@ -2019,7 +2056,8 @@ class ComplexMakeRealImagTest(test.TestCase):
 
   def _compareConj(self, cplx, use_gpu):
     np_ans = np.conj(cplx)
-    with self.test_session(use_gpu=use_gpu):
+    with self.test_session(use_gpu=use_gpu,
+                           force_gpu=use_gpu and test_util.is_gpu_available()):
       inx = ops.convert_to_tensor(cplx)
       tf_conj = math_ops.conj(inx)
       tf_ans = tf_conj.eval()
diff --git a/tensorflow/python/kernel_tests/distributions/bijector_test.py b/tensorflow/python/kernel_tests/distributions/bijector_test.py
index a7fe336e6a8e88f1130bce3fec9f6d253246c37e..8b11556330acc7dab68715ddc69563107a313ee6 100644
--- a/tensorflow/python/kernel_tests/distributions/bijector_test.py
+++ b/tensorflow/python/kernel_tests/distributions/bijector_test.py
@@ -90,9 +90,10 @@ class IntentionallyMissingError(Exception):
 class BrokenBijector(bijector.Bijector):
   """Forward and inverse are not inverses of each other."""
 
-  def __init__(self, forward_missing=False, inverse_missing=False):
+  def __init__(
+      self, forward_missing=False, inverse_missing=False, validate_args=False):
     super(BrokenBijector, self).__init__(
-        validate_args=False, forward_min_event_ndims=0, name="broken")
+        validate_args=validate_args, forward_min_event_ndims=0, name="broken")
     self._forward_missing = forward_missing
     self._inverse_missing = inverse_missing
 
@@ -116,6 +117,33 @@ class BrokenBijector(bijector.Bijector):
       raise IntentionallyMissingError
     return math_ops.log(2.)
 
+class BijectorTestEventNdims(test.TestCase):
+
+  def testBijectorNonIntegerEventNdims(self):
+    bij = BrokenBijector()
+    with self.assertRaisesRegexp(ValueError, "Expected integer"):
+      bij.forward_log_det_jacobian(1., event_ndims=1.5)
+    with self.assertRaisesRegexp(ValueError, "Expected integer"):
+      bij.inverse_log_det_jacobian(1., event_ndims=1.5)
+
+  def testBijectorArrayEventNdims(self):
+    bij = BrokenBijector()
+    with self.assertRaisesRegexp(ValueError, "Expected scalar"):
+      bij.forward_log_det_jacobian(1., event_ndims=(1, 2))
+    with self.assertRaisesRegexp(ValueError, "Expected scalar"):
+      bij.inverse_log_det_jacobian(1., event_ndims=(1, 2))
+
+  def testBijectorDynamicEventNdims(self):
+    bij = BrokenBijector(validate_args=True)
+    event_ndims = array_ops.placeholder(dtype=np.int32, shape=None)
+    with self.test_session():
+      with self.assertRaisesOpError("Expected scalar"):
+        bij.forward_log_det_jacobian(1., event_ndims=event_ndims).eval({
+            event_ndims: (1, 2)})
+      with self.assertRaisesOpError("Expected scalar"):
+        bij.inverse_log_det_jacobian(1., event_ndims=event_ndims).eval({
+            event_ndims: (1, 2)})
+
 
 @six.add_metaclass(abc.ABCMeta)
 class BijectorCachingTestBase(object):
diff --git a/tensorflow/python/kernel_tests/distributions/dirichlet_test.py b/tensorflow/python/kernel_tests/distributions/dirichlet_test.py
index 3bcfae0deb5e14b1be0f491bf8c5c7fcf9bbcf1d..bcec6ef610d0389f4b0f164ff4ab1a1cd1f6d1e5 100644
--- a/tensorflow/python/kernel_tests/distributions/dirichlet_test.py
+++ b/tensorflow/python/kernel_tests/distributions/dirichlet_test.py
@@ -26,6 +26,7 @@ from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.distributions import dirichlet as dirichlet_lib
+from tensorflow.python.ops.distributions import kullback_leibler
 from tensorflow.python.platform import test
 from tensorflow.python.platform import tf_logging
 
@@ -39,6 +40,7 @@ def try_import(name):  # pylint: disable=invalid-name
   return module
 
 
+special = try_import("scipy.special")
 stats = try_import("scipy.stats")
 
 
@@ -262,6 +264,39 @@ class DirichletTest(test.TestCase):
                   a=1., b=2.).cdf)[0],
           0.01)
 
+  def testDirichletDirichletKL(self):
+    conc1 = np.array([[1., 2., 3., 1.5, 2.5, 3.5],
+                      [1.5, 2.5, 3.5, 4.5, 5.5, 6.5]])
+    conc2 = np.array([[0.5, 1., 1.5, 2., 2.5, 3.]])
+
+    d1 = dirichlet_lib.Dirichlet(conc1)
+    d2 = dirichlet_lib.Dirichlet(conc2)
+    x = d1.sample(int(1e4), seed=0)
+    kl_sample = math_ops.reduce_mean(d1.log_prob(x) - d2.log_prob(x), 0)
+    kl_actual = kullback_leibler.kl_divergence(d1, d2)
+
+    kl_sample_val = self.evaluate(kl_sample)
+    kl_actual_val = self.evaluate(kl_actual)
+
+    self.assertEqual(conc1.shape[:-1], kl_actual.get_shape())
+
+    if not special:
+      return
+
+    kl_expected = (
+        special.gammaln(np.sum(conc1, -1))
+        - special.gammaln(np.sum(conc2, -1))
+        - np.sum(special.gammaln(conc1) - special.gammaln(conc2), -1)
+        + np.sum((conc1 - conc2) * (special.digamma(conc1) - special.digamma(
+            np.sum(conc1, -1, keepdims=True))), -1))
+
+    self.assertAllClose(kl_expected, kl_actual_val, atol=0., rtol=1e-6)
+    self.assertAllClose(kl_sample_val, kl_actual_val, atol=0., rtol=1e-1)
+
+    # Make sure KL(d1||d1) is 0
+    kl_same = self.evaluate(kullback_leibler.kl_divergence(d1, d1))
+    self.assertAllClose(kl_same, np.zeros_like(kl_expected))
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/distributions/util_test.py b/tensorflow/python/kernel_tests/distributions/util_test.py
index 63d19c15cfdc8c8a29b98d3cc76df5aaa22ee36d..2f256d3e8beac145a14ca1dd63f267fb5f4ef3a5 100644
--- a/tensorflow/python/kernel_tests/distributions/util_test.py
+++ b/tensorflow/python/kernel_tests/distributions/util_test.py
@@ -814,6 +814,30 @@ class FillTriangularTest(test.TestCase):
     self._run_test(self._rng.randn(2, 3, int(7*8/2)), upper=True)
 
 
+class FillTriangularInverseTest(FillTriangularTest):
+
+  def _run_test(self, x_, use_deferred_shape=False, **kwargs):
+    x_ = np.asarray(x_)
+    with self.test_session() as sess:
+      static_shape = None if use_deferred_shape else x_.shape
+      x_pl = array_ops.placeholder_with_default(x_, shape=static_shape)
+      zeros_like_x_pl = (x_pl * array_ops.stop_gradient(x_pl - 1.)
+                         - array_ops.stop_gradient(x_pl * (x_pl - 1.)))
+      x = x_pl + zeros_like_x_pl
+      actual = du.fill_triangular(x, **kwargs)
+      inverse_actual = du.fill_triangular_inverse(actual, **kwargs)
+
+      inverse_actual_ = sess.run(
+          inverse_actual,
+          feed_dict={x_pl: x_})
+
+    if use_deferred_shape:
+      self.assertEqual(None, inverse_actual.shape)
+    else:
+      self.assertAllEqual(x_.shape, inverse_actual.shape)
+    self.assertAllEqual(x_, inverse_actual_)
+
+
 class ReduceWeightedLogSumExp(test.TestCase):
 
   def _reduce_weighted_logsumexp(self, logx, w, axis, keep_dims=False):
diff --git a/tensorflow/user_ops/duplicate_op.cc b/tensorflow/python/kernel_tests/duplicate_op.cc
similarity index 100%
rename from tensorflow/user_ops/duplicate_op.cc
rename to tensorflow/python/kernel_tests/duplicate_op.cc
diff --git a/tensorflow/user_ops/duplicate_op_test.py b/tensorflow/python/kernel_tests/duplicate_op_test.py
similarity index 69%
rename from tensorflow/user_ops/duplicate_op_test.py
rename to tensorflow/python/kernel_tests/duplicate_op_test.py
index b61e68d75e3ef253788da82cce56d113bc5e44f9..529d3dd0b3aa1f1013119ef4a90363dbd8d53cd0 100644
--- a/tensorflow/user_ops/duplicate_op_test.py
+++ b/tensorflow/python/kernel_tests/duplicate_op_test.py
@@ -17,23 +17,26 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import os.path
+import os
 
-import tensorflow as tf
+from tensorflow.python.framework import load_library
+from tensorflow.python.ops import math_ops
+from tensorflow.python.platform import resource_loader
+from tensorflow.python.platform import test
 
 
-class DuplicateOpTest(tf.test.TestCase):
+class DuplicateOpTest(test.TestCase):
 
   def testBasic(self):
-    library_filename = os.path.join(tf.resource_loader.get_data_files_path(),
+    library_filename = os.path.join(resource_loader.get_data_files_path(),
                                     'duplicate_op.so')
-    duplicate = tf.load_op_library(library_filename)
+    duplicate = load_library.load_op_library(library_filename)
 
     self.assertEqual(len(duplicate.OP_LIST.op), 0)
 
     with self.test_session():
-      self.assertEqual(tf.add(1, 41).eval(), 42)
+      self.assertEqual(math_ops.add(1, 41).eval(), 42)
 
 
 if __name__ == '__main__':
-  tf.test.main()
+  test.main()
diff --git a/tensorflow/python/kernel_tests/fifo_queue_test.py b/tensorflow/python/kernel_tests/fifo_queue_test.py
index ce73e7ad3e5f822363c697609dfa163b6f13751a..14a336c6881960cbf03ba767a051835feebf9d04 100644
--- a/tensorflow/python/kernel_tests/fifo_queue_test.py
+++ b/tensorflow/python/kernel_tests/fifo_queue_test.py
@@ -31,6 +31,7 @@ from tensorflow.python.framework import dtypes as dtypes_lib
 from tensorflow.python.framework import errors_impl
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
+from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import data_flow_ops
@@ -125,12 +126,21 @@ class FIFOQueueTest(test.TestCase):
       q.enqueue_many([[1, 2, 3, 4], [[1, 1], [2, 2], [3, 3], [4, 4]]]).run()
       self.assertEqual(4, q.size().eval())
 
+  @test_util.run_in_graph_and_eager_modes()
   def testMultipleDequeues(self):
-    with self.test_session() as session:
-      q = data_flow_ops.FIFOQueue(10, [dtypes_lib.int32], shapes=[()])
-      q.enqueue_many([[1, 2, 3]]).run()
-      a, b, c = session.run([q.dequeue(), q.dequeue(), q.dequeue()])
-      self.assertAllEqual(set([1, 2, 3]), set([a, b, c]))
+    q = data_flow_ops.FIFOQueue(10, [dtypes_lib.int32], shapes=[()])
+    self.evaluate(q.enqueue_many([[1, 2, 3]]))
+    a, b, c = self.evaluate([q.dequeue(), q.dequeue(), q.dequeue()])
+    self.assertAllEqual(set([1, 2, 3]), set([a, b, c]))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testQueuesDontShare(self):
+    q = data_flow_ops.FIFOQueue(10, [dtypes_lib.int32], shapes=[()])
+    self.evaluate(q.enqueue(1))
+    q2 = data_flow_ops.FIFOQueue(10, [dtypes_lib.int32], shapes=[()])
+    self.evaluate(q2.enqueue(2))
+    self.assertAllEqual(self.evaluate(q2.dequeue()), 2)
+    self.assertAllEqual(self.evaluate(q.dequeue()), 1)
 
   def testEnqueueDictWithoutNames(self):
     with self.test_session():
diff --git a/tensorflow/python/kernel_tests/init_ops_test.py b/tensorflow/python/kernel_tests/init_ops_test.py
index a9b55854f1b4a3dfc49f05397ca32bc7b2ccb88e..795aa67248f66e72f8f772845c4ca5b2b1b06d3d 100644
--- a/tensorflow/python/kernel_tests/init_ops_test.py
+++ b/tensorflow/python/kernel_tests/init_ops_test.py
@@ -362,6 +362,33 @@ class UniformUnitScalingInitializationTest(test.TestCase):
         dtype=dtypes.string)
 
 
+class VarianceScalingInitializationTest(test.TestCase):
+
+  def testNormalDistribution(self):
+    shape = [100, 100]
+    expect_mean = 0.
+    expect_var = 1. / shape[0]
+    init = init_ops.variance_scaling_initializer(distribution='normal')
+
+    with self.test_session(use_gpu=True):
+      x = init(shape).eval()
+
+    self.assertNear(np.mean(x), expect_mean, err=1e-2)
+    self.assertNear(np.var(x), expect_var, err=1e-2)
+
+  def testUniformDistribution(self):
+    shape = [100, 100]
+    expect_mean = 0.
+    expect_var = 1. / shape[0]
+    init = init_ops.variance_scaling_initializer(distribution='uniform')
+
+    with self.test_session(use_gpu=True):
+      x = init(shape).eval()
+
+    self.assertNear(np.mean(x), expect_mean, err=1e-2)
+    self.assertNear(np.var(x), expect_var, err=1e-2)
+
+
 # TODO(vrv): move to sequence_ops_test?
 class RangeTest(test.TestCase):
 
diff --git a/tensorflow/python/kernel_tests/inplace_ops_test.py b/tensorflow/python/kernel_tests/inplace_ops_test.py
index 0f95e13187fcd5cc199d871ea5efdca363b37cd0..6e894365af68877bd4f2ff4ae0f18db7c0829275 100644
--- a/tensorflow/python/kernel_tests/inplace_ops_test.py
+++ b/tensorflow/python/kernel_tests/inplace_ops_test.py
@@ -166,7 +166,8 @@ class InplaceOpsTest(test_util.TensorFlowTestCase):
 
   def testEmpty(self):
     for dtype in [
-        dtypes.float32, dtypes.float64, dtypes.int32, dtypes.int64, dtypes.bool
+        dtypes.float32, dtypes.float64, dtypes.int32, dtypes.int64, dtypes.bool,
+        dtypes.uint8
     ]:
       with self.test_session(use_gpu=True):
         test_shapes = [(), (1,), (2, 3), (0, 2), (2, 3, 5), (2, 0, 5)]
@@ -187,11 +188,12 @@ class InplaceOpsTest(test_util.TensorFlowTestCase):
           self.assertEqual(val.dtype, dtype.as_numpy_dtype)
           self.assertAllEqual(val, np.zeros(shape, dtype.as_numpy_dtype))
 
-        val = inplace_ops.empty((1, 2), dtypes.string, init=True).eval()
-        self.assertEqual(val.tolist(), [[b"", b""]])
+    with self.test_session(use_gpu=True):
+      val = inplace_ops.empty((1, 2), dtypes.string, init=True).eval()
+      self.assertEqual(val.tolist(), [[b"", b""]])
 
-        val = inplace_ops.empty((1, 2), dtypes.string, init=False).eval()
-        self.assertEqual(val.tolist(), [[b"", b""]])
+      val = inplace_ops.empty((1, 2), dtypes.string, init=False).eval()
+      self.assertEqual(val.tolist(), [[b"", b""]])
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/user_ops/invalid_op.cc b/tensorflow/python/kernel_tests/invalid_op.cc
similarity index 100%
rename from tensorflow/user_ops/invalid_op.cc
rename to tensorflow/python/kernel_tests/invalid_op.cc
diff --git a/tensorflow/user_ops/invalid_op_test.py b/tensorflow/python/kernel_tests/invalid_op_test.py
similarity index 67%
rename from tensorflow/user_ops/invalid_op_test.py
rename to tensorflow/python/kernel_tests/invalid_op_test.py
index c90a00ce58bb4f6e1bd74c9f323e6cdc86397365..238299a895487b1cab7db053fd7f354d4a167ea9 100644
--- a/tensorflow/user_ops/invalid_op_test.py
+++ b/tensorflow/python/kernel_tests/invalid_op_test.py
@@ -17,19 +17,22 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import os.path
+import os
 
-import tensorflow as tf
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import load_library
+from tensorflow.python.platform import resource_loader
+from tensorflow.python.platform import test
 
 
-class InvalidOpTest(tf.test.TestCase):
+class InvalidOpTest(test.TestCase):
 
   def testBasic(self):
-    library_filename = os.path.join(tf.resource_loader.get_data_files_path(),
+    library_filename = os.path.join(resource_loader.get_data_files_path(),
                                     'invalid_op.so')
-    with self.assertRaises(tf.errors.InvalidArgumentError):
-      tf.load_op_library(library_filename)
+    with self.assertRaises(errors.InvalidArgumentError):
+      load_library.load_op_library(library_filename)
 
 
 if __name__ == '__main__':
-  tf.test.main()
+  test.main()
diff --git a/tensorflow/python/kernel_tests/linalg/BUILD b/tensorflow/python/kernel_tests/linalg/BUILD
index 91be80322c37792be02d1b625df6757c9d80b060..0123adc2c3e5c32fd86ef11e7b1f552964232abd 100644
--- a/tensorflow/python/kernel_tests/linalg/BUILD
+++ b/tensorflow/python/kernel_tests/linalg/BUILD
@@ -124,6 +124,7 @@ cuda_py_test(
         "//tensorflow/python:random_ops",
     ],
     shard_count = 5,
+    tags = ["optonly"],  # Test is flaky without optimization.
 )
 
 cuda_py_test(
diff --git a/tensorflow/python/kernel_tests/linalg_grad_test.py b/tensorflow/python/kernel_tests/linalg_grad_test.py
index 7d367a92750ae3562c93d2381eb895c94a866eaa..6f401358a2519a699488f0372323b5a41621c4cd 100644
--- a/tensorflow/python/kernel_tests/linalg_grad_test.py
+++ b/tensorflow/python/kernel_tests/linalg_grad_test.py
@@ -177,6 +177,12 @@ if __name__ == '__main__':
             MatrixUnaryFunctorGradientTest, 'MatrixDeterminantGradient', name,
             _GetMatrixUnaryFunctorGradientTest(linalg_ops.matrix_determinant,
                                                dtype, shape))
+        _AddTest(
+            MatrixUnaryFunctorGradientTest, 'LogMatrixDeterminantGradient',
+            name,
+            _GetMatrixUnaryFunctorGradientTest(
+                lambda x: linalg_ops.log_matrix_determinant(x)[1],
+                dtype, shape))
 
   # Tests for gradients of matrix_solve_ls
   for dtype in np.float32, np.float64:
diff --git a/tensorflow/python/kernel_tests/pooling_ops_test.py b/tensorflow/python/kernel_tests/pooling_ops_test.py
index a0c372db7d0a4e76c37c01e1ce24cd8fc9123f7a..e95c72971521452a239b78ff4ab9c25c3089f1da 100644
--- a/tensorflow/python/kernel_tests/pooling_ops_test.py
+++ b/tensorflow/python/kernel_tests/pooling_ops_test.py
@@ -947,7 +947,7 @@ class PoolingTest(test.TestCase):
           output_sizes,
           x_init_value=x_init_value,
           delta=1e-2)
-    print("%s gradient error = " % func_name, err)
+    tf_logging.info("%s gradient error = " % func_name, err)
     self.assertLess(err, err_tolerance)
 
   def _ConstructAndTestSecondGradient(self,
@@ -1024,7 +1024,7 @@ class PoolingTest(test.TestCase):
           input_sizes,
           x_init_value=x_init_value,
           delta=1e-2)
-    print("%s second-order gradient error = " % func_name, err)
+    tf_logging.info("%s second-order gradient error = " % func_name, err)
     self.assertLess(err, err_tolerance)
 
   def _testMaxPoolGradValidPadding1_1(self, data_format, use_gpu):
diff --git a/tensorflow/python/kernel_tests/py_func_test.py b/tensorflow/python/kernel_tests/py_func_test.py
index b9f44d728a1d9843df1e836594f9caa7010d8a94..4239151070b35ac11511360bd85c5c2424d80f3d 100644
--- a/tensorflow/python/kernel_tests/py_func_test.py
+++ b/tensorflow/python/kernel_tests/py_func_test.py
@@ -19,6 +19,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import gc
 import re
 
 import numpy as np
@@ -26,6 +27,7 @@ from six.moves import queue
 from six.moves import xrange  # pylint: disable=redefined-builtin
 
 from tensorflow.python.client import session as session_lib
+from tensorflow.python.eager import backprop
 from tensorflow.python.eager import context
 from tensorflow.python.eager import function
 from tensorflow.python.framework import constant_op
@@ -34,6 +36,7 @@ from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import script_ops
@@ -432,13 +435,29 @@ class PyFuncTest(test.TestCase):
 
   # ----- Tests shared by py_func and eager_py_func -----
   def testCleanup(self):
-    for _ in xrange(1000):
-      g = ops.Graph()
-      with g.as_default():
-        c = constant_op.constant([1.], dtypes.float32)
-        _ = script_ops.py_func(lambda x: x + 1, [c], [dtypes.float32])
-        _ = script_ops.eager_py_func(lambda x: x + 1, [c], [dtypes.float32])
-    self.assertTrue(script_ops._py_funcs.size() < 100)
+    # Delete everything created by previous tests to avoid side effects.
+    ops.reset_default_graph()
+    gc.collect()
+    initial_size = script_ops._py_funcs.size()
+    # Encapsulate the graph generation, so locals can be deleted.
+    def make_graphs():
+      for _ in xrange(1000):
+        g = ops.Graph()
+        with g.as_default():
+          c = constant_op.constant([1.], dtypes.float32)
+          _ = script_ops.py_func(lambda x: x + 1, [c], [dtypes.float32])
+          _ = script_ops.eager_py_func(lambda x: x + 1, [c], [dtypes.float32])
+          # These ops have a reference to 'c' which has a reference to the graph.
+          # Checks if the functions are being deleted though the graph is referenced from them.
+          # (see #18292)
+          _ = script_ops.py_func(lambda x: x + c.shape[0], [c], [dtypes.float32])
+          _ = script_ops.eager_py_func(lambda x: x + c.shape[0], [c], [dtypes.float32])
+ 
+    # Call garbage collector to enforce deletion.
+    make_graphs()
+    ops.reset_default_graph()
+    gc.collect()
+    self.assertEqual(initial_size, script_ops._py_funcs.size())
 
   # ----- Tests for eager_py_func -----
   @test_util.run_in_graph_and_eager_modes()
@@ -446,9 +465,8 @@ class PyFuncTest(test.TestCase):
     a = array_ops.ones((3, 3), dtype=dtypes.int32)
     x = array_ops.ones((3, 1), dtype=dtypes.int32)
     output = script_ops.eager_py_func(matmul, inp=[a, x], Tout=dtypes.int32)
-    with self.test_session():
-      ret = self.evaluate(output)
-      self.assertAllEqual(ret, [[3], [3], [3]])
+    ret = self.evaluate(output)
+    self.assertAllEqual(ret, [[3], [3], [3]])
 
   @test_util.run_in_graph_and_eager_modes()
   def testEagerSingleOutputFloat32(self):
@@ -516,8 +534,7 @@ class PyFuncTest(test.TestCase):
   @test_util.run_in_graph_and_eager_modes()
   def testEagerReturningVariableRaisesError(self):
     def return_variable():
-      variable = resource_variable_ops.ResourceVariable(0.0)
-      return variable
+      return resource_variable_ops.ResourceVariable(0.0)
 
     with self.assertRaisesRegexp(errors.UnknownError,
                                  "Attempting to return a variable"):
@@ -525,6 +542,80 @@ class PyFuncTest(test.TestCase):
           return_variable, inp=[], Tout=dtypes.float32)
       self.evaluate(output)
 
+  @test_util.run_in_graph_and_eager_modes()
+  def testEagerGradientTape(self):
+
+    def f(x):
+      return x**2
+
+    x = constant_op.constant(3.0)
+    with backprop.GradientTape() as tape:
+      tape.watch(x)
+      y = script_ops.eager_py_func(f, inp=[x], Tout=dtypes.float32)
+    dy_dx = tape.gradient(y, x)
+    self.assertEqual(self.evaluate(dy_dx), 6.0)
+
+  def testEagerGradientGraph(self):
+
+    def f(x):
+      return x**2
+
+    x = constant_op.constant(3.0)
+    y = script_ops.eager_py_func(f, inp=[x], Tout=dtypes.float32)
+    dy_dx = gradients_impl.gradients(y, x)[0]
+    self.assertEqual(self.evaluate(dy_dx), 6.0)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testEagerGradientTapeMultipleArgs(self):
+
+    def f(x, y):
+      return x**2 + y**2
+
+    x = constant_op.constant(3.0)
+    y = constant_op.constant(4.0)
+    with backprop.GradientTape() as tape:
+      tape.watch(x)
+      tape.watch(y)
+      z = script_ops.eager_py_func(f, inp=[x, y], Tout=dtypes.float32)
+
+    dz_dx, dz_dy = tape.gradient(z, [x, y])
+    self.assertEqual(self.evaluate(dz_dx), 6.0)
+    self.assertEqual(self.evaluate(dz_dy), 8.0)
+
+  def testEagerGradientGraphMultipleArgs(self):
+
+    def f(x, y):
+      return x**2 + y**2
+
+    x = constant_op.constant(3.0)
+    y = constant_op.constant(4.0)
+    z = script_ops.eager_py_func(f, inp=[x, y], Tout=dtypes.float32)
+
+    dz_dx, dz_dy = gradients_impl.gradients(z, [x, y])
+    self.assertEqual(self.evaluate(dz_dx), 6.0)
+    self.assertEqual(self.evaluate(dz_dy), 8.0)
+
+  def testEagerGradientGraphLogHuber(self):
+
+    def log_huber(x, m):
+      if math_ops.abs(x) <= m:
+        return x**2
+      else:
+        return m**2 * (1 - 2 * math_ops.log(m) + math_ops.log(x**2))
+
+    x = array_ops.placeholder(dtypes.float32)
+    m = array_ops.placeholder(dtypes.float32)
+
+    y = script_ops.eager_py_func(
+        func=log_huber, inp=[x, m], Tout=dtypes.float32)
+    dy_dx = gradients_impl.gradients(y, x)[0]
+
+    with self.test_session() as sess:
+      # Takes the first branch of log_huber.
+      y, dy_dx = sess.run([y, dy_dx], feed_dict={x: 1.0, m: 2.0})
+      self.assertEqual(y, 1.0)
+      self.assertEqual(dy_dx, 2.0)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/reader_ops_test.py b/tensorflow/python/kernel_tests/reader_ops_test.py
index 82a27eebeef16c9dacaf1b900f0398a56533cd2d..7be473a5e750d9d6880f112cb0ca89b3ae61a7fd 100644
--- a/tensorflow/python/kernel_tests/reader_ops_test.py
+++ b/tensorflow/python/kernel_tests/reader_ops_test.py
@@ -77,6 +77,69 @@ _TEXT = b"""Gaily bedight,
     """
 
 
+class TFCompressionTestCase(test.TestCase):
+
+  def setUp(self):
+    super(TFCompressionTestCase, self).setUp()
+    self._num_files = 2
+    self._num_records = 7
+
+  def _Record(self, f, r):
+    return compat.as_bytes("Record %d of file %d" % (r, f))
+
+  def _CreateFiles(self, options=None, prefix=""):
+    filenames = []
+    for i in range(self._num_files):
+      name = prefix + "tfrecord.%d.txt" % i
+      records = [self._Record(i, j) for j in range(self._num_records)]
+      fn = self._WriteRecordsToFile(records, name, options)
+      filenames.append(fn)
+    return filenames
+
+  def _WriteRecordsToFile(self, records, name="tfrecord", options=None):
+    fn = os.path.join(self.get_temp_dir(), name)
+    with tf_record.TFRecordWriter(fn, options=options) as writer:
+      for r in records:
+        writer.write(r)
+    return fn
+
+  def _ZlibCompressFile(self, infile, name="tfrecord.z"):
+    # zlib compress the file and write compressed contents to file.
+    with open(infile, "rb") as f:
+      cdata = zlib.compress(f.read())
+
+    zfn = os.path.join(self.get_temp_dir(), name)
+    with open(zfn, "wb") as f:
+      f.write(cdata)
+    return zfn
+
+  def _GzipCompressFile(self, infile, name="tfrecord.gz"):
+    # gzip compress the file and write compressed contents to file.
+    with open(infile, "rb") as f:
+      cdata = f.read()
+
+    gzfn = os.path.join(self.get_temp_dir(), name)
+    with gzip.GzipFile(gzfn, "wb") as f:
+      f.write(cdata)
+    return gzfn
+
+  def _ZlibDecompressFile(self, infile, name="tfrecord"):
+    with open(infile, "rb") as f:
+      cdata = zlib.decompress(f.read())
+    fn = os.path.join(self.get_temp_dir(), name)
+    with open(fn, "wb") as f:
+      f.write(cdata)
+    return fn
+
+  def _GzipDecompressFile(self, infile, name="tfrecord"):
+    with gzip.GzipFile(infile, "rb") as f:
+      cdata = f.read()
+    fn = os.path.join(self.get_temp_dir(), name)
+    with open(fn, "wb") as f:
+      f.write(cdata)
+    return fn
+
+
 class IdentityReaderTest(test.TestCase):
 
   def _ExpectRead(self, sess, key, value, expected):
@@ -348,7 +411,7 @@ class TextLineReaderTest(test.TestCase):
         k, v = sess.run([key, value])
 
 
-class FixedLengthRecordReaderTest(test.TestCase):
+class FixedLengthRecordReaderTest(TFCompressionTestCase):
 
   def setUp(self):
     super(FixedLengthRecordReaderTest, self).setUp()
@@ -407,40 +470,18 @@ class FixedLengthRecordReaderTest(test.TestCase):
 
   # gap_bytes=hop_bytes-record_bytes
   def _CreateGzipFiles(self, num_records, gap_bytes):
-    filenames = []
-    for i in range(self._num_files):
-      fn = os.path.join(self.get_temp_dir(), "fixed_length_record.%d.txt" % i)
-      filenames.append(fn)
-      with gzip.GzipFile(fn, "wb") as f:
-        f.write(b"H" * self._header_bytes)
-        if num_records > 0:
-          f.write(self._Record(i, 0))
-        for j in range(1, num_records):
-          if gap_bytes > 0:
-            f.write(b"G" * gap_bytes)
-          f.write(self._Record(i, j))
-        f.write(b"F" * self._footer_bytes)
+    filenames = self._CreateFiles(num_records, gap_bytes)
+    for fn in filenames:
+      # compress inplace.
+      self._GzipCompressFile(fn, fn)
     return filenames
 
   # gap_bytes=hop_bytes-record_bytes
   def _CreateZlibFiles(self, num_records, gap_bytes):
-    filenames = []
-    for i in range(self._num_files):
-      fn = os.path.join(self.get_temp_dir(), "fixed_length_record.%d.txt" % i)
-      filenames.append(fn)
-      with open(fn + ".tmp", "wb") as f:
-        f.write(b"H" * self._header_bytes)
-        if num_records > 0:
-          f.write(self._Record(i, 0))
-        for j in range(1, num_records):
-          if gap_bytes > 0:
-            f.write(b"G" * gap_bytes)
-          f.write(self._Record(i, j))
-        f.write(b"F" * self._footer_bytes)
-      with open(fn + ".tmp", "rb") as f:
-        cdata = zlib.compress(f.read())
-        with open(fn, "wb") as zf:
-          zf.write(cdata)
+    filenames = self._CreateFiles(num_records, gap_bytes)
+    for fn in filenames:
+      # compress inplace.
+      self._ZlibCompressFile(fn, fn)
     return filenames
 
   def _CreateGzipOverlappedRecordFiles(self, num_overlapped_records):
@@ -477,10 +518,7 @@ class FixedLengthRecordReaderTest(test.TestCase):
           ])
           f.write(compat.as_bytes(all_records_str))
         f.write(b"F" * self._footer_bytes)
-      with open(fn + ".tmp", "rb") as f:
-        cdata = zlib.compress(f.read())
-        with open(fn, "wb") as zf:
-          zf.write(cdata)
+      self._ZlibCompressFile(fn + ".tmp", fn)
     return filenames
 
   # gap_bytes=hop_bytes-record_bytes
@@ -529,7 +567,6 @@ class FixedLengthRecordReaderTest(test.TestCase):
       for i in range(self._num_files):
         for j in range(num_overlapped_records):
           k, v = sess.run([key, value])
-          print(v)
           self.assertAllEqual("%s:%d" % (files[i], j), compat.as_text(k))
           self.assertAllEqual(self._OverlappedRecord(i, j), v)
 
@@ -579,25 +616,10 @@ class FixedLengthRecordReaderTest(test.TestCase):
           files, num_overlapped_records, encoding="ZLIB")
 
 
-class TFRecordReaderTest(test.TestCase):
+class TFRecordReaderTest(TFCompressionTestCase):
 
   def setUp(self):
     super(TFRecordReaderTest, self).setUp()
-    self._num_files = 2
-    self._num_records = 7
-
-  def _Record(self, f, r):
-    return compat.as_bytes("Record %d of file %d" % (r, f))
-
-  def _CreateFiles(self):
-    filenames = []
-    for i in range(self._num_files):
-      fn = os.path.join(self.get_temp_dir(), "tf_record.%d.txt" % i)
-      filenames.append(fn)
-      writer = tf_record.TFRecordWriter(fn)
-      for j in range(self._num_records):
-        writer.write(self._Record(i, j))
-    return filenames
 
   def testOneEpoch(self):
     files = self._CreateFiles()
@@ -647,107 +669,106 @@ class TFRecordReaderTest(test.TestCase):
       self.assertEqual(self._num_files * self._num_records, num_v)
 
   def testReadZlibFiles(self):
-    files = self._CreateFiles()
-    zlib_files = []
-    for i, fn in enumerate(files):
-      with open(fn, "rb") as f:
-        cdata = zlib.compress(f.read())
-
-        zfn = os.path.join(self.get_temp_dir(), "tfrecord_%s.z" % i)
-        with open(zfn, "wb") as f:
-          f.write(cdata)
-        zlib_files.append(zfn)
+    options = tf_record.TFRecordOptions(TFRecordCompressionType.ZLIB)
+    files = self._CreateFiles(options)
 
     with self.test_session() as sess:
-      options = tf_record.TFRecordOptions(TFRecordCompressionType.ZLIB)
       reader = io_ops.TFRecordReader(name="test_reader", options=options)
       queue = data_flow_ops.FIFOQueue(99, [dtypes.string], shapes=())
       key, value = reader.read(queue)
 
-      queue.enqueue_many([zlib_files]).run()
+      queue.enqueue_many([files]).run()
       queue.close().run()
       for i in range(self._num_files):
         for j in range(self._num_records):
           k, v = sess.run([key, value])
-          self.assertTrue(compat.as_text(k).startswith("%s:" % zlib_files[i]))
+          self.assertTrue(compat.as_text(k).startswith("%s:" % files[i]))
           self.assertAllEqual(self._Record(i, j), v)
 
   def testReadGzipFiles(self):
-    files = self._CreateFiles()
-    gzip_files = []
-    for i, fn in enumerate(files):
-      with open(fn, "rb") as f:
-        cdata = f.read()
-
-        zfn = os.path.join(self.get_temp_dir(), "tfrecord_%s.gz" % i)
-        with gzip.GzipFile(zfn, "wb") as f:
-          f.write(cdata)
-        gzip_files.append(zfn)
+    options = tf_record.TFRecordOptions(TFRecordCompressionType.GZIP)
+    files = self._CreateFiles(options)
 
     with self.test_session() as sess:
-      options = tf_record.TFRecordOptions(TFRecordCompressionType.GZIP)
       reader = io_ops.TFRecordReader(name="test_reader", options=options)
       queue = data_flow_ops.FIFOQueue(99, [dtypes.string], shapes=())
       key, value = reader.read(queue)
 
-      queue.enqueue_many([gzip_files]).run()
+      queue.enqueue_many([files]).run()
       queue.close().run()
       for i in range(self._num_files):
         for j in range(self._num_records):
           k, v = sess.run([key, value])
-          self.assertTrue(compat.as_text(k).startswith("%s:" % gzip_files[i]))
+          self.assertTrue(compat.as_text(k).startswith("%s:" % files[i]))
           self.assertAllEqual(self._Record(i, j), v)
 
 
-class TFRecordWriterZlibTest(test.TestCase):
+class TFRecordWriterTest(TFCompressionTestCase):
 
   def setUp(self):
-    super(TFRecordWriterZlibTest, self).setUp()
-    self._num_files = 2
-    self._num_records = 7
+    super(TFRecordWriterTest, self).setUp()
+
+  def _AssertFilesEqual(self, a, b, equal):
+    for an, bn in zip(a, b):
+      with open(an, "rb") as af, open(bn, "rb") as bf:
+        if equal:
+          self.assertEqual(af.read(), bf.read())
+        else:
+          self.assertNotEqual(af.read(), bf.read())
+
+  def testWriteReadZLibFiles(self):
+    # Write uncompressed then compress manually.
+    options = tf_record.TFRecordOptions(TFRecordCompressionType.NONE)
+    files = self._CreateFiles(options, prefix="uncompressed")
+    zlib_files = [
+        self._ZlibCompressFile(fn, "tfrecord_%s.z" % i)
+        for i, fn in enumerate(files)
+    ]
+    self._AssertFilesEqual(files, zlib_files, False)
 
-  def _Record(self, f, r):
-    return compat.as_bytes("Record %d of file %d" % (r, f))
+    # Now write compressd and verify same.
+    options = tf_record.TFRecordOptions(TFRecordCompressionType.ZLIB)
+    compressed_files = self._CreateFiles(options, prefix="compressed")
+    self._AssertFilesEqual(compressed_files, zlib_files, True)
 
-  def _CreateFiles(self):
-    filenames = []
-    for i in range(self._num_files):
-      fn = os.path.join(self.get_temp_dir(), "tf_record.%d.txt" % i)
-      filenames.append(fn)
-      options = tf_record.TFRecordOptions(
-          compression_type=TFRecordCompressionType.ZLIB)
-      writer = tf_record.TFRecordWriter(fn, options=options)
-      for j in range(self._num_records):
-        writer.write(self._Record(i, j))
-      writer.close()
-      del writer
+    # Decompress compress and verify same.
+    uncompressed_files = [
+        self._ZlibDecompressFile(fn, "tfrecord_%s.z" % i)
+        for i, fn in enumerate(compressed_files)
+    ]
+    self._AssertFilesEqual(uncompressed_files, files, True)
+
+  def testWriteReadGzipFiles(self):
+    # Write uncompressed then compress manually.
+    options = tf_record.TFRecordOptions(TFRecordCompressionType.NONE)
+    files = self._CreateFiles(options, prefix="uncompressed")
+    gzip_files = [
+        self._GzipCompressFile(fn, "tfrecord_%s.gz" % i)
+        for i, fn in enumerate(files)
+    ]
+    self._AssertFilesEqual(files, gzip_files, False)
 
-    return filenames
+    # Now write compressd and verify same.
+    options = tf_record.TFRecordOptions(TFRecordCompressionType.GZIP)
+    compressed_files = self._CreateFiles(options, prefix="compressed")
 
-  def _WriteRecordsToFile(self, records, name="tf_record"):
-    fn = os.path.join(self.get_temp_dir(), name)
-    writer = tf_record.TFRecordWriter(fn, options=None)
-    for r in records:
-      writer.write(r)
-    writer.close()
-    del writer
-    return fn
+    # Note: Gzips written by TFRecordWriter add 'tfrecord_0' so
+    # compressed_files can't be compared with gzip_files
 
-  def _ZlibCompressFile(self, infile, name="tfrecord.z"):
-    # zlib compress the file and write compressed contents to file.
-    with open(infile, "rb") as f:
-      cdata = zlib.compress(f.read())
+    # Decompress compress and verify same.
+    uncompressed_files = [
+        self._GzipDecompressFile(fn, "tfrecord_%s.gz" % i)
+        for i, fn in enumerate(compressed_files)
+    ]
+    self._AssertFilesEqual(uncompressed_files, files, True)
 
-    zfn = os.path.join(self.get_temp_dir(), name)
-    with open(zfn, "wb") as f:
-      f.write(cdata)
-    return zfn
+
+class TFRecordWriterZlibTest(TFCompressionTestCase):
 
   def testOneEpoch(self):
-    files = self._CreateFiles()
+    options = tf_record.TFRecordOptions(TFRecordCompressionType.ZLIB)
+    files = self._CreateFiles(options)
     with self.test_session() as sess:
-      options = tf_record.TFRecordOptions(
-          compression_type=TFRecordCompressionType.ZLIB)
       reader = io_ops.TFRecordReader(name="test_reader", options=options)
       queue = data_flow_ops.FIFOQueue(99, [dtypes.string], shapes=())
       key, value = reader.read(queue)
@@ -788,8 +809,7 @@ class TFRecordWriterZlibTest(test.TestCase):
       h.write(output)
 
     with self.test_session() as sess:
-      options = tf_record.TFRecordOptions(
-          compression_type=TFRecordCompressionType.ZLIB)
+      options = tf_record.TFRecordOptions(TFRecordCompressionType.ZLIB)
       reader = io_ops.TFRecordReader(name="test_reader", options=options)
       queue = data_flow_ops.FIFOQueue(1, [dtypes.string], shapes=())
       key, value = reader.read(queue)
@@ -808,9 +828,7 @@ class TFRecordWriterZlibTest(test.TestCase):
     # read the compressed contents and verify.
     actual = []
     for r in tf_record.tf_record_iterator(
-        zfn,
-        options=tf_record.TFRecordOptions(
-            tf_record.TFRecordCompressionType.ZLIB)):
+        zfn, options=tf_record.TFRecordOptions(TFRecordCompressionType.ZLIB)):
       actual.append(r)
     self.assertEqual(actual, original)
 
@@ -822,12 +840,9 @@ class TFRecordWriterZlibTest(test.TestCase):
     fn = self._WriteRecordsToFile(original, "zlib_read_write_large.tfrecord")
     zfn = self._ZlibCompressFile(fn, "zlib_read_write_large.tfrecord.z")
 
-    # read the compressed contents and verify.
     actual = []
     for r in tf_record.tf_record_iterator(
-        zfn,
-        options=tf_record.TFRecordOptions(
-            tf_record.TFRecordCompressionType.ZLIB)):
+        zfn, options=tf_record.TFRecordOptions(TFRecordCompressionType.ZLIB)):
       actual.append(r)
     self.assertEqual(actual, original)
 
@@ -835,13 +850,7 @@ class TFRecordWriterZlibTest(test.TestCase):
     """Verify that files produced are gzip compatible."""
     original = [b"foo", b"bar"]
     fn = self._WriteRecordsToFile(original, "gzip_read_write.tfrecord")
-
-    # gzip compress the file and write compressed contents to file.
-    with open(fn, "rb") as f:
-      cdata = f.read()
-    gzfn = os.path.join(self.get_temp_dir(), "tf_record.gz")
-    with gzip.GzipFile(gzfn, "wb") as f:
-      f.write(cdata)
+    gzfn = self._GzipCompressFile(fn, "tfrecord.gz")
 
     actual = []
     for r in tf_record.tf_record_iterator(
@@ -850,89 +859,54 @@ class TFRecordWriterZlibTest(test.TestCase):
     self.assertEqual(actual, original)
 
 
-class TFRecordIteratorTest(test.TestCase):
+class TFRecordIteratorTest(TFCompressionTestCase):
 
   def setUp(self):
     super(TFRecordIteratorTest, self).setUp()
     self._num_records = 7
 
-  def _Record(self, r):
-    return compat.as_bytes("Record %d" % r)
-
-  def _WriteCompressedRecordsToFile(
-      self,
-      records,
-      name="tfrecord.z",
-      compression_type=tf_record.TFRecordCompressionType.ZLIB):
-    fn = os.path.join(self.get_temp_dir(), name)
-    options = tf_record.TFRecordOptions(compression_type=compression_type)
-    writer = tf_record.TFRecordWriter(fn, options=options)
-    for r in records:
-      writer.write(r)
-    writer.close()
-    del writer
-    return fn
-
-  def _ZlibDecompressFile(self, infile, name="tfrecord", wbits=zlib.MAX_WBITS):
-    with open(infile, "rb") as f:
-      cdata = zlib.decompress(f.read(), wbits)
-    zfn = os.path.join(self.get_temp_dir(), name)
-    with open(zfn, "wb") as f:
-      f.write(cdata)
-    return zfn
-
   def testIterator(self):
-    fn = self._WriteCompressedRecordsToFile(
-        [self._Record(i) for i in range(self._num_records)],
-        "compressed_records")
-    options = tf_record.TFRecordOptions(
-        compression_type=TFRecordCompressionType.ZLIB)
+    records = [self._Record(0, i) for i in range(self._num_records)]
+    options = tf_record.TFRecordOptions(TFRecordCompressionType.ZLIB)
+    fn = self._WriteRecordsToFile(records, "compressed_records", options)
+
     reader = tf_record.tf_record_iterator(fn, options)
-    for i in range(self._num_records):
+    for expected in records:
       record = next(reader)
-      self.assertAllEqual(self._Record(i), record)
+      self.assertAllEqual(expected, record)
     with self.assertRaises(StopIteration):
       record = next(reader)
 
   def testWriteZlibRead(self):
     """Verify compression with TFRecordWriter is zlib library compatible."""
     original = [b"foo", b"bar"]
-    fn = self._WriteCompressedRecordsToFile(original,
-                                            "write_zlib_read.tfrecord.z")
+    options = tf_record.TFRecordOptions(TFRecordCompressionType.ZLIB)
+    fn = self._WriteRecordsToFile(original, "write_zlib_read.tfrecord.z",
+                                  options)
+
     zfn = self._ZlibDecompressFile(fn, "write_zlib_read.tfrecord")
-    actual = []
-    for r in tf_record.tf_record_iterator(zfn):
-      actual.append(r)
+    actual = list(tf_record.tf_record_iterator(zfn))
     self.assertEqual(actual, original)
 
   def testWriteZlibReadLarge(self):
     """Verify compression for large records is zlib library compatible."""
     # Make it large (about 5MB)
     original = [_TEXT * 10240]
-    fn = self._WriteCompressedRecordsToFile(original,
-                                            "write_zlib_read_large.tfrecord.z")
-    zfn = self._ZlibDecompressFile(fn, "write_zlib_read_large.tf_record")
-    actual = []
-    for r in tf_record.tf_record_iterator(zfn):
-      actual.append(r)
+    options = tf_record.TFRecordOptions(TFRecordCompressionType.ZLIB)
+    fn = self._WriteRecordsToFile(original, "write_zlib_read_large.tfrecord.z",
+                                  options)
+    zfn = self._ZlibDecompressFile(fn, "write_zlib_read_large.tfrecord")
+    actual = list(tf_record.tf_record_iterator(zfn))
     self.assertEqual(actual, original)
 
   def testWriteGzipRead(self):
     original = [b"foo", b"bar"]
-    fn = self._WriteCompressedRecordsToFile(
-        original,
-        "write_gzip_read.tfrecord.gz",
-        compression_type=TFRecordCompressionType.GZIP)
-
-    with gzip.GzipFile(fn, "rb") as f:
-      cdata = f.read()
-    zfn = os.path.join(self.get_temp_dir(), "tf_record")
-    with open(zfn, "wb") as f:
-      f.write(cdata)
+    options = tf_record.TFRecordOptions(TFRecordCompressionType.GZIP)
+    fn = self._WriteRecordsToFile(original, "write_gzip_read.tfrecord.gz",
+                                  options)
 
-    actual = []
-    for r in tf_record.tf_record_iterator(zfn):
-      actual.append(r)
+    gzfn = self._GzipDecompressFile(fn, "write_gzip_read.tfrecord")
+    actual = list(tf_record.tf_record_iterator(gzfn))
     self.assertEqual(actual, original)
 
   def testBadFile(self):
diff --git a/tensorflow/python/kernel_tests/resource_variable_ops_test.py b/tensorflow/python/kernel_tests/resource_variable_ops_test.py
index 846231fe8197461d42f4a23aeda64ffa370cd086..5267eabf0e4f779a96840069a858609059554c89 100644
--- a/tensorflow/python/kernel_tests/resource_variable_ops_test.py
+++ b/tensorflow/python/kernel_tests/resource_variable_ops_test.py
@@ -119,6 +119,13 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
           dtype=dtypes.int32, shape=[1], name="foo")
       self.assertGreater(len(handle.eval()), 0)
 
+  def testCachedValueReadBeforeWrite(self):
+    with self.test_session() as sess:
+      v = resource_variable_ops.ResourceVariable(0.0, caching_device="cpu:0")
+      sess.run(v.initializer)
+      value, _ = sess.run([v, v.assign_add(1.0)])
+      self.assertAllEqual(value, 0.0)
+
   def testAssignVariableDtypeMismatchEager(self):
     with context.eager_mode():
       handle = resource_variable_ops.var_handle_op(
@@ -145,6 +152,10 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
     self.evaluate(resource_variable_ops.assign_variable_op(
         id_handle, constant_op.constant(0, dtype=dtypes.int32)))
 
+  def testUnreadOpName(self):
+    v = resource_variable_ops.ResourceVariable(1.0)
+    self.assertNotEqual(v.name, v.assign_add(1.0).name)
+
   @test_util.run_in_graph_and_eager_modes()
   def testCreateRead(self):
     handle = resource_variable_ops.var_handle_op(dtype=dtypes.int32, shape=[])
@@ -531,6 +542,25 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
       with self.assertRaises(ValueError):
         sess.run(v.initialized_value())
 
+  def testTrainableInProto(self):
+    with ops.Graph().as_default():
+      non_trainable_variable = resource_variable_ops.ResourceVariable(
+          trainable=False,
+          initial_value=constant_op.constant(10.0))
+      self.assertEqual(
+          False,
+          resource_variable_ops.ResourceVariable(
+              variable_def=non_trainable_variable.to_proto())
+          .trainable)
+      trainable_variable = resource_variable_ops.ResourceVariable(
+          trainable=True,
+          initial_value=constant_op.constant(10.0))
+      self.assertEqual(
+          True,
+          resource_variable_ops.ResourceVariable(
+              variable_def=trainable_variable.to_proto())
+          .trainable)
+
   @test_util.run_in_graph_and_eager_modes()
   def testSparseRead(self):
     with self.test_session():
@@ -796,6 +826,16 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
       state_ops.scatter_add(v, [1], [3])
       self.assertAllEqual([1.0, 5.0], v.numpy())
 
+  def testScatterNdAddStateOps(self):
+    with context.eager_mode():
+      v = resource_variable_ops.ResourceVariable(
+          [1, 1, 1, 1, 1, 1, 1, 1], dtype=dtypes.float32, name="add")
+      indices = constant_op.constant([[4], [3], [1], [7]], dtype=dtypes.int32)
+      updates = constant_op.constant([9, 10, 11, 12], dtype=dtypes.float32)
+      expected = np.array([1, 12, 1, 11, 10, 1, 1, 13])
+      state_ops.scatter_nd_add(v, indices, updates)
+      self.assertAllClose(expected, v.numpy())
+
   def testScatterUpdateCast(self):
     with context.eager_mode():
       v = resource_variable_ops.ResourceVariable([1.0, 2.0], name="update")
diff --git a/tensorflow/python/kernel_tests/segment_reduction_ops_test.py b/tensorflow/python/kernel_tests/segment_reduction_ops_test.py
index 794be096b7309a18f9fe225642bcaafb5058df78..a82855dfeb5b8fcf215f545e53ae0f26638011da 100644
--- a/tensorflow/python/kernel_tests/segment_reduction_ops_test.py
+++ b/tensorflow/python/kernel_tests/segment_reduction_ops_test.py
@@ -264,7 +264,9 @@ class UnsortedSegmentTest(SegmentReductionHelper):
 
     # A subset of ops has been enabled for complex numbers
     self.complex_ops_list = [(np.add, None,
-                              math_ops.unsorted_segment_sum, lambda t: 0)]
+                              math_ops.unsorted_segment_sum, lambda t: 0),
+                             (np.ndarray.__mul__, None,
+                              math_ops.unsorted_segment_prod, lambda t: 1)]
     self.differentiable_dtypes = [dtypes_lib.float16, dtypes_lib.float32,
                                   dtypes_lib.float64]
     self.all_dtypes = (self.differentiable_dtypes +
diff --git a/tensorflow/python/kernel_tests/string_split_op_test.py b/tensorflow/python/kernel_tests/string_split_op_test.py
index a5bd1b6ee072e4e025bf76351a971782b4c23fad..e20daccb28a72f939a448f4cf3bfb283ea616e96 100644
--- a/tensorflow/python/kernel_tests/string_split_op_test.py
+++ b/tensorflow/python/kernel_tests/string_split_op_test.py
@@ -146,5 +146,101 @@ class StringSplitOpTest(test.TestCase):
       self.assertAllEqual(shape, [3, 1])
 
 
+class StringSplitV2OpTest(test.TestCase):
+
+  def testSplitV2(self):
+    strings = ["pigs on the wing", "animals"]
+
+    with self.test_session() as sess:
+      tokens = string_ops.string_split_v2(strings)
+      indices, values, shape = sess.run(tokens)
+      self.assertAllEqual(indices, [[0, 0], [0, 1], [0, 2], [0, 3], [1, 0]])
+      self.assertAllEqual(values, [b"pigs", b"on", b"the", b"wing", b"animals"])
+      self.assertAllEqual(shape, [2, 4])
+
+  def testSplitV2MultiCharSeparator(self):
+    # Match Python behavior:
+    # >>> '1<>2<>3'.split('<>')
+    # ['1', '2', '3']
+    # >>> "<><>4<>5<><>6<>".split("<>")
+    # ['', '', '4', '5', '', '6', '']
+    strings = ["1<>2<>3", "<><>4<>5<><>6<>"]
+
+    with self.test_session() as sess:
+      tokens = string_ops.string_split_v2(strings, sep="<>")
+      indices, values, shape = sess.run(tokens)
+      self.assertAllEqual(
+          indices, [[0, 0], [0, 1], [0, 2],
+                    [1, 0], [1, 1], [1, 2], [1, 3], [1, 4], [1, 5], [1, 6]])
+      self.assertAllEqual(values, [b"1", b"2", b"3",
+                                   b"", b"", b"4", b"5", b"", b"6", b""])
+      self.assertAllEqual(shape, [2, 7])
+
+  def testSplitV2SimpleSeparator(self):
+    # Match Python behavior:
+    # >>> '1,2,3'.split(',')
+    # ['1', '2', '3']
+    # >>> '1,2,,3,'.split(',')
+    # ['1', '2', '', '3', '']
+    strings = ["1,2,3", "4,5,,6,"]
+
+    with self.test_session() as sess:
+      tokens = string_ops.string_split_v2(strings, sep=',')
+      indices, values, shape = sess.run(tokens)
+      self.assertAllEqual(indices, [[0, 0], [0, 1], [0, 2],
+                                    [1, 0], [1, 1], [1, 2], [1, 3], [1, 4]])
+      self.assertAllEqual(values, [b"1", b"2", b"3",
+                                   b"4", b"5", b"", b"6", b""])
+      self.assertAllEqual(shape, [2, 5])
+
+  def testSplitV2EmptySeparator(self):
+    # Match Python behavior:
+    # >>> '1 2 3'.split()
+    # ['1', '2', '3']
+    #>>> '   1   2   3   '.split()
+    #['1', '2', '3']
+    strings = ["1 2 3", "  4  5    6  "]
+
+    with self.test_session() as sess:
+      tokens = string_ops.string_split_v2(strings)
+      indices, values, shape = sess.run(tokens)
+      self.assertAllEqual(indices, [[0, 0], [0, 1], [0, 2],
+                                    [1, 0], [1, 1], [1, 2]])
+      self.assertAllEqual(values, [b"1", b"2", b"3", b"4", b"5", b"6"])
+      self.assertAllEqual(shape, [2, 3])
+
+  def testSplitV2SimpleSeparatorMaxSplit(self):
+    # Match Python behavior:
+    # >>> '1,2,3'.split(',', maxsplit=1)
+    # ['1', '2,3']
+    # >>> '4,5,,6,'.split(',', maxsplit=1)
+    # ['4', '5,,6,']
+    strings = ["1,2,3", "4,5,,6,"]
+
+    with self.test_session() as sess:
+      tokens = string_ops.string_split_v2(strings, sep=',', maxsplit=1)
+      indices, values, shape = sess.run(tokens)
+      self.assertAllEqual(indices, [[0, 0], [0, 1],
+                                    [1, 0], [1, 1]])
+      self.assertAllEqual(values, [b"1", b"2,3", b"4", b"5,,6,"])
+      self.assertAllEqual(shape, [2, 2])
+
+  def testSplitV2EmptySeparatorMaxSplit(self):
+    # Match Python behavior:
+    # '1 2 3'.split(maxsplit=1)
+    # ['1', '2 3']
+    # >>> "  4  5    6  ".split(maxsplit=1)
+    # ['4', '5    6  ']
+    strings = ["1 2 3", "  4  5    6  "]
+
+    with self.test_session() as sess:
+      tokens = string_ops.string_split_v2(strings, maxsplit=1)
+      indices, values, shape = sess.run(tokens)
+      self.assertAllEqual(indices, [[0, 0], [0, 1],
+                                    [1, 0], [1, 1]])
+      self.assertAllEqual(values, [b"1", b"2 3", b"4", b"5    6  "])
+      self.assertAllEqual(shape, [2, 2])
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/tensor_array_ops_test.py b/tensorflow/python/kernel_tests/tensor_array_ops_test.py
index c0b36f143d109eb28e2784b49e8fd4099b5799a6..ea06357804f45bbba3a9e7e847659d47bf52bffb 100644
--- a/tensorflow/python/kernel_tests/tensor_array_ops_test.py
+++ b/tensorflow/python/kernel_tests/tensor_array_ops_test.py
@@ -26,11 +26,13 @@ from tensorflow.python.eager import backprop
 from tensorflow.python.eager import context
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import data_flow_ops
 from tensorflow.python.ops import gen_data_flow_ops
 from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import init_ops
@@ -549,6 +551,58 @@ class TensorArrayTest(test.TestCase):
                   dtypes.complex64, dtypes.complex128):
       self._testTensorArrayWriteGradientAddMultipleAdds(dtype)
 
+  def testTensorArrayGradWithShapeKnownElementShape(self):
+    with self.test_session(use_gpu=True) as sess:
+      ta = tensor_array_ops.TensorArray(
+          size=3,
+          dtype=dtypes.float32,
+          element_shape=tensor_shape.TensorShape([2, 3]))
+      handle, flow = data_flow_ops.tensor_array_grad_with_shape(
+          handle=ta.handle,
+          flow_in=ta.flow,
+          shape_to_prepend=tensor_shape.TensorShape([4, 5]),
+          source="source")
+      ta_grad = tensor_array_ops.TensorArray(
+          dtypes.float32, handle=handle, flow=flow)
+      value = array_ops.placeholder(dtypes.float32)
+      ta_grad = ta_grad.write(0, value)
+      read_value = ta_grad.read(0)
+
+      # Make sure shape inference worked.
+      self.assertAllEqual([None, None, 2, 3], read_value.shape.as_list())
+      # Writing with wrong shape should not work.
+      with self.assertRaisesRegexp(errors.InvalidArgumentError,
+                                   "Could not write to TensorArray"):
+        fed_value = np.random.random([2, 3])
+        sess.run(read_value, feed_dict={value: fed_value})
+      # Writing with correct shape should work.
+      fed_value = np.random.random([4, 5, 2, 3])
+      self.assertAllClose(fed_value,
+                          sess.run(read_value, feed_dict={value: fed_value}))
+
+  def testTensorArrayGradWithShapeUnknownElementShape(self):
+    with self.test_session(use_gpu=True) as sess:
+      ta = tensor_array_ops.TensorArray(
+          size=3, dtype=dtypes.float32,
+          element_shape=None)  # Note that element_shape is unknown
+      handle, flow = data_flow_ops.tensor_array_grad_with_shape(
+          handle=ta.handle,
+          flow_in=ta.flow,
+          shape_to_prepend=tensor_shape.TensorShape([4, 5]),
+          source="source")
+      ta_grad = tensor_array_ops.TensorArray(
+          dtypes.float32, handle=handle, flow=flow)
+      value = array_ops.placeholder(dtypes.float32)
+      ta_grad = ta_grad.write(0, value)
+      read_value = ta_grad.read(0)
+
+      # Make sure shape inference worked.
+      self.assertIsNone(read_value.shape.ndims)
+      # Write with some shape and check read value.
+      fed_value = np.random.random([4, 5, 7])
+      self.assertAllClose(fed_value,
+                          sess.run(read_value, feed_dict={value: fed_value}))
+
   @test_util.run_in_graph_and_eager_modes()
   def testMultiTensorArray(self):
     with self.test_session(use_gpu=True):
diff --git a/tensorflow/python/kernel_tests/variable_scope_test.py b/tensorflow/python/kernel_tests/variable_scope_test.py
index 9dc4ec0f9625ccf399807316c9c46309432bb2e7..2ee53df9317331dafd96f7884e9a8728cf443923 100644
--- a/tensorflow/python/kernel_tests/variable_scope_test.py
+++ b/tensorflow/python/kernel_tests/variable_scope_test.py
@@ -197,6 +197,32 @@ class VariableScopeTest(test.TestCase):
         self.assertAllEqual([v1, v2], [v3, v4])
       f()
 
+  @test_util.run_in_graph_and_eager_modes()
+  def testEagerVariablesStoreAddsToCollections(self):
+    store = variable_scope.EagerVariableStore()
+    with store.as_default():
+      trainable = variable_scope.get_variable("v1", [], trainable=True)
+      not_trainable = variable_scope.get_variable("v2", [], trainable=False)
+      concat = variable_scope.get_variable(
+          "v3", [], collections=[ops.GraphKeys.CONCATENATED_VARIABLES])
+      self.assertEqual(
+          ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES),
+          [trainable, not_trainable])
+      self.assertEqual(
+          ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES),
+          [trainable, concat])
+      self.assertEqual(
+          ops.get_collection(ops.GraphKeys.CONCATENATED_VARIABLES), [concat])
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testEagerVariablesOutsideStoreNotAddedToCollections(self):
+    if not context.executing_eagerly():
+      return
+    variable_scope.get_variable("v1", [], trainable=True)
+    variable_scope.get_variable("v2", [], trainable=False)
+    self.assertFalse(ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES))
+    self.assertFalse(ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES))
+
   @test_util.run_in_graph_and_eager_modes()
   def testInitFromNonTensorValue(self):
     v = variable_scope.get_variable("v4", initializer=4, dtype=dtypes.int32)
diff --git a/tensorflow/python/kernel_tests/variables_test.py b/tensorflow/python/kernel_tests/variables_test.py
index 27599868b74be323189b872c2147c6a33f84d170..62d596da91682c396c04efbc64cf063c8e29e7cc 100644
--- a/tensorflow/python/kernel_tests/variables_test.py
+++ b/tensorflow/python/kernel_tests/variables_test.py
@@ -496,6 +496,23 @@ class VariablesTestCase(test.TestCase):
       with self.assertRaises(ValueError):
         sess.run(v.initialized_value())
 
+  def testTrainableInProto(self):
+    with ops.Graph().as_default():
+      non_trainable_variable = variables.Variable(
+          trainable=False,
+          initial_value=constant_op.constant(10.0))
+      self.assertEqual(
+          False,
+          variables.Variable(variable_def=non_trainable_variable.to_proto())
+          .trainable)
+      trainable_variable = variables.Variable(
+          trainable=True,
+          initial_value=constant_op.constant(10.0))
+      self.assertEqual(
+          True,
+          variables.Variable(variable_def=trainable_variable.to_proto())
+          .trainable)
+
   def testLoad(self):
     with self.test_session():
       var = variables.Variable(np.zeros((5, 5), np.float32))
diff --git a/tensorflow/python/lib/core/ndarray_tensor.cc b/tensorflow/python/lib/core/ndarray_tensor.cc
index 9df38d464ca6ad40f22b720902e1c6f127cf846d..ec1ba7b8f7d611ad659ac483505a7d86bf4b31e5 100644
--- a/tensorflow/python/lib/core/ndarray_tensor.cc
+++ b/tensorflow/python/lib/core/ndarray_tensor.cc
@@ -312,6 +312,40 @@ Status GetPyArrayDescrForTensor(const TF_Tensor* tensor,
 
   return Status::OK();
 }
+
+inline void FastMemcpy(void* dst, const void* src, size_t size) {
+  // clang-format off
+  switch (size) {
+    // Most compilers will generate inline code for fixed sizes,
+    // which is significantly faster for small copies.
+    case  1: memcpy(dst, src, 1); break;
+    case  2: memcpy(dst, src, 2); break;
+    case  3: memcpy(dst, src, 3); break;
+    case  4: memcpy(dst, src, 4); break;
+    case  5: memcpy(dst, src, 5); break;
+    case  6: memcpy(dst, src, 6); break;
+    case  7: memcpy(dst, src, 7); break;
+    case  8: memcpy(dst, src, 8); break;
+    case  9: memcpy(dst, src, 9); break;
+    case 10: memcpy(dst, src, 10); break;
+    case 11: memcpy(dst, src, 11); break;
+    case 12: memcpy(dst, src, 12); break;
+    case 13: memcpy(dst, src, 13); break;
+    case 14: memcpy(dst, src, 14); break;
+    case 15: memcpy(dst, src, 15); break;
+    case 16: memcpy(dst, src, 16); break;
+#if defined(PLATFORM_GOOGLE) || defined(PLATFORM_POSIX) && \
+    !defined(IS_MOBILE_PLATFORM)
+    // On Linux, memmove appears to be faster than memcpy for
+    // large sizes, strangely enough.
+    default: memmove(dst, src, size); break;
+#else
+    default: memcpy(dst, src, size); break;
+#endif
+  }
+  // clang-format on
+}
+
 }  // namespace
 
 // Converts the given TF_Tensor to a numpy ndarray.
@@ -362,8 +396,8 @@ Status TF_TensorToPyArray(Safe_TF_TensorPtr tensor, PyObject** out_ndarray) {
                             " bytes but TF_Tensor was ",
                             TF_TensorByteSize(tensor.get()), " bytes");
   } else {
-    memcpy(PyArray_DATA(py_array), TF_TensorData(tensor.get()),
-           PyArray_NBYTES(py_array));
+    FastMemcpy(PyArray_DATA(py_array), TF_TensorData(tensor.get()),
+               PyArray_NBYTES(py_array));
   }
 
   // PyArray_Return turns rank 0 arrays into numpy scalars
@@ -377,7 +411,7 @@ Status PyArrayToTF_Tensor(PyObject* ndarray, Safe_TF_TensorPtr* out_tensor) {
 
   // Make sure we dereference this array object in case of error, etc.
   Safe_PyObjectPtr array_safe(make_safe(
-      PyArray_FromAny(ndarray, nullptr, 0, 0, NPY_ARRAY_CARRAY, nullptr)));
+      PyArray_FromAny(ndarray, nullptr, 0, 0, NPY_ARRAY_CARRAY_RO, nullptr)));
   if (!array_safe) return errors::InvalidArgument("Not a ndarray.");
   PyArrayObject* array = reinterpret_cast<PyArrayObject*>(array_safe.get());
 
diff --git a/tensorflow/python/lib/core/numpy.h b/tensorflow/python/lib/core/numpy.h
index 25322b458b8475882830599dd4ae02f10d97094b..98354083c7e06103166a6fe535b153eaaf201c17 100644
--- a/tensorflow/python/lib/core/numpy.h
+++ b/tensorflow/python/lib/core/numpy.h
@@ -29,6 +29,8 @@ limitations under the License.
 #define NO_IMPORT_ARRAY
 #endif
 
+// Place `<locale>` before <Python.h> to avoid build failure in macOS.
+#include <locale>
 #include <Python.h>
 
 #include "numpy/arrayobject.h"
diff --git a/tensorflow/python/lib/core/py_exception_registry.cc b/tensorflow/python/lib/core/py_exception_registry.cc
index 6637de632b48e4dfc8219543161464b10dcdbe12..d03cf8930b9e2b12d92c72678b501f6a9e659768 100644
--- a/tensorflow/python/lib/core/py_exception_registry.cc
+++ b/tensorflow/python/lib/core/py_exception_registry.cc
@@ -13,10 +13,10 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/python/lib/core/py_exception_registry.h"
-
 #include <Python.h>
 
+#include "tensorflow/python/lib/core/py_exception_registry.h"
+
 namespace tensorflow {
 
 PyExceptionRegistry* PyExceptionRegistry::singleton_ = nullptr;
diff --git a/tensorflow/python/lib/core/py_func.cc b/tensorflow/python/lib/core/py_func.cc
index 8c6bb7955a4e29daddd92860e41d7105192eb24b..30c1a9c75986f242c6cf5a8aa2ed1b64938d2bda 100644
--- a/tensorflow/python/lib/core/py_func.cc
+++ b/tensorflow/python/lib/core/py_func.cc
@@ -17,6 +17,8 @@ limitations under the License.
 
 #include <array>
 
+#include <Python.h>
+
 #include "numpy/arrayobject.h"
 #include "tensorflow/c/eager/c_api.h"
 #include "tensorflow/c/eager/c_api_internal.h"
@@ -33,8 +35,6 @@ limitations under the License.
 #include "tensorflow/python/lib/core/py_util.h"
 #include "tensorflow/python/lib/core/safe_ptr.h"
 
-#include <Python.h>
-
 namespace tensorflow {
 namespace {
 
diff --git a/tensorflow/python/lib/core/py_seq_tensor.cc b/tensorflow/python/lib/core/py_seq_tensor.cc
index 32ea737a99067877e7f527e44d261a0b7c2eb07e..386be35ba2ff1fed07d6b6f5ee5d60a0f2039441 100644
--- a/tensorflow/python/lib/core/py_seq_tensor.cc
+++ b/tensorflow/python/lib/core/py_seq_tensor.cc
@@ -51,6 +51,10 @@ bool IsPyInt(PyObject* obj) {
 #endif
 }
 
+bool IsPyDouble(PyObject* obj) {
+  return PyIsInstance(obj, &PyDoubleArrType_Type);  // NumPy double type.
+}
+
 bool IsPyFloat(PyObject* obj) {
   return PyFloat_Check(obj) ||
          PyIsInstance(obj, &PyFloatingArrType_Type);  // NumPy float types
@@ -113,8 +117,10 @@ Status InferShapeAndType(PyObject* obj, TensorShape* shape, DataType* dtype) {
               "Attempted to convert an invalid sequence to a Tensor.");
         }
       }
-    } else if (IsPyFloat(obj)) {
+    } else if (IsPyDouble(obj)) {
       *dtype = DT_DOUBLE;
+    } else if (IsPyFloat(obj)) {
+      *dtype = DT_FLOAT;
     } else if (PyBool_Check(obj) || PyIsInstance(obj, &PyBoolArrType_Type)) {
       // Have to test for bool before int, since IsInt(True/False) == true.
       *dtype = DT_BOOL;
@@ -433,7 +439,7 @@ Status PySeqToTensor(PyObject* obj, PyObject* dtype, Tensor* ret) {
       break;
   }
   switch (infer_dtype) {
-    case DT_DOUBLE:
+    case DT_FLOAT:
       // TODO(josh11b): Handle mixed floats and complex numbers?
       if (requested_dtype == DT_INVALID) {
         // TensorFlow uses float32s to represent floating point numbers
@@ -446,7 +452,8 @@ Status PySeqToTensor(PyObject* obj, PyObject* dtype, Tensor* ret) {
         // final type.
         RETURN_STRING_AS_STATUS(ConvertDouble(obj, shape, ret));
       }
-
+    case DT_DOUBLE:
+      RETURN_STRING_AS_STATUS(ConvertDouble(obj, shape, ret));
     case DT_INT64:
       if (requested_dtype == DT_INVALID) {
         const char* error = ConvertInt32(obj, shape, ret);
diff --git a/tensorflow/python/lib/core/py_util.cc b/tensorflow/python/lib/core/py_util.cc
index 00cbf0c532cf80d3bb27afe168ecde963ba3591d..572693b1cfafa04a7716e09464885faa4c92e299 100644
--- a/tensorflow/python/lib/core/py_util.cc
+++ b/tensorflow/python/lib/core/py_util.cc
@@ -15,9 +15,12 @@ limitations under the License.
 
 #include "tensorflow/python/lib/core/py_util.h"
 
+// Place `<locale>` before <Python.h> to avoid build failure in macOS.
+#include <locale>
+#include <Python.h>
+
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/strings/strcat.h"
-#include <Python.h>
 
 namespace tensorflow {
 namespace {
diff --git a/tensorflow/python/lib/core/safe_ptr.h b/tensorflow/python/lib/core/safe_ptr.h
index 32d286888666bde8742403bb8e231b3d6d4bf695..35d71f7629e54027eb3a8bdd7f4275e325ac7f11 100644
--- a/tensorflow/python/lib/core/safe_ptr.h
+++ b/tensorflow/python/lib/core/safe_ptr.h
@@ -19,6 +19,7 @@ limitations under the License.
 #include <memory>
 
 #include <Python.h>
+
 #include "tensorflow/c/c_api.h"
 #include "tensorflow/c/eager/c_api.h"
 
diff --git a/tensorflow/python/lib/io/file_io.py b/tensorflow/python/lib/io/file_io.py
index 59f5075f177ef5335115cb4f24182d28a9b547c8..f22fb253e4d59813226f0e9741cabcfbf0cdcd1a 100644
--- a/tensorflow/python/lib/io/file_io.py
+++ b/tensorflow/python/lib/io/file_io.py
@@ -21,6 +21,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import binascii
 import os
 import uuid
 
@@ -33,6 +34,10 @@ from tensorflow.python.util import compat
 from tensorflow.python.util import deprecation
 from tensorflow.python.util.tf_export import tf_export
 
+# A good default block size depends on the system in question.
+# A somewhat conservative default chosen here.
+_DEFAULT_BLOCK_SIZE = 16 * 1024 * 1024
+
 
 class FileIO(object):
   """FileIO class that exposes methods to read / write to / from files.
@@ -551,3 +556,56 @@ def stat(filename):
   with errors.raise_exception_on_not_ok_status() as status:
     pywrap_tensorflow.Stat(compat.as_bytes(filename), file_statistics, status)
     return file_statistics
+
+
+def filecmp(filename_a, filename_b):
+  """Compare two files, returning True if they are the same, False otherwise.
+
+  We check size first and return False quickly if the files are different sizes.
+  If they are the same size, we continue to generating a crc for the whole file.
+
+  You might wonder: why not use Python's filecmp.cmp() instead? The answer is
+  that the builtin library is not robust to the many different filesystems
+  TensorFlow runs on, and so we here perform a similar comparison with
+  the more robust FileIO.
+
+  Args:
+    filename_a: string path to the first file.
+    filename_b: string path to the second file.
+
+  Returns:
+    True if the files are the same, False otherwise.
+  """
+  size_a = FileIO(filename_a, "rb").size()
+  size_b = FileIO(filename_b, "rb").size()
+  if size_a != size_b:
+    return False
+
+  # Size is the same. Do a full check.
+  crc_a = file_crc32(filename_a)
+  crc_b = file_crc32(filename_b)
+  return crc_a == crc_b
+
+
+def file_crc32(filename, block_size=_DEFAULT_BLOCK_SIZE):
+  """Get the crc32 of the passed file.
+
+  The crc32 of a file can be used for error checking; two files with the same
+  crc32 are considered equivalent. Note that the entire file must be read
+  to produce the crc32.
+
+  Args:
+    filename: string, path to a file
+    block_size: Integer, process the files by reading blocks of `block_size`
+      bytes. Use -1 to read the file as once.
+
+  Returns:
+    hexadecimal as string, the crc32 of the passed file.
+  """
+  crc = 0
+  with FileIO(filename, mode="rb") as f:
+    chunk = f.read(n=block_size)
+    while chunk:
+      crc = binascii.crc32(chunk, crc)
+      chunk = f.read(n=block_size)
+  return hex(crc & 0xFFFFFFFF)
diff --git a/tensorflow/python/lib/io/file_io_test.py b/tensorflow/python/lib/io/file_io_test.py
index 223858edfa84eaa1c7879a9774dcc836de4f4672..c21eb931037f1728149456d62b1534f59527cfdb 100644
--- a/tensorflow/python/lib/io/file_io_test.py
+++ b/tensorflow/python/lib/io/file_io_test.py
@@ -491,5 +491,96 @@ class FileIoTest(test.TestCase):
     v = file_io.file_exists(file_path)
     self.assertEqual(v, True)
 
+  def testFilecmp(self):
+    file1 = os.path.join(self._base_dir, "file1")
+    file_io.write_string_to_file(file1, "This is a sentence\n" * 100)
+
+    file2 = os.path.join(self._base_dir, "file2")
+    file_io.write_string_to_file(file2, "This is another sentence\n" * 100)
+
+    file3 = os.path.join(self._base_dir, "file3")
+    file_io.write_string_to_file(file3, u"This is another sentence\n" * 100)
+
+    self.assertFalse(file_io.filecmp(file1, file2))
+    self.assertTrue(file_io.filecmp(file2, file3))
+
+  def testFilecmpSameSize(self):
+    file1 = os.path.join(self._base_dir, "file1")
+    file_io.write_string_to_file(file1, "This is a sentence\n" * 100)
+
+    file2 = os.path.join(self._base_dir, "file2")
+    file_io.write_string_to_file(file2, "This is b sentence\n" * 100)
+
+    file3 = os.path.join(self._base_dir, "file3")
+    file_io.write_string_to_file(file3, u"This is b sentence\n" * 100)
+
+    self.assertFalse(file_io.filecmp(file1, file2))
+    self.assertTrue(file_io.filecmp(file2, file3))
+
+  def testFilecmpBinary(self):
+    file1 = os.path.join(self._base_dir, "file1")
+    file_io.FileIO(file1, "wb").write("testing\n\na")
+
+    file2 = os.path.join(self._base_dir, "file2")
+    file_io.FileIO(file2, "wb").write("testing\n\nb")
+
+    file3 = os.path.join(self._base_dir, "file3")
+    file_io.FileIO(file3, "wb").write("testing\n\nb")
+
+    file4 = os.path.join(self._base_dir, "file4")
+    file_io.FileIO(file4, "wb").write("testing\n\ntesting")
+
+    self.assertFalse(file_io.filecmp(file1, file2))
+    self.assertFalse(file_io.filecmp(file1, file4))
+    self.assertTrue(file_io.filecmp(file2, file3))
+
+  def testFileCrc32(self):
+    file1 = os.path.join(self._base_dir, "file1")
+    file_io.write_string_to_file(file1, "This is a sentence\n" * 100)
+    crc1 = file_io.file_crc32(file1)
+
+    file2 = os.path.join(self._base_dir, "file2")
+    file_io.write_string_to_file(file2, "This is another sentence\n" * 100)
+    crc2 = file_io.file_crc32(file2)
+
+    file3 = os.path.join(self._base_dir, "file3")
+    file_io.write_string_to_file(file3, "This is another sentence\n" * 100)
+    crc3 = file_io.file_crc32(file3)
+
+    self.assertTrue(crc1 != crc2)
+    self.assertEqual(crc2, crc3)
+
+  def testFileCrc32WithBytes(self):
+    file1 = os.path.join(self._base_dir, "file1")
+    file_io.write_string_to_file(file1, "This is a sentence\n" * 100)
+    crc1 = file_io.file_crc32(file1, block_size=24)
+
+    file2 = os.path.join(self._base_dir, "file2")
+    file_io.write_string_to_file(file2, "This is another sentence\n" * 100)
+    crc2 = file_io.file_crc32(file2, block_size=24)
+
+    file3 = os.path.join(self._base_dir, "file3")
+    file_io.write_string_to_file(file3, "This is another sentence\n" * 100)
+    crc3 = file_io.file_crc32(file3, block_size=-1)
+
+    self.assertTrue(crc1 != crc2)
+    self.assertEqual(crc2, crc3)
+
+  def testFileCrc32Binary(self):
+    file1 = os.path.join(self._base_dir, "file1")
+    file_io.FileIO(file1, "wb").write("testing\n\n")
+    crc1 = file_io.file_crc32(file1)
+
+    file2 = os.path.join(self._base_dir, "file2")
+    file_io.FileIO(file2, "wb").write("testing\n\n\n")
+    crc2 = file_io.file_crc32(file2)
+
+    file3 = os.path.join(self._base_dir, "file3")
+    file_io.FileIO(file3, "wb").write("testing\n\n\n")
+    crc3 = file_io.file_crc32(file3)
+
+    self.assertTrue(crc1 != crc2)
+    self.assertEqual(crc2, crc3)
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/ops/array_ops.py b/tensorflow/python/ops/array_ops.py
index fb81798602cb900da2fca83f4547a3776f63eddb..fae63b1132cca527c6bc5d5f9f5c8be2952d8f3c 100644
--- a/tensorflow/python/ops/array_ops.py
+++ b/tensorflow/python/ops/array_ops.py
@@ -1623,7 +1623,7 @@ def ones_like(tensor, dtype=None, name=None, optimize=True):
   Args:
     tensor: A `Tensor`.
     dtype: A type for the returned `Tensor`. Must be `float32`, `float64`,
-      `int8`, `uint8`, `int16`, `uint16`, int32`, `int64`,
+      `int8`, `uint8`, `int16`, `uint16`, `int32`, `int64`,
       `complex64`, `complex128` or `bool`.
     name: A name for the operation (optional).
     optimize: if true, attempt to statically determine the shape of 'tensor'
diff --git a/tensorflow/python/ops/check_ops.py b/tensorflow/python/ops/check_ops.py
index cabc1e724cdb667f4d0c5059ff1d78854a45b30c..375a5ec2c30069c955152e590b3edea0319de73a 100644
--- a/tensorflow/python/ops/check_ops.py
+++ b/tensorflow/python/ops/check_ops.py
@@ -341,8 +341,8 @@ def assert_equal(x, y, data=None, summarize=None, message=None, name=None):
                           y_sum, y_np[:y_sum]))
 
         index_and_values_str = ''
-        if x.shape == y.shape:
-          # If the shapes of x and y are the same,
+        if x.shape == y.shape and x.shape.as_list():
+          # If the shapes of x and y are the same (and not scalars),
           # Get the values that actually differed and their indices.
           # If shapes are different this information is more confusing
           # than useful.
diff --git a/tensorflow/python/ops/collective_ops.py b/tensorflow/python/ops/collective_ops.py
new file mode 100644
index 0000000000000000000000000000000000000000..a05fd15eca12a423bf02dfb13044dd1f7630b99c
--- /dev/null
+++ b/tensorflow/python/ops/collective_ops.py
@@ -0,0 +1,133 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""TensorFlow collective Ops."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import device
+from tensorflow.python.ops import gen_collective_ops
+
+
+def all_reduce(t, group_size, group_key, instance_key, merge_op, final_op,
+               subdiv_offsets=(0)):
+  """Reduces tensors collectively, across devices.
+
+  Args:
+    t: the tensor to be reduced.
+    group_size: the total number of tensors to be collectively reduced.
+      Each must reside on a different device.
+    group_key: an integer identifying the group of devices.
+    instance_key: an integer identifying the participating group of Ops.
+    merge_op: string naming the binary Op to be applied to compute each
+      partial reduction.
+    final_op: string naming the unary Op to be applied to each fully
+      reduced value.  Can be 'Id' for no operation.
+    subdiv_offsets: a list of integer offsets into the tensor at which each
+      independent subdivision should begin.  Use [0] if no subdivision should
+      be done.
+
+  Returns:
+    An Op implementing the distributed reduction.
+
+  Raises:
+    ValueError: if any of the input parameter constraints are not met.
+  """
+  if not device.canonical_name(t.device):
+    raise ValueError('Device assignment required for collective ops')
+  if group_size <= 1:
+    raise ValueError('Parameter group_size to add_reduce must be at least 2.')
+  return gen_collective_ops.collective_reduce(t,
+                                              group_size=group_size,
+                                              group_key=group_key,
+                                              instance_key=instance_key,
+                                              merge_op=merge_op,
+                                              final_op=final_op,
+                                              subdiv_offsets=subdiv_offsets)
+
+
+def broadcast_send(t, shape, dtype, group_size, group_key, instance_key):
+  """Broadcasts one tensor to a group of others, across devices.
+
+  Args:
+    t: the tensor to be sent.
+    shape: the shape of the tensor being sent, which must agree with t.
+    dtype: the type of the tensor being sent, which must agree with t.
+    group_size: one plus the number of receiving tensors, i.e. the total
+      number of devices participating.  Each tensor must reside on a
+      different device.
+    group_key: an integer identifying the group of devices.
+    instance_key: an integer identifying the participating group of Ops.
+
+  Returns:
+    An Op implementing the distributed broadcast send.
+
+  Raises:
+    ValueError: if any of the input parameter constraints are not met.
+
+  Note that the shape and dtype arguments appear redundant since they
+  should be obtainable from t.  The are two reasons for including
+  them.  First, the shape and type of tensors passed via broadcast must
+  be known ahead of time in their most specific form so that the receive
+  side can allocate memory for the operation and shape/type inference can
+  carry forward from there.  Including the same declarations on the
+  send side clarifies a commitment already made.  Secondly, having nearly
+  identical use syntax for send and receive sides may simplify tool-driven
+  generation of broadcast.
+  """
+  if not device.canonical_name(t.device):
+    raise ValueError('Device assignment required for collective ops')
+  if group_size <= 1:
+    raise ValueError(
+        'Parameter group_size to broadcast_send must be at least 2.')
+  if t.shape != shape:
+    raise ValueError(
+        'Shape of broadcast_send tensor not equal to delcared shape')
+  if t.dtype != dtype:
+    raise ValueError(
+        'Type of broadcast_send tensor not equal to declared type')
+  return gen_collective_ops.collective_bcast_send(t,
+                                                  shape=shape,
+                                                  group_size=group_size,
+                                                  group_key=group_key,
+                                                  instance_key=instance_key)
+
+
+def broadcast_recv(shape, dtype, group_size, group_key, instance_key):
+  """Receives a broadcasts tensor, across devices.
+
+  Args:
+    shape: Shape of the tensor to be received.
+    dtype: Type of the tensor to be received.
+    group_size: one plus the number of receiving tensors, i.e. the total
+      number of devices participating.  Each tensor must reside on a
+      different device.
+    group_key: an integer identifying the group of devices.
+    instance_key: an integer identifying the participating group of Ops.
+
+  Returns:
+    An Op implementing the broadcast receive.
+
+  Raises:
+    ValueError: if any of the input parameter constraints are not met.
+  """
+  if group_size <= 1:
+    raise ValueError(
+        'Parameter group_size to broadcast_send must be at least 2.')
+  return gen_collective_ops.collective_bcast_recv(shape=shape,
+                                                  T=dtype,
+                                                  group_size=group_size,
+                                                  group_key=group_key,
+                                                  instance_key=instance_key)
diff --git a/tensorflow/python/ops/collective_ops_test.py b/tensorflow/python/ops/collective_ops_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e16cffdf4917ba361a3c313047e39af514273bc
--- /dev/null
+++ b/tensorflow/python/ops/collective_ops_test.py
@@ -0,0 +1,80 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for Collective Operations."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.core.protobuf import config_pb2
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import collective_ops
+from tensorflow.python.platform import test
+
+# TODO(tucker): Make these ops work in eager mode. b/79776476
+
+
+class CollectiveOpTest(test.TestCase):
+
+  def _testCollectiveReduce(self, t0, t1, expected):
+    group_key = 1
+    instance_key = 1
+    with self.test_session(
+        config=config_pb2.ConfigProto(device_count={'CPU': 2})) as sess:
+      with ops.device('/CPU:0'):
+        in0 = constant_op.constant(t0)
+        colred0 = collective_ops.all_reduce(in0, 2, group_key, instance_key,
+                                            'Add', 'Div', [0])
+      with ops.device('/CPU:1'):
+        in1 = constant_op.constant(t1)
+        colred1 = collective_ops.all_reduce(in1, 2, group_key, instance_key,
+                                            'Add', 'Div', [0])
+      run_options = config_pb2.RunOptions()
+      run_options.experimental.collective_graph_key = 1
+      results = sess.run([colred0, colred1], options=run_options)
+    self.assertAllClose(results[0], expected, rtol=1e-5, atol=1e-5)
+    self.assertAllClose(results[1], expected, rtol=1e-5, atol=1e-5)
+
+  def testCollectiveReduce(self):
+    self._testCollectiveReduce([0.1, 1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1],
+                               [0.3, 1.3, 2.3, 3.3, 4.3, 5.3, 6.3, 7.3],
+                               [0.2, 1.2, 2.2, 3.2, 4.2, 5.2, 6.2, 7.2])
+
+  def _testCollectiveBroadcast(self, t0):
+    group_key = 1
+    instance_key = 1
+    with self.test_session(
+        config=config_pb2.ConfigProto(device_count={'CPU': 2})) as sess:
+      with ops.device('/CPU:0'):
+        in0 = constant_op.constant(t0)
+        out0 = collective_ops.broadcast_send(in0, in0.shape, in0.dtype,
+                                             2, group_key, instance_key)
+      with ops.device('/CPU:1'):
+        c1 = constant_op.constant(t0)
+        out1 = collective_ops.broadcast_recv(c1.shape, c1.dtype,
+                                             2, group_key, instance_key)
+      run_options = config_pb2.RunOptions()
+      run_options.experimental.collective_graph_key = 1
+      results = sess.run([out0, out1], options=run_options)
+    self.assertAllClose(results[0], t0, rtol=1e-5, atol=1e-5)
+    self.assertAllClose(results[1], t0, rtol=1e-5, atol=1e-5)
+
+  def testCollectiveBroadcast(self):
+    self._testCollectiveBroadcast([0.1, 1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1])
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/python/ops/control_flow_ops.py b/tensorflow/python/ops/control_flow_ops.py
index ee024ce64a79de3aa326ce710b3f9daba25fb260..2e5a801f8e96aa1266695a1440d98e6bff53607c 100644
--- a/tensorflow/python/ops/control_flow_ops.py
+++ b/tensorflow/python/ops/control_flow_ops.py
@@ -2729,7 +2729,8 @@ class WhileContext(ControlFlowContext):
           self.outer_context.Exit()
       else:
         shape_acc = array_ops.zeros_like(
-            array_ops.shape_internal(op.inputs[0], optimize=False),
+            array_ops.shape_internal(op.inputs[0], optimize=False,
+                                     out_type=dense_shape.dtype),
             optimize=False)
 
     if self.outer_context:
diff --git a/tensorflow/python/ops/custom_gradient.py b/tensorflow/python/ops/custom_gradient.py
index d934f27cb96f4a65e2adf860e0c5e08b7bd0b7d4..ca24f11054039472baaefd301e45f57c9444f60d 100644
--- a/tensorflow/python/ops/custom_gradient.py
+++ b/tensorflow/python/ops/custom_gradient.py
@@ -89,7 +89,7 @@ def custom_gradient(f):
          operations in `f` to `x`.
        - `grad_fn` is a function with the signature `g(*grad_ys)` which returns
          a list of `Tensor`s - the derivatives of `Tensor`s in `y` with respect
-         to the `Tensor`s in `x.  `grad_ys` is a `Tensor` or sequence of
+         to the `Tensor`s in `x`.  `grad_ys` is a `Tensor` or sequence of
          `Tensor`s the same size as `y` holding the initial value gradients for
          each `Tensor` in `y`. If `f` uses `Variable`s (that are not part of the
          inputs), i.e. through `get_variable`, then `grad_fn` should have
diff --git a/tensorflow/python/ops/data_flow_ops.py b/tensorflow/python/ops/data_flow_ops.py
index 62c5adc385a2e87d27298c72f8dd2f67303119df..abf597ca55c647cca3f6012ed602a815298e1ed3 100644
--- a/tensorflow/python/ops/data_flow_ops.py
+++ b/tensorflow/python/ops/data_flow_ops.py
@@ -35,6 +35,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import gen_data_flow_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import resource_variable_ops
 # go/tf-wildcard-import
 # pylint: disable=wildcard-import
 from tensorflow.python.ops.gen_data_flow_ops import *
@@ -129,11 +130,6 @@ class QueueBase(object):
   @{tf.RandomShuffleQueue} for concrete
   implementations of this class, and instructions on how to create
   them.
-
-  @compatibility(eager)
-  Queues are not compatible with eager execution. Instead, please
-  use `tf.data` to get data into your model.
-  @end_compatibility
   """
 
   def __init__(self, dtypes, shapes, names, queue_ref):
@@ -157,12 +153,7 @@ class QueueBase(object):
 
     Raises:
       ValueError: If one of the arguments is invalid.
-      RuntimeError: If eager execution is enabled.
     """
-    if context.executing_eagerly():
-      raise RuntimeError(
-          "Queues are not supported when eager execution is enabled. "
-          "Instead, please use tf.data to get data into your model.")
     self._dtypes = dtypes
     if shapes is not None:
       if len(shapes) != len(dtypes):
@@ -179,6 +170,8 @@ class QueueBase(object):
     self._queue_ref = queue_ref
     if context.executing_eagerly():
       self._name = context.context().scope_name
+      self._resource_deleter = resource_variable_ops.EagerResourceDeleter(
+          queue_ref, None)
     else:
       self._name = self._queue_ref.op.name.split("/")[-1]
 
@@ -605,6 +598,11 @@ class QueueBase(object):
     else:
       return gen_data_flow_ops.queue_size(self._queue_ref, name=name)
 
+def _shared_name(shared_name):
+  if context.executing_eagerly():
+    return str(ops.uid())
+  return shared_name
+
 
 @tf_export("RandomShuffleQueue")
 class RandomShuffleQueue(QueueBase):
@@ -612,11 +610,6 @@ class RandomShuffleQueue(QueueBase):
 
   See @{tf.QueueBase} for a description of the methods on
   this class.
-
-  @compatibility(eager)
-  Queues are not compatible with eager execution. Instead, please
-  use `tf.data` to get data into your model.
-  @end_compatibility
   """
 
   def __init__(self,
@@ -690,7 +683,7 @@ class RandomShuffleQueue(QueueBase):
         min_after_dequeue=min_after_dequeue,
         seed=seed1,
         seed2=seed2,
-        shared_name=shared_name,
+        shared_name=_shared_name(shared_name),
         name=name)
 
     super(RandomShuffleQueue, self).__init__(dtypes, shapes, names, queue_ref)
@@ -702,11 +695,6 @@ class FIFOQueue(QueueBase):
 
   See @{tf.QueueBase} for a description of the methods on
   this class.
-
-  @compatibility(eager)
-  Queues are not compatible with eager execution. Instead, please
-  use `tf.data` to get data into your model.
-  @end_compatibility
   """
 
   def __init__(self,
@@ -752,7 +740,7 @@ class FIFOQueue(QueueBase):
         component_types=dtypes,
         shapes=shapes,
         capacity=capacity,
-        shared_name=shared_name,
+        shared_name=_shared_name(shared_name),
         name=name)
 
     super(FIFOQueue, self).__init__(dtypes, shapes, names, queue_ref)
@@ -767,11 +755,6 @@ class PaddingFIFOQueue(QueueBase):
 
   See @{tf.QueueBase} for a description of the methods on
   this class.
-
-  @compatibility(eager)
-  Queues are not compatible with eager execution. Instead, please
-  use `tf.data` to get data into your model.
-  @end_compatibility
   """
 
   def __init__(self,
@@ -831,7 +814,7 @@ class PaddingFIFOQueue(QueueBase):
         component_types=dtypes,
         shapes=shapes,
         capacity=capacity,
-        shared_name=shared_name,
+        shared_name=_shared_name(shared_name),
         name=name)
 
     super(PaddingFIFOQueue, self).__init__(dtypes, shapes, names, queue_ref)
@@ -843,11 +826,6 @@ class PriorityQueue(QueueBase):
 
   See @{tf.QueueBase} for a description of the methods on
   this class.
-
-  @compatibility(eager)
-  Queues are not compatible with eager execution. Instead, please
-  use `tf.data` to get data into your model.
-  @end_compatibility
   """
 
   def __init__(self,
@@ -899,7 +877,7 @@ class PriorityQueue(QueueBase):
         component_types=types,
         shapes=shapes,
         capacity=capacity,
-        shared_name=shared_name,
+        shared_name=_shared_name(shared_name),
         name=name)
 
     priority_dtypes = [_dtypes.int64] + types
diff --git a/tensorflow/python/ops/distributions/bernoulli.py b/tensorflow/python/ops/distributions/bernoulli.py
index d7fb3f1f783cceef280e07b6110098a80011b19f..84d9d40a35095643dc29946f8827cfd486a9fd9a 100644
--- a/tensorflow/python/ops/distributions/bernoulli.py
+++ b/tensorflow/python/ops/distributions/bernoulli.py
@@ -71,7 +71,7 @@ class Bernoulli(distribution.Distribution):
     Raises:
       ValueError: If p and logits are passed, or if neither are passed.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name) as name:
       self._logits, self._probs = distribution_util.get_logits_and_probs(
           logits=logits,
diff --git a/tensorflow/python/ops/distributions/beta.py b/tensorflow/python/ops/distributions/beta.py
index b6978486004affbf97ee0e5da15fee7ab092eb32..f28f76b6c42a861c51c1fc06f99fa73b71b625a9 100644
--- a/tensorflow/python/ops/distributions/beta.py
+++ b/tensorflow/python/ops/distributions/beta.py
@@ -150,7 +150,7 @@ class Beta(distribution.Distribution):
         more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[concentration1, concentration0]) as name:
       self._concentration1 = self._maybe_assert_valid_concentration(
           ops.convert_to_tensor(concentration1, name="concentration1"),
@@ -321,7 +321,7 @@ class BetaWithSoftplusConcentration(Beta):
                validate_args=False,
                allow_nan_stats=True,
                name="BetaWithSoftplusConcentration"):
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[concentration1,
                                       concentration0]) as name:
       super(BetaWithSoftplusConcentration, self).__init__(
diff --git a/tensorflow/python/ops/distributions/bijector_impl.py b/tensorflow/python/ops/distributions/bijector_impl.py
index caceadf53a0a0816379b0d75808ec756b558e861..b65e64d401b800884bdee7ad883bd7cc41dcfd20 100644
--- a/tensorflow/python/ops/distributions/bijector_impl.py
+++ b/tensorflow/python/ops/distributions/bijector_impl.py
@@ -160,13 +160,20 @@ class Bijector(object):
 
   3. `log_det_jacobian(x)`
 
-     "The log of the determinant of the matrix of all first-order partial
-     derivatives of the inverse function."
+     "The log of the absolute value of the determinant of the matrix of all
+     first-order partial derivatives of the inverse function."
 
      Useful for inverting a transformation to compute one probability in terms
      of another. Geometrically, the Jacobian determinant is the volume of the
      transformation and is used to scale the probability.
 
+     We take the absolute value of the determinant before log to avoid NaN
+     values.  Geometrically, a negative determinant corresponds to an
+     orientation-reversing transformation.  It is ok for us to discard the sign
+     of the determinant because we only integrate everywhere-nonnegative
+     functions (probability densities) and the correct orientation is always the
+     one that produces a nonnegative integrand.
+
   By convention, transformations of random variables are named in terms of the
   forward transformation. The forward transformation creates samples, the
   inverse is useful for computing probabilities.
@@ -1021,7 +1028,7 @@ class Bijector(object):
         axis=self._get_event_reduce_dims(min_event_ndims, event_ndims))
     # The multiplication by ones can change the inferred static shape so we try
     # to recover as much as possible.
-    event_ndims_ = self._maybe_get_event_ndims_statically(event_ndims)
+    event_ndims_ = self._maybe_get_static_event_ndims(event_ndims)
     if (event_ndims_ is not None and
         y.shape.ndims is not None and
         ildj.shape.ndims is not None):
@@ -1036,7 +1043,7 @@ class Bijector(object):
 
   def _get_event_reduce_dims(self, min_event_ndims, event_ndims):
     """Compute the reduction dimensions given event_ndims."""
-    event_ndims_ = self._maybe_get_event_ndims_statically(event_ndims)
+    event_ndims_ = self._maybe_get_static_event_ndims(event_ndims)
 
     if event_ndims_ is not None:
       return [-index for index in range(1, event_ndims_ - min_event_ndims + 1)]
@@ -1046,9 +1053,18 @@ class Bijector(object):
 
   def _check_valid_event_ndims(self, min_event_ndims, event_ndims):
     """Check whether event_ndims is atleast min_event_ndims."""
-    event_ndims_ = self._maybe_get_event_ndims_statically(event_ndims)
+    event_ndims = ops.convert_to_tensor(event_ndims, name="event_ndims")
+    event_ndims_ = tensor_util.constant_value(event_ndims)
     assertions = []
+
+    if not event_ndims.dtype.is_integer:
+      raise ValueError("Expected integer dtype, got dtype {}".format(
+          event_ndims.dtype))
+
     if event_ndims_ is not None:
+      if event_ndims.shape.ndims != 0:
+        raise ValueError("Expected scalar event_ndims, got shape {}".format(
+            event_ndims.shape))
       if min_event_ndims > event_ndims_:
         raise ValueError("event_ndims ({}) must be larger than "
                          "min_event_ndims ({})".format(
@@ -1056,17 +1072,29 @@ class Bijector(object):
     elif self.validate_args:
       assertions += [
           check_ops.assert_greater_equal(event_ndims, min_event_ndims)]
+
+    if event_ndims.shape.is_fully_defined():
+      if event_ndims.shape.ndims != 0:
+        raise ValueError("Expected scalar shape, got ndims {}".format(
+            event_ndims.shape.ndims))
+
+    elif self.validate_args:
+      assertions += [
+          check_ops.assert_rank(event_ndims, 0, message="Expected scalar.")]
     return assertions
 
-  def _maybe_get_event_ndims_statically(self, event_ndims):
+  def _maybe_get_static_event_ndims(self, event_ndims):
     """Helper which returns tries to return an integer static value."""
     event_ndims_ = distribution_util.maybe_get_static_value(event_ndims)
 
-    if isinstance(event_ndims_, np.ndarray):
-      if (event_ndims_.dtype not in (np.int32, np.int64) or
-          len(event_ndims_.shape)):
+    if isinstance(event_ndims_, (np.generic, np.ndarray)):
+      if event_ndims_.dtype not in (np.int32, np.int64):
+        raise ValueError("Expected integer dtype, got dtype {}".format(
+            event_ndims_.dtype))
+
+      if isinstance(event_ndims_, np.ndarray) and len(event_ndims_.shape):
         raise ValueError("Expected a scalar integer, got {}".format(
             event_ndims_))
-      event_ndims_ = event_ndims_.tolist()
+      event_ndims_ = int(event_ndims_)
 
     return event_ndims_
diff --git a/tensorflow/python/ops/distributions/categorical.py b/tensorflow/python/ops/distributions/categorical.py
index bbdc8c455af66c7c6ad1866302e84f50cf221f9b..b88a0518b6db15021b9917d4c2b5ffb7bcf9484f 100644
--- a/tensorflow/python/ops/distributions/categorical.py
+++ b/tensorflow/python/ops/distributions/categorical.py
@@ -182,7 +182,7 @@ class Categorical(distribution.Distribution):
         more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[logits, probs]) as name:
       self._logits, self._probs = distribution_util.get_logits_and_probs(
           logits=logits,
diff --git a/tensorflow/python/ops/distributions/dirichlet.py b/tensorflow/python/ops/distributions/dirichlet.py
index 8d0d1d860bf4a7efadd8b6bf101708974b87e5d1..72567e62f78665947c001282c9c4f4929e9ea0ef 100644
--- a/tensorflow/python/ops/distributions/dirichlet.py
+++ b/tensorflow/python/ops/distributions/dirichlet.py
@@ -28,6 +28,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import special_math_ops
 from tensorflow.python.ops.distributions import distribution
+from tensorflow.python.ops.distributions import kullback_leibler
 from tensorflow.python.ops.distributions import util as distribution_util
 from tensorflow.python.util.tf_export import tf_export
 
@@ -154,7 +155,7 @@ class Dirichlet(distribution.Distribution):
         more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[concentration]) as name:
       self._concentration = self._maybe_assert_valid_concentration(
           ops.convert_to_tensor(concentration, name="concentration"),
@@ -297,3 +298,80 @@ class Dirichlet(distribution.Distribution):
             math_ops.reduce_sum(x, -1),
             message="sample last-dimension must sum to `1`"),
     ], x)
+
+
+@kullback_leibler.RegisterKL(Dirichlet, Dirichlet)
+def _kl_dirichlet_dirichlet(d1, d2, name=None):
+  """Batchwise KL divergence KL(d1 || d2) with d1 and d2 Dirichlet.
+
+  Args:
+    d1: instance of a Dirichlet distribution object.
+    d2: instance of a Dirichlet distribution object.
+    name: (optional) Name to use for created operations.
+      default is "kl_dirichlet_dirichlet".
+
+  Returns:
+    Batchwise KL(d1 || d2)
+  """
+  with ops.name_scope(name, "kl_dirichlet_dirichlet", values=[
+      d1.concentration, d2.concentration]):
+    # The KL between Dirichlet distributions can be derived as follows. We have
+    #
+    #   Dir(x; a) = 1 / B(a) * prod_i[x[i]^(a[i] - 1)]
+    #
+    # where B(a) is the multivariate Beta function:
+    #
+    #   B(a) = Gamma(a[1]) * ... * Gamma(a[n]) / Gamma(a[1] + ... + a[n])
+    #
+    # The KL is
+    #
+    #   KL(Dir(x; a), Dir(x; b)) = E_Dir(x; a){log(Dir(x; a) / Dir(x; b))}
+    #
+    # so we'll need to know the log density of the Dirichlet. This is
+    #
+    #   log(Dir(x; a)) = sum_i[(a[i] - 1) log(x[i])] - log B(a)
+    #
+    # The only term that matters for the expectations is the log(x[i]). To
+    # compute the expectation of this term over the Dirichlet density, we can
+    # use the following facts about the Dirichlet in exponential family form:
+    #   1. log(x[i]) is a sufficient statistic
+    #   2. expected sufficient statistics (of any exp family distribution) are
+    #      equal to derivatives of the log normalizer with respect to
+    #      corresponding natural parameters: E{T[i](x)} = dA/d(eta[i])
+    #
+    # To proceed, we can rewrite the Dirichlet density in exponential family
+    # form as follows:
+    #
+    #   Dir(x; a) = exp{eta(a) . T(x) - A(a)}
+    #
+    # where '.' is the dot product of vectors eta and T, and A is a scalar:
+    #
+    #   eta[i](a) = a[i] - 1
+    #     T[i](x) = log(x[i])
+    #        A(a) = log B(a)
+    #
+    # Now, we can use fact (2) above to write
+    #
+    #   E_Dir(x; a)[log(x[i])]
+    #       = dA(a) / da[i]
+    #       = d/da[i] log B(a)
+    #       = d/da[i] (sum_j lgamma(a[j])) - lgamma(sum_j a[j])
+    #       = digamma(a[i])) - digamma(sum_j a[j])
+    #
+    # Putting it all together, we have
+    #
+    # KL[Dir(x; a) || Dir(x; b)]
+    #     = E_Dir(x; a){log(Dir(x; a) / Dir(x; b)}
+    #     = E_Dir(x; a){sum_i[(a[i] - b[i]) log(x[i])} - (lbeta(a) - lbeta(b))
+    #     = sum_i[(a[i] - b[i]) * E_Dir(x; a){log(x[i])}] - lbeta(a) + lbeta(b)
+    #     = sum_i[(a[i] - b[i]) * (digamma(a[i]) - digamma(sum_j a[j]))]
+    #          - lbeta(a) + lbeta(b))
+
+    digamma_sum_d1 = math_ops.digamma(
+        math_ops.reduce_sum(d1.concentration, axis=-1, keepdims=True))
+    digamma_diff = math_ops.digamma(d1.concentration) - digamma_sum_d1
+    concentration_diff = d1.concentration - d2.concentration
+
+    return (math_ops.reduce_sum(concentration_diff * digamma_diff, axis=-1) -
+            special_math_ops.lbeta(d1.concentration) +
+            special_math_ops.lbeta(d2.concentration))
diff --git a/tensorflow/python/ops/distributions/dirichlet_multinomial.py b/tensorflow/python/ops/distributions/dirichlet_multinomial.py
index 3a35e0caa0f411dbb413aa5e8fe68143e0914db9..5350c8284704a15f71bce7be0c44e298067d3692 100644
--- a/tensorflow/python/ops/distributions/dirichlet_multinomial.py
+++ b/tensorflow/python/ops/distributions/dirichlet_multinomial.py
@@ -191,7 +191,7 @@ class DirichletMultinomial(distribution.Distribution):
         more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[total_count, concentration]) as name:
       # Broadcasting works because:
       # * The broadcasting convention is to prepend dimensions of size [1], and
diff --git a/tensorflow/python/ops/distributions/distribution.py b/tensorflow/python/ops/distributions/distribution.py
index a6579e3246d4c2c6bf266754868cba43515e3256..41dcd401887a124780a35c3dbd84140553860485 100644
--- a/tensorflow/python/ops/distributions/distribution.py
+++ b/tensorflow/python/ops/distributions/distribution.py
@@ -525,7 +525,7 @@ class Distribution(_BaseDistribution):
     """Dictionary of parameters used to instantiate this `Distribution`."""
     # Remove "self", "__class__", or other special variables. These can appear
     # if the subclass used:
-    # `parameters = distribution_util.parent_frame_arguments()`.
+    # `parameters = dict(locals())`.
     return dict((k, v) for k, v in self._parameters.items()
                 if not k.startswith("__") and k != "self")
 
@@ -722,11 +722,8 @@ class Distribution(_BaseDistribution):
       value = ops.convert_to_tensor(value, name="value")
       try:
         return self._log_prob(value, **kwargs)
-      except NotImplementedError as original_exception:
-        try:
-          return math_ops.log(self._prob(value, **kwargs))
-        except NotImplementedError:
-          raise original_exception
+      except NotImplementedError:
+        return math_ops.log(self._prob(value, **kwargs))
 
   def log_prob(self, value, name="log_prob"):
     """Log probability density/mass function.
@@ -749,11 +746,8 @@ class Distribution(_BaseDistribution):
       value = ops.convert_to_tensor(value, name="value")
       try:
         return self._prob(value, **kwargs)
-      except NotImplementedError as original_exception:
-        try:
-          return math_ops.exp(self._log_prob(value, **kwargs))
-        except NotImplementedError:
-          raise original_exception
+      except NotImplementedError:
+        return math_ops.exp(self._log_prob(value, **kwargs))
 
   def prob(self, value, name="prob"):
     """Probability density/mass function.
@@ -776,11 +770,8 @@ class Distribution(_BaseDistribution):
       value = ops.convert_to_tensor(value, name="value")
       try:
         return self._log_cdf(value, **kwargs)
-      except NotImplementedError as original_exception:
-        try:
-          return math_ops.log(self._cdf(value, **kwargs))
-        except NotImplementedError:
-          raise original_exception
+      except NotImplementedError:
+        return math_ops.log(self._cdf(value, **kwargs))
 
   def log_cdf(self, value, name="log_cdf"):
     """Log cumulative distribution function.
@@ -813,11 +804,8 @@ class Distribution(_BaseDistribution):
       value = ops.convert_to_tensor(value, name="value")
       try:
         return self._cdf(value, **kwargs)
-      except NotImplementedError as original_exception:
-        try:
-          return math_ops.exp(self._log_cdf(value, **kwargs))
-        except NotImplementedError:
-          raise original_exception
+      except NotImplementedError:
+        return math_ops.exp(self._log_cdf(value, **kwargs))
 
   def cdf(self, value, name="cdf"):
     """Cumulative distribution function.
@@ -846,11 +834,8 @@ class Distribution(_BaseDistribution):
       value = ops.convert_to_tensor(value, name="value")
       try:
         return self._log_survival_function(value, **kwargs)
-      except NotImplementedError as original_exception:
-        try:
-          return math_ops.log1p(-self.cdf(value, **kwargs))
-        except NotImplementedError:
-          raise original_exception
+      except NotImplementedError:
+        return math_ops.log1p(-self.cdf(value, **kwargs))
 
   def log_survival_function(self, value, name="log_survival_function"):
     """Log survival function.
@@ -884,11 +869,8 @@ class Distribution(_BaseDistribution):
       value = ops.convert_to_tensor(value, name="value")
       try:
         return self._survival_function(value, **kwargs)
-      except NotImplementedError as original_exception:
-        try:
-          return 1. - self.cdf(value, **kwargs)
-        except NotImplementedError:
-          raise original_exception
+      except NotImplementedError:
+        return 1. - self.cdf(value, **kwargs)
 
   def survival_function(self, value, name="survival_function"):
     """Survival function.
@@ -933,10 +915,7 @@ class Distribution(_BaseDistribution):
   def _call_quantile(self, value, name, **kwargs):
     with self._name_scope(name, values=[value]):
       value = ops.convert_to_tensor(value, name="value")
-      try:
-        return self._quantile(value, **kwargs)
-      except NotImplementedError as original_exception:
-        raise original_exception
+      return self._quantile(value, **kwargs)
 
   def quantile(self, value, name="quantile"):
     """Quantile function. Aka "inverse cdf" or "percent point function".
@@ -982,11 +961,8 @@ class Distribution(_BaseDistribution):
     with self._name_scope(name):
       try:
         return self._variance()
-      except NotImplementedError as original_exception:
-        try:
-          return math_ops.square(self._stddev())
-        except NotImplementedError:
-          raise original_exception
+      except NotImplementedError:
+        return math_ops.square(self._stddev())
 
   def _stddev(self):
     raise NotImplementedError("stddev is not implemented")
@@ -1014,11 +990,8 @@ class Distribution(_BaseDistribution):
     with self._name_scope(name):
       try:
         return self._stddev()
-      except NotImplementedError as original_exception:
-        try:
-          return math_ops.sqrt(self._variance())
-        except NotImplementedError:
-          raise original_exception
+      except NotImplementedError:
+        return math_ops.sqrt(self._variance())
 
   def _covariance(self):
     raise NotImplementedError("covariance is not implemented")
diff --git a/tensorflow/python/ops/distributions/exponential.py b/tensorflow/python/ops/distributions/exponential.py
index 1e08f48d529b164ddbb77d4d36ba0bd3390475b7..24bc3f3d3eb06a01d5173cb6c7fb0f09172a0587 100644
--- a/tensorflow/python/ops/distributions/exponential.py
+++ b/tensorflow/python/ops/distributions/exponential.py
@@ -27,7 +27,6 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import gamma
-from tensorflow.python.ops.distributions import util as distribution_util
 from tensorflow.python.util.tf_export import tf_export
 
 
@@ -91,7 +90,7 @@ class Exponential(gamma.Gamma):
         more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     # Even though all statistics of are defined for valid inputs, this is not
     # true in the parent class "Gamma."  Therefore, passing
     # allow_nan_stats=True
@@ -144,7 +143,7 @@ class ExponentialWithSoftplusRate(Exponential):
                validate_args=False,
                allow_nan_stats=True,
                name="ExponentialWithSoftplusRate"):
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[rate]) as name:
       super(ExponentialWithSoftplusRate, self).__init__(
           rate=nn.softplus(rate, name="softplus_rate"),
diff --git a/tensorflow/python/ops/distributions/gamma.py b/tensorflow/python/ops/distributions/gamma.py
index 7ca690d9d2f8348a103adb57abe57f2ce058f17c..163a27f7585518c321dd1ea59b71029e2ae6a1e7 100644
--- a/tensorflow/python/ops/distributions/gamma.py
+++ b/tensorflow/python/ops/distributions/gamma.py
@@ -126,7 +126,7 @@ class Gamma(distribution.Distribution):
     Raises:
       TypeError: if `concentration` and `rate` are different dtypes.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[concentration, rate]) as name:
       with ops.control_dependencies([
           check_ops.assert_positive(concentration),
@@ -261,7 +261,7 @@ class GammaWithSoftplusConcentrationRate(Gamma):
                validate_args=False,
                allow_nan_stats=True,
                name="GammaWithSoftplusConcentrationRate"):
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[concentration, rate]) as name:
       super(GammaWithSoftplusConcentrationRate, self).__init__(
           concentration=nn.softplus(concentration,
diff --git a/tensorflow/python/ops/distributions/laplace.py b/tensorflow/python/ops/distributions/laplace.py
index ee3a6a40ff78fb95545c3af4eff85e84a1abd515..be17cf2527eacb7930333cbb03f1dba860b19dea 100644
--- a/tensorflow/python/ops/distributions/laplace.py
+++ b/tensorflow/python/ops/distributions/laplace.py
@@ -33,7 +33,6 @@ from tensorflow.python.ops import nn
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import distribution
 from tensorflow.python.ops.distributions import special_math
-from tensorflow.python.ops.distributions import util as distribution_util
 from tensorflow.python.util.tf_export import tf_export
 
 
@@ -101,7 +100,7 @@ class Laplace(distribution.Distribution):
     Raises:
       TypeError: if `loc` and `scale` are of different dtype.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[loc, scale]) as name:
       with ops.control_dependencies([check_ops.assert_positive(scale)] if
                                     validate_args else []):
@@ -218,7 +217,7 @@ class LaplaceWithSoftplusScale(Laplace):
                validate_args=False,
                allow_nan_stats=True,
                name="LaplaceWithSoftplusScale"):
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[loc, scale]) as name:
       super(LaplaceWithSoftplusScale, self).__init__(
           loc=loc,
diff --git a/tensorflow/python/ops/distributions/multinomial.py b/tensorflow/python/ops/distributions/multinomial.py
index 036ba45cccf49913c339d305e110960c988f8ccf..d0943e8eee69a5ef23d1829651801513e4bc1d69 100644
--- a/tensorflow/python/ops/distributions/multinomial.py
+++ b/tensorflow/python/ops/distributions/multinomial.py
@@ -182,7 +182,7 @@ class Multinomial(distribution.Distribution):
         more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[total_count, logits, probs]) as name:
       self._total_count = ops.convert_to_tensor(total_count, name="total_count")
       if validate_args:
diff --git a/tensorflow/python/ops/distributions/normal.py b/tensorflow/python/ops/distributions/normal.py
index 0620aae10d0d3bb1f902171ed5c033630520dd7a..d0a987ba7ccf0b7ef2f38c2fc868540f6b5ff005 100644
--- a/tensorflow/python/ops/distributions/normal.py
+++ b/tensorflow/python/ops/distributions/normal.py
@@ -32,7 +32,6 @@ from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import distribution
 from tensorflow.python.ops.distributions import kullback_leibler
 from tensorflow.python.ops.distributions import special_math
-from tensorflow.python.ops.distributions import util as distribution_util
 from tensorflow.python.util.tf_export import tf_export
 
 
@@ -132,7 +131,7 @@ class Normal(distribution.Distribution):
     Raises:
       TypeError: if `loc` and `scale` have different `dtype`.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[loc, scale]) as name:
       with ops.control_dependencies([check_ops.assert_positive(scale)] if
                                     validate_args else []):
@@ -244,7 +243,7 @@ class NormalWithSoftplusScale(Normal):
                validate_args=False,
                allow_nan_stats=True,
                name="NormalWithSoftplusScale"):
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[scale]) as name:
       super(NormalWithSoftplusScale, self).__init__(
           loc=loc,
diff --git a/tensorflow/python/ops/distributions/student_t.py b/tensorflow/python/ops/distributions/student_t.py
index 9330b930b5140b50e2b8f060c70702bd47b2c4ae..20a2d16181442bede797ded5e4d3ebbd3d55ca2b 100644
--- a/tensorflow/python/ops/distributions/student_t.py
+++ b/tensorflow/python/ops/distributions/student_t.py
@@ -157,7 +157,7 @@ class StudentT(distribution.Distribution):
     Raises:
       TypeError: if loc and scale are different dtypes.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[df, loc, scale]) as name:
       with ops.control_dependencies([check_ops.assert_positive(df)]
                                     if validate_args else []):
@@ -349,7 +349,7 @@ class StudentTWithAbsDfSoftplusScale(StudentT):
                validate_args=False,
                allow_nan_stats=True,
                name="StudentTWithAbsDfSoftplusScale"):
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[df, scale]) as name:
       super(StudentTWithAbsDfSoftplusScale, self).__init__(
           df=math_ops.floor(math_ops.abs(df)),
diff --git a/tensorflow/python/ops/distributions/transformed_distribution.py b/tensorflow/python/ops/distributions/transformed_distribution.py
index 9392464ec11613cf318a9a86124bdba6f46ba595..e80bf9ee4272c832f1ab53e6bf2b1f0cf6faafee 100644
--- a/tensorflow/python/ops/distributions/transformed_distribution.py
+++ b/tensorflow/python/ops/distributions/transformed_distribution.py
@@ -252,7 +252,7 @@ class TransformedDistribution(distribution_lib.Distribution):
       name: Python `str` name prefixed to Ops created by this class. Default:
         `bijector.name + distribution.name`.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     name = name or (("" if bijector is None else bijector.name) +
                     distribution.name)
     with ops.name_scope(name, values=[event_shape, batch_shape]) as name:
@@ -416,7 +416,7 @@ class TransformedDistribution(distribution_lib.Distribution):
     # For caching to work, it is imperative that the bijector is the first to
     # modify the input.
     x = self.bijector.inverse(y)
-    event_ndims = self._maybe_get_event_ndims_statically()
+    event_ndims = self._maybe_get_static_event_ndims()
 
     ildj = self.bijector.inverse_log_det_jacobian(y, event_ndims=event_ndims)
     if self.bijector._is_injective:  # pylint: disable=protected-access
@@ -435,13 +435,15 @@ class TransformedDistribution(distribution_lib.Distribution):
       log_prob = math_ops.reduce_sum(log_prob, self._reduce_event_indices)
     log_prob += math_ops.cast(ildj, log_prob.dtype)
     if self._is_maybe_event_override and isinstance(event_ndims, int):
-      log_prob.set_shape(array_ops.broadcast_static_shape(
-          x.get_shape().with_rank_at_least(1)[:-event_ndims], self.batch_shape))
+      log_prob.set_shape(
+          array_ops.broadcast_static_shape(
+              y.get_shape().with_rank_at_least(1)[:-event_ndims],
+              self.batch_shape))
     return log_prob
 
   def _prob(self, y):
     x = self.bijector.inverse(y)
-    event_ndims = self._maybe_get_event_ndims_statically()
+    event_ndims = self._maybe_get_static_event_ndims()
     ildj = self.bijector.inverse_log_det_jacobian(y, event_ndims=event_ndims)
     if self.bijector._is_injective:  # pylint: disable=protected-access
       return self._finish_prob_for_one_fiber(y, x, ildj, event_ndims)
@@ -459,8 +461,10 @@ class TransformedDistribution(distribution_lib.Distribution):
       prob = math_ops.reduce_prod(prob, self._reduce_event_indices)
     prob *= math_ops.exp(math_ops.cast(ildj, prob.dtype))
     if self._is_maybe_event_override and isinstance(event_ndims, int):
-      prob.set_shape(array_ops.broadcast_static_shape(
-          y.get_shape().with_rank_at_least(1)[:-event_ndims], self.batch_shape))
+      prob.set_shape(
+          array_ops.broadcast_static_shape(
+              y.get_shape().with_rank_at_least(1)[:-event_ndims],
+              self.batch_shape))
     return prob
 
   def _log_cdf(self, y):
@@ -618,15 +622,14 @@ class TransformedDistribution(distribution_lib.Distribution):
     return array_ops.transpose(
         x, _concat_vectors(math_ops.range(n, ndims), math_ops.range(0, n)))
 
-  def _maybe_get_event_ndims_statically(self):
+  def _maybe_get_static_event_ndims(self):
     if self.event_shape.ndims is not None:
       return self.event_shape.ndims
 
     event_ndims = array_ops.size(self.event_shape_tensor())
+    event_ndims_ = distribution_util.maybe_get_static_value(event_ndims)
 
-    static_event_ndims = tensor_util.constant_value(event_ndims)
-
-    if static_event_ndims is not None:
-      return static_event_ndims
+    if event_ndims_ is not None:
+      return event_ndims_
 
     return event_ndims
diff --git a/tensorflow/python/ops/distributions/uniform.py b/tensorflow/python/ops/distributions/uniform.py
index dfa10331e3e9d64c5d3cae2c52373edfbcda723a..e66c4a37e7c6330656bdefb6639670896a488fc7 100644
--- a/tensorflow/python/ops/distributions/uniform.py
+++ b/tensorflow/python/ops/distributions/uniform.py
@@ -29,7 +29,6 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops.distributions import distribution
-from tensorflow.python.ops.distributions import util as distribution_util
 from tensorflow.python.util.tf_export import tf_export
 
 
@@ -103,7 +102,7 @@ class Uniform(distribution.Distribution):
     Raises:
       InvalidArgumentError: if `low >= high` and `validate_args=False`.
     """
-    parameters = distribution_util.parent_frame_arguments()
+    parameters = dict(locals())
     with ops.name_scope(name, values=[low, high]) as name:
       with ops.control_dependencies([
           check_ops.assert_less(
diff --git a/tensorflow/python/ops/distributions/util.py b/tensorflow/python/ops/distributions/util.py
index 59c89d21f9142fbbaf52b1d273873194a30a0f7f..401676bf842b4dd76fc64b5f4599804a0f3a46f8 100644
--- a/tensorflow/python/ops/distributions/util.py
+++ b/tensorflow/python/ops/distributions/util.py
@@ -179,6 +179,7 @@ def maybe_get_static_value(x, dtype=None):
   if x is None:
     return x
   try:
+    # This returns an np.ndarray.
     x_ = tensor_util.constant_value(x)
   except TypeError:
     x_ = x
@@ -823,8 +824,8 @@ def fill_triangular(x, upper=False, name=None):
   Triangular matrix elements are filled in a clockwise spiral. See example,
   below.
 
-  If `x.get_shape()` is `[b1, b2, ..., bK, d]` then the output shape is `[b1,
-  b2, ..., bK, n, n]` where `n` is such that `d = n(n+1)/2`, i.e.,
+  If `x.get_shape()` is `[b1, b2, ..., bB, d]` then the output shape is
+  `[b1, b2, ..., bB, n, n]` where `n` is such that `d = n(n+1)/2`, i.e.,
   `n = int(np.sqrt(0.25 + 2. * m) - 0.5)`.
 
   Example:
@@ -913,10 +914,11 @@ def fill_triangular(x, upper=False, name=None):
     #   = 2 (n**2 / 2 + n / 2) - n**2
     #   = n**2 + n - n**2
     #   = n
+    ndims = prefer_static_rank(x)
     if upper:
-      x_list = [x, array_ops.reverse(x[..., n:], axis=[-1])]
+      x_list = [x, array_ops.reverse(x[..., n:], axis=[ndims - 1])]
     else:
-      x_list = [x[..., n:], array_ops.reverse(x, axis=[-1])]
+      x_list = [x[..., n:], array_ops.reverse(x, axis=[ndims - 1])]
     new_shape = (
         static_final_shape.as_list()
         if static_final_shape.is_fully_defined()
@@ -930,6 +932,74 @@ def fill_triangular(x, upper=False, name=None):
     return x
 
 
+def fill_triangular_inverse(x, upper=False, name=None):
+  """Creates a vector from a (batch of) triangular matrix.
+
+  The vector is created from the lower-triangular or upper-triangular portion
+  depending on the value of the parameter `upper`.
+
+  If `x.shape` is `[b1, b2, ..., bB, n, n]` then the output shape is
+  `[b1, b2, ..., bB, d]` where `d = n (n + 1) / 2`.
+
+  Example:
+
+  ```python
+  fill_triangular_inverse(
+    [[4, 0, 0],
+     [6, 5, 0],
+     [3, 2, 1]])
+
+  # ==> [1, 2, 3, 4, 5, 6]
+
+  fill_triangular_inverse(
+    [[1, 2, 3],
+     [0, 5, 6],
+     [0, 0, 4]], upper=True)
+
+  # ==> [1, 2, 3, 4, 5, 6]
+  ```
+
+  Args:
+    x: `Tensor` representing lower (or upper) triangular elements.
+    upper: Python `bool` representing whether output matrix should be upper
+      triangular (`True`) or lower triangular (`False`, default).
+    name: Python `str`. The name to give this op.
+
+  Returns:
+    flat_tril: (Batch of) vector-shaped `Tensor` representing vectorized lower
+      (or upper) triangular elements from `x`.
+  """
+
+  with ops.name_scope(name, "fill_triangular_inverse", values=[x]):
+    x = ops.convert_to_tensor(x, name="x")
+    if x.shape.with_rank_at_least(2)[-1].value is not None:
+      n = np.int32(x.shape[-1].value)
+      m = np.int32((n * (n + 1)) // 2)
+      static_final_shape = x.shape[:-2].concatenate([m])
+    else:
+      n = array_ops.shape(x)[-1]
+      m = (n * (n + 1)) // 2
+      static_final_shape = x.shape.with_rank_at_least(2)[:-2].concatenate(
+          [None])
+    ndims = prefer_static_rank(x)
+    if upper:
+      initial_elements = x[..., 0, :]
+      triangular_portion = x[..., 1:, :]
+    else:
+      initial_elements = array_ops.reverse(x[..., -1, :], axis=[ndims - 2])
+      triangular_portion = x[..., :-1, :]
+    rotated_triangular_portion = array_ops.reverse(
+        array_ops.reverse(triangular_portion, axis=[ndims - 1]),
+        axis=[ndims - 2])
+    consolidated_matrix = triangular_portion + rotated_triangular_portion
+    end_sequence = array_ops.reshape(
+        consolidated_matrix,
+        array_ops.concat([array_ops.shape(x)[:-2], [n * (n - 1)]], axis=0))
+    y = array_ops.concat([initial_elements, end_sequence[..., :m - n]], axis=-1)
+    y.set_shape(static_final_shape)
+    return y
+
+
 def tridiag(below=None, diag=None, above=None, name=None):
   """Creates a matrix with values set above, below, and on the diagonal.
 
diff --git a/tensorflow/python/ops/functional_ops.py b/tensorflow/python/ops/functional_ops.py
index 394ad0b1a2284ac147a09f165fb1f50d24f4cedc..30413f289a0674db61406153cc05d12c7cc98f9b 100644
--- a/tensorflow/python/ops/functional_ops.py
+++ b/tensorflow/python/ops/functional_ops.py
@@ -455,7 +455,8 @@ def map_fn(fn, elems, dtype=None, parallel_iterations=10, back_prop=True,
         lambda i, _: i < n, compute, (i, accs_ta),
         parallel_iterations=parallel_iterations,
         back_prop=back_prop,
-        swap_memory=swap_memory)
+        swap_memory=swap_memory,
+        maximum_iterations=n)
     results_flat = [r.stack() for r in r_a]
 
     n_static = elems_flat[0].get_shape().with_rank_at_least(1)[0]
diff --git a/tensorflow/python/ops/gradients_impl.py b/tensorflow/python/ops/gradients_impl.py
index 716b54f07cd61b3062ce7f488f5dd806915ac254..7385cb758514e160efec61d731e734d1af126742 100644
--- a/tensorflow/python/ops/gradients_impl.py
+++ b/tensorflow/python/ops/gradients_impl.py
@@ -1006,21 +1006,32 @@ def _AggregatedGrads(grads,
         logging.vlog(2, "  _AggregatedGrads %d x %s using %s", len(out_grad),
                      tensor_shape, used)
       else:
-        out_grad = math_ops._as_indexed_slices_list(
-            [g for g in out_grad if g is not None])
-        out_grad = [_HandleNestedIndexedSlices(x) for x in out_grad]
-        # Form IndexedSlices out of the concatenated values and
-        # indices.
-        out_grads[i] = ops.IndexedSlices(
-            array_ops.concat([x.values for x in out_grad], 0),
-            array_ops.concat([x.indices for x in out_grad], 0),
-            out_grad[0].dense_shape)
+        out_grads[i] = _AggregateIndexedSlicesGradients(out_grad)
     else:  # not out_grad
       # out_grads[i] is [], thus its aggregation is simply None.
       out_grads[i] = None
   return out_grads
 
 
+def _AggregateIndexedSlicesGradients(grads):
+  """Aggregates gradients of type `IndexedSlices` by concatenation."""
+  if len(grads) < 1:
+    return None
+  elif len(grads) == 1:
+    return grads[0]
+  else:
+    grads = math_ops._as_indexed_slices_list(  # pylint: disable=protected-access
+        [g for g in grads if g is not None])
+    grads = [_HandleNestedIndexedSlices(x) for x in grads]  # pylint: disable=protected-access
+    # Form IndexedSlices out of the concatenated values and indices.
+    concat_grad = ops.IndexedSlices(
+        array_ops.concat([x.values for x in grads], axis=0),
+        array_ops.concat([x.indices for x in grads], axis=0),
+        grads[0].dense_shape)
+
+    return concat_grad
+
+
 # TODO(vrv): Make this available when we want to make it public.
 def _hessian_vector_product(ys, xs, v):
   """Multiply the Hessian of `ys` wrt `xs` by `v`.
diff --git a/tensorflow/python/ops/gradients_test.py b/tensorflow/python/ops/gradients_test.py
index 70d500a108f0d4633aed3602cbab6938bddcab29..d81c756f1cbc0a46d094066cda369067f7d3d1f6 100644
--- a/tensorflow/python/ops/gradients_test.py
+++ b/tensorflow/python/ops/gradients_test.py
@@ -83,7 +83,6 @@ def _OpsBetween(to_ops, from_ops):
   return between_ops
 
 
-@test_util.with_c_api
 class GradientsTest(test_util.TensorFlowTestCase):
 
   def _OpNames(self, op_list):
@@ -946,5 +945,53 @@ class CustomGradientTest(test_util.TensorFlowTestCase):
       self.assertAllEqual(g.eval(feed_dict={conditional: False}), [3.0])
 
 
+class AggregateIndexedSlicesGradientsTest(test_util.TensorFlowTestCase):
+
+  def _assert_indexed_slices_equal(self, left, right):
+    self.assertAllEqual(
+        self.evaluate(ops.convert_to_tensor(left)),
+        self.evaluate(ops.convert_to_tensor(right)))
+
+  def testNoGradients(self):
+    self.assertIsNone(gradients_impl._AggregateIndexedSlicesGradients([]))
+
+  def testOneGradient(self):
+    t = math_ops._as_indexed_slices(constant_op.constant(
+        [[1., 2.], [0, 0], [3., 4.]]))
+    result = gradients_impl._AggregateIndexedSlicesGradients([t])
+    self._assert_indexed_slices_equal(t, result)
+
+  def testMultipleGradients(self):
+    t0 = math_ops._as_indexed_slices(constant_op.constant(
+        [[1., 2.], [0, 0], [3., 4.]]))
+    t1 = math_ops._as_indexed_slices(constant_op.constant(
+        [[0., 0.], [5, 6], [7., 8.]]))
+    total = constant_op.constant(
+        [[1., 2.], [5, 6], [10., 12.]])
+    result = gradients_impl._AggregateIndexedSlicesGradients([t0, t1])
+    self._assert_indexed_slices_equal(total, result)
+
+  def testMultipleGradientsWithNones(self):
+    t0 = math_ops._as_indexed_slices(constant_op.constant(
+        [[1., 2.], [0, 0], [3., 4.]]))
+    t1 = math_ops._as_indexed_slices(constant_op.constant(
+        [[0., 0.], [5, 6], [7., 8.]]))
+    t3 = None
+    total = constant_op.constant(
+        [[1., 2.], [5, 6], [10., 12.]])
+    result = gradients_impl._AggregateIndexedSlicesGradients([t0, t1, t3])
+    self._assert_indexed_slices_equal(total, result)
+
+  def testMixedTensorAndIndexedSlices(self):
+    t0 = math_ops._as_indexed_slices(constant_op.constant(
+        [[1., 2.], [0, 0], [3., 4.]]))
+    t1 = constant_op.constant(
+        [[0., 0.], [5, 6], [7., 8.]])
+    total = constant_op.constant(
+        [[1., 2.], [5, 6], [10., 12.]])
+    result = gradients_impl._AggregateIndexedSlicesGradients([t0, t1])
+    self._assert_indexed_slices_equal(total, result)
+
+
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/python/ops/image_ops_impl.py b/tensorflow/python/ops/image_ops_impl.py
index 3f40e3ff7552ecaf9c81b69addb7f4a35e2301e4..f27d9224c15cc9c3791d1d56454415ed95d196bd 100644
--- a/tensorflow/python/ops/image_ops_impl.py
+++ b/tensorflow/python/ops/image_ops_impl.py
@@ -28,6 +28,7 @@ from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import functional_ops
 from tensorflow.python.ops import gen_image_ops
 from tensorflow.python.ops import gen_nn_ops
 from tensorflow.python.ops import math_ops
@@ -258,14 +259,14 @@ def random_flip_up_down(image, seed=None):
   dimension, which is `height`.  Otherwise output the image as-is.
 
   Args:
-    image: A 3-D tensor of shape `[height, width, channels].`
+    image: 4-D Tensor of shape `[batch, height, width, channels]` or
+           3-D Tensor of shape `[height, width, channels]`.
     seed: A Python integer. Used to create a random seed. See
       @{tf.set_random_seed}
       for behavior.
 
   Returns:
-    A 3-D tensor of the same type and shape as `image`.
-
+    A tensor of the same type and shape as `image`.
   Raises:
     ValueError: if the shape of `image` not supported.
   """
@@ -280,13 +281,14 @@ def random_flip_left_right(image, seed=None):
   second dimension, which is `width`.  Otherwise output the image as-is.
 
   Args:
-    image: A 3-D tensor of shape `[height, width, channels].`
+    image: 4-D Tensor of shape `[batch, height, width, channels]` or
+           3-D Tensor of shape `[height, width, channels]`.
     seed: A Python integer. Used to create a random seed. See
       @{tf.set_random_seed}
       for behavior.
 
   Returns:
-    A 3-D tensor of the same type and shape as `image`.
+    A tensor of the same type and shape as `image`.
 
   Raises:
     ValueError: if the shape of `image` not supported.
@@ -297,7 +299,8 @@ def random_flip_left_right(image, seed=None):
 def _random_flip(image, flip_index, seed, scope_name):
   """Randomly (50% chance) flip an image along axis `flip_index`.
     Args:
-      image: A 3-D tensor of shape `[height, width, channels].`
+      image: 4-D Tensor of shape `[batch, height, width, channels]` or
+             3-D Tensor of shape `[height, width, channels]`.
       flip_index: The dimension along which to flip the image.
                   Vertical: 0, Horizontal: 1
       seed: A Python integer. Used to create a random seed. See
@@ -306,22 +309,37 @@ def _random_flip(image, flip_index, seed, scope_name):
       scope_name: Name of the scope in which the ops are added.
 
     Returns:
-      A 3-D tensor of the same type and shape as `image`.
+      A tensor of the same type and shape as `image`.
 
     Raises:
       ValueError: if the shape of `image` not supported.
   """
   with ops.name_scope(None, scope_name, [image]) as scope:
     image = ops.convert_to_tensor(image, name='image')
-    image = _Assert3DImage(image)
-    uniform_random = random_ops.random_uniform([], 0, 1.0, seed=seed)
-    mirror_cond = math_ops.less(uniform_random, .5)
-    result = control_flow_ops.cond(
-        mirror_cond,
-        lambda: array_ops.reverse(image, [flip_index]),
-        lambda: image,
-        name=scope)
-    return fix_image_flip_shape(image, result)
+    image = _AssertAtLeast3DImage(image)
+    shape = image.get_shape()
+    if shape.ndims == 3 or shape.ndims is None:
+      uniform_random = random_ops.random_uniform([], 0, 1.0, seed=seed)
+      mirror_cond = math_ops.less(uniform_random, .5)
+      result = control_flow_ops.cond(
+          mirror_cond,
+          lambda: array_ops.reverse(image, [flip_index]),
+          lambda: image,
+          name=scope
+      )
+      return fix_image_flip_shape(image, result)
+    elif shape.ndims == 4:
+      uniform_random = random_ops.random_uniform(
+          [array_ops.shape(image)[0]], 0, 1.0, seed=seed
+      )
+      mirror_cond = math_ops.less(uniform_random, .5)
+      return array_ops.where(
+          mirror_cond,
+          image,
+          functional_ops.map_fn(lambda x: array_ops.reverse(x, [flip_index]), image, dtype=image.dtype)
+      )
+    else:
+      raise ValueError('\'image\' must have either 3 or 4 dimensions.')
 
 
 @tf_export('image.flip_left_right')
@@ -523,7 +541,7 @@ def transpose_image(image):
 
 @tf_export('image.central_crop')
 def central_crop(image, central_fraction):
-  """Crop the central region of the image.
+  """Crop the central region of the image(s).
 
   Remove the outer parts of an image but retain the central region of the image
   along each dimension. If we specify central_fraction = 0.5, this function
@@ -536,15 +554,19 @@ def central_crop(image, central_fraction):
       |        |   where "X" is the central 50% of the image.
        --------
 
+  This function works on either a single image (`image` is a 3-D Tensor), or a
+  batch of images (`image` is a 4-D Tensor).
+
   Args:
-    image: 3-D float Tensor of shape [height, width, depth]
+    image: Either a 3-D float Tensor of shape [height, width, depth], or a 4-D
+      Tensor of shape [batch_size, height, width, depth].
     central_fraction: float (0, 1], fraction of size to crop
 
   Raises:
     ValueError: if central_crop_fraction is not within (0, 1].
 
   Returns:
-    3-D float Tensor
+    3-D / 4-D float Tensor, as per the input.
   """
   with ops.name_scope(None, 'central_crop', [image]):
     image = ops.convert_to_tensor(image, name='image')
@@ -553,24 +575,75 @@ def central_crop(image, central_fraction):
     if central_fraction == 1.0:
       return image
 
-    image = _Assert3DImage(image)
+    _AssertAtLeast3DImage(image)
+    rank = image.get_shape().ndims
+    if rank != 3 and rank != 4:
+      raise ValueError('`image` should either be a Tensor with rank = 3 or '
+                       'rank = 4. Had rank = {}.'.format(rank))
+
+    # Helper method to return the `idx`-th dimension of `tensor`, along with
+    # a boolean signifying if the dimension is dynamic.
+    def _get_dim(tensor, idx):
+      static_shape = tensor.get_shape()[idx].value
+      if static_shape is not None:
+        return static_shape, False
+      return array_ops.shape(tensor)[idx], True
+
+    # Get the height, width, depth (and batch size, if the image is a 4-D
+    # tensor).
+    if rank == 3:
+      img_h, dynamic_h = _get_dim(image, 0)
+      img_w, dynamic_w = _get_dim(image, 1)
+      img_d = image.get_shape()[2]
+    else:
+      img_bs = image.get_shape()[0]
+      img_h, dynamic_h = _get_dim(image, 1)
+      img_w, dynamic_w = _get_dim(image, 2)
+      img_d = image.get_shape()[3]
+
+    # Compute the bounding boxes for the crop. The type and value of the
+    # bounding boxes depend on the `image` tensor's rank and whether / not the
+    # dimensions are statically defined.
+    if dynamic_h:
+      img_hd = math_ops.to_double(img_h)
+      bbox_h_start = math_ops.to_int32((img_hd - img_hd * central_fraction) / 2)
+    else:
+      img_hd = float(img_h)
+      bbox_h_start = int((img_hd - img_hd * central_fraction) / 2)
+
+    if dynamic_w:
+      img_wd = math_ops.to_double(img_w)
+      bbox_w_start = math_ops.to_int32((img_wd - img_wd * central_fraction) / 2)
+    else:
+      img_wd = float(img_w)
+      bbox_w_start = int((img_wd - img_wd * central_fraction) / 2)
 
-    img_shape = array_ops.shape(image)
-    depth = image.get_shape()[2]
-    img_h = math_ops.to_double(img_shape[0])
-    img_w = math_ops.to_double(img_shape[1])
-    bbox_h_start = math_ops.to_int32((img_h - img_h * central_fraction) / 2)
-    bbox_w_start = math_ops.to_int32((img_w - img_w * central_fraction) / 2)
+    bbox_h_size = img_h - bbox_h_start * 2
+    bbox_w_size = img_w - bbox_w_start * 2
 
-    bbox_h_size = img_shape[0] - bbox_h_start * 2
-    bbox_w_size = img_shape[1] - bbox_w_start * 2
+    if rank == 3:
+      bbox_begin = array_ops.stack([bbox_h_start, bbox_w_start, 0])
+      bbox_size = array_ops.stack([bbox_h_size, bbox_w_size, -1])
+    else:
+      bbox_begin = array_ops.stack([0, bbox_h_start, bbox_w_start, 0])
+      bbox_size = array_ops.stack([-1, bbox_h_size, bbox_w_size, -1])
 
-    bbox_begin = array_ops.stack([bbox_h_start, bbox_w_start, 0])
-    bbox_size = array_ops.stack([bbox_h_size, bbox_w_size, -1])
     image = array_ops.slice(image, bbox_begin, bbox_size)
 
-    # The first two dimensions are dynamic and unknown.
-    image.set_shape([None, None, depth])
+    # Reshape the `image` tensor to the desired size.
+    if rank == 3:
+      image.set_shape([
+          None if dynamic_h else bbox_h_size,
+          None if dynamic_w else bbox_w_size,
+          img_d
+      ])
+    else:
+      image.set_shape([
+          img_bs,
+          None if dynamic_h else bbox_h_size,
+          None if dynamic_w else bbox_w_size,
+          img_d
+      ])
     return image
 
 
@@ -866,7 +939,8 @@ class ResizeMethod(object):
 def resize_images(images,
                   size,
                   method=ResizeMethod.BILINEAR,
-                  align_corners=False):
+                  align_corners=False,
+                  preserve_aspect_ratio=False):
   """Resize `images` to `size` using the specified `method`.
 
   Resized images will be distorted if their original aspect ratio is not
@@ -898,6 +972,10 @@ def resize_images(images,
     align_corners: bool.  If True, the centers of the 4 corner pixels of the
         input and output tensors are aligned, preserving the values at the
         corner pixels. Defaults to `False`.
+    preserve_aspect_ratio: Whether to preserve the aspect ratio. If this is set,
+      then `images` will be resized to a size that fits in `size` while
+      preserving the aspect ratio of the original image. Scales up the image if
+      `size` is bigger than the current size of the `image`. Defaults to False.
 
   Raises:
     ValueError: if the shape of `images` is incompatible with the
@@ -936,6 +1014,28 @@ def resize_images(images,
     new_height_const = size_const_as_shape[0].value
     new_width_const = size_const_as_shape[1].value
 
+    if preserve_aspect_ratio:
+      # Get the current shapes of the image, even if dynamic.
+      _, current_height, current_width, _ = _ImageDimensions(images, rank=4)
+
+      # do the computation to find the right scale and height/width.
+      scale_factor_height = (math_ops.to_float(new_height_const) /
+                             math_ops.to_float(current_height))
+      scale_factor_width = (math_ops.to_float(new_width_const) /
+                            math_ops.to_float(current_width))
+      scale_factor = math_ops.minimum(scale_factor_height, scale_factor_width)
+      scaled_height_const = math_ops.to_int32(scale_factor *
+                                              math_ops.to_float(current_height))
+      scaled_width_const = math_ops.to_int32(scale_factor *
+                                             math_ops.to_float(current_width))
+
+      # NOTE: Reset the size and other constants used later.
+      size = ops.convert_to_tensor([scaled_height_const, scaled_width_const],
+                                   dtypes.int32, name='size')
+      size_const_as_shape = tensor_util.constant_value_as_shape(size)
+      new_height_const = size_const_as_shape[0].value
+      new_width_const = size_const_as_shape[1].value
+
     # If we can determine that the height and width will be unmodified by this
     # transformation, we avoid performing the resize.
     if all(x is not None
@@ -1396,6 +1496,75 @@ def adjust_hue(image, delta, name=None):
     return convert_image_dtype(rgb_altered, orig_dtype)
 
 
+# pylint: disable=invalid-name
+@tf_export('image.random_jpeg_quality')
+def random_jpeg_quality(image, min_jpeg_quality, max_jpeg_quality, seed=None):
+  """Randomly changes jpeg encoding quality for inducing jpeg noise.
+
+  `min_jpeg_quality` must be in the interval `[0, 100]` and less than
+  `max_jpeg_quality`.
+  `max_jpeg_quality` must be in the interval `[0, 100]`.
+
+  Args:
+    image: RGB image or images. Size of the last dimension must be 3.
+    min_jpeg_quality: Minimum jpeg encoding quality to use.
+    max_jpeg_quality: Maximum jpeg encoding quality to use.
+    seed: An operation-specific seed. It will be used in conjunction
+      with the graph-level seed to determine the real seeds that will be
+      used in this operation. Please see the documentation of
+      set_random_seed for its interaction with the graph-level random seed.
+
+  Returns:
+    Adjusted image(s), same shape and DType as `image`.
+
+  Raises:
+    ValueError: if `min_jpeg_quality` or `max_jpeg_quality` is invalid.
+  """
+  if (min_jpeg_quality < 0 or max_jpeg_quality < 0 or
+      min_jpeg_quality > 100 or max_jpeg_quality > 100):
+    raise ValueError('jpeg encoding range must be between 0 and 100.')
+
+  if min_jpeg_quality >= max_jpeg_quality:
+    raise ValueError('`min_jpeg_quality` must be less than `max_jpeg_quality`.')
+
+  np.random.seed(seed)
+  jpeg_quality = np.random.randint(min_jpeg_quality, max_jpeg_quality)
+  return adjust_jpeg_quality(image, jpeg_quality)
+
+
+@tf_export('image.adjust_jpeg_quality')
+def adjust_jpeg_quality(image, jpeg_quality, name=None):
+  """Adjust jpeg encoding quality of an RGB image.
+
+  This is a convenience method that adjusts jpeg encoding quality of an
+  RGB image.
+
+  `image` is an RGB image.  The image's encoding quality is adjusted
+  to `jpeg_quality`.
+  `jpeg_quality` must be in the interval `[0, 100]`.
+
+  Args:
+    image: RGB image or images. Size of the last dimension must be 3.
+    jpeg_quality: int.  jpeg encoding quality.
+    name: A name for this operation (optional).
+
+  Returns:
+    Adjusted image(s), same shape and DType as `image`.
+  """
+  with ops.name_scope(name, 'adjust_jpeg_quality', [image]) as name:
+    image = ops.convert_to_tensor(image, name='image')
+    # Remember original dtype to so we can convert back if needed
+    orig_dtype = image.dtype
+    # Convert to uint8
+    image = convert_image_dtype(image, dtypes.uint8)
+    # Encode image to jpeg with given jpeg quality
+    image = gen_image_ops.encode_jpeg(image, quality=jpeg_quality)
+    # Decode jpeg image
+    image = gen_image_ops.decode_jpeg(image)
+    # Convert back to original dtype and return
+    return convert_image_dtype(image, orig_dtype)
+
+
 @tf_export('image.random_saturation')
 def random_saturation(image, lower, upper, seed=None):
   """Adjust the saturation of an RGB image by a random factor.
@@ -1483,13 +1652,13 @@ def is_jpeg(contents, name=None):
 
 
 @tf_export('image.decode_image')
-def decode_image(contents, channels=None, name=None):
+def decode_image(contents, channels=None, dtype=dtypes.uint8, name=None):
   """Convenience function for `decode_bmp`, `decode_gif`, `decode_jpeg`,
   and `decode_png`.
 
   Detects whether an image is a BMP, GIF, JPEG, or PNG, and performs the
-  appropriate operation to convert the input bytes `string` into a `Tensor` of
-  type `uint8`.
+  appropriate operation to convert the input bytes `string` into a `Tensor`
+  of type `dtype`.
 
   Note: `decode_gif` returns a 4-D array `[num_frames, height, width, 3]`, as
   opposed to `decode_bmp`, `decode_jpeg` and `decode_png`, which return 3-D
@@ -1501,10 +1670,11 @@ def decode_image(contents, channels=None, name=None):
     contents: 0-D `string`. The encoded image bytes.
     channels: An optional `int`. Defaults to `0`. Number of color channels for
       the decoded image.
+    dtype: The desired DType of the returned `Tensor`.
     name: A name for the operation (optional)
 
   Returns:
-    `Tensor` with type `uint8` with shape `[height, width, num_channels]` for
+    `Tensor` with type `dtype` and shape `[height, width, num_channels]` for
       BMP, JPEG, and PNG images and shape `[num_frames, height, width, 3]` for
       GIF images.
 
@@ -1528,7 +1698,7 @@ def decode_image(contents, channels=None, name=None):
       channels_msg = 'Channels must be in (None, 0, 3) when decoding BMP images'
       assert_channels = control_flow_ops.Assert(good_channels, [channels_msg])
       with ops.control_dependencies([assert_decode, assert_channels]):
-        return gen_image_ops.decode_bmp(contents)
+        return convert_image_dtype(gen_image_ops.decode_bmp(contents), dtype)
 
     def _gif():
       # Create assert to make sure that channels is not set to 1
@@ -1541,7 +1711,7 @@ def decode_image(contents, channels=None, name=None):
       channels_msg = 'Channels must be in (None, 0, 3) when decoding GIF images'
       assert_channels = control_flow_ops.Assert(good_channels, [channels_msg])
       with ops.control_dependencies([assert_channels]):
-        return gen_image_ops.decode_gif(contents)
+        return convert_image_dtype(gen_image_ops.decode_gif(contents), dtype)
 
     def check_gif():
       # Create assert op to check that bytes are GIF decodable
@@ -1550,7 +1720,11 @@ def decode_image(contents, channels=None, name=None):
 
     def _png():
       """Decodes a PNG image."""
-      return gen_image_ops.decode_png(contents, channels)
+      return convert_image_dtype(
+          gen_image_ops.decode_png(contents, channels,
+                                   dtype=dtypes.uint8
+                                   if dtype == dtypes.uint8
+                                   else dtypes.uint16), dtype)
 
     def check_png():
       """Checks if an image is PNG."""
@@ -1566,7 +1740,8 @@ def decode_image(contents, channels=None, name=None):
                       'images')
       assert_channels = control_flow_ops.Assert(good_channels, [channels_msg])
       with ops.control_dependencies([assert_channels]):
-        return gen_image_ops.decode_jpeg(contents, channels)
+        return convert_image_dtype(
+            gen_image_ops.decode_jpeg(contents, channels), dtype)
 
     # Decode normal JPEG images (start with \xff\xd8\xff\xe0)
     # as well as JPEG images with EXIF data (start with \xff\xd8\xff\xe1).
@@ -1772,7 +1947,7 @@ def non_max_suppression(boxes,
                         scores,
                         max_output_size,
                         iou_threshold=0.5,
-                        score_threshold=0.0,
+                        score_threshold=float('-inf'),
                         name=None):
   """Greedily selects a subset of bounding boxes in descending order of score.
 
diff --git a/tensorflow/python/ops/image_ops_test.py b/tensorflow/python/ops/image_ops_test.py
index c437c12c2744792eaee197bf7d2a5f2b75d280bf..2a6ab26e96b2339f005cbe8712f7b8f15304b1cc 100644
--- a/tensorflow/python/ops/image_ops_test.py
+++ b/tensorflow/python/ops/image_ops_test.py
@@ -533,6 +533,37 @@ class FlipImageBenchmark(test.Benchmark):
         iters=benchmark_rounds,
         wall_time=step_time)
 
+  def _benchmarkBatchedRandomFlipLeftRight(self, device, cpu_count):
+    image_shape = [16, 299, 299, 3]
+    warmup_rounds = 100
+    benchmark_rounds = 1000
+    config = config_pb2.ConfigProto()
+    if cpu_count is not None:
+      config.inter_op_parallelism_threads = 1
+      config.intra_op_parallelism_threads = cpu_count
+    with session.Session("", graph=ops.Graph(), config=config) as sess:
+      with ops.device(device):
+        inputs = variables.Variable(
+            random_ops.random_uniform(image_shape, dtype=dtypes.float32) * 255,
+            trainable=False,
+            dtype=dtypes.float32)
+        run_op = image_ops.random_flip_left_right(inputs)
+        sess.run(variables.global_variables_initializer())
+        for i in xrange(warmup_rounds + benchmark_rounds):
+          if i == warmup_rounds:
+            start = time.time()
+          sess.run(run_op)
+    end = time.time()
+    step_time = (end - start) / benchmark_rounds
+    tag = device + "_%s" % (cpu_count if cpu_count is not None else "_all")
+    print("benchmarkBatchedRandomFlipLeftRight_16_299_299_3_%s step_time: "
+          "%.2f us" %
+          (tag, step_time * 1e6))
+    self.report_benchmark(
+        name="benchmarkBatchedRandomFlipLeftRight_16_299_299_3_%s" % (tag),
+        iters=benchmark_rounds,
+        wall_time=step_time)
+
   def benchmarkFlipLeftRightCpu1(self):
     self._benchmarkFlipLeftRight("/cpu:0", 1)
 
@@ -551,6 +582,15 @@ class FlipImageBenchmark(test.Benchmark):
   def benchmarkRandomFlipLeftRightGpu(self):
     self._benchmarkRandomFlipLeftRight(test.gpu_device_name(), None)
 
+  def benchmarkBatchedRandomFlipLeftRightCpu1(self):
+    self._benchmarkBatchedRandomFlipLeftRight("/cpu:0", 1)
+
+  def benchmarkBatchedRandomFlipLeftRightCpuAll(self):
+    self._benchmarkBatchedRandomFlipLeftRight("/cpu:0", None)
+
+  def benchmarkBatchedRandomFlipLeftRightGpu(self):
+    self._benchmarkBatchedRandomFlipLeftRight(test.gpu_device_name(), None)
+
 
 class AdjustHueBenchmark(test.Benchmark):
 
@@ -987,7 +1027,7 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
 
     with self.test_session(use_gpu=True):
       x_tf = constant_op.constant(x_np, shape=x_np.shape)
-      y = image_ops.random_flip_left_right(x_tf)
+      y = image_ops.random_flip_left_right(x_tf, seed=seed)
       self.assertTrue(y.op.name.startswith("random_flip_left_right"))
 
       count_flipped = 0
@@ -1008,6 +1048,50 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
       self.assertGreaterEqual(count_flipped, 20)
       self.assertGreaterEqual(count_unflipped, 20)
 
+  def testRandomFlipLeftRightWithBatch(self):
+    batch_size = 16
+    seed = 42
+
+    # create single item of test data
+    x_np_raw = np.array(
+        [[1, 2, 3], [1, 2, 3]], dtype=np.uint8
+    ).reshape([1, 2, 3, 1])
+    y_np_raw = np.array(
+        [[3, 2, 1], [3, 2, 1]], dtype=np.uint8
+    ).reshape([1, 2, 3, 1])
+
+    # create batched test data
+    x_np = np.vstack([x_np_raw for _ in range(batch_size)])
+    y_np = np.vstack([y_np_raw for _ in range(batch_size)])
+
+    with self.test_session(use_gpu=True):
+      x_tf = constant_op.constant(x_np, shape=x_np.shape)
+      y = image_ops.random_flip_left_right(x_tf, seed=seed)
+      self.assertTrue(y.op.name.startswith("random_flip_left_right"))
+
+      count_flipped = 0
+      count_unflipped = 0
+      for _ in range(100):
+        y_tf = y.eval()
+
+        # check every element of the batch
+        for i in range(batch_size):
+          if y_tf[i][0][0] == 1:
+            self.assertAllEqual(y_tf[i], x_np[i])
+            count_unflipped += 1
+          else:
+            self.assertAllEqual(y_tf[i], y_np[i])
+            count_flipped += 1
+
+      # 100 trials, each containing batch_size elements
+      # Mean: 50 * batch_size
+      # Std Dev: ~5 * sqrt(batch_size)
+      # Six Sigma: 50 * batch_size - (5 * 6 * sqrt(batch_size))
+      #          = 50 * batch_size - 30 * sqrt(batch_size) = 800 - 30 * 4 = 680
+      six_sigma = 50 * batch_size - 30 * np.sqrt(batch_size)
+      self.assertGreaterEqual(count_flipped, six_sigma)
+      self.assertGreaterEqual(count_unflipped, six_sigma)
+
   def testInvolutionUpDown(self):
     x_np = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.uint8).reshape([2, 3, 1])
 
@@ -1057,9 +1141,11 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
     x_np = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.uint8).reshape([2, 3, 1])
     y_np = np.array([[4, 5, 6], [1, 2, 3]], dtype=np.uint8).reshape([2, 3, 1])
 
+    seed = 42
+
     with self.test_session(use_gpu=True):
       x_tf = constant_op.constant(x_np, shape=x_np.shape)
-      y = image_ops.random_flip_up_down(x_tf, seed=42)
+      y = image_ops.random_flip_up_down(x_tf, seed=seed)
       self.assertTrue(y.op.name.startswith("random_flip_up_down"))
       count_flipped = 0
       count_unflipped = 0
@@ -1079,6 +1165,50 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
       self.assertGreaterEqual(count_flipped, 20)
       self.assertGreaterEqual(count_unflipped, 20)
 
+  def testRandomFlipUpDownWithBatch(self):
+    batch_size = 16
+    seed = 42
+
+    # create single item of test data
+    x_np_raw = np.array(
+        [[1, 2, 3], [4, 5, 6]], dtype=np.uint8
+    ).reshape([1, 2, 3, 1])
+    y_np_raw = np.array(
+        [[4, 5, 6], [1, 2, 3]], dtype=np.uint8
+    ).reshape([1, 2, 3, 1])
+
+    # create batched test data
+    x_np = np.vstack([x_np_raw for _ in range(batch_size)])
+    y_np = np.vstack([y_np_raw for _ in range(batch_size)])
+
+    with self.test_session(use_gpu=True):
+      x_tf = constant_op.constant(x_np, shape=x_np.shape)
+      y = image_ops.random_flip_up_down(x_tf, seed=seed)
+      self.assertTrue(y.op.name.startswith("random_flip_up_down"))
+
+      count_flipped = 0
+      count_unflipped = 0
+      for _ in range(100):
+        y_tf = y.eval()
+
+        # check every element of the batch
+        for i in range(batch_size):
+          if y_tf[i][0][0] == 1:
+            self.assertAllEqual(y_tf[i], x_np[i])
+            count_unflipped += 1
+          else:
+            self.assertAllEqual(y_tf[i], y_np[i])
+            count_flipped += 1
+
+      # 100 trials, each containing batch_size elements
+      # Mean: 50 * batch_size
+      # Std Dev: ~5 * sqrt(batch_size)
+      # Six Sigma: 50 * batch_size - (5 * 6 * sqrt(batch_size))
+      #          = 50 * batch_size - 30 * sqrt(batch_size) = 800 - 30 * 4 = 680
+      six_sigma = 50 * batch_size - 30 * np.sqrt(batch_size)
+      self.assertGreaterEqual(count_flipped, six_sigma)
+      self.assertGreaterEqual(count_unflipped, six_sigma)
+
   def testInvolutionTranspose(self):
     x_np = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.uint8).reshape([2, 3, 1])
 
@@ -1156,6 +1286,7 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
     #Ops that support 4D input
     for op in [
         image_ops.flip_left_right, image_ops.flip_up_down,
+        image_ops.random_flip_left_right, image_ops.random_flip_up_down,
         image_ops.transpose_image, image_ops.rot90
     ]:
       transformed_unknown_dims_4 = op(p_unknown_dims_4)
@@ -1166,14 +1297,6 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
                                    "must be at least three-dimensional"):
         op(p_wrong_rank)
 
-    for op in [
-        image_ops.random_flip_left_right,
-        image_ops.random_flip_up_down,
-    ]:
-      with self.assertRaisesRegexp(ValueError, "must be three-dimensional"):
-        op(p_wrong_rank)
-
-
   def testRot90GroupOrder(self):
     image = np.arange(24, dtype=np.uint8).reshape([2, 4, 3])
     with self.test_session(use_gpu=True):
@@ -1208,41 +1331,6 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
         y_np = np.rot90(image, k=k, axes=(1, 2))
         self.assertAllEqual(y_np, y_tf.eval({k_placeholder: k}))
 
-class RandomFlipTest(test_util.TensorFlowTestCase):
-
-  def testRandomLeftRight(self):
-    x_np = np.array([0, 1], dtype=np.uint8).reshape([1, 2, 1])
-    num_iterations = 500
-
-    hist = [0, 0]
-    with self.test_session(use_gpu=True):
-      x_tf = constant_op.constant(x_np, shape=x_np.shape)
-      y = image_ops.random_flip_left_right(x_tf)
-      for _ in xrange(num_iterations):
-        y_np = y.eval().flatten()[0]
-        hist[y_np] += 1
-
-    # Ensure that each entry is observed within 4 standard deviations.
-    four_stddev = 4.0 * np.sqrt(num_iterations / 2.0)
-    self.assertAllClose(hist, [num_iterations / 2.0] * 2, atol=four_stddev)
-
-  def testRandomUpDown(self):
-    x_np = np.array([0, 1], dtype=np.uint8).reshape([2, 1, 1])
-    num_iterations = 500
-
-    hist = [0, 0]
-    with self.test_session(use_gpu=True):
-      x_tf = constant_op.constant(x_np, shape=x_np.shape)
-      y = image_ops.random_flip_up_down(x_tf)
-      for _ in xrange(num_iterations):
-        y_np = y.eval().flatten()[0]
-        hist[y_np] += 1
-
-    # Ensure that each entry is observed within 4 standard deviations.
-    four_stddev = 4.0 * np.sqrt(num_iterations / 2.0)
-    self.assertAllClose(hist, [num_iterations / 2.0] * 2, atol=four_stddev)
-
-
 class AdjustContrastTest(test_util.TensorFlowTestCase):
 
   def _testContrast(self, x_np, y_np, contrast_factor):
@@ -1585,14 +1673,16 @@ class CentralCropTest(test_util.TensorFlowTestCase):
       self.assertEqual(y.get_shape().as_list(), post_shape)
 
   def testNoOp(self):
-    x_shape = [13, 9, 3]
-    x_np = np.ones(x_shape, dtype=np.float32)
-    with self.test_session(use_gpu=True):
-      x = constant_op.constant(x_np, shape=x_shape)
-      y = image_ops.central_crop(x, 1.0)
-      y_tf = y.eval()
-      self.assertAllEqual(y_tf, x_np)
-      self.assertEqual(y.op.name, x.op.name)
+    x_shapes = [[13, 9, 3], [5, 13, 9, 3]]
+    for x_shape in x_shapes:
+      x_np = np.ones(x_shape, dtype=np.float32)
+      for use_gpu in [True, False]:
+        with self.test_session(use_gpu=use_gpu):
+          x = constant_op.constant(x_np, shape=x_shape)
+          y = image_ops.central_crop(x, 1.0)
+          y_tf = y.eval()
+          self.assertAllEqual(y_tf, x_np)
+          self.assertEqual(y.op.name, x.op.name)
 
   def testCropping(self):
     x_shape = [4, 8, 1]
@@ -1601,6 +1691,23 @@ class CentralCropTest(test_util.TensorFlowTestCase):
          [1, 2, 3, 4, 5, 6, 7, 8], [1, 2, 3, 4, 5, 6, 7, 8]],
         dtype=np.int32).reshape(x_shape)
     y_np = np.array([[3, 4, 5, 6], [3, 4, 5, 6]]).reshape([2, 4, 1])
+    for use_gpu in [True, False]:
+      with self.test_session(use_gpu=use_gpu):
+        x = constant_op.constant(x_np, shape=x_shape)
+        y = image_ops.central_crop(x, 0.5)
+        y_tf = y.eval()
+        self.assertAllEqual(y_tf, y_np)
+        self.assertAllEqual(y_tf.shape, y_np.shape)
+
+    x_shape = [2, 4, 8, 1]
+    x_np = np.array(
+        [[1, 2, 3, 4, 5, 6, 7, 8], [1, 2, 3, 4, 5, 6, 7, 8],
+         [1, 2, 3, 4, 5, 6, 7, 8], [1, 2, 3, 4, 5, 6, 7, 8],
+         [8, 7, 6, 5, 4, 3, 2, 1], [8, 7, 6, 5, 4, 3, 2, 1],
+         [8, 7, 6, 5, 4, 3, 2, 1], [8, 7, 6, 5, 4, 3, 2, 1]],
+        dtype=np.int32).reshape(x_shape)
+    y_np = np.array([[[3, 4, 5, 6], [3, 4, 5, 6]],
+                     [[6, 5, 4, 3], [6, 5, 4, 3]]]).reshape([2, 2, 4, 1])
     with self.test_session(use_gpu=True):
       x = constant_op.constant(x_np, shape=x_shape)
       y = image_ops.central_crop(x, 0.5)
@@ -1610,52 +1717,87 @@ class CentralCropTest(test_util.TensorFlowTestCase):
 
   def testCropping2(self):
     # Test case for 10315
-    x_shape = [240, 320, 3]
-    x_np = np.zeros(x_shape, dtype=np.int32)
-    y_np = np.zeros([80, 106, 3], dtype=np.int32)
-    with self.test_session(use_gpu=True):
-      x = array_ops.placeholder(shape=x_shape, dtype=dtypes.int32)
-      y = image_ops.central_crop(x, 0.33)
-      y_tf = y.eval(feed_dict={x: x_np})
-      self.assertAllEqual(y_tf, y_np)
-      self.assertAllEqual(y_tf.shape, y_np.shape)
+    x_shapes = [[240, 320, 3], [5, 240, 320, 3]]
+    expected_y_shapes = [[80, 106, 3], [5, 80, 106, 3]]
+
+    for x_shape, y_shape in zip(x_shapes, expected_y_shapes):
+      x_np = np.zeros(x_shape, dtype=np.int32)
+      y_np = np.zeros(y_shape, dtype=np.int32)
+      for use_gpu in [True, False]:
+        with self.test_session(use_gpu=use_gpu):
+          x = array_ops.placeholder(shape=x_shape, dtype=dtypes.int32)
+          y = image_ops.central_crop(x, 0.33)
+          y_tf = y.eval(feed_dict={x: x_np})
+          self.assertAllEqual(y_tf, y_np)
+          self.assertAllEqual(y_tf.shape, y_np.shape)
 
   def testShapeInference(self):
-    # Test no-op fraction=1.0
+    # Test no-op fraction=1.0, with 3-D tensors.
     self._assertShapeInference([50, 60, 3], 1.0, [50, 60, 3])
     self._assertShapeInference([None, 60, 3], 1.0, [None, 60, 3])
     self._assertShapeInference([50, None, 3], 1.0, [50, None, 3])
     self._assertShapeInference([None, None, 3], 1.0, [None, None, 3])
     self._assertShapeInference([50, 60, None], 1.0, [50, 60, None])
     self._assertShapeInference([None, None, None], 1.0, [None, None, None])
-    self._assertShapeInference(None, 1.0, None)
-    # TODO(toddw): Currently central_crop() doesn't infer the result shape even
-    # when it's possible.  If we change it to do so, we can test as follows:
-    #
-    # self._assertShapeInference([50, 60, 3], 0.5, [25, 30, 3])
-    # self._assertShapeInference([None, 60, 3], 0.5, [None, 30, 3])
-    # self._assertShapeInference([50, None, 3], 0.5, [25, None, 3])
-    # self._assertShapeInference([None, None, 3], 0.5, [None, None, 3])
-    # self._assertShapeInference([50, 60, None], 0.5, [25, 30, None])
-    # self._assertShapeInference([None, None, None], 0.5, [None, None, None])
-    # self._assertShapeInference(None, 0.5, None)
 
-  def testError(self):
+    # Test no-op fraction=0.5, with 3-D tensors.
+    self._assertShapeInference([50, 60, 3], 0.5, [26, 30, 3])
+    self._assertShapeInference([None, 60, 3], 0.5, [None, 30, 3])
+    self._assertShapeInference([50, None, 3], 0.5, [26, None, 3])
+    self._assertShapeInference([None, None, 3], 0.5, [None, None, 3])
+    self._assertShapeInference([50, 60, None], 0.5, [26, 30, None])
+    self._assertShapeInference([None, None, None], 0.5, [None, None, None])
+
+    # Test no-op fraction=1.0, with 4-D tensors.
+    self._assertShapeInference([5, 50, 60, 3], 1.0, [5, 50, 60, 3])
+    self._assertShapeInference([5, None, 60, 3], 1.0, [5, None, 60, 3])
+    self._assertShapeInference([5, 50, None, 3], 1.0, [5, 50, None, 3])
+    self._assertShapeInference([5, None, None, 3], 1.0, [5, None, None, 3])
+    self._assertShapeInference([5, 50, 60, None], 1.0, [5, 50, 60, None])
+    self._assertShapeInference([5, None, None, None], 1.0,
+                               [5, None, None, None])
+    self._assertShapeInference([None, None, None, None], 1.0,
+                               [None, None, None, None])
+
+    # Test no-op fraction=0.5, with 4-D tensors.
+    self._assertShapeInference([5, 50, 60, 3], 0.5, [5, 26, 30, 3])
+    self._assertShapeInference([5, None, 60, 3], 0.5, [5, None, 30, 3])
+    self._assertShapeInference([5, 50, None, 3], 0.5, [5, 26, None, 3])
+    self._assertShapeInference([5, None, None, 3], 0.5, [5, None, None, 3])
+    self._assertShapeInference([5, 50, 60, None], 0.5, [5, 26, 30, None])
+    self._assertShapeInference([5, None, None, None], 0.5,
+                               [5, None, None, None])
+    self._assertShapeInference([None, None, None, None], 0.5,
+                               [None, None, None, None])
+
+  def testErrorOnInvalidCentralCropFractionValues(self):
     x_shape = [13, 9, 3]
     x_np = np.ones(x_shape, dtype=np.float32)
-    with self.test_session(use_gpu=True):
-      x = constant_op.constant(x_np, shape=x_shape)
-      with self.assertRaises(ValueError):
-        _ = image_ops.central_crop(x, 0.0)
-      with self.assertRaises(ValueError):
-        _ = image_ops.central_crop(x, 1.01)
+    for use_gpu in [True, False]:
+      with self.test_session(use_gpu=use_gpu):
+        x = constant_op.constant(x_np, shape=x_shape)
+        with self.assertRaises(ValueError):
+          _ = image_ops.central_crop(x, 0.0)
+        with self.assertRaises(ValueError):
+          _ = image_ops.central_crop(x, 1.01)
+
+  def testErrorOnInvalidShapes(self):
+    x_shapes = [None, [], [3], [3, 9], [3, 9, 3, 9, 3]]
+    for x_shape in x_shapes:
+      x_np = np.ones(x_shape, dtype=np.float32)
+      for use_gpu in [True, False]:
+        with self.test_session(use_gpu=use_gpu):
+          x = constant_op.constant(x_np, shape=x_shape)
+          with self.assertRaises(ValueError):
+            _ = image_ops.central_crop(x, 0.5)
 
   def testNameScope(self):
     x_shape = [13, 9, 3]
     x_np = np.ones(x_shape, dtype=np.float32)
-    with self.test_session(use_gpu=True):
-      y = image_ops.central_crop(x_np, 1.0)
-      self.assertTrue(y.op.name.startswith("central_crop"))
+    for use_gpu in [True, False]:
+      with self.test_session(use_gpu=use_gpu):
+        y = image_ops.central_crop(x_np, 1.0)
+        self.assertTrue(y.op.name.startswith("central_crop"))
 
 
 class PadToBoundingBoxTest(test_util.TensorFlowTestCase):
@@ -2457,6 +2599,86 @@ class ResizeImagesTest(test_util.TensorFlowTestCase):
       y = image_ops.resize_images(single_image, [55, 66])
       self.assertTrue(y.op.name.startswith("resize_images"))
 
+  def _ResizeImageCall(self, x, max_h, max_w, preserve_aspect_ratio,
+                       use_tensor_inputs):
+    if use_tensor_inputs:
+      target_max = ops.convert_to_tensor([max_h, max_w])
+      x_tensor = array_ops.placeholder(x.dtype, shape=[None] * x.ndim)
+      feed_dict = {x_tensor: x}
+    else:
+      target_max = [max_h, max_w]
+      x_tensor = x
+      feed_dict = {}
+
+    y = image_ops.resize_images(x_tensor, target_max,
+                                preserve_aspect_ratio=preserve_aspect_ratio)
+
+    with self.test_session(use_gpu=True):
+      return y.eval(feed_dict=feed_dict)
+
+  def _assertResizeEqual(self, x, x_shape, y, y_shape,
+                         preserve_aspect_ratio=True,
+                         use_tensor_inputs_options=None):
+    use_tensor_inputs_options = use_tensor_inputs_options or [False, True]
+    target_height, target_width, _ = y_shape
+    x = np.array(x).reshape(x_shape)
+    y = np.array(y).reshape(y_shape)
+
+    for use_tensor_inputs in use_tensor_inputs_options:
+      y_tf = self._ResizeImageCall(x, target_height, target_width,
+                                   preserve_aspect_ratio, use_tensor_inputs)
+      self.assertAllClose(y, y_tf)
+
+  def _assertResizeCheckShape(self, x, x_shape, target_shape,
+                              y_shape, preserve_aspect_ratio=True,
+                              use_tensor_inputs_options=None):
+    use_tensor_inputs_options = use_tensor_inputs_options or [False, True]
+    target_height, target_width = target_shape
+    x = np.array(x).reshape(x_shape)
+    y = np.zeros(y_shape)
+
+    for use_tensor_inputs in use_tensor_inputs_options:
+      y_tf = self._ResizeImageCall(x, target_height, target_width,
+                                   preserve_aspect_ratio, use_tensor_inputs)
+      self.assertShapeEqual(y, ops.convert_to_tensor(y_tf))
+
+  def testPreserveAspectRatioMultipleImages(self):
+    x_shape = [10, 100, 100, 10]
+    x = np.random.uniform(size=x_shape)
+
+    self._assertResizeCheckShape(x, x_shape, [250, 250], [10, 250, 250, 10],
+                                 preserve_aspect_ratio=False)
+
+  def testPreserveAspectRatioNoOp(self):
+    x_shape = [10, 10, 10]
+    x = np.random.uniform(size=x_shape)
+
+    self._assertResizeEqual(x, x_shape, x, x_shape)
+
+  def testPreserveAspectRatioSmaller(self):
+    x_shape = [100, 100, 10]
+    x = np.random.uniform(size=x_shape)
+
+    self._assertResizeCheckShape(x, x_shape, [75, 50], [50, 50, 10])
+
+  def testPreserveAspectRatioSmallerMultipleImages(self):
+    x_shape = [10, 100, 100, 10]
+    x = np.random.uniform(size=x_shape)
+
+    self._assertResizeCheckShape(x, x_shape, [75, 50], [10, 50, 50, 10])
+
+  def testPreserveAspectRatioLarger(self):
+    x_shape = [100, 100, 10]
+    x = np.random.uniform(size=x_shape)
+
+    self._assertResizeCheckShape(x, x_shape, [150, 200], [150, 150, 10])
+
+  def testPreserveAspectRatioSameRatio(self):
+    x_shape = [1920, 1080, 3]
+    x = np.random.uniform(size=x_shape)
+
+    self._assertResizeCheckShape(x, x_shape, [3840, 2160], [3840, 2160, 3])
+
 
 class ResizeImageWithCropOrPadTest(test_util.TensorFlowTestCase):
 
@@ -3746,5 +3968,88 @@ class SobelEdgesTest(test_util.TensorFlowTestCase):
       self.assertAllClose(expected_batch, actual_sobel)
 
 
+class DecodeImageTest(test_util.TensorFlowTestCase):
+
+  def testJpegUint16(self):
+    with self.test_session(use_gpu=True) as sess:
+      base = "tensorflow/core/lib/jpeg/testdata"
+      jpeg0 = io_ops.read_file(os.path.join(base, "jpeg_merge_test1.jpg"))
+      image0 = image_ops.decode_image(jpeg0, dtype=dtypes.uint16)
+      image1 = image_ops.convert_image_dtype(image_ops.decode_jpeg(jpeg0),
+                                             dtypes.uint16)
+      image0, image1 = sess.run([image0, image1])
+      self.assertAllEqual(image0, image1)
+
+  def testPngUint16(self):
+    with self.test_session(use_gpu=True) as sess:
+      base = "tensorflow/core/lib/png/testdata"
+      png0 = io_ops.read_file(os.path.join(base, "lena_rgba.png"))
+      image0 = image_ops.decode_image(png0, dtype=dtypes.uint16)
+      image1 = image_ops.convert_image_dtype(
+          image_ops.decode_png(png0, dtype=dtypes.uint16), dtypes.uint16)
+      image0, image1 = sess.run([image0, image1])
+      self.assertAllEqual(image0, image1)
+
+  def testGifUint16(self):
+    with self.test_session(use_gpu=True) as sess:
+      base = "tensorflow/core/lib/gif/testdata"
+      gif0 = io_ops.read_file(os.path.join(base, "scan.gif"))
+      image0 = image_ops.decode_image(gif0, dtype=dtypes.uint16)
+      image1 = image_ops.convert_image_dtype(image_ops.decode_gif(gif0),
+                                             dtypes.uint16)
+      image0, image1 = sess.run([image0, image1])
+      self.assertAllEqual(image0, image1)
+
+  def testBmpUint16(self):
+    with self.test_session(use_gpu=True) as sess:
+      base = "tensorflow/core/lib/bmp/testdata"
+      bmp0 = io_ops.read_file(os.path.join(base, "lena.bmp"))
+      image0 = image_ops.decode_image(bmp0, dtype=dtypes.uint16)
+      image1 = image_ops.convert_image_dtype(image_ops.decode_bmp(bmp0),
+                                             dtypes.uint16)
+      image0, image1 = sess.run([image0, image1])
+      self.assertAllEqual(image0, image1)
+
+  def testJpegFloat32(self):
+    with self.test_session(use_gpu=True) as sess:
+      base = "tensorflow/core/lib/jpeg/testdata"
+      jpeg0 = io_ops.read_file(os.path.join(base, "jpeg_merge_test1.jpg"))
+      image0 = image_ops.decode_image(jpeg0, dtype=dtypes.float32)
+      image1 = image_ops.convert_image_dtype(image_ops.decode_jpeg(jpeg0),
+                                             dtypes.float32)
+      image0, image1 = sess.run([image0, image1])
+      self.assertAllEqual(image0, image1)
+
+  def testPngFloat32(self):
+    with self.test_session(use_gpu=True) as sess:
+      base = "tensorflow/core/lib/png/testdata"
+      png0 = io_ops.read_file(os.path.join(base, "lena_rgba.png"))
+      image0 = image_ops.decode_image(png0, dtype=dtypes.float32)
+      image1 = image_ops.convert_image_dtype(
+          image_ops.decode_png(png0, dtype=dtypes.uint16), dtypes.float32)
+      image0, image1 = sess.run([image0, image1])
+      self.assertAllEqual(image0, image1)
+
+  def testGifFloat32(self):
+    with self.test_session(use_gpu=True) as sess:
+      base = "tensorflow/core/lib/gif/testdata"
+      gif0 = io_ops.read_file(os.path.join(base, "scan.gif"))
+      image0 = image_ops.decode_image(gif0, dtype=dtypes.float32)
+      image1 = image_ops.convert_image_dtype(image_ops.decode_gif(gif0),
+                                             dtypes.float32)
+      image0, image1 = sess.run([image0, image1])
+      self.assertAllEqual(image0, image1)
+
+  def testBmpFloat32(self):
+    with self.test_session(use_gpu=True) as sess:
+      base = "tensorflow/core/lib/bmp/testdata"
+      bmp0 = io_ops.read_file(os.path.join(base, "lena.bmp"))
+      image0 = image_ops.decode_image(bmp0, dtype=dtypes.float32)
+      image1 = image_ops.convert_image_dtype(image_ops.decode_bmp(bmp0),
+                                             dtypes.float32)
+      image0, image1 = sess.run([image0, image1])
+      self.assertAllEqual(image0, image1)
+
+
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/python/ops/init_ops.py b/tensorflow/python/ops/init_ops.py
index 1f8d8dc4f3e7b84cea9850f5da08d8c5a189e096..724fcc39cddbdd2a8acd9c0bbaa7b968c6d1510d 100644
--- a/tensorflow/python/ops/init_ops.py
+++ b/tensorflow/python/ops/init_ops.py
@@ -86,7 +86,7 @@ class Initializer(object):
 
 
 @tf_export("keras.initializers.Zeros", "initializers.zeros",
-           "zeros_initializer")
+           "zeros_initializer", "keras.initializers.zeros")
 class Zeros(Initializer):
   """Initializer that generates tensors initialized to 0."""
 
@@ -102,7 +102,8 @@ class Zeros(Initializer):
     return {"dtype": self.dtype.name}
 
 
-@tf_export("keras.initializers.Ones", "initializers.ones", "ones_initializer")
+@tf_export("keras.initializers.Ones", "initializers.ones", "ones_initializer",
+           "keras.initializers.ones")
 class Ones(Initializer):
   """Initializer that generates tensors initialized to 1."""
 
@@ -119,7 +120,7 @@ class Ones(Initializer):
 
 
 @tf_export("keras.initializers.Constant", "initializers.constant",
-           "constant_initializer")
+           "constant_initializer", "keras.initializers.constant")
 class Constant(Initializer):
   """Initializer that generates tensors with constant values.
 
@@ -225,7 +226,8 @@ class Constant(Initializer):
 
 
 @tf_export("keras.initializers.RandomUniform", "initializers.random_uniform",
-           "random_uniform_initializer")
+           "random_uniform_initializer", "keras.initializers.uniform",
+           "keras.initializers.random_uniform")
 class RandomUniform(Initializer):
   """Initializer that generates tensors with a uniform distribution.
 
@@ -262,7 +264,8 @@ class RandomUniform(Initializer):
 
 
 @tf_export("keras.initializers.RandomNormal", "initializers.random_normal",
-           "random_normal_initializer")
+           "random_normal_initializer", "keras.initializers.normal",
+           "keras.initializers.random_normal")
 class RandomNormal(Initializer):
   """Initializer that generates tensors with a normal distribution.
 
@@ -299,7 +302,8 @@ class RandomNormal(Initializer):
 
 
 @tf_export("keras.initializers.TruncatedNormal",
-           "initializers.truncated_normal", "truncated_normal_initializer")
+           "initializers.truncated_normal", "truncated_normal_initializer",
+           "keras.initializers.truncated_normal")
 class TruncatedNormal(Initializer):
   """Initializer that generates a truncated normal distribution.
 
@@ -463,7 +467,8 @@ class VarianceScaling(Initializer):
     else:
       scale /= max(1., (fan_in + fan_out) / 2.)
     if self.distribution == "normal":
-      stddev = math.sqrt(scale)
+      # constant taken from scipy.stats.truncnorm.std(a=-2, b=2, loc=0., scale=1.)
+      stddev = math.sqrt(scale) / .87962566103423978
       return random_ops.truncated_normal(
           shape, 0.0, stddev, dtype, seed=self.seed)
     else:
@@ -482,7 +487,7 @@ class VarianceScaling(Initializer):
 
 
 @tf_export("keras.initializers.Orthogonal", "initializers.orthogonal",
-           "orthogonal_initializer")
+           "orthogonal_initializer", "keras.initializers.orthogonal")
 class Orthogonal(Initializer):
   """Initializer that generates an orthogonal matrix.
 
@@ -1062,7 +1067,8 @@ class ConvolutionOrthogonal3D(ConvolutionOrthogonal):
     return self._dict_to_tensor(p, ksize, ksize, ksize)
 
 
-@tf_export("keras.initializers.Identity", "initializers.identity")
+@tf_export("keras.initializers.Identity", "initializers.identity",
+           "keras.initializers.identity")
 class Identity(Initializer):
   """Initializer that generates the identity matrix.
 
diff --git a/tensorflow/python/ops/linalg_grad.py b/tensorflow/python/ops/linalg_grad.py
index 3cbbf3412a2a1bd974354a5819d410b4074ab47d..b6b98d5c86fd3285b35377c9158dcdb649b88a83 100644
--- a/tensorflow/python/ops/linalg_grad.py
+++ b/tensorflow/python/ops/linalg_grad.py
@@ -55,6 +55,17 @@ def _MatrixDeterminantGrad(op, grad):
   return multipliers * a_adj_inv
 
 
+@ops.RegisterGradient("LogMatrixDeterminant")
+def _LogMatrixDeterminantGrad(op, _, grad_b):
+  """Gradient for LogMatrixDeterminant."""
+  a = op.inputs[0]
+  c = op.outputs[1]
+  a_adj_inv = linalg_ops.matrix_inverse(a, adjoint=True)
+  multipliers = array_ops.reshape(
+      grad_b, array_ops.concat([array_ops.shape(c), [1, 1]], 0))
+  return multipliers * a_adj_inv
+
+
 @ops.RegisterGradient("Cholesky")
 def _CholeskyGrad(op, grad):
   """Gradient for Cholesky."""
diff --git a/tensorflow/python/ops/lookup_ops.py b/tensorflow/python/ops/lookup_ops.py
index 0e547689cc51857adb77791bfb94c2527cdffef2..fb51fbc6264d3b797d134005cbf1e700d0a9990c 100644
--- a/tensorflow/python/ops/lookup_ops.py
+++ b/tensorflow/python/ops/lookup_ops.py
@@ -366,6 +366,10 @@ class KeyValueTensorInitializer(TableInitializerBase):
     with ops.name_scope(
         self._name, values=(table.table_ref, self._keys,
                             self._values)) as scope:
+      if context.executing_eagerly():
+        # Ensure a unique name when eager execution is enabled to avoid spurious
+        # sharing issues.
+        scope += str(ops.uid())
       init_op = gen_lookup_ops.initialize_table_v2(
           table.table_ref, self._keys, self._values, name=scope)
     ops.add_to_collection(ops.GraphKeys.TABLE_INITIALIZERS, init_op)
@@ -1108,6 +1112,10 @@ def index_table_from_tensor(vocabulary_list,
 
     shared_name = ""
     with ops.name_scope(None, "hash_table") as hash_table_scope:
+      if context.executing_eagerly():
+        # Ensure a unique name when eager execution is enabled to avoid spurious
+        # sharing issues.
+        shared_name += str(ops.uid())
       table_keys = math_ops.to_int64(keys) if keys.dtype.is_integer else keys
       init = KeyValueTensorInitializer(
           table_keys,
diff --git a/tensorflow/python/ops/math_grad.py b/tensorflow/python/ops/math_grad.py
index 563c0b3ab3f6316b89f5ea76f5d075d9f4b77eea..a48b3c9395f48ecaf3879b95199f69c84f7f095a 100644
--- a/tensorflow/python/ops/math_grad.py
+++ b/tensorflow/python/ops/math_grad.py
@@ -620,6 +620,35 @@ def _DigammaGrad(op, grad):
     return grad * math_ops.polygamma(array_ops.constant(1, dtype=x.dtype), x)
 
 
+@ops.RegisterGradient("BesselI0e")
+def _BesselI0eGrad(op, grad):
+  """Compute gradient of bessel_i0e(x) with respect to its argument."""
+  x = op.inputs[0]
+  y = op.outputs[0]
+  with ops.control_dependencies([grad]):
+    return grad * (math_ops.bessel_i1e(x) - math_ops.sign(x) * y)
+
+
+@ops.RegisterGradient("BesselI1e")
+def _BesselI1eGrad(op, grad):
+  """Compute gradient of bessel_i1e(x) with respect to its argument."""
+  x = op.inputs[0]
+  y = op.outputs[0]
+  with ops.control_dependencies([grad]):
+    # For x = 0, the correct gradient is 0.5.
+    # However, the main branch gives NaN because of the division by x, so
+    # we impute the gradient manually.
+    # An alternative solution is to express the gradient via bessel_i0e and
+    # bessel_i2e, but the latter is not yet implemented in Eigen.
+    eps = np.finfo(x.dtype.as_numpy_dtype).eps
+    zeros = array_ops.zeros_like(x)
+    x_is_not_tiny = math_ops.abs(x) > eps
+    safe_x = array_ops.where(x_is_not_tiny, x, eps + zeros)
+    dy_dx = math_ops.bessel_i0e(safe_x) - y * (
+        math_ops.sign(safe_x) + math_ops.reciprocal(safe_x))
+    return grad * array_ops.where(x_is_not_tiny, dy_dx, 0.5 + zeros)
+
+
 @ops.RegisterGradient("Igamma")
 def _IgammaGrad(op, grad):
   """Returns gradient of igamma(a, x) with respect to x."""
diff --git a/tensorflow/python/ops/math_ops.py b/tensorflow/python/ops/math_ops.py
index 118b02c6c7ffb9e1f8bca6fec20325b3965b888f..466d0dadc8a430861ab27b6a522ca6acd2db7855 100644
--- a/tensorflow/python/ops/math_ops.py
+++ b/tensorflow/python/ops/math_ops.py
@@ -370,7 +370,7 @@ def erf(x, name=None):
   """Computes the Gauss error function of `x` element-wise.
 
   Args:
-    x: A `Tensor` of `SparseTensor`. Must be one of the following types: `half`,
+    x: A `Tensor` or `SparseTensor`. Must be one of the following types: `half`,
       `float32`, `float64`.
     name: A name for the operation (optional).
 
@@ -1617,7 +1617,7 @@ def reduce_all(input_tensor,
   entry in `axis`. If `keepdims` is true, the reduced dimensions
   are retained with length 1.
 
-  If `axis` has no entries, all dimensions are reduced, and a
+  If `axis` is None, all dimensions are reduced, and a
   tensor with a single element is returned.
 
   For example:
@@ -1675,7 +1675,7 @@ def reduce_any(input_tensor,
   entry in `axis`. If `keepdims` is true, the reduced dimensions
   are retained with length 1.
 
-  If `axis` has no entries, all dimensions are reduced, and a
+  If `axis` is None, all dimensions are reduced, and a
   tensor with a single element is returned.
 
   For example:
@@ -2225,8 +2225,8 @@ def sigmoid(x, name=None):
   Returns:
     A Tensor with the same type as `x`.
 
-  @compatibility(numpy)
-  Equivalent to np.scipy.special.expit
+  @compatibility(scipy)
+  Equivalent to scipy.special.expit
   @end_compatibility
   """
   with ops.name_scope(name, "Sigmoid", [x]) as name:
@@ -2954,6 +2954,67 @@ def polyval(coeffs, x, name=None):
       p = c + p * x
     return p
 
+
+@tf_export("math.bessel_i0e")
+def bessel_i0e(x, name=None):
+  """Computes the Bessel i0e function of `x` element-wise.
+
+  Exponentially scaled modified Bessel function of order 0 defined as
+  `bessel_i0e(x) = exp(-abs(x)) bessel_i0(x)`.
+
+  This function is faster and numerically stabler than `bessel_i0(x)`.
+
+  Args:
+    x: A `Tensor` or `SparseTensor`. Must be one of the following types: `half`,
+      `float32`, `float64`.
+    name: A name for the operation (optional).
+
+  Returns:
+    A `Tensor` or `SparseTensor`, respectively. Has the same type as `x`.
+
+  @compatibility(scipy)
+  Equivalent to scipy.special.i0e
+  @end_compatibility
+  """
+  with ops.name_scope(name, "bessel_i0e", [x]) as name:
+    if isinstance(x, sparse_tensor.SparseTensor):
+      x_i0e = gen_math_ops.bessel_i0e(x.values, name=name)
+      return sparse_tensor.SparseTensor(
+          indices=x.indices, values=x_i0e, dense_shape=x.dense_shape)
+    else:
+      return gen_math_ops.bessel_i0e(x, name=name)
+
+
+@tf_export("math.bessel_i1e")
+def bessel_i1e(x, name=None):
+  """Computes the Bessel i1e function of `x` element-wise.
+
+  Exponentially scaled modified Bessel function of order 1 defined as
+  `bessel_i1e(x) = exp(-abs(x)) bessel_i1(x)`.
+
+  This function is faster and numerically stabler than `bessel_i1(x)`.
+
+  Args:
+    x: A `Tensor` or `SparseTensor`. Must be one of the following types: `half`,
+      `float32`, `float64`.
+    name: A name for the operation (optional).
+
+  Returns:
+    A `Tensor` or `SparseTensor`, respectively. Has the same type as `x`.
+
+  @compatibility(scipy)
+  Equivalent to scipy.special.i1e
+  @end_compatibility
+  """
+  with ops.name_scope(name, "bessel_i1e", [x]) as name:
+    if isinstance(x, sparse_tensor.SparseTensor):
+      x_i1e = gen_math_ops.bessel_i1e(x.values, name=name)
+      return sparse_tensor.SparseTensor(
+          indices=x.indices, values=x_i1e, dense_shape=x.dense_shape)
+    else:
+      return gen_math_ops.bessel_i1e(x, name=name)
+
+
 # FFT ops were moved to tf.spectral. tf.fft symbols were part of the TensorFlow
 # 1.0 API so we leave these here for backwards compatibility.
 fft = gen_spectral_ops.fft
diff --git a/tensorflow/python/ops/math_ops_test.py b/tensorflow/python/ops/math_ops_test.py
index 980c92b0d592bccc34e1fbee636ebdd39056f2fc..c807c8bc2efbf867700f0df37783b02fefa0ca82 100644
--- a/tensorflow/python/ops/math_ops_test.py
+++ b/tensorflow/python/ops/math_ops_test.py
@@ -235,6 +235,15 @@ class ApproximateEqualTest(test_util.TensorFlowTestCase):
         z_tf = self.evaluate(math_ops.approximate_equal(x, y, tolerance=0.0001))
         self.assertAllEqual(z, z_tf)
 
+  def testApproximateEqualShape(self):
+    for dtype in [np.float32, np.double]:
+      x = np.array([1, 2], dtype=np.float32)
+      y = np.array([[1, 2]], dtype=np.float32)
+      # The inputs 'x' and 'y' must have the same shape.
+      with self.assertRaisesRegexp(
+          ValueError, "Shapes must be equal rank, but are 1 and 2"):
+        math_ops.approximate_equal(x, y)
+
 
 class ScalarMulTest(test_util.TensorFlowTestCase):
 
diff --git a/tensorflow/python/ops/metrics_impl.py b/tensorflow/python/ops/metrics_impl.py
index 47eea6ef6b58abd4819544e29783048964104922..5eab12c41d5f781b496a3a1bcfa9ce35fca4fc54 100644
--- a/tensorflow/python/ops/metrics_impl.py
+++ b/tensorflow/python/ops/metrics_impl.py
@@ -34,21 +34,54 @@ from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import weights_broadcast_ops
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import distribute as distribute_lib
 from tensorflow.python.util.deprecation import deprecated
 from tensorflow.python.util.tf_export import tf_export
 
 
 def metric_variable(shape, dtype, validate_shape=True, name=None):
-  """Create variable in `GraphKeys.(LOCAL|METRIC_VARIABLES`) collections."""
-
-  return variable_scope.variable(
-      lambda: array_ops.zeros(shape, dtype),
-      trainable=False,
-      collections=[
-          ops.GraphKeys.LOCAL_VARIABLES, ops.GraphKeys.METRIC_VARIABLES
-      ],
-      validate_shape=validate_shape,
-      name=name)
+  """Create variable in `GraphKeys.(LOCAL|METRIC_VARIABLES)` collections.
+
+  If running in a `DistributionStrategy` context, the variable will be
+  "tower local". This means:
+
+  *   The returned object will be a container with separate variables
+      per replica/tower of the model.
+
+  *   When writing to the variable, e.g. using `assign_add` in a metric
+      update, the update will be applied to the variable local to the
+      replica/tower.
+
+  *   To get a metric's result value, we need to sum the variable values
+      across the replicas/towers before computing the final answer.
+      Furthermore, the final answer should be computed once instead of
+      in every replica/tower. Both of these are accomplished by
+      running the computation of the final result value inside
+      `tf.contrib.distribute.get_tower_context().merge_call(fn)`.
+      Inside the `merge_call()`, ops are only added to the graph once
+      and access to a tower-local variable in a computation returns
+      the sum across all replicas/towers.
+
+  Args:
+    shape: Shape of the created variable.
+    dtype: Type of the created variable.
+    validate_shape: (Optional) Whether shape validation is enabled for
+      the created variable.
+    name: (Optional) String name of the created variable.
+
+  Returns:
+    A (non-trainable) variable initialized to zero, or if inside a
+    `DistributionStrategy` scope a tower-local variable container.
+  """
+  with distribute_lib.get_tower_context().tower_local_var_scope('sum'):
+    # Note that "tower local" implies trainable=False.
+    return variable_scope.variable(
+        lambda: array_ops.zeros(shape, dtype),
+        collections=[
+            ops.GraphKeys.LOCAL_VARIABLES, ops.GraphKeys.METRIC_VARIABLES
+        ],
+        validate_shape=validate_shape,
+        name=name)
 
 
 def _remove_squeezable_dimensions(predictions, labels, weights):
@@ -333,11 +366,15 @@ def mean(values,
     with ops.control_dependencies([values]):
       update_count_op = state_ops.assign_add(count, num_values)
 
-    mean_t = _safe_div(total, count, 'value')
-    update_op = _safe_div(update_total_op, update_count_op, 'update_op')
+    def aggregate_across_towers(_, t, c):
+      mean_t = _safe_div(t, c, 'value')
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, mean_t)
+      return mean_t
 
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, mean_t)
+    mean_t = distribute_lib.get_tower_context().merge_call(
+        aggregate_across_towers, total, count)
+    update_op = _safe_div(update_total_op, update_count_op, 'update_op')
 
     if updates_collections:
       ops.add_to_collections(updates_collections, update_op)
@@ -572,6 +609,17 @@ def _confusion_matrix_at_thresholds(labels,
   return values, update_ops
 
 
+def _aggregate_variable(v, collections):
+
+  def f(distribution, value):
+    value = distribution.read_var(value)
+    if collections:
+      ops.add_to_collections(collections, value)
+    return value
+
+  return distribute_lib.get_tower_context().merge_call(f, v)
+
+
 @tf_export('metrics.auc')
 def auc(labels,
         predictions,
@@ -757,14 +805,18 @@ def auc(labels,
         raise ValueError('Invalid summation_method: %s' % summation_method)
 
     # sum up the areas of all the trapeziums
-    auc_value = compute_auc(values['tp'], values['fn'], values['tn'],
-                            values['fp'], 'value')
+    def aggregate_auc(_, values):
+      auc_value = compute_auc(values['tp'], values['fn'], values['tn'],
+                              values['fp'], 'value')
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, auc_value)
+      return auc_value
+
+    auc_value = distribute_lib.get_tower_context().merge_call(
+        aggregate_auc, values)
     update_op = compute_auc(update_ops['tp'], update_ops['fn'],
                             update_ops['tn'], update_ops['fp'], 'update_op')
 
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, auc_value)
-
     if updates_collections:
       ops.add_to_collections(updates_collections, update_op)
 
@@ -992,15 +1044,18 @@ def mean_per_class_accuracy(labels,
     update_total_op = state_ops.scatter_add(total, labels, ones)
     update_count_op = state_ops.scatter_add(count, labels, is_correct)
 
-    per_class_accuracy = _safe_div(count, total, None)
+    def aggregate_mean_accuracy(_, count, total):
+      per_class_accuracy = _safe_div(count, total, None)
+      mean_accuracy_v = math_ops.reduce_mean(
+          per_class_accuracy, name='mean_accuracy')
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, mean_accuracy_v)
+      return mean_accuracy_v
 
-    mean_accuracy_v = math_ops.reduce_mean(
-        per_class_accuracy, name='mean_accuracy')
-    update_op = _safe_div(update_count_op, update_total_op, name='update_op')
-
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, mean_accuracy_v)
+    mean_accuracy_v = distribute_lib.get_tower_context().merge_call(
+        aggregate_mean_accuracy, count, total)
 
+    update_op = _safe_div(update_count_op, update_total_op, name='update_op')
     if updates_collections:
       ops.add_to_collections(updates_collections, update_op)
 
@@ -1071,7 +1126,7 @@ def mean_iou(labels,
     total_cm, update_op = _streaming_confusion_matrix(labels, predictions,
                                                       num_classes, weights)
 
-    def compute_mean_iou(name):
+    def compute_mean_iou(total_cm, name):
       """Compute the mean intersection-over-union via the confusion matrix."""
       sum_over_row = math_ops.to_float(math_ops.reduce_sum(total_cm, 0))
       sum_over_col = math_ops.to_float(math_ops.reduce_sum(total_cm, 1))
@@ -1098,10 +1153,14 @@ def mean_iou(labels,
           math_ops.reduce_sum(iou, name=name) / num_valid_entries, 0)
       return result
 
-    mean_iou_v = compute_mean_iou('mean_iou')
+    def mean_iou_across_towers(_, v):
+      mean_iou_v = compute_mean_iou(v, 'mean_iou')
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, mean_iou_v)
+      return mean_iou_v
 
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, mean_iou_v)
+    mean_iou_v = distribute_lib.get_tower_context().merge_call(
+        mean_iou_across_towers, total_cm)
 
     if updates_collections:
       ops.add_to_collections(updates_collections, update_op)
@@ -1310,12 +1369,16 @@ def mean_tensor(values,
     with ops.control_dependencies([values]):
       update_count_op = state_ops.assign_add(count, num_values)
 
-    mean_t = _safe_div(total, count, 'value')
-    update_op = _safe_div(update_total_op, update_count_op, 'update_op')
+    def aggregate_across_towers(_, t, c):
+      mean_t = _safe_div(t, c, 'value')
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, mean_t)
+      return mean_t
 
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, mean_t)
+    mean_t = distribute_lib.get_tower_context().merge_call(
+        aggregate_across_towers, total, count)
 
+    update_op = _safe_div(update_total_op, update_count_op, 'update_op')
     if updates_collections:
       ops.add_to_collections(updates_collections, update_op)
 
@@ -1413,12 +1476,9 @@ def _count_condition(values,
       weights = math_ops.to_float(weights)
       values = math_ops.multiply(values, weights)
 
-  value_tensor = array_ops.identity(count)
-  update_op = state_ops.assign_add(count, math_ops.reduce_sum(values))
-
-  if metrics_collections:
-    ops.add_to_collections(metrics_collections, value_tensor)
+  value_tensor = _aggregate_variable(count, metrics_collections)
 
+  update_op = state_ops.assign_add(count, math_ops.reduce_sum(values))
   if updates_collections:
     ops.add_to_collections(updates_collections, update_op)
 
@@ -1525,13 +1585,12 @@ def false_negatives_at_thresholds(labels,
     values, update_ops = _confusion_matrix_at_thresholds(
         labels, predictions, thresholds, weights=weights, includes=('fn',))
 
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, values['fn'])
+    fn_value = _aggregate_variable(values['fn'], metrics_collections)
 
     if updates_collections:
       ops.add_to_collections(updates_collections, update_ops['fn'])
 
-    return values['fn'], update_ops['fn']
+    return fn_value, update_ops['fn']
 
 
 @tf_export('metrics.false_positives')
@@ -1635,13 +1694,12 @@ def false_positives_at_thresholds(labels,
     values, update_ops = _confusion_matrix_at_thresholds(
         labels, predictions, thresholds, weights=weights, includes=('fp',))
 
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, values['fp'])
+    fp_value = _aggregate_variable(values['fp'], metrics_collections)
 
     if updates_collections:
       ops.add_to_collections(updates_collections, update_ops['fp'])
 
-    return values['fp'], update_ops['fp']
+    return fp_value, update_ops['fp']
 
 
 @tf_export('metrics.true_negatives')
@@ -1745,13 +1803,12 @@ def true_negatives_at_thresholds(labels,
     values, update_ops = _confusion_matrix_at_thresholds(
         labels, predictions, thresholds, weights=weights, includes=('tn',))
 
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, values['tn'])
+    tn_value = _aggregate_variable(values['tn'], metrics_collections)
 
     if updates_collections:
       ops.add_to_collections(updates_collections, update_ops['tn'])
 
-    return values['tn'], update_ops['tn']
+    return tn_value, update_ops['tn']
 
 
 @tf_export('metrics.true_positives')
@@ -1855,13 +1912,12 @@ def true_positives_at_thresholds(labels,
     values, update_ops = _confusion_matrix_at_thresholds(
         labels, predictions, thresholds, weights=weights, includes=('tp',))
 
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, values['tp'])
+    tp_value = _aggregate_variable(values['tp'], metrics_collections)
 
     if updates_collections:
       ops.add_to_collections(updates_collections, update_ops['tp'])
 
-    return values['tp'], update_ops['tp']
+    return tp_value, update_ops['tp']
 
 
 @tf_export('metrics.precision')
@@ -1945,13 +2001,17 @@ def precision(labels,
       return array_ops.where(
           math_ops.greater(tp + fp, 0), math_ops.div(tp, tp + fp), 0, name)
 
-    p = compute_precision(true_p, false_p, 'value')
-    update_op = compute_precision(true_positives_update_op,
-                                  false_positives_update_op, 'update_op')
+    def once_across_towers(_, true_p, false_p):
+      p = compute_precision(true_p, false_p, 'value')
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, p)
+      return p
 
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, p)
+    p = distribute_lib.get_tower_context().merge_call(
+        once_across_towers, true_p, false_p)
 
+    update_op = compute_precision(true_positives_update_op,
+                                  false_positives_update_op, 'update_op')
     if updates_collections:
       ops.add_to_collections(updates_collections, update_op)
 
@@ -2025,13 +2085,17 @@ def precision_at_thresholds(labels,
     def compute_precision(tp, fp, name):
       return math_ops.div(tp, epsilon + tp + fp, name='precision_' + name)
 
-    prec = compute_precision(values['tp'], values['fp'], 'value')
-    update_op = compute_precision(update_ops['tp'], update_ops['fp'],
-                                  'update_op')
+    def precision_across_towers(_, values):
+      prec = compute_precision(values['tp'], values['fp'], 'value')
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, prec)
+      return prec
 
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, prec)
+    prec = distribute_lib.get_tower_context().merge_call(
+        precision_across_towers, values)
 
+    update_op = compute_precision(update_ops['tp'], update_ops['fp'],
+                                  'update_op')
     if updates_collections:
       ops.add_to_collections(updates_collections, update_op)
 
@@ -2050,7 +2114,7 @@ def recall(labels,
   The `recall` function creates two local variables, `true_positives`
   and `false_negatives`, that are used to compute the recall. This value is
   ultimately returned as `recall`, an idempotent operation that simply divides
-  `true_positives` by the sum of `true_positives`  and `false_negatives`.
+  `true_positives` by the sum of `true_positives` and `false_negatives`.
 
   For estimation of the metric over a stream of data, the function creates an
   `update_op` that updates these variables and returns the `recall`. `update_op`
@@ -2117,13 +2181,17 @@ def recall(labels,
           math_ops.greater(true_p + false_n, 0),
           math_ops.div(true_p, true_p + false_n), 0, name)
 
-    rec = compute_recall(true_p, false_n, 'value')
-    update_op = compute_recall(true_positives_update_op,
-                               false_negatives_update_op, 'update_op')
+    def once_across_towers(_, true_p, false_n):
+      rec = compute_recall(true_p, false_n, 'value')
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, rec)
+      return rec
 
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, rec)
+    rec = distribute_lib.get_tower_context().merge_call(
+        once_across_towers, true_p, false_n)
 
+    update_op = compute_recall(true_positives_update_op,
+                               false_negatives_update_op, 'update_op')
     if updates_collections:
       ops.add_to_collections(updates_collections, update_op)
 
@@ -2552,11 +2620,17 @@ def recall_at_top_k(labels,
         class_id=class_id,
         weights=weights)
 
-    metric = math_ops.div(tp, math_ops.add(tp, fn), name=scope)
+    def aggregate_across_towers(_, tp, fn):
+      metric = math_ops.div(tp, math_ops.add(tp, fn), name=scope)
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, metric)
+      return metric
+
+    metric = distribute_lib.get_tower_context().merge_call(
+        aggregate_across_towers, tp, fn)
+
     update = math_ops.div(
         tp_update, math_ops.add(tp_update, fn_update), name='update')
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, metric)
     if updates_collections:
       ops.add_to_collections(updates_collections, update)
     return metric, update
@@ -2627,12 +2701,16 @@ def recall_at_thresholds(labels,
     def compute_recall(tp, fn, name):
       return math_ops.div(tp, epsilon + tp + fn, name='recall_' + name)
 
-    rec = compute_recall(values['tp'], values['fn'], 'value')
-    update_op = compute_recall(update_ops['tp'], update_ops['fn'], 'update_op')
+    def recall_across_towers(_, values):
+      rec = compute_recall(values['tp'], values['fn'], 'value')
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, rec)
+      return rec
 
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, rec)
+    rec = distribute_lib.get_tower_context().merge_call(
+        recall_across_towers, values)
 
+    update_op = compute_recall(update_ops['tp'], update_ops['fn'], 'update_op')
     if updates_collections:
       ops.add_to_collections(updates_collections, update_op)
 
@@ -2698,13 +2776,16 @@ def root_mean_squared_error(labels,
   mse, update_mse_op = mean_squared_error(labels, predictions, weights, None,
                                           None, name or
                                           'root_mean_squared_error')
+  def once_across_towers(_, mse):
+    rmse = math_ops.sqrt(mse)
+    if metrics_collections:
+      ops.add_to_collections(metrics_collections, rmse)
+    return rmse
 
-  rmse = math_ops.sqrt(mse)
-  update_rmse_op = math_ops.sqrt(update_mse_op)
-
-  if metrics_collections:
-    ops.add_to_collections(metrics_collections, rmse)
+  rmse = distribute_lib.get_tower_context().merge_call(
+      once_across_towers, mse)
 
+  update_rmse_op = math_ops.sqrt(update_mse_op)
   if updates_collections:
     ops.add_to_collections(updates_collections, update_rmse_op)
 
@@ -2797,15 +2878,19 @@ def sensitivity_at_specificity(labels,
       return math_ops.div(tp[tf_index], tp[tf_index] + fn[tf_index] + kepsilon,
                           name)
 
-    sensitivity = compute_sensitivity_at_specificity(
-        values['tp'], values['tn'], values['fp'], values['fn'], 'value')
+    def aggregate_across_towers(_, values):
+      sensitivity = compute_sensitivity_at_specificity(
+          values['tp'], values['tn'], values['fp'], values['fn'], 'value')
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, sensitivity)
+      return sensitivity
+
+    sensitivity = distribute_lib.get_tower_context().merge_call(
+        aggregate_across_towers, values)
+
     update_op = compute_sensitivity_at_specificity(
         update_ops['tp'], update_ops['tn'], update_ops['fp'], update_ops['fn'],
         'update_op')
-
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, sensitivity)
-
     if updates_collections:
       ops.add_to_collections(updates_collections, update_op)
 
@@ -3070,11 +3155,16 @@ def _streaming_sparse_average_precision_at_top_k(labels,
       total_update = state_ops.assign_add(total_var, batch_total, name='update')
 
     # Divide total by max to get mean, for both vars and the update ops.
-    mean_average_precision = _safe_scalar_div(total_var, max_var, name='mean')
-    update = _safe_scalar_div(total_update, max_update, name=scope)
+    def aggregate_across_towers(_, total_var, max_var):
+      mean_average_precision = _safe_scalar_div(total_var, max_var, name='mean')
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, mean_average_precision)
+      return mean_average_precision
 
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, mean_average_precision)
+    mean_average_precision = distribute_lib.get_tower_context().merge_call(
+        aggregate_across_towers, total_var, max_var)
+
+    update = _safe_scalar_div(total_update, max_update, name=scope)
     if updates_collections:
       ops.add_to_collections(updates_collections, update)
 
@@ -3351,11 +3441,17 @@ def precision_at_top_k(labels,
         class_id=class_id,
         weights=weights)
 
-    metric = math_ops.div(tp, math_ops.add(tp, fp), name=scope)
+    def aggregate_across_towers(_, tp, fp):
+      metric = math_ops.div(tp, math_ops.add(tp, fp), name=scope)
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, metric)
+      return metric
+
+    metric = distribute_lib.get_tower_context().merge_call(
+        aggregate_across_towers, tp, fp)
+
     update = math_ops.div(
         tp_update, math_ops.add(tp_update, fp_update), name='update')
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, metric)
     if updates_collections:
       ops.add_to_collections(updates_collections, update)
     return metric, update
@@ -3583,15 +3679,19 @@ def specificity_at_sensitivity(labels,
       return math_ops.div(tn[tf_index], tn[tf_index] + fp[tf_index] + kepsilon,
                           name)
 
-    specificity = compute_specificity_at_sensitivity(
-        values['tp'], values['tn'], values['fp'], values['fn'], 'value')
+    def aggregate_across_towers(_, values):
+      specificity = compute_specificity_at_sensitivity(
+          values['tp'], values['tn'], values['fp'], values['fn'], 'value')
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, specificity)
+      return specificity
+
+    specificity = distribute_lib.get_tower_context().merge_call(
+        aggregate_across_towers, values)
+
     update_op = compute_specificity_at_sensitivity(
         update_ops['tp'], update_ops['tn'], update_ops['fp'], update_ops['fn'],
         'update_op')
-
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, specificity)
-
     if updates_collections:
       ops.add_to_collections(updates_collections, update_op)
 
diff --git a/tensorflow/python/ops/nn_impl.py b/tensorflow/python/ops/nn_impl.py
index 783d4858925d3e2b1ca210a8162a2b4df07d3089..f47f38e29e328ea92bfc494d60673c70a58274d3 100644
--- a/tensorflow/python/ops/nn_impl.py
+++ b/tensorflow/python/ops/nn_impl.py
@@ -621,7 +621,7 @@ def normalize_moments(counts, mean_ss, variance_ss, shift, name=None):
   """Calculate the mean and variance of based on the sufficient statistics.
 
   Args:
-    counts: A `Tensor` containing a the total count of the data (one value).
+    counts: A `Tensor` containing the total count of the data (one value).
     mean_ss: A `Tensor` containing the mean sufficient statistics: the (possibly
       shifted) sum of the elements to average over.
     variance_ss: A `Tensor` containing the variance sufficient statistics: the
@@ -689,6 +689,9 @@ def moments(
     # Compute true mean while keeping the dims for proper broadcasting.
     mean = math_ops.reduce_mean(y, axes, keepdims=True, name="mean")
     # sample variance, not unbiased variance
+    # Note: stop_gradient does not change the gradient that gets 
+    #       backpropagated to the mean from the variance calculation,
+    #       because that gradient is zero
     variance = math_ops.reduce_mean(
         math_ops.squared_difference(y, array_ops.stop_gradient(mean)),
         axes,
diff --git a/tensorflow/python/ops/nn_ops.py b/tensorflow/python/ops/nn_ops.py
index 54b08a564b1d7da3d4e30d593c18ee3811cff3d4..0c2f5b06c497e8ca7db20ac09938c86b425d66a0 100644
--- a/tensorflow/python/ops/nn_ops.py
+++ b/tensorflow/python/ops/nn_ops.py
@@ -2311,13 +2311,22 @@ def dropout(x, keep_prob, noise_shape=None, seed=None, name=None):  # pylint: di
     if isinstance(keep_prob, numbers.Real) and not 0 < keep_prob <= 1:
       raise ValueError("keep_prob must be a scalar tensor or a float in the "
                        "range (0, 1], got %g" % keep_prob)
-    keep_prob = ops.convert_to_tensor(
-        keep_prob, dtype=x.dtype, name="keep_prob")
-    keep_prob.get_shape().assert_is_compatible_with(tensor_shape.scalar())
 
-    # Do nothing if we know keep_prob == 1
-    if tensor_util.constant_value(keep_prob) == 1:
+    # Early return if nothing needs to be dropped.
+    if isinstance(keep_prob, float) and keep_prob == 1:
       return x
+    if context.executing_eagerly():
+      if isinstance(keep_prob, ops.EagerTensor):
+        if keep_prob.numpy() == 1:
+          return x
+    else:
+      keep_prob = ops.convert_to_tensor(
+          keep_prob, dtype=x.dtype, name="keep_prob")
+      keep_prob.get_shape().assert_is_compatible_with(tensor_shape.scalar())
+
+      # Do nothing if we know keep_prob == 1
+      if tensor_util.constant_value(keep_prob) == 1:
+        return x
 
     noise_shape = _get_noise_shape(x, noise_shape)
 
diff --git a/tensorflow/python/ops/resource_variable_ops.py b/tensorflow/python/ops/resource_variable_ops.py
index e5b80200c09f867faba96329c3515f88c654a554..de44a3e848586d92b3b3155edfbfcadc47755089 100644
--- a/tensorflow/python/ops/resource_variable_ops.py
+++ b/tensorflow/python/ops/resource_variable_ops.py
@@ -507,6 +507,9 @@ class ResourceVariable(variables.Variable):
           else:
             self._cached_value = None
         if not context.executing_eagerly():
+          # Eager variables are only added to collections if they are part of an
+          # eager variable store (otherwise in an interactive session they would
+          # hog memory and cause OOM). This is done in ops/variable_scope.py.
           ops.add_to_collections(collections, self)
         elif ops.GraphKeys.GLOBAL_STEP in collections:
           ops.add_to_collections(ops.GraphKeys.GLOBAL_STEP, self)
@@ -551,6 +554,7 @@ class ResourceVariable(variables.Variable):
                                  import_scope=import_scope))
     else:
       self._initial_value = None
+    self._trainable = getattr(variable_def, "trainable", True)
     if variable_def.snapshot_name:
       snapshot = g.as_graph_element(
           ops.prepend_name_scope(
@@ -576,6 +580,21 @@ class ResourceVariable(variables.Variable):
     self._constraint = None
     self._cached_shape_as_list = None
 
+  @contextlib.contextmanager
+  def _assign_dependencies(self):
+    """Makes assignments depend on the cached value, if any.
+
+    This prevents undefined behavior with reads not ordered wrt writes.
+
+    Yields:
+      None.
+    """
+    if self._cached_value is not None:
+      with ops.control_dependencies([self._cached_value]):
+        yield
+    else:
+      yield
+
   def __nonzero__(self):
     return self.__bool__()
 
@@ -720,7 +739,7 @@ class ResourceVariable(variables.Variable):
     return self._save_slice_info
 
   def _read_variable_op(self):
-    if hasattr(self, "_trainable") and self._trainable:
+    if self.trainable:
       tape.watch_variable(self)
     return gen_resource_variable_ops.read_variable_op(self._handle,
                                                       self._dtype)
@@ -745,7 +764,7 @@ class ResourceVariable(variables.Variable):
   def sparse_read(self, indices, name=None):
     """Reads the value of this variable sparsely, using `gather`."""
     with ops.name_scope("Gather" if name is None else name) as name:
-      if self._trainable:
+      if self.trainable:
         tape.watch_variable(self)
       value = gen_resource_variable_ops.resource_gather(
           self._handle, indices, dtype=self._dtype, name=name)
@@ -786,6 +805,7 @@ class ResourceVariable(variables.Variable):
         var_def.snapshot_name = ops.strip_name_scope(self._graph_element.name,
                                                      export_scope)
       var_def.is_resource = True
+      var_def.trainable = self.trainable
       if self._save_slice_info:
         var_def.save_slice_info_def.MergeFrom(
             self._save_slice_info.to_proto(export_scope=export_scope))
@@ -865,7 +885,7 @@ class ResourceVariable(variables.Variable):
     # TODO(apassos): this here and below is not atomic. Consider making it
     # atomic if there's a way to do so without a performance cost for those who
     # don't need it.
-    with _handle_graph(self.handle):
+    with _handle_graph(self.handle), self._assign_dependencies():
       assign_sub_op = gen_resource_variable_ops.assign_sub_variable_op(
           self.handle, ops.convert_to_tensor(delta, dtype=self.dtype),
           name=name)
@@ -889,7 +909,7 @@ class ResourceVariable(variables.Variable):
       it will return the `Operation` that does the assignment, and when in eager
       mode it will return `None`.
     """
-    with _handle_graph(self.handle):
+    with _handle_graph(self.handle), self._assign_dependencies():
       assign_add_op = gen_resource_variable_ops.assign_add_variable_op(
           self.handle, ops.convert_to_tensor(delta, dtype=self.dtype),
           name=name)
@@ -898,7 +918,7 @@ class ResourceVariable(variables.Variable):
     return assign_add_op
 
   def _lazy_read(self, op):
-    if hasattr(self, "_trainable") and self._trainable:
+    if self.trainable:
       tape.watch_variable(self)
     return _UnreadVariable(
         self._handle, self.dtype, self._shape, self._in_graph_mode,
@@ -921,6 +941,8 @@ class ResourceVariable(variables.Variable):
       it will return the `Operation` that does the assignment, and when in eager
       mode it will return `None`.
     """
+    # Note: not depending on the cached value here since this can used to
+    # initialize the variable.
     with _handle_graph(self.handle):
       value_tensor = ops.convert_to_tensor(value, dtype=self.dtype)
       self._shape.assert_is_compatible_with(value_tensor.shape)
@@ -933,7 +955,7 @@ class ResourceVariable(variables.Variable):
   def _strided_slice_assign(self, begin, end, strides, value, name, begin_mask,
                             end_mask, ellipsis_mask, new_axis_mask,
                             shrink_axis_mask):
-    with _handle_graph(self.handle):
+    with _handle_graph(self.handle), self._assign_dependencies():
       return self._lazy_read(
           gen_array_ops.resource_strided_slice_assign(
               ref=self.handle,
@@ -1045,6 +1067,10 @@ class _UnreadVariable(ResourceVariable):
       self._graph_element = self.read_value()
     self._handle_deleter = deleter
 
+  @property
+  def name(self):
+    return self._parent_op.name
+
   def value(self):
     return self._read_variable_op()
 
diff --git a/tensorflow/python/ops/script_ops.py b/tensorflow/python/ops/script_ops.py
index f87c5dc5e39b7b3a58ca3cf2e9cd943cb081a020..cc23d0d133ecdb1415e0effcbc2ce52a962fb41e 100644
--- a/tensorflow/python/ops/script_ops.py
+++ b/tensorflow/python/ops/script_ops.py
@@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-
 """Script Language Operators. See the @{$python/script_ops} guide."""
 
 # pylint: disable=g-bad-name
@@ -24,35 +23,60 @@ import threading
 
 # Used by py_util.cc to get tracebacks.
 import traceback  # pylint: disable=unused-import
+import weakref
 
 import numpy as np
 import six
 
 from tensorflow.python import pywrap_tensorflow
+from tensorflow.python.eager import backprop
 from tensorflow.python.eager import context
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import gen_script_ops
 from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.util import compat
 from tensorflow.python.util import nest
 from tensorflow.python.util.tf_export import tf_export
 
+# Map from EagerPyFunc token to tuple (tape, eager args, eager outputs);
+# used for differentiation.
+tape_cache = {}
+
 
 class EagerFunc(object):
   """A wrapper for a function owned by an EagerPyFunc."""
 
-  def __init__(self, func, Tout):
+  def __init__(self, func, Tout, is_grad_func):
     """Constructs an EagerFunc.
 
     Args:
       func: The function to wrap.
       Tout: A list of datatypes for the output; an empty list if the output is
             None.
+      is_grad_func: Whether this EagerFunc is the gradient of another
+        EagerPyFunc.
     """
     self._func = func
     self._out_dtypes = Tout
+    self._is_grad_func = is_grad_func
 
   def _convert(self, value, dtype):
+    """Converts `value` to a tensor of type `dtype`, with error checking.
+
+    Args:
+      value: The tensor to convert.
+      dtype: The desired dtype.
+
+    Returns:
+      A tensor of type `dtype`, or a zeros tensor if value is None and
+      this function is in fact a grdient function.
+
+    Raises:
+      RuntimeError: if `value` is a variable.
+    """
+
     if isinstance(value, resource_variable_ops.ResourceVariable):
       raise RuntimeError(
           "Attempting to return a variable from an eagerly executed py_func. "
@@ -60,22 +84,40 @@ class EagerFunc(object):
           "be returned; to return the value of a variable, make sure to obtain "
           "the Tensor backing it by calling `.read_value()` on the variable in "
           "question: %s" % value)
+    if value is None and self._is_grad_func:
+      # Gradient functions may legitimately return a list that contains
+      # both Tensors and Python Nones. Unfortuantely this breaks the
+      # OpKernel, so for now we replace None objects with zeros, which is
+      # mathematically correct but will prevent short-circuiting gradient
+      # computations.
+      #
+      # TODO(akshayka): Make it possible to return a list of both Tensors and
+      # Nones from an EagerPyFunc.
+      return constant_op.constant(0.0, dtype=dtype)
     return ops.convert_to_tensor(value, dtype=dtype)
 
-  def __call__(self, on_gpu, args):
+  def __call__(self, on_gpu, token, args):
     """Passes `args` to `self._func`, which is executed eagerly."""
+
     with context.eager_mode():
-      ret = self._func(*args)
-      maybe_copy_to_gpu = lambda x: x if not on_gpu else x.gpu()
-      if isinstance(ret, (tuple, list)):
-        return [
-            maybe_copy_to_gpu(self._convert(x, dtype=dtype))
-            for (x, dtype) in zip(ret, self._out_dtypes)
-        ]
-      elif ret is None:
-        return ret
-      else:
-        return maybe_copy_to_gpu(self._convert(ret, dtype=self._out_dtypes[0]))
+      with backprop.GradientTape() as tape:
+        for tensor in args:
+          tape.watch(tensor)
+        ret = self._func(*args)
+        # NB: The tape needs to watch copies across devices.
+        maybe_copy_to_gpu = lambda x: x if not on_gpu else x.gpu()
+        if isinstance(ret, (tuple, list)):
+          outputs = [
+              maybe_copy_to_gpu(self._convert(x, dtype=dtype))
+              for (x, dtype) in zip(ret, self._out_dtypes)
+          ]
+        elif ret is None:
+          outputs = None
+        else:
+          outputs = maybe_copy_to_gpu(
+              self._convert(ret, dtype=self._out_dtypes[0]))
+      tape_cache[compat.as_bytes(token)] = (tape, args, outputs)
+      return outputs
 
 
 class FuncRegistry(object):
@@ -88,11 +130,14 @@ class FuncRegistry(object):
   def __init__(self):
     self._lock = threading.Lock()
     self._unique_id = 0  # GUARDED_BY(self._lock)
-    self._funcs = {}
+    # Only store weakrefs to the funtions. The strong reference is stored in
+    # the graph.
+    self._funcs = weakref.WeakValueDictionary()
 
   def insert(self, func):
     """Registers `func` and returns a unique token for this entry."""
     token = self._next_unique_token()
+    # Store a weakref to the function
     self._funcs[token] = func
     return token
 
@@ -145,11 +190,18 @@ class FuncRegistry(object):
     Raises:
       ValueError: if no function is registered for `token`.
     """
-    func = self._funcs[token]
+    func = self._funcs.get(token, None)
     if func is None:
       raise ValueError("callback %s is not found" % token)
     if isinstance(func, EagerFunc):
-      return func(on_gpu, args)
+      # NB: Different invocations of the same py_func will share the same
+      # token, and the entries they stash in the tape_cache will collide.
+      # In practice, when executing a graph, this should only happen if
+      # the py_func is in a while_loop whose iterations are run in parallel
+      # or if the graph is being driven by concurrent session.run() calls.
+      #
+      # TODO(akshayka): Key the tape cache in a thread-safe way.
+      return func(on_gpu, token, args)
     else:
       ret = func(*args)
       # Strings seem to lead to a memory leak here if they're not wrapped in a
@@ -180,20 +232,8 @@ _py_funcs = FuncRegistry()
 pywrap_tensorflow.InitializePyTrampoline(_py_funcs)
 
 
-class CleanupFunc(object):
-  """A helper class to remove a registered function from _py_funcs."""
-
-  def __init__(self, token):
-    self._token = token
-
-  def __del__(self):
-    if _py_funcs is not None:
-      # If _py_funcs is None, the program is most likely in shutdown, and the
-      # _py_funcs object has been destroyed already.
-      _py_funcs.remove(self._token)
-
-
-def _internal_py_func(func, inp, Tout, stateful=None, eager=False, name=None):
+def _internal_py_func(func, inp, Tout, stateful=None, eager=False,
+                      is_grad_func=False, name=None):
   """See documentation for py_func and eager_py_func."""
 
   is_list_or_tuple = False
@@ -203,7 +243,7 @@ def _internal_py_func(func, inp, Tout, stateful=None, eager=False, name=None):
     Tout = [Tout]
 
   if eager:
-    func = EagerFunc(func, Tout)
+    func = EagerFunc(func, Tout, is_grad_func)
 
   token = _py_funcs.insert(func)
   # We tie the registered function's lifetime with the current default graph,
@@ -216,17 +256,15 @@ def _internal_py_func(func, inp, Tout, stateful=None, eager=False, name=None):
     # bound to that of the outer graph instead.
     graph = graph._outer_graph
 
-  cleanup = CleanupFunc(token)
-
   # TODO(zhifengc): Consider adding a Graph method to collect
   # `cleanup` objects in one of its member.
-  if not hasattr(graph, "_cleanup_py_funcs_used_in_graph"):
-    graph._cleanup_py_funcs_used_in_graph = []
+  if not hasattr(graph, "_py_funcs_used_in_graph"):
+    graph._py_funcs_used_in_graph = []
 
-  # When `graph` is destroyed, elements in _cleanup_py_funcs_used_in_graph
-  # will be destroyed and their __del__ will remove the 'token' from
-  # the funcs registry.
-  graph._cleanup_py_funcs_used_in_graph.append(cleanup)
+  # Store a reference to the function in the graph to ensure it stays alive
+  # as long as the graph lives. When the graph is destroyed, the function
+  # is left to the garbage collector for destruction as well.
+  graph._py_funcs_used_in_graph.append(func)
   # pylint: enable=protected-access
 
   if eager:
@@ -242,34 +280,55 @@ def _internal_py_func(func, inp, Tout, stateful=None, eager=False, name=None):
   return result if is_list_or_tuple else result[0]
 
 
+# TODO(akshayka): Implement higher-order derivatives.
+@ops.RegisterGradient("EagerPyFunc")
+def _EagerPyFuncGrad(op, dy):
+  """Computes the gradient of an EagerPyFunc."""
+
+  token = op.get_attr("token")
+
+  def eagerly_executed_grad(dy):
+    tape, eager_inputs, eager_outputs = tape_cache.pop(compat.as_bytes(token))
+    return tape.gradient(eager_outputs, eager_inputs, output_gradients=dy)
+
+  with ops.control_dependencies(op.outputs):
+    return _internal_py_func(
+        func=eagerly_executed_grad,
+        inp=[dy] if isinstance(dy, ops.Tensor) else dy,
+        Tout=[tensor.dtype for tensor in op.inputs],
+        eager=True, is_grad_func=True)
+
+
 def eager_py_func(func, inp, Tout, name=None):
   """Wraps a python function into a TensorFlow op that executes it eagerly.
 
   This function allows expressing computations in a TensorFlow graph as
   Python functions. In particular, it wraps a Python function `func`
-  in a TensorFlow operation that executes it with eager exeuction enabled. As a
-  consequence, `tf.contrib.eager.py_func` makes it possible to express control
-  flow using Python constructs (`if`, `while`, `for`, etc.), instead of
-  TensorFlow control flow constructs (@{tf.cond}, @{tf.while_loop}). For
-  example, you might use `tf.contrib.eager.py_func` to implement the log huber
-  function:
+  in a once-differentiable TensorFlow operation that executes it with eager
+  exeuction enabled. As a consequence, `tf.contrib.eager.py_func` makes it
+  possible to express control flow using Python constructs (`if`, `while`,
+  `for`, etc.), instead of TensorFlow control flow constructs (@{tf.cond},
+  @{tf.while_loop}). For example, you might use `tf.contrib.eager.py_func` to
+  implement the log huber function:
 
   ```python
   def log_huber(x, m):
     if tf.abs(x) <= m:
-      return x ** 2
+      return x**2
     else:
-      return m ** 2 * (1 - 2 * tf.log(m) + tf.log(x ** 2))
+      return m**2 * (1 - 2 * tf.log(m) + tf.log(x**2))
 
   x = tf.placeholder(tf.float32)
   m = tf.placeholder(tf.float32)
 
   y = tf.contrib.eager.py_func(func=log_huber, inp=[x, m], Tout=tf.float32)
+  dy_dx = tf.gradients(y, x)[0]
 
   with tf.Session() as sess:
     # The session executes `log_huber` eagerly. Given the feed values below,
-    # it will take the second branch, so `output` evaluates to 7.24372.
-    output = sess.run(y, feed_dict={x: 3.0, m: 2.0})
+    # it will take the first branch, so `y` evaluates to 1.0 and
+    # `dy_dx` evaluates to 2.0.
+    y, dy_dx = sess.run([y, dy_dx], feed_dict={x: 1.0, m: 2.0})
   ```
 
   You can also use `tf.contrib.eager.py_func` to debug your models at runtime
@@ -288,10 +347,6 @@ def eager_py_func(func, inp, Tout, name=None):
   that take Tensors as inputs, execute TensorFlow operations in their bodies,
   and return Tensors as outputs.
 
-  `tf.contrib.eager.py_func` is not differentiable, though a gradient may be
-  implemented in the future; if you would like to differentiate through it,
-  please file an issue on Github.
-
   Like @{tf.py_func}, `tf.contrib.eager.py_func` has the following limitations
   with respect to serialization and distribution:
 
diff --git a/tensorflow/python/ops/special_math_ops.py b/tensorflow/python/ops/special_math_ops.py
index 6204adef3bb5dc96dab4a16bf05824d32627fccc..1508873b751c4fa42d3488ff2d18b5795fda9652 100644
--- a/tensorflow/python/ops/special_math_ops.py
+++ b/tensorflow/python/ops/special_math_ops.py
@@ -82,6 +82,54 @@ def lbeta(x, name='lbeta'):
     return result
 
 
+@tf_export('math.bessel_i0')
+def bessel_i0(x, name='bessel_i0'):
+  """Computes the Bessel i0 function of `x` element-wise.
+
+  Modified Bessel function of order 0.
+
+  It is preferable to use the numerically stabler function `i0e(x)` instead.
+
+  Args:
+    x: A `Tensor` or `SparseTensor`. Must be one of the following types: `half`,
+      `float32`, `float64`.
+    name: A name for the operation (optional).
+
+  Returns:
+    A `Tensor` or `SparseTensor`, respectively. Has the same type as `x`.
+
+  @compatibility(scipy)
+  Equivalent to scipy.special.i0
+  @end_compatibility
+  """
+  with ops.name_scope(name, [x]):
+    return math_ops.exp(math_ops.abs(x)) * math_ops.bessel_i0e(x)
+
+
+@tf_export('math.bessel_i1')
+def bessel_i1(x, name='bessel_i1'):
+  """Computes the Bessel i1 function of `x` element-wise.
+
+  Modified Bessel function of order 1.
+
+  It is preferable to use the numerically stabler function `i1e(x)` instead.
+
+  Args:
+    x: A `Tensor` or `SparseTensor`. Must be one of the following types: `half`,
+      `float32`, `float64`.
+    name: A name for the operation (optional).
+
+  Returns:
+    A `Tensor` or `SparseTensor`, respectively. Has the same type as `x`.
+
+  @compatibility(scipy)
+  Equivalent to scipy.special.i1
+  @end_compatibility
+  """
+  with ops.name_scope(name, [x]):
+    return math_ops.exp(math_ops.abs(x)) * math_ops.bessel_i1e(x)
+
+
 @tf_export('einsum', 'linalg.einsum')
 def einsum(equation, *inputs, **kwargs):
   """A generalized contraction between tensors of arbitrary dimension.
@@ -153,6 +201,8 @@ def einsum(equation, *inputs, **kwargs):
         indices in its subscript, or
       - the input shapes are inconsistent along a particular axis.
   """
+  equation = equation.replace(" ", "")
+  
   name = kwargs.pop('name', None)
   if kwargs:
     raise TypeError('invalid keyword arguments for this function: ' + ', '.join(
diff --git a/tensorflow/python/ops/special_math_ops_test.py b/tensorflow/python/ops/special_math_ops_test.py
index d7c3a7e8dc7c2ad611cf47718dddcf38700ce304..b7e164f149a9cca336fee061ae2cc3a464ca6132 100644
--- a/tensorflow/python/ops/special_math_ops_test.py
+++ b/tensorflow/python/ops/special_math_ops_test.py
@@ -29,6 +29,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import special_math_ops
 from tensorflow.python.platform import test
+from tensorflow.python.platform import tf_logging
 
 
 class LBetaTest(test.TestCase):
@@ -150,6 +151,33 @@ class LBetaTest(test.TestCase):
         self.assertEqual(expected_result.get_shape(), lbeta_x.get_shape())
 
 
+class BesselTest(test.TestCase):
+
+  def test_bessel_i0(self):
+    x_single = np.arange(-3, 3).reshape(1, 3, 2).astype(np.float32)
+    x_double = np.arange(-3, 3).reshape(1, 3, 2).astype(np.float64)
+    try:
+      from scipy import special  # pylint: disable=g-import-not-at-top
+      self.assertAllClose(special.i0(x_single),
+                          self.evaluate(special_math_ops.bessel_i0(x_single)))
+      self.assertAllClose(special.i0(x_double),
+                          self.evaluate(special_math_ops.bessel_i0(x_double)))
+    except ImportError as e:
+      tf_logging.warn('Cannot test special functions: %s' % str(e))
+
+  def test_bessel_i1(self):
+    x_single = np.arange(-3, 3).reshape(1, 3, 2).astype(np.float32)
+    x_double = np.arange(-3, 3).reshape(1, 3, 2).astype(np.float64)
+    try:
+      from scipy import special  # pylint: disable=g-import-not-at-top
+      self.assertAllClose(special.i1(x_single),
+                          self.evaluate(special_math_ops.bessel_i1(x_single)))
+      self.assertAllClose(special.i1(x_double),
+                          self.evaluate(special_math_ops.bessel_i1(x_double)))
+    except ImportError as e:
+      tf_logging.warn('Cannot test special functions: %s' % str(e))
+
+
 class EinsumTest(test.TestCase):
 
   simple_cases = [
@@ -195,6 +223,12 @@ class EinsumTest(test.TestCase):
       'iJ,Jk->ik',
       'iJ,Ki->JK',
       'iJk,Jklm->Jk'
+      'ij, jk, kl -> il',
+      'a, ab, abc -> abc',
+      'ab, ab, cd, cd, ef, ef -> ',
+      'abc, bac',
+      'iJ, Ki -> JK',
+      'iJk, Jklm -> Jk'
   ]
 
   long_cases = [
@@ -203,6 +237,8 @@ class EinsumTest(test.TestCase):
       'ea,fb,gc,hd,abcd->efgh',
       'ea,fb,abcd,gc,hd->efgh',
       'abhe,hidj,jgba,hiab,gab',
+      'efc, dbc, acf, fd -> abe',
+      'abhe, hidj, jgba, hiab, gab',
   ]
 
   invalid_cases = [
@@ -273,7 +309,7 @@ class EinsumTest(test.TestCase):
     input_axes, _, _ = axes.partition('->')
 
     for idx in input_axes.split(','):
-      shape = [all_axes[ax] for ax in idx]
+      shape = [all_axes[ax] for ax in idx if ax.isalpha()]
       input_vals.append(np.random.random(shape))
 
     input_tensors = [constant_op.constant(val) for val in input_vals]
@@ -285,8 +321,8 @@ class EinsumTest(test.TestCase):
     correct_value = np.einsum(axes, *input_vals)
 
     err = np.abs(correct_value - output_value).max()
-    print(axes, err)
-    assert err < 1e-8
+    # print(axes, err)
+    self.assertLess(err, 1e-8)
 
   def test_input_is_placeholder(self):
     with ops.Graph().as_default():
@@ -298,8 +334,7 @@ class EinsumTest(test.TestCase):
             m0: [[1, 2, 3]],
             m1: [[2], [1], [1]],
         }
-        np.testing.assert_almost_equal([[7]], sess.run(
-            out, feed_dict=feed_dict))
+        self.assertAllClose([[7]], sess.run(out, feed_dict=feed_dict))
 
     with ops.Graph().as_default():
       m0 = array_ops.placeholder(dtypes.int32, shape=(None, 3))
@@ -310,7 +345,7 @@ class EinsumTest(test.TestCase):
             m0: [[1, 2, 3]],
             m1: [2, 1, 1],
         }
-        np.testing.assert_almost_equal([7], sess.run(out, feed_dict=feed_dict))
+        self.assertAllClose([7], sess.run(out, feed_dict=feed_dict))
 
     # Tests for placeholders which have two or more None values
     with ops.Graph().as_default():
@@ -322,8 +357,7 @@ class EinsumTest(test.TestCase):
             m0: [[[1, 2]]],
             m1: [[3], [2]],
         }
-        np.testing.assert_almost_equal([[[7]]],
-                                       sess.run(out, feed_dict=feed_dict))
+        self.assertAllClose([[[7]]], sess.run(out, feed_dict=feed_dict))
 
     with ops.Graph().as_default():
       m0 = array_ops.placeholder(dtypes.int32, shape=(2, 1))
@@ -334,8 +368,7 @@ class EinsumTest(test.TestCase):
             m0: [[3], [2]],
             m1: [[[1, 2]]],
         }
-        np.testing.assert_almost_equal([[[7]]],
-                                       sess.run(out, feed_dict=feed_dict))
+        self.assertAllClose([[[7]]], sess.run(out, feed_dict=feed_dict))
 
     with ops.Graph().as_default():
       m0 = array_ops.placeholder(dtypes.int32, shape=(None, None, 2))
@@ -346,8 +379,7 @@ class EinsumTest(test.TestCase):
             m0: [[[1, 2]]],
             m1: [3, 2],
         }
-        np.testing.assert_almost_equal([[7]], sess.run(
-            out, feed_dict=feed_dict))
+        self.assertAllClose([[7]], sess.run(out, feed_dict=feed_dict))
 
     with ops.Graph().as_default():
       m0 = array_ops.placeholder(dtypes.int32, shape=(None, 2, None, 2))
@@ -358,8 +390,7 @@ class EinsumTest(test.TestCase):
             m0: [[[[1, 2]], [[2, 1]]]],
             m1: [[3, 2]],
         }
-        np.testing.assert_almost_equal([[[7, 8]]],
-                                       sess.run(out, feed_dict=feed_dict))
+        self.assertAllClose([[[7, 8]]], sess.run(out, feed_dict=feed_dict))
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/python/ops/state_ops.py b/tensorflow/python/ops/state_ops.py
index 94d7458ec8735836566033faae95a3aed3af1824..08b7cda73bdc739912ec58f161ec7113aeffd9e8 100644
--- a/tensorflow/python/ops/state_ops.py
+++ b/tensorflow/python/ops/state_ops.py
@@ -338,7 +338,6 @@ def scatter_nd_update(ref, indices, updates, use_locking=True, name=None):
   Args:
     ref: A Variable.
     indices: A `Tensor`. Must be one of the following types: `int32`, `int64`.
-      A Tensor. Must be one of the following types: int32, int64.
       A tensor of indices into ref.
     updates: A `Tensor`. Must have the same type as `ref`.
       A Tensor. Must have the same type as ref. A tensor of updated
@@ -355,10 +354,9 @@ def scatter_nd_update(ref, indices, updates, use_locking=True, name=None):
   if ref.dtype._is_ref_dtype:
     return gen_state_ops.scatter_nd_update(
         ref, indices, updates, use_locking, name)
-  with ops.control_dependencies([gen_state_ops.resource_scatter_nd_update(
-      ref.handle, indices, ops.convert_to_tensor(updates, dtype=ref.dtype),
-      use_locking, name)]):
-    return ref.read_value()
+  return ref._lazy_read(gen_state_ops.resource_scatter_nd_update(  # pylint: disable=protected-access
+      ref.handle, indices, ops.convert_to_tensor(updates, ref.dtype),
+      name=name))
 
 
 @tf_export("scatter_add")
@@ -411,3 +409,67 @@ def scatter_add(ref, indices, updates, use_locking=False, name=None):
   return ref._lazy_read(gen_resource_variable_ops.resource_scatter_add(  # pylint: disable=protected-access
       ref.handle, indices, ops.convert_to_tensor(updates, ref.dtype),
       name=name))
+
+
+@tf_export("scatter_nd_add")
+def scatter_nd_add(ref, indices, updates, use_locking=False, name=None):
+  r"""Applies sparse addition to individual values or slices in a Variable.
+
+  `ref` is a `Tensor` with rank `P` and `indices` is a `Tensor` of rank `Q`.
+
+  `indices` must be integer tensor, containing indices into `ref`.
+  It must be shape `[d_0, ..., d_{Q-2}, K]` where `0 < K <= P`.
+
+  The innermost dimension of `indices` (with length `K`) corresponds to
+  indices into elements (if `K = P`) or slices (if `K < P`) along the `K`th
+  dimension of `ref`.
+
+  `updates` is `Tensor` of rank `Q-1+P-K` with shape:
+
+  ```
+  [d_0, ..., d_{Q-2}, ref.shape[K], ..., ref.shape[P-1]].
+  ```
+
+  For example, say we want to add 4 scattered elements to a rank-1 tensor to
+  8 elements. In Python, that update would look like this:
+
+  ```python
+      ref = tf.Variable([1, 2, 3, 4, 5, 6, 7, 8])
+      indices = tf.constant([[4], [3], [1] ,[7]])
+      updates = tf.constant([9, 10, 11, 12])
+      add = tf.scatter_nd_add(ref, indices, updates)
+      with tf.Session() as sess:
+        print sess.run(add)
+  ```
+
+  The resulting update to ref would look like this:
+
+      [1, 13, 3, 14, 14, 6, 7, 20]
+
+  See @{tf.scatter_nd} for more details about how to make updates to
+  slices.
+
+  Args:
+    ref: A mutable `Tensor`. Must be one of the following types: `float32`,
+      `float64`, `int32`, `uint8`, `int16`, `int8`, `complex64`, `int64`,
+      `qint8`, `quint8`, `qint32`, `bfloat16`, `uint16`, `complex128`, `half`,
+      `uint32`, `uint64`. A mutable Tensor. Should be from a Variable node.
+    indices: A `Tensor`. Must be one of the following types: `int32`, `int64`.
+      A tensor of indices into ref.
+    updates: A `Tensor`. Must have the same type as `ref`.
+      A tensor of updated values to add to ref.
+    use_locking: An optional `bool`. Defaults to `True`.
+      An optional bool. Defaults to True. If True, the assignment will
+      be protected by a lock; otherwise the behavior is undefined,
+      but may exhibit less contention.
+    name: A name for the operation (optional).
+
+  Returns:
+    A mutable `Tensor`. Has the same type as `ref`.
+  """
+  if ref.dtype._is_ref_dtype:
+    return gen_state_ops.scatter_nd_add(
+        ref, indices, updates, use_locking, name)
+  return ref._lazy_read(gen_state_ops.resource_scatter_nd_add(  # pylint: disable=protected-access
+      ref.handle, indices, ops.convert_to_tensor(updates, ref.dtype),
+      name=name))
diff --git a/tensorflow/python/ops/string_ops.py b/tensorflow/python/ops/string_ops.py
index ae79c0194954a052db799d7a00ce1ddc584ea6ed..0280c89c10f264dcc37c89598599d377e8ac9e07 100644
--- a/tensorflow/python/ops/string_ops.py
+++ b/tensorflow/python/ops/string_ops.py
@@ -91,6 +91,59 @@ def string_split(source, delimiter=" ", skip_empty=True):  # pylint: disable=inv
   shape.set_shape([2])
   return sparse_tensor.SparseTensor(indices, values, shape)
 
+@tf_export("strings.split")
+def string_split_v2(source, sep=None, maxsplit=-1):
+  """Split elements of `source` based on `sep` into a `SparseTensor`.
+
+  Let N be the size of source (typically N will be the batch size). Split each
+  element of `source` based on `sep` and return a `SparseTensor`
+  containing the split tokens. Empty tokens are ignored.
+
+  For example, N = 2, source[0] is 'hello world' and source[1] is 'a b c',
+  then the output will be
+
+  st.indices = [0, 0;
+                0, 1;
+                1, 0;
+                1, 1;
+                1, 2]
+  st.shape = [2, 3]
+  st.values = ['hello', 'world', 'a', 'b', 'c']
+
+  If `sep` is given, consecutive delimiters are not grouped together and are
+  deemed to delimit empty strings. For example, source of `"1<>2<><>3"` and
+  sep of `"<>"` returns `["1", "2", "", "3"]`. If `sep` is None or an empty
+  string, consecutive whitespace are regarded as a single separator, and the
+  result will contain no empty strings at the startor end if the string has
+  leading or trailing whitespace.
+
+  Note that the above mentioned behavior matches python's str.split.
+
+  Args:
+    source: `1-D` string `Tensor`, the strings to split.
+    sep: `0-D` string `Tensor`, the delimiter character.
+    maxsplit: An `int`. If `maxsplit > 0`, limit of the split of the result.
+
+  Raises:
+    ValueError: If sep is not a string.
+
+  Returns:
+    A `SparseTensor` of rank `2`, the strings split according to the delimiter.
+    The first column of the indices corresponds to the row in `source` and the
+    second column corresponds to the index of the split component in this row.
+  """
+  if sep is None:
+    sep = ''
+  sep = ops.convert_to_tensor(sep, dtype=dtypes.string)
+  source = ops.convert_to_tensor(source, dtype=dtypes.string)
+
+  indices, values, shape = gen_string_ops.string_split_v2(
+      source, sep=sep, maxsplit=maxsplit)
+  indices.set_shape([None, 2])
+  values.set_shape([None])
+  shape.set_shape([2])
+  return sparse_tensor.SparseTensor(indices, values, shape)
+
 
 def _reduce_join_reduction_dims(x, axis, reduction_indices):
   """Returns range(rank(x) - 1, 0, -1) if reduction_indices is None."""
diff --git a/tensorflow/python/ops/tensor_array_grad.py b/tensorflow/python/ops/tensor_array_grad.py
index 1f70d695485ca0aab22c532099caad1b361d3637..d34134980400999ee2b0de9362423b2ec495868f 100644
--- a/tensorflow/python/ops/tensor_array_grad.py
+++ b/tensorflow/python/ops/tensor_array_grad.py
@@ -34,6 +34,7 @@ ops.NotDifferentiable("TensorArrayCloseV2")
 
 ops.NotDifferentiable("TensorArrayV3")
 ops.NotDifferentiable("TensorArrayGradV3")
+ops.NotDifferentiable("TensorArrayGradWithShape")
 ops.NotDifferentiable("TensorArraySizeV3")
 ops.NotDifferentiable("TensorArrayCloseV3")
 
diff --git a/tensorflow/python/ops/variable_scope.py b/tensorflow/python/ops/variable_scope.py
index 8d93d24b149a2fd27b956e73d9e866b61ca97287..9a711edaa44f711aa4122ca282ca99ca23d17bdc 100644
--- a/tensorflow/python/ops/variable_scope.py
+++ b/tensorflow/python/ops/variable_scope.py
@@ -1,4 +1,4 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+ # Copyright 2015 The TensorFlow Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -794,6 +794,14 @@ class _VariableStore(object):
         validate_shape=validate_shape,
         constraint=constraint,
         use_resource=use_resource)
+    if context.executing_eagerly() and self._store_eager_variables:
+      if collections:
+        ops.add_to_collections(collections, v)
+      else:
+        ops.add_to_collection(ops.GraphKeys.GLOBAL_VARIABLES, v)
+      if trainable:
+        ops.add_to_collection(ops.GraphKeys.TRAINABLE_VARIABLES, v)
+
     if not context.executing_eagerly() or self._store_eager_variables:
       # In eager mode we do not want to keep default references to Variable
       # objects as this will prevent their memory from being released.
@@ -1261,13 +1269,13 @@ class EagerVariableStore(object):
 
   def trainable_variables(self):
     # pylint: disable=protected-access
-    return sorted([x for x in self._store._vars.values() if x._trainable],
+    return sorted([x for x in self._store._vars.values() if x.trainable],
                   key=lambda x: x.name)
     # pylint: enable=protected-access
 
   def non_trainable_variables(self):
     # pylint: disable=protected-access
-    return sorted([x for x in self._store._vars.values() if not x._trainable],
+    return sorted([x for x in self._store._vars.values() if not x.trainable],
                   key=lambda x: x.name)
     # pylint: enable=protected-access
 
@@ -1296,7 +1304,7 @@ class EagerVariableStore(object):
       new_var = resource_variable_ops.ResourceVariable(
           var.read_value(),
           name=stripped_var_name,
-          trainable=var._trainable)
+          trainable=var.trainable)
       new_store._store._vars[key] = new_var
     return new_store
     # pylint: enable=protected-access
@@ -1778,6 +1786,23 @@ class variable_scope(object):
           assert v.name == "foo/bar/v:0"
   ```
 
+  Simple example of how to reenter a premade variable scope safely:
+
+  ```python
+  with tf.variable_scope("foo") as vs:
+    pass
+
+  # Re-enter the variable scope.
+  with tf.variable_scope(vs,
+                         auxiliary_name_scope=False) as vs1:
+    # Restore the original name_scope.
+    with tf.name_scope(vs1.original_name_scope):
+        v = tf.get_variable("v", [1])
+        assert v.name == "foo/v:0"
+        c = tf.constant([1], name="c")
+        assert c.name == "foo/c:0"
+  ```
+
   Basic example of sharing a variable AUTO_REUSE:
 
   ```python
@@ -1915,7 +1940,9 @@ class variable_scope(object):
         (which must have the same shape). Constraints are not safe to
         use when doing asynchronous distributed training.
       auxiliary_name_scope: If `True`, we create an auxiliary name scope with
-        the scope. If `False`, we don't touch name scope.
+        the scope. If `False`, we don't create it. Note that the argument is
+        not inherited, and it only takes effect for once when creating. You
+        should only use it for re-entering a premade variable scope.
 
     Returns:
       A scope that can be captured and reused.
diff --git a/tensorflow/python/ops/variables.py b/tensorflow/python/ops/variables.py
index 294ee0e32831928d34d6bae5668898609a37dc59..4be9f5eb6864015cd9c3f6f3526285ebbdc180f9 100644
--- a/tensorflow/python/ops/variables.py
+++ b/tensorflow/python/ops/variables.py
@@ -123,6 +123,30 @@ class Variable(checkpointable.CheckpointableBase):
   various `Optimizer` classes use this collection as the default list of
   variables to optimize.
 
+  WARNING: tf.Variable objects have a non-intuitive memory model. A Variable is
+  represented internally as a mutable Tensor which can non-deterministically
+  alias other Tensors in a graph. The set of operations which consume a Variable
+  and can lead to aliasing is undetermined and can change across TensorFlow
+  versions. Avoid writing code which relies on the value of a Variable either
+  changing or not changing as other operations happen. For example, using
+  Variable objects or simple functions thereof as predicates in a `tf.cond` is
+  dangerous and error-prone:
+
+  ```
+  v = tf.Variable(True)
+  tf.cond(v, lambda: v.assign(False), my_false_fn)  # Note: this is broken.
+  ```
+
+  Here replacing tf.Variable with tf.contrib.eager.Variable will fix any
+  nondeterminism issues.
+
+  To use the replacement for variables which does
+  not have these issues:
+
+  * Replace `tf.Variable` with `tf.contrib.eager.Variable`;
+  * Call `tf.get_variable_scope().set_use_resource(True)` inside a
+    `tf.variable_scope` before the `tf.get_variable()` call.
+
   @compatibility(eager)
   `tf.Variable` is not compatible with eager execution.  Use
   `tf.contrib.eager.Variable` instead which is compatible with both eager
@@ -235,7 +259,7 @@ class Variable(checkpointable.CheckpointableBase):
           constraint=constraint)
 
   def __repr__(self):
-    if context.executing_eagerly():
+    if context.executing_eagerly() and not self._in_graph_mode:
       return "<tf.Variable '%s' shape=%s dtype=%s, numpy=%s>" % (
           self.name, self.get_shape(), self.dtype.name,
           ops.numpy_text(self.read_value(), is_repr=True))
@@ -317,6 +341,7 @@ class Variable(checkpointable.CheckpointableBase):
       self._update_uid = initial_value.checkpoint_position.restore_uid
       initial_value = initial_value.wrapped_value
 
+    self._trainable = trainable
     if trainable and ops.GraphKeys.TRAINABLE_VARIABLES not in collections:
       collections = list(collections) + [ops.GraphKeys.TRAINABLE_VARIABLES]
     with ops.init_scope():
@@ -426,6 +451,7 @@ class Variable(checkpointable.CheckpointableBase):
                                  import_scope=import_scope))
     else:
       self._initial_value = None
+    self._trainable = getattr(variable_def, "trainable", True)
     self._snapshot = g.as_graph_element(
         ops.prepend_name_scope(variable_def.snapshot_name,
                                import_scope=import_scope))
@@ -519,6 +545,10 @@ class Variable(checkpointable.CheckpointableBase):
     self._ref().set_shape(shape)
     self.value().set_shape(shape)
 
+  @property
+  def trainable(self):
+    return self._trainable
+
   def eval(self, session=None):
     """In a session, computes and returns the value of this variable.
 
@@ -1026,6 +1056,7 @@ class Variable(checkpointable.CheckpointableBase):
         # For backwards compatibility.
         var_def.initial_value_name = ops.strip_name_scope(
             self._initial_value.name, export_scope)
+      var_def.trainable = self.trainable
       var_def.initializer_name = ops.strip_name_scope(
           self.initializer.name, export_scope)
       var_def.snapshot_name = ops.strip_name_scope(
diff --git a/tensorflow/python/pywrap_tfe.i b/tensorflow/python/pywrap_tfe.i
index 5f1fafb9dcaa441b284dce6b57eae1227ebeeaa1..500dc30cc30f757965791e504bc79718bb7f7bd7 100644
--- a/tensorflow/python/pywrap_tfe.i
+++ b/tensorflow/python/pywrap_tfe.i
@@ -33,8 +33,9 @@ limitations under the License.
 %rename("%s") TFE_ContextAsyncClearError;
 %rename("%s") TFE_OpNameGetAttrType;
 %rename("%s") TFE_Py_InitEagerTensor;
+%rename("%s") TFE_Py_SetEagerTensorProfiler;
 %rename("%s") TFE_Py_RegisterExceptionClass;
-%rename("%s") TFE_Py_RegisterBackwardFunctionGetter;
+%rename("%s") TFE_Py_RegisterGradientFunction;
 %rename("%s") TFE_Py_RegisterFallbackExceptionClass;
 %rename("%s") TFE_Py_RegisterResourceVariableType;
 %rename("%s") TFE_Py_Execute;
@@ -60,6 +61,7 @@ limitations under the License.
 %rename("%s") TFE_ContextOptionsSetAsync;
 %rename("%s") TFE_DeleteContextOptions;
 %rename("%s") TFE_Py_TensorShapeSlice;
+%rename("%s") TFE_Py_TensorShapeOnDevice;
 
 %{
 #include "tensorflow/python/eager/pywrap_tfe.h"
diff --git a/tensorflow/python/saved_model/BUILD b/tensorflow/python/saved_model/BUILD
index 2609a5d222659f6ebf775d6baa48bd7bc39fd7f6..81786fbf435ffebba6217c0a03f06494195afc3c 100644
--- a/tensorflow/python/saved_model/BUILD
+++ b/tensorflow/python/saved_model/BUILD
@@ -149,6 +149,7 @@ py_test(
         "//tensorflow/python:saver_test_utils",
         "//tensorflow/python:state_ops",
         "//tensorflow/python:test_ops",
+        "//tensorflow/python:training",
         "//tensorflow/python:util",
         "//tensorflow/python:variables",
     ],
diff --git a/tensorflow/python/saved_model/builder_impl.py b/tensorflow/python/saved_model/builder_impl.py
index 8f1d5a099f8dce262ae902230a010e532c279e3e..e58be804c2738dbad0e2f90c21d6eff3832a8148 100644
--- a/tensorflow/python/saved_model/builder_impl.py
+++ b/tensorflow/python/saved_model/builder_impl.py
@@ -104,10 +104,10 @@ class SavedModelBuilder(object):
     Args:
       assets_collection_to_add: The collection where the asset paths are setup.
     """
-    asset_source_filepath_list = _maybe_save_assets(assets_collection_to_add)
+    asset_filename_map = _maybe_save_assets(assets_collection_to_add)
 
     # Return if there are no assets to write.
-    if len(asset_source_filepath_list) is 0:
+    if not asset_filename_map:
       tf_logging.info("No assets to write.")
       return
 
@@ -119,12 +119,10 @@ class SavedModelBuilder(object):
       file_io.recursive_create_dir(assets_destination_dir)
 
     # Copy each asset from source path to destination path.
-    for asset_source_filepath in asset_source_filepath_list:
-      asset_source_filename = os.path.basename(asset_source_filepath)
-
+    for asset_basename, asset_source_filepath in asset_filename_map.items():
       asset_destination_filepath = os.path.join(
           compat.as_bytes(assets_destination_dir),
-          compat.as_bytes(asset_source_filename))
+          compat.as_bytes(asset_basename))
 
       # Only copy the asset file to the destination if it does not already
       # exist. This is to ensure that an asset with the same name defined as
@@ -272,6 +270,18 @@ class SavedModelBuilder(object):
 
     self._add_train_op(train_op)
 
+  def _maybe_create_saver(self, saver=None):
+    """Creates a sharded saver if one does not already exist."""
+    if not saver:
+      # Initialize a saver to generate a sharded output for all saveables in the
+      # current scope.
+      saver = tf_saver.Saver(
+          variables._all_saveable_objects(),  # pylint: disable=protected-access
+          sharded=True,
+          write_version=saver_pb2.SaverDef.V2,
+          allow_empty=True)
+    return saver
+
   def add_meta_graph(self,
                      tags,
                      signature_def_map=None,
@@ -279,7 +289,8 @@ class SavedModelBuilder(object):
                      legacy_init_op=None,
                      clear_devices=False,
                      main_op=None,
-                     strip_default_attrs=False):
+                     strip_default_attrs=False,
+                     saver=None):
     # pylint: disable=line-too-long
     """Adds the current meta graph to the SavedModel.
 
@@ -304,6 +315,9 @@ class SavedModelBuilder(object):
       strip_default_attrs: Boolean. If `True`, default-valued attributes will be
         removed from the NodeDefs. For a detailed guide, see
         [Stripping Default-Valued Attributes](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/README.md#stripping-default-valued-attributes).
+      saver: An instance of tf.train.Saver that will be used to export the
+        metagraph. If None, a sharded Saver that restores all variables will
+        be used.
 
     Raises:
       AssertionError: If the variables for the SavedModel have not been saved
@@ -322,18 +336,11 @@ class SavedModelBuilder(object):
     # Add assets and ops
     self._add_collections(assets_collection, legacy_init_op, main_op, None)
 
-    # Initialize a saver to generate a sharded output for all saveables in the
-    # current scope.
-    saver = tf_saver.Saver(
-        variables._all_saveable_objects(),  # pylint: disable=protected-access
-        sharded=True,
-        write_version=saver_pb2.SaverDef.V2,
-        allow_empty=True)
+    saver = self._maybe_create_saver(saver)
 
     # The graph almost certainly previously contained at least one Saver, and
     # possibly several (e.g. one for loading a pretrained embedding, and another
-    # for the model weights).  However, a *new* Saver was just created that
-    # includes all of the variables.  Removing the preexisting ones was the
+    # for the model weights).  Removing the preexisting ones was the
     # motivation for the clear_extraneous_savers option, but it turns out that
     # there are edge cases where that option breaks the graph.  Until that is
     # resolved, we just leave the option set to False for now.
@@ -352,7 +359,8 @@ class SavedModelBuilder(object):
                                    legacy_init_op=None,
                                    clear_devices=False,
                                    main_op=None,
-                                   strip_default_attrs=False):
+                                   strip_default_attrs=False,
+                                   saver=None):
     # pylint: disable=line-too-long
     """Adds the current meta graph to the SavedModel and saves variables.
 
@@ -379,6 +387,9 @@ class SavedModelBuilder(object):
       strip_default_attrs: Boolean. If `True`, default-valued attributes will be
         removed from the NodeDefs. For a detailed guide, see
         [Stripping Default-Valued Attributes](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/README.md#stripping-default-valued-attributes).
+      saver: An instance of tf.train.Saver that will be used to export the
+        metagraph and save variables. If None, a sharded Saver that restores
+        all variables will be used.
 
     """
     # pylint: enable=line-too-long
@@ -405,13 +416,7 @@ class SavedModelBuilder(object):
         compat.as_text(variables_dir),
         compat.as_text(constants.VARIABLES_FILENAME))
 
-    # Initialize a saver to generate a sharded output for all saveables in the
-    # current scope.
-    saver = tf_saver.Saver(
-        variables._all_saveable_objects(),  # pylint: disable=protected-access
-        sharded=True,
-        write_version=saver_pb2.SaverDef.V2,
-        allow_empty=True)
+    saver = self._maybe_create_saver(saver)
 
     # Save the variables. Also, disable writing the checkpoint state proto. The
     # file is not used during SavedModel loading. In addition, since a
@@ -423,8 +428,7 @@ class SavedModelBuilder(object):
 
     # The graph almost certainly previously contained at least one Saver, and
     # possibly several (e.g. one for loading a pretrained embedding, and another
-    # for the model weights).  However, a *new* Saver was just created that
-    # includes all of the variables.  Removing the preexisting ones was the
+    # for the model weights).  Removing the preexisting ones was the
     # motivation for the clear_extraneous_savers option, but it turns out that
     # there are edge cases where that option breaks the graph.  Until that is
     # resolved, we just leave the option set to False for now.
@@ -476,16 +480,17 @@ def _maybe_save_assets(assets_collection_to_add=None):
     assets_collection_to_add: The collection where the asset paths are setup.
 
   Returns:
-    The list of filepaths to the assets in the assets collection.
+    A dict of asset basenames for saving to the original full path to the asset.
 
   Raises:
     ValueError: Indicating an invalid filepath tensor.
   """
-  asset_source_filepath_list = []
+  # Map of target file names to original filenames
+  asset_filename_map = {}
 
   if assets_collection_to_add is None:
     tf_logging.info("No assets to save.")
-    return asset_source_filepath_list
+    return asset_filename_map
 
   # Iterate over the supplied asset collection, build the `AssetFile` proto
   # and add them to the collection with key `constants.ASSETS_KEY`, in the
@@ -495,15 +500,71 @@ def _maybe_save_assets(assets_collection_to_add=None):
     if not asset_source_filepath:
       raise ValueError("Invalid asset filepath tensor %s" % asset_tensor)
 
-    asset_source_filename = os.path.basename(asset_source_filepath)
+    asset_filename = _get_asset_filename_to_add(
+        asset_source_filepath, asset_filename_map)
 
     # Build `AssetFile` proto and add it to the asset collection in the graph.
-    _add_asset_to_collection(asset_source_filename, asset_tensor)
+    # Note that this should be done even when the file is a duplicate of an
+    # already-added file, as the tensor reference should still exist.
+    _add_asset_to_collection(asset_filename, asset_tensor)
 
-    asset_source_filepath_list.append(asset_source_filepath)
+    # In the cases where we are adding a duplicate, this will result in the
+    # last of the filepaths being the one used for copying the file to the
+    # SavedModel. Since the files in question are the same, it doesn't matter
+    # either way.
+    asset_filename_map[asset_filename] = asset_source_filepath
 
   tf_logging.info("Assets added to graph.")
-  return asset_source_filepath_list
+  return asset_filename_map
+
+
+def _get_asset_filename_to_add(asset_filepath, asset_filename_map):
+  """Get a unique basename to add to the SavedModel if this file is unseen.
+
+  Assets come from users as full paths, and we save them out to the
+  SavedModel as basenames. In some cases, the basenames collide. Here,
+  we dedupe asset basenames by first checking if the file is the same,
+  and, if different, generate and return an index-suffixed basename
+  that can be used to add the asset to the SavedModel.
+
+  Args:
+    asset_filepath: the full path to the asset that is being saved
+    asset_filename_map: a dict of filenames used for saving the asset in
+      the SavedModel to full paths from which the filenames were derived.
+
+  Returns:
+    Uniquified filename string if the file is not a duplicate, or the original
+    filename if the file has already been seen and saved.
+  """
+  asset_filename = os.path.basename(asset_filepath)
+
+  if asset_filename not in asset_filename_map:
+    # This is an unseen asset. Safe to add.
+    return asset_filename
+
+  other_asset_filepath = asset_filename_map[asset_filename]
+  if other_asset_filepath == asset_filepath:
+    # This is the same file, stored twice in the collection list. No need
+    # to make unique.
+    return asset_filename
+
+  # Else, asset_filename is in the map, and the filepath is different. Dedupe.
+  if not file_io.filecmp(asset_filepath, other_asset_filepath):
+    # Files are different; dedupe filenames.
+    return _get_unique_asset_filename(asset_filename, asset_filename_map)
+
+  # Files are the same; don't make unique.
+  return asset_filename
+
+
+def _get_unique_asset_filename(asset_filename, asset_filename_map):
+  i = 1
+  unique_filename = asset_filename
+  while unique_filename in asset_filename_map:
+    unique_filename = compat.as_bytes("_").join(
+        [compat.as_bytes(asset_filename), compat.as_bytes(str(i))])
+    i += 1
+  return unique_filename
 
 
 def _asset_path_from_tensor(path_tensor):
diff --git a/tensorflow/python/saved_model/loader_impl.py b/tensorflow/python/saved_model/loader_impl.py
index bebf1d5e0d3cc6ac0e431230577704365d37a437..d1bd8d47aee94fd913b807860eff7fa94bb469e5 100644
--- a/tensorflow/python/saved_model/loader_impl.py
+++ b/tensorflow/python/saved_model/loader_impl.py
@@ -79,12 +79,14 @@ def _parse_saved_model(export_dir):
                    constants.SAVED_MODEL_FILENAME_PB))
 
 
-def _get_asset_tensors(export_dir, meta_graph_def_to_load):
+def _get_asset_tensors(export_dir, meta_graph_def_to_load, import_scope=None):
   """Gets the asset tensors, if defined in the meta graph def to load.
 
   Args:
     export_dir: Directory where the SavedModel is located.
     meta_graph_def_to_load: The meta graph def from the SavedModel to be loaded.
+    import_scope: Optional `string` -- if specified, prepend this followed by
+        '/' to all returned asset tensor names.
 
   Returns:
     A dictionary of asset tensors, keyed by the name of the asset tensor. The
@@ -104,7 +106,10 @@ def _get_asset_tensors(export_dir, meta_graph_def_to_load):
     for asset_any_proto in assets_any_proto:
       asset_proto = meta_graph_pb2.AssetFileDef()
       asset_any_proto.Unpack(asset_proto)
-      asset_tensor_dict[asset_proto.tensor_info.name] = os.path.join(
+      tensor_name = asset_proto.tensor_info.name
+      if import_scope:
+        tensor_name = "%s/%s" % (import_scope, tensor_name)
+      asset_tensor_dict[tensor_name] = os.path.join(
           compat.as_bytes(assets_directory),
           compat.as_bytes(asset_proto.filename))
   return asset_tensor_dict
@@ -179,7 +184,7 @@ def maybe_saved_model_directory(export_dir):
 
 
 @tf_export("saved_model.loader.load")
-def load(sess, tags, export_dir, **saver_kwargs):
+def load(sess, tags, export_dir, import_scope=None, **saver_kwargs):
   """Loads the model from a SavedModel as specified by tags.
 
   Args:
@@ -189,6 +194,10 @@ def load(sess, tags, export_dir, **saver_kwargs):
         SavedModel `save()` API.
     export_dir: Directory in which the SavedModel protocol buffer and variables
         to be loaded are located.
+    import_scope: Optional `string` -- if specified, prepend this string
+        followed by '/' to all loaded tensor names. This scope is applied to
+        tensor instances loaded into the passed session, but it is *not* written
+        through to the static `MetaGraphDef` protocol buffer that is returned.
     **saver_kwargs: Optional keyword arguments passed through to Saver.
 
   Returns:
@@ -216,7 +225,8 @@ def load(sess, tags, export_dir, **saver_kwargs):
       )
 
     # Build a saver by importing the meta graph def to load.
-    saver = tf_saver.import_meta_graph(meta_graph_def_to_load, **saver_kwargs)
+    saver = tf_saver.import_meta_graph(
+        meta_graph_def_to_load, import_scope=import_scope, **saver_kwargs)
 
     if saver:
       # Build the checkpoint path where the variables are located.
@@ -232,8 +242,8 @@ def load(sess, tags, export_dir, **saver_kwargs):
                       "checkpoints were restored.")
 
     # Get asset tensors, if any.
-    asset_tensors_dictionary = _get_asset_tensors(export_dir,
-                                                  meta_graph_def_to_load)
+    asset_tensors_dictionary = _get_asset_tensors(
+        export_dir, meta_graph_def_to_load, import_scope=import_scope)
 
     main_op_tensor = (
         _get_main_op_tensor(meta_graph_def_to_load) or
diff --git a/tensorflow/python/saved_model/saved_model_test.py b/tensorflow/python/saved_model/saved_model_test.py
index 1b83d60df926f727be2469a2ea1409bc0c59d6c5..fb4732aca21d4661aaea21a472475690687a42be 100644
--- a/tensorflow/python/saved_model/saved_model_test.py
+++ b/tensorflow/python/saved_model/saved_model_test.py
@@ -44,6 +44,7 @@ from tensorflow.python.saved_model import main_op
 from tensorflow.python.saved_model import signature_def_utils
 from tensorflow.python.saved_model import tag_constants
 from tensorflow.python.training import saver_test_utils
+from tensorflow.python.training import training
 from tensorflow.python.util import compat
 
 SAVED_MODEL_PATH = ("cc/saved_model/testdata/half_plus_two/00000123")
@@ -64,9 +65,12 @@ class SavedModelTest(test.TestCase):
     self.assertEqual(variable_value, v.eval())
 
   def _build_asset_collection(self, asset_file_name, asset_file_contents,
-                              asset_file_tensor_name):
+                              asset_file_tensor_name, asset_subdir=""):
+    parent_dir = os.path.join(
+        compat.as_bytes(test.get_temp_dir()), compat.as_bytes(asset_subdir))
+    file_io.recursive_create_dir(parent_dir)
     asset_filepath = os.path.join(
-        compat.as_bytes(test.get_temp_dir()), compat.as_bytes(asset_file_name))
+        compat.as_bytes(parent_dir), compat.as_bytes(asset_file_name))
     file_io.write_string_to_file(asset_filepath, asset_file_contents)
     asset_file_tensor = constant_op.constant(
         asset_filepath, name=asset_file_tensor_name)
@@ -77,10 +81,11 @@ class SavedModelTest(test.TestCase):
   def _validate_asset_collection(self, export_dir, graph_collection_def,
                                  expected_asset_file_name,
                                  expected_asset_file_contents,
-                                 expected_asset_tensor_name):
+                                 expected_asset_tensor_name,
+                                 asset_id=0):
     assets_any = graph_collection_def[constants.ASSETS_KEY].any_list.value
     asset = meta_graph_pb2.AssetFileDef()
-    assets_any[0].Unpack(asset)
+    assets_any[asset_id].Unpack(asset)
     assets_path = os.path.join(
         compat.as_bytes(export_dir),
         compat.as_bytes(constants.ASSETS_DIRECTORY),
@@ -634,6 +639,141 @@ class SavedModelTest(test.TestCase):
           compat.as_bytes("ignored.txt"))
       self.assertFalse(file_io.file_exists(ignored_asset_path))
 
+  def testAssetsNameCollisionDiffFile(self):
+    export_dir = self._get_export_dir("test_assets_name_collision_diff_file")
+    builder = saved_model_builder.SavedModelBuilder(export_dir)
+
+    with self.test_session(graph=ops.Graph()) as sess:
+      self._init_and_validate_variable(sess, "v", 42)
+
+      asset_collection = self._build_asset_collection(
+          "hello42.txt", "foo bar bak", "asset_file_tensor",
+          asset_subdir="1")
+
+      asset_collection = self._build_asset_collection(
+          "hello42.txt", "foo bar baz", "asset_file_tensor_1",
+          asset_subdir="2")
+
+      builder.add_meta_graph_and_variables(
+          sess, ["foo"], assets_collection=asset_collection)
+
+    # Save the SavedModel to disk.
+    builder.save()
+
+    with self.test_session(graph=ops.Graph()) as sess:
+      foo_graph = loader.load(sess, ["foo"], export_dir)
+      self._validate_asset_collection(export_dir, foo_graph.collection_def,
+                                      "hello42.txt", "foo bar bak",
+                                      "asset_file_tensor:0")
+      self._validate_asset_collection(export_dir, foo_graph.collection_def,
+                                      "hello42.txt_1", "foo bar baz",
+                                      "asset_file_tensor_1:0",
+                                      asset_id=1)
+
+  def testAssetsNameCollisionSameFilepath(self):
+    export_dir = self._get_export_dir("test_assets_name_collision_same_path")
+    builder = saved_model_builder.SavedModelBuilder(export_dir)
+
+    with self.test_session(graph=ops.Graph()) as sess:
+      self._init_and_validate_variable(sess, "v", 42)
+
+      asset_collection = self._build_asset_collection(
+          "hello42.txt", "foo bar baz", "asset_file_tensor")
+
+      asset_collection = self._build_asset_collection(
+          "hello42.txt", "foo bar baz", "asset_file_tensor_1")
+
+      builder.add_meta_graph_and_variables(
+          sess, ["foo"], assets_collection=asset_collection)
+
+    # Save the SavedModel to disk.
+    builder.save()
+
+    with self.test_session(graph=ops.Graph()) as sess:
+      foo_graph = loader.load(sess, ["foo"], export_dir)
+      self._validate_asset_collection(export_dir, foo_graph.collection_def,
+                                      "hello42.txt", "foo bar baz",
+                                      "asset_file_tensor:0")
+      # The second tensor should be recorded, but the same.
+      self._validate_asset_collection(export_dir, foo_graph.collection_def,
+                                      "hello42.txt", "foo bar baz",
+                                      "asset_file_tensor_1:0",
+                                      asset_id=1)
+      ignored_asset_path = os.path.join(
+          compat.as_bytes(export_dir),
+          compat.as_bytes(constants.ASSETS_DIRECTORY),
+          compat.as_bytes("hello42.txt_1"))
+      self.assertFalse(file_io.file_exists(ignored_asset_path))
+
+  def testAssetsNameCollisionSameFile(self):
+    export_dir = self._get_export_dir("test_assets_name_collision_same_file")
+    builder = saved_model_builder.SavedModelBuilder(export_dir)
+
+    with self.test_session(graph=ops.Graph()) as sess:
+      self._init_and_validate_variable(sess, "v", 42)
+
+      asset_collection = self._build_asset_collection(
+          "hello42.txt", "foo bar baz", "asset_file_tensor",
+          asset_subdir="1")
+
+      asset_collection = self._build_asset_collection(
+          "hello42.txt", "foo bar baz", "asset_file_tensor_1",
+          asset_subdir="2")
+
+      builder.add_meta_graph_and_variables(
+          sess, ["foo"], assets_collection=asset_collection)
+
+    # Save the SavedModel to disk.
+    builder.save()
+
+    with self.test_session(graph=ops.Graph()) as sess:
+      foo_graph = loader.load(sess, ["foo"], export_dir)
+      self._validate_asset_collection(export_dir, foo_graph.collection_def,
+                                      "hello42.txt", "foo bar baz",
+                                      "asset_file_tensor:0")
+      # The second tensor should be recorded, but the same.
+      self._validate_asset_collection(export_dir, foo_graph.collection_def,
+                                      "hello42.txt", "foo bar baz",
+                                      "asset_file_tensor_1:0",
+                                      asset_id=1)
+      ignored_asset_path = os.path.join(
+          compat.as_bytes(export_dir),
+          compat.as_bytes(constants.ASSETS_DIRECTORY),
+          compat.as_bytes("hello42.txt_1"))
+      self.assertFalse(file_io.file_exists(ignored_asset_path))
+
+  def testAssetsNameCollisionManyFiles(self):
+    export_dir = self._get_export_dir("test_assets_name_collision_many_files")
+    builder = saved_model_builder.SavedModelBuilder(export_dir)
+
+    with self.test_session(graph=ops.Graph()) as sess:
+      self._init_and_validate_variable(sess, "v", 42)
+
+      for i in range(5):
+        idx = str(i)
+        asset_collection = self._build_asset_collection(
+            "hello42.txt", "foo bar baz " + idx, "asset_file_tensor_" + idx,
+            asset_subdir=idx)
+
+      builder.add_meta_graph_and_variables(
+          sess, ["foo"], assets_collection=asset_collection)
+
+    # Save the SavedModel to disk.
+    builder.save()
+
+    with self.test_session(graph=ops.Graph()) as sess:
+      foo_graph = loader.load(sess, ["foo"], export_dir)
+      for i in range(1, 5):
+        idx = str(i)
+        self._validate_asset_collection(
+            export_dir, foo_graph.collection_def, "hello42.txt_" + idx,
+            "foo bar baz " + idx, "asset_file_tensor_{}:0".format(idx),
+            asset_id=i)
+
+      self._validate_asset_collection(export_dir, foo_graph.collection_def,
+                                      "hello42.txt", "foo bar baz 0",
+                                      "asset_file_tensor_0:0")
+
   def testCustomMainOp(self):
     export_dir = self._get_export_dir("test_main_op")
     builder = saved_model_builder.SavedModelBuilder(export_dir)
@@ -983,6 +1123,133 @@ class SavedModelTest(test.TestCase):
       self.assertEqual(b"k1", v1.keys().eval())
       self.assertEqual(3.0, v1.values().eval())
 
+  def testCustomSaver(self):
+    export_dir = self._get_export_dir("test_custom_saver")
+    builder = saved_model_builder.SavedModelBuilder(export_dir)
+
+    with self.test_session(graph=ops.Graph()) as sess:
+      variables.Variable(1, name="v1")
+      sess.run(variables.global_variables_initializer())
+      custom_saver = training.Saver(name="my_saver")
+      builder.add_meta_graph_and_variables(sess, ["tag"], saver=custom_saver)
+
+    # Save the SavedModel to disk.
+    builder.save()
+
+    with ops.Graph().as_default() as graph:
+      with self.test_session(graph=graph) as sess:
+        saved_graph = loader.load(sess, ["tag"], export_dir)
+        graph_ops = [x.name for x in graph.get_operations()]
+        self.assertTrue("my_saver/restore_all" in graph_ops)
+        self.assertFalse("save/restore_all" in graph_ops)
+        self.assertEqual(
+            saved_graph.saver_def.restore_op_name, "my_saver/restore_all")
+
+  def testNoCustomSaver(self):
+    export_dir = self._get_export_dir("test_no_custom_saver")
+    builder = saved_model_builder.SavedModelBuilder(export_dir)
+
+    with self.test_session(graph=ops.Graph()) as sess:
+      variables.Variable(1, name="v1")
+      sess.run(variables.global_variables_initializer())
+      training.Saver(name="my_saver")
+      builder.add_meta_graph_and_variables(sess, ["tag"])
+
+    # Save the SavedModel to disk.
+    builder.save()
+
+    with ops.Graph().as_default() as graph:
+      with self.test_session(graph=graph) as sess:
+        saved_graph = loader.load(sess, ["tag"], export_dir)
+        graph_ops = [x.name for x in graph.get_operations()]
+        self.assertTrue("my_saver/restore_all" in graph_ops)
+        self.assertTrue("save/restore_all" in graph_ops)
+        self.assertEqual(
+            saved_graph.saver_def.restore_op_name, "save/restore_all")
+
+  def testMultipleCustomSavers(self):
+    export_dir = self._get_export_dir("test_multiple_custom_savers")
+    builder = saved_model_builder.SavedModelBuilder(export_dir)
+
+    with self.test_session(graph=ops.Graph()) as sess:
+      variables.Variable(1, name="v1")
+      sess.run(variables.global_variables_initializer())
+      builder.add_meta_graph_and_variables(sess, ["tag_0"])
+
+      saver_1 = training.Saver()
+      builder.add_meta_graph(["tag_1"], saver=saver_1)
+
+      saver_2 = training.Saver()
+      builder.add_meta_graph(["tag_2"], saver=saver_2)
+
+    # Save the SavedModel to disk.
+    builder.save()
+
+    def _validate_custom_saver(tag_name, saver_name):
+      with ops.Graph().as_default() as graph:
+        with self.test_session(graph=graph) as sess:
+          saved_graph = loader.load(sess, [tag_name], export_dir)
+          self.assertEqual(
+              saved_graph.saver_def.restore_op_name,
+              saver_name)
+
+    _validate_custom_saver("tag_0", "save/restore_all")
+    _validate_custom_saver("tag_1", "save_1/restore_all")
+    _validate_custom_saver("tag_2", "save_2/restore_all")
+
+  def testImportScope(self):
+    export_dir = self._get_export_dir("test_scoped_assets")
+    builder = saved_model_builder.SavedModelBuilder(export_dir)
+
+    # Build a SavedModel with a variable, an asset, and a constant tensor.
+    with self.test_session(graph=ops.Graph()) as sess:
+      self._init_and_validate_variable(sess, "v", 42)
+      asset_collection = self._build_asset_collection("foo.txt", "content_foo",
+                                                      "asset_file_tensor")
+      constant_op.constant("constant value", name="constant_tensor_name")
+      builder.add_meta_graph_and_variables(
+          sess, ["tag_name"], assets_collection=asset_collection)
+
+      # Save the asset file path for later comparison.
+      asset_file_path = asset_collection[0].eval()
+
+    # Save the SavedModel to disk.
+    builder.save()
+
+    with self.test_session(graph=ops.Graph()) as sess:
+      # Restore the SavedModel under an import_scope in a new graph/session.
+      graph_proto = loader.load(
+          sess, ["tag_name"], export_dir, import_scope="scope_name")
+
+      # The loaded variable tensor should be scoped, but its contents should be
+      # unchanged.
+      self.assertEqual(
+          "scope_name/v:0",
+          ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)[0].name)
+      self.assertEqual(
+          42,
+          ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)[0].eval())
+
+      # The loaded asset tensor should be scoped, but the asset file path and
+      # contents should be unchanged.
+      asset_collection = ops.get_collection(ops.GraphKeys.ASSET_FILEPATHS)
+      self.assertEqual(1, len(asset_collection))
+      self.assertEqual(asset_file_path, asset_collection[0].eval())
+      self.assertEqual("scope_name/asset_file_tensor:0",
+                       asset_collection[0].name)
+      # The static asset data inside graph_proto.collection_def should not be
+      # scoped.
+      self._validate_asset_collection(export_dir, graph_proto.collection_def,
+                                      "foo.txt", "content_foo",
+                                      "asset_file_tensor:0")
+
+      # The constant tensor should be scoped, but its contents should be
+      # unchanged.
+      self.assertEqual(
+          compat.as_bytes("constant value"),
+          ops.get_default_graph().get_tensor_by_name(
+              "scope_name/constant_tensor_name:0").eval())
+
   def testClearDevices(self):
     export_dir = self._get_export_dir("test_clear_devices")
     builder = saved_model_builder.SavedModelBuilder(export_dir)
diff --git a/tensorflow/python/tools/import_pb_to_tensorboard.py b/tensorflow/python/tools/import_pb_to_tensorboard.py
old mode 100755
new mode 100644
diff --git a/tensorflow/python/training/adadelta.py b/tensorflow/python/training/adadelta.py
index c08e3cca007dc17f1112d53bf729c1accf61b5df..95eca76496992f7ac66643a4c94d7e9e812cecf8 100644
--- a/tensorflow/python/training/adadelta.py
+++ b/tensorflow/python/training/adadelta.py
@@ -46,6 +46,13 @@ class AdadeltaOptimizer(optimizer.Optimizer):
       use_locking: If `True` use locks for update operations.
       name: Optional name prefix for the operations created when applying
         gradients.  Defaults to "Adadelta".
+
+    @compatibility(eager)
+    When eager execution is enabled, `learning_rate`, `rho`, and `epsilon` can
+    each be a callable that takes no arguments and returns the actual value to
+    use. This can be useful for changing these values across different
+    invocations of optimizer functions.
+    @end_compatibility
     """
     super(AdadeltaOptimizer, self).__init__(use_locking, name)
     self._lr = learning_rate
@@ -63,9 +70,13 @@ class AdadeltaOptimizer(optimizer.Optimizer):
       self._zeros_slot(v, "accum_update", self._name)
 
   def _prepare(self):
-    self._lr_t = ops.convert_to_tensor(self._lr, name="lr")
-    self._rho_t = ops.convert_to_tensor(self._rho, name="rho")
-    self._epsilon_t = ops.convert_to_tensor(self._epsilon, name="epsilon")
+    lr = self._call_if_callable(self._lr)
+    rho = self._call_if_callable(self._rho)
+    epsilon = self._call_if_callable(self._epsilon)
+
+    self._lr_t = ops.convert_to_tensor(lr, name="lr")
+    self._rho_t = ops.convert_to_tensor(rho, name="rho")
+    self._epsilon_t = ops.convert_to_tensor(epsilon, name="epsilon")
 
   def _apply_dense(self, grad, var):
     accum = self.get_slot(var, "accum")
diff --git a/tensorflow/python/training/adadelta_test.py b/tensorflow/python/training/adadelta_test.py
index 50f435236b41fcda7ab5ea37a4e96b72dd1043e7..2678016d24b99b30cbf7021d67e33910051e2561 100644
--- a/tensorflow/python/training/adadelta_test.py
+++ b/tensorflow/python/training/adadelta_test.py
@@ -20,8 +20,10 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.eager import context
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import test_util
 from tensorflow.python.ops import embedding_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import resource_variable_ops
@@ -32,44 +34,52 @@ from tensorflow.python.training import adadelta
 
 class AdadeltaOptimizerTest(test.TestCase):
 
-  def doTestBasic(self, use_resource=False):
+  def doTestBasic(self, use_resource=False, use_callable_params=False):
     num_updates = 4  # number of ADADELTA steps to perform
     for dtype in [dtypes.half, dtypes.float32]:
       for grad in [0.2, 0.1, 0.01]:
         for lr in [1.0, 0.5, 0.1]:
-          with self.test_session():
-            var0_init = [1.0, 2.0]
-            var1_init = [3.0, 4.0]
-            if use_resource:
-              var0 = resource_variable_ops.ResourceVariable(
-                  var0_init, dtype=dtype)
-              var1 = resource_variable_ops.ResourceVariable(
-                  var1_init, dtype=dtype)
-            else:
-              var0 = variables.Variable(var0_init, dtype=dtype)
-              var1 = variables.Variable(var1_init, dtype=dtype)
-
-            grads = constant_op.constant([grad, grad], dtype=dtype)
-
-            accum = 0.0
-            accum_update = 0.0
-
-            # ADADELTA gradient optimizer
-            rho = 0.95
-            epsilon = 1e-8
-            adadelta_opt = adadelta.AdadeltaOptimizer(lr, rho, epsilon)
+          var0_init = [1.0, 2.0]
+          var1_init = [3.0, 4.0]
+          if use_resource:
+            var0 = resource_variable_ops.ResourceVariable(
+                var0_init, dtype=dtype)
+            var1 = resource_variable_ops.ResourceVariable(
+                var1_init, dtype=dtype)
+          else:
+            var0 = variables.Variable(var0_init, dtype=dtype)
+            var1 = variables.Variable(var1_init, dtype=dtype)
+
+          grads = constant_op.constant([grad, grad], dtype=dtype)
+
+          accum = 0.0
+          accum_update = 0.0
+
+          # ADADELTA gradient optimizer
+          rho = 0.95
+          epsilon = 1e-8
+          if use_callable_params:
+            adadelta_opt = adadelta.AdadeltaOptimizer(
+                learning_rate=lambda: lr,  # pylint: disable=cell-var-from-loop
+                rho=lambda: rho,  # pylint: disable=cell-var-from-loop
+                epsilon=lambda: epsilon)  # pylint: disable=cell-var-from-loop
+          else:
+            adadelta_opt = adadelta.AdadeltaOptimizer(
+                learning_rate=lr, rho=rho, epsilon=epsilon)
+          if not context.executing_eagerly():
             adadelta_update = adadelta_opt.apply_gradients(
                 zip([grads, grads], [var0, var1]))
+            self.evaluate(variables.global_variables_initializer())
 
+            # TODO(lxuechen): This is hard to test in eager mode,
+            # since the optimizer is not fully initialized until the first
+            # call to `apply_gradients`
             opt_vars = adadelta_opt.variables()
             self.assertStartsWith(opt_vars[0].name, var0._shared_name)
             self.assertStartsWith(opt_vars[1].name, var0._shared_name)
             self.assertStartsWith(opt_vars[2].name, var1._shared_name)
             self.assertStartsWith(opt_vars[3].name, var1._shared_name)
             self.assertEqual(4, len(opt_vars))
-
-            variables.global_variables_initializer().run()
-
             # Assign slots
             slot = [None] * 2
             slot_update = [None] * 2
@@ -91,36 +101,42 @@ class AdadeltaOptimizerTest(test.TestCase):
             self.assertEquals(slot_update[1].get_shape(), var1.get_shape())
             self.assertFalse(slot_update[1] in variables.trainable_variables())
 
-            # Fetch params to validate initial values
-            self.assertAllClose(var0_init, var0.eval())
-            self.assertAllClose(var1_init, var1.eval())
-
-            update = [None] * num_updates
-            tot_update = 0
-            for step in range(num_updates):
-              # Run adadelta update for comparison
-              adadelta_update.run()
-
-              # Perform initial update without previous accum values
-              accum = accum * rho + (grad**2) * (1 - rho)
-              update[step] = (np.sqrt(accum_update + epsilon) *
-                              (1. / np.sqrt(accum + epsilon)) * grad)
-              accum_update = (accum_update * rho + (update[step]**2) *
-                              (1.0 - rho))
-              tot_update += update[step] * lr
+          # Fetch params to validate initial values
+          self.assertAllClose(var0_init, self.evaluate(var0))
+          self.assertAllClose(var1_init, self.evaluate(var1))
 
+          update = [None] * num_updates
+          tot_update = 0
+          for step in range(num_updates):
+            # Run adadelta update for comparison
+            if not context.executing_eagerly():
+              self.evaluate(adadelta_update)
+            else:
+              adadelta_opt.apply_gradients(zip([grads, grads], [var0, var1]))
+
+            # Perform initial update without previous accum values
+            accum = accum * rho + (grad**2) * (1 - rho)
+            update[step] = (
+                np.sqrt(accum_update + epsilon) *
+                (1. / np.sqrt(accum + epsilon)) * grad)
+            accum_update = (
+                accum_update * rho + (update[step]**2) * (1.0 - rho))
+            tot_update += update[step] * lr
+
+            if not context.executing_eagerly():
               # Check that the accumulators have been updated
+              # TODO(lxuechen): This is hard to test in eager mode
               for slot_idx in range(2):
                 self.assertAllCloseAccordingToType(
                     np.array([accum, accum], dtype=dtype.as_numpy_dtype()),
-                    slot[slot_idx].eval(),
+                    self.evaluate(slot[slot_idx]),
                     rtol=1e-5)
 
                 self.assertAllCloseAccordingToType(
                     np.array(
                         [accum_update, accum_update],
                         dtype=dtype.as_numpy_dtype()),
-                    slot_update[slot_idx].eval(),
+                    self.evaluate(slot_update[slot_idx]),
                     rtol=1e-5)
 
               # Check that the parameters have been updated
@@ -128,22 +144,28 @@ class AdadeltaOptimizerTest(test.TestCase):
                   np.array(
                       [var0_init[0] - tot_update, var0_init[1] - tot_update],
                       dtype=dtype.as_numpy_dtype()),
-                  var0.eval(),
+                  self.evaluate(var0),
                   rtol=1e-5)
 
               self.assertAllCloseAccordingToType(
                   np.array(
                       [var1_init[0] - tot_update, var1_init[1] - tot_update],
                       dtype=dtype.as_numpy_dtype()),
-                  var1.eval(),
+                  self.evaluate(var1),
                   rtol=1e-5)
 
   def testBasic(self):
-    self.doTestBasic(use_resource=False)
+    with self.test_session():
+      self.doTestBasic(use_resource=False)
 
+  @test_util.run_in_graph_and_eager_modes(reset_test=True)
   def testResourceBasic(self):
     self.doTestBasic(use_resource=True)
 
+  def testBasicCallableParams(self):
+    with context.eager_mode():
+      self.doTestBasic(use_resource=True, use_callable_params=True)
+
   def testMinimizeSparseResourceVariable(self):
     for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
       with self.test_session():
diff --git a/tensorflow/python/training/adagrad.py b/tensorflow/python/training/adagrad.py
index deb4e6f546379eff330235dbc302a30c44193830..6778f3c735a70fc32ed299bc9d800b270f06cc66 100644
--- a/tensorflow/python/training/adagrad.py
+++ b/tensorflow/python/training/adagrad.py
@@ -51,6 +51,13 @@ class AdagradOptimizer(optimizer.Optimizer):
 
     Raises:
       ValueError: If the `initial_accumulator_value` is invalid.
+
+    @compatibility(eager)
+    When eager execution is enabled, `learning_rate` can be a callable that
+    takes no arguments and returns the actual value to use. This can be useful
+    for changing these values across different invocations of optimizer
+    functions.
+    @end_compatibility
     """
     if initial_accumulator_value <= 0.0:
       raise ValueError("initial_accumulator_value must be positive: %s" %
@@ -78,8 +85,9 @@ class AdagradOptimizer(optimizer.Optimizer):
                                               "accumulator", self._name)
 
   def _prepare(self):
-    self._learning_rate_tensor = ops.convert_to_tensor(self._learning_rate,
-                                                       name="learning_rate")
+    learning_rate = self._call_if_callable(self._learning_rate)
+    self._learning_rate_tensor = ops.convert_to_tensor(
+        learning_rate, name="learning_rate")
 
   def _apply_dense(self, grad, var):
     acc = self.get_slot(var, "accumulator")
diff --git a/tensorflow/python/training/adagrad_test.py b/tensorflow/python/training/adagrad_test.py
index 15b007b46dea6b3125c5f7bffe8782594bb23692..c9aec33d0916781e3d1a41b996083da92a4ae839 100644
--- a/tensorflow/python/training/adagrad_test.py
+++ b/tensorflow/python/training/adagrad_test.py
@@ -20,9 +20,11 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.eager import context
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import test_util
 from tensorflow.python.ops import embedding_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import resource_variable_ops
@@ -34,40 +36,63 @@ from tensorflow.python.training import adagrad
 
 class AdagradOptimizerTest(test.TestCase):
 
-  def doTestBasic(self, use_locking=False, use_resource=False):
+  def doTestBasic(self,
+                  use_locking=False,
+                  use_resource=False,
+                  use_callable_params=False):
     for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        if use_resource:
-          var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
-          var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype)
-        else:
-          var0 = variables.Variable([1.0, 2.0], dtype=dtype)
-          var1 = variables.Variable([3.0, 4.0], dtype=dtype)
-        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
-        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
-        ada_opt = adagrad.AdagradOptimizer(
-            3.0, initial_accumulator_value=0.1, use_locking=use_locking)
+      if use_resource:
+        var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
+        var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype)
+      else:
+        var0 = variables.Variable([1.0, 2.0], dtype=dtype)
+        var1 = variables.Variable([3.0, 4.0], dtype=dtype)
+      grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
+      grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
+
+      learning_rate = lambda: 3.0
+      if not use_callable_params:
+        learning_rate = learning_rate()
+
+      ada_opt = adagrad.AdagradOptimizer(
+          learning_rate, initial_accumulator_value=0.1, use_locking=use_locking)
+
+      if not context.executing_eagerly():
         ada_update = ada_opt.apply_gradients(
             zip([grads0, grads1], [var0, var1]))
-        variables.global_variables_initializer().run()
-        # Fetch params to validate initial values
-        self.assertAllClose([1.0, 2.0], var0.eval())
-        self.assertAllClose([3.0, 4.0], var1.eval())
-        # Run 3 steps of adagrad
-        for _ in range(3):
-          ada_update.run()
-        # Validate updated params
-        self.assertAllCloseAccordingToType(
-            np.array([-1.6026098728179932, -0.6026098728179932]), var0.eval())
-        self.assertAllCloseAccordingToType(
-            np.array([2.715679168701172, 3.715679168701172]), var1.eval())
+        self.evaluate(variables.global_variables_initializer())
+
+      # Fetch params to validate initial values
+      v0_val, v1_val = self.evaluate([var0, var1])
+      self.assertAllClose([1.0, 2.0], v0_val)
+      self.assertAllClose([3.0, 4.0], v1_val)
+
+      # Run 3 steps of adagrad
+      for _ in range(3):
+        if not context.executing_eagerly():
+          self.evaluate(ada_update)
+        else:
+          ada_opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+
+      # Validate updated params
+      v0_val, v1_val = self.evaluate([var0, var1])
+      self.assertAllCloseAccordingToType(
+          np.array([-1.6026098728179932, -0.6026098728179932]), v0_val)
+      self.assertAllCloseAccordingToType(
+          np.array([2.715679168701172, 3.715679168701172]), v1_val)
 
   def testBasic(self):
     self.doTestBasic(use_locking=False)
 
+  @test_util.run_in_graph_and_eager_modes(reset_test=True)
   def testBasicResource(self):
     self.doTestBasic(use_locking=False, use_resource=True)
 
+  def testBasicCallableParams(self):
+    with context.eager_mode():
+      self.doTestBasic(
+          use_locking=False, use_resource=True, use_callable_params=True)
+
   def testBasicLocked(self):
     self.doTestBasic(use_locking=True)
 
diff --git a/tensorflow/python/training/adam.py b/tensorflow/python/training/adam.py
index 6fa3ff66583ce07a6ee7b0d8158c851ea578637c..b65c88e972454da14dc5161a19cd26280d51d28f 100644
--- a/tensorflow/python/training/adam.py
+++ b/tensorflow/python/training/adam.py
@@ -85,6 +85,13 @@ class AdamOptimizer(optimizer.Optimizer):
       use_locking: If True use locks for update operations.
       name: Optional name for the operations created when applying gradients.
         Defaults to "Adam".
+
+    @compatibility(eager)
+    When eager execution is enabled, `learning_rate`, `beta1`, `beta2`, and
+    `epsilon` can each be a callable that takes no arguments and returns the
+    actual value to use. This can be useful for changing these values across
+    different invocations of optimizer functions.
+    @end_compatibility
     """
     super(AdamOptimizer, self).__init__(use_locking, name)
     self._lr = learning_rate
@@ -128,10 +135,15 @@ class AdamOptimizer(optimizer.Optimizer):
       self._zeros_slot(v, "v", self._name)
 
   def _prepare(self):
-    self._lr_t = ops.convert_to_tensor(self._lr, name="learning_rate")
-    self._beta1_t = ops.convert_to_tensor(self._beta1, name="beta1")
-    self._beta2_t = ops.convert_to_tensor(self._beta2, name="beta2")
-    self._epsilon_t = ops.convert_to_tensor(self._epsilon, name="epsilon")
+    lr = self._call_if_callable(self._lr)
+    beta1 = self._call_if_callable(self._beta1)
+    beta2 = self._call_if_callable(self._beta2)
+    epsilon = self._call_if_callable(self._epsilon)
+
+    self._lr_t = ops.convert_to_tensor(lr, name="learning_rate")
+    self._beta1_t = ops.convert_to_tensor(beta1, name="beta1")
+    self._beta2_t = ops.convert_to_tensor(beta2, name="beta2")
+    self._epsilon_t = ops.convert_to_tensor(epsilon, name="epsilon")
 
   def _apply_dense(self, grad, var):
     m = self.get_slot(var, "m")
diff --git a/tensorflow/python/training/adam_test.py b/tensorflow/python/training/adam_test.py
index 9be8b6aafefa33977511cde24dd2e87dd6c3b81a..ccdc7e384da2ae792a681298c7076fc582d362df 100644
--- a/tensorflow/python/training/adam_test.py
+++ b/tensorflow/python/training/adam_test.py
@@ -150,7 +150,7 @@ class AdamOptimizerTest(test.TestCase):
           self.assertAllClose(aggregated_update_var.eval(),
                               repeated_index_update_var.eval())
 
-  def doTestBasic(self, use_resource=False):
+  def doTestBasic(self, use_resource=False, use_callable_params=False):
     for i, dtype in enumerate([dtypes.half, dtypes.float32, dtypes.float64]):
       with self.test_session(graph=ops.Graph()):
         # Initialize variables for numpy implementation.
@@ -171,7 +171,17 @@ class AdamOptimizerTest(test.TestCase):
         grads0 = constant_op.constant(grads0_np)
         grads1 = constant_op.constant(grads1_np)
 
-        opt = adam.AdamOptimizer()
+        learning_rate = lambda: 0.001
+        beta1 = lambda: 0.9
+        beta2 = lambda: 0.999
+        epsilon = lambda: 1e-8
+        if not use_callable_params:
+          learning_rate = learning_rate()
+          beta1 = beta1()
+          beta2 = beta2()
+          epsilon = epsilon()
+
+        opt = adam.AdamOptimizer(learning_rate=learning_rate)
         update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
         opt_variables = opt.variables()
         beta1_power, beta2_power = opt._get_beta_accumulators()
@@ -180,11 +190,10 @@ class AdamOptimizerTest(test.TestCase):
         self.assertIn(beta1_power, opt_variables)
         self.assertIn(beta2_power, opt_variables)
 
-        with ops.Graph().as_default():
-          # Shouldn't return non-slot variables from other graphs.
-          self.assertEqual(0, len(opt.variables()))
-
         if not context.executing_eagerly():
+          with ops.Graph().as_default():
+            # Shouldn't return non-slot variables from other graphs.
+            self.assertEqual(0, len(opt.variables()))
           self.evaluate(variables.global_variables_initializer())
           # Fetch params to validate initial values
           self.assertAllClose([1.0, 2.0], self.evaluate(var0))
@@ -222,6 +231,10 @@ class AdamOptimizerTest(test.TestCase):
   def testResourceBasic(self):
     self.doTestBasic(use_resource=True)
 
+  def testBasicCallableParams(self):
+    with context.eager_mode():
+      self.doTestBasic(use_resource=True, use_callable_params=True)
+
   def testTensorLearningRate(self):
     for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
       with self.test_session():
diff --git a/tensorflow/python/training/basic_session_run_hooks.py b/tensorflow/python/training/basic_session_run_hooks.py
index df528d54d6503ef1626d0b3bc4b5afe9e0616c31..b0dd188db14a46aae44f8150095cf9ed337ee8a7 100644
--- a/tensorflow/python/training/basic_session_run_hooks.py
+++ b/tensorflow/python/training/basic_session_run_hooks.py
@@ -28,6 +28,7 @@ from tensorflow.core.framework.summary_pb2 import Summary
 from tensorflow.core.protobuf import config_pb2
 from tensorflow.core.util.event_pb2 import SessionLog
 from tensorflow.python.client import timeline
+from tensorflow.python.framework import errors
 from tensorflow.python.framework import meta_graph
 from tensorflow.python.framework import ops
 from tensorflow.python.platform import gfile
@@ -336,6 +337,8 @@ class CheckpointSaverListener(object):
 
     def after_save(self, session, global_step_value):
       print('Done writing checkpoint.')
+      if decided_to_stop_training():
+        return True
 
     def end(self, session, global_step_value):
       print('Done with the session.')
@@ -354,6 +357,11 @@ class CheckpointSaverListener(object):
   implementors should implement the `end()` method to handle actions related to
   the last checkpoint save. But the listener should not act twice if
   `after_save()` already handled this last checkpoint save.
+
+  A `CheckpointSaverListener` can request training to be stopped, by returning
+  True in `after_save`. Please note that, in replicated distributed training
+  setting, only `chief` should use this behavior. Otherwise each worker will do
+  their own evaluation, which may be wasteful of resources.
   """
 
   def begin(self):
@@ -453,7 +461,8 @@ class CheckpointSaverHook(session_run_hook.SessionRunHook):
       global_step = run_context.session.run(self._global_step_tensor)
       if self._timer.should_trigger_for_step(global_step):
         self._timer.update_last_triggered_step(global_step)
-        self._save(run_context.session, global_step)
+        if self._save(run_context.session, global_step):
+          run_context.request_stop()
 
   def end(self, session):
     last_step = session.run(self._global_step_tensor)
@@ -463,7 +472,7 @@ class CheckpointSaverHook(session_run_hook.SessionRunHook):
       l.end(session, last_step)
 
   def _save(self, session, step):
-    """Saves the latest checkpoint."""
+    """Saves the latest checkpoint, returns should_stop."""
     logging.info("Saving checkpoints for %d into %s.", step, self._save_path)
 
     for l in self._listeners:
@@ -475,8 +484,14 @@ class CheckpointSaverHook(session_run_hook.SessionRunHook):
             status=SessionLog.CHECKPOINT, checkpoint_path=self._save_path),
         step)
 
+    should_stop = False
     for l in self._listeners:
-      l.after_save(session, step)
+      if l.after_save(session, step):
+        logging.info(
+            "A CheckpointSaverListener requested that training be stopped. "
+            "listener: {}".format(l))
+        should_stop = True
+    return should_stop
 
   def _get_saver(self):
     if self._saver is not None:
@@ -804,8 +819,25 @@ class FinalOpsHook(session_run_hook.SessionRunHook):
 
   def end(self, session):
     if self._final_ops is not None:
-      self._final_ops_values = session.run(self._final_ops,
-                                           feed_dict=self._final_ops_feed_dict)
+      try:
+        self._final_ops_values = session.run(
+            self._final_ops, feed_dict=self._final_ops_feed_dict)
+      except (errors.OutOfRangeError, StopIteration) as e:
+        logging.warning(
+            "An OutOfRangeError or StopIteration exception is raised by the "
+            "code in FinalOpsHook. This typically means the Ops running by the "
+            "FinalOpsHook have a dependency back to some input source, which "
+            "should not happen. For example, for metrics in "
+            "tf.estimator.Estimator, all metrics functions return two Ops: "
+            "`value_op` and  `update_op`. Estimator.evaluate calls the "
+            "`update_op` for each batch of the data in input source and, once "
+            "it is exhausted, it call the `value_op` to get the metric values. "
+            "The `value_op` here should have dependency back to variables "
+            "reading only, rather than reading another batch from input. "
+            "Otherwise, the `value_op`, executed by `FinalOpsHook`, triggers "
+            "another data reading, which ends OutOfRangeError/StopIteration. "
+            "Please fix that.")
+        raise e
 
 
 @tf_export("train.FeedFnHook")
diff --git a/tensorflow/python/training/basic_session_run_hooks_test.py b/tensorflow/python/training/basic_session_run_hooks_test.py
index 7344ce2758658e3a39add365dd7f089e693c4c4a..b49a871a56a3402fc46a0185d02e6099b6e69f79 100644
--- a/tensorflow/python/training/basic_session_run_hooks_test.py
+++ b/tensorflow/python/training/basic_session_run_hooks_test.py
@@ -29,8 +29,10 @@ from tensorflow.contrib.framework.python.framework import checkpoint_utils
 from tensorflow.contrib.framework.python.ops import variables
 from tensorflow.contrib.testing.python.framework import fake_summary_writer
 from tensorflow.python.client import session as session_lib
+from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
 from tensorflow.python.framework import meta_graph
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
@@ -58,6 +60,7 @@ class MockCheckpointSaverListener(
     self.before_save_count = 0
     self.after_save_count = 0
     self.end_count = 0
+    self.ask_for_stop = False
 
   def begin(self):
     self.begin_count += 1
@@ -67,6 +70,8 @@ class MockCheckpointSaverListener(
 
   def after_save(self, session, global_step):
     self.after_save_count += 1
+    if self.ask_for_stop:
+      return True
 
   def end(self, session, global_step):
     self.end_count += 1
@@ -471,6 +476,25 @@ class CheckpointSaverHookTest(test.TestCase):
         'end': 1
     }, listener_counts)
 
+  def test_listener_stops_training_in_after_save(self):
+    with ops.Graph().as_default():
+      scaffold = monitored_session.Scaffold()
+      variables.get_or_create_global_step()
+      train_op = training_util._increment_global_step(1)
+      listener = MockCheckpointSaverListener()
+      hook = basic_session_run_hooks.CheckpointSaverHook(
+          self.model_dir, save_steps=1, scaffold=scaffold, listeners=[listener])
+      with monitored_session.SingularMonitoredSession(
+          hooks=[hook], scaffold=scaffold,
+          checkpoint_dir=self.model_dir) as sess:
+        sess.run(train_op)
+        self.assertFalse(sess.should_stop())
+        sess.run(train_op)
+        self.assertFalse(sess.should_stop())
+        listener.ask_for_stop = True
+        sess.run(train_op)
+        self.assertTrue(sess.should_stop())
+
   def test_listener_with_default_saver(self):
     with ops.Graph().as_default():
       global_step = variables.get_or_create_global_step()
@@ -1306,6 +1330,26 @@ class FinalOpsHookTest(test.TestCase):
         self.assertListEqual(expected_values,
                              hook.final_ops_values.tolist())
 
+  def test_final_ops_triggers_out_of_range_error(self):
+    with ops.Graph().as_default():
+      dataset = dataset_ops.Dataset.range(1)
+      iterator = dataset.make_one_shot_iterator()
+      read_ops = iterator.get_next()
+      final_ops = read_ops
+
+      hook = basic_session_run_hooks.FinalOpsHook(final_ops)
+      hook.begin()
+
+      with session_lib.Session() as session:
+        session.run(read_ops)
+        with test.mock.patch.object(tf_logging, 'warning') as mock_log:
+          with self.assertRaisesRegexp(errors.OutOfRangeError,
+                                       'End of sequence'):
+            hook.end(session)
+          self.assertRegexpMatches(
+              str(mock_log.call_args),
+              'dependency back to some input source')
+
   def test_final_ops_with_dictionary(self):
     with ops.Graph().as_default():
       expected_values = [4, -3]
diff --git a/tensorflow/python/training/checkpointable/BUILD b/tensorflow/python/training/checkpointable/BUILD
index a7ae6e50a9975d8d164d5c7455aedb8b39ed802e..87ba4dc91c89e03ac5f2a93bedca81878f5254a6 100644
--- a/tensorflow/python/training/checkpointable/BUILD
+++ b/tensorflow/python/training/checkpointable/BUILD
@@ -22,8 +22,9 @@ py_library(
         "//tensorflow/python:constant_op",
         "//tensorflow/python:control_flow_ops",
         "//tensorflow/python:dtypes",
+        "//tensorflow/python:framework_ops",
         "//tensorflow/python:io_ops_gen",
-        "//tensorflow/python:ops",
+        "//tensorflow/python:platform",
         "//tensorflow/python:saveable_object",
         "//tensorflow/python:util",
         "//tensorflow/python/eager:context",
@@ -40,6 +41,42 @@ py_test(
     ],
 )
 
+py_library(
+    name = "data_structures_base",
+    srcs = ["data_structures_base.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":base",
+    ],
+)
+
+py_library(
+    name = "data_structures",
+    srcs = ["data_structures.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":base",
+        ":data_structures_base",
+    ],
+)
+
+py_test(
+    name = "data_structures_test",
+    srcs = ["data_structures_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":data_structures",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:layers",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python/eager:context",
+        "//tensorflow/python/eager:test",
+        "//tensorflow/python/keras:engine",
+        "//tensorflow/python/keras:layers",
+    ],
+)
+
 py_library(
     name = "util",
     srcs = ["util.py"],
diff --git a/tensorflow/python/training/checkpointable/base.py b/tensorflow/python/training/checkpointable/base.py
index e378f0e898c88d6f3e4fa22e93ca2165f2325660..cfe7259e1b6d9932fff9e78049fa85554f022076 100644
--- a/tensorflow/python/training/checkpointable/base.py
+++ b/tensorflow/python/training/checkpointable/base.py
@@ -591,11 +591,11 @@ class CheckpointableBase(object):
           self._unconditional_checkpoint_dependencies):
         if name == old_name:
           self._unconditional_checkpoint_dependencies[index] = new_reference
-    else:
+    elif current_object is None:
       self._unconditional_checkpoint_dependencies.append(new_reference)
-
-    self._unconditional_dependency_names[name] = checkpointable
-    self._handle_deferred_dependencies(name=name, checkpointable=checkpointable)
+      self._unconditional_dependency_names[name] = checkpointable
+      self._handle_deferred_dependencies(
+          name=name, checkpointable=checkpointable)
     return checkpointable
 
   def _handle_deferred_dependencies(self, name, checkpointable):
diff --git a/tensorflow/python/training/checkpointable/data_structures.py b/tensorflow/python/training/checkpointable/data_structures.py
new file mode 100644
index 0000000000000000000000000000000000000000..69ed253fb2d874954ee7563cd8bb21add59a7318
--- /dev/null
+++ b/tensorflow/python/training/checkpointable/data_structures.py
@@ -0,0 +1,257 @@
+"""Checkpointable data structures."""
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+
+import six
+
+from tensorflow.python.keras.engine import base_layer
+from tensorflow.python.keras.utils import layer_utils
+from tensorflow.python.ops import variables
+from tensorflow.python.training.checkpointable import base as checkpointable_lib
+from tensorflow.python.training.checkpointable import data_structures_base
+
+
+# TODO(allenl): We could track regular Python data structures which get assigned
+# to Checkpointable objects. Making this work with restore-on-create would be
+# tricky; we'd need to re-create nested structures with our own wrapped objects
+# on assignment to an attribute, and track the user's original structure to make
+# sure they don't modify it except through the wrappers (since we could save the
+# user's updated structure, but would have no way to support restore-on-create
+# for those modifications).
+# TODO(allenl): A dictionary data structure would be good too.
+class CheckpointableDataStructure(
+    data_structures_base.CheckpointableDataStructureBase):
+  """Base class for data structures which contain checkpointable objects."""
+
+  def __init__(self):
+    self._layers = []
+    self.trainable = True
+    self._extra_variables = []
+
+  def _track_value(self, value, name):
+    """Add a dependency on `value`."""
+    if isinstance(value, checkpointable_lib.CheckpointableBase):
+      self._track_checkpointable(value, name=name)
+      if isinstance(value, variables.Variable):
+        self._extra_variables.append(value)
+    else:
+      raise ValueError(
+          ("Only checkpointable objects (such as Layers or Optimizers) may be "
+           "stored in a List object. Got %s, which does not inherit from "
+           "CheckpointableBase.") % (value,))
+    if isinstance(value, (
+        base_layer.Layer,
+        data_structures_base.CheckpointableDataStructureBase)):
+      if value not in self._layers:
+        self._layers.append(value)
+        if hasattr(value, "_use_resource_variables"):
+          # In subclassed models, legacy layers (tf.layers) must always use
+          # resource variables.
+          value._use_resource_variables = True  # pylint: disable=protected-access
+
+  @property
+  def layers(self):
+    return self._layers
+
+  @property
+  def trainable_weights(self):
+    return layer_utils.gather_trainable_weights(
+        trainable=self.trainable,
+        sub_layers=self.layers,
+        extra_variables=self._extra_variables)
+
+  @property
+  def non_trainable_weights(self):
+    return layer_utils.gather_non_trainable_weights(
+        trainable=self.trainable,
+        sub_layers=self.layers,
+        extra_variables=self._extra_variables)
+
+  @property
+  def weights(self):
+    return self.trainable_weights + self.non_trainable_weights
+
+  @property
+  def trainable_variables(self):
+    return self.trainable_weights
+
+  @property
+  def non_trainable_variables(self):
+    return self.non_trainable_weights
+
+  @property
+  def variables(self):
+    return self.weights
+
+  @property
+  def updates(self):
+    """Aggregate updates from any `Layer` instances."""
+    # Updates and conditional losses are forwarded as-is rather than being
+    # filtered based on inputs, since this is just a container and won't ever
+    # have any inputs.
+    aggregated = []
+    for layer in self.layers:
+      aggregated += layer.updates
+    return aggregated
+
+  @property
+  def losses(self):
+    """Aggregate losses from any `Layer` instances."""
+    aggregated = []
+    for layer in self.layers:
+      aggregated += layer.losses
+    return aggregated
+
+  def __hash__(self):
+    # Support object-identity hashing, so these structures can be used as keys
+    # in sets/dicts.
+    return id(self)
+
+  def __eq__(self, other):
+    # Similar to Tensors, checkpointable data structures use object-identity
+    # equality to support set/dict membership.
+    return self is other
+
+
+class List(CheckpointableDataStructure, collections.Sequence):
+  """An append-only sequence type which is checkpointable.
+
+  Maintains checkpoint dependencies on its contents (which must also be
+  checkpointable), and forwards any `Layer` metadata such as updates and losses.
+
+  Note that `List` is purely a container. It lets a `tf.keras.Model` or
+  other checkpointable object know about its contents, but does not call any
+  `Layer` instances which are added to it. To indicate a sequence of `Layer`
+  instances which should be called sequentially, use `tf.keras.Sequential`.
+
+  Example usage:
+  ```python
+  class HasList(tf.keras.Model):
+
+    def __init__(self):
+      super(HasList, self).__init__()
+      self.layer_list = tf.contrib.checkpoint.List([layers.Dense(3)])
+      self.layer_list.append(layers.Dense(4))
+
+    def call(self, x):
+      aggregation = 0.
+      for l in self.layer_list:
+        x = l(x)
+        aggregation += tf.reduce_sum(x)
+      return aggregation
+  ```
+
+  This kind of wrapping is necessary because `Checkpointable` objects do not
+  (yet) deeply inspect regular Python data structures, so for example assigning
+  a regular list (`self.layer_list = [layers.Dense(3)]`) does not create a
+  checkpoint dependency and does not add the `Layer` instance's weights to its
+  parent `Model`.
+  """
+
+  def __init__(self, *args, **kwargs):
+    """Construct a new sequence. Arguments are passed to `list()`."""
+    super(List, self).__init__()
+    self._storage = list(*args, **kwargs)
+    for index, element in enumerate(self._storage):
+      self._track_value(element, name=self._name_element(index))
+
+  def _name_element(self, index):
+    return "%d" % (index,)
+
+  def append(self, value):
+    """Add a new checkpointable value."""
+    self._track_value(value, self._name_element(len(self._storage)))
+    self._storage.append(value)
+
+  def extend(self, values):
+    """Add a sequence of checkpointable values."""
+    for index_offset, value in enumerate(values):
+      self._track_value(
+          value, name=self._name_element(len(self._storage) + index_offset))
+    self._storage.extend(values)
+
+  def __iadd__(self, values):
+    self.extend(values)
+    return self
+
+  def __add__(self, other):
+    if isinstance(other, List):
+      return List(self._storage + other._storage)  # pylint: disable=protected-access
+    else:
+      return List(self._storage + other)
+
+  def __getitem__(self, key):
+    return self._storage[key]
+
+  def __len__(self):
+    return len(self._storage)
+
+  def __repr__(self):
+    return "List(%s)" % (repr(self._storage),)
+
+
+class Mapping(CheckpointableDataStructure, collections.Mapping):
+  """An append-only checkpointable mapping data structure with string keys.
+
+  Maintains checkpoint dependencies on its contents (which must also be
+  checkpointable), named based on its keys.
+
+  Note that once a key has been added, it may not be deleted or replaced. If
+  names may not be unique, see `tf.contrib.checkpoint.UniqueNameTracker`.
+  """
+
+  def __init__(self, *args, **kwargs):
+    """Construct a new sequence. Arguments are passed to `dict()`."""
+    super(Mapping, self).__init__()
+    self._storage = dict(*args, **kwargs)
+    for key, value in self._storage.items():
+      self._track_value(value, name=self._name_element(key))
+
+  def _name_element(self, key):
+    if not isinstance(key, six.string_types):
+      raise TypeError(
+          "Mapping accepts only string keys, but got a key %s."
+          % repr(key))
+    return str(key)
+
+  def __setitem__(self, key, value):
+    current_value = self._storage.setdefault(key, value)
+    if current_value is not value:
+      raise ValueError(
+          ("Mappings are an append-only data structure. Tried to overwrite the "
+           "key '%s' with value %s, but it already contains %s")
+          % (key, value, current_value))
+    self._track_value(value, name=self._name_element(key))
+
+  def update(self, *args, **kwargs):
+    for key, value in dict(*args, **kwargs).items():
+      self[key] = value
+
+  def __getitem__(self, key):
+    return self._storage[key]
+
+  def __len__(self):
+    return len(self._storage)
+
+  def __repr__(self):
+    return "Mapping(%s)" % (repr(self._storage),)
+
+  def __iter__(self):
+    return iter(self._storage)
diff --git a/tensorflow/python/training/checkpointable/data_structures_base.py b/tensorflow/python/training/checkpointable/data_structures_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1b2cf105b81490ea12e0a667f53fb02d45135c9
--- /dev/null
+++ b/tensorflow/python/training/checkpointable/data_structures_base.py
@@ -0,0 +1,27 @@
+"""A trivial base class to avoid circular imports for isinstance checks."""
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+
+from tensorflow.python.training.checkpointable import base as checkpointable_lib
+
+
+class CheckpointableDataStructureBase(checkpointable_lib.CheckpointableBase):
+  """Base class for data structures which contain checkpointable objects."""
+
+  pass
diff --git a/tensorflow/python/training/checkpointable/data_structures_test.py b/tensorflow/python/training/checkpointable/data_structures_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..b05b3a88002e31560ed6c2005fdd29f56c5227a3
--- /dev/null
+++ b/tensorflow/python/training/checkpointable/data_structures_test.py
@@ -0,0 +1,238 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+import numpy
+
+from tensorflow.python.eager import context
+from tensorflow.python.eager import test
+from tensorflow.python.framework import test_util
+from tensorflow.python.keras.engine import training
+from tensorflow.python.keras.layers import core
+from tensorflow.python.keras.layers import normalization
+from tensorflow.python.layers import core as non_keras_core
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.training.checkpointable import data_structures
+
+
+class HasList(training.Model):
+
+  def __init__(self):
+    super(HasList, self).__init__()
+    self.layer_list = data_structures.List([core.Dense(3)])
+    self.layer_list.append(core.Dense(4))
+    self.layer_list.extend(
+        [core.Dense(5),
+         core.Dense(6, kernel_regularizer=math_ops.reduce_sum)])
+    self.layer_list += [
+        core.Dense(7, bias_regularizer=math_ops.reduce_sum),
+        core.Dense(8)
+    ]
+    self.layer_list += (
+        data_structures.List([core.Dense(9)]) + data_structures.List(
+            [core.Dense(10)]))
+    self.layer_list.extend(
+        data_structures.List(
+            list(sequence=[core.Dense(11)]) + [core.Dense(12)]))
+    self.layers_with_updates = data_structures.List(
+        sequence=(normalization.BatchNormalization(),))
+
+  def call(self, x):
+    aggregation = 0.
+    for l in self.layer_list:
+      x = l(x)
+      aggregation += math_ops.reduce_sum(x)
+    bn, = self.layers_with_updates
+    return bn(x) / aggregation
+
+
+class ListTests(test.TestCase):
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testTracking(self):
+    model = HasList()
+    output = model(array_ops.ones([32, 2]))
+    self.assertAllEqual([32, 12], output.shape)
+    self.assertEqual(2, len(model.layers))
+    self.assertIs(model.layer_list, model.layers[0])
+    self.assertEqual(10, len(model.layers[0].layers))
+    for index in range(10):
+      self.assertEqual(3 + index, model.layers[0].layers[index].units)
+    self.assertEqual(2, len(model._checkpoint_dependencies))
+    self.assertIs(model.layer_list, model._checkpoint_dependencies[0].ref)
+    self.assertIs(model.layers_with_updates,
+                  model._checkpoint_dependencies[1].ref)
+    self.assertEqual(
+        10, len(model._checkpoint_dependencies[0].ref._checkpoint_dependencies))
+    self.evaluate([v.initializer for v in model.variables])
+    self.evaluate(model.variables[0].assign([[1., 2., 3.], [4., 5., 6.]]))
+    save_path = os.path.join(self.get_temp_dir(), "ckpt")
+    model.save_weights(save_path)
+    self.evaluate(model.variables[0].assign(array_ops.zeros([2, 3])))
+    model.load_weights(save_path)
+    self.assertAllEqual([[1., 2., 3.], [4., 5., 6.]],
+                        self.evaluate(model.variables[0]))
+
+  def testUpdatesForwarded(self):
+    with context.graph_mode():
+      model = HasList()
+      model_input = array_ops.ones([32, 2])
+      model(model_input)
+      self.assertGreater(len(model.layers_with_updates[0].updates), 0)
+      self.assertEqual(set(model.layers_with_updates[0].updates),
+                       set(model.updates))
+
+    with context.eager_mode():
+      model = HasList()
+      model_input = array_ops.ones([32, 2])
+      model(model_input)
+      self.assertEqual(0, len(model.updates))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testLossesForwarded(self):
+    model = HasList()
+    model_input = array_ops.ones([32, 2])
+    model(model_input)
+    self.assertEqual(2, len(model.losses))
+
+  def testNotCheckpointable(self):
+    class NotCheckpointable(object):
+      pass
+
+    with self.assertRaises(ValueError):
+      data_structures.List([NotCheckpointable()])
+
+  def testCallNotImplemented(self):
+    with self.assertRaisesRegexp(TypeError, "not callable"):
+      data_structures.List()(1.)
+
+  def testNoPop(self):
+    with self.assertRaises(AttributeError):
+      data_structures.List().pop()
+
+  def testNesting(self):
+    with context.graph_mode():
+      inner = data_structures.List()
+      outer = data_structures.List([inner])
+      inner.append(non_keras_core.Dense(1))
+      inner[0](array_ops.ones([2, 3]))
+      self.assertEqual(2, len(outer.variables))
+      self.assertIsInstance(
+          outer.variables[0],
+          resource_variable_ops.ResourceVariable)
+
+  def testNonLayerVariables(self):
+    v = resource_variable_ops.ResourceVariable([1.])
+    l = data_structures.List([v])
+    self.assertTrue(l.trainable)
+    self.assertEqual([], l.layers)
+    self.assertEqual([v], l.variables)
+    self.assertEqual([v], l.trainable_weights)
+    self.assertEqual([], l.non_trainable_variables)
+    l.trainable = False
+    self.assertEqual([v], l.variables)
+    self.assertEqual([], l.trainable_variables)
+    self.assertEqual([v], l.non_trainable_variables)
+    l.trainable = True
+    v2 = resource_variable_ops.ResourceVariable(1., trainable=False)
+    l.append(v2)
+    self.assertEqual([v, v2], l.weights)
+    self.assertEqual([v], l.trainable_weights)
+    self.assertEqual([v2], l.non_trainable_weights)
+
+  def testHashing(self):
+    has_sequences = set([data_structures.List(),
+                         data_structures.List()])
+    self.assertEqual(2, len(has_sequences))
+    self.assertNotIn(data_structures.List(), has_sequences)
+
+
+class HasMapping(training.Model):
+
+  def __init__(self):
+    super(HasMapping, self).__init__()
+    self.layer_dict = data_structures.Mapping(output=core.Dense(7))
+    self.layer_dict["norm"] = data_structures.List()
+    self.layer_dict["dense"] = data_structures.List()
+    self.layer_dict["dense"].extend(
+        [core.Dense(5),
+         core.Dense(6, kernel_regularizer=math_ops.reduce_sum)])
+    self.layer_dict["norm"].append(
+        normalization.BatchNormalization())
+    self.layer_dict["norm"].append(
+        normalization.BatchNormalization())
+
+  def call(self, x):
+    aggregation = 0.
+    for norm, dense in zip(self.layer_dict["norm"], self.layer_dict["dense"]):
+      x = norm(dense(x))
+      aggregation += math_ops.reduce_sum(x)
+    return self.layer_dict["output"](x) / aggregation
+
+
+class MappingTests(test.TestCase):
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testTracking(self):
+    model = HasMapping()
+    output = model(array_ops.ones([32, 2]))
+    self.assertAllEqual([32, 7], output.shape)
+    self.assertEqual(1, len(model.layers))
+    self.assertIs(model.layer_dict, model.layers[0])
+    self.assertEqual(3, len(model.layers[0].layers))
+    self.assertEqual(1, len(model._checkpoint_dependencies))
+    self.assertIs(model.layer_dict, model._checkpoint_dependencies[0].ref)
+    self.evaluate([v.initializer for v in model.variables])
+    test_var = model.layer_dict["output"].kernel
+    self.evaluate(test_var.assign(array_ops.ones([6, 7])))
+    save_path = os.path.join(self.get_temp_dir(), "ckpt")
+    model.save_weights(save_path)
+    self.evaluate(test_var.assign(array_ops.zeros([6, 7])))
+    model.load_weights(save_path)
+    self.assertAllEqual(numpy.ones([6, 7]),
+                        self.evaluate(test_var))
+
+  def testNoOverwrite(self):
+    mapping = data_structures.Mapping()
+    original = data_structures.List()
+    mapping["a"] = original
+    with self.assertRaises(ValueError):
+      mapping["a"] = data_structures.List()
+    self.assertIs(original, mapping["a"])
+    with self.assertRaises(AttributeError):
+      del mapping["a"]
+    mapping.update(b=data_structures.Mapping())
+    with self.assertRaises(ValueError):
+      mapping.update({"b": data_structures.Mapping()})
+
+  def testNonStringKeys(self):
+    mapping = data_structures.Mapping()
+    with self.assertRaises(TypeError):
+      mapping[1] = data_structures.List()
+
+  def testHashing(self):
+    has_mappings = set([data_structures.Mapping(),
+                        data_structures.Mapping()])
+    self.assertEqual(2, len(has_mappings))
+    self.assertNotIn(data_structures.Mapping(), has_mappings)
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/python/training/device_util.py b/tensorflow/python/training/device_util.py
index e31fa02d60679d218a62f4e2affc16f0f5bc51c3..70e1ca4b5d77e5e7529cb0d06a9ffb4657dc74fe 100644
--- a/tensorflow/python/training/device_util.py
+++ b/tensorflow/python/training/device_util.py
@@ -27,13 +27,15 @@ def canonicalize(d, default=None):
   """Canonicalize device string.
 
   If d has missing components, the rest would be deduced from the `default`
-  argument or from '/job:localhost/replica:0/task:0/device:CPU:0'. For example:
+  argument or from '/replica:0/task:0/device:CPU:0'. For example:
     If d = '/cpu:0', default='/job:worker/task:1', it returns
       '/job:worker/replica:0/task:1/device:CPU:0'.
     If d = '/cpu:0', default='/job:worker', it returns
       '/job:worker/replica:0/task:0/device:CPU:0'.
     If d = '/gpu:0', default=None, it returns
-      '/job:localhost/replica:0/task:0/device:GPU:0'.
+      '/replica:0/task:0/device:GPU:0'.
+
+  Note: This uses "job:localhost" as the default if executing eagerly.
 
   Args:
     d: a device string.
@@ -47,7 +49,9 @@ def canonicalize(d, default=None):
       "Device type '%s' must be all-caps." % (d.device_type,))
   # Fill in missing device fields using defaults.
   result = tf_device.DeviceSpec(
-      job="localhost", replica=0, task=0, device_type="CPU", device_index=0)
+      replica=0, task=0, device_type="CPU", device_index=0)
+  if context.executing_eagerly():
+    result.job = "localhost"
   if default:
     result.merge_from(tf_device.DeviceSpec.from_string(default))
   result.merge_from(d)
diff --git a/tensorflow/python/training/device_util_test.py b/tensorflow/python/training/device_util_test.py
index 61525e21f508bcef5b61fd077d288b93803f1aa8..cdbb08229d2f06c2cfeeb855b32665f7c03ea969 100644
--- a/tensorflow/python/training/device_util_test.py
+++ b/tensorflow/python/training/device_util_test.py
@@ -52,7 +52,7 @@ class DeviceUtilTest(test.TestCase):
   def testCanonicalizeWithoutDefaultDevice(self):
     self.assertEqual(
         device_util.canonicalize("/cpu:0"),
-        "/job:localhost/replica:0/task:0/device:CPU:0")
+        "/replica:0/task:0/device:CPU:0")
     self.assertEqual(
         device_util.canonicalize("/job:worker/cpu:0"),
         "/job:worker/replica:0/task:0/device:CPU:0")
diff --git a/tensorflow/python/training/distribute.py b/tensorflow/python/training/distribute.py
index ab8b37bb655bfc3c222ed661b6d48f0ecdc3a858..caffd042a0917209c87cab8993169dc4bc956039 100644
--- a/tensorflow/python/training/distribute.py
+++ b/tensorflow/python/training/distribute.py
@@ -528,6 +528,8 @@ class DistributionStrategy(object):
   * `d.update_non_slot(d.non_slot_devices(), fn)`: in cross-tower
     context, like `d.update()` except with locality N.
   * `d.fetch(t)`: Copy `t` with any locality to the client's CPU device.
+    TODO(josh11b): Deprecate `fetch`, switch to `read_var` for
+    reading tower-local variables.
 
   The standard pattern for updating variables is to:
 
@@ -614,8 +616,8 @@ class DistributionStrategy(object):
 
     There will still be one component variable per tower, but there is
     no requirement that they stay in sync. Instead, when saving them
-    or calling `fetch()`, we use the value that results when calling
-    `reduce()` on all the towers' variables.
+    or calling `fetch()/read_var()`, we use the value that
+    results when calling `reduce()` on all the towers' variables.
 
     Note: tower-local implies not trainable. Instead, it is expected
     that each tower will directly update (using `assign_add()` or
@@ -646,6 +648,21 @@ class DistributionStrategy(object):
     _require_distribution_strategy_scope(self)
     return variable_scope.variable_creator_scope(create_tower_local_variable)
 
+  def read_var(self, v):
+    """Reads the value of a variable.
+
+    Returns the aggregate value of a tower-local variable, or the
+    (read-only) value of any other variable.
+
+    Args:
+      v: A variable allocated within the scope of this `DistributionStrategy`.
+
+    Returns:
+      A tensor representing the value of `v`, aggregated across towers if
+      necessary.
+    """
+    raise NotImplementedError("must be implemented in descendants")
+
   def colocate_vars_with(self, colocate_with_variable):
     """Scope that controls which devices variables will be created on.
 
@@ -904,6 +921,8 @@ class DistributionStrategy(object):
     will attempt to avoid a copy by checking if the value is already
     on the destination device.
 
+    TODO(josh11b): Switch to `read_var`.
+
     Args:
       val: Value (which may be mirrored) to copy.
       destination: A device string to copy the value to.
@@ -946,7 +965,7 @@ class DistributionStrategy(object):
       return control_flow_ops.group(value, name=name)
     # Special handling for the common case of one op.
     v, = value
-    if isinstance(v, ops.Tensor):
+    if hasattr(v, "op"):
       v = v.op
     return v
 
@@ -1197,6 +1216,9 @@ class _DefaultDistributionStrategy(DistributionStrategy):
     with ops.colocate_with(colocate_with), UpdateContext(colocate_with):
       return fn(*args, **kwargs)
 
+  def read_var(self, tower_local_var):
+    return array_ops.identity(tower_local_var)
+
   def _fetch(self, var, destination, fn):
     with ops.colocate_with(var):
       var = fn(var)
diff --git a/tensorflow/python/training/gradient_descent.py b/tensorflow/python/training/gradient_descent.py
index 6caf29d83af546f821314179e17f7bf1a693ff1a..ef50f6315dd623647e000b9b713d3ae557c31427 100644
--- a/tensorflow/python/training/gradient_descent.py
+++ b/tensorflow/python/training/gradient_descent.py
@@ -18,7 +18,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.python.eager import context
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import resource_variable_ops
@@ -41,6 +40,13 @@ class GradientDescentOptimizer(optimizer.Optimizer):
       use_locking: If True use locks for update operations.
       name: Optional name prefix for the operations created when applying
         gradients. Defaults to "GradientDescent".
+
+    @compatibility(eager)
+    When eager execution is enabled, `learning_rate` can be a callable that
+    takes no arguments and returns the actual value to use. This can be useful
+    for changing these values across different invocations of optimizer
+    functions.
+    @end_compatibility
     """
     super(GradientDescentOptimizer, self).__init__(use_locking, name)
     self._learning_rate = learning_rate
@@ -71,6 +77,6 @@ class GradientDescentOptimizer(optimizer.Optimizer):
     return var.scatter_sub(delta, use_locking=self._use_locking)
 
   def _prepare(self):
-    if not context.executing_eagerly() or self._learning_rate_tensor is None:
-      self._learning_rate_tensor = ops.convert_to_tensor(self._learning_rate,
-                                                         name="learning_rate")
+    learning_rate = self._call_if_callable(self._learning_rate)
+    self._learning_rate_tensor = ops.convert_to_tensor(
+        learning_rate, name="learning_rate")
diff --git a/tensorflow/python/training/gradient_descent_test.py b/tensorflow/python/training/gradient_descent_test.py
index 5370cafbcfab6e5ea46685db997989bf6f218a1a..b304e924212c49d84b7c85e01869603b47fc1222 100644
--- a/tensorflow/python/training/gradient_descent_test.py
+++ b/tensorflow/python/training/gradient_descent_test.py
@@ -18,6 +18,9 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.python.eager import backprop
+from tensorflow.python.eager import context
+from tensorflow.python.eager import function
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
@@ -80,6 +83,32 @@ class GradientDescentOptimizerTest(test.TestCase):
         self.assertAllCloseAccordingToType([3.0 - 3.0 * 0.01, 4.0 - 3.0 * 0.01],
                                            var1.eval())
 
+  def testBasicCallableParams(self):
+    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
+      with self.test_session():
+        var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
+        var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype)
+        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
+        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
+        lr = lambda: 3.0
+        sgd_op = gradient_descent.GradientDescentOptimizer(lr).apply_gradients(
+            zip([grads0, grads1], [var0, var1]))
+        # TODO(apassos) calling initialize_resources on all resources here
+        # doesn't work because the sessions and graph are reused across unit
+        # tests and this would mean trying to reinitialize variables. Figure out
+        # a long-term solution for this.
+        resources.initialize_resources([var0, var1]).run()
+        # Fetch params to validate initial values
+        self.assertAllCloseAccordingToType([1.0, 2.0], var0.eval())
+        self.assertAllCloseAccordingToType([3.0, 4.0], var1.eval())
+        # Run 1 step of sgd
+        sgd_op.run()
+        # Validate updated params
+        self.assertAllCloseAccordingToType([1.0 - 3.0 * 0.1, 2.0 - 3.0 * 0.1],
+                                           var0.eval())
+        self.assertAllCloseAccordingToType([3.0 - 3.0 * 0.01, 4.0 - 3.0 * 0.01],
+                                           var1.eval())
+
   def testMinimizeResourceVariable(self):
     for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
       with self.test_session():
@@ -218,6 +247,26 @@ class GradientDescentOptimizerTest(test.TestCase):
         self.assertAllCloseAccordingToType([[3.0], [4.0 - 3.0 * 0.01]],
                                            var1.eval())
 
+  def testCapturingInDefunWhileExecutingEagerly(self):
+    with context.eager_mode():
+      optimizer = gradient_descent.GradientDescentOptimizer(1.0)
+
+      def step():
+        v = resource_variable_ops.ResourceVariable(1.0)
+        with backprop.GradientTape() as tape:
+          loss = v ** 2
+        grad = tape.gradient(loss, v)
+        optimizer.apply_gradients([(grad, v)])
+        return v.read_value()
+
+      compiled_step = function.defun(step)
+
+      self.assertEqual(float(step()), -1.0)
+      self.assertEqual(float(compiled_step()), -1.0)
+      # This shouldn't fail; in particular, the learning rate tensor should
+      # be an EagerTensor once again, not a graph Tensor.
+      self.assertEqual(float(step()), -1.0)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/training/momentum.py b/tensorflow/python/training/momentum.py
index bd9fa79d8feac68c149f787ee8501bdddb173d33..cb3ec6f053e2e7f5aa80152ed233c8fbb6920be0 100644
--- a/tensorflow/python/training/momentum.py
+++ b/tensorflow/python/training/momentum.py
@@ -61,8 +61,8 @@ class MomentumOptimizer(optimizer.Optimizer):
         variable(s) track the values called `theta_t + mu*v_t` in the paper.
 
     @compatibility(eager)
-    When eager execution is enabled, learning_rate and momentum can each be a
-    callable that takes no arguments and returns the actual value to use. This
+    When eager execution is enabled, `learning_rate` and `momentum` can each be
+    a callable that takes no arguments and returns the actual value to use. This
     can be useful for changing these values across different invocations of
     optimizer functions.
     @end_compatibility
diff --git a/tensorflow/python/training/momentum_test.py b/tensorflow/python/training/momentum_test.py
index 7bd57ad3d854534e196fa7b72bebbd7195e6bca8..f7e78071d8b15500ba607a5cd16aefcaf9d2abfe 100644
--- a/tensorflow/python/training/momentum_test.py
+++ b/tensorflow/python/training/momentum_test.py
@@ -134,7 +134,6 @@ class MomentumOptimizerTest(test.TestCase):
     with context.eager_mode():
       self.doTestBasic(use_resource=True, use_callable_params=True)
 
-  @test_util.run_in_graph_and_eager_modes(reset_test=True)
   def testVariablesAcrossGraphs(self):
     optimizer = momentum_lib.MomentumOptimizer(0.01, 0.5)
     with ops.Graph().as_default():
@@ -142,10 +141,7 @@ class MomentumOptimizerTest(test.TestCase):
           [1.0, 2.0], dtype=dtypes.float32, name="var0")
       var1 = resource_variable_ops.ResourceVariable(
           [3.0, 4.0], dtype=dtypes.float32, name="var1")
-      if context.executing_eagerly():
-        loss = lambda: math_ops.reduce_sum(var0 + var1)
-      else:
-        loss = math_ops.reduce_sum(var0 + var1)
+      loss = math_ops.reduce_sum(var0 + var1)
       optimizer.minimize(loss)
       optimizer_variables = optimizer.variables()
       self.assertStartsWith(optimizer_variables[0].name, "var0")
@@ -157,10 +153,7 @@ class MomentumOptimizerTest(test.TestCase):
           [1.0, 2.0], dtype=dtypes.float32, name="var2")
       var3 = resource_variable_ops.ResourceVariable(
           [3.0, 4.0], dtype=dtypes.float32, name="var3")
-      if context.executing_eagerly():
-        loss = lambda: math_ops.reduce_sum(var2 + var3)
-      else:
-        loss = math_ops.reduce_sum(var2 + var3)
+      loss = math_ops.reduce_sum(var2 + var3)
       optimizer.minimize(loss)
       optimizer_variables = optimizer.variables()
       self.assertStartsWith(optimizer_variables[0].name, "var2")
diff --git a/tensorflow/python/training/moving_averages.py b/tensorflow/python/training/moving_averages.py
index 61fc828a840c490b0f787119134a0941f60f947a..60cc54c2645a0f44195bbb86013e0306387aa8aa 100644
--- a/tensorflow/python/training/moving_averages.py
+++ b/tensorflow/python/training/moving_averages.py
@@ -344,6 +344,11 @@ class ExponentialMovingAverage(object):
     self._name = name
     self._averages = {}
 
+  @property
+  def name(self):
+    """The name of this ExponentialMovingAverage object."""
+    return self._name
+
   def apply(self, var_list=None):
     """Maintains moving averages of variables.
 
@@ -394,7 +399,7 @@ class ExponentialMovingAverage(object):
         if isinstance(var, variables.Variable):
           avg = slot_creator.create_slot(var,
                                          var.initialized_value(),
-                                         self._name,
+                                         self.name,
                                          colocate_with_primary=True)
           # NOTE(mrry): We only add `tf.Variable` objects to the
           # `MOVING_AVERAGE_VARIABLES` collection.
@@ -402,7 +407,7 @@ class ExponentialMovingAverage(object):
         else:
           avg = slot_creator.create_zeros_slot(
               var,
-              self._name,
+              self.name,
               colocate_with_primary=(var.op.type in ["Variable",
                                                      "VariableV2",
                                                      "VarHandleOp"]))
@@ -410,7 +415,7 @@ class ExponentialMovingAverage(object):
             zero_debias_true.add(avg)
       self._averages[var] = avg
 
-    with ops.name_scope(self._name) as scope:
+    with ops.name_scope(self.name) as scope:
       decay = ops.convert_to_tensor(self._decay, name="decay")
       if self._num_updates is not None:
         num_updates = math_ops.cast(self._num_updates,
@@ -462,7 +467,7 @@ class ExponentialMovingAverage(object):
     if var in self._averages:
       return self._averages[var].op.name
     return ops.get_default_graph().unique_name(
-        var.op.name + "/" + self._name, mark_as_used=False)
+        var.op.name + "/" + self.name, mark_as_used=False)
 
   def variables_to_restore(self, moving_avg_variables=None):
     """Returns a map of names to `Variables` to restore.
diff --git a/tensorflow/python/training/moving_averages_test.py b/tensorflow/python/training/moving_averages_test.py
index 6717811bbb0f05723a5ad0fbcbfba75249d0d43b..3e85e6bfa7b20da061d36ab0fba4913e402f8f0c 100644
--- a/tensorflow/python/training/moving_averages_test.py
+++ b/tensorflow/python/training/moving_averages_test.py
@@ -263,6 +263,7 @@ class ExponentialMovingAverageTest(test.TestCase):
       tensor2 = v0 + v1
       ema = moving_averages.ExponentialMovingAverage(
           0.25, zero_debias=zero_debias, name="foo")
+      self.assertEqual("foo", ema.name)
       self.assertEqual("v0/foo", ema.average_name(v0))
       self.assertEqual("v1/foo", ema.average_name(v1))
       self.assertEqual("add/foo", ema.average_name(tensor2))
diff --git a/tensorflow/python/training/optimizer.py b/tensorflow/python/training/optimizer.py
index a9287a0f0d0391cc6e0b297cce18eebaf9f64291..cae29eea933aec1554beea8d0413fd9febcf2d94 100644
--- a/tensorflow/python/training/optimizer.py
+++ b/tensorflow/python/training/optimizer.py
@@ -1211,3 +1211,7 @@ class Optimizer(
       self._deferred_slot_restorations.setdefault(
           slot_name, {}).setdefault(variable_key, []).append(
               slot_variable_position)
+
+  def _call_if_callable(self, param):
+    """Call the function if param is callable."""
+    return param() if callable(param) else param
diff --git a/tensorflow/python/training/rmsprop.py b/tensorflow/python/training/rmsprop.py
index 341b970c92e42b4fe392d91f57219d713d2513e5..f38c9861d64aa258cde07ccd3041d3c50932c33b 100644
--- a/tensorflow/python/training/rmsprop.py
+++ b/tensorflow/python/training/rmsprop.py
@@ -92,6 +92,13 @@ class RMSPropOptimizer(optimizer.Optimizer):
         computation and memory. Defaults to False.
       name: Optional name prefix for the operations created when applying
         gradients. Defaults to "RMSProp".
+
+    @compatibility(eager)
+    When eager execution is enabled, `learning_rate`, `decay`, `momentum`, and
+    `epsilon` can each be a callable that takes no arguments and returns the
+    actual value to use. This can be useful for changing these values across
+    different invocations of optimizer functions.
+    @end_compatibility
     """
     super(RMSPropOptimizer, self).__init__(use_locking, name)
     self._learning_rate = learning_rate
@@ -120,12 +127,15 @@ class RMSPropOptimizer(optimizer.Optimizer):
       self._zeros_slot(v, "momentum", self._name)
 
   def _prepare(self):
-    self._learning_rate_tensor = ops.convert_to_tensor(
-        self._learning_rate, name="learning_rate")
-    self._decay_tensor = ops.convert_to_tensor(self._decay, name="decay")
-    self._momentum_tensor = ops.convert_to_tensor(
-        self._momentum, name="momentum")
-    self._epsilon_tensor = ops.convert_to_tensor(self._epsilon, name="epsilon")
+    lr = self._call_if_callable(self._learning_rate)
+    decay = self._call_if_callable(self._decay)
+    momentum = self._call_if_callable(self._momentum)
+    epsilon = self._call_if_callable(self._epsilon)
+
+    self._learning_rate_tensor = ops.convert_to_tensor(lr, name="learning_rate")
+    self._decay_tensor = ops.convert_to_tensor(decay, name="decay")
+    self._momentum_tensor = ops.convert_to_tensor(momentum, name="momentum")
+    self._epsilon_tensor = ops.convert_to_tensor(epsilon, name="epsilon")
 
   def _apply_dense(self, grad, var):
     rms = self.get_slot(var, "rms")
diff --git a/tensorflow/python/training/rmsprop_test.py b/tensorflow/python/training/rmsprop_test.py
index ee5385596c8b11e607969f94153f7e4f5d2d4cdd..604332738456bfc8b3ff24242f6032bf95273072 100644
--- a/tensorflow/python/training/rmsprop_test.py
+++ b/tensorflow/python/training/rmsprop_test.py
@@ -24,6 +24,7 @@ import math
 
 import numpy as np
 
+from tensorflow.python.eager import context
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
@@ -141,7 +142,7 @@ class RMSPropOptimizerTest(test.TestCase):
         self.assertAllClose([3.0, 4.0], var1.eval())
 
         # Run 4 steps of RMSProp
-        for t in range(1, 5):
+        for _ in range(1, 5):
           update.run()
 
           var0_np, mg0_np, rms0_np, mom0_np = self._rmsprop_update_numpy(
@@ -261,7 +262,7 @@ class RMSPropOptimizerTest(test.TestCase):
         self.assertAllClose([3.0, 4.0], var1.eval())
 
         # Run 4 steps of RMSProp
-        for t in range(1, 5):
+        for _ in range(1, 5):
           update.run()
 
           var0_np, mg0_np, rms0_np, mom0_np = self._sparse_rmsprop_update_numpy(
@@ -444,6 +445,55 @@ class RMSPropOptimizerTest(test.TestCase):
                  (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 2e-5)))
             ]), var1.eval())
 
+  def testCallableParams(self):
+    with context.eager_mode():
+      for dtype in [dtypes.half, dtypes.float32]:
+        var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
+        var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype)
+        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
+        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
+
+        learning_rate = lambda: 2.0
+        decay = lambda: 0.9
+        momentum = lambda: 0.0
+        epsilon = lambda: 1.0
+        opt = rmsprop.RMSPropOptimizer(learning_rate, decay, momentum, epsilon)
+
+        # Fetch params to validate initial values
+        self.assertAllClose([1.0, 2.0], self.evaluate(var0))
+        self.assertAllClose([3.0, 4.0], self.evaluate(var1))
+        # Step 1: the rms accumulators where 1. So we should see a normal
+        # update: v -= grad * learning_rate
+        opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+        # Check the parameters.
+        self.assertAllCloseAccordingToType(
+            np.array([
+                1.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1.0)),
+                2.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1.0))
+            ]), self.evaluate(var0))
+        self.assertAllCloseAccordingToType(
+            np.array([
+                3.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1.0)),
+                4.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1.0))
+            ]), self.evaluate(var1))
+        # Step 2: the root mean square accumulators contain the previous update.
+        opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+        # Check the parameters.
+        self.assertAllCloseAccordingToType(
+            np.array([
+                1.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1.0)) -
+                (0.1 * 2.0 / math.sqrt(0.901 * 0.9 + 0.001 + 1.0)),
+                2.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1.0)) -
+                (0.1 * 2.0 / math.sqrt(0.901 * 0.9 + 0.001 + 1.0))
+            ]), self.evaluate(var0))
+        self.assertAllCloseAccordingToType(
+            np.array([
+                3.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1.0)) -
+                (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 1e-5 + 1.0)),
+                4.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1.0)) -
+                (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 1e-5 + 1.0))
+            ]), self.evaluate(var1))
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/training/saver.py b/tensorflow/python/training/saver.py
index 4d464135fd03330134c0a371853d6bc8a228cd21..b8f58a288c73cde881a60a8e467c8224d2547c5d 100644
--- a/tensorflow/python/training/saver.py
+++ b/tensorflow/python/training/saver.py
@@ -1373,23 +1373,6 @@ class Saver(object):
     name, _ = p
     return name
 
-  def _MetaGraphFilename(self, checkpoint_filename, meta_graph_suffix="meta"):
-    """Returns the meta graph filename.
-
-    Args:
-      checkpoint_filename: Name of the checkpoint file.
-      meta_graph_suffix: Suffix for `MetaGraphDef` file. Defaults to 'meta'.
-
-    Returns:
-      MetaGraph file name.
-    """
-    # If the checkpoint_filename is sharded, the checkpoint_filename could
-    # be of format model.ckpt-step#-?????-of-shard#. For example,
-    # model.ckpt-123456-?????-of-00005, or model.ckpt-123456-00001-of-00002.
-    basename = re.sub(r"-[\d\?]+-of-\d+$", "", checkpoint_filename)
-    meta_graph_filename = ".".join([basename, meta_graph_suffix])
-    return meta_graph_filename
-
   def _RecordLastCheckpoint(self, latest_save_path):
     """Manages the list of the latest checkpoints."""
     if not self.saver_def.max_to_keep:
@@ -1430,24 +1413,12 @@ class Saver(object):
 
       # Otherwise delete the files.
       try:
-        checkpoint_prefix = self._CheckpointFilename(p)
-        self._delete_file_if_exists(
-            self._MetaGraphFilename(checkpoint_prefix, meta_graph_suffix))
-        if self.saver_def.version == saver_pb2.SaverDef.V2:
-          # V2 has a metadata file and some data files.
-          self._delete_file_if_exists(checkpoint_prefix + ".index")
-          self._delete_file_if_exists(checkpoint_prefix +
-                                      ".data-?????-of-?????")
-        else:
-          # V1, Legacy.  Exact match on the data file.
-          self._delete_file_if_exists(checkpoint_prefix)
+        remove_checkpoint(
+            self._CheckpointFilename(p), self.saver_def.version,
+            meta_graph_suffix)
       except Exception as e:  # pylint: disable=broad-except
         logging.warning("Ignoring: %s", str(e))
 
-  def _delete_file_if_exists(self, filespec):
-    for pathname in file_io.get_matching_files(filespec):
-      file_io.delete_file(pathname)
-
   def as_saver_def(self):
     """Generates a `SaverDef` representation of this saver.
 
@@ -1669,7 +1640,7 @@ class Saver(object):
         raise exc
 
     if write_meta_graph:
-      meta_graph_filename = self._MetaGraphFilename(
+      meta_graph_filename = _meta_graph_filename(
           checkpoint_file, meta_graph_suffix=meta_graph_suffix)
       if not context.executing_eagerly():
         with sess.graph.as_default():
@@ -1970,7 +1941,7 @@ def import_meta_graph(meta_graph_or_file, clear_devices=False,
 
     return Saver(saver_def=meta_graph_def.saver_def, name=scope)
   else:
-    if variables._all_saveable_objects():  # pylint: disable=protected-access
+    if variables._all_saveable_objects(scope=import_scope):  # pylint: disable=protected-access
       # Return the default saver instance for all graph variables.
       return Saver()
     else:
@@ -2121,6 +2092,55 @@ def get_checkpoint_mtimes(checkpoint_prefixes):
   return mtimes
 
 
+@tf_export("train.remove_checkpoint")
+def remove_checkpoint(checkpoint_prefix,
+                      checkpoint_format_version=saver_pb2.SaverDef.V2,
+                      meta_graph_suffix="meta"):
+  """Removes a checkpoint given by `checkpoint_prefix`.
+
+  Args:
+    checkpoint_prefix: The prefix of a V1 or V2 checkpoint. Typically the result
+      of `Saver.save()` or that of `tf.train.latest_checkpoint()`, regardless of
+      sharded/non-sharded or V1/V2.
+    checkpoint_format_version: `SaverDef.CheckpointFormatVersion`, defaults to
+      `SaverDef.V2`.
+    meta_graph_suffix: Suffix for `MetaGraphDef` file. Defaults to 'meta'.
+  """
+  _delete_file_if_exists(
+      _meta_graph_filename(checkpoint_prefix, meta_graph_suffix))
+  if checkpoint_format_version == saver_pb2.SaverDef.V2:
+    # V2 has a metadata file and some data files.
+    _delete_file_if_exists(checkpoint_prefix + ".index")
+    _delete_file_if_exists(checkpoint_prefix + ".data-?????-of-?????")
+  else:
+    # V1, Legacy.  Exact match on the data file.
+    _delete_file_if_exists(checkpoint_prefix)
+
+
+def _delete_file_if_exists(filespec):
+  """Deletes files matching `filespec`."""
+  for pathname in file_io.get_matching_files(filespec):
+    file_io.delete_file(pathname)
+
+
+def _meta_graph_filename(checkpoint_filename, meta_graph_suffix="meta"):
+  """Returns the meta graph filename.
+
+  Args:
+    checkpoint_filename: Name of the checkpoint file.
+    meta_graph_suffix: Suffix for `MetaGraphDef` file. Defaults to 'meta'.
+
+  Returns:
+    MetaGraph file name.
+  """
+  # If the checkpoint_filename is sharded, the checkpoint_filename could
+  # be of format model.ckpt-step#-?????-of-shard#. For example,
+  # model.ckpt-123456-?????-of-00005, or model.ckpt-123456-00001-of-00002.
+  basename = re.sub(r"-[\d\?]+-of-\d+$", "", checkpoint_filename)
+  meta_graph_filename = ".".join([basename, meta_graph_suffix])
+  return meta_graph_filename
+
+
 ops.register_proto_function(
     ops.GraphKeys.SAVERS,
     proto_type=saver_pb2.SaverDef,
diff --git a/tensorflow/python/training/saver_test.py b/tensorflow/python/training/saver_test.py
index f1991093e0b519da7448809e759a1cd5c57b80d9..e3be7d868e5805cf89c111cf29ea8412521da1be 100644
--- a/tensorflow/python/training/saver_test.py
+++ b/tensorflow/python/training/saver_test.py
@@ -809,7 +809,7 @@ class SaveRestoreShardedTest(test.TestCase):
         self.assertEqual(save_path + "-?????-of-00002", val)
       else:
         self.assertEqual(save_path, val)
-      meta_graph_filename = save._MetaGraphFilename(val)
+      meta_graph_filename = saver_module._meta_graph_filename(val)
       self.assertEqual(save_path + ".meta", meta_graph_filename)
 
     if save._write_version is saver_pb2.SaverDef.V1:
@@ -1185,13 +1185,13 @@ class MaxToKeepTest(test.TestCase):
       self.assertEqual([s3, s2], save.last_checkpoints)
       self.assertFalse(saver_module.checkpoint_exists(s1))
       self.assertFalse(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s1)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s1)))
       self.assertTrue(saver_module.checkpoint_exists(s3))
       self.assertTrue(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s3)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s3)))
       self.assertTrue(saver_module.checkpoint_exists(s2))
       self.assertTrue(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s2)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s2)))
       self.assertCheckpointState(
           model_checkpoint_path=s2,
           all_model_checkpoint_paths=[s3, s2],
@@ -1202,13 +1202,13 @@ class MaxToKeepTest(test.TestCase):
       self.assertEqual([s2, s1], save.last_checkpoints)
       self.assertFalse(saver_module.checkpoint_exists(s3))
       self.assertFalse(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s3)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s3)))
       self.assertTrue(saver_module.checkpoint_exists(s2))
       self.assertTrue(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s2)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s2)))
       self.assertTrue(saver_module.checkpoint_exists(s1))
       self.assertTrue(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s1)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s1)))
       self.assertCheckpointState(
           model_checkpoint_path=s1,
           all_model_checkpoint_paths=[s2, s1],
@@ -1222,14 +1222,14 @@ class MaxToKeepTest(test.TestCase):
       # Created by the first helper.
       self.assertTrue(saver_module.checkpoint_exists(s1))
       self.assertTrue(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s1)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s1)))
       # Deleted by the first helper.
       self.assertFalse(saver_module.checkpoint_exists(s3))
       self.assertFalse(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s3)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s3)))
       self.assertTrue(saver_module.checkpoint_exists(s2))
       self.assertTrue(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s2)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s2)))
       self.assertCheckpointState(
           model_checkpoint_path=s2,
           all_model_checkpoint_paths=[s3, s2],
@@ -1240,13 +1240,13 @@ class MaxToKeepTest(test.TestCase):
       self.assertEqual([s2, s1], save2.last_checkpoints)
       self.assertFalse(saver_module.checkpoint_exists(s3))
       self.assertFalse(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s3)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s3)))
       self.assertTrue(saver_module.checkpoint_exists(s2))
       self.assertTrue(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s2)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s2)))
       self.assertTrue(saver_module.checkpoint_exists(s1))
       self.assertTrue(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s1)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s1)))
       self.assertCheckpointState(
           model_checkpoint_path=s1,
           all_model_checkpoint_paths=[s2, s1],
@@ -1260,14 +1260,14 @@ class MaxToKeepTest(test.TestCase):
       # Created by the first helper.
       self.assertTrue(saver_module.checkpoint_exists(s1))
       self.assertTrue(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s1)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s1)))
       # Deleted by the first helper.
       self.assertFalse(saver_module.checkpoint_exists(s3))
       self.assertFalse(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s3)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s3)))
       self.assertTrue(saver_module.checkpoint_exists(s2))
       self.assertTrue(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s2)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s2)))
       # Even though the file for s1 exists, this saver isn't aware of it, which
       # is why it doesn't end up in the checkpoint state.
       self.assertCheckpointState(
@@ -1280,13 +1280,13 @@ class MaxToKeepTest(test.TestCase):
       self.assertEqual([s2, s1], save3.last_checkpoints)
       self.assertFalse(saver_module.checkpoint_exists(s3))
       self.assertFalse(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s3)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s3)))
       self.assertTrue(saver_module.checkpoint_exists(s2))
       self.assertTrue(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s2)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s2)))
       self.assertTrue(saver_module.checkpoint_exists(s1))
       self.assertTrue(
-          saver_module.checkpoint_exists(save._MetaGraphFilename(s1)))
+          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s1)))
       self.assertCheckpointState(
           model_checkpoint_path=s1,
           all_model_checkpoint_paths=[s2, s1],
@@ -1317,7 +1317,7 @@ class MaxToKeepTest(test.TestCase):
       else:
         self.assertEqual(4, len(gfile.Glob(s1 + "*")))
 
-      self.assertTrue(gfile.Exists(save._MetaGraphFilename(s1)))
+      self.assertTrue(gfile.Exists(saver_module._meta_graph_filename(s1)))
 
       s2 = save.save(sess, os.path.join(save_dir, "s2"))
       self.assertEqual([s1, s2], save.last_checkpoints)
@@ -1325,27 +1325,27 @@ class MaxToKeepTest(test.TestCase):
         self.assertEqual(2, len(gfile.Glob(s1)))
       else:
         self.assertEqual(4, len(gfile.Glob(s1 + "*")))
-      self.assertTrue(gfile.Exists(save._MetaGraphFilename(s1)))
+      self.assertTrue(gfile.Exists(saver_module._meta_graph_filename(s1)))
       if save._write_version is saver_pb2.SaverDef.V1:
         self.assertEqual(2, len(gfile.Glob(s2)))
       else:
         self.assertEqual(4, len(gfile.Glob(s2 + "*")))
-      self.assertTrue(gfile.Exists(save._MetaGraphFilename(s2)))
+      self.assertTrue(gfile.Exists(saver_module._meta_graph_filename(s2)))
 
       s3 = save.save(sess, os.path.join(save_dir, "s3"))
       self.assertEqual([s2, s3], save.last_checkpoints)
       self.assertEqual(0, len(gfile.Glob(s1 + "*")))
-      self.assertFalse(gfile.Exists(save._MetaGraphFilename(s1)))
+      self.assertFalse(gfile.Exists(saver_module._meta_graph_filename(s1)))
       if save._write_version is saver_pb2.SaverDef.V1:
         self.assertEqual(2, len(gfile.Glob(s2)))
       else:
         self.assertEqual(4, len(gfile.Glob(s2 + "*")))
-      self.assertTrue(gfile.Exists(save._MetaGraphFilename(s2)))
+      self.assertTrue(gfile.Exists(saver_module._meta_graph_filename(s2)))
       if save._write_version is saver_pb2.SaverDef.V1:
         self.assertEqual(2, len(gfile.Glob(s3)))
       else:
         self.assertEqual(4, len(gfile.Glob(s3 + "*")))
-      self.assertTrue(gfile.Exists(save._MetaGraphFilename(s3)))
+      self.assertTrue(gfile.Exists(saver_module._meta_graph_filename(s3)))
 
   def testNoMaxToKeep(self):
     save_dir = self._get_test_dir("no_max_to_keep")
@@ -1385,7 +1385,7 @@ class MaxToKeepTest(test.TestCase):
 
       s1 = save.save(sess, os.path.join(save_dir, "s1"), write_meta_graph=False)
       self.assertTrue(saver_module.checkpoint_exists(s1))
-      self.assertFalse(gfile.Exists(save._MetaGraphFilename(s1)))
+      self.assertFalse(gfile.Exists(saver_module._meta_graph_filename(s1)))
 
 
 class KeepCheckpointEveryNHoursTest(test.TestCase):
@@ -2339,6 +2339,46 @@ class MetaGraphTest(test.TestCase):
               10, size=[1, 10])
       })
 
+  def testImportIntoNamescopeWithoutVariables(self):
+    # Save a simple graph that contains no variables into a checkpoint.
+    test_dir = self._get_test_dir("no_vars_graph")
+    filename = os.path.join(test_dir, "ckpt")
+    graph_1 = ops_lib.Graph()
+    with session.Session(graph=graph_1) as sess:
+      constant_op.constant([1, 2, 3], name="x")
+      constant_op.constant([1, 2, 3], name="y")
+      saver = saver_module.Saver(allow_empty=True)
+      saver.save(sess, filename)
+
+    # Create a fresh graph.
+    graph_2 = ops_lib.Graph()
+    with session.Session(graph=graph_2) as sess:
+      # Restore the above checkpoint under scope "subgraph_1".
+      new_saver_1 = saver_module.import_meta_graph(
+          filename + ".meta", graph=graph_2, import_scope="subgraph_1")
+      # There are no variables to restore, so import_meta_graph should not
+      # return a Saver.
+      self.assertIsNone(new_saver_1)
+
+      # Create a variable in graph_2 under scope "my_scope".
+      variables.Variable(array_ops.zeros([10]), name="my_scope/my_var")
+      sess.run(variables.global_variables_initializer())
+      # Restore the checkpoint into a different scope "subgraph_2".
+      new_saver_2 = saver_module.import_meta_graph(
+          filename + ".meta", graph=graph_2, import_scope="subgraph_2")
+      # Because the variable does not live in scope "subgraph_2",
+      # import_meta_graph should not attempt to restore the variable. So,
+      # import_meta_graph still won't return a Saver instance.
+      self.assertIsNone(new_saver_2)
+
+      # However, if we restore the checkpoint under scope "my_scope",
+      # import_meta_graph will detect the variable and return a Saver for
+      # restoring it. This should happen even when the variable does not
+      # originate from graph_1.
+      new_saver_3 = saver_module.import_meta_graph(
+          filename + ".meta", graph=graph_2, import_scope="my_scope")
+      self.assertIsInstance(new_saver_3, saver_module.Saver)
+
   def testImportIntoImplicitNamescope(self):
     # Test that we can import a meta graph into an implicit namescope.
     test_dir = self._get_test_dir("import_into_namescope")
@@ -2581,6 +2621,20 @@ class SaverUtilsTest(test.TestCase):
     self.assertEqual(2, len(mtimes))
     self.assertTrue(mtimes[1] >= mtimes[0])
 
+  def testRemoveCheckpoint(self):
+    for sharded in (False, True):
+      for version in (saver_pb2.SaverDef.V2, saver_pb2.SaverDef.V1):
+        with self.test_session(graph=ops_lib.Graph()) as sess:
+          unused_v = variables.Variable(1.0, name="v")
+          variables.global_variables_initializer().run()
+          saver = saver_module.Saver(sharded=sharded, write_version=version)
+
+          path = os.path.join(self._base_dir, "%s-%s" % (sharded, version))
+          ckpt_prefix = saver.save(sess, path)
+          self.assertTrue(saver_module.checkpoint_exists(ckpt_prefix))
+          saver_module.remove_checkpoint(ckpt_prefix, version)
+          self.assertFalse(saver_module.checkpoint_exists(ckpt_prefix))
+
 
 class ScopedGraphTest(test.TestCase):
 
diff --git a/tensorflow/python/training/session_manager.py b/tensorflow/python/training/session_manager.py
index 3cb3877cc2fb846477640058b5bcdfea97ee6828..974f75777f43ab4ef3be2edea564d1ad902e4fd5 100644
--- a/tensorflow/python/training/session_manager.py
+++ b/tensorflow/python/training/session_manager.py
@@ -95,7 +95,8 @@ class SessionManager(object):
                ready_op=None,
                ready_for_local_init_op=None,
                graph=None,
-               recovery_wait_secs=30):
+               recovery_wait_secs=30,
+               local_init_run_options=None):
     """Creates a SessionManager.
 
     The `local_init_op` is an `Operation` that is run always after a new session
@@ -127,6 +128,8 @@ class SessionManager(object):
          to run local_init_op.
       graph: The `Graph` that the model will use.
       recovery_wait_secs: Seconds between checks for the model to be ready.
+      local_init_run_options: RunOptions to be passed to session.run when
+        executing the local_init_op.
 
     Raises:
       ValueError: If ready_for_local_init_op is not None but local_init_op is
@@ -141,6 +144,7 @@ class SessionManager(object):
     self._graph = graph
     self._recovery_wait_secs = recovery_wait_secs
     self._target = None
+    self._local_init_run_options = local_init_run_options
     if ready_for_local_init_op is not None and local_init_op is None:
       raise ValueError("If you pass a ready_for_local_init_op "
                        "you must also pass a local_init_op "
@@ -485,7 +489,7 @@ class SessionManager(object):
       is_ready_for_local_init, msg = self._model_ready_for_local_init(sess)
       if is_ready_for_local_init:
         logging.info("Running local_init_op.")
-        sess.run(self._local_init_op)
+        sess.run(self._local_init_op, options=self._local_init_run_options)
         logging.info("Done running local_init_op.")
         return True, None
       else:
diff --git a/tensorflow/python/training/supervisor.py b/tensorflow/python/training/supervisor.py
index 7389e344c7d8eef8e26c4d24c0985ff66276deea..372ea415df0ee299ebb51b2369c1027eb2db4865 100644
--- a/tensorflow/python/training/supervisor.py
+++ b/tensorflow/python/training/supervisor.py
@@ -225,7 +225,8 @@ class Supervisor(object):
                checkpoint_basename="model.ckpt",
                session_manager=None,
                summary_writer=USE_DEFAULT,
-               init_fn=None):
+               init_fn=None,
+               local_init_run_options=None):
     """Create a `Supervisor`.
 
     Args:
@@ -294,6 +295,8 @@ class Supervisor(object):
       init_fn: Optional callable used to initialize the model. Called
         after the optional `init_op` is called.  The callable must accept one
         argument, the session being initialized.
+      local_init_run_options: RunOptions to be passed as the SessionManager
+        local_init_run_options parameter.
 
     Returns:
       A `Supervisor`.
@@ -327,6 +330,7 @@ class Supervisor(object):
     self._recovery_wait_secs = recovery_wait_secs
     self._stop_grace_secs = stop_grace_secs
     self._init_fn = init_fn
+    self._local_init_run_options = local_init_run_options
 
     # Set all attributes related to checkpointing and writing events to None.
     # Afterwards, set them appropriately for chief supervisors, as these are
@@ -362,7 +366,8 @@ class Supervisor(object):
           ready_op=self._ready_op,
           ready_for_local_init_op=self._ready_for_local_init_op,
           graph=self._graph,
-          recovery_wait_secs=self._recovery_wait_secs)
+          recovery_wait_secs=self._recovery_wait_secs,
+          local_init_run_options=self._local_init_run_options)
     else:
       self._session_manager = session_manager
 
diff --git a/tensorflow/python/training/training_util.py b/tensorflow/python/training/training_util.py
index d05e1d2c830b2aa7008c9cba9f28eb6230d8bc82..0877b2a8a2fc7d59c4075c7d37c52ab691ec0361 100644
--- a/tensorflow/python/training/training_util.py
+++ b/tensorflow/python/training/training_util.py
@@ -119,18 +119,18 @@ def create_global_step(graph=None):
   graph = graph or ops.get_default_graph()
   if get_global_step(graph) is not None:
     raise ValueError('"global_step" already exists.')
+  if context.executing_eagerly():
+    with ops.device('cpu:0'):
+      return variable_scope.get_variable(
+          ops.GraphKeys.GLOBAL_STEP,
+          shape=[],
+          dtype=dtypes.int64,
+          initializer=init_ops.zeros_initializer(),
+          trainable=False,
+          collections=[ops.GraphKeys.GLOBAL_VARIABLES,
+                       ops.GraphKeys.GLOBAL_STEP])
   # Create in proper graph and base name_scope.
   with graph.as_default() as g, g.name_scope(None):
-    if context.executing_eagerly():
-      with ops.device('cpu:0'):
-        return variable_scope.get_variable(
-            ops.GraphKeys.GLOBAL_STEP,
-            shape=[],
-            dtype=dtypes.int64,
-            initializer=init_ops.zeros_initializer(),
-            trainable=False,
-            collections=[ops.GraphKeys.GLOBAL_VARIABLES,
-                         ops.GraphKeys.GLOBAL_STEP])
     return variable_scope.get_variable(
         ops.GraphKeys.GLOBAL_STEP,
         shape=[],
diff --git a/tensorflow/python/training/warm_starting_util.py b/tensorflow/python/training/warm_starting_util.py
index b0f37f8cb9e4d51e73335cab153d159484803158..ec740abdd15ae2904f79246429deaa5fc831dad5 100644
--- a/tensorflow/python/training/warm_starting_util.py
+++ b/tensorflow/python/training/warm_starting_util.py
@@ -237,6 +237,62 @@ def _warm_start_var_with_vocab(var,
 # pylint: enable=protected-access
 
 
+def _get_grouped_variables(vars_to_warm_start):
+  """Collects and groups (possibly partitioned) variables into a dictionary.
+
+  The variables can be provided explicitly through vars_to_warm_start, or they
+  are retrieved from collections (see below).
+
+  Args:
+    vars_to_warm_start: One of the following:
+
+      - A regular expression (string) that captures which variables to
+        warm-start (see tf.get_collection).  This expression will only consider
+        variables in the TRAINABLE_VARIABLES collection.
+      - A list of Variables to warm-start.
+      - A list of strings, each representing a full variable name to warm-start.
+      - `None`, in which case only variables specified in
+        `var_name_to_vocab_info` will be warm-started.
+  Returns:
+    A dictionary mapping variable names (strings) to lists of Variables.
+  Raises:
+    ValueError: If vars_to_warm_start is not a string, `None`, a list of
+      `Variables`, or a list of strings.
+  """
+  if isinstance(vars_to_warm_start, str) or vars_to_warm_start is None:
+    # Both vars_to_warm_start = '.*' and vars_to_warm_start = None will match
+    # everything (in TRAINABLE_VARIABLES) here.
+    list_of_vars = ops.get_collection(
+        ops.GraphKeys.TRAINABLE_VARIABLES,
+        scope=vars_to_warm_start)
+  elif isinstance(vars_to_warm_start, list):
+    if all([isinstance(v, str) for v in vars_to_warm_start]):
+      list_of_vars = []
+      for v in vars_to_warm_start:
+        list_of_vars += ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES,
+                                           scope=v)
+    elif all([_is_variable(v) for v in vars_to_warm_start]):
+      list_of_vars = vars_to_warm_start
+    else:
+      raise ValueError("If `vars_to_warm_start` is a list, it must be all "
+                       "`Variable` or all `str`.  Given types are {}".format(
+                           [type(v) for v in vars_to_warm_start]))
+  else:
+    raise ValueError("`vars_to_warm_start must be a `list` or `str`.  Given "
+                     "type is {}".format(type(vars_to_warm_start)))
+  # We have to deal with partitioned variables, since get_collection flattens
+  # out the list.
+  grouped_variables = {}
+  for v in list_of_vars:
+    if not isinstance(v, list):
+      var_name = _infer_var_name([v])
+    else:
+      var_name = _infer_var_name(v)
+    grouped_variables.setdefault(var_name, []).append(v)
+
+  return grouped_variables
+
+
 @tf_export("train.warm_start")
 def warm_start(ckpt_to_initialize_from,
                vars_to_warm_start=".*",
@@ -251,10 +307,19 @@ def warm_start(ckpt_to_initialize_from,
     ckpt_to_initialize_from: [Required] A string specifying the directory with
       checkpoint file(s) or path to checkpoint from which to warm-start the
       model parameters.
-    vars_to_warm_start: [Optional] A regular expression that captures which
-      variables to warm-start (see tf.get_collection).  Defaults to `'.*'`,
-      which warm-starts all variables.  If `None` is explicitly given, only
-      variables specified in `var_name_to_vocab_info` will be warm-started.
+    vars_to_warm_start: [Optional] One of the following:
+
+      - A regular expression (string) that captures which variables to
+        warm-start (see tf.get_collection).  This expression will only consider
+        variables in the TRAINABLE_VARIABLES collection.
+      - A list of Variables to warm-start.
+      - A list of strings, each representing a full variable name to warm-start.
+      - `None`, in which case only variables specified in
+        `var_name_to_vocab_info` will be warm-started.
+
+      Defaults to `'.*'`, which warm-starts all variables in the
+      TRAINABLE_VARIABLES collection.  Note that this excludes variables such as
+      accumulators and moving statistics from batch norm.
     var_name_to_vocab_info: [Optional] Dict of variable names (strings) to
       VocabInfo. The variable names should be "full" variables, not the names
       of the partitions.  If not explicitly provided, the variable is assumed to
@@ -274,21 +339,7 @@ def warm_start(ckpt_to_initialize_from,
   if var_name_to_prev_var_name is None:
     var_name_to_prev_var_name = {}
   logging.info("Warm-starting from: %s", (ckpt_to_initialize_from,))
-  # We have to deal with partitioned variables, since get_collection flattens
-  # out the list.
-  grouped_variables = {}
-  # Both vars_to_warm_start = '.*' and
-  # vars_to_warm_start = None will match everything here.
-  for v in ops.get_collection(
-      # TODO(eddz): Allow for different collections here (to support
-      # warm-starting accumulators).
-      ops.GraphKeys.TRAINABLE_VARIABLES,
-      scope=vars_to_warm_start):
-    if not isinstance(v, list):
-      var_name = _infer_var_name([v])
-    else:
-      var_name = _infer_var_name(v)
-    grouped_variables.setdefault(var_name, []).append(v)
+  grouped_variables = _get_grouped_variables(vars_to_warm_start)
 
   # Keep track of which var_names in var_name_to_prev_var_name and
   # var_name_to_vocab_info have been used.  Err on the safer side by throwing an
diff --git a/tensorflow/python/training/warm_starting_util_test.py b/tensorflow/python/training/warm_starting_util_test.py
index 7e8cbd6baeea160075b61d1191c8f1da5fe2163c..6a4c207d79edf22d635c38fe98589396e781e84e 100644
--- a/tensorflow/python/training/warm_starting_util_test.py
+++ b/tensorflow/python/training/warm_starting_util_test.py
@@ -36,6 +36,7 @@ from tensorflow.python.training import warm_starting_util as ws_util
 ones = init_ops.ones_initializer
 norms = init_ops.truncated_normal_initializer
 rand = init_ops.random_uniform_initializer
+zeros = init_ops.zeros_initializer
 
 
 class WarmStartingUtilTest(test.TestCase):
@@ -305,6 +306,46 @@ class WarmStartingUtilTest(test.TestCase):
         self.assertAllEqual([[0.5], [0.], [0.]],
                             fruit_weights_vars[1].eval(sess))
 
+  def testWarmStart_ListOfVariables(self):
+    # Save checkpoint from which to warm-start.
+    _, prev_int_val = self._create_prev_run_var("v1", shape=[10, 1],
+                                                initializer=ones())
+    # Verify we initialized the values correctly.
+    self.assertAllEqual(np.ones([10, 1]), prev_int_val)
+
+    # New graph, new session with warm-starting.
+    with ops.Graph().as_default() as g:
+      with self.test_session(graph=g) as sess:
+        # Initialize with zeros.
+        var = variable_scope.get_variable(
+            "v1",
+            shape=[10, 1],
+            initializer=zeros())
+        ws_util.warm_start(self.get_temp_dir(), vars_to_warm_start=[var])
+        sess.run(variables.global_variables_initializer())
+        # Verify weights were correctly warm-started (init overridden to ones).
+        self.assertAllEqual(var.eval(), prev_int_val)
+
+  def testWarmStart_ListOfStrings(self):
+    # Save checkpoint from which to warm-start.
+    _, prev_int_val = self._create_prev_run_var("v1", shape=[10, 1],
+                                                initializer=ones())
+    # Verify we initialized the values correctly.
+    self.assertAllEqual(np.ones([10, 1]), prev_int_val)
+
+    # New graph, new session with warm-starting.
+    with ops.Graph().as_default() as g:
+      with self.test_session(graph=g) as sess:
+        # Initialize with zeros.
+        var = variable_scope.get_variable(
+            "v1",
+            shape=[10, 1],
+            initializer=zeros())
+        ws_util.warm_start(self.get_temp_dir(), vars_to_warm_start=["v1"])
+        sess.run(variables.global_variables_initializer())
+        # Verify weights were correctly warm-started (init overridden to ones).
+        self.assertAllEqual(var.eval(), prev_int_val)
+
   def testWarmStart_SparseColumnIntegerized(self):
     # Create feature column.
     sc_int = fc.categorical_column_with_identity("sc_int", num_buckets=10)
diff --git a/tensorflow/python/util/stat_summarizer.i b/tensorflow/python/util/stat_summarizer.i
index 6aeaa0e31b9b48f7e6705ab7146828cc0e0e5e08..73fa85494b72d920d00577c826b76c3381d963a4 100644
--- a/tensorflow/python/util/stat_summarizer.i
+++ b/tensorflow/python/util/stat_summarizer.i
@@ -73,7 +73,7 @@ void _DeleteStatSummarizer(tensorflow::StatSummarizer* ss);
     return ss;
 }
 }
-
+%include "tensorflow/core/util/stat_summarizer_options.h"
 %include "tensorflow/core/util/stat_summarizer.h"
 %unignoreall
 
@@ -88,9 +88,4 @@ def NewStatSummarizer(unused):
 
 def DeleteStatSummarizer(stat_summarizer):
   _DeleteStatSummarizer(stat_summarizer)
-
-NewStatSummarizer._tf_api_names = ["contrib.stat_summarizer.NewStatSummarizer"]
-DeleteStatSummarizer._tf_api_names = [
-    "contrib.stat_summarizer.DeleteStatSummarizer"]
-StatSummarizer._tf_api_names = ["contrib.stat_summarizer.StatSummarizer"]
 %}
diff --git a/tensorflow/python/util/tf_export.py b/tensorflow/python/util/tf_export.py
index bf3961c6920c4c6ade0593b28f9eb1fd23ea8e0d..e154ffb68a4f0ccdebf5320cad7d3da056117197 100644
--- a/tensorflow/python/util/tf_export.py
+++ b/tensorflow/python/util/tf_export.py
@@ -41,17 +41,35 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import collections
+import functools
 import sys
 
 from tensorflow.python.util import tf_decorator
 
+ESTIMATOR_API_NAME = 'estimator'
+TENSORFLOW_API_NAME = 'tensorflow'
+
+_Attributes = collections.namedtuple(
+    'ExportedApiAttributes', ['names', 'constants'])
+
+# Attribute values must be unique to each API.
+API_ATTRS = {
+    TENSORFLOW_API_NAME: _Attributes(
+        '_tf_api_names',
+        '_tf_api_constants'),
+    ESTIMATOR_API_NAME: _Attributes(
+        '_estimator_api_names',
+        '_estimator_api_constants')
+}
+
 
 class SymbolAlreadyExposedError(Exception):
   """Raised when adding API names to symbol that already has API names."""
   pass
 
 
-class tf_export(object):  # pylint: disable=invalid-name
+class api_export(object):  # pylint: disable=invalid-name
   """Provides ways to export symbols to the TensorFlow API."""
 
   def __init__(self, *args, **kwargs):
@@ -63,15 +81,12 @@ class tf_export(object):  # pylint: disable=invalid-name
           overrides: List of symbols that this is overriding
           (those overrided api exports will be removed). Note: passing overrides
           has no effect on exporting a constant.
-          allow_multiple_exports: Allows exporting the same symbol multiple
-          times with multiple `tf_export` usages. Prefer however, to list all
-          of the exported names in a single `tf_export` usage when possible.
-
+          api_name: Name of the API you want to generate (e.g. `tensorflow` or
+          `estimator`). Default is `tensorflow`.
     """
     self._names = args
+    self._api_name = kwargs.get('api_name', TENSORFLOW_API_NAME)
     self._overrides = kwargs.get('overrides', [])
-    self._allow_multiple_exports = kwargs.get(
-        'allow_multiple_exports', False)
 
   def __call__(self, func):
     """Calls this decorator.
@@ -86,25 +101,24 @@ class tf_export(object):  # pylint: disable=invalid-name
       SymbolAlreadyExposedError: Raised when a symbol already has API names
         and kwarg `allow_multiple_exports` not set.
     """
+    api_names_attr = API_ATTRS[self._api_name].names
+
     # Undecorate overridden names
     for f in self._overrides:
       _, undecorated_f = tf_decorator.unwrap(f)
-      del undecorated_f._tf_api_names  # pylint: disable=protected-access
+      delattr(undecorated_f, api_names_attr)
 
     _, undecorated_func = tf_decorator.unwrap(func)
 
     # Check for an existing api. We check if attribute name is in
     # __dict__ instead of using hasattr to verify that subclasses have
     # their own _tf_api_names as opposed to just inheriting it.
-    if '_tf_api_names' in undecorated_func.__dict__:
-      if self._allow_multiple_exports:
-        undecorated_func._tf_api_names += self._names  # pylint: disable=protected-access
-      else:
-        raise SymbolAlreadyExposedError(
-            'Symbol %s is already exposed as %s.' %
-            (undecorated_func.__name__, undecorated_func._tf_api_names))  # pylint: disable=protected-access
-    else:
-      undecorated_func._tf_api_names = self._names  # pylint: disable=protected-access
+    if api_names_attr in undecorated_func.__dict__:
+      raise SymbolAlreadyExposedError(
+          'Symbol %s is already exposed as %s.' %
+          (undecorated_func.__name__, getattr(
+              undecorated_func, api_names_attr)))  # pylint: disable=protected-access
+    setattr(undecorated_func, api_names_attr, self._names)
     return func
 
   def export_constant(self, module_name, name):
@@ -126,8 +140,12 @@ class tf_export(object):  # pylint: disable=invalid-name
       name: (string) Current constant name.
     """
     module = sys.modules[module_name]
-    if not hasattr(module, '_tf_api_constants'):
-      module._tf_api_constants = []  # pylint: disable=protected-access
+    if not hasattr(module, API_ATTRS[self._api_name].constants):
+      setattr(module, API_ATTRS[self._api_name].constants, [])
     # pylint: disable=protected-access
-    module._tf_api_constants.append((self._names, name))
+    getattr(module, API_ATTRS[self._api_name].constants).append(
+        (self._names, name))
+
 
+tf_export = functools.partial(api_export, api_name=TENSORFLOW_API_NAME)
+estimator_export = functools.partial(tf_export, api_name=ESTIMATOR_API_NAME)
diff --git a/tensorflow/python/util/tf_export_test.py b/tensorflow/python/util/tf_export_test.py
index ace3f054ba952f012aa5ca642e490b1f45f8ba1d..b9e26ecb33383f5aa936a6bc92acea6d91eb996e 100644
--- a/tensorflow/python/util/tf_export_test.py
+++ b/tensorflow/python/util/tf_export_test.py
@@ -128,13 +128,6 @@ class ValidateExportTest(test.TestCase):
     with self.assertRaises(tf_export.SymbolAlreadyExposedError):
       export_decorator(_test_function)
 
-  def testEAllowMultipleExports(self):
-    _test_function._tf_api_names = ['name1', 'name2']
-    tf_export.tf_export('nameRed', 'nameBlue', allow_multiple_exports=True)(
-        _test_function)
-    self.assertEquals(['name1', 'name2', 'nameRed', 'nameBlue'],
-                      _test_function._tf_api_names)
-
   def testOverridesFunction(self):
     _test_function2._tf_api_names = ['abc']
 
diff --git a/tensorflow/python/util/tf_inspect.py b/tensorflow/python/util/tf_inspect.py
index 7872f521720de0db34377d6fe9a67137fb1af306..fbd65617670b15bfc69506bab1e83369081502af 100644
--- a/tensorflow/python/util/tf_inspect.py
+++ b/tensorflow/python/util/tf_inspect.py
@@ -42,32 +42,67 @@ def currentframe():
   return _inspect.stack()[1][0]
 
 
-def getargspec(object):  # pylint: disable=redefined-builtin
+def getargspec(obj):
   """TFDecorator-aware replacement for inspect.getargspec.
 
   Args:
-    object: A callable (function or partial function), possibly decorated.
+    obj: A function, partial function, or callable object, possibly
+    decorated.
 
   Returns:
     The `ArgSpec` that describes the signature of the outermost decorator that
     changes the callable's signature. If the callable is not decorated,
-    `inspect.getargspec()` will be called directly on the callable.
+    `inspect.getargspec()` will be called directly on the object.
 
   Raises:
-    ValueError: When callable's function signature can not be expressed with
-    ArgSpec.
+    ValueError: When callable's signature can not be expressed with
+      ArgSpec.
+    TypeError: For objects of unsupported types.
   """
+  if isinstance(obj, functools.partial):
+    return _get_argspec_for_partial(obj)
 
-  def get_argspec_with_decorator(obj):
-    decorators, target = tf_decorator.unwrap(obj)
-    return next((d.decorator_argspec
-                 for d in decorators
-                 if d.decorator_argspec is not None),
-                _inspect.getargspec(target))
+  decorators, target = tf_decorator.unwrap(obj)
+
+  spec = next((d.decorator_argspec
+               for d in decorators
+               if d.decorator_argspec is not None), None)
+  if spec:
+    return spec
+
+  try:
+    # Python3 will handle most callables here (not partial).
+    return _inspect.getargspec(target)
+  except TypeError:
+    pass
+
+  if isinstance(target, type):
+    try:
+      return _inspect.getargspec(target.__init__)
+    except TypeError:
+      pass
+
+    try:
+      return _inspect.getargspec(target.__new__)
+    except TypeError:
+      pass
 
-  if not isinstance(object, functools.partial):
-    return get_argspec_with_decorator(object)
+  # The `type(target)` ensures that if a class is received we don't return
+  # the signature of it's __call__ method.
+  return _inspect.getargspec(type(target).__call__)
 
+
+def _get_argspec_for_partial(obj):
+  """Implements `getargspec` for `functools.partial` objects.
+
+  Args:
+    obj: The `functools.partial` obeject
+  Returns:
+    An `inspect.ArgSpec`
+  Raises:
+    ValueError: When callable's signature can not be expressed with
+      ArgSpec.
+  """
   # When callable is a functools.partial object, we construct its ArgSpec with
   # following strategy:
   # - If callable partial contains default value for positional arguments (ie.
@@ -97,10 +132,10 @@ def getargspec(object):  # pylint: disable=redefined-builtin
   # value and ensures all following arguments also have default values. When
   # this is not true, a ValueError is raised.
 
-  n_prune_args = len(object.args)
-  partial_keywords = object.keywords or {}
+  n_prune_args = len(obj.args)
+  partial_keywords = obj.keywords or {}
 
-  args, varargs, keywords, defaults = get_argspec_with_decorator(object.func)
+  args, varargs, keywords, defaults = getargspec(obj.func)
 
   # Pruning first n_prune_args arguments.
   args = args[n_prune_args:]
@@ -137,11 +172,34 @@ def getargspec(object):  # pylint: disable=redefined-builtin
   return ArgSpec(args, varargs, keywords, tuple(all_defaults[first_default:]))
 
 
-def getfullargspec(obj):  # pylint: disable=redefined-builtin
-  """TFDecorator-aware replacement for `inspect.getfullargspec`/`getargspec`.
+if hasattr(_inspect, 'getfullargspec'):
+  _getfullargspec = _inspect.getfullargspec
+else:
+
+  def _getfullargspec(target):
+    """A python2 version of getfullargspec.
+
+    Args:
+      target: the target object to inspect.
+    Returns:
+      A FullArgSpec with empty kwonlyargs, kwonlydefaults and annotations.
+    """
+    argspecs = _inspect.getargspec(target)
+    fullargspecs = FullArgSpec(
+        args=argspecs.args,
+        varargs=argspecs.varargs,
+        varkw=argspecs.keywords,
+        defaults=argspecs.defaults,
+        kwonlyargs=[],
+        kwonlydefaults=None,
+        annotations={})
+    return fullargspecs
+
 
-  This wrapper uses `inspect.getfullargspec` if available and falls back to
-  `inspect.getargspec` in Python 2.
+def getfullargspec(obj):
+  """TFDecorator-aware replacement for `inspect.getfullargspec`.
+
+  This wrapper emulates `inspect.getfullargspec` in[^)]* Python2.
 
   Args:
     obj: A callable, possibly decorated.
@@ -152,34 +210,10 @@ def getfullargspec(obj):  # pylint: disable=redefined-builtin
     callable is not decorated, `inspect.getfullargspec()` will be called
     directly on the callable.
   """
-  if hasattr(_inspect, 'getfullargspec'):
-    spec_fn = _inspect.getfullargspec
-  else:
-    def spec_fn(target):
-      """Spec function that adding default value from FullArgSpec.
-
-      It is used when getfullargspec is not available (eg in PY2).
-
-      Args:
-        target: the target object to inspect.
-      Returns:
-        The full argument specs with empty kwonlyargs, kwonlydefaults and
-        annotations.
-      """
-      argspecs = _inspect.getargspec(target)
-      fullargspecs = FullArgSpec(
-          args=argspecs.args,
-          varargs=argspecs.varargs,
-          varkw=argspecs.keywords,
-          defaults=argspecs.defaults,
-          kwonlyargs=[],
-          kwonlydefaults=None,
-          annotations={})
-      return fullargspecs
-
   decorators, target = tf_decorator.unwrap(obj)
-  return next((d.decorator_argspec for d in decorators
-               if d.decorator_argspec is not None), spec_fn(target))
+  return next((d.decorator_argspec
+               for d in decorators
+               if d.decorator_argspec is not None), _getfullargspec(target))
 
 
 def getcallargs(func, *positional, **named):
diff --git a/tensorflow/python/util/tf_inspect_test.py b/tensorflow/python/util/tf_inspect_test.py
index 325131c4f4759ea06d1266921880d3e3c45e5ecb..beaf350de1e469a7675a4b55ff341419262b79b2 100644
--- a/tensorflow/python/util/tf_inspect_test.py
+++ b/tensorflow/python/util/tf_inspect_test.py
@@ -245,6 +245,52 @@ class TfInspectTest(test.TestCase):
     self.assertEqual(partial_argspec,
                      tf_inspect.getargspec(partial_with_decorator))
 
+  def testGetArgSpecOnCallableObject(self):
+
+    class Callable(object):
+
+      def __call__(self, a, b=1, c='hello'):
+        pass
+
+    argspec = tf_inspect.ArgSpec(
+        args=['self', 'a', 'b', 'c'],
+        varargs=None,
+        keywords=None,
+        defaults=(1, 'hello'))
+
+    test_obj = Callable()
+    self.assertEqual(argspec, tf_inspect.getargspec(test_obj))
+
+  def testGetArgSpecOnInitClass(self):
+
+    class InitClass(object):
+
+      def __init__(self, a, b=1, c='hello'):
+        pass
+
+    argspec = tf_inspect.ArgSpec(
+        args=['self', 'a', 'b', 'c'],
+        varargs=None,
+        keywords=None,
+        defaults=(1, 'hello'))
+
+    self.assertEqual(argspec, tf_inspect.getargspec(InitClass))
+
+  def testGetArgSpecOnNewClass(self):
+
+    class NewClass(object):
+
+      def __new__(cls, a, b=1, c='hello'):
+        pass
+
+    argspec = tf_inspect.ArgSpec(
+        args=['cls', 'a', 'b', 'c'],
+        varargs=None,
+        keywords=None,
+        defaults=(1, 'hello'))
+
+    self.assertEqual(argspec, tf_inspect.getargspec(NewClass))
+
   def testGetDoc(self):
     self.assertEqual('Test Decorated Function With Defaults Docstring.',
                      tf_inspect.getdoc(test_decorated_function_with_defaults))
diff --git a/tensorflow/python/util/util.cc b/tensorflow/python/util/util.cc
index 8bfc0ec9584598965c32a630d933a992c85aa18b..c79d8a84458800937e3e51a8dae26605bd834233 100644
--- a/tensorflow/python/util/util.cc
+++ b/tensorflow/python/util/util.cc
@@ -14,6 +14,7 @@ limitations under the License.
 ==============================================================================*/
 #include "tensorflow/python/util/util.h"
 
+#include <functional>
 #include <unordered_map>
 #include <vector>
 
@@ -32,6 +33,8 @@ namespace {
 PyObject* CollectionsSequenceType = nullptr;
 PyTypeObject* SparseTensorValueType = nullptr;
 
+const int kMaxItemsInCache = 1024;
+
 bool WarnedThatSetIsNotSequence = false;
 
 bool IsString(PyObject* o) {
@@ -171,17 +174,20 @@ int IsSequenceHelper(PyObject* o) {
   // Try not to return to Python - see if the type has already been seen
   // before.
 
-  // NOTE: It's not clear whether the lock is required (we should be holding the
-  // python GIL in this code already).
-  mutex_lock l(g_type_to_sequence_map);
   auto* type_to_sequence_map = IsTypeSequenceMap();
   auto* type = Py_TYPE(o);
 
-  auto it = type_to_sequence_map->find(type);
-  if (it != type_to_sequence_map->end()) {
-    return it->second;
+  {
+    mutex_lock l(g_type_to_sequence_map);
+    auto it = type_to_sequence_map->find(type);
+    if (it != type_to_sequence_map->end()) {
+      return it->second;
+    }
   }
 
+  // NOTE: We explicitly release the g_type_to_sequence_map mutex,
+  // because PyObject_IsInstance() may release the GIL, allowing another thread
+  // concurrent entry to this function.
   int is_instance = PyObject_IsInstance(o, CollectionsSequenceType);
 
   // Don't cache a failed is_instance check.
@@ -192,9 +198,15 @@ int IsSequenceHelper(PyObject* o) {
   // NOTE: This is never decref'd, but we don't want the type to get deleted
   // as long as it is in the map. This should not be too much of a
   // leak, as there should only be a relatively small number of types in the
-  // map, and an even smaller number that are eligible for decref.
-  Py_INCREF(type);
-  type_to_sequence_map->insert({type, is_sequence});
+  // map, and an even smaller number that are eligible for decref. As a
+  // precaution, we limit the size of the map to 1024.
+  {
+    mutex_lock l(g_type_to_sequence_map);
+    if (type_to_sequence_map->size() < kMaxItemsInCache) {
+      Py_INCREF(type);
+      type_to_sequence_map->insert({type, is_sequence});
+    }
+  }
 
   return is_sequence;
 }
@@ -236,6 +248,9 @@ bool GetNextValuesForIterable(PyObject* nested,
                               std::vector<Safe_PyObjectPtr>* next_values) {
   PyObject* item;
   PyObject* iterator = PyObject_GetIter(nested);
+  if (iterator == nullptr || PyErr_Occurred()) {
+    return false;
+  }
   while ((item = PyIter_Next(iterator)) != nullptr) {
     next_values->emplace_back(item);
   }
diff --git a/tensorflow/security/advisory/tfsa-2018-001.md b/tensorflow/security/advisory/tfsa-2018-001.md
new file mode 100644
index 0000000000000000000000000000000000000000..bb97543a21988b4370ddac912102add6a10e2b35
--- /dev/null
+++ b/tensorflow/security/advisory/tfsa-2018-001.md
@@ -0,0 +1,34 @@
+## TFSA-2018-001: BMP File Parser Out-of-bounds Read.
+
+### CVE Number
+
+CVE-2018-7574
+
+### Issue Description
+
+The BMP (bitmap image file graphics format) decoder had an out-of-bounds read
+due to insufficient checking of header sizes and signed integer values. 
+
+### Impact
+
+The most likely consequence of this vulnerability would be that an invalid BMP
+file could lead to an unhandled process crash, but may permit read access to
+unintended regions of the TensorFlow process memory.
+
+### Vulnerable Versions
+
+TensorFlow 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.6.0
+
+### Mitigation
+
+We have patched the vulnerability in GitHub commit
+[49f73c55](https://github.com/tensorflow/tensorflow/commit/49f73c55d56edffebde4bca4a407ad69c1cae4333c55).
+If users are running TensorFlow in production or on untrusted data, they are
+encouraged to apply this patch.
+
+Additionally, this patch has already been integrated into TensorFlow 1.7.0 and
+newer.
+
+### Credits
+
+This issue was discovered by the Blade Team of Tencent.
diff --git a/tensorflow/security/advisory/tfsa-2018-002.md b/tensorflow/security/advisory/tfsa-2018-002.md
new file mode 100644
index 0000000000000000000000000000000000000000..fad7fdd40f6dcc651ee72e0496f99377ebe24dbc
--- /dev/null
+++ b/tensorflow/security/advisory/tfsa-2018-002.md
@@ -0,0 +1,33 @@
+## TFSA-2018-002: GIF File Parsing Null Pointer Dereference Error
+
+### CVE Number
+
+CVE-2018-7576
+
+### Issue Description
+
+When parsing certain invalid GIF files, an internal function in the GIF decoder
+returned a null pointer, which was subsequently used as an argument to strcat.
+
+### Impact
+
+A maliciously crafted GIF could be used to cause the TensorFlow process to
+crash.
+
+### Vulnerable Versions
+
+TensorFlow 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.2.1, 1.3.0, 1.3.1, 1 1.4.1, 1.5.0, 1.5.1
+
+### Mitigation
+
+We have patched the vulnerability in GitHub commit
+[c4843158](https://github.com/tensorflow/tensorflow/commit/c48431588e7cf8aff61d4c299231e3e925144df8).
+If users are running TensorFlow in production or on untrusted data, they are
+encouraged to apply this patch.
+
+Additionally, this patch has already been integrated into TensorFlow 1.6.0 and
+newer.
+
+### Credits
+
+This issue was discovered by the Blade Team of Tencent.
diff --git a/tensorflow/security/advisory/tfsa-2018-003.md b/tensorflow/security/advisory/tfsa-2018-003.md
new file mode 100644
index 0000000000000000000000000000000000000000..747d37064c02db84b92e669512b5ca4e40c431a2
--- /dev/null
+++ b/tensorflow/security/advisory/tfsa-2018-003.md
@@ -0,0 +1,48 @@
+## TFSA-2018-003: TensorFlow Lite TOCO FlatBuffer Parsing Vulnerability
+
+### CVE Number
+
+CVE-2018-8825
+
+### Issue Description
+
+The TensorFlow Lite TOCO compiler does not perform correct boundary checks when
+reading from some fields within TFLite files. 
+
+As background, TFLite files are based on the FlatBuffers serialization format,
+which does not have bounds checking built-in, rather it relies on the clients to
+handle the appropriate security checks by themselves.
+
+In particular, TOCO is not performing correct bounds checks in the following places:
+* Out of bounds read in TOCO in import.cc:42
+* Null dereference in TOCO in import.cc:135
+* Out of bounds read in TOCO in import.cc:104
+* Null dereference in TOCO in import.cc:121
+* Out of bounds read in TOCO in import.cc:62
+* Out of bounds read in TOCO in operator.cc:48
+* Out of bounds read in TOCO graph_transformations (propagate_fixed_sizes.cc:93)
+
+
+### Impact
+
+Users passing a malformed or malicious version of a TFLite graph into TOCO will
+cause TOCO to crash or cause a buffer overflow, potentially allowing malicious
+code to be executed.
+
+### Vulnerable Versions
+
+TensorFlow 1.5.0, 1.5.1, 1.6.0, 1.7.0
+
+### Mitigation
+
+We have patched the vulnerability in GitHub commits [41335abb](https://github.com/tensorflow/tensorflow/commit/41335abb46f80ca644b5738550daef6136ba5476) and
+[8badd11d](https://github.com/tensorflow/tensorflow/commit/8badd11d875a826bd318ed439909d5c47a7fb811).
+If users are running the TensorFlow TFLite TOCO compiler in production or on
+untrusted data, they are encouraged to apply this patch.
+
+Additionally, we have released TensorFlow version 1.7.1 to mitigate this
+vulnerability.
+
+### Credits
+
+This issue was discovered by the Blade Team of Tencent.
diff --git a/tensorflow/security/advisory/tfsa-2018-004.md b/tensorflow/security/advisory/tfsa-2018-004.md
new file mode 100644
index 0000000000000000000000000000000000000000..3af28defa1387fc8ff99c9f07ae2ff2bcda9b268
--- /dev/null
+++ b/tensorflow/security/advisory/tfsa-2018-004.md
@@ -0,0 +1,35 @@
+## TFSA-2018-004: Checkpoint Meta File Out-of-Bounds Read
+
+### CVE Number
+
+CVE-2018-7575
+
+### Issue Description
+
+The block size in meta file might contain a large int64 value which causes
+an integer overflow upon addition. Subsequent code using n as index may cause
+an out-of-bounds read.
+
+### Impact
+
+A maliciously crafted meta checkpoint could be used to cause the TensorFlow
+process to perform an out of bounds read on in process memory.
+
+### Vulnerable Versions
+
+TensorFlow 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.2.1, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.6.0, 1.7.0
+
+### Mitigation
+
+We have patched the vulnerability in GitHub commit
+[d107fee1](https://github.com/tensorflow/tensorflow/commit/d107fee1e4a9a4462f01564798d345802acc2aef).
+If users are running TensorFlow on untrusted meta checkpoints, such as those
+downloaded from the Internet, in production or on untrusted data, they are
+encouraged to apply this patch.
+
+Additionally, we have released TensorFlow version 1.7.1 to mitigate this
+vulnerability.
+
+### Credits
+
+This issue was discovered by the Blade Team of Tencent.
diff --git a/tensorflow/security/advisory/tfsa-2018-005.md b/tensorflow/security/advisory/tfsa-2018-005.md
new file mode 100644
index 0000000000000000000000000000000000000000..c0f339fd976f5635fe774141e671a31d27523a0b
--- /dev/null
+++ b/tensorflow/security/advisory/tfsa-2018-005.md
@@ -0,0 +1,36 @@
+## TFSA-2018-005: Old Snappy Library Usage Resulting in Memcpy Parameter Overlap
+
+### CVE Number
+
+CVE-2018-7577
+
+### Issue Description
+
+TensorFlow checkpoint meta file uses Google's [https://github.com/google/snappy](snappy)
+compression/decompression library. There is a memcpy-param-overlap issue in the
+version of snappy currently used by TensorFlow.
+
+### Impact
+
+A maliciously crafted checkpoint meta file could cause TensorFlow to crash or
+read from other parts of its process memory.
+
+### Vulnerable Versions
+
+TensorFlow 1.1.0, 1.2.0, 1.2.1, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.6.0, 1.7.0
+
+### Mitigation
+
+We have patched the vulnerability in GitHub commit
+[dfa9921e](https://github.com/tensorflow/tensorflow/commit/dfa9921e6343727b05f42f8d4a918b19528ff994)
+by upgrading the version of the snappy library used by TensorFlow to v1.1.7.
+
+If users are loading untrusted checkpoints in TensorFlow, we encourage users to
+apply the patch to upgrade snappy.
+
+Additionally, we have released TensorFlow version 1.7.1 to mitigate this
+vulnerability.
+
+### Credits
+
+This issue was discovered by the Blade Team of Tencent.
diff --git a/tensorflow/security/advisory/tfsa-2018-006.md b/tensorflow/security/advisory/tfsa-2018-006.md
new file mode 100644
index 0000000000000000000000000000000000000000..17f514d8d2b5435d3325cc2e30bb4e48fe3284cf
--- /dev/null
+++ b/tensorflow/security/advisory/tfsa-2018-006.md
@@ -0,0 +1,35 @@
+## TFSA-2018-006: Crafted Configuration File results in Invalid Memory Access
+
+### CVE Number
+
+CVE-2018-10055
+
+### Issue Description
+
+A maliciously crafted configuration file passed into the TensorFlow XLA compiler
+could cause an invalid memory access and/or a heap buffer overflow.
+
+### Impact
+
+A maliciously crafted configuration file could cause TensorFlow to crash or
+read from other parts of its process memory.
+
+### Vulnerable Versions
+
+TensorFlow 1.1.0, 1.2.0, 1.2.1, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.6.0, 1.7.0
+
+### Mitigation
+
+We have patched the vulnerability in GitHub commit
+[c89ab82a](https://github.com/tensorflow/tensorflow/commit/c89ab82a82585cdaa90bf4911980e9e845909e78).
+
+If users are loading untrusted configurations in TensorFlow, we encourage users
+to apply the patch to upgrade snappy or upgrade the version of TensorFlow they
+are currently using.
+
+Additionally, we have released TensorFlow version 1.7.1 to mitigate this
+vulnerability.
+
+### Credits
+
+This issue was discovered by the Blade Team of Tencent.
diff --git a/tensorflow/security/index.md b/tensorflow/security/index.md
new file mode 100644
index 0000000000000000000000000000000000000000..ea39e17ab2bb417bba1ebe4a589833309fc2c626
--- /dev/null
+++ b/tensorflow/security/index.md
@@ -0,0 +1,18 @@
+# TensorFlow Security Advisories
+
+We regularly publish security advisories about using TensorFlow.
+
+*Note*: In conjunction with these security advisories, we strongly encourage
+TensorFlow users to read and understand TensorFlow's security model as outlined
+in (https://github.com/tensorflow/tensorflow/blob/master/SECURITY.md)[SECURITY.md].
+
+| Advisory Number | Type               | Versions affected | Reported by           | Additional Information      |
+|-----------------|--------------------|:-----------------:|-----------------------|-----------------------------|
+| [TFSA-2018-006](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/security/advisory/tfsa-2018-006.md)   | Crafted Configuration File results in Invalid Memory Access | <= 1.7 | Blade Team of Tencent |  |
+| [TFSA-2018-005](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/security/advisory/tfsa-2018-005.md)   | Old Snappy Library Usage Resulting in Memcpy Parameter Overlap | <= 1.7 | Blade Team of Tencent |  |
+| [TFSA-2018-004](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/security/advisory/tfsa-2018-004.md)   | Checkpoint Meta File Out-of-Bounds Read | <= 1.7 | Blade Team of Tencent |  |
+| [TFSA-2018-003](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/security/advisory/tfsa-2018-003.md)   | TensorFlow Lite TOCO FlatBuffer Parsing Vulnerability | <= 1.7 | Blade Team of Tencent |  |
+| [TFSA-2018-002](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/security/advisory/tfsa-2018-002.md)   | GIF File Parsing Null Pointer Dereference Error | <= 1.5 | Blade Team of Tencent |  |
+| [TFSA-2018-001](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/security/advisory/tfsa-2018-001.md)   | BMP File Parser Out-of-bounds Read | <= 1.6 | Blade Team of Tencent |  |
+| -               | Out Of Bounds Read |             <= 1.4 | Blade Team of Tencent | [issue report](https://github.com/tensorflow/tensorflow/issues/14959) |
+
diff --git a/tensorflow/stream_executor/cuda/cuda_blas.cc b/tensorflow/stream_executor/cuda/cuda_blas.cc
index 08fe153b5909d36eae7848862932bb1359c29fe0..92c1a5fc07190e77a6ecc814145ec06da303e27e 100644
--- a/tensorflow/stream_executor/cuda/cuda_blas.cc
+++ b/tensorflow/stream_executor/cuda/cuda_blas.cc
@@ -2155,10 +2155,7 @@ bool CUDABlas::DoBlasGemmWithAlgorithmImpl(
     const HostOrDeviceScalar<CompT> &beta, DeviceMemory<OutT> *c, int ldc,
     blas::ComputationType computation_type, blas::AlgorithmType algorithm,
     blas::ProfileResult *output_profile_result) {
-// CUDA < version 8 and GPUs < sm_50 don't support cublasGemmEx.
-#if CUDA_VERSION < 8000
-  return false;
-#else
+  // GPUs < sm_50 don't support cublasGemmEx.
   int cc_major, cc_minor;
   if (stream->parent()->GetDeviceDescription().cuda_compute_capability(
           &cc_major, &cc_minor) &&
@@ -2184,6 +2181,13 @@ bool CUDABlas::DoBlasGemmWithAlgorithmImpl(
     }
   }
 
+  // Return false if we might be hitting a cuBLAS bug that produces the wrong
+  // result. See nvbugs/2156201, b/79126339.
+  if (CUDA_VERSION < 9020 && algorithm != CUBLAS_GEMM_ALGO12 &&
+      std::max({m, n, k}) >= 2097153 && cc_major < 7) {
+    return false;
+  }
+
   cudaDataType_t cuda_in_type = CUDADataType<InT>::type;
   // Since we are converting 'algorithm' to cublasGemmAlgo_t by static_cast,
   // we do the following compile-time check on the default value:
@@ -2213,7 +2217,6 @@ bool CUDABlas::DoBlasGemmWithAlgorithmImpl(
         timer->GetElapsedMilliseconds());
   }
   return result;
-#endif
 }
 
 bool CUDABlas::GetBlasGemmAlgorithms(
diff --git a/tensorflow/stream_executor/cuda/cuda_diagnostics.cc b/tensorflow/stream_executor/cuda/cuda_diagnostics.cc
index 46e5deed8474dfa0c0ce6402bd6e5e2675491b31..10f6d21d54a15b46ebfc6f8ad32e3e908fab9a96 100644
--- a/tensorflow/stream_executor/cuda/cuda_diagnostics.cc
+++ b/tensorflow/stream_executor/cuda/cuda_diagnostics.cc
@@ -24,17 +24,12 @@ limitations under the License.
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
-#ifdef __APPLE__
-#include <IOKit/kext/KextManager.h>
-#include <mach-o/dyld.h>
-#else
 #if !defined(PLATFORM_WINDOWS)
 #include <link.h>
 #include <sys/sysmacros.h>
 #include <unistd.h>
 #endif
 #include <sys/stat.h>
-#endif
 #include <algorithm>
 #include <memory>
 #include <vector>
@@ -54,9 +49,7 @@ limitations under the License.
 namespace stream_executor {
 namespace cuda {
 
-#ifdef __APPLE__
-static const CFStringRef kDriverKextIdentifier = CFSTR("com.nvidia.CUDA");
-#elif !defined(PLATFORM_WINDOWS)
+#if !defined(PLATFORM_WINDOWS)
 static const char *kDriverVersionPath = "/proc/driver/nvidia/version";
 #endif
 
@@ -121,26 +114,7 @@ string Diagnostician::GetDevNodePath(int dev_node_ordinal) {
 }
 
 void Diagnostician::LogDiagnosticInformation() {
-#ifdef __APPLE__
-  CFStringRef kext_ids[1];
-  kext_ids[0] = kDriverKextIdentifier;
-  CFArrayRef kext_id_query = CFArrayCreate(nullptr, (const void**)kext_ids, 1, &kCFTypeArrayCallBacks);
-  CFDictionaryRef kext_infos = KextManagerCopyLoadedKextInfo(kext_id_query, nullptr);
-  CFRelease(kext_id_query);
-
-  CFDictionaryRef cuda_driver_info = nullptr;
-  if (CFDictionaryGetValueIfPresent(kext_infos, kDriverKextIdentifier, (const void**)&cuda_driver_info)) {
-    bool started = CFBooleanGetValue((CFBooleanRef)CFDictionaryGetValue(cuda_driver_info, CFSTR("OSBundleStarted")));
-    if (!started) {
-      LOG(INFO) << "kernel driver is installed, but does not appear to be running on this host "
-                << "(" << port::Hostname() << ")";
-    }
-  } else {
-    LOG(INFO) << "kernel driver does not appear to be installed on this host "
-              << "(" << port::Hostname() << ")";
-  }
-  CFRelease(kext_infos);
-#elif !defined(PLATFORM_WINDOWS)
+#if !defined(PLATFORM_WINDOWS)
   if (access(kDriverVersionPath, F_OK) != 0) {
     LOG(INFO) << "kernel driver does not appear to be running on this host "
               << "(" << port::Hostname() << "): "
@@ -194,8 +168,7 @@ void Diagnostician::LogDiagnosticInformation() {
 	  << DriverVersionStatusToString(kernel_version);
 #endif
 
-  // OS X kernel driver does not report version accurately
-#if !defined(__APPLE__) && !defined(PLATFORM_WINDOWS)
+#if !defined(PLATFORM_WINDOWS)
   if (kernel_version.ok() && dso_version.ok()) {
     WarnOnDsoKernelMismatch(dso_version, kernel_version);
   }
@@ -209,29 +182,6 @@ port::StatusOr<DriverVersion> Diagnostician::FindDsoVersion() {
       port::error::NOT_FOUND,
       "was unable to find libcuda.so DSO loaded into this program"));
 
-#if defined(__APPLE__)
-    // OSX CUDA libraries have names like: libcuda_310.41.15_mercury.dylib
-    const string prefix("libcuda_");
-    const string suffix("_mercury.dylib");
-    for (uint32_t image_index = 0; image_index < _dyld_image_count(); ++image_index) {
-      const string path(_dyld_get_image_name(image_index));
-      const size_t suffix_pos = path.rfind(suffix);
-      const size_t prefix_pos = path.rfind(prefix, suffix_pos);
-      if (prefix_pos == string::npos ||
-          suffix_pos == string::npos) {
-        // no match
-        continue;
-      }
-      const size_t start = prefix_pos + prefix.size();
-      if (start >= suffix_pos) {
-        // version not included
-        continue;
-      }
-      const size_t length = suffix_pos - start;
-      const string version = path.substr(start, length);
-      result = StringToDriverVersion(version);
-    }
-#else
 #if !defined(PLATFORM_WINDOWS) && !defined(ANDROID_TEGRA)
   // Callback used when iterating through DSOs. Looks for the driver-interfacing
   // DSO and yields its version number into the callback data, when found.
@@ -264,7 +214,6 @@ port::StatusOr<DriverVersion> Diagnostician::FindDsoVersion() {
   };
 
   dl_iterate_phdr(iterate_phdr, &result);
-#endif
 #endif
 
   return result;
@@ -310,38 +259,7 @@ void Diagnostician::WarnOnDsoKernelMismatch(
 
 
 port::StatusOr<DriverVersion> Diagnostician::FindKernelDriverVersion() {
-#if defined(__APPLE__)
-  CFStringRef kext_ids[1];
-  kext_ids[0] = kDriverKextIdentifier;
-  CFArrayRef kext_id_query = CFArrayCreate(nullptr, (const void**)kext_ids, 1, &kCFTypeArrayCallBacks);
-  CFDictionaryRef kext_infos = KextManagerCopyLoadedKextInfo(kext_id_query, nullptr);
-  CFRelease(kext_id_query);
-
-  CFDictionaryRef cuda_driver_info = nullptr;
-  if (CFDictionaryGetValueIfPresent(kext_infos, kDriverKextIdentifier, (const void**)&cuda_driver_info)) {
-    // NOTE: OSX CUDA driver does not currently store the same driver version
-    // in kCFBundleVersionKey as is returned by cuDriverGetVersion
-    CFRelease(kext_infos);
-    const CFStringRef str = (CFStringRef)CFDictionaryGetValue(
-        cuda_driver_info, kCFBundleVersionKey);
-    const char *version = CFStringGetCStringPtr(str, kCFStringEncodingUTF8);
-
-    // version can be NULL in which case treat it as empty string
-    // see
-    // https://developer.apple.com/library/mac/documentation/CoreFoundation/Conceptual/CFStrings/Articles/AccessingContents.html#//apple_ref/doc/uid/20001184-100980-TPXREF112
-    if (version == NULL) {
-      return StringToDriverVersion("");
-    }
-    return StringToDriverVersion(version);
-  }
-  CFRelease(kext_infos);
-  auto status = port::Status(
-      port::error::INTERNAL,
-      port::StrCat(
-          "failed to read driver bundle version: ",
-          CFStringGetCStringPtr(kDriverKextIdentifier, kCFStringEncodingUTF8)));
-  return status;
-#elif defined(PLATFORM_WINDOWS)
+#if defined(PLATFORM_WINDOWS)
   auto status =
       port::Status(port::error::UNIMPLEMENTED,
                    "kernel reported driver version not implemented on Windows");
diff --git a/tensorflow/stream_executor/cuda/cuda_dnn.cc b/tensorflow/stream_executor/cuda/cuda_dnn.cc
index 5ece80e5518d19e5ed596fe405d20b3360d2f0c0..d4f2fd262544a6a2771e17a712ae911c7249a7d5 100644
--- a/tensorflow/stream_executor/cuda/cuda_dnn.cc
+++ b/tensorflow/stream_executor/cuda/cuda_dnn.cc
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include <functional>
 #include <memory>
+#include <utility>
 
 #include "third_party/eigen3/Eigen/Core"
 #include "tensorflow/core/lib/core/errors.h"
@@ -55,6 +56,33 @@ namespace {
 
 static_assert(CUDNN_VERSION >= 6000, "cuDNN needs to be version 6.0 or higher");
 
+// Exits the program if 'expr' doesn't return CUDNN_STATUS_SUCCESS.
+#define CHECK_CUDNN_OK(expr) CHECK_EQ(expr, CUDNN_STATUS_SUCCESS)
+
+// If 'expr' doesn't return CUDNN_STATUS_SUCCESS, returns from the current
+// function with a non-successful port::Status.
+#define RETURN_IF_CUDNN_ERROR(expr)                                      \
+  do {                                                                   \
+    cudnnStatus_t _status = expr;                                        \
+    if (!SE_PREDICT_TRUE(_status == CUDNN_STATUS_SUCCESS)) {             \
+      std::ostringstream oss;                                            \
+      oss << ToString(_status) << "\nin " << __FILE__ << "(" << __LINE__ \
+          << "): '" << #expr << "'";                                     \
+      return port::Status(port::error::UNKNOWN, oss.str().c_str());      \
+    }                                                                    \
+  } while (false)
+
+// Returns whether status is 'ok', and potentially logs the error.
+bool IsStatusOk(const port::Status& status, bool report_error) {
+  if (status.ok()) {
+    return true;
+  }
+  if (report_error) {
+    LOG(ERROR) << status.error_message();
+  }
+  return false;
+}
+
 // Converts (via narrowing) a type T value to a type U, and checks that the
 // value has no value change due to the conversion.
 template <typename WideT, typename NarrowT>
@@ -89,26 +117,20 @@ string ToString(cudnnStatus_t status) {
       return "CUDNN_STATUS_NOT_SUPPORTED";
     case CUDNN_STATUS_LICENSE_ERROR:
       return "CUDNN_STATUS_LICENSE_ERROR";
+    case CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING:
+      return "CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING";
+#if CUDNN_VERSION >= 7000
+    case CUDNN_STATUS_RUNTIME_IN_PROGRESS:
+      return "CUDNN_STATUS_RUNTIME_IN_PROGRESS";
+    case CUDNN_STATUS_RUNTIME_FP_OVERFLOW:
+      return "CUDNN_STATUS_RUNTIME_FP_OVERFLOW";
+#endif
     default:
       return port::StrCat("<unknown cudnn status: ", static_cast<int>(status),
                           ">");
   }
 }
 
-string ToString(libraryPropertyType type) {
-  switch (type) {
-    case MAJOR_VERSION:
-      return "MAJOR_VERSION";
-    case MINOR_VERSION:
-      return "MINOR_VERSION";
-    case PATCH_LEVEL:
-      return "PATCH_LEVEL";
-    default:
-      return port::StrCat(
-          "<unknown libraryPropertyType: ", static_cast<int>(type), ">");
-  }
-}
-
 template <typename T>
 cudnnDataType_t GetCudnnDataType();
 
@@ -150,9 +172,9 @@ class CudnnHandle {
 
 }  // namespace
 
-// Wraps a cuDNN handle and provides access to it through CudnnHandle instances,
-// which also locks a mutex, acquires the CUDA context, and sets the stream
-// that cuDNN should use to enqueue any work.
+// Wraps a cuDNN handle and provides access to it through CudnnHandle
+// instances, which also locks a mutex, acquires the CUDA context, and sets
+// the stream that cuDNN should use to enqueue any work.
 //
 // Note: CudnnSupport::cudnn_ should be the only instantiation of this class.
 class CudnnAccess {
@@ -167,13 +189,13 @@ class CudnnAccess {
 
   // Creates a CudnnHandle instance for stream.
   //
-  // cuDNN API calls using the same handle instance need to be serialized across
-  // threads. This is guaranteed by CudnnHandle instances locking the mutex
-  // owned by this class.
+  // cuDNN API calls using the same handle instance need to be serialized
+  // across threads. This is guaranteed by CudnnHandle instances locking the
+  // mutex owned by this class.
   //
   // Most cuDNN APIs taking a handle perform work on a CUDA stream. The
-  // CudnnHandle instance acquires the executor's CUDA context and sets cuDNN to
-  // use the provided stream.
+  // CudnnHandle instance acquires the executor's CUDA context and sets cuDNN
+  // to use the provided stream.
   //
   // The stream argument may be null, which translates to the legacy default
   // stream. See
@@ -187,7 +209,6 @@ class CudnnAccess {
     CUstream cu_stream = stream ? AsCUDAStreamValue(stream) : cudaStreamLegacy;
     auto status = cudnnSetStream(handle_, cu_stream);
     CHECK_EQ(status, CUDNN_STATUS_SUCCESS) << "Failed to set cuDNN stream.";
-    using my_mutex_lock = mutex_lock;
     return CudnnHandle(std::move(context), std::move(lock), handle_);
   }
 
@@ -201,6 +222,8 @@ class CudnnAccess {
 
 namespace {
 
+// A helper function to return the internal compute type for
+// RNNs in cudnn.
 cudnnDataType_t GetRnnComputeType(dnn::DataType data_type);
 
 cudnnConvolutionFwdAlgo_t ToConvForwardAlgo(dnn::AlgorithmDesc algorithm) {
@@ -264,16 +287,10 @@ cudnnConvolutionBwdFilterAlgo_t ToConvBackwardFilterAlgo(
   }
 }
 
-port::Status GetCudnnProperty(libraryPropertyType type, int* value) {
-  cudnnStatus_t status = cudnnGetProperty(type, value);
-  if (status != CUDNN_STATUS_SUCCESS) {
-    const string error =
-        port::StrCat("cudnnGetProperty failed for type: ", ToString(type),
-                     " with status: ", ToString(status));
-    LOG(ERROR) << error;
-    return port::Status(port::error::INTERNAL, error);
-  }
-  return port::Status::OK();
+port::StatusOr<int> GetCudnnProperty(libraryPropertyType type) {
+  int value;
+  RETURN_IF_CUDNN_ERROR(cudnnGetProperty(type, &value));
+  return value;
 }
 
 cudnnRNNAlgo_t ToCudnnRNNAlgo(const dnn::AlgorithmDesc& algorithm) {
@@ -294,9 +311,9 @@ cudnnRNNAlgo_t ToCudnnRNNAlgo(const dnn::AlgorithmDesc& algorithm) {
 }
 
 port::Status GetLoadedCudnnVersion(CudnnVersion* version) {
-  TF_RETURN_IF_ERROR(GetCudnnProperty(MAJOR_VERSION, &version->major_version));
-  TF_RETURN_IF_ERROR(GetCudnnProperty(MINOR_VERSION, &version->minor_version));
-  TF_RETURN_IF_ERROR(GetCudnnProperty(PATCH_LEVEL, &version->patch_level));
+  SE_ASSIGN_OR_RETURN(version->major_version, GetCudnnProperty(MAJOR_VERSION));
+  SE_ASSIGN_OR_RETURN(version->minor_version, GetCudnnProperty(MINOR_VERSION));
+  SE_ASSIGN_OR_RETURN(version->patch_level, GetCudnnProperty(PATCH_LEVEL));
   return port::Status::OK();
 }
 
@@ -319,9 +336,11 @@ port::Status CudnnSupport::Init() {
           ".  CuDNN library major and minor version needs to match or have "
           "higher minor version in case of CuDNN 7.0 or later version. If "
           "using a binary install, upgrade your CuDNN library.  If building "
-          "from sources, make sure the library loaded at runtime is compatible "
+          "from sources, make sure the library loaded at runtime is "
+          "compatible "
           "with the version specified during compile configuration.");
       LOG(ERROR) << error;
+      cudnnDestroy(cudnn_handle);
       return port::Status(port::error::INTERNAL, error);
     }
 
@@ -329,23 +348,17 @@ port::Status CudnnSupport::Init() {
     return port::Status::OK();
   }
 
-  LOG(ERROR) << "could not create cudnn handle: " << ToString(status);
+  CHECK_EQ(cudnn_handle, nullptr);
+  LOG(ERROR) << "Could not create cudnn handle: " << ToString(status);
   if (status == CUDNN_STATUS_NOT_INITIALIZED) {
     auto result = cuda::Diagnostician::FindKernelDriverVersion();
     if (!result.ok()) {
-      LOG(ERROR) << "error retrieving driver version: "
+      LOG(ERROR) << "Error retrieving driver version: "
                  << DriverVersionStatusToString(result);
     } else {
       const auto& version = result.ValueOrDie();
-      LOG(ERROR) << "possibly insufficient driver version: "
+      LOG(ERROR) << "Possibly insufficient driver version: "
                  << DriverVersionToString(version);
-      // OS X kernel driver does not report version accurately
-#if !defined(__APPLE__)
-      if (std::get<0>(version) < 340) {
-        LOG(ERROR)
-            << "cudnn library is only supported on 340.XX+ driver versions";
-      }
-#endif
     }
   }
 
@@ -364,18 +377,129 @@ CudnnSupport::GetVersion() {
 
 namespace {
 
-// Turns a BatchDescriptor structure into a cudnn tensor handle within a scope.
-class ScopedTensorDescriptor {
- public:
-  ScopedTensorDescriptor(const dnn::BatchDescriptor& batch_descriptor,
-                         cudnnDataType_t elem_type)
-      : handle_(nullptr) {
-    cudnnStatus_t status = cudnnCreateTensorDescriptor(&handle_);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(FATAL) << "could not create cudnn tensor descriptor: "
-                 << ToString(status);
-    }
+// Deleter functors for cuDNN types that need to be deleted.
+struct TensorDescriptorDeleter {
+  void operator()(cudnnTensorDescriptor_t descriptor) const {
+    CHECK_CUDNN_OK(cudnnDestroyTensorDescriptor(descriptor));
+  }
+};
+struct FilterDescriptorDeleter {
+  void operator()(cudnnFilterDescriptor_t descriptor) const {
+    CHECK_CUDNN_OK(cudnnDestroyFilterDescriptor(descriptor));
+  }
+};
+struct ConvolutionDescriptorDeleter {
+  void operator()(cudnnConvolutionDescriptor_t descriptor) const {
+    CHECK_CUDNN_OK(cudnnDestroyConvolutionDescriptor(descriptor));
+  }
+};
+struct PoolingDescriptorDeleter {
+  void operator()(cudnnPoolingDescriptor_t descriptor) const {
+    CHECK_CUDNN_OK(cudnnDestroyPoolingDescriptor(descriptor));
+  }
+};
+struct LrnDescriptorDeleter {
+  void operator()(cudnnLRNDescriptor_t descriptor) const {
+    CHECK_CUDNN_OK(cudnnDestroyLRNDescriptor(descriptor));
+  }
+};
 
+struct ActivationDescriptorDeleter {
+  void operator()(cudnnActivationDescriptor_t descriptor) const {
+    CHECK_CUDNN_OK(cudnnDestroyActivationDescriptor(descriptor));
+  }
+};
+struct DropoutDescriptorDeleter {
+  void operator()(cudnnDropoutDescriptor_t descriptor) const {
+    CHECK_CUDNN_OK(cudnnDestroyDropoutDescriptor(descriptor));
+  }
+};
+struct RnnDescriptorDeleter {
+  void operator()(cudnnRNNDescriptor_t descriptor) const {
+    CHECK_CUDNN_OK(cudnnDestroyRNNDescriptor(descriptor));
+  }
+};
+struct PersistentRnnPlanDeleter {
+  void operator()(cudnnPersistentRNNPlan_t plan) const {
+    CHECK_CUDNN_OK(cudnnDestroyPersistentRNNPlan(plan));
+  }
+};
+
+// RAII wrappers for cuDNN types.
+using TensorDescriptor =
+    std::unique_ptr<cudnnTensorStruct, TensorDescriptorDeleter>;
+using FilterDescriptor =
+    std::unique_ptr<cudnnFilterStruct, FilterDescriptorDeleter>;
+using ConvolutionDescriptor =
+    std::unique_ptr<cudnnConvolutionStruct, ConvolutionDescriptorDeleter>;
+using PoolingDescriptor =
+    std::unique_ptr<cudnnPoolingStruct, PoolingDescriptorDeleter>;
+using LrnDescriptor = std::unique_ptr<cudnnLRNStruct, LrnDescriptorDeleter>;
+using ActivationDescriptor =
+    std::unique_ptr<cudnnActivationStruct, ActivationDescriptorDeleter>;
+using DropoutDescriptor =
+    std::unique_ptr<cudnnDropoutStruct, DropoutDescriptorDeleter>;
+using RnnDescriptor = std::unique_ptr<cudnnRNNStruct, RnnDescriptorDeleter>;
+using PersistentRnnPlan =
+    std::unique_ptr<cudnnPersistentRNNPlan, PersistentRnnPlanDeleter>;
+
+// Factory methods for cuDNN types.
+TensorDescriptor CreateTensorDescriptor() {
+  cudnnTensorDescriptor_t result;
+  CHECK_CUDNN_OK(cudnnCreateTensorDescriptor(&result));
+  return TensorDescriptor(result);
+}
+FilterDescriptor CreateFilterDescriptor() {
+  cudnnFilterDescriptor_t result;
+  CHECK_CUDNN_OK(cudnnCreateFilterDescriptor(&result));
+  return FilterDescriptor(result);
+}
+ConvolutionDescriptor CreateConvolutionDescriptor() {
+  cudnnConvolutionDescriptor_t result;
+  CHECK_CUDNN_OK(cudnnCreateConvolutionDescriptor(&result));
+  return ConvolutionDescriptor(result);
+}
+PoolingDescriptor CreatePoolingDescriptor() {
+  cudnnPoolingDescriptor_t result;
+  CHECK_CUDNN_OK(cudnnCreatePoolingDescriptor(&result));
+  return PoolingDescriptor(result);
+}
+LrnDescriptor CreateLrnDescriptor() {
+  cudnnLRNDescriptor_t result;
+  CHECK_CUDNN_OK(cudnnCreateLRNDescriptor(&result));
+  return LrnDescriptor(result);
+}
+ActivationDescriptor CreateActivationDescriptor() {
+  cudnnActivationDescriptor_t result;
+  CHECK_CUDNN_OK(cudnnCreateActivationDescriptor(&result));
+  return ActivationDescriptor(result);
+}
+DropoutDescriptor CreateDropoutDescriptor() {
+  cudnnDropoutDescriptor_t result;
+  CHECK_CUDNN_OK(cudnnCreateDropoutDescriptor(&result));
+  return DropoutDescriptor(result);
+}
+RnnDescriptor CreateRnnDescriptor() {
+  cudnnRNNDescriptor_t result;
+  CHECK_CUDNN_OK(cudnnCreateRNNDescriptor(&result));
+  return RnnDescriptor(result);
+}
+PersistentRnnPlan CreatePersistentRnnPlan(cudnnRNNDescriptor_t rnn_desc,
+                                          int batch_size,
+                                          cudnnDataType_t data_type) {
+  cudnnPersistentRNNPlan_t result;
+  CHECK_CUDNN_OK(
+      cudnnCreatePersistentRNNPlan(rnn_desc, batch_size, data_type, &result));
+  return PersistentRnnPlan(result);
+}
+
+// Turns a BatchDescriptor structure into a cudnn tensor handle within a
+// scope.
+class CudnnTensorDescriptor {
+ public:
+  CudnnTensorDescriptor(const dnn::BatchDescriptor& batch_descriptor,
+                        cudnnDataType_t elem_type)
+      : handle_(CreateTensorDescriptor()) {
     switch (batch_descriptor.layout()) {
       case dnn::DataLayout::kBatchYXDepth:
       case dnn::DataLayout::kBatchDepthYX: {
@@ -393,25 +517,16 @@ class ScopedTensorDescriptor {
                        &CheckedNarrowing<int64, int>);
         std::transform(dims64.cbegin(), dims64.cend(), dims.begin(),
                        &CheckedNarrowing<int64, int>);
-        status = cudnnSetTensorNdDescriptor(handle_, elem_type, nd, dims.data(),
-                                            strides.data());
-
-        if (status != CUDNN_STATUS_SUCCESS) {
-          LOG(FATAL) << "could not convert BatchDescriptor "
-                     << batch_descriptor.ToString()
-                     << " to cudnn tensor descriptor: " << ToString(status);
-        }
+        CHECK_CUDNN_OK(cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd,
+                                                  dims.data(), strides.data()))
+            << "batch_descriptor: " << batch_descriptor.ToString();
       } break;
       case dnn::DataLayout::kBatchDepthYX4: {
-        status = cudnnSetTensor4dDescriptor(
-            handle_, CUDNN_TENSOR_NCHW_VECT_C, elem_type,
+        CHECK_CUDNN_OK(cudnnSetTensor4dDescriptor(
+            handle_.get(), CUDNN_TENSOR_NCHW_VECT_C, elem_type,
             batch_descriptor.count(), batch_descriptor.feature_map_count(),
-            batch_descriptor.height(), batch_descriptor.width());
-        if (status != CUDNN_STATUS_SUCCESS) {
-          LOG(FATAL) << "could not convert BatchDescriptor "
-                     << batch_descriptor.ToString()
-                     << " to cudnn tensor descriptor: " << ToString(status);
-        }
+            batch_descriptor.height(), batch_descriptor.width()))
+            << "batch_descriptor: " << batch_descriptor.ToString();
       } break;
       default:
         LOG(FATAL) << "Unsupported tensor format "
@@ -420,43 +535,33 @@ class ScopedTensorDescriptor {
     }
   }
 
-  ~ScopedTensorDescriptor() {
-    cudnnStatus_t status = cudnnDestroyTensorDescriptor(handle_);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(ERROR) << "could not destroy cudnn tensor descriptor: "
-                 << ToString(status);
-    }
-  }
-
-  cudnnTensorDescriptor_t handle() const { return handle_; }
+  cudnnTensorDescriptor_t handle() const { return handle_.get(); }
 
  private:
-  cudnnTensorDescriptor_t handle_;  // Owned.
+  TensorDescriptor handle_;
 
-  SE_DISALLOW_COPY_AND_ASSIGN(ScopedTensorDescriptor);
+  SE_DISALLOW_COPY_AND_ASSIGN(CudnnTensorDescriptor);
 };
 
-// Turns a FilterDescriptor structure into a cudnn filter handle within a scope.
-class ScopedFilterDescriptor {
+// Turns a FilterDescriptor structure into a cudnn filter handle within a
+// scope.
+class CudnnFilterDescriptor {
  public:
-  ScopedFilterDescriptor(const dnn::FilterDescriptor& filter_descriptor,
-                         cudnnDataType_t elem_type)
-      : handle_(nullptr) {
-    cudnnStatus_t status = cudnnCreateFilterDescriptor(&handle_);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(FATAL) << "could not create cudnn filter descriptor: "
-                 << ToString(status);
-    }
-
+  CudnnFilterDescriptor(const dnn::FilterDescriptor& filter_descriptor,
+                        cudnnDataType_t elem_type)
+      : handle_(CreateFilterDescriptor()) {
     // TODO(b/23032134): Even if the filter layout is not supported,
-    // cudnnSetFilter4DDescriptor_v4 will return CUDNN_STATUS_SUCCESS because it
-    // does not take layout as an input. Maybe force cuDNN by giving wrong
+    // cudnnSetFilter4DDescriptor_v4 will return CUDNN_STATUS_SUCCESS because
+    // it does not take layout as an input. Maybe force cuDNN by giving wrong
     // inputs intentionally?
     cudnnTensorFormat_t format;
     switch (filter_descriptor.layout()) {
       case dnn::FilterLayout::kOutputInputYX:
         format = CUDNN_TENSOR_NCHW;
         break;
+      case dnn::FilterLayout::kOutputYXInput:
+        format = CUDNN_TENSOR_NHWC;
+        break;
       case dnn::FilterLayout::kOutputInputYX4:
         format = CUDNN_TENSOR_NCHW_VECT_C;
         break;
@@ -472,32 +577,20 @@ class ScopedFilterDescriptor {
     const auto& spatial_dims = filter_descriptor.input_filter_dims();
     std::copy(spatial_dims.begin(), spatial_dims.end(), dims.begin() + 2);
 
-    status = cudnnSetFilterNdDescriptor(handle_, elem_type, format, dims.size(),
-                                        dims.data());
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(FATAL) << "could not set cudnn filter descriptor: "
-                 << ToString(status);
-    }
+    CHECK_CUDNN_OK(cudnnSetFilterNdDescriptor(handle_.get(), elem_type, format,
+                                              dims.size(), dims.data()));
   }
 
-  ~ScopedFilterDescriptor() {
-    cudnnStatus_t status = cudnnDestroyFilterDescriptor(handle_);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(ERROR) << "could not destroy cudnn filter descriptor: "
-                 << ToString(status);
-    }
-  }
-
-  cudnnFilterDescriptor_t handle() const { return handle_; }
+  cudnnFilterDescriptor_t handle() const { return handle_.get(); }
 
  private:
-  cudnnFilterDescriptor_t handle_;  // Owned.
+  FilterDescriptor handle_;  // Owned.
 
-  SE_DISALLOW_COPY_AND_ASSIGN(ScopedFilterDescriptor);
+  SE_DISALLOW_COPY_AND_ASSIGN(CudnnFilterDescriptor);
 };
 
 // A helper function to decide whether to enable the TENSOR_OP_MATH math type
-static bool TensorOpMathEnabled() {
+bool TensorOpMathEnabled() {
   static bool is_enabled = [] {
     bool is_disabled = false;
     TF_CHECK_OK(
@@ -510,7 +603,7 @@ static bool TensorOpMathEnabled() {
 
 // A helper function to decide whether to enable the TENSOR_OP_MATH math type
 // for RNNs.
-static bool RnnTensorOpMathEnabled() {
+bool RnnTensorOpMathEnabled() {
   static bool is_enabled = [] {
     bool is_disabled = false;
     TF_CHECK_OK(
@@ -521,15 +614,16 @@ static bool RnnTensorOpMathEnabled() {
   return is_enabled;
 }
 
-// A helper function to decide whether to use CUDNN_BATCHNORM_SPATIAL_PERSISTENT
-// in batchnorm. This mode can be faster in some tasks because an optimized path
-// may be selected for CUDNN_DATA_FLOAT and CUDNN_DATA_HALF data types, compute
-// capability 6.0 or higher. The reason we set it to false by default is that
-// this mode may use scaled atomic integer reduction that may cause a numerical
-// overflow for certain input data range.
+// A helper function to decide whether to use
+// CUDNN_BATCHNORM_SPATIAL_PERSISTENT in batchnorm. This mode can be faster in
+// some tasks because an optimized path may be selected for CUDNN_DATA_FLOAT
+// and CUDNN_DATA_HALF data types, compute capability 6.0 or higher. The
+// reason we set it to false by default is that this mode may use scaled
+// atomic integer reduction that may cause a numerical overflow for certain
+// input data range.
 // TODO(yangzihao): Use autotune to choose between this mode and
 // CUDNN_BATCHNORM_SPATIAL mode.
-static bool BatchnormSpatialPersistentEnabled() {
+bool BatchnormSpatialPersistentEnabled() {
   static bool is_enabled = [] {
     bool is_enabled = false;
     TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar(
@@ -542,24 +636,18 @@ static bool BatchnormSpatialPersistentEnabled() {
 
 // Turns a ConvolutionDescriptor structure into a cudnn convolution handle
 // within a scope.
-class ScopedConvolutionDescriptor {
+class CudnnConvolutionDescriptor {
  public:
-  ScopedConvolutionDescriptor(
+  CudnnConvolutionDescriptor(
       const dnn::ConvolutionDescriptor& convolution_descriptor,
       cudnnDataType_t data_type)
-      : handle_(nullptr) {
-    cudnnStatus_t status = cudnnCreateConvolutionDescriptor(&handle_);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(FATAL) << "could not create cudnn convolution descriptor: "
-                 << ToString(status);
-    }
+      : handle_(CreateConvolutionDescriptor()) {
     const auto& strides64 = convolution_descriptor.strides();
     const auto& padding64 = convolution_descriptor.padding();
     const auto& dilations64 = convolution_descriptor.dilations();
-    if (convolution_descriptor.pad_alignment() ==
-        dnn::PadAlignment::kTensorFlowPadding) {
-      LOG(ERROR) << "TensorFlow padding alignment is not supported.";
-    }
+    CHECK_NE(convolution_descriptor.pad_alignment(),
+             dnn::PadAlignment::kTensorFlowPadding)
+        << "TensorFlow padding alignment is not supported.";
 
     // cuDNN requires arrays of ints.
     std::vector<int> strides(convolution_descriptor.ndims());
@@ -574,18 +662,14 @@ class ScopedConvolutionDescriptor {
     std::transform(dilations64.cbegin(), dilations64.cend(), dilations.begin(),
                    &CheckedNarrowing<int64, int>);
 
-    status = cudnnSetConvolutionNdDescriptor(
-        handle_, convolution_descriptor.ndims(), padding.data(), strides.data(),
-        dilations.data(),
+    CHECK_CUDNN_OK(cudnnSetConvolutionNdDescriptor(
+        handle_.get(), convolution_descriptor.ndims(), padding.data(),
+        strides.data(), dilations.data(),
         // NOTE(keveman): cuDNN supports convolution and cross correlation.
         // However, almost all the use cases do cross correlation, so just
         // hard coding it here.
-        CUDNN_CROSS_CORRELATION, data_type);
+        CUDNN_CROSS_CORRELATION, data_type));
 
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(FATAL) << "could not set cudnn convolution descriptor: "
-                 << ToString(status);
-    }
     // NOTE(benbarsdell): This only applies if tensor op math is enabled
     //                      and algo selection is set to Default.
     this->set_use_tensor_op_math(true);
@@ -593,60 +677,39 @@ class ScopedConvolutionDescriptor {
 #if CUDNN_MAJOR >= 7
     VLOG(2) << "Requesting grouped convolution: "
             << convolution_descriptor.group_count();
-    status = cudnnSetConvolutionGroupCount(
-        handle_, convolution_descriptor.group_count());
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(FATAL) << "could not set cudnn convolution group count: "
-                 << ToString(status);
-    }
+    CHECK_CUDNN_OK(cudnnSetConvolutionGroupCount(
+        handle_.get(), convolution_descriptor.group_count()));
 #else
     CHECK_EQ(convolution_descriptor.group_count(), 1)
         << "Requested grouped convolution for cuDNN version < 7";
 #endif
   }
 
-  void set_use_tensor_op_math(bool use_tensor_op_math) {
+  void set_use_tensor_op_math(bool use_tensor_op_math) const {
 #if CUDNN_VERSION >= 7000
     cudnnMathType_t math_type =
         (use_tensor_op_math ? CUDNN_TENSOR_OP_MATH : CUDNN_DEFAULT_MATH);
     if (TensorOpMathEnabled()) {
-      cudnnStatus_t status = cudnnSetConvolutionMathType(handle_, math_type);
-      if (status != CUDNN_STATUS_SUCCESS) {
-        LOG(FATAL) << "could not set cudnn convolution math type: "
-                   << ToString(status);
-      }
+      CHECK_CUDNN_OK(cudnnSetConvolutionMathType(handle_.get(), math_type));
     }
 #endif
   }
 
-  ~ScopedConvolutionDescriptor() {
-    cudnnStatus_t status = cudnnDestroyConvolutionDescriptor(handle_);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(ERROR) << "could not destroy cudnn convolution descriptor: "
-                 << ToString(status);
-    }
-  }
-
-  cudnnConvolutionDescriptor_t handle() const { return handle_; }
+  cudnnConvolutionDescriptor_t handle() const { return handle_.get(); }
 
  private:
-  cudnnConvolutionDescriptor_t handle_;  // Owned.
+  ConvolutionDescriptor handle_;  // Owned.
 
-  SE_DISALLOW_COPY_AND_ASSIGN(ScopedConvolutionDescriptor);
+  SE_DISALLOW_COPY_AND_ASSIGN(CudnnConvolutionDescriptor);
 };
 
 // Turns a PoolingDescriptor structure into a cudnn pooling descriptor handle
 // within a scope.
-class ScopedPoolingDescriptor {
+class CudnnPoolingDescriptor {
  public:
-  explicit ScopedPoolingDescriptor(
+  explicit CudnnPoolingDescriptor(
       const dnn::PoolingDescriptor& pooling_descriptor)
-      : handle_(nullptr) {
-    cudnnStatus_t status = cudnnCreatePoolingDescriptor(&handle_);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(FATAL) << "could not create cudnn pooling descriptor: "
-                 << ToString(status);
-    }
+      : handle_(CreatePoolingDescriptor()) {
     const std::vector<int64> strides64 = pooling_descriptor.strides();
     const std::vector<int64> padding64 = pooling_descriptor.padding();
     const std::vector<int64> shape64 = pooling_descriptor.window();
@@ -662,46 +725,29 @@ class ScopedPoolingDescriptor {
     std::transform(shape64.cbegin(), shape64.cend(), shape.begin(),
                    &CheckedNarrowing<int64, int>);
     bool propagate_nans = pooling_descriptor.propagate_nans();
-    status = cudnnSetPoolingNdDescriptor(
-        handle_,
+    CHECK_CUDNN_OK(cudnnSetPoolingNdDescriptor(
+        handle_.get(),
         (pooling_descriptor.mode() == dnn::PoolingMode::kMaximum
              ? CUDNN_POOLING_MAX
              : CUDNN_POOLING_AVERAGE_COUNT_EXCLUDE_PADDING),
         propagate_nans ? CUDNN_PROPAGATE_NAN : CUDNN_NOT_PROPAGATE_NAN, nd,
-        shape.data(), padding.data(), strides.data());
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(FATAL) << "could not set cudnn pooling descriptor: "
-                 << ToString(status);
-    }
-  }
-  ~ScopedPoolingDescriptor() {
-    cudnnStatus_t status = cudnnDestroyPoolingDescriptor(handle_);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(ERROR) << "could not destroy cudnn pooling descriptor: "
-                 << ToString(status);
-    }
+        shape.data(), padding.data(), strides.data()));
   }
 
-  cudnnPoolingDescriptor_t handle() const { return handle_; }
+  cudnnPoolingDescriptor_t handle() const { return handle_.get(); }
 
  private:
-  cudnnPoolingDescriptor_t handle_;  // Owned.
+  PoolingDescriptor handle_;  // Owned.
 
-  SE_DISALLOW_COPY_AND_ASSIGN(ScopedPoolingDescriptor);
+  SE_DISALLOW_COPY_AND_ASSIGN(CudnnPoolingDescriptor);
 };
 
 // Turns a NormalizeDescriptor structure into a cudnn LRN descriptor handle.
-class ScopedNormalizeDescriptor {
+class CudnnNormalizeDescriptor {
  public:
-  explicit ScopedNormalizeDescriptor(
+  explicit CudnnNormalizeDescriptor(
       const dnn::NormalizeDescriptor& normalize_descriptor)
-      : handle_(nullptr) {
-    cudnnStatus_t status = cudnnCreateLRNDescriptor(&handle_);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(FATAL) << "could not create cudnn LRN descriptor: "
-                 << ToString(status);
-    }
-
+      : handle_(CreateLrnDescriptor()) {
     // The range specifies that the indices in the closed range
     // [i - range, i + range] should be included in the normalization for index
     // i. The lrnN value is the total number of elements in the range, so
@@ -722,42 +768,26 @@ class ScopedNormalizeDescriptor {
 
     double lrnBeta = normalize_descriptor.beta();
     double lrnK = normalize_descriptor.bias();
-    status = cudnnSetLRNDescriptor(handle_, lrnN, lrnAlpha, lrnBeta, lrnK);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(FATAL) << "could not set cudnn LRN descriptor: " << ToString(status);
-    }
+    CHECK_CUDNN_OK(
+        cudnnSetLRNDescriptor(handle_.get(), lrnN, lrnAlpha, lrnBeta, lrnK));
   }
 
-  ~ScopedNormalizeDescriptor() {
-    cudnnStatus_t status = cudnnDestroyLRNDescriptor(handle_);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(ERROR) << "could not destroy cudnn LRN descriptor: "
-                 << ToString(status);
-    }
-  }
-
-  cudnnLRNDescriptor_t handle() const { return handle_; }
+  cudnnLRNDescriptor_t handle() const { return handle_.get(); }
 
  private:
-  cudnnLRNDescriptor_t handle_;  // Owned.
+  LrnDescriptor handle_;  // Owned.
 
-  SE_DISALLOW_COPY_AND_ASSIGN(ScopedNormalizeDescriptor);
+  SE_DISALLOW_COPY_AND_ASSIGN(CudnnNormalizeDescriptor);
 };
 
 // Turns a ActivationDescriptor structure into a cudnn activation
 // descriptor handle within a scope.
-class ScopedActivationDescriptor {
+class CudnnActivationDescriptor {
  public:
-  ScopedActivationDescriptor(dnn::ActivationMode activation_mode,
-                             cudnnNanPropagation_t nan_propagation,
-                             double value_max)
-      : handle_(nullptr) {
-    cudnnStatus_t status = cudnnCreateActivationDescriptor(&handle_);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(FATAL) << "could not create cudnn activation descriptor: "
-                 << ToString(status);
-    }
-
+  CudnnActivationDescriptor(dnn::ActivationMode activation_mode,
+                            cudnnNanPropagation_t nan_propagation,
+                            double value_max)
+      : handle_(CreateActivationDescriptor()) {
     double relu_ceiling = 0.0;
     cudnnActivationMode_t mode;
     switch (activation_mode) {
@@ -783,28 +813,16 @@ class ScopedActivationDescriptor {
                    << static_cast<int>(activation_mode);
     }
 
-    status = cudnnSetActivationDescriptor(handle_, mode, nan_propagation,
-                                          relu_ceiling);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(FATAL) << "could not set cudnn activation descriptor: "
-                 << ToString(status);
-    }
+    CHECK_CUDNN_OK(cudnnSetActivationDescriptor(handle_.get(), mode,
+                                                nan_propagation, relu_ceiling));
   }
 
-  ~ScopedActivationDescriptor() {
-    cudnnStatus_t status = cudnnDestroyActivationDescriptor(handle_);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(ERROR) << "could not destroy cudnn activation descriptor: "
-                 << ToString(status);
-    }
-  }
-
-  cudnnActivationDescriptor_t handle() const { return handle_; }
+  cudnnActivationDescriptor_t handle() const { return handle_.get(); }
 
  private:
-  cudnnActivationDescriptor_t handle_;  // Owned.
+  ActivationDescriptor handle_;  // Owned.
 
-  SE_DISALLOW_COPY_AND_ASSIGN(ScopedActivationDescriptor);
+  SE_DISALLOW_COPY_AND_ASSIGN(CudnnActivationDescriptor);
 };
 
 cudnnDataType_t ToCudnnDataType(
@@ -870,117 +888,74 @@ int CudnnDataTypeToByteSize(cudnnDataType_t data_type) {
   }
 }
 
-template <typename Base>
-class MixinBase : public Base {};
-template <>
-class MixinBase<void> {};
-
-#define CUDNN_RETURN_IF_FAIL(STATUS, ...)                                \
-  if (!SE_PREDICT_TRUE((STATUS) == CUDNN_STATUS_SUCCESS)) {              \
-    string error_msg = port::StrCat(ToString(STATUS), " ", __VA_ARGS__); \
-    SetFailure(port::Status(port::error::UNKNOWN, error_msg));           \
-    LOG(ERROR) << error_msg;                                             \
-    return;                                                              \
-  }
+class CudnnDropoutDescriptor {
+  explicit CudnnDropoutDescriptor(DropoutDescriptor handle)
+      : handle_(std::move(handle)) {}
 
-// TODO(csigg): Remove inheritance for code reuse.
-template <typename Base>
-class CudnnDescriptorCommon : public MixinBase<Base> {
  public:
-  bool ok() const { return status_.ok(); }
-  port::Status Status() const { return status_; }
+  CudnnDropoutDescriptor(CudnnDropoutDescriptor&&) = default;
 
- protected:
-  void SetFailure(const port::Status& status) { status_.Update(status); }
-  port::Status status_;
-};
+  static port::StatusOr<CudnnDropoutDescriptor> Create(
+      const CudnnHandle& cudnn, float dropout, uint64 seed,
+      ScratchAllocator* state_allocator) {
+    DropoutDescriptor handle = CreateDropoutDescriptor();
 
-class CudnnDropoutDescriptor : public CudnnDescriptorCommon<void> {
- public:
-  CudnnDropoutDescriptor(const CudnnHandle& cudnn, float dropout, uint64 seed,
-                         ScratchAllocator* state_allocator)
-      : handle_(nullptr) {
-    cudnnStatus_t status;
-    status = cudnnCreateDropoutDescriptor(&handle_);
-    CUDNN_RETURN_IF_FAIL(status, "Failed to create dropout descriptor");
-
-    if (dropout == 0.f) {
-      return;
+    if (dropout == 0.0f) {
+      // Return 'empty' dropout descriptor.
+      return CudnnDropoutDescriptor(std::move(handle));
     }
 
     DeviceMemory<uint8> state_memory;
     if (state_allocator) {
       size_t state_sizes_in_bytes = 0;
-      status = cudnnDropoutGetStatesSize(cudnn.handle(), &state_sizes_in_bytes);
-      CUDNN_RETURN_IF_FAIL(status, "Failed to query dropout state sizes");
-
-      auto allocated =
-          state_allocator->AllocateBytes(nullptr, state_sizes_in_bytes);
-      if (!allocated.ok() ||
-          (state_memory = allocated.ValueOrDie()) == nullptr) {
-        string error_msg =
-            port::StrCat("Failed to allocate Cudnn dropout state memory of ",
-                         state_sizes_in_bytes, " bytes.");
-        status_ = port::Status(port::error::UNKNOWN, error_msg);
-        LOG(ERROR) << error_msg;
-        return;
-      }
+      RETURN_IF_CUDNN_ERROR(
+          cudnnDropoutGetStatesSize(cudnn.handle(), &state_sizes_in_bytes));
+      SE_ASSIGN_OR_RETURN(state_memory, state_allocator->AllocateBytes(
+                                            nullptr, state_sizes_in_bytes));
     }
-    status = cudnnSetDropoutDescriptor(handle_, cudnn.handle(), dropout,
-                                       state_memory.opaque(),
-                                       state_memory.size(), seed);
-    CUDNN_RETURN_IF_FAIL(
-        status, port::StrCat(
-                    "Failed to set dropout descriptor with state memory size: ",
-                    state_memory.size(), " bytes."));
-  }
+    RETURN_IF_CUDNN_ERROR(cudnnSetDropoutDescriptor(
+        handle.get(), cudnn.handle(), dropout, state_memory.opaque(),
+        state_memory.size(), seed));
 
-  ~CudnnDropoutDescriptor() {
-    cudnnStatus_t status = cudnnDestroyDropoutDescriptor(handle_);
-    // TODO(csigg): This is a no-op (error is not reported). Same below.
-    CUDNN_RETURN_IF_FAIL(status, "Failed to destroy Cudnn dropout handle: ");
+    return CudnnDropoutDescriptor(std::move(handle));
   }
 
-  cudnnDropoutDescriptor_t handle() const {
-    if (!ok()) return nullptr;
-    return handle_;
-  }
+  cudnnDropoutDescriptor_t handle() const { return handle_.get(); }
 
  private:
-  cudnnDropoutDescriptor_t handle_;  // Owned.
-  float dropout_;
-  uint64 seed_;
+  DropoutDescriptor handle_;  // Owned.
   SE_DISALLOW_COPY_AND_ASSIGN(CudnnDropoutDescriptor);
 };
 
-class CudnnRnnParamsDescriptor : public CudnnDescriptorCommon<void> {
- public:
-  typedef dnn::RnnDescriptor::ParamsRegion ParamsRegion;
+class CudnnRnnParamsDescriptor {
   typedef dnn::RnnDescriptor::ParamsRegions ParamsRegions;
-  CudnnRnnParamsDescriptor(const CudnnHandle& cudnn,
-                           const CudnnRnnDescriptor& rnn_desc);
-  ~CudnnRnnParamsDescriptor() {
-    cudnnStatus_t status = cudnnDestroyFilterDescriptor(handle_);
-    CUDNN_RETURN_IF_FAIL(status, "Failed to destroy RNN filter descriptor");
-  }
-  cudnnFilterDescriptor_t handle() const {
-    if (!ok()) return nullptr;
-    return handle_;
-  }
+
+  CudnnRnnParamsDescriptor(FilterDescriptor handle, int64 params_size_in_bytes,
+                           ParamsRegions weights, ParamsRegions biases)
+      : handle_(std::move(handle)),
+        params_size_in_bytes_(params_size_in_bytes),
+        weights_(std::move(weights)),
+        biases_(std::move(biases)) {}
+
+ public:
+  CudnnRnnParamsDescriptor(CudnnRnnParamsDescriptor&&) = default;
+
+  static port::StatusOr<CudnnRnnParamsDescriptor> Create(
+      const CudnnHandle& cudnn, int input_size, cudnnDataType_t data_type,
+      cudnnRNNDescriptor_t rnn_desc, cudnnRNNMode_t rnn_mode,
+      cudnnDirectionMode_t direction_mode, int num_layers);
+
+  cudnnFilterDescriptor_t handle() const { return handle_.get(); }
   int64 params_size_in_bytes() const { return params_size_in_bytes_; }
   ParamsRegions params_weights() const {
-    if (!ok()) return ParamsRegions();
     return weights_;
   }
   ParamsRegions params_biases() const {
-    if (!ok()) return ParamsRegions();
     return biases_;
   }
 
  private:
-  int GetRegionCountPerLayer() const;
-  cudnnFilterDescriptor_t handle_;
-  const CudnnRnnDescriptor* rnn_desc_;
+  FilterDescriptor handle_;
   int64 params_size_in_bytes_;
   ParamsRegions weights_;
   ParamsRegions biases_;
@@ -989,97 +964,98 @@ class CudnnRnnParamsDescriptor : public CudnnDescriptorCommon<void> {
 
 }  // namespace
 
-class CudnnRnnDescriptor : public CudnnDescriptorCommon<dnn::RnnDescriptor> {
- public:
-  CudnnRnnDescriptor(const CudnnHandle& cudnn, int num_layers, int hidden_size,
-                     int input_size, int batch_size,
+class CudnnRnnDescriptor : public dnn::RnnDescriptor {
+  CudnnRnnDescriptor(const CudnnHandle& cudnn, cuda::RnnDescriptor rnn_desc,
+                     PersistentRnnPlan rnn_plan, int num_layers,
+                     int hidden_size, int input_size, int batch_size,
                      cudnnRNNInputMode_t input_mode,
                      cudnnDirectionMode_t direction_mode,
                      cudnnRNNMode_t rnn_mode, cudnnDataType_t data_type,
                      cudnnDataType_t compute_type,
                      const dnn::AlgorithmConfig& algorithm_config,
-                     float dropout, uint64 seed,
-                     ScratchAllocator* state_allocator)
-      : rnn_desc_(nullptr),
+                     CudnnDropoutDescriptor dropout_desc,
+                     CudnnRnnParamsDescriptor params_desc)
+      : rnn_desc_(std::move(rnn_desc)),
+        rnn_plan_(std::move(rnn_plan)),
         num_layers_(num_layers),
         hidden_size_(hidden_size),
         input_size_(input_size),
         batch_size_(batch_size),
-        rnn_plan_(nullptr),
+        rnn_algo_(ToCudnnRNNAlgo(algorithm_config.algorithm())),
         input_mode_(input_mode),
         direction_mode_(direction_mode),
         rnn_mode_(rnn_mode),
         data_type_(data_type),
         compute_type_(compute_type),
-        algorithm_config_(algorithm_config) {
-    // Create the dropout handle.
-    cudnn_dropout_desc_.reset(
-        new CudnnDropoutDescriptor(cudnn, dropout, seed, state_allocator));
-    if (!cudnn_dropout_desc_->ok()) {
-      SetFailure(cudnn_dropout_desc_->Status());
-      return;
-    }
+        algorithm_config_(algorithm_config),
+        dropout_desc_(std::move(dropout_desc)),
+        params_desc_(std::move(params_desc)) {}
+
+ public:
+  CudnnRnnDescriptor(CudnnRnnDescriptor&& other) = default;
+
+  static port::StatusOr<CudnnRnnDescriptor> Create(
+      const CudnnHandle& cudnn, int num_layers, int hidden_size, int input_size,
+      int batch_size, cudnnRNNInputMode_t input_mode,
+      cudnnDirectionMode_t direction_mode, cudnnRNNMode_t rnn_mode,
+      cudnnDataType_t data_type, cudnnDataType_t compute_type,
+      const dnn::AlgorithmConfig& algorithm_config, float dropout, uint64 seed,
+      ScratchAllocator* state_allocator) {
+    SE_ASSIGN_OR_RETURN(
+        CudnnDropoutDescriptor dropout_desc,
+        CudnnDropoutDescriptor::Create(cudnn, dropout, seed, state_allocator));
+
+    cuda::RnnDescriptor rnn_desc = CreateRnnDescriptor();
+    cudnnRNNAlgo_t rnn_algo = ToCudnnRNNAlgo(algorithm_config.algorithm());
 
-    // Create the RNN handle
-    cudnnStatus_t status = cudnnCreateRNNDescriptor(&rnn_desc_);
-    CUDNN_RETURN_IF_FAIL(status, "Unable to create RNN descriptor");
     // TODO: allow the user to choose an algorithm.
-    rnn_algo_ = ToCudnnRNNAlgo(algorithm_config_.algorithm());
-    status = cudnnSetRNNDescriptor_v6(
-        cudnn.handle(), /*rnnDesc=*/rnn_desc_, /*hiddenSize=*/hidden_size,
-        /*numLayers=*/num_layers, /*dropoutDesc=*/dropout_handle(),
+    RETURN_IF_CUDNN_ERROR(cudnnSetRNNDescriptor_v6(
+        cudnn.handle(), /*rnnDesc=*/rnn_desc.get(), /*hiddenSize=*/hidden_size,
+        /*numLayers=*/num_layers, /*dropoutDesc=*/dropout_desc.handle(),
         /*inputMode=*/input_mode, /*direction=*/direction_mode,
-        /*mode=*/rnn_mode, /*algo=*/rnn_algo_, /*dataType=*/compute_type);
-    CUDNN_RETURN_IF_FAIL(status, ::tensorflow::strings::Printf(
-                                     "Unable to update RNN descriptor with "
-                                     "algo_id: %d and compute_type: %d",
-                                     static_cast<int>(rnn_algo_),
-                                     static_cast<int>(compute_type)));
-
-    if (rnn_algo_ == CUDNN_RNN_ALGO_PERSIST_DYNAMIC) {
-      CHECK_GE(batch_size_, 0);
-      status = cudnnCreatePersistentRNNPlan(rnn_desc_, batch_size_, data_type_,
-                                            &rnn_plan_);
-      CUDNN_RETURN_IF_FAIL(status, "Unable to create persistent RNN plan.");
-      status = cudnnSetPersistentRNNPlan(rnn_desc_, rnn_plan_);
-      CUDNN_RETURN_IF_FAIL(status, "Unable to update persistent RNN plan.");
+        /*mode=*/rnn_mode, /*algo=*/rnn_algo,
+        /*dataType=*/compute_type));
+
+    PersistentRnnPlan rnn_plan;
+    if (rnn_algo == CUDNN_RNN_ALGO_PERSIST_DYNAMIC) {
+      CHECK_GE(batch_size, 0);
+      rnn_plan = CreatePersistentRnnPlan(rnn_desc.get(), batch_size, data_type);
+      RETURN_IF_CUDNN_ERROR(
+          cudnnSetPersistentRNNPlan(rnn_desc.get(), rnn_plan.get()));
     }
 
     // Create the params handle.
-    cudnn_params_desc_.reset(new CudnnRnnParamsDescriptor(cudnn, *this));
-    if (!cudnn_params_desc_->ok()) {
-      SetFailure(cudnn_params_desc_->Status());
-      return;
-    }
-    set_use_tensor_op_math(algorithm_config_.algorithm().tensor_ops_enabled());
-  }
-  ~CudnnRnnDescriptor() override {
-    if (rnn_desc_) {
-      cudnnStatus_t status;
-      if (rnn_algo_ == CUDNN_RNN_ALGO_PERSIST_DYNAMIC && rnn_plan_) {
-        status = cudnnDestroyPersistentRNNPlan(rnn_plan_);
-        CUDNN_RETURN_IF_FAIL(status, "Unable to destroy persistent RNN plan.");
-      }
-      status = cudnnDestroyRNNDescriptor(rnn_desc_);
-      CUDNN_RETURN_IF_FAIL(status, "Unable to destroy RNN descriptor");
-    }
-  }
-  void set_use_tensor_op_math(bool use_tensor_op_math) {
+    SE_ASSIGN_OR_RETURN(auto params_desc,
+                        CudnnRnnParamsDescriptor::Create(
+                            cudnn, input_size, data_type, rnn_desc.get(),
+                            rnn_mode, direction_mode, num_layers));
+
 #if CUDNN_VERSION >= 7000
-    cudnnMathType_t math_type =
-        (use_tensor_op_math ? CUDNN_TENSOR_OP_MATH : CUDNN_DEFAULT_MATH);
-    if (RnnTensorOpMathEnabled()) {
-      cudnnStatus_t status = cudnnSetRNNMatrixMathType(rnn_desc_, math_type);
-      if (status != CUDNN_STATUS_SUCCESS) {
-        LOG(FATAL) << "could not set cudnn RNN math type: " << ToString(status);
-      }
+    // Require explicit algorithm config to enable tensor cores. Some configs
+    // return CUDNN_NOT_SUPPORTED when tensor ops are enabled (which is against
+    // the idiom that enabling tensor ops is only a hint: see nvbugs/2172799).
+    // We can only reasonably expect the user to handle the subsequent failure
+    // in profile mode, which is run with algorithms returned from
+    // GetRnnAlgorithms() (which are non-default and explicitly set whether to
+    // use tensor ops).
+    if (RnnTensorOpMathEnabled() &&
+        !algorithm_config.algorithm().is_default()) {
+      cudnnMathType_t math_type =
+          algorithm_config.algorithm().tensor_ops_enabled()
+              ? CUDNN_TENSOR_OP_MATH
+              : CUDNN_DEFAULT_MATH;
+      CHECK_CUDNN_OK(cudnnSetRNNMatrixMathType(rnn_desc.get(), math_type));
     }
 #endif
+
+    return CudnnRnnDescriptor(cudnn, std::move(rnn_desc), std::move(rnn_plan),
+                              num_layers, hidden_size, input_size, batch_size,
+                              input_mode, direction_mode, rnn_mode, data_type,
+                              compute_type, algorithm_config,
+                              std::move(dropout_desc), std::move(params_desc));
   }
-  cudnnRNNDescriptor_t handle() const {
-    if (!ok()) return nullptr;
-    return rnn_desc_;
-  }
+
+  cudnnRNNDescriptor_t handle() const { return rnn_desc_.get(); }
   int num_layers() const { return num_layers_; }
   int hidden_size() const { return hidden_size_; }
   int input_size() const { return input_size_; }
@@ -1093,27 +1069,21 @@ class CudnnRnnDescriptor : public CudnnDescriptorCommon<dnn::RnnDescriptor> {
     return algorithm_config_;
   }
   int64 ParamsSizeInBytes() const override {
-    return cudnn_params_desc_->params_size_in_bytes();
-  }
-  cudnnDropoutDescriptor_t dropout_handle() const {
-    if (!cudnn_dropout_desc_) return nullptr;
-    return cudnn_dropout_desc_->handle();
+    return params_desc_.params_size_in_bytes();
   }
   cudnnFilterDescriptor_t params_handle() const {
-    if (!cudnn_params_desc_) return nullptr;
-    return cudnn_params_desc_->handle();
+    return params_desc_.handle();
   }
   ParamsRegions ParamsWeightRegions() const override {
-    if (!ok()) return ParamsRegions();
-    return cudnn_params_desc_->params_weights();
+    return params_desc_.params_weights();
   }
   ParamsRegions ParamsBiasRegions() const override {
-    if (!ok()) return ParamsRegions();
-    return cudnn_params_desc_->params_biases();
+    return params_desc_.params_biases();
   }
 
  private:
-  cudnnRNNDescriptor_t rnn_desc_;
+  cuda::RnnDescriptor rnn_desc_;
+  PersistentRnnPlan rnn_plan_;
   int num_layers_;
   int hidden_size_;
   int input_size_;
@@ -1121,180 +1091,142 @@ class CudnnRnnDescriptor : public CudnnDescriptorCommon<dnn::RnnDescriptor> {
   // algorithm.
   int batch_size_;
   cudnnRNNAlgo_t rnn_algo_;
-  cudnnPersistentRNNPlan_t rnn_plan_;
   cudnnRNNInputMode_t input_mode_;
   cudnnDirectionMode_t direction_mode_;
   cudnnRNNMode_t rnn_mode_;
   cudnnDataType_t data_type_;
   cudnnDataType_t compute_type_;
   dnn::AlgorithmConfig algorithm_config_;
-  std::unique_ptr<CudnnDropoutDescriptor> cudnn_dropout_desc_;
-  std::unique_ptr<CudnnRnnParamsDescriptor> cudnn_params_desc_;
+  CudnnDropoutDescriptor dropout_desc_;
+  CudnnRnnParamsDescriptor params_desc_;
   SE_DISALLOW_COPY_AND_ASSIGN(CudnnRnnDescriptor);
 };
 
 namespace {
 
-CudnnRnnParamsDescriptor::CudnnRnnParamsDescriptor(
-    const CudnnHandle& cudnn, const CudnnRnnDescriptor& rnn_desc)
-    : handle_(nullptr), rnn_desc_(&rnn_desc), params_size_in_bytes_(0) {
-  cudnnTensorDescriptor_t input_desc = nullptr;
-  {
-    // Query the params size.
-    auto status = cudnnCreateTensorDescriptor(&input_desc);
-    CUDNN_RETURN_IF_FAIL(status, "Cudnn fails to create tensor descriptor");
-    int dims[] = {1, rnn_desc.input_size(), 1};
-    int strides[] = {dims[1] * dims[2], dims[2], 1};
-    status = cudnnSetTensorNdDescriptor(
-        /*tensorDesc=*/input_desc, /*dataType=*/rnn_desc.data_type(),
-        /*nbDims=*/sizeof(dims) / sizeof(dims[0]), /*dimA=*/dims,
-        /*strideA=*/strides);
-    CUDNN_RETURN_IF_FAIL(status, "Cudnn fails to set tensor descriptor");
-
-    size_t params_size = 0;
-    status = cudnnGetRNNParamsSize(
-        /*handle=*/cudnn.handle(), /*rnnDesc=*/rnn_desc.handle(),
-        /*xDesc=*/input_desc, /*sizeInBytes=*/&params_size,
-        /*dataType=*/rnn_desc.data_type());
-    CUDNN_RETURN_IF_FAIL(status, "Cudnn fails to get RNN parameter size");
-    params_size_in_bytes_ = static_cast<int64>(params_size);
-  }
-
-  {
-    // Create the params descriptor.
-    auto status = cudnnCreateFilterDescriptor(&handle_);
-    CUDNN_RETURN_IF_FAIL(status, "Cudnn fails to create RNN filter descriptor");
-    int dims[] = {static_cast<int>(params_size_in_bytes_), 1, 1};
-    status = cudnnSetFilterNdDescriptor(
-        /*filterDesc=*/handle_, /*dataType=*/rnn_desc.data_type(),
-        /*format=*/CUDNN_TENSOR_NCHW, /*nbDims=*/sizeof(dims) / sizeof(dims[0]),
-        /*filterDimA=*/dims);
-    CUDNN_RETURN_IF_FAIL(status, "Cudnn fails to update RNN filter descriptor");
-  }
+port::StatusOr<CudnnRnnParamsDescriptor> CudnnRnnParamsDescriptor::Create(
+    const CudnnHandle& cudnn, int input_size, cudnnDataType_t data_type,
+    cudnnRNNDescriptor_t rnn_desc, cudnnRNNMode_t rnn_mode,
+    cudnnDirectionMode_t direction_mode, int num_layers) {
+  // Query the params size.
+  TensorDescriptor input_desc = CreateTensorDescriptor();
+  int tensor_dims[] = {1, input_size, 1};
+  int strides[] = {tensor_dims[1] * tensor_dims[2], tensor_dims[2], 1};
+  RETURN_IF_CUDNN_ERROR(cudnnSetTensorNdDescriptor(
+      /*tensorDesc=*/input_desc.get(), /*dataType=*/data_type,
+      /*nbDims=*/sizeof(tensor_dims) / sizeof(tensor_dims[0]),
+      /*dimA=*/tensor_dims,
+      /*strideA=*/strides));
+
+  size_t params_size = 0;
+  RETURN_IF_CUDNN_ERROR(cudnnGetRNNParamsSize(
+      /*handle=*/cudnn.handle(), /*rnnDesc=*/rnn_desc,
+      /*xDesc=*/input_desc.get(), /*sizeInBytes=*/&params_size,
+      /*dataType=*/data_type));
+  int64 params_size_in_bytes = static_cast<int64>(params_size);
+
+  FilterDescriptor filter_desc = CreateFilterDescriptor();
+  int filter_dims[] = {static_cast<int>(params_size_in_bytes), 1, 1};
+  RETURN_IF_CUDNN_ERROR(cudnnSetFilterNdDescriptor(
+      /*filterDesc=*/filter_desc.get(), /*dataType=*/data_type,
+      /*format=*/CUDNN_TENSOR_NCHW,
+      /*nbDims=*/sizeof(filter_dims) / sizeof(filter_dims[0]),
+      /*filterDimA=*/filter_dims));
+
+  // Create the weights and biases into the params buffer
+  int region_count_per_layer = [&] {
+    switch (rnn_mode) {
+      case CUDNN_RNN_RELU:
+      case CUDNN_RNN_TANH:
+        return 2;
+      case CUDNN_LSTM:
+        return 8;
+      case CUDNN_GRU:
+        return 6;
+      default:
+        LOG(FATAL) << "Invalid RNN Mode: " << static_cast<int>(rnn_mode);
+        return 0;
+    }
+  }();
 
-  {
-    // Create the weights and biases into the params buffer
-    int region_count_per_layer = GetRegionCountPerLayer();
-    cudnnFilterDescriptor_t region_desc_handle = nullptr;
-    auto status = cudnnCreateFilterDescriptor(&region_desc_handle);
-    CUDNN_RETURN_IF_FAIL(status, "Cudnn fails to create filter descriptor");
-    const int layer_count = rnn_desc.direction_mode() == CUDNN_UNIDIRECTIONAL
-                                ? rnn_desc.num_layers()
-                                : 2 * rnn_desc.num_layers();
-    for (int layer = 0; layer < layer_count; layer++) {
-      for (int region = 0; region < region_count_per_layer; region++) {
-        for (int type = 0; type < 2; type++) {
-          void* offset = nullptr;
-          if (type == 0) {
-            status = cudnnGetRNNLinLayerMatrixParams(
-                /*handle=*/cudnn.handle(), /*rnnDesc=*/rnn_desc.handle(),
-                /*layer=*/layer, /*xDesc=*/input_desc, /*wDesc=*/handle_,
-                /*w=*/nullptr, /*linLayerID=*/region,
-                /*linLayerMatDesc=*/region_desc_handle,
-                /*linLayerMat=*/&offset);
-            CUDNN_RETURN_IF_FAIL(
-                status, "Cudnn fails to call cudnnGetRNNLinLayerMatrixParams");
-          } else {
-            status = cudnnGetRNNLinLayerBiasParams(
-                /*handle=*/cudnn.handle(), /*rnnDesc=*/rnn_desc.handle(),
-                /*layer=*/layer, /*xDesc=*/input_desc, /*wDesc=*/handle_,
-                /*w=*/nullptr, /*linLayerID=*/region,
-                /*linLayerBiasDesc=*/region_desc_handle,
-                /*linLayerBias=*/&offset);
-            CUDNN_RETURN_IF_FAIL(
-                status, "Cudnn fails to call cudnnGetRNNLinLayerBiasParams");
-          }
-          int dims[] = {1, 1, 1};
-          cudnnDataType_t data_type;
-          cudnnTensorFormat_t tensor_format;
-          int n_dims;
-          status = cudnnGetFilterNdDescriptor(
-              /*filterDesc=*/region_desc_handle,
-              /*nbDimsRequested=*/sizeof(dims) / sizeof(dims[0]),
-              /*dataType=*/&data_type, /*format=*/&tensor_format,
-              /*nbDims=*/&n_dims, /*filterDimA=*/dims);
-          CUDNN_RETURN_IF_FAIL(status, "Cudnn fails to get filter description");
-          int64 size = dims[0] * dims[1] * dims[2] *
-                       CudnnDataTypeToByteSize(rnn_desc.data_type());
-          ParamsRegion region = {reinterpret_cast<int64>(offset), size};
-          if (type == 0) {
-            weights_.push_back(region);
-          } else {
-            biases_.push_back(region);
-          }
-        }
+  FilterDescriptor region_desc_handle = CreateFilterDescriptor();
+  const int layer_count =
+      direction_mode == CUDNN_UNIDIRECTIONAL ? num_layers : 2 * num_layers;
+
+  ParamsRegions weights;
+  ParamsRegions biases;
+
+  for (int layer = 0; layer < layer_count; layer++) {
+    for (int region = 0; region < region_count_per_layer; region++) {
+      for (int type = 0; type < 2; type++) {
+        void* offset = nullptr;
+        RETURN_IF_CUDNN_ERROR((type == 0 ? cudnnGetRNNLinLayerMatrixParams
+                                         : cudnnGetRNNLinLayerBiasParams)(
+            /*handle=*/cudnn.handle(), /*rnnDesc=*/rnn_desc,
+            /*layer=*/layer, /*xDesc=*/input_desc.get(),
+            /*wDesc=*/filter_desc.get(),
+            /*w=*/nullptr, /*linLayerID=*/region,
+            /*linLayerMatDesc=*/region_desc_handle.get(),
+            /*linLayerMat or linLayerBias=*/&offset));
+        int dims[] = {1, 1, 1};
+        cudnnDataType_t data_type;
+        cudnnTensorFormat_t tensor_format;
+        int n_dims;
+        RETURN_IF_CUDNN_ERROR(cudnnGetFilterNdDescriptor(
+            /*filterDesc=*/region_desc_handle.get(),
+            /*nbDimsRequested=*/sizeof(dims) / sizeof(dims[0]),
+            /*dataType=*/&data_type, /*format=*/&tensor_format,
+            /*nbDims=*/&n_dims, /*filterDimA=*/dims));
+        int64 size =
+            dims[0] * dims[1] * dims[2] * CudnnDataTypeToByteSize(data_type);
+        dnn::RnnDescriptor::ParamsRegion region = {
+            reinterpret_cast<int64>(offset), size};
+        (type == 0 ? weights : biases).push_back(region);
       }
     }
-    status = cudnnDestroyFilterDescriptor(region_desc_handle);
-    CUDNN_RETURN_IF_FAIL(status, "Cudnn fails to destroy filter descriptor");
   }
 
-  {
-    // Release the dummy input tensor descriptor.
-    auto status = cudnnDestroyTensorDescriptor(input_desc);
-    CUDNN_RETURN_IF_FAIL(status, "Cudnn fails to destroy tensor descriptor");
-  }
-}
-
-int CudnnRnnParamsDescriptor::GetRegionCountPerLayer() const {
-  auto rnn_mode = rnn_desc_->rnn_mode();
-  switch (rnn_mode) {
-    case CUDNN_RNN_RELU:
-    case CUDNN_RNN_TANH:
-      return 2;
-    case CUDNN_LSTM:
-      return 8;
-    case CUDNN_GRU:
-      return 6;
-    default:
-      LOG(FATAL) << "Invalid RNN Mode: " << static_cast<int>(rnn_mode);
-  }
+  return CudnnRnnParamsDescriptor(std::move(filter_desc), params_size_in_bytes,
+                                  weights, biases);
 }
 
 }  // namespace
 
 class CudnnRnnSequenceTensorDescriptor
-    : public CudnnDescriptorCommon<dnn::RnnSequenceTensorDescriptor> {
- public:
+    : public dnn::RnnSequenceTensorDescriptor {
   CudnnRnnSequenceTensorDescriptor(CUDAExecutor* parent, int seq_length,
                                    int batch_size, int data_size,
-                                   cudnnDataType_t data_type)
+                                   cudnnDataType_t data_type,
+                                   TensorDescriptor handle)
       : parent_(parent),
         seq_length_(seq_length),
         batch_size_(batch_size),
         data_size_(data_size),
-        data_type_(data_type) {
-    cudnnTensorDescriptor_t handle = nullptr;
-    if (seq_length <= 0) {
-      string error_msg =
-          port::StrCat("sequence length must be positive: ", seq_length);
-      LOG(ERROR) << error_msg;
-      SetFailure(port::Status(port::error::UNKNOWN, error_msg));
-      return;
-    }
-    cudnnStatus_t status = cudnnCreateTensorDescriptor(&handle);
-    CUDNN_RETURN_IF_FAIL(status, "Failed to create tensor descriptor");
+        data_type_(data_type),
+        handle_(std::move(handle)),
+        handles_(seq_length, handle_.get()) {}
+
+ public:
+  CudnnRnnSequenceTensorDescriptor(CudnnRnnSequenceTensorDescriptor&&) =
+      default;
+
+  static port::StatusOr<CudnnRnnSequenceTensorDescriptor> Create(
+      CUDAExecutor* parent, int seq_length, int batch_size, int data_size,
+      cudnnDataType_t data_type) {
+    CHECK_GT(seq_length, 0);
     int dims[] = {batch_size, data_size, 1};
     int strides[] = {dims[1] * dims[2], dims[2], 1};
-    status = cudnnSetTensorNdDescriptor(
-        /*tensorDesc=*/handle, /*dataType=*/data_type,
+    TensorDescriptor tensor_desc = CreateTensorDescriptor();
+    RETURN_IF_CUDNN_ERROR(cudnnSetTensorNdDescriptor(
+        /*tensorDesc=*/tensor_desc.get(), /*dataType=*/data_type,
         /*nbDims=*/sizeof(dims) / sizeof(dims[0]), /*dimA=*/dims,
-        /*strideA=*/strides);
-    CUDNN_RETURN_IF_FAIL(status, "Failed to update tensor descriptor");
-    // Replicate handle across the number of steps.
-    handles_.assign(seq_length, handle);
-  }
-
-  ~CudnnRnnSequenceTensorDescriptor() override {
-    // Only the first one needs to be destroyed. All others are the same.
-    cudnnStatus_t status = cudnnDestroyTensorDescriptor(handles_[0]);
-    CUDNN_RETURN_IF_FAIL(status,
-                         "Failed to destroy sequence tensor descriptor");
+        /*strideA=*/strides));
+    return CudnnRnnSequenceTensorDescriptor(parent, seq_length, batch_size,
+                                            data_size, data_type,
+                                            std::move(tensor_desc));
   }
 
   const cudnnTensorDescriptor_t* handles() const {
-    if (!ok()) return nullptr;
-    CHECK(!handles_.empty()) << "handles cannot be empty";
     return handles_.data();
   }
 
@@ -1308,51 +1240,39 @@ class CudnnRnnSequenceTensorDescriptor
   int batch_size_;
   int data_size_;
   cudnnDataType_t data_type_;
-  std::vector<cudnnTensorDescriptor_t> handles_;
+  TensorDescriptor handle_;
+  std::vector<cudnnTensorDescriptor_t> handles_;  // Copies of handle_.
   SE_DISALLOW_COPY_AND_ASSIGN(CudnnRnnSequenceTensorDescriptor);
 };
 
-class CudnnRnnStateTensorDescriptor
-    : public CudnnDescriptorCommon<dnn::RnnStateTensorDescriptor> {
+class CudnnRnnStateTensorDescriptor : public dnn::RnnStateTensorDescriptor {
  public:
   CudnnRnnStateTensorDescriptor(CUDAExecutor* parent, int num_layers,
                                 int batch_size, int data_size,
                                 cudnnDataType_t data_type)
       : parent_(parent),
-        handle_(nullptr),
+        handle_(CreateTensorDescriptor()),
         num_layers_(num_layers),
         batch_size_(batch_size),
         data_size_(data_size),
         data_type_(data_type) {
-    cudnnStatus_t status = cudnnCreateTensorDescriptor(&handle_);
-    CUDNN_RETURN_IF_FAIL(status, "Failed to create tensor descriptor");
     int dims[] = {num_layers, batch_size, data_size};
     int strides[] = {dims[1] * dims[2], dims[2], 1};
-    status = cudnnSetTensorNdDescriptor(
-        /*tensorDesc=*/handle_, /*dataType=*/data_type,
+    CHECK_CUDNN_OK(cudnnSetTensorNdDescriptor(
+        /*tensorDesc=*/handle_.get(), /*dataType=*/data_type,
         /*nbDims=*/sizeof(dims) / sizeof(dims[0]), /*dimA=*/dims,
-        /*strideA=*/strides);
-    CUDNN_RETURN_IF_FAIL(status, "Failed to update tensor descriptor");
+        /*strideA=*/strides));
   }
 
-  ~CudnnRnnStateTensorDescriptor() override {
-    if (!handle_) {
-      cudnnStatus_t status = cudnnDestroyTensorDescriptor(handle_);
-      CUDNN_RETURN_IF_FAIL(status, "Unable to destroy RNN state tensor");
-    }
-  }
+  cudnnTensorDescriptor_t handle() const { return handle_.get(); }
 
-  cudnnTensorDescriptor_t handle() const {
-    if (!ok()) return nullptr;
-    return handle_;
-  }
   int num_layers() const { return num_layers_; }
   int batch_size() const { return batch_size_; }
   int data_size() const { return data_size_; }
 
  private:
   CUDAExecutor* parent_;
-  cudnnTensorDescriptor_t handle_;
+  TensorDescriptor handle_;
   int num_layers_;
   int batch_size_;
   int data_size_;
@@ -1372,7 +1292,7 @@ struct RnnModelDims {
 };
 
 template <class T>
-bool ExtractAndCheckRnnForward(
+port::StatusOr<RnnModelDims> ExtractAndCheckRnnForward(
     const CudnnRnnDescriptor& rnn_desc,
     const CudnnRnnSequenceTensorDescriptor& input_desc,
     const DeviceMemory<T>& input_data,
@@ -1385,103 +1305,89 @@ bool ExtractAndCheckRnnForward(
     const CudnnRnnStateTensorDescriptor& output_h_desc,
     const DeviceMemory<T>& output_h_data,
     const CudnnRnnStateTensorDescriptor& output_c_desc,
-    const DeviceMemory<T>& output_c_data, RnnModelDims* model_dims) {
+    const DeviceMemory<T>& output_c_data) {
   // extract model parameters
-  model_dims->num_layers = rnn_desc.num_layers();
-  model_dims->batch_size = input_desc.batch_size();
-  model_dims->seq_length = input_desc.seq_length();
-  model_dims->hidden_size = rnn_desc.hidden_size();
-  model_dims->input_size = input_desc.data_size();
-  model_dims->dir_count =
+  RnnModelDims model_dims;
+  model_dims.num_layers = rnn_desc.num_layers();
+  model_dims.batch_size = input_desc.batch_size();
+  model_dims.seq_length = input_desc.seq_length();
+  model_dims.hidden_size = rnn_desc.hidden_size();
+  model_dims.input_size = input_desc.data_size();
+  model_dims.dir_count =
       (rnn_desc.direction_mode() == CUDNN_BIDIRECTIONAL) ? 2 : 1;
 
   // check parameters
   if (!(input_h_desc.num_layers() ==
-            model_dims->num_layers * model_dims->dir_count &&
-        input_h_desc.batch_size() == model_dims->batch_size &&
-        input_h_desc.data_size() == model_dims->hidden_size)) {
-    LOG(ERROR) << "Invalid input_h shape";
-    return false;
+            model_dims.num_layers * model_dims.dir_count &&
+        input_h_desc.batch_size() == model_dims.batch_size &&
+        input_h_desc.data_size() == model_dims.hidden_size)) {
+    return port::Status(port::error::INVALID_ARGUMENT, "Invalid input_h shape");
   }
   if (!(input_h_desc.num_layers() == input_c_desc.num_layers() &&
         input_h_desc.batch_size() == input_c_desc.batch_size() &&
         input_h_desc.data_size() == input_c_desc.data_size())) {
-    LOG(ERROR) << "Invalid input_c shape";
-    return false;
+    return port::Status(port::error::INVALID_ARGUMENT, "Invalid input_c shape");
   }
-  if (!(output_desc.seq_length() == model_dims->seq_length &&
-        output_desc.batch_size() == model_dims->batch_size &&
+  if (!(output_desc.seq_length() == model_dims.seq_length &&
+        output_desc.batch_size() == model_dims.batch_size &&
         output_desc.data_size() ==
-            model_dims->hidden_size * model_dims->dir_count)) {
-    LOG(ERROR) << "Invalid output shape";
-    return false;
+            model_dims.hidden_size * model_dims.dir_count)) {
+    return port::Status(port::error::INVALID_ARGUMENT, "Invalid output shape");
   }
   if (!(input_h_desc.num_layers() == output_h_desc.num_layers() &&
         input_h_desc.batch_size() == output_h_desc.batch_size() &&
         input_h_desc.data_size() == output_h_desc.data_size())) {
-    LOG(ERROR) << "Invalid output_h shape";
-    return false;
+    return port::Status(port::error::INVALID_ARGUMENT,
+                        "Invalid output_h shape");
   }
   if (!(input_h_desc.num_layers() == output_c_desc.num_layers() &&
         input_h_desc.batch_size() == output_c_desc.batch_size() &&
         input_h_desc.data_size() == output_c_desc.data_size())) {
-    LOG(ERROR) << "Invalid output_h shape";
-    return false;
+    return port::Status(port::error::INVALID_ARGUMENT,
+                        "Invalid output_c shape");
   }
 
-  return true;
+  return model_dims;
 }
 
-bool CheckRNNParameterSize(const CudnnHandle& cudnn,
-                           const CudnnRnnDescriptor& rnn_desc,
-                           const CudnnRnnSequenceTensorDescriptor& input_desc) {
+port::Status CheckRNNParameterSize(
+    const CudnnHandle& cudnn, const CudnnRnnDescriptor& rnn_desc,
+    const CudnnRnnSequenceTensorDescriptor& input_desc) {
   size_t params_size_in_bytes = 0;
-  cudnnStatus_t status = cudnnGetRNNParamsSize(
+  RETURN_IF_CUDNN_ERROR(cudnnGetRNNParamsSize(
       /*handle=*/cudnn.handle(), /*rnnDesc=*/rnn_desc.handle(),
       /*xDesc=*/input_desc.handles()[0], /*sizeInBytes=*/&params_size_in_bytes,
-      /*dataType=*/rnn_desc.data_type());
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "Unable to check RNN param size: " << ToString(status);
-    return false;
+      /*dataType=*/rnn_desc.data_type()));
+  if (static_cast<int64>(params_size_in_bytes) !=
+      rnn_desc.ParamsSizeInBytes()) {
+    return port::Status(port::error::INVALID_ARGUMENT,
+                        "Mismatching RNN parameter size");
   }
-  return static_cast<int64>(params_size_in_bytes) ==
-         rnn_desc.ParamsSizeInBytes();
+  return port::Status::OK();
 }
 
-bool CreateRnnWorkspace(Stream* stream, const CudnnHandle& cudnn,
-                        const CudnnRnnDescriptor& rnn_desc,
-                        const CudnnRnnSequenceTensorDescriptor& input_desc,
-                        ScratchAllocator* workspace_allocator,
-                        DeviceMemory<uint8>* workspace) {
+port::StatusOr<DeviceMemory<uint8>> CreateRnnWorkspace(
+    Stream* stream, const CudnnHandle& cudnn,
+    const CudnnRnnDescriptor& rnn_desc,
+    const CudnnRnnSequenceTensorDescriptor& input_desc,
+    ScratchAllocator* workspace_allocator) {
   // Query the workspace size.
   size_t workspace_size_in_bytes = 0;
-  cudnnStatus_t status = cudnnGetRNNWorkspaceSize(
+  RETURN_IF_CUDNN_ERROR(cudnnGetRNNWorkspaceSize(
       /*handle=*/cudnn.handle(), /*rnnDesc=*/rnn_desc.handle(),
       /*seqLength=*/input_desc.seq_length(), /*xDesc=*/input_desc.handles(),
-      /*sizeInBytes=*/&workspace_size_in_bytes);
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "Unable to query workspace size: " << ToString(status);
-    return false;
-  }
+      /*sizeInBytes=*/&workspace_size_in_bytes));
   // Allocate the workspace.
-  if (workspace_size_in_bytes > 0) {
-    auto allocated =
-        workspace_allocator->AllocateBytes(stream, workspace_size_in_bytes);
-    if (!allocated.ok() || (*workspace = allocated.ValueOrDie()) == nullptr) {
-      LOG(ERROR) << port::StrCat("Failed to allocate RNN workspace of ",
-                                 workspace_size_in_bytes, " bytes.");
-      return false;
-    }
-  } else {
-    *workspace = DeviceMemory<uint8>();
+  if (workspace_size_in_bytes == 0) {
+    return DeviceMemory<uint8>();
   }
-  return true;
+  return workspace_allocator->AllocateBytes(stream, workspace_size_in_bytes);
 }
 
 }  // namespace
 
 template <class T>
-bool CudnnSupport::DoRnnForwardImpl(
+port::Status CudnnSupport::DoRnnForwardImpl(
     Stream* stream, const CudnnRnnDescriptor& rnn_desc,
     const CudnnRnnSequenceTensorDescriptor& input_desc,
     const DeviceMemory<T>& input_data,
@@ -1498,57 +1404,34 @@ bool CudnnSupport::DoRnnForwardImpl(
     ScratchAllocator* reserve_space_allocator,
     ScratchAllocator* workspace_allocator,
     dnn::ProfileResult* output_profile_result) {
-  // extract model parameters
-  RnnModelDims model_dims;
-  bool res = ExtractAndCheckRnnForward(
-      rnn_desc, input_desc, input_data, input_h_desc, input_h_data,
-      input_c_desc, input_c_data, params, output_desc, *output_data,
-      output_h_desc, *output_h_data, output_c_desc, *output_c_data,
-      &model_dims);
-  if (!res) {
-    LOG(ERROR) << "Invalid parameters for RNN Model";
-    return false;
-  }
+  SE_ASSIGN_OR_RETURN(
+      RnnModelDims model_dims,
+      ExtractAndCheckRnnForward(
+          rnn_desc, input_desc, input_data, input_h_desc, input_h_data,
+          input_c_desc, input_c_data, params, output_desc, *output_data,
+          output_h_desc, *output_h_data, output_c_desc, *output_c_data));
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
 
-  // check params size
-  if (!CheckRNNParameterSize(cudnn, rnn_desc, input_desc)) {
-    LOG(ERROR) << "Invalid parameters";
-    return false;
-  }
-
-  // create the workspace
-  DeviceMemory<uint8> workspace;
-  if (!CreateRnnWorkspace(stream, cudnn, rnn_desc, input_desc,
-                          workspace_allocator, &workspace)) {
-    LOG(ERROR) << "Unable to create rnn workspace";
-    return false;
-  }
+  SE_RETURN_IF_ERROR(CheckRNNParameterSize(cudnn, rnn_desc, input_desc));
+  SE_ASSIGN_OR_RETURN(DeviceMemory<uint8> workspace,
+                      CreateRnnWorkspace(stream, cudnn, rnn_desc, input_desc,
+                                         workspace_allocator))
 
   // query the reserve space size
   // allocate the reserve space
   DeviceMemory<uint8> reserve_space;
   if (is_training) {
     size_t reserve_space_size_in_bytes = 0;
-    cudnnStatus_t status = cudnnGetRNNTrainingReserveSize(
+    RETURN_IF_CUDNN_ERROR(cudnnGetRNNTrainingReserveSize(
         /*handle=*/cudnn.handle(), /*rnnDesc=*/rnn_desc.handle(),
         /*seqLength=*/model_dims.seq_length, /*xDesc=*/input_desc.handles(),
-        /*sizeInBytes=*/&reserve_space_size_in_bytes);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      LOG(ERROR) << "Unable to query reserve space size: " << ToString(status);
-      return false;
-    }
+        /*sizeInBytes=*/&reserve_space_size_in_bytes));
 
     if (reserve_space_size_in_bytes > 0) {
-      auto allocated = reserve_space_allocator->AllocateBytes(
-          stream, reserve_space_size_in_bytes);
-      if (!allocated.ok() ||
-          (reserve_space = allocated.ValueOrDie()) == nullptr) {
-        LOG(ERROR) << "Failed to allocate RNN reserve space of "
-                   << reserve_space_size_in_bytes << " bytes.";
-        return false;
-      }
+      SE_ASSIGN_OR_RETURN(reserve_space,
+                          reserve_space_allocator->AllocateBytes(
+                              stream, reserve_space_size_in_bytes));
     }
   }
 
@@ -1556,20 +1439,16 @@ bool CudnnSupport::DoRnnForwardImpl(
   const bool is_profiling = output_profile_result != nullptr;
   if (is_profiling) {
     timer.reset(new CUDATimer(parent_));
-    if (!timer->Init()) {
-      return false;
-    }
     // The start and stop of the timer should be as close to the Cudnn call as
     // possible. It is still possible for other threads to issue workload on
     // to this stream. So it could take multiple profiling measurements.
-    if (!timer->Start(AsCUDAStream(stream))) {
-      return false;
+    if (!timer->Init() || !timer->Start(AsCUDAStream(stream))) {
+      return port::Status(port::error::INTERNAL, "Failed to start timer");
     }
   }
-  // make the forward call
-  cudnnStatus_t status;
+
   if (!is_training) {
-    status = cudnnRNNForwardInference(
+    RETURN_IF_CUDNN_ERROR(cudnnRNNForwardInference(
         /*handle=*/cudnn.handle(), /*rnnDesc=*/rnn_desc.handle(),
         /*seqLength=*/model_dims.seq_length, /*xDesc=*/input_desc.handles(),
         /*x=*/input_data.opaque(), /*hxDesc=*/input_h_desc.handle(),
@@ -1579,9 +1458,9 @@ bool CudnnSupport::DoRnnForwardImpl(
         /*y=*/output_data->opaque(), /*hyDesc=*/output_h_desc.handle(),
         /*hy=*/output_h_data->opaque(), /*cyDesc=*/output_c_desc.handle(),
         /*cy=*/output_c_data->opaque(), /*workspace=*/workspace.opaque(),
-        /*workSpaceSizeInBytes=*/workspace.size());
+        /*workSpaceSizeInBytes=*/workspace.size()));
   } else {
-    status = cudnnRNNForwardTraining(
+    RETURN_IF_CUDNN_ERROR(cudnnRNNForwardTraining(
         /*handle=*/cudnn.handle(), /*rnnDesc=*/rnn_desc.handle(),
         /*seqLength=*/model_dims.seq_length, /*xDesc=*/input_desc.handles(),
         /*x=*/input_data.opaque(), /*hxDesc=*/input_h_desc.handle(),
@@ -1593,35 +1472,24 @@ bool CudnnSupport::DoRnnForwardImpl(
         /*cy=*/output_c_data->opaque(), /*workspace=*/workspace.opaque(),
         /*workSpaceSizeInBytes=*/workspace.size(),
         /*reserveSpace=*/reserve_space.opaque(),
-        /*reserveSpaceSizeInBytes=*/reserve_space.size());
+        /*reserveSpaceSizeInBytes=*/reserve_space.size()));
   }
+
   if (is_profiling) {
     if (!timer->Stop(AsCUDAStream(stream))) {
-      return false;
-    }
-    if (status == CUDNN_STATUS_SUCCESS) {
-      auto algo_desc = rnn_desc.algorithm_config().algorithm();
-      output_profile_result->set_algorithm(algo_desc);
-      output_profile_result->set_elapsed_time_in_ms(
-          timer->GetElapsedMilliseconds());
-    }
-  }
-  if (status != CUDNN_STATUS_SUCCESS) {
-    // Silently return when we are profiling.
-    if (!is_profiling) {
-      LOG(ERROR) << "Failed to call "
-                 << (is_training ? "cudnnRNNForwardTraining "
-                                 : "cudnnRNNForwardInference ")
-                 << ToString(status);
-      return false;
+      return port::Status(port::error::INTERNAL, "Failed to stop timer");
     }
+    auto algo_desc = rnn_desc.algorithm_config().algorithm();
+    output_profile_result->set_algorithm(algo_desc);
+    output_profile_result->set_elapsed_time_in_ms(
+        timer->GetElapsedMilliseconds());
   }
 
-  return true;
+  return port::Status::OK();
 }
 
 template <class T>
-bool CudnnSupport::DoRnnBackwardImpl(
+port::Status CudnnSupport::DoRnnBackwardImpl(
     Stream* stream, const CudnnRnnDescriptor& rnn_desc,
     const CudnnRnnSequenceTensorDescriptor& input_desc,
     const DeviceMemory<T>& input_data,
@@ -1645,53 +1513,38 @@ bool CudnnSupport::DoRnnBackwardImpl(
     DeviceMemory<uint8>* reserve_space_data,
     ScratchAllocator* workspace_allocator,
     dnn::ProfileResult* output_profile_result) {
-  // extract model parameters
-  RnnModelDims model_dims;
-  bool res = ExtractAndCheckRnnForward(
-      rnn_desc, input_desc, input_data, input_h_desc, input_h_data,
-      input_c_desc, input_c_data, params, output_desc, output_data,
-      output_h_desc, output_h_data, output_c_desc, output_c_data, &model_dims);
-  if (!res) {
-    LOG(ERROR) << "Invalid parameters for RNN Model";
-    return false;
-  }
+  SE_ASSIGN_OR_RETURN(
+      RnnModelDims model_dims,
+      ExtractAndCheckRnnForward(rnn_desc, input_desc, input_data, input_h_desc,
+                                input_h_data, input_c_desc, input_c_data,
+                                params, output_desc, output_data, output_h_desc,
+                                output_h_data, output_c_desc, output_c_data));
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
 
-  // check params size
-  if (!CheckRNNParameterSize(cudnn, rnn_desc, input_desc)) {
-    LOG(ERROR) << "Invalid parameters";
-    return false;
-  }
-
-  // create the workspace
-  DeviceMemory<uint8> workspace;
-  if (!CreateRnnWorkspace(stream, cudnn, rnn_desc, input_desc,
-                          workspace_allocator, &workspace)) {
-    LOG(ERROR) << "Unable to create rnn workspace";
-    return false;
-  }
+  SE_RETURN_IF_ERROR(CheckRNNParameterSize(cudnn, rnn_desc, input_desc));
+  SE_ASSIGN_OR_RETURN(DeviceMemory<uint8> workspace,
+                      CreateRnnWorkspace(stream, cudnn, rnn_desc, input_desc,
+                                         workspace_allocator));
 
   std::unique_ptr<CUDATimer, TimerDeleter> timer;
   const bool is_profiling = output_profile_result != nullptr;
   if (is_profiling) {
     timer.reset(new CUDATimer(parent_));
-    if (!timer->Init()) {
-      return false;
-    }
     // The start and stop of the timer should be as close to the Cudnn call as
     // possible. It is still possible for other threads to issue workload on
     // to this stream. So it could take multiple profiling measurements.
-    if (!timer->Start(AsCUDAStream(stream))) {
-      return false;
+    if (!timer->Init() || !timer->Start(AsCUDAStream(stream))) {
+      return port::Status(port::error::INTERNAL, "Failed to start timer");
     }
   }
-  // make the backward data call
-  cudnnStatus_t status = cudnnRNNBackwardData(
+
+  RETURN_IF_CUDNN_ERROR(cudnnRNNBackwardData(
       /*handle=*/cudnn.handle(), /*rnnDesc=*/rnn_desc.handle(),
       /*seqLength=*/model_dims.seq_length, /*yDesc=*/output_desc.handles(),
       /*y=*/output_data.opaque(), /*dyDesc=*/output_desc.handles(),
-      /*dy=*/output_backprop_data.opaque(), /*dhyDesc=*/output_h_desc.handle(),
+      /*dy=*/output_backprop_data.opaque(),
+      /*dhyDesc=*/output_h_desc.handle(),
       /*dhy=*/output_h_backprop_data.opaque(),
       /*dcyDesc=*/output_c_desc.handle(),
       /*dcy=*/output_c_backprop_data.opaque(),
@@ -1702,24 +1555,17 @@ bool CudnnSupport::DoRnnBackwardImpl(
       /*dhxDesc=*/input_h_desc.handle(),
       /*dhx=*/input_h_backprop_data->opaque(),
       /*dcxDesc=*/input_c_desc.handle(),
-      /*dcx=*/input_c_backprop_data->opaque(), /*workspace=*/workspace.opaque(),
+      /*dcx=*/input_c_backprop_data->opaque(),
+      /*workspace=*/workspace.opaque(),
       /*workSpaceSizeInBytes=*/workspace.size(),
       /*reserveSpace=*/reserve_space_data->opaque(),
-      /*reserveSpaceSizeInBytes=*/reserve_space_data->size());
-
-  if (status != CUDNN_STATUS_SUCCESS) {
-    if (is_profiling) {
-      timer->Stop(AsCUDAStream(stream));
-    }
-    LOG(ERROR) << "Failed to call cudnnRNNBackwardData: " << ToString(status);
-    return false;
-  }
+      /*reserveSpaceSizeInBytes=*/reserve_space_data->size()));
 
   if (params_backprop_data != nullptr) {
     // Clear the dw to zeros.
     stream->ThenMemZero(params_backprop_data, params_backprop_data->size());
     // make the backward weight call
-    status = cudnnRNNBackwardWeights(
+    RETURN_IF_CUDNN_ERROR(cudnnRNNBackwardWeights(
         /*handle=*/cudnn.handle(), /*rnnDesc=*/rnn_desc.handle(),
         /*seqLength=*/model_dims.seq_length, /*xDesc=*/input_desc.handles(),
         /*x=*/input_data.opaque(), /*hxDesc=*/input_h_desc.handle(),
@@ -1729,19 +1575,12 @@ bool CudnnSupport::DoRnnBackwardImpl(
         /*dwDesc=*/rnn_desc.params_handle(),
         /*dw=*/params_backprop_data->opaque(),
         /*reserveSpace=*/reserve_space_data->opaque(),
-        /*reserveSpaceSizeInBytes=*/reserve_space_data->size());
-    if (status != CUDNN_STATUS_SUCCESS) {
-      if (is_profiling) {
-        timer->Stop(AsCUDAStream(stream));
-      }
-      LOG(ERROR) << "Failed to call cudnnRNNBackwardWeights: "
-                 << ToString(status);
-      return false;
-    }
+        /*reserveSpaceSizeInBytes=*/reserve_space_data->size()));
   }
+
   if (is_profiling) {
     if (!timer->Stop(AsCUDAStream(stream))) {
-      return false;
+      return port::Status(port::error::INTERNAL, "Failed to stop timer");
     }
     auto algo_desc = rnn_desc.algorithm_config().algorithm();
     output_profile_result->set_algorithm(algo_desc);
@@ -1749,7 +1588,7 @@ bool CudnnSupport::DoRnnBackwardImpl(
         timer->GetElapsedMilliseconds());
   }
 
-  return true;
+  return port::Status::OK();
 }
 
 port::StatusOr<std::unique_ptr<dnn::RnnDescriptor>>
@@ -1762,46 +1601,37 @@ CudnnSupport::createRnnDescriptor(
   // Setting up a cudnnRNNDescriptor requires a cuDNN handle, but because it's
   // not enqueueing anything into a stream, we pass in the null stream.
   auto cudnn = cudnn_->GetHandle(parent_, /*stream=*/nullptr);
-  std::unique_ptr<CudnnRnnDescriptor> rnn_desc(new CudnnRnnDescriptor(
-      cudnn, num_layers, hidden_size, input_size, batch_size,
-      ToCudnnRnnInputMode(input_mode), ToCudnnRnnDirectionMode(direction_mode),
-      ToCudnnRnnMode(rnn_mode), ToCudnnDataType(data_type),
-      GetRnnComputeType(data_type), algorithm_config, dropout, seed,
-      state_allocator));
-  if (!rnn_desc->ok()) {
-    return rnn_desc->Status();
-  }
-  return port::StatusOr<std::unique_ptr<dnn::RnnDescriptor>>(
-      std::move(rnn_desc));
+  SE_ASSIGN_OR_RETURN(
+      CudnnRnnDescriptor rnn_desc,
+      CudnnRnnDescriptor::Create(
+          cudnn, num_layers, hidden_size, input_size, batch_size,
+          ToCudnnRnnInputMode(input_mode),
+          ToCudnnRnnDirectionMode(direction_mode), ToCudnnRnnMode(rnn_mode),
+          ToCudnnDataType(data_type), GetRnnComputeType(data_type),
+          algorithm_config, dropout, seed, state_allocator));
+  return std::unique_ptr<dnn::RnnDescriptor>(
+      new CudnnRnnDescriptor(std::move(rnn_desc)));
 }
 
 port::StatusOr<std::unique_ptr<dnn::RnnSequenceTensorDescriptor>>
 CudnnSupport::createRnnSequenceTensorDescriptor(int seq_length, int batch_size,
                                                 int data_size,
                                                 dnn::DataType data_type) {
-  std::unique_ptr<CudnnRnnSequenceTensorDescriptor> seq_desc(
-      new CudnnRnnSequenceTensorDescriptor(parent_, seq_length, batch_size,
-                                           data_size,
-                                           ToCudnnDataType(data_type)));
-  if (!seq_desc->ok()) {
-    return seq_desc->Status();
-  }
-  return port::StatusOr<std::unique_ptr<dnn::RnnSequenceTensorDescriptor>>(
-      std::move(seq_desc));
+  SE_ASSIGN_OR_RETURN(CudnnRnnSequenceTensorDescriptor descriptor,
+                      CudnnRnnSequenceTensorDescriptor::Create(
+                          parent_, seq_length, batch_size, data_size,
+                          ToCudnnDataType(data_type)));
+  return std::unique_ptr<dnn::RnnSequenceTensorDescriptor>(
+      new CudnnRnnSequenceTensorDescriptor(std::move(descriptor)));
 }
 
 port::StatusOr<std::unique_ptr<dnn::RnnStateTensorDescriptor>>
 CudnnSupport::createRnnStateTensorDescriptor(int num_layer, int batch_size,
                                              int data_size,
                                              dnn::DataType data_type) {
-  std::unique_ptr<CudnnRnnStateTensorDescriptor> state_desc(
+  return std::unique_ptr<dnn::RnnStateTensorDescriptor>(
       new CudnnRnnStateTensorDescriptor(parent_, num_layer, batch_size,
                                         data_size, ToCudnnDataType(data_type)));
-  if (!state_desc->ok()) {
-    return state_desc->Status();
-  }
-  return port::StatusOr<std::unique_ptr<dnn::RnnStateTensorDescriptor>>(
-      std::move(state_desc));
 }
 
 bool CudnnSupport::DoRnnForward(
@@ -1837,12 +1667,14 @@ bool CudnnSupport::DoRnnForward(
   const CudnnRnnStateTensorDescriptor& cudnn_output_c_desc =
       static_cast<const CudnnRnnStateTensorDescriptor&>(output_c_desc);
 
-  return DoRnnForwardImpl<Eigen::half>(
-      stream, cudnn_rnn_desc, cudnn_input_desc, input_data, cudnn_input_h_desc,
-      input_h_data, cudnn_input_c_desc, input_c_data, params, cudnn_output_desc,
-      output_data, cudnn_output_h_desc, output_h_data, cudnn_output_c_desc,
-      output_c_data, is_training, reserve_space_allocator, workspace_allocator,
-      output_profile_result);
+  return IsStatusOk(
+      DoRnnForwardImpl<Eigen::half>(
+          stream, cudnn_rnn_desc, cudnn_input_desc, input_data,
+          cudnn_input_h_desc, input_h_data, cudnn_input_c_desc, input_c_data,
+          params, cudnn_output_desc, output_data, cudnn_output_h_desc,
+          output_h_data, cudnn_output_c_desc, output_c_data, is_training,
+          reserve_space_allocator, workspace_allocator, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoRnnForward(
@@ -1877,12 +1709,14 @@ bool CudnnSupport::DoRnnForward(
   const CudnnRnnStateTensorDescriptor& cudnn_output_c_desc =
       static_cast<const CudnnRnnStateTensorDescriptor&>(output_c_desc);
 
-  return DoRnnForwardImpl<float>(
-      stream, cudnn_rnn_desc, cudnn_input_desc, input_data, cudnn_input_h_desc,
-      input_h_data, cudnn_input_c_desc, input_c_data, params, cudnn_output_desc,
-      output_data, cudnn_output_h_desc, output_h_data, cudnn_output_c_desc,
-      output_c_data, is_training, reserve_space_allocator, workspace_allocator,
-      output_profile_result);
+  return IsStatusOk(
+      DoRnnForwardImpl<float>(
+          stream, cudnn_rnn_desc, cudnn_input_desc, input_data,
+          cudnn_input_h_desc, input_h_data, cudnn_input_c_desc, input_c_data,
+          params, cudnn_output_desc, output_data, cudnn_output_h_desc,
+          output_h_data, cudnn_output_c_desc, output_c_data, is_training,
+          reserve_space_allocator, workspace_allocator, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoRnnForward(
@@ -1918,12 +1752,14 @@ bool CudnnSupport::DoRnnForward(
   const CudnnRnnStateTensorDescriptor& cudnn_output_c_desc =
       static_cast<const CudnnRnnStateTensorDescriptor&>(output_c_desc);
 
-  return DoRnnForwardImpl<double>(
-      stream, cudnn_rnn_desc, cudnn_input_desc, input_data, cudnn_input_h_desc,
-      input_h_data, cudnn_input_c_desc, input_c_data, params, cudnn_output_desc,
-      output_data, cudnn_output_h_desc, output_h_data, cudnn_output_c_desc,
-      output_c_data, is_training, reserve_space_allocator, workspace_allocator,
-      output_profile_result);
+  return IsStatusOk(
+      DoRnnForwardImpl<double>(
+          stream, cudnn_rnn_desc, cudnn_input_desc, input_data,
+          cudnn_input_h_desc, input_h_data, cudnn_input_c_desc, input_c_data,
+          params, cudnn_output_desc, output_data, cudnn_output_h_desc,
+          output_h_data, cudnn_output_c_desc, output_c_data, is_training,
+          reserve_space_allocator, workspace_allocator, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoRnnBackward(
@@ -1966,14 +1802,17 @@ bool CudnnSupport::DoRnnBackward(
   const CudnnRnnStateTensorDescriptor& cudnn_output_c_desc =
       static_cast<const CudnnRnnStateTensorDescriptor&>(output_c_desc);
 
-  return DoRnnBackwardImpl<Eigen::half>(
-      stream, cudnn_rnn_desc, cudnn_input_desc, input_data, cudnn_input_h_desc,
-      input_h_data, cudnn_input_c_desc, input_c_data, params, cudnn_output_desc,
-      output_data, cudnn_output_h_desc, output_h_data, cudnn_output_c_desc,
-      output_c_data, output_backprop_data, output_h_backprop_data,
-      output_c_backprop_data, input_backprop_data, input_h_backprop_data,
-      input_c_backprop_data, params_backprop_data, reserve_space_data,
-      workspace_allocator, output_profile_result);
+  return IsStatusOk(
+      DoRnnBackwardImpl<Eigen::half>(
+          stream, cudnn_rnn_desc, cudnn_input_desc, input_data,
+          cudnn_input_h_desc, input_h_data, cudnn_input_c_desc, input_c_data,
+          params, cudnn_output_desc, output_data, cudnn_output_h_desc,
+          output_h_data, cudnn_output_c_desc, output_c_data,
+          output_backprop_data, output_h_backprop_data, output_c_backprop_data,
+          input_backprop_data, input_h_backprop_data, input_c_backprop_data,
+          params_backprop_data, reserve_space_data, workspace_allocator,
+          output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoRnnBackward(
@@ -2015,14 +1854,17 @@ bool CudnnSupport::DoRnnBackward(
   const CudnnRnnStateTensorDescriptor& cudnn_output_c_desc =
       static_cast<const CudnnRnnStateTensorDescriptor&>(output_c_desc);
 
-  return DoRnnBackwardImpl<float>(
-      stream, cudnn_rnn_desc, cudnn_input_desc, input_data, cudnn_input_h_desc,
-      input_h_data, cudnn_input_c_desc, input_c_data, params, cudnn_output_desc,
-      output_data, cudnn_output_h_desc, output_h_data, cudnn_output_c_desc,
-      output_c_data, output_backprop_data, output_h_backprop_data,
-      output_c_backprop_data, input_backprop_data, input_h_backprop_data,
-      input_c_backprop_data, params_backprop_data, reserve_space_data,
-      workspace_allocator, output_profile_result);
+  return IsStatusOk(
+      DoRnnBackwardImpl<float>(
+          stream, cudnn_rnn_desc, cudnn_input_desc, input_data,
+          cudnn_input_h_desc, input_h_data, cudnn_input_c_desc, input_c_data,
+          params, cudnn_output_desc, output_data, cudnn_output_h_desc,
+          output_h_data, cudnn_output_c_desc, output_c_data,
+          output_backprop_data, output_h_backprop_data, output_c_backprop_data,
+          input_backprop_data, input_h_backprop_data, input_c_backprop_data,
+          params_backprop_data, reserve_space_data, workspace_allocator,
+          output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoRnnBackward(
@@ -2065,121 +1907,351 @@ bool CudnnSupport::DoRnnBackward(
   const CudnnRnnStateTensorDescriptor& cudnn_output_c_desc =
       static_cast<const CudnnRnnStateTensorDescriptor&>(output_c_desc);
 
-  return DoRnnBackwardImpl<double>(
-      stream, cudnn_rnn_desc, cudnn_input_desc, input_data, cudnn_input_h_desc,
-      input_h_data, cudnn_input_c_desc, input_c_data, params, cudnn_output_desc,
-      output_data, cudnn_output_h_desc, output_h_data, cudnn_output_c_desc,
-      output_c_data, output_backprop_data, output_h_backprop_data,
-      output_c_backprop_data, input_backprop_data, input_h_backprop_data,
-      input_c_backprop_data, params_backprop_data, reserve_space_data,
-      workspace_allocator, output_profile_result);
+  return IsStatusOk(
+      DoRnnBackwardImpl<double>(
+          stream, cudnn_rnn_desc, cudnn_input_desc, input_data,
+          cudnn_input_h_desc, input_h_data, cudnn_input_c_desc, input_c_data,
+          params, cudnn_output_desc, output_data, cudnn_output_h_desc,
+          output_h_data, cudnn_output_c_desc, output_c_data,
+          output_backprop_data, output_h_backprop_data, output_c_backprop_data,
+          input_backprop_data, input_h_backprop_data, input_c_backprop_data,
+          params_backprop_data, reserve_space_data, workspace_allocator,
+          output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 namespace {
 
-inline cudnnConvolutionFwdAlgo_t GetCudnnConvolutionForwardAlgo(
-    const CudnnHandle& cudnn, const ScopedTensorDescriptor& input_nd,
-    const ScopedFilterDescriptor& filter,
-    const ScopedConvolutionDescriptor& conv,
-    const ScopedTensorDescriptor& output_nd, bool specify_workspace_limit,
+// TODO(csigg): Merge a lot of duplicate code below for forward, backward data,
+// and backward filter.
+
+port::StatusOr<cudnnConvolutionFwdAlgo_t> GetCudnnConvolutionForwardAlgo(
+    const CudnnHandle& cudnn, const CudnnTensorDescriptor& input_nd,
+    const CudnnFilterDescriptor& filter, const CudnnConvolutionDescriptor& conv,
+    const CudnnTensorDescriptor& output_nd, bool specify_workspace_limit,
     size_t memory_limit_bytes) {
   cudnnConvolutionFwdPreference_t preference =
       specify_workspace_limit ? CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT
                               : CUDNN_CONVOLUTION_FWD_NO_WORKSPACE;
-
   cudnnConvolutionFwdAlgo_t algo_to_use;
-  auto status = cudnnGetConvolutionForwardAlgorithm(
+  RETURN_IF_CUDNN_ERROR(cudnnGetConvolutionForwardAlgorithm(
       cudnn.handle(), input_nd.handle(), filter.handle(), conv.handle(),
-      output_nd.handle(), preference, memory_limit_bytes, &algo_to_use);
-  CHECK_EQ(status, CUDNN_STATUS_SUCCESS)
-      << "Unable to find a suitable algorithm for doing forward convolution";
+      output_nd.handle(), preference, memory_limit_bytes, &algo_to_use));
   return algo_to_use;
 }
 
-dnn::AlgorithmDesc GetCudnnConvolutionForwardAlgorithm(
-    Stream* stream, const CudnnHandle& cudnn,
-    const dnn::AlgorithmConfig& algorithm_config, bool is_profiling,
-    const ScopedTensorDescriptor& input_nd,
-    const ScopedFilterDescriptor& filter,
-    const ScopedConvolutionDescriptor& conv,
-    const ScopedTensorDescriptor& output_nd,
-    ScratchAllocator* scratch_allocator, DeviceMemory<uint8>* scratch) {
-  cudnnConvolutionFwdAlgo_t algo;
-  bool use_tensor_ops;
-  if (algorithm_config.algorithm().is_default()) {
-    use_tensor_ops = true;
+port::StatusOr<cudnnConvolutionBwdDataAlgo_t>
+GetCudnnConvolutionBackwardDataAlgo(const CudnnHandle& cudnn,
+                                    const CudnnTensorDescriptor& input_nd,
+                                    const CudnnFilterDescriptor& filter,
+                                    const CudnnConvolutionDescriptor& conv,
+                                    const CudnnTensorDescriptor& output_nd,
+                                    bool specify_workspace_limit,
+                                    size_t memory_limit_bytes) {
+  cudnnConvolutionBwdDataPreference_t preference =
+      specify_workspace_limit
+          ? CUDNN_CONVOLUTION_BWD_DATA_SPECIFY_WORKSPACE_LIMIT
+          : CUDNN_CONVOLUTION_BWD_DATA_NO_WORKSPACE;
+  cudnnConvolutionBwdDataAlgo_t algo_to_use;
+  RETURN_IF_CUDNN_ERROR(cudnnGetConvolutionBackwardDataAlgorithm(
+      cudnn.handle(), filter.handle(), output_nd.handle(), conv.handle(),
+      input_nd.handle(), preference, memory_limit_bytes, &algo_to_use));
+  return algo_to_use;
+}
 
-    auto memory_limit_bytes =
-        scratch_allocator == nullptr
-            ? 0
-            : scratch_allocator->GetMemoryLimitInBytes(stream);
-    if (memory_limit_bytes < 0) {
-      memory_limit_bytes = 0;
-    }
+port::StatusOr<cudnnConvolutionBwdFilterAlgo_t>
+GetCudnnConvolutionBackwardFilterAlgo(const CudnnHandle& cudnn,
+                                      const CudnnTensorDescriptor& input_nd,
+                                      const CudnnFilterDescriptor& filter,
+                                      const CudnnConvolutionDescriptor& conv,
+                                      const CudnnTensorDescriptor& output_nd,
+                                      bool specify_workspace_limit,
+                                      size_t memory_limit_bytes) {
+  cudnnConvolutionBwdFilterPreference_t preference =
+      specify_workspace_limit
+          ? CUDNN_CONVOLUTION_BWD_FILTER_SPECIFY_WORKSPACE_LIMIT
+          : CUDNN_CONVOLUTION_BWD_FILTER_NO_WORKSPACE;
+  cudnnConvolutionBwdFilterAlgo_t algo_to_use;
+  RETURN_IF_CUDNN_ERROR(cudnnGetConvolutionBackwardFilterAlgorithm(
+      cudnn.handle(), input_nd.handle(), output_nd.handle(), conv.handle(),
+      filter.handle(), preference, memory_limit_bytes, &algo_to_use));
+  return algo_to_use;
+}
 
-    algo = GetCudnnConvolutionForwardAlgo(
-        cudnn, input_nd, filter, conv, output_nd,
-        /*specify_workspace_limit=*/scratch_allocator != nullptr,
-        memory_limit_bytes);
-  } else {
-    use_tensor_ops = algorithm_config.algorithm().tensor_ops_enabled();
-    algo = ToConvForwardAlgo(algorithm_config.algorithm());
-  }
+port::StatusOr<DeviceMemory<uint8>> AllocateCudnnConvolutionForwardWorkspace(
+    Stream* stream, const CudnnHandle& cudnn,
+    const dnn::AlgorithmDesc& algorithm_desc,
+    const CudnnTensorDescriptor& input_nd, const CudnnFilterDescriptor& filter,
+    const CudnnConvolutionDescriptor& conv,
+    const CudnnTensorDescriptor& output_nd,
+    ScratchAllocator* scratch_allocator) {
+  // TODO(csigg): This has side effects on the convolution descriptor. It is
+  // functionally correct because the convolution is run with the algorithm of
+  // the last call to this function, but should be fixed anyway.
+  conv.set_use_tensor_op_math(algorithm_desc.tensor_ops_enabled());
+
+  // Query the size of the workspace and allocate it.
   size_t size_in_bytes;
-  auto status = cudnnGetConvolutionForwardWorkspaceSize(
+  RETURN_IF_CUDNN_ERROR(cudnnGetConvolutionForwardWorkspaceSize(
       cudnn.handle(),
       /*xDesc=*/input_nd.handle(),
       /*wDesc=*/filter.handle(), /*convDesc=*/conv.handle(),
-      /*yDesc=*/output_nd.handle(), /*algo=*/algo,
-      /*sizeInBytes=*/&size_in_bytes);
+      /*yDesc=*/output_nd.handle(), /*algo=*/ToConvForwardAlgo(algorithm_desc),
+      /*sizeInBytes=*/&size_in_bytes));
   int64 size_in_bytes_int64 = size_in_bytes;
-  if (TF_PREDICT_FALSE(status != CUDNN_STATUS_SUCCESS)) {
-    CHECK(is_profiling) << "Cannot query the size of workspace needed "
-                           "for the specified algorithm: "
-                        << algorithm_config.algorithm().algo_id() << " "
-                        << ToString(status);
-    // Silently return when we are profiling.
-    return dnn::AlgorithmDesc();
+
+  if (TF_PREDICT_FALSE(size_in_bytes_int64 < 0)) {
+    return port::Status(
+        port::error::INTERNAL,
+        "cudnnGetConvolutionForwardWorkspaceSize() returned "
+        "negative sizeInBytes value. This could be a cudnn bug.");
   }
+
+  if (size_in_bytes_int64 == 0) {
+    return DeviceMemory<uint8>();
+  }
+
+  if (TF_PREDICT_FALSE(!scratch_allocator)) {
+    return port::Status(port::error::INVALID_ARGUMENT,
+                        "No scratch allocator provided");
+  }
+
+  return scratch_allocator->AllocateBytes(stream, size_in_bytes);
+}
+
+port::StatusOr<DeviceMemory<uint8>>
+AllocateCudnnConvolutionBackwardDataWorkspace(
+    Stream* stream, const CudnnHandle& cudnn,
+    const dnn::AlgorithmDesc& algorithm_desc,
+    const CudnnTensorDescriptor& input_nd, const CudnnFilterDescriptor& filter,
+    const CudnnConvolutionDescriptor& conv,
+    const CudnnTensorDescriptor& output_nd,
+    ScratchAllocator* scratch_allocator) {
+  // TODO(csigg): This has side effects on the convolution descriptor. It is
+  // functionally correct because the convolution is run with the algorithm of
+  // the last call to this function, but should be fixed anyway.
+  conv.set_use_tensor_op_math(algorithm_desc.tensor_ops_enabled());
+
+  // Query the size of the workspace and allocate it.
+  size_t size_in_bytes;
+  RETURN_IF_CUDNN_ERROR(cudnnGetConvolutionBackwardDataWorkspaceSize(
+      cudnn.handle(),
+      /*wDesc=*/filter.handle(),
+      /*dyDesc=*/output_nd.handle(),
+      /*convDesc=*/conv.handle(),
+      /*dxDesc=*/input_nd.handle(),
+      /*algo=*/ToConvBackwardDataAlgo(algorithm_desc),
+      /*sizeInBytes=*/&size_in_bytes));
+  int64 size_in_bytes_int64 = size_in_bytes;
+
   if (TF_PREDICT_FALSE(size_in_bytes_int64 < 0)) {
-    LOG(WARNING) << "cudnnGetConvolutionForwardWorkspaceSize() returned "
-                    "negative sizeInBytes value. This could be a cudnn bug.";
-    if (TF_PREDICT_TRUE(is_profiling)) {
-      return dnn::AlgorithmDesc();
-    }
-  } else if (size_in_bytes_int64 > 0) {
-    port::StatusOr<DeviceMemory<uint8>> allocated;
-    if (TF_PREDICT_TRUE(scratch_allocator)) {
-      allocated = scratch_allocator->AllocateBytes(stream, size_in_bytes);
-      if (TF_PREDICT_TRUE(allocated.ok())) {
-        *scratch = allocated.ValueOrDie();
-      } else {
-        if (TF_PREDICT_TRUE(is_profiling)) {
-          // Silently return when we are profiling.
-          return dnn::AlgorithmDesc();
-        }
-        LOG(WARNING) << allocated.status().error_message();
-        // For the int8 case, we fail at this point since the no_scratch
-        // algorithm should be set to dnn::kDefaultAlgorithm.
-        CHECK(!algorithm_config.algorithm_no_scratch().is_default())
-            << "The primary convolution algorithm failed memory allocation, "
-               "while a secondary algorithm is not provided.";
-      }
-    }
-    if (TF_PREDICT_FALSE(!allocated.ok())) {
-      if (algorithm_config.algorithm_no_scratch().is_default()) {
-        use_tensor_ops = true;
-        algo = GetCudnnConvolutionForwardAlgo(
-            cudnn, input_nd, filter, conv, output_nd,
-            /*specify_workspace_limit=*/false, 0);
-      } else {
-        use_tensor_ops = algorithm_config.algorithm().tensor_ops_enabled();
-        algo = ToConvForwardAlgo(algorithm_config.algorithm_no_scratch());
-      }
-    }
+    return port::Status(
+        port::error::INTERNAL,
+        "cudnnGetConvolutionBackwardDataWorkspaceSize() returned "
+        "negative sizeInBytes value. This could be a cudnn bug.");
+  }
+
+  if (size_in_bytes_int64 == 0) {
+    return DeviceMemory<uint8>();
+  }
+
+  if (TF_PREDICT_FALSE(!scratch_allocator)) {
+    return port::Status(port::error::INVALID_ARGUMENT,
+                        "No scratch allocator provided");
+  }
+
+  return scratch_allocator->AllocateBytes(stream, size_in_bytes);
+}
+
+port::StatusOr<DeviceMemory<uint8>>
+AllocateCudnnConvolutionBackwardFilterWorkspace(
+    Stream* stream, const CudnnHandle& cudnn,
+    const dnn::AlgorithmDesc& algorithm_desc,
+    const CudnnTensorDescriptor& input_nd, const CudnnFilterDescriptor& filter,
+    const CudnnConvolutionDescriptor& conv,
+    const CudnnTensorDescriptor& output_nd,
+    ScratchAllocator* scratch_allocator) {
+  // TODO(csigg): This has side effects on the convolution descriptor. It is
+  // functionally correct because the convolution is run with the algorithm of
+  // the last call to this function, but should be fixed anyway.
+  conv.set_use_tensor_op_math(algorithm_desc.tensor_ops_enabled());
+
+  // Query the size of the workspace and allocate it.
+  size_t size_in_bytes;
+  RETURN_IF_CUDNN_ERROR(cudnnGetConvolutionBackwardFilterWorkspaceSize(
+      cudnn.handle(),
+      /*xDesc=*/input_nd.handle(),
+      /*dyDesc=*/output_nd.handle(),
+      /*convDesc=*/conv.handle(),
+      /*gradDesc=*/filter.handle(),
+      /*algo=*/ToConvBackwardFilterAlgo(algorithm_desc),
+      /*sizeInBytes=*/&size_in_bytes));
+  int64 size_in_bytes_int64 = size_in_bytes;
+
+  if (TF_PREDICT_FALSE(size_in_bytes_int64 < 0)) {
+    return port::Status(
+        port::error::INTERNAL,
+        "cudnnGetConvolutionBackwardFilterWorkspaceSize() returned "
+        "negative sizeInBytes value. This could be a cudnn bug.");
+  }
+
+  if (size_in_bytes_int64 == 0) {
+    return DeviceMemory<uint8>();
+  }
+
+  if (TF_PREDICT_FALSE(!scratch_allocator)) {
+    return port::Status(port::error::INVALID_ARGUMENT,
+                        "No scratch allocator provided");
+  }
+
+  return scratch_allocator->AllocateBytes(stream, size_in_bytes);
+}
+
+port::StatusOr<dnn::AlgorithmDesc> GetCudnnConvolutionForwardAlgorithm(
+    Stream* stream, const CudnnHandle& cudnn,
+    const dnn::AlgorithmConfig& algorithm_config,
+    const CudnnTensorDescriptor& input_nd, const CudnnFilterDescriptor& filter,
+    const CudnnConvolutionDescriptor& conv,
+    const CudnnTensorDescriptor& output_nd, ScratchAllocator* scratch_allocator,
+    DeviceMemory<uint8>* scratch) {
+  dnn::AlgorithmDesc algo_desc = algorithm_config.algorithm();
+  if (algorithm_config.algorithm().is_default()) {
+    // Pick fastest algorithm within memory limit according to cuDNN's
+    // heuristics.
+    bool specify_workspace_limit = scratch_allocator != nullptr;
+    auto memory_limit_bytes =
+        specify_workspace_limit
+            ? std::max(scratch_allocator->GetMemoryLimitInBytes(stream), 0ll)
+            : 0ll;
+    SE_ASSIGN_OR_RETURN(cudnnConvolutionFwdAlgo_t algo,
+                        GetCudnnConvolutionForwardAlgo(
+                            cudnn, input_nd, filter, conv, output_nd,
+                            specify_workspace_limit, memory_limit_bytes));
+    algo_desc = dnn::AlgorithmDesc(
+        algo, algorithm_config.algorithm().tensor_ops_enabled());
+  }
+
+  auto scratch_or = AllocateCudnnConvolutionForwardWorkspace(
+      stream, cudnn, algo_desc, input_nd, filter, conv, output_nd,
+      scratch_allocator);
+
+  if (scratch_or.ok()) {
+    *scratch = scratch_or.ValueOrDie();
+    return algo_desc;
+  }
+
+  // Failed to allocate workspace for the first algorithm, fall back to the
+  // no_scratch algorithm.
+  if (algorithm_config.algorithm_no_scratch().is_default()) {
+    return port::Status(
+        port::error::INVALID_ARGUMENT,
+        "The primary convolution algorithm failed memory allocation, "
+        "while a secondary algorithm is not provided.");
+  }
+
+  SE_ASSIGN_OR_RETURN(
+      *scratch, AllocateCudnnConvolutionForwardWorkspace(
+                    stream, cudnn, algorithm_config.algorithm_no_scratch(),
+                    input_nd, filter, conv, output_nd, scratch_allocator));
+  return algorithm_config.algorithm_no_scratch();
+}
+
+port::StatusOr<dnn::AlgorithmDesc> GetCudnnConvolutionBackwardDataAlgorithm(
+    Stream* stream, const CudnnHandle& cudnn,
+    const dnn::AlgorithmConfig& algorithm_config,
+    const CudnnTensorDescriptor& input_nd, const CudnnFilterDescriptor& filter,
+    const CudnnConvolutionDescriptor& conv,
+    const CudnnTensorDescriptor& output_nd, ScratchAllocator* scratch_allocator,
+    DeviceMemory<uint8>* scratch) {
+  dnn::AlgorithmDesc algo_desc = algorithm_config.algorithm();
+  if (algorithm_config.algorithm().is_default()) {
+    // Pick fastest algorithm within memory limit according to cuDNN's
+    // heuristics.
+    bool specify_workspace_limit = scratch_allocator != nullptr;
+    auto memory_limit_bytes =
+        specify_workspace_limit
+            ? std::max(scratch_allocator->GetMemoryLimitInBytes(stream), 0ll)
+            : 0ll;
+    SE_ASSIGN_OR_RETURN(cudnnConvolutionBwdDataAlgo_t algo,
+                        GetCudnnConvolutionBackwardDataAlgo(
+                            cudnn, input_nd, filter, conv, output_nd,
+                            specify_workspace_limit, memory_limit_bytes));
+    algo_desc = dnn::AlgorithmDesc(
+        algo, algorithm_config.algorithm().tensor_ops_enabled());
+  }
+
+  auto scratch_or = AllocateCudnnConvolutionBackwardDataWorkspace(
+      stream, cudnn, algo_desc, input_nd, filter, conv, output_nd,
+      scratch_allocator);
+
+  if (scratch_or.ok()) {
+    *scratch = scratch_or.ValueOrDie();
+    return algo_desc;
+  }
+
+  // Failed to allocate workspace for the first algorithm, fall back to the
+  // no_scratch algorithm.
+  if (algorithm_config.algorithm_no_scratch().is_default()) {
+    return port::Status(
+        port::error::INVALID_ARGUMENT,
+        "The primary convolution algorithm failed memory allocation, "
+        "while a secondary algorithm is not provided.");
+  }
+
+  SE_ASSIGN_OR_RETURN(
+      *scratch, AllocateCudnnConvolutionBackwardDataWorkspace(
+                    stream, cudnn, algorithm_config.algorithm_no_scratch(),
+                    input_nd, filter, conv, output_nd, scratch_allocator));
+  return algorithm_config.algorithm_no_scratch();
+}
+
+port::StatusOr<dnn::AlgorithmDesc> GetCudnnConvolutionBackwardFilterAlgorithm(
+    Stream* stream, const CudnnHandle& cudnn,
+    const dnn::AlgorithmConfig& algorithm_config,
+    const CudnnTensorDescriptor& input_nd, const CudnnFilterDescriptor& filter,
+    const CudnnConvolutionDescriptor& conv,
+    const CudnnTensorDescriptor& output_nd, ScratchAllocator* scratch_allocator,
+    DeviceMemory<uint8>* scratch) {
+  dnn::AlgorithmDesc algo_desc = algorithm_config.algorithm();
+  if (algorithm_config.algorithm().is_default()) {
+    // Pick fastest algorithm within memory limit according to cuDNN's
+    // heuristics.
+    bool specify_workspace_limit = scratch_allocator != nullptr;
+    auto memory_limit_bytes =
+        specify_workspace_limit
+            ? std::max(scratch_allocator->GetMemoryLimitInBytes(stream), 0ll)
+            : 0ll;
+    SE_ASSIGN_OR_RETURN(cudnnConvolutionBwdFilterAlgo_t algo,
+                        GetCudnnConvolutionBackwardFilterAlgo(
+                            cudnn, input_nd, filter, conv, output_nd,
+                            specify_workspace_limit, memory_limit_bytes));
+    algo_desc = dnn::AlgorithmDesc(
+        algo, algorithm_config.algorithm().tensor_ops_enabled());
+  }
+
+  auto scratch_or = AllocateCudnnConvolutionBackwardFilterWorkspace(
+      stream, cudnn, algo_desc, input_nd, filter, conv, output_nd,
+      scratch_allocator);
+
+  if (scratch_or.ok()) {
+    *scratch = scratch_or.ValueOrDie();
+    return algo_desc;
   }
 
-  return dnn::AlgorithmDesc(algo, use_tensor_ops);
+  // Failed to allocate workspace for the first algorithm, fall back to the
+  // no_scratch algorithm.
+  if (algorithm_config.algorithm_no_scratch().is_default()) {
+    return port::Status(
+        port::error::INVALID_ARGUMENT,
+        "The primary convolution algorithm failed memory allocation, "
+        "while a secondary algorithm is not provided.");
+  }
+
+  SE_ASSIGN_OR_RETURN(*scratch,
+                      AllocateCudnnConvolutionBackwardFilterWorkspace(
+                          stream, cudnn, algorithm_config.algorithm(), input_nd,
+                          filter, conv, output_nd, scratch_allocator));
+  return algorithm_config.algorithm_no_scratch();
 }
 
 // A helper class to set env-vars and choose options for cudnn-related
@@ -2212,9 +2284,7 @@ class CudnnEnvVar {
 // algorithm through an env-var "TF_ENABLE_FFT_TILING_FORWARD=1".
 struct FftTilingForward {
   static constexpr const char* kName = "TF_ENABLE_FFT_TILING_FORWARD";
-  // TODO(csigg): Enabling this algo causes XLA test failures, for example in
-  // platforms/xla/tests/internal:convolution_test_gpu. See b/80018418.
-  static constexpr bool kDefaultFlag = false;  // CUDNN_VERSION >= 7000;
+  static constexpr bool kDefaultFlag = CUDNN_VERSION >= 7000;
 };
 
 // A helper struct to decide whether to enable the WINOGRAD_NONFUSED algorithms.
@@ -2279,8 +2349,6 @@ struct RnnDoFP32ComputationFP16Input {
   static constexpr bool kDefaultFlag = false;
 };
 
-// A helper function to return the internal compute type for
-// RNNs in cudnn.
 cudnnDataType_t GetRnnComputeType(dnn::DataType data_type) {
   switch (data_type) {
     case dnn::DataType::kFloat:
@@ -2301,7 +2369,7 @@ cudnnDataType_t GetRnnComputeType(dnn::DataType data_type) {
 }  // namespace
 
 template <class T>
-bool CudnnSupport::DoConvolveImpl(
+port::Status CudnnSupport::DoConvolveImpl(
     Stream* stream, const dnn::BatchDescriptor& input_descriptor,
     const DeviceMemory<T>& input_data,
     const dnn::FilterDescriptor& filter_descriptor,
@@ -2312,11 +2380,11 @@ bool CudnnSupport::DoConvolveImpl(
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
   cudnnDataType_t cudnn_type = GetCudnnDataType<T>();
-  ScopedTensorDescriptor input_nd(input_descriptor, cudnn_type);
-  ScopedTensorDescriptor output_nd(output_descriptor, cudnn_type);
-  ScopedFilterDescriptor filter(filter_descriptor, cudnn_type);
-  ScopedConvolutionDescriptor conv(convolution_descriptor,
-                                   GetConvComputeType<T>());
+  CudnnTensorDescriptor input_nd(input_descriptor, cudnn_type);
+  CudnnTensorDescriptor output_nd(output_descriptor, cudnn_type);
+  CudnnFilterDescriptor filter(filter_descriptor, cudnn_type);
+  CudnnConvolutionDescriptor conv(convolution_descriptor,
+                                  GetConvComputeType<T>());
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
   // Alpha is the scaling factor for input.
@@ -2331,177 +2399,75 @@ bool CudnnSupport::DoConvolveImpl(
                                                : static_cast<void*>(&fbeta);
 
   const bool is_profiling = output_profile_result != nullptr;
-  cudnnConvolutionFwdAlgo_t algo;
-  bool use_tensor_ops;
-  DeviceMemory<uint8> scratch;
-
-  // TODO(pauldonnelly): Replace the following code with a call to
-  //   GetCudnnConvolutionForwardAlgorithm().
-  if (algorithm_config.algorithm().is_default()) {
-    // With the default algorithm, use Cudnn's heuristics.
-    auto get_algorithm = [&](bool specify_limit) {
-      cudnnConvolutionFwdPreference_t preference =
-          specify_limit ? CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT
-                        : CUDNN_CONVOLUTION_FWD_NO_WORKSPACE;
-
-      auto memory_limit_bytes =
-          scratch_allocator == nullptr
-              ? 0
-              : scratch_allocator->GetMemoryLimitInBytes(stream);
-      if (memory_limit_bytes < 0) {
-        memory_limit_bytes = 0;
-      }
 
-      cudnnConvolutionFwdAlgo_t algo_to_use;
-      auto status = cudnnGetConvolutionForwardAlgorithm(
-          cudnn.handle(), input_nd.handle(), filter.handle(), conv.handle(),
-          output_nd.handle(),
-          /*preference=*/preference,
-          /*memoryLimitInBytes=*/memory_limit_bytes,
-          /*algo=*/&algo_to_use);
-      CHECK_EQ(status, CUDNN_STATUS_SUCCESS) << "Unable to find a suitable "
-                                                "algorithm for doing forward "
-                                                "convolution";
-      return algo_to_use;
-    };
-
-    algo = get_algorithm(/*specify_limit=*/scratch_allocator != nullptr);
-    use_tensor_ops = true;
-    if (scratch_allocator != nullptr) {
-      size_t size_in_bytes;
-      auto status = cudnnGetConvolutionForwardWorkspaceSize(
-          cudnn.handle(),
-          /*xDesc=*/input_nd.handle(),
-          /*wDesc=*/filter.handle(), /*convDesc=*/conv.handle(),
-          /*yDesc=*/output_nd.handle(), /*algo=*/algo,
-          /*sizeInBytes=*/&size_in_bytes);
-      int64 size_in_bytes_int64 = size_in_bytes;
-      if (status == CUDNN_STATUS_SUCCESS && size_in_bytes_int64 != 0) {
-        if (size_in_bytes_int64 > 0) {
-          auto allocated =
-              scratch_allocator->AllocateBytes(stream, size_in_bytes);
-          if (allocated.ok()) {
-            scratch = allocated.ValueOrDie();
-          } else {
-            LOG(WARNING) << allocated.status().error_message();
-          }
-        } else {
-          LOG(WARNING)
-              << "cudnnGetConvolutionForwardWorkspaceSize() returned "
-                 "negative sizeInBytes value. This could be a cudnn bug.";
-        }
-      }
-    }
+  DeviceMemory<uint8> scratch;
+  SE_ASSIGN_OR_RETURN(dnn::AlgorithmDesc algo_desc,
+                      GetCudnnConvolutionForwardAlgorithm(
+                          stream, cudnn, algorithm_config, input_nd, filter,
+                          conv, output_nd, scratch_allocator, &scratch));
 
-    // If we didn't allocate any scratch space (perhaps because of failed
-    // allocation), we force a switch back to the "no workspace" algorithm.
-    if (scratch == nullptr) {
-      algo = get_algorithm(/*specify_limit=*/false);
-    }
-  } else {
-    // An algorithm has been specified.
-    dnn::AlgorithmDesc algotype = algorithm_config.algorithm();
-    algo = ToConvForwardAlgo(algotype);
-    use_tensor_ops = algotype.tensor_ops_enabled();
-    conv.set_use_tensor_op_math(use_tensor_ops);
-    size_t size_in_bytes;
-    auto status = cudnnGetConvolutionForwardWorkspaceSize(
-        cudnn.handle(),
-        /*xDesc=*/input_nd.handle(),
-        /*wDesc=*/filter.handle(), /*convDesc=*/conv.handle(),
-        /*yDesc=*/output_nd.handle(), /*algo=*/algo,
-        /*sizeInBytes=*/&size_in_bytes);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      if (is_profiling) {
-        // Silently return when we are profiling.
-        return false;
-      }
-      LOG(FATAL) << "Cannot query the size of workspace needed for the given "
-                    "algorithm: "
-                 << algorithm_config.algorithm().algo_id();
-    }
-    int64 size_in_bytes_int64 = size_in_bytes;
-    if (size_in_bytes_int64 > 0) {
-      if (scratch_allocator == nullptr) {
-        LOG(FATAL) << "An allocator must be specified when scratch memory is "
-                      "needed";
-      }
-      auto allocated = scratch_allocator->AllocateBytes(stream, size_in_bytes);
-      if (is_profiling && !allocated.ok()) {
-        // Silently return when we are profiling.
-        return false;
-      }
-      if (allocated.ok()) {
-        scratch = allocated.ValueOrDie();
-      } else {
-        LOG(WARNING) << allocated.status().error_message();
-      }
-      if (scratch == nullptr) {
-        CHECK(!algorithm_config.algorithm_no_scratch().is_default())
-            << "The primary convolution algorithm failed memory allocation, "
-               "while a secondary algorithm is not provided.";
-        dnn::AlgorithmDesc algotype = algorithm_config.algorithm_no_scratch();
-        algo = ToConvForwardAlgo(algotype);
-        use_tensor_ops = algotype.tensor_ops_enabled();
-        conv.set_use_tensor_op_math(use_tensor_ops);
-      }
-    } else if (size_in_bytes_int64 < 0) {
-      LOG(WARNING) << "cudnnGetConvolutionForwardWorkspaceSize() returned "
-                      "negative sizeInBytes value. This could be a cudnn bug.";
-    }
-  }
-  std::unique_ptr<CUDATimer> timer;
+  std::unique_ptr<CUDATimer, TimerDeleter> timer;
   if (is_profiling) {
     timer.reset(new CUDATimer(parent_));  // NOLINT
-    if (!timer->Init()) {
-      return false;
-    }
     // The start and stop of the timer should be as close to the Cudnn call as
     // possible. It is still possible for other threads to issue workload on
     // to this stream. So it could take multiple profiling measurements.
-    if (!timer->Start(AsCUDAStream(stream))) {
-      timer->Destroy();
-      return false;
+    if (!timer->Init() || !timer->Start(AsCUDAStream(stream))) {
+      return port::Status(port::error::INTERNAL, "Failed to start timer");
     }
   }
-  auto status = cudnnConvolutionForward(
+
+  // Report an error if we might be hitting a cuDNN bug that accesses illegal
+  // memory. See nvbugs/2138754, b/80018418.
+  SE_RETURN_IF_ERROR([&] {
+    if (algo_desc.algo_id() != CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING) {
+      return port::Status::OK();
+    }
+    if (input_descriptor.ndims() < 3) {
+      return port::Status::OK();
+    }
+    // Checks that a*b is within the valid range (as provided by NVIDIA).
+    auto check_sizes = [](size_t a, size_t b) {
+      if ((a * b * 4608 - 1) >> 31 == 0) {
+        return port::Status::OK();
+      }
+      return port::Status(
+          port::error::FAILED_PRECONDITION,
+          "This configuration potentially accesses illegal memory.");
+    };
+    SE_RETURN_IF_ERROR(check_sizes(input_descriptor.feature_map_count(),
+                                   output_descriptor.feature_map_count()));
+    SE_RETURN_IF_ERROR(check_sizes(input_descriptor.count(),
+                                   input_descriptor.feature_map_count()));
+    SE_RETURN_IF_ERROR(check_sizes(input_descriptor.count(),
+                                   output_descriptor.feature_map_count()));
+    return port::Status::OK();
+  }());
+
+  RETURN_IF_CUDNN_ERROR(cudnnConvolutionForward(
       cudnn.handle(),
       /*alpha=*/alpha, /*srcDesc=*/input_nd.handle(),
       /*srcData=*/input_data.opaque(), /*filterDesc=*/filter.handle(),
       /*filterData=*/filter_data.opaque(), /*convDesc=*/conv.handle(),
-      /*algo=*/algo, /*workSpace=*/scratch.opaque(),
+      /*algo=*/ToConvForwardAlgo(algo_desc), /*workSpace=*/scratch.opaque(),
       /*workSpaceSizeInBytes=*/scratch.size(), /*beta=*/beta,
-      /*destDesc=*/output_nd.handle(), /*destData=*/output_data->opaque());
+      /*yDesc=*/output_nd.handle(), /*y=*/output_data->opaque()));
 
   if (is_profiling) {
     if (!timer->Stop(AsCUDAStream(stream))) {
-      timer->Destroy();
-      return false;
-    }
-    if (status == CUDNN_STATUS_SUCCESS) {
-      dnn::AlgorithmDesc algotype(algo, use_tensor_ops);
-      output_profile_result->set_algorithm(algotype);
-      output_profile_result->set_elapsed_time_in_ms(
-          timer->GetElapsedMilliseconds());
+      return port::Status(port::error::INTERNAL, "Failed to stop timer");
     }
-    timer->Destroy();
-  }
-
-  if (status != CUDNN_STATUS_SUCCESS) {
-    // Silently return when we are profiling.
-    if (!is_profiling) {
-      LOG(ERROR) << "failed to enqueue convolution on stream: "
-                 << ToString(status);
-    }
-    return false;
+    output_profile_result->set_algorithm(algo_desc);
+    output_profile_result->set_elapsed_time_in_ms(
+        timer->GetElapsedMilliseconds());
   }
 
-  return true;
+  return port::Status::OK();
 }
 
 template <typename Type, typename BiasType, typename ScaleType,
           int cudnn_data_type, int cudnn_compute_type>
-bool CudnnSupport::DoFusedConvolveImpl(
+port::Status CudnnSupport::DoFusedConvolveImpl(
     Stream* stream, const dnn::BatchDescriptor& conv_input_descriptor,
     const DeviceMemory<Type>& conv_input_data, ScaleType conv_input_scale,
     const dnn::FilterDescriptor& filter_descriptor,
@@ -2514,56 +2480,48 @@ bool CudnnSupport::DoFusedConvolveImpl(
     DeviceMemory<Type>* output_data, ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  ScopedTensorDescriptor conv_input_nd(
+  if (activation_mode != dnn::ActivationMode::kRelu) {
+    return port::Status(port::error::INVALID_ARGUMENT,
+                        "cudnnConvolutionBiasActivationForward() only supports "
+                        "Relu activation.");
+  }
+
+  CudnnTensorDescriptor conv_input_nd(
       conv_input_descriptor, static_cast<cudnnDataType_t>(cudnn_data_type));
-  ScopedTensorDescriptor output_nd(
+  CudnnTensorDescriptor output_nd(
       output_descriptor, static_cast<cudnnDataType_t>(cudnn_data_type));
-  ScopedFilterDescriptor filter(filter_descriptor,
-                                static_cast<cudnnDataType_t>(cudnn_data_type));
-  ScopedTensorDescriptor bias_nd(bias_descriptor, CUDNN_DATA_FLOAT);
-  ScopedConvolutionDescriptor conv(
+  CudnnFilterDescriptor filter(filter_descriptor,
+                               static_cast<cudnnDataType_t>(cudnn_data_type));
+  CudnnTensorDescriptor bias_nd(bias_descriptor, CUDNN_DATA_FLOAT);
+  CudnnConvolutionDescriptor conv(
       convolution_descriptor, static_cast<cudnnDataType_t>(cudnn_compute_type));
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
+
   const bool is_profiling = output_profile_result != nullptr;
-  DeviceMemory<uint8> scratch;
-  dnn::AlgorithmDesc algotype = GetCudnnConvolutionForwardAlgorithm(
-      stream, cudnn, algorithm_config, is_profiling, conv_input_nd, filter,
-      conv, output_nd, scratch_allocator, &scratch);
-  if (algotype.is_default()) {
-    if (!is_profiling) {
-      LOG(ERROR) << "No suitable algorithm found";
-    }
-    return false;
-  }
-  auto algo = static_cast<cudnnConvolutionFwdAlgo_t>(algotype.algo_id());
-  conv.set_use_tensor_op_math(algotype.tensor_ops_enabled());
 
-  if (activation_mode != dnn::ActivationMode::kRelu) {
-    LOG(ERROR) << "cudnnConvolutionBiasActivationForward() only supports Relu "
-                  "activation.";
-    return false;
-  }
+  DeviceMemory<uint8> scratch;
+  SE_ASSIGN_OR_RETURN(
+      dnn::AlgorithmDesc algo_desc,
+      GetCudnnConvolutionForwardAlgorithm(
+          stream, cudnn, algorithm_config, conv_input_nd, filter, conv,
+          output_nd, scratch_allocator, &scratch));
 
-  std::unique_ptr<CUDATimer> timer;
+  std::unique_ptr<CUDATimer, TimerDeleter> timer;
   if (is_profiling) {
     timer.reset(new CUDATimer(parent_));  // NOLINT
-    if (!timer->Init()) {
-      return false;
-    }
     // The start and stop of the timer should be as close to the Cudnn call as
     // possible. It is still possible for other threads to issue workload on
     // to this stream. So it could take multiple profiling measurements.
-    if (!timer->Start(AsCUDAStream(stream))) {
-      timer->Destroy();
-      return false;
+    if (!timer->Init() || !timer->Start(AsCUDAStream(stream))) {
+      return port::Status(port::error::INTERNAL, "Failed to start timer");
     }
   }
   // CUDNN v6 only supports CUDNN_NOT_PROPAGATE_NAN as the reluNanOpt for
   // activation descriptor. Note that this will change the nan propagation
   // behavior from separate conv, bias, and relu (which by default is
   // CUDNN_PROPAGATE_NAN.
-  ScopedActivationDescriptor activation_desc(
+  CudnnActivationDescriptor activation_desc(
       activation_mode, CUDNN_NOT_PROPAGATE_NAN, output_descriptor.value_max());
   auto side_input_data_ptr = (side_input_scale == 0) ? output_data->opaque()
                                                      : side_input_data.opaque();
@@ -2573,7 +2531,8 @@ bool CudnnSupport::DoFusedConvolveImpl(
           << "\nconv_input_data.opaque() = " << conv_input_data.opaque()
           << "\nfilter.handle() = " << filter.handle()
           << "\nfilter_data.opaque() = " << filter_data.opaque()
-          << "\nconv.handle() = " << conv.handle() << "\nalgo = " << algo
+          << "\nconv.handle() = " << conv.handle()
+          << "\nalgo = " << algo_desc.algo_id()
           << "\nscratch.opaque() = " << scratch.opaque()
           << "\nscratch.size() = " << scratch.size()
           << "\nside_input_scale = " << side_input_scale
@@ -2585,41 +2544,29 @@ bool CudnnSupport::DoFusedConvolveImpl(
           << "\noutput_nd.handle() = " << output_nd.handle()
           << "\noutput_data->opaque() = " << output_data->opaque();
 
-  auto status = cudnnConvolutionBiasActivationForward(
+  RETURN_IF_CUDNN_ERROR(cudnnConvolutionBiasActivationForward(
       cudnn.handle(),
       /*alpha1=*/&conv_input_scale,
       /*srcDesc=*/conv_input_nd.handle(), /*srcData=*/conv_input_data.opaque(),
       /*filterDesc=*/filter.handle(), /*filterData=*/filter_data.opaque(),
-      /*convDesc=*/conv.handle(), algo, /*workSpace=*/scratch.opaque(),
+      /*convDesc=*/conv.handle(), ToConvForwardAlgo(algo_desc),
+      /*workSpace=*/scratch.opaque(),
       /*workSpaceSizeInBytes=*/scratch.size(), /*alpha2=*/&side_input_scale,
       /*zDesc=*/output_nd.handle(), /*z=*/side_input_data_ptr,
       /*biasDesc=*/bias_nd.handle(), /*bias=*/biases.opaque(),
       /*activationDesc=*/activation_desc.handle(),
-      /*destDesc=*/output_nd.handle(), /*destData=*/output_data->opaque());
+      /*yDesc=*/output_nd.handle(), /*y=*/output_data->opaque()));
 
   if (is_profiling) {
     if (!timer->Stop(AsCUDAStream(stream))) {
-      timer->Destroy();
-      return false;
-    }
-    if (status == CUDNN_STATUS_SUCCESS) {
-      output_profile_result->set_algorithm(algotype);
-      output_profile_result->set_elapsed_time_in_ms(
-          timer->GetElapsedMilliseconds());
+      return port::Status(port::error::INTERNAL, "Failed to stop timer");
     }
-    timer->Destroy();
-  }
-
-  if (status != CUDNN_STATUS_SUCCESS) {
-    // Silently return when we are profiling.
-    if (!is_profiling) {
-      LOG(ERROR) << "failed to enqueue convolution on stream: "
-                 << ToString(status);
-    }
-    return false;
+    output_profile_result->set_algorithm(algo_desc);
+    output_profile_result->set_elapsed_time_in_ms(
+        timer->GetElapsedMilliseconds());
   }
 
-  return true;
+  return port::Status::OK();
 }
 
 bool CudnnSupport::GetConvolveAlgorithms(
@@ -2743,11 +2690,13 @@ bool CudnnSupport::DoBatchNormalizationForward(
     DeviceMemory<float>* saved_inv_var, bool is_training,
     std::function<const DeviceMemory<float>&()> var_to_inv_var,
     std::function<void()> inv_var_to_var) {
-  return DoBatchNormalizationForwardImpl<float, float>(
-      stream, dnn::DataType::kFloat, dnn::DataType::kFloat, x, scale, offset,
-      estimated_mean, estimated_variance, x_desc, scale_offset_desc, epsilon, y,
-      batch_mean, batch_var, saved_mean, saved_inv_var, is_training,
-      std::move(var_to_inv_var), std::move(inv_var_to_var));
+  return IsStatusOk(
+      DoBatchNormalizationForwardImpl<float, float>(
+          stream, dnn::DataType::kFloat, dnn::DataType::kFloat, x, scale,
+          offset, estimated_mean, estimated_variance, x_desc, scale_offset_desc,
+          epsilon, y, batch_mean, batch_var, saved_mean, saved_inv_var,
+          is_training, std::move(var_to_inv_var), std::move(inv_var_to_var)),
+      /*report_error=*/true);
 }
 
 bool CudnnSupport::DoBatchNormalizationForward(
@@ -2762,15 +2711,17 @@ bool CudnnSupport::DoBatchNormalizationForward(
     DeviceMemory<float>* saved_inv_var, bool is_training,
     std::function<const DeviceMemory<float>&()> var_to_inv_var,
     std::function<void()> inv_var_to_var) {
-  return DoBatchNormalizationForwardImpl<Eigen::half, float>(
-      stream, dnn::DataType::kHalf, dnn::DataType::kFloat, x, scale, offset,
-      estimated_mean, estimated_variance, x_desc, scale_offset_desc, epsilon, y,
-      batch_mean, batch_var, saved_mean, saved_inv_var, is_training,
-      std::move(var_to_inv_var), std::move(inv_var_to_var));
+  return IsStatusOk(
+      DoBatchNormalizationForwardImpl<Eigen::half, float>(
+          stream, dnn::DataType::kHalf, dnn::DataType::kFloat, x, scale, offset,
+          estimated_mean, estimated_variance, x_desc, scale_offset_desc,
+          epsilon, y, batch_mean, batch_var, saved_mean, saved_inv_var,
+          is_training, std::move(var_to_inv_var), std::move(inv_var_to_var)),
+      /*report_error=*/true);
 }
 
 template <class T, class U>
-bool CudnnSupport::DoBatchNormalizationForwardImpl(
+port::Status CudnnSupport::DoBatchNormalizationForwardImpl(
     Stream* stream, dnn::DataType input_data_type,
     dnn::DataType scale_data_type, const DeviceMemory<T>& x,
     const DeviceMemory<U>& scale, const DeviceMemory<U>& offset,
@@ -2782,8 +2733,8 @@ bool CudnnSupport::DoBatchNormalizationForwardImpl(
     DeviceMemory<U>* saved_mean, DeviceMemory<U>* saved_inv_var,
     bool is_training, std::function<const DeviceMemory<U>&()> var_to_inv_var,
     std::function<void()> inv_var_to_var) {
-  ScopedTensorDescriptor x_descriptor(x_desc, ToCudnnDataType(input_data_type));
-  ScopedTensorDescriptor scale_offset_descriptor(
+  CudnnTensorDescriptor x_descriptor(x_desc, ToCudnnDataType(input_data_type));
+  CudnnTensorDescriptor scale_offset_descriptor(
       scale_offset_desc, ToCudnnDataType(scale_data_type));
   cudnnBatchNormMode_t mode = CUDNN_BATCHNORM_SPATIAL;
 #if CUDNN_VERSION >= 7000
@@ -2795,7 +2746,6 @@ bool CudnnSupport::DoBatchNormalizationForwardImpl(
   float zero = 0.0;
   auto cudnn = cudnn_->GetHandle(parent_, stream);
 
-  auto status = CUDNN_STATUS_SUCCESS;
   if (is_training) {
     CHECK_EQ(batch_mean->is_null(), batch_var->is_null())
         << "batch_mean and batch_var must both be null or both be non-null";
@@ -2812,26 +2762,21 @@ bool CudnnSupport::DoBatchNormalizationForwardImpl(
       batch_var_opaque = nullptr;
     }
 
-    status = cudnnBatchNormalizationForwardTraining(
+    RETURN_IF_CUDNN_ERROR(cudnnBatchNormalizationForwardTraining(
         cudnn.handle(), mode, &one, &zero, x_descriptor.handle(), x.opaque(),
         x_descriptor.handle(), y->opaque(), scale_offset_descriptor.handle(),
         scale.opaque(), offset.opaque(), 1.0, batch_mean_opaque,
         batch_var_opaque, epsilon, saved_mean->opaque(),
-        saved_inv_var->opaque());
+        saved_inv_var->opaque()));
   } else {
     const void* maybe_inv_var = estimated_variance.opaque();
-    status = cudnnBatchNormalizationForwardInference(
+    RETURN_IF_CUDNN_ERROR(cudnnBatchNormalizationForwardInference(
         cudnn.handle(), mode, &one, &zero, x_descriptor.handle(), x.opaque(),
         x_descriptor.handle(), y->opaque(), scale_offset_descriptor.handle(),
         scale.opaque(), offset.opaque(), estimated_mean.opaque(), maybe_inv_var,
-        epsilon);
-  }
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "failed to enqueue forward batch normalization on stream: "
-               << ToString(status);
-    return false;
+        epsilon));
   }
-  return true;
+  return port::Status::OK();
 }
 
 bool CudnnSupport::DoBatchNormalizationBackward(
@@ -2842,10 +2787,11 @@ bool CudnnSupport::DoBatchNormalizationBackward(
     const dnn::BatchDescriptor& scale_offset_desc, const double epsilon,
     DeviceMemory<float>* x_backprop, DeviceMemory<float>* scale_backprop,
     DeviceMemory<float>* offset_backprop) {
-  return DoBatchNormalizationBackwardImpl(
-      stream, CUDNN_DATA_FLOAT, CUDNN_DATA_FLOAT, y_backprop, x, scale, mean,
-      inv_var, x_desc, scale_offset_desc, epsilon, x_backprop, scale_backprop,
-      offset_backprop);
+  return IsStatusOk(DoBatchNormalizationBackwardImpl(
+                        stream, CUDNN_DATA_FLOAT, CUDNN_DATA_FLOAT, y_backprop,
+                        x, scale, mean, inv_var, x_desc, scale_offset_desc,
+                        epsilon, x_backprop, scale_backprop, offset_backprop),
+                    /*report_error=*/true);
 }
 
 bool CudnnSupport::DoBatchNormalizationBackward(
@@ -2856,14 +2802,15 @@ bool CudnnSupport::DoBatchNormalizationBackward(
     const dnn::BatchDescriptor& scale_offset_desc, const double epsilon,
     DeviceMemory<Eigen::half>* x_backprop, DeviceMemory<float>* scale_backprop,
     DeviceMemory<float>* offset_backprop) {
-  return DoBatchNormalizationBackwardImpl(
-      stream, CUDNN_DATA_HALF, CUDNN_DATA_FLOAT, y_backprop, x, scale, mean,
-      inv_var, x_desc, scale_offset_desc, epsilon, x_backprop, scale_backprop,
-      offset_backprop);
+  return IsStatusOk(DoBatchNormalizationBackwardImpl(
+                        stream, CUDNN_DATA_HALF, CUDNN_DATA_FLOAT, y_backprop,
+                        x, scale, mean, inv_var, x_desc, scale_offset_desc,
+                        epsilon, x_backprop, scale_backprop, offset_backprop),
+                    /*report_error=*/true);
 }
 
 template <class T, class U>
-bool CudnnSupport::DoBatchNormalizationBackwardImpl(
+port::Status CudnnSupport::DoBatchNormalizationBackwardImpl(
     Stream* stream, int cudnn_input_type, int cudnn_scale_type,
     const DeviceMemory<T>& y_backprop, const DeviceMemory<T>& x,
     const DeviceMemory<U>& scale, const DeviceMemory<U>& mean,
@@ -2871,9 +2818,9 @@ bool CudnnSupport::DoBatchNormalizationBackwardImpl(
     const dnn::BatchDescriptor& scale_offset_desc, const double epsilon,
     DeviceMemory<T>* x_backprop, DeviceMemory<U>* scale_backprop,
     DeviceMemory<U>* offset_backprop) {
-  ScopedTensorDescriptor x_descriptor(
+  CudnnTensorDescriptor x_descriptor(
       x_desc, static_cast<cudnnDataType_t>(cudnn_input_type));
-  ScopedTensorDescriptor scale_offset_descriptor(
+  CudnnTensorDescriptor scale_offset_descriptor(
       scale_offset_desc, static_cast<cudnnDataType_t>(cudnn_scale_type));
   cudnnBatchNormMode_t mode = CUDNN_BATCHNORM_SPATIAL;
 #if CUDNN_VERSION >= 7000
@@ -2886,19 +2833,14 @@ bool CudnnSupport::DoBatchNormalizationBackwardImpl(
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
 
-  auto status = cudnnBatchNormalizationBackward(
+  RETURN_IF_CUDNN_ERROR(cudnnBatchNormalizationBackward(
       cudnn.handle(), mode, &one, &zero, &one, &zero, x_descriptor.handle(),
       x.opaque(), x_descriptor.handle(), y_backprop.opaque(),
       x_descriptor.handle(), x_backprop->opaque(),
       scale_offset_descriptor.handle(), scale.opaque(),
       scale_backprop->opaque(), offset_backprop->opaque(), epsilon,
-      mean.opaque(), inv_var.opaque());
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "failed to enqueue backward batch normalization on stream: "
-               << ToString(status);
-    return false;
-  }
-  return true;
+      mean.opaque(), inv_var.opaque()));
+  return port::Status::OK();
 }
 
 bool CudnnSupport::DoConvolve(
@@ -2911,10 +2853,12 @@ bool CudnnSupport::DoConvolve(
     DeviceMemory<float>* output_data, ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  return DoConvolveImpl<float>(
-      stream, batch_descriptor, input_data, filter_descriptor, filter_data,
-      convolution_descriptor, output_descriptor, output_data, scratch_allocator,
-      algorithm_config, output_profile_result);
+  return IsStatusOk(
+      DoConvolveImpl<float>(
+          stream, batch_descriptor, input_data, filter_descriptor, filter_data,
+          convolution_descriptor, output_descriptor, output_data,
+          scratch_allocator, algorithm_config, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoConvolve(
@@ -2927,10 +2871,12 @@ bool CudnnSupport::DoConvolve(
     DeviceMemory<double>* output_data, ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  return DoConvolveImpl<double>(
-      stream, batch_descriptor, input_data, filter_descriptor, filter_data,
-      convolution_descriptor, output_descriptor, output_data, scratch_allocator,
-      algorithm_config, output_profile_result);
+  return IsStatusOk(
+      DoConvolveImpl<double>(
+          stream, batch_descriptor, input_data, filter_descriptor, filter_data,
+          convolution_descriptor, output_descriptor, output_data,
+          scratch_allocator, algorithm_config, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoConvolve(
@@ -2943,10 +2889,12 @@ bool CudnnSupport::DoConvolve(
     DeviceMemory<Eigen::half>* output_data, ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  return DoConvolveImpl<Eigen::half>(
-      stream, batch_descriptor, input_data, filter_descriptor, filter_data,
-      convolution_descriptor, output_descriptor, output_data, scratch_allocator,
-      algorithm_config, output_profile_result);
+  return IsStatusOk(
+      DoConvolveImpl<Eigen::half>(
+          stream, batch_descriptor, input_data, filter_descriptor, filter_data,
+          convolution_descriptor, output_descriptor, output_data,
+          scratch_allocator, algorithm_config, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoFusedConvolve(
@@ -2962,13 +2910,15 @@ bool CudnnSupport::DoFusedConvolve(
     DeviceMemory<double>* output_data, ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  return DoFusedConvolveImpl<double, double, double, CUDNN_DATA_DOUBLE,
-                             CUDNN_DATA_DOUBLE>(
-      stream, conv_input_descriptor, conv_input_data, conv_input_scale,
-      filter_descriptor, filter_data, convolution_descriptor, side_input_data,
-      side_input_scale, bias_descriptor, biases, activation_mode,
-      output_descriptor, output_data, scratch_allocator, algorithm_config,
-      output_profile_result);
+  return IsStatusOk(
+      DoFusedConvolveImpl<double, double, double, CUDNN_DATA_DOUBLE,
+                          CUDNN_DATA_DOUBLE>(
+          stream, conv_input_descriptor, conv_input_data, conv_input_scale,
+          filter_descriptor, filter_data, convolution_descriptor,
+          side_input_data, side_input_scale, bias_descriptor, biases,
+          activation_mode, output_descriptor, output_data, scratch_allocator,
+          algorithm_config, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoFusedConvolve(
@@ -2984,13 +2934,15 @@ bool CudnnSupport::DoFusedConvolve(
     DeviceMemory<float>* output_data, ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  return DoFusedConvolveImpl<float, float, float, CUDNN_DATA_FLOAT,
-                             CUDNN_DATA_FLOAT>(
-      stream, conv_input_descriptor, conv_input_data, conv_input_scale,
-      filter_descriptor, filter_data, convolution_descriptor, side_input_data,
-      side_input_scale, bias_descriptor, biases, activation_mode,
-      output_descriptor, output_data, scratch_allocator, algorithm_config,
-      output_profile_result);
+  return IsStatusOk(
+      DoFusedConvolveImpl<float, float, float, CUDNN_DATA_FLOAT,
+                          CUDNN_DATA_FLOAT>(
+          stream, conv_input_descriptor, conv_input_data, conv_input_scale,
+          filter_descriptor, filter_data, convolution_descriptor,
+          side_input_data, side_input_scale, bias_descriptor, biases,
+          activation_mode, output_descriptor, output_data, scratch_allocator,
+          algorithm_config, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoFusedConvolve(
@@ -3007,13 +2959,15 @@ bool CudnnSupport::DoFusedConvolve(
     DeviceMemory<Eigen::half>* output_data, ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  return DoFusedConvolveImpl<Eigen::half, Eigen::half, float, CUDNN_DATA_HALF,
-                             CUDNN_DATA_FLOAT>(
-      stream, conv_input_descriptor, conv_input_data, conv_input_scale,
-      filter_descriptor, filter_data, convolution_descriptor, side_input_data,
-      side_input_scale, bias_descriptor, biases, activation_mode,
-      output_descriptor, output_data, scratch_allocator, algorithm_config,
-      output_profile_result);
+  return IsStatusOk(
+      DoFusedConvolveImpl<Eigen::half, Eigen::half, float, CUDNN_DATA_HALF,
+                          CUDNN_DATA_FLOAT>(
+          stream, conv_input_descriptor, conv_input_data, conv_input_scale,
+          filter_descriptor, filter_data, convolution_descriptor,
+          side_input_data, side_input_scale, bias_descriptor, biases,
+          activation_mode, output_descriptor, output_data, scratch_allocator,
+          algorithm_config, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoFusedConvolve(
@@ -3037,61 +2991,16 @@ bool CudnnSupport::DoFusedConvolve(
                     "supported on GPUs with compute capability 6.1 or later.";
     return false;
   }
-  return DoFusedConvolveImpl<int8, float, float, CUDNN_DATA_INT8x4,
-                             CUDNN_DATA_INT32>(
-      stream, conv_input_descriptor, conv_input_data, conv_input_scale,
-      filter_descriptor, filter_data, convolution_descriptor, side_input_data,
-      side_input_scale, bias_descriptor, biases, activation_mode,
-      output_descriptor, output_data, scratch_allocator, algorithm_config,
-      output_profile_result);
-}
-
-namespace {
-// NOTE(keveman): Temporary data layout transformation until cuDNN supports
-// kBatchYXDepth for backward pass. This function allocates temporary memory,
-// lays out the source data into the temporary but in the kBatchDepthXY
-// layout, and returns the temporary memory. The caller is responsible for
-// deallocating the temporary. Since the allocation is done using Stream's
-// AllocateTemporaryMemory, a later BlockHostUntilDone could be used for
-// deallocation.
-//
-// transform_scratch is populated with a legitimate temporary allocation iff
-// the original output data needs to be transformed.
-template <class T>
-DeviceMemory<T> MaybeTransformLayout(
-    Stream* stream, const CudnnHandle& cudnn,
-    dnn::BatchDescriptor* output_descriptor,
-    DeviceMemory<T> backward_output_data,
-    std::unique_ptr<TemporaryDeviceMemory<T>>* transform_scratch) {
-  if (output_descriptor->layout() == dnn::DataLayout::kBatchDepthYX) {
-    return backward_output_data;
-  }
-  CHECK(output_descriptor->layout() == dnn::DataLayout::kBatchYXDepth);
-  *transform_scratch =
-      stream->AllocateTemporaryArray<T>(backward_output_data.ElementCount())
-          .ConsumeValueOrDie();
-  dnn::BatchDescriptor transformed_output_descriptor;
-  transformed_output_descriptor.CloneFrom(*output_descriptor);
-  transformed_output_descriptor.set_layout(dnn::DataLayout::kBatchDepthYX);
-  cudnnDataType_t cudnn_type = GetCudnnDataType<T>();
-  ScopedTensorDescriptor orig_out_back_nd(*output_descriptor, cudnn_type);
-  ScopedTensorDescriptor transformed_out_back_nd(transformed_output_descriptor,
-                                                 cudnn_type);
-
-  float alpha = 1.0f;
-  float beta = 0.0f;
-  auto status = cudnnTransformTensor(
-      cudnn.handle(), &alpha, orig_out_back_nd.handle(),
-      backward_output_data.opaque(), &beta, transformed_out_back_nd.handle(),
-      (*transform_scratch)->mutable_device_memory()->opaque());
-
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(FATAL) << "Failed to transform the data layout.";
-  }
-  output_descriptor->set_layout(dnn::DataLayout::kBatchDepthYX);
-  return (*transform_scratch)->device_memory();
+  return IsStatusOk(
+      DoFusedConvolveImpl<int8, float, float, CUDNN_DATA_INT8x4,
+                          CUDNN_DATA_INT32>(
+          stream, conv_input_descriptor, conv_input_data, conv_input_scale,
+          filter_descriptor, filter_data, convolution_descriptor,
+          side_input_data, side_input_scale, bias_descriptor, biases,
+          activation_mode, output_descriptor, output_data, scratch_allocator,
+          algorithm_config, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
-}  // namespace
 
 bool CudnnSupport::DoTransformTensor(Stream* stream,
                                      const dnn::BatchDescriptor& input_desc,
@@ -3101,30 +3010,25 @@ bool CudnnSupport::DoTransformTensor(Stream* stream,
                                      dnn::DataType output_type, float scale,
                                      DeviceMemoryBase* output_data) {
   float beta = 0.0f;
-  ScopedTensorDescriptor input_tensor_desc(
+  CudnnTensorDescriptor input_tensor_desc(
       input_desc, ToCudnnDataType(input_type, input_desc.layout()));
-  ScopedTensorDescriptor output_tensor_desc(
+  CudnnTensorDescriptor output_tensor_desc(
       output_desc, ToCudnnDataType(output_type, output_desc.layout()));
   auto cudnn = cudnn_->GetHandle(parent_, stream);
-  auto status = cudnnTransformTensor(
-      cudnn.handle(), &scale, input_tensor_desc.handle(), input_data.opaque(),
-      &beta, output_tensor_desc.handle(), output_data->opaque());
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "Could not transform a tensor with layout "
-               << input_desc.ToString() << " and data type "
-               << static_cast<int>(input_type) << " to another with layout "
-               << output_desc.ToString() << " and data type "
-               << static_cast<int>(output_type) << ": " << ToString(status);
-    return false;
-  }
-  return true;
+  auto status = [&] {
+    RETURN_IF_CUDNN_ERROR(cudnnTransformTensor(
+        cudnn.handle(), &scale, input_tensor_desc.handle(), input_data.opaque(),
+        &beta, output_tensor_desc.handle(), output_data->opaque()));
+    return port::Status::OK();
+  }();
+  return IsStatusOk(status, /*report_error=*/true);
 }
 
 template <class T>
-bool CudnnSupport::DoConvolveBackwardDataImpl(
+port::Status CudnnSupport::DoConvolveBackwardDataImpl(
     Stream* stream, const dnn::FilterDescriptor& filter_descriptor,
     const DeviceMemory<T>& filter_data,
-    const dnn::BatchDescriptor& output_descriptor_in,
+    const dnn::BatchDescriptor& output_descriptor,
     DeviceMemory<T> backward_output_data,
     const dnn::ConvolutionDescriptor& convolution_descriptor,
     const dnn::BatchDescriptor& input_descriptor,
@@ -3145,154 +3049,32 @@ bool CudnnSupport::DoConvolveBackwardDataImpl(
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
 
-  // TBD(keveman): remove once cuDNN supports kBatchYXDepth for backward pass.
-  dnn::BatchDescriptor output_descriptor;
-  output_descriptor.CloneFrom(output_descriptor_in);
-  std::unique_ptr<TemporaryDeviceMemory<T>> transform_scratch;
-  backward_output_data =
-      MaybeTransformLayout(stream, cudnn, &output_descriptor,
-                           backward_output_data, &transform_scratch);
-
-  ScopedTensorDescriptor out_back_nd(output_descriptor, cudnn_type);
-  ScopedTensorDescriptor in_back_nd(input_descriptor, cudnn_type);
-  ScopedFilterDescriptor filter(filter_descriptor, cudnn_type);
-  ScopedConvolutionDescriptor conv(convolution_descriptor,
-                                   GetConvComputeType<T>());
+  CudnnTensorDescriptor out_back_nd(output_descriptor, cudnn_type);
+  CudnnTensorDescriptor in_back_nd(input_descriptor, cudnn_type);
+  CudnnFilterDescriptor filter(filter_descriptor, cudnn_type);
+  CudnnConvolutionDescriptor conv(convolution_descriptor,
+                                  GetConvComputeType<T>());
 
   const bool is_profiling = output_profile_result != nullptr;
-  cudnnConvolutionBwdDataAlgo_t algo;
-  DeviceMemory<uint8> scratch;
 
-  if (algorithm_config.algorithm().is_default()) {
-    // With the default algorithm, use Cudnn's heuristics.
-    auto get_algorithm =
-        [&](bool specify_limit) -> cudnnConvolutionBwdDataAlgo_t {
-      cudnnConvolutionBwdDataPreference_t preference =
-          specify_limit ? CUDNN_CONVOLUTION_BWD_DATA_SPECIFY_WORKSPACE_LIMIT
-                        : CUDNN_CONVOLUTION_BWD_DATA_NO_WORKSPACE;
-
-      auto memory_limit_bytes =
-          scratch_allocator == nullptr
-              ? 0
-              : scratch_allocator->GetMemoryLimitInBytes(stream);
-      if (memory_limit_bytes < 0) {
-        memory_limit_bytes = 0;
-      }
-      cudnnConvolutionBwdDataAlgo_t algo_to_use;
-      cudnnStatus_t status = cudnnGetConvolutionBackwardDataAlgorithm(
-          cudnn.handle(),
-          /*filterDesc=*/filter.handle(),
-          /*diffDesc=*/out_back_nd.handle(),
-          /*convDesc=*/conv.handle(),
-          /*gradDesc=*/in_back_nd.handle(),
-          /*preference=*/preference,
-          /*memoryLimitInBytes=*/memory_limit_bytes,
-          /*algo=*/&algo_to_use);
-      CHECK_EQ(status, CUDNN_STATUS_SUCCESS) << "Unable to find a suitable "
-                                                "algorithm for doing backward "
-                                                "data convolution";
-      return algo_to_use;
-    };
-
-    algo = get_algorithm(/*specify_limit=*/scratch_allocator != nullptr);
-
-    if (scratch_allocator != nullptr) {
-      size_t size_in_bytes;
-      auto status = cudnnGetConvolutionBackwardDataWorkspaceSize(
-          cudnn.handle(),
-          /*filterDesc=*/filter.handle(),
-          /*diffDesc=*/out_back_nd.handle(),
-          /*convDesc=*/conv.handle(),
-          /*gradDesc=*/in_back_nd.handle(),
-          /*algo=*/algo,
-          /*sizeInBytes=*/&size_in_bytes);
-      int64 size_in_bytes_int64 = size_in_bytes;
-      if (status == CUDNN_STATUS_SUCCESS && size_in_bytes_int64 != 0) {
-        if (size_in_bytes_int64 > 0) {
-          auto allocated =
-              scratch_allocator->AllocateBytes(stream, size_in_bytes);
-          if (allocated.ok()) {
-            scratch = allocated.ValueOrDie();
-          } else {
-            LOG(WARNING) << allocated.status().error_message();
-          }
-        } else {
-          LOG(WARNING)
-              << "cudnnGetConvolutionBackwardDataWorkspaceSize() returned "
-                 "negative sizeInBytes value. This could be a cudnn bug.";
-        }
-      }
-    }
-
-    // If we didn't allocate any scratch space (perhaps because of failed
-    // allocation), we force a switch back to the "no workspace" algorithm.
-    if (scratch == nullptr) {
-      algo = get_algorithm(/*specify_limit=*/false);
-    }
-  } else {
-    // An algorithm has been specified.
-    dnn::AlgorithmDesc algotype = algorithm_config.algorithm();
-    algo = ToConvBackwardDataAlgo(algotype);
-    conv.set_use_tensor_op_math(algotype.tensor_ops_enabled());
-    size_t size_in_bytes;
-    auto status = cudnnGetConvolutionBackwardDataWorkspaceSize(
-        cudnn.handle(),
-        /*filterDesc=*/filter.handle(),
-        /*diffDesc=*/out_back_nd.handle(),
-        /*convDesc=*/conv.handle(),
-        /*gradDesc=*/in_back_nd.handle(),
-        /*algo=*/algo,
-        /*sizeInBytes=*/&size_in_bytes);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      if (is_profiling) {
-        // Silently return when we are profiling.
-        return false;
-      }
-      LOG(FATAL) << "Cannot query the size of workspace needed for the given "
-                    "algorithm: "
-                 << algorithm_config.algorithm().algo_id();
-    }
-    int64 size_in_bytes_int64 = size_in_bytes;
-    if (size_in_bytes_int64 > 0) {
-      if (scratch_allocator == nullptr) {
-        LOG(FATAL) << "An allocator must be specified when scratch memory is "
-                      "needed";
-      }
-      auto allocated = scratch_allocator->AllocateBytes(stream, size_in_bytes);
-      if (is_profiling && !allocated.ok()) {
-        // Silently return when we are profiling.
-        return false;
-      }
-      if (allocated.ok()) {
-        scratch = allocated.ValueOrDie();
-      } else {
-        LOG(WARNING) << allocated.status().error_message();
-      }
-      if (scratch == nullptr) {
-        CHECK(!algorithm_config.algorithm_no_scratch().is_default())
-            << "The primary convolution algorithm failed memory allocation, "
-               "while a secondary algorithm is not provided.";
-        dnn::AlgorithmDesc algotype = algorithm_config.algorithm_no_scratch();
-        algo = ToConvBackwardDataAlgo(algotype);
-        conv.set_use_tensor_op_math(algotype.tensor_ops_enabled());
-      }
-    } else if (size_in_bytes_int64 < 0) {
-      LOG(WARNING) << "cudnnGetConvolutionBackwardDataWorkspaceSize() returned "
-                      "negative sizeInBytes value. This could be a cudnn bug.";
-    }
-  }
+  DeviceMemory<uint8> scratch;
+  SE_ASSIGN_OR_RETURN(dnn::AlgorithmDesc algo_desc,
+                      GetCudnnConvolutionBackwardDataAlgorithm(
+                          stream, cudnn, algorithm_config, in_back_nd, filter,
+                          conv, out_back_nd, scratch_allocator, &scratch));
 
-  std::unique_ptr<CUDATimer> timer;
+  std::unique_ptr<CUDATimer, TimerDeleter> timer;
   if (is_profiling) {
     timer.reset(new CUDATimer(parent_));  // NOLINT
-    timer->Init();
     // The start and stop of the timer should be as close to the Cudnn call as
     // possible. It is still possible for other threads to issue workload on
     // to this stream. So it could take multiple profiling measurements.
-    timer->Start(AsCUDAStream(stream));
+    if (!timer->Init() || !timer->Start(AsCUDAStream(stream))) {
+      return port::Status(port::error::INTERNAL, "Failed to start timer");
+    }
   }
 
-  auto status =
+  RETURN_IF_CUDNN_ERROR(
       cudnnConvolutionBackwardData(cudnn.handle(),
                                    /*alpha=*/alpha,
                                    /*wDesc=*/filter.handle(),
@@ -3300,32 +3082,22 @@ bool CudnnSupport::DoConvolveBackwardDataImpl(
                                    /*dyDesc=*/out_back_nd.handle(),
                                    /*dy=*/backward_output_data.opaque(),
                                    /*convDesc=*/conv.handle(),
-                                   /*algo=*/algo,
+                                   /*algo=*/ToConvBackwardDataAlgo(algo_desc),
                                    /*workSpace=*/scratch.opaque(),
                                    /*workSpaceSizeInBytes=*/scratch.size(),
                                    /*beta=*/beta,
                                    /*dxDesc=*/in_back_nd.handle(),
-                                   /*dx=*/backward_input_data->opaque());
+                                   /*dx=*/backward_input_data->opaque()));
   if (is_profiling) {
-    timer->Stop(AsCUDAStream(stream));
-    if (status == CUDNN_STATUS_SUCCESS) {
-      bool use_tensor_ops = algorithm_config.algorithm().tensor_ops_enabled();
-      dnn::AlgorithmDesc algotype(algo, use_tensor_ops);
-      output_profile_result->set_algorithm(algotype);
-      output_profile_result->set_elapsed_time_in_ms(
-          timer->GetElapsedMilliseconds());
-    }
-    timer->Destroy();
-  }
-  if (status != CUDNN_STATUS_SUCCESS) {
-    // Silently return when we are profiling.
-    if (!is_profiling) {
-      LOG(ERROR) << "failed to enqueue convolution on stream: "
-                 << ToString(status);
+    if (!timer->Stop(AsCUDAStream(stream))) {
+      return port::Status(port::error::INTERNAL, "Failed to stop timer");
     }
-    return false;
+    output_profile_result->set_algorithm(algo_desc);
+    output_profile_result->set_elapsed_time_in_ms(
+        timer->GetElapsedMilliseconds());
   }
-  return true;
+
+  return port::Status::OK();
 }
 
 bool CudnnSupport::DoConvolveBackwardData(
@@ -3339,11 +3111,13 @@ bool CudnnSupport::DoConvolveBackwardData(
     ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  return DoConvolveBackwardDataImpl(stream, filter_descriptor, filter_data,
-                                    output_descriptor, backward_output_data,
-                                    convolution_descriptor, input_descriptor,
-                                    backward_input_data, scratch_allocator,
-                                    algorithm_config, output_profile_result);
+  return IsStatusOk(
+      DoConvolveBackwardDataImpl(stream, filter_descriptor, filter_data,
+                                 output_descriptor, backward_output_data,
+                                 convolution_descriptor, input_descriptor,
+                                 backward_input_data, scratch_allocator,
+                                 algorithm_config, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoConvolveBackwardData(
@@ -3357,11 +3131,13 @@ bool CudnnSupport::DoConvolveBackwardData(
     ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  return DoConvolveBackwardDataImpl(stream, filter_descriptor, filter_data,
-                                    output_descriptor, backward_output_data,
-                                    convolution_descriptor, input_descriptor,
-                                    backward_input_data, scratch_allocator,
-                                    algorithm_config, output_profile_result);
+  return IsStatusOk(
+      DoConvolveBackwardDataImpl(stream, filter_descriptor, filter_data,
+                                 output_descriptor, backward_output_data,
+                                 convolution_descriptor, input_descriptor,
+                                 backward_input_data, scratch_allocator,
+                                 algorithm_config, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoConvolveBackwardData(
@@ -3375,18 +3151,20 @@ bool CudnnSupport::DoConvolveBackwardData(
     ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  return DoConvolveBackwardDataImpl(stream, filter_descriptor, filter_data,
-                                    output_descriptor, backward_output_data,
-                                    convolution_descriptor, input_descriptor,
-                                    backward_input_data, scratch_allocator,
-                                    algorithm_config, output_profile_result);
+  return IsStatusOk(
+      DoConvolveBackwardDataImpl(stream, filter_descriptor, filter_data,
+                                 output_descriptor, backward_output_data,
+                                 convolution_descriptor, input_descriptor,
+                                 backward_input_data, scratch_allocator,
+                                 algorithm_config, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 template <class T>
-bool CudnnSupport::DoConvolveBackwardFilterImpl(
+port::Status CudnnSupport::DoConvolveBackwardFilterImpl(
     Stream* stream, const dnn::BatchDescriptor& input_descriptor,
     const DeviceMemory<T>& input_data,
-    const dnn::BatchDescriptor& output_descriptor_in,
+    const dnn::BatchDescriptor& output_descriptor,
     DeviceMemory<T> backward_output_data,
     const dnn::ConvolutionDescriptor& convolution_descriptor,
     const dnn::FilterDescriptor& filter_descriptor,
@@ -3407,156 +3185,60 @@ bool CudnnSupport::DoConvolveBackwardFilterImpl(
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
 
-  // TBD(keveman): remove once cuDNN supports kBatchYXDepth for backward pass.
-  dnn::BatchDescriptor output_descriptor;
-  output_descriptor.CloneFrom(output_descriptor_in);
-  std::unique_ptr<TemporaryDeviceMemory<T>> transform_scratch;
-  backward_output_data =
-      MaybeTransformLayout(stream, cudnn, &output_descriptor,
-                           backward_output_data, &transform_scratch);
-
-  ScopedTensorDescriptor out_back_nd(output_descriptor, cudnn_type);
-  ScopedTensorDescriptor input_nd(input_descriptor, cudnn_type);
-  ScopedFilterDescriptor filter(filter_descriptor, cudnn_type);
-  ScopedConvolutionDescriptor conv(convolution_descriptor,
-                                   GetConvComputeType<T>());
+  CudnnTensorDescriptor out_back_nd(output_descriptor, cudnn_type);
+  CudnnTensorDescriptor input_nd(input_descriptor, cudnn_type);
+  CudnnFilterDescriptor filter(filter_descriptor, cudnn_type);
+  CudnnConvolutionDescriptor conv(convolution_descriptor,
+                                  GetConvComputeType<T>());
 
   const bool is_profiling = output_profile_result != nullptr;
-  cudnnConvolutionBwdFilterAlgo_t algo;
-  DeviceMemory<uint8> scratch;
-
-  if (algorithm_config.algorithm().is_default()) {
-    // With the default algorithm, use Cudnn's heuristics.
-
-    // Lambda that retrieves the algorithm.
-    // specify_limit will occur when we have a scratch allocator and it succeeds
-    // in allocating; otherwise, we'll fall back to the "no workspace" version.
-    auto get_algorithm = [&](bool specify_limit) {
-      cudnnConvolutionBwdFilterPreference_t preference =
-          specify_limit ? CUDNN_CONVOLUTION_BWD_FILTER_SPECIFY_WORKSPACE_LIMIT
-                        : CUDNN_CONVOLUTION_BWD_FILTER_NO_WORKSPACE;
-
-      auto memory_limit_bytes =
-          scratch_allocator == nullptr
-              ? 0
-              : scratch_allocator->GetMemoryLimitInBytes(stream);
-      if (memory_limit_bytes < 0) {
-        memory_limit_bytes = 0;
-      }
 
-      cudnnConvolutionBwdFilterAlgo_t algo_to_use;
-      cudnnStatus_t status = cudnnGetConvolutionBackwardFilterAlgorithm(
-          cudnn.handle(),
-          /*srcDesc=*/input_nd.handle(),
-          /*diffDesc=*/out_back_nd.handle(),
-          /*convDesc=*/conv.handle(),
-          /*gradDesc=*/filter.handle(),
-          /*preference=*/preference,
-          /*memoryLimitInBytes=*/memory_limit_bytes,
-          /*algo=*/&algo_to_use);
-      CHECK_EQ(status, CUDNN_STATUS_SUCCESS) << "Unable to find a suitable "
-                                                "algorithm for doing backward "
-                                                "filter convolution";
-      return algo_to_use;
-    };
-
-    algo = get_algorithm(/*specify_limit=*/scratch_allocator != nullptr);
-
-    if (scratch_allocator != nullptr) {
-      size_t size_in_bytes;
-      auto status = cudnnGetConvolutionBackwardFilterWorkspaceSize(
-          cudnn.handle(),
-          /*xDesc=*/input_nd.handle(),
-          /*dyDesc=*/out_back_nd.handle(), /*convDesc=*/conv.handle(),
-          /*gradDesc=*/filter.handle(), /*algo=*/algo,
-          /*sizeInBytes=*/&size_in_bytes);
-      int64 size_in_bytes_int64 = size_in_bytes;
-      if (status == CUDNN_STATUS_SUCCESS && size_in_bytes_int64 != 0) {
-        if (size_in_bytes_int64 > 0) {
-          auto allocated =
-              scratch_allocator->AllocateBytes(stream, size_in_bytes);
-          if (allocated.ok()) {
-            scratch = allocated.ValueOrDie();
-          } else {
-            LOG(WARNING) << allocated.status().error_message();
-          }
-        } else {
-          LOG(WARNING)
-              << "cudnnGetConvolutionBackwardFilterWorkspaceSize() returned "
-                 "negative sizeInBytes value. This could be a cudnn bug.";
-        }
-      }
-    }
-
-    // If we didn't allocate any scratch space (perhaps because of failed
-    // allocation), we force a switch back to the "no workspace" algorithm.
-    if (scratch == nullptr) {
-      algo = get_algorithm(/*specify_limit=*/false);
-    }
-  } else {
-    // An algorithm has been specified.
-    dnn::AlgorithmDesc algotype = algorithm_config.algorithm();
-    algo = ToConvBackwardFilterAlgo(algotype);
-    conv.set_use_tensor_op_math(algotype.tensor_ops_enabled());
-
-    size_t size_in_bytes;
-    auto status = cudnnGetConvolutionBackwardFilterWorkspaceSize(
-        cudnn.handle(),
-        /*xDesc=*/input_nd.handle(),
-        /*dyDesc=*/out_back_nd.handle(), /*convDesc=*/conv.handle(),
-        /*gradDesc=*/filter.handle(), /*algo=*/algo,
-        /*sizeInBytes=*/&size_in_bytes);
-    if (status != CUDNN_STATUS_SUCCESS) {
-      if (is_profiling) {
-        // Silently return when we are profiling.
-        return false;
-      }
-      LOG(FATAL) << "Cannot query the size of workspace needed for the given "
-                    "algorithm: "
-                 << algorithm_config.algorithm().algo_id();
-    }
-    int64 size_in_bytes_int64 = size_in_bytes;
-    if (size_in_bytes_int64 > 0) {
-      if (scratch_allocator == nullptr) {
-        LOG(FATAL) << "An allocator must be specified when scratch memory is "
-                      "needed";
-      }
-      auto allocated = scratch_allocator->AllocateBytes(stream, size_in_bytes);
-      if (is_profiling && !allocated.ok()) {
-        // Silently return when we are profiling.
-        return false;
-      }
-      if (allocated.ok()) {
-        scratch = allocated.ValueOrDie();
-      } else {
-        LOG(WARNING) << allocated.status().error_message();
-      }
-      if (scratch == nullptr) {
-        CHECK(!algorithm_config.algorithm_no_scratch().is_default())
-            << "The primary convolution algorithm failed memory allocation, "
-               "while a secondary algorithm is not provided.";
-        dnn::AlgorithmDesc algotype = algorithm_config.algorithm_no_scratch();
-        algo = ToConvBackwardFilterAlgo(algotype);
-        conv.set_use_tensor_op_math(algotype.tensor_ops_enabled());
-      }
-    } else if (size_in_bytes_int64 < 0) {
-      LOG(WARNING)
-          << "cudnnGetConvolutionBackwardFilterWorkspaceSize() returned "
-             "negative sizeInBytes value. This could be a cudnn bug.";
-    }
-  }
+  DeviceMemory<uint8> scratch;
+  SE_ASSIGN_OR_RETURN(dnn::AlgorithmDesc algo_desc,
+                      GetCudnnConvolutionBackwardFilterAlgorithm(
+                          stream, cudnn, algorithm_config, input_nd, filter,
+                          conv, out_back_nd, scratch_allocator, &scratch));
 
-  std::unique_ptr<CUDATimer> timer;
+  std::unique_ptr<CUDATimer, TimerDeleter> timer;
   if (is_profiling) {
     timer.reset(new CUDATimer(parent_));  // NOLINT
-    timer->Init();
     // The start and stop of the timer should be as close to the Cudnn call as
     // possible. It is still possible for other threads to issue workload on
     // to this stream. So it could take multiple profiling measurements.
-    timer->Start(AsCUDAStream(stream));
+    if (!timer->Init() || !timer->Start(AsCUDAStream(stream))) {
+      return port::Status(port::error::INTERNAL, "Failed to start timer");
+    }
   }
 
-  auto status = cudnnConvolutionBackwardFilter(
+  // Report an error if we might be hitting a cuDNN bug that produces incorrect
+  // results. See nvbugs/2072856
+  SE_RETURN_IF_ERROR([&] {
+    if (algo_desc.algo_id() != CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT_TILING) {
+      return port::Status::OK();
+    }
+    if (output_descriptor.height() > 1 && output_descriptor.width() > 1) {
+      return port::Status::OK();
+    }
+    int convolution_size = output_descriptor.height() > 1
+                               ? filter_descriptor.input_filter_height()
+                               : filter_descriptor.input_filter_width();
+    if (convolution_size <= 32) {
+      return port::Status::OK();
+    }
+    cudnnConvolutionMode_t convolution_mode;
+    cudnnDataType_t compute_type;
+    RETURN_IF_CUDNN_ERROR(cudnnGetConvolutionNdDescriptor(
+        conv.handle(), 0, nullptr, nullptr, nullptr, nullptr, &convolution_mode,
+        &compute_type));
+    if (convolution_mode != CUDNN_CONVOLUTION) {
+      return port::Status::OK();
+    }
+    return port::Status(
+        port::error::FAILED_PRECONDITION,
+        "This configuration potentially produces incorrect results.");
+  }());
+
+  RETURN_IF_CUDNN_ERROR(cudnnConvolutionBackwardFilter(
       cudnn.handle(),
       /*alpha=*/alpha,
       /*srcDesc=*/input_nd.handle(),
@@ -3564,33 +3246,22 @@ bool CudnnSupport::DoConvolveBackwardFilterImpl(
       /*diffDesc=*/out_back_nd.handle(),
       /*diffData=*/backward_output_data.opaque(),
       /*convDesc=*/conv.handle(),
-      /*algo=*/algo,
+      /*algo=*/ToConvBackwardFilterAlgo(algo_desc),
       /*workSpace=*/scratch.opaque(),
       /*workSpaceSizeInBytes=*/scratch.size(),
       /*beta=*/beta,
       /*gradDesc=*/filter.handle(),
-      /*gradData=*/backward_filter_data->opaque());
-
+      /*dw=*/backward_filter_data->opaque()));
   if (is_profiling) {
-    timer->Stop(AsCUDAStream(stream));
-    if (status == CUDNN_STATUS_SUCCESS) {
-      bool use_tensor_ops = algorithm_config.algorithm().tensor_ops_enabled();
-      dnn::AlgorithmDesc algotype(algo, use_tensor_ops);
-      output_profile_result->set_algorithm(algotype);
-      output_profile_result->set_elapsed_time_in_ms(
-          timer->GetElapsedMilliseconds());
-    }
-    timer->Destroy();
-  }
-  if (status != CUDNN_STATUS_SUCCESS) {
-    // Silently return when we are profiling.
-    if (!is_profiling) {
-      LOG(ERROR) << "failed to enqueue convolution on stream: "
-                 << ToString(status);
+    if (!timer->Stop(AsCUDAStream(stream))) {
+      return port::Status(port::error::INTERNAL, "Failed to stop timer");
     }
-    return false;
+    output_profile_result->set_algorithm(algo_desc);
+    output_profile_result->set_elapsed_time_in_ms(
+        timer->GetElapsedMilliseconds());
   }
-  return true;
+
+  return port::Status::OK();
 }
 
 bool CudnnSupport::DoConvolveBackwardFilter(
@@ -3604,11 +3275,13 @@ bool CudnnSupport::DoConvolveBackwardFilter(
     ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  return DoConvolveBackwardFilterImpl(stream, input_descriptor, input_data,
-                                      output_descriptor, backward_output_data,
-                                      convolution_descriptor, filter_descriptor,
-                                      backward_filter_data, scratch_allocator,
-                                      algorithm_config, output_profile_result);
+  return IsStatusOk(
+      DoConvolveBackwardFilterImpl(stream, input_descriptor, input_data,
+                                   output_descriptor, backward_output_data,
+                                   convolution_descriptor, filter_descriptor,
+                                   backward_filter_data, scratch_allocator,
+                                   algorithm_config, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoConvolveBackwardFilter(
@@ -3622,11 +3295,13 @@ bool CudnnSupport::DoConvolveBackwardFilter(
     ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  return DoConvolveBackwardFilterImpl(stream, input_descriptor, input_data,
-                                      output_descriptor, backward_output_data,
-                                      convolution_descriptor, filter_descriptor,
-                                      backward_filter_data, scratch_allocator,
-                                      algorithm_config, output_profile_result);
+  return IsStatusOk(
+      DoConvolveBackwardFilterImpl(stream, input_descriptor, input_data,
+                                   output_descriptor, backward_output_data,
+                                   convolution_descriptor, filter_descriptor,
+                                   backward_filter_data, scratch_allocator,
+                                   algorithm_config, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 bool CudnnSupport::DoConvolveBackwardFilter(
@@ -3640,22 +3315,24 @@ bool CudnnSupport::DoConvolveBackwardFilter(
     ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  return DoConvolveBackwardFilterImpl(stream, input_descriptor, input_data,
-                                      output_descriptor, backward_output_data,
-                                      convolution_descriptor, filter_descriptor,
-                                      backward_filter_data, scratch_allocator,
-                                      algorithm_config, output_profile_result);
+  return IsStatusOk(
+      DoConvolveBackwardFilterImpl(stream, input_descriptor, input_data,
+                                   output_descriptor, backward_output_data,
+                                   convolution_descriptor, filter_descriptor,
+                                   backward_filter_data, scratch_allocator,
+                                   algorithm_config, output_profile_result),
+      /*report_error=*/!output_profile_result);
 }
 
 template <class T>
-bool CudnnSupport::DoConvolveBackwardBiasImpl(
+port::Status CudnnSupport::DoConvolveBackwardBiasImpl(
     Stream* stream, const dnn::BatchDescriptor& input_descriptor,
     const DeviceMemory<T>& input_data,
     const dnn::BatchDescriptor& bias_descriptor,
     DeviceMemory<T>* backward_bias_data) {
   cudnnDataType_t cudnn_type = GetCudnnDataType<T>();
-  ScopedTensorDescriptor input_nd(input_descriptor, cudnn_type);
-  ScopedTensorDescriptor bias_nd(bias_descriptor, cudnn_type);
+  CudnnTensorDescriptor input_nd(input_descriptor, cudnn_type);
+  CudnnTensorDescriptor bias_nd(bias_descriptor, cudnn_type);
 
   // Alpha is the scaling factor for input.
   float alpha = 1.0;
@@ -3663,15 +3340,10 @@ bool CudnnSupport::DoConvolveBackwardBiasImpl(
   float beta = 0.0;
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
-  auto status = cudnnConvolutionBackwardBias(
+  RETURN_IF_CUDNN_ERROR(cudnnConvolutionBackwardBias(
       cudnn.handle(), &alpha, input_nd.handle(), input_data.opaque(), &beta,
-      bias_nd.handle(), backward_bias_data->opaque());
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "failed to enqueue backward convolution on stream: "
-               << ToString(status);
-    return false;
-  }
-  return true;
+      bias_nd.handle(), backward_bias_data->opaque()));
+  return port::Status::OK();
 }
 
 bool CudnnSupport::DoConvolveBackwardBias(
@@ -3679,8 +3351,10 @@ bool CudnnSupport::DoConvolveBackwardBias(
     const DeviceMemory<double>& input_data,
     const dnn::BatchDescriptor& bias_descriptor,
     DeviceMemory<double>* backward_bias_data) {
-  return DoConvolveBackwardBiasImpl(stream, input_descriptor, input_data,
-                                    bias_descriptor, backward_bias_data);
+  return IsStatusOk(
+      DoConvolveBackwardBiasImpl(stream, input_descriptor, input_data,
+                                 bias_descriptor, backward_bias_data),
+      /*report_error=*/true);
 }
 
 bool CudnnSupport::DoConvolveBackwardBias(
@@ -3688,8 +3362,10 @@ bool CudnnSupport::DoConvolveBackwardBias(
     const DeviceMemory<float>& input_data,
     const dnn::BatchDescriptor& bias_descriptor,
     DeviceMemory<float>* backward_bias_data) {
-  return DoConvolveBackwardBiasImpl(stream, input_descriptor, input_data,
-                                    bias_descriptor, backward_bias_data);
+  return IsStatusOk(
+      DoConvolveBackwardBiasImpl(stream, input_descriptor, input_data,
+                                 bias_descriptor, backward_bias_data),
+      /*report_error=*/true);
 }
 
 bool CudnnSupport::DoConvolveBackwardBias(
@@ -3697,8 +3373,10 @@ bool CudnnSupport::DoConvolveBackwardBias(
     const DeviceMemory<Eigen::half>& input_data,
     const dnn::BatchDescriptor& bias_descriptor,
     DeviceMemory<Eigen::half>* backward_bias_data) {
-  return DoConvolveBackwardBiasImpl(stream, input_descriptor, input_data,
-                                    bias_descriptor, backward_bias_data);
+  return IsStatusOk(
+      DoConvolveBackwardBiasImpl(stream, input_descriptor, input_data,
+                                 bias_descriptor, backward_bias_data),
+      /*report_error=*/true);
 }
 
 bool CudnnSupport::DoMatMul(Stream* stream,
@@ -3841,7 +3519,7 @@ bool CudnnSupport::DoBiasAdd(Stream* stream,
                              const DeviceMemory<float>& biases,
                              const dnn::BatchDescriptor& dimensions,
                              DeviceMemory<float>* output_data) {
-  ScopedTensorDescriptor input_descriptor(dimensions, CUDNN_DATA_FLOAT);
+  CudnnTensorDescriptor input_descriptor(dimensions, CUDNN_DATA_FLOAT);
 
   dnn::BatchDescriptor bias_dimensions;
   bias_dimensions.set_count(1)
@@ -3849,7 +3527,7 @@ bool CudnnSupport::DoBiasAdd(Stream* stream,
       .set_height(1)
       .set_width(1)
       .set_layout(dnn::DataLayout::kBatchYXDepth);
-  ScopedTensorDescriptor bias_descriptor(bias_dimensions, CUDNN_DATA_FLOAT);
+  CudnnTensorDescriptor bias_descriptor(bias_dimensions, CUDNN_DATA_FLOAT);
 
   // cudnnAddTensor after R3 is in-place, so we need to copy input_data to
   // output_data before doing the addition, unless the input and
@@ -3870,16 +3548,13 @@ bool CudnnSupport::DoBiasAdd(Stream* stream,
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
 
-  auto status = cudnnAddTensor(
-      cudnn.handle(), &alpha, bias_descriptor.handle(), biases.opaque(), &beta,
-      input_descriptor.handle(), output_data->opaque());
-
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "stream " << stream << " could not enqueue bias addition.";
-    return false;
-  }
-
-  return true;
+  auto status = [&] {
+    RETURN_IF_CUDNN_ERROR(cudnnAddTensor(
+        cudnn.handle(), &alpha, bias_descriptor.handle(), biases.opaque(),
+        &beta, input_descriptor.handle(), output_data->opaque()));
+    return port::Status::OK();
+  }();
+  return IsStatusOk(status, /*report_error=*/true);
 }
 
 bool CudnnSupport::DoActivate(Stream* stream,
@@ -3888,26 +3563,23 @@ bool CudnnSupport::DoActivate(Stream* stream,
                               const DeviceMemory<float>& input_data,
                               DeviceMemory<float>* output_data,
                               uint64 options) {
-  ScopedActivationDescriptor activation_desc(
+  CudnnActivationDescriptor activation_desc(
       activation_mode, CUDNN_PROPAGATE_NAN, dimensions.value_max());
 
-  ScopedTensorDescriptor input_nd(dimensions, CUDNN_DATA_FLOAT);
+  CudnnTensorDescriptor input_nd(dimensions, CUDNN_DATA_FLOAT);
   // Alpha is the input scaling factor.
   float alpha = 1.0;
   // Beta is the output scaling factor.
   float beta = 0.0;
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
-  auto status = cudnnActivationForward(
-      cudnn.handle(), activation_desc.handle(), &alpha, input_nd.handle(),
-      input_data.opaque(), &beta, input_nd.handle(), output_data->opaque());
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "stream " << stream
-               << " could not enqueue activation: " << ToString(status);
-    return false;
-  }
-
-  return true;
+  auto status = [&] {
+    RETURN_IF_CUDNN_ERROR(cudnnActivationForward(
+        cudnn.handle(), activation_desc.handle(), &alpha, input_nd.handle(),
+        input_data.opaque(), &beta, input_nd.handle(), output_data->opaque()));
+    return port::Status::OK();
+  }();
+  return IsStatusOk(status, /*report_error=*/true);
 }
 
 bool CudnnSupport::DoPoolForward(
@@ -3921,20 +3593,18 @@ bool CudnnSupport::DoPoolForward(
   // Beta is the scaling factor for output.
   double beta = 0.0;
 
-  ScopedTensorDescriptor src_desc(input_dimensions, CUDNN_DATA_DOUBLE);
-  ScopedTensorDescriptor dest_desc(output_dimensions, CUDNN_DATA_DOUBLE);
-  ScopedPoolingDescriptor pooling_desc(pooling_dimensions);
+  CudnnTensorDescriptor src_desc(input_dimensions, CUDNN_DATA_DOUBLE);
+  CudnnTensorDescriptor dest_desc(output_dimensions, CUDNN_DATA_DOUBLE);
+  CudnnPoolingDescriptor pooling_desc(pooling_dimensions);
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
-  auto status = cudnnPoolingForward(
-      cudnn.handle(), pooling_desc.handle(), &alpha, src_desc.handle(),
-      input_data.opaque(), &beta, dest_desc.handle(), output_data->opaque());
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "failed to enqueue forward pooling on stream: "
-               << ToString(status);
-    return false;
-  }
-  return true;
+  auto status = [&] {
+    RETURN_IF_CUDNN_ERROR(cudnnPoolingForward(
+        cudnn.handle(), pooling_desc.handle(), &alpha, src_desc.handle(),
+        input_data.opaque(), &beta, dest_desc.handle(), output_data->opaque()));
+    return port::Status::OK();
+  }();
+  return IsStatusOk(status, /*report_error=*/true);
 }
 
 bool CudnnSupport::DoPoolForward(
@@ -3948,20 +3618,18 @@ bool CudnnSupport::DoPoolForward(
   // Beta is the scaling factor for output.
   float beta = 0.0;
 
-  ScopedTensorDescriptor src_desc(input_dimensions, CUDNN_DATA_FLOAT);
-  ScopedTensorDescriptor dest_desc(output_dimensions, CUDNN_DATA_FLOAT);
-  ScopedPoolingDescriptor pooling_desc(pooling_dimensions);
+  CudnnTensorDescriptor src_desc(input_dimensions, CUDNN_DATA_FLOAT);
+  CudnnTensorDescriptor dest_desc(output_dimensions, CUDNN_DATA_FLOAT);
+  CudnnPoolingDescriptor pooling_desc(pooling_dimensions);
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
-  auto status = cudnnPoolingForward(
-      cudnn.handle(), pooling_desc.handle(), &alpha, src_desc.handle(),
-      input_data.opaque(), &beta, dest_desc.handle(), output_data->opaque());
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "failed to enqueue forward pooling on stream: "
-               << ToString(status);
-    return false;
-  }
-  return true;
+  auto status = [&] {
+    RETURN_IF_CUDNN_ERROR(cudnnPoolingForward(
+        cudnn.handle(), pooling_desc.handle(), &alpha, src_desc.handle(),
+        input_data.opaque(), &beta, dest_desc.handle(), output_data->opaque()));
+    return port::Status::OK();
+  }();
+  return IsStatusOk(status, /*report_error=*/true);
 }
 
 bool CudnnSupport::DoPoolForward(
@@ -3975,19 +3643,17 @@ bool CudnnSupport::DoPoolForward(
   // Beta is the scaling factor for output.
   float beta = 0.0;
 
-  ScopedTensorDescriptor src_desc(input_dimensions, CUDNN_DATA_HALF);
-  ScopedTensorDescriptor dest_desc(output_dimensions, CUDNN_DATA_HALF);
-  ScopedPoolingDescriptor pooling_desc(pooling_dimensions);
+  CudnnTensorDescriptor src_desc(input_dimensions, CUDNN_DATA_HALF);
+  CudnnTensorDescriptor dest_desc(output_dimensions, CUDNN_DATA_HALF);
+  CudnnPoolingDescriptor pooling_desc(pooling_dimensions);
   auto cudnn = cudnn_->GetHandle(parent_, stream);
-  auto status = cudnnPoolingForward(
-      cudnn.handle(), pooling_desc.handle(), &alpha, src_desc.handle(),
-      input_data.opaque(), &beta, dest_desc.handle(), output_data->opaque());
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "failed to enqueue forward pooling on stream: "
-               << ToString(status);
-    return false;
-  }
-  return true;
+  auto status = [&] {
+    RETURN_IF_CUDNN_ERROR(cudnnPoolingForward(
+        cudnn.handle(), pooling_desc.handle(), &alpha, src_desc.handle(),
+        input_data.opaque(), &beta, dest_desc.handle(), output_data->opaque()));
+    return port::Status::OK();
+  }();
+  return IsStatusOk(status, /*report_error=*/true);
 }
 
 bool CudnnSupport::DoPoolBackward(
@@ -4003,22 +3669,20 @@ bool CudnnSupport::DoPoolBackward(
   // Beta is the scaling factor for output.
   double beta = 0.0;
 
-  ScopedTensorDescriptor src_desc(input_dimensions, CUDNN_DATA_DOUBLE);
-  ScopedTensorDescriptor dest_desc(output_dimensions, CUDNN_DATA_DOUBLE);
-  ScopedPoolingDescriptor pooling_desc(pooling_dimensions);
+  CudnnTensorDescriptor src_desc(input_dimensions, CUDNN_DATA_DOUBLE);
+  CudnnTensorDescriptor dest_desc(output_dimensions, CUDNN_DATA_DOUBLE);
+  CudnnPoolingDescriptor pooling_desc(pooling_dimensions);
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
-  auto status = cudnnPoolingBackward(
-      cudnn.handle(), pooling_desc.handle(), &alpha, dest_desc.handle(),
-      output_data.opaque(), dest_desc.handle(), input_diff_data.opaque(),
-      src_desc.handle(), input_data.opaque(), &beta, src_desc.handle(),
-      output_diff_data->opaque());
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "failed to enqueue backward pooling on stream: "
-               << ToString(status);
-    return false;
-  }
-  return true;
+  auto status = [&] {
+    RETURN_IF_CUDNN_ERROR(cudnnPoolingBackward(
+        cudnn.handle(), pooling_desc.handle(), &alpha, dest_desc.handle(),
+        output_data.opaque(), dest_desc.handle(), input_diff_data.opaque(),
+        src_desc.handle(), input_data.opaque(), &beta, src_desc.handle(),
+        output_diff_data->opaque()));
+    return port::Status::OK();
+  }();
+  return IsStatusOk(status, /*report_error=*/true);
 }
 
 bool CudnnSupport::DoPoolBackward(
@@ -4034,22 +3698,20 @@ bool CudnnSupport::DoPoolBackward(
   // Beta is the scaling factor for output.
   float beta = 0.0;
 
-  ScopedTensorDescriptor src_desc(input_dimensions, CUDNN_DATA_FLOAT);
-  ScopedTensorDescriptor dest_desc(output_dimensions, CUDNN_DATA_FLOAT);
-  ScopedPoolingDescriptor pooling_desc(pooling_dimensions);
+  CudnnTensorDescriptor src_desc(input_dimensions, CUDNN_DATA_FLOAT);
+  CudnnTensorDescriptor dest_desc(output_dimensions, CUDNN_DATA_FLOAT);
+  CudnnPoolingDescriptor pooling_desc(pooling_dimensions);
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
-  auto status = cudnnPoolingBackward(
-      cudnn.handle(), pooling_desc.handle(), &alpha, dest_desc.handle(),
-      output_data.opaque(), dest_desc.handle(), input_diff_data.opaque(),
-      src_desc.handle(), input_data.opaque(), &beta, src_desc.handle(),
-      output_diff_data->opaque());
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "failed to enqueue backward pooling on stream: "
-               << ToString(status);
-    return false;
-  }
-  return true;
+  auto status = [&] {
+    RETURN_IF_CUDNN_ERROR(cudnnPoolingBackward(
+        cudnn.handle(), pooling_desc.handle(), &alpha, dest_desc.handle(),
+        output_data.opaque(), dest_desc.handle(), input_diff_data.opaque(),
+        src_desc.handle(), input_data.opaque(), &beta, src_desc.handle(),
+        output_diff_data->opaque()));
+    return port::Status::OK();
+  }();
+  return IsStatusOk(status, /*report_error=*/true);
 }
 
 bool CudnnSupport::DoPoolBackward(
@@ -4065,22 +3727,20 @@ bool CudnnSupport::DoPoolBackward(
   // Beta is the scaling factor for output.
   float beta = 0.0;
 
-  ScopedTensorDescriptor src_desc(input_dimensions, CUDNN_DATA_HALF);
-  ScopedTensorDescriptor dest_desc(output_dimensions, CUDNN_DATA_HALF);
-  ScopedPoolingDescriptor pooling_desc(pooling_dimensions);
+  CudnnTensorDescriptor src_desc(input_dimensions, CUDNN_DATA_HALF);
+  CudnnTensorDescriptor dest_desc(output_dimensions, CUDNN_DATA_HALF);
+  CudnnPoolingDescriptor pooling_desc(pooling_dimensions);
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
-  auto status = cudnnPoolingBackward(
-      cudnn.handle(), pooling_desc.handle(), &alpha, dest_desc.handle(),
-      output_data.opaque(), dest_desc.handle(), input_diff_data.opaque(),
-      src_desc.handle(), input_data.opaque(), &beta, src_desc.handle(),
-      output_diff_data->opaque());
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "failed to enqueue backward pooling on stream: "
-               << ToString(status);
-    return false;
-  }
-  return true;
+  auto status = [&] {
+    RETURN_IF_CUDNN_ERROR(cudnnPoolingBackward(
+        cudnn.handle(), pooling_desc.handle(), &alpha, dest_desc.handle(),
+        output_data.opaque(), dest_desc.handle(), input_diff_data.opaque(),
+        src_desc.handle(), input_data.opaque(), &beta, src_desc.handle(),
+        output_diff_data->opaque()));
+    return port::Status::OK();
+  }();
+  return IsStatusOk(status, /*report_error=*/true);
 }
 
 bool CudnnSupport::DoNormalize(
@@ -4104,8 +3764,8 @@ bool CudnnSupport::DoNormalizeWithDimensions(
     return false;
   }
 
-  ScopedTensorDescriptor dims(dimensions, CUDNN_DATA_FLOAT);
-  ScopedNormalizeDescriptor normalize(normalize_descriptor);
+  CudnnTensorDescriptor dims(dimensions, CUDNN_DATA_FLOAT);
+  CudnnNormalizeDescriptor normalize(normalize_descriptor);
 
   // Alpha is the scaling factor for input.
   float alpha = 1.0f;
@@ -4115,15 +3775,14 @@ bool CudnnSupport::DoNormalizeWithDimensions(
   auto cudnn = cudnn_->GetHandle(parent_, stream);
 
   // Launch the normalization.
-  auto status = cudnnLRNCrossChannelForward(
-      cudnn.handle(), normalize.handle(), CUDNN_LRN_CROSS_CHANNEL_DIM1, &alpha,
-      dims.handle(), input_data.opaque(), &beta, dims.handle(),
-      output_data->opaque());
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "failed to run cudnnLRNCrossChannelForward";
-    return false;
-  }
-  return true;
+  auto status = [&] {
+    RETURN_IF_CUDNN_ERROR(cudnnLRNCrossChannelForward(
+        cudnn.handle(), normalize.handle(), CUDNN_LRN_CROSS_CHANNEL_DIM1,
+        &alpha, dims.handle(), input_data.opaque(), &beta, dims.handle(),
+        output_data->opaque()));
+    return port::Status::OK();
+  }();
+  return IsStatusOk(status, /*report_error=*/true);
 }
 
 bool CudnnSupport::DoNormalizeBackwardWithDimensions(
@@ -4142,23 +3801,22 @@ bool CudnnSupport::DoNormalizeBackwardWithDimensions(
     return false;
   }
 
-  ScopedTensorDescriptor dims(dimensions, CUDNN_DATA_FLOAT);
-  ScopedNormalizeDescriptor normalize(normalize_descriptor);
+  CudnnTensorDescriptor dims(dimensions, CUDNN_DATA_FLOAT);
+  CudnnNormalizeDescriptor normalize(normalize_descriptor);
 
   float alpha = 1.0f;
   float beta = 0.0f;
 
   auto cudnn = cudnn_->GetHandle(parent_, stream);
-  auto status = cudnnLRNCrossChannelBackward(
-      cudnn.handle(), normalize.handle(), CUDNN_LRN_CROSS_CHANNEL_DIM1, &alpha,
-      dims.handle(), normalized_data.opaque(), dims.handle(),
-      normalized_variable_gradient.opaque(), dims.handle(), raw_data.opaque(),
-      &beta, dims.handle(), raw_variable_gradient->opaque());
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "failed to run cudnnLRNCrossChannelBackward";
-    return false;
-  }
-  return true;
+  auto status = [&] {
+    RETURN_IF_CUDNN_ERROR(cudnnLRNCrossChannelBackward(
+        cudnn.handle(), normalize.handle(), CUDNN_LRN_CROSS_CHANNEL_DIM1,
+        &alpha, dims.handle(), normalized_data.opaque(), dims.handle(),
+        normalized_variable_gradient.opaque(), dims.handle(), raw_data.opaque(),
+        &beta, dims.handle(), raw_variable_gradient->opaque()));
+    return port::Status::OK();
+  }();
+  return IsStatusOk(status, /*report_error=*/true);
 }
 
 bool CudnnSupport::DoDepthConcatenate(
@@ -4267,30 +3925,26 @@ bool CudnnSupport::DeriveOutputBatchDescriptor(
     const dnn::FilterDescriptor& filter_descriptor,
     const dnn::ConvolutionDescriptor& convolution_descriptor,
     dnn::BatchDescriptor* output_batch_descriptor) {
-  ScopedTensorDescriptor input_nd(batch_descriptor, CUDNN_DATA_FLOAT);
-  ScopedFilterDescriptor filter(filter_descriptor, CUDNN_DATA_FLOAT);
-  ScopedConvolutionDescriptor conv(convolution_descriptor, CUDNN_DATA_FLOAT);
+  CudnnTensorDescriptor input_nd(batch_descriptor, CUDNN_DATA_FLOAT);
+  CudnnFilterDescriptor filter(filter_descriptor, CUDNN_DATA_FLOAT);
+  CudnnConvolutionDescriptor conv(convolution_descriptor, CUDNN_DATA_FLOAT);
 
   int dn = batch_descriptor.ndims() + 2;
   std::vector<int> dims(dn);  // in BDYX
-  auto status = cudnnGetConvolutionNdForwardOutputDim(
-      conv.handle(), input_nd.handle(), filter.handle(), dn, dims.data());
-  if (status != CUDNN_STATUS_SUCCESS) {
-    LOG(ERROR) << "could not get output tensor for convolution: "
-               << ToString(status);
-    return false;
-  }
-
-  output_batch_descriptor->set_count(dims[0])
-      .set_feature_map_count(dims[1])
-      .set_layout(batch_descriptor.layout());
-
-  for (int i = 0; i < batch_descriptor.ndims(); i++) {
-    output_batch_descriptor->set_spatial_dim(static_cast<dnn::DimIndex>(i),
-                                             dims.rbegin()[i]);
-  }
+  auto status = [&] {
+    RETURN_IF_CUDNN_ERROR(cudnnGetConvolutionNdForwardOutputDim(
+        conv.handle(), input_nd.handle(), filter.handle(), dn, dims.data()));
+    output_batch_descriptor->set_count(dims[0])
+        .set_feature_map_count(dims[1])
+        .set_layout(batch_descriptor.layout());
 
-  return true;
+    for (int i = 0; i < batch_descriptor.ndims(); i++) {
+      output_batch_descriptor->set_spatial_dim(static_cast<dnn::DimIndex>(i),
+                                               dims.rbegin()[i]);
+    }
+    return port::Status::OK();
+  }();
+  return IsStatusOk(status, /*report_error=*/true);
 }
 
 }  // namespace cuda
diff --git a/tensorflow/stream_executor/cuda/cuda_dnn.h b/tensorflow/stream_executor/cuda/cuda_dnn.h
index e2de3c62d81ae56c28fd4b888c74435ceecc6b22..c924d41cb5239d704e658f0b5452e04087caeba2 100644
--- a/tensorflow/stream_executor/cuda/cuda_dnn.h
+++ b/tensorflow/stream_executor/cuda/cuda_dnn.h
@@ -631,7 +631,7 @@ class CudnnSupport : public dnn::DnnSupport {
   std::unique_ptr<class CudnnAccess> cudnn_;
 
   template <class T, class U>
-  bool DoBatchNormalizationForwardImpl(
+  port::Status DoBatchNormalizationForwardImpl(
       Stream* stream, dnn::DataType input_data_type,
       dnn::DataType scale_data_type, const DeviceMemory<T>& x,
       const DeviceMemory<U>& scale, const DeviceMemory<U>& offset,
@@ -646,7 +646,7 @@ class CudnnSupport : public dnn::DnnSupport {
       std::function<void()> inv_var_to_var);
 
   template <class T, class U>
-  bool DoBatchNormalizationBackwardImpl(
+  port::Status DoBatchNormalizationBackwardImpl(
       Stream* stream, int cudnn_input_type, int cudnn_scale_type,
       const DeviceMemory<T>& y_backprop, const DeviceMemory<T>& x,
       const DeviceMemory<U>& scale, const DeviceMemory<U>& mean,
@@ -656,21 +656,20 @@ class CudnnSupport : public dnn::DnnSupport {
       DeviceMemory<U>* offset_backprop);
 
   template <class T>
-  bool DoConvolveImpl(Stream* stream,
-                      const dnn::BatchDescriptor& input_descriptor,
-                      const DeviceMemory<T>& input_data,
-                      const dnn::FilterDescriptor& filter_descriptor,
-                      const DeviceMemory<T>& filter_data,
-                      const dnn::ConvolutionDescriptor& convolution_descriptor,
-                      const dnn::BatchDescriptor& output_descriptor,
-                      DeviceMemory<T>* output_data,
-                      ScratchAllocator* scratch_allocator,
-                      const dnn::AlgorithmConfig& algorithm_config,
-                      dnn::ProfileResult* output_profile_result);
+  port::Status DoConvolveImpl(
+      Stream* stream, const dnn::BatchDescriptor& input_descriptor,
+      const DeviceMemory<T>& input_data,
+      const dnn::FilterDescriptor& filter_descriptor,
+      const DeviceMemory<T>& filter_data,
+      const dnn::ConvolutionDescriptor& convolution_descriptor,
+      const dnn::BatchDescriptor& output_descriptor,
+      DeviceMemory<T>* output_data, ScratchAllocator* scratch_allocator,
+      const dnn::AlgorithmConfig& algorithm_config,
+      dnn::ProfileResult* output_profile_result);
 
   template <typename Type, typename BiasType, typename ScaleType,
             int cudnn_data_type, int cudnn_compute_type>
-  bool DoFusedConvolveImpl(
+  port::Status DoFusedConvolveImpl(
       Stream* stream, const dnn::BatchDescriptor& conv_input_descriptor,
       const DeviceMemory<Type>& conv_input_data, ScaleType conv_input_scale,
       const dnn::FilterDescriptor& filter_descriptor,
@@ -685,9 +684,8 @@ class CudnnSupport : public dnn::DnnSupport {
       dnn::ProfileResult* output_profile_result);
 
   template <class T>
-  bool DoConvolveBackwardDataImpl(
-      Stream* stream,
-      const dnn::FilterDescriptor& filter_descriptor,
+  port::Status DoConvolveBackwardDataImpl(
+      Stream* stream, const dnn::FilterDescriptor& filter_descriptor,
       const DeviceMemory<T>& filter_data,
       const dnn::BatchDescriptor& output_descriptor,
       DeviceMemory<T> backward_output_data,
@@ -698,10 +696,10 @@ class CudnnSupport : public dnn::DnnSupport {
       dnn::ProfileResult* output_profile_result);
 
   template <class T>
-  bool DoConvolveBackwardFilterImpl(
+  port::Status DoConvolveBackwardFilterImpl(
       Stream* stream, const dnn::BatchDescriptor& input_descriptor,
       const DeviceMemory<T>& input_data,
-      const dnn::BatchDescriptor& output_descriptor_in,
+      const dnn::BatchDescriptor& output_descriptor,
       DeviceMemory<T> backward_output_data,
       const dnn::ConvolutionDescriptor& convolution_descriptor,
       const dnn::FilterDescriptor& filter_descriptor,
@@ -711,56 +709,56 @@ class CudnnSupport : public dnn::DnnSupport {
       dnn::ProfileResult* output_profile_result);
 
   template <class T>
-  bool DoConvolveBackwardBiasImpl(Stream* stream,
-                                  const dnn::BatchDescriptor& input_descriptor,
-                                  const DeviceMemory<T>& input_data,
-                                  const dnn::BatchDescriptor& bias_descriptor,
-                                  DeviceMemory<T>* backward_bias_data);
+  port::Status DoConvolveBackwardBiasImpl(
+      Stream* stream, const dnn::BatchDescriptor& input_descriptor,
+      const DeviceMemory<T>& input_data,
+      const dnn::BatchDescriptor& bias_descriptor,
+      DeviceMemory<T>* backward_bias_data);
 
   template <class T>
-  bool DoRnnForwardImpl(Stream* stream, const CudnnRnnDescriptor& rnn_desc,
-                        const CudnnRnnSequenceTensorDescriptor& input_desc,
-                        const DeviceMemory<T>& input_data,
-                        const CudnnRnnStateTensorDescriptor& input_h_desc,
-                        const DeviceMemory<T>& input_h_data,
-                        const CudnnRnnStateTensorDescriptor& input_c_desc,
-                        const DeviceMemory<T>& input_c_data,
-                        const DeviceMemory<T>& params,
-                        const CudnnRnnSequenceTensorDescriptor& output_desc,
-                        DeviceMemory<T>* output_data,
-                        const CudnnRnnStateTensorDescriptor& output_h_desc,
-                        DeviceMemory<T>* output_h_data,
-                        const CudnnRnnStateTensorDescriptor& output_c_desc,
-                        DeviceMemory<T>* output_c_data, bool is_training,
-                        ScratchAllocator* reserve_space_allocator,
-                        ScratchAllocator* workspace_allocator,
-                        dnn::ProfileResult* output_profile_result);
+  port::Status DoRnnForwardImpl(
+      Stream* stream, const CudnnRnnDescriptor& rnn_desc,
+      const CudnnRnnSequenceTensorDescriptor& input_desc,
+      const DeviceMemory<T>& input_data,
+      const CudnnRnnStateTensorDescriptor& input_h_desc,
+      const DeviceMemory<T>& input_h_data,
+      const CudnnRnnStateTensorDescriptor& input_c_desc,
+      const DeviceMemory<T>& input_c_data, const DeviceMemory<T>& params,
+      const CudnnRnnSequenceTensorDescriptor& output_desc,
+      DeviceMemory<T>* output_data,
+      const CudnnRnnStateTensorDescriptor& output_h_desc,
+      DeviceMemory<T>* output_h_data,
+      const CudnnRnnStateTensorDescriptor& output_c_desc,
+      DeviceMemory<T>* output_c_data, bool is_training,
+      ScratchAllocator* reserve_space_allocator,
+      ScratchAllocator* workspace_allocator,
+      dnn::ProfileResult* output_profile_result);
 
   template <class T>
-  bool DoRnnBackwardImpl(Stream* stream, const CudnnRnnDescriptor& rnn_desc,
-                         const CudnnRnnSequenceTensorDescriptor& input_desc,
-                         const DeviceMemory<T>& input_data,
-                         const CudnnRnnStateTensorDescriptor& input_h_desc,
-                         const DeviceMemory<T>& input_h_data,
-                         const CudnnRnnStateTensorDescriptor& input_c_desc,
-                         const DeviceMemory<T>& input_c_data,
-                         const DeviceMemory<T>& params,
-                         const CudnnRnnSequenceTensorDescriptor& output_desc,
-                         const DeviceMemory<T>& output_data,
-                         const CudnnRnnStateTensorDescriptor& output_h_desc,
-                         const DeviceMemory<T>& output_h_data,
-                         const CudnnRnnStateTensorDescriptor& output_c_desc,
-                         const DeviceMemory<T>& output_c_data,
-                         const DeviceMemory<T>& output_backprop_data,
-                         const DeviceMemory<T>& output_h_backprop_data,
-                         const DeviceMemory<T>& output_c_backprop_data,
-                         DeviceMemory<T>* input_backprop_data,
-                         DeviceMemory<T>* input_h_backprop_data,
-                         DeviceMemory<T>* input_c_backprop_data,
-                         DeviceMemory<T>* params_backprop_data,
-                         DeviceMemory<uint8>* reserve_space_data,
-                         ScratchAllocator* workspace_allocator,
-                         dnn::ProfileResult* output_profile_result);
+  port::Status DoRnnBackwardImpl(
+      Stream* stream, const CudnnRnnDescriptor& rnn_desc,
+      const CudnnRnnSequenceTensorDescriptor& input_desc,
+      const DeviceMemory<T>& input_data,
+      const CudnnRnnStateTensorDescriptor& input_h_desc,
+      const DeviceMemory<T>& input_h_data,
+      const CudnnRnnStateTensorDescriptor& input_c_desc,
+      const DeviceMemory<T>& input_c_data, const DeviceMemory<T>& params,
+      const CudnnRnnSequenceTensorDescriptor& output_desc,
+      const DeviceMemory<T>& output_data,
+      const CudnnRnnStateTensorDescriptor& output_h_desc,
+      const DeviceMemory<T>& output_h_data,
+      const CudnnRnnStateTensorDescriptor& output_c_desc,
+      const DeviceMemory<T>& output_c_data,
+      const DeviceMemory<T>& output_backprop_data,
+      const DeviceMemory<T>& output_h_backprop_data,
+      const DeviceMemory<T>& output_c_backprop_data,
+      DeviceMemory<T>* input_backprop_data,
+      DeviceMemory<T>* input_h_backprop_data,
+      DeviceMemory<T>* input_c_backprop_data,
+      DeviceMemory<T>* params_backprop_data,
+      DeviceMemory<uint8>* reserve_space_data,
+      ScratchAllocator* workspace_allocator,
+      dnn::ProfileResult* output_profile_result);
 
   SE_DISALLOW_COPY_AND_ASSIGN(CudnnSupport);
 };
diff --git a/tensorflow/stream_executor/cuda/cuda_driver.cc b/tensorflow/stream_executor/cuda/cuda_driver.cc
index 09e9f9f7582904e7bdccb54abafa1f99f6e3e2a8..d508f6594a9f9ac3c924b0b952620b6a4ac727ea 100644
--- a/tensorflow/stream_executor/cuda/cuda_driver.cc
+++ b/tensorflow/stream_executor/cuda/cuda_driver.cc
@@ -21,7 +21,6 @@ limitations under the License.
 #include <set>
 #include <utility>
 
-#include "cuda/include/cuda_runtime.h"
 #include "tensorflow/stream_executor/cuda/cuda_diagnostics.h"
 #include "tensorflow/stream_executor/lib/casts.h"
 #include "tensorflow/stream_executor/lib/env.h"
diff --git a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
index f2be68bc421c1fbc31ea5a054b91130c11949635..edf217875ff30b26407e35a39d1ee603b980b480 100644
--- a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
+++ b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
@@ -15,9 +15,6 @@ limitations under the License.
 
 #include "tensorflow/stream_executor/cuda/cuda_gpu_executor.h"
 
-#if defined(__APPLE__)
-#include <mach-o/dyld.h>
-#endif
 #if defined(PLATFORM_WINDOWS)
 #include <windows.h>
 #define PATH_MAX MAX_PATH
@@ -179,19 +176,11 @@ bool CUDAExecutor::FindOnDiskForComputeCapability(
 //                 would return /usr/bin.
 static string GetBinaryDir(bool strip_exe) {
   char exe_path[PATH_MAX] = {0};
-#if defined(__APPLE__)
-    uint32_t buffer_size = 0U;
-    _NSGetExecutablePath(nullptr, &buffer_size);
-    char unresolved_path[buffer_size];
-    _NSGetExecutablePath(unresolved_path, &buffer_size);
-    CHECK_ERR(realpath(unresolved_path, exe_path) ? 1 : -1);
-#else
 #if defined(PLATFORM_WINDOWS)
   HMODULE hModule = GetModuleHandle(NULL);
   GetModuleFileName(hModule, exe_path, MAX_PATH);
 #else
   CHECK_ERR(readlink("/proc/self/exe", exe_path, sizeof(exe_path) - 1));
-#endif
 #endif
   // Make sure it's null-terminated:
   exe_path[sizeof(exe_path) - 1] = 0;
@@ -854,10 +843,7 @@ CudaContext* CUDAExecutor::cuda_context() { return context_; }
 // For anything more complicated/prod-focused than this, you'll likely want to
 // turn to gsys' topology modeling.
 static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) {
-#if defined(__APPLE__)
-  LOG(INFO) << "OS X does not support NUMA - returning NUMA node zero";
-  return 0;
-#elif defined(PLATFORM_WINDOWS)
+#if defined(PLATFORM_WINDOWS)
   // Windows support for NUMA is not currently implemented. Return node 0.
   return 0;
 #elif defined(__aarch64__)
diff --git a/tensorflow/stream_executor/cuda/cuda_timer.h b/tensorflow/stream_executor/cuda/cuda_timer.h
index 70554ec93120fcb0251ba0995a1ce9d6e5997016..e040cf86fad1f40a708ad4ca28693e31908393f0 100644
--- a/tensorflow/stream_executor/cuda/cuda_timer.h
+++ b/tensorflow/stream_executor/cuda/cuda_timer.h
@@ -37,8 +37,9 @@ class CUDATimer : public internal::TimerInterface {
   explicit CUDATimer(CUDAExecutor *parent)
       : parent_(parent), start_event_(nullptr), stop_event_(nullptr) {}
 
-  // Note: teardown is explicitly handled in this API by a call to
+  // Note: teardown needs to be explicitly handled in this API by a call to
   // StreamExecutor::DeallocateTimer(), which invokes Destroy().
+  // TODO(csigg): Change to RAII.
   ~CUDATimer() override {}
 
   // Allocates the platform-specific pieces of the timer, called as part of
diff --git a/tensorflow/stream_executor/dnn.cc b/tensorflow/stream_executor/dnn.cc
index eed93efc8d655276d4afc8c651abc90dab7dc3c4..82aa8ceb3298a30a4c117882dc96c504d9d10226 100644
--- a/tensorflow/stream_executor/dnn.cc
+++ b/tensorflow/stream_executor/dnn.cc
@@ -141,6 +141,10 @@ string PadAlignmentString(PadAlignment alignment) {
   return "unknown pad alignment";
 }
 
+std::ostream& operator<<(std::ostream& str, dnn::PadAlignment alignment) {
+  return str << PadAlignmentString(alignment);
+}
+
 string ShortPoolingModeString(PoolingMode mode) {
   switch (mode) {
     case PoolingMode::kMaximum:
@@ -407,6 +411,8 @@ string FilterDescriptor::ToShortString() const {
   switch (layout_) {
     case FilterLayout::kOutputInputYX:
       return port::StrCat(od, id, spatial);
+    case FilterLayout::kOutputYXInput:
+      return port::StrCat(od, spatial, id);
     case FilterLayout::kOutputInputYX4:
       return port::StrCat(od, id, spatial, "(VECT_C)");
     case FilterLayout::kInputYXOutput:
diff --git a/tensorflow/stream_executor/dnn.h b/tensorflow/stream_executor/dnn.h
index 38abc66079264dd46634fca1b0d9297844a31aa1..9eca5abe1ae7265ebca0a1ea653823816deaa8f5 100644
--- a/tensorflow/stream_executor/dnn.h
+++ b/tensorflow/stream_executor/dnn.h
@@ -349,6 +349,8 @@ enum class FilterLayout : int64 {
   kOutputInputYX = 0,  // cuDNN's default filter layout, laid out as:
                        // (major) output feature maps >> input feature maps >>
                        // rows >> columns (minor).
+  kOutputYXInput,      // major to minor:
+                       //   (output features, row, columns, input features)
   kOutputInputYX4,  // laid out the same as kOutputInputYX but each element is a
                     // vector of 4 feature maps.
   kInputYXOutput,   // Same as dist_belief's default filter layout.
@@ -467,6 +469,9 @@ enum class PadAlignment : int64 {
 // Returns a string representation of the given padding alignment.
 string PadAlignmentString(PadAlignment alignment);
 
+// Print alignment to str. Needed to use CHECK_EQ between two PadAlignments.
+std::ostream& operator<<(std::ostream& str, dnn::PadAlignment alignment);
+
 // Describes a convolution.
 //
 // Uses the named argument construction form:
@@ -708,7 +713,7 @@ class PoolingDescriptor {
 class AlgorithmDesc {
  public:
   typedef int64 Index;
-  AlgorithmDesc() : algo_(kDefaultAlgorithm), tensor_ops_enabled_(false) {}
+  AlgorithmDesc() : algo_(kDefaultAlgorithm), tensor_ops_enabled_(true) {}
   AlgorithmDesc(Index a, bool use_tensor_ops)
       : algo_(a), tensor_ops_enabled_(use_tensor_ops) {}
   bool is_default() const { return algo_ == kDefaultAlgorithm; }
diff --git a/tensorflow/stream_executor/stream_executor_pimpl.cc b/tensorflow/stream_executor/stream_executor_pimpl.cc
index b222a4d82a3e87a52c44427627e7aaacd0ed5c0d..000795ff0048dddb0eb4a08956e6de6f5e336f28 100644
--- a/tensorflow/stream_executor/stream_executor_pimpl.cc
+++ b/tensorflow/stream_executor/stream_executor_pimpl.cc
@@ -610,7 +610,7 @@ port::Status StreamExecutor::SynchronousMemcpyD2H(
 port::Status StreamExecutor::SynchronousMemcpyH2D(
     const void *host_src, int64 size, DeviceMemoryBase *device_dst) {
   VLOG(1) << "Called StreamExecutor::SynchronousMemcpyH2D(host_src=" << host_src
-          << ", size=" << size << ", device_dst" << device_dst->opaque() << ")"
+          << ", size=" << size << ", device_dst=" << device_dst->opaque() << ")"
           << StackTraceIfVLOG10();
 
   port::Status result;
diff --git a/tensorflow/tensorflow.bzl b/tensorflow/tensorflow.bzl
index d71fd71bbd83add63d11bcd62ae7ecaa2a8be8d1..b59f8e1f987567727ef3d4051618edd377d06f89 100644
--- a/tensorflow/tensorflow.bzl
+++ b/tensorflow/tensorflow.bzl
@@ -1353,12 +1353,6 @@ register_extension_info(
     label_regex_for_dep = "{extension_name}",
 )
 
-def tf_extension_linkopts():
-  return []  # No extension link opts
-
-def tf_extension_copts():
-  return []  # No extension c opts
-
 # In tf_py_wrap_cc generated libraries
 # module init functions are not exported unless
 # they contain one of the keywords in the version file
@@ -1459,10 +1453,10 @@ def tf_py_wrap_cc(name,
   tf_cc_shared_object(
       name=cc_library_name,
       srcs=[module_name + ".cc"],
-      copts=(copts + if_not_windows([
+      copts=copts + if_not_windows([
           "-Wno-self-assign", "-Wno-sign-compare", "-Wno-write-strings"
-      ]) + tf_extension_copts()),
-      linkopts=tf_extension_linkopts() + extra_linkopts,
+      ]),
+      linkopts=extra_linkopts,
       linkstatic=1,
       deps=deps + extra_deps,
       **kwargs)
@@ -1725,7 +1719,7 @@ def tf_py_build_info_genrule():
       name="py_build_info_gen",
       outs=["platform/build_info.py"],
       cmd=
-      "$(location //tensorflow/tools/build_info:gen_build_info.py) --raw_generate \"$@\" --build_config " + if_cuda("cuda", "cpu"),
+     "$(location //tensorflow/tools/build_info:gen_build_info.py) --raw_generate \"$@\" --build_config " + if_cuda("cuda", "cpu"),
       local=1,
       tools=[clean_dep("//tensorflow/tools/build_info:gen_build_info.py")],)
 
diff --git a/tensorflow/tools/api/generator/BUILD b/tensorflow/tools/api/generator/BUILD
index f46bb4b5fcc5d6eface8617ba5261abf29e34b02..3a28153e522a25b02858c7aaa75f2ade620479a1 100644
--- a/tensorflow/tools/api/generator/BUILD
+++ b/tensorflow/tools/api/generator/BUILD
@@ -5,12 +5,22 @@ licenses(["notice"])  # Apache 2.0
 
 exports_files(["LICENSE"])
 
+load("//tensorflow/tools/api/generator:api_gen.bzl", "TENSORFLOW_API_INIT_FILES")
+
+py_library(
+    name = "doc_srcs",
+    srcs = ["doc_srcs.py"],
+    srcs_version = "PY2AND3",
+)
+
 py_binary(
     name = "create_python_api",
     srcs = ["create_python_api.py"],
     srcs_version = "PY2AND3",
+    visibility = ["//visibility:public"],
     deps = [
-        "//tensorflow/python",
+        ":doc_srcs",
+        "//tensorflow/python:no_contrib",
     ],
 )
 
@@ -24,115 +34,17 @@ py_test(
     ],
 )
 
-genrule(
-    name = "python_api_gen",
-    # List of API files. This list should include file name for
-    # every module exported using tf_export. For e.g. if an op is decorated with
-    # @tf_export('module1.module2', 'module3'). Then, outs should include
-    # api/module1/module2/__init__.py and api/module3/__init__.py.
-    # keep sorted
-    outs = [
-        # BEGIN GENERATED FILES
-        "api/__init__.py",
-        "api/app/__init__.py",
-        "api/bitwise/__init__.py",
-        "api/compat/__init__.py",
-        "api/contrib/__init__.py",
-        "api/contrib/stat_summarizer/__init__.py",
-        "api/data/__init__.py",
-        "api/distributions/__init__.py",
-        "api/distributions/bijectors/__init__.py",
-        "api/errors/__init__.py",
-        "api/estimator/__init__.py",
-        "api/estimator/export/__init__.py",
-        "api/estimator/inputs/__init__.py",
-        "api/feature_column/__init__.py",
-        "api/gfile/__init__.py",
-        "api/graph_util/__init__.py",
-        "api/image/__init__.py",
-        "api/initializers/__init__.py",
-        "api/keras/__init__.py",
-        "api/keras/activations/__init__.py",
-        "api/keras/applications/__init__.py",
-        "api/keras/applications/densenet/__init__.py",
-        "api/keras/applications/inception_resnet_v2/__init__.py",
-        "api/keras/applications/inception_v3/__init__.py",
-        "api/keras/applications/mobilenet/__init__.py",
-        "api/keras/applications/nasnet/__init__.py",
-        "api/keras/applications/resnet50/__init__.py",
-        "api/keras/applications/vgg16/__init__.py",
-        "api/keras/applications/vgg19/__init__.py",
-        "api/keras/applications/xception/__init__.py",
-        "api/keras/backend/__init__.py",
-        "api/keras/callbacks/__init__.py",
-        "api/keras/constraints/__init__.py",
-        "api/keras/datasets/__init__.py",
-        "api/keras/datasets/boston_housing/__init__.py",
-        "api/keras/datasets/cifar10/__init__.py",
-        "api/keras/datasets/cifar100/__init__.py",
-        "api/keras/datasets/fashion_mnist/__init__.py",
-        "api/keras/datasets/imdb/__init__.py",
-        "api/keras/datasets/mnist/__init__.py",
-        "api/keras/datasets/reuters/__init__.py",
-        "api/keras/estimator/__init__.py",
-        "api/keras/initializers/__init__.py",
-        "api/keras/layers/__init__.py",
-        "api/keras/losses/__init__.py",
-        "api/keras/metrics/__init__.py",
-        "api/keras/models/__init__.py",
-        "api/keras/optimizers/__init__.py",
-        "api/keras/preprocessing/__init__.py",
-        "api/keras/preprocessing/image/__init__.py",
-        "api/keras/preprocessing/sequence/__init__.py",
-        "api/keras/preprocessing/text/__init__.py",
-        "api/keras/regularizers/__init__.py",
-        "api/keras/utils/__init__.py",
-        "api/keras/wrappers/__init__.py",
-        "api/keras/wrappers/scikit_learn/__init__.py",
-        "api/layers/__init__.py",
-        "api/linalg/__init__.py",
-        "api/logging/__init__.py",
-        "api/losses/__init__.py",
-        "api/manip/__init__.py",
-        "api/math/__init__.py",
-        "api/metrics/__init__.py",
-        "api/nn/__init__.py",
-        "api/nn/rnn_cell/__init__.py",
-        "api/profiler/__init__.py",
-        "api/python_io/__init__.py",
-        "api/resource_loader/__init__.py",
-        "api/strings/__init__.py",
-        "api/saved_model/__init__.py",
-        "api/saved_model/builder/__init__.py",
-        "api/saved_model/constants/__init__.py",
-        "api/saved_model/loader/__init__.py",
-        "api/saved_model/main_op/__init__.py",
-        "api/saved_model/signature_constants/__init__.py",
-        "api/saved_model/signature_def_utils/__init__.py",
-        "api/saved_model/tag_constants/__init__.py",
-        "api/saved_model/utils/__init__.py",
-        "api/sets/__init__.py",
-        "api/sparse/__init__.py",
-        "api/spectral/__init__.py",
-        "api/summary/__init__.py",
-        "api/sysconfig/__init__.py",
-        "api/test/__init__.py",
-        "api/train/__init__.py",
-        "api/train/queue_runner/__init__.py",
-        "api/user_ops/__init__.py",
-        # END GENERATED FILES
-    ],
-    cmd = "$(location create_python_api) $(OUTS)",
-    tools = ["create_python_api"],
-)
-
-py_library(
-    name = "python_api",
-    srcs = [":python_api_gen"],
+py_test(
+    name = "tensorflow_doc_srcs_test",
+    srcs = ["doc_srcs_test.py"],
+    args = [
+        "--package=tensorflow.python",
+    ] + TENSORFLOW_API_INIT_FILES,
+    main = "doc_srcs_test.py",
     srcs_version = "PY2AND3",
-    visibility = ["//tensorflow:__subpackages__"],
     deps = [
-        "//tensorflow/contrib:contrib_py",  # keep
-        "//tensorflow/python",  # keep
+        ":doc_srcs",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:no_contrib",
     ],
 )
diff --git a/tensorflow/tools/api/generator/api_gen.bzl b/tensorflow/tools/api/generator/api_gen.bzl
new file mode 100644
index 0000000000000000000000000000000000000000..41713a94ecbde567340a9e1571a06aca3bbda97a
--- /dev/null
+++ b/tensorflow/tools/api/generator/api_gen.bzl
@@ -0,0 +1,135 @@
+"""Targets for generating TensorFlow Python API __init__.py files."""
+
+# keep sorted
+TENSORFLOW_API_INIT_FILES = [
+    # BEGIN GENERATED FILES
+    "__init__.py",
+    "app/__init__.py",
+    "bitwise/__init__.py",
+    "compat/__init__.py",
+    "data/__init__.py",
+    "distributions/__init__.py",
+    "distributions/bijectors/__init__.py",
+    "errors/__init__.py",
+    "feature_column/__init__.py",
+    "gfile/__init__.py",
+    "graph_util/__init__.py",
+    "image/__init__.py",
+    "initializers/__init__.py",
+    "keras/__init__.py",
+    "keras/activations/__init__.py",
+    "keras/applications/__init__.py",
+    "keras/applications/densenet/__init__.py",
+    "keras/applications/inception_resnet_v2/__init__.py",
+    "keras/applications/inception_v3/__init__.py",
+    "keras/applications/mobilenet/__init__.py",
+    "keras/applications/nasnet/__init__.py",
+    "keras/applications/resnet50/__init__.py",
+    "keras/applications/vgg16/__init__.py",
+    "keras/applications/vgg19/__init__.py",
+    "keras/applications/xception/__init__.py",
+    "keras/backend/__init__.py",
+    "keras/callbacks/__init__.py",
+    "keras/constraints/__init__.py",
+    "keras/datasets/__init__.py",
+    "keras/datasets/boston_housing/__init__.py",
+    "keras/datasets/cifar10/__init__.py",
+    "keras/datasets/cifar100/__init__.py",
+    "keras/datasets/fashion_mnist/__init__.py",
+    "keras/datasets/imdb/__init__.py",
+    "keras/datasets/mnist/__init__.py",
+    "keras/datasets/reuters/__init__.py",
+    "keras/estimator/__init__.py",
+    "keras/initializers/__init__.py",
+    "keras/layers/__init__.py",
+    "keras/losses/__init__.py",
+    "keras/metrics/__init__.py",
+    "keras/models/__init__.py",
+    "keras/optimizers/__init__.py",
+    "keras/preprocessing/__init__.py",
+    "keras/preprocessing/image/__init__.py",
+    "keras/preprocessing/sequence/__init__.py",
+    "keras/preprocessing/text/__init__.py",
+    "keras/regularizers/__init__.py",
+    "keras/utils/__init__.py",
+    "keras/wrappers/__init__.py",
+    "keras/wrappers/scikit_learn/__init__.py",
+    "layers/__init__.py",
+    "linalg/__init__.py",
+    "logging/__init__.py",
+    "losses/__init__.py",
+    "manip/__init__.py",
+    "math/__init__.py",
+    "metrics/__init__.py",
+    "nn/__init__.py",
+    "nn/rnn_cell/__init__.py",
+    "profiler/__init__.py",
+    "python_io/__init__.py",
+    "resource_loader/__init__.py",
+    "strings/__init__.py",
+    "saved_model/__init__.py",
+    "saved_model/builder/__init__.py",
+    "saved_model/constants/__init__.py",
+    "saved_model/loader/__init__.py",
+    "saved_model/main_op/__init__.py",
+    "saved_model/signature_constants/__init__.py",
+    "saved_model/signature_def_utils/__init__.py",
+    "saved_model/tag_constants/__init__.py",
+    "saved_model/utils/__init__.py",
+    "sets/__init__.py",
+    "sparse/__init__.py",
+    "spectral/__init__.py",
+    "summary/__init__.py",
+    "sysconfig/__init__.py",
+    "test/__init__.py",
+    "train/__init__.py",
+    "train/queue_runner/__init__.py",
+    "user_ops/__init__.py",
+    # END GENERATED FILES
+]
+
+# keep sorted
+ESTIMATOR_API_INIT_FILES = [
+    # BEGIN GENERATED ESTIMATOR FILES
+    "__init__.py",
+    "estimator/__init__.py",
+    "estimator/export/__init__.py",
+    "estimator/inputs/__init__.py",
+    # END GENERATED ESTIMATOR FILES
+]
+
+# Creates a genrule that generates a directory structure with __init__.py
+# files that import all exported modules (i.e. modules with tf_export
+# decorators).
+#
+# Args:
+#   name: name of genrule to create.
+#   output_files: List of __init__.py files that should be generated.
+#     This list should include file name for every module exported using
+#     tf_export. For e.g. if an op is decorated with
+#     @tf_export('module1.module2', 'module3'). Then, output_files should
+#     include module1/module2/__init__.py and module3/__init__.py.
+#   root_init_template: Python init file that should be used as template for
+#     root __init__.py file. "# API IMPORTS PLACEHOLDER" comment inside this
+#     template will be replaced with root imports collected by this genrule.
+#   srcs: genrule sources. If passing root_init_template, the template file
+#     must be included in sources.
+def gen_api_init_files(name,
+                       output_files=TENSORFLOW_API_INIT_FILES,
+                       root_init_template=None,
+                       srcs=[],
+                       api_name="tensorflow",
+                       package="tensorflow.python"):
+  root_init_template_flag = ""
+  if root_init_template:
+    root_init_template_flag = "--root_init_template=$(location " + root_init_template + ")"
+  native.genrule(
+      name = name,
+      outs = output_files,
+      cmd = (
+          "$(location //tensorflow/tools/api/generator:create_python_api) " +
+          root_init_template_flag + " --apidir=$(@D) --apiname=" + api_name + " --package=" + package + " $(OUTS)"),
+      srcs = srcs,
+      tools = ["//tensorflow/tools/api/generator:create_python_api"],
+      visibility = ["//tensorflow:__pkg__"],
+  )
diff --git a/tensorflow/tools/api/generator/create_python_api.py b/tensorflow/tools/api/generator/create_python_api.py
index 18182090dabab1f0552001e1388e4f74e3514f1a..e375cd48d8ce0cf1b132e851f4133ad255f14aca 100644
--- a/tensorflow/tools/api/generator/create_python_api.py
+++ b/tensorflow/tools/api/generator/create_python_api.py
@@ -25,17 +25,21 @@ import os
 import sys
 
 from tensorflow.python.util import tf_decorator
+from tensorflow.python.util import tf_export
+from tensorflow.tools.api.generator import doc_srcs
 
+API_ATTRS = tf_export.API_ATTRS
 
-_API_CONSTANTS_ATTR = '_tf_api_constants'
-_API_NAMES_ATTR = '_tf_api_names'
-_API_DIR = '/api/'
 _DEFAULT_PACKAGE = 'tensorflow.python'
-_OUTPUT_MODULE = 'tensorflow.tools.api.generator.api'
-_GENERATED_FILE_HEADER = """\"\"\"Imports for Python API.
-
-This file is MACHINE GENERATED! Do not edit.
-Generated by: tensorflow/tools/api/generator/create_python_api.py script.
+_GENFILES_DIR_SUFFIX = 'genfiles/'
+_SYMBOLS_TO_SKIP_EXPLICITLY = {
+    # Overrides __getattr__, so that unwrapping tf_decorator
+    # would have side effects.
+    'tensorflow.python.platform.flags.FLAGS'
+}
+_GENERATED_FILE_HEADER = """# This file is MACHINE GENERATED! Do not edit.
+# Generated by: tensorflow/tools/api/generator/create_python_api.py script.
+\"\"\"%s
 \"\"\"
 
 from __future__ import print_function
@@ -147,20 +151,21 @@ class _ModuleInitCodeBuilder(object):
     # the script outputs.
     module_text_map[''] = module_text_map.get('', '') + '''
 _names_with_underscore = [%s]
-__all__ = [s for s in dir() if not s.startswith('_')]
-__all__.extend([s for s in _names_with_underscore])
+__all__ = [_s for _s in dir() if not _s.startswith('_')]
+__all__.extend([_s for _s in _names_with_underscore])
 __all__.remove('print_function')
 ''' % underscore_names_str
 
     return module_text_map
 
 
-def get_api_init_text(package):
+def get_api_init_text(package, api_name):
   """Get a map from destination module to __init__.py code for that module.
 
   Args:
     package: Base python package containing python with target tf_export
       decorators.
+    api_name: API you want to generate (e.g. `tensorflow` or `estimator`).
 
   Returns:
     A dictionary where
@@ -182,10 +187,13 @@ def get_api_init_text(package):
       continue
 
     for module_contents_name in dir(module):
+      if (module.__name__ + '.' + module_contents_name
+          in _SYMBOLS_TO_SKIP_EXPLICITLY):
+        continue
       attr = getattr(module, module_contents_name)
 
       # If attr is _tf_api_constants attribute, then add the constants.
-      if module_contents_name == _API_CONSTANTS_ATTR:
+      if module_contents_name == API_ATTRS[api_name].constants:
         for exports, value in attr:
           for export in exports:
             names = export.split('.')
@@ -197,8 +205,9 @@ def get_api_init_text(package):
       _, attr = tf_decorator.unwrap(attr)
       # If attr is a symbol with _tf_api_names attribute, then
       # add import for it.
-      if hasattr(attr, '__dict__') and _API_NAMES_ATTR in attr.__dict__:
-        for export in attr._tf_api_names:  # pylint: disable=protected-access
+      if (hasattr(attr, '__dict__') and
+          API_ATTRS[api_name].names in attr.__dict__):
+        for export in getattr(attr, API_ATTRS[api_name].names):  # pylint: disable=protected-access
           names = export.split('.')
           dest_module = '.'.join(names[:-1])
           module_code_builder.add_import(
@@ -209,6 +218,7 @@ def get_api_init_text(package):
   # For e.g. if we import 'foo.bar.Value'. Then, we also
   # import 'bar' in 'foo'.
   imported_modules = set(module_code_builder.module_imports.keys())
+  import_from = '.'
   for module in imported_modules:
     if not module:
       continue
@@ -216,11 +226,9 @@ def get_api_init_text(package):
     parent_module = ''  # we import submodules in their parent_module
 
     for submodule_index in range(len(module_split)):
-      import_from = _OUTPUT_MODULE
       if submodule_index > 0:
         parent_module += ('.' + module_split[submodule_index-1] if parent_module
                           else module_split[submodule_index-1])
-        import_from += '.' + parent_module
       module_code_builder.add_import(
           -1, parent_module, import_from,
           module_split[submodule_index], module_split[submodule_index])
@@ -228,7 +236,62 @@ def get_api_init_text(package):
   return module_code_builder.build()
 
 
-def create_api_files(output_files, package):
+def get_module(dir_path, relative_to_dir):
+  """Get module that corresponds to path relative to relative_to_dir.
+
+  Args:
+    dir_path: Path to directory.
+    relative_to_dir: Get module relative to this directory.
+
+  Returns:
+    Name of module that corresponds to the given directory.
+  """
+  dir_path = dir_path[len(relative_to_dir):]
+  # Convert path separators to '/' for easier parsing below.
+  dir_path = dir_path.replace(os.sep, '/')
+  return dir_path.replace('/', '.').strip('.')
+
+
+def get_module_docstring(module_name, package):
+  """Get docstring for the given module.
+
+  This method looks for docstring in the following order:
+  1. Checks if module has a docstring specified in doc_srcs.
+  2. Checks if module has a docstring source module specified
+     in doc_srcs. If it does, gets docstring from that module.
+  3. Checks if module with module_name exists under base package.
+     If it does, gets docstring from that module.
+  4. Returns a default docstring.
+
+  Args:
+    module_name: module name relative to tensorflow
+      (excluding 'tensorflow.' prefix) to get a docstring for.
+    package: Base python package containing python with target tf_export
+      decorators.
+
+  Returns:
+    One-line docstring to describe the module.
+  """
+  # Module under base package to get a docstring from.
+  docstring_module_name = module_name
+
+  if module_name in doc_srcs.TENSORFLOW_DOC_SOURCES:
+    docsrc = doc_srcs.TENSORFLOW_DOC_SOURCES[module_name]
+    if docsrc.docstring:
+      return docsrc.docstring
+    if docsrc.docstring_module_name:
+      docstring_module_name = docsrc.docstring_module_name
+
+  docstring_module_name = package + '.' + docstring_module_name
+  if (docstring_module_name in sys.modules and
+      sys.modules[docstring_module_name].__doc__):
+    return sys.modules[docstring_module_name].__doc__
+
+  return 'Public API for tf.%s namespace.' % module_name
+
+
+def create_api_files(
+    output_files, package, root_init_template, output_dir, api_name):
   """Creates __init__.py files for the Python API.
 
   Args:
@@ -236,6 +299,11 @@ def create_api_files(output_files, package):
       Each file must be under api/ directory.
     package: Base python package containing python with target tf_export
       decorators.
+    root_init_template: Template for top-level __init__.py file.
+      "#API IMPORTS PLACEHOLDER" comment in the template file will be replaced
+      with imports.
+    output_dir: output API root directory.
+    api_name: API you want to generate (e.g. `tensorflow` or `estimator`).
 
   Raises:
     ValueError: if an output file is not under api/ directory,
@@ -243,18 +311,7 @@ def create_api_files(output_files, package):
   """
   module_name_to_file_path = {}
   for output_file in output_files:
-    # Convert path separators to '/' for easier parsing below.
-    normalized_output_file = output_file.replace(os.sep, '/')
-    if _API_DIR not in output_file:
-      raise ValueError(
-          'Output files must be in api/ directory, found %s.' % output_file)
-    # Get the module name that corresponds to output_file.
-    # First get module directory under _API_DIR.
-    module_dir = os.path.dirname(
-        normalized_output_file[
-            normalized_output_file.rfind(_API_DIR)+len(_API_DIR):])
-    # Convert / to .
-    module_name = module_dir.replace('/', '.').strip('.')
+    module_name = get_module(os.path.dirname(output_file), output_dir)
     module_name_to_file_path[module_name] = os.path.normpath(output_file)
 
   # Create file for each expected output in genrule.
@@ -263,25 +320,35 @@ def create_api_files(output_files, package):
       os.makedirs(os.path.dirname(file_path))
     open(file_path, 'a').close()
 
-  module_text_map = get_api_init_text(package)
+  module_text_map = get_api_init_text(package, api_name)
 
   # Add imports to output files.
   missing_output_files = []
   for module, text in module_text_map.items():
     # Make sure genrule output file list is in sync with API exports.
     if module not in module_name_to_file_path:
-      module_file_path = '"api/%s/__init__.py"' %  (
+      module_file_path = '"%s/__init__.py"' %  (
           module.replace('.', '/'))
       missing_output_files.append(module_file_path)
       continue
+    contents = ''
+    if module or not root_init_template:
+      contents = (
+          _GENERATED_FILE_HEADER %
+          get_module_docstring(module, package) + text + _GENERATED_FILE_FOOTER)
+    else:
+      # Read base init file
+      with open(root_init_template, 'r') as root_init_template_file:
+        contents = root_init_template_file.read()
+        contents = contents.replace('# API IMPORTS PLACEHOLDER', text)
     with open(module_name_to_file_path[module], 'w') as fp:
-      fp.write(_GENERATED_FILE_HEADER + text + _GENERATED_FILE_FOOTER)
+      fp.write(contents)
 
   if missing_output_files:
     raise ValueError(
         'Missing outputs for python_api_gen genrule:\n%s.'
         'Make sure all required outputs are in the '
-        'tensorflow/tools/api/generator/BUILD file.' %
+        'tensorflow/tools/api/generator/api_gen.bzl file.' %
         ',\n'.join(sorted(missing_output_files)))
 
 
@@ -297,6 +364,20 @@ def main():
       '--package', default=_DEFAULT_PACKAGE, type=str,
       help='Base package that imports modules containing the target tf_export '
            'decorators.')
+  parser.add_argument(
+      '--root_init_template', default='', type=str,
+      help='Template for top level __init__.py file. '
+           '"#API IMPORTS PLACEHOLDER" comment will be replaced with imports.')
+  parser.add_argument(
+      '--apidir', type=str, required=True,
+      help='Directory where generated output files are placed. '
+           'gendir should be a prefix of apidir. Also, apidir '
+           'should be a prefix of every directory in outputs.')
+  parser.add_argument(
+      '--apiname', required=True, type=str,
+      choices=API_ATTRS.keys(),
+      help='The API you want to generate.')
+
   args = parser.parse_args()
 
   if len(args.outputs) == 1:
@@ -309,7 +390,8 @@ def main():
 
   # Populate `sys.modules` with modules containing tf_export().
   importlib.import_module(args.package)
-  create_api_files(outputs, args.package)
+  create_api_files(outputs, args.package, args.root_init_template,
+                   args.apidir, args.apiname)
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/tools/api/generator/create_python_api_test.py b/tensorflow/tools/api/generator/create_python_api_test.py
index 986340cf6d4a1bb18841d781dcd11c0208279ec8..651ec9d040302a4343ae6e0053cf6a4b37a971d4 100644
--- a/tensorflow/tools/api/generator/create_python_api_test.py
+++ b/tensorflow/tools/api/generator/create_python_api_test.py
@@ -57,7 +57,8 @@ class CreatePythonApiTest(test.TestCase):
 
   def testFunctionImportIsAdded(self):
     imports = create_python_api.get_api_init_text(
-        package=create_python_api._DEFAULT_PACKAGE)
+        package=create_python_api._DEFAULT_PACKAGE,
+        api_name='tensorflow')
     expected_import = (
         'from tensorflow.python.test_module '
         'import test_op as test_op1')
@@ -73,7 +74,8 @@ class CreatePythonApiTest(test.TestCase):
 
   def testClassImportIsAdded(self):
     imports = create_python_api.get_api_init_text(
-        package=create_python_api._DEFAULT_PACKAGE)
+        package=create_python_api._DEFAULT_PACKAGE,
+        api_name='tensorflow')
     expected_import = ('from tensorflow.python.test_module '
                        'import TestClass')
     self.assertTrue(
@@ -82,7 +84,8 @@ class CreatePythonApiTest(test.TestCase):
 
   def testConstantIsAdded(self):
     imports = create_python_api.get_api_init_text(
-        package=create_python_api._DEFAULT_PACKAGE)
+        package=create_python_api._DEFAULT_PACKAGE,
+        api_name='tensorflow')
     expected = ('from tensorflow.python.test_module '
                 'import _TEST_CONSTANT')
     self.assertTrue(expected in str(imports),
diff --git a/tensorflow/tools/api/generator/doc_srcs.py b/tensorflow/tools/api/generator/doc_srcs.py
new file mode 100644
index 0000000000000000000000000000000000000000..74f6db98fd58adf17d062a7aff6776fe1db0445b
--- /dev/null
+++ b/tensorflow/tools/api/generator/doc_srcs.py
@@ -0,0 +1,65 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Specifies sources of doc strings for API modules."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+
+
+# Specifies docstring source for a module.
+# Only one of docstring or docstring_module_name should be set.
+# * If docstring is set, then we will use this docstring when
+#   for the module.
+# * If docstring_module_name is set, then we will copy the docstring
+#   from docstring source module.
+DocSource = collections.namedtuple(
+    'DocSource', ['docstring', 'docstring_module_name'])
+# Each attribute of DocSource is optional.
+DocSource.__new__.__defaults__ = (None,) * len(DocSource._fields)
+
+TENSORFLOW_DOC_SOURCES = {
+    'app': DocSource(docstring_module_name='platform.app'),
+    'compat': DocSource(docstring_module_name='util.compat'),
+    'distributions': DocSource(
+        docstring_module_name='ops.distributions.distributions'),
+    'bitwise': DocSource(docstring_module_name='ops.bitwise_ops'),
+    'errors': DocSource(docstring_module_name='framework.errors'),
+    'gfile': DocSource(docstring_module_name='platform.gfile'),
+    'graph_util': DocSource(docstring_module_name='framework.graph_util'),
+    'image': DocSource(docstring_module_name='ops.image_ops'),
+    'keras.estimator': DocSource(docstring_module_name='estimator.keras'),
+    'linalg': DocSource(docstring_module_name='ops.linalg_ops'),
+    'logging': DocSource(docstring_module_name='ops.logging_ops'),
+    'losses': DocSource(docstring_module_name='ops.losses.losses'),
+    'manip': DocSource(docstring_module_name='ops.manip_ops'),
+    'math': DocSource(docstring_module_name='ops.math_ops'),
+    'metrics': DocSource(docstring_module_name='ops.metrics'),
+    'nn': DocSource(docstring_module_name='ops.nn_ops'),
+    'nn.rnn_cell': DocSource(docstring_module_name='ops.rnn_cell'),
+    'python_io': DocSource(docstring_module_name='lib.io.python_io'),
+    'resource_loader': DocSource(
+        docstring_module_name='platform.resource_loader'),
+    'sets': DocSource(docstring_module_name='ops.sets'),
+    'sparse': DocSource(docstring_module_name='ops.sparse_ops'),
+    'spectral': DocSource(docstring_module_name='ops.spectral_ops'),
+    'strings': DocSource(docstring_module_name='ops.string_ops'),
+    'sysconfig': DocSource(docstring_module_name='platform.sysconfig'),
+    'test': DocSource(docstring_module_name='platform.test'),
+    'train': DocSource(docstring_module_name='training.training'),
+    'train.queue_runner': DocSource(
+        docstring_module_name='training.queue_runner'),
+}
diff --git a/tensorflow/tools/api/generator/doc_srcs_test.py b/tensorflow/tools/api/generator/doc_srcs_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ba95a34391a8011477ec6afd630ec5c37586496
--- /dev/null
+++ b/tensorflow/tools/api/generator/doc_srcs_test.py
@@ -0,0 +1,80 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+"""Tests for tensorflow.tools.api.generator.doc_srcs."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import importlib
+import sys
+
+from tensorflow.python.platform import test
+from tensorflow.tools.api.generator import doc_srcs
+
+
+FLAGS = None
+
+
+class DocSrcsTest(test.TestCase):
+
+  def testModulesAreValidAPIModules(self):
+    for module_name in doc_srcs.TENSORFLOW_DOC_SOURCES:
+      # Convert module_name to corresponding __init__.py file path.
+      file_path = module_name.replace('.', '/')
+      if file_path:
+        file_path += '/'
+      file_path += '__init__.py'
+
+      if file_path not in FLAGS.outputs:
+        self.assertFalse('%s is not a valid API module' % module_name)
+
+  def testHaveDocstringOrDocstringModule(self):
+    for module_name, docsrc in doc_srcs.TENSORFLOW_DOC_SOURCES.items():
+      if docsrc.docstring and docsrc.docstring_module_name:
+        self.assertFalse(
+            '%s contains DocSource has both a docstring and a '
+            'docstring_module_name. '
+            'Only one of "docstring" or "docstring_module_name" should be set.'
+            % (module_name))
+
+  def testDocstringModulesAreValidModules(self):
+    for _, docsrc in doc_srcs.TENSORFLOW_DOC_SOURCES.items():
+      if docsrc.docstring_module_name:
+        doc_module_name = '.'.join([
+            FLAGS.package, docsrc.docstring_module_name])
+        if doc_module_name not in sys.modules:
+          sys.assertFalse(
+              'docsources_module %s is not a valid module under %s.' %
+              (docsrc.docstring_module_name, FLAGS.package))
+
+
+if __name__ == '__main__':
+  parser = argparse.ArgumentParser()
+  parser.add_argument(
+      'outputs', metavar='O', type=str, nargs='+',
+      help='create_python_api output files.')
+  parser.add_argument(
+      '--package', type=str,
+      help='Base package that imports modules containing the target tf_export '
+           'decorators.')
+  FLAGS, unparsed = parser.parse_known_args()
+
+  importlib.import_module(FLAGS.package)
+
+  # Now update argv, so that unittest library does not get confused.
+  sys.argv = [sys.argv[0]] + unparsed
+  test.main()
diff --git a/tensorflow/tools/api/golden/tensorflow.-attr-value.-list-value.pbtxt b/tensorflow/tools/api/golden/tensorflow.-attr-value.-list-value.pbtxt
index 0fb1aaba2831e63cea9b9a38954b361e5cabd072..f1dffd595285098afaeb0ff04e5db35d594f7fac 100644
--- a/tensorflow/tools/api/golden/tensorflow.-attr-value.-list-value.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-attr-value.-list-value.pbtxt
@@ -1,108 +1,70 @@
 path: "tensorflow.AttrValue.ListValue"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.attr_value_pb2.ListValue\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "B_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FUNC_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "F_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "I_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SHAPE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "S_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TENSOR_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TYPE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "ListValue"
+    field {
+      name: "s"
+      number: 2
+      label: LABEL_REPEATED
+      type: TYPE_BYTES
+    }
+    field {
+      name: "i"
+      number: 3
+      label: LABEL_REPEATED
+      type: TYPE_INT64
+      options {
+        packed: true
+      }
+    }
+    field {
+      name: "f"
+      number: 4
+      label: LABEL_REPEATED
+      type: TYPE_FLOAT
+      options {
+        packed: true
+      }
+    }
+    field {
+      name: "b"
+      number: 5
+      label: LABEL_REPEATED
+      type: TYPE_BOOL
+      options {
+        packed: true
+      }
+    }
+    field {
+      name: "type"
+      number: 6
+      label: LABEL_REPEATED
+      type: TYPE_ENUM
+      type_name: ".tensorflow.DataType"
+      options {
+        packed: true
+      }
+    }
+    field {
+      name: "shape"
+      number: 7
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.TensorShapeProto"
+    }
+    field {
+      name: "tensor"
+      number: 8
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.TensorProto"
+    }
+    field {
+      name: "func"
+      number: 9
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.NameAttrList"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-attr-value.pbtxt b/tensorflow/tools/api/golden/tensorflow.-attr-value.pbtxt
index e7a3a1f02faf104a03eecc4a45f5a54ab1a26f9a..6ccd64f428c3b87c807d0af82f67a884187f738c 100644
--- a/tensorflow/tools/api/golden/tensorflow.-attr-value.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-attr-value.pbtxt
@@ -1,120 +1,151 @@
 path: "tensorflow.AttrValue"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.attr_value_pb2.AttrValue\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "B_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FUNC_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "F_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "I_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "LIST_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "ListValue"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "PLACEHOLDER_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SHAPE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "S_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TENSOR_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TYPE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "AttrValue"
+    field {
+      name: "s"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_BYTES
+      oneof_index: 0
+    }
+    field {
+      name: "i"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+      oneof_index: 0
+    }
+    field {
+      name: "f"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_FLOAT
+      oneof_index: 0
+    }
+    field {
+      name: "b"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+      oneof_index: 0
+    }
+    field {
+      name: "type"
+      number: 6
+      label: LABEL_OPTIONAL
+      type: TYPE_ENUM
+      type_name: ".tensorflow.DataType"
+      oneof_index: 0
+    }
+    field {
+      name: "shape"
+      number: 7
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.TensorShapeProto"
+      oneof_index: 0
+    }
+    field {
+      name: "tensor"
+      number: 8
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.TensorProto"
+      oneof_index: 0
+    }
+    field {
+      name: "list"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.AttrValue.ListValue"
+      oneof_index: 0
+    }
+    field {
+      name: "func"
+      number: 10
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.NameAttrList"
+      oneof_index: 0
+    }
+    field {
+      name: "placeholder"
+      number: 9
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+      oneof_index: 0
+    }
+    nested_type {
+      name: "ListValue"
+      field {
+        name: "s"
+        number: 2
+        label: LABEL_REPEATED
+        type: TYPE_BYTES
+      }
+      field {
+        name: "i"
+        number: 3
+        label: LABEL_REPEATED
+        type: TYPE_INT64
+        options {
+          packed: true
+        }
+      }
+      field {
+        name: "f"
+        number: 4
+        label: LABEL_REPEATED
+        type: TYPE_FLOAT
+        options {
+          packed: true
+        }
+      }
+      field {
+        name: "b"
+        number: 5
+        label: LABEL_REPEATED
+        type: TYPE_BOOL
+        options {
+          packed: true
+        }
+      }
+      field {
+        name: "type"
+        number: 6
+        label: LABEL_REPEATED
+        type: TYPE_ENUM
+        type_name: ".tensorflow.DataType"
+        options {
+          packed: true
+        }
+      }
+      field {
+        name: "shape"
+        number: 7
+        label: LABEL_REPEATED
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.TensorShapeProto"
+      }
+      field {
+        name: "tensor"
+        number: 8
+        label: LABEL_REPEATED
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.TensorProto"
+      }
+      field {
+        name: "func"
+        number: 9
+        label: LABEL_REPEATED
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.NameAttrList"
+      }
+    }
+    oneof_decl {
+      name: "value"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-config-proto.-device-count-entry.pbtxt b/tensorflow/tools/api/golden/tensorflow.-config-proto.-device-count-entry.pbtxt
index 29bb3be35cba5f261f44811c731ba4c1fc007612..d9b142682899bf5d9fd5d942437359adf8962466 100644
--- a/tensorflow/tools/api/golden/tensorflow.-config-proto.-device-count-entry.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-config-proto.-device-count-entry.pbtxt
@@ -1,84 +1,21 @@
 path: "tensorflow.ConfigProto.DeviceCountEntry"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.config_pb2.DeviceCountEntry\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "KEY_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "DeviceCountEntry"
+    field {
+      name: "key"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "value"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    options {
+      map_entry: true
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-config-proto.-experimental.pbtxt b/tensorflow/tools/api/golden/tensorflow.-config-proto.-experimental.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9e09a8d48ec7a501cb25a30163b5dae84b7c8655
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.-config-proto.-experimental.pbtxt
@@ -0,0 +1,12 @@
+path: "tensorflow.ConfigProto.Experimental"
+tf_proto {
+  descriptor {
+    name: "Experimental"
+    field {
+      name: "collective_group_leader"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.-config-proto.pbtxt b/tensorflow/tools/api/golden/tensorflow.-config-proto.pbtxt
index 009d64aed09ddcb47410d6ee6fb42fca42861ddd..4af4ed70ef0698e996905bcb3b2222380b8694d8 100644
--- a/tensorflow/tools/api/golden/tensorflow.-config-proto.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-config-proto.pbtxt
@@ -1,140 +1,136 @@
 path: "tensorflow.ConfigProto"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.config_pb2.ConfigProto\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "ALLOW_SOFT_PLACEMENT_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "CLUSTER_DEF_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "DEVICE_COUNT_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DEVICE_FILTERS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DeviceCountEntry"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "GPU_OPTIONS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "GRAPH_OPTIONS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "INTER_OP_PARALLELISM_THREADS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "INTRA_OP_PARALLELISM_THREADS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "ISOLATE_SESSION_STATE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "LOG_DEVICE_PLACEMENT_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "OPERATION_TIMEOUT_IN_MS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "PLACEMENT_PERIOD_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "RPC_OPTIONS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SESSION_INTER_OP_THREAD_POOL_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "USE_PER_SESSION_THREADS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "ConfigProto"
+    field {
+      name: "device_count"
+      number: 1
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.ConfigProto.DeviceCountEntry"
+    }
+    field {
+      name: "intra_op_parallelism_threads"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "inter_op_parallelism_threads"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "use_per_session_threads"
+      number: 9
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "session_inter_op_thread_pool"
+      number: 12
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.ThreadPoolOptionProto"
+    }
+    field {
+      name: "placement_period"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "device_filters"
+      number: 4
+      label: LABEL_REPEATED
+      type: TYPE_STRING
+    }
+    field {
+      name: "gpu_options"
+      number: 6
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.GPUOptions"
+    }
+    field {
+      name: "allow_soft_placement"
+      number: 7
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "log_device_placement"
+      number: 8
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "graph_options"
+      number: 10
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.GraphOptions"
+    }
+    field {
+      name: "operation_timeout_in_ms"
+      number: 11
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "rpc_options"
+      number: 13
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.RPCOptions"
+    }
+    field {
+      name: "cluster_def"
+      number: 14
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.ClusterDef"
+    }
+    field {
+      name: "isolate_session_state"
+      number: 15
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "experimental"
+      number: 16
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.ConfigProto.Experimental"
+    }
+    nested_type {
+      name: "DeviceCountEntry"
+      field {
+        name: "key"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "value"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_INT32
+      }
+      options {
+        map_entry: true
+      }
+    }
+    nested_type {
+      name: "Experimental"
+      field {
+        name: "collective_group_leader"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-event.pbtxt b/tensorflow/tools/api/golden/tensorflow.-event.pbtxt
index 9bf8c124288854abc847a59db2c68b29759bfc7a..3b75a1735be76fe77689736e492c42c54ab795c1 100644
--- a/tensorflow/tools/api/golden/tensorflow.-event.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-event.pbtxt
@@ -1,112 +1,74 @@
 path: "tensorflow.Event"
-tf_class {
-  is_instance: "<class \'tensorflow.core.util.event_pb2.Event\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FILE_VERSION_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "GRAPH_DEF_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "LOG_MESSAGE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "META_GRAPH_DEF_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SESSION_LOG_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "STEP_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SUMMARY_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TAGGED_RUN_METADATA_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "WALL_TIME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Event"
+    field {
+      name: "wall_time"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_DOUBLE
+    }
+    field {
+      name: "step"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "file_version"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+      oneof_index: 0
+    }
+    field {
+      name: "graph_def"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_BYTES
+      oneof_index: 0
+    }
+    field {
+      name: "summary"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.Summary"
+      oneof_index: 0
+    }
+    field {
+      name: "log_message"
+      number: 6
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.LogMessage"
+      oneof_index: 0
+    }
+    field {
+      name: "session_log"
+      number: 7
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.SessionLog"
+      oneof_index: 0
+    }
+    field {
+      name: "tagged_run_metadata"
+      number: 8
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.TaggedRunMetadata"
+      oneof_index: 0
+    }
+    field {
+      name: "meta_graph_def"
+      number: 9
+      label: LABEL_OPTIONAL
+      type: TYPE_BYTES
+      oneof_index: 0
+    }
+    oneof_decl {
+      name: "what"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-g-p-u-options.pbtxt b/tensorflow/tools/api/golden/tensorflow.-g-p-u-options.pbtxt
index 875d802a9c458e299f73c130bb2b37c5d8828aad..f819b174c0b701153af4709fade9313efa7f7fb6 100644
--- a/tensorflow/tools/api/golden/tensorflow.-g-p-u-options.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-g-p-u-options.pbtxt
@@ -1,116 +1,86 @@
 path: "tensorflow.GPUOptions"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.config_pb2.GPUOptions\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "ALLOCATOR_TYPE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "ALLOW_GROWTH_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DEFERRED_DELETION_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "EXPERIMENTAL_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Experimental"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FORCE_GPU_COMPATIBLE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "PER_PROCESS_GPU_MEMORY_FRACTION_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "POLLING_ACTIVE_DELAY_USECS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "POLLING_INACTIVE_DELAY_MSECS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VISIBLE_DEVICE_LIST_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "GPUOptions"
+    field {
+      name: "per_process_gpu_memory_fraction"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_DOUBLE
+    }
+    field {
+      name: "allow_growth"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "allocator_type"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "deferred_deletion_bytes"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "visible_device_list"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "polling_active_delay_usecs"
+      number: 6
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "polling_inactive_delay_msecs"
+      number: 7
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "force_gpu_compatible"
+      number: 8
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "experimental"
+      number: 9
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.GPUOptions.Experimental"
+    }
+    nested_type {
+      name: "Experimental"
+      field {
+        name: "virtual_devices"
+        number: 1
+        label: LABEL_REPEATED
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.GPUOptions.Experimental.VirtualDevices"
+      }
+      field {
+        name: "use_unified_memory"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_BOOL
+      }
+      nested_type {
+        name: "VirtualDevices"
+        field {
+          name: "memory_limit_mb"
+          number: 1
+          label: LABEL_REPEATED
+          type: TYPE_FLOAT
+        }
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-graph-def.pbtxt b/tensorflow/tools/api/golden/tensorflow.-graph-def.pbtxt
index 1495e847cb08ed39ee5e365744ab1d798c3eed41..19eccff03d24719d95ea84ccdad4014aa777ccd5 100644
--- a/tensorflow/tools/api/golden/tensorflow.-graph-def.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-graph-def.pbtxt
@@ -1,92 +1,36 @@
 path: "tensorflow.GraphDef"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.graph_pb2.GraphDef\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "LIBRARY_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "NODE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VERSIONS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VERSION_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "GraphDef"
+    field {
+      name: "node"
+      number: 1
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.NodeDef"
+    }
+    field {
+      name: "versions"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.VersionDef"
+    }
+    field {
+      name: "version"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+      options {
+        deprecated: true
+      }
+    }
+    field {
+      name: "library"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.FunctionDefLibrary"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-graph-options.pbtxt b/tensorflow/tools/api/golden/tensorflow.-graph-options.pbtxt
index 0844f891cad3d4ea798dec82d318e2bc53c53683..a9f99bc171cc3661031981f467f583b122e43476 100644
--- a/tensorflow/tools/api/golden/tensorflow.-graph-options.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-graph-options.pbtxt
@@ -1,112 +1,67 @@
 path: "tensorflow.GraphOptions"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.config_pb2.GraphOptions\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "BUILD_COST_MODEL_AFTER_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "BUILD_COST_MODEL_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "ENABLE_BFLOAT16_SENDRECV_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "ENABLE_RECV_SCHEDULING_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "INFER_SHAPES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "OPTIMIZER_OPTIONS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "PLACE_PRUNED_GRAPH_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "REWRITE_OPTIONS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TIMELINE_STEP_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "GraphOptions"
+    field {
+      name: "enable_recv_scheduling"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "optimizer_options"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.OptimizerOptions"
+    }
+    field {
+      name: "build_cost_model"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "build_cost_model_after"
+      number: 9
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "infer_shapes"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "place_pruned_graph"
+      number: 6
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "enable_bfloat16_sendrecv"
+      number: 7
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "timeline_step"
+      number: 8
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "rewrite_options"
+      number: 10
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.RewriterConfig"
+    }
+    reserved_range {
+      start: 1
+      end: 2
+    }
+    reserved_name: "skip_common_subexpression_elimination"
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-histogram-proto.pbtxt b/tensorflow/tools/api/golden/tensorflow.-histogram-proto.pbtxt
index 2567d2fe60293833b340d790ac1110f91d018107..d4402f330b8a28eaa61eb2b74c9ca412dce06b62 100644
--- a/tensorflow/tools/api/golden/tensorflow.-histogram-proto.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-histogram-proto.pbtxt
@@ -1,104 +1,54 @@
 path: "tensorflow.HistogramProto"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.summary_pb2.HistogramProto\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "BUCKET_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "BUCKET_LIMIT_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "MAX_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "MIN_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "NUM_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SUM_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SUM_SQUARES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "HistogramProto"
+    field {
+      name: "min"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_DOUBLE
+    }
+    field {
+      name: "max"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_DOUBLE
+    }
+    field {
+      name: "num"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_DOUBLE
+    }
+    field {
+      name: "sum"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_DOUBLE
+    }
+    field {
+      name: "sum_squares"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_DOUBLE
+    }
+    field {
+      name: "bucket_limit"
+      number: 6
+      label: LABEL_REPEATED
+      type: TYPE_DOUBLE
+      options {
+        packed: true
+      }
+    }
+    field {
+      name: "bucket"
+      number: 7
+      label: LABEL_REPEATED
+      type: TYPE_DOUBLE
+      options {
+        packed: true
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-log-message.pbtxt b/tensorflow/tools/api/golden/tensorflow.-log-message.pbtxt
index a43c5eb7e30c3c2b025e750de5786ef4338e4ffc..5023aa96bf3b4f3f550421db5f41872d9f62b70d 100644
--- a/tensorflow/tools/api/golden/tensorflow.-log-message.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-log-message.pbtxt
@@ -1,112 +1,46 @@
 path: "tensorflow.LogMessage"
-tf_class {
-  is_instance: "<class \'tensorflow.core.util.event_pb2.LogMessage\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DEBUGGING"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "ERROR"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FATAL"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "INFO"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "LEVEL_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Level"
-    mtype: "<class \'google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper\'>"
-  }
-  member {
-    name: "MESSAGE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "UNKNOWN"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "WARN"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "LogMessage"
+    field {
+      name: "level"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_ENUM
+      type_name: ".tensorflow.LogMessage.Level"
+    }
+    field {
+      name: "message"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    enum_type {
+      name: "Level"
+      value {
+        name: "UNKNOWN"
+        number: 0
+      }
+      value {
+        name: "DEBUGGING"
+        number: 10
+      }
+      value {
+        name: "INFO"
+        number: 20
+      }
+      value {
+        name: "WARN"
+        number: 30
+      }
+      value {
+        name: "ERROR"
+        number: 40
+      }
+      value {
+        name: "FATAL"
+        number: 50
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.-collection-def-entry.pbtxt b/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.-collection-def-entry.pbtxt
index 3572126fbfd77dbefc2ecb0246bd732f0c9aec63..0ba09bec4b3fa6e9eaf59978beaa958ebc038b4c 100644
--- a/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.-collection-def-entry.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.-collection-def-entry.pbtxt
@@ -1,84 +1,22 @@
 path: "tensorflow.MetaGraphDef.CollectionDefEntry"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.meta_graph_pb2.CollectionDefEntry\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "KEY_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "CollectionDefEntry"
+    field {
+      name: "key"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "value"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.CollectionDef"
+    }
+    options {
+      map_entry: true
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.-meta-info-def.pbtxt b/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.-meta-info-def.pbtxt
index b0e983115499c5b5b79459affc931600ad16256b..41c62a407b8577288016f2376c35ba6ec1c3c1ca 100644
--- a/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.-meta-info-def.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.-meta-info-def.pbtxt
@@ -1,104 +1,50 @@
 path: "tensorflow.MetaGraphDef.MetaInfoDef"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.meta_graph_pb2.MetaInfoDef\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "ANY_INFO_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "META_GRAPH_VERSION_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "STRIPPED_DEFAULT_ATTRS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "STRIPPED_OP_LIST_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TAGS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TENSORFLOW_GIT_VERSION_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TENSORFLOW_VERSION_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "MetaInfoDef"
+    field {
+      name: "meta_graph_version"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "stripped_op_list"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.OpList"
+    }
+    field {
+      name: "any_info"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".google.protobuf.Any"
+    }
+    field {
+      name: "tags"
+      number: 4
+      label: LABEL_REPEATED
+      type: TYPE_STRING
+    }
+    field {
+      name: "tensorflow_version"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "tensorflow_git_version"
+      number: 6
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "stripped_default_attrs"
+      number: 7
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.-signature-def-entry.pbtxt b/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.-signature-def-entry.pbtxt
index 48fccac99d60b5035e207dc3ddf10054c70c5e61..73dc414a779ded3d1f896e743b7f1f1a443352f0 100644
--- a/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.-signature-def-entry.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.-signature-def-entry.pbtxt
@@ -1,84 +1,22 @@
 path: "tensorflow.MetaGraphDef.SignatureDefEntry"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.meta_graph_pb2.SignatureDefEntry\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "KEY_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "SignatureDefEntry"
+    field {
+      name: "key"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "value"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.SignatureDef"
+    }
+    options {
+      map_entry: true
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.pbtxt b/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.pbtxt
index 3e683a8715992357c1a2e744cb3cef510ce966ae..d71c2358c93e9597726665fdf8f92e648b2ea772 100644
--- a/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-meta-graph-def.pbtxt
@@ -1,112 +1,133 @@
 path: "tensorflow.MetaGraphDef"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.meta_graph_pb2.MetaGraphDef\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "ASSET_FILE_DEF_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "COLLECTION_DEF_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "CollectionDefEntry"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "GRAPH_DEF_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "META_INFO_DEF_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "MetaInfoDef"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "SAVER_DEF_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SIGNATURE_DEF_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SignatureDefEntry"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "MetaGraphDef"
+    field {
+      name: "meta_info_def"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.MetaGraphDef.MetaInfoDef"
+    }
+    field {
+      name: "graph_def"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.GraphDef"
+    }
+    field {
+      name: "saver_def"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.SaverDef"
+    }
+    field {
+      name: "collection_def"
+      number: 4
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.MetaGraphDef.CollectionDefEntry"
+    }
+    field {
+      name: "signature_def"
+      number: 5
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.MetaGraphDef.SignatureDefEntry"
+    }
+    field {
+      name: "asset_file_def"
+      number: 6
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.AssetFileDef"
+    }
+    nested_type {
+      name: "MetaInfoDef"
+      field {
+        name: "meta_graph_version"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "stripped_op_list"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.OpList"
+      }
+      field {
+        name: "any_info"
+        number: 3
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".google.protobuf.Any"
+      }
+      field {
+        name: "tags"
+        number: 4
+        label: LABEL_REPEATED
+        type: TYPE_STRING
+      }
+      field {
+        name: "tensorflow_version"
+        number: 5
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "tensorflow_git_version"
+        number: 6
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "stripped_default_attrs"
+        number: 7
+        label: LABEL_OPTIONAL
+        type: TYPE_BOOL
+      }
+    }
+    nested_type {
+      name: "CollectionDefEntry"
+      field {
+        name: "key"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "value"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.CollectionDef"
+      }
+      options {
+        map_entry: true
+      }
+    }
+    nested_type {
+      name: "SignatureDefEntry"
+      field {
+        name: "key"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "value"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.SignatureDef"
+      }
+      options {
+        map_entry: true
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-name-attr-list.-attr-entry.pbtxt b/tensorflow/tools/api/golden/tensorflow.-name-attr-list.-attr-entry.pbtxt
index 2750bd780caa418f933ada2073c5e8d0475c2a33..b119b208772199e5c3596be142f3e0f62d3ed50e 100644
--- a/tensorflow/tools/api/golden/tensorflow.-name-attr-list.-attr-entry.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-name-attr-list.-attr-entry.pbtxt
@@ -1,84 +1,22 @@
 path: "tensorflow.NameAttrList.AttrEntry"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.attr_value_pb2.AttrEntry\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "KEY_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "AttrEntry"
+    field {
+      name: "key"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "value"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.AttrValue"
+    }
+    options {
+      map_entry: true
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-name-attr-list.pbtxt b/tensorflow/tools/api/golden/tensorflow.-name-attr-list.pbtxt
index d10faf67d027a4dc8c7a32ec31ea22773104508a..fcdb411ffce9b68ac28696f86ca11a47f9e64e8f 100644
--- a/tensorflow/tools/api/golden/tensorflow.-name-attr-list.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-name-attr-list.pbtxt
@@ -1,88 +1,38 @@
 path: "tensorflow.NameAttrList"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.attr_value_pb2.NameAttrList\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "ATTR_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "AttrEntry"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "NameAttrList"
+    field {
+      name: "name"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "attr"
+      number: 2
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.NameAttrList.AttrEntry"
+    }
+    nested_type {
+      name: "AttrEntry"
+      field {
+        name: "key"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "value"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.AttrValue"
+      }
+      options {
+        map_entry: true
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-node-def.-attr-entry.pbtxt b/tensorflow/tools/api/golden/tensorflow.-node-def.-attr-entry.pbtxt
index b1b62d60f1e8c95a5e8cc13bc8162cf1de087195..622e4c3d0f60ce4842a6fd4cc421551aa795fcbf 100644
--- a/tensorflow/tools/api/golden/tensorflow.-node-def.-attr-entry.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-node-def.-attr-entry.pbtxt
@@ -1,84 +1,22 @@
 path: "tensorflow.NodeDef.AttrEntry"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.node_def_pb2.AttrEntry\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "KEY_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "AttrEntry"
+    field {
+      name: "key"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "value"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.AttrValue"
+    }
+    options {
+      map_entry: true
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-node-def.pbtxt b/tensorflow/tools/api/golden/tensorflow.-node-def.pbtxt
index b812b4df2b3c15af3c2c81944a82d9878865b8fb..646fa8abb9b22dbd908ff821cbe66a33ad02ba64 100644
--- a/tensorflow/tools/api/golden/tensorflow.-node-def.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-node-def.pbtxt
@@ -1,100 +1,56 @@
 path: "tensorflow.NodeDef"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.node_def_pb2.NodeDef\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "ATTR_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "AttrEntry"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "DEVICE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "INPUT_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "OP_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "NodeDef"
+    field {
+      name: "name"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "op"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "input"
+      number: 3
+      label: LABEL_REPEATED
+      type: TYPE_STRING
+    }
+    field {
+      name: "device"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "attr"
+      number: 5
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.NodeDef.AttrEntry"
+    }
+    nested_type {
+      name: "AttrEntry"
+      field {
+        name: "key"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "value"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.AttrValue"
+      }
+      options {
+        map_entry: true
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-optimizer-options.pbtxt b/tensorflow/tools/api/golden/tensorflow.-optimizer-options.pbtxt
index 6cac5c4d99fd7537b8fa852013ab348344be3f7e..3ccf9d459b133b48e5456f02e4780ade8d3042c8 100644
--- a/tensorflow/tools/api/golden/tensorflow.-optimizer-options.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-optimizer-options.pbtxt
@@ -1,132 +1,74 @@
 path: "tensorflow.OptimizerOptions"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.config_pb2.OptimizerOptions\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DEFAULT"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "DO_COMMON_SUBEXPRESSION_ELIMINATION_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DO_CONSTANT_FOLDING_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DO_FUNCTION_INLINING_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "GLOBAL_JIT_LEVEL_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "GlobalJitLevel"
-    mtype: "<class \'google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper\'>"
-  }
-  member {
-    name: "L0"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "L1"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Level"
-    mtype: "<class \'google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper\'>"
-  }
-  member {
-    name: "MAX_FOLDED_CONSTANT_IN_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "OFF"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "ON_1"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "ON_2"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "OPT_LEVEL_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "OptimizerOptions"
+    field {
+      name: "do_common_subexpression_elimination"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "do_constant_folding"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "max_folded_constant_in_bytes"
+      number: 6
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "do_function_inlining"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "opt_level"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_ENUM
+      type_name: ".tensorflow.OptimizerOptions.Level"
+    }
+    field {
+      name: "global_jit_level"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_ENUM
+      type_name: ".tensorflow.OptimizerOptions.GlobalJitLevel"
+    }
+    enum_type {
+      name: "Level"
+      value {
+        name: "L1"
+        number: 0
+      }
+      value {
+        name: "L0"
+        number: -1
+      }
+    }
+    enum_type {
+      name: "GlobalJitLevel"
+      value {
+        name: "DEFAULT"
+        number: 0
+      }
+      value {
+        name: "OFF"
+        number: -1
+      }
+      value {
+        name: "ON_1"
+        number: 1
+      }
+      value {
+        name: "ON_2"
+        number: 2
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-run-metadata.pbtxt b/tensorflow/tools/api/golden/tensorflow.-run-metadata.pbtxt
index 808fa0fa217a407b2c86459b32fcef46b96afa5c..1287940326c0196e76fff2cf6363622226092504 100644
--- a/tensorflow/tools/api/golden/tensorflow.-run-metadata.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-run-metadata.pbtxt
@@ -1,88 +1,27 @@
 path: "tensorflow.RunMetadata"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.config_pb2.RunMetadata\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "COST_GRAPH_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "PARTITION_GRAPHS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "STEP_STATS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "RunMetadata"
+    field {
+      name: "step_stats"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.StepStats"
+    }
+    field {
+      name: "cost_graph"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.CostGraphDef"
+    }
+    field {
+      name: "partition_graphs"
+      number: 3
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.GraphDef"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-run-options.-experimental.pbtxt b/tensorflow/tools/api/golden/tensorflow.-run-options.-experimental.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..537e73aa8969905c108a59688cfd99793ce211f0
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.-run-options.-experimental.pbtxt
@@ -0,0 +1,12 @@
+path: "tensorflow.RunOptions.Experimental"
+tf_proto {
+  descriptor {
+    name: "Experimental"
+    field {
+      name: "collective_graph_key"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.-run-options.pbtxt b/tensorflow/tools/api/golden/tensorflow.-run-options.pbtxt
index 2f3e7f1a847dd3609f06b1af535be6f5968edfaf..cec04a2bf0962455495340da001214914cc8bb36 100644
--- a/tensorflow/tools/api/golden/tensorflow.-run-options.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-run-options.pbtxt
@@ -1,120 +1,83 @@
 path: "tensorflow.RunOptions"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.config_pb2.RunOptions\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DEBUG_OPTIONS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FULL_TRACE"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "HARDWARE_TRACE"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "INTER_OP_THREAD_POOL_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "NO_TRACE"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "OUTPUT_PARTITION_GRAPHS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "REPORT_TENSOR_ALLOCATIONS_UPON_OOM_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SOFTWARE_TRACE"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TIMEOUT_IN_MS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TRACE_LEVEL_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TraceLevel"
-    mtype: "<class \'google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "RunOptions"
+    field {
+      name: "trace_level"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_ENUM
+      type_name: ".tensorflow.RunOptions.TraceLevel"
+    }
+    field {
+      name: "timeout_in_ms"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "inter_op_thread_pool"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "output_partition_graphs"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "debug_options"
+      number: 6
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.DebugOptions"
+    }
+    field {
+      name: "report_tensor_allocations_upon_oom"
+      number: 7
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "experimental"
+      number: 8
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.RunOptions.Experimental"
+    }
+    nested_type {
+      name: "Experimental"
+      field {
+        name: "collective_graph_key"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_INT64
+      }
+    }
+    enum_type {
+      name: "TraceLevel"
+      value {
+        name: "NO_TRACE"
+        number: 0
+      }
+      value {
+        name: "SOFTWARE_TRACE"
+        number: 1
+      }
+      value {
+        name: "HARDWARE_TRACE"
+        number: 2
+      }
+      value {
+        name: "FULL_TRACE"
+        number: 3
+      }
+    }
+    reserved_range {
+      start: 4
+      end: 5
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-session-log.pbtxt b/tensorflow/tools/api/golden/tensorflow.-session-log.pbtxt
index ec66d7f3354083f953066e33dff73ba9c185fc16..259f2418740cbfe47cdb4bd871d4f5c6306d25f5 100644
--- a/tensorflow/tools/api/golden/tensorflow.-session-log.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-session-log.pbtxt
@@ -1,108 +1,44 @@
 path: "tensorflow.SessionLog"
-tf_class {
-  is_instance: "<class \'tensorflow.core.util.event_pb2.SessionLog\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "CHECKPOINT"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "CHECKPOINT_PATH_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "MSG_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "START"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "STATUS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "STATUS_UNSPECIFIED"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "STOP"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SessionStatus"
-    mtype: "<class \'google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "SessionLog"
+    field {
+      name: "status"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_ENUM
+      type_name: ".tensorflow.SessionLog.SessionStatus"
+    }
+    field {
+      name: "checkpoint_path"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "msg"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    enum_type {
+      name: "SessionStatus"
+      value {
+        name: "STATUS_UNSPECIFIED"
+        number: 0
+      }
+      value {
+        name: "START"
+        number: 1
+      }
+      value {
+        name: "STOP"
+        number: 2
+      }
+      value {
+        name: "CHECKPOINT"
+        number: 3
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-summary-metadata.-plugin-data.pbtxt b/tensorflow/tools/api/golden/tensorflow.-summary-metadata.-plugin-data.pbtxt
index 067f02ce8cbb1a1f6e65758f37bb1d36927fad98..a66b74b315c6132e8f884bd52e7a3b5bd7f52ccd 100644
--- a/tensorflow/tools/api/golden/tensorflow.-summary-metadata.-plugin-data.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-summary-metadata.-plugin-data.pbtxt
@@ -1,84 +1,18 @@
 path: "tensorflow.SummaryMetadata.PluginData"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.summary_pb2.PluginData\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "CONTENT_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "PLUGIN_NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "PluginData"
+    field {
+      name: "plugin_name"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "content"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_BYTES
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-summary-metadata.pbtxt b/tensorflow/tools/api/golden/tensorflow.-summary-metadata.pbtxt
index b9156521ccbee25486113a82ddec1053f8b32e3b..c02575b9626c848e9b871d2cc6febb26a5142f08 100644
--- a/tensorflow/tools/api/golden/tensorflow.-summary-metadata.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-summary-metadata.pbtxt
@@ -1,92 +1,40 @@
 path: "tensorflow.SummaryMetadata"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.summary_pb2.SummaryMetadata\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "DISPLAY_NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "PLUGIN_DATA_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "PluginData"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "SUMMARY_DESCRIPTION_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "SummaryMetadata"
+    field {
+      name: "plugin_data"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.SummaryMetadata.PluginData"
+    }
+    field {
+      name: "display_name"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "summary_description"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    nested_type {
+      name: "PluginData"
+      field {
+        name: "plugin_name"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "content"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_BYTES
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-summary.-audio.pbtxt b/tensorflow/tools/api/golden/tensorflow.-summary.-audio.pbtxt
index 781010d75e23c16624b193e9f1041b6d58eef34e..94f712073e0d0dda201fcf7adba849dd45a1229b 100644
--- a/tensorflow/tools/api/golden/tensorflow.-summary.-audio.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-summary.-audio.pbtxt
@@ -1,96 +1,36 @@
 path: "tensorflow.Summary.Audio"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.summary_pb2.Audio\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "CONTENT_TYPE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "ENCODED_AUDIO_STRING_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "LENGTH_FRAMES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "NUM_CHANNELS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SAMPLE_RATE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Audio"
+    field {
+      name: "sample_rate"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_FLOAT
+    }
+    field {
+      name: "num_channels"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "length_frames"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "encoded_audio_string"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_BYTES
+    }
+    field {
+      name: "content_type"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-summary.-image.pbtxt b/tensorflow/tools/api/golden/tensorflow.-summary.-image.pbtxt
index feb9c7ee9270a7d64cf228dffeb1187fbd225704..fc1acb483b3051cba01f5d9bc8501a61965bbc37 100644
--- a/tensorflow/tools/api/golden/tensorflow.-summary.-image.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-summary.-image.pbtxt
@@ -1,92 +1,30 @@
 path: "tensorflow.Summary.Image"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.summary_pb2.Image\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "COLORSPACE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "ENCODED_IMAGE_STRING_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "HEIGHT_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "WIDTH_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Image"
+    field {
+      name: "height"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "width"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "colorspace"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "encoded_image_string"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_BYTES
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-summary.-value.pbtxt b/tensorflow/tools/api/golden/tensorflow.-summary.-value.pbtxt
index ffb4f45fc5e2a9db35b57c36a23de5bb74a9517d..feb84b6ee996549ac58aa0e8a4ac560f947b6339 100644
--- a/tensorflow/tools/api/golden/tensorflow.-summary.-value.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-summary.-value.pbtxt
@@ -1,112 +1,74 @@
 path: "tensorflow.Summary.Value"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.summary_pb2.Value\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "AUDIO_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "HISTO_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "IMAGE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "METADATA_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "NODE_NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "OBSOLETE_OLD_STYLE_HISTOGRAM_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SIMPLE_VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TAG_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TENSOR_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Value"
+    field {
+      name: "node_name"
+      number: 7
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "tag"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "metadata"
+      number: 9
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.SummaryMetadata"
+    }
+    field {
+      name: "simple_value"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_FLOAT
+      oneof_index: 0
+    }
+    field {
+      name: "obsolete_old_style_histogram"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_BYTES
+      oneof_index: 0
+    }
+    field {
+      name: "image"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.Summary.Image"
+      oneof_index: 0
+    }
+    field {
+      name: "histo"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.HistogramProto"
+      oneof_index: 0
+    }
+    field {
+      name: "audio"
+      number: 6
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.Summary.Audio"
+      oneof_index: 0
+    }
+    field {
+      name: "tensor"
+      number: 8
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.TensorProto"
+      oneof_index: 0
+    }
+    oneof_decl {
+      name: "value"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-summary.pbtxt b/tensorflow/tools/api/golden/tensorflow.-summary.pbtxt
index 38de17fa9e52b87d19413a64271b70755e604610..b2bdff7171804aae114d1e3631e3074b1e4006ba 100644
--- a/tensorflow/tools/api/golden/tensorflow.-summary.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-summary.pbtxt
@@ -1,92 +1,144 @@
 path: "tensorflow.Summary"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.summary_pb2.Summary\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "Audio"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "Image"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Value"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Summary"
+    field {
+      name: "value"
+      number: 1
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.Summary.Value"
+    }
+    nested_type {
+      name: "Image"
+      field {
+        name: "height"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_INT32
+      }
+      field {
+        name: "width"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_INT32
+      }
+      field {
+        name: "colorspace"
+        number: 3
+        label: LABEL_OPTIONAL
+        type: TYPE_INT32
+      }
+      field {
+        name: "encoded_image_string"
+        number: 4
+        label: LABEL_OPTIONAL
+        type: TYPE_BYTES
+      }
+    }
+    nested_type {
+      name: "Audio"
+      field {
+        name: "sample_rate"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_FLOAT
+      }
+      field {
+        name: "num_channels"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_INT64
+      }
+      field {
+        name: "length_frames"
+        number: 3
+        label: LABEL_OPTIONAL
+        type: TYPE_INT64
+      }
+      field {
+        name: "encoded_audio_string"
+        number: 4
+        label: LABEL_OPTIONAL
+        type: TYPE_BYTES
+      }
+      field {
+        name: "content_type"
+        number: 5
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+    }
+    nested_type {
+      name: "Value"
+      field {
+        name: "node_name"
+        number: 7
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "tag"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "metadata"
+        number: 9
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.SummaryMetadata"
+      }
+      field {
+        name: "simple_value"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_FLOAT
+        oneof_index: 0
+      }
+      field {
+        name: "obsolete_old_style_histogram"
+        number: 3
+        label: LABEL_OPTIONAL
+        type: TYPE_BYTES
+        oneof_index: 0
+      }
+      field {
+        name: "image"
+        number: 4
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.Summary.Image"
+        oneof_index: 0
+      }
+      field {
+        name: "histo"
+        number: 5
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.HistogramProto"
+        oneof_index: 0
+      }
+      field {
+        name: "audio"
+        number: 6
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.Summary.Audio"
+        oneof_index: 0
+      }
+      field {
+        name: "tensor"
+        number: 8
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.TensorProto"
+        oneof_index: 0
+      }
+      oneof_decl {
+        name: "value"
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-tensor-info.-coo-sparse.pbtxt b/tensorflow/tools/api/golden/tensorflow.-tensor-info.-coo-sparse.pbtxt
index 425c35e0674610904c65a2876b1a184a30355682..0064c8460cb374f1e3f108085a2efed4131dd205 100644
--- a/tensorflow/tools/api/golden/tensorflow.-tensor-info.-coo-sparse.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-tensor-info.-coo-sparse.pbtxt
@@ -1,88 +1,24 @@
 path: "tensorflow.TensorInfo.CooSparse"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.meta_graph_pb2.CooSparse\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DENSE_SHAPE_TENSOR_NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "INDICES_TENSOR_NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VALUES_TENSOR_NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "CooSparse"
+    field {
+      name: "values_tensor_name"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "indices_tensor_name"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "dense_shape_tensor_name"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-tensor-info.pbtxt b/tensorflow/tools/api/golden/tensorflow.-tensor-info.pbtxt
index 41ea393be51bd77acb2136affc203d7df332064d..63566c808e55cb4d3b630f0a017fa3a2c8a30de3 100644
--- a/tensorflow/tools/api/golden/tensorflow.-tensor-info.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-tensor-info.pbtxt
@@ -1,96 +1,59 @@
 path: "tensorflow.TensorInfo"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.meta_graph_pb2.TensorInfo\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "COO_SPARSE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "CooSparse"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "DTYPE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TENSOR_SHAPE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "TensorInfo"
+    field {
+      name: "name"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+      oneof_index: 0
+    }
+    field {
+      name: "coo_sparse"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.TensorInfo.CooSparse"
+      oneof_index: 0
+    }
+    field {
+      name: "dtype"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_ENUM
+      type_name: ".tensorflow.DataType"
+    }
+    field {
+      name: "tensor_shape"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.TensorShapeProto"
+    }
+    nested_type {
+      name: "CooSparse"
+      field {
+        name: "values_tensor_name"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "indices_tensor_name"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "dense_shape_tensor_name"
+        number: 3
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+    }
+    oneof_decl {
+      name: "encoding"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-variable.pbtxt b/tensorflow/tools/api/golden/tensorflow.-variable.pbtxt
index 8c8912dfabb9b5ee7ce15725064f1bdf2fd35bfd..23b552cc38488bdc15d7deed20f563379dba24c3 100644
--- a/tensorflow/tools/api/golden/tensorflow.-variable.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-variable.pbtxt
@@ -43,6 +43,10 @@ tf_class {
     name: "shape"
     mtype: "<type \'property\'>"
   }
+  member {
+    name: "trainable"
+    mtype: "<type \'property\'>"
+  }
   member_method {
     name: "__init__"
     argspec: "args=[\'self\', \'initial_value\', \'trainable\', \'collections\', \'validate_shape\', \'caching_device\', \'name\', \'variable_def\', \'dtype\', \'expected_shape\', \'import_scope\', \'constraint\'], varargs=None, keywords=None, defaults=[\'None\', \'True\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.data.-dataset.pbtxt b/tensorflow/tools/api/golden/tensorflow.data.-dataset.pbtxt
index 8e7e945ed1bc26669d7c7f0ed3c2002df9f1883b..834f0954d5bba655a8eb923672d89bac6bb80808 100644
--- a/tensorflow/tools/api/golden/tensorflow.data.-dataset.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.data.-dataset.pbtxt
@@ -24,7 +24,7 @@ tf_class {
   }
   member_method {
     name: "batch"
-    argspec: "args=[\'self\', \'batch_size\'], varargs=None, keywords=None, defaults=None"
+    argspec: "args=[\'self\', \'batch_size\', \'drop_remainder\'], varargs=None, keywords=None, defaults=[\'False\'], "
   }
   member_method {
     name: "cache"
@@ -80,7 +80,7 @@ tf_class {
   }
   member_method {
     name: "padded_batch"
-    argspec: "args=[\'self\', \'batch_size\', \'padded_shapes\', \'padding_values\'], varargs=None, keywords=None, defaults=[\'None\'], "
+    argspec: "args=[\'self\', \'batch_size\', \'padded_shapes\', \'padding_values\', \'drop_remainder\'], varargs=None, keywords=None, defaults=[\'None\', \'False\'], "
   }
   member_method {
     name: "prefetch"
diff --git a/tensorflow/tools/api/golden/tensorflow.data.-fixed-length-record-dataset.pbtxt b/tensorflow/tools/api/golden/tensorflow.data.-fixed-length-record-dataset.pbtxt
index 5cfb2fd2f0c6a7b733e70445aa130e96c512205e..4d854a4ceea3907d7d795d0a19d081f4069c9ba9 100644
--- a/tensorflow/tools/api/golden/tensorflow.data.-fixed-length-record-dataset.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.data.-fixed-length-record-dataset.pbtxt
@@ -25,7 +25,7 @@ tf_class {
   }
   member_method {
     name: "batch"
-    argspec: "args=[\'self\', \'batch_size\'], varargs=None, keywords=None, defaults=None"
+    argspec: "args=[\'self\', \'batch_size\', \'drop_remainder\'], varargs=None, keywords=None, defaults=[\'False\'], "
   }
   member_method {
     name: "cache"
@@ -81,7 +81,7 @@ tf_class {
   }
   member_method {
     name: "padded_batch"
-    argspec: "args=[\'self\', \'batch_size\', \'padded_shapes\', \'padding_values\'], varargs=None, keywords=None, defaults=[\'None\'], "
+    argspec: "args=[\'self\', \'batch_size\', \'padded_shapes\', \'padding_values\', \'drop_remainder\'], varargs=None, keywords=None, defaults=[\'None\', \'False\'], "
   }
   member_method {
     name: "prefetch"
diff --git a/tensorflow/tools/api/golden/tensorflow.data.-t-f-record-dataset.pbtxt b/tensorflow/tools/api/golden/tensorflow.data.-t-f-record-dataset.pbtxt
index 3327e5b274b43c0b424933cb086c894d47ad25cb..601f095a60ae481b895a535efa37341611499499 100644
--- a/tensorflow/tools/api/golden/tensorflow.data.-t-f-record-dataset.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.data.-t-f-record-dataset.pbtxt
@@ -25,7 +25,7 @@ tf_class {
   }
   member_method {
     name: "batch"
-    argspec: "args=[\'self\', \'batch_size\'], varargs=None, keywords=None, defaults=None"
+    argspec: "args=[\'self\', \'batch_size\', \'drop_remainder\'], varargs=None, keywords=None, defaults=[\'False\'], "
   }
   member_method {
     name: "cache"
@@ -81,7 +81,7 @@ tf_class {
   }
   member_method {
     name: "padded_batch"
-    argspec: "args=[\'self\', \'batch_size\', \'padded_shapes\', \'padding_values\'], varargs=None, keywords=None, defaults=[\'None\'], "
+    argspec: "args=[\'self\', \'batch_size\', \'padded_shapes\', \'padding_values\', \'drop_remainder\'], varargs=None, keywords=None, defaults=[\'None\', \'False\'], "
   }
   member_method {
     name: "prefetch"
diff --git a/tensorflow/tools/api/golden/tensorflow.data.-text-line-dataset.pbtxt b/tensorflow/tools/api/golden/tensorflow.data.-text-line-dataset.pbtxt
index 9d59375282b39564456b4c8aa49435c3836c58ea..587829a4c078e8ab945f66c64f5adad21223dfb1 100644
--- a/tensorflow/tools/api/golden/tensorflow.data.-text-line-dataset.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.data.-text-line-dataset.pbtxt
@@ -25,7 +25,7 @@ tf_class {
   }
   member_method {
     name: "batch"
-    argspec: "args=[\'self\', \'batch_size\'], varargs=None, keywords=None, defaults=None"
+    argspec: "args=[\'self\', \'batch_size\', \'drop_remainder\'], varargs=None, keywords=None, defaults=[\'False\'], "
   }
   member_method {
     name: "cache"
@@ -81,7 +81,7 @@ tf_class {
   }
   member_method {
     name: "padded_batch"
-    argspec: "args=[\'self\', \'batch_size\', \'padded_shapes\', \'padding_values\'], varargs=None, keywords=None, defaults=[\'None\'], "
+    argspec: "args=[\'self\', \'batch_size\', \'padded_shapes\', \'padding_values\', \'drop_remainder\'], varargs=None, keywords=None, defaults=[\'None\', \'False\'], "
   }
   member_method {
     name: "prefetch"
diff --git a/tensorflow/tools/api/golden/tensorflow.image.pbtxt b/tensorflow/tools/api/golden/tensorflow.image.pbtxt
index acc3fc4c5bbb767997b9844c7268b717184e4ea8..10171b3d60f300962e05f14d6aa7e13f364d2bde 100644
--- a/tensorflow/tools/api/golden/tensorflow.image.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.image.pbtxt
@@ -20,6 +20,10 @@ tf_module {
     name: "adjust_hue"
     argspec: "args=[\'image\', \'delta\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
+  member_method {
+    name: "adjust_jpeg_quality"
+    argspec: "args=[\'image\', \'jpeg_quality\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
   member_method {
     name: "adjust_saturation"
     argspec: "args=[\'image\', \'saturation_factor\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
@@ -54,7 +58,7 @@ tf_module {
   }
   member_method {
     name: "decode_image"
-    argspec: "args=[\'contents\', \'channels\', \'name\'], varargs=None, keywords=None, defaults=[\'None\', \'None\'], "
+    argspec: "args=[\'contents\', \'channels\', \'dtype\', \'name\'], varargs=None, keywords=None, defaults=[\'None\', \"<dtype: \'uint8\'>\", \'None\'], "
   }
   member_method {
     name: "decode_jpeg"
@@ -110,7 +114,7 @@ tf_module {
   }
   member_method {
     name: "non_max_suppression"
-    argspec: "args=[\'boxes\', \'scores\', \'max_output_size\', \'iou_threshold\', \'score_threshold\', \'name\'], varargs=None, keywords=None, defaults=[\'0.5\', \'0.0\', \'None\'], "
+    argspec: "args=[\'boxes\', \'scores\', \'max_output_size\', \'iou_threshold\', \'score_threshold\', \'name\'], varargs=None, keywords=None, defaults=[\'0.5\', \'-inf\', \'None\'], "
   }
   member_method {
     name: "pad_to_bounding_box"
@@ -144,6 +148,10 @@ tf_module {
     name: "random_hue"
     argspec: "args=[\'image\', \'max_delta\', \'seed\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
+  member_method {
+    name: "random_jpeg_quality"
+    argspec: "args=[\'image\', \'min_jpeg_quality\', \'max_jpeg_quality\', \'seed\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
   member_method {
     name: "random_saturation"
     argspec: "args=[\'image\', \'lower\', \'upper\', \'seed\'], varargs=None, keywords=None, defaults=[\'None\'], "
@@ -166,7 +174,7 @@ tf_module {
   }
   member_method {
     name: "resize_images"
-    argspec: "args=[\'images\', \'size\', \'method\', \'align_corners\'], varargs=None, keywords=None, defaults=[\'0\', \'False\'], "
+    argspec: "args=[\'images\', \'size\', \'method\', \'align_corners\', \'preserve_aspect_ratio\'], varargs=None, keywords=None, defaults=[\'0\', \'False\', \'False\'], "
   }
   member_method {
     name: "resize_nearest_neighbor"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.initializers.constant.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.initializers.constant.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..bddc37b907e7573c9fff27a0c3a5f7e199b88a9a
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.keras.initializers.constant.pbtxt
@@ -0,0 +1,18 @@
+path: "tensorflow.keras.initializers.constant"
+tf_class {
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Constant\'>"
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Initializer\'>"
+  is_instance: "<type \'object\'>"
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\', \'value\', \'dtype\', \'verify_shape\'], varargs=None, keywords=None, defaults=[\'0\', \"<dtype: \'float32\'>\", \'False\'], "
+  }
+  member_method {
+    name: "from_config"
+    argspec: "args=[\'cls\', \'config\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_config"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.initializers.identity.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.initializers.identity.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a4c5a6149047ffdaadde1243e4c80feae05cd77b
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.keras.initializers.identity.pbtxt
@@ -0,0 +1,18 @@
+path: "tensorflow.keras.initializers.identity"
+tf_class {
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Identity\'>"
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Initializer\'>"
+  is_instance: "<type \'object\'>"
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\', \'gain\', \'dtype\'], varargs=None, keywords=None, defaults=[\'1.0\', \"<dtype: \'float32\'>\"], "
+  }
+  member_method {
+    name: "from_config"
+    argspec: "args=[\'cls\', \'config\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_config"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.initializers.normal.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.initializers.normal.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7485772784d40b7bf552efe9bbe8b22fadee3b86
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.keras.initializers.normal.pbtxt
@@ -0,0 +1,18 @@
+path: "tensorflow.keras.initializers.normal"
+tf_class {
+  is_instance: "<class \'tensorflow.python.ops.init_ops.RandomNormal\'>"
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Initializer\'>"
+  is_instance: "<type \'object\'>"
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\', \'mean\', \'stddev\', \'seed\', \'dtype\'], varargs=None, keywords=None, defaults=[\'0.0\', \'1.0\', \'None\', \"<dtype: \'float32\'>\"], "
+  }
+  member_method {
+    name: "from_config"
+    argspec: "args=[\'cls\', \'config\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_config"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.initializers.ones.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.initializers.ones.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a89f78d1e1a47c7cd5a252cfd0a7b2fa23979e90
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.keras.initializers.ones.pbtxt
@@ -0,0 +1,18 @@
+path: "tensorflow.keras.initializers.ones"
+tf_class {
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Ones\'>"
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Initializer\'>"
+  is_instance: "<type \'object\'>"
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\', \'dtype\'], varargs=None, keywords=None, defaults=[\"<dtype: \'float32\'>\"], "
+  }
+  member_method {
+    name: "from_config"
+    argspec: "args=[\'cls\', \'config\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_config"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.initializers.orthogonal.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.initializers.orthogonal.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ee1e9bbae2b7130db5b96309e2d87719169d788a
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.keras.initializers.orthogonal.pbtxt
@@ -0,0 +1,18 @@
+path: "tensorflow.keras.initializers.orthogonal"
+tf_class {
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Orthogonal\'>"
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Initializer\'>"
+  is_instance: "<type \'object\'>"
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\', \'gain\', \'seed\', \'dtype\'], varargs=None, keywords=None, defaults=[\'1.0\', \'None\', \"<dtype: \'float32\'>\"], "
+  }
+  member_method {
+    name: "from_config"
+    argspec: "args=[\'cls\', \'config\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_config"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.initializers.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.initializers.pbtxt
index 093c56595bd54eef4062d4ac9134e4bb3e7f7d98..14a667870d3118e48bfac03eee9accb3d48a72ce 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.initializers.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.initializers.pbtxt
@@ -40,6 +40,46 @@ tf_module {
     name: "Zeros"
     mtype: "<type \'type\'>"
   }
+  member {
+    name: "constant"
+    mtype: "<type \'type\'>"
+  }
+  member {
+    name: "identity"
+    mtype: "<type \'type\'>"
+  }
+  member {
+    name: "normal"
+    mtype: "<type \'type\'>"
+  }
+  member {
+    name: "ones"
+    mtype: "<type \'type\'>"
+  }
+  member {
+    name: "orthogonal"
+    mtype: "<type \'type\'>"
+  }
+  member {
+    name: "random_normal"
+    mtype: "<type \'type\'>"
+  }
+  member {
+    name: "random_uniform"
+    mtype: "<type \'type\'>"
+  }
+  member {
+    name: "truncated_normal"
+    mtype: "<type \'type\'>"
+  }
+  member {
+    name: "uniform"
+    mtype: "<type \'type\'>"
+  }
+  member {
+    name: "zeros"
+    mtype: "<type \'type\'>"
+  }
   member_method {
     name: "deserialize"
     argspec: "args=[\'config\', \'custom_objects\'], varargs=None, keywords=None, defaults=[\'None\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.initializers.random_normal.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.initializers.random_normal.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a6df1e87a3f68fb16e32dce1ba4ee29f6d86e74e
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.keras.initializers.random_normal.pbtxt
@@ -0,0 +1,18 @@
+path: "tensorflow.keras.initializers.random_normal"
+tf_class {
+  is_instance: "<class \'tensorflow.python.ops.init_ops.RandomNormal\'>"
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Initializer\'>"
+  is_instance: "<type \'object\'>"
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\', \'mean\', \'stddev\', \'seed\', \'dtype\'], varargs=None, keywords=None, defaults=[\'0.0\', \'1.0\', \'None\', \"<dtype: \'float32\'>\"], "
+  }
+  member_method {
+    name: "from_config"
+    argspec: "args=[\'cls\', \'config\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_config"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.initializers.random_uniform.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.initializers.random_uniform.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..37a0fa0d5508de0026472ff1a3aa532bb8f343cd
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.keras.initializers.random_uniform.pbtxt
@@ -0,0 +1,18 @@
+path: "tensorflow.keras.initializers.random_uniform"
+tf_class {
+  is_instance: "<class \'tensorflow.python.ops.init_ops.RandomUniform\'>"
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Initializer\'>"
+  is_instance: "<type \'object\'>"
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\', \'minval\', \'maxval\', \'seed\', \'dtype\'], varargs=None, keywords=None, defaults=[\'0\', \'None\', \'None\', \"<dtype: \'float32\'>\"], "
+  }
+  member_method {
+    name: "from_config"
+    argspec: "args=[\'cls\', \'config\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_config"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.initializers.truncated_normal.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.initializers.truncated_normal.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f97e93f0b72d5e959722d15fa9dc35869c550710
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.keras.initializers.truncated_normal.pbtxt
@@ -0,0 +1,18 @@
+path: "tensorflow.keras.initializers.truncated_normal"
+tf_class {
+  is_instance: "<class \'tensorflow.python.ops.init_ops.TruncatedNormal\'>"
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Initializer\'>"
+  is_instance: "<type \'object\'>"
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\', \'mean\', \'stddev\', \'seed\', \'dtype\'], varargs=None, keywords=None, defaults=[\'0.0\', \'1.0\', \'None\', \"<dtype: \'float32\'>\"], "
+  }
+  member_method {
+    name: "from_config"
+    argspec: "args=[\'cls\', \'config\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_config"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.initializers.uniform.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.initializers.uniform.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..58186b1383d8997165bb457e1cb54df86cd02d11
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.keras.initializers.uniform.pbtxt
@@ -0,0 +1,18 @@
+path: "tensorflow.keras.initializers.uniform"
+tf_class {
+  is_instance: "<class \'tensorflow.python.ops.init_ops.RandomUniform\'>"
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Initializer\'>"
+  is_instance: "<type \'object\'>"
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\', \'minval\', \'maxval\', \'seed\', \'dtype\'], varargs=None, keywords=None, defaults=[\'0\', \'None\', \'None\', \"<dtype: \'float32\'>\"], "
+  }
+  member_method {
+    name: "from_config"
+    argspec: "args=[\'cls\', \'config\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_config"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.initializers.zeros.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.initializers.zeros.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a262390687f31a5fb79822e69273306b9e1897b5
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.keras.initializers.zeros.pbtxt
@@ -0,0 +1,18 @@
+path: "tensorflow.keras.initializers.zeros"
+tf_class {
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Zeros\'>"
+  is_instance: "<class \'tensorflow.python.ops.init_ops.Initializer\'>"
+  is_instance: "<type \'object\'>"
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\', \'dtype\'], varargs=None, keywords=None, defaults=[\"<dtype: \'float32\'>\"], "
+  }
+  member_method {
+    name: "from_config"
+    argspec: "args=[\'cls\', \'config\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_config"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-minimum.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-minimum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..56e32e9d3690a92c3f6e41bf2b5164c6bf62f443
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-minimum.pbtxt
@@ -0,0 +1,176 @@
+path: "tensorflow.keras.layers.Minimum"
+tf_class {
+  is_instance: "<class \'tensorflow.python.keras.layers.merge.Minimum\'>"
+  is_instance: "<class \'tensorflow.python.keras.layers.merge._Merge\'>"
+  is_instance: "<class \'tensorflow.python.keras.engine.base_layer.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.base.CheckpointableBase\'>"
+  is_instance: "<type \'object\'>"
+  member {
+    name: "activity_regularizer"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "dtype"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "inbound_nodes"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "input"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "input_mask"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "input_shape"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "losses"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "name"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "non_trainable_variables"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "non_trainable_weights"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "outbound_nodes"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "output"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "output_mask"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "output_shape"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "trainable_variables"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "trainable_weights"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "updates"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "variables"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "weights"
+    mtype: "<type \'property\'>"
+  }
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\'], varargs=None, keywords=kwargs, defaults=None"
+  }
+  member_method {
+    name: "add_loss"
+    argspec: "args=[\'self\', \'losses\', \'inputs\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
+  member_method {
+    name: "add_update"
+    argspec: "args=[\'self\', \'updates\', \'inputs\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
+  member_method {
+    name: "add_variable"
+    argspec: "args=[\'self\'], varargs=args, keywords=kwargs, defaults=None"
+  }
+  member_method {
+    name: "add_weight"
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+  }
+  member_method {
+    name: "apply"
+    argspec: "args=[\'self\', \'inputs\'], varargs=args, keywords=kwargs, defaults=None"
+  }
+  member_method {
+    name: "build"
+    argspec: "args=[\'instance\', \'input_shape\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "call"
+    argspec: "args=[\'self\', \'inputs\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "compute_mask"
+    argspec: "args=[\'self\', \'inputs\', \'mask\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
+  member_method {
+    name: "compute_output_shape"
+    argspec: "args=[\'instance\', \'input_shape\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "count_params"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "from_config"
+    argspec: "args=[\'cls\', \'config\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_config"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_input_at"
+    argspec: "args=[\'self\', \'node_index\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_input_mask_at"
+    argspec: "args=[\'self\', \'node_index\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_input_shape_at"
+    argspec: "args=[\'self\', \'node_index\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_losses_for"
+    argspec: "args=[\'self\', \'inputs\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_output_at"
+    argspec: "args=[\'self\', \'node_index\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_output_mask_at"
+    argspec: "args=[\'self\', \'node_index\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_output_shape_at"
+    argspec: "args=[\'self\', \'node_index\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_updates_for"
+    argspec: "args=[\'self\', \'inputs\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_weights"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "set_weights"
+    argspec: "args=[\'self\', \'weights\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-subtract.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-subtract.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..35ad87ad5d91f1cc5d413b0adc8e9e5d1403726a
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-subtract.pbtxt
@@ -0,0 +1,176 @@
+path: "tensorflow.keras.layers.Subtract"
+tf_class {
+  is_instance: "<class \'tensorflow.python.keras.layers.merge.Subtract\'>"
+  is_instance: "<class \'tensorflow.python.keras.layers.merge._Merge\'>"
+  is_instance: "<class \'tensorflow.python.keras.engine.base_layer.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.base.CheckpointableBase\'>"
+  is_instance: "<type \'object\'>"
+  member {
+    name: "activity_regularizer"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "dtype"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "inbound_nodes"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "input"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "input_mask"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "input_shape"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "losses"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "name"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "non_trainable_variables"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "non_trainable_weights"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "outbound_nodes"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "output"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "output_mask"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "output_shape"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "trainable_variables"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "trainable_weights"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "updates"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "variables"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "weights"
+    mtype: "<type \'property\'>"
+  }
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\'], varargs=None, keywords=kwargs, defaults=None"
+  }
+  member_method {
+    name: "add_loss"
+    argspec: "args=[\'self\', \'losses\', \'inputs\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
+  member_method {
+    name: "add_update"
+    argspec: "args=[\'self\', \'updates\', \'inputs\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
+  member_method {
+    name: "add_variable"
+    argspec: "args=[\'self\'], varargs=args, keywords=kwargs, defaults=None"
+  }
+  member_method {
+    name: "add_weight"
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+  }
+  member_method {
+    name: "apply"
+    argspec: "args=[\'self\', \'inputs\'], varargs=args, keywords=kwargs, defaults=None"
+  }
+  member_method {
+    name: "build"
+    argspec: "args=[\'instance\', \'input_shape\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "call"
+    argspec: "args=[\'self\', \'inputs\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "compute_mask"
+    argspec: "args=[\'self\', \'inputs\', \'mask\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
+  member_method {
+    name: "compute_output_shape"
+    argspec: "args=[\'instance\', \'input_shape\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "count_params"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "from_config"
+    argspec: "args=[\'cls\', \'config\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_config"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_input_at"
+    argspec: "args=[\'self\', \'node_index\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_input_mask_at"
+    argspec: "args=[\'self\', \'node_index\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_input_shape_at"
+    argspec: "args=[\'self\', \'node_index\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_losses_for"
+    argspec: "args=[\'self\', \'inputs\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_output_at"
+    argspec: "args=[\'self\', \'node_index\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_output_mask_at"
+    argspec: "args=[\'self\', \'node_index\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_output_shape_at"
+    argspec: "args=[\'self\', \'node_index\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_updates_for"
+    argspec: "args=[\'self\', \'inputs\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "get_weights"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "set_weights"
+    argspec: "args=[\'self\', \'weights\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.pbtxt
index 709eb5be55ef180ce9836def4bef601ea4315be0..475e9dade3a6f5ba8a5020afbd4668be5b5ed9d7 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.pbtxt
@@ -280,6 +280,10 @@ tf_module {
     name: "Maximum"
     mtype: "<type \'type\'>"
   }
+  member {
+    name: "Minimum"
+    mtype: "<type \'type\'>"
+  }
   member {
     name: "Multiply"
     mtype: "<type \'type\'>"
@@ -348,6 +352,10 @@ tf_module {
     name: "StackedRNNCells"
     mtype: "<type \'type\'>"
   }
+  member {
+    name: "Subtract"
+    mtype: "<type \'type\'>"
+  }
   member {
     name: "ThresholdedReLU"
     mtype: "<type \'type\'>"
@@ -408,8 +416,16 @@ tf_module {
     name: "maximum"
     argspec: "args=[\'inputs\'], varargs=None, keywords=kwargs, defaults=None"
   }
+  member_method {
+    name: "minimum"
+    argspec: "args=[\'inputs\'], varargs=None, keywords=kwargs, defaults=None"
+  }
   member_method {
     name: "multiply"
     argspec: "args=[\'inputs\'], varargs=None, keywords=kwargs, defaults=None"
   }
+  member_method {
+    name: "subtract"
+    argspec: "args=[\'inputs\'], varargs=None, keywords=kwargs, defaults=None"
+  }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.losses.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.losses.pbtxt
index ae5f6305b7d1bb85c1c6acd8daf5628d83814b27..eca6b915388ebff0103f7ad16f43c6be0df60b7d 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.losses.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.losses.pbtxt
@@ -1,5 +1,25 @@
 path: "tensorflow.keras.losses"
 tf_module {
+  member_method {
+    name: "KLD"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "MAE"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "MAPE"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "MSE"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "MSLE"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "binary_crossentropy"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
@@ -12,6 +32,10 @@ tf_module {
     name: "categorical_hinge"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
   }
+  member_method {
+    name: "cosine"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "cosine_proximity"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
@@ -28,6 +52,10 @@ tf_module {
     name: "hinge"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
   }
+  member_method {
+    name: "kld"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "kullback_leibler_divergence"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
@@ -36,6 +64,14 @@ tf_module {
     name: "logcosh"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
   }
+  member_method {
+    name: "mae"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "mape"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "mean_absolute_error"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
@@ -52,6 +88,14 @@ tf_module {
     name: "mean_squared_logarithmic_error"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
   }
+  member_method {
+    name: "mse"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "msle"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "poisson"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.metrics.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.metrics.pbtxt
index 42729e4237685638d38301cece6e93383ddfffba..a97a9b57587070ec4841b627920ac91737a67997 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.metrics.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.metrics.pbtxt
@@ -1,5 +1,25 @@
 path: "tensorflow.keras.metrics"
 tf_module {
+  member_method {
+    name: "KLD"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "MAE"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "MAPE"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "MSE"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "MSLE"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "binary_accuracy"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
@@ -16,6 +36,10 @@ tf_module {
     name: "categorical_crossentropy"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
   }
+  member_method {
+    name: "cosine"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "cosine_proximity"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
@@ -32,10 +56,22 @@ tf_module {
     name: "hinge"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
   }
+  member_method {
+    name: "kld"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "kullback_leibler_divergence"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
   }
+  member_method {
+    name: "mae"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "mape"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "mean_absolute_error"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
@@ -52,6 +88,14 @@ tf_module {
     name: "mean_squared_logarithmic_error"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
   }
+  member_method {
+    name: "mse"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "msle"
+    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "poisson"
     argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
diff --git a/tensorflow/tools/api/golden/tensorflow.math.pbtxt b/tensorflow/tools/api/golden/tensorflow.math.pbtxt
index 897718c05e0d10a6f961f33b8c65f5dab1d03f5b..03fbf6266d24f6a31ef76e2668c15987e55d2651 100644
--- a/tensorflow/tools/api/golden/tensorflow.math.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.math.pbtxt
@@ -1,5 +1,21 @@
 path: "tensorflow.math"
 tf_module {
+  member_method {
+    name: "bessel_i0"
+    argspec: "args=[\'x\', \'name\'], varargs=None, keywords=None, defaults=[\'bessel_i0\'], "
+  }
+  member_method {
+    name: "bessel_i0e"
+    argspec: "args=[\'x\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
+  member_method {
+    name: "bessel_i1"
+    argspec: "args=[\'x\', \'name\'], varargs=None, keywords=None, defaults=[\'bessel_i1\'], "
+  }
+  member_method {
+    name: "bessel_i1e"
+    argspec: "args=[\'x\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
   member_method {
     name: "polyval"
     argspec: "args=[\'coeffs\', \'x\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.pbtxt b/tensorflow/tools/api/golden/tensorflow.pbtxt
index 3051c4437e9a14bf0ef86adfa8c596b736a6172d..01b80581188e65d228aaa669254d9951546ecfa0 100644
--- a/tensorflow/tools/api/golden/tensorflow.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.pbtxt
@@ -792,6 +792,10 @@ tf_module {
     name: "broadcast_static_shape"
     argspec: "args=[\'shape_x\', \'shape_y\'], varargs=None, keywords=None, defaults=None"
   }
+  member_method {
+    name: "broadcast_to"
+    argspec: "args=[\'input\', \'shape\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
   member_method {
     name: "case"
     argspec: "args=[\'pred_fn_pairs\', \'default\', \'exclusive\', \'strict\', \'name\'], varargs=None, keywords=None, defaults=[\'None\', \'False\', \'False\', \'case\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.profiler.-advice-proto.-checker.pbtxt b/tensorflow/tools/api/golden/tensorflow.profiler.-advice-proto.-checker.pbtxt
index bd5c36f390add9cfb31642b80a792d65d59bb3e8..e09c44cc9ce71305692740ba2d63b0940b2e0573 100644
--- a/tensorflow/tools/api/golden/tensorflow.profiler.-advice-proto.-checker.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.profiler.-advice-proto.-checker.pbtxt
@@ -1,80 +1,12 @@
 path: "tensorflow.profiler.AdviceProto.Checker"
-tf_class {
-  is_instance: "<class \'tensorflow.core.profiler.tfprof_output_pb2.Checker\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "REPORTS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Checker"
+    field {
+      name: "reports"
+      number: 2
+      label: LABEL_REPEATED
+      type: TYPE_STRING
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.profiler.-advice-proto.-checkers-entry.pbtxt b/tensorflow/tools/api/golden/tensorflow.profiler.-advice-proto.-checkers-entry.pbtxt
index 7c8c68e155c99da4f0c1c1ba2c944719c42c12c7..87462435496fd2eedeb0bc8d92e8a833671b6531 100644
--- a/tensorflow/tools/api/golden/tensorflow.profiler.-advice-proto.-checkers-entry.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.profiler.-advice-proto.-checkers-entry.pbtxt
@@ -1,84 +1,22 @@
 path: "tensorflow.profiler.AdviceProto.CheckersEntry"
-tf_class {
-  is_instance: "<class \'tensorflow.core.profiler.tfprof_output_pb2.CheckersEntry\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "KEY_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "CheckersEntry"
+    field {
+      name: "key"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "value"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.tfprof.AdviceProto.Checker"
+    }
+    options {
+      map_entry: true
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.profiler.-advice-proto.pbtxt b/tensorflow/tools/api/golden/tensorflow.profiler.-advice-proto.pbtxt
index 1b789f4fc92ed63fc72f3ecfe6be80a99eb3427f..a8a8858ccd5af3fb3dac612eef44e5cb450df914 100644
--- a/tensorflow/tools/api/golden/tensorflow.profiler.-advice-proto.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.profiler.-advice-proto.pbtxt
@@ -1,88 +1,41 @@
 path: "tensorflow.profiler.AdviceProto"
-tf_class {
-  is_instance: "<class \'tensorflow.core.profiler.tfprof_output_pb2.AdviceProto\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "CHECKERS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Checker"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "CheckersEntry"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "AdviceProto"
+    field {
+      name: "checkers"
+      number: 1
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.tfprof.AdviceProto.CheckersEntry"
+    }
+    nested_type {
+      name: "CheckersEntry"
+      field {
+        name: "key"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "value"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.tfprof.AdviceProto.Checker"
+      }
+      options {
+        map_entry: true
+      }
+    }
+    nested_type {
+      name: "Checker"
+      field {
+        name: "reports"
+        number: 2
+        label: LABEL_REPEATED
+        type: TYPE_STRING
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.profiler.-graph-node-proto.-input-shapes-entry.pbtxt b/tensorflow/tools/api/golden/tensorflow.profiler.-graph-node-proto.-input-shapes-entry.pbtxt
index f0b9605bee1c7cf2f0154f65c475aac49c411f76..afec73f537aadd5d1a274db8d57e37b8c6fa3e74 100644
--- a/tensorflow/tools/api/golden/tensorflow.profiler.-graph-node-proto.-input-shapes-entry.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.profiler.-graph-node-proto.-input-shapes-entry.pbtxt
@@ -1,84 +1,22 @@
 path: "tensorflow.profiler.GraphNodeProto.InputShapesEntry"
-tf_class {
-  is_instance: "<class \'tensorflow.core.profiler.tfprof_output_pb2.InputShapesEntry\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "KEY_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "InputShapesEntry"
+    field {
+      name: "key"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "value"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.TensorShapeProto"
+    }
+    options {
+      map_entry: true
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.profiler.-graph-node-proto.pbtxt b/tensorflow/tools/api/golden/tensorflow.profiler.-graph-node-proto.pbtxt
index b80896a8a0f36a9f1c4da76528b9c4e70500ad4c..3c83177005323a277f929d8c769cd7b1eeff4d51 100644
--- a/tensorflow/tools/api/golden/tensorflow.profiler.-graph-node-proto.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.profiler.-graph-node-proto.pbtxt
@@ -1,188 +1,191 @@
 path: "tensorflow.profiler.GraphNodeProto"
-tf_class {
-  is_instance: "<class \'tensorflow.core.profiler.tfprof_output_pb2.GraphNodeProto\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "ACCELERATOR_EXEC_MICROS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "CHILDREN_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "CPU_EXEC_MICROS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "DEVICES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "EXEC_MICROS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FLOAT_OPS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "INPUT_SHAPES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "InputShapesEntry"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "OUTPUT_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "PARAMETERS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "PEAK_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "REQUESTED_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "RESIDUAL_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "RUN_COUNT_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SHAPES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TENSOR_VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_ACCELERATOR_EXEC_MICROS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_CPU_EXEC_MICROS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_DEFINITION_COUNT_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_EXEC_MICROS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_FLOAT_OPS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_OUTPUT_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_PARAMETERS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_PEAK_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_REQUESTED_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_RESIDUAL_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_RUN_COUNT_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "GraphNodeProto"
+    field {
+      name: "name"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "tensor_value"
+      number: 15
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.tfprof.TFProfTensorProto"
+    }
+    field {
+      name: "run_count"
+      number: 21
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "exec_micros"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "accelerator_exec_micros"
+      number: 17
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "cpu_exec_micros"
+      number: 18
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "requested_bytes"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "peak_bytes"
+      number: 24
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "residual_bytes"
+      number: 25
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "output_bytes"
+      number: 26
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "parameters"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "float_ops"
+      number: 13
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "devices"
+      number: 10
+      label: LABEL_REPEATED
+      type: TYPE_STRING
+    }
+    field {
+      name: "total_definition_count"
+      number: 23
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_run_count"
+      number: 22
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_exec_micros"
+      number: 6
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_accelerator_exec_micros"
+      number: 19
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_cpu_exec_micros"
+      number: 20
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_requested_bytes"
+      number: 7
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_peak_bytes"
+      number: 27
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_residual_bytes"
+      number: 28
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_output_bytes"
+      number: 29
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_parameters"
+      number: 8
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_float_ops"
+      number: 14
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "shapes"
+      number: 11
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.TensorShapeProto"
+    }
+    field {
+      name: "input_shapes"
+      number: 16
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.tfprof.GraphNodeProto.InputShapesEntry"
+    }
+    field {
+      name: "children"
+      number: 12
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.tfprof.GraphNodeProto"
+    }
+    nested_type {
+      name: "InputShapesEntry"
+      field {
+        name: "key"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_INT32
+      }
+      field {
+        name: "value"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.TensorShapeProto"
+      }
+      options {
+        map_entry: true
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.profiler.-multi-graph-node-proto.pbtxt b/tensorflow/tools/api/golden/tensorflow.profiler.-multi-graph-node-proto.pbtxt
index 33deff649791322c0e8361a46de6d180f23be9a9..2b08a05437f90b91160fc08e670b2466ae163149 100644
--- a/tensorflow/tools/api/golden/tensorflow.profiler.-multi-graph-node-proto.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.profiler.-multi-graph-node-proto.pbtxt
@@ -1,160 +1,134 @@
 path: "tensorflow.profiler.MultiGraphNodeProto"
-tf_class {
-  is_instance: "<class \'tensorflow.core.profiler.tfprof_output_pb2.MultiGraphNodeProto\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "ACCELERATOR_EXEC_MICROS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "CHILDREN_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "CPU_EXEC_MICROS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "EXEC_MICROS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FLOAT_OPS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "GRAPH_NODES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "OUTPUT_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "PARAMETERS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "PEAK_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "REQUESTED_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "RESIDUAL_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_ACCELERATOR_EXEC_MICROS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_CPU_EXEC_MICROS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_EXEC_MICROS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_FLOAT_OPS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_OUTPUT_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_PARAMETERS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_PEAK_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_REQUESTED_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TOTAL_RESIDUAL_BYTES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "MultiGraphNodeProto"
+    field {
+      name: "name"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "exec_micros"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "accelerator_exec_micros"
+      number: 12
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "cpu_exec_micros"
+      number: 13
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "requested_bytes"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "peak_bytes"
+      number: 16
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "residual_bytes"
+      number: 17
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "output_bytes"
+      number: 18
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "parameters"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "float_ops"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_exec_micros"
+      number: 6
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_accelerator_exec_micros"
+      number: 14
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_cpu_exec_micros"
+      number: 15
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_requested_bytes"
+      number: 7
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_peak_bytes"
+      number: 19
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_residual_bytes"
+      number: 20
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_output_bytes"
+      number: 21
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_parameters"
+      number: 8
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "total_float_ops"
+      number: 9
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "graph_nodes"
+      number: 10
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.tfprof.GraphNodeProto"
+    }
+    field {
+      name: "children"
+      number: 11
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.tfprof.MultiGraphNodeProto"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.profiler.-op-log-proto.-id-to-string-entry.pbtxt b/tensorflow/tools/api/golden/tensorflow.profiler.-op-log-proto.-id-to-string-entry.pbtxt
index 8c4727cf35bdfcdaf3c45e636f6627b3c85102d6..b3adc50c7e14152a81a148df9deccc5272189aad 100644
--- a/tensorflow/tools/api/golden/tensorflow.profiler.-op-log-proto.-id-to-string-entry.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.profiler.-op-log-proto.-id-to-string-entry.pbtxt
@@ -1,84 +1,21 @@
 path: "tensorflow.profiler.OpLogProto.IdToStringEntry"
-tf_class {
-  is_instance: "<class \'tensorflow.core.profiler.tfprof_log_pb2.IdToStringEntry\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "KEY_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "IdToStringEntry"
+    field {
+      name: "key"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "value"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    options {
+      map_entry: true
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.profiler.-op-log-proto.pbtxt b/tensorflow/tools/api/golden/tensorflow.profiler.-op-log-proto.pbtxt
index 1071a82b5ce1396e235b37af1991908dbd4ca7a4..7510c566ba574e9370f5e54c29023ef4fb5ee804 100644
--- a/tensorflow/tools/api/golden/tensorflow.profiler.-op-log-proto.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.profiler.-op-log-proto.pbtxt
@@ -1,88 +1,38 @@
 path: "tensorflow.profiler.OpLogProto"
-tf_class {
-  is_instance: "<class \'tensorflow.core.profiler.tfprof_log_pb2.OpLogProto\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "ID_TO_STRING_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "IdToStringEntry"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "LOG_ENTRIES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "OpLogProto"
+    field {
+      name: "log_entries"
+      number: 1
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.tfprof.OpLogEntry"
+    }
+    field {
+      name: "id_to_string"
+      number: 2
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.tfprof.OpLogProto.IdToStringEntry"
+    }
+    nested_type {
+      name: "IdToStringEntry"
+      field {
+        name: "key"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_INT64
+      }
+      field {
+        name: "value"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      options {
+        map_entry: true
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.saved_model.builder.-saved-model-builder.pbtxt b/tensorflow/tools/api/golden/tensorflow.saved_model.builder.-saved-model-builder.pbtxt
index ca8e5884b18110d4293225e595c030e9629b5663..83bd7035409534abf036c7e2b0d66fcc060ada3a 100644
--- a/tensorflow/tools/api/golden/tensorflow.saved_model.builder.-saved-model-builder.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.saved_model.builder.-saved-model-builder.pbtxt
@@ -8,11 +8,11 @@ tf_class {
   }
   member_method {
     name: "add_meta_graph"
-    argspec: "args=[\'self\', \'tags\', \'signature_def_map\', \'assets_collection\', \'legacy_init_op\', \'clear_devices\', \'main_op\', \'strip_default_attrs\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'False\', \'None\', \'False\'], "
+    argspec: "args=[\'self\', \'tags\', \'signature_def_map\', \'assets_collection\', \'legacy_init_op\', \'clear_devices\', \'main_op\', \'strip_default_attrs\', \'saver\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'False\', \'None\', \'False\', \'None\'], "
   }
   member_method {
     name: "add_meta_graph_and_variables"
-    argspec: "args=[\'self\', \'sess\', \'tags\', \'signature_def_map\', \'assets_collection\', \'legacy_init_op\', \'clear_devices\', \'main_op\', \'strip_default_attrs\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'False\', \'None\', \'False\'], "
+    argspec: "args=[\'self\', \'sess\', \'tags\', \'signature_def_map\', \'assets_collection\', \'legacy_init_op\', \'clear_devices\', \'main_op\', \'strip_default_attrs\', \'saver\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'False\', \'None\', \'False\', \'None\'], "
   }
   member_method {
     name: "save"
diff --git a/tensorflow/tools/api/golden/tensorflow.saved_model.loader.pbtxt b/tensorflow/tools/api/golden/tensorflow.saved_model.loader.pbtxt
index 896e2160c693039ab5582be13286f387c08d8f37..511e6b4712d3c55746a39fe9098fa3b649bc75dc 100644
--- a/tensorflow/tools/api/golden/tensorflow.saved_model.loader.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.saved_model.loader.pbtxt
@@ -2,7 +2,7 @@ path: "tensorflow.saved_model.loader"
 tf_module {
   member_method {
     name: "load"
-    argspec: "args=[\'sess\', \'tags\', \'export_dir\'], varargs=None, keywords=saver_kwargs, defaults=None"
+    argspec: "args=[\'sess\', \'tags\', \'export_dir\', \'import_scope\'], varargs=None, keywords=saver_kwargs, defaults=[\'None\'], "
   }
   member_method {
     name: "maybe_saved_model_directory"
diff --git a/tensorflow/tools/api/golden/tensorflow.strings.pbtxt b/tensorflow/tools/api/golden/tensorflow.strings.pbtxt
index a3fbe95bbad4b8c1d803e1002b2cf9ef2812fed0..b641c39feb6bcc4b5b73ba81ce0f0d4a499007ea 100644
--- a/tensorflow/tools/api/golden/tensorflow.strings.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.strings.pbtxt
@@ -4,4 +4,8 @@ tf_module {
     name: "regex_full_match"
     argspec: "args=[\'input\', \'pattern\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
+  member_method {
+    name: "split"
+    argspec: "args=[\'source\', \'sep\', \'maxsplit\'], varargs=None, keywords=None, defaults=[\'None\', \'-1\'], "
+  }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.summary.-event.pbtxt b/tensorflow/tools/api/golden/tensorflow.summary.-event.pbtxt
index ab3449d80f6108d83b721563427bd07d07a7104b..eb99d0f5334457aa654fed0553af143839328dba 100644
--- a/tensorflow/tools/api/golden/tensorflow.summary.-event.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.summary.-event.pbtxt
@@ -1,112 +1,74 @@
 path: "tensorflow.summary.Event"
-tf_class {
-  is_instance: "<class \'tensorflow.core.util.event_pb2.Event\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FILE_VERSION_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "GRAPH_DEF_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "LOG_MESSAGE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "META_GRAPH_DEF_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SESSION_LOG_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "STEP_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SUMMARY_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TAGGED_RUN_METADATA_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "WALL_TIME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Event"
+    field {
+      name: "wall_time"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_DOUBLE
+    }
+    field {
+      name: "step"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "file_version"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+      oneof_index: 0
+    }
+    field {
+      name: "graph_def"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_BYTES
+      oneof_index: 0
+    }
+    field {
+      name: "summary"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.Summary"
+      oneof_index: 0
+    }
+    field {
+      name: "log_message"
+      number: 6
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.LogMessage"
+      oneof_index: 0
+    }
+    field {
+      name: "session_log"
+      number: 7
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.SessionLog"
+      oneof_index: 0
+    }
+    field {
+      name: "tagged_run_metadata"
+      number: 8
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.TaggedRunMetadata"
+      oneof_index: 0
+    }
+    field {
+      name: "meta_graph_def"
+      number: 9
+      label: LABEL_OPTIONAL
+      type: TYPE_BYTES
+      oneof_index: 0
+    }
+    oneof_decl {
+      name: "what"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.summary.-session-log.pbtxt b/tensorflow/tools/api/golden/tensorflow.summary.-session-log.pbtxt
index 92ca4872caf1c1ce7e19201b0a612c1a74ef59b0..73de73869c8d1a6808b16fe8853fd21cc8891879 100644
--- a/tensorflow/tools/api/golden/tensorflow.summary.-session-log.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.summary.-session-log.pbtxt
@@ -1,108 +1,44 @@
 path: "tensorflow.summary.SessionLog"
-tf_class {
-  is_instance: "<class \'tensorflow.core.util.event_pb2.SessionLog\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "CHECKPOINT"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "CHECKPOINT_PATH_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "MSG_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "START"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "STATUS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "STATUS_UNSPECIFIED"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "STOP"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SessionStatus"
-    mtype: "<class \'google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "SessionLog"
+    field {
+      name: "status"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_ENUM
+      type_name: ".tensorflow.SessionLog.SessionStatus"
+    }
+    field {
+      name: "checkpoint_path"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "msg"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    enum_type {
+      name: "SessionStatus"
+      value {
+        name: "STATUS_UNSPECIFIED"
+        number: 0
+      }
+      value {
+        name: "START"
+        number: 1
+      }
+      value {
+        name: "STOP"
+        number: 2
+      }
+      value {
+        name: "CHECKPOINT"
+        number: 3
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.summary.-summary-description.pbtxt b/tensorflow/tools/api/golden/tensorflow.summary.-summary-description.pbtxt
index f93da2196adbc28524f93746a8e047b5c0f610d8..4a8b59cf02ed46ef70f22564f3134214840600fe 100644
--- a/tensorflow/tools/api/golden/tensorflow.summary.-summary-description.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.summary.-summary-description.pbtxt
@@ -1,80 +1,12 @@
 path: "tensorflow.summary.SummaryDescription"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.summary_pb2.SummaryDescription\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "TYPE_HINT_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "SummaryDescription"
+    field {
+      name: "type_hint"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.summary.-summary.-audio.pbtxt b/tensorflow/tools/api/golden/tensorflow.summary.-summary.-audio.pbtxt
index 605e305e82cc3f4dd6a0bce68f846a43347a00e2..8b271cf58fc11c8666abd456021afeedc0b14c7a 100644
--- a/tensorflow/tools/api/golden/tensorflow.summary.-summary.-audio.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.summary.-summary.-audio.pbtxt
@@ -1,96 +1,36 @@
 path: "tensorflow.summary.Summary.Audio"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.summary_pb2.Audio\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "CONTENT_TYPE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "ENCODED_AUDIO_STRING_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "LENGTH_FRAMES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "NUM_CHANNELS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SAMPLE_RATE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Audio"
+    field {
+      name: "sample_rate"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_FLOAT
+    }
+    field {
+      name: "num_channels"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "length_frames"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_INT64
+    }
+    field {
+      name: "encoded_audio_string"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_BYTES
+    }
+    field {
+      name: "content_type"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.summary.-summary.-image.pbtxt b/tensorflow/tools/api/golden/tensorflow.summary.-summary.-image.pbtxt
index 0646972196dc728b3f39aad07540aa7b6893ab88..dbbc02dd0506dbcebd1690602b5786b02c3ed4a0 100644
--- a/tensorflow/tools/api/golden/tensorflow.summary.-summary.-image.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.summary.-summary.-image.pbtxt
@@ -1,92 +1,30 @@
 path: "tensorflow.summary.Summary.Image"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.summary_pb2.Image\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "COLORSPACE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "ENCODED_IMAGE_STRING_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "HEIGHT_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "WIDTH_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Image"
+    field {
+      name: "height"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "width"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "colorspace"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "encoded_image_string"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_BYTES
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.summary.-summary.-value.pbtxt b/tensorflow/tools/api/golden/tensorflow.summary.-summary.-value.pbtxt
index b319cd03d9e8678b36ae107b581aea431a969b4e..4176171cd938e383fe5366153364d8e8e8c1a1ee 100644
--- a/tensorflow/tools/api/golden/tensorflow.summary.-summary.-value.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.summary.-summary.-value.pbtxt
@@ -1,112 +1,74 @@
 path: "tensorflow.summary.Summary.Value"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.summary_pb2.Value\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "AUDIO_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "HISTO_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "IMAGE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "METADATA_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "NODE_NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "OBSOLETE_OLD_STYLE_HISTOGRAM_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SIMPLE_VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TAG_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TENSOR_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Value"
+    field {
+      name: "node_name"
+      number: 7
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "tag"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "metadata"
+      number: 9
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.SummaryMetadata"
+    }
+    field {
+      name: "simple_value"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_FLOAT
+      oneof_index: 0
+    }
+    field {
+      name: "obsolete_old_style_histogram"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_BYTES
+      oneof_index: 0
+    }
+    field {
+      name: "image"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.Summary.Image"
+      oneof_index: 0
+    }
+    field {
+      name: "histo"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.HistogramProto"
+      oneof_index: 0
+    }
+    field {
+      name: "audio"
+      number: 6
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.Summary.Audio"
+      oneof_index: 0
+    }
+    field {
+      name: "tensor"
+      number: 8
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.TensorProto"
+      oneof_index: 0
+    }
+    oneof_decl {
+      name: "value"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.summary.-summary.pbtxt b/tensorflow/tools/api/golden/tensorflow.summary.-summary.pbtxt
index 132ef1b7d2e933c3fe953ca2eb19b32133db8186..d6c5e3a87a115b9bdcfd044abe93177eda2af275 100644
--- a/tensorflow/tools/api/golden/tensorflow.summary.-summary.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.summary.-summary.pbtxt
@@ -1,92 +1,144 @@
 path: "tensorflow.summary.Summary"
-tf_class {
-  is_instance: "<class \'tensorflow.core.framework.summary_pb2.Summary\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "Audio"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "Image"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "Value"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Summary"
+    field {
+      name: "value"
+      number: 1
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.Summary.Value"
+    }
+    nested_type {
+      name: "Image"
+      field {
+        name: "height"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_INT32
+      }
+      field {
+        name: "width"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_INT32
+      }
+      field {
+        name: "colorspace"
+        number: 3
+        label: LABEL_OPTIONAL
+        type: TYPE_INT32
+      }
+      field {
+        name: "encoded_image_string"
+        number: 4
+        label: LABEL_OPTIONAL
+        type: TYPE_BYTES
+      }
+    }
+    nested_type {
+      name: "Audio"
+      field {
+        name: "sample_rate"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_FLOAT
+      }
+      field {
+        name: "num_channels"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_INT64
+      }
+      field {
+        name: "length_frames"
+        number: 3
+        label: LABEL_OPTIONAL
+        type: TYPE_INT64
+      }
+      field {
+        name: "encoded_audio_string"
+        number: 4
+        label: LABEL_OPTIONAL
+        type: TYPE_BYTES
+      }
+      field {
+        name: "content_type"
+        number: 5
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+    }
+    nested_type {
+      name: "Value"
+      field {
+        name: "node_name"
+        number: 7
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "tag"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "metadata"
+        number: 9
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.SummaryMetadata"
+      }
+      field {
+        name: "simple_value"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_FLOAT
+        oneof_index: 0
+      }
+      field {
+        name: "obsolete_old_style_histogram"
+        number: 3
+        label: LABEL_OPTIONAL
+        type: TYPE_BYTES
+        oneof_index: 0
+      }
+      field {
+        name: "image"
+        number: 4
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.Summary.Image"
+        oneof_index: 0
+      }
+      field {
+        name: "histo"
+        number: 5
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.HistogramProto"
+        oneof_index: 0
+      }
+      field {
+        name: "audio"
+        number: 6
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.Summary.Audio"
+        oneof_index: 0
+      }
+      field {
+        name: "tensor"
+        number: 8
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.TensorProto"
+        oneof_index: 0
+      }
+      oneof_decl {
+        name: "value"
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.summary.-tagged-run-metadata.pbtxt b/tensorflow/tools/api/golden/tensorflow.summary.-tagged-run-metadata.pbtxt
index 4dce20819de06fb3a31d6b044a8c751c22da5c74..27c8873320403cb2e7402ef9f1bb0e7134d5f96b 100644
--- a/tensorflow/tools/api/golden/tensorflow.summary.-tagged-run-metadata.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.summary.-tagged-run-metadata.pbtxt
@@ -1,84 +1,18 @@
 path: "tensorflow.summary.TaggedRunMetadata"
-tf_class {
-  is_instance: "<class \'tensorflow.core.util.event_pb2.TaggedRunMetadata\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "RUN_METADATA_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TAG_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "TaggedRunMetadata"
+    field {
+      name: "tag"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "run_metadata"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_BYTES
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-bytes-list.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-bytes-list.pbtxt
index 8cf52b817f342a3ccd8bcf5f4f532b886a318f23..87e4f160e5bd5950dfc338649fb531c92cc81b60 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-bytes-list.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-bytes-list.pbtxt
@@ -1,80 +1,12 @@
 path: "tensorflow.train.BytesList"
-tf_class {
-  is_instance: "<class \'tensorflow.core.example.feature_pb2.BytesList\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "BytesList"
+    field {
+      name: "value"
+      number: 1
+      label: LABEL_REPEATED
+      type: TYPE_BYTES
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-cluster-def.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-cluster-def.pbtxt
index 93ff856b09de15f12954bb11802a935b82c1d278..f9de26839f5f6dc1591bfc909ca8e6c02271b5c7 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-cluster-def.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-cluster-def.pbtxt
@@ -1,80 +1,13 @@
 path: "tensorflow.train.ClusterDef"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.cluster_pb2.ClusterDef\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "JOB_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "ClusterDef"
+    field {
+      name: "job"
+      number: 1
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.JobDef"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-example.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-example.pbtxt
index f7215a20372e981a2fb20f20d9e4cfa43973c7cc..23c30f1ef4fe2dd93e8714655dbb1ef3b8e05c65 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-example.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-example.pbtxt
@@ -1,80 +1,13 @@
 path: "tensorflow.train.Example"
-tf_class {
-  is_instance: "<class \'tensorflow.core.example.example_pb2.Example\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FEATURES_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Example"
+    field {
+      name: "features"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.Features"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-exponential-moving-average.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-exponential-moving-average.pbtxt
index 737acbe07c93da30b4a206cbdae2efcbc2cb2159..c9fe136e68b5f3cadaff6d4fd0638b7f10d18365 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-exponential-moving-average.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-exponential-moving-average.pbtxt
@@ -2,6 +2,10 @@ path: "tensorflow.train.ExponentialMovingAverage"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.moving_averages.ExponentialMovingAverage\'>"
   is_instance: "<type \'object\'>"
+  member {
+    name: "name"
+    mtype: "<type \'property\'>"
+  }
   member_method {
     name: "__init__"
     argspec: "args=[\'self\', \'decay\', \'num_updates\', \'zero_debias\', \'name\'], varargs=None, keywords=None, defaults=[\'None\', \'False\', \'ExponentialMovingAverage\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-feature-list.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-feature-list.pbtxt
index 3ad98354d69453d6f66a858991d4a19e2525d1e0..2a8b3714fc0c4f5e979bc02550a8e08835d53cb4 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-feature-list.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-feature-list.pbtxt
@@ -1,80 +1,13 @@
 path: "tensorflow.train.FeatureList"
-tf_class {
-  is_instance: "<class \'tensorflow.core.example.feature_pb2.FeatureList\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FEATURE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "FeatureList"
+    field {
+      name: "feature"
+      number: 1
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.Feature"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-feature-lists.-feature-list-entry.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-feature-lists.-feature-list-entry.pbtxt
index cd171f4ca3ef1e48848be1bd71f8a56685534b8c..cd1d56e606c96b62346b936001a5a0f07a8a8ad8 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-feature-lists.-feature-list-entry.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-feature-lists.-feature-list-entry.pbtxt
@@ -1,84 +1,22 @@
 path: "tensorflow.train.FeatureLists.FeatureListEntry"
-tf_class {
-  is_instance: "<class \'tensorflow.core.example.feature_pb2.FeatureListEntry\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "KEY_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "FeatureListEntry"
+    field {
+      name: "key"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "value"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.FeatureList"
+    }
+    options {
+      map_entry: true
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-feature-lists.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-feature-lists.pbtxt
index 3d95017d584ad95f96a54ef52a966aa6f2a69a58..3c183a64769b59b104c52b6840e8f351f4b0cef5 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-feature-lists.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-feature-lists.pbtxt
@@ -1,84 +1,32 @@
 path: "tensorflow.train.FeatureLists"
-tf_class {
-  is_instance: "<class \'tensorflow.core.example.feature_pb2.FeatureLists\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FEATURE_LIST_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "FeatureListEntry"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "FeatureLists"
+    field {
+      name: "feature_list"
+      number: 1
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.FeatureLists.FeatureListEntry"
+    }
+    nested_type {
+      name: "FeatureListEntry"
+      field {
+        name: "key"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "value"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.FeatureList"
+      }
+      options {
+        map_entry: true
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-feature.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-feature.pbtxt
index 9cca132bba91c46398c2fecb4ff7b45bd5ed2af2..5d0eb871c2f4aeb13d6b8518486f11b1f80d0620 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-feature.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-feature.pbtxt
@@ -1,88 +1,33 @@
 path: "tensorflow.train.Feature"
-tf_class {
-  is_instance: "<class \'tensorflow.core.example.feature_pb2.Feature\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "BYTES_LIST_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FLOAT_LIST_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "INT64_LIST_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Feature"
+    field {
+      name: "bytes_list"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.BytesList"
+      oneof_index: 0
+    }
+    field {
+      name: "float_list"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.FloatList"
+      oneof_index: 0
+    }
+    field {
+      name: "int64_list"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.Int64List"
+      oneof_index: 0
+    }
+    oneof_decl {
+      name: "kind"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-features.-feature-entry.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-features.-feature-entry.pbtxt
index 858aee03415dead500cdb450f5885a904f620221..f912005f1cc35f12ce6eba5313b0c67adebe70f7 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-features.-feature-entry.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-features.-feature-entry.pbtxt
@@ -1,84 +1,22 @@
 path: "tensorflow.train.Features.FeatureEntry"
-tf_class {
-  is_instance: "<class \'tensorflow.core.example.feature_pb2.FeatureEntry\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "KEY_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "FeatureEntry"
+    field {
+      name: "key"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "value"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.Feature"
+    }
+    options {
+      map_entry: true
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-features.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-features.pbtxt
index 49cd12153bf3078eb1e68cfd6efad6e2673439f4..b788ca1d57e1d679a1b809d85c6aa9bcef01f252 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-features.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-features.pbtxt
@@ -1,84 +1,32 @@
 path: "tensorflow.train.Features"
-tf_class {
-  is_instance: "<class \'tensorflow.core.example.feature_pb2.Features\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FEATURE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "FeatureEntry"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Features"
+    field {
+      name: "feature"
+      number: 1
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.Features.FeatureEntry"
+    }
+    nested_type {
+      name: "FeatureEntry"
+      field {
+        name: "key"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      field {
+        name: "value"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_MESSAGE
+        type_name: ".tensorflow.Feature"
+      }
+      options {
+        map_entry: true
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-float-list.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-float-list.pbtxt
index e3f01334b547feef87d07166eb3784659c41d542..55d3b46f20e17ec4e6fbac5672e1b0a8ef98552d 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-float-list.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-float-list.pbtxt
@@ -1,80 +1,15 @@
 path: "tensorflow.train.FloatList"
-tf_class {
-  is_instance: "<class \'tensorflow.core.example.feature_pb2.FloatList\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "FloatList"
+    field {
+      name: "value"
+      number: 1
+      label: LABEL_REPEATED
+      type: TYPE_FLOAT
+      options {
+        packed: true
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-int64-list.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-int64-list.pbtxt
index 8917dc122cfd0b0a7de0a3a74da3c45104d9eaff..1de92b3ab7b5e0ff873a7e8092c7e6c2edcbd2ce 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-int64-list.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-int64-list.pbtxt
@@ -1,80 +1,15 @@
 path: "tensorflow.train.Int64List"
-tf_class {
-  is_instance: "<class \'tensorflow.core.example.feature_pb2.Int64List\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "Int64List"
+    field {
+      name: "value"
+      number: 1
+      label: LABEL_REPEATED
+      type: TYPE_INT64
+      options {
+        packed: true
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-job-def.-tasks-entry.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-job-def.-tasks-entry.pbtxt
index ac6d81541a43e934ebd131afe07be0bd6e427a7b..58115590a5eebd742afac4b31b5f585e8077e049 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-job-def.-tasks-entry.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-job-def.-tasks-entry.pbtxt
@@ -1,84 +1,21 @@
 path: "tensorflow.train.JobDef.TasksEntry"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.cluster_pb2.TasksEntry\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "KEY_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VALUE_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "TasksEntry"
+    field {
+      name: "key"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "value"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    options {
+      map_entry: true
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-job-def.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-job-def.pbtxt
index ce34537fa13b92f7900128d769ac3161d2b4d287..d7eb505e27930d6411a589909584f237a7e8b8f5 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-job-def.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-job-def.pbtxt
@@ -1,88 +1,37 @@
 path: "tensorflow.train.JobDef"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.cluster_pb2.JobDef\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TASKS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TasksEntry"
-    mtype: "<class \'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "JobDef"
+    field {
+      name: "name"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "tasks"
+      number: 2
+      label: LABEL_REPEATED
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.JobDef.TasksEntry"
+    }
+    nested_type {
+      name: "TasksEntry"
+      field {
+        name: "key"
+        number: 1
+        label: LABEL_OPTIONAL
+        type: TYPE_INT32
+      }
+      field {
+        name: "value"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
+      options {
+        map_entry: true
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-saver-def.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-saver-def.pbtxt
index 84498a64f5b04526e989ec03f1894dcea19d850e..4ec99469e4025603e7ab340b190cbebf7e33eed7 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-saver-def.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-saver-def.pbtxt
@@ -1,120 +1,64 @@
 path: "tensorflow.train.SaverDef"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.saver_pb2.SaverDef\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "CheckpointFormatVersion"
-    mtype: "<class \'google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FILENAME_TENSOR_NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "KEEP_CHECKPOINT_EVERY_N_HOURS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "LEGACY"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "MAX_TO_KEEP_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "RESTORE_OP_NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SAVE_TENSOR_NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "SHARDED_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "V1"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "V2"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "VERSION_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "SaverDef"
+    field {
+      name: "filename_tensor_name"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "save_tensor_name"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "restore_op_name"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "max_to_keep"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "sharded"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "keep_checkpoint_every_n_hours"
+      number: 6
+      label: LABEL_OPTIONAL
+      type: TYPE_FLOAT
+    }
+    field {
+      name: "version"
+      number: 7
+      label: LABEL_OPTIONAL
+      type: TYPE_ENUM
+      type_name: ".tensorflow.SaverDef.CheckpointFormatVersion"
+    }
+    enum_type {
+      name: "CheckpointFormatVersion"
+      value {
+        name: "LEGACY"
+        number: 0
+      }
+      value {
+        name: "V1"
+        number: 1
+      }
+      value {
+        name: "V2"
+        number: 2
+      }
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-sequence-example.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-sequence-example.pbtxt
index 9ab95537021167f368d3a8f6b1e1ec1a3996aa88..6a4553bbc157960696ef17959f532fecdfd54ae8 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-sequence-example.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-sequence-example.pbtxt
@@ -1,84 +1,20 @@
 path: "tensorflow.train.SequenceExample"
-tf_class {
-  is_instance: "<class \'tensorflow.core.example.example_pb2.SequenceExample\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "CONTEXT_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "FEATURE_LISTS_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "SequenceExample"
+    field {
+      name: "context"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.Features"
+    }
+    field {
+      name: "feature_lists"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.FeatureLists"
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-server-def.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-server-def.pbtxt
index af0a3b73cc2ff3510e9a0426c28696fe51097f9d..83ee7b3eb91a558765abcde630fe6e0480b9818f 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-server-def.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-server-def.pbtxt
@@ -1,96 +1,38 @@
 path: "tensorflow.train.ServerDef"
-tf_class {
-  is_instance: "<class \'tensorflow.core.protobuf.tensorflow_server_pb2.ServerDef\'>"
-  is_instance: "<type \'google.protobuf.pyext._message.CMessage\'>"
-  member {
-    name: "CLUSTER_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DEFAULT_SESSION_CONFIG_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "DESCRIPTOR"
-    mtype: "<type \'google.protobuf.pyext._message.MessageDescriptor\'>"
-  }
-  member {
-    name: "Extensions"
-    mtype: "<type \'getset_descriptor\'>"
-  }
-  member {
-    name: "JOB_NAME_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "PROTOCOL_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member {
-    name: "TASK_INDEX_FIELD_NUMBER"
-    mtype: "<type \'int\'>"
-  }
-  member_method {
-    name: "ByteSize"
-  }
-  member_method {
-    name: "Clear"
-  }
-  member_method {
-    name: "ClearExtension"
-  }
-  member_method {
-    name: "ClearField"
-  }
-  member_method {
-    name: "CopyFrom"
-  }
-  member_method {
-    name: "DiscardUnknownFields"
-  }
-  member_method {
-    name: "FindInitializationErrors"
-  }
-  member_method {
-    name: "FromString"
-  }
-  member_method {
-    name: "HasExtension"
-  }
-  member_method {
-    name: "HasField"
-  }
-  member_method {
-    name: "IsInitialized"
-  }
-  member_method {
-    name: "ListFields"
-  }
-  member_method {
-    name: "MergeFrom"
-  }
-  member_method {
-    name: "MergeFromString"
-  }
-  member_method {
-    name: "ParseFromString"
-  }
-  member_method {
-    name: "RegisterExtension"
-  }
-  member_method {
-    name: "SerializePartialToString"
-  }
-  member_method {
-    name: "SerializeToString"
-  }
-  member_method {
-    name: "SetInParent"
-  }
-  member_method {
-    name: "WhichOneof"
-  }
-  member_method {
-    name: "__init__"
+tf_proto {
+  descriptor {
+    name: "ServerDef"
+    field {
+      name: "cluster"
+      number: 1
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.ClusterDef"
+    }
+    field {
+      name: "job_name"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
+    field {
+      name: "task_index"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_INT32
+    }
+    field {
+      name: "default_session_config"
+      number: 4
+      label: LABEL_OPTIONAL
+      type: TYPE_MESSAGE
+      type_name: ".tensorflow.ConfigProto"
+    }
+    field {
+      name: "protocol"
+      number: 5
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-session-manager.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-session-manager.pbtxt
index cc31bb4e4b396917a00d1162125b6d2e47343322..448764fe081b250e1e22633f118268ad638cb9dd 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-session-manager.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-session-manager.pbtxt
@@ -4,7 +4,7 @@ tf_class {
   is_instance: "<type \'object\'>"
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'local_init_op\', \'ready_op\', \'ready_for_local_init_op\', \'graph\', \'recovery_wait_secs\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'30\'], "
+    argspec: "args=[\'self\', \'local_init_op\', \'ready_op\', \'ready_for_local_init_op\', \'graph\', \'recovery_wait_secs\', \'local_init_run_options\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'30\', \'None\'], "
   }
   member_method {
     name: "prepare_session"
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-supervisor.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-supervisor.pbtxt
index 1f0e59a1ac2d899a50ff30c7c8da8f91a0258a1e..9677e5a98e4a8308093f51a84d8b1edae405cd2b 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-supervisor.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-supervisor.pbtxt
@@ -104,7 +104,7 @@ tf_class {
   }
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'graph\', \'ready_op\', \'ready_for_local_init_op\', \'is_chief\', \'init_op\', \'init_feed_dict\', \'local_init_op\', \'logdir\', \'summary_op\', \'saver\', \'global_step\', \'save_summaries_secs\', \'save_model_secs\', \'recovery_wait_secs\', \'stop_grace_secs\', \'checkpoint_basename\', \'session_manager\', \'summary_writer\', \'init_fn\'], varargs=None, keywords=None, defaults=[\'None\', \'0\', \'0\', \'True\', \'0\', \'None\', \'0\', \'None\', \'0\', \'0\', \'0\', \'120\', \'600\', \'30\', \'120\', \'model.ckpt\', \'None\', \'0\', \'None\'], "
+    argspec: "args=[\'self\', \'graph\', \'ready_op\', \'ready_for_local_init_op\', \'is_chief\', \'init_op\', \'init_feed_dict\', \'local_init_op\', \'logdir\', \'summary_op\', \'saver\', \'global_step\', \'save_summaries_secs\', \'save_model_secs\', \'recovery_wait_secs\', \'stop_grace_secs\', \'checkpoint_basename\', \'session_manager\', \'summary_writer\', \'init_fn\', \'local_init_run_options\'], varargs=None, keywords=None, defaults=[\'None\', \'0\', \'0\', \'True\', \'0\', \'None\', \'0\', \'None\', \'0\', \'0\', \'0\', \'120\', \'600\', \'30\', \'120\', \'model.ckpt\', \'None\', \'0\', \'None\', \'None\'], "
   }
   member_method {
     name: "loop"
diff --git a/tensorflow/tools/api/golden/tensorflow.train.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.pbtxt
index 9fb18e77afd7c9c989ad5e967be291406e7239aa..5f45b3b1ad9fc04cd9ab306b9a05917ecd54a837 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.pbtxt
@@ -400,6 +400,10 @@ tf_module {
     name: "range_input_producer"
     argspec: "args=[\'limit\', \'num_epochs\', \'shuffle\', \'seed\', \'capacity\', \'shared_name\', \'name\'], varargs=None, keywords=None, defaults=[\'None\', \'True\', \'None\', \'32\', \'None\', \'None\'], "
   }
+  member_method {
+    name: "remove_checkpoint"
+    argspec: "args=[\'checkpoint_prefix\', \'checkpoint_format_version\', \'meta_graph_suffix\'], varargs=None, keywords=None, defaults=[\'2\', \'meta\'], "
+  }
   member_method {
     name: "replica_device_setter"
     argspec: "args=[\'ps_tasks\', \'ps_device\', \'worker_device\', \'merge_devices\', \'cluster\', \'ps_ops\', \'ps_strategy\'], varargs=None, keywords=None, defaults=[\'0\', \'/job:ps\', \'/job:worker\', \'True\', \'None\', \'None\', \'None\'], "
diff --git a/tensorflow/tools/api/lib/api_objects.proto b/tensorflow/tools/api/lib/api_objects.proto
index 0966a5f1d530ecd70c9e904c12816f0aa33b3ada..7207b9c5a9f4db7a8efcea3207adf1eb99df7d5b 100644
--- a/tensorflow/tools/api/lib/api_objects.proto
+++ b/tensorflow/tools/api/lib/api_objects.proto
@@ -1,5 +1,7 @@
 syntax = "proto2";
 
+import "google/protobuf/descriptor.proto";
+
 package third_party.tensorflow.tools.api;
 
 message TFAPIMember {
@@ -24,8 +26,17 @@ message TFAPIClass {
   repeated TFAPIMethod member_method = 3;
 };
 
+message TFAPIProto {
+  // Suppress generation of the proto API's descriptor() method lest it
+  // conflict with the standard accessor for the field having the same name.
+  option no_standard_descriptor_accessor = true;
+
+  optional google.protobuf.DescriptorProto descriptor = 1;
+};
+
 message TFAPIObject {
   optional string path = 1;
   optional TFAPIModule tf_module = 2;
   optional TFAPIClass tf_class = 3;
+  optional TFAPIProto tf_proto = 4;
 };
diff --git a/tensorflow/tools/api/lib/python_object_to_proto_visitor.py b/tensorflow/tools/api/lib/python_object_to_proto_visitor.py
index 0b30f7b4d12e3d78b12f8a229b19cb0e8d530d56..1cf330e70247260cd9e50b18903bdfecad6260e4 100644
--- a/tensorflow/tools/api/lib/python_object_to_proto_visitor.py
+++ b/tensorflow/tools/api/lib/python_object_to_proto_visitor.py
@@ -19,6 +19,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from google.protobuf import message
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util import tf_decorator
 from tensorflow.python.util import tf_inspect
@@ -101,6 +102,11 @@ def _SanitizedMRO(obj):
   return return_list
 
 
+def _IsProtoClass(obj):
+  """Returns whether the passed obj is a Protocol Buffer class."""
+  return isinstance(obj, type) and issubclass(obj, message.Message)
+
+
 class PythonObjectToProtoVisitor(object):
   """A visitor that summarizes given python objects as protobufs."""
 
@@ -153,6 +159,13 @@ class PythonObjectToProtoVisitor(object):
         # Store the constructed module object.
         self._protos[lib_path] = api_objects_pb2.TFAPIObject(
             path=lib_path, tf_module=module_obj)
+      elif _IsProtoClass(parent):
+        proto_obj = api_objects_pb2.TFAPIProto()
+        parent.DESCRIPTOR.CopyToProto(proto_obj.descriptor)
+
+        # Store the constructed proto object.
+        self._protos[lib_path] = api_objects_pb2.TFAPIObject(
+            path=lib_path, tf_proto=proto_obj)
       elif tf_inspect.isclass(parent):
         # Construct a class.
         class_obj = api_objects_pb2.TFAPIClass()
@@ -161,7 +174,7 @@ class PythonObjectToProtoVisitor(object):
           if name in parent_corner_cases:
             # If we have an empty entry, skip this object.
             if parent_corner_cases[name]:
-              module_obj.member.add(**(parent_corner_cases[name]))
+              class_obj.member.add(**(parent_corner_cases[name]))
           else:
             _AddMember(name, child, class_obj)
 
diff --git a/tensorflow/tools/api/tests/api_compatibility_test.py b/tensorflow/tools/api/tests/api_compatibility_test.py
index 1ad6b6d1c0ae5ca1ac1329fd49e972840020e4c3..90375a794f64a9edd2bab2671f5870ae02e84e3c 100644
--- a/tensorflow/tools/api/tests/api_compatibility_test.py
+++ b/tensorflow/tools/api/tests/api_compatibility_test.py
@@ -35,6 +35,7 @@ import unittest
 
 import tensorflow as tf
 
+from google.protobuf import message
 from google.protobuf import text_format
 
 from tensorflow.python.lib.io import file_io
@@ -195,6 +196,25 @@ class ApiCompatibilityTest(test.TestCase):
     else:
       logging.info('No differences found between API and golden.')
 
+  def testNoSubclassOfMessage(self):
+
+    def Visit(path, parent, unused_children):
+      """A Visitor that crashes on subclasses of generated proto classes."""
+      # If the traversed object is a proto Message class
+      if not (isinstance(parent, type) and
+              issubclass(parent, message.Message)):
+        return
+      if parent is message.Message:
+        return
+      # Check that it is a direct subclass of Message.
+      if message.Message not in parent.__bases__:
+        raise NotImplementedError(
+            'Object tf.%s is a subclass of a generated proto Message. '
+            'They are not yet supported by the API tools.' % path)
+    visitor = public_api.PublicAPIVisitor(Visit)
+    visitor.do_not_descend_map['tf'].append('contrib')
+    traverse.traverse(tf, visitor)
+
   @unittest.skipUnless(
       sys.version_info.major == 2,
       'API compabitility test goldens are generated using python2.')
diff --git a/tensorflow/tools/benchmark/benchmark_model.cc b/tensorflow/tools/benchmark/benchmark_model.cc
index eeb1fab40c4c7a2fff417b18111ecfe8ceabc71a..de93b12b97081feea5be96edf3b6e6dfbe5599b4 100644
--- a/tensorflow/tools/benchmark/benchmark_model.cc
+++ b/tensorflow/tools/benchmark/benchmark_model.cc
@@ -667,12 +667,12 @@ int Main(int argc, char** argv) {
         output_prefix, benchmark_name, "meta-init-plus-first-inference", 1,
         initialization_time_s + (warmup_time_us / 1000000.0) / warmup_runs);
 
-    std::map<string, int64> node_type_map_count;
-    std::map<string, int64> node_type_map_time;
-    std::map<string, int64> node_type_map_memory;
-    std::map<string, int64> node_type_map_times_called;
+    std::map<std::string, int64_t> node_type_map_count;
+    std::map<std::string, int64_t> node_type_map_time;
+    std::map<std::string, int64_t> node_type_map_memory;
+    std::map<std::string, int64_t> node_type_map_times_called;
 
-    int64 accumulated_us;
+    int64_t accumulated_us;
     stats->ComputeStatsByType(&node_type_map_count, &node_type_map_time,
                               &node_type_map_memory,
                               &node_type_map_times_called, &accumulated_us);
diff --git a/tensorflow/tools/ci_build/builds/pip.sh b/tensorflow/tools/ci_build/builds/pip.sh
index 5fa75e1d61cceeebfa77439bb64f1c644c9dba70..883bb9364742e74b4a5a7c8b0d41253352d6c2e7 100755
--- a/tensorflow/tools/ci_build/builds/pip.sh
+++ b/tensorflow/tools/ci_build/builds/pip.sh
@@ -322,6 +322,10 @@ create_activate_virtualenv_and_install_tensorflow() {
   pip install -v ${PIP_FLAGS} ${WHL_PATH} || \
     die "pip install (forcing to reinstall tensorflow) FAILED"
   echo "Successfully installed pip package ${TF_WHEEL_PATH}"
+
+  # Force downgrade setuptools.
+  pip install --upgrade setuptools==39.1.0
+
 }
 
 ################################################################################
diff --git a/tensorflow/tools/ci_build/builds/test_user_ops.sh b/tensorflow/tools/ci_build/builds/test_user_ops.sh
index c342367bacea9d2ba8152d928b93bf61cf60d0e7..25ecee472524d5346252772b3058a5e824eef217 100755
--- a/tensorflow/tools/ci_build/builds/test_user_ops.sh
+++ b/tensorflow/tools/ci_build/builds/test_user_ops.sh
@@ -239,8 +239,9 @@ function run_op() {
   fi
 }
 
-run_op $("${PYTHON_BIN_PATH}" -c "import tensorflow as tf; print(tf.Session('').run(tf.load_op_library('./${USER_OP_SO}').${USER_OP}(${OP_INPUT})))")
-run_op $("${PYTHON_BIN_PATH}" -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.load_op_library('./${USER_OP_SO}').${USER_OP}(${OP_INPUT}))") " in eager mode"
+run_op "$("${PYTHON_BIN_PATH}" -c "import tensorflow as tf; print(tf.Session('').run(tf.load_op_library('./${USER_OP_SO}').${USER_OP}(${OP_INPUT})))")"
+run_op "$("${PYTHON_BIN_PATH}" -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.load_op_library('./${USER_OP_SO}').${USER_OP}(${OP_INPUT}).numpy())")" " in eager mode"
+
 
 popd
 
diff --git a/tensorflow/tools/ci_build/builds/with_the_same_user b/tensorflow/tools/ci_build/builds/with_the_same_user
index d4bf546d401d058bd205a70c147615c8efc4f4ba..b216e3549f8ab7850c966e5a8e138f3b566f9952 100755
--- a/tensorflow/tools/ci_build/builds/with_the_same_user
+++ b/tensorflow/tools/ci_build/builds/with_the_same_user
@@ -40,7 +40,7 @@ if [ -n "${CI_BUILD_USER_FORCE_BADNAME}" ]; then
   ADDUSER_OPTS="--force-badname"
 fi
 
-getent group "${CI_BUILD_GID}" || addgroup --gid "${CI_BUILD_GID}" "${CI_BUILD_GROUP}"
+getent group "${CI_BUILD_GID}" || addgroup ${ADDUSER_OPTS} --gid "${CI_BUILD_GID}" "${CI_BUILD_GROUP}"
 getent passwd "${CI_BUILD_UID}" || adduser ${ADDUSER_OPTS} \
     --gid "${CI_BUILD_GID}" --uid "${CI_BUILD_UID}" \
     --gecos "${CI_BUILD_USER} (generated by with_the_same_user script)" \
diff --git a/tensorflow/tools/ci_build/ci_build.sh b/tensorflow/tools/ci_build/ci_build.sh
index 072dd6ab995bb41c3197d6c898405be487534593..1f0fd0387af28bf15e5c42fa14f5c1a1ee5a8cfb 100755
--- a/tensorflow/tools/ci_build/ci_build.sh
+++ b/tensorflow/tools/ci_build/ci_build.sh
@@ -134,6 +134,12 @@ if [[ $? != "0" ]]; then
   die "ERROR: docker build failed. Dockerfile is at ${DOCKERFILE_PATH}"
 fi
 
+# If caller wants the with_the_same_user script to allow bad usernames, 
+# pass the var to the docker environment
+if [ -n "${CI_BUILD_USER_FORCE_BADNAME}" ]; then
+        CI_BUILD_USER_FORCE_BADNAME_ENV="-e CI_BUILD_USER_FORCE_BADNAME=yes"
+fi
+
 # Run the command inside the container.
 echo "Running '${COMMAND[*]}' inside ${DOCKER_IMG_NAME}..."
 mkdir -p ${WORKSPACE}/bazel-ci_build-cache
@@ -148,6 +154,7 @@ ${DOCKER_BINARY} run --rm --pid=host \
     -e "CI_BUILD_GROUP=$(id -g -n)" \
     -e "CI_BUILD_GID=$(id -g)" \
     -e "CI_TENSORFLOW_SUBMODULE_PATH=${CI_TENSORFLOW_SUBMODULE_PATH}" \
+    ${CI_BUILD_USER_FORCE_BADNAME_ENV} \
     -v ${WORKSPACE}:/workspace \
     -w /workspace \
     ${GPU_EXTRA_PARAMS} \
diff --git a/tensorflow/tools/ci_build/ci_parameterized_build.sh b/tensorflow/tools/ci_build/ci_parameterized_build.sh
index e621f85652588f7b5cba6dc5128f857f9eb0fe09..90bd8bc3d0c349b5df51535cadc3b1d85d76b7d0 100755
--- a/tensorflow/tools/ci_build/ci_parameterized_build.sh
+++ b/tensorflow/tools/ci_build/ci_parameterized_build.sh
@@ -94,10 +94,6 @@
 #
 # This script can be used by Jenkins parameterized / matrix builds.
 
-# TODO(jhseu): Temporary for the gRPC pull request due to the
-# protobuf -> protobuf_archive rename. Remove later.
-TF_BUILD_BAZEL_CLEAN=1
-
 # Helper function: Convert to lower case
 to_lower () {
   echo "$1" | tr '[:upper:]' '[:lower:]'
diff --git a/tensorflow/tools/ci_build/copy_binary.py b/tensorflow/tools/ci_build/copy_binary.py
index 420d390d2b9dc1ec25461b3502c63467a7eda16b..148526492d25e9acebe036294175e2814b2ead12 100755
--- a/tensorflow/tools/ci_build/copy_binary.py
+++ b/tensorflow/tools/ci_build/copy_binary.py
@@ -32,7 +32,8 @@ import shutil
 import tempfile
 import zipfile
 
-TF_NIGHTLY_REGEX = r"(.+)tf_nightly(|_gpu)-(\d\.\d\.\d.dev[\d]{0,8})-(.+)\.whl"
+TF_NIGHTLY_REGEX = (r"(.+)tf_nightly(|_gpu)-(\d\.[\d]{1,2}"
+                    "\.\d.dev[\d]{0,8})-(.+)\.whl")
 BINARY_STRING_TEMPLATE = "%s-%s-%s.whl"
 
 
diff --git a/tensorflow/tools/ci_build/install/install_pip_packages.sh b/tensorflow/tools/ci_build/install/install_pip_packages.sh
index 982161cefeefddce7705515b0771b53acfff2706..b3d3f23ec8fd16c78b525159abeeb22776e3eac5 100755
--- a/tensorflow/tools/ci_build/install/install_pip_packages.sh
+++ b/tensorflow/tools/ci_build/install/install_pip_packages.sh
@@ -109,3 +109,7 @@ pip2 install --upgrade gast
 pip3 install --upgrade gast
 pip2 install --upgrade termcolor
 pip3 install --upgrade termcolor
+
+# Install last working version of setuptools.
+pip2 install --upgrade setuptools==39.1.0
+pip3 install --upgrade setuptools==39.1.0
diff --git a/tensorflow/tools/ci_build/install/install_python3.5_pip_packages.sh b/tensorflow/tools/ci_build/install/install_python3.5_pip_packages.sh
index 204a82f647eed550a1ad14bd6fed4cd72b0f7dba..61d34c7304708be3290b116b7e528560ea907031 100755
--- a/tensorflow/tools/ci_build/install/install_python3.5_pip_packages.sh
+++ b/tensorflow/tools/ci_build/install/install_python3.5_pip_packages.sh
@@ -39,7 +39,6 @@ if [[ -z $pip35_version ]]; then
 fi
 
 set -e
-pip3.5 install --upgrade setuptools
 pip3.5 install --upgrade pip
 
 pip3.5 install --upgrade virtualenv
@@ -82,4 +81,7 @@ pip3.5 install --upgrade astor
 pip3.5 install --upgrade gast
 pip3.5 install --upgrade termcolor
 
+# Install last working version of setuptools.
+pip3.5 install --upgrade setuptools==39.1.0
+
 # LINT.ThenChange(//tensorflow/tools/ci_build/install/install_python3.6_pip_packages.sh)
diff --git a/tensorflow/tools/ci_build/install/install_python3.6_pip_packages.sh b/tensorflow/tools/ci_build/install/install_python3.6_pip_packages.sh
index 275abeb669792423301f09339c786bd3869d6de9..fe2d2cf11c3a172cb795f91184e474a0d7d15167 100755
--- a/tensorflow/tools/ci_build/install/install_python3.6_pip_packages.sh
+++ b/tensorflow/tools/ci_build/install/install_python3.6_pip_packages.sh
@@ -49,7 +49,6 @@ cd Python-3.6.1
 make altinstall
 ln -s /usr/local/bin/pip3.6 /usr/local/bin/pip3
 
-pip3 install --upgrade setuptools
 pip3 install --upgrade pip
 
 pip3 install --upgrade virtualenv
@@ -98,4 +97,7 @@ pip3 install --upgrade astor
 pip3 install --upgrade gast
 pip3 install --upgrade termcolor
 
+# Install last working version of setuptools.
+pip3 install --upgrade setuptools==39.1.0
+
 # LINT.ThenChange(//tensorflow/tools/ci_build/install/install_python3.5_pip_packages.sh)
diff --git a/tensorflow/tools/ci_build/linux/cpu/run_cc_core.sh b/tensorflow/tools/ci_build/linux/cpu/run_cc_core.sh
index 51e10f81f82da7920e9d219eaec3e1eb2973b998..8eeddcdb824e84c8c20e948488fd40f3b26fff01 100755
--- a/tensorflow/tools/ci_build/linux/cpu/run_cc_core.sh
+++ b/tensorflow/tools/ci_build/linux/cpu/run_cc_core.sh
@@ -34,5 +34,5 @@ yes "" | $PYTHON_BIN_PATH configure.py
 # Run bazel test command. Double test timeouts to avoid flakes.
 bazel test --test_tag_filters=-no_oss,-gpu,-benchmark-test --test_lang_filters=cc,java -k \
     --jobs=${N_JOBS} --test_timeout 300,450,1200,3600 --config=opt \
-    --test_output=errors -- \
+    --test_output=errors --test_size_filters=small,medium -- \
     //tensorflow/... -//tensorflow/compiler/... -//tensorflow/contrib/...
diff --git a/tensorflow/tools/ci_build/linux/cpu/run_py2_core.sh b/tensorflow/tools/ci_build/linux/cpu/run_py2_core.sh
index ea14848b1ae74ef0c42d14678fde225d465512bf..8eca1987f08491d92953971584ee612b07c13566 100755
--- a/tensorflow/tools/ci_build/linux/cpu/run_py2_core.sh
+++ b/tensorflow/tools/ci_build/linux/cpu/run_py2_core.sh
@@ -33,5 +33,5 @@ yes "" | $PYTHON_BIN_PATH configure.py
 # Run bazel test command. Double test timeouts to avoid flakes.
 bazel test --test_tag_filters=-no_oss,-oss_serial,-gpu,-benchmark-test --test_lang_filters=py -k \
     --jobs=${N_JOBS} --test_timeout 300,450,1200,3600 --build_tests_only --config=opt \
-    --test_output=errors -- \
+    --test_output=errors --test_size_filters=small,medium -- \
     //tensorflow/... -//tensorflow/compiler/... -//tensorflow/contrib/...
diff --git a/tensorflow/tools/ci_build/linux/cpu/run_py3_contrib.sh b/tensorflow/tools/ci_build/linux/cpu/run_py3_contrib.sh
index c7980812507d05a100cf54158fb0581bd4a71296..2b68de3c5b9bbb0c09ddead7466049827fac4147 100755
--- a/tensorflow/tools/ci_build/linux/cpu/run_py3_contrib.sh
+++ b/tensorflow/tools/ci_build/linux/cpu/run_py3_contrib.sh
@@ -33,7 +33,7 @@ yes "" | $PYTHON_BIN_PATH configure.py
 # Run bazel test command. Double test timeouts to avoid flakes.
 bazel test --test_tag_filters=-no_oss,-oss_serial,-gpu,-benchmark-test -k \
     --jobs=${N_JOBS} --test_timeout 300,450,1200,3600 --config=opt \
-    --test_output=errors -- \
+    --test_size_filters=small,medium --test_output=errors -- \
     //tensorflow/contrib/... \
     -//tensorflow/contrib/lite/... \
     //tensorflow/contrib/lite:context_test \
diff --git a/tensorflow/tools/ci_build/linux/cpu/run_py3_core.sh b/tensorflow/tools/ci_build/linux/cpu/run_py3_core.sh
index a9accb9dd5b2d23e028a34ac3d99976d5f2f59db..51eb2cd7e67a4eb53cfc033e9543e13ceaf1b963 100755
--- a/tensorflow/tools/ci_build/linux/cpu/run_py3_core.sh
+++ b/tensorflow/tools/ci_build/linux/cpu/run_py3_core.sh
@@ -33,5 +33,5 @@ yes "" | $PYTHON_BIN_PATH configure.py
 # Run bazel test command. Double test timeouts to avoid flakes.
 bazel test --test_tag_filters=-no_oss,-oss_serial,-gpu,-benchmark-test --test_lang_filters=py -k \
     --jobs=${N_JOBS} --test_timeout 300,450,1200,3600 --build_tests_only --config=opt \
-    --test_output=errors -- \
+    --test_output=errors --test_size_filters=small,medium -- \
     //tensorflow/... -//tensorflow/compiler/... -//tensorflow/contrib/...
diff --git a/tensorflow/tools/ci_build/linux/gpu/run_cc_core.sh b/tensorflow/tools/ci_build/linux/gpu/run_cc_core.sh
index 02224d8e9d9efd92b5c1658118bd0c45bdf4f1db..9d2c8383fae5c1a65ff2bf16a496116c519f6ce4 100755
--- a/tensorflow/tools/ci_build/linux/gpu/run_cc_core.sh
+++ b/tensorflow/tools/ci_build/linux/gpu/run_cc_core.sh
@@ -37,5 +37,6 @@ yes "" | $PYTHON_BIN_PATH configure.py
 bazel test --config=cuda --test_tag_filters=-no_oss,-oss_serial,-no_gpu,-benchmark-test -k \
     --test_lang_filters=cc --jobs=${N_JOBS} --test_timeout 300,450,1200,3600 \
     --build_tests_only --test_output=errors --local_test_jobs=8 --config=opt \
+    --test_size_filters=small,medium \
     --run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute -- \
     //tensorflow/... -//tensorflow/compiler/... -//tensorflow/contrib/...
diff --git a/tensorflow/tools/ci_build/linux/gpu/run_py3_core.sh b/tensorflow/tools/ci_build/linux/gpu/run_py3_core.sh
index 0367a53d1459e7207a76c83e0c1e5c83580722a7..5b3383e1059a189c55ac2f2374a3160df572e5e8 100755
--- a/tensorflow/tools/ci_build/linux/gpu/run_py3_core.sh
+++ b/tensorflow/tools/ci_build/linux/gpu/run_py3_core.sh
@@ -37,5 +37,6 @@ yes "" | $PYTHON_BIN_PATH configure.py
 bazel test --config=cuda --test_tag_filters=-no_oss,-oss_serial,-no_gpu,-benchmark-test -k \
     --test_lang_filters=py --jobs=${N_JOBS} --test_timeout 300,450,1200,3600 \
     --build_tests_only --test_output=errors --local_test_jobs=8 --config=opt \
+    --test_size_filters=small,medium \
     --run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute -- \
     //tensorflow/... -//tensorflow/compiler/... -//tensorflow/contrib/...
diff --git a/tensorflow/tools/ci_build/linux/mkl/basic-mkl-test.sh b/tensorflow/tools/ci_build/linux/mkl/basic-mkl-test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..10a09a415a1fd5657efe1734ebf63b9cfc3dfc6e
--- /dev/null
+++ b/tensorflow/tools/ci_build/linux/mkl/basic-mkl-test.sh
@@ -0,0 +1,29 @@
+#!/usr/bin/env bash
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+#
+# Usage: basic_mkl_test.sh
+
+# Helper function to traverse directories up until given file is found.
+function upsearch () {
+  test / == "$PWD" && return || \
+      test -e "$1" && echo "$PWD" && return || \
+      cd .. && upsearch "$1"
+}
+
+# Set up WORKSPACE.
+WORKSPACE="${WORKSPACE:-$(upsearch WORKSPACE)}"
+
+BUILD_TAG=mkl-ci-test CI_BUILD_USER_FORCE_BADNAME=yes ${WORKSPACE}/tensorflow/tools/ci_build/ci_build.sh cpu tensorflow/tools/ci_build/linux/cpu/run_mkl.sh
diff --git a/tensorflow/tools/ci_build/pi/build_raspberry_pi.sh b/tensorflow/tools/ci_build/pi/build_raspberry_pi.sh
index 30ea8539aa33fb5b033b5ba3c2e53ae66fd17203..b8bce57c87ab39ab2f51288163187f2e87c9135d 100755
--- a/tensorflow/tools/ci_build/pi/build_raspberry_pi.sh
+++ b/tensorflow/tools/ci_build/pi/build_raspberry_pi.sh
@@ -79,6 +79,7 @@ if [[ $1 == "PI_ONE" ]]; then
   --linkopt=-L${OPENBLAS_INSTALL_PATH}/lib/
   --linkopt=-l:libopenblas.a"
   echo "Building for the Pi One/Zero, with no NEON support"
+  WHEEL_ARCH=linux_armv6l
 else
   PI_COPTS='--copt=-march=armv7-a --copt=-mfpu=neon-vfpv4
   --copt=-std=gnu11 --copt=-DS_IREAD=S_IRUSR --copt=-DS_IWRITE=S_IWUSR
@@ -86,6 +87,7 @@ else
   --copt=-U__GCC_HAVE_SYNC_COMPARE_AND_SWAP_1
   --copt=-U__GCC_HAVE_SYNC_COMPARE_AND_SWAP_2
   --copt=-U__GCC_HAVE_SYNC_COMPARE_AND_SWAP_8'
+  WHEEL_ARCH=linux_armv7l
   echo "Building for the Pi Two/Three, with NEON acceleration"
 fi
 
@@ -100,7 +102,8 @@ bazel build -c opt ${PI_COPTS} \
   --copt=-fomit-frame-pointer --cpu=armeabi \
   --crosstool_top=@local_config_arm_compiler//:toolchain \
   --verbose_failures \
-  --distinct_host_configuration=true \
+  //tensorflow:libtensorflow.so \
+  //tensorflow:libtensorflow_framework.so \
   //tensorflow/tools/benchmark:benchmark_model \
   //tensorflow/tools/pip_package:build_pip_package
 
@@ -113,10 +116,12 @@ BDIST_OPTS="--universal" \
   bazel-bin/tensorflow/tools/pip_package/build_pip_package "${OUTDIR}"
 
 OLD_FN=$(ls "${OUTDIR}" | grep -m 1 \.whl)
-SUB='s/tensorflow-([^-]+)-([^-]+)-.*/tensorflow-\1-\2-none-any.whl/; print'
+SUB='s/tensorflow-([^-]+)-([^-]+)-.*/tensorflow-\1-\2-none-'${WHEEL_ARCH}'.whl/; print'
 NEW_FN=$(echo "${OLD_FN}" | perl -ne "${SUB}")
 mv "${OUTDIR}/${OLD_FN}" "${OUTDIR}/${NEW_FN}"
 cp bazel-bin/tensorflow/tools/benchmark/benchmark_model "${OUTDIR}"
+cp bazel-bin/tensorflow/libtensorflow.so "${OUTDIR}"
+cp bazel-bin/tensorflow/libtensorflow_framework.so "${OUTDIR}"
 
 echo "Output can be found here:"
 find "${OUTDIR}"
diff --git a/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh b/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh
index 582188fc00b260926820a6add1331cf8fe0c8a9b..a3e07737a4fa79de80cf667d058517772db9f103 100644
--- a/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh
+++ b/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh
@@ -14,130 +14,20 @@
 # limitations under the License.
 # ==============================================================================
 #
-# C++ tests
-failing_cpu_cc_tests="\
-    //tensorflow/core/kernels:control_flow_ops_test + \
-    //tensorflow/core:example_example_parser_configuration_test + \
-    //tensorflow/core:lib_core_status_test + \
-    //tensorflow/core:lib_monitoring_collection_registry_test + \
-    //tensorflow/core:lib_strings_numbers_test + \
-    //tensorflow/core/platform/hadoop:hadoop_file_system_test + \
-    //tensorflow/core:platform_file_system_test + \
-    //tensorflow/core:platform_logging_test + \
-    //tensorflow/core:util_sparse_sparse_tensor_test + \
-    //tensorflow/cc:framework_gradient_checker_test + \
-    //tensorflow/cc:framework_gradients_test + \
-    //tensorflow/cc:gradients_array_grad_test + \
-    //tensorflow/cc:gradients_math_grad_test + \
-    //tensorflow/cc:gradients_nn_grad_test + \
-    //tensorflow/cc/saved_model:loader_test \
-"
-
-broken_cpu_cc_tests="\
-    //tensorflow/cc:framework_cc_ops_test + \
-    //tensorflow/core/platform/cloud:time_util_test + \
-    //tensorflow/core/platform/cloud:oauth_client_test + \
-    //tensorflow/core/platform/cloud:http_request_test + \
-    //tensorflow/core/platform/cloud:google_auth_provider_test + \
-    //tensorflow/core/platform/cloud:gcs_file_system_test + \
-    //tensorflow/core/kernels/cloud:bigquery_table_accessor_test + \
-    //tensorflow/core/kernels/hexagon:graph_transferer_test + \
-    //tensorflow/core/kernels:remote_fused_graph_execute_utils_test + \
-    //tensorflow/core/kernels:requantize_op_test + \
-    //tensorflow/core/kernels:requantization_range_op_test + \
-    //tensorflow/core/kernels:quantized_reshape_op_test + \
-    //tensorflow/core/kernels:quantized_pooling_ops_test + \
-    //tensorflow/core/kernels:quantized_matmul_op_test + \
-    //tensorflow/core/kernels:quantized_conv_ops_test + \
-    //tensorflow/core/kernels:quantized_concat_op_test + \
-    //tensorflow/core/kernels:quantized_bias_add_op_test + \
-    //tensorflow/core/kernels:quantized_batch_norm_op_test + \
-    //tensorflow/core/kernels:quantized_activation_ops_test + \
-    //tensorflow/core/kernels:quantize_op_test + \
-    //tensorflow/core/kernels:quantize_down_and_shrink_range_op_test + \
-    //tensorflow/core/kernels:quantize_and_dequantize_op_test_gpu + \
-    //tensorflow/core/kernels:quantize_and_dequantize_op_test + \
-    //tensorflow/core/kernels:quantization_utils_test + \
-    //tensorflow/core/kernels:debug_ops_test + \
-    //tensorflow/core/distributed_runtime/rpc:rpc_rendezvous_mgr_test_gpu + \
-    //tensorflow/core/distributed_runtime/rpc:rpc_rendezvous_mgr_test + \
-    //tensorflow/core/distributed_runtime/rpc:grpc_tensor_coding_test + \
-    //tensorflow/core/distributed_runtime/rpc:grpc_session_test_gpu + \
-    //tensorflow/core/distributed_runtime/rpc:grpc_session_test + \
-    //tensorflow/core/distributed_runtime/rpc:grpc_channel_test_gpu + \
-    //tensorflow/core/distributed_runtime/rpc:grpc_channel_test + \
-    //tensorflow/core/distributed_runtime:remote_device_test_gpu + \
-    //tensorflow/core/distributed_runtime:remote_device_test + \
-    //tensorflow/core/distributed_runtime:executor_test_gpu + \
-    //tensorflow/core/distributed_runtime:executor_test + \
-    //tensorflow/core/debug:debug_gateway_test + \
-    //tensorflow/core/debug:debug_grpc_io_utils_test + \
-    //tensorflow/core:util_reporter_test + \
-    //tensorflow/core:util_memmapped_file_system_test + \
-    //tensorflow/core:platform_subprocess_test + \
-    //tensorflow/core:platform_profile_utils_cpu_utils_test + \
-    //tensorflow/core:lib_jpeg_jpeg_mem_unittest + \
-    //tensorflow/core/debug:debug_io_utils_test \
-"
-
-# lib_core_threadpool_test is timeout, but it passes when running alone
-extra_failing_gpu_cc_tests="\
-    //tensorflow/core:lib_core_threadpool_test + \
-    //tensorflow/core:cuda_libdevice_path_test + \
-    //tensorflow/core:common_runtime_direct_session_test + \
-    //tensorflow/core:common_runtime_direct_session_with_tracking_alloc_test + \
-    //tensorflow/core:device_tracer_test + \
-    //tensorflow/core:ops_math_grad_test \
-"
-
-exclude_cpu_cc_tests="${failing_cpu_cc_tests} + ${broken_cpu_cc_tests}"
-
-exclude_gpu_cc_tests="${extra_failing_gpu_cc_tests} + ${exclude_cpu_cc_tests}"
 
 function run_configure_for_cpu_build {
-  # Due to a bug in Bazel: https://github.com/bazelbuild/bazel/issues/2182
-  # yes "" | ./configure doesn't work on Windows, so we set all the
-  # environment variables in advance to avoid interact with the script.
-  export TF_NEED_CUDA=0
-  if [ -z "$TF_ENABLE_XLA" ]; then
-    export TF_ENABLE_XLA=0
-  fi
-  if [ -z "$TF_NEED_MKL" ]; then
-    export TF_NEED_MKL=0
-  fi
-  export TF_NEED_VERBS=0
-  export TF_NEED_GCP=1
-  export TF_NEED_HDFS=0
-  export TF_NEED_OPENCL_SYCL=0
-  echo "" | ./configure
+  yes "" | ./configure
 }
 
 function run_configure_for_gpu_build {
-  # Due to a bug in Bazel: https://github.com/bazelbuild/bazel/issues/2182
-  # yes "" | ./configure doesn't work on Windows, so we set all the
-  # environment variables in advance to avoid interact with the script.
+  # Enable CUDA support
   export TF_NEED_CUDA=1
-  export TF_CUDA_VERSION=9.0
-  export CUDA_TOOLKIT_PATH="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v9.0"
-  export TF_CUDNN_VERSION=7.0
-  if [ -z "$CUDNN_INSTALL_PATH" ]; then
-    export CUDNN_INSTALL_PATH="C:/tools/cuda"
-  fi
-  export TF_CUDA_COMPUTE_CAPABILITIES="3.7"
-  if [ -z "$TF_ENABLE_XLA" ]; then
-    export TF_ENABLE_XLA=0
-  fi
-  export TF_NEED_VERBS=0
-  export TF_NEED_MKL=0
-  export TF_NEED_GCP=0
-  export TF_NEED_HDFS=0
-  export TF_NEED_OPENCL_SYCL=0
 
   # TODO(pcloudy): Remove this after TensorFlow uses its own CRSOOTOOL
   # for GPU build on Windows
   export USE_MSVC_WRAPPER=1
 
-  echo "" | ./configure
+  yes "" | ./configure
 }
 
 function set_gcs_remote_cache_options {
diff --git a/tensorflow/tools/ci_build/windows/bazel/common_env.sh b/tensorflow/tools/ci_build/windows/bazel/common_env.sh
index 0e6c0227b7ffb6b35193e133aa7d3fbcd16ce3c4..eefa8ee2d504945991c91e1574b6a74330ba3a8d 100644
--- a/tensorflow/tools/ci_build/windows/bazel/common_env.sh
+++ b/tensorflow/tools/ci_build/windows/bazel/common_env.sh
@@ -49,8 +49,3 @@ export PATH="/c/Program Files/Git/cmd:$PATH"
 
 # Make sure we have pip in PATH
 export PATH="/c/${PYTHON_BASE_PATH}/Scripts:$PATH"
-
-# Add Cuda and Cudnn dll directories into PATH
-export PATH="/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v9.0/bin:$PATH"
-export PATH="/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v9.0/extras/CUPTI/libx64:$PATH"
-export PATH="/c/tools/cuda/bin:$PATH"
diff --git a/tensorflow/tools/ci_build/windows/cpu/pip/build_tf_windows.sh b/tensorflow/tools/ci_build/windows/cpu/pip/build_tf_windows.sh
index a2300811bb93b9e9d96b9db314943ab08870fcb3..0b13b97209fa6cd6c629a64fdd54a0423535a9a3 100644
--- a/tensorflow/tools/ci_build/windows/cpu/pip/build_tf_windows.sh
+++ b/tensorflow/tools/ci_build/windows/cpu/pip/build_tf_windows.sh
@@ -54,24 +54,34 @@ function cleanup {
 trap cleanup EXIT
 
 skip_test=0
+release_build=0
 
 for ARG in "$@"; do
   if [[ "$ARG" == --skip_test ]]; then
     skip_test=1
   elif [[ "$ARG" == --enable_gcs_remote_cache ]]; then
     set_gcs_remote_cache_options
+  elif [[ "$ARG" == --release_build ]]; then
+    release_build=1
   fi
 done
 
-# --define=override_eigen_strong_inline=true speeds up the compiling of conv_grad_ops_3d.cc and conv_ops_3d.cc
-# by 20 minutes. See https://github.com/tensorflow/tensorflow/issues/10521
-echo "build --define=override_eigen_strong_inline=true" >> "${TMP_BAZELRC}"
+if [[ "$release_build" != 1 ]]; then
+  # --define=override_eigen_strong_inline=true speeds up the compiling of conv_grad_ops_3d.cc and conv_ops_3d.cc
+  # by 20 minutes. See https://github.com/tensorflow/tensorflow/issues/10521
+  # Because this hurts the performance of TF, we don't enable it in release build.
+  echo "build --define=override_eigen_strong_inline=true" >> "${TMP_BAZELRC}"
+fi
+
+# The host and target platforms are the same in Windows build. So we don't have
+# to distinct them. This helps avoid building the same targets twice.
+echo "build --distinct_host_configuration=false" >> "${TMP_BAZELRC}"
 
 echo "import %workspace%/${TMP_BAZELRC}" >> .bazelrc
 
 run_configure_for_cpu_build
 
-bazel build --announce_rc -c opt tensorflow/tools/pip_package:build_pip_package || exit $?
+bazel build --announce_rc --config=opt tensorflow/tools/pip_package:build_pip_package || exit $?
 
 if [[ "$skip_test" == 1 ]]; then
   exit 0
@@ -92,7 +102,7 @@ N_JOBS="${NUMBER_OF_PROCESSORS}"
 
 # Define no_tensorflow_py_deps=true so that every py_test has no deps anymore,
 # which will result testing system installed tensorflow
-bazel test -c opt -k --test_output=errors \
+bazel test --announce_rc --config=opt -k --test_output=errors \
   --define=no_tensorflow_py_deps=true --test_lang_filters=py \
   --test_tag_filters=-no_pip,-no_windows,-no_oss \
   --build_tag_filters=-no_pip,-no_windows,-no_oss --build_tests_only \
diff --git a/tensorflow/tools/ci_build/windows/gpu/cmake/run_build.bat b/tensorflow/tools/ci_build/windows/gpu/cmake/run_build.bat
index 4656afe0256d03540fed6912677c8e93f9cf9eb6..cec5b717f8ad07c0090ee424f3ae47e60df34a5a 100644
--- a/tensorflow/tools/ci_build/windows/gpu/cmake/run_build.bat
+++ b/tensorflow/tools/ci_build/windows/gpu/cmake/run_build.bat
@@ -30,7 +30,6 @@ IF DEFINED SWIG_EXE (ECHO SWIG_EXE is set to %SWIG_EXE%) ELSE (SET SWIG_EXE="C:\
 IF DEFINED PY_EXE (ECHO PY_EXE is set to %PY_EXE%) ELSE (SET PY_EXE="C:\Program Files\Anaconda3\python.exe")
 IF DEFINED PY_LIB (ECHO PY_LIB is set to %PY_LIB%) ELSE (SET PY_LIB="C:\Program Files\Anaconda3\libs\python35.lib")
 IF DEFINED CUDNN_HOME (ECHO CUDNN_HOME is set to %CUDNN_HOME%) ELSE (SET CUDNN_HOME="c:\tools\cuda")
-verbosity:quiet
 IF DEFINED DISABLE_FORCEINLINE (ECHO DISABLE_FORCEINLINE is set to %DISABLE_FORCEINLINE%) ELSE (SET DISABLE_FORCEINLINE="OFF")
 
 SET CMAKE_DIR=%REPO_ROOT%\tensorflow\contrib\cmake
diff --git a/tensorflow/tools/ci_build/xla/linux/gpu/run_py3.sh b/tensorflow/tools/ci_build/xla/linux/gpu/run_py3.sh
index a410c10b61b9f3f2cf8fc00074237a2bcfcbbf78..d085e21b0305d1c4266db0adb97f586286a12735 100755
--- a/tensorflow/tools/ci_build/xla/linux/gpu/run_py3.sh
+++ b/tensorflow/tools/ci_build/xla/linux/gpu/run_py3.sh
@@ -37,6 +37,7 @@ bazel clean
 # Run bazel test command. Double test timeouts to avoid flakes.
 bazel test --config=cuda --test_tag_filters=-no_gpu,-benchmark-test,-no_oss -k \
     --jobs=${N_JOBS} --test_timeout 300,450,1200,3600 \
+    --test_size_filters=small,medium \
     --build_tests_only --test_output=errors --local_test_jobs=8 \
     --run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute \
     --config=xla -- \
diff --git a/tensorflow/tools/dist_test/build_server.sh b/tensorflow/tools/dist_test/build_server.sh
index 225c0347416ec8c8fef855946d18e838bd767690..345217d733acec62c599dd6dfeffd4839e5a79bc 100755
--- a/tensorflow/tools/dist_test/build_server.sh
+++ b/tensorflow/tools/dist_test/build_server.sh
@@ -23,7 +23,7 @@
 #     E.g.: tensorflow/tf_grpc_test_server:0.11.0rc1
 #
 #   whl_file_location: URL from which the TensorFlow whl file will be downloaded.
-#     E.g.: https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-0.11.0rc1-cp27-none-linux_x86_64.whl
+#     E.g.: https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0-cp27-none-linux_x86_64.whl
 #     E.g.: /path/to/folder/tensorflow-0.11.0rc1-cp27-none-linux_x86_64.whl
 #
 # The optional flag --test lets the script to use the Dockerfile for the
diff --git a/tensorflow/tools/dist_test/local_test.sh b/tensorflow/tools/dist_test/local_test.sh
index 99e09502be3439888d478b256b47a65c54a07a3e..b0114721bd2435dd2d4b8ee667250d3b824f1207 100755
--- a/tensorflow/tools/dist_test/local_test.sh
+++ b/tensorflow/tools/dist_test/local_test.sh
@@ -35,7 +35,7 @@
 #
 # Arguments:
 # whl_file_location: URL from which the TensorFlow whl file will be acquired.
-#   E.g.: https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-0.11.0rc1-cp27-none-linux_x86_64.whl
+#   E.g.: https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0-cp27-none-linux_x86_64.whl
 #   E.g.: /path/to/folder/tensorflow-0.11.0rc1-cp27-none-linux_x86_64.whl
 #
 # --leave_container_running:  Do not stop the docker-in-docker container after
diff --git a/tensorflow/tools/docker/Dockerfile.devel b/tensorflow/tools/docker/Dockerfile.devel
index 406d134699ff182dde219c137f79a27094b09169..57a491255ea968b08e6e9cbaf9dd0178e8d2c3bf 100644
--- a/tensorflow/tools/docker/Dockerfile.devel
+++ b/tensorflow/tools/docker/Dockerfile.devel
@@ -76,7 +76,7 @@ RUN mkdir /bazel && \
 
 # Download and build TensorFlow.
 WORKDIR /tensorflow
-RUN git clone --branch=r1.8 --depth=1 https://github.com/tensorflow/tensorflow.git .
+RUN git clone --branch=r1.9 --depth=1 https://github.com/tensorflow/tensorflow.git .
 
 # TODO(craigcitro): Don't install the pip package, since it makes it
 # more difficult to experiment with local changes. Instead, just add
diff --git a/tensorflow/tools/docker/Dockerfile.devel-cpu-mkl b/tensorflow/tools/docker/Dockerfile.devel-cpu-mkl
index a6cd44ced1d546846f274ef79aad75bcf950fd03..6796ad70e5d22ca683343680b142081d8d58a9e4 100644
--- a/tensorflow/tools/docker/Dockerfile.devel-cpu-mkl
+++ b/tensorflow/tools/docker/Dockerfile.devel-cpu-mkl
@@ -3,7 +3,7 @@ FROM tensorflow/tensorflow:latest-devel
 LABEL maintainer="Clayne Robison<clayne.b.robison@intel.com>"
 
 # These arguments are parameterized. Use --build-args to override.
-ARG TF_BRANCH=r1.8
+ARG TF_BRANCH=r1.9
 ARG WHL_DIR=/whl
 
 RUN apt-get update && apt-get install -y --no-install-recommends \
diff --git a/tensorflow/tools/docker/Dockerfile.devel-gpu b/tensorflow/tools/docker/Dockerfile.devel-gpu
index 2fe47f3356ce26da4174b95d59dce1889d3ec90c..204b5b4dba1b607fb709b7f45d145ceafc33f3e7 100644
--- a/tensorflow/tools/docker/Dockerfile.devel-gpu
+++ b/tensorflow/tools/docker/Dockerfile.devel-gpu
@@ -13,8 +13,8 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
         cuda-cusparse-dev-9-0 \
         curl \
         git \
-        libcudnn7=7.0.5.15-1+cuda9.0 \
-        libcudnn7-dev=7.0.5.15-1+cuda9.0 \
+        libcudnn7=7.1.4.18-1+cuda9.0 \
+        libcudnn7-dev=7.1.4.18-1+cuda9.0 \
         libcurl3-dev \
         libfreetype6-dev \
         libhdf5-serial-dev \
@@ -85,7 +85,7 @@ RUN mkdir /bazel && \
 
 # Download and build TensorFlow.
 WORKDIR /tensorflow
-RUN git clone --branch=r1.8 --depth=1 https://github.com/tensorflow/tensorflow.git .
+RUN git clone --branch=r1.9 --depth=1 https://github.com/tensorflow/tensorflow.git .
 
 # Configure the build for our CUDA configuration.
 ENV CI_BUILD_PYTHON python
diff --git a/tensorflow/tools/docker/Dockerfile.gpu b/tensorflow/tools/docker/Dockerfile.gpu
index bff4a20392076994c75705b73c25dcb740ba1f09..9197651ff4326e9b40264183a94b82e936746010 100644
--- a/tensorflow/tools/docker/Dockerfile.gpu
+++ b/tensorflow/tools/docker/Dockerfile.gpu
@@ -12,7 +12,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
         cuda-cusolver-9-0 \
         cuda-cusparse-9-0 \
         curl \
-        libcudnn7=7.0.5.15-1+cuda9.0 \
+        libcudnn7=7.1.4.18-1+cuda9.0 \
         libfreetype6-dev \
         libhdf5-serial-dev \
         libpng12-dev \
diff --git a/tensorflow/tools/docs/generate_lib.py b/tensorflow/tools/docs/generate_lib.py
index 111d54d8205f805cc24d21c610acc81610b8d47d..853ec6194f8327f13b3eb6ac7792511c9c4494cd 100644
--- a/tensorflow/tools/docs/generate_lib.py
+++ b/tensorflow/tools/docs/generate_lib.py
@@ -50,7 +50,11 @@ def _is_free_function(py_object, full_name, index):
   return True
 
 
-def write_docs(output_dir, parser_config, yaml_toc, root_title='TensorFlow'):
+def write_docs(output_dir,
+               parser_config,
+               yaml_toc,
+               root_title='TensorFlow',
+               search_hints=True):
   """Write previously extracted docs to disk.
 
   Write a docs page for each symbol included in the indices of parser_config to
@@ -66,6 +70,8 @@ def write_docs(output_dir, parser_config, yaml_toc, root_title='TensorFlow'):
       indices.
     yaml_toc: Set to `True` to generate a "_toc.yaml" file.
     root_title: The title name for the root level index.md.
+    search_hints: (bool) include meta-data search hints at the top of each
+      output file.
 
   Raises:
     ValueError: if `output_dir` is not an absolute path
@@ -134,7 +140,13 @@ def write_docs(output_dir, parser_config, yaml_toc, root_title='TensorFlow'):
       if not os.path.exists(directory):
         os.makedirs(directory)
       # This function returns raw bytes in PY2 or unicode in PY3.
-      text = pretty_docs.build_md_page(page_info)
+      if search_hints:
+        content = [page_info.get_metadata_html()]
+      else:
+        content = ['']
+
+      content.append(pretty_docs.build_md_page(page_info))
+      text = '\n'.join(content)
       if six.PY3:
         text = text.encode('utf-8')
       with open(path, 'wb') as f:
@@ -467,6 +479,12 @@ class DocGenerator(object):
     self._do_not_descend_map = _get_default_do_not_descend_map()
     self.yaml_toc = True
 
+    self.argument_parser.add_argument(
+        '--no_search_hints',
+        dest='search_hints',
+        action='store_false',
+        default=True)
+
   def add_output_dir_argument(self):
     self.argument_parser.add_argument(
         '--output_dir',
@@ -553,7 +571,8 @@ class DocGenerator(object):
         output_dir,
         parser_config,
         yaml_toc=self.yaml_toc,
-        root_title=root_title)
+        root_title=root_title,
+        search_hints=getattr(flags, 'search_hints', True))
     _other_docs(flags.src_dir, flags.output_dir, reference_resolver)
 
     parser_config.reference_resolver.log_errors()
diff --git a/tensorflow/tools/docs/parser.py b/tensorflow/tools/docs/parser.py
index fb0bd2c2ff438aa9b3fa04719c447a2f3a91a95e..50c90527413d0904c78dab199a68678f6cc91845 100644
--- a/tensorflow/tools/docs/parser.py
+++ b/tensorflow/tools/docs/parser.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 import ast
 import collections
 import functools
+import itertools
 import json
 import os
 import re
@@ -614,6 +615,9 @@ def _parse_md_docstring(py_object, relative_path_to_root, reference_resolver):
   docstring, compatibility = _handle_compatibility(raw_docstring)
   docstring, function_details = _parse_function_details(docstring)
 
+  if 'Generated by: tensorflow/tools/api/generator' in docstring:
+    docstring = ''
+
   return _DocstringInfo(
       docstring.split('\n')[0], docstring, function_details, compatibility)
 
@@ -906,6 +910,9 @@ class _FunctionPageInfo(object):
   def add_decorator(self, dec):
     self._decorators.append(dec)
 
+  def get_metadata_html(self):
+    return _Metadata(self.full_name).build_html()
+
 
 class _ClassPageInfo(object):
   """Collects docs for a class page.
@@ -1099,6 +1106,14 @@ class _ClassPageInfo(object):
     """Returns a list of `_LinkInfo` pointing to any nested classes."""
     return self._classes
 
+  def get_metadata_html(self):
+    meta_data = _Metadata(self.full_name)
+    for item in itertools.chain(self.classes, self.properties, self.methods,
+                                self.other_members):
+      meta_data.append(item)
+
+    return meta_data.build_html()
+
   def _add_class(self, short_name, full_name, obj, doc, url):
     """Adds a `_LinkInfo` for a nested class to `classes` list.
 
@@ -1330,6 +1345,16 @@ class _ModulePageInfo(object):
     self._other_members.append(
         _OtherMemberInfo(short_name, full_name, obj, doc))
 
+  def get_metadata_html(self):
+    meta_data = _Metadata(self.full_name)
+
+    # Objects with their own pages are not added to the matadata list for the
+    # module, the module only has a link to the object page. No docs.
+    for item in self.other_members:
+      meta_data.append(item)
+
+    return meta_data.build_html()
+
   def collect_docs_for_module(self, parser_config):
     """Collect information necessary specifically for a module's doc page.
 
@@ -1575,7 +1600,8 @@ class _GeneratedFile(object):
     return True
 
   def __str__(self):
-    return 'Defined in `%s%s`.\n\n' % (self.path_prefix, self.path)
+    return 'Defined in generated file: `%s%s`.\n\n' % (self.path_prefix,
+                                                       self.path)
 
 
 def _get_defined_in(py_object, parser_config):
@@ -1612,6 +1638,8 @@ def _get_defined_in(py_object, parser_config):
 
   if re.match(r'.*/gen_[^/]*\.py$', path):
     return _GeneratedFile(path, parser_config)
+  if 'genfiles' in path or 'tools/api/generator' in path:
+    return _GeneratedFile(path, parser_config)
   elif re.match(r'.*_pb2\.py$', path):
     # The _pb2.py files all appear right next to their defining .proto file.
     return _ProtoFile(path[:-7] + '.proto', parser_config)
@@ -1656,3 +1684,41 @@ def generate_global_index(library_name, index, reference_resolver):
 
   # TODO(markdaoust): use a _ModulePageInfo -> prety_docs.build_md_page()
   return '\n'.join(lines)
+
+
+class _Metadata(object):
+  """A class for building a page's Metadata block.
+
+  Attributes:
+    name: The name of the page being described by the Metadata block.
+  """
+
+  def __init__(self, name):
+    """Creates a Metadata builder.
+
+    Args:
+      name: The name of the page being described by the Metadata block.
+    """
+    self.name = name
+    self._content = []
+
+  def append(self, item):
+    """Adds an item from the page to the Metadata block.
+
+    Args:
+      item: The parsed page section to add.
+    """
+    self._content.append(item.short_name)
+
+  def build_html(self):
+    """Returns the Metadata block as an Html string."""
+    schema = 'http://developers.google.com/ReferenceObject'
+    parts = ['<div itemscope itemtype="%s">' % schema]
+
+    parts.append('<meta itemprop="name" content="%s" />' % self.name)
+    for item in self._content:
+      parts.append('<meta itemprop="property" content="%s"/>' % item)
+
+    parts.extend(['</div>', ''])
+
+    return '\n'.join(parts)
diff --git a/tensorflow/tools/docs/pretty_docs.py b/tensorflow/tools/docs/pretty_docs.py
index 55ab5bdd49a427e680221f4864b3f31a65b12e8d..63d4fef91cc752b8fa053b92c833349ca3bc8f19 100644
--- a/tensorflow/tools/docs/pretty_docs.py
+++ b/tensorflow/tools/docs/pretty_docs.py
@@ -27,7 +27,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import itertools
 import textwrap
 
 
@@ -58,8 +57,7 @@ def build_md_page(page_info):
 
 def _build_function_page(page_info):
   """Given a FunctionPageInfo object Return the page as an md string."""
-  parts = [_Metadata(page_info.full_name).build_html()]
-  parts.append('# %s\n\n' % page_info.full_name)
+  parts = ['# %s\n\n' % page_info.full_name]
 
   if len(page_info.aliases) > 1:
     parts.append('### Aliases:\n\n')
@@ -83,17 +81,7 @@ def _build_function_page(page_info):
 
 def _build_class_page(page_info):
   """Given a ClassPageInfo object Return the page as an md string."""
-  meta_data = _Metadata(page_info.full_name)
-  for item in itertools.chain(
-      page_info.classes,
-      page_info.properties,
-      page_info.methods,
-      page_info.other_members):
-    meta_data.append(item)
-
-  parts = [meta_data.build_html()]
-
-  parts.append('# {page_info.full_name}\n\n'.format(page_info=page_info))
+  parts = ['# {page_info.full_name}\n\n'.format(page_info=page_info)]
 
   parts.append('## Class `%s`\n\n' % page_info.full_name.split('.')[-1])
   if page_info.bases:
@@ -186,17 +174,7 @@ def _build_class_page(page_info):
 
 def _build_module_page(page_info):
   """Given a ClassPageInfo object Return the page as an md string."""
-  meta_data = _Metadata(page_info.full_name)
-
-  # Objects with their own pages are not added to the matadata list for the
-  # module, as the only thing on the module page is a link to the object's page.
-  for item in page_info.other_members:
-    meta_data.append(item)
-
-  parts = [meta_data.build_html()]
-
-  parts.append(
-      '# Module: {full_name}\n\n'.format(full_name=page_info.full_name))
+  parts = ['# Module: {full_name}\n\n'.format(full_name=page_info.full_name)]
 
   if len(page_info.aliases) > 1:
     parts.append('### Aliases:\n\n')
@@ -317,41 +295,3 @@ def _build_function_details(function_details):
     parts.append(''.join(sub))
 
   return '\n'.join(parts)
-
-
-class _Metadata(object):
-  """A class for building a page's Metadata block.
-
-  Attributes:
-    name: The name of the page being described by the Metadata block.
-  """
-
-  def __init__(self, name):
-    """Create a Metadata builder.
-
-    Args:
-      name: The name of the page being described by the Metadata block.
-    """
-    self.name = name
-    self._content = []
-
-  def append(self, item):
-    """Add an item from the page to the Metadata block.
-
-    Args:
-      item: The parsed page section to add.
-    """
-    self._content.append(item.short_name)
-
-  def build_html(self):
-    """Return the Metadata block as an Html string."""
-    schema = 'http://developers.google.com/ReferenceObject'
-    parts = ['<div itemscope itemtype="%s">' % schema]
-
-    parts.append('<meta itemprop="name" content="%s" />' % self.name)
-    for item in self._content:
-      parts.append('<meta itemprop="property" content="%s"/>' % item)
-
-    parts.extend(['</div>', '', ''])
-
-    return '\n'.join(parts)
diff --git a/tensorflow/tools/git/gen_git_source.py b/tensorflow/tools/git/gen_git_source.py
index 73dee98bae8946b747e1b28bd14b0a26edc62736..cc2288a7fa9202efcd077e54b941cc278b25993c 100755
--- a/tensorflow/tools/git/gen_git_source.py
+++ b/tensorflow/tools/git/gen_git_source.py
@@ -164,14 +164,17 @@ def get_git_version(git_base_path, git_tag_override):
         "git", str("--git-dir=%s/.git" % git_base_path),
         str("--work-tree=" + git_base_path), "describe", "--long", "--tags"
     ]).strip())
-    if git_tag_override:
+    if git_tag_override and val:
       split_val = val.split("-")
-      if len(split_val) != 3:
+      if len(split_val) < 3:
         raise Exception(
             ("Expected git version in format 'TAG-COMMITS AFTER TAG-HASH' "
              "but got '%s'") % val)
-      split_val[0] = git_tag_override
-      val = bytes("-".join(split_val))
+      # There might be "-" in the tag name. But we can be sure that the final
+      # two "-" are those inserted by the git describe command.
+      abbrev_commit = split_val[-1]
+      val = bytes(
+          "-".join([git_tag_override, "0", abbrev_commit]))
     return val if val else unknown_label
   except (subprocess.CalledProcessError, OSError):
     return unknown_label
diff --git a/tensorflow/tools/graph_transforms/fold_constants_lib.cc b/tensorflow/tools/graph_transforms/fold_constants_lib.cc
index 85660f94a85dce29360525f7bb7474494b3f010f..f85841187670fef0fdc9237886237f84057d6bd5 100644
--- a/tensorflow/tools/graph_transforms/fold_constants_lib.cc
+++ b/tensorflow/tools/graph_transforms/fold_constants_lib.cc
@@ -117,6 +117,31 @@ Status ReplaceSendRecvs(const GraphDef& original_graph_def,
   return Status::OK();
 }
 
+Status RewriteInputsAsPlaceholders(const TransformFuncContext& context,
+                                   GraphDef* graph_def) {
+  std::unordered_set<string> input_names;
+  for (const string& input_name : context.input_names) {
+    input_names.insert(ParseTensorName(input_name).first.ToString());
+  }
+
+  for (NodeDef& node : *graph_def->mutable_node()) {
+    if (input_names.find(node.name()) == input_names.end()) {
+      continue;
+    }
+    if (node.op() == "PlaceholderWithDefault") {
+      node.set_op("Placeholder");
+      node.clear_input();
+    } else if (node.op() != "Placeholder") {
+      return errors::InvalidArgument(
+          "Input '", node.name(),
+          "' was expected to be a Placeholder or PlaceholderWithDefault op, "
+          "but was ",
+          node.op());
+    }
+  }
+  return Status::OK();
+}
+
 Status RemoveUnusedNodes(const GraphDef& input_graph_def,
                          const TransformFuncContext& context,
                          GraphDef* output_graph_def) {
@@ -165,6 +190,7 @@ Status RemoveUnusedNodes(const GraphDef& input_graph_def,
       input_graph_def,
       [&](const NodeDef& node) { return used_nodes.count(node.name()) > 0; },
       output_graph_def);
+  TF_RETURN_IF_ERROR(RewriteInputsAsPlaceholders(context, output_graph_def));
 
   return Status::OK();
 }
diff --git a/tensorflow/tools/graph_transforms/fold_constants_test.cc b/tensorflow/tools/graph_transforms/fold_constants_test.cc
index a082399a87dbaad913be421fe273ba89b6f7340e..dcdc3c29069c212c499aa21e420b47f239ce62f2 100644
--- a/tensorflow/tools/graph_transforms/fold_constants_test.cc
+++ b/tensorflow/tools/graph_transforms/fold_constants_test.cc
@@ -330,48 +330,6 @@ class ConstantFoldingTest : public ::testing::Test {
     EXPECT_EQ(0, node_map.count("unused"));
   }
 
-  void TestRemoveUnusedNodesMultipleOutputs() {
-    using namespace ::tensorflow::ops;  // NOLINT(build/namespaces)
-    auto root = tensorflow::Scope::NewRootScope();
-
-    //    a    b
-    //     \  /
-    //    shape_n
-    //     \  /
-    //       c
-    auto a = Placeholder(root.WithOpName("a"), DT_FLOAT);
-    auto b = Placeholder(root.WithOpName("b"), DT_FLOAT);
-    auto shape_n = ShapeN(root.WithOpName("shape_n"), {Output(a), Output(b)});
-    auto c = Add(root.WithOpName("c"), shape_n[0], shape_n[1]);
-
-    GraphDef graph_def;
-    TF_ASSERT_OK(root.ToGraphDef(&graph_def));
-    GraphDef result_graph_def;
-    TF_ASSERT_OK(graph_transforms::RemoveUnusedNodes(
-        graph_def, {{shape_n[0].name()}, {"c"}}, &result_graph_def));
-
-    // Only one output of shape_n node is fed input. Hence the graph search
-    // should propagate to inputs of shape_n. Nothing to remove here.
-    std::map<string, const NodeDef*> node_map;
-    graph_transforms::MapNamesToNodes(result_graph_def, &node_map);
-    EXPECT_EQ(1, node_map.count("a"));
-    EXPECT_EQ(1, node_map.count("b"));
-    EXPECT_EQ(1, node_map.count("c"));
-
-    result_graph_def.Clear();
-    TF_ASSERT_OK(graph_transforms::RemoveUnusedNodes(
-        graph_def, {{shape_n[0].name(), shape_n[1].name()}, {"c"}},
-        &result_graph_def));
-
-    // Both outputs of shape_n node are fed inputs. shape_n does not function
-    // and inputs to shape_n should be removed.
-    node_map.clear();
-    graph_transforms::MapNamesToNodes(result_graph_def, &node_map);
-    EXPECT_EQ(0, node_map.count("a"));
-    EXPECT_EQ(0, node_map.count("b"));
-    EXPECT_EQ(1, node_map.count("c"));
-  }
-
   void TestMaxConstantSizeInBytes() {
     auto root = tensorflow::Scope::NewRootScope();
 
@@ -431,10 +389,6 @@ TEST_F(ConstantFoldingTest, TestReplaceSendRecvsPrefixNames) {
 
 TEST_F(ConstantFoldingTest, TestRemoveUnusedNodes) { TestRemoveUnusedNodes(); }
 
-TEST_F(ConstantFoldingTest, TestRemoveUnusedNodesMultipleOutputs) {
-  TestRemoveUnusedNodesMultipleOutputs();
-}
-
 TEST_F(ConstantFoldingTest, TestMaxConstantSizeInBytes) {
   TestMaxConstantSizeInBytes();
 }
diff --git a/tensorflow/tools/graph_transforms/fold_old_batch_norms_test.cc b/tensorflow/tools/graph_transforms/fold_old_batch_norms_test.cc
index 7651a03fe51012678d6d6fc495fd82e497aa512b..435f46c107cd9b0a6d64d4c0d52607ec5f41eb4f 100644
--- a/tensorflow/tools/graph_transforms/fold_old_batch_norms_test.cc
+++ b/tensorflow/tools/graph_transforms/fold_old_batch_norms_test.cc
@@ -191,7 +191,7 @@ class FoldOldBatchNormsTest : public ::testing::Test {
     std::vector<Tensor> fused_outputs;
     TF_ASSERT_OK(fused_session->Run({}, {"output"}, {}, &fused_outputs));
 
-    test::ExpectTensorNear<float>(original_outputs[0], fused_outputs[0], 1e-5);
+    test::ExpectTensorNear<float>(original_outputs[0], fused_outputs[0], 2e-5);
 
     for (const NodeDef& node : fused_graph_def.node()) {
       EXPECT_NE("FusedBatchNorm", node.op());
diff --git a/tensorflow/tools/pip_package/BUILD b/tensorflow/tools/pip_package/BUILD
index 677ea65edd91df9eef2347ab305f47a05f6cedaa..9d4148c07f37fb511836425e1b6ffceb7c259777 100644
--- a/tensorflow/tools/pip_package/BUILD
+++ b/tensorflow/tools/pip_package/BUILD
@@ -59,6 +59,7 @@ COMMON_PIP_DEPS = [
     "//tensorflow/contrib/autograph/converters:converters",
     "//tensorflow/contrib/autograph/converters:test_lib",
     "//tensorflow/contrib/autograph/impl:impl",
+    "//tensorflow/contrib/autograph/operators:operators",
     "//tensorflow/contrib/autograph/pyct:pyct",
     "//tensorflow/contrib/autograph/pyct/static_analysis:static_analysis",
     "//tensorflow/contrib/boosted_trees:boosted_trees_pip",
@@ -173,9 +174,7 @@ sh_binary(
         "//conditions:default": COMMON_PIP_DEPS + [
             ":simple_console",
             "//tensorflow/contrib/lite/python:interpreter_test_data",
-            "//tensorflow/contrib/lite/python:tf_lite_py_pip",
-            "//tensorflow/contrib/lite/toco:toco",
-            "//tensorflow/contrib/lite/toco/python:toco_wrapper",
+            "//tensorflow/contrib/lite/python:tflite_convert",
             "//tensorflow/contrib/lite/toco/python:toco_from_protos",
         ],
     }) + if_mkl(["//third_party/mkl:intel_binary_blob"]) + if_tensorrt([
diff --git a/tensorflow/tools/pip_package/build_pip_package.sh b/tensorflow/tools/pip_package/build_pip_package.sh
index 1a83c6e7578fed88f0bd7db5a5b620a5281fd95a..f7e42ce5362163a23ae1060c47e3e02869372556 100755
--- a/tensorflow/tools/pip_package/build_pip_package.sh
+++ b/tensorflow/tools/pip_package/build_pip_package.sh
@@ -41,51 +41,15 @@ function is_windows() {
   fi
 }
 
-function main() {
+function prepare_src() {
   if [ $# -lt 1 ] ; then
     echo "No destination dir provided"
     exit 1
   fi
 
-  DEST=$(real_path $1)
-  TMPDIR=$(mktemp -d -t tmp.XXXXXXXXXX)
-
-  PKG_NAME_FLAG=""
-  GPU_BUILD=0
-  NIGHTLY_BUILD=0
-  PROJECT_NAME=""
-  while true; do
-    if [[ "$1" == "--nightly_flag" ]]; then
-      NIGHTLY_BUILD=1
-    elif [[ "$1" == "--gpu" ]]; then
-      GPU_BUILD=1
-    elif [[ "$1" == "--gpudirect" ]]; then
-      PKG_NAME_FLAG="--project_name tensorflow_gpudirect"
-    elif [[ "$1" == "--project_name" ]]; then
-      shift
-      if [[ -z "$1" ]]; then
-        break
-      fi
-      PROJECT_NAME="$1"
-    fi
-    shift
-
-    if [[ -z "$1" ]]; then
-      break
-    fi
-  done
-
-  if [[ -n ${PROJECT_NAME} ]]; then
-    PKG_NAME_FLAG="--project_name ${PROJECT_NAME}"
-  elif [[ ${NIGHTLY_BUILD} == "1" && ${GPU_BUILD} == "1" ]]; then
-    PKG_NAME_FLAG="--project_name tf_nightly_gpu"
-  elif [[ ${NIGHTLY_BUILD} == "1" ]]; then
-    PKG_NAME_FLAG="--project_name tf_nightly"
-  elif [[ ${GPU_BUILD} == "1" ]]; then
-    PKG_NAME_FLAG="--project_name tensorflow_gpu"
-  fi
-
-  echo $(date) : "=== Using tmpdir: ${TMPDIR}"
+  TMPDIR="$1"
+  mkdir -p "$TMPDIR"
+  echo $(date) : "=== Preparing sources in dir: ${TMPDIR}"
 
   if [ ! -d bazel-bin/tensorflow ]; then
     echo "Could not find bazel-bin.  Did you run from the root of the build tree?"
@@ -148,26 +112,35 @@ function main() {
     fi
     mkdir "${TMPDIR}/tensorflow/aux-bin"
     # Install toco as a binary in aux-bin.
-    # TODO(aselle): Re-enable this when we find a way to do it without doubling
-    # the whl size (over the limit).
-    # cp bazel-bin/tensorflow/contrib/lite/toco/toco ${TMPDIR}/tensorflow/aux-bin/
+    cp bazel-bin/tensorflow/contrib/lite/python/tflite_convert ${TMPDIR}/tensorflow/aux-bin/
   fi
 
   # protobuf pip package doesn't ship with header files. Copy the headers
   # over so user defined ops can be compiled.
   mkdir -p ${TMPDIR}/google
   mkdir -p ${TMPDIR}/third_party
-  pushd ${RUNFILES%org_tensorflow}
+  pushd ${RUNFILES%org_tensorflow} > /dev/null
   for header in $(find protobuf_archive -name \*.h); do
     mkdir -p "${TMPDIR}/google/$(dirname ${header})"
     cp "$header" "${TMPDIR}/google/$(dirname ${header})/"
   done
-  popd
+  popd > /dev/null
   cp -R $RUNFILES/third_party/eigen3 ${TMPDIR}/third_party
 
   cp tensorflow/tools/pip_package/MANIFEST.in ${TMPDIR}
   cp tensorflow/tools/pip_package/README ${TMPDIR}
   cp tensorflow/tools/pip_package/setup.py ${TMPDIR}
+}
+
+function build_wheel() {
+  if [ $# -lt 2 ] ; then
+    echo "No src and dest dir provided"
+    exit 1
+  fi
+
+  TMPDIR="$1"
+  DEST="$2"
+  PKG_NAME_FLAG="$3"
 
   # Before we leave the top-level directory, make sure we know how to
   # call python.
@@ -175,15 +148,110 @@ function main() {
     source tools/python_bin_path.sh
   fi
 
-  pushd ${TMPDIR}
+  pushd ${TMPDIR} > /dev/null
   rm -f MANIFEST
   echo $(date) : "=== Building wheel"
   "${PYTHON_BIN_PATH:-python}" setup.py bdist_wheel ${PKG_NAME_FLAG} >/dev/null
   mkdir -p ${DEST}
   cp dist/* ${DEST}
-  popd
-  rm -rf ${TMPDIR}
+  popd > /dev/null
   echo $(date) : "=== Output wheel file is in: ${DEST}"
 }
 
+function usage() {
+  echo "Usage:"
+  echo "$0 [--src srcdir] [--dst dstdir] [options]"
+  echo "$0 dstdir [options]"
+  echo ""
+  echo "    --src                 prepare sources in srcdir"
+  echo "                              will use temporary dir if not specified"
+  echo ""
+  echo "    --dst                 build wheel in dstdir"
+  echo "                              if dstdir is not set do not build, only prepare sources"
+  echo ""
+  echo "  Options:"
+  echo "    --project_name <name> set project name to name"
+  echo "    --gpu                 build tensorflow_gpu"
+  echo "    --gpudirect           build tensorflow_gpudirect"
+  echo "    --nightly_flag        build tensorflow nightly"
+  echo ""
+  exit 1
+}
+
+function main() {
+  PKG_NAME_FLAG=""
+  PROJECT_NAME=""
+  GPU_BUILD=0
+  NIGHTLY_BUILD=0
+  SRCDIR=""
+  DSTDIR=""
+  CLEANSRC=1
+  while true; do
+    if [[ "$1" == "--help" ]]; then
+      usage
+      exit 1
+    elif [[ "$1" == "--nightly_flag" ]]; then
+      NIGHTLY_BUILD=1
+    elif [[ "$1" == "--gpu" ]]; then
+      GPU_BUILD=1
+    elif [[ "$1" == "--gpudirect" ]]; then
+      PKG_NAME_FLAG="--project_name tensorflow_gpudirect"
+    elif [[ "$1" == "--project_name" ]]; then
+      shift
+      if [[ -z "$1" ]]; then
+        break
+      fi
+      PROJECT_NAME="$1"
+    elif [[ "$1" == "--src" ]]; then
+      shift
+      SRCDIR="$(real_path $1)"
+      CLEANSRC=0
+    elif [[ "$1" == "--dst" ]]; then
+      shift
+      DSTDIR="$(real_path $1)"
+    else
+      DSTDIR="$(real_path $1)"
+    fi
+    shift
+
+    if [[ -z "$1" ]]; then
+      break
+    fi
+  done
+
+  if [[ -z "$DSTDIR" ]] && [[ -z "$SRCDIR" ]]; then
+    echo "No destination dir provided"
+    usage
+    exit 1
+  fi
+
+  if [[ -z "$SRCDIR" ]]; then
+    # make temp srcdir if none set
+    SRCDIR="$(mktemp -d -t tmp.XXXXXXXXXX)"
+  fi
+
+  prepare_src "$SRCDIR"
+
+  if [[ -z "$DSTDIR" ]]; then
+      # only want to prepare sources
+      exit
+  fi
+
+  if [[ -n ${PROJECT_NAME} ]]; then
+    PKG_NAME_FLAG="--project_name ${PROJECT_NAME}"
+  elif [[ ${NIGHTLY_BUILD} == "1" && ${GPU_BUILD} == "1" ]]; then
+    PKG_NAME_FLAG="--project_name tf_nightly_gpu"
+  elif [[ ${NIGHTLY_BUILD} == "1" ]]; then
+    PKG_NAME_FLAG="--project_name tf_nightly"
+  elif [[ ${GPU_BUILD} == "1" ]]; then
+    PKG_NAME_FLAG="--project_name tensorflow_gpu"
+  fi
+
+  build_wheel "$SRCDIR" "$DSTDIR" "$PKG_NAME_FLAG"
+
+  if [[ $CLEANSRC -ne 0 ]]; then
+    rm -rf "${TMPDIR}"
+  fi
+}
+
 main "$@"
diff --git a/tensorflow/tools/pip_package/setup.py b/tensorflow/tools/pip_package/setup.py
index 319878e1b5ae9ff9d72132e2421062f6ca26197a..97f625e7e9cf5bc5064596cc76737f5cb7a591f2 100644
--- a/tensorflow/tools/pip_package/setup.py
+++ b/tensorflow/tools/pip_package/setup.py
@@ -12,6 +12,18 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
+"""TensorFlow is an open source machine learning framework for everyone.
+
+TensorFlow is an open source software library for high performance numerical
+computation. Its flexible architecture allows easy deployment of computation
+across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters
+of servers to mobile and edge devices.
+
+Originally developed by researchers and engineers from the Google Brain team
+within Google's AI organization, it comes with strong support for machine
+learning and deep learning and the flexible numerical computation core is used
+across many other scientific domains.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -28,25 +40,12 @@ from setuptools import setup
 from setuptools.command.install import install as InstallCommandBase
 from setuptools.dist import Distribution
 
+DOCLINES = __doc__.split('\n')
+
 # This version string is semver compatible, but incompatible with pip.
 # For pip, we will remove all '-' characters from this string, and use the
 # result for pip.
-_VERSION = '1.8.0'
-
-_SHORT_DESCRIPTION = ('TensorFlow is an open source machine learning framework '
-                      'for everyone.')
-
-_LONG_DESCRIPTION = ('TensorFlow is an open source software library for high '
-                     'performance numerical computation. Its flexible '
-                     'architecture allows easy deployment of computation across'
-                     ' a variety of platforms (CPUs, GPUs, TPUs), and from '
-                     'desktops to clusters of servers to mobile and edge '
-                     'devices. Originally developed by researchers and '
-                     'engineers from the Google Brain team within Google\'s AI '
-                     'organization, it comes with strong support for machine '
-                     'learning and deep learning and the flexible numerical '
-                     'computation core is used across many other scientific '
-                     'domains.')
+_VERSION = '1.9.0-rc0'
 
 REQUIRED_PACKAGES = [
     'absl-py >= 0.1.6',
@@ -55,6 +54,7 @@ REQUIRED_PACKAGES = [
     'numpy >= 1.13.3',
     'six >= 1.10.0',
     'protobuf >= 3.4.0',
+    'setuptools <= 39.1.0',
     'tensorboard >= 1.8.0, < 1.9.0',
     'termcolor >= 1.1.0',
 ]
@@ -96,7 +96,8 @@ if sys.version_info < (3, 4):
 CONSOLE_SCRIPTS = [
     'freeze_graph = tensorflow.python.tools.freeze_graph:run_main',
     'toco_from_protos = tensorflow.contrib.lite.toco.python.toco_from_protos:main',
-    'toco = tensorflow.contrib.lite.toco.python.toco_wrapper:main',
+    'tflite_convert = tensorflow.contrib.lite.python.tflite_convert:main',
+    'toco = tensorflow.contrib.lite.python.tflite_convert:main',
     'saved_model_cli = tensorflow.python.tools.saved_model_cli:main',
     # We need to keep the TensorBoard command, even though the console script
     # is now declared by the tensorboard pip package. If we remove the
@@ -229,9 +230,10 @@ headers = (list(find_files('*.h', 'tensorflow/core')) +
 setup(
     name=project_name,
     version=_VERSION.replace('-', ''),
-    description=_SHORT_DESCRIPTION,
-    long_description=_LONG_DESCRIPTION,
+    description=DOCLINES[0],
+    long_description='\n'.join(DOCLINES[2:]),
     url='https://www.tensorflow.org/',
+    download_url='https://github.com/tensorflow/tensorflow/tags',
     author='Google Inc.',
     author_email='opensource@google.com',
     # Contained modules and scripts.
@@ -257,7 +259,7 @@ setup(
     },
     # PyPI package information.
     classifiers=[
-        'Development Status :: 4 - Beta',
+        'Development Status :: 5 - Production/Stable',
         'Intended Audience :: Developers',
         'Intended Audience :: Education',
         'Intended Audience :: Science/Research',
diff --git a/tensorflow/tools/proto_text/gen_proto_text_functions_lib.cc b/tensorflow/tools/proto_text/gen_proto_text_functions_lib.cc
index aa56cc676d0dfdf88f449f07de4189eaabfa3112..15d7c702819ddec256b779f41b8745633d4a7769 100644
--- a/tensorflow/tools/proto_text/gen_proto_text_functions_lib.cc
+++ b/tensorflow/tools/proto_text/gen_proto_text_functions_lib.cc
@@ -279,8 +279,13 @@ void Generator::AppendFieldValueAppend(const FieldDescriptor& field,
       if (omit_default) {
         Print("if (", field_expr, " != 0) {").Nest();
       }
-      Print("o->AppendEnumName(\"", field.name(), "\", ",
-            GetQualifiedEnumNameFn(*field.enum_type()), "(", field_expr, "));");
+      Print("const char* enum_name = ",
+            GetQualifiedEnumNameFn(*field.enum_type()), "(", field_expr, ");");
+      Print("if (enum_name[0]) {").Nest();
+      Print("o->AppendEnumName(\"", field.name(), "\", enum_name);");
+      Unnest().Print("} else {").Nest();
+      Print("o->AppendNumeric(\"", field.name(), "\", ", field_expr, ");");
+      Unnest().Print("}");
       if (omit_default) {
         Unnest().Print("}");
       }
@@ -540,18 +545,24 @@ void Generator::AppendParseMessageFunction(const Descriptor& md) {
       for (int enum_i = 0; enum_i < enum_d->value_count(); ++enum_i) {
         const auto* value_d = enum_d->value(enum_i);
         const string& value_name = value_d->name();
-        string condition = StrCat("value == \"", value_name,
-                                  "\" || value == \"", value_d->number(), "\"");
-        if (value_d->number() == 0) {
-          StrAppend(&condition, " || value == \"-0\"");
-        }
+        string condition = StrCat("value == \"", value_name, "\"");
 
         Print(enum_i == 0 ? "" : "} else ", "if (", condition, ") {");
         Nest();
         Print(set_value_prefix, "(", value_prefix, value_name, ");");
         Unnest();
       }
+      Print("} else {");
+      Nest();
+      // Proto3 allows all numeric values.
+      Print("int32 int_value;");
+      Print("if (strings::SafeStringToNumeric(value, &int_value)) {");
+      Nest();
+      Print(set_value_prefix, "(static_cast<", GetQualifiedName(*enum_d),
+            ">(int_value));");
+      Unnest();
       Print("} else {").Nest().Print("return false;").Unnest().Print("}");
+      Unnest().Print("}");
     } else {
       Print(field->cpp_type_name(), " value;");
       switch (field->cpp_type()) {
diff --git a/tensorflow/tools/proto_text/gen_proto_text_functions_lib_test.cc b/tensorflow/tools/proto_text/gen_proto_text_functions_lib_test.cc
index 6f0b4f47de6464aa0f0648f3b0a2fac1e7d3c7cc..e67add72de660b9c8dd566b6db978a8dc489c749 100644
--- a/tensorflow/tools/proto_text/gen_proto_text_functions_lib_test.cc
+++ b/tensorflow/tools/proto_text/gen_proto_text_functions_lib_test.cc
@@ -455,7 +455,10 @@ TEST(CreateProtoDebugStringLibTest, Enums) {
        "repeated_nested_enum: 1"));
 
   EXPECT_PARSE_SUCCESS("", "optional_nested_enum: -0");
-  EXPECT_PARSE_FAILURE("optional_nested_enum: 6");
+  // TODO(amauryfa): restore the line below when protobuf::TextFormat also
+  // supports unknonwn enum values.
+  // EXPECT_PARSE_SUCCESS("optional_nested_enum: 6", "optional_nested_enum: 6");
+  EXPECT_PARSE_FAILURE("optional_nested_enum: 2147483648");  // > INT32_MAX
   EXPECT_PARSE_FAILURE("optional_nested_enum: BARNONE");
   EXPECT_PARSE_FAILURE("optional_nested_enum: 'BAR'");
   EXPECT_PARSE_FAILURE("optional_nested_enum: \"BAR\" ");
diff --git a/tensorflow/tools/quantization/quantize_graph_test.py b/tensorflow/tools/quantization/quantize_graph_test.py
index df71840b64db3a1a451ec74b12d039a412976666..92bb5127dacf316c62cd64b3874b283309deffd5 100644
--- a/tensorflow/tools/quantization/quantize_graph_test.py
+++ b/tensorflow/tools/quantization/quantize_graph_test.py
@@ -119,8 +119,8 @@ def are_tensors_near(a, b, tolerance):
   flat_a = a.flatten()
   flat_b = b.flatten()
   if len(flat_a) != len(flat_b):
-    print("Tensors are different sizes: " + str(len(flat_a)) + " vs " + str(
-        len(flat_b)))
+    tf_logging.info("Tensors are different sizes: " + str(len(flat_a)) + " vs "
+                    + str(len(flat_b)))
     return False
   value_count = len(flat_a)
   how_many_different = 0
@@ -140,10 +140,10 @@ def are_tensors_near(a, b, tolerance):
   if how_many_different == 0:
     return True
   else:
-    print("Tensors have {0} different values ({1}%), with mean difference"
-          " {2} and mean absolute difference {3}".format(
-              how_many_different, proportion_different * 100, mean_difference,
-              mean_abs_difference))
+    tf_logging.info("Tensors have {0} different values ({1}%), with mean"
+                    " difference {2} and mean absolute difference {3}".format(
+                        how_many_different, proportion_different * 100,
+                        mean_difference, mean_abs_difference))
     return False
 
 
diff --git a/tensorflow/user_ops/BUILD b/tensorflow/user_ops/BUILD
deleted file mode 100644
index 71443cc41eb5ecdd23e1a47712633c77fcd7d395..0000000000000000000000000000000000000000
--- a/tensorflow/user_ops/BUILD
+++ /dev/null
@@ -1,52 +0,0 @@
-# Description:
-# An example for custom op and kernel defined as a TensorFlow plugin.
-
-package(
-    default_visibility = ["//tensorflow:internal"],
-)
-
-licenses(["notice"])  # Apache 2.0
-
-exports_files(["LICENSE"])
-
-load("//tensorflow:tensorflow.bzl", "tf_py_test")
-load("//tensorflow:tensorflow.bzl", "tf_custom_op_library")
-
-tf_custom_op_library(
-    name = "ackermann_op.so",
-    srcs = ["ackermann_op.cc"],
-)
-
-tf_py_test(
-    name = "ackermann_test",
-    size = "small",
-    srcs = ["ackermann_test.py"],
-    additional_deps = ["//tensorflow:tensorflow_py"],
-    data = [":ackermann_op.so"],
-)
-
-tf_custom_op_library(
-    name = "duplicate_op.so",
-    srcs = ["duplicate_op.cc"],
-)
-
-tf_py_test(
-    name = "duplicate_op_test",
-    size = "small",
-    srcs = ["duplicate_op_test.py"],
-    additional_deps = ["//tensorflow:tensorflow_py"],
-    data = [":duplicate_op.so"],
-)
-
-tf_custom_op_library(
-    name = "invalid_op.so",
-    srcs = ["invalid_op.cc"],
-)
-
-tf_py_test(
-    name = "invalid_op_test",
-    size = "small",
-    srcs = ["invalid_op_test.py"],
-    additional_deps = ["//tensorflow:tensorflow_py"],
-    data = [":invalid_op.so"],
-)
diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index 07c8f5e41622a034682724cbc6fc04130ffc5ad2..ef5cb60ceed9d2b67dd96c6c1f36e2e6ca927a9c 100644
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -107,13 +107,12 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   tf_http_archive(
       name = "eigen_archive",
       urls = [
-          "https://mirror.bazel.build/bitbucket.org/eigen/eigen/get/6913f0cf7d06.tar.gz",
-          "https://bitbucket.org/eigen/eigen/get/6913f0cf7d06.tar.gz",
+          "https://mirror.bazel.build/bitbucket.org/eigen/eigen/get/267806ed9b4f.tar.gz",
+          "https://bitbucket.org/eigen/eigen/get/267806ed9b4f.tar.gz",
       ],
-      sha256 = "791b836cacd03e20bae5bdd25f1c4a5505a0a9975ba94a61eb4e2631fbd1d53a",
-      strip_prefix = "eigen-eigen-6913f0cf7d06",
+      sha256 = "ade57357093463cab9e4e51cd5749c81483a75451b1471a3ebc73f9c1d14043b",
+      strip_prefix = "eigen-eigen-267806ed9b4f",
       build_file = clean_dep("//third_party:eigen.BUILD"),
-      patch_file = clean_dep("//third_party:eigen_fix_cuda_compilation.patch")
   )
 
   tf_http_archive(
@@ -452,11 +451,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   tf_http_archive(
       name = "llvm",
       urls = [
-          "https://mirror.bazel.build/github.com/llvm-mirror/llvm/archive/638915a37f90f26599941977846408864f70ab35.tar.gz",
-          "https://github.com/llvm-mirror/llvm/archive/638915a37f90f26599941977846408864f70ab35.tar.gz",
+          "https://mirror.bazel.build/github.com/llvm-mirror/llvm/archive/582e5dd5553e3089fef97f9ab5a3f063e0160fa9.tar.gz",
+          "https://github.com/llvm-mirror/llvm/archive/582e5dd5553e3089fef97f9ab5a3f063e0160fa9.tar.gz",
       ],
-      sha256 = "aae3cacefa318cef030b4ca1e81ee9906752bbd89013cf9d47e156b5ad04b3a5",
-      strip_prefix = "llvm-638915a37f90f26599941977846408864f70ab35",
+      sha256 = "9a0e63469ae5a546e0c84b778955f0febabfc8497d312324546ec7d0db68430e",
+      strip_prefix = "llvm-582e5dd5553e3089fef97f9ab5a3f063e0160fa9",
       build_file = clean_dep("//third_party/llvm:llvm.BUILD"),
   )
 
@@ -779,11 +778,9 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       actual = "@grpc//:grpc_python_plugin",
   )
 
-  # gRPC has three empty C++ functions which it wants the user to define
-  # at build time. https://github.com/grpc/grpc/issues/13590
   native.bind(
       name = "grpc_lib",
-      actual = "@grpc//:grpc++_unsecure",
+      actual = "@grpc//:grpc++",
   )
 
   # Needed by gRPC
@@ -820,7 +817,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   # Needed by Protobuf
   native.bind(
       name = "python_headers",
-      actual = clean_dep("//util/python:python_headers"),
+      actual = clean_dep("//third_party/python_runtime:headers"),
   )
 
   # Needed by Protobuf
diff --git a/third_party/examples/eager/spinn/README.md b/third_party/examples/eager/spinn/README.md
deleted file mode 100644
index fbb1fde837b92bc521698d0a517a946da0438dbc..0000000000000000000000000000000000000000
--- a/third_party/examples/eager/spinn/README.md
+++ /dev/null
@@ -1,109 +0,0 @@
-# SPINN with TensorFlow eager execution
-
-SPINN, or Stack-Augmented Parser-Interpreter Neural Network, is a recursive
-neural network that utilizes syntactic parse information for natural language
-understanding.
-
-SPINN was originally described by:
-Bowman, S.R., Gauthier, J., Rastogi A., Gupta, R., Manning, C.D., & Potts, C.
-  (2016). A Fast Unified Model for Parsing and Sentence Understanding.
-  https://arxiv.org/abs/1603.06021
-
-Our implementation is based on @jekbradbury's PyTorch implementation at:
-https://github.com/jekbradbury/examples/blob/spinn/snli/spinn.py,
-
-which was released under the BSD 3-Clause License at:
-https://github.com/jekbradbury/examples/blob/spinn/LICENSE
-
-Other eager execution examples can be found under [tensorflow/contrib/eager/python/examples](../../../../tensorflow/contrib/eager/python/examples).
-
-##  Content
-
-- [`data.py`](../../../../tensorflow/contrib/eager/python/examples/spinn/data.py): Pipeline for loading and preprocessing the
-   [SNLI](https://nlp.stanford.edu/projects/snli/) data and
-   [GloVe](https://nlp.stanford.edu/projects/glove/) word embedding, written
-   using the [`tf.data`](https://www.tensorflow.org/programmers_guide/datasets)
-   API.
-- [`spinn.py`](./spinn.py): Model definition and training routines.
-  This example illustrates how one might perform the following actions with
-  eager execution enabled:
-  * defining a model consisting of a dynamic computation graph,
-  * assigning operations to the CPU or GPU dependending on device availability,
-  * training the model using the data from the `tf.data`-based pipeline,
-  * obtaining metrics such as mean accuracy during training,
-  * saving and loading checkpoints,
-  * writing summaries for monitoring and visualization in TensorBoard.
-
-## To run
-
-- Make sure you have installed TensorFlow release 1.5 or higher. Alternatively,
-  you can use the latest `tf-nightly` or `tf-nightly-gpu` pip
-  package to access the eager execution feature.
-
-- Download and extract the raw SNLI data and GloVe embedding vectors.
-  For example:
-
-  ```bash
-  curl -fSsL https://nlp.stanford.edu/projects/snli/snli_1.0.zip --create-dirs -o /tmp/spinn-data/snli/snli_1.0.zip
-  unzip -d /tmp/spinn-data/snli /tmp/spinn-data/snli/snli_1.0.zip
-  curl -fSsL http://nlp.stanford.edu/data/glove.42B.300d.zip --create-dirs -o /tmp/spinn-data/glove/glove.42B.300d.zip
-  unzip -d /tmp/spinn-data/glove /tmp/spinn-data/glove/glove.42B.300d.zip
-  ```
-
-- Train model. E.g.,
-
-  ```bash
-  python spinn.py --data_root /tmp/spinn-data --logdir /tmp/spinn-logs
-  ```
-
-  During training, model checkpoints and TensorBoard summaries will be written
-  periodically to the directory specified with the `--logdir` flag.
-  The training script will reload a saved checkpoint from the directory if it
-  can find one there.
-
-  To view the summaries with TensorBoard:
-
-  ```bash
-  tensorboard --logdir /tmp/spinn-logs
-  ```
-
-- After training, you may use the model to perform inference on input data in
-  the SNLI data format. The premise and hypotheses sentences are specified with
-  the command-line flags `--inference_premise` and `--inference_hypothesis`,
-  respectively. Each sentence should include the words, as well as parentheses
-  representing a binary parsing of the sentence. The words and parentheses
-  should all be separated by spaces. For instance,
-
-  ```bash
-  python spinn.py --data_root /tmp/spinn-data --logdir /tmp/spinn-logs \
-      --inference_premise '( ( The dog ) ( ( is running ) . ) )' \
-      --inference_hypothesis '( ( The dog ) ( moves . ) )'
-  ```
-
-  which will generate an output like the following, due to the semantic
-  consistency of the two sentences.
-
-  ```none
-  Inference logits:
-    entailment:     1.101249 (winner)
-    contradiction:  -2.374171
-    neutral:        -0.296733
-  ```
-
-  By contrast, the following sentence pair:
-
-  ```bash
-  python spinn.py --data_root /tmp/spinn-data --logdir /tmp/spinn-logs \
-      --inference_premise '( ( The dog ) ( ( is running ) . ) )' \
-      --inference_hypothesis '( ( The dog ) ( rests . ) )'
-  ```
-
-  will give you an output like the following, due to the semantic
-  contradiction of the two sentences.
-
-  ```none
-  Inference logits:
-    entailment:     -1.070098
-    contradiction:  2.798695 (winner)
-    neutral:        -1.402287
-  ```
diff --git a/third_party/mkl/BUILD b/third_party/mkl/BUILD
deleted file mode 100644
index 017613abb0246fdcda8ca189da694102abfc8529..0000000000000000000000000000000000000000
--- a/third_party/mkl/BUILD
+++ /dev/null
@@ -1,60 +0,0 @@
-licenses(["notice"])  # 3-Clause BSD
-
-config_setting(
-    name = "using_mkl",
-    values = {
-        "define": "using_mkl=true",
-    },
-    visibility = ["//visibility:public"],
-)
-
-config_setting(
-    name = "using_mkl_lnx_x64",
-    values = {
-        "cpu": "k8",
-        "define": "using_mkl=true",
-    },
-    visibility = ["//visibility:public"],
-)
-
-load(
-    "//third_party/mkl:build_defs.bzl",
-    "if_mkl",
-)
-
-filegroup(
-    name = "LICENSE",
-    srcs = ["MKL_LICENSE"] + select({
-        "@org_tensorflow//tensorflow:linux_x86_64": [
-            "@mkl_linux//:LICENSE",
-        ],
-        "@org_tensorflow//tensorflow:darwin": [
-            "@mkl_darwin//:LICENSE",
-        ],
-        "@org_tensorflow//tensorflow:windows": [
-            "@mkl_windows//:LICENSE",
-        ],
-        "//conditions:default": []
-    }),
-    visibility = ["//visibility:public"],
-)
-
-cc_library(
-    name = "intel_binary_blob",
-    visibility = ["//visibility:public"],
-    deps = select({
-        "@org_tensorflow//tensorflow:linux_x86_64": [
-            "@mkl_linux//:mkl_headers",
-            "@mkl_linux//:mkl_libs_linux",
-        ],
-        "@org_tensorflow//tensorflow:darwin": [
-            "@mkl_darwin//:mkl_headers",
-            "@mkl_darwin//:mkl_libs_darwin",
-        ],
-        "@org_tensorflow//tensorflow:windows": [
-            "@mkl_windows//:mkl_headers",
-            "@mkl_windows//:mkl_libs_windows",
-        ],
-        "//conditions:default": []
-    }),
-)
diff --git a/third_party/png_fix_rpi.patch b/third_party/png_fix_rpi.patch
deleted file mode 100644
index 80da7b3c06444ff22a869de48bca273f99963728..0000000000000000000000000000000000000000
--- a/third_party/png_fix_rpi.patch
+++ /dev/null
@@ -1,16 +0,0 @@
-diff -r -u /tmp/libpng-1.6.34/scripts/pnglibconf.h.prebuilt ./scripts/pnglibconf.h.prebuilt
---- /tmp/libpng-1.6.34/scripts/pnglibconf.h.prebuilt	2017-09-29 01:42:33.000000000 -0700
-+++ ./scripts/pnglibconf.h.prebuilt	2018-05-01 09:51:24.719318242 -0700
-@@ -20,6 +20,12 @@
- #define PNG_ALIGNED_MEMORY_SUPPORTED
- /*#undef PNG_ARM_NEON_API_SUPPORTED*/
- /*#undef PNG_ARM_NEON_CHECK_SUPPORTED*/
-+
-+/* Workaround not having a great build file by forcing
-+ * png filter optimization to be disabled on arm */
-+#define PNG_ARM_NEON_OPT 0
-+
-+
- /*#undef PNG_POWERPC_VSX_API_SUPPORTED*/
- /*#undef PNG_POWERPC_VSX_CHECK_SUPPORTED*/
- #define PNG_BENIGN_ERRORS_SUPPORTED
diff --git a/third_party/tflite_ovic_testdata.BUILD b/third_party/tflite_ovic_testdata.BUILD
deleted file mode 100644
index de47ed61f9db9ad980468aa325e3c770e0aae4f1..0000000000000000000000000000000000000000
--- a/third_party/tflite_ovic_testdata.BUILD
+++ /dev/null
@@ -1,12 +0,0 @@
-package(default_visibility = ["//visibility:public"])
-
-licenses(["notice"])  # Apache 2.0
-
-exports_files(
-    glob(
-        ["**/*"],
-        exclude = [
-            "BUILD",
-        ],
-    ),
-)
diff --git a/tools/bazel.rc b/tools/bazel.rc
index 03aa52da1f6e9c113d6db6cb9c1d38b5be21927d..1c1e6afb65ab8da5b689d58ecaec6ac6c8a69bb8 100644
--- a/tools/bazel.rc
+++ b/tools/bazel.rc
@@ -1,14 +1,8 @@
-# By default, we don't distinct target and host platfroms.
-# When doing cross compilation, use --config=cross_compile to distinct them.
-build --distinct_host_configuration=false
-build:cross_compile --distinct_host_configuration=true
-
 # Android configs. Bazel needs to have --cpu and --fat_apk_cpu both set to the
 # target CPU to build transient dependencies correctly. See
 # https://docs.bazel.build/versions/master/user-manual.html#flag--fat_apk_cpu
 build:android --crosstool_top=//external:android/crosstool
 build:android --host_crosstool_top=@bazel_tools//tools/cpp:toolchain
-build:android --config=cross_compile
 build:android_arm --config=android
 build:android_arm --cpu=armeabi-v7a
 build:android_arm --fat_apk_cpu=armeabi-v7a
diff --git a/util/python/BUILD b/util/python/BUILD
deleted file mode 100644
index f5fa0c6d29c905cd9073e5001e993da5c8560ec0..0000000000000000000000000000000000000000
--- a/util/python/BUILD
+++ /dev/null
@@ -1,8 +0,0 @@
-licenses(["notice"])  # New BSD, Python Software Foundation
-
-package(default_visibility = ["//visibility:public"])
-
-alias(
-    name = "python_headers",
-    actual = "@local_config_python//:python_headers",
-)